PROC UNIVARIATE - Part 02: HISTOGRAM, QQPLOT, and Normality Testing
Overview
Part 01 covered the core statistical summary produced by PROC UNIVARIATE: moments, percentiles, extreme observations, the OUTPUT= statement, and merging summary statistics back to detail data.
This lesson focuses on the graphical and distributional testing capabilities: producing histograms, Q-Q plots, and formal normality test statistics.
These tools are used to assess whether a variable follows a normal (Gaussian) distribution — an assumption required by many parametric statistical tests.
Note: PROC UNIVARIATE produces traditional SAS/GRAPH or ODS Graphics output depending on your SAS environment. The code shown here uses ODS Graphics which is the modern default in SAS 9.2 and later.
SAS Log
CALL STREAMINIT(42) sets a random number seed for reproducibility — running this code will produce the same dataset every time.
RAND('normal', mean, std) generates values from a normal distribution with the specified mean and standard deviation.
RAND('exponential') generates values from an exponential distribution which is inherently right-skewed — a useful contrast for demonstrating normality tests.
Inspect WORK.MEASURES to see the two variables before running any analysis.
Dataset View
HISTOGRAM Statement
The HISTOGRAM statement in PROC UNIVARIATE produces a graphical display of the distribution of a variable.
By default, it uses automatic bin widths. You can overlay a normal distribution curve using the NORMAL option to visually assess how closely the data follow a normal shape.
The MIDPOINTS= option specifies exact bin midpoints if you want to control the histogram bins manually.
A histogram that is roughly bell-shaped and symmetric is consistent with normality; long tails, sharp peaks, or asymmetry suggest non-normality.
SAS Log
ODS GRAPHICS ON enables the modern graphical output system. ODS GRAPHICS OFF closes it after the procedure to avoid leaving it active for subsequent steps.
The NORMAL option on the HISTOGRAM statement overlays a fitted normal distribution curve on the histogram bars. This makes it easy to see whether the data shape matches the normal curve.
INSET MEAN STD / POSITION=NE adds a small statistics box in the northeast corner of the plot showing the mean and standard deviation.
Examine the histogram for normal_val — it should appear roughly bell-shaped. The histogram for skewed_val will show a long right tail, inconsistent with the overlaid normal curve.
Q-Q Plot (Quantile-Quantile Plot)
A Q-Q plot compares the quantiles of your data against the theoretical quantiles of a normal distribution.
If the data are normally distributed, the points on the Q-Q plot should fall approximately along a straight diagonal reference line.
Deviations from the line indicate non-normality: an S-shaped curve suggests heavy tails or light tails; a curved pattern suggests skewness.
The QQPLOT statement in PROC UNIVARIATE generates a Q-Q plot. The NORMAL option overlays the reference line for the fitted normal distribution.
SAS Log
MU=EST and SIGMA=EST instruct SAS to estimate the mean and standard deviation from the data (rather than requiring you to specify them manually) and use those estimates for the reference line.
For normal_val, the points should cluster closely around the diagonal reference line — minor deviations at the tails are expected with small samples.
For skewed_val, the points will curve away from the reference line, especially in the upper tail, confirming the right-skewed distribution.
Q-Q plots are generally more sensitive to distributional departures than histograms, particularly in the tails, making them a valuable complement to the histogram.
Formal Normality Tests with the NORMAL Option
Visual assessment of histograms and Q-Q plots is subjective. PROC UNIVARIATE also provides formal statistical tests for normality.
Add the NORMAL option to the PROC UNIVARIATE statement to produce four normality test statistics in the output.
The four tests are: Shapiro-Wilk (recommended for n less than 2000), Kolmogorov-Smirnov, Cramer-von Mises, and Anderson-Darling.
The null hypothesis for all four tests is that the data are normally distributed. A small p-value (typically less than 0.05) leads to rejecting the null hypothesis and concluding non-normality.
With large sample sizes, these tests become very sensitive and may flag minor, inconsequential departures from normality — always combine test results with visual assessment.
SAS Log
Look for the "Tests for Normality" table in the output for each variable.
For normal_val, the p-values for all four tests should be large (well above 0.05), indicating no evidence against normality.
For skewed_val, the p-values should be small (below 0.05), indicating that the data are not consistent with a normal distribution.
The Shapiro-Wilk W statistic ranges from 0 to 1, where values close to 1 indicate normality. A W below 0.9 is often a practical flag of concern.
Combining Histogram and QQPLOT in One Step
You can include both HISTOGRAM and QQPLOT in a single PROC UNIVARIATE call to get all distributional diagnostics at once.
This is efficient for a standard normality check routine across multiple variables.
SAS Log
The INSET statement here adds n, mean, standard deviation, skewness, and kurtosis to the histogram plot — a quick distributional summary embedded in the graphic.
Skewness near 0 and kurtosis near 3 (excess kurtosis near 0) are consistent with normality. Skewness above 1 or below -1 is generally considered a meaningful departure.
This combined call produces histogram plots, Q-Q plots, and formal test statistics in a single PROC step — the standard approach for a normality assessment in a clinical or statistical analysis program.
Key Points
HISTOGRAM / NORMAL overlays a fitted normal curve on the distribution bars — a quick visual check of distributional shape.
QQPLOT / NORMAL(MU=EST SIGMA=EST) plots observed vs theoretical quantiles — points along the diagonal line confirm normality; curves or S-shapes indicate departures.
Adding NORMAL to the PROC UNIVARIATE statement produces four formal normality test statistics including the Shapiro-Wilk W test which is recommended for sample sizes below 2000.
A p-value below 0.05 in the normality tests rejects the null hypothesis of normality — but with large samples, even trivial departures become significant, so always use visual assessment alongside formal tests.
Always enclose ODS GRAPHICS ON and ODS GRAPHICS OFF around PROC UNIVARIATE graphical output to control when the graphical system is active.