PROC UNIVARIATE produces a detailed statistical summary of one or more numeric variables
Compared to PROC MEANS, which focuses on basic summary statistics, PROC UNIVARIATE also reports percentiles, extreme values, tests for normality, and optionally produces histograms and probability plots
It is especially useful for understanding the distribution of a variable — whether it is symmetric, skewed, or has extreme outliers
In clinical programming, PROC UNIVARIATE is commonly used for verifying value ranges, identifying outliers, and assessing whether a continuous endpoint meets normality assumptions before statistical testing
Create sample data
The dataset `lab_results` contains simulated haemoglobin values for a small group of subjects
A few values are intentionally set at the extremes to make the percentile and extreme-value output more informative
SAS Log
`hgb` is the haemoglobin value in g/dL — the dataset has ten subjects with a deliberate low value (9.8) and a high value (17.1) to exercise the extreme-values section of the output
Inspect the dataset and note the range of values before running PROC UNIVARIATE
Dataset View
Basic PROC UNIVARIATE output
The default PROC UNIVARIATE output contains several panels for each variable
The Moments panel shows the mean, standard deviation, variance, skewness, kurtosis, and sample size
The Basic Statistical Measures panel shows mean, median, mode, standard deviation, variance, and range
The Tests for Location panel tests whether the mean or median differs significantly from zero
The Quantiles panel shows percentiles including the minimum, maximum, and quartiles
The Extreme Observations panel lists the five smallest and five largest values with their observation numbers
SAS Log
Review the Quantiles section — the minimum should be 9.8 and the maximum 17.1
The median (50th percentile) is the middle value when all ten are ranked — verify it is between 12.8 and 13.0
The Extreme Observations section lists subject 1007 (9.8) as the lowest value and subject 1008 (17.1) as the highest — confirm this matches the source data
Adding an ID variable to extreme observations
By default, the Extreme Observations panel identifies rows by their observation number
The `id` statement replaces the observation number with a chosen variable, making it much easier to trace extreme values back to specific subjects
In clinical data, you would typically use the subject identifier variable as the ID
SAS Log
The Extreme Observations panel now shows `usubjid` values instead of row numbers
Confirm that the five lowest and five highest values are identified by subject ID — this is far more useful than a row number when investigating outliers
Running PROC UNIVARIATE by group
A `by` statement produces a separate PROC UNIVARIATE output panel for each group
The data must be sorted by the BY variable before the procedure runs
This is useful when you need distribution summaries broken out by sex, treatment group, or visit
SAS Log
Two separate output sections are produced — one for female subjects and one for male subjects
Compare the medians and ranges between the two groups — the female group includes subject 1007 with the low value of 9.8 which will pull the female minimum down
Saving key statistics to an output dataset
The `output` statement saves selected statistics from PROC UNIVARIATE into a new dataset
Each statistic keyword is followed by `=` and the name to give that variable in the output dataset
Commonly saved statistics include `mean=`, `median=`, `std=`, `min=`, `max=`, `q1=` (25th percentile), and `q3=` (75th percentile)
Saving statistics to a dataset is essential when you need to use them programmatically — for example to flag values more than two standard deviations from the mean
SAS Log
The `noprint` option suppresses the usual printed output — the statistics are written only to `hgb_stats`
Inspect `hgb_stats` — it should contain a single row with one column per requested statistic
Verify the mean, min, max, and n values against what you observed in the earlier full PROC UNIVARIATE output
Dataset View
Using saved statistics to flag outliers
Once statistics are in a dataset, they can be merged back to the source data to derive flags
A common rule flags values more than two standard deviations above or below the mean as potential outliers
The step below merges the summary statistics into the subject-level data using a cross-join pattern and then creates the flag
SAS Log
The `if _n_ = 1 then set hgb_stats;` pattern reads the single-row statistics dataset once and retains its values across all subsequent rows — this is a standard SAS idiom for merging a summary row into detail data
Inspect `lab_flagged` and check the `outlier_flag` column — subjects whose `hgb` falls outside the two-standard-deviation band should have `outlier_flag = 1`
Also inspect `lower_limit` and `upper_limit` to confirm the band is calculated correctly
Dataset View
Key points to remember
PROC UNIVARIATE provides richer distributional detail than PROC MEANS — moments, quantiles, extreme values, and normality tests
The `id` statement replaces observation numbers with a subject identifier in the Extreme Observations panel
A `by` statement produces separate output panels per group — data must be pre-sorted
The `output` statement saves named statistics to a dataset — use `noprint` to suppress the printed output when only the dataset is needed
Saved statistics can be merged back to detail data using the `if _n_ = 1 then set` pattern to derive per-subject flags or derived values