*Copyright @ www.mycsg.in;

What is PROC UNIVARIATE

PROC UNIVARIATE produces a detailed statistical summary of one or more numeric variables
Compared to PROC MEANS, which focuses on basic summary statistics, PROC UNIVARIATE also reports percentiles, extreme values, tests for normality, and optionally produces histograms and probability plots
It is especially useful for understanding the distribution of a variable — whether it is symmetric, skewed, or has extreme outliers
In clinical programming, PROC UNIVARIATE is commonly used for verifying value ranges, identifying outliers, and assessing whether a continuous endpoint meets normality assumptions before statistical testing

Create sample data

The dataset `lab_results` contains simulated haemoglobin values for a small group of subjects
A few values are intentionally set at the extremes to make the percentile and extreme-value output more informative

data lab_results;
    infile datalines truncover;
    input usubjid $ sex $ age hgb;
datalines;
1001 F 45 12.1
1002 M 52 14.8
1003 F 38 11.5
1004 M 61 16.2
1005 F 29 13.0
1006 M 47 15.5
1007 F 55  9.8
1008 M 33 17.1
1009 F 42 12.8
1010 M 58 14.0
;
run;

`hgb` is the haemoglobin value in g/dL — the dataset has ten subjects with a deliberate low value (9.8) and a high value (17.1) to exercise the extreme-values section of the output
Inspect the dataset and note the range of values before running PROC UNIVARIATE

Basic PROC UNIVARIATE output

The default PROC UNIVARIATE output contains several panels for each variable
The Moments panel shows the mean, standard deviation, variance, skewness, kurtosis, and sample size
The Basic Statistical Measures panel shows mean, median, mode, standard deviation, variance, and range
The Tests for Location panel tests whether the mean or median differs significantly from zero
The Quantiles panel shows percentiles including the minimum, maximum, and quartiles
The Extreme Observations panel lists the five smallest and five largest values with their observation numbers

proc univariate data=lab_results;
    var hgb;
run;

Review the Quantiles section — the minimum should be 9.8 and the maximum 17.1
The median (50th percentile) is the middle value when all ten are ranked — verify it is between 12.8 and 13.0
The Extreme Observations section lists subject 1007 (9.8) as the lowest value and subject 1008 (17.1) as the highest — confirm this matches the source data

Adding an ID variable to extreme observations

By default, the Extreme Observations panel identifies rows by their observation number
The `id` statement replaces the observation number with a chosen variable, making it much easier to trace extreme values back to specific subjects
In clinical data, you would typically use the subject identifier variable as the ID

proc univariate data=lab_results;
    var hgb;
    id usubjid;
run;

The Extreme Observations panel now shows `usubjid` values instead of row numbers
Confirm that the five lowest and five highest values are identified by subject ID — this is far more useful than a row number when investigating outliers

Running PROC UNIVARIATE by group

A `by` statement produces a separate PROC UNIVARIATE output panel for each group
The data must be sorted by the BY variable before the procedure runs
This is useful when you need distribution summaries broken out by sex, treatment group, or visit

proc sort data=lab_results;
    by sex;
run;
 
proc univariate data=lab_results;
    var hgb;
    id usubjid;
    by sex;
run;

Two separate output sections are produced — one for female subjects and one for male subjects
Compare the medians and ranges between the two groups — the female group includes subject 1007 with the low value of 9.8 which will pull the female minimum down

Saving key statistics to an output dataset

The `output` statement saves selected statistics from PROC UNIVARIATE into a new dataset
Each statistic keyword is followed by `=` and the name to give that variable in the output dataset
Commonly saved statistics include `mean=`, `median=`, `std=`, `min=`, `max=`, `q1=` (25th percentile), and `q3=` (75th percentile)
Saving statistics to a dataset is essential when you need to use them programmatically — for example to flag values more than two standard deviations from the mean

proc univariate data=lab_results noprint;
    var hgb;
    output out=hgb_stats
           mean   = hgb_mean
           median = hgb_median
           std    = hgb_std
           min    = hgb_min
           max    = hgb_max
           q1     = hgb_q1
           q3     = hgb_q3
           n      = hgb_n;
run;

The `noprint` option suppresses the usual printed output — the statistics are written only to `hgb_stats`
Inspect `hgb_stats` — it should contain a single row with one column per requested statistic
Verify the mean, min, max, and n values against what you observed in the earlier full PROC UNIVARIATE output

Using saved statistics to flag outliers

Once statistics are in a dataset, they can be merged back to the source data to derive flags
A common rule flags values more than two standard deviations above or below the mean as potential outliers
The step below merges the summary statistics into the subject-level data using a cross-join pattern and then creates the flag

data lab_flagged;
    if _n_ = 1 then set hgb_stats;
    set lab_results;
    lower_limit = hgb_mean - 2 * hgb_std;
    upper_limit = hgb_mean + 2 * hgb_std;
    outlier_flag = (hgb lt lower_limit or hgb gt upper_limit);
run;

The `if _n_ = 1 then set hgb_stats;` pattern reads the single-row statistics dataset once and retains its values across all subsequent rows — this is a standard SAS idiom for merging a summary row into detail data
Inspect `lab_flagged` and check the `outlier_flag` column — subjects whose `hgb` falls outside the two-standard-deviation band should have `outlier_flag = 1`
Also inspect `lower_limit` and `upper_limit` to confirm the band is calculated correctly

Key points to remember

PROC UNIVARIATE provides richer distributional detail than PROC MEANS — moments, quantiles, extreme values, and normality tests
The `id` statement replaces observation numbers with a subject identifier in the Extreme Observations panel
A `by` statement produces separate output panels per group — data must be pre-sorted
The `output` statement saves named statistics to a dataset — use `noprint` to suppress the printed output when only the dataset is needed
Saved statistics can be merged back to detail data using the `if _n_ = 1 then set` pattern to derive per-subject flags or derived values

*Copyright @ www.mycsg.in;

What is PROC UNIVARIATE

Create sample data

SAS Log

Dataset View

Basic PROC UNIVARIATE output

SAS Log

Adding an ID variable to extreme observations

SAS Log

Running PROC UNIVARIATE by group

SAS Log

Saving key statistics to an output dataset

SAS Log

Dataset View

Using saved statistics to flag outliers

SAS Log

Dataset View

Key points to remember