Program Data Vector - Part 02: RETAIN, DROP/KEEP, and Implicit vs Explicit OUTPUT
Overview
Part 01 introduced the Program Data Vector (PDV) as SAS's internal working row — the area in memory where SAS builds each output observation during DATA step execution.
This lesson goes deeper into three behaviours that directly affect the PDV: the RETAIN statement, the effect of DROP and KEEP on PDV variables, and the difference between implicit and explicit OUTPUT.
Understanding these concepts is essential for writing correct accumulation logic, controlling which variables appear in output, and producing the right number of output rows.
RETAIN and its Effect on the PDV
By default, SAS resets all variables in the PDV to missing at the start of each new iteration (each new row read from the input).
The RETAIN statement overrides this default: any variable listed in RETAIN keeps its value from the previous iteration instead of being reset.
RETAIN is the mechanism behind running totals, row counters, carry-forward values, and any logic that needs memory of a previous row.
Variables read directly from an input dataset via SET or MERGE are NOT reset to missing between iterations regardless of RETAIN — they are overwritten by the new incoming row. RETAIN applies to variables that are created by assignment statements in the DATA step.
The following example demonstrates the difference between a variable with and without RETAIN when accumulating a running total.
SAS Log
Inspect WORK.RUNNING_TOTALS carefully.
no_retain_total resets to missing at the start of each iteration, so SUM(missing, amount) simply returns amount itself — you see just the current row amount, not a running total.
running_total correctly accumulates because RETAIN prevents the reset. RETAIN 0 also initialises the variable to zero on the very first iteration.
The shorthand variable + expression (sum statement) is equivalent to retain variable 0; variable = sum(variable, expression); — it implicitly retains and initialises to zero.
Dataset View
DROP and KEEP - What Stays in the PDV vs What Goes to the Output Dataset
DROP and KEEP control which variables are written to the output dataset, but they do NOT remove a variable from the PDV during processing.
A dropped variable is still present in the PDV and can still be used in assignment statements or conditional logic within the DATA step — it just will not appear in the output dataset.
This distinction matters when you need an intermediate calculation variable that should not be in the final output.
DROP= and KEEP= as dataset options on the SET statement work differently: they prevent SAS from even reading those variables into the PDV at all, which is more efficient for large datasets.
SAS Log
Even though DROP=pre post is specified on the DATA statement, pre and post are still read into the PDV from the SET statement and are available for the change and pct_change calculations.
Check WORK.RESULT — it should contain subject, change, and pct_change only. Pre and post are absent from the output because of the DROP= option.
If you had written SET work.source (drop=pre post) instead, pre and post would never enter the PDV and the calculations would produce missing values — avoid that pattern when you still need the dropped variables for logic.
Dataset View
Implicit vs Explicit OUTPUT
Implicit OUTPUT: by default SAS writes the current state of the PDV to the output dataset at the bottom of the DATA step, once per iteration. This happens automatically without any OUTPUT statement.
Explicit OUTPUT: when you add an OUTPUT statement anywhere in the DATA step, SAS writes the PDV to the output dataset at that exact point. Importantly, the implicit output at the bottom of the step is then suppressed — SAS will only output when it encounters an explicit OUTPUT statement.
Explicit OUTPUT is used when you want to produce more than one output row per input row, or when you want to conditionally output rows, or when routing rows to multiple output datasets.
The following example shows producing multiple output rows from a single input row using explicit OUTPUT inside a DO loop.
SAS Log
Without the OUTPUT statement inside the DO loop, SAS would output one row per input observation after the loop finishes — and day would hold the last value from the loop.
With the explicit OUTPUT inside the DO loop, SAS writes one row for every value of day, expanding each input row into multiple output rows.
Check WORK.EXPANDED — subject 001 should produce 5 rows (days 1 through 5), and subject 002 should produce 4 rows (days 3 through 6).
The DROP statement removes start_day and end_day from the output since they are no longer needed once the expansion is done.
Dataset View
Combining RETAIN and Explicit OUTPUT - a Practical Pattern
A common pattern in clinical programming combines RETAIN, explicit OUTPUT, and a DO loop to produce a dataset of planned visit windows from a subject-level start and end date.
This demonstrates all three PDV concepts working together.
SAS Log
RETAIN subject ensures subject carries into every row of the DO loop output, though in this case, SET already handles that — RETAIN is shown here for clarity.
Each iteration of the DO loop assigns a new studyday value before calling OUTPUT, so each output row has a unique studyday.
Check WORK.VISIT_ROWS — each subject should have one row per study day within their observation window.
Dataset View
Key Points
RETAIN prevents SAS from resetting a variable to missing at the start of each DATA step iteration — use it for running totals, counters, and carry-forward values.
Variables created by the sum statement (variable + expression) are implicitly retained and initialised to zero.
DROP and KEEP on the DATA statement control output, not PDV presence — dropped variables are still available for calculations during the step.
DROP= and KEEP= on the SET statement prevent variables from entering the PDV at all — efficient but means those variables cannot be used in calculations.
Explicit OUTPUT suppresses the automatic end-of-step output and writes the PDV at the exact point of the OUTPUT statement — essential for expanding rows or conditional routing.