Merging - Part 04: UPDATE Statement and Interleaving Datasets
Overview
Previous merging lessons covered: L101 (SET-based appending), L102 (MERGE with IN= flags), L103 (one-to-many merges and many-to-many hazards).
This lesson covers two further DATA step combining operations: the UPDATE statement and interleaving with SET and BY.
UPDATE is a targeted overwrite operation distinct from MERGE — it replaces values selectively without touching variables that are not in the transaction dataset.
Interleaving combines multiple datasets into one by reading observations in BY-variable order rather than concatenating them in sequence.
The UPDATE Statement
The UPDATE statement reads one BASE dataset and one TRANSACTION dataset. For each matching BY-key in the transaction, non-missing values in the transaction overwrite the corresponding values in the base record.
The critical distinction from MERGE: UPDATE only overwrites values that are non-missing in the transaction. If a transaction variable is missing (blank), the base value is kept unchanged.
MERGE with a matching BY key always writes the transaction value, even if that value is missing — which can overwrite good data with blanks. UPDATE avoids this problem entirely.
Both datasets must be sorted by the BY variable. The BASE dataset must have unique BY values (one row per key). The TRANSACTION dataset may have multiple rows per key — the last non-missing value per variable wins.
SAS Log
Subject 002: only weight is non-missing in the transaction (67.5). The UPDATE statement replaces weight, but leaves site, country, and age from the base record unchanged.
Subject 003: transaction has site and country populated but age and weight are missing. Base values for age and weight are preserved; site and country are overwritten (same values here, so no visible change).
Subject 004: only age (53) is non-missing in the transaction. Weight stays at 91.3 from the base.
Subject 001 has no transaction row — all base values carry through unchanged.
Check WORK.UPDATED: compare it side-by-side with WORK.BASE to confirm only the intended values changed and all others are preserved.
Dataset View
UPDATE vs MERGE - the Key Difference
To reinforce why UPDATE is preferred for corrections, the following example shows what a MERGE would produce instead.
MERGE overwrites with missing values — UPDATE does not. This matters when the transaction dataset has incomplete records.
SAS Log
Check WORK.MERGED_RESULT and compare it to WORK.UPDATED.
For subject 002, the MERGE result will have missing values for site and country (because the transaction row had missing values for those fields). WORK.UPDATED preserves them.
For subject 004, the MERGE result will have missing weight because the transaction row had a missing weight. WORK.UPDATED preserves the base weight of 91.3.
This confirms that UPDATE is the appropriate choice when you want to apply selective corrections without risk of overwriting non-missing data with blanks.
Dataset View
Interleaving Datasets with SET and BY
Concatenation (SET without BY) appends datasets one after another in the order they are listed: all rows from dataset 1, then all rows from dataset 2, and so on.
Interleaving (SET with BY) reads from all listed datasets simultaneously, outputting observations in the sorted order of the BY variable — similar to a merge sort.
The result of interleaving is a single dataset ordered by the BY variable, as if all source datasets had been concatenated and then sorted — but without running a separate PROC SORT step.
All datasets in the SET statement must already be sorted by the BY variable. The interleaved output is guaranteed to be in BY order.
SAS Log
WORK.CONCATENATED will have rows from site_a first, then site_b, then site_c — subjects appear in arrival order, not subject number order.
WORK.INTERLEAVED will have all rows sorted by subject: 001 rows first, then 002, 003, 004, 005 — regardless of which source dataset they came from.
Confirm both datasets have the same total observation count (9 rows). The difference is purely in row order.
Interleaving is most valuable when you receive data from multiple sites or batches and want a single subject-ordered dataset without an extra PROC SORT step.
Dataset View
Interleaving with FIRST. and LAST. for BY-Group Logic
When you interleave with a BY statement, the FIRST. and LAST. automatic variables are available, just as they are in a standard DATA step with SET and BY.
This allows you to apply BY-group logic (such as counting visits per subject or flagging the first record) on the interleaved output in a single pass.
SAS Log
visit_seq increments within each subject group across all source datasets combined.
total_visits is populated only on the last row per subject — it records the total number of observations for that subject across all three site datasets.
This demonstrates that interleaving with BY is not just a sorting convenience — it also enables efficient single-pass BY-group aggregation.
Inspect WORK.INTERLEAVED_GROUPED to verify subject 001 has 2 visits (one from site_a, one from site_c), subject 002 has 2 visits (both from site_b), and subject 003 has 2 visits (both from site_a).
Dataset View
Key Points
UPDATE overwrites base values only with non-missing transaction values — missing transaction values leave the base value intact. This makes UPDATE the correct choice for applying data corrections.
MERGE always overwrites with the transaction value regardless of whether it is missing — use UPDATE when you cannot guarantee the transaction record is complete.
Both datasets must be sorted by the BY variable before UPDATE, and the base dataset must have unique BY values.
Interleaving (SET with BY) reads multiple sorted datasets simultaneously and outputs rows in BY-variable order — producing a merged sort without a separate PROC SORT step.
FIRST. and LAST. BY-group variables are fully available during interleaving, enabling BY-group logic across multiple source datasets in a single DATA step.