*Copyright @ www.mycsg.in;

Merging - Part 04: UPDATE Statement and Interleaving Datasets

Overview

Previous merging lessons covered: L101 (SET-based appending), L102 (MERGE with IN= flags), L103 (one-to-many merges and many-to-many hazards).
This lesson covers two further DATA step combining operations: the UPDATE statement and interleaving with SET and BY.
UPDATE is a targeted overwrite operation distinct from MERGE — it replaces values selectively without touching variables that are not in the transaction dataset.
Interleaving combines multiple datasets into one by reading observations in BY-variable order rather than concatenating them in sequence.

The UPDATE Statement

The UPDATE statement reads one BASE dataset and one TRANSACTION dataset. For each matching BY-key in the transaction, non-missing values in the transaction overwrite the corresponding values in the base record.
The critical distinction from MERGE: UPDATE only overwrites values that are non-missing in the transaction. If a transaction variable is missing (blank), the base value is kept unchanged.
MERGE with a matching BY key always writes the transaction value, even if that value is missing — which can overwrite good data with blanks. UPDATE avoids this problem entirely.
Both datasets must be sorted by the BY variable. The BASE dataset must have unique BY values (one row per key). The TRANSACTION dataset may have multiple rows per key — the last non-missing value per variable wins.

*------------------------------------------------------------------------------;
* Base dataset: master subject record ;
*------------------------------------------------------------------------------;
data work.base;
    input subject $ site $ country $ age weight;
    datalines;
001 S01 USA  34 72.5
002 S01 USA  28 65.0
003 S02 GBR  45 80.1
004 S02 GBR  52 91.3
;
run;
 
*------------------------------------------------------------------------------;
* Transaction dataset: corrections and updates ;
*------------------------------------------------------------------------------;
data work.transactions;
    input subject $ site $ country $ age weight;
    datalines;
002 .   .    .  67.5
003 S02 GBR  .  .
004 .   .    53 .
;
run;
 
proc sort data=work.base;        by subject; run;
proc sort data=work.transactions; by subject; run;
 
data work.updated;
    update work.base work.transactions;
    by subject;
run;

Subject 002: only weight is non-missing in the transaction (67.5). The UPDATE statement replaces weight, but leaves site, country, and age from the base record unchanged.
Subject 003: transaction has site and country populated but age and weight are missing. Base values for age and weight are preserved; site and country are overwritten (same values here, so no visible change).
Subject 004: only age (53) is non-missing in the transaction. Weight stays at 91.3 from the base.
Subject 001 has no transaction row — all base values carry through unchanged.
Check WORK.UPDATED: compare it side-by-side with WORK.BASE to confirm only the intended values changed and all others are preserved.

UPDATE vs MERGE - the Key Difference

To reinforce why UPDATE is preferred for corrections, the following example shows what a MERGE would produce instead.
MERGE overwrites with missing values — UPDATE does not. This matters when the transaction dataset has incomplete records.

data work.merged_result;
    merge work.base work.transactions;
    by subject;
run;

Check WORK.MERGED_RESULT and compare it to WORK.UPDATED.
For subject 002, the MERGE result will have missing values for site and country (because the transaction row had missing values for those fields). WORK.UPDATED preserves them.
For subject 004, the MERGE result will have missing weight because the transaction row had a missing weight. WORK.UPDATED preserves the base weight of 91.3.
This confirms that UPDATE is the appropriate choice when you want to apply selective corrections without risk of overwriting non-missing data with blanks.

Interleaving Datasets with SET and BY

Concatenation (SET without BY) appends datasets one after another in the order they are listed: all rows from dataset 1, then all rows from dataset 2, and so on.
Interleaving (SET with BY) reads from all listed datasets simultaneously, outputting observations in the sorted order of the BY variable — similar to a merge sort.
The result of interleaving is a single dataset ordered by the BY variable, as if all source datasets had been concatenated and then sorted — but without running a separate PROC SORT step.
All datasets in the SET statement must already be sorted by the BY variable. The interleaved output is guaranteed to be in BY order.

*------------------------------------------------------------------------------;
* Three separate datasets - each sorted by subject ;
*------------------------------------------------------------------------------;
data work.site_a;
    input subject $ visit result;
    datalines;
001 1 10
003 1 14
003 2 12
;
run;
 
data work.site_b;
    input subject $ visit result;
    datalines;
002 1  9
002 2 11
004 1 16
;
run;
 
data work.site_c;
    input subject $ visit result;
    datalines;
001 2 13
004 2 15
005 1 18
;
run;
 
*------------------------------------------------------------------------------;
* Concatenation - rows appear in dataset order ;
*------------------------------------------------------------------------------;
data work.concatenated;
    set work.site_a work.site_b work.site_c;
run;
 
*------------------------------------------------------------------------------;
* Interleaving - rows appear in BY order across all datasets ;
*------------------------------------------------------------------------------;
data work.interleaved;
    set work.site_a work.site_b work.site_c;
    by subject;
run;

WORK.CONCATENATED will have rows from site_a first, then site_b, then site_c — subjects appear in arrival order, not subject number order.
WORK.INTERLEAVED will have all rows sorted by subject: 001 rows first, then 002, 003, 004, 005 — regardless of which source dataset they came from.
Confirm both datasets have the same total observation count (9 rows). The difference is purely in row order.
Interleaving is most valuable when you receive data from multiple sites or batches and want a single subject-ordered dataset without an extra PROC SORT step.

Interleaving with FIRST. and LAST. for BY-Group Logic

When you interleave with a BY statement, the FIRST. and LAST. automatic variables are available, just as they are in a standard DATA step with SET and BY.
This allows you to apply BY-group logic (such as counting visits per subject or flagging the first record) on the interleaved output in a single pass.

data work.interleaved_grouped;
    set work.site_a work.site_b work.site_c;
    by subject;
 
    if first.subject then visit_seq = 0;
    visit_seq + 1;
 
    if last.subject then total_visits = visit_seq;
    else total_visits = .;
run;

visit_seq increments within each subject group across all source datasets combined.
total_visits is populated only on the last row per subject — it records the total number of observations for that subject across all three site datasets.
This demonstrates that interleaving with BY is not just a sorting convenience — it also enables efficient single-pass BY-group aggregation.
Inspect WORK.INTERLEAVED_GROUPED to verify subject 001 has 2 visits (one from site_a, one from site_c), subject 002 has 2 visits (both from site_b), and subject 003 has 2 visits (both from site_a).

Key Points

UPDATE overwrites base values only with non-missing transaction values — missing transaction values leave the base value intact. This makes UPDATE the correct choice for applying data corrections.
MERGE always overwrites with the transaction value regardless of whether it is missing — use UPDATE when you cannot guarantee the transaction record is complete.
Both datasets must be sorted by the BY variable before UPDATE, and the base dataset must have unique BY values.
Interleaving (SET with BY) reads multiple sorted datasets simultaneously and outputs rows in BY-variable order — producing a merged sort without a separate PROC SORT step.
FIRST. and LAST. BY-group variables are fully available during interleaving, enabling BY-group logic across multiple source datasets in a single DATA step.

*Copyright @ www.mycsg.in;

Merging - Part 04: UPDATE Statement and Interleaving Datasets

Overview

The UPDATE Statement

SAS Log

Dataset View

UPDATE vs MERGE - the Key Difference

SAS Log

Dataset View

Interleaving Datasets with SET and BY

SAS Log

Dataset View

Interleaving with FIRST. and LAST. for BY-Group Logic

SAS Log

Dataset View

Key Points