*Copyright @ www.mycsg.in;

One-to-many merges in SAS

What is a one-to-many merge

Previous lessons covered one-to-one merging where each BY key value appears exactly once in both datasets
In a one-to-many merge, one dataset has exactly one row per BY key value and the other dataset has multiple rows for the same key
The typical real-world example is a header dataset that holds one record per subject and a detail dataset that holds multiple records per subject, such as visits, events, or measurements
SAS handles this automatically in the MERGE statement, but the behaviour of variable values across repeated rows deserves careful attention

Create sample datasets

We create a subjects dataset with one row per subject holding demographic information
We also create a visits dataset with multiple rows per subject representing repeated assessments
Both datasets share the key variable `subject` and will be sorted before merging

data subjects;
    input subject $ age sex $;
    datalines;
001 28 M
002 35 F
003 22 F
004 41 M
;
run;
 
data visits;
    input subject $ visit $ result;
    datalines;
001 VISIT1 10
001 VISIT2 12
001 VISIT3 15
002 VISIT1  8
002 VISIT2 11
003 VISIT1  9
003 VISIT2 13
003 VISIT3 14
003 VISIT4 16
;
run;

`subjects` has 4 rows — one per subject including subject 004 who has no visits
`visits` has 9 rows — subjects 001 and 003 have three and four visits respectively, subject 002 has two, and subject 004 has none
Inspect both datasets before merging to understand what the expected output should look like

Sort both datasets before merging

Like all BY-variable merges, a one-to-many merge requires both datasets to be sorted by the BY variable(s)
SAS does not sort automatically and will issue an error if the data is not in order

proc sort data=subjects; by subject; run;
proc sort data=visits;   by subject; run;

Perform the one-to-many merge

Basic merge - combine subjects with visits

In a one-to-many merge, SAS reads one row from the single-row dataset and holds it in the PDV while it processes all matching rows from the many-row dataset
The variables from the single-row dataset (`age`, `sex`) carry forward to every matching visit row automatically
This is why merging a header onto a detail file is so convenient — you do not need to explicitly retain the header values

data subjects_with_visits;
    merge subjects(in=in_s) visits(in=in_v);
    by subject;
run;

Inspect `subjects_with_visits` and confirm that each visit row now carries the `age` and `sex` values from the corresponding subject
Subject 004, who appears in `subjects` but has no visits, should appear as one row in the output with missing values for `visit` and `result`
This happens because SAS outputs one row per observation in the more-frequent dataset, and for subjects with no visits it outputs one row from the subjects side with the visit variables blank

Keep only rows that exist in both datasets

If you want only subjects who have at least one visit, use the `IN=` flags to filter the output
Subjects with no visits are excluded by requiring `in_v=1`

data matched_only;
    merge subjects(in=in_s) visits(in=in_v);
    by subject;
    if in_s and in_v;
run;

Subject 004 should now be absent from `matched_only` because that subject has no visits in the visits dataset
Confirm the row count: 9 rows matching the 9 rows in `visits`

What happens when both datasets have repeated BY key values

Sequential pairing - not a cartesian product

When both input datasets contain more than one row for the same BY key value, SAS does not produce a cartesian product
Instead, SAS pairs rows one by one within the group: the first row from each dataset, then the second row from each, and so on
When one dataset runs out of rows for a key, the last-seen variable values from that dataset are retained for the remaining pairings
The number of output rows for a BY group equals the maximum number of rows that group has in either input dataset
SAS issues the WARNING: `MERGE statement has more than one data set with repeats of BY values` when this situation is detected
The result is not a cartesian join explosion, but it is almost always wrong because retained variable values silently carry over into rows where they do not belong

data events_a;
    input subject $ event_a $;
    datalines;
001 EVA1
001 EVA2
002 EVA1
;
run;
 
data events_b;
    input subject $ event_b $;
    datalines;
001 EVB1
001 EVB2
001 EVB3
002 EVB1
;
run;
 
proc sort data=events_a; by subject; run;
proc sort data=events_b; by subject; run;
 
data many_to_many_demo;
    merge events_a events_b;
    by subject;
run;

Run this code and check the SAS log for the MERGE warning about repeated BY values
Inspect `many_to_many_demo` — for subject 001, SAS produces 3 rows, which is the maximum of 2 (from `events_a`) and 3 (from `events_b`)
The pairing is: EVA1 with EVB1, EVA2 with EVB2, then EVA2 retained with EVB3 because `events_a` has no third row for subject 001
Subject 002 produces 1 row because both datasets have exactly one row for it
Total output rows: 4 — but the EVA2 value appearing twice in subject 001 is not intentional data and is a silent error
This is the core danger: wrong values appear without any error, only a WARNING in the log
When a true many-to-many relationship exists and all combinations are required, use PROC SQL with an explicit join on the key variable instead of DATA step MERGE

Safe approach - verify one-to-many assumption before merging

Before performing a one-to-many merge, it is good practice to verify that the header dataset truly has one row per BY key
PROC SORT with the `NODUPKEY` option can enforce uniqueness on the header side and will delete duplicate keys if any exist, printing a note to the log
An even safer approach is to check for duplicates first using PROC FREQ or a DATA step before merging

proc sort data=subjects nodupkey;
    by subject;
run;

If duplicates are found, SAS prints a note stating how many observations were deleted
If no duplicates exist, the step runs silently and the data is unchanged
After confirming the header is unique, proceed with the one-to-many merge with confidence

Key points to remember

A one-to-many merge combines a unique-key header dataset with a detail dataset that has multiple rows per key
Sort both datasets by the BY variable before merging
SAS automatically carries header variable values across all matching detail rows, so you do not need RETAIN for this purpose
Subjects in the header but absent from the detail file produce one output row with missing values for detail variables unless filtered with IN= flags
If both datasets have repeated BY values, SAS pairs rows sequentially and retains the last value from the shorter side — this is not a cartesian product but produces wrong data silently; use PROC SQL for true many-to-many relationships
Use `proc sort nodupkey` or a pre-merge uniqueness check to confirm the one-to-many assumption before production runs

*Copyright @ www.mycsg.in;

One-to-many merges in SAS

What is a one-to-many merge

Create sample datasets

SAS Log

Dataset View

Sort both datasets before merging

SAS Log

Perform the one-to-many merge

Basic merge - combine subjects with visits

SAS Log

Dataset View

Keep only rows that exist in both datasets

SAS Log

Dataset View

What happens when both datasets have repeated BY key values

Sequential pairing - not a cartesian product

SAS Log

Dataset View

Safe approach - verify one-to-many assumption before merging

SAS Log

Key points to remember