PROC SQL - Part 06: INNER JOIN, LEFT JOIN, and FULL OUTER JOIN
Overview
SQL joins combine rows from two or more tables based on a matching condition, typically shared key values.
The type of join determines which rows are included in the result when a key exists in one table but not the other.
PROC SQL supports INNER JOIN, LEFT JOIN (also called LEFT OUTER JOIN), RIGHT JOIN, and FULL OUTER JOIN.
This lesson covers the three most commonly used types in practice: INNER JOIN, LEFT JOIN, and FULL OUTER JOIN — with practical clinical data examples for each.
Note: joins in PROC SQL are functionally equivalent to MERGE operations in the DATA step, but use different syntax and have different behaviour when keys are non-unique.
SAS Log
WORK.SUBJECTS has 5 subjects (001-005). Subject 006 does not appear in WORK.SUBJECTS.
WORK.LABS has lab results for subjects 001, 002, 003, and 006. Subject 004 and 005 have no lab results. Subject 006 has labs but no enrolment record.
This setup demonstrates the different join behaviours clearly.
Dataset View
INNER JOIN - Only Matching Rows
An INNER JOIN returns only the rows where the join key exists in both tables. Rows that exist in only one table are excluded entirely.
This is the most restrictive join type — use it when you only want observations that have corresponding records in both datasets.
In clinical data: INNER JOIN subjects to labs to get only enrolled subjects who have at least one lab result.
SAS Log
Only subjects present in BOTH work.subjects AND work.labs appear in the result: subjects 001, 002, 003.
Subjects 004 and 005 are excluded because they have no lab rows. Subject 006 is excluded because it has no subject record.
Subjects 001 and 003 each appear twice because they have two lab results — the subject row is repeated once per matching lab row.
Inspect WORK.INNER_RESULT: it should have 5 rows (001-ALT, 001-AST, 002-ALT, 003-ALT, 003-AST).
Dataset View
LEFT JOIN - All Left Rows, Matching Right Rows
A LEFT JOIN returns all rows from the left (first) table, and the matching rows from the right (second) table. When there is no match in the right table, the right-side columns are set to null (missing).
The row count in the result is at least as large as the left table's row count — it may be larger if the right table has multiple matching rows per key.
Use LEFT JOIN when you want to keep all records from the primary dataset and add information from a secondary dataset where available — subjects with no lab results still appear, with missing lab values.
SAS Log
All 5 subjects from WORK.SUBJECTS appear in the result.
Subjects 001 and 003 appear twice each (two lab rows each); subject 002 appears once (one lab row).
Subjects 004 and 005 appear once each with param = missing and aval = missing — they have no lab results but are retained from the left table.
Subject 006 does not appear — it is not in the left table (work.subjects).
Inspect WORK.LEFT_RESULT: it should have 7 rows. Confirm subjects 004 and 005 are present with missing lab columns.
Dataset View
FULL OUTER JOIN - All Rows from Both Tables
A FULL OUTER JOIN returns all rows from both tables. Where a match exists, columns from both sides are populated. Where no match exists in either direction, the non-matching side's columns are set to null.
Use FULL OUTER JOIN when you want a complete picture of both datasets, including rows that have no match in either — useful for reconciliation and completeness checks.
In PROC SQL, use the COALESCE function on key columns to produce a single unified subject identifier column from both tables.
SAS Log
COALESCE(a.subject, b.subject) ensures the subject column is populated from whichever side has a non-missing value — without this, rows from the right-only side would have a missing subject on the left.
Subject 006 now appears: it has lab rows in WORK.LABS but no enrolment record in WORK.SUBJECTS, so site and trta are missing, and match_status = "Labs only - not enrolled".
Subjects 004 and 005 appear with match_status = "Enrolled - no labs" because they have no lab rows.
Subjects 001-003 appear as "Matched" with all columns populated.
Inspect WORK.FULL_RESULT: it should have 8 rows. Confirm subject 006 appears and match_status is correct for all rows.
Dataset View
Choosing the Right Join Type
INNER JOIN: use when you only want observations that are fully linked across both datasets. Row count will be less than or equal to the smaller dataset.
LEFT JOIN: use when the left dataset is your primary source and you want to add data from the right where available. All left rows are preserved. This is the most common join type in clinical programming.
FULL OUTER JOIN: use for completeness checks and reconciliation — it reveals rows exclusive to either side.
A RIGHT JOIN is the mirror of LEFT JOIN — swap the table order and use LEFT JOIN instead for consistency and readability.
Observation count check: after any join, verify the row count matches your expectation. An unexpected increase often means a non-unique key in one of the tables, creating unintended row multiplication.
SAS Log
Verify: n_inner = 5, n_left = 7, n_full = 8.
If any count is higher than expected, check whether the join key is unique in both tables — a non-unique key will produce one output row for every pair of matching rows, which is rarely intended.
Use PROC SQL with COUNT(DISTINCT key) to check key uniqueness before joining.
Key Points
INNER JOIN keeps only rows with matching keys in both tables — the most restrictive join; unmatched rows from either side are dropped.
LEFT JOIN keeps all rows from the left table, with right-side columns set to missing where there is no match — the standard choice when the left table is the anchor dataset.
FULL OUTER JOIN keeps all rows from both tables, with missing values on whichever side has no match — use for reconciliation and completeness audits.
Use COALESCE on the key column in a FULL OUTER JOIN to produce a single unified identifier that is populated regardless of which side provided the row.
Always verify the output row count after a join — unexpected inflation usually indicates a non-unique join key creating unintended row multiplication.