Sequential row numbering and group size counting within BY groups
What this lesson covers
Previous BY group lessons covered the FIRST. and LAST. automatic variables and how to build per-group summaries
This lesson focuses on two closely related derivations that appear constantly in clinical and analytical SAS programming
The first is assigning a sequential row number within each BY group so every observation knows its position in the group
The second is computing the total size of each BY group and carrying that count back to every row in the group
Both techniques rely on RETAIN to preserve values across observations within the same group
Create sample data
We create a dataset of clinical measurements for multiple subjects across multiple visits
Each subject has a different number of visits so group sizes differ — this makes the counting logic easy to verify
SAS Log
Subject 001 has 4 visits, subject 002 has 2 visits, subject 003 has 5 visits
After this lesson each row will carry its position within the subject's visits and the subject's total visit count
Dataset View
Sequential row numbering within each group
Using RETAIN and FIRST to number rows
The approach is to maintain a counter variable using RETAIN
When FIRST.subject is 1 we reset the counter to 1 because a new group is starting
For every subsequent row in the same group we add 1 to the counter
Without RETAIN the counter would reset to missing at the start of every DATA step iteration
SAS Log
Inspect `measurements_numbered` and confirm that `row_num` starts at 1 for the first visit of each subject and increments by 1 for each subsequent visit
Subject 001 should have row_num values 1, 2, 3, 4, and subject 002 should have 1, 2
Note that `row_num+1` is a sum statement and automatically retains its value — it is equivalent to `retain row_num; row_num=row_num+1;`
Dataset View
Alternative: using a sum statement directly
The sum statement (`variable+expression`) is a shorthand that both retains and increments the variable
Combining it with FIRST detection gives a very compact form of the row counter pattern
Initialising to 0 using `RETAIN row_num2 0` and then resetting at each new group is equivalent to the previous code
SAS Log
The output is identical to the previous DATA step because the sum statement on `row_num2` implicitly retains its value between iterations
Either form is acceptable — choose the one that reads most clearly in your programs
Group size counting - how many rows are in each group
Two-pass approach using PROC FREQ or PROC SQL
The simplest way to add a group size variable to every row is a two-step approach
In the first step we compute the count per group using PROC FREQ or PROC SQL and save it as a summary dataset
In the second step we merge that count back onto the original data
This approach is readable and works well when the data is already sorted
SAS Log
Inspect `measurements_with_n` and confirm that `n_visits` is the same value for every row belonging to the same subject
Subject 001 should have n_visits=4, subject 002 n_visits=2, subject 003 n_visits=5 on all their rows
Dataset View
One-pass approach using RETAIN and LAST
When avoiding an extra sort and merge is important, the count can be derived in a single DATA step pass using RETAIN
The idea: use the row counter from earlier to build up a count, then at LAST.subject that counter equals the group size
However, the problem is that earlier rows are already written to the output before the final count is known
The solution is to first count using a LAST-only output pass, saving the group size to a separate dataset, then merge back — or to use the DOW (DO-WHILE) loop pattern
The approach below uses a clean two-step within a single DATA step using DOW loop logic to count first then output with the total attached to every row
SAS Log
This DOW loop pattern reads the group twice: once to count, once to output with the count attached
The first loop counts rows until LAST.subject and retains the final count
The second loop re-reads the same group and outputs every row with `n_visits` already known
This is an advanced technique — review the output and confirm it matches `measurements_with_n` from the simpler approach above
Combined row number and group size in one dataset
For many practical purposes both the sequential row number and the group total size are needed together
The simplest production-grade approach is to combine the PROC FREQ merge method for group size with the RETAIN counter for row number in one DATA step
SAS Log
Inspect `measurements_full` and confirm that each row has both `row_num` (position within the subject) and `n_visits` (total visits for that subject)
For subject 003 the last row should show row_num=5 and n_visits=5
These two variables together are useful for identifying first and last observations, flagging specific positions, or computing per-subject percentages
Dataset View
Key points to remember
Sequential row numbering within groups uses RETAIN with a counter that resets to 1 at FIRST.byvar and increments by 1 for all subsequent rows
The sum statement (`var+1`) is a compact alternative that automatically retains the value — set it back to 1 at FIRST.byvar for group-relative numbering
Group size counting is most cleanly done with a PROC FREQ or PROC SQL summary followed by a merge back onto the detail data
Advanced users can use the DOW loop pattern to count and output in one DATA step but the two-step merge is more transparent and easier to debug
Both `row_num` and `n_visits` together enable common clinical programming patterns such as identifying the last non-missing observation or computing a within-subject percentile