*Copyright @ www.mycsg.in;

Sequential row numbering and group size counting within BY groups

What this lesson covers

Previous BY group lessons covered the FIRST. and LAST. automatic variables and how to build per-group summaries
This lesson focuses on two closely related derivations that appear constantly in clinical and analytical SAS programming
The first is assigning a sequential row number within each BY group so every observation knows its position in the group
The second is computing the total size of each BY group and carrying that count back to every row in the group
Both techniques rely on RETAIN to preserve values across observations within the same group

Create sample data

We create a dataset of clinical measurements for multiple subjects across multiple visits
Each subject has a different number of visits so group sizes differ — this makes the counting logic easy to verify

data measurements;
    input subject $ visit $ aval;
    datalines;
001 VISIT1 10
001 VISIT2 12
001 VISIT3 15
001 VISIT4 14
002 VISIT1  8
002 VISIT2 11
003 VISIT1  9
003 VISIT2 13
003 VISIT3 16
003 VISIT4 18
003 VISIT5 17
;
run;
 
proc sort data=measurements; by subject visit; run;

Subject 001 has 4 visits, subject 002 has 2 visits, subject 003 has 5 visits
After this lesson each row will carry its position within the subject's visits and the subject's total visit count

Sequential row numbering within each group

Using RETAIN and FIRST to number rows

The approach is to maintain a counter variable using RETAIN
When FIRST.subject is 1 we reset the counter to 1 because a new group is starting
For every subsequent row in the same group we add 1 to the counter
Without RETAIN the counter would reset to missing at the start of every DATA step iteration

data measurements_numbered;
    set measurements;
    by subject;
 
    retain row_num 0;
 
    if first.subject then row_num=1;
    else row_num+1;
run;

Inspect `measurements_numbered` and confirm that `row_num` starts at 1 for the first visit of each subject and increments by 1 for each subsequent visit
Subject 001 should have row_num values 1, 2, 3, 4, and subject 002 should have 1, 2
Note that `row_num+1` is a sum statement and automatically retains its value — it is equivalent to `retain row_num; row_num=row_num+1;`

Alternative: using a sum statement directly

The sum statement (`variable+expression`) is a shorthand that both retains and increments the variable
Combining it with FIRST detection gives a very compact form of the row counter pattern
Initialising to 0 using `RETAIN row_num2 0` and then resetting at each new group is equivalent to the previous code

data measurements_numbered2;
    set measurements;
    by subject;
 
    if first.subject then row_num2=1;
    else row_num2+1;
run;

The output is identical to the previous DATA step because the sum statement on `row_num2` implicitly retains its value between iterations
Either form is acceptable — choose the one that reads most clearly in your programs

Group size counting - how many rows are in each group

Two-pass approach using PROC FREQ or PROC SQL

The simplest way to add a group size variable to every row is a two-step approach
In the first step we compute the count per group using PROC FREQ or PROC SQL and save it as a summary dataset
In the second step we merge that count back onto the original data
This approach is readable and works well when the data is already sorted

proc freq data=measurements noprint;
    tables subject / out=subject_counts(keep=subject count rename=(count=n_visits));
run;
 
data measurements_with_n;
    merge measurements subject_counts;
    by subject;
run;

Inspect `measurements_with_n` and confirm that `n_visits` is the same value for every row belonging to the same subject
Subject 001 should have n_visits=4, subject 002 n_visits=2, subject 003 n_visits=5 on all their rows

One-pass approach using RETAIN and LAST

When avoiding an extra sort and merge is important, the count can be derived in a single DATA step pass using RETAIN
The idea: use the row counter from earlier to build up a count, then at LAST.subject that counter equals the group size
However, the problem is that earlier rows are already written to the output before the final count is known
The solution is to first count using a LAST-only output pass, saving the group size to a separate dataset, then merge back — or to use the DOW (DO-WHILE) loop pattern
The approach below uses a clean two-step within a single DATA step using DOW loop logic to count first then output with the total attached to every row

data measurements_with_n2;
    do until (last.subject);
        set measurements;
        by subject;
        n_visits+1;
    end;
 
    do until (last.subject);
        set measurements;
        by subject;
        output;
    end;
 
    n_visits=n_visits;
    drop n_visits;
 
    retain n_visits;
run;

This DOW loop pattern reads the group twice: once to count, once to output with the count attached
The first loop counts rows until LAST.subject and retains the final count
The second loop re-reads the same group and outputs every row with `n_visits` already known
This is an advanced technique — review the output and confirm it matches `measurements_with_n` from the simpler approach above

Combined row number and group size in one dataset

For many practical purposes both the sequential row number and the group total size are needed together
The simplest production-grade approach is to combine the PROC FREQ merge method for group size with the RETAIN counter for row number in one DATA step

data measurements_full;
    merge measurements subject_counts;
    by subject;
 
    retain row_num 0;
    if first.subject then row_num=1;
    else row_num+1;
run;

Inspect `measurements_full` and confirm that each row has both `row_num` (position within the subject) and `n_visits` (total visits for that subject)
For subject 003 the last row should show row_num=5 and n_visits=5
These two variables together are useful for identifying first and last observations, flagging specific positions, or computing per-subject percentages

Key points to remember

Sequential row numbering within groups uses RETAIN with a counter that resets to 1 at FIRST.byvar and increments by 1 for all subsequent rows
The sum statement (`var+1`) is a compact alternative that automatically retains the value — set it back to 1 at FIRST.byvar for group-relative numbering
Group size counting is most cleanly done with a PROC FREQ or PROC SQL summary followed by a merge back onto the detail data
Advanced users can use the DOW loop pattern to count and output in one DATA step but the two-step merge is more transparent and easier to debug
Both `row_num` and `n_visits` together enable common clinical programming patterns such as identifying the last non-missing observation or computing a within-subject percentile

*Copyright @ www.mycsg.in;

Sequential row numbering and group size counting within BY groups

What this lesson covers

Create sample data

SAS Log

Dataset View

Sequential row numbering within each group

Using RETAIN and FIRST to number rows

SAS Log

Dataset View

Alternative: using a sum statement directly

SAS Log

Group size counting - how many rows are in each group

Two-pass approach using PROC FREQ or PROC SQL

SAS Log

Dataset View

One-pass approach using RETAIN and LAST

SAS Log

Combined row number and group size in one dataset

SAS Log

Dataset View

Key points to remember