*Copyright @ www.mycsg.in;

What is BY group processing in SAS

BY group processing means SAS processes observations in groups formed by one or more variables listed on a BY statement
Each unique value or unique combination of BY variable values becomes one group
This technique is used in procedures as well as in DATA step programming
It is useful when the same logic must be repeated separately for each group, such as each sex, each treatment, each site, or each subject
A very important rule is that the input dataset must usually be sorted by the same BY variables before BY group processing is attempted

Create example datasets for learning BY group processing

We start by copying `sashelp.class` into a simple working dataset named `class01`
We then create a second dataset named `class_detail` that adds two grouping variables named `age_group` and `height_band`
These extra variables help us demonstrate single-variable grouping and multiple-variable grouping in later examples
After running the code, confirm that `class01` and `class_detail` contain the same 19 students with the added derived variables in `class_detail`

data class01;
   set sashelp.class;
run;
 
data class_detail;
   set class01;
   length age_group $12 height_band $12;
 
   if age le 12 then age_group="Age 11 to 12";
   else age_group="Age 13 to 16";
 
   if height lt 60 then height_band="Below 60";
   else height_band="60 and above";
run;

`class01` gives us the original student records from `sashelp.class`
`class_detail` gives us two extra character variables that can be used for grouping
Inspect the output datasets and verify that the values of `age_group` and `height_band` are populated for every observation

Why sorting is usually required before a BY statement

SAS expects observations that belong to the same BY group to be adjacent to each other
Sorting arranges the observations in that required order
If the dataset is not sorted in the same order as the BY variables, SAS can stop with an error in many procedures and DATA steps
The safest habit is to sort the dataset first, and then use the BY statement with the exact same variables in the exact same order

Example of the required sort step before BY group processing

`proc sort` reads `class01` and writes a sorted output dataset named `class_sex_sort`
The `by sex;` statement tells SAS to arrange all female records together and all male records together
After the sort finishes, confirm in the output dataset that all observations for one sex appear together as one block

proc sort data=class01 out=class_sex_sort;
   by sex;
run;

`class_sex_sort` is now ready for BY group processing by `sex`
Notice that the grouping variable used in the sort step is the same grouping variable that will be used in the next procedure

Syntax only example of an incorrect approach

The following block is intentionally kept as syntax only and is not executed
It shows the type of code that can fail when a BY statement is used on data that was not first sorted by the BY variable
Keep this pattern in mind as a warning, not as a working program

/*
proc means data=class01;
   by sex;
   var height weight;
run;
*/

Use BY group processing in a procedure

Here we use `proc means` on the sorted dataset `class_sex_sort`
The `by sex;` statement tells SAS to produce separate statistics for each value of `sex`
`var height weight;` requests descriptive statistics for the numeric variables `height` and `weight` within each group
When you review the results window, you should see one section for females and another section for males

proc means data=class_sex_sort n min max mean;
   by sex;
   var height weight;
run;

The calculations are no longer for the full dataset as one unit
Instead, SAS repeats the same summary calculation once for each BY group
Confirm that the observation counts for males and females together match the total observation count in `class01`

Use multiple variables on the BY statement

BY group processing can be based on more than one variable
In that case, the data must be sorted by all BY variables in the same sequence
Each distinct combination of the listed variables becomes a separate group
In this example, we group first by `sex` and then by `age_group`

proc sort data=class_detail out=class_multi_sort;
   by sex age_group;
run;
 
proc print data=class_multi_sort;
   by sex age_group;
   var name age height weight;
run;

The sorted dataset `class_multi_sort` is arranged first by `sex`, and then within each sex by `age_group`
The printed output appears in separate blocks for each unique combination, such as female plus age 11 to 12 or male plus age 13 to 16
Inspect the sorted dataset and confirm that records belonging to the same pair of values are adjacent to each other

Understand FIRST and LAST temporary variables in a DATA step

When a sorted dataset is read with a BY statement in a DATA step, SAS automatically creates temporary indicators named `FIRST.variable` and `LAST.variable`
`FIRST.variable` becomes 1 for the first observation in each BY group
`LAST.variable` becomes 1 for the last observation in each BY group
These temporary variables are not stored in the output dataset unless we copy their values into real variables
This feature is one of the most useful parts of BY group processing because it allows us to reset counters, start totals, and write one final record per group

Create an output dataset that stores the FIRST and LAST flags

We read the already sorted dataset `class_sex_sort`
`by sex;` tells SAS to identify the beginning and ending observation for each sex group
We copy `first.sex` and `last.sex` into permanent numeric variables so they can be inspected in the output dataset
After running the code, review the first and last observation within each sex and confirm that the flags are set to 1 only on those boundary rows

data class_with_flags;
   set class_sex_sort;
   by sex;
 
   first_sex=first.sex;
   last_sex=last.sex;
run;

`class_with_flags` makes the temporary BY group boundaries visible as regular variables
For each value of `sex`, one observation should have `first_sex=1` and one observation should have `last_sex=1`
If a group contains only one observation, then both indicators would be 1 on the same row

Build a group level summary in a DATA step using RETAIN and LAST

This example shows a very common BY group pattern in DATA step programming
We retain a counter across observations within the same group
We reset that counter when a new sex group starts by checking `first.sex`
We increase the counter for each observation in the group
We output only the final row of each group by checking `last.sex`, so that one summary record is written per group

data sex_counts;
   set class_sex_sort;
   by sex;
   retain student_count;
 
   if first.sex then student_count=0;
   student_count+1;
 
   if last.sex then output;
 
   keep sex student_count;
run;

`sex_counts` contains one record for females and one record for males
The value of `student_count` shows how many observations belonged to each BY group
Verify that the counts in `sex_counts` match what you see when you inspect `class_sex_sort` and the earlier `proc means` output

Key points to remember

Sort the dataset before BY group processing unless you are working with data that is already guaranteed to be in the correct BY order
The variables listed in the BY statement must match the sort order
Procedures use BY groups to repeat the same analysis separately for each group
DATA steps use BY groups together with `FIRST.` and `LAST.` variables to control row-level logic inside each group
When more than one variable is used in the BY statement, each unique combination of values forms a separate group

*Copyright @ www.mycsg.in;

What is BY group processing in SAS

Create example datasets for learning BY group processing

SAS Log

Dataset View

Why sorting is usually required before a BY statement

Example of the required sort step before BY group processing

SAS Log

Dataset View

Syntax only example of an incorrect approach

SAS Log

Use BY group processing in a procedure

SAS Log

Use multiple variables on the BY statement

SAS Log

Dataset View

Understand FIRST and LAST temporary variables in a DATA step

Create an output dataset that stores the FIRST and LAST flags

SAS Log

Dataset View

Build a group level summary in a DATA step using RETAIN and LAST

SAS Log

Dataset View

Key points to remember