BY group processing means SAS processes observations in groups formed by one or more variables listed on a BY statement
Each unique value or unique combination of BY variable values becomes one group
This technique is used in procedures as well as in DATA step programming
It is useful when the same logic must be repeated separately for each group, such as each sex, each treatment, each site, or each subject
A very important rule is that the input dataset must usually be sorted by the same BY variables before BY group processing is attempted
Create example datasets for learning BY group processing
We start by copying `sashelp.class` into a simple working dataset named `class01`
We then create a second dataset named `class_detail` that adds two grouping variables named `age_group` and `height_band`
These extra variables help us demonstrate single-variable grouping and multiple-variable grouping in later examples
After running the code, confirm that `class01` and `class_detail` contain the same 19 students with the added derived variables in `class_detail`
SAS Log
`class01` gives us the original student records from `sashelp.class`
`class_detail` gives us two extra character variables that can be used for grouping
Inspect the output datasets and verify that the values of `age_group` and `height_band` are populated for every observation
Dataset View
Why sorting is usually required before a BY statement
SAS expects observations that belong to the same BY group to be adjacent to each other
Sorting arranges the observations in that required order
If the dataset is not sorted in the same order as the BY variables, SAS can stop with an error in many procedures and DATA steps
The safest habit is to sort the dataset first, and then use the BY statement with the exact same variables in the exact same order
Example of the required sort step before BY group processing
`proc sort` reads `class01` and writes a sorted output dataset named `class_sex_sort`
The `by sex;` statement tells SAS to arrange all female records together and all male records together
After the sort finishes, confirm in the output dataset that all observations for one sex appear together as one block
SAS Log
`class_sex_sort` is now ready for BY group processing by `sex`
Notice that the grouping variable used in the sort step is the same grouping variable that will be used in the next procedure
Dataset View
Syntax only example of an incorrect approach
The following block is intentionally kept as syntax only and is not executed
It shows the type of code that can fail when a BY statement is used on data that was not first sorted by the BY variable
Keep this pattern in mind as a warning, not as a working program
SAS Log
Use BY group processing in a procedure
Here we use `proc means` on the sorted dataset `class_sex_sort`
The `by sex;` statement tells SAS to produce separate statistics for each value of `sex`
`var height weight;` requests descriptive statistics for the numeric variables `height` and `weight` within each group
When you review the results window, you should see one section for females and another section for males
SAS Log
The calculations are no longer for the full dataset as one unit
Instead, SAS repeats the same summary calculation once for each BY group
Confirm that the observation counts for males and females together match the total observation count in `class01`
Use multiple variables on the BY statement
BY group processing can be based on more than one variable
In that case, the data must be sorted by all BY variables in the same sequence
Each distinct combination of the listed variables becomes a separate group
In this example, we group first by `sex` and then by `age_group`
SAS Log
The sorted dataset `class_multi_sort` is arranged first by `sex`, and then within each sex by `age_group`
The printed output appears in separate blocks for each unique combination, such as female plus age 11 to 12 or male plus age 13 to 16
Inspect the sorted dataset and confirm that records belonging to the same pair of values are adjacent to each other
Dataset View
Understand FIRST and LAST temporary variables in a DATA step
When a sorted dataset is read with a BY statement in a DATA step, SAS automatically creates temporary indicators named `FIRST.variable` and `LAST.variable`
`FIRST.variable` becomes 1 for the first observation in each BY group
`LAST.variable` becomes 1 for the last observation in each BY group
These temporary variables are not stored in the output dataset unless we copy their values into real variables
This feature is one of the most useful parts of BY group processing because it allows us to reset counters, start totals, and write one final record per group
Create an output dataset that stores the FIRST and LAST flags
We read the already sorted dataset `class_sex_sort`
`by sex;` tells SAS to identify the beginning and ending observation for each sex group
We copy `first.sex` and `last.sex` into permanent numeric variables so they can be inspected in the output dataset
After running the code, review the first and last observation within each sex and confirm that the flags are set to 1 only on those boundary rows
SAS Log
`class_with_flags` makes the temporary BY group boundaries visible as regular variables
For each value of `sex`, one observation should have `first_sex=1` and one observation should have `last_sex=1`
If a group contains only one observation, then both indicators would be 1 on the same row
Dataset View
Build a group level summary in a DATA step using RETAIN and LAST
This example shows a very common BY group pattern in DATA step programming
We retain a counter across observations within the same group
We reset that counter when a new sex group starts by checking `first.sex`
We increase the counter for each observation in the group
We output only the final row of each group by checking `last.sex`, so that one summary record is written per group
SAS Log
`sex_counts` contains one record for females and one record for males
The value of `student_count` shows how many observations belonged to each BY group
Verify that the counts in `sex_counts` match what you see when you inspect `class_sex_sort` and the earlier `proc means` output
Dataset View
Key points to remember
Sort the dataset before BY group processing unless you are working with data that is already guaranteed to be in the correct BY order
The variables listed in the BY statement must match the sort order
Procedures use BY groups to repeat the same analysis separately for each group
DATA steps use BY groups together with `FIRST.` and `LAST.` variables to control row-level logic inside each group
When more than one variable is used in the BY statement, each unique combination of values forms a separate group