*Copyright @ www.mycsg.in;
GROUP BY with aggregate functions and HAVING in PROC SQL
What does GROUP BY do
Previous PROC SQL lessons covered SELECT, WHERE, ORDER BY, joining datasets, and subqueries
`GROUP BY` groups rows that share the same value in one or more columns and reduces each group to a single summary row
Aggregate functions such as `COUNT`, `SUM`, `MEAN`, `MIN`, and `MAX` compute one value per group
Every column in the SELECT list must either appear in the GROUP BY clause or be wrapped in an aggregate function
The HAVING clause filters the grouped results in the same way WHERE filters individual rows — but HAVING operates after aggregation
Create sample data
We create a sales transactions dataset with multiple rows per department and per region
This data will be used to demonstrate counts, sums, averages, and HAVING filters across different groupings
data transactions; input dept $ region $ amount units; datalines; Marketing East 1200 10 Marketing East 1500 12 Marketing West 900 8 Marketing West 1100 10 Sales East 2000 20 Sales East 1800 18 Sales West 2200 22 Sales West 1600 16 Support East 500 5 Support West 450 4 ; run;
Copy Code
View Log
SAS Log
data transactions; input dept $ region $ amount units; datalines; NOTE: The data set WORK.TRANSACTIONS has 10 observations and 4 variables. NOTE: DATA statement used (Total process time): real time 0.00 seconds cpu time 0.00 seconds ; run;
Confirm that `transactions` has 10 rows: 4 for Marketing, 4 for Sales, and 2 for Support
Each row represents a transaction with a department, region, amount, and number of units
View Data
Dataset View
Basic GROUP BY - count and sum per group
Count rows per department
`COUNT(*)` counts all rows in each group regardless of missing values
The result has one row per distinct department value
The column alias after `as` gives the computed column a readable name in the output
proc sql; create table dept_counts as select dept, count(*) as n_transactions from transactions group by dept; quit;
Copy Code
View Log
SAS Log
proc sql; create table dept_counts as select dept, count(*) as n_transactions from transactions group by dept; NOTE: Table WORK.DEPT_COUNTS created, with 3 rows and 2 columns. quit; NOTE: PROCEDURE SQL used (Total process time): real time 0.00 seconds cpu time 0.01 seconds
`dept_counts` should have 3 rows: Marketing with 4, Sales with 4, Support with 2
View Data
Dataset View
Sum and average amount per department
Multiple aggregate functions can appear in the same SELECT statement
Here we compute the total amount, average amount, and total units for each department in a single query
proc sql; create table dept_summary as select dept, count(*) as n_transactions, sum(amount) as total_amount, mean(amount) as avg_amount, sum(units) as total_units from transactions group by dept; quit;
Copy Code
View Log
SAS Log
proc sql; create table dept_summary as select dept, count(*) as n_transactions, sum(amount) as total_amount, mean(amount) as avg_amount, sum(units) as total_units from transactions group by dept; NOTE: Table WORK.DEPT_SUMMARY created, with 3 rows and 5 columns. quit; NOTE: PROCEDURE SQL used (Total process time): real time 0.00 seconds cpu time 0.00 seconds
Review `dept_summary` and confirm the totals match your mental arithmetic against the raw data
Marketing should have total amount 4700, Sales 7600, Support 950
View Data
Dataset View
Group by multiple columns - department and region
Listing multiple columns in GROUP BY creates one output row for each unique combination of those columns
With two grouping columns (dept and region) we expect up to 6 combinations from the sample data
proc sql; create table dept_region_summary as select dept, region, count(*) as n_transactions, sum(amount) as total_amount from transactions group by dept, region; quit;
Copy Code
View Log
SAS Log
proc sql; create table dept_region_summary as select dept, region, count(*) as n_transactions, sum(amount) as total_amount from transactions group by dept, region; NOTE: Table WORK.DEPT_REGION_SUMMARY created, with 6 rows and 4 columns. quit; NOTE: PROCEDURE SQL used (Total process time): real time 0.01 seconds cpu time 0.00 seconds
Inspect `dept_region_summary` and verify it has 6 rows: one for each department-region combination
The total for Marketing East should be 2700, Marketing West 2000, and so on
View Data
Dataset View
HAVING - filter groups after aggregation
Why WHERE cannot be used to filter group totals
WHERE filters individual rows before any grouping occurs
This means WHERE cannot reference aggregated values such as `sum(amount)` because those values do not exist yet at the WHERE stage
HAVING is applied after GROUP BY and can reference any aggregate expression
Think of WHERE as a row-level filter and HAVING as a group-level filter
Use HAVING to keep only departments above a total threshold
In this example, we want only departments where the total amount exceeds 4000
This cannot be done with WHERE because the total amount is only known after aggregation
proc sql; create table high_value_depts as select dept, sum(amount) as total_amount from transactions group by dept having sum(amount) gt 4000; quit;
Copy Code
View Log
SAS Log
proc sql; create table high_value_depts as select dept, sum(amount) as total_amount from transactions group by dept having sum(amount) gt 4000; NOTE: Table WORK.HIGH_VALUE_DEPTS created, with 2 rows and 2 columns. quit; NOTE: PROCEDURE SQL used (Total process time): real time 0.00 seconds cpu time 0.01 seconds
`high_value_depts` should contain only Marketing (4700) and Sales (7600) — Support (950) is excluded
The HAVING condition references `sum(amount)` directly, not the alias `total_amount` — SAS requires the aggregate expression itself in HAVING
View Data
Dataset View
Combine WHERE and HAVING in the same query
WHERE and HAVING can be used together — WHERE filters rows before grouping, and HAVING filters groups after aggregation
In this example, we exclude the Support department entirely using WHERE and then further filter groups by transaction count using HAVING
proc sql; create table east_dept_counts as select dept, region, count(*) as n_transactions from transactions where region='East' group by dept, region having count(*) ge 2; quit;
Copy Code
View Log
SAS Log
proc sql; create table east_dept_counts as select dept, region, count(*) as n_transactions from transactions where region='East' group by dept, region having count(*) ge 2; NOTE: Table WORK.EAST_DEPT_COUNTS created, with 2 rows and 3 columns. quit; NOTE: PROCEDURE SQL used (Total process time): real time 0.00 seconds cpu time 0.00 seconds
WHERE first keeps only East rows (5 rows remain)
GROUP BY then groups those 5 rows by department and region
HAVING then keeps only groups with 2 or more transactions
Inspect the result and confirm Support East (which has only 1 East row) is excluded
View Data
Dataset View
MIN and MAX within groups
`MIN` and `MAX` compute the smallest and largest values within each group
This is useful for finding the lowest and highest transaction values per department without sorting the raw data
proc sql; create table dept_range as select dept, min(amount) as min_amount, max(amount) as max_amount, max(amount) - min(amount) as amount_range from transactions group by dept; quit;
Copy Code
View Log
SAS Log
proc sql; create table dept_range as select dept, min(amount) as min_amount, max(amount) as max_amount, max(amount) - min(amount) as amount_range from transactions group by dept; NOTE: Table WORK.DEPT_RANGE created, with 3 rows and 4 columns. quit; NOTE: PROCEDURE SQL used (Total process time): real time 0.00 seconds cpu time 0.01 seconds
The `amount_range` column is a derived column computed from two aggregate expressions
Verify that for Marketing, the range is 600 (1500 minus 900), and for Sales the range is 600 (2200 minus 1600)
View Data
Dataset View
Key points to remember
`GROUP BY` reduces each group of rows to a single summary row — every SELECT column must either be in GROUP BY or wrapped in an aggregate function
Common aggregate functions: `COUNT(*)` counts rows, `SUM` totals values, `MEAN` averages values, `MIN` and `MAX` find extremes
HAVING filters groups after aggregation using aggregate expressions — it is the group-level equivalent of WHERE
WHERE and HAVING can appear in the same query: WHERE filters before grouping, HAVING filters after
In PROC SQL, HAVING must reference the aggregate expression itself such as `sum(amount)` rather than the column alias such as `total_amount`
Use `create table` with GROUP BY queries to save summary results as SAS datasets for further use