*Copyright @ www.mycsg.in;

GROUP BY with aggregate functions and HAVING in PROC SQL

What does GROUP BY do

Previous PROC SQL lessons covered SELECT, WHERE, ORDER BY, joining datasets, and subqueries
`GROUP BY` groups rows that share the same value in one or more columns and reduces each group to a single summary row
Aggregate functions such as `COUNT`, `SUM`, `MEAN`, `MIN`, and `MAX` compute one value per group
Every column in the SELECT list must either appear in the GROUP BY clause or be wrapped in an aggregate function
The HAVING clause filters the grouped results in the same way WHERE filters individual rows — but HAVING operates after aggregation

Create sample data

We create a sales transactions dataset with multiple rows per department and per region
This data will be used to demonstrate counts, sums, averages, and HAVING filters across different groupings

data transactions;
    input dept $ region $ amount units;
    datalines;
Marketing East  1200 10
Marketing East  1500 12
Marketing West   900  8
Marketing West  1100 10
Sales     East  2000 20
Sales     East  1800 18
Sales     West  2200 22
Sales     West  1600 16
Support   East   500  5
Support   West   450  4
;
run;

Confirm that `transactions` has 10 rows: 4 for Marketing, 4 for Sales, and 2 for Support
Each row represents a transaction with a department, region, amount, and number of units

Basic GROUP BY - count and sum per group

Count rows per department

`COUNT(*)` counts all rows in each group regardless of missing values
The result has one row per distinct department value
The column alias after `as` gives the computed column a readable name in the output

proc sql;
    create table dept_counts as
    select dept,
           count(*) as n_transactions
    from transactions
    group by dept;
quit;

`dept_counts` should have 3 rows: Marketing with 4, Sales with 4, Support with 2

Sum and average amount per department

Multiple aggregate functions can appear in the same SELECT statement
Here we compute the total amount, average amount, and total units for each department in a single query

proc sql;
    create table dept_summary as
    select dept,
           count(*)       as n_transactions,
           sum(amount)    as total_amount,
           mean(amount)   as avg_amount,
           sum(units)     as total_units
    from transactions
    group by dept;
quit;

Review `dept_summary` and confirm the totals match your mental arithmetic against the raw data
Marketing should have total amount 4700, Sales 7600, Support 950

Group by multiple columns - department and region

Listing multiple columns in GROUP BY creates one output row for each unique combination of those columns
With two grouping columns (dept and region) we expect up to 6 combinations from the sample data

proc sql;
    create table dept_region_summary as
    select dept,
           region,
           count(*)    as n_transactions,
           sum(amount) as total_amount
    from transactions
    group by dept, region;
quit;

Inspect `dept_region_summary` and verify it has 6 rows: one for each department-region combination
The total for Marketing East should be 2700, Marketing West 2000, and so on

HAVING - filter groups after aggregation

Why WHERE cannot be used to filter group totals

WHERE filters individual rows before any grouping occurs
This means WHERE cannot reference aggregated values such as `sum(amount)` because those values do not exist yet at the WHERE stage
HAVING is applied after GROUP BY and can reference any aggregate expression
Think of WHERE as a row-level filter and HAVING as a group-level filter

Use HAVING to keep only departments above a total threshold

In this example, we want only departments where the total amount exceeds 4000
This cannot be done with WHERE because the total amount is only known after aggregation

proc sql;
    create table high_value_depts as
    select dept,
           sum(amount) as total_amount
    from transactions
    group by dept
    having sum(amount) gt 4000;
quit;

`high_value_depts` should contain only Marketing (4700) and Sales (7600) — Support (950) is excluded
The HAVING condition references `sum(amount)` directly, not the alias `total_amount` — SAS requires the aggregate expression itself in HAVING

Combine WHERE and HAVING in the same query

WHERE and HAVING can be used together — WHERE filters rows before grouping, and HAVING filters groups after aggregation
In this example, we exclude the Support department entirely using WHERE and then further filter groups by transaction count using HAVING

proc sql;
    create table east_dept_counts as
    select dept,
           region,
           count(*) as n_transactions
    from transactions
    where region='East'
    group by dept, region
    having count(*) ge 2;
quit;

WHERE first keeps only East rows (5 rows remain)
GROUP BY then groups those 5 rows by department and region
HAVING then keeps only groups with 2 or more transactions
Inspect the result and confirm Support East (which has only 1 East row) is excluded

MIN and MAX within groups

`MIN` and `MAX` compute the smallest and largest values within each group
This is useful for finding the lowest and highest transaction values per department without sorting the raw data

proc sql;
    create table dept_range as
    select dept,
           min(amount) as min_amount,
           max(amount) as max_amount,
           max(amount) - min(amount) as amount_range
    from transactions
    group by dept;
quit;

The `amount_range` column is a derived column computed from two aggregate expressions
Verify that for Marketing, the range is 600 (1500 minus 900), and for Sales the range is 600 (2200 minus 1600)

Key points to remember

`GROUP BY` reduces each group of rows to a single summary row — every SELECT column must either be in GROUP BY or wrapped in an aggregate function
Common aggregate functions: `COUNT(*)` counts rows, `SUM` totals values, `MEAN` averages values, `MIN` and `MAX` find extremes
HAVING filters groups after aggregation using aggregate expressions — it is the group-level equivalent of WHERE
WHERE and HAVING can appear in the same query: WHERE filters before grouping, HAVING filters after
In PROC SQL, HAVING must reference the aggregate expression itself such as `sum(amount)` rather than the column alias such as `total_amount`
Use `create table` with GROUP BY queries to save summary results as SAS datasets for further use

*Copyright @ www.mycsg.in;

GROUP BY with aggregate functions and HAVING in PROC SQL

What does GROUP BY do

Create sample data

SAS Log

Dataset View

Basic GROUP BY - count and sum per group

Count rows per department

SAS Log

Dataset View

Sum and average amount per department

SAS Log

Dataset View

Group by multiple columns - department and region

SAS Log

Dataset View

HAVING - filter groups after aggregation

Why WHERE cannot be used to filter group totals

Use HAVING to keep only departments above a total threshold

SAS Log

Dataset View

Combine WHERE and HAVING in the same query

SAS Log

Dataset View

MIN and MAX within groups

SAS Log

Dataset View

Key points to remember