*Copyright @ www.mycsg.in;
Create a sample dataset
This dataset contains symbols digit strings lowercase letters and uppercase letters
It helps us study how SAS sorts character values by default and how the sort sequence can be modified
Notice that some values look numeric such as `10` and `101` but they are stored as character values in the variable `var`
data series01; length var group $10; var="%"; group="Symbol"; output; var="$"; group="Symbol"; output; var="1"; group="Digits"; output; var="2"; group="Digits"; output; var="10"; group="Digits"; output; var="101"; group="Digits"; output; var="21"; group="Digits"; output; var="11"; group="Digits"; output; var="a"; group="Lowcase"; output; var="ab"; group="Lowcase"; output; var="A"; group="Uppercase"; output; var="AB"; group="Uppercase"; output; var="b"; group="Lowcase"; output; var="B"; group="Uppercase"; output; run;
Copy Code
View Log
SAS Log
data series01; length var group $10; var="%"; group="Symbol"; output; var="$"; group="Symbol"; output; var="1"; group="Digits"; output; var="2"; group="Digits"; output; var="10"; group="Digits"; output; var="101"; group="Digits"; output; var="21"; group="Digits"; output; var="11"; group="Digits"; output; var="a"; group="Lowcase"; output; var="ab"; group="Lowcase"; output; var="A"; group="Uppercase"; output; var="AB"; group="Uppercase"; output; var="b"; group="Lowcase"; output; var="B"; group="Uppercase"; output; run; NOTE: The data set WORK.SERIES01 has 14 observations and 2 variables. NOTE: DATA statement used (Total process time): real time 0.06 seconds cpu time 0.01 seconds
Inspect `series01` before sorting and notice that the observations are in creation order not sorted order
The variable `group` is included only to help you identify what kind of character value each row contains
View Data
Dataset View
Understand the default sort sequence of character variables in SAS
With the default sort sequence SAS orders character values according to internal collating rules
In a simple English environment the order is usually symbols then digit strings then uppercase letters then lowercase letters
Character digits are sorted character by character not by numeric value
That means `101` can come before `11` because SAS compares from left to right as text
Create a dataset named series01_sort01 by sorting series01 in ascending order of var
`proc sort` sorts the observations based on the character variable `var`
Review the sorted output carefully and notice the relative position of symbols, digits, uppercase values, and lowercase values
Also compare the order of `1`, `10`, `101`, `11`, `2`, and `21` to see character-based ordering in action
proc sort data=series01 out=series01_sort01; by var; run;
Copy Code
View Log
SAS Log
proc sort data=series01 out=series01_sort01; by var; run; NOTE: There were 14 observations read from the data set WORK.SERIES01. NOTE: The data set WORK.SERIES01_SORT01 has 14 observations and 2 variables. NOTE: PROCEDURE SORT used (Total process time): real time 0.00 seconds cpu time 0.00 seconds
The result demonstrates the default collating behavior for character data
This example is important because many programmers initially expect character digits to be sorted numerically but that is not the default
View Data
Dataset View
Modify the sorting sequence of character variables using the SORTSEQ option
The `sortseq=` option lets us change how SAS compares character values during sorting
The `linguistic` sequence provides more natural language aware sorting behavior than the default binary style collation
Group lowercase and uppercase letters together
The `linguistic` option treats uppercase and lowercase forms more naturally
This makes values such as `A` and `a` sort closer together than they do under the default sequence
proc sort data=series01 out=series01_sort02 sortseq=linguistic; by var; run;
Copy Code
View Log
SAS Log
proc sort data=series01 out=series01_sort02 sortseq=linguistic; by var; run; NOTE: There were 14 observations read from the data set WORK.SERIES01. NOTE: The data set WORK.SERIES01_SORT02 has 14 observations and 2 variables. NOTE: PROCEDURE SORT used (Total process time): real time 0.03 seconds cpu time 0.01 seconds
Compare `series01_sort02` with the default sorted dataset and observe how letter case is handled differently
View Data
Dataset View
Group lowercase and uppercase letters together and give preference to lowercase letters
The `case_first=lower` suboption tells SAS to place lowercase values before uppercase values when the letters are otherwise equivalent
This can be useful when a business rule or output requirement expects lowercase values to appear first
proc sort data=series01 out=series01_sort03 sortseq=linguistic(case_first=lower); by var; run;
Copy Code
View Log
SAS Log
proc sort data=series01 out=series01_sort03 sortseq=linguistic(case_first=lower); by var; run; NOTE: There were 14 observations read from the data set WORK.SERIES01. NOTE: The data set WORK.SERIES01_SORT03 has 14 observations and 2 variables. NOTE: PROCEDURE SORT used (Total process time): real time 0.00 seconds cpu time 0.00 seconds
Review the order of pairs such as `a` and `A` or `ab` and `AB` and confirm that lowercase values are preferred
View Data
Dataset View
Group lowercase and uppercase letters together and give preference to uppercase letters
The `case_first=upper` suboption does the reverse and places uppercase values first when the base letters match
This demonstrates that sort order can be tuned to meet display or reporting conventions
proc sort data=series01 out=series01_sort04 sortseq=linguistic(case_first=upper); by var; run;
Copy Code
View Log
SAS Log
proc sort data=series01 out=series01_sort04 sortseq=linguistic(case_first=upper); by var; run; NOTE: There were 14 observations read from the data set WORK.SERIES01. NOTE: The data set WORK.SERIES01_SORT04 has 14 observations and 2 variables. NOTE: PROCEDURE SORT used (Total process time): real time 0.00 seconds cpu time 0.00 seconds
Compare the position of uppercase and lowercase values against the previous example and confirm that uppercase values now appear first
View Data
Dataset View
Sort character digits based on their numeric value
The `numeric_collation=on` suboption tells SAS to compare embedded digit strings by numeric value rather than pure character order
This is useful when character variables store codes like `1` `2` `10` and `101` and you want them sorted numerically
proc sort data=series01 out=series01_sort05 sortseq=linguistic(numeric_collation=on); by var; run;
Copy Code
View Log
SAS Log
proc sort data=series01 out=series01_sort05 sortseq=linguistic(numeric_collation=on); by var; run; NOTE: There were 14 observations read from the data set WORK.SERIES01. NOTE: The data set WORK.SERIES01_SORT05 has 14 observations and 2 variables. NOTE: PROCEDURE SORT used (Total process time): real time 0.01 seconds cpu time 0.00 seconds
Check the order of the digit values in `series01_sort05` and compare it with `series01_sort01`
You should now see a more natural numeric progression for the digit strings
View Data
Dataset View
Key points to remember
Character values are not always sorted in the way users intuitively expect
Default sorting compares character data using collating rules rather than numeric meaning
`sortseq=linguistic` can provide more natural handling of letter case
`case_first=` controls whether lowercase or uppercase values come first
`numeric_collation=on` helps character digit strings sort by numeric value