True duplicate record: Values of all variables are equal on an observation is same as one or more other observations
Duplicate based on the key variable: Values of a variable are equal on an observation when compared to one or more other observations
Duplicate based on more than one key variable: Values of the key variables are equal on an observation when compared to one or more other observations
Name Sex Age Height Weight
Alfred M 14 69 112.5
Alfred M 14 69 112.5
Carol F 14 62.8 102.5
Carol F 12 55 84
Carol F 11 51 79
Jane F 12 59.8 84.5
In the above example, row 2 is considered a true duplicate as all the variables have same information as row 1
In the above example, row 4 and row 5 are considered duplicates based on the key variable 'Name' as the value in the Name is 'Carol' is same as that on the row 3
Duplicates based on key variable depend on the expected structure of the dataset. If we expect a dataset to have only one record with a particular name value but there exists
more than one record with the same name value the subsequent records with that name value will be called a duplicate with respect to the variable 'Name'
If the expected record structure is one record per subject and there exists more than one record per subject in the dataset, we call that there is a duplicate record
with subject as key variable
In the above example, if the key variables are Name and Age, only row 2 will be called a duplicate (of row 1) and none of the records with Name value 'Carol' are considered
duplicate as the values of the second variable(Age) differ on all three records
In the above example, if the key variables are Name and Sex, row 2 will be considered as a duplicate (of row 1) as on both the records the values on the key variables are same.
Similiarly, rows 4 and 5 are considered duplicates (of row 3) as the values on the key variables Name and sex are same on all 3 records of Carol
In the above example, if the key variable is Sex alone, row 2 will be considered a duplicate(row 1) and rows 4,5,6 will be considered as duplicates of row 3
Where do we need to identify the duplicate records (and separate or delete them)?
Check if there is more than one blood sample collected on a particular date
Check if there is more than one instance of a particular adverse event for a subject
Check if there exists more than one record per subject when only record is expected per subject in a dataset
Identify the unique values seen for a particular variable in a dataset irrespective of the values present in other variables
Identify the unique value combinations for a group of variables in a dataset irrespective of the values present in other variables
Creating a sample dataset
Identifying if there is a duplicate record based on a key variable(s)
Create a dataset named class_names by removing the duplicates based on Name variable
In this example, let us assume that we are expecting only record per Name value
We will use an option called 'nodupkey' to identify there is a duplicate record based on the key variable (Name)
'Nodupkey' option has to be specified on the proc sort statemnet
Name of the input dataset has to be specified in the data= option on proc sort statement
Name of the output dataset has to be specified on the out= option on proc sort statement
The list of key variables have to specified on the by statement
When nodupkey option is used on proc sort statement, the first record within each unique value of the key variable
will be written to the output dataset and subsequent records with that unique value will NOT be written to the output dataset.
From the above data, as there are two records with a name value of Alfred, the first record with Name=Alfred will be written to
the output dataset and subsequent records with Name=Alfred will not be written to the output dataset. Simliarly, of the 3 records
with Name=Carol, only the first record with Carol in name will be written to the output dataset
Create a dataset named class_names_age by removing the duplicates based on Name and Age variables
As we are interested in checking if a there is a duplicate record based on Name and Age variable values,
we need to provide Name and Age variables in the by statement
The records get sorted first based on Name and Age value and the first record within a Name and Age combination will be written to the
output dataset, subsequent recods with the occurrences of that Name and Age combination will not be written to the output dataset
As there are two records with Alfred in Name and 14 in Age variables, only the first instance of Alfred 14 will be written to output
datasset. For Carol, as the records have 3 different values in Age variable, all 3 records in Name and Age combination become unique and are
written to the output dataset.
Capture the duplicate records getting deleted into another dataset
What if we are interested in seeing the records which are not being written to the output dataset when using nodupkey?
We have an option called dupout= to specify a dataset name to hold the records which are not captured in output dataset
The first record within a combination of the the key variables get captured in the dataset listed in out= option, and subsequent
occurrences of that combination will be written to the dataset specified in dupout=option
Create a dataset named class_names_dup to hold the duplicate records based on Name variable
The record with first instance of each name value will be written to the dataset specified in out= option and the records
with subsequent occurrence of a name will be written to the dataset specified in dupout= option
First record of Alfred and Carol will be written to calss_names2 dataset and the next instances of Alfred and Carol will be written
to class_names_dup dataset. As there is only one record with a name of Jane, it is considered as first instance and get written to class_names2
and no record for Jane will be seen in class_names_dup as there is no 'duplicate' for Jane when Name is the key variable
Question
Answer
What would happen if out= option is not used when nodupkey is specified?
Input dataset gets overwritten with the dataset containing the first instance records of each key variable combinations.
Identify and remove the records which are true duplicate records with nodupkey
Create a dataset named class_noduplicates by removing the records with all variable values are same on more than one record
The key to this task is the usage of _all_ keyword which is used for referring all variables in a dataset
When we use nodupkey and specify a list of variables in the by statement, SAS only checks if the values in the listed variables
are same before deleting a record. So, instead of specifying a list of variables we specify _all_ in the by statement so that SAS checks
if all the variables have same values before deleting a record
Removing duplicate records with NODUPREC option
NODUPREC option can be used to remove observations where every variable value on the current record is same as the previous observation
NODUP option is synonymous with NODUPREC. That is, NODUP can be used in place of NODUPREC.
Create a dataset named class_noduplicates2 by removing the duplicate records with NODUPREC option
Alfred's record on row 2 has same values for all variables as that of the previous observation (row 1). So the record will not be written
to the output dataset
Create an input dataset
Create a dataset named class_noduplicates3 using classx dataset by removing the duplicate records with NODUPREC option
By definition, while checking for duplicates with NODUPREC proc sort will only check the previous observation to determine
if the current record is a duplicate record (all variables have same values as previous observation)
For row 2 of Alfred, as all variables have same values when compared to previous observation, SAS will not write row 2 to output dataset
For row 3 of Alfred, as the Height variable has a different value when compared previous observation(row 2), row 3 is not considered a duplicate
and will be written to output dataset
For row 4 of Alfred, as all variables have same values when compared to previous observation, SAS will not write row 4 to output dataset
For row 5 of Alfred, as the Height variable has a different value when compared previous observation(row 4), row 5 is not considered a duplicate
and will be written to output dataset. However, when compared to row 1 and row 2, row 5 of Alfred is a duplicate record
To prevent the above issue, we need to make sure that observations with same values on all variables are sorted together
We can use _all_ in the by statement so that the records with all variables values same are sorted next to each other