Merge/left join two datasets in SAS and R tidyverse
This post is part of 'SASnR | Merging/joining data' series
When working with data, we frequently encounter data such that different pieces of information of same entities are collected in separate datasets.
And sometimes, the entities in two datasets may not be same.
We need programming features to merge or join data horizontally from two or more datasets and keep both the matching entities from both the datasets and the unmatching entities from one of the datasets.
Let us assume that we have height and weight of a few students collected in two separate datasets as shown below.
Name
|
Height
|
Janet
|
62.5
|
Ronald
|
67
|
William
|
66.5
|
Name
|
Weight
|
Janet
|
112.5
|
Mary
|
112
|
William
|
112
|
- Overall, there are 4 students - Janet, Ronald, William, Mary
- Janet and William has both height and weight collected
- Ronald has only height collected but not weight
- Mary has only weight collected but not weight
Let us assume that we want to merge or join the above two pieces of data such that the height and weight are present in a single dataset side by side for the entities (students) which are common to both the datasets and the students who only have height collected as shown below.
Name
|
Height
|
Weight
|
Janet
|
62.5
|
112.5
|
Ronald
|
67
|
|
William
|
66.5
|
112
|
There are multiple ways of achieving this result in both SAS and R. Below is one basic approach in SAS and R.
SAS code
data HEIGHT;
infile datalines dlm='|' dsd missover;
input Name : $8. Height : best32.;
datalines;
Janet|62.5
Ronald|67
William|66.5
;
run;
data WEIGHT;
infile datalines dlm='|' dsd missover;
input Name : $8. Weight : best32.;
datalines;
Janet|112.5
Mary|112
William|112
;
run;
proc sort data=height ;
by name;
run;
proc sort data=weight ;
by name;
run;
data left;
merge height(in=inHeight) weight(in=inWeight);
by name;
if inHeight=1 then output;
run;
- Sample data for this task is created using instream data infile/input/cards
- height dataset contains the name and height of the students
- weight dataset contains the name and weight of the students
- height and weight of a student can be identified by the specific value in name variable. So, name is considered as a linking variable between these two datasets
- To merge/join data horizontally from two or more datasets, we need to specify the names of the datasets on the merge statement separated by a space
- We need to specify the name of the linking variables on the by statement
- in= dataset option is used to create a temporary variable which holds a value of 1 or 0. The value the temporary variable indicated after the in= option will be 1 when that dataset contributed to the observation in the output dataset, otherwise it will be 0.
- inHeight variable will have a value of 1 for Janet, Ronald, and William rows
- inWeight variable will have a value of 1 for Janet, Mary, and William rows
- if statement is used to conditionally write only the entities which are coming from height dataset - this will include both common students and the students who are exclusively present in height dataset only
R tidyverse code
library(tidyverse)
library(dplyr)
height<-tribble(
~Name,~Height,
"Janet",62.5,
"Ronald",67,
"William",66.5,
)
weight<-tribble(
~Name,~Weight,
"Janet",112.5,
"Mary",112,
"William",112,
)
left<-left_join(height,weight,by="Name")
- Sample data for this task is created using tribble function of tidyverse
- height and weight datasets are joined using left_join function of tidyverse
- left_join function keeps both those rows which are common to both the datasets based on the values in the by variables and which are exclusive the dataset which is specified first in first argument
- the names of the input datasets are specified as the first two arguments
- the linking variables are specified on the by= parameter
- notice that the name of the linking variable is specified within quotes