Merge/inner join two datasets in SAS and R tidyverse


This post is part of 'SASnR | Merging/joining data' series

When working with data, we frequently encounter data such that different pieces of information of same entities are collected in separate datasets.

And sometimes, the entities in two datasets may not be same.

We need programming features to merge or join data horizontally from two or more datasets and keep only the matching entities from both the datasets.


Let us assume that we have height and weight of a few students collected in two separate datasets as shown below.

Name

Height

Janet

62.5

Ronald

67

William

66.5

 

Name

Weight

Janet

112.5

Mary

112

William

112

  • Overall, there are 4 students - Janet, Ronald, William, Mary
  • Janet and William has both height and weight collected
  • Ronald has only height collected but not weight
  • Mary has only weight collected but not weight

 


Let us assume that we want to merge or join the above two pieces of data such that the height and weight are present in a single dataset side by side for the entities (students) which are common to both the datasets as shown below.

Name

Height

Weight

Janet

62.5

112.5

William

66.5

112

There are multiple ways of achieving this result in both SAS and R. Below is one basic approach in SAS and R.


SAS code

data HEIGHT;
infile datalines dlm='|' dsd missover;
input Name : $8. Height : best32.;
label ;
format ;
datalines;
Janet|62.5
Ronald|67
William|66.5
;
run;

data WEIGHT;
infile datalines dlm='|' dsd missover;
input Name : $8. Weight : best32.;
label ;
format ;
datalines;
Janet|112.5
Mary|112
William|112
;
run;

proc sort data=height ;
   by name;
run;

proc sort data=weight ;
   by name;
run;

data both;
   merge height(in=inHeight) weight(in=inWeight);
   by name;
   if inHeight=1 and inWeight=1 then output;
run;
 
  • Sample data for this task is created using instream data infile/input/cards
  • height dataset contains the name and height of the students
  • weight dataset contains the name and weight of the students
  • height and weight of a student can be identified by the specific value in name variable. So, name is considered as a linking variable between these two datasets
  • To merge/join data horizontally from two or more datasets, we need to specify the names of the datasets on the merge statement separated by a space
  • We need to specify the name of the linking variables on the by statement
  • in= dataset option is used to create a temporary variable which holds a value of 1 or 0. The value the temporary variable indicated after the in= option will be 1 when that dataset contributed to the observation in the output dataset, otherwise it will be 0.
  • inHeight variable will have a value of 1 for Janet, Ronald, and William rows
  • inWeight variable will have a value of 1 for Janet, Mary, and William rows
  • if statement is used to conditionally write only the entities which are coming from both height and weight datasets

R tidyverse code

 
library(tidyverse)
library(dplyr)
 
height<-tribble(
  ~Name,~Height,
  "Janet",62.5,
  "Ronald",67,
  "William",66.5,
)
 
weight<-tribble(
  ~Name,~Weight,
  "Janet",112.5,
  "Mary",112,
  "William",112,
)
 
both<-inner_join(height,weight,by="Name")  
 
  • Sample data for this task is created using tribble function of tidyverse
  • height and weight datasets are joined using inner_join function of tidyverse
  • inner_join function keeps only those rows which are common to both the datasets based on the values in the by variables 
  • the names of the input datasets are specified as the first two arguments
  • the linking variables are specified on the by= parameter 
  • notice that the name of the linking variable is specified within quotes

 





Filter a category
AllSASnRSDTMSASDomainADaMStatistics

List of other posts


Domain


General
What are CDISC standards?
What is a clinical development plan?
What is a clinical study report?
What is a clinical trial registry?
What is the importance of baseline characteristics in a clinical trial?
Drug discovery
Drug Development
Preclinical Research
The Investigational New Drug Process
What is ICH?
Efficacy data vs Safety data
What is a clinical trial?
Why are clinical trials conducted?
What is a clinical trial protocol?
Who conducts clinical trials?
Inclusion/Exclusion Criteria in a clinical trial
What are the phases of clinical trials?
What happens after a clinical trial is completed?
FDA Drug Review
Key steps in developing new drugs and biologicals

Trial design aspects
What is a crossover clinical trial?

Terminology
What is a cohort?

SAS


Definitions
What is a computer?
What happens when we execute a SAS program?
What is software?
What is SAS?
What is data?
What is data entry?
What is data retrieval?
What is data management?
What is "Report"?
What is statistics?
What is Statistical Analysis?
How do we use SAS?
What kind of questions can SAS help us answer?
How do we provide instructions to SAS?
What is a SAS program?
What does a SAS program contain?

General
Attributes of a SAS dataset
Rules for SAS dataset names
Rules for SAS variable names
Rules for SAS library names
Rules for character SAS format names
Reserved SAS dataset names
Rules for numeric SAS format names
What can SAS dataset options do?
Attributes of a SAS variable
Automatic conversion of data types in SAS
How does SAS expect our data to be organized?
Introduction to SAS interface
By groups in SAS

Informats
Rules for character SAS informat names

Proc freq
Count the number of times a particular value occurred in a variable of a dataset

Proc contents
Check the list of variables in a SAS dataset

Proc datasets
Delete all sas datasets from a library
Delete specific sas datasets from a library
Save specific sas datasets (and delete others) of a library
Rename SAS datasets using proc datasets change statement

Log issues
WARNING: No matching members in directory.

One-line definitions
What is a SAS library?
What is a libref?
What is an input statement?
What is infile statement?
What is set statement?
What is length statement?

SDTM


General
How to derive baseline flag in SDTM
How to create SEQ variable in SDTM
New domains in SDTM IG 3.3
What is a codelist?
How to derive study day variable in SDTM
What is SDTM?
Why do we need SDTM?
How to convert original results to standard results using conversion factors

Demographics
What information does SDTM.DM (Demographics) contain?

Adverse Events
What is causality assessment?
What information does SDTM.AE (Adverse Events) contain?

Disposition
What information does SDTM.DS (Disposition) contain?

Concomitant Medications
What information does SDTM.CM (Concomitant/Prior Medications) contain?

Procedures
What information does SDTM.PR (Procedures) contain?

ADaM


General
What is ADaM?

ADSL
What is ADSL as per ADaM standard?

BDS
What is BDS as per ADaM standard?

Statistics


General
Alpha and beta errors
What is correlation?

SASnR


Introduction
What is R?
What is an R package?
What is tidyverse?
What are the core packages of tidyverse?
What is haven package of tidyverse?
How to install tidyverse?
How to load core tidyverse packages into the R session?

Reading data
Import/Read SAS dataset into R

Creating sample data
How to create some sample data in SAS and R tidyverse

Subset variables (columns)
How to select only required variables/columns in SAS and R tidyverse?
How to drop unwanted variables/columns in SAS and R tidyverse?

Subset observations (rows)
How to select/subset required rows in SAS and R tidyverse

Appending data
Append two datasets in SAS and R tidyverse

Merging/joining data
Merge/full join two datasets in SAS and R tidyverse
Merge/inner join two datasets in SAS and R tidyverse
Merge/left join two datasets in SAS and R tidyverse

Sort (order) observations
Sort/order observations based on the values in a single variable in SAS and R tidyverse

Transpose/Restructure data
Restructure/transpose long data to wide data
Restructure/transpose wide data to long data

Obtain frequencies
Obtain frequencies/counts based on one variable - one-way frequencies in SAS and R tidyverse
Obtain frequencies/counts based on two variables - two-way frequencies in SAS and R tidyverse

Descriptive statistics
Descriptive statistics for a numeric variable using SAS and R tidyverse