How to select only required variables/columns in SAS and R tidyverse?


This post is part of 'SASnR | Subset variables (columns)' series

When working with data, we frequently need to work with only a selected set of variables. For this, we need programming features to subset only required variables or columns.

There are multiple ways of selecting only required in both SAS and R tidyverse. Below is one basic approach in both SAS and R.

Let us assume that we have a dataset named "class" with 5 variables named Name, Sex, Age, Height, Weight.

Name

Sex

Age

Height

Weight

Alfred

M

14

69

112.5

Alice

F

13

56.5

84

Barbara

F

13

65.3

98

Carol

F

14

62.8

102.5

Henry

M

14

63.5

102.5

James

M

12

57.3

83

Let us assume that we only need Name, Sex and Age variables. 

Name

Sex

Age

Height

Weight

Alfred

M

14

69

112.5

Alice

F

13

56.5

84

Barbara

F

13

65.3

98

Carol

F

14

62.8

102.5

Henry

M

14

63.5

102.5

James

M

12

57.3

83

We can create a new subset dataset (tibble/dataframe) using the below code.

SAS code


data class;
   set sashelp.class;
   keep namge sex age;
run;

Notes:

  • data statement is used to specify the name of the newly created dataset
  • set statement is used to specify the name of the input dataset
  • keep statement is used to specify the names of the required variables
  • note that sas is not case sensitive in terms of variable/column names

R tidyverse code


library(tidyverse)
library(haven)
class<-haven::read_sas("class.sas7bdat")
class_selvars<-select(class,Name,Sex,Age)

Notes:

  • tidyverse is loaded into R session using library function
  • As haven is not a core tidyverse package, it has be explicitly loaded 
  • read_sas function of haven is used to read the sas dataset into R session as a tibble
  • select verb of dplyr of tidyverse is used to select only required columns
  • the first argument of select verb (function) is the name of the input tibble, followed by the list of required variables each separated by a comma
  • note that R is case sensitive in terms of variable/column names - so we need use the same text case

 

Example class dataset (sas dataset) can be downloaded from here.





Post categories
SASnR
SDTM