SAS has a programming language and many utilities to perform statistical analysis. These utilities are called procedures
These utilities(procedures) expect data in a format which is understandable by SAS
The SAS understandable format of data is called a SAS dataset
Data collected for answering a specific question of an experiment/survey is generally collected in a format which is not directly understandble by SAS
One such non-standard (with respect to SAS) data collection happens in text files
These text files containing the original data is called raw data
SAS is equipped with features to convert the data collected in text files into a SAS dataset
Conversion of raw data into a SAS dataset is called 'reading raw data'
What do we mean by data arranged in columns?
When information is collected, it is generally collected as distinct components. Each distinct component is called a variable
For example, if we are collecting some demographic information of students of a class, we collect the information in distinct components like
Name, Age, Sex, Height and Weight of each student. Each component: Name, Age, Sex, Height, Weight is called a variable.
Collection of components about an entity(student,person,item etc) is called an observation
The space for each character on a line of raw data is considered as a column. For example, in 'google' we say that it is occupying 6 columns(one column for each letter).
Similarly, 'google.in' occupies 9 columns and 'www.google.in' occupies 13 columns
When specified number of columns are reserved for each variable on a row we say that the data is arranged columns
For example, on each row first 16 columns can be reserved for name, column 17 reserved for sex (to capture as M or F), columns 19,20 and 21 reserved for
age component. Even when a name is less than 16 characters, information related to sex variable has to be collected on reserved 17th column when data is planned to be arranged in columns
Types of variables in SAS
A SAS dataset has a tabular structure, with rows representing observations and columns representing variables
The type of information that can be stored in a SAS dataset column(a variable) is restricted to either character or numeric format
Based on the type of information that can be stored in a variable, SAS variables are of two types: NUMERIC variable or CHARACTER variable
The variables which store numbers, typically on which intend to perform some arithmetic operations, are called NUMERIC variables
The variables which store text, such as letters, special characters and even numbers (where numeric values are treated as character strings), are called CHARACTER variables
How do we read raw data arranged in columns into a SAS dataset?
'DATA STEP' of SAS is used to create a SAS dataset from the raw data
We need to tell SAS about the following things for reading raw data
Provide the name of the dataset to be created using DATA statement
The filename, file extension and location of the raw data file using INFILE statement
Variables names, type of the variable (character or numeric) and the column positions for the data components present on reach row of raw data file using INPUT statement
A run statement to inform SAS to compile and execute the 'DATA STEP' code
Read raw data from external file
A dataset named students get created
Complete filename and location is specified
Five components are collected in raw data
Name: Columns 1 to 11 are reserved to capture the name of the student, and is read into a character variable 'Name'.
It is assumed here that no student's name exceeds 11 charcters. If maximum number of characters expected for a student's name
is 35, then we have to reserve 35 columns for it and next piece of information should start at or after column 36.
Sex: Column 12 is reserved for capturing gender of the student, and is read into a character variable 'Sex'
Age: Columns 13 to 15 are reserved for collecting age of the student, and is read into a NUMERIC variable 'Age'
Height: Columns 17 to 20 are reserved for capturing height of the student, and is read into a numeric variable 'Height'
Weight: Columns 23 to 27 are reserved for capturing weight of the student, and is read into a numeric variable 'Weight'
Notice that when a piece of information is read into a character variable, we have to indicate it to SAS by using a
dollar sign ($) as suffix after the name of the variable. If a dollar sign is not suffixed, then SAS assumes that we are requesting it
to create a variable of numeric type.
Each record of a student becomes an observation(row) in the SAS dataset
Each information component becomes a variable in the SAS dataset.So, we will have 5 variables in the dataset.
Reading instream raw data
In the previous example, we have read the raw data from an external file. However, we can enter the raw data as part of sas data step itself
When raw data is part of SAS program, we call the raw data as instream data
We need to note a few things about processing instream data
As raw data is now part of the sas data step program usage of infile is not required
We need to indicate clearly where the raw data lines begin in the program. We use 'DATALINES' statement to indicate the beginning of raw data
We must place the raw data lines immediately after the datalines statement and no other statements must be placed after the raw data lines
Now, in this example we are placing the raw data lines from the external file used in above example directly into SAS session (as instream data)
The data of students01 created in this step will be exactly same as student dataset
Understanding the attributes of a SAS dataset
When a SAS dataset is available, we would be interested in knowing details about the dataset like
Name of the dataset
Number of observations in the dataset
Number of variables in the dataset
Date of creation and modification of dataset
The size (space required to store the dataset on the drive) of a dataset depends on number of variables and observations
The size of a dataset thus is dependent on the space (length required) required for each variable
The space required for a variable depends on the type(Numeric vs character) of the variable and number of characters stored in it
We have a procedure called CONTENTS to check the details (attributes) of a SAS dataset
The procedure displays some important information about the dataset on which the procedure is invoked
(in this example, we are invoking proc contents on students01 datasest)
Screenshots shown below highlights some important information from the proc contents output
From the variable attributes screenshot, we can see that there are 3 key attributes for each variable.
Name of the variable
Type of the variable
Length of the variable (the size of the variable in bytes)
We have read 11 columns (columns 1 to11; 11 characters) for the Name variable, so SAS assigned a length of 11 (bytes) to the Name variable
We have read 1 column, (column 12, 1 character) for the Sex variable, so SAS assigned a length of 1 (byte) to the Sex variable
For numeric variables, irrespective of the number of columns read, SAS assigns a length of 8 (bytes).
For character data, we need 1 byte to store 1 character but for numeric variables SAS can store more than 8 characters within a length of 8 bytes