6  Introduction to SAS

6.1 Introduction

SAS (Statistical Analysis System) is a software that was originally created in the 1960s. Today, it is widely used by statisticians working in biostatistics and the pharmaceutical industries. Unlike Python and R, it is a proprietary software. The full license is quite expensive for a low-usage case such as ours. Thankfully, there is a free web-based version that we can use for our course.

6.2 Registering for a SAS Studio Account

The first step is to create a SAS profile: Use your NUS email address to register and create your SAS profile using this link.

Once you have verified your account using the email that would be sent to you, the following link should take you to the login page shown in Figure 6.1.

SAS login
Figure 6.1: SAS Studio Login

Subsequently logging in should take you to the landing page, where you can begin writing SAS code and using SAS. This interface can be seen in Figure 6.2.

SAS Studio
Figure 6.2: SAS Studio

6.3 An Overview of SAS Language

The SAS language is not a fully-fledged programming language like Python is, or even R. For the most part, we are going to capitalise on the point-and-click interface of SAS Studio in our course. However, even so, it is good to understand a little about the language so that we can modify the options for different procedures as necessary.

A SAS program is a sequence of statements executed in order. Keep in mind that:

Every SAS statement ends with a semicolon.

SAS programs are constructed from two basic building blocks: DATA steps and PROC steps. A typical program starts with a DATA step to create a SAS data set and then passes the data to a PROC step for processing.

Example 6.1 (Creating and Printing a Dataset) Here is a simple program that converts miles to kilometres in a DATA step and then prints the results with a PROC step:

DATA distance;
    Miles = 26.22;
    Kilometer = 1.61 * Miles;

PROC PRINT DATA=distance;
RUN;

To run the above program, click on the “Running Man” icon in SAS studio. You should obtain the output shown in Figure 6.3.

SAS output
Figure 6.3: SAS output

This dataset has only one observation (row).

Data steps start with the DATA keyword. This is followed by the name for the dataset. Procedures start with PROC followed by the name of the particular procedure (e.g. PRINT, SORT or PLOT) you wish to run on the dataset. Most SAS procedures have only a handful of possible statements. A step ends when SAS encounters a new step (marked by a DATA or PROC statement) or a RUN statement. RUN statements are not part of a DATA or PROC step; they are global statements.

Example 6.2 (Creating a Dataset Inline) The following program explicitly creates a dataset within the DATA step.

/*CREATING DATA MANUALLY:; */

DATA ex_1;
INPUT subject gender $ CA1 CA2 HW $;
DATALINES;
10 m 80 84 a
7 m 85 89 a
4 f 90 86 b
20 m 82 85 b
25 f 94 94 a
14 f 88 84 c
;

PROC MEANS DATA=ex_1;
VAR CA1 CA2;
RUN;

The output for the above code is shown in Figure 6.4 and Figure 6.5.

DATA output
Figure 6.4: Dataset output
PROC output
Figure 6.5: Proc output

In the statements above, the $’s in the INPUT statement inform SAS that the preceding variables (gender and HW) are character. Note how the semi-colon for the DATALINES appears after all the data has been listed.

PROC MEANS creates basic summary statistics for the variables listed.

To review, there are only 2 types of steps in SAS programs:

DATA steps

  • begin with DATA statements.
  • read and modify data.
  • create a SAS dataset.

PROC steps

  • begin with PROC statements.
  • perform specific analysis or function.
  • produce reports or results.

6.4 Basic Rules for SAS Programs

For SAS statements

  • All SAS statements (except those containing data) must end with a semicolon (;).
  • SAS statements typically begin with a SAS keyword. (DATA, PROC).
  • SAS statements are not case sensitive, that is, they can be entered in lowercase, uppercase, or a mixture of the two.
    • Example : SAS keywords (DATA, PROC) are not case sensitive
  • A delimited comment begins with a forward slash-asterisk (/) and ends with an asterisk-forward slash (/). All text within the delimiters is ignored by SAS.

For SAS names

  • All names must contain between 1 and 32 characters.
  • The first character appearing in a name must be a letter (A, B, … Z, a, b, …, z) or an underscore ( ). Subsequent characters must be letters, numbers, or underscores. That is, no other characters, such as $, %, or & are permitted.
  • Blanks also cannot appear in SAS names.
  • SAS names are not case sensitive, that is, they can be entered in lowercase, uppercase, or a mixture of the two. (SAS is only case sensitive within quotation marks.)

For SAS variables

  • If the variable in the INPUT statement is followed by a dollar sign ($), SAS assumes this is a character variable. Otherwise, the variable is considered as a numeric variable.

6.5 Reading Data into SAS

In this topic, we shall introduce a new dataset, also from the UCI Machine Learning repository.

Example 6.3 (Bike Rentals)

The dataset was collected by the authors in Fanaee-T and Gama (2013). It contains information on bike-sharing rentals in Washington D.C. USA for the years 2011 and 2012, along with measurements of weather. The original dataset contained hourly and daily aggregated data. For our class, we use a re-coded version of the daily data. Our dataset can be found on Canvas as bike2.csv.

Here is the data dictionary:

Field Description
instant Record index
dteday Date
season spring, summer, fall, winter)
yr Year (0: 2011, 1: 2012)
mnth Abbreviated month
holiday Whether the day is a holiday or not
weekday Abbreviated day of the week
workingday yes: If day is neither weekend nor holiday is 1, no: Otherwise
weathersit Weather situation: clear: Clear, Few clouds, Partly cloudy, Partly cloudy; mist: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist; light_precip: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds; heavy_precip: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
temp Normalized temperature in Celsius. Divided by 41 (max)
atemp Normalized feeling temperature in Celsius. Divided by 50 (max)
hum Normalized humidity. Divided by 100 (max)
windspeed Normalized wind speed. Divided by 67 (max)
casual Count of casual users
registered Count of registered users
cnt Count of total rental bikes including both casual and registered

Our first step will be to load the dataset into SAS Studio.

6.6 Uploading and Using Datasets

To use our own datasets on SAS Studio, we have to execute the following steps:

  1. Create a new library. In SAS, a library is a collection of datasets. If you already have a library created, you can simply import datasets into it. The default library on SAS is called WORK. However, the datasets will be purged every time you sign out. Hence it is better to create a new one.
  2. Import your dataset (csv, xlsx, etc.) into the library.
  3. After this, the data will be available for use with the reference name <library-name>.<dataset-name>.

From the “Libraries” menu on the left of SAS studio, click on the “New library” icon (the one circled in red in Figure 6.6), and create a new library called “ST2137”. You can use the default suggested path for the library.

Figure 6.6: New library
Figure 6.7: Upload data

Now expand the menu for “Server Files and Folders” and upload bike2.csv file to SAS, using the circled icon in Figure 6.7.

Finally, right-click on the top of the main Studio area (where we write code) and select “New Import Data”. Select the bike2.csv that has just been uploaded, and modify the OUTPUT DATA settings to Library ST2137 and Data set name BIKE2. Click on the running man, and your dataset is now ready for use in SAS studio!

6.7 Summarising Numerical Data

The SAS routines we are going to work with can be found in the “Tasks and Utilities” section (see highlighted tasks in Figure 6.8).

Figure 6.8: Common ST2137 Tasks

Numerical Summaries

Example 6.4 (5-number Summaries)

We expect that the total count of users will vary by the seasons. Hence, we begin by computing five-number summaries for each season.

Under Tasks, go to Statistcs > Summary Statistics. Select cnt as the analysis variable, and season as the classification variable. Under the options tab, select the lower and upper quartiles, along with comparative boxplots. The output should look like this Figure 6.9:

Figure 6.9: Summaries, Bike data

We observe that the median count is highest for fall, followed by summer, winter and lastly spring. The spreads, as measured by IQR, are similar across the seasons: approximately 2000 users. In the middle 50%, the count distribution for spring is the most right-skewed.

Scatter Plots

Example 6.5 (Casual vs Registered Scatterplot)

To create a scatterplot in SAS, go to Tasks > Graphs > Scatter Plot.

Specify casual on the x-axis, registered on the y-axis, and workingday as the Group. You should observe the plot created Figure 6.10:

Figure 6.10: Scatter plot, Bike Data

We can see that there seem to be two different relationships between the counts of casual and registered users. The two relationships correspond to whether it as a working day or not.

Histograms

Example 6.6 (Casual Users Distribution)

Now suppose we focus on casual users, and study the distribution of counts by whether a day is a working day or not. To create a histogram, go to Tasks > Graph > Histogram. Select casual as the analysis variable, and workingday as the group variable.

Figure 6.11: Histograms, Bike Data

From Figure 6.11, we can see that the distribution is right-skewed in both cases. However, the range of counts for non-working days extends further, to about 3500.

Boxplots

Example 6.7 (Boxplots for Casual Users, by Season) In Example 6.4, we observed that total counts vary by users, and in Example 6.6, we observed that working days seem to have fewer casual users. Let us investigate if this difference is related to season.

To create boxplots, go to Tasks > Box Plot. Select casual as the analysis variable, season as the category and workingday as the subcategory. You should obtain a plot like this Figure 6.12:

Figure 6.12: Boxplots, Bike Data

In order to order the seasons according to the calendar, I had to add this line to the code:

proc sgplot data=ST2137.BIKE2;
    vbox casual / category=season group=workingday grouporder=ascending;
    xaxis values=('spring' 'summer' 'fall' 'winter');
    yaxis grid;
run;

There is little insight from the previous two examples. However, now try the same plots, but on the log scale (modify the APPEARANCE tab and re-run). You should now obtain Figure 6.13:

Figure 6.13: Boxplots log scale, Bike Data

Now, we can observe that the difference within each season, is constant across seasons. Because the difference in logarithms is constant, it means that, on the original scale, it is a constant multiplicative factor that increases counts from workingday to non-working day.

We have arrived at a more succint representation of the relationship by using the log transform.

QQ-plots

Example 6.8 (Normality Check for Humidity) To create QQ-plots, we go to Tasks > Statistics > Distribution Analysis.

Select hum for the analysis variable. Under options, add the normal curve, the kernel density estimate, and the Normal quantile-quantile plot. You should obtain the following two charts:

Histogram for humidity

QQ-plot for humidity

The plot shows that humidity values are quite close to a Normal distribution, apart from a single observation on the left.

6.8 Categorical Data

We now turn to categorical data methods with SAS. We return to the dataset on student performance that we used in the topic on summarising data. Upload and store student-mat.csv as ST2137.STUD_PERF on the SAS Studio website.

Example 6.9 (\(\chi^2\) Test for Independence)

For a test of independence of address and paid, go to Tasks > Table Analysis, and select:

  • address as the column variable
  • paid as the row variable.
  • Under OPTIONS, check the “Chi-square statistics” box.

The following output should enable you to perform the test (Figure 6.14 and Figure 6.15).

Figure 6.14: Observed & Expected Counts
Figure 6.15: Test statistic, p-value
Figure 6.16: Mosaic Plot

For measures of association, we only need to select the option for “Measures of Association” to generate the Kendall \(\tau_b\) that we covered earlier.

Example 6.10 (Kendall \(\tau\) for Walc and Dalc)

Once we load the data ST2137.STUD_PERF, we go to Tasks > Table Analysis. After selecting the two variables, we check the appropriate box to obtain Figure 6.17.

Figure 6.17: Walc vs Dalc

You may observe that the particular associations computed and returned are similar to those by the Desc R package that we used in Example 4.10.

6.9 References

6.10 Website References

  1. SAS account sign-up Use this link to sign up for a SAS account.
  2. SAS Studio link Once you have activated your account, use this link to login to your SAS studio online.
  3. SAS Studio Help This link contains help on SAS studio features and commands.