Subsets: Selecting Variables
When preparing your data for analysis it is generally a great idea to reduce your data set to just those variables (and observations) that you need. This not only allows the computer to work more efficiently, but also reduces the clutter when you try and look at the data yourself. When data cleaning is written into a SAS script, it is easy to revise later if you decide you need to look are more (or fewer!) variables.
Extracting variables is generally something you do in a DATA step or in a utility PROC (COPY, SORT, STANDARDIZE). In most statistical PROCs you already select variables with a VAR, TABLES, or MODEL statement.
There are two main SAS language elements that you use to select variables:
You can pretty easily guess what these keywords mean. The only subtlety is in whether you select the variables before SAS reads data in or before SAS writes data out.
KEEP or DROP statements
Used as statements in a DATA step, KEEP and DROP control what variables a written to the output data set. All of the variables that are SET are read into the Program Data Vector and are available in the DATA step but only the selected variables are saved in the new data set.
While these statements could be written anywhere among your DATA step statements - they are only used during the compile phase of the DATA step - it aids in understanding your code when they are written at the bottom of the step, just before the (implied) OUTPUT.
You should generally use either KEEP or DROP, and not both statements in the same step (although it is technically possible, albeit with a warning). You may use multiple KEEPs or multiple DROPS, but this tends to make your code hard to read. You will generally prefer whichever allows you to write the shortest variable list. However, for anyone who may read your code in the future, using KEEP makes it clear what variables will be in the result without having to be overly familiar with your data.
If I have a data set with heights and weights, but I am interested in analyzing body mass index, I might be interesting in constructing the following data sets.
- For a t-test, I might KEEP just the two variables needed
data class;
set sashelp.class;
bmi = (weight/height**2)*703;
keep sex bmi;
run;
ods select conflimits ttests;
proc ttest data=class;
class sex;
var bmi;
run;
The TTEST Procedure
Variable: bmi
Sex Method Mean 95% CL Mean Std Dev
F 17.0510 15.5357 18.5664 1.9714
M 18.5942 17.1546 20.0339 2.0125
Diff (1-2) Pooled -1.5432 -3.4755 0.3890 1.9933
Diff (1-2) Satterthwaite -1.5432 -3.4745 0.3880
Sex Method 95% CL Std Dev
F 1.3316 3.7768
M 1.3843 3.6740
Diff (1-2) Pooled 1.4957 2.9882
Diff (1-2) Satterthwaite
Method Variances DF t Value Pr > |t|
Pooled Equal 17 -1.69 0.1103
Satterthwaite Unequal 16.86 -1.69 0.1100
- For a regression, I might DROP the variables I don’t need. Here you see that although we use
height
and weight
to calculate bmi
, dropping it only affects the output data set.
data class;
set sashelp.class;
bmi = (weight/height**2)*703;
drop name height weight;
run;
ods select parameterestimates;
proc glm data=class;
class sex;
model bmi = sex age / solution;
run;
The GLM Procedure
Dependent Variable: bmi
Standard
Parameter Estimate Error t Value Pr > |t|
Intercept 12.60539830 B 4.13540007 3.05 0.0077
Sex F -1.46375027 B 0.88815751 -1.65 0.1188
Sex M 0.00000000 B . . .
Age 0.44692874 0.30523460 1.46 0.1625
KEEP or DROP data set options
The keywords KEEP and DROP may also be used as data set options. Data set options can be used with both input and output data sets, and they can be used in PROCs as well as in DATA steps. The general form is
data-set (KEEP= var1|varlist <var2 ...>)
data-set (DROP= var1|varlist <var2 ...>)
Used as an input data set option, they promote efficient computing in DATA steps because they prevent SAS from moving data into the PDV. Used as an output data set option, the promote future efficient computing.
Used with an output data set, they do the same thing as a DATA step statement.
We could use these options on both input and output data sets in the same step.
data class (drop=height weight);
set sashelp.class(drop=name);
bmi = (weight/height**2)*703;
run;
proc means data=class n mean stddev;
class sex;
run;
The MEANS Procedure
N
Sex Obs Variable N Mean Std Dev
------------------------------------------------------------
F 9 Age 9 13.2222222 1.3944334
bmi 9 17.0510391 1.9714107
M 10 Age 10 13.4000000 1.6465452
bmi 10 18.5942434 2.0124784
------------------------------------------------------------
Used to modify a input data set, these options may also appear in PROCs. (In the next example, this does not provide any computational benefit, it merely gives me an alternative way of specifying TABLES.)
proc freq data=sashelp.class(keep=sex age);
tables _all_ /nocum;
run;
The FREQ Procedure
Sex Frequency Percent
----------------------------
F 9 47.37
M 10 52.63
Age Frequency Percent
----------------------------
11 2 10.53
12 5 26.32
13 3 15.79
14 4 21.05
15 4 21.05
16 1 5.26