Subsets: Selecting Variables

When preparing your data for analysis it is generally a great idea to reduce your data set to just those variables (and observations) that you need. This not only allows the computer to work more efficiently, but also reduces the clutter when you try and look at the data yourself. When data cleaning is written into a SAS script, it is easy to revise later if you decide you need to look are more (or fewer!) variables.

Extracting variables is generally something you do in a DATA step or in a utility PROC (COPY, SORT, STANDARDIZE). In most statistical PROCs you already select variables with a VAR, TABLES, or MODEL statement.

There are two main SAS language elements that you use to select variables:

data set options KEEP or DROP
KEEP or DROP statements

You can pretty easily guess what these keywords mean. The only subtlety is in whether you select the variables before SAS reads data in or before SAS writes data out.

KEEP or DROP statements

Used as statements in a DATA step, KEEP and DROP control what variables a written to the output data set. All of the variables that are SET are read into the Program Data Vector and are available in the DATA step but only the selected variables are saved in the new data set.

While these statements could be written anywhere among your DATA step statements - they are only used during the compile phase of the DATA step - it aids in understanding your code when they are written at the bottom of the step, just before the (implied) OUTPUT.

You should generally use either KEEP or DROP, and not both statements in the same step (although it is technically possible, albeit with a warning). You may use multiple KEEPs or multiple DROPS, but this tends to make your code hard to read. You will generally prefer whichever allows you to write the shortest variable list. However, for anyone who may read your code in the future, using KEEP makes it clear what variables will be in the result without having to be overly familiar with your data.

If I have a data set with heights and weights, but I am interested in analyzing body mass index, I might be interesting in constructing the following data sets.

For a t-test, I might KEEP just the two variables needed

data class;
  set sashelp.class;
  bmi = (weight/height**2)*703;
  keep sex bmi;
  run;

ods select conflimits ttests;
proc ttest data=class;
    class sex;
    var bmi;
    run;

                            The TTEST Procedure
 
                              Variable:  bmi

 Sex           Method               Mean       95% CL Mean        Std Dev

 F                               17.0510     15.5357  18.5664      1.9714
 M                               18.5942     17.1546  20.0339      2.0125
 Diff (1-2)    Pooled            -1.5432     -3.4755   0.3890      1.9933
 Diff (1-2)    Satterthwaite     -1.5432     -3.4745   0.3880            

             Sex           Method             95% CL Std Dev

             F                                1.3316   3.7768
             M                                1.3843   3.6740
             Diff (1-2)    Pooled             1.4957   2.9882
             Diff (1-2)    Satterthwaite                     

        Method           Variances        DF    t Value    Pr > |t|

        Pooled           Equal            17      -1.69      0.1103
        Satterthwaite    Unequal       16.86      -1.69      0.1100

For a regression, I might DROP the variables I don’t need. Here you see that although we use height and weight to calculate bmi, dropping it only affects the output data set.

data class;
  set sashelp.class;
  bmi = (weight/height**2)*703;
  drop name height weight;
  run;

ods select parameterestimates;
proc glm data=class;
    class sex;
    model bmi = sex age / solution;
    run;

                             The GLM Procedure
 
                        Dependent Variable: bmi   

                                         Standard
   Parameter           Estimate             Error    t Value    Pr > |t|

   Intercept        12.60539830 B      4.13540007       3.05      0.0077
   Sex       F      -1.46375027 B      0.88815751      -1.65      0.1188
   Sex       M       0.00000000 B       .                .         .    
   Age               0.44692874        0.30523460       1.46      0.1625

KEEP or DROP data set options

The keywords KEEP and DROP may also be used as data set options. Data set options can be used with both input and output data sets, and they can be used in PROCs as well as in DATA steps. The general form is

data-set (KEEP= var1|varlist <var2 ...>)
data-set (DROP= var1|varlist <var2 ...>)

Used as an input data set option, they promote efficient computing in DATA steps because they prevent SAS from moving data into the PDV. Used as an output data set option, the promote future efficient computing.

Used with an output data set, they do the same thing as a DATA step statement.

We could use these options on both input and output data sets in the same step.

data class (drop=height weight);
  set sashelp.class(drop=name);
  bmi = (weight/height**2)*703;
  run;

proc means data=class n mean stddev;
  class sex;
run;

                            The MEANS Procedure

                N
       Sex    Obs    Variable     N            Mean         Std Dev
       ------------------------------------------------------------
       F        9    Age          9      13.2222222       1.3944334
                     bmi          9      17.0510391       1.9714107

       M       10    Age         10      13.4000000       1.6465452
                     bmi         10      18.5942434       2.0124784
       ------------------------------------------------------------

Used to modify a input data set, these options may also appear in PROCs. (In the next example, this does not provide any computational benefit, it merely gives me an alternative way of specifying TABLES.)

proc freq data=sashelp.class(keep=sex age);
  tables _all_ /nocum;
  run;

                            The FREQ Procedure

                       Sex    Frequency     Percent
                       ----------------------------
                       F             9       47.37 
                       M            10       52.63 

                       Age    Frequency     Percent
                       ----------------------------
                        11           2       10.53 
                        12           5       26.32 
                        13           3       15.79 
                        14           4       21.05 
                        15           4       21.05 
                        16           1        5.26