Subsets: Selecting Observations

Another common task for the data analyst is to extract selected observations from a data set. It may be that only a subgroup within the population is of analytical interst. You may want to pull a random sample from a sampling frame. Or you may simply want a smaller subset with which to test the SAS code that you are developing.

Conditional Observations

You can conditionally select observations in DATA steps or in PROC steps. You can use both statements and data set options in either context.

  • WHERE statements
  • subsetting IF statements
  • WHERE data set options

WHERE statements

A WHERE statement can be used in either a DATA step or a PROC step - any step that reads each observation. It controls what data SAS reads for the step. This makes it a very efficient method of extracting observations, because it skips reading some observations. It takes the form

WHERE expression

where expression is any logical expression that can be composed using the variables in the data set to be read. The key NOTE here says “There were 7 observations read …”.

DATA step WHERE

2          data class;
3            set sashelp.class;
4            where age lt 13;
5          run;

NOTE: There were 7 observations read from the data set SASHELP.CLASS.
      WHERE age<13;
NOTE: The data set WORK.CLASS has 7 observations and 5 variables.

PROC step WHERE

proc freq data=sashelp.class;
  where age lt 13;
  tables sex / nocum;
run;
                            The FREQ Procedure

                       Sex    Frequency     Percent
                       ----------------------------
                       F             3       42.86 
                       M             4       57.14 

PROC SORT WHERE

If you need to work with a subset of observations, but also SORT your data for a later MERGE or BY group processing, it would be very efficient to subset at the start of the SORT procedure.

2          proc sort data=sashelp.class out=class;
3            where age lt 13;
4            by sex;
5          run;

NOTE: There were 7 observations read from the data set SASHELP.CLASS.
      WHERE age<13;
NOTE: The data set WORK.CLASS has 7 observations and 5 variables.

Subsetting IF

A subsetting IF statement can only be used in a DATA step. It is less efficient than WHERE because all the observations are read into the Program Data Vector before the condition is applied. The IF statement has been part of the SAS language longer than the WHERE statement, so it sometimes appears in old code, or code written by old coders. It is useful where some value cannot be calculated from the input data set(s).

Inefficient

Here note that 19 observations are read.

2          data class;
3            set sashelp.class;
4            if age lt 13;
5          run;

NOTE: There were 19 observations read from the data set SASHELP.CLASS.
NOTE: The data set WORK.CLASS has 7 observations and 5 variables.

Table Lookup

One place a subsetting IF is often required is in a table lookup. When performing a many-to-one match merge, it is often the case that our secondary table includes observations which are not used and do not add anything to our analysis. The variable which tells us whether an input data set contributes to the output data set does not exist beyond the PDV, so it cannot be used in a WHERE statement.

This IF statement outputs only observations which had data contributed by the first data set.

data complete;
  merge individual(in=ds1) county;
  by location_id;
  if ds1;
  run;

WHERE data set option

As a data set option, the WHERE option can be used any time a data set is referenced. For input data sets this is essentially the same as a WHERE statement. However, it can also be used with output data sets.

As a data set option the form is

data-set (WHERE=(logical-expression))

Here, the equals symbol and all the parentheses are required.

For example, suppose we were interested in producing a table of mean city mpg by number of cylinders, but only for those categories with more than 5 observations (dropping the 3, 10, and 12 cylinder categories). Using a WHERE option saves us a DATA step to drop the unwanted categories.

ods output summary=mpg(where=(nobs > 5));

proc means data=sashelp.cars n mean std;
    class cylinders;
    var mpg_city;
    run;

proc print data=mpg noobs; run;
                               MPG_        MPG_City_       MPG_City_
         Cylinders    NObs    City_N            Mean          StdDev

                 4    136      136      24.941176471    5.2093430394
                 5      7        7      19.857142857    0.8997354108
                 6    190      190      18.515789474    1.7630130408
                 8     87       87      15.873563218    1.8912564896

Observation Ranges

It can occasionally be useful to pick a subset of observations by observation number. This is especially useful when creating small data sets with which to test code. Two data set options, FIRSTOBS and OBS make this very easy to code.

Both options reference observation numbers and not numbers of observations. That is

data-set (firstobs=10 obs=20)

extracts observation 10 through 20 (11 observations), and not the twenty observations beginning with observation 10!

2          data cars;
3            set sashelp.cars(firstobs=10 obs=20);
4            run;

NOTE: There were 11 observations read from the data set SASHELP.CARS.
NOTE: The data set WORK.CARS has 11 observations and 15 variables.

Because these are data set options, these could equally be use with output data sets.

Random Sampling

You could write a DATA step that selects a random sample or you can use PROC SURVEYSELECT. This PROC has more than a dozen probability sampling methods (depending on how you count them) built in.

A simple random sample:

proc surveyselect data=sashelp.cars(keep=make model invoice) 
    out=carsample
    method=srs
    n=10;
run;

proc print noobs; run;
                        The SURVEYSELECT Procedure

                Selection Method    Simple Random Sampling

                    Input Data Set                 CARS
                    Random Number Seed        687613000
                    Sample Size                      10
                    Selection Probability      0.023364
                    Sampling Weight                42.8
                    Output Data Set           CARSAMPLE
 
                                                                           
 
          Make             Model                         Invoice

          Audi             A4 3.0 Quattro 4dr manual     $30,366
          Chevrolet        TrailBlazer LT                $27,479
          Ford             Focus LX 4dr                  $12,906
          Jaguar           X-Type 2.5 4dr                $27,355
          Mercedes-Benz    C320 4dr                      $36,162
          Mercedes-Benz    CL500 2dr                     $88,324
          Nissan           Murano SL                     $27,300
          Suzuki           Forenza EX 4dr                $15,378
          Toyota           Prius 4dr (gas/electric)      $18,926
          Volvo            C70 HPT convertible 2dr       $40,083