Constructing Indicator Variables with SAS

CLASS-less PROCs

Most SAS estimation procedures let you declare classification (aka categorical, factor) variables using a CLASS statement. The CLASS statement lets SAS know that these variables should be represented by sets of indicator (binary, dummy) variables.

However, some regression procedures (notably PROC REG) do not automatically generate indicator variables for classification variables, leaving this up to you.

If you are working with just a few categorical variables and just a few categories, you can code this in a DATA step and the extra work is manageable. But with lots of variables, lots of categories, and/or higher order effects (interactions, polynomials) this can be quite a bit of work.

An Example

For example, this PROC REG produces an error:

proc reg data=sashelp.class;
  model weight = sex;
  run;
2          proc reg data=sashelp.class;
3            model weight = sex;
ERROR: Variable Sex in list does not match type prescribed for this list.
NOTE: The previous statement has been deleted.
4            run;

WARNING: No variables specified for an SSCP matrix. Execution terminating.

ERROR: Errors printed on page 1.

Although sex only has two categories in this small data set, the data are coded as a character variable. PROC REG requires all variables to be numeric (a “type” error). While there are a variety of ways to encode sex numerically, the model will be easiest to interpret if this is an indicator variable.

Solutions

There are at least three solutions to this problem.

  • Switch to a procedure like PROC GLM that handles classification variables.
  • Create your own indicators, so you can use the extra features of PROC REG (diagnostics, multiple model statements, extra plots, etc.).
  • Use PROC GLMMOD to create indicators, and then use PROC REG for analysis.

Switch to PROC GLM

Perhaps the simplest solution is to switch to PROC GLM. Using PROC GLM you can let the software create indicator variables for you, by declaring CLASS variables.

proc glm data=sashelp.class;
  class sex;
  model weight = sex / solution ss3;
  run;
                             The GLM Procedure

                         Class Level Information
 
                      Class         Levels    Values

                      Sex                2    F M   

                  Number of Observations Read          19
                  Number of Observations Used          19
 
                                                                           
 
                             The GLM Procedure
 
                       Dependent Variable: Weight   

                                     Sum of
 Source                    DF       Squares   Mean Square  F Value  Pr > F

 Model                      1   1681.122953   1681.122953     3.73  0.0702

 Error                     17   7654.613889    450.271405                 

 Corrected Total           18   9335.736842                               

            R-Square     Coeff Var      Root MSE    Weight Mean

            0.180074      21.21402      21.21960       100.0263


 Source                    DF   Type III SS   Mean Square  F Value  Pr > F

 Sex                        1   1681.122953   1681.122953     3.73  0.0702

                                         Standard
   Parameter           Estimate             Error    t Value    Pr > |t|

   Intercept        108.9500000 B      6.71022656      16.24      <.0001
   Sex       F      -18.8388889 B      9.74973316      -1.93      0.0702
   Sex       M        0.0000000 B       .                .         .    

NOTE: The X'X matrix has been found to be singular, and a generalized 
      inverse was used to solve the normal equations.  Terms whose 
      estimates are followed by the letter 'B' are not uniquely estimable.

An advantage of PROC GLM is that it also allows you to easily create interaction and polynomial terms.

proc glm data=sashelp.class;
  class sex;
  model weight = sex | age;  /* sex by age interaction */
  run;

Create Indicators in a DATA Step

This is straightforward if your don’t have too many categories to create indicators for.

data class;
  set sashelp.class;
  female = (sex eq 'F');
  run;

proc reg data=class;
  model weight = female;
  run;
                             The REG Procedure
                               Model: MODEL1
                        Dependent Variable: Weight 

                  Number of Observations Read          19
                  Number of Observations Used          19

                           Analysis of Variance
 
                                   Sum of          Mean
 Source                  DF       Squares        Square   F Value   Pr > F

 Model                    1    1681.12295    1681.12295      3.73   0.0702
 Error                   17    7654.61389     450.27141                   
 Corrected Total         18    9335.73684                                 

           Root MSE             21.21960    R-Square     0.1801
           Dependent Mean      100.02632    Adj R-Sq     0.1318
           Coeff Var            21.21402                       

                           Parameter Estimates
 
                        Parameter       Standard
   Variable     DF       Estimate          Error    t Value    Pr > |t|

   Intercept     1      108.95000        6.71023      16.24      <.0001
   female        1      -18.83889        9.74973      -1.93      0.0702

Use PROC GLMMOD to create indicators

This is useful where you would benefit from some special feature of PROC REG, and you have a lot of categories to encode as indicators.

The GLMMOD procedure essentially constitutes the model-building front end for the GLM procedure; it constructs and saves the design matrix for the model you specify. You can take the output from the GLMMOD procedure and use it as input to PROC REG or other SAS procedures.

proc glmmod data=sashelp.class;
  class sex;
  model weight = sex;
  run;
                           The GLMMOD Procedure

                         Class Level Information
 
                      Class         Levels    Values

                      Sex                2    F M   

                  Number of Observations Read          19
                  Number of Observations Used          19
 
                                                                           
 
                           The GLMMOD Procedure

                           Parameter Definitions
 
                                   Name of
                        Column    Associated    CLASS Variable Values
                        Number      Effect      Sex

                            1     Intercept        
                            2     Sex            F 
                            3     Sex            M 
 
                                                                           
 
                           The GLMMOD Procedure

                              Design Points
 
                   Observation              Column Number
                     Number       Weight    1    2    3

                          1       112.5     1    0    1
                          2        84.0     1    1    0
                          3        98.0     1    1    0
                          4       102.5     1    1    0
                          5       102.5     1    0    1
                          6        83.0     1    0    1
                          7        84.5     1    1    0
                          8       112.5     1    1    0
                          9        84.0     1    0    1
                         10        99.5     1    0    1
                         11        50.5     1    1    0
                         12        90.0     1    1    0
                         13        77.0     1    1    0
                         14       112.0     1    1    0
                         15       150.0     1    0    1
                         16       128.0     1    0    1
                         17       133.0     1    0    1
                         18        85.0     1    0    1
                         19       112.0     1    0    1

Model specification is the same as for PROC GLM. The printed output describes the model matrix. What we want is to save the design matrix (“Design Points” table) as a data set to be used as input for our analysis PROC.

  • The CLASS statement is where you identify the classification variables to be used in the analysis.
  • The MODEL statement names the dependent variable(s) and all the independent effects.

SAS gives us two options for saving the design matrix:

  • an OUTDESIGN option on the PROC statement, or
  • an ODS data set.

Each of these options has it’s awkward points. The OUTDESIGN option (usually paired with an OUTPARM option) uses column numbers to name it’s design variables, making it more difficult to interpret your results. The ODS data set uses intelligible variable names. With either option, you would want to suppress the printed output for any large data set.

ods html exclude all; /* pause printed output */
ods output designpoints = classmatrix;
proc glmmod data=sashelp.class;
  class sex;
  model weight = sex;
  run;
ods html select all; /* resume printed output */

This creates indicators for both sexes, sex_f and sex_m. Pick one for your regression.

proc reg data=classmatrix;
  model weight = sex_f;
  run;
                             The REG Procedure
                               Model: MODEL1
                        Dependent Variable: Weight 

                  Number of Observations Read          19
                  Number of Observations Used          19

                           Analysis of Variance
 
                                   Sum of          Mean
 Source                  DF       Squares        Square   F Value   Pr > F

 Model                    1    1681.12295    1681.12295      3.73   0.0702
 Error                   17    7654.61389     450.27141                   
 Corrected Total         18    9335.73684                                 

           Root MSE             21.21960    R-Square     0.1801
           Dependent Mean      100.02632    Adj R-Sq     0.1318
           Coeff Var            21.21402                       

                            Parameter Estimates
 
                               Parameter      Standard
Variable    Label       DF      Estimate         Error   t Value   Pr > |t|

Intercept   Intercept    1     108.95000       6.71023     16.24     <.0001
Sex_F       Sex F        1     -18.83889       9.74973     -1.93     0.0702