2 proc reg data=sashelp.class;
3 model weight = sex;
ERROR: Variable Sex in list does not match type prescribed for this list.
NOTE: The previous statement has been deleted.
4 run;
WARNING: No variables specified for an SSCP matrix. Execution terminating.
ERROR: Errors printed on page 1.
Constructing Indicator Variables with SAS
CLASS-less PROCs
Most SAS estimation procedures let you declare classification (aka categorical, factor) variables using a CLASS
statement. The CLASS statement lets SAS know that these variables should be represented by sets of indicator (binary, dummy) variables.
However, some regression procedures (notably PROC REG
) do not automatically generate indicator variables for classification variables, leaving this up to you.
If you are working with just a few categorical variables and just a few categories, you can code this in a DATA step and the extra work is manageable. But with lots of variables, lots of categories, and/or higher order effects (interactions, polynomials) this can be quite a bit of work.
An Example
For example, this PROC REG produces an error:
proc reg data=sashelp.class;
model weight = sex;
run;
Although sex
only has two categories in this small data set, the data are coded as a character variable. PROC REG requires all variables to be numeric (a “type” error). While there are a variety of ways to encode sex numerically, the model will be easiest to interpret if this is an indicator variable.
Solutions
There are at least three solutions to this problem.
- Switch to a procedure like PROC GLM that handles classification variables.
- Create your own indicators, so you can use the extra features of PROC REG (diagnostics, multiple model statements, extra plots, etc.).
- Use PROC GLMMOD to create indicators, and then use PROC REG for analysis.
Switch to PROC GLM
Perhaps the simplest solution is to switch to PROC GLM
. Using PROC GLM you can let the software create indicator variables for you, by declaring CLASS variables.
proc glm data=sashelp.class;
class sex;
model weight = sex / solution ss3;
run;
The GLM Procedure
Class Level Information
Class Levels Values
Sex 2 F M
Number of Observations Read 19
Number of Observations Used 19
The GLM Procedure
Dependent Variable: Weight
Sum of
Source DF Squares Mean Square F Value Pr > F
Model 1 1681.122953 1681.122953 3.73 0.0702
Error 17 7654.613889 450.271405
Corrected Total 18 9335.736842
R-Square Coeff Var Root MSE Weight Mean
0.180074 21.21402 21.21960 100.0263
Source DF Type III SS Mean Square F Value Pr > F
Sex 1 1681.122953 1681.122953 3.73 0.0702
Standard
Parameter Estimate Error t Value Pr > |t|
Intercept 108.9500000 B 6.71022656 16.24 <.0001
Sex F -18.8388889 B 9.74973316 -1.93 0.0702
Sex M 0.0000000 B . . .
NOTE: The X'X matrix has been found to be singular, and a generalized
inverse was used to solve the normal equations. Terms whose
estimates are followed by the letter 'B' are not uniquely estimable.
An advantage of PROC GLM is that it also allows you to easily create interaction and polynomial terms.
proc glm data=sashelp.class;
class sex;
model weight = sex | age; /* sex by age interaction */
run;
Create Indicators in a DATA Step
This is straightforward if your don’t have too many categories to create indicators for.
data class;
set sashelp.class;
female = (sex eq 'F');
run;
proc reg data=class;
model weight = female;
run;
The REG Procedure
Model: MODEL1
Dependent Variable: Weight
Number of Observations Read 19
Number of Observations Used 19
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Model 1 1681.12295 1681.12295 3.73 0.0702
Error 17 7654.61389 450.27141
Corrected Total 18 9335.73684
Root MSE 21.21960 R-Square 0.1801
Dependent Mean 100.02632 Adj R-Sq 0.1318
Coeff Var 21.21402
Parameter Estimates
Parameter Standard
Variable DF Estimate Error t Value Pr > |t|
Intercept 1 108.95000 6.71023 16.24 <.0001
female 1 -18.83889 9.74973 -1.93 0.0702
Use PROC GLMMOD to create indicators
This is useful where you would benefit from some special feature of PROC REG, and you have a lot of categories to encode as indicators.
The GLMMOD procedure essentially constitutes the model-building front end for the GLM procedure; it constructs and saves the design matrix for the model you specify. You can take the output from the GLMMOD procedure and use it as input to PROC REG or other SAS procedures.
proc glmmod data=sashelp.class;
class sex;
model weight = sex;
run;
The GLMMOD Procedure
Class Level Information
Class Levels Values
Sex 2 F M
Number of Observations Read 19
Number of Observations Used 19
The GLMMOD Procedure
Parameter Definitions
Name of
Column Associated CLASS Variable Values
Number Effect Sex
1 Intercept
2 Sex F
3 Sex M
The GLMMOD Procedure
Design Points
Observation Column Number
Number Weight 1 2 3
1 112.5 1 0 1
2 84.0 1 1 0
3 98.0 1 1 0
4 102.5 1 1 0
5 102.5 1 0 1
6 83.0 1 0 1
7 84.5 1 1 0
8 112.5 1 1 0
9 84.0 1 0 1
10 99.5 1 0 1
11 50.5 1 1 0
12 90.0 1 1 0
13 77.0 1 1 0
14 112.0 1 1 0
15 150.0 1 0 1
16 128.0 1 0 1
17 133.0 1 0 1
18 85.0 1 0 1
19 112.0 1 0 1
Model specification is the same as for PROC GLM. The printed output describes the model matrix. What we want is to save the design matrix (“Design Points” table) as a data set to be used as input for our analysis PROC.
- The CLASS statement is where you identify the classification variables to be used in the analysis.
- The MODEL statement names the dependent variable(s) and all the independent effects.
SAS gives us two options for saving the design matrix:
- an
OUTDESIGN
option on the PROC statement, or - an ODS data set.
Each of these options has it’s awkward points. The OUTDESIGN option (usually paired with an OUTPARM option) uses column numbers to name it’s design variables, making it more difficult to interpret your results. The ODS data set uses intelligible variable names. With either option, you would want to suppress the printed output for any large data set.
ods html exclude all; /* pause printed output */
ods output designpoints = classmatrix;
proc glmmod data=sashelp.class;
class sex;
model weight = sex;
run;
ods html select all; /* resume printed output */
This creates indicators for both sexes, sex_f
and sex_m
. Pick one for your regression.
proc reg data=classmatrix;
model weight = sex_f;
run;
The REG Procedure
Model: MODEL1
Dependent Variable: Weight
Number of Observations Read 19
Number of Observations Used 19
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Model 1 1681.12295 1681.12295 3.73 0.0702
Error 17 7654.61389 450.27141
Corrected Total 18 9335.73684
Root MSE 21.21960 R-Square 0.1801
Dependent Mean 100.02632 Adj R-Sq 0.1318
Coeff Var 21.21402
Parameter Estimates
Parameter Standard
Variable Label DF Estimate Error t Value Pr > |t|
Intercept Intercept 1 108.95000 6.71023 16.24 <.0001
Sex_F Sex F 1 -18.83889 9.74973 -1.93 0.0702