Assigning Values
The most common thing you will do in a DATA step is to assign values to a variable. The basic assignment statement is
variable = expression;
where variable names either a new or existing variable, and expression is some combination of constants, variable names, operators, and functions. When an expression is evaluated (executed) it results in a data value.
This is the one SAS statement that does not begin with a SAS keyword. Assignment always occurs within a DATA step. The SAS language includes all the usual arithmetic operators and numeric functions.
For example, using the SAS example data set, class
, we can add a variable for the body mass index of each kid, calculated from the existing variables height
and weight
. In this example, we read the data from the SASHELP library, and create a new copy in the WORK library (in WORK because we have not named any other library on the DATA statement), and call the new variable bmi
.
data class;
set sashelp.class;
bmi = (weight/height**2)*703;
run;
proc means data=class n mean stddev;
var weight height bmi;
run;
The MEANS Procedure
Variable N Mean Std Dev
----------------------------------------------
Weight 19 100.0263158 22.7739335
Height 19 62.3368421 5.1270752
bmi 19 17.8632519 2.0926193
----------------------------------------------
A DATA step can include any number of assignment statements, and they are executed in order, one observation at a time. You should think of a DATA step as an implicit loop: SAS reads in one observation according to a statement that is (usually) at the top of the step, runs through each line of the step with that one observation, and outputs the observation to the output data set at the bottom of the step. Then SAS returns to the top of the step, and repeats for as long as it finds a new observation to read. See Understanding SAS DATA Steps for a more detailed explanation.
Overwriting Existing Variables (Recoding)
For example, another approach to our bmi calculation might begin by converting Imperial heights and weights into SI units. In this example, notice that we are transforming weight and height (they appear on both the left and right of the assignment operator, the equals sign). And the order matters. It would be a mistake to put the bmi statement first. However, changing the statement order would also give us output with no error messages in the log!
data class;
set sashelp.class;
weight = weight/2.2;
height = height/39.37;
bmi = weight/height**2;
run;
proc means data=class n mean stddev;
var weight height bmi;
run;
The MEANS Procedure
Variable N Mean Std Dev
----------------------------------------------
Weight 19 45.4665072 10.3517880
Height 19 1.5833590 0.1302280
bmi 19 17.9024862 2.0972154
----------------------------------------------
Compare with this, where the bmi values are nonsense!
2 data class;
3 set sashelp.class;
4 bmi = weight/height**2;
5 weight = weight/2.2;
6 height = height/39.37;
7 run;
NOTE: There were 19 observations read from the data set SASHELP.CLASS.
NOTE: The data set WORK.CLASS has 19 observations and 6 variables.
8
9 proc means data=class n mean stddev;
10 var weight height bmi;
11 run;
NOTE: There were 19 observations read from the data set WORK.CLASS.
NOTE: The PROCEDURE MEANS printed page 1.
The MEANS Procedure
Variable N Mean Std Dev
----------------------------------------------
Weight 19 45.4665072 10.3517880
Height 19 1.5833590 0.1302280
bmi 19 0.0254100 0.0029767
----------------------------------------------
Missing Values
When using most SAS operators and functions, a missing value in an expression results in a missing value (with a few important exceptions). In
z = x + y;
if either x or y is missing, z will be missing.
data missing;
input x y;
z = x + y;
datalines;
59 1
60 .
. -39
;
proc print noobs; run;
x y z
59 1 60
60 . .
. -39 .
Observation-wise Summary Statistics
The functions that calculate summary statistics within an observation are exceptions to missing value propagation. Functions like MEAN(), STD(), STDERR(), and SUM() usually have several variables as arguments. As long as their values are not all missing (within an observation), the result is also not missing.
These functions can generally take one of two forms, using either variable names separated by commas, or OF and a variable list.
mean(var1, var2, var3 ...)
mean(of varlist)
data missing;
input x1 x2 x3 x4;
a = mean(x1, x2, x3, x4);
b = std (x1, x2, x3, x4);
c = stderr(of x1-x4);
d = sum (of _numeric_);
datalines;
59 1 7 2
60 . -3 5
. -39 . 0
;
proc print noobs; run;
x1 x2 x3 x4 a b c d
59 1 7 2 17.2500 27.9568 13.9784 128.185
60 . -3 5 20.6667 34.2977 19.8018 136.766
. -39 . 0 -19.5000 27.5772 19.5000 -11.423
Sum Operator
As described in Understanding DATA steps, the sum operator, which accumulates sums across observations, does not propagate missing values.