data survey;
infile datalines;
array q{10};
input q{*};
datalines;
2 5 3 3 5 9 3 1 3 8
4 5 2 5 5 1 8 8 2 3
4 3 4 8 4 2 5 1 9 5
9 5 9 9 8 3 2 9 3 2
8 1 3 3 9 4 3 2 4 8
8 1 5 4 9 1 3 9 2 1
2 4 2 5 8 9 1 8 3 4
3 2 8 5 8 8 2 3 2 8
2 9 9 4 5 2 9 3 5 1
5 4 2 3 2 2 3 2 1 1
;
Arrays - Looping Over Variables
When preparing data for analysis it is common to find that there is some calculation that you need to perform for several variables. For example, converting units of measure from imperial to metric units, recoding non-responses to survey questions, etc.
An Alternative to Variable Names
In SAS, an array is a DATA step language construct that makes it easy to loop over a collection of variables.
Using arrays is a two-part process.
- Define an array with an
ARRAY
statement. Arrays are simply a language construct, convenient aliases (like LIBNAMEs and FILENAMEs), and they must be defined in every DATA step where you wish to use them. They are not saved in the output data set. - Use an array reference anywhere you might use a variable name.
Defining an Array
Within a DATA step, the basic ARRAY statement takes the form
ARRAY array-name {size} variable-list
For example, if I have 5 variables I’d like to loop over, my DATA step might look like
data new;
set old;
array v {5} q1a q1b q1c q1d q1e;
...
Here, my array is named “v”, it has five elements, and the data is actually stored in the variables q1a through q1e.
The variable list is optional. Where you omit the variable list, the SAS default is to use variables with the array name as a prefix and the array position as a suffix. For example
data new;
set old;
array v {5};
...
Here the variables being references are named v1 through v5. If they already exist in your data set, SAS uses those, otherwise SAS creates new variables.
Array References
Having defined an array, you use elements of the array through an array reference (and you can still use the actual variable names, where that is convenient).
An array reference simply takes the form
array-name{element}
In the first DATA step above, I can reference q1a as v{1}
, q1e as v{5}
, etc. Because my array references have an index, it makes it easy to loop over the variable list.
Where SAS can use a variable list, you can also use the reference
array-name{*}
to refer to the whole array. In the previous example, the reference v{*}
would mean v1-v5
.
Using Arrays as Variable Lists
Perhaps the simplest way to use an array is as a quick way to define and refer to a group of variables.
Arrays for Data Input
An ARRAY statement maps a correspondence between array references and actual variables in a data set. If the variables do not already exist, the ARRAY statement creates them in the program data vector (PDV). This is a handy way to create a set of related variables.
An ARRAY statement of the form
ARRAY v{10};
seeks to create references to variables named v1 through v10 because I have not given an explicit variable list. If it finds those variables in the PDV, it uses them whenever an array reference is used. If it doesn’t find them, it creates them.
If I had a survey with 10 questions, and I wanted to name the variables Q1 through Q10, my DATA step to input the data might look like this:
Here the ARRAY statement creates a group of variables, q1 through q10. In the DATA step’s compile phase, these ten variables are added to the PDV at this point.
Then the array reference, q{*}
, stands in for the variable list q1-q10
on the INPUT statement - a small shortcut.
In subsequent use of the data set, we refer to the variables by their variable names (there are no array references in PROCs).
proc means n min max;
var q1 q5;
run;
The MEANS Procedure
Variable N Minimum Maximum
----------------------------------------------
q1 10 2.0000000 9.0000000
q5 10 2.0000000 9.0000000
----------------------------------------------
Arrays in DATA Step Functions
There are a few DATA step functions that accept variable lists as arguments. We can use array references as a special form of variable list.
For example, the MEAN function has a special form that uses variable lists:
MEAN(OF variable-list)
Note we must define an array in order to use it here - the array definition does not carry over with the survey
data set.
data rowmean;
set survey;
array v{*} q1-q10;
x = mean(of v{*});
run;
proc print;
run;
Obs q1 q2 q3 q4 q5 q6 q7 q8 q9 q10 x
1 2 5 3 3 5 9 3 1 3 8 4.2
2 4 5 2 5 5 1 8 8 2 3 4.3
3 4 3 4 8 4 2 5 1 9 5 4.5
4 9 5 9 9 8 3 2 9 3 2 5.9
5 8 1 3 3 9 4 3 2 4 8 4.5
6 8 1 5 4 9 1 3 9 2 1 4.3
7 2 4 2 5 8 9 1 8 3 4 4.6
8 3 2 8 5 8 8 2 3 2 8 4.9
9 2 9 9 4 5 2 9 3 5 1 4.9
10 5 4 2 3 2 2 3 2 1 1 2.5
Here I name my array v
just to emphasize the point that the name of the array and the variable names are not required to match. However, it’s less confusing when they do match, don’t you think?
Loop over Variables
An array is a very handy construct to loop over variables.
Suppose in the example above, the data are from a survey where 8 = “don’t know” and 9 = “refused to answer”. For most analyses, we would want to recode the 8’s and 9’s as missing. For just 2 or 3 variables it might be easy to code
data
...
if (q1 eq 8 or q1 eq 9) then q1 = .;
if (q2 eq 8 or q2 eq 9) then q1 = .;
...
Where there are many variables it will be much easier to put this in a DO loop.
Notice again that although the variables already exist (from the previous example), we still have to define the array in this new DATA step.
data recode;
set survey;
array q{10};
do i = 1 to 10;
if (q{i} eq 8 or q{i} eq 9) then q{i} = .;
end;
drop i;
run;
proc means n min max;
var q1 q5;
run;
The MEANS Procedure
Variable N Minimum Maximum
----------------------------------------------
q1 7 2.0000000 5.0000000
q5 5 2.0000000 5.0000000
----------------------------------------------
Array Dimensions and Sizes
An array can have one or more dimensions. The preceding examples are all one-dimensional arrays of various sizes.
Two Dimensional Arrays
For an example of a two-dimensional array, reconsider the previous example. This survey has gone back to the same respondents twice a year for 5 years (January and July). And suppose we wish to identify those years (if any) in which a respondent dropped out. It would be convenient to consider this as a 5-by-2 array, where the first array dimension represents a year, and the second array dimension represents a survey instance.
data dropouts;
set recode;
array q{5,2}; /* reponse array */
array y{5} y1-y5; /* array to count not-missed instances */
do year = 1 to 5;
y{year} = 0; /* initial value for each observation-year */
do instance = 1 to 2;
if (q{year,instance} not eq .) then y{year} + 1;
end;
end;
drop year instance;
run;
proc print; run;
Obs q1 q2 q3 q4 q5 q6 q7 q8 q9 q10 y1 y2 y3 y4 y5
1 2 5 3 3 5 . 3 1 3 . 2 2 1 2 1
2 4 5 2 5 5 1 . . 2 3 2 2 2 0 2
3 4 3 4 . 4 2 5 1 . 5 2 1 2 2 1
4 . 5 . . . 3 2 . 3 2 1 0 1 1 2
5 . 1 3 3 . 4 3 2 4 . 1 2 1 2 1
6 . 1 5 4 . 1 3 . 2 1 1 2 1 1 2
7 2 4 2 5 . . 1 . 3 4 2 2 0 1 2
8 3 2 . 5 . . 2 3 2 . 2 1 0 2 1
9 2 . . 4 5 2 . 3 5 1 1 1 2 1 2
10 5 4 2 3 2 2 3 2 1 1 2 2 2 2 2
Notice that the actual variables are still q1 through q10 by default. If we had used more intuitive variable names we would need to provide them explicitly, e.g. q2020_1, q2020_2, q2021_1, etc.
We are still processing one observation (one data row) at a time. Using an array we can use SAS to translate a two-dimension concept - year and instances - into a one-dimensional data structure.
Dimension Subscripts
Dimension sizes and array references are not limited to ordinal positions. We can also use lower-bound:upper-bound
specifications to establish array sizes, and then use values within that range as references.
For example, our last example might be more intuitive if we coded
data dropouts;
set recode;
array q{2020:2024,2}; /* reponse array */
array y{2020:2024} y2020-y2024; /* array to count not-missed instances */
do year = 2020 to 2024;
y{year} = 0; /* initial value for each observation-year */
do instance = 1 to 2;
if (q{year,instance} not eq .) then y{year} + 1;
end;
end;
drop year instance;
run;
proc print; run;
Obs q1 q2 q3 q4 q5 q6 q7 q8 q9 q10 y2020 y2021 y2022 y2023 y2024
1 2 5 3 3 5 . 3 1 3 . 2 2 1 2 1
2 4 5 2 5 5 1 . . 2 3 2 2 2 0 2
3 4 3 4 . 4 2 5 1 . 5 2 1 2 2 1
4 . 5 . . . 3 2 . 3 2 1 0 1 1 2
5 . 1 3 3 . 4 3 2 4 . 1 2 1 2 1
6 . 1 5 4 . 1 3 . 2 1 1 2 1 1 2
7 2 4 2 5 . . 1 . 3 4 2 2 0 1 2
8 3 2 . 5 . . 2 3 2 . 2 1 0 2 1
9 2 . . 4 5 2 . 3 5 1 1 1 2 1 2
10 5 4 2 3 2 2 3 2 1 1 2 2 2 2 2
Notice that here again the array q
still refers to variables q1 to q10 (the default), despite the use of a subscript range.
Letting SAS Determine Dimension Sizes
When Defining Arrays
For one-dimensional arrays, you can let SAS automatically determine how many elements are in the array if you explicitly give variable names.
For example if you specify
data new;
set old;
array v {*} v1 v4 v6;
...
SAS understands that the array v
has three elements regardless of any other variables with a v prefix. However, you cannot specify v {*} ;
with no variable names.
Iterating
When iterating over arrays, you can use the dim()
function to determine the number of elements in an array dimension whose lower bound is 1. Alternatively, you can use the hbound
and lbound
functions to determine the upper and lower bounds of an array dimension. These are particularly useful in DO loops.
In the first version of our example
data dropouts;
set recode;
array q{5,2}; /* reponse array */
array y{5} y1-y5; /* array to count not-missed instances */
do year = 1 to dim(q, 1);
y{year} = 0; /* initial value for each observation-year */
do instance = 1 to dim(q, 2);
if (q{year,instance} not eq .) then y{year} + 1;
end;
end;
drop year instance;
run;
proc print; var y1-y5; run;
Obs y1 y2 y3 y4 y5
1 2 2 1 2 1
2 2 2 2 0 2
3 2 1 2 2 1
4 1 0 1 1 2
5 1 2 1 2 1
6 1 2 1 1 2
7 2 2 0 1 2
8 2 1 0 2 1
9 1 1 2 1 2
10 2 2 2 2 2
In the second version
data dropouts;
set recode;
array q{2020:2024,2}; /* reponse array */
array y{2020:2024} y2020-y2024; /* array to count not-missed instances */
do year = lbound(q,1) to hbound(q,1);
y{year} = 0; /* initial value for each observation-year */
do instance = 1 to dim(q,2);
if (q{year,instance} not eq .) then y{year} + 1;
end;
end;
drop year instance;
run;
proc print; var y2020-y2024; run;
Obs y2020 y2021 y2022 y2023 y2024
1 2 2 1 2 1
2 2 2 2 0 2
3 2 1 2 2 1
4 1 0 1 1 2
5 1 2 1 2 1
6 1 2 1 1 2
7 2 2 0 1 2
8 2 1 0 2 1
9 1 1 2 1 2
10 2 2 2 2 2