data fishdata;
infile datalines missover;
input Location & $10. Date date7.
Length1 Weight1 Length2 Weight2 Length3 Weight3
Length4 Weight4;
format date date7.;
datalines;
Cole Pond 2JUN95 31 .25 32 .3 32 .25 33 .3
Cole Pond 3JUL95 33 .32 34 .41 37 .48 32 .28
Cole Pond 4AUG95 29 .23 30 .25 34 .47 32 .3
Eagle Lake 2JUN95 32 .35 32 .25 33 .30
Eagle Lake 3JUL95 30 .20 36 .45
Eagle Lake 4AUG95 33 .30 33 .28 34 .42
;
Reshape Wide to Long
We sometimes need to reshape a data set, converting data that has been recorded in multiple variables into multiple observations, or vice versa. This is an especially common data manipulation where we have data with repeated observations for a common unit.
Consider this example (from SAS):
Each observation records the fish weights and lengths observed on a given date from a give body of water. There are two to four fish observed on each occasion.
If we wanted to know the mean length and mean weight of all the fish observed, we would like all the data values to be in two columns so that we can calculate this with PROC MEANS.
There are generally two approaches we can take to reshaping these data. If our goal is to convert a single repeated measure, we can use PROC TRANSPOSE. More generally, for example where we have multiple repeated measures, we make use of ARRAYs.
Using Arrays
Recall that a SAS ARRAY
is essentially a list of variables, and that this list makes it easy to loop over those variables. In this case we have two lists of variables - length1-length4 and weight1-weight4. We would like to take the values from one list and put them into a single column. We can use an ARRAY to loop over the variables. The key here is to use an explict OUTPUT
statement to turn one observation into four observations.
To simplify this first example, let’s focus on fish length
.
data long;
set fishdata;
array len{*} length1-length4;
do fish=1 to 4;
length = len{fish};
output;
end;
run;
proc means data=long;
var length;
run;
The MEANS Procedure
Analysis Variable : length
N Mean Std Dev Minimum Maximum
------------------------------------------------------------------
20 32.6000000 1.9029064 29.0000000 37.0000000
------------------------------------------------------------------
- The ARRAY statement defines
len
as an array reference, a convenient way of specifying any of the four length variables. - DO and END form our loop-within-an-observation. We call our loop index
fish
because this variable will eventually identify individual fish in our long data. - We assign
len{fish}
values all to a single new variable,length
. - We OUTPUT for each new value of
length
.
Looking at the resulting data set (or at least part of it)
proc print data=long(obs=8
drop=date weight1-weight4);
run;
Obs Location Length1 Length2 Length3 Length4 fish length
1 Cole Pond 31 32 32 33 1 31
2 Cole Pond 31 32 32 33 2 32
3 Cole Pond 31 32 32 33 3 32
4 Cole Pond 31 32 32 33 4 33
5 Cole Pond 33 34 37 32 1 33
6 Cole Pond 33 34 37 32 2 34
7 Cole Pond 33 34 37 32 3 37
8 Cole Pond 33 34 37 32 4 32
we see we have retained the wide data, in addition to the properly structured length
variable at the end. To complete this exercise we can drop the original variables in our output data set, restructure our weight variables, and remove observations with missing data.
data long;
set fishdata;
array len{*} length1-length4;
array wgt{*} weight1-weight4;
do fish=1 to 4;
length = len{fish};
weight = wgt{fish};
/* delete empty observations */
if length eq . and weight eq . then delete;
/* output good observations */
else output;
end;
/* drop no longer needed variables */
drop length1-length4 weight1-weight4;
run;
proc means data=long;
var length weight;
run;
The MEANS Procedure
Variable N Mean Std Dev Minimum Maximum
-------------------------------------------------------------------------
length 20 32.6000000 1.9029064 29.0000000 37.0000000
weight 20 0.3195000 0.0831280 0.2000000 0.4800000
-------------------------------------------------------------------------
proc print data=long;
run;
Obs Location Date fish length weight
1 Cole Pond 02JUN95 1 31 0.25
2 Cole Pond 02JUN95 2 32 0.30
3 Cole Pond 02JUN95 3 32 0.25
4 Cole Pond 02JUN95 4 33 0.30
5 Cole Pond 03JUL95 1 33 0.32
6 Cole Pond 03JUL95 2 34 0.41
7 Cole Pond 03JUL95 3 37 0.48
8 Cole Pond 03JUL95 4 32 0.28
9 Cole Pond 04AUG95 1 29 0.23
10 Cole Pond 04AUG95 2 30 0.25
11 Cole Pond 04AUG95 3 34 0.47
12 Cole Pond 04AUG95 4 32 0.30
13 Eagle Lake 02JUN95 1 32 0.35
14 Eagle Lake 02JUN95 2 32 0.25
15 Eagle Lake 02JUN95 3 33 0.30
16 Eagle Lake 03JUL95 1 30 0.20
17 Eagle Lake 03JUL95 2 36 0.45
18 Eagle Lake 04AUG95 1 33 0.30
19 Eagle Lake 04AUG95 2 33 0.28
20 Eagle Lake 04AUG95 3 34 0.42
PROC TRANSPOSE
PROC TRANSPOSE is another method of converting data from wide to long form. It is appealing because you do not need to understand DATA step programming. However, it’s limitation is that all the data values to be transposed end up in one column (in the previous example, multiple repeated measures ended up in multiple columns).
- an OUT option specifies the output data set
- BY specifies the variable(s) that identify unique observations
- VAR specifies the variables to be transformed
The new data column and the within-observation-index will have default names.
proc transpose data=fishdata
out=transfish;
by location date;
var length1-length4;
run;
proc print; run;
Obs Location Date _NAME_ COL1
1 Cole Pond 02JUN95 Length1 31
2 Cole Pond 02JUN95 Length2 32
3 Cole Pond 02JUN95 Length3 32
4 Cole Pond 02JUN95 Length4 33
5 Cole Pond 03JUL95 Length1 33
6 Cole Pond 03JUL95 Length2 34
7 Cole Pond 03JUL95 Length3 37
8 Cole Pond 03JUL95 Length4 32
9 Cole Pond 04AUG95 Length1 29
10 Cole Pond 04AUG95 Length2 30
11 Cole Pond 04AUG95 Length3 34
12 Cole Pond 04AUG95 Length4 32
13 Eagle Lake 02JUN95 Length1 32
14 Eagle Lake 02JUN95 Length2 32
15 Eagle Lake 02JUN95 Length3 33
16 Eagle Lake 02JUN95 Length4 .
17 Eagle Lake 03JUL95 Length1 30
18 Eagle Lake 03JUL95 Length2 36
19 Eagle Lake 03JUL95 Length3 .
20 Eagle Lake 03JUL95 Length4 .
21 Eagle Lake 04AUG95 Length1 33
22 Eagle Lake 04AUG95 Length2 33
23 Eagle Lake 04AUG95 Length3 34
24 Eagle Lake 04AUG95 Length4 .
We can clean this up a little with:
- PREFIX names the final column of repeated measurements
- RENAME the within-observation-index
- WHERE removes observations with missing data
proc transpose data=fishdata
out=transfish(
rename=(_name_=fish)
where=(length1 ne .))
prefix=length;
by location date;
var length1-length4;
run;
proc print; run;
Obs Location Date fish length1
1 Cole Pond 02JUN95 Length1 31
2 Cole Pond 02JUN95 Length2 32
3 Cole Pond 02JUN95 Length3 32
4 Cole Pond 02JUN95 Length4 33
5 Cole Pond 03JUL95 Length1 33
6 Cole Pond 03JUL95 Length2 34
7 Cole Pond 03JUL95 Length3 37
8 Cole Pond 03JUL95 Length4 32
9 Cole Pond 04AUG95 Length1 29
10 Cole Pond 04AUG95 Length2 30
11 Cole Pond 04AUG95 Length3 34
12 Cole Pond 04AUG95 Length4 32
13 Eagle Lake 02JUN95 Length1 32
14 Eagle Lake 02JUN95 Length2 32
15 Eagle Lake 02JUN95 Length3 33
16 Eagle Lake 03JUL95 Length1 30
17 Eagle Lake 03JUL95 Length2 36
18 Eagle Lake 04AUG95 Length1 33
19 Eagle Lake 04AUG95 Length2 33
20 Eagle Lake 04AUG95 Length3 34
Transforming multiple measures (both lengths and weights) would require more than a single PROC TRANSPOSE - it is more efficient to use the ARRAY technique!
Last revised: 08/22/24