data fishlong;
infile datalines truncover;
input Location & $10. Date date7. Length Weight @;
do fish = 1 by 1 while (length gt . and weight gt .);
output;
input Length Weight @;
end;
format date date7.;
datalines;
Cole Pond 2JUN95 31 .25 32 .3 32 .25 33 .3
Cole Pond 3JUL95 33 .32 34 .41 37 .48 32 .28
Cole Pond 4AUG95 29 .23 30 .25 34 .47 32 .3
Eagle Lake 2JUN95 32 .35 32 .25 33 .30
Eagle Lake 3JUL95 30 .20 36 .45
Eagle Lake 4AUG95 33 .30 33 .28 34 .42
;
Reshape Long to Wide
As with the more common reshaping wide-to-long, reshaping long-to-wide can be be accomplished either through the use of ARRAYS or with PROC TRANSPOSE. Where you just have one variable to reshape, PROC TRANSPOSE is perhaps simpler, but for the more general case ARRAYS give you the most flexibility and power.
proc print data=fishlong(obs=8);
run;
Obs Location Date Length Weight fish
1 Cole Pond 02JUN95 31 0.25 1
2 Cole Pond 02JUN95 32 0.30 2
3 Cole Pond 02JUN95 32 0.25 3
4 Cole Pond 02JUN95 33 0.30 4
5 Cole Pond 03JUL95 33 0.32 1
6 Cole Pond 03JUL95 34 0.41 2
7 Cole Pond 03JUL95 37 0.48 3
8 Cole Pond 03JUL95 32 0.28 4
Using PROC TRANSPOSE
Each group of observations to be combined into a single observation is identified BY location
and date
. In this example we will reshape the length
column (VAR), and give names to the new columns based on fish
(ID).
proc transpose data=fishlong out=fishwide;
by location date;
var length;
id fish;
run;
proc print data=fishwide;
run;
Obs Location Date _NAME_ _1 _2 _3 _4
1 Cole Pond 02JUN95 Length 31 32 32 33
2 Cole Pond 03JUL95 Length 33 34 37 32
3 Cole Pond 04AUG95 Length 29 30 34 32
4 Eagle Lake 02JUN95 Length 32 32 33 .
5 Eagle Lake 03JUL95 Length 30 36 . .
6 Eagle Lake 04AUG95 Length 33 33 34 .
We can have better variable names for our reshaped data if we add a PREFIX option to the PROC statement. And we could drop the NAME variable.
proc transpose data=fishlong prefix=Length
out=fishwide(drop=_name_);
by location date;
var length;
id fish;
run;
proc print data=fishwide;
run;
Obs Location Date Length1 Length2 Length3 Length4
1 Cole Pond 02JUN95 31 32 32 33
2 Cole Pond 03JUL95 33 34 37 32
3 Cole Pond 04AUG95 29 30 34 32
4 Eagle Lake 02JUN95 32 32 33 .
5 Eagle Lake 03JUL95 30 36 . .
6 Eagle Lake 04AUG95 33 33 34 .
An undocumented feature of PROC TRANSPOSE is that if we have a _NAME_ variable in the long data (for example if we had previously transposed wide-to-long), SAS will use that for the new variable ID.
proc transpose data=fishlong(rename=(fish=_name_))
prefix=Length
out=fishwide(drop=_name_);
by location date;
var length;
run;
proc print data=fishwide;
run;
Obs Location Date Length1 Length2 Length3 Length4
1 Cole Pond 02JUN95 31 32 32 33
2 Cole Pond 03JUL95 33 34 37 32
3 Cole Pond 04AUG95 29 30 34 32
4 Eagle Lake 02JUN95 32 32 33 .
5 Eagle Lake 03JUL95 30 36 . .
6 Eagle Lake 04AUG95 33 33 34 .
Using Arrays
We will discuss this is four stages in order to more easily understand the elements of the process, but in practice you just use the final data step.
To transform our long data to wide form involves three DATA step concepts.
- assigning values to new variables
- retaining values from one observation to the next
- initialing each by-group (not always required)
- outputting only the final observation of a group
Moving data to separate columns
We want variables length
and weight
distributed across four new variables each. The variable fish
along with two array references enables us to do just that, where fish
picks out the new column from a set of columns.
data fishwide;
set fishlong;
array len{4};
array wgt{4};
len{fish}=length;
wgt{fish}=weight;
run;
proc print data=fishwide(obs=8);
run;
L
o
c L W
a e e
t D n i f l l l l w w w w
O i a g g i e e e e g g g g
b o t t h s n n n n t t t t
s n e h t h 1 2 3 4 1 2 3 4
1 Cole Pond 02JUN95 31 0.25 1 31 . . . 0.25 . . .
2 Cole Pond 02JUN95 32 0.30 2 . 32 . . . 0.30 . .
3 Cole Pond 02JUN95 32 0.25 3 . . 32 . . . 0.25 .
4 Cole Pond 02JUN95 33 0.30 4 . . . 33 . . . 0.30
5 Cole Pond 03JUL95 33 0.32 1 33 . . . 0.32 . . .
6 Cole Pond 03JUL95 34 0.41 2 . 34 . . . 0.41 . .
7 Cole Pond 03JUL95 37 0.48 3 . . 37 . . . 0.48 .
8 Cole Pond 03JUL95 32 0.28 4 . . . 32 . . . 0.28
Notice here that we are using array references, but we are not looping within each observation - the reference just picks out one particular variable from a set of possibilities.
Retaining data across observations
Ordinarily, new variables are given a missing value at the top of each iteration of the DATA step. The RETAIN statement alters that behavior, so that the value of a new variable will be carried over until it is explicitly changed.
data fishwide;
set fishlong;
array len{4};
array wgt{4};
retain len1-len4 wgt1-wgt4;
len{fish}=length;
wgt{fish}=weight;
run;
proc print data=fishwide(obs=8);
run;
L
o
c L W
a e e
t D n i f l l l l w w w w
O i a g g i e e e e g g g g
b o t t h s n n n n t t t t
s n e h t h 1 2 3 4 1 2 3 4
1 Cole Pond 02JUN95 31 0.25 1 31 . . . 0.25 . . .
2 Cole Pond 02JUN95 32 0.30 2 31 32 . . 0.25 0.30 . .
3 Cole Pond 02JUN95 32 0.25 3 31 32 32 . 0.25 0.30 0.25 .
4 Cole Pond 02JUN95 33 0.30 4 31 32 32 33 0.25 0.30 0.25 0.30
5 Cole Pond 03JUL95 33 0.32 1 33 32 32 33 0.32 0.30 0.25 0.30
6 Cole Pond 03JUL95 34 0.41 2 33 34 32 33 0.32 0.41 0.25 0.30
7 Cole Pond 03JUL95 37 0.48 3 33 34 37 33 0.32 0.41 0.48 0.30
8 Cole Pond 03JUL95 32 0.28 4 33 34 37 32 0.32 0.41 0.48 0.28
The RETAIN statement is used during the compile phase of the DATA step, so the order in which it appears in the DATA step is not important. Placing it immediately before or after the ARRAY statements keeps all the statements that define the behavior of these variables in one place, and is common style.
By-group initialization
The 4th and 8th observations are in the final form we seek. However, RETAINing data can also create observations with data values from the wrong fish, as in observations 5 through 7.
Unless we always have the same number of observations in each group (i.e. no missing repetitions), we should re-initialize the first observation in each group. For this we need by group processing.
data fishwide;
set fishlong;
array len{4};
array wgt{4};
retain len1-len4 wgt1-wgt4;
by location date;
if first.date then do i=1 to 4;
len{i} = .;
wgt{i} = .;
end;
drop i;
len{fish}=length;
wgt{fish}=weight;
run;
proc print data=fishwide(firstobs=9 obs=16);
run;
L
o
c L W
a e e
t D n i f l l l l w w w w
O i a g g i e e e e g g g g
b o t t h s n n n n t t t t
s n e h t h 1 2 3 4 1 2 3 4
9 Cole Pond 04AUG95 29 0.23 1 29 . . . 0.23 . . .
10 Cole Pond 04AUG95 30 0.25 2 29 30 . . 0.23 0.25 . .
11 Cole Pond 04AUG95 34 0.47 3 29 30 34 . 0.23 0.25 0.47 .
12 Cole Pond 04AUG95 32 0.30 4 29 30 34 32 0.23 0.25 0.47 0.3
13 Eagle Lake 02JUN95 32 0.35 1 32 . . . 0.35 . . .
14 Eagle Lake 02JUN95 32 0.25 2 32 32 . . 0.35 0.25 . .
15 Eagle Lake 02JUN95 33 0.30 3 32 32 33 . 0.35 0.25 0.30 .
16 Eagle Lake 03JUL95 30 0.20 1 30 . . . 0.20 . . .
The BY statement assumes the data are already sorted. It creates two automatic variables for each BY-variable, FIRST.variable and LAST.variable (like ARRAYS, these only exist for the duration of the DATA step, they are not part of the output data set).
Like RETAIN, the position of the BY and DROP statements is not too important. It is common style to place the BY after the SET statement, and the DROP at the bottom of the DATA step. I have repositioned them here just to keep all the newly added statements together.
The conditional DO loop with it’s assignment statements must come before the assignment statements that move data values into the correct columns.
Keep the final observation in each by-group
All that is left is to OUTPUT the correct observation, skipping all the intermediate observations. For this we use the LAST automatic variable, which comes from the BY statement we have already added.
data fishwide;
set fishlong;
by location date;
array len{4};
array wgt{4};
retain len1-len4 wgt1-wgt4;
if first.date then do i=1 to 4;
len{i} = .;
wgt{i} = .;
end;
len{fish}=length;
wgt{fish}=weight;
if last.date then output;
drop i length weight;
run;
proc print data=fishwide;
run;
Obs Location Date fish len1 len2 len3 len4 wgt1 wgt2 wgt3 wgt4
1 Cole Pond 02JUN95 4 31 32 32 33 0.25 0.30 0.25 0.30
2 Cole Pond 03JUL95 4 33 34 37 32 0.32 0.41 0.48 0.28
3 Cole Pond 04AUG95 4 29 30 34 32 0.23 0.25 0.47 0.30
4 Eagle Lake 02JUN95 3 32 32 33 . 0.35 0.25 0.30 .
5 Eagle Lake 03JUL95 2 30 36 . . 0.20 0.45 . .
6 Eagle Lake 04AUG95 3 33 33 34 . 0.30 0.28 0.42 .