Reshape Long to Wide

As with the more common reshaping wide-to-long, reshaping long-to-wide can be be accomplished either through the use of ARRAYS or with PROC TRANSPOSE. Where you just have one variable to reshape, PROC TRANSPOSE is perhaps simpler, but for the more general case ARRAYS give you the most flexibility and power.

data fishlong;
   infile datalines truncover;
   input Location & $10. Date date7. Length Weight @;
    do fish = 1 by 1 while (length gt . and weight gt .);
        output;
        input Length Weight @;
        end;
    
   format date date7.;
   datalines;
Cole Pond   2JUN95 31 .25 32 .3  32 .25 33 .3
Cole Pond   3JUL95 33 .32 34 .41 37 .48 32 .28
Cole Pond   4AUG95 29 .23 30 .25 34 .47 32 .3
Eagle Lake  2JUN95 32 .35 32 .25 33 .30
Eagle Lake  3JUL95 30 .20 36 .45
Eagle Lake  4AUG95 33 .30 33 .28 34 .42
;
proc print data=fishlong(obs=8);
run;
          Obs    Location        Date    Length    Weight    fish

            1    Cole Pond    02JUN95      31       0.25       1 
            2    Cole Pond    02JUN95      32       0.30       2 
            3    Cole Pond    02JUN95      32       0.25       3 
            4    Cole Pond    02JUN95      33       0.30       4 
            5    Cole Pond    03JUL95      33       0.32       1 
            6    Cole Pond    03JUL95      34       0.41       2 
            7    Cole Pond    03JUL95      37       0.48       3 
            8    Cole Pond    03JUL95      32       0.28       4 

Using PROC TRANSPOSE

Each group of observations to be combined into a single observation is identified BY location and date. In this example we will reshape the length column (VAR), and give names to the new columns based on fish (ID).

proc transpose data=fishlong out=fishwide;
    by location date;
    var length;
    id fish;
    run;

proc print data=fishwide;
run;
      Obs     Location        Date    _NAME_    _1    _2    _3    _4

       1     Cole Pond     02JUN95    Length    31    32    32    33
       2     Cole Pond     03JUL95    Length    33    34    37    32
       3     Cole Pond     04AUG95    Length    29    30    34    32
       4     Eagle Lake    02JUN95    Length    32    32    33     .
       5     Eagle Lake    03JUL95    Length    30    36     .     .
       6     Eagle Lake    04AUG95    Length    33    33    34     .

We can have better variable names for our reshaped data if we add a PREFIX option to the PROC statement. And we could drop the NAME variable.

proc transpose data=fishlong prefix=Length
      out=fishwide(drop=_name_);
    by location date;
    var length;
    id fish;
    run;

proc print data=fishwide;
run;
 Obs     Location        Date    Length1    Length2    Length3    Length4

  1     Cole Pond     02JUN95       31         32         32         33  
  2     Cole Pond     03JUL95       33         34         37         32  
  3     Cole Pond     04AUG95       29         30         34         32  
  4     Eagle Lake    02JUN95       32         32         33          .  
  5     Eagle Lake    03JUL95       30         36          .          .  
  6     Eagle Lake    04AUG95       33         33         34          .  

An undocumented feature of PROC TRANSPOSE is that if we have a _NAME_ variable in the long data (for example if we had previously transposed wide-to-long), SAS will use that for the new variable ID.

proc transpose data=fishlong(rename=(fish=_name_))
      prefix=Length
      out=fishwide(drop=_name_);
    by location date;
    var length;
    run;

proc print data=fishwide;
run;
 Obs     Location        Date    Length1    Length2    Length3    Length4

  1     Cole Pond     02JUN95       31         32         32         33  
  2     Cole Pond     03JUL95       33         34         37         32  
  3     Cole Pond     04AUG95       29         30         34         32  
  4     Eagle Lake    02JUN95       32         32         33          .  
  5     Eagle Lake    03JUL95       30         36          .          .  
  6     Eagle Lake    04AUG95       33         33         34          .  

Using Arrays

Note

We will discuss this is four stages in order to more easily understand the elements of the process, but in practice you just use the final data step.

To transform our long data to wide form involves three DATA step concepts.

  • assigning values to new variables
  • retaining values from one observation to the next
  • initialing each by-group (not always required)
  • outputting only the final observation of a group

Moving data to separate columns

We want variables length and weight distributed across four new variables each. The variable fish along with two array references enables us to do just that, where fish picks out the new column from a set of columns.

data fishwide;
    set fishlong;

    array len{4};
    array wgt{4};
    len{fish}=length;
    wgt{fish}=weight;

    run;

proc print data=fishwide(obs=8);
run;
        L
        o
        c                L    W
        a                e    e
        t         D      n    i   f   l   l   l   l    w     w     w     w
 O      i         a      g    g   i   e   e   e   e    g     g     g     g
 b      o         t      t    h   s   n   n   n   n    t     t     t     t
 s      n         e      h    t   h   1   2   3   4    1     2     3     4

 1  Cole Pond  02JUN95  31  0.25  1  31   .   .   .  0.25   .     .     .  
 2  Cole Pond  02JUN95  32  0.30  2   .  32   .   .   .    0.30   .     .  
 3  Cole Pond  02JUN95  32  0.25  3   .   .  32   .   .     .    0.25   .  
 4  Cole Pond  02JUN95  33  0.30  4   .   .   .  33   .     .     .    0.30
 5  Cole Pond  03JUL95  33  0.32  1  33   .   .   .  0.32   .     .     .  
 6  Cole Pond  03JUL95  34  0.41  2   .  34   .   .   .    0.41   .     .  
 7  Cole Pond  03JUL95  37  0.48  3   .   .  37   .   .     .    0.48   .  
 8  Cole Pond  03JUL95  32  0.28  4   .   .   .  32   .     .     .    0.28

Notice here that we are using array references, but we are not looping within each observation - the reference just picks out one particular variable from a set of possibilities.

Retaining data across observations

Ordinarily, new variables are given a missing value at the top of each iteration of the DATA step. The RETAIN statement alters that behavior, so that the value of a new variable will be carried over until it is explicitly changed.

data fishwide;
    set fishlong;
    array len{4};
    array wgt{4};

    retain len1-len4 wgt1-wgt4;

    len{fish}=length;
    wgt{fish}=weight;
    run;

proc print data=fishwide(obs=8);
run;
        L
        o
        c                L    W
        a                e    e
        t         D      n    i   f   l   l   l   l    w     w     w     w
 O      i         a      g    g   i   e   e   e   e    g     g     g     g
 b      o         t      t    h   s   n   n   n   n    t     t     t     t
 s      n         e      h    t   h   1   2   3   4    1     2     3     4

 1  Cole Pond  02JUN95  31  0.25  1  31   .   .   .  0.25   .     .     .  
 2  Cole Pond  02JUN95  32  0.30  2  31  32   .   .  0.25  0.30   .     .  
 3  Cole Pond  02JUN95  32  0.25  3  31  32  32   .  0.25  0.30  0.25   .  
 4  Cole Pond  02JUN95  33  0.30  4  31  32  32  33  0.25  0.30  0.25  0.30
 5  Cole Pond  03JUL95  33  0.32  1  33  32  32  33  0.32  0.30  0.25  0.30
 6  Cole Pond  03JUL95  34  0.41  2  33  34  32  33  0.32  0.41  0.25  0.30
 7  Cole Pond  03JUL95  37  0.48  3  33  34  37  33  0.32  0.41  0.48  0.30
 8  Cole Pond  03JUL95  32  0.28  4  33  34  37  32  0.32  0.41  0.48  0.28

The RETAIN statement is used during the compile phase of the DATA step, so the order in which it appears in the DATA step is not important. Placing it immediately before or after the ARRAY statements keeps all the statements that define the behavior of these variables in one place, and is common style.

By-group initialization

The 4th and 8th observations are in the final form we seek. However, RETAINing data can also create observations with data values from the wrong fish, as in observations 5 through 7.

Unless we always have the same number of observations in each group (i.e. no missing repetitions), we should re-initialize the first observation in each group. For this we need by group processing.

data fishwide;
    set fishlong;

    array len{4};
    array wgt{4};
    retain len1-len4 wgt1-wgt4;

    by location date;
    if first.date then do i=1 to 4;
      len{i} = .;
      wgt{i} = .;
      end;
    drop i;

    len{fish}=length;
    wgt{fish}=weight;
    run;

proc print data=fishwide(firstobs=9 obs=16);
run;
        L
        o
        c                 L    W
        a                 e    e
        t          D      n    i   f   l   l   l   l    w     w     w    w
 O      i          a      g    g   i   e   e   e   e    g     g     g    g
 b      o          t      t    h   s   n   n   n   n    t     t     t    t
 s      n          e      h    t   h   1   2   3   4    1     2     3    4

 9  Cole Pond   04AUG95  29  0.23  1  29   .   .   .  0.23   .     .     . 
10  Cole Pond   04AUG95  30  0.25  2  29  30   .   .  0.23  0.25   .     . 
11  Cole Pond   04AUG95  34  0.47  3  29  30  34   .  0.23  0.25  0.47   . 
12  Cole Pond   04AUG95  32  0.30  4  29  30  34  32  0.23  0.25  0.47  0.3
13  Eagle Lake  02JUN95  32  0.35  1  32   .   .   .  0.35   .     .     . 
14  Eagle Lake  02JUN95  32  0.25  2  32  32   .   .  0.35  0.25   .     . 
15  Eagle Lake  02JUN95  33  0.30  3  32  32  33   .  0.35  0.25  0.30   . 
16  Eagle Lake  03JUL95  30  0.20  1  30   .   .   .  0.20   .     .     . 

The BY statement assumes the data are already sorted. It creates two automatic variables for each BY-variable, FIRST.variable and LAST.variable (like ARRAYS, these only exist for the duration of the DATA step, they are not part of the output data set).

Like RETAIN, the position of the BY and DROP statements is not too important. It is common style to place the BY after the SET statement, and the DROP at the bottom of the DATA step. I have repositioned them here just to keep all the newly added statements together.

The conditional DO loop with it’s assignment statements must come before the assignment statements that move data values into the correct columns.

Keep the final observation in each by-group

All that is left is to OUTPUT the correct observation, skipping all the intermediate observations. For this we use the LAST automatic variable, which comes from the BY statement we have already added.

data fishwide;
    set fishlong;
    by location date;
    array len{4};
    array wgt{4};
    retain len1-len4 wgt1-wgt4;
    if first.date then do i=1 to 4;
      len{i} = .;
      wgt{i} = .;
      end;
    len{fish}=length;
    wgt{fish}=weight;
  
    if last.date then output;

    drop i length weight;
    run;


proc print data=fishwide;
run;
    Obs  Location     Date fish len1 len2 len3 len4 wgt1 wgt2 wgt3 wgt4

     1  Cole Pond  02JUN95   4   31   32   32   33  0.25 0.30 0.25 0.30
     2  Cole Pond  03JUL95   4   33   34   37   32  0.32 0.41 0.48 0.28
     3  Cole Pond  04AUG95   4   29   30   34   32  0.23 0.25 0.47 0.30
     4  Eagle Lake 02JUN95   3   32   32   33    .  0.35 0.25 0.30  .  
     5  Eagle Lake 03JUL95   2   30   36    .    .  0.20 0.45  .    .  
     6  Eagle Lake 04AUG95   3   33   33   34    .  0.30 0.28 0.42  .