Reshape Wide to Long

We sometimes need to reshape a data set, converting data that has been recorded in multiple variables into multiple observations, or vice versa. This is an especially common data manipulation where we have data with repeated observations for a common unit.

Consider this example (from SAS):

data fishdata;
   infile datalines missover;
   input Location & $10. Date date7.
         Length1 Weight1 Length2 Weight2 Length3 Weight3
         Length4 Weight4;
   format date date7.;
   datalines;
Cole Pond   2JUN95 31 .25 32 .3  32 .25 33 .3
Cole Pond   3JUL95 33 .32 34 .41 37 .48 32 .28
Cole Pond   4AUG95 29 .23 30 .25 34 .47 32 .3
Eagle Lake  2JUN95 32 .35 32 .25 33 .30
Eagle Lake  3JUL95 30 .20 36 .45
Eagle Lake  4AUG95 33 .30 33 .28 34 .42
;

Each observation records the fish weights and lengths observed on a given date from a give body of water. There are two to four fish observed on each occasion.

Fish data wide form

If we wanted to know the mean length and mean weight of all the fish observed, we would like all the data values to be in two columns so that we can calculate this with PROC MEANS.

Fish data long form

There are generally two approaches we can take to reshaping these data. If our goal is to convert a single repeated measure, we can use PROC TRANSPOSE. More generally, for example where we have multiple repeated measures, we make use of ARRAYs.

Using Arrays

Recall that a SAS ARRAY is essentially a list of variables, and that this list makes it easy to loop over those variables. In this case we have two lists of variables - length1-length4 and weight1-weight4. We would like to take the values from one list and put them into a single column. We can use an ARRAY to loop over the variables. The key here is to use an explict OUTPUT statement to turn one observation into four observations.

To simplify this first example, let’s focus on fish length.

data long;
    set fishdata;
    array len{*} length1-length4;
    do fish=1 to 4;
        length = len{fish};
        output;
    end;
run;
proc means data=long;
  var length;
run;
                            The MEANS Procedure

                        Analysis Variable : length 
 
     N            Mean         Std Dev         Minimum         Maximum
    ------------------------------------------------------------------
    20      32.6000000       1.9029064      29.0000000      37.0000000
    ------------------------------------------------------------------
  • The ARRAY statement defines len as an array reference, a convenient way of specifying any of the four length variables.
  • DO and END form our loop-within-an-observation. We call our loop index fish because this variable will eventually identify individual fish in our long data.
  • We assign len{fish} values all to a single new variable, length.
  • We OUTPUT for each new value of length.

Looking at the resulting data set (or at least part of it)

proc print data=long(obs=8
    drop=date weight1-weight4);
run;
  Obs   Location    Length1   Length2   Length3   Length4   fish   length

    1   Cole Pond      31        32        32        33       1      31  
    2   Cole Pond      31        32        32        33       2      32  
    3   Cole Pond      31        32        32        33       3      32  
    4   Cole Pond      31        32        32        33       4      33  
    5   Cole Pond      33        34        37        32       1      33  
    6   Cole Pond      33        34        37        32       2      34  
    7   Cole Pond      33        34        37        32       3      37  
    8   Cole Pond      33        34        37        32       4      32  

we see we have retained the wide data, in addition to the properly structured length variable at the end. To complete this exercise we can drop the original variables in our output data set, restructure our weight variables, and remove observations with missing data.

data long;
    set fishdata;
    array len{*} length1-length4;
    array wgt{*} weight1-weight4;
    do fish=1 to 4;
        length = len{fish};
        weight = wgt{fish};
/* delete empty observations */
    if length eq . and weight eq . then delete;
/* output good observations  */
        else output;
    end;
/* drop no longer needed variables */
  drop length1-length4 weight1-weight4;
run;
proc means data=long;
  var length weight;
run;
                            The MEANS Procedure

 Variable    N           Mean        Std Dev        Minimum        Maximum
 -------------------------------------------------------------------------
 length     20     32.6000000      1.9029064     29.0000000     37.0000000
 weight     20      0.3195000      0.0831280      0.2000000      0.4800000
 -------------------------------------------------------------------------
proc print data=long;
run;
         Obs     Location        Date    fish    length    weight

           1    Cole Pond     02JUN95      1       31       0.25 
           2    Cole Pond     02JUN95      2       32       0.30 
           3    Cole Pond     02JUN95      3       32       0.25 
           4    Cole Pond     02JUN95      4       33       0.30 
           5    Cole Pond     03JUL95      1       33       0.32 
           6    Cole Pond     03JUL95      2       34       0.41 
           7    Cole Pond     03JUL95      3       37       0.48 
           8    Cole Pond     03JUL95      4       32       0.28 
           9    Cole Pond     04AUG95      1       29       0.23 
          10    Cole Pond     04AUG95      2       30       0.25 
          11    Cole Pond     04AUG95      3       34       0.47 
          12    Cole Pond     04AUG95      4       32       0.30 
          13    Eagle Lake    02JUN95      1       32       0.35 
          14    Eagle Lake    02JUN95      2       32       0.25 
          15    Eagle Lake    02JUN95      3       33       0.30 
          16    Eagle Lake    03JUL95      1       30       0.20 
          17    Eagle Lake    03JUL95      2       36       0.45 
          18    Eagle Lake    04AUG95      1       33       0.30 
          19    Eagle Lake    04AUG95      2       33       0.28 
          20    Eagle Lake    04AUG95      3       34       0.42 

PROC TRANSPOSE

PROC TRANSPOSE is another method of converting data from wide to long form. It is appealing because you do not need to understand DATA step programming. However, it’s limitation is that all the data values to be transposed end up in one column (in the previous example, multiple repeated measures ended up in multiple columns).

  • an OUT option specifies the output data set
  • BY specifies the variable(s) that identify unique observations
  • VAR specifies the variables to be transformed

The new data column and the within-observation-index will have default names.

proc transpose data=fishdata
  out=transfish;
  by location date;
  var length1-length4;
  run;
  
proc print; run;
              Obs     Location        Date    _NAME_     COL1

                1    Cole Pond     02JUN95    Length1     31 
                2    Cole Pond     02JUN95    Length2     32 
                3    Cole Pond     02JUN95    Length3     32 
                4    Cole Pond     02JUN95    Length4     33 
                5    Cole Pond     03JUL95    Length1     33 
                6    Cole Pond     03JUL95    Length2     34 
                7    Cole Pond     03JUL95    Length3     37 
                8    Cole Pond     03JUL95    Length4     32 
                9    Cole Pond     04AUG95    Length1     29 
               10    Cole Pond     04AUG95    Length2     30 
               11    Cole Pond     04AUG95    Length3     34 
               12    Cole Pond     04AUG95    Length4     32 
               13    Eagle Lake    02JUN95    Length1     32 
               14    Eagle Lake    02JUN95    Length2     32 
               15    Eagle Lake    02JUN95    Length3     33 
               16    Eagle Lake    02JUN95    Length4      . 
               17    Eagle Lake    03JUL95    Length1     30 
               18    Eagle Lake    03JUL95    Length2     36 
               19    Eagle Lake    03JUL95    Length3      . 
               20    Eagle Lake    03JUL95    Length4      . 
               21    Eagle Lake    04AUG95    Length1     33 
               22    Eagle Lake    04AUG95    Length2     33 
               23    Eagle Lake    04AUG95    Length3     34 
               24    Eagle Lake    04AUG95    Length4      . 

We can clean this up a little with:

  • PREFIX names the final column of repeated measurements
  • RENAME the within-observation-index
  • WHERE removes observations with missing data
proc transpose data=fishdata
  out=transfish(
    rename=(_name_=fish)
    where=(length1 ne .))
  prefix=length;
  by location date;
  var length1-length4;
  run;
  
proc print; run;
            Obs     Location        Date     fish      length1

              1    Cole Pond     02JUN95    Length1       31  
              2    Cole Pond     02JUN95    Length2       32  
              3    Cole Pond     02JUN95    Length3       32  
              4    Cole Pond     02JUN95    Length4       33  
              5    Cole Pond     03JUL95    Length1       33  
              6    Cole Pond     03JUL95    Length2       34  
              7    Cole Pond     03JUL95    Length3       37  
              8    Cole Pond     03JUL95    Length4       32  
              9    Cole Pond     04AUG95    Length1       29  
             10    Cole Pond     04AUG95    Length2       30  
             11    Cole Pond     04AUG95    Length3       34  
             12    Cole Pond     04AUG95    Length4       32  
             13    Eagle Lake    02JUN95    Length1       32  
             14    Eagle Lake    02JUN95    Length2       32  
             15    Eagle Lake    02JUN95    Length3       33  
             16    Eagle Lake    03JUL95    Length1       30  
             17    Eagle Lake    03JUL95    Length2       36  
             18    Eagle Lake    04AUG95    Length1       33  
             19    Eagle Lake    04AUG95    Length2       33  
             20    Eagle Lake    04AUG95    Length3       34  

Transforming multiple measures (both lengths and weights) would require more than a single PROC TRANSPOSE - it is more efficient to use the ARRAY technique!

Last revised: 08/22/24