6 Restructuring Datasets

In this chapter you’ll learn how to restructure datasets. You’ll learn how reshape data so you can switch at will between the long form (one row per level one unit) and wide form (one row per level two unit). You’ll also learn how to turn a dataset containing both level one and level two units into a dataset containing just level two units.

Terminology Alert

This chapter depends heavily on the concepts and terminology introduced in Hierarchical Data Concepts, so be sure you’ve read that section before proceeding.

6.1 Set Up

Create a new do file called restructure.do that loads acs_hh_inc.dta (if you downloaded the example files, you can skipt the URL):

capture log close
log using restructure.log, replace
clear all
use https://sscc.wisc.edu/sscc/pubs/dws/data/acs_hh_inc
set linesize 120 // only needed for Jupyter Notebook

-------------------------------------------------------------------------------
      name:  <unnamed>
       log:  /home/d/dimond/kb/dws/restructure.log
  log type:  text
 opened on:  30 Dec 2025, 13:55:01

This is the ACS, as cleaned up in the First Steps chapter, with one additional variable.

Recall that this is hierarchical data consisting of people living in households, so a person is a level one unit and a household is a level two unit. The level one and level two identifier variables are person and household, and you can confirm that they uniquely identify observations with:

duplicates report household person


Duplicates in terms of household person

--------------------------------------
   Copies | Observations       Surplus
----------+---------------------------
        1 |        27410             0
--------------------------------------

The familiar variables age, race, marital_status, edu, income, female, and hispanic all describe individual persons, making them level one variables. The added variable household_income describes the household, making it a level two variable. Persons who live in the same household always have the same value of household_income.

6.2 Reshape

The reshape command converts data sets between long form and wide form. It is easy to use if you understand the structure of the your data set. In particular, you need to know the level one and level two identifiers and which variables are level one variables and which are level two.

The ACS data set has one observation per person, or level one unit, so the data set is currently in long form. (Take a look at the data set in the data browser if you’re not familiar with it.) In wide form it would have one observation per household, and reshape can do that for you:

reshape wide age race marital_status edu income female hispanic, ///
i(household) j(person)

(j = 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16)

Data                               Long   ->   Wide
-----------------------------------------------------------------------------
Number of observations           27,410   ->   10,565      
Number of variables                  10   ->   114         
j variable (16 values)           person   ->   (dropped)
xij variables:
                                    age   ->   age1 age2 ... age16
                                   race   ->   race1 race2 ... race16
                         marital_status   ->   marital_status1 marital_status2 ... marital_status16
                                    edu   ->   edu1 edu2 ... edu16
                                 income   ->   income1 income2 ... income16
                                 female   ->   female1 female2 ... female16
                               hispanic   ->   hispanic1 hispanic2 ... hispanic16
-----------------------------------------------------------------------------

The syntax of the reshape command begins by specifying the form you want the data to be in, in this case wide.

Then you give a list of all the level one variables in the data set, not including the level one identifier. This is a bit different from the usual syntax where a list of variables tells the command which variables to act on. The reshape command always reshapes the entire data set, but to do so it needs to understand which variables are level one and which are level two. The variables you list are level one variables; any variables you do not list are assumed to be level two variables. Make sure this is true!

Finally, the command needs to know the identifiers. The i() option specifies the level two identifier, which could be a compound identifier with multiple variables; the j() option specifies the level one identifier. It calls them i and j rather than level one and level two because reshape can be used on data with more than two levels of hierarchy.

Take a moment to view the result in the data browser (it won’t fit here). As promised, there is now just one observation per household, and the household variable is now a unique identifier all by itself. Most of the variable names now have two parts: the name of the quantity described (e.g. income) followed by the number of the person being described. The person variable has been changed from part of a compound row identifier to part of a compound column identifier.

Note that there are now 16 columns for each of the level one variables. The largest household had 16 people in it, so storing the information for all of them required 16 versions of each level one variable. Since the data set has to be rectangular, that means all the households have 16 of each level one variable, with most of them containing missing values. The household_income variable (on the far right in the data browser) remains a single variable because it is a household level variable.

Reshaping from wide form to long form requires the exact same command, just replacing wide with long:

reshape long age race marital_status edu income female hispanic, ///
i(household) j(person)

(j = 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16)

Data                               Wide   ->   Long
-----------------------------------------------------------------------------
Number of observations           10,565   ->   169,040     
Number of variables                 114   ->   10          
j variable (16 values)                    ->   person
xij variables:
                    age1 age2 ... age16   ->   age
                 race1 race2 ... race16   ->   race
marital_status1 marital_status2 ... marital_status16->marital_status
                    edu1 edu2 ... edu16   ->   edu
           income1 income2 ... income16   ->   income
           female1 female2 ... female16   ->   female
     hispanic1 hispanic2 ... hispanic16   ->   hispanic
-----------------------------------------------------------------------------

However, the meaning is quite different: age, race, etc. do not refer to individual variables, but to groups of variables, and j(person) does not refer to an existing variable at all. The reshape long command will identify all the variables that start with age, race, etc. then take the number that follows and store it in a new variable called person.

With reshape wide, the list of level one variables is a list of actual variables, so you can use shortcuts like age-hispanic. With reshape long you are giving a list of “stubs” of variable names, so you must list them all individually (individual stubs, that is, not individual variables). We recommend dropping any variables you won’t actually use very early in the data wrangling process, but you definitely want to get rid of them before using reshape long.

If you look in the data browser there’s a lot more missing data than there used to be. You can see the problem by running:

duplicates report household


Duplicates in terms of household

--------------------------------------
   Copies | Observations       Surplus
----------+---------------------------
       16 |       169040        158475
--------------------------------------

Every household now has exactly 16 observations in it. That’s because in wide form every household had 16 of each level one variable. This will happen any time you reshape a data set from wide to long: every level two unit will end up with same number of level one units as the largest level two unit. So now you have a bunch of observations that do not actually represent people. Fortunately you can detect and drop the extraneous level one units because they have missing values for all the level one variables:

drop if age==. & race==. & marital_status==. & ///
edu==. & income==. & female==. & hispanic==.

(141,630 observations deleted)

Keep in mind that real people can have missing data, so you would not want to drop observations that have a missing value for one level one variable or even a few of them. Include all the level one variables in your if condition.

Exercise

Load the dataset nlsy.dta. Identify the level one and level two units, and the level one and level two variables.

Run duplicates report id. Reshape the dataset to wide form, and then reshape it again to long. Run duplicates report id again. Why don’t you need to worry about extraneous observations in this case?

(Yes, you can skip the first part of the exercise if you just completed the previous chapter and remember the answers.)

Solution

First load the data:

clear
use https://sscc.wisc.edu/sscc/pubs/dws/data/nlsy

You can quickly find that while id does not uniquely identify observations, a combination if id and year does:

duplicates report id


Duplicates in terms of id

--------------------------------------
   Copies | Observations       Surplus
----------+---------------------------
       19 |       241034        228348
--------------------------------------

duplicates report id year


Duplicates in terms of id year

--------------------------------------
   Copies | Observations       Surplus
----------+---------------------------
        1 |       241034             0
--------------------------------------

That tells you you have one observation per person (id) per year. A person-year combination (or just year) is a level one unit, and a person is a level two unit. Now look at a few observations (ideally in the data browser):

list in 1/6, ab(30)


     +-------------------------------------------------------+
     | id   year   year_of_birth          edu   income   age |
     |-------------------------------------------------------|
  1. |  1   1979              58   12TH GRADE     4620    21 |
  2. |  1   1980              58            .        .    22 |
  3. |  1   1981              58   12TH GRADE     5000    23 |
  4. |  1   1982              58            .        .    24 |
  5. |  1   1983              58            .        .    25 |
     |-------------------------------------------------------|
  6. |  1   1984              58            .        .    26 |
     +-------------------------------------------------------+

year_of_birth is a level two variable (it does not change over time) and edu, income, and age are level one variables. Now you know enough about the dataset to reshape it. But first run duplicates report id again so you have it handy:

duplicates report id


Duplicates in terms of id

--------------------------------------
   Copies | Observations       Surplus
----------+---------------------------
       19 |       241034        228348
--------------------------------------

Now reshape to wide:

reshape wide edu income age, i(id) j(year)

(j = 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1996 1998 2000)

Data                               Long   ->   Wide
-----------------------------------------------------------------------------
Number of observations          241,034   ->   12,686      
Number of variables                   6   ->   59          
j variable (19 values)             year   ->   (dropped)
xij variables:
                                    edu   ->   edu1979 edu1980 ... edu2000
                                 income   ->   income1979 income1980 ... income2000
                                    age   ->   age1979 age1980 ... age2000
-----------------------------------------------------------------------------

And back to long:

reshape long edu income age, i(id) j(year)

(j = 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1996 1998 2000)

Data                               Wide   ->   Long
-----------------------------------------------------------------------------
Number of observations           12,686   ->   241,034     
Number of variables                  59   ->   6           
j variable (19 values)                    ->   year
xij variables:
            edu1979 edu1980 ... edu2000   ->   edu
   income1979 income1980 ... income2000   ->   income
            age1979 age1980 ... age2000   ->   age
-----------------------------------------------------------------------------

Finally, run duplicates report id again:

duplicates report id


Duplicates in terms of id

--------------------------------------
   Copies | Observations       Surplus
----------+---------------------------
       19 |       241034        228348
--------------------------------------

Unlike the ACS, reshaping the NLSY from long to wide and then back to long did not create any extraneous observations. That’s because the NLSY includes a row for every person for every year whether they could find the person or not. Thus every person started the reshape cycle with exactly 19 observations and ended the reshape cycle with exactly 19 observations.

6.3 Creating Data Sets of Level Two Units

Sometimes you want to get rid of the level one units in your data set entirely so you’re left with a data set of level two units and level two variables. For the ACS that would be a data set of households with no individual-level variables.

If all the level two variables you want already exist, that’s very easy to do.

6.3.1 Existing Level Two Variables in Wide Form

If the data set is already in wide form, just drop or keep to eliminate all the level 1 variables and you’re done. acs_wide_hh_inc contains the ACS data in wide form with the additional level two variable household income:

clear
use https://sscc.wisc.edu/sscc/pubs/dws/data/acs_wide_hh_inc

Right now it’s a mix of level one and level two data:

list household age1 income1 age2 income2 age3 income3 household_income in 1/3, ab(30)


     +---------------------------------------------------------------------------------+
     | household   age1   income1   age2   income2   age3   income3   household_income |
     |---------------------------------------------------------------------------------|
  1. |        37     20     10000     19      5300     19      4700              20000 |
  2. |       241     50     32500      .         .      .         .              32500 |
  3. |       242     29     30000      .         .      .         .              30000 |
     +---------------------------------------------------------------------------------+

Since there’s just one level two variable it’s easier to use keep to keep it and the level two identifier:

keep household household_income
list in 1/3, ab(30)


     +------------------------------+
     | household   household_income |
     |------------------------------|
  1. |        37              20000 |
  2. |       241              32500 |
  3. |       242              30000 |
     +------------------------------+

Now this is strictly a dataset of households with only household-level variables.

6.3.2 Existing Level Two Variables in Long Form

Next consider a data set that’s in long form but has the level two variables you need, like the ACS with household_income:

clear
use https://sscc.wisc.edu/sscc/pubs/dws/data/acs_hh_inc

list household person age race income household_income in 1/5, ab(30)


     +--------------------------------------------------------------+
     | household   person   age    race   income   household_income |
     |--------------------------------------------------------------|
  1. |        37        1    20   White    10000              20000 |
  2. |        37        2    19   White     5300              20000 |
  3. |        37        3    19   Black     4700              20000 |
  4. |       241        1    50   White    32500              32500 |
  5. |       242        1    29   White    30000              30000 |
     +--------------------------------------------------------------+

Begin by again keeping just the level two identifier and variables:

keep household household_income
list in 1/5, ab(30)


     +------------------------------+
     | household   household_income |
     |------------------------------|
  1. |        37              20000 |
  2. |        37              20000 |
  3. |        37              20000 |
  4. |       241              32500 |
  5. |       242              30000 |
     +------------------------------+

Now you just need to reduce the dataset to one observation for each household. Note how all the observations for a given household are now identical so it doesn’t matter which one you keep, but keeping the first observation is easy to do:

by household: keep if _n==1
list in 1/3, ab(30)

(16,845 observations deleted)

     +------------------------------+
     | household   household_income |
     |------------------------------|
  1. |        37              20000 |
  2. |       241              32500 |
  3. |       242              30000 |
     +------------------------------+

Now this is again strictly a dataset of households with only household-level variables.

6.3.3 Creating New Level Two Variables

Often, however, you need to create new level two variables based on the level one units before you can eliminate the level one units entirely. In the next example you’ll create variables for “number of people in the household” and “proportion of the household that is female” as well as keeping the existing household income variable. You can always do that by creating the level two variables using the methods described in the previous chapter and then dropping the level one units using the methods just described. But if all the variables you need to create are summary statistics, the collapse command can do the entire process for you quickly and easily.

The collapse command takes all the observations (level one units) for a given level two unit and aggregates them into a single observation. Thus it needs to know the level two identifier, which variables you want aggregated, and how you want to aggregate them. Most of the aggregation rules are based on summary statistics.

Reload the data, and then run the following collapse command:

clear
use https://sscc.wisc.edu/sscc/pubs/dws/data/acs_hh_inc

collapse ///
    (first) household_income ///
    (mean) proportion_female=female ///
    (count) household_size=person, ///
    by(household)
    
list in 1/6, ab(30)


     +-------------------------------------------------------------------+
     | household   household_income   proportion_female   household_size |
     |-------------------------------------------------------------------|
  1. |        37              20000                   1                3 |
  2. |       241              32500                   1                1 |
  3. |       242              30000                   1                1 |
  4. |       377              51900                   1                1 |
  5. |       418              12200                   1                1 |
     |-------------------------------------------------------------------|
  6. |       465               2600                  .5                2 |
     +-------------------------------------------------------------------+

This gives you a data set of households and household-level variables as advertised. Note how the only variables remaining in the data set are the variables mentioned in the collapse command. Now consider the elements of the collapse command in turn:

(first) means the variables that follow should be aggregated using the rule “keep the first value.” The household_income variable is already a level two variable, so all the values for a given level two unit are the same and you just need to keep the first one.

(mean) means the variables that follow should be aggregated using the rule “take the mean.” The female variable is binary, so taking its mean tells us the proportion of household members that are female. However, you don’t want to call the result female, so this renames it to proportion_female. It’s a bit backwards, but (mean) proportion_female=female can be read “take the mean of the female variable and call the result proportion_female.”

(count) means the variables that follow should be aggregated using the rule “count the number of non-missing values.” What you really want is the number of observations, but for any variable with no missing values that will be the same thing. If you had to we could create such a variable (gen temp = 1), but person has no missing values so you can use it. Again, you want to give the result a new name, and the syntax (count) household_size=person can be read “count the number of non-missing values of person and call the result household_size.”

You can list many variables after each aggregation rule, not just one. You can also skip specifying an aggregation rule, in which case Stata will assume you want means. Type help collapse to see a list of available aggregation rules.

The by(household) option tells collapse the level two identifier so it knows which observations to aggregate.

Exercise

Load nlsy_extract.dta and use collapse to convert it into a data set with one row per person. The resulting data set should contain the person’s year of birth, their mean income over the study period, and their maximum educational attainment.

Solution

clear
use https://sscc.wisc.edu/sscc/pubs/dws/data/nlsy

collapse ///
    (first) year_of_birth ///
    (mean) mean_income=income ///
    (max) max_edu=edu, ///
    by(id)
    
l in 1/5, ab(30)


     +--------------------------------------------+
     | id   year_of_birth   mean_income   max_edu |
     |--------------------------------------------|
  1. |  1              58          4810        12 |
  2. |  2              59      11289.47        12 |
  3. |  3              61      4735.294        12 |
  4. |  4              62         12874        14 |
  5. |  5              59         20631        18 |
     +--------------------------------------------+