In this chapter you’ll learn how to restructure datasets. You’ll learn how reshape data so you can switch at will between the long form (one row per level one unit) and wide form (one row per level two unit). You’ll also learn how to turn a dataset containing both level one and level two units into a dataset containing just level two units.
Terminology Alert
This chapter depends heavily on the concepts and terminology introduced in Hierarchical Data Concepts, so be sure you’ve read that section before proceeding.
6.1 Set Up
Create a new do file called restructure.do that loads acs_hh_inc.dta (if you downloaded the example files, you can skipt the URL):
capturelogcloselogusing restructure.log, replaceclearalluse https://sscc.wisc.edu/sscc/pubs/dws/data/acs_hh_incsetlinesize 120 // only needed for Jupyter Notebook
-------------------------------------------------------------------------------
name: <unnamed>
log: /home/d/dimond/kb/dws/restructure.log
log type: text
opened on: 30 Dec 2025, 13:55:01
This is the ACS, as cleaned up in the First Steps chapter, with one additional variable.
Recall that this is hierarchical data consisting of people living in households, so a person is a level one unit and a household is a level two unit. The level one and level two identifier variables are person and household, and you can confirm that they uniquely identify observations with:
duplicatesreport household person
Duplicates in terms of household person
--------------------------------------
Copies | Observations Surplus
----------+---------------------------
1 | 27410 0
--------------------------------------
The familiar variables age, race, marital_status, edu, income, female, and hispanic all describe individual persons, making them level one variables. The added variable household_income describes the household, making it a level two variable. Persons who live in the same household always have the same value of household_income.
6.2 Reshape
The reshape command converts data sets between long form and wide form. It is easy to use if you understand the structure of the your data set. In particular, you need to know the level one and level two identifiers and which variables are level one variables and which are level two.
The ACS data set has one observation per person, or level one unit, so the data set is currently in long form. (Take a look at the data set in the data browser if you’re not familiar with it.) In wide form it would have one observation per household, and reshape can do that for you:
reshapewide age race marital_status edu income female hispanic, ///i(household) j(person)
(j = 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16)
Data Long -> Wide
-----------------------------------------------------------------------------
Number of observations 27,410 -> 10,565
Number of variables 10 -> 114
j variable (16 values) person -> (dropped)
xij variables:
age -> age1 age2 ... age16
race -> race1 race2 ... race16
marital_status -> marital_status1 marital_status2 ... marital_status16
edu -> edu1 edu2 ... edu16
income -> income1 income2 ... income16
female -> female1 female2 ... female16
hispanic -> hispanic1 hispanic2 ... hispanic16
-----------------------------------------------------------------------------
The syntax of the reshape command begins by specifying the form you want the data to be in, in this case wide.
Then you give a list of all the level one variables in the data set, not including the level one identifier. This is a bit different from the usual syntax where a list of variables tells the command which variables to act on. The reshape command always reshapes the entire data set, but to do so it needs to understand which variables are level one and which are level two. The variables you list are level one variables; any variables you do not list are assumed to be level two variables. Make sure this is true!
Finally, the command needs to know the identifiers. The i() option specifies the level two identifier, which could be a compound identifier with multiple variables; the j() option specifies the level one identifier. It calls them i and j rather than level one and level two because reshape can be used on data with more than two levels of hierarchy.
Take a moment to view the result in the data browser (it won’t fit here). As promised, there is now just one observation per household, and the household variable is now a unique identifier all by itself. Most of the variable names now have two parts: the name of the quantity described (e.g. income) followed by the number of the person being described. The person variable has been changed from part of a compound row identifier to part of a compound column identifier.
Note that there are now 16 columns for each of the level one variables. The largest household had 16 people in it, so storing the information for all of them required 16 versions of each level one variable. Since the data set has to be rectangular, that means all the households have 16 of each level one variable, with most of them containing missing values. The household_income variable (on the far right in the data browser) remains a single variable because it is a household level variable.
Reshaping from wide form to long form requires the exact same command, just replacing wide with long:
reshapelong age race marital_status edu income female hispanic, ///i(household) j(person)
(j = 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16)
Data Wide -> Long
-----------------------------------------------------------------------------
Number of observations 10,565 -> 169,040
Number of variables 114 -> 10
j variable (16 values) -> person
xij variables:
age1 age2 ... age16 -> age
race1 race2 ... race16 -> race
marital_status1 marital_status2 ... marital_status16->marital_status
edu1 edu2 ... edu16 -> edu
income1 income2 ... income16 -> income
female1 female2 ... female16 -> female
hispanic1 hispanic2 ... hispanic16 -> hispanic
-----------------------------------------------------------------------------
However, the meaning is quite different: age, race, etc. do not refer to individual variables, but to groups of variables, and j(person) does not refer to an existing variable at all. The reshape long command will identify all the variables that start with age, race, etc. then take the number that follows and store it in a new variable called person.
With reshape wide, the list of level one variables is a list of actual variables, so you can use shortcuts like age-hispanic. With reshape long you are giving a list of “stubs” of variable names, so you must list them all individually (individual stubs, that is, not individual variables). We recommend dropping any variables you won’t actually use very early in the data wrangling process, but you definitely want to get rid of them before using reshape long.
If you look in the data browser there’s a lot more missing data than there used to be. You can see the problem by running:
duplicatesreport household
Duplicates in terms of household
--------------------------------------
Copies | Observations Surplus
----------+---------------------------
16 | 169040 158475
--------------------------------------
Every household now has exactly 16 observations in it. That’s because in wide form every household had 16 of each level one variable. This will happen any time you reshape a data set from wide to long: every level two unit will end up with same number of level one units as the largest level two unit. So now you have a bunch of observations that do not actually represent people. Fortunately you can detect and drop the extraneous level one units because they have missing values for all the level one variables:
Keep in mind that real people can have missing data, so you would not want to drop observations that have a missing value for one level one variable or even a few of them. Include all the level one variables in your if condition.
Exercise
Load the dataset nlsy.dta. Identify the level one and level two units, and the level one and level two variables.
Run duplicates report id. Reshape the dataset to wide form, and then reshape it again to long. Run duplicates report id again. Why don’t you need to worry about extraneous observations in this case?
(Yes, you can skip the first part of the exercise if you just completed the previous chapter and remember the answers.)
You can quickly find that while id does not uniquely identify observations, a combination if id and year does:
duplicatesreport id
Duplicates in terms of id
--------------------------------------
Copies | Observations Surplus
----------+---------------------------
19 | 241034 228348
--------------------------------------
duplicatesreport id year
Duplicates in terms of id year
--------------------------------------
Copies | Observations Surplus
----------+---------------------------
1 | 241034 0
--------------------------------------
That tells you you have one observation per person (id) per year. A person-year combination (or just year) is a level one unit, and a person is a level two unit. Now look at a few observations (ideally in the data browser):
year_of_birth is a level two variable (it does not change over time) and edu, income, and age are level one variables. Now you know enough about the dataset to reshape it. But first run duplicates report id again so you have it handy:
duplicatesreport id
Duplicates in terms of id
--------------------------------------
Copies | Observations Surplus
----------+---------------------------
19 | 241034 228348
--------------------------------------
Now reshape to wide:
reshapewide edu income age, i(id) j(year)
(j = 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1996 1998 2000)
Data Long -> Wide
-----------------------------------------------------------------------------
Number of observations 241,034 -> 12,686
Number of variables 6 -> 59
j variable (19 values) year -> (dropped)
xij variables:
edu -> edu1979 edu1980 ... edu2000
income -> income1979 income1980 ... income2000
age -> age1979 age1980 ... age2000
-----------------------------------------------------------------------------
And back to long:
reshapelong edu income age, i(id) j(year)
(j = 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1996 1998 2000)
Data Wide -> Long
-----------------------------------------------------------------------------
Number of observations 12,686 -> 241,034
Number of variables 59 -> 6
j variable (19 values) -> year
xij variables:
edu1979 edu1980 ... edu2000 -> edu
income1979 income1980 ... income2000 -> income
age1979 age1980 ... age2000 -> age
-----------------------------------------------------------------------------
Finally, run duplicates report id again:
duplicatesreport id
Duplicates in terms of id
--------------------------------------
Copies | Observations Surplus
----------+---------------------------
19 | 241034 228348
--------------------------------------
Unlike the ACS, reshaping the NLSY from long to wide and then back to long did not create any extraneous observations. That’s because the NLSY includes a row for every person for every year whether they could find the person or not. Thus every person started the reshape cycle with exactly 19 observations and ended the reshape cycle with exactly 19 observations.
6.3 Creating Data Sets of Level Two Units
Sometimes you want to get rid of the level one units in your data set entirely so you’re left with a data set of level two units and level two variables. For the ACS that would be a data set of households with no individual-level variables.
If all the level two variables you want already exist, that’s very easy to do.
6.3.1 Existing Level Two Variables in Wide Form
If the data set is already in wide form, just drop or keep to eliminate all the level 1 variables and you’re done. acs_wide_hh_inc contains the ACS data in wide form with the additional level two variable household income:
Now you just need to reduce the dataset to one observation for each household. Note how all the observations for a given household are now identical so it doesn’t matter which one you keep, but keeping the first observation is easy to do:
Now this is again strictly a dataset of households with only household-level variables.
6.3.3 Creating New Level Two Variables
Often, however, you need to create new level two variables based on the level one units before you can eliminate the level one units entirely. In the next example you’ll create variables for “number of people in the household” and “proportion of the household that is female” as well as keeping the existing household income variable. You can always do that by creating the level two variables using the methods described in the previous chapter and then dropping the level one units using the methods just described. But if all the variables you need to create are summary statistics, the collapse command can do the entire process for you quickly and easily.
The collapse command takes all the observations (level one units) for a given level two unit and aggregates them into a single observation. Thus it needs to know the level two identifier, which variables you want aggregated, and how you want to aggregate them. Most of the aggregation rules are based on summary statistics.
Reload the data, and then run the following collapse command:
This gives you a data set of households and household-level variables as advertised. Note how the only variables remaining in the data set are the variables mentioned in the collapse command. Now consider the elements of the collapse command in turn:
(first) means the variables that follow should be aggregated using the rule “keep the first value.” The household_income variable is already a level two variable, so all the values for a given level two unit are the same and you just need to keep the first one.
(mean) means the variables that follow should be aggregated using the rule “take the mean.” The female variable is binary, so taking its mean tells us the proportion of household members that are female. However, you don’t want to call the result female, so this renames it to proportion_female. It’s a bit backwards, but (mean) proportion_female=female can be read “take the mean of the female variable and call the result proportion_female.”
(count) means the variables that follow should be aggregated using the rule “count the number of non-missing values.” What you really want is the number of observations, but for any variable with no missing values that will be the same thing. If you had to we could create such a variable (gen temp = 1), but person has no missing values so you can use it. Again, you want to give the result a new name, and the syntax (count) household_size=person can be read “count the number of non-missing values of person and call the result household_size.”
You can list many variables after each aggregation rule, not just one. You can also skip specifying an aggregation rule, in which case Stata will assume you want means. Type help collapse to see a list of available aggregation rules.
The by(household) option tells collapse the level two identifier so it knows which observations to aggregate.
Exercise
Load nlsy_extract.dta and use collapse to convert it into a data set with one row per person. The resulting data set should contain the person’s year of birth, their mean income over the study period, and their maximum educational attainment.