Hierarchical data is any kind of data where observations fall into groups or clusters. The most common examples at the SSCC are individuals living in a household and a subject being observed multiple times, but there are many other applications: schools within a district, courses taken by a student, or even individuals who are part of a subject's social network can all be treated as hierarchical data. Hierarchies can also have more than two levels; for example students may be grouped into classrooms which are grouped into schools which are grouped into districts.
One of the most useful tools for managing hierarchical data in Stata is the reshape command. It is very simple to use, but you need to be able to picture in your mind what it does and why. This article will try to give you that picture and then discuss using reshape in a variety of situations.
Note that SPSS has a tool called the Restructure Data Wizard which looks very different but does essentially the same thing, and you can write a program to do it in SAS (though with substantially more effort). Thus the conceptual parts of this article may be of interest even if you don't use Stata.
Describing Hierarchical Data
Since hierarchical data can describe so many different things, we need some to define terms that can apply to all of them. We'll describe the smallest unit in the data as the level one unit. In the examples mentioned above the level one unit would be an individual within the household, a particular time the subject was observed, a school within the district, a course taken by the student, or an individual within the subject's social network. A level two unit is then a group of level one units: the household in which the individuals live, the subject which is measured repeatedly, the district which contains the schools, the student who takes the courses or the subject whose social network is being described. If needed, a level three unit is a collection of level two units, and so forth.
Normally a hierarchical data set will include some variables which describe the level one units and some which describe the level two units. For example, a data set of individuals living in households may contain the age and sex of each individual, plus the household income of the household as a whole. Age and sex would then be level one variables while household income would be a level two variable.
Level two variables are easy to identify: they always have the same value for all level one units in the same level two unit. For example, the total number of people in the household must be the same for every member of a household, or if a subject is observed multiple times he or she must have the same race each time. Anything that varies within a level two unit is a level one variable: individuals within a household can obviously have different employment statuses, so employment status must be a level one variable. What occasionally gets forgotten is that individuals observed over time can change their education level or marital status, so those must be level one variables as well. While it's rarely difficult to identify which of your variables are level one and which are level two, taking a moment to do so during the planning stage of your project and before writing any code can help you avoid a lot of headaches.
In an ideal world, each level would have an associated identifier: for example a household ID and an individual ID, or a subject ID and a wave ID. The level one identifiers only need to be unique within a level two group, and in fact reshape will only be able to use it if this is the case. Often data sets do not come with a usable level one identifier, but one can be created very easily.
Representing Hierarchical Data as a Matrix
Stata (like most statistical programs) stores its data in a matrix, where rows are observations and columns are variables. But when working with hierarchical data "observation" is an ambiguous term: it could mean either a level one unit or a level two unit. The purpose of the reshape command is to allow you to go back and forth between the two definitions at will, restructuring your data accordingly.
If an observation represents a level one unit, then your data are in the long form. The long form is so named because it has a larger number of observations but fewer variables. In the long form, both level one and level two variables are represented by columns in the data matrix. However, level two variables will have many repeated values, since all the observations in the same level two group will share the same values of all the level two variables.
If, on the other hand, an observation represents a level two unit, then your data are in the wide form. The wide form is so named because it has fewer observations but more variables. In wide form, level two variables are represented by columns as usual. However, level one variables are represented by sets of columns, with each set containing a column for each level one unit. Thus the values of the level one variables for each level one unit within a level two unit are stored in the same row, but in different columns.
Consider the following data (taken from the Penn World Table).
Country | Continent | Year | Population | GDP Per-Capita |
---|---|---|---|---|
Afghanistan | Asia | 2000 | 25889 | 478 |
Afghanistan | Asia | 2001 | 26813 | 428 |
Afghanistan | Asia | 2002 | 27756 | 547 |
Albania | Europe | 2000 | 3474 | 3797 |
Albania | Europe | 2001 | 3490 | 4269 |
Albania | Europe | 2002 | 3509 | 4431 |
Algeria | Africa | 2000 | 31194 | 5753 |
Algeria | Africa | 2001 | 31736 | 5534 |
Algeria | Africa | 2002 | 32278 | 5639 |
This is panel data, where countries are observed over time. Thus a level two unit is a country and a level one unit is a country's data in a particular year. Population and GDP Per-Capita are level one variables as they change over time. Continent is a level two variable as it does not change (at least not over the time scale we're interested in). As listed above, each row represents a level one unit. Thus this data set is in the long form. Now consider the exact same data in a different format:
Country | Pop2000 | CGDP2000 | Pop2001 | CGDP2001 | Pop2002 | CGDP2002 | Continent |
---|---|---|---|---|---|---|---|
Afghanistan | 25889 | 478 | 26813 | 428 | 27756 | 547 | Asia |
Albania | 3474 | 3797 | 3490 | 4269 | 3509 | 4431 | Europe |
Algeria | 31194 | 5753 | 31736 | 5534 | 32278 | 5639 | Africa |
Now a row represents a level two unit, so this is the wide form. The level one variables Population and GDP Per-Capita (abbreviated Pop and CGDP) are represented by three columns each: one for the year 2000, one for 2001, and one for 2002.
Using Reshape
Stata's reshape command allows you switch between the two forms at will. The general syntax is:
reshape [long or wide] ["stubs" of level one variable names], i(level two ID) j(level one ID)
Before talking through the syntax in detail let's do an example. First load the Stata version of this data by typing (or more likely, copying and pasting):
use http://ssc.wisc.edu/sscc/pubs/files/4-28/reshape1
Do a list (l) and you'll see it's currently in the long form (as in this table). To change that, type:
reshape wide pop cgdp, i(country) j(year)
Do another list to see the results (it should look like this table). To go back to long form, type:
reshape long pop cgdp, i(country) j(year)
Referring back to the general syntax, long or wide is the form in which you want to put the data. Next comes a list of level one variables, but note that when the data set is in wide form it does not contain any variables called literally pop or cgdp. Instead you have pop2000, pop2001 and so forth. Note too that continent is not in the list, as it is a level two variable.
The i() option is where you give the level two identifier variable. j() is then the level one identifier--but note again that in wide form the data set does not have a variable called year. When reshaping from wide to long, pop cgdp combined with j(year) can be interpreted as "look for variables that start with pop or cgdp, then take whatever follows that and put it in a new variable called year."
Incidentally this is why reshape can't use Stata's varlist syntax (which is unfortunate if your data set includes a large number of level one variables). The stubs aren't really variable names.
Removing Extraneous Cases
Now consider a (fabricated) data set consisting of individuals living in households:
household | income | age1 | female1 | age2 | female2 | age3 | female3 |
---|---|---|---|---|---|---|---|
1 | 30000 | 30 | 1 | 2 | 1 | . | . |
2 | 90000 | 45 | 0 | 43 | 1 | 15 | 0 |
Here the level one unit is an individual, the level two unit is a household (with the household variable as the identifier), income is a level two variable, and age and female are level one variables. Since an observation represents a household, the level two unit, this data set is in wide form.
Load this data with:
use http://ssc.wisc.edu/sscc/pubs/files/4-28/reshape2
If you need an observation to represent an individual instead, all you need to do is reshape the data into the long form:
reshape long age female, i(household) j(j)
Most of this is familiar, but there is one puzzle: what is j? At one level the answer is "The number at the end of age and female." But if you want to go further and ask "What does j mean?" the answer in this case is "We don't know." It's the order in which the individuals in the household are listed in the data set, but they may have been listed in more or less random order--in which case j would mean absolutely nothing. Or j may be very important--person one may always be the respondent, or the head of the household. We don't know without consulting the documentation for the data set (and since this data set was fabricated no documentation exists). So we'll punt and just call it j.
Here is the result of the reshape command:
household | j | income | age | female |
---|---|---|---|---|
1 | 1 | 30000 | 30 | 1 |
1 | 2 | 30000 | 2 | 1 |
1 | 3 | 30000 | . | . |
2 | 1 | 90000 | 45 | 0 |
2 | 2 | 90000 | 43 | 1 |
2 | 3 | 90000 | 15 | 0 |
There's just one problem: person number three in household one. Note that in wide form the number of columns is determined by the household with the largest number of members--the data set must be wide enough to contain all the individuals in that household. However, since all rows have the same number of columns, smaller households have as many columns as the biggest household. In this data set, household one only has two people so age3 and female3 are missing. But the reshape command doesn't choose not create an observation just because these variables don't have values. After all, conceivably there could be a third person in household one and we just don't know their age or sex.
Such agnosticism is rarely an option with real data. In big surveys the largest household tends to be very large indeed, so you could easily have variables like age26 and female26. Converting them all to observations results in a data set consisting mostly of missing values. Thus you most likely want to drop these extraneous observations:
drop if age==. & female==.
If you had more level one variables you would still use all of them in the if condition, just to be sure that you don't eliminate a real person who is only missing a few variables.
Note that what we've discussed so far covers 90% of the ways reshape is used. The remaining examples in this article cover special cases, so continue reading at your discretion.
Creating a Level One Identifier
Now suppose you had gotten the data in long form, but without the j variable.
household | income | age | female |
---|---|---|---|
1 | 30000 | 30 | 1 |
1 | 30000 | 2 | 1 |
2 | 90000 | 45 | 0 |
2 | 90000 | 43 | 1 |
2 | 90000 | 15 | 0 |
Type:
use http://ssc.wisc.edu/sscc/pubs/files/4-28/reshape3
to load this in Stata.
This data set contains all the information in the original data set. But what if you had to convert it to wide form? There's no j variable and nothing that can be used as a j variable. So you'll have to make one:
bysort household: gen j=_n
A list (l) will verify that this works, but you should be very careful if the order of the household members is important. By default Stata's sort algorithm is not "stable" meaning that ties are not necessarily left in the same order they started in. Thus when you sort by household, you could change the order of the individuals within each household.
One solution is to add the stable option to all your sorts. Then Stata will use a (slightly slower) stable sort algorithm. However, this can't be used with bysort. Thus you'd have to instead type:
sort household, stable
by household: gen j=_n
Another approach is to create a new variable that keeps track of the original order. This has the advantage of of allowing you to return to the proper sort order later if you ever have to sort by something else.
gen originalOrder=_n
sort household originalOrder
by household: gen j=_n
This can be done with bysort, since bysort understands variables in parentheses as variables it should sort by but not use to form by groups:
gen originalOrder=_n
bysort household (originalOrder): gen j=_n
Using Strings as Identifiers
Suppose you had data on students who have taken the SAT, including their SAT section scores. The SAT has three sections, which we'll call verbal, math and writing (officially they're now called Critical Reading, Mathematics and Writing). Each student takes all three sections.
id | section | score |
---|---|---|
13 | verbal | 571 |
13 | math | 236 |
13 | writing | 533 |
55 | verbal | 722 |
55 | math | 353 |
55 | writing | 226 |
68 | verbal | 454 |
68 | math | 739 |
68 | writing | 513 |
First load the data:
use http://ssc.wisc.edu/sscc/pubs/files/4-28/reshape4
The level one unit in this data set is a section and the level two unit is a student. The variables id and section are the level two and level one identifiers respectively, and score is a level one variable. However, section is stored as a string. That's not a problem: just add the string option to reshape:
reshape wide score, i(id) j(section) string
The process of reshaping the data is the same, but the values of section which are added to the score stub are now verbal, math and writing rather than numbers. The resulting variables are scoremath, scoreverbal and scorewriting.
These variable names are functional, but somewhat hard to read. One solution would be rename score to score_ before reshaping (ren score score_). Then the reshaped variables would be score_math, score_verbal and score_writing. Another would be to capitalize the first letter in the values of section using the proper function (replace section=proper(section)). Then the variables would be scoreMath, scoreVerbal and scoreWriting.
In any case the structure will be the same:
id | scoreMath | scoreVerbal | scoreWriting |
---|---|---|---|
13 | 236 | 571 | 533 |
55 | 353 | 722 | 226 |
68 | 739 | 454 | 513 |
Handling Non-standard Variable Names
Now suppose you got this data set in wide form and needed to convert it to long, but it had slightly different variable names:
id | SATMathScore | SATVerbalScore | SATWritingScore |
---|---|---|---|
13 | 236 | 571 | 533 |
55 | 353 | 722 | 226 |
68 | 739 | 454 | 513 |
Load this with:
use http://ssc.wisc.edu/sscc/pubs/files/4-28/reshape5
These variable names are very clear, but they don't follow the stub+suffix convention. Fortunately you can tell Stata where to find the j variable within the stub using the @ sign:
reshape long SAT@Score, i(id) j(section) string
This tells Stata to look for variables than start with SAT and end with Score, and then the j variable, section, is everything in between.
id | section | SATScore |
---|---|---|
13 | Math | 236 |
13 | Verbal | 571 |
13 | Writing | 533 |
55 | Math | 353 |
55 | Verbal | 722 |
55 | Writing | 226 |
68 | Math | 739 |
68 | Verbal | 454 |
68 | Writing | 513 |
Working With More Than Two Levels
If you have more than two levels things get more complicated at the conceptual level, but the Stata code remains the same except that you may have to apply the reshape command multiple times. Suppose you had data on two schools, each with two classes, each of which has two students (yes, these are very small schools and classes, but it makes for manageable tables). Thus the level one unit is a student, the level two unit is a class and the level three unit is a school. The data set includes an identifier for each level (school, class, student) and one variable for each level (lunch--percentage of the school's students receiving free or reduced price lunches, exp--the number of years of teaching experience the class's teacher has, and score--the student's score on a test). Also note that the identifiers are 1 and 11 rather than 1 and 2 or random numbers. This is to guarantee we run into a problem that is otherwise merely common.
Load this with:
use http://ssc.wisc.edu/sscc/pubs/files/4-28/reshape6
Now that there are three levels, a row could represent a level one unit, a level two unit or a level three unit. The structure where a row represents a level one unit is sometimes called the long-long form:
school | lunch | class | exp | student | score |
---|---|---|---|---|---|
1 | 45 | 1 | 1 | 1 | 83 |
1 | 45 | 1 | 1 | 11 | 82 |
1 | 45 | 11 | 4 | 1 | 96 |
1 | 45 | 11 | 4 | 11 | 71 |
11 | 26 | 1 | 7 | 1 | 90 |
11 | 26 | 1 | 7 | 11 | 91 |
11 | 26 | 11 | 8 | 1 | 90 |
11 | 26 | 11 | 8 | 11 | 76 |
Now consider reshaping so that a row represents a level two unit. The j variable is clearly student. But the i variable is not just class, because class only uniquely identifies a class within a school. To fully identify a class you need both school and class. Thus the command is:
reshape wide score, i(school class) j(student)
And the result is:
school | class | score1 | score11 | lunch | exp |
---|---|---|---|---|---|
1 | 1 | 83 | 82 | 45 | 1 |
1 | 11 | 96 | 71 | 45 | 4 |
11 | 1 | 90 | 91 | 26 | 7 |
11 | 11 | 90 | 76 | 26 | 8 |
This is sometimes called long-wide form. It is long with respect to classes, but wide with respect to students.
To make a row represent a level three unit, a school, you need to reshape again:
reshape wide score1 score11 exp, i(school) j(class)
But this gives an error:
score11 already defined
r(110);
The trouble is the identifiers. The reshape command identifies the values of the j variable (class) as 1 and 11. It then tries to add them to the names of the existing variables score1 and score11. But that means the score for student 1 in class 1 would be stored as score11, and that variable already exists. It's true that the existing score11 will be renamed something else shortly, but even if Stata recognized that and proceeded the results would still be problematic: both student 1, class 11 and student 11, class 1 would have their scores stored as score111.
While this data set was constructed specifically to ensure that this problem arose, it will probably come up on its own if you have more than two levels and the values of the identifiers do not always have the same lengths. The solution is to separate the student numbers and the class numbers somehow. The following adds an underscore between them:
ren score1 score1_
ren score11 score11_
reshape wide score1_ score11_ exp, i(school) j(class)
The result is the following:
school | score1_1 | score11_1 | exp1 | score1_11 | score11_11 | exp11 | lunch |
---|---|---|---|---|---|---|---|
1 | 83 | 82 | 1 | 96 | 71 | 4 | 45 |
11 | 90 | 91 | 7 | 90 | 76 | 8 | 26 |
This is sometimes known as the wide-wide form, because it is wide in both class and student.
Note that Stata doesn't really understand the nesting structure of your data set and will allow you to do things that don't make much sense. For example, you could start with the original data in long-long form and type:
reshape wide score exp, i(school student) j(class)
Stata will carry out its instructions without complaint, but this is treating a class as the level one unit and a student as the level two unit--as if all the student 1s had something important in common. While Stata will prevent you from doing a reshape that is clearly wrong (for example, misidentifying a level one variable as a level two variable) it is your responsibility to make sure you've correctly identified the levels and that the reshape makes sense.
Last Revised: 7/2/2008