5Variable Transformation With Hierarchical (Grouped) Data
Many data sets involve some sort of hierarchical structure. The American Community Survey is an example of one of the most common hierarchical data structures: individuals grouped into households. Another common hierarchical data structure is things are observed multiple times (panel/longitudinal data or repeated measures). These structures may seem very different, but the same concepts apply to both and often even the same code.
5.1 Hierarchical Data Concepts
In this section we’ll introduce some of the core concepts and vocabulary for thinking carefully about hierarchical data. Having a more-or-less formal way of talking and thinking about hierarchical data makes it much less confusing.
5.1.1 Levels
Hierarchical data can be described in terms of levels (note that these are not the same as the levels of a categorical variable). A level one unit is the smallest unit in the data set. A level two unit is then a group of level one units. Some examples:
EXAMPLE DATA STRUCTURE
LEVEL ONE UNIT
LEVEL TWO UNIT
Individuals living in households
Individuals
Households
States grouped into regions
States
Regions
Students in classes
Students
Classes
Tests taken by students
Tests
Students
Monthly observations of individuals
Monthly observations
Individuals
Visits to the doctor by patients
Doctor visits
Patients
Social media posts by individuals
Posts
Individuals
If the hierarchy has more than two levels, simply keep counting up: if you have students grouped into classes grouped into schools grouped into districts, students are level one, classes are level two, schools are level three, and districts are level four.
Each variable is associated with a specific level. A variable is a level two variable if all the level one units within a given level two unit have the same value for the variable. For data structures where a level two unit is observed over time, level two variables are variables that do not change over time.
EXAMPLE DATA STRUCTURE
LEVEL ONE VARIABLES
LEVEL TWO VARIABLES
Individuals living in households
Age, Sex
Household income, Address
States grouped into regions
State population
Regional income per capita
Students in classes
Student’s race
Teacher’s race, Class size
Tests taken by students
Test Score
Free lunch eligibility
Monthly observations of individuals
Employment Status
Highest degree earned
Visits to the doctor by patients
BMI, Diagnosis
Race
Social media posts by individuals
Length, Topic
Sex
Of course all of these depend on the details of the actual data set. In a study that observes individuals for a few months, it’s unlikely that their highest degree earned will change. But it might! And if it does, highest degree earned becomes a level one variable. It does not meet the definition of a level two variable because different level one units (observations of the same individual) have different values for it.
5.1.2 Data Forms
With a hierarchical data set, an observation (row) can represent either a level one unit or a level two unit. Consider observing two people once a month for three months:
person_id
month
years_edu
employed
1
1
16
True
1
2
16
True
1
3
16
True
2
1
12
False
2
2
12
True
2
3
12
True
Exercise
Identify the level two units, level one units, level two variable(s), and level one variable(s) in the above data set.
Solution
In this data set, a person is a level two unit, a person’s observation for a particular month is a level one unit, years_edu is a level two variable (it never changes for a given person), and employed is a level one variable (it changes for person 2 from month 1 to month 2).
In this form, each observation represents a month (or more precisely, a person-month combination). Now consider the exact same data in a different form:
person_id
years_edu
employed1
employed2
employed3
1
16
True
True
True
2
12
False
True
True
In this form, each observation represents a person. Because years_edu is a level two variable, there’s just one value per person and thus one variable. However, employed is a level one variable with three (potentially) different values per person. Thus it needs three variables.
We call the first form the long form (or occasionally the tall form) and the second the wide form. The long form is longer because it has more rows; the wide form is wider because it has more variables. In most cases the long form is easier to work with, so we’ll do most of our examples in this form.
Now consider the identifiers in this data set. In the long form, person_id and month were a compound identifier for the rows in the data set, while the variable names were a simple column identifier. In the wide form, person_id is a simple row identifier, but now the variable names for the level one variables are compound identifiers with two parts: the variable name (employed) and the month in which the variable was observed. To identify a specific value in the data set we still need to know the person_id, month, and a variable name, but month has been converted from a row identifier to part of a compound column identifier.
5.1.3 Level 2 Units that Vary in Size
Suppose that in the previous example, person 2 dropped out of the study after month 2. In the long form, person 2 would typically have just two rows:
person_id
month
years_edu
employed
1
1
16
True
1
2
16
True
1
3
16
True
2
1
12
False
2
2
12
True
(As you’ll see later, some long form data sets would include a third row for person 2 that contains missing values, but this is less common.)
However, a data set cannot have different variables for different rows. So in wide form, person 2 must have an employed3 variable, but it will contain a missing value.
person_id
years_edu
employed1
employed2
employed3
1
16
True
True
True
2
12
False
True
Missing
A wide form data set must have enough columns to store all the data for the biggest level 2 unit in the data set. Our ACS sample has a household with 16 people in it, so in wide form it must have 16 columns for each level 1 variable. But most households do not have even close to 16 people, so you end up with a large number of missing values. Most of those missing values do not mean “We don’t know the value of this observation for this person” but rather “This person does not actually exist.”
5.2 Creating Level Two Variables in Long Form
A very common data wrangling task is creating level two variables based on the level one variables in the data set. If the data set is in long form, one critical tool for doing so is the by prefix, so that commands are executed separately for each level two unit. Another is egen, with its aggregate functions that act across observations.
In the ACS the level one units are people and the level 2 units are households, and in our example file there is one row per person. That makes it a long form data set.
Start a do file called level2.do that loads acs.dta. If you downloaded the example files for this book you can skip the URL.
-------------------------------------------------------------------------------
name: <unnamed>
log: /home/d/dimond/kb/dws/level2.log
log type: text
opened on: 12 Dec 2025, 10:28:36
I suggest having the data browser open while you work on the examples so you can see the result of what you’re doing. The book will list the key variables for a few observations, but you can skip the list commands if you look at the data browser instead. (We will create a lot of variables in this chapter. You may want to reload the data every time you reach a new heading just to reduce the clutter in the data browser. The variables created in previous sections will not be needed again.)
I will increase the length of lines in the web book output so I have a bit more room for lists, but you don’t need to do that:
setlinesize 90
5.2.1 Continuous Variables
We’ll start with examples of calculating continuous (or quantitative) level two variables based on continuous level one variables.
For many purposes, the total income of a household is more relevant than individual incomes. You can calculate it with a combination of by and the egentotal() function:
by household: egen household_income = total(income)list household person income household_income in 1/5, ab(30)
The reason is that young children have missing values for income,which prompts egen to completely ignore them in the calculation. Thus for household 484, inc_per_person_wrong is the total income of the household divided by the number of people who have a non-missing value for income, two, when it should be total income divided by the total number of people in the household, four.
Fortunately, these egen functions can act on mathematical expressions, not just single variables. Since mean(x) = total(x)/N = total(x/N), we can instead use:
by household: egen income_per_person = total(income/_N)list household person income household_income income_per_person ///if household==484, ab(30)
_N is a system variable, or a variable Stata always keeps track of and you can use even though you never created it. It contains the number of observations in the data set, or, if you are using by, the number of observations in the by group. In this case that’s the number of people in the household. Sometimes that’s useful all by itself:
by household: gen household_size = _Nlist household person household_size in 1/5, ab(30)
Sometimes the way egen ignores missing values can be useful. Consider trying to calculate the total income earned by the children of a household. You might think you could run:
by household: egen child_income_wrong = total(income) if age<18
(20,118 missing values generated)
But examine household 8787:
list household person age income child_income_wrong if household==8787, ab(30)
child_income_wrong has the right number, but it’s not a proper household-level variable in that the parent has a missing value instead. That makes it unusable if, for example, you wanted to talk about the effect of children’s contribution to the household on the parent. The reason is that egen used the condition if age<18 for two different purposes, one that you want and one that you don’t. The first is that in calculating the total it only includes observations where age is less than 18, which is exactly what you want. The second is that it only stores the result in observations where age is less than 18, leaving the other observations with missing values. This is not what you want.
One solution is to first create a variable income_if_child containing only the values of income that you want to be included in the total (i.e. the children’s incomes). Observations that should not be included (i.e. the adults) get a missing value. Then you can add up all the values of that variable and store the result in all the observations. egen ignore the missing values, giving you exactly what you want:
gen income_if_child = income if age<18by household: egen child_income = total(income_if_child)
(26,262 missing values generated)
list household person age income income_if_child child_income if household==8787, ab(30)
Now child_income is a proper household-level variable, with the same value for all the members of the household. income_if_child is no longer needed can be dropped at your convenience.
Exercise
Create a level two variable called mean_adult_age containing the mean age of the adults in the household. Make sure it has the same value for all the members of the household, including any children.
Solution
Now the subset of interest is the adults:
gen age_if_adult = age if age>=18by household: egen mean_adult_age = mean(age_if_adult)list household person age age_if_adult mean_adult_age if household==8787, ab(30)
When an indicator variable is coded one/zero, the egen functions used above take on new and very useful meanings. To see this in action, begin by creating an indicator called child which is 1 for people under 18 and 0 for everyone else:
gen child = (age<18)
The total of child, calculated by household, is the number of children in the household:
by household: egen num_children = total(child)list household person age child num_children if household==484, ab(30)
More broadly, the total of an indicator variable is the number of observations for which the indicator is one. If the indicator variable corresponds to a condition (like “this person is a child”) then it is the number of observations for which the condition is true.
The mean of an indicator variable is the proportion of observations for which the indicator is one. Create a variable containing the proportion of each household that is below 18 with:
by household: egen prop_children = mean(child)list household person age child prop_children if household==484, ab(30)
If a household has no children in it, then child is always zero and the maximum value is zero. If a household has any children in it, then those children have a one for child and the maximum value of child is one. Thus has_children is a household-level indicator variable for “this household has children in it.”
More generally, applying the max() function to an indicator variable creates a new indicator variable which is one for all observations (or all observations within a by group) if the original indicator variable is one for any observation. If the indicator variable corresponds to a condition, then the result is true for all observations if the condition is true for any observation.
You can use min() in a similar way: the result will be one if the indicator you apply it to is one for all observations, but the result will be zero if any observation has a zero. Use that to create an indicator variable for “all the people in this household are children”:
by household: egen all_children = min(child)list household person age child all_children if household==484, ab(30)
All of these functions can act on an expression rather than a variable, including conditions. Thus we could have used total(age<18), max(age<18), etc. rather than creating the indicator variable child, and we’d have gotten the same results.
This data set has no missing values for age, but it’s worth thinking about what this code would do with them. The condition age<18 returns a zero for observations where age is missing (recall that you can think of missing as infinity), as if the person were known to be an adult. Thus in the presence of missing values num_children isn’t really the number of children in the household, it’s the number of people known to be children. That may or may not be what you want.
Exercise
Create variables containing the number of college graduates in each household (Associate’s degree or above) and an indicator variable for “this household contains at least one college graduate.” Then create an indicator variable for “all the adults in this household are college graduates.” Remember that edu has missing values, but since the people with missing values of edu are all less than three years old you can safely assume they’re not college graduates.
Solution
First remind yourself of the coding scheme for edu:
labellist edu_label
edu_label:
0 Not in universe
1 None
2 Nursery school-4th grade
3 5th-6th grade
4 7th-8th grade
5 9th grade
6 10th grade
7 11th grade
8 12th grade, no diploma
9 High School graduate
10 Some college, <1 year
11 Some college, >=1 year
12 Associate degree
13 Bachelor's degree
14 Master's degree
15 Professional degree
16 Doctorate degree
Thus someone is a college graduate if edu is greater than 11 but not missing:
gen grad = (edu>11 & edu<.)
You could also identify college graduates with inlist(edu, 12, 13, 14, 15, 16).
Now you’re ready to create the straightforward household-level variables:
by household: egen num_grads = total(grad)by household: egen has_grad = max(grad)list household person edu num_grads has_grad if household==4398, ab(30)
“All the adults in the household are college graduates” is a little trickier. One approach is to treat this as a subsetting problem:
gen grad_adult = grad if age>=18by household: egen all_adults_grad1 = min(grad_adult)list household person edu age grad_adult all_adults_grad1 if household==4398, ab(30)
An alternative depends on combining conditions. “All the adults in the household are college graduates” requires that every individual be either a child (age<18) or a college graduate (grad). The min() function can act on that combination of conditions directly:
by household: egen all_adults_grad2 = min(age<18 | grad)list household person age edu all_adults_grad2 if household==4398, ab(30)
name: <unnamed>
log: /home/d/dimond/kb/dws/level2.log
log type: text
closed on: 12 Dec 2025, 10:28:37
------------------------------------------------------------------------------------------
If this were real work, you would save the modified data set first.
5.3 Creating Level Two Variables in Wide Form
Start a new do file, level2_wide.do that loads acs_wide.dta:
------------------------------------------------------------------------------------------
name: <unnamed>
log: /home/d/dimond/kb/dws/level2_wide.log
log type: text
opened on: 12 Dec 2025, 10:28:37
This is the exact same data set as before, but in wide form. Now there is just one row for each household, but it contains all the data about all the individuals in that household. The income1 variable contains the income of the first person in the household, income2 the second, etc.
There is one household in this ACS sample with 16 people in it, so every level one variable must have 16 columns (income1 through income16, for example). However, most households have fewer people. If a household only has three people, then for that household income1 through income3 will have values and all the other income variables will be missing.
list household income1 income2 income3 income4 income5 if household==37, ab(30)
I only included income1 through income5 in this list partially so the output wouldn’t be too long, but partially because typing long lists of variables is tiresome. Unfortunately, working with wide form data frequently involves long lists of variables. Fortunately, Stata includes some easy shortcuts so you can specify lists of variables without having to type them all out. So let’s take a brief digression to learn about varlist syntax.
5.3.1 Shortcuts for Variable Lists
The most common shortcut is to use the asterisk (*) as a wildcard character. Try:
describe income*
Variable Storage Display Value
name type format label Variable label
------------------------------------------------------------------------------------------
income1 long %10.0g 1 income
income2 long %10.0g 2 income
income3 long %10.0g 3 income
income4 long %10.0g 4 income
income5 long %10.0g 5 income
income6 long %10.0g 6 income
income7 long %10.0g 7 income
income8 long %10.0g 8 income
income9 long %10.0g 9 income
income10 long %10.0g 10 income
income11 long %10.0g 11 income
income12 long %10.0g 12 income
income13 long %10.0g 13 income
income14 long %10.0g 14 income
income15 long %10.0g 15 income
income16 long %10.0g 16 income
This tells the describe command to act on all variables that match the pattern “income followed by anything.” Note that “anything” can include nothing, so a variable just called income would be included as well. The wildcard can go anywhere:
describe *1
Variable Storage Display Value
name type format label Variable label
------------------------------------------------------------------------------------------
age1 byte %10.0g 1 age
race1 byte %25.0g race_label
1 race
marital_status1 byte %13.0g marital_status_label
1 marital_status
edu1 byte %24.0g edu_label
1 edu
income1 long %10.0g 1 income
female1 float %9.0g 1 female
hispanic1 float %9.0g 1 hispanic
age11 byte %10.0g 11 age
race11 byte %25.0g race_label
11 race
marital_stat~11 byte %13.0g marital_status_label
11 marital_status
edu11 byte %24.0g edu_label
11 edu
income11 long %10.0g 11 income
female11 float %9.0g 11 female
hispanic11 float %9.0g 11 hispanic
This matches all the variables with information about the first individual in the household, but also the variables with information about the eleventh individual. Be careful your wildcards don’t match more than what you want!
A question mark (?) is also a wildcard, but it matches exactly one character:
describe income?
Variable Storage Display Value
name type format label Variable label
------------------------------------------------------------------------------------------
income1 long %10.0g 1 income
income2 long %10.0g 2 income
income3 long %10.0g 3 income
income4 long %10.0g 4 income
income5 long %10.0g 5 income
income6 long %10.0g 6 income
income7 long %10.0g 7 income
income8 long %10.0g 8 income
income9 long %10.0g 9 income
This matches income1 through income9, but not income10 because it is income followed by two characters. It also would not match just income.
Another shortcut is to put a dash (-) between two variables. This will give you all the variables in between them:
describe age1-hispanic1
Variable Storage Display Value
name type format label Variable label
------------------------------------------------------------------------------------------
age1 byte %10.0g 1 age
race1 byte %25.0g race_label
1 race
marital_status1 byte %13.0g marital_status_label
1 marital_status
edu1 byte %24.0g edu_label
1 edu
income1 long %10.0g 1 income
female1 float %9.0g 1 female
hispanic1 float %9.0g 1 hispanic
This gives you just the variables with information about the first individual. The order used in resolving this shortcut is the order in which the variables are listed in the variables window or a describe command. You can use the order command to change the order of the variables to something more convenient.
A varlist can mix multiple kinds of shortcuts as well as individual variable names:
describe household age1-female1 income*
Variable Storage Display Value
name type format label Variable label
------------------------------------------------------------------------------------------
household double %8.0g Household serial number
age1 byte %10.0g 1 age
race1 byte %25.0g race_label
1 race
marital_status1 byte %13.0g marital_status_label
1 marital_status
edu1 byte %24.0g edu_label
1 edu
income1 long %10.0g 1 income
female1 float %9.0g 1 female
income1 long %10.0g 1 income
income2 long %10.0g 2 income
income3 long %10.0g 3 income
income4 long %10.0g 4 income
income5 long %10.0g 5 income
income6 long %10.0g 6 income
income7 long %10.0g 7 income
income8 long %10.0g 8 income
income9 long %10.0g 9 income
income10 long %10.0g 10 income
income11 long %10.0g 11 income
income12 long %10.0g 12 income
income13 long %10.0g 13 income
income14 long %10.0g 14 income
income15 long %10.0g 15 income
income16 long %10.0g 16 income
5.3.2 Using Row Functions
Now consider calculating the household income. To calculate the household income in long form, we used:
by household: egen household_income = total(income)
This told Stata to add up all the values in the income column, grouping all the observations in each household.
In wide form, we don’t need by household: because each household is just one observation. But the individual incomes are not in a single column, they’re in the columns income1 through income16. This is a job for the egen function rowtotal(). The rowtotal() function takes a varlist as input, and adds up the value of all those variables across a single row. The varlist income* is a convenient way to tell it to add up all the income variables.
egen household_income = rowtotal(income*)list household household_income income1 income2 income3 income4 if household==37, ab(30)
The egen functions we used in the section on working with long form data all have row equivalents: rowtotal(), rowmean(), rowmax(), and rowmin(). The tasks we carried out in long form can easily be done in wide form with just three changes:
Remove the by prefix.
Replace the egen function with its row equivalent.
Replace the variable to act on with a variable list matching all the corresponding level one variables.
Exercise
Crete a variable containing the mean age of each household.
Solution
egen mean_age = rowmean(age*)list household age1 age2 age3 mean_age if household==37, ab(30)
Now consider creating an indicator for “this household has children.” In long form you could do that with:
by household: egen has_children = max(age<18)
One big difference between row functions and their column equivalents is that row functions can’t act on expressions like age<18, only variables. But creating a single indicator variable for child won’t work either: we need child1 (person one is a child), child2 (person 2 is a child), etc. Typing out 16 gen commands would be even more tiresome than typing out 16 variable names, so it’s time for another digression.
5.3.3 Creating Level 1 Variables in Wide Form
In this section we’ll demonstrate very briefly how to write loops in Stata. For a broader and deeper introduction to looping, see Stata Macros and Loops. Definitely read it if you find this brief introduction confusing.
In long form, we ran:
gen child = (age < 18)
to create an indicator for “this person is a child.” The wide form equivalent would be:
gen child1 = (age1<18) if age1<.
gen child2 = (age2<18) if age2<.
gen child3 = (age3<18) if age3<.
...
gen child16 = (age16<18) if age16<.
Fortunately, the following loop will do that much more easily:
forvalues i = 1/16 {gen child`i' = (age`i' < 18) if age`i' < .}
forvalues tells Stata we’re going to carry out the following code for a set of values, meaning execute the code once for each value. 1/16 defines the set of values as the integers 1 through 16. (Stata knows from context that we’re not doing division here.) Each value is stored in the macroi. Finally, the { tells Stata this is where the code to be executed as part of the loop begins.
gen child`i' = (age`i' < 18) if age`i' < .
Most of this is familiar, except for `i'. This is how you use the values of a macro. When Stata sees a name that starts with a left single quote (backtick) and ends with a right single quote (apostrophe) it will replace the name with the corresponding value of the macro before it executes the code. In this case, the forvalues loop defines the macro `i' to contain first the number 1, then the number 2, and so forth all the way up to 16. So what Stata actually runs is:
gen child1 = (age1 < 18) if age1 < .
gen child2 = (age2 < 18) if age2 < .
...
gen child16 = (age16 < 18) if age16 < .
Note that in long form age has no missing values, but in wide form the age variables will have missing values for all the people who don’t exist (e.g. if a household has three people in it age4 and higher will be missing). It’s very important that child also be missing for people who don’t exist.
Finally, } tells Stata this is the end of the code to be executed as part of the loop. It must go on its own line.
Once you have a set of indicator variables for “this person is a child”, you’re ready to create a has_children indicator:
egen has_children = rowmax(child*)
Note how has_children is a single variables, not a set of variables like child1, child2 etc. That’s because child is a person-level variable with one value per person in the household, while has_children is a household-level variables with one value for the entire household.
Subsetting in wide form works the same way. Use a loop to create a set of variables that contain the value of interest for people who are in the subset and missing for people outside it, and then combine using row functions.
Exercise
Use a loop to create a set of person-level indicator variables for “this person is black.” Then create household-level indicator variables for “this household has at least one black member” and “all the members of this household are black.”
Solution
First, remind yourself how the race variables are coded:
labellist race_label
race_label:
1 White
2 Black
3 American Indian
4 Alaska Native
5 Indigenous, tribe unknown
6 Asian
7 Pacific Islander
8 Other
9 Multiracial
So a 2 means the person is black. Next create indicator variables for “this person is black” using a loop:
forvalues i = 1/16 {gen black`i' = (race`i'==2) if race`i' < .}
To wrap up your do file for the section, close the log:
logclose
name: <unnamed>
log: /home/d/dimond/kb/dws/level2_wide.log
log type: text
closed on: 12 Dec 2025, 10:28:38
------------------------------------------------------------------------------------------
If this were real work, you’d save the modified data set first.
5.4 Panel Data
Panel data, or longitudinal data, are data where subjects are observed repeatedly over time and the timing is important. (If the timing isn’t important then we call it repeated measures data.) The National Longitudinal Survey of Youth is an example of panel data, and we’ll use a small extract from it as an example. (Note that this extract combines income variables from different years with slightly different definitions into a single income variable, so you really wouldn’t want to use this extract for actual research.) Create a do file called panel.do that loads it:
------------------------------------------------------------------------------------------
name: <unnamed>
log: /home/d/dimond/kb/dws/panel.txt
log type: smcl
opened on: 12 Dec 2025, 10:28:38
Exercise
As a review, spend some time getting familiar with this data set. In particular:
Identify the identifier variables. What is a level one unit in this data set? What is a level two unit?
Which variables are level one variables? Which are level two variables?
What type of variable is edu?
Why are income and edu are frequently missing for the same observation? What does it tell you about age that it is never missing?
Solution
Start by seeing what variables exist:
describe
Contains data from https://sscc.wisc.edu/sscc/pubs/dws/data/nlsy.dta
Observations: 241,034
Variables: 6 27 Dec 2022 13:11
------------------------------------------------------------------------------------------
Variable Storage Display Value
name type format label Variable label
------------------------------------------------------------------------------------------
id float %9.0g ID# (1-12686) 79
year int %9.0g
year_of_birth float %16.0g
edu float %24.0g edulabel
income float %9.0g
age float %9.0g
------------------------------------------------------------------------------------------
Sorted by: id year
Obviously id is an identifier. But does it uniquerly identify observations?
duplicatesreport id
Duplicates in terms of id
--------------------------------------
Copies | Observations Surplus
----------+---------------------------
19 | 241034 228348
--------------------------------------
No, we need at least one more variable to uniquely identify observations. Recall that identifiers generally contain information, which excludes all but year. So try that:
duplicatesreport id year
Duplicates in terms of id year
--------------------------------------
Copies | Observations Surplus
----------+---------------------------
1 | 241034 0
--------------------------------------
If id and year are the identifiers, then we must have people observed over time (the fact that we’re in a section on panel data is another pretty good clue). Thus a level two unit is a person, and a level one unit is a person’s observation for a particular year.
Someone’s year of birth can never change (if it does it must be a data error) so year_of_birth is a level two variable. A quick look at the data browser will show that all the other variables can and do change over time, and are thus level one variables.
The codebook command will help you identify the type of edu:
codebook edu
------------------------------------------------------------------------------------------
edu (unlabeled)
------------------------------------------------------------------------------------------
Type: Numeric (float)
Label: edulabel
Range: [0,20] Units: 1
Unique values: 21 Missing .: 43,417/241,034
Examples: 12 12TH GRADE
12 12TH GRADE
13 1ST YEAR COLLEGE
17 5TH YEAR COLLEGE
While the labels are similar to the categorical edu variable in the ACS, note how the values themselves represent the number of years of schooling the person has. That makes this a quantitative variable.
The simplest explanation for edu and income frequently being missing for the same observation is that they’re missing because the subject could not be found in that year. Tracking people for multiple years is hard, and almost always some people will be lost to attrition. Be mindful of how attrition will affect your analysis, especially since it’s usually not completely random who is lost. The NLSY is a bit unusual in that every subject has a row for every year, whether the subject could be found in that year or not, but the rows for years the subject could not be found contain missing values. You’ll also see that they kept trying to find people even after losing them for a year, and sometimes succeeded.
The fact that age is not missing tells you they did not ask the subject their age every year: they simply calculated it using year_of_birth.
Most of the techniques you learned for working with individuals in households carry over directly to panel data. For example, to find the total income earned during the study period, run:
by id: egen total_income = total(income)
To find the age of the subject the first time they appear in the study, run:
by id: egen starting_age = min(age)
This relies on the fact that age always increases over time. What if you wanted to know their income the first time they appear in the study? Recall that income[1] means “the value of income for the first observation.” When combined with by it means “the value of income for the first observation in the by group.” This is highly convenient, but you need to make sure that the observations for each subject are in chronological order so their first observation really is the first time they appear in the study:
sort id yearby id: gen startingIncome = income[1]
(109,953 missing values generated)
You need to be careful because Stata’s default sorting algorithm is not stable. This means it will put ties in whatever order will make it run fastest. So if you run sort id, or bysort id:, the observations for each person could be in any order. In practice, if the data are already sorted or mostly sorted the order that will make the sort run fastest is usually to leave things alone. But you can’t count on that. So if you’re going to run code that depends on the sort order, like assuming the first observation is the first chronologically, be sure the data are actually in the right order.
Exercise
Create ending_income, the subject’s income in the last year of the study. Recall that _N is the observation number of the last observation. (Yes, for many people ending_income will be missing.)
Solution
by id: gen ending_income = income[_N]
(97,128 missing values generated)
5.4.1 Events and Event History
Often with panel data you’ll need to identify particular events or sequences of events. For example, suppose you need to identify the year in which each subject graduated from high school. A subject graduated from high school in a given year if they have 12 years of education in that year and less than 12 years of education the year before.
Normally Stata works row by row. If you say:
gen y = x
you’re saying “set the value of y for each observation to the value of x for the same observation.” As it does so, Stata keeps track of which observation it is working on in a system variable called _n. So x[_n] is just the same as x: the value of x for the current observation. But x[_n-1] means “the value of x for the observation before the current observation.” And that’s extremely useful. In particular, you can detect the year a person graduates from high school using the rule “the person has 12 years of education this year and less than 12 years of education the year before with:
by id: gen grad = (edu==12 & edu[_n-1]<12)
To check your work, calculate the number of times this rule identified each person as having graduated from high school:
by id: egen times_grad = total(grad)tab times_grad
Note that the frequencies above do not indicate how many people graduated one or zero times, but how many person-year combinations belong to people who graduated one or zero times. Be careful about summary statistics with hierarchical data!
The zeroes are not surprising: some subjects graduated before the study began, or didn’t graduate at all, or missing data prevented you from identifying the year they graduated. Values of two or more would indicate a problem, either with your code or with the data, so it’s good you don’t have any.
Speaking of missing data, what does the code for creating grad do with it? Missing values are neither 12 nor less than 12, so the condition (edu==12 & edu[_n-1]<12) will be false if edu is missing for either the current observation or the observation before it. So grad==1 can be interpreted as “we know the person graduated this year” but grad==0 could mean either “we know the person didn’t graduate this year” or “we don’t know if the person graduated this year or not because of missing data.”
In the command:
by id: egen times_grad = total(grad)
the by id: prefix tells the total() function which observations it should aggregate when calculating the total: all the ones with the same value of id. But the command:
by id: gen grad = (edu==12 & edu[_n-1]<12)
doesn’t do any aggregating. So why do we need by id: there?
Consider the first observation for the second person. edu[_n-1] is the value of edu for the observation before it. Without by id:, that would be the last value of edu for person 1. That’s not what we want! by id: tells Stata to treat each group defined by id as if it were its own data set. Thus the first observation for each person has _n equal to 1 and there’s no observation before it–_n-1 is zero. Some languages would give you an error like “index out of bounds” if you asked for the value of edu for the zeroeth observation, but Stata just returns a missing value. That means grad is always 0 for the first observation for each person, which makes sense. If someone starts the study with 12 years of education we can’t tell if they just graduated from high school or have had 12 years of education for some time.
Exercise
Identify the year in which the subject started college (if they did). For simplicity, assume everyone starts college in the fall and only starts claiming to have completed a year of college in the following year. (Hint: if someone’s values of education for three years were 11, 12, and 13, they both graduated from high school and started college in the middle year.) Some people in the NLSY report that their education level increased by more than one year in a single year, so make sure your code allows for that.
Check your work by seeing how many times people started college. If you find people who started more than once, determine whether this is due to a problem with your code or with the data.
Tip
Now you’re looking for a year in which the subject’s education is 12 and their education the following year is greater than 12. But remember missing is essentially infinity and thus definitely greater than 12, so this time you have to deal with missing values explicitly:
by id: gen start_college = (edu==12 & (edu[_n+1]>12 & edu[_n+1]<.))
Now check your work:
by id: egen times_started = total(start_college)tab times_started
Now you have people who started college multiple times. Take a look to see why. (Best done in the data browser, but I’ll use list here even though it’s long.)
list id year edu start_college times_started if times_started>1, ab(30)
+----------------------------------------------------------------+
| id year edu start_college times_started |
|----------------------------------------------------------------|
64601. | 3401 1979 11TH GRADE 0 2 |
64602. | 3401 1980 12TH GRADE 0 2 |
64603. | 3401 1981 12TH GRADE 0 2 |
64604. | 3401 1982 12TH GRADE 0 2 |
64605. | 3401 1983 12TH GRADE 0 2 |
|----------------------------------------------------------------|
64606. | 3401 1984 12TH GRADE 0 2 |
64607. | 3401 1985 12TH GRADE 1 2 |
64608. | 3401 1986 1ST YEAR COLLEGE 0 2 |
64609. | 3401 1987 1ST YEAR COLLEGE 0 2 |
64610. | 3401 1988 1ST YEAR COLLEGE 0 2 |
|----------------------------------------------------------------|
64611. | 3401 1989 1ST YEAR COLLEGE 0 2 |
64612. | 3401 1990 1ST YEAR COLLEGE 0 2 |
64613. | 3401 1991 1ST YEAR COLLEGE 0 2 |
64614. | 3401 1992 1ST YEAR COLLEGE 0 2 |
64615. | 3401 1993 1ST YEAR COLLEGE 0 2 |
|----------------------------------------------------------------|
64616. | 3401 1994 12TH GRADE 0 2 |
64617. | 3401 1996 12TH GRADE 0 2 |
64618. | 3401 1998 12TH GRADE 1 2 |
64619. | 3401 2000 1ST YEAR COLLEGE 0 2 |
66881. | 3521 1979 9TH GRADE 0 2 |
|----------------------------------------------------------------|
66882. | 3521 1980 10TH GRADE 0 2 |
66883. | 3521 1981 11TH GRADE 0 2 |
66884. | 3521 1982 12TH GRADE 1 2 |
66885. | 3521 1983 1ST YEAR COLLEGE 0 2 |
66886. | 3521 1984 12TH GRADE 1 2 |
|----------------------------------------------------------------|
66887. | 3521 1985 1ST YEAR COLLEGE 0 2 |
66888. | 3521 1986 1ST YEAR COLLEGE 0 2 |
66889. | 3521 1987 2ND YEAR COLLEGE 0 2 |
66890. | 3521 1988 2ND YEAR COLLEGE 0 2 |
66891. | 3521 1989 2ND YEAR COLLEGE 0 2 |
|----------------------------------------------------------------|
66892. | 3521 1990 2ND YEAR COLLEGE 0 2 |
66893. | 3521 1991 2ND YEAR COLLEGE 0 2 |
66894. | 3521 1992 2ND YEAR COLLEGE 0 2 |
66895. | 3521 1993 2ND YEAR COLLEGE 0 2 |
66896. | 3521 1994 2ND YEAR COLLEGE 0 2 |
|----------------------------------------------------------------|
66897. | 3521 1996 2ND YEAR COLLEGE 0 2 |
66898. | 3521 1998 2ND YEAR COLLEGE 0 2 |
66899. | 3521 2000 1ST YEAR COLLEGE 0 2 |
152077. | 8005 1979 12TH GRADE 1 2 |
152078. | 8005 1980 1ST YEAR COLLEGE 0 2 |
|----------------------------------------------------------------|
152079. | 8005 1981 12TH GRADE 0 2 |
152080. | 8005 1982 12TH GRADE 0 2 |
152081. | 8005 1983 12TH GRADE 1 2 |
152082. | 8005 1984 1ST YEAR COLLEGE 0 2 |
152083. | 8005 1985 1ST YEAR COLLEGE 0 2 |
|----------------------------------------------------------------|
152084. | 8005 1986 1ST YEAR COLLEGE 0 2 |
152085. | 8005 1987 1ST YEAR COLLEGE 0 2 |
152086. | 8005 1988 1ST YEAR COLLEGE 0 2 |
152087. | 8005 1989 1ST YEAR COLLEGE 0 2 |
152088. | 8005 1990 1ST YEAR COLLEGE 0 2 |
|----------------------------------------------------------------|
152089. | 8005 1991 1ST YEAR COLLEGE 0 2 |
152090. | 8005 1992 1ST YEAR COLLEGE 0 2 |
152091. | 8005 1993 1ST YEAR COLLEGE 0 2 |
152092. | 8005 1994 1ST YEAR COLLEGE 0 2 |
152093. | 8005 1996 1ST YEAR COLLEGE 0 2 |
|----------------------------------------------------------------|
152094. | 8005 1998 1ST YEAR COLLEGE 0 2 |
152095. | 8005 2000 1ST YEAR COLLEGE 0 2 |
+----------------------------------------------------------------+
In all three cases the subject reported “1ST YEAR COLLEGE” for their education, then went back to reporting “12TH GRADE” for a while before reporting “1ST YEAR COLLEGE” again. That’s a problem with the data, not your code, though you may need to change your code to deal with it. How to deal with it will depend on what you plan to do with this start_college variable.
5.4.2 Level 2 Variables Based on a Special Level 1 Observation
Sometimes you need to create a level two variable which is just the value of a level one variable for a particular level one unit, such as the subject’s age at the time they graduated from high school. Here’s one way to do that:
sort id gradby id: gen age_at_grad = age[_N] if grad[_N]sort id year
(135,033 missing values generated)
Sorting by id and grad puts the observation in which the subject graduated last, so you can get the subject’s age in that year with age[_N]. However, recall that for many subjects you could not identify the year they graduated. For them, grad is always zero, so which observation is last after the sort is completely arbitrary (remember Stata’s default sort is not stable), and age_at_grad should be missing. The condition if grad[_N] ensures that age_at_grad is not set if the last observation is not a a year in which they graduated.
Don’t forget to put the data back in chronological order with sort id year if that’s how you’re used to using it.
Alternatively you can treat this as a subsetting problem and use the techniques described above for calculations on subsets:
gen age_if_grad = age if gradby id: egen age_at_grad2 = mean(age_if_grad)
Since people graduate at most once, the subset of observations where grad is 1 (true) has just one observation per person. The mean of one number is just that number. Thus we’re really using the mean function to “broadcast” that number to all the observations for the same person.
These methods give exactly the same result, so you can use whichever you prefer. The sorting method will be slower with large data sets, but there are many things you can do with a clever sort and I want you to have that in your bag of tricks.
Exercise
Create a person-level variable for “income at age 25.”
Solution
You have two options here. First, the sorting method:
gen age25 = (age==25)sort id age25by id: gen income25a = income[_N] if age25[_N]sort id year
(22,686 missing values generated)
Second, the subsetting method:
gen income_if_25 = income if age==25by id: egen income25b = mean(income_if_25)