5 Variable Transformation With Hierarchical (Grouped) Data

Many data sets involve some sort of hierarchical structure. The American Community Survey is an example of one of the most common hierarchical data structures: individuals grouped into households. Another common hierarchical data structure is things are observed multiple times (panel/longitudinal data or repeated measures). These structures may seem very different, but the same concepts apply to both and often even the same code.

5.1 Hierarchical Data Concepts

In this section we’ll introduce some of the core concepts and vocabulary for thinking carefully about hierarchical data. Having a more-or-less formal way of talking and thinking about hierarchical data makes it much less confusing.

5.1.1 Levels

Hierarchical data can be described in terms of levels (note that these are not the same as the levels of a categorical variable). A level one unit is the smallest unit in the data set. A level two unit is then a group of level one units. Some examples:

EXAMPLE DATA STRUCTURE	LEVEL ONE UNIT	LEVEL TWO UNIT
Individuals living in households	Individuals	Households
States grouped into regions	States	Regions
Students in classes	Students	Classes
Tests taken by students	Tests	Students
Monthly observations of individuals	Monthly observations	Individuals
Visits to the doctor by patients	Doctor visits	Patients
Social media posts by individuals	Posts	Individuals

If the hierarchy has more than two levels, simply keep counting up: if you have students grouped into classes grouped into schools grouped into districts, students are level one, classes are level two, schools are level three, and districts are level four.

Each variable is associated with a specific level. A variable is a level two variable if all the level one units within a given level two unit have the same value for the variable. For data structures where a level two unit is observed over time, level two variables are variables that do not change over time.

EXAMPLE DATA STRUCTURE	LEVEL ONE VARIABLES	LEVEL TWO VARIABLES
Individuals living in households	Age, Sex	Household income, Address
States grouped into regions	State population	Regional income per capita
Students in classes	Student’s race	Teacher’s race, Class size
Tests taken by students	Test Score	Free lunch eligibility
Monthly observations of individuals	Employment Status	Highest degree earned
Visits to the doctor by patients	BMI, Diagnosis	Race
Social media posts by individuals	Length, Topic	Sex

Of course all of these depend on the details of the actual data set. In a study that observes individuals for a few months, it’s unlikely that their highest degree earned will change. But it might! And if it does, highest degree earned becomes a level one variable. It does not meet the definition of a level two variable because different level one units (observations of the same individual) have different values for it.

5.1.2 Data Forms

With a hierarchical data set, an observation (row) can represent either a level one unit or a level two unit. Consider observing two people once a month for three months:

person_id	month	years_edu	employed
1	1	16	True
1	2	16	True
1	3	16	True
2	1	12	False
2	2	12	True
2	3	12	True

Exercise

Identify the level two units, level one units, level two variable(s), and level one variable(s) in the above data set.

Solution

In this data set, a person is a level two unit, a person’s observation for a particular month is a level one unit, years_edu is a level two variable (it never changes for a given person), and employed is a level one variable (it changes for person 2 from month 1 to month 2).

In this form, each observation represents a month (or more precisely, a person-month combination). Now consider the exact same data in a different form:

person_id	years_edu	employed1	employed2	employed3
1	16	True	True	True
2	12	False	True	True

In this form, each observation represents a person. Because years_edu is a level two variable, there’s just one value per person and thus one variable. However, employed is a level one variable with three (potentially) different values per person. Thus it needs three variables.

We call the first form the long form (or occasionally the tall form) and the second the wide form. The long form is longer because it has more rows; the wide form is wider because it has more variables. In most cases the long form is easier to work with, so we’ll do most of our examples in this form.

Now consider the identifiers in this data set. In the long form, person_id and month were a compound identifier for the rows in the data set, while the variable names were a simple column identifier. In the wide form, person_id is a simple row identifier, but now the variable names for the level one variables are compound identifiers with two parts: the variable name (employed) and the month in which the variable was observed. To identify a specific value in the data set we still need to know the person_id, month, and a variable name, but month has been converted from a row identifier to part of a compound column identifier.

5.1.3 Level 2 Units that Vary in Size

Suppose that in the previous example, person 2 dropped out of the study after month 2. In the long form, person 2 would typically have just two rows:

person_id	month	years_edu	employed
1	1	16	True
1	2	16	True
1	3	16	True
2	1	12	False
2	2	12	True

(As you’ll see later, some long form data sets would include a third row for person 2 that contains missing values, but this is less common.)

However, a data set cannot have different variables for different rows. So in wide form, person 2 must have an employed3 variable, but it will contain a missing value.

person_id	years_edu	employed1	employed2	employed3
1	16	True	True	True
2	12	False	True	Missing

A wide form data set must have enough columns to store all the data for the biggest level 2 unit in the data set. Our ACS sample has a household with 16 people in it, so in wide form it must have 16 columns for each level 1 variable. But most households do not have even close to 16 people, so you end up with a large number of missing values. Most of those missing values do not mean “We don’t know the value of this observation for this person” but rather “This person does not actually exist.”

5.2 Creating Level Two Variables in Long Form

A very common data wrangling task is creating level two variables based on the level one variables in the data set. If the data set is in long form, one critical tool for doing so is the by prefix, so that commands are executed separately for each level two unit. Another is egen, with its aggregate functions that act across observations.

In the ACS the level one units are people and the level 2 units are households, and in our example file there is one row per person. That makes it a long form data set.

Start a do file called level2.do that loads acs.dta. If you downloaded the example files for this book you can skip the URL.

capture log close
log using level2.log, replace

clear all
use https://sscc.wisc.edu/sscc/pubs/dws/data/acs

-------------------------------------------------------------------------------
      name:  <unnamed>
       log:  /home/d/dimond/kb/dws/level2.log
  log type:  text
 opened on:  12 Dec 2025, 10:28:36

I suggest having the data browser open while you work on the examples so you can see the result of what you’re doing. The book will list the key variables for a few observations, but you can skip the list commands if you look at the data browser instead. (We will create a lot of variables in this chapter. You may want to reload the data every time you reach a new heading just to reduce the clutter in the data browser. The variables created in previous sections will not be needed again.)

I will increase the length of lines in the web book output so I have a bit more room for lists, but you don’t need to do that:

set linesize 90

5.2.1 Continuous Variables

We’ll start with examples of calculating continuous (or quantitative) level two variables based on continuous level one variables.

For many purposes, the total income of a household is more relevant than individual incomes. You can calculate it with a combination of by and the egen total() function:

by household: egen household_income = total(income)
list household person income household_income in 1/5, ab(30)


     +------------------------------------------------+
     | household   person   income   household_income |
     |------------------------------------------------|
  1. |        37        1    10000              20000 |
  2. |        37        2     5300              20000 |
  3. |        37        3     4700              20000 |
  4. |       241        1    32500              32500 |
  5. |       242        1    30000              30000 |
     +------------------------------------------------+

Now calculate the income per person. Since this is the same as the mean income for the household, you might think you can use:

by household: egen income_per_person_wrong = mean(income)

In fact this does not give the right answer, as you can see by examining household 484:

list household person income household_income income_per_person_wrong ///
    if household==484, ab(30)


       +--------------------------------------------------------------------------+
       | household   person   income   household_income   income_per_person_wrong |
       |--------------------------------------------------------------------------|
   10. |       484        1    16800              34800                     17400 |
   11. |       484        2    18000              34800                     17400 |
   12. |       484        3        .              34800                     17400 |
   13. |       484        4        .              34800                     17400 |
       +--------------------------------------------------------------------------+

The reason is that young children have missing values for income,which prompts egen to completely ignore them in the calculation. Thus for household 484, inc_per_person_wrong is the total income of the household divided by the number of people who have a non-missing value for income, two, when it should be total income divided by the total number of people in the household, four.

Fortunately, these egen functions can act on mathematical expressions, not just single variables. Since mean(x) = total(x)/N = total(x/N), we can instead use:

by household: egen income_per_person = total(income/_N)
list household person income household_income income_per_person ///
    if household==484, ab(30)


       +--------------------------------------------------------------------+
       | household   person   income   household_income   income_per_person |
       |--------------------------------------------------------------------|
   10. |       484        1    16800              34800                8700 |
   11. |       484        2    18000              34800                8700 |
   12. |       484        3        .              34800                8700 |
   13. |       484        4        .              34800                8700 |
       +--------------------------------------------------------------------+

_N is a system variable, or a variable Stata always keeps track of and you can use even though you never created it. It contains the number of observations in the data set, or, if you are using by, the number of observations in the by group. In this case that’s the number of people in the household. Sometimes that’s useful all by itself:

by household: gen household_size = _N
list household person household_size in 1/5, ab(30)


     +-------------------------------------+
     | household   person   household_size |
     |-------------------------------------|
  1. |        37        1                3 |
  2. |        37        2                3 |
  3. |        37        3                3 |
  4. |       241        1                1 |
  5. |       242        1                1 |
     +-------------------------------------+

Because Stata numbers observations starting from 1, _N is also the observation number of the last observation. That turns out to be useful as well.

Exercise

Create a variable for the mean age of all the individuals in the household. Recall that age is never missing, which makes this straightforward.

Solution

by household: egen mean_age = mean(age)
list household person age mean_age in 1/5


     +------------------------------------+
     | househ~d   person   age   mean_age |
     |------------------------------------|
  1. |       37        1    20   19.33333 |
  2. |       37        2    19   19.33333 |
  3. |       37        3    19   19.33333 |
  4. |      241        1    50         50 |
  5. |      242        1    29         29 |
     +------------------------------------+

5.2.2 Subsetting with egen

Sometimes the way egen ignores missing values can be useful. Consider trying to calculate the total income earned by the children of a household. You might think you could run:

by household: egen child_income_wrong = total(income) if age<18

(20,118 missing values generated)

But examine household 8787:

list household person age income child_income_wrong if household==8787, ab(30)


       +--------------------------------------------------------+
       | household   person   age   income   child_income_wrong |
       |--------------------------------------------------------|
  179. |      8787        1    45    38000                    . |
  180. |      8787        2    16     3600                 3800 |
  181. |      8787        3    15      200                 3800 |
       +--------------------------------------------------------+

child_income_wrong has the right number, but it’s not a proper household-level variable in that the parent has a missing value instead. That makes it unusable if, for example, you wanted to talk about the effect of children’s contribution to the household on the parent. The reason is that egen used the condition if age<18 for two different purposes, one that you want and one that you don’t. The first is that in calculating the total it only includes observations where age is less than 18, which is exactly what you want. The second is that it only stores the result in observations where age is less than 18, leaving the other observations with missing values. This is not what you want.

One solution is to first create a variable income_if_child containing only the values of income that you want to be included in the total (i.e. the children’s incomes). Observations that should not be included (i.e. the adults) get a missing value. Then you can add up all the values of that variable and store the result in all the observations. egen ignore the missing values, giving you exactly what you want:

gen income_if_child = income if age<18
by household: egen child_income = total(income_if_child)

(26,262 missing values generated)

list household person age income income_if_child child_income if household==8787, ab(30)


       +--------------------------------------------------------------------+
       | household   person   age   income   income_if_child   child_income |
       |--------------------------------------------------------------------|
  179. |      8787        1    45    38000                 .           3800 |
  180. |      8787        2    16     3600              3600           3800 |
  181. |      8787        3    15      200               200           3800 |
       +--------------------------------------------------------------------+

Now child_income is a proper household-level variable, with the same value for all the members of the household. income_if_child is no longer needed can be dropped at your convenience.

Exercise

Create a level two variable called mean_adult_age containing the mean age of the adults in the household. Make sure it has the same value for all the members of the household, including any children.

Solution

Now the subset of interest is the adults:

gen age_if_adult = age if age>=18
by household: egen mean_adult_age = mean(age_if_adult)
list household person age age_if_adult mean_adult_age if household==8787, ab(30)

(7,292 missing values generated)
(3 missing values generated)

       +----------------------------------------------------------+
       | household   person   age   age_if_adult   mean_adult_age |
       |----------------------------------------------------------|
  179. |      8787        1    45             45               45 |
  180. |      8787        2    16              .               45 |
  181. |      8787        3    15              .               45 |
       +----------------------------------------------------------+

5.2.3 Indicator Variables

When an indicator variable is coded one/zero, the egen functions used above take on new and very useful meanings. To see this in action, begin by creating an indicator called child which is 1 for people under 18 and 0 for everyone else:

gen child = (age<18)

The total of child, calculated by household, is the number of children in the household:

by household: egen num_children = total(child)
list household person age child num_children if household==484, ab(30)


       +-------------------------------------------------+
       | household   person   age   child   num_children |
       |-------------------------------------------------|
   10. |       484        1    33       0              2 |
   11. |       484        2    26       0              2 |
   12. |       484        3     4       1              2 |
   13. |       484        4     2       1              2 |
       +-------------------------------------------------+

More broadly, the total of an indicator variable is the number of observations for which the indicator is one. If the indicator variable corresponds to a condition (like “this person is a child”) then it is the number of observations for which the condition is true.

The mean of an indicator variable is the proportion of observations for which the indicator is one. Create a variable containing the proportion of each household that is below 18 with:

by household: egen prop_children = mean(child)
list household person age child prop_children if household==484, ab(30)


       +--------------------------------------------------+
       | household   person   age   child   prop_children |
       |--------------------------------------------------|
   10. |       484        1    33       0              .5 |
   11. |       484        2    26       0              .5 |
   12. |       484        3     4       1              .5 |
   13. |       484        4     2       1              .5 |
       +--------------------------------------------------+

Next consider the maximum value of an indicator variable:

by household: egen has_children = max(child)
list household person age child has_children if inlist(household, 37, 484), ab(30)


       +-------------------------------------------------+
       | household   person   age   child   has_children |
       |-------------------------------------------------|
    1. |        37        1    20       0              0 |
    2. |        37        2    19       0              0 |
    3. |        37        3    19       0              0 |
   10. |       484        1    33       0              1 |
   11. |       484        2    26       0              1 |
       |-------------------------------------------------|
   12. |       484        3     4       1              1 |
   13. |       484        4     2       1              1 |
       +-------------------------------------------------+

If a household has no children in it, then child is always zero and the maximum value is zero. If a household has any children in it, then those children have a one for child and the maximum value of child is one. Thus has_children is a household-level indicator variable for “this household has children in it.”

More generally, applying the max() function to an indicator variable creates a new indicator variable which is one for all observations (or all observations within a by group) if the original indicator variable is one for any observation. If the indicator variable corresponds to a condition, then the result is true for all observations if the condition is true for any observation.

You can use min() in a similar way: the result will be one if the indicator you apply it to is one for all observations, but the result will be zero if any observation has a zero. Use that to create an indicator variable for “all the people in this household are children”:

by household: egen all_children = min(child)
list household person age child all_children if household==484, ab(30)


       +-------------------------------------------------+
       | household   person   age   child   all_children |
       |-------------------------------------------------|
   10. |       484        1    33       0              0 |
   11. |       484        2    26       0              0 |
   12. |       484        3     4       1              0 |
   13. |       484        4     2       1              0 |
       +-------------------------------------------------+

Surprisingly enough, there are two households apparently consisting entirely of children:

list household person age child all_children if inlist(household, 73731, 80220), ab(30)


       +-------------------------------------------------+
       | household   person   age   child   all_children |
       |-------------------------------------------------|
 1490. |     73731        1    15       1              1 |
 1491. |     73731        2    15       1              1 |
 1623. |     80220        1    17       1              1 |
       +-------------------------------------------------+

All of these functions can act on an expression rather than a variable, including conditions. Thus we could have used total(age<18), max(age<18), etc. rather than creating the indicator variable child, and we’d have gotten the same results.

This data set has no missing values for age, but it’s worth thinking about what this code would do with them. The condition age<18 returns a zero for observations where age is missing (recall that you can think of missing as infinity), as if the person were known to be an adult. Thus in the presence of missing values num_children isn’t really the number of children in the household, it’s the number of people known to be children. That may or may not be what you want.

Exercise

Create variables containing the number of college graduates in each household (Associate’s degree or above) and an indicator variable for “this household contains at least one college graduate.” Then create an indicator variable for “all the adults in this household are college graduates.” Remember that edu has missing values, but since the people with missing values of edu are all less than three years old you can safely assume they’re not college graduates.

Solution

First remind yourself of the coding scheme for edu:

label list edu_label

edu_label:
           0 Not in universe
           1 None
           2 Nursery school-4th grade
           3 5th-6th grade
           4 7th-8th grade
           5 9th grade
           6 10th grade
           7 11th grade
           8 12th grade, no diploma
           9 High School graduate
          10 Some college, <1 year
          11 Some college, >=1 year
          12 Associate degree
          13 Bachelor's degree
          14 Master's degree
          15 Professional degree
          16 Doctorate degree

Thus someone is a college graduate if edu is greater than 11 but not missing:

gen grad = (edu>11 & edu<.)

You could also identify college graduates with inlist(edu, 12, 13, 14, 15, 16).

Now you’re ready to create the straightforward household-level variables:

by household: egen num_grads = total(grad)
by household: egen has_grad = max(grad)
list household person edu num_grads has_grad if household==4398, ab(30)


       +-----------------------------------------------------------------+
       | household   person                   edu   num_grads   has_grad |
       |-----------------------------------------------------------------|
   93. |      4398        1   Professional degree           2          1 |
   94. |      4398        2       Master's degree           2          1 |
   95. |      4398        3                  None           2          1 |
   96. |      4398        4                  None           2          1 |
   97. |      4398        5                     .           2          1 |
       +-----------------------------------------------------------------+

“All the adults in the household are college graduates” is a little trickier. One approach is to treat this as a subsetting problem:

gen grad_adult = grad if age>=18
by household: egen all_adults_grad1 = min(grad_adult)
list household person edu age grad_adult all_adults_grad1 if household==4398, ab(30)

(7,292 missing values generated)
(3 missing values generated)

       +--------------------------------------------------------------------------------+
       | household   person                   edu   age   grad_adult   all_adults_grad1 |
       |--------------------------------------------------------------------------------|
   93. |      4398        1   Professional degree    33            1                  1 |
   94. |      4398        2       Master's degree    34            1                  1 |
   95. |      4398        3                  None     5            .                  1 |
   96. |      4398        4                  None     3            .                  1 |
   97. |      4398        5                     .     0            .                  1 |
       +--------------------------------------------------------------------------------+

An alternative depends on combining conditions. “All the adults in the household are college graduates” requires that every individual be either a child (age<18) or a college graduate (grad). The min() function can act on that combination of conditions directly:

by household: egen all_adults_grad2 = min(age<18 | grad)
list household person age edu all_adults_grad2 if household==4398, ab(30)


       +-------------------------------------------------------------------+
       | household   person   age                   edu   all_adults_grad2 |
       |-------------------------------------------------------------------|
   93. |      4398        1    33   Professional degree                  1 |
   94. |      4398        2    34       Master's degree                  1 |
   95. |      4398        3     5                  None                  1 |
   96. |      4398        4     3                  None                  1 |
   97. |      4398        5     0                     .                  1 |
       +-------------------------------------------------------------------+

To wrap up your do file, close the log:

log close

      name:  <unnamed>
       log:  /home/d/dimond/kb/dws/level2.log
  log type:  text
 closed on:  12 Dec 2025, 10:28:37
------------------------------------------------------------------------------------------

If this were real work, you would save the modified data set first.

5.3 Creating Level Two Variables in Wide Form

Start a new do file, level2_wide.do that loads acs_wide.dta:

capture log close
log using level2_wide.log, replace
clear all
use https://sscc.wisc.edu/sscc/pubs/dws/data/acs_wide

------------------------------------------------------------------------------------------
      name:  <unnamed>
       log:  /home/d/dimond/kb/dws/level2_wide.log
  log type:  text
 opened on:  12 Dec 2025, 10:28:37

This is the exact same data set as before, but in wide form. Now there is just one row for each household, but it contains all the data about all the individuals in that household. The income1 variable contains the income of the first person in the household, income2 the second, etc.

There is one household in this ACS sample with 16 people in it, so every level one variable must have 16 columns (income1 through income16, for example). However, most households have fewer people. If a household only has three people, then for that household income1 through income3 will have values and all the other income variables will be missing.

list household income1 income2 income3 income4 income5 if household==37, ab(30)


       +-------------------------------------------------------------+
       | household   income1   income2   income3   income4   income5 |
       |-------------------------------------------------------------|
    1. |        37     10000      5300      4700         .         . |
       +-------------------------------------------------------------+

I only included income1 through income5 in this list partially so the output wouldn’t be too long, but partially because typing long lists of variables is tiresome. Unfortunately, working with wide form data frequently involves long lists of variables. Fortunately, Stata includes some easy shortcuts so you can specify lists of variables without having to type them all out. So let’s take a brief digression to learn about varlist syntax.

5.3.1 Shortcuts for Variable Lists

The most common shortcut is to use the asterisk (*) as a wildcard character. Try:

describe income*


Variable      Storage   Display    Value
    name         type    format    label      Variable label
------------------------------------------------------------------------------------------
income1         long    %10.0g                1 income
income2         long    %10.0g                2 income
income3         long    %10.0g                3 income
income4         long    %10.0g                4 income
income5         long    %10.0g                5 income
income6         long    %10.0g                6 income
income7         long    %10.0g                7 income
income8         long    %10.0g                8 income
income9         long    %10.0g                9 income
income10        long    %10.0g                10 income
income11        long    %10.0g                11 income
income12        long    %10.0g                12 income
income13        long    %10.0g                13 income
income14        long    %10.0g                14 income
income15        long    %10.0g                15 income
income16        long    %10.0g                16 income

This tells the describe command to act on all variables that match the pattern “income followed by anything.” Note that “anything” can include nothing, so a variable just called income would be included as well. The wildcard can go anywhere:

describe *1


Variable      Storage   Display    Value
    name         type    format    label      Variable label
------------------------------------------------------------------------------------------
age1            byte    %10.0g                1 age
race1           byte    %25.0g     race_label
                                              1 race
marital_status1 byte    %13.0g     marital_status_label
                                              1 marital_status
edu1            byte    %24.0g     edu_label
                                              1 edu
income1         long    %10.0g                1 income
female1         float   %9.0g                 1 female
hispanic1       float   %9.0g                 1 hispanic
age11           byte    %10.0g                11 age
race11          byte    %25.0g     race_label
                                              11 race
marital_stat~11 byte    %13.0g     marital_status_label
                                              11 marital_status
edu11           byte    %24.0g     edu_label
                                              11 edu
income11        long    %10.0g                11 income
female11        float   %9.0g                 11 female
hispanic11      float   %9.0g                 11 hispanic

This matches all the variables with information about the first individual in the household, but also the variables with information about the eleventh individual. Be careful your wildcards don’t match more than what you want!

A question mark (?) is also a wildcard, but it matches exactly one character:

describe income?


Variable      Storage   Display    Value
    name         type    format    label      Variable label
------------------------------------------------------------------------------------------
income1         long    %10.0g                1 income
income2         long    %10.0g                2 income
income3         long    %10.0g                3 income
income4         long    %10.0g                4 income
income5         long    %10.0g                5 income
income6         long    %10.0g                6 income
income7         long    %10.0g                7 income
income8         long    %10.0g                8 income
income9         long    %10.0g                9 income

This matches income1 through income9, but not income10 because it is income followed by two characters. It also would not match just income.

Another shortcut is to put a dash (-) between two variables. This will give you all the variables in between them:

describe age1-hispanic1


Variable      Storage   Display    Value
    name         type    format    label      Variable label
------------------------------------------------------------------------------------------
age1            byte    %10.0g                1 age
race1           byte    %25.0g     race_label
                                              1 race
marital_status1 byte    %13.0g     marital_status_label
                                              1 marital_status
edu1            byte    %24.0g     edu_label
                                              1 edu
income1         long    %10.0g                1 income
female1         float   %9.0g                 1 female
hispanic1       float   %9.0g                 1 hispanic

This gives you just the variables with information about the first individual. The order used in resolving this shortcut is the order in which the variables are listed in the variables window or a describe command. You can use the order command to change the order of the variables to something more convenient.

A varlist can mix multiple kinds of shortcuts as well as individual variable names:

describe household age1-female1 income*


Variable      Storage   Display    Value
    name         type    format    label      Variable label
------------------------------------------------------------------------------------------
household       double  %8.0g                 Household serial number
age1            byte    %10.0g                1 age
race1           byte    %25.0g     race_label
                                              1 race
marital_status1 byte    %13.0g     marital_status_label
                                              1 marital_status
edu1            byte    %24.0g     edu_label
                                              1 edu
income1         long    %10.0g                1 income
female1         float   %9.0g                 1 female
income1         long    %10.0g                1 income
income2         long    %10.0g                2 income
income3         long    %10.0g                3 income
income4         long    %10.0g                4 income
income5         long    %10.0g                5 income
income6         long    %10.0g                6 income
income7         long    %10.0g                7 income
income8         long    %10.0g                8 income
income9         long    %10.0g                9 income
income10        long    %10.0g                10 income
income11        long    %10.0g                11 income
income12        long    %10.0g                12 income
income13        long    %10.0g                13 income
income14        long    %10.0g                14 income
income15        long    %10.0g                15 income
income16        long    %10.0g                16 income

5.3.2 Using Row Functions

Now consider calculating the household income. To calculate the household income in long form, we used:

by household: egen household_income = total(income)

This told Stata to add up all the values in the income column, grouping all the observations in each household.

In wide form, we don’t need by household: because each household is just one observation. But the individual incomes are not in a single column, they’re in the columns income1 through income16. This is a job for the egen function rowtotal(). The rowtotal() function takes a varlist as input, and adds up the value of all those variables across a single row. The varlist income* is a convenient way to tell it to add up all the income variables.

egen household_income = rowtotal(income*)
list household household_income income1 income2 income3 income4 if household==37, ab(30)


       +----------------------------------------------------------------------+
       | household   household_income   income1   income2   income3   income4 |
       |----------------------------------------------------------------------|
    1. |        37              20000     10000      5300      4700         . |
       +----------------------------------------------------------------------+

The egen functions we used in the section on working with long form data all have row equivalents: rowtotal(), rowmean(), rowmax(), and rowmin(). The tasks we carried out in long form can easily be done in wide form with just three changes:

Remove the by prefix.
Replace the egen function with its row equivalent.
Replace the variable to act on with a variable list matching all the corresponding level one variables.

Exercise

Crete a variable containing the mean age of each household.

Solution

egen mean_age = rowmean(age*)
list household age1 age2 age3 mean_age if household==37, ab(30)


       +-------------------------------------------+
       | household   age1   age2   age3   mean_age |
       |-------------------------------------------|
    1. |        37     20     19     19   19.33333 |
       +-------------------------------------------+

Now consider creating an indicator for “this household has children.” In long form you could do that with:

by household: egen has_children = max(age<18)

One big difference between row functions and their column equivalents is that row functions can’t act on expressions like age<18, only variables. But creating a single indicator variable for child won’t work either: we need child1 (person one is a child), child2 (person 2 is a child), etc. Typing out 16 gen commands would be even more tiresome than typing out 16 variable names, so it’s time for another digression.

5.3.3 Creating Level 1 Variables in Wide Form

In this section we’ll demonstrate very briefly how to write loops in Stata. For a broader and deeper introduction to looping, see Stata Macros and Loops. Definitely read it if you find this brief introduction confusing.

In long form, we ran:

gen child = (age < 18)

to create an indicator for “this person is a child.” The wide form equivalent would be:

gen child1 = (age1<18) if age1<.
gen child2 = (age2<18) if age2<.
gen child3 = (age3<18) if age3<.
...
gen child16 = (age16<18) if age16<.

Fortunately, the following loop will do that much more easily:

forvalues i = 1/16 {
    gen child`i' = (age`i' < 18) if age`i' < .
}

(2,779 missing values generated)
(6,191 missing values generated)
(7,876 missing values generated)
(9,355 missing values generated)
(10,115 missing values generated)
(10,388 missing values generated)
(10,484 missing values generated)
(10,531 missing values generated)
(10,544 missing values generated)
(10,554 missing values generated)
(10,557 missing values generated)
(10,564 missing values generated)
(10,564 missing values generated)
(10,564 missing values generated)
(10,564 missing values generated)

Consider each line in turn:

forvalues i = 1/16 {

forvalues tells Stata we’re going to carry out the following code for a set of values, meaning execute the code once for each value. 1/16 defines the set of values as the integers 1 through 16. (Stata knows from context that we’re not doing division here.) Each value is stored in the macro i. Finally, the { tells Stata this is where the code to be executed as part of the loop begins.

gen child`i' = (age`i' < 18) if age`i' < .

Most of this is familiar, except for `i'. This is how you use the values of a macro. When Stata sees a name that starts with a left single quote (backtick) and ends with a right single quote (apostrophe) it will replace the name with the corresponding value of the macro before it executes the code. In this case, the forvalues loop defines the macro `i' to contain first the number 1, then the number 2, and so forth all the way up to 16. So what Stata actually runs is:

gen child1 = (age1 < 18) if age1 < .
gen child2 = (age2 < 18) if age2 < .
...
gen child16 = (age16 < 18) if age16 < .

Note that in long form age has no missing values, but in wide form the age variables will have missing values for all the people who don’t exist (e.g. if a household has three people in it age4 and higher will be missing). It’s very important that child also be missing for people who don’t exist.

Finally, } tells Stata this is the end of the code to be executed as part of the loop. It must go on its own line.

Once you have a set of indicator variables for “this person is a child”, you’re ready to create a has_children indicator:

egen has_children = rowmax(child*)

Note how has_children is a single variables, not a set of variables like child1, child2 etc. That’s because child is a person-level variable with one value per person in the household, while has_children is a household-level variables with one value for the entire household.

Subsetting in wide form works the same way. Use a loop to create a set of variables that contain the value of interest for people who are in the subset and missing for people outside it, and then combine using row functions.

Exercise

Use a loop to create a set of person-level indicator variables for “this person is black.” Then create household-level indicator variables for “this household has at least one black member” and “all the members of this household are black.”

Solution

First, remind yourself how the race variables are coded:

label list race_label

race_label:
           1 White
           2 Black
           3 American Indian
           4 Alaska Native
           5 Indigenous, tribe unknown
           6 Asian
           7 Pacific Islander
           8 Other
           9 Multiracial

So a 2 means the person is black. Next create indicator variables for “this person is black” using a loop:

forvalues i = 1/16 {
    gen black`i' = (race`i'==2) if race`i' < .
}

(2,779 missing values generated)
(6,191 missing values generated)
(7,876 missing values generated)
(9,355 missing values generated)
(10,115 missing values generated)
(10,388 missing values generated)
(10,484 missing values generated)
(10,531 missing values generated)
(10,544 missing values generated)
(10,554 missing values generated)
(10,557 missing values generated)
(10,564 missing values generated)
(10,564 missing values generated)
(10,564 missing values generated)
(10,564 missing values generated)

Now you can use them to create the household-level indicator variables:

egen has_black_member = rowmax(black*)
egen all_black = rowmin(black*)

list household race1 race2 race3 race4 has_black_member all_black if inlist(household, 37, 465), ab(30)


       +--------------------------------------------------------------------------+
       | household   race1   race2   race3   race4   has_black_member   all_black |
       |--------------------------------------------------------------------------|
    1. |        37   White   White   Black       .                  1           0 |
    6. |       465   Black   Black       .       .                  1           1 |
       +--------------------------------------------------------------------------+

Exercise

Create child_income, the total income of all the children in each household.

Solution

forvalues i = 1/16 {
    gen income_if_child`i' = income`i' if age`i'<18
}

egen child_income = rowtotal(income_if_child*)

(10,559 missing values generated)
(10,383 missing values generated)
(9,981 missing values generated)
(10,304 missing values generated)
(10,495 missing values generated)
(10,538 missing values generated)
(10,555 missing values generated)
(10,561 missing values generated)
(10,562 missing values generated)
(10,564 missing values generated)
(10,565 missing values generated)
(10,565 missing values generated)
(10,565 missing values generated)
(10,565 missing values generated)
(10,565 missing values generated)
(10,565 missing values generated)

list household age1 income1 age2 income2 age3 income3 child_income if household==8787, ab(30)


       +-----------------------------------------------------------------------------+
       | household   age1   income1   age2   income2   age3   income3   child_income |
       |-----------------------------------------------------------------------------|
   71. |      8787     45     38000     16      3600     15       200           3800 |
       +-----------------------------------------------------------------------------+

To wrap up your do file for the section, close the log:

log close

      name:  <unnamed>
       log:  /home/d/dimond/kb/dws/level2_wide.log
  log type:  text
 closed on:  12 Dec 2025, 10:28:38
------------------------------------------------------------------------------------------

If this were real work, you’d save the modified data set first.

5.4 Panel Data

Panel data, or longitudinal data, are data where subjects are observed repeatedly over time and the timing is important. (If the timing isn’t important then we call it repeated measures data.) The National Longitudinal Survey of Youth is an example of panel data, and we’ll use a small extract from it as an example. (Note that this extract combines income variables from different years with slightly different definitions into a single income variable, so you really wouldn’t want to use this extract for actual research.) Create a do file called panel.do that loads it:

capture log close
log using panel.txt, replace
clear all
use https://sscc.wisc.edu/sscc/pubs/dws/data/nlsy

------------------------------------------------------------------------------------------
      name:  <unnamed>
       log:  /home/d/dimond/kb/dws/panel.txt
  log type:  smcl
 opened on:  12 Dec 2025, 10:28:38

Exercise

As a review, spend some time getting familiar with this data set. In particular:

Identify the identifier variables. What is a level one unit in this data set? What is a level two unit?
Which variables are level one variables? Which are level two variables?
What type of variable is edu?
Why are income and edu are frequently missing for the same observation? What does it tell you about age that it is never missing?

Solution

Start by seeing what variables exist:

describe


Contains data from https://sscc.wisc.edu/sscc/pubs/dws/data/nlsy.dta
 Observations:       241,034                  
    Variables:             6                  27 Dec 2022 13:11
------------------------------------------------------------------------------------------
Variable      Storage   Display    Value
    name         type    format    label      Variable label
------------------------------------------------------------------------------------------
id              float   %9.0g                 ID# (1-12686) 79
year            int     %9.0g                 
year_of_birth   float   %16.0g                
edu             float   %24.0g     edulabel   
income          float   %9.0g                 
age             float   %9.0g                 
------------------------------------------------------------------------------------------
Sorted by: id  year

Obviously id is an identifier. But does it uniquerly identify observations?

duplicates report id


Duplicates in terms of id

--------------------------------------
   Copies | Observations       Surplus
----------+---------------------------
       19 |       241034        228348
--------------------------------------

No, we need at least one more variable to uniquely identify observations. Recall that identifiers generally contain information, which excludes all but year. So try that:

duplicates report id year


Duplicates in terms of id year

--------------------------------------
   Copies | Observations       Surplus
----------+---------------------------
        1 |       241034             0
--------------------------------------

If id and year are the identifiers, then we must have people observed over time (the fact that we’re in a section on panel data is another pretty good clue). Thus a level two unit is a person, and a level one unit is a person’s observation for a particular year.

Someone’s year of birth can never change (if it does it must be a data error) so year_of_birth is a level two variable. A quick look at the data browser will show that all the other variables can and do change over time, and are thus level one variables.

The codebook command will help you identify the type of edu:

codebook edu


------------------------------------------------------------------------------------------
edu                                                                            (unlabeled)
------------------------------------------------------------------------------------------

                  Type: Numeric (float)
                 Label: edulabel

                 Range: [0,20]                        Units: 1
         Unique values: 21                        Missing .: 43,417/241,034

              Examples: 12    12TH GRADE
                        12    12TH GRADE
                        13    1ST YEAR COLLEGE
                        17    5TH YEAR COLLEGE

While the labels are similar to the categorical edu variable in the ACS, note how the values themselves represent the number of years of schooling the person has. That makes this a quantitative variable.

The simplest explanation for edu and income frequently being missing for the same observation is that they’re missing because the subject could not be found in that year. Tracking people for multiple years is hard, and almost always some people will be lost to attrition. Be mindful of how attrition will affect your analysis, especially since it’s usually not completely random who is lost. The NLSY is a bit unusual in that every subject has a row for every year, whether the subject could be found in that year or not, but the rows for years the subject could not be found contain missing values. You’ll also see that they kept trying to find people even after losing them for a year, and sometimes succeeded.

The fact that age is not missing tells you they did not ask the subject their age every year: they simply calculated it using year_of_birth.

Most of the techniques you learned for working with individuals in households carry over directly to panel data. For example, to find the total income earned during the study period, run:

by id: egen total_income = total(income)

To find the age of the subject the first time they appear in the study, run:

by id: egen starting_age = min(age)

This relies on the fact that age always increases over time. What if you wanted to know their income the first time they appear in the study? Recall that income[1] means “the value of income for the first observation.” When combined with by it means “the value of income for the first observation in the by group.” This is highly convenient, but you need to make sure that the observations for each subject are in chronological order so their first observation really is the first time they appear in the study:

sort id year
by id: gen startingIncome = income[1]

(109,953 missing values generated)

You need to be careful because Stata’s default sorting algorithm is not stable. This means it will put ties in whatever order will make it run fastest. So if you run sort id, or bysort id:, the observations for each person could be in any order. In practice, if the data are already sorted or mostly sorted the order that will make the sort run fastest is usually to leave things alone. But you can’t count on that. So if you’re going to run code that depends on the sort order, like assuming the first observation is the first chronologically, be sure the data are actually in the right order.

Exercise

Create ending_income, the subject’s income in the last year of the study. Recall that _N is the observation number of the last observation. (Yes, for many people ending_income will be missing.)

Solution

by id: gen ending_income = income[_N]

(97,128 missing values generated)

5.4.1 Events and Event History

Often with panel data you’ll need to identify particular events or sequences of events. For example, suppose you need to identify the year in which each subject graduated from high school. A subject graduated from high school in a given year if they have 12 years of education in that year and less than 12 years of education the year before.

Normally Stata works row by row. If you say:

gen y = x

you’re saying “set the value of y for each observation to the value of x for the same observation.” As it does so, Stata keeps track of which observation it is working on in a system variable called _n. So x[_n] is just the same as x: the value of x for the current observation. But x[_n-1] means “the value of x for the observation before the current observation.” And that’s extremely useful. In particular, you can detect the year a person graduates from high school using the rule “the person has 12 years of education this year and less than 12 years of education the year before with:

by id: gen grad = (edu==12 & edu[_n-1]<12)

To check your work, calculate the number of times this rule identified each person as having graduated from high school:

by id: egen times_grad = total(grad)
tab times_grad


 times_grad |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |    135,033       56.02       56.02
          1 |    106,001       43.98      100.00
------------+-----------------------------------
      Total |    241,034      100.00

Note that the frequencies above do not indicate how many people graduated one or zero times, but how many person-year combinations belong to people who graduated one or zero times. Be careful about summary statistics with hierarchical data!

The zeroes are not surprising: some subjects graduated before the study began, or didn’t graduate at all, or missing data prevented you from identifying the year they graduated. Values of two or more would indicate a problem, either with your code or with the data, so it’s good you don’t have any.

Speaking of missing data, what does the code for creating grad do with it? Missing values are neither 12 nor less than 12, so the condition (edu==12 & edu[_n-1]<12) will be false if edu is missing for either the current observation or the observation before it. So grad==1 can be interpreted as “we know the person graduated this year” but grad==0 could mean either “we know the person didn’t graduate this year” or “we don’t know if the person graduated this year or not because of missing data.”

In the command:

by id: egen times_grad = total(grad)

the by id: prefix tells the total() function which observations it should aggregate when calculating the total: all the ones with the same value of id. But the command:

by id: gen grad = (edu==12 & edu[_n-1]<12)

doesn’t do any aggregating. So why do we need by id: there?

Consider the first observation for the second person. edu[_n-1] is the value of edu for the observation before it. Without by id:, that would be the last value of edu for person 1. That’s not what we want! by id: tells Stata to treat each group defined by id as if it were its own data set. Thus the first observation for each person has _n equal to 1 and there’s no observation before it–_n-1 is zero. Some languages would give you an error like “index out of bounds” if you asked for the value of edu for the zeroeth observation, but Stata just returns a missing value. That means grad is always 0 for the first observation for each person, which makes sense. If someone starts the study with 12 years of education we can’t tell if they just graduated from high school or have had 12 years of education for some time.

Exercise

Identify the year in which the subject started college (if they did). For simplicity, assume everyone starts college in the fall and only starts claiming to have completed a year of college in the following year. (Hint: if someone’s values of education for three years were 11, 12, and 13, they both graduated from high school and started college in the middle year.) Some people in the NLSY report that their education level increased by more than one year in a single year, so make sure your code allows for that.

Check your work by seeing how many times people started college. If you find people who started more than once, determine whether this is due to a problem with your code or with the data.

Tip

Now you’re looking for a year in which the subject’s education is 12 and their education the following year is greater than 12. But remember missing is essentially infinity and thus definitely greater than 12, so this time you have to deal with missing values explicitly:

by id: gen start_college = (edu==12 & (edu[_n+1]>12 & edu[_n+1]<.))

Now check your work:

by id: egen times_started = total(start_college)
tab times_started


times_start |
         ed |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |    172,349       71.50       71.50
          1 |     68,628       28.47       99.98
          2 |         57        0.02      100.00
------------+-----------------------------------
      Total |    241,034      100.00

Now you have people who started college multiple times. Take a look to see why. (Best done in the data browser, but I’ll use list here even though it’s long.)

list id year edu start_college times_started if times_started>1, ab(30)


        +----------------------------------------------------------------+
        |   id   year                edu   start_college   times_started |
        |----------------------------------------------------------------|
 64601. | 3401   1979         11TH GRADE               0               2 |
 64602. | 3401   1980         12TH GRADE               0               2 |
 64603. | 3401   1981         12TH GRADE               0               2 |
 64604. | 3401   1982         12TH GRADE               0               2 |
 64605. | 3401   1983         12TH GRADE               0               2 |
        |----------------------------------------------------------------|
 64606. | 3401   1984         12TH GRADE               0               2 |
 64607. | 3401   1985         12TH GRADE               1               2 |
 64608. | 3401   1986   1ST YEAR COLLEGE               0               2 |
 64609. | 3401   1987   1ST YEAR COLLEGE               0               2 |
 64610. | 3401   1988   1ST YEAR COLLEGE               0               2 |
        |----------------------------------------------------------------|
 64611. | 3401   1989   1ST YEAR COLLEGE               0               2 |
 64612. | 3401   1990   1ST YEAR COLLEGE               0               2 |
 64613. | 3401   1991   1ST YEAR COLLEGE               0               2 |
 64614. | 3401   1992   1ST YEAR COLLEGE               0               2 |
 64615. | 3401   1993   1ST YEAR COLLEGE               0               2 |
        |----------------------------------------------------------------|
 64616. | 3401   1994         12TH GRADE               0               2 |
 64617. | 3401   1996         12TH GRADE               0               2 |
 64618. | 3401   1998         12TH GRADE               1               2 |
 64619. | 3401   2000   1ST YEAR COLLEGE               0               2 |
 66881. | 3521   1979          9TH GRADE               0               2 |
        |----------------------------------------------------------------|
 66882. | 3521   1980         10TH GRADE               0               2 |
 66883. | 3521   1981         11TH GRADE               0               2 |
 66884. | 3521   1982         12TH GRADE               1               2 |
 66885. | 3521   1983   1ST YEAR COLLEGE               0               2 |
 66886. | 3521   1984         12TH GRADE               1               2 |
        |----------------------------------------------------------------|
 66887. | 3521   1985   1ST YEAR COLLEGE               0               2 |
 66888. | 3521   1986   1ST YEAR COLLEGE               0               2 |
 66889. | 3521   1987   2ND YEAR COLLEGE               0               2 |
 66890. | 3521   1988   2ND YEAR COLLEGE               0               2 |
 66891. | 3521   1989   2ND YEAR COLLEGE               0               2 |
        |----------------------------------------------------------------|
 66892. | 3521   1990   2ND YEAR COLLEGE               0               2 |
 66893. | 3521   1991   2ND YEAR COLLEGE               0               2 |
 66894. | 3521   1992   2ND YEAR COLLEGE               0               2 |
 66895. | 3521   1993   2ND YEAR COLLEGE               0               2 |
 66896. | 3521   1994   2ND YEAR COLLEGE               0               2 |
        |----------------------------------------------------------------|
 66897. | 3521   1996   2ND YEAR COLLEGE               0               2 |
 66898. | 3521   1998   2ND YEAR COLLEGE               0               2 |
 66899. | 3521   2000   1ST YEAR COLLEGE               0               2 |
152077. | 8005   1979         12TH GRADE               1               2 |
152078. | 8005   1980   1ST YEAR COLLEGE               0               2 |
        |----------------------------------------------------------------|
152079. | 8005   1981         12TH GRADE               0               2 |
152080. | 8005   1982         12TH GRADE               0               2 |
152081. | 8005   1983         12TH GRADE               1               2 |
152082. | 8005   1984   1ST YEAR COLLEGE               0               2 |
152083. | 8005   1985   1ST YEAR COLLEGE               0               2 |
        |----------------------------------------------------------------|
152084. | 8005   1986   1ST YEAR COLLEGE               0               2 |
152085. | 8005   1987   1ST YEAR COLLEGE               0               2 |
152086. | 8005   1988   1ST YEAR COLLEGE               0               2 |
152087. | 8005   1989   1ST YEAR COLLEGE               0               2 |
152088. | 8005   1990   1ST YEAR COLLEGE               0               2 |
        |----------------------------------------------------------------|
152089. | 8005   1991   1ST YEAR COLLEGE               0               2 |
152090. | 8005   1992   1ST YEAR COLLEGE               0               2 |
152091. | 8005   1993   1ST YEAR COLLEGE               0               2 |
152092. | 8005   1994   1ST YEAR COLLEGE               0               2 |
152093. | 8005   1996   1ST YEAR COLLEGE               0               2 |
        |----------------------------------------------------------------|
152094. | 8005   1998   1ST YEAR COLLEGE               0               2 |
152095. | 8005   2000   1ST YEAR COLLEGE               0               2 |
        +----------------------------------------------------------------+

In all three cases the subject reported “1ST YEAR COLLEGE” for their education, then went back to reporting “12TH GRADE” for a while before reporting “1ST YEAR COLLEGE” again. That’s a problem with the data, not your code, though you may need to change your code to deal with it. How to deal with it will depend on what you plan to do with this start_college variable.

5.4.2 Level 2 Variables Based on a Special Level 1 Observation

Sometimes you need to create a level two variable which is just the value of a level one variable for a particular level one unit, such as the subject’s age at the time they graduated from high school. Here’s one way to do that:

sort id grad
by id: gen age_at_grad = age[_N] if grad[_N]
sort id year

(135,033 missing values generated)

Sorting by id and grad puts the observation in which the subject graduated last, so you can get the subject’s age in that year with age[_N]. However, recall that for many subjects you could not identify the year they graduated. For them, grad is always zero, so which observation is last after the sort is completely arbitrary (remember Stata’s default sort is not stable), and age_at_grad should be missing. The condition if grad[_N] ensures that age_at_grad is not set if the last observation is not a a year in which they graduated.

Don’t forget to put the data back in chronological order with sort id year if that’s how you’re used to using it.

Alternatively you can treat this as a subsetting problem and use the techniques described above for calculations on subsets:

gen age_if_grad = age if grad
by id: egen age_at_grad2 = mean(age_if_grad)

(235,455 missing values generated)
(135,033 missing values generated)

Since people graduate at most once, the subset of observations where grad is 1 (true) has just one observation per person. The mean of one number is just that number. Thus we’re really using the mean function to “broadcast” that number to all the observations for the same person.

These methods give exactly the same result, so you can use whichever you prefer. The sorting method will be slower with large data sets, but there are many things you can do with a clever sort and I want you to have that in your bag of tricks.

Exercise

Create a person-level variable for “income at age 25.”

Solution

You have two options here. First, the sorting method:

gen age25 = (age==25)
sort id age25
by id: gen income25a = income[_N] if age25[_N]
sort id year

(22,686 missing values generated)

Second, the subsetting method:

gen income_if_25 = income if age==25
by id: egen income25b = mean(income_if_25)

(229,542 missing values generated)
(22,686 missing values generated)

They will give you exactly the same result, as you can verify by asserting that income25a is always the same as income25b:

assert income25a==income25b

Stata doesn’t object (i.e. give an error message), so the assertion is true.

To finish a proper do file for this section, add:

log close

      name:  <unnamed>
       log:  /home/d/dimond/kb/dws/panel.txt
  log type:  smcl
 closed on:  12 Dec 2025, 10:28:40
------------------------------------------------------------------------------------------

For real work you’d also save the modified data set.