5  Working With Hierarchical Data

Many data sets involve some sort of hierarchical structure. The American Community Survey is an example of one of the most common hierarchical data structures: individuals grouped into households. Another common hierarchical data structure is panel or longitudinal data and repeated measures, where things are observed multiple times. These structures may seem very different, but the same concepts apply to both and often even the same code.

5.1 Hierarchical Data Concepts

In this section we’ll introduce some of the core concepts and vocabulary for thinking carefully about hierarchical data.

5.1.1 Levels

Hierarchical data can be described in terms of levels (note that these are not the same as the levels of a categorical variable). A level one unit is the smallest unit in the data set. A level two unit is then a group of level one units. Some examples:

EXAMPLE DATA STRUCTURE LEVEL ONE UNIT LEVEL TWO UNIT
Individuals living in households Individuals Households
States grouped into regions States Regions
Students in classes Students Classes
Tests taken by students Tests Students
Monthly observations of individuals Monthly observations Individuals
Visits to the doctor by patients Doctor visits Patients
Social media posts by individuals Posts Individuals

If the hierarchy has more than two levels, simply keep counting up: if you have students grouped into classes grouped into schools grouped into districts, students are level one, classes are level two, schools are level three, and districts are level four.

Each variable is associated with a specific level. A variable is a level two variable if all the level one units within a given level two unit have the same value for the variable. For data structures where a level two unit is observed over time, level two variables are variables that do not change over time.

EXAMPLE DATA STRUCTURE LEVEL ONE VARIABLES LEVEL TWO VARIABLES
Individuals living in households Age, Sex Household income, Address
States grouped into regions State population Regional income per capita
Students in classes Student’s race Teacher’s race, Class size
Tests taken by students Test Score Free lunch eligibility
Monthly observations of individuals Employment Status Highest degree earned
Visits to the doctor by patients BMI, Diagnosis Race
Social media posts by individuals Length, Topic Sex

Of course all of these depend on the details of the actual data set. In a study that observes individuals for a few months, it’s unlikely that their highest degree earned will change. But it might! And if it does, highest degree earned becomes a level one variable. It does not meet the definition of a level two variable because different level one units (monthly observations) have different values for it.

5.1.2 Data Forms

With a hierarchical data set, an observation (row) can represent either a level one unit or a level two unit. Consider observing two people for three months:

person_id month years_edu employed
1 1 16 True
1 2 16 True
1 3 16 True
2 1 12 False
2 2 12 True
2 3 12 True
Exercise

Identify the level one units, level two units, level one variable(s), and level two variable(s) in the above data set.

In this form, each observation represents a month (or more precisely, a person-month combination). Now consider the exact same data in a different form:

person_id years_edu employed_1 employed_2 employed_3
1 16 True True True
2 12 False True True

In this form, each observation represents a person. Because years_edu is a level two variable, there’s just one value per person and thus one variable (column). However, employed is a level one variable with three (potentially) different values per person. Thus it needs three variables (columns).

We call the first form the long form (or occasionally the tall form) and the second the wide form. The long form is longer because it has more rows; the wide form is wider because it has more columns. In most cases the long form is easier to work with, so we’ll do most of our examples in this form.

Now consider the indexes in this data set. In the long form, person_id and month are a compound identifier and could be easily turned into a MultiIndex, while the variable names are a simple column identifier. In the wide form, person_id is a simple row identifier, but now the variable names for the level one variables are a compound identifier with two parts: the variable name (employed) and the month in which the variable was observed. To identify a specific value in the data set we still need to know the person_id, month, and a variable name, but month has been converted from a row identifier to part of a compound column identifier. We’ll learn how to actually carry out that transformation in the next chapter.

5.1.3 Setting Up

Start up Jupyter Lab if you haven’t already and navigate to the folder where you put the example files. Then create a new Python Notebook and call it Hierarchical_Data_Practice.ipynb. Have it import Pandas and load acs_clean.pickle:

import pandas as pd
acs = pd.read_pickle('acs_clean.pickle')
acs
age race marital_status edu income female hispanic
household person
37 1 20 White Never married Some college, >=1 year 10000.0 True False
2 19 White Never married Some college, >=1 year 5300.0 True False
3 19 Black Never married Some college, >=1 year 4700.0 True False
241 1 50 White Never married Master's degree 32500.0 True False
242 1 29 White Never married Bachelor's degree 30000.0 True False
... ... ... ... ... ... ... ... ...
1236624 1 29 White Now married Some college, >=1 year 50100.0 False False
2 26 White Now married High School graduate 12000.0 True False
1236756 1 58 White Now married Master's degree 69800.0 True False
2 61 White Now married Master's degree 40800.0 False False
1236779 1 30 American Indian Divorced High School graduate 22110.0 False False

27410 rows × 7 columns

Exercise

Load the example extract from the National Longitudinal Survey of Youth contained in nlsy.pickle. Identify the level one and level two units in the data and the level one and level two variables. (We’re not so worried about having a streamlined data cleaning workflow in this chapter, so just use your Hierarchical_Data_Practice.ipynb Notebook.)

5.2 Long Form Data

For some purposes, if a hierarchical data set is in long form you don’t have to worry about the hierarchical structure. On the other hand, a very common data wrangling task is creating level two variables based on the level one variables in the data set, and then the hierarchical structure is critical. In long form the critical tool for doing so is groupby(), so that the calculations are carried out separately for each level two unit.

5.2.1 Using groupby()

The groupby() function creates a new object which contains the DataFrame split up into groups. Functions act on that object, carry out calculations on the groups, and can return results in various forms. For example, if you apply basic summary statistic functions, you’ll get one row per group:

acs['income'].groupby('household').mean()
household
37          6666.666667
241        32500.000000
242        30000.000000
377        51900.000000
418        12200.000000
               ...     
1236119    46600.000000
1236287    17800.000000
1236624    31050.000000
1236756    55300.000000
1236779    22110.000000
Name: income, Length: 10565, dtype: float64

mean() is an example of an aggregate function: it takes multiple values and returns a single value. Useful aggregate functions include:

  • mean()
  • sum() (total)
  • max()
  • min()
  • count() (number of rows with non-missing values)
  • size() (number of rows)
  • first() (value for the first observation in the group)
  • last() (value for the last observation in the group)
  • nth() (value for the nth observation in the group)

The first five functions listed, mean() through count(), can be applied to a DataFrame, a Series, or the result of a groupby(). However, the last four, size() through nth(), only work with groupby().

With the transform() function, you pass in the name of an aggregate function which is then applied to the groups, but the result is copied or broadcast to all the observations in the group. This is the primary tool you’ll use to add level two variables to a DataFrame. The result is a Series with the same number of rows as the original DataFrame, but the value for level one units in the same level two unit is always the same.

acs['income'].groupby('household').transform('mean')
household  person
37         1          6666.666667
           2          6666.666667
           3          6666.666667
241        1         32500.000000
242        1         30000.000000
                         ...     
1236624    1         31050.000000
           2         31050.000000
1236756    1         55300.000000
           2         55300.000000
1236779    1         22110.000000
Name: income, Length: 27410, dtype: float64

5.2.2 Continuous Variables

We’ll start with examples of calculating continuous (or quantitative) level two variables based on continuous level one variables.

For many purposes, the total income of a household is more relevant than individual incomes. Create a column called household_income containing the total income of everyone in the household with:

acs['household_income'] = (
    acs['income'].
    groupby('household').
    transform('sum')
)
acs
age race marital_status edu income female hispanic household_income
household person
37 1 20 White Never married Some college, >=1 year 10000.0 True False 20000.0
2 19 White Never married Some college, >=1 year 5300.0 True False 20000.0
3 19 Black Never married Some college, >=1 year 4700.0 True False 20000.0
241 1 50 White Never married Master's degree 32500.0 True False 32500.0
242 1 29 White Never married Bachelor's degree 30000.0 True False 30000.0
... ... ... ... ... ... ... ... ... ...
1236624 1 29 White Now married Some college, >=1 year 50100.0 False False 62100.0
2 26 White Now married High School graduate 12000.0 True False 62100.0
1236756 1 58 White Now married Master's degree 69800.0 True False 110600.0
2 61 White Now married Master's degree 40800.0 False False 110600.0
1236779 1 30 American Indian Divorced High School graduate 22110.0 False False 22110.0

27410 rows × 8 columns

Now calculate the household size, or the number of people in the household. Since there is one row per person, we can use the size() function. You still need to have it act on a single column, but it makes no difference which column you choose:

acs['household_size'] = (
    acs['age'].
    groupby('household').
    transform('size')
)
acs
age race marital_status edu income female hispanic household_income household_size
household person
37 1 20 White Never married Some college, >=1 year 10000.0 True False 20000.0 3
2 19 White Never married Some college, >=1 year 5300.0 True False 20000.0 3
3 19 Black Never married Some college, >=1 year 4700.0 True False 20000.0 3
241 1 50 White Never married Master's degree 32500.0 True False 32500.0 1
242 1 29 White Never married Bachelor's degree 30000.0 True False 30000.0 1
... ... ... ... ... ... ... ... ... ... ...
1236624 1 29 White Now married Some college, >=1 year 50100.0 False False 62100.0 2
2 26 White Now married High School graduate 12000.0 True False 62100.0 2
1236756 1 58 White Now married Master's degree 69800.0 True False 110600.0 2
2 61 White Now married Master's degree 40800.0 False False 110600.0 2
1236779 1 30 American Indian Divorced High School graduate 22110.0 False False 22110.0 1

27410 rows × 9 columns

Exercise

Create a variable mean_age containg the mean age of the household.

Now calculate the income per person, spot-checking household 484. Since income per person is the same as the mean income for the household, you might think you can use:

acs['income_per_person'] = (
    acs['income'].
    groupby('household').
    transform('mean')
)
acs.loc[484]
age race marital_status edu income female hispanic household_income household_size income_per_person
person
1 33 Asian Now married High School graduate 16800.0 False False 34800.0 4 17400.0
2 26 Asian Now married 11th grade 18000.0 True False 34800.0 4 17400.0
3 4 Asian NaN None NaN False False 34800.0 4 17400.0
4 2 Asian NaN NaN NaN True False 34800.0 4 17400.0

This does not give the right answer because of the way mean() handled missing values: it took the mean of the two observed values of income and ignored the two people with NaN. However, we know that NaN for income means the person is under 15 years old, and almost certainly their income is zero or close to it. What we want is the total income of the household divided by the number of people in the household, whether they have observed values of income or not.

Since you already created household_income and household_size, you can just use them:

acs['income_per_person'] = (
    acs['household_income']/acs['household_size']
)
acs.loc[484]
age race marital_status edu income female hispanic household_income household_size income_per_person
person
1 33 Asian Now married High School graduate 16800.0 False False 34800.0 4 8700.0
2 26 Asian Now married 11th grade 18000.0 True False 34800.0 4 8700.0
3 4 Asian NaN None NaN False False 34800.0 4 8700.0
4 2 Asian NaN NaN NaN True False 34800.0 4 8700.0

Note that we did not need to use groupby()! groupby() is relevant when you’re using an aggregation function and need to control which rows are aggregated (e.g. the people in the same household). This code does not use an aggregation function: income_per_person for each row depends solely on household_income and household_size for the same row.

What if you hadn’t already created household_income and household_size? Is there a way to calculate income_per_person directly? Absolutely. We just need a function to pass to transform() that first calculates the sum and then divides by the size. No such function exists–but we can create one.

5.2.3 Lambda Functions

Python allows you to create and name your own functions, but you most likely won’t need to for data wrangling. You’ll probably be able to do everything you need to do with temporary, unnamed functions, known as lambda functions. You define a lambda function with:

lamba x: (f(x))

where f(x) should be replaced by the function you need. x is a parameter that is being passed in to the function–you can actually give it any name you want.

What seems like the obvious f(x) here is:

lambda x: (x.sum()/x.size())

but that doesn’t work. The reason is that the passed in x is no longer a groupby() object so you can’t use the size() function. However, Series and DataFrames have a size attribute which gives the total number of elements they contain. For a DataFrame this is the number of rows times the number of columns and rarely useful. However, for a Series which is one column from a DataFrame size is the number of rows, and that is exactly what you need:

acs['income'].size
27410

Thus the function is:

lambda x: (x.sum()/x.size)

and the full code is:

acs['income_per_person'] = (
    acs['income'].
    groupby('household').
    transform(
        lambda x: 
        (x.sum()/x.size)
    )
)
acs.loc[484]
age race marital_status edu income female hispanic household_income household_size income_per_person
person
1 33 Asian Now married High School graduate 16800.0 False False 34800.0 4 8700.0
2 26 Asian Now married 11th grade 18000.0 True False 34800.0 4 8700.0
3 4 Asian NaN None NaN False False 34800.0 4 8700.0
4 2 Asian NaN NaN NaN True False 34800.0 4 8700.0

The data set is getting kind of wide, so remove the variables you just created (add mean_age if you created it):

acs = acs.drop(
    [
        'household_income',
        'household_size',
        'income_per_person'
    ],
    axis=1
)

5.2.4 Subsetting With Aggregate Functions

Suppose you’re interested in the income children contribute to their household (household 8787 is a good example), so you want to create a child_income containing the household’s total income earned by children. You might think this would be a straightforward process of adding up the income of everyone that’s younger than 18:

acs['child_income'] = (
    acs.loc[acs['age']<18, 'income'].
    groupby('household').
    transform('sum')
)
acs.loc[8787]
age race marital_status edu income female hispanic child_income
person
1 45 White Divorced Some college, >=1 year 38000.0 True False NaN
2 16 White Never married 9th grade 3600.0 False False 3800.0
3 15 White Never married 7th-8th grade 200.0 True False 3800.0

The trouble is that in excluding the adults from the sum(), you also excluded them from getting the result. Thus child_income is not a proper household-level variable, as it is not the same for all the observations in the same household–and you’d really be in trouble if your research question involved the effect of children’s contributions on the adults in the household. The solution is to create a column that only contains the quantities you want to add up, but can be added up for everyone:

acs['income_if_child'] = acs.loc[acs['age']<18, 'income']
acs.loc[8787]
age race marital_status edu income female hispanic child_income income_if_child
person
1 45 White Divorced Some college, >=1 year 38000.0 True False NaN NaN
2 16 White Never married 9th grade 3600.0 False False 3800.0 3600.0
3 15 White Never married 7th-8th grade 200.0 True False 3800.0 200.0

For children income_if_child contains their income, while for adults it contains NaN. If we now pass income_if_child to sum(), it will add up just the children’s incomes without us having to exclude the adults. Thus the result will be stored in the rows for the household’s adults too:

acs['child_income'] = (
    acs['income_if_child'].
    groupby('household').
    transform('sum')
)
acs.loc[8787]
age race marital_status edu income female hispanic child_income income_if_child
person
1 45 White Divorced Some college, >=1 year 38000.0 True False 3800.0 NaN
2 16 White Never married 9th grade 3600.0 False False 3800.0 3600.0
3 15 White Never married 7th-8th grade 200.0 True False 3800.0 200.0

Since all the aggregate functions ignore missing values, you can use this approach any time you need to use an aggregate function on a subset of the data:

  1. Create a new column containing the value of interest for the observations to be included and NaN for the values to be excluded
  2. Aggregate that column with no subsetting
Exercise

Find the mean age of the adults in each household.

Again, just to keep things neat drop the variables created in this section:

acs = acs.drop(
    [
        'child_income',
        'income_if_child'
    ],
    axis=1
)

5.2.5 Indicator Variables

As we’ve seen, if you do math with an indicator variable, it will be coerced into a number with True converted to 1 and False converted to 0. Aggregate functions take on new and very useful meanings when applied to indicator variables.

Start by creating an indicator variable for ‘this person is a child’:

acs['child'] = acs['age']<18
acs.loc[[37, 8787]]
age race marital_status edu income female hispanic child
household person
37 1 20 White Never married Some college, >=1 year 10000.0 True False False
2 19 White Never married Some college, >=1 year 5300.0 True False False
3 19 Black Never married Some college, >=1 year 4700.0 True False False
8787 1 45 White Divorced Some college, >=1 year 38000.0 True False False
2 16 White Never married 9th grade 3600.0 False False True
3 15 White Never married 7th-8th grade 200.0 True False True

Just to see the conversion in action, use the astype() function and pass in int32 (32-bit integer).

acs.loc[[37, 8787],'child'].astype('int32')
household  person
37         1         0
           2         0
           3         0
8787       1         0
           2         1
           3         1
Name: child, dtype: int32

The sum of child is the number of children in the household:

acs['num_children'] = (
    acs['child'].
    groupby('household').
    transform('sum')
)
acs.loc[[37, 8787, 73731]]
age race marital_status edu income female hispanic child num_children
household person
37 1 20 White Never married Some college, >=1 year 10000.0 True False False 0
2 19 White Never married Some college, >=1 year 5300.0 True False False 0
3 19 Black Never married Some college, >=1 year 4700.0 True False False 0
8787 1 45 White Divorced Some college, >=1 year 38000.0 True False False 2
2 16 White Never married 9th grade 3600.0 False False True 2
3 15 White Never married 7th-8th grade 200.0 True False True 2
73731 1 15 White Now married 9th grade 0.0 False False True 2
2 15 White Never married 7th-8th grade 0.0 True False True 2

The mean of child is the proportion of the household that is children:

acs['prop_children'] = (
    acs['child'].
    groupby('household').
    transform('mean')
)
acs.loc[[37, 8787, 73731]]
age race marital_status edu income female hispanic child num_children prop_children
household person
37 1 20 White Never married Some college, >=1 year 10000.0 True False False 0 0.000000
2 19 White Never married Some college, >=1 year 5300.0 True False False 0 0.000000
3 19 Black Never married Some college, >=1 year 4700.0 True False False 0 0.000000
8787 1 45 White Divorced Some college, >=1 year 38000.0 True False False 2 0.666667
2 16 White Never married 9th grade 3600.0 False False True 2 0.666667
3 15 White Never married 7th-8th grade 200.0 True False True 2 0.666667
73731 1 15 White Now married 9th grade 0.0 False False True 2 1.000000
2 15 White Never married 7th-8th grade 0.0 True False True 2 1.000000

The maximum value of child tells you if the household has any children or not:

acs['has_children'] = (
    acs['child'].
    groupby('household').
    transform('max')
)
acs.loc[[37, 8787, 73731]]
age race marital_status edu income female hispanic child num_children prop_children has_children
household person
37 1 20 White Never married Some college, >=1 year 10000.0 True False False 0 0.000000 False
2 19 White Never married Some college, >=1 year 5300.0 True False False 0 0.000000 False
3 19 Black Never married Some college, >=1 year 4700.0 True False False 0 0.000000 False
8787 1 45 White Divorced Some college, >=1 year 38000.0 True False False 2 0.666667 True
2 16 White Never married 9th grade 3600.0 False False True 2 0.666667 True
3 15 White Never married 7th-8th grade 200.0 True False True 2 0.666667 True
73731 1 15 White Now married 9th grade 0.0 False False True 2 1.000000 True
2 15 White Never married 7th-8th grade 0.0 True False True 2 1.000000 True

This merits some explanation. If a household has no children, everyone in the household has a 0 (False) for child and the maximum value is 0. If a household has children, then at least one person in the household has 1 (True) for child, and the maximum value is 1. Python is clever enough to realize that the max() of a boolean variable is also a boolean variable, so it stores True/False in has_children rather than 1/0.

The minimum value of child tells you if the household consists entirely of children:

acs['all_children'] = (
    acs['child'].
    groupby('household').
    transform('min')
)
acs.loc[[37, 8787, 73731]]
age race marital_status edu income female hispanic child num_children prop_children has_children all_children
household person
37 1 20 White Never married Some college, >=1 year 10000.0 True False False 0 0.000000 False False
2 19 White Never married Some college, >=1 year 5300.0 True False False 0 0.000000 False False
3 19 Black Never married Some college, >=1 year 4700.0 True False False 0 0.000000 False False
8787 1 45 White Divorced Some college, >=1 year 38000.0 True False False 2 0.666667 True False
2 16 White Never married 9th grade 3600.0 False False True 2 0.666667 True False
3 15 White Never married 7th-8th grade 200.0 True False True 2 0.666667 True False
73731 1 15 White Now married 9th grade 0.0 False False True 2 1.000000 True True
2 15 White Never married 7th-8th grade 0.0 True False True 2 1.000000 True True

If everyone in the household is a child, everyone has a 1 (True) for child and the minimum value is 1 (True). If anyone in the household is not a child, then at least one person has a 0 (False) for child and the minimum is 0 (False).

Exercise

Create a variable containing the number of people in the household with income greater than zero and an indicator variable for ‘this household has someone with income>0’.

5.3 Wide Form Data

Up to this point we’ve worked with the American Community Survey in long form, with one observation per person. Some tasks are easier to do in long form than in wide form, but others are just as easy either way–once you know the tricks for working in wide form.

The file acs_wide.pickle contains the same ACS data we’ve been working with, but in wide form. Read it in as our new acs:

acs = pd.read_pickle('acs_wide.pickle')
acs
age_1 age_2 age_3 age_4 age_5 age_6 age_7 age_8 age_9 age_10 ... female_7 female_8 female_9 female_10 female_11 female_12 female_13 female_14 female_15 female_16
household
37 20.0 19.0 19.0 NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
241 50.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
242 29.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
377 69.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
418 59.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1236119 51.0 51.0 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1236287 41.0 42.0 23.0 4.0 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1236624 29.0 26.0 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1236756 58.0 61.0 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1236779 30.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

10565 rows × 112 columns

Before proceeding, let’s review what we’re seeing here. In wide form, there is one row per household rather than one row per person as in long form. Each person is described by a set of columns. For example, person 1 is described by age_1, race_1, etc. There are 16 people in one household, so there are 16 of all the columns (i.e. they go up to age_16, race_16, etc.). Smaller households will have NaN in the columns for the people they don’t have. For example, household 37 has three people, so age_1, age_2, and age_3 have values while all the others have NaN.

In long form, household and person made up a MultiIndex row index, while the variable names (age, race, etc.) made up the column index. In wide form household is the row index. We could use the variable names plus person as a MultiIndex column index, and you’ll see what that looks like in the next chapter. However, we have not found a good way to have a MultiIndex describe data that contain both level 1 and level 2 variables. Thus rather than using an official MultiIndex, the column names contain both the variable name and the person number. This seems to be the most common approach.

5.3.1 Subsetting in Wide Form

If we were using a MultiIndex for the columns we could use loc() for subsetting just like we did in long form. However, since the information we need is embedded in the column names, we’ll use filter() instead.

The filter() function allows you to subset based on patterns in the column names. A very simple pattern is like=. This will select all the columns that contain the string you pass in. Select all the income columns by passing in like='income':

acs.filter(like='income').dtypes
income_1     float64
income_2     float64
income_3     float64
income_4     float64
income_5     float64
income_6     float64
income_7     float64
income_8     float64
income_9     float64
income_10    float64
income_11    float64
income_12    float64
income_13    float64
income_14    float64
income_15    float64
income_16    float64
dtype: object

We’ll learn shortly how to use this to create level two variables like household_income. But that raises a problem: once you create a household_income variable, like='income' will select it too.

Now consider trying to select all the variables that describe person 1. You might think like='_1' would do the job, but try it:

acs.filter(like='_1').dtypes
age_1                 float64
age_10                float64
age_11                float64
age_12                float64
age_13                float64
age_14                float64
age_15                float64
age_16                float64
hispanic_1             object
hispanic_10            object
hispanic_11            object
hispanic_12            object
hispanic_13            object
hispanic_14            object
hispanic_15            object
hispanic_16            object
race_1               category
race_10              category
race_11              category
race_12              category
race_13              category
race_14              category
race_15              category
race_16              category
marital_status_1     category
marital_status_10    category
marital_status_11    category
marital_status_12    category
marital_status_13    category
marital_status_14    category
marital_status_15    category
marital_status_16    category
edu_1                category
edu_10               category
edu_11               category
edu_12               category
edu_13               category
edu_14               category
edu_15               category
edu_16               category
income_1              float64
income_10             float64
income_11             float64
income_12             float64
income_13             float64
income_14             float64
income_15             float64
income_16             float64
female_1               object
female_10              object
female_11              object
female_12              object
female_13              object
female_14              object
female_15              object
female_16              object
dtype: object

Instead we got the variables for persons 10 through 16 as well, because they also contain ’_1’.

What we need is the ability to be much more specific about the columns to be selected. The solution is regular expressions. Regular expressions are a syntax for describing a pattern in text so that a computer can detect text that matches it. They’ve been around for more than 50 years and most programming languages and statistical packages implement them. They can be used for a variety of purposes; we won’t discuss working with text data (strings) but we’re laying the foundation by introducing you to regular expressions. Regular expressions can be extremely complex. We’ll introduce you to just a few of the most important features but you can find many, many tutorials online.

You use a regular expression with filter() by passing in regex=. The simplest regular expression is just some text, and a string matches it if it contains that text. In other words, it does the same thing as like=:

acs.filter(regex='income').dtypes
income_1     float64
income_2     float64
income_3     float64
income_4     float64
income_5     float64
income_6     float64
income_7     float64
income_8     float64
income_9     float64
income_10    float64
income_11    float64
income_12    float64
income_13    float64
income_14    float64
income_15    float64
income_16    float64
dtype: object

But a regular expression can do much more. Here are the tools we’ll use:

  • If a regular expression starts with ^, then the string only matches if it starts with the pattern.

  • If a regular expression ends with $, then the string only matches if it ends with the pattern.

  • \d matches any digit (number).

  • A * after something means the string can contain any number of that thing at that point. So \d* means the string can contain any number of digits at that point.

Putting this all together, the regular expression ^income_\d*$ will match strings that start with ‘income_’ followed by any number of digits and then end. This will allow us to match the columns containing the income for each person (income_1, income_2, etc.) without also matching columns like household_income:

acs.filter(regex='^income_\d*$').dtypes
income_1     float64
income_2     float64
income_3     float64
income_4     float64
income_5     float64
income_6     float64
income_7     float64
income_8     float64
income_9     float64
income_10    float64
income_11    float64
income_12    float64
income_13    float64
income_14    float64
income_15    float64
income_16    float64
dtype: object

The regular expression _1$ will match strings that end with ’_1’. This excludes variables that end with ’_10’, ’_11’, etc. and thus allows us to select just the columns describing person 1:

acs.filter(regex='_1$').dtypes
age_1                float64
hispanic_1            object
race_1              category
marital_status_1    category
edu_1               category
income_1             float64
female_1              object
dtype: object
Exercise

Use a regular expression to select all the race variables. Then use one to select all the variables for person 5.

5.3.2 Creating Level Two Variables

When we used the sum() function in the last section to add up household incomes, it was adding all the values in a column. However, with the axis=1 argument it will add up values across a row. You can use this along with filter() to calculate household income in wide form:

acs['household_income'] = (
    acs.filter(regex='^income_\d*$').
    sum(axis=1)
)
acs['household_income']
household
37          20000.0
241         32500.0
242         30000.0
377         51900.0
418         12200.0
             ...   
1236119     93200.0
1236287     53400.0
1236624     62100.0
1236756    110600.0
1236779     22110.0
Name: household_income, Length: 10565, dtype: float64

If you want to be able to see both the individual incomes and household_income so you can spot-check the results, go ahead and use like='income' (and scroll right):

acs.filter(like='income')
income_1 income_2 income_3 income_4 income_5 income_6 income_7 income_8 income_9 income_10 income_11 income_12 income_13 income_14 income_15 income_16 household_income
household
37 10000.0 5300.0 4700.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 20000.0
241 32500.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 32500.0
242 30000.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 30000.0
377 51900.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 51900.0
418 12200.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 12200.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1236119 62200.0 31000.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 93200.0
1236287 15000.0 27000.0 11400.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 53400.0
1236624 50100.0 12000.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 62100.0
1236756 69800.0 40800.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 110600.0
1236779 22110.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 22110.0

10565 rows × 17 columns

This works for all of the aggregate functions from the last section, except those that only work with groupby(). Thus the code to carry out the tasks from the last section is almost the same: just use filter() instead of loc to select the right subsample, don’t bother with transform(), and pass axis=1 to the aggregate function itself.

How can we find the size of each household (number of people)? Recall that count() gives the number of non-missing values. So as long as there’s at least one variable with no missing values in the data set, you can use count() to find the number of people. This data set has multiple candidates, but use age:

acs['size'] = (
    acs.filter(regex='^age_\d*$').
    count(axis=1)
)
acs['size'].value_counts()
size
2     3412
1     2779
3     1685
4     1479
5      760
6      273
7       96
8       47
9       13
10      10
12       7
11       3
16       1
Name: count, dtype: int64

What about income per person? That’s actually easier than in long form. There’s no need to broadcast the result to all the rows in the same household, so you don’t need transform() or a lambda function. Just add up the household income and divide by the number of people in the household:

acs['income_per_person'] = (
    acs.filter(regex='^income_\d$').
    sum(axis=1) / 
    acs.filter(regex='^age_\d*$')
    .count(axis=1)
)
acs.filter(like='income').loc[484]
income_1             16800.0
income_2             18000.0
income_3                 NaN
income_4                 NaN
income_5                 NaN
income_6                 NaN
income_7                 NaN
income_8                 NaN
income_9                 NaN
income_10                NaN
income_11                NaN
income_12                NaN
income_13                NaN
income_14                NaN
income_15                NaN
income_16                NaN
household_income     34800.0
income_per_person     8700.0
Name: 484, dtype: float64

Recall that household 484 has two children who have missing values for income because they are under 15.

Exercise

Find the mean age for each household. Then create an indicator variable for ‘everyone in this household is female’.

5.3.3 Creating Level One Variables

Creating level one variables in long form is straightforward. For example, to create an indicator for ‘this person is black’ you’d just run:

acs['black'] = (acs['race']=='Black')

But in wide form there are 16 race variables, and there will need to be 16 indicators for Black as well. If you need to do a lot of work with level one variables, this is a pretty good reason to convert your data to long form. We’ll learn how in the next chapter. But it’s not too hard to make all 16 variables as long as you use a for loop.

Once again, we’ll start by considering what it would take to create just one of the 16 new indicator variables:

acs['black_1'] = (acs['race_1']=='Black')

That would take care of person 1. What would need to change in order to take care of person 2? The number at the end of the variable names. That tells us that this time we need to loop over a range of numbers. You can do that with the range() function. In its simplest form it takes two arguments, a starting number and an ending number. In true Python style the ending number is not included. To see how this works in a loop, try:

for i in range(1, 17): print(i)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16

So far so good, but how do we get those numbers into the variable names? If you add two strings together, they’ll be concatenated:

'Hello ' + 'world!'
'Hello world!'

The only wrinkle is that i is a number, But you can convert it to a string with the str() function and then add it:

'black_'+str(i)
'black_16'

This allows us to build the variable names we need, so now you’re ready to write the loop. One last complication: remember that any comparison involving NaN returns False. If a race variable is NaN the corresponding black variable should also be NaN, so select only the rows where race is not NaN:

for i in range(1,17): acs.loc[
    acs['race_'+str(i)].notna(),
    'black_'+str(i)
] = (acs['race_'+str(i)]=='Black')
acs.filter(like='black')
black_1 black_2 black_3 black_4 black_5 black_6 black_7 black_8 black_9 black_10 black_11 black_12 black_13 black_14 black_15 black_16
household
37 False False True NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
241 False NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
242 False NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
377 False NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
418 False NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1236119 False False NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1236287 False False False False NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1236624 False False NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1236756 False False NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1236779 False NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

10565 rows × 16 columns

Exercise

Create a set of indicator variables for ‘this person has experienced the end of a marriage’. This would include anyone who is divorced, widowed, or separated (we’ll call that an ‘end’, though you could argue otherwise), and not anyone who is currently married or has never been married.

5.4 Panel Data

Panel data or longitudinal data is just another form of hierarchical data, with subjects as level two units and times they were observed as level one units. With panel data, the timing of the observations or at least their order is important. If it’s not, then we refer to it as repeated measures data. Since panel data is just hierarchical data, a lot of the code we’ve already learned applies directly to panel data, but there are some panel-specific tricks too.

You should already have the NLSY extract in memory from the second exercise. If not, load it with:

nlsy = pd.read_pickle('nlsy.pickle')
nlsy
birth_year edu income age
id year
1.0 1979 58.0 12TH GRADE 4620.0 21.0
1980 58.0 NaN NaN 22.0
1981 58.0 12TH GRADE 5000.0 23.0
1982 58.0 NaN NaN 24.0
1983 58.0 NaN NaN 25.0
... ... ... ... ... ...
12686.0 1993 60.0 12TH GRADE 31000.0 33.0
1994 60.0 NaN NaN 34.0
1996 60.0 NaN NaN 36.0
1998 60.0 NaN NaN 38.0
2000 60.0 NaN NaN 40.0

241034 rows × 4 columns

5.4.1 Creating Level Two Variables

The National Longitudinal Survey of Youth tracks young people for many years, making it classic panel data. In the NLSY a level two unit is a year and a level one unit is a person-year combination. You can create level two variables using the same tools we used with the ACS. For example, create max_edu containing the person’s highest level of education with:

nlsy['max_edu'] = (
    nlsy['edu'].
    groupby('id').
    transform('max')
)
nlsy.loc[5]
birth_year edu income age max_edu
year
1979 59.0 1ST YEAR COLLEGE 2200.0 20.0 6TH YEAR COLLEGE
1980 59.0 2ND YEAR COLLEGE 2000.0 21.0 6TH YEAR COLLEGE
1981 59.0 3RD YEAR COLLEGE 3400.0 22.0 6TH YEAR COLLEGE
1982 59.0 4TH YEAR COLLEGE 2300.0 23.0 6TH YEAR COLLEGE
1983 59.0 5TH YEAR COLLEGE 2200.0 24.0 6TH YEAR COLLEGE
1984 59.0 6TH YEAR COLLEGE 10500.0 25.0 6TH YEAR COLLEGE
1985 59.0 NaN NaN 26.0 6TH YEAR COLLEGE
1986 59.0 6TH YEAR COLLEGE 43119.0 27.0 6TH YEAR COLLEGE
1987 59.0 NaN NaN 28.0 6TH YEAR COLLEGE
1988 59.0 6TH YEAR COLLEGE 57124.0 29.0 6TH YEAR COLLEGE
1989 59.0 6TH YEAR COLLEGE 62836.0 30.0 6TH YEAR COLLEGE
1990 59.0 NaN NaN 31.0 6TH YEAR COLLEGE
1991 59.0 NaN NaN 32.0 6TH YEAR COLLEGE
1992 59.0 NaN NaN 33.0 6TH YEAR COLLEGE
1993 59.0 NaN NaN 34.0 6TH YEAR COLLEGE
1994 59.0 NaN NaN 35.0 6TH YEAR COLLEGE
1996 59.0 NaN NaN 37.0 6TH YEAR COLLEGE
1998 59.0 NaN NaN 39.0 6TH YEAR COLLEGE
2000 59.0 NaN NaN 41.0 6TH YEAR COLLEGE

Note that this relies on edu being an ordered categorical variable.

The first() and last() functions are particularly useful with panel data. But their meaning depends on the order the data are sorted in. This data set appears to be in chronological order, but you shouldn’t trust that if you’re going to write code that will give you the wrong answer if it is not. So before using the first() function to find the person’s age at the start of the study, sort by id and year:

nlsy = nlsy.sort_values(['id', 'year'])
nlsy['starting_age'] = (
    nlsy['age'].
    groupby('id').
    transform('first')
)
nlsy
birth_year edu income age max_edu starting_age
id year
1.0 1979 58.0 12TH GRADE 4620.0 21.0 12TH GRADE 21.0
1980 58.0 NaN NaN 22.0 12TH GRADE 21.0
1981 58.0 12TH GRADE 5000.0 23.0 12TH GRADE 21.0
1982 58.0 NaN NaN 24.0 12TH GRADE 21.0
1983 58.0 NaN NaN 25.0 12TH GRADE 21.0
... ... ... ... ... ... ... ...
12686.0 1993 60.0 12TH GRADE 31000.0 33.0 12TH GRADE 19.0
1994 60.0 NaN NaN 34.0 12TH GRADE 19.0
1996 60.0 NaN NaN 36.0 12TH GRADE 19.0
1998 60.0 NaN NaN 38.0 12TH GRADE 19.0
2000 60.0 NaN NaN 40.0 12TH GRADE 19.0

241034 rows × 6 columns

Exercise

Use the last() function to find each subject’s last value of income. Does this behave as you expected?

Before proceeding, remove the variables we’ve created thus far just to keep the data set neat:

nlsy = nlsy.drop(
    [
        'max_edu',
        'starting_age'
    ], axis=1
)

5.4.2 Filling In Gaps

The NLSY frequently has missing data as subjects were frequently not located in a given year. Somewhat unusually for panel data, they often were able to find people in later years so there are gaps as well as attrition. Consider what we know about the education of person 5:

nlsy.loc[11]
birth_year edu income age
year
1979 59.0 1ST YEAR COLLEGE 2900.0 20.0
1980 59.0 NaN NaN 21.0
1981 59.0 3RD YEAR COLLEGE 2000.0 22.0
1982 59.0 4TH YEAR COLLEGE 4002.0 23.0
1983 59.0 4TH YEAR COLLEGE 8000.0 24.0
1984 59.0 4TH YEAR COLLEGE 17000.0 25.0
1985 59.0 4TH YEAR COLLEGE 20000.0 26.0
1986 59.0 NaN NaN 27.0
1987 59.0 NaN NaN 28.0
1988 59.0 NaN NaN 29.0
1989 59.0 4TH YEAR COLLEGE 46000.0 30.0
1990 59.0 4TH YEAR COLLEGE 74283.0 31.0
1991 59.0 4TH YEAR COLLEGE 81481.0 32.0
1992 59.0 4TH YEAR COLLEGE 55000.0 33.0
1993 59.0 4TH YEAR COLLEGE 60000.0 34.0
1994 59.0 4TH YEAR COLLEGE 101653.0 35.0
1996 59.0 4TH YEAR COLLEGE 60000.0 37.0
1998 59.0 4TH YEAR COLLEGE 25000.0 39.0
2000 59.0 4TH YEAR COLLEGE 55000.0 41.0

They were not interviewed in 1986, 1987, and 1988, but given that their education level both before and after that gap is ‘4TH YEAR COLLEGE’ we know it was ‘4TH YEAR COLLEGE’ in those years as well. Meanwhile, while we might think that in 1980 their education level was ‘2ND YEAR COLLEGE’ that is less certain–and if that were a two-year gap we’d have no idea. So let’s fill in gaps only where the edu is the same before and after the gap.

To actually do that, we’ll fill in all the gaps in two ways: first by carrying past values forward with the ffill() function, and then by carrying future values back with the bfill() function. If the two methods agree, then that is a gap we can fill. While ffill() and bfill() are not aggregate functions, it’s still important to use groupby('id') because that ensures each person is considered separately and no values are carried from one person to another.

nlsy['edu_forward'] = (
    nlsy['edu'].
    groupby('id').
    ffill()
)
nlsy['edu_backward'] = (
    nlsy['edu'].
    groupby('id').
    bfill()
)
nlsy.loc[11]
birth_year edu income age edu_forward edu_backward
year
1979 59.0 1ST YEAR COLLEGE 2900.0 20.0 1ST YEAR COLLEGE 1ST YEAR COLLEGE
1980 59.0 NaN NaN 21.0 1ST YEAR COLLEGE 3RD YEAR COLLEGE
1981 59.0 3RD YEAR COLLEGE 2000.0 22.0 3RD YEAR COLLEGE 3RD YEAR COLLEGE
1982 59.0 4TH YEAR COLLEGE 4002.0 23.0 4TH YEAR COLLEGE 4TH YEAR COLLEGE
1983 59.0 4TH YEAR COLLEGE 8000.0 24.0 4TH YEAR COLLEGE 4TH YEAR COLLEGE
1984 59.0 4TH YEAR COLLEGE 17000.0 25.0 4TH YEAR COLLEGE 4TH YEAR COLLEGE
1985 59.0 4TH YEAR COLLEGE 20000.0 26.0 4TH YEAR COLLEGE 4TH YEAR COLLEGE
1986 59.0 NaN NaN 27.0 4TH YEAR COLLEGE 4TH YEAR COLLEGE
1987 59.0 NaN NaN 28.0 4TH YEAR COLLEGE 4TH YEAR COLLEGE
1988 59.0 NaN NaN 29.0 4TH YEAR COLLEGE 4TH YEAR COLLEGE
1989 59.0 4TH YEAR COLLEGE 46000.0 30.0 4TH YEAR COLLEGE 4TH YEAR COLLEGE
1990 59.0 4TH YEAR COLLEGE 74283.0 31.0 4TH YEAR COLLEGE 4TH YEAR COLLEGE
1991 59.0 4TH YEAR COLLEGE 81481.0 32.0 4TH YEAR COLLEGE 4TH YEAR COLLEGE
1992 59.0 4TH YEAR COLLEGE 55000.0 33.0 4TH YEAR COLLEGE 4TH YEAR COLLEGE
1993 59.0 4TH YEAR COLLEGE 60000.0 34.0 4TH YEAR COLLEGE 4TH YEAR COLLEGE
1994 59.0 4TH YEAR COLLEGE 101653.0 35.0 4TH YEAR COLLEGE 4TH YEAR COLLEGE
1996 59.0 4TH YEAR COLLEGE 60000.0 37.0 4TH YEAR COLLEGE 4TH YEAR COLLEGE
1998 59.0 4TH YEAR COLLEGE 25000.0 39.0 4TH YEAR COLLEGE 4TH YEAR COLLEGE
2000 59.0 4TH YEAR COLLEGE 55000.0 41.0 4TH YEAR COLLEGE 4TH YEAR COLLEGE

Now we’ll replace edu if edu is missing and edu_forward and edu_backward are the same. Since they are the same it doesn’t matter which of them we set edu to:

nlsy.loc[
    nlsy['edu'].isna() & 
    (nlsy['edu_forward']==nlsy['edu_backward']),
    'edu'      
] = nlsy['edu_forward']
nlsy.loc[11]
birth_year edu income age edu_forward edu_backward
year
1979 59.0 1ST YEAR COLLEGE 2900.0 20.0 1ST YEAR COLLEGE 1ST YEAR COLLEGE
1980 59.0 NaN NaN 21.0 1ST YEAR COLLEGE 3RD YEAR COLLEGE
1981 59.0 3RD YEAR COLLEGE 2000.0 22.0 3RD YEAR COLLEGE 3RD YEAR COLLEGE
1982 59.0 4TH YEAR COLLEGE 4002.0 23.0 4TH YEAR COLLEGE 4TH YEAR COLLEGE
1983 59.0 4TH YEAR COLLEGE 8000.0 24.0 4TH YEAR COLLEGE 4TH YEAR COLLEGE
1984 59.0 4TH YEAR COLLEGE 17000.0 25.0 4TH YEAR COLLEGE 4TH YEAR COLLEGE
1985 59.0 4TH YEAR COLLEGE 20000.0 26.0 4TH YEAR COLLEGE 4TH YEAR COLLEGE
1986 59.0 4TH YEAR COLLEGE NaN 27.0 4TH YEAR COLLEGE 4TH YEAR COLLEGE
1987 59.0 4TH YEAR COLLEGE NaN 28.0 4TH YEAR COLLEGE 4TH YEAR COLLEGE
1988 59.0 4TH YEAR COLLEGE NaN 29.0 4TH YEAR COLLEGE 4TH YEAR COLLEGE
1989 59.0 4TH YEAR COLLEGE 46000.0 30.0 4TH YEAR COLLEGE 4TH YEAR COLLEGE
1990 59.0 4TH YEAR COLLEGE 74283.0 31.0 4TH YEAR COLLEGE 4TH YEAR COLLEGE
1991 59.0 4TH YEAR COLLEGE 81481.0 32.0 4TH YEAR COLLEGE 4TH YEAR COLLEGE
1992 59.0 4TH YEAR COLLEGE 55000.0 33.0 4TH YEAR COLLEGE 4TH YEAR COLLEGE
1993 59.0 4TH YEAR COLLEGE 60000.0 34.0 4TH YEAR COLLEGE 4TH YEAR COLLEGE
1994 59.0 4TH YEAR COLLEGE 101653.0 35.0 4TH YEAR COLLEGE 4TH YEAR COLLEGE
1996 59.0 4TH YEAR COLLEGE 60000.0 37.0 4TH YEAR COLLEGE 4TH YEAR COLLEGE
1998 59.0 4TH YEAR COLLEGE 25000.0 39.0 4TH YEAR COLLEGE 4TH YEAR COLLEGE
2000 59.0 4TH YEAR COLLEGE 55000.0 41.0 4TH YEAR COLLEGE 4TH YEAR COLLEGE

Note that we will not attempt to fill in other gaps where we are not perfectly certain what the missing value should be. There are more or less valid methods for filling in missing values like multiple imputation, but they are complicated and a lot of work. Methods that are less complicated and less work, like filling in the mean or linear interpolation, can create serious problems with any statistical analysis. It’s unusual to be able to fill in gaps like we were able to here but worth doing when you can.

5.4.3 Variables Based On An Observations Neighbors

The shift() function can shift observations forward or backwards: forward if you pass in a positive number, backwards if you pass in a negative number. For easy viewing, let’s build a DataFrame by passing a dictionary to the Pandas DataFrame function. The dictionary must contain the column names as keys and Series (which we’ll create based on the edu Series in nsly) as values. Our DataFrame will have one column for the original edu, one for shifting forward, and one for shifting backward:

pd.DataFrame(
    {
        'edu' : nlsy.loc[11, 'edu'],
        'edu_forward_1' : nlsy.loc[11, 'edu'].shift(1),
        'edu_backward_1' : nlsy.loc[11, 'edu'].shift(-1) 
    }
)
edu edu_forward_1 edu_backward_1
year
1979 1ST YEAR COLLEGE NaN NaN
1980 NaN 1ST YEAR COLLEGE 3RD YEAR COLLEGE
1981 3RD YEAR COLLEGE NaN 4TH YEAR COLLEGE
1982 4TH YEAR COLLEGE 3RD YEAR COLLEGE 4TH YEAR COLLEGE
1983 4TH YEAR COLLEGE 4TH YEAR COLLEGE 4TH YEAR COLLEGE
1984 4TH YEAR COLLEGE 4TH YEAR COLLEGE 4TH YEAR COLLEGE
1985 4TH YEAR COLLEGE 4TH YEAR COLLEGE 4TH YEAR COLLEGE
1986 4TH YEAR COLLEGE 4TH YEAR COLLEGE 4TH YEAR COLLEGE
1987 4TH YEAR COLLEGE 4TH YEAR COLLEGE 4TH YEAR COLLEGE
1988 4TH YEAR COLLEGE 4TH YEAR COLLEGE 4TH YEAR COLLEGE
1989 4TH YEAR COLLEGE 4TH YEAR COLLEGE 4TH YEAR COLLEGE
1990 4TH YEAR COLLEGE 4TH YEAR COLLEGE 4TH YEAR COLLEGE
1991 4TH YEAR COLLEGE 4TH YEAR COLLEGE 4TH YEAR COLLEGE
1992 4TH YEAR COLLEGE 4TH YEAR COLLEGE 4TH YEAR COLLEGE
1993 4TH YEAR COLLEGE 4TH YEAR COLLEGE 4TH YEAR COLLEGE
1994 4TH YEAR COLLEGE 4TH YEAR COLLEGE 4TH YEAR COLLEGE
1996 4TH YEAR COLLEGE 4TH YEAR COLLEGE 4TH YEAR COLLEGE
1998 4TH YEAR COLLEGE 4TH YEAR COLLEGE 4TH YEAR COLLEGE
2000 4TH YEAR COLLEGE 4TH YEAR COLLEGE NaN

Note that when shifting forward you’ll have NaN at the beginning of the series, and when shifting backwards at the end. In this case we did not need groupby() because we extracted just one person, but in general you’ll need it to prevent values from being shifted between people.

Now consider trying to identify the year in which a person graduates from high school. In that year their edu will be ‘12TH GRADE’ or higher and the year before their edu will be less than ‘12TH GRADE’. Note that for whatever reason the NLSY frequently records people as having gained multiple years of education in one year, and we won’t worry about that. For person 2, the year in which they graduated is 1985:

nlsy.loc[2]
birth_year edu income age edu_forward edu_backward
year
1979 59.0 9TH GRADE 4000.0 20.0 9TH GRADE 9TH GRADE
1980 59.0 9TH GRADE 5000.0 21.0 9TH GRADE 9TH GRADE
1981 59.0 9TH GRADE 6000.0 22.0 9TH GRADE 9TH GRADE
1982 59.0 9TH GRADE 10000.0 23.0 9TH GRADE 9TH GRADE
1983 59.0 9TH GRADE 11000.0 24.0 9TH GRADE 9TH GRADE
1984 59.0 9TH GRADE 11500.0 25.0 9TH GRADE 9TH GRADE
1985 59.0 12TH GRADE 11000.0 26.0 12TH GRADE 12TH GRADE
1986 59.0 12TH GRADE 14000.0 27.0 12TH GRADE 12TH GRADE
1987 59.0 12TH GRADE 16000.0 28.0 12TH GRADE 12TH GRADE
1988 59.0 12TH GRADE 20000.0 29.0 12TH GRADE 12TH GRADE
1989 59.0 12TH GRADE 19000.0 30.0 12TH GRADE 12TH GRADE
1990 59.0 12TH GRADE 20000.0 31.0 12TH GRADE 12TH GRADE
1991 59.0 12TH GRADE 20000.0 32.0 12TH GRADE 12TH GRADE
1992 59.0 12TH GRADE 22000.0 33.0 12TH GRADE 12TH GRADE
1993 59.0 12TH GRADE 25000.0 34.0 12TH GRADE 12TH GRADE
1994 59.0 12TH GRADE 0.0 35.0 12TH GRADE 12TH GRADE
1996 59.0 12TH GRADE 0.0 37.0 12TH GRADE 12TH GRADE
1998 59.0 12TH GRADE 0.0 39.0 12TH GRADE 12TH GRADE
2000 59.0 12TH GRADE 0.0 41.0 12TH GRADE 12TH GRADE

Now create an indicator variable for ‘graduated this year’ with:

nlsy['grad'] = (
    (nlsy['edu'] >= '12TH GRADE') & 
    (nlsy['edu'].groupby('id').shift(1) < '12TH GRADE')
)
nlsy.loc[2]
birth_year edu income age edu_forward edu_backward grad
year
1979 59.0 9TH GRADE 4000.0 20.0 9TH GRADE 9TH GRADE False
1980 59.0 9TH GRADE 5000.0 21.0 9TH GRADE 9TH GRADE False
1981 59.0 9TH GRADE 6000.0 22.0 9TH GRADE 9TH GRADE False
1982 59.0 9TH GRADE 10000.0 23.0 9TH GRADE 9TH GRADE False
1983 59.0 9TH GRADE 11000.0 24.0 9TH GRADE 9TH GRADE False
1984 59.0 9TH GRADE 11500.0 25.0 9TH GRADE 9TH GRADE False
1985 59.0 12TH GRADE 11000.0 26.0 12TH GRADE 12TH GRADE True
1986 59.0 12TH GRADE 14000.0 27.0 12TH GRADE 12TH GRADE False
1987 59.0 12TH GRADE 16000.0 28.0 12TH GRADE 12TH GRADE False
1988 59.0 12TH GRADE 20000.0 29.0 12TH GRADE 12TH GRADE False
1989 59.0 12TH GRADE 19000.0 30.0 12TH GRADE 12TH GRADE False
1990 59.0 12TH GRADE 20000.0 31.0 12TH GRADE 12TH GRADE False
1991 59.0 12TH GRADE 20000.0 32.0 12TH GRADE 12TH GRADE False
1992 59.0 12TH GRADE 22000.0 33.0 12TH GRADE 12TH GRADE False
1993 59.0 12TH GRADE 25000.0 34.0 12TH GRADE 12TH GRADE False
1994 59.0 12TH GRADE 0.0 35.0 12TH GRADE 12TH GRADE False
1996 59.0 12TH GRADE 0.0 37.0 12TH GRADE 12TH GRADE False
1998 59.0 12TH GRADE 0.0 39.0 12TH GRADE 12TH GRADE False
2000 59.0 12TH GRADE 0.0 41.0 12TH GRADE 12TH GRADE False

Note that if either edu or shifted edu is NaN, the result will be false. Thus this indicator variable is really for ‘we know the person graduated this year.’ We could set it to NaN for years with missing values if we wanted to, using the techniques we’ve used before.

Logically each person should only graduate from high school once, but it’s not hard to image data entry errors that would make it look like they graduated multiple times. Recall that the sum of an indicator variable is the number of times it is true, so check with:

(
    nlsy['grad'].
    groupby('id').
    sum().
    value_counts()
)
grad
0    7033
1    5653
Name: count, dtype: int64

A little over half the subjects have True for grad zero times: either they never graduated from high school, they graduated outside the study period, or we could not identify their graduation year due to missing data. But no one graduated more than once.

Now let’s find their age at graduation. This is just a matter of subsetting with an aggregate function, except that the desired subset has just one observation per group. So start by creating age_if_grad containing the person’s age for the year they graduated and NaN for all others.

In the past we’ve done this with code like:

nlsy.loc[nlsy['grad'], 'age_if_grad'] = nlsy['age']

and that would absolutely work. But let’s learn an alternative method. The numpy where() function takes three arguments: a condition, the result to return if the condition is true, and the result to return if the condition is false. Build age_if_grad by importing numpy as np and then passing to where() the three arguments grad, age, and NaN:

import numpy as np
nlsy['age_if_grad'] = np.where(
    nlsy['grad'],
    nlsy['age'],
    np.NaN
)

nlsy.loc[2]
birth_year edu income age edu_forward edu_backward grad age_if_grad
year
1979 59.0 9TH GRADE 4000.0 20.0 9TH GRADE 9TH GRADE False NaN
1980 59.0 9TH GRADE 5000.0 21.0 9TH GRADE 9TH GRADE False NaN
1981 59.0 9TH GRADE 6000.0 22.0 9TH GRADE 9TH GRADE False NaN
1982 59.0 9TH GRADE 10000.0 23.0 9TH GRADE 9TH GRADE False NaN
1983 59.0 9TH GRADE 11000.0 24.0 9TH GRADE 9TH GRADE False NaN
1984 59.0 9TH GRADE 11500.0 25.0 9TH GRADE 9TH GRADE False NaN
1985 59.0 12TH GRADE 11000.0 26.0 12TH GRADE 12TH GRADE True 26.0
1986 59.0 12TH GRADE 14000.0 27.0 12TH GRADE 12TH GRADE False NaN
1987 59.0 12TH GRADE 16000.0 28.0 12TH GRADE 12TH GRADE False NaN
1988 59.0 12TH GRADE 20000.0 29.0 12TH GRADE 12TH GRADE False NaN
1989 59.0 12TH GRADE 19000.0 30.0 12TH GRADE 12TH GRADE False NaN
1990 59.0 12TH GRADE 20000.0 31.0 12TH GRADE 12TH GRADE False NaN
1991 59.0 12TH GRADE 20000.0 32.0 12TH GRADE 12TH GRADE False NaN
1992 59.0 12TH GRADE 22000.0 33.0 12TH GRADE 12TH GRADE False NaN
1993 59.0 12TH GRADE 25000.0 34.0 12TH GRADE 12TH GRADE False NaN
1994 59.0 12TH GRADE 0.0 35.0 12TH GRADE 12TH GRADE False NaN
1996 59.0 12TH GRADE 0.0 37.0 12TH GRADE 12TH GRADE False NaN
1998 59.0 12TH GRADE 0.0 39.0 12TH GRADE 12TH GRADE False NaN
2000 59.0 12TH GRADE 0.0 41.0 12TH GRADE 12TH GRADE False NaN

You may find that syntax clearer. where() would definitely be easier to use than the first syntax if you needed to set the rows where grad was false to something other than NaN.

age_if_grad for each person is a single number plus a bunch of NaNs. Many aggregate functions have the property that if you give them a single number they’ll return that number, so we could use any of them. We’ll go with mean:

nlsy['age_at_grad'] = (
    nlsy['age_if_grad'].
    groupby('id').
    transform('mean')
)
nlsy.loc[2]
birth_year edu income age edu_forward edu_backward grad age_if_grad age_at_grad
year
1979 59.0 9TH GRADE 4000.0 20.0 9TH GRADE 9TH GRADE False NaN 26.0
1980 59.0 9TH GRADE 5000.0 21.0 9TH GRADE 9TH GRADE False NaN 26.0
1981 59.0 9TH GRADE 6000.0 22.0 9TH GRADE 9TH GRADE False NaN 26.0
1982 59.0 9TH GRADE 10000.0 23.0 9TH GRADE 9TH GRADE False NaN 26.0
1983 59.0 9TH GRADE 11000.0 24.0 9TH GRADE 9TH GRADE False NaN 26.0
1984 59.0 9TH GRADE 11500.0 25.0 9TH GRADE 9TH GRADE False NaN 26.0
1985 59.0 12TH GRADE 11000.0 26.0 12TH GRADE 12TH GRADE True 26.0 26.0
1986 59.0 12TH GRADE 14000.0 27.0 12TH GRADE 12TH GRADE False NaN 26.0
1987 59.0 12TH GRADE 16000.0 28.0 12TH GRADE 12TH GRADE False NaN 26.0
1988 59.0 12TH GRADE 20000.0 29.0 12TH GRADE 12TH GRADE False NaN 26.0
1989 59.0 12TH GRADE 19000.0 30.0 12TH GRADE 12TH GRADE False NaN 26.0
1990 59.0 12TH GRADE 20000.0 31.0 12TH GRADE 12TH GRADE False NaN 26.0
1991 59.0 12TH GRADE 20000.0 32.0 12TH GRADE 12TH GRADE False NaN 26.0
1992 59.0 12TH GRADE 22000.0 33.0 12TH GRADE 12TH GRADE False NaN 26.0
1993 59.0 12TH GRADE 25000.0 34.0 12TH GRADE 12TH GRADE False NaN 26.0
1994 59.0 12TH GRADE 0.0 35.0 12TH GRADE 12TH GRADE False NaN 26.0
1996 59.0 12TH GRADE 0.0 37.0 12TH GRADE 12TH GRADE False NaN 26.0
1998 59.0 12TH GRADE 0.0 39.0 12TH GRADE 12TH GRADE False NaN 26.0
2000 59.0 12TH GRADE 0.0 41.0 12TH GRADE 12TH GRADE False NaN 26.0
Exercise

Assume for the moment that if someone’s income is zero they are unemployed and if their income is positive they are employed. Create an indicator variable for ‘person lost their job’ which is true if someone is unemployed (income is zero) in the current year and was employed (income is positive) in the previous year.