5 Working With Hierarchical Data

Many data sets involve some sort of hierarchical structure. The American Community Survey is an example of one of the most common hierarchical data structures: individuals grouped into households. Another common hierarchical data structure is panel or longitudinal data and repeated measures, where things are observed multiple times. These structures may seem very different, but the same concepts apply to both and often even the same code.

5.1 Hierarchical Data Concepts

In this section we’ll introduce some of the core concepts and vocabulary for thinking carefully about hierarchical data.

5.1.1 Levels

Hierarchical data can be described in terms of levels (note that these are not the same as the levels of a categorical variable). A level one unit is the smallest unit in the data set. A level two unit is then a group of level one units. Some examples:

EXAMPLE DATA STRUCTURE	LEVEL ONE UNIT	LEVEL TWO UNIT
Individuals living in households	Individuals	Households
States grouped into regions	States	Regions
Students in classes	Students	Classes
Tests taken by students	Tests	Students
Monthly observations of individuals	Monthly observations	Individuals
Visits to the doctor by patients	Doctor visits	Patients
Social media posts by individuals	Posts	Individuals

If the hierarchy has more than two levels, simply keep counting up: if you have students grouped into classes grouped into schools grouped into districts, students are level one, classes are level two, schools are level three, and districts are level four.

Each variable is associated with a specific level. A variable is a level two variable if all the level one units within a given level two unit have the same value for the variable. For data structures where a level two unit is observed over time, level two variables are variables that do not change over time.

EXAMPLE DATA STRUCTURE	LEVEL ONE VARIABLES	LEVEL TWO VARIABLES
Individuals living in households	Age, Sex	Household income, Address
States grouped into regions	State population	Regional income per capita
Students in classes	Student’s race	Teacher’s race, Class size
Tests taken by students	Test Score	Free lunch eligibility
Monthly observations of individuals	Employment Status	Highest degree earned
Visits to the doctor by patients	BMI, Diagnosis	Race
Social media posts by individuals	Length, Topic	Sex

Of course all of these depend on the details of the actual data set. In a study that observes individuals for a few months, it’s unlikely that their highest degree earned will change. But it might! And if it does, highest degree earned becomes a level one variable. It does not meet the definition of a level two variable because different level one units (monthly observations) have different values for it.

5.1.2 Data Forms

With a hierarchical data set, an observation (row) can represent either a level one unit or a level two unit. Consider observing two people for three months:

person_id	month	years_edu	employed
1	1	16	True
1	2	16	True
1	3	16	True
2	1	12	False
2	2	12	True
2	3	12	True

Exercise

Identify the level one units, level two units, level one variable(s), and level two variable(s) in the above data set.

In this form, each observation represents a month (or more precisely, a person-month combination). Now consider the exact same data in a different form:

person_id	years_edu	employed_1	employed_2	employed_3
1	16	True	True	True
2	12	False	True	True

In this form, each observation represents a person. Because years_edu is a level two variable, there’s just one value per person and thus one variable (column). However, employed is a level one variable with three (potentially) different values per person. Thus it needs three variables (columns).

We call the first form the long form (or occasionally the tall form) and the second the wide form. The long form is longer because it has more rows; the wide form is wider because it has more columns. In most cases the long form is easier to work with, so we’ll do most of our examples in this form.

Now consider the indexes in this data set. In the long form, person_id and month are a compound identifier and could be easily turned into a MultiIndex, while the variable names are a simple column identifier. In the wide form, person_id is a simple row identifier, but now the variable names for the level one variables are a compound identifier with two parts: the variable name (employed) and the month in which the variable was observed. To identify a specific value in the data set we still need to know the person_id, month, and a variable name, but month has been converted from a row identifier to part of a compound column identifier. We’ll learn how to actually carry out that transformation in the next chapter.

5.1.3 Setting Up

Start up Jupyter Lab if you haven’t already and navigate to the folder where you put the example files. Then create a new Python Notebook and call it Hierarchical_Data_Practice.ipynb. Have it import Pandas and load acs_clean.pickle:

import pandas as pd
acs = pd.read_pickle('acs_clean.pickle')
acs

		age	race	marital_status	edu	income	female	hispanic
household	person
37	1	20	White	Never married	Some college, >=1 year	10000.0	True	False
	2	19	White	Never married	Some college, >=1 year	5300.0	True	False
	3	19	Black	Never married	Some college, >=1 year	4700.0	True	False
241	1	50	White	Never married	Master's degree	32500.0	True	False
242	1	29	White	Never married	Bachelor's degree	30000.0	True	False
...	...	...	...	...	...	...	...	...
1236624	1	29	White	Now married	Some college, >=1 year	50100.0	False	False
1236624	2	26	White	Now married	High School graduate	12000.0	True	False
1236756	1	58	White	Now married	Master's degree	69800.0	True	False
1236756	2	61	White	Now married	Master's degree	40800.0	False	False
1236779	1	30	American Indian	Divorced	High School graduate	22110.0	False	False

27410 rows × 7 columns

Exercise

Load the example extract from the National Longitudinal Survey of Youth contained in nlsy.pickle. Identify the level one and level two units in the data and the level one and level two variables. (We’re not so worried about having a streamlined data cleaning workflow in this chapter, so just use your Hierarchical_Data_Practice.ipynb Notebook.)

5.2 Long Form Data

For some purposes, if a hierarchical data set is in long form you don’t have to worry about the hierarchical structure. On the other hand, a very common data wrangling task is creating level two variables based on the level one variables in the data set, and then the hierarchical structure is critical. In long form the critical tool for doing so is groupby(), so that the calculations are carried out separately for each level two unit.

5.2.1 Using groupby()

The groupby() function creates a new object which contains the DataFrame split up into groups. Functions act on that object, carry out calculations on the groups, and can return results in various forms. For example, if you apply basic summary statistic functions, you’ll get one row per group:

acs['income'].groupby('household').mean()

household
37          6666.666667
241        32500.000000
242        30000.000000
377        51900.000000
418        12200.000000
               ...     
1236119    46600.000000
1236287    17800.000000
1236624    31050.000000
1236756    55300.000000
1236779    22110.000000
Name: income, Length: 10565, dtype: float64

mean() is an example of an aggregate function: it takes multiple values and returns a single value. Useful aggregate functions include:

mean()
sum() (total)
max()
min()
count() (number of rows with non-missing values)
size() (number of rows)
first() (value for the first observation in the group)
last() (value for the last observation in the group)
nth() (value for the nth observation in the group)

The first five functions listed, mean() through count(), can be applied to a DataFrame, a Series, or the result of a groupby(). However, the last four, size() through nth(), only work with groupby().

With the transform() function, you pass in the name of an aggregate function which is then applied to the groups, but the result is copied or broadcast to all the observations in the group. This is the primary tool you’ll use to add level two variables to a DataFrame. The result is a Series with the same number of rows as the original DataFrame, but the value for level one units in the same level two unit is always the same.

acs['income'].groupby('household').transform('mean')

household  person
37         1          6666.666667
           2          6666.666667
           3          6666.666667
241        1         32500.000000
242        1         30000.000000
                         ...     
1236624    1         31050.000000
           2         31050.000000
1236756    1         55300.000000
           2         55300.000000
1236779    1         22110.000000
Name: income, Length: 27410, dtype: float64

5.2.2 Continuous Variables

We’ll start with examples of calculating continuous (or quantitative) level two variables based on continuous level one variables.

For many purposes, the total income of a household is more relevant than individual incomes. Create a column called household_income containing the total income of everyone in the household with:

acs['household_income'] = (
    acs['income'].
    groupby('household').
    transform('sum')
)
acs

		age	race	marital_status	edu	income	female	hispanic	household_income
household	person
37	1	20	White	Never married	Some college, >=1 year	10000.0	True	False	20000.0
	2	19	White	Never married	Some college, >=1 year	5300.0	True	False	20000.0
	3	19	Black	Never married	Some college, >=1 year	4700.0	True	False	20000.0
241	1	50	White	Never married	Master's degree	32500.0	True	False	32500.0
242	1	29	White	Never married	Bachelor's degree	30000.0	True	False	30000.0
...	...	...	...	...	...	...	...	...	...
1236624	1	29	White	Now married	Some college, >=1 year	50100.0	False	False	62100.0
1236624	2	26	White	Now married	High School graduate	12000.0	True	False	62100.0
1236756	1	58	White	Now married	Master's degree	69800.0	True	False	110600.0
1236756	2	61	White	Now married	Master's degree	40800.0	False	False	110600.0
1236779	1	30	American Indian	Divorced	High School graduate	22110.0	False	False	22110.0

27410 rows × 8 columns

Now calculate the household size, or the number of people in the household. Since there is one row per person, we can use the size() function. You still need to have it act on a single column, but it makes no difference which column you choose:

acs['household_size'] = (
    acs['age'].
    groupby('household').
    transform('size')
)
acs

		age	race	marital_status	edu	income	female	hispanic	household_income	household_size
household	person
37	1	20	White	Never married	Some college, >=1 year	10000.0	True	False	20000.0	3
	2	19	White	Never married	Some college, >=1 year	5300.0	True	False	20000.0	3
	3	19	Black	Never married	Some college, >=1 year	4700.0	True	False	20000.0	3
241	1	50	White	Never married	Master's degree	32500.0	True	False	32500.0	1
242	1	29	White	Never married	Bachelor's degree	30000.0	True	False	30000.0	1
...	...	...	...	...	...	...	...	...	...	...
1236624	1	29	White	Now married	Some college, >=1 year	50100.0	False	False	62100.0	2
1236624	2	26	White	Now married	High School graduate	12000.0	True	False	62100.0	2
1236756	1	58	White	Now married	Master's degree	69800.0	True	False	110600.0	2
1236756	2	61	White	Now married	Master's degree	40800.0	False	False	110600.0	2
1236779	1	30	American Indian	Divorced	High School graduate	22110.0	False	False	22110.0	1

27410 rows × 9 columns

Exercise

Create a variable mean_age containg the mean age of the household.

Now calculate the income per person, spot-checking household 484. Since income per person is the same as the mean income for the household, you might think you can use:

acs['income_per_person'] = (
    acs['income'].
    groupby('household').
    transform('mean')
)
acs.loc[484]

	age	race	marital_status	edu	income	female	hispanic	household_income	household_size	income_per_person
person
1	33	Asian	Now married	High School graduate	16800.0	False	False	34800.0	4	17400.0
2	26	Asian	Now married	11th grade	18000.0	True	False	34800.0	4	17400.0
3	4	Asian	NaN	None	NaN	False	False	34800.0	4	17400.0
4	2	Asian	NaN	NaN	NaN	True	False	34800.0	4	17400.0

This does not give the right answer because of the way mean() handled missing values: it took the mean of the two observed values of income and ignored the two people with NaN. However, we know that NaN for income means the person is under 15 years old, and almost certainly their income is zero or close to it. What we want is the total income of the household divided by the number of people in the household, whether they have observed values of income or not.

Since you already created household_income and household_size, you can just use them:

acs['income_per_person'] = (
    acs['household_income']/acs['household_size']
)
acs.loc[484]

	age	race	marital_status	edu	income	female	hispanic	household_income	household_size	income_per_person
person
1	33	Asian	Now married	High School graduate	16800.0	False	False	34800.0	4	8700.0
2	26	Asian	Now married	11th grade	18000.0	True	False	34800.0	4	8700.0
3	4	Asian	NaN	None	NaN	False	False	34800.0	4	8700.0
4	2	Asian	NaN	NaN	NaN	True	False	34800.0	4	8700.0

Note that we did not need to use groupby()! groupby() is relevant when you’re using an aggregation function and need to control which rows are aggregated (e.g. the people in the same household). This code does not use an aggregation function: income_per_person for each row depends solely on household_income and household_size for the same row.

What if you hadn’t already created household_income and household_size? Is there a way to calculate income_per_person directly? Absolutely. We just need a function to pass to transform() that first calculates the sum and then divides by the size. No such function exists–but we can create one.

5.2.3 Lambda Functions

Python allows you to create and name your own functions, but you most likely won’t need to for data wrangling. You’ll probably be able to do everything you need to do with temporary, unnamed functions, known as lambda functions. You define a lambda function with:

lamba x: (f(x))

where f(x) should be replaced by the function you need. x is a parameter that is being passed in to the function–you can actually give it any name you want.

What seems like the obvious f(x) here is:

lambda x: (x.sum()/x.size())

but that doesn’t work. The reason is that the passed in x is no longer a groupby() object so you can’t use the size() function. However, Series and DataFrames have a size attribute which gives the total number of elements they contain. For a DataFrame this is the number of rows times the number of columns and rarely useful. However, for a Series which is one column from a DataFrame size is the number of rows, and that is exactly what you need:

acs['income'].size

Thus the function is:

lambda x: (x.sum()/x.size)

and the full code is:

acs['income_per_person'] = (
    acs['income'].
    groupby('household').
    transform(
        lambda x: 
        (x.sum()/x.size)
    )
)
acs.loc[484]

	age	race	marital_status	edu	income	female	hispanic	household_income	household_size	income_per_person
person
1	33	Asian	Now married	High School graduate	16800.0	False	False	34800.0	4	8700.0
2	26	Asian	Now married	11th grade	18000.0	True	False	34800.0	4	8700.0
3	4	Asian	NaN	None	NaN	False	False	34800.0	4	8700.0
4	2	Asian	NaN	NaN	NaN	True	False	34800.0	4	8700.0

The data set is getting kind of wide, so remove the variables you just created (add mean_age if you created it):

acs = acs.drop(
    [
        'household_income',
        'household_size',
        'income_per_person'
    ],
    axis=1
)

5.2.4 Subsetting With Aggregate Functions

Suppose you’re interested in the income children contribute to their household (household 8787 is a good example), so you want to create a child_income containing the household’s total income earned by children. You might think this would be a straightforward process of adding up the income of everyone that’s younger than 18:

acs['child_income'] = (
    acs.loc[acs['age']<18, 'income'].
    groupby('household').
    transform('sum')
)
acs.loc[8787]

	age	race	marital_status	edu	income	female	hispanic	child_income
person
1	45	White	Divorced	Some college, >=1 year	38000.0	True	False	NaN
2	16	White	Never married	9th grade	3600.0	False	False	3800.0
3	15	White	Never married	7th-8th grade	200.0	True	False	3800.0

The trouble is that in excluding the adults from the sum(), you also excluded them from getting the result. Thus child_income is not a proper household-level variable, as it is not the same for all the observations in the same household–and you’d really be in trouble if your research question involved the effect of children’s contributions on the adults in the household. The solution is to create a column that only contains the quantities you want to add up, but can be added up for everyone:

acs['income_if_child'] = acs.loc[acs['age']<18, 'income']
acs.loc[8787]

	age	race	marital_status	edu	income	female	hispanic	child_income	income_if_child
person
1	45	White	Divorced	Some college, >=1 year	38000.0	True	False	NaN	NaN
2	16	White	Never married	9th grade	3600.0	False	False	3800.0	3600.0
3	15	White	Never married	7th-8th grade	200.0	True	False	3800.0	200.0

For children income_if_child contains their income, while for adults it contains NaN. If we now pass income_if_child to sum(), it will add up just the children’s incomes without us having to exclude the adults. Thus the result will be stored in the rows for the household’s adults too:

acs['child_income'] = (
    acs['income_if_child'].
    groupby('household').
    transform('sum')
)
acs.loc[8787]

	age	race	marital_status	edu	income	female	hispanic	child_income	income_if_child
person
1	45	White	Divorced	Some college, >=1 year	38000.0	True	False	3800.0	NaN
2	16	White	Never married	9th grade	3600.0	False	False	3800.0	3600.0
3	15	White	Never married	7th-8th grade	200.0	True	False	3800.0	200.0

Since all the aggregate functions ignore missing values, you can use this approach any time you need to use an aggregate function on a subset of the data:

Create a new column containing the value of interest for the observations to be included and NaN for the values to be excluded
Aggregate that column with no subsetting

Exercise

Find the mean age of the adults in each household.

Again, just to keep things neat drop the variables created in this section:

acs = acs.drop(
    [
        'child_income',
        'income_if_child'
    ],
    axis=1
)

5.2.5 Indicator Variables

As we’ve seen, if you do math with an indicator variable, it will be coerced into a number with True converted to 1 and False converted to 0. Aggregate functions take on new and very useful meanings when applied to indicator variables.

Start by creating an indicator variable for ‘this person is a child’:

acs['child'] = acs['age']<18
acs.loc[[37, 8787]]

		age	race	marital_status	edu	income	female	hispanic	child
household	person
37	1	20	White	Never married	Some college, >=1 year	10000.0	True	False	False
	2	19	White	Never married	Some college, >=1 year	5300.0	True	False	False
	3	19	Black	Never married	Some college, >=1 year	4700.0	True	False	False
8787	1	45	White	Divorced	Some college, >=1 year	38000.0	True	False	False
	2	16	White	Never married	9th grade	3600.0	False	False	True
	3	15	White	Never married	7th-8th grade	200.0	True	False	True

Just to see the conversion in action, use the astype() function and pass in int32 (32-bit integer).

acs.loc[[37, 8787],'child'].astype('int32')

household  person
37         1         0
           2         0
           3         0
8787       1         0
           2         1
           3         1
Name: child, dtype: int32

The sum of child is the number of children in the household:

acs['num_children'] = (
    acs['child'].
    groupby('household').
    transform('sum')
)
acs.loc[[37, 8787, 73731]]

		age	race	marital_status	edu	income	female	hispanic	child	num_children
household	person
37	1	20	White	Never married	Some college, >=1 year	10000.0	True	False	False	0
	2	19	White	Never married	Some college, >=1 year	5300.0	True	False	False	0
	3	19	Black	Never married	Some college, >=1 year	4700.0	True	False	False	0
8787	1	45	White	Divorced	Some college, >=1 year	38000.0	True	False	False	2
	2	16	White	Never married	9th grade	3600.0	False	False	True	2
	3	15	White	Never married	7th-8th grade	200.0	True	False	True	2
73731	1	15	White	Now married	9th grade	0.0	False	False	True	2
73731	2	15	White	Never married	7th-8th grade	0.0	True	False	True	2

The mean of child is the proportion of the household that is children:

acs['prop_children'] = (
    acs['child'].
    groupby('household').
    transform('mean')
)
acs.loc[[37, 8787, 73731]]

		age	race	marital_status	edu	income	female	hispanic	child	num_children	prop_children
household	person
37	1	20	White	Never married	Some college, >=1 year	10000.0	True	False	False	0	0.000000
	2	19	White	Never married	Some college, >=1 year	5300.0	True	False	False	0	0.000000
	3	19	Black	Never married	Some college, >=1 year	4700.0	True	False	False	0	0.000000
8787	1	45	White	Divorced	Some college, >=1 year	38000.0	True	False	False	2	0.666667
	2	16	White	Never married	9th grade	3600.0	False	False	True	2	0.666667
	3	15	White	Never married	7th-8th grade	200.0	True	False	True	2	0.666667
73731	1	15	White	Now married	9th grade	0.0	False	False	True	2	1.000000
73731	2	15	White	Never married	7th-8th grade	0.0	True	False	True	2	1.000000

The maximum value of child tells you if the household has any children or not:

acs['has_children'] = (
    acs['child'].
    groupby('household').
    transform('max')
)
acs.loc[[37, 8787, 73731]]

		age	race	marital_status	edu	income	female	hispanic	child	num_children	prop_children	has_children
household	person
37	1	20	White	Never married	Some college, >=1 year	10000.0	True	False	False	0	0.000000	False
	2	19	White	Never married	Some college, >=1 year	5300.0	True	False	False	0	0.000000	False
	3	19	Black	Never married	Some college, >=1 year	4700.0	True	False	False	0	0.000000	False
8787	1	45	White	Divorced	Some college, >=1 year	38000.0	True	False	False	2	0.666667	True
	2	16	White	Never married	9th grade	3600.0	False	False	True	2	0.666667	True
	3	15	White	Never married	7th-8th grade	200.0	True	False	True	2	0.666667	True
73731	1	15	White	Now married	9th grade	0.0	False	False	True	2	1.000000	True
73731	2	15	White	Never married	7th-8th grade	0.0	True	False	True	2	1.000000	True

This merits some explanation. If a household has no children, everyone in the household has a 0 (False) for child and the maximum value is 0. If a household has children, then at least one person in the household has 1 (True) for child, and the maximum value is 1. Python is clever enough to realize that the max() of a boolean variable is also a boolean variable, so it stores True/False in has_children rather than 1/0.

The minimum value of child tells you if the household consists entirely of children:

acs['all_children'] = (
    acs['child'].
    groupby('household').
    transform('min')
)
acs.loc[[37, 8787, 73731]]

		age	race	marital_status	edu	income	female	hispanic	child	num_children	prop_children	has_children	all_children
household	person
37	1	20	White	Never married	Some college, >=1 year	10000.0	True	False	False	0	0.000000	False	False
	2	19	White	Never married	Some college, >=1 year	5300.0	True	False	False	0	0.000000	False	False
	3	19	Black	Never married	Some college, >=1 year	4700.0	True	False	False	0	0.000000	False	False
8787	1	45	White	Divorced	Some college, >=1 year	38000.0	True	False	False	2	0.666667	True	False
	2	16	White	Never married	9th grade	3600.0	False	False	True	2	0.666667	True	False
	3	15	White	Never married	7th-8th grade	200.0	True	False	True	2	0.666667	True	False
73731	1	15	White	Now married	9th grade	0.0	False	False	True	2	1.000000	True	True
73731	2	15	White	Never married	7th-8th grade	0.0	True	False	True	2	1.000000	True	True

If everyone in the household is a child, everyone has a 1 (True) for child and the minimum value is 1 (True). If anyone in the household is not a child, then at least one person has a 0 (False) for child and the minimum is 0 (False).

Exercise

Create a variable containing the number of people in the household with income greater than zero and an indicator variable for ‘this household has someone with income>0’.

5.3 Wide Form Data

Up to this point we’ve worked with the American Community Survey in long form, with one observation per person. Some tasks are easier to do in long form than in wide form, but others are just as easy either way–once you know the tricks for working in wide form.

The file acs_wide.pickle contains the same ACS data we’ve been working with, but in wide form. Read it in as our new acs:

acs = pd.read_pickle('acs_wide.pickle')
acs

	age_1	age_2	age_3	age_4	age_5	age_6	age_7	age_8	age_9	age_10	...	female_7	female_8	female_9	female_10	female_11	female_12	female_13	female_14	female_15	female_16
household
37	20.0	19.0	19.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
241	50.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
242	29.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
377	69.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
418	59.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1236119	51.0	51.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1236287	41.0	42.0	23.0	4.0	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1236624	29.0	26.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1236756	58.0	61.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1236779	30.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

10565 rows × 112 columns

Before proceeding, let’s review what we’re seeing here. In wide form, there is one row per household rather than one row per person as in long form. Each person is described by a set of columns. For example, person 1 is described by age_1, race_1, etc. There are 16 people in one household, so there are 16 of all the columns (i.e. they go up to age_16, race_16, etc.). Smaller households will have NaN in the columns for the people they don’t have. For example, household 37 has three people, so age_1, age_2, and age_3 have values while all the others have NaN.

In long form, household and person made up a MultiIndex row index, while the variable names (age, race, etc.) made up the column index. In wide form household is the row index. We could use the variable names plus person as a MultiIndex column index, and you’ll see what that looks like in the next chapter. However, we have not found a good way to have a MultiIndex describe data that contain both level 1 and level 2 variables. Thus rather than using an official MultiIndex, the column names contain both the variable name and the person number. This seems to be the most common approach.

5.3.1 Subsetting in Wide Form

If we were using a MultiIndex for the columns we could use loc() for subsetting just like we did in long form. However, since the information we need is embedded in the column names, we’ll use filter() instead.

The filter() function allows you to subset based on patterns in the column names. A very simple pattern is like=. This will select all the columns that contain the string you pass in. Select all the income columns by passing in like='income':

acs.filter(like='income').dtypes

income_1     float64
income_2     float64
income_3     float64
income_4     float64
income_5     float64
income_6     float64
income_7     float64
income_8     float64
income_9     float64
income_10    float64
income_11    float64
income_12    float64
income_13    float64
income_14    float64
income_15    float64
income_16    float64
dtype: object

We’ll learn shortly how to use this to create level two variables like household_income. But that raises a problem: once you create a household_income variable, like='income' will select it too.

Now consider trying to select all the variables that describe person 1. You might think like='_1' would do the job, but try it:

acs.filter(like='_1').dtypes

age_1                 float64
age_10                float64
age_11                float64
age_12                float64
age_13                float64
age_14                float64
age_15                float64
age_16                float64
hispanic_1             object
hispanic_10            object
hispanic_11            object
hispanic_12            object
hispanic_13            object
hispanic_14            object
hispanic_15            object
hispanic_16            object
race_1               category
race_10              category
race_11              category
race_12              category
race_13              category
race_14              category
race_15              category
race_16              category
marital_status_1     category
marital_status_10    category
marital_status_11    category
marital_status_12    category
marital_status_13    category
marital_status_14    category
marital_status_15    category
marital_status_16    category
edu_1                category
edu_10               category
edu_11               category
edu_12               category
edu_13               category
edu_14               category
edu_15               category
edu_16               category
income_1              float64
income_10             float64
income_11             float64
income_12             float64
income_13             float64
income_14             float64
income_15             float64
income_16             float64
female_1               object
female_10              object
female_11              object
female_12              object
female_13              object
female_14              object
female_15              object
female_16              object
dtype: object

Instead we got the variables for persons 10 through 16 as well, because they also contain ’_1’.

What we need is the ability to be much more specific about the columns to be selected. The solution is regular expressions. Regular expressions are a syntax for describing a pattern in text so that a computer can detect text that matches it. They’ve been around for more than 50 years and most programming languages and statistical packages implement them. They can be used for a variety of purposes; we won’t discuss working with text data (strings) but we’re laying the foundation by introducing you to regular expressions. Regular expressions can be extremely complex. We’ll introduce you to just a few of the most important features but you can find many, many tutorials online.

You use a regular expression with filter() by passing in regex=. The simplest regular expression is just some text, and a string matches it if it contains that text. In other words, it does the same thing as like=:

acs.filter(regex='income').dtypes

income_1     float64
income_2     float64
income_3     float64
income_4     float64
income_5     float64
income_6     float64
income_7     float64
income_8     float64
income_9     float64
income_10    float64
income_11    float64
income_12    float64
income_13    float64
income_14    float64
income_15    float64
income_16    float64
dtype: object

But a regular expression can do much more. Here are the tools we’ll use:

If a regular expression starts with ^, then the string only matches if it starts with the pattern.
If a regular expression ends with $, then the string only matches if it ends with the pattern.
\d matches any digit (number).
A * after something means the string can contain any number of that thing at that point. So \d* means the string can contain any number of digits at that point.

Putting this all together, the regular expression ^income_\d*$ will match strings that start with ‘income_’ followed by any number of digits and then end. This will allow us to match the columns containing the income for each person (income_1, income_2, etc.) without also matching columns like household_income:

acs.filter(regex='^income_\d*$').dtypes

income_1     float64
income_2     float64
income_3     float64
income_4     float64
income_5     float64
income_6     float64
income_7     float64
income_8     float64
income_9     float64
income_10    float64
income_11    float64
income_12    float64
income_13    float64
income_14    float64
income_15    float64
income_16    float64
dtype: object

The regular expression _1$ will match strings that end with ’_1’. This excludes variables that end with ’_10’, ’_11’, etc. and thus allows us to select just the columns describing person 1:

acs.filter(regex='_1$').dtypes

age_1                float64
hispanic_1            object
race_1              category
marital_status_1    category
edu_1               category
income_1             float64
female_1              object
dtype: object

Exercise

Use a regular expression to select all the race variables. Then use one to select all the variables for person 5.

5.3.2 Creating Level Two Variables

When we used the sum() function in the last section to add up household incomes, it was adding all the values in a column. However, with the axis=1 argument it will add up values across a row. You can use this along with filter() to calculate household income in wide form:

acs['household_income'] = (
    acs.filter(regex='^income_\d*$').
    sum(axis=1)
)
acs['household_income']

household
37          20000.0
241         32500.0
242         30000.0
377         51900.0
418         12200.0
             ...   
1236119     93200.0
1236287     53400.0
1236624     62100.0
1236756    110600.0
1236779     22110.0
Name: household_income, Length: 10565, dtype: float64

If you want to be able to see both the individual incomes and household_income so you can spot-check the results, go ahead and use like='income' (and scroll right):

acs.filter(like='income')

	income_1	income_2	income_3	income_4	income_5	income_6	income_7	income_8	income_9	income_10	income_11	income_12	income_13	income_14	income_15	income_16	household_income
household
37	10000.0	5300.0	4700.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	20000.0
241	32500.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	32500.0
242	30000.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	30000.0
377	51900.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	51900.0
418	12200.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	12200.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1236119	62200.0	31000.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	93200.0
1236287	15000.0	27000.0	11400.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	53400.0
1236624	50100.0	12000.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	62100.0
1236756	69800.0	40800.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	110600.0
1236779	22110.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	22110.0

10565 rows × 17 columns

This works for all of the aggregate functions from the last section, except those that only work with groupby(). Thus the code to carry out the tasks from the last section is almost the same: just use filter() instead of loc to select the right subsample, don’t bother with transform(), and pass axis=1 to the aggregate function itself.

How can we find the size of each household (number of people)? Recall that count() gives the number of non-missing values. So as long as there’s at least one variable with no missing values in the data set, you can use count() to find the number of people. This data set has multiple candidates, but use age:

acs['size'] = (
    acs.filter(regex='^age_\d*$').
    count(axis=1)
)
acs['size'].value_counts()

size
2     3412
1     2779
3     1685
4     1479
5      760
6      273
7       96
8       47
9       13
10      10
12       7
11       3
16       1
Name: count, dtype: int64

What about income per person? That’s actually easier than in long form. There’s no need to broadcast the result to all the rows in the same household, so you don’t need transform() or a lambda function. Just add up the household income and divide by the number of people in the household:

acs['income_per_person'] = (
    acs.filter(regex='^income_\d$').
    sum(axis=1) / 
    acs.filter(regex='^age_\d*$')
    .count(axis=1)
)
acs.filter(like='income').loc[484]

income_1             16800.0
income_2             18000.0
income_3                 NaN
income_4                 NaN
income_5                 NaN
income_6                 NaN
income_7                 NaN
income_8                 NaN
income_9                 NaN
income_10                NaN
income_11                NaN
income_12                NaN
income_13                NaN
income_14                NaN
income_15                NaN
income_16                NaN
household_income     34800.0
income_per_person     8700.0
Name: 484, dtype: float64

Recall that household 484 has two children who have missing values for income because they are under 15.

Exercise

Find the mean age for each household. Then create an indicator variable for ‘everyone in this household is female’.

5.3.3 Creating Level One Variables

Creating level one variables in long form is straightforward. For example, to create an indicator for ‘this person is black’ you’d just run:

acs['black'] = (acs['race']=='Black')

But in wide form there are 16 race variables, and there will need to be 16 indicators for Black as well. If you need to do a lot of work with level one variables, this is a pretty good reason to convert your data to long form. We’ll learn how in the next chapter. But it’s not too hard to make all 16 variables as long as you use a for loop.

Once again, we’ll start by considering what it would take to create just one of the 16 new indicator variables:

acs['black_1'] = (acs['race_1']=='Black')

That would take care of person 1. What would need to change in order to take care of person 2? The number at the end of the variable names. That tells us that this time we need to loop over a range of numbers. You can do that with the range() function. In its simplest form it takes two arguments, a starting number and an ending number. In true Python style the ending number is not included. To see how this works in a loop, try:

for i in range(1, 17): print(i)

So far so good, but how do we get those numbers into the variable names? If you add two strings together, they’ll be concatenated:

'Hello ' + 'world!'

'Hello world!'

The only wrinkle is that i is a number, But you can convert it to a string with the str() function and then add it:

'black_'+str(i)

'black_16'

This allows us to build the variable names we need, so now you’re ready to write the loop. One last complication: remember that any comparison involving NaN returns False. If a race variable is NaN the corresponding black variable should also be NaN, so select only the rows where race is not NaN:

for i in range(1,17): acs.loc[
    acs['race_'+str(i)].notna(),
    'black_'+str(i)
] = (acs['race_'+str(i)]=='Black')
acs.filter(like='black')

	black_1	black_2	black_3	black_4	black_5	black_6	black_7	black_8	black_9	black_10	black_11	black_12	black_13	black_14	black_15	black_16
household
37	False	False	True	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
241	False	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
242	False	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
377	False	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
418	False	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1236119	False	False	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1236287	False	False	False	False	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1236624	False	False	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1236756	False	False	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1236779	False	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

10565 rows × 16 columns

Exercise

Create a set of indicator variables for ‘this person has experienced the end of a marriage’. This would include anyone who is divorced, widowed, or separated (we’ll call that an ‘end’, though you could argue otherwise), and not anyone who is currently married or has never been married.

5.4 Panel Data

Panel data or longitudinal data is just another form of hierarchical data, with subjects as level two units and times they were observed as level one units. With panel data, the timing of the observations or at least their order is important. If it’s not, then we refer to it as repeated measures data. Since panel data is just hierarchical data, a lot of the code we’ve already learned applies directly to panel data, but there are some panel-specific tricks too.

You should already have the NLSY extract in memory from the second exercise. If not, load it with:

nlsy = pd.read_pickle('nlsy.pickle')
nlsy

		birth_year	edu	income	age
id	year
1.0	1979	58.0	12TH GRADE	4620.0	21.0
	1980	58.0	NaN	NaN	22.0
	1981	58.0	12TH GRADE	5000.0	23.0
	1982	58.0	NaN	NaN	24.0
	1983	58.0	NaN	NaN	25.0
...	...	...	...	...	...
12686.0	1993	60.0	12TH GRADE	31000.0	33.0
	1994	60.0	NaN	NaN	34.0
	1996	60.0	NaN	NaN	36.0
	1998	60.0	NaN	NaN	38.0
	2000	60.0	NaN	NaN	40.0

241034 rows × 4 columns

5.4.1 Creating Level Two Variables

The National Longitudinal Survey of Youth tracks young people for many years, making it classic panel data. In the NLSY a level two unit is a year and a level one unit is a person-year combination. You can create level two variables using the same tools we used with the ACS. For example, create max_edu containing the person’s highest level of education with:

nlsy['max_edu'] = (
    nlsy['edu'].
    groupby('id').
    transform('max')
)
nlsy.loc[5]

	birth_year	edu	income	age	max_edu
year
1979	59.0	1ST YEAR COLLEGE	2200.0	20.0	6TH YEAR COLLEGE
1980	59.0	2ND YEAR COLLEGE	2000.0	21.0	6TH YEAR COLLEGE
1981	59.0	3RD YEAR COLLEGE	3400.0	22.0	6TH YEAR COLLEGE
1982	59.0	4TH YEAR COLLEGE	2300.0	23.0	6TH YEAR COLLEGE
1983	59.0	5TH YEAR COLLEGE	2200.0	24.0	6TH YEAR COLLEGE
1984	59.0	6TH YEAR COLLEGE	10500.0	25.0	6TH YEAR COLLEGE
1985	59.0	NaN	NaN	26.0	6TH YEAR COLLEGE
1986	59.0	6TH YEAR COLLEGE	43119.0	27.0	6TH YEAR COLLEGE
1987	59.0	NaN	NaN	28.0	6TH YEAR COLLEGE
1988	59.0	6TH YEAR COLLEGE	57124.0	29.0	6TH YEAR COLLEGE
1989	59.0	6TH YEAR COLLEGE	62836.0	30.0	6TH YEAR COLLEGE
1990	59.0	NaN	NaN	31.0	6TH YEAR COLLEGE
1991	59.0	NaN	NaN	32.0	6TH YEAR COLLEGE
1992	59.0	NaN	NaN	33.0	6TH YEAR COLLEGE
1993	59.0	NaN	NaN	34.0	6TH YEAR COLLEGE
1994	59.0	NaN	NaN	35.0	6TH YEAR COLLEGE
1996	59.0	NaN	NaN	37.0	6TH YEAR COLLEGE
1998	59.0	NaN	NaN	39.0	6TH YEAR COLLEGE
2000	59.0	NaN	NaN	41.0	6TH YEAR COLLEGE

Note that this relies on edu being an ordered categorical variable.

The first() and last() functions are particularly useful with panel data. But their meaning depends on the order the data are sorted in. This data set appears to be in chronological order, but you shouldn’t trust that if you’re going to write code that will give you the wrong answer if it is not. So before using the first() function to find the person’s age at the start of the study, sort by id and year:

nlsy = nlsy.sort_values(['id', 'year'])
nlsy['starting_age'] = (
    nlsy['age'].
    groupby('id').
    transform('first')
)
nlsy

		birth_year	edu	income	age	max_edu	starting_age
id	year
1.0	1979	58.0	12TH GRADE	4620.0	21.0	12TH GRADE	21.0
	1980	58.0	NaN	NaN	22.0	12TH GRADE	21.0
	1981	58.0	12TH GRADE	5000.0	23.0	12TH GRADE	21.0
	1982	58.0	NaN	NaN	24.0	12TH GRADE	21.0
	1983	58.0	NaN	NaN	25.0	12TH GRADE	21.0
...	...	...	...	...	...	...	...
12686.0	1993	60.0	12TH GRADE	31000.0	33.0	12TH GRADE	19.0
	1994	60.0	NaN	NaN	34.0	12TH GRADE	19.0
	1996	60.0	NaN	NaN	36.0	12TH GRADE	19.0
	1998	60.0	NaN	NaN	38.0	12TH GRADE	19.0
	2000	60.0	NaN	NaN	40.0	12TH GRADE	19.0

241034 rows × 6 columns

Exercise

Use the last() function to find each subject’s last value of income. Does this behave as you expected?

Before proceeding, remove the variables we’ve created thus far just to keep the data set neat:

nlsy = nlsy.drop(
    [
        'max_edu',
        'starting_age'
    ], axis=1
)

5.4.2 Filling In Gaps

The NLSY frequently has missing data as subjects were frequently not located in a given year. Somewhat unusually for panel data, they often were able to find people in later years so there are gaps as well as attrition. Consider what we know about the education of person 5:

nlsy.loc[11]

	birth_year	edu	income	age
year
1979	59.0	1ST YEAR COLLEGE	2900.0	20.0
1980	59.0	NaN	NaN	21.0
1981	59.0	3RD YEAR COLLEGE	2000.0	22.0
1982	59.0	4TH YEAR COLLEGE	4002.0	23.0
1983	59.0	4TH YEAR COLLEGE	8000.0	24.0
1984	59.0	4TH YEAR COLLEGE	17000.0	25.0
1985	59.0	4TH YEAR COLLEGE	20000.0	26.0
1986	59.0	NaN	NaN	27.0
1987	59.0	NaN	NaN	28.0
1988	59.0	NaN	NaN	29.0
1989	59.0	4TH YEAR COLLEGE	46000.0	30.0
1990	59.0	4TH YEAR COLLEGE	74283.0	31.0
1991	59.0	4TH YEAR COLLEGE	81481.0	32.0
1992	59.0	4TH YEAR COLLEGE	55000.0	33.0
1993	59.0	4TH YEAR COLLEGE	60000.0	34.0
1994	59.0	4TH YEAR COLLEGE	101653.0	35.0
1996	59.0	4TH YEAR COLLEGE	60000.0	37.0
1998	59.0	4TH YEAR COLLEGE	25000.0	39.0
2000	59.0	4TH YEAR COLLEGE	55000.0	41.0

They were not interviewed in 1986, 1987, and 1988, but given that their education level both before and after that gap is ‘4TH YEAR COLLEGE’ we know it was ‘4TH YEAR COLLEGE’ in those years as well. Meanwhile, while we might think that in 1980 their education level was ‘2ND YEAR COLLEGE’ that is less certain–and if that were a two-year gap we’d have no idea. So let’s fill in gaps only where the edu is the same before and after the gap.

To actually do that, we’ll fill in all the gaps in two ways: first by carrying past values forward with the ffill() function, and then by carrying future values back with the bfill() function. If the two methods agree, then that is a gap we can fill. While ffill() and bfill() are not aggregate functions, it’s still important to use groupby('id') because that ensures each person is considered separately and no values are carried from one person to another.

nlsy['edu_forward'] = (
    nlsy['edu'].
    groupby('id').
    ffill()
)
nlsy['edu_backward'] = (
    nlsy['edu'].
    groupby('id').
    bfill()
)
nlsy.loc[11]

	birth_year	edu	income	age	edu_forward	edu_backward
year
1979	59.0	1ST YEAR COLLEGE	2900.0	20.0	1ST YEAR COLLEGE	1ST YEAR COLLEGE
1980	59.0	NaN	NaN	21.0	1ST YEAR COLLEGE	3RD YEAR COLLEGE
1981	59.0	3RD YEAR COLLEGE	2000.0	22.0	3RD YEAR COLLEGE	3RD YEAR COLLEGE
1982	59.0	4TH YEAR COLLEGE	4002.0	23.0	4TH YEAR COLLEGE	4TH YEAR COLLEGE
1983	59.0	4TH YEAR COLLEGE	8000.0	24.0	4TH YEAR COLLEGE	4TH YEAR COLLEGE
1984	59.0	4TH YEAR COLLEGE	17000.0	25.0	4TH YEAR COLLEGE	4TH YEAR COLLEGE
1985	59.0	4TH YEAR COLLEGE	20000.0	26.0	4TH YEAR COLLEGE	4TH YEAR COLLEGE
1986	59.0	NaN	NaN	27.0	4TH YEAR COLLEGE	4TH YEAR COLLEGE
1987	59.0	NaN	NaN	28.0	4TH YEAR COLLEGE	4TH YEAR COLLEGE
1988	59.0	NaN	NaN	29.0	4TH YEAR COLLEGE	4TH YEAR COLLEGE
1989	59.0	4TH YEAR COLLEGE	46000.0	30.0	4TH YEAR COLLEGE	4TH YEAR COLLEGE
1990	59.0	4TH YEAR COLLEGE	74283.0	31.0	4TH YEAR COLLEGE	4TH YEAR COLLEGE
1991	59.0	4TH YEAR COLLEGE	81481.0	32.0	4TH YEAR COLLEGE	4TH YEAR COLLEGE
1992	59.0	4TH YEAR COLLEGE	55000.0	33.0	4TH YEAR COLLEGE	4TH YEAR COLLEGE
1993	59.0	4TH YEAR COLLEGE	60000.0	34.0	4TH YEAR COLLEGE	4TH YEAR COLLEGE
1994	59.0	4TH YEAR COLLEGE	101653.0	35.0	4TH YEAR COLLEGE	4TH YEAR COLLEGE
1996	59.0	4TH YEAR COLLEGE	60000.0	37.0	4TH YEAR COLLEGE	4TH YEAR COLLEGE
1998	59.0	4TH YEAR COLLEGE	25000.0	39.0	4TH YEAR COLLEGE	4TH YEAR COLLEGE
2000	59.0	4TH YEAR COLLEGE	55000.0	41.0	4TH YEAR COLLEGE	4TH YEAR COLLEGE

Now we’ll replace edu if edu is missing and edu_forward and edu_backward are the same. Since they are the same it doesn’t matter which of them we set edu to:

nlsy.loc[
    nlsy['edu'].isna() & 
    (nlsy['edu_forward']==nlsy['edu_backward']),
    'edu'      
] = nlsy['edu_forward']
nlsy.loc[11]

	birth_year	edu	income	age	edu_forward	edu_backward
year
1979	59.0	1ST YEAR COLLEGE	2900.0	20.0	1ST YEAR COLLEGE	1ST YEAR COLLEGE
1980	59.0	NaN	NaN	21.0	1ST YEAR COLLEGE	3RD YEAR COLLEGE
1981	59.0	3RD YEAR COLLEGE	2000.0	22.0	3RD YEAR COLLEGE	3RD YEAR COLLEGE
1982	59.0	4TH YEAR COLLEGE	4002.0	23.0	4TH YEAR COLLEGE	4TH YEAR COLLEGE
1983	59.0	4TH YEAR COLLEGE	8000.0	24.0	4TH YEAR COLLEGE	4TH YEAR COLLEGE
1984	59.0	4TH YEAR COLLEGE	17000.0	25.0	4TH YEAR COLLEGE	4TH YEAR COLLEGE
1985	59.0	4TH YEAR COLLEGE	20000.0	26.0	4TH YEAR COLLEGE	4TH YEAR COLLEGE
1986	59.0	4TH YEAR COLLEGE	NaN	27.0	4TH YEAR COLLEGE	4TH YEAR COLLEGE
1987	59.0	4TH YEAR COLLEGE	NaN	28.0	4TH YEAR COLLEGE	4TH YEAR COLLEGE
1988	59.0	4TH YEAR COLLEGE	NaN	29.0	4TH YEAR COLLEGE	4TH YEAR COLLEGE
1989	59.0	4TH YEAR COLLEGE	46000.0	30.0	4TH YEAR COLLEGE	4TH YEAR COLLEGE
1990	59.0	4TH YEAR COLLEGE	74283.0	31.0	4TH YEAR COLLEGE	4TH YEAR COLLEGE
1991	59.0	4TH YEAR COLLEGE	81481.0	32.0	4TH YEAR COLLEGE	4TH YEAR COLLEGE
1992	59.0	4TH YEAR COLLEGE	55000.0	33.0	4TH YEAR COLLEGE	4TH YEAR COLLEGE
1993	59.0	4TH YEAR COLLEGE	60000.0	34.0	4TH YEAR COLLEGE	4TH YEAR COLLEGE
1994	59.0	4TH YEAR COLLEGE	101653.0	35.0	4TH YEAR COLLEGE	4TH YEAR COLLEGE
1996	59.0	4TH YEAR COLLEGE	60000.0	37.0	4TH YEAR COLLEGE	4TH YEAR COLLEGE
1998	59.0	4TH YEAR COLLEGE	25000.0	39.0	4TH YEAR COLLEGE	4TH YEAR COLLEGE
2000	59.0	4TH YEAR COLLEGE	55000.0	41.0	4TH YEAR COLLEGE	4TH YEAR COLLEGE

Note that we will not attempt to fill in other gaps where we are not perfectly certain what the missing value should be. There are more or less valid methods for filling in missing values like multiple imputation, but they are complicated and a lot of work. Methods that are less complicated and less work, like filling in the mean or linear interpolation, can create serious problems with any statistical analysis. It’s unusual to be able to fill in gaps like we were able to here but worth doing when you can.

5.4.3 Variables Based On An Observations Neighbors

The shift() function can shift observations forward or backwards: forward if you pass in a positive number, backwards if you pass in a negative number. For easy viewing, let’s build a DataFrame by passing a dictionary to the Pandas DataFrame function. The dictionary must contain the column names as keys and Series (which we’ll create based on the edu Series in nsly) as values. Our DataFrame will have one column for the original edu, one for shifting forward, and one for shifting backward:

pd.DataFrame(
    {
        'edu' : nlsy.loc[11, 'edu'],
        'edu_forward_1' : nlsy.loc[11, 'edu'].shift(1),
        'edu_backward_1' : nlsy.loc[11, 'edu'].shift(-1) 
    }
)

	edu	edu_forward_1	edu_backward_1
year
1979	1ST YEAR COLLEGE	NaN	NaN
1980	NaN	1ST YEAR COLLEGE	3RD YEAR COLLEGE
1981	3RD YEAR COLLEGE	NaN	4TH YEAR COLLEGE
1982	4TH YEAR COLLEGE	3RD YEAR COLLEGE	4TH YEAR COLLEGE
1983	4TH YEAR COLLEGE	4TH YEAR COLLEGE	4TH YEAR COLLEGE
1984	4TH YEAR COLLEGE	4TH YEAR COLLEGE	4TH YEAR COLLEGE
1985	4TH YEAR COLLEGE	4TH YEAR COLLEGE	4TH YEAR COLLEGE
1986	4TH YEAR COLLEGE	4TH YEAR COLLEGE	4TH YEAR COLLEGE
1987	4TH YEAR COLLEGE	4TH YEAR COLLEGE	4TH YEAR COLLEGE
1988	4TH YEAR COLLEGE	4TH YEAR COLLEGE	4TH YEAR COLLEGE
1989	4TH YEAR COLLEGE	4TH YEAR COLLEGE	4TH YEAR COLLEGE
1990	4TH YEAR COLLEGE	4TH YEAR COLLEGE	4TH YEAR COLLEGE
1991	4TH YEAR COLLEGE	4TH YEAR COLLEGE	4TH YEAR COLLEGE
1992	4TH YEAR COLLEGE	4TH YEAR COLLEGE	4TH YEAR COLLEGE
1993	4TH YEAR COLLEGE	4TH YEAR COLLEGE	4TH YEAR COLLEGE
1994	4TH YEAR COLLEGE	4TH YEAR COLLEGE	4TH YEAR COLLEGE
1996	4TH YEAR COLLEGE	4TH YEAR COLLEGE	4TH YEAR COLLEGE
1998	4TH YEAR COLLEGE	4TH YEAR COLLEGE	4TH YEAR COLLEGE
2000	4TH YEAR COLLEGE	4TH YEAR COLLEGE	NaN

Note that when shifting forward you’ll have NaN at the beginning of the series, and when shifting backwards at the end. In this case we did not need groupby() because we extracted just one person, but in general you’ll need it to prevent values from being shifted between people.

Now consider trying to identify the year in which a person graduates from high school. In that year their edu will be ‘12TH GRADE’ or higher and the year before their edu will be less than ‘12TH GRADE’. Note that for whatever reason the NLSY frequently records people as having gained multiple years of education in one year, and we won’t worry about that. For person 2, the year in which they graduated is 1985:

nlsy.loc[2]

	birth_year	edu	income	age	edu_forward	edu_backward
year
1979	59.0	9TH GRADE	4000.0	20.0	9TH GRADE	9TH GRADE
1980	59.0	9TH GRADE	5000.0	21.0	9TH GRADE	9TH GRADE
1981	59.0	9TH GRADE	6000.0	22.0	9TH GRADE	9TH GRADE
1982	59.0	9TH GRADE	10000.0	23.0	9TH GRADE	9TH GRADE
1983	59.0	9TH GRADE	11000.0	24.0	9TH GRADE	9TH GRADE
1984	59.0	9TH GRADE	11500.0	25.0	9TH GRADE	9TH GRADE
1985	59.0	12TH GRADE	11000.0	26.0	12TH GRADE	12TH GRADE
1986	59.0	12TH GRADE	14000.0	27.0	12TH GRADE	12TH GRADE
1987	59.0	12TH GRADE	16000.0	28.0	12TH GRADE	12TH GRADE
1988	59.0	12TH GRADE	20000.0	29.0	12TH GRADE	12TH GRADE
1989	59.0	12TH GRADE	19000.0	30.0	12TH GRADE	12TH GRADE
1990	59.0	12TH GRADE	20000.0	31.0	12TH GRADE	12TH GRADE
1991	59.0	12TH GRADE	20000.0	32.0	12TH GRADE	12TH GRADE
1992	59.0	12TH GRADE	22000.0	33.0	12TH GRADE	12TH GRADE
1993	59.0	12TH GRADE	25000.0	34.0	12TH GRADE	12TH GRADE
1994	59.0	12TH GRADE	0.0	35.0	12TH GRADE	12TH GRADE
1996	59.0	12TH GRADE	0.0	37.0	12TH GRADE	12TH GRADE
1998	59.0	12TH GRADE	0.0	39.0	12TH GRADE	12TH GRADE
2000	59.0	12TH GRADE	0.0	41.0	12TH GRADE	12TH GRADE

Now create an indicator variable for ‘graduated this year’ with:

nlsy['grad'] = (
    (nlsy['edu'] >= '12TH GRADE') & 
    (nlsy['edu'].groupby('id').shift(1) < '12TH GRADE')
)
nlsy.loc[2]

	birth_year	edu	income	age	edu_forward	edu_backward	grad
year
1979	59.0	9TH GRADE	4000.0	20.0	9TH GRADE	9TH GRADE	False
1980	59.0	9TH GRADE	5000.0	21.0	9TH GRADE	9TH GRADE	False
1981	59.0	9TH GRADE	6000.0	22.0	9TH GRADE	9TH GRADE	False
1982	59.0	9TH GRADE	10000.0	23.0	9TH GRADE	9TH GRADE	False
1983	59.0	9TH GRADE	11000.0	24.0	9TH GRADE	9TH GRADE	False
1984	59.0	9TH GRADE	11500.0	25.0	9TH GRADE	9TH GRADE	False
1985	59.0	12TH GRADE	11000.0	26.0	12TH GRADE	12TH GRADE	True
1986	59.0	12TH GRADE	14000.0	27.0	12TH GRADE	12TH GRADE	False
1987	59.0	12TH GRADE	16000.0	28.0	12TH GRADE	12TH GRADE	False
1988	59.0	12TH GRADE	20000.0	29.0	12TH GRADE	12TH GRADE	False
1989	59.0	12TH GRADE	19000.0	30.0	12TH GRADE	12TH GRADE	False
1990	59.0	12TH GRADE	20000.0	31.0	12TH GRADE	12TH GRADE	False
1991	59.0	12TH GRADE	20000.0	32.0	12TH GRADE	12TH GRADE	False
1992	59.0	12TH GRADE	22000.0	33.0	12TH GRADE	12TH GRADE	False
1993	59.0	12TH GRADE	25000.0	34.0	12TH GRADE	12TH GRADE	False
1994	59.0	12TH GRADE	0.0	35.0	12TH GRADE	12TH GRADE	False
1996	59.0	12TH GRADE	0.0	37.0	12TH GRADE	12TH GRADE	False
1998	59.0	12TH GRADE	0.0	39.0	12TH GRADE	12TH GRADE	False
2000	59.0	12TH GRADE	0.0	41.0	12TH GRADE	12TH GRADE	False

Note that if either edu or shifted edu is NaN, the result will be false. Thus this indicator variable is really for ‘we know the person graduated this year.’ We could set it to NaN for years with missing values if we wanted to, using the techniques we’ve used before.

Logically each person should only graduate from high school once, but it’s not hard to image data entry errors that would make it look like they graduated multiple times. Recall that the sum of an indicator variable is the number of times it is true, so check with:

(
    nlsy['grad'].
    groupby('id').
    sum().
    value_counts()
)

grad
0    7033
1    5653
Name: count, dtype: int64

A little over half the subjects have True for grad zero times: either they never graduated from high school, they graduated outside the study period, or we could not identify their graduation year due to missing data. But no one graduated more than once.

Now let’s find their age at graduation. This is just a matter of subsetting with an aggregate function, except that the desired subset has just one observation per group. So start by creating age_if_grad containing the person’s age for the year they graduated and NaN for all others.

In the past we’ve done this with code like:

nlsy.loc[nlsy['grad'], 'age_if_grad'] = nlsy['age']

and that would absolutely work. But let’s learn an alternative method. The numpy where() function takes three arguments: a condition, the result to return if the condition is true, and the result to return if the condition is false. Build age_if_grad by importing numpy as np and then passing to where() the three arguments grad, age, and NaN:

import numpy as np
nlsy['age_if_grad'] = np.where(
    nlsy['grad'],
    nlsy['age'],
    np.NaN
)

nlsy.loc[2]

	birth_year	edu	income	age	edu_forward	edu_backward	grad	age_if_grad
year
1979	59.0	9TH GRADE	4000.0	20.0	9TH GRADE	9TH GRADE	False	NaN
1980	59.0	9TH GRADE	5000.0	21.0	9TH GRADE	9TH GRADE	False	NaN
1981	59.0	9TH GRADE	6000.0	22.0	9TH GRADE	9TH GRADE	False	NaN
1982	59.0	9TH GRADE	10000.0	23.0	9TH GRADE	9TH GRADE	False	NaN
1983	59.0	9TH GRADE	11000.0	24.0	9TH GRADE	9TH GRADE	False	NaN
1984	59.0	9TH GRADE	11500.0	25.0	9TH GRADE	9TH GRADE	False	NaN
1985	59.0	12TH GRADE	11000.0	26.0	12TH GRADE	12TH GRADE	True	26.0
1986	59.0	12TH GRADE	14000.0	27.0	12TH GRADE	12TH GRADE	False	NaN
1987	59.0	12TH GRADE	16000.0	28.0	12TH GRADE	12TH GRADE	False	NaN
1988	59.0	12TH GRADE	20000.0	29.0	12TH GRADE	12TH GRADE	False	NaN
1989	59.0	12TH GRADE	19000.0	30.0	12TH GRADE	12TH GRADE	False	NaN
1990	59.0	12TH GRADE	20000.0	31.0	12TH GRADE	12TH GRADE	False	NaN
1991	59.0	12TH GRADE	20000.0	32.0	12TH GRADE	12TH GRADE	False	NaN
1992	59.0	12TH GRADE	22000.0	33.0	12TH GRADE	12TH GRADE	False	NaN
1993	59.0	12TH GRADE	25000.0	34.0	12TH GRADE	12TH GRADE	False	NaN
1994	59.0	12TH GRADE	0.0	35.0	12TH GRADE	12TH GRADE	False	NaN
1996	59.0	12TH GRADE	0.0	37.0	12TH GRADE	12TH GRADE	False	NaN
1998	59.0	12TH GRADE	0.0	39.0	12TH GRADE	12TH GRADE	False	NaN
2000	59.0	12TH GRADE	0.0	41.0	12TH GRADE	12TH GRADE	False	NaN

You may find that syntax clearer. where() would definitely be easier to use than the first syntax if you needed to set the rows where grad was false to something other than NaN.

age_if_grad for each person is a single number plus a bunch of NaNs. Many aggregate functions have the property that if you give them a single number they’ll return that number, so we could use any of them. We’ll go with mean:

nlsy['age_at_grad'] = (
    nlsy['age_if_grad'].
    groupby('id').
    transform('mean')
)
nlsy.loc[2]

	birth_year	edu	income	age	edu_forward	edu_backward	grad	age_if_grad	age_at_grad
year
1979	59.0	9TH GRADE	4000.0	20.0	9TH GRADE	9TH GRADE	False	NaN	26.0
1980	59.0	9TH GRADE	5000.0	21.0	9TH GRADE	9TH GRADE	False	NaN	26.0
1981	59.0	9TH GRADE	6000.0	22.0	9TH GRADE	9TH GRADE	False	NaN	26.0
1982	59.0	9TH GRADE	10000.0	23.0	9TH GRADE	9TH GRADE	False	NaN	26.0
1983	59.0	9TH GRADE	11000.0	24.0	9TH GRADE	9TH GRADE	False	NaN	26.0
1984	59.0	9TH GRADE	11500.0	25.0	9TH GRADE	9TH GRADE	False	NaN	26.0
1985	59.0	12TH GRADE	11000.0	26.0	12TH GRADE	12TH GRADE	True	26.0	26.0
1986	59.0	12TH GRADE	14000.0	27.0	12TH GRADE	12TH GRADE	False	NaN	26.0
1987	59.0	12TH GRADE	16000.0	28.0	12TH GRADE	12TH GRADE	False	NaN	26.0
1988	59.0	12TH GRADE	20000.0	29.0	12TH GRADE	12TH GRADE	False	NaN	26.0
1989	59.0	12TH GRADE	19000.0	30.0	12TH GRADE	12TH GRADE	False	NaN	26.0
1990	59.0	12TH GRADE	20000.0	31.0	12TH GRADE	12TH GRADE	False	NaN	26.0
1991	59.0	12TH GRADE	20000.0	32.0	12TH GRADE	12TH GRADE	False	NaN	26.0
1992	59.0	12TH GRADE	22000.0	33.0	12TH GRADE	12TH GRADE	False	NaN	26.0
1993	59.0	12TH GRADE	25000.0	34.0	12TH GRADE	12TH GRADE	False	NaN	26.0
1994	59.0	12TH GRADE	0.0	35.0	12TH GRADE	12TH GRADE	False	NaN	26.0
1996	59.0	12TH GRADE	0.0	37.0	12TH GRADE	12TH GRADE	False	NaN	26.0
1998	59.0	12TH GRADE	0.0	39.0	12TH GRADE	12TH GRADE	False	NaN	26.0
2000	59.0	12TH GRADE	0.0	41.0	12TH GRADE	12TH GRADE	False	NaN	26.0

Exercise

Assume for the moment that if someone’s income is zero they are unemployed and if their income is positive they are employed. Create an indicator variable for ‘person lost their job’ which is true if someone is unemployed (income is zero) in the current year and was employed (income is positive) in the previous year.