5 Working With Hierarchical Data in Long Form

Many data sets involve some sort of hierarchical structure. The American Community Survey is an example of one of the most common hierarchical data structures: individuals grouped into households. Another common hierarchical data structure is panel or longitudinal data and repeated measures, where things are observed multiple times. These structures may seem very different, but the same concepts apply to both and often even the same code.

In this chapter we’ll focus on working with hierarchical data in long form, and the next will discuss working with it in wide form (we’ll define the long and wide forms momentarily).

5.1 Hierarchical Data Concepts

In this section we’ll introduce some of the core concepts and vocabulary for thinking carefully about hierarchical data.

5.1.1 Levels

Hierarchical data can be described in terms of levels (note that these are not the same as the levels of a categorical variable). A level one unit is the smallest unit in the data set. A level two unit is then a group of level one units. Some examples:

EXAMPLE DATA STRUCTURE	LEVEL ONE UNIT	LEVEL TWO UNIT
Individuals living in households	Individuals	Households
States grouped into regions	States	Regions
Students in classes	Students	Classes
Tests taken by students	Tests	Students
Monthly observations of individuals	Monthly observations	Individuals
Visits to the doctor by patients	Doctor visits	Patients
Social media posts by individuals	Posts	Individuals

If the hierarchy has more than two levels, simply keep counting up: if you have students grouped into classes grouped into schools grouped into districts, students are level one, classes are level two, schools are level three, and districts are level four.

Each variable is associated with a specific level. A variable is a level two variable if all the level one units within a given level two unit have the same value for the variable. For data structures where a level two unit is observed over time, level two variables are variables that do not change over time.

EXAMPLE DATA STRUCTURE	LEVEL ONE VARIABLES	LEVEL TWO VARIABLES
Individuals living in households	Age, Sex	Household income, Address
States grouped into regions	State population	Regional income per capita
Students in classes	Student’s race	Teacher’s race, Class size
Tests taken by students	Test Score	Free lunch eligibility
Monthly observations of individuals	Employment status	Highest degree earned
Visits to the doctor by patients	BMI, Diagnosis	Race
Social media posts by individuals	Length, Topic	Sex

Of course all of these depend on the details of the actual data set. In a study that observes individuals for a few months, it’s unlikely that their highest degree earned will change. But it might! And if it does, highest degree earned becomes a level one variable. It does not meet the definition of a level two variable because different level one units (monthly observations) have different values for it.

5.1.2 Data Forms

With a hierarchical data set, an observation (row) can represent either a level one unit or a level two unit. Consider observing two people for three months:

person_id	month	years_edu	employed
1	1	16	True
1	2	16	True
1	3	16	True
2	1	12	False
2	2	12	True
2	3	12	True

Exercise

Identify the level one units, level two units, level one variable(s), and level two variable(s) in the above data set.

Solution

In this data set, the level one unit is a month and the level two unit is a person. years_edu is a level two variable because all the observations for a person have the same value of years_edu, and employed is a level one variable because it changes.

In this form shown above, each observation represents a month (or more precisely, a person-month combination). Now consider the exact same data in a different form:

person_id	years_edu	employed_1	employed_2	employed_3
1	16	True	True	True
2	12	False	True	True

In this form, each observation represents a person. Because years_edu is a level two variable, there’s just one value per person and thus one variable (column). However, employed is a level one variable with three (potentially) different values per person. Thus it needs three variables (columns).

We call the first form the long form (or occasionally the tall form) and the second the wide form. The long form is longer because it has more rows; the wide form is wider because it has more columns. In most cases the long form is easier to work with.

Now consider the indexes in this data set. In the long form, person_id and month are a compound identifier and could be easily turned into a MultiIndex, while the variable names are a simple column identifier. In the wide form, person_id is a simple row identifier, but now the variable names for the level one variables are a compound identifier with two parts: the variable name (employed) and the month in which the variable was observed. To identify a specific value in the data set we still need to know the person_id, month, and a variable name, but month has been converted from a row identifier to part of a compound column identifier. We’ll learn how to actually carry out that transformation in the chapter on Restructuring Data Sets.

5.2 Setting Up

Start up Jupyter Lab if you haven’t already and navigate to the folder where you put the example files. Then create a new Python Notebook and call it Hierarchical_Long_Practice.ipynb. Have it import Pandas and NumPy, then load acs.pickle:

import pandas as pd
import numpy as np
acs = pd.read_pickle('acs.pickle')
acs

		age	race	marital_status	edu	income	female	hispanic
household	person
37	1	20	White	Never married	Some college, >=1 year	10000.0	True	False
	2	19	White	Never married	Some college, >=1 year	5300.0	True	False
	3	19	Black	Never married	Some college, >=1 year	4700.0	True	False
241	1	50	White	Never married	Master's degree	32500.0	True	False
242	1	29	White	Never married	Bachelor's degree	30000.0	True	False
...	...	...	...	...	...	...	...	...
1236624	1	29	White	Now married	Some college, >=1 year	50100.0	False	False
1236624	2	26	White	Now married	High School graduate	12000.0	True	False
1236756	1	58	White	Now married	Master's degree	69800.0	True	False
1236756	2	61	White	Now married	Master's degree	40800.0	False	False
1236779	1	30	American Indian	Divorced	High School graduate	22110.0	False	False

27410 rows × 7 columns

Now make a copy of acs called acs_original. We’ll make various changes to acs and then return to the original data several times in the course of this chapter.

acs_original = acs.copy(deep=True)

Exercise

Load the example extract from the National Longitudinal Survey of Youth contained in nlsy.pickle. Identify the level one and level two units in the data and the level one and level two variables.

Solution

nlsy = pd.read_pickle('nlsy.pickle')
nlsy

		birth_year	edu	income	age
id	year
1.0	1979	58.0	12TH GRADE	4620.0	21.0
	1980	58.0	NaN	NaN	22.0
	1981	58.0	12TH GRADE	5000.0	23.0
	1982	58.0	NaN	NaN	24.0
	1983	58.0	NaN	NaN	25.0
...	...	...	...	...	...
12686.0	1993	60.0	12TH GRADE	31000.0	33.0
	1994	60.0	NaN	NaN	34.0
	1996	60.0	NaN	NaN	36.0
	1998	60.0	NaN	NaN	38.0
	2000	60.0	NaN	NaN	40.0

241034 rows × 4 columns

This is longitudinal data (says it right on the tin) where people (youth) are observed over time. So the level two unit is a person, identified by id, and the level one unit is a year. birth_year does not change over time, so it’s a level two variable, but edu, income, and age do, so they are level one variables.

5.3 Creating Level Two Variables

For some purposes, if a hierarchical data set is in long form you don’t have to worry about the hierarchical structure. On the other hand, a very common data wrangling task is creating level two variables based on the level one variables in the data set, and then the hierarchical structure is critical. In long form the critical tool for doing so is groupby(), so that the calculations are carried out separately for each level two unit.

5.3.1 Using groupby()

The groupby() function creates a new object which contains the DataFrame split up into groups. Functions act on that object, carry out calculations on the groups, and can return results in various forms. For example, if you apply basic summary statistic functions, you’ll get one row per group:

acs['income'].groupby('household').mean()

household
37          6666.666667
241        32500.000000
242        30000.000000
377        51900.000000
418        12200.000000
               ...     
1236119    46600.000000
1236287    17800.000000
1236624    31050.000000
1236756    55300.000000
1236779    22110.000000
Name: income, Length: 10565, dtype: float64

mean() is an example of an aggregate function: it takes multiple values and returns a single value. Useful aggregate functions include:

mean()
sum() (total)
max()
min()
count() (number of rows with non-missing values)
size() (number of rows)
first() (value for the first non-missing observation in the group)
last() (value for the last non-missing observation in the group)
nth() (value for the nth observation in the group)

The first five functions listed, mean() through count(), can be applied to a DataFrame, a Series, or the result of a groupby(). However, the last four, size() through nth(), only work with groupby().

With the transform() function, you pass in the name of an aggregate function which is then applied to the groups, but the result is copied or broadcast to all the observations in the group. This is the primary tool you’ll use to add level two variables to a DataFrame. The result is a Series with the same number of rows as the original DataFrame, but the value for level one units in the same level two unit is always the same.

acs['income'].groupby('household').transform('mean')

household  person
37         1          6666.666667
           2          6666.666667
           3          6666.666667
241        1         32500.000000
242        1         30000.000000
                         ...     
1236624    1         31050.000000
           2         31050.000000
1236756    1         55300.000000
           2         55300.000000
1236779    1         22110.000000
Name: income, Length: 27410, dtype: float64

Note how household 37 now has three rows, just like in the original data.

5.3.2 Continuous Variables

We’ll start with examples of calculating continuous (or quantitative) level two variables based on continuous level one variables.

For many purposes, the total income of a household is more relevant than individual incomes. Create a column called household_income containing the total income of everyone in the household with:

acs['household_income'] = (
    acs['income'].
    groupby('household').
    transform('sum')
)
acs

		age	race	marital_status	edu	income	female	hispanic	household_income
household	person
37	1	20	White	Never married	Some college, >=1 year	10000.0	True	False	20000.0
	2	19	White	Never married	Some college, >=1 year	5300.0	True	False	20000.0
	3	19	Black	Never married	Some college, >=1 year	4700.0	True	False	20000.0
241	1	50	White	Never married	Master's degree	32500.0	True	False	32500.0
242	1	29	White	Never married	Bachelor's degree	30000.0	True	False	30000.0
...	...	...	...	...	...	...	...	...	...
1236624	1	29	White	Now married	Some college, >=1 year	50100.0	False	False	62100.0
1236624	2	26	White	Now married	High School graduate	12000.0	True	False	62100.0
1236756	1	58	White	Now married	Master's degree	69800.0	True	False	110600.0
1236756	2	61	White	Now married	Master's degree	40800.0	False	False	110600.0
1236779	1	30	American Indian	Divorced	High School graduate	22110.0	False	False	22110.0

27410 rows × 8 columns

Now calculate the household size, or the number of people in the household. Since there is one row per person, we can use the size() function. You still need to have it act on a single column, but it makes no difference which column you choose:

acs['household_size'] = (
    acs['age'].
    groupby('household').
    transform('size')
)
acs

		age	race	marital_status	edu	income	female	hispanic	household_income	household_size
household	person
37	1	20	White	Never married	Some college, >=1 year	10000.0	True	False	20000.0	3
	2	19	White	Never married	Some college, >=1 year	5300.0	True	False	20000.0	3
	3	19	Black	Never married	Some college, >=1 year	4700.0	True	False	20000.0	3
241	1	50	White	Never married	Master's degree	32500.0	True	False	32500.0	1
242	1	29	White	Never married	Bachelor's degree	30000.0	True	False	30000.0	1
...	...	...	...	...	...	...	...	...	...	...
1236624	1	29	White	Now married	Some college, >=1 year	50100.0	False	False	62100.0	2
1236624	2	26	White	Now married	High School graduate	12000.0	True	False	62100.0	2
1236756	1	58	White	Now married	Master's degree	69800.0	True	False	110600.0	2
1236756	2	61	White	Now married	Master's degree	40800.0	False	False	110600.0	2
1236779	1	30	American Indian	Divorced	High School graduate	22110.0	False	False	22110.0	1

27410 rows × 9 columns

Exercise

Create a variable mean_age containg the mean age of the household.

Solution

acs['mean_age'] = (
    acs['age'].
    groupby('household').
    transform('mean')
)
acs

		age	race	marital_status	edu	income	female	hispanic	household_income	household_size	mean_age
household	person
37	1	20	White	Never married	Some college, >=1 year	10000.0	True	False	20000.0	3	19.333333
	2	19	White	Never married	Some college, >=1 year	5300.0	True	False	20000.0	3	19.333333
	3	19	Black	Never married	Some college, >=1 year	4700.0	True	False	20000.0	3	19.333333
241	1	50	White	Never married	Master's degree	32500.0	True	False	32500.0	1	50.000000
242	1	29	White	Never married	Bachelor's degree	30000.0	True	False	30000.0	1	29.000000
...	...	...	...	...	...	...	...	...	...	...	...
1236624	1	29	White	Now married	Some college, >=1 year	50100.0	False	False	62100.0	2	27.500000
1236624	2	26	White	Now married	High School graduate	12000.0	True	False	62100.0	2	27.500000
1236756	1	58	White	Now married	Master's degree	69800.0	True	False	110600.0	2	59.500000
1236756	2	61	White	Now married	Master's degree	40800.0	False	False	110600.0	2	59.500000
1236779	1	30	American Indian	Divorced	High School graduate	22110.0	False	False	22110.0	1	30.000000

27410 rows × 10 columns

Now calculate the income per person, spot-checking household 484. Since income per person is the same as the mean income for the household, you might think you can use:

acs['income_per_person'] = (
    acs['income'].
    groupby('household').
    transform('mean')
)
acs.loc[484, ['age', 'income', 'household_income', 'income_per_person']]

	age	income	household_income	income_per_person
person
1	33	16800.0	34800.0	17400.0
2	26	18000.0	34800.0	17400.0
3	4	NaN	34800.0	17400.0
4	2	NaN	34800.0	17400.0

This does not give the right answer because of the way mean() handled missing values: it took the mean of the two observed values of income and ignored the two people with NaN. What we want is the total income of the household divided by the number of people in the household, whether they have observed values of income or not.

Since you already created household_income and household_size, you can just use them:

acs['income_per_person'] = (
    acs['household_income']/acs['household_size']
)
acs.loc[484, ['age', 'income', 'household_income', 'income_per_person']]

	age	income	household_income	income_per_person
person
1	33	16800.0	34800.0	8700.0
2	26	18000.0	34800.0	8700.0
3	4	NaN	34800.0	8700.0
4	2	NaN	34800.0	8700.0

Note that we did not need to use groupby()! groupby() is relevant when you’re using an aggregation function and need to control which rows are aggregated (e.g. the people in the same household). This code does not use an aggregation function: income_per_person for each row depends solely on household_income and household_size for the same row.

What if you hadn’t already created household_income and household_size? Is there a way to calculate income_per_person directly? Absolutely. We just need a function to pass to transform() that first calculates the sum of income and then divides by the size of the household. No such function exists–but we can create one.

5.3.3 Transform and Lambda Functions

We’ll only need to use our household income divided by household size function once, which makes it a great candidate for a lambda function. Recall that the syntax for a lambda function is:

lamba x: f(x)

where f(x) should be replaced by a a single expresion that uses `x’.

What seems like the obvious f(x) here is:

lambda x: x.sum()/x.size()

but that doesn’t work. The reason is that the passed in x is a Series object, not a groupby object, so it doesn’t have a size() function. However, it does have a size attribute that contains the number of rows in the Series. groupby() creates a separate Series for each household, so that’s exactly what we want.

Thus the function is:

lambda x: x.sum()/x.size

and the full code is:

acs['income_per_person'] = (
    acs['income'].
    groupby('household').
    transform(lambda x: x.sum()/x.size)
)
acs.loc[484, ['age', 'income', 'household_income', 'income_per_person']]

	age	income	household_income	income_per_person
person
1	33	16800.0	34800.0	8700.0
2	26	18000.0	34800.0	8700.0
3	4	NaN	34800.0	8700.0
4	2	NaN	34800.0	8700.0

Note how much longer the code takes to run with a lambda function rather than with a standard aggregate function.

Before proceeding to the next section, replace acs with a copy of acs_original. This will get rid of the variables we just created and keep the data set from becoming too wide to see all of it.

acs = acs_original.copy(deep=True)

5.3.4 Subsetting With Aggregate Functions

Suppose you’re interested in the income children contribute to their household (household 8787 is a good example), so you want to create a child_income variable containing the household’s total income earned by children. You might think this would be a straightforward process of adding up the income of everyone that’s younger than 18:

acs['child_income'] = (
    acs.loc[acs['age']<18, 'income'].
    groupby('household').
    transform('sum')
)
acs.loc[8787, ['age', 'income', 'child_income']]

	age	income	child_income
person
1	45	38000.0	NaN
2	16	3600.0	3800.0
3	15	200.0	3800.0

The trouble is that in excluding the adults from the sum(), you also excluded them from getting the result. Thus child_income is not a proper household-level variable, as it is not the same for all the observations in the same household–you’d especially be in trouble if your research question involved investigating the effect of children’s contributions on the adults in the household. The solution is to first create a series that only contains the quantities you want to add up, but that can be added up for everyone.

A Series has a where() function that has some elements in common with the NumPy where() function we used in chapter 2, but is designed to create a modified version of an existing Series. It takes a condition as its first argument, and observations where that condition is true are unchanged. Then the other argument specifies what the “other” observtions–observations where the condition is not true–should be set to. This allows you to easily create a Series that contains the actual value of income for the children but NaN for the adults:

income_if_child = acs['income'].where(acs['age']<18, other=np.NaN)
income_if_child.loc[8787]

person
1       NaN
2    3600.0
3     200.0
Name: income, dtype: float64

You could have stored income_if_child as a column in the acs DataFrame, but it’s just an intermediate result so there’s no need.

When you add up all of a household’s incomes income_if_child, you’ll get the correct value of child_income without having to exclude the adults:

acs['child_income'] = income_if_child.groupby('household').transform('sum')
acs.loc[8787, ['age', 'income', 'child_income']]

	age	income	child_income
person
1	45	38000.0	3800.0
2	16	3600.0	3800.0
3	15	200.0	3800.0

Since all the aggregate functions ignore missing values, you can use this approach any time you need to use an aggregate function on a subset of the data:

Create a new Series containing the value of interest for the observations to be included and NaN for the values to be excluded
Aggregate that column with no subsetting

Exercise

Find the mean age of the adults in each household.

Solution

age_if_adult = acs['age'].where(acs['age']>=18, other=np.NaN)
acs['mean_adult_age'] = age_if_adult.groupby('household').transform('mean')
acs.loc[484, ['age', 'mean_adult_age']]

	age	mean_adult_age
person
1	33	29.5
2	26	29.5
3	4	29.5
4	2	29.5

Again, just to keep things neat revert to the original data:

acs = acs_original.copy(deep=True)

5.3.5 Indicator Variables

As we’ve seen, if you do math with an indicator variable, it will be coerced into a number with True converted to 1 and False converted to 0. Aggregate functions take on new and very useful meanings when applied to indicator variables.

Start by creating an indicator variable for ‘this person is a child’:

acs['child'] = acs['age']<18
acs.loc[[37, 8787, 73731], ['age', 'child']]

		age	child
household	person
37	1	20	False
	2	19	False
	3	19	False
8787	1	45	False
	2	16	True
	3	15	True
73731	1	15	True
73731	2	15	True

Just to see the conversion in action, use the astype() function and pass in int32 (32-bit integer).

acs.loc[[37, 8787, 73731],'child'].astype('int32')

household  person
37         1         0
           2         0
           3         0
8787       1         0
           2         1
           3         1
73731      1         1
           2         1
Name: child, dtype: int32

The sum of child is the number of children in the household:

acs['num_children'] = (
    acs['child'].
    groupby('household').
    transform('sum')
)
acs.loc[[37, 8787, 73731], ['age', 'child', 'num_children']]

		age	child	num_children
household	person
37	1	20	False	0
	2	19	False	0
	3	19	False	0
8787	1	45	False	2
	2	16	True	2
	3	15	True	2
73731	1	15	True	2
73731	2	15	True	2

The mean of child is the proportion of the household that is children:

acs['prop_children'] = (
    acs['child'].
    groupby('household').
    transform('mean')
)
acs.loc[[37, 8787, 73731], ['age', 'child', 'prop_children']]

		age	child	prop_children
household	person
37	1	20	False	0.000000
	2	19	False	0.000000
	3	19	False	0.000000
8787	1	45	False	0.666667
	2	16	True	0.666667
	3	15	True	0.666667
73731	1	15	True	1.000000
73731	2	15	True	1.000000

The maximum value of child tells you if the household has any children or not:

acs['has_children'] = (
    acs['child'].
    groupby('household').
    transform('max')
)
acs.loc[[37, 8787, 73731], ['age', 'child', 'has_children']]

		age	child	has_children
household	person
37	1	20	False	False
	2	19	False	False
	3	19	False	False
8787	1	45	False	True
	2	16	True	True
	3	15	True	True
73731	1	15	True	True
73731	2	15	True	True

This merits some explanation. If a household has no children, everyone in the household has a 0 (False) for child and the maximum value is 0. If a household has any children, then at least one person in the household has 1 (True) for child, and the maximum value is 1. Python is clever enough to realize that the max() of a boolean variable is also a boolean variable, so it stores True/False in has_children rather than 1/0.

The minimum value of child tells you if the household consists entirely of children:

acs['all_children'] = (
    acs['child'].
    groupby('household').
    transform('min')
)
acs.loc[[37, 8787, 73731], ['age', 'child', 'all_children']]

		age	child	all_children
household	person
37	1	20	False	False
	2	19	False	False
	3	19	False	False
8787	1	45	False	False
	2	16	True	False
	3	15	True	False
73731	1	15	True	True
73731	2	15	True	True

If everyone in the household is a child, everyone has a 1 (True) for child and the minimum value is 1 (True). If anyone in the household is not a child, then at least one person has a 0 (False) for child and the minimum is 0 (False).

Given the number of times we used the indicator child it was convenient to calculate it once and store it. But you don’t need to. For example, you could also create all_children with:

acs['all_children'] = (
    (acs['age']<18).
    groupby('household').
    transform('min')
)
acs.loc[[37, 8787, 73731], ['age', 'child', 'all_children']]

		age	child	all_children
household	person
37	1	20	False	False
	2	19	False	False
	3	19	False	False
8787	1	45	False	False
	2	16	True	False
	3	15	True	False
73731	1	15	True	True
73731	2	15	True	True

Exercise

Create a variable containing the number of people in the household with income greater than zero and an indicator variable for ‘this household has someone with income>0’.

Solution

acs['num_incomes'] = (
    (acs['income']>0).
    groupby('household').
    transform('sum')
)

acs['has_income'] = (
    (acs['income']>0).
    groupby('household').
    transform('max')
)

acs.loc[[37, 8787, 73731], ['income', 'num_incomes', 'has_income']]

		income	num_incomes	has_income
household	person
37	1	10000.0	3	True
	2	5300.0	3	True
	3	4700.0	3	True
8787	1	38000.0	3	True
	2	3600.0	3	True
	3	200.0	3	True
73731	1	0.0	0	False
73731	2	0.0	0	False

5.4 Vectorizing Functions

Back in the first chapter, you defined a function that took a value of household income, compared it to the year 2000 poverty level for a family of four, and returned a string describing the result. Something like:

def poverty(income):
    if income<17050:
        return 'poor'
    if income< (2*17050):
        return 'low income'
    return 'not low income'

Now that we know how to calculate household incomes, we’ll put it to use.

Reset acs back to its original form, then calculate household income and household size. Recall that the size() function needs a single column to act on, but it can be any column since it just counts rows.

acs = acs_original.copy(deep=True)

acs['household_income'] = acs['income'].groupby('household').transform('sum')
acs['household_size'] = acs['age'].groupby('household').transform('size')

Before we actually use the poverty function we defined, let’s address three inadequacies. First, it does not handle missing values properly:

poverty(np.NaN)

'not low income'

We’re getting the ‘else’ result because any condition involving NaN returns false. Technically that’s okay, because household_income is never missing (a household where no one has a valid value of income gets zero). But you need to be in the habit of thinking about missing values and it’s an easy fix.

You might think you can use something like income.isna(), but isna() is a Series function and income is a simple number. Instead, there is a NumPy function isnan() that takes a number as an argument and tells you if it’s missing or not in the same way:

def poverty(income):
    if income<17050:
        return 'poor'
    if income< (2*17050):
        return 'low income'
    if np.isnan(income):
        return np.NaN
    return 'not low income'

poverty(np.NaN)

nan

Second, $17,050 is the poverty threshold for a family of four, and the families in the ACS are of all different sizes. The formula for the poverty threshold in the year 2000 was $8,350 for one person plus $2,900 for each additional person after the first. So our function needs to take two arguments, household income and household size, and calculate the appropriate threshold before using it.

def poverty(income, size):
    threshold = 8350 + 2900*(size-1)
    if income<threshold:
        return 'poor'
    if income< (2*threshold):
        return 'low income'
    if np.isnan(income):
        return np.NaN
    return 'not low income'

Third, the function acts on a single value of household income and household size. If you try to pass in acs['household_income'] and acs['household_size'] as arguments, you’ll get an error message that says “The truth value of a Series is ambiguous.” What’s that means is that when Python hits if income<threshold and income is a Series, sometimes it’s true and sometimes it is not, so Python doesn’t know if it’s supposed to execute the code in that block or not.

Some operators and functions work just fine with either a single number or a series. For example, you can do both 2*2 and acs['income']*2. That’s because the multiplication operator has been vectorized, meaning it has been set up to handle vectors of values as well as single values. Wouldn’t it be nice if our poverty() function were vectorized? It would, and it’s ridiculously easy to do:

@np.vectorize
def poverty(income, size):
    threshold = 8350 + 2900*(size-1)
    if income<threshold:
        return 'poor'
    if income< (2*threshold):
        return 'low income'
    if np.isnan(income):
        return np.NaN
    return 'not low income'

@np.vectorize is a decorator. Under the hood, there’s a NumPy function called vectorize() that takes a function you define as an argument and basically wraps it in a for loop so if it gets a vector of values it acts on them one at a time. But putting the decorator right before your function definition takes care of all of that for you.

acs['poverty'] = poverty(acs['household_income'], acs['household_size'])
acs[['household_income', 'household_size', 'poverty']]

		household_income	household_size	poverty
household	person
37	1	20000.0	3	low income
	2	20000.0	3	low income
	3	20000.0	3	low income
241	1	32500.0	1	not low income
242	1	30000.0	1	not low income
...	...	...	...	...
1236624	1	62100.0	2	not low income
1236624	2	62100.0	2	not low income
1236756	1	110600.0	2	not low income
1236756	2	110600.0	2	not low income
1236779	1	22110.0	1	not low income

27410 rows × 3 columns

Did we need to write a function to do this? No, there’s always more than one way to do things in Python (but note that the following method doesn’t handle missing values):

acs['poverty2'] = 'not low income'
acs.loc[
    acs['household_income'] < (8350 + 2900*(acs['household_size']-1)),
    'poverty2'
] = 'poor'
acs.loc[
    acs['household_income'] < 2*(8350 + 2900*(acs['household_size']-1)),
    'poverty2'
] = 'low income'
acs[['household_income', 'household_size', 'poverty', 'poverty2']]

		household_income	household_size	poverty	poverty2
household	person
37	1	20000.0	3	low income	low income
	2	20000.0	3	low income	low income
	3	20000.0	3	low income	low income
241	1	32500.0	1	not low income	not low income
242	1	30000.0	1	not low income	not low income
...	...	...	...	...	...
1236624	1	62100.0	2	not low income	not low income
1236624	2	62100.0	2	not low income	not low income
1236756	1	110600.0	2	not low income	not low income
1236756	2	110600.0	2	not low income	not low income
1236779	1	22110.0	1	not low income	not low income

27410 rows × 4 columns

5.5 Panel Data

Panel data or longitudinal data is just another form of hierarchical data, with subjects as level two units and times they were observed as level one units. With panel data, the timing of the observations or at least their order is important. If it’s not, then we refer to it as repeated measures data. Since panel data is just hierarchical data, a lot of the code we’ve already learned applies directly to panel data, but there are some panel-specific tricks too.

You should already have the NLSY extract in memory from the second exercise. If not, load it with:

nlsy = pd.read_pickle('nlsy.pickle')
nlsy

		birth_year	edu	income	age
id	year
1.0	1979	58.0	12TH GRADE	4620.0	21.0
	1980	58.0	NaN	NaN	22.0
	1981	58.0	12TH GRADE	5000.0	23.0
	1982	58.0	NaN	NaN	24.0
	1983	58.0	NaN	NaN	25.0
...	...	...	...	...	...
12686.0	1993	60.0	12TH GRADE	31000.0	33.0
	1994	60.0	NaN	NaN	34.0
	1996	60.0	NaN	NaN	36.0
	1998	60.0	NaN	NaN	38.0
	2000	60.0	NaN	NaN	40.0

241034 rows × 4 columns

Also make a copy so we can go back to the original:

nlsy_original = nlsy.copy(deep=True)

5.5.1 Creating Level Two Variables

The National Longitudinal Survey of Youth tracks young people for many years, making it classic panel data. In the NLSY a level two unit is a year and a level one unit is a person-year combination. You can create level two variables using the same tools we used with the ACS. For example, create max_edu containing the person’s highest level of education with:

nlsy['max_edu'] = (
    nlsy['edu'].
    groupby('id').
    transform('max')
)
nlsy.loc[5, ['edu', 'max_edu']]

	edu	max_edu
year
1979	1ST YEAR COLLEGE	6TH YEAR COLLEGE
1980	2ND YEAR COLLEGE	6TH YEAR COLLEGE
1981	3RD YEAR COLLEGE	6TH YEAR COLLEGE
1982	4TH YEAR COLLEGE	6TH YEAR COLLEGE
1983	5TH YEAR COLLEGE	6TH YEAR COLLEGE
1984	6TH YEAR COLLEGE	6TH YEAR COLLEGE
1985	NaN	6TH YEAR COLLEGE
1986	6TH YEAR COLLEGE	6TH YEAR COLLEGE
1987	NaN	6TH YEAR COLLEGE
1988	6TH YEAR COLLEGE	6TH YEAR COLLEGE
1989	6TH YEAR COLLEGE	6TH YEAR COLLEGE
1990	NaN	6TH YEAR COLLEGE
1991	NaN	6TH YEAR COLLEGE
1992	NaN	6TH YEAR COLLEGE
1993	NaN	6TH YEAR COLLEGE
1994	NaN	6TH YEAR COLLEGE
1996	NaN	6TH YEAR COLLEGE
1998	NaN	6TH YEAR COLLEGE
2000	NaN	6TH YEAR COLLEGE

Note that this relies on edu being an ordered categorical variable.

The first() and last() functions are particularly useful with panel data. But their meaning depends on the order the data are sorted in. This data set appears to be in chronological order, but you shouldn’t trust that if you’re going to write code that will give you the wrong answer if it is not. So before using the first() function to find the person’s age at the start of the study, sort by id and year:

nlsy = nlsy.sort_values(['id', 'year'])
nlsy['starting_age'] = (
    nlsy['age'].
    groupby('id').
    transform('first')
)
nlsy.loc[1, ['age', 'starting_age']]

	age	starting_age
year
1979	21.0	21.0
1980	22.0	21.0
1981	23.0	21.0
1982	24.0	21.0
1983	25.0	21.0
1984	26.0	21.0
1985	27.0	21.0
1986	28.0	21.0
1987	29.0	21.0
1988	30.0	21.0
1989	31.0	21.0
1990	32.0	21.0
1991	33.0	21.0
1992	34.0	21.0
1993	35.0	21.0
1994	36.0	21.0
1996	38.0	21.0
1998	40.0	21.0
2000	42.0	21.0

Exercise

Use the last() function to find each subject’s last value of income. Does this behave as you expected?

Solution

nlsy['last_income'] = nlsy['income'].groupby('id').transform('last')
nlsy.loc[1, ['income', 'last_income']]

	income	last_income
year
1979	4620.0	5000.0
1980	NaN	5000.0
1981	5000.0	5000.0
1982	NaN	5000.0
1983	NaN	5000.0
1984	NaN	5000.0
1985	NaN	5000.0
1986	NaN	5000.0
1987	NaN	5000.0
1988	NaN	5000.0
1989	NaN	5000.0
1990	NaN	5000.0
1991	NaN	5000.0
1992	NaN	5000.0
1993	NaN	5000.0
1994	NaN	5000.0
1996	NaN	5000.0
1998	NaN	5000.0
2000	NaN	5000.0

The last() function returns the last non-missing value. As of Pandas 2.2.1 you can pass in skipna=False to get the last value.

Before proceeding, go back to the original data just to keep the data set neat:

nlsy = nlsy_original.copy(deep=True)

5.5.2 Filling In Gaps

The NLSY frequently has missing data as subjects were frequently not located in a given year. Somewhat unusually for panel data, they often were able to find people in later years so there are gaps as well as attrition. Consider what we know about the education of person 11:

nlsy.loc[11]

	birth_year	edu	income	age
year
1979	59.0	1ST YEAR COLLEGE	2900.0	20.0
1980	59.0	NaN	NaN	21.0
1981	59.0	3RD YEAR COLLEGE	2000.0	22.0
1982	59.0	4TH YEAR COLLEGE	4002.0	23.0
1983	59.0	4TH YEAR COLLEGE	8000.0	24.0
1984	59.0	4TH YEAR COLLEGE	17000.0	25.0
1985	59.0	4TH YEAR COLLEGE	20000.0	26.0
1986	59.0	NaN	NaN	27.0
1987	59.0	NaN	NaN	28.0
1988	59.0	NaN	NaN	29.0
1989	59.0	4TH YEAR COLLEGE	46000.0	30.0
1990	59.0	4TH YEAR COLLEGE	74283.0	31.0
1991	59.0	4TH YEAR COLLEGE	81481.0	32.0
1992	59.0	4TH YEAR COLLEGE	55000.0	33.0
1993	59.0	4TH YEAR COLLEGE	60000.0	34.0
1994	59.0	4TH YEAR COLLEGE	101653.0	35.0
1996	59.0	4TH YEAR COLLEGE	60000.0	37.0
1998	59.0	4TH YEAR COLLEGE	25000.0	39.0
2000	59.0	4TH YEAR COLLEGE	55000.0	41.0

They were not interviewed in 1986, 1987, and 1988, but given that their education level both before and after that gap is ‘4TH YEAR COLLEGE’ we know it was ‘4TH YEAR COLLEGE’ in those years as well. Meanwhile, while we might think that in 1980 their education level was ‘2ND YEAR COLLEGE’ that is less certain–and if that were a two-year gap we’d have no idea. So let’s fill in gaps only where the edu is the same before and after the gap.

To actually do that, we’ll fill in all the gaps in two ways: first by carrying past values forward with the ffill() function, and then by carrying future values back with the bfill() function. If the two methods agree, then that is a gap we can fill. While ffill() and bfill() are not aggregate functions, it’s still essential to use groupby('id') because that ensures each person is considered separately and no values are carried from one person to another.

nlsy['edu_forward'] = (
    nlsy['edu'].
    groupby('id').
    ffill()
)
nlsy['edu_backward'] = (
    nlsy['edu'].
    groupby('id').
    bfill()
)
nlsy.loc[11, ['edu', 'edu_forward', 'edu_backward']]

	edu	edu_forward	edu_backward
year
1979	1ST YEAR COLLEGE	1ST YEAR COLLEGE	1ST YEAR COLLEGE
1980	NaN	1ST YEAR COLLEGE	3RD YEAR COLLEGE
1981	3RD YEAR COLLEGE	3RD YEAR COLLEGE	3RD YEAR COLLEGE
1982	4TH YEAR COLLEGE	4TH YEAR COLLEGE	4TH YEAR COLLEGE
1983	4TH YEAR COLLEGE	4TH YEAR COLLEGE	4TH YEAR COLLEGE
1984	4TH YEAR COLLEGE	4TH YEAR COLLEGE	4TH YEAR COLLEGE
1985	4TH YEAR COLLEGE	4TH YEAR COLLEGE	4TH YEAR COLLEGE
1986	NaN	4TH YEAR COLLEGE	4TH YEAR COLLEGE
1987	NaN	4TH YEAR COLLEGE	4TH YEAR COLLEGE
1988	NaN	4TH YEAR COLLEGE	4TH YEAR COLLEGE
1989	4TH YEAR COLLEGE	4TH YEAR COLLEGE	4TH YEAR COLLEGE
1990	4TH YEAR COLLEGE	4TH YEAR COLLEGE	4TH YEAR COLLEGE
1991	4TH YEAR COLLEGE	4TH YEAR COLLEGE	4TH YEAR COLLEGE
1992	4TH YEAR COLLEGE	4TH YEAR COLLEGE	4TH YEAR COLLEGE
1993	4TH YEAR COLLEGE	4TH YEAR COLLEGE	4TH YEAR COLLEGE
1994	4TH YEAR COLLEGE	4TH YEAR COLLEGE	4TH YEAR COLLEGE
1996	4TH YEAR COLLEGE	4TH YEAR COLLEGE	4TH YEAR COLLEGE
1998	4TH YEAR COLLEGE	4TH YEAR COLLEGE	4TH YEAR COLLEGE
2000	4TH YEAR COLLEGE	4TH YEAR COLLEGE	4TH YEAR COLLEGE

Now we’ll replace edu if edu is missing and edu_forward and edu_backward are the same. Since they are the same it doesn’t matter which of them we set edu to:

nlsy.loc[
    nlsy['edu'].isna() & 
    (nlsy['edu_forward']==nlsy['edu_backward']),
    'edu'      
] = nlsy['edu_forward']
nlsy.loc[11, ['edu', 'edu_forward', 'edu_backward']]

	edu	edu_forward	edu_backward
year
1979	1ST YEAR COLLEGE	1ST YEAR COLLEGE	1ST YEAR COLLEGE
1980	NaN	1ST YEAR COLLEGE	3RD YEAR COLLEGE
1981	3RD YEAR COLLEGE	3RD YEAR COLLEGE	3RD YEAR COLLEGE
1982	4TH YEAR COLLEGE	4TH YEAR COLLEGE	4TH YEAR COLLEGE
1983	4TH YEAR COLLEGE	4TH YEAR COLLEGE	4TH YEAR COLLEGE
1984	4TH YEAR COLLEGE	4TH YEAR COLLEGE	4TH YEAR COLLEGE
1985	4TH YEAR COLLEGE	4TH YEAR COLLEGE	4TH YEAR COLLEGE
1986	4TH YEAR COLLEGE	4TH YEAR COLLEGE	4TH YEAR COLLEGE
1987	4TH YEAR COLLEGE	4TH YEAR COLLEGE	4TH YEAR COLLEGE
1988	4TH YEAR COLLEGE	4TH YEAR COLLEGE	4TH YEAR COLLEGE
1989	4TH YEAR COLLEGE	4TH YEAR COLLEGE	4TH YEAR COLLEGE
1990	4TH YEAR COLLEGE	4TH YEAR COLLEGE	4TH YEAR COLLEGE
1991	4TH YEAR COLLEGE	4TH YEAR COLLEGE	4TH YEAR COLLEGE
1992	4TH YEAR COLLEGE	4TH YEAR COLLEGE	4TH YEAR COLLEGE
1993	4TH YEAR COLLEGE	4TH YEAR COLLEGE	4TH YEAR COLLEGE
1994	4TH YEAR COLLEGE	4TH YEAR COLLEGE	4TH YEAR COLLEGE
1996	4TH YEAR COLLEGE	4TH YEAR COLLEGE	4TH YEAR COLLEGE
1998	4TH YEAR COLLEGE	4TH YEAR COLLEGE	4TH YEAR COLLEGE
2000	4TH YEAR COLLEGE	4TH YEAR COLLEGE	4TH YEAR COLLEGE

Note that we will not attempt to fill in other gaps where we are not perfectly certain what the missing value should be. In a sense the values of edu in a gap where edu is the same before and after are not really missing, in that we can identify their true values. This is somewhat unusual, but always worth doing when you can.

Sometimes filling in missing data is presented as a routine part of data preparation, but it should not be. Simple methods, like filling in the mean or linear interpolation, can create serious problems for subsequent statistical analysis. If you want to fill in truly missing values, use valid methods like BLIMP or multiple imputation.

5.5.3 Variables Based On An Observations Neighbors

The shift() function can shift observations forward or backwards: forward if you pass in a positive number, backwards if you pass in a negative number. For easy viewing, let’s build a DataFrame by passing a dictionary to the Pandas DataFrame function. The dictionary must contain the column names as keys and Series (which we’ll create based on the edu Series in nsly) as values. Our DataFrame will have one column for the original edu, one for shifting forward, and one for shifting backward:

pd.DataFrame(
    {
        'edu' : nlsy.loc[11, 'edu'],
        'edu_forward_1' : nlsy.loc[11, 'edu'].shift(1),
        'edu_backward_1' : nlsy.loc[11, 'edu'].shift(-1) 
    }
)

	edu	edu_forward_1	edu_backward_1
year
1979	1ST YEAR COLLEGE	NaN	NaN
1980	NaN	1ST YEAR COLLEGE	3RD YEAR COLLEGE
1981	3RD YEAR COLLEGE	NaN	4TH YEAR COLLEGE
1982	4TH YEAR COLLEGE	3RD YEAR COLLEGE	4TH YEAR COLLEGE
1983	4TH YEAR COLLEGE	4TH YEAR COLLEGE	4TH YEAR COLLEGE
1984	4TH YEAR COLLEGE	4TH YEAR COLLEGE	4TH YEAR COLLEGE
1985	4TH YEAR COLLEGE	4TH YEAR COLLEGE	4TH YEAR COLLEGE
1986	4TH YEAR COLLEGE	4TH YEAR COLLEGE	4TH YEAR COLLEGE
1987	4TH YEAR COLLEGE	4TH YEAR COLLEGE	4TH YEAR COLLEGE
1988	4TH YEAR COLLEGE	4TH YEAR COLLEGE	4TH YEAR COLLEGE
1989	4TH YEAR COLLEGE	4TH YEAR COLLEGE	4TH YEAR COLLEGE
1990	4TH YEAR COLLEGE	4TH YEAR COLLEGE	4TH YEAR COLLEGE
1991	4TH YEAR COLLEGE	4TH YEAR COLLEGE	4TH YEAR COLLEGE
1992	4TH YEAR COLLEGE	4TH YEAR COLLEGE	4TH YEAR COLLEGE
1993	4TH YEAR COLLEGE	4TH YEAR COLLEGE	4TH YEAR COLLEGE
1994	4TH YEAR COLLEGE	4TH YEAR COLLEGE	4TH YEAR COLLEGE
1996	4TH YEAR COLLEGE	4TH YEAR COLLEGE	4TH YEAR COLLEGE
1998	4TH YEAR COLLEGE	4TH YEAR COLLEGE	4TH YEAR COLLEGE
2000	4TH YEAR COLLEGE	4TH YEAR COLLEGE	NaN

Note that when shifting forward you’ll have NaN at the beginning of the series, and when shifting backwards at the end. In this case we did not need groupby() because we extracted just one person, but in general you’ll need it to prevent values from being shifted between people.

Now consider trying to identify the year in which a person graduates from high school. In that year their edu will be ‘12TH GRADE’ or higher and the year before their edu will be less than ‘12TH GRADE’. Note that for whatever reason the NLSY frequently records people as having gained multiple years of education in one year, and we won’t worry about that. For person 2, the year in which they graduated is 1985:

nlsy.loc[2]

	birth_year	edu	income	age	edu_forward	edu_backward
year
1979	59.0	9TH GRADE	4000.0	20.0	9TH GRADE	9TH GRADE
1980	59.0	9TH GRADE	5000.0	21.0	9TH GRADE	9TH GRADE
1981	59.0	9TH GRADE	6000.0	22.0	9TH GRADE	9TH GRADE
1982	59.0	9TH GRADE	10000.0	23.0	9TH GRADE	9TH GRADE
1983	59.0	9TH GRADE	11000.0	24.0	9TH GRADE	9TH GRADE
1984	59.0	9TH GRADE	11500.0	25.0	9TH GRADE	9TH GRADE
1985	59.0	12TH GRADE	11000.0	26.0	12TH GRADE	12TH GRADE
1986	59.0	12TH GRADE	14000.0	27.0	12TH GRADE	12TH GRADE
1987	59.0	12TH GRADE	16000.0	28.0	12TH GRADE	12TH GRADE
1988	59.0	12TH GRADE	20000.0	29.0	12TH GRADE	12TH GRADE
1989	59.0	12TH GRADE	19000.0	30.0	12TH GRADE	12TH GRADE
1990	59.0	12TH GRADE	20000.0	31.0	12TH GRADE	12TH GRADE
1991	59.0	12TH GRADE	20000.0	32.0	12TH GRADE	12TH GRADE
1992	59.0	12TH GRADE	22000.0	33.0	12TH GRADE	12TH GRADE
1993	59.0	12TH GRADE	25000.0	34.0	12TH GRADE	12TH GRADE
1994	59.0	12TH GRADE	0.0	35.0	12TH GRADE	12TH GRADE
1996	59.0	12TH GRADE	0.0	37.0	12TH GRADE	12TH GRADE
1998	59.0	12TH GRADE	0.0	39.0	12TH GRADE	12TH GRADE
2000	59.0	12TH GRADE	0.0	41.0	12TH GRADE	12TH GRADE

Now create an indicator variable for ‘graduated this year’ with:

nlsy['grad'] = (
    (nlsy['edu'] >= '12TH GRADE') & 
    (nlsy['edu'].groupby('id').shift(1) < '12TH GRADE')
)
nlsy.loc[2, ['edu', 'grad']]

	edu	grad
year
1979	9TH GRADE	False
1980	9TH GRADE	False
1981	9TH GRADE	False
1982	9TH GRADE	False
1983	9TH GRADE	False
1984	9TH GRADE	False
1985	12TH GRADE	True
1986	12TH GRADE	False
1987	12TH GRADE	False
1988	12TH GRADE	False
1989	12TH GRADE	False
1990	12TH GRADE	False
1991	12TH GRADE	False
1992	12TH GRADE	False
1993	12TH GRADE	False
1994	12TH GRADE	False
1996	12TH GRADE	False
1998	12TH GRADE	False
2000	12TH GRADE	False

Keep in mind that if either edu or shifted edu is NaN, the result will be false. Thus this indicator variable is really for ‘we know the person graduated this year.’ We could set it to NaN for years with missing values if we wanted to, using the techniques we’ve used before.

Logically each person should only graduate from high school once, but it’s not hard to image data entry errors that would make it look like they graduated multiple times. Recall that the sum of an indicator variable is the number of times it is true, so check with:

(
    nlsy['grad'].
    groupby('id').
    sum().
    value_counts()
)

grad
0    7033
1    5653
Name: count, dtype: int64

A little over half the subjects have True for grad zero times: either they never graduated from high school, they graduated outside the study period, or we could not identify their graduation year due to missing data. But no one graduated more than once.

Exercise

Assume for the moment that if someone’s income is zero they are unemployed and if their income is positive they are employed. Create an indicator variable for ‘person lost their job’ which is true if someone is unemployed (income is zero) in the current year and was employed (income is positive) in the previous year.

Solution

nlsy['lost_job'] = (nlsy['income']==0) & (nlsy['income'].groupby('id').shift(1)>0)
nlsy.loc[2, ['income', 'lost_job']]

	income	lost_job
year
1979	4000.0	False
1980	5000.0	False
1981	6000.0	False
1982	10000.0	False
1983	11000.0	False
1984	11500.0	False
1985	11000.0	False
1986	14000.0	False
1987	16000.0	False
1988	20000.0	False
1989	19000.0	False
1990	20000.0	False
1991	20000.0	False
1992	22000.0	False
1993	25000.0	False
1994	0.0	True
1996	0.0	False
1998	0.0	False
2000	0.0	False

Now let’s find each person’s age at the time of their graduation. This is just a matter of subsetting with an aggregate function, except that the desired subset has just one observation per group. So start by creating age_if_grad containing the person’s age for the year they graduated and NaN for all others:

age_if_grad = nlsy['age'].where(nlsy['grad'], other=np.NaN)
age_if_grad.loc[2]

year
1979     NaN
1980     NaN
1981     NaN
1982     NaN
1983     NaN
1984     NaN
1985    26.0
1986     NaN
1987     NaN
1988     NaN
1989     NaN
1990     NaN
1991     NaN
1992     NaN
1993     NaN
1994     NaN
1996     NaN
1998     NaN
2000     NaN
Name: age, dtype: float32

age_if_grad for each person is a single number plus a bunch of NaNs. Many aggregate functions have the property that if you give them a single number they’ll return that number, so we could use any of them. We’ll go with mean:

nlsy['age_at_grad'] = (
    age_if_grad.
    groupby('id').
    transform('mean')
)
nlsy.loc[2,['edu', 'age', 'grad', 'age_at_grad']]

	edu	age	grad	age_at_grad
year
1979	9TH GRADE	20.0	False	26.0
1980	9TH GRADE	21.0	False	26.0
1981	9TH GRADE	22.0	False	26.0
1982	9TH GRADE	23.0	False	26.0
1983	9TH GRADE	24.0	False	26.0
1984	9TH GRADE	25.0	False	26.0
1985	12TH GRADE	26.0	True	26.0
1986	12TH GRADE	27.0	False	26.0
1987	12TH GRADE	28.0	False	26.0
1988	12TH GRADE	29.0	False	26.0
1989	12TH GRADE	30.0	False	26.0
1990	12TH GRADE	31.0	False	26.0
1991	12TH GRADE	32.0	False	26.0
1992	12TH GRADE	33.0	False	26.0
1993	12TH GRADE	34.0	False	26.0
1994	12TH GRADE	35.0	False	26.0
1996	12TH GRADE	37.0	False	26.0
1998	12TH GRADE	39.0	False	26.0
2000	12TH GRADE	41.0	False	26.0

Exercise

Compare the overall distribution of edu with the distribution of edu for person-years where the subject lost their job. (Keep in mind that in both cases, an observation represents a person-year combination.) Create a bar graph for each to do so.

Solution

Start by creating a DataFrame containing just the rows where someone lost their job. You can easily do this with loc or query():

lost_job = nlsy.query('lost_job==True')
lost_job

		birth_year	edu	income	age	edu_forward	edu_backward	grad	lost_job	age_at_grad
id	year
2.0	1994	59.0	12TH GRADE	0.0	35.0	12TH GRADE	12TH GRADE	False	True	26.0
3.0	1983	61.0	10TH GRADE	0.0	22.0	10TH GRADE	10TH GRADE	False	True	32.0
	1986	61.0	10TH GRADE	0.0	25.0	10TH GRADE	10TH GRADE	False	True	32.0
	1989	61.0	10TH GRADE	0.0	28.0	10TH GRADE	10TH GRADE	False	True	32.0
	1994	61.0	12TH GRADE	0.0	33.0	12TH GRADE	12TH GRADE	False	True	32.0
...	...	...	...	...	...	...	...	...	...	...
12675.0	1984	59.0	12TH GRADE	0.0	25.0	12TH GRADE	12TH GRADE	False	True	NaN
12677.0	1980	59.0	12TH GRADE	0.0	21.0	12TH GRADE	12TH GRADE	False	True	NaN
12680.0	1980	59.0	12TH GRADE	0.0	21.0	12TH GRADE	12TH GRADE	False	True	NaN
12681.0	1983	59.0	12TH GRADE	0.0	24.0	12TH GRADE	12TH GRADE	False	True	21.0
12682.0	1980	59.0	9TH GRADE	0.0	21.0	9TH GRADE	9TH GRADE	False	True	22.0

9917 rows × 9 columns

(Note that we didn’t use copy(), so it’s ambiguous if lost_job is a copy or a view of nlsy. That would be a problem if we planned to make any changes in it, but we don’t.)

Now create bar graphs of both nlsy['edu'] and lost_job['edu']:

import plotnine as p9

p9.ggplot(nlsy, p9.aes(x='edu')) + p9.geom_bar() + p9.coord_flip()

p9.ggplot(lost_job, p9.aes(x='edu')) + p9.geom_bar() + p9.coord_flip()

The biggest difference is the missing values: edu is never missing in the lost_job dataframe because if edu is missing, income is missing (they couldn’t find the person to interview them) and thus we can’t tell that they lost their job. But note how ‘4TH YEAR COLLEGE’ and above are a smaller fraction of lost_job than nlsy. Having a college education gives more job security, not just higher earnings.