# 5Working With Hierarchical Data

Many data sets involve some sort of hierarchical structure. The American Community Survey is an example of one of the most common hierarchical data structures: individuals grouped into households. Another common hierarchical data structure is panel or longitudinal data and repeated measures, where things are observed multiple times. These structures may seem very different, but the same concepts apply to both and often even the same code.

## 5.1 Hierarchical Data Concepts

In this section we’ll introduce some of the core concepts and vocabulary for thinking carefully about hierarchical data.

### 5.1.1 Levels

Hierarchical data can be described in terms of levels (note that these are not the same as the levels of a categorical variable). A level one unit is the smallest unit in the data set. A level two unit is then a group of level one units. Some examples:

EXAMPLE DATA STRUCTURE LEVEL ONE UNIT LEVEL TWO UNIT
Individuals living in households Individuals Households
States grouped into regions States Regions
Students in classes Students Classes
Tests taken by students Tests Students
Monthly observations of individuals Monthly observations Individuals
Visits to the doctor by patients Doctor visits Patients
Social media posts by individuals Posts Individuals

If the hierarchy has more than two levels, simply keep counting up: if you have students grouped into classes grouped into schools grouped into districts, students are level one, classes are level two, schools are level three, and districts are level four.

Each variable is associated with a specific level. A variable is a level two variable if all the level one units within a given level two unit have the same value for the variable. For data structures where a level two unit is observed over time, level two variables are variables that do not change over time.

EXAMPLE DATA STRUCTURE LEVEL ONE VARIABLES LEVEL TWO VARIABLES
Individuals living in households Age, Sex Household income, Address
States grouped into regions State population Regional income per capita
Students in classes Student’s race Teacher’s race, Class size
Tests taken by students Test Score Free lunch eligibility
Monthly observations of individuals Employment Status Highest degree earned
Visits to the doctor by patients BMI, Diagnosis Race
Social media posts by individuals Length, Topic Sex

Of course all of these depend on the details of the actual data set. In a study that observes individuals for a few months, it’s unlikely that their highest degree earned will change. But it might! And if it does, highest degree earned becomes a level one variable. It does not meet the definition of a level two variable because different level one units (monthly observations) have different values for it.

### 5.1.2 Data Forms

With a hierarchical data set, an observation (row) can represent either a level one unit or a level two unit. Consider observing two people for three months:

person_id month years_edu employed
1 1 16 True
1 2 16 True
1 3 16 True
2 1 12 False
2 2 12 True
2 3 12 True
Exercise

Identify the level one units, level two units, level one variable(s), and level two variable(s) in the above data set.

In this form, each observation represents a month (or more precisely, a person-month combination). Now consider the exact same data in a different form:

person_id years_edu employed_1 employed_2 employed_3
1 16 True True True
2 12 False True True

In this form, each observation represents a person. Because `years_edu` is a level two variable, there’s just one value per person and thus one variable (column). However, `employed` is a level one variable with three (potentially) different values per person. Thus it needs three variables (columns).

We call the first form the long form (or occasionally the tall form) and the second the wide form. The long form is longer because it has more rows; the wide form is wider because it has more columns. In most cases the long form is easier to work with, so we’ll do most of our examples in this form.

Now consider the indexes in this data set. In the long form, `person_id` and `month` are a compound identifier and could be easily turned into a MultiIndex, while the variable names are a simple column identifier. In the wide form, `person_id` is a simple row identifier, but now the variable names for the level one variables are a compound identifier with two parts: the variable name (`employed`) and the month in which the variable was observed. To identify a specific value in the data set we still need to know the `person_id`, `month`, and a variable name, but `month` has been converted from a row identifier to part of a compound column identifier. We’ll learn how to actually carry out that transformation in the next chapter.

### 5.1.3 Setting Up

Start up Jupyter Lab if you haven’t already and navigate to the folder where you put the example files. Then create a new Python Notebook and call it `Hierarchical_Data_Practice.ipynb`. Have it import Pandas and load `acs_clean.pickle`:

``````import pandas as pd
acs``````
age race marital_status edu income female hispanic
household person
37 1 20 White Never married Some college, >=1 year 10000.0 True False
2 19 White Never married Some college, >=1 year 5300.0 True False
3 19 Black Never married Some college, >=1 year 4700.0 True False
241 1 50 White Never married Master's degree 32500.0 True False
242 1 29 White Never married Bachelor's degree 30000.0 True False
... ... ... ... ... ... ... ... ...
1236624 1 29 White Now married Some college, >=1 year 50100.0 False False
2 26 White Now married High School graduate 12000.0 True False
1236756 1 58 White Now married Master's degree 69800.0 True False
2 61 White Now married Master's degree 40800.0 False False
1236779 1 30 American Indian Divorced High School graduate 22110.0 False False

27410 rows × 7 columns

Exercise

Load the example extract from the National Longitudinal Survey of Youth contained in `nlsy.pickle`. Identify the level one and level two units in the data and the level one and level two variables. (We’re not so worried about having a streamlined data cleaning workflow in this chapter, so just use your `Hierarchical_Data_Practice.ipynb` Notebook.)

## 5.2 Long Form Data

For some purposes, if a hierarchical data set is in long form you don’t have to worry about the hierarchical structure. On the other hand, a very common data wrangling task is creating level two variables based on the level one variables in the data set, and then the hierarchical structure is critical. In long form the critical tool for doing so is `groupby()`, so that the calculations are carried out separately for each level two unit.

### 5.2.1 Using groupby()

The `groupby()` function creates a new object which contains the DataFrame split up into groups. Functions act on that object, carry out calculations on the groups, and can return results in various forms. For example, if you apply basic summary statistic functions, you’ll get one row per group:

``acs['income'].groupby('household').mean()``
``````household
37          6666.666667
241        32500.000000
242        30000.000000
377        51900.000000
418        12200.000000
...
1236119    46600.000000
1236287    17800.000000
1236624    31050.000000
1236756    55300.000000
1236779    22110.000000
Name: income, Length: 10565, dtype: float64``````

`mean()` is an example of an aggregate function: it takes multiple values and returns a single value. Useful aggregate functions include:

• `mean()`
• `sum()` (total)
• `max()`
• `min()`
• `count()` (number of rows with non-missing values)
• `size()` (number of rows)
• `first()` (value for the first observation in the group)
• `last()` (value for the last observation in the group)
• `nth()` (value for the nth observation in the group)

The first five functions listed, `mean()` through `count()`, can be applied to a DataFrame, a Series, or the result of a `groupby()`. However, the last four, `size()` through `nth()`, only work with `groupby()`.

With the `transform()` function, you pass in the name of an aggregate function which is then applied to the groups, but the result is copied or broadcast to all the observations in the group. This is the primary tool you’ll use to add level two variables to a DataFrame. The result is a Series with the same number of rows as the original DataFrame, but the value for level one units in the same level two unit is always the same.

``acs['income'].groupby('household').transform('mean')``
``````household  person
37         1          6666.666667
2          6666.666667
3          6666.666667
241        1         32500.000000
242        1         30000.000000
...
1236624    1         31050.000000
2         31050.000000
1236756    1         55300.000000
2         55300.000000
1236779    1         22110.000000
Name: income, Length: 27410, dtype: float64``````

### 5.2.2 Continuous Variables

We’ll start with examples of calculating continuous (or quantitative) level two variables based on continuous level one variables.

For many purposes, the total income of a household is more relevant than individual incomes. Create a column called `household_income` containing the total income of everyone in the household with:

``````acs['household_income'] = (
acs['income'].
groupby('household').
transform('sum')
)
acs``````
age race marital_status edu income female hispanic household_income
household person
37 1 20 White Never married Some college, >=1 year 10000.0 True False 20000.0
2 19 White Never married Some college, >=1 year 5300.0 True False 20000.0
3 19 Black Never married Some college, >=1 year 4700.0 True False 20000.0
241 1 50 White Never married Master's degree 32500.0 True False 32500.0
242 1 29 White Never married Bachelor's degree 30000.0 True False 30000.0
... ... ... ... ... ... ... ... ... ...
1236624 1 29 White Now married Some college, >=1 year 50100.0 False False 62100.0
2 26 White Now married High School graduate 12000.0 True False 62100.0
1236756 1 58 White Now married Master's degree 69800.0 True False 110600.0
2 61 White Now married Master's degree 40800.0 False False 110600.0
1236779 1 30 American Indian Divorced High School graduate 22110.0 False False 22110.0

27410 rows × 8 columns

Now calculate the household size, or the number of people in the household. Since there is one row per person, we can use the `size()` function. You still need to have it act on a single column, but it makes no difference which column you choose:

``````acs['household_size'] = (
acs['age'].
groupby('household').
transform('size')
)
acs``````
age race marital_status edu income female hispanic household_income household_size
household person
37 1 20 White Never married Some college, >=1 year 10000.0 True False 20000.0 3
2 19 White Never married Some college, >=1 year 5300.0 True False 20000.0 3
3 19 Black Never married Some college, >=1 year 4700.0 True False 20000.0 3
241 1 50 White Never married Master's degree 32500.0 True False 32500.0 1
242 1 29 White Never married Bachelor's degree 30000.0 True False 30000.0 1
... ... ... ... ... ... ... ... ... ... ...
1236624 1 29 White Now married Some college, >=1 year 50100.0 False False 62100.0 2
2 26 White Now married High School graduate 12000.0 True False 62100.0 2
1236756 1 58 White Now married Master's degree 69800.0 True False 110600.0 2
2 61 White Now married Master's degree 40800.0 False False 110600.0 2
1236779 1 30 American Indian Divorced High School graduate 22110.0 False False 22110.0 1

27410 rows × 9 columns

Exercise

Create a variable `mean_age` containg the mean `age` of the household.

Now calculate the income per person, spot-checking household 484. Since income per person is the same as the mean income for the household, you might think you can use:

``````acs['income_per_person'] = (
acs['income'].
groupby('household').
transform('mean')
)
acs.loc[484]``````
age race marital_status edu income female hispanic household_income household_size income_per_person
person
1 33 Asian Now married High School graduate 16800.0 False False 34800.0 4 17400.0
2 26 Asian Now married 11th grade 18000.0 True False 34800.0 4 17400.0
3 4 Asian NaN None NaN False False 34800.0 4 17400.0
4 2 Asian NaN NaN NaN True False 34800.0 4 17400.0

This does not give the right answer because of the way `mean()` handled missing values: it took the mean of the two observed values of `income` and ignored the two people with NaN. However, we know that NaN for `income` means the person is under 15 years old, and almost certainly their income is zero or close to it. What we want is the total income of the household divided by the number of people in the household, whether they have observed values of income or not.

Since you already created `household_income` and `household_size`, you can just use them:

``````acs['income_per_person'] = (
acs['household_income']/acs['household_size']
)
acs.loc[484]``````
age race marital_status edu income female hispanic household_income household_size income_per_person
person
1 33 Asian Now married High School graduate 16800.0 False False 34800.0 4 8700.0
2 26 Asian Now married 11th grade 18000.0 True False 34800.0 4 8700.0
3 4 Asian NaN None NaN False False 34800.0 4 8700.0
4 2 Asian NaN NaN NaN True False 34800.0 4 8700.0

Note that we did not need to use `groupby()`! `groupby()` is relevant when you’re using an aggregation function and need to control which rows are aggregated (e.g. the people in the same household). This code does not use an aggregation function: `income_per_person` for each row depends solely on `household_income` and `household_size` for the same row.

What if you hadn’t already created `household_income` and `household_size`? Is there a way to calculate `income_per_person` directly? Absolutely. We just need a function to pass to `transform()` that first calculates the sum and then divides by the size. No such function exists–but we can create one.

### 5.2.3 Lambda Functions

Python allows you to create and name your own functions, but you most likely won’t need to for data wrangling. You’ll probably be able to do everything you need to do with temporary, unnamed functions, known as lambda functions. You define a lambda function with:

``lamba x: (f(x))``

where f(x) should be replaced by the function you need. `x` is a parameter that is being passed in to the function–you can actually give it any name you want.

What seems like the obvious f(x) here is:

``lambda x: (x.sum()/x.size())``

but that doesn’t work. The reason is that the passed in `x` is no longer a `groupby()` object so you can’t use the `size()` function. However, Series and DataFrames have a `size` attribute which gives the total number of elements they contain. For a DataFrame this is the number of rows times the number of columns and rarely useful. However, for a Series which is one column from a DataFrame `size` is the number of rows, and that is exactly what you need:

``acs['income'].size``
``27410``

Thus the function is:

``lambda x: (x.sum()/x.size)``

and the full code is:

``````acs['income_per_person'] = (
acs['income'].
groupby('household').
transform(
lambda x:
(x.sum()/x.size)
)
)
acs.loc[484]``````
age race marital_status edu income female hispanic household_income household_size income_per_person
person
1 33 Asian Now married High School graduate 16800.0 False False 34800.0 4 8700.0
2 26 Asian Now married 11th grade 18000.0 True False 34800.0 4 8700.0
3 4 Asian NaN None NaN False False 34800.0 4 8700.0
4 2 Asian NaN NaN NaN True False 34800.0 4 8700.0

The data set is getting kind of wide, so remove the variables you just created (add `mean_age` if you created it):

``````acs = acs.drop(
[
'household_income',
'household_size',
'income_per_person'
],
axis=1
)``````

### 5.2.4 Subsetting With Aggregate Functions

Suppose you’re interested in the income children contribute to their household (household 8787 is a good example), so you want to create a `child_income` containing the household’s total income earned by children. You might think this would be a straightforward process of adding up the `income` of everyone that’s younger than 18:

``````acs['child_income'] = (
acs.loc[acs['age']<18, 'income'].
groupby('household').
transform('sum')
)
acs.loc[8787]``````
age race marital_status edu income female hispanic child_income
person
1 45 White Divorced Some college, >=1 year 38000.0 True False NaN
2 16 White Never married 9th grade 3600.0 False False 3800.0
3 15 White Never married 7th-8th grade 200.0 True False 3800.0

The trouble is that in excluding the adults from the `sum()`, you also excluded them from getting the result. Thus `child_income` is not a proper household-level variable, as it is not the same for all the observations in the same household–and you’d really be in trouble if your research question involved the effect of children’s contributions on the adults in the household. The solution is to create a column that only contains the quantities you want to add up, but can be added up for everyone:

``````acs['income_if_child'] = acs.loc[acs['age']<18, 'income']
acs.loc[8787]``````
age race marital_status edu income female hispanic child_income income_if_child
person
1 45 White Divorced Some college, >=1 year 38000.0 True False NaN NaN
2 16 White Never married 9th grade 3600.0 False False 3800.0 3600.0
3 15 White Never married 7th-8th grade 200.0 True False 3800.0 200.0

For children `income_if_child` contains their income, while for adults it contains NaN. If we now pass `income_if_child` to `sum()`, it will add up just the children’s incomes without us having to exclude the adults. Thus the result will be stored in the rows for the household’s adults too:

``````acs['child_income'] = (
acs['income_if_child'].
groupby('household').
transform('sum')
)
acs.loc[8787]``````
age race marital_status edu income female hispanic child_income income_if_child
person
1 45 White Divorced Some college, >=1 year 38000.0 True False 3800.0 NaN
2 16 White Never married 9th grade 3600.0 False False 3800.0 3600.0
3 15 White Never married 7th-8th grade 200.0 True False 3800.0 200.0

Since all the aggregate functions ignore missing values, you can use this approach any time you need to use an aggregate function on a subset of the data:

1. Create a new column containing the value of interest for the observations to be included and NaN for the values to be excluded
2. Aggregate that column with no subsetting
Exercise

Find the mean age of the adults in each household.

Again, just to keep things neat drop the variables created in this section:

``````acs = acs.drop(
[
'child_income',
'income_if_child'
],
axis=1
)``````

### 5.2.5 Indicator Variables

As we’ve seen, if you do math with an indicator variable, it will be coerced into a number with True converted to 1 and False converted to 0. Aggregate functions take on new and very useful meanings when applied to indicator variables.

Start by creating an indicator variable for ‘this person is a child’:

``````acs['child'] = acs['age']<18
acs.loc[[37, 8787]]``````
age race marital_status edu income female hispanic child
household person
37 1 20 White Never married Some college, >=1 year 10000.0 True False False
2 19 White Never married Some college, >=1 year 5300.0 True False False
3 19 Black Never married Some college, >=1 year 4700.0 True False False
8787 1 45 White Divorced Some college, >=1 year 38000.0 True False False
2 16 White Never married 9th grade 3600.0 False False True
3 15 White Never married 7th-8th grade 200.0 True False True

Just to see the conversion in action, use the `astype()` function and pass in `int32` (32-bit integer).

``acs.loc[[37, 8787],'child'].astype('int32')``
``````household  person
37         1         0
2         0
3         0
8787       1         0
2         1
3         1
Name: child, dtype: int32``````

The sum of `child` is the number of children in the household:

``````acs['num_children'] = (
acs['child'].
groupby('household').
transform('sum')
)
acs.loc[[37, 8787, 73731]]``````
age race marital_status edu income female hispanic child num_children
household person
37 1 20 White Never married Some college, >=1 year 10000.0 True False False 0
2 19 White Never married Some college, >=1 year 5300.0 True False False 0
3 19 Black Never married Some college, >=1 year 4700.0 True False False 0
8787 1 45 White Divorced Some college, >=1 year 38000.0 True False False 2
2 16 White Never married 9th grade 3600.0 False False True 2
3 15 White Never married 7th-8th grade 200.0 True False True 2
73731 1 15 White Now married 9th grade 0.0 False False True 2
2 15 White Never married 7th-8th grade 0.0 True False True 2

The mean of `child` is the proportion of the household that is children:

``````acs['prop_children'] = (
acs['child'].
groupby('household').
transform('mean')
)
acs.loc[[37, 8787, 73731]]``````
age race marital_status edu income female hispanic child num_children prop_children
household person
37 1 20 White Never married Some college, >=1 year 10000.0 True False False 0 0.000000
2 19 White Never married Some college, >=1 year 5300.0 True False False 0 0.000000
3 19 Black Never married Some college, >=1 year 4700.0 True False False 0 0.000000
8787 1 45 White Divorced Some college, >=1 year 38000.0 True False False 2 0.666667
2 16 White Never married 9th grade 3600.0 False False True 2 0.666667
3 15 White Never married 7th-8th grade 200.0 True False True 2 0.666667
73731 1 15 White Now married 9th grade 0.0 False False True 2 1.000000
2 15 White Never married 7th-8th grade 0.0 True False True 2 1.000000

The maximum value of `child` tells you if the household has any children or not:

``````acs['has_children'] = (
acs['child'].
groupby('household').
transform('max')
)
acs.loc[[37, 8787, 73731]]``````
age race marital_status edu income female hispanic child num_children prop_children has_children
household person
37 1 20 White Never married Some college, >=1 year 10000.0 True False False 0 0.000000 False
2 19 White Never married Some college, >=1 year 5300.0 True False False 0 0.000000 False
3 19 Black Never married Some college, >=1 year 4700.0 True False False 0 0.000000 False
8787 1 45 White Divorced Some college, >=1 year 38000.0 True False False 2 0.666667 True
2 16 White Never married 9th grade 3600.0 False False True 2 0.666667 True
3 15 White Never married 7th-8th grade 200.0 True False True 2 0.666667 True
73731 1 15 White Now married 9th grade 0.0 False False True 2 1.000000 True
2 15 White Never married 7th-8th grade 0.0 True False True 2 1.000000 True

This merits some explanation. If a household has no children, everyone in the household has a 0 (False) for `child` and the maximum value is 0. If a household has children, then at least one person in the household has 1 (True) for `child`, and the maximum value is 1. Python is clever enough to realize that the `max()` of a boolean variable is also a boolean variable, so it stores True/False in `has_children` rather than 1/0.

The minimum value of `child` tells you if the household consists entirely of children:

``````acs['all_children'] = (
acs['child'].
groupby('household').
transform('min')
)
acs.loc[[37, 8787, 73731]]``````
age race marital_status edu income female hispanic child num_children prop_children has_children all_children
household person
37 1 20 White Never married Some college, >=1 year 10000.0 True False False 0 0.000000 False False
2 19 White Never married Some college, >=1 year 5300.0 True False False 0 0.000000 False False
3 19 Black Never married Some college, >=1 year 4700.0 True False False 0 0.000000 False False
8787 1 45 White Divorced Some college, >=1 year 38000.0 True False False 2 0.666667 True False
2 16 White Never married 9th grade 3600.0 False False True 2 0.666667 True False
3 15 White Never married 7th-8th grade 200.0 True False True 2 0.666667 True False
73731 1 15 White Now married 9th grade 0.0 False False True 2 1.000000 True True
2 15 White Never married 7th-8th grade 0.0 True False True 2 1.000000 True True

If everyone in the household is a child, everyone has a 1 (True) for `child` and the minimum value is 1 (True). If anyone in the household is not a child, then at least one person has a 0 (False) for `child` and the minimum is 0 (False).

Exercise

Create a variable containing the number of people in the household with `income` greater than zero and an indicator variable for ‘this household has someone with `income`>0’.

## 5.3 Wide Form Data

Up to this point we’ve worked with the American Community Survey in long form, with one observation per person. Some tasks are easier to do in long form than in wide form, but others are just as easy either way–once you know the tricks for working in wide form.

The file `acs_wide.pickle` contains the same ACS data we’ve been working with, but in wide form. Read it in as our new `acs`:

``````acs = pd.read_pickle('acs_wide.pickle')
acs``````
age_1 age_2 age_3 age_4 age_5 age_6 age_7 age_8 age_9 age_10 ... female_7 female_8 female_9 female_10 female_11 female_12 female_13 female_14 female_15 female_16
household
37 20.0 19.0 19.0 NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
241 50.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
242 29.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
377 69.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
418 59.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1236119 51.0 51.0 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1236287 41.0 42.0 23.0 4.0 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1236624 29.0 26.0 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1236756 58.0 61.0 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1236779 30.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

10565 rows × 112 columns

Before proceeding, let’s review what we’re seeing here. In wide form, there is one row per household rather than one row per person as in long form. Each person is described by a set of columns. For example, person 1 is described by `age_1`, `race_1`, etc. There are 16 people in one household, so there are 16 of all the columns (i.e. they go up to `age_16`, `race_16`, etc.). Smaller households will have NaN in the columns for the people they don’t have. For example, household 37 has three people, so `age_1`, `age_2`, and `age_3` have values while all the others have NaN.

In long form, `household` and `person` made up a MultiIndex row index, while the variable names (`age`, `race`, etc.) made up the column index. In wide form `household` is the row index. We could use the variable names plus `person` as a MultiIndex column index, and you’ll see what that looks like in the next chapter. However, we have not found a good way to have a MultiIndex describe data that contain both level 1 and level 2 variables. Thus rather than using an official MultiIndex, the column names contain both the variable name and the person number. This seems to be the most common approach.

### 5.3.1 Subsetting in Wide Form

If we were using a MultiIndex for the columns we could use `loc()` for subsetting just like we did in long form. However, since the information we need is embedded in the column names, we’ll use `filter()` instead.

The `filter()` function allows you to subset based on patterns in the column names. A very simple pattern is `like=`. This will select all the columns that contain the string you pass in. Select all the `income` columns by passing in `like='income'`:

``acs.filter(like='income').dtypes``
``````income_1     float64
income_2     float64
income_3     float64
income_4     float64
income_5     float64
income_6     float64
income_7     float64
income_8     float64
income_9     float64
income_10    float64
income_11    float64
income_12    float64
income_13    float64
income_14    float64
income_15    float64
income_16    float64
dtype: object``````

We’ll learn shortly how to use this to create level two variables like `household_income`. But that raises a problem: once you create a `household_income` variable, `like='income'` will select it too.

Now consider trying to select all the variables that describe person 1. You might think `like='_1'` would do the job, but try it:

``acs.filter(like='_1').dtypes``
``````age_1                 float64
age_10                float64
age_11                float64
age_12                float64
age_13                float64
age_14                float64
age_15                float64
age_16                float64
hispanic_1             object
hispanic_10            object
hispanic_11            object
hispanic_12            object
hispanic_13            object
hispanic_14            object
hispanic_15            object
hispanic_16            object
race_1               category
race_10              category
race_11              category
race_12              category
race_13              category
race_14              category
race_15              category
race_16              category
marital_status_1     category
marital_status_10    category
marital_status_11    category
marital_status_12    category
marital_status_13    category
marital_status_14    category
marital_status_15    category
marital_status_16    category
edu_1                category
edu_10               category
edu_11               category
edu_12               category
edu_13               category
edu_14               category
edu_15               category
edu_16               category
income_1              float64
income_10             float64
income_11             float64
income_12             float64
income_13             float64
income_14             float64
income_15             float64
income_16             float64
female_1               object
female_10              object
female_11              object
female_12              object
female_13              object
female_14              object
female_15              object
female_16              object
dtype: object``````

Instead we got the variables for persons 10 through 16 as well, because they also contain ’_1’.

What we need is the ability to be much more specific about the columns to be selected. The solution is regular expressions. Regular expressions are a syntax for describing a pattern in text so that a computer can detect text that matches it. They’ve been around for more than 50 years and most programming languages and statistical packages implement them. They can be used for a variety of purposes; we won’t discuss working with text data (strings) but we’re laying the foundation by introducing you to regular expressions. Regular expressions can be extremely complex. We’ll introduce you to just a few of the most important features but you can find many, many tutorials online.

You use a regular expression with `filter()` by passing in `regex=`. The simplest regular expression is just some text, and a string matches it if it contains that text. In other words, it does the same thing as `like=`:

``acs.filter(regex='income').dtypes``
``````income_1     float64
income_2     float64
income_3     float64
income_4     float64
income_5     float64
income_6     float64
income_7     float64
income_8     float64
income_9     float64
income_10    float64
income_11    float64
income_12    float64
income_13    float64
income_14    float64
income_15    float64
income_16    float64
dtype: object``````

But a regular expression can do much more. Here are the tools we’ll use:

• If a regular expression starts with `^`, then the string only matches if it starts with the pattern.

• If a regular expression ends with `\$`, then the string only matches if it ends with the pattern.

• `\d` matches any digit (number).

• A `*` after something means the string can contain any number of that thing at that point. So `\d*` means the string can contain any number of digits at that point.

Putting this all together, the regular expression `^income_\d*\$` will match strings that start with ‘income_’ followed by any number of digits and then end. This will allow us to match the columns containing the income for each person (`income_1`, `income_2`, etc.) without also matching columns like `household_income`:

``acs.filter(regex='^income_\d*\$').dtypes``
``````income_1     float64
income_2     float64
income_3     float64
income_4     float64
income_5     float64
income_6     float64
income_7     float64
income_8     float64
income_9     float64
income_10    float64
income_11    float64
income_12    float64
income_13    float64
income_14    float64
income_15    float64
income_16    float64
dtype: object``````

The regular expression `_1\$` will match strings that end with ’_1’. This excludes variables that end with ’_10’, ’_11’, etc. and thus allows us to select just the columns describing person 1:

``acs.filter(regex='_1\$').dtypes``
``````age_1                float64
hispanic_1            object
race_1              category
marital_status_1    category
edu_1               category
income_1             float64
female_1              object
dtype: object``````
Exercise

Use a regular expression to select all the race variables. Then use one to select all the variables for person 5.

### 5.3.2 Creating Level Two Variables

When we used the `sum()` function in the last section to add up household incomes, it was adding all the values in a column. However, with the `axis=1` argument it will add up values across a row. You can use this along with `filter()` to calculate household income in wide form:

``````acs['household_income'] = (
acs.filter(regex='^income_\d*\$').
sum(axis=1)
)
acs['household_income']``````
``````household
37          20000.0
241         32500.0
242         30000.0
377         51900.0
418         12200.0
...
1236119     93200.0
1236287     53400.0
1236624     62100.0
1236756    110600.0
1236779     22110.0
Name: household_income, Length: 10565, dtype: float64``````

If you want to be able to see both the individual incomes and `household_income` so you can spot-check the results, go ahead and use `like='income'` (and scroll right):

``acs.filter(like='income')``
income_1 income_2 income_3 income_4 income_5 income_6 income_7 income_8 income_9 income_10 income_11 income_12 income_13 income_14 income_15 income_16 household_income
household
37 10000.0 5300.0 4700.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 20000.0
241 32500.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 32500.0
242 30000.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 30000.0
377 51900.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 51900.0
418 12200.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 12200.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1236119 62200.0 31000.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 93200.0
1236287 15000.0 27000.0 11400.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 53400.0
1236624 50100.0 12000.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 62100.0
1236756 69800.0 40800.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 110600.0
1236779 22110.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 22110.0

10565 rows × 17 columns

This works for all of the aggregate functions from the last section, except those that only work with `groupby()`. Thus the code to carry out the tasks from the last section is almost the same: just use `filter()` instead of `loc` to select the right subsample, don’t bother with `transform()`, and pass `axis=1` to the aggregate function itself.

How can we find the size of each household (number of people)? Recall that `count()` gives the number of non-missing values. So as long as there’s at least one variable with no missing values in the data set, you can use `count()` to find the number of people. This data set has multiple candidates, but use `age`:

``````acs['size'] = (
acs.filter(regex='^age_\d*\$').
count(axis=1)
)
acs['size'].value_counts()``````
``````size
2     3412
1     2779
3     1685
4     1479
5      760
6      273
7       96
8       47
9       13
10      10
12       7
11       3
16       1
Name: count, dtype: int64``````

What about income per person? That’s actually easier than in long form. There’s no need to broadcast the result to all the rows in the same household, so you don’t need `transform()` or a lambda function. Just add up the household income and divide by the number of people in the household:

``````acs['income_per_person'] = (
acs.filter(regex='^income_\d\$').
sum(axis=1) /
acs.filter(regex='^age_\d*\$')
.count(axis=1)
)
acs.filter(like='income').loc[484]``````
``````income_1             16800.0
income_2             18000.0
income_3                 NaN
income_4                 NaN
income_5                 NaN
income_6                 NaN
income_7                 NaN
income_8                 NaN
income_9                 NaN
income_10                NaN
income_11                NaN
income_12                NaN
income_13                NaN
income_14                NaN
income_15                NaN
income_16                NaN
household_income     34800.0
income_per_person     8700.0
Name: 484, dtype: float64``````

Recall that household 484 has two children who have missing values for income because they are under 15.

Exercise

Find the mean `age` for each household. Then create an indicator variable for ‘everyone in this household is female’.

### 5.3.3 Creating Level One Variables

Creating level one variables in long form is straightforward. For example, to create an indicator for ‘this person is black’ you’d just run:

``acs['black'] = (acs['race']=='Black')``

But in wide form there are 16 race variables, and there will need to be 16 indicators for Black as well. If you need to do a lot of work with level one variables, this is a pretty good reason to convert your data to long form. We’ll learn how in the next chapter. But it’s not too hard to make all 16 variables as long as you use a for loop.

Once again, we’ll start by considering what it would take to create just one of the 16 new indicator variables:

``acs['black_1'] = (acs['race_1']=='Black')``

That would take care of person 1. What would need to change in order to take care of person 2? The number at the end of the variable names. That tells us that this time we need to loop over a range of numbers. You can do that with the `range()` function. In its simplest form it takes two arguments, a starting number and an ending number. In true Python style the ending number is not included. To see how this works in a loop, try:

``for i in range(1, 17): print(i)``
``````1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16``````

So far so good, but how do we get those numbers into the variable names? If you add two strings together, they’ll be concatenated:

``'Hello ' + 'world!'``
``'Hello world!'``

The only wrinkle is that `i` is a number, But you can convert it to a string with the `str()` function and then add it:

``'black_'+str(i)``
``'black_16'``

This allows us to build the variable names we need, so now you’re ready to write the loop. One last complication: remember that any comparison involving NaN returns False. If a race variable is NaN the corresponding black variable should also be NaN, so select only the rows where race is not NaN:

``````for i in range(1,17): acs.loc[
acs['race_'+str(i)].notna(),
'black_'+str(i)
] = (acs['race_'+str(i)]=='Black')
acs.filter(like='black')``````
black_1 black_2 black_3 black_4 black_5 black_6 black_7 black_8 black_9 black_10 black_11 black_12 black_13 black_14 black_15 black_16
household
37 False False True NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
241 False NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
242 False NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
377 False NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
418 False NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1236119 False False NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1236287 False False False False NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1236624 False False NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1236756 False False NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1236779 False NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

10565 rows × 16 columns

Exercise

Create a set of indicator variables for ‘this person has experienced the end of a marriage’. This would include anyone who is divorced, widowed, or separated (we’ll call that an ‘end’, though you could argue otherwise), and not anyone who is currently married or has never been married.

## 5.4 Panel Data

Panel data or longitudinal data is just another form of hierarchical data, with subjects as level two units and times they were observed as level one units. With panel data, the timing of the observations or at least their order is important. If it’s not, then we refer to it as repeated measures data. Since panel data is just hierarchical data, a lot of the code we’ve already learned applies directly to panel data, but there are some panel-specific tricks too.

You should already have the NLSY extract in memory from the second exercise. If not, load it with:

``````nlsy = pd.read_pickle('nlsy.pickle')
nlsy``````
birth_year edu income age
id year
1.0 1979 58.0 12TH GRADE 4620.0 21.0
1980 58.0 NaN NaN 22.0
1981 58.0 12TH GRADE 5000.0 23.0
1982 58.0 NaN NaN 24.0
1983 58.0 NaN NaN 25.0
... ... ... ... ... ...
12686.0 1993 60.0 12TH GRADE 31000.0 33.0
1994 60.0 NaN NaN 34.0
1996 60.0 NaN NaN 36.0
1998 60.0 NaN NaN 38.0
2000 60.0 NaN NaN 40.0

241034 rows × 4 columns

### 5.4.1 Creating Level Two Variables

The National Longitudinal Survey of Youth tracks young people for many years, making it classic panel data. In the NLSY a level two unit is a year and a level one unit is a person-year combination. You can create level two variables using the same tools we used with the ACS. For example, create `max_edu` containing the person’s highest level of education with:

``````nlsy['max_edu'] = (
nlsy['edu'].
groupby('id').
transform('max')
)
nlsy.loc[5]``````
birth_year edu income age max_edu
year
1979 59.0 1ST YEAR COLLEGE 2200.0 20.0 6TH YEAR COLLEGE
1980 59.0 2ND YEAR COLLEGE 2000.0 21.0 6TH YEAR COLLEGE
1981 59.0 3RD YEAR COLLEGE 3400.0 22.0 6TH YEAR COLLEGE
1982 59.0 4TH YEAR COLLEGE 2300.0 23.0 6TH YEAR COLLEGE
1983 59.0 5TH YEAR COLLEGE 2200.0 24.0 6TH YEAR COLLEGE
1984 59.0 6TH YEAR COLLEGE 10500.0 25.0 6TH YEAR COLLEGE
1985 59.0 NaN NaN 26.0 6TH YEAR COLLEGE
1986 59.0 6TH YEAR COLLEGE 43119.0 27.0 6TH YEAR COLLEGE
1987 59.0 NaN NaN 28.0 6TH YEAR COLLEGE
1988 59.0 6TH YEAR COLLEGE 57124.0 29.0 6TH YEAR COLLEGE
1989 59.0 6TH YEAR COLLEGE 62836.0 30.0 6TH YEAR COLLEGE
1990 59.0 NaN NaN 31.0 6TH YEAR COLLEGE
1991 59.0 NaN NaN 32.0 6TH YEAR COLLEGE
1992 59.0 NaN NaN 33.0 6TH YEAR COLLEGE
1993 59.0 NaN NaN 34.0 6TH YEAR COLLEGE
1994 59.0 NaN NaN 35.0 6TH YEAR COLLEGE
1996 59.0 NaN NaN 37.0 6TH YEAR COLLEGE
1998 59.0 NaN NaN 39.0 6TH YEAR COLLEGE
2000 59.0 NaN NaN 41.0 6TH YEAR COLLEGE

Note that this relies on `edu` being an ordered categorical variable.

The `first()` and `last()` functions are particularly useful with panel data. But their meaning depends on the order the data are sorted in. This data set appears to be in chronological order, but you shouldn’t trust that if you’re going to write code that will give you the wrong answer if it is not. So before using the `first()` function to find the person’s `age` at the start of the study, sort by `id` and `year`:

``````nlsy = nlsy.sort_values(['id', 'year'])
nlsy['starting_age'] = (
nlsy['age'].
groupby('id').
transform('first')
)
nlsy``````
birth_year edu income age max_edu starting_age
id year
1980 58.0 NaN NaN 22.0 12TH GRADE 21.0
1982 58.0 NaN NaN 24.0 12TH GRADE 21.0
1983 58.0 NaN NaN 25.0 12TH GRADE 21.0
... ... ... ... ... ... ... ...
1994 60.0 NaN NaN 34.0 12TH GRADE 19.0
1996 60.0 NaN NaN 36.0 12TH GRADE 19.0
1998 60.0 NaN NaN 38.0 12TH GRADE 19.0
2000 60.0 NaN NaN 40.0 12TH GRADE 19.0

241034 rows × 6 columns

Exercise

Use the `last()` function to find each subject’s last value of `income`. Does this behave as you expected?

Before proceeding, remove the variables we’ve created thus far just to keep the data set neat:

``````nlsy = nlsy.drop(
[
'max_edu',
'starting_age'
], axis=1
)``````

### 5.4.2 Filling In Gaps

The NLSY frequently has missing data as subjects were frequently not located in a given year. Somewhat unusually for panel data, they often were able to find people in later years so there are gaps as well as attrition. Consider what we know about the education of person 5:

``nlsy.loc[11]``
birth_year edu income age
year
1979 59.0 1ST YEAR COLLEGE 2900.0 20.0
1980 59.0 NaN NaN 21.0
1981 59.0 3RD YEAR COLLEGE 2000.0 22.0
1982 59.0 4TH YEAR COLLEGE 4002.0 23.0
1983 59.0 4TH YEAR COLLEGE 8000.0 24.0
1984 59.0 4TH YEAR COLLEGE 17000.0 25.0
1985 59.0 4TH YEAR COLLEGE 20000.0 26.0
1986 59.0 NaN NaN 27.0
1987 59.0 NaN NaN 28.0
1988 59.0 NaN NaN 29.0
1989 59.0 4TH YEAR COLLEGE 46000.0 30.0
1990 59.0 4TH YEAR COLLEGE 74283.0 31.0
1991 59.0 4TH YEAR COLLEGE 81481.0 32.0
1992 59.0 4TH YEAR COLLEGE 55000.0 33.0
1993 59.0 4TH YEAR COLLEGE 60000.0 34.0
1994 59.0 4TH YEAR COLLEGE 101653.0 35.0
1996 59.0 4TH YEAR COLLEGE 60000.0 37.0
1998 59.0 4TH YEAR COLLEGE 25000.0 39.0
2000 59.0 4TH YEAR COLLEGE 55000.0 41.0

They were not interviewed in 1986, 1987, and 1988, but given that their education level both before and after that gap is ‘4TH YEAR COLLEGE’ we know it was ‘4TH YEAR COLLEGE’ in those years as well. Meanwhile, while we might think that in 1980 their education level was ‘2ND YEAR COLLEGE’ that is less certain–and if that were a two-year gap we’d have no idea. So let’s fill in gaps only where the `edu` is the same before and after the gap.

To actually do that, we’ll fill in all the gaps in two ways: first by carrying past values forward with the `ffill()` function, and then by carrying future values back with the `bfill()` function. If the two methods agree, then that is a gap we can fill. While `ffill()` and `bfill()` are not aggregate functions, it’s still important to use `groupby('id')` because that ensures each person is considered separately and no values are carried from one person to another.

``````nlsy['edu_forward'] = (
nlsy['edu'].
groupby('id').
ffill()
)
nlsy['edu_backward'] = (
nlsy['edu'].
groupby('id').
bfill()
)
nlsy.loc[11]``````
birth_year edu income age edu_forward edu_backward
year
1979 59.0 1ST YEAR COLLEGE 2900.0 20.0 1ST YEAR COLLEGE 1ST YEAR COLLEGE
1980 59.0 NaN NaN 21.0 1ST YEAR COLLEGE 3RD YEAR COLLEGE
1981 59.0 3RD YEAR COLLEGE 2000.0 22.0 3RD YEAR COLLEGE 3RD YEAR COLLEGE
1982 59.0 4TH YEAR COLLEGE 4002.0 23.0 4TH YEAR COLLEGE 4TH YEAR COLLEGE
1983 59.0 4TH YEAR COLLEGE 8000.0 24.0 4TH YEAR COLLEGE 4TH YEAR COLLEGE
1984 59.0 4TH YEAR COLLEGE 17000.0 25.0 4TH YEAR COLLEGE 4TH YEAR COLLEGE
1985 59.0 4TH YEAR COLLEGE 20000.0 26.0 4TH YEAR COLLEGE 4TH YEAR COLLEGE
1986 59.0 NaN NaN 27.0 4TH YEAR COLLEGE 4TH YEAR COLLEGE
1987 59.0 NaN NaN 28.0 4TH YEAR COLLEGE 4TH YEAR COLLEGE
1988 59.0 NaN NaN 29.0 4TH YEAR COLLEGE 4TH YEAR COLLEGE
1989 59.0 4TH YEAR COLLEGE 46000.0 30.0 4TH YEAR COLLEGE 4TH YEAR COLLEGE
1990 59.0 4TH YEAR COLLEGE 74283.0 31.0 4TH YEAR COLLEGE 4TH YEAR COLLEGE
1991 59.0 4TH YEAR COLLEGE 81481.0 32.0 4TH YEAR COLLEGE 4TH YEAR COLLEGE
1992 59.0 4TH YEAR COLLEGE 55000.0 33.0 4TH YEAR COLLEGE 4TH YEAR COLLEGE
1993 59.0 4TH YEAR COLLEGE 60000.0 34.0 4TH YEAR COLLEGE 4TH YEAR COLLEGE
1994 59.0 4TH YEAR COLLEGE 101653.0 35.0 4TH YEAR COLLEGE 4TH YEAR COLLEGE
1996 59.0 4TH YEAR COLLEGE 60000.0 37.0 4TH YEAR COLLEGE 4TH YEAR COLLEGE
1998 59.0 4TH YEAR COLLEGE 25000.0 39.0 4TH YEAR COLLEGE 4TH YEAR COLLEGE
2000 59.0 4TH YEAR COLLEGE 55000.0 41.0 4TH YEAR COLLEGE 4TH YEAR COLLEGE

Now we’ll replace `edu` if `edu` is missing and `edu_forward` and `edu_backward` are the same. Since they are the same it doesn’t matter which of them we set `edu` to:

``````nlsy.loc[
nlsy['edu'].isna() &
(nlsy['edu_forward']==nlsy['edu_backward']),
'edu'
] = nlsy['edu_forward']
nlsy.loc[11]``````
birth_year edu income age edu_forward edu_backward
year
1979 59.0 1ST YEAR COLLEGE 2900.0 20.0 1ST YEAR COLLEGE 1ST YEAR COLLEGE
1980 59.0 NaN NaN 21.0 1ST YEAR COLLEGE 3RD YEAR COLLEGE
1981 59.0 3RD YEAR COLLEGE 2000.0 22.0 3RD YEAR COLLEGE 3RD YEAR COLLEGE
1982 59.0 4TH YEAR COLLEGE 4002.0 23.0 4TH YEAR COLLEGE 4TH YEAR COLLEGE
1983 59.0 4TH YEAR COLLEGE 8000.0 24.0 4TH YEAR COLLEGE 4TH YEAR COLLEGE
1984 59.0 4TH YEAR COLLEGE 17000.0 25.0 4TH YEAR COLLEGE 4TH YEAR COLLEGE
1985 59.0 4TH YEAR COLLEGE 20000.0 26.0 4TH YEAR COLLEGE 4TH YEAR COLLEGE
1986 59.0 4TH YEAR COLLEGE NaN 27.0 4TH YEAR COLLEGE 4TH YEAR COLLEGE
1987 59.0 4TH YEAR COLLEGE NaN 28.0 4TH YEAR COLLEGE 4TH YEAR COLLEGE
1988 59.0 4TH YEAR COLLEGE NaN 29.0 4TH YEAR COLLEGE 4TH YEAR COLLEGE
1989 59.0 4TH YEAR COLLEGE 46000.0 30.0 4TH YEAR COLLEGE 4TH YEAR COLLEGE
1990 59.0 4TH YEAR COLLEGE 74283.0 31.0 4TH YEAR COLLEGE 4TH YEAR COLLEGE
1991 59.0 4TH YEAR COLLEGE 81481.0 32.0 4TH YEAR COLLEGE 4TH YEAR COLLEGE
1992 59.0 4TH YEAR COLLEGE 55000.0 33.0 4TH YEAR COLLEGE 4TH YEAR COLLEGE
1993 59.0 4TH YEAR COLLEGE 60000.0 34.0 4TH YEAR COLLEGE 4TH YEAR COLLEGE
1994 59.0 4TH YEAR COLLEGE 101653.0 35.0 4TH YEAR COLLEGE 4TH YEAR COLLEGE
1996 59.0 4TH YEAR COLLEGE 60000.0 37.0 4TH YEAR COLLEGE 4TH YEAR COLLEGE
1998 59.0 4TH YEAR COLLEGE 25000.0 39.0 4TH YEAR COLLEGE 4TH YEAR COLLEGE
2000 59.0 4TH YEAR COLLEGE 55000.0 41.0 4TH YEAR COLLEGE 4TH YEAR COLLEGE

Note that we will not attempt to fill in other gaps where we are not perfectly certain what the missing value should be. There are more or less valid methods for filling in missing values like multiple imputation, but they are complicated and a lot of work. Methods that are less complicated and less work, like filling in the mean or linear interpolation, can create serious problems with any statistical analysis. It’s unusual to be able to fill in gaps like we were able to here but worth doing when you can.

### 5.4.3 Variables Based On An Observations Neighbors

The `shift()` function can shift observations forward or backwards: forward if you pass in a positive number, backwards if you pass in a negative number. For easy viewing, let’s build a DataFrame by passing a dictionary to the Pandas `DataFrame` function. The dictionary must contain the column names as keys and Series (which we’ll create based on the `edu` Series in `nsly`) as values. Our DataFrame will have one column for the original `edu`, one for shifting forward, and one for shifting backward:

``````pd.DataFrame(
{
'edu' : nlsy.loc[11, 'edu'],
'edu_forward_1' : nlsy.loc[11, 'edu'].shift(1),
'edu_backward_1' : nlsy.loc[11, 'edu'].shift(-1)
}
)``````
edu edu_forward_1 edu_backward_1
year
1979 1ST YEAR COLLEGE NaN NaN
1980 NaN 1ST YEAR COLLEGE 3RD YEAR COLLEGE
1981 3RD YEAR COLLEGE NaN 4TH YEAR COLLEGE
1982 4TH YEAR COLLEGE 3RD YEAR COLLEGE 4TH YEAR COLLEGE
1983 4TH YEAR COLLEGE 4TH YEAR COLLEGE 4TH YEAR COLLEGE
1984 4TH YEAR COLLEGE 4TH YEAR COLLEGE 4TH YEAR COLLEGE
1985 4TH YEAR COLLEGE 4TH YEAR COLLEGE 4TH YEAR COLLEGE
1986 4TH YEAR COLLEGE 4TH YEAR COLLEGE 4TH YEAR COLLEGE
1987 4TH YEAR COLLEGE 4TH YEAR COLLEGE 4TH YEAR COLLEGE
1988 4TH YEAR COLLEGE 4TH YEAR COLLEGE 4TH YEAR COLLEGE
1989 4TH YEAR COLLEGE 4TH YEAR COLLEGE 4TH YEAR COLLEGE
1990 4TH YEAR COLLEGE 4TH YEAR COLLEGE 4TH YEAR COLLEGE
1991 4TH YEAR COLLEGE 4TH YEAR COLLEGE 4TH YEAR COLLEGE
1992 4TH YEAR COLLEGE 4TH YEAR COLLEGE 4TH YEAR COLLEGE
1993 4TH YEAR COLLEGE 4TH YEAR COLLEGE 4TH YEAR COLLEGE
1994 4TH YEAR COLLEGE 4TH YEAR COLLEGE 4TH YEAR COLLEGE
1996 4TH YEAR COLLEGE 4TH YEAR COLLEGE 4TH YEAR COLLEGE
1998 4TH YEAR COLLEGE 4TH YEAR COLLEGE 4TH YEAR COLLEGE
2000 4TH YEAR COLLEGE 4TH YEAR COLLEGE NaN

Note that when shifting forward you’ll have NaN at the beginning of the series, and when shifting backwards at the end. In this case we did not need `groupby()` because we extracted just one person, but in general you’ll need it to prevent values from being shifted between people.

Now consider trying to identify the year in which a person graduates from high school. In that year their `edu` will be ‘12TH GRADE’ or higher and the year before their `edu` will be less than ‘12TH GRADE’. Note that for whatever reason the NLSY frequently records people as having gained multiple years of education in one year, and we won’t worry about that. For person 2, the year in which they graduated is 1985:

``nlsy.loc[2]``
birth_year edu income age edu_forward edu_backward
year

Now create an indicator variable for ‘graduated this year’ with:

``````nlsy['grad'] = (
)
nlsy.loc[2]``````
birth_year edu income age edu_forward edu_backward grad
year

Note that if either `edu` or shifted `edu` is NaN, the result will be false. Thus this indicator variable is really for ‘we know the person graduated this year.’ We could set it to NaN for years with missing values if we wanted to, using the techniques we’ve used before.

Logically each person should only graduate from high school once, but it’s not hard to image data entry errors that would make it look like they graduated multiple times. Recall that the sum of an indicator variable is the number of times it is true, so check with:

``````(
groupby('id').
sum().
value_counts()
)``````
``````grad
0    7033
1    5653
Name: count, dtype: int64``````

A little over half the subjects have True for `grad` zero times: either they never graduated from high school, they graduated outside the study period, or we could not identify their graduation year due to missing data. But no one graduated more than once.

Now let’s find their age at graduation. This is just a matter of subsetting with an aggregate function, except that the desired subset has just one observation per group. So start by creating `age_if_grad` containing the person’s age for the year they graduated and NaN for all others.

In the past we’ve done this with code like:

``nlsy.loc[nlsy['grad'], 'age_if_grad'] = nlsy['age']``

and that would absolutely work. But let’s learn an alternative method. The numpy `where()` function takes three arguments: a condition, the result to return if the condition is true, and the result to return if the condition is false. Build `age_if_grad` by importing numpy as np and then passing to `where()` the three arguments `grad`, `age`, and NaN:

``````import numpy as np
nlsy['age'],
np.NaN
)

nlsy.loc[2]``````
year

You may find that syntax clearer. `where()` would definitely be easier to use than the first syntax if you needed to set the rows where `grad` was false to something other than NaN.

`age_if_grad` for each person is a single number plus a bunch of NaNs. Many aggregate functions have the property that if you give them a single number they’ll return that number, so we could use any of them. We’ll go with `mean`:

``````nlsy['age_at_grad'] = (
groupby('id').
transform('mean')
)
nlsy.loc[2]``````
year