6  Working With Hierarchical Data in Wide Form

Wide form data almost always looks uglier than long form, and some tasks are always harder to do in wide form than in long form. Others can be done just as easily in wide form if you get comfortable working across columns rather than down rows. Some Python functions work exactly the same either way.

The Pythonic way to work with wide form data is with a column MultiIndex that reflects the structure of the data, just like we used a row MultiIndex in long form. However, column MultiIndexes are unique to Python and don’t have a great way to handle level 2 variables. So you’ll more commonly see wide form data where the variable name reflects both levels of the hierarchy, like age_1 for ‘age of person 1’. We’ll demonstrate both methods.

This chapter is meant to be read after Working with Hierarchical Data in Long Form. That chapter introduces the core concepts of hierarchical data (in any form), contains more example tasks, and explains them in more detail. This chapter mostly translates what you learned in the previous chapter to wide form.

6.1 Setting Up

Start up Jupyter Lab if you haven’t already and navigate to the folder where you put the example files. Then create a new Python Notebook and call it Hierarchical_Wide_Practice.ipynb. Have it import Pandas and NumPy, then load acs_wide_mi.pickle as acs:

import pandas as pd
import numpy as np

acs = pd.read_pickle('acs_wide_mi.pickle')
acs
age race income female
person 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4
household
37 20.0 19.0 19.0 NaN White White Black NaN 10000.0 5300.0 4700.0 NaN True True True NaN
241 50.0 NaN NaN NaN White NaN NaN NaN 32500.0 NaN NaN NaN True NaN NaN NaN
242 29.0 NaN NaN NaN White NaN NaN NaN 30000.0 NaN NaN NaN True NaN NaN NaN
377 69.0 NaN NaN NaN White NaN NaN NaN 51900.0 NaN NaN NaN True NaN NaN NaN
418 59.0 NaN NaN NaN White NaN NaN NaN 12200.0 NaN NaN NaN True NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1236119 51.0 51.0 NaN NaN White White NaN NaN 62200.0 31000.0 NaN NaN False True NaN NaN
1236287 41.0 42.0 23.0 4.0 American Indian White White White 15000.0 27000.0 11400.0 NaN False True True False
1236624 29.0 26.0 NaN NaN White White NaN NaN 50100.0 12000.0 NaN NaN False True NaN NaN
1236756 58.0 61.0 NaN NaN White White NaN NaN 69800.0 40800.0 NaN NaN True False NaN NaN
1236779 30.0 NaN NaN NaN American Indian NaN NaN NaN 22110.0 NaN NaN NaN False NaN NaN NaN

10565 rows × 16 columns

This data set is a subset of the ACS data you’ve been working with. It contains fewer variables and only the first four people in each household. However, it is in wide form, meaning that there’s one row per level 2 unit (household) and one column per variable per level one unit (person). That makes it quite wide (which is why it’s a subset of the original).

The number of columns is determined by the number of people in the largest household. Household 1236287, for example, has four people in it and thus four columns are needed to store all their information. (The full data set has a household with 16 people in it, so it requires 16 columns per variable.) However, unlike long form where different households can have different numbers of rows, in wide form every household must have the same number of columns. Households with fewer people will have NaN in the columns for people that don’t exist in their household. For example, household 37 has three people, so the columns for person 4 are always NaN.

6.2 Wide Form With a Column MultiIndex

In order to pick out a particular number from this data set, I need a household number, a person number, and a variable. For example, 20 is the age of person 1 in household 37. In the long form, household and person identify a row, and a variable identifies a column. A row MultIndex allowed Python to understand the relationship between household and person.

In this version of wide format, household identifies a row and a combination of a variable and person identify a column. The column MultiIndex allows Python to understand the relationship between variable and person. Note that we have not officially named the first level of the MultiIndex variable (we could) so we’ll just refer to it as level 0. The second level is named person, or level 1.

6.2.1 Subsetting

Up to this point, all of our column indexes have been simple strings like age or income. Now, to identify a column you need to identify both the variable and the person of interest. This is done as a tuple, for example ('age', 1) for ‘age of person 1.’ Tuples like this can be used as column names in square brackets just like the strings we’ve used before:

acs[('age',1)]
household
37         20.0
241        50.0
242        29.0
377        69.0
418        59.0
           ... 
1236119    51.0
1236287    41.0
1236624    29.0
1236756    58.0
1236779    30.0
Name: (age, 1), Length: 10565, dtype: float64

They can also be used along with a row index in loc to pick out a single number:

acs.loc[37, ('age', 1)]
20.0

Note how the household number 37 now identifies a single row, since in wide form there’s just one observation per household.

The xs() function gives you more flexibility. To select all the age variables, use:

acs.xs('age', axis=1)
person 1 2 3 4
household
37 20.0 19.0 19.0 NaN
241 50.0 NaN NaN NaN
242 29.0 NaN NaN NaN
377 69.0 NaN NaN NaN
418 59.0 NaN NaN NaN
... ... ... ... ...
1236119 51.0 51.0 NaN NaN
1236287 41.0 42.0 23.0 4.0
1236624 29.0 26.0 NaN NaN
1236756 58.0 61.0 NaN NaN
1236779 30.0 NaN NaN NaN

10565 rows × 4 columns

The argument axis=1 clarifies that you’re talking about the column indexes. (The rows are axis 0.) You’ll be using axis=1 a lot in this chapter. xs() will assume you’re specifying level 0 (variable) unless you say otherwise.

Since all the columns are age, by default xs() drops that level from the results. You can tell it not to with drop_level=False:

acs.xs('age', axis=1, drop_level=False)
age
person 1 2 3 4
household
37 20.0 19.0 19.0 NaN
241 50.0 NaN NaN NaN
242 29.0 NaN NaN NaN
377 69.0 NaN NaN NaN
418 59.0 NaN NaN NaN
... ... ... ... ...
1236119 51.0 51.0 NaN NaN
1236287 41.0 42.0 23.0 4.0
1236624 29.0 26.0 NaN NaN
1236756 58.0 61.0 NaN NaN
1236779 30.0 NaN NaN NaN

10565 rows × 4 columns

You can also use xs() to select all the variables for a given person. In doing so, you need to specify that you’re selecting based on level 1 of the MultiIndex, which you can do by number or by name (person):

acs.xs(1, level=1, axis=1)
age race income female
household
37 20.0 White 10000.0 True
241 50.0 White 32500.0 True
242 29.0 White 30000.0 True
377 69.0 White 51900.0 True
418 59.0 White 12200.0 True
... ... ... ... ...
1236119 51.0 White 62200.0 False
1236287 41.0 American Indian 15000.0 False
1236624 29.0 White 50100.0 False
1236756 58.0 White 69800.0 True
1236779 30.0 American Indian 22110.0 False

10565 rows × 4 columns

This looks a lot like the long form data, but there are far fewer rows because only the first person in each household is included. If we had specified drop_level=False we’d be reminded of that.

6.2.2 Creating Level 1 Variables

In long form, creating level 1 variables is straightforward. For example, to create an indicator variable for ‘this person is Black’ you’d run something like:

acs['black'] = (acs['race']=='Black')

This acts on all the rows in the data set automatically. Since in long form the people in a household are rows, it thus takes care of all the people in a household automatically. The simple wide-form equivalent:

acs[('black', 1)] = (acs[('race', 1)]=='Black')

would only take care of person 1 in each household. Clearly this is a job for a for loop. But there’s another wrinkle: we now have lots of missing values for race, because that’s how wide form tells us a person doesn’t exist.

Let’s start by creating a single column properly, including the missing values. Creating column 4 will make it easy to see if we’ve succeeded. Recall that the np.where() function takes a condition, the result if the condition is true, and the result if the condition is false. That makes it easy to check if ('race', 4) is missing, set the new column to np.NaN if it is, and set it to the result of ('race', 4)=='Black' if it is not. Just remember that we must specify that the result is boolean with astype('boolean') or NumPy will return 1/0 instead of True/False.

acs[('black', 4)] = np.where(
    acs[('race', 4)].isna(), 
    np.NaN, 
    (acs[('race', 4)]=='Black').astype('boolean'))
acs.xs(4, level='person', axis=1, drop_level=False)
age race income female black
person 4 4 4 4 4
household
37 NaN NaN NaN NaN NaN
241 NaN NaN NaN NaN NaN
242 NaN NaN NaN NaN NaN
377 NaN NaN NaN NaN NaN
418 NaN NaN NaN NaN NaN
... ... ... ... ... ...
1236119 NaN NaN NaN NaN NaN
1236287 4.0 White NaN False 0.0
1236624 NaN NaN NaN NaN NaN
1236756 NaN NaN NaN NaN NaN
1236779 NaN NaN NaN NaN NaN

10565 rows × 5 columns

The result is NaN for all the rows we can see except 1236287, where it is correctly False, so now we’re ready to wrap that in a for loop. Before you do, drop ('black', 4) just because if you don’t it will be out of order.

acs = acs.drop(columns=('black', 4))

In order to make the code that created ('black', 4) do a different column, what needs to change is the person number in the tuples ('black', 4) and ('race', 4). That tells us we need to loop over person numbers. We can get the numbers 1 through 4 with range(1, 5) since range() leaves off the endpoint.

for person in range(1, 5):
    acs[('black', person)] = np.where(
        acs[('race', person)].isna(),
        np.NaN, 
        (acs[('race', person)]=='Black').astype('boolean')
    )

acs.xs('black', axis=1, drop_level=False)
black
person 1 2 3 4
household
37 0.0 0.0 1.0 NaN
241 0.0 NaN NaN NaN
242 0.0 NaN NaN NaN
377 0.0 NaN NaN NaN
418 0.0 NaN NaN NaN
... ... ... ... ...
1236119 0.0 0.0 NaN NaN
1236287 0.0 0.0 0.0 0.0
1236624 0.0 0.0 NaN NaN
1236756 0.0 0.0 NaN NaN
1236779 0.0 NaN NaN NaN

10565 rows × 4 columns

Exercise

Create an indicator variable called child for each person. Be sure it’s NaN for people who don’t exist.

for person in range(1,5):
    acs[('child', person)] = np.where(
        acs[('age', person)].isna(),
        np.NaN,
        (acs[('age', person)]<18).astype('boolean')
    )

acs.xs('child', axis=1, drop_level=False)
child
person 1 2 3 4
household
37 0.0 0.0 0.0 NaN
241 0.0 NaN NaN NaN
242 0.0 NaN NaN NaN
377 0.0 NaN NaN NaN
418 0.0 NaN NaN NaN
... ... ... ... ...
1236119 0.0 0.0 NaN NaN
1236287 0.0 0.0 0.0 1.0
1236624 0.0 0.0 NaN NaN
1236756 0.0 0.0 NaN NaN
1236779 0.0 NaN NaN NaN

10565 rows × 4 columns

6.2.3 Creating Level 2 Variables

Next we’ll learn how to make level 2 variables like household_income. In long form, you had to use groupby('household') to specify which rows should be aggregated. That’s not an issue in wide form, since each household has its own row. Instead you need to specify which columns should be aggregated. You can do that with xs(). For example, to calculate household_income you need all the income columns:

acs.xs('income', axis=1)
person 1 2 3 4
household
37 10000.0 5300.0 4700.0 NaN
241 32500.0 NaN NaN NaN
242 30000.0 NaN NaN NaN
377 51900.0 NaN NaN NaN
418 12200.0 NaN NaN NaN
... ... ... ... ...
1236119 62200.0 31000.0 NaN NaN
1236287 15000.0 27000.0 11400.0 NaN
1236624 50100.0 12000.0 NaN NaN
1236756 69800.0 40800.0 NaN NaN
1236779 22110.0 NaN NaN NaN

10565 rows × 4 columns

So how do you add them up? With the sum() function as usual, but pass in axis=1 so it adds them up across columns instead of rows:

acs.xs('income', axis=1).sum(axis=1)
household
37          20000.0
241         32500.0
242         30000.0
377         51900.0
418         12200.0
             ...   
1236119     93200.0
1236287     53400.0
1236624     62100.0
1236756    110600.0
1236779     22110.0
Length: 10565, dtype: float64

In long form, we needed transform() to broadcast the result of sum() to all the rows in the household. That’s not necessary in wide form, since each household only has one row.

The trick is the column name. The columns in this dataset are identified by a MultiIndex–no exceptions. So even though household income is not a property of a person, the tuple that identifies the column must have two values. We’ll use ’‘, an empty string, to denote ’this is a level 2 variable.’ Thus, to create a variable for household income, run:

acs[('household_income', '')] = acs.xs('income', axis=1).sum(axis=1)
acs[('household_income', '')]
household
37          20000.0
241         32500.0
242         30000.0
377         51900.0
418         12200.0
             ...   
1236119     93200.0
1236287     53400.0
1236624     62100.0
1236756    110600.0
1236779     22110.0
Name: (household_income, ), Length: 10565, dtype: float64
Exercise

Create an indicator variable for ‘this household contains children.’

acs[('has_children', '')] = acs.xs('child', axis=1).max(axis=1)
acs[('has_children', '')]
household
37         0.0
241        0.0
242        0.0
377        0.0
418        0.0
          ... 
1236119    0.0
1236287    1.0
1236624    0.0
1236756    0.0
1236779    0.0
Name: (has_children, ), Length: 10565, dtype: float64

6.2.4 Subsetting With Aggregate Functions

So how do you calculate a level 2 variable based on a subset of the level 1 variables, like child_income? The same way as in long form: first create a level 1 variable that contains actual income for children and NaN for adults, then aggregate it. The difference is that in wide form creating a level 1 variable requires a loop over columns. It can still be easily done using the Series where() function, which takes a condition and then an other argument that specifies what the value should be changed to for rows that don’t meet the condition.

for person in range(1, 5):
    acs[('income_if_child', person)] = acs[('income', person)].where(
        acs[('age', person)]<18,
        other=np.NaN
    )

acs.xs('income_if_child', axis=1, drop_level=False).loc[8787]
                 person
income_if_child  1            NaN
                 2         3600.0
                 3          200.0
                 4            NaN
Name: 8787, dtype: float64

With an income_if_child variable in hand for each person, all you need to do is add them up:

acs[('child_income', '')] = acs.xs('income_if_child', axis=1).sum(axis=1)

acs.xs('child_income', axis=1, drop_level=False).loc[8787]
              person
child_income            3800.0
Name: 8787, dtype: float64
Exercise

Create a level 2 variable containing the mean age of the adults in each household.

for person in range(1, 5):
    acs[('age_if_adult', person)] = acs[('age', person)].where(
        acs[('age', person)]>=18,
        other=np.NaN
    )

acs[('mean_adult_age', '')] = acs.xs('age_if_adult', axis=1).mean(axis=1)
acs[('mean_adult_age', '')]
household
37         19.333333
241        50.000000
242        29.000000
377        69.000000
418        59.000000
             ...    
1236119    51.000000
1236287    35.333333
1236624    27.500000
1236756    59.500000
1236779    30.000000
Name: (mean_adult_age, ), Length: 10565, dtype: float64

6.3 Wide Form Data With Compound Variable Names

While a column MultiIndex has a certain elegance, in practice it’s more common to stick with strings for column identifiers. The string contains both levels of the index, such as age_1 (or age1) for ‘age of person 1’. But that structure is not enforced, so there’s no problem creating a variable like household_income.

The disadvantage is that it’s not as easy to specify subsets like ‘all the income variables.’ The good news is the tools you’ll learn for doing that also apply to data that’s stored as strings (i.e. text data).

Load the example data set acs_wide_strings as the new acs to see what this looks like in practice:

acs = pd.read_pickle('acs_wide_strings.pickle')
acs
age_1 age_2 age_3 age_4 race_1 race_2 race_3 race_4 income_1 income_2 income_3 income_4 female_1 female_2 female_3 female_4
household
37 20.0 19.0 19.0 NaN White White Black NaN 10000.0 5300.0 4700.0 NaN True True True NaN
241 50.0 NaN NaN NaN White NaN NaN NaN 32500.0 NaN NaN NaN True NaN NaN NaN
242 29.0 NaN NaN NaN White NaN NaN NaN 30000.0 NaN NaN NaN True NaN NaN NaN
377 69.0 NaN NaN NaN White NaN NaN NaN 51900.0 NaN NaN NaN True NaN NaN NaN
418 59.0 NaN NaN NaN White NaN NaN NaN 12200.0 NaN NaN NaN True NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1236119 51.0 51.0 NaN NaN White White NaN NaN 62200.0 31000.0 NaN NaN False True NaN NaN
1236287 41.0 42.0 23.0 4.0 American Indian White White White 15000.0 27000.0 11400.0 NaN False True True False
1236624 29.0 26.0 NaN NaN White White NaN NaN 50100.0 12000.0 NaN NaN False True NaN NaN
1236756 58.0 61.0 NaN NaN White White NaN NaN 69800.0 40800.0 NaN NaN True False NaN NaN
1236779 30.0 NaN NaN NaN American Indian NaN NaN NaN 22110.0 NaN NaN NaN False NaN NaN NaN

10565 rows × 16 columns

6.3.1 Subsetting

Selecting a column or row and column now works just like before (no tuple required):

acs['age_1']
household
37         20.0
241        50.0
242        29.0
377        69.0
418        59.0
           ... 
1236119    51.0
1236287    41.0
1236624    29.0
1236756    58.0
1236779    30.0
Name: age_1, Length: 10565, dtype: float64
acs.loc[37, 'age_1']
20.0

Selecting all the columns for a variable or person gets more interesting. How can well tell Python to select all the variables that start with age or all the variables that end with 1? The answer is the DataFrame filter() function.

The filter() function allows you to subset based on patterns in the column names. A very simple pattern is like=. This will select all the columns that contain the string you pass in. Select all the income columns by passing in like='income':

acs.filter(like='income')
income_1 income_2 income_3 income_4
household
37 10000.0 5300.0 4700.0 NaN
241 32500.0 NaN NaN NaN
242 30000.0 NaN NaN NaN
377 51900.0 NaN NaN NaN
418 12200.0 NaN NaN NaN
... ... ... ... ...
1236119 62200.0 31000.0 NaN NaN
1236287 15000.0 27000.0 11400.0 NaN
1236624 50100.0 12000.0 NaN NaN
1236756 69800.0 40800.0 NaN NaN
1236779 22110.0 NaN NaN NaN

10565 rows × 4 columns

This works fine for now, but if our data set had a household_income variable it would be included too.

You can select all the columns for person 1 by passing in `like=’_1’:

acs.filter(like='_1')
age_1 race_1 income_1 female_1
household
37 20.0 White 10000.0 True
241 50.0 White 32500.0 True
242 29.0 White 30000.0 True
377 69.0 White 51900.0 True
418 59.0 White 12200.0 True
... ... ... ... ...
1236119 51.0 White 62200.0 False
1236287 41.0 American Indian 15000.0 False
1236624 29.0 White 50100.0 False
1236756 58.0 White 69800.0 True
1236779 30.0 American Indian 22110.0 False

10565 rows × 4 columns

Again, this works fine for now. But remember the original data had 16 columns per variable. like='_1' would also match the columns for persons 10 through 16.

We need the ability to be much more specific about the columns to be selected. The solution is regular expressions. Regular expressions are a syntax for describing a pattern in text so that a computer can detect text that matches it. They’ve been around for more than 50 years and most programming languages and statistical packages implement them. They can be used for a variety of purposes; we won’t discuss working with text data (strings) but we’re laying the foundation by introducing you to regular expressions. Regular expressions can be extremely complex. We’ll introduce you to just a few of the most important features but you can find many, many tutorials online.

You use a regular expression with filter() by passing in regex=. The simplest regular expression is just some text, and a string matches it if it contains that text. In other words, it does the same thing as like=:

acs.filter(regex='income')
income_1 income_2 income_3 income_4
household
37 10000.0 5300.0 4700.0 NaN
241 32500.0 NaN NaN NaN
242 30000.0 NaN NaN NaN
377 51900.0 NaN NaN NaN
418 12200.0 NaN NaN NaN
... ... ... ... ...
1236119 62200.0 31000.0 NaN NaN
1236287 15000.0 27000.0 11400.0 NaN
1236624 50100.0 12000.0 NaN NaN
1236756 69800.0 40800.0 NaN NaN
1236779 22110.0 NaN NaN NaN

10565 rows × 4 columns

But a regular expression can do much more. Here are the tools we’ll use:

  • If a regular expression starts with ^, then the string only matches if it starts with the pattern.

  • If a regular expression ends with $, then the string only matches if it ends with the pattern.

  • \d matches any digit (number).

  • A * after something means the string can contain any number of that thing at that point. So \d* means the string can contain any number of digits at that point.

Putting this all together, the regular expression ^income_\d*$ will match strings that start with ‘income_’ followed by any number of digits and then end. This will allow us to match the columns containing the income for each person (income_1, income_2, etc.) without also matching columns like household_income:

acs.filter(regex='^income_\d*$')
income_1 income_2 income_3 income_4
household
37 10000.0 5300.0 4700.0 NaN
241 32500.0 NaN NaN NaN
242 30000.0 NaN NaN NaN
377 51900.0 NaN NaN NaN
418 12200.0 NaN NaN NaN
... ... ... ... ...
1236119 62200.0 31000.0 NaN NaN
1236287 15000.0 27000.0 11400.0 NaN
1236624 50100.0 12000.0 NaN NaN
1236756 69800.0 40800.0 NaN NaN
1236779 22110.0 NaN NaN NaN

10565 rows × 4 columns

The regular expression _1$ will match strings that end with ’_1’. This excludes variables that end with ’_10’, ’_11’, etc. and thus allows us to select just the columns describing person 1:

acs.filter(regex='_1$')
age_1 race_1 income_1 female_1
household
37 20.0 White 10000.0 True
241 50.0 White 32500.0 True
242 29.0 White 30000.0 True
377 69.0 White 51900.0 True
418 59.0 White 12200.0 True
... ... ... ... ...
1236119 51.0 White 62200.0 False
1236287 41.0 American Indian 15000.0 False
1236624 29.0 White 50100.0 False
1236756 58.0 White 69800.0 True
1236779 30.0 American Indian 22110.0 False

10565 rows × 4 columns

Exercise

Use a regular expression to select all the race columns. Then use one to select all the variables for person 4.

acs.filter(regex='^race_\d*$')
race_1 race_2 race_3 race_4
household
37 White White Black NaN
241 White NaN NaN NaN
242 White NaN NaN NaN
377 White NaN NaN NaN
418 White NaN NaN NaN
... ... ... ... ...
1236119 White White NaN NaN
1236287 American Indian White White White
1236624 White White NaN NaN
1236756 White White NaN NaN
1236779 American Indian NaN NaN NaN

10565 rows × 4 columns

acs.filter(regex='_4$')
age_4 race_4 income_4 female_4
household
37 NaN NaN NaN NaN
241 NaN NaN NaN NaN
242 NaN NaN NaN NaN
377 NaN NaN NaN NaN
418 NaN NaN NaN NaN
... ... ... ... ...
1236119 NaN NaN NaN NaN
1236287 4.0 White NaN False
1236624 NaN NaN NaN NaN
1236756 NaN NaN NaN NaN
1236779 NaN NaN NaN NaN

10565 rows × 4 columns

6.3.2 Creating Level 1 Variables

Creating level 1 variables calls for a for loop just like with a MultiIndex. Recall that adding strings concatenates them in Python, so the equivalent of the tuple ('race', person), where person is a numeric variable, is the string 'race_' + str(person). Thus the code to create indicator variables for ‘this person is Black’ becomes:

for person in range(1, 5):
    acs['black_' + str(person)] = np.where(
        acs['race_' +str(person)].isna(),
        np.NaN, 
        (acs['race_' + str(person)]=='Black').astype('boolean')
    )

acs.filter(regex='^black_\d*$')
black_1 black_2 black_3 black_4
household
37 0.0 0.0 1.0 NaN
241 0.0 NaN NaN NaN
242 0.0 NaN NaN NaN
377 0.0 NaN NaN NaN
418 0.0 NaN NaN NaN
... ... ... ... ...
1236119 0.0 0.0 NaN NaN
1236287 0.0 0.0 0.0 0.0
1236624 0.0 0.0 NaN NaN
1236756 0.0 0.0 NaN NaN
1236779 0.0 NaN NaN NaN

10565 rows × 4 columns

Exercise

Create an indicator variable for ‘this person is a child’.

for person in range(1, 5):
    acs['child_' + str(person)] = np.where(
        acs['age_' +str(person)].isna(),
        np.NaN, 
        (acs['age_' + str(person)]<18).astype('boolean')
    )

acs.filter(regex='^child_\d*$')
child_1 child_2 child_3 child_4
household
37 0.0 0.0 0.0 NaN
241 0.0 NaN NaN NaN
242 0.0 NaN NaN NaN
377 0.0 NaN NaN NaN
418 0.0 NaN NaN NaN
... ... ... ... ...
1236119 0.0 0.0 NaN NaN
1236287 0.0 0.0 0.0 1.0
1236624 0.0 0.0 NaN NaN
1236756 0.0 0.0 NaN NaN
1236779 0.0 NaN NaN NaN

10565 rows × 4 columns

6.3.3 Creating Level 2 Variables

Creating level 2 variables is just a matter of replacing the xs() we used previously with the appropriate filter(). To calculate the household income, use:

acs['household_income'] = acs.filter(regex='^income_\d*$').sum(axis=1)

acs.filter(like='income')
income_1 income_2 income_3 income_4 household_income
household
37 10000.0 5300.0 4700.0 NaN 20000.0
241 32500.0 NaN NaN NaN 32500.0
242 30000.0 NaN NaN NaN 30000.0
377 51900.0 NaN NaN NaN 51900.0
418 12200.0 NaN NaN NaN 12200.0
... ... ... ... ... ...
1236119 62200.0 31000.0 NaN NaN 93200.0
1236287 15000.0 27000.0 11400.0 NaN 53400.0
1236624 50100.0 12000.0 NaN NaN 62100.0
1236756 69800.0 40800.0 NaN NaN 110600.0
1236779 22110.0 NaN NaN NaN 22110.0

10565 rows × 5 columns

Exercise

Create an indicator for ‘this household consists entirely of children’.

acs['all_children'] = acs.filter(regex='^child_\d*$').min(axis=1)

acs.filter(like='child').loc[73731]
child_1         1.0
child_2         1.0
child_3         NaN
child_4         NaN
all_children    1.0
Name: 73731, dtype: float64

6.3.4 Subsetting With Aggregate Functions

Finally, subsetting with an aggregate function is again a combination of creating level 1 variables containing just the values of interest with a loop and the aggregating them.

for person in range(1, 5):
    acs['income_if_child_' + str(person)] = acs['income_' + str(person)].where(
        acs['age_' + str(person)]<18,
        other=np.NaN
    )

acs.filter(regex='^income_if_child_\d*$').loc[8787]
income_if_child_1       NaN
income_if_child_2    3600.0
income_if_child_3     200.0
income_if_child_4       NaN
Name: 8787, dtype: float64
acs['child_income'] = acs.filter(regex='^income_if_child_\d*$').sum(axis=1)

acs.filter(like='income').loc[8787]
income_1             38000.0
income_2              3600.0
income_3               200.0
income_4                 NaN
household_income     41800.0
income_if_child_1        NaN
income_if_child_2     3600.0
income_if_child_3      200.0
income_if_child_4        NaN
child_income          3800.0
Name: 8787, dtype: float64
Exercise

Create a level 2 variable containing the mean age of the adults in each household.

for person in range(1, 5):
    acs['age_if_adult_' + str(person)] = acs['age_' + str(person)].where(
        acs['age_' + str(person)]>=18,
        other=np.NaN
    )

acs['mean_adult_age'] = acs.filter(regex='^age_if_adult_\d*$').mean(axis=1)
acs.filter(like='age')
age_1 age_2 age_3 age_4 age_if_adult_1 age_if_adult_2 age_if_adult_3 age_if_adult_4 mean_adult_age
household
37 20.0 19.0 19.0 NaN 20.0 19.0 19.0 NaN 19.333333
241 50.0 NaN NaN NaN 50.0 NaN NaN NaN 50.000000
242 29.0 NaN NaN NaN 29.0 NaN NaN NaN 29.000000
377 69.0 NaN NaN NaN 69.0 NaN NaN NaN 69.000000
418 59.0 NaN NaN NaN 59.0 NaN NaN NaN 59.000000
... ... ... ... ... ... ... ... ... ...
1236119 51.0 51.0 NaN NaN 51.0 51.0 NaN NaN 51.000000
1236287 41.0 42.0 23.0 4.0 41.0 42.0 23.0 NaN 35.333333
1236624 29.0 26.0 NaN NaN 29.0 26.0 NaN NaN 27.500000
1236756 58.0 61.0 NaN NaN 58.0 61.0 NaN NaN 59.500000
1236779 30.0 NaN NaN NaN 30.0 NaN NaN NaN 30.000000

10565 rows × 9 columns