Wide form data almost always looks uglier than long form, and some tasks are always harder to do in wide form than in long form. Others can be done just as easily in wide form if you get comfortable working across columns rather than down rows. Some Python functions work exactly the same either way.
The Pythonic way to work with wide form data is with a column MultiIndex that reflects the structure of the data, just like we used a row MultiIndex in long form. However, column MultiIndexes are unique to Python and don’t have a great way to handle level 2 variables. So you’ll more commonly see wide form data where the variable name reflects both levels of the hierarchy, like age_1 for ‘age of person 1’. We’ll demonstrate both methods.
This chapter is meant to be read after Working with Hierarchical Data in Long Form. That chapter introduces the core concepts of hierarchical data (in any form), contains more example tasks, and explains them in more detail. This chapter mostly translates what you learned in the previous chapter to wide form.
6.1 Setting Up
Start up Jupyter Lab if you haven’t already and navigate to the folder where you put the example files. Then create a new Python Notebook and call it Hierarchical_Wide_Practice.ipynb. Have it import Pandas and NumPy, then load acs_wide_mi.pickle as acs:
import pandas as pdimport numpy as npacs = pd.read_pickle('acs_wide_mi.pickle')acs
age
race
income
female
person
1
2
3
4
1
2
3
4
1
2
3
4
1
2
3
4
household
37
20.0
19.0
19.0
NaN
White
White
Black
NaN
10000.0
5300.0
4700.0
NaN
True
True
True
NaN
241
50.0
NaN
NaN
NaN
White
NaN
NaN
NaN
32500.0
NaN
NaN
NaN
True
NaN
NaN
NaN
242
29.0
NaN
NaN
NaN
White
NaN
NaN
NaN
30000.0
NaN
NaN
NaN
True
NaN
NaN
NaN
377
69.0
NaN
NaN
NaN
White
NaN
NaN
NaN
51900.0
NaN
NaN
NaN
True
NaN
NaN
NaN
418
59.0
NaN
NaN
NaN
White
NaN
NaN
NaN
12200.0
NaN
NaN
NaN
True
NaN
NaN
NaN
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
1236119
51.0
51.0
NaN
NaN
White
White
NaN
NaN
62200.0
31000.0
NaN
NaN
False
True
NaN
NaN
1236287
41.0
42.0
23.0
4.0
American Indian
White
White
White
15000.0
27000.0
11400.0
NaN
False
True
True
False
1236624
29.0
26.0
NaN
NaN
White
White
NaN
NaN
50100.0
12000.0
NaN
NaN
False
True
NaN
NaN
1236756
58.0
61.0
NaN
NaN
White
White
NaN
NaN
69800.0
40800.0
NaN
NaN
True
False
NaN
NaN
1236779
30.0
NaN
NaN
NaN
American Indian
NaN
NaN
NaN
22110.0
NaN
NaN
NaN
False
NaN
NaN
NaN
10565 rows × 16 columns
This data set is a subset of the ACS data you’ve been working with. It contains fewer variables and only the first four people in each household. However, it is in wide form, meaning that there’s one row per level 2 unit (household) and one column per variable per level one unit (person). That makes it quite wide (which is why it’s a subset of the original).
The number of columns is determined by the number of people in the largest household. Household 1236287, for example, has four people in it and thus four columns are needed to store all their information. (The full data set has a household with 16 people in it, so it requires 16 columns per variable.) However, unlike long form where different households can have different numbers of rows, in wide form every household must have the same number of columns. Households with fewer people will have NaN in the columns for people that don’t exist in their household. For example, household 37 has three people, so the columns for person 4 are always NaN.
6.2 Wide Form With a Column MultiIndex
In order to pick out a particular number from this data set, I need a household number, a person number, and a variable. For example, 20 is the age of person 1 in household 37. In the long form, household and person identify a row, and a variable identifies a column. A row MultIndex allowed Python to understand the relationship between household and person.
In this version of wide format, household identifies a row and a combination of a variable and person identify a column. The column MultiIndex allows Python to understand the relationship between variable and person. Note that we have not officially named the first level of the MultiIndex variable (we could) so we’ll just refer to it as level 0. The second level is named person, or level 1.
6.2.1 Subsetting
Up to this point, all of our column indexes have been simple strings like age or income. Now, to identify a column you need to identify both the variable and the person of interest. This is done as a tuple, for example ('age', 1) for ‘age of person 1.’ Tuples like this can be used as column names in square brackets just like the strings we’ve used before:
They can also be used along with a row index in loc to pick out a single number:
acs.loc[37, ('age', 1)]
20.0
Note how the household number 37 now identifies a single row, since in wide form there’s just one observation per household.
The xs() function gives you more flexibility. To select all the age variables, use:
acs.xs('age', axis=1)
person
1
2
3
4
household
37
20.0
19.0
19.0
NaN
241
50.0
NaN
NaN
NaN
242
29.0
NaN
NaN
NaN
377
69.0
NaN
NaN
NaN
418
59.0
NaN
NaN
NaN
...
...
...
...
...
1236119
51.0
51.0
NaN
NaN
1236287
41.0
42.0
23.0
4.0
1236624
29.0
26.0
NaN
NaN
1236756
58.0
61.0
NaN
NaN
1236779
30.0
NaN
NaN
NaN
10565 rows × 4 columns
The argument axis=1 clarifies that you’re talking about the column indexes. (The rows are axis 0.) You’ll be using axis=1 a lot in this chapter. xs() will assume you’re specifying level 0 (variable) unless you say otherwise.
Since all the columns are age, by default xs() drops that level from the results. You can tell it not to with drop_level=False:
acs.xs('age', axis=1, drop_level=False)
age
person
1
2
3
4
household
37
20.0
19.0
19.0
NaN
241
50.0
NaN
NaN
NaN
242
29.0
NaN
NaN
NaN
377
69.0
NaN
NaN
NaN
418
59.0
NaN
NaN
NaN
...
...
...
...
...
1236119
51.0
51.0
NaN
NaN
1236287
41.0
42.0
23.0
4.0
1236624
29.0
26.0
NaN
NaN
1236756
58.0
61.0
NaN
NaN
1236779
30.0
NaN
NaN
NaN
10565 rows × 4 columns
You can also use xs() to select all the variables for a given person. In doing so, you need to specify that you’re selecting based on level 1 of the MultiIndex, which you can do by number or by name (person):
acs.xs(1, level=1, axis=1)
age
race
income
female
household
37
20.0
White
10000.0
True
241
50.0
White
32500.0
True
242
29.0
White
30000.0
True
377
69.0
White
51900.0
True
418
59.0
White
12200.0
True
...
...
...
...
...
1236119
51.0
White
62200.0
False
1236287
41.0
American Indian
15000.0
False
1236624
29.0
White
50100.0
False
1236756
58.0
White
69800.0
True
1236779
30.0
American Indian
22110.0
False
10565 rows × 4 columns
This looks a lot like the long form data, but there are far fewer rows because only the first person in each household is included. If we had specified drop_level=False we’d be reminded of that.
6.2.2 Creating Level 1 Variables
In long form, creating level 1 variables is straightforward. For example, to create an indicator variable for ‘this person is Black’ you’d run something like:
acs['black'] = (acs['race']=='Black')
This acts on all the rows in the data set automatically. Since in long form the people in a household are rows, it thus takes care of all the people in a household automatically. The simple wide-form equivalent:
acs[('black', 1)] = (acs[('race', 1)]=='Black')
would only take care of person 1 in each household. Clearly this is a job for a for loop. But there’s another wrinkle: we now have lots of missing values for race, because that’s how wide form tells us a person doesn’t exist.
Let’s start by creating a single column properly, including the missing values. Creating column 4 will make it easy to see if we’ve succeeded. Recall that the np.where() function takes a condition, the result if the condition is true, and the result if the condition is false. That makes it easy to check if ('race', 4) is missing, set the new column to np.NaN if it is, and set it to the result of ('race', 4)=='Black' if it is not. Just remember that we must specify that the result is boolean with astype('boolean') or NumPy will return 1/0 instead of True/False.
The result is NaN for all the rows we can see except 1236287, where it is correctly False, so now we’re ready to wrap that in a for loop. Before you do, drop ('black', 4) just because if you don’t it will be out of order.
acs = acs.drop(columns=('black', 4))
In order to make the code that created ('black', 4) do a different column, what needs to change is the person number in the tuples ('black', 4) and ('race', 4). That tells us we need to loop over person numbers. We can get the numbers 1 through 4 with range(1, 5) since range() leaves off the endpoint.
for person inrange(1, 5): acs[('black', person)] = np.where( acs[('race', person)].isna(), np.NaN, (acs[('race', person)]=='Black').astype('boolean') )acs.xs('black', axis=1, drop_level=False)
black
person
1
2
3
4
household
37
0.0
0.0
1.0
NaN
241
0.0
NaN
NaN
NaN
242
0.0
NaN
NaN
NaN
377
0.0
NaN
NaN
NaN
418
0.0
NaN
NaN
NaN
...
...
...
...
...
1236119
0.0
0.0
NaN
NaN
1236287
0.0
0.0
0.0
0.0
1236624
0.0
0.0
NaN
NaN
1236756
0.0
0.0
NaN
NaN
1236779
0.0
NaN
NaN
NaN
10565 rows × 4 columns
Exercise
Create an indicator variable called child for each person. Be sure it’s NaN for people who don’t exist.
Solution
for person inrange(1,5): acs[('child', person)] = np.where( acs[('age', person)].isna(), np.NaN, (acs[('age', person)]<18).astype('boolean') )acs.xs('child', axis=1, drop_level=False)
child
person
1
2
3
4
household
37
0.0
0.0
0.0
NaN
241
0.0
NaN
NaN
NaN
242
0.0
NaN
NaN
NaN
377
0.0
NaN
NaN
NaN
418
0.0
NaN
NaN
NaN
...
...
...
...
...
1236119
0.0
0.0
NaN
NaN
1236287
0.0
0.0
0.0
1.0
1236624
0.0
0.0
NaN
NaN
1236756
0.0
0.0
NaN
NaN
1236779
0.0
NaN
NaN
NaN
10565 rows × 4 columns
6.2.3 Creating Level 2 Variables
Next we’ll learn how to make level 2 variables like household_income. In long form, you had to use groupby('household') to specify which rows should be aggregated. That’s not an issue in wide form, since each household has its own row. Instead you need to specify which columns should be aggregated. You can do that with xs(). For example, to calculate household_income you need all the income columns:
acs.xs('income', axis=1)
person
1
2
3
4
household
37
10000.0
5300.0
4700.0
NaN
241
32500.0
NaN
NaN
NaN
242
30000.0
NaN
NaN
NaN
377
51900.0
NaN
NaN
NaN
418
12200.0
NaN
NaN
NaN
...
...
...
...
...
1236119
62200.0
31000.0
NaN
NaN
1236287
15000.0
27000.0
11400.0
NaN
1236624
50100.0
12000.0
NaN
NaN
1236756
69800.0
40800.0
NaN
NaN
1236779
22110.0
NaN
NaN
NaN
10565 rows × 4 columns
So how do you add them up? With the sum() function as usual, but pass in axis=1 so it adds them up across columns instead of rows:
In long form, we needed transform() to broadcast the result of sum() to all the rows in the household. That’s not necessary in wide form, since each household only has one row.
The trick is the column name. The columns in this dataset are identified by a MultiIndex–no exceptions. So even though household income is not a property of a person, the tuple that identifies the column must have two values. We’ll use ’‘, an empty string, to denote ’this is a level 2 variable.’ Thus, to create a variable for household income, run:
So how do you calculate a level 2 variable based on a subset of the level 1 variables, like child_income? The same way as in long form: first create a level 1 variable that contains actual income for children and NaN for adults, then aggregate it. The difference is that in wide form creating a level 1 variable requires a loop over columns. It can still be easily done using the Series where() function, which takes a condition and then an other argument that specifies what the value should be changed to for rows that don’t meet the condition.
for person inrange(1, 5): acs[('income_if_child', person)] = acs[('income', person)].where( acs[('age', person)]<18, other=np.NaN )acs.xs('income_if_child', axis=1, drop_level=False).loc[8787]
person
income_if_child 1 NaN
2 3600.0
3 200.0
4 NaN
Name: 8787, dtype: float64
With an income_if_child variable in hand for each person, all you need to do is add them up:
While a column MultiIndex has a certain elegance, in practice it’s more common to stick with strings for column identifiers. The string contains both levels of the index, such as age_1 (or age1) for ‘age of person 1’. But that structure is not enforced, so there’s no problem creating a variable like household_income.
The disadvantage is that it’s not as easy to specify subsets like ‘all the income variables.’ The good news is the tools you’ll learn for doing that also apply to data that’s stored as strings (i.e. text data).
Load the example data set acs_wide_strings as the new acs to see what this looks like in practice:
Selecting all the columns for a variable or person gets more interesting. How can well tell Python to select all the variables that start with age or all the variables that end with 1? The answer is the DataFrame filter() function.
The filter() function allows you to subset based on patterns in the column names. A very simple pattern is like=. This will select all the columns that contain the string you pass in. Select all the income columns by passing in like='income':
acs.filter(like='income')
income_1
income_2
income_3
income_4
household
37
10000.0
5300.0
4700.0
NaN
241
32500.0
NaN
NaN
NaN
242
30000.0
NaN
NaN
NaN
377
51900.0
NaN
NaN
NaN
418
12200.0
NaN
NaN
NaN
...
...
...
...
...
1236119
62200.0
31000.0
NaN
NaN
1236287
15000.0
27000.0
11400.0
NaN
1236624
50100.0
12000.0
NaN
NaN
1236756
69800.0
40800.0
NaN
NaN
1236779
22110.0
NaN
NaN
NaN
10565 rows × 4 columns
This works fine for now, but if our data set had a household_income variable it would be included too.
You can select all the columns for person 1 by passing in `like=’_1’:
acs.filter(like='_1')
age_1
race_1
income_1
female_1
household
37
20.0
White
10000.0
True
241
50.0
White
32500.0
True
242
29.0
White
30000.0
True
377
69.0
White
51900.0
True
418
59.0
White
12200.0
True
...
...
...
...
...
1236119
51.0
White
62200.0
False
1236287
41.0
American Indian
15000.0
False
1236624
29.0
White
50100.0
False
1236756
58.0
White
69800.0
True
1236779
30.0
American Indian
22110.0
False
10565 rows × 4 columns
Again, this works fine for now. But remember the original data had 16 columns per variable. like='_1' would also match the columns for persons 10 through 16.
We need the ability to be much more specific about the columns to be selected. The solution is regular expressions. Regular expressions are a syntax for describing a pattern in text so that a computer can detect text that matches it. They’ve been around for more than 50 years and most programming languages and statistical packages implement them. They can be used for a variety of purposes; we won’t discuss working with text data (strings) but we’re laying the foundation by introducing you to regular expressions. Regular expressions can be extremely complex. We’ll introduce you to just a few of the most important features but you can find many, many tutorials online.
You use a regular expression with filter() by passing in regex=. The simplest regular expression is just some text, and a string matches it if it contains that text. In other words, it does the same thing as like=:
acs.filter(regex='income')
income_1
income_2
income_3
income_4
household
37
10000.0
5300.0
4700.0
NaN
241
32500.0
NaN
NaN
NaN
242
30000.0
NaN
NaN
NaN
377
51900.0
NaN
NaN
NaN
418
12200.0
NaN
NaN
NaN
...
...
...
...
...
1236119
62200.0
31000.0
NaN
NaN
1236287
15000.0
27000.0
11400.0
NaN
1236624
50100.0
12000.0
NaN
NaN
1236756
69800.0
40800.0
NaN
NaN
1236779
22110.0
NaN
NaN
NaN
10565 rows × 4 columns
But a regular expression can do much more. Here are the tools we’ll use:
If a regular expression starts with ^, then the string only matches if it starts with the pattern.
If a regular expression ends with $, then the string only matches if it ends with the pattern.
\d matches any digit (number).
A * after something means the string can contain any number of that thing at that point. So \d* means the string can contain any number of digits at that point.
Putting this all together, the regular expression ^income_\d*$ will match strings that start with ‘income_’ followed by any number of digits and then end. This will allow us to match the columns containing the income for each person (income_1, income_2, etc.) without also matching columns like household_income:
acs.filter(regex='^income_\d*$')
income_1
income_2
income_3
income_4
household
37
10000.0
5300.0
4700.0
NaN
241
32500.0
NaN
NaN
NaN
242
30000.0
NaN
NaN
NaN
377
51900.0
NaN
NaN
NaN
418
12200.0
NaN
NaN
NaN
...
...
...
...
...
1236119
62200.0
31000.0
NaN
NaN
1236287
15000.0
27000.0
11400.0
NaN
1236624
50100.0
12000.0
NaN
NaN
1236756
69800.0
40800.0
NaN
NaN
1236779
22110.0
NaN
NaN
NaN
10565 rows × 4 columns
The regular expression _1$ will match strings that end with ’_1’. This excludes variables that end with ’_10’, ’_11’, etc. and thus allows us to select just the columns describing person 1:
acs.filter(regex='_1$')
age_1
race_1
income_1
female_1
household
37
20.0
White
10000.0
True
241
50.0
White
32500.0
True
242
29.0
White
30000.0
True
377
69.0
White
51900.0
True
418
59.0
White
12200.0
True
...
...
...
...
...
1236119
51.0
White
62200.0
False
1236287
41.0
American Indian
15000.0
False
1236624
29.0
White
50100.0
False
1236756
58.0
White
69800.0
True
1236779
30.0
American Indian
22110.0
False
10565 rows × 4 columns
Exercise
Use a regular expression to select all the race columns. Then use one to select all the variables for person 4.
Solution
acs.filter(regex='^race_\d*$')
race_1
race_2
race_3
race_4
household
37
White
White
Black
NaN
241
White
NaN
NaN
NaN
242
White
NaN
NaN
NaN
377
White
NaN
NaN
NaN
418
White
NaN
NaN
NaN
...
...
...
...
...
1236119
White
White
NaN
NaN
1236287
American Indian
White
White
White
1236624
White
White
NaN
NaN
1236756
White
White
NaN
NaN
1236779
American Indian
NaN
NaN
NaN
10565 rows × 4 columns
acs.filter(regex='_4$')
age_4
race_4
income_4
female_4
household
37
NaN
NaN
NaN
NaN
241
NaN
NaN
NaN
NaN
242
NaN
NaN
NaN
NaN
377
NaN
NaN
NaN
NaN
418
NaN
NaN
NaN
NaN
...
...
...
...
...
1236119
NaN
NaN
NaN
NaN
1236287
4.0
White
NaN
False
1236624
NaN
NaN
NaN
NaN
1236756
NaN
NaN
NaN
NaN
1236779
NaN
NaN
NaN
NaN
10565 rows × 4 columns
6.3.2 Creating Level 1 Variables
Creating level 1 variables calls for a for loop just like with a MultiIndex. Recall that adding strings concatenates them in Python, so the equivalent of the tuple ('race', person), where person is a numeric variable, is the string 'race_' + str(person). Thus the code to create indicator variables for ‘this person is Black’ becomes:
for person inrange(1, 5): acs['black_'+str(person)] = np.where( acs['race_'+str(person)].isna(), np.NaN, (acs['race_'+str(person)]=='Black').astype('boolean') )acs.filter(regex='^black_\d*$')
black_1
black_2
black_3
black_4
household
37
0.0
0.0
1.0
NaN
241
0.0
NaN
NaN
NaN
242
0.0
NaN
NaN
NaN
377
0.0
NaN
NaN
NaN
418
0.0
NaN
NaN
NaN
...
...
...
...
...
1236119
0.0
0.0
NaN
NaN
1236287
0.0
0.0
0.0
0.0
1236624
0.0
0.0
NaN
NaN
1236756
0.0
0.0
NaN
NaN
1236779
0.0
NaN
NaN
NaN
10565 rows × 4 columns
Exercise
Create an indicator variable for ‘this person is a child’.
Solution
for person inrange(1, 5): acs['child_'+str(person)] = np.where( acs['age_'+str(person)].isna(), np.NaN, (acs['age_'+str(person)]<18).astype('boolean') )acs.filter(regex='^child_\d*$')
child_1
child_2
child_3
child_4
household
37
0.0
0.0
0.0
NaN
241
0.0
NaN
NaN
NaN
242
0.0
NaN
NaN
NaN
377
0.0
NaN
NaN
NaN
418
0.0
NaN
NaN
NaN
...
...
...
...
...
1236119
0.0
0.0
NaN
NaN
1236287
0.0
0.0
0.0
1.0
1236624
0.0
0.0
NaN
NaN
1236756
0.0
0.0
NaN
NaN
1236779
0.0
NaN
NaN
NaN
10565 rows × 4 columns
6.3.3 Creating Level 2 Variables
Creating level 2 variables is just a matter of replacing the xs() we used previously with the appropriate filter(). To calculate the household income, use:
child_1 1.0
child_2 1.0
child_3 NaN
child_4 NaN
all_children 1.0
Name: 73731, dtype: float64
6.3.4 Subsetting With Aggregate Functions
Finally, subsetting with an aggregate function is again a combination of creating level 1 variables containing just the values of interest with a loop and the aggregating them.
for person inrange(1, 5): acs['income_if_child_'+str(person)] = acs['income_'+str(person)].where( acs['age_'+str(person)]<18, other=np.NaN )acs.filter(regex='^income_if_child_\d*$').loc[8787]
income_if_child_1 NaN
income_if_child_2 3600.0
income_if_child_3 200.0
income_if_child_4 NaN
Name: 8787, dtype: float64