8 Restructuring Data Sets

In this chapter we’ll learn about two of the most common ways you might need to change the structure of a data set: aggregating observations, and switching between long form and wide form.

8.1 Aggregating Observations

Sometimes you’ll get a hierarchical data set with level one and level two units, but all you care about is the level two units. For example, you may need one observation per household and just household-level variables from the ACS. But those household-level variables are likely to be aggregates of the level one variables, like household_income. More broadly, there are many situations where you may want to combine observations that belong to the same group. Pandas has functions that makes this easy to do.

8.1.1 Setting Up

Start up Jupyter Lab if you haven’t already and navigate to the folder where you put the example files. Then create a new Python Notebook and call it Restructuring_Data_Practice.ipynb. Have it import Pandas and NumPy, and load acs.pickle. Then create a household_income variable so we’ll have a household-level variable to practice with.

import pandas as pd
import numpy as np
acs = pd.read_pickle('acs.pickle')
acs['household_income'] = acs['income'].groupby('household').transform('sum')
acs

		age	race	marital_status	edu	income	female	hispanic	household_income
household	person
37	1	20	White	Never married	Some college, >=1 year	10000.0	True	False	20000.0
	2	19	White	Never married	Some college, >=1 year	5300.0	True	False	20000.0
	3	19	Black	Never married	Some college, >=1 year	4700.0	True	False	20000.0
241	1	50	White	Never married	Master's degree	32500.0	True	False	32500.0
242	1	29	White	Never married	Bachelor's degree	30000.0	True	False	30000.0
...	...	...	...	...	...	...	...	...	...
1236624	1	29	White	Now married	Some college, >=1 year	50100.0	False	False	62100.0
1236624	2	26	White	Now married	High School graduate	12000.0	True	False	62100.0
1236756	1	58	White	Now married	Master's degree	69800.0	True	False	110600.0
1236756	2	61	White	Now married	Master's degree	40800.0	False	False	110600.0
1236779	1	30	American Indian	Divorced	High School graduate	22110.0	False	False	22110.0

27410 rows × 8 columns

8.1.2 Aggregating With A Subset

If the data set you need consists of level two units and level two variables that already exist, all you need to do is create a subset selecting the level two variable(s) and just one row per level two unit. Since level two variables are always the same for all level one units in the same level two unit, it doesn’t matter which row you choose. But it’s easiest to choose the first row in each group since every group will have a first row. You can do that with xs():

acs['household_income'].xs(1, level='person')

household
37          20000.0
241         32500.0
242         30000.0
377         51900.0
418         12200.0
             ...   
1236119     93200.0
1236287     53400.0
1236624     62100.0
1236756    110600.0
1236779     22110.0
Name: household_income, Length: 10565, dtype: float64

Exercise

Create a household-level variable containing the mean age of the children in the household. Then create a data set that contains one row per household and the mean age of the children. (Hint: to find the mean age of the children, first create a column containing the ages of just the children, with NaN for the adults.)

Solution

age_if_child = acs['age'].where(acs['age']<18, other=np.NaN)
acs['mean_child_age'] = age_if_child.groupby('household').transform('mean')
acs_households = acs.xs(1, level=1)['mean_child_age']
acs_households.loc[8787]

15.5

Compare with:

acs.loc[8787]

	age	race	marital_status	edu	income	female	hispanic	household_income	mean_child_age
person
1	45	White	Divorced	Some college, >=1 year	38000.0	True	False	41800.0	15.5
2	16	White	Never married	9th grade	3600.0	False	False	41800.0	15.5
3	15	White	Never married	7th-8th grade	200.0	True	False	41800.0	15.5

8.1.3 Aggregating With An Aggregate Function

If the data set you need consists of level two units and either one level two variable or multiple level two variables that can all be created with the same aggregate function, just apply the aggregate function to the variables you need. For example, you can create a data set of household incomes directly with:

acs['income'].groupby('household').sum()

household
37          20000.0
241         32500.0
242         30000.0
377         51900.0
418         12200.0
             ...   
1236119     93200.0
1236287     53400.0
1236624     62100.0
1236756    110600.0
1236779     22110.0
Name: income, Length: 10565, dtype: float64

Create a data set containing the proportion of the household that is female and the proportion of the household that is hispanic with:

acs[['female', 'hispanic']].groupby('household').mean()

	female	hispanic
household
37	1.0	0.0
241	1.0	0.0
242	1.0	0.0
377	1.0	0.0
418	1.0	0.0
...	...	...
1236119	0.5	0.0
1236287	0.5	0.0
1236624	0.5	0.0
1236756	0.5	0.0
1236779	0.0	0.0

10565 rows × 2 columns

The apply() function allows you to aggregate using a lambda function, so you can create your own aggregate functions. Create a data set containing households and their income per person with:

acs['income'].groupby('household').apply(lambda x: x.sum()/x.size)

household
37          6666.666667
241        32500.000000
242        30000.000000
377        51900.000000
418        12200.000000
               ...     
1236119    46600.000000
1236287    13350.000000
1236624    31050.000000
1236756    55300.000000
1236779    22110.000000
Name: income, Length: 10565, dtype: float64

Exercise

Create a data set of households containing the education level of the most educated person in the household. Recall that edu is an ordered categorical variable, so you can apply max() to it.

Solution

acs['edu'].groupby('household').max()

household
37         Some college, >=1 year
241               Master's degree
242             Bachelor's degree
377                          None
418        12th grade, no diploma
                    ...          
1236119         Bachelor's degree
1236287     Some college, <1 year
1236624    Some college, >=1 year
1236756           Master's degree
1236779      High School graduate
Name: edu, Length: 10565, dtype: category
Categories (17, object): ['Not in universe' < 'None' < 'Nursery school-4th grade' < '5th-6th grade' ... 'Bachelor's degree' < 'Master's degree' < 'Professional degree' < 'Doctorate degree']

8.1.4 Aggregating Using Named Aggregates

If you need to create a data set of level two units and level two variables that must be made using different aggregate functions, you can do so using named aggregates. A named aggregate allows you to specify the column to be aggregated, the aggregate function to use, and what the result should be called. This gives you complete control over the DataFrame to be created.

The Pandas NamedAgg() function takes two arguments: the column to act on and the aggregate function to use. To use them, pass them as key word arguments to the agg() function, with the name being the column to create in the resulting data set.

For example, suppose we need a data set of households with the variables ‘household income’, ‘number of people in the household’, and ‘proportion of the household that is female.’ We already have household_income as a level two variable, so we can just take its first value with the aggregate function first–but since household income will be the only income we care about in the data set of households, let’s just call it income. Number of people can be obtained from the aggregate function size, but remember it needs a single column, any column, to act on. And proportion female is just the mean of female. Thus the code is:

acs.groupby('household').agg(
    income=pd.NamedAgg('household_income', 'first'),
    size=pd.NamedAgg('income', 'size'),
    prop_female=pd.NamedAgg('female', 'mean')
)

	income	size	prop_female
household
37	20000.0	3	1.0
241	32500.0	1	1.0
242	30000.0	1	1.0
377	51900.0	1	1.0
418	12200.0	1	1.0
...	...	...	...
1236119	93200.0	2	0.5
1236287	53400.0	4	0.5
1236624	62100.0	2	0.5
1236756	110600.0	2	0.5
1236779	22110.0	1	0.0

10565 rows × 3 columns

The naming of the new columns is a rare instance where column names do not go in quotes. That’s because they are key word arguments for the function agg().

Exercise

Create a data set with one row per household and variables for ‘at least one member of this household is Hispanic’ and ‘all the members of this household are Hispanic.’

Solution

acs.groupby('household').agg(
    at_least_one_hispanic = pd.NamedAgg('hispanic', 'max'),
    all_hispanic = pd.NamedAgg('hispanic', 'min')
).loc[1214344]

at_least_one_hispanic     True
all_hispanic             False
Name: 1214344, dtype: bool

Compare with:

acs.loc[1214344]

	age	race	marital_status	edu	income	female	hispanic	household_income	mean_child_age
person
1	41	White	Divorced	High School graduate	28600.0	True	False	28600.0	13.0
2	13	Black	NaN	7th-8th grade	NaN	False	True	28600.0	13.0

8.2 Switching Between Long and Wide Form

Unlike aggregating data sets, switching between long and wide form keeps all of the level one data–no data are lost, they’re just organized differently. Doing so is extremely easy if your data set only contains level 1 variables, but not still too bad if you have to handle level 2 variables as well.

Reload the acs.pickle data set as acs:

acs = pd.read_pickle('acs.pickle')
acs

		age	race	marital_status	edu	income	female	hispanic
household	person
37	1	20	White	Never married	Some college, >=1 year	10000.0	True	False
	2	19	White	Never married	Some college, >=1 year	5300.0	True	False
	3	19	Black	Never married	Some college, >=1 year	4700.0	True	False
241	1	50	White	Never married	Master's degree	32500.0	True	False
242	1	29	White	Never married	Bachelor's degree	30000.0	True	False
...	...	...	...	...	...	...	...	...
1236624	1	29	White	Now married	Some college, >=1 year	50100.0	False	False
1236624	2	26	White	Now married	High School graduate	12000.0	True	False
1236756	1	58	White	Now married	Master's degree	69800.0	True	False
1236756	2	61	White	Now married	Master's degree	40800.0	False	False
1236779	1	30	American Indian	Divorced	High School graduate	22110.0	False	False

27410 rows × 7 columns

8.2.1 Stack and Unstack

This data set is in the long form, so one way to describe it would be to say that people in the same household are stacked on top of each other. So to convert it to the wide form, all you need to do is unstack them:

acs_wide = acs.unstack()
acs_wide

	age										...	hispanic
person	1	2	3	4	5	6	7	8	9	10	...	7	8	9	10	11	12	13	14	15	16
household
37	20.0	19.0	19.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
241	50.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
242	29.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
377	69.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
418	59.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1236119	51.0	51.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1236287	41.0	42.0	23.0	4.0	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1236624	29.0	26.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1236756	58.0	61.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1236779	30.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

10565 rows × 112 columns

The result is the same dataset in wide form. Note how the MultiIndex for rows in long form was converted to a MultiIndex for columns in wide form.

To go back to long form, just stack it again:

acs_wide.stack()

		age	race	marital_status	edu	income	female	hispanic
household	person
37	1	20.0	White	Never married	Some college, >=1 year	10000.0	True	False
	2	19.0	White	Never married	Some college, >=1 year	5300.0	True	False
	3	19.0	Black	Never married	Some college, >=1 year	4700.0	True	False
241	1	50.0	White	Never married	Master's degree	32500.0	True	False
242	1	29.0	White	Never married	Bachelor's degree	30000.0	True	False
...	...	...	...	...	...	...	...	...
1236624	1	29.0	White	Now married	Some college, >=1 year	50100.0	False	False
1236624	2	26.0	White	Now married	High School graduate	12000.0	True	False
1236756	1	58.0	White	Now married	Master's degree	69800.0	True	False
1236756	2	61.0	White	Now married	Master's degree	40800.0	False	False
1236779	1	30.0	American Indian	Divorced	High School graduate	22110.0	False	False

27410 rows × 7 columns

You can now apply everything you learned about working with data in wide form with a column MultiIndex.

8.2.2 The Problem of Level 2 Variables

If you’re familiar with Stata’s reshape command or similar tools in R, you may be thinking “Wow, why can’t reshape be as easy as stack() and unstack()?” One reason is that you don’t have to tell stack() and unstack() about the structure of the data–they can get it from the MultiIndex. But the other is that stack() and unstack() don’t and can’t distinguish between level 1 and level 2 variables.

Let’s add a couple of level two variables to the mix, household_income and household_size:

acs['household_income'] = acs['income'].groupby('household').transform('sum')
acs['household_size'] = acs['age'].groupby('household').transform('size')
acs

		age	race	marital_status	edu	income	female	hispanic	household_income	household_size
household	person
37	1	20	White	Never married	Some college, >=1 year	10000.0	True	False	20000.0	3
	2	19	White	Never married	Some college, >=1 year	5300.0	True	False	20000.0	3
	3	19	Black	Never married	Some college, >=1 year	4700.0	True	False	20000.0	3
241	1	50	White	Never married	Master's degree	32500.0	True	False	32500.0	1
242	1	29	White	Never married	Bachelor's degree	30000.0	True	False	30000.0	1
...	...	...	...	...	...	...	...	...	...	...
1236624	1	29	White	Now married	Some college, >=1 year	50100.0	False	False	62100.0	2
1236624	2	26	White	Now married	High School graduate	12000.0	True	False	62100.0	2
1236756	1	58	White	Now married	Master's degree	69800.0	True	False	110600.0	2
1236756	2	61	White	Now married	Master's degree	40800.0	False	False	110600.0	2
1236779	1	30	American Indian	Divorced	High School graduate	22110.0	False	False	22110.0	1

27410 rows × 9 columns

Now see what happens to household_income when you unstack() the DataFrame:

acs.unstack()['household_income']

person	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16
household
37	20000.0	20000.0	20000.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
241	32500.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
242	30000.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
377	51900.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
418	12200.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1236119	93200.0	93200.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1236287	53400.0	53400.0	53400.0	53400.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1236624	62100.0	62100.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1236756	110600.0	110600.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1236779	22110.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

10565 rows × 16 columns

unstack() does not know or care that household_income is a level two variable. It went ahead and created 16 copies of it and copied over the values on each row. The Pandas package has several other functions for reshaping data (pivot(), melt() etc.) but none of them handle level 2 variables properly.

8.2.3 Long to Wide

The solution is to break the problem up into pieces that Pandas tools handle very well. Specifically:

Break the data set into one DataFrame containing just the level 1 variables and one containing just the level 2 variables.
Reshape the level 1 variables to one observation per level 2 unit using unstack() like in the previous section.
Reshape the level 2 variables to one observation per level 2 unit by subsetting.
Merge the results

You’ve learned how to work with wide form data with both column MultiIndexes and compound strings as variable names, so we’ll learn how to create both. The process is the same right up to the final step.

We’ll need to refer to a list of level 2 variables several times during this process, so it will be convenient to create it ahead of time:

level2_vars = ['household_income', 'household_size']

With that in hand, create the data set of level 1 variables by dropping the variables in level2_vars, and the data set of level 2 variables by selecting them. Avoid any confusion about views and copies by explicitly making copies.

level1 = acs.drop(columns=level2_vars, axis=1).copy(deep=True)
level2 = acs[level2_vars].copy(deep=True)

Now reshape level1 to wide form with unstack():

level1 = level1.unstack()
level1

	age										...	hispanic
person	1	2	3	4	5	6	7	8	9	10	...	7	8	9	10	11	12	13	14	15	16
household
37	20.0	19.0	19.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
241	50.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
242	29.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
377	69.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
418	59.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1236119	51.0	51.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1236287	41.0	42.0	23.0	4.0	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1236624	29.0	26.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1236756	58.0	61.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1236779	30.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

10565 rows × 112 columns

level2 needs to have one observation per household as well. Do that by selecing the first person in each household:

level2 = level2.xs(1, level='person')
level2

	household_income	household_size
household
37	20000.0	3
241	32500.0	1
242	30000.0	1
377	51900.0	1
418	12200.0	1
...	...	...
1236119	93200.0	2
1236287	53400.0	4
1236624	62100.0	2
1236756	110600.0	2
1236779	22110.0	1

10565 rows × 2 columns

Visually it looks like these two data sets are ready to be merged, but there’s a complication: the column index for level1 is a MultiIndex, while the colum index for level2 is a string. How you combine them depends on whether you want the final result to have a MultiIndex or compound strings for the column identifiers.

8.2.3.1 Column MultiIndex

If you’re going for a column MultiIndex, level1 is ready to go but level2 needs to be converted to a MultiIndex. These are household-level variables, so the person level will be empty.

We’re going to need level2 in its current form when we make the string version of the data set, so make a copy before you do anything to it:

level2_original = level2.copy(deep=True)

Start the conversion by building a list of tuples based on level2_vars using a list comprehension:

level2_tuples = [(var, '') for var in level2_vars]
level2_tuples

[('household_income', ''), ('household_size', '')]

Now you can set the columns of the data set level2 to a new MultiIndex based on level2_tuples. This turns out to be very easy to do, as the Pandas MultiIndex class has a from_tuples() function that does it for you:

level2.columns = pd.MultiIndex.from_tuples(level2_tuples)
level2

	household_income	household_size

household
37	20000.0	3
241	32500.0	1
242	30000.0	1
377	51900.0	1
418	12200.0	1
...	...	...
1236119	93200.0	2
1236287	53400.0	4
1236624	62100.0	2
1236756	110600.0	2
1236779	22110.0	1

10565 rows × 2 columns

Now you’re ready to merge. Given that you created the data sets to merge and can be reasonably confident they’ll match, merge on the indexes and skip creating an indicator, but use validate to ensure you’ve got the data structures right.

acs_wide_mi = level1.merge(
    level2,
    left_index=True, right_index=True,
    validate='1:1'
)
acs_wide_mi

	age										...	hispanic								household_income	household_size
	1	2	3	4	5	6	7	8	9	10	...	9	10	11	12	13	14	15	16
household
37	20.0	19.0	19.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	20000.0	3
241	50.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	32500.0	1
242	29.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	30000.0	1
377	69.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	51900.0	1
418	59.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	12200.0	1
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1236119	51.0	51.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	93200.0	2
1236287	41.0	42.0	23.0	4.0	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	53400.0	4
1236624	29.0	26.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	62100.0	2
1236756	58.0	61.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	110600.0	2
1236779	30.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	22110.0	1

10565 rows × 114 columns

8.2.3.2 Compound Strings

If you’re going for compound strings, then level2 is ready to go but level1 needs to be converted from a MultiIndex to compound strings. Start by replacing the MultiIndex with a flat index. The column indexes will still be tuples, but they will no longer correspond to levels (xs() will no longer work):

level1.columns = level1.columns.to_flat_index()
level1

	(age, 1)	(age, 2)	(age, 3)	(age, 4)	(age, 5)	(age, 6)	(age, 7)	(age, 8)	(age, 9)	(age, 10)	...	(hispanic, 7)	(hispanic, 8)	(hispanic, 9)	(hispanic, 10)	(hispanic, 11)	(hispanic, 12)	(hispanic, 13)	(hispanic, 14)	(hispanic, 15)	(hispanic, 16)
household
37	20.0	19.0	19.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
241	50.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
242	29.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
377	69.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
418	59.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1236119	51.0	51.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1236287	41.0	42.0	23.0	4.0	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1236624	29.0	26.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1236756	58.0	61.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1236779	30.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

10565 rows × 112 columns

Now convert the tuples to strings using a list comprehension. You can get the list of tuples from the columns attribute of the DataFrame and store the result as the new columns attribute:

level1.columns = [col[0] + '_' + str(col[1]) for col in level1.columns]
level1

	age_1	age_2	age_3	age_4	age_5	age_6	age_7	age_8	age_9	age_10	...	hispanic_7	hispanic_8	hispanic_9	hispanic_10	hispanic_11	hispanic_12	hispanic_13	hispanic_14	hispanic_15	hispanic_16
household
37	20.0	19.0	19.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
241	50.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
242	29.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
377	69.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
418	59.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1236119	51.0	51.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1236287	41.0	42.0	23.0	4.0	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1236624	29.0	26.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1236756	58.0	61.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1236779	30.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

10565 rows × 112 columns

Finally, merge it with level2_original:

acs_wide_string = level1.merge(
    level2_original,
    left_index=True, right_index=True,
    validate='1:1'
)
acs_wide_string

	age_1	age_2	age_3	age_4	age_5	age_6	age_7	age_8	age_9	age_10	...	hispanic_9	hispanic_10	hispanic_11	hispanic_12	hispanic_13	hispanic_14	hispanic_15	hispanic_16	household_income	household_size
household
37	20.0	19.0	19.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	20000.0	3
241	50.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	32500.0	1
242	29.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	30000.0	1
377	69.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	51900.0	1
418	59.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	12200.0	1
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1236119	51.0	51.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	93200.0	2
1236287	41.0	42.0	23.0	4.0	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	53400.0	4
1236624	29.0	26.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	62100.0	2
1236756	58.0	61.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	110600.0	2
1236779	30.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	22110.0	1

10565 rows × 114 columns

8.2.4 Wide to Long

Switching from wide to long could be done the same way (split the data into level 1 and level 2 variables, reshape them separately, and combine) but it’s easier in this case to just use stack() and clean up the minor messes it leaves because it doesn’t know about level 2 variables.

stack() only works with a MultiIndex, so if you’re using strings as column indexes (like acs_wide_string) the first step is to convert them to a MultiIndex. If you’re already using a MultiIndex (like acs_wide_mi), you can skip the next section, though we’ll learn another useful function for working with strings and get some practice writing more complex loops.

8.2.4.1 Converting Strings to a MultiIndex

To convert a string like age_1 to a tuple like ('age', 1) we need to divide the string into the part that comes before the underscore and the part that comes after. The string function split() will do that for us:

'age_1'.split('_')

['age', '1']

The result is a list containing the pieces of the original string, broken up by the character you passed in. (If you don’t pass in a character it will divide the string by spaces, essentially giving you a list of words.) There are two complications. One is variable names that contain underscores, like marital_status:

'marital_status_1'.split('_')

['marital', 'status', '1']

The result we need is ['marital_status', 1], which we get if we tell split() to only make one split and start from the right. The maxsplit parameter will let us limit it to one split, but starting from the right turns out to be a different function, rsplit():

'marital_status_1'.rsplit('_', maxsplit=1)

['marital_status', '1']

The second complication is the level 2 variables, which should not be split up the same way. (It just so happens that our level 2 variables contain underscores, but they don’t represent a compound of variable and person number.) We’re going to loop over all the columns in acs_wide_string, but we need to detect the level 2 variables and handle them differently. That makes the process too complicated to do as a list comprehension.

Instead, we’ll define an empty list called new_cols, add the new column names to it one by one using append() inside a for loop, and make it the new columns of acs_wide_strings when we’re done.

Inside the loop, we’ll check if col (the current column) is in level2_vars. If it is, the tuple which will be its new name is just (col, '') since it’s a level 2 variable. If it is not, we’ll split the name using rsplit() and construct the tuple as above.

new_cols = []
for col in acs_wide_string.columns:
    if col in level2_vars:
        new_cols.append((col, ''))
    else:
        split = col.rsplit('_', maxsplit=1)
        new_cols.append((split[0], split[1]))

acs_wide_string.columns = new_cols
acs_wide_string

	(age, 1)	(age, 2)	(age, 3)	(age, 4)	(age, 5)	(age, 6)	(age, 7)	(age, 8)	(age, 9)	(age, 10)	...	(hispanic, 9)	(hispanic, 10)	(hispanic, 11)	(hispanic, 12)	(hispanic, 13)	(hispanic, 14)	(hispanic, 15)	(hispanic, 16)	(household_income, )	(household_size, )
household
37	20.0	19.0	19.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	20000.0	3
241	50.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	32500.0	1
242	29.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	30000.0	1
377	69.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	51900.0	1
418	59.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	12200.0	1
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1236119	51.0	51.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	93200.0	2
1236287	41.0	42.0	23.0	4.0	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	53400.0	4
1236624	29.0	26.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	62100.0	2
1236756	58.0	61.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	110600.0	2
1236779	30.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	22110.0	1

10565 rows × 114 columns

Next we need to convert the column index tuples into a MultiIndex using MultiIndex.from_tuples() again.

acs_wide_string.columns = pd.MultiIndex.from_tuples(acs_wide_string.columns)
acs_wide_string

	age										...	hispanic								household_income	household_size
	1	2	3	4	5	6	7	8	9	10	...	9	10	11	12	13	14	15	16
household
37	20.0	19.0	19.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	20000.0	3
241	50.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	32500.0	1
242	29.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	30000.0	1
377	69.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	51900.0	1
418	59.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	12200.0	1
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1236119	51.0	51.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	93200.0	2
1236287	41.0	42.0	23.0	4.0	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	53400.0	4
1236624	29.0	26.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	62100.0	2
1236756	58.0	61.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	110600.0	2
1236779	30.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	22110.0	1

10565 rows × 114 columns

Now acs_wide_string is identical to acs_wide_mi, so we can convert it to long form the same way.

8.2.5 Converting MultiIndex Wide Form to Long

The stack() function does not understand level 2 variables any better than unstack(), but the problems that creates turn out to be pretty easy to fix.

acs_long = acs_wide_mi.stack()
acs_long.loc[37]

	age	race	marital_status	edu	income	female	hispanic	household_income	household_size
1	20.0	White	Never married	Some college, >=1 year	10000.0	True	False	NaN	NaN
2	19.0	White	Never married	Some college, >=1 year	5300.0	True	False	NaN	NaN
3	19.0	Black	Never married	Some college, >=1 year	4700.0	True	False	NaN	NaN
	NaN	NaN	NaN	NaN	NaN	NaN	NaN	20000.0	3.0

Looking at the first household, there are three problems:

stack() created a row for person '', the “person index” of the household level variables.
The level 2 variables are NaN for the actual people.
We lost the name of the person index. We don’t need it, but it will be convenient.

We’ll fix the problems in reverse order. Start by putting back person with rename_axis():

acs_long = acs_long.rename_axis(
    ['household', 'person'],
    axis='rows'
)
acs_long.loc[37]

	age	race	marital_status	edu	income	female	hispanic	household_income	household_size
person
1	20.0	White	Never married	Some college, >=1 year	10000.0	True	False	NaN	NaN
2	19.0	White	Never married	Some college, >=1 year	5300.0	True	False	NaN	NaN
3	19.0	Black	Never married	Some college, >=1 year	4700.0	True	False	NaN	NaN
	NaN	NaN	NaN	NaN	NaN	NaN	NaN	20000.0	3.0

(This table doesn’t have a column for household since we selected just one household, but it’s there in the full data set.)

Next we’ll fix the rows for actual people having NaN for household-level variables. The last observation for each household has the correct value of each level 2 variable, so you can use groupby() and transform('last') to set all of them to that value. Just wrap that in a loop over the level 2 variables:

for var in level2_vars:
    acs_long[var] = acs_long[var].groupby('household').transform('last')
acs_long.loc[37]

	age	race	marital_status	edu	income	female	hispanic	household_income	household_size
person
1	20.0	White	Never married	Some college, >=1 year	10000.0	True	False	20000.0	3.0
2	19.0	White	Never married	Some college, >=1 year	5300.0	True	False	20000.0	3.0
3	19.0	Black	Never married	Some college, >=1 year	4700.0	True	False	20000.0	3.0
	NaN	NaN	NaN	NaN	NaN	NaN	NaN	20000.0	3.0

Now all we need to do is drop the extraneous rows. This is very easy to do using query().

We want to keep all the rows where person is not blank. We’ve previously used '' for blank, but recall that within a query we need to use double quotes since single quotes mark the beginning and end of the query string:

acs_long = acs_long.query('person != ""')
acs_long

		age	race	marital_status	edu	income	female	hispanic	household_income	household_size
household	person
37	1	20.0	White	Never married	Some college, >=1 year	10000.0	True	False	20000.0	3.0
	2	19.0	White	Never married	Some college, >=1 year	5300.0	True	False	20000.0	3.0
	3	19.0	Black	Never married	Some college, >=1 year	4700.0	True	False	20000.0	3.0
241	1	50.0	White	Never married	Master's degree	32500.0	True	False	32500.0	1.0
242	1	29.0	White	Never married	Bachelor's degree	30000.0	True	False	30000.0	1.0
...	...	...	...	...	...	...	...	...	...	...
1236624	1	29.0	White	Now married	Some college, >=1 year	50100.0	False	False	62100.0	2.0
1236624	2	26.0	White	Now married	High School graduate	12000.0	True	False	62100.0	2.0
1236756	1	58.0	White	Now married	Master's degree	69800.0	True	False	110600.0	2.0
1236756	2	61.0	White	Now married	Master's degree	40800.0	False	False	110600.0	2.0
1236779	1	30.0	American Indian	Divorced	High School graduate	22110.0	False	False	22110.0	1.0

27410 rows × 9 columns

And that’s the data set back in long form.