7  Restructuring Data Sets

In this chapter we’ll learn about two of the most common ways you might need to change the structure of a data set: aggregating observations, and switching between long form and wide form.

7.1 Aggregating Observations

Sometimes you’ll get a hierarchical data set with level one and level two units, but all you care about is the level two units. For example, you may need one observation per household and just household-level variables from the ACS. But those household-level variables are likely to be aggregates of the level one variables, like household_income. More broadly, there are many situations where you may want to combine observations that belong to the same group. Pandas has functions that makes this easy to do.

7.1.1 Setting Up

Start up Jupyter Lab if you haven’t already and navigate to the folder where you put the example files. Then create a new Python Notebook and call it Restructuring_Data_Practice.ipynb. Have it import Pandas and load acs_clean.pickle. Then create a household_income variable so we’ll have a household-level variable to practice with.

import pandas as pd
acs = pd.read_pickle('acs_clean.pickle')
acs['household_income'] = acs['income'].groupby('household').transform('sum')
acs
age race marital_status edu income female hispanic household_income
household person
37 1 20 White Never married Some college, >=1 year 10000.0 True False 20000.0
2 19 White Never married Some college, >=1 year 5300.0 True False 20000.0
3 19 Black Never married Some college, >=1 year 4700.0 True False 20000.0
241 1 50 White Never married Master's degree 32500.0 True False 32500.0
242 1 29 White Never married Bachelor's degree 30000.0 True False 30000.0
... ... ... ... ... ... ... ... ... ...
1236624 1 29 White Now married Some college, >=1 year 50100.0 False False 62100.0
2 26 White Now married High School graduate 12000.0 True False 62100.0
1236756 1 58 White Now married Master's degree 69800.0 True False 110600.0
2 61 White Now married Master's degree 40800.0 False False 110600.0
1236779 1 30 American Indian Divorced High School graduate 22110.0 False False 22110.0

27410 rows × 8 columns

7.1.2 Aggregating With A Subset

If the data set you need consists of level two units and level two variables that already exist, all you need to do is create a subset selecting the level two variable(s) and just one row per level two unit. Since level two variables are always the same for all level one units in the same level two unit, it doesn’t matter which row you choose. But it’s easiest to choose the first row in each group since every group will have a first row. You can do that with xs():

acs['household_income'].xs(1, level='person')
household
37          20000.0
241         32500.0
242         30000.0
377         51900.0
418         12200.0
             ...   
1236119     93200.0
1236287     53400.0
1236624     62100.0
1236756    110600.0
1236779     22110.0
Name: household_income, Length: 10565, dtype: float64
Exercise

Create a household-level variable containing the mean age of the children in the household. Then create a data set that contains one row per household and the mean age of the children. (Hint: to find the mean age of the children, first create a column containing the ages of just the children, with NaN for the adults.)

7.1.3 Aggregating With An Aggregate Function

If the data set you need consists of level two units and either one level two variable or multiple level two variables that can all be created with the same aggregate function, just apply the aggregate function to the variables you need. For example, you can create a data set of household incomes directly with:

acs['income'].groupby('household').sum()
household
37          20000.0
241         32500.0
242         30000.0
377         51900.0
418         12200.0
             ...   
1236119     93200.0
1236287     53400.0
1236624     62100.0
1236756    110600.0
1236779     22110.0
Name: income, Length: 10565, dtype: float64

Create a data set containing the proportion of the household that is female and the proportion of the household that is hispanic with:

acs[['female', 'hispanic']].groupby('household').mean()
female hispanic
household
37 1.0 0.0
241 1.0 0.0
242 1.0 0.0
377 1.0 0.0
418 1.0 0.0
... ... ...
1236119 0.5 0.0
1236287 0.5 0.0
1236624 0.5 0.0
1236756 0.5 0.0
1236779 0.0 0.0

10565 rows × 2 columns

The apply() function allows you to aggregate using a lambda function, so you can create your own aggregate functions. Create a data set containing households and their income per person with:

acs['income'].groupby('household').apply(lambda x: x.sum()/x.size)
household
37          6666.666667
241        32500.000000
242        30000.000000
377        51900.000000
418        12200.000000
               ...     
1236119    46600.000000
1236287    13350.000000
1236624    31050.000000
1236756    55300.000000
1236779    22110.000000
Name: income, Length: 10565, dtype: float64
Exercise

Create a data set of households containing the education level of the most educated person in the household. Recall that edu is an ordered categorical variable, so you can apply max() to it.

7.1.4 Aggregating Using Named Aggregates

If you need to create a data set of level two units and level two variables that must be made using different aggregate functions, you can do so using named aggregates. A named aggregate allows you to specify the column to be aggregated, the aggregate function to use, and what the result should be called. This gives you complete control over the DataFrame to be created.

The Pandas NamedAgg() function takes two arguments: the column to act on and the aggregate function to use. To use them, pass them as key word arguments to the agg() function, with the name being the column to create in the resulting data set.

For example, suppose we need a data set of households with the variables ‘household income’, ‘number of people in the household’, and ‘proportion of the household that is female.’ We already have household_income as a level two variable, so we can just take its first value with the aggregate function first–but since household income will be the only income we care about in the data set of households, let’s just call it income. Number of people can be obtained from the aggregate function size, but remember it needs a single column, any column, to act on. And proportion female is just the mean of female. Thus the code is:

acs.groupby('household').agg(
    income=pd.NamedAgg('household_income', 'first'),
    size=pd.NamedAgg('income', 'size'),
    prop_female=pd.NamedAgg('female', 'mean')
)
income size prop_female
household
37 20000.0 3 1.0
241 32500.0 1 1.0
242 30000.0 1 1.0
377 51900.0 1 1.0
418 12200.0 1 1.0
... ... ... ...
1236119 93200.0 2 0.5
1236287 53400.0 4 0.5
1236624 62100.0 2 0.5
1236756 110600.0 2 0.5
1236779 22110.0 1 0.0

10565 rows × 3 columns

The naming of the new columns is a rare instance where column names do not go in quotes. That’s because they are key word arguments for the function agg().

Exercise

Create a data set with one row per household and variables for ‘at least one member of this household is Hispanic’ and ‘all the members of this household are Hispanic.’

7.2 Switching Between Long and Wide Form

Unlike aggregating data sets, switching between long and wide form keeps all of the level one data–no data are lost, they’re just organized differently. Doing so is extremely easy if your data set only contains level 1 variables, but not still too bad if you have to handle level 2 variables as well.

Reload the acs_clean data set as acs:

acs = pd.read_pickle('acs_clean.pickle')
acs
age race marital_status edu income female hispanic
household person
37 1 20 White Never married Some college, >=1 year 10000.0 True False
2 19 White Never married Some college, >=1 year 5300.0 True False
3 19 Black Never married Some college, >=1 year 4700.0 True False
241 1 50 White Never married Master's degree 32500.0 True False
242 1 29 White Never married Bachelor's degree 30000.0 True False
... ... ... ... ... ... ... ... ...
1236624 1 29 White Now married Some college, >=1 year 50100.0 False False
2 26 White Now married High School graduate 12000.0 True False
1236756 1 58 White Now married Master's degree 69800.0 True False
2 61 White Now married Master's degree 40800.0 False False
1236779 1 30 American Indian Divorced High School graduate 22110.0 False False

27410 rows × 7 columns

7.2.1 Stack and Unstack

This data set is in the long form, so one way to describe it would be to say that people in the same household are stacked on top of each other. So to convert it to the wide form, all you need to do is unstack them:

acs_wide = acs.unstack()
acs_wide
age ... hispanic
person 1 2 3 4 5 6 7 8 9 10 ... 7 8 9 10 11 12 13 14 15 16
household
37 20.0 19.0 19.0 NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
241 50.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
242 29.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
377 69.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
418 59.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1236119 51.0 51.0 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1236287 41.0 42.0 23.0 4.0 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1236624 29.0 26.0 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1236756 58.0 61.0 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1236779 30.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

10565 rows × 112 columns

The result is the same dataset in wide form. Note how the MultiIndex for rows in long form was converted to a MultiIndex for columns in wide form.

To go back to long form, just stack it again:

acs_wide.stack()
age race marital_status edu income female hispanic
household person
37 1 20.0 White Never married Some college, >=1 year 10000.0 True False
2 19.0 White Never married Some college, >=1 year 5300.0 True False
3 19.0 Black Never married Some college, >=1 year 4700.0 True False
241 1 50.0 White Never married Master's degree 32500.0 True False
242 1 29.0 White Never married Bachelor's degree 30000.0 True False
... ... ... ... ... ... ... ... ...
1236624 1 29.0 White Now married Some college, >=1 year 50100.0 False False
2 26.0 White Now married High School graduate 12000.0 True False
1236756 1 58.0 White Now married Master's degree 69800.0 True False
2 61.0 White Now married Master's degree 40800.0 False False
1236779 1 30.0 American Indian Divorced High School graduate 22110.0 False False

27410 rows × 7 columns

Since we’ve never used a column MultiIndex before, let’s take a moment to see how they work.

To select a household, just use loc:

acs_wide.loc[37]
          person
age       1         20.0
          2         19.0
          3         19.0
          4          NaN
          5          NaN
                    ... 
hispanic  12         NaN
          13         NaN
          14         NaN
          15         NaN
          16         NaN
Name: 37, Length: 112, dtype: object

To select a particular column, you can use square brackets with a tuple containing the variable name and person number:

acs_wide[('race', 1)]
household
37                   White
241                  White
242                  White
377                  White
418                  White
                ...       
1236119              White
1236287    American Indian
1236624              White
1236756              White
1236779    American Indian
Name: (race, 1), Length: 10565, dtype: category
Categories (9, object): ['Alaska Native', 'American Indian', 'Asian', 'Black', ..., 'Other', 'Pacific Islander', 'Two or more races', 'White']

To subset all the variables for a particular person number (in all households), use xs() with level='person' and axis=1:

acs_wide.xs(1, level='person', axis=1)
age race marital_status edu income female hispanic
household
37 20.0 White Never married Some college, >=1 year 10000.0 True False
241 50.0 White Never married Master's degree 32500.0 True False
242 29.0 White Never married Bachelor's degree 30000.0 True False
377 69.0 White Never married None 51900.0 True False
418 59.0 White Widowed 12th grade, no diploma 12200.0 True False
... ... ... ... ... ... ... ...
1236119 51.0 White Now married Some college, >=1 year 62200.0 False False
1236287 41.0 American Indian Now married Some college, <1 year 15000.0 False False
1236624 29.0 White Now married Some college, >=1 year 50100.0 False False
1236756 58.0 White Now married Master's degree 69800.0 True False
1236779 30.0 American Indian Divorced High School graduate 22110.0 False False

10565 rows × 7 columns

You can use the same approach to select a variable for all the people in the household. The list of variables doesn’t have a name like person (we could give it one if we wanted to), but you can refer to it as level 0. To select all the age variables, use:

acs_wide.xs('age', level=0, axis=1)
person 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
household
37 20.0 19.0 19.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
241 50.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
242 29.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
377 69.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
418 59.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1236119 51.0 51.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1236287 41.0 42.0 23.0 4.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1236624 29.0 26.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1236756 58.0 61.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1236779 30.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

10565 rows × 16 columns

7.2.2 Handling Level 2 Variables

If you’re familiar with Stata’s reshape command or similar tools in R, you may be thinking “Wow, why can’t reshape be as easy as stack() and unstack()?” One reason is that you don’t have to tell stack() and unstack() about the structure of the data–they can get it from the MultiIndex. But the other is that stack() and unstack() don’t and can’t distinguish between level 1 and level 2 variables.

Let’s add a couple of level two variables to the mix, household_income and household_size:

acs['household_income'] = acs['income'].groupby('household').transform('sum')
acs['household_size'] = acs['age'].groupby('household').transform('size')
acs
age race marital_status edu income female hispanic household_income household_size
household person
37 1 20 White Never married Some college, >=1 year 10000.0 True False 20000.0 3
2 19 White Never married Some college, >=1 year 5300.0 True False 20000.0 3
3 19 Black Never married Some college, >=1 year 4700.0 True False 20000.0 3
241 1 50 White Never married Master's degree 32500.0 True False 32500.0 1
242 1 29 White Never married Bachelor's degree 30000.0 True False 30000.0 1
... ... ... ... ... ... ... ... ... ... ...
1236624 1 29 White Now married Some college, >=1 year 50100.0 False False 62100.0 2
2 26 White Now married High School graduate 12000.0 True False 62100.0 2
1236756 1 58 White Now married Master's degree 69800.0 True False 110600.0 2
2 61 White Now married Master's degree 40800.0 False False 110600.0 2
1236779 1 30 American Indian Divorced High School graduate 22110.0 False False 22110.0 1

27410 rows × 9 columns

Now see what happens to household_income when you unstack() the DataFrame:

acs.unstack()['household_income']
person 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
household
37 20000.0 20000.0 20000.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
241 32500.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
242 30000.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
377 51900.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
418 12200.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1236119 93200.0 93200.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1236287 53400.0 53400.0 53400.0 53400.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1236624 62100.0 62100.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1236756 110600.0 110600.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1236779 22110.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

10565 rows × 16 columns

unstack() does not know or care that household_income is a level two variable. It went ahead and created 16 copies of it and copied over the values on each row. The pandas package has several other functions for reshaping data (pivot(), melt() etc.) but none of them handle level 2 variables properly.

7.2.2.1 Long to Wide

The solution is to break the problem up into pieces that Pandas tools handle very well. Specifically:

  1. Break the data set into one DataFrame containing just the level 1 variables and one containing just the level 2 variables.
  2. Reshape the level 1 variables to one observation per level 2 unit using unstack() like in the previous section.
  3. Reshape the level 2 variables to one observation per level 2 unit by subsetting.
  4. Merge the results

We’ll need to refer to a list of level 2 variables several times during this process, so it will be convenient to create it ahead of time:

level2_vars = ['household_income', 'household_size']

With that in hand, create the data set of level 1 variables by dropping the variables in level2_vars, and the data set of level 2 variables by selecting them. Avoid any confusion about views and copies by explicitly making copies.

level1 = acs.drop(columns=level2_vars, axis=1).copy(deep=True)
level2 = acs[level2_vars].copy(deep=True)

Now reshape level1 to wide form with unstack():

level1 = level1.unstack()
level1
age ... hispanic
person 1 2 3 4 5 6 7 8 9 10 ... 7 8 9 10 11 12 13 14 15 16
household
37 20.0 19.0 19.0 NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
241 50.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
242 29.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
377 69.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
418 59.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1236119 51.0 51.0 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1236287 41.0 42.0 23.0 4.0 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1236624 29.0 26.0 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1236756 58.0 61.0 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1236779 30.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

10565 rows × 112 columns

level2 needs to have one observation per household as well. Do that by selecing the first person in each household:

level2 = level2.xs(1, level='person')
level2
household_income household_size
household
37 20000.0 3
241 32500.0 1
242 30000.0 1
377 51900.0 1
418 12200.0 1
... ... ...
1236119 93200.0 2
1236287 53400.0 4
1236624 62100.0 2
1236756 110600.0 2
1236779 22110.0 1

10565 rows × 2 columns

Visually it looks like these two data sets are ready to be merged, but there’s a complication: the column index for level1 is a MultiIndex, while the colum index for level2 is a simple index. We’ll fix that by giving level2 a MultiIndex as well, but with the person level set to blank ('', a string containing nothing). This is a minor abuse of MultiIndexing, and you should keep in mind that Python will think of this as a new person whose identifier is '' and just happens to only have values for the level 2 variables. But it will work.

MultiIndexes go naturally with tuples, so we’ll start by making a list of tuples based on level2_vars. This is easy to do with a for loop. First, create an empty list called level2_tuples. Then loop over level2_vars and append a new tuple containing the variable and a blank to level2_tuples.

level2_tuples = []
for var in level2_vars:
    level2_tuples.append((var, ''))
level2_tuples
[('household_income', ''), ('household_size', '')]

Now you can set the columns of the data set level2 to a new MultiIndex based on level2_tuples:

level2.columns = pd.MultiIndex.from_tuples(level2_tuples)
level2
household_income household_size
household
37 20000.0 3
241 32500.0 1
242 30000.0 1
377 51900.0 1
418 12200.0 1
... ... ...
1236119 93200.0 2
1236287 53400.0 4
1236624 62100.0 2
1236756 110600.0 2
1236779 22110.0 1

10565 rows × 2 columns

Now you’re ready to merge. Given that you created the data sets to merge, merge on the indexes and skip creating an indicator, but use validate to ensure you’ve got the data structures right.

acs_wide = pd.merge(
    level1, level2,
    left_index=True, right_index=True,
    validate='1:1'
)
acs_wide   
age ... hispanic household_income household_size
1 2 3 4 5 6 7 8 9 10 ... 9 10 11 12 13 14 15 16
household
37 20.0 19.0 19.0 NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN 20000.0 3
241 50.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN 32500.0 1
242 29.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN 30000.0 1
377 69.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN 51900.0 1
418 59.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN 12200.0 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1236119 51.0 51.0 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN 93200.0 2
1236287 41.0 42.0 23.0 4.0 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN 53400.0 4
1236624 29.0 26.0 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN 62100.0 2
1236756 58.0 61.0 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN 110600.0 2
1236779 30.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN 22110.0 1

10565 rows × 114 columns

Be sure to scroll right and see how neatly household_income and household_size were added to the data set!

7.2.2.2 Wide to Long

Switching from wide to long could be done the same way (split the data into level 1 and level 2 variables, reshape them separately, and combine) but it’s easier in this case to just use stack() and clean up the minor messes it leaves because it doesn’t know about level 2 variables.

acs_long = acs_wide.stack()
acs_long.loc[37]
age race marital_status edu income female hispanic household_income household_size
1 20.0 White Never married Some college, >=1 year 10000.0 True False NaN NaN
2 19.0 White Never married Some college, >=1 year 5300.0 True False NaN NaN
3 19.0 Black Never married Some college, >=1 year 4700.0 True False NaN NaN
NaN NaN NaN NaN NaN NaN NaN 20000.0 3.0

Looking at the first household, there are three problems:

  1. stack() created a row for person '', the “person index” of the household level variables.
  2. The level 2 variables are NaN for the actual people.
  3. We lost the name of the person index. We don’t need it, but it will be convenient.

We’ll fix the problems in reverse order. Start by putting back person with rename_axis():

acs_long = acs_long.rename_axis(
    ['household', 'person'],
    axis='rows'
)
acs_long.loc[37]
age race marital_status edu income female hispanic household_income household_size
person
1 20.0 White Never married Some college, >=1 year 10000.0 True False NaN NaN
2 19.0 White Never married Some college, >=1 year 5300.0 True False NaN NaN
3 19.0 Black Never married Some college, >=1 year 4700.0 True False NaN NaN
NaN NaN NaN NaN NaN NaN NaN 20000.0 3.0

Next we’ll fix the rows for actual people having NaN for household-level variables. The last observation for each household has the correct value of each level 2 variable, so you can use groupby() and transform('last') to set all of them to that value. Just wrap that in a loop over the level 2 variables:

for var in level2_vars:
    acs_long[var] = acs_long[var].groupby('household').transform('last')
acs_long.loc[37]
age race marital_status edu income female hispanic household_income household_size
person
1 20.0 White Never married Some college, >=1 year 10000.0 True False 20000.0 3.0
2 19.0 White Never married Some college, >=1 year 5300.0 True False 20000.0 3.0
3 19.0 Black Never married Some college, >=1 year 4700.0 True False 20000.0 3.0
NaN NaN NaN NaN NaN NaN NaN 20000.0 3.0

Now all we need to do is drop the extraneous rows. This is very easy to do using the query() function, an alternative way of subsetting rows. (Look for a future version of this book to introduce query() earlier and use it more.) With query() you pass in a string containing a condition, but the string can refer to variables in the data frame simply by name (e.g. age instead of acs['age']).

The one wrinkle is that string values need to go in quotes. We’re using single quotes to denote the beginning and end of the query string itself, so use double quotes to denote the beginning and end of string values inside the query. In particular, refer to a blank for person as "" rather than ''.

With that, a query to select just the rows representing actual people is very easy to write:

acs_long = acs_long.query('person != ""')
acs_long
age race marital_status edu income female hispanic household_income household_size
household person
37 1 20.0 White Never married Some college, >=1 year 10000.0 True False 20000.0 3.0
2 19.0 White Never married Some college, >=1 year 5300.0 True False 20000.0 3.0
3 19.0 Black Never married Some college, >=1 year 4700.0 True False 20000.0 3.0
241 1 50.0 White Never married Master's degree 32500.0 True False 32500.0 1.0
242 1 29.0 White Never married Bachelor's degree 30000.0 True False 30000.0 1.0
... ... ... ... ... ... ... ... ... ... ...
1236624 1 29.0 White Now married Some college, >=1 year 50100.0 False False 62100.0 2.0
2 26.0 White Now married High School graduate 12000.0 True False 62100.0 2.0
1236756 1 58.0 White Now married Master's degree 69800.0 True False 110600.0 2.0
2 61.0 White Now married Master's degree 40800.0 False False 110600.0 2.0
1236779 1 30.0 American Indian Divorced High School graduate 22110.0 False False 22110.0 1.0

27410 rows × 9 columns

And that’s the data set back in long form.

Exercise

Load the example extract from the National Longitudinal Survey of Youth contained in nlsy.pickle. Convert it to wide form. Then convert it back to long form.