7 Restructuring Data Sets

In this chapter we’ll learn about two of the most common ways you might need to change the structure of a data set: switching between long form and wide form, and creating a data set of aggregates.

7.1 Switching Between Long and Wide Form

Switching between long and wide form is very easy if you can use a MultiIndex for the columns in wide form. Unfortunately, this does not work if the data set contains level two variables. Also, most statistical packages do not have an equivalent of Python’s MultiIndex for columns. Thus the wide form we saw in the last chapter is much more common, with the level one identifiers (person number in the ACS) as part of the variable names.

7.1.1 Setting Up

Start up Jupyter Notebook if you haven’t already, then navigate to the folder where you put the example files. Once you’re in that folder, click New, Python 3 to create a new Notebook. Click on the name at the top to change it from Untitled to Restructuring Data Sets. Then import pandas and use pd.read_pickle() to read in acs_clean.pickle as acs.

import pandas as pd
acs = pd.read_pickle('acs_clean.pickle')
acs
age race marital_status edu income female hispanic
household person
37 1 20 White Never married Some college, >=1 year 10000.0 True False
2 19 White Never married Some college, >=1 year 5300.0 True False
3 19 Black Never married Some college, >=1 year 4700.0 True False
241 1 50 White Never married Master’s degree 32500.0 True False
242 1 29 White Never married Bachelor’s degree 30000.0 True False
1236624 1 29 White Now married Some college, >=1 year 50100.0 False False
2 26 White Now married High School graduate 12000.0 True False
1236756 1 58 White Now married Master’s degree 69800.0 True False
2 61 White Now married Master’s degree 40800.0 False False
1236779 1 30 American Indian Divorced High School graduate 22110.0 False False

27410 rows × 7 columns

7.1.2 Switching With A MultiIndex

This data set is in the long form, so one way to describe it would be to say that people in the same household are stacked on top of each other. So to convert it to the wide form, all you need to do is unstack them:

acs_wide = acs.unstack()
acs_wide
age hispanic
person 1 2 3 4 5 6 7 8 9 10 7 8 9 10 11 12 13 14 15 16
household
37 20.0 19.0 19.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
241 50.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
242 29.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
377 69.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
418 59.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1236119 51.0 51.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1236287 41.0 42.0 23.0 4.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1236624 29.0 26.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1236756 58.0 61.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1236779 30.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

10565 rows × 112 columns

The result is the same dataset in wide form (though you’d have to look at more of it to know that for sure). This was very easy to do because the MultiIndex tells Python everything it needs to know about the data structure (i.e. people grouped into households). All it needed to do was convert the row MultiIndex to a column MultiIndex.

To go back to long form, just stack it again:

acs_wide.stack()
age race marital_status edu income female hispanic
household person
37 1 20.0 White Never married Some college, >=1 year 10000.0 True False
2 19.0 White Never married Some college, >=1 year 5300.0 True False
3 19.0 Black Never married Some college, >=1 year 4700.0 True False
241 1 50.0 White Never married Master’s degree 32500.0 True False
242 1 29.0 White Never married Bachelor’s degree 30000.0 True False
1236624 1 29.0 White Now married Some college, >=1 year 50100.0 False False
2 26.0 White Now married High School graduate 12000.0 True False
1236756 1 58.0 White Now married Master’s degree 69800.0 True False
2 61.0 White Now married Master’s degree 40800.0 False False
1236779 1 30.0 American Indian Divorced High School graduate 22110.0 False False

27410 rows × 7 columns

Since we’ve never used a column MultiIndex before, let’s take a moment to see how they work.

To select a household, just use loc:

acs_wide.loc[37]
          person
age       1         20.0
          2         19.0
          3         19.0
          4          NaN
          5          NaN
                    ... 
hispanic  12         NaN
          13         NaN
          14         NaN
          15         NaN
          16         NaN
Name: 37, Length: 112, dtype: object

To select a particular column, you can use square brackets with a tuple containing the variable name and person number:

acs_wide[('race', 1)]
household
37                   White
241                  White
242                  White
377                  White
418                  White
                ...       
1236119              White
1236287    American Indian
1236624              White
1236756              White
1236779    American Indian
Name: (race, 1), Length: 10565, dtype: category
Categories (9, object): ['Alaska Native', 'American Indian', 'Asian', 'Black', ..., 'Other', 'Pacific Islander', 'Two or more races', 'White']

To subset all the variables for a particular person number (in all households), use xs() with level='person' and axis=1:

acs_wide.xs(1, level='person', axis=1)
age race marital_status edu income female hispanic
household
37 20.0 White Never married Some college, >=1 year 10000.0 True False
241 50.0 White Never married Master’s degree 32500.0 True False
242 29.0 White Never married Bachelor’s degree 30000.0 True False
377 69.0 White Never married None 51900.0 True False
418 59.0 White Widowed 12th grade, no diploma 12200.0 True False
1236119 51.0 White Now married Some college, >=1 year 62200.0 False False
1236287 41.0 American Indian Now married Some college, <1 year 15000.0 False False
1236624 29.0 White Now married Some college, >=1 year 50100.0 False False
1236756 58.0 White Now married Master’s degree 69800.0 True False
1236779 30.0 American Indian Divorced High School graduate 22110.0 False False

10565 rows × 7 columns

You can use the same approach to select a variable for all the people in the household. The list of variables doesn’t have a name like person (we could give it one if we really wanted to), but you can refer to it as level 0. To select all the age variables, use:

acs_wide.xs('age', level=0, axis=1)
person 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
household
37 20.0 19.0 19.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
241 50.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
242 29.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
377 69.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
418 59.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1236119 51.0 51.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1236287 41.0 42.0 23.0 4.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1236624 29.0 26.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1236756 58.0 61.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1236779 30.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

10565 rows × 16 columns

7.1.3 Switching From Long to Wide

So why didn’t we use a column MultiIndex all along? Well, let’s add a level two variable to the mix, household_income:

acs['household_income'] = (
    acs['income'].
    groupby('household').
    transform('sum')
)
acs
age race marital_status edu income female hispanic household_income
household person
37 1 20 White Never married Some college, >=1 year 10000.0 True False 20000.0
2 19 White Never married Some college, >=1 year 5300.0 True False 20000.0
3 19 Black Never married Some college, >=1 year 4700.0 True False 20000.0
241 1 50 White Never married Master’s degree 32500.0 True False 32500.0
242 1 29 White Never married Bachelor’s degree 30000.0 True False 30000.0
1236624 1 29 White Now married Some college, >=1 year 50100.0 False False 62100.0
2 26 White Now married High School graduate 12000.0 True False 62100.0
1236756 1 58 White Now married Master’s degree 69800.0 True False 110600.0
2 61 White Now married Master’s degree 40800.0 False False 110600.0
1236779 1 30 American Indian Divorced High School graduate 22110.0 False False 22110.0

27410 rows × 8 columns

Now see what happens to it when you unstack() the DataFrame:

acs_wide=acs.unstack()
acs_wide
age household_income
person 1 2 3 4 5 6 7 8 9 10 7 8 9 10 11 12 13 14 15 16
household
37 20.0 19.0 19.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
241 50.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
242 29.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
377 69.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
418 59.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1236119 51.0 51.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1236287 41.0 42.0 23.0 4.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1236624 29.0 26.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1236756 58.0 61.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1236779 30.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

10565 rows × 128 columns

unstack() does not know or care that household_income is a level two variable. It went ahead and created 16 copies of it and copied over the values on each row. Take a look at the values for household 37:

acs_wide.loc[37].filter(like='household_income')
                  person
household_income  1         20000.0
                  2         20000.0
                  3         20000.0
                  4             NaN
                  5             NaN
                  6             NaN
                  7             NaN
                  8             NaN
                  9             NaN
                  10            NaN
                  11            NaN
                  12            NaN
                  13            NaN
                  14            NaN
                  15            NaN
                  16            NaN
Name: 37, dtype: object

But the problem is not just unstack(): the column MultiIndex assumes every variable is associated with 16 values of person and we’re not aware of any way to change that. That’s why the wide form data set used in the last chapter put the person index into the variable names rather than using a MultiIndex.

The leaves us with the problem of putting the data set in that form. Given that long form is generally easier to use, converting to wide form is a somewhat uncommon task. We’ll do it anyway for two reasons: first, because doing so will provide a road map for the much more common transformation of wide to long. Second, we’ll learn a very useful tool along the way.

The columns attribute of the DataFrame has a function called to_flat_index() that eliminates the MultiIndex:

acs_wide.columns = acs_wide.columns.to_flat_index()
acs_wide
(age, 1) (age, 2) (age, 3) (age, 4) (age, 5) (age, 6) (age, 7) (age, 8) (age, 9) (age, 10) (household_income, 7) (household_income, 8) (household_income, 9) (household_income, 10) (household_income, 11) (household_income, 12) (household_income, 13) (household_income, 14) (household_income, 15) (household_income, 16)
household
37 20.0 19.0 19.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
241 50.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
242 29.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
377 69.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
418 59.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1236119 51.0 51.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1236287 41.0 42.0 23.0 4.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1236624 29.0 26.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1236756 58.0 61.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1236779 30.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

10565 rows × 128 columns

This replaced the MultiIndex with a single index where each column is labeled by a tuple. But tuples would be hard to work with, so we still want to convert them to a single string.

This is a job for a loop, so as usual we’ll start with just one example. Store the tuple ('age', 1) as name. Recall that you can access the elements of a tuple with square brackets, in this case name[0] and name[1]. So to convert ('age', 1) to age_1, use:

name = ('age', 1)
name[0] + "_" + str(name[1])
'age_1'

To do this for all the columns in acs_wide, we need them in a list we can loop over. Fortunately the DataFrame columns have a to_list() function that gives us exactly that:

acs_wide.columns.tolist()

Our goal is to replace the current acs_wide.columns with a new list of string names. We can do that very easily with a backwards sort of for loop called a list comprehension. A list comprehension looks almost the same as a regular for loop, except the whole thing goes in brackets, the code to execute comes first, and the for loop definition goes at the end. The big difference is that the result is a list. Start with a simple example:

my_list = [1, 2, 3]
[x**2 for x in my_list]
[1, 4, 9]

Think of a list comprehension as an easy way to carry out a transformation on each element in a list. Now use a list comprehension to transform all the column names in acs_wide from tuples to strings and store the result as the new column names:

acs_wide.columns = [
    name[0] + "_" + str(name[1]) 
    for name in acs_wide.columns.tolist()
]
acs_wide
age_1 age_2 age_3 age_4 age_5 age_6 age_7 age_8 age_9 age_10 household_income_7 household_income_8 household_income_9 household_income_10 household_income_11 household_income_12 household_income_13 household_income_14 household_income_15 household_income_16
household
37 20.0 19.0 19.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
241 50.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
242 29.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
377 69.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
418 59.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1236119 51.0 51.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1236287 41.0 42.0 23.0 4.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1236624 29.0 26.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1236756 58.0 61.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1236779 30.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

10565 rows × 128 columns

We’re not done yet: we still have 16 copies of household income and we only want one: household_income_1 since it’s the one that’s never missing. Rename it with:

acs_wide = acs_wide.rename(
    columns = 
        {'household_income_1' : 'household_income'}
)
acs_wide.dtypes
age_1                  float64
age_2                  float64
age_3                  float64
age_4                  float64
age_5                  float64
                        ...   
household_income_12    float64
household_income_13    float64
household_income_14    float64
household_income_15    float64
household_income_16    float64
Length: 128, dtype: object

If we now use filter() with the regular expression '^household_income_\d*$, we’ll get all the household income variables except the one we want to keep:

acs_wide.filter(
    regex='^household_income_\d*$'
).columns.to_list()
['household_income_2',
 'household_income_3',
 'household_income_4',
 'household_income_5',
 'household_income_6',
 'household_income_7',
 'household_income_8',
 'household_income_9',
 'household_income_10',
 'household_income_11',
 'household_income_12',
 'household_income_13',
 'household_income_14',
 'household_income_15',
 'household_income_16']

Drop them from the data set by passing that list to the drop() function:

acs_wide = acs_wide.drop(
    acs_wide.filter(
        regex='^household_income_\d*$'
    ).columns.to_list(),
    axis=1
)
acs_wide.dtypes
age_1               float64
age_2               float64
age_3               float64
age_4               float64
age_5               float64
                     ...   
hispanic_13          object
hispanic_14          object
hispanic_15          object
hispanic_16          object
household_income    float64
Length: 113, dtype: object

7.1.4 Switching from Wide to Long

This data set is now in wide form with the indexes stored in the column names, which is a very common format to receive data in. We’ll next convert it to long form, a very common task since long form is usually easier to work with. To do so we’ll just reverse all the steps we took in converting it to wide form.

First, we need to create 16 household income variables. They should contain a copy of the existing household_income variable for the columns describing a person that actually exists, and NaN for the others. We know the age variables are never missing, so we’ll use age not being NaN as our indicator that a person actually exists.

This requires a loop from 1 to 16, but hopefully these are becoming familiar to you and this one’s not complicated so we won’t take the time to build it from its component parts. Check on household 37 (which has three people) to make sure it worked properly:

for i in range(1,17): 
    acs_wide.loc[
        ~acs_wide['age_'+str(i)].isna(),
        'household_income_'+str(i)
    ] = acs_wide['household_income']

acs_wide.filter(like='household_income').loc[37]
household_income       20000.0
household_income_1     20000.0
household_income_2     20000.0
household_income_3     20000.0
household_income_4         NaN
household_income_5         NaN
household_income_6         NaN
household_income_7         NaN
household_income_8         NaN
household_income_9         NaN
household_income_10        NaN
household_income_11        NaN
household_income_12        NaN
household_income_13        NaN
household_income_14        NaN
household_income_15        NaN
household_income_16        NaN
Name: 37, dtype: float64

Now that we know all is well, drop household_income:

acs_wide = acs_wide.drop(
    'household_income',
    axis=1
)

The next task is to convert the column names into a MultiIndex, but the first step is to convert them back into tuples. The column names are in the form variable_person, and we need to split those two components up. This is a common task in working with text data called parsing. In this case the split() function will make it very easy. It acts on a string, splits it into pieces based on a string you pass in, and returns the results in a list you can convert to a tuple with the tuple() function. In our case, it’s the underscore character, ’_’, that separates the variable from the person number:

name = 'age_1'
tuple(name.split('_'))
('age', '1')

There’s just one complication: household_income contains an underscore.

name = 'household_income_1'
tuple(name.split('_'))
('household', 'income', '1')

If we tried to make that into a MultiIndex we’d end up with three levels. The solution is a combination of two things. First, we’ll pass in a second argument, 1, that tells the function to only split the string based on the first underscore it sees. Second, we’ll switch from split() to rsplit(). The only difference is the rsplit() starts splitting from the right, so when we pass in 1 it will only split based on the last underscore in the string. Thus:

tuple(name.rsplit('_', 1))
('household_income', '1')

Applying this to all the column names is a job for a list comprehension:

acs_wide.columns = [
    tuple(name.rsplit('_', 1)) 
    for name in acs_wide.columns.tolist()
]
acs_wide
(age, 1) (age, 2) (age, 3) (age, 4) (age, 5) (age, 6) (age, 7) (age, 8) (age, 9) (age, 10) (household_income, 7) (household_income, 8) (household_income, 9) (household_income, 10) (household_income, 11) (household_income, 12) (household_income, 13) (household_income, 14) (household_income, 15) (household_income, 16)
household
37 20.0 19.0 19.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
241 50.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
242 29.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
377 69.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
418 59.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1236119 51.0 51.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1236287 41.0 42.0 23.0 4.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1236624 29.0 26.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1236756 58.0 61.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1236779 30.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

10565 rows × 128 columns

Now you’re ready to convert the tuples into a MultiIndex. Pandas has a MultiIndex.from_tuples() function that does exactly that:

acs_wide.columns = pd.MultiIndex.from_tuples(
    acs_wide.columns.tolist()
)
acs_wide
age household_income
1 2 3 4 5 6 7 8 9 10 7 8 9 10 11 12 13 14 15 16
household
37 20.0 19.0 19.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
241 50.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
242 29.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
377 69.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
418 59.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1236119 51.0 51.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1236287 41.0 42.0 23.0 4.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1236624 29.0 26.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1236756 58.0 61.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1236779 30.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

10565 rows × 128 columns

With the column names now set up as a MultiIndex, the actual conversion to long form is easily done with stack():

acs_wide.stack()
age edu female hispanic household_income income marital_status race
household
37 1 20.0 Some college, >=1 year True False 20000.0 10000.0 Never married White
2 19.0 Some college, >=1 year True False 20000.0 5300.0 Never married White
3 19.0 Some college, >=1 year True False 20000.0 4700.0 Never married Black
241 1 50.0 Master’s degree True False 32500.0 32500.0 Never married White
242 1 29.0 Bachelor’s degree True False 30000.0 30000.0 Never married White
1236624 1 29.0 Some college, >=1 year False False 62100.0 50100.0 Now married White
2 26.0 High School graduate True False 62100.0 12000.0 Now married White
1236756 1 58.0 Master’s degree True False 110600.0 69800.0 Now married White
2 61.0 Master’s degree False False 110600.0 40800.0 Now married White
1236779 1 30.0 High School graduate False False 22110.0 22110.0 Divorced American Indian

27410 rows × 8 columns

7.1.4.1 Exercise

Load the example extract from the National Longitudinal Survey of Youth contained in nlsy.pickle. Convert it to wide form withe level one index stored in the column names. Then convert it back to long form.

7.2 Aggregating Data Sets

Previously in this chapter we’ve taken a data set with one row per person (level one unit) and converted it to one row per household (level two unit), but we made sure not to lose any information about the individual people in the process. Sometimes you need to create a data set that is all about the level two units, aggregating the level one variables in some way.

7.2.1 Setting Up

Start up Jupyter Notebook if you haven’t already, then navigate to the folder where you put the example files. Once you’re in that folder, click New, Python 3 to create a new Notebook. Click on the name at the top to change it from Untitled to Aggregating Data Sets. Then import pandas and use pd.read_pickle() to read in acs_clean.pickle as acs. Create a household_income variable, just so we have a level two variable to work with:

import pandas as pd
acs = pd.read_pickle('acs_clean.pickle')
acs['household_income'] = acs['income'].groupby('household').transform('sum')
acs
age race marital_status edu income female hispanic household_income
household person
37 1 20 White Never married Some college, >=1 year 10000.0 True False 20000.0
2 19 White Never married Some college, >=1 year 5300.0 True False 20000.0
3 19 Black Never married Some college, >=1 year 4700.0 True False 20000.0
241 1 50 White Never married Master’s degree 32500.0 True False 32500.0
242 1 29 White Never married Bachelor’s degree 30000.0 True False 30000.0
1236624 1 29 White Now married Some college, >=1 year 50100.0 False False 62100.0
2 26 White Now married High School graduate 12000.0 True False 62100.0
1236756 1 58 White Now married Master’s degree 69800.0 True False 110600.0
2 61 White Now married Master’s degree 40800.0 False False 110600.0
1236779 1 30 American Indian Divorced High School graduate 22110.0 False False 22110.0

27410 rows × 8 columns

7.2.2 Aggregating With A Subset

If the data set you need consists of level two units and level two variables that already exist, all you need to do is create a subset selecting the level two variable(s) and just one row per level two unit. Since level two variables are always the same for all level one units in the same level two unit, it doesn’t matter which row you choose. But it’s easiest to choose the first row in each group since every group will have a first row. You can do that with xs():

acs['household_income'].xs(1, level='person')
household
37          20000.0
241         32500.0
242         30000.0
377         51900.0
418         12200.0
             ...   
1236119     93200.0
1236287     53400.0
1236624     62100.0
1236756    110600.0
1236779     22110.0
Name: household_income, Length: 10565, dtype: float64

7.2.2.1 Exercise

Create a household-level variable containing the mean age of the children in the household. Then create a data set that contains one row per household and the mean age of the children. (Hint: to find the mean age of the children, first create a column containing the ages of just the children, with NaN for the adults.)

7.2.3 Aggregating With An Aggregate Function

If the data set you need consists of level two units and either one level two variable or multiple level two variables that can all be created with the same aggregate function, just apply the aggregate function to the variables you need. For example, you can create a data set of household incomes directly with:

acs['income'].groupby('household').sum()
household
37          20000.0
241         32500.0
242         30000.0
377         51900.0
418         12200.0
             ...   
1236119     93200.0
1236287     53400.0
1236624     62100.0
1236756    110600.0
1236779     22110.0
Name: income, Length: 10565, dtype: float64

Create a data set containing the proportion of the household that is female and the proportion of the household that is hispanic with:

acs[['female', 'hispanic']].groupby('household').mean()
female hispanic
household
37 1.0 0.0
241 1.0 0.0
242 1.0 0.0
377 1.0 0.0
418 1.0 0.0
1236119 0.5 0.0
1236287 0.5 0.0
1236624 0.5 0.0
1236756 0.5 0.0
1236779 0.0 0.0

10565 rows × 2 columns

The apply() function allows you to aggregate using a lambda function, so you can create your own aggregate functions. Create a data set containing households and their income per person with:

acs['income'].groupby('household').apply(lambda x: x.sum()/x.size)
household
37          6666.666667
241        32500.000000
242        30000.000000
377        51900.000000
418        12200.000000
               ...     
1236119    46600.000000
1236287    13350.000000
1236624    31050.000000
1236756    55300.000000
1236779    22110.000000
Name: income, Length: 10565, dtype: float64

7.2.3.1 Exercise

Create a data set of households containing the education level of the most educated person in the household. Recall that edu is an ordered categorical variable, so you can apply max() to it.

7.2.4 Aggregating Using Named Aggregates

If you need to create a data set of level two units and level two variables that must be made using different aggregate functions, you can do so using named aggregates. A named aggregate allows you to specify the column to be aggregated, the aggregate function to use, and what the result should be called. This gives you complete control over the DataFrame to be created.

The Pandas NamedAgg() function takes two arguments: the column to act on and the aggregate function to use. To use them, pass them as key word arguments to the agg() function, with the name being the column to create in the resulting data set.

For example, suppose we need a data set of households with the variables ‘household income’, ‘number of people in the household’, and ‘proportion of the household that is female.’ We already have household_income as a level two variable, so we can just take its first value with the aggregate function first–but since household income will be the only income we care about in the data set of households, let’s just call it income. Number of people can be obtained from the aggregate function size, but remember it needs a single column, any column, to act on. And proportion female is just the mean of female. Thus the code is:

acs.groupby('household').agg(
    income=pd.NamedAgg('household_income', 'first'),
    size=pd.NamedAgg('income', 'size'),
    prop_female=pd.NamedAgg('female', 'mean')
)
income size prop_female
household
37 20000.0 3 1.0
241 32500.0 1 1.0
242 30000.0 1 1.0
377 51900.0 1 1.0
418 12200.0 1 1.0
1236119 93200.0 2 0.5
1236287 53400.0 4 0.5
1236624 62100.0 2 0.5
1236756 110600.0 2 0.5
1236779 22110.0 1 0.0

10565 rows × 3 columns

The naming of the new columns is a rare instance where column names do not go in quotes. That’s because they are key word arguments for the function agg().

7.2.4.1 Exercise

Create a data set with one row per household and variables for ‘at least one member of this household is Hispanic’ and ‘all the members of this household are Hispanic.’