2  DataFrames

Most data wrangling work in Python revolves around DataFrames, a class defined by the Pandas package. A DataFrame represents a classic data set organized as a table or matrix, with rows representing observations and columns representing variables or features. In this chapter we’ll learn the basics of working with DataFrames.

A Note on Nomenclature

In the statistics world, a variable is a property of an observation–a person’s age for example. As you’ve seen, in the programming world ‘variable’ has a much broader meaning. This has led the machine learning community to call variables features, or just columns. We will generally stick with variable, on the theory that it’s almost always clear from context whether we’re talking about a variable in the statistics sense or the programming sense.

2.1 DataFrame Fundamentals

We’ll begin by examining some of the fundamental properties of DataFrames.

2.1.1 Setting Up

Start up JupyterLab if you haven’t already, then navigate to the folder where you put the example files. Once you’re in that folder, click File, New, Notebook and choose the Python 3 kernel. It will start out as Untitled.ipynb so right-click on its tab and choose Rename Notebook to change it to DataFrames_Practice.ipynb.

We’ll start by importing the Pandas package and then reading in a simple data set from disk. acs_household.csv is one of the example files and contains a single household from the 2000 American Community Survey. It’s a comma separated variable or CSV file, one of the most popular types of data files. The read_csv() function in the Pandas package reads CSV files. The first argument for read_csv() is the file to be read, usually in the form of a string containing the file name. Since the file is in the same folder as your new Notebook, you don’t have to tell Python where to find it. read_csv() has many keyword arguments you can pass in to control how the file is read, and we’ll talk about how and why to use the most important ones in the next chapter. However, by default read_csv() will try to figure out the best way to read the file, and since this file is simple and has already been cleaned up it will be successful.

import pandas as pd
acs = pd.read_csv('acs_household.csv')
acs
household person age race marital_status edu income female hispanic
0 132531 1 47 Black Now married Associate degree 20000.0 0 0
1 132531 2 38 White Now married High School graduate 18500.0 1 0
2 132531 3 19 Black Never married 10th grade 5000.0 0 0
3 132531 4 18 Two or more races Never married 11th grade 2200.0 0 0
4 132531 5 16 Black Never married 9th grade 0.0 0 0
5 132531 6 1 Black NaN NaN NaN 0 0
6 132531 7 18 White Never married 9th grade 7900.0 1 0

The variables in this data set are:

  • household: identifies the household
  • person: identifies a person within the household
  • age: the person’s age in years
  • race: the person’s race
  • marital_status: the person’s marital status
  • edu: the person’s educational attainment
  • female: an indicator for whether the person is female or not, with 1 meaning yes, they are female
  • hispanic: an indicator for whether the person is Hispanic or not, with 1 meaning yes, they are Hispanic

2.1.2 Indexes

Note that I did not include in the list of variables the column of numbers you see on the left in the printout of acs. This column is the row index. One of the key features of a Pandas DataFrame is that it always has indexes for both the rows and columns. Like the index of a book, the indexes of a DataFrame tell you where to find a particular piece of information.

Suppose I wanted to draw your attention to the number 19 in the acs DataFrame. One way I could do it would be to tell you to look at the number in the third row and the third column, what Python calls row 2 and column 2 since it numbers from 0. However, I could also tell you to look at the age variable of the row where person is 3. That’s because the variable names act as a column index, and in this data set person acts as a row index. While identifying rows and columns by number is sometimes useful, keep in mind that many operations will change those numbers. For example, if you dropped all the adults in the data set or sorted the data set by age, then person 3 would no longer be on row 2. Thus you’ll usually specify subsets of the data using indexes. We’ll talk more about identifying indexes and what they tell you about the structure of your data in First Steps With Your Data.

When we read in this data set using read_csv(), it created a default row index for us based on the row numbers. However, once the index is created we can change the order of the observations (using sort_values()) and the same index will still be associated with the same observation.

acs = acs.sort_values('age')
acs
household person age race marital_status edu income female hispanic
5 132531 6 1 Black NaN NaN NaN 0 0
4 132531 5 16 Black Never married 9th grade 0.0 0 0
3 132531 4 18 Two or more races Never married 11th grade 2200.0 0 0
6 132531 7 18 White Never married 9th grade 7900.0 1 0
2 132531 3 19 Black Never married 10th grade 5000.0 0 0
1 132531 2 38 White Now married High School graduate 18500.0 1 0
0 132531 1 47 Black Now married Associate degree 20000.0 0 0

This data set came with a row index: the person variable. We can tell Python to use it as the index with set_index():

acs = acs.set_index('person')
acs
household age race marital_status edu income female hispanic
person
6 132531 1 Black NaN NaN NaN 0 0
5 132531 16 Black Never married 9th grade 0.0 0 0
4 132531 18 Two or more races Never married 11th grade 2200.0 0 0
7 132531 18 White Never married 9th grade 7900.0 1 0
3 132531 19 Black Never married 10th grade 5000.0 0 0
2 132531 38 White Now married High School graduate 18500.0 1 0
1 132531 47 Black Now married Associate degree 20000.0 0 0

Note that the original row numbers have disappeared, and person has been moved to the far left and its name moved down. That’s because it’s now the row index. Indexes are not variables. Now that person is an index you can’t create a new variable that’s equal to person times 2, for example. That’s okay: identifiers like person generally don’t contain information. However, you can do some things like with an index, like sort by it.

acs = acs.sort_values('person')
acs
household age race marital_status edu income female hispanic
person
1 132531 47 Black Now married Associate degree 20000.0 0 0
2 132531 38 White Now married High School graduate 18500.0 1 0
3 132531 19 Black Never married 10th grade 5000.0 0 0
4 132531 18 Two or more races Never married 11th grade 2200.0 0 0
5 132531 16 Black Never married 9th grade 0.0 0 0
6 132531 1 Black NaN NaN NaN 0 0
7 132531 18 White Never married 9th grade 7900.0 1 0

This put the data back in the original order.

What if you really needed to do math with person? You can convert it from an index back to a variable by resetting the index:

acs.reset_index()
person household age race marital_status edu income female hispanic
0 1 132531 47 Black Now married Associate degree 20000.0 0 0
1 2 132531 38 White Now married High School graduate 18500.0 1 0
2 3 132531 19 Black Never married 10th grade 5000.0 0 0
3 4 132531 18 Two or more races Never married 11th grade 2200.0 0 0
4 5 132531 16 Black Never married 9th grade 0.0 0 0
5 6 132531 1 Black NaN NaN NaN 0 0
6 7 132531 18 White Never married 9th grade 7900.0 1 0

This converts person to a variable and creates a new index based on row numbers, just like read_csv() did originally.

2.1.3 Assignment vs. Changing in Place

Hopefully you noticed that most of the code cells thus far have started with acs = but the last one did not. That’s why if I ask Python to print the acs DataFrame, person is still the index:

acs
household age race marital_status edu income female hispanic
person
1 132531 47 Black Now married Associate degree 20000.0 0 0
2 132531 38 White Now married High School graduate 18500.0 1 0
3 132531 19 Black Never married 10th grade 5000.0 0 0
4 132531 18 Two or more races Never married 11th grade 2200.0 0 0
5 132531 16 Black Never married 9th grade 0.0 0 0
6 132531 1 Black NaN NaN NaN 0 0
7 132531 18 White Never married 9th grade 7900.0 1 0

When I ran acs.reset_index(), the reset_index() function returned a new copy of acs with the index reset. But I did not store this new DataFrame as acs like in the previous cells. Because I did not store it, JupyterLab carried out an implicit print and showed me the modified copy. But the original acs was unchanged.

You can use the inplace=True argument to tell a function to change the DataFrame it acts on. To demonstrate, sort the acs DataFrame by age using inplace=True, and just to keep things interesting also add ascending=False so it’s sorted in descending order. Since this function call is getting long, put each argument on its own line:

acs.sort_values(
    'age', 
    ascending=False, 
    inplace=True
)
acs
household age race marital_status edu income female hispanic
person
1 132531 47 Black Now married Associate degree 20000.0 0 0
2 132531 38 White Now married High School graduate 18500.0 1 0
3 132531 19 Black Never married 10th grade 5000.0 0 0
4 132531 18 Two or more races Never married 11th grade 2200.0 0 0
7 132531 18 White Never married 9th grade 7900.0 1 0
5 132531 16 Black Never married 9th grade 0.0 0 0
6 132531 1 Black NaN NaN NaN 0 0

You might think that using inplace=True would save memory because the function can change the existing DataFrame rather than making a new copy. In reality, most of the time the function creates a new copy and then puts it in the place of the old one anyway. So we’ll rarely use inplace=True, though you may see it in other peoples’ code. But we will frequently use DataFrame functions without changing the original DataFrame.

2.1.4 Copies vs. Views

If you create a new DataFrame by setting it equal to another DataFrame, the result may not be what you expect.

acs2 = acs
acs2['hispanic'] = 1
acs
household age race marital_status edu income female hispanic
person
1 132531 47 Black Now married Associate degree 20000.0 0 1
2 132531 38 White Now married High School graduate 18500.0 1 1
3 132531 19 Black Never married 10th grade 5000.0 0 1
4 132531 18 Two or more races Never married 11th grade 2200.0 0 1
7 132531 18 White Never married 9th grade 7900.0 1 1
5 132531 16 Black Never married 9th grade 0.0 0 1
6 132531 1 Black NaN NaN NaN 0 1

The line acs2['hispanic'] = 1 told Python to make everyone in the acs2 DataFrame Hispanic, but it made everyone in the original acs DataFrame Hispanic as well. Why?

The reason is that the line acs2 = acs did not actually create a new DataFrame. It just gave the DataFrame acs a second name, acs2. Both names point to the same location in the computer’s memory. Thus if you make a change while referring to the DataFrame by one name you’ll see it when you refer to it by the other name as well.

Usually this is not what you want! If you want to make a new DataFrame that is a copy of an existing DataFrame but separate from it, use the copy() function. You will almost always want to pass in the argument deep=True. This tells copy() to make a deep copy, meaning that it should copy both the structure and values of the original DataFrame.

To demonstrate, make acs3 a deep copy of the original acs DataFrame, change everyone in acs back to being not Hispanic, and observe that that did not change acs3:

acs3 = acs.copy(deep=True)
acs['hispanic'] = 0
acs3
household age race marital_status edu income female hispanic
person
1 132531 47 Black Now married Associate degree 20000.0 0 1
2 132531 38 White Now married High School graduate 18500.0 1 1
3 132531 19 Black Never married 10th grade 5000.0 0 1
4 132531 18 Two or more races Never married 11th grade 2200.0 0 1
7 132531 18 White Never married 9th grade 7900.0 1 1
5 132531 16 Black Never married 9th grade 0.0 0 1
6 132531 1 Black NaN NaN NaN 0 1

That’s because acs and acs3 are now separate DataFrames. The change did affect acs2, because it’s still just another name for acs:

acs2
household age race marital_status edu income female hispanic
person
1 132531 47 Black Now married Associate degree 20000.0 0 0
2 132531 38 White Now married High School graduate 18500.0 1 0
3 132531 19 Black Never married 10th grade 5000.0 0 0
4 132531 18 Two or more races Never married 11th grade 2200.0 0 0
7 132531 18 White Never married 9th grade 7900.0 1 0
5 132531 16 Black Never married 9th grade 0.0 0 0
6 132531 1 Black NaN NaN NaN 0 0

DataFrames like acs2 are sometimes called views: they’re another way of viewing the same data. Views can sometimes be useful because they save memory, but they’re also confusing. We mention them here mostly as something to avoid. The same problem will occur if you create a new DataFrame that is a subset of another DataFrame, so use .copy(deep=True) there as well.

2.2 Specifying Subsets of DataFrames

A critical Pandas skill is specifying a subset of a DataFrame. For better or worse, you have lots of options for doing so.

2.2.1 Subsetting with Square Brackets

Recall that you could select items from lists, tuples, and dictionaries using square brackets. You can use them with DataFrames as well.

If you put the name of a column in the square brackets it will select that column:

acs['age']
person
1    47
2    38
3    19
4    18
7    18
5    16
6     1
Name: age, dtype: int64

Why acs['age'] rather than acs[age]? Because age without quotes would refer to a variable called age. You can use variables in subsets, and sometimes it’s very useful:

column_to_select = 'age'
acs[column_to_select]
person
1    47
2    38
3    19
4    18
7    18
5    16
6     1
Name: age, dtype: int64

But most of the time you’ll refer to a column name directly by putting it in quotes. Get used to typing lots of quotes in Python.

To select multiple columns, put them in a list:

acs[['age', 'edu']]
age edu
person
1 47 Associate degree
2 38 High School graduate
3 19 10th grade
4 18 11th grade
7 18 9th grade
5 16 9th grade
6 1 NaN

Note that there’s no special meaning to double square brackets ([[ and ]]). The outer [] is for identifying a subset, and the inner [] is for defining a list.

Why does this look so much prettier than when we only selected age? Remember that each column in a DataFrame is a Series. If you select one column from a DataFrame, the result is a Series. When you select two or more columns, the result is a DataFrame. JupyterLab puts extra effort into printing DataFrames so they look nice, but it doesn’t bother with a Series. Note that both the DataFrame and the Series get an index that is copied from the original DataFrame. This allows Pandas to align the result with the original DataFrame if you use them together.

Exercise

Select and print the race, female, and hispanic columns from the acs DataFrame. (Yes, implicit print is fine.)

acs[['race', 'female', 'hispanic']]
race female hispanic
person
1 Black 0 0
2 White 1 0
3 Black 0 0
4 Two or more races 0 0
7 White 1 0
5 Black 0 0
6 Black 0 0

2.2.2 Subsetting with iloc

The iloc function allows you to select subsets by row and column number. It’s not quite a conventional function in that the arguments go in square brackets rather than parentheses.

If you give iloc a single number, it will select the corresponding row:

acs.iloc[0]
household                   132531
age                             47
race                         Black
marital_status         Now married
edu               Associate degree
income                     20000.0
female                           0
hispanic                         0
Name: 1, dtype: object

The result is a Series again, but since it’s going across a row the index of the Series is the column names from the original DataFrame.

If you give iloc two numbers, it will select the value of the corresponding row and column:

acs.iloc[0, 0]
132531

This gave us the household number of the first person, because that is the value in the first row and first column (remember Python counts from zero and does not count the row index).

You can also use slices with iloc:

acs.iloc[1:3, 0:4]
household age race marital_status
person
2 132531 38 White Now married
3 132531 19 Black Never married

Slices do not include their endpoint, so this gave us rows one and two (i.e. skipping the first row, row zero) and columns zero through three.

Exercise

Select and print the race of the first person in the acs DataFrame using iloc.

acs.iloc[0, 2]
'Black'

race is the third column in acs, making it column 2 by Python counting.

2.2.3 Subsetting with loc

The loc function allows you to subset using the DataFrame’s indexes. If you give it a value of the row index it will select that row:

acs.loc[1]
household                   132531
age                             47
race                         Black
marital_status         Now married
edu               Associate degree
income                     20000.0
female                           0
hispanic                         0
Name: 1, dtype: object

Compare that with selecting row 1 (i.e. the second row) with iloc:

acs.iloc[1]
household                       132531
age                                 38
race                             White
marital_status             Now married
edu               High School graduate
income                         18500.0
female                               1
hispanic                             0
Name: 2, dtype: object

If you give it a value of the row index and a value of the column index, i.e. a column name, it will select that value:

acs.loc[1, 'household']
132531

You can select multiple rows and/or columns with lists:

acs.loc[[1, 2, 3], ['age', 'edu']]
age edu
person
1 47 Associate degree
2 38 High School graduate
3 19 10th grade

If the row indexes are integers, as they often are, you can also use a slice. Just to make things confusing, with loc the endpoint of the slice is included:

acs.loc[1:3, 'age']
person
1    47
2    38
3    19
Name: age, dtype: int64

The reason slices behave differently with loc is that they’re really indexes, not numbers. Try selecting persons 3 through 5:

acs.loc[3:5]
household age race marital_status edu income female hispanic
person
3 132531 19 Black Never married 10th grade 5000.0 0 0
4 132531 18 Two or more races Never married 11th grade 2200.0 0 0
7 132531 18 White Never married 9th grade 7900.0 1 0
5 132531 16 Black Never married 9th grade 0.0 0 0

Recall that we sorted the data by age (descending) not person number, so person 7 ended up between person 4 and person 5. To loc, the slice 3:5 means ‘from person 3 to person 5’ and that includes person 7.

To select all rows or all columns, use : without specifying a beginning or ending point:

acs.loc[:, ['age', 'edu']]
age edu
person
1 47 Associate degree
2 38 High School graduate
3 19 10th grade
4 18 11th grade
7 18 9th grade
5 16 9th grade
6 1 NaN
acs.loc[[1, 2], :]
household age race marital_status edu income female hispanic
person
1 132531 47 Black Now married Associate degree 20000.0 0 0
2 132531 38 White Now married High School graduate 18500.0 1 0

But remember that passing in just a row is equivalent to selecting all columns:

acs.loc[[1, 2]]
household age race marital_status edu income female hispanic
person
1 132531 47 Black Now married Associate degree 20000.0 0 0
2 132531 38 White Now married High School graduate 18500.0 1 0
Exercise

Select and print the income and marital_status of persons 3, 4, and 5 in the acs DataFrame using loc.

acs.loc[[3, 4, 5], ['income', 'marital_status']]
income marital_status
person
3 5000.0 Never married
4 2200.0 Never married
5 0.0 Never married

Since we did not want to include person 7, we had to use a list rather than a slice.

2.2.4 Subsetting by Conditions

Both plain square brackets and loc can select the rows that meet a condition. For example, to select the children in the household, use:

acs[acs['age']<18]
household age race marital_status edu income female hispanic
person
5 132531 16 Black Never married 9th grade 0.0 0 0
6 132531 1 Black NaN NaN NaN 0 0

Let’s take that apart a bit. acs['age']<18 gives us a Series:

acs['age']<18
person
1    False
2    False
3    False
4    False
7    False
5     True
6     True
Name: age, dtype: bool

You could store and pass in that Series:

is_child = acs['age']<18
acs[is_child]
household age race marital_status edu income female hispanic
person
5 132531 16 Black Never married 9th grade 0.0 0 0
6 132531 1 Black NaN NaN NaN 0 0

Pandas used the indexes to ensure the right rows of is_child were aligned with the right rows of acs.

loc works the same way:

acs.loc[acs['age']<18]
household age race marital_status edu income female hispanic
person
5 132531 16 Black Never married 9th grade 0.0 0 0
6 132531 1 Black NaN NaN NaN 0 0
acs.loc[is_child]
household age race marital_status edu income female hispanic
person
5 132531 16 Black Never married 9th grade 0.0 0 0
6 132531 1 Black NaN NaN NaN 0 0

However, with loc you can still add a second argument to select columns:

acs.loc[acs['age']<18, ['age', 'edu']]
age edu
person
5 16 9th grade
6 1 NaN

2.2.4.1 Comparison Operators

The comparison operators you can use to create conditions are:

== (equals)

!= (not equals)

> (greater than)

< (less than)

>= (greater than or equal to)

<= (less than or equal to)

~ means ‘not’, and can be used to negate the following condition. Thus:

acs[~is_child]
household age race marital_status edu income female hispanic
person
1 132531 47 Black Now married Associate degree 20000.0 0 0
2 132531 38 White Now married High School graduate 18500.0 1 0
3 132531 19 Black Never married 10th grade 5000.0 0 0
4 132531 18 Two or more races Never married 11th grade 2200.0 0 0
7 132531 18 White Never married 9th grade 7900.0 1 0

Comparison operators have rules for which operators are evaluated first just like arithmetic operators, and sometimes they may not be what you expect. Take:

acs[~acs['age']<18]
household age race marital_status edu income female hispanic
person
1 132531 47 Black Now married Associate degree 20000.0 0 0
2 132531 38 White Now married High School graduate 18500.0 1 0
3 132531 19 Black Never married 10th grade 5000.0 0 0
4 132531 18 Two or more races Never married 11th grade 2200.0 0 0
7 132531 18 White Never married 9th grade 7900.0 1 0
5 132531 16 Black Never married 9th grade 0.0 0 0
6 132531 1 Black NaN NaN NaN 0 0

Why isn’t ~acs['age']<18 equivalent to ~is_child? Because ~ is evaluated before <.

~is_child

is equivalent to:

~(acs['age']<18)

but:

~acs['age']<18

is equivalent to:

(~acs['age'])<18.

Applying the ~ operator to a number rather than a condition gives results that are unintuitive to say the least:

~acs['age']
person
1   -48
2   -39
3   -20
4   -19
7   -19
5   -17
6    -2
Name: age, dtype: int64

Once age is negative, it’s always less than 18.

The solution is to use parentheses generously to make sure everything is evaluated in the order you want.

acs[~(acs['age']<18)]
household age race marital_status edu income female hispanic
person
1 132531 47 Black Now married Associate degree 20000.0 0 0
2 132531 38 White Now married High School graduate 18500.0 1 0
3 132531 19 Black Never married 10th grade 5000.0 0 0
4 132531 18 Two or more races Never married 11th grade 2200.0 0 0
7 132531 18 White Never married 9th grade 7900.0 1 0
Exercise

Select and print edu for the Black people in the acs DataFrame. (Hint: race is a string and values of a string variable, like ‘Black’, go in quotes.)

acs.loc[acs['race']=='Black', 'edu']
person
1    Associate degree
3          10th grade
5           9th grade
6                 NaN
Name: edu, dtype: object

2.2.4.2 Combining Conditions

To combine conditions, use logical and, &, or logical or, |. With logical and the result is true if and only if both conditions are true. In terms of set theory, it is the intersection. With logical or, the result is true if either condition is true (or both). In terms of set theory, it is the union.

For example, to identify the people who are adults and have an income greater than $5,000, use:

acs[(acs['age']>=18) & (acs['income']>5000)]
household age race marital_status edu income female hispanic
person
1 132531 47 Black Now married Associate degree 20000.0 0 0
2 132531 38 White Now married High School graduate 18500.0 1 0
7 132531 18 White Never married 9th grade 7900.0 1 0

Yes, the parentheses are essential.

To identify the people who are either adults or have an income greater than $5,000, use:

acs[(acs['age']>=18) | (acs['income']>5000)]
household age race marital_status edu income female hispanic
person
1 132531 47 Black Now married Associate degree 20000.0 0 0
2 132531 38 White Now married High School graduate 18500.0 1 0
3 132531 19 Black Never married 10th grade 5000.0 0 0
4 132531 18 Two or more races Never married 11th grade 2200.0 0 0
7 132531 18 White Never married 9th grade 7900.0 1 0
Exercise

Select and print the adult females in the data set.

acs[(acs['female']==1) & (acs['age']>=18)]
household age race marital_status edu income female hispanic
person
2 132531 38 White Now married High School graduate 18500.0 1 0
7 132531 18 White Never married 9th grade 7900.0 1 0

2.2.4.3 Queries

A DataFrame or Series has a query() function that allows you to pass in a string describing a condition, and it will return only those rows where the condition is True. This is essentially the same as using square brackets or loc with a condition. However, with query(), the condition can only involve columns from the DataFrame or Series itself. This limitation turns out to be an advantage: because all variables must be part of the DataFrame, you don’t have to specify the DataFrame name to use them. You don’t even have to put them in quotes.

Recall that we selected the people who were adults and had incomes greater than $5,000 with:

acs[(acs['age']>=18) & (acs['income']>5000)]
household age race marital_status edu income female hispanic
person
1 132531 47 Black Now married Associate degree 20000.0 0 0
2 132531 38 White Now married High School graduate 18500.0 1 0
7 132531 18 White Never married 9th grade 7900.0 1 0

The query() equivalent is:

acs.query('(age>=18) & (income>5000)')
household age race marital_status edu income female hispanic
person
1 132531 47 Black Now married Associate degree 20000.0 0 0
2 132531 38 White Now married High School graduate 18500.0 1 0
7 132531 18 White Never married 9th grade 7900.0 1 0

The disadvantage of query() is that it’s usually slightly slower than square brackets or loc. That may be important in some applications. Also, you can’t use query() in an assignment statement to change the values of some rows and not others, because that leads to confusion about copies and views again. The advantage of query() is that less fussy syntax to type means less time correcting errors.

Now consider selecting all the people who are White with query(). The trouble is that the value of interest, 'White', is a string and needs to go in quotes. The query string 'race=='White'' won’t work because Python will assume the quote at the start of 'White' is the end of the query string. The solution is to put double quotes around "White". Python is fine with strings starting with either single or double quotes, but it will expect the string to end the same way. Thus if the query starts with a single quote, Python knows the double quotes around "White" do not mean the query is ending:

acs.query('race=="White"')
household age race marital_status edu income female hispanic
person
2 132531 38 White Now married High School graduate 18500.0 1 0
7 132531 18 White Never married 9th grade 7900.0 1 0
Exercise

Select and print the adult males using query().

acs.query('age>18 & female==0')
household age race marital_status edu income female hispanic
person
1 132531 47 Black Now married Associate degree 20000.0 0 0
3 132531 19 Black Never married 10th grade 5000.0 0 0

2.2.5 Which Subsetting Method Should You Use?

Plain square brackets, loc, and iloc all do basically the same thing, so which should you use?

Subsets specified by iloc are always precise and unambiguous, but they make the code hard to read because you have to know exactly what the data set looks like to know what they mean. We’ll rarely use iloc.

Plain square brackets are simple and convenient, but they can’t do as much. Also, because they’re used with so many different kinds of objects, the underlying code isn’t optimized for DataFrames. Avoid them whenever speed is a major consideration.

loc is almost as easy to use as plain square brackets, is much more capable, and is optimized for performance. Plus, the resulting code is easier to understand than with iloc.

query() is very convenient for subsets based on a condition, but performance is a concern.

In this book we’ll normally use plain square brackets for simple subsets, especially selecting a single column, and loc for anything more complicated. We’ll also use query() occasionally so you get comfortable with both ways of subsetting based on conditions.

Important

When you use plain square brackets, the argument inside the brackets answers the question ‘Which columns?’ When you use loc or iloc with one argument, the argument answers the question ‘Which rows?’ If you use two arguments with loc or iloc they answer the questions ‘Which rows?’ and ‘Which columns?’ respectively.

2.2.6 Function Chaining

Suppose you wanted to create a new DataFrame containing just the teens from this data set. You can identify the correct subset with:

acs[(acs['age']>12) & (acs['age']<20)]
household age race marital_status edu income female hispanic
person
3 132531 19 Black Never married 10th grade 5000.0 0 0
4 132531 18 Two or more races Never married 11th grade 2200.0 0 0
7 132531 18 White Never married 9th grade 7900.0 1 0
5 132531 16 Black Never married 9th grade 0.0 0 0

If you assign this to a new variable, it’s actually ambigious whether it’s a new DataFrame or a view of acs, and that will get you into trouble if try to change it. The solution, as mentioned earlier, is to make a copy. But how you do you call the copy function of a subset of a DataFrame?

The key insight is that the thing you get when you run acs[(acs['age']>12) & (acs['age']<20)] is a DataFrame and has all the functions of a DataFrame. Thus you can run:

teens = acs[(acs['age']>12) & (acs['age']<20)].copy(deep=True)

In fact, what copy(deep=True) returns is another DataFrame, and you can call its functions as well. And if a function returns a different kind of object, you can call its functions. That means that if you need to carry out multiple tasks that can all be done by calling functions, you can often combine them into a chain of function calls. For example, if you wanted to create teens as above and then sort it by age in ascending order, you can do that with:

teens = (
    acs[(acs['age']>12) & (acs['age']<20)].
    copy(deep=True).
    sort_values('age')
)
teens
household age race marital_status edu income female hispanic
person
5 132531 16 Black Never married 9th grade 0.0 0 0
4 132531 18 Two or more races Never married 11th grade 2200.0 0 0
7 132531 18 White Never married 9th grade 7900.0 1 0
3 132531 19 Black Never married 10th grade 5000.0 0 0
Exercise

Create a DataFrame containing just the males in the household. Then sort it in descending order by income and reset its index so Python creates a new index. The new index will then be the person’s ranking by income (but starting from zero). Carry out all these tasks in a single function chain.

males = (
    acs[acs['female']==0].
    copy(deep=True).
    sort_values('income', ascending=False).
    reset_index()
)
males
person household age race marital_status edu income female hispanic
0 1 132531 47 Black Now married Associate degree 20000.0 0 0
1 3 132531 19 Black Never married 10th grade 5000.0 0 0
2 4 132531 18 Two or more races Never married 11th grade 2200.0 0 0
3 5 132531 16 Black Never married 9th grade 0.0 0 0
4 6 132531 1 Black NaN NaN NaN 0 0

2.3 Changing DataFrames

You can change the data in a DataFrame by assigning new values to it. You’ll use the subsetting tools from the previous section to select which parts of the DataFrame will be changed. To create a new column, select it as a subset of the DataFrame as if it already existed and then assign a value to it.

2.3.1 Numeric Variables

The incomes in this data set date from the year 2000, so to correct for inflation and convert them to January 2023 dollars we need to multiply them by 1.77:

acs['income_2023'] = acs['income']*1.77
acs
household age race marital_status edu income female hispanic income_2023
person
1 132531 47 Black Now married Associate degree 20000.0 0 0 35400.0
2 132531 38 White Now married High School graduate 18500.0 1 0 32745.0
3 132531 19 Black Never married 10th grade 5000.0 0 0 8850.0
4 132531 18 Two or more races Never married 11th grade 2200.0 0 0 3894.0
7 132531 18 White Never married 9th grade 7900.0 1 0 13983.0
5 132531 16 Black Never married 9th grade 0.0 0 0 0.0
6 132531 1 Black NaN NaN NaN 0 0 NaN

Note how the income of person 6 is NaN, or ‘Not a Number’. This is how Python stores missing values. Attempting to multiply NaN by 1.77 results in NaN, so income_2023 is automatically missing as well. This is what we want and expect.

Exercise

Go to the BLS inflation calculator and see what you need to multiply incomes from 2000 by to get dollars in the most recent month that has data (assume for simplicity that the incomes are from January 2000). Create a column in the acs DataFrame to store income in current dollars.

As of this writing, if you put a 1 in the top box of the BLS calculator web page and click Calculate it will tell you that $1.00 in January 2000 had the same buying power as $1.85 in March 2024 (the most recent month for which CPI data are available). That makes 1.85 the current conversion factor.

acs['income_current'] = acs['income']*1.85
acs
household age race marital_status edu income female hispanic income_2023 income_current
person
1 132531 47 Black Now married Associate degree 20000.0 0 0 35400.0 37000.0
2 132531 38 White Now married High School graduate 18500.0 1 0 32745.0 34225.0
3 132531 19 Black Never married 10th grade 5000.0 0 0 8850.0 9250.0
4 132531 18 Two or more races Never married 11th grade 2200.0 0 0 3894.0 4070.0
7 132531 18 White Never married 9th grade 7900.0 1 0 13983.0 14615.0
5 132531 16 Black Never married 9th grade 0.0 0 0 0.0 0.0
6 132531 1 Black NaN NaN NaN 0 0 NaN NaN

Of course you’ll probably get a different number.

2.3.2 Indicator (Boolean) Variables

Now create an indicator variable for ‘this person is black’. Since you’re setting the new variable equal to a condition, it will get True if the condition is true and False if it is false:

acs['black'] = (acs['race']=='Black')
acs[['race', 'black']]
race black
person
1 Black True
2 White False
3 Black True
4 Two or more races False
7 White False
5 Black True
6 Black True

In this case the parentheses are purely for readability.

Exercise

Create an indicator variable for ‘this person is an adult.’

acs['adult'] = (acs['age']>=18)
acs[['age', 'adult']]
age adult
person
1 47 True
2 38 True
3 19 True
4 18 True
7 18 True
5 16 False
6 1 False

2.3.3 Missing Values (NaN)

Next suppose we want to create an indicator variable has_income which tells us if the person’s income is greater than zero. This seems like a straightforward extension of what we just did:

acs['has_income'] = (acs['income'] > 0)

The problem is person 6 and their missing value for income. Because we don’t know their income, we don’t know what their value of has_income should be. (We could make a pretty good guess based on their age, but we don’t know.) So the right value of has_income for person 6 is NaN. There are just two problems with that: Python won’t automatically give us that answer, and it won’t automatically store it.

What happens if we do run that code?

acs['has_income'] = (acs['income'] > 0)
acs[['income', 'has_income']]
income has_income
person
1 20000.0 True
2 18500.0 True
3 5000.0 True
4 2200.0 True
7 7900.0 True
5 0.0 False
6 NaN False

has_income looks fine for all the observations except person 6, where we got False rather than NaN. The problem here is that the result of a comparison like (acs['income'] > 0) must be either True or False. NaN is not an option.

Different languages handle missing values in conditions in different ways, and they all have their advantages and disadvantages. Python’s is at least easy to remember: any condition involving NaN returns False.

NaN is defined by the numpy package, so to talk about it directly we need to import numpy. The standard abbreviation for it is np:

import numpy as np

Now compare acs['income'] to np.NaN and note the results:

acs['income']==np.NaN
person
1    False
2    False
3    False
4    False
7    False
5    False
6    False
Name: income, dtype: bool

income is NaN for person 6, but it doesn’t matter: the rule that any condition involving NaN returns False takes precedence. Even comparing NaN to itself returns False:

np.NaN==np.NaN
False

If you think of NaN as representing unknown quantities this makes some sense: two unknown quantities are very unlikely to be identical.

To detect missing values, instead use the isna() function:

acs['income'].isna()
person
1    False
2    False
3    False
4    False
7    False
5    False
6     True
Name: income, dtype: bool

You can also use notna() to identify valid values.

With isna() in your toolkit, you can tell Pandas to change the value of has_income to np.NaN for all the rows where income is missing with code like:

acs.loc[acs['income'].isna(), 'has_income'] = np.NaN

There’s just one problem: a strictly Boolean variable can only store True or False, not NaN. The solution is to convert has_income from the default bool type to the boolean type. boolean is a nullable boolean, meaning it can store NaN as well.

In previous versions of Pandas, putting NaN in a variable of type bool would prompt Pandas to silently promote the variable to type boolean, but the Pandas developers decided this was convenient but dangerous. After all, bool and boolean sometimes behave differently. So they want you to explicitly convert has_income to boolean using the astype() function before you put NaN in it:

acs['has_income'] = acs['has_income'].astype('boolean')
acs.loc[acs['income'].isna(), 'has_income'] = np.NaN
acs[['income', 'has_income']]
income has_income
person
1 20000.0 True
2 18500.0 True
3 5000.0 True
4 2200.0 True
7 7900.0 True
5 0.0 False
6 NaN <NA>

(Technically you’ve ended up with the Pandas NA rather than the NumPy NaN, but that’s not a distinction you generally have to worry about. Pandas NA is officially experimental but almost always behaves the same as NumPy’s NaN.)

As of Pandas 2.1, skipping the conversion works (has_income will still be automatically promoted) but you’ll get a warning. In a future version it will crash.

There’s actually a similar issue with integer variable types, as np.NaN is technically a floating point number. However, converting integers to floating point numbers is not considered a promotion, so setting some values of an integer variable to NaN results in the variable being converted to floating point without any fuss.

Creating has_income with some incorrect values and then immediately correcting them is not bad. But let’s talk about a couple of ways you could create it with all the right values immediately.

2.3.3.1 Subsets and New Variables

One takes advantage of the fact a brand new variable essentially starts out with all its values set to NaN. If you use a subset to only set values for some observations when you create it, the observations that are left out will get NaN:

acs.loc[acs['income'].notna(), 'has_income2'] = (acs['income']>0)
acs[['income', 'has_income2']]
income has_income2
person
1 20000.0 True
2 18500.0 True
3 5000.0 True
4 2200.0 True
7 7900.0 True
5 0.0 False
6 NaN NaN

Because the new variable is only set to the result of (acs['income']>0) for rows where income has a valid value (acs['income'].notna()), rows where income is missing get NaN for has_income2. Just remember that this only works for brand new variables. If has_income2 already exists, the subset means that rows where income is missing keep whatever value of has_income2 they had before.

2.3.3.2 The where() Function

Another way to do this relies on a very useful NumPy function called where(). It takes three arguments:

  1. A condition (something that is either true or false)
  2. The result if the condition is true
  3. The result if the condition is false

For example:

np.where(
    acs['age']>=18,
    'Adult',
    'Child'
)
array(['Adult', 'Adult', 'Adult', 'Adult', 'Adult', 'Child', 'Child'],
      dtype='<U5')

The result is an array rather than a Series because where() is NumPy and Series is a Pandas class, but Pandas will take care of converting arrays to Series as needed.

Our rules for has_income are:

  • if income is not missing, set it to the result of acs['income']>0
  • is income is missing, set it to np.NaN

We can do this using where():

acs['has_income3'] = np.where(
    acs['income'].notna(),
    acs['income']>0,
    np.NaN
)
acs[['income', 'has_income3']]
income has_income3
person
1 20000.0 1.0
2 18500.0 1.0
3 5000.0 1.0
4 2200.0 1.0
7 7900.0 1.0
5 0.0 0.0
6 NaN NaN

There’s just one wrinkle: has_income was set to 1.0 or 0.0 instead of True or False! It’s the problem of variable type again: NumPy sees that the where() function is going to return True, False, and NaN, but it doesn’t know about the Pandas boolean type. So it decides it can store all three as floating point numbers if it converts True to 1.0 and False to 0.0. Converting True and False to 1 and 0 is quite useful at times (we’ll use it a lot with aggregate functions and hierarchical data), but not what we want here.

The solution is to tell where() to use the boolean type:

acs['has_income3'] = np.where(
    acs['income'].notna(),
    (acs['income']>0).astype('boolean'),
    np.NaN
)
acs[['income', 'has_income3']]
income has_income3
person
1 20000.0 1.0
2 18500.0 1.0
3 5000.0 1.0
4 2200.0 1.0
7 7900.0 1.0
5 0.0 0.0
6 NaN NaN
Exercise

Create an indicator variable for ‘this person is married.’ For this exercise, do not assume that the one-year-old cannot possibly be married.

You only need to do this once, but we’ll do it both ways for practice. First using a subset on variable creation:

acs.loc[acs['marital_status'].notna(), 'married1'] = (acs['marital_status']=='Now married')

Then using where():

acs['married2'] = np.where(
    acs['marital_status'].notna(), 
    (acs['marital_status']=='Now married').astype('boolean'),
    np.NaN
)

acs[['marital_status', 'married1', 'married2']]
marital_status married1 married2
person
1 Now married True 1.0
2 Now married True 1.0
3 Never married False 0.0
4 Never married False 0.0
7 Never married False 0.0
5 Never married False 0.0
6 NaN NaN NaN

2.3.3.3 Three-Valued Logic

One difference between bool and boolean is that boolean uses three-valued logic (specifically, Kleene logic) for combining conditions that could include NaN. It’s three-valued because the result could be True, False, or NaN. Conceptually, you replace the NaN values withe first True and then False. If the result is the same either way, then that’s the result. If the result is not the same, then the result is NaN. For example:

acs['black'] | acs['has_income']
person
1    True
2    True
3    True
4    True
7    True
5    True
6    True
dtype: boolean

The result is True for person 6 because black is true and with | (logical or) only one of the two values needs to be True for the result to be True. Compare with:

acs['black'] & acs['has_income']
person
1     True
2    False
3     True
4    False
7    False
5    False
6     <NA>
dtype: boolean

Now the result for person 6 is NaN, because the result would be True if has_income were True, but False if has_income were False. Thus we don’t know the correct result.