Most data wrangling work in Python revolves around DataFrames, a class defined by the Pandas package. A DataFrame represents a classic data set organized as a table or matrix, with rows representing observations and columns representing variables or features. In this chapter we’ll learn the basics of working with DataFrames.
A Note on Nomenclature
In the statistics world, a variable is a property of an observation–a person’s age for example. As you’ve seen, in the programming world ‘variable’ has a much broader meaning. This has led the machine learning community to call variables features, or just columns. We will generally stick with variable, on the theory that it’s almost always clear from context whether we’re talking about a variable in the statistics sense or the programming sense.
2.1 DataFrame Fundamentals
We’ll begin by examining some of the fundamental properties of DataFrames.
2.1.1 Setting Up
Start up JupyterLab if you haven’t already, then navigate to the folder where you put the example files. Once you’re in that folder, click File, New, Notebook and choose the Python 3 kernel. It will start out as Untitled.ipynb so right-click on its tab and choose Rename Notebook to change it to DataFrames_Practice.ipynb.
We’ll start by importing the Pandas package and then reading in a simple data set from disk. acs_household.csv is one of the example files and contains a single household from the 2000 American Community Survey. It’s a comma separated variable or CSV file, one of the most popular types of data files. The read_csv() function in the Pandas package reads CSV files. The first argument for read_csv() is the file to be read, usually in the form of a string containing the file name. Since the file is in the same folder as your new Notebook, you don’t have to tell Python where to find it. read_csv() has many keyword arguments you can pass in to control how the file is read, and we’ll talk about how and why to use the most important ones in the next chapter. However, by default read_csv() will try to figure out the best way to read the file, and since this file is simple and has already been cleaned up it will be successful.
import pandas as pdacs = pd.read_csv('acs_household.csv')acs
household
person
age
race
marital_status
edu
income
female
hispanic
0
132531
1
47
Black
Now married
Associate degree
20000.0
0
0
1
132531
2
38
White
Now married
High School graduate
18500.0
1
0
2
132531
3
19
Black
Never married
10th grade
5000.0
0
0
3
132531
4
18
Two or more races
Never married
11th grade
2200.0
0
0
4
132531
5
16
Black
Never married
9th grade
0.0
0
0
5
132531
6
1
Black
NaN
NaN
NaN
0
0
6
132531
7
18
White
Never married
9th grade
7900.0
1
0
The variables in this data set are:
household: identifies the household
person: identifies a person within the household
age: the person’s age in years
race: the person’s race
marital_status: the person’s marital status
edu: the person’s educational attainment
female: an indicator for whether the person is female or not, with 1 meaning yes, they are female
hispanic: an indicator for whether the person is Hispanic or not, with 1 meaning yes, they are Hispanic
2.1.2 Indexes
Note that I did not include in the list of variables the column of numbers you see on the left in the printout of acs. This column is the row index. One of the key features of a Pandas DataFrame is that it always has indexes for both the rows and columns. Like the index of a book, the indexes of a DataFrame tell you where to find a particular piece of information.
Suppose I wanted to draw your attention to the number 19 in the acs DataFrame. One way I could do it would be to tell you to look at the number in the third row and the third column, what Python calls row 2 and column 2 since it numbers from 0. However, I could also tell you to look at the age variable of the row where person is 3. That’s because the variable names act as a column index, and in this data set person acts as a row index. While identifying rows and columns by number is sometimes useful, keep in mind that many operations will change those numbers. For example, if you dropped all the adults in the data set or sorted the data set by age, then person 3 would no longer be on row 2. Thus you’ll usually specify subsets of the data using indexes. We’ll talk more about identifying indexes and what they tell you about the structure of your data in First Steps With Your Data.
When we read in this data set using read_csv(), it created a default row index for us based on the row numbers. However, once the index is created we can change the order of the observations (using sort_values()) and the same index will still be associated with the same observation.
acs = acs.sort_values('age')acs
household
person
age
race
marital_status
edu
income
female
hispanic
5
132531
6
1
Black
NaN
NaN
NaN
0
0
4
132531
5
16
Black
Never married
9th grade
0.0
0
0
3
132531
4
18
Two or more races
Never married
11th grade
2200.0
0
0
6
132531
7
18
White
Never married
9th grade
7900.0
1
0
2
132531
3
19
Black
Never married
10th grade
5000.0
0
0
1
132531
2
38
White
Now married
High School graduate
18500.0
1
0
0
132531
1
47
Black
Now married
Associate degree
20000.0
0
0
This data set came with a row index: the person variable. We can tell Python to use it as the index with set_index():
acs = acs.set_index('person')acs
household
age
race
marital_status
edu
income
female
hispanic
person
6
132531
1
Black
NaN
NaN
NaN
0
0
5
132531
16
Black
Never married
9th grade
0.0
0
0
4
132531
18
Two or more races
Never married
11th grade
2200.0
0
0
7
132531
18
White
Never married
9th grade
7900.0
1
0
3
132531
19
Black
Never married
10th grade
5000.0
0
0
2
132531
38
White
Now married
High School graduate
18500.0
1
0
1
132531
47
Black
Now married
Associate degree
20000.0
0
0
Note that the original row numbers have disappeared, and person has been moved to the far left and its name moved down. That’s because it’s now the row index. Indexes are not variables. Now that person is an index you can’t create a new variable that’s equal to person times 2, for example. That’s okay: identifiers like person generally don’t contain information. However, you can do some things like with an index, like sort by it.
acs = acs.sort_values('person')acs
household
age
race
marital_status
edu
income
female
hispanic
person
1
132531
47
Black
Now married
Associate degree
20000.0
0
0
2
132531
38
White
Now married
High School graduate
18500.0
1
0
3
132531
19
Black
Never married
10th grade
5000.0
0
0
4
132531
18
Two or more races
Never married
11th grade
2200.0
0
0
5
132531
16
Black
Never married
9th grade
0.0
0
0
6
132531
1
Black
NaN
NaN
NaN
0
0
7
132531
18
White
Never married
9th grade
7900.0
1
0
This put the data back in the original order.
What if you really needed to do math with person? You can convert it from an index back to a variable by resetting the index:
acs.reset_index()
person
household
age
race
marital_status
edu
income
female
hispanic
0
1
132531
47
Black
Now married
Associate degree
20000.0
0
0
1
2
132531
38
White
Now married
High School graduate
18500.0
1
0
2
3
132531
19
Black
Never married
10th grade
5000.0
0
0
3
4
132531
18
Two or more races
Never married
11th grade
2200.0
0
0
4
5
132531
16
Black
Never married
9th grade
0.0
0
0
5
6
132531
1
Black
NaN
NaN
NaN
0
0
6
7
132531
18
White
Never married
9th grade
7900.0
1
0
This converts person to a variable and creates a new index based on row numbers, just like read_csv() did originally.
2.1.3 Assignment vs. Changing in Place
Hopefully you noticed that most of the code cells thus far have started with acs = but the last one did not. That’s why if I ask Python to print the acs DataFrame, person is still the index:
acs
household
age
race
marital_status
edu
income
female
hispanic
person
1
132531
47
Black
Now married
Associate degree
20000.0
0
0
2
132531
38
White
Now married
High School graduate
18500.0
1
0
3
132531
19
Black
Never married
10th grade
5000.0
0
0
4
132531
18
Two or more races
Never married
11th grade
2200.0
0
0
5
132531
16
Black
Never married
9th grade
0.0
0
0
6
132531
1
Black
NaN
NaN
NaN
0
0
7
132531
18
White
Never married
9th grade
7900.0
1
0
When I ran acs.reset_index(), the reset_index() function returned a new copy of acs with the index reset. But I did not store this new DataFrame as acs like in the previous cells. Because I did not store it, JupyterLab carried out an implicit print and showed me the modified copy. But the original acs was unchanged.
You can use the inplace=True argument to tell a function to change the DataFrame it acts on. To demonstrate, sort the acs DataFrame by age using inplace=True, and just to keep things interesting also add ascending=False so it’s sorted in descending order. Since this function call is getting long, put each argument on its own line:
You might think that using inplace=True would save memory because the function can change the existing DataFrame rather than making a new copy. In reality, most of the time the function creates a new copy and then puts it in the place of the old one anyway. So we’ll rarely use inplace=True, though you may see it in other peoples’ code. But we will frequently use DataFrame functions without changing the original DataFrame.
2.1.4 Copies vs. Views
If you create a new DataFrame by setting it equal to another DataFrame, the result may not be what you expect.
acs2 = acsacs2['hispanic'] =1acs
household
age
race
marital_status
edu
income
female
hispanic
person
1
132531
47
Black
Now married
Associate degree
20000.0
0
1
2
132531
38
White
Now married
High School graduate
18500.0
1
1
3
132531
19
Black
Never married
10th grade
5000.0
0
1
4
132531
18
Two or more races
Never married
11th grade
2200.0
0
1
7
132531
18
White
Never married
9th grade
7900.0
1
1
5
132531
16
Black
Never married
9th grade
0.0
0
1
6
132531
1
Black
NaN
NaN
NaN
0
1
The line acs2['hispanic'] = 1 told Python to make everyone in the acs2 DataFrame Hispanic, but it made everyone in the original acs DataFrame Hispanic as well. Why?
The reason is that the line acs2 = acs did not actually create a new DataFrame. It just gave the DataFrame acs a second name, acs2. Both names point to the same location in the computer’s memory. Thus if you make a change while referring to the DataFrame by one name you’ll see it when you refer to it by the other name as well.
Usually this is not what you want! If you want to make a new DataFrame that is a copy of an existing DataFrame but separate from it, use the copy() function. You will almost always want to pass in the argument deep=True. This tells copy() to make a deep copy, meaning that it should copy both the structure and values of the original DataFrame.
To demonstrate, make acs3 a deep copy of the original acs DataFrame, change everyone in acs back to being not Hispanic, and observe that that did not change acs3:
acs3 = acs.copy(deep=True)acs['hispanic'] =0acs3
household
age
race
marital_status
edu
income
female
hispanic
person
1
132531
47
Black
Now married
Associate degree
20000.0
0
1
2
132531
38
White
Now married
High School graduate
18500.0
1
1
3
132531
19
Black
Never married
10th grade
5000.0
0
1
4
132531
18
Two or more races
Never married
11th grade
2200.0
0
1
7
132531
18
White
Never married
9th grade
7900.0
1
1
5
132531
16
Black
Never married
9th grade
0.0
0
1
6
132531
1
Black
NaN
NaN
NaN
0
1
That’s because acs and acs3 are now separate DataFrames. The change did affect acs2, because it’s still just another name for acs:
acs2
household
age
race
marital_status
edu
income
female
hispanic
person
1
132531
47
Black
Now married
Associate degree
20000.0
0
0
2
132531
38
White
Now married
High School graduate
18500.0
1
0
3
132531
19
Black
Never married
10th grade
5000.0
0
0
4
132531
18
Two or more races
Never married
11th grade
2200.0
0
0
7
132531
18
White
Never married
9th grade
7900.0
1
0
5
132531
16
Black
Never married
9th grade
0.0
0
0
6
132531
1
Black
NaN
NaN
NaN
0
0
DataFrames like acs2 are sometimes called views: they’re another way of viewing the same data. Views can sometimes be useful because they save memory, but they’re also confusing. We mention them here mostly as something to avoid. The same problem will occur if you create a new DataFrame that is a subset of another DataFrame, so use .copy(deep=True) there as well.
2.2 Specifying Subsets of DataFrames
A critical Pandas skill is specifying a subset of a DataFrame. For better or worse, you have lots of options for doing so.
2.2.1 Subsetting with Square Brackets
Recall that you could select items from lists, tuples, and dictionaries using square brackets. You can use them with DataFrames as well.
If you put the name of a column in the square brackets it will select that column:
Why acs['age'] rather than acs[age]? Because age without quotes would refer to a variable called age. You can use variables in subsets, and sometimes it’s very useful:
But most of the time you’ll refer to a column name directly by putting it in quotes. Get used to typing lots of quotes in Python.
To select multiple columns, put them in a list:
acs[['age', 'edu']]
age
edu
person
1
47
Associate degree
2
38
High School graduate
3
19
10th grade
4
18
11th grade
7
18
9th grade
5
16
9th grade
6
1
NaN
Note that there’s no special meaning to double square brackets ([[ and ]]). The outer [] is for identifying a subset, and the inner [] is for defining a list.
Why does this look so much prettier than when we only selected age? Remember that each column in a DataFrame is a Series. If you select one column from a DataFrame, the result is a Series. When you select two or more columns, the result is a DataFrame. JupyterLab puts extra effort into printing DataFrames so they look nice, but it doesn’t bother with a Series. Note that both the DataFrame and the Series get an index that is copied from the original DataFrame. This allows Pandas to align the result with the original DataFrame if you use them together.
Exercise
Select and print the race, female, and hispanic columns from the acs DataFrame. (Yes, implicit print is fine.)
Solution
acs[['race', 'female', 'hispanic']]
race
female
hispanic
person
1
Black
0
0
2
White
1
0
3
Black
0
0
4
Two or more races
0
0
7
White
1
0
5
Black
0
0
6
Black
0
0
2.2.2 Subsetting with iloc
The iloc function allows you to select subsets by row and column number. It’s not quite a conventional function in that the arguments go in square brackets rather than parentheses.
If you give iloc a single number, it will select the corresponding row:
acs.iloc[0]
household 132531
age 47
race Black
marital_status Now married
edu Associate degree
income 20000.0
female 0
hispanic 0
Name: 1, dtype: object
The result is a Series again, but since it’s going across a row the index of the Series is the column names from the original DataFrame.
If you give iloc two numbers, it will select the value of the corresponding row and column:
acs.iloc[0, 0]
132531
This gave us the household number of the first person, because that is the value in the first row and first column (remember Python counts from zero and does not count the row index).
You can also use slices with iloc:
acs.iloc[1:3, 0:4]
household
age
race
marital_status
person
2
132531
38
White
Now married
3
132531
19
Black
Never married
Slices do not include their endpoint, so this gave us rows one and two (i.e. skipping the first row, row zero) and columns zero through three.
Exercise
Select and print the race of the first person in the acs DataFrame using iloc.
Solution
acs.iloc[0, 2]
'Black'
race is the third column in acs, making it column 2 by Python counting.
2.2.3 Subsetting with loc
The loc function allows you to subset using the DataFrame’s indexes. If you give it a value of the row index it will select that row:
acs.loc[1]
household 132531
age 47
race Black
marital_status Now married
edu Associate degree
income 20000.0
female 0
hispanic 0
Name: 1, dtype: object
Compare that with selecting row 1 (i.e. the second row) with iloc:
acs.iloc[1]
household 132531
age 38
race White
marital_status Now married
edu High School graduate
income 18500.0
female 1
hispanic 0
Name: 2, dtype: object
If you give it a value of the row index and a value of the column index, i.e. a column name, it will select that value:
acs.loc[1, 'household']
132531
You can select multiple rows and/or columns with lists:
acs.loc[[1, 2, 3], ['age', 'edu']]
age
edu
person
1
47
Associate degree
2
38
High School graduate
3
19
10th grade
If the row indexes are integers, as they often are, you can also use a slice. Just to make things confusing, with loc the endpoint of the slice is included:
acs.loc[1:3, 'age']
person
1 47
2 38
3 19
Name: age, dtype: int64
The reason slices behave differently with loc is that they’re really indexes, not numbers. Try selecting persons 3 through 5:
acs.loc[3:5]
household
age
race
marital_status
edu
income
female
hispanic
person
3
132531
19
Black
Never married
10th grade
5000.0
0
0
4
132531
18
Two or more races
Never married
11th grade
2200.0
0
0
7
132531
18
White
Never married
9th grade
7900.0
1
0
5
132531
16
Black
Never married
9th grade
0.0
0
0
Recall that we sorted the data by age (descending) not person number, so person 7 ended up between person 4 and person 5. To loc, the slice 3:5 means ‘from person 3 to person 5’ and that includes person 7.
To select all rows or all columns, use : without specifying a beginning or ending point:
acs.loc[:, ['age', 'edu']]
age
edu
person
1
47
Associate degree
2
38
High School graduate
3
19
10th grade
4
18
11th grade
7
18
9th grade
5
16
9th grade
6
1
NaN
acs.loc[[1, 2], :]
household
age
race
marital_status
edu
income
female
hispanic
person
1
132531
47
Black
Now married
Associate degree
20000.0
0
0
2
132531
38
White
Now married
High School graduate
18500.0
1
0
But remember that passing in just a row is equivalent to selecting all columns:
acs.loc[[1, 2]]
household
age
race
marital_status
edu
income
female
hispanic
person
1
132531
47
Black
Now married
Associate degree
20000.0
0
0
2
132531
38
White
Now married
High School graduate
18500.0
1
0
Exercise
Select and print the income and marital_status of persons 3, 4, and 5 in the acs DataFrame using loc.
Solution
acs.loc[[3, 4, 5], ['income', 'marital_status']]
income
marital_status
person
3
5000.0
Never married
4
2200.0
Never married
5
0.0
Never married
Since we did not want to include person 7, we had to use a list rather than a slice.
2.2.4 Subsetting by Conditions
Both plain square brackets and loc can select the rows that meet a condition. For example, to select the children in the household, use:
acs[acs['age']<18]
household
age
race
marital_status
edu
income
female
hispanic
person
5
132531
16
Black
Never married
9th grade
0.0
0
0
6
132531
1
Black
NaN
NaN
NaN
0
0
Let’s take that apart a bit. acs['age']<18 gives us a Series:
Pandas used the indexes to ensure the right rows of is_child were aligned with the right rows of acs.
loc works the same way:
acs.loc[acs['age']<18]
household
age
race
marital_status
edu
income
female
hispanic
person
5
132531
16
Black
Never married
9th grade
0.0
0
0
6
132531
1
Black
NaN
NaN
NaN
0
0
acs.loc[is_child]
household
age
race
marital_status
edu
income
female
hispanic
person
5
132531
16
Black
Never married
9th grade
0.0
0
0
6
132531
1
Black
NaN
NaN
NaN
0
0
However, with loc you can still add a second argument to select columns:
acs.loc[acs['age']<18, ['age', 'edu']]
age
edu
person
5
16
9th grade
6
1
NaN
2.2.4.1 Comparison Operators
The comparison operators you can use to create conditions are:
== (equals)
!= (not equals)
> (greater than)
< (less than)
>= (greater than or equal to)
<= (less than or equal to)
~ means ‘not’, and can be used to negate the following condition. Thus:
acs[~is_child]
household
age
race
marital_status
edu
income
female
hispanic
person
1
132531
47
Black
Now married
Associate degree
20000.0
0
0
2
132531
38
White
Now married
High School graduate
18500.0
1
0
3
132531
19
Black
Never married
10th grade
5000.0
0
0
4
132531
18
Two or more races
Never married
11th grade
2200.0
0
0
7
132531
18
White
Never married
9th grade
7900.0
1
0
Comparison operators have rules for which operators are evaluated first just like arithmetic operators, and sometimes they may not be what you expect. Take:
acs[~acs['age']<18]
household
age
race
marital_status
edu
income
female
hispanic
person
1
132531
47
Black
Now married
Associate degree
20000.0
0
0
2
132531
38
White
Now married
High School graduate
18500.0
1
0
3
132531
19
Black
Never married
10th grade
5000.0
0
0
4
132531
18
Two or more races
Never married
11th grade
2200.0
0
0
7
132531
18
White
Never married
9th grade
7900.0
1
0
5
132531
16
Black
Never married
9th grade
0.0
0
0
6
132531
1
Black
NaN
NaN
NaN
0
0
Why isn’t ~acs['age']<18 equivalent to ~is_child? Because ~ is evaluated before <.
~is_child
is equivalent to:
~(acs['age']<18)
but:
~acs['age']<18
is equivalent to:
(~acs['age'])<18.
Applying the ~ operator to a number rather than a condition gives results that are unintuitive to say the least:
The solution is to use parentheses generously to make sure everything is evaluated in the order you want.
acs[~(acs['age']<18)]
household
age
race
marital_status
edu
income
female
hispanic
person
1
132531
47
Black
Now married
Associate degree
20000.0
0
0
2
132531
38
White
Now married
High School graduate
18500.0
1
0
3
132531
19
Black
Never married
10th grade
5000.0
0
0
4
132531
18
Two or more races
Never married
11th grade
2200.0
0
0
7
132531
18
White
Never married
9th grade
7900.0
1
0
Exercise
Select and print edu for the Black people in the acs DataFrame. (Hint: race is a string and values of a string variable, like ‘Black’, go in quotes.)
Solution
acs.loc[acs['race']=='Black', 'edu']
person
1 Associate degree
3 10th grade
5 9th grade
6 NaN
Name: edu, dtype: object
2.2.4.2 Combining Conditions
To combine conditions, use logical and, &, or logical or, |. With logical and the result is true if and only if both conditions are true. In terms of set theory, it is the intersection. With logical or, the result is true if either condition is true (or both). In terms of set theory, it is the union.
For example, to identify the people who are adults and have an income greater than $5,000, use:
acs[(acs['age']>=18) & (acs['income']>5000)]
household
age
race
marital_status
edu
income
female
hispanic
person
1
132531
47
Black
Now married
Associate degree
20000.0
0
0
2
132531
38
White
Now married
High School graduate
18500.0
1
0
7
132531
18
White
Never married
9th grade
7900.0
1
0
Yes, the parentheses are essential.
To identify the people who are either adults or have an income greater than $5,000, use:
acs[(acs['age']>=18) | (acs['income']>5000)]
household
age
race
marital_status
edu
income
female
hispanic
person
1
132531
47
Black
Now married
Associate degree
20000.0
0
0
2
132531
38
White
Now married
High School graduate
18500.0
1
0
3
132531
19
Black
Never married
10th grade
5000.0
0
0
4
132531
18
Two or more races
Never married
11th grade
2200.0
0
0
7
132531
18
White
Never married
9th grade
7900.0
1
0
Exercise
Select and print the adult females in the data set.
Solution
acs[(acs['female']==1) & (acs['age']>=18)]
household
age
race
marital_status
edu
income
female
hispanic
person
2
132531
38
White
Now married
High School graduate
18500.0
1
0
7
132531
18
White
Never married
9th grade
7900.0
1
0
2.2.4.3 Queries
A DataFrame or Series has a query() function that allows you to pass in a string describing a condition, and it will return only those rows where the condition is True. This is essentially the same as using square brackets or loc with a condition. However, with query(), the condition can only involve columns from the DataFrame or Series itself. This limitation turns out to be an advantage: because all variables must be part of the DataFrame, you don’t have to specify the DataFrame name to use them. You don’t even have to put them in quotes.
Recall that we selected the people who were adults and had incomes greater than $5,000 with:
acs[(acs['age']>=18) & (acs['income']>5000)]
household
age
race
marital_status
edu
income
female
hispanic
person
1
132531
47
Black
Now married
Associate degree
20000.0
0
0
2
132531
38
White
Now married
High School graduate
18500.0
1
0
7
132531
18
White
Never married
9th grade
7900.0
1
0
The query() equivalent is:
acs.query('(age>=18) & (income>5000)')
household
age
race
marital_status
edu
income
female
hispanic
person
1
132531
47
Black
Now married
Associate degree
20000.0
0
0
2
132531
38
White
Now married
High School graduate
18500.0
1
0
7
132531
18
White
Never married
9th grade
7900.0
1
0
The disadvantage of query() is that it’s usually slightly slower than square brackets or loc. That may be important in some applications. Also, you can’t use query() in an assignment statement to change the values of some rows and not others, because that leads to confusion about copies and views again. The advantage of query() is that less fussy syntax to type means less time correcting errors.
Now consider selecting all the people who are White with query(). The trouble is that the value of interest, 'White', is a string and needs to go in quotes. The query string 'race=='White'' won’t work because Python will assume the quote at the start of 'White' is the end of the query string. The solution is to put double quotes around "White". Python is fine with strings starting with either single or double quotes, but it will expect the string to end the same way. Thus if the query starts with a single quote, Python knows the double quotes around "White" do not mean the query is ending:
acs.query('race=="White"')
household
age
race
marital_status
edu
income
female
hispanic
person
2
132531
38
White
Now married
High School graduate
18500.0
1
0
7
132531
18
White
Never married
9th grade
7900.0
1
0
Exercise
Select and print the adult males using query().
Solution
acs.query('age>18 & female==0')
household
age
race
marital_status
edu
income
female
hispanic
person
1
132531
47
Black
Now married
Associate degree
20000.0
0
0
3
132531
19
Black
Never married
10th grade
5000.0
0
0
2.2.5 Which Subsetting Method Should You Use?
Plain square brackets, loc, and iloc all do basically the same thing, so which should you use?
Subsets specified by iloc are always precise and unambiguous, but they make the code hard to read because you have to know exactly what the data set looks like to know what they mean. We’ll rarely use iloc.
Plain square brackets are simple and convenient, but they can’t do as much. Also, because they’re used with so many different kinds of objects, the underlying code isn’t optimized for DataFrames. Avoid them whenever speed is a major consideration.
loc is almost as easy to use as plain square brackets, is much more capable, and is optimized for performance. Plus, the resulting code is easier to understand than with iloc.
query() is very convenient for subsets based on a condition, but performance is a concern.
In this book we’ll normally use plain square brackets for simple subsets, especially selecting a single column, and loc for anything more complicated. We’ll also use query() occasionally so you get comfortable with both ways of subsetting based on conditions.
Important
When you use plain square brackets, the argument inside the brackets answers the question ‘Which columns?’ When you use loc or iloc with one argument, the argument answers the question ‘Which rows?’ If you use two arguments with loc or iloc they answer the questions ‘Which rows?’ and ‘Which columns?’ respectively.
2.2.6 Function Chaining
Suppose you wanted to create a new DataFrame containing just the teens from this data set. You can identify the correct subset with:
acs[(acs['age']>12) & (acs['age']<20)]
household
age
race
marital_status
edu
income
female
hispanic
person
3
132531
19
Black
Never married
10th grade
5000.0
0
0
4
132531
18
Two or more races
Never married
11th grade
2200.0
0
0
7
132531
18
White
Never married
9th grade
7900.0
1
0
5
132531
16
Black
Never married
9th grade
0.0
0
0
If you assign this to a new variable, it’s actually ambigious whether it’s a new DataFrame or a view of acs, and that will get you into trouble if try to change it. The solution, as mentioned earlier, is to make a copy. But how you do you call the copy function of a subset of a DataFrame?
The key insight is that the thing you get when you run acs[(acs['age']>12) & (acs['age']<20)] is a DataFrame and has all the functions of a DataFrame. Thus you can run:
In fact, what copy(deep=True) returns is another DataFrame, and you can call its functions as well. And if a function returns a different kind of object, you can call its functions. That means that if you need to carry out multiple tasks that can all be done by calling functions, you can often combine them into a chain of function calls. For example, if you wanted to create teens as above and then sort it by age in ascending order, you can do that with:
Create a DataFrame containing just the males in the household. Then sort it in descending order by income and reset its index so Python creates a new index. The new index will then be the person’s ranking by income (but starting from zero). Carry out all these tasks in a single function chain.
You can change the data in a DataFrame by assigning new values to it. You’ll use the subsetting tools from the previous section to select which parts of the DataFrame will be changed. To create a new column, select it as a subset of the DataFrame as if it already existed and then assign a value to it.
2.3.1 Numeric Variables
The incomes in this data set date from the year 2000, so to correct for inflation and convert them to January 2023 dollars we need to multiply them by 1.77:
acs['income_2023'] = acs['income']*1.77acs
household
age
race
marital_status
edu
income
female
hispanic
income_2023
person
1
132531
47
Black
Now married
Associate degree
20000.0
0
0
35400.0
2
132531
38
White
Now married
High School graduate
18500.0
1
0
32745.0
3
132531
19
Black
Never married
10th grade
5000.0
0
0
8850.0
4
132531
18
Two or more races
Never married
11th grade
2200.0
0
0
3894.0
7
132531
18
White
Never married
9th grade
7900.0
1
0
13983.0
5
132531
16
Black
Never married
9th grade
0.0
0
0
0.0
6
132531
1
Black
NaN
NaN
NaN
0
0
NaN
Note how the income of person 6 is NaN, or ‘Not a Number’. This is how Python stores missing values. Attempting to multiply NaN by 1.77 results in NaN, so income_2023 is automatically missing as well. This is what we want and expect.
Exercise
Go to the BLS inflation calculator and see what you need to multiply incomes from 2000 by to get dollars in the most recent month that has data (assume for simplicity that the incomes are from January 2000). Create a column in the acs DataFrame to store income in current dollars.
Solution
As of this writing, if you put a 1 in the top box of the BLS calculator web page and click Calculate it will tell you that $1.00 in January 2000 had the same buying power as $1.85 in March 2024 (the most recent month for which CPI data are available). That makes 1.85 the current conversion factor.
acs['income_current'] = acs['income']*1.85acs
household
age
race
marital_status
edu
income
female
hispanic
income_2023
income_current
person
1
132531
47
Black
Now married
Associate degree
20000.0
0
0
35400.0
37000.0
2
132531
38
White
Now married
High School graduate
18500.0
1
0
32745.0
34225.0
3
132531
19
Black
Never married
10th grade
5000.0
0
0
8850.0
9250.0
4
132531
18
Two or more races
Never married
11th grade
2200.0
0
0
3894.0
4070.0
7
132531
18
White
Never married
9th grade
7900.0
1
0
13983.0
14615.0
5
132531
16
Black
Never married
9th grade
0.0
0
0
0.0
0.0
6
132531
1
Black
NaN
NaN
NaN
0
0
NaN
NaN
Of course you’ll probably get a different number.
2.3.2 Indicator (Boolean) Variables
Now create an indicator variable for ‘this person is black’. Since you’re setting the new variable equal to a condition, it will get True if the condition is true and False if it is false:
Next suppose we want to create an indicator variable has_income which tells us if the person’s income is greater than zero. This seems like a straightforward extension of what we just did:
acs['has_income'] = (acs['income'] > 0)
The problem is person 6 and their missing value for income. Because we don’t know their income, we don’t know what their value of has_income should be. (We could make a pretty good guess based on their age, but we don’t know.) So the right value of has_income for person 6 is NaN. There are just two problems with that: Python won’t automatically give us that answer, and it won’t automatically store it.
has_income looks fine for all the observations except person 6, where we got False rather than NaN. The problem here is that the result of a comparison like (acs['income'] > 0) must be either True or False. NaN is not an option.
Different languages handle missing values in conditions in different ways, and they all have their advantages and disadvantages. Python’s is at least easy to remember: any condition involving NaN returns False.
NaN is defined by the numpy package, so to talk about it directly we need to import numpy. The standard abbreviation for it is np:
import numpy as np
Now compare acs['income'] to np.NaN and note the results:
income is NaN for person 6, but it doesn’t matter: the rule that any condition involving NaN returns False takes precedence. Even comparing NaN to itself returns False:
np.NaN==np.NaN
False
If you think of NaN as representing unknown quantities this makes some sense: two unknown quantities are very unlikely to be identical.
To detect missing values, instead use the isna() function:
There’s just one problem: a strictly Boolean variable can only store True or False, not NaN. The solution is to convert has_income from the default bool type to the boolean type. boolean is a nullable boolean, meaning it can store NaN as well.
In previous versions of Pandas, putting NaN in a variable of type bool would prompt Pandas to silently promote the variable to type boolean, but the Pandas developers decided this was convenient but dangerous. After all, bool and boolean sometimes behave differently. So they want you to explicitly convert has_income to boolean using the astype() function before you put NaN in it:
(Technically you’ve ended up with the Pandas NA rather than the NumPy NaN, but that’s not a distinction you generally have to worry about. Pandas NA is officially experimental but almost always behaves the same as NumPy’s NaN.)
As of Pandas 2.1, skipping the conversion works (has_income will still be automatically promoted) but you’ll get a warning. In a future version it will crash.
There’s actually a similar issue with integer variable types, as np.NaN is technically a floating point number. However, converting integers to floating point numbers is not considered a promotion, so setting some values of an integer variable to NaN results in the variable being converted to floating point without any fuss.
Creating has_income with some incorrect values and then immediately correcting them is not bad. But let’s talk about a couple of ways you could create it with all the right values immediately.
2.3.3.1 Subsets and New Variables
One takes advantage of the fact a brand new variable essentially starts out with all its values set to NaN. If you use a subset to only set values for some observations when you create it, the observations that are left out will get NaN:
Because the new variable is only set to the result of (acs['income']>0) for rows where income has a valid value (acs['income'].notna()), rows where income is missing get NaN for has_income2. Just remember that this only works for brand new variables. If has_income2 already exists, the subset means that rows where income is missing keep whatever value of has_income2 they had before.
2.3.3.2 The where() Function
Another way to do this relies on a very useful NumPy function called where(). It takes three arguments:
A condition (something that is either true or false)
The result is an array rather than a Series because where() is NumPy and Series is a Pandas class, but Pandas will take care of converting arrays to Series as needed.
Our rules for has_income are:
if income is not missing, set it to the result of acs['income']>0
There’s just one wrinkle: has_income was set to 1.0 or 0.0 instead of True or False! It’s the problem of variable type again: NumPy sees that the where() function is going to return True, False, and NaN, but it doesn’t know about the Pandas boolean type. So it decides it can store all three as floating point numbers if it converts True to 1.0 and False to 0.0. Converting True and False to 1 and 0 is quite useful at times (we’ll use it a lot with aggregate functions and hierarchical data), but not what we want here.
The solution is to tell where() to use the boolean type:
One difference between bool and boolean is that boolean uses three-valued logic (specifically, Kleene logic) for combining conditions that could include NaN. It’s three-valued because the result could be True, False, or NaN. Conceptually, you replace the NaN values withe first True and then False. If the result is the same either way, then that’s the result. If the result is not the same, then the result is NaN. For example:
The result is True for person 6 because black is true and with | (logical or) only one of the two values needs to be True for the result to be True. Compare with:
Now the result for person 6 is NaN, because the result would be True if has_income were True, but False if has_income were False. Thus we don’t know the correct result.