4.6 Subsets of a data frame
4.6.1 Data Concepts - Subsetting
Subsetting is identifying either a single element of the data frame or
a group of elements.
Dropping columns in the prior sections was an example of subsetting.
The head
and tail
methods are examples of subsetting.
They subset on the rows of a data frame.
In this section we will consider subsetting
rows and columns together.
When subsetting on both rows and columns, a rectangle of values is returned. There is a row for each of the row of the subset and a column for each column of the subset. For example, if one subsetted the data frame from section 4.2.2, repeated here,
A,B,C,D
1,51,-2,Madison
-3,1,8,Sun Prairie
9,13,,Verona
using A
, B
, and D
for the columns and
the first and last rows,
the following would be the subset of the data frame.
A B D
1 51 Madison
9 13 Verona
This would identify two rows of three variables.
4.6.2 Programming skills
4.6.2.1 Subsetting
Subsetting can be used on the right hand side of an assignment command to extract values from a data frame or on the left hand side to identify a set of elements to be replaced.
4.6.2.2 Row index
A row index, also called row names, may have text names or only the integer position index.
4.6.2.3 Slicing
Slices identifying a range of an index to be used in subsetting.
Both the tidyverse and pandas use the colon, :
, operator to identify a slice
and specify a slice with start:end
, where start
and end
are the beginning and end of the range of values.
Slices can be specified using numeric position values or named index
values.
4.6.2.4 Conditional subsetting
Conditional subsetting uses a boolean index,
a boolean value for for each index position of the row or column
index.
The subset will include the rows or columns that are identified by the
true values in the boolean index.
The tidyverse filter()
and pandas querry()
methods used conditional
subsetting.
In this section we will other examples that use conditional subsetting.
4.6.3 Examples - R
These examples use the airAccs.csv data set.
We begin by using the same code as in the prior section to load packages, import the csv file, and rename the variables. To allow us to identify rows by number in this section, we do not remove the observation number variable.
library(tidyverse)
airAccs_path <- file.path("..", "datasets", "airAccs.csv") air_accidents_in <- read_csv(airAccs_path, col_types = cols())
Warning: Missing column names filled in: 'X1' [1]
air_accidents_in <- air_accidents_in %>% rename( obs_num = 1, date = Date, plane_type = planeType, dead = Dead, aboard = Aboard, ground = Ground ) air_accidents <- air_accidents_in glimpse(air_accidents)
Observations: 5,666 Variables: 8 $ obs_num <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, ... $ date <date> 1908-09-17, 1912-07-12, 1913-08-06, 1913-09-09, 19... $ location <chr> "Fort Myer, Virginia", "Atlantic City, New Jersey",... $ operator <chr> "Military - U.S. Army", "Military - U.S. Navy", "Pr... $ plane_type <chr> "Wright Flyer III", "Dirigible", "Curtiss seaplane"... $ dead <dbl> 1, 5, 1, 14, 30, 21, 19, 20, 22, 19, 27, 20, 20, 23... $ aboard <dbl> 2, 5, 1, 20, 30, 41, 19, 20, 22, 19, 28, 20, 20, 23... $ ground <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
Select rows using numeric row index.
The
slice()
method is used to subset using the numeric row index values.This example uses
slice()
to display the beginning of theair_accidents
data frame similarly to head.This example makes use of the sequence operator
:
to create a slice. The sequence operator creates a range of numbers. The range of numbers starts at the integer value on the left of:
and includes all integers up to and including the integer on the right of:
. For example,9:12
would be the numbers 9, 10, 11, and 12. The sequence operator can be used outside of slicing.This example displays the first 5 fows of the data set.
air_accidents %>% slice(1:5)
# A tibble: 5 x 8 obs_num date location operator plane_type dead aboard ground <dbl> <date> <chr> <chr> <chr> <dbl> <dbl> <dbl> 1 1 1908-09-17 Fort Myer,~ Military ~ Wright Fly~ 1 2 0 2 2 1912-07-12 Atlantic C~ Military ~ Dirigible 5 5 0 3 3 1913-08-06 Victoria, ~ Private Curtiss se~ 1 1 0 4 4 1913-09-09 Over the N~ Military ~ Zeppelin L~ 14 20 0 5 5 1913-10-17 Near Johan~ Military ~ Zeppelin L~ 30 30 0
Select rows from the end using numeric row index.
This example displays the last 5 rows of the data set.
The
n()
helper function is used to get the last row number. (This helper function function is only available inside of tidyverse functions that operate on rows likeslice()
.)air_accidents %>% slice((n() - 4):n())
# A tibble: 5 x 8 obs_num date location operator plane_type dead aboard ground <dbl> <date> <chr> <chr> <chr> <dbl> <dbl> <dbl> 1 5662 2014-02-21 Grombalia,~ Libyan Ai~ Antonov 26 11 11 0 2 5663 2014-02-26 Near Lanai~ Maui Air Piper PA-3~ 3 6 0 3 5664 2014-03-07 South Indi~ Malaysia ~ Boeing 777~ 239 239 0 4 5665 2014-03-18 Seattle, W~ Helicopte~ urocopter ~ 2 2 0 5 5666 2014-03-22 Caboolture~ Skydive C~ Cessna U20~ 5 5 0
Excluding rows.
Exclusion of rows is done by preceding the row with
-
. This is the same way that columns are excluded.This example uses exclusion of rows to display the same beginning rows of the
air_accidents
data frame as in the prior example.slice(air_accidents, -6:-n())
# A tibble: 5 x 8 obs_num date location operator plane_type dead aboard ground <dbl> <date> <chr> <chr> <chr> <dbl> <dbl> <dbl> 1 1 1908-09-17 Fort Myer,~ Military ~ Wright Fly~ 1 2 0 2 2 1912-07-12 Atlantic C~ Military ~ Dirigible 5 5 0 3 3 1913-08-06 Victoria, ~ Private Curtiss se~ 1 1 0 4 4 1913-09-09 Over the N~ Military ~ Zeppelin L~ 14 20 0 5 5 1913-10-17 Near Johan~ Military ~ Zeppelin L~ 30 30 0
Base R subsetting.
Base R has another method to do subsetting. It is using the
[]
method. For a two dimensional object like a data frame, the rows are provided as the first parameter and the columns as the second parameter. Note, to get the set of columns we want, they need to be wrapped byc()
This will be explained further in the next chapter.air_accidents[64:67, c("date", "location", "operator", "plane_type")]
# A tibble: 4 x 4 date location operator plane_type <date> <chr> <chr> <chr> 1 1922-04-08 Pao Ting Fou, China ? ? 2 1922-04-13 Brooklands, England Vickers Vickers Viki~ 3 1922-06-03 Off Folkestone, Kent, E~ Cie des Messageries Ae~ Bleriot Spad~ 4 1922-07-04 Fuhlsbuttel, Germany ? LVG C VI
The base R subsetting operator,
[]
, can be used on the left side of the assignment operator. This allow you to modify a part of a data frame.This example would be done as follows in the tidyverse.
air_accidents %>% select(date, location, operator, plane_type) %>% slice(64:67)
# A tibble: 4 x 4 date location operator plane_type <date> <chr> <chr> <chr> 1 1922-04-08 Pao Ting Fou, China ? ? 2 1922-04-13 Brooklands, England Vickers Vickers Viki~ 3 1922-06-03 Off Folkestone, Kent, E~ Cie des Messageries Ae~ Bleriot Spad~ 4 1922-07-04 Fuhlsbuttel, Germany ? LVG C VI
Create a test data frame.
When writing code, it is sometimes convenient to work with a smaller set of observations when you are developing and testing new code. The
slice()
method can be used to create a smaller data frame to test code.This example creates a new data frame from the first ten rows.
air_accidents_test <- slice(air_accidents, 1:10)
4.6.4 Examples - Python
These examples use the airAccs.csv data set.
We begin by using the same code as in the prior section to load packages, import the csv file, and rename the variables. To allow us to identify rows by number in this section, we do not remove the observation number variable.
from pathlib import Path import pandas as pd import numpy as np
airAccs_path = Path('..') / 'datasets' / 'airAccs.csv' air_accidents_in = pd.read_csv(airAccs_path) air_accidents_in = ( air_accidents_in .rename( columns={ air_accidents_in.columns[0]: 'obs_num', 'Date': 'date', 'planeType': 'plane_type', 'Dead': 'dead', 'Aboard': 'aboard', 'Ground': 'ground'})) air_accidents = air_accidents_in.copy(deep=True) print(air_accidents.dtypes)
obs_num int64 date object location object operator object plane_type object dead float64 aboard float64 ground float64 dtype: object
Select rows using numeric row index.
The
iloc[]
attribute can subset on both rows and columns using numeric index values.An
iloc[]
slice does not include the end value, as is typical in Python. For example,9:12
would be the numbers 9, 10, and 11. Not providing either thestart
orend
will result in the slice starting at the first or last position, respectively. For example,:
would mean start at the begining and go to the end.The
iloc[]
attribute can subset on both rows and columns using named index values.Slices for
loc[]
are different from Python slices in two ways. The first isend
values are included in the range. The second is that the range is specified by names. The range is all the rows, or columns, between the namedstart
andend
inclusive, as ordered in the index.The
loc[]
attribute uses index names andiloc[]
only uses index numeric position. To mix using numbers and names requires either converting names to position values, or position values to names.Note,
loc[]
andiloc[]
should not be chained together. This would be considered bad coding practice.This example uses
iloc[]
to display the beginning of theair_accidents
data frame similarly to head.(air_accidents .iloc[:5, :] .pipe(print))
obs_num date ... aboard ground 0 1 1908-09-17 ... 2.0 0.0 1 2 1912-07-12 ... 5.0 0.0 2 3 1913-08-06 ... 1.0 0.0 3 4 1913-09-09 ... 20.0 0.0 4 5 1913-10-17 ... 30.0 0.0 [5 rows x 8 columns]
Select rows from the end using numeric row index.
Slices can be taken with respect to the end of the data frame. This is done by using negative values.
This example displays the last 5 rows of the data set.
(air_accidents .iloc[-5:, :] .pipe(print))
obs_num date ... aboard ground 5661 5662 2014-02-21 ... 11.0 0.0 5662 5663 2014-02-26 ... 6.0 0.0 5663 5664 2014-03-07 ... 239.0 0.0 5664 5665 2014-03-18 ... 2.0 0.0 5665 5666 2014-03-22 ... 5.0 0.0 [5 rows x 8 columns]
Subsetting using the subsetting operator,
[]
The
[]
operator will select either a single column or a list of columns. If a slice is used, then the subsetting is on rows. This operator is typically used as a short cut for selecting columns. It is not the preferred method for subsetting and is being shown since you will likely see code that uses it.print(air_accidents[64:68])
obs_num date location ... dead aboard ground 64 65 1922-04-13 Brooklands, England ... 3.0 3.0 0.0 65 66 1922-06-03 Off Folkestone, Kent, England ... 3.0 3.0 0.0 66 67 1922-07-04 Fuhlsbuttel, Germany ... 2.0 3.0 0.0 67 68 1922-07-16 Near Saverne, Lorraine, France ... 5.0 5.0 0.0 [4 rows x 8 columns]
Note,
[]
should not be chained withloc[]
andiloc[]
.Create a test data frame.
When writing code, it is sometimes convenient to work with a smaller set of observations when you are developing and testing new code. The
iloc[]
attribute can be used to create a smaller data frame to test code.This example creates a new data frame from the first ten rows.
air_accidents_test = air_accidents.iloc[:10, :] print(air_accidents_test)
obs_num date ... aboard ground 0 1 1908-09-17 ... 2.0 0.0 1 2 1912-07-12 ... 5.0 0.0 2 3 1913-08-06 ... 1.0 0.0 3 4 1913-09-09 ... 20.0 0.0 4 5 1913-10-17 ... 30.0 0.0 5 6 1915-03-05 ... 41.0 0.0 6 7 1915-09-03 ... 19.0 0.0 7 8 1916-07-28 ... 20.0 0.0 8 9 1916-09-24 ... 22.0 0.0 9 10 1916-10-01 ... 19.0 0.0 [10 rows x 8 columns]
Summary of subsetting
This is not an example. Instead it is a brief summary of the subsetting tools available for pandas data frames. Not all of these have been covered in the material and examples. The one that were not covered are left for you to explore on your own, as you need them.
There are two approaches to identifying rows or columns to be subsetted (indexed.)
Slicing is providing a
start:end<:step>
definition of the rows or columns to be subsetted on. A pandas slice can be used with either interger position or named index values. The:step
is optional and defaults to one. This is the number to count by from thestart
toend
.Providing a list,
[]
, of the desired rows or columns. This is called fancy indexing.
A series is returned when subsetting results in a single column. A single value is returned when subsetting results in a single row of a single column. If you intend to subset a single value, consider using the
at[]
attribute. It is intended to return a single value. A data frame is returned when a subset results in more than one column of data.Subsetting creates either a view (a reference to the subsetted data) or a copy (shallow copy.) Which of these two objects are returned depends on the context. If no changes are going to be made or further subsetting is to be done, then using either a view or copy is alright. If you plan to make any changes or further subset the object, then make an explicit copy of the subset using
copy()
.When an object is a view of a subset of another object, the view object is indexed using the same position integer values as the original data frame.
Indexing used on the left hand side of the assignment operator creates no view or copy. The indexing and assignment are done together in place.
4.6.5 Exercises
These exercises use the PSID.csv
data set
that was imported in the prior section.
Import the
PSID.csv
data set.Display the last three rows of the data frame using positional values to subset.
Display the first, third, fifth, and seventh rows of columns two and three.
Create a smaller data frame using the first 20 rows.