4.4 Dropping unneeded variables
4.4.1 Data Concepts - Removing unneeded variables
Data sets often contain variables that are not of interest for a project. Removing these unneeded variables makes it easier to focus on the variables that we need to wrangle.
There are two approaches to identifying the variables that you want to keep in a data frame. These approaches are inclusion, explicitly identifying which variables to retain, and exclusion, explicitly identifying the variables that are to be removed. Determining which of these two approaches to use depends on how easy it is define each of these two sets of variables. For example, if a data frame has over 15 variables and you want all of them except two, it is easier and clearer coding to list the two variables that are not needed.
Identifying what you want to operate on through inclusion and exclusion will be used in many of the later wrangling tasks.
4.4.2 Programming skills - Chaining/pipes
As a task gets more complex, the number of steps needed to complete the task goes up. These steps can produce a number of intermediate results that need to be passed from one step to the next. When an intermediate result will only be used by the following steps, saving and referencing the intermediate result can make your code more difficult to read. The tidyverse and Python both have operators that allow for intermediate results to be passed to following methods and functions without the need for explicitly naming the intermediate results.
R and Python have different forms for calling methods. Recall that Python uses
object.method_name(parms)
and R uses
method_name(object, parms).
Due to this difference in method calling approaches, there are different approaches to connecting method results together.
In the tidyverse two methods are connected using the
pipe operator, %>%
.
(The pipe operator is used in a number of other programming
languages, though the symbol for it is different.)
This would be done as follows
result <-
object %>%
method_1(parms_1) %>%
method_2(parms_2)
In python the methods would be chained together as follows
result = (object
.method_1(parms_1)
.method_2(parms_2))
The set of outer parenthesis in the Python example makes it easier to split the code over multiple lines.
In both the tidyverse and Python example, method_1
is run and
the result from method_1
is used as the object for
method_2
to work on.
Both of these method-chaining approaches place the most important information about what is being operated on and the operations that are being done on the left side of the page. The details of what is being done are found further to the right on the page. This is considered easier to read.
4.4.3 Examples - R
These examples use the airAccs.csv data set.
We begin by using the same code as in the prior section to loading packages, import the csv file, and rename the variables.
library(tidyverse)
airAccs_path <- file.path("..", "datasets", "airAccs.csv") air_accidents_in <- read_csv(airAccs_path, col_types = cols())
Warning: Missing column names filled in: 'X1' [1]
air_accidents_in <- rename( air_accidents_in, obs_num = 1, date = Date, plane_type = planeType, dead = Dead, aboard = Aboard, ground = Ground ) air_accidents <- air_accidents_in glimpse(air_accidents)
Observations: 5,666 Variables: 8 $ obs_num <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, ... $ date <date> 1908-09-17, 1912-07-12, 1913-08-06, 1913-09-09, 19... $ location <chr> "Fort Myer, Virginia", "Atlantic City, New Jersey",... $ operator <chr> "Military - U.S. Army", "Military - U.S. Navy", "Pr... $ plane_type <chr> "Wright Flyer III", "Dirigible", "Curtiss seaplane"... $ dead <dbl> 1, 5, 1, 14, 30, 21, 19, 20, 22, 19, 27, 20, 20, 23... $ aboard <dbl> 2, 5, 1, 20, 30, 41, 19, 20, 22, 19, 28, 20, 20, 23... $ ground <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
4.4.3.1 Cleaning - Drop variables
The select()
method is used to reduce the number of columns and
order the columns.
The first parameter to the select()
method is the data frame on
which to select.
The remaining parameters are used to identify the columns to be
included or excluded.
The negation sign, "-", preceding a variable name indicates that the
variable is to be excluded.
We will use
select()
to keep all the variables except theobs_num
.air_accidents <- select(air_accidents, date, plane_type, dead, aboard, ground) glimpse(air_accidents)
Observations: 5,666 Variables: 5 $ date <date> 1908-09-17, 1912-07-12, 1913-08-06, 1913-09-09, 19... $ plane_type <chr> "Wright Flyer III", "Dirigible", "Curtiss seaplane"... $ dead <dbl> 1, 5, 1, 14, 30, 21, 19, 20, 22, 19, 27, 20, 20, 23... $ aboard <dbl> 2, 5, 1, 20, 30, 41, 19, 20, 22, 19, 28, 20, 20, 23... $ ground <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
The variables will be ordered in the data set as they are given in
select()
.Using exclusion to keep all the variables except the
obs_num
.air_accidents <- air_accidents_in air_accidents <- select(air_accidents_in, -obs_num) glimpse(air_accidents)
Observations: 5,666 Variables: 7 $ date <date> 1908-09-17, 1912-07-12, 1913-08-06, 1913-09-09, 19... $ location <chr> "Fort Myer, Virginia", "Atlantic City, New Jersey",... $ operator <chr> "Military - U.S. Army", "Military - U.S. Navy", "Pr... $ plane_type <chr> "Wright Flyer III", "Dirigible", "Curtiss seaplane"... $ dead <dbl> 1, 5, 1, 14, 30, 21, 19, 20, 22, 19, 27, 20, 20, 23... $ aboard <dbl> 2, 5, 1, 20, 30, 41, 19, 20, 22, 19, 28, 20, 20, 23... $ ground <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
Typically one uses either inclusion or exclusion to select the variables of a data frame, but not both.
Ordering the variables using
select()
. Thedead
column will be made the first variable in the data frame.This example makes use of the
everything()
function. This is one of the helper functions ofselect()
. See the documentation forselect
for the other helper functions.air_accidents <- select(air_accidents, dead, everything()) glimpse(air_accidents)
Observations: 5,666 Variables: 7 $ dead <dbl> 1, 5, 1, 14, 30, 21, 19, 20, 22, 19, 27, 20, 20, 23... $ date <date> 1908-09-17, 1912-07-12, 1913-08-06, 1913-09-09, 19... $ location <chr> "Fort Myer, Virginia", "Atlantic City, New Jersey",... $ operator <chr> "Military - U.S. Army", "Military - U.S. Navy", "Pr... $ plane_type <chr> "Wright Flyer III", "Dirigible", "Curtiss seaplane"... $ aboard <dbl> 2, 5, 1, 20, 30, 41, 19, 20, 22, 19, 28, 20, 20, 23... $ ground <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
This example uses pipes to rename and select the variables.
Note there are comments in this code that provide one possible way to read the code.
library(tidyverse)
# The comments in this block of code are to show you how the code might # be read. Theis kind of comments should not be put in real code. airAccs_path <- file.path("..", "datasets", "airAccs.csv") air_accidents_in <- read_csv(airAccs_path, col_types = cols()) air_accidents_in <- # air_accidents_in is the result of air_accidents_in %>% # taking the air_accident_in data, then rename( # renaming the following variables obs_num = 1, date = Date, plane_type = planeType, dead = Dead, aboard = Aboard, ground = Ground ) air_accidents <- # air_accidents is the result of air_accidents_in %>% # taking the air_accident_in data, then select(-obs_num) # selecting all of the variables except obs_num glimpse(air_accidents)
Observations: 5,666 Variables: 7 $ date <date> 1908-09-17, 1912-07-12, 1913-08-06, 1913-09-09, 19... $ location <chr> "Fort Myer, Virginia", "Atlantic City, New Jersey",... $ operator <chr> "Military - U.S. Army", "Military - U.S. Navy", "Pr... $ plane_type <chr> "Wright Flyer III", "Dirigible", "Curtiss seaplane"... $ dead <dbl> 1, 5, 1, 14, 30, 21, 19, 20, 22, 19, 27, 20, 20, 23... $ aboard <dbl> 2, 5, 1, 20, 30, 41, 19, 20, 22, 19, 28, 20, 20, 23... $ ground <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
The tidyverse includes support for a
compound assignment operator, %<>%
.
Its use would replace the assignment operator.
This is not recommend by Hadley Wickham (architect of the tidyverse.)
So while you will see examples of this on the web,
I recommend avoiding its use.
4.4.4 Examples - Python
These examples use the airAccs.csv data set.
We begin by using the same code as in the prior section to load the packages, import the csv file, and rename the variables.
import pandas as pd import os
airAccs_path = os.path.join('..', 'datasets', 'airAccs.csv') air_accidents_in = pd.read_csv(airAccs_path) air_accidents_in.rename( columns={ 'Unnamed: 0': 'obs_num', 'Date': 'date', 'planeType': 'plane_type', 'Dead': 'dead', 'Aboard': 'aboard', 'Ground': 'ground'}, inplace=True) air_accidents = air_accidents_in.copy(deep=True) print(air_accidents.dtypes)
obs_num int64 date object location object operator object plane_type object dead float64 aboard float64 ground float64 dtype: object
4.4.4.1 Cleaning - Drop variables
The loc[]
attribute is used to reduce the number of columns and
order the columns.
(Note the use of square brackets with this attribute.)
The loc[]
attribute can be used on both rows and columns,
loc[<rows>, <columns>]
.
We are using only columns in these examples.
The row index will be set to :
to return all rows.
The :
operator will be explained further in a
following section.
We will use
loc[]
to keep all the variables except theobs_num
.air_accidents = air_accidents.loc[:, ['date', 'plane_type', 'dead', 'aboard', 'ground']] print(air_accidents.dtypes)
date object plane_type object dead float64 aboard float64 ground float64 dtype: object
Note that square brackets,
[
]
, are used inside of theloc[]
attribute to provide a set of columns.The assignment line of code has been split across two physical lines. It is recommended to keep the length of Python lines to 72 character. The line can be split as it is here due to the split being done inside of the square brackets. The open square bracket lets Python know that there is more to this line of code and it then ignores the end of the physical line.
The
drop()
method will drop variables from a data frame. We will use it to exclude theobs_num
variable.air_accidents = air_accidents_in.copy(deep=True) air_accidents.drop(columns='obs_num', inplace=True) print(air_accidents.dtypes)
date object location object operator object plane_type object dead float64 aboard float64 ground float64 dtype: object
The
drop()
method can be given a set of columns to drop in the same way as was done above withloc[]
, by including them in[
]
.Typically one uses either inclusion to keep variables or exclusion to drop variables, but not both.
Ordering the variables using
loc[]
.The
dead
column will be made the first variable in the data frame.air_accidents = air_accidents.loc[:, ['dead', 'date', 'plane_type', 'aboard', 'ground']] print(air_accidents.dtypes)
dead float64 date object plane_type object aboard float64 ground float64 dtype: object
This example uses chained methods to display the column names.
In prior sections the column index was saved as
air_accidents_columns
. Then the value attribute was applied toair_accidents_columns
. Bothcolumns
andvalues
are attributes and not methods. Chaining applies to attributes as well as methods. Here we retreive the name values without using an intermidiate named object.print(air_accidents.columns.values)
['dead' 'date' 'plane_type' 'aboard' 'ground']
Chained methods are often written across several lines of code. This helps keep the line length below the recommended length of 72 characters and allows the eye to quckly see all the methods being used. This improves the readability of the code. Two accepted mthods to split the chained methods across multiple lines are shown below.
print( air_accidents .columns .values)
['dead' 'date' 'plane_type' 'aboard' 'ground']
Chaining is a useful tool that generally makes code more readable. There are limits to its use. One of them is with the
inplace
equalsTrue
. There is no object returned from a method withinplace=True
and as such nothing to chain. Chaining and inplace changes are typically not used together. Another is chaining of subsetting methods (subsetting is discussed in the next section.) This is called chained indexing and is to be avoided.
4.4.5 Exercises
These exercises use the PSID.csv
data set
that was imported in the prior section.
Import the
PSID.csv
data set.Drop the first variable in the data frame. You may have renamed it after it was loaded.
Make the age variable the first variable in the data frame.