4.3 Copying data sets
4.3.1 Data concepts - Copies of the data
When cleaning and wrangling data, it can be helpful to retain the original imported data frame without changes. This allows you the opportunity to compare your changes to the original data frame to check your work as you clean the data.
When the code to clean a data set takes a meaningful amount of time to run, the cleaned data set can be written to a file. This cleaned data set can then be used for analysis or exploratory work without rerunning the code to clean the data. When this is done, the cleaning code is kept in its own script, separate from the other wrangling, exploratory, and analysis code. When the cleaning time is short, the cleaning code can be run every time the analysis is worked on. There may be no need for a saved clean data file.
When a cleaned data set is saved to a file, it should be given a new name and possibly stored in a different directory. It is a best practice to not overwrite the original data files.
4.3.2 Programming skills
4.3.2.1 Assignment copy verse reference
The assignment command section explained that
an assignment gives a name to the object that is defined on right hand side.
In this section we expand on this and consider when the right hand
side is an object that already has a name.
For example, say the data frame returned by read_csv()
is given
the name df_in
.
Then we assign df_in
to the name df
.
What objects are df
and df_in
referencing?
In Python, df_in = df
would result in df
and df_in
referencing the same object.
Changes made to the object referenced by df
could be seen
by displaying df_in
and vice versa.
Pandas provides a method to create a copy of an object when this
is needed.
In R, df_in <- df
would result in df
and df_in
referencing different objects that have the same values and attributes.
Changes made to the object referenced by df
would apply to only the df
object and would not change the df_in
object.
4.3.2.2 Parameter copy versus reference
The object and assignment command sections explained that names are references to an object. The functions and their parameters section explained how objects are associated with parameters to functions (and methods) and that functions (and methods) can return an object. In this section we dig a little deeper into how parameter objects are are passed to functions and methods.
Python passes the name reference and not the object itself for each parameter. Therefore, the code of methods and functions use the original objects passed in and can modify these objects. The result of this is objects can be modified without using the assignment command. This means you can write
<object>.<method>()
instead of
<object> = <object>.<method>()
to change a value or attribute of <object>
.
Not all methods do this.
You will need to check the method documentation to determine
if the object is modified by its use.
R makes copies of parameters for use by the code of functions and methods. These copied objects exist inside the function or method as well as where the function was called. Modification of the values or attributes of a parameter object by the code of a function only changes the local copy of the object. This requires the use of the assignment operator to change an object or its attributes.
Both of these approaches to parameters have advantages and disadvantages. As such, one approach is no better than the other. It is one of a few programming style differences between R and Python.
The programming concept of scope was introduced in the
ggplot-layers section.
It was explained that data and aesthetics defined in the ggplot()
function
could be used by all geom_*()
functions associated with it, global scope.
But, data and aesthetics defined in a geom_*()
are local to that geom_*()
.
This is similar to the pass by name reference verse pass a copy approaches.
The pass a name reference approach is similar to global scope in that there
is one and only one copy and other parts of the program are using the same copy.
The pass a copy approach is similar to local scope in that a function uses
a copy that is separate from the use of the name in other parts of the program.
4.3.3 Examples - R
These examples use the airAccs.csv data set.
We begin by loading the tidyverse, importing the csv file, and renaming the variables.
library(tidyverse)
airAccs_path <- file.path("..", "datasets", "airAccs.csv") air_accidents_in <- read_csv(airAccs_path, col_types = cols())
Warning: Missing column names filled in: 'X1' [1]
air_accidents_in <- rename( air_accidents_in, obs_num = X1, date = Date, plane_type = planeType, dead = Dead, aboard = Aboard, ground = Ground ) glimpse(air_accidents_in)
Observations: 5,666 Variables: 8 $ obs_num <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, ... $ date <date> 1908-09-17, 1912-07-12, 1913-08-06, 1913-09-09, 19... $ location <chr> "Fort Myer, Virginia", "Atlantic City, New Jersey",... $ operator <chr> "Military - U.S. Army", "Military - U.S. Navy", "Pr... $ plane_type <chr> "Wright Flyer III", "Dirigible", "Curtiss seaplane"... $ dead <dbl> 1, 5, 1, 14, 30, 21, 19, 20, 22, 19, 27, 20, 20, 23... $ aboard <dbl> 2, 5, 1, 20, 30, 41, 19, 20, 22, 19, 28, 20, 20, 23... $ ground <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
Notice that the data frame was imported with
_in
appended to the data frame name used in the prior section.Make a copy of the data frame.
air_accidents <- air_accidents_in
An assignment is used to copy the data frame to a name without the
_in
. This coding practice preserves the original data set and does not modify it. The original data set is then available for comparison as a reference while cleaning the data. This practice may not be possible if the data set is very large.The following code demonstrates that an object is copied when assigning an object to a new name.
air_accidents_reference <- air_accidents_in air_accidents_reference <- rename(air_accidents_reference, changed_date_name = date) glimpse(air_accidents_reference)
Observations: 5,666 Variables: 8 $ obs_num <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1... $ changed_date_name <date> 1908-09-17, 1912-07-12, 1913-08-06, 1913-09... $ location <chr> "Fort Myer, Virginia", "Atlantic City, New J... $ operator <chr> "Military - U.S. Army", "Military - U.S. Nav... $ plane_type <chr> "Wright Flyer III", "Dirigible", "Curtiss se... $ dead <dbl> 1, 5, 1, 14, 30, 21, 19, 20, 22, 19, 27, 20,... $ aboard <dbl> 2, 5, 1, 20, 30, 41, 19, 20, 22, 19, 28, 20,... $ ground <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
The following code displays the data frame as it was imported. From this display, you can see that the changes made to the column names was made only in the copy and the as-imported data frame remains unchanged.
glimpse(air_accidents_in)
Observations: 5,666 Variables: 8 $ obs_num <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, ... $ date <date> 1908-09-17, 1912-07-12, 1913-08-06, 1913-09-09, 19... $ location <chr> "Fort Myer, Virginia", "Atlantic City, New Jersey",... $ operator <chr> "Military - U.S. Army", "Military - U.S. Navy", "Pr... $ plane_type <chr> "Wright Flyer III", "Dirigible", "Curtiss seaplane"... $ dead <dbl> 1, 5, 1, 14, 30, 21, 19, 20, 22, 19, 27, 20, 20, 23... $ aboard <dbl> 2, 5, 1, 20, 30, 41, 19, 20, 22, 19, 28, 20, 20, 23... $ ground <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
The data set (with the new names) is output to a csv file. This is done for demonstration purposes. There no need to save this mostly un-modified dataset. This step of writing a data frame to a file would typically only be done if you have changes that either require a lot of time or code to run.
temp_data_path <- file.path("..", "datasets", "temp_data_to_be_deleted.csv") write_csv(air_accidents, temp_data_path)
The variable names and values are saved. No type information is saved. When the file is input again, there may be a need for some type conversion. Type conversion is covered in the next chapter.
Note, this will overwrite the file at
temp_data_path
, if one already existed.
4.3.4 Examples - Python
These examples use the airAccs.csv data set.
We begin by loading the pandas and os packages, importing the csv file, and renaming the variables.
from pathlib import Path import pandas as pd import numpy as np
airAccs_path = Path('..') / 'datasets' / 'airAccs.csv' air_accidents_in = pd.read_csv(airAccs_path) air_accidents_in = ( air_accidents_in .rename( columns={ air_accidents_in.columns[0]: 'obs_num', 'Date': 'date', 'planeType': 'plane_type', 'Dead': 'dead', 'Aboard': 'aboard', 'Ground': 'ground'})) print(air_accidents_in.dtypes)
obs_num int64 date object location object operator object plane_type object dead float64 aboard float64 ground float64 dtype: object
Notice that the data frame was imported with
_in
appended to the data frame name used in the prior section.Make a copy of the data frame.
We will make a copy of the data and name it
air_accidents
, removing the_in
from the name it was input as. This coding practice preserves the original data set and does not modify it. The original data set is then available for comparison as a reference while cleaning the data. This practice may not be possible if the data set is very large.The
copy()
method is used to create an independent copy the data frame. (Recall that the assignment opperator creates another reference to the same physical data.) Thedeep=True
parameter is used to create a complete copy of the data. With out this parameter,copy()
will create what is called a shallow copy and some elements and attributes of the new data frame may be shared with the copied-from data frame.air_accidents = air_accidents_in.copy(deep=True)
The following code demonstrates that without the use of the
copy()
method, the two object names reference the same object.air_accidents_reference = air_accidents_in air_accidents_reference.rename( columns={'date': 'changed_date_name'}, inplace=True) print(air_accidents_in.dtypes)
obs_num int64 changed_date_name object location object operator object plane_type object dead float64 aboard float64 ground float64 dtype: object
The change made to
air_accidents
is seen inair_accidents_in
.The data set (with the new names) is output to a csv file. This is done for demonstration purposes. There no need to save this mostly un-modified dataset. This step of writing a data frame to a file would typically only be done if you have changes that either require a lot of time or code to run.
temp_data_path = Path('..') / 'datasets' / 'temp_data_to_be_deleted.csv' air_accidents.to_csv(temp_data_path)
The variable names and values are saved. No type information is saved. When the file is input again, there may be a need for some type conversion. Type conversion is covered in the next chapter.
Note, this will overwrite the file at
temp_data_path
, if one already existed.
4.3.5 Exercises
These exercises use the PSID.csv data set that was imported in the prior section.
Import the
PSID.csv
data set. Set the variable names to something useful, if they are not already. Change at least one name.Create a copy of the imported data frame that will be used for data cleaning.
Save the data frame as a csv to a file. Make sure to give the file a new name.