2.2 Reading csv files and other delimited data
2.2.1 Data concepts - Delimited data files
Delimited data files contain data organized as a table (rows and columns) with
each row on a separate line of the file and columns separated by a special character
known as the delimiter.
The only formatting information in delimited files is the delimiter and
the end of line character.
These files can be viewed using text editors such as Notepad and Notepad++.
Execl can often open .csv
files.
A csv file is a delimited file that uses commas as the deliminators. The following data is formatted as csv.
A,B,C,D
1,51,-2,Madison
-3,1,8,Sun Prairie
9,13,,Verona
The columns of csv files are formatted for easy storing and processing by computers. This results in the files that are not easily read by humans.
A few thing to note about this data file.
- The first three columns contain numbers in all rows except the first row. The first row in this data set looks like it has the names of the columns. Some data files have the column names in the file like this and others do not.
- The columns of the data may not appear vertically aligned in the data file.
- The space in "Sun Prairie" does not represent a column separator. The entire string of characters "Sun Prairie" will be read in as one value for the fourth column.
Other common separators in delimited files are tabs and pipes, "|".
They are the same as csv files except for the use of different separators.
These other non comma delimited files will often have a file type of
.txt
or .dat
.
But, other file types can be used.
2.2.2 Programming skills - Directory separator symbols in a path
Different operating systems use different directory separator symbols in the path to a file. To support writing code that is able to run on any operating system, R and Python have a function that constructs a path to a file. These functions fill in the correct directory separator between the directories and file name, given as parameters. The use of these functions is considered good programming style.
2.2.3 Examples - R
The function to construct a file path is file.path()
.
It takes as parameters the directories of the path and the file name
as character values and returns a character variable containing the path.
The following example creates the path to file named
Chile.csv
in the datasets folder of the project.Chile_path <- file.path("..", "datasets", "Chile.csv") print(Chile_path)
[1] "../datasets/Chile.csv"
The above path to the Chile data set uses forward slash
/
for the directory separator on this Windows machine.The
..
infile.path()
means to move up one folder.
2.2.3.1 Acquisition - Importing a csv file
The tidyverse function to read a csv file is read_csv()
.
The following are a few important parameters of read_csv()
.
file
, the path to the file to be imported.col_names
, setting this toFALSE
indicates the first row does not contains variable names.col_types
, setting this tocol()
uses guessed types for the columns. Alternatively, the parameters ofcol()
can be used to define the types of each column.na
, list of strings that indicate missing data.guess_max
, specifies the number of row to consider before making a guess of what type the columns are. The default value of 1000 works well on most csv files.skip
, number of lines at the front of the file to be ignored. This is used when a csv file contains metadata at the beginning of the file.
The read_*()
functions of the tidyverse use a common set
of parsers.
These parsers are used to format data such as numeric, factors,
date and time, etc.
The following example imports the
Chile.csv
file using the path created above.Chile <- read_csv(Chile_path, col_types = cols())
Warning: Missing column names filled in: 'X1' [1]
The results of
read_csv()
show that one of the column names is missing. How to correct this is covered in the cleaning chapter.The read functions from the
readr
package will truncate the display of the list of problems that may need to be addressed. When the list is truncated, theproblems()
function can be used to return the full list.
Note, there is a base R function, read.csv()
, that also reads in
a csv file.
While this function works,
it is best to use read_csv()
since it was designed to work with the
other tidyverse functions and this will reduce potential problems.
There are two other read functions in readr
that are useful.
These are read_tsv()
and read_delim()
.
They operate similarly to read_csv()
.
Once you have learned on these functions,
you will also know how to use the other two.
2.2.3.2 Explore - Initial look at a data frame
The following is a
glimpse()
of the Chile data.glimpse(Chile)
Observations: 2,700 Variables: 9 $ X1 <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, ... $ region <chr> "N", "N", "N", "N", "N", "N", "N", "N", "N", "N", "... $ population <dbl> 175000, 175000, 175000, 175000, 175000, 175000, 175... $ sex <chr> "M", "M", "F", "F", "F", "F", "M", "F", "F", "M", "... $ age <dbl> 65, 29, 38, 49, 23, 28, 26, 24, 41, 41, 64, 19, 27,... $ education <chr> "P", "PS", "P", "P", "S", "P", "PS", "S", "P", "P",... $ income <dbl> 35000, 7500, 15000, 35000, 35000, 7500, 35000, 1500... $ statusquo <dbl> 1.00820, -1.29617, 1.23072, -1.03163, -1.10496, -1.... $ vote <chr> "Y", "N", "Y", "N", "N", "N", "N", "N", "U", "N", "...
2.2.4 Examples - Python
The path construction function in Python is Path()
.
This function is in the os
package.
It takes a parameters a character string of a path.
Here we use ..
to move up a level in the folders.
The /
operator is used to add folders and files
to a path.
The following example creates the path to file named
Chile.csv
in thedatasets
folder of the project.from pathlib import Path import pandas as pd
chile_path = Path('..') / 'datasets' / 'Chile.csv' print(chile_path)
..\datasets\Chile.csv
The above path to the Chile data set uses back slash
\
for the directory separator on Windows.The
..
inPath()
means to move up one folder.
2.2.4.1 Acquisition - Importing a csv file
The pandas function to read a csv file is read_csv()
.
The following are a few important parameters of read_csv()
.
header
, boolean to indicate if the first row of the data contains the variable names.skiprows
, number of lines at the front of the file to be ignored.na_values
, list of characters that indicate missing data.
The following example imports the
Chile.csv
file using the path created above.chile = pd.read_csv(chile_path)
2.2.4.2 Explore - Initial look at a data frame
The following displays the class, column types, and the values of the first few rows.
print(chile.shape)
(2700, 9)
print(chile.dtypes)
Unnamed: 0 int64
region object
population int64
sex object
age float64
education object
income float64
statusquo float64
vote object
dtype: object
print(chile.head())
Unnamed: 0 region population sex age education income statusquo vote
0 1 N 175000 M 65.0 P 35000.0 1.00820 Y
1 2 N 175000 M 29.0 PS 7500.0 -1.29617 N
2 3 N 175000 F 38.0 P 15000.0 1.23072 Y
3 4 N 175000 F 49.0 P 35000.0 -1.03163 N
4 5 N 175000 F 23.0 S 35000.0 -1.10496 N
2.2.5 Exercises
Import the "mtcars.csv" data set.
What is the type of each variable of the mtcars data set?
How many observations are in the mtcars data set?
Import the "cane.csv" data set.
How many observations and variables are in the cane data set?