SSCC - Social Science Computing Cooperative Supporting Statistical Analysis for Research

2.2 Reading csv files and other delimited data

2.2.1 Data concepts - Delimited data files

Delimited data files contain data organized as a table (rows and columns) with each row on a separate line of the file and columns separated by a special character known as the delimiter. The only formatting information in delimited files is the delimiter and the end of line character. These files can be viewed using text editors such as Notepad and Notepad++. Execl can often open .csv files.

A csv file is a delimited file that uses commas as the deliminators. The following data is formatted as csv.

    A,B,C,D
    1,51,-2,Madison
    -3,1,8,Sun Prairie
    9,13,,Verona

The columns of csv files are formatted for easy storing and processing by computers. This results in the files that are not easily read by humans.

A few thing to note about this data file.

  • The first three columns contain numbers in all rows except the first row. The first row in this data set looks like it has the names of the columns. Some data files have the column names in the file like this and others do not.
  • The columns of the data may not appear vertically aligned in the data file.
  • The space in "Sun Prairie" does not represent a column separator. The entire string of characters "Sun Prairie" will be read in as one value for the fourth column.

Other common separators in delimited files are tabs and pipes, "|". They are the same as csv files except for the use of different separators. These other non comma delimited files will often have a file type of .txt or .dat. But, other file types can be used.

2.2.2 Programming skills - Directory separator symbols in a path

Different operating systems use different directory separator symbols in the path to a file. To support writing code that is able to run on any operating system, R and Python have a function that constructs a path to a file. These functions fill in the correct directory separator between the directories and file name, given as parameters. The use of these functions is considered good programming style.

2.2.3 Examples - R

The function to construct a file path is file.path(). It takes as parameters the directories of the path and the file name as character values and returns a character variable containing the path.

  1. The following example creates the path to file named Chile.csv in the datasets folder of the project.

    Chile_path <- file.path("..", "datasets", "Chile.csv")
    print(Chile_path)
    [1] "../datasets/Chile.csv"

    The above path to the Chile data set uses forward slash / for the directory separator on this Windows machine.

    The .. in file.path() means to move up one folder.

2.2.3.1 Acquisition - Importing a csv file

The tidyverse function to read a csv file is read_csv(). The following are a few important parameters of read_csv().

  • file, the path to the file to be imported.

  • col_names, setting this to FALSE indicates the first row does not contains variable names.

  • col_types, setting this to col() uses guessed types for the columns. Alternatively, the parameters of col() can be used to define the types of each column.

  • na, list of strings that indicate missing data.

  • guess_max, specifies the number of row to consider before making a guess of what type the columns are. The default value of 1000 works well on most csv files.

  • skip, number of lines at the front of the file to be ignored. This is used when a csv file contains metadata at the beginning of the file.

The read_*() functions of the tidyverse use a common set of parsers. These parsers are used to format data such as numeric, factors, date and time, etc.

  1. The following example imports the Chile.csv file using the path created above.

    Chile <- read_csv(Chile_path, col_types = cols())
    Warning: Missing column names filled in: 'X1' [1]

    The results of read_csv() show that one of the column names is missing. How to correct this is covered in the cleaning chapter.

    The read functions from the readr package will truncate the display of the list of problems that may need to be addressed. When the list is truncated, the problems() function can be used to return the full list.

Note, there is a base R function, read.csv(), that also reads in a csv file. While this function works, it is best to use read_csv() since it was designed to work with the other tidyverse functions and this will reduce potential problems.

There are two other read functions in readr that are useful. These are read_tsv() and read_delim(). They operate similarly to read_csv(). Once you have learned on these functions, you will also know how to use the other two.

2.2.3.2 Explore - Initial look at a data frame

  1. The following is a glimpse() of the Chile data.

    glimpse(Chile)
    Observations: 2,700
    Variables: 9
    $ X1         <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, ...
    $ region     <chr> "N", "N", "N", "N", "N", "N", "N", "N", "N", "N", "...
    $ population <dbl> 175000, 175000, 175000, 175000, 175000, 175000, 175...
    $ sex        <chr> "M", "M", "F", "F", "F", "F", "M", "F", "F", "M", "...
    $ age        <dbl> 65, 29, 38, 49, 23, 28, 26, 24, 41, 41, 64, 19, 27,...
    $ education  <chr> "P", "PS", "P", "P", "S", "P", "PS", "S", "P", "P",...
    $ income     <dbl> 35000, 7500, 15000, 35000, 35000, 7500, 35000, 1500...
    $ statusquo  <dbl> 1.00820, -1.29617, 1.23072, -1.03163, -1.10496, -1....
    $ vote       <chr> "Y", "N", "Y", "N", "N", "N", "N", "N", "U", "N", "...

2.2.4 Examples - Python

The path construction function in Python is Path(). This function is in the os package. It takes a parameters a character string of a path. Here we use .. to move up a level in the folders. The / operator is used to add folders and files to a path.

  1. The following example creates the path to file named Chile.csv in the datasets folder of the project.

    from pathlib import Path
    import pandas as pd
    chile_path = Path('..') / 'datasets' / 'Chile.csv'
    print(chile_path)
    ..\datasets\Chile.csv

    The above path to the Chile data set uses back slash \ for the directory separator on Windows.

    The .. in Path() means to move up one folder.

2.2.4.1 Acquisition - Importing a csv file

The pandas function to read a csv file is read_csv(). The following are a few important parameters of read_csv().

  • header, boolean to indicate if the first row of the data contains the variable names.

  • skiprows, number of lines at the front of the file to be ignored.

  • na_values, list of characters that indicate missing data.

  1. The following example imports the Chile.csv file using the path created above.

    chile = pd.read_csv(chile_path)

2.2.4.2 Explore - Initial look at a data frame

The following displays the class, column types, and the values of the first few rows.

print(chile.shape)
(2700, 9)
print(chile.dtypes)
Unnamed: 0      int64
region         object
population      int64
sex            object
age           float64
education      object
income        float64
statusquo     float64
vote           object
dtype: object
print(chile.head())
   Unnamed: 0 region  population sex   age education   income  statusquo vote
0           1      N      175000   M  65.0         P  35000.0    1.00820    Y
1           2      N      175000   M  29.0        PS   7500.0   -1.29617    N
2           3      N      175000   F  38.0         P  15000.0    1.23072    Y
3           4      N      175000   F  49.0         P  35000.0   -1.03163    N
4           5      N      175000   F  23.0         S  35000.0   -1.10496    N

2.2.5 Exercises

  1. Import the "mtcars.csv" data set.

  2. What is the type of each variable of the mtcars data set?

  3. How many observations are in the mtcars data set?

  4. Import the "cane.csv" data set.

  5. How many observations and variables are in the cane data set?