SSCC - Social Science Computing Cooperative Supporting Statistical Analysis for Research

4.8 Coding missing values - part 2

4.8.1 Programming skills

4.8.1.1 Anonymous functions

An anonymous function is a function that does not have a name. This means that the function is for a single use and has no permanence. It can not be used latter without being created again.

Anonymous functions are a quick and simple approach to creating your own function. (The syntax for their specification is much simpler than a typical user-defined function.) They are a handy tool for data wranglers when a custom function (not already defined) is needed.

4.8.1.2 Repeating an operation on multiple variables

In the prior sections of this chapter we introduced the concept of vectorization and used it to repeat operations on all rows of a variable. There are times when the same (vectorized) operation needs to be applied to two or more variables of a data frame. For example, in the prior section, the ? values where changed to missing in two variables of the data frame. The code to make the change was written twice, once for each of the two variable. This is not hard if there are only two variables that the operation is being repeated on. When there are more, maybe many more, you will want to use a function/method to apply the operation to a set of variables. This will be demonstrated in the examples of this section.

4.8.2 Examples - R

These examples use the airAccs.csv data set.

  1. We begin by using the same code as in the prior sections of this chapter to load the tidyverse, import the csv file, and rename the variables.

    Note, the lubridate package is also loaded for use in these examples.

    library(tidyverse)
    library(lubridate)
    airAccs_path <- file.path("..", "datasets", "airAccs.csv")
    air_accidents_in <- read_csv(airAccs_path, col_types = cols())
    Warning: Missing column names filled in: 'X1' [1]
    air_accidents_in <-
      air_accidents_in %>%
      rename(
        obs_num = 1,
        date = Date,
        plane_type = planeType,
        dead = Dead,
        aboard = Aboard,
        ground = Ground
        )
    
    air_accidents <- air_accidents_in
    
    glimpse(air_accidents)
    Observations: 5,666
    Variables: 8
    $ obs_num    <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, ...
    $ date       <date> 1908-09-17, 1912-07-12, 1913-08-06, 1913-09-09, 19...
    $ location   <chr> "Fort Myer, Virginia", "Atlantic City, New Jersey",...
    $ operator   <chr> "Military - U.S. Army", "Military - U.S. Navy", "Pr...
    $ plane_type <chr> "Wright Flyer III", "Dirigible", "Curtiss seaplane"...
    $ dead       <dbl> 1, 5, 1, 14, 30, 21, 19, 20, 22, 19, 27, 20, 20, 23...
    $ aboard     <dbl> 2, 5, 1, 20, 30, 41, 19, 20, 22, 19, 28, 20, 20, 23...
    $ ground     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
  2. Anonymous functions.

    The tidyverse (purrr package) allows anonymous functions to be specified using what is known as formula notation. The anonymous function begins with ~ and the parameters are named .x, .y, and .z (defined in this order.) Not all three parameters need to be used in the function. The function does not need to use any parameters. If there is only one parameter, . can be used in place of .x. For example,

    ~(.x + .y)

    This would add the first and second parameter together and return the results. Another example is,

    ~ if_else(. == -9, NA_character_, .)

    This would test the rows of the first parameter for equality with -9. If TRUE, the row is NA, otherwise it is unchanged. (This is equivalent to the na_if(x, -9) function.)

    The case_when() function is a generalization of if_else() that can check more than one condition.

    Each condition is paired with a replacement value. The replacement value is specified as a function, possibly an anonymous function. If the replacement value is the same for all observations, an anonymous function with simply that value can be used.

    If a condition is found to match, the remaining conditions are not checked for these rows. Using TRUE as the last condition provides a default value to use if none of the other conditions are TRUE.

    We will change the operator values of ? and Private to NA using the case_when() function.

    # restore original values of the variables
    air_accidents <- 
      air_accidents %>%
      mutate(operator = pull(air_accidents_in, operator))
    
    air_accidents <- 
      mutate(
        air_accidents,
        operator = case_when(
          operator == "?" ~ NA_character_,
          operator == "Private" ~ NA_character_,
          TRUE ~ operator
          )
        )
    
    air_accidents %>%
      filter(operator == "?") %>%
      select("location", "operator", "plane_type")
    # A tibble: 0 x 3
    # ... with 3 variables: location <chr>, operator <chr>, plane_type <chr>

    In this example the case_when() function is used with only two test conditions. The recode() or na_if() methods could have been used here. The case_when() function was used to demonstrate its use.

  3. Operating on multiple columns.

    The mutate_at() function is a variant of mutate(). It is used when you want to apply the same function to a set of columns.

    We will use mutate_at() to apply the na_if() function to the operator and plane_type columns. The columns to apply the function to are given using the var() function.

    The na_if() function needs two parameters, the column name and "?". We will create an anonymous function to do that for us. The anonymous function will be, ~ na_if(., "?"), where the ., short for .x the first parameter, is passed in from the mutate_at() function.

    # restore original values of the variables
    air_accidents <- 
      air_accidents %>%
      mutate(operator = pull(air_accidents_in, operator))
    
    air_accidents <- 
      mutate_at(air_accidents, vars(operator, plane_type), ~ na_if(., "?"))
    
    air_accidents %>%
      filter(operator == "?" | plane_type == "?") %>%
      select("location", "operator", "plane_type")
    # A tibble: 0 x 3
    # ... with 3 variables: location <chr>, operator <chr>, plane_type <chr>

    The mutate_at() method can pass additional parameters to the called function, na_if() here. The following code uses this approach instead of the anonymous function approach.

    # restore original values of the variables
    air_accidents <- 
      air_accidents %>%
      mutate(
        operator = pull(air_accidents_in, operator),
        plane_type = pull(air_accidents_in, plane_type)
        )
    
    air_accidents <- 
      mutate_at(air_accidents, vars(operator, plane_type), na_if, "?")
    
    air_accidents %>%
      filter(operator == "?" | plane_type == "?") %>%
      select("location", "operator", "plane_type")
    # A tibble: 0 x 3
    # ... with 3 variables: location <chr>, operator <chr>, plane_type <chr>
    The parameters to mutate_at() are, .tbl, .vars, .funs, ....
    • The .tbl parameter is the data frame being operated on. This parameter is typically piped into mutate_at().
    • The .vars parameter is the list of variable in the data frame to be passed to .funs.
      The vars() function is one of several select() helper functions that can genereate lists of variables.
    • The .funs parameter is the name of the function to be called for each of the .vars. It is the name only, with out ().
    • The ... are a list of any additional parameter to be used by the .funs function. They can be named or not. The first parameter to funs. is the variable. The ... will follow this parameter.

4.8.3 Examples - Python

These examples use the airAccs.csv data set.

  1. We begin by using the same code as in the prior sections of this chapter to load the packages, import the csv file, and rename the variables.

    from pathlib import Path
    import pandas as pd
    import numpy as np
    airAccs_path = Path('..') / 'datasets' / 'airAccs.csv'
    air_accidents_in = pd.read_csv(airAccs_path)
    air_accidents_in = (
        air_accidents_in
            .rename(
                columns={
                    air_accidents_in.columns[0]: 'obs_num',
                    'Date': 'date', 
                    'planeType': 'plane_type',
                    'Dead': 'dead',
                    'Aboard': 'aboard',
                    'Ground': 'ground'}))
    
    air_accidents = air_accidents_in.copy(deep=True)
    
    print(air_accidents.dtypes)
    obs_num         int64
    date           object
    location       object
    operator       object
    plane_type     object
    dead          float64
    aboard        float64
    ground        float64
    dtype: object
  2. Using anonymous functions

    Anonymous functions in python are called lambda functions. They are specified as,

    lambda <parameter list>: code

    For example,

    lambda x, y: (x + y)

    This would add the first and second parameters together and return the reslts. The calling function\method would need to supply the two parameters for this function.

    A lambda function called inside of the assign() method gets passed the data frame (that the assign() method is modifying) as its only parameter. An example of a lambda function that could be called inside of an assign() is,

    lambda df: df.loc[:, 'x'].mask(df.loc[:, 'x'] == -9, np.NaN)

    The data frame parameter of the lambda function is named df. The mask() method is used in this example uses a test condition to determine which rows to replace with a specified value, np.NAN here. This example tests the rows of the x variable for equality with -9. If True, the row is set to np.NaN, otherwise it is unchanged.

    # restore original values of the operator variable
    air_accidents.operator = air_accidents_in.operator
    
    air_accidents = (
        air_accidents
            .assign(operator=lambda df:
                df['operator'].mask(df['operator'] == '?', np.NaN)))
    
    (air_accidents
        .query('operator == "?"')
        .copy()
        .loc[:, ["location", "operator", "plane_type"]]
        .head(n=10)
        .pipe(print))
    Empty DataFrame
    Columns: [location, operator, plane_type]
    Index: []
  3. Change a variable with multiple conditions

    Python and Pandas do not have a switch, also called case when, type of function that can test multiple conditions. (This is a common control mechanism found in many programming languages.) If all the conditions are checks of equality, then replace() works. If one or more of the conditions is a range of values, then the solution is to use multiple comprehensions or multiple lambda fuctions. Sequential comprehensions can be used to achive most case when type functions. Nested comprehensions may sometimes be needed to implement case when or else if type conditions. Nest the comprehensions should only be used if sequential comprehensions will not accomplish what is needed.

    We will change the operator values of ? and Private to NAN using sequential mask() functions.

    # restore original values of the operator variable
    air_accidents['operator'] = air_accidents_in['operator']
    
    air_accidents = (
        air_accidents
            .assign(
                operator=(lambda df: df
                    ['operator']
                    .mask(df.operator == '?', np.NaN)
                    .mask(df.operator == 'Private', np.NaN))))
    
    (air_accidents  
        .query('operator == "?"')
        .copy()
        .loc[:, ["location", "operator", "plane_type"]]
        .head(n=10)
        .pipe(print))
    Empty DataFrame
    Columns: [location, operator, plane_type]
    Index: []
  4. Operating on multiple columns.

    The apply() method will iterate over the columns of the data frame, by default operation. (The apply() method can also iterate over rows, if needed.) This can be used to do the same operation on multiple columns. There are two parts to coding this.

    One of the parts is specifying the function to be applied. In this example we will use a lambda function to change ? to NaN. The apply() method is operating on columns. So the parameter to the lambda function is a column of the data frame.

    The second part is determining the columns that the operation will be applied to. For this we will use a comprehension. The comprehension will compare each column name in the data frame to a set of names. If the name is in the set, the lambda function is run for the column. Otherwise the original column is returned unchanged.

    # restore original values of the variables
    air_accidents = (
        air_accidents
            .assign(
                operator=air_accidents_in['operator'],
                plane_type=air_accidents_in['plane_type']))
    
    air_accidents = (
        air_accidents
            .apply(
                func=lambda x: x.mask(x == '?', np.NaN)
                if x.name in ['operator', 'plane_type'] else x))
    
    (air_accidents
        .query('operator == "?" | plane_type == "?"')
        .copy()
        .loc[:, ["location", "operator", "plane_type"]]
        .head(n=10)
        .pipe(print))
    Empty DataFrame
    Columns: [location, operator, plane_type]
    Index: []

4.8.4 Exercises

These exercises use the PSID.csv data set that was imported in the prior section.

  1. Import the PSID.csv data set.

  2. Code NAs for the NA/DF and no histories values for the marriage status variable.

  3. Change the units on the earnings and hours variables to be thousands of dollars or hours. Use a method that operates on multiple columns.

    Hint, to do the unit change on a variable x, one would do x / 1000.