SSCC - Social Science Computing Cooperative Supporting Statistical Analysis for Research

4.6 Subsets of a data frame

4.6.1 Data Concepts - Subsetting

Subsetting is identifying either a single element of the data frame or a group of elements. Dropping columns in the prior sections was an example of subsetting. The head and tail methods are examples of subsetting. They subset on the rows of a data frame. In this section we will consider subsetting rows and columns together.

When subsetting on both rows and columns, a rectangle of values is returned. There is a row for each of the row of the subset and a column for each column of the subset. For example, if one subsetted the data frame from section 4.2.2, repeated here,

    A,B,C,D
    1,51,-2,Madison
    -3,1,8,Sun Prairie
    9,13,,Verona

using A, B, and D for the columns and the first and last rows, the following would be the subset of the data frame.

    A B  D
    1 51 Madison
    9 13 Verona

This would identify two rows of three variables.

4.6.2 Programming skills

4.6.2.1 Subsetting

Subsetting can be used on the right hand side of an assignment command to extract values from a data frame or on the left hand side to identify a set of elements to be replaced.

4.6.2.2 Row index

A row index, also called row names, may have text names or only the integer position index.

4.6.2.3 Slicing

Slices identifying a range of an index to be used in subsetting. Both the tidyverse and pandas use the colon, :, operator to identify a slice and specify a slice with start:end, where start and end are the beginning and end of the range of values. Slices can be specified using numeric position values or named index values.

4.6.2.4 Conditional subsetting

Conditional subsetting uses a boolean index, a boolean value for for each index position of the row or column index. The subset will include the rows or columns that are identified by the true values in the boolean index. The tidyverse filter() and pandas querry() methods used conditional subsetting. In this section we will other examples that use conditional subsetting.

4.6.3 Examples - R

These examples use the airAccs.csv data set.

  1. We begin by using the same code as in the prior section to load packages, import the csv file, and rename the variables. To allow us to identify rows by number in this section, we do not remove the observation number variable.

    library(tidyverse)
    airAccs_path <- file.path("..", "datasets", "airAccs.csv")
    air_accidents_in <- read_csv(airAccs_path, col_types = cols())
    Warning: Missing column names filled in: 'X1' [1]
    air_accidents_in <-
      air_accidents_in %>%
      rename(
        obs_num = 1,
        date = Date,
        plane_type = planeType,
        dead = Dead,
        aboard = Aboard,
        ground = Ground
        )
    
    air_accidents <- air_accidents_in
    
    glimpse(air_accidents)
    Observations: 5,666
    Variables: 8
    $ obs_num    <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, ...
    $ date       <date> 1908-09-17, 1912-07-12, 1913-08-06, 1913-09-09, 19...
    $ location   <chr> "Fort Myer, Virginia", "Atlantic City, New Jersey",...
    $ operator   <chr> "Military - U.S. Army", "Military - U.S. Navy", "Pr...
    $ plane_type <chr> "Wright Flyer III", "Dirigible", "Curtiss seaplane"...
    $ dead       <dbl> 1, 5, 1, 14, 30, 21, 19, 20, 22, 19, 27, 20, 20, 23...
    $ aboard     <dbl> 2, 5, 1, 20, 30, 41, 19, 20, 22, 19, 28, 20, 20, 23...
    $ ground     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
  2. Select rows using numeric row index.

    The slice() method is used to subset using the numeric row index values.

    This example uses slice() to display the beginning of the air_accidents data frame similarly to head.

    This example makes use of the sequence operator : to create a slice. The sequence operator creates a range of numbers. The range of numbers starts at the integer value on the left of : and includes all integers up to and including the integer on the right of :. For example, 9:12 would be the numbers 9, 10, 11, and 12. The sequence operator can be used outside of slicing.

    This example displays the first 5 fows of the data set.

    air_accidents %>%
      slice(1:5)
    # A tibble: 5 x 8
      obs_num date       location    operator   plane_type   dead aboard ground
        <dbl> <date>     <chr>       <chr>      <chr>       <dbl>  <dbl>  <dbl>
    1       1 1908-09-17 Fort Myer,~ Military ~ Wright Fly~     1      2      0
    2       2 1912-07-12 Atlantic C~ Military ~ Dirigible       5      5      0
    3       3 1913-08-06 Victoria, ~ Private    Curtiss se~     1      1      0
    4       4 1913-09-09 Over the N~ Military ~ Zeppelin L~    14     20      0
    5       5 1913-10-17 Near Johan~ Military ~ Zeppelin L~    30     30      0
  3. Select rows from the end using numeric row index.

    This example displays the last 5 rows of the data set.

    The n() helper function is used to get the last row number. (This helper function function is only available inside of tidyverse functions that operate on rows like slice().)

    air_accidents %>%
      slice((n() - 4):n())
    # A tibble: 5 x 8
      obs_num date       location    operator   plane_type   dead aboard ground
        <dbl> <date>     <chr>       <chr>      <chr>       <dbl>  <dbl>  <dbl>
    1    5662 2014-02-21 Grombalia,~ Libyan Ai~ Antonov 26     11     11      0
    2    5663 2014-02-26 Near Lanai~ Maui Air   Piper PA-3~     3      6      0
    3    5664 2014-03-07 South Indi~ Malaysia ~ Boeing 777~   239    239      0
    4    5665 2014-03-18 Seattle, W~ Helicopte~ urocopter ~     2      2      0
    5    5666 2014-03-22 Caboolture~ Skydive C~ Cessna U20~     5      5      0
  4. Excluding rows.

    Exclusion of rows is done by preceding the row with -. This is the same way that columns are excluded.

    This example uses exclusion of rows to display the same beginning rows of the air_accidents data frame as in the prior example.

    slice(air_accidents, -6:-n())
    # A tibble: 5 x 8
      obs_num date       location    operator   plane_type   dead aboard ground
        <dbl> <date>     <chr>       <chr>      <chr>       <dbl>  <dbl>  <dbl>
    1       1 1908-09-17 Fort Myer,~ Military ~ Wright Fly~     1      2      0
    2       2 1912-07-12 Atlantic C~ Military ~ Dirigible       5      5      0
    3       3 1913-08-06 Victoria, ~ Private    Curtiss se~     1      1      0
    4       4 1913-09-09 Over the N~ Military ~ Zeppelin L~    14     20      0
    5       5 1913-10-17 Near Johan~ Military ~ Zeppelin L~    30     30      0
  5. Base R subsetting.

    Base R has another method to do subsetting. It is using the [] method. For a two dimensional object like a data frame, the rows are provided as the first parameter and the columns as the second parameter. Note, to get the set of columns we want, they need to be wrapped by c() This will be explained further in the next chapter.

    air_accidents[64:67, c("date", "location", "operator", "plane_type")]
    # A tibble: 4 x 4
      date       location                 operator                plane_type   
      <date>     <chr>                    <chr>                   <chr>        
    1 1922-04-08 Pao Ting Fou, China      ?                       ?            
    2 1922-04-13 Brooklands, England      Vickers                 Vickers Viki~
    3 1922-06-03 Off Folkestone, Kent, E~ Cie des Messageries Ae~ Bleriot Spad~
    4 1922-07-04 Fuhlsbuttel, Germany     ?                       LVG C VI     

    The base R subsetting operator, [], can be used on the left side of the assignment operator. This allow you to modify a part of a data frame.

    This example would be done as follows in the tidyverse.

    air_accidents %>%
      select(date, location, operator, plane_type) %>%
      slice(64:67)
    # A tibble: 4 x 4
      date       location                 operator                plane_type   
      <date>     <chr>                    <chr>                   <chr>        
    1 1922-04-08 Pao Ting Fou, China      ?                       ?            
    2 1922-04-13 Brooklands, England      Vickers                 Vickers Viki~
    3 1922-06-03 Off Folkestone, Kent, E~ Cie des Messageries Ae~ Bleriot Spad~
    4 1922-07-04 Fuhlsbuttel, Germany     ?                       LVG C VI     
  6. Create a test data frame.

    When writing code, it is sometimes convenient to work with a smaller set of observations when you are developing and testing new code. The slice() method can be used to create a smaller data frame to test code.

    This example creates a new data frame from the first ten rows.

    air_accidents_test <- slice(air_accidents, 1:10)

4.6.4 Examples - Python

These examples use the airAccs.csv data set.

  1. We begin by using the same code as in the prior section to load packages, import the csv file, and rename the variables. To allow us to identify rows by number in this section, we do not remove the observation number variable.

    from pathlib import Path
    import pandas as pd
    import numpy as np
    airAccs_path = Path('..') / 'datasets' / 'airAccs.csv'
    air_accidents_in = pd.read_csv(airAccs_path)
    air_accidents_in = (
        air_accidents_in
            .rename(
                columns={
                    air_accidents_in.columns[0]: 'obs_num',
                    'Date': 'date', 
                    'planeType': 'plane_type',
                    'Dead': 'dead',
                    'Aboard': 'aboard',
                    'Ground': 'ground'}))
    
    air_accidents = air_accidents_in.copy(deep=True)
    
    print(air_accidents.dtypes)
    obs_num         int64
    date           object
    location       object
    operator       object
    plane_type     object
    dead          float64
    aboard        float64
    ground        float64
    dtype: object
  2. Select rows using numeric row index.

    The iloc[] attribute can subset on both rows and columns using numeric index values.

    An iloc[] slice does not include the end value, as is typical in Python. For example, 9:12 would be the numbers 9, 10, and 11. Not providing either the start or end will result in the slice starting at the first or last position, respectively. For example, : would mean start at the begining and go to the end.

    The iloc[] attribute can subset on both rows and columns using named index values.

    Slices for loc[] are different from Python slices in two ways. The first is end values are included in the range. The second is that the range is specified by names. The range is all the rows, or columns, between the named start and end inclusive, as ordered in the index.

    The loc[] attribute uses index names and iloc[] only uses index numeric position. To mix using numbers and names requires either converting names to position values, or position values to names.

    Note, loc[] and iloc[] should not be chained together. This would be considered bad coding practice.

    This example uses iloc[] to display the beginning of the air_accidents data frame similarly to head.

    (air_accidents  
        .iloc[:5, :]
        .pipe(print))
       obs_num        date  ... aboard ground
    0        1  1908-09-17  ...    2.0    0.0
    1        2  1912-07-12  ...    5.0    0.0
    2        3  1913-08-06  ...    1.0    0.0
    3        4  1913-09-09  ...   20.0    0.0
    4        5  1913-10-17  ...   30.0    0.0
    
    [5 rows x 8 columns]
  3. Select rows from the end using numeric row index.

    Slices can be taken with respect to the end of the data frame. This is done by using negative values.

    This example displays the last 5 rows of the data set.

    (air_accidents  
        .iloc[-5:, :]
        .pipe(print))
          obs_num        date  ... aboard ground
    5661     5662  2014-02-21  ...   11.0    0.0
    5662     5663  2014-02-26  ...    6.0    0.0
    5663     5664  2014-03-07  ...  239.0    0.0
    5664     5665  2014-03-18  ...    2.0    0.0
    5665     5666  2014-03-22  ...    5.0    0.0
    
    [5 rows x 8 columns]
  4. Subsetting using the subsetting operator, []

    The [] operator will select either a single column or a list of columns. If a slice is used, then the subsetting is on rows. This operator is typically used as a short cut for selecting columns. It is not the preferred method for subsetting and is being shown since you will likely see code that uses it.

    print(air_accidents[64:68])
        obs_num        date                        location  ... dead aboard  ground
    64       65  1922-04-13             Brooklands, England  ...  3.0    3.0     0.0
    65       66  1922-06-03   Off Folkestone, Kent, England  ...  3.0    3.0     0.0
    66       67  1922-07-04            Fuhlsbuttel, Germany  ...  2.0    3.0     0.0
    67       68  1922-07-16  Near Saverne, Lorraine, France  ...  5.0    5.0     0.0
    
    [4 rows x 8 columns]

    Note, [] should not be chained with loc[] and iloc[].

  5. Create a test data frame.

    When writing code, it is sometimes convenient to work with a smaller set of observations when you are developing and testing new code. The iloc[] attribute can be used to create a smaller data frame to test code.

    This example creates a new data frame from the first ten rows.

    air_accidents_test = air_accidents.iloc[:10, :]
    
    print(air_accidents_test)
       obs_num        date  ... aboard ground
    0        1  1908-09-17  ...    2.0    0.0
    1        2  1912-07-12  ...    5.0    0.0
    2        3  1913-08-06  ...    1.0    0.0
    3        4  1913-09-09  ...   20.0    0.0
    4        5  1913-10-17  ...   30.0    0.0
    5        6  1915-03-05  ...   41.0    0.0
    6        7  1915-09-03  ...   19.0    0.0
    7        8  1916-07-28  ...   20.0    0.0
    8        9  1916-09-24  ...   22.0    0.0
    9       10  1916-10-01  ...   19.0    0.0
    
    [10 rows x 8 columns]
  6. Summary of subsetting

    This is not an example. Instead it is a brief summary of the subsetting tools available for pandas data frames. Not all of these have been covered in the material and examples. The one that were not covered are left for you to explore on your own, as you need them.

    There are two approaches to identifying rows or columns to be subsetted (indexed.)

    • Slicing is providing a start:end<:step> definition of the rows or columns to be subsetted on. A pandas slice can be used with either interger position or named index values. The :step is optional and defaults to one. This is the number to count by from the start to end.

    • Providing a list, [], of the desired rows or columns. This is called fancy indexing.

    A series is returned when subsetting results in a single column. A single value is returned when subsetting results in a single row of a single column. If you intend to subset a single value, consider using the at[] attribute. It is intended to return a single value. A data frame is returned when a subset results in more than one column of data.

    Subsetting creates either a view (a reference to the subsetted data) or a copy (shallow copy.) Which of these two objects are returned depends on the context. If no changes are going to be made or further subsetting is to be done, then using either a view or copy is alright. If you plan to make any changes or further subset the object, then make an explicit copy of the subset using copy().

    When an object is a view of a subset of another object, the view object is indexed using the same position integer values as the original data frame.

    Indexing used on the left hand side of the assignment operator creates no view or copy. The indexing and assignment are done together in place.

4.6.5 Exercises

These exercises use the PSID.csv data set that was imported in the prior section.

  1. Import the PSID.csv data set.

  2. Display the last three rows of the data frame using positional values to subset.

  3. Display the first, third, fifth, and seventh rows of columns two and three.

  4. Create a smaller data frame using the first 20 rows.