18 Aggregating

Aggregation is the process of turning many datapoints into fewer datapoints, typically in the form of summary statistics. Examples include calculating the total income by family or the mean test score by state.

Load the airquality dataset.

air <- airquality

The tidyverse includes the dplyr package, which has functions for aggregating dataframes. Load the dplyr package now.

(Although most of what we will discuss is easier with dplyr, any of this can be accomplished with base R functions as well. These usually involve several steps, and intermediate data objects that may not be data frames. See the chapter on Aggregating with base R if you are interested in the details.)

library(dplyr)

dplyr’s summarize() function applies a function to the variables in a dataset. A simple use of summarize() is calculating the mean of a single column.

air |> 
  summarize(TempAvg = mean(Temp))

   TempAvg
1 77.88235

summarize() can also be used in conjunction with across() to apply a function to multiple variables. across() accepts select() syntax, so we can use the functions reviewed in the section on Subsetting by Columns. For now, we can just use everything() to apply the function to all of the columns.

air |> 
  summarize(across(everything(), 
                   mean, 
                   na.rm = T))

Warning: There was 1 warning in `summarize()`.
ℹ In argument: `across(everything(), mean, na.rm = T)`.
Caused by warning:
! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
Supply arguments directly to `.fns` through an anonymous function instead.

  # Previously
  across(a:b, mean, na.rm = TRUE)

  # Now
  across(a:b, \(x) mean(x, na.rm = TRUE))

     Ozone  Solar.R     Wind     Temp    Month      Day
1 42.12931 185.9315 9.957516 77.88235 6.993464 15.80392

Our usage of summarize() so far mirrors that of apply(). The summary() function, while less flexible, also accomplishes the task of calculating means alongside several other summary statistics, and it is more concise. In other words, the examples above are intended only to illustrate summarize()’s functionality rather than provide examples of efficient coding.

apply(air, 2, mean, na.rm = T)

     Ozone    Solar.R       Wind       Temp      Month        Day 
 42.129310 185.931507   9.957516  77.882353   6.993464  15.803922

summary(air)

     Ozone           Solar.R           Wind             Temp           Month            Day      
 Min.   :  1.00   Min.   :  7.0   Min.   : 1.700   Min.   :56.00   Min.   :5.000   Min.   : 1.0  
 1st Qu.: 18.00   1st Qu.:115.8   1st Qu.: 7.400   1st Qu.:72.00   1st Qu.:6.000   1st Qu.: 8.0  
 Median : 31.50   Median :205.0   Median : 9.700   Median :79.00   Median :7.000   Median :16.0  
 Mean   : 42.13   Mean   :185.9   Mean   : 9.958   Mean   :77.88   Mean   :6.993   Mean   :15.8  
 3rd Qu.: 63.25   3rd Qu.:258.8   3rd Qu.:11.500   3rd Qu.:85.00   3rd Qu.:8.000   3rd Qu.:23.0  
 Max.   :168.00   Max.   :334.0   Max.   :20.700   Max.   :97.00   Max.   :9.000   Max.   :31.0  
 NA's   :37       NA's   :7

18.1 Summarizing by Grouping Variables

The real power of summarize() exists when it is used to aggregate across grouping variables. We should first add some variables to aggregate along. We can create variables for the day of the week of each observation in airquality, and we can do this in two steps.

We covered Dates in a previous chapter, but it is never too late for a little more practice. The airquality dataset has the month and the day, but not the year. When in doubt, read the documentation. Reading the page for help(airquality), we see that the data is all from 1973. Standardize the date format with the ISO 8601 standard (YYYY-MM-DD).

air <- 
  air |> 
  mutate(Date = as.Date(paste("1973", Month, Day, sep = "-")))

After each step, it is good to browse the dataset to confirm the function accomplished what you expected. It is important to catch problems early on, especially when you have a lot of wrangling to do. Click on air in your global environment, or run the command View(air) to open the viewer.

It is now possible to calculate which day of the week each datapoint comes from with the weekdays() function. Let’s also add an indicator variable for whether the date is a weekend, 1 for yes and 0 for no, with the help of the ifelse() function.

air <-
  air |> 
  mutate(DayOfWeek = weekdays(Date),
         Weekend = ifelse(DayOfWeek %in% c("Saturday", "Sunday"), 1, 0))

Now, we can aggregate the dataframe by any number of grouping variables. Let’s calculate some descriptive statistics by the Month and Weekend variables.

group_by() takes column names as arguments. After your dataset is grouped, use summarize() to perform calculations along the grouping variables.

In the example below, the data is grouped by Month and Weekend. TempAvg will be the mean of Temp by each value of Weekend for each value of Month. So, we should get an average temperature for the weekdays in May, the weekends in May, the weekdays in June, the weekends in June, and so on, for every combination of the grouping variables. n() counts the number of observations in each group.

Since the dataframe will be much smaller after summarizing, we can print it to the console.

air_summary <- 
  air |> 
  group_by(Month, Weekend) |>
  summarize(TempAvg = mean(Temp),
            TempSD = sd(Temp),
            TempMax = max(Temp),
            TempMin = min(Temp),
            N_days = n())

`summarise()` has grouped output by 'Month'. You can override using the `.groups` argument.

air_summary

# A tibble: 10 × 7
# Groups:   Month [5]
   Month Weekend TempAvg TempSD TempMax TempMin N_days
   <int>   <dbl>   <dbl>  <dbl>   <int>   <int>  <int>
 1     5       0    66.5   7.19      81      57     23
 2     5       1    62.8   5.20      69      56      8
 3     6       0    79.6   6.38      93      65     21
 4     6       1    78     7.35      90      67      9
 5     7       0    83.5   3.65      92      73     22
 6     7       1    84.9   5.78      92      74      9
 7     8       0    84.3   7.30      97      72     23
 8     8       1    83     4.11      86      76      8
 9     9       0    76.3   8.64      93      63     20
10     9       1    78.1   8.06      92      68     10

Of course, there is no limit on how many functions you can pipe, so all of the above steps could be combined together as follows:

air_summary <- 
  airquality |> 
  mutate(Date = as.Date(paste("1973", Month, Day, sep = "-")),
         DayOfWeek = weekdays(Date),
         Weekend = ifelse(DayOfWeek %in% c("Saturday", "Sunday"), 1, 0)) |> 
  group_by(Month, Weekend) |> 
  summarize(TempAvg = mean(Temp),
            TempSD = sd(Temp),
            TempMax = max(Temp),
            TempMin = min(Temp),
            N_days = n())

`summarise()` has grouped output by 'Month'. You can override using the `.groups` argument.

This is a good place to stop for a moment and be thankful for piping. The code below is the same as the code above, but instead of using pipes, it is in a nested format and without any line breaks. Of course, we could make it a little prettier by adding in some line breaks or by taking the output of each function and saving it as a new object to be used in the following function.

Looking at the code below, ask yourself a couple questions:

Is it easy to read from inside to outside?
How long did it take you to find the name of the dataset we are using?

air_summary <- summarize(group_by(mutate(airquality, Date = as.Date(paste("1973", Month, Day, sep = "-")), DayOfWeek = weekdays(Date), Weekend = ifelse(DayOfWeek %in% c("Saturday", "Sunday"), 1, 0)), Month, Weekend), TempAvg = mean(Temp), TempSD = sd(Temp), TempMax = max(Temp), TempMin = min(Temp), N_days = n())

`summarise()` has grouped output by 'Month'. You can override using the `.groups` argument.

18.2 Adding Group-Level Information without Removing Rows

The code in the above section reduced the number of rows in the dataframe when calculating group-level variables. This is useful in situations where you want to create summary statistics for succinct tables, such as mean income by level of education.

However, if you want to keep your whole dataset but add information along a grouping variable, you can do that too. Just change summarize() to mutate() in the previous example:

air_summary2 <-
  airquality |> 
  mutate(Date = as.Date(paste("1973", Month, Day, sep = "-")),
         DayOfWeek = weekdays(Date),
         Weekend = ifelse(DayOfWeek %in% c("Saturday", "Sunday"), 1, 0)) |> 
  group_by(Month, Weekend) |> 
  mutate(TempAvg = mean(Temp),
         TempSD = sd(Temp),
         TempMax = max(Temp),
         TempMin = min(Temp),
         N_days = n())

air_summary2 |> select(-c(1:3, 7)) # drop a few columns to fit the others on the screen

# A tibble: 153 × 10
# Groups:   Month, Weekend [10]
    Temp Month   Day DayOfWeek Weekend TempAvg TempSD TempMax TempMin N_days
   <int> <int> <int> <chr>       <dbl>   <dbl>  <dbl>   <int>   <int>  <int>
 1    67     5     1 Tuesday         0    66.5   7.19      81      57     23
 2    72     5     2 Wednesday       0    66.5   7.19      81      57     23
 3    74     5     3 Thursday        0    66.5   7.19      81      57     23
 4    62     5     4 Friday          0    66.5   7.19      81      57     23
 5    56     5     5 Saturday        1    62.8   5.20      69      56      8
 6    66     5     6 Sunday          1    62.8   5.20      69      56      8
 7    65     5     7 Monday          0    66.5   7.19      81      57     23
 8    59     5     8 Tuesday         0    66.5   7.19      81      57     23
 9    61     5     9 Wednesday       0    66.5   7.19      81      57     23
10    69     5    10 Thursday        0    66.5   7.19      81      57     23
# ℹ 143 more rows

Note that, unlike in Stata where we need to sort our data with bysort in order to perform group-wise calculations, R does not require our dataframe to be sorted.

The group-level summary statistics are repeated alongside the individual cases.

We can check the grouping variables in a dataframe with group_keys().

air_summary2 |> group_vars()

[1] "Month"   "Weekend"

It is good practice to ungroup() your dataframe after carrying out operations. If you do not, you may forget and unknowingly perform group-wise calculations when you meant to perform global calculations.

air_summary2 <- 
  air_summary2 |> 
  ungroup()

The empty vector output from group_vars() confirms that our dataframe is no longer grouped.

air_summary2 |> group_vars()

character(0)

18.3 Exercises

Use the chickwts dataset.

Is this a balanced experiment (same number of individuals/chickens in each condition/feed)?
Which feed was associated with the largest variation (standard deviation) in weight?
Without reducing the number of rows, add a column with the range (maximum - minimum) of weight for each feed.