18 Aggregating
Aggregation is the process of turning many datapoints into fewer datapoints, typically in the form of summary statistics. Examples include calculating the total income by family or the mean test score by state.
Load the airquality
dataset.
air <- airquality
The tidyverse includes the dplyr
package, which has functions for aggregating dataframes. Load the dplyr
package now.
(Although most of what we will discuss is easier with dplyr
, any of this
can be accomplished with base R functions as well. These usually involve
several steps, and intermediate data objects that may not be data frames.
See the chapter on Aggregating with base R if you are interested in the details.)
library(dplyr)
dplyr
’s summarize()
function applies a function to the variables in a dataset. A simple use of summarize()
is calculating the mean of a single column.
air |>
summarize(TempAvg = mean(Temp))
TempAvg
1 77.88235
summarize()
can also be used in conjunction with across()
to apply a function to multiple variables. across()
accepts select()
syntax, so we can use the functions reviewed in the section on Subsetting by Columns. For now, we can just use everything()
to apply the function to all of the columns.
air |>
summarize(across(everything(),
mean,
na.rm = T))
Warning: There was 1 warning in `summarize()`.
ℹ In argument: `across(everything(), mean, na.rm = T)`.
Caused by warning:
! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
Supply arguments directly to `.fns` through an anonymous function instead.
# Previously
across(a:b, mean, na.rm = TRUE)
# Now
across(a:b, \(x) mean(x, na.rm = TRUE))
Ozone Solar.R Wind Temp Month Day
1 42.12931 185.9315 9.957516 77.88235 6.993464 15.80392
Our usage of summarize()
so far mirrors that of apply()
. The summary()
function, while less flexible, also accomplishes the task of calculating means alongside several other summary statistics, and it is more concise. In other words, the examples above are intended only to illustrate summarize()
’s functionality rather than provide examples of efficient coding.
apply(air, 2, mean, na.rm = T)
Ozone Solar.R Wind Temp Month Day
42.129310 185.931507 9.957516 77.882353 6.993464 15.803922
summary(air)
Ozone Solar.R Wind Temp Month Day
Min. : 1.00 Min. : 7.0 Min. : 1.700 Min. :56.00 Min. :5.000 Min. : 1.0
1st Qu.: 18.00 1st Qu.:115.8 1st Qu.: 7.400 1st Qu.:72.00 1st Qu.:6.000 1st Qu.: 8.0
Median : 31.50 Median :205.0 Median : 9.700 Median :79.00 Median :7.000 Median :16.0
Mean : 42.13 Mean :185.9 Mean : 9.958 Mean :77.88 Mean :6.993 Mean :15.8
3rd Qu.: 63.25 3rd Qu.:258.8 3rd Qu.:11.500 3rd Qu.:85.00 3rd Qu.:8.000 3rd Qu.:23.0
Max. :168.00 Max. :334.0 Max. :20.700 Max. :97.00 Max. :9.000 Max. :31.0
NA's :37 NA's :7
18.1 Summarizing by Grouping Variables
The real power of summarize()
exists when it is used to aggregate across grouping variables. We should first add some variables to aggregate along. We can create variables for the day of the week of each observation in airquality
, and we can do this in two steps.
We covered Dates in a previous chapter, but it is never too late for a little more practice. The airquality
dataset has the month and the day, but not the year. When in doubt, read the documentation. Reading the page for help(airquality)
, we see that the data is all from 1973. Standardize the date format with the ISO 8601 standard (YYYY-MM-DD).
air <-
air |>
mutate(Date = as.Date(paste("1973", Month, Day, sep = "-")))
After each step, it is good to browse the dataset to confirm the function accomplished what you expected. It is important to catch problems early on, especially when you have a lot of wrangling to do. Click on air
in your global environment, or run the command View(air)
to open the viewer.
It is now possible to calculate which day of the week each datapoint comes from with the weekdays()
function. Let’s also add an indicator variable for whether the date is a weekend, 1 for yes and 0 for no, with the help of the ifelse()
function.
air <-
air |>
mutate(DayOfWeek = weekdays(Date),
Weekend = ifelse(DayOfWeek %in% c("Saturday", "Sunday"), 1, 0))
Now, we can aggregate the dataframe by any number of grouping variables. Let’s calculate some descriptive statistics by the Month
and Weekend
variables.
group_by()
takes column names as arguments. After your dataset is grouped, use summarize()
to perform calculations along the grouping variables.
In the example below, the data is grouped by Month
and Weekend
. TempAvg
will be the mean of Temp
by each value of Weekend
for each value of Month.
So, we should get an average temperature for the weekdays in May, the weekends in May, the weekdays in June, the weekends in June, and so on, for every combination of the grouping variables. n()
counts the number of observations in each group.
Since the dataframe will be much smaller after summarizing, we can print it to the console.
air_summary <-
air |>
group_by(Month, Weekend) |>
summarize(TempAvg = mean(Temp),
TempSD = sd(Temp),
TempMax = max(Temp),
TempMin = min(Temp),
N_days = n())
`summarise()` has grouped output by 'Month'. You can override using the `.groups` argument.
air_summary
# A tibble: 10 × 7
# Groups: Month [5]
Month Weekend TempAvg TempSD TempMax TempMin N_days
<int> <dbl> <dbl> <dbl> <int> <int> <int>
1 5 0 66.5 7.19 81 57 23
2 5 1 62.8 5.20 69 56 8
3 6 0 79.6 6.38 93 65 21
4 6 1 78 7.35 90 67 9
5 7 0 83.5 3.65 92 73 22
6 7 1 84.9 5.78 92 74 9
7 8 0 84.3 7.30 97 72 23
8 8 1 83 4.11 86 76 8
9 9 0 76.3 8.64 93 63 20
10 9 1 78.1 8.06 92 68 10
Of course, there is no limit on how many functions you can pipe, so all of the above steps could be combined together as follows:
air_summary <-
airquality |>
mutate(Date = as.Date(paste("1973", Month, Day, sep = "-")),
DayOfWeek = weekdays(Date),
Weekend = ifelse(DayOfWeek %in% c("Saturday", "Sunday"), 1, 0)) |>
group_by(Month, Weekend) |>
summarize(TempAvg = mean(Temp),
TempSD = sd(Temp),
TempMax = max(Temp),
TempMin = min(Temp),
N_days = n())
`summarise()` has grouped output by 'Month'. You can override using the `.groups` argument.
This is a good place to stop for a moment and be thankful for piping. The code below is the same as the code above, but instead of using pipes, it is in a nested format and without any line breaks. Of course, we could make it a little prettier by adding in some line breaks or by taking the output of each function and saving it as a new object to be used in the following function.
Looking at the code below, ask yourself a couple questions:
- Is it easy to read from inside to outside?
- How long did it take you to find the name of the dataset we are using?
air_summary <- summarize(group_by(mutate(airquality, Date = as.Date(paste("1973", Month, Day, sep = "-")), DayOfWeek = weekdays(Date), Weekend = ifelse(DayOfWeek %in% c("Saturday", "Sunday"), 1, 0)), Month, Weekend), TempAvg = mean(Temp), TempSD = sd(Temp), TempMax = max(Temp), TempMin = min(Temp), N_days = n())
`summarise()` has grouped output by 'Month'. You can override using the `.groups` argument.
18.2 Adding Group-Level Information without Removing Rows
The code in the above section reduced the number of rows in the dataframe when calculating group-level variables. This is useful in situations where you want to create summary statistics for succinct tables, such as mean income by level of education.
However, if you want to keep your whole dataset but add information along a grouping variable, you can do that too. Just change summarize()
to mutate()
in the previous example:
air_summary2 <-
airquality |>
mutate(Date = as.Date(paste("1973", Month, Day, sep = "-")),
DayOfWeek = weekdays(Date),
Weekend = ifelse(DayOfWeek %in% c("Saturday", "Sunday"), 1, 0)) |>
group_by(Month, Weekend) |>
mutate(TempAvg = mean(Temp),
TempSD = sd(Temp),
TempMax = max(Temp),
TempMin = min(Temp),
N_days = n())
air_summary2 |> select(-c(1:3, 7)) # drop a few columns to fit the others on the screen
# A tibble: 153 × 10
# Groups: Month, Weekend [10]
Temp Month Day DayOfWeek Weekend TempAvg TempSD TempMax TempMin N_days
<int> <int> <int> <chr> <dbl> <dbl> <dbl> <int> <int> <int>
1 67 5 1 Tuesday 0 66.5 7.19 81 57 23
2 72 5 2 Wednesday 0 66.5 7.19 81 57 23
3 74 5 3 Thursday 0 66.5 7.19 81 57 23
4 62 5 4 Friday 0 66.5 7.19 81 57 23
5 56 5 5 Saturday 1 62.8 5.20 69 56 8
6 66 5 6 Sunday 1 62.8 5.20 69 56 8
7 65 5 7 Monday 0 66.5 7.19 81 57 23
8 59 5 8 Tuesday 0 66.5 7.19 81 57 23
9 61 5 9 Wednesday 0 66.5 7.19 81 57 23
10 69 5 10 Thursday 0 66.5 7.19 81 57 23
# ℹ 143 more rows
Note that, unlike in Stata where we need to sort our data with bysort
in order to perform group-wise calculations, R does not require our dataframe to be sorted.
The group-level summary statistics are repeated alongside the individual cases.
We can check the grouping variables in a dataframe with group_keys()
.
air_summary2 |> group_vars()
[1] "Month" "Weekend"
It is good practice to ungroup()
your dataframe after carrying out operations. If you do not, you may forget and unknowingly perform group-wise calculations when you meant to perform global calculations.
air_summary2 <-
air_summary2 |>
ungroup()
The empty vector output from group_vars()
confirms that our dataframe is no longer grouped.
air_summary2 |> group_vars()
character(0)
18.3 Exercises
Use the chickwts
dataset.
Is this a balanced experiment (same number of individuals/chickens in each condition/feed)?
Which feed was associated with the largest variation (standard deviation) in weight?
Without reducing the number of rows, add a column with the range (maximum - minimum) of weight for each feed.