For some analysis and graphing tasks you will need to work with summary data in a data frame. Occasionally this might be the summary statistics of an entire data set, but more often it will be measures that summarize groups of observations. The task of producing such a data frame may be called summarizing, aggregating, or collapsing data.
For some tasks, the summary data set itself may be a suitable endpoint of your data wrangling, but for other problems you will want to merge the summary data back onto the original data set. In base R, this is often a three step process:
- produce summary statistics
- ensure the result is a data frame
While it is useful to understand how to set this up
in base R, functions in the tidyverse package
can greatly simplify all this, especially where you
want to merge summary statistics back onto your
original data set. (See Chapter 18.)
What follows here is base R.
Let’s begin with a review of how to produce various summary statistics, but with the added emphasis of turning the results into data.frames.
Let’s work with a subset of the
cars <- mtcars[,1:5]
The usual summary statistics applied to vectors - mean, minimum, maximum, length (i.e. counts) - produce results that are essentially scalars. The only hard part here is remembering to consider missing values!
Applied to multiple columns in a data.frame using the
sapply functions, our summary
statistics are returned as vectors. Here, it may not be
obvious how to turn this into a data.frame. The key is
to realize that vectors are considered column vectors by
default. The matrix transpose function,
this for us.
means <- sapply(cars, mean) as.data.frame(t(means))
mpg cyl disp hp drat 1 20.09062 6.1875 230.7219 146.6875 3.596563
More often than grand summaries we will be interested in group
summaries. To produce these summaries with a vector and a grouping
variable, you might typically use the
tapply function, returning
a vector of results.
mpgmin <- tapply(cars$mpg, cars$cyl, min) mpgmin
4 6 8 21.4 17.8 10.4
And we can turn this into a data frame, with the grouping data values as row names.
mpgmin 4 21.4 6 17.8 8 10.4
Another useful function, which produces a data frame directly, is
aggregate. For summaries of a single variable, the formula
method of specification is perhaps easiest to write and labels
the result most nicely.
aggregate(mpg ~ cyl, cars, min)
cyl mpg 1 4 21.4 2 6 17.8 3 8 10.4
The aggregate function is also useful where you have multiple variables to summarize by the same groups, which tapply will not handle.
aggregate(cars, list(cars$cyl), min)
Group.1 mpg cyl disp hp drat 1 4 21.4 4 71.1 52 3.69 2 6 17.8 6 145.0 105 2.76 3 8 10.4 8 275.8 150 2.76
Often we want to include summary data alongside the original data. This entails merging the summary data with the original data frame.
Where we have a grouped summary statistic in a named vector, we can merge this with the original data directly, skipping the data framing step. We simply use the “by” variable from the original data as a vector of category names (after coercing to character) to construct a new vector in the data frame.
mpgmin <- tapply(cars$mpg, cars$cyl, min) cars$mpgmin <- mpgmin[as.character(cars$cyl)] head(cars)
mpg cyl disp hp drat mpgmin Mazda RX4 21.0 6 160 110 3.90 17.8 Mazda RX4 Wag 21.0 6 160 110 3.90 17.8 Datsun 710 22.8 4 108 93 3.85 21.4 Hornet 4 Drive 21.4 6 258 110 3.08 17.8 Hornet Sportabout 18.7 8 360 175 3.15 10.4 Valiant 18.1 6 225 105 2.76 17.8
See Chapter 13 for more on the basics of merging data frames. In this case, we especially need to think about how our summary statistic columns are names in order to avoid confusion in the merged data.
cars <- mtcars[,1:3] # simple example maxcars <- aggregate(cars, list(cars$cyl), max) # summarize # cleanup prior to merging names(maxcars) <- c("cyl", paste(names(cars), "max", sep="_")) maxcars$cyl_max <- NULL # drop carsplus <- merge(cars, maxcars, all=TRUE) carsplus[11:20,]
cyl mpg disp mpg_max disp_max 11 4 27.3 79.0 33.9 146.7 12 6 21.0 160.0 21.4 258.0 13 6 21.0 160.0 21.4 258.0 14 6 17.8 167.6 21.4 258.0 15 6 21.4 258.0 21.4 258.0 16 6 18.1 225.0 21.4 258.0 17 6 19.2 167.6 21.4 258.0 18 6 19.7 145.0 21.4 258.0 19 8 18.7 360.0 19.2 472.0 20 8 17.3 275.8 19.2 472.0
Is this a balanced experiment (are the same number of chickens given each feed)?
Which feed was associated with the largest variation (standard deviation) in weight?
Without reducing the number of rows, add a column with the range (maximum - minimum) of weight for each feed.