2 Apply Functions
Writing for
loops can often be abstracted into two related tasks:
- Writing down the steps to loop over, the loop “body”
- Setting up an object to collect the result, “initialization”
Where writing for
loops itself becomes repetitive (where we have many
similar loops), the first task can be thought of as writing functions.
The second task, “initialization”, can be addressed through the use
of the apply
family of functions. We can
apply
a function to columns of a data frame or matrixapply
a function to rows of a data frame or matrixtapply
a function to groups of valueslapply
a function to items in a list
(and more)
2.1 Apply a function to variables in a data frame
Returning to the table of means and standard deviations we have
apply(X, MARGIN, FUN, ...)
where X
is a data frame or matrix, MARGIN
determines whether
you are looping over columns (2) or rows (1), and FUN
is the
function you wish to employ.
cmeans <- apply(mtcars, 2, mean)
csds <- apply(mtcars, 2, sd)
data.frame(means=cmeans, stddev=csds)
means stddev
mpg 20.090625 6.0269481
cyl 6.187500 1.7859216
disp 230.721875 123.9386938
hp 146.687500 68.5628685
drat 3.596563 0.5346787
wt 3.217250 0.9784574
qsec 17.848750 1.7869432
vs 0.437500 0.5040161
am 0.406250 0.4989909
gear 3.687500 0.7378041
carb 2.812500 1.6152000
Here, each use of apply
returns a named vector automatically.
The ...
elipses in our syntax diagram indicates we can include
additional arguments which are arguments to the function FUN
.
So if we return to our simulated survey responses with the missing
values, we can write
qmeans <- apply(q, 2, mean, na.rm=TRUE)
qsds <- apply(q, 2, sd, na.rm=TRUE)
data.frame(means=qmeans, stddev=qsds)
means stddev
V1 4.400000 0.8944272
V2 3.428571 1.6183472
V3 2.571429 1.3972763
V4 2.750000 1.7078251
V5 3.000000 2.0000000
2.2 Apply a function to rows in a data frame
This is simply a matter of changing the MARGIN
.
qrowmeans <- apply(q, 1, mean, na.rm=TRUE)
qrowsds <- apply(q, 1, sd, na.rm=TRUE)
data.frame(means=qmeans, stddev=qsds)
means stddev
V1 4.400000 0.8944272
V2 3.428571 1.6183472
V3 2.571429 1.3972763
V4 2.750000 1.7078251
V5 3.000000 2.0000000
2.3 Apply a function to groups of observations
Returning to the problem of means and standard deviations within groups
defined by cyl
in mtcars
we switch to tapply
.
tapply(X, INDEX, FUN, ...)
Here X
is usually a vector, and the INDEX
is a factor, something
that can be coerced into a factor,
or a list of factors.
mean_bycyl <- tapply(mtcars$mpg, mtcars$cyl, mean)
sd_bycyle <- tapply(mtcars$mpg, mtcars$cyl, sd)
data.frame(mpg_mean=mean_bycyl, mpg_sd=sd_bycyle)
mpg_mean mpg_sd
4 26.66364 4.509828
6 19.74286 1.453567
8 15.10000 2.560048
2.4 Apply a function to a list
Here we have two functions, the aptly named lapply
,
and sapply
. Where the former returns a list, the
latter will (usually) return a named vector.
lapply(X, FUN, ...)
sapply(X, FUN, ...)
means_list <- lapply(mtcars, mean)
means_vector <- sapply(mtcars, mean)
means_vector
mpg cyl disp hp drat wt qsec vs
20.090625 6.187500 230.721875 146.687500 3.596563 3.217250 17.848750 0.437500
am gear carb
0.406250 3.687500 2.812500
means_list
$mpg
[1] 20.09062
$cyl
[1] 6.1875
$disp
[1] 230.7219
$hp
[1] 146.6875
$drat
[1] 3.596563
$wt
[1] 3.21725
$qsec
[1] 17.84875
$vs
[1] 0.4375
$am
[1] 0.40625
$gear
[1] 3.6875
$carb
[1] 2.8125