1 Explicit Loops
It can be easiest to think about a repetition problem in explicit loops
(for
loops), because
all of the mechanics of statement execution remain visible in your code. We’ll
want to think about what steps are repeated (the body of a for
loop), what
data objects vary across repetitions and how we want to sequence them, and
if we are collecting results, where to place them.
- repeated statements
- varying data
- a results object
1.1 Loop over variables in a data frame
One of the most common situations you will encounter is where you want to repeat the same actions for several variables in a data frame.
Suppose you wanted a table with means, standard deviations,
and the number of non-missing observations
of all your analysis variables. As an example, let’s look at a pared down
version of the mtcars
data.
cars <- mtcars[, c("mpg", "wt", "disp", "hp")]
Calculating the individual values for a table is easy enough.
mean(cars$mpg)
[1] 20.09062
sd(cars$mpg)
[1] 6.026948
sum(!is.na(cars$mpg))
[1] 32
These are the statements we want to repeat, while the column in the data frame varies. We even have a function that will return all the means as a single data object … but nothing similar for standard deviations or counts.
colMeans(cars)
mpg wt disp hp
20.09062 3.21725 230.72188 146.68750
To get something similar for standard deviations and for counts
we can loop over the variables. We’ll use a for loop (see help("for")
)
for (var in seq) expression
The keywords for
and in
are required, as are the parentheses.
1.1.1 Loop over an index
Consider this code:
for (i in 1:4) {
print(i)
print(sd(cars[,i]))
}
[1] 1
[1] 6.026948
[1] 2
[1] 0.9784574
[1] 3
[1] 123.9387
[1] 4
[1] 68.56287
Here seq
is a vector (a “sequence”) of variable positions, 1:4
. The var
is i
,
a variable that will take on the value of each element in seq
. In this
example, each i
will be the position of a variable in cars
.
The expression
being evaluated is a compound expression, i.e. the two
expressions enclosed in braces, in order.
{
print(i)
print(sd(cars[,i]))
}
We make the print
explicit here, because data objects are only printed
automatically at the “top” level of execution.
To save our results as a table, we’ll want to give the table a name. One
way to do this here would be to place the standard deviations within a vector, v
.
In order to address position i
in v
efficiently, we need to setup
(“initialize”) the vector first. In order to make sense of the result,
we will also name the elements of our vector.
v <- rep(NA, 4)
names(v) <- names(cars)
for (i in 1:4) {
v[i] <- sd(cars[,i])
}
v
mpg wt disp hp
6.0269481 0.9784574 123.9386938 68.5628685
Because our expression is now a single statement, the braces are no longer necessary, and this could even be written on one line. (I think using braces makes it easier to read, even when written on one line. That makes it easier to debug if you make a mistake.)
for (i in 1:4) v[i] <- sd(cars[,i])
At this point we might take our vector of column means and our vector of standard deviations and combine them in a data frame or a matrix. To complete our table with observation counts we could write yet another loop. Or we could think about doing all of this work with a single loop. For the latter approach we want to collect our results in a matrix with named dimensions.
v <- matrix(NA, nrow=4, ncol=3)
rownames(v) <- names(cars) # variable names for row names
colnames(v) <- c("mean", "sd", "N") # statistics names for col names
for (i in 1:4) {
v[i, "mean"] <- mean(cars[,i])
v[i, "sd"] <- sd(cars[,i])
v[i, "N"] <- sum(!is.na(cars[,i]))
}
v
mean sd N
mpg 20.09062 6.0269481 32
wt 3.21725 0.9784574 32
disp 230.72188 123.9386938 32
hp 146.68750 68.5628685 32
1.1.2 Loop over variable names
Because a data frame always has named columns, we could just as easily use
names as positions to index a data frame. This can be especially helpful if
we are trying to analyze a few selected variables from a larger data frame
(for example, using mtcars
.)
Here, the sequence is a character vector of column names.
analysis_vars <- c("mpg", "wt", "disp", "hp")
for (i in analysis_vars) {
print(i)
print(sd(mtcars[,i]))
}
[1] "mpg"
[1] 6.026948
[1] "wt"
[1] 0.9784574
[1] "disp"
[1] 123.9387
[1] "hp"
[1] 68.56287
Our code is very similar to indexing by position, without needing to create a subset of our data frame or trying to count columns.
v <- matrix(NA, nrow=length(analysis_vars), ncol=3)
rownames(v) <- analysis_vars
colnames(v) <- c("mean", "sd", "N")
for (i in analysis_vars) {
v[i, "mean"] <- mean(mtcars[,i])
v[i, "sd"] <- sd(mtcars[,i])
v[i, "N"] <- sum(!is.na(mtcars[,i]))
}
v
mean sd N
mpg 20.09062 6.0269481 32
wt 3.21725 0.9784574 32
disp 230.72188 123.9386938 32
hp 146.68750 68.5628685 32
1.2 Loops over rows in a data frame
Because our binary operators and many other functions are “vectorized”, we often don’t need to think about looping over the rows of a data frame. Code like
dfr <- data.frame(x=rnorm(15), y=runif(15))
dfr$z <- dfr$x + dfr$y
is implicitly looping over the rows of dfr
. But occasionally you will
find you need to operate within each row, and simple vectorization will not
work.
As an example consider the situation where survey respondents have answered five related questions, each on a scale of 1 to 5. You want to construct a scale that is the mean of each person’s responses, but there are some missing answers.
# simulate data for this problem
set.seed(20210205)
q <- as.data.frame(matrix(sample(c(1:5,NA), 35, replace=TRUE), ncol=5))
q
V1 V2 V3 V4 V5
1 4 2 3 1 3
2 5 3 1 NA 5
3 NA 3 5 3 NA
4 5 5 3 2 5
5 3 5 2 NA NA
6 5 5 1 NA 1
7 NA 1 3 5 1
The vectorized approach will only work where there are no missing values.
(V1 + V2 + V3 + V4 + V5)/5
With a loop, we can make use of the mean
function and it’s na.rm
argument.
Vscale <- rep(NA, nrow(q))
names(Vscale) <- row.names(q)
for (i in 1:nrow(q)) {
v <- as.matrix(q[i,])
# an odd coercion problem here
# because mean() does not have a data.frame method
Vscale[i] <- mean(v, na.rm=TRUE)
}
Vscale
1 2 3 4 5 6 7
2.600000 3.500000 3.666667 4.000000 3.333333 3.000000 2.500000
(For this specific problem of row means, there are two more graceful
solutions. One is the rowMeans
function; the other is to apply
the mean
function, as discussed below.)
1.3 Loop over groups
When exploring relationships in our data, we often want to analyze one variable within groups denoted by values in another, categorical variable.
Suppose we wanted to see the means and standard deviations of mpg
within different levels of the cyl
variable of mtcars
.
As the data set is given, cyl
is a numeric variable, and we could
loop over a vector of unique numeric values
cylvals <- unique(mtcars$cyl)
cylvals
[1] 6 4 8
The value order here doesn’t matter for running a for
loop, but
we do need to pay attention in order to properly interpret our results!
And these values would not be convenient for indexing a results table.
We could also coerce cyl
to a factor, since that is how we are
using it in this analysis. Then we could loop over the levels
of the factor.
cars <- mtcars
cars$cyl <- factor(mtcars$cyl)
cyllevels <- levels(cars$cyl)
cyllevels
[1] "4" "6" "8"
Here, keep in mind that these levels are value labels, not numbers!
cyltable <- matrix(NA, nrow=nlevels(cars$cyl), ncol=2)
# dimnames() print more nicely than rownames() and colnames()
dimnames(cyltable) <- list(cyls=cyllevels, stats=c("mean", "sd"))
for (i in cyllevels) {
v <- cars$mpg[cars$cyl==i]
cyltable[i, "mean"] <- mean(v)
cyltable[i, "sd"] <- sd(v)
}
cyltable
stats
cyls mean sd
4 26.66364 4.509828
6 19.74286 1.453567
8 15.10000 2.560048
1.4 Loop over parameter values
We are not limited to sequences of integers, nor do the numbers have to represent column positions. The sequence can be any arbitrary vector, in any arbitrary order.
Suppose we wanted to simulate random samples of various sizes in order to better understand how the standard error of the mean behaves.
for (n in c(5, 10, 50, 100, 500, 1000)) {
cat(n, "\n") # use cat() instead of print() for simpler output
cat(sd(rnorm(n)/sqrt(n)), "\n")
cat("\n")
}
5
0.3710791
10
0.2181523
50
0.1489542
100
0.09735525
500
0.0420442
1000
0.03186828
While this example gives us the general idea that bigger samples yield smaller standard errors, we should really replicate these samples many times. For this we can use a loop within a loop.
samples <- c(5, 10, 50, 100, 500, 1000)
reps <- 200
results <- matrix(NA, ncol=length(samples), nrow=reps,
dimnames=list(rep=1:reps,
size=samples))
for (n in samples) {
for (r in 1:reps){
results[r, as.character(n)] <- (sd(rnorm(n))/sqrt(n))
}
}
colMeans(results)
5 10 50 100 500 1000
0.41996018 0.30445001 0.14096484 0.10018702 0.04454776 0.03158740
And we might notice that a 100-fold increase in sample size gives us an extra decimal place of precision in our estimates.
1.5 Loop over data objects
Our sequence does not have to be a vector. Lists are also ordered objects, so it is possible to loop over the items in a list. Understood as a list, a data frame is a list of vectors. So our standard deviation problem could be approached by
cars <- mtcars[, 1:4]
for (i in cars) {
print(sd(i))
}
[1] 6.026948
[1] 1.785922
[1] 123.9387
[1] 68.56287
This gives us very clean and simple code. However, this leaves us without a convenient index for collecting results - this technique is most useful when we aren’t saving results as a single data object.