# 1 Explicit Loops

It can be easiest to think about a repetition problem in explicit loops
(`for`

loops), because
all of the mechanics of statement execution remain visible in your code. We’ll
want to think about what *steps* are repeated (the *body* of a `for`

loop), what
data objects *vary* across repetitions and how we want to *sequence* them, and
if we are *collecting results*, where to place them.

- repeated statements
- varying data
- a results object

## 1.1 Loop over variables in a data frame

One of the most common situations you will encounter is where you want to repeat the same actions for several variables in a data frame.

Suppose you wanted a table with means, standard deviations,
and the number of non-missing observations
of all your analysis variables. As an example, let’s look at a pared down
version of the `mtcars`

data.

`cars <- mtcars[, c("mpg", "wt", "disp", "hp")]`

Calculating the individual values for a table is easy enough.

```
mean(cars$mpg)
[1] 20.09062
sd(cars$mpg)
[1] 6.026948
sum(!is.na(cars$mpg))
[1] 32
```

These are the statements we want to repeat, while the column in the data frame varies. We even have a function that will return all the means as a single data object … but nothing similar for standard deviations or counts.

`colMeans(cars)`

```
mpg wt disp hp
20.09062 3.21725 230.72188 146.68750
```

To get something similar for standard deviations and for counts
we can loop over the variables. We’ll use a *for* loop (see `help("for")`

)

`for (var in seq) expression`

The keywords `for`

and `in`

are required, as are the parentheses.

### 1.1.1 Loop over an index

Consider this code:

```
for (i in 1:4) {
print(i)
print(sd(cars[,i]))
}
```

```
[1] 1
[1] 6.026948
[1] 2
[1] 0.9784574
[1] 3
[1] 123.9387
[1] 4
[1] 68.56287
```

Here `seq`

is a vector (a “sequence”) of variable positions, `1:4`

. The `var`

is `i`

,
a variable that will take on the value of each element in `seq`

. In this
example, each `i`

will be the position of a variable in `cars`

.

The `expression`

being evaluated is a *compound* expression, i.e. the two
expressions enclosed in braces, in order.

```
{
print(i)
print(sd(cars[,i]))
}
```

We make the `print`

explicit here, because data objects are only printed
automatically at the “top” level of execution.

To save our results as a table, we’ll want to give the table a name. One
way to do this here would be to place the standard deviations within a vector, `v`

.

In order to address position `i`

in `v`

efficiently, we need to setup
(“initialize”) the vector first. In order to make sense of the result,
we will also name the elements of our vector.

```
v <- rep(NA, 4)
names(v) <- names(cars)
for (i in 1:4) {
v[i] <- sd(cars[,i])
}
v
```

```
mpg wt disp hp
6.0269481 0.9784574 123.9386938 68.5628685
```

Because our expression is now a single statement, the braces are no longer necessary, and this could even be written on one line. (I think using braces makes it easier to read, even when written on one line. That makes it easier to debug if you make a mistake.)

`for (i in 1:4) v[i] <- sd(cars[,i])`

At this point we might take our vector of column means and our vector of standard deviations and combine them in a data frame or a matrix. To complete our table with observation counts we could write yet another loop. Or we could think about doing all of this work with a single loop. For the latter approach we want to collect our results in a matrix with named dimensions.

```
v <- matrix(NA, nrow=4, ncol=3)
rownames(v) <- names(cars) # variable names for row names
colnames(v) <- c("mean", "sd", "N") # statistics names for col names
for (i in 1:4) {
v[i, "mean"] <- mean(cars[,i])
v[i, "sd"] <- sd(cars[,i])
v[i, "N"] <- sum(!is.na(cars[,i]))
}
v
```

```
mean sd N
mpg 20.09062 6.0269481 32
wt 3.21725 0.9784574 32
disp 230.72188 123.9386938 32
hp 146.68750 68.5628685 32
```

### 1.1.2 Loop over variable names

Because a data frame always has named columns, we could just as easily use
names as positions to index a data frame. This can be especially helpful if
we are trying to analyze a few selected variables from a larger data frame
(for example, using `mtcars`

.)

Here, the sequence is a character vector of column names.

```
analysis_vars <- c("mpg", "wt", "disp", "hp")
for (i in analysis_vars) {
print(i)
print(sd(mtcars[,i]))
}
```

```
[1] "mpg"
[1] 6.026948
[1] "wt"
[1] 0.9784574
[1] "disp"
[1] 123.9387
[1] "hp"
[1] 68.56287
```

Our code is very similar to indexing by position, without needing to create a subset of our data frame or trying to count columns.

```
v <- matrix(NA, nrow=length(analysis_vars), ncol=3)
rownames(v) <- analysis_vars
colnames(v) <- c("mean", "sd", "N")
for (i in analysis_vars) {
v[i, "mean"] <- mean(mtcars[,i])
v[i, "sd"] <- sd(mtcars[,i])
v[i, "N"] <- sum(!is.na(mtcars[,i]))
}
v
```

```
mean sd N
mpg 20.09062 6.0269481 32
wt 3.21725 0.9784574 32
disp 230.72188 123.9386938 32
hp 146.68750 68.5628685 32
```

## 1.2 Loops over rows in a data frame

Because our binary operators and many other functions are “vectorized”, we often don’t need to think about looping over the rows of a data frame. Code like

```
dfr <- data.frame(x=rnorm(15), y=runif(15))
dfr$z <- dfr$x + dfr$y
```

is *implicitly* looping over the rows of `dfr`

. But occasionally you will
find you need to operate within each row, and simple vectorization will not
work.

As an example consider the situation where survey respondents have answered five related questions, each on a scale of 1 to 5. You want to construct a scale that is the mean of each person’s responses, but there are some missing answers.

```
# simulate data for this problem
set.seed(20210205)
q <- as.data.frame(matrix(sample(c(1:5,NA), 35, replace=TRUE), ncol=5))
q
```

```
V1 V2 V3 V4 V5
1 4 2 3 1 3
2 5 3 1 NA 5
3 NA 3 5 3 NA
4 5 5 3 2 5
5 3 5 2 NA NA
6 5 5 1 NA 1
7 NA 1 3 5 1
```

The vectorized approach will only work where there are no missing values.

`(V1 + V2 + V3 + V4 + V5)/5`

With a loop, we can make use of the `mean`

function and it’s `na.rm`

argument.

```
Vscale <- rep(NA, nrow(q))
names(Vscale) <- row.names(q)
for (i in 1:nrow(q)) {
v <- as.matrix(q[i,])
# an odd coercion problem here
# because mean() does not have a data.frame method
Vscale[i] <- mean(v, na.rm=TRUE)
}
Vscale
```

```
1 2 3 4 5 6 7
2.600000 3.500000 3.666667 4.000000 3.333333 3.000000 2.500000
```

(For this specific problem of row means, there are two more graceful
solutions. One is the `rowMeans`

function; the other is to `apply`

the `mean`

function, as discussed below.)

## 1.3 Loop over groups

When exploring relationships in our data, we often want to analyze one variable within groups denoted by values in another, categorical variable.

Suppose we wanted to see the means and standard deviations of `mpg`

within different levels of the `cyl`

variable of `mtcars`

.

As the data set is given, `cyl`

is a numeric variable, and we could
loop over a vector of unique numeric values

```
cylvals <- unique(mtcars$cyl)
cylvals
```

`[1] 6 4 8`

The value order here doesn’t matter for running a `for`

loop, but
we do need to pay attention in order to properly interpret our results!
And these values would not be convenient for indexing a results table.

We could also coerce `cyl`

to a factor, since that is how we are
using it in this analysis. Then we could loop over the levels
of the factor.

```
cars <- mtcars
cars$cyl <- factor(mtcars$cyl)
cyllevels <- levels(cars$cyl)
cyllevels
```

`[1] "4" "6" "8"`

Here, keep in mind that these levels are value labels, not numbers!

```
cyltable <- matrix(NA, nrow=nlevels(cars$cyl), ncol=2)
# dimnames() print more nicely than rownames() and colnames()
dimnames(cyltable) <- list(cyls=cyllevels, stats=c("mean", "sd"))
for (i in cyllevels) {
v <- cars$mpg[cars$cyl==i]
cyltable[i, "mean"] <- mean(v)
cyltable[i, "sd"] <- sd(v)
}
cyltable
```

```
stats
cyls mean sd
4 26.66364 4.509828
6 19.74286 1.453567
8 15.10000 2.560048
```

## 1.4 Loop over parameter values

We are not limited to sequences of integers, nor do the numbers have to represent column positions. The sequence can be any arbitrary vector, in any arbitrary order.

Suppose we wanted to simulate random samples of various sizes in order to better understand how the standard error of the mean behaves.

```
for (n in c(5, 10, 50, 100, 500, 1000)) {
cat(n, "\n") # use cat() instead of print() for simpler output
cat(sd(rnorm(n)/sqrt(n)), "\n")
cat("\n")
}
```

```
5
0.3710791
10
0.2181523
50
0.1489542
100
0.09735525
500
0.0420442
1000
0.03186828
```

While this example gives us the general idea that bigger samples yield smaller standard errors, we should really replicate these samples many times. For this we can use a loop within a loop.

```
samples <- c(5, 10, 50, 100, 500, 1000)
reps <- 200
results <- matrix(NA, ncol=length(samples), nrow=reps,
dimnames=list(rep=1:reps,
size=samples))
for (n in samples) {
for (r in 1:reps){
results[r, as.character(n)] <- (sd(rnorm(n))/sqrt(n))
}
}
colMeans(results)
```

```
5 10 50 100 500 1000
0.41996018 0.30445001 0.14096484 0.10018702 0.04454776 0.03158740
```

And we might notice that a 100-fold increase in sample size gives us an extra decimal place of precision in our estimates.

## 1.5 Loop over data objects

Our sequence does not have to be a vector. Lists are also ordered objects, so it is possible to loop over the items in a list. Understood as a list, a data frame is a list of vectors. So our standard deviation problem could be approached by

```
cars <- mtcars[, 1:4]
for (i in cars) {
print(sd(i))
}
```

```
[1] 6.026948
[1] 1.785922
[1] 123.9387
[1] 68.56287
```

This gives us very clean and simple code. However, this leaves us without a convenient index for collecting results - this technique is most useful when we aren’t saving results as a single data object.