3  Iteration

First, load the tidyverse.

library(tidyverse)

3.1 Iterating over Vectors and Lists

Base R supports traditional for loops, but two complications are they tend to involve a lot of indexing and setting up objects to store the results.

Instead, use map() from the purrr package, part of the tidyverse. Its basic usage involves two arguments:

  1. The list or vector we want to iterate over
  2. The function we want to apply to each element

map() returns a list as its output.

Let’s write another simple function that simply squares a number.

square <- function(x) {
  x^2
}

square(3)
[1] 9

Now if we want to apply this function to the numbers 1 through 5, we can just write

map(1:5, square)
[[1]]
[1] 1

[[2]]
[1] 4

[[3]]
[1] 9

[[4]]
[1] 16

[[5]]
[1] 25

1:5 is a vector, so let’s now try a list. A dataframe is technically a list of variables, so let’s calculate the standard deviation of each column in mtcars:

map(mtcars, sd)
$mpg
[1] 6.026948

$cyl
[1] 1.785922

$disp
[1] 123.9387

$hp
[1] 68.56287

$drat
[1] 0.5346787

$wt
[1] 0.9784574

$qsec
[1] 1.786943

$vs
[1] 0.5040161

$am
[1] 0.4989909

$gear
[1] 0.7378041

$carb
[1] 1.6152

Notice again that the output is a list. map() works with lists, so we can use map() on the output again. Pass that output to square() to turn those standard deviations into variances:

map(mtcars, sd) |> 
  map(square)
$mpg
[1] 36.3241

$cyl
[1] 3.189516

$disp
[1] 15360.8

$hp
[1] 4700.867

$drat
[1] 0.2858814

$wt
[1] 0.957379

$qsec
[1] 3.193166

$vs
[1] 0.2540323

$am
[1] 0.2489919

$gear
[1] 0.5443548

$carb
[1] 2.608871

3.2 Anonymous functions

We could also calculate the standard error using our function from earlier:

se <- function(x) {
  sd(x) / sqrt(length(x))
}

map(mtcars, se)
$mpg
[1] 1.065424

$cyl
[1] 0.3157093

$disp
[1] 21.90947

$hp
[1] 12.12032

$drat
[1] 0.09451874

$wt
[1] 0.1729685

$qsec
[1] 0.3158899

$vs
[1] 0.08909831

$am
[1] 0.08820997

$gear
[1] 0.1304266

$carb
[1] 0.2855297

This works fine, but we may not always need a separate function for every task. If we only want to do a simple procedure once or twice, we can create an anonymous function.

To turn our named function into an anonymous function, first rewrite the function as one line, in which case we do not need braces:

se <- function(x) sd(x) / sqrt(length(x))

We said it was anonymous—without a name—so remove the assignment:

function(x) sd(x) / sqrt(length(x))

Technically, we could be finished, but we can optionally replace function with \ to further shorten our code:

\(x) sd(x) / sqrt(length(x))

Now we can use this as the second argument within map(), without ever defining a function:

map(mtcars, \(x) sd(x) / sqrt(length(x)))
$mpg
[1] 1.065424

$cyl
[1] 0.3157093

$disp
[1] 21.90947

$hp
[1] 12.12032

$drat
[1] 0.09451874

$wt
[1] 0.1729685

$qsec
[1] 0.3158899

$vs
[1] 0.08909831

$am
[1] 0.08820997

$gear
[1] 0.1304266

$carb
[1] 0.2855297

However, if we find ourselves copying and pasting this anonymous function more than once or twice, we should give it a name!

We will often use anonymous functions when we want to modify a function’s argument defaults. The default of sd() is na.rm = F, so it’ll return NA wherever there’s missing data. See:

map(airquality, sd)
$Ozone
[1] NA

$Solar.R
[1] NA

$Wind
[1] 3.523001

$Temp
[1] 9.46527

$Month
[1] 1.416522

$Day
[1] 8.86452

If we want to just change that argument, we could write a custom wrapper with different defaults:

sd2 <- function(x) {
  sd(x, na.rm = T)
}

map(airquality, sd2)
$Ozone
[1] 32.98788

$Solar.R
[1] 90.05842

$Wind
[1] 3.523001

$Temp
[1] 9.46527

$Month
[1] 1.416522

$Day
[1] 8.86452

And that would be a fine approach if we want to use it several times. For now, let’s write an anonymous function:

map(airquality, \(x) sd(x, na.rm = T))
$Ozone
[1] 32.98788

$Solar.R
[1] 90.05842

$Wind
[1] 3.523001

$Temp
[1] 9.46527

$Month
[1] 1.416522

$Day
[1] 8.86452

3.3 Simulation: Ignoring the Index

So far we have been applying some function to each element of the vector or list we gave to map(). For some vector or list with length n, we get output of length n where each element is the result of applying a function to the nth element.

What if we just want results of length n, where the results are not derived from the nth element? We can write a function (anonymous or named) that does not use that object. For reproducibility, set a seed first:

set.seed(123)
map(1:5, \(x) rnorm(3))
[[1]]
[1] -0.5604756 -0.2301775  1.5587083

[[2]]
[1] 0.07050839 0.12928774 1.71506499

[[3]]
[1]  0.4609162 -1.2650612 -0.6868529

[[4]]
[1] -0.4456620  1.2240818  0.3598138

[[5]]
[1]  0.4007715  0.1106827 -0.5558411

Notice how x is not used in our function:

  • When x is 1, it returns rnorm(3)
  • When x is 2, it returns rnorm(3)
  • When x is 3, it returns rnorm(3)

And so on. We get three random values five times.

Now let’s do an experiment. We want to know, if we flip n coins, the probability of getting all heads.

Simulate flips by sampling from the vector c(0, 1) with replacement, where 0 is tails and 1 is heads:

n <- 5
flips <- sample(0:1, n, replace = T)
flips
[1] 0 1 0 0 1

Then, one way to check if we have all heads is to check whether the sum of the 0s and 1s is equal to the length of the vector. Here, n is 5, so we would need five 1s for this to be true. We get back TRUE if all were heads and FALSE if not.

sum(flips) == n
[1] FALSE

Now wrap all this code into a function:

all_heads <- function(n) {
  flips <- sample(0:1, n, replace = T)
  sum(flips) == n
}

And do it 1000 times:

sims <- map(1:1000, \(x) all_heads(5))

Take a look at the first six values:

head(sims)
[[1]]
[1] FALSE

[[2]]
[1] FALSE

[[3]]
[1] TRUE

[[4]]
[1] FALSE

[[5]]
[1] FALSE

[[6]]
[1] FALSE

Our results are a list now, and we want to summarize it. Since each element in our list contains on a single value, we can simplify the list with list_simplify():

sims <- list_simplify(sims)
head(sims)
[1] FALSE FALSE  TRUE FALSE FALSE FALSE

The mean of this vector should approximate the probability of all heads for a given number of coin flips. The expected value is \(\frac{1}{2^n}\), which is 0.03125 when n is 5.

mean(sims)
[1] 0.03

3.4 Wrangling Lists

map() returns a list, so we should know how to work with lists. We just saw how to use list_simplify() to turn a list into a vector.

Another skill we need is to turn a list into a dataframe. In Multiple Returns we learned how to use a list to return multiple values.

Let’s get the mean, standard deviation, and standard error of each column in mtcars:

mySummary <- function(x) {
  list(mean = mean(x),
       sd = sd(x),
       se = sd(x) / sqrt(length(x)))
}

descriptives <- map(mtcars, mySummary)

In the resulting object, note that each top-level elements are the column names the statistics were derived from:

descriptives[1:2]
$mpg
$mpg$mean
[1] 20.09062

$mpg$sd
[1] 6.026948

$mpg$se
[1] 1.065424


$cyl
$cyl$mean
[1] 6.1875

$cyl$sd
[1] 1.785922

$cyl$se
[1] 0.3157093

The bind_rows() function can be used not only to combine dataframes, but also lists. Setting .id to “var” makes an ID column in the resulting dataframe called “var” that takes its values from the list element names. (If there are no names, it will use position numbers: 1, 2, 3, etc.)

map(mtcars, mySummary) |> 
  bind_rows(.id = "var")
# A tibble: 11 × 4
   var      mean      sd      se
   <chr>   <dbl>   <dbl>   <dbl>
 1 mpg    20.1     6.03   1.07  
 2 cyl     6.19    1.79   0.316 
 3 disp  231.    124.    21.9   
 4 hp    147.     68.6   12.1   
 5 drat    3.60    0.535  0.0945
 6 wt      3.22    0.978  0.173 
 7 qsec   17.8     1.79   0.316 
 8 vs      0.438   0.504  0.0891
 9 am      0.406   0.499  0.0882
10 gear    3.69    0.738  0.130 
11 carb    2.81    1.62   0.286 

See the purrr package to learn more list wrangling skills, such as subsetting with logical conditions.

3.5 Exercises: Basic Iteration

  1. Iterate over a vector to produce a list where the first element has one random value between 0 and 1 (use runif()), the second element has two values, and the third has three.

  2. Calculate the mean of each column in airquality, ignoring missing values. Use an anoymous function.

  3. In a Poisson distribution, both the mean and variance are equal to lambda. Repeatedly simulate values from a Poisson distribution (rpois()) where the lambda argument is set to some value. For each sample, calculate the mean and variance (squared standard deviation). Then, check the sampling distribution.

3.6 Side Effects

In all our examples so far, we have been interested in the values that a function returns. At other times we are interested in what are called side effects. Side effects include reading and writing files, making plots, and printing stuff to the console.

walk() is used to produce side effects. It is a convenience function that will accomplish the same task as map() but without cluttering the console. Compare the output when iteratively creating plots of the cars dataset:

map(cars, plot)

$speed
NULL

$dist
NULL
walk(cars, plot)

map() returns empty output (NULL) along with the plots, but walk() does not.

map() and walk() each iterate over a single list. map2() and walk2() iterate across two lists, and pmap() and pwalk() iterate across lists of lists with any length.

3.6.1 Example: Saving Subsets to Disk

Let’s use walk2() to iterate across two lists to accomplish a task: save a list of dataframes (list 1) with certain filenames (list 2).

airquality has a Month column that ranges 5-9, so let’s split this dataframe into separate dataframes by month with the split() function:

air_months <- split(airquality, airquality$Month)
str(air_months)
List of 5
 $ 5:'data.frame':  31 obs. of  6 variables:
  ..$ Ozone  : int [1:31] 41 36 12 18 NA 28 23 19 8 NA ...
  ..$ Solar.R: int [1:31] 190 118 149 313 NA NA 299 99 19 194 ...
  ..$ Wind   : num [1:31] 7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
  ..$ Temp   : int [1:31] 67 72 74 62 56 66 65 59 61 69 ...
  ..$ Month  : int [1:31] 5 5 5 5 5 5 5 5 5 5 ...
  ..$ Day    : int [1:31] 1 2 3 4 5 6 7 8 9 10 ...
 $ 6:'data.frame':  30 obs. of  6 variables:
  ..$ Ozone  : int [1:30] NA NA NA NA NA NA 29 NA 71 39 ...
  ..$ Solar.R: int [1:30] 286 287 242 186 220 264 127 273 291 323 ...
  ..$ Wind   : num [1:30] 8.6 9.7 16.1 9.2 8.6 14.3 9.7 6.9 13.8 11.5 ...
  ..$ Temp   : int [1:30] 78 74 67 84 85 79 82 87 90 87 ...
  ..$ Month  : int [1:30] 6 6 6 6 6 6 6 6 6 6 ...
  ..$ Day    : int [1:30] 1 2 3 4 5 6 7 8 9 10 ...
 $ 7:'data.frame':  31 obs. of  6 variables:
  ..$ Ozone  : int [1:31] 135 49 32 NA 64 40 77 97 97 85 ...
  ..$ Solar.R: int [1:31] 269 248 236 101 175 314 276 267 272 175 ...
  ..$ Wind   : num [1:31] 4.1 9.2 9.2 10.9 4.6 10.9 5.1 6.3 5.7 7.4 ...
  ..$ Temp   : int [1:31] 84 85 81 84 83 83 88 92 92 89 ...
  ..$ Month  : int [1:31] 7 7 7 7 7 7 7 7 7 7 ...
  ..$ Day    : int [1:31] 1 2 3 4 5 6 7 8 9 10 ...
 $ 8:'data.frame':  31 obs. of  6 variables:
  ..$ Ozone  : int [1:31] 39 9 16 78 35 66 122 89 110 NA ...
  ..$ Solar.R: int [1:31] 83 24 77 NA NA NA 255 229 207 222 ...
  ..$ Wind   : num [1:31] 6.9 13.8 7.4 6.9 7.4 4.6 4 10.3 8 8.6 ...
  ..$ Temp   : int [1:31] 81 81 82 86 85 87 89 90 90 92 ...
  ..$ Month  : int [1:31] 8 8 8 8 8 8 8 8 8 8 ...
  ..$ Day    : int [1:31] 1 2 3 4 5 6 7 8 9 10 ...
 $ 9:'data.frame':  30 obs. of  6 variables:
  ..$ Ozone  : int [1:30] 96 78 73 91 47 32 20 23 21 24 ...
  ..$ Solar.R: int [1:30] 167 197 183 189 95 92 252 220 230 259 ...
  ..$ Wind   : num [1:30] 6.9 5.1 2.8 4.6 7.4 15.5 10.9 10.3 10.9 9.7 ...
  ..$ Temp   : int [1:30] 91 92 93 93 87 84 80 78 75 73 ...
  ..$ Month  : int [1:30] 9 9 9 9 9 9 9 9 9 9 ...
  ..$ Day    : int [1:30] 1 2 3 4 5 6 7 8 9 10 ...

Now make some filenames to save these as:

filenames <- paste0("air_", names(air_months), ".csv")
filenames
[1] "air_5.csv" "air_6.csv" "air_7.csv" "air_8.csv" "air_9.csv"

And then save each subset of airquality as a separate CSV:

walk2(air_months, filenames, write_csv)

The function (write_csv()) will take the first argument of walk2() (air_months) as its first argument and the second (filenames) as its second.

Now, what if we want to read all those files back into R? We could import them as a list of dataframes:

air_all <- 
  map(filenames, read.csv)
str(air_all)
List of 5
 $ :'data.frame':   31 obs. of  6 variables:
  ..$ Ozone  : int [1:31] 41 36 12 18 NA 28 23 19 8 NA ...
  ..$ Solar.R: int [1:31] 190 118 149 313 NA NA 299 99 19 194 ...
  ..$ Wind   : num [1:31] 7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
  ..$ Temp   : int [1:31] 67 72 74 62 56 66 65 59 61 69 ...
  ..$ Month  : int [1:31] 5 5 5 5 5 5 5 5 5 5 ...
  ..$ Day    : int [1:31] 1 2 3 4 5 6 7 8 9 10 ...
 $ :'data.frame':   30 obs. of  6 variables:
  ..$ Ozone  : int [1:30] NA NA NA NA NA NA 29 NA 71 39 ...
  ..$ Solar.R: int [1:30] 286 287 242 186 220 264 127 273 291 323 ...
  ..$ Wind   : num [1:30] 8.6 9.7 16.1 9.2 8.6 14.3 9.7 6.9 13.8 11.5 ...
  ..$ Temp   : int [1:30] 78 74 67 84 85 79 82 87 90 87 ...
  ..$ Month  : int [1:30] 6 6 6 6 6 6 6 6 6 6 ...
  ..$ Day    : int [1:30] 1 2 3 4 5 6 7 8 9 10 ...
 $ :'data.frame':   31 obs. of  6 variables:
  ..$ Ozone  : int [1:31] 135 49 32 NA 64 40 77 97 97 85 ...
  ..$ Solar.R: int [1:31] 269 248 236 101 175 314 276 267 272 175 ...
  ..$ Wind   : num [1:31] 4.1 9.2 9.2 10.9 4.6 10.9 5.1 6.3 5.7 7.4 ...
  ..$ Temp   : int [1:31] 84 85 81 84 83 83 88 92 92 89 ...
  ..$ Month  : int [1:31] 7 7 7 7 7 7 7 7 7 7 ...
  ..$ Day    : int [1:31] 1 2 3 4 5 6 7 8 9 10 ...
 $ :'data.frame':   31 obs. of  6 variables:
  ..$ Ozone  : int [1:31] 39 9 16 78 35 66 122 89 110 NA ...
  ..$ Solar.R: int [1:31] 83 24 77 NA NA NA 255 229 207 222 ...
  ..$ Wind   : num [1:31] 6.9 13.8 7.4 6.9 7.4 4.6 4 10.3 8 8.6 ...
  ..$ Temp   : int [1:31] 81 81 82 86 85 87 89 90 90 92 ...
  ..$ Month  : int [1:31] 8 8 8 8 8 8 8 8 8 8 ...
  ..$ Day    : int [1:31] 1 2 3 4 5 6 7 8 9 10 ...
 $ :'data.frame':   30 obs. of  6 variables:
  ..$ Ozone  : int [1:30] 96 78 73 91 47 32 20 23 21 24 ...
  ..$ Solar.R: int [1:30] 167 197 183 189 95 92 252 220 230 259 ...
  ..$ Wind   : num [1:30] 6.9 5.1 2.8 4.6 7.4 15.5 10.9 10.3 10.9 9.7 ...
  ..$ Temp   : int [1:30] 91 92 93 93 87 84 80 78 75 73 ...
  ..$ Month  : int [1:30] 9 9 9 9 9 9 9 9 9 9 ...
  ..$ Day    : int [1:30] 1 2 3 4 5 6 7 8 9 10 ...

Or add a step to bind_rows() to combine them into a single dataframe:

air_all <- 
  map(filenames, read.csv) |> 
  bind_rows()
str(air_all)
'data.frame':   153 obs. of  6 variables:
 $ Ozone  : int  41 36 12 18 NA 28 23 19 8 NA ...
 $ Solar.R: int  190 118 149 313 NA NA 299 99 19 194 ...
 $ Wind   : num  7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
 $ Temp   : int  67 72 74 62 56 66 65 59 61 69 ...
 $ Month  : int  5 5 5 5 5 5 5 5 5 5 ...
 $ Day    : int  1 2 3 4 5 6 7 8 9 10 ...

3.6.2 Example: Converting Data File Types

As another example, let’s convert the CSVs on disk to RDS files. First, write a regular expression to get a vector of those filenames:

myFiles <- list.files(pattern = "^air_[0-9]\\.csv$")
myFiles
[1] "air_5.csv" "air_6.csv" "air_7.csv" "air_8.csv" "air_9.csv"

And then write a function that (1) imports the CSV, (2) makes a new filename, and (3) writes it as an RDS file:

csv_to_rds <- function(paths) {
  dat <- read.csv(paths)
  rds_path <- paste0(tools::file_path_sans_ext(paths), ".rds")
  saveRDS(dat, rds_path)
}

This function only takes a single argument, so we can use walk():

walk(myFiles, csv_to_rds)

To see whether that worked, check if the expected files are in our working directory:

list.files(pattern = "^air_[0-9]\\.rds$")
[1] "air_5.rds" "air_6.rds" "air_7.rds" "air_8.rds" "air_9.rds"

Success!

3.7 Parallelization

Functions and loops make writing our code faster, but it still runs in about the same amount of time since our iteration is sequential. We apply a function to the first element, and then when that is done, we apply the function to the second element, and so on. Each step waits for the previous step to finish.

To speed up our code, we can parallelize it. This is useful for tasks like wrangling many large datasets or fitting models to many subsets of a dataset.

For this, we can use the furrr package, which has parallelized versions of the map() and walk() families.

For example, let’s fit models to subsets of mtcars, one for each value of cyl: 4, 6, and 8.

All we need to do is declare a multisession with some number of parallel processes (workers) and then prefix our iteration function with future_. furrr will take care of the rest: assigning chunks to workers and gathering all the results into a single list.

library(tidyverse)
library(furrr)

ncores <- 3
plan(multisession, workers = ncores)

cyls <- c(4, 6, 8)

myMods <- 
  future_map(cyls,
             \(x) lm(mpg ~ wt, mtcars |> filter(cyl == x)))

Let’s format the model output. Apply the summary() function to each element, and then apply the coef() function to get dataframes of coefficients with test statistics:

myMods |> 
  map(summary) |> 
  map(coef)
[[1]]
             Estimate Std. Error   t value     Pr(>|t|)
(Intercept) 39.571196   4.346582  9.103980 7.771511e-06
wt          -5.647025   1.850119 -3.052251 1.374278e-02

[[2]]
             Estimate Std. Error   t value    Pr(>|t|)
(Intercept) 28.408845   4.184369  6.789278 0.001054844
wt          -2.780106   1.334917 -2.082605 0.091757660

[[3]]
             Estimate Std. Error   t value     Pr(>|t|)
(Intercept) 23.868029  3.0054619  7.941551 4.052705e-06
wt          -2.192438  0.7392393 -2.965803 1.179281e-02

Or, pass the list of models to stargazer() for formatted tables of regression output:

library(stargazer)

stargazer(myMods, 
          type = "html", 
          column.labels = paste0(cyls, "-cyl"))
Dependent variable:
mpg
4-cyl 6-cyl 8-cyl
(1) (2) (3)
wt -5.647** -2.780* -2.192**
(1.850) (1.335) (0.739)
Constant 39.571*** 28.409*** 23.868***
(4.347) (4.184) (3.005)
Observations 11 7 14
R2 0.509 0.465 0.423
Adjusted R2 0.454 0.357 0.375
Residual Std. Error 3.332 (df = 9) 1.165 (df = 5) 2.024 (df = 12)
F Statistic 9.316** (df = 1; 9) 4.337* (df = 1; 5) 8.796** (df = 1; 12)
Note: p<0.1; p<0.05; p<0.01

3.7.1 Slurm

The SSCC Slurm cluster is built for tasks like these. Submit your parallelized jobs to Slurm to take advantage of up to 128 cores on a single server.

R itself uses one process, so your parallelized task can use one less core than you reserve. If you want your script to use 3 cores in parallel, you need to reserve 4. This means you cannot run a parallel task that uses all the cores on a computer, since R needs one core to run.

Do not use parallel::detectCores() on Slurm to determine the number of cores you have to use, since it will find all those on the server, not just those you have reserved. Instead, you can access the environmental variable with the number of cores you reserved, and then subtract one:

ncores <- as.numeric(Sys.getenv()[["SLURM_CPUS_PER_TASK"]]) - 1

Then you can continue with the example above: plan(multisession, workers = ncores), then future_map(), etc.

Congratulations! You are now able to parallelize your R loops!

3.7.2 Random Seeds

When you are doing any work that involves random numbers (e.g., fitting models, multiple imputation, simulation), it is important to set your random seed for reproducibility. We set the seed with regular map() as we would with any other R code, but with furrr, we need to specify it in the .options argument:

ncores <- 2
plan(multisession, workers = ncores)

future_map(1:2, 
           \(x) rnorm(1), 
           .options = furrr_options(seed = 123))
[[1]]
[1] -1.01425

[[2]]
[1] -0.03074715
future_map(1:2, 
           \(x) rnorm(1), 
           .options = furrr_options(seed = 123))
[[1]]
[1] -1.01425

[[2]]
[1] -0.03074715

Just setting the seed outside will not result in the same results every time:

ncores <- 2
plan(multisession, workers = ncores)

set.seed(123) # incorrect!

future_map(1:2, 
           \(x) rnorm(1))
Warning: UNRELIABLE VALUE: Future ('<none>') unexpectedly generated random
numbers without specifying argument 'seed'. There is a risk that those random
numbers are not statistically sound and the overall results might be invalid.
To fix this, specify 'seed=TRUE'. This ensures that proper, parallel-safe
random numbers are produced via the L'Ecuyer-CMRG method. To disable this
check, use 'seed=NULL', or set option 'future.rng.onMisuse' to "ignore".
Warning: UNRELIABLE VALUE: Future ('<none>') unexpectedly generated random
numbers without specifying argument 'seed'. There is a risk that those random
numbers are not statistically sound and the overall results might be invalid.
To fix this, specify 'seed=TRUE'. This ensures that proper, parallel-safe
random numbers are produced via the L'Ecuyer-CMRG method. To disable this
check, use 'seed=NULL', or set option 'future.rng.onMisuse' to "ignore".
[[1]]
[1] 0.65431

[[2]]
[1] -0.6810615
future_map(1:2, 
           \(x) rnorm(1))
Warning: UNRELIABLE VALUE: Future ('<none>') unexpectedly generated random
numbers without specifying argument 'seed'. There is a risk that those random
numbers are not statistically sound and the overall results might be invalid.
To fix this, specify 'seed=TRUE'. This ensures that proper, parallel-safe
random numbers are produced via the L'Ecuyer-CMRG method. To disable this
check, use 'seed=NULL', or set option 'future.rng.onMisuse' to "ignore".
Warning: UNRELIABLE VALUE: Future ('<none>') unexpectedly generated random
numbers without specifying argument 'seed'. There is a risk that those random
numbers are not statistically sound and the overall results might be invalid.
To fix this, specify 'seed=TRUE'. This ensures that proper, parallel-safe
random numbers are produced via the L'Ecuyer-CMRG method. To disable this
check, use 'seed=NULL', or set option 'future.rng.onMisuse' to "ignore".
[[1]]
[1] -1.350347

[[2]]
[1] -1.14498

See more details at ?furrr_options.

3.8 Exercises: Side Effects and Parallelization

  1. Use the diamonds dataset.

    A. Split it into separate dataframes by one of the categorical variables (cut, color, or clarity). For each subset, fit a regression model predicting price from any variables.

    B. For each model, plot the residuals against the fitted values with ggplot(). Save each plot to disk with ggsave().

    C. Parallelize the code where you fit the models. (You do not need to plot it.)

  2. Verify that your computer is actually running code in parallel.

    A. Check that you have at least four cores on your computer by running parallel::detectCores().

    B. Run the block of code below, which should take about 6 seconds.

    start <- Sys.time()
    walk(1:3, \(x) Sys.sleep(2))
    Sys.time() - start

    C. Parallelize the code to use 3 cores.

    D. Does the code take very close to 2 seconds, or does it take a little more? Why?