library(tidyverse)
3 Iteration
First, load the tidyverse
.
3.1 Iterating over Vectors and Lists
Base R supports traditional for
loops, but two complications are they tend to involve a lot of indexing and setting up objects to store the results.
Instead, use map()
from the purrr
package, part of the tidyverse. Its basic usage involves two arguments:
- The list or vector we want to iterate over
- The function we want to apply to each element
map()
returns a list as its output.
Let’s write another simple function that simply squares a number.
<- function(x) {
square ^2
x
}
square(3)
[1] 9
Now if we want to apply this function to the numbers 1 through 5, we can just write
map(1:5, square)
[[1]]
[1] 1
[[2]]
[1] 4
[[3]]
[1] 9
[[4]]
[1] 16
[[5]]
[1] 25
1:5
is a vector, so let’s now try a list. A dataframe is technically a list of variables, so let’s calculate the standard deviation of each column in mtcars
:
map(mtcars, sd)
$mpg
[1] 6.026948
$cyl
[1] 1.785922
$disp
[1] 123.9387
$hp
[1] 68.56287
$drat
[1] 0.5346787
$wt
[1] 0.9784574
$qsec
[1] 1.786943
$vs
[1] 0.5040161
$am
[1] 0.4989909
$gear
[1] 0.7378041
$carb
[1] 1.6152
Notice again that the output is a list. map()
works with lists, so we can use map()
on the output again. Pass that output to square()
to turn those standard deviations into variances:
map(mtcars, sd) |>
map(square)
$mpg
[1] 36.3241
$cyl
[1] 3.189516
$disp
[1] 15360.8
$hp
[1] 4700.867
$drat
[1] 0.2858814
$wt
[1] 0.957379
$qsec
[1] 3.193166
$vs
[1] 0.2540323
$am
[1] 0.2489919
$gear
[1] 0.5443548
$carb
[1] 2.608871
3.2 Anonymous functions
We could also calculate the standard error using our function from earlier:
<- function(x) {
se sd(x) / sqrt(length(x))
}
map(mtcars, se)
$mpg
[1] 1.065424
$cyl
[1] 0.3157093
$disp
[1] 21.90947
$hp
[1] 12.12032
$drat
[1] 0.09451874
$wt
[1] 0.1729685
$qsec
[1] 0.3158899
$vs
[1] 0.08909831
$am
[1] 0.08820997
$gear
[1] 0.1304266
$carb
[1] 0.2855297
This works fine, but we may not always need a separate function for every task. If we only want to do a simple procedure once or twice, we can create an anonymous function.
To turn our named function into an anonymous function, first rewrite the function as one line, in which case we do not need braces:
<- function(x) sd(x) / sqrt(length(x)) se
We said it was anonymous—without a name—so remove the assignment:
function(x) sd(x) / sqrt(length(x))
Technically, we could be finished, but we can optionally replace function
with \
to further shorten our code:
sd(x) / sqrt(length(x)) \(x)
Now we can use this as the second argument within map()
, without ever defining a function:
map(mtcars, \(x) sd(x) / sqrt(length(x)))
$mpg
[1] 1.065424
$cyl
[1] 0.3157093
$disp
[1] 21.90947
$hp
[1] 12.12032
$drat
[1] 0.09451874
$wt
[1] 0.1729685
$qsec
[1] 0.3158899
$vs
[1] 0.08909831
$am
[1] 0.08820997
$gear
[1] 0.1304266
$carb
[1] 0.2855297
However, if we find ourselves copying and pasting this anonymous function more than once or twice, we should give it a name!
We will often use anonymous functions when we want to modify a function’s argument defaults. The default of sd()
is na.rm = F
, so it’ll return NA
wherever there’s missing data. See:
map(airquality, sd)
$Ozone
[1] NA
$Solar.R
[1] NA
$Wind
[1] 3.523001
$Temp
[1] 9.46527
$Month
[1] 1.416522
$Day
[1] 8.86452
If we want to just change that argument, we could write a custom wrapper with different defaults:
<- function(x) {
sd2 sd(x, na.rm = T)
}
map(airquality, sd2)
$Ozone
[1] 32.98788
$Solar.R
[1] 90.05842
$Wind
[1] 3.523001
$Temp
[1] 9.46527
$Month
[1] 1.416522
$Day
[1] 8.86452
And that would be a fine approach if we want to use it several times. For now, let’s write an anonymous function:
map(airquality, \(x) sd(x, na.rm = T))
$Ozone
[1] 32.98788
$Solar.R
[1] 90.05842
$Wind
[1] 3.523001
$Temp
[1] 9.46527
$Month
[1] 1.416522
$Day
[1] 8.86452
3.3 Simulation: Ignoring the Index
So far we have been applying some function to each element of the vector or list we gave to map()
. For some vector or list with length n, we get output of length n where each element is the result of applying a function to the nth element.
What if we just want results of length n, where the results are not derived from the nth element? We can write a function (anonymous or named) that does not use that object. For reproducibility, set a seed first:
set.seed(123)
map(1:5, \(x) rnorm(3))
[[1]]
[1] -0.5604756 -0.2301775 1.5587083
[[2]]
[1] 0.07050839 0.12928774 1.71506499
[[3]]
[1] 0.4609162 -1.2650612 -0.6868529
[[4]]
[1] -0.4456620 1.2240818 0.3598138
[[5]]
[1] 0.4007715 0.1106827 -0.5558411
Notice how x
is not used in our function:
- When
x
is 1, it returnsrnorm(3)
- When
x
is 2, it returnsrnorm(3)
- When
x
is 3, it returnsrnorm(3)
And so on. We get three random values five times.
Now let’s do an experiment. We want to know, if we flip n coins, the probability of getting all heads.
Simulate flips by sampling from the vector c(0, 1)
with replacement, where 0 is tails and 1 is heads:
<- 5
n <- sample(0:1, n, replace = T)
flips flips
[1] 0 1 0 0 1
Then, one way to check if we have all heads is to check whether the sum of the 0s and 1s is equal to the length of the vector. Here, n is 5, so we would need five 1s for this to be true. We get back TRUE
if all were heads and FALSE
if not.
sum(flips) == n
[1] FALSE
Now wrap all this code into a function:
<- function(n) {
all_heads <- sample(0:1, n, replace = T)
flips sum(flips) == n
}
And do it 1000 times:
<- map(1:1000, \(x) all_heads(5)) sims
Take a look at the first six values:
head(sims)
[[1]]
[1] FALSE
[[2]]
[1] FALSE
[[3]]
[1] TRUE
[[4]]
[1] FALSE
[[5]]
[1] FALSE
[[6]]
[1] FALSE
Our results are a list now, and we want to summarize it. Since each element in our list contains on a single value, we can simplify the list with list_simplify()
:
<- list_simplify(sims)
sims head(sims)
[1] FALSE FALSE TRUE FALSE FALSE FALSE
The mean of this vector should approximate the probability of all heads for a given number of coin flips. The expected value is \(\frac{1}{2^n}\), which is 0.03125 when n is 5.
mean(sims)
[1] 0.03
3.4 Wrangling Lists
map()
returns a list, so we should know how to work with lists. We just saw how to use list_simplify()
to turn a list into a vector.
Another skill we need is to turn a list into a dataframe. In Multiple Returns we learned how to use a list to return multiple values.
Let’s get the mean, standard deviation, and standard error of each column in mtcars
:
<- function(x) {
mySummary list(mean = mean(x),
sd = sd(x),
se = sd(x) / sqrt(length(x)))
}
<- map(mtcars, mySummary) descriptives
In the resulting object, note that each top-level elements are the column names the statistics were derived from:
1:2] descriptives[
$mpg
$mpg$mean
[1] 20.09062
$mpg$sd
[1] 6.026948
$mpg$se
[1] 1.065424
$cyl
$cyl$mean
[1] 6.1875
$cyl$sd
[1] 1.785922
$cyl$se
[1] 0.3157093
The bind_rows()
function can be used not only to combine dataframes, but also lists. Setting .id
to “var” makes an ID column in the resulting dataframe called “var” that takes its values from the list element names. (If there are no names, it will use position numbers: 1, 2, 3, etc.)
map(mtcars, mySummary) |>
bind_rows(.id = "var")
# A tibble: 11 × 4
var mean sd se
<chr> <dbl> <dbl> <dbl>
1 mpg 20.1 6.03 1.07
2 cyl 6.19 1.79 0.316
3 disp 231. 124. 21.9
4 hp 147. 68.6 12.1
5 drat 3.60 0.535 0.0945
6 wt 3.22 0.978 0.173
7 qsec 17.8 1.79 0.316
8 vs 0.438 0.504 0.0891
9 am 0.406 0.499 0.0882
10 gear 3.69 0.738 0.130
11 carb 2.81 1.62 0.286
See the purrr package to learn more list wrangling skills, such as subsetting with logical conditions.
3.5 Exercises: Basic Iteration
Iterate over a vector to produce a list where the first element has one random value between 0 and 1 (use
runif()
), the second element has two values, and the third has three.Calculate the mean of each column in
airquality
, ignoring missing values. Use an anoymous function.In a Poisson distribution, both the mean and variance are equal to lambda. Repeatedly simulate values from a Poisson distribution (
rpois()
) where the lambda argument is set to some value. For each sample, calculate the mean and variance (squared standard deviation). Then, check the sampling distribution.
3.6 Side Effects
In all our examples so far, we have been interested in the values that a function returns. At other times we are interested in what are called side effects. Side effects include reading and writing files, making plots, and printing stuff to the console.
walk()
is used to produce side effects. It is a convenience function that will accomplish the same task as map()
but without cluttering the console. Compare the output when iteratively creating plots of the cars
dataset:
map(cars, plot)
$speed
NULL
$dist
NULL
walk(cars, plot)
map()
returns empty output (NULL
) along with the plots, but walk()
does not.
map()
and walk()
each iterate over a single list. map2()
and walk2()
iterate across two lists, and pmap()
and pwalk()
iterate across lists of lists with any length.
3.6.1 Example: Saving Subsets to Disk
Let’s use walk2()
to iterate across two lists to accomplish a task: save a list of dataframes (list 1) with certain filenames (list 2).
airquality
has a Month column that ranges 5-9, so let’s split this dataframe into separate dataframes by month with the split()
function:
<- split(airquality, airquality$Month)
air_months str(air_months)
List of 5
$ 5:'data.frame': 31 obs. of 6 variables:
..$ Ozone : int [1:31] 41 36 12 18 NA 28 23 19 8 NA ...
..$ Solar.R: int [1:31] 190 118 149 313 NA NA 299 99 19 194 ...
..$ Wind : num [1:31] 7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
..$ Temp : int [1:31] 67 72 74 62 56 66 65 59 61 69 ...
..$ Month : int [1:31] 5 5 5 5 5 5 5 5 5 5 ...
..$ Day : int [1:31] 1 2 3 4 5 6 7 8 9 10 ...
$ 6:'data.frame': 30 obs. of 6 variables:
..$ Ozone : int [1:30] NA NA NA NA NA NA 29 NA 71 39 ...
..$ Solar.R: int [1:30] 286 287 242 186 220 264 127 273 291 323 ...
..$ Wind : num [1:30] 8.6 9.7 16.1 9.2 8.6 14.3 9.7 6.9 13.8 11.5 ...
..$ Temp : int [1:30] 78 74 67 84 85 79 82 87 90 87 ...
..$ Month : int [1:30] 6 6 6 6 6 6 6 6 6 6 ...
..$ Day : int [1:30] 1 2 3 4 5 6 7 8 9 10 ...
$ 7:'data.frame': 31 obs. of 6 variables:
..$ Ozone : int [1:31] 135 49 32 NA 64 40 77 97 97 85 ...
..$ Solar.R: int [1:31] 269 248 236 101 175 314 276 267 272 175 ...
..$ Wind : num [1:31] 4.1 9.2 9.2 10.9 4.6 10.9 5.1 6.3 5.7 7.4 ...
..$ Temp : int [1:31] 84 85 81 84 83 83 88 92 92 89 ...
..$ Month : int [1:31] 7 7 7 7 7 7 7 7 7 7 ...
..$ Day : int [1:31] 1 2 3 4 5 6 7 8 9 10 ...
$ 8:'data.frame': 31 obs. of 6 variables:
..$ Ozone : int [1:31] 39 9 16 78 35 66 122 89 110 NA ...
..$ Solar.R: int [1:31] 83 24 77 NA NA NA 255 229 207 222 ...
..$ Wind : num [1:31] 6.9 13.8 7.4 6.9 7.4 4.6 4 10.3 8 8.6 ...
..$ Temp : int [1:31] 81 81 82 86 85 87 89 90 90 92 ...
..$ Month : int [1:31] 8 8 8 8 8 8 8 8 8 8 ...
..$ Day : int [1:31] 1 2 3 4 5 6 7 8 9 10 ...
$ 9:'data.frame': 30 obs. of 6 variables:
..$ Ozone : int [1:30] 96 78 73 91 47 32 20 23 21 24 ...
..$ Solar.R: int [1:30] 167 197 183 189 95 92 252 220 230 259 ...
..$ Wind : num [1:30] 6.9 5.1 2.8 4.6 7.4 15.5 10.9 10.3 10.9 9.7 ...
..$ Temp : int [1:30] 91 92 93 93 87 84 80 78 75 73 ...
..$ Month : int [1:30] 9 9 9 9 9 9 9 9 9 9 ...
..$ Day : int [1:30] 1 2 3 4 5 6 7 8 9 10 ...
Now make some filenames to save these as:
<- paste0("air_", names(air_months), ".csv")
filenames filenames
[1] "air_5.csv" "air_6.csv" "air_7.csv" "air_8.csv" "air_9.csv"
And then save each subset of airquality
as a separate CSV:
walk2(air_months, filenames, write_csv)
The function (write_csv()
) will take the first argument of walk2()
(air_months
) as its first argument and the second (filenames
) as its second.
Now, what if we want to read all those files back into R? We could import them as a list of dataframes:
<-
air_all map(filenames, read.csv)
str(air_all)
List of 5
$ :'data.frame': 31 obs. of 6 variables:
..$ Ozone : int [1:31] 41 36 12 18 NA 28 23 19 8 NA ...
..$ Solar.R: int [1:31] 190 118 149 313 NA NA 299 99 19 194 ...
..$ Wind : num [1:31] 7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
..$ Temp : int [1:31] 67 72 74 62 56 66 65 59 61 69 ...
..$ Month : int [1:31] 5 5 5 5 5 5 5 5 5 5 ...
..$ Day : int [1:31] 1 2 3 4 5 6 7 8 9 10 ...
$ :'data.frame': 30 obs. of 6 variables:
..$ Ozone : int [1:30] NA NA NA NA NA NA 29 NA 71 39 ...
..$ Solar.R: int [1:30] 286 287 242 186 220 264 127 273 291 323 ...
..$ Wind : num [1:30] 8.6 9.7 16.1 9.2 8.6 14.3 9.7 6.9 13.8 11.5 ...
..$ Temp : int [1:30] 78 74 67 84 85 79 82 87 90 87 ...
..$ Month : int [1:30] 6 6 6 6 6 6 6 6 6 6 ...
..$ Day : int [1:30] 1 2 3 4 5 6 7 8 9 10 ...
$ :'data.frame': 31 obs. of 6 variables:
..$ Ozone : int [1:31] 135 49 32 NA 64 40 77 97 97 85 ...
..$ Solar.R: int [1:31] 269 248 236 101 175 314 276 267 272 175 ...
..$ Wind : num [1:31] 4.1 9.2 9.2 10.9 4.6 10.9 5.1 6.3 5.7 7.4 ...
..$ Temp : int [1:31] 84 85 81 84 83 83 88 92 92 89 ...
..$ Month : int [1:31] 7 7 7 7 7 7 7 7 7 7 ...
..$ Day : int [1:31] 1 2 3 4 5 6 7 8 9 10 ...
$ :'data.frame': 31 obs. of 6 variables:
..$ Ozone : int [1:31] 39 9 16 78 35 66 122 89 110 NA ...
..$ Solar.R: int [1:31] 83 24 77 NA NA NA 255 229 207 222 ...
..$ Wind : num [1:31] 6.9 13.8 7.4 6.9 7.4 4.6 4 10.3 8 8.6 ...
..$ Temp : int [1:31] 81 81 82 86 85 87 89 90 90 92 ...
..$ Month : int [1:31] 8 8 8 8 8 8 8 8 8 8 ...
..$ Day : int [1:31] 1 2 3 4 5 6 7 8 9 10 ...
$ :'data.frame': 30 obs. of 6 variables:
..$ Ozone : int [1:30] 96 78 73 91 47 32 20 23 21 24 ...
..$ Solar.R: int [1:30] 167 197 183 189 95 92 252 220 230 259 ...
..$ Wind : num [1:30] 6.9 5.1 2.8 4.6 7.4 15.5 10.9 10.3 10.9 9.7 ...
..$ Temp : int [1:30] 91 92 93 93 87 84 80 78 75 73 ...
..$ Month : int [1:30] 9 9 9 9 9 9 9 9 9 9 ...
..$ Day : int [1:30] 1 2 3 4 5 6 7 8 9 10 ...
Or add a step to bind_rows()
to combine them into a single dataframe:
<-
air_all map(filenames, read.csv) |>
bind_rows()
str(air_all)
'data.frame': 153 obs. of 6 variables:
$ Ozone : int 41 36 12 18 NA 28 23 19 8 NA ...
$ Solar.R: int 190 118 149 313 NA NA 299 99 19 194 ...
$ Wind : num 7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
$ Temp : int 67 72 74 62 56 66 65 59 61 69 ...
$ Month : int 5 5 5 5 5 5 5 5 5 5 ...
$ Day : int 1 2 3 4 5 6 7 8 9 10 ...
3.6.2 Example: Converting Data File Types
As another example, let’s convert the CSVs on disk to RDS files. First, write a regular expression to get a vector of those filenames:
<- list.files(pattern = "^air_[0-9]\\.csv$")
myFiles myFiles
[1] "air_5.csv" "air_6.csv" "air_7.csv" "air_8.csv" "air_9.csv"
And then write a function that (1) imports the CSV, (2) makes a new filename, and (3) writes it as an RDS file:
<- function(paths) {
csv_to_rds <- read.csv(paths)
dat <- paste0(tools::file_path_sans_ext(paths), ".rds")
rds_path saveRDS(dat, rds_path)
}
This function only takes a single argument, so we can use walk()
:
walk(myFiles, csv_to_rds)
To see whether that worked, check if the expected files are in our working directory:
list.files(pattern = "^air_[0-9]\\.rds$")
[1] "air_5.rds" "air_6.rds" "air_7.rds" "air_8.rds" "air_9.rds"
Success!
3.7 Parallelization
Functions and loops make writing our code faster, but it still runs in about the same amount of time since our iteration is sequential. We apply a function to the first element, and then when that is done, we apply the function to the second element, and so on. Each step waits for the previous step to finish.
To speed up our code, we can parallelize it. This is useful for tasks like wrangling many large datasets or fitting models to many subsets of a dataset.
For this, we can use the furrr
package, which has parallelized versions of the map()
and walk()
families.
For example, let’s fit models to subsets of mtcars
, one for each value of cyl
: 4, 6, and 8.
All we need to do is declare a multisession
with some number of parallel processes (workers
) and then prefix our iteration function with future_
. furrr
will take care of the rest: assigning chunks to workers and gathering all the results into a single list.
library(tidyverse)
library(furrr)
<- 3
ncores plan(multisession, workers = ncores)
<- c(4, 6, 8)
cyls
<-
myMods future_map(cyls,
lm(mpg ~ wt, mtcars |> filter(cyl == x))) \(x)
Let’s format the model output. Apply the summary()
function to each element, and then apply the coef()
function to get dataframes of coefficients with test statistics:
|>
myMods map(summary) |>
map(coef)
[[1]]
Estimate Std. Error t value Pr(>|t|)
(Intercept) 39.571196 4.346582 9.103980 7.771511e-06
wt -5.647025 1.850119 -3.052251 1.374278e-02
[[2]]
Estimate Std. Error t value Pr(>|t|)
(Intercept) 28.408845 4.184369 6.789278 0.001054844
wt -2.780106 1.334917 -2.082605 0.091757660
[[3]]
Estimate Std. Error t value Pr(>|t|)
(Intercept) 23.868029 3.0054619 7.941551 4.052705e-06
wt -2.192438 0.7392393 -2.965803 1.179281e-02
Or, pass the list of models to stargazer()
for formatted tables of regression output:
library(stargazer)
stargazer(myMods,
type = "html",
column.labels = paste0(cyls, "-cyl"))
Dependent variable: | |||
mpg | |||
4-cyl | 6-cyl | 8-cyl | |
(1) | (2) | (3) | |
wt | -5.647** | -2.780* | -2.192** |
(1.850) | (1.335) | (0.739) | |
Constant | 39.571*** | 28.409*** | 23.868*** |
(4.347) | (4.184) | (3.005) | |
Observations | 11 | 7 | 14 |
R2 | 0.509 | 0.465 | 0.423 |
Adjusted R2 | 0.454 | 0.357 | 0.375 |
Residual Std. Error | 3.332 (df = 9) | 1.165 (df = 5) | 2.024 (df = 12) |
F Statistic | 9.316** (df = 1; 9) | 4.337* (df = 1; 5) | 8.796** (df = 1; 12) |
Note: | p<0.1; p<0.05; p<0.01 |
3.7.1 Slurm
The SSCC Slurm cluster is built for tasks like these. Submit your parallelized jobs to Slurm to take advantage of up to 128 cores on a single server.
R itself uses one process, so your parallelized task can use one less core than you reserve. If you want your script to use 3 cores in parallel, you need to reserve 4. This means you cannot run a parallel task that uses all the cores on a computer, since R needs one core to run.
Do not use parallel::detectCores()
on Slurm to determine the number of cores you have to use, since it will find all those on the server, not just those you have reserved. Instead, you can access the environmental variable with the number of cores you reserved, and then subtract one:
<- as.numeric(Sys.getenv()[["SLURM_CPUS_PER_TASK"]]) - 1 ncores
Then you can continue with the example above: plan(multisession, workers = ncores)
, then future_map()
, etc.
Congratulations! You are now able to parallelize your R loops!
3.7.2 Random Seeds
When you are doing any work that involves random numbers (e.g., fitting models, multiple imputation, simulation), it is important to set your random seed for reproducibility. We set the seed with regular map()
as we would with any other R code, but with furrr
, we need to specify it in the .options
argument:
<- 2
ncores plan(multisession, workers = ncores)
future_map(1:2,
rnorm(1),
\(x) .options = furrr_options(seed = 123))
[[1]]
[1] -1.01425
[[2]]
[1] -0.03074715
future_map(1:2,
rnorm(1),
\(x) .options = furrr_options(seed = 123))
[[1]]
[1] -1.01425
[[2]]
[1] -0.03074715
Just setting the seed outside will not result in the same results every time:
<- 2
ncores plan(multisession, workers = ncores)
set.seed(123) # incorrect!
future_map(1:2,
rnorm(1)) \(x)
Warning: UNRELIABLE VALUE: Future ('<none>') unexpectedly generated random
numbers without specifying argument 'seed'. There is a risk that those random
numbers are not statistically sound and the overall results might be invalid.
To fix this, specify 'seed=TRUE'. This ensures that proper, parallel-safe
random numbers are produced via the L'Ecuyer-CMRG method. To disable this
check, use 'seed=NULL', or set option 'future.rng.onMisuse' to "ignore".
Warning: UNRELIABLE VALUE: Future ('<none>') unexpectedly generated random
numbers without specifying argument 'seed'. There is a risk that those random
numbers are not statistically sound and the overall results might be invalid.
To fix this, specify 'seed=TRUE'. This ensures that proper, parallel-safe
random numbers are produced via the L'Ecuyer-CMRG method. To disable this
check, use 'seed=NULL', or set option 'future.rng.onMisuse' to "ignore".
[[1]]
[1] 0.65431
[[2]]
[1] -0.6810615
future_map(1:2,
rnorm(1)) \(x)
Warning: UNRELIABLE VALUE: Future ('<none>') unexpectedly generated random
numbers without specifying argument 'seed'. There is a risk that those random
numbers are not statistically sound and the overall results might be invalid.
To fix this, specify 'seed=TRUE'. This ensures that proper, parallel-safe
random numbers are produced via the L'Ecuyer-CMRG method. To disable this
check, use 'seed=NULL', or set option 'future.rng.onMisuse' to "ignore".
Warning: UNRELIABLE VALUE: Future ('<none>') unexpectedly generated random
numbers without specifying argument 'seed'. There is a risk that those random
numbers are not statistically sound and the overall results might be invalid.
To fix this, specify 'seed=TRUE'. This ensures that proper, parallel-safe
random numbers are produced via the L'Ecuyer-CMRG method. To disable this
check, use 'seed=NULL', or set option 'future.rng.onMisuse' to "ignore".
[[1]]
[1] -1.350347
[[2]]
[1] -1.14498
See more details at ?furrr_options
.
3.8 Exercises: Side Effects and Parallelization
Use the
diamonds
dataset.A. Split it into separate dataframes by one of the categorical variables (
cut
,color
, orclarity
). For each subset, fit a regression model predicting price from any variables.B. For each model, plot the residuals against the fitted values with
ggplot()
. Save each plot to disk withggsave()
.C. Parallelize the code where you fit the models. (You do not need to plot it.)
Verify that your computer is actually running code in parallel.
A. Check that you have at least four cores on your computer by running
parallel::detectCores()
.B. Run the block of code below, which should take about 6 seconds.
<- Sys.time() start walk(1:3, \(x) Sys.sleep(2)) Sys.time() - start
C. Parallelize the code to use 3 cores.
D. Does the code take very close to 2 seconds, or does it take a little more? Why?