library(tidyverse)
2 Functions
First, load the tidyverse
.
2.1 Turning Procedures into Functions
A procedure is one or more lines of code we run to do things like calculate summary statistics, transform variables, and read and write files. Our existing code serves as raw material we can transform into functions, which makes it portable and allows us to reuse it within and across scripts with minimal repetition.
If you want to reuse a procedure more than two or three times, turn it into a function.
2.1.1 Example: Calculating the Standard Error
In the code below, we calculate the standard error (\(\frac{SD_x}{\sqrt{n}}\) ) of several columns in mtcars
, which involves repeating the same procedure for several different R objects.
sd(mtcars$mpg) / sqrt(length(mtcars$mpg))
[1] 1.065424
sd(mtcars$wt) / sqrt(length(mtcars$wt))
[1] 0.1729685
sd(mtcars$hp) / sqrt(length(mtcars$hp))
[1] 12.12032
The functions are the same, but the objects vary.
Two things change in each line: the object inside of sd()
and the object inside of sqrt(length())
, which are the same object within each line. Fill them in with a placeholder, x
:
sd(x) / sqrt(length(x))
Now, to turn this procedure into a function, we need (1) a name and (2) the number of arguments, with suitable names.
For the name of the function, se
is concise and meaningful, but we should first check if it is the name of another existing function in a library we have loaded at the moment:
?se
No documentation for 'se' in specified packages and libraries:
you could try '??se'
That means it is not already used, so we will not have any naming conflicts if we were to use this name.
For the arguments, we only have one argument, which we can keep calling x
. If we had multiple arguments or planned on using a function more often, we should give the arguments more informative names. For now, x
is fine.
The structure of our function-making syntax is as follows:
<- function(argument) {
function_name
procedure }
If we replace function_name
, argument
, and procedure
with their values, we get this code:
<- function(x) {
se sd(x) / sqrt(length(x))
}
Running that line will add se()
to our global environment under the Functions header, so that is available for our use. Now let’s use it!
se(mtcars$mpg)
[1] 1.065424
se(mtcars$wt)
[1] 0.1729685
se(mtcars$hp)
[1] 12.12032
2.1.2 Example: Counting Missing Values
As another example, let’s get the number of missing values in a vector:
sum(is.na(airquality$Ozone))
[1] 37
sum(is.na(airquality$Solar.R))
[1] 7
What changes in each line? Replace that with x
:
sum(is.na(x))
Name the function nmiss
and fill in the function template:
<- function(x) {
nmiss sum(is.na(x))
}
Now use it:
nmiss(airquality$Ozone)
[1] 37
nmiss(airquality$Solar.R)
[1] 7
2.2 Exercises: Basic Functions
The code below returns the proportion of rows in a dataframe that do not have any missing data. Create a function called
prop_observed()
that does the same.nrow(na.omit(txhousing)) / nrow(txhousing)
[1] 0.828412
nrow(na.omit(storms)) / nrow(storms)
[1] 0.1110713
prop_observed(txhousing)
[1] 0.828412
prop_observed(storms)
[1] 0.1110713
The following line returns how many years have passed since some date given in year-month-day format. In this example, we use July 20, 1969, the Apollo 11 moon landing.
time_length(interval(ymd(19690720), Sys.Date()), unit = "years")
[1] 55.06575
Create a function called
years_since()
that does the same, and use it to find how many years have passed since some significant dates for you, like a birthday, wedding, or starting your graduate school. (Note theymd()
function works with dates in various year-month-day formats, everything from19690720
to1969-07-20
to"1969 July 20"
. For working with dates, see our chapter on date vectors.)years_since(19690720)
[1] 55.06575
The code below prints a greeting.
cat(paste0("Good morning, ", "Jason", ".\nToday is ", format(Sys.Date(), "%B %d, %Y"), ".\n"))
Good morning, Jason. Today is August 13, 2024.
Create a function called
morning()
that takes the user’s name.morning("World")
Good morning, World. Today is August 13, 2024.
2.3 Environments
Most of our work in R involves working with objects in the global environment. This is the one we see in the top-right Environment pane of RStudio, and the objects there are available to everything in our current session. Other environments exist, and functions actually execute in their own environment.
The technical details of environments are discussed elsewhere1, so here we will focus on the implications for our function writing. Keep these three things in mind:
- Objects created within functions with
<-
are not stored in the global environment. - Each time a function is run, it runs in a new environment.
- A function searches its own environment before it searches the global environment.
We will discuss each of these in turn.
2.3.1 Objects created within functions
Let’s create a somewhat useful function that reports the empirical standard deviation from a random sample drawn from a normal distribution of size n. The expected standard deviation, per the default of rnorm()
, is 1. The error tends to get smaller as the sample size increases.
<- function(n) {
empirical_sd <- rnorm(n)
sample_data sd(sample_data)
}
empirical_sd(10)
[1] 0.780586
empirical_sd(10000)
[1] 1.012406
empirical_sd(1000000)
[1] 1.000062
Take a look at the procedure inside the function. We make an object, sample_data
, and then calculate and print the standard deviation.
Does sample_data
exist in our global environment?
"sample_data" %in% ls()
[1] FALSE
No. It only exists within the function’s environment. Functions do not return intermediary assigned objects; instead, they return the result of the last expression.
2.3.2 Functions run in new environments
Not only do functions run in their own environments, they run in a new environment each time.
The function add_one()
has no arguments, and it simply adds one to the object i
. We initialize it with a value of 1. After the first call of add_one()
, i
is 2. We may expect that i
increases each time, but it does not.
<- function() {
add_one <- i + 1
i
i
}
<- 1
i
add_one()
[1] 2
add_one()
[1] 2
add_one()
[1] 2
The reason for this is that our global environment has i
with a value of 1. The first call of add_one()
creates an environment, gets i
from the global environment, and adds 1 to it, so that i
is now 2. The next call of add_one()
does not see the environment from the previous call of the function, so it gets i
with a value of 1 from the global environment, adds 1 to make 2, and so on.
The function will search its own environment before searching the global environment. The upside of this is that we do not need to worry about conflicts between function argument names and our global environment.
2.3.3 Environment Search Order
To illustrate the search order of functions, let’s modify the add_one()
code from above to include the line i <- 10
. The output is 11 because add_one()
finds the i
with a value of 10 in its own environment before finding the i
with a value of 1 in the global environment.
<- function() {
add_one <- 10
i <- i + 1
i
i
}
<- 1
i
add_one()
[1] 11
The upside of this is that we do not need to worry about conflicts between the names of objects in our function, and the names of objects in our global environment. That is, while in general we need to thoughtfully name our objects, we are also free inside our functions to call every vector x
or y
and every dataframe dat
or df
.
2.4 Multiple Returns
Given that assigning objects within functions does not assign objects in our global environment, and our last returned object is returned, how can we return multiple objects?
Let’s add to nmiss()
to also give the total number of values and the proportion of missing values:
<- function(x) {
miss_stats length(x)
sum(is.na(x))
mean(is.na(x))
}
miss_stats(airquality$Ozone)
[1] 0.2418301
Only the last value is returned, mean(is.na(x))
.
A solution is to make a (named) list that contains all the objects we want returned. Lists are excellent containers when you want to return a variety of objects since they can contain objects of different types and structures.
<- function(x) {
miss_stats list(n = length(x),
miss_count = sum(is.na(x)),
miss_prop = mean(is.na(x)))
}
miss_stats(airquality$Ozone)
$n
[1] 153
$miss_count
[1] 37
$miss_prop
[1] 0.2418301
2.5 Arguments and Defaults
If we use more informative argument names, our functions become easier to use and later understand.
The code below takes proportions (0-1) and turns them into percentages (0-100%) with a differing numbers of decimal places.
<- seq(0, 1, .1243)
x <- seq(0, 1, .2345) y
x
[1] 0.0000 0.1243 0.2486 0.3729 0.4972 0.6215 0.7458 0.8701 0.9944
y
[1] 0.0000 0.2345 0.4690 0.7035 0.9380
trimws(paste0(format(round(x * 100, 2), nsmall = 2), "%"))
[1] "0.00%" "12.43%" "24.86%" "37.29%" "49.72%" "62.15%" "74.58%" "87.01%"
[9] "99.44%"
trimws(paste0(format(round(y * 100, 1), nsmall = 1), "%"))
[1] "0.0%" "23.4%" "46.9%" "70.3%" "93.8%"
In these, we have two pieces that change: the vector (x
, y
) and the number of decimal places (1
, 2
), which occures twice. We can use x
for the vector argument name and n_decimal
for the number of decimal places argument name.
trimws(paste0(format(round(x * 100, n_decimal), nsmall = n_decimal), "%"))
The first line of our code to create that function will look like this:
<- function(x, n_decimal) { prop_to_perc
We may find that we often want to round to two decimal places, so we should set that as an argument default. Defaults are set with the pattern argument = value
:
<- function(x, n_decimal = 2) { prop_to_perc
Now, reformat the code inside the function with pipes to make it more readable.
<- function(x, n_decimal = 2) {
prop_to_perc round(x * 100, n_decimal) |>
format(nsmall = n_decimal) |>
paste0("%") |>
trimws()
}
Now use the function on x
and y
. For x
, we are fine with the default of 2 decimal places, so we do not need to specify anything.
prop_to_perc(x)
[1] "0.00%" "12.43%" "24.86%" "37.29%" "49.72%" "62.15%" "74.58%" "87.01%"
[9] "99.44%"
And for y
, we want just 1 decimal place, so we need to specify that:
prop_to_perc(y, n_decimal = 1)
[1] "0.0%" "23.4%" "46.9%" "70.3%" "93.8%"
Following our earlier discussion of environments, note how each function call finds the x
in its own environment instead of the x
in the global environment.
2.6 Conditionals
Sometimes we may want a single function that does different things with different types of input.
For example, take the function plot()
. If we give it a numeric vector, it gives us a scatterplot of values by index:
plot(chickwts$weight)
And if we give it a factor, it returns a bar chart:
plot(chickwts$feed)
But it fails if we give it a character vector:
plot(as.character(chickwts$feed))
Warning in xy.coords(x, y, xlabel, ylabel, log): NAs introduced by coercion
Warning in min(x): no non-missing arguments to min; returning Inf
Warning in max(x): no non-missing arguments to max; returning -Inf
Error in plot.window(...): need finite 'ylim' values
And if we give it a logical vector, it returns a plot where TRUE
and FALSE
have been coerced to 1 and 0, respectively:
plot(chickwts$weight > 250)
Let’s make a function that plots vectors of different types. It can return the default plots for numeric and factor vectors, but for character and logical vectors, it should coerce them to factor vectors first.
We should also limit the function to only work with vectors so that we do not get unexpected results with other data structures and objects. To do this, we can use stopifnot()
with a condition: check if it is a vector or a factor (which technically is not a vector…):
stopifnot(is.vector(x) | is.factor(x))
After that, we need branching logic. If it is a character or logical vector, make it a factor and then plot it. Otherwise, plot it as-is:
if (is.character(x) | is.logical(x)) {
plot(as.factor(x))
else {
} plot(x)
}
Put this all together inside a function which we can call plot_vec()
:
<- function(x) {
plot_vec stopifnot(is.vector(x) | is.factor(x))
if (is.character(x) | is.logical(x)) {
plot(as.factor(x))
else {
} plot(x)
} }
And now test it with a few inputs:
plot_vec(chickwts$weight) # numeric
plot_vec(chickwts$feed) # factor
plot_vec(as.character(chickwts$feed)) # character
plot_vec(chickwts$weight > 250) # logical
Now give it another data structure to see how it reacts:
plot_vec(mtcars)
Error in plot_vec(mtcars): is.vector(x) | is.factor(x) is not TRUE
It is working as expected, but we could improve that error message.
We can change from using stopifnot()
to using an if
statement with our condition along with stop()
, which can contain a more helpful error message. It can tell the user the structure of the provided object and instruct them to instead supply a vector.
<- function(x) {
plot_vec if (!(is.vector(x) | is.factor(x))) {
stop(paste("x is a", class(x)[[1]], "but it should be a vector"))
}
if (is.character(x) | is.logical(x)) {
plot(as.factor(x))
else {
} plot(x)
} }
Now try it out:
plot_vec(mtcars)
Error in plot_vec(mtcars): x is a data.frame but it should be a vector
2.7 Tidyverse Functions
tidyverse functions usually make code easier to read and write because we do not need to repeat the dataframe name each time2:
$var3 <-
df$var1 + df$var2
df
<-
df |>
df mutate(var3 = var1 + var2)
In the context of functions, however, using tidyverse functions requires extra work since we need to explicitly tell them to look within the dataframe to find the variables.
To illustrate the problem, here we make a function called wrong()
that tries to get the mean of a given variable, but it fails to find mpg
since it does not know to look inside mtcars
. Adding quotes around the variable name does not solve the issue either.
<- function(df, var) {
wrong |> summarize(mean = mean(var))
df
}wrong(mtcars, mpg)
Warning: There was 1 warning in `summarize()`.
ℹ In argument: `mean = mean(var)`.
Caused by warning in `mean.default()`:
! argument is not numeric or logical: returning NA
mean
1 NA
wrong(mtcars, "mpg")
Warning: There was 1 warning in `summarize()`.
ℹ In argument: `mean = mean(var)`.
Caused by warning in `mean.default()`:
! argument is not numeric or logical: returning NA
mean
1 NA
This issue can be solved in one of two ways within our function:
- Embrace the variable name with double curly braces:
{{ var }}
. Seeright1()
below. - Use the
.data
pronoun with a character vector:.data[[var]]
. Seeright2()
below.
<- function(df, var) {
right1 |> summarize(mean = mean({{ var }}))
df
}right1(mtcars, mpg)
mean
1 20.09062
<- function(df, var) {
right2 |> summarize(mean = mean(.data[[var]]))
df
}right2(mtcars, "mpg")
mean
1 20.09062
When would we use one over the other?
{{ var }}
is useful when we specify variable names in function arguments
.data[[var]]
is useful when we have a character vector returned by something else, like user input from a Shiny app or output from some other code like
<- colnames(mtcars)[[1]]
myVar myVar
[1] "mpg"
2.7.1 Embracing
We often need to calculate custom summary statistics for variables in our data. The code below takes a dataframe and variable as arguments, so we need to embrace the variable name in the function syntax:
<- function(df, var) {
mySummary5 |>
df summarize(n = n(),
mean = mean({{ var }}, na.rm = T),
median = median({{ var }}, na.rm = T),
sd = sd({{ var }}, na.rm = T),
se = sd({{ var }}, na.rm = T) / sqrt(n()))
}
mySummary5(airquality, Temp)
n mean median sd se
1 153 77.88235 79 9.46527 0.7652217
mySummary5(airquality, Ozone)
n mean median sd se
1 153 42.12931 31.5 32.98788 2.666912
(Note that we could have used .data[[var]]
within the function and then specified “Temp” and “Ozone” in the function calls.)
2.7.2 The .data Pronoun
In this example, let’s fit a regression model and then produce plots to assess the linearity assumption. Because we want to specify the variable names as character vectors, we need to specify .data[[var]]
within the function.
<- lm(mpg ~ wt + disp + drat, mtcars)
mod
<- function(df, mod, var) {
linearity_plot <-
df |>
df mutate(res = residuals(mod),
yhat = fitted(mod))
ggplot(df, aes(.data[[var]], res)) +
geom_point() +
geom_hline(yintercept = 0, color = "red") +
geom_smooth()
}
Now use the function to plot the fitted values and each predictor (just wt
here, but we should check all of them):
linearity_plot(mtcars, mod, "yhat")
linearity_plot(mtcars, mod, "wt")
(Note we could have used embracing ({{ var }}
) in the function and then referred to the vars as yhat
and wt
in the function calls.)
To show where we will go in the next chapter on Iteration, we can first programmatically extract the names of our predictors and then iteratively plot them with map()
:
<- colnames(model.frame(mod))[-1]
predictors predictors
[1] "wt" "disp" "drat"
map(predictors, \(x) linearity_plot(mtcars, mod, x))
[[1]]
[[2]]
[[3]]
2.8 Reusing Functions Across Scripts
We may find that some functions we write are actually quite useful, and we want to use them across projects. We know we should not copy and paste code again and again within a script, but what about between scripts? How can we make a function available to multiple projects?
We have two options: creating a script with functions, and writing a package. Package writing is outside the scope of this resource. Packages can be used locally without publishing them, but it would be quite an addition to your resume to publish packages to GitHub or even CRAN. See R Packages to learn more.
For now, let’s just put our functions inside a script somewhere on our computer. We then need to “source” that script. Just add a line to the beginning of your script where you load other libraries:
library(tidyverse)
source("path/to/file/myFunctions.R")
source()
will run a script. Any functions or objects created in that script will be added to the global environment, and then you can use them in your current project.
2.9 Exercises: Complex Functions
Revisit the time difference calculator from the previous exercises. Modify the function so that the units (years, months, days, etc.) can be specified.
time_since(19690720, "days")
[1] 20113
Sometimes, missing is coded with more than one missing code. Write a function that uses
ifelse()
and returns a vector after replacing specified values with NA. For example, if 8 and 9 are missing codes, this vector:set.seed(2) <- sample(c(1:3, 8, 9), 10, replace = T) x x
[1] 9 1 9 1 8 9 1 2 3 1
should be returned as:
to_na(x, 8:9)
[1] NA 1 NA 1 NA NA 1 2 3 1
Fix this function. The problem is that the output does not vary.
get_sample()
should return a sample of size n. Fix it so thatget_sample(3)
returns 3 values. Refer to?sample
.<- function(n) { get_sample sample(1:5, replace = T) } get_sample(3)
[1] 3 2 3 1 1
Fix this function. The problem is that there is no output.
fit_mod()
should return a fitted model object, but nothing is returned.<- function(df, outcome, predictors) { fit_mod <- form as.formula( paste(outcome, "~", paste(predictors, collapse = "+") ) )<- lm(form, data = df) mod } fit_mod(df = mtcars, outcome = "mpg", predictors = c("wt", "disp", "wt*disp"))
Fix this function. The problem is that the scatterplot only shows one point.
scatter()
should produce a scatterplot of a givenx
andy
from dataframedf
.<- function(df, x, y) { scatter ggplot(df, aes(x, y)) + geom_point() } scatter(mtcars, "wt", "mpg")
To learn more about environments in R, see the “Environments” chapters of Advanced R and Hands-On Programming with R.↩︎