Writing Functions
June 2021
1 Writing Functions - Basics
Eventually you will find that you want to write your own functions, and R is
designed to make this very easy. For many of us this first comes up when we
have several steps we want to do in sequence using one of the
apply
functions. These steps might be a series of R statements, or they might even
be a nested sequence of functions that we want to refer to by a simple name.
As an example, consider a function to count the missing values (NA
s) in a vector.
We might want to apply this to the columns of a data frame, to the rows, or within groups
specified by the values of some other vector.
# setup, a matrix with about 20% missing data
set.seed(20141117)
dm <- matrix(sample(0:9, 100, replace=TRUE), ncol=10)
dm[dm<2] <- NA
dm
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 4 2 9 5 6 NA 9 5 5 2
[2,] 2 6 2 6 3 7 7 9 9 4
[3,] NA 4 7 4 4 7 9 NA 5 8
[4,] NA NA 8 2 9 8 9 NA 5 7
[5,] 6 3 NA 7 3 NA 7 4 7 3
[6,] NA 5 3 3 9 NA 5 NA 6 NA
[7,] 3 8 3 NA 6 NA 7 NA NA 3
[8,] 7 7 NA 4 2 9 8 NA 8 9
[9,] 8 NA 6 5 3 6 8 NA NA NA
[10,] NA 3 9 5 3 2 9 5 5 6
We can count the total number of missing values in our matrix:
sum(is.na(dm))
[1] 23
But we have a problem using this code when we try to count NAs within each column:
apply(dm, 2, sum(is.na))
Error in sum(is.na): invalid 'type' (builtin) of argument
What we need is a function to use with apply
.
1.1 Defining a New Function
Defining a function is pretty simple, really. Typically a function has a name, an argument list (parameters, or “formals”), a body (the expressions that act on the arguments), and a return value.
name <- function(arg1, arg2 ...){
expression(arg1)
...
value <- expression
return(value)
}
(Basic documentation on writing functions is in Help in
An Introduction to R, “Chapter 10: Writing your own functions”.
See also help("function")
.)
In our example, sum(is.na())
will be our expression or body. As an argument we will
use v
to stand for an arbitrary data object,
and we want to return a scalar count that we’ll call rv
.
We’ll give this function the name nmiss
.
nmiss <- function( v) {
rv <- sum(is.na(v))
return(rv)
}
Then use the function with the whole matrix, as before.
nmiss(dm)
[1] 23
And finally use the function with apply
apply(dm,1,nmiss) # missing per row
[1] 1 0 2 3 2 4 4 2 4 1
We can store the results in another object in the usual way.
dm.missing <- nmiss(dm)
dm.missing
[1] 23
First notice that a new object, nmiss
, has been added to our
workspace, the global environment.
Notice also that the objects v
and rv
(two arbitrary names
for data objects, i.e. two placeholders in our function definition)
do not appear in our workspace.
Think of them as local to the nmiss
function,
or as objects within the nmiss
enclosure or environment.
As objects in our workspace, functions have class, and they can be printed (do not include the parentheses with the name), which shows us the details of how they were defined. You can do this with any function, not just those you define yourself!
class(nmiss) # functions have class
[1] "function"
nmiss # as an "object" it can be printed
function( v) {
rv <- sum(is.na(v))
return(rv)
}
<environment: 0x0000022fe1e01840>
1.1.1 More About Returns
Either the return()
object, or the value produce by
the last expression
evaluated is returned by the function.
For example, this is a common way of specifying “rv” as the returned object:
nmiss <- function( v) {
rv <- sum(is.na(v))
rv
}
nmiss(dm)
[1] 23
Notice that evaluating an assignment does NOT work,
returning nothing (not even an error!, just a NULL
value):
nmiss <- function( v) {
rv <- sum(is.na(v))
}
nmiss(dm)
The last expression evaluated was the assignment, "<-", a function used for it's side effect.
But the following example does return what we want. We don't need to name the object we want to return, we can simply return the value of the last expression evaluated.
nmiss <- function( v) {
sum(is.na(v))
}
nmiss(dm)
[1] 23
1.1.2 One-liners and Anonymous Functions
Notice the last example could be written on one line:
nmiss <- function( v) { sum(is.na(v))}
nmiss(dm)
[1] 23
And because the body is a single expression we don't actually need curly braces for multiple statements:
nmiss <- function( v) sum(is.na(v))
nmiss(dm)
[1] 23
It is not uncommon to see simple functions both defined and used within the same expression:
apply(dm, 1, function( v) sum(is.na(v))) # number of NAs per row
[1] 1 0 2 3 2 4 4 2 4 1
The last example is often called an "anonymous function".
Here is another anonymous function (as an exercise, trace the order in which the various objects and functions are evaluated):
(function( v) sum(is.na(v)))(dm)
[1] 23
1.1.3 Style
Arguably, skipping the return
makes one-liners and anonymous functions easier to
read and debug, as long as they are short and sweet.
But as you move on to more complicated functions, especially those that may
conditionally have different sorts of return value, you will find them easier to read and debug
if you do include the return
expressions.
1.2 Exercises
In the Motor Trend car tests data,
mtcars
, vehicle weight,wt
, is reported in thousands of pounds. Write a function that converts thousands-of-pounds to kilograms. Bonus: Use the result to show (by calculation) that this leaves the correlation betweenwt
in kilos andmpg
unaffected.The same data set reports fuel consumption in miles per gallon. Write a function that converts this to kilometers per liter. Bonus: show that this conversion leaves the correlation between the rescaled variables unaffected.
R does not have a standard-error-of-the-mean function. Write one, then produce a table of means and standard errors for the variables in the
mtcars
data.When working with time-series data it is often useful to identify gaps in the series. Write a function that identifies gaps by indicating observations preceded by a gap. For example:
1990 1991 1992 1993 1994 1997 1998 1999 2000
NA FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
Bonus: write a function that fills out the gap.
- In survey data, it is common for data to be coded with 8 = don't know and 9 =
refused to answer. We need to convert these to
NA
s for most statistical work. Write a function that takes a vector or matrix as input, and returns the recoded vector/matrix. For example:
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 4 2 9 5 6 NA 9 5 5 2
[2,] 2 6 2 6 3 7 7 9 9 4
[3,] NA 4 7 4 4 7 9 NA 5 8
[4,] NA NA 8 2 9 8 9 NA 5 7
[5,] 6 3 NA 7 3 NA 7 4 7 3
[6,] NA 5 3 3 9 NA 5 NA 6 NA
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 4 2 NA 5 6 NA NA 5 5 2
[2,] 2 6 2 6 3 7 7 NA NA 4
[3,] NA 4 7 4 4 7 NA NA 5 NA
[4,] NA NA NA 2 NA NA NA NA 5 7
[5,] 6 3 NA 7 3 NA 7 4 7 3
[6,] NA 5 3 3 NA NA 5 NA 6 NA
Note: This is a simplification of a couple of existing R functions. Functions such as the one you are asked to produce are often termed "convenience" functions or "wrappers". Depending on how you solve this problem, the function you simplify may itself be a convenience wrapper - take a look at the code inside the function you use!
- Average compounded growth. Given two vectors, one of starting values (say, starting salary) and another of ending values (current salary), we often want to characterize the growth rate that led from one to the other. If we additionally know how many growth periods there were between each pair of values, we can calculate an average growth rate that takes compounding into account.
Write a function that returns the average growth rate as a fraction. In other words, if y = ending value, x=starting value, t=number of time periods, and r = growth multiplier, return r-1. The fundamental relation is \[y=x*(r)^t\]
- Write a function returning "degree of consensus." 1 - (variance of respondents)/(max possible variance), where the respondents have given answers on some bounded scale (e.g. a Likert scale from 1 to 5). Complete consensus = 1, maximum disagreement = 0.