1 Writing Functions - Basics

Eventually you will find that you want to write your own functions, and R is designed to make this very easy. For many of us this first comes up when we have several steps we want to do in sequence using one of the apply functions. These steps might be a series of R statements, or they might even be a nested sequence of functions that we want to refer to by a simple name.

As an example, consider a function to count the missing values (NAs) in a vector. We might want to apply this to the columns of a data frame, to the rows, or within groups specified by the values of some other vector.

# setup, a matrix with about 20% missing data
set.seed(20141117)
dm <- matrix(sample(0:9, 100, replace=TRUE), ncol=10)
dm[dm<2] <- NA
dm

      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
 [1,]    4    2    9    5    6   NA    9    5    5     2
 [2,]    2    6    2    6    3    7    7    9    9     4
 [3,]   NA    4    7    4    4    7    9   NA    5     8
 [4,]   NA   NA    8    2    9    8    9   NA    5     7
 [5,]    6    3   NA    7    3   NA    7    4    7     3
 [6,]   NA    5    3    3    9   NA    5   NA    6    NA
 [7,]    3    8    3   NA    6   NA    7   NA   NA     3
 [8,]    7    7   NA    4    2    9    8   NA    8     9
 [9,]    8   NA    6    5    3    6    8   NA   NA    NA
[10,]   NA    3    9    5    3    2    9    5    5     6

We can count the total number of missing values in our matrix:

sum(is.na(dm))

[1] 23

But we have a problem using this code when we try to count NAs within each column:

apply(dm, 2, sum(is.na))

Error in sum(is.na): invalid 'type' (builtin) of argument

What we need is a function to use with apply.

1.1 Defining a New Function

Defining a function is pretty simple, really. Typically a function has a name, an argument list (parameters, or “formals”), a body (the expressions that act on the arguments), and a return value.

name <- function(arg1, arg2 ...){
    expression(arg1)
    ...
    value <- expression
    return(value)
}

(Basic documentation on writing functions is in Help in An Introduction to R, “Chapter 10: Writing your own functions”. See also help("function").)

In our example, sum(is.na()) will be our expression or body. As an argument we will use v to stand for an arbitrary data object, and we want to return a scalar count that we’ll call rv.
We’ll give this function the name nmiss.

nmiss <- function( v) {
  rv <- sum(is.na(v))
  return(rv)
}

Then use the function with the whole matrix, as before.

nmiss(dm)

[1] 23

And finally use the function with apply

apply(dm,1,nmiss) # missing per row

 [1] 1 0 2 3 2 4 4 2 4 1

We can store the results in another object in the usual way.

dm.missing <- nmiss(dm)
dm.missing

[1] 23

First notice that a new object, nmiss, has been added to our workspace, the global environment.

Notice also that the objects v and rv (two arbitrary names for data objects, i.e. two placeholders in our function definition) do not appear in our workspace. Think of them as local to the nmiss function, or as objects within the nmiss enclosure or environment.

As objects in our workspace, functions have class, and they can be printed (do not include the parentheses with the name), which shows us the details of how they were defined. You can do this with any function, not just those you define yourself!

class(nmiss) # functions have class

[1] "function"

nmiss        # as an "object" it can be printed

function( v) {
  rv <- sum(is.na(v))
  return(rv)
}
<environment: 0x0000022fe1e01840>

1.1.1 More About Returns

Either the return() object, or the value produce by the last expression evaluated is returned by the function.

For example, this is a common way of specifying “rv” as the returned object:

nmiss <- function( v) {
  rv <- sum(is.na(v))
  rv
}

nmiss(dm)

[1] 23

Notice that evaluating an assignment does NOT work, returning nothing (not even an error!, just a NULL value):

nmiss <- function( v) {
  rv <- sum(is.na(v))
}
nmiss(dm)

The last expression evaluated was the assignment, "<-", a function used for it's side effect.

But the following example does return what we want. We don't need to name the object we want to return, we can simply return the value of the last expression evaluated.

nmiss <- function( v) {
  sum(is.na(v))
}
nmiss(dm)

[1] 23

1.1.2 One-liners and Anonymous Functions

Notice the last example could be written on one line:

nmiss <- function( v) { sum(is.na(v))}
nmiss(dm)

[1] 23

And because the body is a single expression we don't actually need curly braces for multiple statements:

nmiss <- function( v) sum(is.na(v))
nmiss(dm)

[1] 23

It is not uncommon to see simple functions both defined and used within the same expression:

apply(dm, 1, function( v) sum(is.na(v))) # number of NAs per row

 [1] 1 0 2 3 2 4 4 2 4 1

The last example is often called an "anonymous function".

Here is another anonymous function (as an exercise, trace the order in which the various objects and functions are evaluated):

(function( v) sum(is.na(v)))(dm)

[1] 23

1.1.3 Style

Arguably, skipping the return makes one-liners and anonymous functions easier to read and debug, as long as they are short and sweet. But as you move on to more complicated functions, especially those that may conditionally have different sorts of return value, you will find them easier to read and debug if you do include the return expressions.

1.2 Exercises

In the Motor Trend car tests data, mtcars, vehicle weight, wt, is reported in thousands of pounds. Write a function that converts thousands-of-pounds to kilograms. Bonus: Use the result to show (by calculation) that this leaves the correlation between wt in kilos and mpg unaffected.
The same data set reports fuel consumption in miles per gallon. Write a function that converts this to kilometers per liter. Bonus: show that this conversion leaves the correlation between the rescaled variables unaffected.
R does not have a standard-error-of-the-mean function. Write one, then produce a table of means and standard errors for the variables in the mtcars data.
When working with time-series data it is often useful to identify gaps in the series. Write a function that identifies gaps by indicating observations preceded by a gap. For example:

 1990  1991  1992  1993  1994  1997  1998  1999  2000 
   NA FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE

Bonus: write a function that fills out the gap.

In survey data, it is common for data to be coded with 8 = don't know and 9 = refused to answer. We need to convert these to NAs for most statistical work. Write a function that takes a vector or matrix as input, and returns the recoded vector/matrix. For example:

     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,]    4    2    9    5    6   NA    9    5    5     2
[2,]    2    6    2    6    3    7    7    9    9     4
[3,]   NA    4    7    4    4    7    9   NA    5     8
[4,]   NA   NA    8    2    9    8    9   NA    5     7
[5,]    6    3   NA    7    3   NA    7    4    7     3
[6,]   NA    5    3    3    9   NA    5   NA    6    NA

     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,]    4    2   NA    5    6   NA   NA    5    5     2
[2,]    2    6    2    6    3    7    7   NA   NA     4
[3,]   NA    4    7    4    4    7   NA   NA    5    NA
[4,]   NA   NA   NA    2   NA   NA   NA   NA    5     7
[5,]    6    3   NA    7    3   NA    7    4    7     3
[6,]   NA    5    3    3   NA   NA    5   NA    6    NA

Note: This is a simplification of a couple of existing R functions. Functions such as the one you are asked to produce are often termed "convenience" functions or "wrappers". Depending on how you solve this problem, the function you simplify may itself be a convenience wrapper - take a look at the code inside the function you use!

Average compounded growth. Given two vectors, one of starting values (say, starting salary) and another of ending values (current salary), we often want to characterize the growth rate that led from one to the other. If we additionally know how many growth periods there were between each pair of values, we can calculate an average growth rate that takes compounding into account.

Write a function that returns the average growth rate as a fraction. In other words, if y = ending value, x=starting value, t=number of time periods, and r = growth multiplier, return r-1. The fundamental relation is \[y=x*(r)^t\]

Write a function returning "degree of consensus." 1 - (variance of respondents)/(max possible variance), where the respondents have given answers on some bounded scale (e.g. a Likert scale from 1 to 5). Complete consensus = 1, maximum disagreement = 0.