During my undergraduate (and now postgraduate) years, I often spent my evenings and weekends toiling over statistics assignments. I was always amused when R seemed to know and would sometimes return my favourite error, reminding me that I was missing the fun:

Error in match.fun(FUN) : argument "FUN" is missing, with no default

Of course, I just forgot to supply a function name a command like apply(). The apply() function is really useful way of extracting summary statistics from data sets. The basic format is

apply(array, margin, function, ...)
  • An array in R is a generic data type. A zero dimensional array is a scalar or a point; a one dimensional array is a vector; and a two dimensional array is a matrix…
  • The margin argument is used to specify which margin we want to apply the function to. If the array we are using is a matrix then we can specify the margin to be either 1 (apply the function to the rows of the matrix) or 2 (apply the function to the columns of the matrix).
  • The function can be any function that is built in or user defined (this is what I was missing when I got the error above).
  • The ... after the function refers to any other arguments that needs to be passed to the function being applied to the data.

The apply function internally uses a loop so if time and efficiency is very important one of the other apply functions such as

lapply(list, function, ...)

would be a better choice. The lapply command is designed for lists. It is particularly useful for data frames as each data frame is considered a list and the variables in the data frame are the elements of the list. Note that lapply doesn’t have a margin argument as it simply applies the function to each of the variables in the data frame.

You can see the difference in the example below.  The data set cars is a data frame that comes with R.

mode(cars) # what data type is cars?
[1] "list"
head(cars) # output the first six entries in the data set
  speed dist
1     4    2
2     4   10
3     7    4
4     7   22
5     8   16
6     9   10
apply(cars,2,mean) # calculate column means treating cars as a matrix (2D array)
speed  dist
15.40 42.98
lapply(cars,mean) # same thing treating cars as a data frame (list)
$speed
[1] 15.4

$dist
[1] 42.98

To show how much faster lapply is than apply, consider the following simulation:

X = matrix(rnorm(10000000),ncol=2)
X=data.frame(X)
system.time(apply(X,2,mean))
   user  system elapsed 
  0.573   0.394   0.965 
system.time(lapply(X,mean))
   user  system elapsed 
  0.072   0.049   0.121 

To perform the same operation, the lapply function was nearly 8 times faster than the apply function. You need a reasonably large data set for this to make a noticeable difference, but it’s worth keeping in mind regardless.

To find out more about any of these functions or datasets use the help:

?apply
?lapply
?head
?cars