During my undergraduate (and now postgraduate) years, I often spent my evenings and weekends toiling over statistics assignments. I was always amused when R seemed to know and would sometimes return my favourite error, reminding me that I was missing the fun:

Error in match.fun(FUN) : argument "FUN" is missing, with no default

Of course, I just forgot to supply a function name a command like `apply()`

. The `apply()`

function is really useful way of extracting summary statistics from data sets. The basic format is

apply(array, margin, function, ...)

- An
`array`

in R is a generic data type. A zero dimensional array is a scalar or a point; a one dimensional array is a vector; and a two dimensional array is a matrix…
- The
`margin`

argument is used to specify which margin we want to apply the function to. If the array we are using is a matrix then we can specify the margin to be either 1 (apply the function to the rows of the matrix) or 2 (apply the function to the columns of the matrix).
- The
`function`

can be any function that is built in or user defined (this is what I was missing when I got the error above).
- The
`...`

after the function refers to any other arguments that needs to be passed to the function being applied to the data.

The apply function internally uses a loop so if time and efficiency is very important one of the other apply functions such as

lapply(list, function, ...)

would be a better choice. The `lapply`

command is designed for lists. It is particularly useful for data frames as each data frame is considered a list and the variables in the data frame are the elements of the list. Note that `lapply`

doesn’t have a margin argument as it simply applies the function to each of the variables in the data frame.

You can see the difference in the example below. The data set `cars`

is a data frame that comes with R.

mode(cars) # what data type is cars?
[1] "list"
head(cars) # output the first six entries in the data set
speed dist
1 4 2
2 4 10
3 7 4
4 7 22
5 8 16
6 9 10
apply(cars,2,mean) # calculate column means treating cars as a matrix (2D array)
speed dist
15.40 42.98
lapply(cars,mean) # same thing treating cars as a data frame (list)
$speed
[1] 15.4
$dist
[1] 42.98

To show how much faster `lapply`

is than `apply`

, consider the following simulation:

X = matrix(rnorm(10000000),ncol=2)
X=data.frame(X)
system.time(apply(X,2,mean))
user system elapsed
0.573 0.394 0.965
system.time(lapply(X,mean))
user system elapsed
0.072 0.049 0.121

To perform the same operation, the `lapply`

function was nearly 8 times faster than the `apply`

function. You need a reasonably large data set for this to make a noticeable difference, but it’s worth keeping in mind regardless.

To find out more about any of these functions or datasets use the help:

?apply
?lapply
?head
?cars