During my undergraduate (and now postgraduate) years, I often spent my evenings and weekends toiling over statistics assignments. I was always amused when R seemed to know and would sometimes return my favourite error, reminding me that I was missing the fun:

Error in match.fun(FUN) : argument "FUN" is missing, with no default

Of course, I just forgot to supply a function name a command like `apply()`

. The `apply()`

function is really useful way of extracting summary statistics from data sets. The basic format is

apply(array, margin, function, ...)

- An
`array`

in R is a generic data type. A zero dimensional array is a scalar or a point; a one dimensional array is a vector; and a two dimensional array is a matrix… - The
`margin`

argument is used to specify which margin we want to apply the function to. If the array we are using is a matrix then we can specify the margin to be either 1 (apply the function to the rows of the matrix) or 2 (apply the function to the columns of the matrix). - The
`function`

can be any function that is built in or user defined (this is what I was missing when I got the error above). - The
`...`

after the function refers to any other arguments that needs to be passed to the function being applied to the data.

The apply function internally uses a loop so if time and efficiency is very important one of the other apply functions such as

lapply(list, function, ...)

would be a better choice. The `lapply`

command is designed for lists. It is particularly useful for data frames as each data frame is considered a list and the variables in the data frame are the elements of the list. Note that `lapply`

doesn’t have a margin argument as it simply applies the function to each of the variables in the data frame.

You can see the difference in the example below. The data set `cars`

is a data frame that comes with R.

mode(cars) # what data type is cars? [1] "list" head(cars) # output the first six entries in the data set speed dist 1 4 2 2 4 10 3 7 4 4 7 22 5 8 16 6 9 10 apply(cars,2,mean) # calculate column means treating cars as a matrix (2D array) speed dist 15.40 42.98 lapply(cars,mean) # same thing treating cars as a data frame (list) $speed [1] 15.4 $dist [1] 42.98

To show how much faster `lapply`

is than `apply`

, consider the following simulation:

X = matrix(rnorm(10000000),ncol=2) X=data.frame(X) system.time(apply(X,2,mean)) user system elapsed 0.573 0.394 0.965 system.time(lapply(X,mean)) user system elapsed 0.072 0.049 0.121

To perform the same operation, the `lapply`

function was nearly 8 times faster than the `apply`

function. You need a reasonably large data set for this to make a noticeable difference, but it’s worth keeping in mind regardless.

To find out more about any of these functions or datasets use the help:

?apply ?lapply ?head ?cars

