During my undergraduate (and now postgraduate) years, I often spent my evenings and weekends toiling over statistics assignments. I was always amused when R seemed to know and would sometimes return my favourite error, reminding me that I was missing the fun:
Error in match.fun(FUN) : argument "FUN" is missing, with no default
Of course, I just forgot to supply a function name a command like
apply() function is really useful way of extracting summary statistics from data sets. The basic format is
apply(array, margin, function, ...)
arrayin R is a generic data type. A zero dimensional array is a scalar or a point; a one dimensional array is a vector; and a two dimensional array is a matrix…
marginargument is used to specify which margin we want to apply the function to. If the array we are using is a matrix then we can specify the margin to be either 1 (apply the function to the rows of the matrix) or 2 (apply the function to the columns of the matrix).
functioncan be any function that is built in or user defined (this is what I was missing when I got the error above).
...after the function refers to any other arguments that needs to be passed to the function being applied to the data.
The apply function internally uses a loop so if time and efficiency is very important one of the other apply functions such as
lapply(list, function, ...)
would be a better choice. The
lapply command is designed for lists. It is particularly useful for data frames as each data frame is considered a list and the variables in the data frame are the elements of the list. Note that
lapply doesn’t have a margin argument as it simply applies the function to each of the variables in the data frame.
You can see the difference in the example below. The data set
cars is a data frame that comes with R.
mode(cars) # what data type is cars?  "list" head(cars) # output the first six entries in the data set speed dist 1 4 2 2 4 10 3 7 4 4 7 22 5 8 16 6 9 10 apply(cars,2,mean) # calculate column means treating cars as a matrix (2D array) speed dist 15.40 42.98 lapply(cars,mean) # same thing treating cars as a data frame (list) $speed  15.4 $dist  42.98
To show how much faster
lapply is than
apply, consider the following simulation:
X = matrix(rnorm(10000000),ncol=2) X=data.frame(X) system.time(apply(X,2,mean)) user system elapsed 0.573 0.394 0.965 system.time(lapply(X,mean)) user system elapsed 0.072 0.049 0.121
To perform the same operation, the
lapply function was nearly 8 times faster than the
apply function. You need a reasonably large data set for this to make a noticeable difference, but it’s worth keeping in mind regardless.
To find out more about any of these functions or datasets use the help:
?apply ?lapply ?head ?cars