During my undergraduate (and now postgraduate) years, I often spent my evenings and weekends toiling over statistics assignments. I was always amused when R seemed to know and would sometimes return my favourite error, reminding me that I was missing the fun:
Error in match.fun(FUN) : argument "FUN" is missing, with no default
Of course, I just forgot to supply a function name a command like apply()
. The apply()
function is really useful way of extracting summary statistics from data sets. The basic format is
apply(array, margin, function, ...)
- An
array
in R is a generic data type. A zero dimensional array is a scalar or a point; a one dimensional array is a vector; and a two dimensional array is a matrix…
- The
margin
argument is used to specify which margin we want to apply the function to. If the array we are using is a matrix then we can specify the margin to be either 1 (apply the function to the rows of the matrix) or 2 (apply the function to the columns of the matrix).
- The
function
can be any function that is built in or user defined (this is what I was missing when I got the error above).
- The
...
after the function refers to any other arguments that needs to be passed to the function being applied to the data.
The apply function internally uses a loop so if time and efficiency is very important one of the other apply functions such as
lapply(list, function, ...)
would be a better choice. The lapply
command is designed for lists. It is particularly useful for data frames as each data frame is considered a list and the variables in the data frame are the elements of the list. Note that lapply
doesn’t have a margin argument as it simply applies the function to each of the variables in the data frame.
You can see the difference in the example below. The data set cars
is a data frame that comes with R.
mode(cars) # what data type is cars?
[1] "list"
head(cars) # output the first six entries in the data set
speed dist
1 4 2
2 4 10
3 7 4
4 7 22
5 8 16
6 9 10
apply(cars,2,mean) # calculate column means treating cars as a matrix (2D array)
speed dist
15.40 42.98
lapply(cars,mean) # same thing treating cars as a data frame (list)
$speed
[1] 15.4
$dist
[1] 42.98
To show how much faster lapply
is than apply
, consider the following simulation:
X = matrix(rnorm(10000000),ncol=2)
X=data.frame(X)
system.time(apply(X,2,mean))
user system elapsed
0.573 0.394 0.965
system.time(lapply(X,mean))
user system elapsed
0.072 0.049 0.121
To perform the same operation, the lapply
function was nearly 8 times faster than the apply
function. You need a reasonably large data set for this to make a noticeable difference, but it’s worth keeping in mind regardless.
To find out more about any of these functions or datasets use the help:
?apply
?lapply
?head
?cars