Missing the FUN
During my undergraduate (and now postgraduate) years, I often spent my evenings and weekends toiling over statistics assignments. I was always amused when R seemed to know and would sometimes return my favourite error, reminding me that I was missing the fun:
Error in match.fun(FUN) : argument "FUN" is missing, with no default
Of course, I just forgot to supply a function name a command like apply()
. The apply()
function is really useful way of extracting summary statistics from data sets. The basic format is
apply(array, margin, function, ...)
- An
array
in R is a generic data type. A zero dimensional array is a scalar or a point; a one dimensional array is a vector; and a two dimensional array is a matrix… - The
margin
argument is used to specify which margin we want to apply the function to. If the array we are using is a matrix then we can specify the margin to be either 1 (apply the function to the rows of the matrix) or 2 (apply the function to the columns of the matrix). - The
function
can be any function that is built in or user defined (this is what I was missing when I got the error above). - The
...
after the function refers to any other arguments that needs to be passed to the function being applied to the data.
The apply function internally uses a loop so if time and efficiency is very important one of the other apply functions such as
lapply(list, function, ...)
would be a better choice. The lapply
command is designed for lists. It is particularly useful for data frames as each data frame is considered a list and the variables in the data frame are the elements of the list. Note that lapply
doesn’t have a margin argument as it simply applies the function to each of the variables in the data frame.
You can see the difference in the example below. The data set cars
is a data frame that comes with R.
mode(cars) # what data type is cars? [1] "list" head(cars) # output the first six entries in the data set speed dist 1 4 2 2 4 10 3 7 4 4 7 22 5 8 16 6 9 10 apply(cars,2,mean) # calculate column means treating cars as a matrix (2D array) speed dist 15.40 42.98 lapply(cars,mean) # same thing treating cars as a data frame (list) $speed [1] 15.4 $dist [1] 42.98
To show how much faster lapply
is than apply
, consider the following simulation:
X = matrix(rnorm(10000000),ncol=2) X=data.frame(X) system.time(apply(X,2,mean)) user system elapsed 0.573 0.394 0.965 system.time(lapply(X,mean)) user system elapsed 0.072 0.049 0.121
To perform the same operation, the lapply
function was nearly 8 times faster than the apply
function. You need a reasonably large data set for this to make a noticeable difference, but it’s worth keeping in mind regardless.
To find out more about any of these functions or datasets use the help:
?apply ?lapply ?head ?cars
RStudio = Awesome
RStudio is a IDE for R that makes it look (and work) more like Matlab. It’s a bit nicer and more user friendly than the standard R GUI application. In particular, it makes it easier to import data from text files, write scripts, autocomplete functions, store graphics and access help files. It is in active development – the developers continue to add additional functionality really quickly.
Check out the screencast on the homepage for a 2 minute overview of the awesomeness that is RStudio. When teaching with R in my units, I skip the default R GUI and go straight to RStudio. This often results in students not fully realising the it is RStudio is an IDE for the R software, but that’s a small price to pay for the ease of use and good scripting habits students develop from working within a fully developed IDE.
RStudio is available across all major operating systems and it’s free!
Selling Statistics
This video clip does a great job of selling statistics to a general audience (despite being created SAS). It’s only 2:30 mins – a good length for adding some interest at the start of a first year statistics unit.
“Statisticians help researchers keep children healthy”
Statistics: saving children’s lives since 1850.
On the Numerical Accuracy of Spreadsheets
I came across this journal article a couple of years ago. It’s very accessible (not at all difficult to understand as journal articles go). It provides some interesting results that may help inform your decision about whether or not to use Excel and some background as to why we use dedicated statistical/computational software such as Matlab/Scilab/R.
The failings of Excel might seem like they only occur in extreme cases, but it is the way Excel handles the errors that is most concerning. In many of the examples listed in the article, it will return an incorrect value rather than admit that it doesn’t know the answer to that level of precision. I.e. when beyond the ability of the function, Excel should return NAs.
The last paragraph:
“Finally, as a rule of the thumb, every user should be aware that spreadsheets have serious limitations. Other platforms are advisable, being currently R the most dependable FLOSS (Free/Libre Open Source Software, see Almiron et al. 2009).”
Almiron, M. G., Lopes, B., Oliveira, A. L. C., Medeiros, A. C., and Frery, A. C. (2010). On the numerical accuracy of spreadsheets. Journal of Statistical Software, 34(4):1–29.
Law of large numbers
The first two minutes of this video for a graphical representation of the law of large numbers (the physicist’s center of gravity is the statistician’s mean). It’s worth a look if only for the awesome 80’s styling and soundtrack.
Getting started with LaTeX
LaTeX is a document markup language and document preparation system for the TeX typesetting program.
It’s what I and most math/stats/econometrics teachers use to create the lecture slides and tutorials. If you’re thinking about doing honours, or postgraduate study in these fields it’s probably not a bad idea to start playing with LaTeX as it is the default standard for typesetting theses, journal articles and the like.
To get started you need to download and install TeX on your computer.
Once you’ve got TeX installed you need software to use it. For Windows I think TeXWorks is the best (though there are others) for Mac TeXShop is definitely the best.
It is a WYMIWYG (What You Mean Is What You Get) system for typesetting, as opposed to Word which is WYSIWYG (What You See Is What You Get). This means that it has a really steep learning curve and can be quite strange at first. The best way to learn is by adapting what others have already done. Here are some resources to get you started:
A more user-friendly way to get started is with LyX. It’s more of a Word-like front end to LaTeX. You still need a TeX distribution installed before you can use that software.
Best of all, all of it’s free!
Theory and Practice
“In theory, theory and practice are the same. In practice, they are not.”
Albert Einstein
How to do stuff in r in two minutes or less
The site twotutorials has more than 90 two minute video clips on various uses for R. The clips minimise prerequisite knowledge and are super entertaining to watch. They begin with installing R then work through a range of topics that will be of interest to statisticians and data analysts including:
- how to read and write excel files with the xlsx package
- what does the
%in%
operator do and how does it differ from the double equals (==
) sign? - how to merge stuff together using the
merge
,cbind
,rbind
, andrbind.fill
functions - how to run your first regression
“It’s easy, free and fun”
Data Communication with Hans Rosling
Hans Rosling does great things with data communication. Here’s an example showing how we can expect the world’s population to level out at 10 billion — contrary to popular opinion regarding exponential population growth.