Garth

Home/Garth

About Garth

This author has not yet filled in any details.
So far Garth has created 31 blog entries.

Missing the FUN

During my undergraduate (and now postgraduate) years, I often spent my evenings and weekends toiling over statistics assignments. I was always amused when R seemed to know and would sometimes return my favourite error, reminding me that I was missing the fun:

Error in match.fun(FUN) : argument "FUN" is missing, with no default

Of course, I just forgot to supply a function name a command like apply(). The apply() function is really useful way of extracting summary statistics from data sets. The basic format is

apply(array, margin, function, ...)
  • An array in R is a generic data type. A zero dimensional array is a scalar or a point; a one dimensional array is a vector; and a two dimensional array is a matrix…
  • The margin argument is used to specify which margin we want to apply the function to. If the array we are using is a matrix then we can specify the margin to be either 1 (apply the function to the rows of the matrix) or 2 (apply the function to the columns of the matrix).
  • The function can be any function that is built in or user defined (this is what I was missing when I got the error above).
  • The ... after the function refers to any other arguments that needs to be passed to the function being applied to the data.

The apply function internally uses a loop so if time and efficiency is very important one of the other apply functions such as

lapply(list, function, ...)

would be a better choice. The lapply command is designed for lists. It is particularly useful for data frames as each data frame is considered a list and the variables in the data frame are the elements of the list. Note that lapply doesn’t have a margin argument as it simply applies the function to each of the variables in the data frame.

You can see the difference in the example below.  The data set cars is a data frame that comes with R.

mode(cars) # what data type is cars?
[1] "list"
head(cars) # output the first six entries in the data set
  speed dist
1     4    2
2     4   10
3     7    4
4     7   22
5     8   16
6     9   10
apply(cars,2,mean) # calculate column means treating cars as a matrix (2D array)
speed  dist
15.40 42.98
lapply(cars,mean) # same thing treating cars as a data frame (list)
$speed
[1] 15.4

$dist
[1] 42.98

To show how much faster lapply is than apply, consider the following simulation:

X = matrix(rnorm(10000000),ncol=2)
X=data.frame(X)
system.time(apply(X,2,mean))
   user  system elapsed 
  0.573   0.394   0.965 
system.time(lapply(X,mean))
   user  system elapsed 
  0.072   0.049   0.121 

To perform the same operation, the lapply function was nearly 8 times faster than the apply function. You need a reasonably large data set for this to make a noticeable difference, but it’s worth keeping in mind regardless.

To find out more about any of these functions or datasets use the help:

?apply
?lapply
?head
?cars
By |2013-08-28T11:59:40+00:00December 14th, 2012|R, Teaching|Comments Off on Missing the FUN

RStudio = Awesome

RStudio is a IDE for R that makes it look (and work) more like Matlab.  It’s a bit nicer and more user friendly than the standard R GUI application.  In particular, it makes it easier to import data from text files, write scrirstudio-windowspts, autocomplete functions, store graphics and access help files.  It is in active development – the developers continue to add additional functionality really quickly.

Check out the screencast on the homepage for a 2 minute overview of the awesomeness that is RStudio.  When teaching with R in my units, I skip the default R GUI and go straight to RStudio.  This often results in students not fully realising the it is RStudio is an IDE for the R software, but that’s a small price to pay for the ease of use and good scripting habits students develop from working within a fully developed IDE.

RStudio is available across all major operating systems and it’s free!

By |2016-10-15T05:47:50+00:00December 13th, 2012|R|Comments Off on RStudio = Awesome

Selling Statistics

This video clip does a great job of selling statistics to a general audience (despite being created SAS). It’s only 2:30 mins – a good length for adding some interest at the start of a first year statistics unit.

“Statisticians help researchers keep children healthy”

Statistics: saving children’s lives since 1850.

By |2016-10-15T05:47:50+00:00December 12th, 2012|Statistics, Teaching|Comments Off on Selling Statistics

On the Numerical Accuracy of Spreadsheets

I came across this journal article a couple of years ago.  It’s very accessible (not at all difficult to understand as journal articles go).  It provides some interesting results that may help inform your decision about whether or not to use Excel and some background as to why we use dedicated statistical/computational software such as Matlab/Scilab/R.

The failings of Excel might seem like they only occur in extreme cases, but it is the way Excel handles the errors that is most concerning.  In many of the examples listed in the article, it will return an incorrect value rather than admit that it doesn’t know the answer to that level of precision.  I.e. when beyond the ability of the function, Excel should return NAs.

The last paragraph:

“Finally, as a rule of the thumb, every user should be aware that spreadsheets have serious limitations. Other platforms are advisable, being currently R the most dependable FLOSS (Free/Libre Open Source Software, see Almiron et al. 2009).”

Almiron, M. G., Lopes, B., Oliveira, A. L. C., Medeiros, A. C., and Frery, A. C. (2010). On the numerical accuracy of spreadsheets. Journal of Statistical Software, 34(4):1–29.

By |2013-08-28T11:59:40+00:00December 11th, 2012|R, Statistics|Comments Off on On the Numerical Accuracy of Spreadsheets

Law of large numbers

The first two minutes of this video for a graphical representation of the law of large numbers (the physicist’s center of gravity is the statistician’s mean).  It’s worth a look if only for the awesome 80’s styling and soundtrack.

By |2016-10-15T05:47:51+00:00December 10th, 2012|Statistics, Teaching|Comments Off on Law of large numbers

Getting started with LaTeX

100px-LaTeX_logo.svg
LaTeX is a document markup language and document preparation system for the TeX typesetting program.

It’s what I and most math/stats/econometrics teachers use to create the lecture slides and tutorials.  If you’re thinking about doing honours, or postgraduate study in these fields it’s probably not a bad idea to start playing with LaTeX as it is the default standard for typesetting theses, journal articles and the like.

To get started you need to download and install TeX on your computer.

Once you’ve got TeX installed you need software to use it.  For Windows I think TeXWorks is the best (though there are others) for Mac TeXShop is definitely the best.

It is a WYMIWYG (What You Mean Is What You Get) system for typesetting, as opposed to Word which is WYSIWYG (What You See Is What You Get).  This means that it has a really steep learning curve and can be quite strange at first.  The best way to learn is by adapting what others have already done.  Here are some resources to get you started:

A more user-friendly way to get started is with LyX.  It’s more of a Word-like front end to LaTeX.  You still need a TeX distribution installed before you can use that software.

Best of all, all of it’s free!

By |2016-10-15T05:47:51+00:00December 9th, 2012|LaTeX|Comments Off on Getting started with LaTeX

Theory and Practice

“In theory, theory and practice are the same. In practice, they are not.”

Albert Einstein

By |2013-08-28T12:00:18+00:00December 7th, 2012|Statistics|Comments Off on Theory and Practice

How to do stuff in r in two minutes or less

The site twotutorials has more than 90 two minute video clips on various uses for R.  The clips minimise prerequisite knowledge and are super entertaining to watch.  They begin with installing R then work through a range of topics that will be of interest to statisticians and data analysts including:

  • how to read and write excel files with the xlsx package
  • what does the %in% operator do and how does it differ from the double equals (==) sign?
  • how to merge stuff together using the merge, cbind, rbind, and rbind.fill functions
  • how to run your first regression

“It’s easy, free and fun”

By |2013-08-28T12:00:18+00:00November 28th, 2012|R|Comments Off on How to do stuff in r in two minutes or less

Data Communication with Hans Rosling

Hans Rosling does great things with data communication. Here’s an example showing how we can expect the world’s population to level out at 10 billion — contrary to popular opinion regarding exponential population growth.

By |2016-10-15T05:47:51+00:00November 27th, 2012|Statistics|Comments Off on Data Communication with Hans Rosling
Go to Top