R

Getting data into R

Note: This first appeared in Biometric Bulletin Volume 33, Issue 3, Software Corner.

One of the biggest hurdles that students and novice R users have is importing data into R. A number of recent packages have made this process easier and, importantly in the age of big data, faster. Most of the packages outlined below are part of Hadley Wickham’s tidyverse and owe their speed to calling C or C++ libraries from R.

The readr package

The readr package provides new functions for importing tabular data into R. Specifically, the functions read_table(), read_csv(), read_delim() are intended as fast (around 10 times faster) replacements for the base R read.table(), read.csv(), read.delim(). The readr functions do not convert strings to factors by default, are able to parse dates and times and can automatically determine the data types in each column (it does this by parsing the first 1000 rows). You can even import compressed files and they will be automatically decompressed and read into R. There is also file writing functionality, with write_csv(), write_tsv() and write_delim(). If you want even more speediness in your data importing, you might also consider the fread() function from the data.table package.

The readxl package

It used to be that the most reliable way to get data from Excel into R was to save it as a tab (or comma) delimited text file. While there are alternatives, such as the xlsx, XLConnect and gdata packages, that have other features (e.g. the ability to write excel files), if all you need is an easy way to import tabular data from xls and xlsx formats files, then the readxl package is for you. Importantly it has no external dependencies, so is very straightforward to install and use on all platforms. The syntax is very similar to the readr package functions, for example you specify the file name and the sheet of interest, read_excel("spreadsheet.xlsx", sheet = "data")

The haven package

The haven package provides functions for importing from SAS, SPSS and Stata file formats, read_sas(), read_sav()and read_dta(). This functionality is similar to that available in the base R foreign package but is often faster, can read SAS7BDAT files and formats, works with Stata 14 and 14 files. Following is the code to read SAS7BDAT.

install.packages("haven")
library(haven)
dat = read_sas("path to file", "path to formats catalog")

The returned object will be a data frame where SAS variable labels are attached as an attribute to each variable. When a variable is attached to a format in SAS and the formats are stored in a library, its path also needs to be supplied. Missing values in numeric variables should be seamlessly converted. Missing values in character variables are converted to the empty string. To convert empty strings to missing values, use zap_empty(), for example,

dat$x1 = zap_empty(dat$x1)

SAS, Stata and SPSS all have the notion of a “labelled”” variable. These are similar to categorical factor variables in R, but integer, numeric and character vectors can be labelled and not every value must be associated with a label. To turn a labelled variable into a standard factor R variable use the as_factor() function,

dat$facvar = as_factor(dat$facvar)

The haven package is under active development and becoming increasingly robust. If you have difficulties loading a file, try using the development version on GitHub:

devtools::install_github("hadley/haven")

For example, consider the National Youth Tobacco Survey (NYTS) from the CDC website. After downloading the files (and installing the development version from GitHub) the data can be imported into R using

devtools::install_github("hadley/haven")
require(haven)
x = read_sas("nyts2014_dataset.sas7bdat","nyts2014_formats.sas7bcat")
# convert qn1 to a factor:
x$qn1 = as_factor(x$qn1)
View(x)

The rio package

The rio package describes itself as “a Swiss army knife for data I/O”. It unifies many of the above methods by providing the wrapper function import() that takes as an input the path to a data file. It then uses the file extension to determine the file type and imports the data into R. The one function can be used to import standard text files, RData, JSON, Stata, SPSS, Excel, SAS, XML, Minitab and many more. There is an analogous export() function that allows users to similarly easily export data to various file types.

Using RStudio to import data

Recent versions of RStudio allow users to import various file types via a graphical user interface, which is perfect for novice users and experts alike as they get used to the new functions and customization options. Once you’ve clicked through the various options, it will output the required code at the console so that you can see exactly what was done to get the data in and edit the code as necessary for next time.

The team at RStudio have put together a webinar on getting data into R which is well worth watching.

Once you’ve got your data into R, you’ll probably need to restructure it in some way prior to analysis. To help with this, you may want to take a look at the tidyr package provides a suite of functions to get your data set in a standardized format, such that each observation is a row, each variable is a column and there are no data in the labels.

By Garth|2016-10-15T05:47:37+00:00October 15th, 2016|R, Statistics|2 Comments

Parallel computation in R

Note: This first appeared in Biometric Bulletin Volume 33, Issue 2, Software Corner.

Parallel computing refers to situations where calculations are carried out simultaneously, for example distributing the calculations across multiple cores of your computer’s processor, as opposed to having the calculations run sequentially on a single core. Parallel computing is particularly suitable for ‘single program, multiple data’ problems, for example in simulations and bootstrapping.

Parallel computation in R has come a long way over the last 10 years. If you tried to parallelise your R code a few years ago, you probably worked with the architecture specific snow (Windows) or multicore (Unix-like) packages. Since 2011 R has supported parallel computation as part of the base distribution with introduction of the parallel package (R version 2.14.0 released in October 2011). The parallel package builds on multicore and snow to provide a (mostly) platform agnostic method of leveraging multiple cores to speed up the computation ofembarrassingly parallel problems

This note discusses how incorporate parallel and associated packages, with little or no additional effort on the part of the statistical practitioner, to speed up data processing and statistical analysis pipelines.

Getting started

The parallel package is part of base R which means that it’s already installed and you can’t find it on CRAN. You can load it in the usual way library("parallel"). The first thing you’ll want to do is detectCores() which checks how many cores you have available wherever R is running (probably your laptop or desktop computer).

Parallel apply

The family of apply functions (apply, lapply, tapply, sapply, etc.) in R provide an extremely convenient way of applying a function to the margins of an array, matrix, list, etc. If you’re using apply on a data frame, e.g. apply(X,2,median) to compute the median of the columns of X, you should consider using lapply instead because it’s much faster, e.g. lapply(X,median) which computes the median of each variable in the data frame X (there’s no need for a margin argument).

The apply functions operate serially, i.e. they calculate each value in sequence and are constrained to use only one of your computer’s cores. If you’re on a Unix-like system, the mclapply function is an easy way to distribute the computation across the available cores. To use mclapply, first code your calculations as a function that can be called with lapply. Make sure it works serially using lapply and then use mclapply to perform the computations in parallel.

library("parallel")
X <- data.frame(matrix(rnorm(1e+07), ncol = 200))
mclapply(X, median)

In this simple example, there is only a relatively minor speed improvement. To get the most out of parallel processes, the functions to be run in parallel should be non-trivial. There is an overhead associated with forking the process, so it is possible to slow your code down with mclapply if the time taken to send the tasks out to various cores takes longer than performing the task serially.

Windows can’t “fork” processes in the same way that Unix-like systems do. This means that you won’t see any improvements in speed when using mclapply on a Windows machine. On the plus side, your code won’t break – it will work as if you were simply using lapply. Microsoft has announced plans to incorporate a Ubuntu image into future releases of Windows 10 through a new infrastructure they’re calling “Windows Subsystem for Linux”. This means you may soon be able to run R in a native Unix-like environment (which supports forking and hencemclapply) on Windows 10 machines.

Parallel loops

An alternative to mclapply is the foreach function which is a little more involved, but works on Windows and Unix-like systems, and allows you to use a loop structure rather than an apply structure. To use foreach you need to register a “parallel backend”, for example using thedoParallel package. The doParallel package acts as an interface between foreach and the parallel package. A simple example of how this works is given below where we calculate a percentile bootstrap confidence interval for a least absolute deviations (LAD) regression parameter. In the code below, we resample a dataset 10,000 times and each time generate LAD regression coefficients.

library("quantreg") # for quantile regression function rq()
data(engel) # the data set we'll use
# help(engel) 
# plot(foodexp ~ income, data = engel)
# fit1 <- rq(foodexp ~ income, tau = 0.5, data = engel)
# abline(fit1)
library("foreach")
library("doParallel")
cl <- makeCluster(2) # create a cluster with 2 cores
registerDoParallel(cl) # register the cluster
res = foreach(i = 1:10000, 
              .combine = "rbind", 
              .packages = "quantreg") %dopar% {
  # generate a bootstrap sample              
  boot_dat <- engel[sample(1:nrow(engel), replace = TRUE), ]
  # fit the model
  fit1 <- rq(foodexp ~ income, tau = 0.5, data = boot_dat)
  # return the coefficients
  fit1$coef
}
stopCluster(cl) # shut down the cluster

In the foreach() function we’ve specified that the results should be combined using rbind (i.e. the rows will be bound together, where the i $^{th}$ row is the LAD regression coefficients from the i $^{th}$ bootstrap sample) and we’ve indicated that the quantreg package needs to be loaded on each of the processes. The output is a matrix, that we’ve called res, which consists of two columns and 10,000 rows. We obtain a 95% percentile bootstrap by extracting the appropriate quantiles:

resdf <- as.data.frame(res)
quantile(resdf$income, probs = c(0.025,0.975), type = 1)

##      2.5%     97.5% 
## 0.4704510 0.6125974

The for loop syntax is very similar to a regular for loop in R, except it starts with foreach and uses the %dopar% function. If you use %do% instead of %dopar% evaluates the loop sequentially where one CPU will run at 100% until the job is finished. In the above example we asked for two cores to be used, hence with %dopar% both cores will run at 100% until the job is done. This means that a %dopar% loop running on two cores will finish in roughly half the time that it would have taken to run on a single core. As an example, the code above with 100,000 replications using%do% takes 3.0 minutes on my computer whereas using %dopar% takes only 1.6 minutes. It’s not exactly two times faster because of the computational overheads involved in sending the tasks out to the different cores and collating the results. For more details see the doParalleland foreach vignettes.

Distributed computing

The function mclapply can only use the cores of one machine, i.e. jobs can’t be distributed over several nodes of compute cluster. One way to do this in R is to use the parLapply function which can utilise the Message Passing Interface (MPI) system. For further details see the Rmpi package. The doMPI package provides an MPI parallel backend for the foreach package.

Other considerations

Parallel random number generation: When bootstrapping or performing simulation studies, it is desirable for each instance of R to generate independent, reproducible pseudo-random numbers. If there is an object .Random.seed in your R workspace that is then shared with the worker cores, all your instances of R may inavertently run identical simulations using identical “random” numbers. Alternatively, if .Random.seed is not in the workspace, then you will have independent streams of random numbers but it will not be reproducible. The parallel package includes a random number generator designed to overcome these issues. It can be enabled using RNGkind("L'Ecuyer-CMRG"). See the documentation of the parallel package for details.

Multi-threaded linear algebra computation: R can also be compiled against multi-threaded linear algebra libraries (BLAS, LAPACK) which can speed up calculations. One of the easiest ways to do this is to install Microsoft R Open (formerly known as Revolution R Open), however doing so ties you to a CRAN snapshot taken at a fixed point in time. You can still install the most up-to-date version of a package by manually specifying the repos argument.

Bioconductor: If you work with Bioconductor packages, you should look into BiocParallel which provides modified versions of functions optimised for parallel evaluation, tailored for use with Bioconductor objects.

So the next time you think about leaving a simulation running on your computer for the weekend, consider using mclapply instead of lapply or rewriting that for loop as a foreach loop and have it run overnight instead, or send it out to a supercomputer and have the results within a couple of hours!

By Garth|2016-10-15T05:47:42+00:00July 5th, 2016|R, Statistics|1 Comment

Raijin

Just put together a few notes on how to use R on Raijin (the NCI supercomputer). The guide is for people within the Mathematical Sciences Institute (MSI) at ANU but the general process should hold for others.

http://rpubs.com/garthtarr/raijin

By Garth|2014-10-29T00:33:34+00:00October 29th, 2014|R|Comments Off

Dotchart with error bars in R

A colleague recently presented the results of a survey we designed. Instead of a table with means, standard deviations, t statistics and p values I thought a series of dotcharts with error bars would be more appropriate and convey the required information in a more engaging manner. It wasn’t as straight forward as I thought it should be to make such plots comparing the results of different surveys, so I’m sharing my code.

Some difficulties I ran into:
– reordering the categories on the y axis (default is alphabetical order, which is fine when they’re called “Survey 1”, “Survey 2” and “Survey 3” but not so good with more meaningful labels).
– changing the text size, lattice graphics have some peculiar syntax compared with base or ggplot2
– getting the bars (with tips) to work nicely
– adjusting the spacing so that the top and bottom surveys weren’t so close to the edge of the plot
– labelling the x axis nicely
– lattice graphics require the print function wrapper when called in a loop

mypanel.Dotplot <- function(x, y, ...) {
  panel.Dotplot(x,y,...)
  tips = attr(x, "other")
  panel.arrows(x0 = tips[,1], y0 = y, 
               x1 = tips[,2], y1 = y, 
               length = 0.05, unit = "native",
               angle = 90, code = 3,lwd=2,col="blue")
}
library(Hmisc)
dcfn = function(means,sds,n,title){
  data = data.frame(ID=factor(c("Survey 1","Survey 2","Survey 3")),
                   means=means,
                   stderrs=2*sds/sqrt(n))
  data$lower = data$means-data$stderrs 
  data$upper = data$means+data$stderrs
  Dotplot(data$ID ~ Cbind(data$means,data$lower,data$upper), 
          col="blue", pch=20, panel = mypanel.Dotplot, ylim=c(0.5,3.5),
          xlab=list("",cex=1.5),
          ylab=list("",cex=1.5),xlim=c(-0.5,10.5),
          cex=2,
          scales=list(y=list(cex=1.5, at=1:3, 
                             # for reordering the y labels:
                             labels=levels(data$ID)[ c(1,3,2) ])),
          x = list(cex=1.5,at=c(0,5,10),
                   labels=c("0nNo knowledge","5nModerate knowledge",
                            "10nFull knowledge")),
          main=list(title,cex=1.5))
}
n = c(107,45,54)
means = c(3.4,7.1,6.6)
sds = c(2.8,2.6,3.4)
png("dotchart.png",width=1200,height=260)
dcfn(means,sds,n=n,title="Question goes here")
dev.off()
# if using a loop you'll need to use the print function:
# print(dcfn(means[i,],sds[i,],n=n,title=titles[i]))

The result:

By Garth|2016-10-15T05:47:43+00:00October 19th, 2014|R|Comments Off

Quandl and R

logo_rgb_300
I haven’t taught econometrics for over a year now, but the next time I do, I’ll be using Quandl! Quandl is a repository of data: “when a user clicks on a dataset on Quandl, the Quandl engine goes to the original publisher of that data, retrieves the freshest version of that data, and presents it to the user.” There’s a lot of data online, but it’s really nice that they’ve aggregated so much here and made it extremely easy to access.

They have a nice R package that hooks into the Quandl API, allowing you to seamlessly (once you have your authentication token) import data direct from their servers – circumventing one of the major issues for new R users – importing and getting the data structured correctly.

They’ve provided an extremely brief “econometrics” tutorial and their own R cheat sheet.

By Garth|2016-10-15T05:47:43+00:00November 7th, 2013|R, Teaching|Comments Off

Missing the FUN

During my undergraduate (and now postgraduate) years, I often spent my evenings and weekends toiling over statistics assignments. I was always amused when R seemed to know and would sometimes return my favourite error, reminding me that I was missing the fun:

Error in match.fun(FUN) : argument "FUN" is missing, with no default

Of course, I just forgot to supply a function name a command like apply(). The apply() function is really useful way of extracting summary statistics from data sets. The basic format is

apply(array, margin, function, ...)

An array in R is a generic data type. A zero dimensional array is a scalar or a point; a one dimensional array is a vector; and a two dimensional array is a matrix…
The margin argument is used to specify which margin we want to apply the function to. If the array we are using is a matrix then we can specify the margin to be either 1 (apply the function to the rows of the matrix) or 2 (apply the function to the columns of the matrix).
The function can be any function that is built in or user defined (this is what I was missing when I got the error above).
The ... after the function refers to any other arguments that needs to be passed to the function being applied to the data.

The apply function internally uses a loop so if time and efficiency is very important one of the other apply functions such as

lapply(list, function, ...)

would be a better choice. The lapply command is designed for lists. It is particularly useful for data frames as each data frame is considered a list and the variables in the data frame are the elements of the list. Note that lapply doesn’t have a margin argument as it simply applies the function to each of the variables in the data frame.

You can see the difference in the example below. The data set cars is a data frame that comes with R.

mode(cars) # what data type is cars?
[1] "list"
head(cars) # output the first six entries in the data set
  speed dist
1     4    2
2     4   10
3     7    4
4     7   22
5     8   16
6     9   10
apply(cars,2,mean) # calculate column means treating cars as a matrix (2D array)
speed  dist
15.40 42.98
lapply(cars,mean) # same thing treating cars as a data frame (list)
$speed
[1] 15.4

$dist
[1] 42.98

To show how much faster lapply is than apply, consider the following simulation:

X = matrix(rnorm(10000000),ncol=2)
X=data.frame(X)
system.time(apply(X,2,mean))
   user  system elapsed 
  0.573   0.394   0.965 
system.time(lapply(X,mean))
   user  system elapsed 
  0.072   0.049   0.121

To perform the same operation, the lapply function was nearly 8 times faster than the apply function. You need a reasonably large data set for this to make a noticeable difference, but it’s worth keeping in mind regardless.

To find out more about any of these functions or datasets use the help:

?apply
?lapply
?head
?cars

By Garth|2013-08-28T11:59:40+00:00December 14th, 2012|R, Teaching|Comments Off

RStudio = Awesome

RStudio is a IDE for R that makes it look (and work) more like Matlab. It’s a bit nicer and more user friendly than the standard R GUI application. In particular, it makes it easier to import data from text files, write scripts, autocomplete functions, store graphics and access help files. It is in active development – the developers continue to add additional functionality really quickly.

Check out the screencast on the homepage for a 2 minute overview of the awesomeness that is RStudio. When teaching with R in my units, I skip the default R GUI and go straight to RStudio. This often results in students not fully realising the it is RStudio is an IDE for the R software, but that’s a small price to pay for the ease of use and good scripting habits students develop from working within a fully developed IDE.

RStudio is available across all major operating systems and it’s free!

By Garth|2016-10-15T05:47:50+00:00December 13th, 2012|R|Comments Off

On the Numerical Accuracy of Spreadsheets

I came across this journal article a couple of years ago. It’s very accessible (not at all difficult to understand as journal articles go). It provides some interesting results that may help inform your decision about whether or not to use Excel and some background as to why we use dedicated statistical/computational software such as Matlab/Scilab/R.

The failings of Excel might seem like they only occur in extreme cases, but it is the way Excel handles the errors that is most concerning. In many of the examples listed in the article, it will return an incorrect value rather than admit that it doesn’t know the answer to that level of precision. I.e. when beyond the ability of the function, Excel should return NAs.

The last paragraph:

“Finally, as a rule of the thumb, every user should be aware that spreadsheets have serious limitations. Other platforms are advisable, being currently R the most dependable FLOSS (Free/Libre Open Source Software, see Almiron et al. 2009).”

Almiron, M. G., Lopes, B., Oliveira, A. L. C., Medeiros, A. C., and Frery, A. C. (2010). On the numerical accuracy of spreadsheets. Journal of Statistical Software, 34(4):1–29.

By Garth|2013-08-28T11:59:40+00:00December 11th, 2012|R, Statistics|Comments Off

How to do stuff in r in two minutes or less

The site twotutorials has more than 90 two minute video clips on various uses for R. The clips minimise prerequisite knowledge and are super entertaining to watch. They begin with installing R then work through a range of topics that will be of interest to statisticians and data analysts including:

how to read and write excel files with the xlsx package
what does the %in% operator do and how does it differ from the double equals (==) sign?
how to merge stuff together using the merge, cbind, rbind, and rbind.fill functions
how to run your first regression

“It’s easy, free and fun”

By Garth|2013-08-28T12:00:18+00:00November 28th, 2012|R|Comments Off