Note: This first appeared in Biometric Bulletin Volume 33, Issue 2, Software Corner.

 

Parallel computing refers to situations where calculations are carried out simultaneously, for example distributing the calculations across multiple cores of your computer’s processor, as opposed to having the calculations run sequentially on a single core. Parallel computing is particularly suitable for ‘single program, multiple data’ problems, for example in simulations and bootstrapping.

Parallel computation in R has come a long way over the last 10 years. If you tried to parallelise your R code a few years ago, you probably worked with the architecture specific snow (Windows) or multicore (Unix-like) packages. Since 2011 R has supported parallel computation as part of the base distribution with introduction of the parallel package (R version 2.14.0 released in October 2011). The parallel package builds on multicore and snow to provide a (mostly) platform agnostic method of leveraging multiple cores to speed up the computation ofembarrassingly parallel problems

This note discusses how incorporate parallel and associated packages, with little or no additional effort on the part of the statistical practitioner, to speed up data processing and statistical analysis pipelines.

Getting started

The parallel package is part of base R which means that it’s already installed and you can’t find it on CRAN. You can load it in the usual way library("parallel"). The first thing you’ll want to do is detectCores() which checks how many cores you have available wherever R is running (probably your laptop or desktop computer).

Parallel apply

The family of apply functions (apply, lapply, tapply, sapply, etc.) in R provide an extremely convenient way of applying a function to the margins of an array, matrix, list, etc. If you’re using apply on a data frame, e.g. apply(X,2,median) to compute the median of the columns of X, you should consider using lapply instead because it’s much faster, e.g. lapply(X,median) which computes the median of each variable in the data frame X (there’s no need for a margin argument).

The apply functions operate serially, i.e. they calculate each value in sequence and are constrained to use only one of your computer’s cores. If you’re on a Unix-like system, the mclapply function is an easy way to distribute the computation across the available cores. To use mclapply, first code your calculations as a function that can be called with lapply. Make sure it works serially using lapply and then use mclapply to perform the computations in parallel.

library("parallel")
X <- data.frame(matrix(rnorm(1e+07), ncol = 200))
mclapply(X, median)

In this simple example, there is only a relatively minor speed improvement. To get the most out of parallel processes, the functions to be run in parallel should be non-trivial. There is an overhead associated with forking the process, so it is possible to slow your code down with mclapply if the time taken to send the tasks out to various cores takes longer than performing the task serially.

Windows can’t “fork” processes in the same way that Unix-like systems do. This means that you won’t see any improvements in speed when using mclapply on a Windows machine. On the plus side, your code won’t break – it will work as if you were simply using lapply. Microsoft has announced plans to incorporate a Ubuntu image into future releases of Windows 10 through a new infrastructure they’re calling “Windows Subsystem for Linux”. This means you may soon be able to run R in a native Unix-like environment (which supports forking and hencemclapply) on Windows 10 machines.

Parallel loops

An alternative to mclapply is the foreach function which is a little more involved, but works on Windows and Unix-like systems, and allows you to use a loop structure rather than an apply structure. To use foreach you need to register a “parallel backend”, for example using thedoParallel package. The doParallel package acts as an interface between foreach and the parallel package. A simple example of how this works is given below where we calculate a percentile bootstrap confidence interval for a least absolute deviations (LAD) regression parameter. In the code below, we resample a dataset 10,000 times and each time generate LAD regression coefficients.

library("quantreg") # for quantile regression function rq()
data(engel) # the data set we'll use
# help(engel) 
# plot(foodexp ~ income, data = engel)
# fit1 <- rq(foodexp ~ income, tau = 0.5, data = engel)
# abline(fit1)
library("foreach")
library("doParallel")
cl <- makeCluster(2) # create a cluster with 2 cores
registerDoParallel(cl) # register the cluster
res = foreach(i = 1:10000, 
              .combine = "rbind", 
              .packages = "quantreg") %dopar% {
  # generate a bootstrap sample              
  boot_dat <- engel
[sample(1:nrow(engel), replace = TRUE), ] # fit the model fit1 <- rq(foodexp ~ income, tau = 0.5, data = boot_dat) # return the coefficients fit1$coef } stopCluster(cl) # shut down the cluster

In the foreach() function we’ve specified that the results should be combined using rbind (i.e. the rows will be bound together, where the iththrow is the LAD regression coefficients from the ithth bootstrap sample) and we’ve indicated that the quantreg package needs to be loaded on each of the processes. The output is a matrix, that we’ve called res, which consists of two columns and 10,000 rows. We obtain a 95% percentile bootstrap by extracting the appropriate quantiles:

resdf <- as.data.frame(res)
quantile(resdf$income, probs = c(0.025,0.975), type = 1)
##      2.5%     97.5% 
## 0.4704510 0.6125974

The for loop syntax is very similar to a regular for loop in R, except it starts with foreach and uses the %dopar% function. If you use %do% instead of %dopar% evaluates the loop sequentially where one CPU will run at 100% until the job is finished. In the above example we asked for two cores to be used, hence with %dopar% both cores will run at 100% until the job is done. This means that a %dopar% loop running on two cores will finish in roughly half the time that it would have taken to run on a single core. As an example, the code above with 100,000 replications using%do% takes 3.0 minutes on my computer whereas using %dopar% takes only 1.6 minutes. It’s not exactly two times faster because of the computational overheads involved in sending the tasks out to the different cores and collating the results. For more details see the doParalleland foreach vignettes.

Distributed computing

The function mclapply can only use the cores of one machine, i.e. jobs can’t be distributed over several nodes of compute cluster. One way to do this in R is to use the parLapply function which can utilise the Message Passing Interface (MPI) system. For further details see the Rmpi package. The doMPI package provides an MPI parallel backend for the foreach package.

Other considerations

Parallel random number generation: When bootstrapping or performing simulation studies, it is desirable for each instance of R to generate independent, reproducible pseudo-random numbers. If there is an object .Random.seed in your R workspace that is then shared with the worker cores, all your instances of R may inavertently run identical simulations using identical “random” numbers. Alternatively, if .Random.seed is not in the workspace, then you will have independent streams of random numbers but it will not be reproducible. The parallel package includes a random number generator designed to overcome these issues. It can be enabled using RNGkind("L'Ecuyer-CMRG"). See the documentation of the parallel package for details.

Multi-threaded linear algebra computation: R can also be compiled against multi-threaded linear algebra libraries (BLAS, LAPACK) which can speed up calculations. One of the easiest ways to do this is to install Microsoft R Open (formerly known as Revolution R Open), however doing so ties you to a CRAN snapshot taken at a fixed point in time. You can still install the most up-to-date version of a package by manually specifying the repos argument.

Bioconductor: If you work with Bioconductor packages, you should look into BiocParallel which provides modified versions of functions optimised for parallel evaluation, tailored for use with Bioconductor objects.

So the next time you think about leaving a simulation running on your computer for the weekend, consider using mclapply instead of lapply or rewriting that for loop as a foreach loop and have it run overnight instead, or send it out to a supercomputer and have the results within a couple of hours!