Five steps to parallel computing in R

20/11/2014

By default, R uses only one core for its computation. This can slow things down, when you have to run extensive analyses (e.g. involving bootstrapping over many groups).

However, you can set up R in parallel computing mode and boost speed considerably. There are several R-packages that enable R to use multiple cores. But if you are new to the field, it can be painful to choose the right package. Here, I show how a parallel R-analysis can be set up in five easy steps.

The following code works on all stand-alone operating systems ('socket'-mode). If you want to set it up for Windows, make sure your firewall is disabled. Your analysis must involve a split-apply combine strategy. This simply means that your analysis must include some form iterations, e.g. a model fitted for each region, species, etc.

Five steps to set up parallel R computation

1. Install and load packages into your workspace

 install.packages(c("doParallel", "foreach", "plyr"))
require(doParallel)
require(foreach)
require(plyr)

doParallel and foreach register the cores and set up parallel computing; plyr specifies the functions to be performed.

2. Load your data

 data(iris)

In my case, I am using the Edgar Anderson's iris dataset. You might want to import you dataset via functions like read.table.

3. Specify the number of clusters to be used

 detectCores()

 ## [1] 8

 cl <- makeCluster(6)
registerDoParallel(cl)

detectCores() is not obligatory. It checks the number of cores your PC/Notebook/server has (in case you don't know it). makeCluster() and registerDoParallel() setup the cores (also called registering). I am using six out of eight cores (two cores reserved for my remaining applications, such as e-mail client).

4. Start your analysis

 ptm <- proc.time()
mo <- dlply(iris, .(Species), function(x)
           lm(x$Sepal.Width ~ x$Petal.Length, data=x),
          .parallel=TRUE,.paropts = list(.packages = NULL,.export="iris"))
diff1 <-  proc.time() - ptm diff1

 ##    user  system elapsed
##   0.028   0.001   0.453

Here, I am using the iris dataset as input to produce a list with model parameters for each species. In the .paropts argument, you purge all necessary data (and packages) into the workspace of each core. In my case, I am purging only the iris dataset; no special packages are needed. The proc.time() wrapper functions records the time to run the process (not obligatory).

5. Unregister your clusters

 stopCluster(cl)

You're done!

1 Comment

Hal Martin

13/5/2016 08:44:29 am

My sincere thanks for this information. While I am quite experienced in software development I am a newcomer to R and have only just begun to experiment with R parallel processing. This page has given me an excellent first experience in running such a process. I used only two cores, which probably accounts for the greatly increased processing time experienced when I ran this code.

Five steps to parallel computing in R

Five steps to set up parallel R computation

1. Install and load packages into your workspace

2. Load your data

3. Specify the number of clusters to be used

4. Start your analysis

5. Unregister your clusters

Leave a Reply.

Author

Archives

Categories