How to: Process data in parallel
Author: Henrik Bengtsson
Created on: 2016-01-11
Last updated: 2017-06-10
Parallel processing is supported in Aroma since January 2016 with the release of aroma.affymetrix 3.0.0. The default is to process data sequentially (synchronously), but with a single change in setting, it is possible to process data in parallel (asynchronously) on the current machine or on a cluster of compute node. The mechanism for synchronously/asynchronously processing is automatically handled by the future package.
As shown below, the
future::plan() function can be used to control how data is processed. I suggest that you add this to your
~/.Rprofile file, or to a project-specific one
./.Rprofile in the working directory. This way you don't have to edit your scripts and therefore they should be able to run anywhere regardless of computational resources.
The default is single-core processing via sequential futures. This can be explicit set as:
To analyze data in parallel using multiple processes on the current machine, make sure to call the following first:
That's it! After this, methods in Aroma that support parallel processing will automatically process the data in parallel.
If supported, the above will process data using multiple forked R processes ("multicore"), otherwise, on for instance Microsoft Windows, it will process the data using multiple background R processes ("multisession").
The number of parallel processes utilized is given by
future::availableCores(). This function looks at a set of commonly used R options and system environment variables to infer the number of core available / assigned to the R session. If no such settings are available, it will fall back to the total number of cores available on the machine as reported by
parallel::detectCores(). The easiest way to control these settings is to use
options(mc.cores = n). See
help("availableCores", package = "future") for more details.
Ad hoc cluster processing
To process data using multiple R sessions running on different machines, use something like:
future::plan("cluster", workers = c("n1", "n4", "n4", "n6", "n7"))
Job scheduler processing
To process data on compute clusters via job schedulers such as Torque/PBS, install the future.batchtools package and specify:
There are similar settings for other job schedulers, e.g. Slurm and SGE. For full details on how to configure batchtools, please see the future.batchtools vignette.