How to: Process data in parallel
Author: Henrik Bengtsson
Created on: 2016-01-11
Last updated: 2017-06-10
Parallel processing is supported in Aroma since January 2016 with the release of aroma.affymetrix 3.0.0. The default is to process data sequentially (synchronously), but with a single change in setting, it is possible to process data in parallel (asynchronously) on the current machine or on a cluster of compute node. The mechanism for synchronously/asynchronously processing is automatically handled by the future package.
As shown below, the future::plan()
function can be used to control how data is processed. I suggest that you add this to your ~/.Rprofile
file, or to a project-specific one ./.Rprofile
in the working directory. This way you don't have to edit your scripts and therefore they should be able to run anywhere regardless of computational resources.
Non-parallel processing
The default is single-core processing via sequential futures. This can be explicit set as:
future::plan("sequential")
Multiprocess processing
To analyze data in parallel using multiple processes on the current machine, make sure to call the following first:
future::plan("multiprocess")
That's it! After this, methods in Aroma that support parallel processing will automatically process the data in parallel.
If supported, the above will process data using multiple forked R processes ("multicore"), otherwise, on for instance Microsoft Windows, it will process the data using multiple background R processes ("multisession").
The number of parallel processes utilized is given by future::availableCores()
. This function looks at a set of commonly used R options and system environment variables to infer the number of core available / assigned to the R session. If no such settings are available, it will fall back to the total number of cores available on the machine as reported by parallel::detectCores()
. The easiest way to control these settings is to use options(mc.cores = n)
. See help("availableCores", package = "future")
for more details.
Ad hoc cluster processing
To process data using multiple R sessions running on different machines, use something like:
future::plan("cluster", workers = c("n1", "n4", "n4", "n6", "n7"))
Job scheduler processing
To process data on compute clusters via job schedulers such as Torque/PBS, install the future.batchtools package and specify:
future::plan(future.batchtools::batchtools_torque)
There are similar settings for other job schedulers, e.g. Slurm and SGE. For full details on how to configure batchtools, please see the future.batchtools vignette.