How to: Process data in parallel

Author: Henrik Bengtsson
Created on: 2016-01-11
Last updated: 2017-06-10

Parallel processing is supported in Aroma since January 2016 with the release of aroma.affymetrix 3.0.0. The default is to process data sequentially (synchronously), but with a single change in setting, it is possible to process data in parallel (asynchronously) on the current machine or on a cluster of compute node. The mechanism for synchronously/asynchronously processing is automatically handled by the future package.

As shown below, the future::plan() function can be used to control how data is processed. I suggest that you add this to your ~/.Rprofile file, or to a project-specific one ./.Rprofile in the working directory. This way you don't have to edit your scripts and therefore they should be able to run anywhere regardless of computational resources.

Non-parallel processing

The default is single-core processing via sequential futures. This can be explicit set as:

future::plan("sequential")

Multiprocess processing

To analyze data in parallel using multiple processes on the current machine, make sure to call the following first:

future::plan("multiprocess")

That's it! After this, methods in Aroma that support parallel processing will automatically process the data in parallel.

If supported, the above will process data using multiple forked R processes ("multicore"), otherwise, on for instance Microsoft Windows, it will process the data using multiple background R processes ("multisession").

The number of parallel processes utilized is given by future::availableCores(). This function looks at a set of commonly used R options and system environment variables to infer the number of core available / assigned to the R session. If no such settings are available, it will fall back to the total number of cores available on the machine as reported by parallel::detectCores(). The easiest way to control these settings is to use options(mc.cores = n). See help("availableCores", package = "future") for more details.

Ad hoc cluster processing

To process data using multiple R sessions running on different machines, use something like:

future::plan("cluster", workers = c("n1", "n4", "n4", "n6", "n7"))

Job scheduler processing

To process data on compute clusters via job schedulers such as Torque/PBS, install the future.batchtools package and specify:

future::plan(future.batchtools::batchtools_torque)

There are similar settings for other job schedulers, e.g. Slurm and SGE. For full details on how to configure batchtools, please see the future.batchtools vignette.