The Aroma Framework: How data files and data sets are located
Author: Henrik Bengtsson
Created on: 2011-03-07
Last updated on: 2011-04-08
This document gives detailed information on how the aroma framework locates data files and data sets. It also shows how to best share data on as hared file system between users.
FULLNAME, NAME AND TAGS:
A fullname consists of a name and optional comma-separated tags, e.g. the fullname 'HapMap270,6.0,CEU,testSet' has name 'HapMap270' and tags '6.0,CEU,testSet'. By this convention, neither names nor tags can themselves contain commas.
The default in the aroma framework is that fullnames (and therefore names and tags) are inferred from the filename of a file, or the data set directory name. For instance, a raw CEL file with filename 'NA06985,XX.CEL' hasfullname 'NA06985,XX' with name 'NA06985' and the single tag 'XX'. It's file extension is *.CEL. This file may be part of raw data set 'HapMap270,6.0,CEU,testSet' (fullname) with name 'HapMap270' and the three tags '6.0', 'CEU', and 'testSet' (short '6.0,CEU,testSet').
THE CURRENT DIRECTORY:
The current directory, aka the working directory, is the
directory given by
getwd(). This is often the directory in which R and
aroma was started.
A root directory is a aroma-specific directory that is available in the current directory and in turn contains subdirectories. Root directories can be recognized by their names ending with *Data/, e.g. annotationData/ and rawData/.
A main root directory is a root directory whose fullname has no tags. Whenever aroma writes results to file, it always does so under a main root directory, which is why there also must be sufficient write permissions in addition to read permissions.
A sibling root directory is a root directory whose fullname contains tags (Footnote #1), e.g. annotationData,shared/ and rawData,shared/. Whenever aroma is locating a data file or a data set, it always searches the main root directory before the sibling root directories.
Sibling root directories are useful for sharing data sets and data files in common location on the file system without having to give everyone write permissions to it. A sibling root directory is typically setup as a file-system link (as all root directories always in the current directory) pointing to a location elsewhere on the file system. Using sibling root directories this way saves overall disk space, minimizes redundancy, further simplifies structuring of data files, and, in support for reproducible research, minimizes the amount of reprocessing required by a group of users interested in the same data sets.
The aroma framework will never write to a sibling root directory (Footnote #2), which also means that it is sufficient that there are read permissions. Indeed, we suggest that write permissions are not shared for sibling root directories. Although main root directories in theory could be shared, they should always be treated as if they are private to the user. This minimizes the risk for race conditions where two users try to write to or update the same file at the same time (Footnote #3).
Whenever aroma needs to locate an annotation data file, it always searches the main root directory annotationData/ first. If the file cannot be found there, it will then search any available sibling root directories in (lexicographic) order, e.g. annotationData,lab/ and annotationData,shared/. If it fails to locate a file, it is often the case that an exception is thrown and the analysis is interrupted with an informative error message.
Annotation data files are identified by their type, their chip types and optionally by additional tags. The type is implicit from the class or method used, whereas the name and the tags are commonly arguments specified by the user, although in some cases also those are inferred indirectly.
# Explicit set up annotation data file
cdf <- AffymetrixCdfFile$byChipType("GenomeWideSNP_6", tags="Full")
ugp <- AromaUgpFile$byChipType("GenomeWideSNP_6", tags="Full")
# Implicit set up annotation data file, where the type is
# inferred by the method name and the name and the tags are
# inferred from the input object.
ugp <- getAromaUgpFile(cdf)
In the first case, aroma searches for an Affymetrix CDF file located in annotationData/chipTypes/GenomeWideSNP_6/,where annotationData/ means that it first searches main root directory annotationData/ and then any sibling directories annotationData,<tags>/. In the second case, aroma searches for an Aroma UGP file located in annotationData/chipTypes/GenomeWideSNP_6/. In the third case, aroma also searches for an Aroma UGP file located in annotationData/chipTypes/GenomeWideSNP_6/, where the chip type is inferred from the CDF via getChipType(cdf).
The convention in the aroma framework is that whenever annotation data files are created, they are written to the main root directory, that is, under annotationData/, e.g. annotationData/chipTypes/GenomeWideSNP_6/GenomeWideSNP_6,Full,monocell.CDF. It will never write to a sibling root directory (Footnote #2).
Raw data sets
Whenever aroma tries to locate a raw data set, it always searches the main root directory rawData/ first, and, if not found there, then any available sibling root directories, e.g. rawData,shared/.
Data sets are located by their type, their names and optional tags, and their chip types. The type is implicit from the class, whereas the name and the tags as well as the chip type are commonly specified explicitly via arguments, although in some cases also those are also inferred indirectly.
# Explicit specification of name, tags and chip type
csR <- AffymetrixCelSet$byName("HapMap270,6.0,CEU,testSet",
# Implicit specification of chip type (via the CDF)
csR <- AffymetrixCelSet$byName("HapMap270,6.0,CEU,testSet", cdf=cdf)
In both these cases, aroma searches for Affymetrix CEL files located in rawData/HapMap270,6.0,CEU,testSet/GenomeWideSNP_6/, where rawData/ means that it first searches main root directory rawData/ and then any sibling directories rawData,<tags>/.
Intermediate and final data sets
Currently sibling root directories are not supported for intermediate and final data sets and data files. This means that all such data needs to be located under main root directories, e.g. probeData/, plmData/ etc. Following the aroma-framework convention above, all intermediate and final data sets and data files are stored under a main root directories.
# Assume that the following raw data set is located in
csR <- AffymetrixCelSet$byName("HapMap270,6.0,CEU,testSet",
# If the average array signals are not available, then they are
# calculated and stored in a single file in
cfR <- getAverageFile(csR)
Note that, if the above two steps are repeated, the data set itself will still be located under rawData,shared/ (because the new data set directory under rawData/ is considered to be empty and hence ignored), and the average file will be located under the main root directory, although the data set itself is located under a sibling root directory. If it is believed that this average file is useful to others, then, given the correct file privileges, one could move the average file to the data set directory that contains the raw data. This way the same averaging will never have to be calculated again by any user who links to this data set directory.
Footnote #1: Sibling root directories are currently only possible for annotationData/ and rawData/.
Footnote #2: Currently, there is no protection in aroma against updating an existing data set/data file that is already located in a sibling root directory. This means that it is possible for a user to setup, for instance, an UGP file located under sibling root directory annotationData,shared/ and then update it via methods specific to this class of annotation files. Protection against such misusage can be obtained by making sure that the user, and hence the aroma framework, has no write privileges to sibling root directories.
Footnote #3: The aroma framework creates files atomically by first writing to a temporary file, which is then renamed. Writing files atomically lower the risk for race conditions, but still does not guarantee that there will not be any conflicts. For instance, on file system such as Unix NFS, it may take up to 30 seconds before a newly created file is visible to all computers on the file system leaving a short window for such race conditions to take place.