How to: Delete intermediate data files while keeping final ones
Author: Henrik Bengtsson
Created on: 2010-05-17
The aroma framework uses the file system to store intermediate and final data sets. This way data is only kept in the memory (RAM) when needed making it possible to process data sets of virtually any size. The only limitation is the available disk space. Another advantage is that if a data set has already been processed by an algorithm, then the already available results are quickly retrieved from the file system instead of being regenerated. This is especially handy when working with the data interactively.
After completing an analysis it is possible to delete intermediate data files while keeping the final results. The only important thing to keep in mind is to never delete your raw/original data files, which typically are located in the rawData/ directory (or similar). As long as you have the original data files (together with the annotation data files), you can always regenerate the results by running the same analysis script.
After deleting intermediate files, one need to retrieve the final results explicitly, instead of using the analysis script, otherwise the analysis will be redone.
Example
Say you run CRMA v2 on a GenomeWideSNP_6 data set named 'HapMap270,6.0,CEU,testSet'. This can be done by:
dataSet <- "HapMap270,6.0,CEU,testSet"
chipType <- "GenomeWideSNP_6,Full"
csR <- AffymetrixCelSet$byName(dataSet, chipType=chipType)
print(csR)
## AffymetrixCelSet:
## Name: HapMap270
## Tags: 6.0,CEU,testSet
## Path: C:/Users/hb/Documents/My
## Data/rawData/HapMap270,6.0,CEU,testSet/GenomeWideSNP_6
## Platform: Affymetrix
## Chip type: GenomeWideSNP_6,Full
## Number of arrays: 6
## Names: NA06985, NA06991, ..., NA07019
## Time period: 2007-03-06 12:13:04 -- 2007-03-06 19:17:16
## Total file size: 395.13MB
## RAM: 0.01MB
ds <- doCRMAv2(csR, verbose=-10)
print(ds)
## AromaUnitTotalCnBinarySet:
## Name: HapMap270
## Tags: 6.0,CEU,testSet,ACC,ra,-XY,BPN,-XY,AVG,A+B,FLN,-XY
## Full name: HapMap270,6.0,CEU,testSet,ACC,ra,-XY,BPN,-XY,AVG,A+B,FLN,-XY
## Number of files: 6
## Names: NA06985, NA06991, ..., NA07019 [6]
## Path (to the first file): totalAndFracBData/HapMap270,6.0,CEU,testSet,ACC,ra,-XY,BPN,-XY,AVG,A+B,FLN,-XY/GenomeWideSNP_6
## Total file size: 43.06 MB
## RAM: 0.00MB
The raw data is located in directory rawData/HapMap270,6.0,CEU,testSet/GenomeWideSNP_6/ and the final data is located in directory totalAndFracBData/HapMap270,6.0,CEU,testSet,ACC,ra,-XY,BPN,-XY,AVG,A+B,FLN,-XY/GenomeWideSNP_6/.
What is not shown are the intermediate data sets generated by CRMA v2 (to see the intermediate steps, see the corresponding vignette). Those are stored under probeData/ and plmData/ and can be identified by their fullnames (names & tags) and chip types. More precisely, for the data set processed here, the intermediate results are located in directories:
- probeData/HapMap270,6.0,CEU,testSet,ACC,ra,-XY/GenomeWideSNP_6/
- probeData/HapMap270,6.0,CEU,testSet,ACC,ra,-XY,BPN,-XY/GenomeWideSNP_6/
- plmData/HapMap270,6.0,CEU,testSet,ACC,ra,-XY,BPN,-XY,AVG,A+B/GenomeWideSNP_6/
- plmData/HapMap270,6.0,CEU,testSet,ACC,ra,-XY,BPN,-XY,AVG,A+B,FLN,-XY/GenomeWideSNP_6/
These can safely be deleted, while retaining the final results (and the raw data files). The size of data files in probeData/ is similar to the ones in rawData/, whereas the data files in plmData/ are significantly smaller. Thus, the intermediate data sets take up roughly 2-3 times the disk space as the original data set.
After deleting intermediate data sets we can no longer use doCRMAv2()
to
get the results, because then CRMA v2 will be redone. Instead, we
retrieve the final results as:
dataSet <- "HapMap270,6.0,CEU,testSet"
tags <- "ACC,ra,-XY,BPN,-XY,AVG,A+B,FLN,-XY"
chipType <- "GenomeWideSNP_6"
ds <- AromaUnitTotalCnBinarySet$byName(dataSet, tags=tags, chipType=chipType)
print(ds)
## AromaUnitTotalCnBinarySet:
## Name: HapMap270
## Tags: 6.0,CEU,testSet,ACC,ra,-XY,BPN,-XY,AVG,A+B,FLN,-XY
## Full name: HapMap270,6.0,CEU,testSet,ACC,ra,-XY,BPN,-XY,AVG,A+B,FLN,-XY
## Number of files: 6
## Names: NA06985, NA06991, ..., NA07019 [6]
## Path (to the first file):
## totalAndFracBData/HapMap270,6.0,CEU,testSet,ACC,ra,-XY,BPN,-XY,AVG,A+B,FLN,-XY/GenomeWideSNP_6
## Total file size: 43.06 MB
## RAM: 0.00MB
Note how we just have appended the new tags to the original data set (full)name, and we now specify that the data set to be retrieved is of class AromaUnitTotalCnBinarySet.