Replication: Reproducing SNPRMA in the oligo package
Author: Henrik Bengtsson
Created on: 2008-12-05
Last updated: 2008-12-06
This document show how to reproduce the chip effect estimates produced
by justSNPRMA()
as implemented in oligo.
One advantage of processing
data in aroma.affymetrix is that any number of arrays can be processed
in bounded memory (approx 1GB of RAM) on any platform. Moreover, since
the estimates are stored in persistent memory (on file), rerunning the
same analysis another day/in another R session will be instantaneous.
We show that it is possible to reproduce the estimates whether one normalize toward oligo's predefined HapMap reference (target quantiles) or not. This reference is actually located in the Platform Design (PD) package for the corresponding chip type/CDF. When using aroma.affymetrix and not normalizing toward this predefined reference, there is no need for this PD package (which can be huge or non-existing).
There are some minor discrepancies which are still to be explained.
Setup
We will use public HapMap CEL files for this example. Here we show how to do the comparison for Mapping50K_Hind240, Mapping250K_Nsp, and GenomeWideSNP_6 data.
annotationData/
chipTypes/
GenomeWideSNP_6/
GenomeWideSNP_6.cdf
Mapping250K_Nsp/
Mapping250K_Nsp.cdf
Mapping50K_Hind240/
Mapping50K_Hind240.cdf
rawData/
HapMap270,100K,CEU,testSet/
Mapping50K_Hind240/
NA06985,Hind,B5,3005533.CEL NA06994,Hind,A7,3005533.CEL
NA06991,Hind,B6,3005533.CEL NA07000,Hind,A8,3005533.CEL
NA06993,Hind,B4,4000092.CEL NA07019,Hind,A12,4000092.CEL
HapMap270,500K,CEU,testSet/
Mapping250K_Nsp/
NA06985.CEL NA06991.CEL NA06993.CEL
NA06994.CEL NA07000.CEL NA07019.CEL
HapMap270,6.0,CEU,testSet/
GenomeWideSNP_6/
NA06985.CEL NA06991.CEL NA06993.CEL
NA06994.CEL NA07000.CEL NA07019.CEL
Processing data
library("oligo")
library("aroma.affymetrix")
log <- Arguments$getVerbose(-8, timestamp=TRUE)
# Change this to FALSE if not normalizing toward the HapMap ref.
normalizeToHapmap <- TRUE
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# Setup data set
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
dataSet <- "HapMap270,100K,CEU,testSet"
chipType <- "Mapping50K_Hind240"
dataSet <- "HapMap270,500K,CEU,testSet"
chipType <- "Mapping250K_Nsp"
dataSet <- "HapMap270,6.0,CEU,testSet"
chipType <- "GenomeWideSNP_6"
cdf <- AffymetrixCdfFile$byChipType(chipType)
csR <- AffymetrixCelSet$byName(dataSet, cdf=cdf)
print(csR)
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# SNPRMA according to aroma.affymetrix
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
eSet <- justSNPRMA(csR, normalizeToHapmap=normalizeToHapmap,
verbose=log)
print(eSet)
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# SNPRMA according to oligo
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
library("oligo")
eSet2 <- justSNPRMA(getPathnames(csR),
normalizeToHapmap=normalizeToHapmap, verbose=TRUE)
print(eSet2)
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# Compare theta estimates
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
theta <- extractTheta(eSet)
theta2 <- extractTheta(eSet2)
# Assert that the dimensions are the same
stopifnot(identical(dim(theta), dim(theta2)))
# Assert that the ordering of units and arrays are the same
stopifnot(identical(dimnames(theta), dimnames(theta2)))
# Assert that the estimates are very similar
diff <- all.equal(theta, theta2)
print(diff)
tol <- 0.012
stopifnot(all.equal(theta, theta2, tolerance=tol))
# Plotting
array <- 1
nbrOfGroups <- dim(theta)[2]
subplots(nbrOfGroups, ncol=2)
par(mar=c(4.5,4,1,1))
lim <- c(0,16)
for (gg in 1:nbrOfGroups) {
x <- cbind(theta2[,gg,array], theta[,gg,array])
x <- log2(x)
plot(NA, xlim=lim, ylim=lim, xlab="oligo", ylab="aroma.affymetrix")
stext(side=3, pos=0, line=-1, sprintf("Group: #%d", gg), margin=0.5)
abline(a=0, b=1, lty=3, col="blue")
points(x, pch=".", cex=3)
}
array <- 1
lim <- c(0,1)
nbrOfStrands <- dim(theta)[2]/2
subplots(nbrOfStrands, nrow=1)
par(mar=c(4.5,4,1,1));
for (ss in 1:nbrOfStrands) {
iA <- 2*(ss-1) + 1
iB <- 2*(ss-1) + 2
freqB <- theta[,iB,array]/(theta[,iA,array]+theta[,iB,array])
freqB2 <- theta2[,iB,array]/(theta2[,iA,array]+theta2[,iB,array])
plot(NA, xlim=lim, ylim=lim, xlab="oligo", ylab="aroma.affymetrix")
stext(side=3, pos=0, line=-1, sprintf("Strand: #%d", ss), margin=0.5)
abline(a=0, b=1, lty=3, col="blue"); points(freqB2, freqB, pch=".", cex=3)
}
Results
Mapping50K_Hind240 - Toward HapMap reference
Figure: (thetaA+, thetaB+, thetaA-, thetaB-)
Figure: (freqB+, freqB-)
Mapping250K_Nsp - Toward HapMap reference
Figure: (thetaA+, thetaB+, thetaA-, thetaB-)
Figure: (freqB+, freqB-)
GenomeWideSNP_6 - Toward HapMap reference
Figure: (thetaA, thetaB)
Figure: (freqB)
Appendix
Session information
R version 2.8.0 Patched (2008-10-21 r46766)
i386-pc-mingw32
locale:
LC_COLLATE=English_United States.1252;LC_CTYPE=English_United
States.1252;LC_MON
ETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United
States.1252
attached base packages:
[1] splines tools stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] pd.mapping50k.hind240_0.4.1 pd.genomewidesnp.6_0.4.2
[3] oligoClasses_1.3.8 AnnotationDbi_1.3.12
[5] preprocessCore_1.3.4 RSQLite_0.7-0
[7] DBI_0.2-4 Biobase_2.1.7
[9] aroma.affymetrix.wrappers_0.1.0 oligo_1.5.9
[11] aroma.affymetrix_0.9.6 aroma.apd_0.1.4
[13] R.huge_0.1.6 affxparser_1.14.1
[15] aroma.core_0.9.6 aroma.light_1.9.2
[17] digest_0.3.1 matrixStats_0.1.3
[19] R.rsp_0.3.4 R.cache_0.1.7
[21] R.utils_1.1.1 R.oo_1.4.6
[23] R.methodsS3_1.0.3 pd.mapping250k.nsp_0.4.1