Vignette: CRLMM genotyping (100K and 500K)
Author: Henrik Bengtsson
Created on: 2009-01-15
Last updated: 2009-01-15
Introduction
This document shows how to estimate genotypes based on the CRLMM
algorithm (Carvalho, Bengtsson, Speed, and Irizarry, 2007). The setup is such that it tries to replicate the estimates
of the justCRLMMv2()
of the oligo package. Support for CRLMM in
aroma.affymetrix is currently for the 100K and the 500K chip types. The
GWS5 or GWS6 chip types are yet not supported.
Setup
If this is your first analysis in aroma.affymetrix, please make sure to first read the first few pages in the online 'Users Guide'. This will explain the importance of following a well defined directory structure and file names. Understanding this is important and will save you a lot of time.
Annotation data
annotationData/
chipTypes/
Mapping250K_Nsp/
Mapping250K_Nsp.cdf
Mapping250K_Nsp,na26,HB20080915.ugp
Raw data
rawData/
HapMap270,500K,CEU,testSet/
Mapping250K_Nsp/
NA06985.CEL NA06991.CEL NA06993.CEL
NA06994.CEL NA07000.CEL NA07019.CEL
TThere should be 6 CEL files in total. This data was downloaded from the HapMap website.
Analysis
library("aroma.affymetrix")
log <- verbose <- Arguments$getVerbose(-8, timestamp=TRUE)
Setup raw data set
cdf <- AffymetrixCdfFile$byChipType("Mapping50K_Nsp")
print(cdf)
## AffymetrixCdfFile:
## Path: annotationData/chipTypes/Mapping250K_Nsp
## Filename: Mapping250K_Nsp.cdf
## Filesize: 185.45MB
## Chip type: Mapping250K_Nsp
## RAM: 11.01MB
## File format: v4 (binary; XDA)
## Number of cells: 6553600
## Number of units: 262338
## Cells per unit: 24.98
## Number of QC units: 6
csR <- AffymetrixCelSet$byName("HapMap270,500K,CEU,testSet", cdf=cdf)
print(csR)
## AffymetrixCelSet:
## Name: HapMap270
## Tags: 500K,CEU,testSet
## Path: rawData/HapMap270,500K,CEU,testSet/Mapping250K_Nsp
## Platform: Affymetrix
## Chip type: Mapping250K_Nsp
## Number of arrays: 6
## Names: NA06985, NA06991, ..., NA07019
## Time period: 2005-11-16 15:17:51 -- 2005-11-18 19:08:24
## Total file size: 375.88MB
## RAM: 0.01MB
SNPRMA (normalization and summarization)
ces <- justSNPRMA(csR, normalizeToHapmap=TRUE, returnESet=FALSE, verbose=log)
print(ces)
## SnpChipEffectSet:
## Name: HapMap270
## Tags: 500K,CEU,testSet,QN,HapMapRef,RMA,oligo,+-
## Path: plmData/HapMap270,500K,CEU,testSet,QN,HapMapRef,RMA,oligo,+-/Mapping250K_Nsp
## Platform: Affymetrix
## Chip type: Mapping250K_Nsp,monocell
## Number of arrays: 6
## Names: NA06985, NA06991, ..., NA07019
## Time period: 2009-01-10 17:34:45 -- 2009-01-10 17:34:46
## Total file size: 57.34MB
## RAM: 0.01MB
## Parameters: (probeModel: chr "pm", mergeStrands: logi FALSE)
Genotype calling using the CRLMM model
Setting up the model:
crlmm <- CrlmmModel(ces, tags="*,oligo")
print(crlmm)
## CrlmmModel:
## Data set: HapMap270
## Chip type: Mapping250K_Nsp,monocell
## Input tags: 500K,CEU,testSet,QN,HapMapRef,RMA,oligo,+-
## Output tags: 500K,CEU,testSet,QN,HapMapRef,RMA,oligo,+-,CRLMM,oligo
## Parameters: (balance: num 1.5; minLLRforCalls: num [1:3] 5 1 5
## recalibrate: logi FALSE;flavor: chr "v2").
## Path: crlmmData/HapMap270,500K,CEU,testSet,QN,HapMapRef,RMA,oligo,+-,CRLMM,oligo/Mapping250K_Nsp
## RAM: 0.00MB
Fitting the model:
units <- fit(crlmm, ram="oligo", verbose=log)
str(units)
## int [1:262264] 135666 261964 227992 3670 227993 159191 212754 24401 ...
Accessing individual genotype calls and confidence scores
Retrieving the genotype calls (without actually loading anything):
callSet <- getCallSet(crlmm)
print(callSet)
## AromaUnitGenotypeCallSet:
## Name: HapMap270
## Tags: 500K,CEU,testSet,QN,HapMapRef,RMA,oligo,+-,CRLMM,oligo
## Full name: HapMap270,500K,CEU,testSet,QN,HapMapRef,RMA,oligo,+-,CRLMM,oligo
## Number of files: 6
## Names: NA06985, NA06991, ..., NA07019
## Path (to the first file): crlmmData/HapMap270,500K,CEU,testSet,QN,HapMapRef,RMA,oligo,+-,CRLMM,oligo/Mapping250K_Nsp
## Total file size: 3.00MB
## RAM: 0.01MB
calls <- extractGenotypes(callSet, units=2001:2005)
print(calls)
## NA06985 NA06991 NA06993 NA06994 NA07000 NA07019
## [1,] "AA" "AA" "AA" "AA" "AA" "AA"
## [2,] "AA" "AB" "AB" "BB" "AB" "AB"
## [3,] "AA" "AB" "AB" "AB" "AA" "AB"
## [4,] "BB" "AB" "AB" "BB" "BB" "BB"
## [5,] "AA" "AA" "AA" "AA" "AA" "AA"
Retrieving the confidence scores for the calls:
confSet <- getConfidenceScoreSet(crlmm)
print(confSet)
## AromaUnitSignalBinarySet:
## Name: HapMap270
## Tags: 500K,CEU,testSet,QN,HapMapRef,RMA,oligo,+-,CRLMM,oligo
## Full name: HapMap270,500K,CEU,testSet,QN,HapMapRef,RMA,oligo,+-,CRLMM,oligo
## Number of files: 6
## Names: NA06985, NA06991, ..., NA07019
## Path (to the first file): crlmmData/HapMap270,500K,CEU,testSet,QN,HapMapRef,RMA,oligo,+-,CRLMM,oligo/Mapping250K_Nsp
## Total file size: 6.01MB
## RAM: 0.01MB
scores <- extractMatrix(confSet, units=2001:2005)
print(scores)
## NA06985 NA06991 NA06993 NA06994 NA07000 NA07019
## [1,] 0.9998 0.9998 0.9998 0.9999 0.9999 0.9998
## [2,] 0.9968 0.9932 0.9931 0.9998 0.9976 0.9974
## [3,] 0.9997 0.9993 0.9967 0.9988 0.9998 0.9975
## [4,] 0.9986 0.9961 0.9894 0.9994 0.9994 0.9989
## [5,] 0.9998 0.9999 0.9998 0.9999 0.9998 0.9998
Plotting raw CNs and raw genotypes annotated by genotypes
gi <- getGenomeInformation(cdf)
units <- getUnitsOnChromosome(gi, chromosome=2, region=c(75,90)*1e6)
pos <- getPositions(gi, units=units) / 1e6
Get data for the last array:
array <- 6
ce <- ces[[array]]
data <- extractTotalAndFreqB(ce, units=units)
str(data)
## num [1:1306, 1:2] 2312 2185 1994 3242 4156 ...
## - attr(*, "dimnames")=List of 2
## ..$ : NULL
## ..$ : chr [1:2] "total" "freqB"
Extract the (total) theta and the freqB columns:
theta <- data[,"total"]
freqB <- data[,"freqB"]
We calculate the copy-neutral reference as the robust average of all arrays:
ceR <- getAverageFile(ces)
dataR <- extractTotalAndFreqB(ceR, units=units)
thetaR <- dataR[,"total"]
Then we calculate the log2 copy-number ratios:
M <- log2(theta/thetaR)
Finally, we extract the corresponding genotype calls:
cf <- callSet[[array]]
calls <- extractGenotypes(cf, units=units)
Next, we plot the (x,C) and (x,freqB) along the genome annotated with colors according to genotype:
col <- c(AA=1, AB=2, BB=3)[calls]
xlim <- range(pos, na.rm=TRUE)
xlab <- "Physical position (Mb)"
Mlab <- expression(log[2](theta/theta[R]))
Blab <- expression(beta == theta[B]/theta)
subplots(2, ncol=1)
par(mar=c(3,4,1,1)+0.1, pch=".")
plot(pos, M, col=col, cex=3, xlim=xlim, ylim=c(-2,2), xlab=xlab, ylab=Mlab)
stext(side=3, pos=0, getName(ce))
stext(side=3, pos=1, "Chr 2")
plot(pos, freqB, col=col, cex=3, xlim=xlim, ylim=c(0,1), xlab=xlab, ylab=Blab)
Comparing with CRLMM in oligo
path <- file.path("oligoData", getFullName(csR), getChipType(csR, fullname=FALSE))
path <- Arguments$getWritablePathname(path)
if (!isDirectory(path)) {
mkdirs(getParent(path))
oligo:::justCRLMMv2(getPathnames(csR), tmpdir=path, recalibrate=FALSE, balance=1.5, verbose=TRUE)
}
Comparing genotype calls
Genotype calls according to oligo
calls0 <- readSummaries("calls", path)
dimnames(calls0) <- NULL
Genotype calls according to aroma.affymetrix
units <- indexOf(cdf, pattern="^SNP")
unitNames <- getUnitNames(cdf, units=units)
units <- units[order(unitNames)]
calls <- extractGenotypes(callSet, units=units, encoding="oligo")
dimnames(calls) <- NULL
Differences between aroma.affymetrix and oligo
count <- 0
for (cc in 1:ncol(calls)) {
idxs <- which(calls[,cc] != calls0[,cc])
count <- count + length(idxs)
printf("%s: ", getNames(callSet)[cc])
if (length(idxs) > 0) {
map <- c("AA", "AB", "BB")
cat(paste(map[calls[idxs,cc]], map[calls0[idxs,cc]], sep="!="), sep=",")
}
cat("\n")
}
We identify the following difference for each of the genotyped samples:
- NA06985: AB!=AA, AB!=BB, AB!=AA, BB!=AB, BB!=AB
- NA06991: AB!=AA
- NA06993: AA!=AB, BB!=AB
- NA06994: AB!=AA, BB!=AB
- NA07000: AB!=BB, AA!=AB
- NA07019: AA!=AB, AA!=AB, AA!=AB, AA!=AB
Note that these differences are only in one allele, that is, one implementation calls a SNP heterozygote whereas the other calls a homozygote.
printf("Averages number of discrepancies per array: %.1f\n", count/ncol(calls))
Averages number of discrepancies per array: 2.7
errorRate <- count/length(calls)
printf("Concordance rate: %.5f%%\n", 100*(1-errorRate))
Concordance rate: 99.99898%
Comparing confidence scores
Confidence scores according to oligo:
conf0 <- readSummaries("conf", path)
dimnames(conf) <- NULL
Confidence scores according to aroma.affymetrix:
conf <- extractMatrix(confSet, units=units)
dimnames(conf) <- NULL
Differences between aroma.affymetrix and oligo:
delta <- conf - conf0
avgDelta <- mean(abs(delta), na.rm=TRUE)
printf("Averages difference: %.2g\n", avgDelta)
Averages difference: 7.1e-06
Pairwise plots:
subplots(ncol(conf))
par(mar=c(3,2,1,1)+0.1)
lim <- c(0,1)
for (cc in 1:ncol(conf)) {
plot(NA, xlim=lim, ylim=lim)
abline(a=0, b=1, col="#999999")
points(conf[,cc], conf0[,cc], pch=".", cex=3)
rho <- cor(conf[,cc], conf0[,cc])
stext(side=3, pos=0, line=-1, sprintf("rho=%.4f", rho))
stext(side=3, pos=0, getNames(confSet)[cc])
printf("Array #%d: Correlation: %.4f\n", cc, rho)
}
References
[1] B. Carvalho, H. Bengtsson, T. P. Speed, et al. "Exploration, normalization, and genotype calls of high-density oligonucleotide SNP array data". Eng. In: Biostatistics (Oxford, England) 8.2 (Apr. 2007), pp. 485-99. ISSN: 1465-4644. DOI: 10.1093/biostatistics/kxl042. PMID: 17189563.
Appendix
Session information
> sessionInfo()
R version 2.8.1 Patched (2008-12-22 r47296)
i386-pc-mingw32
locale:
LC_COLLATE=English_United States.1252;LC_CTYPE=English_United
States.1252;LC_MONETARY=English_United
States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252
attached base packages:
[1] splines tools stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] pd.mapping250k.nsp_0.4.1 oligoClasses_1.3.8
[4] AnnotationDbi_1.3.12 preprocessCore_1.3.4 RSQLite_0.7-0
[7] DBI_0.2-4 Biobase_2.1.7 aroma.affymetrix_1.0.0
[10] aroma.apd_0.1.4 R.huge_0.1.6 affxparser_1.15.1
[13] aroma.core_1.0.0 aroma.light_1.11.1 oligo_1.5.9
[16] digest_0.3.1 matrixStats_0.1.3 R.rsp_0.3.4
[19] R.cache_0.1.7 R.utils_1.1.3 R.oo_1.4.6
[22] R.methodsS3_1.0.3