Vignette: Empirical probe-signal densities and rank-based quantile normalization
Author: Henrik Bengtsson
Created on: 2009-09-17
Last updated: 2011-02-05
This document illustrated (i) how to create density plots of raw and normalized probe signals and (ii) the result of rank-based quantile normalization stratified by probe type.
Setup
Here we will use 10 public Mapping50K_Hind240 CEL files from the HapMap project.
Raw data
Download the following CEL files from the HapMap site (see the 'HapMap 100K' data set on Page Data Sets):
rawData/
HapMap,CEU,testSet/
Mapping50K_Hind240/
NA06985_Hind_B5_3005533.CEL
NA06991_Hind_B6_3005533.CEL
NA06993_Hind_B4_4000092.CEL
NA06994_Hind_A7_3005533.CEL
NA07000_Hind_A8_3005533.CEL
NA07019_Hind_A12_4000092.CEL
NA07022_Hind_A10_4000092.CEL
NA07029_Hind_A9_4000092.CEL
NA07034_Hind_B1_4000092.CEL
NA07048_Hind_B3_4000092.CEL
Annotation data
Download the following annotation files (Mapping50K_Hind240 & Mapping50K_Xba240):
annotationData/
chipTypes/
Mapping50K_Hind240/
Mapping50K_Hind240.CDF
Mapping50K_Hind240,na26,HB20080916.ufl
Mapping50K_Hind240,na26,HB20080916.ugp
Startup
library("aroma.affymetrix")
library("R.devices")
devOptions("png", width=1024)
# Use a nicer palette of colors
colors <- RColorBrewer::brewer.pal(12, "Paired")
palette(colors)
# Setup the Verbose object
verbose <- Arguments$getVerbose(-10, timestamp=TRUE)
Setup of raw data set
cdf <- AffymetrixCdfFile$byChipType("Mapping50K_Hind240")
print(cdf)
gi <- getGenomeInformation(cdf)
print(gi)
Setup of raw data set
csR <- AffymetrixCelSet$byName("HapMap,CEU,testSet", cdf=cdf)
print(getFullNames(csR))
## [1] "NA06985_Hind_B5_3005533" "NA06991_Hind_B6_3005533"
## [3] "NA06993_Hind_B4_4000092" "NA06994_Hind_A7_3005533"
## [5] "NA07000_Hind_A8_3005533" "NA07019_Hind_A12_4000092"
## [7] "NA07022_Hind_A10_4000092" "NA07029_Hind_A9_4000092"
## [9] "NA07034_Hind_B1_4000092" "NA07048_Hind_B3_4000092"
The CEL files downloaded from HapMap has file names such as NA07000_Hind_A8_3005533.CEL. In order for aroma.affymetrix to identify 'NA07000' as the sample name, and 'A8' and '3005533' as tags (ignore the 'Hind' part), we will utilize so called fullname translators that translates the full name to a comma-separated fullname, e.g. 'NA07000_Hind_A8_3005533' to 'NA07000,A8,3005533'.
setFullNamesTranslator(csR, function(names, ...) {
# Turn into comma-separated tags
names <- gsub("_", ",", names)
# Drop any Hind/Xba tags
names <- gsub(",(Hind|Xba)", "", names)
names
})
print(getFullNames(csR))
## [1] "NA06985,B5,3005533" "NA06991,B6,3005533"
## [3] "NA06993,B4,4000092" "NA06994,A7,3005533"
## [5] "NA07000,A8,3005533" "NA07019,A12,4000092"
## [7] "NA07022,A10,4000092" "NA07029,A9,4000092"
## [9] "NA07034,B1,4000092" "NA07048,B3,4000092"
print(csR)
## AffymetrixCelSet:
## Name: HapMap
## Tags: CEU,testSet
## Path: rawData/HapMap,CEU,testSet/Mapping50K_Hind240
## Platform: Affymetrix
## Chip type: Mapping50K_Hind240
## Number of arrays: 10
## Names: NA06985, NA06991, ..., NA07048
## Time period: 2004-01-14 14:02:08 -- 2004-02-13 11:51:01
## Total file size: 244.78MB
## RAM: 0.01MB
Brief about different types of probes
On Affymetrix arrays, there are different types of probes (cells). The most well known are perfect-match (PM) and mismatch (MM) probes. In addition to these, there are also QC probes, e.g. so called landing-light probes and chequered-flag probes etc.
One can query the CDF for the indices of the cells for each type. If one ask for "all", then all cells on the array are returned.
types <- c("all", "pmmm", "pm", "mm")
cells <- lapply(types, FUN=function(type) identifyCells(cdf, types=type))
names(cells) <- types
str(cells)
## $ all : int [1:2560000] 1 2 3 4 5 6 7 8 9 10 ...
## $ pmmm: int [1:2291560] 1606 1607 1609 1610 1611 1613 1614 1615
## $ pm : int [1:1145780] 1606 1607 1609 1610 1611 1613 1614 1615
## $ mm : int [1:1145780] 3206 3207 3209 3210 3211 3213 3214 3215
As we will see next, one can specify these types also when plotting the empirical densities and when doing quantile normalization.
Raw probe-signal densities
The plotDensity()
method for an AffymetrixCelSet estimates and draws
smooth empirical density functions for each array in set. It is possible
to stratify by probe type (see above) by setting the types
argument.
For instance, types="all"
uses all probes on the chip (regardless of
CDF used), types="pmmm"
uses all PM & MM probes, types="pm"
only the PM
probes etc. The following code illustrates how to use the types
argument with plotDensity()
. See the below figure of the result.
cols <- seq(along=types)
toPNG("plotDensity", tags=c(kk, "raw"), aspectRatio=0.618, {
for (kk in seq(along=types)) {
plotDensity(csR, types=types[kk], col=cols[kk], subset=NULL,
lwd=2, ylim=c(0,0.45), add=(kk > 1))
}
legend("topleft", col=cols, lwd=4, types)
title("Raw probe signals")
})
Rank-based quantile normalization
The QuantileNormalization class provide methods for doing rank-based
quantile normalization of probe signals. By specifying argument
typesToUpdates
one can specify what type of probes should be
normalized (and used in the model fitting). All other probe signals are
left unchanged. The default is typesToUpdates="all"
, but any of "pmmm"
,
"pm"
and "mm"
can also be used. The following code illustrates how the
different values produce different results. The results are depicted in
the four panels in the below Figure.
Figure 2A.
Figure 2B.
Figure 2C.
Figure 2D.
for (type in types) {
verbose && enter(verbose, "Rank-based quantile normalization on ", type, " probes")
verbose && cat(verbose, "typesToUpdate: ", type)
qn <- QuantileNormalization(csR, typesToUpdate=type, tags=c("*", type))
print(qn)
csN <- process(qn, verbose=verbose)
toPNG("plotDensity", tags=c("QN", type), aspectRatio=0.618, {
# Plot the type normalized-by last.
kks <- match(c(setdiff(types, type), type), types)
for (kk in kks) {
plotDensity(csN, types=types[kk], subset=NULL, col=cols[kk], lwd=2, ylim=c(0,0.45), add=(kk != kks[1]));
}
legend("topleft", col=cols, lwd=4, types)
title(sprintf("QuantileNormalization(..., typesToUpdate=\"%s\")", type))
})
verbose && exit(verbose)
}