Vignette: CalMaTe - A calibration method to improve allele-specific copy number of SNP arrays for downstream segmentation

Created on: 2010-07-30
Last updated: 2014-12-21

Introduction

CalMaTe calibrates preprocessed allele-specific copy-number estimates (ASCNs) from DNA microarrays by controlling for SNP-specific allelic crosstalk. The resulting ASCNs are on average more accurate, which increases the power of segmentation methods (e.g. PSCBS by Olshen, Bengtsson, Neuvial, Spellman, Olshen, and Seshan (2011)) for detecting changes between copy-number states in tumor studies including copy-neutral loss of heterozygosity (LOH). CalMaTe applies to any ASCNs regardless of preprocessing method and microarray technology, e.g. Affymetrix and Illumina. Contrary to the TumorBoost (Bengtsson, Neuvial, and Speed, 2010), which requires a single pair of tumor-normal samples, the CalMaTe methods don't require matched normals, but instead it requires a large number of reference samples. For more details, see Ortiz-Estevez, Aramburu, Bengtsson, Neuvial, and Rubio (2012).

Data

Below we will use the public GEO data set GSE12702:

rawData/  
   GSE12702/  
     Mapping250K_Nsp/  
       GSM318728.CEL, GSM318729.CEL, ..., GSM318767.CEL

Analysis

Allele-specific copy-number estimation (CRMA v2)

We will use an allele-specific version of the CRMA v2 method (Bengtsson, Wirapati, and Speed, 2009) to estimate allele-specific copy numbers (ASCNs) at each SNP. The CRMA v2 method is implemented in the aroma.affymetrix package and the CalMaTe method in the calmate package.

library("aroma.affymetrix")
library("calmate")
verbose <- Arguments$getVerbose(-8, timestamp=TRUE)

# Setting up data set
csR <- AffymetrixCelSet$byName("GSE12702", chipType="Mapping250K_Nsp")
print(csR)

## AffymetrixCelSet:
## Name: GSE12702
## Tags:
## Path: rawData/GSE12702/Mapping250K_Nsp
## Platform: Affymetrix
## Chip type: Mapping250K_Nsp
## Number of arrays: 40
## Names: GSM318728, GSM318729, ..., GSM318767
## Time period: 2008-04-03 13:52:53 -- 2008-04-10 20:08:49
## Total file size: 2506.11MB
## RAM: 0.06MB


# Allele-specific CRMA v2 with log-additive modeling
dsList <- doASCRMAv2(csR, plm="RmaCnPlm", verbose=verbose)
print(dsList)

## $total
## AromaUnitTotalCnBinarySet:
## Name: GSE12702
## Tags: ACC,-XY,BPN,-XY,RMA,FLN,-XY
## Full name: GSE12702,ACC,-XY,BPN,-XY,RMA,FLN,-XY
## Number of files: 40
## Names: GSM318728, GSM318729, ..., GSM318767 [40]
## Path (to the first file): totalAndFracBData/GSE12702,ACC,-XY,BPN,-XY,RMA,FLN,-XY/Mapping250K_Nsp
## Total file size: 40.05 MB
## RAM: 0.06MB##
## $fracB
## AromaUnitFracBCnBinarySet:
## Name: GSE12702
## Tags: ACC,-XY,BPN,-XY,RMA,FLN,-XY
## Full name: GSE12702,ACC,-XY,BPN,-XY,RMA,FLN,-XY
## Number of files: 40
## Names: GSM318728, GSM318729, ..., GSM318767 [40]
## Path (to the first file): totalAndFracBData/GSE12702,ACC,-XY,BPN,-XY,RMA,FLN,-XY/Mapping250K_Nsp
## Total file size: 40.05 MB
## RAM: 0.05MB

Calibration of ASCN estimates (CalMaTe)

cmt <- CalMaTeCalibration(dsList)
print(cmt)

## CalMaTeCalibration:
## Data sets (2):
## <Total>:
## AromaUnitTotalCnBinarySet:
## Name: GSE12702
## Tags: ACC,-XY,BPN,-XY,RMA,FLN,-XY
## Full name: GSE12702,ACC,-XY,BPN,-XY,RMA,FLN,-XY
## Number of files: 40
## Names: GSM318728, GSM318729, ..., GSM318767 [40]
## Path (to the first file): totalAndFracBData/GSE12702,ACC,-XY,BPN,-XY,RMA,FLN,-XY/Mapping250K_Nsp
## Total file size: 40.05 MB
## RAM: 0.06MB
## <FracB>:
## AromaUnitFracBCnBinarySet:
## Name: GSE12702
## Tags: ACC,-XY,BPN,-XY,RMA,FLN,-XY
## Full name: GSE12702,ACC,-XY,BPN,-XY,RMA,FLN,-XY
## Number of files: 40
## Names: GSM318728, GSM318729, ..., GSM318767 [40]
## Path (to the first file): totalAndFracBData/GSE12702,ACC,-XY,BPN,-XY,RMA,FLN,-XY/Mapping250K_Nsp
## Total file size: 40.05 MB
## RAM: 0.05MB

dsCList <- process(cmt, verbose=verbose)
print(dsCList)

## $total
## AromaUnitTotalCnBinarySet:
## Name: GSE12702
## Tags: ACC,-XY,BPN,-XY,RMA,FLN,-XY,CMTN
## Full name: GSE12702,ACC,-XY,BPN,-XY,RMA,FLN,-XY,CMTN
## Number of files: 40
## Names: GSM318728, GSM318729, ..., GSM318767 [40]
## Path (to the first file): totalAndFracBData/GSE12702,ACC,-XY,BPN,-XY,RMA,FLN,-XY,CMTN/Mapping250K_Nsp
## Total file size: 40.05 MB
## RAM: 0.06MB
##
## $fracB
## AromaUnitFracBCnBinarySet:
## Name: GSE12702
## Tags: ACC,-XY,BPN,-XY,RMA,FLN,-XY,CMTN
## Full name: GSE12702,ACC,-XY,BPN,-XY,RMA,FLN,-XY,CMTN
## Number of files: 40
## Names: GSM318728, GSM318729, ..., GSM318767 [40]
## Path (to the first file): totalAndFracBData/GSE12702,ACC,-XY,BPN,-XY,RMA,FLN,-XY,CMTN/Mapping250K_Nsp
## Total file size: 40.05 MB
## RAM: 0.05MB

CalMaTe calibration using only a subset of the data set as references

As this particular data sets consists of tumor and normal samples, it may be advantageous to use only normal samples for the calibration step (and then backtransform all arrays using the estimated calibration). First we identify the indices of the samples that should be used as references:

names <- getNames(dsList$total)

patientIDs <- c(24, 25, 27, 31, 45, 52, 58, 60, 75, 110, 115, 122, 128, 137, 138, 140, 154, 167, 80, 96)
sampleTypes <- c("tumor", "normal")

pids <- rep(patientIDs, each=length(sampleTypes))
types <- rep(sampleTypes, times=length(patientIDs))

mat <- cbind(names, pids, types)
str(mat)
## chr [1:40, 1:3] "GSM318728" "GSM318729" "GSM318730" "GSM318731" ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : NULL
##   ..$ : chr [1:3] "names" "pids" "types"

idxsN <- which(mat[, "types"] == "normal")

Next, when setting up the CalMaTe model, we specify the reference samples using argument references as follows:

cmtN <- CalMaTeCalibration(dsList, tags=c("*", "normalReferences"), references=idxsN)
print(cmtN)

## CalMaTeCalibration:
##   Data sets (2):
##   <Total>:
##   AromaUnitTotalCnBinarySet:
##   Name: GSE12702
## Tags: ACC,-XY,BPN,-XY,RMA,FLN,-XY
## Full name: GSE12702,ACC,-XY,BPN,-XY,RMA,FLN,-XY
## Number of files: 40
## Names: GSM318728, GSM318729, ..., GSM318767 [40]
## Path (to the first file): totalAndFracBData/GSE12702,ACC,-XY,BPN,-XY,RMA,FLN,-XY/Mapping250K_Nsp
## Total file size: 40.05 MB
## RAM: 0.06MB
## <FracB>:
##   AromaUnitFracBCnBinarySet:
##   Name: GSE12702
## Tags: ACC,-XY,BPN,-XY,RMA,FLN,-XY
## Full name: GSE12702,ACC,-XY,BPN,-XY,RMA,FLN,-XY
## Number of files: 40
## Names: GSM318728, GSM318729, ..., GSM318767 [40]
## Path (to the first file): totalAndFracBData/GSE12702,ACC,-XY,BPN,-XY,RMA,FLN,-XY/Mapping250K_Nsp
## Total file size: 40.05 MB
## RAM: 0.05MB

dsCNList <- process(cmtN, verbose=verbose)
print(dsCNList)

## $total
## AromaUnitTotalCnBinarySet:
## Name: GSE12702
## Tags: ACC,-XY,BPN,-XY,RMA,FLN,-XY,CMTN,normalReferences
## Full name: GSE12702,ACC,-XY,BPN,-XY,RMA,FLN,-XY,CMTN,normalReferences
## Number of files: 40
## Names: GSM318728, GSM318729, ..., GSM318767 [40]
## Path (to the first file): totalAndFracBData/GSE12702,ACC,-XY,BPN,-XY,RMA,FLN,-XY,CMTN,normalReferences/Mapping250K_Nsp
## Total file size: 40.05 MB
## RAM: 0.06MB

## $fracB
## AromaUnitFracBCnBinarySet:
## Name: GSE12702
## Tags: ACC,-XY,BPN,-XY,RMA,FLN,-XY,CMTN,normalReferences
## Full name: GSE12702,ACC,-XY,BPN,-XY,RMA,FLN,-XY,CMTN,normalReferences
## Number of files: 40
## Names: GSM318728, GSM318729, ..., GSM318767 [40]
## Path (to the first file): totalAndFracBData/GSE12702,ACC,-XY,BPN,-XY,RMA,FLN,-XY,CMTN,normalReferences/Mapping250K_Nsp
## Total file size: 40.05
## RAM: 0.05MB

Results

extractSignals <- function(dsList, sampleName, reference=c("none", "median"), refIdxs=NULL, ..., verbose=FALSE) {
    reference <- match.arg(reference)
      idx <- indexOf(dsList$total, sampleName)
      dfT <- dsList$total[[idx]]
      dfB <- dsList$fracB[[idx]]
      tcn <- extractRawCopyNumbers(dfT, logBase=NULL, ..., verbose=verbose)
      baf <- extractRawAlleleBFractions(dfB, ..., verbose=verbose)
      if (reference == "median") {
        if (!is.null(refIdxs)) {
          dsR <- dsList$total[refIdxs]
        } else {
          dsR <- dsList$total
        }
        dfTR <- getAverageFile(dsR, verbose=verbose)
        tcnR <- extractRawCopyNumbers(dfTR, logBase=NULL, ..., verbose=verbose)
        tcn <- divideBy(tcn, tcnR)
        setSignals(tcn, 2*getSignals(tcn))
      }
    list(tcn=tcn, baf=baf)
} # extractSignals()

sampleName <- "GSM318736"

pch <- 19
cex <- 0.8

snT <- sampleName
chr <- 8

for (normalRefs in c(TRUE, FALSE)) {
  if (normalRefs) {
    figName <- sprintf("%s,Chr%02d,CalMaTe,normalReferences", snT, chr)
    dataT <- extractSignals(dsList, sampleName=snT, chromosome=chr, reference="median", refIdxs=idxsN, verbose=verbose)
    dataTC <- extractSignals(dsCNList, sampleName=snT, chromosome=chr, verbose=verbose)
  } else {
    figName <- sprintf("%s,Chr%02d,CalMaTe", snT, chr)
    dataT <- extractSignals(dsList, sampleName=snT, chromosome=chr, reference="median", verbose=verbose)
    dataTC <- extractSignals(dsCList, sampleName=snT, chromosome=chr, verbose=verbose)
  }


  toPNG(figName, width=1200, {
    subplots(4, ncol=1)
    par(mar=c(2,5,1,1)+0.1, cex=cex, cex.lab=2.4, cex.axis=2.2)

    plot(dataT$tcn, ylim=c(0,4), pch=pch)
    plot(dataT$baf, pch=pch)
    plot(dataTC$tcn, ylim=c(0,4), pch=pch)
    plot(dataTC$baf, pch=pch)
  })
}

TCN and BAF before and after CalMaTe

Figure 1a: TCNs and BAFs along chromosome 8 of sample GSM318736 with and without CalMaTe when using all 40 samples are used for calibration. Data is from CRMA v2-processed Affymetrix Mapping250K_Nsp arrays. Top two panels show the TCNs and BAFs before CalMaTe. Bottom two panels show the TCNs and BAFs after CalMaTe. The non-calibrated BAFs are noisy and biased (homozygous clouds are notat 0 and 1) whereas the CalMaTe-calibrated BAFs are much less noisy and located closer to the expected locations. We find that there are three regions, a normal region at 0-20Mb with 2 copies and BAFs near 0, 1/2 and 1 (corresponding to AA, AB and BB genotypes), a deletion at 20-44Mb with1 copy and BAFs close to 0 and 1 (A and B), and a gain at 49-145Mb with 3 copies and BAFs near 0, 1/3, 2/3, and 1 (AAA, AAB, ABB and BBB). The BAFs in the deleted region are noisier because of weaker ASCNs.

TCN and BAF before and after CalMaTe using normals as references

Figure 1b: TCNs and BAFs along chromosome 8 of sample GSM318736 with and without CalMaTe when using all 40 samples are used for calibration. Same data as in Figure 1a. Top two panels show the TCNs and BAFs before CalMaTe. Bottom two panels show the TCNs and BAFs after CalMaTe.

(betaN, betaT) plots

Another way to visualize the effect of CalMaTe on allelic signals, and connect/contrast it with TumorBoost, is to look at tumor-normal pairs and compare allele B fraction in the tumor to allele B fraction in the normal, before and after CalMaTe and/or TumorBoost.

mm <- match(sampleName, names)
id <- pid[mm]

idxT <- which((pid==id) & (type=="tumor"))
snT <- names[idxT]
stopifnot(snT==sampleName)

idxN <- which((pid==id) & (type=="normal"))
snN <- names[idxN]

xlab <- "Normal BAF"
ylab <- "TumorBAF"
lim <- c(-0.1, 1.1)
cext <- 1.8

regions <- list("normal (1,1)"=c(0, 16), "loss (0,1)"=c(16.5, 45), "gain (1,2)"=c(45, 150))

for (rr in seq(along=regions)) {
  reg <- regions[[rr]]
  regLab <- paste(reg, collapse="-")
  cnLab <- names(regions)[rr]
  datT <- extractSignals(dsList, sampleName=snT, chromosome=chr, reg=reg*1e6, verbose=verbose)
  datN <- extractSignals(dsList, sampleName=snN, chromosome=chr, reg=reg*1e6, verbose=verbose)

  datCT <- extractSignals(dsCList, sampleName=snT, chromosome=chr, reg=reg*1e6, verbose=verbose)
  datCN <- extractSignals(dsCList, sampleName=snN, chromosome=chr, reg=reg*1e6, verbose=verbose)

  tbn <- normalizeTumorBoost(datT$baf$y, datN$baf$y, verbose=verbose)
  tbnC <- normalizeTumorBoost(datCT$baf$y, datCN$baf$y, verbose=verbose)

  figName <- sprintf("%s,betaNvsBetaT,Chr%02d,%s", snT, chr, regLab)
  toPNG(figName, width=1200, {
    subplots(4, ncol=2)
    par(mar=c(5,5,2,1)+0.1, cex=cex, cex.lab=2.4, cex.axis=2.2)
    plot(datN$baf$y, datT$baf$y, pch=pch, xlim=lim, ylim=lim, xlab=xlab, ylab=ylab)
    stext("ASCRMAv2", side=3, pos=1, cex=cext)
    stext(cnLab, side=3, pos=0, cex=cext)
    plot(datCN$baf$y, datCT$baf$y, pch=pch, xlim=lim, ylim=lim, xlab=xlab, ylab=ylab)
    stext("ASCRMAv2 + CalMaTe", side=3, pos=1, cex=cext)
    stext(cnLab, side=3, pos=0, cex=cext)
    plot(datN$baf$y, tbn, pch=pch, xlim=lim, ylim=lim, xlab=xlab, ylab=ylab)
    stext("ASCRMAv2 + TumorBoost", side=3, pos=1, cex=cext)
    stext(cnLab, side=3, pos=0, cex=cext)
    plot(datCN$baf$y, tbnC, pch=pch, xlim=lim, ylim=lim, xlab=xlab, ylab=ylab)
    stext("ASCRMAv2 + CalMaTe + TumorBoost", side=3, pos=1, cex=cext)
    stext(cnLab, side=3, pos=0, cex=cext)
  })
}

In Figures 2-4, paired tumor-normal BAFs for three (manually identified) copy number regions on chromosome 8 of sample GSM318736 are displayed for each of the following four combinations of preprocessing methods: ASCRMA v2 (top left), ASCRMA v2 + CalMaTe (top right), ASCRMA v2 + TumorBoost (bottom left), ASCRMA v2 + CalMaTe + TumorBoost (bottom right).

betaT vs betaN in a normal region

Figure 2: betaT vs betaN in a normal (1,1) region between 0 and 16 Mb on Chromosome 8. Preprocessing methods used: ASCRMA v2 (top left), ASCRMA v2 + CalMaTe (top right), ASCRMA v2 + TumorBoost (bottom left), ASCRMA v2 + CalMaTe + TumorBoost (bottom right).

betaT vs betaN in a region of homozygous deletion
(0,1)

Figure 3: betaT vs betaN in a region of homozygous deletion (0,1) between 16 and 45 Mb on Chromosome 8. Preprocessing methods used: ASCRMA v2 (top left), ASCRMA v2 + CalMaTe (top right), ASCRMA v2 + TumorBoost (bottom left), ASCRMA v2 + CalMaTe + TumorBoost (bottom right).

betaT vs betaN in a region of gain
(1,2)

Figure 4: betaT vs betaN in a region of gain (1,2) after 45 Mb on Chromosome 8. Preprocessing methods used: ASCRMA v2 (top left), ASCRMA v2 + CalMaTe (top right), ASCRMA v2 + TumorBoost (bottom left), ASCRMA v2 + CalMaTe + TumorBoost (bottom right).

Note: when comparing signals from TumorBoost to CalMaTe, it is useful to keep in mind that TumorBoost does not normalize the normal sample, whereas CalMaTe calibrates all samples regardless of their types. This means that only the y axes are comparable across the 4 plots. Also, one should focus on SNPs that are heterozygous in the normal samples (they correspond to normal BAFs close to 1/2) as in theory, only these SNPs are informative in terms of copy number changes.

References

[1] H. Bengtsson, P. Wirapati, and T. P. Speed. "A single-array preprocessing method for estimating full-resolution raw copy numbers from all Affymetrix genotyping arrays including GenomeWideSNP 5 & 6". Eng. In: Bioinformatics (Oxford, England) 25.17 (Sep. 2009), pp. 2149-56. ISSN: 1367-4811. DOI: 10.1093/bioinformatics/btp371. PMID: 19535535.

[2] H. Bengtsson, P. Neuvial, and T. P. Speed. "TumorBoost: normalization of allele-specific tumor copy numbers from a single pair of tumor-normal genotyping microarrays". Eng. In: BMC bioinformatics 11 (May. 2010), p. 245. ISSN: 1471-2105. DOI: 10.1186/1471-2105-11-245. PMID: 20462408.

[3] A. B. Olshen, H. Bengtsson, P. Neuvial, et al. "Parent-specific copy number in paired tumor-normal studies using circular binary segmentation". Eng. In: Bioinformatics (Oxford, England) 27.15 (Aug. 2011), pp. 2038-46. ISSN: 1367-4811. DOI: 10.1093/bioinformatics/btr329. PMID: 21666266.

[4] M. Ortiz-Estevez, A. Aramburu, H. Bengtsson, et al. "CalMaTe: a method and software to improve allele-specific copy number of SNP arrays for downstream segmentation". Eng. In: Bioinformatics (Oxford, England) 28.13 (Jul. 2012), pp. 1793-4. ISSN: 1367-4811. DOI: 10.1093/bioinformatics/bts248. PMID: 22576175.