Vignette: ACNE: Allele-specific copy numbers using non-negative matrix factorization
Author: Maria Ortiz (cleanup by Henrik Bengtsson)
Created on: 2009-11-18
Last updated: 2014-12-21
Figure: Allele-specific copy numbers (CA,CB) using ACNE (left), AS-dChip (center) and ASCRMA v2 (right) in a normal region where we know there should be three clouds located around (2,0), (1,1) and (0,2). Data is from an Mapping250K_Nsp data set.
This document describes how to estimate allele specific copy numbers (ASCNs) in aroma.affymetrix using ACNE (Ortiz-Estevez, Bengtsson, and Rubio, 2010). ACNE is a summarization method that provides ASCNs based on signals normalized in a similar way to as CRMA v2 (Bengtsson, Wirapati, and Speed, 2009).
Eight (8) Affymetrix GenomeWideSNP_6 arrays deposited in NCBI-GEO under accession numbers GSE14996 (GSM374529 -- 36) will be used to illustrate the necessary steps in aroma.affymetrix in order to apply ACNE summarization method.
Note: This analysis requires the ACNE package in addition to aroma.affymetrix.
Setup
Annotation data
annotationData/
chipTypes/
GenomeWideSNP_6/
GenomeWideSNP_6,Full.cdf
GenomeWideSNP_6,Full,na26,HB20080821.ugp
GenomeWideSNP_6,Full,na26,HB20080722.ufl
GenomeWideSNP_6,Full,HB20080710.acs
Note that *.Full.cdf have to be renamed to *,Full.cdf (w/ a comma).
Raw data
rawData/
GSE14996,testSet/
GenomeWideSNP_6/
GSM374529.CEL, GSM374530.CEL, GSM374531.CEL, GSM374532.CEL,
GSM374533.CEL, GSM374534.CEL, GSM374535.CEL, GSM374536.CEL
Analysis
Setup
library("aroma.affymetrix")
library("ACNE")
verbose <- Arguments$getVerbose(-10, timestamp=TRUE)
dataSet <- "GSE14996,testSet"
chipType <- "GenomeWideSNP_6"
Annotation data
cdf <- AffymetrixCdfFile$byChipType(chipType, tags="Full")
print(cdf)
which gives:
AffymetrixCdfFile:
Path: annotationData/chipTypes/GenomeWideSNP_6
Filename: GenomeWideSNP_6,Full.cdf
Filesize: 470.44MB
Chip type: GenomeWideSNP_6,Full
RAM: 0.00MB
File format: v4 (binary; XDA)
Dimension: 2572x2680
Number of cells: 6892960
Number of units: 1881415
Cells per unit: 3.66
Number of QC units: 4
and
gi <- getGenomeInformation(cdf)
print(gi)
which gives:
UgpGenomeInformation:
Name: GenomeWideSNP_6
Tags: Full,na26,HB20080821
Full name: GenomeWideSNP_6,Full,na26,HB20080821
Pathname:
annotationData/chipTypes/GenomeWideSNP_6/GenomeWideSNP_6,Full,na26,HB20080821.ugp
File size: 8.97 MB (9407937 bytes)
RAM: 0.00 MB
Chip type: GenomeWideSNP_6,Full
and
print(si)
which gives:
UflSnpInformation:
Name: GenomeWideSNP_6
Tags: Full,na26,HB20080722
Full name: GenomeWideSNP_6,Full,na26,HB20080722
Pathname:
annotationData/chipTypes/GenomeWideSNP_6/GenomeWideSNP_6,Full,na26,HB20080722.ufl
File size: 7.18 MB (7526454 bytes)
RAM: 0.00 MB
Chip type: GenomeWideSNP_6,Full
Number of enzymes: 2
Then
acs <- AromaCellSequenceFile$byChipType(getChipType(cdf, fullname=FALSE))
print(acs)
which outputs:
AromaCellSequenceFile:
Name: GenomeWideSNP_6
Tags: Full,HB20080710
Full name: GenomeWideSNP_6,Full,HB20080710
Pathname:
annotationData/chipTypes/GenomeWideSNP_6/GenomeWideSNP_6,Full,HB20080710.acs
File size: 170.92 MB (179217531 bytes)
RAM: 0.00 MB
Number of data rows: 6892960
File format: v1
Dimensions: 6892960x26
Column classes: raw, raw, raw, raw, raw, raw, raw, raw, raw, raw, raw,
raw, raw, raw, raw, raw, raw, raw, raw, raw, raw, raw, raw, raw, raw,
raw
Number of bytes per column: 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1
Footer: \<createdOn\>20080710 22:47:02
PDT\</createdOn\>\<platform\>Affymetrix\</platform\>\<chipType\>GenomeWideSNP_6\</chipType\>\<srcFile\>\<filename\>GenomeWideSNP_6.probe_tab\</filename\>\<filesize\>341479928\</filesize\>\<checksum\>2037c033c09fd8f7c06bd042a77aef15\</checksum\>\</srcFile\>\<srcFile2\>\<filename\>GenomeWideSNP_6.CN_probe_tab\</filename\>\<filesize\>96968290\</filesize\>\<checksum\>3dc2d3178f5eafdbea9c8b6eca88a89c\</checksum\>\</srcFile2\>
Chip type: GenomeWideSNP_6
Platform: Affymetrix
Raw data
cs <- AffymetrixCelSet$byName(dataSet, cdf=cdf)
print(cs)
AffymetrixCelSet:
Name: GSE14996
Tags: testSet
Path: rawData/GSE14996,testSet/GenomeWideSNP_6
Platform: Affymetrix
Chip type: GenomeWideSNP_6,Full
Number of arrays: 8
Names: GSM374529, GSM374530, ..., GSM374536
Time period: 2007-09-21 18:56:58 -- 2007-12-31 12:29:36
Total file size: 526.64MB
RAM: 0.01MB
Pre-processing
Cross-talk calibration
acc <- AllelicCrosstalkCalibration(cs, model="CRMAv2")
print(acc)
AllelicCrosstalkCalibration:
Data set: GSE14996
Input tags: testSet
User tags: *
Asterisk ('*') tags: ACC,ra,-XY
Output tags: testSet,ACC,ra,-XY
Number of files: 8 (526.64MB)
Platform: Affymetrix
Chip type: GenomeWideSNP_6,Full
Algorithm parameters: (rescaleBy: chr "all", targetAvg: num 2200,
subsetToAvg: chr "-XY", mergeShifts: logi TRUE, B: int 1, flavor: chr
"sfit", algorithmParameters:List of 3, ..\$ alpha: num [1:8] 0.1 0.075
0.05 0.03 0.01 0.0025 0.001 0.0001, ..\$ q: num 2, ..\$ Q: num 98)
Output path: probeData/GSE14996,testSet,ACC,ra,-XY/GenomeWideSNP_6
Is done: FALSE
RAM: 0.00MB
csC <- process(acc, verbose=verbose)
print(csC)
AffymetrixCelSet:
Name: GSE14996
Tags: testSet,ACC,ra,-XY
Path: probeData/GSE14996,testSet,ACC,ra,-XY/GenomeWideSNP_6
Platform: Affymetrix
Chip type: GenomeWideSNP_6,Full
Number of arrays: 8
Names: GSM374529, GSM374530, ..., GSM374536
Time period: 2007-09-21 18:56:58 -- 2007-12-31 12:29:36
Total file size: 526.64MB
RAM: 0.01MB
Nucleotide-position sequence normalization
bpn <- BasePositionNormalization(csC, target="zero")
print(bpn)
BasePositionNormalization:
Data set: GSE14996
Input tags: testSet,ACC,ra,-XY
User tags: *
Asterisk ('*') tags: BPN,-XY
Output tags: testSet,ACC,ra,-XY,BPN,-XY
Number of files: 8 (526.64MB)
Platform: Affymetrix
Chip type: GenomeWideSNP_6,Full
Algorithm parameters: (unitsToFit: chr "-XY", typesToFit: chr "pm",
unitsToUpdate: NULL, typesToUpdate: chr "pm", shift: num 0, target: chr
"zero", model: chr "smooth.spline", df: int 5)
Output path:
probeData/GSE14996,testSet,ACC,ra,-XY,BPN,-XY/GenomeWideSNP_6
Is done: FALSE
RAM: 0.00MB
csN <- process(bpn, verbose=verbose)
print(csN)
AffymetrixCelSet:
Name: GSE14996
Tags: testSet,ACC,ra,-XY,BPN,-XY
Path: probeData/GSE14996,testSet,ACC,ra,-XY,BPN,-XY/GenomeWideSNP_6
Platform: Affymetrix
Chip type: GenomeWideSNP_6,Full
Number of arrays: 8
Names: GSM374529, GSM374530, ..., GSM374536
Time period: 2007-09-21 18:56:58 -- 2007-12-31 12:29:36
Total file size: 526.64MB
RAM: 0.01MB
Probe summarization using non-negative-matrix factorization (NMF)
plm <- NmfSnpPlm(csN, mergeStrands=TRUE)
print(plm)
NmfSnpPlm:
Data set: GSE14996
Chip type: GenomeWideSNP_6,Full
Input tags: testSet,ACC,ra,-XY,BPN,-XY
Output tags: testSet,ACC,ra,-XY,BPN,-XY,NMF,v4
Parameters: (probeModel: chr "pm"; shift: num 0; mergeStrands: logiTRUE).
Path: plmData/GSE14996,testSet,ACC,ra,-XY,BPN,-XY,NMF,v4/GenomeWideSNP_6
RAM: 0.00MB
if (length(findUnitsTodo(plm)) > 0) {
# Fit CN probes quickly (~5-10s/array + some overhead)
units <- fitCnProbes(plm, verbose=verbose)
str(units)
# int [1:945826] 935590 935591 935592 935593 935594 935595 ...
# Fit remaining units, i.e. SNPs (~5-10min/array)
units <- fit(plm, verbose=verbose)
str(units)
}
ces <- getChipEffectSet(plm)
print(ces)
SnpChipEffectSet:
Name: GSE14996
Tags: testSet,ACC,ra,-XY,BPN,-XY,NMF,v4
Path:
plmData/GSE14996,testSet,ACC,ra,-XY,BPN,-XY,NMF,v4/GenomeWideSNP_6
Platform: Affymetrix
Chip type: GenomeWideSNP_6,Full,monocell
Number of arrays: 8
Names: GSM374529, GSM374530, ..., GSM374536
Time period: 2009-11-19 10:51:15 -- 2009-11-19 10:51:16
Total file size: 215.59MB
RAM: 0.01MB
Parameters: (probeModel: chr "pm", mergeStrands: logi TRUE)
Results
Extracting allele-specific CNs
Example: ASCNs for Chromosome 2
chromosome <- 2
units <- getUnitsOnChromosome(gi, chromosome=chromosome)
str(units)
## int [1:153663] 26048 26049 26050 26052 26053 26054 26055 26056 26057 26058 ...
pos <- getPositions(gi, units=units)
str(pos)
## int [1:153663] 102496 141464 155674 160576 160616 166395 179818 179972 214086 214192 ...
Example: ASCNs for Sample #1
cf <- ces[[1]]
data <- extractTotalAndFreqB(cf, units=units)
CT <- data[,"total"]
# NmfSnpPlm returns relative ASCNs (CA, CB) already standardized to the pool of all arrays.
C <- CT
cn <- RawCopyNumbers(C, pos, chromosome=chromosome)
print(cn)
RawCopyNumbers:
Name:
Chromosome: 2
Position range: [2785,2.42738e+08]
Number of loci: 153663
Mean distance between loci: 1579.67
Loci fields: x [153663xnumeric], y [153663xnumeric]
RAM: 1.76MB
beta <- data[,"freqB"]
fracB <- RawAlleleBFractions(beta, pos, chromosome=chromosome)
print(fracB)
RawAlleleBFractions:
Name:
Chromosome: 2
Position range: [2785,2.42738e+08]
Number of loci: 153663
Mean distance between loci: 1579.67
Loci fields: x [153663xnumeric], y [153663xnumeric]
RAM: 1.76MB
Plotting TCN and BAF
xScale <- 1e-6
cn <- extractSubset(cn, which(!is.na(beta))) # to erase the CN probes
subplots(2, ncol=1)
plot(cn, xScale=xScale, ylim = c(0,6), cex = .3)
stext(side=3, pos=0, getName(cn))
stext(side=3, pos=1, sprintf("Chr%d", chromosome))
plot(fracB, xScale=xScale, cex = .3, ylim = c(0,1))
References
[1] H. Bengtsson, P. Wirapati, and T. P. Speed. "A single-array preprocessing method for estimating full-resolution raw copy numbers from all Affymetrix genotyping arrays including GenomeWideSNP 5 & 6". Eng. In: Bioinformatics (Oxford, England) 25.17 (Sep. 2009), pp. 2149-56. ISSN: 1367-4811. DOI: 10.1093/bioinformatics/btp371. PMID: 19535535.
[2] M. Ortiz-Estevez, H. Bengtsson, and A. Rubio. "ACNE: a summarization method to estimate allele-specific copy numbers for Affymetrix SNP arrays". Eng. In: Bioinformatics (Oxford, England) 26.15 (Aug. 2010), pp. 1827-33. ISSN: 1367-4811. DOI: 10.1093/bioinformatics/btq300. PMID: 20529889.