Vignette: Calculating raw total copy numbers manually
Author: Henrik Bengtsson
Created: 2007-12-12
Last modified: 2014-12-21
You can calculate raw copy numbers (CNs) from chip effect estimates by first "extracting" the chip effects and then taking the log2 ratios. Basically, first decide for which units you wish to calculate the raw CNs, extract the theta estimates across samples for these units, and calculate the log2 ratios.
Note, a more convenient way to extract raw copy numbers is by using extractRawCopyNumbers()
of CopyNumberChromosomalModel. See Vignettes 'CRMA v1: Total copy number analysis (10K, 100K, 500K)' and 'CRMA v2: Estimation of total copy numbers using the CRMA v2 method (10K-GWS6)' for examples.
To help support this work, please consider citing the following relevant references in your publications or talks whenever using their methods or results:
-
H. Bengtsson, K. Simpson, J. Bullard, et al. aroma.affymetrix: A generic framework in R for analyzing small to very large Affymetrix data sets in bounded memory. Tech. rep. 745. Department of Statistics, University of California, Berkeley, Feb. 2008.
-
H. Bengtsson, R. Irizarry, B. Carvalho, et al. “Estimation and assessment of raw copy numbers at the single locus level”. Eng. In: Bioinformatics (Oxford, England) 24.6 (Mar. 2008), pp. 759-67. ISSN: 1367-4811. DOI: 10.1093/bioinformatics/btn016. PMID: 18204055.
-
H. Bengtsson, P. Wirapati, and T. P. Speed. “A single-array preprocessing method for estimating full-resolution raw copy numbers from all Affymetrix genotyping arrays including GenomeWideSNP 5 & 6”. Eng. In: Bioinformatics (Oxford, England) 25.17 (Sep. 2009), pp. 2149-56. ISSN: 1367-4811. DOI: 10.1093/bioinformatics/btp371. PMID: 19535535.
Example
This example assumes you have got a normalized CnChipEffectSet object called
cesN
consisting of chip-effect estimates for total copy number
analysis. To obtain these, please see one of the total copy number
vignettes.
Identify all units on chromosome 19:
# Get the genome information file
cdf <- getCdf(cesN)
gi <- getGenomeInformation(cdf)
print(gi)
## UgpGenomeInformation:
## Name: GenomeWideSNP_6
## Tags: Full,na24,HB20080214
## Pathname: annotationData/chipTypes/GenomeWideSNP_6/
## GenomeWideSNP_6,Full,na24,HB20080214.ugp
## File size: 8.97MB
## RAM: 0.00MB
## Chip type: GenomeWideSNP_6,Full
Get all units on chromosome 19:
units <- getUnitsOnChromosome(gi, chromosome=19)
str(units)
## int [1:30362] 25670 25671 25672 25673 25674 25675 ...
Get the physical position for these units:
pos <- getPositions(gi, units=units)
str(pos)
## int [1:30362] 341341 2705548 2963883 3574534 4367411 4368845 ...
Get the names of the units:
unitNames <- getUnitNames(cdf, units=units)
str(unitNames)
## chr [1:30362] "SNP_A-1938296" "SNP_A-4259059" "SNP_A-1939610" ...
Extract the chip effects across samples for these units:
theta <- extractMatrix(cesN, units=units)
rownames(theta) <- unitNames
print(theta[1:5,])
## NA06985 NA06991 NA06993 NA06994 NA07000
NA07019
## SNP_A-1938296 2679.284 2830.551 2657.694 2677.040 3157.389 2847.244
## SNP_A-4259059 14688.840 12187.581 15538.014 11971.401 11002.526 14254.147
## SNP_A-1939610 4901.532 5198.730 6546.090 6711.787 8270.660 7745.191
## SNP_A-1940033 16756.266 16171.789 19242.760 16371.320 15542.183 19314.775
## SNP_A-4259154 2543.355 2915.118 3375.513 2695.686 3262.761 3523.717
Calculate the raw CNs as the log2-ratio over the robust average across samples:
# Robust average unit by unit across samples
thetaR <- rowMedians(theta, na.rm=TRUE)
str(thetaR)
## num [1:30362] 2755 13221 6629 16564 3089 ...
Calculate the raw CNs:
M <- log2(theta/thetaR)
print(M[1:5,])
## NA06985 NA06991 NA06993 NA06994 NA07000 NA07019
## SNP_A-1938296 -0.0402 0.0391 -0.0518 -0.0414 0.1967 0.0476
## SNP_A-4259059 0.1519 -0.1174 0.2330 -0.1432 -0.2650 0.1086
## SNP_A-1939610 -0.4355 -0.3506 -0.0181 0.0179 0.3192 0.2245
## SNP_A-1940033 0.0167 -0.0346 0.2163 -0.0169 -0.0918 0.2217
## SNP_A-4259154 -0.2804 -0.0836 0.1280 -0.1965 0.0790 0.1900
In order to calculate paired raw CNs, you have to use a separate reference in the denominator for each column. I leave that as an exercise.
Plot the raw CNs along the chromosome for a given sample:
xlab <- "Position"
ylab <- expression(M==log[2](theta/theta[R]))
sample <- "NA06985"
plot(pos, M[,sample], pch=20, ylim=c(-3,3), xlab=xlab, ylab=ylab)
title(main=sample)