Title:  LightWeight Methods for Normalization and Visualization of Microarray Data using Only Basic R Data Types 

Description:  Methods for microarray analysis that take basic data types such as matrices and lists of vectors. These methods can be used standalone, be utilized in other packages, or be wrapped up in higherlevel classes. 
Authors:  Henrik Bengtsson [aut, cre, cph], Pierre Neuvial [ctb], Aaron Lun [ctb] 
Maintainer:  Henrik Bengtsson <[email protected]> 
License:  GPL (>= 2) 
Version:  3.30.0 
Built:  20230923 06:54:21 UTC 
Source:  https://github.com/HenrikBengtsson/aroma.light 
Methods for microarray analysis that take basic data types such as matrices and lists of vectors. These methods can be used standalone, be utilized in other packages, or be wrapped up in higherlevel classes.
To install this package, see https://bioconductor.org/packages/release/bioc/html/aroma.light.html.
For scanner calibration:
see calibrateMultiscan
()  scan the same array two or more times to calibrate for scanner effects and extended dynamical range.
To normalize multiple singlechannel arrays all with the same number of probes/spots:
normalizeAffine
()  normalizes, on the intensity scale, for differences in offset and scale between channels.
normalizeQuantileRank
(), normalizeQuantileSpline
()  normalizes, on the intensity scale, for differences in empirical distribution between channels.
To normalize multiple singlechannel arrays with varying number probes/spots:
normalizeQuantileRank
(), normalizeQuantileSpline
()  normalizes, on the intensity scale, for differences in empirical distribution between channels.
To normalize twochannel arrays:
normalizeAffine
()  normalizes, on the intensity scale, for differences in offset and scale between channels. This will also correct for intensitydependent affects on the log scale.
normalizeCurveFit
()  Classical intensitydependent normalization, on the log scale, e.g. lowess normalization.
To normalize three or more channels:
normalizeAffine
()  normalizes, on the intensity scale, for differences in offset and scale between channels. This will minimize the curvature on the log scale between any two channels.
Several of the normalization methods proposed in [1][7] are available in this package.
Whenever using this package, please cite one or more of [1][7].
Here is a list of features that would be useful, but which I have too little time to add myself. Contributions are appreciated.
At the moment, nothing.
If you consider to contribute, make sure it is not already implemented by downloading the latest "devel" version!
The releases of this package is licensed under GPL version 2 or newer.
NB: Except for the robustSmoothSpline()
method,
it is alright to distribute the rest of the package under
LGPL version 2.1 or newer.
The development code of the packages is under a private licence (where applicable) and patches sent to the author fall under the latter license, but will be, if incorporated, released under the "release" license above.
Henrik Bengtsson, Pierre Neuvial, Aaron Lun
Some of the reference below can be found at
https://www.aromaproject.org/publications/.
[1] H. Bengtsson, Identification and normalization of plate effects
in cDNA microarray data, Preprints in Mathematical Sciences,
2002:28, Mathematical Statistics, Centre for Mathematical Sciences,
Lund University, 2002.
[2] H. Bengtsson, The R.oo package  ObjectOriented Programming with References Using Standard R Code, In Kurt Hornik, Friedrich Leisch and Achim Zeileis, editors, Proceedings of the 3rd International Workshop on Distributed Statistical Computing (DSC 2003), March 2022, Vienna, Austria. http://www.ci.tuwien.ac.at/Conferences/DSC2003/Proceedings/
[3] H. Bengtsson, aroma  An R Objectoriented Microarray
Analysis environment, Preprints in Mathematical Sciences (manuscript
in preparation), Mathematical Statistics, Centre for Mathematical
Sciences, Lund University, 2004.
[4] H. Bengtsson, J. VallonChristersson and G. Jönsson, Calibration and assessment of channelspecific biases in microarray data with extended dynamical range, BMC Bioinformatics, 5:177, 2004.
[5] Henrik Bengtsson and Ola Hössjer, Methodological Study of Affine Transformations of Gene Expression Data, Methodological study of affine transformations of gene expression data with proposed robust nonparametric multidimensional normalization method, BMC Bioinformatics, 2006, 7:100.
[6] H. Bengtsson, R. Irizarry, B. Carvalho, and T. Speed, Estimation and assessment of raw copy numbers at the single locus level, Bioinformatics, 2008.
[7] H. Bengtsson, A. Ray, P. Spellman and T.P. Speed, A singlesample method for normalizing and combining fullresolutioncopy numbers from multiple platforms, labs and analysis methods, Bioinformatics, 2009.
[8] H. Bengtsson, P. Neuvial and T.P. Speed, TumorBoost: Normalization of allelespecific tumor copy numbers from a single pair of tumornormal genotyping microarrays, BMC Bioinformatics, 2010, 11:245. [PMID 20462408]
In this section we give our recommendation on how spotted twocolor (or multicolor) microarray data is best calibrated and normalized.
We do not recommend background subtraction in classical means where background is estimated by various image analysis methods. This means that we will only consider foreground signals in the analysis.
We estimate "background" by other means. In what is explain below, only a global background, that is, a global bias, is estimated and removed.
In Bengtsson et al (2004) we give evidence that microarray scanners can introduce a significant bias in data. This bias, which is about 1525 out of 65535, will introduce intensity dependency in the logratios, as explained in Bengtsson & Hössjer (2006).
In Bengtsson et al (2004) we find that this bias is stable across arrays (and a couple of months), but further research is needed in order to tell if this is true over a longer time period.
To calibrate signals for scanner biases, scan the same array at multiple PMTsettings at three or more (K >= 3) different PMT settings (preferably in decreasing order). While doing this, do not adjust the laser power settings. Also, do the multiscan without washing, cleaning or by other means changing the array between subsequent scans. Although not necessary, it is preferred that the array remains in the scanner between subsequent scans. This will simplify the image analysis since spot identification can be made once if images aligns perfectly.
After image analysis, read all K scans for the same array into the two matrices, one for the red and one for the green channel, where the K columns corresponds to scans and the N rows to the spots. It is enough to use foreground signals.
In order to multiscan calibrate the data, for each channel
separately call Xc < calibrateMultiscan(X)
where X
is the NxK matrix of signals for one channel across all scans. The
calibrated signals are returned in the Nx1 matrix Xc
.
Multiscan calibration may sometimes be skipped, especially if affine
normalization is applied immediately after, but we do recommend that
every lab check at least once if their scanner introduce bias.
If the offsets in a scanner is already estimated from earlier
multiscan analyses, or known by other means, they can readily be
subtracted from the signals of each channel. If arrays are still
multiscanned, it is possible to force the calibration method to
fit the model with zero intercept (assuming the scanner offsets
have been subtracted) by adding argument center=FALSE
.
In Bengtsson & Hössjer (2006), we carry out a detailed study on how biases in each channel introduce so called intensitydependent logratios among other systematic artifacts. Data with (additive) bias in each channel is said to be affinely transformed. Data without such bias, is said to be linearly (proportionally) transform. Ideally, observed signals (data) is a linear (proportional) function of true gene expression levels.
We do not assume proportional observations. The scanner bias is real evidence that assuming linearity is not correct. Affine normalization corrects for affine transformation in data. Without control spots it is not possible to estimate the bias in each of the channels but only the relative bias such that after normalization the effective bias are the same in all channels. This is why we call it normalization and not calibration.
In its simplest form, affine normalization is done by
Xn < normalizeAffine(X)
where X
is a Nx2 matrix with
the first column holds the foreground signals from the red channel and
the second holds the signals from the green channel. If three or
fourchannel data is used these are added the same way. The normalized
data is returned as a Nx2 matrix Xn
.
To normalize all arrays and all channels at once, one may put all
data into one big NxK matrix where the K columns hold the all channels
from the first array, then all channels from the second array and so
on. Then Xn < normalizeAffine(X)
will return the acrossarray
and acrosschannel normalized data in the NxK matrix Xn
where
the columns are stored in the same order as in matrix X
.
Equal effective bias in all channels is much better. First of all, any intensitydependent bias in the logratios is removed for all nondifferentially expressed genes. There is still an intensitydependent bias in the logratios for differentially expressed genes, but this is now symmetric around logratio zero.
Affine normalization will (by default and recommended) normalize all arrays together and at once. This will guarantee that all arrays are "on the same scale". Thus, it not recommended to apply a classical betweenarray scale normalization afterward. Moreover, the average logratio will be zero after an affine normalization.
Note that an affine normalization will only remove curvature in the logratios at lower intensities. If a strong intensitydependent bias at high intensities remains, this is most likely due to saturation effects, such as too high PMT settings or quenching.
Note that for a perfect affine normalization you should expect much higher noise levels in the logratios at lower intensities than at higher. It should also be approximately symmetric around zero logratio. In other words, a strong fanning effect is a good sign.
Due to different noise levels in red and green channels, different PMT settings in different channels, plus the fact that the minimum signal is zero, "odd shapes" may be seen in the logratio vs logintensity graphs at lower intensities. Typically, these show themselves as nonsymmetric in positive and negative logratios. Note that you should not see this at higher intensities.
If there is a strong intensitydependent effect left after the affine normalization, we recommend, for now, that a subsequent curvefit or quantile normalization is done. Which one, we do not know.
Why negative signals?
By default, 5% of the normalized signals will have a nonpositive
signal in one or both channels. This is on purpose, although
the exact number 5% is chosen by experience. The reason for
introducing negative signals is that they are indeed expected.
For instance, when measure a zero gene expression level, there is
a chance that the observed value is (should be) negative due to
measurement noise. (For this reason it is possible that the scanner
manufacturers have introduced scanner bias on purpose to avoid
negative signals, which then all would be truncated to zero.)
To adjust the ratio (or number) of negative signals allowed, use
for example normalizeAffine(X, constraint=0.01)
for 1%
negative signals. If set to zero (or "max"
) only as much
bias is removed such that no negative signals exist afterward.
Note that this is also true if there were negative signals on
beforehand.
Why not lowess normalization? Curvefit normalization methods such as lowess normalization are basically designed based on linearity assumptions and will for this reason not correct for channel biases. Curvefit normalization methods can by definition only be applied to one pair of channels at the time and do therefore require a subsequent betweenarray scale normalization, which is by the way very ad hoc.
Why not quantile normalization? Affine normalization can be though of a special case of quantile normalization that is more robust than the latter. See Bengtsson & Hössjer (2006) for details. Quantile normalization is probably better to apply than curvefit normalization methods, but less robust than affine normalization, especially at extreme (low and high) intensities. For this reason, we do recommend to use affine normalization first, and if this is not satisfactory, quantile normalization may be applied.
If the channel offsets are zero, already corrected for, or estimated
by other means, it is possible to normalize the data robustly by
fitting the above affine model without intercept, that is, fitting
a truly linear model. This is done adding argument center=FALSE
when calling normalizeAffine()
.
Henrik Bengtsson
Gets the average empirical distribution for a set of samples.
## S3 method for class 'list'
averageQuantile(X, ...)
## S3 method for class 'matrix'
averageQuantile(X, ...)
X 
A 
... 
Not used. 
Returns a numeric
vector
of length equal to the longest vector
in argument X
.
Missing values are excluded.
Parts adopted from Gordon Smyth (http://www.statsci.org/) in 2002 \& 2006. Original code by Ben Bolstad at Statistics Department, University of California.
normalizeQuantileRank
().
normalizeQuantileSpline
().
quantile
.
Reverse affine transformation.
## S3 method for class 'matrix'
backtransformAffine(X, a=NULL, b=NULL, project=FALSE, ...)
X 
An NxK 
a 
A scalar, 
b 
A scalar, 
project 
returned (K values per data point are returned).
If 
... 
Not used. 
The "(Xa)/b
" backtransformed NxK matrix
is returned.
If project
is TRUE
, an Nx1 matrix
is returned, because
all columns are identical anyway.
Missing values remain missing values. If projected, data points that contain missing values are projected without these.
X < matrix(1:8, nrow=4, ncol=2)
X[2,2] < NA
print(X)
# Returns a 4x2 matrix
print(backtransformAffine(X, a=c(1,5)))
# Returns a 4x2 matrix
print(backtransformAffine(X, b=c(1,1/2)))
# Returns a 4x2 matrix
print(backtransformAffine(X, a=matrix(1:4,ncol=1)))
# Returns a 4x2 matrix
print(backtransformAffine(X, a=matrix(1:3,ncol=1)))
# Returns a 4x2 matrix
print(backtransformAffine(X, a=matrix(1:2,ncol=1), b=c(1,2)))
# Returns a 4x1 matrix
print(backtransformAffine(X, b=c(1,1/2), project=TRUE))
# If the columns of X are identical, and a identity
# backtransformation is applied and projected, the
# same matrix is returned.
X < matrix(1:4, nrow=4, ncol=3)
Y < backtransformAffine(X, b=c(1,1,1), project=TRUE)
print(X)
print(Y)
stopifnot(sum(X[,1]Y) <= .Machine$double.eps)
# If the columns of X are identical, and a identity
# backtransformation is applied and projected, the
# same matrix is returned.
X < matrix(1:4, nrow=4, ncol=3)
X[,2] < X[,2]*2; X[,3] < X[,3]*3
print(X)
Y < backtransformAffine(X, b=c(1,2,3))
print(Y)
Y < backtransformAffine(X, b=c(1,2,3), project=TRUE)
print(Y)
stopifnot(sum(X[,1]Y) <= .Machine$double.eps)
Reverse transformation of principalcurve fit.
## S3 method for class 'matrix'
backtransformPrincipalCurve(X, fit, dimensions=NULL, targetDimension=NULL, ...)
## S3 method for class 'numeric'
backtransformPrincipalCurve(X, ...)
X 
An NxK 
fit 
An MxL principalcurve fit object of class

dimensions 
An (optional) subset of of D dimensions all in [1,L] to be returned (and backtransform). 
targetDimension 
An (optional) index specifying the dimension
in [1,L] to be used as the target dimension of the 
... 
Passed internally to 
Each column in X ("dimension") is backtransformed independently of the others.
The backtransformed NxK (or NxD) matrix
.
By default, the backtransform is such that afterward the signals are
approximately proportional to the (first) principal curve as fitted
by fitPrincipalCurve
(). This scale and origin of this
principal curve is not uniquely defined.
If targetDimension
is specified, then the backtransformed signals
are approximately proportional to the signals of the target dimension,
and the signals in the target dimension are unchanged.
Argument dimensions
can be used to backtransform a subset of
dimensions (K) based on a subset of the fitted dimensions (L).
If $K = L$
, then both X
and fit
is subsetted.
If $K <> L$
, then it is assumed that X
is already
subsetted/expanded and only fit
is subsetted.
# Consider the case where K=4 measurements have been done
# for the same underlying signals 'x'. The different measurements
# have different systematic variation
#
# y_k = f(x_k) + eps_k; k = 1,...,K.
#
# In this example, we assume nonlinear measurement functions
#
# f(x) = a + b*x + x^c + eps(b*x)
#
# where 'a' is an offset, 'b' a scale factor, and 'c' an exponential.
# We also assume heteroscedastic zeromean noise with standard
# deviation proportional to the rescaled underlying signal 'x'.
#
# Furthermore, we assume that measurements k=2 and k=3 undergo the
# same transformation, which may illustrate that the come from
# the same batch. However, when *fitting* the model below we
# will assume they are independent.
# Transforms
a < c(2, 15, 15, 3)
b < c(2, 3, 3, 4)
c < c(1, 2, 2, 1/2)
K < length(a)
# The true signal
N < 1000
x < rexp(N)
# The noise
bX < outer(b,x)
E < apply(bX, MARGIN=2, FUN=function(x) rnorm(K, mean=0, sd=0.1*x))
# The transformed signals with noise
Xc < t(sapply(c, FUN=function(c) x^c))
Y < a + bX + Xc + E
Y < t(Y)
#                                  
# Fit principal curve
#                                  
# Fit principal curve through Y = (y_1, y_2, ..., y_K)
fit < fitPrincipalCurve(Y)
# Flip direction of 'lambda'?
rho < cor(fit$lambda, Y[,1], use="complete.obs")
flip < (rho < 0)
if (flip) {
fit$lambda < max(fit$lambda, na.rm=TRUE)fit$lambda
}
L < ncol(fit$s)
#                                  
# Backtransform data according to model fit
#                                  
# Backtransform toward the principal curve (the "common scale")
YN1 < backtransformPrincipalCurve(Y, fit=fit)
stopifnot(ncol(YN1) == K)
# Backtransform toward the first dimension
YN2 < backtransformPrincipalCurve(Y, fit=fit, targetDimension=1)
stopifnot(ncol(YN2) == K)
# Backtransform toward the last (fitted) dimension
YN3 < backtransformPrincipalCurve(Y, fit=fit, targetDimension=L)
stopifnot(ncol(YN3) == K)
# Backtransform toward the third dimension (dimension by dimension)
# Note, this assumes that K == L.
YN4 < Y
for (cc in 1:L) {
YN4[,cc] < backtransformPrincipalCurve(Y, fit=fit,
targetDimension=1, dimensions=cc)
}
stopifnot(identical(YN4, YN2))
# Backtransform a subset toward the first dimension
# Note, this assumes that K == L.
YN5 < backtransformPrincipalCurve(Y, fit=fit,
targetDimension=1, dimensions=2:3)
stopifnot(identical(YN5, YN2[,2:3]))
stopifnot(ncol(YN5) == 2)
# Extract signals from measurement #2 and backtransform according
# its model fit. Signals are standardized to target dimension 1.
y6 < Y[,2,drop=FALSE]
yN6 < backtransformPrincipalCurve(y6, fit=fit, dimensions=2,
targetDimension=1)
stopifnot(identical(yN6, YN2[,2,drop=FALSE]))
stopifnot(ncol(yN6) == 1)
# Extract signals from measurement #2 and backtransform according
# the the model fit of measurement #3 (because we believe these
# two have undergone very similar transformations.
# Signals are standardized to target dimension 1.
y7 < Y[,2,drop=FALSE]
yN7 < backtransformPrincipalCurve(y7, fit=fit, dimensions=3,
targetDimension=1)
stopifnot(ncol(yN7) == 1)
stopifnot(cor(yN7, yN6) > 0.9999)
Weighted affine calibration of a multiple rescanned channel.
## S3 method for class 'matrix'
calibrateMultiscan(X, weights=NULL, typeOfWeights=c("datapoint"), method="L1",
constraint="diagonal", satSignal=2^16  1, ..., average=median, deviance=NULL,
project=FALSE, .fitOnly=FALSE)
X 
An NxK 
weights 
If 
typeOfWeights 
A 
method 
A 
constraint 
Constraint making the bias parameters identifiable.
See 
satSignal 
Signals equal to or above this threshold is considered saturated signals. 
... 
Other arguments passed to 
average 
A 
deviance 
A 
project 
If 
.fitOnly 
If 
Fitting is done by iterated reweighted principal component analysis (IWPCA).
If average
is specified or project
is TRUE
,
an Nx1 matrix
is returned, otherwise an NxK matrix
is returned.
If deviance
is specified, a deviance Nx1 matrix
is returned
as attribute deviance
.
In addition, the fitted model is returned as attribute modelFit
.
Affine multiscan calibration applies also to negative values, which are therefor also calibrated, if they exist.
Saturated signals in any scan are set to NA
. Thus, they will not be
used to estimate the calibration function, nor will they affect an
optional projection.
Only observations (rows) in X
that contain all finite values are
used in the estimation of the calibration functions. Thus,
observations can be excluded by setting them to NA
.
Each data point/observation, that is, each row in X
, which is a
vector of length K, can be assigned a weight in [0,1] specifying how much
it should affect the fitting of the calibration function.
Weights are given by argument weights
,
which should be a numeric
vector
of length N. Regardless of weights,
all data points are calibrated based on the fitted calibration
function.
By default, the model fit of multiscan calibration is done in $L_1$
(method="L1"
). This way, outliers affect the parameter estimates
less than ordinary leastsquare methods.
When calculating the average calibrated signal from multiple scans, by default the median is used, which further robustify against outliers.
For further robustness, downweight outliers such as saturated signals, if possible.
Tukey's biweight function is supported, but not used by default because
then a "bandwidth" parameter has to selected. This can indeed be done
automatically by estimating the standard deviation, for instance using
MAD. However, since scanner signals have heteroscedastic noise
(standard deviation is approximately proportional to the nonlogged
signal), Tukey's bandwidth parameter has to be a function of the
signal too, cf. loess
. We have experimented with this
too, but found that it does not significantly improve the robustness
compared to $L_1$
.
Moreover, using Tukey's biweight as is, that is, assuming homoscedastic
noise, seems to introduce a (scale dependent) bias in the estimates
of the offset terms.
If the scanner offsets can be assumed to be known, for instance,
from prior multiscan analyses on the scanner, then it is possible
to fit the scanner model with no (zero) offset by specifying
argument center=FALSE
.
Note that you cannot specify the offset. Instead, subtract it
from all signals before calibrating, e.g.
Xc < calibrateMultiscan(Xe, center=FALSE)
where e
is the scanner offset (a scalar).
You can assert that the model is fitted without offset by
stopifnot(all(attr(Xc, "modelFit")$adiag == 0))
.
Henrik Bengtsson
[1] H. Bengtsson, J. VallonChristersson and G. Jönsson, Calibration and assessment of channelspecific biases in microarray data with extended dynamical range, BMC Bioinformatics, 5:177, 2004.
1. Calibration and Normalization
.
normalizeAffine
().
## Not run: # For an example, see help(normalizeAffine).
Calls genotypes in a normal sample.
## S3 method for class 'numeric'
callNaiveGenotypes(y, cn=rep(2L, times = length(y)), ..., modelFit=NULL, verbose=FALSE)
y 
A 
cn 
An optional 
... 
Additional arguments passed to 
modelFit 
A optional model fit as returned
by 
verbose 
Returns a numeric
vector
of length J containing the genotype calls
in allele B fraction space, that is, in [0,1] where 1/2 corresponds
to a heterozygous call, and 0 and 1 corresponds to homozygous A
and B, respectively.
Non called genotypes have value NA
.
A missing value always gives a missing (NA
) genotype call.
Negative infinity (Inf
) always gives genotype call 0.
Positive infinity (+Inf
) always gives genotype call 1.
Henrik Bengtsson
Internally fitNaiveGenotypes
() is used to identify the thresholds.
layout(matrix(1:3, ncol=1))
par(mar=c(2,4,4,1)+0.1)
#                            
# A bimodal distribution
#                            
xAA < rnorm(n=10000, mean=0, sd=0.1)
xBB < rnorm(n=10000, mean=1, sd=0.1)
x < c(xAA,xBB)
fit < findPeaksAndValleys(x)
print(fit)
calls < callNaiveGenotypes(x, cn=rep(1,length(x)), verbose=20)
xc < split(x, calls)
print(table(calls))
xx < c(list(x),xc)
plotDensity(xx, adjust=1.5, lwd=2, col=seq_along(xx), main="(AA,BB)")
abline(v=fit$x)
#                            
# A trimodal distribution with missing values
#                            
xAB < rnorm(n=10000, mean=1/2, sd=0.1)
x < c(xAA,xAB,xBB)
x[sample(length(x), size=0.05*length(x))] < NA;
x[sample(length(x), size=0.01*length(x))] < Inf;
x[sample(length(x), size=0.01*length(x))] < +Inf;
fit < findPeaksAndValleys(x)
print(fit)
calls < callNaiveGenotypes(x)
xc < split(x, calls)
print(table(calls))
xx < c(list(x),xc)
plotDensity(xx, adjust=1.5, lwd=2, col=seq_along(xx), main="(AA,AB,BB)")
abline(v=fit$x)
#                            
# A trimodal distribution with clear separation
#                            
xAA < rnorm(n=10000, mean=0, sd=0.02)
xAB < rnorm(n=10000, mean=1/2, sd=0.02)
xBB < rnorm(n=10000, mean=1, sd=0.02)
x < c(xAA,xAB,xBB)
fit < findPeaksAndValleys(x)
print(fit)
calls < callNaiveGenotypes(x)
xc < split(x, calls)
print(table(calls))
xx < c(list(x),xc)
plotDensity(xx, adjust=1.5, lwd=2, col=seq_along(xx), main="(AA',AB',BB')")
abline(v=fit$x)
Finds the shortest distance between two lines.
Consider the two lines
$x(s) = a_x + b_x*s$
and $y(t) = a_y + b_y*t$
in an Kspace where the offset and direction vector
s are $a_x$
and $b_x$
(in $R^K$
) that define the line $x(s)$
($s$
is a scalar). Similar for the line $y(t)$
.
This function finds the point $(s,t)$
for which $x(s)x(t)$
is minimal.
## Default S3 method:
distanceBetweenLines(ax, bx, ay, by, ...)
ax , bx

Offset and direction 
ay , by

Offset and direction 
... 
Not used. 
Returns the a list
containing
ax , bx

The given line 
ay , by

The given line 
s , t

The values of 
xs , yt

The values of 
distance 
The distance between the lines, i.e. 
Henrik Bengtsson
[1] M. Bard and D. Himel, The Minimum Distance Between Two
Lines in nSpace, September 2001, Advisor Dennis Merino.
[2] Dan Sunday, Distance between 3D Lines and Segments,
Jan 2016, https://www.geomalgorithms.com/algorithms.html.
for (zzz in 0) {
# This example requires plot3d() in R.basic [http://www.braju.com/R/]
if (!require(pkgName < "R.basic", character.only=TRUE)) break
layout(matrix(1:4, nrow=2, ncol=2, byrow=TRUE))
############################################################
# Lines in twodimensions
############################################################
x < list(a=c(1,0), b=c(1,2))
y < list(a=c(0,2), b=c(1,1))
fit < distanceBetweenLines(ax=x$a, bx=x$b, ay=y$a, by=y$b)
xlim < ylim < c(1,8)
plot(NA, xlab="", ylab="", xlim=ylim, ylim=ylim)
# Highlight the offset coordinates for both lines
points(t(x$a), pch="+", col="red")
text(t(x$a), label=expression(a[x]), adj=c(1,0.5))
points(t(y$a), pch="+", col="blue")
text(t(y$a), label=expression(a[y]), adj=c(1,0.5))
v < c(1,1)*10;
xv < list(x=x$a[1]+x$b[1]*v, y=x$a[2]+x$b[2]*v)
yv < list(x=y$a[1]+y$b[1]*v, y=y$a[2]+y$b[2]*v)
lines(xv, col="red")
lines(yv, col="blue")
points(t(fit$xs), cex=2.0, col="red")
text(t(fit$xs), label=expression(x(s)), adj=c(+2,0.5))
points(t(fit$yt), cex=1.5, col="blue")
text(t(fit$yt), label=expression(y(t)), adj=c(1,0.5))
print(fit)
############################################################
# Lines in threedimensions
############################################################
x < list(a=c(0,0,0), b=c(1,1,1)) # The 'diagonal'
y < list(a=c(2,1,2), b=c(2,1,3)) # A 'fitted' line
fit < distanceBetweenLines(ax=x$a, bx=x$b, ay=y$a, by=y$b)
xlim < ylim < zlim < c(1,3)
dummy < t(c(1,1,1))*100;
# Coordinates for the lines in 3d
v < seq(10,10, by=1);
xv < list(x=x$a[1]+x$b[1]*v, y=x$a[2]+x$b[2]*v, z=x$a[3]+x$b[3]*v)
yv < list(x=y$a[1]+y$b[1]*v, y=y$a[2]+y$b[2]*v, z=y$a[3]+y$b[3]*v)
for (theta in seq(30,140,length.out=3)) {
plot3d(dummy, theta=theta, phi=30, xlab="", ylab="", zlab="",
xlim=ylim, ylim=ylim, zlim=zlim)
# Highlight the offset coordinates for both lines
points3d(t(x$a), pch="+", col="red")
text3d(t(x$a), label=expression(a[x]), adj=c(1,0.5))
points3d(t(y$a), pch="+", col="blue")
text3d(t(y$a), label=expression(a[y]), adj=c(1,0.5))
# Draw the lines
lines3d(xv, col="red")
lines3d(yv, col="blue")
# Draw the two points that are closest to each other
points3d(t(fit$xs), cex=2.0, col="red")
text3d(t(fit$xs), label=expression(x(s)), adj=c(+2,0.5))
points3d(t(fit$yt), cex=1.5, col="blue")
text3d(t(fit$yt), label=expression(y(t)), adj=c(1,0.5))
# Draw the distance between the two points
lines3d(rbind(fit$xs,fit$yt), col="purple", lwd=2)
}
print(fit)
} # for (zzz in 0)
rm(zzz)
Robust fit of linear subspace through multidimensional data.
## S3 method for class 'matrix'
fitIWPCA(X, constraint=c("diagonal", "baseline", "max"), baselineChannel=NULL, ...,
aShift=rep(0, times = ncol(X)), Xmin=NULL)
X 
NxK 
constraint 
A If If If 
baselineChannel 
Index of channel toward which all other
channels are conform.
This argument is required if 
... 
Additional arguments accepted by 
aShift , Xmin

For internal use only. 
This method uses reweighted principal component analysis (IWPCA)
to fit a the model $y_n = a + bx_n + eps_n$
where $y_n$
,
$a$
, $b$
, and $eps_n$
are vector of the K and $x_n$
is a scalar.
The algorithm is:
For iteration i:
1) Fit a line $L$
through the data close using weighted PCA
with weights $\{w_n\}$
. Let
$r_n = \{r_{n,1},...,r_{n,K}\}$
be the $K$
principal components.
2) Update the weights as
$w_n < 1 / \sum_{2}^{K} (r_{n,k} + \epsilon_r)$
where we have used the residuals of all but the first principal
component.
3) Find the point a on $L$
that is closest to the
line $D=(1,1,...,1)$
. Similarly, denote the point on D that is
closest to $L$
by $t=a*(1,1,...,1)$
.
Returns a list
that contains estimated parameters and algorithm
details;
a 
A 
b 
A 
adiag 
If identifiability constraint 
eigen 
A KxK 
converged 

nbrOfIterations 
The number of iterations for the algorithm to converge, or zero if it did not converge. 
t0 
Internal parameter estimates, which contains no more information than the above listed elements. 
t 
Always 
Henrik Bengtsson
This is an internal method used by the calibrateMultiscan
()
and normalizeAffine
() methods.
Internally the function iwpca
() is used to fit a line
through the data cloud and the function distanceBetweenLines
() to
find the closest point to the diagonal (1,1,...,1).
Fit naive genotype model from a normal sample.
## S3 method for class 'numeric'
fitNaiveGenotypes(y, cn=rep(2L, times = length(y)), subsetToFit=NULL,
flavor=c("density", "fixed"), adjust=1.5, ..., censorAt=c(0.1, 1.1), verbose=FALSE)
y 
A 
cn 
An optional 
subsetToFit 
An optional 
flavor 
A 
adjust 
A positive 
... 
Additional arguments passed to 
censorAt 
A 
verbose 
Henrik Bengtsson
To call genotypes see callNaiveGenotypes
().
Internally findPeaksAndValleys
() is used to identify the thresholds.
Fit a principal curve in K dimensions.
## S3 method for class 'matrix'
fitPrincipalCurve(X, ..., verbose=FALSE)
X 
An NxK 
... 
Other arguments passed to 
verbose 
Returns a principal_curve object (which is a list
).
See principal_curve
for more details.
The estimation of the normalization function will only be made
based on complete observations, i.e. observations that contains no NA
values in any of the channels.
Henrik Bengtsson
[1] Hastie, T. and Stuetzle, W, Principal Curves, JASA, 1989.
[2] H. Bengtsson, A. Ray, P. Spellman and T.P. Speed, A singlesample method for normalizing and combining fullresolutioncopy numbers from multiple platforms, labs and analysis methods, Bioinformatics, 2009.
backtransformPrincipalCurve
().
principal_curve
.
# Simulate data from the model y < a + bx + x^c + eps(bx)
J < 1000
x < rexp(J)
a < c(2,15,3)
b < c(2,3,4)
c < c(1,2,1/2)
bx < outer(b,x)
xc < t(sapply(c, FUN=function(c) x^c))
eps < apply(bx, MARGIN=2, FUN=function(x) rnorm(length(b), mean=0, sd=0.1*x))
y < a + bx + xc + eps
y < t(y)
# Fit principal curve through (y_1, y_2, y_3)
fit < fitPrincipalCurve(y, verbose=TRUE)
# Flip direction of 'lambda'?
rho < cor(fit$lambda, y[,1], use="complete.obs")
flip < (rho < 0)
if (flip) {
fit$lambda < max(fit$lambda, na.rm=TRUE)fit$lambda
}
# Backtransform (y_1, y_2, y_3) to be proportional to each other
yN < backtransformPrincipalCurve(y, fit=fit)
# Same backtransformation dimension by dimension
yN2 < y
for (cc in 1:ncol(y)) {
yN2[,cc] < backtransformPrincipalCurve(y, fit=fit, dimensions=cc)
}
stopifnot(identical(yN2, yN))
xlim < c(0, 1.04*max(x))
ylim < range(c(y,yN), na.rm=TRUE)
# Pairwise signals vs x before and after transform
layout(matrix(1:4, nrow=2, byrow=TRUE))
par(mar=c(4,4,3,2)+0.1)
for (cc in 1:3) {
ylab < substitute(y[c], env=list(c=cc))
plot(NA, xlim=xlim, ylim=ylim, xlab="x", ylab=ylab)
abline(h=a[cc], lty=3)
mtext(side=4, at=a[cc], sprintf("a=%g", a[cc]),
cex=0.8, las=2, line=0, adj=1.1, padj=0.2)
points(x, y[,cc])
points(x, yN[,cc], col="tomato")
legend("topleft", col=c("black", "tomato"), pch=19,
c("orignal", "transformed"), bty="n")
}
title(main="Pairwise signals vs x before and after transform", outer=TRUE, line=2)
# Pairwise signals before and after transform
layout(matrix(1:4, nrow=2, byrow=TRUE))
par(mar=c(4,4,3,2)+0.1)
for (rr in 3:2) {
ylab < substitute(y[c], env=list(c=rr))
for (cc in 1:2) {
if (cc == rr) {
plot.new()
next
}
xlab < substitute(y[c], env=list(c=cc))
plot(NA, xlim=ylim, ylim=ylim, xlab=xlab, ylab=ylab)
abline(a=0, b=1, lty=2)
points(y[,c(cc,rr)])
points(yN[,c(cc,rr)], col="tomato")
legend("topleft", col=c("black", "tomato"), pch=19,
c("orignal", "transformed"), bty="n")
}
}
title(main="Pairwise signals before and after transform", outer=TRUE, line=2)
Fitting a smooth curve through paired (x,y) data.
## S3 method for class 'matrix'
fitXYCurve(X, weights=NULL, typeOfWeights=c("datapoint"), method=c("loess", "lowess",
"spline", "robustSpline"), bandwidth=NULL, satSignal=2^16  1, ...)
X 
An Nx2 
weights 
If 
typeOfWeights 
A 
method 

bandwidth 
A 
satSignal 
Signals equal to or above this threshold will not be used in the fitting. 
... 
Not used. 
A named list
structure of class XYCurve
.
The estimation of the function will only be made based on complete
nonsaturated observations, i.e. observations that contains no NA
values nor saturated values as defined by satSignal
.
Each data point, that is, each row in X
, which is a
vector of length 2, can be assigned a weight in [0,1] specifying how much
it should affect the fitting of the normalization function.
Weights are given by argument weights
, which should be a numeric
vector
of length N.
Note that the lowess and the spline method only support zeroone {0,1} weights. For such methods, all weights that are less than a half are set to zero.
For loess
, the arguments family="symmetric"
,
degree=1
, span=3/4
,
control=loess.control(trace.hat="approximate"
,
iterations=5
, surface="direct")
are used.
Henrik Bengtsson
# Simulate data from the model y < a + bx + x^c + eps(bx)
x < rexp(1000)
a < c(2,15)
b < c(2,1)
c < c(1,2)
bx < outer(b,x)
xc < t(sapply(c, FUN=function(c) x^c))
eps < apply(bx, MARGIN=2, FUN=function(x) rnorm(length(x), mean=0, sd=0.1*x))
Y < a + bx + xc + eps
Y < t(Y)
lim < c(0,70)
plot(Y, xlim=lim, ylim=lim)
# Fit principal curve through a subset of (y_1, y_2)
subset < sample(nrow(Y), size=0.3*nrow(Y))
fit < fitXYCurve(Y[subset,], bandwidth=0.2)
lines(fit, col="red", lwd=2)
# Backtransform (y_1, y_2) keeping y_1 unchanged
YN < backtransformXYCurve(Y, fit=fit)
points(YN, col="blue")
abline(a=0, b=1, col="red", lwd=2)
Fits an Rdimensional hyperplane using iterative reweighted PCA.
## S3 method for class 'matrix'
iwpca(X, w=NULL, R=1, method=c("symmetric", "bisquare", "tricube", "L1"), maxIter=30,
acc=1e04, reps=0.02, fit0=NULL, ...)
X 
NtimesK 
w 
An N 
R 
Number of principal components to fit. By default a line is fitted. 
method 
If 
maxIter 
Maximum number of iterations. 
acc 
The (Euclidean) distance between two subsequent parameters fit for which the algorithm is considered to have converged. 
reps 
Small value to be added to the residuals before the the weights are calculated based on their inverse. This is to avoid infinite weights. 
fit0 
A 
... 
Additional arguments accepted by 
This method uses weighted principal component analysis (WPCA) to fit a
Rdimensional hyperplane through the data with initial internal
weights all equal.
At each iteration the internal weights are recalculated based on
the "residuals".
If method=="L1"
, the internal weights are 1 / sum(abs(r) + reps).
This is the same as method=function(r) 1/sum(abs(r)+reps)
.
The "residuals" are orthogonal Euclidean distance of the principal
components R,R+1,...,K.
In each iteration before doing WPCA, the internal weighted are
multiplied by the weights given by argument w
, if specified.
Returns the fit (a list
) from the last call to wpca
()
with the additional elements nbrOfIterations
and
converged
.
Henrik Bengtsson
Internally wpca
() is used for calculating the weighted PCA.
for (zzz in 0) {
# This example requires plot3d() in R.basic [http://www.braju.com/R/]
if (!require(pkgName < "R.basic", character.only=TRUE)) break
# Simulate data from the model y < a + bx + eps(bx)
x < rexp(1000)
a < c(2,15,3)
b < c(2,3,4)
bx < outer(b,x)
eps < apply(bx, MARGIN=2, FUN=function(x) rnorm(length(x), mean=0, sd=0.1*x))
y < a + bx + eps
y < t(y)
# Add some outliers by permuting the dimensions for 1/10 of the observations
idx < sample(1:nrow(y), size=1/10*nrow(y))
y[idx,] < y[idx,c(2,3,1)]
# Plot the data with fitted lines at four different view points
opar < par(mar=c(1,1,1,1)+0.1)
N < 4
layout(matrix(1:N, nrow=2, byrow=TRUE))
theta < seq(0,270,length.out=N)
phi < rep(20, length.out=N)
xlim < ylim < zlim < c(0,45);
persp < list();
for (kk in seq_along(theta)) {
# Plot the data
persp[[kk]] < plot3d(y, theta=theta[kk], phi=phi[kk], xlim=xlim, ylim=ylim, zlim=zlim)
}
# Weights on the observations
# Example a: Equal weights
w < NULL
# Example b: More weight on the outliers (uncomment to test)
w < rep(1, length(x)); w[idx] < 0.8
# ...and show all iterations too with different colors.
maxIter < c(seq(1,20,length.out=10),Inf)
col < topo.colors(length(maxIter))
# Show the fitted value for every iteration
for (ii in seq_along(maxIter)) {
# Fit a line using IWPCA through data
fit < iwpca(y, w=w, maxIter=maxIter[ii], swapDirections=TRUE)
ymid < fit$xMean
d0 < apply(y, MARGIN=2, FUN=min)  ymid
d1 < apply(y, MARGIN=2, FUN=max)  ymid
b < fit$vt[1,]
y0 < b * max(abs(d0))
y1 < b * max(abs(d1))
yline < matrix(c(y0,y1), nrow=length(b), ncol=2)
yline < yline + ymid
for (kk in seq_along(theta)) {
# Set pane to draw in
par(mfg=c((kk1) %/% 2, (kk1) %% 2) + 1);
# Set the viewpoint of the pane
options(persp.matrix=persp[[kk]]);
# Get the first principal component
points3d(t(ymid), col=col[ii])
lines3d(t(yline), col=col[ii])
# Highlight the last one
if (ii == length(maxIter))
lines3d(t(yline), col="red", lwd=3)
}
}
par(opar)
} # for (zzz in 0)
rm(zzz)
Median polish.
## S3 method for class 'matrix'
medianPolish(X, tol=0.01, maxIter=10L, na.rm=NA, ..., .addExtra=TRUE)
X 
NtimesK 
tol 
A 
maxIter 
Maximum number of iterations. 
na.rm 
If 
.addExtra 
If 
... 
Not used. 
The implementation of this method give identical estimates as
medpolish
, but is about 35 times more efficient when
there are no NA
values.
Returns a named list
structure with elements:
overall 
The fitted constant term. 
row 
The fitted row effect. 
col 
The fitted column effect. 
residuals 
The residuals. 
converged 
If 
Henrik Bengtsson
# Deaths from sport parachuting; from ABC of EDA, p.224:
deaths < matrix(c(14,15,14, 7,4,7, 8,2,10, 15,9,10, 0,2,0), ncol=3, byrow=TRUE)
rownames(deaths) < c("124", "2574", "75199", "200++", "NA")
colnames(deaths) < 1973:1975
print(deaths)
mp < medianPolish(deaths)
mp1 < medpolish(deaths, trace=FALSE)
print(mp)
ff < c("overall", "row", "col", "residuals")
stopifnot(all.equal(mp[ff], mp1[ff]))
# Validate decomposition:
stopifnot(all.equal(deaths, mp$overall+outer(mp$row,mp$col,"+")+mp$resid))
Weighted affine normalization between channels and arrays.
This method will remove curvature in the M vs A plots that are due to an affine transformation of the data. In other words, if there are (small or large) biases in the different (red or green) channels, biases that can be equal too, you will get curvature in the M vs A plots and this type of curvature will be removed by this normalization method.
Moreover, if you normalize all slides at once, this method will also bring the signals on the same scale such that the logratios for different slides are comparable. Thus, do not normalize the scale of the logratios between slides afterward.
It is recommended to normalize as many slides as possible in one run. The result is that if creating logratios between any channels and any slides, they will contain as little curvature as possible.
Furthermore, since the relative scale between any two channels on any two slides will be one if one normalizes all slides (and channels) at once it is possible to add or multiply with the same constant to all channels/arrays without introducing curvature. Thus, it is easy to rescale the data afterwards as demonstrated in the example.
## S3 method for class 'matrix'
normalizeAffine(X, weights=NULL, typeOfWeights=c("datapoint"), method="L1",
constraint=0.05, satSignal=2^16  1, ..., .fitOnly=FALSE)
X 
An NxK 
weights 
If 
typeOfWeights 
A 
method 
A 
constraint 
Constraint making the bias parameters identifiable.
See 
satSignal 
Signals equal to or above this threshold will not be used in the fitting. 
... 
Other arguments passed to 
.fitOnly 
If 
A line is fitted robustly through the $(y_R,y_G)$
observations
using an iterated reweighted principal component analysis (IWPCA),
which minimized the residuals that are orthogonal to the fitted line.
Each observation is downweighted by the inverse of the absolute
residuals, i.e. the fit is done in $L_1$
.
A NxK matrix
of the normalized channels.
The fitted model is returned as attribute modelFit
.
Affine normalization applies equally well to negative values. Thus,
contrary to normalization methods applied to logratios, such as curvefit
normalization methods, affine normalization, will not set these to NA
.
Data points that are saturated in one or more channels are not used to estimate the normalization function, but they are normalized.
The estimation of the affine normalization function will only be made
based on complete nonsaturated observations, i.e. observations that
contains no NA
values nor saturated values as defined by satSignal
.
Each data point/observation, that is, each row in X
, which is a
vector of length K, can be assigned a weight in [0,1] specifying how much
it should affect the fitting of the affine normalization function.
Weights are given by argument weights
,
which should be a numeric
vector
of length N. Regardless of weights,
all data points are normalized based on the fitted normalization
function.
By default, the model fit of affine normalization is done in $L_1$
(method="L1"
). This way, outliers affect the parameter estimates
less than ordinary leastsquare methods.
For further robustness, downweight outliers such as saturated signals, if possible.
We do not use Tukey's biweight function for reasons similar to those
outlined in calibrateMultiscan
().
If the channel offsets can be assumed to be known, then it is
possible to fit the affine model with no (zero) offset, which
formally is a linear (proportional) model, by specifying
argument center=FALSE
.
In order to do this, the channel offsets have to be subtracted
from the signals manually before normalizing, e.g.
Xa < t(t(X)a)
where e
is vector
of length
ncol(X)
. Then normalize by
Xn < normalizeAffine(Xa, center=FALSE)
.
You can assert that the model is fitted without offset by
stopifnot(all(attr(Xn, "modelFit")$adiag == 0))
.
Henrik Bengtsson
[1] Henrik Bengtsson and Ola Hössjer, Methodological Study of Affine Transformations of Gene Expression Data, Methodological study of affine transformations of gene expression data with proposed robust nonparametric multidimensional normalization method, BMC Bioinformatics, 2006, 7:100.
pathname < system.file("dataex", "PMTRGData.dat", package="aroma.light")
rg < read.table(pathname, header=TRUE, sep="\t")
nbrOfScans < max(rg$slide)
rg < as.list(rg)
for (field in c("R", "G"))
rg[[field]] < matrix(as.double(rg[[field]]), ncol=nbrOfScans)
rg$slide < rg$spot < NULL
rg < as.matrix(as.data.frame(rg))
colnames(rg) < rep(c("R", "G"), each=nbrOfScans)
layout(matrix(c(1,2,0,3,4,0,5,6,7), ncol=3, byrow=TRUE))
rgC < rg
for (channel in c("R", "G")) {
sidx < which(colnames(rg) == channel)
channelColor < switch(channel, R="red", G="green")
#                                
# The raw data
#                                
plotMvsAPairs(rg[,sidx])
title(main=paste("Observed", channel))
box(col=channelColor)
#                                
# The calibrated data
#                                
rgC[,sidx] < calibrateMultiscan(rg[,sidx], average=NULL)
plotMvsAPairs(rgC[,sidx])
title(main=paste("Calibrated", channel))
box(col=channelColor)
} # for (channel ...)
#                                
# The average calibrated data
#
# Note how the red signals are weaker than the green. The reason
# for this can be that the scale factor in the green channel is
# greater than in the red channel, but it can also be that there
# is a remaining relative difference in bias between the green
# and the red channel, a bias that precedes the scanning.
#                                
rgCA < rg
for (channel in c("R", "G")) {
sidx < which(colnames(rg) == channel)
rgCA[,sidx] < calibrateMultiscan(rg[,sidx])
}
rgCAavg < matrix(NA_real_, nrow=nrow(rgCA), ncol=2)
colnames(rgCAavg) < c("R", "G")
for (channel in c("R", "G")) {
sidx < which(colnames(rg) == channel)
rgCAavg[,channel] < apply(rgCA[,sidx], MARGIN=1, FUN=median, na.rm=TRUE)
}
# Add some "fake" outliers
outliers < 1:600
rgCAavg[outliers,"G"] < 50000
plotMvsA(rgCAavg)
title(main="Average calibrated (AC)")
#                                
# Normalize data
#                                
# Weightdown outliers when normalizing
weights < rep(1, nrow(rgCAavg))
weights[outliers] < 0.001
# Affine normalization of channels
rgCANa < normalizeAffine(rgCAavg, weights=weights)
# It is always ok to rescale the affine normalized data if its
# done on (R,G); not on (A,M)! However, this is only needed for
# esthetic purposes.
rgCANa < rgCANa *2^1.4
plotMvsA(rgCANa)
title(main="Normalized AC")
# Curvefit (lowess) normalization
rgCANlw < normalizeLowess(rgCAavg, weights=weights)
plotMvsA(rgCANlw, col="orange", add=TRUE)
# Curvefit (loess) normalization
rgCANl < normalizeLoess(rgCAavg, weights=weights)
plotMvsA(rgCANl, col="red", add=TRUE)
# Curvefit (robust spline) normalization
rgCANrs < normalizeRobustSpline(rgCAavg, weights=weights)
plotMvsA(rgCANrs, col="blue", add=TRUE)
legend(x=0,y=16, legend=c("affine", "lowess", "loess", "r. spline"), pch=19,
col=c("black", "orange", "red", "blue"), ncol=2, x.intersp=0.3, bty="n")
plotMvsMPairs(cbind(rgCANa, rgCANlw), col="orange", xlab=expression(M[affine]))
title(main="Normalized AC")
plotMvsMPairs(cbind(rgCANa, rgCANl), col="red", add=TRUE)
plotMvsMPairs(cbind(rgCANa, rgCANrs), col="blue", add=TRUE)
abline(a=0, b=1, lty=2)
legend(x=6,y=6, legend=c("lowess", "loess", "r. spline"), pch=19,
col=c("orange", "red", "blue"), ncol=2, x.intersp=0.3, bty="n")
Rescales channel vectors to get the same average.
## S3 method for class 'matrix'
normalizeAverage(x, baseline=1, avg=stats::median, targetAvg=2200, ...)
## S3 method for class 'list'
normalizeAverage(x, baseline=1, avg=stats::median, targetAvg=2200, ...)
x 

baseline 
An 
avg 
A 
targetAvg 
The average that each channel should have afterwards.
If 
... 
Additional arguments passed to the 
Returns a normalized numeric
NxK matrix
(or list
of length K).
Henrik Bengtsson
Weighted curvefit normalization between a pair of channels.
This method will estimate a smooth function of the dependency between the logratios and the logintensity of the two channels and then correct the logratios (only) in order to remove the dependency. This is method is also known as intensitydependent or lowess normalization.
The curvefit methods are by nature limited to pairedchannel data. There exist at least one method trying to overcome this limitation, namely the cycliclowess [1], which applies the paired curvefit method iteratively over all pairs of channels/arrays. Cycliclowess is not implemented here.
We recommend that affine normalization [2] is used instead of curvefit normalization.
## S3 method for class 'matrix'
normalizeCurveFit(X, weights=NULL, typeOfWeights=c("datapoint"),
method=c("loess", "lowess", "spline", "robustSpline"), bandwidth=NULL,
satSignal=2^16  1, ...)
## S3 method for class 'matrix'
normalizeLoess(X, ...)
## S3 method for class 'matrix'
normalizeLowess(X, ...)
## S3 method for class 'matrix'
normalizeSpline(X, ...)
## S3 method for class 'matrix'
normalizeRobustSpline(X, ...)
X 
An Nx2 
weights 
If 
typeOfWeights 
A 
method 

bandwidth 
A 
satSignal 
Signals equal to or above this threshold will not be used in the fitting. 
... 
Not used. 
A smooth function $c(A)$
is fitted through data in $(A,M)$
,
where $M=log_2(y_2/y_1)$
and $A=1/2*log_2(y_2*y_1)$
. Data is
normalized by $M < M  c(A)$
.
Loess is by far the slowest method of the four, then lowess, and then robust spline, which iteratively calls the spline method.
A Nx2 matrix
of the normalized two channels.
The fitted model is returned as attribute modelFit
.
Nonpositive values are set to notanumber (NaN
).
Data points that are saturated in one or more channels are not used
to estimate the normalization function, but they are normalized.
The estimation of the normalization function will only be made
based on complete nonsaturated observations, i.e. observations that
contains no NA
values nor saturated values as defined by satSignal
.
Each data point, that is, each row in X
, which is a
vector of length 2, can be assigned a weight in [0,1] specifying how much
it should affect the fitting of the normalization function.
Weights are given by argument weights
, which should be a numeric
vector
of length N. Regardless of weights, all data points are
normalized based on the fitted normalization function.
Note that the lowess and the spline method only support zeroone {0,1} weights. For such methods, all weights that are less than a half are set to zero.
For loess
, the arguments family="symmetric"
,
degree=1
, span=3/4
,
control=loess.control(trace.hat="approximate"
,
iterations=5
, surface="direct")
are used.
Henrik Bengtsson
[1] M. Åstrand,
Contrast Normalization of Oligonucleotide Arrays,
Journal Computational Biology, 2003, 10, 95102.
[2] Henrik Bengtsson and Ola Hössjer, Methodological Study of Affine Transformations of Gene Expression Data, Methodological study of affine transformations of gene expression data with proposed robust nonparametric multidimensional normalization method, BMC Bioinformatics, 2006, 7:100.
pathname < system.file("dataex", "PMTRGData.dat", package="aroma.light")
rg < read.table(pathname, header=TRUE, sep="\t")
nbrOfScans < max(rg$slide)
rg < as.list(rg)
for (field in c("R", "G"))
rg[[field]] < matrix(as.double(rg[[field]]), ncol=nbrOfScans)
rg$slide < rg$spot < NULL
rg < as.matrix(as.data.frame(rg))
colnames(rg) < rep(c("R", "G"), each=nbrOfScans)
layout(matrix(c(1,2,0,3,4,0,5,6,7), ncol=3, byrow=TRUE))
rgC < rg
for (channel in c("R", "G")) {
sidx < which(colnames(rg) == channel)
channelColor < switch(channel, R="red", G="green")
#                                
# The raw data
#                                
plotMvsAPairs(rg[,sidx])
title(main=paste("Observed", channel))
box(col=channelColor)
#                                
# The calibrated data
#                                
rgC[,sidx] < calibrateMultiscan(rg[,sidx], average=NULL)
plotMvsAPairs(rgC[,sidx])
title(main=paste("Calibrated", channel))
box(col=channelColor)
} # for (channel ...)
#                                
# The average calibrated data
#
# Note how the red signals are weaker than the green. The reason
# for this can be that the scale factor in the green channel is
# greater than in the red channel, but it can also be that there
# is a remaining relative difference in bias between the green
# and the red channel, a bias that precedes the scanning.
#                                
rgCA < rg
for (channel in c("R", "G")) {
sidx < which(colnames(rg) == channel)
rgCA[,sidx] < calibrateMultiscan(rg[,sidx])
}
rgCAavg < matrix(NA_real_, nrow=nrow(rgCA), ncol=2)
colnames(rgCAavg) < c("R", "G")
for (channel in c("R", "G")) {
sidx < which(colnames(rg) == channel)
rgCAavg[,channel] < apply(rgCA[,sidx], MARGIN=1, FUN=median, na.rm=TRUE)
}
# Add some "fake" outliers
outliers < 1:600
rgCAavg[outliers,"G"] < 50000
plotMvsA(rgCAavg)
title(main="Average calibrated (AC)")
#                                
# Normalize data
#                                
# Weightdown outliers when normalizing
weights < rep(1, nrow(rgCAavg))
weights[outliers] < 0.001
# Affine normalization of channels
rgCANa < normalizeAffine(rgCAavg, weights=weights)
# It is always ok to rescale the affine normalized data if its
# done on (R,G); not on (A,M)! However, this is only needed for
# esthetic purposes.
rgCANa < rgCANa *2^1.4
plotMvsA(rgCANa)
title(main="Normalized AC")
# Curvefit (lowess) normalization
rgCANlw < normalizeLowess(rgCAavg, weights=weights)
plotMvsA(rgCANlw, col="orange", add=TRUE)
# Curvefit (loess) normalization
rgCANl < normalizeLoess(rgCAavg, weights=weights)
plotMvsA(rgCANl, col="red", add=TRUE)
# Curvefit (robust spline) normalization
rgCANrs < normalizeRobustSpline(rgCAavg, weights=weights)
plotMvsA(rgCANrs, col="blue", add=TRUE)
legend(x=0,y=16, legend=c("affine", "lowess", "loess", "r. spline"), pch=19,
col=c("black", "orange", "red", "blue"), ncol=2, x.intersp=0.3, bty="n")
plotMvsMPairs(cbind(rgCANa, rgCANlw), col="orange", xlab=expression(M[affine]))
title(main="Normalized AC")
plotMvsMPairs(cbind(rgCANa, rgCANl), col="red", add=TRUE)
plotMvsMPairs(cbind(rgCANa, rgCANrs), col="blue", add=TRUE)
abline(a=0, b=1, lty=2)
legend(x=6,y=6, legend=c("lowess", "loess", "r. spline"), pch=19,
col=c("orange", "red", "blue"), ncol=2, x.intersp=0.3, bty="n")
Rescales channel vectors to get the same average.
## S3 method for class 'list'
normalizeDifferencesToAverage(x, baseline=1, FUN=median, ...)
x 

baseline 
An 
FUN 
A 
... 
Additional arguments passed to the 
Returns a normalized list
of length K.
Henrik Bengtsson
# Simulate three shifted tracks of different lengths with same profiles
ns < c(A=2, B=1, C=0.25)*1000
xx < lapply(ns, FUN=function(n) { seq(from=1, to=max(ns), length.out=n) })
zz < mapply(seq_along(ns), ns, FUN=function(z,n) rep(z,n))
yy < list(
A = rnorm(ns["A"], mean=0, sd=0.5),
B = rnorm(ns["B"], mean=5, sd=0.4),
C = rnorm(ns["C"], mean=5, sd=1.1)
)
yy < lapply(yy, FUN=function(y) {
n < length(y)
y[1:(n/2)] < y[1:(n/2)] + 2
y[1:(n/4)] < y[1:(n/4)]  4
y
})
# Shift all tracks toward the first track
yyN < normalizeDifferencesToAverage(yy, baseline=1)
# The baseline channel is not changed
stopifnot(identical(yy[[1]], yyN[[1]]))
# Get the estimated parameters
fit < attr(yyN, "fit")
# Plot the tracks
layout(matrix(1:2, ncol=1))
x < unlist(xx)
col < unlist(zz)
y < unlist(yy)
yN < unlist(yyN)
plot(x, y, col=col, ylim=c(10,10))
plot(x, yN, col=col, ylim=c(10,10))
Normalizes signals for PCR fragmentlength effects. Some or all signals are used to estimated the normalization function. All signals are normalized.
## Default S3 method:
normalizeFragmentLength(y, fragmentLengths, targetFcns=NULL, subsetToFit=NULL,
onMissing=c("ignore", "median"), .isLogged=TRUE, ..., .returnFit=FALSE)
y 
A 
fragmentLengths 

targetFcns 
An optional 
subsetToFit 
The subset of data points used to fit the
normalization function.
If 
onMissing 
Specifies how data points for which there is no
fragment length is normalized.
If 
.isLogged 
A 
... 
Additional arguments passed to 
.returnFit 
A 
Returns a numeric
vector
of the normalized signals.
It is assumed that the fragmentlength effects from multiple enzymes added (with equal weights) on the intensity scale. The fragmentlength effects are fitted for each enzyme separately based on units that are exclusively for that enzyme. If there are no or very such units for an enzyme, the assumptions of the model are not met and the fit will fail with an error. Then, from the above singleenzyme fits the average effect across enzymes is the calculated for each unit that is on multiple enzymes.
It is possible to specify custom target function effects for each
enzyme via argument targetFcns
. This argument has to be a
list
containing one function
per enzyme and ordered in the same
order as the enzyme are in the columns of argument
fragmentLengths
.
For instance, if one wish to normalize the signals such that their
mean signal as a function of fragment length effect is constantly
equal to 2200 (or the intensity scale), the use
targetFcns=function(fl, ...) log2(2200)
which completely
ignores fragmentlength argument 'fl' and always returns a
constant.
If two enzymes are used, then use
targetFcns=rep(list(function(fl, ...) log2(2200)), 2)
.
Note, if targetFcns
is NULL
, this corresponds to
targetFcns=rep(list(function(fl, ...) 0), ncol(fragmentLengths))
.
Alternatively, if one wants to only apply minimal corrections to the signals, then one can normalize toward target functions that correspond to the fragmentlength effect of the average array.
Henrik Bengtsson
[1] H. Bengtsson, R. Irizarry, B. Carvalho, and T. Speed, Estimation and assessment of raw copy numbers at the single locus level, Bioinformatics, 2008.
#                                  
# Example 1: Singleenzyme fragmentlength normalization of 6 arrays
#                                  
# Number samples
I < 9
# Number of loci
J < 1000
# Fragment lengths
fl < seq(from=100, to=1000, length.out=J)
# Simulate data points with unknown fragment lengths
hasUnknownFL < seq(from=1, to=J, by=50)
fl[hasUnknownFL] < NA
# Simulate data
y < matrix(0, nrow=J, ncol=I)
maxY < 12
for (kk in 1:I) {
k < runif(n=1, min=3, max=5)
mu < function(fl) {
mu < rep(maxY, length(fl))
ok < !is.na(fl)
mu[ok] < mu[ok]  fl[ok]^{1/k}
mu
}
eps < rnorm(J, mean=0, sd=1)
y[,kk] < mu(fl) + eps
}
# Normalize data (to a zero baseline)
yN < apply(y, MARGIN=2, FUN=function(y) {
normalizeFragmentLength(y, fragmentLengths=fl, onMissing="median")
})
# The correction factors
rho < yyN
print(summary(rho))
# The correction for units with unknown fragment lengths
# equals the median correction factor of all other units
print(summary(rho[hasUnknownFL,]))
# Plot raw data
layout(matrix(1:9, ncol=3))
xlim < c(0,max(fl, na.rm=TRUE))
ylim < c(0,max(y, na.rm=TRUE))
xlab < "Fragment length"
ylab < expression(log2(theta))
for (kk in 1:I) {
plot(fl, y[,kk], xlim=xlim, ylim=ylim, xlab=xlab, ylab=ylab)
ok < (is.finite(fl) & is.finite(y[,kk]))
lines(lowess(fl[ok], y[ok,kk]), col="red", lwd=2)
}
# Plot normalized data
layout(matrix(1:9, ncol=3))
ylim < c(1,1)*max(y, na.rm=TRUE)/2
for (kk in 1:I) {
plot(fl, yN[,kk], xlim=xlim, ylim=ylim, xlab=xlab, ylab=ylab)
ok < (is.finite(fl) & is.finite(y[,kk]))
lines(lowess(fl[ok], yN[ok,kk]), col="blue", lwd=2)
}
#                                  
# Example 2: Twoenzyme fragmentlength normalization of 6 arrays
#                                  
set.seed(0xbeef)
# Number samples
I < 5
# Number of loci
J < 3000
# Fragment lengths (two enzymes)
fl < matrix(0, nrow=J, ncol=2)
fl[,1] < seq(from=100, to=1000, length.out=J)
fl[,2] < seq(from=1000, to=100, length.out=J)
# Let 1/2 of the units be on both enzymes
fl[seq(from=1, to=J, by=4),1] < NA
fl[seq(from=2, to=J, by=4),2] < NA
# Let some have unknown fragment lengths
hasUnknownFL < seq(from=1, to=J, by=15)
fl[hasUnknownFL,] < NA
# Sty/Nsp mixing proportions:
rho < rep(1, I)
rho[1] < 1/3; # Less Sty in 1st sample
rho[3] < 3/2; # More Sty in 3rd sample
# Simulate data
z < array(0, dim=c(J,2,I))
maxLog2Theta < 12
for (ii in 1:I) {
# Common effect for both enzymes
mu < function(fl) {
k < runif(n=1, min=3, max=5)
mu < rep(maxLog2Theta, length(fl))
ok < is.finite(fl)
mu[ok] < mu[ok]  fl[ok]^{1/k}
mu
}
# Calculate the effect for each data point
for (ee in 1:2) {
z[,ee,ii] < mu(fl[,ee])
}
# Update the Sty/Nsp mixing proportions
ee < 2
z[,ee,ii] < rho[ii]*z[,ee,ii]
# Add random errors
for (ee in 1:2) {
eps < rnorm(J, mean=0, sd=1/sqrt(2))
z[,ee,ii] < z[,ee,ii] + eps
}
}
hasFl < is.finite(fl)
unitSets < list(
nsp = which( hasFl[,1] & !hasFl[,2]),
sty = which(!hasFl[,1] & hasFl[,2]),
both = which( hasFl[,1] & hasFl[,2]),
none = which(!hasFl[,1] & !hasFl[,2])
)
# The observed data is a mix of two enzymes
theta < matrix(NA_real_, nrow=J, ncol=I)
# Singleenzyme units
for (ee in 1:2) {
uu < unitSets[[ee]]
theta[uu,] < 2^z[uu,ee,]
}
# Bothenzyme units (sum on intensity scale)
uu < unitSets$both
theta[uu,] < (2^z[uu,1,]+2^z[uu,2,])/2
# Missing units (sample from the others)
uu < unitSets$none
theta[uu,] < apply(theta, MARGIN=2, sample, size=length(uu))
# Calculate target array
thetaT < rowMeans(theta, na.rm=TRUE)
targetFcns < list()
for (ee in 1:2) {
uu < unitSets[[ee]]
fit < lowess(fl[uu,ee], log2(thetaT[uu]))
class(fit) < "lowess"
targetFcns[[ee]] < function(fl, ...) {
predict(fit, newdata=fl)
}
}
# Fit model only to a subset of the data
subsetToFit < setdiff(1:J, seq(from=1, to=J, by=10))
# Normalize data (to a target baseline)
thetaN < matrix(NA_real_, nrow=J, ncol=I)
fits < vector("list", I)
for (ii in 1:I) {
lthetaNi < normalizeFragmentLength(log2(theta[,ii]), targetFcns=targetFcns,
fragmentLengths=fl, onMissing="median",
subsetToFit=subsetToFit, .returnFit=TRUE)
fits[[ii]] < attr(lthetaNi, "modelFit")
thetaN[,ii] < 2^lthetaNi
}
# Plot raw data
xlim < c(0, max(fl, na.rm=TRUE))
ylim < c(0, max(log2(theta), na.rm=TRUE))
Mlim < c(1,1)*4
xlab < "Fragment length"
ylab < expression(log2(theta))
Mlab < expression(M==log[2](theta/theta[R]))
layout(matrix(1:(3*I), ncol=I, byrow=TRUE))
for (ii in 1:I) {
plot(NA, xlim=xlim, ylim=ylim, xlab=xlab, ylab=ylab, main="raw")
# Singleenzyme units
for (ee in 1:2) {
# The raw data
uu < unitSets[[ee]]
points(fl[uu,ee], log2(theta[uu,ii]), col=ee+1)
}
# Bothenzyme units (use fragmentlength for enzyme #1)
uu < unitSets$both
points(fl[uu,1], log2(theta[uu,ii]), col=3+1)
for (ee in 1:2) {
# The true effects
uu < unitSets[[ee]]
lines(lowess(fl[uu,ee], log2(theta[uu,ii])), col="black", lwd=4, lty=3)
# The estimated effects
fit < fits[[ii]][[ee]]$fit
lines(fit, col="orange", lwd=3)
muT < targetFcns[[ee]](fl[uu,ee])
lines(fl[uu,ee], muT, col="cyan", lwd=1)
}
}
# Calculate logratios
thetaR < rowMeans(thetaN, na.rm=TRUE)
M < log2(thetaN/thetaR)
# Plot normalized data
for (ii in 1:I) {
plot(NA, xlim=xlim, ylim=Mlim, xlab=xlab, ylab=Mlab, main="normalized")
# Singleenzyme units
for (ee in 1:2) {
# The normalized data
uu < unitSets[[ee]]
points(fl[uu,ee], M[uu,ii], col=ee+1)
}
# Bothenzyme units (use fragmentlength for enzyme #1)
uu < unitSets$both
points(fl[uu,1], M[uu,ii], col=3+1)
}
ylim < c(0,1.5)
for (ii in 1:I) {
data < list()
for (ee in 1:2) {
# The normalized data
uu < unitSets[[ee]]
data[[ee]] < M[uu,ii]
}
uu < unitSets$both
if (length(uu) > 0)
data[[3]] < M[uu,ii]
uu < unitSets$none
if (length(uu) > 0)
data[[4]] < M[uu,ii]
cols < seq_along(data)+1
plotDensity(data, col=cols, xlim=Mlim, xlab=Mlab, main="normalized")
abline(v=0, lty=2)
}
Normalizes the empirical distribution of one of more samples to a target distribution.
The average sample distribution is calculated either robustly or not
by utilizing either weightedMedian()
or weighted.mean()
.
A weighted method is used if any of the weights are different from one.
## S3 method for class 'numeric'
normalizeQuantileRank(x, xTarget, ties=FALSE, ...)
## S3 method for class 'list'
normalizeQuantileRank(X, xTarget=NULL, ...)
## Default S3 method:
normalizeQuantile(x, ...)
x , X

a 
xTarget 
The target empirical distribution as a sorted

ties 
Should ties in 
... 
Not used. 
Returns an object of the same shape as the input argument.
Missing values are excluded when estimating the "common" (the baseline).
Values that are NA
remain NA
after normalization.
No new NA
s are introduced.
Currently only channel weights are support due to the way quantile normalization is done. If signal weights are given, channel weights are calculated from these by taking the mean of the signal weights in each channel.
Adopted from Gordon Smyth (http://www.statsci.org/) in 2002 \& 2006. Original code by Ben Bolstad at Statistics Department, University of California.
To calculate a target distribution from a set of samples, see
averageQuantile
().
For an alternative empirical density normalization methods, see
normalizeQuantileSpline
().
# Simulate ten samples of different lengths
N < 10000
X < list()
for (kk in 1:8) {
rfcn < list(rnorm, rgamma)[[sample(2, size=1)]]
size < runif(1, min=0.3, max=1)
a < rgamma(1, shape=20, rate=10)
b < rgamma(1, shape=10, rate=10)
values < rfcn(size*N, a, b)
# "Censor" values
values[values < 0  values > 8] < NA
X[[kk]] < values
}
# Add 20% missing values
X < lapply(X, FUN=function(x) {
x[sample(length(x), size=0.20*length(x))] < NA
x
})
# Normalize quantiles
Xn < normalizeQuantile(X)
# Plot the data
layout(matrix(1:2, ncol=1))
xlim < range(X, na.rm=TRUE)
plotDensity(X, lwd=2, xlim=xlim, main="The original distributions")
plotDensity(Xn, lwd=2, xlim=xlim, main="The normalized distributions")
Normalizes the empirical distribution of a set of samples to a common target distribution.
The average sample distribution is calculated either robustly or not
by utilizing either weightedMedian()
or weighted.mean()
.
A weighted method is used if any of the weights are different from one.
## S3 method for class 'matrix'
normalizeQuantileRank(X, ties=FALSE, robust=FALSE, weights=NULL,
typeOfWeights=c("channel", "signal"), ...)
X 
a numerical NxK 
robust 
If 
ties 
Should ties in 
weights 
If 
typeOfWeights 
A 
... 
Not used. 
Returns an object of the same shape as the input argument.
Missing values are excluded when estimating the "common" (the baseline).
Values that are NA
remain NA
after normalization.
No new NA
s are introduced.
Currently only channel weights are support due to the way quantile normalization is done. If signal weights are given, channel weights are calculated from these by taking the mean of the signal weights in each channel.
Adopted from Gordon Smyth (http://www.statsci.org/) in 2002 \& 2006. Original code by Ben Bolstad at Statistics Department, University of California. Support for calculating the average sample distribution using (weighted) mean or median was added by Henrik Bengtsson.
median
, weightedMedian
,
mean
() and weighted.mean
.
normalizeQuantileSpline
().
# Simulate three samples with on average 20% missing values
N < 10000
X < cbind(rnorm(N, mean=3, sd=1),
rnorm(N, mean=4, sd=2),
rgamma(N, shape=2, rate=1))
X[sample(3*N, size=0.20*3*N)] < NA
# Normalize quantiles
Xn < normalizeQuantile(X)
# Plot the data
layout(matrix(1:2, ncol=1))
xlim < range(X, Xn, na.rm=TRUE)
plotDensity(X, lwd=2, xlim=xlim, main="The three original distributions")
plotDensity(Xn, lwd=2, xlim=xlim, main="The three normalized distributions")
Normalizes the empirical distribution of one or more samples to a target distribution. After normalization, all samples have the same average empirical density distribution.
## S3 method for class 'numeric'
normalizeQuantileSpline(x, w=NULL, xTarget, sortTarget=TRUE, robust=TRUE, ...)
## S3 method for class 'matrix'
normalizeQuantileSpline(X, w=NULL, xTarget=NULL, sortTarget=TRUE, robust=TRUE, ...)
## S3 method for class 'list'
normalizeQuantileSpline(X, w=NULL, xTarget=NULL, sortTarget=TRUE, robust=TRUE, ...)
x , X

A single ( 
w 
An optional 
xTarget 
The target empirical distribution as a sorted

sortTarget 
If 
robust 
If 
... 
Arguments passed to ( 
Returns an object of the same type and dimensions as the input.
Both argument X
and xTarget
may contain nonfinite values.
These values do not affect the estimation of the normalization function.
Missing values and other nonfinite values in X
,
remain in the output as is. No new missing values are introduced.
Henrik Bengtsson
[1] H. Bengtsson, R. Irizarry, B. Carvalho, and T. Speed, Estimation and assessment of raw copy numbers at the single locus level, Bioinformatics, 2008.
The target distribution can be calculated as the average
using averageQuantile
().
Internally either
robustSmoothSpline
(robust=TRUE
) or
smooth.spline
(robust=FALSE
) is used.
An alternative normalization method that is also normalizing the
empirical densities of samples is normalizeQuantileRank
().
Contrary to this method, that method requires that all samples are
based on the exact same set of data points and it is also more likely
to overcorrect in the tails of the distributions.
# Simulate three samples with on average 20% missing values
N < 10000
X < cbind(rnorm(N, mean=3, sd=1),
rnorm(N, mean=4, sd=2),
rgamma(N, shape=2, rate=1))
X[sample(3*N, size=0.20*3*N)] < NA
# Plot the data
layout(matrix(c(1,0,2:5), ncol=2, byrow=TRUE))
xlim < range(X, na.rm=TRUE)
plotDensity(X, lwd=2, xlim=xlim, main="The three original distributions")
Xn < normalizeQuantile(X)
plotDensity(Xn, lwd=2, xlim=xlim, main="The three normalized distributions")
plotXYCurve(X, Xn, xlim=xlim, main="The three normalized distributions")
Xn2 < normalizeQuantileSpline(X, xTarget=Xn[,1], spar=0.99)
plotDensity(Xn2, lwd=2, xlim=xlim, main="The three normalized distributions")
plotXYCurve(X, Xn2, xlim=xlim, main="The three normalized distributions")
TumorBoost [1] is a normalization method that normalizes the allele B fractions of a tumor sample given the allele B fractions and genotypes of a matched normal. The method is a singlesample (singlepair) method. It does not require total copynumber estimates. The normalization is done such that the total copy number is unchanged afterwards.
## S3 method for class 'numeric'
normalizeTumorBoost(betaT, betaN, muN=callNaiveGenotypes(betaN), preserveScale=FALSE,
flavor=c("v4", "v3", "v2", "v1"), ...)
betaT , betaN

Two 
muN 
An optional 
preserveScale 
If 
flavor 
A 
... 
Not used. 
Allele B fractions are defined as the ratio between the allele B signal and the sum of both (all) allele signals at the same locus. Allele B fractions are typically within [0,1], but may have a slightly wider support due to for instance negative noise. This is typically also the case for the returned normalized allele B fractions.
Returns a numeric
vector
of length J containing the normalized
allele B fractions for the tumor.
Attribute modelFit
is a list
containing model fit parameters.
This method provides a few different "flavors" for normalizing the
data. The following values of argument flavor
are accepted:
v4: (default) The TumorBoost method, i.e. Eqns. (8)(9) in [1].
v3: Eqn (9) in [1] is applied to both heterozygous and homozygous SNPs, which effectively is v4 where the normalized allele B fractions for homozygous SNPs becomes 0 and 1.
v2: ...
v1: TumorBoost where correction factor is forced to one, i.e.
$\eta_j=1$
. As explained in [1], this is a suboptimal
normalization method. See also the discussion in the
paragraph following Eqn (12) in [1].
As of aroma.light v1.33.3 (March 30, 2014),
argument preserveScale
no longer has a default value and has
to be specified explicitly. This is done in order to change the
default to FALSE
in a future version, while minimizing the risk
for surprises.
Allele B fractions are more or less compressed toward a half, e.g. the signals for homozygous SNPs are slightly away from zero and one. The TumorBoost method decreases the correlation in allele B fractions between the tumor and the normal conditioned on the genotype. What it does not control for is the mean level of the allele B fraction conditioned on the genotype.
By design, most flavors of the method will correct the homozygous SNPs such that their mean levels get close to the expected zero and one levels. However, the heterozygous SNPs will typically keep the same mean levels as before. One possibility is to adjust the signals such as the mean levels of the heterozygous SNPs relative to that of the homozygous SNPs is the same after as before the normalization.
If argument preserveScale=TRUE
, then SNPs that are heterozygous
(in the matched normal) are corrected for signal compression using
an estimate of signal compression based on the amount of correction
performed by TumorBoost on SNPs that are homozygous
(in the matched normal).
The option of preserving the scale is not discussed in the
TumorBoost paper [1], which presents the preserveScale=FALSE
version.
Henrik Bengtsson, Pierre Neuvial
[1] H. Bengtsson, P. Neuvial and T.P. Speed, TumorBoost: Normalization of allelespecific tumor copy numbers from a single pair of tumornormal genotyping microarrays, BMC Bioinformatics, 2010, 11:245. [PMID 20462408]
library(R.utils)
# Load data
pathname < system.file("dataex/TumorBoost,fracB,exampleData.Rbin", package="aroma.light")
data < loadObject(pathname)
attachLocally(data)
pos < position/1e6
muN < genotypeN
layout(matrix(1:4, ncol=1))
par(mar=c(2.5,4,0.5,1)+0.1)
ylim < c(0.05, 1.05)
col < rep("#999999", length(muN))
col[muN == 1/2] < "#000000"
# Allele B fractions for the normal sample
plot(pos, betaN, col=col, ylim=ylim)
# Allele B fractions for the tumor sample
plot(pos, betaT, col=col, ylim=ylim)
# TumorBoost w/ naive genotype calls
betaTN < normalizeTumorBoost(betaT=betaT, betaN=betaN, preserveScale=FALSE)
plot(pos, betaTN, col=col, ylim=ylim)
# TumorBoost w/ external multisample genotype calls
betaTNx < normalizeTumorBoost(betaT=betaT, betaN=betaN, muN=muN, preserveScale=FALSE)
plot(pos, betaTNx, col=col, ylim=ylim)
Calculating tumornormal paired allelespecific copy number stratified on genotypes. The method is a singlesample (singlepair) method. It requires paired tumornormal parentspecific copy number signals.
## S3 method for class 'numeric'
pairedAlleleSpecificCopyNumbers(thetaT, betaT, thetaN, betaN,
muN=callNaiveGenotypes(betaN), ...)
thetaT , betaT

Theta and alleleB fraction signals for the tumor. 
thetaN , betaN

Total and alleleB fraction signals for the matched normal. 
muN 
An optional 
... 
Not used. 
Returns a data.frame
with elements CT
, betaT
and muN
.
Pierre Neuvial, Henrik Bengtsson
This definition of calculating tumornormal paired ASCN is related
to how the normalizeTumorBoost
() method calculates normalized
tumor BAFs.
Plots density distributions for a set of vectors.
## S3 method for class 'data.frame'
plotDensity(X, ..., xlab=NULL)
## S3 method for class 'matrix'
plotDensity(X, ..., xlab=NULL)
## S3 method for class 'numeric'
plotDensity(X, ..., xlab=NULL)
## S3 method for class 'list'
plotDensity(X, W=NULL, xlim=NULL, ylim=NULL, xlab=NULL,
ylab="density (integrates to one)", col=1:length(X), lty=NULL, lwd=NULL, ...,
add=FALSE)
X 
A single of 
W 
(optional) weights of similar data types and dimensions as

xlim , ylim


xlab , ylab


col 
The color(s) of the curves. 
lty 
The types of curves. 
lwd 
The width of curves. 
... 

add 
If 
Henrik Bengtsson
Internally, density
is used to estimate the
empirical density.
Plot logratios vs logintensities.
## S3 method for class 'matrix'
plotMvsA(X, Alab="A", Mlab="M", Alim=c(0, 16), Mlim=c(1, 1) * diff(Alim) * aspectRatio,
aspectRatio=1, pch=".", ..., add=FALSE)
X 
Nx2 
Alab , Mlab

Labels on the x and y axes. 
Alim , Mlim

Plot range on the A and M axes. 
aspectRatio 
Aspect ratio between 
pch 
Plot symbol used. 
... 
Additional arguments accepted by 
add 
If 
Red channel is assumed to be in column one and green in column two. Logratio are calculated taking channel one over channel two.
Returns nothing.
Henrik Bengtsson
Plot logratios/logintensities for all unique pairs of data vectors.
## S3 method for class 'matrix'
plotMvsAPairs(X, Alab="A", Mlab="M", Alim=c(0, 16), Mlim=c(1, 1) * diff(Alim), pch=".",
..., add=FALSE)
X 
NxK 
Alab , Mlab

Labels on the x and y axes. 
Alim , Mlim

Plot range on the A and M axes. 
pch 
Plot symbol used. 
... 
Additional arguments accepted by 
add 
If 
Logratios and logintensities are calculated for each neighboring pair of channels (columns) and plotted. Thus, in total there will be K1 data set plotted.
The colors used for the plotted pairs are 1, 2, and so on. To change the colors, use a different color palette.
Returns nothing.
Henrik Bengtsson
Plot logratios vs logratios for all pairs of columns.
## S3 method for class 'matrix'
plotMvsMPairs(X, xlab="M", ylab="M", xlim=c(1, 1) * 6, ylim=xlim, pch=".", ...,
add=FALSE)
X 
Nx2K 
xlab , ylab

Labels on the x and y axes. 
xlim , ylim

Plot range on the x and y axes. 
pch 
Plot symbol used. 
... 
Additional arguments accepted by 
add 
If 
Logratio are calculated by over paired columns, e.g. column 1 and 2, column 3 and 4, and so on.
Returns nothing.
Henrik Bengtsson
Plot the relationship between two variables as a smooth curve.
## S3 method for class 'numeric'
plotXYCurve(x, y, col=1L, lwd=2, dlwd=1, dcol=NA, xlim=NULL, ylim=xlim, xlab=NULL,
ylab=NULL, curveFit=smooth.spline, ..., add=FALSE)
## S3 method for class 'matrix'
plotXYCurve(X, Y, col=seq_len(nrow(X)), lwd=2, dlwd=1, dcol=NA, xlim=NULL, ylim=xlim,
xlab=NULL, ylab=NULL, curveFit=smooth.spline, ..., add=FALSE)
x , y , X , Y

Two 
col 
The color of each curve.
Either a scalar specifying the same value of all curves,
or a 
lwd 
The line width of each curve.
Either a scalar specifying the same value of all curves,
or a 
dlwd 
The width of each density curve. 
dcol 
The fill color of the interior of each density curve. 
xlim , ylim

The x and y plotting limits. 
xlab , ylab

The x and y labels. 
curveFit 
The 
... 
Additional arguments passed to 
add 
If 
Returns nothing.
Data points (x,y) with nonfinite values are excluded.
Henrik Bengtsson
Fits a smoothing spline robustly using the $L_1$
norm. Currently, the
algorithm is an iterative reweighted smooth spline algorithm which
calls smooth.spline(x,y,w,...)
at each iteration with the weights
w
equal to the inverse of the absolute value of the residuals for
the last iteration step.
## Default S3 method:
robustSmoothSpline(x, y=NULL, w=NULL, ..., minIter=3, maxIter=max(minIter, 50),
method=c("L1", "symmetric"), sdCriteria=2e04, reps=1e15, tol=1e06 * IQR(x),
plotCurves=FALSE)
x 
a 
y 
responses. If 
w 
a 
... 
Other arguments passed to 
minIter 
the minimum number of iterations used to fit the smoothing spline robustly. Default value is 3. 
maxIter 
the maximum number of iterations used to fit the smoothing spline robustly. Default value is 25. 
method 
the method used to compute robustness weights at each
iteration. Default value is 
sdCriteria 
Convergence criteria, which the difference between the standard deviation of the residuals between two consecutive iteration steps. Default value is 2e4. 
reps 
Small positive number added to residuals to avoid division by zero when calculating new weights for next iteration. 
tol 
Passed to 
plotCurves 
If 
Returns an object of class smooth.spline
.
Henrik Bengtsson
This implementation of this function was adopted from
smooth.spline
of the stats package.
Because of this, this function is also licensed under GPL v2.
data(cars)
attach(cars)
plot(speed, dist, main = "data(cars) & robust smoothing splines")
# Fit a smoothing spline using L_2 norm
cars.spl < smooth.spline(speed, dist)
lines(cars.spl, col = "blue")
# Fit a smoothing spline using L_1 norm
cars.rspl < robustSmoothSpline(speed, dist)
lines(cars.rspl, col = "red")
# Fit a smoothing spline using L_2 norm with 10 degrees of freedom
lines(smooth.spline(speed, dist, df=10), lty=2, col = "blue")
# Fit a smoothing spline using L_1 norm with 10 degrees of freedom
lines(robustSmoothSpline(speed, dist, df=10), lty=2, col = "red")
legend(5,120, c(
paste("smooth.spline [C.V.] => df =",round(cars.spl$df,1)),
paste("robustSmoothSpline [C.V.] => df =",round(cars.rspl$df,1)),
"standard with s( * , df = 10)", "robust with s( * , df = 10)"
), col = c("blue","red","blue","red"), lty = c(1,1,2,2), bg='bisque')
Calculates the correlation for random pairs of observations.
## S3 method for class 'matrix'
sampleCorrelations(X, MARGIN=1, pairs=NULL, npairs=max(5000, nrow(X)), ...)
X 
An NxK 
MARGIN 
The dimension (1 or 2) in which the observations are.
If 
pairs 
If a Lx2 
npairs 
The number of correlations to calculate. 
... 
Not used. 
Returns a double
vector
of length npairs
.
Henrik Bengtsson
[1] A. Ploner, L. Miller, P. Hall, J. Bergh & Y. Pawitan. Correlation test to assess lowlevel processing of highdensity oligonucleotide microarray data. BMC Bioinformatics, 2005, vol 6.
sample
().
# Simulate 20000 genes with 10 observations each
X < matrix(rnorm(n=20000), ncol=10)
# Calculate the correlation for 5000 random gene pairs
cor < sampleCorrelations(X, npairs=5000)
print(summary(cor))
Sample tuples of elements from a set. The elements within a sampled tuple are unique, i.e. no two elements are the same.
## Default S3 method:
sampleTuples(x, size, length, ...)
x 
A set of elements to sample from. 
size 
The number of tuples to sample. 
length 
The length of each tuple. 
... 
Additional arguments passed to 
Returns a NxK matrix
where N = size
and K = length
.
Henrik Bengtsson
sample
().
pairs < sampleTuples(1:10, size=5, length=2)
print(pairs)
triples < sampleTuples(1:10, size=5, length=3)
print(triples)
# Allow tuples with repeated elements
quadruples < sampleTuples(1:3, size=5, length=4, replace=TRUE)
print(quadruples)
Calculates the (weighted) principal components of a matrix, that is, finds a new coordinate system (not unique) for representing the given multivariate data such that i) all dimensions are orthogonal to each other, and ii) all dimensions have maximal variances.
## S3 method for class 'matrix'
wpca(x, w=NULL, center=TRUE, scale=FALSE, method=c("dgesdd", "dgesvd"),
swapDirections=FALSE, ...)
x 
An NxK 
w 
An N 
center 
If 
scale 
If 
method 
If 
swapDirections 
If 
... 
Not used. 
Returns a list
with elements:
pc 
An NxK 
d 
An K 
vt 
An KxK 
xMean 
The center coordinate. 
It holds that x == t(t(fit$pc %*% fit$vt) + fit$xMean)
.
A singular value decomposition (SVD) is carried out.
Let X=mat
, then the SVD of the matrix is $X = U D V'$
, where
$U$
and $V$
are orthogonal, and $D$
is a diagonal matrix
with singular values. The principal returned by this method are $U D$
.
Internally La.svd()
(or svd()
) of the base
package is used.
For a popular and well written introduction to SVD see for instance [2].
Henrik Bengtsson
[1] J. Demmel and J. Dongarra, DOE2000 Progress Report, 2004.
https://people.eecs.berkeley.edu/~demmel/DOE2000/Report0100.html
[2] Todd Will, Introduction to the Singular Value Decomposition,
UWLa Crosse, 2004. http://websites.uwlax.edu/twill/svd/
For a iterative reweighted PCA method, see iwpca
().
For Singular Value Decomposition, see svd
().
For other implementations of Principal Component Analysis functions see
(if they are installed):
prcomp
in package stats and pca()
in package
pcurve.
for (zzz in 0) {
# This example requires plot3d() in R.basic [http://www.braju.com/R/]
if (!require(pkgName < "R.basic", character.only=TRUE)) break
# 
# A first example
# 
# Simulate data from the model y < a + bx + eps(bx)
x < rexp(1000)
a < c(2,15,3)
b < c(2,3,15)
bx < outer(b,x)
eps < apply(bx, MARGIN=2, FUN=function(x) rnorm(length(x), mean=0, sd=0.1*x))
y < a + bx + eps
y < t(y)
# Add some outliers by permuting the dimensions for 1/3 of the observations
idx < sample(1:nrow(y), size=1/3*nrow(y))
y[idx,] < y[idx,c(2,3,1)]
# Downweight the outliers W times to demonstrate how weights are used
W < 10
# Plot the data with fitted lines at four different view points
N < 4
theta < seq(0,180,length.out=N)
phi < rep(30, length.out=N)
# Use a different color for each set of weights
col < topo.colors(W)
opar < par(mar=c(1,1,1,1)+0.1)
layout(matrix(1:N, nrow=2, byrow=TRUE))
for (kk in seq_along(theta)) {
# Plot the data
plot3d(y, theta=theta[kk], phi=phi[kk])
# First, same weights for all observations
w < rep(1, length=nrow(y))
for (ww in 1:W) {
# Fit a line using IWPCA through data
fit < wpca(y, w=w, swapDirections=TRUE)
# Get the first principal component
ymid < fit$xMean
d0 < apply(y, MARGIN=2, FUN=min)  ymid
d1 < apply(y, MARGIN=2, FUN=max)  ymid
b < fit$vt[1,]
y0 < b * max(abs(d0))
y1 < b * max(abs(d1))
yline < matrix(c(y0,y1), nrow=length(b), ncol=2)
yline < yline + ymid
points3d(t(ymid), col=col)
lines3d(t(yline), col=col)
# Downweight outliers only, because here we know which they are.
w[idx] < w[idx]/2
}
# Highlight the last one
lines3d(t(yline), col="red", lwd=3)
}
par(opar)
} # for (zzz in 0)
rm(zzz)
if (dev.cur() > 1) dev.off()
# 
# A second example
# 
# Data
x < c(1,2,3,4,5)
y < c(2,4,3,3,6)
opar < par(bty="L")
opalette < palette(c("blue", "red", "black"))
xlim < ylim < c(0,6)
# Plot the data and the center mass
plot(x,y, pch=16, cex=1.5, xlim=xlim, ylim=ylim)
points(mean(x), mean(y), cex=2, lwd=2, col="blue")
# Linear regression y ~ x
fit < lm(y ~ x)
abline(fit, lty=1, col=1)
# Linear regression y ~ x through without intercept
fit < lm(y ~ x  1)
abline(fit, lty=2, col=1)
# Linear regression x ~ y
fit < lm(x ~ y)
c < coefficients(fit)
b < 1/c[2]
a < b*c[1]
abline(a=a, b=b, lty=1, col=2)
# Linear regression x ~ y through without intercept
fit < lm(x ~ y  1)
b < 1/coefficients(fit)
abline(a=0, b=b, lty=2, col=2)
# Orthogonal linear "regression"
fit < wpca(cbind(x,y))
b < fit$vt[1,2]/fit$vt[1,1]
a < fit$xMean[2]b*fit$xMean[1]
abline(a=a, b=b, lwd=2, col=3)
# Orthogonal linear "regression" without intercept
fit < wpca(cbind(x,y), center=FALSE)
b < fit$vt[1,2]/fit$vt[1,1]
a < fit$xMean[2]b*fit$xMean[1]
abline(a=a, b=b, lty=2, lwd=2, col=3)
legend(xlim[1],ylim[2], legend=c("lm(y~x)", "lm(y~x1)", "lm(x~y)",
"lm(x~y1)", "pca", "pca w/o intercept"), lty=rep(1:2,3),
lwd=rep(c(1,1,2),each=2), col=rep(1:3,each=2))
palette(opalette)
par(opar)