<%@meta language="R-vignette" content="-------------------------------- DIRECTIVES FOR R: %\VignetteIndexEntry{Parent-specific copy-number segmentation using Paired PSCBS} %\VignetteAuthor{Henrik Bengtsson} %\VignetteKeyword{copy numbers} %\VignetteKeyword{allele specific} %\VignetteKeyword{parent specific} %\VignetteKeyword{genomic aberrations} %\VignetteEngine{R.rsp::rsp} --------------------------------------------------------------------"%> <% t0 <- Sys.time() %> <% R.utils::use("R.utils") # RSP specific R.rsp <- R.oo::Package("R.rsp") withCapture <- R.utils::withCapture options("withCapture/newline"=FALSE) options(str=strOptions(strict.width="cut")) options(width=85) options(digits=3) # Graphics use("R.devices") options("devEval/args/field"="fullname") # Preferred for LaTeX devOptions("png", width=840) # Analysis use("PSCBS") PSCBS <- R.oo::Package("PSCBS") fixLocations <- function(fit, ...) { for (key in grep("(end|start)$", colnames(fit$output))) { fit$output[[key]] <- as.integer(fit$output[[key]]) } fit } # fixLocations() %> \documentclass[letter]{article} \usepackage{xspace} \usepackage{alltt} \usepackage{xcolor} \usepackage{natbib} % \citep{}, \citet{} \usepackage{graphicx} \graphicspath{{figures/}} <%------------------------------------------------------------------- Assign PDF metadata -------------------------------------------------------------------%> % PDF metadata \usepackage{hyperref} % Ideally \hypersetup{hidelinks}, but for backward compatibility: \hypersetup{pdfborder={0 0 0}} \hypersetup{ pdfauthor={<%@meta name="author"%>}, pdftitle={<%@meta name="title"%>}, pdfsubject={}, pdfkeywords={<%@meta name="keywords"%>}, pdfproducer={R.rsp v<%=R.rsp$version%> by <%=R.rsp$author%>} } % Page margins \addtolength{\oddsidemargin}{-0.5in} \addtolength{\evensidemargin}{-0.5in} \addtolength{\textwidth}{1in} \addtolength{\topmargin}{-0.5in} \addtolength{\textheight}{1in} % Placement of floats \setcounter{bottomnumber}{2} \renewcommand{\topfraction}{1.0} \renewcommand{\bottomfraction}{1.0} \renewcommand{\textfraction}{0.0} \renewcommand{\floatpagefraction}{1.0} % Macros \newcommand{\keywords}[1]{{\footnotesize{\textbf{Keywords: }#1}}\xspace} \newcommand{\pkg}[1]{\textsl{#1}\xspace} \newcommand{\file}[1]{\textsl{#1}\xspace} \newcommand{\code}[1]{\texttt{#1}\xspace} \newcommand{\bs}{$\backslash$} \newcommand{\alphaTCN}{\alpha_{\textnormal{TCN}}\xspace} \newcommand{\alphaDH}{\alpha_{\textnormal{DH}}\xspace} \newcommand{\LOH}{\ensuremath{\textnormal{LOH}}\xspace} \newcommand{\AB}{\ensuremath{\textnormal{AB}}\xspace} \newenvironment{rspVerbatim}{\vspace{-\parskip}\begin{alltt}\color{blue}}{\end{alltt}} \newenvironment{escapeRspVerbatim}{\vspace{-\parskip}\begin{alltt}}{\end{alltt}} \title{<%@meta name="title"%>} \author{<%@meta name="author"%>} \date{<%=format(as.Date(PSCBS$date), format="%B %d, %Y")%>} \begin{document} \maketitle \begin{abstract} The Paired Parent-Specific Circular Binary Segmentation (Paired PSCBS) method partitions a tumor genome into segments of constant parent-specific copy numbers (PSCNs) based on SNP DNA microarray data from a matched tumor-normal pair. The method also calls segments with run of homozygosity (ROH), segments in allelic balance (AB) and segments with loss of heterozygosity (LOH). Paired PSCBS was designed to work with data from any SNP microarray technology and generation, including Affymetrix and Illumina. This document shows how to use the \pkg{PSCBS} package to run Paired PSCBS on a tumor-normal pair. \end{abstract} \keywords{<%@meta name="keywords"%>} \begin{center} \emph{This vignette is distributed as part of the \pkg{PSCBS} package, which is available on CRAN (\url{https://cran.r-project.org/}). The authors very much appreciate feedback on this document.} \end{center} \clearpage \tableofcontents \clearpage <%------------------------------------------------------------------- BACKGROUND -------------------------------------------------------------------%> \section{Background} \label{secBackground} We will here use a small example data set to illustrate how to setup the data in a format suitable for Paired PSCBS, how to identify segments, how to call them, and how to plot and export the segmentation results. The statistical model and the algorithm behind Paired PSCBS is explained in detail in~\citet{OlshenA_etal_2011}. <%------------------------------------------------------------------- EXAMPLE -------------------------------------------------------------------%> \section{Preparing data to be segmented} The Paired PSCBS~\citep{OlshenA_etal_2011} method requires tumor-normal paired parent-specific copy-number (PSCN) signals. More precisely, it requires total copy-number (TCN) estimates for the tumor relative to the matched normal ($C_T$), allele B fractions (BAFs) for the tumor ($\beta_T$) and BAFs for the matched normal ($\beta_N$). The genomic locations of the loci in form of chromosome and physical position are also required. \subsection{Locus-level SNP copy-number signals} \label{secData} In this example we will use a small example data set part of the \pkg{PSCBS} package. It can be loaded as: <% fullname <- "PairedPSCBS,exData,chr01" %> \begin{verbatim} <%=withCapture({ data <- PSCBS::exampleData("paired.chr01") str(data) })%> \end{verbatim} In additional to the mandatory fields (\code{chromosome}, \code{x}, \code{CT}, \code{betaT}, and \code{betaN}), this data set also contains TCNs for normal (\code{CN}) relative to a large pool of normal samples. The latter will not be used here. \subsection{Dropping TCN outliers} \label{secTCNOutliers} There may be some outliers among the TCNs. In CBS~\citep{OlshenA_etal_2004,VenkatramanOlshen_2007}, the authors propose a method for identifying outliers and then to shrink such values toward their neighbors ("smooth") before performing segmentation. At the time CBS was developed it made sense to not just to drop outliers because the resolution was low and every datapoint was valuable. With modern technologies the resolution is much higher and we can afford dropping such outliers, which can be done by: \begin{verbatim} <%=withCapture({ data <- dropSegmentationOutliers(data) })%> \end{verbatim} Dropping TCN outliers is optional. \section{Paired PSCBS segmentation} \subsection{Skipping centromeres and other large gaps} \label{secGaps} Like the CBS method, Paired PSCBS does not take the physical locations (in units of nucleotides) of the loci in to account when segmenting the data, only their relative ordering along the genome. This means that after having ordered the loci along genome, it will treat two "neighboring" loci that are on both sides of the centromere equally to two neighboring loci that are only a few hundred bases apart. This may introduce erroneous change points that appear to be inside the centromere. The same issues occur for other large gaps of the genome where there are no observed signals. To avoid this, although not mandatory, we will locate all gaps of the genome where there are no observed loci. As a threshold we will consider a region to be a "gap" if the distance between the two closest loci is greater than 1Mb. \begin{verbatim} <%=withCapture({ gaps <- findLargeGaps(data, minLength=1e6) gaps })%> \end{verbatim} which shows that there is a 20.5Mb long gap between 121.0Mb and 141.5Mb on Chromosome~1. This is the centromere of Chromosome~1. It is not possible to specify "gaps" to the segmentation function. Instead they need to be given as part of a set of "known" segments, which is done as: \begin{verbatim} <%=withCapture({ knownSegments <- gapsToSegments(gaps) knownSegments })%> \end{verbatim} Below, we will use this to tell Paired PSCBS to segment Chromosome~1 into three independent segments, where the first segment is from the beginning of the chromosome (hence '-Inf') to 120.1Mb, the second one from 120.1-141.5Mb (the above gap), and the third one is from 141.5Mb to the end of the chromosome (hence '+Inf'). Just as Paired PSCBS segments chromosomes independently of each other, it also segments priorly known segments independently of each other. Specifying known segments is optional. \subsection{Identifying PSCN segments} We are now ready to segment the locus-level PSCN signals. This is done by\footnote{We fix the random seed in order for the results of this vignette to be numerically reproducible.}: \begin{verbatim} <%=withCapture({ fit <- segmentByPairedPSCBS(data, knownSegments=knownSegments, seed=0xBEEF, verbose=-10) })%> \end{verbatim} Note that this may take several minutes when applied to whole-genome data. The above call will also normalize the tumor BAFs using the TumorBoost normalization method~\citep{BengtssonH_etal_2010} (without preserving the relative scale for homozygous and heterozygous BAFs; in a future version, this will be the default). If this has already been done or the tumor signals have been normalized by other means, the TumorBoost step can be skipped by setting argument \code{tbn=FALSE}. The result of the segmentation is a set of segments identified to have the same underlying PSCN levels. In this particular case, <%=nbrOfSegments(fit)%> PSCN segments were found: <% fit <- fixLocations(fit) %> \begin{verbatim} <%=withCapture({ getSegments(fit, simplify=TRUE) })%> \end{verbatim} <% segs <- getSegments(fit, simplify=TRUE) %> Note that Segment~\#<%=which(segs$tcnNbrOfLoci == 0)%> has no mean-level estimates. It is because it corresponds to the centromere (the gap) that was identified above. Paired PSCBS did indeed try to segment it, but since there are no data points, all estimates are missing values. Similarly, for Segment~\#<%=which(segs$dhNbrOfLoci == 0)%> the DH and minor and major CNs mean estimates are all missing values. This is because, Paired PSCBS identified that segment by first segmenting the TCN signals by themselves, and after which it tried to segment the DH signals within that segment. Since there are no heterozygous SNPs in the segment, there are no DH signals, and hence there is no DH mean estimate. \subsection{Displaying genomic PSCN profiles} To plot the PSCN segmentation results, do: \begin{verbatim} plotTracks(fit) \end{verbatim} which by default displays three panels containing TCN, decrease of heterozygosity (DH), and minor and major CNs as in Figure~\ref{figTracks}. To plot only one panel with TCN and minor and major CNs and zoom in on a particular region, do (not shown): \begin{verbatim} plotTracks(fit, tracks="tcn,c1,c2", xlim=c(120,244)*1e6) \end{verbatim} \begin{figure}[htp] \begin{center} \resizebox{0.96\textwidth}{!}{\includegraphics{<%=toPNG(fullname, tags=c(class(fit)[1L], "tracks"), aspectRatio=0.6, { plotTracks(fit) })%>}} \end{center} \caption{PSCN segments identified by Paired PSCBS. \textbf{Top}: The TCN signals with the TCN mean levels (purple). \textbf{Middle}: The DH signals with the DH mean levels (orange). \textbf{Bottom}: The TCN signals with the minor CN ($C_1$; blue), the major CN ($C_2$; red) and the TCN ($C=C_1+C_2$; purple) mean levels. } \label{figTracks} \end{figure} \section{Calling segments} The calling algorithms for allelic balance (AB) and loss of heterozygosity (LOH) are based on quantile estimates of the different mean levels. These estimates are obtained using non-parametric bootstrap techniques. For more details, see~\citet{OlshenA_etal_2011}. After the Paired PSCBS method was published, we have added methods for calling run of homozygosity (ROH) and neutral total copy number (NTCN). \emph{The ROH and NTCN calling methods should be considered under development until further notice, meaning they and their results may change without notice.} \subsection{Calling segments with run of homozygosity (ROH)*} \emph{Please note that this method is under development. This means that it may change without further notice.} A region has a run of homozygotes (ROH) if all of its SNPs are homozygous (in the normal). Since such a region has no heterozyous SNPs, its decrease in heterozygosity (DH) is undefined. Likewise, the minor and major copy numbers are unknown. However, if there are genotyping errors within an ROH region, we will obtain a non-missing DH mean level and hence also finite minor and major CNs. In order to adjust for these faulty estimates, we test if the identified segments are ROHs or not by: \begin{verbatim} <%=withCapture({ fit <- callROH(fit, verbose=-10) })%> \end{verbatim} This will also set the corresponding DH and minor and major CN mean levels to NA. The total CN mean levels are not affected by the ROH call. \subsubsection{Tuning parameters} For each segment, the test for ROH calculates the fraction of SNPs that are called heterozygous. If the fraction of heterozygotes is smaller than a threshold, the segmented is called ROH, otherwise not. The default threshold is 1/12. To use a different threshold, set argument \code{delta} to a scalar in $[0,1]$. For example, using \code{callROH(fit, delta=1/30)} makes the ROH caller more conservative.\footnote{It is our plan to replace this basic test with a bionomial test.} \subsection{Calling segments in allelic balance (AB)} The AB caller tests whether the DH level of a segment is small enough to be considered zero, which means that the minor and the major CNs are equal, i.e. the segment is in allelic balance. To call the AB state of all segments, do: \begin{verbatim} <%=withCapture({ fit <- callAB(fit, verbose=-10) })%> \end{verbatim} Because the caller utilizes bootstrapping techniques, calling AB may take some time if there is a large number of segments. Segments already called ROH will not be called for AB, and their AB statuses will have a missing value (\code{NA}). \subsubsection{Tuning parameters} The AB caller tests whether a segment's DH is zero or not, by comparing its DH level (or more precisely, the $5\%$ quantile of its bootstrapped DH mean level) to a threshold. This threshold will be a function of the noise level, because the noisier the BAF signals (and hence the DH signals), the greater the bias of the DH mean level for segments in AB will be. Because of this, the threshold is chosen from data by estimating the noise level of the DH signals near zero. Further rationales and details are given in~\citet{OlshenA_etal_2011}. The AB threshold can be estimated explicitly and used in the caller as \begin{verbatim} deltaAB <- estimateDeltaAB(fit, scale=1) fit <- callAB(fit, delta=deltaAB) \end{verbatim} By decreasing argument \code{scale} (a positive scalar), a smaller threshold will be obtained resulting in a more conservative AB caller. \subsection{Calling segments with loss of heterozygosity (LOH)} The LOH caller tests whether the minor CN level of a segment is large enough to be considered non-zero, which means that the segment is \emph{not} in LOH. To call the LOH state of all segments, do: \begin{verbatim} <%=withCapture({ fit <- callLOH(fit, verbose=-10) })%> \end{verbatim} Note that in order to call LOH, one has to call allelic balance first. Since the bootstrapping was already done in the AB caller, it is not repeated here, which is why calling LOH is faster than calling AB. Analogously to the AB caller, segments already called ROH will not be called for LOH, and their LOH statuses will have a missing value (\code{NA}). Segments already called AB will be called non-LOH, because AB and LOH are exclusively mutual states. \subsubsection{Tuning parameters} The LOH caller tests whether a segment's minor CN is non-zero or not, by comparing its minor CN level (or more precisely, the $95\%$ quantile of its bootstrapped minor CN mean level) to a threshold. This threshold will, among other components, be a function of normal contamination, i.e. the greater the fraction of normal cells is the greater the threshold needs to be in order to call LOH in the tumor cells. Because of this, the threshold is chosen from data as the midpoint of the estimated levels of minor CNs zero and one, which in turn is a function of the normal contamination. For details on the definition of LOH and details on the LOH caller, see~\citet{OlshenA_etal_2011}. The LOH threshold can be estimated explicitly and used as: \begin{verbatim} deltaLOH <- estimateDeltaLOH(fit, midpoint=1/2) fit <- callLOH(fit, delta=deltaLOH) \end{verbatim} By decreasing argument \code{midpoint} in $[0,1]$, a smaller threshold will be obtained resulting in a more conservative LOH caller. \subsection{Calling segments with neutral total copy number (NTCN)*} \emph{Please note that this method is under development. This means that it may change without further notice.} The neutral total copy number (NTCN) caller tests whether the total CN level of a segment is neutral (e.g. diploid) or not. To call the NTCN state of all segments, do: \begin{verbatim} <%=withCapture({ fit <- callNTCN(fit, verbose=-10) })%> \end{verbatim} \subsubsection{Tuning parameters} The NTCN caller identifies segments which TCN mean levels (or more precisely, which $95\%$ confidence intervals) are within a given "acceptance" region. This acceptance region is determined by the $95\%$ confidence interval of an initial set of AB segments identified to be copy neutral and then expanded by half a TCN unit length. The true length of half a total copy number unit is specified by the argument \code{delta} to \code{callNTCN()}. Its length should be a function of the overall background signals (which includes normal contamination and more), such that the width of the acceptance region becomes smaller when the background increases. The background signal (\code{kappa}) and \code{delta} can be estimated explicitly as: \begin{verbatim} kappa <- estimateKappa(fit) deltaCN <- estimateDeltaCN(fit, scale=1, kappa=kappa) fit <- callNTCN(fit, delta=deltaCN, verbose=-10) \end{verbatim} By decreasing the tuning parameter \code{scale} (a positive scalar), a smaller acceptance region will be obtained, which results in a more conservative CN caller. \subsection{Results from calling ROH, AB, LOH and NTCN} All calls are appended to the segmentation results as logical columns: \begin{verbatim} <%=withCapture({ getSegments(fit, simplify=TRUE) })%> \end{verbatim} <% segs <- getSegments(fit, simplify=TRUE) %> \section{Saving results} \subsection{Writing segments to a tab-delimited text file} To write the PSCN segmentation results to file, do: \begin{verbatim} pathname <- writeSegments(fit, name="MySample", simplify=TRUE) \end{verbatim} With \code{simplify=FALSE} (default) quantile estimates of the different mean levels will also be written, which roughly doubles the file size. \section{Experimental} In this section we illustrate some of the ongoing and future work that will be contained in the PSCBS package. Please be aware that these methods are very much under construction, possibly incomplete and in the worst case, even incorrect. \subsection{Less biased Decrease of Heterozygosity (DH) estimates} The DH mean levels of the segments are estimated as the sample mean of the DH signals within each segment. Since DH signals are by definition truncated at zero, the mean level estimates will always be greater than zero even if the true underlying DH is exactly zero. An additional bias is introduced because the distribution of the DH signals is skewed for small DHs, causing the sample mean estimate to be biased. A less biased estimate can be obtained by using the median estimator. To use the median estimator for DH mean levels, specify argument \code{avgDH="median"} to the \code{segmentByPairedPSCBS()} call. Because the DH mean levels will be less biased, the $(C_1,C_2)$ mean level estimates will also be less biased, e.g. they will be closer to each other for segments that are in allelic balance. \subsection{Pruning segmentation profile} By applying hierarchical clustering on the segment means, it is possible to prune the PSCN profile such that change points with very small absolute changes are dropped. If change points are dropped this way, this results in a smaller number of segments, which are hence longer on average. To prune Paired PSCBS segmentation results this way, do: \begin{verbatim} <%=withCapture({ fitP <- pruneByHClust(fit, h=0.25, verbose=-10) })%> \end{verbatim} The result of this is shown in Figure~\ref{figTracksPruned}. \begin{figure}[htp] \begin{center} \resizebox{0.96\textwidth}{!}{\includegraphics{<%=toPNG(fullname, tags=c(class(fitP)[1L], "pruned", "tracks"), aspectRatio=0.6, { plotTracks(fitP) })%>}} \end{center} \caption{Pruned PSCN segments plotted as in Figure~\ref{figTracks}.} \label{figTracksPruned} \end{figure} In the current implementation, any segment calls and quantile mean levels estimates previously are dropped when pruning. \subsection{Report generation} A multipage PDF report that contains both whole-genome and per-chromosome summaries and figures can be generated by: \begin{verbatim} > report(fit, sampleName="PairedPSCBS", studyName="PSCBS-Ex", verbose=-10) \end{verbatim} By default, the reports are written to directory \code{reports//} under the current working directory. In addition to the PDF, that directory also contains subdirectory \code{figures/} holding all generated figure files (e.g. PNGs and PDFs) for easy inclusion elsewhere. <%------------------------------------------------------------------- REFERENCES -------------------------------------------------------------------%> \bibliographystyle{natbib} \bibliography{PSCBS} <%------------------------------------------------------------------- APPENDIX -------------------------------------------------------------------%> \clearpage \section*{Appendix} \subsection*{Session information} <%=toLatex(sessionInfo())%> This report was automatically generated using \code{rfile()} of the R.rsp package. Total processing time after RSP-to-R translation was <%=dt <- round(Sys.time()-t0, digits=2)%> <%=attr(dt, "units")%>. \end{document}