<%@meta language="R-vignette" content="-------------------------------- DIRECTIVES FOR R: %\VignetteIndexEntry{Total copy-number segmentation using CBS} %\VignetteKeyword{copy numbers} %\VignetteKeyword{genomic aberrations} %\VignetteAuthor{Henrik Bengtsson} %\VignetteEngine{R.rsp::rsp} --------------------------------------------------------------------"%> <% t0 <- Sys.time() %> <% R.utils::use("R.utils") # RSP specific R.rsp <- R.oo::Package("R.rsp") withCapture <- R.utils::withCapture options("withCapture/newline"=FALSE) options(str=strOptions(strict.width="cut")) options(width=85) options(digits=3) # Graphics use("R.devices") options("devEval/args/field"="fullname") # Preferred for LaTeX devOptions("png", width=840) # Analysis use("PSCBS") PSCBS <- R.oo::Package("PSCBS") fixLocations <- function(fit, ...) { for (key in grep("(end|start)$", colnames(fit$output))) { fit$output[[key]] <- as.integer(fit$output[[key]]) } fit } # fixLocations() signalType <- "TCN" %> \documentclass[letter]{article} \usepackage{xspace} \usepackage{alltt} \usepackage{xcolor} \usepackage{natbib} % \citep{}, \citet{} \usepackage{graphicx} \graphicspath{{figures/}} <%------------------------------------------------------------------- Assign PDF metadata -------------------------------------------------------------------%> % PDF metadata \usepackage{hyperref} % Ideally \hypersetup{hidelinks}, but for backward compatibility: \hypersetup{pdfborder={0 0 0}} \hypersetup{ pdfauthor={<%@meta name="author"%>}, pdftitle={<%@meta name="title"%>}, pdfsubject={}, pdfkeywords={<%@meta name="keywords"%>}, pdfproducer={R.rsp v<%=R.rsp$version%> by <%=R.rsp$author%>} } % Page margins \addtolength{\oddsidemargin}{-0.5in} \addtolength{\evensidemargin}{-0.5in} \addtolength{\textwidth}{1in} \addtolength{\topmargin}{-0.5in} \addtolength{\textheight}{1in} % Placement of floats \setcounter{bottomnumber}{2} \renewcommand{\topfraction}{1.0} \renewcommand{\bottomfraction}{1.0} \renewcommand{\textfraction}{0.0} \renewcommand{\floatpagefraction}{1.0} % Macros \newcommand{\keywords}[1]{{\footnotesize{\textbf{Keywords: }#1}}\xspace} \newcommand{\pkg}[1]{\textsl{#1}\xspace} \newcommand{\file}[1]{\textsl{#1}\xspace} \newcommand{\code}[1]{\texttt{#1}\xspace} \newcommand{\bs}{$\backslash$} \newenvironment{rspVerbatim}{\vspace{-\parskip}\begin{alltt}\color{blue}}{\end{alltt}} \newenvironment{escapeRspVerbatim}{\vspace{-\parskip}\begin{alltt}}{\end{alltt}} \title{<%@meta name="title"%>} \author{<%@meta name="author"%>} \date{<%=format(as.Date(PSCBS$date), format="%B %d, %Y")%>} \begin{document} \maketitle \begin{abstract} The Circular Binary Segmentation (CBS) method partitions a genome into segments of constant total copy numbers (TCNs) based on DNA microarray data. The method also calls .... CBS was designed to work with data from any DNA microarray technology and generation, including Affymetrix and Illumina. This document shows how to use the \pkg{PSCBS} package to run CBS on a tumor sample. \end{abstract} \keywords{<%@meta name="keywords"%>} \begin{center} \emph{This vignette is distributed as part of the \pkg{PSCBS} package, which is available on CRAN (\url{https://cran.r-project.org/}). The authors very much appreciate feedback on this document.} \end{center} \clearpage \tableofcontents \clearpage <%------------------------------------------------------------------- BACKGROUND -------------------------------------------------------------------%> \section{Background} \label{secBackground} We will here use a small example data set to illustrate how to setup the data in a format suitable for CBS, how to identify segments, how to call them, and how to plot and export the segmentation results. The statistical model and the algorithm behind CBS is explained in detail in \citet{OlshenA_etal_2004, VenkatramanOlshen_2007}. <%------------------------------------------------------------------- EXAMPLE -------------------------------------------------------------------%> \section{Preparing data to be segmented} The CBS method requires total copy-number (TCN) estimates. More precisely, it requires TCN ratios for a sample of interest relative to a reference ($y$). The genomic location of the loci in form of chromosome and physical position are also required. \subsection{Locus-level total copy-number signals} \label{secData} In this example we will use a small example data set part of the \pkg{PSCBS} package. It can be loaded as: <% fullname <- "PairedPSCBS,exData,chr01" %> \begin{verbatim} <%=withCapture({ data <- PSCBS::exampleData("paired.chr01") data <- data[,c("chromosome", "x", "CT")] colnames(data)[3] <- "y" str(data) })%> \end{verbatim} In additional to the mandatory fields (\code{chromosome}, \code{x}, and \code{C} this data set also contains .... The latter will not be used here. \subsection{Dropping <%=signalType%> outliers} \label{secTCNOutliers} There may be some outliers among the <%=signalType%>s. In CBS~\citep{OlshenA_etal_2004,VenkatramanOlshen_2007}, the authors propose a method for identifying outliers and then to shrink such values toward their neighbors ("smooth") before performing segmentation. At the time CBS was developed it made sense to not just to drop outliers because the resolution was low and every datapoint was valuable. With modern technologies the resolution is much higher and we can afford dropping such outliers, which can be done by: \begin{verbatim} <%=withCapture({ data <- dropSegmentationOutliers(data) })%> \end{verbatim} Dropping <%=signalType%> outliers is optional. \section{CBS segmentation} \subsection{Skipping centromeres and other large gaps} \label{secGaps} The CBS method does not take the physical locations (in units of nucleotides) of the loci in to account when segmenting the data, only their relative ordering along the genome. This means that after having ordered the loci along genome, it will treat two "neighboring" loci that are on both sides of the centromere equally as two neighboring loci that are only few hundred bases apart. This may introduce erroneous change points that appears to be inside the centromere and biological impossible interpretation of the identified copy-number states. The same issues occur for other large gaps of the genome where there are no observed signals. To avoid this, although not mandatory, we will locate all gaps of the genome where there are no observed loci. As a threshold we will consider a region to be a "gap" if the distance between the two closest loci is greater than 1Mb. \begin{verbatim} <%=withCapture({ gaps <- findLargeGaps(data, minLength=1e6) gaps })%> \end{verbatim} which shows that there is a 20.5Mb long gap between 121.0Mb and 141.5Mb on Chromosome~1. This is the centromere of Chromosome~1. Gaps cannot be specified directly. Instead they need to be give as part of a set of "known" segments, which is done as: \begin{verbatim} <%=withCapture({ knownSegments <- gapsToSegments(gaps) knownSegments })%> \end{verbatim} Below, we will use this to tell CBS to segment Chromosome~1 in three independent segments, where the first segments is from the beginning of the chromosomes (hence '-Inf') to 120.1Mb, the second from 120.1-141.5Mb (the above gap), and the third is from 141.5Mb to the end of the chromosome (hence '+Inf'). Just as CBS segments chromosomes independently of each other, it also segments priorly known segments independently of each other. Specifying known segments is optional. \subsection{Identifying <%=signalType%> segments} We are now ready to segment the locus-level <%=signalType%> signals. This is done by\footnote{We fix the random seed in order for the results of this vignette to be exactly reproducible.}: \begin{verbatim} <%=withCapture({ fit <- segmentByCBS(data, knownSegments=knownSegments, seed=0xBEEF, verbose=-10) })%> \end{verbatim} Note that this may take several minutes when applied to whole-genome data. The result of the segmentation is a set of segments identified to have the same underlying <%=signalType%> levels. In this particular case, <%=nbrOfSegments(fit)%> <%=signalType%> segments were found: <% fit <- fixLocations(fit) %> \begin{verbatim} <%=withCapture({ getSegments(fit, simplify=TRUE) })%> \end{verbatim} <% segs <- getSegments(fit, simplify=TRUE) %> Note how Segment~\#<%=which(segs$nbrOfLoci == 0)%> has no mean-level estimates. It is because it corresponds to the centromere (the gap) that was identified above. CBS did indeed try to segment it, but since there are no data points, all estimates are missing values. \subsection{Displaying genomic <%=signalType%> profiles} To plot the <%=signalType%> segmentation results, do: \begin{verbatim} plotTracks(fit) \end{verbatim} which displays <%=signalType%> as in Figure~\ref{figTracks}. To zoom in on a particular region, do: \begin{verbatim} plotTracks(fit, xlim=c(120,244)*1e6) \end{verbatim} \begin{figure}[htp] \begin{center} \resizebox{0.96\textwidth}{!}{\includegraphics{<%=toPNG(fullname, tags=c(class(fit)[1L], "tracks"), aspectRatio=0.35, { plotTracks(fit) })%>}} \end{center} \caption{Segments identified by CBS. The <%=signalType%> signals with the <%=signalType%> mean levels (purple). } \label{figTracks} \end{figure} \section{Calling segments} TBA. <%--- \subsection{Results from calling <%=signalType%> states} All calls are appended to the segmentation results as logical columns: \begin{verbatim} <%=withCapture({ getSegments(fit, simplify=TRUE) })%> \end{verbatim} <% segs <- getSegments(fit, simplify=TRUE) %> ---%> \section{Saving results} \subsection{Writing segments to a tab-delimited text file} To write the <%=signalType%> segmentation results to file, do: \begin{verbatim} pathname <- writeSegments(fit, name="MySample", simplify=TRUE) \end{verbatim} \section{Experimental} In this section we illustrate some of the ongoing and future work of the PSCBS package. Please be aware that these methods are very much under construction, possibly incomplete and in worst case even incorrect. \subsection{Pruning segmentation profile} By using hierarchical cluster of the segment means it is possible to prune the <%=signalType%> profile such that change points with very small absolute changes are dropped. If change points are dropped this way, this results in a smaller number of segments, which are hence longer. \begin{verbatim} <%=withCapture({ fitP <- pruneByHClust(fit, h=0.25, verbose=-10) })%> \end{verbatim} \begin{figure}[htp] \begin{center} \resizebox{0.96\textwidth}{!}{\includegraphics{<%=toPNG(fullname, tags=c(class(fitP)[1L], "pruned", "tracks"), aspectRatio=0.35, { plotTracks(fitP) })%>}} \end{center} \caption{Pruned <%=signalType%> segments plotted as in Figure~\ref{figTracks}.} \label{figTracksPruned} \end{figure} \subsection{Report generation} A multipage PDF report that contains both whole-genome and per-chromosome summaries and figures can be generated by: \begin{verbatim} > report(fit, sampleName="CBS", studyName="CBS-Ex", verbose=-10) \end{verbatim} By default, the reports are written to directory \code{reports//} under the current working directory. In addition to the PDF, that directory also contains subdirectory \code{figures/} holding all generated figure files (e.g. PNGs and PDFs) for easy inclusion elsewhere. <%------------------------------------------------------------------- REFERENCES -------------------------------------------------------------------%> \bibliographystyle{natbib} \bibliography{PSCBS} <%------------------------------------------------------------------- APPENDIX -------------------------------------------------------------------%> \clearpage \section*{Appendix} \subsection*{Session information} <%=toLatex(sessionInfo())%> This report was automatically generated using \code{rfile()} of the R.rsp package. Total processing time after RSP-to-R translation was <%=dt <- round(Sys.time()-t0, digits=2)%> <%=attr(dt, "units")%>. \end{document}