diff --git a/vignettes/ExtendingGenomicRanges.Rmd b/vignettes/ExtendingGenomicRanges.Rmd new file mode 100644 index 00000000..45bd9b69 --- /dev/null +++ b/vignettes/ExtendingGenomicRanges.Rmd @@ -0,0 +1,98 @@ +--- +title: "Extending *GenomicRanges*" +author: + - name: "Michael Lawrence" + - name: "Bioconductor Team" +date: "Edited: Oct 2014; Compiled: `r format(Sys.time(), '%d %B, %Y')`" +package: GenomicRanges +vignette: > + %\VignetteIndexEntry{Extending Genomic Ranges} + %\VignetteEncoding{UTF-8} + %\VignetteEngine{knitr::rmarkdown} +output: + BiocStyle::html_document: + number_sections: yes + toc: yes + toc_depth: 4 +--- + +# Introduction + +The goal of `r Biocpkg("GenomicRanges")` is to provide general containers for +genomic data. The central class, at least from the user perspective, is +*GRanges*, which formalizes the notion of ranges, while allowing for arbitrary +"metadata columns" to be attached to it. These columns offer the same +flexibility as the venerable *data.frame* and permit users to adapt *GRanges* to +a wide variety of *adhoc* use-cases. + +The more we encounter a particular problem, the better we understand it. We +eventually develop a systematic approach for solving the most frequently +encountered problems, and every systematic approach deserves a systematic +implementation. For example, we might want to formally store genetic variants, +with information on alleles and read depths. The metadata columns, which were so +useful during prototyping, are inappropriate for extending the formal semantics +of our data structure: for the sake of data integrity, we need to ensure that +the columns are always present and that they meet certain constraints. + +We might also find that our prototype does not scale well to the increased data +volume that often occurs when we advance past the prototype stage. *GRanges* is +meant mostly for prototyping and stores its data in memory as simple R data +structures. We may require something more specialized when the data are large; +for example, we might store the data as a Tabix-indexed file, or in a database. + +The `r Biocpkg("GenomicRanges")` package does not directly solve either of these +problems, because there are no general solutions. However, it is adaptable to +specialized use cases. + +# The *GenomicRanges* abstraction + +Unbeknownst to many, most of the *GRanges* implementation is provided by methods +on the *GenomicRanges* class, the virtual parent class of *GRanges*. +*GenomicRanges* methods provide everything except for the actual data storage +and retrieval, which *GRanges* implements directly using slots. For example, the +ranges are retrieved like this: + +```{r granges-ranges, message=FALSE} +library(GenomicRanges) +selectMethod(ranges, "GRanges") +``` + +An alternative implementation is *DelegatingGenomicRanges*, which stores all of its data in a delegate *GenomicRanges* object: + +```{r delegating-granges-ranges} +selectMethod(ranges, "DelegatingGenomicRanges") +``` + +This abstraction enables us to pursue more efficient implementations for +particular tasks. One example is *GNCList*, which is indexed for fast range +queries, we expose here: + +```{r gnclist-granges} +getSlots("GNCList")["granges"] +``` + +The `r Biocpkg("MutableRanges")` package in svn provides other, untested +examples. + +# Formalizing `mcols`: Extra column slots + +An orthogonal problem to data storage is adding semantics by the formalization +of metadata columns, and we solve it using the "extra column slot" mechanism. +Whenever *GenomicRanges* needs to operate on its metadata columns, it also +delegates to the internal `extraColumnSlotNames` generic, methods of which +should return a character vector, naming the slots in the *GenomicRanges* +subclass that correspond to columns (i.e., they have one value per range). It +extracts the slot values and manipulates them as it would a metadata column -- +except they are now formal slots, with formal types. + +An example is the *VRanges* class in `r Biocpkg("VariantAnnotation")`. It stores +information on the variants by adding these column slots: + +```{r vranges, message=FALSE, warning=FALSE} +GenomicRanges:::extraColumnSlotNames(VariantAnnotation:::VRanges()) +``` + +Mostly for historical reasons, *VRanges* extends *GRanges*. However, since the +data storage mechanism and the set of extra column slots are orthogonal, it is +probably best practice to take a composition approach by extending +*DelegatingGenomicRanges*. diff --git a/vignettes/ExtendingGenomicRanges.Rnw b/vignettes/ExtendingGenomicRanges.Rnw deleted file mode 100644 index 69b377c0..00000000 --- a/vignettes/ExtendingGenomicRanges.Rnw +++ /dev/null @@ -1,121 +0,0 @@ -% \VignetteIndexEntry{5. Extending GenomicRanges} -% \VignetteDepends{GenomicRanges, VariantAnnotation} -% \VignetteKeywords{ranges} -% \VignettePackage{GenomicRanges} - -\documentclass{article} - -\usepackage[authoryear,round]{natbib} - -<>= -BiocStyle::latex(use.unsrturl=FALSE) -@ - -\title{Extending \Biocpkg{GenomicRanges}} -\author{Michael Lawrence, Bioconductor Team} -\date{Edited: Oct 2014; Compiled: \today} - -\begin{document} - -\maketitle - -\tableofcontents - -<>= -options(width=72) -options(showHeadLines=3) -options(showTailLines=3) -@ - -%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% - -\section{Introduction} - -The goal of \Biocpkg{GenomicRanges} is to provide general containers -for genomic data. The central class, at least from the user -perspective, is \Rclass{GRanges}, which formalizes the notion of -ranges, while allowing for arbitrary ``metadata columns'' to be -attached to it. These columns offer the same flexibility as the -venerable \Rclass{data.frame} and permit users to adapt -\Rclass{GRanges} to a wide variety of \textit{adhoc} use-cases. - -The more we encounter a particular problem, the better we understand -it. We eventually develop a systematic approach for solving the most -frequently encountered problems, and every systematic approach -deserves a systematic implementation. For example, we might want to -formally store genetic variants, with information on alleles and read -depths. The metadata columns, which were so useful during prototyping, -are inappropriate for extending the formal semantics of our data -structure: for the sake of data integrity, we need to ensure that the -columns are always present and that they meet certain constraints. - -We might also find that our prototype does not scale well to the -increased data volume that often occurs when we advance past the -prototype stage. \Rclass{GRanges} is meant mostly for prototyping and -stores its data in memory as simple R data structures. We may require -something more specialized when the data are large; for example, we -might store the data as a Tabix-indexed file, or in a database. - -The \Biocpkg{GenomicRanges} package does not directly solve either of -these problems, because there are no general solutions. However, it is -adaptible to specialized use cases. - -\section{The \Rclass{GenomicRanges} abstraction} - -Unbeknownst to many, most of the \Rclass{GRanges} implementation is -provided by methods on the \Rclass{GenomicRanges} class, the virtual -parent class of \Rclass{GRanges}. \Rclass{GenomicRanges} methods -provide everything except for the actual data storage and retrieval, -which \Rclass{GRanges} implements directly using slots. For example, -the ranges are retrieved like this: -<>= -library(GenomicRanges) -selectMethod(ranges, "GRanges") -@ - -An alternative implementation is \Rclass{DelegatingGenomicRanges}, -which stores all of its data in a delegate \Rclass{GenomicRanges} -object: -<>= -selectMethod(ranges, "DelegatingGenomicRanges") -@ - -This abstraction enables us to pursue more efficient implementations -for particular tasks. One example is \Rclass{GNCList}, which is -indexed for fast range queries, we expose here: -<>= -getSlots("GNCList")["granges"] -@ - -The \Biocpkg{MutableRanges} package in svn provides other, untested -examples. - -\section{Formalizing \texttt{mcols}: Extra column slots} - -An orthogonal problem to data storage is adding semantics by the -formalization of metadata columns, and we solve it using the ``extra -column slot'' mechanism. Whenever \Rclass{GenomicRanges} needs to -operate on its metadata columns, it also delegates to the internal -\Rfunction{extraColumnSlotNames} generic, methods of which should -return a character vector, naming the slots in the -\Rclass{GenomicRanges} subclass that correspond to columns (i.e., they -have one value per range). It extracts the slot values and manipulates -them as it would a metadata column -- except they are now formal -slots, with formal types. - -An example is the \Rclass{VRanges} class in -\Biocpkg{VariantAnnotation}. It stores information on the variants by -adding these column slots: -<>= -GenomicRanges:::extraColumnSlotNames(VariantAnnotation:::VRanges()) -@ - -Mostly for historical reasons, \Rclass{VRanges} extends -\Rclass{GRanges}. However, since the data storage mechanism and the -set of extra column slots are orthogonal, it is probably best practice -to take a composition approach by extending -\Rclass{DelegatingGenomicRanges}. - -\end{document}