index.Rmd

---
title: "derfinder software paper Supplementary Website"
author: "L Collado-Torres"
date: "`r doc_date()`"
output: 
  BiocStyle::html_document
---

```{r citationsSetup, echo=FALSE, message=FALSE, warning=FALSE}
## Track time spent on making the report
startTime <- Sys.time()

## Bib setup
library('knitcitations')

## Load knitcitations with a clean bibliography
cleanbib()
cite_options(hyperlink = 'to.doc', citation_format = 'text', style = 'html')
# Note links won't show for now due to the following issue
# https://github.com/cboettig/knitcitations/issues/63

bibs <- c("knitcitations" = citation("knitcitations"),
    "derfinder" = citation("derfinder")[1],
    "regionReport" = citation("regionReport")[1],
    "GenomicRanges" = citation("GenomicRanges"),
    "DESeq2" = citation("DESeq2"),
    "edgeR" = citation("edgeR")[5],
    "BiocStyle" = citation("BiocStyle"),
    'rmarkdown' = citation('rmarkdown'),
    'knitr' = citation('knitr')[3],
    'eff' = RefManageR::BibEntry('manual', key = 'eff', title = 'Efficiency analysis of Sun Grid Engine batch jobs', author = 'Alyssa Frazee', year = 2014, url = 'http://dx.doi.org/10.6084/m9.figshare.878000'),
    'zhou2011' = RefManageR::BibEntry('article', key = 'zhou2011', author = "Zhou, Zhifeng and Yuan, Qiaoping and Mash, Deborah C and Goldman, David", title = "Substance-specific and shared transcription and epigenetic changes in the human hippocampus chronically exposed to cocaine and alcohol", journal = "Proceedings of the National Academy of Sciences of the United States of America", year = 2011, volume = "108", number = "16", pages = "6626-6631"),
    'rail' = RefManageR::BibEntry('article', key = 'rail', author = 'Abhinav Nellore and Leonardo Collado-Torres and Andrew E. Jaffe and José Alquicira-Hernández and Jacob Pritt and James Morton and Jeffrey T. Leek  and Ben Langmead', journal = 'bioRxiv', year = '2015', title = 'Rail-RNA: {Scalable} analysis of {RNA}-seq splicing and coverage'),
    'stringtie' = RefManageR::BibEntry('article', key = 'stringtie', author = ' Mihaela Pertea and Geo M. Pertea and Corina M. Antonescu and Tsung-Cheng Chang and Joshua T. Mendell and Steven L. Salzberg', journal = 'Nature Biotechnology', year = '2015', title = 'StringTie enables improved reconstruction of a transcriptome from RNA-seq reads'),
    'hisat' = RefManageR::BibEntry('article', key = 'hisat', author = 'Daehwan Kim and Ben Langmead and Steven L Salzberg', journal = 'Nature Methods', year = '2015', title = 'HISAT: a fast spliced aligner with low memory requirements'),
    'ballgown' = RefManageR::BibEntry('article', key = 'ballgown', author = 'Alyssa C. Frazee and Geo Pertea and Andrew E. Jaffe and Ben Langmead and Steven L. Salzberg  and Jeffrey T. Leek', journal = 'Nature Biotechnology', year = '2015', title = 'Ballgown bridges the gap between transcriptome assembly and expression analysis'))
    
write.bibtex(bibs, file = 'index.bib')
bib <- read.bibtex('index.bib')

## Assign short names
names(bib) <- names(bibs)
```

This page describes the supplementary material for the `derfinder` software paper. All the `bash`, `R` and `R Markdown` source files used to analyze the data for this project as well as generate the HTML reports are available in this website. However, it is easier to view them at [github.com/leekgroup/derSupplement](https://github.com/leekgroup/derSupplement).

# BrainSpan data set

This section of the website describes the code and reports associated with the BrainSpan data set that are referred to in the paper and Supplementary Methods and Results.


## Code to reproduce analyses


There are 9 main `bash` scripts named _step1-*_ through _step9-*_ for running the expressed regions-level and single base-level approaches. 

1. _fullCoverage_ loads the data from the raw files. See [step1-fullCoverage.sh](step1-fullCoverage.sh) and [step1-fullCoverage.R](step1-fullCoverage.R).
1. _makeModels_ creates the models used for the single-level base analysis.  See [step2-makeModels.sh](step2-makeModels.sh) and [step2-makeModels.R](step2-makeModels.R).
1. _analyzeChr_ runs the single base-level analysis by chromosome.  See [step3-analyzeChr.sh](step3-analyzeChr.sh) and [step3-analyzeChr.R](step3-analyzeChr.R).
1. _mergeResults_ merges the single base-level analysis results for all the chromosomes. See [step4-mergeResults.sh](step4-mergeResults.sh).
1. _derfinderReport_ generates a HTML report for the single base-level DERs. See [step5-derfinderReport.sh](step5-derfinderReport.sh).
1. _regionMatrix_ identifies the expressed regions for the expressed-regions level approach. See [step6-regionMatrix.sh](step6-regionMatrix.sh).
1. _regMatVsDERs_ creates a simple HTML report comparing the single base-level and expressed regions-level approaches. See [step7-regMatVsDERs.sh](step7-regMatVsDERs.sh) and [step7-regMatVsDERs.Rmd](step7-regMatVsDERs.Rmd).
1. _coverageToExon_ creates an exon count table using known annotation information. See [step8-coverageToExon.sh](step8-coverageToExon.sh) and [step8-coverageToExon.R](step8-coverageToExon.R).
1. _summaryInfo_ creates a HTML report with brief summary information for the given experiment. See [step9-summaryInfo.sh](step9-summaryInfo.sh), [step9-summaryInfo.R](step9-summaryInfo.R), and [step9-summaryInfo.Rmd](step9-summaryInfo.Rmd).


A final `bash` script, [run-all.sh](run-all.sh), can be used to run the main 9 steps (or a subset of them).

All scripts show at the beginning the way they were used. Some of them generate intermediate small `bash` scripts, for example one script per chromosome for the _analyzeChr_ step. For some steps, there is a companion `R` or `R Markdown` code file when the code is more involved or an HTML file is generated in the particular step.


The [check-analysis-time.R](check-analysis-time.R) script was useful for checking the progress of the _step3-analyzeChr_ jobs and detect whenever a node in the cluster was presenting problems.


We expect that these scripts will be useful to `derfinder` users who want to automate the single base-level and/or expressed regions-level analyses for several data sets and/or have the jobs run automatically without having to check if each step has finished running.


Note that all `bash` scripts are tailored for the cluster we have access to which administer job queues with Sun Grid Engine (SGE).


## Single base-level

### Quick overview HTML report

This HTML report contains basic information on the `derfinder` `r citep(bib[["derfinder"]])` results from the _BrainSpan_ data set. The report answers basic questions such as:

* What is the number of filtered bases?
* What is the number of candidate regions?
* How many candidate regions are significant?

It also illustrates three clusters of candidate differentially expressed regions (DERs) from the single base-level analysis. You can view the report by following this link:

* [BrainSpan](brainspan/summaryInfo/run4-v1.0.10/summaryInfo.html)

### CSV files and annotation comparison

This HTML report has the code for loading the R data files and generating the CSV files. The report also has Venn diagrams showing the number of candidate DERs from the single base-level analysis that overlap known exons, introns and intergenic regions using the UCSC hg19 annotation. It also includes a detailed description of the columns in the CSV file.

View the [venn](venn/venn.html) report or its `R Markdown` source file [venn.Rmd](venn/venn.Rmd). 


## Timing and memory information


This HTML report has code for reading and processing the time and memory information for each job extracted with [efficiency_analytics](https://github.com/alyssafrazee/efficiency_analytics) `r citep(bib[["eff"]])`. The report contains a detailed description of the analysis steps and tables summarizing the maximum memory and time for each analysis step if all the jobs for that particular step were running simultaneously. Finally, there is an interactive table with the timing results.

View the [timing](timing/timing.html) report or check the `R Markdown` file [timing.Rmd](timing/timing.Rmd).

# GTEx analysis

The script [mergeInfo.R](gtex/mergeInfo.R) takes several phenotype tables and merges them into a single one. This information is then used by the [select_samples.R](gtex/select_samples.R) script for choosing the 24 samples to analyze. These samples have a RIN greater than 7 and are from subjects that have samples from the heart (left ventricle), testis and liver. The script [create_meanCov.R](gtex/create_meanCov.R) creates a mean coverage BigWig file just as you would get from using `Rail-RNA` `r citep(bib[['rail']])` on only these 24 samples. The actual script for running `Rail-RNA` on the GTEx data are described at the [nellore/runs](https://github.com/nellore/runs) GitHub repository. The scripts [run-railMatrix.sh](gtex/run-railMatrix.sh) and [railMatrix.R](gtex/railMatrix.R) then run `railMatrix()` using derfinder version 1.5.19 to identify the expressed regions. The resulting set of regions is then analyzed with the [analyze_gtex.R](gtex/analyze_gtex.R) script.


# Simulation

## Generating reads

The code for generating the simulated RNA-seq reads and the chosen setup is described in the [generateReads](simulation/generateReads/generateReads.html) report. This report is generated by the `R Markdown` [generateReads.Rmd](simulation/generateReads/generateReads.Rmd) file.

## Processing reads

We analyzed the simulation reads with the following pipelines:

* Align with `HISAT` `r citep(bib[['hisat']])`, summarize with `Rsubread::featureCounts()` at the exon-level with and without the complete annotation, identify differentially expressed exons with `DESeq2` `r citep(bib[['DESeq2']])` or `edgeR`-robust `r citep(bib[['edgeR']])`.
* Align with `HISAT`, summarize transcripts with `StringTie` `r citep(bib[['stringtie']])`, and test at the transcript and exon levels with `ballgown`.
* Align with `HISAT`, summarize with `derfinder::regionMatrix()`, and test with `DESeq2` or `edgeR`-robust.
* Align with `Rail-RNA` `r citep(bib[['rail']])`, summarize with `derfinder::railMatrix()`, and test with `DESeq2` or `edgeR`-robust.

Here we list the role of different scripts.

* The code for aligning the reads to the genome with `HISAT` is in the [run-paired-hisat.sh](simulation/hisat/run-paired-hisat.sh) script while the code for aligning with `Rail-RNA` is in [prep-manifest.R](simulation/rail/prep-manifest.R) and [run-rail.sh](simulation/rail/run-rail.sh) scripts.
* [createGTF](simulation/gtf/createGTF.html) creates GTF file with the complete and incomplete annotation (source [createGTF.Rmd](simulation/gtf/createGTF.Rmd)).
* The scripts [run-featCounts.sh](simulation/featurecounts/run-featCounts.sh), [featureCounts.R](simulation/featurecounts/featureCounts.R) and [run-featCounts-inc.sh](simulation/featurecounts-inc/run-featCounts-inc.sh), [featureCounts-inc.R](simulation/featurecounts-inc/featureCounts-inc.R) run `Rsubread::featureCounts()` at the exon level with the complete and incomplete annotation respectively.
* Similarly, the scripts [run-stringtie.sh](simulation/stringtie/run-stringtie.sh) and [run-stringtie-inc.sh](simulation/stringtie-inc/run-stringtie-inc.sh) run `StringTie` with the complete and incomplete annotation creating the input needed to run `ballgown`. 
* The scripts [run_ballgown_analysis.sh](simulation/ballgown/run_ballgown_analysis.sh) and [ballgown_analysis.R](simulation/ballgown/ballgown_analysis.R) then perform the `ballgown` `r citep(bib[['ballgown']])` analyses at the transcript and exon levels.
* The scripts [run_regionMat.sh](simulation/regionMatrix/run_regionMat.sh) and [regionMat.R](simulation/regionMatrix/regionMat.R) run `derfinder::regionMatrix()` with the `HISAT` output.
* Similarly, the scripts [run_railMat.sh](simulation/railMatrix/run_railMat.sh) and [railMat.R](simulation/railMatrix/railMat.R) run `derfinder::railMatrix()` with the `Rail-RNA` output.
* The scripts [run_calc_stats.sh](simulation/deseq2-edger/run_calc_stats.sh) and [calc_stats.R](simulation/deseq2-edge/calc_stats.R) use the count matrices created by `regionMatrix()`, `railMatrix()` and `featureCounts()` to perform differential expression tests using `DESeq2` and `edgeR`-robust.


## Evaluating results

The report [evaluate](simulation/evaluate/evaluate.html) (source [evaluate.Rmd](simulation/evaluate/evaluate.Rmd)) defines different reference sets one could consider. It then takes the results from all the different pipelines and compares them against these reference sets. The report includes summary tables from these results showing the minimum and maximum empirical power, false positive rate and false discovery rate. The main results are highlighted in the paper. Finally [timing](simulation/timing/timing.html) (source [timing.Rmd](simulation/timing/timing.Rmd)) shows information about the timing and computer resources used by the different pipelines for the simulation analysis.


# Miscellaneous

## Expressed regions-level overview figure

The code used for generating the panels using in figure showing the expressed regions-level approach is available in the [figure-expressed-regions.R](figure-expressed-regions/figure-expressed-regions.R) file.


## Single base-level overview figure

The code used for generating the panels using in the figure showing the single base-level approach is available in the [figure-single-base.R](figure-single-base/figure-single-base.R) file.

## Additional analyses

The following `R` source files have the code for reproducing additional analyses described in the paper

* [analyze_brainspan.R](additional-analyses/analyze_brainspan.R) and [brainspan_regionLevel.R](additional-analyses/brainspan_regionLevel.R) are the scripts containing the analysis of BrainSpan expressed regions-level DERs.
* [characterize_brainspan_DERs.R](additional-analyses/characterize_brainspan_DERs.R) Analysis of BrainSpan single base-level DERs.

These scripts also include other exploratory code.


# Reproducibility

Date this page was generated.

```{r reproducibility1, echo=FALSE}
## Date the report was generated
Sys.time()
```

Wallclock time spent generating the report.

```{r "reproducibility2", echo=FALSE}
## Processing time in seconds
totalTime <- diff(c(startTime, Sys.time()))
round(totalTime, digits=3)
```

`R` session information.

```{r "reproducibility3", echo=FALSE}
## Session info
options(width=120)
devtools::session_info()
```

You can view the source `R Markdown` file for this page at [index.Rmd](index.Rmd).

# Bibliography

This report was generated using `BiocStyle` `r citep(bib[['BiocStyle']])`
with `knitr` `r citep(bib[['knitr']])` and `rmarkdown` `r citep(bib[['rmarkdown']])` running behind the scenes.  

Citations were made with  `knitcitations` `r citep(bib[['knitcitations']])`. Citation file: [index.bib](index.bib).

```{r vignetteBiblio, results = 'asis', echo = FALSE, warning = FALSE}
## Print bibliography
bibliography()
```