-
Notifications
You must be signed in to change notification settings - Fork 297
/
Copy pathInstalling_Bioinformatics_Tools.Rmd
268 lines (204 loc) · 12.8 KB
/
Installing_Bioinformatics_Tools.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
---
title: "Bioinformatics Software"
author: "Brian High"
date: "February 24, 2015"
output:
ioslides_presentation:
fig_caption: yes
fig_retina: 1
keep_md: yes
smaller: yes
---
## Languages, Environments and Tools
- Programming Languages
* R: [Bioconductor](http://www.bioconductor.org/)
* Python: [SciPy](http://www.scipy.org/),
[Pandas](http://pandas.pydata.org/),
[Biopython](http://biopython.org/wiki/Main_Page)
* Perl: [BioPerl](http://www.bioperl.org/wiki/Main_Page)
* Java: [BioJava](http://biojava.org/wiki/Main_Page)
* Other: C, C++, etc.
- Development Environments
* [RStudio](http://www.rstudio.com/)
* [IPython](http://ipython.org/)
* [BioJava3 eclipse](http://biojava.org/wiki/BioJava3_eclipse)
- Operating Environments
* [Bio-Linux](http://environmentalomics.org/bio-linux/)
* [bioknoppix](http://bioknoppix.hpcf.upr.edu/applications/)
- [Other](http://en.wikipedia.org/wiki/Category:Bioinformatics_software)
[software](http://en.wikipedia.org/wiki/List_of_open-source_bioinformatics_software),
[tools](http://www.ccmb.med.umich.edu/bioinf-core/tools),
[websites](http://www.colorado.edu/chemistry/bioinfo/BioinformaticsLinks.htm)
and [databases](http://www.hsls.pitt.edu/obrc/)
## Installing Software
- Free-standing ("binary") applications and utilities
* Download from developer (or use package manager like
[brew](http://brew.sh/))
* These may be graphical or command-line programs
- Scripts and packages
* First install the language interpreter or environment
* Install additional language modules, packages, or libraries needed
* Package managers ([biocLite](http://www.bioconductor.org/install/),
[pip](http://en.wikipedia.org/wiki/Pip_%28package_manager%29),
[cpan](http://www.cpan.org/modules/INSTALL.html), etc.) may install
dependencies for you
* You often install and run these from a command-line "shell" like
[Bash](http://www.gnu.org/software/bash/)
- System Administration issues
* You may need administrative ("superuser") rights to install
* You may need to move files or modify environment variables like `PATH`
* You may need to use `git`, `svn`, or `hg` to pull from repositories
## Compiling Software
- Requirements
* Programs written in languages like C and C++ must be compiled before use
* If you can't download a "binary" of the program, you will have to compile
* Mac users will need a development environment like
[XCode](https://developer.apple.com/xcode/)
* Windows users may need a [GNU](https://www.gnu.org/) environment like
[Cygwin](https://www.cygwin.com/) or [MinGW](http://www.mingw.org/)
* These include a compiler like [GCC](https://gcc.gnu.org/) and automation
tools like [make](http://www.gnu.org/software/make/)
* A package manager like [MacPorts](https://www.macports.org/) can
automate the process
- Compilation steps are usually run from a command-line "shell" like
[Bash](http://www.gnu.org/software/bash/)
* Usually these are listed in a README file (text, markdown, or HTML)
* Can be as simple as: `./configure`, `make` and `sudo make install`
- `make` is a tool commonly used to automate compilation and installation
* `./configure` prepares the [Makefile](http://en.wikipedia.org/wiki/Makefile)
and `make` processes it
- Tracking down and installing dependencies (libraries) may be tedious
* Compile, fix errors, re-compile, fix errors, re-compile, etc.
## Examples from Research Papers: #1
What software tools will you need to reproduce results from these papers?
1. Leek, J. T. & Storey, J. D. [Capturing Heterogeneity in Gene Expression Studies by Surrogate Variable Analysis](http://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.0030161). PLoS Genet 3, e161 (2007).
2. Risso, D., Ngai, J., Speed, T. P. & Dudoit, S. [Normalization of RNA-seq data using factor analysis of control genes or samples](http://www.nature.com/nbt/journal/v32/n9/full/nbt.2931.html). Nature Biotechnology 32, 896–902 (2014).
3. Finak, G. et al. [OpenCyto: an open source infrastructure for scalable, robust, reproducible, and automated, end-to-end flow cytometry data analysis](http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4148203/). PLoS Comput. Biol. 10, e1003806 (2014).
4. Zhong, Y. & Liu, Z. [Gene expression deconvolution in linear space](http://www.nature.com/nmeth/journal/v9/n1/full/nmeth.1830.html). Nat. Methods 9, 8–9– author reply 9 (2012).
5. Anders, S., Reyes, A. & Huber, W. [Detecting differential usage of exons from RNA-seq data](http://www.ncbi.nlm.nih.gov/pubmed/22722343). Genome Res. 22, 2008-2017 (2012).
```{r, eval=FALSE}
biocLite(c("sva", "RUVSeq", "openCyto", "csSAM", "DEXSeq"))
```
## Examples from Research Papers: #2
What software tools will you need to reproduce results from this paper?
- McDavid, A. et al. [Data exploration, quality control and testing in single-cell qPCR-based gene expression experiments](http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3570210/). Bioinformatics 29, 461-467 (2013).
This uses an R package. Do we install it with `install.packages` or `biocLite`?
No, since the [SingleCellAssay](https://github.com/RGLab/SingleCellAssay) package is not yet in Bioconductor. Instead, the README recommends:
```{r, eval=FALSE}
install.packages('devtools')
library(devtools)
install_github('SingleCellAssay', 'RGLab')
# *or* if you don't have a working latex setup
install_github('SingleCellAssay', 'RGLab', build_vignettes=FALSE)
vignette('SingleCellAssay-intro')
```
## Examples from Research Papers: #3
What software tools will you need to reproduce results from this paper?
- Frazee, A. C., Sabunciyan, S., Hansen, K. D., Irizarry, R. A. & Leek, J. T. [Differential expression analysis of RNA-seq data at single-base resolution](http://biostatistics.oxfordjournals.org/content/early/2014/01/06/biostatistics.kxt053.full). Biostatistics kxt053 (2014). doi:10.1093/biostatistics/kxt053.
This study used a [prototype version](https://github.com/alyssafrazee/derfinder)
of an R Bioconductor [package](https://github.com/lcolladotor/derfinder). Some
[sample analysis code](https://github.com/alyssafrazee/derfinder/blob/master/analysis_code.R)
was provided. Since the sample code uses some packages no longer available in Bioconductor,
you will need to use version R 2.15.x (2.15.2 or above).
```{r, eval=FALSE}
# Install "beta" derfinder and dependencies
source("http://bioconductor.org/biocLite.R")
biocLite()
biocLite(c("Genominator", "limma", "GenomicFeatures", "rtracklayer"))
install.packages(c("RSQLite.extfuns", "HiddenMarkov", "proto", "locfdr", "devtools"))
library(devtools)
install_github('derfinder', 'alyssafrazee') # beta version
library(derfinder)
```
Also, there are some "rda" files to be loaded with the sample code that may not be provided in the Github repo. Check for any open [issues](https://github.com/alyssafrazee/derfinder/issues). Also, be sure to read the README, especially the "reproducing the manuscript's results" section. You will also need [samtools](http://www.htslib.org/). The entire process takes several hours and a few GB of RAM.
## Examples from Research Papers: #4
What software tools will you need to reproduce results from this paper?
- Hansen, K. D., Langmead, B. & Irizarry, R. A. [BSmooth: from whole genome bisulfite sequencing reads to differentially methylated regions](http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3491411/). Genome Biol. (2012).
This paper presents a "pipeline". How do we get it to work?
```{r, engine='bash', eval=FALSE}
mkdir -p ~/src && cd ~/src/ && export BSMOOTH_HOME=~/src/bsmooth-align
git clone https://github.com/BenLangmead/bsmooth-align.git
cd $BSMOOTH_HOME/merman/ && make
```
This gives several compiler errors in `merman.cpp` when compiled on
[Bio-Linux 8](http://environmentalomics.org/whats-new-in-bio-linux-8/) /
[Ubuntu 14.04 LTS](http://releases.ubuntu.com/14.04/) using GCC 4.8.2 and
also on OS X Mavericks (10.9) using XCode with GCC 4.2.1. This would compile
correctly, however, using an older version of GCC (4.1.2) on a Red Hat Linux
5.11 system. All test systems were 64 bit.
You will also need [Bowtie2](http://bowtie-bio.sourceforge.net/bowtie2/index.shtml).
## Examples from Research Papers: #5
What software tools will you need to reproduce results from this paper?
- Amir, E.-A. D. et al. [viSNE enables visualization of high dimensional single-cell data and reveals phenotypic heterogeneity of leukemia](http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4076922/). Nature Biotechnology 31, 545–552 (2013).
From within what environment do we use `viSNE`? How do we access it?
[viSNE](http://www.c2b2.columbia.edu/danapeerlab/html/cyt.html) runs within
[cyt](http://www.c2b2.columbia.edu/danapeerlab/html/cyt-download.html). `cyt`
[requires](www.c2b2.columbia.edu/danapeerlab/html/CYT/cyTutorial.ppt):
- [MatLab](http://www.mathworks.com/products/matlab/index.html) 2010b or higher
on Windows or Mac OS X
- [Parallel computing toolbox](http://www.mathworks.com/products/parallel-computing/)
For a fee, you can also
[run viSNE on CytoBank](http://blog.cytobank.org/2014/11/13/visne/) (a website).
## What about our RSEM example?
- [Using RSEM, a hands-on example](https://github.com/raphg/Biostat-578/blob/master/Using_RSEM.md)
The requirements are listed at the top of the article. How would we install them?
## What about our RSEM example?
Regarding data for the [example](https://github.com/raphg/Biostat-578/blob/master/Using_RSEM.md), Raphael posted three essential files in a Dropbox folder:
- [hg19.fa](https://www.biostars.org/p/1796/)
- [knownIsoforms](https://groups.google.com/forum/#!topic/rsem-users/oto_OJg5NcQ)
- [UCSC.gtf](https://groups.google.com/a/soe.ucsc.edu/d/msg/genome/kyk7AAm4R-M/9LkE-CRjzioJ)
[What](http://bioinformatics.oxfordjournals.org/content/22/9/1036.full) are they? [Where](http://genome.ucsc.edu/) did they come from? Why not [just](https://github.com/raphg/Biostat-578/blob/master/using_rsem_prep_input.sh) [get](http://genomewiki.ucsc.edu/index.php/Genes_in_gtf_or_gff_format) [them](http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/) from [the](http://hgdownload.cse.ucsc.edu/downloads.html#human) [source](http://hgdownload.cse.ucsc.edu/goldenpath/hg19/database/)?
```{r rsem-test, engine="bash", eval=FALSE, results='hide'}
cd ./RSEM_test/Reference_Genome/
../../using_rsem_prep_input.sh
```
That [script](https://github.com/raphg/Biostat-578/blob/master/using_rsem_prep_input.sh) will download the three files from UCSC. A little extra processing is done to extract, convert, or rename the files.
The conversion of the GTF will not work on Windows, even using an environment like Cygwin, as some dependencies (namely, `genePredToGtf`) will not be met. How else can you get that file?
## Example: Trinity and RSEM Test
Assuming you have already installed
[bowtie](http://bowtie-bio.sourceforge.net/index.shtml) and
[bowtie2](http://bowtie-bio.sourceforge.net/bowtie2/index.shtml),
you can run this shell script to compile and test
[Trinity and RSEM](http://trinityrnaseq.sourceforge.net/analysis/abundance_estimation.html).
```{r trinity-test, engine="bash", eval=FALSE, results='hide'}
#!/bin/sh
# Test Trinity and RSEM
mkdir -p ~/biotools/
cd ~/biotools/
git clone 'https://github.com/bli25wisc/RSEM.git'
cd ./RSEM/
make
make ebseq
export PATH=$PATH:~/biotools/RSEM
cd ../
git clone 'https://github.com/trinityrnaseq/trinityrnaseq.git'
cd ./trinityrnaseq/
make clean
make
cd ./sample_data/test_Trinity_Assembly
./runMe.sh
```
You should see a lot of verbose output. Did the test run okay?
## Example: Bowtie, RSEM, and Detonate
We have another [example script](https://github.com/raphg/Biostat-578/blob/master/detonate_test.sh) which tests:
- [Bowtie](http://bowtie-bio.sourceforge.net/index.shtml)
- [RSEM](https://github.com/bli25wisc/RSEM)
- [Detonate](https://github.com/deweylab/detonate)
Detonate requires [Blat](https://genome.ucsc.edu/FAQ/FAQblat.html).
The script assumes bowtie is already installed. The rest are downloaded and
compiled. In each case, the compile command is simply `make`.
Read the script's comments to learn about other dependencies for compiling.
```{r detonate-test, engine="bash", eval=FALSE, results='hide'}
./detonate_test.sh
```
## Example: Rsubread, limma, and edgeR
For a case study of using Rsubread, limma, and edgeR in a Bioconductor R
pipeline to analyze RNA-seq data, see: [rsubread_test.md](https://github.com/raphg/Biostat-578/blob/master/rsubread_test.md), based on the work of:
Wei Shi (shi at wehi dot edu dot au), Yang Liao and Gordon K Smyth
Bioinformatics Division, Walter and Eliza Hall Institute, Melbourne, Australia
- [Case](http://bioinf.wehi.edu.au/RNAseqCaseStudy/)
- [Code](http://bioinf.wehi.edu.au/RNAseqCaseStudy/code.R)
- [Data](http://bioinf.wehi.edu.au/RNAseqCaseStudy/data.tar.gz)
Requirements:
- The version of Rsubread package should be 1.12.1 or later.
- You should run R version 3.0.2 or later.