Skip to content

Commit e0964c0

Browse files
authored
Merge pull request #9 from hsmaan/release
v0.2 update
2 parents 493ab4a + eba1630 commit e0964c0

File tree

6 files changed

+93
-39
lines changed

6 files changed

+93
-39
lines changed

R/global.R

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,5 @@
11
library(Biostrings)
22
library(DECIPHER)
3-
library(stringi)
43
library(stringr)
54
library(ape)
65
library(igraph)

README.Rmd

Lines changed: 32 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@ knitr::opts_chunk$set(
1919

2020
## Overview
2121

22-
[The Covid-19 Genotyping Tool](https://hsmaan.shinyapps.io/CovidGenotyper/) (CGT) is an R-Shiny based web application that allows researchers to upload fasta sequences of Covid-19 viral genomes and compare with public sequence data available on [GISAID](https://www.gisaid.org/). Genomic distance is visualized using manifold projection and network analysis, and genotype information with respective to high-prevalence SNPs is determined.
22+
[The Covid-19 Genotyping Tool](covidgenotyper.app) (CGT) is an R-Shiny based web application that allows researchers to upload fasta sequences of Covid-19 viral genomes and compare with public sequence data available on [GISAID](https://www.gisaid.org/). Genomic distance is visualized using manifold projection and network analysis, and genotype information with respective to high-prevalence SNPs is determined.
2323

2424
## Details and methodology
2525

@@ -29,11 +29,11 @@ The CGT application was developed using the `shiny` R package and framework. Vis
2929

3030
#### Sequence and metadata retrieval
3131

32-
Processed fasta files of Covid-19 viral genome sequence are retrieved from the [GISAID](https://www.gisaid.org/) EpiCoV database, which is a public database for sharing of viral genome sequence data. Metadata for GISAID viral genomes are obtained from [nextstrain's ncov build](https://github.com/nextstrain/ncov/blob/master/data/metadata.tsv). Viral genome data and metadata are updated on a weekly basis.
32+
Processed fasta files and metadata of Covid-19 viral genome sequence are retrieved from the [GISAID](https://www.gisaid.org/) EpiCoV database, which is a public database for sharing of viral genome sequence data. Viral genome data and metadata are updated on a weekly basis.
3333

3434
#### Genome sequence alignment
3535

36-
GISAID sequences are subset for those that have metadata from nextstrain. Public sequencing data is pre-aligned before being uploaded to the server. Fasta sequences are read and written using the `Biostrings` package. Gap removal and multiple-sequence alignment is performed using `DECIPHER`. Post alignment processing is done using `ape`. User uploaded fasta sequences are processed similarly, with the exception of complete alignment - the user sequence is aligned to the pre-aligned public data profile using `AlignProfiles` from `DECIPHER`.
36+
GISAID sequences are subset for those that have corresponding metadata. Public sequencing data is pre-aligned before being uploaded to the server. Fasta sequences are read and written using the `Biostrings` package. Gap removal and multiple-sequence alignment is performed using `DECIPHER`. Post alignment processing is done using `ape`. User uploaded fasta sequences are processed similarly, with the exception of complete alignment - the user sequence is aligned to the pre-aligned public data profile using `AlignProfiles` from `DECIPHER`.
3737

3838
#### DNA distance
3939

@@ -66,7 +66,7 @@ Genotype profiles of viral genomes are determined using high prevalence non-syno
6666

6767
CGT can also be installed locally. Application deployment has currently only been tested on Linux systems including Ubuntu 18.04 LTS and Debian 9.0 LTS, thus we only provide installation instructions for Debian/Ubuntu systems.
6868

69-
##### 1) Installing CGT dependencies
69+
#### 1) Installing CGT dependencies
7070

7171
Clone the repository locally <br/>
7272
`git clone https://github.com/hsmaan/CovidGenotyper`
@@ -94,9 +94,9 @@ cd CovidGenotyper/bin
9494
Rscript --verbose packages_install.R
9595
```
9696

97-
##### 2) Run preprocessing scripts
97+
#### 2) Run preprocessing scripts
9898

99-
CGT relies on pre-processing plot data prior to deployment to ensure visualizations can be loaded quickly. Fasta sequences should be downloaded from [GISAID's EpiCoV database](https://www.gisaid.org/) and saved as `gisaid_cov2020_sequences_[mmm_dd].fasta` in the `data` folder. Metadata from [nextstrain's ncov repository](https://github.com/nextstrain/ncov) should be saved as `gisaid_metadata_[mmm_dd].tsv`, also in the `data` folder.
99+
CGT relies on pre-processing plot data prior to deployment to ensure visualizations can be loaded quickly. Fasta sequences should be downloaded from [GISAID's EpiCoV database](https://www.gisaid.org/) and saved as `gisaid_cov2020_sequences_[mmm_dd].fasta` in the `data` folder. Metadata from GISAID should be saved as `gisaid_metadata_[mmm_dd].tsv`, also in the `data` folder.
100100

101101
The order for processing scripts is the following:
102102

@@ -109,7 +109,7 @@ Rscript --verbose maf_sites_out.R
109109
Rscript --verbose preprocess_plot_data.R
110110
```
111111

112-
##### 3) Deploy CGT
112+
#### 3) Deploy CGT
113113

114114
Now that the shiny application dependencies have been installed and data has been preloaded, the shiny app can be deployed in a variety of ways, documented [here](https://shiny.rstudio.com/deploy/).
115115

@@ -146,6 +146,17 @@ docker run --rm cgt/app
146146
* ggplot2 v3.3.0
147147
* ggnetwork v0.5.8
148148
* plotly v4.9.2.1
149+
* Cairo v1.5.11
150+
* intergraph v2.0.2
151+
* tidyverse v1.3.0
152+
* data.table v1.12.8
153+
* stringr v1.4.0
154+
* reshape2 v1.4.3
155+
* dplyr v0.8.5
156+
* parallel v3.6.3
157+
* ggthemes v4.2.0
158+
* RColorBrewer v1.1.2
159+
* GenomicRanges v1.38.0
149160

150161
#### Command-line tools
151162

@@ -155,7 +166,6 @@ docker run --rm cgt/app
155166

156167
## References
157168

158-
* Hadfield J, Megill C, Bell SM, Huddleston J, Potter B, Callender C, et al. NextStrain: Real-time tracking of pathogen evolution. Bioinformatics. 2018;34(23):4121–3.
159169
* Elbe S, Buckland-Merrett G. Data, disease and diplomacy: GISAID’s innovative contribution to global health. Glob Challenges. 2017;1(1):33–46.
160170
* Chang W, Cheng J, Allaire JJ, Xie Y, McPherson J. Shiny: Web Application Framework for R. R package version 1.4.0.2. 2020. Available from: https://CRAN.R-project.org/package=shiny
161171
* Wickham H. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York, 2016.
@@ -173,6 +183,20 @@ and shiny. Chapman and Hall/CRC Florida, 2020.
173183
* Page AJ, Taylor B, Delaney AJ, Soares J, Seemann T, Keane JA, et al. SNP-sites: rapid efficient extraction of SNPs from multi-FASTA alignments. Microb genomics. 2016;2(4):e000056.
174184
* Cingolani P, Platts A, Wang LL, Coon M, Nguyen T, Wang L, et al. A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3. Fly. 2012.
175185
* Danecek P, Auton A, Abecasis G, Albers CA, Banks E, DePristo MA, et al. The variant call format and VCFtools. Bioinformatics. 2011;27(15):2156–8.
186+
* Simon Urbanek and Jeffrey Horner (2020). Cairo: R Graphics Device using Cairo Graphics
187+
Library for Creating High-Quality Bitmap (PNG, JPEG, TIFF), Vector (PDF, SVG,
188+
PostScript) and Display (X11 and Win32) Output. R package version 1.5-11.
189+
https://CRAN.R-project.org/package=Cairo
190+
* Bojanowski, Michal (2015) intergraph: Coercion Routines for Network Data Objects. R package version 2.0-2. http://mbojan.github.io/intergraph
191+
* Wickham et al., (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686, https://doi.org/10.21105/joss.01686
192+
* Matt Dowle and Arun Srinivasan (2019). data.table: Extension of `data.frame`. R package version 1.12.8. https://CRAN.R-project.org/package=data.table
193+
* Hadley Wickham (2019). stringr: Simple, Consistent Wrappers for Common String Operations. R package version 1.4.0. https://CRAN.R-project.org/package=stringr
194+
* Hadley Wickham (2007). Reshaping Data with the reshape Package. Journal of Statistical Software, 21(12), 1-20. URL http://www.jstatsoft.org/v21/i12/.
195+
* Hadley Wickham, Romain François, Lionel Henry and Kirill Müller (2020). dplyr: A Grammar of Data Manipulation. R package version 0.8.5. https://CRAN.R-project.org/package=dplyr
196+
* R Core Team (2020). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.
197+
* Jeffrey B. Arnold (2019). ggthemes: Extra Themes, Scales and Geoms for 'ggplot2'. R package version 4.2.0. https://CRAN.R-project.org/package=ggthemes
198+
* Erich Neuwirth (2014). RColorBrewer: ColorBrewer Palettes. R package version 1.1-2. https://CRAN.R-project.org/package=RColorBrewer
199+
* Lawrence M, Huber W, Pages H, Aboyoun P, Carlson M, et al. (2013) Software for Computing and Annotating Genomic Ranges. PLoS Comput Biol 9(8): e1003118. doi:10.1371/journal.pcbi.1003118
176200

177201
## License
178202

README.md

Lines changed: 61 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -5,8 +5,7 @@
55

66
## Overview
77

8-
[The Covid-19 Genotyping
9-
Tool](https://hsmaan.shinyapps.io/CovidGenotyper/) (CGT) is an R-Shiny
8+
[The Covid-19 Genotyping Tool](covidgenotyper.app) (CGT) is an R-Shiny
109
based web application that allows researchers to upload fasta sequences
1110
of Covid-19 viral genomes and compare with public sequence data
1211
available on [GISAID](https://www.gisaid.org/). Genomic distance is
@@ -30,23 +29,21 @@ processing time (see below).
3029

3130
#### Sequence and metadata retrieval
3231

33-
Processed fasta files of Covid-19 viral genome sequence are retrieved
34-
from the [GISAID](https://www.gisaid.org/) EpiCoV database, which is a
35-
public database for sharing of viral genome sequence data. Metadata for
36-
GISAID viral genomes are obtained from [nextstrain’s ncov
37-
build](https://github.com/nextstrain/ncov/blob/master/data/metadata.tsv).
32+
Processed fasta files and metadata of Covid-19 viral genome sequence are
33+
retrieved from the [GISAID](https://www.gisaid.org/) EpiCoV database,
34+
which is a public database for sharing of viral genome sequence data.
3835
Viral genome data and metadata are updated on a weekly basis.
3936

4037
#### Genome sequence alignment
4138

42-
GISAID sequences are subset for those that have metadata from
43-
nextstrain. Public sequencing data is pre-aligned before being uploaded
44-
to the server. Fasta sequences are read and written using the
45-
`Biostrings` package. Gap removal and multiple-sequence alignment is
46-
performed using `DECIPHER`. Post alignment processing is done using
47-
`ape`. User uploaded fasta sequences are processed similarly, with the
48-
exception of complete alignment - the user sequence is aligned to the
49-
pre-aligned public data profile using `AlignProfiles` from `DECIPHER`.
39+
GISAID sequences are subset for those that have corresponding metadata.
40+
Public sequencing data is pre-aligned before being uploaded to the
41+
server. Fasta sequences are read and written using the `Biostrings`
42+
package. Gap removal and multiple-sequence alignment is performed using
43+
`DECIPHER`. Post alignment processing is done using `ape`. User uploaded
44+
fasta sequences are processed similarly, with the exception of complete
45+
alignment - the user sequence is aligned to the pre-aligned public data
46+
profile using `AlignProfiles` from `DECIPHER`.
5047

5148
#### DNA distance
5249

@@ -117,7 +114,7 @@ only been tested on Linux systems including Ubuntu 18.04 LTS and Debian
117114
9.0 LTS, thus we only provide installation instructions for
118115
Debian/Ubuntu systems.
119116

120-
##### 1\) Installing CGT dependencies
117+
#### 1\) Installing CGT dependencies
121118

122119
Clone the repository locally <br/> `git clone
123120
https://github.com/hsmaan/CovidGenotyper`
@@ -149,15 +146,14 @@ library depdendencies - check Rscript output and install <br/>
149146
cd CovidGenotyper/bin
150147
Rscript --verbose packages_install.R
151148

152-
##### 2\) Run preprocessing scripts
149+
#### 2\) Run preprocessing scripts
153150

154151
CGT relies on pre-processing plot data prior to deployment to ensure
155152
visualizations can be loaded quickly. Fasta sequences should be
156153
downloaded from [GISAID’s EpiCoV database](https://www.gisaid.org/) and
157154
saved as `gisaid_cov2020_sequences_[mmm_dd].fasta` in the `data` folder.
158-
Metadata from [nextstrain’s ncov
159-
repository](https://github.com/nextstrain/ncov) should be saved as
160-
`gisaid_metadata_[mmm_dd].tsv`, also in the `data` folder.
155+
Metadata from GISAID should be saved as `gisaid_metadata_[mmm_dd].tsv`,
156+
also in the `data` folder.
161157

162158
The order for processing scripts is the following:
163159

@@ -168,7 +164,7 @@ The order for processing scripts is the following:
168164
Rscript --verbose maf_sites_out.R
169165
Rscript --verbose preprocess_plot_data.R
170166

171-
##### 3\) Deploy CGT
167+
#### 3\) Deploy CGT
172168

173169
Now that the shiny application dependencies have been installed and data
174170
has been preloaded, the shiny app can be deployed in a variety of ways,
@@ -212,6 +208,17 @@ following <br/>
212208
- ggplot2 v3.3.0
213209
- ggnetwork v0.5.8
214210
- plotly v4.9.2.1
211+
- Cairo v1.5.11
212+
- intergraph v2.0.2
213+
- tidyverse v1.3.0
214+
- data.table v1.12.8
215+
- stringr v1.4.0
216+
- reshape2 v1.4.3
217+
- dplyr v0.8.5
218+
- parallel v3.6.3
219+
- ggthemes v4.2.0
220+
- RColorBrewer v1.1.2
221+
- GenomicRanges v1.38.0
215222

216223
#### Command-line tools
217224

@@ -221,9 +228,6 @@ following <br/>
221228

222229
## References
223230

224-
- Hadfield J, Megill C, Bell SM, Huddleston J, Potter B, Callender C,
225-
et al. NextStrain: Real-time tracking of pathogen evolution.
226-
Bioinformatics. 2018;34(23):4121–3.
227231
- Elbe S, Buckland-Merrett G. Data, disease and diplomacy: GISAID’s
228232
innovative contribution to global health. Glob Challenges.
229233
2017;1(1):33–46.
@@ -266,6 +270,39 @@ following <br/>
266270
- Danecek P, Auton A, Abecasis G, Albers CA, Banks E, DePristo MA, et
267271
al. The variant call format and VCFtools. Bioinformatics.
268272
2011;27(15):2156–8.
273+
- Simon Urbanek and Jeffrey Horner (2020). Cairo: R Graphics Device
274+
using Cairo Graphics Library for Creating High-Quality Bitmap (PNG,
275+
JPEG, TIFF), Vector (PDF, SVG, PostScript) and Display (X11 and
276+
Win32) Output. R package version 1.5-11.
277+
<https://CRAN.R-project.org/package=Cairo>
278+
- Bojanowski, Michal (2015) intergraph: Coercion Routines for Network
279+
Data Objects. R package version 2.0-2.
280+
<http://mbojan.github.io/intergraph>
281+
- Wickham et al., (2019). Welcome to the tidyverse. Journal of Open
282+
Source Software, 4(43), 1686, <https://doi.org/10.21105/joss.01686>
283+
- Matt Dowle and Arun Srinivasan (2019). data.table: Extension of
284+
`data.frame`. R package version 1.12.8.
285+
<https://CRAN.R-project.org/package=data.table>
286+
- Hadley Wickham (2019). stringr: Simple, Consistent Wrappers for
287+
Common String Operations. R package version 1.4.0.
288+
<https://CRAN.R-project.org/package=stringr>
289+
- Hadley Wickham (2007). Reshaping Data with the reshape Package.
290+
Journal of Statistical Software, 21(12), 1-20. URL
291+
<http://www.jstatsoft.org/v21/i12/>.
292+
- Hadley Wickham, Romain François, Lionel Henry and Kirill Müller
293+
(2020). dplyr: A Grammar of Data Manipulation. R package version
294+
0.8.5. <https://CRAN.R-project.org/package=dplyr>
295+
- R Core Team (2020). R: A language and environment for statistical
296+
computing. R Foundation for Statistical Computing, Vienna, Austria.
297+
URL <https://www.R-project.org/>.
298+
- Jeffrey B. Arnold (2019). ggthemes: Extra Themes, Scales and Geoms
299+
for ‘ggplot2’. R package version 4.2.0.
300+
<https://CRAN.R-project.org/package=ggthemes>
301+
- Erich Neuwirth (2014). RColorBrewer: ColorBrewer Palettes. R package
302+
version 1.1-2. <https://CRAN.R-project.org/package=RColorBrewer>
303+
- Lawrence M, Huber W, Pages H, Aboyoun P, Carlson M, et al. (2013)
304+
Software for Computing and Annotating Genomic Ranges. PLoS Comput
305+
Biol 9(8): e1003118. <doi:10.1371/journal.pcbi.1003118>
269306

270307
## License
271308

bin/gisaid_sequence_process.R

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,6 @@ library(ape)
22
library(Biostrings)
33
library(stringr)
44
library(DECIPHER)
5-
library(stringi)
65
library(data.table)
76
library(tidyverse)
87

bin/maf_sites_out.R

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,6 @@ library(tidyverse)
33
library(Biostrings)
44
library(ape)
55
library(GenomicRanges)
6-
library(GenomicFeatures)
76
library(stringr)
87

98
# Load data

bin/packages_install.R

Lines changed: 0 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,6 @@ install.packages("shinyWidgets")
77
install.packages("Cairo")
88
install.packages("intergraph")
99
install.packages("tidyverse")
10-
install.packages("tidyverse")
1110
install.packages("data.table")
1211
install.packages("stringr")
1312
install.packages("reshape2")
@@ -19,11 +18,8 @@ install.packages("ggnetwork")
1918
install.packages("RColorBrewer")
2019
install.packages("uwot")
2120
install.packages("igraph")
22-
install.packages("stringi")
23-
install.packages("igraph")
2421
install.packages("ape")
2522
BiocManager::install("GenomicRanges")
26-
BiocManager::install("GenomicFeatures")
2723
BiocManager::install("Biostrings")
2824
BiocManager::install("DECIPHER")
2925

0 commit comments

Comments
 (0)