Skip to content

Commit

Permalink
Merge pull request #14 from hsmaan/release
Browse files Browse the repository at this point in the history
0.2.2
  • Loading branch information
hsmaan authored May 18, 2020
2 parents 22c0663 + e667929 commit 673fe6b
Show file tree
Hide file tree
Showing 11 changed files with 11,265 additions and 12,543 deletions.
2 changes: 1 addition & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -22,4 +22,4 @@
*.html
gisaid_cron.sh
slurm*

manuscript
10 changes: 8 additions & 2 deletions R/global.R
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ library(parallel)

# Set core usage

cores <- round(detectCores()/1.5, 0)
cores <- detectCores()

# Load color palettes

Expand Down Expand Up @@ -61,7 +61,13 @@ align_get <- function(fasta, align) {

dist_get <- function(align) {

dec_dist <- dist.dna(as.DNAbin(align), model = "K80", as.matrix = TRUE, pairwise.deletion = FALSE)
mask_sites <- c(187, 1059, 2094, 3037, 3130, 6990, 8022, 10323, 10741, 11074, 13408, 14786, 19684, 20148, 21137, 24034, 24378, 25563, 26144, 26461, 26681, 28077, 28826, 28854, 29700, 4050, 13402, 11083, 15324, 21575)
align_mat <- as.matrix(align)
align_mat_sub <- align_mat[, -mask_sites]
align_mat_bin <- as.DNAbin(align_mat_sub)
align_masked <- align_mat_bin %>% as.list %>% as.character %>% lapply(., paste0, collapse = "") %>% unlist %>% DNAStringSet
align_trim <- subseq(align_masked, start = 265, end = 29674)
dec_dist <- dist.dna(as.DNAbin(align_trim), model = "K80", as.matrix = TRUE, pairwise.deletion = FALSE)
colnames(dec_dist) <- (str_split_fixed(colnames(dec_dist), fixed("."), 2)[,1])
rownames(dec_dist) <- (str_split_fixed(rownames(dec_dist), fixed("."), 2)[,1])
return(dec_dist)
Expand Down
6 changes: 5 additions & 1 deletion README.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ The CGT application was developed using the `shiny` R package and framework. Vis

#### Sequence and metadata retrieval

Processed fasta files and metadata of Covid-19 viral genome sequence are retrieved from the [GISAID](https://www.gisaid.org/) EpiCoV database, which is a public database for sharing of viral genome sequence data. Viral genome data and metadata are updated on a weekly basis.
Processed fasta files and metadata of Covid-19 viral genome sequence are retrieved from the [GISAID](https://www.gisaid.org/) EpiCoV database, which is a public database for sharing of viral genome sequence data. Viral genome data and metadata are updated on a weekly basis. Sequences are filtered for completeness (>29000 nucleotides) and high coverage (<0.5% N's). Outlier sequences are also filtered out, defined by >0.05% unique amino acid substitutions compared to all GISAID sequences. This criteria is based on the mutation rate of SARS-CoV-2 and breadth of the GISAID database.

#### Genome sequence alignment

Expand All @@ -39,6 +39,10 @@ GISAID sequences are subset for those that have corresponding metadata. Public s

For both pre-aligned and profile aligned data, DNA distance is determined using `ape` and the Kimura-80 model of nucleotide substitution. Currently only Kimura-80 is supported, but integrating other evolutionary distance metrics will be part of a future release.

The following nucleotide positions are masked post-alignment when determining DNA distance due to homoplasy, as per the recommendations from [this article](http://virological.org/t/issues-with-sars-cov-2-sequencing-data/473):

**187, 1059, 2094, 3037, 3130, 6990, 8022, 10323, 10741, 11074, 13408, 14786, 19684, 20148, 21137, 24034, 24378, 25563, 26144, 26461, 26681, 28077, 28826, 28854, 29700, 4050, 13402, 11083, 15324, 21575**

User-uploaded data is aligned and the distance between each user uploaded genome and the public genomes from GISAID are calculated first. If all user-uploaded sequences are significantly similar to a publicly uploaded genome (min dist < 1e-4, a lower bound based on distances between public genomes), then the distances for the user-uploaded genomes are imputed as the most similar publicly uploaded genome for each. Otherwise, if user-uploaded sequences do not meet this criteria (min dist > 1e-4 for any user-uploaded genome), then the entire distance matrix is recomputed. This heuristic allows for significantly improved computation time for user-uploaded data, and is based on properties of SARS-CoV-2 - Betacoronaviruses have low mutation rates, and most genomes sequenced within the current pandemic will be highly similar.

#### Uniform manifold projection and approximation (UMAP)
Expand Down
16 changes: 15 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,12 @@ processing time (see below).
Processed fasta files and metadata of Covid-19 viral genome sequence are
retrieved from the [GISAID](https://www.gisaid.org/) EpiCoV database,
which is a public database for sharing of viral genome sequence data.
Viral genome data and metadata are updated on a weekly basis.
Viral genome data and metadata are updated on a weekly basis. Sequences
are filtered for completeness (\>29000 nucleotides) and high coverage
(\<0.5% N’s). Outlier sequences are also filtered out, defined by
\>0.05% unique amino acid substitutions compared to all GISAID
sequences. This criteria is based on the mutation rate of SARS-CoV-2 and
breadth of the GISAID database.

#### Genome sequence alignment

Expand All @@ -52,6 +57,15 @@ determined using `ape` and the Kimura-80 model of nucleotide
substitution. Currently only Kimura-80 is supported, but integrating
other evolutionary distance metrics will be part of a future release.

The following nucleotide positions are masked post-alignment when
determining DNA distance due to homoplasy, as per the recommendations
from [this
article](http://virological.org/t/issues-with-sars-cov-2-sequencing-data/473):

**187, 1059, 2094, 3037, 3130, 6990, 8022, 10323, 10741, 11074, 13408,
14786, 19684, 20148, 21137, 24034, 24378, 25563, 26144, 26461, 26681,
28077, 28826, 28854, 29700, 4050, 13402, 11083, 15324, 21575**

User-uploaded data is aligned and the distance between each user
uploaded genome and the public genomes from GISAID are calculated first.
If all user-uploaded sequences are significantly similar to a publicly
Expand Down
Loading

0 comments on commit 673fe6b

Please sign in to comment.