Merge pull request #14 from hsmaan/release

0.2.2
hsmaan · May 18, 2020 · 673fe6b · 673fe6b
2 parents 22c0663 + e667929
commit 673fe6b
Show file tree

Hide file tree

Showing 11 changed files with 11,265 additions and 12,543 deletions.
diff --git a/.gitignore b/.gitignore
@@ -22,4 +22,4 @@
 *.html
 gisaid_cron.sh
 slurm*
-
+manuscript
diff --git a/R/global.R b/R/global.R
@@ -15,7 +15,7 @@ library(parallel)
 
 # Set core usage
 
-cores <- round(detectCores()/1.5, 0)
+cores <- detectCores()
 
 # Load color palettes    
 
@@ -61,7 +61,13 @@ align_get <- function(fasta, align) {
 
 dist_get <- function(align) {
 
-  dec_dist <- dist.dna(as.DNAbin(align), model = "K80", as.matrix = TRUE, pairwise.deletion = FALSE)
+  mask_sites <- c(187, 1059, 2094, 3037, 3130, 6990, 8022, 10323, 10741, 11074, 13408, 14786, 19684, 20148, 21137, 24034, 24378, 25563, 26144, 26461, 26681, 28077, 28826, 28854, 29700, 4050, 13402, 11083, 15324, 21575)
+  align_mat <- as.matrix(align)
+  align_mat_sub <- align_mat[, -mask_sites]
+  align_mat_bin <- as.DNAbin(align_mat_sub)
+  align_masked <- align_mat_bin %>% as.list %>% as.character %>% lapply(., paste0, collapse = "") %>% unlist %>% DNAStringSet
+  align_trim <- subseq(align_masked, start = 265, end = 29674)
+  dec_dist <- dist.dna(as.DNAbin(align_trim), model = "K80", as.matrix = TRUE, pairwise.deletion = FALSE)
   colnames(dec_dist) <- (str_split_fixed(colnames(dec_dist), fixed("."), 2)[,1])
   rownames(dec_dist) <- (str_split_fixed(rownames(dec_dist), fixed("."), 2)[,1])
   return(dec_dist)

diff --git a/README.Rmd b/README.Rmd
@@ -29,7 +29,7 @@ The CGT application was developed using the `shiny` R package and framework. Vis
 
 #### Sequence and metadata retrieval 
 
-Processed fasta files and metadata of Covid-19 viral genome sequence are retrieved from the  [GISAID](https://www.gisaid.org/) EpiCoV database, which is a public database for sharing of viral genome sequence data. Viral genome data and metadata are updated on a weekly basis.
+Processed fasta files and metadata of Covid-19 viral genome sequence are retrieved from the  [GISAID](https://www.gisaid.org/) EpiCoV database, which is a public database for sharing of viral genome sequence data. Viral genome data and metadata are updated on a weekly basis. Sequences are filtered for completeness (>29000 nucleotides) and high coverage (<0.5% N's). Outlier sequences are also filtered out, defined by >0.05% unique amino acid substitutions compared to all GISAID sequences. This criteria is based on the mutation rate of SARS-CoV-2 and breadth of the GISAID database. 
 
 #### Genome sequence alignment
 
@@ -39,6 +39,10 @@ GISAID sequences are subset for those that have corresponding metadata. Public s
 
 For both pre-aligned and profile aligned data, DNA distance is determined using `ape` and the Kimura-80 model of nucleotide substitution. Currently only Kimura-80 is supported, but integrating other evolutionary distance metrics will be part of a future release.
 
+The following nucleotide positions are masked post-alignment when determining DNA distance due to homoplasy, as per the recommendations from [this article](http://virological.org/t/issues-with-sars-cov-2-sequencing-data/473):
+
+**187, 1059, 2094, 3037, 3130, 6990, 8022, 10323, 10741, 11074, 13408, 14786, 19684, 20148, 21137, 24034, 24378, 25563, 26144, 26461, 26681, 28077, 28826, 28854, 29700, 4050, 13402, 11083, 15324, 21575**
+
 User-uploaded data is aligned and the distance between each user uploaded genome and the public genomes from GISAID are calculated first. If all user-uploaded sequences are significantly similar to a publicly uploaded genome (min dist < 1e-4, a lower bound based on distances between public genomes), then the distances for the user-uploaded genomes are imputed as the most similar publicly uploaded genome for each. Otherwise, if user-uploaded sequences do not meet this criteria (min dist > 1e-4 for any user-uploaded genome), then the entire distance matrix is recomputed. This heuristic allows for significantly improved computation time for user-uploaded data, and is based on properties of SARS-CoV-2 - Betacoronaviruses have low mutation rates, and most genomes sequenced within the current pandemic will be highly similar. 
 
 #### Uniform manifold projection and approximation (UMAP)

diff --git a/README.md b/README.md
@@ -32,7 +32,12 @@ processing time (see below).
 Processed fasta files and metadata of Covid-19 viral genome sequence are
 retrieved from the [GISAID](https://www.gisaid.org/) EpiCoV database,
 which is a public database for sharing of viral genome sequence data.
-Viral genome data and metadata are updated on a weekly basis.
+Viral genome data and metadata are updated on a weekly basis. Sequences
+are filtered for completeness (\>29000 nucleotides) and high coverage
+(\<0.5% N’s). Outlier sequences are also filtered out, defined by
+\>0.05% unique amino acid substitutions compared to all GISAID
+sequences. This criteria is based on the mutation rate of SARS-CoV-2 and
+breadth of the GISAID database.
 
 #### Genome sequence alignment
 
@@ -52,6 +57,15 @@ determined using `ape` and the Kimura-80 model of nucleotide
 substitution. Currently only Kimura-80 is supported, but integrating
 other evolutionary distance metrics will be part of a future release.
 
+The following nucleotide positions are masked post-alignment when
+determining DNA distance due to homoplasy, as per the recommendations
+from [this
+article](http://virological.org/t/issues-with-sars-cov-2-sequencing-data/473):
+
+**187, 1059, 2094, 3037, 3130, 6990, 8022, 10323, 10741, 11074, 13408,
+14786, 19684, 20148, 21137, 24034, 24378, 25563, 26144, 26461, 26681,
+28077, 28826, 28854, 29700, 4050, 13402, 11083, 15324, 21575**
+
 User-uploaded data is aligned and the distance between each user
 uploaded genome and the public genomes from GISAID are calculated first.
 If all user-uploaded sequences are significantly similar to a publicly
-Original file line number
+Diff line change
@@ Expand Up / @@ -22,4 +22,4 @@ @@
     *.html
     gisaid_cron.sh
     slurm*
+    manuscript