duotang is a collaborative effort involving members of CoVaRR-Net Pillar 6 (computational biology and modelling), but we welcome contributions from the SARS-CoV-2 research community as pull requests from a fork. All updates, modifications and feature additions to this project are subject to review by our partners in public health. You should also review the Canadian VirusSeq Data Portal data usage policy.
Resources (metadata, trees, mutation frequency tables) can be updated with new data releases from VirusSeq using the workflow described below.
The following dependencies should be installed system-wide (or at user level) if Conda is not being used. These must be discoverable in $PATH.
- Python 3.9+
- R 4.0.2+
- minimap2
- IQTREE2 - COVID-19 release
- Firefox [Optional] Required for getting the GSD data.
- Github CLI [Optional] Required for automatic updates. The following dependences are best to be installed via PIP in a virtual environment if Conda is not being used. If installed at system level, they must be discoverable in $PATH.
- BioPython
- dnastack
- Pandas
- Selenium [Optional] Required for getting the GSD data.
- Webdriver-Manager [Optional] Required for getting the GSD data.
- pycryptodome=3.16 (note: ONLY version 3.16 works)
- PBKDF2
- TreeTime
- Slack-Bolt [Optional] Required for slack integration.
A virtual environment can be created via: python -m venv /path/to/duotang/venv
. Above programs can be installed using pip install
.
These dependencies can also be installed from the environment.yaml using conda env create -f conda-env.yaml
The update script update.sh
is available at the root of this repo. This script attempts to automate the data download, data processing, and knitting process of building CoVaRR-Net Duotang.
The conda environment can be created using the environment.yaml file at the root of this repo. If a duotang conda environment is available, run the following command at the root directory:
./update.sh
If conda is not available, but dependencies are install via python virtual env, run the script with the --noconda
flag and specify a venv that the script should load for the needed dependencies via --venvpath
.
./update.sh --noconda --venvpath /path/to/duotang/venv
If all dependencies are install system wide, use the --noconda
flag only. Note: the script might throw unreasonable errors should a dependency be missing in this mode.
./update.sh --noconda
To download external data only (e.g. FASTA, metadata, etc.), use the --downloadonly
flag. The above section dealing with dependencies apply.
./update.sh --downloadonly [--noconda --venvpath /path/to/venv]
First we download new data:
./update.sh --downloadonly [--noconda --venvpath /path/to/venv]
Then, skip to the step in which we knit duotang.
./update.sh --gotostep knitduotang [--noconda --venvpath /path/to/venv]
Arguments can also be provided for custom build functions:
-d|--date
String in format "YYYY-MM-DD". This will be the datestamped used throughout the build process (default: $CurrentUTCDate)-s|--source
String. The value can beviralai
orvirusseq
, this will be the genomic datasource used (default: viralai).-o|--outdir
String. The output directory of all but the HTML files (default: ./data_needed).-f|--scriptdir
String. The ABSOLUTE path to the scripts directory (default: ${PWD}/scripts).--overwrite
Flag for discarding current checkpoints and restart update from beginning--downloadonly
Flag used to download data only. Script will exit once all external resources had been downloaded.--noconda
Flag used to run the update script without conda. Note: The dependencies should exist in $PATH and this script makes no attempt to ensure that they exist.--venvpath
String. The ABSOLUTE path to the venv containing dependencies. Should be used with--noconda
.--includegsd
Flag used to include the GSD metadata download. Requires GSD secrets located in .secret/--liststeps
Prints the available checkpoint steps in this script. You can use this for the--gotostep
argument.--gotostep
Jumps to a checkpoint step in the script, specify it as '#StepName:'. You must include the # at beginning and : at end. Use--liststeps
to see all the available checkpoints.
First, you must modify the duotang folder paths in scripts/UpdateMonitor.service file
Insert the following into your crontab:
30 * * * * /usr/bin/flock -n ~/duotang/updatemonitor.log -c "/bin/bash ~/duotang/scripts/UpdateMonitor.service > ~/duotang/updatemonitor.log 2>&1
This will push changes into the dev
branch. Once changes are approved, PR this branch into main
.
(NOTE) Due to crond
using minimal environments, you may need to define the PATH variable in order for your programs to be found.
For example, you can insert the following into your crontab list prior to the above line.
PATH=/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin
Note $datetime
is a placeholder for the date and time associated with downloading VirusSeq data, e.g., 2022-03-16
.
Use the update.sh script located at the root of the repo. This script, when in downloadonly mode, will download all relevant metadata required to build Duotang.
./update.sh --downloadonly [--noconda --venvpath /path/to/venv] --date $datetime
(The following instructions still works but will be deprecated soon.)
Command | Description | Outputs | Expected time |
---|---|---|---|
sh data_needed/download.sh <ViralAi> |
download data release from VirusSeq (or ViralAI if argument is provided), separate and re-compress, download also also data from ncov viralai and add pango designations | ncov-open.$datestamp.fasta.xz viralai.$datestamp.withalias.csv virusseq.$datestamp.fasta.xz ncov-open.$datestamp.withalias.tsv.gz virusseq.$datestamp.metadata.tsv.gz |
~20 minutes |
datestamp=$(ls data_needed/virusseq.*fasta.xz | tail -1 | cut -d. -f2) |
set the datestamp variable |
1 second | |
python3 scripts/alignment.py data_needed/virusseq.$datestamp.fasta.xz data_needed/virusseq.metadata.csv.gz data_needed/ --samplenum 3 --reffile resources/NC_045512.fa |
downsample genomes, use minimap2 to align pairwise to reference and write result to FASTA |
sample1.fasta sample2.fasta sample3.fasta |
~2 minutes |
The extractSequences.py is located at scripts/extractSequences.py
. This scripts allows you to use regex to remove and/or keep certain sequences with specific lineage or ID. This script should be ran at the root of the repo. An example use case might be to remove recombinants before constructing phylogenetic trees.
python scripts/extractSequences.py --infile /path/to/fasta --metadata /path/to/metadata --outfile /dir/of/output [--extractregex '^SomeRegex$' --keepregex '^SomeRegex1$' --keepregex '^SomeRegex2$' --extractbyid]
This script will then output at least 5 files:
Sequences_remainder.fasta.xz
FASTA file containing all sequences not matching the --extractregex regex.Sequences_regex_\*.fasta.xz
FASTA file containing all sequences matching the --keepregex regex.SequenceMetadata_remainder.tsv.gz
Metadata for sequences in Sequences_remainder.fasta.xzSequenceMetadata_regex_\*.tsv.gz
Metadata for sequences in Sequences_regex_*.fasta.xzSequenceMetadata_matched.tsv
Metadata for the sequences that were extracted but not kept. There is no associated FASTA for this.
Optional arguments can also be provided:
--extractregex
This will be the regex used to remove sequences. e.g. '^X\S*$' will remove all lineages starting with X--keepregex
This will be the regex used to put sequences removed by --extractregex in a different file. e.g. '^XBB\S*$' will keep the lineages starting with XBB and output them. This argument can be specified multiple times, resulting in one output file for each regex.--extractbyid
By default, the regex is applied to the lineage column, this will change the behavior so that the regex filtering is applied to the sample ID.
Example: The following will remove all lineages starting with 'X' (i.e. recombinants) from the FASTA but will keep the removed lineages starting with 'XBB' and 'XAC' into a separate file.
python scripts/extractSequences.py --infile /path/to/fasta --metadata /path/to/metadata --outfile /dir/of/output --extractregex '^X\S*$' --keepregex '^XBB\S*$' --keepregex '^XAC\S*$']
This step downsample genomes, use minimap2
to align pairwise to reference and write result to FASTA.
python3 scripts/alignment.py data_needed/virusseq.$datestamp.fasta.xz data_needed/virusseq.metadata.csv.gz data_needed/ --samplenum 3 --reffile resources/NC_045512.fa
The output of this is three FASTA containing aligned sequences (sample1.fasta
sample2.fasta
sample3.fasta
)
The following steps should be applied to all three replicates from the preceding stage.
Command | Description | Outputs | Expected time |
---|---|---|---|
for i in 1 2 3; do iqtree2 -ninit 2 -n 2 -me 0.05 -nt 8 -s data_needed/sample$i.fasta -m GTR -ninit 10 -n 4; done |
Use COVID-version of IQ-TREE to reconstruct ML tree | File containing Newick tree string, sample[123].fasta.treefile |
~3 hour |
for i in 1 2 3; do Rscript scripts/root2tip.R data_needed/sample$i.fasta.treefile data_needed/sample$i.rtt.nwk data_needed/sample$i.dates.tsv; done |
Root the ML tree using reference genome as "outgroup", fit root-to-tip regression, prune tips with outlying sequences (±4 s.d. of molecular clock prediction) and export files for TreeTime | sample[123].rtt.nwk and sample[123].dates.tsv |
~1 minute |
for i in 1 2 3; do treetime --tree data_needed/sample$i.rtt.nwk --dates data_needed/sample$i.dates.tsv --clock-filter 0 --sequence-length 29903 --keep-root --outdir data_needed/sample$i.treetime_dir ;done |
Generate time-scaled tree, allowing re-estimation of the root | Folder data_needed/sample[123].treetime_dir , containing timetree.nexus file |
~10 minutes |
for i in 1 2 3; do python3 scripts/nex2nwk.py data_needed/sample$i.treetime_dir/timetree.nexus data_needed/sample$i.timetree.nwk; done |
Converts NEXUS to Newick format, excluding comment fields from internal nodes | file containing Newick tree string, sample[1]123].timetree.nwk |
~5 minutes |
Command | Description | Outputs | Expected time |
---|---|---|---|
for i in 1 2 4 5; do python3 scripts/get-mutations.py --pango rawlineage data_needed/virusseq.$datestamp.fasta.xz B.1.1.529.$i data_needed/virusseq.metadata.csv.gz data_needed/raphgraph/canada-BA$i.var; ; done |
Generate a frequency table of nucleotides at all positions for Canadian genomes of user-specified lineage, aligned against the reference | canada-BA1.var |
~1 minute |
for i in 1 2 4 5; do python3 scripts/get-mutations.py --seqname strain --delimiter "\t" --pango rawlineage data_needed/ncov-open.$datestamp.fasta.xz B.1.1.529.$i data_needed/ncov-open.$datestamp.withalias.tsv.gz data_needed/raphgraph/global-BA$i.var ; done |
Generate the corresponding nucleotide frequency table for global data set | global-BA1.var |