Software tools to collect and analyze Coronavirus (COV) sequences, including code that extracts gene and protein sequences from COV genomes from GenBank, and creates gene and protein-specific sequence collections, alignments, and Hidden Markov Models (HMMs) for protein and nucleotide COV sequences. The code uses the cov_strains.yaml file that lists the expected genes for SARS, MERS, and COV2 so that collections are strain-specific.
The COV genes and proteins are parsed from the GenBank files as features and assigned standard names based on their product or mat_peptide tags. Possible synonyms for these standard names are listed in cov_features.yaml. A QC step compares the COV sequences to expected lengths listed in the cov_length_variants.yaml file.
mkdir COV2
download_gb_by_taxid.py
feature_to_gene_and_protein.py COV2/*.gb
seqs_to_aligns_and_hmms.py COV2/*.fa
Code is included that annotates COV genomes using the nucleotide HMMs, pyTMHMM, and Infernal for RNA motif prediction and creates BED files for genome visualization.
annotate_to_bed.py -rfam_file ../COV2/cov_allvirus.cm COV2/NC_045512.2.gb
- Get an NCBI API key
- Pass the key as a command-line argument or configure as an
env
variable (e.g.export NCBI_API_KEY=exampleexampleexampleexampleexample
)
You can specify a destination directory with -data_dir
, and the strain-specific directories (e.g. 'COV2') will be created
inside this directory. The data_dir
location is the current directory by default, and the default strain is COV2.
- GenBank sequence:
.gb
- Fasta sequence:
.fa
- Fasta index file:
.fai
- Stockholm alignment:
.sto
- If the aligner is
clustalo
: Fasta alignment:.fasta
- If the aligner is
mafft
: MAF alignment:.maf
- HMMER profile HMM:
.hmm
- hmmsearch table:
.tblout
- BED track file:
.bed