You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I wasn't able to test #1333 (which adds support for parsing complex CDSs using GenBank files) because I didn't have translated (AA) fasta files around. This made me wonder why we need the AA sequences if we already have nucleotide sequences & CDS coordinates? This behavior is getting close to augur translate but we are performing ancestral state inference on the AA sequences rather than simply translating the internal nuc sequences. (A codon-aware inference would be the best.)
This is pretty easy to accomplish via the following patch, although a few rough edges would need to be attended to if we want to use this.
I'm not proposing this as a PR as I think it'd be better to work our golden path for generating translations.
diff --git a/augur/ancestral.py b/augur/ancestral.py
index b8f9dbe9..980c8fc8 100644
--- a/augur/ancestral.py+++ b/augur/ancestral.py@@ -232,8 +232,8 @@ def register_parser(parent_subparsers):
def run(args):
# Validate arguments.
aa_arguments = (args.annotation, args.genes, args.translations)
- if any(aa_arguments) and not all(aa_arguments):- raise AugurError("For amino acid sequence reconstruction, you must provide an annotation file, a list of genes, and a template path to amino acid sequences.")+ # if any(aa_arguments) and not all(aa_arguments):+ # raise AugurError("For amino acid sequence reconstruction, you must provide an annotation file, a list of genes, and a template path to amino acid sequences.")
# check alignment type, set flags, read in if VCF
is_vcf = any([args.alignment.lower().endswith(x) for x in ['.vcf', '.vcf.gz']])
@@ -302,19 +302,25 @@ def run(args):
if not is_vcf and args.genes:
from .utils import load_features
+ from Bio import AlignIO+ from Bio.Align import MultipleSeqAlignment+ from .translate import safe_translate
## load features; only requested features if genes given
features = load_features(args.annotation, args.genes)
if features is None:
print("ERROR: could not read features of reference sequence file")
return 1
print("Read in {} features from reference sequence file".format(len(features)))
+ msa = AlignIO.read(aln, "fasta")
for gene in args.genes:
print(f"Processing gene: {gene}")
- fname = args.translations.replace("%GENE", gene)
feat = features[gene]
+ cds_records = [feat.extract(seq) for seq in msa]+ for record in cds_records:+ record.seq = Seq(safe_translate(str(record.seq)))+ cds_alignment = MultipleSeqAlignment(cds_records)
root_seq = str(feat.extract(Seq(ref)).translate()) if ref else None
-- aa_result = run_ancestral(T, fname, root_sequence=root_seq, is_vcf=is_vcf, fill_overhangs=not args.keep_overhangs,+ aa_result = run_ancestral(T, cds_alignment, root_sequence=root_seq, is_vcf=is_vcf, fill_overhangs=not args.keep_overhangs,
marginal=args.inference, infer_ambiguous=infer_ambiguous, alphabet='aa')
if aa_result['tt'].data.full_length*3 != len(feat):
print(f"ERROR: length of translated alignment for {gene} does not match length of reference feature."
For testing I changed nCoVs default genbank file's features block to:
FEATURES Location/Qualifiers
source 1..29903
CDS join(266..13468,13468..21555)
/gene="ORF1ab"
/note="translated by -1 ribosomal frameshift"
/codon_start=1
CDS join(13442..13468,13468..16263)
/gene="NSP12"
/note="RdRp"
/codon_start=1
The text was updated successfully, but these errors were encountered:
What happens in your implementation when there are deletions (or even frameshifts)? That's one thing that Nextclade does as well as translation, it aligns the translations as well.
What happens in your implementation when there are deletions (or even frameshifts)? That's one thing that Nextclade does as well as translation, it aligns the translations as well.
The behaviour is the same as augur translate (docstring with examples). The translation is 1:1 of (aligned) nucleotide codon to AA, with no re-alignment. (This preserves the ability to go between AA & nuc via the CDS coordinates, which is present in Auspice and I think would be expected by users?)
I wasn't able to test #1333 (which adds support for parsing complex CDSs using GenBank files) because I didn't have translated (AA) fasta files around. This made me wonder why we need the AA sequences if we already have nucleotide sequences & CDS coordinates? This behavior is getting close to
augur translate
but we are performing ancestral state inference on the AA sequences rather than simply translating the internal nuc sequences. (A codon-aware inference would be the best.)This is pretty easy to accomplish via the following patch, although a few rough edges would need to be attended to if we want to use this.
I'm not proposing this as a PR as I think it'd be better to work our golden path for generating translations.
For testing I changed nCoVs default genbank file's features block to:
The text was updated successfully, but these errors were encountered: