[ancestral] translate genes/CDSs via nuc seq + annotation #1336

jameshadfield · 2023-11-07T02:28:12Z

I wasn't able to test #1333 (which adds support for parsing complex CDSs using GenBank files) because I didn't have translated (AA) fasta files around. This made me wonder why we need the AA sequences if we already have nucleotide sequences & CDS coordinates? This behavior is getting close to augur translate but we are performing ancestral state inference on the AA sequences rather than simply translating the internal nuc sequences. (A codon-aware inference would be the best.)

This is pretty easy to accomplish via the following patch, although a few rough edges would need to be attended to if we want to use this.

I'm not proposing this as a PR as I think it'd be better to work our golden path for generating translations.

diff --git a/augur/ancestral.py b/augur/ancestral.py
index b8f9dbe9..980c8fc8 100644
--- a/augur/ancestral.py
+++ b/augur/ancestral.py
@@ -232,8 +232,8 @@ def register_parser(parent_subparsers):
 def run(args):
     # Validate arguments.
     aa_arguments = (args.annotation, args.genes, args.translations)
-    if any(aa_arguments) and not all(aa_arguments):
-        raise AugurError("For amino acid sequence reconstruction, you must provide an annotation file, a list of genes, and a template path to amino acid sequences.")
+    # if any(aa_arguments) and not all(aa_arguments):
+    #     raise AugurError("For amino acid sequence reconstruction, you must provide an annotation file, a list of genes, and a template path to amino acid sequences.")
 
     # check alignment type, set flags, read in if VCF
     is_vcf = any([args.alignment.lower().endswith(x) for x in ['.vcf', '.vcf.gz']])
@@ -302,19 +302,25 @@ def run(args):
 
     if not is_vcf and args.genes:
         from .utils import load_features
+        from Bio import AlignIO
+        from Bio.Align import MultipleSeqAlignment
+        from .translate import safe_translate
         ## load features; only requested features if genes given
         features = load_features(args.annotation, args.genes)
         if features is None:
             print("ERROR: could not read features of reference sequence file")
             return 1
         print("Read in {} features from reference sequence file".format(len(features)))
+        msa = AlignIO.read(aln, "fasta")
         for gene in args.genes:
             print(f"Processing gene: {gene}")
-            fname = args.translations.replace("%GENE", gene)
             feat = features[gene]
+            cds_records = [feat.extract(seq) for seq in msa]
+            for record in cds_records:
+                record.seq = Seq(safe_translate(str(record.seq)))
+            cds_alignment = MultipleSeqAlignment(cds_records)
             root_seq = str(feat.extract(Seq(ref)).translate()) if ref else None
-
-            aa_result = run_ancestral(T, fname, root_sequence=root_seq, is_vcf=is_vcf, fill_overhangs=not args.keep_overhangs,
+            aa_result = run_ancestral(T, cds_alignment, root_sequence=root_seq, is_vcf=is_vcf, fill_overhangs=not args.keep_overhangs,
                                         marginal=args.inference, infer_ambiguous=infer_ambiguous, alphabet='aa')
             if aa_result['tt'].data.full_length*3 != len(feat):
                 print(f"ERROR: length of translated alignment for {gene} does not match length of reference feature."

For testing I changed nCoVs default genbank file's features block to:

FEATURES             Location/Qualifiers
     source          1..29903
     CDS             join(266..13468,13468..21555)
                     /gene="ORF1ab"
                     /note="translated by -1 ribosomal frameshift"
                     /codon_start=1
     CDS             join(13442..13468,13468..16263)
                     /gene="NSP12"
                     /note="RdRp"
                     /codon_start=1

The text was updated successfully, but these errors were encountered:

corneliusroemer · 2023-11-13T14:27:06Z

What happens in your implementation when there are deletions (or even frameshifts)? That's one thing that Nextclade does as well as translation, it aligns the translations as well.

jameshadfield · 2023-11-19T20:40:51Z

What happens in your implementation when there are deletions (or even frameshifts)? That's one thing that Nextclade does as well as translation, it aligns the translations as well.

The behaviour is the same as augur translate (docstring with examples). The translation is 1:1 of (aligned) nucleotide codon to AA, with no re-alignment. (This preserves the ability to go between AA & nuc via the CDS coordinates, which is present in Auspice and I think would be expected by users?)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ancestral] translate genes/CDSs via nuc seq + annotation #1336

[ancestral] translate genes/CDSs via nuc seq + annotation #1336

jameshadfield commented Nov 7, 2023 •

edited

Loading

corneliusroemer commented Nov 13, 2023

jameshadfield commented Nov 19, 2023

[ancestral] translate genes/CDSs via nuc seq + annotation #1336

[ancestral] translate genes/CDSs via nuc seq + annotation #1336

Comments

jameshadfield commented Nov 7, 2023 • edited Loading

corneliusroemer commented Nov 13, 2023

jameshadfield commented Nov 19, 2023

jameshadfield commented Nov 7, 2023 •

edited

Loading