Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ancestral] translate genes/CDSs via nuc seq + annotation #1336

Open
jameshadfield opened this issue Nov 7, 2023 · 2 comments
Open

[ancestral] translate genes/CDSs via nuc seq + annotation #1336

jameshadfield opened this issue Nov 7, 2023 · 2 comments

Comments

@jameshadfield
Copy link
Member

jameshadfield commented Nov 7, 2023

I wasn't able to test #1333 (which adds support for parsing complex CDSs using GenBank files) because I didn't have translated (AA) fasta files around. This made me wonder why we need the AA sequences if we already have nucleotide sequences & CDS coordinates? This behavior is getting close to augur translate but we are performing ancestral state inference on the AA sequences rather than simply translating the internal nuc sequences. (A codon-aware inference would be the best.)

This is pretty easy to accomplish via the following patch, although a few rough edges would need to be attended to if we want to use this.

I'm not proposing this as a PR as I think it'd be better to work our golden path for generating translations.

diff --git a/augur/ancestral.py b/augur/ancestral.py
index b8f9dbe9..980c8fc8 100644
--- a/augur/ancestral.py
+++ b/augur/ancestral.py
@@ -232,8 +232,8 @@ def register_parser(parent_subparsers):
 def run(args):
     # Validate arguments.
     aa_arguments = (args.annotation, args.genes, args.translations)
-    if any(aa_arguments) and not all(aa_arguments):
-        raise AugurError("For amino acid sequence reconstruction, you must provide an annotation file, a list of genes, and a template path to amino acid sequences.")
+    # if any(aa_arguments) and not all(aa_arguments):
+    #     raise AugurError("For amino acid sequence reconstruction, you must provide an annotation file, a list of genes, and a template path to amino acid sequences.")
 
     # check alignment type, set flags, read in if VCF
     is_vcf = any([args.alignment.lower().endswith(x) for x in ['.vcf', '.vcf.gz']])
@@ -302,19 +302,25 @@ def run(args):
 
     if not is_vcf and args.genes:
         from .utils import load_features
+        from Bio import AlignIO
+        from Bio.Align import MultipleSeqAlignment
+        from .translate import safe_translate
         ## load features; only requested features if genes given
         features = load_features(args.annotation, args.genes)
         if features is None:
             print("ERROR: could not read features of reference sequence file")
             return 1
         print("Read in {} features from reference sequence file".format(len(features)))
+        msa = AlignIO.read(aln, "fasta")
         for gene in args.genes:
             print(f"Processing gene: {gene}")
-            fname = args.translations.replace("%GENE", gene)
             feat = features[gene]
+            cds_records = [feat.extract(seq) for seq in msa]
+            for record in cds_records:
+                record.seq = Seq(safe_translate(str(record.seq)))
+            cds_alignment = MultipleSeqAlignment(cds_records)
             root_seq = str(feat.extract(Seq(ref)).translate()) if ref else None
-
-            aa_result = run_ancestral(T, fname, root_sequence=root_seq, is_vcf=is_vcf, fill_overhangs=not args.keep_overhangs,
+            aa_result = run_ancestral(T, cds_alignment, root_sequence=root_seq, is_vcf=is_vcf, fill_overhangs=not args.keep_overhangs,
                                         marginal=args.inference, infer_ambiguous=infer_ambiguous, alphabet='aa')
             if aa_result['tt'].data.full_length*3 != len(feat):
                 print(f"ERROR: length of translated alignment for {gene} does not match length of reference feature."

For testing I changed nCoVs default genbank file's features block to:

FEATURES             Location/Qualifiers
     source          1..29903
     CDS             join(266..13468,13468..21555)
                     /gene="ORF1ab"
                     /note="translated by -1 ribosomal frameshift"
                     /codon_start=1
     CDS             join(13442..13468,13468..16263)
                     /gene="NSP12"
                     /note="RdRp"
                     /codon_start=1
@corneliusroemer
Copy link
Member

What happens in your implementation when there are deletions (or even frameshifts)? That's one thing that Nextclade does as well as translation, it aligns the translations as well.

@jameshadfield
Copy link
Member Author

What happens in your implementation when there are deletions (or even frameshifts)? That's one thing that Nextclade does as well as translation, it aligns the translations as well.

The behaviour is the same as augur translate (docstring with examples). The translation is 1:1 of (aligned) nucleotide codon to AA, with no re-alignment. (This preserves the ability to go between AA & nuc via the CDS coordinates, which is present in Auspice and I think would be expected by users?)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants