feat: complex CDSs within augur ancestral #1333

corneliusroemer · 2023-11-04T02:14:59Z

Auspice now supports multi-part CDS annotations
This PR makes augur translate output the correct annotation for complex CDSs

Complex CDSs are detected through type(feat.location) being SeqFeature.CompountLocation

Currently, this only works with genbank files, as our load_features utility function does not
deal with compound CDS in GFF files correctly

I tested this with MERSs complex ORF1ab

Partially resolves #1280

Description of proposed changes

Related issue(s)

Checklist

Checks pass
If making user-facing changes, add a message in CHANGES.md summarizing the changes in this PR

Auspice now supports multi-part CDS annotations This PR makes `augur translate` output the correct annotation for complex CDSs Complex CDSs are detected through `type(feat.location)` being `SeqFeature.CompountLocation` Currently, this only works with genbank files, as our load_features utility function does not deal with compound CDS in GFF files correctly I tested this with MERSs complex ORF1ab Partially resolves #1280

codecov · 2023-11-04T02:29:48Z

Codecov Report

Attention: 5 lines in your changes are missing coverage. Please review.

Files	Coverage Δ
augur/ancestral.py	`73.88% <28.57%> (-1.84%)`	⬇️

📢 Thoughts on this report? Let us know!.

jameshadfield · 2023-11-07T02:21:22Z

This PR makes augur translate output the correct annotation for complex CDSs

I think you mean augur ancestral? Changes here are to ancestral.py and as far as I can tell translate doesn't import from ancestral.

I'd be happy to test this out - would you mind making available a test dataset?

Given a SeqFeqture with a CompoundLocation we now correctly write out the CDS/gene using segmented coordinates. Auspice can now handle such coordinates (see <nextstrain/auspice#1684> and <#1281> for the corresponding schema updates). Note that the translations (via augur translate) of complex CDSs are not modified in this commit. Supersedes #1333

Given a SeqFeqture with a CompoundLocation we now correctly write out the CDS/gene using segmented coordinates. Auspice can now handle such coordinates (see <nextstrain/auspice#1684> and <#1281> for the corresponding schema updates). Note that the translations (via augur translate) of complex CDSs did not need modifying as they already used BioPython's SeqFeature.extract method. Supersedes #1333

jameshadfield · 2024-09-25T20:45:50Z

I think this is now done via #1438 but perhaps there are cases not covered. We (I?) should write docs.

huddlej · 2024-09-26T00:25:19Z

@jameshadfield #1438 did not fix the issue for my use case trying to implement an "epitope sites" gene annotation. Since epitope sites are not contiguous, I needed to define them like this (just showing the first two Koel et al. sites for H3N2 HA as an example here):

##gff-version 3
##sequence-region CY163680.1 1 1737
CY163680.1	feature	gene	18	65	.	+	.	gene_name="SigPep"
CY163680.1	feature	gene	66	1052	.	+	.	gene_name="HA1"
CY163680.1	feature	gene	501	533	.	+	.	gene=Koel;ID=gene-Koel
CY163680.1	feature	CDS	501	503	.	+	.	gene=Koel;ID=cds-Koel;Parent=gene-Koel
CY163680.1	feature	CDS	531	533	.	+	.	gene=Koel;ID=cds-Koel;Parent=gene-Koel
CY163680.1	feature	gene	1053	1715	.	+	.	gene_name="HA2"

This file is based on the example SARS-CoV-2 gene map from the Nextclade docs. Nextclade properly parses this GFF file and exports the correct translations for these two sites, but augur ancestral fails to load the GFF, because it tests whether the length of the amino acid sequences from TreeTime is 3 times the length of the full gene feature.

@corneliusroemer's implementation here seems like the right direction, but it also doesn't work for me with BioPython 1.84 an bcbio-gff 0.7.1 because the compound location gets parsed as a SimpleLocation with multiple subfeatures and the _read_gff function doesn't include CDS as a valid type. To bypass the length validation error locally, I had to make the following changes:

diff --git a/augur/ancestral.py b/augur/ancestral.py
index 932d8607..23e2b2b4 100644
--- a/augur/ancestral.py
+++ b/augur/ancestral.py
@@ -449,7 +449,13 @@ def run(args):

             aa_result = run_ancestral(T, fname, reference_sequence=reference_sequence, is_vcf=is_vcf, fill_overhangs=not args.keep_overhangs,
                                         marginal=args.inference, infer_ambiguous=infer_ambiguous, alphabet='aa')
-            if aa_result['tt'].data.full_length*3 != len(feat):
+
+            if len(feat.sub_features) > 0:
+                feature_length = sum(len(sub_feat) for sub_feat in feat.sub_features)
+            else:
+                feature_length = len(feat)
+
+            if (aa_result['tt'].data.full_length * 3) != feature_length:
                 raise AugurError(f"length of translated alignment for {gene} does not match length of reference feature."
                        " Please make sure that the annotation matches the translations.")

diff --git a/augur/utils.py b/augur/utils.py
index d794d256..827f23d6 100644
--- a/augur/utils.py
+++ b/augur/utils.py
@@ -289,7 +289,7 @@ def _read_gff(reference, feature_names):
         If a gene is found with the name 'nuc'
     """
     from BCBio import GFF
-    valid_types = ['gene', 'source', 'region']
+    valid_types = ['gene', 'source', 'region', 'CDS']
     features = {}

     with open_file(reference) as in_handle:

These changes alone do not fix the issue, though, since the downstream ancestral logic needs to know how to export multiple subfeatures as annotations instead of the full gene.

Before working on this more, though, I wonder if I'm thinking about the GFF implementation incorrectly? Would you approach this problem in a different way, @jameshadfield?

corneliusroemer mentioned this pull request Nov 4, 2023

Genome annotations for segmented CDSs etc #1280

Open

3 tasks

jameshadfield mentioned this pull request Nov 7, 2023

[ancestral] translate genes/CDSs via nuc seq + annotation #1336

Open

jameshadfield changed the title ~~feat: output genome annotations for complex CDS~~ feat: complex CDSs within augur ancestral Sep 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: complex CDSs within augur ancestral #1333

feat: complex CDSs within augur ancestral #1333

corneliusroemer commented Nov 4, 2023

codecov bot commented Nov 4, 2023

jameshadfield commented Nov 7, 2023

jameshadfield commented Sep 25, 2024

huddlej commented Sep 26, 2024

feat: complex CDSs within augur ancestral #1333

Are you sure you want to change the base?

feat: complex CDSs within augur ancestral #1333

Conversation

corneliusroemer commented Nov 4, 2023

Description of proposed changes

Related issue(s)

Checklist

codecov bot commented Nov 4, 2023

Codecov Report

jameshadfield commented Nov 7, 2023

jameshadfield commented Sep 25, 2024

huddlej commented Sep 26, 2024