-
Notifications
You must be signed in to change notification settings - Fork 128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: complex CDSs within augur ancestral #1333
base: master
Are you sure you want to change the base?
Conversation
Auspice now supports multi-part CDS annotations This PR makes `augur translate` output the correct annotation for complex CDSs Complex CDSs are detected through `type(feat.location)` being `SeqFeature.CompountLocation` Currently, this only works with genbank files, as our load_features utility function does not deal with compound CDS in GFF files correctly I tested this with MERSs complex ORF1ab Partially resolves #1280
Codecov ReportAttention:
📢 Thoughts on this report? Let us know!. |
I think you mean I'd be happy to test this out - would you mind making available a test dataset? |
Given a SeqFeqture with a CompoundLocation we now correctly write out the CDS/gene using segmented coordinates. Auspice can now handle such coordinates (see <nextstrain/auspice#1684> and <#1281> for the corresponding schema updates). Note that the translations (via augur translate) of complex CDSs are not modified in this commit. Supersedes #1333
Given a SeqFeqture with a CompoundLocation we now correctly write out the CDS/gene using segmented coordinates. Auspice can now handle such coordinates (see <nextstrain/auspice#1684> and <#1281> for the corresponding schema updates). Note that the translations (via augur translate) of complex CDSs did not need modifying as they already used BioPython's SeqFeature.extract method. Supersedes #1333
Given a SeqFeqture with a CompoundLocation we now correctly write out the CDS/gene using segmented coordinates. Auspice can now handle such coordinates (see <nextstrain/auspice#1684> and <#1281> for the corresponding schema updates). Note that the translations (via augur translate) of complex CDSs did not need modifying as they already used BioPython's SeqFeature.extract method. Supersedes #1333
I think this is now done via #1438 but perhaps there are cases not covered. We (I?) should write docs. |
@jameshadfield #1438 did not fix the issue for my use case trying to implement an "epitope sites" gene annotation. Since epitope sites are not contiguous, I needed to define them like this (just showing the first two Koel et al. sites for H3N2 HA as an example here):
This file is based on the example SARS-CoV-2 gene map from the Nextclade docs. Nextclade properly parses this GFF file and exports the correct translations for these two sites, but @corneliusroemer's implementation here seems like the right direction, but it also doesn't work for me with BioPython 1.84 an bcbio-gff 0.7.1 because the compound location gets parsed as a diff --git a/augur/ancestral.py b/augur/ancestral.py
index 932d8607..23e2b2b4 100644
--- a/augur/ancestral.py
+++ b/augur/ancestral.py
@@ -449,7 +449,13 @@ def run(args):
aa_result = run_ancestral(T, fname, reference_sequence=reference_sequence, is_vcf=is_vcf, fill_overhangs=not args.keep_overhangs,
marginal=args.inference, infer_ambiguous=infer_ambiguous, alphabet='aa')
- if aa_result['tt'].data.full_length*3 != len(feat):
+
+ if len(feat.sub_features) > 0:
+ feature_length = sum(len(sub_feat) for sub_feat in feat.sub_features)
+ else:
+ feature_length = len(feat)
+
+ if (aa_result['tt'].data.full_length * 3) != feature_length:
raise AugurError(f"length of translated alignment for {gene} does not match length of reference feature."
" Please make sure that the annotation matches the translations.")
diff --git a/augur/utils.py b/augur/utils.py
index d794d256..827f23d6 100644
--- a/augur/utils.py
+++ b/augur/utils.py
@@ -289,7 +289,7 @@ def _read_gff(reference, feature_names):
If a gene is found with the name 'nuc'
"""
from BCBio import GFF
- valid_types = ['gene', 'source', 'region']
+ valid_types = ['gene', 'source', 'region', 'CDS']
features = {}
with open_file(reference) as in_handle: These changes alone do not fix the issue, though, since the downstream ancestral logic needs to know how to export multiple subfeatures as annotations instead of the full gene. Before working on this more, though, I wonder if I'm thinking about the GFF implementation incorrectly? Would you approach this problem in a different way, @jameshadfield? |
Auspice now supports multi-part CDS annotations
This PR makes
augur translate
output the correct annotation for complex CDSsComplex CDSs are detected through
type(feat.location)
beingSeqFeature.CompountLocation
Currently, this only works with genbank files, as our load_features utility function does not
deal with compound CDS in GFF files correctly
I tested this with MERSs complex ORF1ab
Partially resolves #1280
Description of proposed changes
Related issue(s)
Checklist