-
Notifications
You must be signed in to change notification settings - Fork 128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DRAFT: add 'gene' feature capture when reading genbank files #1435
base: master
Are you sure you want to change the base?
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #1435 +/- ##
==========================================
+ Coverage 68.66% 68.70% +0.03%
==========================================
Files 69 69
Lines 7554 7608 +54
Branches 1851 1866 +15
==========================================
+ Hits 5187 5227 +40
- Misses 2089 2098 +9
- Partials 278 283 +5 ☔ View full report in Codecov by Sentry. |
@@ -432,6 +434,21 @@ def _read_genbank(reference, feature_names): | |||
if fname and (feature_names is None or fname in feature_names): | |||
features[fname] = feat | |||
|
|||
if feat.type=='gene': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To reduce the chances of this introducing unexpected changes to current usage, I'd suggest an approach where we only use 'gene' features if the corresponding 'CDS' feature doesn't exist. There's a few different ways to implement this, but the simplest may be to use an extra loop through gb.features
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Continuing this discussion based on nextstrain/rsv#55 (comment):
Theoretically CDSs are what we want, but we have GenBank files which use "gene" instead. I can't think of a situation where we want to extract both from the same genbank reference¹. One option is to look for 'gene' only if no CDSs are found. Another is to expose this as a command line argument to augur.
¹ I'm sure there are exceptions to this, in which case I think the onus is on the user to edit the GenBank file upstream of Augur.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this something where we need to poll users to vote for "CDS" or "gene"? Or it sounds like we're pretty sure it's "CDS"?
Another is to expose this as a command line argument to augur.
Ah, I could get behind having a command line argument
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're sure we want to use CDS, this is what we're moving towards as only CDSes allow complex annotations like ribosomal slippage etc. Nextclade v3 uses CDS moving forward, we've made similar changes for auspice
Description of proposed changes
In response to nextstrain/rsv#55 (comment), adds a 'gene' feature capture when reading GenBank files.
Please feel free to commit changes to this branch, this initial commit was a draft.
We should add a test of some sort to make sure this works as expected, perhaps in augur/tests/io?
The potential benefit of this functionality is to simplify the newreference.py script for Nextstrain gene tree builds.
Related issue(s)
Checklist