-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/support protein predictions #137
Conversation
855e041
to
e77a000
Compare
e77a000
to
0fa4cc4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like it is hardcoded to use prodigal output as input with the option --use_proteins
. I suppose it is OK for MGnify internal purposes, but I think it is a good idea to specify the expected format of such input for external users.
else: | ||
self.logger.debug(f'---- Protein {prodigal_info[4:20]} is not in {prophage_info}') | ||
return False | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was kind of difficult to read for me. You could evaluate whether the elements of two lists representing the ranges for prophage and protein intersect totally, partially, or not at all. The two lists are created using range and the intersection is checked:
def check_intersection(start_prophage, finish_prophage, start_protein, finish_protein):
prophage_set = set(range(start_prophage, finish_prophage + 1))
protein_set = set(range(start_protein, finish_protein + 1))
intersection = prophage_set.intersection(protein_set)
# Check for intersection size
if len(intersection) > 0:
if len(intersection) == len(protein_set):
return "Protein inside prophage"
else:
return "Protein intersects with prophage"
else:
return "Protein not in prophage"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the MAP we use a gene coverage threshold of 75% as a criterion to tag the genes as part of the MGE or not. I think it is a good idea to homologate the criteria in both pipelines. What do you think?
@@ -1,6 +1,6 @@ | |||
{ | |||
"$schema": "https://json-schema.org/draft/2020-12/schema", | |||
"$id": "https://raw.githubusercontent.com/ebi-metagenomics/miassembler/master/assets/schema_input.json", | |||
"$id": "https://raw.githubusercontent.com/ebi-metagenomics/emg-viral-pipeline/dev/assets/schema_input.json", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to remember to update this dev
-> master
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, absolutely (bit now dev if quite different from master)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this. I will update the protein coverage length on the next version of the MAP (wip) to be on the same page.
prophage_cov = intersection / prophage_length | ||
protein_cov = intersection / protein_length | ||
|
||
if prophage_cov > 0.9 or protein_cov > 0.9: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Protein length will always be shorter that the prophage length. Evaluating the protein coverage should be enough. In the MAP we evaluate the MGE length because we can have small MGEs as short as 1 Kb, but in the case of prophages, that shouldn't be the case.
Added support of using already predicted proteins as pipeline input.