Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/support protein predictions #137

Merged
merged 2 commits into from
Feb 5, 2025

Conversation

KateSakharova
Copy link

@KateSakharova KateSakharova commented Jan 15, 2025

Added support of using already predicted proteins as pipeline input.

  • added script to split proteins.faa into HC, LC and PP files. Phophages processing is quite tricky because they have particular coords, not whole contig.
  • fixed gff generator not to change coords for already predicted proteins for prophages (because they were initially predicted in contig coords)

@KateSakharova KateSakharova changed the base branch from master to dev January 15, 2025 12:39
@KateSakharova KateSakharova force-pushed the feature/support-protein-predictions branch from 855e041 to e77a000 Compare January 29, 2025 12:53
@KateSakharova KateSakharova force-pushed the feature/support-protein-predictions branch from e77a000 to 0fa4cc4 Compare January 29, 2025 12:54
@KateSakharova KateSakharova marked this pull request as ready for review January 29, 2025 12:54
Copy link

@Ales-ibt Ales-ibt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like it is hardcoded to use prodigal output as input with the option --use_proteins. I suppose it is OK for MGnify internal purposes, but I think it is a good idea to specify the expected format of such input for external users.

modules/local/help.nf Outdated Show resolved Hide resolved
else:
self.logger.debug(f'---- Protein {prodigal_info[4:20]} is not in {prophage_info}')
return False

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was kind of difficult to read for me. You could evaluate whether the elements of two lists representing the ranges for prophage and protein intersect totally, partially, or not at all. The two lists are created using range and the intersection is checked:

def check_intersection(start_prophage, finish_prophage, start_protein, finish_protein):
    prophage_set = set(range(start_prophage, finish_prophage + 1))
    protein_set = set(range(start_protein, finish_protein + 1))

    intersection = prophage_set.intersection(protein_set)

    # Check for intersection size
    if len(intersection) > 0:
        if len(intersection) == len(protein_set):
            return "Protein inside prophage" 
        else:
            return "Protein intersects with prophage"
    else:
        return "Protein not in prophage"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the MAP we use a gene coverage threshold of 75% as a criterion to tag the genes as part of the MGE or not. I think it is a good idea to homologate the criteria in both pipelines. What do you think?

@@ -1,6 +1,6 @@
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"$id": "https://raw.githubusercontent.com/ebi-metagenomics/miassembler/master/assets/schema_input.json",
"$id": "https://raw.githubusercontent.com/ebi-metagenomics/emg-viral-pipeline/dev/assets/schema_input.json",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to remember to update this dev -> master

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, absolutely (bit now dev if quite different from master)

Copy link

@Ales-ibt Ales-ibt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this. I will update the protein coverage length on the next version of the MAP (wip) to be on the same page.

prophage_cov = intersection / prophage_length
protein_cov = intersection / protein_length

if prophage_cov > 0.9 or protein_cov > 0.9:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Protein length will always be shorter that the prophage length. Evaluating the protein coverage should be enough. In the MAP we evaluate the MGE length because we can have small MGEs as short as 1 Kb, but in the case of prophages, that shouldn't be the case.

@KateSakharova KateSakharova merged commit aa59e72 into dev Feb 5, 2025
1 check passed
@KateSakharova KateSakharova deleted the feature/support-protein-predictions branch February 5, 2025 14:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants