Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HMMER: additional seq motifs #5

Open
3 of 4 tasks
ptrebert opened this issue Mar 17, 2022 · 23 comments
Open
3 of 4 tasks

HMMER: additional seq motifs #5

ptrebert opened this issue Mar 17, 2022 · 23 comments

Comments

@ptrebert
Copy link
Member

ptrebert commented Mar 17, 2022

Add the following motifs to HMMER runs (various purposes)
FASTA files located under references

  • DYZ19_Yq: located in Yq euchromatin, not for contig identification, just annotate chrY contigs
  • DYZ3-prim_Ycentro: 171bp rep unit, not for contig identification, just annotate chrY contigs
  • DYZ3-sec_Ycentro: 5941bp rep unit, threshold hits >1700 and use for contig identification
  • "TSPY": to be defined
@ptrebert
Copy link
Member Author

config adapted; now distinguish between all motifs and those used for chrY contig identification:

- DYZ1_Yq
- DYZ18_Yq
- DYZ3-sec_Ycentro
- DYZ2_Yq

done c05672c

@pilleh
Copy link
Collaborator

pilleh commented Mar 23, 2022

@ptrebert
Just wondering - do you use the '-E' parameter when you run hmmer for DYZ19 and DYZ3-prim? I know for a fact that the DYZ19 region is present in at least some of the samples (from the old hifiasm assemblies) where hmmer now doesn't identify any, so maybe it's a question of using less stringent parameters?

@ptrebert
Copy link
Member Author

Well, of course I am using it as per your instructions for the HMMER run :-)
Since these two motifs you mentioned are not relevant for contig identification, changing the run parameters is not much of an issue, so which parameters do you suggest here (19 and 3-prim)?

@pilleh
Copy link
Collaborator

pilleh commented Mar 23, 2022

@ptrebert
For DYZ19 and DYZ3-prim I've been using '-E 1.60E-15'. This seems to pick up (most) of the correct chrY copies.
In some ways using DYZ3-prim might be a bit pointless - it picks up huge amounts of stuff from other chromosomes. But I guess since it's already in the pipeline, you might as well leave it in as it might still be useful at some point.

@ptrebert
Copy link
Member Author

ptrebert commented Mar 23, 2022

Ok, to have that info centrally available, this is set in the pipeline:

SCORE_THRESHOLDS_MOTIF = {
    'DYZ1_Yq': 2500,
    'DYZ18_Yq': 2100,
    'DYZ2_Yq': 1700,
    'DYZ3-sec_Ycentro': 1700
}
# plus HMMER E-value of: 1.60E-150

And now the E-value for the motifs DYZ19 and DYZ3-prim should be increased to -E 1.60E-15, correct?
@pilleh

@pilleh
Copy link
Collaborator

pilleh commented Mar 23, 2022

@ptrebert
Yes, it all sounds good to me.
You can also add a score threshold for DYZ3-prim of 90.

I also wonder - are you keeping the original hmmer output files? Would you be able to add the parse output to Globus as well, for the future runs at least?

@ptrebert
Copy link
Member Author

Ok, I adapted the thresholds and let this rerun now.

I take the "parse" output to be this one

  --tblout <f>       : save parseable table of hits to file <f>

in which case, yes, this will be copied for all runs (past and future) from now on.

@pilleh
Copy link
Collaborator

pilleh commented Mar 23, 2022

Yes, this is great, thanks!

@ptrebert
Copy link
Member Author

You find the tabular output in the .norm.tsv files

@ptrebert ptrebert added the Stalled Further steps need to be decided label Mar 24, 2022
@ptrebert
Copy link
Member Author

I received feedback from one of the HMMER devs, and they have now confirmed and identified the problem. They want to work on this tomorrow. Right now, no need to look for an alternative tool.

@pilleh
Copy link
Collaborator

pilleh commented Mar 30, 2022

@ptrebert
sounds good. Fingers crossed they will do as they say.
I've been trying to run hmmer with the Y sequence classes + repeats against hg02666 assembly to see if it would help with the annotation of Y regions. But even with super low -E threshold it does not complete in 2 days which is a bit disappointing. Anyway, just felt like complaining.

@ptrebert ptrebert removed the Stalled Further steps need to be decided label May 27, 2022
@ptrebert
Copy link
Member Author

@pilleh what's actually with the "TSPY" that is still unchecked in the list above? If that has been dropped, this issue could be closed

@pilleh
Copy link
Collaborator

pilleh commented May 27, 2022

@ptrebert
Yes, sorry, this is still undefined. I have a repeat unit for it, but I want to double-check it to make sure it is fine as is or should be slightly changed. I'll try to have this done today.
If it's easy for you to add this to the pipeline once I have an acceptable repeat unit, then we can do that. Or I can just run HMMER with it separately on the assemblies - not a problem either.

@pilleh
Copy link
Collaborator

pilleh commented May 27, 2022

I don't think I will make it today :(

@ptrebert
Copy link
Member Author

if it is just one more motif to add, it's no problem adding it later

@pilleh
Copy link
Collaborator

pilleh commented Jun 29, 2022

@ptrebert
Okay, I finally have some more repeat units for HMMER. Do you prefer to add these to the pipeline, or I can run them myself?

  1. Yqhet_2.7kb - 2.7kb repeat unit at the eu-Yqhet boundary, threshold hits >1500
  2. Yqhet_3.1kb - 3.1kb repeat unit at the eu-Yqhet boundary, threshold hits >1500
  3. TSPY - 20.2kb repeat unit, threshold hits >1000
    [the thresholds are very generous just to be on the safe side]

plus HMMER E-value of: 1.60E-200 [hopefully this will reduce the running time a bit]

I've added the fasta sequences to HHU/references:
Yqhet_2.7kb.fasta
Yqhet_3.1kb.fasta
TSPY.fasta

@ptrebert
Copy link
Member Author

I'll add these to the pipeline, unless it's so urgent that you need the results today.

@pilleh
Copy link
Collaborator

pilleh commented Jun 29, 2022

Thanks :)
Nope, I don't need them today. If you think the results could be ready some time let's say mid next week, then that would be awesome, as it would be nice to add something from these results to the Y update.
I'll spend most of today on the grant renewal :S

@ptrebert
Copy link
Member Author

the workflow engine was annoying me with name matching issues because of the "dots" (2.7kb and 3.1kb) in the motif names, and the easiest solution was to just rename your files:

Yqhet_2k7bp
Yqhet_3k1bp

@ptrebert
Copy link
Member Author

ptrebert commented Jul 5, 2022

for the record, the TSPY is so big that the HMMER runs have failed so far, so I need to submit those with much more resources. The other two are done and the results on the Globus share.

@pilleh
Copy link
Collaborator

pilleh commented Jul 5, 2022

@ptrebert
How about if you decrease the E-value to: 5.00E-300
This should help but not sure how much.

Another option which you might like less is running only against chrY.

@pilleh
Copy link
Collaborator

pilleh commented Jul 6, 2022

@ptrebert
I always feel bad to ask additional things but is it a lot of pain to run HMMER with all our repeat units and RepeatMasker also on the T2T HG002 Y sequence? I have run most of these separately on T2T Y, but probably would be better if it is done consistently, right?

@ptrebert
Copy link
Member Author

ptrebert commented Jul 7, 2022

I added the respective tasks to the pipeline for all HMMER motifs and RepeatMasker with T2T-HG002Y as target.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants