Bin QC Improvements #707

dialvarezs · 2024-10-27T21:31:08Z

This PR adds:

CheckM2 as an alternative for bin qc
A new BIN_QC subworkflow, integrating CHECKM, CHECKM2, BUSCO and GUNC.

Closes #607.

PR checklist

- Update modules - Update integration in mag and with other tools (bin_summary, gtdb-tk) - Update test - Update schema

jfy133 · 2024-10-27T21:40:34Z

@nf-core-bot fix linting

jfy133 · 2024-10-27T21:42:43Z

Before you continue (sorry this is a bit late):

I generally don't like to deprecate old version of tools for a while, but rather keep them as alternative tools.

In some cases people want to stick with the original version for compatibility with previous runs

Could you 'revert' (or reinstall) the old checkm module and wrap it in an if/else statement (but within the subworkflow :) )

@muabnezor did a similar thing when adding porechop_ABI hree: #674

dialvarezs · 2024-10-27T21:49:06Z

@jfy133 that makes sense, I will revert the CheckM removal. Bad for me for not asking before 😅.

dialvarezs · 2024-10-28T12:15:01Z

Sorry, I didn't catch your last comment about including both tools in a single workflow. With that in mind, would make sense to include BUSCO as well, and just make a "bin_qc" subworkflow?

Also, simplify bin_summary regarding bin qc

jfy133 · 2024-10-29T09:47:00Z

Sorry, I didn't catch your last comment about including both tools in a single workflow. With that in mind, would make sense to include BUSCO as well, and just make a "bin_qc" subworkflow?

Yes that would be perfect! We need to subworkflow the sh*t out of this monster 😅 thank you!!!

dialvarezs · 2024-11-01T08:44:22Z

It's ready for review now, @jfy133. I’ve added GUNC to the subworkflow to keep mag.nf cleaner.

The only issue I found is with GUNC_MERGECHECKM outputs not being saved correctly. Since the process runs per bin but uses the sample as ID, the output files are overwritten. However, once this PR is merged, we’ll be able to run GUNC with all bins simultaneously, which should solve this issue.

jfy133 · 2024-11-07T05:28:31Z

Hi @dialvarezs just FYI I'm at another conference since yesterday and didn't have a chance beginning of this week to look at your PRs, sorry about that!

My response time should speed up from next week :)

prototaxites · 2024-11-11T15:18:39Z

I think this looks good, though I don't have the latitude to run any manual tests of it myself at the moment - so probably wait for someone else to sign off on it!

My only question is whether it might be good to move all the binqc database preparation steps (e.g. CHECKM2_DATABASEDOWNLOAD, and the bits that initialise the files/channels) inside the BINQC workflow where they are consumed? I think keeping conceptually-related code together will probably help with maintenance down the line, and the main mag.nf workflow is already over-full. Plus, there's less subworkflow inputs and outputs to keep track of. Might be a job for another PR though!

dialvarezs · 2024-11-12T06:50:59Z

@prototaxites Absolutely, that makes sense to me. That would help to simplify the mag workflow a bit. What do you think about this @jfy133 ?

jfy133 · 2024-11-12T07:45:46Z

I've pondered that a few times, however many of these download steps take a very long time, and thus from a user PoV I think it makes sense to have it triggered right at the beginning of the pipeline so by the time assembly and binning is done it's ready to go rather than getting all the way to binning and then having to wait the same length of time again before you can start the binning QC.

In my mind to clear up the code I would rather have a dB download subworkflow for all DB Downloadsto make the code clearer. Maybe from a related module PoV it's not as efficient but functionally they are related

Any counter arguments?

dialvarezs · 2024-11-12T07:55:14Z

In most cases, database downloads don’t depend on anything, so, if I'm not wrong, the processes should start immediately regardless of where they are in placed in the code. Or am I missing something?

prototaxites · 2024-11-12T09:24:31Z

In most cases, database downloads don’t depend on anything, so, if I'm not wrong, the processes should start immediately regardless of where they are in placed in the code. Or am I missing something?

Yes, that's my understanding - Nextflow processes kick off as soon as any inputs are available, so processes beginning with file/URL input from parameters should start as soon as the pipeline begins, no matter how many subworkflows deep they are (there might be a latency hit if you go 10,000 subworkflows deep...).

Also not opposed to a "database download" subWF - but that seems a little more limited in scope. In particular I'm thinking about all the steps that are like "if(params.some_db) { db = Channel.value(file(path)) } else { db = Channel.empty }" - there are a lot of these input databases, often just a zip or fasta file, and building these in a single WF would make something with a lot of outputs. Hence why I think it may be more maintainable to keep that code near to where the outputs are consumed (set the PhiX fasta in the "short read preprocessing" subWF, lambda in the "long read preprocessing", etc.). But either way cleaning up the main WF should help - this all just may be beyond the scope of this PR.

jfy133 · 2024-11-13T12:41:10Z

In most cases, database downloads don’t depend on anything, so, if I'm not wrong, the processes should start immediately regardless of where they are in placed in the code. Or am I missing something?

Yes, that's my understanding - Nextflow processes kick off as soon as any inputs are available, so processes beginning with file/URL input from parameters should start as soon as the pipeline begins, no matter how many subworkflows deep they are (there might be a latency hit if you go 10,000 subworkflows deep...).

Hmm fair. I might be feeling over defensive due to the huge number of conditions mag has... so maybe this is indeed the case.

Also not opposed to a "database download" subWF - but that seems a little more limited in scope. In particular I'm thinking about all the steps that are like "if(params.some_db) { db = Channel.value(file(path)) } else { db = Channel.empty }" - there are a lot of these input databases, often just a zip or fasta file, and building these in a single WF would make something with a lot of outputs. Hence why I think it may be more maintainable to keep that code near to where the outputs are consumed (set the PhiX fasta in the "short read preprocessing" subWF, lambda in the "long read preprocessing", etc.). But either way cleaning up the main WF should help - this all just may be beyond the scope of this PR.

Fair point.

Ok - if you're feeling to up to it @dialvarezs go ahead and move the relevant database downloads to the BINQC subworklow(s) :)

Note you don't need a separate CheckM download CI check -> I would rather you just include an UNTAR step in the pipeline isntead.

I added the seaprte CheckM CI tests originally due to instable downloads when connecting to the servers in Australia, but as these databases are now on Zenodo this should rarely happen.

dialvarezs · 2024-11-14T08:12:38Z

@jfy133 I'll get back to this on Friday or Saturday.

Ok - if you're feeling to up to it @dialvarezs go ahead and move the relevant database downloads to the BINQC subworklow(s) :)

I think this is a step in the right direction to improve the modularity of mag, so I will.

Note you don't need a separate CheckM download CI check -> I would rather you just include an UNTAR step in the pipeline isntead.

Got it, I will remove the checkm2 ci test I added. Should I remove the checkm ci test as well?

jfy133 · 2024-11-14T08:18:09Z

@jfy133 I'll get back to this on Friday or Saturday.

Ok - if you're feeling to up to it @dialvarezs go ahead and move the relevant database downloads to the BINQC subworklow(s) :)

I think this is a step in the right direction to improve the modularity of mag, so I will.

Note you don't need a separate CheckM download CI check -> I would rather you just include an UNTAR step in the pipeline isntead.

Got it, I will remove the checkm2 ci test I added. Should I remove the checkm ci test as well?

Yes you probably can! Thank you!

feat: Migrate from CheckM to CheckM2

53864d2

- Update modules - Update integration in mag and with other tools (bin_summary, gtdb-tk) - Update test - Update schema

dialvarezs marked this pull request as draft October 27, 2024 21:31

dialvarezs added 3 commits October 27, 2024 19:05

fix: Bring back CheckM

ca2f97b

fix: One more thing

05e3393

fix: Linting

3e49dbf

dialvarezs changed the title ~~feat: Replace from CheckM to CheckM2~~ feat: Add CheckM2 Oct 27, 2024

fix: Option and checks in bin summary script

b1b6518

dialvarezs force-pushed the dev-checkm2 branch from c63e084 to b1b6518 Compare October 27, 2024 22:40

dialvarezs added 2 commits October 27, 2024 19:43

Fix: missing import

73b5794

docs: Output, changelog and citation for CheckM2

325310b

dialvarezs marked this pull request as ready for review October 28, 2024 10:30

dialvarezs added 3 commits October 28, 2024 07:32

docs: readme

b88302a

Merge remote-tracking branch 'upstream/dev' into dev-checkm2

d62a671

docs: Update readme

3da7441

dialvarezs added 2 commits October 28, 2024 09:57

refactor: Merge checkm and checkm2 subworkflows in a single one

3d98415

Also, simplify bin_summary regarding bin qc

fix: Restore mistakenly deleted code

8b4fcc7

dialvarezs added 4 commits October 31, 2024 17:26

Bin QC workflow

76dac5c

Merge remote-tracking branch 'upstream/dev' into dev-checkm2

0ebab07

Cleanup

38ce756

Final touches

3702329

dialvarezs force-pushed the dev-checkm2 branch 2 times, most recently from a4f42ef to da52285 Compare November 1, 2024 00:27

Integrate GUNC in BIN_QC subworkflow

0eb167a

dialvarezs force-pushed the dev-checkm2 branch from 4007932 to 0eb167a Compare November 1, 2024 04:13

Code style improvements

f0a6999

dialvarezs changed the title ~~feat: Add CheckM2~~ Bin QC Improvements Nov 1, 2024

harper357 mentioned this pull request Nov 7, 2024

Be able to run multiple bin QC metrics #713

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bin QC Improvements #707

Bin QC Improvements #707

dialvarezs commented Oct 27, 2024 •

edited

Loading

jfy133 commented Oct 27, 2024

jfy133 commented Oct 27, 2024

dialvarezs commented Oct 27, 2024

dialvarezs commented Oct 28, 2024 •

edited

Loading

jfy133 commented Oct 29, 2024

dialvarezs commented Nov 1, 2024 •

edited

Loading

jfy133 commented Nov 7, 2024

prototaxites commented Nov 11, 2024

dialvarezs commented Nov 12, 2024

jfy133 commented Nov 12, 2024 •

edited

Loading

dialvarezs commented Nov 12, 2024

prototaxites commented Nov 12, 2024 •

edited

Loading

jfy133 commented Nov 13, 2024

dialvarezs commented Nov 14, 2024

jfy133 commented Nov 14, 2024

Bin QC Improvements #707

Are you sure you want to change the base?

Bin QC Improvements #707

Conversation

dialvarezs commented Oct 27, 2024 • edited Loading

PR checklist

jfy133 commented Oct 27, 2024

jfy133 commented Oct 27, 2024

dialvarezs commented Oct 27, 2024

dialvarezs commented Oct 28, 2024 • edited Loading

jfy133 commented Oct 29, 2024

dialvarezs commented Nov 1, 2024 • edited Loading

jfy133 commented Nov 7, 2024

prototaxites commented Nov 11, 2024

dialvarezs commented Nov 12, 2024

jfy133 commented Nov 12, 2024 • edited Loading

dialvarezs commented Nov 12, 2024

prototaxites commented Nov 12, 2024 • edited Loading

jfy133 commented Nov 13, 2024

dialvarezs commented Nov 14, 2024

jfy133 commented Nov 14, 2024

dialvarezs commented Oct 27, 2024 •

edited

Loading

dialvarezs commented Oct 28, 2024 •

edited

Loading

dialvarezs commented Nov 1, 2024 •

edited

Loading

jfy133 commented Nov 12, 2024 •

edited

Loading

prototaxites commented Nov 12, 2024 •

edited

Loading