Longread only functionality #718

muabnezor · 2024-11-28T12:28:45Z

This PR adds long-read only functionality to mag.

closes #662, #659, #275

PR checklist

…esheets

… 97 to 90 when dealing with longreads

…and reads

github-actions · 2024-11-28T12:28:58Z

Warning

Newer version of the nf-core template is available.

Your pipeline is using an old version of the nf-core template: 3.1.2.
Please update your pipeline to the latest version.

For more documentation on how to update your pipeline, please see the nf-core documentation and Synchronisation documentation.

d4straub

I only had a short look, sorry, but I just wanted to express my gratitude that you tackle long read assembly!

nextflow.config

conf/modules.config

muabnezor · 2024-11-28T14:18:45Z

I only had a short look, sorry, but I just wanted to express my gratitude that you tackle long read assembly!

My pleasure hehe. Still WIP. I have to tidy up the code and run some validation on real data, but we're getting there!

… in config

…rtread assemblers, when assembly input is given

…empty if no files are given

…d also return remainder in case the ch_short_reads channel is empty. Change config for FILTLONG to only use '--trim' option if shortreads are passed

…orkflow

…nnel

jfy133

OK I'm not seeing anything obvious! But I would still like to run a test (can't do that today).

Once commetns (mostly questions) addressed, I will start the manual tests :)

CHANGELOG.md

CITATIONS.md

jfy133 · 2025-01-20T11:58:58Z

subworkflows/local/binning_preparation.nf

-    BOWTIE2_ASSEMBLY_BUILD ( assemblies )
+    ch_versions       = Channel.empty()
+    ch_multiqc_files  = Channel.empty()
+        // multiple symlinks to the same assembly -> use first of sorted list


Suggested change

// multiple symlinks to the same assembly -> use first of sorted list

// multiple symlinks to the same assembly -> use first of sorted list

What is this comment referring to, it sounds a bit scary?

Did a lot of copy-pasting here from original binning_preparation.nf, forgot to remove this comment :) the comment is referring to line 47 in shortread_binning_preparation.nf if you are interest. I did not write this line of code though.

jfy133 · 2025-01-20T12:08:02Z

subworkflows/local/longread_binning_preparation.nf

+    ch_minimap2_input_idx = ch_minimap2_input
+        .map { meta_idx, index, meta, reads -> [ meta_idx, index ] }
+
+    MINIMAP2_ASSEMBLY_ALIGN ( ch_minimap2_input_reads, ch_minimap2_input_idx, true, 'bai', false, false )


What are the true/false/falses? something should be parameterasable by the user?

first true: if the output should be in bam-format. I think this can be fixed
first false: If the output is set to paf, then a cigar string is generated. This only happens if output format is not bam, so does not make sense for it to be changed by user
second false: to make the cigar string backward compatible with older tools, but no need for that here.

subworkflows/local/shortread_assembly.nf

subworkflows/local/utils_nfcore_mag_pipeline/main.nf

workflows/mag.nf

muabnezor · 2025-01-20T14:07:05Z

thank you @jfy133. I'm away this week, but I'll try to find time to go through your comments asap.

Co-authored-by: James A. Fellows Yates <jfy133@gmail.com>

muabnezor · 2025-01-24T13:08:30Z

Thanks for the review @jfy133. Sorry I have been away this week, but now I think I addressed all of your comments.

…ning

prototaxites

Hi @muabnezor - looks really good! A few thoughts I had going through the PR - some probably easy fixes, some a bit more involved.

Once it's ready to go, let me know and I can potentially run a test with some of our PacBio HiFi data, if that's of any interest to you.

prototaxites · 2025-02-17T13:02:52Z

conf/modules.config

@@ -580,6 +644,7 @@ process {
    }

    withName: METABAT2_JGISUMMARIZEBAMCONTIGDEPTHS {
+        ext.args = { meta.assembler in ['FLYE', 'METAMDBG'] ? "--percentIdentity ${params.longread_percentidentity}" : '' }


Should this be more generally configurable for long and short reads?

Fixed this! moved the logic here to binning.nf workflow instead, and added a parameter --shortread_percent_identity also

prototaxites · 2025-02-17T13:04:09Z

docs/usage.md

@@ -43,14 +43,25 @@ sample2,0,0,data/sample2_R1.fastq.gz,data/sample2_R2.fastq.gz,data/sample2.fastq
 sample3,1,0,data/sample3_R1.fastq.gz,data/sample3_R2.fastq.gz,
 ```

+If only long read data is available, the columns `short_reads_1` and `short_reads_2` can be simply left empty:


This is processed with nf-schema, right? In which case, I think the short read columns can be left out in their entirety - don't need to be left empty

you are right!

prototaxites · 2025-02-17T13:04:57Z

modules/nf-core/metamdbg/asm/main.nf

@@ -0,0 +1,62 @@
+process METAMDBG_ASM {


I pushed updates to the metaMDBG module that correct languageserver errors and updates the version to 1.1

ok, i updated the module

prototaxites · 2025-02-17T13:09:48Z

nextflow.config

@@ -53,6 +53,7 @@ params {
    min_length_unbinned_contigs          = 1000000
    max_unbinned_contigs                 = 100
    skip_prokka                          = false
+    longread_percentidentity             = 90


Why 90% - seems low?

It's been removed from the VAMB documentation, but in an old readme file it states the following:

In general, if reads are allowed to map to subjects with a nucleotide identity of X %, then it is not possible to distinguish genomes with a higher nucleotide identity than X % using co-abundance, and these genomes will be binned together. This means you want to tweak your alignment tool to only output alignments with the desired minimum query/subject identity - for example 97%.

So I wonder that this should be set to 97% and be made configurable as a single parameter for both long and short reads? Appreciate this might be different for ONT, but AFAIK even ONT quality is good nowadays, up to 99% - definitely not a 10% error rate!

https://github.com/RasmussenLab/vamb/blob/adeb157d3d16b77e240819b86e810f733d600b28/doc/tutorial.md

yes, I was unsure what the default for this should be, I ran tests on some older ont data, and for those reads 97% was to high. but as you say ONT quality nowadays is probably ok to run with the default of 97%

prototaxites · 2025-02-17T13:11:11Z

nextflow.config

@@ -65,6 +66,12 @@ params {
    skip_megahit                         = false
    skip_quast                           = false
    skip_prodigal                        = false
+    skip_metamdbg                        = false
+    skip_flye                            = false
+    flye_mode                            = 'nano-raw'


Would it be better to figure this (and metamdbg) out "on-the-fly" - add a field to the samplesheet that states ONT or HiFi, and these are selected accordingly? You can add a check that they're all ONT/all HiFi per group.

(this might require changes to the modules if we wanted to do this per-group)

That does complicate things... I just don't like that if you wanted to run both assemblers you would have to remember to specify two flags in the pipeline call. Maybe there could be a column in the input spreadsheet, but if you specify flye_mode (default null) then that overrides whatever is read from the column input?

I think you are right. Whats more minimap2 is also optimized to run with specific longread input. If we want to allow for the flexibility of several different longread sources in the same run, I guess we need to add a column in the samplesheet (what should be the allowed values for this column?), or maybe it would be sufficient to only allow for one longread technology, which can be specified by one parameter, --longread-technology ?

Thinking about it, I'm not sure about allowing mixed input types as I think you're right that the complexity comes with the mapping for binning (if you map ONT to a HiFi assembly and vice-versa, and then try to compare co-abundance across assemblies - I think you might run into problems as the coverages may not be comparable...). So either we allow mixed input types but don't allow cross-mapping (single-sample only) - which will require writing some some complicated logic! - or force the user to specify a single parameter declaring the longread type, which sets a bunch of defaults that can be overridden by parameters that are by default null?

Or we could also just be conservative and say that the pipeline only supports uncorrected ONT and PacBio HiFi (no CLR)?

@jfy133 any thoughts?

Actually, do we need rail guards to ensure that we’re not mixing short read only and long read only data within a single run?

prototaxites · 2025-02-17T13:16:10Z

subworkflows/local/assembly.nf

+// MODULES
+include { POOL_SINGLE_READS as POOL_SHORT_SINGLE_READS          } from '../../modules/local/pool_single_reads'
+include { POOL_PAIRED_READS                                     } from '../../modules/local/pool_paired_reads'
+include { POOL_SINGLE_READS as POOL_LONG_READS                  } from '../../modules/local/pool_single_reads'


Could this local module be replaced with the nf-core CAT module (and called separately for R1 and R2 for paired reads)?

Maybe I have not looked into the pool module at all. maybe this is better to fix in another PR?

There's a CAT_FASTQ nf-core module that could replace all these, apparently: https://github.com/nf-core/modules/blob/master/modules/nf-core/cat/fastq/main.nf

But yes, maybe better for another PR!

prototaxites · 2025-02-17T13:22:55Z

subworkflows/local/longread_hostremoval.nf

+    SAMTOOLS_HOSTREMOVED_VIEW ( ch_minimap2_mapped , [[],[]], [] )
+    ch_versions      = ch_versions.mix( SAMTOOLS_HOSTREMOVED_VIEW.out.versions.first() )
+
+    SAMTOOLS_HOSTREMOVED_FASTQ ( SAMTOOLS_HOSTREMOVED_VIEW.out.bam, false )


This might be a place where a local module might be good - stream samtools view into samtools fastq. Otherwise you're essentially using extraneous storage in the work dir to hold a bam file you're never going to look at again, with the same data stored as fastq.

Good idea, I did something similar first, but later used the nf-core modules instead, as I thought this is preferred when possible :)

prototaxites · 2025-02-17T13:25:14Z

subworkflows/local/shortread_binning_preparation.nf

+
+    }
+
+    BOWTIE2_ASSEMBLY_ALIGN ( ch_bowtie2_input )


Just a thought, but this change clashes with @jfy133's bowtie PR - might be worth thinking about which one to pull into the other.

I noticed that too

…shortreads

muabnezor · 2025-02-20T14:37:05Z

Thanks for the review @prototaxites! I will try to get to all your comments this week.

muabnezor added 12 commits November 25, 2024 11:34

Add longread only test config

5ffb327

Change samplesheet validation schema to allow for longread-only sampl…

695da2b

…esheets

Check if both short, and long reads are given

f67b6b3

fix validation schema for samplesheet

bb78aab

Merge remote-tracking branch 'upstream/dev' into longread_only

992b01e

Add separate wubworkflow for longread host removal

bfd404c

Add longread meta-assemblers metaflye and metamdbg

387284d

Add longread assemble config

6948316

Prepare binning for longread assemblies

762f12b

Fix config and how long reads are prepared for binning

c7a5cdf

change jgi_summarize_bam_contig_depths --percentidentity default from…

18ae00d

… 97 to 90 when dealing with longreads

Fix longread binning preparation from all combination of assemblies, …

8a22939

…and reads

muabnezor added the WIP Work in progress label Nov 28, 2024

muabnezor requested a review from jfy133 November 28, 2024 12:28

muabnezor changed the base branch from master to dev November 28, 2024 12:29

d4straub reviewed Nov 28, 2024

View reviewed changes

nextflow.config Outdated Show resolved Hide resolved

prototaxites reviewed Nov 28, 2024

View reviewed changes

conf/modules.config Outdated Show resolved Hide resolved

muabnezor added 11 commits November 29, 2024 07:16

format

038a87b

Fix longread hostremoval, and fix --longread_percentidentiy parameter…

3ff8de2

… in config

Add test_longread to profiles, and ci testing

e277290

Fix validation schema, and fix assembly channels for longread and sho…

75a36f2

…rtread assemblers, when assembly input is given

Change logic in samplesheet validation, ch_raw_short_reads should be …

67d44d0

…empty if no files are given

fix custom samtools view module

ca4d101

Merge branch 'dev' into longread_only

cfd43e7

Make sure filtlong works without short reads. the join operator shoul…

6765ab1

…d also return remainder in case the ch_short_reads channel is empty. Change config for FILTLONG to only use '--trim' option if shortreads are passed

Make sure FILTLONG is not run when there are no long reads

e73975c

Fix grouping logic for channels in longreads_binning_preparation subw…

686fafe

…orkflow

make assembly into subworkflow

a0279fe

muabnezor added the ready to review label Dec 13, 2024

muabnezor added 2 commits January 7, 2025 08:49

Fix bug when running with --keep_phix, make ch_phix_db_file empty Cha…

d9564da

…nnel

Merge branch 'dev' into longread_only

9087f2f

muabnezor mentioned this pull request Jan 7, 2025

When running the pipeline with --skip_clipping and --keep_phix: ERROR ~ No such variable: ch_phix_db_file #740

Closed

muabnezor added 6 commits January 7, 2025 10:59

fix modules.config

ba0f831

Fix linting

d46aa6d

Use nf-core official module for samtools fastq

f90e5f5

Change modules config

85c2e82

change hybrid logic

7398ba3

fix samplesheet validation

ae1953d

jfy133 linked an issue Jan 20, 2025 that may be closed by this pull request

Add separate Nanopore input option #275

Open

jfy133 reviewed Jan 20, 2025

View reviewed changes

jfy133 mentioned this pull request Jan 21, 2025

Use channel empty when no phix reference supplied because skipping #749

Merged

11 tasks

muabnezor and others added 4 commits January 24, 2025 10:26

Apply suggestions from code review

d95b501

Co-authored-by: James A. Fellows Yates <jfy133@gmail.com>

Merge branch 'dev' into longread_only

855a299

changes from review

095b6fb

fix linting

00105b2

Fix how assemblies are grouped with aligned reads in longread pre bin…

6ca7617

…ning

prototaxites reviewed Feb 17, 2025

View reviewed changes

muabnezor added 4 commits February 20, 2025 15:13

make local module of samtools for longread host removal

c074c72

make METABAT_JGISUMMARIZEBAMCONTIGS run separately for longreads and …

0b632f1

…shortreads

add shortread_percentidentity parameter

7464ac5

fix depth output from binning subworkflow

ff1ef2b

muabnezor added 4 commits February 21, 2025 08:21

fix samtools_unmapped output name

e281a4a

update metamdbg/asm

681320d

add parameters for configuring minimap index making

ab441e1

Fix usage docs

abefb5f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Longread only functionality #718

Longread only functionality #718

muabnezor commented Nov 28, 2024 •

edited

Loading

github-actions bot commented Nov 28, 2024 •

edited by nf-core-bot

Loading

d4straub left a comment

muabnezor commented Nov 28, 2024

jfy133 left a comment

jfy133 Jan 20, 2025

muabnezor Jan 24, 2025

jfy133 Jan 20, 2025

muabnezor Jan 24, 2025

muabnezor commented Jan 20, 2025

muabnezor commented Jan 24, 2025

prototaxites left a comment

prototaxites Feb 17, 2025

muabnezor Feb 21, 2025

prototaxites Feb 17, 2025

muabnezor Feb 21, 2025

prototaxites Feb 17, 2025

muabnezor Feb 21, 2025

prototaxites Feb 17, 2025

muabnezor Feb 21, 2025

prototaxites Feb 17, 2025

prototaxites Feb 17, 2025

muabnezor Feb 21, 2025

prototaxites Feb 21, 2025

muabnezor Feb 21, 2025

prototaxites Feb 21, 2025

prototaxites Feb 21, 2025

prototaxites Feb 17, 2025

muabnezor Feb 21, 2025

prototaxites Feb 21, 2025

prototaxites Feb 17, 2025

muabnezor Feb 21, 2025

prototaxites Feb 17, 2025

muabnezor Feb 21, 2025

muabnezor commented Feb 20, 2025

	// multiple symlinks to the same assembly -> use first of sorted list
	// multiple symlinks to the same assembly -> use first of sorted list

Longread only functionality #718

Are you sure you want to change the base?

Longread only functionality #718

Conversation

muabnezor commented Nov 28, 2024 • edited Loading

PR checklist

github-actions bot commented Nov 28, 2024 • edited by nf-core-bot Loading

d4straub left a comment

Choose a reason for hiding this comment

muabnezor commented Nov 28, 2024

jfy133 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

muabnezor commented Jan 20, 2025

muabnezor commented Jan 24, 2025

prototaxites left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

muabnezor commented Feb 20, 2025

muabnezor commented Nov 28, 2024 •

edited

Loading

github-actions bot commented Nov 28, 2024 •

edited by nf-core-bot

Loading