Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Longread only functionality #718

Open
wants to merge 48 commits into
base: dev
Choose a base branch
from
Open

Conversation

muabnezor
Copy link
Contributor

@muabnezor muabnezor commented Nov 28, 2024

This PR adds long-read only functionality to mag.

closes #662, #659, #275

PR checklist

  • This comment contains a description of changes (with reason).
  • If you've fixed a bug or added code that should be tested, add tests!
  • If you've added a new tool - have you followed the pipeline conventions in the contribution docs
  • If necessary, also make a PR on the nf-core/mag branch on the nf-core/test-datasets repository.
  • Make sure your code lints (nf-core pipelines lint).
  • Ensure the test suite passes (nextflow run . -profile test,docker --outdir <OUTDIR>).
  • Check for unexpected warnings in debug mode (nextflow run . -profile debug,test,docker --outdir <OUTDIR>).
  • Usage Documentation in docs/usage.md is updated.
  • Output Documentation in docs/output.md is updated.
  • CHANGELOG.md is updated.
  • README.md is updated (including new tool citations and authors/contributors).

@muabnezor muabnezor added the WIP Work in progress label Nov 28, 2024
@muabnezor muabnezor requested a review from jfy133 November 28, 2024 12:28
Copy link

github-actions bot commented Nov 28, 2024

Warning

Newer version of the nf-core template is available.

Your pipeline is using an old version of the nf-core template: 3.1.2.
Please update your pipeline to the latest version.

For more documentation on how to update your pipeline, please see the nf-core documentation and Synchronisation documentation.

@muabnezor muabnezor changed the base branch from master to dev November 28, 2024 12:29
Copy link
Collaborator

@d4straub d4straub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only had a short look, sorry, but I just wanted to express my gratitude that you tackle long read assembly!

@muabnezor
Copy link
Contributor Author

I only had a short look, sorry, but I just wanted to express my gratitude that you tackle long read assembly!

My pleasure hehe. Still WIP. I have to tidy up the code and run some validation on real data, but we're getting there!

@jfy133 jfy133 linked an issue Jan 20, 2025 that may be closed by this pull request
Copy link
Member

@jfy133 jfy133 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK I'm not seeing anything obvious! But I would still like to run a test (can't do that today).

Once commetns (mostly questions) addressed, I will start the manual tests :)

BOWTIE2_ASSEMBLY_BUILD ( assemblies )
ch_versions = Channel.empty()
ch_multiqc_files = Channel.empty()
// multiple symlinks to the same assembly -> use first of sorted list
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// multiple symlinks to the same assembly -> use first of sorted list
// multiple symlinks to the same assembly -> use first of sorted list

What is this comment referring to, it sounds a bit scary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a lot of copy-pasting here from original binning_preparation.nf, forgot to remove this comment :) the comment is referring to line 47 in shortread_binning_preparation.nf if you are interest. I did not write this line of code though.

ch_minimap2_input_idx = ch_minimap2_input
.map { meta_idx, index, meta, reads -> [ meta_idx, index ] }

MINIMAP2_ASSEMBLY_ALIGN ( ch_minimap2_input_reads, ch_minimap2_input_idx, true, 'bai', false, false )
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the true/false/falses? something should be parameterasable by the user?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

first true: if the output should be in bam-format. I think this can be fixed
first false: If the output is set to paf, then a cigar string is generated. This only happens if output format is not bam, so does not make sense for it to be changed by user
second false: to make the cigar string backward compatible with older tools, but no need for that here.

@muabnezor
Copy link
Contributor Author

thank you @jfy133. I'm away this week, but I'll try to find time to go through your comments asap.

@muabnezor
Copy link
Contributor Author

Thanks for the review @jfy133. Sorry I have been away this week, but now I think I addressed all of your comments.

Copy link
Contributor

@prototaxites prototaxites left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @muabnezor - looks really good! A few thoughts I had going through the PR - some probably easy fixes, some a bit more involved.

Once it's ready to go, let me know and I can potentially run a test with some of our PacBio HiFi data, if that's of any interest to you.

@@ -580,6 +644,7 @@ process {
}

withName: METABAT2_JGISUMMARIZEBAMCONTIGDEPTHS {
ext.args = { meta.assembler in ['FLYE', 'METAMDBG'] ? "--percentIdentity ${params.longread_percentidentity}" : '' }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be more generally configurable for long and short reads?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed this! moved the logic here to binning.nf workflow instead, and added a parameter --shortread_percent_identity also

docs/usage.md Outdated
@@ -43,14 +43,25 @@ sample2,0,0,data/sample2_R1.fastq.gz,data/sample2_R2.fastq.gz,data/sample2.fastq
sample3,1,0,data/sample3_R1.fastq.gz,data/sample3_R2.fastq.gz,
```

If only long read data is available, the columns `short_reads_1` and `short_reads_2` can be simply left empty:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is processed with nf-schema, right? In which case, I think the short read columns can be left out in their entirety - don't need to be left empty

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are right!

@@ -0,0 +1,62 @@
process METAMDBG_ASM {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed updates to the metaMDBG module that correct languageserver errors and updates the version to 1.1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, i updated the module

nextflow.config Outdated
@@ -53,6 +53,7 @@ params {
min_length_unbinned_contigs = 1000000
max_unbinned_contigs = 100
skip_prokka = false
longread_percentidentity = 90
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why 90% - seems low?

It's been removed from the VAMB documentation, but in an old readme file it states the following:

In general, if reads are allowed to map to subjects with a nucleotide identity of X %, then it is not possible to distinguish genomes with a higher nucleotide identity than X % using co-abundance, and these genomes will be binned together. This means you want to tweak your alignment tool to only output alignments with the desired minimum query/subject identity - for example 97%.

So I wonder that this should be set to 97% and be made configurable as a single parameter for both long and short reads? Appreciate this might be different for ONT, but AFAIK even ONT quality is good nowadays, up to 99% - definitely not a 10% error rate!

https://github.com/RasmussenLab/vamb/blob/adeb157d3d16b77e240819b86e810f733d600b28/doc/tutorial.md

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I was unsure what the default for this should be, I ran tests on some older ont data, and for those reads 97% was to high. but as you say ONT quality nowadays is probably ok to run with the default of 97%

@@ -65,6 +66,12 @@ params {
skip_megahit = false
skip_quast = false
skip_prodigal = false
skip_metamdbg = false
skip_flye = false
flye_mode = 'nano-raw'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be better to figure this (and metamdbg) out "on-the-fly" - add a field to the samplesheet that states ONT or HiFi, and these are selected accordingly? You can add a check that they're all ONT/all HiFi per group.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(this might require changes to the modules if we wanted to do this per-group)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe, but this parameter can be any of (--pacbio-raw | --pacbio-corr | --pacbio-hifi | --nano-raw | --nano-corr | --nano-hq ), depending also on the quality of the data. I thought it is better for now to let the user specify the longread source instead.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That does complicate things... I just don't like that if you wanted to run both assemblers you would have to remember to specify two flags in the pipeline call. Maybe there could be a column in the input spreadsheet, but if you specify flye_mode (default null) then that overrides whatever is read from the column input?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you are right. Whats more minimap2 is also optimized to run with specific longread input. If we want to allow for the flexibility of several different longread sources in the same run, I guess we need to add a column in the samplesheet (what should be the allowed values for this column?), or maybe it would be sufficient to only allow for one longread technology, which can be specified by one parameter, --longread-technology ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about it, I'm not sure about allowing mixed input types as I think you're right that the complexity comes with the mapping for binning (if you map ONT to a HiFi assembly and vice-versa, and then try to compare co-abundance across assemblies - I think you might run into problems as the coverages may not be comparable...). So either we allow mixed input types but don't allow cross-mapping (single-sample only) - which will require writing some some complicated logic! - or force the user to specify a single parameter declaring the longread type, which sets a bunch of defaults that can be overridden by parameters that are by default null?

Or we could also just be conservative and say that the pipeline only supports uncorrected ONT and PacBio HiFi (no CLR)?

@jfy133 any thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, do we need rail guards to ensure that we’re not mixing short read only and long read only data within a single run?

// MODULES
include { POOL_SINGLE_READS as POOL_SHORT_SINGLE_READS } from '../../modules/local/pool_single_reads'
include { POOL_PAIRED_READS } from '../../modules/local/pool_paired_reads'
include { POOL_SINGLE_READS as POOL_LONG_READS } from '../../modules/local/pool_single_reads'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this local module be replaced with the nf-core CAT module (and called separately for R1 and R2 for paired reads)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I have not looked into the pool module at all. maybe this is better to fix in another PR?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a CAT_FASTQ nf-core module that could replace all these, apparently: https://github.com/nf-core/modules/blob/master/modules/nf-core/cat/fastq/main.nf

But yes, maybe better for another PR!

SAMTOOLS_HOSTREMOVED_VIEW ( ch_minimap2_mapped , [[],[]], [] )
ch_versions = ch_versions.mix( SAMTOOLS_HOSTREMOVED_VIEW.out.versions.first() )

SAMTOOLS_HOSTREMOVED_FASTQ ( SAMTOOLS_HOSTREMOVED_VIEW.out.bam, false )
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be a place where a local module might be good - stream samtools view into samtools fastq. Otherwise you're essentially using extraneous storage in the work dir to hold a bam file you're never going to look at again, with the same data stored as fastq.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, I did something similar first, but later used the nf-core modules instead, as I thought this is preferred when possible :)


}

BOWTIE2_ASSEMBLY_ALIGN ( ch_bowtie2_input )
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a thought, but this change clashes with @jfy133's bowtie PR - might be worth thinking about which one to pull into the other.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed that too

@muabnezor
Copy link
Contributor Author

Thanks for the review @prototaxites! I will try to get to all your comments this week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add separate Nanopore input option
4 participants