-
Notifications
You must be signed in to change notification settings - Fork 118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Longread only functionality #718
base: dev
Are you sure you want to change the base?
Conversation
… 97 to 90 when dealing with longreads
Warning Newer version of the nf-core template is available. Your pipeline is using an old version of the nf-core template: 3.1.2. For more documentation on how to update your pipeline, please see the nf-core documentation and Synchronisation documentation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only had a short look, sorry, but I just wanted to express my gratitude that you tackle long read assembly!
My pleasure hehe. Still WIP. I have to tidy up the code and run some validation on real data, but we're getting there! |
…rtread assemblers, when assembly input is given
…empty if no files are given
…d also return remainder in case the ch_short_reads channel is empty. Change config for FILTLONG to only use '--trim' option if shortreads are passed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK I'm not seeing anything obvious! But I would still like to run a test (can't do that today).
Once commetns (mostly questions) addressed, I will start the manual tests :)
BOWTIE2_ASSEMBLY_BUILD ( assemblies ) | ||
ch_versions = Channel.empty() | ||
ch_multiqc_files = Channel.empty() | ||
// multiple symlinks to the same assembly -> use first of sorted list |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// multiple symlinks to the same assembly -> use first of sorted list | |
// multiple symlinks to the same assembly -> use first of sorted list |
What is this comment referring to, it sounds a bit scary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did a lot of copy-pasting here from original binning_preparation.nf, forgot to remove this comment :) the comment is referring to line 47 in shortread_binning_preparation.nf if you are interest. I did not write this line of code though.
ch_minimap2_input_idx = ch_minimap2_input | ||
.map { meta_idx, index, meta, reads -> [ meta_idx, index ] } | ||
|
||
MINIMAP2_ASSEMBLY_ALIGN ( ch_minimap2_input_reads, ch_minimap2_input_idx, true, 'bai', false, false ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are the true/false/falses? something should be parameterasable by the user?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
first true: if the output should be in bam-format. I think this can be fixed
first false: If the output is set to paf, then a cigar string is generated. This only happens if output format is not bam, so does not make sense for it to be changed by user
second false: to make the cigar string backward compatible with older tools, but no need for that here.
thank you @jfy133. I'm away this week, but I'll try to find time to go through your comments asap. |
Co-authored-by: James A. Fellows Yates <jfy133@gmail.com>
Thanks for the review @jfy133. Sorry I have been away this week, but now I think I addressed all of your comments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @muabnezor - looks really good! A few thoughts I had going through the PR - some probably easy fixes, some a bit more involved.
Once it's ready to go, let me know and I can potentially run a test with some of our PacBio HiFi data, if that's of any interest to you.
conf/modules.config
Outdated
@@ -580,6 +644,7 @@ process { | |||
} | |||
|
|||
withName: METABAT2_JGISUMMARIZEBAMCONTIGDEPTHS { | |||
ext.args = { meta.assembler in ['FLYE', 'METAMDBG'] ? "--percentIdentity ${params.longread_percentidentity}" : '' } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be more generally configurable for long and short reads?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed this! moved the logic here to binning.nf workflow instead, and added a parameter --shortread_percent_identity also
docs/usage.md
Outdated
@@ -43,14 +43,25 @@ sample2,0,0,data/sample2_R1.fastq.gz,data/sample2_R2.fastq.gz,data/sample2.fastq | |||
sample3,1,0,data/sample3_R1.fastq.gz,data/sample3_R2.fastq.gz, | |||
``` | |||
|
|||
If only long read data is available, the columns `short_reads_1` and `short_reads_2` can be simply left empty: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is processed with nf-schema, right? In which case, I think the short read columns can be left out in their entirety - don't need to be left empty
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are right!
@@ -0,0 +1,62 @@ | |||
process METAMDBG_ASM { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pushed updates to the metaMDBG module that correct languageserver errors and updates the version to 1.1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, i updated the module
nextflow.config
Outdated
@@ -53,6 +53,7 @@ params { | |||
min_length_unbinned_contigs = 1000000 | |||
max_unbinned_contigs = 100 | |||
skip_prokka = false | |||
longread_percentidentity = 90 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why 90% - seems low?
It's been removed from the VAMB documentation, but in an old readme file it states the following:
In general, if reads are allowed to map to subjects with a nucleotide identity of X %, then it is not possible to distinguish genomes with a higher nucleotide identity than X % using co-abundance, and these genomes will be binned together. This means you want to tweak your alignment tool to only output alignments with the desired minimum query/subject identity - for example 97%.
So I wonder that this should be set to 97% and be made configurable as a single parameter for both long and short reads? Appreciate this might be different for ONT, but AFAIK even ONT quality is good nowadays, up to 99% - definitely not a 10% error rate!
https://github.com/RasmussenLab/vamb/blob/adeb157d3d16b77e240819b86e810f733d600b28/doc/tutorial.md
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I was unsure what the default for this should be, I ran tests on some older ont data, and for those reads 97% was to high. but as you say ONT quality nowadays is probably ok to run with the default of 97%
@@ -65,6 +66,12 @@ params { | |||
skip_megahit = false | |||
skip_quast = false | |||
skip_prodigal = false | |||
skip_metamdbg = false | |||
skip_flye = false | |||
flye_mode = 'nano-raw' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be better to figure this (and metamdbg) out "on-the-fly" - add a field to the samplesheet that states ONT or HiFi, and these are selected accordingly? You can add a check that they're all ONT/all HiFi per group.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(this might require changes to the modules if we wanted to do this per-group)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe, but this parameter can be any of (--pacbio-raw | --pacbio-corr | --pacbio-hifi | --nano-raw | --nano-corr | --nano-hq ), depending also on the quality of the data. I thought it is better for now to let the user specify the longread source instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That does complicate things... I just don't like that if you wanted to run both assemblers you would have to remember to specify two flags in the pipeline call. Maybe there could be a column in the input spreadsheet, but if you specify flye_mode
(default null) then that overrides whatever is read from the column input?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you are right. Whats more minimap2 is also optimized to run with specific longread input. If we want to allow for the flexibility of several different longread sources in the same run, I guess we need to add a column in the samplesheet (what should be the allowed values for this column?), or maybe it would be sufficient to only allow for one longread technology, which can be specified by one parameter, --longread-technology ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about it, I'm not sure about allowing mixed input types as I think you're right that the complexity comes with the mapping for binning (if you map ONT to a HiFi assembly and vice-versa, and then try to compare co-abundance across assemblies - I think you might run into problems as the coverages may not be comparable...). So either we allow mixed input types but don't allow cross-mapping (single-sample only) - which will require writing some some complicated logic! - or force the user to specify a single parameter declaring the longread type, which sets a bunch of defaults that can be overridden by parameters that are by default null?
Or we could also just be conservative and say that the pipeline only supports uncorrected ONT and PacBio HiFi (no CLR)?
@jfy133 any thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, do we need rail guards to ensure that we’re not mixing short read only and long read only data within a single run?
// MODULES | ||
include { POOL_SINGLE_READS as POOL_SHORT_SINGLE_READS } from '../../modules/local/pool_single_reads' | ||
include { POOL_PAIRED_READS } from '../../modules/local/pool_paired_reads' | ||
include { POOL_SINGLE_READS as POOL_LONG_READS } from '../../modules/local/pool_single_reads' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this local module be replaced with the nf-core CAT module (and called separately for R1 and R2 for paired reads)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I have not looked into the pool module at all. maybe this is better to fix in another PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a CAT_FASTQ nf-core module that could replace all these, apparently: https://github.com/nf-core/modules/blob/master/modules/nf-core/cat/fastq/main.nf
But yes, maybe better for another PR!
SAMTOOLS_HOSTREMOVED_VIEW ( ch_minimap2_mapped , [[],[]], [] ) | ||
ch_versions = ch_versions.mix( SAMTOOLS_HOSTREMOVED_VIEW.out.versions.first() ) | ||
|
||
SAMTOOLS_HOSTREMOVED_FASTQ ( SAMTOOLS_HOSTREMOVED_VIEW.out.bam, false ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be a place where a local module might be good - stream samtools view into samtools fastq. Otherwise you're essentially using extraneous storage in the work dir to hold a bam file you're never going to look at again, with the same data stored as fastq.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, I did something similar first, but later used the nf-core modules instead, as I thought this is preferred when possible :)
|
||
} | ||
|
||
BOWTIE2_ASSEMBLY_ALIGN ( ch_bowtie2_input ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a thought, but this change clashes with @jfy133's bowtie PR - might be worth thinking about which one to pull into the other.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed that too
Thanks for the review @prototaxites! I will try to get to all your comments this week. |
This PR adds long-read only functionality to mag.
closes #662, #659, #275
PR checklist
nf-core pipelines lint
).nextflow run . -profile test,docker --outdir <OUTDIR>
).nextflow run . -profile debug,test,docker --outdir <OUTDIR>
).docs/usage.md
is updated.docs/output.md
is updated.CHANGELOG.md
is updated.README.md
is updated (including new tool citations and authors/contributors).