Conversation
|
MatthiasZepper
left a comment
There was a problem hiding this comment.
Thank you for your contribution!
Ultimately, the pipeline will allow to flexibly choose the tools that are run, so having more tools in the pipeline is always great, but what exactly was your rationale here?
Admittedly, I just skimmed over the description, but to me, it seems that it is predominantly meant to run on FASTA files and for judging the quality of genome assemblies. Unless I missed some relevant arguments, its application on sequencing reads is in my opinion does not really yield too many meaningful statistics - at least for Illumina, for Nanopore probably yes.
Do you happen to have a Nanopore example? For Illumina, this is basically all you get:
file format type num_seqs sum_len min_len avg_len max_len
S11_L001_R1_001.fastq.gz FASTQ DNA 35875355 5417178605 151 151.0 151
S11_L001_R1_001.fastq.gz FASTQ DNA 35875355 322878195 9 9.0 9
S11_L001_R2_001.fastq.gz FASTQ DNA 35875355 5417178605 151 151.0 151
But with regard to the code, this already looks very good and tidy!
I am only missing a publishDir directive in the config, so that the reports from the stats output channel are published in a subfolder of the outdir.
| } | ||
|
|
||
| withName: SEQKIT_STATS { | ||
| ext.args = '' |
There was a problem hiding this comment.
Don't you not want to publish the results as well in the outdir? Or only MultiQC?
There was a problem hiding this comment.
yeah it is defined here - I guess you would only overwrite it if you dont want your output published in the standard way.
seqinspector/conf/modules.config
Line 15 in 9cd232c
There was a problem hiding this comment.
Oooh fancy - a central publishDir directive for all modules ?!? That's smart and has totally escaped me...sorry then for the false accusations!
| // | ||
| // MODULE: Run SEQKIT_STATS | ||
| // | ||
| SEQKIT_STATS ( |
There was a problem hiding this comment.
Currently, you are neither mixing the output into the MultiQC channel nor publishing it in the outdir. Unless I am missing something, the results of the run are therefore not used at all?
docs/output.md
Outdated
|
|
||
| </details> | ||
|
|
||
| [SeqkitStats](https://bioinf.shenwei.me/seqkit/usage/#stats) it gives general quality metrics about your sequenced reads including average read lengths, GC(%) and n50's. For further reading and documentation see the [Seqkit help pages]([Seqkit help](https://bioinf.shenwei.me/seqkit/)). |
There was a problem hiding this comment.
It is true that SeqkitStats computes some quality metrics, but to my best knowledge it is more useful for FASTA files and genomic assemblies than sequencing reads?
For example, for an Illumina run, an average read length prior to trimming should be known, because it corresponds to the number of cycles. For Nanopore, admittedly, such a statistic is more useful, so you may want to structure the documentation accordingly?
There was a problem hiding this comment.
Updated text a little, should i specifically mention this is more useful for nanopore data? for ilumina etc you do still get n50's etc but agree it is less useful - want me to add a nanopore test? :D
file format type num_seqs sum_len min_len avg_len max_len Q1 Q2 Q3 sum_gap N50 N50_num Q20(%) Q30(%) AvgQual GC(%)
sample1_R1.fastq.gz FASTQ DNA 1377513 411872902 35 299.0 301 300.0 301.0 301.0 0 301 1 99.10 97.54 29.31 38.53
sample1_R2.fastq.gz FASTQ DNA 1377513 411840994 35 299.0 301 300.0 301.0 301.0 0 301 1 97.11 93.54 25.78 38.54
|
Warning Newer version of the nf-core template is available. Your pipeline is using an old version of the nf-core template: 3.0.2. For more documentation on how to update your pipeline, please see the nf-core documentation and Synchronisation documentation. |
Add seqkit stats nf-core module
PR checklist
nf-core lint).nf-test test main.nf.test -profile test,docker).nextflow run . -profile debug,test,docker --outdir <OUTDIR>).docs/usage.mdis updated.docs/output.mdis updated.CHANGELOG.mdis updated.README.mdis updated (including new tool citations and authors/contributors).