You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
With this workflow you can screen and clean your Illumina, Nanopore or any FASTA-formated sequence data. The results are the clean sequences and the sequences identified as contaminated.
19
-
Per default [minimap2](https://github.com/lh3/minimap2) is used for aligning your sequences to reference sequences but I recommend using `bbduk`, part of [BBTools](https://github.com/BioInfoTools/BBMap), to clean short-read data (_--bbduk_).
18
+
With this workflow you can screen and clean your Illumina, Nanopore or any FASTA-formated sequence data. The results are the clean sequences and the sequences identified as contaminated. Per default [minimap2](https://github.com/lh3/minimap2) is used for aligning your sequences to reference sequences (with the `map-ont` settings for Nanopore data and `sr` settings for short-read data activated automatically). However, for short-read data, you may want to switch to [BWA](https://github.com/lh3/bwa) (`--bwa`). As another alternative, we provide `bbduk`, part of [BBTools](https://github.com/BioInfoTools/BBMap), as a kmer-based approach (`--bbduk`). However, no mapping file will be produced with `bbduk` and thus some subsequent statistics are not calculated.
20
19
21
-
You can simply specify provided hosts and controls for the cleanup or use your own FASTA files. The reads are then mapped against the specified host, control and user defined FASTA files. All reads that map are considered as contamination. In case of Illumina paired-end reads, both mates need to be aligned.
20
+
You can simply specify provided hosts and controls for the cleanup or use your own FASTA files. The reads are then mapped (or kmer-based compared in case of `bbduk`) against the specified host, control, and user defined FASTA files. All reads that match are considered as contamination. In case of Illumina paired-end reads, both mates need to be aligned (singleton files will be produced otherwise).
22
21
23
-
If Nanopore (`--input_type nano`) and Illumina (`--input_type illumina`) reads and control(s) (`--control`) are set, the control is selectively concatenated with the host and own FASTA: `dcs` for Nanopore DNA-Seq, `eno` for Nanopore RNA-Seq and `phix` from Illumina data.
24
-
Else, specified host, control and user defined FASTA files are concatenated.
22
+
The read input is defined via `--input_type nano` for Nanopore and `--input_type illumina` or `--input_type illumina_single_end` for Illumina reads. Additional control(s) for decontamination can be defined via `--control`. If controls are defined, they are selectively concatenated with the host and potential own FASTA files for decontamination. We provide auto-download for the following controls: `dcs` for Nanopore DNA-Seq, `eno` for Nanopore RNA-Seq, and `phix` from Illumina data. In general, specified host, control, and user defined FASTA files are concatenated for decontamination.
25
23
26
24
### Filter soft-clipped contamination reads
27
25
28
-
We saw many soft-clipped reads after the mapping, that probably aren't contamination. With `--min_clip` the user can set a threshold for the number of soft-clipped positions (sum of both ends). If `--min_clip` is greater 1, the total number is considered, else the fraction of soft-clipped positions to the read length. The output consists of all mapped, soft-clipped and mapped reads passing the filer.
26
+
We saw many soft-clipped reads after the mapping, that probably aren't contamination. With `--min_clip` the user can set a threshold for the number of soft-clipped positions (sum of both ends). If `--min_clip` is greater 1, the total number is considered, else the fraction of soft-clipped positions to the read length. The output consists of all mapped, soft-clipped, and mapped reads passing the filer.
29
27
30
28
## Requirements
31
29
@@ -35,13 +33,20 @@ We saw many soft-clipped reads after the mapping, that probably aren't contamina
35
33
36
34
### Dependencies management
37
35
36
+
For dependency handling you have to use one of the following technologies:
As default `docker` is used; to switch to another technology for dependency handling, e.g., `mamba`, use `-profile mamba`.
41
44
42
-
-[Docker](https://docs.docker.com/get-docker/)
45
+
### Run engine
43
46
44
-
In default `docker` is used; to switch to `conda` use `-profile conda`.
47
+
Per default we assume you are running the tool on a laptop or work station (`local`). You can change the pipeline behaviour for example when running on a HPC with the SLURM workload manager via `-profile slurm`.
48
+
49
+
Dependencies and run engines can be combined, e.g., to run with Singularity on LSF use `-profile singularity,lsf`.
45
50
46
51
## Execution examples
47
52
@@ -57,28 +62,35 @@ Get help:
57
62
nextflow run rki-mf1/clean --help
58
63
```
59
64
65
+
We always recommend running a release version. **Check for latest releases!** In these examples we use release `-r v1.1.0`:
66
+
67
+
```bash
68
+
# check available release versions and branches
69
+
nextflow info rki-mf1/clean
70
+
# select a release and run it to show the help
71
+
nextflow run rki-mf1/clean -r v1.1.0 --help
72
+
```
73
+
60
74
Clean Nanopore data by filtering against a combined reference of the _E. coli_ genome and the Nanopore DNA CS spike-in.
61
75
62
76
```bash
63
77
# uses Docker per default
64
-
nextflow run rki-mf1/clean --input_type nano --input ~/.nextflow/assets/rki-mf1/clean/test/nanopore.fastq.gz \
78
+
nextflow run rki-mf1/clean -r v1.1.0 --input_type nano --input ~/.nextflow/assets/rki-mf1/clean/test/nanopore.fastq.gz \
65
79
--host eco --control dcs
66
80
67
-
# use conda instead of Docker
68
-
nextflow run rki-mf1/clean --input_type nano --input ~/.nextflow/assets/rki-mf1/clean/test/nanopore.fastq.gz \
69
-
--host eco --control dcs -profile conda
81
+
# use mamba instead of Docker
82
+
nextflow run rki-mf1/clean -r v1.1.0 --input_type nano --input ~/.nextflow/assets/rki-mf1/clean/test/nanopore.fastq.gz \
83
+
--host eco --control dcs -profile mamba
70
84
```
71
85
72
-
Clean Illumina paired-end data against your own reference FASTA using bbduk instead of minimap2.
86
+
Clean Illumina paired-end data against your own reference FASTA using `bbduk` instead of `minimap2`.
73
87
74
88
```bash
75
-
#enter your home dir!
76
-
nextflow run rki-mf1/clean --input_type illumina --input '/home/martin/.nextflow/assets/rki-mf1/clean/test/illumina*.R{1,2}.fastq.gz' \
89
+
#we have to define the $HOME specifically here, not sure why
90
+
nextflow run rki-mf1/clean -r v1.1.0 --input_type illumina --input $HOME/'.nextflow/assets/rki-mf1/clean/test/illumina*.R{1,2}.fastq.gz' \
| dcs | ONT DNA-Seq reads |3.6 kb standard amplicon mapping the 3' end of the Lambda genome|https://assets.ctfassets.net/hkzaxo8a05x5/2IX56YmF5ug0kAQYoAg2Uk/159523e326b1b791e3b842c4791420a6/DNA_CS.txt|
102
114
| eno | ONT RNA-Seq reads |yeast ENO2 Enolase II of strain S288C, YHR174W|https://raw.githubusercontent.com/rki-mf1/clean/master/controls/S288C_YHR174W_ENO2_coding.fsa|
... for reasons. More can be easily added! Just write us, add an issue or make a pull request.
117
+
... for reasons. More can be easily added! Just write us, add an issue, or make a pull request.
106
118
107
119
## Workflow
108
120
@@ -112,7 +124,7 @@ Included in this repository are:
112
124
113
125
## Results
114
126
115
-
Running the pipeline will create a directory called `results/` in the current directory with some or all of the following directories and files (plus additional failes for indices, ...):
127
+
Running the pipeline will create a directory called `results/`(can be changed via `--output`) in the current directory with some or all of the following directories and files (plus additional files for indices, ...):
116
128
117
129
```text
118
130
results/
@@ -157,20 +169,19 @@ results/
157
169
158
170
The most important files you are likely interested in are `results/clean/<sample_name>.fastq.gz`, which are the "cleaned" reads. These are the input reads that *do not* map to the host, control, own fasta or rRNA files (or the subset of these that you provided), plus those reads that map to the "keep" sequence if you used the `--keep` option. Any files that were removed from your input fasta file are placed in `results/removed/<sample_name>.fastq.gz`.
159
171
160
-
For debugging purposes we also provide various intermediate results in the `intermediate/` folder.
172
+
For debugging purposes we also provide various intermediate results in the `intermediate/` folder. For mapping-based approaches (`minimap2`, `bwa`), you will also find a brief summary of mapped/unmapped reads and their proportions.
161
173
162
174
## Acknowledgements
163
175
164
-
Thanks to Matt Huska (@matthuska) for extensive testing of `CLEAN`, bug fixing, and reorganizing the output.
176
+
- Thanks to Matt Huska (@matthuska) for extensive testing of `CLEAN`, bug fixing, and reorganizing the output.
177
+
- Thanks to XXX for valuable feedback and a pull request adding a simple summary table for mapping-based approaches.
165
178
166
179
## Citations
167
180
168
181
If you use `CLEAN` in your work, please consider citing our preprint:
169
182
170
183
> Targeted decontamination of sequencing data with CLEAN
171
-
>
172
184
> Marie Lataretu, Sebastian Krautwurst, Adrian Viehweger, Christian Brandt, Martin Hölzer
Additionally, an extensive list of references for the tools used by the pipeline can be found in the [`CITATIONS.md`](CITATIONS.md) file.
187
+
Additionally, an extensive list of references for the tools used by the pipeline can be found in the [`CITATIONS.md`](CITATIONS.md) file.**Please consider also citing these tools because w/o them there would be no CLEAN!**
0 commit comments