Skip to content

Commit 44e1d2f

Browse files
committed
update README
1 parent 06bdbc6 commit 44e1d2f

File tree

1 file changed

+37
-26
lines changed

1 file changed

+37
-26
lines changed

README.md

Lines changed: 37 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -15,17 +15,15 @@ Technologies ([DNA CS (DCS)](https://assets.ctfassets.net/hkzaxo8a05x5/2IX56YmF5
1515

1616
## What this workflow does for you
1717

18-
With this workflow you can screen and clean your Illumina, Nanopore or any FASTA-formated sequence data. The results are the clean sequences and the sequences identified as contaminated.
19-
Per default [minimap2](https://github.com/lh3/minimap2) is used for aligning your sequences to reference sequences but I recommend using `bbduk`, part of [BBTools](https://github.com/BioInfoTools/BBMap), to clean short-read data (_--bbduk_).
18+
With this workflow you can screen and clean your Illumina, Nanopore or any FASTA-formated sequence data. The results are the clean sequences and the sequences identified as contaminated. Per default [minimap2](https://github.com/lh3/minimap2) is used for aligning your sequences to reference sequences (with the `map-ont` settings for Nanopore data and `sr` settings for short-read data activated automatically). However, for short-read data, you may want to switch to [BWA](https://github.com/lh3/bwa) (`--bwa`). As another alternative, we provide `bbduk`, part of [BBTools](https://github.com/BioInfoTools/BBMap), as a kmer-based approach (`--bbduk`). However, no mapping file will be produced with `bbduk` and thus some subsequent statistics are not calculated.
2019

21-
You can simply specify provided hosts and controls for the cleanup or use your own FASTA files. The reads are then mapped against the specified host, control and user defined FASTA files. All reads that map are considered as contamination. In case of Illumina paired-end reads, both mates need to be aligned.
20+
You can simply specify provided hosts and controls for the cleanup or use your own FASTA files. The reads are then mapped (or kmer-based compared in case of `bbduk`) against the specified host, control, and user defined FASTA files. All reads that match are considered as contamination. In case of Illumina paired-end reads, both mates need to be aligned (singleton files will be produced otherwise).
2221

23-
If Nanopore (`--input_type nano`) and Illumina (`--input_type illumina`) reads and control(s) (`--control`) are set, the control is selectively concatenated with the host and own FASTA: `dcs` for Nanopore DNA-Seq, `eno` for Nanopore RNA-Seq and `phix` from Illumina data.
24-
Else, specified host, control and user defined FASTA files are concatenated.
22+
The read input is defined via `--input_type nano` for Nanopore and `--input_type illumina` or `--input_type illumina_single_end` for Illumina reads. Additional control(s) for decontamination can be defined via `--control`. If controls are defined, they are selectively concatenated with the host and potential own FASTA files for decontamination. We provide auto-download for the following controls: `dcs` for Nanopore DNA-Seq, `eno` for Nanopore RNA-Seq, and `phix` from Illumina data. In general, specified host, control, and user defined FASTA files are concatenated for decontamination.
2523

2624
### Filter soft-clipped contamination reads
2725

28-
We saw many soft-clipped reads after the mapping, that probably aren't contamination. With `--min_clip` the user can set a threshold for the number of soft-clipped positions (sum of both ends). If `--min_clip` is greater 1, the total number is considered, else the fraction of soft-clipped positions to the read length. The output consists of all mapped, soft-clipped and mapped reads passing the filer.
26+
We saw many soft-clipped reads after the mapping, that probably aren't contamination. With `--min_clip` the user can set a threshold for the number of soft-clipped positions (sum of both ends). If `--min_clip` is greater 1, the total number is considered, else the fraction of soft-clipped positions to the read length. The output consists of all mapped, soft-clipped, and mapped reads passing the filer.
2927

3028
## Requirements
3129

@@ -35,13 +33,20 @@ We saw many soft-clipped reads after the mapping, that probably aren't contamina
3533

3634
### Dependencies management
3735

36+
For dependency handling you have to use one of the following technologies:
37+
3838
- [Conda](https://docs.conda.io/en/latest/miniconda.html)
39+
- [Mamba](https://mamba.readthedocs.io/en/latest/installation/micromamba-installation.html)
40+
- [Docker](https://docs.docker.com/get-docker/)
41+
- [Singularity](https://docs.sylabs.io/guides/3.0/user-guide/installation.html)
3942

40-
and/or
43+
As default `docker` is used; to switch to another technology for dependency handling, e.g., `mamba`, use `-profile mamba`.
4144

42-
- [Docker](https://docs.docker.com/get-docker/)
45+
### Run engine
4346

44-
In default `docker` is used; to switch to `conda` use `-profile conda`.
47+
Per default we assume you are running the tool on a laptop or work station (`local`). You can change the pipeline behaviour for example when running on a HPC with the SLURM workload manager via `-profile slurm`.
48+
49+
Dependencies and run engines can be combined, e.g., to run with Singularity on LSF use `-profile singularity,lsf`.
4550

4651
## Execution examples
4752

@@ -57,28 +62,35 @@ Get help:
5762
nextflow run rki-mf1/clean --help
5863
```
5964

65+
We always recommend running a release version. **Check for latest releases!** In these examples we use release `-r v1.1.0`:
66+
67+
```bash
68+
# check available release versions and branches
69+
nextflow info rki-mf1/clean
70+
# select a release and run it to show the help
71+
nextflow run rki-mf1/clean -r v1.1.0 --help
72+
```
73+
6074
Clean Nanopore data by filtering against a combined reference of the _E. coli_ genome and the Nanopore DNA CS spike-in.
6175

6276
```bash
6377
# uses Docker per default
64-
nextflow run rki-mf1/clean --input_type nano --input ~/.nextflow/assets/rki-mf1/clean/test/nanopore.fastq.gz \
78+
nextflow run rki-mf1/clean -r v1.1.0 --input_type nano --input ~/.nextflow/assets/rki-mf1/clean/test/nanopore.fastq.gz \
6579
--host eco --control dcs
6680

67-
# use conda instead of Docker
68-
nextflow run rki-mf1/clean --input_type nano --input ~/.nextflow/assets/rki-mf1/clean/test/nanopore.fastq.gz \
69-
--host eco --control dcs -profile conda
81+
# use mamba instead of Docker
82+
nextflow run rki-mf1/clean -r v1.1.0 --input_type nano --input ~/.nextflow/assets/rki-mf1/clean/test/nanopore.fastq.gz \
83+
--host eco --control dcs -profile mamba
7084
```
7185

72-
Clean Illumina paired-end data against your own reference FASTA using bbduk instead of minimap2.
86+
Clean Illumina paired-end data against your own reference FASTA using `bbduk` instead of `minimap2`.
7387

7488
```bash
75-
# enter your home dir!
76-
nextflow run rki-mf1/clean --input_type illumina --input '/home/martin/.nextflow/assets/rki-mf1/clean/test/illumina*.R{1,2}.fastq.gz' \
89+
# we have to define the $HOME specifically here, not sure why
90+
nextflow run rki-mf1/clean -r v1.1.0 --input_type illumina --input $HOME/'.nextflow/assets/rki-mf1/clean/test/illumina*.R{1,2}.fastq.gz' \
7791
--own ~/.nextflow/assets/rki-mf1/clean/test/ref.fasta.gz --bbduk
7892
```
7993

80-
Clean some Illumina, Nanopore, and assembly files against the mouse and phiX genomes.
81-
8294
## Supported species and control sequences
8395

8496
Currently supported are:
@@ -94,15 +106,15 @@ Currently supported are:
94106
|eco | _Escherichia coli_ | [Ensembl: Escherichia_coli_k_12.ASM80076v1.dna.toplevel] |
95107
|sc2 | _SARS-CoV-2_ | [ENA Sequence: MN908947.3 (Wuhan-Hu-1 complete genome) [web](https://www.ebi.ac.uk/ena/browser/view/MN908947.3) [fasta](https://www.ebi.ac.uk/ena/browser/api/fasta/MN908947.3?download=true)] |
96108

97-
Included in this repository are:
109+
Controls included in this repository are:
98110

99111
|flag | recommended usage | control/spike | source |
100112
|-----|-|---------|-------|
101113
| dcs | ONT DNA-Seq reads |3.6 kb standard amplicon mapping the 3' end of the Lambda genome| https://assets.ctfassets.net/hkzaxo8a05x5/2IX56YmF5ug0kAQYoAg2Uk/159523e326b1b791e3b842c4791420a6/DNA_CS.txt |
102114
| eno | ONT RNA-Seq reads |yeast ENO2 Enolase II of strain S288C, YHR174W| https://raw.githubusercontent.com/rki-mf1/clean/master/controls/S288C_YHR174W_ENO2_coding.fsa |
103115
| phix| Illumina reads |enterobacteria_phage_phix174_sensu_lato_uid14015, NC_001422| ftp://ftp.ncbi.nlm.nih.gov/genomes/Viruses/enterobacteria_phage_phix174_sensu_lato_uid14015/NC_001422.fna |
104116

105-
... for reasons. More can be easily added! Just write us, add an issue or make a pull request.
117+
... for reasons. More can be easily added! Just write us, add an issue, or make a pull request.
106118

107119
## Workflow
108120

@@ -112,7 +124,7 @@ Included in this repository are:
112124

113125
## Results
114126

115-
Running the pipeline will create a directory called `results/` in the current directory with some or all of the following directories and files (plus additional failes for indices, ...):
127+
Running the pipeline will create a directory called `results/` (can be changed via `--output`) in the current directory with some or all of the following directories and files (plus additional files for indices, ...):
116128

117129
```text
118130
results/
@@ -157,20 +169,19 @@ results/
157169

158170
The most important files you are likely interested in are `results/clean/<sample_name>.fastq.gz`, which are the "cleaned" reads. These are the input reads that *do not* map to the host, control, own fasta or rRNA files (or the subset of these that you provided), plus those reads that map to the "keep" sequence if you used the `--keep` option. Any files that were removed from your input fasta file are placed in `results/removed/<sample_name>.fastq.gz`.
159171

160-
For debugging purposes we also provide various intermediate results in the `intermediate/` folder.
172+
For debugging purposes we also provide various intermediate results in the `intermediate/` folder. For mapping-based approaches (`minimap2`, `bwa`), you will also find a brief summary of mapped/unmapped reads and their proportions.
161173

162174
## Acknowledgements
163175

164-
Thanks to Matt Huska (@matthuska) for extensive testing of `CLEAN`, bug fixing, and reorganizing the output.
176+
- Thanks to Matt Huska (@matthuska) for extensive testing of `CLEAN`, bug fixing, and reorganizing the output.
177+
- Thanks to XXX for valuable feedback and a pull request adding a simple summary table for mapping-based approaches.
165178

166179
## Citations
167180

168181
If you use `CLEAN` in your work, please consider citing our preprint:
169182

170183
> Targeted decontamination of sequencing data with CLEAN
171-
>
172184
> Marie Lataretu, Sebastian Krautwurst, Adrian Viehweger, Christian Brandt, Martin Hölzer
173-
>
174185
> bioRxiv 2023.08.05.552089; doi: https://doi.org/10.1101/2023.08.05.552089
175186
176-
Additionally, an extensive list of references for the tools used by the pipeline can be found in the [`CITATIONS.md`](CITATIONS.md) file.
187+
Additionally, an extensive list of references for the tools used by the pipeline can be found in the [`CITATIONS.md`](CITATIONS.md) file. **Please consider also citing these tools because w/o them there would be no CLEAN!**

0 commit comments

Comments
 (0)