Merge pull request #81 from rki-mf1/dev

Dev
rki-mf1 · Jan 4, 2024 · b20203e · b20203e
2 parents e5dcc55 + 71464d2
commit b20203e
Show file tree

Hide file tree

Showing 35 changed files with 816 additions and 340 deletions.
diff --git a/.github/workflows/dryrun.yml b/.github/workflows/dryrun.yml
@@ -0,0 +1,77 @@
+name: dry run CI
+on:
+  push:
+    branches:
+      - master
+  pull_request:
+    branches:
+      - master
+      - dev
+
+
+env:
+  NXF_ANSI_LOG: false
+
+jobs:
+  test:
+    name: Dry run ${{ matrix.inputtype }}-${{ matrix.profile }}-nf_${{ matrix.NXF_VER }} pipeline test
+    runs-on: ubuntu-latest
+    strategy:
+      matrix:
+        NXF_VER:
+          - "23.04.1"
+          - "latest"
+        inputtype:
+          - fasta
+          - illumina
+          - nanopore
+        profile: ["docker", "conda"] # TODO "singularity"]
+        python-version:
+          - "3.9"
+    steps:
+      - name: Check out pipeline code
+        uses: actions/checkout@v3
+
+      - uses: actions/cache@v3
+        with:
+          path: /usr/local/bin/nextflow
+          key: ${{ runner.os }}
+          restore-keys: |
+            ${{ runner.os }}-nextflow-
+
+      - name: Install Nextflow
+        uses: nf-core/setup-nextflow@v1
+        with:
+          version: "${{ matrix.NXF_VER }}"
+
+      - name: Install nf-test
+        run: |
+          wget -qO- https://code.askimed.com/install/nf-test | bash
+          sudo mv nf-test /usr/local/bin/
+
+      - name: Set up Singularity
+        if: matrix.profile == 'singularity'
+        uses: eWaterCycle/setup-singularity@v5
+        with:
+          singularity-version: 3.7.1
+
+      - name: Set up miniconda
+        if: matrix.profile == 'conda'
+        uses: conda-incubator/setup-miniconda@v3
+        with:
+          auto-update-conda: true
+          conda-solver: libmamba
+          channels: conda-forge,bioconda,defaults
+          python-version: ${{ matrix.python-version }}
+
+      - name: Conda clean
+        if: matrix.profile == 'conda'
+        run: conda clean -a
+
+      - name: Run nf-test
+        run: nf-test test --profile=${{ matrix.profile }} tests/${{ matrix.inputtype }}/*.nf.test --tap=test.tap
+
+      - uses: pcolby/tap-summary@v1
+        with:
+          path: >-
+            test.tap
diff --git a/.gitignore b/.gitignore
@@ -9,4 +9,6 @@ work/
 results/
 centrifuge-cloud/
 conda/
-.vscode/
+singularity/
+.vscode/
+.nf-test/
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,5 +1,21 @@
 # Changelog
 
+## [v1.0.0] - 2024-01-04
+
+### Changed
+
+- reorganized of results directory and file names
+
+### Added
+
+- dry run CI tests
+- options to reduce disk usage
+  - `--cleanup_work_dir` and `--no_intermediate`
+
+### Fixed
+
+- fixed some issues on Mac OS
+
 ## [v1.0.0-beta.1] - 2023-10-11
 
 ### Changed

diff --git a/CITATIONS.md b/CITATIONS.md
@@ -29,6 +29,10 @@
 
   > Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R; 1000 Genome Project Data Processing Subgroup. The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009 Aug 15;25(16):2078-9. doi: 10.1093/bioinformatics/btp352. Epub 2009 Jun 8. PubMed PMID: 19505943; PubMed Central PMCID: PMC2723002.
 
+- [SeqKit](https://pubmed.ncbi.nlm.nih.gov/27706213/)
+
+  > Shen W, Le S, Li Y, Hu F. SeqKit: A Cross-Platform and Ultrafast Toolkit for FASTA/Q File Manipulation. PLoS One. 2016 Oct 5;11(10):e0163962. doi: 10.1371/journal.pone.0163962. PMID: 27706213; PMCID: PMC5051824.
+
 - [QUAST](https://pubmed.ncbi.nlm.nih.gov/23422339/)
 
   > Gurevich A, Saveliev V, Vyahhi N, Tesler G. QUAST: quality assessment tool for genome assemblies. Bioinformatics. 2013 Apr 15;29(8):1072-5. doi: 10.1093/bioinformatics/btt086. Epub 2013 Feb 19. PMID: 23422339; PMCID: PMC3624806.

diff --git a/README.md b/README.md
@@ -15,7 +15,7 @@ Technologies ([DNA CS (DCS)](https://assets.ctfassets.net/hkzaxo8a05x5/2IX56YmF5
 
 ## What this workflow does for you
 
-With this workflow you can screen and clean your Illumina, Nanopore or any FASTA-formated sequence date. The results are the clean sequences and the sequences identified as contaminated.
+With this workflow you can screen and clean your Illumina, Nanopore or any FASTA-formated sequence data. The results are the clean sequences and the sequences identified as contaminated.
 Per default [minimap2](https://github.com/lh3/minimap2) is used for aligning your sequences to reference sequences but I recommend using `bbduk`, part of [BBTools](https://github.com/BioInfoTools/BBMap), to clean short-read data (_--bbduk_).
 
 You can simply specify provided hosts and controls for the cleanup or use your own FASTA files. The reads are then mapped against the specified host, control and user defined FASTA files. All reads that map are considered as contamination. In case of Illumina paired-end reads, both mates need to be aligned.
@@ -35,7 +35,7 @@ We saw many soft-clipped reads after the mapping, that probably aren't contamina
 
 ### Dependencies management
 
-- [Conda](https://docs.conda.io/en/latest/miniconda.html) 
+- [Conda](https://docs.conda.io/en/latest/miniconda.html)
 
 and/or
 
@@ -48,36 +48,36 @@ In default `docker` is used; to switch to `conda` use `-profile conda`.
 Get or update the workflow:
 
 ```bash
-nextflow pull hoelzer/clean
+nextflow pull rki-mf1/clean
 ```
 
 Get help:
 
 ```bash
-nextflow run hoelzer/clean --help
+nextflow run rki-mf1/clean --help
 ```
 
-Clean Nanopore data by filtering against a combined reference of the _E. coli_ genome and the Nanopore DNA CS spike-in.  
+Clean Nanopore data by filtering against a combined reference of the _E. coli_ genome and the Nanopore DNA CS spike-in.
 
 ```bash
 # uses Docker per default
-nextflow run hoelzer/clean --input_type nano --input ~/.nextflow/assets/hoelzer/clean/test/nanopore.fastq.gz \
+nextflow run rki-mf1/clean --input_type nano --input ~/.nextflow/assets/rki-mf1/clean/test/nanopore.fastq.gz \
 --host eco --control dcs
 
 # use conda instead of Docker
-nextflow run hoelzer/clean --input_type nano --input ~/.nextflow/assets/hoelzer/clean/test/nanopore.fastq.gz \
+nextflow run rki-mf1/clean --input_type nano --input ~/.nextflow/assets/rki-mf1/clean/test/nanopore.fastq.gz \
 --host eco --control dcs -profile conda
 ```
 
 Clean Illumina paired-end data against your own reference FASTA using bbduk instead of minimap2.
 
 ```bash
 # enter your home dir!
-nextflow run hoelzer/clean --input_type illumina --input '/home/martin/.nextflow/assets/hoelzer/clean/test/illumina*.R{1,2}.fastq.gz' \
---own ~/.nextflow/assets/hoelzer/clean/test/ref.fasta.gz --bbduk
+nextflow run rki-mf1/clean --input_type illumina --input '/home/martin/.nextflow/assets/rki-mf1/clean/test/illumina*.R{1,2}.fastq.gz' \
+--own ~/.nextflow/assets/rki-mf1/clean/test/ref.fasta.gz --bbduk
 ```
 
-Clean some Illumina, Nanopore, and assembly files against the mouse and phiX genomes.  
+Clean some Illumina, Nanopore, and assembly files against the mouse and phiX genomes.
 
 ## Supported species and control sequences
 
@@ -98,7 +98,7 @@ Included in this repository are:
 |flag | recommended usage | control/spike | source |
 |-----|-|---------|-------|
 | dcs | ONT DNA-Seq reads |3.6 kb standard amplicon mapping the 3' end of the Lambda genome| https://assets.ctfassets.net/hkzaxo8a05x5/2IX56YmF5ug0kAQYoAg2Uk/159523e326b1b791e3b842c4791420a6/DNA_CS.txt |
-| eno | ONT RNA-Seq reads |yeast ENO2 Enolase II of strain S288C, YHR174W| https://raw.githubusercontent.com/hoelzer/clean/master/controls/S288C_YHR174W_ENO2_coding.fsa |
+| eno | ONT RNA-Seq reads |yeast ENO2 Enolase II of strain S288C, YHR174W| https://raw.githubusercontent.com/rki-mf1/clean/master/controls/S288C_YHR174W_ENO2_coding.fsa |
 | phix| Illumina reads |enterobacteria_phage_phix174_sensu_lato_uid14015, NC_001422| ftp://ftp.ncbi.nlm.nih.gov/genomes/Viruses/enterobacteria_phage_phix174_sensu_lato_uid14015/NC_001422.fna |
 
 ... for reasons. More can be easily added! Just write me, add an issue or make a pull request.
@@ -107,14 +107,69 @@ Included in this repository are:
 
 ![chart](figures/clean_workflow_latest.png)
 
+<sub><sub>The icons and diagram components that make up the schematic view were originally designed by James A. Fellow Yates & nf-core under a CCO license (public domain).</sub></sub>
+
+## Results
+
+Running the pipeline will create a directory called `results/` in the current directory with some or all of the following directories and files (plus additional failes for indices, ...):
+
+```text
+results/
+├── clean/
+│   └── <sample_name>.fastq.gz
+├── removed/
+│   └── <sample_name>.fastq.gz
+├── intermediate/
+│   ├── map-to-remove/
+│   │   ├── <sample_name>.mapped.fastq.gz
+│   │   ├── <sample_name>.unmapped.fastq.gz
+│   │   ├── <sample_name>.sorted.bam
+│   │   ├── <sample_name>.sorted.bam.bai
+│   │   ├── <sample_name>.sorted.flagstat.txt
+│   │   ├── <sample_name>.sorted.idxstats.tsv
+│   │   ├── strict-dcs/
+│   │   │   ├── <sample_name>.no-dcs.bam
+│   │   │   ├── <sample_name>.true-dcs.bam
+│   │   │   └── <sample_name>.false-dcs.bam
+│   │   └── soft-clipped/
+│   │       ├── <sample_name>.soft-clipped.bam
+│   │       └── <sample_name>.passed-clipped.bam
+│   ├── map-to-keep/
+│   │   ├── <sample_name>.mapped.fastq.gz
+│   │   ├── <sample_name>.unmapped.fastq.gz
+│   │   ├── <sample_name>.sorted.bam
+│   │   ├── <sample_name>.sorted.bam.bai
+│   │   ├── <sample_name>.sorted.flagstat.txt
+│   │   ├── <sample_name>.sorted.idxstats.tsv
+│   │   ├── strict-dcs/
+│   │   │   ├── <sample_name>.no-dcs.bam
+│   │   │   ├── <sample_name>.true-dcs.bam
+│   │   │   └── <sample_name>.false-dcs.bam
+│   │   └── soft-clipped/
+│   │       ├── <sample_name>.soft-clipped.bam
+│   │       └── <sample_name>.passed-clipped.bam
+|   ├── host.fa.fai
+|   └── host.fa.gz
+├── logs/*.html
+└── qc/multiqc_report.html
+```
+
+The most important files you are likely interested in are `results/clean/<sample_name>.fastq.gz`, which are the "cleaned" reads. These are the input reads that *do not* map to the host, control, own fasta or rRNA files (or the subset of these that you provided), plus those reads that map to the "keep" sequence if you used the `--keep` option. Any files that were removed from your input fasta file are placed in `results/removed/<sample_name>.fastq.gz`.
+
+For debugging purposes we also provide various intermediate results in the `intermediate/` folder.
+
+## Acknowledgements
+
+Thanks to Matt Huska (@matthuska) for extensive testing of `CLEAN`, bug fixing, and reorganizing the output.
+
 ## Citations
 
 If you use `CLEAN` in your work, please consider citing our preprint:
- 
+
 > Targeted decontamination of sequencing data with CLEAN
 >
 > Marie Lataretu, Sebastian Krautwurst, Adrian Viehweger, Christian Brandt, Martin Hölzer
 >
-> bioRxiv 2023.08.05.552089; doi: https://doi.org/10.1101/2023.08.05.552089 
+> bioRxiv 2023.08.05.552089; doi: https://doi.org/10.1101/2023.08.05.552089
 
 Additionally, an extensive list of references for the tools used by the pipeline can be found in the [`CITATIONS.md`](CITATIONS.md) file.
diff --git a/assets/EMPTY_FILE b/assets/EMPTY_FILE
diff --git a/assets/EMPTY_FILE_R1.fastq b/assets/EMPTY_FILE_R1.fastq
diff --git a/assets/EMPTY_FILE_R2.fastq b/assets/EMPTY_FILE_R2.fastq