Release 0.8.3

metagenlab · May 24, 2024 · 4587e12 · 4587e12
2 parents 8e02ed1 + b06c09f
commit 4587e12
Show file tree

Hide file tree

Showing 65 changed files with 1,924 additions and 1,948 deletions.
diff --git a/.dockerignore b/.dockerignore
@@ -0,0 +1,20 @@
+docs/
+
+mess.egg-info/
+mess/__pycache__
+build/
+
+htmlcov/
+.tox/
+.coverage
+.coverage.*
+.cache
+nosetests.xml
+coverage.xml
+*,cover
+tests/__pycache__
+.pytest_cache
+
+.snakemake
+mess/workflow/conda
+mess/workflow/taxonkit
diff --git a/.github/workflows/docker-publish.yml b/.github/workflows/docker-publish.yml
@@ -0,0 +1,68 @@
+name: Docker publish
+
+on:
+  push:
+    branches: ["main"]
+    tags: ["v*.*.*"]
+  pull_request:
+    branches: ["main"]
+
+env:
+  REGISTRY: ghcr.io
+  IMAGE_NAME: ${{ github.repository }}
+
+jobs:
+  build:
+    runs-on: ubuntu-latest
+    permissions:
+      contents: read
+      packages: write
+      id-token: write
+
+    steps:
+      - name: Checkout repository
+        uses: actions/checkout@v4
+
+      - name: Install cosign
+        if: github.event_name != 'pull_request'
+        uses: sigstore/cosign-installer@v3.5.0
+        with:
+          cosign-release: "v2.2.4"
+
+      - name: Set up QEMU
+        uses: docker/setup-qemu-action@v3
+
+      - name: Set up Docker Buildx
+        uses: docker/setup-buildx-action@v3
+
+      - name: Log into registry ${{ env.REGISTRY }}
+        if: github.event_name != 'pull_request'
+        uses: docker/login-action@v3
+        with:
+          registry: ${{ env.REGISTRY }}
+          username: ${{ github.actor }}
+          password: ${{ secrets.GITHUB_TOKEN }}
+
+      - name: Extract Docker metadata
+        id: meta
+        uses: docker/metadata-action@v5
+        with:
+          images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}
+
+      - name: Build and push Docker image
+        id: build-and-push
+        uses: docker/build-push-action@v5
+        with:
+          context: .
+          push: ${{ github.event_name != 'pull_request' }}
+          tags: ${{ steps.meta.outputs.tags }}
+          labels: ${{ steps.meta.outputs.labels }}
+          cache-from: type=gha
+          cache-to: type=gha,mode=max
+
+      - name: Sign the published Docker image
+        if: ${{ github.event_name != 'pull_request' }}
+        env:
+          TAGS: ${{ steps.meta.outputs.tags }}
+          DIGEST: ${{ steps.build-and-push.outputs.digest }}
+        run: echo "${TAGS}" | xargs -I {} cosign sign --yes {}@${DIGEST}
diff --git a/Dockerfile b/Dockerfile
@@ -0,0 +1,17 @@
+FROM mambaorg/micromamba
+LABEL org.opencontainers.image.source=https://github.com/metagenlab/MeSS
+LABEL org.opencontainers.image.description="Snakemake pipeline for simulating shotgun metagenomic samples"
+LABEL org.opencontainers.image.licenses=MIT
+ADD . /tmp/repo
+WORKDIR /tmp/repo
+ENV LANG C.UTF-8
+ENV SHELL /bin/bash
+USER root 
+
+RUN micromamba install -q -y -c bioconda -c conda-forge -n base \
+    mess --only-deps && \
+    micromamba install -q -y -c conda-forge -n base mamba && \
+    micromamba clean -afy
+
+ENV PATH /opt/conda/bin:${PATH}
+RUN pip install .
diff --git a/docs/benchmarking.md b/docs/benchmarking.md
diff --git a/docs/benchmarks/index.md b/docs/benchmarks/index.md
@@ -0,0 +1,9 @@
+# Benchmarks
+We benchmarked MeSS and CAMISIM, the state-of-the art metagenome simulator, in terms of species composition and resource usage.
+
+We demonstrated that, MeSS generates the same species composition as CAMISIM, while being 10x faster.
+
+## [Species composition](species-composition.md)
+
+## [Resource usage](resource-usage.md)
+
diff --git a/docs/benchmarks/resource-usage.md b/docs/benchmarks/resource-usage.md
@@ -0,0 +1,29 @@
+16 samples were used to benchmark MeSS and CAMISIM resources usage. 
+
+Samples were create by subsampling 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 40, 80, 160, 320, 640 genomes from a total of 2000 complete bacterial genomes (downloaded with [assembly_finder](https://github.com/metagenlab/assembly_finder)). 
+
+Each genome was covered at 1x using art_illumina with CAMISIM's custom MBARC error model.
+
+See [this nextflow pipeline](https://github.com/farchaab/benchmark-MeSS-CAMISIM) to run the benchmark.
+## Results
+### Physical RAM usage
+
+![ram](../images/ram-usage.svg)
+
+### CPU usage
+
+![cpu-usage](../images/cpu-usage.svg)
+
+###  CPU time
+
+![cpu-usage](../images/cpu-time.svg)
+
+!!! warning
+    To simulate a sample with 2.4G base pairs, using one CPU, CAMISIM takes 32 hours, while MeSS takes 3 hours.
+
+## Conclusions
+MeSS vs CAMISIM on average:
+
+- [x] 5x more parallel (CPU usage)
+- [x] 10x faster using one CPU (CPU time)
+- [x] Uses 16.7x less memory (physical RAM)
diff --git a/docs/benchmarks/species-composition.md b/docs/benchmarks/species-composition.md
@@ -0,0 +1,61 @@
+5 samples from the [human microbiome project](https://www.hmpdacc.org/hmp/) were were classified with [kraken2](https://github.com/DerrickWood/kraken2) and [bracken](https://github.com/jenniferlu717/Bracken). Taxa with at least at 200 reads were kept and used as input to both MeSS and CAMISIM.
+
+Use [this nextflow pipeline](https://github.com/farchaab/benchmark-MeSS-CAMISIM) to generate the fastqs. 
+
+## Results
+
+[microViz](https://github.com/david-barnett/microViz/) was used for the ordination plots and statistical tests.
+
+### Bray-curtis
+
+![bray](../images/species-bray-NMDS.svg)
+
+:material-arrow-right: Samples from the same bodysite cluster together. In addition, simulated samples cluster well with real samples (gold_standard and gs_filtered).
+
+### PERMANOVA
+
+:simple-hypothesis: **Null hypothesis** : No significant difference in species composition between simulated and non simulated samples
+
+??? info "**Code**"
+    ```R
+    perm <- dist_permanova(mdist,
+        variables = "origin:simulated+body_site",
+        n_perms = 999, 
+        n_processes = 3
+    )
+    ```
+
+```R
+                 Df SumOfSqs      R2       F Pr(>F)
+body_site         3   12.153 0.37843 15.6933  0.001 ***
+origin:simulated  3    1.117 0.03479  1.4429  0.067 .
+Residual         73   18.844 0.58678
+Total            79   32.115 1.00000
+```
+
+:material-arrow-right: Significant difference between body sites. No significant difference between simulated and real samples
+
+### Beta dispersion
+
+:simple-hypothesis: **Null hypothesis** : No significant difference in dispersion between samples of different origin
+
+```R
+Fit: aov(formula = distances ~ group, data = df)
+
+$group
+                                   diff         lwr        upr     p adj
+gs_filtered-gold_standard  2.249163e-03 -0.03593552 0.04043384 0.9986690
+camisim-gold_standard     -2.310968e-02 -0.06129435 0.01507500 0.3905351
+mess-gold_standard        -2.308946e-02 -0.06127414 0.01509522 0.3913195
+camisim-gs_filtered       -2.535884e-02 -0.06354352 0.01282584 0.3082419
+mess-gs_filtered          -2.533862e-02 -0.06352330 0.01284606 0.3089344
+mess-camisim               2.021632e-05 -0.03816446 0.03820490 1.0000000
+```
+
+:material-arrow-right: No significant difference between filtered and non-filtered samples, simulated and real samples.
+
+## Conclusions
+
+- [x] Same species composition between original and filtered samples
+- [x] Same species composition between MeSS and CAMISIM
+
diff --git a/docs/citation.md b/docs/citation.md
@@ -1,3 +1,3 @@
 # Citation
 
-![`mess citation`](docs/images/mess-citation.svg)
+![`mess citation`](images/mess-citation.svg)