Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bin QC Improvements #707

Open
wants to merge 18 commits into
base: dev
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
37 changes: 37 additions & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -163,3 +163,40 @@ jobs:
- name: Run pipeline with ${{ matrix.profile }} test profile
run: |
nextflow run ${GITHUB_WORKSPACE} -profile test,docker --outdir ./results --binqc_tool checkm --checkm_db databases/checkm

checkm2:
name: Run single test to checkm2 due to database download
# Only run on push if this is the nf-core dev branch (merged PRs)
if: ${{ github.event_name != 'push' || (github.event_name == 'push' && github.repository == 'nf-core/mag') }}
runs-on: ubuntu-latest

steps:
- name: Free some space
run: |
sudo rm -rf "/usr/local/share/boost"
sudo rm -rf "$AGENT_TOOLSDIRECTORY"

- name: Check out pipeline code
uses: actions/checkout@0ad4b8fadaa221de15dcec353f45205ec38ea70b # v4

- name: Install Nextflow
run: |
wget -qO- get.nextflow.io | bash
sudo mv nextflow /usr/local/bin/

- name: Clean up Disk space
uses: jlumbroso/free-disk-space@54081f138730dfa15788a46383842cd2f914a1be # v1.3.1

- name: Download and prepare CheckM2 database
run: |
mkdir -p databases/checkm2
wget https://zenodo.org/records/5571251/files/checkm2_database.tar.gz -P databases/checkm2
tar xzvf databases/checkm2/checkm2_database.tar.gz -C databases/checkm2/

- name: Run pipeline with ${{ matrix.profile }} test profile
run: |
nextflow run ${GITHUB_WORKSPACE} \
-profile test,docker \
--outdir ./results \
--binqc_tool checkm2 \
--checkm2_db databases/checkm2/CheckM2_database/uniref100.KO.1.dmnd
8 changes: 7 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,16 +7,22 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

### `Added`

- [#707](https://github.com/nf-core/mag/pull/707) - Make Bin QC a subworkflow (added by @dialvarezs)
- [#707](https://github.com/nf-core/mag/pull/707) - Added CheckM2 as an alternative bin completeness and QC tool (added by @dialvarezs)
- [#708](https://github.com/nf-core/mag/pull/708) - Added `--exclude_unbins_from_postbinning` parameter to exclude unbinned contigs from post-binning processes, speeding up Prokka in some cases (added by @dialvarezs)

### `Changed`

### `Fixed`

- [#708](https://github.com/nf-core/mag/pull/708) - Fixed channel passed as GUNC input (added by @dialvarezs)
- [#707](https://github.com/nf-core/mag/pull/708) - Fixed channel passed as GUNC input (added by @dialvarezs)

### `Dependencies`

| Tool | Previous version | New version |
| ------- | ---------------- | ----------- |
| CheckM2 | | 1.0.2 |

### `Deprecated`

## 3.2.1 [2024-10-30]
Expand Down
4 changes: 4 additions & 0 deletions CITATIONS.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,10 @@

> Parks, D. H., Imelfort, M., Skennerton, C. T., Hugenholtz, P., & Tyson, G. W. (2015). CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Research, 25(7), 1043–1055. doi: 10.1101/gr.186072.114

- [CheckM2](https://doi.org/10.1038/s41592-023-01940-w)

> Chklovski, A., Parks, D. H., Woodcroft, B. J., & Tyson, G. W. (2023). CheckM2: a rapid, scalable and accurate tool for assessing microbial genome quality using machine learning. Nature Methods, 20(8), 1203-1212.

- [CONCOCT](https://doi.org/10.1038/nmeth.3103)

> Alneberg, J., Bjarnason, B. S., de Bruijn, I., Schirmer, M., Quick, J., Ijaz, U. Z., Lahti, L., Loman, N. J., Andersson, A. F., & Quince, C. (2014). Binning metagenomic contigs by coverage and composition. Nature Methods, 11(11), 1144–1146. doi: 10.1038/nmeth.3103
Expand Down
3 changes: 2 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ The pipeline then:
- performs assembly using [MEGAHIT](https://github.com/voutcn/megahit) and [SPAdes](http://cab.spbu.ru/software/spades/), and checks their quality using [Quast](http://quast.sourceforge.net/quast)
- (optionally) performs ancient DNA assembly validation using [PyDamage](https://github.com/maxibor/pydamage) and contig consensus sequence recalling with [Freebayes](https://github.com/freebayes/freebayes) and [BCFtools](http://samtools.github.io/bcftools/bcftools.html)
- predicts protein-coding genes for the assemblies using [Prodigal](https://github.com/hyattpd/Prodigal), and bins with [Prokka](https://github.com/tseemann/prokka) and optionally [MetaEuk](https://www.google.com/search?channel=fs&client=ubuntu-sn&q=MetaEuk)
- performs metagenome binning using [MetaBAT2](https://bitbucket.org/berkeleylab/metabat/src/master/), [MaxBin2](https://sourceforge.net/projects/maxbin2/), and/or with [CONCOCT](https://github.com/BinPro/CONCOCT), and checks the quality of the genome bins using [Busco](https://busco.ezlab.org/), or [CheckM](https://ecogenomics.github.io/CheckM/), and optionally [GUNC](https://grp-bork.embl-community.io/gunc/).
- performs metagenome binning using [MetaBAT2](https://bitbucket.org/berkeleylab/metabat/src/master/), [MaxBin2](https://sourceforge.net/projects/maxbin2/), and/or with [CONCOCT](https://github.com/BinPro/CONCOCT), and checks the quality of the genome bins using [Busco](https://busco.ezlab.org/), [CheckM](https://ecogenomics.github.io/CheckM/), or [CheckM2](https://github.com/chklovski/CheckM2) and optionally [GUNC](https://grp-bork.embl-community.io/gunc/).
- Performs ancient DNA validation and repair with [pyDamage](https://github.com/maxibor/pydamage) and [freebayes](https://github.com/freebayes/freebayes)
- optionally refines bins with [DAS Tool](https://github.com/cmks/DAS_Tool)
- assigns taxonomy to bins using [GTDB-Tk](https://github.com/Ecogenomics/GTDBTk) and/or [CAT](https://github.com/dutilh/CAT) and optionally identifies viruses in assemblies using [geNomad](https://github.com/apcamargo/genomad), or Eukaryotes with [Tiara](https://github.com/ibe-uw/tiara)
Expand Down Expand Up @@ -90,6 +90,7 @@ Other code contributors include:
- [Phil Palmer](https://github.com/PhilPalmer)
- [@willros](https://github.com/willros)
- [Adam Rosenbaum](https://github.com/muabnezor)
- [Diego Alvarez](https://github.com/dialvarezs)

Long read processing was inspired by [caspargross/HybridAssembly](https://github.com/caspargross/HybridAssembly) written by Caspar Gross [@caspargross](https://github.com/caspargross)

Expand Down
82 changes: 43 additions & 39 deletions bin/combine_tables.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,10 +3,9 @@
## Originally written by Daniel Straub and Sabrina Krakau and released under the MIT license.
## See git repository (https://github.com/nf-core/mag) for full license text.


import sys
import argparse
import os.path
import sys

import pandas as pd


Expand All @@ -19,19 +18,14 @@ def parse_args(args=None):
metavar="FILE",
help="Bin depths summary file.",
)
parser.add_argument("-b", "--binqc_summary", metavar="FILE", help="BUSCO summary file.")
parser.add_argument("-q", "--quast_summary", metavar="FILE", help="QUAST BINS summary file.")
parser.add_argument("-g", "--gtdbtk_summary", metavar="FILE", help="GTDB-Tk summary file.")
parser.add_argument("-a", "--cat_summary", metavar="FILE", help="CAT table file.")
parser.add_argument(
"-b", "--busco_summary", metavar="FILE", help="BUSCO summary file."
)
parser.add_argument(
"-c", "--checkm_summary", metavar="FILE", help="CheckM summary file."
)
parser.add_argument(
"-q", "--quast_summary", metavar="FILE", help="QUAST BINS summary file."
)
parser.add_argument(
"-g", "--gtdbtk_summary", metavar="FILE", help="GTDB-Tk summary file."
"-t", "--binqc_tool", help="Bin QC tool used", choices=["busco", "checkm", "checkm2"]
)
parser.add_argument("-a", "--cat_summary", metavar="FILE", help="CAT table file.")

parser.add_argument(
"-o",
"--out",
Expand Down Expand Up @@ -81,9 +75,7 @@ def parse_cat_table(cat_table):
)
# merge all rank columns into a single column
df["CAT_rank"] = (
df.filter(regex="rank_\d+")
.apply(lambda x: ";".join(x.dropna()), axis=1)
.str.lstrip()
df.filter(regex="rank_\d+").apply(lambda x: ";".join(x.dropna()), axis=1).str.lstrip()
)
# remove rank_* columns
df.drop(df.filter(regex="rank_\d+").columns, axis=1, inplace=True)
Expand All @@ -95,39 +87,36 @@ def main(args=None):
args = parse_args(args)

if (
not args.busco_summary
and not args.checkm_summary
not args.binqc_summary
and not args.quast_summary
and not args.gtdbtk_summary
):
sys.exit(
"No summary specified! Please specify at least BUSCO, CheckM or QUAST summary."
"No summary specified! "
"Please specify at least BUSCO, CheckM, CheckM2 or QUAST summary."
)

# GTDB-Tk can only be run in combination with BUSCO or CheckM
if args.gtdbtk_summary and not (args.busco_summary or args.checkm_summary):
# GTDB-Tk can only be run in combination with BUSCO, CheckM or CheckM2
if args.gtdbtk_summary and not args.binqc_summary:
sys.exit(
"Invalid parameter combination: GTDB-TK summary specified, but no BUSCO or CheckM summary!"
"Invalid parameter combination: "
"GTDB-TK summary specified, but no BUSCO, CheckM or CheckM2 summary!"
)

# handle bin depths
results = pd.read_csv(args.depths_summary, sep="\t")
results.columns = [
"Depth " + str(col) if col != "bin" else col for col in results.columns
]
results.columns = ["Depth " + str(col) if col != "bin" else col for col in results.columns]
bins = results["bin"].sort_values().reset_index(drop=True)

if args.busco_summary:
busco_results = pd.read_csv(args.busco_summary, sep="\t")
if not bins.equals(
busco_results["GenomeBin"].sort_values().reset_index(drop=True)
):
if args.binqc_summary and args.binqc_tool == "busco":
busco_results = pd.read_csv(args.binqc_summary, sep="\t")
if not bins.equals(busco_results["GenomeBin"].sort_values().reset_index(drop=True)):
sys.exit("Bins in BUSCO summary do not match bins in bin depths summary!")
results = pd.merge(
results, busco_results, left_on="bin", right_on="GenomeBin", how="outer"
) # assuming depths for all bins are given

if args.checkm_summary:
if args.binqc_summary and args.binqc_tool == "checkm":
use_columns = [
"Bin Id",
"Marker lineage",
Expand All @@ -147,22 +136,37 @@ def main(args=None):
"4",
"5+",
]
checkm_results = pd.read_csv(args.checkm_summary, usecols=use_columns, sep="\t")
checkm_results = pd.read_csv(args.binqc_summary, usecols=use_columns, sep="\t")
checkm_results["Bin Id"] = checkm_results["Bin Id"] + ".fa"
if not bins.equals(
checkm_results["Bin Id"].sort_values().reset_index(drop=True)
):
if not bins.equals(checkm_results["Bin Id"].sort_values().reset_index(drop=True)):
sys.exit("Bins in CheckM summary do not match bins in bin depths summary!")
results = pd.merge(
results, checkm_results, left_on="bin", right_on="Bin Id", how="outer"
) # assuming depths for all bins are given
results["Bin Id"] = results["Bin Id"].str.removesuffix(".fa")

if args.binqc_summary and args.binqc_tool == "checkm2":
use_columns = [
"Name",
"Completeness",
"Contamination",
"Completeness_Model_Used",
"Coding_Density",
"Translation_Table_Used",
"Total_Coding_Sequences",
]
checkm2_results = pd.read_csv(args.binqc_summary, usecols=use_columns, sep="\t")
checkm2_results["Name"] = checkm2_results["Name"] + ".fa"
if not set(checkm2_results["Name"]).issubset(set(bins)):
sys.exit("Bins in CheckM2 summary do not match bins in bin depths summary!")
results = pd.merge(
results, checkm2_results, left_on="bin", right_on="Name", how="outer"
) # assuming depths for all bins are given
results["Name"] = results["Name"].str.removesuffix(".fa")

if args.quast_summary:
quast_results = pd.read_csv(args.quast_summary, sep="\t")
if not bins.equals(
quast_results["Assembly"].sort_values().reset_index(drop=True)
):
if not bins.equals(quast_results["Assembly"].sort_values().reset_index(drop=True)):
sys.exit("Bins in QUAST summary do not match bins in bin depths summary!")
results = pd.merge(
results, quast_results, left_on="bin", right_on="Assembly", how="outer"
Expand Down
6 changes: 4 additions & 2 deletions conf/base.config
Original file line number Diff line number Diff line change
Expand Up @@ -160,12 +160,14 @@ process {
cpus = { 8 * task.attempt }
memory = { 20.GB * task.attempt }
}

withName: MAXBIN2 {
errorStrategy = { task.exitStatus in [1, 255] ? 'ignore' : 'retry' }
}

withName: DASTOOL_DASTOOL {
errorStrategy = { task.exitStatus in ((130..145) + 104) ? 'retry' : task.exitStatus == 1 ? 'ignore' : 'finish' }
}
//CheckM2 returns exit code 1 when Diamond doesn't find any hits
withName: CHECKM2_PREDICT {
errorStrategy = { task.exitStatus in (130..145) ? 'retry' : task.exitStatus == 1 ? 'ignore' : 'finish' }
}
}
33 changes: 29 additions & 4 deletions conf/modules.config
Original file line number Diff line number Diff line change
Expand Up @@ -351,7 +351,11 @@ process {
withName: CHECKM_LINEAGEWF {
tag = { "${meta.assembler}-${meta.binner}-${meta.domain}-${meta.refinement}-${meta.id}" }
ext.prefix = { "${meta.assembler}-${meta.binner}-${meta.domain}-${meta.refinement}-${meta.id}_wf" }
publishDir = [path: { "${params.outdir}/GenomeBinning/QC/CheckM" }, mode: params.publish_dir_mode, saveAs: { filename -> filename.equals('versions.yml') ? null : filename }]
publishDir = [
path: { "${params.outdir}/GenomeBinning/QC/CheckM" },
mode: params.publish_dir_mode,
saveAs: { filename -> filename.equals('versions.yml') ? null : filename }
]
}

withName: CHECKM_QA {
Expand All @@ -364,9 +368,30 @@ process {
]
}

withName: COMBINE_CHECKM_TSV {
ext.prefix = { "checkm_summary" }
publishDir = [path: { "${params.outdir}/GenomeBinning/QC" }, mode: params.publish_dir_mode, saveAs: { filename -> filename.equals('versions.yml') ? null : filename }]
withName: COMBINE_BINQC_TSV {
ext.prefix = { "${params.binqc_tool}_summary" }
publishDir = [
path: { "${params.outdir}/GenomeBinning/QC" },
mode: params.publish_dir_mode,
saveAs: { filename -> filename.equals('versions.yml') ? null : filename }
]
}

withName: CHECKM2_DATABASEDOWNLOAD {
publishDir = [
path: { "${params.outdir}/GenomeBinning/QC/CheckM2/checkm2_downloads" },
mode: params.publish_dir_mode, overwrite: false,
saveAs: { filename -> filename.equals('versions.yml') ? null : filename }, enabled: params.save_checkm2_data
]
}

withName: CHECKM2_PREDICT {
ext.prefix = { "${meta.assembler}-${meta.binner}-${meta.domain}-${meta.refinement}-${meta.id}" }
publishDir = [
path: { "${params.outdir}/GenomeBinning/QC/CheckM2" },
mode: params.publish_dir_mode,
saveAs: { filename -> filename.equals('versions.yml') ? null : filename }
]
}

withName: GUNC_DOWNLOADDB {
Expand Down
30 changes: 28 additions & 2 deletions docs/output.md
Original file line number Diff line number Diff line change
Expand Up @@ -540,7 +540,7 @@ Besides the reference files or output files created by BUSCO, the following summ

#### CheckM

[CheckM](https://ecogenomics.github.io/CheckM/) CheckM provides a set of tools for assessing the quality of genomes recovered from isolates, single cells, or metagenomes. It provides robust estimates of genome completeness and contamination by using collocated sets of genes that are ubiquitous and single-copy within a phylogenetic lineage
[CheckM](https://ecogenomics.github.io/CheckM/) provides a set of tools for assessing the quality of genomes recovered from isolates, single cells, or metagenomes. It provides robust estimates of genome completeness and contamination by using collocated sets of genes that are ubiquitous and single-copy within a phylogenetic lineage

By default, nf-core/mag runs CheckM with the `check_lineage` workflow that places genome bins on a reference tree to define lineage-marker sets, to check for completeness and contamination based on lineage-specific marker genes. and then subsequently runs `qa` to generate the summary files.

Expand All @@ -550,7 +550,8 @@ By default, nf-core/mag runs CheckM with the `check_lineage` workflow that place
- `GenomeBinning/QC/CheckM/`
- `[assembler]-[binner]-[domain]-[refinement]-[sample/group]_qa.txt`: Detailed statistics about bins informing completeness and contamamination scores (output of `checkm qa`). This should normally be your main file to use to evaluate your results.
- `[assembler]-[binner]-[domain]-[refinement]-[sample/group]_wf.tsv`: Overall summary file for completeness and contamination (output of `checkm lineage_wf`).
- `[assembler]-[binner]-[domain]-[refinement]-[sample/group]/`: intermediate files for CheckM results, including CheckM generated annotations, log, lineage markers etc.
- `[assembler]-[binner]-[domain]-[refinement]-[sample/group]/`: Intermediate files for CheckM results, including CheckM generated annotations, log, lineage markers etc.
- `GenomeBinning/QC/`
- `checkm_summary.tsv`: A summary table of the CheckM results for all bins (output of `checkm qa`).

</details>
Expand All @@ -566,6 +567,31 @@ If the parameter `--save_checkm_reference` is set, additionally the used the Che

</details>

#### CheckM2

[CheckM2](https://github.com/chklovski/CheckM2) is atool for assessing the quality of metagenome-derived genomes. It uses a machine learning approach to predict the completeness and contamination of a genome regardless of its taxonomic lineage.

<details markdown="1">
<summary>Output files</summary>

- `GenomeBinning/QC/CheckM2/`
- `[assembler]-[binner]-[domain]-[refinement]-[sample/group]/quality_report.tsv`: Detailed statistics about bins informing completeness and contamamination scores. This should normally be your main file to use to evaluate your results.
- `[assembler]-[binner]-[domain]-[refinement]-[sample/group]/`: Intermediate files for CheckM2 results, including CheckM2 generated annotations, log, and Diamond alignment results.
- `GenomeBinning/QC/`
- `checkm2_summary.tsv`: A summary table of the CheckM2 results for all bins.

</details>

If the parameter `--save_checkm2_reference` is set, the CheckM2 reference datasets will be stored in the output directory.

<details markdown="1">
<summary>Output files</summary>

- `GenomeBinning/QC/CheckM2/`
- `checkm2_downloads/CheckM2_database/*.dmnd`: Diamond database used by CheckM2.

</details>

#### GUNC

[Genome UNClutterer (GUNC)](https://grp-bork.embl-community.io/gunc/index.html) is a tool for detection of chimerism and contamination in prokaryotic genomes resulting from mis-binning of genomic contigs from unrelated lineages. It does so by applying an entropy based score on taxonomic assignment and contig location of all genes in a genome. It is generally considered as a additional complement to CheckM results.
Expand Down
10 changes: 10 additions & 0 deletions modules.json
Original file line number Diff line number Diff line change
Expand Up @@ -62,6 +62,16 @@
"git_sha": "911696ea0b62df80e900ef244d7867d177971f73",
"installed_by": ["modules"]
},
"checkm2/databasedownload": {
"branch": "master",
"git_sha": "e17652681c856afaf2e240ba4c98bf4631a0fe2d",
"installed_by": ["modules"]
},
"checkm2/predict": {
"branch": "master",
"git_sha": "e17652681c856afaf2e240ba4c98bf4631a0fe2d",
"installed_by": ["modules"]
},
"concoct/concoct": {
"branch": "master",
"git_sha": "baa30accc6c50ea8a98662417d4f42ed18966353",
Expand Down
Loading
Loading