NanoMark

DNA Assembly Benchmark for Nanopore long reads

A system for benchmarking DNA assembly tools, based on 3rd generation sequencers.

Setup

git clone https://github.com/kkrizanovic/NanoMark.git
cd NanoMark
src/nanomark.py setup

This will download and compile all tools necessary to run a benchmark.

Cgmemtime to measure runtime andmaximum memory consumption
Quast to estimate assembly quality
Everyhting needed to run each denovo assembly tool for whom a wrapper is defined in folder: ¨¨¨NanoMark/wrappers```

All tools will be downloaded to folder NanoMark/tools.

Running a benchmark

Run benchmark with a following command: NanoMark.py --benchmark <reads_file> <reference_file>

Benchmark results will be stored in folder: /intermediate Each benchmark will have its own folder with a randomly generated name, e.g. benchmark_129eed95-377d-4e8e-a5bf-309631e3df3d. Inside will be a folder for each assembler and a folder for Quast results (containing quast results for each assembler).

Summary Quast and cgmemtime results for each assembler are stored in a benchmark folder in a .tsv file benchmark_summary.tsv. The file contains the following values for each assembler:

Name : Assembler name

Obtained from cgmemtime:

Real time : real execution time
CPU time : CPU execution time
Maximum RSS : maximum memory consumption

Obtained from Quast:

contigs : number of contigs larger then 500bp generated by an assembler
Largest contig : largest contig generated by an assembler
Total length : total contig length generated by an assembler
GC (%) GC content of contigs generated by an assembler, can be compared to reference GC content to estimate assembly quality
N50 : N50 values for generated contigs
NG50 : NG50 value for generated contigs
misassemblies : number of missasembiled generated by an assembler
Genome fraction (%) : genome fraction covered by generated contigs
Duplication ratio : how many time is each reference base covered in generated contigs
N's per 100 kbp
mismatches per 100 kbp
indels per 100 kbp

Quast generated fields present in summarized results are represented in a list "qfields" in function "summarize_results" in NanoMark.py. New field can be added by modifying that list.

Including new assemblers in the benchmark

The benchmarking tool currently includes following assembly tools:

Loman, Quick and Simpson assembly pipeline (http://www.nature.com/nmeth/journal/v12/n8/full/nmeth.3444.html)
PBcR
FALCON
SPAdes
ALLPATHS-LG

Additional assembly tools can be included by writting a wrapper script in Python. Each assembler that needs to be included in the benchmark must have a corresponding wrapper in folder: /wrappers. Wrapper script filenames must start with "wrapper_"

Each wrapper must define three varibales:

Installation path (ASSEMBLER_PATH)
Assembler name written to benchmark summary .tsv file (ASSEMBLER_NAME)
Results file filename (must include relative path from assembler installation path) (ASSEMBLER_RESULTS)

Each wrapper must also define two functions:

download_and_install() : installs the assembler and makes it ready to run
run(reads_file, reference_file, machine_name, output_path, output_suffix='') : runs the assembler on given reads and reference files, results are stored in a given folder: output_folder. Attributes machine_name and output_suffix are currently not used, but must be included in function header for compatibility.

Included wrappers are good examples of wrapper implementation.

Usage of wrapper scripts

Installing an assembler (often requires sudo access):

wrapper_falcon.py install

Running the assembly process consists of specifying all reads files in the form:

wrapper_?.py run output_folder reads_type,<reads_path>[<reads_path_b,frag_len,frag_stddev]

More than one dataset can be described in the same command line, simply by listing them in a space-separated manner.
Reads_type can be one of: nanopore/pacbio/single/paired/mate. If reads_type != "paired" or "mate", last three parameters can be omitted.
If reads_type == "paired" or "mate", other end of the pair needs to be in another file provided by reads_path_b.

Examples of usage:

wrapper_falcon.py run output_folder nanopore,reads.fa
wrapper_allpathslg.py run output_folder paired,datasets/frag_reads.Solexa-25396.A.fastq,datasets/frag_reads.Solexa-25396.B.fastq,180,10 mate,datasets/jump_reads.Solexa-42866.A.fastq,datasets/jump_reads.Solexa-42866.B.fastq,3000,500 mate,datasets/jump_reads.Solexa-44956.A.fastq,datasets/jump_reads.Solexa-44956.B.fastq,3000,500 nanopore,datasets/reads.fastq

Acknowledgement

This work has been supported in part by Croatian Science Fundation under the project UIP-11-2013-7353.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

NanoMark

Setup

Running a benchmark

contigs : number of contigs larger then 500bp generated by an assembler

misassemblies : number of missasembiled generated by an assembler

N's per 100 kbp

mismatches per 100 kbp

indels per 100 kbp

Including new assemblers in the benchmark

Usage of wrapper scripts

Acknowledgement

Files

README.md

Latest commit

History

README.md

File metadata and controls

NanoMark

Setup

Running a benchmark

contigs : number of contigs larger then 500bp generated by an assembler

misassemblies : number of missasembiled generated by an assembler

N's per 100 kbp

mismatches per 100 kbp

indels per 100 kbp

Including new assemblers in the benchmark

Usage of wrapper scripts

Acknowledgement