DNA Assembly Benchmark for Nanopore long reads
A system for benchmarking DNA assembly tools, based on 3rd generation sequencers.
git clone https://github.com/kkrizanovic/NanoMark.git
cd NanoMark
src/nanomark.py setup
This will download and compile all tools necessary to run a benchmark.
- Cgmemtime to measure runtime andmaximum memory consumption
- Quast to estimate assembly quality
- Everyhting needed to run each denovo assembly tool for whom a wrapper is defined in folder: ¨¨¨NanoMark/wrappers```
All tools will be downloaded to folder NanoMark/tools
.
Run benchmark with a following command: NanoMark.py --benchmark <reads_file> <reference_file>
Benchmark results will be stored in folder: /intermediate Each benchmark will have its own folder with a randomly generated name, e.g. benchmark_129eed95-377d-4e8e-a5bf-309631e3df3d. Inside will be a folder for each assembler and a folder for Quast results (containing quast results for each assembler).
Summary Quast and cgmemtime results for each assembler are stored in a benchmark folder in a .tsv file benchmark_summary.tsv. The file contains the following values for each assembler:
- Name : Assembler name
Obtained from cgmemtime:
- Real time : real execution time
- CPU time : CPU execution time
- Maximum RSS : maximum memory consumption
Obtained from Quast:
- Largest contig : largest contig generated by an assembler
- Total length : total contig length generated by an assembler
- GC (%) GC content of contigs generated by an assembler, can be compared to reference GC content to estimate assembly quality
- N50 : N50 values for generated contigs
- NG50 : NG50 value for generated contigs
- Genome fraction (%) : genome fraction covered by generated contigs
- Duplication ratio : how many time is each reference base covered in generated contigs
Quast generated fields present in summarized results are represented in a list "qfields" in function "summarize_results" in NanoMark.py. New field can be added by modifying that list.
The benchmarking tool currently includes following assembly tools:
- Loman, Quick and Simpson assembly pipeline (http://www.nature.com/nmeth/journal/v12/n8/full/nmeth.3444.html)
- PBcR
- FALCON
- SPAdes
- ALLPATHS-LG
Additional assembly tools can be included by writting a wrapper script in Python. Each assembler that needs to be included in the benchmark must have a corresponding wrapper in folder: /wrappers. Wrapper script filenames must start with "wrapper_"
Each wrapper must define three varibales:
- Installation path (ASSEMBLER_PATH)
- Assembler name written to benchmark summary .tsv file (ASSEMBLER_NAME)
- Results file filename (must include relative path from assembler installation path) (ASSEMBLER_RESULTS)
Each wrapper must also define two functions:
- download_and_install() : installs the assembler and makes it ready to run
- run(reads_file, reference_file, machine_name, output_path, output_suffix='') : runs the assembler on given reads and reference files, results are stored in a given folder: output_folder. Attributes machine_name and output_suffix are currently not used, but must be included in function header for compatibility.
Included wrappers are good examples of wrapper implementation.
Installing an assembler (often requires sudo access):
wrapper_falcon.py install
Running the assembly process consists of specifying all reads files in the form:
wrapper_?.py run output_folder reads_type,<reads_path>[<reads_path_b,frag_len,frag_stddev]
More than one dataset can be described in the same command line, simply by listing them in a space-separated manner.
Reads_type can be one of: nanopore/pacbio/single/paired/mate. If reads_type != "paired" or "mate", last three parameters can be omitted.
If reads_type == "paired" or "mate", other end of the pair needs to be in another file provided by reads_path_b.
Examples of usage:
wrapper_falcon.py run output_folder nanopore,reads.fa
wrapper_allpathslg.py run output_folder paired,datasets/frag_reads.Solexa-25396.A.fastq,datasets/frag_reads.Solexa-25396.B.fastq,180,10 mate,datasets/jump_reads.Solexa-42866.A.fastq,datasets/jump_reads.Solexa-42866.B.fastq,3000,500 mate,datasets/jump_reads.Solexa-44956.A.fastq,datasets/jump_reads.Solexa-44956.B.fastq,3000,500 nanopore,datasets/reads.fastq
This work has been supported in part by Croatian Science Fundation under the project UIP-11-2013-7353.