Automatic read mapping and genome size estimation from coverage.
Automatic mapping of paired, unpaired, PacBio and Nanopore reads to an assembly with bwa mem
or minimap2
, execution of qualimap bamqc
and estimation of genome size from mapped nucleotides divided by mode of the coverage distribution (>0). This method was first pulished in Schell et al. (2017). To show high accuracy and reliability of this method throughout the tree of life, Pfenninger et al. (2021) published a study comparing different estimators. Currently, the estimator Nbm/m (number of back-mapped bases divided by the modal value of the sequencing depth distribution) is implemented in this script only.
The tools samtools
, bwa
and/or minimap2
need to be in your $PATH
. The tools qualimap
, multiqc
, bedtools
and Rscript
are optional but needed to create the mapping quality report, coverage histogram as well as genome size estimation and to plot of the coverage distribution respectively.
needs the following perl modules and will search for executables in your $PATH
- Number::FormatEng
- Parallel::Loops
- samtools:
Short read mapping:
- bwa (mem):
Long read mapping:
- minimap2:
Optional: [-a <assembly.fa> {-p <paired_1.fq>,<paired_2.fq> | -u <unpaired.fq>} |
-pb <clr.fq> | -hifi <hifi.fq> | -ont <ont.fq> } | -b <mapping.bam>]
-a STR Assembly were reads should mapped to in fasta format
-p STR Two files with paired Illumina reads comma sperated
-u STR Fastq file with unpaired Illumina reads
-pb STR Fasta or fastq file with PacBio CLR reads
-hifi STR Fasta or fastq file with PacBio HiFi reads
-ont STR Fasta or fastq file with Nanopore reads
-b STR Bam file to calculate coverage from
Skips read mapping
Overrides -nh
Technologies will recognized correctly if filenames end with
.pb(.sort).bam, .hifi(.sort).bam or .ont(.sort).bam for PacBio CLR,
PacBio HiFi and Nanopore respectively. Otherwise they are assumed to
be from Illumina.
All mandatory options except of -a can be specified multiple times
Options: [default]
-o STR Output directory [.]
Will be created if not existing
-t INT Number of parallel executed processes [1]
Affects bwa mem, samtools sort/index/view/stats, qualimap bamqc
-pre STR Prefix of output files if -a is used [filename of -a]
-sort Sort the bam file(s) (-b) [off]
-nq Do not run qualimap bamqc [off]
-nh Do not create coverage histogram [off]
Implies -ne
-ne Do not estimate genome size [off]
-kt Keep temporary bam files [off]
-bo STR Options passed to bwa [-a -c 10000]
-mo STR Options passed to minimap [CLR: -H -x map-pb; HiFi: minimap<=2.18
-x asm20 minimap>2.18 -x map-hifi; ONT: -x map-ont]
-qo STR Options passed to qualimap [none]
Pass options with quotes e.g. -bo "<options>"
-v Print executed commands to STDERR [off]
-dry-run Only print commands to STDERR instead of executing [off]
-h or -help Print this help and exit
-version Print version number and exit
Pfenninger M, Schönenbeck P & Schell T (2021). ModEst: Accurate estimation of genome size from next generation sequencing data. Molecular ecology resources, 00, 1–11.
Schell T, Feldmeyer B, Schmidt H, Greshake B, Tills O et al. (2017). An Annotated Draft Genome for Radix auricularia (Gastropoda, Mollusca). Genome Biology and Evolution, 9(3):585–592,
