Skip to content
Hadrien Gourlé edited this page Jul 21, 2017 · 2 revisions

Choosing k

MEGAHIT uses multiple k-mer strategy. Minimum k, maximum k and the step for iteration can be set by options --k-min, --k-max and --k-step respectively. k must be odd numbers while the step must be an even number.

  • for ultra complex metagenomics data such as soil, a larger kmin, say 27, is recommended to reduce the complexity of the de Bruijn graph. Quality trimming is also recommended
  • for high-depth generic data, large --k-min (25 to 31) is recommended
  • smaller --k-step, say 10, is more friendly to low-coverage datasets

Filtering (kmin+1)-mer

(kmin+1)-mer with multiplicity lower than d (default 2, specified by --min-count option) will be discarded. You should be cautious to set d less than 2, which will lead to a much larger and noisy graph. We recommend using the default value 2 for metagenomics assembly. If you want to use MEGAHIT to do generic assemblies, please change this value according to the sequencing depth. (recommend --min-count 3 for >40x).

Mercy k-mer

This is specially designed for metagenomics assembly to recover low coverage sequence. For generic dataset >= 30x, MEGAHIT may generate better results with --no-mercy option.

k-min 1pass mode

This mode can be activated by option --kmin-1pass. It is more memory efficient for ultra low-depth datasets, such as soil metagenomics data.