Skip to content

Setting parameters appropriately

Colin Davenport edited this page Aug 15, 2022 · 5 revisions

How to set parameters

This is about getting the most out of nf_wochenende and it's output, since the tool is very flexible.

Wochenende can analyze long reads and short reads against many different metagenomes. These data types have very different characteristics:

Long reads - mostly great to good mapping quality due to length, 1-10% error rate (mainly homopolymer indels), widely divergent read lengths from 100 bp to 1 million bp + reads

Short reads - good to very poor mapping quality depending on repetitiveness of genomic region, 0-1% error rate (mainly SNPs), widely divergent read lengths from 100 bp to 1 million bp + reads

Wochenende has been extensively tested on short reads, but not so much on long reads.

Parameters which most strongly affect results

Mapping quality: try checking your results with and without mapping quality (eg MQ30) filter.

  • To increase specificity (resulting bacteria are really in the sample) use a MQ filter.
  • To increase sensitivity (bacteria which are really in the sample are detected in the proportions in which they were present) do not use a MQ filter. Some false positives may occur.
  • MQ20 and MQ30 results tend to be very similar for the metagenomes we have analyzed.
  • For long reads, it shouldn't make much difference to use a mapping quality filter, but it might exclude a few false positive hits

Number of mismatches

  • Reads with too many mismatches are removed (filtered out). Otherwise accumulation of poor reads led to false positive bacterial assignments.
  • The optimal number of mismatches to filter out is (obviously) different for long and short reads.
  • Try playing with this parameter if too many reads are being filtered out.
  • 3-5 might be appropriate for short reads,
  • 5000-10000 for long reads (depending on your library - please test these settings to avoid excluding reads)
  • Both long and short reads are sensitive to this parameter.

Read mapping tool

  • Use minimap2long for long reads
  • We advise bwa mem for short reads (best tested, performed best in tests of mock communities). minimap2short has a very different output for short reads, with many secondary mappings to our knowledge.

PCR duplicate reads

  • We advise removing PCR duplicates for all counting applications, such as in metagenomics
  • Removing duplicates is advisable for many genomics applications as well.
  • If not set to remove, stacks of identical reads at identical genomic positions arising from PCR errors are counted separately. This leads to false positive species identifications.
  • Long reads are unlikely to have many duplicates, so are insensitive to this parameter.