Setting parameters appropriately

How to set parameters

This is about getting the most out of nf_wochenende and it's output, since the tool is very flexible.

Wochenende can analyze long reads and short reads against many different metagenomes. These data types have very different characteristics:

Long reads - mostly great to good mapping quality due to length, 1-10% error rate (mainly homopolymer indels), widely divergent read lengths from 100 bp to 1 million bp + reads

Short reads - good to very poor mapping quality depending on repetitiveness of genomic region, 0-1% error rate (mainly SNPs), widely divergent read lengths from 100 bp to 1 million bp + reads

Wochenende has been extensively tested on short reads, but not so much on long reads.

Parameters which most strongly affect results

Mapping quality: try checking your results with and without mapping quality (eg MQ30) filter.

To increase specificity (resulting bacteria are really in the sample) use a MQ filter.
To increase sensitivity (bacteria which are really in the sample are detected in the proportions in which they were present) do not use a MQ filter. Some false positives may occur.
MQ20 and MQ30 results tend to be very similar for the metagenomes we have analyzed.
For long reads, it shouldn't make much difference to use a mapping quality filter, but it might exclude a few false positive hits

Number of mismatches

Reads with too many mismatches are removed (filtered out). Otherwise accumulation of poor reads led to false positive bacterial assignments.
The optimal number of mismatches to filter out is (obviously) different for long and short reads.
Try playing with this parameter if too many reads are being filtered out.
3-5 might be appropriate for short reads,
5000-10000 for long reads (depending on your library - please test these settings to avoid excluding reads)
Both long and short reads are sensitive to this parameter.

Read mapping tool

Use minimap2long for long reads
We advise bwa mem for short reads (best tested, performed best in tests of mock communities). minimap2short has a very different output for short reads, with many secondary mappings to our knowledge.

PCR duplicate reads

We advise removing PCR duplicates for all counting applications, such as in metagenomics
Removing duplicates is advisable for many genomics applications as well.
If not set to remove, stacks of identical reads at identical genomic positions arising from PCR errors are counted separately. This leads to false positive species identifications.
Long reads are unlikely to have many duplicates, so are insensitive to this parameter.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly