-
Notifications
You must be signed in to change notification settings - Fork 2
Setting parameters appropriately
Colin Davenport edited this page Aug 15, 2022
·
5 revisions
This is about getting the most out of nf_wochenende and it's output, since the tool is very flexible.
Wochenende can analyze long reads and short reads against many different metagenomes. These data types have very different characteristics:
Long reads
- mostly great to good mapping quality due to length, 1-10% error rate (mainly homopolymer indels), widely divergent read lengths from 100 bp to 1 million bp + reads
Short reads
- good to very poor mapping quality depending on repetitiveness of genomic region, 0-1% error rate (mainly SNPs), widely divergent read lengths from 100 bp to 1 million bp + reads
Wochenende has been extensively tested on short reads, but not so much on long reads.
- To increase specificity (resulting bacteria are really in the sample) use a MQ filter.
- To increase sensitivity (bacteria which are really in the sample are detected in the proportions in which they were present) do not use a MQ filter. Some false positives may occur.
- MQ20 and MQ30 results tend to be very similar for the metagenomes we have analyzed.
- For long reads, it shouldn't make much difference to use a mapping quality filter, but it might exclude a few false positive hits
- Reads with too many mismatches are removed (filtered out). Otherwise accumulation of poor reads led to false positive bacterial assignments.
- The optimal number of mismatches to filter out is (obviously) different for long and short reads.
- Try playing with this parameter if too many reads are being filtered out.
- 3-5 might be appropriate for short reads,
- 5000-10000 for long reads (depending on your library - please test these settings to avoid excluding reads)
- Both long and short reads are sensitive to this parameter.
- Use minimap2long for long reads
- We advise bwa mem for short reads (best tested, performed best in tests of mock communities). minimap2short has a very different output for short reads, with many secondary mappings to our knowledge.
- We advise removing PCR duplicates for all counting applications, such as in metagenomics
- Removing duplicates is advisable for many genomics applications as well.
- If not set to remove, stacks of identical reads at identical genomic positions arising from PCR errors are counted separately. This leads to false positive species identifications.
- Long reads are unlikely to have many duplicates, so are insensitive to this parameter.