-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
2024.09 #3433
2024.09 #3433
Changes from 1 commit
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -71,17 +71,22 @@ References: | |
Shotgun sequencing | ||
------------------ | ||
|
||
Qiita currently has one active shotgun metagenomics data analysis pipeline: a per sample | ||
Qiita currently has one active shotgun metagenomics data analysis pipeline: a per sample, paired-end | ||
bowtie2 alignment step with Woltka classification using either the WoLr2 (default) or RS210 databases. | ||
Below you will find more information about each of these options. | ||
|
||
.. note:: | ||
The bowtie2 settings are maximum and minimum mismatch penalties (mp=[1,1]), a | ||
penalty for ambiguities (np=1; default), read and reference gap open- and | ||
The bowtie2 settings are set for interleaved processing with a maximum and minimum mismatch | ||
penalties (mp=[1,1]), a penalty for ambiguities (np=1; default), read and reference gap open- and | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "reference gap open- and" seems odd. Perhaps it should be "open-and"? or maybe the hyphen should be removed? |
||
extend penalties (rdg=[0,1], rfg=[0,1]), a minimum alignment score for an | ||
alignment to be considered valid (score-min=[L,0,-0.05]), a defined number of | ||
distinct, valid alignments (k=16), and the suppression of SAM records for | ||
unaligned reads, as well as SAM headers (no-unal, no-hd). | ||
unaligned reads, as well as SAM headers (no-unal, no-hd), and using end-to-end alignments | ||
before using the multiseed heuristic (no-exact-upfront, no-1mm-upfront). More information visit: | ||
|
||
.. toctree:: | ||
|
||
woltka_pairedend.rst | ||
|
||
The current workflow is as follows: | ||
|
||
|
@@ -110,10 +115,9 @@ For more information about the versions in this plugin, visit: | |
|
||
qp-fastp-minimap2.rst | ||
|
||
Note that the command produces up to 6 output artifacts based on the aligner and database selected: | ||
Note that the command produces up to 5 output artifacts based on the aligner and database selected: | ||
|
||
- Alignment Profile: contains the raw alignment file and the no rank classification BIOM table | ||
- Per genome Predictions: contains the per genome level predictions BIOM table | ||
- Per genome Predictions: contains the raw alignment file and the per genome level predictions BIOM table | ||
- Per gene Predictions: Only WoLr2, contains the per gene level predictions BIOM table | ||
- KEGG Pathways: Only WoLr2, contains the functional profile | ||
- KEGG Ontology (KO): Only WoLr2, contains the functional profile | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,72 @@ | ||
Wolka and Bowtie2 using Read Pairing Schemes | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @qiyunzhu; could you take a look? Thank you. |
||
============================================ | ||
|
||
Benchmarks created by Qiyun Zhu (@qiyunzhu) on Aug 1, 2024. | ||
|
||
Summary | ||
------- | ||
|
||
Here I tested alternative read pairing schemes in the analysis of shotgun metagenomic sequencing data. Sequencing reads were aligned against a reference microbial genome database as unpaired or paired, with or without singleton and/or discordant alignments suppressed. A series of synthetic datasets were used in the analysis. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Here I tested -> I tested |
||
|
||
The results reveal that treating reads as paired is always advantageous over unpaired. Suppressing singleton alignments further increases the accuracy of results, despite at the cost of lower mapping rate. Suppressing discordant alignments has no obvious impact on the result. Regardless of accuracy, the downstream community ecology analyses are not obviously impacted by the choice of parameters. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. despite at the cost -> despite the cost |
||
|
||
Therefore, I recommend the general adoption of paired alignments as a standard procedure. I also endorse suppressing singleton and discordant alignments, but note the favor of further tests on whether they may reduce sensitivity with complex communities. | ||
|
||
Alignment parameters | ||
-------------------- | ||
|
||
Sequencing data were aligned using Bowtie2 v2.5.1 in the “very sensitive” mode against the WoL2 database. They were treated as either unpaired or paired-end: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Double quotes may be replaced with |
||
|
||
- SE: Reads are treated as unpaired (Bowtie2 input: -U merged.fq) | ||
- PE: Reads are treated as paired (Bowtie2 input: -1 fwd.fq, -2 rev.fq) | ||
|
||
Under the paired mode, certain flags were applied: | ||
|
||
- ND: Discordant alignments are suppressed (Bowtie2 flag: --no-discordant). For example, if a pair of alignments are not pointing toward each other or are too far apart from each other in the reference genome, then both alignments are discarded. See the Bowtie2 manual for a discussion. | ||
- NM: Singleton alignments are suppressed (Bowtie2 flag: --no-mixed). For example, if a read was aligned but its mate was not, then the alignment was discarded. | ||
- NDM: Both flags were applied. | ||
|
||
Extra parameters tested: | ||
- PE.NU: flags `--no-exact-upfront --no-1mm-upfront`. | ||
|
||
Resulting alignment files (SAM format) were processed by Woltka v0.1.6 using default parameters to generate OGU tables. | ||
|
||
Synthetic data | ||
-------------- | ||
|
||
Five synthetic datasets were generated with 25 samples each consisting of randomly selected WoL2 genomes. CAMISIM was executed to simulate 500 Mbp of 150 bp paired-end Illumina sequencing reads (appr. 3.3 million reads) per sample. The five datasets have different taxon count and distribution patterns. The result of one of the five datasets is displayed below. It consists of 400 taxa (more than others) and therefore is presumably the most realistic. However, all five results largely shared the same pattern. | ||
|
||
The results of the five Bowtie2 parameter sets were compared using nine metrics: | ||
|
||
Three metrics that only rely on each result. | ||
|
||
- Mapping rate (%) | ||
- Number of taxa | ||
- Entropy (i.e., Shannon index, but without subsampling) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This line (37) can be removed. Also replace "three" with "two" in line 33. |
||
|
||
Six metrics that rely on comparing each result against the ground truth (higher is better): | ||
|
||
- Presence/absence-based: | ||
- Precision (fraction of discovered taxa that are true) | ||
- Recall (sensitivity) (fraction of true taxa that are discovered) | ||
- F1 score (combination of precision and recall) | ||
- Abundance-based: | ||
|
||
- Spearman correlation coefficient | ||
- Bray-Curtis similarity * | ||
- Weighted UniFrac similarity * | ||
|
||
* Note: Bray-Curtis and weighted UniFrac similarities were calculated after subsampling to a constant sum of taxon frequencies per sample. | ||
|
||
.. figure:: woltka_synthetic.png | ||
:align: center | ||
|
||
|
||
The results revealed: | ||
|
||
#. PE outperforms SE in all metrics. Most importantly, it reduces false positive rate (higher precision) while retaining mapping rate. Meanwhile, the sensitivity (recall) of identifying true taxa is not obviously compromised (note the y-axis scale). | ||
#. Suppressing singleton alignments (no mixing; NM) further significantly reduces false positive rate (higher precision), while not obviously reducing sensitivity (recall). However, the mapping rate is also significantly reduced. | ||
#. Suppressing discordant alignments (no discordance; ND) seem to have little to no effect on the outcome. In fact, most profiles are identical with or without ND (but a few are not). | ||
#. PE.NU the two additional parameters had minimum effect on the result and make the alignment step faster. This may suggest that the additional parameters are safe to use. | ||
|
||
Therefore, I would recommend adopting paired alignment in preference to unpaired alignment. I may suggest no mixing as it has improved accuracy, but the potential adverse effect of lower mapping rate may be further explored before making a compelling recommendation. Although not having a visible effect, no discordance may be added for logical coherency. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can remove "no mixing... no discordance". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should upgrade Woltka to v0.1.6. (!important!)