Skip to content

Commit

Permalink
Fix citations and bibliography
Browse files Browse the repository at this point in the history
  • Loading branch information
michaelosthege committed Sep 29, 2024
1 parent 9304f26 commit 0600888
Show file tree
Hide file tree
Showing 5 changed files with 22 additions and 15 deletions.
4 changes: 4 additions & 0 deletions docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,7 @@
"numpydoc",
"myst_nb",
"sphinx_book_theme",
"sphinxcontrib.bibtex",
"sphinxcontrib.mermaid",
]
myst_enable_extensions = [
Expand All @@ -48,6 +49,9 @@
"tasklist",
]
nb_execution_mode = "off"
bibtex_bibfiles = ["literature.bib"]
bibtex_default_style = "plain"
bibtex_reference_style = "super"

# Add any paths that contain templates here, relative to this directory.
templates_path = ["_templates"]
Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
@softmisc{nutpie,
@misc{nutpie,
author = {Seyboldt, Adrian and {PyMC Developers}},
keywords = {Software},
license = {MIT},
Expand Down Expand Up @@ -44,7 +44,7 @@ @article{matplotlib
year = 2007
}

@softmisc{matplotlibzenodo,
@misc{matplotlibzenodo,
author = {{The Matplotlib Development Team}},
title = {Matplotlib: Visualization with Python},
keywords = {software},
Expand Down Expand Up @@ -83,6 +83,7 @@ @book{RN162
author = {Kruschke, John K.},
title = {Doing Bayesian Data Analysis},
edition = {1st Edition},
publisher={Academic Press},
isbn = {9780123814852},
year = {2010},
type = {Book}
Expand Down
5 changes: 4 additions & 1 deletion docs/source/markdown/PeakPerformance_validation.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
Several stages of validation were employed to prove the suitability of `PeakPerformance` for chromatographic peak data analysis.
The goals were to showcase the efficacy of `PeakPerformance` utilizing noisy synthetic data, to investigate cases where a peak could reasonably be fit with either of the single peak models, and finally to use experimental data to compare results obtained with `PeakPerformance` to those from the commercial vendor software Sciex MultiQuant.

For the first test, 500 random data sets were generated with the NumPy random module [@harris2020array] by drawing from the normal distributions detailed in Table 1 except for the mean parameter which was held constant at a value of 6.
For the first test, 500 random data sets were generated with the NumPy random module {cite}`harris2020array` by drawing from the normal distributions detailed in Table 1 except for the mean parameter which was held constant at a value of 6.
Subsequently, normally distributed random noise ($\mathcal{N}(0, 0.6)$ or $\mathcal{N}(0, 1.2)$ for data sets with the tag "higher noise") was added to each data point.
The amount of data points per time was chosen based on an LC-MS/MS method routinely utilized by the authors and accordingly set to one data point per 1.8 s.

Expand Down Expand Up @@ -75,3 +75,6 @@ By showing not only the mean area ratio of all peaks but also the ones for the s
In case of this data set, two low quality double peaks in particular inflated the variance significantly which may not be representative for other data sets.
It has to be stated, too, that the prevalence of manual re-integration of double peaks in MQ might have introduced a user-specific bias, thereby increasing the final variance.
Nevertheless, it could be shown that `PeakPerformance` yields comparable peak area results to a commercially available vendor software.
```{bibliography}
```
16 changes: 6 additions & 10 deletions docs/source/markdown/PeakPerformance_workflow.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,3 @@
---
bibliography:
- literature.bib
---

# PeakPerformance workflow
`PeakPerformance` accommodates the use of a pre-manufactured data pipeline for standard applications as well as the creation of custom data pipelines using only its core functions.
The provided data analysis pipeline was designed in a user-friendly way and requires minimal programming knowledge ([Fig. 1](#fig_w1)).
Expand Down Expand Up @@ -35,7 +30,7 @@ If e.g. a standard mixture containing all targets was measured, this would be co
An additional feature lets the user exclude specific model types to save computation time and improve the accuracy of model selection by for example excluding double peak models when a single peak was observed.
Upon provision of the required information, the automated model selection can be started using the `model_selection()` function from the pipeline module and will be performed successively for each mass trace.
Essentially, every type of model which has not been excluded by the user needs to be instantiated, sampled, and the log-likelihood needs to be calculated.
Subsequently, the results for each model are ranked with the `compare()` function of the ArviZ package based on Pareto-smoothed importance sampling leave-one-out cross-validation (LOO-PIT) [@RN146; @RN145].
Subsequently, the results for each model are ranked with the `compare()` function of the ArviZ package based on Pareto-smoothed importance sampling leave-one-out cross-validation (LOO-PIT) {cite}`RN146,RN145`.
This function returns a DataFrame showing the results of the models in order of their placement on the ranking which is decided by the expected log pointwise predictive density.
The best model for each mass trace is then written to the Excel template file.

Expand All @@ -44,21 +39,22 @@ The first step consists of parsing the information from the Excel sheet.
Since the pipeline, just like model selection, acts successively, a time series is read from its data file next and the information contained in the name of the file according to the naming convention is parsed.
All this information is combined in an instance of `PeakPerformance`'s `UserInput` class acting as a centralized source of data for the program.
Depending on whether the "pre-filtering" option was selected, an optional filtering step will be executed to reject signals where clearly no peak is present before sampling, thus saving computation time.
This filtering step uses the `find_peaks()` function from the SciPy package [@scipy] which simply checks for data points directly neighboured by points with lower intensity values.
This filtering step uses the `find_peaks()` function from the SciPy package {cite}`scipy` which simply checks for data points directly neighboured by points with lower intensity values.
If no data points within a certain range around the expected retention time of an analyte fulfill this most basic requirement of a peak, the signal is rejected.
Furthermore, if none of the candidate data points exceed a signal-to-noise ratio threshold defined by the user, the signal will also be discarded.
Depending on the origin of the samples, this step may reject a great many signals before sampling saving potentially hours of computation time across a batch run of the `PeakPerformance` pipeline.
For instance, in bioreactor cultivations, a product might be quantified but if it is only produced during the stationary growth phase, it will not show up in early samples.
Another pertinent example of such a use case are isotopic labeling experiments for which every theoretically achievable mass isotopomer needs to be investigated, yet depending on the input labeling mixture, the majority of them might not be present in actuality.
Upon passing the first filter, a Markov chain Monte Carlo (MCMC) simulation is conducted using a No-U-Turn Sampler (NUTS) [@RN173], preferably - if installed in the Python environment - the nutpie sampler [@nutpie] due to its highly increased performance compared to the default sampler of PyMC.
Upon passing the first filter, a Markov chain Monte Carlo (MCMC) simulation is conducted using a No-U-Turn Sampler (NUTS) {cite}`RN173`, preferably - if installed in the Python environment - the nutpie sampler {cite}`nutpie` due to its highly increased performance compared to the default sampler of PyMC.
Before sampling from the posterior distribution, a prior predictive check is performed the results of which can be accessed and evaluated after the fact.
When a posterior distribution has been obtained, the main filtering step is next in line.
The first criterion is constituted by checking the convergence of the Markov chains towards a common solution for the posterior represented by the potential scale reduction factor [@RN152], also referred to as the $\hat{R}$ statistic or Gelman-Rubin diagnostic.
The first criterion is constituted by checking the convergence of the Markov chains towards a common solution for the posterior represented by the potential scale reduction factor {cite}`RN152`, also referred to as the $\hat{R}$ statistic or Gelman-Rubin diagnostic.
If this factor is above 1.05 for any parameter, convergence was not reached and the sampling will be repeated once with a much higher number of tuning samples.
If the filter is not passed a second time, the pertaining signal is rejected.
Harnessing the advantages of the uncertainty quantification, a second criterion calculates the ratio of the resulting standard deviation of a peak parameter to its mean and discards signals exceeding a threshold.
Usually, false positives passing the first criterion are rather noisy signals where a fit was achieved but the uncertainty on the peak parameters is extremely high.
These signals will then be rejected by the second criterion, ultimately reducing the number of false positive peaks significantly if not eliminating them.
If a signal was accepted as a peak, the final simulation step is a posterior predictive check which is added to the inference data object resulting from the model simulation.

# Bibliography
```{bibliography}
```
7 changes: 5 additions & 2 deletions docs/source/markdown/Peak_model_composition.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,7 @@ Starting with single peak models, the normal-shaped model ([Figure 1a](fig_c1))
```{figure-md} fig_c1
![](./Fig1_model_single_peak.png)
__Figure 1:__ The intensity functions of normal (**a**) and skew normal peak models (**b**) as well as the prior probability distributions of their parameters are shown in the style of a Kruschke diagram [@RN162]. Connections with $\sim$ imply stochastic and with $=$ deterministic relationships. In case of variables with multiple occurrences in one formula, the prior was only connected to one such instance to preserve visual clarity. The variables $M_{i}$ and $O_{i}$ describe mean values and $T_{i}$, $R$, and $S$ standard deviations.
__Figure 1:__ The intensity functions of normal (**a**) and skew normal peak models (**b**) as well as the prior probability distributions of their parameters are shown in the style of a Kruschke diagram {cite}`RN162`. Connections with $\sim$ imply stochastic and with $=$ deterministic relationships. In case of variables with multiple occurrences in one formula, the prior was only connected to one such instance to preserve visual clarity. The variables $M_{i}$ and $O_{i}$ describe mean values and $T_{i}$, $R$, and $S$ standard deviations.
```
The mean value $\mu$ has a normally distributed prior with the center of the selected time frame $\mathrm{min}(t) + \frac{\Delta t}{2}$ as its mean and $\frac{\Delta t}{2}$ as the standard deviation where $\Delta t$ corresponds to the length of the time frame.
Expand Down Expand Up @@ -106,7 +106,7 @@ $$ (eqn:param_mumu)
```{figure-md} fig_c2
![](./Fig2_model_double_peak.png)
__Figure 2:__ The intensity functions of double normal (**a**) and double skew normal peak models (**b**) as well as the prior probability distributions of their parameters are shown in the style of a Kruschke diagram [@RN162]. Connections with $\sim$ imply stochastic and with $=$ deterministic relationships. In case of variables with multiple occurrences in one formula, the prior was only connected to one such instance to preserve visual clarity. The variables $M_{i}$ and $O_{i}$ describe mean values and $T_{i}$, $S_{i}$, $P_{i}$, and $V_{i}$ standard deviations.
__Figure 2:__ The intensity functions of double normal (**a**) and double skew normal peak models (**b**) as well as the prior probability distributions of their parameters are shown in the style of a Kruschke diagram {cite}`RN162`. Connections with $\sim$ imply stochastic and with $=$ deterministic relationships. In case of variables with multiple occurrences in one formula, the prior was only connected to one such instance to preserve visual clarity. The variables $M_{i}$ and $O_{i}$ describe mean values and $T_{i}$, $S_{i}$, $P_{i}$, and $V_{i}$ standard deviations.
```
While all aforementioned parameters are necessary for the models, not all are of equal relevance for the user.
Expand All @@ -129,3 +129,6 @@ Examples for deterministic model variables in addition to peak area or height ar
$$
\mathrm{sn} = \frac{h}{\mathrm{noise}}
$$ (eqn:sn)
```{bibliography}
```

0 comments on commit 0600888

Please sign in to comment.