Skip to content

Commit

Permalink
Merge pull request #394 from beckyfisher/joss-paper-rf
Browse files Browse the repository at this point in the history
  • Loading branch information
joethorley authored Dec 10, 2024
2 parents e70d2d8 + df9fdf6 commit 6ae6245
Show file tree
Hide file tree
Showing 3 changed files with 27 additions and 32 deletions.
8 changes: 4 additions & 4 deletions vignettes/additional-technical-details.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -66,15 +66,15 @@ A total of 20 unique datasets were extracted from ssddata and used to define the

The individual ssdtools fits are shown below for each of the 20 simulation datasets from ssddata, for the six recommended default distributions, as well as the model averaged CDF (black line):

![](images/fitted_dists.png)
![ An aggregate plot of all the 20 datasets from ssddata, showing how the six recommended default distributions fit to each, as well as the model averaged fit. ](images/fitted_dists.png) \

This simulation process was repeated a minimum of 1,000 times for each dataset, and the results collated across all iterations.
For each simulated dataset the true HCx values were obtained directly from the parameter estimates of from data generating distribution.
From these, relative bias was calculated as the scaled-difference between the estimated HCx values and the true HCx value, i.e $$\frac{\widehat {HC}x-HCx}{HCx}$$ where $\widehat {HC}x$ is the estimate of the true value, $HCx$; coverage was calculated as the proportion of simulations where the true $HCx$ value fell within the lower and upper 95% confidence limits; and the scaled confidence interval width was calculated as $$\frac{UL-LL}{HCx}$$ where $UL$ and $LL$ are the upper and lower limits respectively.

Bias, confidence interval width and coverage as a function of sample size across ~1000 simulations of 20 datasets using the multi model averaging method and the weighted sample method for estimating confidence intervals via ssdtools are shown below:

![](images/ssdata_sims_collated.png)
![ An aggregate plot showing how bias and confidence interval width decrease, and covergae increases as the number of data points in the SSD increases. The first three bars are coloured to highlight that the biggest changes are from five to six and then seven data points. ](images/ssdata_sims_collated.png) \

The simulation results showed significant gains in terms of reduced bias from N=5 to N=6, as well as in coverage, which improved substantially between N=5 and N=6.
There is also a small additional gain in coverage at N=7, where the median of the simulations reaches the expected 95% but this is only the case for HC1.
Expand All @@ -88,7 +88,7 @@ Estimates of coverage and confidence interval widths were not obtained for this

Bias of sample size across ~1000 simulations of 353 datasets using the multi model averaging method via ssdtools, for HC5, 10 and 20 (0.05, 0.1, and 0.2) are shown below:

![](images/all_sims_bias.png)
![ An aggregate plot showing how bias increases as the number of data points in the SSD increases. Columns are for different proportions of the community potentially effected. The biggest changes are for 5% of the community effected. Data generated from a Gamma distribution show the greatest bias, followed by the Weibull and then the lognormal-lognormal mixture distribution. ](images/all_sims_bias.png) \

Note that simulation results are shown separately for those derived from each of the six distributions as the underlying source data generating distribution.

Expand All @@ -105,7 +105,7 @@ This was examined using the simulation study across the larger combined ssddata
Below is a plot of the mean AICc weights as a function of sample size (N) obtained for data simulated using the best fit distribution to 353 datasets.
Results are shown separately for the six different source distributions, with the upper plot (A) showing the AICc weight of the source (data generating) distributions, when fit using the default set of six distributions using ssdtools; and the lower plot (B) showing the AICc weight of the lognormal-lognormal mixture distribution for each of the source (data generating) distributions.

![](images/weights_collated.png)
![A two panel plot, with both showing how AICc based weights increase as the number of data points in the SSD increases. The top panel shows that when there is a large enough sample size the AICc weights for the true source distribution are high. The bottom panel shows the weights for the lognormal-lognormal mixture distribution and highlights that these remain low except when the lognormal-lognormal misture is the source distribution. ](images/weights_collated.png) \

We found that the AICc weights for the five unimodal distributions were relatively similar (~0.2) for very low N.
This is because for small N it is difficult to discern differences between the distributions in the candidate list.
Expand Down
33 changes: 14 additions & 19 deletions vignettes/distributions.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -89,7 +89,7 @@ The probability density function, ${f_X}(x;b,c,k)$ and cumulative distribution f

<br>

```{r echo=FALSE,fig.align='center',fig.width=9,fig.height=5, fig.cap="Sample Burr probability density (A) and cumulative probability (B) functions."}
```{r echo=FALSE, fig.align='center', fig.width=9,fig.height=5, fig.cap="Sample Burr probability density (A) and cumulative probability (B) functions.", fig.alt="A two panel plot showing several realisations of the Burr type III probability density function on the left panel and the cumulative density function of distribution on the right panel."}
par(mfrow = c(1, 2))
f <- function(x, b, c, k) {
Expand Down Expand Up @@ -150,7 +150,7 @@ These can be fitted using `ssdtools` by supplying the strings "llogis_llogis" an
The underlying code for these mixtures has three components: the likelihood function required for TMB; exported R functions to allow the usual methods for a distribution to be called (p, q and r); and a set of supporting R functions (see @fox_methodologies_2022 Appendix D for more details).
Both mixtures have five parameters - two parameters for each of the component distributions and a mixing parameter (pmix) that defines the weighting of the two distributions in the ‘mixture.’

```{r echo=FALSE,fig.align='center',fig.width=9,fig.height=5, fig.cap="Sample lognormal lognormal mixture probability density (A) and cumulative probability (B) functions."}
```{r echo=FALSE,fig.align='center',fig.width=9,fig.height=5, fig.cap="Sample lognormal lognormal mixture probability density (A) and cumulative probability (B) functions.", fig.alt="A two panel plot showing several realisations of the lognormal lognormal mixture probability density function on the left panel and the cumulative density function of distribution on the right panel."}
par(mfrow = c(1, 2))
f <- function(x, m1, s1, m2, s2, p) {
Expand Down Expand Up @@ -216,7 +216,7 @@ The default list of candidate distributions in `ssdtools` is comprised of the fo

The default distributions are plotted below with a mean of 2 and standard deviation of 2 on the (natural) log concentration scale or around 7.4 on the concentration scale.

```{r, message=FALSE, echo=FALSE, fig.width=7,fig.height=5, fig.cap="Currently recommended default distributions."}
```{r, message=FALSE, echo=FALSE, fig.width=7,fig.height=5, fig.cap="Currently recommended default distributions.", fig.alt="The cumulative distribution function of the six default distributions using a mean and standard deviation of 2 for all. The distributions included are the gammal, lgumbel, llogis, lnorm, lnorm_lnorm and weibull."}
library(ssdtools)
library(ggplot2)
Expand Down Expand Up @@ -266,7 +266,7 @@ A random variable, $X$ is lognormally distributed if the *logarithm* of $X$ is n
The log-normal distribution was selected as the starting distribution given the data are for effect concentrations.
The log-normal distribution can be fitted using `ssdtools` supplying the string `lnorm` to the `dists` argument in the `ssd_fit_dists` call.

```{r echo=FALSE,fig.align='center',fig.width=9,fig.height=5,fig.cap="Sample lognormal probability density (A) and cumulative probability (B) functions."}
```{r echo=FALSE,fig.align='center',fig.width=9,fig.height=5,fig.cap="Sample lognormal probability density (A) and cumulative probability (B) functions.", fig.alt="A two panel plot showing several realisations of the lognormal probability density function on the left panel and the cumulative density function of distribution on the right panel."}
par(mfrow = c(1, 2))
conc <- seq(0, 10, by = 0.005)
Expand Down Expand Up @@ -326,7 +326,7 @@ The log-logistic distribution is often used as a candidate SSD primarily because
We included it because it has wider tails than the log-normal and because it is a specific case of the more general Burr family of distributions [@shao_estimation_2000, @burr_cumulative_1942].
The log-logistic distribution can be fitted using `ssdtools` by supplying the string `lnorm` to the `dists` argument in the `ssd_fit_dists` call.

```{r echo=FALSE,fig.align='center',fig.width=9,fig.height=5,fig.cap="Sample Log logistic probability density (A) and cumulative probability (B) functions."}
```{r echo=FALSE,fig.align='center',fig.width=9,fig.height=5,fig.cap="Sample Log logistic probability density (A) and cumulative probability (B) functions.", fig.alt="A two panel plot showing several realisations of the log logistic probability density function on the left panel and the cumulative density function of distribution on the right panel."}
par(mfrow = c(1, 2))
f <- function(x, a, b) {
Expand Down Expand Up @@ -376,7 +376,7 @@ For use in modeling species sensitivity data, the gamma distribution has two key

The gamma distribution can be fitted using `ssdtools` by supplying the string "gamma" to the *dists* argument in the *ssd_fit_dists* call.

```{r echo=FALSE,fig.align='center',fig.width=9,fig.height=5,fig.cap="Sample gamma probability density (A) and cumulative probability (B) functions."}
```{r echo=FALSE,fig.align='center',fig.width=9,fig.height=5,fig.cap="Sample gamma probability density (A) and cumulative probability (B) functions.", fig.alt="A two panel plot showing several realisations of the Gamma density function on the left panel and the cumulative density function of distribution on the right panel."}
par(mfrow = c(1, 2))
conc <- seq(0, 10, by = 0.005)
Expand Down Expand Up @@ -411,7 +411,7 @@ The two-parameter log-gumbel distribution has the following *pdf* and *cdf*:

<br>

```{r echo=FALSE,fig.align='center',fig.width=9,fig.height=5,fig.cap="Sample Log-Gumbel probability density (A) and cumulative probability (B) functions."}
```{r echo=FALSE,fig.align='center',fig.width=9,fig.height=5,fig.cap="Sample Log-Gumbel probability density (A) and cumulative probability (B) functions.", fig.alt="A two panel plot showing several realisations of the log Gumbel probability density function on the left panel and the cumulative density function of distribution on the right panel. The log Gumbel is also known as the inverse Weibull distribution."}
par(mfrow = c(1, 2))
f <- function(x, a, b) {
Expand Down Expand Up @@ -471,7 +471,7 @@ The second parameterisation in which the *product* $b\eta$ in the formulae above

<br>

```{r echo=FALSE,fig.align='center',fig.width=9,fig.height=5,fig.cap="Sample Gompertz probability density (A) and cumulative probability (B) functions."}
```{r echo=FALSE,fig.align='center',fig.width=9,fig.height=5,fig.cap="Sample Gompertz probability density (A) and cumulative probability (B) functions.", fig.alt="A two panel plot showing several realisations of the Gompertz probability density function on the left panel and the cumulative density function of distribution on the right panel."}
par(mfrow = c(1, 2))
f <- function(x, n, b) {
Expand Down Expand Up @@ -521,7 +521,7 @@ The two-parameter [Weibull](https://en.wikipedia.org/wiki/Weibull_distribution)

The Weibull distribution can be fitted in `ssdtools` by supplying the string `weibull` to the `dists` argument in the `ssd_fit_dists` call.

```{r echo=FALSE,fig.align='center',fig.width=9,fig.height=5,fig.cap="Sample Weibull probability density (A) and cumulative probability (B) functions."}
```{r echo=FALSE,fig.align='center',fig.width=9,fig.height=5,fig.cap="Sample Weibull probability density (A) and cumulative probability (B) functions.", fig.alt="A two panel plot showing several realisations of the Weibull probability density function on the left panel and the cumulative density function of distribution on the right panel."}
par(mfrow = c(1, 2))
conc <- seq(0, 10, by = 0.005)
Expand Down Expand Up @@ -593,7 +593,7 @@ Thus, it doesn't matter whether you're fitting a Pareto or inverse Pareto distri

Because it is *bounded*, the North American version of the (Inverse)Pareto distribution is not useful as a stand-alone SSD - more so for the *inverse Pareto* distribution since it is bounded from *above*.

```{r echo=FALSE,fig.align='center',fig.width=9,fig.height=5,fig.cap="Sample North American Pareto probability density (A) and cumulative probability (B) functions."}
```{r echo=FALSE,fig.align='center',fig.width=9,fig.height=5,fig.cap="Sample North American Pareto probability density (A) and cumulative probability (B) functions.", fig.alt="A two panel plot showing several realisations of the North American Pareto probability density function on the left panel and the cumulative density function of distribution on the right panel."}
par(mfrow = c(1, 2))
conc <- seq(0, 10, by = 0.005)
Expand All @@ -610,14 +610,9 @@ lines(conc, extraDistr::ppareto(conc, 4, 4), col = "#1CADF2")
lines(conc, extraDistr::ppareto(conc, 10, 7), col = "#1F1CF2")
legend("topleft", "(B)", bty = "n")
conc <- seq(0, 10, by = 0.005)
plot(conc, extraDistr::ppareto(conc, 3, 2), type = "l", ylab = "Cumulative probability", xlab = "Concentration", col = "#FF5733", ylim = c(0, 1))
lines(conc, extraDistr::ppareto(conc, 0.838, 0.911), col = "#F2A61C")
lines(conc, extraDistr::ppareto(conc, 4, 4), col = "#1CADF2")
lines(conc, extraDistr::ppareto(conc, 10, 7), col = "#1F1CF2")
```

```{r echo=FALSE,fig.align='center',fig.width=9,fig.height=5,fig.cap="Sample Sample North American inverse Pareto probability density (A) and cumulative probability (B) functions."}
```{r echo=FALSE,fig.align='center',fig.width=9,fig.height=5,fig.cap="Sample North American inverse Pareto probability density (A) and cumulative probability (B) functions.", fig.alt="A two panel plot showing several realisations of the North American inverse Pareto probability density function on the left panel and the cumulative density function of distribution on the right panel."}
par(mfrow = c(1, 2))
f <- function(x, a, b) {
Expand Down Expand Up @@ -669,7 +664,7 @@ We see from the *pdf* plots that the alternative, *European* version of the inve
We note in passing that *both* versions of these Pareto and inverse Pareto distrbutions are available in `R`.
For example, the `R`package `extraDistr` has North American versions, while the `actuar` package has European versions.

```{r echo=FALSE,fig.align='center',fig.width=9,fig.height=5,fig.cap="Sample European Pareto probability density (A) and cumulative probability (B) functions."}
```{r echo=FALSE,fig.align='center',fig.width=9,fig.height=5,fig.cap="Sample European Pareto probability density (A) and cumulative probability (B) functions.", fig.alt="A two panel plot showing several realisations of the European Pareto probability density function on the left panel and the cumulative density function of distribution on the right panel."}
par(mfrow = c(1, 2))
conc <- seq(0, 10, by = 0.005)
Expand All @@ -687,7 +682,7 @@ lines(conc, actuar::ppareto(conc, 10.5, 6.5), col = "#1F1CF2")
legend("topleft", "(B)", bty = "n")
```

```{r echo=FALSE,fig.align='center',fig.width=9,fig.height=5,fig.cap="Sample European inverse Pareto probability density (A) and cumulative probability (B) functions."}
```{r echo=FALSE,fig.align='center',fig.width=9,fig.height=5,fig.cap="Sample European inverse Pareto probability density (A) and cumulative probability (B) functions.", fig.alt="A two panel plot showing several realisations of the European inverse Pareto probability density function on the left panel and the cumulative density function of distribution on the right panel."}
par(mfrow = c(1, 2))
conc <- seq(0, 10, by = 0.001)
Expand Down Expand Up @@ -725,7 +720,7 @@ The inverse Weibull (log-Gumbel) distribution can be fitted in `ssdtools` by sup

<hr>

![](images/relationships.png){width=100%}
![ Flow diagram showing the relationship between the various statistical distributions available in ssdtools. ](images/relationships.png){width=100%} \

## References

Expand Down
Loading

0 comments on commit 6ae6245

Please sign in to comment.