| title | Occupancy Models with Stan (with help from R) | ||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| author | Richard Erickson | ||||||||||||||||
| date | 5/7/2018 | ||||||||||||||||
| output |
|
||||||||||||||||
| bibliography | bib.bib |
This document provides an introduction to occupancy models in Stan. The terms occupancy models and occurrence models can be used synonymous from a statistical and programming perspective, but can have important ecological differences from a scientific perspective. Both terms are used interchangeably within this document unless specifically indicated that one term is appropriate for a specific use.
This document was created help collaborators learn how program occupancy models in Stan. Occupancy models can be difficult to program in Stan because they have discrete latent variables (i.e., the site-level occupancy). However, compared to other Bayesian programs, Stan can offer better performance (e.g., more rapidly fitting models, converging when other models could not, fewer iterations required).
This tutorials assumes the reader is familiar with R, occupancy models, and Bayesian statistics. Some base knowledge of Stan would be helpful, even if it is just completing the schools "hello world" example. This tutorial builds upon a previous example tutorial, which I helped to update. Another tutorial I found did not use current Stan syntax. Last, a multi-species tutorial exists, but is beyond the (current) scope of this tutorial
The introduction covers a brief overview of occupancy models, logistic regression in Stan and R, and matrix notation in R. Next, two-level occupancy models are introduced. Following this, three-level occupancy models are introduced. More topics may be included (e.g., changing variance structures or correlated estimates) as I have time or need to program these models.
This section provides a crash course on useful topics and code for using occupancy models with Stan in R.
Occupancy models allow for imperfect detection. For example, if I look out my kitchen window and do not see a robin, why did I not see it? Maybe the bird was not there or maybe I missed it. If I look out my kitchen window on a regular basis, I could use an occupancy model to estimate my probability of detection. Occupancy models can be extended to other situations with imperfect detection as well.
In a world with perfect detection (i.e., glm() function to fit this model using a maximum likelihood approach:
glm(y ~ x, family = 'binomial')Note, that if we want, we can fit the model using a probit link function rather than a logit:
glm( y ~ x, family = binomial(link = 'probit'))Many good tutorials exist on the differences between probits and logits. Usually, the models produce similar results, although the logit has a slightly fatter tail. Personally, I prefer logits if for no other reason than that is what my PhD advisor used. The reason for mentioning the logit versus probit is that we need to be explicit about our modeling assumption when moving on to Stan and knowing other options exist help see why.
The logit model may be also easily coded in Stan as well.
I would save this script as a logistic_stan.stan (Important note: Stan models must end with .stan with a lower case s, otherwise the model will not be complied):
data {
int<lower=0> N;
vector[N] x;
int<lower=0,upper=1> y[N];
}
parameters {
real alpha;
real beta;
}
model {
y ~ bernoulli_logit(alpha + beta * x);
}
After saving this code as logistic_stan.stan, we can call the model from R. We'll simulate some data and format it to work with Stan here as well.
library(rstan)
options(mc.cores = parallel::detectCores())
### simulate and examine raw data
n_obs = 20
x_label <- factor(rep(c("a", "b"), times = n_obs)) # used to Id variables
x_prob <- rep(c(0.25, 0.75), times = n_obs) # used to simulate data
y <- rbinom(n = n_obs * 2, size = 1, p = x_prob)
aggregate(y, by = list(x_label), FUN = mean)
### First, we fit a GLM using base R
glm_out <- glm(y ~ x_label, family = 'binomial')
summary(glm_out)
### Second, we fit the model using Stan:
stan_data <- list(N = length(y),
x = as.numeric(x_label) - 1,
y = y)
stan_model <- stan_model(file = "logistic_stan.stan")
stan_out <- sampling(stan_model, data = stan_data, chains = 4, iter = 500)
### Useful functions for looking at Stan outputs include:
print(stan_out)
traceplot(stan_out)
traceplot(stan_out, inc_warmup = TRUE)
plot(stan_out)
pairs(stan_out)A few things to point out.
- Unlike base R, rstan does not have a nice wrapper function. This means we must use integers as the id variable each observation rather than characters or factors.
- You will want to increase the number of iterations when actually running the model for use. When troubleshooting, I use 100s. When running for publication, I use 1,000s to 10,000s or more.
- Notice the Bayesian model diagnostics. If these are unfamiliar to you, I suggest working through BDA3 by Gelman et al. or other introductory Bayesian stats book. The Shinystan interface provides many different Bayesian diagnostics tools that work with Stan.
However, when working with occupancy, we must account for imperfect detection.
Because of this, and the details of how Stan works under the hood, we cannot use the default probability functions for Stan.
To introduce this topic, we will next look at using the probability target in Stan.
The += opperator is a programming shortcut.
For example x = x + 2, can be written as x += 2.
Our code now looks like this and is saved in the file logistic_target_stan.stan:
data {
int<lower=0> N; // Number of samples
vector[N] x; // predictor vector
int<lower=0,upper=1> y[N];//response vector
}
parameters {
real alpha; //intercept
real beta; ///slope
}
model {
for( index in 1:N){
target += binomial_logit_lpmf( y[index] | 1, alpha + beta * x[index]);
}
}
From both of the above example, there are some important features of Stan to note:
First, unlike R or JAGS, Stan requires us to explicitly declare variables like C++. A downside is that the language can be less forgiving. An upside is that the language makes us be precise and makes it harder to "programming by coincidence". Rather than simply estimate parameters using a model, we must understand their data structure within the model. That being said, I've spend many an hour refreshing my linear algebra skills to understand my Stan code and dimensions of objects. But, the end products were code that I now trust and run quickly.
Second, Stan uses code blocks.
These are defined in the Stan manual.
In a nutshell, the require us to declare the input data, the estimated parameters, and our model.
There are also other types of blocks you can look up in the manual and we might see some of them later.
Third, we can include comments with // or blocks of code with /* comment here */.
## Complie the model
stan_model_target <- stan_model("logistic_target_stan.stan")
## Sample from the model
stan_out_target <- sampling(stan_model_target,
data = stan_data, chains = 4,
iter = 1000)
## Useful functions for looking at Stan outputs include:
print(stan_out_target, pars = c("alpha", "beta", "lp__"))
traceplot(stan_out_target, pars = c("alpha", "beta", "lp__"))
traceplot(stan_out_target, inc_warmup = TRUE, pars = c("alpha", "beta", "lp__"))
plot(stan_out_target, pars = c("alpha", "beta", "lp__"))Exercise: Using the code above, fit a probit regression. You'll need to go to the Stan documentation to figure this out. This exercise will also likely take you 1 hr to a couple of days depending upon your abilities with Stan.
Another important concept for occupancy modeling is data structure and probability distributions.
The Bernoulli distribution models one sampling event (often denoted as
For both general data analysis and occupancy modeling, I use both distributions.
When fitting a simple binomial GLM in R, my modeling choice depends upon the structure of the data.
I use a Bernoulli style input (a vector of 0s and 1s for y) if my data has coefficients for each observation or the data was given to me in that format.
I use a binomial style input if my data has been aggregated or I want to avoid pseudoreplication (e.g., the tank is the level of replication rather than the individual).
R has two methods for inputting binomial style data.
First, a matrix of "successes" and "failures" maybe used for y.
Second, a vector of probabilities may be used for y and a weight = option specified.
Closely relate to these distributions are the data concepts of "wide versus long data in R" and "aggregate versus raw data".
During this code example, you will see how to fit a model using all three methods as well as how to convert code between wide and long formats.
## Simulate data in long format
set.seed(1223)
xSim <- rep(c(0.25, 0.75), each = 14)
x <- rep(c("a", "b"), each = 14)
y <- rbinom(n = length(xSim), size = 1, prob = xSim)
data_long <-
data.frame(x = x,
y = factor(y, labels = c("fail", "success")))
### cast the data to wide format using tidyverse
library(tidyverse)
data_wide <-
data_long %>%
group_by(x,y) %>%
summarize(N = n()) %>%
spread( y, N) %>%
mutate(Total = success + fail,
success_proportion = success / (success + fail))
data_wide
### Compare the three methods for fitting a logistic regression in R
glm(y ~ x, family = 'binomial', data = data_long)
glm(cbind(fail, success) ~ x, family = 'binomial', data = data_wide)
glm(success_proportion ~ x, family = 'binomial', data = data_wide,
weights = Total)For occurrence models in Stan, we must use the Bernoulli distribution (or Binomial with K = 1) for the latent variables because we cannot aggregate the data.
Specifically, we need details about each replicate at a lower level.
For example, We cannot aggregate and say that 3 sites had Robins and 2 sites did not. Instead, we need a vector of of these site-level detentions, for example c(0, 1, 1, 0, 1, 1, 1).
For the lowest level of the occurrence model, I often do use a Bernoulli distribution when I do not have coefficients at the observation-level because there are fewer data entries to aggregate over.
We will see these in later chapters.
Models in R such as lm() and glm() allow users to input formuals.
Formulas allow users to input factors and have R convert them to a matrix of dummy variables.
model.matrix() allows us to create these same type matrices of input variables.
There are several benefits to using model.matrix() to pre-process inputs for Stan.
First, it allows us to easily turn factors into dummy variables.
Second, it allows us to easily have matrices of predictors, which in turn allows us to us matrix algebra within Stan.
This section introduces model.matrix() so that it will be familiar to us later.
Note: I use shorter matrices than most real applications to save screen space.
model.matrix() use the ~ (shift-` on US keyboards) for its input.
In statistical English, this could be read as "predicted by", for example y ~ x could be spoken or read as y predicted by x.
The follow example demonstrates how it may be used on a simple factor data.frame:
df <- data.frame(city = c("La Crosse", "St. Louis", "Cairo"))
model.matrix( ~ city, data = df)## (Intercept) cityLa Crosse citySt. Louis
## 1 1 1 0
## 2 1 0 1
## 3 1 0 0
## attr(,"assign")
## [1] 0 1 1
## attr(,"contrasts")
## attr(,"contrasts")$city
## [1] "contr.treatment"
Things to notice about model.matrix():
model.matrix()converted city to an alphabetical order factor.- The first factor is the first in alphabetically. This order may be changing the factor order in R.
- The first factor become a global intercept and the other two levels are compared to this. In the next section, we'll see how to change this.
If we want an intercept for each factor level, we use a - 1 in the notation.
If we have multiple factors, we can only estimate intercepts for all of one of the factors. For example, if we have months and city, we would need a reference month or reference city. Also, notice how order matters. Most advanced book on regression analysis explains this in greater detail (e.g., Harrell or Gelman and Hill).
df <- expand.grid(city = c("La Crosse", "St. Louis", "Cairo"),
month = c("May", "June"))
model.matrix( ~ city + month -1, data = df)
model.matrix( ~ month + city -1, data = df)We can also use numeric inputs with model.matrix()
For example, if we input month as a numeric vector, R creates a matrix with month as a numeric column.
If we were using the matrix in a regression, this new column would correspond to a slope estimate.
df1 <- expand.grid(month = c( 5, 6, 7))
model.matrix( ~ month, data = df1)## (Intercept) month
## 1 1 5
## 2 1 6
## 3 1 7
## attr(,"assign")
## [1] 0 1
Conversely, if we input month as a factor, we get similar results as before.
df2 <- expand.grid(month = factor(c( 5, 6, 7)))
model.matrix( ~ month, data = df2)## (Intercept) month6 month7
## 1 1 0 0
## 2 1 1 0
## 3 1 0 1
## attr(,"assign")
## [1] 0 1 1
## attr(,"contrasts")
## attr(,"contrasts")$month
## [1] "contr.treatment"
The purpose of this example is to demonstrate how R can sometimes produce unexpected results, especially if we want a measure of time to correspond to an intercept estimate rather than a slope estimate.
Closely related to the above point is a problem I have run into when creating binary response variables in R for use with Stan. For example, let's say we want to model occupancy for a lake and a river:
set.seed(12351513)
df_occ <-
data.frame(occ = factor(sample(rep(c("yes", "no"), each = 10))),
site = factor(rep(c("lake", "river"), each = 10)))Using Base R, we could just run glm() on this data:
summary(glm(occ ~ site, data = df_occ, family = 'binomial'))But, look at what happens if we try and convert occ to a binary response:
### base line
as.numeric(factor(df_occ$occ))
### need -1 to create a vecor of zeros and ones
as.numeric(factor(df_occ$occ)) - 1Now that you've had a crash course on R topics, let's build our first occupancy model with Stan.
As you will see in the upcoming chapters, Stan requires marginalizing out discrete latent variables.
This requires working with probabilities.
Probabilities are often log transformed to increase numerical stability (or, informally: make the equations easier for the computer to solve) AND to change from multiplication to addition.
Here are some quick refreshers of log rules:
$\text{log}(xy) = \text{log}(x) + \text{log}(y)$ $\text{log}(0) = 1$
For example, let's say we flip a coin once.
The coin has probability
We get a heads, which we call 1.
If we flip the coin 4 times, our outcome could be 1001.
The probability of this occurring would be:
We have
Taking the log of this gives
Two key takeaways.
First, notice how the product now became a summation.
Second,
MacKenzie et al. covers these calculations for occupancy models in chapter 4 of their book.
This chapter builds up a series of two-level occupancy models in Stan. The models start simple and then build to more complicated models. The purpose of this is twofold. First, it provides an introduction and overview of these models. Second, the transformations buildup to allow for more complicated models such as those used for eDNA or other structures. For the purpose of this chapter we use the term sampling unit to refer to the observation. Sampling units can be replicated across space, time, or both.
This chapter usesa simple occupancy model that has one site with multiple observation points (e.g., cameras or traps). This model is presented in four different forms:
- The first and simplest model directly estimates probabilities of detection and occupancy. This model takes a site-by-visit (row of sites, columns of visits or surveys to each site) matrix as the input.
- The second model then gets changed to estimate parameters on the logit-scale.
- The third model uses an input vector of the sum of number visits with detentions.
- The fourth model uses an input vector of binary observations for each visit to a site.
We will start by building a simple occupancy model that only includes a global intercept at each level. This example is based upon the Stan example provided by Bob Carpenter. The model assumes the same probability of detection for each observation.
The probability that any site is occupied is
The model in Stan contains three code blocks.
The first is the data block:
data {
int<lower=0> n_sampling_units
int<lower=0> n_surveys;
int<lower=0,upper=1> y[n_sampling_units, n_surveys];
}
which defines the input data into Stan.
The second is the parameters block:
parameters {
real<lower=0,upper=1> psi;
real<lower=0,upper=1> p;
}
which defines the parameters being estimated.
The third block is the model block.
This block includes local variables to increase the computational efficient of the code.
The code also includes priors.
If a user does not specify priors with Stan, then Stan has default priors that account for the constraints of parameters.
The third part of the block is the likelihood function.
This includes an if-else statement to account for detection (if) or non-detection (else) at a site.
model {
// local variables to avoid recomputing log(psi) and log(1 - psi)
real log_psi;
real log1m_psi;
log_psi = log(psi);
log1m_psi = log1m(psi);
// priors
psi ~ uniform(0,1);
p ~ uniform(0,1);
// likelihood
for (r in 1:n_sampling_units) {
if (sum(y[r]) > 0)
target += log_psi + bernoulli_lpmf(y[r] | p);
else
target += log_sum_exp(log_psi + bernoulli_lpmf(y[r] | p),
log1m_psi);;
}
}
The data simulation process also can provide us with insight into the how the model works and how we assume our data generation process works.
For this case, we are specifying the number of sites and visits (or surveys) per site.
We also specify our simulated probability of detection p, and site occupancy probability psi.
The next simulation step is to simulate if the sites are occupied and then simulation the surveys at each site.
Last, the data are wrapped into a list for Stan.
## Simulate true occupancy states
set.seed(1234)
## number of sampling units
n_sampling_units <- 250
## visits per sampling unit
n_surveys <- 10
p <- 0.6
psi <- 0.4
## simulate site-level occupancy
z <- rbinom( n_sampling_units, 1, psi)
## simulate sample-level detections
y <- matrix( NA, n_sampling_units, n_surveys);
for (site in 1:n_sampling_units){
y[site,] <- rbinom( n_surveys, 1, z[site] * p)
}
simple_model_input <- list(n_sampling_units = n_sampling_units,
n_surveys = n_surveys,
y = y)**Modeling building tip: ** When building a Stan model, I examine the raw data and make sure the Stan model matches the data. Although, this can be an iterative process, because I often discover my data requires wrangling to work with Stan. In this process, I usually start the iterative process with the data because it is easier to work with.
Now that we've simulated the data, we can use it with our model.
First, we compile the model with stan_model().
Second, we sample from the model using sampling().
## Fit the model in stan
model_wide <- stan_model('./occupancy.stan')
fit_wide <- sampling(model_wide, simple_model_input)
fit_wideAfter fitting the model, we can look at the outputs.
The default print() option for a stanfit object shows a summary of the model's fit.
In this case, typing fit_wide prints these results:
Inference for Stan model: occupancy.
4 chains, each with iter=2000; warmup=1000; thin=1;
post-warmup draws per chain=1000, total post-warmup draws=4000.
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
psi 0.37 0.00 0.03 0.32 0.35 0.37 0.39 0.43 3587 1
p 0.60 0.00 0.02 0.56 0.58 0.60 0.61 0.63 4000 1
lp__ -796.37 0.02 1.01 -799.03 -796.74 -796.08 -795.65 -795.38 1996 1
Samples were drawn using NUTS(diag_e) at Thu Aug 30 13:03:08 2018.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).If these results do not make sense, I strongly recommend either reviewing/learning Stan and Bayesian statistics.
Exercise: Simulate different datasets. Using different occupancy probabilities and sample sizes. Figure out when the model works well and when it does not.
We can take the model from the previous section and make it more numerically efficient by calculating on the logit scale. This new formulation also allows the inclusion of predictor variables, much like a logistic regression.
This model is different because we are now estimating
parameters {
real muPsi;
real muP;
}
The model code block is also written differently now. Key differences include
- Slightly different likelihood functions are used; and
- the
muparameters are now real number that can range from minus infinity to positive infinity.
However, the model is still very similar to the previously defined model.
model {
real log_muPsi;
real log1m_muPsi;
log_muPsi = log_inv_logit(muPsi);
log1m_muPsi = log1m_inv_logit(muPsi);
muP ~ normal(0, 2);
muPsi ~ normal(0, 2);
// likelihood
for (r in 1:n_sampling_units) {
if (sum(y[r]) > 0)
target +=
log_muPsi + bernoulli_logit_lpmf(y[r] | muP);
else
target +=
log_sum_exp(log_muPsi +
bernoulli_logit_lpmf(y[r] | muP), log1m_muPsi);
}
}
We can transform the parameters to the logit scale to the probability scale in Stan using the generated quantities code block.
This code block is complied so it is quick, but does not get run through MCMC algorithm and slow down the model.
generated quantities{
real<lower = 0, upper = 1> p;
real<lower = 0, upper = 1> psi;
p = inv_logit(muP);
psi = inv_logit(muPsi);
}
This model is fit the same as previous model.
## Fit the model using parameters on the mu scale
model_wide_mu <- stan_model('./occupancy_mu.stan')
fit_wide_mu <- sampling(model_wide_mu, data = simple_model_input)
fit_wide_muThe outputs are similar, other than having small numerical differences due to Monte Carlo variability.
The output also include the muP and muPsi parameters.
Inference for Stan model: occupancyMu.
4 chains, each with iter=2000; warmup=1000; thin=1;
post-warmup draws per chain=1000, total post-warmup draws=4000.
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
muPsi -0.53 0.00 0.13 -0.78 -0.62 -0.53 -0.44 -0.27 3111 1
muP 0.39 0.00 0.07 0.26 0.34 0.39 0.43 0.52 3136 1
p 0.60 0.00 0.02 0.56 0.58 0.60 0.61 0.63 3139 1
psi 0.37 0.00 0.03 0.31 0.35 0.37 0.39 0.43 3125 1
lp__ -793.54 0.02 1.01 -796.18 -793.90 -793.21 -792.83 -792.56 1700 1
Samples were drawn using NUTS(diag_e) at Thu Aug 30 13:26:44 2018.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).The previous example assumed that we had the same number of observations per site.
One method to side-step this assumption is to use a binomial input rather than a Bernoulli.
Now our input y is the number of surveys or visits where detections occurred and we need a new input k, which corresponds to the number of surveys per sites.
We can summarize our data to be in the needed format and put it into a list for Stan.
## Now use summizations of the data
y_sum <- apply(y, 1, sum)
k <- rep(dim(y)[2], dim(y)[1])
stan_sumation_data <-
list(
n_sampling_units = n_sampling_units,
y = y_sum,
k = k)This formulation also changes the data block of the model:
data {
int<lower=0> n_sampling_units;
int<lower=0> y[n_sampling_units];
int<lower=0> k[n_sampling_units];
}
As well as the likelihood portion of the model:
// likelihood
for (r in 1:n_sampling_units) {
if (y[r] > 0)
target +=
log_muPsi + binomial_logit_lpmf(y[r] | k[r], muP);
else
target +=
log_sum_exp(log_muPsi +
binomial_logit_lpmf(y[r] | k[r], muP), log1m_muPsi);
}
Before we can fit the model, we need to sum and reformat our data our data:
## Now use summizations of the data
y_sum <- apply(y, 1, sum)
k <- rep(dim(y)[2], dim(y)[1])
stan_sumation_data <-
list(
n_sampling_units = n_sampling_units,
y = y_sum,
k = k)We can then fit this model using Rstan and look at the results, which are similar to the previous results:
stan_model_mu_binomial <- stan_model('./occupancy_mu_binomial.stan')
fit_mu_binomial <- sampling(stan_model_mu_binomial,
data = stan_sumation_data)
fit_mu_binomialInference for Stan model: occupancyMuBinomial.
4 chains, each with iter=2000; warmup=1000; thin=1;
post-warmup draws per chain=1000, total post-warmup draws=4000.
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
muPsi -0.52 0.00 0.13 -0.78 -0.61 -0.52 -0.43 -0.27 3619 1
muP 0.39 0.00 0.07 0.26 0.35 0.39 0.43 0.52 3589 1
p 0.60 0.00 0.02 0.56 0.59 0.60 0.61 0.63 3589 1
psi 0.37 0.00 0.03 0.31 0.35 0.37 0.39 0.43 3619 1
lp__ -342.89 0.02 1.02 -345.71 -343.27 -342.57 -342.15 -341.90 1709 1
Samples were drawn using NUTS(diag_e) at Thu Aug 30 14:31:42 2018.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).What if our data is in a "long format" with one row per survey and sampling unit?
Then, we can build a model, but it is trickier.
I received help from Max Joseph on the Stan user forum who wrote this model.
Before diving into the Stan code, the data needs to mutated to be the correct shape.
I use the Tidyverse for this.
One key part of this code is creating the index of start and ending numbers numbers for each survey.
## Convert to long format
y_df <-
y %>%
as.data.frame() %>%
mutate(site = factor(1:n_sampling_units),
z_obs = ifelse(rowSums(y) >0, 1, 0))
y_long <-
y_df %>%
gather(survey, y, -site, -z_obs) %>%
arrange(site) %>%
mutate(index = 1:nrow(y_long))
head(y_long, 12)
tail(y_long, 12)
y_long_summary <-
y_long %>%
group_by(site) %>%
summarize(any_seen = ifelse(sum(y) >0, 1, 0),
N = n())
X_psi <- model.matrix( ~ 1, data = y_long_summary)
n_site <- y_long %>% pull(site) %>% unique() %>% length()
m_psi <- ncol(X_psi)
total_surveys <- nrow(y_long)
X_p <- model.matrix( ~ 1, data = y_long)
m_p <- ncol(X_p)
start_end <-
y_long %>%
group_by(site) %>%
summarize(start_idx = min(index),
end_idx = max(index))
site = y_long %>% pull(site) %>% as.numeric()
stan_d <- list(
n_site = n_site,
m_psi = m_psi,
X_psi = X_psi,
total_surveys = total_surveys,
m_p = m_p,
X_p = X_p,
site = site,
y = y_long %>% pull(y),
start_idx = start_end %>% pull( start_idx),
end_idx = start_end %>% pull( end_idx),
any_seen = y_long_summary %>% pull(any_seen),
n_survey = y_long_summary %>% pull(N)
)The code for this Stan model become more complicated.
First, we cannot simply loop over raw observations.
Instead, we need to "cut" out the observations for each survey (site visit).
This model also includes the ability to include coefficients at different levels, but these will not be discussed until the next section.
These coefficient include matrices of predictors.
These are same type that are created by model.matrix().
data {
// site-level occupancy covariates
int<lower = 1> n_sampling_units;
int<lower = 1> nPsiCoef;
matrix[n_sampling_units, nPsiCoef] Xpsi;
// survey-level detection covariates
int<lower = 1> totalSurveys;
int<lower = 1> nPCoef;
matrix[totalSurveys, nPCoef] Vp;
// survey level information
int<lower = 1, upper = n_sampling_units> site[totalSurveys];
int<lower = 0, upper = 1> y[totalSurveys];
int<lower = 0, upper = totalSurveys> startIndex[n_sampling_units];
int<lower = 0, upper = totalSurveys> endIndex[n_sampling_units];
// summary of whether species is known to be present at each site
int<lower = 0, upper = 1> z[n_sampling_units];
// number of surveys at each site
int<lower = 0> nSurveys[n_sampling_units];
}
parameters {
vector[nPsiCoef] beta_psi;
vector[nPCoef] beta_p;
}
transformed parameters {
vector[totalSurveys] logit_p = Vp * beta_p;
vector[n_sampling_units] logit_psi = Xpsi * beta_psi;
}
model {
vector[n_sampling_units] log_psi = log_inv_logit(logit_psi);
vector[n_sampling_units] log1m_psi = log1m_inv_logit(logit_psi);
beta_psi ~ normal(0, 1);
beta_p ~ normal(0, 1);
for (i in 1:n_sampling_units) {
if (nSurveys[i] > 0) {
if (z[i]) {
// site is occupied
target += log_psi[i]
+ bernoulli_logit_lpmf(y[startIndex[i]:endIndex[i]] |
logit_p[startIndex[i]:endIndex[i]]);
} else {
// site may or may not be occupied
target += log_sum_exp(
log_psi[i] + bernoulli_logit_lpmf(y[startIndex[i]:endIndex[i]] |
logit_p[startIndex[i]:endIndex[i]]),
log1m_psi[i]
);
}
}
}
}
After building this model, it may be fit like any other model.
Notice the use of the pars option in print to only print the parameters of interest.
Also, this model did not calculate probabilities because of the coefficients.
## Fit long form model
stan_model_bern <- stan_model("./occupancy_bernoulli.stan")
fit_bern <- sampling(stan_model_bern,
data= stan_d,
chains = 4, iter = 2000)
print(fit_bern, pars = c("beta_psi", "beta_p"))
## mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
## beta_psi[1] -0.52 0 0.13 -0.78 -0.60 -0.51 -0.43 -0.27 3909 1
## beta_p[1] 0.39 0 0.07 0.25 0.34 0.39 0.43 0.51 3393 1And the outputs from the model are similar to the other models.
Inference for Stan model: bernoulli-occupancy.
4 chains, each with iter=2000; warmup=1000; thin=1;
post-warmup draws per chain=1000, total post-warmup draws=4000.
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
beta_psi[1] -0.52 0 0.13 -0.78 -0.60 -0.51 -0.43 -0.27 3909 1
beta_p[1] 0.39 0 0.07 0.25 0.34 0.39 0.43 0.51 3393 1
Samples were drawn using NUTS(diag_e) at Thu Aug 30 15:01:05 2018.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).Converting these values to probabilities with the plogis() function shows the results are similar to the other modeling methods:
plogis(-0.52)## [1] 0.3728522
plogis(0.39)## [1] 0.5962827
This chapter provided an introduction to two-level occupancy models in Stan. The formatting and coding of the models became progressively more difficult, but in doing so, gives us the opportunity to do more with the models.
This chapter covers occupancy models with subsampling. The motivating example are eDNA-based occupancy models. For examples, see Erickson et al. (2019) or Mize et al. (2019) and the application of these models to a USFWS monitoring program. These two examples are described in the next chapter. Specifically, those with three levels: Site-level occupancy; Sample-level capture probabilities; and Molecular-level detection probabilities. However, these models originated with ecological models with three-levels (e.g. sites, samples, and sub-samples) and this chapter could readily be applied to these models as well.
The chapter includes a brief overview and background on eDNA. Then, the chapter covers the statistical challenges of occupancy models. Next, the chapter covers a Stan model for three-levels. Last, the chapter covers simulating data and fitting a Stan model to this data.
Most if not all organisms excrete or shed DNA into their surroundings or environment. This DNA is called environmental or eDNA. Microbial and soil scientist have long used this DNA as a tool for detecting microbes in the soil because the method allows the detection of difficult to capture and culture species. More recently, macro biologists discovered the power of sampling DNA. We (the authors of this tutorial) have been motivated to examine the use eDNA as a monitoring tool for natural resource management, primarily driven to study and monitor aquatic invasive species.
When sampling for eDNA, occupancy models must account for at least one extra level of detection. Most occupancy models examine the probability a site is occupied and the probability of detecting eDNA within that site. In contrast, eDNA-based sampling possesses three levels:
- The site-level occupancy probability of eDNA (e.g., is eDNA present at a site?);
- The sample-level capture probability of eDNA (e.g., does this water sample contain eDNA?);
- The molecular-level detection probability of eDNA (e.g., did the PCR replicate detect eDNA?).
Currently no standard terminology exists for eDNA-based occupancy models. We based our terminology upon Dorazio and Erickson (2017), who also provide a brief overview of the eDNA literature. In general, we do not include indexing to simplify this tutorial, but this indexing allows for multiple seasons or sites or both to be sampled (e.g., one site may be visited multiple times, multiple sites may be visited once, or multiple sites may be visited multiple times). Additionally, we use an inconsistent mix of scalar and matrix/vector notation.
The first level of the model examines the occurrence of eDNA at a location.
For example if we either visit the same site multiple times or visit multiple sites once, what's the probability eDNA will occur at the site?
The binary state variable,
The second level of the model examines the capture of eDNA in a sample.
For example, if we collect multiple water samples at a site, what's the probability a sample contains eDNA?
Capturing eDNA within a sample requires eDNA to be present a site.
If
Like
In this case,
The third level of he model examines the detection of eDNA in a sample.
For example, did our PCR assay detect eDNA in a sample?
Detecting eDNA within a sample requires eDNA to present within a sample.
If
Like the previous two probabilities,
Here,
Fitting the model presents statically challenges. From a numerical perspective, the model contains 2 levels of latent variables. These present challenges for Stan, which can be overcome by taking the discritizing the model. Overcoming these are covered in the next section. Detecting eDNA at a site also presents philosophical challenges. What does the detection mean? Is the species present or only its DNA? Likewise, does the amount of eDNA correspond to the bioass or number of organisms present? What about the detection probabilities? These questions are beyond the scope of this tutorial, but are important as eDNA becomes an increasingly used tool.
Define input variales, describe here
data {
// site-level occupancy covariates
int<lower = 1> nSites;
int<lower = 1> nPsiCoef;
matrix[nSites, nPsiCoef] Xpsi;
// sample-level detection covariates
int<lower = 1> totalSamples;
int<lower = 1> nPCoef;
int<lower = 1> nThetaCoef;
matrix[totalSamples, nPCoef] Vp;
matrix[totalSamples, nThetaCoef] Wtheta;
// sample level information
int<lower = 0> y[totalSamples];
int<lower = 0, upper = 1> aObs[totalSamples];
int<lower = 0> k[totalSamples];
int<lower = 0, upper = totalSamples> startIndex[nSites];
int<lower = 0, upper = totalSamples> endIndex[nSites];
// summary of whether species is known to be present at each site
int<lower = 0, upper = 1> zObs[nSites];
// number of samples at each site
int<lower = 0> nSamples[nSites];
}
parameters {
vector[nPsiCoef] beta_psi;
vector[nPCoef] delta_p;
vector[nThetaCoef] alpha_theta;
}
transformed parameters {
vector[totalSamples] logit_p = Vp * delta_p;
vector[totalSamples] logit_theta = Wtheta * alpha_theta;
vector[nSites] logit_psi = Xpsi * beta_psi;
}
library(rstan)## Loading required package: StanHeaders
## Registered S3 methods overwritten by 'ggplot2':
## method from
## [.quosures rlang
## c.quosures rlang
## print.quosures rlang
## rstan (Version 2.18.9, GitRev: 2e1f913d3ca3)
## For execution on a local, multicore CPU with excess RAM we recommend calling
## options(mc.cores = parallel::detectCores()).
## To avoid recompilation of unchanged Stan programs, we recommend calling
## rstan_options(auto_write = TRUE)
library(data.table)
options(mc.cores = parallel::detectCores())
rstan_options(auto_write = TRUE)set.seed(12345)
nSites = 100
nSamples = 10
k = 8
totalSamples = nSites * nSamples
psi = 0.8
theta = 0.6
p = 0.3siteLevelData <- data.table(
site = 1:nSites,
zSim = rbinom(nSites, 1, psi))
sampleLevelData <- data.table(
site = rep(1:nSites, each = nSamples),
sample = rep(1:nSamples, times = nSites),
theta = theta,
p = p,
k = k
)
setkey(siteLevelData, "site")
setkey(sampleLevelData, "site")
dataUse <- siteLevelData[ sampleLevelData]## was eDNA captured in sample?
dataUse[ , aSim := zSim * rbinom(n = nrow(dataUse), size = 1, p = theta)]
## was eDNA detected in sample?
dataUse[ , y := rbinom(n = nrow(dataUse), size = k, p = aSim * p)]
## Calculate observed a and z
dataUse[ , aObs := ifelse(y > 0, 1, 0)]
dataUse[ , zObs := ifelse(sum(aSim) > 0, 1, 0), by = .(site)]psi## [1] 0.8
dataUse[ , mean(zSim)]## [1] 0.8
dataUse[ , mean(zObs)]## [1] 0.8
theta## [1] 0.6
dataUse[ zObs > 0, mean(aSim)]## [1] 0.56375
dataUse[ zObs > 0, mean(aObs)]## [1] 0.54125
p## [1] 0.3
dataUse[ aObs > 0, mean(y/k)]## [1] 0.3262125
siteSummary <- dataUse[ , .(a = ifelse(sum(y) >0, 1, 0), .N, sumY = sum(y),
z = ifelse(sum(zObs) >0, 1, 0)), by = site]
head(siteSummary)## site a N sumY z
## 1: 1 1 10 10 1
## 2: 2 0 10 0 0
## 3: 3 1 10 17 1
## 4: 4 0 10 0 0
## 5: 5 1 10 12 1
## 6: 6 1 10 12 1
Xpsi <- model.matrix( ~ 1, data = siteSummary)
nPsiCoef = dim(Xpsi)[2]
Vp <- model.matrix( ~ 1, data = dataUse)
nPCoef <- dim(Vp)[2]
dataUse[ , index:=1:nrow(dataUse)]startStop <- dataUse[ , .(start_idx = min(index),
end_idx = max(index)), by = site]y <- dataUse[ , y]
aObs <- dataUse[ , aObs]
k <- dataUse[ , k]
zObs = siteSummary[ , z]
startIndex <- startStop[ , start_idx]
endIndex <- startStop[ , end_idx]
stanData <- list(nSites = nSites,
nPsiCoef = nPsiCoef,
Xpsi = Xpsi,
totalSamples = nrow(dataUse),
nPCoef = nPCoef,
nThetaCoef = nPCoef,
Vp = Vp,
Wtheta = Vp,
y = y,
k = k,
zObs = siteSummary[ , z],
startIndex = startIndex,
endIndex = endIndex,
nSamples = siteSummary[ , N],
aObs = dataUse[, aObs]
)fit <- stan('../ChaptersCode/eDNAOccupancy/eDNAoccupancy.stan',
data= stanData,
chains = 4, iter = 200)
fitSummary <- summary(fit)$summarytraceplot(fit, pars = c("beta_psi", "alpha_theta", "delta_p", "lp__"), inc_warmup = FALSE)plot(fit, pars = c("beta_psi", "alpha_theta", "delta_p")) ## ci_level: 0.8 (80% intervals)
## outer_level: 0.95 (95% intervals)
print(fit, pars = c("beta_psi", "alpha_theta", "delta_p", "lp__"))## Inference for Stan model: eDNAoccupancy.
## 4 chains, each with iter=200; warmup=100; thin=1;
## post-warmup draws per chain=100, total post-warmup draws=400.
##
## mean se_mean sd 2.5% 25% 50% 75%
## beta_psi[1] 1.32 0.02 0.24 0.84 1.15 1.33 1.48
## alpha_theta[1] 0.43 0.01 0.15 0.13 0.33 0.43 0.53
## delta_p[1] -0.77 0.00 0.07 -0.92 -0.82 -0.78 -0.73
## lp__ -356.44 0.07 1.11 -359.66 -357.00 -356.08 -355.64
## 97.5% n_eff Rhat
## beta_psi[1] 1.82 140 1.01
## alpha_theta[1] 0.70 324 1.00
## delta_p[1] -0.62 498 0.99
## lp__ -355.19 221 1.01
##
## Samples were drawn using NUTS(diag_e) at Fri Jun 14 11:43:28 2019.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at
## convergence, Rhat=1).
print(dataUse[ , mean(zSim)])## [1] 0.8
print(dataUse[ , mean(zObs)])## [1] 0.8
print(round(plogis(fitSummary[grep("beta_psi", rownames(fitSummary)), c(4,1,8)] ), 2))## 2.5% mean 97.5%
## 0.70 0.79 0.86
print(theta)## [1] 0.6
print(dataUse[ zObs > 0, mean(aSim)])## [1] 0.56375
print(dataUse[ zObs > 0, mean(aObs)])## [1] 0.54125
print(round(plogis(fitSummary[grep("alpha_theta", rownames(fitSummary)), c(4,1,8)] ), 2))## 2.5% mean 97.5%
## 0.53 0.61 0.67
print(p)## [1] 0.3
print(dataUse[ aObs > 0, mean(y/k)])## [1] 0.3262125
print(round(plogis(fitSummary[grep("delta_p", rownames(fitSummary)), c(4,1,8)] ), 2))## 2.5% mean 97.5%
## 0.28 0.32 0.35
This chapter describes the application of eDNA-based sampling methods to monitoring for two invasive species, bighead carp and silver carp. These two species are closely related, and, in fact, hybridize in North America. Sometimes both species are called bigheaded carp. The species were initially introduced to control algae in aquaculture and industrial settings (e.g., power plan and waste water treatment ponds) and since escaped into the wild. The species are currently spreading throughout the Mississippi River basin. One method the USFWS uses to monitor for these species are eDNA-based sampling.
This case study builds upon the previous presented occupancy model because of the sampling design. The USFWS Quality Assurance Project Plan (QAPP) required two eDNA qPCR assays (ACTM1 and ACTM3) to both be positive for detection of "Asian carp" (i.e., Silver carp, bighead carp, or both). The QAPP also used a second round of four assays (two for each species) to determine which species were present. With the second assay, only one assay needed to be positive for a species to be detected. For the purpose of my collaboration with the USFWS, we only considered the first round of assays.
My involvement with this project resulted in two papers. An initial study used simulations and mathematical calculations examined different sampling designs for the project (Erickson et al. 2019). This study approach is analogous to "power analysis" for null-hypothesis significance testing. The first part of this study used mathematical probability calculations to give numbers of samples required for detection probabilities. These calculations are described in this chapter. The second part of this study used simulated datasets to evaluate an occupancy model from the chapter. These efforts are not describe in this chapter.
The second paper (Mize et al. 2019) created a new occupancy model with two different detection method. The inspiration for this came from aerial surveys for dugongs which used two observers (Pollock et al. 2006). However, with this model, two detection probabilities are estimated. The results probabilities are used to calculate a "probability of positive detection", which means both assays detected the species. A walk through of this model is presented is presented as part of this chapter.
This next section of text and code are adapted from an online supplement to Erickson et al. (2019). Major changes include fixing typos and switching from the data.table package to the Tidyverse. Additionally, the I reformatted the source code.
Calculating the probability of eDNA occurring given two levels of sampling is not straight forward. Rather than calculating the probability of detecting eDNA, we calculate the probability of non-detecting DNA and then subtract it from 1. The probability of eDNA occurring is a sample is (\theta). The probability of detecting eDNA within a sample given DNA is present within the sample is (p). The probability of not detecting eDNA within a sample give multiple samples (K) may be written as (1 - (1 -p)^K). This calculation for non-detecting eDNA is broken down into two parts. First, the probability of not detecting eDNA because it truly is not in the sample needs to be calculated, which is (1-\theta), for a given sample. Second, the probability of missing eDNA even though the eDNA is present within the sample needs to be calculated as well: (\theta (1 -p)^K).
For the case where only 1 sample is take (i.e., (J =1 )), the probability of not detecting eDNA in any sample of subsample may be written:
(P(y_{j,k}=0 |\theta, p, k) = 1-\theta + \theta (1 - p)^k).
For the case when 2 samples are taken (i.e, (J = 2)), the probability of not detecting eDNA in any of the subsampels may be written as:
(P(y_{j,k}=0 | \theta, p, k) = (1-\theta)^2 + 2 (1-\theta)(\theta (1 - p)^k ) + (\theta (1 - p)^k)^2).
For (J=3), it follows that:
(P(y_{j,k}=0 | \theta, p, k) = (1-\theta)^3 + 3 (1-\theta)^2(\theta (1 - p)^k) + 3 (1-\theta)(\theta (1 - p)^k )^2 + (\theta (1 - p)^k)^3).
In turn, this generalizes to be
(P(y_{j,k} =0|\theta, p,k) = \sum_{j=1}^{J} = {{J}\choose{j}} (1 - \theta)^j (\theta (1 - p)^k)^{J-j}).
The observation and detection probabilities are based upon ranges found in the literature and described in our corresponding manuscript.
The first analysis we run estimates the probability of detecting a species at site. This does not allow us to distinguish different densities. Rather it simply informs if a species is present at a site.
We first write a function that estimates the probability of detecting a species assuming different numbers of samples, J; probabilities of samples containing DNA, theta; different numbers of assay replicates, K; and different detection probabilities for the assay, pDetection (we choose to use pDetection rather than p to have a variable that was easier to find in our code). We derived this relationship in a previous section of the document. We also define a helper functions, combo().
comb = function(n, r){ factorial(n)/(factorial(r) * factorial(n -r ))}
sample_detect_one <-
Vectorize(function(J = 50,
K = 8,
theta = 0.06,
p_detection = 0.3
){
j_index = J:0
prob = sum(
comb( J, j_index) *
(1 - theta) ^ j_index *
(theta * (1 - p_detection)^K ) ^ rev(j_index))
return(1 - prob)
})Next, we explore different sample numbers, (J \in 1, 2, \ldots 120); different assay detection probabilities, (\theta \in {0.05, 0.1, 0.2, 0.4, 0.8, 1.0}); different sample detection probabilities, (p \in {0.05, 0.1, 0.2, 0.4, 0.8, 1.0}); and different numbers of molecular replicates (K \in {2, 4, 8, 16}).
We use the tidyverse packages for storing and manipulating my data.
results <-
expand.grid(J = 1:120,
theta = c(0.05, 0.1, 0.2, 0.4, 0.8, 1.0),
p_detection = c(0.05, 0.1, 0.2, 0.4, 0.8, 1.0),
K = c(2, 4, 8, 16)) %>%
as_tibble()
results <-
results %>%
mutate(prob_detect = sample_detect_one(J = J, K = K,
theta = theta,
p_detection = p_detection),
theta_plot = factor(paste0("theta = ", theta)),
p_detection_plot = factor(paste0("p = ", p_detection)),
K_plot = factor(paste0("K = ", K))) %>%
mutate(K_plot = fct_reorder(K_plot, K))Last, we plot the results using ggplot2 (which is loaded as part of the Tidyverse).
detect_one <-
ggplot(data = results, aes(x = J, y = prob_detect, color = K_plot)) +
geom_line() +
facet_grid( p_detection_plot ~ theta,
labeller = label_bquote(cols = theta == .(theta))) +
theme_minimal() +
ylab("Probabiltiy of detecting species at site") +
xlab("Number of samples per site (J)") +
scale_color_manual("Molecular\nreplicates",
values = c("red", "blue", "black", "seagreen",
"orange", "grey50")) +
scale_x_continuous(breaks = seq(0,125, by = 30)) +
scale_y_continuous(breaks = seq(0,1, by = .5)) +
theme(axis.text.x = element_text(angle = -90, hjust = 0),
panel.spacing = unit(1, "lines"))
print(detect_one)A more interesting question than simply detecting species at a site using eDNA is "Can eDNA detect different levels of sample occurrence at sites?". To do this, we conduct a simulation study.
First, we simulate the probability of a water sample (e.g., grabbing water from the environment, extracting eDNA, etc.) capturing eDNA. To avoid confusion, consider this capturing eDNA even through this statistically and biologically this is a type of sampling. To simulate this, we tell R to draw a sample for each row from binomial distribution. The number of samples taken is (J) and the probability a sample contains eDNA is (\theta).
Second, we sample again from a binomial distribution again to account for imperfect detection of the molecular method.
The number of samples drawn is the number of samples containing eDNA from the previous simulation.
The probability eDNA is detected in a sample based upon the probability of detection, (p), and the number of molecular replicates, (K): prob = 1 - (1 - p)^K.
samples_detect <-
function(
n_sims = 2,
J_in = c(10, 100),
theta_in = c(0.1, 0.9),
K_in = 8,
p_detection_in = c(0.25, 0.75)
){
results <-
expand.grid(simulation = 1:n_sims,
J = J_in, theta = theta_in,
K = K_in, p_detection = p_detection_in) %>%
as_tibble()
results <-
results %>%
mutate(samples_DNA = rbinom(n = nrow(results),
size = J,
prob = theta)) %>%
mutate(samples_detect = rbinom(n = nrow(results),
size = samples_DNA,
prob = 1 - (1 - p_detection)^K) ) %>%
mutate(samples_p_positive = samples_detect / J,
Samples_plot = factor(paste0("J = ", J)),
theta_plot = factor(paste0("theta = ", theta)),
p_detection_plot = factor(paste0("p = ", p_detection)),
K_plot = paste0("K = ", K)) %>%
mutate(Samples_plot = fct_reorder(Samples_plot, J),
theta_plot = fct_reorder(theta_plot, theta),
p_detection_plot = fct_reorder(p_detection_plot, p_detection),
K_plot = fct_reorder(K_plot, K))
return(results)
}We explore different sample numbers, (n \in {5, 10, 20, 40, 80, 120}); different assay detection probabilities, (p \in {0.05, 0.1, 0.2, 0.4, 0.8, 1.0}); and different sample detection probabilities, (\theta \in {0.05, 0.1, 0.2, 0.4, 0.8, 1.0}) by running 500 simulations (The original code ran 4,000, but fewer are run here to speed up the creation of this document).
sample_results <- samples_detect(n_sims = 500,
theta = c(0.05, 0.1, 0.2, 0.4, 0.8, 1.0),
p_detection = c(0.05, 0.1, 0.2, 0.4, 0.8, 1.0),
J = c(5, 10, 20, 40, 80, 120),
K = c(2, 4, 8, 16))
compare_sites <-
ggplot(sample_results,
aes(x = K_plot, y = samples_p_positive, color = factor(theta))) +
geom_boxplot(outlier.size = 0.5) +
facet_grid( Samples_plot ~ p_detection_plot ) +
theme_minimal() +
ylab(expression("Simulated "*theta~bgroup("(",
over("Number of simulated positive samples",
"Total number of simulated samples"), ")"))) +
xlab("Number of molecular replicates (K)") +
scale_color_manual(expression("Generating "* theta),
values = c("red", "blue", "black", "seagreen", "orange", "grey50"))
print(compare_sites)Last, we plot the results using ggplot2
As mentioned in the introduction, a new occupancy model was developed for the eDNA sampling design used by the USFWS. During this section, I will first review the equations for this model. Next, I will go through the code for the model.
We developed equations to match our sampling methods that are described in the main manuscript.
To do this, we adapted equations from Erickson et al. [-@erickson2019sampling] to include two qPCR assays rather than the one.
Broadly, we visited each site multiple times, collected multiple samples samples per site/visit combination, and used two qPCR assays on each sample.
We ran each qPCR assay in replicate.
To begin our equation documentation, we define our different indexing used for replication. The equations include the following levels indexing (denoted using subscripts):
- Sites are indexed using (i), (i \in 1, 2, \ldots N_{\text{sites}}) or denoted using the site name;
- Visits to a site are indexed using (v), (v \in 1, 2, \ldots N_{\text{visit}}) or denoted using the sampling month;
- Collection samples at a specific site (i) during visit (v) are indexed using (j), (j \in 1, 2, \ldots J);
- qPCR replicates from site (i) during visit (v) and collection sample (j) are indexed using (k), (k \in 1, 2, \ldots K).
The possible values for each level are included in a table: Possible values for each index levels.
Table: Possible values for each index level:
| Level | Value Name |
|---|---|
| Site | Boston Bay, Dam 17 spillway, Dam 18 spillway, Iowa River tributary |
| Visit | April, May, November |
| Collection sample | Not named |
| 105 samples for the dam spillways | |
| 114 samples for the other sites | |
| qPCR replicates | Not names |
| 9 replicates for all samples |
The state variables for the model were the detection or non-detection at each level.
For example, if one qPCR replicate detected the target eDNA, than all of the higher levels (i.e., collection sample, and site) would be a detection or 1.
If eDNA was detected within site (i) during visit (v), (\textbf{Z}{i,v}) would be 1, otherwise 0.
If eDNA was detected within a collection sample using either qPCR assay at site (i) during visit (v) within sample (j), (\textbf{A}{i,v, j}) would be 1, otherwise 0.
If eDNA was detected using the ACTM1 qPCR assay at site (i) during visit (v) within sample (j), (\textbf{Y}{i,v, j}^{\text{ACTM1}}) would be the number of positive replicates, 0, 1, 2, \ldots (K).
If eDNA was detected using the ACTM3 qPCR assay at site (i) during visit (v) within sample (j), (\textbf{Y}{i,v, j}^{\text{ACTM3}}) would be the number of positive replicates, 0, 1, 2, \ldots (K).
The last two variables differ from Erickson et al. [-@erickson2019sampling] because they only examined a single assay, which was denoted using a \textbf{Y}.
We based our superscript notation upon Nichols et al. [-@nichols2008multi], who presented two-level occupancy models with multiple observers.
The state variables also had corresponding probabilities. The probability eDNA occurred at site (i) was estimated across visits: (\psi_{i}). The probability of eDNA being captured within samples was estimated for each visit (v) to site (i): (\theta_{i, v}). The probability of detecting eDNA at site (i) during visit (v) using ACTM1 qPCR assay was (p_{i,v}^{\text{ACTM1}}). The probability of detecting eDNA at site (i) during visit (v) using ACTM3 qPCR assay was (p_{i,v}^{\text{ACTM3}}).
The highest level of the model is the relationship between the site-level occurrence probability and detection during visits to a site:
(\textbf{Z}{i,v} &\sim& \text{Bernoulli}( \psi{i}).)
The next level of the model is the relationship between the sample-level capture probability and capture during a visit to a site, which is conditional upon target eDNA being present at the site:
(\textbf{A}{i,v,j} | \textbf{Z}{i,v} &\sim& \text{Bernoulli}(\textbf{Z}{i,v} \theta{i,v}).)
For this level, we also estimated regression coefficients for collection sample temperature and depth on the logit scale: (\mu_{\theta_{i,v}} &=& \text{logit}(\theta_{i,v}) )
(\mu_{\theta_{i,v}} &=& \textbf{w}^{\prime}_{i,v} \alpha)
We specifically estimated an intercept for each site visit, a slope parameter for sample depth, and a slope parameter for sample water temperature. The lowest level of the model is the relationship between the assays detection probabilities and detection within an assay. These use a binomial distribution with (K = 8) draws, which is a generalization of a Bernoulli, which has an implicit single draw (i.e., binomial(\ldots, (K = 1))). This levels is conditional upon eDNA being detected within a sample:
(\textbf{Y}{i,v,j}^{\text{ACTM1}} | \textbf{A}{i,v,j} &\sim& \text{binomial}(\textbf{A}{i,v,j} p{i,v}^{\text{ACTM1}}, K))
(\textbf{Y}{i,v,j}^{\text{ACTM3}} | \textbf{A}{i,v,j} &\sim& \text{binomial}(\textbf{A}{i,v,j} p{i,v}^{\text{ACTM3}}, K).)
For this level, we also estimated regression coefficients for collection sample temperature and depth on the logit scale:
(\mu_{p_{i,v}^{\text{ACTM1}}} &=& \text{logit}(p_{i,v}^{\text{ACTM1}}),)
(\mu_{p_{i,v}^{\text{ACTM1}}} &=& \textbf{v}^{\prime}_{i,v} \delta^{\text{ACTM1}},)
(\mu_{p_{i,v}^{\text{ACTM3}}} &=& \text{logit}(p_{i,v}^{\text{ACTM3}}), \ \text{and} )
(\mu_{p_{i,v}^{\text{ACTM3}}} &=& \textbf{v}^{\prime}_{i,v} \delta^{\text{ACTM3}}.)
We specifically estimated an intercept for each site visit, a slope parameter for sample depth, and a slope parameter for sample water temperature for each qPCR assay.
Using parameter estimates from the above equations, we were able to calculate the probability of detecting eDNA using both ACTM1 and ACTM3 using our sampling design of 8 molecular replicates, (p_{i,v}^{\ast}).
We refer to these positive samples as a ``positive sample''.
These calculations are similar to the work of \citet{pollock2006estimating}.
Also, we assumed that when a sample had one assay detect eDNA, but the other did not, the non-detect was a false negative.
Compared to traditionally occupancy models, this would be the equivalent of one observer (e.g., bird watcher) or detection method (e.g., electrofishing vs gill netting) detecting a species and the other one missing the species.
For each site visit, these calculations require the probability of sample capture as well as the probability of detection for each assay
(p_{i,v}^{\ast}= \theta_{i, v} ( 1 - (1 - p_{i,v}^{\text{ACTM1}})^K ) ( 1 - (1 - p_{i,v}^{\text{ACTM3}})^K ))




