-
Notifications
You must be signed in to change notification settings - Fork 38
/
Exercise_11_StateSpace.Rmd
87 lines (51 loc) · 11.2 KB
/
Exercise_11_StateSpace.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
---
title: 'Exercise 11: State Space'
author: "EE 509"
output: html_document
---
The objective of this lab is to explore state-space time-series models. State-space models are a flexible framework that treats the true time-series as a latent variable, which separates the process model (which describes the transition from one time to the next) and the data model (which describes the observation error). This separation allows us to treat the observed data for each time as conditionally independent from every other observation given the underlying latent state variable. This separation also allows us to model the data and process on different time scales and to easily accommodate missing data or irregular sampling, which is difficult in traditional time-series models. The basic state-space framework presented here is easily extensible to multiple data sets, multiple interacting time series, and more complex (e.g. multivariate) process models.
##Case Study: Bialowieza moose
In this lab we will be working with data on the observed density of moose (Alces alces) in the Bialowieza Primeval Forest, which straddles the border of Poland and Belarus (Jedrzejewska et al 1997). Our data consists of annual censuses collected by a combination of snow tracking, hunting, drive censuses, and counts at baiting sites. Data have been corrected for known biases but some amount of observation error likely remains. The data set is fairly long by ecological standards—a 48-year data set spanning 1946-1993—however, there are gaps of missing data from 1963-1966 and in 1989. Data are available in the text file “alcesdata.txt”. For further information, this analysis closely mirrors the example from Clark and Bjornstad (2004).
In this lab we will confront this data with the two simplest discrete-time population models in ecology, exponential growth and logistic growth. By comparing these two models we will be able to determine whether the growth of this population is density-dependent or density-independent. We will also be able to estimate the intrinsic rate of population growth and the carrying capacity for the species, both of which are important for management and for understanding the basic biology of the species. Finally, we will be able to partition the variability in the time-series into observation error vs. process error, which in this case is the inter-annual variability in population growth. Knowing the inter-annual variability in population growth helps us understand to what extent population dynamics are driven by internal biological factors (autogenic) vs. external environmental factors (exogenous). While we cannot attribute all of the residual process variance to exogenous factors because there are always biotic interactions that are not accounted for, a large amount of process error is often suggestive of environmental factors.
The two process models we will use for this lab are the discrete-time exponential growth model
$$N_{t+1} = N_t e^{r+\epsilon_t}$$
and the Ricker model, which is the discrete-time equivalent to logistic growth
$$N_{t+1} = N_t e^{r (1 - N_t / K) + \epsilon_t}$$
In both models, $N_t$ is the population density at time t, and r is the population growth rate (per-capita in/decrease in population from time t to time t+1). In the second model, $K$ is the carrying capacity of the habitat (the maximum sustainable population size). Finally, for both models we will assume that the process error, $\epsilon_t$, is normally distributed and multiplicative:
$$\epsilon_t \sim N(0,\sigma^2)$$
If we take the log of both sides and express the process model in terms of $X = ln(N)$, the two models then become:
$X_{t+1} = X_t + r + \epsilon_t$ Exponential
$X_{t+1} = X_t + r(1-N_t/K) + \epsilon_t$ Ricker
Next, lets assume that the observation error on the population densities is lognormally distributed. This is a sensible choice because we know that the observed population density is bound at 0 (we can't have negative densities), and can reasonably expect the absolute error to increase with population density. If we define $Y = ln(observed density)$ then this is equivalent to saying that Y has Normal error with a mean of X:
$$Y_t \sim N(X_t,\tau^2)$$
![Figure 1](images/Lab10.1.jpg)
Figure 1: Graph representation of the state space model (priors not shown). Under parameters, b represents the process model coefficients (r for the exponential, r and K for the Ricker), $\tau^2$ is the observation error, and $\sigma^2$ is the process error.. As defined above, X and Y are the latent “true” density and the observed values, respectively. The letters A, B, and C indicate the three terms that contribute to the posterior for Xt.
Finally, we will want to specify priors for the model parameters:
$r \sim N(r_0,V_r)$ ## prior in intrinsic growth rate
$K \sim LN(K_0,V_k)$ ## prior on carrying capacity
$\tau^2 \sim IG(t1,t2)$ ## prior on PROCESS variance
$\sigma^2 \sim IG(s1,s2)$ ## prior on Observation error
$X_1 \sim N(\mu_0,V_0)$ ## prior on initial time point
The implementation of this model in JAGS is fairly straightforward, and thus the specification of the data model (observation error) and the priors are left to you. The process model is not complex either:
```
### Process model for exponential growth
for(t in 2:N) {
mu[t] <- x[t-1] + r
x[t] ~ dnorm(mu[t],sigma)
}
```
In this code you will notice a few things that differ slightly from models you've implemented in the past. First is that there is error, $\sigma$, in the process model itself. This part is more familiar if you realize the same model could easily be rewritten as having a random effect on r. Second is that in the process model, the calculation of the X for the current time step depends upon the previous time step, which occurs because our process model describes the change in the state from one time step to the next. Because of this you will note that there is one “trick” to the state-space model that has to do with the indexing of the variables. While there are n datapoints, and thus up to n comparisons between the latent X and the observations Y, there are only n-1 transitions between time steps. This means that the process model only shows up n-1 times (and thus is indexed from 2 to N instead of 1 to N). If we had looped starting from 1 this would have caused a reference to X[0], which doesn't exist. By contrast, the data model shows up n times and thus should be indexed from 1 to N in a *different loop* from the process model. You'll also note that in the list of priors, in addition to putting priors on sigma and tau we need to put a prior on this first time point, X[1], which falls outside the process model loop.
The second major difference in the state-space model is in how the CI is generated. Up to now we've been using the modeled parameters to simulate the CI/PI after the fact. Here, since the latent variable X is inferred as part of the model, you'll want to make sure that's part of the JAGS output and then use that latent X estimate itself to draw your CI. As in the last lab, `grep` can be handy for finding the columns in the output that contain X. That said, because the Y_t is used to estimate the X_t, you'll find that even a model with no process (e.g. random walk) will generate a predicted (X_t) vs observed (Y_t) plot that looks pretty good. In practice, a more useful assessment of a state space model's performance is to perform a one-step-ahead prediction (or, more generally, n-step-ahead prediction). Like with our previous model CI/PI, these predictive diagnositics *are* frequently generated after the fact in R.
Another thing that is different in this lab from what you have done before is the presence of NA values in the data. The NA character is used to indicate missing data, and both R and JAGS are capable of handling this. Since these values are missing from the observation model linking the latent variable, X, to the observation, Y, JAGS will automatically generate predictive estimates for them. Make sure to specify Y as one of your tracked variables if you want to assess the predicted values!
Finally, unlike many of your previous labs that converge quickly and are insensitive to the choice of prior and initial conditions, you will want to give thought to both the priors and initial conditions on parameters and **state variables** in this lab otherwise your chains may not converge over any reasonable timescale. Also remember that initial conditions can (and in many cases should) be chosen based on the data, but priors can't just be 'tuned' to give the 'right' answer (i.e. clamping down on a prior without reference to external information is not the correct way to solve convergence problems). In particular, it can be very helpful to **initialize the the latent X's** at values near or at the observed Y's.
## Lab Report Tasks
Lab Report Task 1
1. Write out and run the JAGS code for the exponential growth model. Include the JAGS code and standard MCMC diagnostics in your lab report. Also include the following plots:
a) Plots of the time series data on both the LOG and LINEAR scales that include the model mean and credible intervals. Note: the linear scale plots are just a transform of the outputs from the log-scale model, NOT a different model. Also remember that because of Jensen's inequality you shouldn't transform the summary statistics, but you CAN transform the MCMC samples themselves
b) Density plots for the predicted values for the missing data. Note: make sure to track these variables as part of your coda.samples variable list.
c) Density plots of the intrinsic growth rate, the observation error variance, and the process model error variance
2. Modify the exponential growth process model in the JAGS code to instead be the Ricker growth model. Rerun including your JAGS code and the same figures as in part 1 plus plots for both the prior and posterior density on the carrying capacity.
Hint: when implementing the Ricker model, don't forget that N and X are on different scales.
3. Construct a summary table that includes the parameters in both models, their 95% CI, and model selection scores (your choice between DIC, WAIC, or predictive loss).
4. Briefly give an interpretation of your results. Be sure to comment on which model is a better fit, what you can interpret about the importance of density dependence in regulating this population, and whether the population is at its carrying capacity. What can you infer about the relative importance of autogenic (internal) vs exogenous (external) factors in regulating this population?
5. **Extra Credit**: An alternative to fitting this model as a time-series would have been to fit the data to the analytical solution for the model (e.g. for the exponential growth model, this would have been $N[1]*exp[r*t]$ ), which would have ignored the process varibility. To approximate the difference in the resulting prediction, generate a time-series plot for the best fitting model (Exponential or Ricker) that contains: (1) the posterior model from your Bayesian fit (mean and CI on X), (2) the CI from the process model using the posterior parameter distributions (r, K, N[1]) but ignoring process error, (3) the observed data. Comment on the difference between these lines, and between the lines and the data.