Skip to content

Commit

Permalink
Updating the chlorophyll - daily average flow models with the cleaned…
Browse files Browse the repository at this point in the history
… up flow data
  • Loading branch information
mountaindboz committed Dec 12, 2023
1 parent 62cc55b commit 8dcf510
Show file tree
Hide file tree
Showing 4 changed files with 3,555 additions and 3,731 deletions.
2 changes: 1 addition & 1 deletion docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ layout: default

* [GAM Model using Categorical Predictors - Initial analysis](rtm_chl_gam_categorical_initial.html)
* [Models using Categorical Predictors - Revised analysis](rtm_chl_models_categorical_revised.html)
* [Models using Flow as a Continuous Predictor - Daily Averages](rtm_chl_models_flow.html)
* [Models using Flow as a Continuous Predictor - Daily Averages](rtm_chl_models_flow_daily_avg.html)
* [Models using Flow as a Continuous Predictor - Weekly Averages](rtm_chl_models_flow_weekly_avg.html)
* [Explore relationship between continuous chlorophyll and percent flow pulse water](rtm_chl_analysis_perc_flow_pulse.html)

Expand Down
3,659 changes: 0 additions & 3,659 deletions docs/rtm_chl_models_flow.html

This file was deleted.

3,515 changes: 3,515 additions & 0 deletions docs/rtm_chl_models_flow_daily_avg.html

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
Expand Up @@ -138,9 +138,9 @@ df_chla_c2_lag <- df_chla_c2 %>%

# Plots

Let's explore the data with some plots. First, lets plot the data in scatterplots of chlorophyll and flow facetted by Station and grouping all years together.
Let's explore the data with some plots. First, lets plot the data in scatter plots of chlorophyll and flow faceted by Station and grouping all years together.

```{r chla scatterplot all yrs, message = FALSE, fig.height = 6, fig.width = 6}
```{r chla scatterplot all yrs, message = FALSE, fig.height = 6}
df_chla_c2 %>%
ggplot(aes(x = Flow, y = Chla)) +
geom_point() +
Expand All @@ -153,7 +153,7 @@ At first glance, I'm not sure how well flow is going to be able to predict chlor

Let's break these scatterplots apart by year to see how these patterns vary annually.

```{r chla scatterplot facet yrs, message = FALSE, fig.height = 8, fig.width = 8}
```{r chla scatterplot facet yrs, message = FALSE, fig.height = 8, fig.width = 8.5}
df_chla_c2 %>%
ggplot(aes(x = Flow, y = Chla)) +
geom_point() +
Expand Down Expand Up @@ -272,7 +272,7 @@ anova(
)
```

It looks like the AR(3) model has the best fit according to the AIC values. However, BIC prefers the AR(1) model. Let's take a closer look at the AR(1) model.
It looks like both AIC and BIC prefer the AR(1) model. Let's take a closer look at the AR(1) model.

## AR(1) Model

Expand Down Expand Up @@ -313,7 +313,7 @@ plt_gamm3_ar1_obs_fit

Let's look at each Station-Year combination separately.

```{r gam3 ar1 obs vs fit facet sta yr, fig.height = 8, fig.width = 8}
```{r gam3 ar1 obs vs fit facet sta yr, fig.height = 8, fig.width = 8.5}
plt_gamm3_ar1_obs_fit +
facet_wrap(
vars(StationCode, Year_fct),
Expand Down Expand Up @@ -390,7 +390,7 @@ acf(residuals(m_lm3_lag3))
Box.test(residuals(m_lm3_lag3), lag = 20, type = 'Ljung-Box')
```

The model with 3 lag terms looks to have slightly better ACF and Box-Ljung test results than the other models. Let's compare the 3 models using AIC and BIC to see which model those prefer.
The model with 3 lag terms looks like it has better ACF and Box-Ljung test results than the other models. Let's compare the 3 models using AIC and BIC to see which model those prefer.

#### Compare models

Expand All @@ -399,30 +399,30 @@ AIC(m_lm3_lag1, m_lm3_lag2, m_lm3_lag3)
BIC(m_lm3_lag1, m_lm3_lag2, m_lm3_lag3)
```

Both AIC and BIC prefer the model with 1 lag term. Let's take a closer look at that one.
Both AIC and BIC prefer the model with 3 lag terms. Let's take a closer look at that one.

### Lag 1 model summary
### Lag 3 model summary

```{r lm3 lag1 summary and diag}
summary(m_lm3_lag1)
```{r lm3 lag3 summary and diag}
summary(m_lm3_lag3)
df_chla_c2_lag %>%
drop_na(Chla_log, lag1) %>%
plot_lm_diag(Chla_log, m_lm3_lag1)
drop_na(Chla_log, lag1, lag2, lag3) %>%
plot_lm_diag(Chla_log, m_lm3_lag3)
shapiro.test(residuals(m_lm3_lag1))
shapiro.test(residuals(m_lm3_lag3))
```

The residuals deviate from a normal distribution particularly at the both tails of the distribution. I'm not so sure about using this model. Let's take a closer look at how the back-transformed fitted values from the model match the observed values.

```{r lm3 lag1 obs vs fit}
df_lm3_lag1_fit <- df_chla_c2_lag %>%
drop_na(Chla, lag1) %>%
mutate(Fitted = as_tibble(predict(m_lm3_lag1, interval = "confidence"))) %>%
```{r lm3 lag3 obs vs fit}
df_lm3_lag3_fit <- df_chla_c2_lag %>%
drop_na(Chla, lag1, lag2, lag3) %>%
mutate(Fitted = as_tibble(predict(m_lm3_lag3, interval = "confidence"))) %>%
unnest_wider(Fitted) %>%
mutate(across(c("fit", "lwr", "upr"), ~ exp(.x) / 1000))
plt_lm3_lag1_obs_fit <- df_lm3_lag1_fit %>%
plt_lm3_lag3_obs_fit <- df_lm3_lag3_fit %>%
ggplot(aes(x = fit, y = Chla)) +
geom_point() +
geom_abline(slope = 1, intercept = 0, color = "red") +
Expand All @@ -432,13 +432,13 @@ plt_lm3_lag1_obs_fit <- df_lm3_lag1_fit %>%
y = "Observed Values"
)
plt_lm3_lag1_obs_fit
plt_lm3_lag3_obs_fit
```

Let's look at each Station-Year combination separately.

```{r lm3 lag1 obs vs fit facet sta yr, fig.height = 8, fig.width = 8}
plt_lm3_lag1_obs_fit +
```{r lm3 lag1 obs vs fit facet sta yr, fig.height = 8, fig.width = 8.5}
plt_lm3_lag3_obs_fit +
facet_wrap(
vars(StationCode, Year_fct),
ncol = 5,
Expand All @@ -449,22 +449,6 @@ plt_lm3_lag1_obs_fit +

The red lines are the 1:1 ratio between fitted and observed values, and we would expect for most of the points to be near this line if the model has a good fit to the observed data. The fitted and observed values appear to match pretty well at the lower end of the range of values, but this deteriorates at the mid and higher range values. This pattern holds for some of the separate Station-Year combinations. I'm not sure this linear model is a good candidate to use for our analysis.

### Lag 3 model summary

Let's take a closer look at the model with 3 lag terms since that had the least autocorrelation.

```{r lm3 lag3 summary and diag}
summary(m_lm3_lag3)
df_chla_c2_lag %>%
drop_na(Chla_log, lag1, lag2, lag3) %>%
plot_lm_diag(Chla_log, m_lm3_lag3)
shapiro.test(residuals(m_lm3_lag3))
```

The diagnostic plots show that the lag 3 model has similar problems as the lag 1 model, so this doesn't look like a good candidate for our analysis as well.

# GAM Model with 2-way interactions

Because the models using 3-way interactions don't seem to fit the data well, we will attempt to fit a generalized additive model (GAM) to the data set using all two-way interactions between Year, Daily Average Flow, and Station, with a smooth term for day of year to account for seasonality. Initially, we'll run the GAM without accounting for serial autocorrelation.
Expand Down Expand Up @@ -659,7 +643,7 @@ acf(residuals(m_lm2_lag3))
Box.test(residuals(m_lm2_lag3), lag = 20, type = 'Ljung-Box')
```

The model with 3 lag terms looks to have slightly better ACF and Box-Ljung test results than the other models. Let's compare the 3 models using AIC and BIC to see which model those prefer.
The model with 3 lag terms looks like it has better ACF and Box-Ljung test results than the other models. Let's compare the 3 models using AIC and BIC to see which model those prefer.

#### Compare models

Expand All @@ -668,30 +652,30 @@ AIC(m_lm2_lag1, m_lm2_lag2, m_lm2_lag3)
BIC(m_lm2_lag1, m_lm2_lag2, m_lm2_lag3)
```

Both AIC and BIC prefer the model with 1 lag term. Let's take a closer look at that one.
AIC prefers the model with 3 lag terms while BIC slightly prefers the model with 1 lag term. Let's take a closer look at the model with 3 lag terms since it had better ACF and Box-Ljung test results.

### Lag 1 model summary
### Lag 3 model summary

```{r lm2 lag1 summary and diag}
summary(m_lm2_lag1)
```{r lm2 lag3 summary and diag}
summary(m_lm2_lag3)
df_chla_c2_lag %>%
drop_na(Chla_log, lag1) %>%
plot_lm_diag(Chla_log, m_lm2_lag1)
drop_na(Chla_log, lag1, lag2, lag3) %>%
plot_lm_diag(Chla_log, m_lm2_lag3)
shapiro.test(residuals(m_lm2_lag1))
shapiro.test(residuals(m_lm2_lag3))
```

As with the model with 3-way interactions, the residuals deviate from a normal distribution particularly at the both tails of the distribution. I'm not so sure about using this model. Let's take a closer look at how the back-transformed fitted values from the model match the observed values.

```{r lm2 lag1 obs vs fit}
df_lm2_lag1_fit <- df_chla_c2_lag %>%
drop_na(Chla, lag1) %>%
mutate(Fitted = as_tibble(predict(m_lm2_lag1, interval = "confidence"))) %>%
```{r lm2 lag3 obs vs fit}
df_lm2_lag3_fit <- df_chla_c2_lag %>%
drop_na(Chla, lag1, lag2, lag3) %>%
mutate(Fitted = as_tibble(predict(m_lm2_lag3, interval = "confidence"))) %>%
unnest_wider(Fitted) %>%
mutate(across(c("fit", "lwr", "upr"), ~ exp(.x) / 1000))
plt_lm2_lag1_obs_fit <- df_lm2_lag1_fit %>%
plt_lm2_lag3_obs_fit <- df_lm2_lag3_fit %>%
ggplot(aes(x = fit, y = Chla)) +
geom_point() +
geom_abline(slope = 1, intercept = 0, color = "red") +
Expand All @@ -701,35 +685,19 @@ plt_lm2_lag1_obs_fit <- df_lm2_lag1_fit %>%
y = "Observed Values"
)
plt_lm2_lag1_obs_fit
plt_lm2_lag3_obs_fit
```

Let's look at each Station and Year separately.

```{r lm2 lag1 obs vs fit facet sta yr}
plt_lm2_lag1_obs_fit + facet_wrap(vars(StationCode), scales = "free")
```{r lm2 lag3 obs vs fit facet sta yr}
plt_lm2_lag3_obs_fit + facet_wrap(vars(StationCode), scales = "free")
plt_lm2_lag1_obs_fit + facet_wrap(vars(Year_fct), scales = "free")
plt_lm2_lag3_obs_fit + facet_wrap(vars(Year_fct), scales = "free")
```

The red lines are the 1:1 ratio between fitted and observed values, and we would expect for most of the points to be near this line if the model has a good fit to the observed data. The fitted and observed values appear to match pretty well at the lower end of the range of values, but this deteriorates at the mid and higher range values. This pattern holds for some of the individual Stations and Years. I'm not sure this linear model is a good candidate to use for our analysis.

### Lag 3 model summary

Let's take a closer look at the model with 3 lag terms since that had the least autocorrelation.

```{r lm2 lag3 summary and diag}
summary(m_lm2_lag3)
df_chla_c2_lag %>%
drop_na(Chla_log, lag1, lag2, lag3) %>%
plot_lm_diag(Chla_log, m_lm2_lag3)
shapiro.test(residuals(m_lm2_lag3))
```

The diagnostic plots show that the lag 3 model has similar problems as the lag 1 model, so this doesn't look like a good candidate for our analysis as well.

# Linear Model with single station - STTD

Since none of the models we tried so far seemed to fit the observed data very well, it may be interesting to run separate models for each of the four stations - RD22, STTD, LIB, RVB - to see if that helps improve model fit. We'll try STTD as a test case.
Expand Down Expand Up @@ -799,7 +767,7 @@ acf(residuals(m_lm2_sttd_lag3))
Box.test(residuals(m_lm2_sttd_lag3), lag = 20, type = 'Ljung-Box')
```

The model with 2 lag terms seems to take care of the autocorrelation. Let's compare the 3 models using AIC and BIC to see which model those prefer.
The models with 2 and 3 lag terms seem to take care of the autocorrelation. Let's compare the 3 models using AIC and BIC to see which model those prefer.

#### Compare models

Expand Down

0 comments on commit 8dcf510

Please sign in to comment.