diff --git a/docs/index.md b/docs/index.md index c9357ea..c51d6d3 100644 --- a/docs/index.md +++ b/docs/index.md @@ -8,7 +8,7 @@ layout: default * [GAM Model using Categorical Predictors - Initial analysis](rtm_chl_gam_categorical_initial.html) * [Models using Categorical Predictors - Revised analysis](rtm_chl_models_categorical_revised.html) -* [Models using Flow as a Continuous Predictor - Daily Averages](rtm_chl_models_flow.html) +* [Models using Flow as a Continuous Predictor - Daily Averages](rtm_chl_models_flow_daily_avg.html) * [Models using Flow as a Continuous Predictor - Weekly Averages](rtm_chl_models_flow_weekly_avg.html) * [Explore relationship between continuous chlorophyll and percent flow pulse water](rtm_chl_analysis_perc_flow_pulse.html) diff --git a/docs/rtm_chl_models_flow.html b/docs/rtm_chl_models_flow.html deleted file mode 100644 index 6fd0ba6..0000000 --- a/docs/rtm_chl_models_flow.html +++ /dev/null @@ -1,3659 +0,0 @@ - - - - -
- - - - - - - - - - -Explore and analyze the continuous chlorophyll data to be included in -the NDFS synthesis manuscript. We will attempt to fit various models to -the data set using daily average flow as a continuous predictor which -replaces the categorical predictor - flow action period - in the -original analysis. These models will only include representative -stations for 4 habitat types - upstream (RD22), lower Yolo Bypass -(STTD), Cache Slough complex (LIB), and downstream (RVB).
-# Load packages
-library(tidyverse)
-library(scales)
-library(knitr)
-library(mgcv)
-library(car)
-library(gratia)
-library(here)
-library(conflicted)
-
-# Source functions
-source(here("manuscript_synthesis/src/global_functions.R"))
-
-# Declare package conflict preferences
-conflicts_prefer(dplyr::filter(), dplyr::lag())
-Display current versions of R and packages used for this -analysis:
-devtools::session_info()
-## ─ Session info ───────────────────────────────────────────────────────────────
-## setting value
-## version R version 4.2.3 (2023-03-15 ucrt)
-## os Windows 10 x64 (build 19044)
-## system x86_64, mingw32
-## ui RTerm
-## language (EN)
-## collate English_United States.utf8
-## ctype English_United States.utf8
-## tz America/Los_Angeles
-## date 2023-11-21
-## pandoc 3.1.1 @ C:/Program Files/RStudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)
-##
-## ─ Packages ───────────────────────────────────────────────────────────────────
-## package * version date (UTC) lib source
-## abind 1.4-5 2016-07-21 [1] CRAN (R 4.2.0)
-## bslib 0.4.2 2022-12-16 [1] CRAN (R 4.2.2)
-## cachem 1.0.8 2023-05-01 [1] CRAN (R 4.2.3)
-## callr 3.7.3 2022-11-02 [1] CRAN (R 4.2.2)
-## car * 3.1-2 2023-03-30 [1] CRAN (R 4.2.3)
-## carData * 3.0-5 2022-01-06 [1] CRAN (R 4.2.1)
-## cli 3.6.1 2023-03-23 [1] CRAN (R 4.2.3)
-## colorspace 2.1-0 2023-01-23 [1] CRAN (R 4.2.2)
-## conflicted * 1.2.0 2023-02-01 [1] CRAN (R 4.2.2)
-## crayon 1.5.2 2022-09-29 [1] CRAN (R 4.2.1)
-## devtools 2.4.5 2022-10-11 [1] CRAN (R 4.2.1)
-## digest 0.6.33 2023-07-07 [1] CRAN (R 4.2.3)
-## dplyr * 1.1.3 2023-09-03 [1] CRAN (R 4.2.3)
-## ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.2.1)
-## evaluate 0.21 2023-05-05 [1] CRAN (R 4.2.3)
-## fansi 1.0.4 2023-01-22 [1] CRAN (R 4.2.2)
-## fastmap 1.1.1 2023-02-24 [1] CRAN (R 4.2.2)
-## forcats * 1.0.0 2023-01-29 [1] CRAN (R 4.2.2)
-## fs * 1.6.3 2023-07-20 [1] CRAN (R 4.2.3)
-## generics 0.1.3 2022-07-05 [1] CRAN (R 4.2.1)
-## ggplot2 * 3.4.3 2023-08-14 [1] CRAN (R 4.2.3)
-## glue 1.6.2 2022-02-24 [1] CRAN (R 4.2.1)
-## gratia * 0.8.1 2023-02-02 [1] CRAN (R 4.2.2)
-## gtable 0.3.4 2023-08-21 [1] CRAN (R 4.2.3)
-## here * 1.0.1 2020-12-13 [1] CRAN (R 4.2.1)
-## hms 1.1.3 2023-03-21 [1] CRAN (R 4.2.3)
-## htmltools 0.5.5 2023-03-23 [1] CRAN (R 4.2.3)
-## htmlwidgets 1.6.2 2023-03-17 [1] CRAN (R 4.2.3)
-## httpuv 1.6.9 2023-02-14 [1] CRAN (R 4.2.2)
-## jquerylib 0.1.4 2021-04-26 [1] CRAN (R 4.2.1)
-## jsonlite 1.8.7 2023-06-29 [1] CRAN (R 4.2.3)
-## knitr * 1.42 2023-01-25 [1] CRAN (R 4.2.2)
-## later 1.3.0 2021-08-18 [1] CRAN (R 4.2.1)
-## lattice 0.21-8 2023-04-05 [1] CRAN (R 4.2.3)
-## lifecycle 1.0.3 2022-10-07 [1] CRAN (R 4.2.1)
-## lubridate * 1.9.3 2023-09-27 [1] CRAN (R 4.2.3)
-## magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.2.1)
-## Matrix 1.5-4 2023-04-04 [1] CRAN (R 4.2.3)
-## memoise 2.0.1 2021-11-26 [1] CRAN (R 4.2.1)
-## mgcv * 1.8-42 2023-03-02 [1] CRAN (R 4.2.2)
-## mime 0.12 2021-09-28 [1] CRAN (R 4.2.0)
-## miniUI 0.1.1.1 2018-05-18 [1] CRAN (R 4.2.1)
-## munsell 0.5.0 2018-06-12 [1] CRAN (R 4.2.1)
-## mvnfast 0.2.8 2023-02-23 [1] CRAN (R 4.2.2)
-## nlme * 3.1-162 2023-01-31 [2] CRAN (R 4.2.3)
-## patchwork * 1.1.2 2022-08-19 [1] CRAN (R 4.2.1)
-## pillar 1.9.0 2023-03-22 [1] CRAN (R 4.2.3)
-## pkgbuild 1.4.2 2023-06-26 [1] CRAN (R 4.2.3)
-## pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.2.1)
-## pkgload 1.3.2.1 2023-07-08 [1] CRAN (R 4.2.3)
-## prettyunits 1.2.0 2023-09-24 [1] CRAN (R 4.2.3)
-## processx 3.8.2 2023-06-30 [1] CRAN (R 4.2.3)
-## profvis 0.3.7 2020-11-02 [1] CRAN (R 4.2.1)
-## promises 1.2.0.1 2021-02-11 [1] CRAN (R 4.2.1)
-## ps 1.7.5 2023-04-18 [1] CRAN (R 4.2.3)
-## purrr * 1.0.2 2023-08-10 [1] CRAN (R 4.2.3)
-## R6 2.5.1 2021-08-19 [1] CRAN (R 4.2.1)
-## Rcpp 1.0.11 2023-07-06 [1] CRAN (R 4.2.3)
-## readr * 2.1.4 2023-02-10 [1] CRAN (R 4.2.2)
-## remotes 2.4.2 2021-11-30 [1] CRAN (R 4.2.1)
-## rlang * 1.1.1 2023-04-28 [1] CRAN (R 4.2.3)
-## rmarkdown 2.21 2023-03-26 [1] CRAN (R 4.2.3)
-## rprojroot 2.0.3 2022-04-02 [1] CRAN (R 4.2.1)
-## rstudioapi 0.14 2022-08-22 [1] CRAN (R 4.2.1)
-## sass 0.4.6 2023-05-03 [1] CRAN (R 4.2.3)
-## scales * 1.2.1 2022-08-20 [1] CRAN (R 4.2.1)
-## sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.2.1)
-## shiny 1.7.4 2022-12-15 [1] CRAN (R 4.2.2)
-## stringi 1.7.12 2023-01-11 [1] CRAN (R 4.2.2)
-## stringr * 1.5.0 2022-12-02 [1] CRAN (R 4.2.2)
-## tibble * 3.2.1 2023-03-20 [1] CRAN (R 4.2.3)
-## tidyr * 1.3.0 2023-01-24 [1] CRAN (R 4.2.2)
-## tidyselect 1.2.0 2022-10-10 [1] CRAN (R 4.2.1)
-## tidyverse * 2.0.0 2023-02-22 [1] CRAN (R 4.2.2)
-## timechange 0.2.0 2023-01-11 [1] CRAN (R 4.2.2)
-## tzdb 0.4.0 2023-05-12 [1] CRAN (R 4.2.3)
-## urlchecker 1.0.1 2021-11-30 [1] CRAN (R 4.2.1)
-## usethis 2.1.6 2022-05-25 [1] CRAN (R 4.2.1)
-## utf8 1.2.3 2023-01-31 [1] CRAN (R 4.2.2)
-## vctrs 0.6.3 2023-06-14 [1] CRAN (R 4.2.3)
-## withr 2.5.1 2023-09-26 [1] CRAN (R 4.2.3)
-## xfun 0.39 2023-04-20 [1] CRAN (R 4.2.3)
-## xtable 1.8-4 2019-04-21 [1] CRAN (R 4.2.1)
-## yaml 2.3.7 2023-01-23 [1] CRAN (R 4.2.2)
-##
-## [1] C:/R/win-library/4.2
-## [2] C:/Program Files/R/R-4.2.3/library
-##
-## ──────────────────────────────────────────────────────────────────────────────
-# Define file path for processed data
-fp_data <- here("manuscript_synthesis/data/processed")
-
-# Import daily average chlorophyll data
-df_chla <- readRDS(file.path(fp_data, "chla_wt_daily_avg_2013-2019.rds")) %>% select(-WaterTemp)
-
-# Import daily average flow data
-df_flow <- readRDS(file.path(fp_data, "flow_daily_avg_2013-2019.rds"))
-# Create a vector for the factor order of StationCode
-sta_order <- c("RD22", "STTD", "LIB", "RVB")
-
-# We will use LIS flow data as a proxy for STTD
-df_flow_c <- df_flow %>% mutate(StationCode = if_else(StationCode == "LIS", "STTD", StationCode))
-
-# Prepare chlorophyll and flow data for exploration and analysis
-df_chla_c1 <- df_chla %>%
- # Filter to only include representative stations for 4 habitat types - RD22, STTD, LIB, RVB
- filter(StationCode %in% sta_order) %>%
- # Join flow data to chlorophyll data
- left_join(df_flow_c, by = join_by(StationCode, Date)) %>%
- # Remove all NA flow values
- drop_na(Flow) %>%
- mutate(
- # Scale and log transform chlorophyll values
- Chla_log = log(Chla * 1000),
- # Add Year and DOY variables
- Year = year(Date),
- DOY = yday(Date),
- # Apply factor order to StationCode
- StationCode = factor(StationCode, levels = sta_order),
- # Add a column for Year as a factor for the model
- Year_fct = factor(Year)
- ) %>%
- arrange(StationCode, Date)
-df_chla_c1 %>%
- summarize(
- min_date = min(Date),
- max_date = max(Date),
- num_samples = n(),
- .by = c(StationCode, Year)
- ) %>%
- arrange(StationCode, Year) %>%
- kable()
-StationCode | -Year | -min_date | -max_date | -num_samples | -
---|---|---|---|---|
RD22 | -2014 | -2014-09-30 | -2014-11-08 | -40 | -
RD22 | -2015 | -2015-07-24 | -2015-11-10 | -110 | -
RD22 | -2016 | -2016-06-23 | -2016-09-16 | -86 | -
RD22 | -2017 | -2017-07-14 | -2017-11-03 | -113 | -
RD22 | -2018 | -2018-07-13 | -2018-11-11 | -122 | -
RD22 | -2019 | -2019-07-11 | -2019-11-06 | -119 | -
STTD | -2013 | -2013-08-15 | -2013-11-04 | -78 | -
STTD | -2014 | -2014-07-25 | -2014-10-19 | -87 | -
STTD | -2015 | -2015-07-27 | -2015-11-16 | -108 | -
STTD | -2016 | -2016-06-23 | -2016-09-16 | -83 | -
STTD | -2017 | -2017-07-14 | -2017-09-25 | -56 | -
STTD | -2018 | -2018-07-20 | -2018-10-11 | -84 | -
STTD | -2019 | -2019-07-26 | -2019-11-06 | -100 | -
LIB | -2014 | -2014-10-01 | -2014-11-08 | -36 | -
LIB | -2015 | -2015-07-06 | -2015-11-16 | -125 | -
LIB | -2016 | -2016-05-29 | -2016-09-16 | -101 | -
LIB | -2017 | -2017-07-14 | -2017-11-02 | -96 | -
LIB | -2018 | -2018-08-14 | -2018-10-31 | -36 | -
LIB | -2019 | -2019-07-11 | -2019-10-25 | -31 | -
RVB | -2013 | -2013-07-07 | -2013-11-17 | -134 | -
RVB | -2014 | -2014-07-25 | -2014-11-08 | -105 | -
RVB | -2015 | -2015-07-06 | -2015-11-16 | -134 | -
RVB | -2016 | -2016-05-29 | -2016-09-16 | -80 | -
RVB | -2017 | -2017-07-14 | -2017-11-03 | -113 | -
RVB | -2018 | -2018-07-13 | -2018-11-11 | -122 | -
RVB | -2019 | -2019-07-11 | -2019-11-06 | -102 | -
Looking at the sample counts and date ranges, we’ll only include -years 2015-2019 for the analysis.
-df_chla_c2 <- df_chla_c1 %>%
- filter(Year %in% 2015:2019) %>%
- mutate(Year_fct = fct_drop(Year_fct))
-
-# Create another dataframe that has up to 3 lag variables for chlorophyll to be
- # used in the linear models
-df_chla_c2_lag <- df_chla_c2 %>%
- # Fill in missing days for each StationCode-Year combination
- group_by(StationCode, Year) %>%
- complete(Date = seq.Date(min(Date), max(Date), by = "1 day")) %>%
- # Create lag variables of scaled log transformed chlorophyll values
- mutate(
- lag1 = lag(Chla_log),
- lag2 = lag(Chla_log, 2),
- lag3 = lag(Chla_log, 3)
- ) %>%
- ungroup()
-Let’s explore the data with some plots. First, lets plot the data in -scatterplots of chlorophyll and flow facetted by Station and grouping -all years together.
-df_chla_c2 %>%
- ggplot(aes(x = Flow, y = Chla)) +
- geom_point() +
- geom_smooth(formula = "y ~ x") +
- facet_wrap(vars(StationCode), scales = "free") +
- theme_bw()
-
-At first glance, I’m not sure how well flow is going to be able to -predict chlorophyll concentrations. At the furthest upstream station - -RD22 - chlorophyll appears to be highest at the lowest flows, but the -variation is at its maximum at the lowest flows. There may be some -dilution effect going on here at the higher flows. At STTD, there does -seem to be a modest increase in chlorophyll concentrations at the -mid-range flows. This pattern is even more obvious at LIB. There appears -to be no effect of flow on chlorophyll at RVB, but the range of -chlorophyll concentrations is narrow at this station (between 0 and -5).
-Let’s break these scatterplots apart by year to see how these -patterns vary annually.
-df_chla_c2 %>%
- ggplot(aes(x = Flow, y = Chla)) +
- geom_point() +
- geom_smooth(formula = "y ~ x") +
- facet_wrap(
- vars(StationCode, Year),
- ncol = 5,
- scales = "free",
- labeller = labeller(.multi_line = FALSE)
- ) +
- theme_bw()
-
-The patterns appear to vary annually at each station, which may -justify using a 3-way interaction.
-First, we will attempt to fit a generalized additive model (GAM) to -the data set to help account for seasonality in the data. We’ll try -running a GAM using a three-way interaction between Year, Daily Average -Flow, and Station, and a smooth term for day of year to account for -seasonality. Initially, we’ll run the GAM without accounting for serial -autocorrelation.
-m_gam3 <- gam(
- Chla_log ~ Year_fct * Flow * StationCode + s(DOY),
- data = df_chla_c2,
- method = "REML"
-)
-Lets look at the model summary and diagnostics:
-summary(m_gam3)
-##
-## Family: gaussian
-## Link function: identity
-##
-## Formula:
-## Chla_log ~ Year_fct * Flow * StationCode + s(DOY)
-##
-## Parametric coefficients:
-## Estimate Std. Error t value Pr(>|t|)
-## (Intercept) 9.745e+00 4.283e-02 227.546 < 2e-16 ***
-## Year_fct2016 -8.341e-01 6.570e-02 -12.696 < 2e-16 ***
-## Year_fct2017 -9.211e-01 9.375e-02 -9.825 < 2e-16 ***
-## Year_fct2018 -3.811e-01 5.645e-02 -6.750 1.96e-11 ***
-## Year_fct2019 -1.553e-01 5.595e-02 -2.775 0.00557 **
-## Flow -1.869e-03 1.825e-04 -10.240 < 2e-16 ***
-## StationCodeSTTD -1.075e+00 5.393e-02 -19.926 < 2e-16 ***
-## StationCodeLIB -2.137e+00 1.437e-01 -14.875 < 2e-16 ***
-## StationCodeRVB -2.195e+00 1.005e-01 -21.842 < 2e-16 ***
-## Year_fct2016:Flow 2.209e-03 2.990e-04 7.388 2.24e-13 ***
-## Year_fct2017:Flow 2.950e-02 4.852e-03 6.080 1.45e-09 ***
-## Year_fct2018:Flow 4.676e-04 2.482e-04 1.884 0.05977 .
-## Year_fct2019:Flow -7.187e-05 2.176e-04 -0.330 0.74126
-## Year_fct2016:StationCodeSTTD 1.082e+00 8.210e-02 13.181 < 2e-16 ***
-## Year_fct2017:StationCodeSTTD 1.048e+00 1.242e-01 8.441 < 2e-16 ***
-## Year_fct2018:StationCodeSTTD -2.895e-01 7.742e-02 -3.739 0.00019 ***
-## Year_fct2019:StationCodeSTTD -8.838e-01 7.574e-02 -11.668 < 2e-16 ***
-## Year_fct2016:StationCodeLIB 2.203e+00 2.546e-01 8.652 < 2e-16 ***
-## Year_fct2017:StationCodeLIB 9.576e-01 1.991e-01 4.810 1.63e-06 ***
-## Year_fct2018:StationCodeLIB -1.551e+00 2.409e-01 -6.436 1.55e-10 ***
-## Year_fct2019:StationCodeLIB -3.510e-01 2.135e-01 -1.644 0.10033
-## Year_fct2016:StationCodeRVB 1.212e+00 2.309e-01 5.248 1.71e-07 ***
-## Year_fct2017:StationCodeRVB 9.736e-01 2.001e-01 4.866 1.23e-06 ***
-## Year_fct2018:StationCodeRVB 1.274e-01 1.557e-01 0.818 0.41349
-## Year_fct2019:StationCodeRVB -8.231e-01 1.848e-01 -4.454 8.94e-06 ***
-## Flow:StationCodeSTTD 4.740e-03 2.936e-04 16.147 < 2e-16 ***
-## Flow:StationCodeLIB 1.812e-03 2.039e-04 8.883 < 2e-16 ***
-## Flow:StationCodeRVB 1.891e-03 1.838e-04 10.290 < 2e-16 ***
-## Year_fct2016:Flow:StationCodeSTTD -3.737e-03 4.362e-04 -8.566 < 2e-16 ***
-## Year_fct2017:Flow:StationCodeSTTD -2.723e-02 5.249e-03 -5.187 2.37e-07 ***
-## Year_fct2018:Flow:StationCodeSTTD -7.936e-04 3.882e-04 -2.044 0.04105 *
-## Year_fct2019:Flow:StationCodeSTTD -1.449e-03 3.440e-04 -4.213 2.64e-05 ***
-## Year_fct2016:Flow:StationCodeLIB -1.816e-03 3.421e-04 -5.307 1.25e-07 ***
-## Year_fct2017:Flow:StationCodeLIB -2.955e-02 4.855e-03 -6.087 1.39e-09 ***
-## Year_fct2018:Flow:StationCodeLIB -6.475e-04 3.801e-04 -1.703 0.08865 .
-## Year_fct2019:Flow:StationCodeLIB 2.046e-04 3.072e-04 0.666 0.50555
-## Year_fct2016:Flow:StationCodeRVB -2.268e-03 3.003e-04 -7.551 6.70e-14 ***
-## Year_fct2017:Flow:StationCodeRVB -2.955e-02 4.852e-03 -6.089 1.37e-09 ***
-## Year_fct2018:Flow:StationCodeRVB -4.848e-04 2.495e-04 -1.943 0.05219 .
-## Year_fct2019:Flow:StationCodeRVB 9.204e-05 2.197e-04 0.419 0.67527
-## ---
-## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
-##
-## Approximate significance of smooth terms:
-## edf Ref.df F p-value
-## s(DOY) 7.487 8.411 40.88 <2e-16 ***
-## ---
-## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
-##
-## R-sq.(adj) = 0.875 Deviance explained = 87.8%
-## -REML = 898.16 Scale est. = 0.12072 n = 1921
-appraise(m_gam3)
-
-shapiro.test(residuals(m_gam3))
-##
-## Shapiro-Wilk normality test
-##
-## data: residuals(m_gam3)
-## W = 0.97102, p-value < 2.2e-16
-k.check(m_gam3)
-## k' edf k-index p-value
-## s(DOY) 9 7.486841 0.9418805 0.005
-draw(m_gam3, select = 1, residuals = TRUE, rug = FALSE)
-
-plot(m_gam3, pages = 1, all.terms = TRUE)
-
-acf(residuals(m_gam3))
-
-Box.test(residuals(m_gam3), lag = 20, type = 'Ljung-Box')
-##
-## Box-Ljung test
-##
-## data: residuals(m_gam3)
-## X-squared = 3735.8, df = 20, p-value < 2.2e-16
-The model definitely has residual autocorrelation as indicated by the -ACF plot and the Box-Ljung test. Even though the p-value for the k-check -is less than 0.05, the smooth term for day of year appears to be -overfitted. Let’s try a smaller k-value for the smooth first, then lets -try to address the residual autocorrelation.
-m_gam3_k5 <- gam(
- Chla_log ~ Year_fct * Flow * StationCode + s(DOY, k = 5),
- data = df_chla_c2,
- method = "REML"
-)
-
-summary(m_gam3_k5)
-##
-## Family: gaussian
-## Link function: identity
-##
-## Formula:
-## Chla_log ~ Year_fct * Flow * StationCode + s(DOY, k = 5)
-##
-## Parametric coefficients:
-## Estimate Std. Error t value Pr(>|t|)
-## (Intercept) 9.733e+00 4.260e-02 228.453 < 2e-16 ***
-## Year_fct2016 -7.901e-01 6.461e-02 -12.229 < 2e-16 ***
-## Year_fct2017 -9.026e-01 9.409e-02 -9.593 < 2e-16 ***
-## Year_fct2018 -3.787e-01 5.676e-02 -6.672 3.32e-11 ***
-## Year_fct2019 -1.526e-01 5.625e-02 -2.714 0.006711 **
-## Flow -1.808e-03 1.803e-04 -10.029 < 2e-16 ***
-## StationCodeSTTD -1.069e+00 5.408e-02 -19.760 < 2e-16 ***
-## StationCodeLIB -2.132e+00 1.438e-01 -14.828 < 2e-16 ***
-## StationCodeRVB -2.191e+00 1.009e-01 -21.726 < 2e-16 ***
-## Year_fct2016:Flow 2.031e-03 2.952e-04 6.881 8.06e-12 ***
-## Year_fct2017:Flow 2.882e-02 4.852e-03 5.941 3.37e-09 ***
-## Year_fct2018:Flow 4.505e-04 2.496e-04 1.805 0.071304 .
-## Year_fct2019:Flow -9.132e-05 2.188e-04 -0.417 0.676427
-## Year_fct2016:StationCodeSTTD 1.069e+00 8.243e-02 12.964 < 2e-16 ***
-## Year_fct2017:StationCodeSTTD 1.036e+00 1.248e-01 8.298 < 2e-16 ***
-## Year_fct2018:StationCodeSTTD -2.965e-01 7.785e-02 -3.809 0.000144 ***
-## Year_fct2019:StationCodeSTTD -8.854e-01 7.604e-02 -11.644 < 2e-16 ***
-## Year_fct2016:StationCodeLIB 2.201e+00 2.555e-01 8.615 < 2e-16 ***
-## Year_fct2017:StationCodeLIB 9.597e-01 1.992e-01 4.817 1.57e-06 ***
-## Year_fct2018:StationCodeLIB -1.511e+00 2.415e-01 -6.257 4.84e-10 ***
-## Year_fct2019:StationCodeLIB -3.715e-01 2.144e-01 -1.733 0.083277 .
-## Year_fct2016:StationCodeRVB 1.158e+00 2.306e-01 5.021 5.62e-07 ***
-## Year_fct2017:StationCodeRVB 9.786e-01 1.999e-01 4.895 1.07e-06 ***
-## Year_fct2018:StationCodeRVB 1.267e-01 1.564e-01 0.810 0.417994
-## Year_fct2019:StationCodeRVB -8.344e-01 1.853e-01 -4.504 7.09e-06 ***
-## Flow:StationCodeSTTD 4.756e-03 2.952e-04 16.111 < 2e-16 ***
-## Flow:StationCodeLIB 1.749e-03 2.024e-04 8.641 < 2e-16 ***
-## Flow:StationCodeRVB 1.832e-03 1.816e-04 10.087 < 2e-16 ***
-## Year_fct2016:Flow:StationCodeSTTD -3.760e-03 4.388e-04 -8.570 < 2e-16 ***
-## Year_fct2017:Flow:StationCodeSTTD -2.633e-02 5.239e-03 -5.026 5.48e-07 ***
-## Year_fct2018:Flow:StationCodeSTTD -8.012e-04 3.906e-04 -2.051 0.040386 *
-## Year_fct2019:Flow:StationCodeSTTD -1.475e-03 3.460e-04 -4.262 2.13e-05 ***
-## Year_fct2016:Flow:StationCodeLIB -1.620e-03 3.392e-04 -4.776 1.93e-06 ***
-## Year_fct2017:Flow:StationCodeLIB -2.885e-02 4.853e-03 -5.946 3.27e-09 ***
-## Year_fct2018:Flow:StationCodeLIB -5.940e-04 3.818e-04 -1.556 0.119930
-## Year_fct2019:Flow:StationCodeLIB 2.056e-04 3.088e-04 0.666 0.505630
-## Year_fct2016:Flow:StationCodeRVB -2.087e-03 2.962e-04 -7.045 2.59e-12 ***
-## Year_fct2017:Flow:StationCodeRVB -2.887e-02 4.852e-03 -5.950 3.19e-09 ***
-## Year_fct2018:Flow:StationCodeRVB -4.685e-04 2.510e-04 -1.867 0.062082 .
-## Year_fct2019:Flow:StationCodeRVB 1.118e-04 2.207e-04 0.506 0.612687
-## ---
-## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
-##
-## Approximate significance of smooth terms:
-## edf Ref.df F p-value
-## s(DOY) 3.897 3.994 82.03 <2e-16 ***
-## ---
-## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
-##
-## R-sq.(adj) = 0.874 Deviance explained = 87.7%
-## -REML = 905.16 Scale est. = 0.12223 n = 1921
-appraise(m_gam3_k5)
-
-shapiro.test(residuals(m_gam3_k5))
-##
-## Shapiro-Wilk normality test
-##
-## data: residuals(m_gam3_k5)
-## W = 0.96838, p-value < 2.2e-16
-k.check(m_gam3_k5)
-## k' edf k-index p-value
-## s(DOY) 4 3.897303 0.9289271 0
-draw(m_gam3_k5, select = 1, residuals = TRUE, rug = FALSE)
-
-plot(m_gam3_k5, pages = 1, all.terms = TRUE)
-
-acf(residuals(m_gam3_k5))
-
-Box.test(residuals(m_gam3_k5), lag = 20, type = 'Ljung-Box')
-##
-## Box-Ljung test
-##
-## data: residuals(m_gam3_k5)
-## X-squared = 3775.8, df = 20, p-value < 2.2e-16
-Changing the k-value to 5 seems to help reduce the “wiggliness” of -the smooth term for DOY, but the p-value for the k-check is still less -than 0.05. Despite this, lets proceed with a k-value of 5.
-Now, we’ll try to deal with the residual autocorrelation. We’ll run -AR(1), AR(2), AR(3), and AR(4) models and compare them using AIC.
-# Define model formula as an object
-f_gam3 <- as.formula("Chla_log ~ Year_fct * Flow * StationCode + s(DOY, k = 5)")
-
-# Fit original model with k = 5 and three successive AR(p) models
-m_gamm3 <- gamm(
- f_gam3,
- data = df_chla_c2,
- method = "REML"
-)
-
-m_gamm3_ar1 <- gamm(
- f_gam3,
- data = df_chla_c2,
- correlation = corARMA(form = ~ 1|Year_fct/StationCode, p = 1), # grouped by Year_fct and StationCode
- method = "REML"
-)
-
-m_gamm3_ar2 <- gamm(
- f_gam3,
- data = df_chla_c2,
- correlation = corARMA(form = ~ 1|Year_fct/StationCode, p = 2),
- method = "REML"
-)
-
-m_gamm3_ar3 <- gamm(
- f_gam3,
- data = df_chla_c2,
- correlation = corARMA(form = ~ 1|Year_fct/StationCode, p = 3),
- method = "REML"
-)
-
-m_gamm3_ar4 <- gamm(
- f_gam3,
- data = df_chla_c2,
- correlation = corARMA(form = ~ 1|Year_fct/StationCode, p = 4),
- method = "REML"
-)
-
-# Compare models
-anova(
- m_gamm3$lme,
- m_gamm3_ar1$lme,
- m_gamm3_ar2$lme,
- m_gamm3_ar3$lme,
- m_gamm3_ar4$lme
-)
-## Model df AIC BIC logLik Test L.Ratio p-value
-## m_gamm3$lme 1 43 1896.3152 2134.4933 -905.1576
-## m_gamm3_ar1$lme 2 44 -761.4514 -517.7342 424.7257 1 vs 2 2659.7665 <.0001
-## m_gamm3_ar2$lme 3 45 -761.1459 -511.8897 425.5730 2 vs 3 1.6945 0.1930
-## m_gamm3_ar3$lme 4 46 -762.6776 -507.8824 427.3388 3 vs 4 3.5317 0.0602
-## m_gamm3_ar4$lme 5 47 -760.6894 -500.3552 427.3447 4 vs 5 0.0118 0.9133
-It looks like the AR(3) model has the best fit according to the AIC -values. However, BIC prefers the AR(1) model. Let’s take a closer look -at the AR(1) model.
-# Pull out the residuals and the GAM model
-resid_gamm3_ar1 <- residuals(m_gamm3_ar1$lme, type = "normalized")
-m_gamm3_ar1_gam <- m_gamm3_ar1$gam
-
-summary(m_gamm3_ar1_gam)
-##
-## Family: gaussian
-## Link function: identity
-##
-## Formula:
-## Chla_log ~ Year_fct * Flow * StationCode + s(DOY, k = 5)
-##
-## Parametric coefficients:
-## Estimate Std. Error t value Pr(>|t|)
-## (Intercept) 9.518e+00 2.152e-01 44.217 < 2e-16 ***
-## Year_fct2016 -6.228e-01 3.255e-01 -1.913 0.05587 .
-## Year_fct2017 -4.890e-01 3.314e-01 -1.476 0.14016
-## Year_fct2018 -2.959e-01 2.987e-01 -0.991 0.32201
-## Year_fct2019 -3.006e-02 2.998e-01 -0.100 0.92015
-## Flow -6.681e-04 2.914e-04 -2.293 0.02197 *
-## StationCodeSTTD -8.105e-01 3.024e-01 -2.680 0.00743 **
-## StationCodeLIB -1.712e+00 3.119e-01 -5.490 4.57e-08 ***
-## StationCodeRVB -1.847e+00 2.957e-01 -6.247 5.17e-10 ***
-## Year_fct2016:Flow 1.008e-03 5.864e-04 1.719 0.08578 .
-## Year_fct2017:Flow 1.318e-02 8.063e-03 1.635 0.10232
-## Year_fct2018:Flow 6.137e-05 5.312e-04 0.116 0.90802
-## Year_fct2019:Flow -3.428e-04 4.481e-04 -0.765 0.44432
-## Year_fct2016:StationCodeSTTD 9.596e-01 4.533e-01 2.117 0.03439 *
-## Year_fct2017:StationCodeSTTD 3.666e-01 4.833e-01 0.758 0.44827
-## Year_fct2018:StationCodeSTTD -4.601e-01 4.375e-01 -1.052 0.29313
-## Year_fct2019:StationCodeSTTD -1.078e+00 4.296e-01 -2.508 0.01221 *
-## Year_fct2016:StationCodeLIB 1.892e+00 4.690e-01 4.035 5.69e-05 ***
-## Year_fct2017:StationCodeLIB 5.850e-01 4.679e-01 1.250 0.21127
-## Year_fct2018:StationCodeLIB -1.588e+00 4.994e-01 -3.181 0.00149 **
-## Year_fct2019:StationCodeLIB -7.228e-01 5.097e-01 -1.418 0.15634
-## Year_fct2016:StationCodeRVB 9.792e-01 4.745e-01 2.064 0.03919 *
-## Year_fct2017:StationCodeRVB 2.970e-01 4.728e-01 0.628 0.52996
-## Year_fct2018:StationCodeRVB 1.256e-02 4.359e-01 0.029 0.97701
-## Year_fct2019:StationCodeRVB -5.866e-01 4.441e-01 -1.321 0.18667
-## Flow:StationCodeSTTD 1.563e-03 5.696e-04 2.744 0.00614 **
-## Flow:StationCodeLIB 7.331e-04 3.001e-04 2.443 0.01467 *
-## Flow:StationCodeRVB 6.702e-04 2.917e-04 2.298 0.02167 *
-## Year_fct2016:Flow:StationCodeSTTD -1.668e-03 9.315e-04 -1.791 0.07345 .
-## Year_fct2017:Flow:StationCodeSTTD -7.118e-03 8.237e-03 -0.864 0.38762
-## Year_fct2018:Flow:StationCodeSTTD 4.618e-04 8.986e-04 0.514 0.60738
-## Year_fct2019:Flow:StationCodeSTTD 7.969e-04 7.567e-04 1.053 0.29244
-## Year_fct2016:Flow:StationCodeLIB -7.348e-04 5.979e-04 -1.229 0.21925
-## Year_fct2017:Flow:StationCodeLIB -1.306e-02 8.064e-03 -1.620 0.10537
-## Year_fct2018:Flow:StationCodeLIB 1.411e-04 5.571e-04 0.253 0.80007
-## Year_fct2019:Flow:StationCodeLIB 3.178e-04 4.756e-04 0.668 0.50408
-## Year_fct2016:Flow:StationCodeRVB -1.024e-03 5.868e-04 -1.746 0.08101 .
-## Year_fct2017:Flow:StationCodeRVB -1.319e-02 8.063e-03 -1.635 0.10212
-## Year_fct2018:Flow:StationCodeRVB -7.120e-05 5.316e-04 -0.134 0.89346
-## Year_fct2019:Flow:StationCodeRVB 3.206e-04 4.485e-04 0.715 0.47479
-## ---
-## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
-##
-## Approximate significance of smooth terms:
-## edf Ref.df F p-value
-## s(DOY) 3.745 3.745 9.993 5.3e-07 ***
-## ---
-## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
-##
-## R-sq.(adj) = 0.82
-## Scale est. = 0.22667 n = 1921
-appraise(m_gamm3_ar1_gam)
-
-shapiro.test(resid_gamm3_ar1)
-##
-## Shapiro-Wilk normality test
-##
-## data: resid_gamm3_ar1
-## W = 0.84032, p-value < 2.2e-16
-k.check(m_gamm3_ar1_gam)
-## k' edf k-index p-value
-## s(DOY) 4 3.74514 0.8115987 0
-draw(m_gamm3_ar1_gam, select = 1, residuals = TRUE, rug = FALSE)
-
-plot(m_gamm3_ar1_gam, pages = 1, all.terms = TRUE)
-
-acf(resid_gamm3_ar1)
-
-Box.test(resid_gamm3_ar1, lag = 20, type = 'Ljung-Box')
-##
-## Box-Ljung test
-##
-## data: resid_gamm3_ar1
-## X-squared = 50.995, df = 20, p-value = 0.0001593
-The AR(1) model has much less residual autocorrelation, and the -diagnostics plots look okay. Let’s take a closer look at how the -back-transformed fitted values from the model match the observed -values.
-df_gamm3_ar1_fit <- df_chla_c2 %>%
- fitted_values(m_gamm3_ar1_gam, data = .) %>%
- mutate(fitted_bt = exp(fitted) / 1000)
-
-plt_gamm3_ar1_obs_fit <- df_gamm3_ar1_fit %>%
- ggplot(aes(x = fitted_bt, y = Chla)) +
- geom_point() +
- geom_abline(slope = 1, intercept = 0, color = "red") +
- theme_bw() +
- labs(
- x = "Back-transformed Fitted Values",
- y = "Observed Values"
- )
-
-plt_gamm3_ar1_obs_fit
-
-Let’s look at each Station-Year combination separately.
-plt_gamm3_ar1_obs_fit +
- facet_wrap(
- vars(StationCode, Year_fct),
- ncol = 5,
- scales = "free",
- labeller = labeller(.multi_line = FALSE)
- )
-
-The red lines are the 1:1 ratio between fitted and observed values, -and we would expect for most of the points to be near this line if the -model has a good fit to the observed data. However, the fitted values -from the model don’t appear to match the observed values that well, and -there are some unusual patterns when we look at each Station-Year -combination separately. I don’t think this GAM model is a good candidate -to use for our analysis.
-Since the GAM model didn’t seem to fit the observed data that well, -let’s try a linear model using a three-way interaction between Year, -Daily Average Flow, and Station. Initially, we’ll run the model without -accounting for serial autocorrelation.
-m_lm3 <- df_chla_c2_lag %>%
- drop_na(Chla_log) %>%
- lm(Chla_log ~ Year_fct * Flow * StationCode, data = .)
-Lets look at the model summary and diagnostics:
-summary(m_lm3)
-##
-## Call:
-## lm(formula = Chla_log ~ Year_fct * Flow * StationCode, data = .)
-##
-## Residuals:
-## Min 1Q Median 3Q Max
-## -1.53824 -0.22584 -0.03527 0.19362 1.75991
-##
-## Coefficients:
-## Estimate Std. Error t value Pr(>|t|)
-## (Intercept) 9.6844243 0.0452509 214.016 < 2e-16 ***
-## Year_fct2016 -0.6624673 0.0679526 -9.749 < 2e-16 ***
-## Year_fct2017 -1.1724759 0.0999868 -11.726 < 2e-16 ***
-## Year_fct2018 -0.3418296 0.0613064 -5.576 2.82e-08 ***
-## Year_fct2019 -0.1194239 0.0606175 -1.970 0.048971 *
-## Flow -0.0018894 0.0001902 -9.935 < 2e-16 ***
-## StationCodeSTTD -1.0665077 0.0583396 -18.281 < 2e-16 ***
-## StationCodeLIB -2.2759582 0.1501355 -15.159 < 2e-16 ***
-## StationCodeRVB -2.2211596 0.1087675 -20.421 < 2e-16 ***
-## Year_fct2016:Flow 0.0023191 0.0003112 7.451 1.40e-13 ***
-## Year_fct2017:Flow 0.0459260 0.0050786 9.043 < 2e-16 ***
-## Year_fct2018:Flow 0.0003299 0.0002697 1.223 0.221489
-## Year_fct2019:Flow -0.0001500 0.0002357 -0.636 0.524638
-## Year_fct2016:StationCodeSTTD 1.0883876 0.0888628 12.248 < 2e-16 ***
-## Year_fct2017:StationCodeSTTD 1.4828385 0.1306889 11.346 < 2e-16 ***
-## Year_fct2018:StationCodeSTTD -0.3044581 0.0836642 -3.639 0.000281 ***
-## Year_fct2019:StationCodeSTTD -0.9412845 0.0821169 -11.463 < 2e-16 ***
-## Year_fct2016:StationCodeLIB 2.0498183 0.2674351 7.665 2.85e-14 ***
-## Year_fct2017:StationCodeLIB 1.3182346 0.2107896 6.254 4.95e-10 ***
-## Year_fct2018:StationCodeLIB -1.4934167 0.2582457 -5.783 8.58e-09 ***
-## Year_fct2019:StationCodeLIB -0.3466158 0.2273610 -1.525 0.127548
-## Year_fct2016:StationCodeRVB 1.9708585 0.2227786 8.847 < 2e-16 ***
-## Year_fct2017:StationCodeRVB 1.4604311 0.2117602 6.897 7.25e-12 ***
-## Year_fct2018:StationCodeRVB -0.0019108 0.1643456 -0.012 0.990725
-## Year_fct2019:StationCodeRVB -0.7693190 0.1962948 -3.919 9.20e-05 ***
-## Flow:StationCodeSTTD 0.0046051 0.0003182 14.474 < 2e-16 ***
-## Flow:StationCodeLIB 0.0017121 0.0002100 8.153 6.42e-16 ***
-## Flow:StationCodeRVB 0.0019282 0.0001915 10.070 < 2e-16 ***
-## Year_fct2016:Flow:StationCodeSTTD -0.0037137 0.0004737 -7.840 7.46e-15 ***
-## Year_fct2017:Flow:StationCodeSTTD -0.0407758 0.0055375 -7.364 2.66e-13 ***
-## Year_fct2018:Flow:StationCodeSTTD -0.0007383 0.0004198 -1.759 0.078768 .
-## Year_fct2019:Flow:StationCodeSTTD -0.0012540 0.0003732 -3.360 0.000794 ***
-## Year_fct2016:Flow:StationCodeLIB -0.0020650 0.0003575 -5.777 8.89e-09 ***
-## Year_fct2017:Flow:StationCodeLIB -0.0459271 0.0050805 -9.040 < 2e-16 ***
-## Year_fct2018:Flow:StationCodeLIB -0.0004184 0.0004122 -1.015 0.310185
-## Year_fct2019:Flow:StationCodeLIB 0.0001287 0.0003310 0.389 0.697449
-## Year_fct2016:Flow:StationCodeRVB -0.0024657 0.0003128 -7.884 5.33e-15 ***
-## Year_fct2017:Flow:StationCodeRVB -0.0460030 0.0050786 -9.058 < 2e-16 ***
-## Year_fct2018:Flow:StationCodeRVB -0.0003456 0.0002710 -1.275 0.202416
-## Year_fct2019:Flow:StationCodeRVB 0.0001493 0.0002375 0.629 0.529618
-## ---
-## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
-##
-## Residual standard error: 0.378 on 1881 degrees of freedom
-## Multiple R-squared: 0.8556, Adjusted R-squared: 0.8526
-## F-statistic: 285.8 on 39 and 1881 DF, p-value: < 2.2e-16
-df_chla_c2_lag %>%
- drop_na(Chla_log) %>%
- plot_lm_diag(Chla_log, m_lm3)
-## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
-
-shapiro.test(residuals(m_lm3))
-##
-## Shapiro-Wilk normality test
-##
-## data: residuals(m_lm3)
-## W = 0.96994, p-value < 2.2e-16
-acf(residuals(m_lm3))
-
-Box.test(residuals(m_lm3), lag = 20, type = 'Ljung-Box')
-##
-## Box-Ljung test
-##
-## data: residuals(m_lm3)
-## X-squared = 4218.2, df = 20, p-value < 2.2e-16
-The residuals deviate from a normal distribution according to visual -inspection and the Shapiro-Wilk normality test. Also, model definitely -has residual autocorrelation as indicated by the ACF plot and the -Box-Ljung test.
-Now, we’ll try to deal with the residual autocorrelation and the -non-normal residuals. We’ll run a series of linear models adding 1, 2, -and 3 lag terms and compare how well they correct for -autocorrelation.
-m_lm3_lag1 <- df_chla_c2_lag %>%
- drop_na(Chla_log, lag1) %>%
- lm(Chla_log ~ Year_fct * Flow * StationCode + lag1, data = .)
-
-acf(residuals(m_lm3_lag1))
-
-Box.test(residuals(m_lm3_lag1), lag = 20, type = 'Ljung-Box')
-##
-## Box-Ljung test
-##
-## data: residuals(m_lm3_lag1)
-## X-squared = 59.583, df = 20, p-value = 8.262e-06
-m_lm3_lag2 <- df_chla_c2_lag %>%
- drop_na(Chla_log, lag1, lag2) %>%
- lm(Chla_log ~ Year_fct * Flow * StationCode + lag1 + lag2, data = .)
-
-acf(residuals(m_lm3_lag2))
-
-Box.test(residuals(m_lm3_lag2), lag = 20, type = 'Ljung-Box')
-##
-## Box-Ljung test
-##
-## data: residuals(m_lm3_lag2)
-## X-squared = 36.166, df = 20, p-value = 0.01471
-m_lm3_lag3 <- df_chla_c2_lag %>%
- drop_na(Chla_log, lag1, lag2, lag3) %>%
- lm(Chla_log ~ Year_fct * Flow * StationCode + lag1 + lag2 + lag3, data = .)
-
-acf(residuals(m_lm3_lag3))
-
-Box.test(residuals(m_lm3_lag3), lag = 20, type = 'Ljung-Box')
-##
-## Box-Ljung test
-##
-## data: residuals(m_lm3_lag3)
-## X-squared = 29.8, df = 20, p-value = 0.07317
-The model with 3 lag terms looks to have slightly better ACF and -Box-Ljung test results than the other models. Let’s compare the 3 models -using AIC and BIC to see which model those prefer.
-AIC(m_lm3_lag1, m_lm3_lag2, m_lm3_lag3)
-## df AIC
-## m_lm3_lag1 42 -1332.474
-## m_lm3_lag2 43 -1320.991
-## m_lm3_lag3 44 -1301.195
-BIC(m_lm3_lag1, m_lm3_lag2, m_lm3_lag3)
-## df BIC
-## m_lm3_lag1 42 -1099.969
-## m_lm3_lag2 43 -1083.949
-## m_lm3_lag3 44 -1059.514
-Both AIC and BIC prefer the model with 1 lag term. Let’s take a -closer look at that one.
-summary(m_lm3_lag1)
-##
-## Call:
-## lm(formula = Chla_log ~ Year_fct * Flow * StationCode + lag1,
-## data = .)
-##
-## Residuals:
-## Min 1Q Median 3Q Max
-## -1.11307 -0.05814 -0.00595 0.05217 1.06385
-##
-## Coefficients:
-## Estimate Std. Error t value Pr(>|t|)
-## (Intercept) 1.001e+00 1.029e-01 9.728 < 2e-16 ***
-## Year_fct2016 -9.809e-02 3.103e-02 -3.161 0.001598 **
-## Year_fct2017 -1.525e-01 4.616e-02 -3.303 0.000975 ***
-## Year_fct2018 -5.790e-02 2.756e-02 -2.101 0.035792 *
-## Year_fct2019 -2.174e-02 2.708e-02 -0.803 0.422137
-## Flow -3.367e-04 8.645e-05 -3.895 0.000102 ***
-## StationCodeSTTD -1.182e-01 2.830e-02 -4.177 3.10e-05 ***
-## StationCodeLIB -1.924e-01 7.091e-02 -2.713 0.006729 **
-## StationCodeRVB -2.181e-01 5.374e-02 -4.059 5.14e-05 ***
-## lag1 8.985e-01 1.044e-02 86.090 < 2e-16 ***
-## Year_fct2016:Flow 3.634e-04 1.401e-04 2.594 0.009552 **
-## Year_fct2017:Flow 5.522e-03 2.327e-03 2.373 0.017739 *
-## Year_fct2018:Flow 2.083e-04 1.199e-04 1.738 0.082447 .
-## Year_fct2019:Flow 1.407e-04 1.048e-04 1.342 0.179847
-## Year_fct2016:StationCodeSTTD 1.270e-01 4.128e-02 3.078 0.002115 **
-## Year_fct2017:StationCodeSTTD 2.685e-01 6.059e-02 4.432 9.90e-06 ***
-## Year_fct2018:StationCodeSTTD -1.769e-02 3.753e-02 -0.471 0.637527
-## Year_fct2019:StationCodeSTTD -9.963e-02 3.804e-02 -2.619 0.008883 **
-## Year_fct2016:StationCodeLIB 2.879e-01 1.205e-01 2.388 0.017030 *
-## Year_fct2017:StationCodeLIB 1.367e-01 9.785e-02 1.397 0.162477
-## Year_fct2018:StationCodeLIB -1.636e-01 1.329e-01 -1.231 0.218468
-## Year_fct2019:StationCodeLIB -2.590e-02 1.048e-01 -0.247 0.804794
-## Year_fct2016:StationCodeRVB 2.448e-01 1.028e-01 2.382 0.017302 *
-## Year_fct2017:StationCodeRVB 1.841e-01 9.545e-02 1.929 0.053871 .
-## Year_fct2018:StationCodeRVB 3.772e-02 7.309e-02 0.516 0.605905
-## Year_fct2019:StationCodeRVB -3.273e-02 8.873e-02 -0.369 0.712219
-## Flow:StationCodeSTTD 6.188e-04 1.508e-04 4.103 4.26e-05 ***
-## Flow:StationCodeLIB 3.573e-04 9.469e-05 3.774 0.000166 ***
-## Flow:StationCodeRVB 3.341e-04 8.712e-05 3.835 0.000130 ***
-## Year_fct2016:Flow:StationCodeSTTD -5.296e-04 2.150e-04 -2.463 0.013871 *
-## Year_fct2017:Flow:StationCodeSTTD -2.463e-03 2.544e-03 -0.968 0.333192
-## Year_fct2018:Flow:StationCodeSTTD -2.088e-04 1.882e-04 -1.109 0.267388
-## Year_fct2019:Flow:StationCodeSTTD -2.486e-04 1.680e-04 -1.480 0.139048
-## Year_fct2016:Flow:StationCodeLIB -2.807e-04 1.601e-04 -1.753 0.079803 .
-## Year_fct2017:Flow:StationCodeLIB -5.534e-03 2.328e-03 -2.377 0.017554 *
-## Year_fct2018:Flow:StationCodeLIB -1.611e-04 2.125e-04 -0.758 0.448487
-## Year_fct2019:Flow:StationCodeLIB -1.318e-04 1.537e-04 -0.857 0.391346
-## Year_fct2016:Flow:StationCodeRVB -3.780e-04 1.410e-04 -2.681 0.007399 **
-## Year_fct2017:Flow:StationCodeRVB -5.527e-03 2.327e-03 -2.375 0.017656 *
-## Year_fct2018:Flow:StationCodeRVB -2.086e-04 1.205e-04 -1.732 0.083497 .
-## Year_fct2019:Flow:StationCodeRVB -1.420e-04 1.056e-04 -1.345 0.178840
-## ---
-## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
-##
-## Residual standard error: 0.1677 on 1833 degrees of freedom
-## Multiple R-squared: 0.9709, Adjusted R-squared: 0.9703
-## F-statistic: 1531 on 40 and 1833 DF, p-value: < 2.2e-16
-df_chla_c2_lag %>%
- drop_na(Chla_log, lag1) %>%
- plot_lm_diag(Chla_log, m_lm3_lag1)
-## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
-
-shapiro.test(residuals(m_lm3_lag1))
-##
-## Shapiro-Wilk normality test
-##
-## data: residuals(m_lm3_lag1)
-## W = 0.86196, p-value < 2.2e-16
-The residuals deviate from a normal distribution particularly at the -both tails of the distribution. I’m not so sure about using this model. -Let’s take a closer look at how the back-transformed fitted values from -the model match the observed values.
-df_lm3_lag1_fit <- df_chla_c2_lag %>%
- drop_na(Chla, lag1) %>%
- mutate(Fitted = as_tibble(predict(m_lm3_lag1, interval = "confidence"))) %>%
- unnest_wider(Fitted) %>%
- mutate(across(c("fit", "lwr", "upr"), ~ exp(.x) / 1000))
-
-plt_lm3_lag1_obs_fit <- df_lm3_lag1_fit %>%
- ggplot(aes(x = fit, y = Chla)) +
- geom_point() +
- geom_abline(slope = 1, intercept = 0, color = "red") +
- theme_bw() +
- labs(
- x = "Back-transformed Fitted Values",
- y = "Observed Values"
- )
-
-plt_lm3_lag1_obs_fit
-
-Let’s look at each Station-Year combination separately.
-plt_lm3_lag1_obs_fit +
- facet_wrap(
- vars(StationCode, Year_fct),
- ncol = 5,
- scales = "free",
- labeller = labeller(.multi_line = FALSE)
- )
-
-The red lines are the 1:1 ratio between fitted and observed values, -and we would expect for most of the points to be near this line if the -model has a good fit to the observed data. The fitted and observed -values appear to match pretty well at the lower end of the range of -values, but this deteriorates at the mid and higher range values. This -pattern holds for some of the separate Station-Year combinations. I’m -not sure this linear model is a good candidate to use for our -analysis.
-Let’s take a closer look at the model with 3 lag terms since that had -the least autocorrelation.
-summary(m_lm3_lag3)
-##
-## Call:
-## lm(formula = Chla_log ~ Year_fct * Flow * StationCode + lag1 +
-## lag2 + lag3, data = .)
-##
-## Residuals:
-## Min 1Q Median 3Q Max
-## -1.10205 -0.05729 -0.00689 0.05130 1.10315
-##
-## Coefficients:
-## Estimate Std. Error t value Pr(>|t|)
-## (Intercept) 1.157e+00 1.107e-01 10.454 < 2e-16 ***
-## Year_fct2016 -9.661e-02 3.126e-02 -3.091 0.002027 **
-## Year_fct2017 -1.610e-01 4.643e-02 -3.467 0.000540 ***
-## Year_fct2018 -5.585e-02 2.771e-02 -2.016 0.043975 *
-## Year_fct2019 -1.043e-02 2.720e-02 -0.383 0.701433
-## Flow -3.249e-04 8.626e-05 -3.766 0.000171 ***
-## StationCodeSTTD -1.216e-01 2.874e-02 -4.232 2.43e-05 ***
-## StationCodeLIB -2.218e-01 7.148e-02 -3.102 0.001950 **
-## StationCodeRVB -2.404e-01 5.410e-02 -4.444 9.37e-06 ***
-## lag1 9.520e-01 2.404e-02 39.604 < 2e-16 ***
-## lag2 -6.912e-03 3.300e-02 -0.209 0.834141
-## lag3 -6.404e-02 2.396e-02 -2.672 0.007602 **
-## Year_fct2016:Flow 3.667e-04 1.395e-04 2.629 0.008641 **
-## Year_fct2017:Flow 6.434e-03 2.364e-03 2.722 0.006556 **
-## Year_fct2018:Flow 1.808e-04 1.194e-04 1.514 0.130087
-## Year_fct2019:Flow 9.141e-05 1.045e-04 0.875 0.381899
-## Year_fct2016:StationCodeSTTD 1.392e-01 4.182e-02 3.329 0.000889 ***
-## Year_fct2017:StationCodeSTTD 2.677e-01 6.087e-02 4.399 1.15e-05 ***
-## Year_fct2018:StationCodeSTTD -3.326e-02 3.782e-02 -0.880 0.379189
-## Year_fct2019:StationCodeSTTD -1.314e-01 3.855e-02 -3.408 0.000669 ***
-## Year_fct2016:StationCodeLIB 2.940e-01 1.214e-01 2.422 0.015517 *
-## Year_fct2017:StationCodeLIB 1.346e-01 1.007e-01 1.337 0.181412
-## Year_fct2018:StationCodeLIB -2.818e-01 1.426e-01 -1.976 0.048279 *
-## Year_fct2019:StationCodeLIB -6.092e-02 1.303e-01 -0.467 0.640243
-## Year_fct2016:StationCodeRVB 2.601e-01 1.100e-01 2.364 0.018204 *
-## Year_fct2017:StationCodeRVB 1.906e-01 9.562e-02 1.993 0.046416 *
-## Year_fct2018:StationCodeRVB 2.043e-02 7.272e-02 0.281 0.778735
-## Year_fct2019:StationCodeRVB -1.232e-01 9.296e-02 -1.326 0.185077
-## Flow:StationCodeSTTD 6.575e-04 1.545e-04 4.254 2.21e-05 ***
-## Flow:StationCodeLIB 3.407e-04 9.466e-05 3.599 0.000328 ***
-## Flow:StationCodeRVB 3.220e-04 8.693e-05 3.704 0.000219 ***
-## Year_fct2016:Flow:StationCodeSTTD -5.712e-04 2.169e-04 -2.633 0.008528 **
-## Year_fct2017:Flow:StationCodeSTTD -3.888e-03 2.608e-03 -1.491 0.136168
-## Year_fct2018:Flow:StationCodeSTTD -2.004e-04 1.903e-04 -1.053 0.292443
-## Year_fct2019:Flow:StationCodeSTTD -2.358e-04 1.704e-04 -1.384 0.166560
-## Year_fct2016:Flow:StationCodeLIB -2.891e-04 1.602e-04 -1.805 0.071255 .
-## Year_fct2017:Flow:StationCodeLIB -6.458e-03 2.365e-03 -2.731 0.006386 **
-## Year_fct2018:Flow:StationCodeLIB -2.967e-04 2.335e-04 -1.271 0.203981
-## Year_fct2019:Flow:StationCodeLIB -1.055e-04 1.901e-04 -0.555 0.578996
-## Year_fct2016:Flow:StationCodeRVB -3.828e-04 1.405e-04 -2.725 0.006496 **
-## Year_fct2017:Flow:StationCodeRVB -6.439e-03 2.364e-03 -2.724 0.006523 **
-## Year_fct2018:Flow:StationCodeRVB -1.798e-04 1.200e-04 -1.498 0.134302
-## Year_fct2019:Flow:StationCodeRVB -8.592e-05 1.054e-04 -0.816 0.414863
-## ---
-## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
-##
-## Residual standard error: 0.1663 on 1752 degrees of freedom
-## Multiple R-squared: 0.9706, Adjusted R-squared: 0.9699
-## F-statistic: 1378 on 42 and 1752 DF, p-value: < 2.2e-16
-df_chla_c2_lag %>%
- drop_na(Chla_log, lag1, lag2, lag3) %>%
- plot_lm_diag(Chla_log, m_lm3_lag3)
-## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
-
-shapiro.test(residuals(m_lm3_lag3))
-##
-## Shapiro-Wilk normality test
-##
-## data: residuals(m_lm3_lag3)
-## W = 0.8528, p-value < 2.2e-16
-The diagnostic plots show that the lag 3 model has similar problems -as the lag 1 model, so this doesn’t look like a good candidate for our -analysis as well.
-Because the models using 3-way interactions don’t seem to fit the -data well, we will attempt to fit a generalized additive model (GAM) to -the data set using all two-way interactions between Year, Daily Average -Flow, and Station, with a smooth term for day of year to account for -seasonality. Initially, we’ll run the GAM without accounting for serial -autocorrelation.
-m_gam2 <- gam(
- Chla_log ~ (Year_fct + Flow + StationCode)^2 + s(DOY, k = 5),
- data = df_chla_c2,
- method = "REML"
-)
-Lets look at the model summary and diagnostics:
-summary(m_gam2)
-##
-## Family: gaussian
-## Link function: identity
-##
-## Formula:
-## Chla_log ~ (Year_fct + Flow + StationCode)^2 + s(DOY, k = 5)
-##
-## Parametric coefficients:
-## Estimate Std. Error t value Pr(>|t|)
-## (Intercept) 9.688e+00 3.752e-02 258.201 < 2e-16 ***
-## Year_fct2016 -5.462e-01 5.415e-02 -10.087 < 2e-16 ***
-## Year_fct2017 -3.975e-01 5.038e-02 -7.891 5.02e-15 ***
-## Year_fct2018 -3.166e-01 4.815e-02 -6.576 6.24e-11 ***
-## Year_fct2019 -1.727e-01 4.848e-02 -3.562 0.000377 ***
-## Flow -1.460e-03 9.154e-05 -15.953 < 2e-16 ***
-## StationCodeSTTD -9.919e-01 5.113e-02 -19.400 < 2e-16 ***
-## StationCodeLIB -1.930e+00 1.079e-01 -17.897 < 2e-16 ***
-## StationCodeRVB -2.140e+00 9.850e-02 -21.721 < 2e-16 ***
-## Year_fct2016:Flow -4.570e-05 3.126e-05 -1.462 0.143979
-## Year_fct2017:Flow -4.495e-05 2.527e-05 -1.779 0.075411 .
-## Year_fct2018:Flow -1.826e-05 2.555e-05 -0.715 0.474786
-## Year_fct2019:Flow 4.051e-06 2.720e-05 0.149 0.881632
-## Year_fct2016:StationCodeSTTD 7.454e-01 7.498e-02 9.941 < 2e-16 ***
-## Year_fct2017:StationCodeSTTD 3.922e-01 7.886e-02 4.974 7.16e-07 ***
-## Year_fct2018:StationCodeSTTD -3.423e-01 7.218e-02 -4.743 2.26e-06 ***
-## Year_fct2019:StationCodeSTTD -9.528e-01 7.064e-02 -13.488 < 2e-16 ***
-## Year_fct2016:StationCodeLIB 1.293e+00 8.638e-02 14.969 < 2e-16 ***
-## Year_fct2017:StationCodeLIB 3.824e-01 8.524e-02 4.487 7.66e-06 ***
-## Year_fct2018:StationCodeLIB -1.577e+00 1.060e-01 -14.875 < 2e-16 ***
-## Year_fct2019:StationCodeLIB -5.273e-01 1.082e-01 -4.875 1.18e-06 ***
-## Year_fct2016:StationCodeRVB 8.304e-01 2.334e-01 3.557 0.000384 ***
-## Year_fct2017:StationCodeRVB 4.753e-01 1.859e-01 2.556 0.010665 *
-## Year_fct2018:StationCodeRVB 7.054e-02 1.538e-01 0.459 0.646619
-## Year_fct2019:StationCodeRVB -6.698e-01 1.828e-01 -3.664 0.000255 ***
-## Flow:StationCodeSTTD 3.336e-03 1.263e-04 26.406 < 2e-16 ***
-## Flow:StationCodeLIB 1.501e-03 1.106e-04 13.572 < 2e-16 ***
-## Flow:StationCodeRVB 1.483e-03 9.028e-05 16.427 < 2e-16 ***
-## ---
-## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
-##
-## Approximate significance of smooth terms:
-## edf Ref.df F p-value
-## s(DOY) 3.883 3.992 93.07 <2e-16 ***
-## ---
-## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
-##
-## R-sq.(adj) = 0.863 Deviance explained = 86.5%
-## -REML = 906.44 Scale est. = 0.13309 n = 1921
-appraise(m_gam2)
-
-shapiro.test(residuals(m_gam2))
-##
-## Shapiro-Wilk normality test
-##
-## data: residuals(m_gam2)
-## W = 0.97201, p-value < 2.2e-16
-k.check(m_gam2)
-## k' edf k-index p-value
-## s(DOY) 4 3.883314 0.9303233 0
-draw(m_gam2, select = 1, residuals = TRUE, rug = FALSE)
-
-plot(m_gam2, pages = 1, all.terms = TRUE)
-
-acf(residuals(m_gam2))
-
-Box.test(residuals(m_gam2), lag = 20, type = 'Ljung-Box')
-##
-## Box-Ljung test
-##
-## data: residuals(m_gam2)
-## X-squared = 4292.7, df = 20, p-value < 2.2e-16
-Now, we’ll try to deal with the residual autocorrelation. We’ll run -AR(1), AR(2), AR(3), and AR(4) models and compare them using AIC.
-# Define model formula as an object
-f_gam2 <- as.formula("Chla_log ~ (Year_fct + Flow + StationCode)^2 + s(DOY, k = 5)")
-
-# Fit original model with k = 5 and three successive AR(p) models
-m_gamm2 <- gamm(
- f_gam2,
- data = df_chla_c2,
- method = "REML"
-)
-
-m_gamm2_ar1 <- gamm(
- f_gam2,
- data = df_chla_c2,
- correlation = corARMA(form = ~ 1|Year_fct/StationCode, p = 1), # grouped by Year_fct and StationCode
- method = "REML"
-)
-
-m_gamm2_ar2 <- gamm(
- f_gam2,
- data = df_chla_c2,
- correlation = corARMA(form = ~ 1|Year_fct/StationCode, p = 2),
- method = "REML"
-)
-
-m_gamm2_ar3 <- gamm(
- f_gam2,
- data = df_chla_c2,
- correlation = corARMA(form = ~ 1|Year_fct/StationCode, p = 3),
- method = "REML"
-)
-
-m_gamm2_ar4 <- gamm(
- f_gam2,
- data = df_chla_c2,
- correlation = corARMA(form = ~ 1|Year_fct/StationCode, p = 4),
- method = "REML"
-)
-
-# Compare models
-anova(
- m_gamm2$lme,
- m_gamm2_ar1$lme,
- m_gamm2_ar2$lme,
- m_gamm2_ar3$lme,
- m_gamm2_ar4$lme
-)
-## Model df AIC BIC logLik Test L.Ratio p-value
-## m_gamm2$lme 1 31 1874.8752 2046.7822 -906.4376
-## m_gamm2_ar1$lme 2 32 -917.8475 -740.3950 490.9238 1 vs 2 2794.7227 <.0001
-## m_gamm2_ar2$lme 3 33 -916.3375 -733.3396 491.1687 2 vs 3 0.4900 0.4839
-## m_gamm2_ar3$lme 4 34 -917.2007 -728.6574 492.6003 3 vs 4 2.8632 0.0906
-## m_gamm2_ar4$lme 5 35 -915.4855 -721.3969 492.7428 4 vs 5 0.2848 0.5936
-It looks like the AR(1) model has the best fit according to the AIC -and BIC values. Let’s take a closer look at the AR(1) model.
-# Pull out the residuals and the GAM model
-resid_gamm2_ar1 <- residuals(m_gamm2_ar1$lme, type = "normalized")
-m_gamm2_ar1_gam <- m_gamm2_ar1$gam
-
-summary(m_gamm2_ar1_gam)
-##
-## Family: gaussian
-## Link function: identity
-##
-## Formula:
-## Chla_log ~ (Year_fct + Flow + StationCode)^2 + s(DOY, k = 5)
-##
-## Parametric coefficients:
-## Estimate Std. Error t value Pr(>|t|)
-## (Intercept) 9.499e+00 2.554e-01 37.194 < 2e-16 ***
-## Year_fct2016 -5.044e-01 3.805e-01 -1.325 0.185176
-## Year_fct2017 -2.380e-01 3.574e-01 -0.666 0.505493
-## Year_fct2018 -2.842e-01 3.516e-01 -0.808 0.418971
-## Year_fct2019 -4.756e-02 3.534e-01 -0.135 0.892969
-## Flow -5.814e-04 1.880e-04 -3.093 0.002013 **
-## StationCodeSTTD -7.963e-01 3.606e-01 -2.208 0.027339 *
-## StationCodeLIB -1.541e+00 3.567e-01 -4.320 1.64e-05 ***
-## StationCodeRVB -1.802e+00 3.510e-01 -5.133 3.14e-07 ***
-## Year_fct2016:Flow -9.464e-06 2.004e-05 -0.472 0.636788
-## Year_fct2017:Flow -2.725e-06 2.032e-05 -0.134 0.893321
-## Year_fct2018:Flow -5.029e-06 2.185e-05 -0.230 0.818003
-## Year_fct2019:Flow -2.100e-05 2.047e-05 -1.026 0.304988
-## Year_fct2016:StationCodeSTTD 8.188e-01 5.352e-01 1.530 0.126195
-## Year_fct2017:StationCodeSTTD -9.049e-02 5.460e-01 -0.166 0.868392
-## Year_fct2018:StationCodeSTTD -4.750e-01 5.186e-01 -0.916 0.359835
-## Year_fct2019:StationCodeSTTD -1.039e+00 5.094e-01 -2.039 0.041548 *
-## Year_fct2016:StationCodeLIB 1.407e+00 5.202e-01 2.705 0.006900 **
-## Year_fct2017:StationCodeLIB 1.655e-01 5.081e-01 0.326 0.744633
-## Year_fct2018:StationCodeLIB -1.832e+00 5.647e-01 -3.244 0.001197 **
-## Year_fct2019:StationCodeLIB -7.660e-01 5.742e-01 -1.334 0.182305
-## Year_fct2016:StationCodeRVB 8.858e-01 5.518e-01 1.605 0.108565
-## Year_fct2017:StationCodeRVB 1.627e-02 5.257e-01 0.031 0.975309
-## Year_fct2018:StationCodeRVB -3.484e-02 5.084e-01 -0.069 0.945361
-## Year_fct2019:StationCodeRVB -5.837e-01 5.181e-01 -1.127 0.260065
-## Flow:StationCodeSTTD 1.710e-03 3.001e-04 5.698 1.40e-08 ***
-## Flow:StationCodeLIB 7.500e-04 1.930e-04 3.885 0.000106 ***
-## Flow:StationCodeRVB 5.801e-04 1.879e-04 3.087 0.002052 **
-## ---
-## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
-##
-## Approximate significance of smooth terms:
-## edf Ref.df F p-value
-## s(DOY) 3.815 3.815 13.05 <2e-16 ***
-## ---
-## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
-##
-## R-sq.(adj) = 0.787
-## Scale est. = 0.27723 n = 1921
-appraise(m_gamm2_ar1_gam)
-
-shapiro.test(resid_gamm2_ar1)
-##
-## Shapiro-Wilk normality test
-##
-## data: resid_gamm2_ar1
-## W = 0.82416, p-value < 2.2e-16
-k.check(m_gamm2_ar1_gam)
-## k' edf k-index p-value
-## s(DOY) 4 3.814541 0.7173524 0
-draw(m_gamm2_ar1_gam, select = 1, residuals = TRUE, rug = FALSE)
-
-plot(m_gamm2_ar1_gam, pages = 1, all.terms = TRUE)
-
-acf(resid_gamm2_ar1)
-
-Box.test(resid_gamm2_ar1, lag = 20, type = 'Ljung-Box')
-##
-## Box-Ljung test
-##
-## data: resid_gamm2_ar1
-## X-squared = 48.902, df = 20, p-value = 0.0003175
-The AR(1) model has much less residual autocorrelation, and the -diagnostics plots look okay. Let’s take a closer look at how the -back-transformed fitted values from the model match the observed -values.
-df_gamm2_ar1_fit <- df_chla_c2 %>%
- fitted_values(m_gamm2_ar1_gam, data = .) %>%
- mutate(fitted_bt = exp(fitted) / 1000)
-
-plt_gamm2_ar1_obs_fit <- df_gamm2_ar1_fit %>%
- ggplot(aes(x = fitted_bt, y = Chla)) +
- geom_point() +
- geom_abline(slope = 1, intercept = 0, color = "red") +
- theme_bw() +
- labs(
- x = "Back-transformed Fitted Values",
- y = "Observed Values"
- )
-
-plt_gamm2_ar1_obs_fit
-
-Let’s look at each Station and Year separately.
-plt_gamm2_ar1_obs_fit + facet_wrap(vars(StationCode), scales = "free")
-
-plt_gamm2_ar1_obs_fit + facet_wrap(vars(Year_fct), scales = "free")
-
-The red lines are the 1:1 ratio between fitted and observed values, -and we would expect for most of the points to be near this line if the -model has a good fit to the observed data. However, the fitted values -from the model don’t appear to match the observed values that well. -Again, I don’t think this GAM model is a good candidate to use for our -analysis.
-Since the models using 3-way interactions don’t seem to fit the -observed data that well, let’s try a linear model using all two-way -interactions between Year, Daily Average Flow, and Station. Initially, -we’ll run the model without accounting for serial autocorrelation.
-m_lm2 <- df_chla_c2_lag %>%
- drop_na(Chla_log) %>%
- lm(Chla_log ~ (Year_fct + Flow + StationCode)^2, data = .)
-Lets look at the model summary and diagnostics:
-summary(m_lm2)
-##
-## Call:
-## lm(formula = Chla_log ~ (Year_fct + Flow + StationCode)^2, data = .)
-##
-## Residuals:
-## Min 1Q Median 3Q Max
-## -1.5367 -0.2393 -0.0399 0.2061 1.9457
-##
-## Coefficients:
-## Estimate Std. Error t value Pr(>|t|)
-## (Intercept) 9.633e+00 4.043e-02 238.279 < 2e-16 ***
-## Year_fct2016 -3.556e-01 5.747e-02 -6.188 7.43e-10 ***
-## Year_fct2017 -3.852e-01 5.474e-02 -7.038 2.71e-12 ***
-## Year_fct2018 -2.924e-01 5.249e-02 -5.569 2.92e-08 ***
-## Year_fct2019 -1.411e-01 5.280e-02 -2.671 0.007619 **
-## Flow -1.529e-03 9.697e-05 -15.764 < 2e-16 ***
-## StationCodeSTTD -9.934e-01 5.570e-02 -17.835 < 2e-16 ***
-## StationCodeLIB -2.106e+00 1.139e-01 -18.485 < 2e-16 ***
-## StationCodeRVB -2.159e+00 1.073e-01 -20.117 < 2e-16 ***
-## Year_fct2016:Flow -1.249e-04 3.167e-05 -3.945 8.28e-05 ***
-## Year_fct2017:Flow -7.386e-05 2.731e-05 -2.704 0.006909 **
-## Year_fct2018:Flow -1.076e-05 2.737e-05 -0.393 0.694152
-## Year_fct2019:Flow -1.445e-05 2.928e-05 -0.494 0.621605
-## Year_fct2016:StationCodeSTTD 7.414e-01 8.173e-02 9.072 < 2e-16 ***
-## Year_fct2017:StationCodeSTTD 5.161e-01 8.550e-02 6.037 1.89e-09 ***
-## Year_fct2018:StationCodeSTTD -3.377e-01 7.825e-02 -4.316 1.67e-05 ***
-## Year_fct2019:StationCodeSTTD -9.968e-01 7.700e-02 -12.945 < 2e-16 ***
-## Year_fct2016:StationCodeLIB 1.200e+00 9.360e-02 12.823 < 2e-16 ***
-## Year_fct2017:StationCodeLIB 4.159e-01 9.208e-02 4.517 6.67e-06 ***
-## Year_fct2018:StationCodeLIB -1.560e+00 1.125e-01 -13.868 < 2e-16 ***
-## Year_fct2019:StationCodeLIB -3.847e-01 1.150e-01 -3.346 0.000837 ***
-## Year_fct2016:StationCodeRVB 1.489e+00 2.227e-01 6.686 3.02e-11 ***
-## Year_fct2017:StationCodeRVB 6.563e-01 1.998e-01 3.285 0.001037 **
-## Year_fct2018:StationCodeRVB -8.074e-02 1.636e-01 -0.494 0.621625
-## Year_fct2019:StationCodeRVB -6.171e-01 1.962e-01 -3.145 0.001686 **
-## Flow:StationCodeSTTD 3.308e-03 1.377e-04 24.020 < 2e-16 ***
-## Flow:StationCodeLIB 1.427e-03 1.140e-04 12.513 < 2e-16 ***
-## Flow:StationCodeRVB 1.565e-03 9.550e-05 16.386 < 2e-16 ***
-## ---
-## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
-##
-## Residual standard error: 0.3979 on 1893 degrees of freedom
-## Multiple R-squared: 0.8389, Adjusted R-squared: 0.8366
-## F-statistic: 365.2 on 27 and 1893 DF, p-value: < 2.2e-16
-df_chla_c2_lag %>%
- drop_na(Chla_log) %>%
- plot_lm_diag(Chla_log, m_lm2)
-## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
-
-shapiro.test(residuals(m_lm2))
-##
-## Shapiro-Wilk normality test
-##
-## data: residuals(m_lm2)
-## W = 0.97046, p-value < 2.2e-16
-acf(residuals(m_lm2))
-
-Box.test(residuals(m_lm2), lag = 20, type = 'Ljung-Box')
-##
-## Box-Ljung test
-##
-## data: residuals(m_lm2)
-## X-squared = 4904.7, df = 20, p-value < 2.2e-16
-The residuals deviate from a normal distribution according to visual -inspection and the Shapiro-Wilk normality test. Also, model definitely -has residual autocorrelation as indicated by the ACF plot and the -Box-Ljung test.
-Now, we’ll try to deal with the residual autocorrelation and the -non-normal residuals. We’ll run a series of linear models adding 1, 2, -and 3 lag terms and compare how well they correct for -autocorrelation.
-m_lm2_lag1 <- df_chla_c2_lag %>%
- drop_na(Chla_log, lag1) %>%
- lm(Chla_log ~ (Year_fct + Flow + StationCode)^2 + lag1, data = .)
-
-acf(residuals(m_lm2_lag1))
-
-Box.test(residuals(m_lm2_lag1), lag = 20, type = 'Ljung-Box')
-##
-## Box-Ljung test
-##
-## data: residuals(m_lm2_lag1)
-## X-squared = 58.308, df = 20, p-value = 1.297e-05
-m_lm2_lag2 <- df_chla_c2_lag %>%
- drop_na(Chla_log, lag1, lag2) %>%
- lm(Chla_log ~ (Year_fct + Flow + StationCode)^2 + lag1 + lag2, data = .)
-
-acf(residuals(m_lm2_lag2))
-
-Box.test(residuals(m_lm2_lag2), lag = 20, type = 'Ljung-Box')
-##
-## Box-Ljung test
-##
-## data: residuals(m_lm2_lag2)
-## X-squared = 34.749, df = 20, p-value = 0.02148
-m_lm2_lag3 <- df_chla_c2_lag %>%
- drop_na(Chla_log, lag1, lag2, lag3) %>%
- lm(Chla_log ~ (Year_fct + Flow + StationCode)^2 + lag1 + lag2 + lag3, data = .)
-
-acf(residuals(m_lm2_lag3))
-
-Box.test(residuals(m_lm2_lag3), lag = 20, type = 'Ljung-Box')
-##
-## Box-Ljung test
-##
-## data: residuals(m_lm2_lag3)
-## X-squared = 28.624, df = 20, p-value = 0.09542
-The model with 3 lag terms looks to have slightly better ACF and -Box-Ljung test results than the other models. Let’s compare the 3 models -using AIC and BIC to see which model those prefer.
-AIC(m_lm2_lag1, m_lm2_lag2, m_lm2_lag3)
-## df AIC
-## m_lm2_lag1 30 -1331.263
-## m_lm2_lag2 31 -1317.974
-## m_lm2_lag3 32 -1299.688
-BIC(m_lm2_lag1, m_lm2_lag2, m_lm2_lag3)
-## df BIC
-## m_lm2_lag1 30 -1165.189
-## m_lm2_lag2 31 -1147.083
-## m_lm2_lag3 32 -1123.919
-Both AIC and BIC prefer the model with 1 lag term. Let’s take a -closer look at that one.
-summary(m_lm2_lag1)
-##
-## Call:
-## lm(formula = Chla_log ~ (Year_fct + Flow + StationCode)^2 + lag1,
-## data = .)
-##
-## Residuals:
-## Min 1Q Median 3Q Max
-## -1.10990 -0.05742 -0.00637 0.05263 1.08891
-##
-## Coefficients:
-## Estimate Std. Error t value Pr(>|t|)
-## (Intercept) 8.621e-01 9.706e-02 8.882 < 2e-16 ***
-## Year_fct2016 -4.466e-02 2.467e-02 -1.810 0.070432 .
-## Year_fct2017 -3.920e-02 2.357e-02 -1.664 0.096380 .
-## Year_fct2018 -2.570e-02 2.248e-02 -1.143 0.253192
-## Year_fct2019 4.008e-04 2.248e-02 0.018 0.985774
-## Flow -1.624e-04 4.370e-05 -3.717 0.000208 ***
-## StationCodeSTTD -8.197e-02 2.571e-02 -3.188 0.001456 **
-## StationCodeLIB -1.210e-01 5.403e-02 -2.239 0.025270 *
-## StationCodeRVB -1.584e-01 5.049e-02 -3.137 0.001736 **
-## lag1 9.106e-01 9.917e-03 91.816 < 2e-16 ***
-## Year_fct2016:Flow -7.998e-06 1.364e-05 -0.586 0.557696
-## Year_fct2017:Flow -1.197e-06 1.167e-05 -0.103 0.918296
-## Year_fct2018:Flow 3.085e-06 1.166e-05 0.265 0.791304
-## Year_fct2019:Flow 4.247e-07 1.252e-05 0.034 0.972946
-## Year_fct2016:StationCodeSTTD 6.232e-02 3.561e-02 1.750 0.080298 .
-## Year_fct2017:StationCodeSTTD 7.283e-02 3.689e-02 1.974 0.048527 *
-## Year_fct2018:StationCodeSTTD -3.915e-02 3.345e-02 -1.170 0.242020
-## Year_fct2019:StationCodeSTTD -1.145e-01 3.422e-02 -3.345 0.000838 ***
-## Year_fct2016:StationCodeLIB 9.269e-02 4.175e-02 2.220 0.026522 *
-## Year_fct2017:StationCodeLIB 2.497e-02 3.947e-02 0.632 0.527175
-## Year_fct2018:StationCodeLIB -2.167e-01 5.218e-02 -4.152 3.45e-05 ***
-## Year_fct2019:StationCodeLIB -6.106e-02 5.164e-02 -1.182 0.237206
-## Year_fct2016:StationCodeRVB 1.449e-01 9.716e-02 1.491 0.136079
-## Year_fct2017:StationCodeRVB 5.911e-02 8.512e-02 0.694 0.487457
-## Year_fct2018:StationCodeRVB -5.716e-03 6.935e-02 -0.082 0.934320
-## Year_fct2019:StationCodeRVB -4.787e-02 8.441e-02 -0.567 0.570736
-## Flow:StationCodeSTTD 3.526e-04 6.676e-05 5.281 1.43e-07 ***
-## Flow:StationCodeLIB 1.992e-04 5.072e-05 3.928 8.89e-05 ***
-## Flow:StationCodeRVB 1.567e-04 4.326e-05 3.622 0.000300 ***
-## ---
-## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
-##
-## Residual standard error: 0.1682 on 1845 degrees of freedom
-## Multiple R-squared: 0.9706, Adjusted R-squared: 0.9701
-## F-statistic: 2172 on 28 and 1845 DF, p-value: < 2.2e-16
-df_chla_c2_lag %>%
- drop_na(Chla_log, lag1) %>%
- plot_lm_diag(Chla_log, m_lm2_lag1)
-## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
-
-shapiro.test(residuals(m_lm2_lag1))
-##
-## Shapiro-Wilk normality test
-##
-## data: residuals(m_lm2_lag1)
-## W = 0.86339, p-value < 2.2e-16
-As with the model with 3-way interactions, the residuals deviate from -a normal distribution particularly at the both tails of the -distribution. I’m not so sure about using this model. Let’s take a -closer look at how the back-transformed fitted values from the model -match the observed values.
-df_lm2_lag1_fit <- df_chla_c2_lag %>%
- drop_na(Chla, lag1) %>%
- mutate(Fitted = as_tibble(predict(m_lm2_lag1, interval = "confidence"))) %>%
- unnest_wider(Fitted) %>%
- mutate(across(c("fit", "lwr", "upr"), ~ exp(.x) / 1000))
-
-plt_lm2_lag1_obs_fit <- df_lm2_lag1_fit %>%
- ggplot(aes(x = fit, y = Chla)) +
- geom_point() +
- geom_abline(slope = 1, intercept = 0, color = "red") +
- theme_bw() +
- labs(
- x = "Back-transformed Fitted Values",
- y = "Observed Values"
- )
-
-plt_lm2_lag1_obs_fit
-
-Let’s look at each Station and Year separately.
-plt_lm2_lag1_obs_fit + facet_wrap(vars(StationCode), scales = "free")
-
-plt_lm2_lag1_obs_fit + facet_wrap(vars(Year_fct), scales = "free")
-
-The red lines are the 1:1 ratio between fitted and observed values, -and we would expect for most of the points to be near this line if the -model has a good fit to the observed data. The fitted and observed -values appear to match pretty well at the lower end of the range of -values, but this deteriorates at the mid and higher range values. This -pattern holds for some of the individual Stations and Years. I’m not -sure this linear model is a good candidate to use for our analysis.
-Let’s take a closer look at the model with 3 lag terms since that had -the least autocorrelation.
-summary(m_lm2_lag3)
-##
-## Call:
-## lm(formula = Chla_log ~ (Year_fct + Flow + StationCode)^2 + lag1 +
-## lag2 + lag3, data = .)
-##
-## Residuals:
-## Min 1Q Median 3Q Max
-## -1.11082 -0.05607 -0.00690 0.05255 1.12323
-##
-## Coefficients:
-## Estimate Std. Error t value Pr(>|t|)
-## (Intercept) 9.973e-01 1.037e-01 9.616 < 2e-16 ***
-## Year_fct2016 -4.170e-02 2.478e-02 -1.682 0.092657 .
-## Year_fct2017 -3.659e-02 2.365e-02 -1.547 0.122072
-## Year_fct2018 -2.643e-02 2.255e-02 -1.172 0.241268
-## Year_fct2019 4.395e-03 2.253e-02 0.195 0.845349
-## Flow -1.748e-04 4.378e-05 -3.994 6.77e-05 ***
-## StationCodeSTTD -8.615e-02 2.611e-02 -3.299 0.000988 ***
-## StationCodeLIB -1.599e-01 5.592e-02 -2.858 0.004309 **
-## StationCodeRVB -1.834e-01 5.086e-02 -3.607 0.000319 ***
-## lag1 9.620e-01 2.396e-02 40.159 < 2e-16 ***
-## lag2 -7.617e-03 3.309e-02 -0.230 0.817974
-## lag3 -5.869e-02 2.395e-02 -2.450 0.014368 *
-## Year_fct2016:Flow -9.558e-06 1.419e-05 -0.673 0.500724
-## Year_fct2017:Flow -2.255e-06 1.164e-05 -0.194 0.846434
-## Year_fct2018:Flow 3.211e-06 1.161e-05 0.277 0.782103
-## Year_fct2019:Flow 5.258e-06 1.279e-05 0.411 0.680965
-## Year_fct2016:StationCodeSTTD 7.111e-02 3.603e-02 1.973 0.048606 *
-## Year_fct2017:StationCodeSTTD 7.658e-02 3.753e-02 2.041 0.041445 *
-## Year_fct2018:StationCodeSTTD -4.966e-02 3.363e-02 -1.477 0.139987
-## Year_fct2019:StationCodeSTTD -1.379e-01 3.461e-02 -3.984 7.05e-05 ***
-## Year_fct2016:StationCodeLIB 1.011e-01 4.269e-02 2.369 0.017957 *
-## Year_fct2017:StationCodeLIB 2.753e-02 3.966e-02 0.694 0.487626
-## Year_fct2018:StationCodeLIB -2.237e-01 5.530e-02 -4.045 5.46e-05 ***
-## Year_fct2019:StationCodeLIB -6.390e-02 5.763e-02 -1.109 0.267669
-## Year_fct2016:StationCodeRVB 1.545e-01 1.041e-01 1.483 0.138162
-## Year_fct2017:StationCodeRVB 5.969e-02 8.535e-02 0.699 0.484390
-## Year_fct2018:StationCodeRVB -1.333e-02 6.896e-02 -0.193 0.846691
-## Year_fct2019:StationCodeRVB -1.159e-01 8.851e-02 -1.310 0.190487
-## Flow:StationCodeSTTD 3.796e-04 6.738e-05 5.634 2.04e-08 ***
-## Flow:StationCodeLIB 1.999e-04 5.141e-05 3.888 0.000105 ***
-## Flow:StationCodeRVB 1.698e-04 4.333e-05 3.919 9.23e-05 ***
-## ---
-## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
-##
-## Residual standard error: 0.1669 on 1764 degrees of freedom
-## Multiple R-squared: 0.9702, Adjusted R-squared: 0.9697
-## F-statistic: 1914 on 30 and 1764 DF, p-value: < 2.2e-16
-df_chla_c2_lag %>%
- drop_na(Chla_log, lag1, lag2, lag3) %>%
- plot_lm_diag(Chla_log, m_lm2_lag3)
-## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
-
-shapiro.test(residuals(m_lm2_lag3))
-##
-## Shapiro-Wilk normality test
-##
-## data: residuals(m_lm2_lag3)
-## W = 0.85552, p-value < 2.2e-16
-The diagnostic plots show that the lag 3 model has similar problems -as the lag 1 model, so this doesn’t look like a good candidate for our -analysis as well.
-Since none of the models we tried so far seemed to fit the observed -data very well, it may be interesting to run separate models for each of -the four stations - RD22, STTD, LIB, RVB - to see if that helps improve -model fit. We’ll try STTD as a test case.
-Let’s try a linear model using a two-way interaction between Year and -Daily Average Flow. Initially, we’ll run the model without accounting -for serial autocorrelation.
-df_sttd <- df_chla_c2_lag %>% filter(StationCode == "STTD")
-
-m_lm2_sttd <- df_sttd %>%
- drop_na(Chla_log) %>%
- lm(Chla_log ~ Year_fct * Flow, data = .)
-Lets look at the model summary and diagnostics:
-summary(m_lm2_sttd)
-##
-## Call:
-## lm(formula = Chla_log ~ Year_fct * Flow, data = .)
-##
-## Residuals:
-## Min 1Q Median 3Q Max
-## -0.78556 -0.17235 -0.02937 0.13767 1.54418
-##
-## Coefficients:
-## Estimate Std. Error t value Pr(>|t|)
-## (Intercept) 8.6179166 0.0295372 291.765 < 2e-16 ***
-## Year_fct2016 0.4259203 0.0459339 9.272 < 2e-16 ***
-## Year_fct2017 0.3103625 0.0675065 4.598 5.66e-06 ***
-## Year_fct2018 -0.6462877 0.0456684 -14.152 < 2e-16 ***
-## Year_fct2019 -1.0607084 0.0444363 -23.870 < 2e-16 ***
-## Flow 0.0027158 0.0002046 13.273 < 2e-16 ***
-## Year_fct2016:Flow -0.0013946 0.0002864 -4.869 1.59e-06 ***
-## Year_fct2017:Flow 0.0051503 0.0017706 2.909 0.00382 **
-## Year_fct2018:Flow -0.0004084 0.0002580 -1.583 0.11417
-## Year_fct2019:Flow -0.0014040 0.0002321 -6.051 3.19e-09 ***
-## ---
-## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
-##
-## Residual standard error: 0.3032 on 421 degrees of freedom
-## Multiple R-squared: 0.8091, Adjusted R-squared: 0.805
-## F-statistic: 198.2 on 9 and 421 DF, p-value: < 2.2e-16
-df_sttd %>%
- drop_na(Chla_log) %>%
- plot_lm_diag(Chla_log, m_lm2_sttd)
-## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
-
-shapiro.test(residuals(m_lm2_sttd))
-##
-## Shapiro-Wilk normality test
-##
-## data: residuals(m_lm2_sttd)
-## W = 0.94653, p-value = 2.342e-11
-acf(residuals(m_lm2_sttd))
-
-Box.test(residuals(m_lm2_sttd), lag = 20, type = 'Ljung-Box')
-##
-## Box-Ljung test
-##
-## data: residuals(m_lm2_sttd)
-## X-squared = 463.96, df = 20, p-value < 2.2e-16
-The residuals deviate from a normal distribution according to visual -inspection and the Shapiro-Wilk normality test. Also, model definitely -has residual autocorrelation as indicated by the ACF plot and the -Box-Ljung test.
-Now, we’ll try to deal with the residual autocorrelation and the -non-normal residuals. We’ll run a series of linear models adding 1, 2, -and 3 lag terms and compare how well they correct for -autocorrelation.
-m_lm2_sttd_lag1 <- df_sttd %>%
- drop_na(Chla_log, lag1) %>%
- lm(Chla_log ~ Year_fct * Flow + lag1, data = .)
-
-acf(residuals(m_lm2_sttd_lag1))
-
-Box.test(residuals(m_lm2_sttd_lag1), lag = 20, type = 'Ljung-Box')
-##
-## Box-Ljung test
-##
-## data: residuals(m_lm2_sttd_lag1)
-## X-squared = 37.182, df = 20, p-value = 0.01113
-m_lm2_sttd_lag2 <- df_sttd %>%
- drop_na(Chla_log, lag1, lag2) %>%
- lm(Chla_log ~ Year_fct * Flow + lag1 + lag2, data = .)
-
-acf(residuals(m_lm2_sttd_lag2))
-
-Box.test(residuals(m_lm2_sttd_lag2), lag = 20, type = 'Ljung-Box')
-##
-## Box-Ljung test
-##
-## data: residuals(m_lm2_sttd_lag2)
-## X-squared = 18.739, df = 20, p-value = 0.5389
-m_lm2_sttd_lag3 <- df_sttd %>%
- drop_na(Chla_log, lag1, lag2, lag3) %>%
- lm(Chla_log ~ Year_fct * Flow + lag1 + lag2 + lag3, data = .)
-
-acf(residuals(m_lm2_sttd_lag3))
-
-Box.test(residuals(m_lm2_sttd_lag3), lag = 20, type = 'Ljung-Box')
-##
-## Box-Ljung test
-##
-## data: residuals(m_lm2_sttd_lag3)
-## X-squared = 10.814, df = 20, p-value = 0.9509
-The model with 2 lag terms seems to take care of the autocorrelation. -Let’s compare the 3 models using AIC and BIC to see which model those -prefer.
-AIC(m_lm2_sttd_lag1, m_lm2_sttd_lag2, m_lm2_sttd_lag3)
-## df AIC
-## m_lm2_sttd_lag1 12 -328.3102
-## m_lm2_sttd_lag2 13 -350.6190
-## m_lm2_sttd_lag3 14 -340.5244
-BIC(m_lm2_sttd_lag1, m_lm2_sttd_lag2, m_lm2_sttd_lag3)
-## df BIC
-## m_lm2_sttd_lag1 12 -279.7701
-## m_lm2_sttd_lag2 13 -298.3142
-## m_lm2_sttd_lag3 14 -284.4699
-Both AIC and BIC prefer the model with 2 lag terms. Let’s take a -closer look at that one.
-summary(m_lm2_sttd_lag2)
-##
-## Call:
-## lm(formula = Chla_log ~ Year_fct * Flow + lag1 + lag2, data = .)
-##
-## Residuals:
-## Min 1Q Median 3Q Max
-## -0.88419 -0.06597 -0.00902 0.05115 0.81167
-##
-## Coefficients:
-## Estimate Std. Error t value Pr(>|t|)
-## (Intercept) 1.613e+00 2.285e-01 7.059 7.43e-12 ***
-## Year_fct2016 6.865e-02 2.667e-02 2.574 0.010415 *
-## Year_fct2017 1.177e-01 3.644e-02 3.230 0.001339 **
-## Year_fct2018 -1.292e-01 2.907e-02 -4.445 1.14e-05 ***
-## Year_fct2019 -2.085e-01 3.616e-02 -5.767 1.61e-08 ***
-## Flow 5.136e-04 1.324e-04 3.878 0.000123 ***
-## lag1 1.109e+00 4.925e-02 22.525 < 2e-16 ***
-## lag2 -2.955e-01 4.819e-02 -6.132 2.07e-09 ***
-## Year_fct2016:Flow -2.826e-04 1.555e-04 -1.818 0.069803 .
-## Year_fct2017:Flow 2.695e-03 1.001e-03 2.692 0.007402 **
-## Year_fct2018:Flow -5.969e-05 1.375e-04 -0.434 0.664474
-## Year_fct2019:Flow -2.504e-04 1.302e-04 -1.922 0.055260 .
-## ---
-## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
-##
-## Residual standard error: 0.1556 on 401 degrees of freedom
-## Multiple R-squared: 0.9497, Adjusted R-squared: 0.9483
-## F-statistic: 687.9 on 11 and 401 DF, p-value: < 2.2e-16
-df_sttd %>%
- drop_na(Chla_log, lag1, lag2) %>%
- plot_lm_diag(Chla_log, m_lm2_sttd_lag2)
-## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
-
-shapiro.test(residuals(m_lm2_sttd_lag2))
-##
-## Shapiro-Wilk normality test
-##
-## data: residuals(m_lm2_sttd_lag2)
-## W = 0.86609, p-value < 2.2e-16
-As with all the other linear models, the residuals deviate from a -normal distribution particularly at the both tails of the distribution. -I’m not so sure about using this model. Let’s take a closer look at how -the back-transformed fitted values from the model match the observed -values.
-df_lm2_sttd_lag2_fit <- df_sttd %>%
- drop_na(Chla, lag1, lag2) %>%
- mutate(Fitted = as_tibble(predict(m_lm2_sttd_lag2, interval = "confidence"))) %>%
- unnest_wider(Fitted) %>%
- mutate(across(c("fit", "lwr", "upr"), ~ exp(.x) / 1000))
-
-plt_lm2_sttd_lag2_obs_fit <- df_lm2_sttd_lag2_fit %>%
- ggplot(aes(x = fit, y = Chla)) +
- geom_point() +
- geom_abline(slope = 1, intercept = 0, color = "red") +
- theme_bw() +
- labs(
- x = "Back-transformed Fitted Values",
- y = "Observed Values"
- )
-
-plt_lm2_sttd_lag2_obs_fit
-
-Let’s look at each Year separately.
-plt_lm2_sttd_lag2_obs_fit + facet_wrap(vars(Year_fct), scales = "free")
-
-The red lines are the 1:1 ratio between fitted and observed values, -and we would expect for most of the points to be near this line if the -model has a good fit to the observed data. The fitted and observed -values appear to match pretty well at the lower end of the range of -values, but this deteriorates at the higher values. This pattern holds -for some of the individual Years. I’m not sure this linear model is a -good candidate to use for our analysis.
-None of the models we tried in this analysis seemed to fit the -observed data very well. It looks like daily average flow has an effect -on chlorophyll concentrations at some of the stations, but flow doesn’t -seem to be a good predictor overall. Unfortunately, I don’t think we -should use any of these models in our analysis.
-Explore and analyze the continuous chlorophyll data to be included in +the NDFS synthesis manuscript. We will attempt to fit various models to +the data set using daily average flow as a continuous predictor which +replaces the categorical predictor - flow action period - in the +original analysis. These models will only include representative +stations for 4 habitat types - upstream (RD22), lower Yolo Bypass +(STTD), Cache Slough complex (LIB), and downstream (RVB).
+# Load packages
+library(tidyverse)
+library(scales)
+library(knitr)
+library(mgcv)
+library(car)
+library(gratia)
+library(here)
+library(conflicted)
+
+# Source functions
+source(here("manuscript_synthesis/src/global_functions.R"))
+
+# Declare package conflict preferences
+conflicts_prefer(dplyr::filter(), dplyr::lag())
+Display current versions of R and packages used for this +analysis:
+devtools::session_info()
+## ─ Session info ───────────────────────────────────────────────────────────────
+## setting value
+## version R version 4.2.3 (2023-03-15 ucrt)
+## os Windows 10 x64 (build 19044)
+## system x86_64, mingw32
+## ui RTerm
+## language (EN)
+## collate English_United States.utf8
+## ctype English_United States.utf8
+## tz America/Los_Angeles
+## date 2023-12-11
+## pandoc 3.1.1 @ C:/Program Files/RStudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)
+##
+## ─ Packages ───────────────────────────────────────────────────────────────────
+## package * version date (UTC) lib source
+## abind 1.4-5 2016-07-21 [1] CRAN (R 4.2.0)
+## bslib 0.4.2 2022-12-16 [1] CRAN (R 4.2.2)
+## cachem 1.0.8 2023-05-01 [1] CRAN (R 4.2.3)
+## callr 3.7.3 2022-11-02 [1] CRAN (R 4.2.2)
+## car * 3.1-2 2023-03-30 [1] CRAN (R 4.2.3)
+## carData * 3.0-5 2022-01-06 [1] CRAN (R 4.2.1)
+## cli 3.6.1 2023-03-23 [1] CRAN (R 4.2.3)
+## colorspace 2.1-0 2023-01-23 [1] CRAN (R 4.2.2)
+## conflicted * 1.2.0 2023-02-01 [1] CRAN (R 4.2.2)
+## crayon 1.5.2 2022-09-29 [1] CRAN (R 4.2.1)
+## devtools 2.4.5 2022-10-11 [1] CRAN (R 4.2.1)
+## digest 0.6.33 2023-07-07 [1] CRAN (R 4.2.3)
+## dplyr * 1.1.3 2023-09-03 [1] CRAN (R 4.2.3)
+## ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.2.1)
+## evaluate 0.21 2023-05-05 [1] CRAN (R 4.2.3)
+## fansi 1.0.4 2023-01-22 [1] CRAN (R 4.2.2)
+## fastmap 1.1.1 2023-02-24 [1] CRAN (R 4.2.2)
+## forcats * 1.0.0 2023-01-29 [1] CRAN (R 4.2.2)
+## fs * 1.6.3 2023-07-20 [1] CRAN (R 4.2.3)
+## generics 0.1.3 2022-07-05 [1] CRAN (R 4.2.1)
+## ggplot2 * 3.4.3 2023-08-14 [1] CRAN (R 4.2.3)
+## glue 1.6.2 2022-02-24 [1] CRAN (R 4.2.1)
+## gratia * 0.8.1 2023-02-02 [1] CRAN (R 4.2.2)
+## gtable 0.3.4 2023-08-21 [1] CRAN (R 4.2.3)
+## here * 1.0.1 2020-12-13 [1] CRAN (R 4.2.1)
+## hms 1.1.3 2023-03-21 [1] CRAN (R 4.2.3)
+## htmltools 0.5.5 2023-03-23 [1] CRAN (R 4.2.3)
+## htmlwidgets 1.6.2 2023-03-17 [1] CRAN (R 4.2.3)
+## httpuv 1.6.9 2023-02-14 [1] CRAN (R 4.2.2)
+## jquerylib 0.1.4 2021-04-26 [1] CRAN (R 4.2.1)
+## jsonlite 1.8.7 2023-06-29 [1] CRAN (R 4.2.3)
+## knitr * 1.42 2023-01-25 [1] CRAN (R 4.2.2)
+## later 1.3.0 2021-08-18 [1] CRAN (R 4.2.1)
+## lattice 0.21-8 2023-04-05 [1] CRAN (R 4.2.3)
+## lifecycle 1.0.3 2022-10-07 [1] CRAN (R 4.2.1)
+## lubridate * 1.9.3 2023-09-27 [1] CRAN (R 4.2.3)
+## magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.2.1)
+## Matrix 1.5-4 2023-04-04 [1] CRAN (R 4.2.3)
+## memoise 2.0.1 2021-11-26 [1] CRAN (R 4.2.1)
+## mgcv * 1.8-42 2023-03-02 [1] CRAN (R 4.2.2)
+## mime 0.12 2021-09-28 [1] CRAN (R 4.2.0)
+## miniUI 0.1.1.1 2018-05-18 [1] CRAN (R 4.2.1)
+## munsell 0.5.0 2018-06-12 [1] CRAN (R 4.2.1)
+## mvnfast 0.2.8 2023-02-23 [1] CRAN (R 4.2.2)
+## nlme * 3.1-162 2023-01-31 [2] CRAN (R 4.2.3)
+## patchwork * 1.1.2 2022-08-19 [1] CRAN (R 4.2.1)
+## pillar 1.9.0 2023-03-22 [1] CRAN (R 4.2.3)
+## pkgbuild 1.4.2 2023-06-26 [1] CRAN (R 4.2.3)
+## pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.2.1)
+## pkgload 1.3.2.1 2023-07-08 [1] CRAN (R 4.2.3)
+## prettyunits 1.2.0 2023-09-24 [1] CRAN (R 4.2.3)
+## processx 3.8.2 2023-06-30 [1] CRAN (R 4.2.3)
+## profvis 0.3.7 2020-11-02 [1] CRAN (R 4.2.1)
+## promises 1.2.0.1 2021-02-11 [1] CRAN (R 4.2.1)
+## ps 1.7.5 2023-04-18 [1] CRAN (R 4.2.3)
+## purrr * 1.0.2 2023-08-10 [1] CRAN (R 4.2.3)
+## R6 2.5.1 2021-08-19 [1] CRAN (R 4.2.1)
+## Rcpp 1.0.11 2023-07-06 [1] CRAN (R 4.2.3)
+## readr * 2.1.4 2023-02-10 [1] CRAN (R 4.2.2)
+## remotes 2.4.2 2021-11-30 [1] CRAN (R 4.2.1)
+## rlang * 1.1.1 2023-04-28 [1] CRAN (R 4.2.3)
+## rmarkdown 2.21 2023-03-26 [1] CRAN (R 4.2.3)
+## rprojroot 2.0.3 2022-04-02 [1] CRAN (R 4.2.1)
+## rstudioapi 0.14 2022-08-22 [1] CRAN (R 4.2.1)
+## sass 0.4.6 2023-05-03 [1] CRAN (R 4.2.3)
+## scales * 1.2.1 2022-08-20 [1] CRAN (R 4.2.1)
+## sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.2.1)
+## shiny 1.7.4 2022-12-15 [1] CRAN (R 4.2.2)
+## stringi 1.7.12 2023-01-11 [1] CRAN (R 4.2.2)
+## stringr * 1.5.0 2022-12-02 [1] CRAN (R 4.2.2)
+## tibble * 3.2.1 2023-03-20 [1] CRAN (R 4.2.3)
+## tidyr * 1.3.0 2023-01-24 [1] CRAN (R 4.2.2)
+## tidyselect 1.2.0 2022-10-10 [1] CRAN (R 4.2.1)
+## tidyverse * 2.0.0 2023-02-22 [1] CRAN (R 4.2.2)
+## timechange 0.2.0 2023-01-11 [1] CRAN (R 4.2.2)
+## tzdb 0.4.0 2023-05-12 [1] CRAN (R 4.2.3)
+## urlchecker 1.0.1 2021-11-30 [1] CRAN (R 4.2.1)
+## usethis 2.1.6 2022-05-25 [1] CRAN (R 4.2.1)
+## utf8 1.2.3 2023-01-31 [1] CRAN (R 4.2.2)
+## vctrs 0.6.3 2023-06-14 [1] CRAN (R 4.2.3)
+## withr 2.5.1 2023-09-26 [1] CRAN (R 4.2.3)
+## xfun 0.39 2023-04-20 [1] CRAN (R 4.2.3)
+## xtable 1.8-4 2019-04-21 [1] CRAN (R 4.2.1)
+## yaml 2.3.7 2023-01-23 [1] CRAN (R 4.2.2)
+##
+## [1] C:/R/win-library/4.2
+## [2] C:/Program Files/R/R-4.2.3/library
+##
+## ──────────────────────────────────────────────────────────────────────────────
+# Define file path for processed data
+fp_data <- here("manuscript_synthesis/data/processed")
+
+# Import daily average chlorophyll data
+df_chla <- readRDS(file.path(fp_data, "chla_wt_daily_avg_2013-2019.rds")) %>% select(-WaterTemp)
+
+# Import daily average flow data
+df_flow <- readRDS(file.path(fp_data, "flow_daily_avg_2013-2019.rds"))
+# Create a vector for the factor order of StationCode
+sta_order <- c("RD22", "STTD", "LIB", "RVB")
+
+# We will use LIS flow data as a proxy for STTD
+df_flow_c <- df_flow %>% mutate(StationCode = if_else(StationCode == "LIS", "STTD", StationCode))
+
+# Prepare chlorophyll and flow data for exploration and analysis
+df_chla_c1 <- df_chla %>%
+ # Filter to only include representative stations for 4 habitat types - RD22, STTD, LIB, RVB
+ filter(StationCode %in% sta_order) %>%
+ # Join flow data to chlorophyll data
+ left_join(df_flow_c, by = join_by(StationCode, Date)) %>%
+ # Remove all NA flow values
+ drop_na(Flow) %>%
+ mutate(
+ # Scale and log transform chlorophyll values
+ Chla_log = log(Chla * 1000),
+ # Add Year and DOY variables
+ Year = year(Date),
+ DOY = yday(Date),
+ # Apply factor order to StationCode
+ StationCode = factor(StationCode, levels = sta_order),
+ # Add a column for Year as a factor for the model
+ Year_fct = factor(Year)
+ ) %>%
+ arrange(StationCode, Date)
+df_chla_c1 %>%
+ summarize(
+ min_date = min(Date),
+ max_date = max(Date),
+ num_samples = n(),
+ .by = c(StationCode, Year)
+ ) %>%
+ arrange(StationCode, Year) %>%
+ kable()
+StationCode | +Year | +min_date | +max_date | +num_samples | +
---|---|---|---|---|
RD22 | +2014 | +2014-10-01 | +2014-11-08 | +39 | +
RD22 | +2015 | +2015-07-24 | +2015-11-10 | +110 | +
RD22 | +2016 | +2016-06-23 | +2016-09-16 | +86 | +
RD22 | +2017 | +2017-07-14 | +2017-11-03 | +101 | +
RD22 | +2018 | +2018-07-13 | +2018-11-11 | +122 | +
RD22 | +2019 | +2019-07-11 | +2019-11-06 | +118 | +
STTD | +2013 | +2013-08-15 | +2013-11-04 | +78 | +
STTD | +2014 | +2014-07-25 | +2014-10-19 | +87 | +
STTD | +2015 | +2015-07-27 | +2015-11-16 | +108 | +
STTD | +2016 | +2016-06-23 | +2016-09-16 | +83 | +
STTD | +2017 | +2017-07-14 | +2017-09-25 | +56 | +
STTD | +2018 | +2018-07-20 | +2018-10-11 | +84 | +
STTD | +2019 | +2019-07-26 | +2019-11-06 | +100 | +
LIB | +2014 | +2014-10-01 | +2014-11-08 | +35 | +
LIB | +2015 | +2015-07-06 | +2015-11-16 | +123 | +
LIB | +2016 | +2016-05-29 | +2016-09-16 | +99 | +
LIB | +2017 | +2017-07-14 | +2017-11-01 | +89 | +
LIB | +2018 | +2018-08-22 | +2018-10-22 | +26 | +
LIB | +2019 | +2019-07-11 | +2019-09-21 | +21 | +
RVB | +2013 | +2013-07-07 | +2013-11-17 | +134 | +
RVB | +2014 | +2014-07-25 | +2014-11-08 | +105 | +
RVB | +2015 | +2015-07-06 | +2015-11-16 | +134 | +
RVB | +2016 | +2016-05-29 | +2016-09-16 | +80 | +
RVB | +2017 | +2017-07-14 | +2017-11-03 | +113 | +
RVB | +2018 | +2018-07-13 | +2018-11-11 | +122 | +
RVB | +2019 | +2019-07-11 | +2019-11-06 | +99 | +
Looking at the sample counts and date ranges, we’ll only include +years 2015-2019 for the analysis.
+df_chla_c2 <- df_chla_c1 %>%
+ filter(Year %in% 2015:2019) %>%
+ mutate(Year_fct = fct_drop(Year_fct))
+
+# Create another dataframe that has up to 3 lag variables for chlorophyll to be
+ # used in the linear models
+df_chla_c2_lag <- df_chla_c2 %>%
+ # Fill in missing days for each StationCode-Year combination
+ group_by(StationCode, Year) %>%
+ complete(Date = seq.Date(min(Date), max(Date), by = "1 day")) %>%
+ # Create lag variables of scaled log transformed chlorophyll values
+ mutate(
+ lag1 = lag(Chla_log),
+ lag2 = lag(Chla_log, 2),
+ lag3 = lag(Chla_log, 3)
+ ) %>%
+ ungroup()
+Let’s explore the data with some plots. First, lets plot the data in +scatter plots of chlorophyll and flow faceted by Station and grouping +all years together.
+df_chla_c2 %>%
+ ggplot(aes(x = Flow, y = Chla)) +
+ geom_point() +
+ geom_smooth(formula = "y ~ x") +
+ facet_wrap(vars(StationCode), scales = "free") +
+ theme_bw()
+
+At first glance, I’m not sure how well flow is going to be able to +predict chlorophyll concentrations. At the furthest upstream station - +RD22 - chlorophyll appears to be highest at the lowest flows, but the +variation is at its maximum at the lowest flows. There may be some +dilution effect going on here at the higher flows. At STTD, there does +seem to be a modest increase in chlorophyll concentrations at the +mid-range flows. This pattern is even more obvious at LIB. There appears +to be no effect of flow on chlorophyll at RVB, but the range of +chlorophyll concentrations is narrow at this station (between 0 and +5).
+Let’s break these scatterplots apart by year to see how these +patterns vary annually.
+df_chla_c2 %>%
+ ggplot(aes(x = Flow, y = Chla)) +
+ geom_point() +
+ geom_smooth(formula = "y ~ x") +
+ facet_wrap(
+ vars(StationCode, Year),
+ ncol = 5,
+ scales = "free",
+ labeller = labeller(.multi_line = FALSE)
+ ) +
+ theme_bw()
+
+The patterns appear to vary annually at each station, which may +justify using a 3-way interaction.
+First, we will attempt to fit a generalized additive model (GAM) to +the data set to help account for seasonality in the data. We’ll try +running a GAM using a three-way interaction between Year, Daily Average +Flow, and Station, and a smooth term for day of year to account for +seasonality. Initially, we’ll run the GAM without accounting for serial +autocorrelation.
+m_gam3 <- gam(
+ Chla_log ~ Year_fct * Flow * StationCode + s(DOY),
+ data = df_chla_c2,
+ method = "REML"
+)
+Lets look at the model summary and diagnostics:
+summary(m_gam3)
+##
+## Family: gaussian
+## Link function: identity
+##
+## Formula:
+## Chla_log ~ Year_fct * Flow * StationCode + s(DOY)
+##
+## Parametric coefficients:
+## Estimate Std. Error t value Pr(>|t|)
+## (Intercept) 9.734e+00 4.242e-02 229.451 < 2e-16 ***
+## Year_fct2016 -8.100e-01 6.510e-02 -12.442 < 2e-16 ***
+## Year_fct2017 -9.446e-01 9.333e-02 -10.120 < 2e-16 ***
+## Year_fct2018 -3.798e-01 5.586e-02 -6.800 1.42e-11 ***
+## Year_fct2019 -1.569e-01 5.549e-02 -2.827 0.004748 **
+## Flow -1.795e-03 1.810e-04 -9.916 < 2e-16 ***
+## StationCodeSTTD -1.068e+00 5.338e-02 -20.010 < 2e-16 ***
+## StationCodeLIB -2.164e+00 1.427e-01 -15.166 < 2e-16 ***
+## StationCodeRVB -2.181e+00 9.945e-02 -21.927 < 2e-16 ***
+## Year_fct2016:Flow 2.101e-03 2.965e-04 7.087 1.95e-12 ***
+## Year_fct2017:Flow 2.980e-02 4.868e-03 6.122 1.13e-09 ***
+## Year_fct2018:Flow 4.690e-04 2.456e-04 1.909 0.056383 .
+## Year_fct2019:Flow -8.769e-05 2.155e-04 -0.407 0.684079
+## Year_fct2016:StationCodeSTTD 1.073e+00 8.125e-02 13.208 < 2e-16 ***
+## Year_fct2017:StationCodeSTTD 1.077e+00 1.234e-01 8.724 < 2e-16 ***
+## Year_fct2018:StationCodeSTTD -2.829e-01 7.662e-02 -3.692 0.000229 ***
+## Year_fct2019:StationCodeSTTD -8.837e-01 7.505e-02 -11.774 < 2e-16 ***
+## Year_fct2016:StationCodeLIB 2.103e+00 2.559e-01 8.217 3.92e-16 ***
+## Year_fct2017:StationCodeLIB 9.232e-01 2.051e-01 4.501 7.20e-06 ***
+## Year_fct2018:StationCodeLIB -8.344e-01 2.888e-01 -2.889 0.003910 **
+## Year_fct2019:StationCodeLIB -2.348e-01 2.275e-01 -1.032 0.302235
+## Year_fct2016:StationCodeRVB 1.200e+00 2.292e-01 5.235 1.84e-07 ***
+## Year_fct2017:StationCodeRVB 9.410e-01 1.984e-01 4.743 2.27e-06 ***
+## Year_fct2018:StationCodeRVB 8.428e-02 1.543e-01 0.546 0.584870
+## Year_fct2019:StationCodeRVB -8.860e-01 1.847e-01 -4.797 1.74e-06 ***
+## Flow:StationCodeSTTD 4.763e-03 2.905e-04 16.396 < 2e-16 ***
+## Flow:StationCodeLIB 1.721e-03 2.024e-04 8.501 < 2e-16 ***
+## Flow:StationCodeRVB 1.817e-03 1.823e-04 9.964 < 2e-16 ***
+## Year_fct2016:Flow:StationCodeSTTD -3.755e-03 4.317e-04 -8.699 < 2e-16 ***
+## Year_fct2017:Flow:StationCodeSTTD -2.776e-02 5.251e-03 -5.285 1.40e-07 ***
+## Year_fct2018:Flow:StationCodeSTTD -8.361e-04 3.842e-04 -2.176 0.029658 *
+## Year_fct2019:Flow:StationCodeSTTD -1.479e-03 3.405e-04 -4.345 1.47e-05 ***
+## Year_fct2016:Flow:StationCodeLIB -1.778e-03 3.405e-04 -5.222 1.97e-07 ***
+## Year_fct2017:Flow:StationCodeLIB -2.991e-02 4.870e-03 -6.143 9.93e-10 ***
+## Year_fct2018:Flow:StationCodeLIB 2.326e-04 4.499e-04 0.517 0.605220
+## Year_fct2019:Flow:StationCodeLIB 3.155e-04 3.227e-04 0.978 0.328296
+## Year_fct2016:Flow:StationCodeRVB -2.159e-03 2.977e-04 -7.252 6.04e-13 ***
+## Year_fct2017:Flow:StationCodeRVB -2.983e-02 4.868e-03 -6.129 1.08e-09 ***
+## Year_fct2018:Flow:StationCodeRVB -4.803e-04 2.469e-04 -1.945 0.051881 .
+## Year_fct2019:Flow:StationCodeRVB 1.161e-04 2.175e-04 0.534 0.593549
+## ---
+## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
+##
+## Approximate significance of smooth terms:
+## edf Ref.df F p-value
+## s(DOY) 7.386 8.342 40.88 <2e-16 ***
+## ---
+## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
+##
+## R-sq.(adj) = 0.872 Deviance explained = 87.5%
+## -REML = 860.35 Scale est. = 0.11819 n = 1874
+appraise(m_gam3)
+
+shapiro.test(residuals(m_gam3))
+##
+## Shapiro-Wilk normality test
+##
+## data: residuals(m_gam3)
+## W = 0.96852, p-value < 2.2e-16
+k.check(m_gam3)
+## k' edf k-index p-value
+## s(DOY) 9 7.385659 0.952863 0.02
+draw(m_gam3, select = 1, residuals = TRUE, rug = FALSE)
+
+plot(m_gam3, pages = 1, all.terms = TRUE)
+
+acf(residuals(m_gam3))
+
+Box.test(residuals(m_gam3), lag = 20, type = 'Ljung-Box')
+##
+## Box-Ljung test
+##
+## data: residuals(m_gam3)
+## X-squared = 3426.2, df = 20, p-value < 2.2e-16
+The model definitely has residual autocorrelation as indicated by the +ACF plot and the Box-Ljung test. Even though the p-value for the k-check +is less than 0.05, the smooth term for day of year appears to be +overfitted. Let’s try a smaller k-value for the smooth first, then lets +try to address the residual autocorrelation.
+m_gam3_k5 <- gam(
+ Chla_log ~ Year_fct * Flow * StationCode + s(DOY, k = 5),
+ data = df_chla_c2,
+ method = "REML"
+)
+
+summary(m_gam3_k5)
+##
+## Family: gaussian
+## Link function: identity
+##
+## Formula:
+## Chla_log ~ Year_fct * Flow * StationCode + s(DOY, k = 5)
+##
+## Parametric coefficients:
+## Estimate Std. Error t value Pr(>|t|)
+## (Intercept) 9.7278768 0.0421697 230.684 < 2e-16 ***
+## Year_fct2016 -0.7776765 0.0639758 -12.156 < 2e-16 ***
+## Year_fct2017 -0.9217677 0.0936371 -9.844 < 2e-16 ***
+## Year_fct2018 -0.3780679 0.0561613 -6.732 2.23e-11 ***
+## Year_fct2019 -0.1554310 0.0557794 -2.787 0.005383 **
+## Flow -0.0017682 0.0001785 -9.904 < 2e-16 ***
+## StationCodeSTTD -1.0661259 0.0535105 -19.924 < 2e-16 ***
+## StationCodeLIB -2.1691650 0.1428317 -15.187 < 2e-16 ***
+## StationCodeRVB -2.1846232 0.0998032 -21.889 < 2e-16 ***
+## Year_fct2016:Flow 0.0019713 0.0002923 6.743 2.07e-11 ***
+## Year_fct2017:Flow 0.0286811 0.0048738 5.885 4.73e-09 ***
+## Year_fct2018:Flow 0.0004480 0.0002470 1.814 0.069834 .
+## Year_fct2019:Flow -0.0000991 0.0002165 -0.458 0.647254
+## Year_fct2016:StationCodeSTTD 1.0643578 0.0815638 13.049 < 2e-16 ***
+## Year_fct2017:StationCodeSTTD 1.0619864 0.1239959 8.565 < 2e-16 ***
+## Year_fct2018:StationCodeSTTD -0.2896681 0.0770322 -3.760 0.000175 ***
+## Year_fct2019:StationCodeSTTD -0.8811363 0.0753268 -11.698 < 2e-16 ***
+## Year_fct2016:StationCodeLIB 2.1057016 0.2567401 8.202 4.41e-16 ***
+## Year_fct2017:StationCodeLIB 0.9481491 0.2051177 4.622 4.06e-06 ***
+## Year_fct2018:StationCodeLIB -0.7855803 0.2899198 -2.710 0.006798 **
+## Year_fct2019:StationCodeLIB -0.2662514 0.2284866 -1.165 0.244056
+## Year_fct2016:StationCodeRVB 1.1701753 0.2287826 5.115 3.47e-07 ***
+## Year_fct2017:StationCodeRVB 0.9672348 0.1981715 4.881 1.15e-06 ***
+## Year_fct2018:StationCodeRVB 0.0946581 0.1548241 0.611 0.541017
+## Year_fct2019:StationCodeRVB -0.8809824 0.1850921 -4.760 2.09e-06 ***
+## Flow:StationCodeSTTD 0.0047745 0.0002921 16.346 < 2e-16 ***
+## Flow:StationCodeLIB 0.0016888 0.0002007 8.416 < 2e-16 ***
+## Flow:StationCodeRVB 0.0017912 0.0001799 9.959 < 2e-16 ***
+## Year_fct2016:Flow:StationCodeSTTD -0.0037733 0.0004341 -8.692 < 2e-16 ***
+## Year_fct2017:Flow:StationCodeSTTD -0.0262643 0.0052481 -5.005 6.14e-07 ***
+## Year_fct2018:Flow:StationCodeSTTD -0.0008384 0.0003865 -2.169 0.030185 *
+## Year_fct2019:Flow:StationCodeSTTD -0.0015024 0.0003424 -4.388 1.21e-05 ***
+## Year_fct2016:Flow:StationCodeLIB -0.0016335 0.0003378 -4.836 1.44e-06 ***
+## Year_fct2017:Flow:StationCodeLIB -0.0287590 0.0048744 -5.900 4.32e-09 ***
+## Year_fct2018:Flow:StationCodeLIB 0.0002943 0.0004521 0.651 0.515145
+## Year_fct2019:Flow:StationCodeLIB 0.0002993 0.0003244 0.923 0.356341
+## Year_fct2016:Flow:StationCodeRVB -0.0020283 0.0002934 -6.914 6.48e-12 ***
+## Year_fct2017:Flow:StationCodeRVB -0.0287229 0.0048739 -5.893 4.50e-09 ***
+## Year_fct2018:Flow:StationCodeRVB -0.0004615 0.0002483 -1.859 0.063211 .
+## Year_fct2019:Flow:StationCodeRVB 0.0001259 0.0002185 0.576 0.564661
+## ---
+## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
+##
+## Approximate significance of smooth terms:
+## edf Ref.df F p-value
+## s(DOY) 3.892 3.993 81.76 <2e-16 ***
+## ---
+## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
+##
+## R-sq.(adj) = 0.87 Deviance explained = 87.3%
+## -REML = 867.13 Scale est. = 0.11965 n = 1874
+appraise(m_gam3_k5)
+
+shapiro.test(residuals(m_gam3_k5))
+##
+## Shapiro-Wilk normality test
+##
+## data: residuals(m_gam3_k5)
+## W = 0.96615, p-value < 2.2e-16
+k.check(m_gam3_k5)
+## k' edf k-index p-value
+## s(DOY) 4 3.892262 0.9395093 0.0025
+draw(m_gam3_k5, select = 1, residuals = TRUE, rug = FALSE)
+
+plot(m_gam3_k5, pages = 1, all.terms = TRUE)
+
+acf(residuals(m_gam3_k5))
+
+Box.test(residuals(m_gam3_k5), lag = 20, type = 'Ljung-Box')
+##
+## Box-Ljung test
+##
+## data: residuals(m_gam3_k5)
+## X-squared = 3449.2, df = 20, p-value < 2.2e-16
+Changing the k-value to 5 seems to help reduce the “wiggliness” of +the smooth term for DOY, but the p-value for the k-check is still less +than 0.05. Despite this, lets proceed with a k-value of 5.
+Now, we’ll try to deal with the residual autocorrelation. We’ll run +AR(1), AR(2), AR(3), and AR(4) models and compare them using AIC.
+# Define model formula as an object
+f_gam3 <- as.formula("Chla_log ~ Year_fct * Flow * StationCode + s(DOY, k = 5)")
+
+# Fit original model with k = 5 and three successive AR(p) models
+m_gamm3 <- gamm(
+ f_gam3,
+ data = df_chla_c2,
+ method = "REML"
+)
+
+m_gamm3_ar1 <- gamm(
+ f_gam3,
+ data = df_chla_c2,
+ correlation = corARMA(form = ~ 1|Year_fct/StationCode, p = 1), # grouped by Year_fct and StationCode
+ method = "REML"
+)
+
+m_gamm3_ar2 <- gamm(
+ f_gam3,
+ data = df_chla_c2,
+ correlation = corARMA(form = ~ 1|Year_fct/StationCode, p = 2),
+ method = "REML"
+)
+
+m_gamm3_ar3 <- gamm(
+ f_gam3,
+ data = df_chla_c2,
+ correlation = corARMA(form = ~ 1|Year_fct/StationCode, p = 3),
+ method = "REML"
+)
+
+m_gamm3_ar4 <- gamm(
+ f_gam3,
+ data = df_chla_c2,
+ correlation = corARMA(form = ~ 1|Year_fct/StationCode, p = 4),
+ method = "REML"
+)
+
+# Compare models
+anova(
+ m_gamm3$lme,
+ m_gamm3_ar1$lme,
+ m_gamm3_ar2$lme,
+ m_gamm3_ar3$lme,
+ m_gamm3_ar4$lme
+)
+## Model df AIC BIC logLik Test L.Ratio p-value
+## m_gamm3$lme 1 43 1820.2656 2057.3551 -867.1328
+## m_gamm3_ar1$lme 2 44 -725.2482 -482.6450 406.6241 1 vs 2 2547.5138 <.0001
+## m_gamm3_ar2$lme 3 45 -723.7345 -475.6176 406.8672 2 vs 3 0.4862 0.4856
+## m_gamm3_ar3$lme 4 46 -724.9241 -471.2935 408.4621 3 vs 4 3.1896 0.0741
+## m_gamm3_ar4$lme 5 47 -723.7127 -464.5684 408.8564 4 vs 5 0.7886 0.3745
+It looks like both AIC and BIC prefer the AR(1) model. Let’s take a +closer look at the AR(1) model.
+# Pull out the residuals and the GAM model
+resid_gamm3_ar1 <- residuals(m_gamm3_ar1$lme, type = "normalized")
+m_gamm3_ar1_gam <- m_gamm3_ar1$gam
+
+summary(m_gamm3_ar1_gam)
+##
+## Family: gaussian
+## Link function: identity
+##
+## Formula:
+## Chla_log ~ Year_fct * Flow * StationCode + s(DOY, k = 5)
+##
+## Parametric coefficients:
+## Estimate Std. Error t value Pr(>|t|)
+## (Intercept) 9.525e+00 3.065e-01 31.077 < 2e-16 ***
+## Year_fct2016 -6.067e-01 4.585e-01 -1.323 0.185863
+## Year_fct2017 -4.935e-01 4.661e-01 -1.059 0.289818
+## Year_fct2018 -2.740e-01 4.248e-01 -0.645 0.519064
+## Year_fct2019 1.705e-02 4.273e-01 0.040 0.968183
+## Flow -6.516e-04 2.937e-04 -2.219 0.026619 *
+## StationCodeSTTD -7.842e-01 4.322e-01 -1.815 0.069757 .
+## StationCodeLIB -1.671e+00 4.339e-01 -3.852 0.000121 ***
+## StationCodeRVB -1.779e+00 4.201e-01 -4.235 2.4e-05 ***
+## Year_fct2016:Flow 1.011e-03 5.967e-04 1.694 0.090507 .
+## Year_fct2017:Flow 1.339e-02 8.870e-03 1.510 0.131306
+## Year_fct2018:Flow 5.601e-05 5.438e-04 0.103 0.917981
+## Year_fct2019:Flow -3.069e-04 4.586e-04 -0.669 0.503503
+## Year_fct2016:StationCodeSTTD 9.683e-01 6.423e-01 1.508 0.131843
+## Year_fct2017:StationCodeSTTD 2.722e-01 6.782e-01 0.401 0.688202
+## Year_fct2018:StationCodeSTTD -5.477e-01 6.227e-01 -0.880 0.379244
+## Year_fct2019:StationCodeSTTD -1.146e+00 6.127e-01 -1.870 0.061617 .
+## Year_fct2016:StationCodeLIB 1.710e+00 6.473e-01 2.641 0.008333 **
+## Year_fct2017:StationCodeLIB 5.028e-01 6.538e-01 0.769 0.441911
+## Year_fct2018:StationCodeLIB -1.599e+00 7.123e-01 -2.244 0.024924 *
+## Year_fct2019:StationCodeLIB -7.410e-01 7.159e-01 -1.035 0.300801
+## Year_fct2016:StationCodeRVB 1.035e+00 6.537e-01 1.584 0.113479
+## Year_fct2017:StationCodeRVB 2.833e-01 6.466e-01 0.438 0.661318
+## Year_fct2018:StationCodeRVB -3.375e-02 6.055e-01 -0.056 0.955553
+## Year_fct2019:StationCodeRVB -6.441e-01 6.186e-01 -1.041 0.297865
+## Flow:StationCodeSTTD 1.370e-03 5.784e-04 2.369 0.017938 *
+## Flow:StationCodeLIB 7.042e-04 3.021e-04 2.331 0.019857 *
+## Flow:StationCodeRVB 6.539e-04 2.939e-04 2.225 0.026221 *
+## Year_fct2016:Flow:StationCodeSTTD -1.540e-03 9.500e-04 -1.621 0.105090
+## Year_fct2017:Flow:StationCodeSTTD -7.559e-03 9.022e-03 -0.838 0.402250
+## Year_fct2018:Flow:StationCodeSTTD 4.260e-04 9.270e-04 0.460 0.645894
+## Year_fct2019:Flow:StationCodeSTTD 8.869e-04 7.767e-04 1.142 0.253651
+## Year_fct2016:Flow:StationCodeLIB -8.492e-04 6.085e-04 -1.396 0.162993
+## Year_fct2017:Flow:StationCodeLIB -1.333e-02 8.871e-03 -1.502 0.133191
+## Year_fct2018:Flow:StationCodeLIB 5.668e-04 5.819e-04 0.974 0.330165
+## Year_fct2019:Flow:StationCodeLIB 4.780e-04 4.908e-04 0.974 0.330269
+## Year_fct2016:Flow:StationCodeRVB -1.030e-03 5.971e-04 -1.724 0.084817 .
+## Year_fct2017:Flow:StationCodeRVB -1.340e-02 8.870e-03 -1.511 0.131071
+## Year_fct2018:Flow:StationCodeRVB -6.689e-05 5.443e-04 -0.123 0.902196
+## Year_fct2019:Flow:StationCodeRVB 2.798e-04 4.591e-04 0.609 0.542337
+## ---
+## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
+##
+## Approximate significance of smooth terms:
+## edf Ref.df F p-value
+## s(DOY) 3.806 3.806 14.49 <2e-16 ***
+## ---
+## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
+##
+## R-sq.(adj) = 0.748
+## Scale est. = 0.3387 n = 1874
+appraise(m_gamm3_ar1_gam)
+
+shapiro.test(resid_gamm3_ar1)
+##
+## Shapiro-Wilk normality test
+##
+## data: resid_gamm3_ar1
+## W = 0.81636, p-value < 2.2e-16
+k.check(m_gamm3_ar1_gam)
+## k' edf k-index p-value
+## s(DOY) 4 3.806167 0.6413984 0
+draw(m_gamm3_ar1_gam, select = 1, residuals = TRUE, rug = FALSE)
+
+plot(m_gamm3_ar1_gam, pages = 1, all.terms = TRUE)
+
+acf(resid_gamm3_ar1)
+
+Box.test(resid_gamm3_ar1, lag = 20, type = 'Ljung-Box')
+##
+## Box-Ljung test
+##
+## data: resid_gamm3_ar1
+## X-squared = 52.893, df = 20, p-value = 8.426e-05
+The AR(1) model has much less residual autocorrelation, and the +diagnostics plots look okay. Let’s take a closer look at how the +back-transformed fitted values from the model match the observed +values.
+df_gamm3_ar1_fit <- df_chla_c2 %>%
+ fitted_values(m_gamm3_ar1_gam, data = .) %>%
+ mutate(fitted_bt = exp(fitted) / 1000)
+
+plt_gamm3_ar1_obs_fit <- df_gamm3_ar1_fit %>%
+ ggplot(aes(x = fitted_bt, y = Chla)) +
+ geom_point() +
+ geom_abline(slope = 1, intercept = 0, color = "red") +
+ theme_bw() +
+ labs(
+ x = "Back-transformed Fitted Values",
+ y = "Observed Values"
+ )
+
+plt_gamm3_ar1_obs_fit
+
+Let’s look at each Station-Year combination separately.
+plt_gamm3_ar1_obs_fit +
+ facet_wrap(
+ vars(StationCode, Year_fct),
+ ncol = 5,
+ scales = "free",
+ labeller = labeller(.multi_line = FALSE)
+ )
+
+The red lines are the 1:1 ratio between fitted and observed values, +and we would expect for most of the points to be near this line if the +model has a good fit to the observed data. However, the fitted values +from the model don’t appear to match the observed values that well, and +there are some unusual patterns when we look at each Station-Year +combination separately. I don’t think this GAM model is a good candidate +to use for our analysis.
+Since the GAM model didn’t seem to fit the observed data that well, +let’s try a linear model using a three-way interaction between Year, +Daily Average Flow, and Station. Initially, we’ll run the model without +accounting for serial autocorrelation.
+m_lm3 <- df_chla_c2_lag %>%
+ drop_na(Chla_log) %>%
+ lm(Chla_log ~ Year_fct * Flow * StationCode, data = .)
+Lets look at the model summary and diagnostics:
+summary(m_lm3)
+##
+## Call:
+## lm(formula = Chla_log ~ Year_fct * Flow * StationCode, data = .)
+##
+## Residuals:
+## Min 1Q Median 3Q Max
+## -1.63942 -0.22564 -0.03365 0.19208 1.75991
+##
+## Coefficients:
+## Estimate Std. Error t value Pr(>|t|)
+## (Intercept) 9.6844243 0.0448372 215.991 < 2e-16 ***
+## Year_fct2016 -0.6624673 0.0673312 -9.839 < 2e-16 ***
+## Year_fct2017 -1.1888491 0.0996337 -11.932 < 2e-16 ***
+## Year_fct2018 -0.3418296 0.0607458 -5.627 2.11e-08 ***
+## Year_fct2019 -0.1257666 0.0602074 -2.089 0.036855 *
+## Flow -0.0018894 0.0001884 -10.027 < 2e-16 ***
+## StationCodeSTTD -1.0664657 0.0578059 -18.449 < 2e-16 ***
+## StationCodeLIB -2.2872235 0.1491890 -15.331 < 2e-16 ***
+## StationCodeRVB -2.2211596 0.1077729 -20.610 < 2e-16 ***
+## Year_fct2016:Flow 0.0023191 0.0003084 7.520 8.51e-14 ***
+## Year_fct2017:Flow 0.0462944 0.0050980 9.081 < 2e-16 ***
+## Year_fct2018:Flow 0.0003299 0.0002673 1.234 0.217261
+## Year_fct2019:Flow -0.0001397 0.0002337 -0.598 0.549941
+## Year_fct2016:StationCodeSTTD 1.0883529 0.0880500 12.361 < 2e-16 ***
+## Year_fct2017:StationCodeSTTD 1.4991116 0.1299212 11.539 < 2e-16 ***
+## Year_fct2018:StationCodeSTTD -0.3045441 0.0828996 -3.674 0.000246 ***
+## Year_fct2019:StationCodeSTTD -0.9349727 0.0814723 -11.476 < 2e-16 ***
+## Year_fct2016:StationCodeLIB 1.9616137 0.2700048 7.265 5.49e-13 ***
+## Year_fct2017:StationCodeLIB 1.3396481 0.2174563 6.161 8.89e-10 ***
+## Year_fct2018:StationCodeLIB -0.7670441 0.3102772 -2.472 0.013521 *
+## Year_fct2019:StationCodeLIB -0.2261882 0.2427049 -0.932 0.351487
+## Year_fct2016:StationCodeRVB 1.9708585 0.2207415 8.928 < 2e-16 ***
+## Year_fct2017:StationCodeRVB 1.4768042 0.2100894 7.029 2.91e-12 ***
+## Year_fct2018:StationCodeRVB -0.0019108 0.1628429 -0.012 0.990639
+## Year_fct2019:StationCodeRVB -0.7680363 0.1962837 -3.913 9.45e-05 ***
+## Flow:StationCodeSTTD 0.0046042 0.0003152 14.606 < 2e-16 ***
+## Flow:StationCodeLIB 0.0017082 0.0002082 8.206 4.25e-16 ***
+## Flow:StationCodeRVB 0.0019282 0.0001897 10.163 < 2e-16 ***
+## Year_fct2016:Flow:StationCodeSTTD -0.0037128 0.0004693 -7.911 4.39e-15 ***
+## Year_fct2017:Flow:StationCodeSTTD -0.0411458 0.0055472 -7.417 1.82e-13 ***
+## Year_fct2018:Flow:StationCodeSTTD -0.0007371 0.0004159 -1.772 0.076508 .
+## Year_fct2019:Flow:StationCodeSTTD -0.0012635 0.0003698 -3.417 0.000648 ***
+## Year_fct2016:Flow:StationCodeLIB -0.0021272 0.0003563 -5.970 2.83e-09 ***
+## Year_fct2017:Flow:StationCodeLIB -0.0463035 0.0051002 -9.079 < 2e-16 ***
+## Year_fct2018:Flow:StationCodeLIB 0.0005066 0.0004884 1.037 0.299825
+## Year_fct2019:Flow:StationCodeLIB 0.0001596 0.0003482 0.458 0.646694
+## Year_fct2016:Flow:StationCodeRVB -0.0024657 0.0003099 -7.957 3.07e-15 ***
+## Year_fct2017:Flow:StationCodeRVB -0.0463714 0.0050981 -9.096 < 2e-16 ***
+## Year_fct2018:Flow:StationCodeRVB -0.0003456 0.0002686 -1.287 0.198288
+## Year_fct2019:Flow:StationCodeRVB 0.0001395 0.0002354 0.593 0.553553
+## ---
+## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
+##
+## Residual standard error: 0.3745 on 1834 degrees of freedom
+## Multiple R-squared: 0.8512, Adjusted R-squared: 0.848
+## F-statistic: 268.9 on 39 and 1834 DF, p-value: < 2.2e-16
+df_chla_c2_lag %>%
+ drop_na(Chla_log) %>%
+ plot_lm_diag(Chla_log, m_lm3)
+## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
+
+shapiro.test(residuals(m_lm3))
+##
+## Shapiro-Wilk normality test
+##
+## data: residuals(m_lm3)
+## W = 0.96914, p-value < 2.2e-16
+acf(residuals(m_lm3))
+
+Box.test(residuals(m_lm3), lag = 20, type = 'Ljung-Box')
+##
+## Box-Ljung test
+##
+## data: residuals(m_lm3)
+## X-squared = 3897.6, df = 20, p-value < 2.2e-16
+The residuals deviate from a normal distribution according to visual +inspection and the Shapiro-Wilk normality test. Also, model definitely +has residual autocorrelation as indicated by the ACF plot and the +Box-Ljung test.
+Now, we’ll try to deal with the residual autocorrelation and the +non-normal residuals. We’ll run a series of linear models adding 1, 2, +and 3 lag terms and compare how well they correct for +autocorrelation.
+m_lm3_lag1 <- df_chla_c2_lag %>%
+ drop_na(Chla_log, lag1) %>%
+ lm(Chla_log ~ Year_fct * Flow * StationCode + lag1, data = .)
+
+acf(residuals(m_lm3_lag1))
+
+Box.test(residuals(m_lm3_lag1), lag = 20, type = 'Ljung-Box')
+##
+## Box-Ljung test
+##
+## data: residuals(m_lm3_lag1)
+## X-squared = 59.819, df = 20, p-value = 7.595e-06
+m_lm3_lag2 <- df_chla_c2_lag %>%
+ drop_na(Chla_log, lag1, lag2) %>%
+ lm(Chla_log ~ Year_fct * Flow * StationCode + lag1 + lag2, data = .)
+
+acf(residuals(m_lm3_lag2))
+
+Box.test(residuals(m_lm3_lag2), lag = 20, type = 'Ljung-Box')
+##
+## Box-Ljung test
+##
+## data: residuals(m_lm3_lag2)
+## X-squared = 46.699, df = 20, p-value = 0.0006458
+m_lm3_lag3 <- df_chla_c2_lag %>%
+ drop_na(Chla_log, lag1, lag2, lag3) %>%
+ lm(Chla_log ~ Year_fct * Flow * StationCode + lag1 + lag2 + lag3, data = .)
+
+acf(residuals(m_lm3_lag3))
+
+Box.test(residuals(m_lm3_lag3), lag = 20, type = 'Ljung-Box')
+##
+## Box-Ljung test
+##
+## data: residuals(m_lm3_lag3)
+## X-squared = 26.866, df = 20, p-value = 0.1391
+The model with 3 lag terms looks like it has better ACF and Box-Ljung +test results than the other models. Let’s compare the 3 models using AIC +and BIC to see which model those prefer.
+AIC(m_lm3_lag1, m_lm3_lag2, m_lm3_lag3)
+## df AIC
+## m_lm3_lag1 42 -1343.149
+## m_lm3_lag2 43 -1328.212
+## m_lm3_lag3 44 -1352.918
+BIC(m_lm3_lag1, m_lm3_lag2, m_lm3_lag3)
+## df BIC
+## m_lm3_lag1 42 -1111.573
+## m_lm3_lag2 43 -1092.047
+## m_lm3_lag3 44 -1112.153
+Both AIC and BIC prefer the model with 3 lag terms. Let’s take a +closer look at that one.
+summary(m_lm3_lag3)
+##
+## Call:
+## lm(formula = Chla_log ~ Year_fct * Flow * StationCode + lag1 +
+## lag2 + lag3, data = .)
+##
+## Residuals:
+## Min 1Q Median 3Q Max
+## -0.89652 -0.05739 -0.00662 0.05039 1.11198
+##
+## Coefficients:
+## Estimate Std. Error t value Pr(>|t|)
+## (Intercept) 1.191e+00 1.103e-01 10.799 < 2e-16 ***
+## Year_fct2016 -9.858e-02 3.059e-02 -3.223 0.001294 **
+## Year_fct2017 -1.589e-01 4.601e-02 -3.454 0.000566 ***
+## Year_fct2018 -5.669e-02 2.710e-02 -2.092 0.036584 *
+## Year_fct2019 -4.812e-03 2.687e-02 -0.179 0.857864
+## Flow -3.297e-04 8.442e-05 -3.905 9.78e-05 ***
+## StationCodeSTTD -1.253e-01 2.819e-02 -4.443 9.44e-06 ***
+## StationCodeLIB -2.304e-01 7.029e-02 -3.277 0.001069 **
+## StationCodeRVB -2.481e-01 5.311e-02 -4.672 3.21e-06 ***
+## lag1 9.650e-01 2.412e-02 40.016 < 2e-16 ***
+## lag2 -3.050e-02 3.321e-02 -0.919 0.358431
+## lag3 -5.694e-02 2.389e-02 -2.383 0.017260 *
+## Year_fct2016:Flow 3.732e-04 1.365e-04 2.735 0.006310 **
+## Year_fct2017:Flow 6.200e-03 2.353e-03 2.635 0.008500 **
+## Year_fct2018:Flow 1.793e-04 1.167e-04 1.536 0.124654
+## Year_fct2019:Flow 7.924e-05 1.024e-04 0.774 0.439015
+## Year_fct2016:StationCodeSTTD 1.428e-01 4.096e-02 3.486 0.000503 ***
+## Year_fct2017:StationCodeSTTD 2.660e-01 6.001e-02 4.433 9.90e-06 ***
+## Year_fct2018:StationCodeSTTD -3.452e-02 3.698e-02 -0.933 0.350725
+## Year_fct2019:StationCodeSTTD -1.405e-01 3.789e-02 -3.708 0.000215 ***
+## Year_fct2016:StationCodeLIB 2.629e-01 1.208e-01 2.175 0.029743 *
+## Year_fct2017:StationCodeLIB 1.362e-01 9.987e-02 1.364 0.172741
+## Year_fct2018:StationCodeLIB 1.543e-02 1.486e-01 0.104 0.917311
+## Year_fct2019:StationCodeLIB -6.836e-02 2.610e-01 -0.262 0.793414
+## Year_fct2016:StationCodeRVB 2.661e-01 1.077e-01 2.472 0.013541 *
+## Year_fct2017:StationCodeRVB 1.889e-01 9.380e-02 2.014 0.044143 *
+## Year_fct2018:StationCodeRVB 1.962e-02 7.109e-02 0.276 0.782619
+## Year_fct2019:StationCodeRVB -1.364e-01 9.245e-02 -1.476 0.140206
+## Flow:StationCodeSTTD 6.714e-04 1.514e-04 4.434 9.84e-06 ***
+## Flow:StationCodeLIB 3.450e-04 9.266e-05 3.724 0.000203 ***
+## Flow:StationCodeRVB 3.269e-04 8.507e-05 3.843 0.000126 ***
+## Year_fct2016:Flow:StationCodeSTTD -5.817e-04 2.122e-04 -2.741 0.006181 **
+## Year_fct2017:Flow:StationCodeSTTD -3.665e-03 2.588e-03 -1.416 0.156920
+## Year_fct2018:Flow:StationCodeSTTD -2.008e-04 1.860e-04 -1.079 0.280640
+## Year_fct2019:Flow:StationCodeSTTD -2.288e-04 1.667e-04 -1.373 0.170039
+## Year_fct2016:Flow:StationCodeLIB -3.224e-04 1.577e-04 -2.044 0.041063 *
+## Year_fct2017:Flow:StationCodeLIB -6.223e-03 2.355e-03 -2.643 0.008297 **
+## Year_fct2018:Flow:StationCodeLIB 2.158e-05 2.375e-04 0.091 0.927610
+## Year_fct2019:Flow:StationCodeLIB -1.214e-04 3.530e-04 -0.344 0.730859
+## Year_fct2016:Flow:StationCodeRVB -3.897e-04 1.375e-04 -2.835 0.004629 **
+## Year_fct2017:Flow:StationCodeRVB -6.205e-03 2.354e-03 -2.636 0.008455 **
+## Year_fct2018:Flow:StationCodeRVB -1.783e-04 1.173e-04 -1.520 0.128795
+## Year_fct2019:Flow:StationCodeRVB -7.338e-05 1.032e-04 -0.711 0.477174
+## ---
+## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
+##
+## Residual standard error: 0.1626 on 1715 degrees of freedom
+## Multiple R-squared: 0.9709, Adjusted R-squared: 0.9702
+## F-statistic: 1361 on 42 and 1715 DF, p-value: < 2.2e-16
+df_chla_c2_lag %>%
+ drop_na(Chla_log, lag1, lag2, lag3) %>%
+ plot_lm_diag(Chla_log, m_lm3_lag3)
+## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
+
+shapiro.test(residuals(m_lm3_lag3))
+##
+## Shapiro-Wilk normality test
+##
+## data: residuals(m_lm3_lag3)
+## W = 0.85329, p-value < 2.2e-16
+The residuals deviate from a normal distribution particularly at the +both tails of the distribution. I’m not so sure about using this model. +Let’s take a closer look at how the back-transformed fitted values from +the model match the observed values.
+df_lm3_lag3_fit <- df_chla_c2_lag %>%
+ drop_na(Chla, lag1, lag2, lag3) %>%
+ mutate(Fitted = as_tibble(predict(m_lm3_lag3, interval = "confidence"))) %>%
+ unnest_wider(Fitted) %>%
+ mutate(across(c("fit", "lwr", "upr"), ~ exp(.x) / 1000))
+
+plt_lm3_lag3_obs_fit <- df_lm3_lag3_fit %>%
+ ggplot(aes(x = fit, y = Chla)) +
+ geom_point() +
+ geom_abline(slope = 1, intercept = 0, color = "red") +
+ theme_bw() +
+ labs(
+ x = "Back-transformed Fitted Values",
+ y = "Observed Values"
+ )
+
+plt_lm3_lag3_obs_fit
+
+Let’s look at each Station-Year combination separately.
+plt_lm3_lag3_obs_fit +
+ facet_wrap(
+ vars(StationCode, Year_fct),
+ ncol = 5,
+ scales = "free",
+ labeller = labeller(.multi_line = FALSE)
+ )
+
+The red lines are the 1:1 ratio between fitted and observed values, +and we would expect for most of the points to be near this line if the +model has a good fit to the observed data. The fitted and observed +values appear to match pretty well at the lower end of the range of +values, but this deteriorates at the mid and higher range values. This +pattern holds for some of the separate Station-Year combinations. I’m +not sure this linear model is a good candidate to use for our +analysis.
+Because the models using 3-way interactions don’t seem to fit the +data well, we will attempt to fit a generalized additive model (GAM) to +the data set using all two-way interactions between Year, Daily Average +Flow, and Station, with a smooth term for day of year to account for +seasonality. Initially, we’ll run the GAM without accounting for serial +autocorrelation.
+m_gam2 <- gam(
+ Chla_log ~ (Year_fct + Flow + StationCode)^2 + s(DOY, k = 5),
+ data = df_chla_c2,
+ method = "REML"
+)
+Lets look at the model summary and diagnostics:
+summary(m_gam2)
+##
+## Family: gaussian
+## Link function: identity
+##
+## Formula:
+## Chla_log ~ (Year_fct + Flow + StationCode)^2 + s(DOY, k = 5)
+##
+## Parametric coefficients:
+## Estimate Std. Error t value Pr(>|t|)
+## (Intercept) 9.687e+00 3.720e-02 260.381 < 2e-16 ***
+## Year_fct2016 -5.422e-01 5.369e-02 -10.098 < 2e-16 ***
+## Year_fct2017 -4.293e-01 5.124e-02 -8.377 < 2e-16 ***
+## Year_fct2018 -3.173e-01 4.772e-02 -6.650 3.86e-11 ***
+## Year_fct2019 -1.779e-01 4.815e-02 -3.694 0.000227 ***
+## Flow -1.438e-03 9.087e-05 -15.824 < 2e-16 ***
+## StationCodeSTTD -9.908e-01 5.067e-02 -19.553 < 2e-16 ***
+## StationCodeLIB -1.959e+00 1.108e-01 -17.684 < 2e-16 ***
+## StationCodeRVB -2.136e+00 9.771e-02 -21.860 < 2e-16 ***
+## Year_fct2016:Flow -4.743e-05 3.105e-05 -1.528 0.126764
+## Year_fct2017:Flow -4.247e-05 2.509e-05 -1.693 0.090640 .
+## Year_fct2018:Flow -1.286e-05 2.536e-05 -0.507 0.612267
+## Year_fct2019:Flow 1.000e-05 2.711e-05 0.369 0.712110
+## Year_fct2016:StationCodeSTTD 7.459e-01 7.431e-02 10.038 < 2e-16 ***
+## Year_fct2017:StationCodeSTTD 4.306e-01 7.903e-02 5.449 5.76e-08 ***
+## Year_fct2018:StationCodeSTTD -3.358e-01 7.154e-02 -4.694 2.88e-06 ***
+## Year_fct2019:StationCodeSTTD -9.471e-01 7.007e-02 -13.516 < 2e-16 ***
+## Year_fct2016:StationCodeLIB 1.303e+00 8.601e-02 15.152 < 2e-16 ***
+## Year_fct2017:StationCodeLIB 4.316e-01 8.611e-02 5.012 5.92e-07 ***
+## Year_fct2018:StationCodeLIB -1.431e+00 1.135e-01 -12.609 < 2e-16 ***
+## Year_fct2019:StationCodeLIB -4.886e-01 1.182e-01 -4.135 3.71e-05 ***
+## Year_fct2016:StationCodeRVB 8.502e-01 2.321e-01 3.663 0.000257 ***
+## Year_fct2017:StationCodeRVB 4.864e-01 1.847e-01 2.633 0.008531 **
+## Year_fct2018:StationCodeRVB 3.244e-02 1.527e-01 0.212 0.831772
+## Year_fct2019:StationCodeRVB -7.110e-01 1.831e-01 -3.884 0.000106 ***
+## Flow:StationCodeSTTD 3.329e-03 1.253e-04 26.579 < 2e-16 ***
+## Flow:StationCodeLIB 1.465e-03 1.114e-04 13.151 < 2e-16 ***
+## Flow:StationCodeRVB 1.460e-03 8.962e-05 16.292 < 2e-16 ***
+## ---
+## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
+##
+## Approximate significance of smooth terms:
+## edf Ref.df F p-value
+## s(DOY) 3.88 3.992 93.39 <2e-16 ***
+## ---
+## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
+##
+## R-sq.(adj) = 0.858 Deviance explained = 86.1%
+## -REML = 869.89 Scale est. = 0.13071 n = 1874
+appraise(m_gam2)
+
+shapiro.test(residuals(m_gam2))
+##
+## Shapiro-Wilk normality test
+##
+## data: residuals(m_gam2)
+## W = 0.96845, p-value < 2.2e-16
+k.check(m_gam2)
+## k' edf k-index p-value
+## s(DOY) 4 3.880408 0.9386901 0
+draw(m_gam2, select = 1, residuals = TRUE, rug = FALSE)
+
+plot(m_gam2, pages = 1, all.terms = TRUE)
+
+acf(residuals(m_gam2))
+
+Box.test(residuals(m_gam2), lag = 20, type = 'Ljung-Box')
+##
+## Box-Ljung test
+##
+## data: residuals(m_gam2)
+## X-squared = 3957.6, df = 20, p-value < 2.2e-16
+Now, we’ll try to deal with the residual autocorrelation. We’ll run +AR(1), AR(2), AR(3), and AR(4) models and compare them using AIC.
+# Define model formula as an object
+f_gam2 <- as.formula("Chla_log ~ (Year_fct + Flow + StationCode)^2 + s(DOY, k = 5)")
+
+# Fit original model with k = 5 and three successive AR(p) models
+m_gamm2 <- gamm(
+ f_gam2,
+ data = df_chla_c2,
+ method = "REML"
+)
+
+m_gamm2_ar1 <- gamm(
+ f_gam2,
+ data = df_chla_c2,
+ correlation = corARMA(form = ~ 1|Year_fct/StationCode, p = 1), # grouped by Year_fct and StationCode
+ method = "REML"
+)
+
+m_gamm2_ar2 <- gamm(
+ f_gam2,
+ data = df_chla_c2,
+ correlation = corARMA(form = ~ 1|Year_fct/StationCode, p = 2),
+ method = "REML"
+)
+
+m_gamm2_ar3 <- gamm(
+ f_gam2,
+ data = df_chla_c2,
+ correlation = corARMA(form = ~ 1|Year_fct/StationCode, p = 3),
+ method = "REML"
+)
+
+m_gamm2_ar4 <- gamm(
+ f_gam2,
+ data = df_chla_c2,
+ correlation = corARMA(form = ~ 1|Year_fct/StationCode, p = 4),
+ method = "REML"
+)
+
+# Compare models
+anova(
+ m_gamm2$lme,
+ m_gamm2_ar1$lme,
+ m_gamm2_ar2$lme,
+ m_gamm2_ar3$lme,
+ m_gamm2_ar4$lme
+)
+## Model df AIC BIC logLik Test L.Ratio p-value
+## m_gamm2$lme 1 31 1801.7891 1972.9163 -869.8945
+## m_gamm2_ar1$lme 2 32 -880.0401 -703.3926 472.0201 1 vs 2 2683.8292 <.0001
+## m_gamm2_ar2$lme 3 33 -878.0514 -695.8836 472.0257 2 vs 3 0.0112 0.9156
+## m_gamm2_ar3$lme 4 34 -878.0388 -690.3508 473.0194 3 vs 4 1.9874 0.1586
+## m_gamm2_ar4$lme 5 35 -877.7177 -684.5095 473.8589 4 vs 5 1.6789 0.1951
+It looks like the AR(1) model has the best fit according to the AIC +and BIC values. Let’s take a closer look at the AR(1) model.
+# Pull out the residuals and the GAM model
+resid_gamm2_ar1 <- residuals(m_gamm2_ar1$lme, type = "normalized")
+m_gamm2_ar1_gam <- m_gamm2_ar1$gam
+
+summary(m_gamm2_ar1_gam)
+##
+## Family: gaussian
+## Link function: identity
+##
+## Formula:
+## Chla_log ~ (Year_fct + Flow + StationCode)^2 + s(DOY, k = 5)
+##
+## Parametric coefficients:
+## Estimate Std. Error t value Pr(>|t|)
+## (Intercept) 9.524e+00 4.705e-01 20.245 < 2e-16 ***
+## Year_fct2016 -4.745e-01 6.930e-01 -0.685 0.493589
+## Year_fct2017 -1.946e-01 6.718e-01 -0.290 0.772141
+## Year_fct2018 -2.427e-01 6.521e-01 -0.372 0.709814
+## Year_fct2019 7.054e-02 6.557e-01 0.108 0.914341
+## Flow -5.558e-04 1.911e-04 -2.908 0.003687 **
+## StationCodeSTTD -7.488e-01 6.649e-01 -1.126 0.260262
+## StationCodeLIB -1.444e+00 6.549e-01 -2.204 0.027614 *
+## StationCodeRVB -1.664e+00 6.463e-01 -2.575 0.010103 *
+## Year_fct2016:Flow -1.617e-05 1.991e-05 -0.812 0.416926
+## Year_fct2017:Flow -4.077e-06 2.028e-05 -0.201 0.840700
+## Year_fct2018:Flow -2.836e-06 2.191e-05 -0.129 0.897009
+## Year_fct2019:Flow -2.267e-05 2.043e-05 -1.110 0.267297
+## Year_fct2016:StationCodeSTTD 8.515e-01 9.770e-01 0.872 0.383585
+## Year_fct2017:StationCodeSTTD -2.845e-01 9.961e-01 -0.286 0.775213
+## Year_fct2018:StationCodeSTTD -6.349e-01 9.530e-01 -0.666 0.505336
+## Year_fct2019:StationCodeSTTD -1.204e+00 9.406e-01 -1.280 0.200857
+## Year_fct2016:StationCodeLIB 1.355e+00 9.566e-01 1.417 0.156753
+## Year_fct2017:StationCodeLIB 6.824e-02 9.518e-01 0.072 0.942850
+## Year_fct2018:StationCodeLIB -2.291e+00 1.025e+00 -2.234 0.025580 *
+## Year_fct2019:StationCodeLIB -1.047e+00 1.039e+00 -1.008 0.313616
+## Year_fct2016:StationCodeRVB 9.658e-01 9.787e-01 0.987 0.323854
+## Year_fct2017:StationCodeRVB -5.525e-02 9.433e-01 -0.059 0.953298
+## Year_fct2018:StationCodeRVB -1.545e-01 9.189e-01 -0.168 0.866506
+## Year_fct2019:StationCodeRVB -7.854e-01 9.363e-01 -0.839 0.401700
+## Flow:StationCodeSTTD 1.554e-03 3.081e-04 5.044 5.01e-07 ***
+## Flow:StationCodeLIB 7.168e-04 1.967e-04 3.645 0.000275 ***
+## Flow:StationCodeRVB 5.545e-04 1.911e-04 2.902 0.003758 **
+## ---
+## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
+##
+## Approximate significance of smooth terms:
+## edf Ref.df F p-value
+## s(DOY) 3.864 3.864 27.52 <2e-16 ***
+## ---
+## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
+##
+## R-sq.(adj) = 0.643
+## Scale est. = 0.56753 n = 1874
+appraise(m_gamm2_ar1_gam)
+
+shapiro.test(resid_gamm2_ar1)
+##
+## Shapiro-Wilk normality test
+##
+## data: resid_gamm2_ar1
+## W = 0.80125, p-value < 2.2e-16
+k.check(m_gamm2_ar1_gam)
+## k' edf k-index p-value
+## s(DOY) 4 3.864181 0.4865552 0
+draw(m_gamm2_ar1_gam, select = 1, residuals = TRUE, rug = FALSE)
+
+plot(m_gamm2_ar1_gam, pages = 1, all.terms = TRUE)
+
+acf(resid_gamm2_ar1)
+
+Box.test(resid_gamm2_ar1, lag = 20, type = 'Ljung-Box')
+##
+## Box-Ljung test
+##
+## data: resid_gamm2_ar1
+## X-squared = 47.503, df = 20, p-value = 0.0004993
+The AR(1) model has much less residual autocorrelation, and the +diagnostics plots look okay. Let’s take a closer look at how the +back-transformed fitted values from the model match the observed +values.
+df_gamm2_ar1_fit <- df_chla_c2 %>%
+ fitted_values(m_gamm2_ar1_gam, data = .) %>%
+ mutate(fitted_bt = exp(fitted) / 1000)
+
+plt_gamm2_ar1_obs_fit <- df_gamm2_ar1_fit %>%
+ ggplot(aes(x = fitted_bt, y = Chla)) +
+ geom_point() +
+ geom_abline(slope = 1, intercept = 0, color = "red") +
+ theme_bw() +
+ labs(
+ x = "Back-transformed Fitted Values",
+ y = "Observed Values"
+ )
+
+plt_gamm2_ar1_obs_fit
+
+Let’s look at each Station and Year separately.
+plt_gamm2_ar1_obs_fit + facet_wrap(vars(StationCode), scales = "free")
+
+plt_gamm2_ar1_obs_fit + facet_wrap(vars(Year_fct), scales = "free")
+
+The red lines are the 1:1 ratio between fitted and observed values, +and we would expect for most of the points to be near this line if the +model has a good fit to the observed data. However, the fitted values +from the model don’t appear to match the observed values that well. +Again, I don’t think this GAM model is a good candidate to use for our +analysis.
+Since the models using 3-way interactions don’t seem to fit the +observed data that well, let’s try a linear model using all two-way +interactions between Year, Daily Average Flow, and Station. Initially, +we’ll run the model without accounting for serial autocorrelation.
+m_lm2 <- df_chla_c2_lag %>%
+ drop_na(Chla_log) %>%
+ lm(Chla_log ~ (Year_fct + Flow + StationCode)^2, data = .)
+Lets look at the model summary and diagnostics:
+summary(m_lm2)
+##
+## Call:
+## lm(formula = Chla_log ~ (Year_fct + Flow + StationCode)^2, data = .)
+##
+## Residuals:
+## Min 1Q Median 3Q Max
+## -1.65329 -0.23920 -0.03958 0.20123 1.96767
+##
+## Coefficients:
+## Estimate Std. Error t value Pr(>|t|)
+## (Intercept) 9.632e+00 4.015e-02 239.900 < 2e-16 ***
+## Year_fct2016 -3.555e-01 5.707e-02 -6.228 5.83e-10 ***
+## Year_fct2017 -4.068e-01 5.586e-02 -7.284 4.79e-13 ***
+## Year_fct2018 -2.925e-01 5.213e-02 -5.611 2.31e-08 ***
+## Year_fct2019 -1.467e-01 5.255e-02 -2.791 0.005313 **
+## Flow -1.525e-03 9.636e-05 -15.827 < 2e-16 ***
+## StationCodeSTTD -9.929e-01 5.532e-02 -17.948 < 2e-16 ***
+## StationCodeLIB -2.094e+00 1.172e-01 -17.863 < 2e-16 ***
+## StationCodeRVB -2.154e+00 1.067e-01 -20.194 < 2e-16 ***
+## Year_fct2016:Flow -1.258e-04 3.148e-05 -3.996 6.70e-05 ***
+## Year_fct2017:Flow -7.326e-05 2.717e-05 -2.696 0.007076 **
+## Year_fct2018:Flow -8.557e-06 2.722e-05 -0.314 0.753278
+## Year_fct2019:Flow -1.301e-05 2.923e-05 -0.445 0.656205
+## Year_fct2016:StationCodeSTTD 7.413e-01 8.117e-02 9.133 < 2e-16 ***
+## Year_fct2017:StationCodeSTTD 5.377e-01 8.588e-02 6.261 4.75e-10 ***
+## Year_fct2018:StationCodeSTTD -3.377e-01 7.771e-02 -4.345 1.47e-05 ***
+## Year_fct2019:StationCodeSTTD -9.913e-01 7.655e-02 -12.950 < 2e-16 ***
+## Year_fct2016:StationCodeLIB 1.196e+00 9.326e-02 12.829 < 2e-16 ***
+## Year_fct2017:StationCodeLIB 4.439e-01 9.307e-02 4.769 1.99e-06 ***
+## Year_fct2018:StationCodeLIB -1.446e+00 1.205e-01 -12.006 < 2e-16 ***
+## Year_fct2019:StationCodeLIB -3.061e-01 1.257e-01 -2.436 0.014943 *
+## Year_fct2016:StationCodeRVB 1.500e+00 2.213e-01 6.779 1.62e-11 ***
+## Year_fct2017:StationCodeRVB 6.775e-01 1.990e-01 3.404 0.000679 ***
+## Year_fct2018:StationCodeRVB -9.430e-02 1.626e-01 -0.580 0.561995
+## Year_fct2019:StationCodeRVB -6.203e-01 1.969e-01 -3.151 0.001653 **
+## Flow:StationCodeSTTD 3.304e-03 1.368e-04 24.145 < 2e-16 ***
+## Flow:StationCodeLIB 1.434e-03 1.151e-04 12.458 < 2e-16 ***
+## Flow:StationCodeRVB 1.560e-03 9.489e-05 16.443 < 2e-16 ***
+## ---
+## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
+##
+## Residual standard error: 0.3952 on 1846 degrees of freedom
+## Multiple R-squared: 0.8332, Adjusted R-squared: 0.8308
+## F-statistic: 341.5 on 27 and 1846 DF, p-value: < 2.2e-16
+df_chla_c2_lag %>%
+ drop_na(Chla_log) %>%
+ plot_lm_diag(Chla_log, m_lm2)
+## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
+
+shapiro.test(residuals(m_lm2))
+##
+## Shapiro-Wilk normality test
+##
+## data: residuals(m_lm2)
+## W = 0.96829, p-value < 2.2e-16
+acf(residuals(m_lm2))
+
+Box.test(residuals(m_lm2), lag = 20, type = 'Ljung-Box')
+##
+## Box-Ljung test
+##
+## data: residuals(m_lm2)
+## X-squared = 4598.1, df = 20, p-value < 2.2e-16
+The residuals deviate from a normal distribution according to visual +inspection and the Shapiro-Wilk normality test. Also, model definitely +has residual autocorrelation as indicated by the ACF plot and the +Box-Ljung test.
+Now, we’ll try to deal with the residual autocorrelation and the +non-normal residuals. We’ll run a series of linear models adding 1, 2, +and 3 lag terms and compare how well they correct for +autocorrelation.
+m_lm2_lag1 <- df_chla_c2_lag %>%
+ drop_na(Chla_log, lag1) %>%
+ lm(Chla_log ~ (Year_fct + Flow + StationCode)^2 + lag1, data = .)
+
+acf(residuals(m_lm2_lag1))
+
+Box.test(residuals(m_lm2_lag1), lag = 20, type = 'Ljung-Box')
+##
+## Box-Ljung test
+##
+## data: residuals(m_lm2_lag1)
+## X-squared = 59.714, df = 20, p-value = 7.884e-06
+m_lm2_lag2 <- df_chla_c2_lag %>%
+ drop_na(Chla_log, lag1, lag2) %>%
+ lm(Chla_log ~ (Year_fct + Flow + StationCode)^2 + lag1 + lag2, data = .)
+
+acf(residuals(m_lm2_lag2))
+
+Box.test(residuals(m_lm2_lag2), lag = 20, type = 'Ljung-Box')
+##
+## Box-Ljung test
+##
+## data: residuals(m_lm2_lag2)
+## X-squared = 45.379, df = 20, p-value = 0.00098
+m_lm2_lag3 <- df_chla_c2_lag %>%
+ drop_na(Chla_log, lag1, lag2, lag3) %>%
+ lm(Chla_log ~ (Year_fct + Flow + StationCode)^2 + lag1 + lag2 + lag3, data = .)
+
+acf(residuals(m_lm2_lag3))
+
+Box.test(residuals(m_lm2_lag3), lag = 20, type = 'Ljung-Box')
+##
+## Box-Ljung test
+##
+## data: residuals(m_lm2_lag3)
+## X-squared = 25.678, df = 20, p-value = 0.1767
+The model with 3 lag terms looks like it has better ACF and Box-Ljung +test results than the other models. Let’s compare the 3 models using AIC +and BIC to see which model those prefer.
+AIC(m_lm2_lag1, m_lm2_lag2, m_lm2_lag3)
+## df AIC
+## m_lm2_lag1 30 -1341.836
+## m_lm2_lag2 31 -1328.270
+## m_lm2_lag3 32 -1351.047
+BIC(m_lm2_lag1, m_lm2_lag2, m_lm2_lag3)
+## df BIC
+## m_lm2_lag1 30 -1176.425
+## m_lm2_lag2 31 -1158.012
+## m_lm2_lag3 32 -1175.945
+AIC prefers the model with 3 lag terms while BIC slightly prefers the +model with 1 lag term. Let’s take a closer look at the model with 3 lag +terms since it had better ACF and Box-Ljung test results.
+summary(m_lm2_lag3)
+##
+## Call:
+## lm(formula = Chla_log ~ (Year_fct + Flow + StationCode)^2 + lag1 +
+## lag2 + lag3, data = .)
+##
+## Residuals:
+## Min 1Q Median 3Q Max
+## -0.88352 -0.05588 -0.00615 0.05095 1.13186
+##
+## Coefficients:
+## Estimate Std. Error t value Pr(>|t|)
+## (Intercept) 1.021e+00 1.030e-01 9.917 < 2e-16 ***
+## Year_fct2016 -4.245e-02 2.424e-02 -1.751 0.080137 .
+## Year_fct2017 -3.954e-02 2.399e-02 -1.648 0.099540 .
+## Year_fct2018 -2.729e-02 2.205e-02 -1.237 0.216141
+## Year_fct2019 8.278e-03 2.223e-02 0.372 0.709665
+## Flow -1.826e-04 4.293e-05 -4.254 2.21e-05 ***
+## StationCodeSTTD -8.930e-02 2.560e-02 -3.489 0.000497 ***
+## StationCodeLIB -1.633e-01 5.564e-02 -2.936 0.003371 **
+## StationCodeRVB -1.892e-01 4.989e-02 -3.793 0.000154 ***
+## lag1 9.766e-01 2.404e-02 40.624 < 2e-16 ***
+## lag2 -3.105e-02 3.327e-02 -0.933 0.350706
+## lag3 -5.219e-02 2.385e-02 -2.189 0.028766 *
+## Year_fct2016:Flow -1.045e-05 1.389e-05 -0.752 0.452144
+## Year_fct2017:Flow -2.358e-06 1.139e-05 -0.207 0.836010
+## Year_fct2018:Flow 3.587e-06 1.136e-05 0.316 0.752176
+## Year_fct2019:Flow 5.525e-06 1.265e-05 0.437 0.662376
+## Year_fct2016:StationCodeSTTD 7.295e-02 3.526e-02 2.069 0.038693 *
+## Year_fct2017:StationCodeSTTD 7.972e-02 3.727e-02 2.139 0.032553 *
+## Year_fct2018:StationCodeSTTD -5.016e-02 3.289e-02 -1.525 0.127423
+## Year_fct2019:StationCodeSTTD -1.443e-01 3.397e-02 -4.249 2.26e-05 ***
+## Year_fct2016:StationCodeLIB 1.040e-01 4.189e-02 2.482 0.013175 *
+## Year_fct2017:StationCodeLIB 3.142e-02 3.948e-02 0.796 0.426219
+## Year_fct2018:StationCodeLIB -1.231e-01 5.723e-02 -2.152 0.031553 *
+## Year_fct2019:StationCodeLIB -4.936e-02 6.070e-02 -0.813 0.416244
+## Year_fct2016:StationCodeRVB 1.633e-01 1.019e-01 1.602 0.109254
+## Year_fct2017:StationCodeRVB 6.267e-02 8.373e-02 0.748 0.454265
+## Year_fct2018:StationCodeRVB -1.640e-02 6.745e-02 -0.243 0.807882
+## Year_fct2019:StationCodeRVB -1.252e-01 8.814e-02 -1.420 0.155702
+## Flow:StationCodeSTTD 3.908e-04 6.614e-05 5.909 4.14e-09 ***
+## Flow:StationCodeLIB 2.095e-04 5.088e-05 4.118 4.00e-05 ***
+## Flow:StationCodeRVB 1.777e-04 4.250e-05 4.181 3.05e-05 ***
+## ---
+## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
+##
+## Residual standard error: 0.1632 on 1727 degrees of freedom
+## Multiple R-squared: 0.9704, Adjusted R-squared: 0.9699
+## F-statistic: 1890 on 30 and 1727 DF, p-value: < 2.2e-16
+df_chla_c2_lag %>%
+ drop_na(Chla_log, lag1, lag2, lag3) %>%
+ plot_lm_diag(Chla_log, m_lm2_lag3)
+## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
+
+shapiro.test(residuals(m_lm2_lag3))
+##
+## Shapiro-Wilk normality test
+##
+## data: residuals(m_lm2_lag3)
+## W = 0.85565, p-value < 2.2e-16
+As with the model with 3-way interactions, the residuals deviate from +a normal distribution particularly at the both tails of the +distribution. I’m not so sure about using this model. Let’s take a +closer look at how the back-transformed fitted values from the model +match the observed values.
+df_lm2_lag3_fit <- df_chla_c2_lag %>%
+ drop_na(Chla, lag1, lag2, lag3) %>%
+ mutate(Fitted = as_tibble(predict(m_lm2_lag3, interval = "confidence"))) %>%
+ unnest_wider(Fitted) %>%
+ mutate(across(c("fit", "lwr", "upr"), ~ exp(.x) / 1000))
+
+plt_lm2_lag3_obs_fit <- df_lm2_lag3_fit %>%
+ ggplot(aes(x = fit, y = Chla)) +
+ geom_point() +
+ geom_abline(slope = 1, intercept = 0, color = "red") +
+ theme_bw() +
+ labs(
+ x = "Back-transformed Fitted Values",
+ y = "Observed Values"
+ )
+
+plt_lm2_lag3_obs_fit
+
+Let’s look at each Station and Year separately.
+plt_lm2_lag3_obs_fit + facet_wrap(vars(StationCode), scales = "free")
+
+plt_lm2_lag3_obs_fit + facet_wrap(vars(Year_fct), scales = "free")
+
+The red lines are the 1:1 ratio between fitted and observed values, +and we would expect for most of the points to be near this line if the +model has a good fit to the observed data. The fitted and observed +values appear to match pretty well at the lower end of the range of +values, but this deteriorates at the mid and higher range values. This +pattern holds for some of the individual Stations and Years. I’m not +sure this linear model is a good candidate to use for our analysis.
+Since none of the models we tried so far seemed to fit the observed +data very well, it may be interesting to run separate models for each of +the four stations - RD22, STTD, LIB, RVB - to see if that helps improve +model fit. We’ll try STTD as a test case.
+Let’s try a linear model using a two-way interaction between Year and +Daily Average Flow. Initially, we’ll run the model without accounting +for serial autocorrelation.
+df_sttd <- df_chla_c2_lag %>% filter(StationCode == "STTD")
+
+m_lm2_sttd <- df_sttd %>%
+ drop_na(Chla_log) %>%
+ lm(Chla_log ~ Year_fct * Flow, data = .)
+Lets look at the model summary and diagnostics:
+summary(m_lm2_sttd)
+##
+## Call:
+## lm(formula = Chla_log ~ Year_fct * Flow, data = .)
+##
+## Residuals:
+## Min 1Q Median 3Q Max
+## -0.7854 -0.1724 -0.0296 0.1377 1.5439
+##
+## Coefficients:
+## Estimate Std. Error t value Pr(>|t|)
+## (Intercept) 8.6179586 0.0295402 291.736 < 2e-16 ***
+## Year_fct2016 0.4258856 0.0459390 9.271 < 2e-16 ***
+## Year_fct2017 0.3102626 0.0675111 4.596 5.71e-06 ***
+## Year_fct2018 -0.6463737 0.0456741 -14.152 < 2e-16 ***
+## Year_fct2019 -1.0607392 0.0444413 -23.868 < 2e-16 ***
+## Flow 0.0027149 0.0002046 13.269 < 2e-16 ***
+## Year_fct2016:Flow -0.0013937 0.0002865 -4.865 1.62e-06 ***
+## Year_fct2017:Flow 0.0051487 0.0017705 2.908 0.00383 **
+## Year_fct2018:Flow -0.0004072 0.0002580 -1.578 0.11523
+## Year_fct2019:Flow -0.0014032 0.0002321 -6.047 3.26e-09 ***
+## ---
+## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
+##
+## Residual standard error: 0.3032 on 421 degrees of freedom
+## Multiple R-squared: 0.809, Adjusted R-squared: 0.805
+## F-statistic: 198.2 on 9 and 421 DF, p-value: < 2.2e-16
+df_sttd %>%
+ drop_na(Chla_log) %>%
+ plot_lm_diag(Chla_log, m_lm2_sttd)
+## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
+
+shapiro.test(residuals(m_lm2_sttd))
+##
+## Shapiro-Wilk normality test
+##
+## data: residuals(m_lm2_sttd)
+## W = 0.94651, p-value = 2.328e-11
+acf(residuals(m_lm2_sttd))
+
+Box.test(residuals(m_lm2_sttd), lag = 20, type = 'Ljung-Box')
+##
+## Box-Ljung test
+##
+## data: residuals(m_lm2_sttd)
+## X-squared = 463.9, df = 20, p-value < 2.2e-16
+The residuals deviate from a normal distribution according to visual +inspection and the Shapiro-Wilk normality test. Also, model definitely +has residual autocorrelation as indicated by the ACF plot and the +Box-Ljung test.
+Now, we’ll try to deal with the residual autocorrelation and the +non-normal residuals. We’ll run a series of linear models adding 1, 2, +and 3 lag terms and compare how well they correct for +autocorrelation.
+m_lm2_sttd_lag1 <- df_sttd %>%
+ drop_na(Chla_log, lag1) %>%
+ lm(Chla_log ~ Year_fct * Flow + lag1, data = .)
+
+acf(residuals(m_lm2_sttd_lag1))
+
+Box.test(residuals(m_lm2_sttd_lag1), lag = 20, type = 'Ljung-Box')
+##
+## Box-Ljung test
+##
+## data: residuals(m_lm2_sttd_lag1)
+## X-squared = 37.18, df = 20, p-value = 0.01113
+m_lm2_sttd_lag2 <- df_sttd %>%
+ drop_na(Chla_log, lag1, lag2) %>%
+ lm(Chla_log ~ Year_fct * Flow + lag1 + lag2, data = .)
+
+acf(residuals(m_lm2_sttd_lag2))
+
+Box.test(residuals(m_lm2_sttd_lag2), lag = 20, type = 'Ljung-Box')
+##
+## Box-Ljung test
+##
+## data: residuals(m_lm2_sttd_lag2)
+## X-squared = 18.734, df = 20, p-value = 0.5392
+m_lm2_sttd_lag3 <- df_sttd %>%
+ drop_na(Chla_log, lag1, lag2, lag3) %>%
+ lm(Chla_log ~ Year_fct * Flow + lag1 + lag2 + lag3, data = .)
+
+acf(residuals(m_lm2_sttd_lag3))
+
+Box.test(residuals(m_lm2_sttd_lag3), lag = 20, type = 'Ljung-Box')
+##
+## Box-Ljung test
+##
+## data: residuals(m_lm2_sttd_lag3)
+## X-squared = 10.813, df = 20, p-value = 0.9509
+The models with 2 and 3 lag terms seem to take care of the +autocorrelation. Let’s compare the 3 models using AIC and BIC to see +which model those prefer.
+AIC(m_lm2_sttd_lag1, m_lm2_sttd_lag2, m_lm2_sttd_lag3)
+## df AIC
+## m_lm2_sttd_lag1 12 -328.2932
+## m_lm2_sttd_lag2 13 -350.6079
+## m_lm2_sttd_lag3 14 -340.5105
+BIC(m_lm2_sttd_lag1, m_lm2_sttd_lag2, m_lm2_sttd_lag3)
+## df BIC
+## m_lm2_sttd_lag1 12 -279.7532
+## m_lm2_sttd_lag2 13 -298.3031
+## m_lm2_sttd_lag3 14 -284.4561
+Both AIC and BIC prefer the model with 2 lag terms. Let’s take a +closer look at that one.
+summary(m_lm2_sttd_lag2)
+##
+## Call:
+## lm(formula = Chla_log ~ Year_fct * Flow + lag1 + lag2, data = .)
+##
+## Residuals:
+## Min 1Q Median 3Q Max
+## -0.88426 -0.06597 -0.00901 0.05116 0.81176
+##
+## Coefficients:
+## Estimate Std. Error t value Pr(>|t|)
+## (Intercept) 1.612e+00 2.284e-01 7.058 7.50e-12 ***
+## Year_fct2016 6.863e-02 2.667e-02 2.573 0.010442 *
+## Year_fct2017 1.177e-01 3.644e-02 3.230 0.001339 **
+## Year_fct2018 -1.292e-01 2.907e-02 -4.445 1.14e-05 ***
+## Year_fct2019 -2.085e-01 3.616e-02 -5.766 1.62e-08 ***
+## Flow 5.131e-04 1.324e-04 3.875 0.000124 ***
+## lag1 1.109e+00 4.925e-02 22.527 < 2e-16 ***
+## lag2 -2.956e-01 4.819e-02 -6.133 2.07e-09 ***
+## Year_fct2016:Flow -2.822e-04 1.555e-04 -1.816 0.070186 .
+## Year_fct2017:Flow 2.695e-03 1.001e-03 2.693 0.007380 **
+## Year_fct2018:Flow -5.917e-05 1.375e-04 -0.430 0.667197
+## Year_fct2019:Flow -2.500e-04 1.302e-04 -1.920 0.055609 .
+## ---
+## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
+##
+## Residual standard error: 0.1557 on 401 degrees of freedom
+## Multiple R-squared: 0.9497, Adjusted R-squared: 0.9483
+## F-statistic: 687.9 on 11 and 401 DF, p-value: < 2.2e-16
+df_sttd %>%
+ drop_na(Chla_log, lag1, lag2) %>%
+ plot_lm_diag(Chla_log, m_lm2_sttd_lag2)
+## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
+
+shapiro.test(residuals(m_lm2_sttd_lag2))
+##
+## Shapiro-Wilk normality test
+##
+## data: residuals(m_lm2_sttd_lag2)
+## W = 0.86609, p-value < 2.2e-16
+As with all the other linear models, the residuals deviate from a +normal distribution particularly at the both tails of the distribution. +I’m not so sure about using this model. Let’s take a closer look at how +the back-transformed fitted values from the model match the observed +values.
+df_lm2_sttd_lag2_fit <- df_sttd %>%
+ drop_na(Chla, lag1, lag2) %>%
+ mutate(Fitted = as_tibble(predict(m_lm2_sttd_lag2, interval = "confidence"))) %>%
+ unnest_wider(Fitted) %>%
+ mutate(across(c("fit", "lwr", "upr"), ~ exp(.x) / 1000))
+
+plt_lm2_sttd_lag2_obs_fit <- df_lm2_sttd_lag2_fit %>%
+ ggplot(aes(x = fit, y = Chla)) +
+ geom_point() +
+ geom_abline(slope = 1, intercept = 0, color = "red") +
+ theme_bw() +
+ labs(
+ x = "Back-transformed Fitted Values",
+ y = "Observed Values"
+ )
+
+plt_lm2_sttd_lag2_obs_fit
+
+Let’s look at each Year separately.
+plt_lm2_sttd_lag2_obs_fit + facet_wrap(vars(Year_fct), scales = "free")
+
+The red lines are the 1:1 ratio between fitted and observed values, +and we would expect for most of the points to be near this line if the +model has a good fit to the observed data. The fitted and observed +values appear to match pretty well at the lower end of the range of +values, but this deteriorates at the higher values. This pattern holds +for some of the individual Years. I’m not sure this linear model is a +good candidate to use for our analysis.
+None of the models we tried in this analysis seemed to fit the +observed data very well. It looks like daily average flow has an effect +on chlorophyll concentrations at some of the stations, but flow doesn’t +seem to be a good predictor overall. Unfortunately, I don’t think we +should use any of these models in our analysis.
+