R scripts to generate city-level forecasts
This repository contains R code to generate city-level forecasts for submission to the flu-metrocast hub. All models are exploratory/preliminary, though we will regularly update this document to describe the latest mathematical model used in the submission.
All outputs submitted to the Hub will be archived in this repository, along with additional model metadata (such as the model definition associated with a submission and details on any additional data sources used or decisions made in the submission process). If significant changes to the model are made during submission, we will rename the model in the submission file.
Initially, we plan to fit the data from each state independently, using hierarchical partial pooling to jointly fit the cities within a state. This initially includes producing forecast of:
Forecast target | Location |
---|---|
ED visits due to ILI | New York City (5 boroughs, unknown, citywide) |
Percent of ED visits due to flu | Texas (5 metro areas) |
We plan to use the same latent model structure for both forecast targets, modifying the observation model for count data (NYC) vs proportion data (Texas).
Because all data is available publicly, the forecasts generated should be completely reproducible from the specified configuration file.
We start by using the mvgam
package, which is a an R package that leverages both mgcv
and brms
formula interface to fit Bayesian Dynamic Generalized Additive Models (GAMs).
These packages use metaprogramming to produce Stan files, and we also include the Stan code generated by the package.
To produce forecasts each week we follow the following workflow:
- Modify the configuration file in
input/config.toml
- In the command line, run
Rscript preprocess_data.R input/config.toml {index}
where index is used to track the individual model runs, which in this case, also have different pre-processing due to being from different data sources. - Next run
Rscript models.R input/config.toml {index}
- Lastly run
Rscript postprocess_forecasts.R input/{forecast_date}/config.toml
- This will populate the
output/cityforecasts/{forecast_date}
folder with a csv file formatted following the Hub submission guidelines.
Eventually, steps 2-4 will be automated with the Github Action .git/workflows/generate_forecasts
and set on a schedule to run after 12 pm CST, corresponding to the time that the target_data
is updated on the Hub.
The below describes the preliminary model used:
For the forecasts of counts due to ED visits, we assume a Poisson observation process
For the forecasts of the percent of ED visits due to flu, we assume a Beta observation process on the proportion of ED visits due to flu:
We model latent admissions with a hierarchical GAM component to capture shared seasonality and weekday effects and a univariate autoregressive component to capture trends in the dynamics within each location.
For the NYC data, we have count data on a daily scale so we add in a weekday component
And since
where
For the TX data,
For the NYC data, we have daily data so
The above model estimates a hierarchical dynamic GAM, which contains both a GAM component and an autoregressive component. We can additionally fit a more traditional hierarchical GAM (with no autoregression but with tensor product splines to jointly estimate across location and time) as well as a vector ARIMA without a spline component. Eventually, we can also mash everything together and estimate a hierarchical GAM with a multivariante vector autoregression. These will be areas of future work.