This repository contains code to run the Ada Poll of Polls model using Stan.
To run the models and (re)produce the output:
- You need to have Stan and Rstan installed. To install Rstan (and Stan), see Rstan Installation information.
- Install the SwedishPolls R package. See the install instructions here.
- Install the
ada
R package (see below). - Run either run_ada/run_ada.R in R or run_ada/run_ad_bash.sh.
- Play around with the resulting model object in R.
The whole process of 1-4. can be seen in the github action workflow here.
We continuously develop the model and improve it. The actual model we
use is set in the
run_ada/ada_config.yml
(model
argument). The same model will exist as a stan file in the R
package, that you can find in
rpackage/inst/stan_models/.
The hyperparameter settings we use are then either set in the config file (run_ada/ada_config.yml) or as the default values. The default values are printed when running the model in R.
Unfortunately, we do not have a better description of the model right now. We know that it can be cumbersome to read, but if you have any questions feel free to reach out on Twitter or leave an issue here at Github.
All functionality and tests of implemented functionality are implemented
in the R package ada
.
To install, just build the local package (if the repo is cloned):
devtools::install_local("rpackage")
We can access two types of data through the ada
package. First, we can
access real polling data from Sweden (Spain and Germany).
To access the polls data from the R package, use:
library(ada)
data("swedish_polls_curated")
data("swedish_elections")
Based on the data, we create a polls_data object.
pd <- polls_data(y = swedish_polls_curated[, 3:10],
house = swedish_polls_curated$Company,
publish_date = swedish_polls_curated$PublDate,
start_date = swedish_polls_curated$collectPeriodFrom,
end_date = swedish_polls_curated$collectPeriodTo,
n = swedish_polls_curated$n)
pd
## A polls_data object with 1660 polls from 28 houses
## that range from 2000-01-03 to 2021-06-29.
## # A tibble: 1,660 × 14
## .poll_id M L C KD S V MP SD .house .publish_date
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <date>
## 1 1 0.221 0.034 0.106 0.068 0.234 0.113 0.033 0.182 Demoskop 2021-06-29
## 2 2 0.218 0.025 0.079 0.05 0.255 0.112 0.035 0.214 Novus 2021-06-29
## 3 3 0.212 0.028 0.079 0.051 0.252 0.102 0.033 0.194 Sentio 2021-06-29
## 4 4 0.22 0.02 0.1 0.06 0.24 0.12 0.03 0.19 Ipsos 2021-06-22
## 5 5 0.224 0.023 0.099 0.051 0.258 0.096 0.035 0.202 Sifo 2021-06-18
## 6 6 0.226 0.023 0.094 0.053 0.261 0.096 0.042 0.192 Novus 2021-06-11
## 7 7 0.229 0.036 0.107 0.056 0.242 0.085 0.035 0.192 Demoskop 2021-06-05
## 8 8 0.224 0.025 0.095 0.045 0.282 0.089 0.038 0.189 SCB 2021-06-02
## 9 9 0.193 0.033 0.07 0.052 0.255 0.103 0.05 0.21 Sentio 2021-05-31
## 10 10 0.22 0.03 0.09 0.05 0.26 0.1 0.04 0.2 Ipsos 2021-05-29
## # … with 1,650 more rows, and 3 more variables: .start_date <date>,
## # .end_date <date>, .n <int>
To see all available datasets, use:
data(package = "ada")
The real polling data comes both as an original dataset and a curated
dataset that has been curated to be internally consistent. See
rpackage/data-raw
for details on how the curation has been conducted.
Every file without suffix _functions
will create the datasets stored
in the R-package.
Documentation on each dataset can be found in the R package docs or in
rpackage/R/docs_data.R
.
To simplify modelling, we can also simulate polls data. Simulation can
be done using simulate_polls()
as follows.
library(ada)
data(x_test)
set.seed(4711)
spd <- simulate_polls(x = x_test[[1]],
pd = pd_test,
npolls = 150,
time_scale = "week",
start_date = "2010-01-01")
We can also plot the simulated polls with:
plot(spd, y = "x")
In the next step, we add information on the true underlying latent state. In the case of real data, this is election data (see above). Below we create a dataset with weeks 3, 4, and 72 known.
true_idx <- c(3, 44, 72)
known_state <- tibble::tibble(date = as.Date("2010-01-01") + lubridate::weeks(true_idx), x = x_test[[1]][true_idx])
known_state
## # A tibble: 3 × 2
## date x
## <date> <dbl>
## 1 2010-01-22 0.25
## 2 2010-11-05 0.25
## 3 2011-05-20 0.3
We only need to use the poll_of_polls()
function to fit the model.
output <- capture.output(suppressWarnings(
pop <- poll_of_polls(y = "x",
model = "model8h3",
polls_data = spd,
time_scale = "week",
known_state = known_state,
warmup = 1000,
iter = 2000,
chains = 4)
))
## Default value(s) set:
## sigma_kappa_hyper = 0.005
## kappa_1_sigma_hyper = 0.02
## g_scale = 0.46986301369863
## use_industry_bias = 0
## use_house_bias = 0
## use_design_effects = 0
## use_constrained_party_house_bias = 0
## use_constrained_house_house_bias = 0
## use_constrained_party_kappa = 0
## use_ar_kappa = 0
## use_latent_state_version = 0
## use_t_dist_industry_bias = 0
## use_multivariate_version = 0
## use_softmax = 1
## estimate_alpha_kappa = 0
## estimate_alpha_beta_mu = 0
## estimate_alpha_beta_sigma = 0
## alpha_kappa_known = 1
## alpha_beta_mu_known = 1
## alpha_beta_sigma_known = 1
## beta_mu_1_sigma_hyper = 0.02
## sigma_beta_mu_sigma_hyper = 0.01
## beta_sigma_1_sigma_hyper = 1
## sigma_beta_sigma_sigma_hyper = 1
## kappa_sum_sigma_hyper = 0.01
## beta_mu_sum_party_sigma_hyper = 0.01
## beta_mu_sum_house_sigma_hyper = 0.01
## estimate_kappa_next = 1
## nu_kappa_raw_alpha = 6.5
## nu_kappa_raw_beta = 1
## alpha_kappa_mean = 0
## alpha_kappa_sd = 1
## alpha_beta_mu_mean = 0
## alpha_beta_mu_sd = 1
## alpha_beta_sigma_mean = 0
## alpha_beta_sigma_sd = 1
## nu_lkj = 1
## x1_prior_p = 0.268781302170284, 0.731218697829716
## x1_prior_alpha0 = 100
We can also extract some basic information and the results.
pop
## ==== Poll of Polls Model (33.8 MB) ====
## Model is fit during the period 2010-01-01--2011-11-27
## Stan model: model8h3.stan
## Number of parameters: 506
## Parties: x
## Time scale: week
##
## == Data ==
## A polls_data object with 150 polls from 1 houses
## that range from 2010-01-01 to 2011-11-27.
##
## A known state object with 3 known states
## that range from 2010-01-22 to 2011-05-20.
##
## == Stan arguments ==
## warmup: 1000.0
## iter: 2000.0
## chains: 4.0
##
## == Model arguments ==
## ~
##
## == Model diagnostics ==
## no_divergent_transistions: 0
## no_max_treedepth: 0
## no_low_bfmi_chains: 0
## no_Rhat_above_1_1: 0
## no_Rhat_is_NA: 14
## mean_no_leapfrog_steps: 108
## mean_chain_step_size: 0.0484539
## mean_chain_inv_mass_matrix_min: 0.0416008
## mean_chain_inv_mass_matrix_max: 1.0767375
## mean_chain_warmup_time: 18
## mean_chain_sampling_time: 21
##
## == Git ==
## git sha: fc74bb2aec2b4e1900f59ee477b131f2384e2882
##
## == Cache ==
## sha: 0bc8c0f14ea23e05ef4e65e8fdd28fa462712280
## cache directory: /var/folders/8x/bgssdq5n6dx1_ydrhq1zgrym0000gn/T//Rtmp8bFLr7/pop_cache
We can find the stan object in pop$stan_fit
head(rstan::summary(pop$stan_fit)$summary)[,1:3]
## mean se_mean sd
## x_pred[1,1] 0.2548916 1.683106e-04 0.010130557
## x_pred[1,2] 0.7451084 1.683106e-04 0.010130557
## x_pred[2,1] 0.2546602 1.219312e-04 0.008128779
## x_pred[2,2] 0.7453398 1.219312e-04 0.008128779
## x_pred[3,1] 0.2533631 8.510874e-05 0.005537453
## x_pred[3,2] 0.7466369 8.510874e-05 0.005537453
We can also quickly visualize the latent series with:
plot(pop, "x")
All Stan Code can be found in rpackage/inst/stan_models
. The purpose
is that the stan files should be a part of the rpackage for testing.
Although local stan models can be tested directly by:
pop <- poll_of_polls(...,
model = "path/to/my/stan/model.stan",
...)
In this way, we can edit a model quickly without needing to rebuild the R package.
The filename needs to have the same name as the available models to identify how the package should parse data.