-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathchapter9.Rmd
1079 lines (793 loc) · 37 KB
/
chapter9.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Rethinking Chapter 9"
author: "Gregor Mathes"
date: "2021-02-11"
slug: Rethinking Chapter 9
categories: []
tags: [Rethinking, Bayes, Statistics]
subtitle: ''
summary: 'Markov Chain Monte Carlo algorithms to sample from the posterior'
authors: [Gregor Mathes]
lastmod: '2021-03-11T12:07:04+02:00'
featured: no
projects: [Rethinking]
output:
html_document:
toc: true
toc_depth: 1
number_sections: false
fig_width: 6
mathjax: "default"
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, fig.dim=c(7,4), warning=FALSE, message = FALSE)
library(tidyverse)
library(rethinking)
library(knitr)
library(kableExtra)
map <- purrr::map
```
# Introduction
This is the eigth part of a series where I work through the practice questions of the second edition of Richard McElreaths [Statistical Rethinking](https://xcelab.net/rm/statistical-rethinking/).\
Each post covers a new chapter and you can see the posts on previous chapters [here](https://gregor-mathes.netlify.app/tags/rethinking/). This chapter introduces Markov Chain Monte Carlo algorithms to obtain or approximate the posterior distribution.
You can find the the lectures and homework accompanying the book [here](https://github.com/rmcelreath/stat_rethinking_2020%3E).
The colours for this blog post are:
```{r colour setup}
blue <- "#337677"
red <- "#C4977F"
grey <- "#ECDED2"
brown <- "#50473D"
```
```{r colour plot, echo=FALSE}
tibble(colours = c(blue, red, grey, brown),
colourname = c("blue", "red", "grey", "brown")) %>%
arrange(colours) %>%
mutate(colourname = fct_reorder(colourname, colours),
colourname = paste0(colours, "/ ", colourname)) %>%
ggplot() +
geom_bar(aes(y = colours, fill = colours)) +
scale_fill_identity() +
geom_text(aes(x = 0.5, y = colours, label = colourname),
size = 6, colour = "white") +
theme_void()
```
Note that I did not have time for practive questions 9H6 and 9H7, but I probably manage to add these soon(-ish).
# Easy practices
## 9E1
> Which of the following is a requirement of the simple Metropolis algorithm?
>
> * (1) The parameters must be discrete.
> * (2) The likelihood function must be Gaussian.
> * (3) The proposal distribution must be symmetric.
A quick look into chapter 9.2 shows that parameters are allowed to be non-discrete (e.g. continuous) and that the likelihood function can be anything. The only requirement is (3). Or, keeping it in the metaphor of the islands, that there is an equal chance of proposing from Island A to Island B and from B to A.
## 9E2
> Gibbs sampling is more efficient than the Metropolis algorithm. How does it achieve this extra efficiency? Are there any limitations to the Gibbs sampling strategy?
Hamiltonian algorithms jump from proposal to proposal in a random way. This might take some time to explore the whole posterior distribution space. Using *conjugate priors*, Gibbs sampling makes more intelligent proposals. In order words, it makes smart jumps in the joint posterior distribution. This way, you need less samples as you get less rejected proposals. The disadvantages of the Gibbs sampler is that it relies on conjugate priors, which you sometimes don't want to provide, or sometimes it's not even possible. Similar to the Metropolis MCMC, it get's stuck in a valley of the joint posterior when there is a high correlation between parameters.
## 9E3
> Which sort of parameters can Hamiltonian Monte Carlo not handle? Can you explain why?
Hamiltonian Monte Carlo cannot handle discrete parameters. This is because it requires a smooth surface to glide its imaginary particle over while sampling from the posterior distribution.
## 9E4
> Explain the difference between the effective number of samples, *n_eff* as calculated by Stan,and the actual number of samples.
*n_eff* corresponds to the number of independent samples with the same estimation power as the number of autocorrelated samples. It is is a measure of how much independent information there is in autocorrelated chains. It is always smaller than the actual number of samples.
## 9E5
> Which value should *Rhat* approach, when a chain is sampling the posterior distribution correctly?
A *Rhat* value of one is always a good sight.
## 9E6
> Sketch a good trace plot for a Markov chain, one that is effectively sampling from the posterior distribution. What is good about its shape? Then sketch a trace plot for a malfunctioning Markov chain. What about its shape indicates malfunction?
So first a good chain. We want good mixing and the chain to be stationary, meaning that we should have some minor horizontal noise around a fixed mean.
```{r 9E6 Figure 1, fig.cap="A good Markow chain showing good mixing and stationarity."}
tibble(chain = rnorm(1e3, 2, 0.25),
steps = 1:1e3) %>%
ggplot(aes(steps, chain)) +
geom_line(colour = red) +
coord_cartesian(ylim = c(0, 3)) +
labs(x = "Steps", y = "Sample") +
theme_minimal()
```
Note how the chain converges around the mean of 2 and has only small and random divergence from that mean.
Now the bad chain:
```{r 9E6 Figure 2, fig.cap="A bad Markow chain showing bad mixing and non-stationarity"}
tibble(mean.val = c(seq(2, 6, length.out = 400),
seq(6, 1, length.out = 400),
seq(1, 0.5, length.out = 200)),
steps = 1:1e3,
noise = rlogis(1e3, 0, 0.15)) %>%
mutate(chain = mean.val + noise) %>%
ggplot(aes(steps, chain)) +
geom_line(colour = red) +
labs(x = "Steps", y = "Sample") +
theme_minimal()
```
The chain is not stationary around a mean. Instead, it get's stuck at some pretty unrealistic values.
# Medium practices
## 9M1
> Re-estimate the terrain ruggedness model from the chapter, but now using a uniform prior and an exponential prior for the standard deviation, sigma. The uniform prior should be `dunif(0,10)` and the exponential should be `dexp(1)`. Do the different priors have any detectible influence on the posterior distribution?
First, we fit the model from the chapter to compare the differences later on. Note that this model has the `dexp(1)` prior on `sigma`.
```{r 9M1 model m9.1, results='hide'}
data("rugged")
dat_rugged <- rugged %>%
as_tibble() %>%
drop_na(rgdppc_2000) %>%
transmute(log_gdp = log(rgdppc_2000),
log_gdp_std = log_gdp/ mean(log_gdp),
rugged_std = rugged / max(rugged),
cid = if_else(cont_africa == 1, 1, 2)) %>%
as.list()
m9.1 <- alist(
log_gdp_std ~ dnorm(mu, sigma),
mu <- a[cid] + b[cid] * (rugged_std - 0.215),
a[cid] ~ dnorm(1, 0.1),
b[cid] ~ dnorm(0, 0.3),
sigma ~ dexp(1)) %>%
ulam(data = dat_rugged, chains = 1)
```
And now the new model with the updated priors.
```{r 9M1 model m9.1.upd, results='hide'}
m9.1.upd <- alist(
log_gdp_std ~ dnorm(mu, sigma),
mu <- a[cid] + b[cid] * (rugged_std - 0.215),
a[cid] ~ dnorm(1, 0.1),
b[cid] ~ dnorm(0, 0.3),
sigma ~ dunif(0, 10)) %>%
ulam(data = dat_rugged, chains = 1)
```
Let's plot the prior distributions against each other.
```{r 9M1 Figure 3, fig.cap="Prior distributions on sigma."}
tibble(exponential = rexp(1e3, 1),
uniform = runif(1e3, 0, 10)) %>%
pivot_longer(cols = everything(),
names_to = "type",
values_to = "Sigma") %>%
ggplot(aes(Sigma, fill = type)) +
geom_density(colour = grey,
alpha = 0.6) +
scale_fill_manual(values = c(blue, red)) +
labs(y = NULL, fill = NULL) +
theme_minimal()
```
Now we can look at the posteriors for `sigma`and compare the models.
```{r 9M1 Figure 4, fig.cap="Posterior distributions for sigma."}
tibble(exponential = extract.samples(m9.1) %>%
pluck("sigma"),
uniform = extract.samples(m9.1.upd) %>%
pluck("sigma")) %>%
pivot_longer(cols = everything(),
names_to = "model",
values_to = "Sigma") %>%
ggplot(aes(Sigma, fill = model)) +
geom_density(colour = grey,
alpha = 0.6) +
scale_fill_manual(values = c(blue, red)) +
labs(y = NULL, fill = NULL) +
theme_minimal()
```
There are basically no differences in the posterior. It seems that there was so much data that the prior was just overwhelmed. However, the uniform prior results in a broader and less peaky posterior, expressing more uncertainty.
## 9M2
> Modify the terrain ruggedness model again. This time, change the prior for `b[cid]` to `dexp(0.3)`. What does this do to the posterior distribution? Can you explain it?
First, let's take a look at the modified prior distribution:
```{r 9M2 Figure 5, fig.cap="Prior distributions on b[cid]"}
tibble(exponential = rexp(1e3, 0.3)) %>%
ggplot(aes(exponential)) +
geom_density(colour = grey,
alpha = 0.8, fill = brown) +
labs(y = NULL, fill = NULL) +
theme_minimal()
```
And now refit the model.
```{r 9m2 m9.1.mod}
m9.1.mod <- alist(
log_gdp_std ~ dnorm(mu, sigma),
mu <- a[cid] + b[cid] * (rugged_std - 0.215),
a[cid] ~ dnorm(1, 0.1),
b[cid] ~ dexp(0.3),
sigma ~ dexp(1)) %>%
ulam(data = dat_rugged, chains = 1)
```
Looking at the coefficient estimates, we can already see that something happened to `b[2]`.
```{r 9m2 coefficient}
precis(m9.1.mod, depth = 2) %>%
as_tibble(rownames = "coefficient") %>%
knitr::kable(digits = 2)
```
Let's plot the posterior
```{r Figure 6, fig.cap="Posterior distributions for b[2]"}
extract.samples(m9.1.mod) %>%
pluck("b") %>%
.[,1] %>%
as_tibble_col(column_name = "b1") %>%
ggplot(aes(b1)) +
geom_density(colour = grey,
fill = brown,
alpha = 0.8) +
labs(y = NULL) +
theme_minimal()
```
We can clearly see that the posterior get's cut off at zero and only shows positive numbers. This is because the prior does not allow any numbers below zero.
## 9M3
> Re-estimate one of the Stan models from the chapter, but at different numbers of `warmup` iterations. Be sure to use the same number of sampling iterations in each case. Compare the `n_eff` values. How much warmup is enough?
Let's stick to the model `m9.1`. Let's define a function that fits this model dependent on the `warmup` number, and returns the effective number of samples. Similar to the `update` function from the *brms* package, we can use `ulam` on a fitted model, which makes updating much easier. This is described in the help for `?ulam`.
```{r 9M3 refit_m9.1}
refit_m9.1 <- function(N){
ulam(m9.1, chains = 1, warmup = N, iter = 1000,
cores = parallel::detectCores()) %>%
precis() %>%
as_tibble() %>%
pull(n_eff)
}
```
Now we can iterate over the `warmup` numbers using `purrr::map`.
```{r 9M3 refit_m9.1 tibble, results='hide'}
n_effective <- seq(1, 500, length.out = 100) %>%
round(0) %>%
map_dbl(refit_m9.1)
```
Now we just refitted 100 new models with a warmup betwenn 1 and 500. Let's plot the results.
```{r 9M3 Figure 7, fig.cap="Effective number of samples (ESS) as a function of warmup."}
n_effective %>%
as_tibble_col(column_name = "n_eff") %>%
add_column(warmup = seq(1, 500, length.out = 100)) %>%
ggplot(aes(warmup, n_eff)) +
geom_line(colour = red, size = 0.9) +
labs(x = "Warm-up", y = "ESS") +
theme_minimal()
```
With such an easy model, and such a low correlation between parameters, the effective number of samples increases fast. We get a robust estimate for a warm-up bigger than 20. But note that this is not the case for all models.
# Hard practices
## 9H1
> Run the model below and then inspect the posterior distribution and explain what it is accomplishing.
```{r 9H1 setup, results='hide'}
mp <- map2stan(
alist(
a ~ dnorm(0,1),
b ~ dcauchy(0,1)),
data=list(y=1),
start=list(a=0,b=0),
iter=1e4,
warmup=100,
WAIC=FALSE )
```
> Compare the samples for the parameters a and b. Can you explain the different trace plots, using what you know about the Cauchy distribution?
Let's look at the priors first.
```{r 9H1 Figure 8, fig.cap="Prior simulation for a and b."}
tibble(a = rnorm(1e5, 0, 1),
b = rcauchy(1e5, 0, 1)) %>%
pivot_longer(cols = everything(),
names_to = "Parameter",
values_to = "Estimate") %>%
ggplot(aes(Estimate, fill = Parameter)) +
geom_density(colour = brown) +
scale_fill_manual(values = c(red, blue)) +
facet_wrap(~ Parameter, scales = "free") +
labs(y = NULL, fill = NULL) +
theme_minimal()
```
The normal prior on `a` is pretty straight forward. But looking at `b` and the cauchy prior draws a different picture. The cauchy distribution places the most probability mass at the same area as `a` but with some very extreme outliers.
Now let's look at the posterior:
```{r 9H1 Figure 9, fig.cap="Posterior distribution for a and b."}
extract.samples(mp) %>%
as_tibble() %>%
pivot_longer(cols = everything(),
names_to = "Parameter",
values_to = "Estimate") %>%
ggplot(aes(Estimate, fill = Parameter)) +
geom_density(colour = brown) +
scale_fill_manual(values = c(red, blue)) +
facet_wrap(~ Parameter, scales = "free") +
labs(y = NULL, fill = NULL) +
theme_minimal()
```
Notice a pattern? The posteriors look exactly as the priors. That's simply because we didn't specify any likelihood, any data to update our beliefs. So the MCMC just samples from the priors. Now what does the trace plots look like?
```{r 9H1 Figure 10, fig.cap="Trace plots for model mp."}
plot(mp, n_col = 2)
```
The chain for `a` looks like a beautiful and hairy caterpillar, so that's fine. However, `b` looks very worrying. It seems like the chain diverges from the mean to some extremes from time to time. This normally indicates some convergence problems. However, I think it is fine in this case as it closely mirrors the cauchy distribution, with a lot of probability at a mean and some very extreme outliers.
## 9H2
> Recall the divorce rate example from Chapter 5. Repeat that analysis, using `ulam()` this time, fitting modelsm 5.1, m5.2, and m5.3. Use `compare` to compare the models on the basis of WAIC orPSIS. Explain the results.
Ok, we need the data and the processing steps as in Chapter 5.
```{r 9H2 data}
data(WaffleDivorce)
dat_waffel <- WaffleDivorce %>%
as_tibble() %>%
transmute(across(c(Divorce, Marriage, MedianAgeMarriage), standardize)) %>%
select(D = Divorce, M = Marriage, A = MedianAgeMarriage)
```
And here are the models, now refit with `ulam()`. Note that we need to set `log_lik = TRUE` to enable the calculation of WAIC and LOO.
```{r 9H2 models, results='hide', message=FALSE}
m5.1 <- alist(
D ~ dnorm(mu, sigma),
mu <- a + bA * A,
a ~ dnorm(0, 0.2),
bA ~ dnorm(0, 0.5),
sigma ~ dexp(1)) %>%
ulam(data = dat_waffel, cores = 8, log_lik = TRUE)
m5.2 <- alist(
D ~ dnorm(mu, sigma),
mu <- a + bM * M,
a ~ dnorm(0, 0.2),
bM ~ dnorm(0, 0.5),
sigma ~ dexp(1)) %>%
ulam(data = dat_waffel, cores = 8, log_lik = TRUE)
m5.3 <- alist(
D ~ dnorm(mu, sigma),
mu <- a + bM * M + bA * A,
a ~ dnorm(0, 0.2),
c(bM, bA) ~ dnorm(0, 0.5),
sigma ~ dexp(1)) %>%
ulam(data = dat_waffel, cores = 8, log_lik = TRUE)
```
We should check the MCMC performance.
```{r 9H2 Figure 11, fig.cap="Trace plots for all models from chapter 5."}
c(m5.1, m5.2, m5.3) %>%
walk(traceplot)
```
They look all fine and the numbers of effective samples is fine as well for all. However, it's substantially lower for the parameters in `m5.3`, indicating correlation. Now checking the rhat values.
```{r 9H2 rhat}
get_rhat <- function(model.input) {
model.input %>%
precis() %>%
as_tibble(rownames = "estimate") %>%
select(estimate, rhat = Rhat4)
}
c(m5.1, m5.2, m5.3) %>%
map_dfr(get_rhat) %>%
add_column(model = c(rep("5.1", 3),
rep("5.2", 3),
rep("5.3", 4)), .before = "estimate") %>%
knitr::kable(digits = 3)
```
They look all good. But note that they are slightly increased in model `m5.3`, probably due to the correlation between parameters.
Now we can compare the models via the `compare()` function.
```{r 9H2 compare}
compare(m5.1, m5.2, m5.3, func = PSIS) %>%
as_tibble(rownames = "Model") %>%
knitr::kable(digits = 2)
```
Model `m5.1` with only median age at marriage as a predictor performs best, but is not really distinguishable from model `m5.3`. However, the model with marriage rate only, `m5.2` clearly performs worse than both.
## 9H3
> Sometimes changing a prior for one parameter has unanticipated effects on other parameters. This is because when a parameter is highly correlated with another parameter in the posterior, the prior influences both parameters. Here’s an example to work and think through. Go back to the leg length example in Chapter 5. Here is the code again, which simulates height and leg lengths for 100 imagined individuals:
```{r 9H3 set up 1}
N <- 100 # number of individuals
height <- rnorm(N,10,2) # sim total height of each
leg_prop <- runif(N,0.4,0.5) # leg as proportion of height
leg_left <- leg_prop*height + # sim left leg as proportion + error
rnorm( N , 0 , 0.02 )
leg_right <- leg_prop*height + # sim right leg as proportion + error
rnorm( N , 0 , 0.02 )
# combine into data frame
d <- data.frame(height,leg_left,leg_right)
```
> And below is the model you fit before, resulting in a highly correlated posterior for the two beta parameters. This time, fit the model using `ulam()`:
```{r 9H3 set up 2, results='hide'}
m5.8s <- ulam(
alist(
height ~ dnorm( mu , sigma ),
mu <- a + bl*leg_left + br*leg_right,
a ~ dnorm( 10 , 100 ),
bl ~ dnorm( 2 , 10 ),
br ~ dnorm( 2 , 10 ),
sigma ~ dexp( 1 )),
data=d, chains=4, start=list(a=10,bl=0,br=0.1,sigma=1),
log_lik = TRUE)
```
> Compare the posterior distribution produced by the code above to the posterior distribution produced when you change the prior forbrso that it is strictly positive:
```{r 9H3 set up 3, results='hide'}
m5.8s2 <- ulam(
alist(
height ~ dnorm( mu , sigma ),
mu <- a + bl*leg_left + br*leg_right,
a ~ dnorm( 10 , 100 ),
bl ~ dnorm( 2 , 10 ),
br ~ dnorm( 2 , 10 ),
sigma ~ dexp( 1 )),
data=d, chains=4,
constraints=list(br="lower=0"),
start=list(a=10,bl=0,br=0.1,sigma=1),
log_lik = TRUE)
```
> Note the `constraints` list. What this does is constrain the prior distribution of `br` so that it has positive probability only above zero. In other words, that prior ensures that the posterior distribution for `br` will have no probability mass below zero. Compare the two posterior distributions for `m5.8s` and `m5.8s2`. What has changed in the posterior distribution of both beta parameters? Can you explain the change induced by the change inprior?
First let's take a look at both model summaries:
```{r 9H3 summaries}
precis(m5.8s) %>%
as_tibble(rownames = "estimate") %>%
add_column(model = "m5.8s", .before = "estimate") %>%
full_join(
precis(m5.8s2) %>%
as_tibble(rownames = "estimate") %>%
add_column(model = "m5.8s2", .before = "estimate")
) %>%
arrange(estimate) %>%
knitr::kable(digits = 3)
```
We can already see that changing the prior on one parameter (`br`) results in a change in the other parameter (`bl`).
We can and should plot the whole posterior distributions:
```{r 9H3 Figure 12, fig.cap="Posterior distributions for all estimates of both models."}
extract.samples(m5.8s) %>%
as_tibble() %>%
add_column(model = "m5.8s") %>%
full_join(
extract.samples(m5.8s2) %>%
as_tibble() %>%
add_column(model = "m5.8s2")
) %>%
pivot_longer(cols = -model,
names_to = "estimate") %>%
ggplot(aes(value, fill = model)) +
geom_density(colour = grey, alpha = 0.7) +
facet_wrap(~estimate, scales = "free") +
scale_fill_manual(values = c(red, blue)) +
scale_y_continuous(breaks = 0, labels = NULL) +
theme_minimal() +
labs(y = NULL, fill = "Model",
x = "Posterior estimate") +
theme(legend.position = c(0.9, 0.9),
legend.background = element_rect(colour = "white"))
```
We can see that the posterior on the intercept and on sigma remains constant. But forcing `br` to be positive shifts the posterior for `bl` to the left. That's because `br` and `bl` are negatively correlated in the likelihood (the data) and the model wants to keep this information. And the only way to express this information in the posterior is to shift the estimate for `bl` to the left, when `br` is shifted to the right. It's just a necessity arising from the extreme negative correlation in the data.
## 9H4
> For the two models fit in the previous problem, use WAIC or PSIS to compare the effective numbers of parameters for each model. You will need to use log_lik=TRUE to instruct ulam() to compute the terms that both WAIC and PSIS need. Which model has more effective parameters? Why?
We can just throw both models into the `compare()` function with Pareto smoothed importance sampling (PSIS).
```{r 9H4 compare}
compare(m5.8s, m5.8s2, func = "PSIS") %>%
as_tibble(rownames = "Estimate") %>%
knitr::kable(digits = 2)
```
The important column here is `pPSIS` containing the effective number of parameters given by PSIS, or "the overfitting penalty". There is not much difference in these models, but `m5.8s2` is a bit less flexible as the prior is more informative. At the same time, the model fits the data slightly worse as indicated by the other PSIS estimates.
## 9H5
> Modify the Metropolis algorithm code from the chapter to handle the case that the island populations have a different distribution than the island labels. This means the island’s number will not be the same as its population.
Here's the code from the chapter:
```{r 9H5 setup, eval=FALSE}
num_weeks <- 1e5
positions <- rep(0, num_weeks)
current <- 10
for (i in 1:num_weeks) {
# record current position
positions[i] <- current
# flip coin to generate proposal
proposal <- current + sample(c(-1, 1) , size = 1)
# now make sure he loops around the archipelago
if (proposal < 1)
proposal <- 10
if (proposal > 10)
proposal <- 1
# move?
prob_move <- proposal / current
current <- ifelse(runif(1) < prob_move , proposal , current)
}
```
Let's set up a data frame with the number of the islands (1 to 10) and population for each islands.
```{r 9H5 island data}
set.seed(123)
island <- tibble(current = 1:10,
population = rpois(10, 10))
island %>%
arrange(desc(population)) %>%
knitr::kable()
```
Island 6 has the highest population, and island 9 the lowest.
Now to refit the code, we simply make the probability to move `prob_move` dependent on the population, and not on the number of the island (which was equal to the population in the original code).
```{r 9H5 new loop}
num_weeks <- 1e5
positions <- rep(0, num_weeks)
current <- 10
for (i in 1:num_weeks) {
# record current position
positions[i] <- island$current[current]
# flip coin to generate proposal
proposal <- island$current[current] + sample(c(-1, 1), size = 1)
# now make sure he loops around the archipelago
if (proposal < 1)
proposal <- 10
if (proposal > 10)
proposal <- 1
# move?
prob_move <- island$population[proposal] / island$population[current]
current <- ifelse(runif(1) < prob_move , island$current[proposal] , island$current[current])
}
```
Now we can plot the results. We start by plotting the first 100 weeks:
```{r 9H5 Figure 13, fig.cap="Walk-through the first 100 island visits."}
tibble(positions = positions[1:100], week = 1:100) %>%
ggplot(aes(week, positions)) +
geom_point(shape = 21, size = 2,
stroke = 1.1, alpha = 0.8,
fill = blue, colour = brown) +
labs(y = "Island", x = "Week") +
scale_y_continuous(breaks = seq(2, 10, by = 2)) +
theme_minimal() +
theme(panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank())
```
The movement looks all good, now we should take a look at the absolute visits for each islands.
```{r 9H5 Figure 14, fig.cap="The long-run behaviour of the algorithm."}
tibble(positions = positions) %>%
count(positions, name = "n_weeks") %>%
mutate(positions = as_factor(positions),
positions = fct_reorder(positions, n_weeks)) %>%
ggplot(aes(positions, n_weeks)) +
geom_segment(aes(xend = positions, y = 0, yend = n_weeks),
size = 2, colour = grey) +
geom_point(shape = 21, size = 4,
stroke = 1.1, alpha = 0.8,
fill = blue, colour = brown) +
labs(x = "Island", y = "Number of weeks") +
theme_minimal() +
theme(panel.grid = element_blank(),
axis.ticks.y = element_line())
```
That's a perfect results, with island 9 having the fewest visits, and island 6 the most. Another way to visualise this is to plot the number of weeks for each island against it's population size.
```{r 9H5 Figure 15, fig.cap="Population size of each island plotted against the number of weeks each island was visited."}
island %>%
rename(positions = current) %>%
left_join(
tibble(positions = positions) %>%
count(positions, name = "n_weeks")
) %>%
ggplot(aes(n_weeks, population,
label = positions)) +
geom_label(colour = brown, size = 6) +
labs(x = "Number of weeks", y = "Population") +
theme_minimal()
```
So with these three plots at hand, and some `magick` as well as `gganimate`, we can make a nice Gif with the build-up of the Markov Chain for the first 100 steps.
```{r, eval = FALSE}
library(magick)
library(gganimate)
# first plot
walk_through <- tibble(positions = positions[1:100],
week = 1:100) %>%
ggplot(aes(week, positions)) +
geom_point(aes(week2, positions2),
shape = 21, size = 2,
stroke = 1.1, alpha = 0.3,
fill = grey, colour = brown,
data = tibble(positions2 = positions[1:100],
week2 = 1:100)) +
geom_point(shape = 21, size = 4,
stroke = 1.1, alpha = 0.8,
fill = blue, colour = brown) +
labs(y = "Island", x = "Week") +
scale_y_continuous(breaks = seq(2, 10, by = 2)) +
theme_minimal() +
theme(panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank()) +
transition_reveal(week) +
ease_aes()
walk_through_gif <- animate(walk_through, width = 240, height = 240)
# second plot
week_counts <- tibble(positions = positions[1:100], week = 1:100) %>%
count(positions, week) %>%
group_by(positions) %>%
arrange(week) %>%
mutate(n_weeks = cumsum(n)) %>%
ungroup()
lollipop <- week_counts %>%
expand(positions, week) %>%
left_join(week_counts) %>%
replace_na(list(n_weeks = 0)) %>%
group_by(positions) %>%
mutate(n_weeks = cumsum(n_weeks),
positions = as_factor(positions)) %>%
ggplot(aes(positions, n_weeks)) +
geom_segment(aes(xend = positions, y = 0, yend = n_weeks),
size = 2, colour = grey) +
geom_point(shape = 21, size = 4,
stroke = 1.1, alpha = 0.8,
fill = blue, colour = brown) +
labs(x = "Island", y = "Number of weeks") +
theme_minimal() +
theme(panel.grid = element_blank(),
axis.ticks.y = element_line()) +
transition_time(week)
lollipop_gif <- animate(lollipop, width = 240/2, height = 240)
# third plot
corr_plot <- island %>%
rename(positions = current) %>%
mutate(positions = as_factor(positions)) %>%
full_join(
week_counts %>%
expand(positions, week) %>%
left_join(week_counts) %>%
replace_na(list(n_weeks = 0)) %>%
group_by(positions) %>%
mutate(n_weeks = cumsum(n_weeks),
positions = as_factor(positions))
) %>%
drop_na(week) %>%
ggplot(aes(n_weeks, population)) +
geom_smooth(aes(x, y), colour = brown,
method = "lm", se = FALSE,
data = tibble(x = 1:250,
y = seq(1, 15, length.out = 250))) +
geom_label(aes(label = positions),
colour = brown, size = 6) +
labs(x = "Number of weeks", y = "Population") +
theme_minimal() +
transition_time(week)
corr_plot_gif <- animate(corr_plot, width = 240/2, height = 240)
```
![](/home/gregor/Dokumente/Random/statistical-rethinking/walk_through_gif.gif){width=30%}
![](/home/gregor/Dokumente/Random/statistical-rethinking/lollipop_gif.gif){width=15%} ![](/home/gregor/Dokumente/Random/statistical-rethinking/corr_plot_gif.gif){width=15%}
Here we see the problem of the Metropolis algorithm. After 100 weeks, you start to see a pattern, but it does not really reflect the true posterior. It just needs more time to converge.
## 9H6
> Modify the Metropolis algorithm code from the chapter to write your own simple MCMC estimator for globe tossing data and model from Chapter 2.
So first let's take a look at the model from chapter 6. We were tossing the globe 9 times (*N*) and it landed on water (*w*) 6 times out of these 9. We were assuming that every toss is independent of the other tosses and that the probability of *w* is the same on every toss. So we can describe *w* as being binomial distributed: $$w~Binomial(N, p)$$,
where *N* is the sum of how many times it landed on water and on land. *p* is the probability and is an unobserved parameter:
$$p∼Uniform(0,1)$$
We give *p* a uniform prior, ranging from zero to one, which is obviously not the best. But we will stick to the chapter. Let's define some parameters: We will run the algorithm for 100,000 steps. Our data is *w* = 6, and *n* = 9. Our starting point for *p* will be 0.5, which is in the middle of our prior on *p*. One proposal step size will be 0.1.
Let's transform this into R code:
```{r, results='hide'}
# set seed
set.seed(123)
# iteration
n_iter <- 1e4
# water count
w <- 6
# total tosses
n <- 9
# starting point
current <- 0.5
# empty vector to record data
positions <- rep(NA, n_iter)
sample_metro <- function() {
for (i in 1:n_iter) {
# record current position
positions[i] <- current
# generate proposal
proposal <- current + runif(1, -0.1, 0.1)
# make sure that the posterior is bounded between zero and one
if (proposal < 0) proposal <- abs(proposal)
if (proposal > 1) proposal <- 1-(proposal - 1)
# compute likelihoods
lkhd_current <- dbinom(w, n, prob = current)
lkhd_proposal <- dbinom(w, n, prob = proposal)
# compute posteriors
prob_current <- lkhd_current * dunif(current)
prob_proposal <- lkhd_proposal * dunif(proposal)
# move
prob_move <- prob_proposal/prob_current
current <- if_else(runif(1) < prob_move, proposal, current)
}
# print positions
positions
}
# run function
results <- sample_metro()
```
Let's recap what we did. We first generated a proposal that is either a bit (0.01) on the left side of the current position (which was 0.5 at the start), or a bit on the right side. This means we treat the continuous posterior space ranging from zero to one as a discrete space with bins of 0.01 width. Think of it like there are 100 islands, each placed next to each other. When the King is at island 50 (the starting point), he can either go at island 49 or island 51 (the proposal), or stay at the current island. What happens when he reaches the first (1) or the last (100) island? He simply moves in the other direction. This is what we did at the section *make sure that the posterior is bounded between zero and one*. This just means that we do not have a circle of islands as in the chapter, but rather a straight line of islands placed next to each other. We then compute the likelihood for both the current position and the proposal, which we defined as being binomial distributed. We then transform the likelihood into the posterior for each, as always by multiplying it with the prior. As the prior is uniform, we multiply the likelihood by one. We then calculate the ratio of the posterior between the proposal and the current position. If this ratio is larger than a randomly sampled number between one and zero, we stay at the current position, if it is smaller we move to the proposal. I moved the for loop into a function, so we can call it repeatedly later and get some results printed.
Now let's take a look at the results. Here is each sample from the posterior over time:
```{r}
tibble(results = results) %>%
ggplot() +
geom_line(aes(x = 1:1e4, results),
colour = red) +
labs(y = "Probability water",
x = "Step") +
theme_minimal()
```
We can see that the chain is quite autocorrelated, which shouldn't be a big problem for this easy example. We still explore the whole posterior and don't get stuck anywhere. Let's see what the posterior looks like:
```{r}
tibble(results = results) %>%
ggplot() +
geom_density(aes(results),
colour = red, fill = red,
alpha = 0.8) +
theme_minimal() +
labs(x = "Probability water",
y = NULL) +
scale_y_continuous(breaks = NULL) +
geom_function(fun = dbeta, args = c(shape1 = 7, shape2 = 4),
colour = blue, size = 1.5)
```
I added the analytically derived posterior for this model in blue. We can see that we get quite a nice fit. We can even improve this by sampling from more chains:
```{r}
results2 <- sample_metro()
results3 <- sample_metro()
results4 <- sample_metro()
multi_chains <- tibble(p_samples = c(results, results2,
results3, results4),
chain = as_factor(rep(1:4, each = 1e4)))
multi_chains %>%
add_column(steps = rep(1:1e4, 4)) %>%
ggplot() +
geom_line(aes(steps, p_samples,
colour = chain),
alpha = 0.8) +
scale_colour_manual(values = c(red, blue, brown, grey)) +
labs(y = "Probability water",
x = "Step") +
theme_minimal()
```
And here's the posterior with the samples from all 4 chains:
```{r}
multi_chains %>%
ggplot() +
geom_density(aes(p_samples),
colour = red, fill = red,
alpha = 0.8) +
theme_minimal() +
geom_function(fun = dbeta, args = c(shape1 = 7, shape2 = 4),
colour = blue, size = 1.5) +
scale_y_continuous(breaks = NULL) +
labs(x = "Probability water",
y = NULL)
```
## 9H6
> Can you write your own Hamiltonian Monte Carlo algorithm for the globe tossing data, using the R code in the chapter? You will have to write your own functions for the likelihood and gradient, but you can use the `HMC2` function.
So what we need to start with is a function `neg_log_prob` that returns the negative log-probability of the data at the current position (parameter values). We get that by summing the log likelihood to log priors, to get the posterior probability at the current position *q*.
```{r}
neg_log_prob <- function(q) {
# calculate log-probability at q
U <- dbinom(w, n, q, log = TRUE) + dunif(q, log = TRUE)
# return negative probability
-U
}
```
Then we need the gradient function, which is the derivative of the logarithm of the binomial. I simply stole this one from McElreath:
```{r}
neg_log_gradient <- function(q) {
# calculate the derivative of the binomial log-probability
G <- (w - n*q) / (q*(1 - q))
# return the negative
-G
}
```
Now we can simply modify the code from the chapter to run the Hamiltonian Monte Carlo simulation.
```{r}
# the data from chapter 2
w <- 6
n <- 9
# set up Q for HMC2
Q <- list()
Q$q <- 0.5
# the number of samples
n_samples <- 1000
# size of the leapfrog step
epsilon <- 0.03
# number of leapfrog steps
L <- 10
# initalize empty vectors to store data
samples <- rep(NA, n_samples)
accepted <- rep(NA, n_samples)
# start the simulation
for (i in 1:n_samples) {
Q <- HMC2(U = neg_log_prob,
grad_U = neg_log_gradient,
epsilon = epsilon,
L = L,
current_q = Q$q)
samples[i] <- Q$q
accepted[i] <- Q$accept