-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathestimate_binary_vector.stan
165 lines (152 loc) · 8.86 KB
/
estimate_binary_vector.stan
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
// Authors: Rob Hinch (method) and Chris Wymant (pedagogy)
//
// We have a set of observations of a value y.
// Each observation was generated by either a 'signal' process or a 'noise'
// process; which of the two it was is unknown.
// The distributions of y for the two processes differ.
// We want to estimate the parameters of the two processes and their relative
// frequencies, i.e. the population-level parameters. However, we also want to
// estimate whether each individual observation was signal or noise.
//
// You could think of there being a latent binary parameter for each of the N
// observations - signal or noise - and collecting these parameters into a
// binary-valued vector of length N, called V say. Including V with the other
// parameters in the likelihood, P(data | V, other parameters), each observation
// would only contribute one of the two possibilities to the likelihood - the
// one for if they're noise or the one for if they're signal. However, we'd then
// have to explore all 2^N possible discrete values of V as part of the
// exploration of this parameter space, each of which has its own continuous
// parameter space for the values of all the other continuous parameters, i.e.
// we'd need to explore 2^N separate continuous parameter spaces. That's
// computationally very challenging, and indeed Stan doesn't allow this.
//
// So instead of calculating the likelihood as P(data | V, other parameters), we
// analytically marginalise over all 2^N possible values of V by writing
// P(data | V, other parameters) = sum_{all possible V} P(data | V, other parameters) P(V)
// where P(V) is the prior for the fraction of observation that are signal,
// which is just fraction_signal for each observation. Hence each observation
// contributes to the likelihood a factor
// P(their data | signal) fraction_signal + P(their data | noise) (1 - fraction_signal)
// The posterior distribution for the probability that an observation is noise
// is given by the posterior distribution of [the first of those two terms
// divided by the sum of both terms].
//
// Conceptually, we're treating [the probability that any given
// observation was signal rather than noise] as a parameter in its own right.
// Think about that parameter as being the fraction of observations that would
// be signal, if you obtained a large number of observations with that precise
// value. We calculate the posterior probability density for that parameter,
// i.e. the posterior probability for the probability that it's signal. This
// sounds like a tricky concept - the probability of a probability. To make
// sense of that, imagine that we have so much data that we can estimate all the
// parameters of the model with very high precision, and we encounter one
// observation that is in the middle of the overlap between the distribution for
// signal observations and the distribution for noise observations ('in the
// middle' should take into account the weighting of the two distributions, i.e.
// the relative frequency of signal vs noise). Given our precise estimate of the
// parameters we can say with confidence that this observation has 50:50 chance
// of being signal or noise, or in other words, if we observed many observations
// with that precise value, we're confident that 50% of them would be signal and
// 50% would be noise. Then introduce some uncertainty into our estimate of
// the parameters, and we might still have 50:50 as our best guess for a given
// observation's probability, but there's now epistemological uncertainty
// layered onto that in addition to the ontological uncertainty due to our model
// being intrinsically stochastic (there is overlap in the counts between signal
// and noise so you never know for sure).
//
// We enforce that the noise distribution has lower mean than the signal
// distribution, otherwise the problem is symmetric under an exchange of what we
// call signal and what we call noise, and there is a two-fold redundancy in
// all/part of parameter space that is likely to mess up inference.
//
// The specific distributions used for signal and noise are lognormal, with
// meanlog parameter mu and sdlog parameter sigma.
data {
// First, actual data
int<lower = 0> num_observations;
real<lower = 0> y[num_observations];
// Second, things that are not actually data but should be kept fixed over one
// whole round of inference
// Upper and lower bounds for parameter priors (assumed uniformly distributed)
real prior_mu_signal_min;
real<lower = prior_mu_signal_min> prior_mu_signal_max;
real prior_mu_noise_min;
real<lower = prior_mu_noise_min> prior_mu_noise_max;
real<lower = 0> prior_sigma_signal_min;
real<lower = prior_sigma_signal_min> prior_sigma_signal_max;
real<lower = 0> prior_sigma_noise_min;
real<lower = prior_sigma_noise_min> prior_sigma_noise_max;
real<lower = 0, upper = 1> prior_fraction_signal_min;
real<lower = prior_fraction_signal_min, upper = 1> prior_fraction_signal_max;
// A boolean switch for whether we sample from the
// posterior or the prior, to see how and how much the data are updating our
// beliefs about the parameters and the kind of data they generate.
// 0 for prior, 1 for posterior
int<lower = 0, upper = 1> get_posterior_not_prior;
}
parameters {
real<lower = prior_mu_signal_min, upper = prior_mu_signal_max> mu_signal;
real<lower = prior_mu_noise_min, upper = fmin(prior_mu_noise_max, mu_signal)> mu_noise;
real<lower = prior_sigma_signal_min, upper = prior_sigma_signal_max> sigma_signal;
real<lower = prior_sigma_noise_min, upper = prior_sigma_noise_max> sigma_noise;
real<lower = prior_fraction_signal_min, upper = prior_fraction_signal_max> fraction_signal;
}
// For each observation, calculate a variable lp_signal which is
// log(Prob(data | params, this observation is signal) * Prob(this observation is signal))
// and a variable lp_noise which is
// log(Prob(data | params, this observation is noise) * Prob(this observation is noise))
// For a given observation, exponentiating each of these terms and adding them gives
// Prob(data | params)
// because P(signal) + P(noise) = 1
transformed parameters {
real lp_signal[num_observations];
real lp_noise[num_observations];
for (observation in 1:num_observations) {
lp_signal[observation] = lognormal_lpdf(y[observation] | mu_signal, sigma_signal) + log(fraction_signal);
lp_noise[observation] = lognormal_lpdf(y[observation] | mu_noise, sigma_noise) + log(1 - fraction_signal);
}
}
model {
// Priors should be defined here.
// Uniform priors whose boundaries are fixed do not need to be stated becasuse
// they are implicitly understood.
// Uniform priors whose boundaries depend on other parameters, like mu_noise
// here, must be stated explicitly, as discussed at
// https://htmlpreview.github.io/?https://github.com/ChrisHIV/teaching/blob/main/other_topics/Stan_example_dependent_uniform_priors.html
mu_noise ~ uniform(prior_mu_noise_min, fmin(prior_mu_noise_max, mu_signal));
// Calculate the likelihood if we're sampling from the posterior.
// The hypotheses of signal and noise are mutually exclusive, so we should add
// their likelihoods weighted by P(signal) and P(noise)
// To increment the overall log probability density by log(a + b) where a is the
// likelihood assuming it's signal times the probabaility of signal, i.e.
// exp(lp_signal), and b is similar but for noise, we use
// log_sum_exp(lp_signal, lp_noise) defined such that
// log_sum_exp(log(a), log(b))) = log(a + b)
if (get_posterior_not_prior) {
for (observation in 1:num_observations) {
target += log_sum_exp(lp_signal[observation], lp_noise[observation]);
}
}
}
// The posterior probability that a given observation is noise, conditioning on
// specific parameter values for quantities below but temporarily suppressing
// this for clarity, is
// Prob(noise | data) = Prob(data | noise) P(noise) / P(data)
// = Prob(data | noise) P(noise) / [
// P(data | noise) P(noise) + P(data | signal) P(signal) ]
// = exp(lp_noise) / [exp(lp_noise) + exp(lp_signal)]
// = exp(lp_noise) / exp(log_sum_exp(lp_signal, lp_noise))
// = exp(lp_noise - log_sum_exp(lp_signal, lp_noise))
// We calculate this per observation below.
// No longer suppressing the conditioning on parameter values, this quantity is
// Prob(noise | data, params); Stan will return the posterior for this over
// parameter space, i.e. sampling in proportion to P(params | data), and so that
// posterior for this quantity defines the posterior for the probability of
// being noise.
generated quantities {
real<lower = 0, upper = 1> prob_is_noise_given_params[num_observations];
for (observation in 1:num_observations) {
prob_is_noise_given_params[observation] = exp(
lp_noise[observation] - log_sum_exp(lp_noise[observation], lp_signal[observation]));
}
}