other_topics/estimate_binary_vector.stan

// Authors: Rob Hinch (method) and Chris Wymant (pedagogy)
//
// We have a set of observations of a value y.
// Each observation was generated by either a 'signal' process or a 'noise' 
// process; which of the two it was is unknown.
// The distributions of y for the two processes differ.
// We want to estimate the parameters of the two processes and their relative
// frequencies, i.e. the population-level parameters. However, we also want to
// estimate whether each individual observation was signal or noise. 
//
// You could think of there being a latent binary parameter for each of the N
// observations - signal or noise - and collecting these parameters into a 
// binary-valued vector of length N, called V say. Including V with the other
// parameters in the likelihood, P(data | V, other parameters), each observation
// would only contribute one of the two possibilities to the likelihood - the
// one for if they're noise or the one for if they're signal. However, we'd then
// have to explore all 2^N possible discrete values of V as part of the
// exploration of this parameter space, each of which has its own continuous
// parameter space for the values of all the other continuous parameters, i.e.
// we'd need to explore 2^N separate continuous parameter spaces. That's 
// computationally very challenging, and indeed Stan doesn't allow this.
//
// So instead of calculating the likelihood as P(data | V, other parameters), we
// analytically marginalise over all 2^N possible values of V by writing
// P(data | V, other parameters) = sum_{all possible V} P(data | V, other parameters) P(V)
// where P(V) is the prior for the fraction of observation that are signal,
// which is just fraction_signal for each observation. Hence each observation
// contributes to the likelihood a factor
// P(their data | signal) fraction_signal  +  P(their data | noise) (1 - fraction_signal)
// The posterior distribution for the probability that an observation is noise
// is given by the posterior distribution of [the first of those two terms
// divided by the sum of both terms].
//
// Conceptually, we're treating [the probability that any given
// observation was signal rather than noise] as a parameter in its own right.
// Think about that parameter as being the fraction of observations that would 
// be signal, if you obtained a large number of observations with that precise
// value. We calculate the posterior probability density for that parameter, 
// i.e. the posterior probability for the probability that it's signal. This
// sounds like a tricky concept - the probability of a probability. To make
// sense of that, imagine that we have so much data that we can estimate all the
// parameters of the model with very high precision, and we encounter one
// observation that is in the middle of the overlap between the distribution for
// signal observations and the distribution for noise observations ('in the 
// middle' should take into account the weighting of the two distributions, i.e.
// the relative frequency of signal vs noise). Given our precise estimate of the
// parameters we can say with confidence that this observation has 50:50 chance
// of being signal or noise, or in other words, if we observed many observations
// with that precise value, we're confident that 50% of them would be signal and
// 50% would be noise. Then introduce some uncertainty into our estimate of
// the parameters, and we might still have 50:50 as our best guess for a given
// observation's probability, but there's now epistemological uncertainty
// layered onto that in addition to the ontological uncertainty due to our model
// being intrinsically stochastic (there is overlap in the counts between signal
// and noise so you never know for sure).
//
// We enforce that the noise distribution has lower mean than the signal 
// distribution, otherwise the problem is symmetric under an exchange of what we 
// call signal and what we call noise, and there is a two-fold redundancy in 
// all/part of parameter space that is likely to mess up inference.
//
// The specific distributions used for signal and noise are lognormal, with
// meanlog parameter mu and sdlog parameter sigma.

data {
  
  // First, actual data
  
  int<lower = 0> num_observations;
  real<lower = 0> y[num_observations];
  
  // Second, things that are not actually data but should be kept fixed over one
  // whole round of inference
  
  // Upper and lower bounds for parameter priors (assumed uniformly distributed)
  real prior_mu_signal_min;
  real<lower = prior_mu_signal_min> prior_mu_signal_max;
  real prior_mu_noise_min;
  real<lower = prior_mu_noise_min> prior_mu_noise_max;
  real<lower = 0> prior_sigma_signal_min;
  real<lower = prior_sigma_signal_min> prior_sigma_signal_max;
  real<lower = 0> prior_sigma_noise_min;
  real<lower = prior_sigma_noise_min> prior_sigma_noise_max;
  real<lower = 0, upper = 1> prior_fraction_signal_min;
  real<lower = prior_fraction_signal_min, upper = 1> prior_fraction_signal_max;
  
  // A boolean switch for whether we sample from the
  // posterior or the prior, to see how and how much the data are updating our
  // beliefs about the parameters and the kind of data they generate.
  // 0 for prior, 1 for posterior
  int<lower = 0, upper = 1> get_posterior_not_prior;
}

parameters {
  real<lower = prior_mu_signal_min,       upper = prior_mu_signal_max> mu_signal;
  real<lower = prior_mu_noise_min,        upper = fmin(prior_mu_noise_max, mu_signal)> mu_noise;
  real<lower = prior_sigma_signal_min,    upper = prior_sigma_signal_max> sigma_signal;
  real<lower = prior_sigma_noise_min,     upper = prior_sigma_noise_max> sigma_noise;
  real<lower = prior_fraction_signal_min, upper = prior_fraction_signal_max> fraction_signal;
}

// For each observation, calculate a variable lp_signal which is
// log(Prob(data | params, this observation is signal) * Prob(this observation is signal))
// and a variable lp_noise which is
// log(Prob(data | params, this observation is noise)  * Prob(this observation is noise))
// For a given observation, exponentiating each of these terms and adding them gives
// Prob(data | params)
// because P(signal) + P(noise) = 1
transformed parameters {
  real lp_signal[num_observations];
  real lp_noise[num_observations];
  for (observation in 1:num_observations) {
    lp_signal[observation] = lognormal_lpdf(y[observation] | mu_signal, sigma_signal) + log(fraction_signal);
    lp_noise[observation]  = lognormal_lpdf(y[observation] | mu_noise,  sigma_noise)  + log(1 - fraction_signal);
  }
}

model {
  
  // Priors should be defined here. 
  // Uniform priors whose boundaries are fixed do not need to be stated becasuse
  // they are implicitly understood.
  // Uniform priors whose boundaries depend on other parameters, like mu_noise 
  // here, must be stated explicitly, as discussed at 
  // https://htmlpreview.github.io/?https://github.com/ChrisHIV/teaching/blob/main/other_topics/Stan_example_dependent_uniform_priors.html
  mu_noise ~ uniform(prior_mu_noise_min, fmin(prior_mu_noise_max, mu_signal));

  // Calculate the likelihood if we're sampling from the posterior.
  // The hypotheses of signal and noise are mutually exclusive, so we should add
  // their likelihoods weighted by P(signal) and P(noise)
  // To increment the overall log probability density by log(a + b) where a is the
  // likelihood assuming it's signal times the probabaility of signal, i.e.
  // exp(lp_signal), and b is similar but for noise, we use
  // log_sum_exp(lp_signal, lp_noise) defined such that
  // log_sum_exp(log(a), log(b))) = log(a + b)
  if (get_posterior_not_prior) {
    for (observation in 1:num_observations) {
      target += log_sum_exp(lp_signal[observation], lp_noise[observation]);
    }
  }
  
}

// The posterior probability that a given observation is noise, conditioning on
// specific parameter values for quantities below but temporarily suppressing
// this for clarity, is
// Prob(noise | data) = Prob(data | noise) P(noise) / P(data)
//                    = Prob(data | noise) P(noise) / [
//                       P(data | noise) P(noise) + P(data | signal) P(signal) ]
//                    = exp(lp_noise) / [exp(lp_noise) + exp(lp_signal)]
//                    = exp(lp_noise) / exp(log_sum_exp(lp_signal, lp_noise))
//                    = exp(lp_noise - log_sum_exp(lp_signal, lp_noise))
// We calculate this per observation below.
// No longer suppressing the conditioning on parameter values, this quantity is
// Prob(noise | data, params); Stan will return the posterior for this over 
// parameter space, i.e. sampling in proportion to P(params | data), and so that
// posterior for this quantity defines the posterior for the probability of
// being noise.
generated quantities {
  real<lower = 0, upper = 1> prob_is_noise_given_params[num_observations];
  for (observation in 1:num_observations) {
    prob_is_noise_given_params[observation] = exp(
      lp_noise[observation] - log_sum_exp(lp_noise[observation], lp_signal[observation]));
  }
}