Skip to content

Primer on Statistical Inference

Luke Western edited this page Jul 3, 2024 · 2 revisions

The term inverse modelling is widely used in atmospheric sciences for the method of estimating some physical quantity, in our case the emissions from some source, using some indirect measurements. This problem is a statistical problem, requiring a physical relationship to be turned into a statistical model. The methods to infer useful information from this statistical model relies on probability theory -- that is the output from a simulation of a physical process informs such metrics as the most probable emissions or the expected emissions with some degree of uncertainty.

This document talks about probability, choice of statistical models and how to make inference from these.

Basic probability

Basic concepts of probability are intuitive. A basic question such as 'if there are 5 apples in a bag, 3 are red, 2 are green, what is the probability of pulling out a green apple?' are well understood and the answer is clearly $2/5$. Conditional probabilities, when extra information is added, can be just as intuitive. What if there are two bags of apples -- bag A, with 3 red, 2 green, and bag B, with 2 red, 3 green. With no knowledge of the bags, we have a $50/50$ chance of pulling out a red or green as, in total, 50% of the apples are green and 50% are red. With some extra information, e.g. given that we have bag A, then we have a $2/5$ of picking green. The notation of this is as follows,

$$\text{p}(\text{green}) = 1/2$$ $$\text{p}(\text{green} \mid A) = 2/5.$$

Here the dividing line between green and $A$ (you read p$(\text{green} \mid A)$ as `the probability of green given $A$') indicates that we are calculating the probability of pulling out a green apples under the assumption that we have bag $A$.

Frequentist inference

This concept becomes a little harder for continuous variables. Let's instead think of a thermometer -- we use thermometers to measure temperature. However, thermometers never measure temperature directly. Instead they measure some abstract quantity, such a volume or resistance, from which we infer temperature. There will be some error in the measurement of temperature, which we can assume is random about the true value. Let $x$ be temperature, $y$ be our measure of resistance in an electric thermometer, $F(\cdot)$ be the function that converts temperature to resistance and $\epsilon$ be the error. We can then express this as a mathematical model

$$y = F(x) + \epsilon.$$

This means that the measurement of resistance is some function of the true temperature plus a bit of error. If the relationship $F(\cdot)$ is linear between resistance and temperature we can say $F(x) = hx$, where $h$ is the gradient.

As we have some error involved we won't know the true temperature, so instead we need to infer something about it. This means that we have to look at the probability of temperature given a measurement of resistance, i.e. p$(x \mid y)$. We said earlier that the error is random about the true value. This sort of stochastic random error can be described by a normal distribution, i.e. if we repeated the measurement many many times, the true temperature will be our most likely outcome with a decreasing, symmetric probability that it will give us a temperature larger or smaller than the true reading (this comes from something called the Central Limit Theorem). Therefore we can say the probability of a particular measurement of resistance given a temperature is described by the statistical model,

$$\text{p}(y \mid x) = \frac{1}{{\sigma \sqrt {2\pi } }} \exp \left(- \frac{(y - hx)^2}{2\sigma^2} \right),$$

where $\sigma$ is the standard distribution of the readings given at a given temperature if we were to repeat the measurement many many times (or the error in the resistance measurement). This is simply the normal distribution.

In the above case we are talking about the probability of the a measured resistance given a temperature, but we are often more interested in the temperature given a measured resistance. Conceptually (and mathematically) these things are very similar, and we tend to call \ref{eq:normaldist} the likelihood of of $x$, or in our example the likelihood of a temperature given a resistance reading. As these two things are conceptually the same thing, we can say $\mathcal{L}(x) = \text{p}(y \mid x)$. This is what is known as frequentist statistics. This means the statistical model describes the physical concept well and we know through repeating the process many many times that it holds true.

Unfortunately with only one measurement we can't do much better than say that our best estimate is the temperature that our resistance measurement gives, i.e. $x = y/h$. The hope is that with more measurements we can improve our estimate.

Let's say that we make $m$ measurements of temperature using lots of different thermometers. We can collectively contain these measurements in the vector $\mathbf{y}$. All of these thermometers will have different errors, i.e. their value of $\sigma$ differs. We can collectively contain their errors in a covariance matrix $\boldsymbol\Sigma$. If all these thermometers are independent, then the various values for $\sigma$ for the different thermometers make up the diagonal of this matrix. The relationship between resistance and temperature $\mathbf{h}$ is a vector for each thermometer, and we can describe this probability distribution using a multivariate normal, i.e.

$$\text{p}(\mathbf{y} \mid x)=\frac{1}{\sqrt{(2\pi)^m\mid\boldsymbol\Sigma\mid}} \exp(-\frac{1}{2}(\mathbf{y}-\mathbf{h}x)^T{\boldsymbol\Sigma}^{-1}(\mathbf{y}-\mathbf{h}x))$$

where $T$ is the transpose of the matrix and ${\boldsymbol\Sigma}^{-1}$ the inverse of a matrix. Now we want to find the best estimate of $x$ using these various measurements, also known as the maximum likelihood estimate. Often it's easier to work with the log of probability distributions and then get rid of the constants that don't depend on $x$ or $y$. This gives

$$-2\ln\text{p}(\mathbf{y} \mid x) \propto (\mathbf{y}-\mathbf{h}x)^T{\boldsymbol\Sigma}^{-1}(\mathbf{y}-\mathbf{h}x).$$

To find the maximum likelhihood estimate (the most likely value) for $x$, which we denote $\hat{x}$, we can use the above equation and complete the square, to give the the closed-form expression

$$\hat{x} = \left(\mathbf{h}^T \boldsymbol\Sigma^{-1} \mathbf{h}\right)^{-1} \mathbf{h}^T \boldsymbol\Sigma^{-1} \mathbf{y},$$

and the uncertainty, or confidence interval, in this estimate can be summarised by its standard deviation

$$\hat{\sigma}_x = \left(\mathbf{h}^T \boldsymbol\Sigma^{-1} \mathbf{h}\right)^{-1}.$$

This approach holds when we have a robust statistical model, which describes the physical system. That is, we can justify our statistical model on the understanding of how our thermometer works in reality if we repeat the measurement many many times. Unfortunately in the geosciences we often don't have this luxury. The statistical models aren't robust in the same way, e.g. we don't have a solid physical understanding of the error characteristics in our understanding of the relationship between the emissions of carbon dioxide from a power plant in Italy and a measurement of carbon dioxide in the atmosphere made in Germany. Needless to say, it is unlikely that it truthfully fits a simple common statistical model. For these occasions we need Bayesian statistics.

Bayesian statistics

Bayesian statistics considers the fact that often extra information is needed to inform inference to complement the measurements. Take the example at the end of the previous section of estimating carbon dioxide from measurements. Our statistical model based on measurements alone is probably not good enough to robustly estimate the emissions from a power plant $x$ from the measurements $y$, so we need to guide the inference using our general understanding of emissions or any auxiliary information that may be available (e.g. inventory reports). This auxiliary information, or knowledge prior to inference, is called the prior probability $\text{p}(x)$. Note here that it is the prior probability and not a prior probability. That is, the prior probability is a distribution and not a single value (unless we know for certain that $\text{p}(x)$ takes one, and only one value). Using Bayes's theorem, we can use this prior probability to inform our inference,

$$\text{p}(\mathbf{x} \mid \mathbf{y}) = \frac{\text{p}(\mathbf{y} \mid \mathbf{x})\text{p}(\mathbf{x})}{\text{p}(\mathbf{y})},$$

where we have generalised this to be a vector of desired quantities $\mathbf{x}$, $\text{p}(\mathbf{y} \mid \mathbf{x})$ is the statistical model describing the probability of the measurements given the desired quantities, referred to as the likelihood (as in the previous section) and $\text{p}(\mathbf{y})$ is simply a normalisation to ensure the end probability is between 0 and 1. Does this intuitively make sense? Think about our thermometer again and that we want to see what the temperature is outside. Let's say that our thermometer is broken, i.e. the error variance is infinite, and the weather forecast tells us that the temperature is $20 \pm 2^\circ$C, which we use as our prior knowledge. Bayes's theorem suggests that our prior knowledge best informs us about the temperature outside. That is, there is a 68% probability that the temperature is between 18-22$^\circ$C, with the most probable temperature being 20$^\circ$C. Likewise, if we have absolutely no idea what the temperature is outside, but our thermometer is working fine, then the likelihood, or measurements, dominates and we believe the most probable temperature given by the thermometer. There is a philosophical difference from the last section here. Taking the example of the broken thermometer, we said that the temperature is between 18-22$^\circ$C, with the most probable temperature being 20$^\circ$C. That is that \emph{we believe} that the most probable temperature is 20$^\circ$C based on the information we've been given, not that that the physical workings of the world dictate that that must be the temperature outside. To this end, Bayesian statistics is often concerned with subjective probabilities. The previous section was concerned not with what we believe, but only with what is observed. This is getting a bit deep here, but it's worth noting that Bayesian probability concerns itself with a measure of belief.

Now for making inference using Bayesian statistics. We usually don't have to worry about the normalisation constant and we can just say that

$$\text{p}(\mathbf{x} \mid \mathbf{y}) \propto \text{p}(\mathbf{y} \mid \mathbf{x})\text{p}(\mathbf{x}).$$

If both probability distributions (the likelihood and prior probability) are Normal on the right hand side of the proportionality, then much like before it is often easier to work with log-probabilties,

$$-2 \log ( \text{p}(\mathbf{x} \mid \mathbf{y})) \propto (\mathbf{y - Hx})^T \boldsymbol\Sigma_\epsilon^{-1} (\mathbf{y - Hx}) + (\mathbf{x- x}_a)^T \boldsymbol\Sigma_a^{-1} (\mathbf{x- x}_a),$$

where here $\mathbf{H}$ is the linear mapping between $\mathbf{y}$ and $\mathbf{x}$ ($\mathbf{y = Hx + \epsilon}$), $\Sigma_\epsilon$ is the known model-measurement uncertainty (as described by $\Sigma$ in the previous example), $\mathbf{x}_a$ is the mean estimate of $\mathbf{x}$ before we do the inversion (or a priori estimate of $\mathbf{x}$) and $\Sigma_a$ is the covariance matrix describing the uncertainty in $\mathbf{x}_a$. We can get an analytical solution for the most probable value of $\mathbf{x}$,

$$\mathbf{\hat{x}} = \left(\mathbf{H}^T \boldsymbol\Sigma_\epsilon^{-1} \mathbf{H} + \boldsymbol\Sigma_a^{-1} \right)^{-1} \left( \mathbf{H}^T \boldsymbol\Sigma_\epsilon^{-1} \mathbf{y} + \boldsymbol\Sigma_a^{-1} \mathbf{x}_a \right),$$

and the uncertainty in this estimate is,

$$\hat{\boldsymbol\Sigma}_x = \left(\mathbf{H}^T \boldsymbol\Sigma_\epsilon^{-1} \mathbf{H} + \boldsymbol\Sigma_a^{-1} \right)^{-1}.$$

Note here that the posterior distribution, that is our estimate of $\mathbf{x}$, is also a Normal distrubution with mean $\mathbf{\hat{x}} $ and covariance $\hat{\boldsymbol\Sigma}_x$. This is a special case (known as conjugacy) but its analytical solution makes estimation very efficient.

Regularisation under the Bayesian umbrella

The use of 'Bayesian inference' is widespread in atmospheric sciences and inverse theory. Often, however, what is being done isn't Bayesian at all but a regularisation, i.e. guiding values to some optimum value using sensible constraints. I won't dwell on this -- many would likely disagree -- but it can be seen in many inverse procedures. One telltale sign of this is the arbitrary choice of likelihood and prior distributions. This may be done for computational convenience, and to guide some solver to a 'better value'. This is obviously incorrect, and so any posterior probabilities will likely also be incorrect and therefore caution should be taken in what these 'probabilities' and uncertainties actually mean.

Hierarchical modelling

The proportionality in equation Bayes' law infers the desired quantity $\mathbf{x}$ using the known data $\mathbf{y}$, and with all errors known before any inference takes place. It might not always be the case that the full distribution of the errors is known -- often we know, or impose, a statistical distribution on these errors, but do not know their full distribution. Errors in geosciences are mostly considered to be additive noise and so the imposed likelihood model is assumed to be a Normal distribution, random about zero. It may be the case that even though we know, or assume, the errors fit a particular distribution, we do not know how large the errors are. This is an example a hierarchical problem -- the estimate of $\mathbf{x}$ relies on the estimation of a different variable further down the hierarchy. This estimation can be built into Bayes's theorem as before by conditioning the desired variables on the variables further down the hierarchy. Let's assume that our errors in $\Sigma_\epsilon$ are identical and independently distributed, that is the variance $\sigma_\epsilon$ is the same for all elements in $\mathbf{y}$, such that $\Sigma_\epsilon = \mathbf{I}\sigma_\epsilon$, where $\mathbf{I}$ is the identity matrix. The value of $\sigma_\epsilon$ is unknown, and so must be built into a hierarchical structure

$$\text{p}(\mathbf{x}, \sigma_\epsilon \mid \mathbf{y}) \propto \text{p}(\mathbf{y} \mid \mathbf{x}, \sigma_\epsilon)\text{p}(\mathbf{x} \mid \sigma_\epsilon) \text{p}(\sigma_\epsilon).$$

In the equation the comma inside the probability just means 'and', so for example $\text{p}(\mathbf{y} \mid \mathbf{x}, \sigma_\epsilon)$ would read `the probability of $\mathbf{y}$ given $\mathbf{x}$ and $\sigma_\epsilon$. This hierarchy can be extended to include as many members of the hierarchy as needed. Notation often lumps all together into a common variable $\theta$ when representing them in the hierarchy, and these parameters lower down the hierarchy are often referred to as hyperparameters. When hyerparameters are introduced into the probabilistic hierarchy, $\mathbf{x}$ (and the hyperparameters) cannot be solved in a single line equation as the case was with a Normally distributed likelihood and prior probability with known errors. In fact, analytical solutions are rare in the geosciences for probabilistic inference beyond linear models and Normal distributions.