-
Notifications
You must be signed in to change notification settings - Fork 1
/
website.Rmd
442 lines (337 loc) · 18.8 KB
/
website.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
---
title: "Statistical Methods 1"
output:
html_document:
toc: true
toc_depth: 4
css: style.css
theme: united
highlight: tango
---
<style>
body .main-container {
max-width: 1800px;
}
</style>
<script>
$(document).ready(function() {
$items = $('div#TOC li');
$items.each(function(idx) {
num_ul = $(this).parentsUntil('#TOC').length;
$(this).css({'text-indent': num_ul * 10, 'padding-left': 0});
});
});
</script>
<!--- For HTML Only --->
`r if (!knitr:::is_latex_output()) '
$\\DeclareMathOperator*{\\argmin}{arg\\,min}$
$\\DeclareMathOperator*{\\argmax}{arg\\,max}$
$\\newcommand{\\var}{\\mathrm{Var}}$
'`
\usepackage[english]{babel}
\usepackage[utf8]{inputenc}
\usepackage{amsmath}
\usepackage{graphicx}
\usepackage{amsthm}
\usepackage[colorinlistoftodos]{todonotes}
\usepackage{listings}
\usepackage{color}
\usepackage[T1]{fontenc}
\usepackage{fullpage}
\usepackage{hyperref}
\usepackage{amssymb}
\usepackage{fullpage}
\usepackage{color}
\usepackage{bm}
\usepackage{mathtools}
\usepackage{algorithm2e}
\newcommand{\def}{\overset{\text{def}}{:=}}
\newcommand{\lop}{\mathcal{L}}
\newcommand{\vect}[1]{\boldsymbol{\mathbf{#1}}}
\newcommand{\norm}[1]{||#1||}
\newcommand{\vx}{\vect{x}}
\newcommand{\vb}{\vect{b}}
\newcommand{\vy}{\vect{y}}
\newcommand{\vz}{\vect{z}}
\newcommand{\ve}{\vect{e}}
\newcommand{\yhat}{\widehat{y}}
\newcommand{\xhat}{\widehat{x}}
\newcommand{\vc}{\vect{c}}
\newcommand{\vr}{\vect{r}}
\newcommand{\vphi}{\vect{\phi}}
\newcommand{\vf}{\vect{f}}
\newcommand{\vY}{\vect{Y}}
\newcommand{\vX}{\vect{X}}
\newcommand{\vw}{\vect{w}}
\newcommand{\thetahatis}{\widehat{\theta}^{(s)}_i}
\newcommand{\thetahat}[1]{\widehat{\theta}^{(#1)}_i}
\newcommand{\vm}{\vect{m}}
\newcommand{\redmath}[1]{\mathbin{\textcolor{red}{\vect{#1}}}}
\newcommand{\redtext}[1]{\textcolor{red}{\vect{#1}}}
\newcommand{\vzero}{\vect{0}}
\newcommand{\vt}{\vect{t}}
\newcommand{\linearpredictor}{\vx_i^T\vbeta}
\newcommand{\vmu}{\vect{\mu}}
\newcommand{\vnu}{\vect{\nu}}
\newcommand{\Var}{\text{Var}}
\newcommand{\veta}{\vect{\eta}}
\newcommand{\vbeta}{\vect{\beta}}
\newcommand{\vepsilon}{\vect{\epsilon}}
\newcommand{\Pbb}{\mathbb{P}}
\newcommand{\Ebb}{\mathbb{E}}
\newcommand{\vdelta}{\vect{\delta}}
\newcommand{\Rbb}{\mathbb{R}}
\newcommand{\vxi}{\vect{\xi}}
\newcommand{\vu}{\vect{u}}
\newcommand{\vW}{\vect{W}}
\newcommand{\vlambda}{\vect{\lambda}}
\newcommand{\dataset}{\mathcal{D}}
\newcommand{\vS}{\vect{S}}
\newcommand{\sample}{\vz^{(l)}}
\newcommand{\MYhref}[3][blue]{\href{#2}{\color{#1}{#3}}}
\newcommand{\sol}[1]{\vx^{(#1)}}
\newcommand{\qtext}[1]{\quad\quad \text{#1}}
\newcommand{\vtheta}{\vect{\theta}}
\newcommand{\bi}[1]{\textbf{\textit{#1}}}
\newcommand{\iid}{\overset{\text{i.i.d.}}{\sim}}
\newcommand{\uniform}{\mathcal{U}(0, 1)}
\newcommand{\qimplies}{\quad\Longrightarrow\quad}
\newcommand{\tp}{\tilde{p}}
\newcommand{\nul}{\Theta^{(0)}}
\newcommand{\alter}{\Theta^{(1)}}
\newcommand{\const}{\mathcal{Z}}
\newcommand{\tq}{\tilde{q}}
\newcommand{\vxhat}{\widehat{\vx}}
\newcommand{\tvx}{\widetilde{\vx}}
\newcommand{\tr}{\tilde{r}}
\newcommand{\like}{\mathcal{L}}
\newcommand{\kl}[2]{\text{KL}(#1\,\,||\,\,#2)}
\newcommand{\logit}[1]{\log\left(\frac{#1}{1-#1}\right)}
\newcommand{\elbo}[1]{\text{elbo}(#1)}
\newcommand{\eval}{\biggr\rvert}
\newcommand{\normal}{\mathcal{N}}
\newcommand{\class}{\mathcal{C}}
\newcommand{\infor}{\mathcal{I}}
\newcommand{\variance}{\text{Var}}
\newcommand{\vSigma}{\vect{\Sigma}}
\newcommand{\lp}[3]{
\begin{equation*}
\begin{alignat}{2}
&\!\min &\qquad& #1\\
&\text{s.t.} & & #2\\
& & & #3
\end{alignat}
\end{equation*}
}
\newcommand{\lpmax}[3]{
\begin{equation*}
\begin{alignat}{2}
&\!\max &\qquad& #1\\
&\text{s.t.} & & #2\\
& & & #3
\end{alignat}
\end{equation*}
}
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
```{js, echo=FALSE}
function myFunction() {
var x = document.getElementById("myDIV");
if (x.style.display === "none") {
x.style.display = "block";
} else {
x.style.display = "none";
}
}
```
## Course Description
Repository containing summary of notes for the [Statistical Methods 1](https://www.bris.ac.uk/unit-programme-catalogue/UnitDetails.jsa;jsessionid=7912981ABC434B62AA99EB75652E7E8A?ayrCode=19%2F20&unitCode=MATHM0041) course in the [Computational Statistics and Data Science (COMPASS)](http://www.bris.ac.uk/maths/postgraduate/compass/) Center for Doctoral Training (CDT) PhD at the [University of Bristol](https://www.bristol.ac.uk).
## Regression Problems
### Problem Statement and Assumptions
Given a **training** dataset
$$D_0:= \left\{(\vx_i, y_i)\right\}_{i=1}^n$$
of $n$ pairs of inputs $$\vx_i\ = (x_{1i}, \ldots, x_{di})\in\Rbb^{d\times 1}$$ and outputs $y_i\in \Rbb$ we want to **predict** the output $\yhat$ for a new input $\vxhat$ by _inferring_ the **parameters** $\vw\in \Rbb^{(d+1)\times 1}$ of a **prediction function** $f(\vx, \vw)$.
### Measuring Performance and Model Selection
We want $f(\vx, \vw)$ to perform well on **unseen testing data** $D_1$. In other words, we want our model to **generalize** well. We define the **error** on a dataset $D'$ given the inferred parameters $\vw^*$ as
$$E(D', \vw^*):=\sum_{i\in D'} \left(y_i - f(\vx_i, \vw^*)\right)^2$$
This gives two **performance metrics**:
* _Training Error_ : $E(D_0, \vw^*)$
* _Testing Error_ : $E(D_1, \vw^*)$
We could split our dataset into training set $D_0$ and testing set $D_1$, however data is often scarce so we want to use as much of it as possible. To fully exploit our dataset we can use **Cross Validation**:
* Split the dataset into $s$ groups $D_1, \ldots, D_s$ called **folds**. We then call it **s-fold cross validation**. In the extreme case in which $s=n$, i.e. each group consists of one data point, we call it **leave-one-out** cross validation.
* Use $s-1$ groups to infer $\vw^*$. Test performance of $f(\vx, \vw^*)$ on the remaining unseen group, and calculate $E(D_i, \vw^{*^{(i)}})$. Repeat this until all groups have been used for testing exactly once.
* At the end, calculate the mean test error $$E_{\text{cv}} = \frac{1}{s+1}\sum_{i=1}^s E(D_i, \vw^{*^{(i)}})$$
Cross validation can be used to choose between **competing models**. Thus, it is often used for **model selection** when a model has one or more **hyperparameters** that needs to be estimated. The main problems with Cross Validation are:
* Dataset needs to be IID, but this is often not the case. For instance in a time-dependent dataset.
* Computational complexity of Cross Validation algorithm increases by a factor of $s$ for each hyperparameter.
### Non-Bayesian Inference Methods
We adopt the following notation: each column of the matrix $X$ represents a single observation of all the explanatory variable features, and each row represents all observations of a single feature.
$$X =
\begin{pmatrix}
\vx_1 & \vx_2 & \cdots & \vx_n \\
1 & 1 & \cdots 1
\end{pmatrix}
=
\begin{pmatrix}
x_{11} & x_{12} & \cdots & x_{1n} \\
x_{21} & x_{22} & \cdots & x_{2n} \\
\vdots & \vdots & \vdots & \vdots \\
x_{d1} & x_{d2} & \cdots & x_{dn} \\
1 & 1 & \cdots & 1
\end{pmatrix}
\in \Rbb^{(b+1)\times n}
$$
The **response vector** is a _row_ vector containing all $n$ observations of the target variable.
$$\vy = (y_1, \ldots, y_n)\in \Rbb^{1\times n}$$
We call the function
$$\phi:\Rbb^d\to\Rbb^b$$
a **feature transform** and will be used to make inference more flexible. Sometimes we will write $\phi(X)$, this is a short-hand notation for the following matrix
$$
\phi(X) =
\begin{pmatrix}
\phi(\vx_1) & \phi(\vx_2) & \cdots & \phi(\vx_n) \\
1 & 1 & \cdots & 1
\end{pmatrix}
\in \Rbb^{(b+1)\times n}
$$
There are $2$ main non-bayesian inference methods that we look at:
* **Linear Least Squares**: Finds parameters $\vw_{\text{LS}}\in\Rbb^{(b+1)\times 1}$ that minimize the _sum of squares residuals_
$$\vw_{\text{LS}}:= \argmin_{\vw} \sum_{i\in D_0} \left(y_i - f(\phi(\vx_i), \vw)\right)^2 = \left(\phi(X)\phi(X)^\top\right)^{-1}\phi(X)\vy^\top$$
where $f(\phi(\vx_i), \vw)$ is the _dot product_ between the $i^{\text{th}}$ column of $\phi(X)$ and $\vw$. It is important to understand that it's called _linear_ because it is linear in the _unknown parameters_ $\vw$, but can be _non-linear_ in the inputs if we apply a *non-linear* feature transform $\phi$.
* **Maximum Likelihood Estimation**: Finds parameters $\vw_{\text{ML}}$ describing a joint probability density function for $\vy$ that has _maximum density_ values at our data points
$$
\vw_{\text{ML}} = \argmax_{\vw} \,\,\,p(\vy \mid \phi(\vx_1), \ldots, \phi(\vx_n), \vw, \sigma)
$$
Assuming $y_i$ are **IID** and **normally distributed**, Maximum likelihood and Least Squares lead to the same solution $\vw_{\text{LS}} = \vw_{\text{ML}}$. The advantage of Maximum Likelihood over Least Squares is that by finding $\sigma^2_{\text{ML}}$ we obtain a **measure of uncertainty** on our prediction.
### Issues with Non-Bayesian Inference Methods
* They provide **point estimates** for our predictions, but we want a **predictive distribution** instead.
* Using the feature transform $\phi$ in the model leads both inference methods to possible **overfitting**, i.e. learning the _noise_ rather than the _structure_ in the data, when the **complexity** of the model, governed by $b$, increases. Cross Validation can be used to select the output dimensionality $b$ of $\phi$ to combat overfitting, but dataset is often not IID. A better choice is **regularized least squares**:
$$
\vw_{\text{LS-R}}:= \argmin_{\vw} \sum_{i\in D_0} \left(y_i - f(\phi(\vx_i), \vw)\right)^2 + \lambda || \vw ||_2^2 = \left(\phi(X)\phi(X)^\top + \lambda I \right)^{-1} \phi(X)\vy^\top
$$
Regularization combats overfitting because this is often due to large increase in parameters magnitude, and regularization puts a penalty on it. To choose $\lambda$ we can use **cross-validation** or a **probabilistic perspective**, as we'll see later. Notice that above the penalty is the $L^2$ norm but can be any $p-$norm.
* They incur in the **curse of dimensionality** because as we increase complexity $b$, we include more features, whose number can grow exponentially even just in the case of $\phi$ being a polynomial transform. This requires a dataset size that is **exponential** in $b$.
### Semi and Fully Bayesian Inference Methods
Here we assume $y_i$ are IID and normally distributed. Moreover, we assign a _normal prior_ to the parameters.
* **Maximum-A-Posteriori** (MAP): Finds parameter $\vw_{\text{MAP}}$ that maximizes the **posterior parameter distribution** given our data $D$:
$$
\vw_{\text{MAP}} := \argmax_{\vw} p(\vw\mid D) = \argmax_{\vw} \prod_{i\in D} N_{y_i}\left(f(\phi(\vx_i), \vw), \sigma^2 \right)N_{\vw}(\vect{0}, \sigma^2_{\vw}I) = \vw_{\text{LS-R}}\left(\frac{\sigma^2}{\sigma^2_{\vw}}\right)
$$
* **Fully Bayesian**: Finds a **predictive distribution** for $\yhat$ given the new $\vxhat$
$$
p(\yhat\mid \vxhat, D) = \int p(\yhat\mid \vxhat, D)p(\vw\mid D)d\vw = N_{\yhat}\left(f\left(\phi(\vxhat), \vw_{\text{LS-R}}\left(\frac{\sigma^2}{\sigma^2_{\vw}}\right)\right), \sigma^2 + \phi(\vxhat)^\top \sigma^2\left(\phi(X)\phi(X)^\top + \frac{\sigma^2}{\sigma^2_\vw} I \right)^{-1}\phi(\vxhat)\right)
$$
which is a normal distribution whose mean is the prediction given by regularized least squares.
## Classification
### Problem Statement and Optimal Bayes Classifier
Now we consider $y\in \left\{+1, -1\right\}$. We are looking for a **decision function** $f(\vx)$ whose level set $f(\vx) = 0$, called **decision boundary**, separates feature space into two regions:
* $R_{+}$ where $f(\vx)\geq 0$ when $\vx\in R_{+}$
* $R_{-}$ where $f(\vx)\leq 0$ when $\vx\in R_{-}$
The **optimal Bayes classifier**
$$f(\vw) = p(\vw, y=+1) - p(\vx, y= -1)$$ minimizes the probability of making a mistake given the decision function $f$, that is it minimizes
$$p(\text{mistake} \mid f) = \int_{R_{+}} p(\vx, y=-1)d\vx + \int_{R_{-}} p(\vx, y=+1)d\vx$$
However, this classifier is **unpractical** because in reality
* We don't have access to $p(\vx, y)$
* Different mistakes bear different losses.
### Expected Loss Minimization
Therefore, we introduce a **loss function** $L(y, y_0)$ which is described by a **loss matrix** whose entries give the loss of predicting $y_0$ when the actual value is $y$. Our goal then becomes to **minimize expected loss**
$$\yhat = \argmin_{y_0} \Ebb_{p(y\mid \vx, D)}\left[L(y, y_0) \mid \vx\right]$$
where $p(y \mid \vx, D)$ can be found with either the Fully Bayesian approach, or we can find $\vw_{\text{ML}}$ or $\vw_{\text{MAP}}$ first, and then plug them in to obtain the distribution.
Notice that in general there are two strategies to find $p(y \mid \vx, D)$:
* **Discriminative Approach**: Infer $p(y \mid \vx, D)$ directly using $D$. This method is best used when our only goal is **prediction**.
* **Generative Approach**: Infer
$$
p(y \mid \vx, D) \propto p(\vx \mid y, D) p(y)
$$
using Bayes rule, where $p(y)$ is simply the proportion of $+$ and $-$ in $D$. This gives us a way of _generating_ new data points, but at an extra cost because $p(\vx \mid y , D)$ can be very high-dimensional.
Once we have $p(y \mid \vx, D)$ we can start taking decisions. Sometimes, we might be uncertain about our decision. In such cases, we can reject a data point $\vx$ when both probabilities are lower than a **threshold** $\theta$
$$\max\left\{ p(y = +1\mid \vx), \, p(y=-1\mid \vx)\right\} < \theta$$
### Decision Making in Regression Tasks
$$\yhat \approx
\begin{cases}
\Ebb_{p(y\mid \vx, D)}\left[L(y, y_0)\mid \vx\right] && \text{if } L=(y-y_0)^2 \\
\text{Predictive median of } p(y \mid \vx, D) && \text{if } L=|y-y_0|
\end{cases}$$
Basically if we use a squared loss, our prediction $\yhat$ will be the same as the one found by using the fully-Bayesian approach or the MAP approach.
## Probability Review: Gaussian Identities
### Multivariate Gaussian Definition
We say that a random variable $X\sim N_x(\mu, \sigma^2)$ has a **univariate** Gaussian distribution if
We say that a random vector $\vX = (X_1, \ldots, X_d)^\top$ has a **multivariate** Gaussian distribution if
$$p(\vx) = N_{\vx}(\vmu, \Sigma) = \frac{1}{(2\pi)^{\frac{d}{2}}|\Sigma|^{\frac{1}{2}}}\exp\left(-\frac{1}{2}(\vx-\vmu)^\top \Sigma^{-1} (\vx - \vmu)\right)$$
where $\vmu\in\Rbb^{d}$ and $\Sigma\in \Rbb^{d\times d}$ and the variance-covariance matrix needs to be **positive definite** $\Sigma \succ 0$. Issues with MVN:
* Number of parameters grows **quadratically** with dimensionality $d$.
* **Unimodal**
A very useful formula that is used to derive the formulae below regards the _kernel_ of the MVN
$$-\frac{1}{2}(\vx - \vmu)^\top \Sigma^{-1}(\vx - \vmu) = -\frac{1}{2}\vx^\top \Sigma^{-1}\vx + \vx^\top \Sigma^{-1} \vmu + \text{const}$$
### Geometry of the Multivariate Normal Distribution
**Eigendecomposing** $\Sigma^{-1}$ and using the change of variable $\vy = U^\top (\vx - \vmu)$, corresponding to centering the observations and rotating using the eigenvector matrix, we find that _every MVN under a linear coordinate transform is a product of $d$ univariate normals_:
$$p(\vy) = \prod_{i=1}^d N_{y_i}(0, \lambda_i) \qquad \text{where } \lambda_i \text{ eigenvectors of } \Sigma^{-1}$$
### Conditional Gaussian Distributions
If two sets of variables are jointly Gaussian, then the conditional distribution of one set conditioned on the other is again Gaussian. That is if we partition
$$
\vx =
\begin{pmatrix}
\vx_a \\
\vx_b
\end{pmatrix}
\qquad \vmu =
\begin{pmatrix}
\vmu_a \\
\vmu_b
\end{pmatrix}
\qquad
\Sigma = \begin{pmatrix}
\Sigma_{aa} & \Sigma_{ab}\\
\Sigma_{bb} & \Sigma_{bb}
\end{pmatrix}
\qquad
\Lambda = \Sigma^{-1} = \begin{pmatrix}
\Lambda_{aa} & \Lambda_{ab} \\
\Lambda_{ba} & \Lambda_{bb}
\end{pmatrix}
\qquad \text{where } \Lambda \text{ is the $\textbf{precision}$ matrix}
$$
Then we have the mean and covariance matrix of $p(\vx_a\mid\vx_b)$ are given by
$$
\begin{align}
\vmu_{a\mid b} &= \vmu_a + \Sigma_{ab}\Sigma_{bb}^{-1}(\vx_b - \vmu_b) = \vmu_a - \Lambda_{aa}^{-1}\Lambda_{ab}(\vx_b - \vmu_b) \qquad \textbf{linear} \text{ function of } \vx_b\\
\Sigma_{a\mid b} &= \Sigma_{aa} - \Sigma_{ab}\Sigma_{bb}^{-1}\Sigma_{ba} = \Lambda_{aa}^{-1} \qquad \textbf{independent} \text{ of } \vx_b
\end{align}
$$
The general form $p(\vy\mid \vx)$ with mean being a linear function of $\vx$ and covariance independent of $\vx$ is called a **linear Gaussian model**.
### Marginal Distribution
The marginal of a MVN is again a Gaussian:
$$
p(\vx_a) = \int p(\vx_a, \vx_b) d\vx_b = N(\vmu_a, \Sigma_{aa})
$$
### Gaussian Linear Model
Suppose we have a
* **Prior**: $p(\vx) =N_{\vx}(\vmu, \Lambda^{-1})$
* **Likelihood**: $p(\vy\mid \vx) = N_{\vy}(A\vx + \vb, L^{-1})$
Then we can find:
* **Marginal**: $p(\vy) = N_{\vy}(A\vmu + \vb, L^{-1} + A\Lambda^{-1}A^\top)$
* **Posterior**: $p(\vx\mid \vy) = N_{\vx}(\Sigma[A^\top L(\vy - \vb) + \Lambda \vy], \Sigma)$ where $\Sigma=(\Lambda + A^\top L A)^{-1}$
### Maximum Likelihood Estimation of Multivariate Gaussian
Suppose we have $N$ IID multivariate Gaussian observations $\left\{\vx_1, \ldots, \vx_N\right\}$, where $\vx_i\sim N(\vmu, \Sigma)$. Then
$$
\begin{align}
\vmu_{\text{ML}} &= \frac{1}{N} \sum_{i=1}^N \vx_i \\
\Sigma_{\text{ML}} &= \frac{1}{N} \sum_{i=1}^N (\vx_i - \vmu_{\text{ML}})(\vx_i - \vmu_{\text{ML}})^\top
\end{align}
$$
Notice that while the maximum likelihood estimate of the mean is **unbiased** $\Ebb[\vmu_{\text{ML}}] = \vmu$ the maximum likelihood estimate of the covariance matrix is **biased**
$$\Ebb\left[\Sigma_{\text{ML}}\right] = \frac{N - 1}{N}\Sigma$$
## Bias-Variance Decomposition
The **training error** $E[D, \vw_{\text{LS}}]$ is evaluated only on the specific dataset $D$. We call **in-sample error** the training error on all possible datasets
$$\Ebb_D\left[E[D, \vw_{\text{LS}}]\right] = \sum_{i=1}^n \Ebb_D\left[[y_i - f(\vx_i; \vw_{\text{LS}})]\mid\vx_i\right]$$
where $\Ebb_D\left[[y_i - f(\vx_i; \vw_{\text{LS}})]\mid\vx_i\right]$ is called _expected error_ and can be decomposed as follows
$$
\Ebb_D\left[[y_i - f(\vx_i; \vw_{\text{LS}})]\mid\vx_i\right] = \underbrace{\Var_{\epsilon}[\epsilon_i]}_{\text{irreducible error}} + \underbrace{(g(\vx_i) - \Ebb_{\epsilon}[f(\vx_i; \vw_{\text{LS}})])^2}_{\text{bias}} + \underbrace{\Var_{\epsilon}[f(\vx_i; \vw_{\text{LS}})]}_{\text{variance}}
$$
This cannot be calculated cause we don't know the distribution of $\epsilon$ and we don't know $g(\vx_i)$. Thus we actually use the **out-sample error**:
$$
\Ebb_{\vx}[\Ebb_{\epsilon}[(y_i - f(\vx_i; \vw_{\text{LS}}))]] = \Ebb_{p(y, \vx)}[(y-f(\vx_i; \vw_{\text{LS}}))^2]
$$
If we have a _new batch of data_ $D':=\left\{(y_i', \vx_i')\right\}_{i=1}^{n'}$ we see that the out-sample error can be approximated by the **testing error** thanks to the **Law of Large Numbers**:
$$\frac{1}{n'}\sum_{i=1}^{n'} (y_i' - f(\vx_i'; \vw_{\text{LS}}))^2 \quad \overset{\text{LLN}}{\longrightarrow} \quad \Ebb_{p(y, \vx)}\left[(y_i - f(\vx_i; \vw_{\text{LS}}))^2\right]$$