Skip to content

Commit

Permalink
Update cross_validation.rst.
Browse files Browse the repository at this point in the history
  • Loading branch information
mbaudin47 committed Jun 29, 2023
1 parent 8b7a66f commit 5a80637
Show file tree
Hide file tree
Showing 2 changed files with 76 additions and 40 deletions.
2 changes: 1 addition & 1 deletion python/doc/bibliography.rst
Original file line number Diff line number Diff line change
Expand Up @@ -323,7 +323,7 @@ Bibliography
discrete random variates*. Journal of Computational and Applied Mathematics,
vol. 31, no. 1, pp. 181-189, 1990.
`pdf <https://openturns.github.io/openturns/papers/stadlober1990.pdf>`__
.. [stone1974] Stone, M. (1974). *Cross‐validatory choice and assessment of statistical predictions.*
.. [stone1974] Stone, M. (1974). *Cross‐validatory choice and assessment of statistical predictions.*
Journal of the royal statistical society: Series B (Methodological), 36 (2), 111-133.
.. [stoer1993] Stoer, J., Bulirsch, R. *Introduction to Numerical
Analysis*, Second Edition, Springer-Verlag, 1993.
Expand Down
114 changes: 75 additions & 39 deletions python/doc/theory/meta_modeling/cross_validation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -114,16 +114,14 @@ where :math:`\hat{\sigma}^2(Y)` is the sample variance of the random output:
.. math::
\hat{\sigma}^2(Y)
= \frac{1}{n - 1} \sum_{j = 1}^n \left[ y^{(j)} - \bar{y} \right]^2
= \frac{1}{n - 1} \sum_{j = 1}^n \left( y^{(j)} - \bar{y} \right)^2
where :math:`\bar{y}` is the sample mean of the output:

.. math::
\bar{y} = \frac{1}{n} \sum_{j = 1}^n y^{(j)}.
Naive and fast cross-validation
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Expand All @@ -144,7 +142,7 @@ also known as jackknife in statistics.
Let :math:`\metamodel^{(-j)}` be the surrogate model estimated from the
leave-one-out experimental design :math:`\set{D} \setminus \{\vect{x}^{(j)}\}`.
This is the experimental design where the :math:`j`-th observation
:math:`\vect{x}^{(j)}` is set aside.
:math:`\vect{x}^{(j)}` is set aside.
The corresponding set of observation indices is:

.. math::
Expand Down Expand Up @@ -192,21 +190,21 @@ shown in the next section.
Fast leave-one-out
~~~~~~~~~~~~~~~~~~

In this section, we present the fast leave-one-out error estimator
In this section, we present the fast leave-one-out error estimator
of a linear regression least squares model.
In the special case of of a linear regression model, [stone1974]_ (eq. 3.13 page 121)
showed that the leave-one-out residuals have an expression which depends on the diagonal
of the projection matrix.
In this case, the evaluation of the leave-one-out mean squared error involves the
multiplication of the raw residuals by a correction which involves the leverages
of the model.
This method makes it possible to evaluation directly the mean squared error without
necessarily estimating the coefficients of $n$ different leave-one-out regression model.
In the special case of of a linear regression model, [stone1974]_ (eq. 3.13 page 121)
showed that the leave-one-out residuals have an expression which depends on the diagonal
of the projection matrix.
In this case, the evaluation of the leave-one-out mean squared error involves the
multiplication of the raw residuals by a correction which involves the leverages
of the model.
This method makes it possible to evaluation directly the mean squared error without
necessarily estimating the coefficients of $n$ different leave-one-out regression model.
It is then much faster than the naive leave-one-out method.

Let :math:`m \in \Nset` be an integer representing the number of
parameters in the model.
Assume that the physical model is linear :
Assume that the physical model is linear:

.. math::
Expand Down Expand Up @@ -246,14 +244,15 @@ Consider the linear surrogate model:
\metamodel(\vect{x}) = \boldsymbol{D} \hat{\vect{a}}
for any :math:`\vect{x} \in \set{X}` where :math:`\hat{\vect{a}}` is the
estimate from linear least squares.
We substitute the estimator in the previous equation and
for any :math:`\vect{x} \in \set{X}` where :math:`\hat{\vect{a}}` is the
estimate from linear least squares.
We substitute the estimator in the previous equation and
get the value of the surrogate linear regression model:

.. math::
\metamodel(\vect{x}) = \boldsymbol{D} \left(\boldsymbol{D}^T \boldsymbol{D} \right)^{-1} \boldsymbol{D}^T \vect{y}
\metamodel(\vect{x})
= \boldsymbol{D} \left(\boldsymbol{D}^T \boldsymbol{D} \right)^{-1} \boldsymbol{D}^T \vect{y}
Let :math:`\boldsymbol{H}` be the projection ("hat") matrix ([wang2012]_ eq. 16.8 page 472):

Expand Down Expand Up @@ -282,8 +281,8 @@ also known as the *leverage*.
In other words, the residual error of the LOO surrogate model is equal to the
residual of the full surrogate model corrected by :math:`1 - h_{jj}`.
This avoids to actually build the LOO surrogate.
We substitute the previous expression in the definition of the leave-one-out
mean squared error estimator and get the fast leave-one-out cross validation
We substitute the previous expression in the definition of the leave-one-out
mean squared error estimator and get the fast leave-one-out cross validation
error ([wang2012]_ eq. 16.25 page 487):

.. math::
Expand Down Expand Up @@ -331,7 +330,7 @@ The corresponding set of indices:

.. math::
\set{S}_1 \; \dot{\cup} \; \cdots \; \dot{\cup} \; \set{S}_k
\set{S}_1 \; \dot{\cup} \; \cdots \; \dot{\cup} \; \set{S}_k
= \{1, ..., n\}
and the corresponding set of input observations is:
Expand Down Expand Up @@ -361,7 +360,7 @@ The approximation error is estimated on the set-aside sample :math:`\set{D}_{\el

.. math::
\widehat{\operatorname{MSE}}^{(\ell)}
= \dfrac{1}{n_\ell} \sum_{j \in \set{S}_\ell}
= \dfrac{1}{n_\ell} \sum_{j \in \set{S}_\ell}
\left[ \physicalmodel\left(\vect{x}^{(j)}\right) - \metamodel^{(-\set{D}_{\ell})} \left(\vect{x}^{(j)}\right) \right]^2.
where :math:`n_\ell` is the number of observations in
Expand Down Expand Up @@ -391,15 +390,49 @@ sample mean ([deisenroth2020]_ eq. 8.13 page 264):
The K-Fold error estimate can be obtained
with a single split of the data :math:`\set{D}` into :math:`k` folds.
It is worth noting that one can repeat the cross-validation multiple
One can repeat the cross-validation multiple
times using different divisions into folds to obtain better Monte
Carlo estimate.
This comes obviously with an additional computational cost.
The *leave-one-out* (LOO) cross-validation is a special case of
K-Fold cross-validation where the number of folds :math:`k` is
chosen equal to the cardinality :math:`n` of the experimental design
equal to :math:`n`, the sample size of the experimental design
:math:`\set{D}`.

If the folds have equal sizes
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Assume that the number of folds divides the sample size.
Mathematically, this means that :math:`k` divides :math:`n`.
In this special case, each fold has the same number of observations:

.. math::
n_\ell = \frac{n}{k}
for :math:`\ell = 1, ..., k`.
This implies that the K-Fold mean squared error has a particularly simple expression:

.. math::
\widehat{\operatorname{MSE}}_{KFold}
= \frac{1}{n} \sum_{\ell = 1}^k \sum_{j \in \mathcal{S}_\ell}
\left[y^{(j)} - \hat{g}^{(-j)} \left(\boldsymbol{x}^{(j)}\right)\right]^2.
Since the subsets :math:`\mathcal{S}_1, ..., \mathcal{S}_k` are, by hypothesis, a disjoint
partition of the full set of indices, all observations are involved in the
previous equation with the same weight :math:`\frac{1}{n}`.
This implies that the previous equation is the sample mean of the K-Fold squared residuals:

.. math::
\widehat{\operatorname{MSE}}_{KFold}
= \frac{1}{n} \sum_{j = 1}^n
\left[y^{(j)} - \hat{g}^{(-j)} \left(\boldsymbol{x}^{(j)}\right)\right]^2.
If :math:`k` does not divides :math:`n`, the previous equation does not
hold: the squared residuals in different folds do not
necessarily have the same weight in the estimator of the mean squared error.

Fast K-Fold of a linear model
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Expand All @@ -410,41 +443,40 @@ Just as the LOO error involves the division by :math:`1 - h_{jj}`,
the K-Fold error involves the resolution of a linear system of equations.
The equations introduced by [shao1993]_, are presented in [suzuki2020]_ (proposition 14 page 71).

For any :math:`\ell \in \{1, ..., k\}`, let :math:`\boldsymbol{D}_{\ell} \in \Rset^{n_\ell \times m}`
For any :math:`\ell \in \{1, ..., k\}`, let :math:`\boldsymbol{D}_{\ell} \in \Rset^{n_\ell \times m}`
be the rows of the design matrix :math:`\boldsymbol{D}` corresponding to the indices of the observations
involved in the :math:`\ell`-th fold:

.. math::
\boldsymbol{D}_{\ell} =
\begin{pmatrix}
\boldsymbol{D}_{\ell}
=\begin{pmatrix}
d_{j_1, 1} & \ldots & d_{j_1, m} \\
\vdots & & \vdots \\
d_{j_{n_\ell}, 1} & \ldots & d_{j_{n_\ell}, m}
\end{pmatrix}
where :math:`j_1, ..., j_{n_\ell} \in \mathcal{S}_\ell` are the
indices of the observations involved in the :math:`\ell`-th fold.
For any :math:`\ell \in \{1, ..., k\}`,
For any :math:`\ell \in \{1, ..., k\}`,
let :math:`\boldsymbol{H}_{\ell} \in \Rset^{n_{\ell} \times n_{\ell}}` be the sub-matrix of
the hat matrix corresponding to the indices of the observations in the
:math:`\ell`-th fold:

.. math::
\boldsymbol{H}_{\ell}
\boldsymbol{H}_{\ell}
= \boldsymbol{D}_{\ell} \left(\boldsymbol{D}^T \boldsymbol{D} \right)^{-1} \boldsymbol{D}_{\ell}^T
It is not necessary to evaluate the previous expression in order to evaluate
the corresponding hat matrix.
It is not necessary to evaluate the previous expression in order to evaluate
the corresponding hat matrix.
Indeed, the matrix :math:`\boldsymbol{H}_{\ell}` can be computed by extracting the corresponding
rows and columns from the full hat matrix :math:`\boldsymbol{H}`:

.. math::
\boldsymbol{H}_{\ell}
=
\boldsymbol{H}_{\ell}
=
\begin{pmatrix}
h_{j_1, j_1} & \ldots & h_{j_1, j_{n_\ell}} \\
\vdots & & \vdots \\
Expand All @@ -456,29 +488,33 @@ corrected K-Fold residuals:

.. math::
(\boldsymbol{I}_{\ell} - \boldsymbol{H}_{\ell}) \hat{\boldsymbol{r}}_{\ell} = \boldsymbol{y}_{\ell} - \hat{\boldsymbol{y}}_{\ell}
(\boldsymbol{I}_{\ell} - \boldsymbol{H}_{\ell}) \hat{\boldsymbol{r}}_{\ell}
= \boldsymbol{y}_{\ell} - \hat{\boldsymbol{y}}_{\ell}
where :math:`\boldsymbol{I}_{\ell} \in \Rset^{\ell \times \ell}` is the identity matrix,
:math:`\boldsymbol{y}_{\ell} \in \Rset^{\ell}` is the vector of output observations in the
:math:`\ell`-th fold:

.. math::
\boldsymbol{y}_{\ell} & = \left(y^{(j)}\right)^T_{j \in \set{S}_{\ell}},
\boldsymbol{y}_{\ell}
= \left(y^{(j)}\right)^T_{j \in \set{S}_{\ell}},
and :math:`\hat{\boldsymbol{y}}_{\ell} \in \Rset^{\ell}` is the corresponding
vector of output predictions from the linear least squares surrogate model:

.. math::
\hat{\boldsymbol{y}}_{\ell} & = \left(g\left(\vect{x}^{(j)}\right)\right)^T_{j \in \set{S}_{\ell}}.
\hat{\boldsymbol{y}}_{\ell}
= \left(g\left(\vect{x}^{(j)}\right)\right)^T_{j \in \set{S}_{\ell}}.
Then the mean squared error of the :math:`\ell`-th fold is:

.. math::
\widehat{MSE}^{(\ell)}
= \frac{1}{n_{\ell}} \sum_{\ell = 1}^{n_{\ell}} \hat{\boldsymbol{r}}_{\ell}^2.
\widehat{\operatorname{MSE}}^{(\ell)}
= \frac{1}{n_{\ell}} \sum_{\ell = 1}^{n_{\ell}}
\left(\hat{\boldsymbol{r}}_{\ell}\right)_j^2.
Then the K-Fold mean squared error is evaluated from equation :eq:`kfoldMean`.

Expand Down

0 comments on commit 5a80637

Please sign in to comment.