Update cross_validation.rst.

mbaudin47 · Jun 29, 2023 · 5a80637 · 5a80637
1 parent 8b7a66f
commit 5a80637
Show file tree

Hide file tree

Showing 2 changed files with 76 additions and 40 deletions.
diff --git a/python/doc/bibliography.rst b/python/doc/bibliography.rst
@@ -323,7 +323,7 @@ Bibliography
     discrete random variates*. Journal of Computational and Applied Mathematics,
     vol. 31, no. 1, pp. 181-189, 1990.
     `pdf <https://openturns.github.io/openturns/papers/stadlober1990.pdf>`__
-.. [stone1974] Stone, M. (1974). *Cross‐validatory choice and assessment of statistical predictions.* 
+.. [stone1974] Stone, M. (1974). *Cross‐validatory choice and assessment of statistical predictions.*
     Journal of the royal statistical society: Series B (Methodological), 36 (2), 111-133.
 .. [stoer1993] Stoer, J., Bulirsch, R. *Introduction to Numerical
     Analysis*, Second Edition, Springer-Verlag, 1993.

diff --git a/python/doc/theory/meta_modeling/cross_validation.rst b/python/doc/theory/meta_modeling/cross_validation.rst
@@ -114,16 +114,14 @@ where :math:`\hat{\sigma}^2(Y)` is the sample variance of the random output:
 .. math::
 
       \hat{\sigma}^2(Y)
-      = \frac{1}{n - 1} \sum_{j = 1}^n \left[ y^{(j)} -   \bar{y}  \right]^2
+      = \frac{1}{n - 1} \sum_{j = 1}^n \left( y^{(j)} -   \bar{y} \right)^2
 
 where :math:`\bar{y}` is the sample mean of the output:
 
 .. math::
 
     \bar{y} = \frac{1}{n} \sum_{j = 1}^n y^{(j)}.
 
-
-
 Naive and fast cross-validation
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
@@ -144,7 +142,7 @@ also known as jackknife in statistics.
 Let :math:`\metamodel^{(-j)}` be the surrogate model estimated from the
 leave-one-out experimental design :math:`\set{D} \setminus \{\vect{x}^{(j)}\}`.
 This is the experimental design where the :math:`j`-th observation
-:math:`\vect{x}^{(j)}` is set aside. 
+:math:`\vect{x}^{(j)}` is set aside.
 The corresponding set of observation indices is:
 
 .. math::
@@ -192,21 +190,21 @@ shown in the next section.
 Fast leave-one-out
 ~~~~~~~~~~~~~~~~~~
 
-In this section, we present the fast leave-one-out error estimator 
+In this section, we present the fast leave-one-out error estimator
 of a linear regression least squares model.
-In the special case of of a linear regression model, [stone1974]_ (eq. 3.13 page 121) 
-showed that the leave-one-out residuals have an expression which depends on the diagonal 
-of the projection matrix. 
-In this case, the evaluation of the leave-one-out mean squared error involves the 
-multiplication of the raw residuals by a correction which involves the leverages 
-of the model. 
-This method makes it possible to evaluation directly the mean squared error without 
-necessarily estimating the coefficients of $n$ different leave-one-out regression model. 
+In the special case of of a linear regression model, [stone1974]_ (eq. 3.13 page 121)
+showed that the leave-one-out residuals have an expression which depends on the diagonal
+of the projection matrix.
+In this case, the evaluation of the leave-one-out mean squared error involves the
+multiplication of the raw residuals by a correction which involves the leverages
+of the model.
+This method makes it possible to evaluation directly the mean squared error without
+necessarily estimating the coefficients of $n$ different leave-one-out regression model.
 It is then much faster than the naive leave-one-out method.
 
 Let :math:`m \in \Nset` be an integer representing the number of
 parameters in the model.
-Assume that the physical model is linear :
+Assume that the physical model is linear:
 
 .. math::
 
@@ -246,14 +244,15 @@ Consider the linear surrogate model:
 
     \metamodel(\vect{x}) = \boldsymbol{D} \hat{\vect{a}}
 
-for any :math:`\vect{x} \in \set{X}` where :math:`\hat{\vect{a}}` is the 
-estimate from linear least squares. 
-We substitute the estimator in the previous equation and 
+for any :math:`\vect{x} \in \set{X}` where :math:`\hat{\vect{a}}` is the
+estimate from linear least squares.
+We substitute the estimator in the previous equation and
 get the value of the surrogate linear regression model:
 
 .. math::
 
-    \metamodel(\vect{x}) = \boldsymbol{D} \left(\boldsymbol{D}^T \boldsymbol{D} \right)^{-1} \boldsymbol{D}^T \vect{y}
+    \metamodel(\vect{x})
+    = \boldsymbol{D} \left(\boldsymbol{D}^T \boldsymbol{D} \right)^{-1} \boldsymbol{D}^T \vect{y}
 
 Let :math:`\boldsymbol{H}` be the projection ("hat") matrix ([wang2012]_ eq. 16.8 page 472):
 
@@ -282,8 +281,8 @@ also known as the *leverage*.
 In other words, the residual error of the LOO surrogate model is equal to the
 residual of the full surrogate model corrected by :math:`1 - h_{jj}`.
 This avoids to actually build the LOO surrogate.
-We substitute the previous expression in the definition of the leave-one-out 
-mean squared error estimator and get the fast leave-one-out cross validation 
+We substitute the previous expression in the definition of the leave-one-out
+mean squared error estimator and get the fast leave-one-out cross validation
 error ([wang2012]_ eq. 16.25 page 487):
 
 .. math::
@@ -331,7 +330,7 @@ The corresponding set of indices:
 
 .. math::
 
-    \set{S}_1 \; \dot{\cup} \; \cdots \; \dot{\cup} \; \set{S}_k  
+    \set{S}_1 \; \dot{\cup} \; \cdots \; \dot{\cup} \; \set{S}_k
     = \{1, ..., n\}
 
 and the corresponding set of input observations is:
@@ -361,7 +360,7 @@ The approximation error is estimated on the set-aside sample :math:`\set{D}_{\el
 
 .. math::
    \widehat{\operatorname{MSE}}^{(\ell)}
-   = \dfrac{1}{n_\ell}  \sum_{j \in \set{S}_\ell} 
+   = \dfrac{1}{n_\ell}  \sum_{j \in \set{S}_\ell}
    \left[ \physicalmodel\left(\vect{x}^{(j)}\right) - \metamodel^{(-\set{D}_{\ell})} \left(\vect{x}^{(j)}\right) \right]^2.
 
 where :math:`n_\ell` is the number of observations in
@@ -391,15 +390,49 @@ sample mean ([deisenroth2020]_ eq. 8.13 page 264):
 
 The K-Fold error estimate can be obtained
 with a single split of the data :math:`\set{D}` into :math:`k` folds.
-It is worth noting that one can repeat the cross-validation multiple
+One can repeat the cross-validation multiple
 times using different divisions into folds to obtain better Monte
 Carlo estimate.
 This comes obviously with an additional computational cost.
 The *leave-one-out* (LOO) cross-validation is a special case of
 K-Fold cross-validation where the number of folds :math:`k` is
-chosen equal to the cardinality :math:`n` of the experimental design
+equal to :math:`n`, the sample size of the experimental design
 :math:`\set{D}`.
 
+If the folds have equal sizes
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Assume that the number of folds divides the sample size.
+Mathematically, this means that :math:`k` divides :math:`n`.
+In this special case, each fold has the same number of observations:
+
+.. math::
+
+    n_\ell = \frac{n}{k}
+
+for :math:`\ell = 1, ..., k`.
+This implies that the K-Fold mean squared error has a  particularly simple expression:
+
+.. math::
+
+    \widehat{\operatorname{MSE}}_{KFold}
+    = \frac{1}{n} \sum_{\ell = 1}^k \sum_{j \in \mathcal{S}_\ell}
+    \left[y^{(j)} - \hat{g}^{(-j)} \left(\boldsymbol{x}^{(j)}\right)\right]^2.
+
+Since the subsets :math:`\mathcal{S}_1, ..., \mathcal{S}_k` are, by hypothesis, a disjoint
+partition of the full set of indices, all observations are involved in the
+previous equation with the same weight :math:`\frac{1}{n}`.
+This implies that the previous equation is the sample mean of the K-Fold squared residuals:
+
+.. math::
+
+    \widehat{\operatorname{MSE}}_{KFold}
+    = \frac{1}{n} \sum_{j = 1}^n
+    \left[y^{(j)} - \hat{g}^{(-j)} \left(\boldsymbol{x}^{(j)}\right)\right]^2.
+
+If :math:`k` does not divides :math:`n`, the previous equation does not
+hold: the squared residuals in different folds do not
+necessarily have the same weight in the estimator of the mean squared error.
 
 Fast K-Fold of a linear model
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -410,41 +443,40 @@ Just as the LOO error involves the division by :math:`1 - h_{jj}`,
 the K-Fold error involves the resolution of a linear system of equations.
 The equations introduced by [shao1993]_, are presented in [suzuki2020]_ (proposition 14 page 71).
 
-For any :math:`\ell \in \{1, ..., k\}`, let :math:`\boldsymbol{D}_{\ell} \in \Rset^{n_\ell \times m}` 
+For any :math:`\ell \in \{1, ..., k\}`, let :math:`\boldsymbol{D}_{\ell} \in \Rset^{n_\ell \times m}`
 be the rows of the design matrix :math:`\boldsymbol{D}` corresponding to the indices of the observations
 involved in the :math:`\ell`-th fold:
 
 .. math::
 
-    \boldsymbol{D}_{\ell} = 
-    \begin{pmatrix}
+    \boldsymbol{D}_{\ell}
+    =\begin{pmatrix}
     d_{j_1, 1} & \ldots & d_{j_1, m} \\
     \vdots & & \vdots \\
     d_{j_{n_\ell}, 1} & \ldots & d_{j_{n_\ell}, m}
     \end{pmatrix}
 
 where :math:`j_1, ..., j_{n_\ell} \in \mathcal{S}_\ell` are the
 indices of the observations involved in the :math:`\ell`-th fold.
-For any :math:`\ell \in \{1, ..., k\}`, 
+For any :math:`\ell \in \{1, ..., k\}`,
 let :math:`\boldsymbol{H}_{\ell} \in \Rset^{n_{\ell} \times n_{\ell}}` be the sub-matrix of
 the hat matrix corresponding to the indices of the observations in the
 :math:`\ell`-th fold:
 
 .. math::
 
-    \boldsymbol{H}_{\ell} 
+    \boldsymbol{H}_{\ell}
     = \boldsymbol{D}_{\ell} \left(\boldsymbol{D}^T \boldsymbol{D} \right)^{-1} \boldsymbol{D}_{\ell}^T
-    
 
-It is not necessary to evaluate the previous expression in order to evaluate 
-the corresponding hat matrix. 
+It is not necessary to evaluate the previous expression in order to evaluate
+the corresponding hat matrix.
 Indeed, the matrix :math:`\boldsymbol{H}_{\ell}` can be computed by extracting the corresponding
 rows and columns from the full hat matrix :math:`\boldsymbol{H}`:
 
 .. math::
 
-    \boldsymbol{H}_{\ell} 
-    = 
+    \boldsymbol{H}_{\ell}
+    =
     \begin{pmatrix}
     h_{j_1, j_1} & \ldots & h_{j_1, j_{n_\ell}} \\
     \vdots & & \vdots \\
@@ -456,29 +488,33 @@ corrected K-Fold residuals:
 
 .. math::
 
-    (\boldsymbol{I}_{\ell} - \boldsymbol{H}_{\ell}) \hat{\boldsymbol{r}}_{\ell} = \boldsymbol{y}_{\ell} - \hat{\boldsymbol{y}}_{\ell}
+    (\boldsymbol{I}_{\ell} - \boldsymbol{H}_{\ell}) \hat{\boldsymbol{r}}_{\ell}
+    = \boldsymbol{y}_{\ell} - \hat{\boldsymbol{y}}_{\ell}
 
 where :math:`\boldsymbol{I}_{\ell} \in \Rset^{\ell \times \ell}` is the identity matrix,
 :math:`\boldsymbol{y}_{\ell} \in \Rset^{\ell}` is the vector of output observations in the
 :math:`\ell`-th fold:
 
 .. math::
 
-    \boldsymbol{y}_{\ell} & = \left(y^{(j)}\right)^T_{j \in \set{S}_{\ell}},
+    \boldsymbol{y}_{\ell}
+    = \left(y^{(j)}\right)^T_{j \in \set{S}_{\ell}},
 
 and :math:`\hat{\boldsymbol{y}}_{\ell} \in \Rset^{\ell}` is the corresponding
 vector of output predictions from the linear least squares surrogate model:
 
 .. math::
 
-    \hat{\boldsymbol{y}}_{\ell} & = \left(g\left(\vect{x}^{(j)}\right)\right)^T_{j \in \set{S}_{\ell}}.
+    \hat{\boldsymbol{y}}_{\ell}
+    = \left(g\left(\vect{x}^{(j)}\right)\right)^T_{j \in \set{S}_{\ell}}.
 
 Then the mean squared error of the :math:`\ell`-th fold is:
 
 .. math::
 
-    \widehat{MSE}^{(\ell)}
-    = \frac{1}{n_{\ell}} \sum_{\ell = 1}^{n_{\ell}} \hat{\boldsymbol{r}}_{\ell}^2.
+    \widehat{\operatorname{MSE}}^{(\ell)}
+    = \frac{1}{n_{\ell}} \sum_{\ell = 1}^{n_{\ell}}
+    \left(\hat{\boldsymbol{r}}_{\ell}\right)_j^2.
 
 Then the K-Fold mean squared error is evaluated from equation :eq:`kfoldMean`.