diff --git a/python/doc/bibliography.rst b/python/doc/bibliography.rst index fdddca93d1d..baa0b4daa4c 100644 --- a/python/doc/bibliography.rst +++ b/python/doc/bibliography.rst @@ -323,7 +323,7 @@ Bibliography discrete random variates*. Journal of Computational and Applied Mathematics, vol. 31, no. 1, pp. 181-189, 1990. `pdf `__ -.. [stone1974] Stone, M. (1974). *Cross‐validatory choice and assessment of statistical predictions.* +.. [stone1974] Stone, M. (1974). *Cross‐validatory choice and assessment of statistical predictions.* Journal of the royal statistical society: Series B (Methodological), 36 (2), 111-133. .. [stoer1993] Stoer, J., Bulirsch, R. *Introduction to Numerical Analysis*, Second Edition, Springer-Verlag, 1993. diff --git a/python/doc/theory/meta_modeling/cross_validation.rst b/python/doc/theory/meta_modeling/cross_validation.rst index bf981f4d9cf..03862468bce 100644 --- a/python/doc/theory/meta_modeling/cross_validation.rst +++ b/python/doc/theory/meta_modeling/cross_validation.rst @@ -114,7 +114,7 @@ where :math:`\hat{\sigma}^2(Y)` is the sample variance of the random output: .. math:: \hat{\sigma}^2(Y) - = \frac{1}{n - 1} \sum_{j = 1}^n \left[ y^{(j)} - \bar{y} \right]^2 + = \frac{1}{n - 1} \sum_{j = 1}^n \left( y^{(j)} - \bar{y} \right)^2 where :math:`\bar{y}` is the sample mean of the output: @@ -122,8 +122,6 @@ where :math:`\bar{y}` is the sample mean of the output: \bar{y} = \frac{1}{n} \sum_{j = 1}^n y^{(j)}. - - Naive and fast cross-validation ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ @@ -144,7 +142,7 @@ also known as jackknife in statistics. Let :math:`\metamodel^{(-j)}` be the surrogate model estimated from the leave-one-out experimental design :math:`\set{D} \setminus \{\vect{x}^{(j)}\}`. This is the experimental design where the :math:`j`-th observation -:math:`\vect{x}^{(j)}` is set aside. +:math:`\vect{x}^{(j)}` is set aside. The corresponding set of observation indices is: .. math:: @@ -192,21 +190,21 @@ shown in the next section. Fast leave-one-out ~~~~~~~~~~~~~~~~~~ -In this section, we present the fast leave-one-out error estimator +In this section, we present the fast leave-one-out error estimator of a linear regression least squares model. -In the special case of of a linear regression model, [stone1974]_ (eq. 3.13 page 121) -showed that the leave-one-out residuals have an expression which depends on the diagonal -of the projection matrix. -In this case, the evaluation of the leave-one-out mean squared error involves the -multiplication of the raw residuals by a correction which involves the leverages -of the model. -This method makes it possible to evaluation directly the mean squared error without -necessarily estimating the coefficients of $n$ different leave-one-out regression model. +In the special case of of a linear regression model, [stone1974]_ (eq. 3.13 page 121) +showed that the leave-one-out residuals have an expression which depends on the diagonal +of the projection matrix. +In this case, the evaluation of the leave-one-out mean squared error involves the +multiplication of the raw residuals by a correction which involves the leverages +of the model. +This method makes it possible to evaluation directly the mean squared error without +necessarily estimating the coefficients of $n$ different leave-one-out regression model. It is then much faster than the naive leave-one-out method. Let :math:`m \in \Nset` be an integer representing the number of parameters in the model. -Assume that the physical model is linear : +Assume that the physical model is linear: .. math:: @@ -246,14 +244,15 @@ Consider the linear surrogate model: \metamodel(\vect{x}) = \boldsymbol{D} \hat{\vect{a}} -for any :math:`\vect{x} \in \set{X}` where :math:`\hat{\vect{a}}` is the -estimate from linear least squares. -We substitute the estimator in the previous equation and +for any :math:`\vect{x} \in \set{X}` where :math:`\hat{\vect{a}}` is the +estimate from linear least squares. +We substitute the estimator in the previous equation and get the value of the surrogate linear regression model: .. math:: - \metamodel(\vect{x}) = \boldsymbol{D} \left(\boldsymbol{D}^T \boldsymbol{D} \right)^{-1} \boldsymbol{D}^T \vect{y} + \metamodel(\vect{x}) + = \boldsymbol{D} \left(\boldsymbol{D}^T \boldsymbol{D} \right)^{-1} \boldsymbol{D}^T \vect{y} Let :math:`\boldsymbol{H}` be the projection ("hat") matrix ([wang2012]_ eq. 16.8 page 472): @@ -282,8 +281,8 @@ also known as the *leverage*. In other words, the residual error of the LOO surrogate model is equal to the residual of the full surrogate model corrected by :math:`1 - h_{jj}`. This avoids to actually build the LOO surrogate. -We substitute the previous expression in the definition of the leave-one-out -mean squared error estimator and get the fast leave-one-out cross validation +We substitute the previous expression in the definition of the leave-one-out +mean squared error estimator and get the fast leave-one-out cross validation error ([wang2012]_ eq. 16.25 page 487): .. math:: @@ -331,7 +330,7 @@ The corresponding set of indices: .. math:: - \set{S}_1 \; \dot{\cup} \; \cdots \; \dot{\cup} \; \set{S}_k + \set{S}_1 \; \dot{\cup} \; \cdots \; \dot{\cup} \; \set{S}_k = \{1, ..., n\} and the corresponding set of input observations is: @@ -361,7 +360,7 @@ The approximation error is estimated on the set-aside sample :math:`\set{D}_{\el .. math:: \widehat{\operatorname{MSE}}^{(\ell)} - = \dfrac{1}{n_\ell} \sum_{j \in \set{S}_\ell} + = \dfrac{1}{n_\ell} \sum_{j \in \set{S}_\ell} \left[ \physicalmodel\left(\vect{x}^{(j)}\right) - \metamodel^{(-\set{D}_{\ell})} \left(\vect{x}^{(j)}\right) \right]^2. where :math:`n_\ell` is the number of observations in @@ -391,15 +390,49 @@ sample mean ([deisenroth2020]_ eq. 8.13 page 264): The K-Fold error estimate can be obtained with a single split of the data :math:`\set{D}` into :math:`k` folds. -It is worth noting that one can repeat the cross-validation multiple +One can repeat the cross-validation multiple times using different divisions into folds to obtain better Monte Carlo estimate. This comes obviously with an additional computational cost. The *leave-one-out* (LOO) cross-validation is a special case of K-Fold cross-validation where the number of folds :math:`k` is -chosen equal to the cardinality :math:`n` of the experimental design +equal to :math:`n`, the sample size of the experimental design :math:`\set{D}`. +If the folds have equal sizes +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Assume that the number of folds divides the sample size. +Mathematically, this means that :math:`k` divides :math:`n`. +In this special case, each fold has the same number of observations: + +.. math:: + + n_\ell = \frac{n}{k} + +for :math:`\ell = 1, ..., k`. +This implies that the K-Fold mean squared error has a particularly simple expression: + +.. math:: + + \widehat{\operatorname{MSE}}_{KFold} + = \frac{1}{n} \sum_{\ell = 1}^k \sum_{j \in \mathcal{S}_\ell} + \left[y^{(j)} - \hat{g}^{(-j)} \left(\boldsymbol{x}^{(j)}\right)\right]^2. + +Since the subsets :math:`\mathcal{S}_1, ..., \mathcal{S}_k` are, by hypothesis, a disjoint +partition of the full set of indices, all observations are involved in the +previous equation with the same weight :math:`\frac{1}{n}`. +This implies that the previous equation is the sample mean of the K-Fold squared residuals: + +.. math:: + + \widehat{\operatorname{MSE}}_{KFold} + = \frac{1}{n} \sum_{j = 1}^n + \left[y^{(j)} - \hat{g}^{(-j)} \left(\boldsymbol{x}^{(j)}\right)\right]^2. + +If :math:`k` does not divides :math:`n`, the previous equation does not +hold: the squared residuals in different folds do not +necessarily have the same weight in the estimator of the mean squared error. Fast K-Fold of a linear model ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ @@ -410,14 +443,14 @@ Just as the LOO error involves the division by :math:`1 - h_{jj}`, the K-Fold error involves the resolution of a linear system of equations. The equations introduced by [shao1993]_, are presented in [suzuki2020]_ (proposition 14 page 71). -For any :math:`\ell \in \{1, ..., k\}`, let :math:`\boldsymbol{D}_{\ell} \in \Rset^{n_\ell \times m}` +For any :math:`\ell \in \{1, ..., k\}`, let :math:`\boldsymbol{D}_{\ell} \in \Rset^{n_\ell \times m}` be the rows of the design matrix :math:`\boldsymbol{D}` corresponding to the indices of the observations involved in the :math:`\ell`-th fold: .. math:: - \boldsymbol{D}_{\ell} = - \begin{pmatrix} + \boldsymbol{D}_{\ell} + =\begin{pmatrix} d_{j_1, 1} & \ldots & d_{j_1, m} \\ \vdots & & \vdots \\ d_{j_{n_\ell}, 1} & \ldots & d_{j_{n_\ell}, m} @@ -425,26 +458,25 @@ involved in the :math:`\ell`-th fold: where :math:`j_1, ..., j_{n_\ell} \in \mathcal{S}_\ell` are the indices of the observations involved in the :math:`\ell`-th fold. -For any :math:`\ell \in \{1, ..., k\}`, +For any :math:`\ell \in \{1, ..., k\}`, let :math:`\boldsymbol{H}_{\ell} \in \Rset^{n_{\ell} \times n_{\ell}}` be the sub-matrix of the hat matrix corresponding to the indices of the observations in the :math:`\ell`-th fold: .. math:: - \boldsymbol{H}_{\ell} + \boldsymbol{H}_{\ell} = \boldsymbol{D}_{\ell} \left(\boldsymbol{D}^T \boldsymbol{D} \right)^{-1} \boldsymbol{D}_{\ell}^T - -It is not necessary to evaluate the previous expression in order to evaluate -the corresponding hat matrix. +It is not necessary to evaluate the previous expression in order to evaluate +the corresponding hat matrix. Indeed, the matrix :math:`\boldsymbol{H}_{\ell}` can be computed by extracting the corresponding rows and columns from the full hat matrix :math:`\boldsymbol{H}`: .. math:: - \boldsymbol{H}_{\ell} - = + \boldsymbol{H}_{\ell} + = \begin{pmatrix} h_{j_1, j_1} & \ldots & h_{j_1, j_{n_\ell}} \\ \vdots & & \vdots \\ @@ -456,7 +488,8 @@ corrected K-Fold residuals: .. math:: - (\boldsymbol{I}_{\ell} - \boldsymbol{H}_{\ell}) \hat{\boldsymbol{r}}_{\ell} = \boldsymbol{y}_{\ell} - \hat{\boldsymbol{y}}_{\ell} + (\boldsymbol{I}_{\ell} - \boldsymbol{H}_{\ell}) \hat{\boldsymbol{r}}_{\ell} + = \boldsymbol{y}_{\ell} - \hat{\boldsymbol{y}}_{\ell} where :math:`\boldsymbol{I}_{\ell} \in \Rset^{\ell \times \ell}` is the identity matrix, :math:`\boldsymbol{y}_{\ell} \in \Rset^{\ell}` is the vector of output observations in the @@ -464,21 +497,24 @@ where :math:`\boldsymbol{I}_{\ell} \in \Rset^{\ell \times \ell}` is the identity .. math:: - \boldsymbol{y}_{\ell} & = \left(y^{(j)}\right)^T_{j \in \set{S}_{\ell}}, + \boldsymbol{y}_{\ell} + = \left(y^{(j)}\right)^T_{j \in \set{S}_{\ell}}, and :math:`\hat{\boldsymbol{y}}_{\ell} \in \Rset^{\ell}` is the corresponding vector of output predictions from the linear least squares surrogate model: .. math:: - \hat{\boldsymbol{y}}_{\ell} & = \left(g\left(\vect{x}^{(j)}\right)\right)^T_{j \in \set{S}_{\ell}}. + \hat{\boldsymbol{y}}_{\ell} + = \left(g\left(\vect{x}^{(j)}\right)\right)^T_{j \in \set{S}_{\ell}}. Then the mean squared error of the :math:`\ell`-th fold is: .. math:: - \widehat{MSE}^{(\ell)} - = \frac{1}{n_{\ell}} \sum_{\ell = 1}^{n_{\ell}} \hat{\boldsymbol{r}}_{\ell}^2. + \widehat{\operatorname{MSE}}^{(\ell)} + = \frac{1}{n_{\ell}} \sum_{\ell = 1}^{n_{\ell}} + \left(\hat{\boldsymbol{r}}_{\ell}\right)_j^2. Then the K-Fold mean squared error is evaluated from equation :eq:`kfoldMean`.