Pretty-print of LinearModelAnalysis

mbaudin47 · Aug 22, 2023 · a1f37df · a1f37df
1 parent 1d861ea
commit a1f37df
Show file tree

Hide file tree

Showing 7 changed files with 466 additions and 95 deletions.
diff --git a/python/doc/bibliography.rst b/python/doc/bibliography.rst
@@ -25,6 +25,7 @@ Bibliography
     *OpenTURNS: An Industrial Software for Uncertainty Quantification in Simulation.*
     In: Ghanem R., Higdon D., Owhadi H. (eds) Handbook of Uncertainty Quantification. Springer
     `pdf <https://arxiv.org/pdf/1501.05242>`__
+.. [baron2014] Baron, M. (2019). *Probability and statistics for computer scientists*. CRC press.
 .. [beirlant2004] Beirlant J., Goegebeur Y., Teugels J., Segers J.,
     *Statistics of extremes: theory and applications*, Wiley, 2004
 .. [bhattacharyya1997] Bhattacharyya G.K., and R.A. Johnson, *Statistical
@@ -39,6 +40,8 @@ Bibliography
 .. [burnham2002] Burnham, K.P., and Anderson, D.R. *Model Selection and
     Multimodel Inference: A Practical Information Theoretic Approach*, Springer,
     2002.
+.. [bingham2010] Bingham, N. H., & Fry, J. M. (2010). 
+    *Regression: Linear models in statistics*. Springer.
 .. [cambou2017] Mathieu Cambou, Marius Hofert, Christiane Lemieux, *Quasi-Random numbers for copula models*, Stat. Comp., 2017, 27(5), 1307-1329.
     `pdf <https://arxiv.org/pdf/1508.03483.pdf>`__
 .. [caniou2012] Caniou, Y. *Global sensitivity analysis for nested and
@@ -87,6 +90,7 @@ Bibliography
     `pdf <https://tel.archives-ouvertes.fr/tel-00697026v2/document>`__
 .. [fang2006] K-T. Fang, R. Li, and A. Sudjianto. *Design and modeling for
     computer experiments.* Chapman & Hall CRC, 2006.
+.. [faraway2014] Faraway, J. J. (2014). *Linear models with R*. Second Edition CRC press.
 .. [freedman1981] David Freedman, Persi Diaconis, *On the histogram as a density
     estimator: L2 theory*, December 1981, Probability Theory and Related Fields.
     57 (4): 453–476.
@@ -293,6 +297,8 @@ Bibliography
 .. [ScottStewart2011] W. F. Scott & B. Stewart.
     *Tables for the Lilliefors and Modified Cramer–von Mises Tests of Normality.*,
     Communications in Statistics - Theory and Methods. Volume 40, 2011 - Issue 4. Pages 726-730.
+.. [sen1990] Sen, A., & Srivastava, M. (1990). *Regression analysis: theory, methods, and applications*. 
+    Springer.
 .. [simard2011] Simard, R. & L'Ecuyer, P. *Computing the Two-Sided Kolmogorov-
     Smirnov Distribution.* Journal of Statistical Software, 2011, 39(11), 1-18.
     `pdf <https://www.jstatsoft.org/article/view/v039i11/v39i11.pdf>`__

diff --git a/python/doc/examples/meta_modeling/general_purpose_metamodels/plot_linear_model.py b/python/doc/examples/meta_modeling/general_purpose_metamodels/plot_linear_model.py
@@ -1,11 +1,43 @@
 """
 Create a linear model
 =====================
-In this example we create a surrogate model using linear model approximation.
+In this example we create a surrogate model using linear model approximation
+using the :class:`~openturns.LinearModelAlgorithm` class.
+We show how the :class:`~openturns.LinearModelAnalysis` class
+can be used to produce the statistical analysis of the least squares 
+regression model.
 """
 # %%
-# The following 2-dimensional function is used in this example
-# :math:`h(x,y) = 2x - y + 3 + 0.05 \sin(0.8x)`.
+# Introduction
+# ~~~~~~~~~~~~
+#
+# The following 2-dimensional function is used in this example:
+#
+# .. math::
+#
+#     \model(x,y) = 3 + 2x - y
+#
+# for any :math:`x, y \in \Rset`.
+# 
+# Notice that this model is linear:
+#
+# .. math::
+#
+#     \model(x,y) = \beta_1 + \beta_2 x + \beta_3 y
+#
+# where :math:`\beta_1 = 3`, :math:`\beta_2 = 2` and :math:`\beta_3 = -1`.
+#
+# We consider noisy observations of the output:
+#
+# .. math::
+#
+#     Y = \model(x,y) + \epsilon
+#
+# where :math:`\epsilon \sim \cN(0, \sigma^2)` where :math:`\sigma > 0`
+# is the standard deviation.
+# In our example, we use :math:`\sigma = 0.1`.
+#
+# Finally, we use :math:`n = 1000` independent observations of the model.
 #
 
 # %%
@@ -14,73 +46,96 @@
 
 # %%
 # Generation of the data set
-# --------------------------
+# ~~~~~~~~~~~~~~~~~~~~~~~~~~
 #
 # We first generate the data and we add noise to the output observations:
 
 # %%
 ot.RandomGenerator.SetSeed(0)
 distribution = ot.Normal(2)
 distribution.setDescription(["x", "y"])
-func = ot.SymbolicFunction(["x", "y"], ["2 * x - y + 3 + 0.05 * sin(0.8*x)"])
-input_sample = distribution.getSample(30)
-epsilon = ot.Normal(0, 0.1).getSample(30)
+sampleSize = 1000
+func = ot.SymbolicFunction(["x", "y"], ["3 + 2 * x - y"])
+input_sample = distribution.getSample(sampleSize)
+epsilon = ot.Normal(0, 0.1).getSample(sampleSize)
 output_sample = func(input_sample) + epsilon
 
 # %%
 # Linear regression
-# -----------------
+# ~~~~~~~~~~~~~~~~~
 #
-# Let us run the linear model algorithm using the `LinearModelAlgorithm` class and get the associated results :
+# Let us run the linear model algorithm using the 
+# :class:`~openturns.LinearModelAlgorithm` class and get the associated 
+# results:
 
 # %%
 algo = ot.LinearModelAlgorithm(input_sample, output_sample)
 result = algo.getResult()
 
 # %%
 # Residuals analysis
-# ------------------
+# ~~~~~~~~~~~~~~~~~~
 #
 # We can now analyse the residuals of the regression on the training data.
 # For clarity purposes, only the first 5 residual values are printed.
 
 # %%
 residuals = result.getSampleResiduals()
-print(residuals[:5])
+residuals[:5]
 
 # %%
 # Alternatively, the `standardized` or `studentized` residuals can be used:
 
 # %%
 stdresiduals = result.getStandardizedResiduals()
-print(stdresiduals[:5])
+stdresiduals[:5]
 
 # %%
-# Similarly, we can also obtain the underyling distribution characterizing the residuals:
+# Similarly, we can also obtain the underyling distribution characterizing 
+# the residuals:
 
 # %%
-print(result.getNoiseDistribution())
+result.getNoiseDistribution()
 
 
 # %%
-# ANOVA table
-# -----------
+# Analysis of the results
+# ~~~~~~~~~~~~~~~~~~~~~~~
 #
-# In order to post-process the linear regression results, the `LinearModelAnalysis` class can be used:
+# In order to post-process the linear regression results, the 
+# :class:`~openturns.LinearModelAnalysis` class can be used:
 
 # %%
 analysis = ot.LinearModelAnalysis(result)
-print(analysis)
+analysis
 
 # %%
-# The results seem to indicate that the linear hypothesis can be accepted. Indeed, the `R-Squared` value is nearly `1`.
-# Furthermore, the adjusted value, which takes into account the data set size and the number of hyperparameters, is similar to `R-Squared`.
+# The results seem to indicate that the linear model is satisfactory. 
+#
+# - The basis uses the three functions :math:`1` (i.e. the intercept),
+#   :math:`x` and :math:`y`.
+# - The table of coefficients test if one single coefficient is zero.
+#   The probability of observing a large value of the T statistics is close to 
+#   zero for all coefficients.
+#   Hence, we can reject the hypothesis that any coefficient is zero. 
+#   In other words, all the coefficients are significantly nonzero.
+# - The `R-Squared` value is close to 1.
+#   Furthermore, the adjusted value, which takes into account the data set 
+#   size and the number of hyperparameters, is similar to `R-Squared`.
+#   Most of the variance is explained by the linear model.
+# - The F-test tests if all the coefficients are simultaneously zero.
+#   The `Fisher-Snedecor` p-value is lower than 1%. 
+#   The linear model assumption is accepted.
+# - The normality test checks that the residuals have a normal distribution.
+#   The normality assumption can be rejected if the p-value is close to zero
+#   (say, lower than 0.05).
+#   Based on the different statistical tests, we see that all p-values 
+#   are larger than 0.05: the normality assumption is accepted.
 #
-# We can also notice that the `Fisher-Snedecor` and `Student` p-values detailed above are lower than 1%. This ensures an acceptable quality of the linear model.
 
 # %%
 # Graphical analyses
-# ------------------
+# ~~~~~~~~~~~~~~~~~~
 #
 # Let us compare model and fitted values:
 
@@ -89,7 +144,8 @@
 view = viewer.View(graph)
 
 # %%
-# The previous figure seems to indicate that the linearity hypothesis is accurate.
+# The previous figure seems to indicate that the linearity hypothesis 
+# is accurate.
 
 # %%
 # Residuals can be plotted against the fitted values.
@@ -107,9 +163,11 @@
 view = viewer.View(graph)
 
 # %%
-# In this case, the two distributions are very close: there is no obvious outlier.
+# In this case, the two distributions are very close: there is no obvious 
+# outlier.
 #
-# Cook's distance measures the impact of every individual data point on the linear regression, and can be plotted as follows:
+# Cook's distance measures the impact of every individual data point on the 
+# linear regression, and can be plotted as follows:
 
 # %%
 graph = analysis.drawCookDistance()
@@ -118,27 +176,32 @@
 # %%
 # This graph shows us the index of the points with disproportionate influence.
 #
-# One of the components of the computation of Cook's distance at a given point is that point's *leverage*.
-# High-leverage points are far from their closest neighbors, so the fitted linear regression model must pass close to them.
+# One of the components of the computation of Cook's distance at a given 
+# point is that point's *leverage*.
+# High-leverage points are far from their closest neighbors, so the fitted 
+# linear regression model must pass close to them.
 
 # %%
 graph = analysis.drawResidualsVsLeverages()
 view = viewer.View(graph)
 
 # %%
-# In this case, there seem to be no obvious influential outlier characterized by large leverage and residual values, as is also shown in the figure below:
+# In this case, there seem to be no obvious influential outlier characterized 
+# by large leverage and residual values, as is also shown in the figure below:
 #
-# Similarly, we can also plot Cook's distances as a function of the sample leverages:
+# Similarly, we can also plot Cook's distances as a function of the sample 
+# leverages:
 
 # %%
 graph = analysis.drawCookVsLeverages()
 view = viewer.View(graph)
 
 # %%
-# Finally, we give the intervals for each estimated coefficient (95% confidence interval):
+# Finally, we give the intervals for each estimated coefficient (95% confidence 
+# interval):
 
 # %%
 alpha = 0.95
 interval = analysis.getCoefficientsConfidenceInterval(alpha)
-print("confidence intervals with level=%1.2f : " % (alpha))
+print("confidence intervals with level=%1.2f: " % (alpha))
 print("%s" % (interval))