diff --git a/HERS-example.qmd b/HERS-example.qmd index 257b0c4..b83ec12 100644 --- a/HERS-example.qmd +++ b/HERS-example.qmd @@ -28,7 +28,7 @@ hers = haven::read_dta( ```{r} #| include: false library(haven) -hers = read_stata("data/hersdata.dta") +hers = read_stata("Data/hersdata.dta") ``` ```{r} diff --git a/Linear-models-overview.qmd b/Linear-models-overview.qmd index a4e16fb..07d656c 100644 --- a/Linear-models-overview.qmd +++ b/Linear-models-overview.qmd @@ -2240,7 +2240,7 @@ hers = read_dta(url) ```{r} #| include: false library(haven) -hers = read_stata("data/hersdata.dta") +hers = read_stata("Data/hersdata.dta") ``` ```{r} diff --git a/_book/search.json b/_book/search.json index ffac8a2..266b939 100644 --- a/_book/search.json +++ b/_book/search.json @@ -193,7 +193,7 @@ "href": "Linear-models-overview.html#model-selection-1", "title": "\n2  Linear (Gaussian) Models\n", "section": "\n2.9 Model selection", - "text": "2.9 Model selection\n(adapted from Dobson and Barnett (2018) §6.3.3; for more information on prediction, see James et al. (2013) and Harrell (2015)).\n\nIf we have a lot of covariates in our dataset, we might want to choose a small subset to use in our model.\nThere are a few possible metrics to consider for choosing a “best” model.\n\n\n2.9.1 Mean squared error\nWe might want to minimize the mean squared error, \\(\\text E[(y-\\hat y)^2]\\), for new observations that weren’t in our data set when we fit the model.\nUnfortunately, \\[\\frac{1}{n}\\sum_{i=1}^n (y_i-\\hat y_i)^2\\] gives a biased estimate of \\(\\text E[(y-\\hat y)^2]\\) for new data. If we want an unbiased estimate, we will have to be clever.\n\nCross-validation\n\nShow R codedata(\"carbohydrate\", package = \"dobson\")\nlibrary(cvTools)\nfull_model <- lm(carbohydrate ~ ., data = carbohydrate)\ncv_full = \n full_model |> cvFit(\n data = carbohydrate, K = 5, R = 10,\n y = carbohydrate$carbohydrate)\n\nreduced_model = update(full_model, \n formula = ~ . - age)\n\ncv_reduced = \n reduced_model |> cvFit(\n data = carbohydrate, K = 5, R = 10,\n y = carbohydrate$carbohydrate)\n\n\n\n\nShow R coderesults_reduced = \n tibble(\n model = \"wgt+protein\",\n errs = cv_reduced$reps[])\nresults_full = \n tibble(model = \"wgt+age+protein\",\n errs = cv_full$reps[])\n\ncv_results = \n bind_rows(results_reduced, results_full)\n\ncv_results |> \n ggplot(aes(y = model, x = errs)) +\n geom_boxplot()\n\n\n\n\n\n\n\n\ncomparing metrics\n\nShow R code\ncompare_results = tribble(\n ~ model, ~ cvRMSE, ~ r.squared, ~adj.r.squared, ~ trainRMSE, ~loglik,\n \"full\", cv_full$cv, summary(full_model)$r.squared, summary(full_model)$adj.r.squared, sigma(full_model), logLik(full_model) |> as.numeric(),\n \"reduced\", cv_reduced$cv, summary(reduced_model)$r.squared, summary(reduced_model)$adj.r.squared, sigma(reduced_model), logLik(reduced_model) |> as.numeric())\n\ncompare_results\n\n\n\nmodel\ncvRMSE\nr.squared\nadj.r.squared\ntrainRMSE\nloglik\n\n\n\nfull\n6.887\n0.4805\n0.3831\n5.956\n-61.84\n\n\nreduced\n6.483\n0.4454\n0.3802\n5.971\n-62.49\n\n\n\n\n\n\n\nShow R codeanova(full_model, reduced_model)\n\n\n\nRes.Df\nRSS\nDf\nSum of Sq\nF\nPr(>F)\n\n\n\n16\n567.7\nNA\nNA\nNA\nNA\n\n\n17\n606.0\n-1\n-38.36\n1.081\n0.3139\n\n\n\n\n\nstepwise regression\n\nShow R codelibrary(olsrr)\nolsrr:::ols_step_both_aic(full_model)\n#> \n#> \n#> Stepwise Summary \n#> -------------------------------------------------------------------------\n#> Step Variable AIC SBC SBIC R2 Adj. R2 \n#> -------------------------------------------------------------------------\n#> 0 Base Model 140.773 142.764 83.068 0.00000 0.00000 \n#> 1 protein (+) 137.950 140.937 80.438 0.21427 0.17061 \n#> 2 weight (+) 132.981 136.964 77.191 0.44544 0.38020 \n#> -------------------------------------------------------------------------\n#> \n#> Final Model Output \n#> ------------------\n#> \n#> Model Summary \n#> ---------------------------------------------------------------\n#> R 0.667 RMSE 5.505 \n#> R-Squared 0.445 MSE 30.301 \n#> Adj. R-Squared 0.380 Coef. Var 15.879 \n#> Pred R-Squared 0.236 AIC 132.981 \n#> MAE 4.593 SBC 136.964 \n#> ---------------------------------------------------------------\n#> RMSE: Root Mean Square Error \n#> MSE: Mean Square Error \n#> MAE: Mean Absolute Error \n#> AIC: Akaike Information Criteria \n#> SBC: Schwarz Bayesian Criteria \n#> \n#> ANOVA \n#> -------------------------------------------------------------------\n#> Sum of \n#> Squares DF Mean Square F Sig. \n#> -------------------------------------------------------------------\n#> Regression 486.778 2 243.389 6.827 0.0067 \n#> Residual 606.022 17 35.648 \n#> Total 1092.800 19 \n#> -------------------------------------------------------------------\n#> \n#> Parameter Estimates \n#> ----------------------------------------------------------------------------------------\n#> model Beta Std. Error Std. Beta t Sig lower upper \n#> ----------------------------------------------------------------------------------------\n#> (Intercept) 33.130 12.572 2.635 0.017 6.607 59.654 \n#> protein 1.824 0.623 0.534 2.927 0.009 0.509 3.139 \n#> weight -0.222 0.083 -0.486 -2.662 0.016 -0.397 -0.046 \n#> ----------------------------------------------------------------------------------------\n\n\nLasso\n\\[\\arg min_{\\theta} \\ell(\\theta) + \\lambda \\sum_{j=1}^p|\\beta_j|\\]\n\nShow R codelibrary(glmnet)\ny = carbohydrate$carbohydrate\nx = carbohydrate |> \n select(age, weight, protein) |> \n as.matrix()\nfit = glmnet(x,y)\n\n\n\n\nShow R codeautoplot(fit, xvar = 'lambda')\n\n\n\nFigure 2.19: Lasso selection\n\n\n\n\n\n\n\n\n\nShow R codecvfit = cv.glmnet(x,y)\nplot(cvfit)\n\n\n\n\n\n\n\n\n\nShow R codecoef(cvfit, s = \"lambda.1se\")\n#> 4 x 1 sparse Matrix of class \"dgCMatrix\"\n#> s1\n#> (Intercept) 33.8049\n#> age . \n#> weight -0.1406\n#> protein 1.2176", + "text": "2.9 Model selection\n(adapted from Dobson and Barnett (2018) §6.3.3; for more information on prediction, see James et al. (2013) and Harrell (2015)).\n\nIf we have a lot of covariates in our dataset, we might want to choose a small subset to use in our model.\nThere are a few possible metrics to consider for choosing a “best” model.\n\n\n2.9.1 Mean squared error\nWe might want to minimize the mean squared error, \\(\\text E[(y-\\hat y)^2]\\), for new observations that weren’t in our data set when we fit the model.\nUnfortunately, \\[\\frac{1}{n}\\sum_{i=1}^n (y_i-\\hat y_i)^2\\] gives a biased estimate of \\(\\text E[(y-\\hat y)^2]\\) for new data. If we want an unbiased estimate, we will have to be clever.\n\nCross-validation\n\nShow R codedata(\"carbohydrate\", package = \"dobson\")\nlibrary(cvTools)\nfull_model <- lm(carbohydrate ~ ., data = carbohydrate)\ncv_full = \n full_model |> cvFit(\n data = carbohydrate, K = 5, R = 10,\n y = carbohydrate$carbohydrate)\n\nreduced_model = update(full_model, \n formula = ~ . - age)\n\ncv_reduced = \n reduced_model |> cvFit(\n data = carbohydrate, K = 5, R = 10,\n y = carbohydrate$carbohydrate)\n\n\n\n\nShow R coderesults_reduced = \n tibble(\n model = \"wgt+protein\",\n errs = cv_reduced$reps[])\nresults_full = \n tibble(model = \"wgt+age+protein\",\n errs = cv_full$reps[])\n\ncv_results = \n bind_rows(results_reduced, results_full)\n\ncv_results |> \n ggplot(aes(y = model, x = errs)) +\n geom_boxplot()\n\n\n\n\n\n\n\n\ncomparing metrics\n\nShow R code\ncompare_results = tribble(\n ~ model, ~ cvRMSE, ~ r.squared, ~adj.r.squared, ~ trainRMSE, ~loglik,\n \"full\", cv_full$cv, summary(full_model)$r.squared, summary(full_model)$adj.r.squared, sigma(full_model), logLik(full_model) |> as.numeric(),\n \"reduced\", cv_reduced$cv, summary(reduced_model)$r.squared, summary(reduced_model)$adj.r.squared, sigma(reduced_model), logLik(reduced_model) |> as.numeric())\n\ncompare_results\n\n\n\nmodel\ncvRMSE\nr.squared\nadj.r.squared\ntrainRMSE\nloglik\n\n\n\nfull\n6.906\n0.4805\n0.3831\n5.956\n-61.84\n\n\nreduced\n6.586\n0.4454\n0.3802\n5.971\n-62.49\n\n\n\n\n\n\n\nShow R codeanova(full_model, reduced_model)\n\n\n\nRes.Df\nRSS\nDf\nSum of Sq\nF\nPr(>F)\n\n\n\n16\n567.7\nNA\nNA\nNA\nNA\n\n\n17\n606.0\n-1\n-38.36\n1.081\n0.3139\n\n\n\n\n\nstepwise regression\n\nShow R codelibrary(olsrr)\nolsrr:::ols_step_both_aic(full_model)\n#> \n#> \n#> Stepwise Summary \n#> -------------------------------------------------------------------------\n#> Step Variable AIC SBC SBIC R2 Adj. R2 \n#> -------------------------------------------------------------------------\n#> 0 Base Model 140.773 142.764 83.068 0.00000 0.00000 \n#> 1 protein (+) 137.950 140.937 80.438 0.21427 0.17061 \n#> 2 weight (+) 132.981 136.964 77.191 0.44544 0.38020 \n#> -------------------------------------------------------------------------\n#> \n#> Final Model Output \n#> ------------------\n#> \n#> Model Summary \n#> ---------------------------------------------------------------\n#> R 0.667 RMSE 5.505 \n#> R-Squared 0.445 MSE 30.301 \n#> Adj. R-Squared 0.380 Coef. Var 15.879 \n#> Pred R-Squared 0.236 AIC 132.981 \n#> MAE 4.593 SBC 136.964 \n#> ---------------------------------------------------------------\n#> RMSE: Root Mean Square Error \n#> MSE: Mean Square Error \n#> MAE: Mean Absolute Error \n#> AIC: Akaike Information Criteria \n#> SBC: Schwarz Bayesian Criteria \n#> \n#> ANOVA \n#> -------------------------------------------------------------------\n#> Sum of \n#> Squares DF Mean Square F Sig. \n#> -------------------------------------------------------------------\n#> Regression 486.778 2 243.389 6.827 0.0067 \n#> Residual 606.022 17 35.648 \n#> Total 1092.800 19 \n#> -------------------------------------------------------------------\n#> \n#> Parameter Estimates \n#> ----------------------------------------------------------------------------------------\n#> model Beta Std. Error Std. Beta t Sig lower upper \n#> ----------------------------------------------------------------------------------------\n#> (Intercept) 33.130 12.572 2.635 0.017 6.607 59.654 \n#> protein 1.824 0.623 0.534 2.927 0.009 0.509 3.139 \n#> weight -0.222 0.083 -0.486 -2.662 0.016 -0.397 -0.046 \n#> ----------------------------------------------------------------------------------------\n\n\nLasso\n\\[\\arg min_{\\theta} \\ell(\\theta) + \\lambda \\sum_{j=1}^p|\\beta_j|\\]\n\nShow R codelibrary(glmnet)\ny = carbohydrate$carbohydrate\nx = carbohydrate |> \n select(age, weight, protein) |> \n as.matrix()\nfit = glmnet(x,y)\n\n\n\n\nShow R codeautoplot(fit, xvar = 'lambda')\n\n\n\nFigure 2.19: Lasso selection\n\n\n\n\n\n\n\n\n\nShow R codecvfit = cv.glmnet(x,y)\nplot(cvfit)\n\n\n\n\n\n\n\n\n\nShow R codecoef(cvfit, s = \"lambda.1se\")\n#> 4 x 1 sparse Matrix of class \"dgCMatrix\"\n#> s1\n#> (Intercept) 34.2044\n#> age . \n#> weight -0.0926\n#> protein 0.8582", "crumbs": [ "Generalized Linear Models", "2  Linear (Gaussian) Models" @@ -215,7 +215,7 @@ "href": "Linear-models-overview.html#ordinal-covariates", "title": "\n2  Linear (Gaussian) Models\n", "section": "\n2.11 Ordinal covariates", - "text": "2.11 Ordinal covariates\n(c.f. Dobson and Barnett (2018) §2.4.4)\n\n\nWe can create ordinal variables in R using the ordered() function4.\n\n\nExample 2.3  \n\nShow R codeurl = paste0(\n \"https://regression.ucsf.edu/sites/g/files/tkssra6706/\",\n \"f/wysiwyg/home/data/hersdata.dta\")\nlibrary(haven)\nhers = read_dta(url)\n\n\n\nShow R codehers |> head()\n\n\nTable 2.18: HERS dataset\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nHT\nage\nraceth\nnonwhite\nsmoking\ndrinkany\nexercise\nphysact\nglobrat\npoorfair\nmedcond\nhtnmeds\nstatins\ndiabetes\ndmpills\ninsulin\nweight\nBMI\nwaist\nWHR\nglucose\nweight1\nBMI1\nwaist1\nWHR1\nglucose1\ntchol\nLDL\nHDL\nTG\ntchol1\nLDL1\nHDL1\nTG1\nSBP\nDBP\nage10\n\n\n\n0\n70\n2\n1\n0\n0\n0\n5\n3\n0\n0\n1\n1\n0\n0\n0\n73.8\n23.69\n96.0\n0.932\n84\n73.6\n23.63\n93.0\n0.912\n94\n189\n122.4\n52\n73\n201\n137.6\n48\n77\n138\n78\n7.0\n\n\n0\n62\n2\n1\n0\n0\n0\n1\n3\n0\n1\n1\n0\n0\n0\n0\n70.9\n28.62\n93.0\n0.964\n111\n73.4\n28.89\n95.0\n0.964\n78\n307\n241.6\n44\n107\n216\n150.6\n48\n87\n118\n70\n6.2\n\n\n1\n69\n1\n0\n0\n0\n0\n3\n3\n0\n0\n1\n0\n1\n0\n0\n102.0\n42.51\n110.2\n0.782\n114\n96.1\n40.73\n103.0\n0.774\n98\n254\n166.2\n57\n154\n254\n156.0\n66\n160\n134\n78\n6.9\n\n\n0\n64\n1\n0\n1\n1\n0\n1\n3\n0\n1\n1\n0\n0\n0\n0\n64.4\n24.39\n87.0\n0.877\n94\n58.6\n22.52\n77.0\n0.802\n93\n204\n116.2\n56\n159\n207\n122.6\n57\n137\n152\n72\n6.4\n\n\n0\n65\n1\n0\n0\n0\n0\n2\n3\n0\n0\n0\n0\n0\n0\n0\n57.9\n21.90\n77.0\n0.794\n101\n58.9\n22.28\n76.5\n0.757\n92\n214\n150.6\n42\n107\n235\n172.2\n35\n139\n175\n95\n6.5\n\n\n1\n68\n2\n1\n0\n1\n0\n3\n3\n0\n0\n0\n0\n0\n0\n0\n60.9\n29.05\n96.0\n1.000\n116\n57.7\n27.52\n86.0\n0.910\n115\n212\n137.8\n52\n111\n202\n126.6\n53\n112\n174\n98\n6.8\n\n\n\n\n\n\n\n\n\n\n\nShow R code\n# C(contr = codingMatrices::contr.diff)", + "text": "2.11 Ordinal covariates\n(c.f. Dobson and Barnett (2018) §2.4.4)\n\n\nWe can create ordinal variables in R using the ordered() function4.\n\n\nExample 2.3  \n\nShow R codeurl = paste0(\n \"https://regression.ucsf.edu/sites/g/files/tkssra6706/\",\n \"f/wysiwyg/home/data/hersdata.dta\")\nlibrary(haven)\nhers = read_dta(url)\n\n\n\nShow R codehers |> head()\n\n\nTable 2.18: HERS dataset\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nHT\nage\nraceth\nnonwhite\nsmoking\ndrinkany\nexercise\nphysact\nglobrat\npoorfair\nmedcond\nhtnmeds\nstatins\ndiabetes\ndmpills\ninsulin\nweight\nBMI\nwaist\nWHR\nglucose\nweight1\nBMI1\nwaist1\nWHR1\nglucose1\ntchol\nLDL\nHDL\nTG\ntchol1\nLDL1\nHDL1\nTG1\nSBP\nDBP\nage10\n\n\n\n0\n70\n2\n1\n0\n0\n0\n5\n3\n0\n0\n1\n1\n0\n0\n0\n73.8\n23.69\n96.0\n0.932\n84\n73.6\n23.63\n93.0\n0.912\n94\n189\n122.4\n52\n73\n201\n137.6\n48\n77\n138\n78\n7.0\n\n\n0\n62\n2\n1\n0\n0\n0\n1\n3\n0\n1\n1\n0\n0\n0\n0\n70.9\n28.62\n93.0\n0.964\n111\n73.4\n28.89\n95.0\n0.964\n78\n307\n241.6\n44\n107\n216\n150.6\n48\n87\n118\n70\n6.2\n\n\n1\n69\n1\n0\n0\n0\n0\n3\n3\n0\n0\n1\n0\n1\n0\n0\n102.0\n42.51\n110.2\n0.782\n114\n96.1\n40.73\n103.0\n0.774\n98\n254\n166.2\n57\n154\n254\n156.0\n66\n160\n134\n78\n6.9\n\n\n0\n64\n1\n0\n1\n1\n0\n1\n3\n0\n1\n1\n0\n0\n0\n0\n64.4\n24.39\n87.0\n0.877\n94\n58.6\n22.52\n77.0\n0.802\n93\n204\n116.2\n56\n159\n207\n122.6\n57\n137\n152\n72\n6.4\n\n\n0\n65\n1\n0\n0\n0\n0\n2\n3\n0\n0\n0\n0\n0\n0\n0\n57.9\n21.90\n77.0\n0.794\n101\n58.9\n22.28\n76.5\n0.757\n92\n214\n150.6\n42\n107\n235\n172.2\n35\n139\n175\n95\n6.5\n\n\n1\n68\n2\n1\n0\n1\n0\n3\n3\n0\n0\n0\n0\n0\n0\n0\n60.9\n29.05\n96.0\n1.000\n116\n57.7\n27.52\n86.0\n0.910\n115\n212\n137.8\n52\n111\n202\n126.6\n53\n112\n174\n98\n6.8\n\n\n\n\n\n\n\n\n\n\n\nShow R code\n# C(contr = codingMatrices::contr.diff)\n\n\n\n\n\n\n\n\nAnderson, Edgar. 1935. “The Irises of the Gaspe Peninsula.” Bulletin of American Iris Society 59: 2–5.\n\n\nChatterjee, Samprit, and Ali S Hadi. 2015. Regression Analysis by Example. John Wiley & Sons. https://www.wiley.com/en-us/Regression+Analysis+by+Example%2C+4th+Edition-p-9780470055458.\n\n\nDobson, Annette J, and Adrian G Barnett. 2018. An Introduction to Generalized Linear Models. 4th ed. CRC press. https://doi.org/10.1201/9781315182780.\n\n\nDunn, Peter K, and Gordon K Smyth. 2018. Generalized Linear Models with Examples in r. Vol. 53. Springer. https://link.springer.com/book/10.1007/978-1-4419-0118-7.\n\n\nFaraway, Julian J. 2025. Linear Models with R. https://www.routledge.com/Linear-Models-with-R/Faraway/p/book/9781032583983.\n\n\nHarrell, Frank E. 2015. Regression Modeling Strategies: With Applications to Linear Models, Logistic Regression, and Survival Analysis. 2nd ed. Springer. https://doi.org/10.1007/978-3-319-19425-7.\n\n\nJames, Gareth, Daniela Witten, Trevor Hastie, Robert Tibshirani, et al. 2013. An Introduction to Statistical Learning. Vol. 112. Springer. https://www.statlearning.com/.\n\n\nKleinbaum, David G, and Mitchel Klein. 2010. Logistic Regression: A Self-Learning Text. 3rd ed. Springer. https://link.springer.com/book/10.1007/978-1-4419-1742-3.\n\n\n———. 2012. Survival Analysis: A Self-Learning Text. 3rd ed. Springer. https://link.springer.com/book/10.1007/978-1-4419-6646-9.\n\n\nKleinbaum, David G, Lawrence L Kupper, Azhar Nizam, K Muller, and ES Rosenberg. 2014. Applied Regression Analysis and Other Multivariable Methods. 5th ed. Cengage Learning. https://www.cengage.com/c/applied-regression-analysis-and-other-multivariable-methods-5e-kleinbaum/9781285051086/.\n\n\nKutner, Michael H, Christopher J Nachtsheim, John Neter, and William Li. 2005. Applied Linear Statistical Models. McGraw-Hill.\n\n\nPolin, Richard A, William W Fox, and Steven H Abman. 2011. Fetal and Neonatal Physiology. 4th ed. Elsevier health sciences.\n\n\nSeber, George AF, and Alan J Lee. 2012. Linear Regression Analysis. 2nd ed. John Wiley & Sons. https://www.wiley.com/en-us/Linear+Regression+Analysis%2C+2nd+Edition-p-9781118274422.\n\n\nVenables, Bill. 2023. codingMatrices: Alternative Factor Coding Matrices for Linear Model Formulae (version 0.4.0). https://CRAN.R-project.org/package=codingMatrices.\n\n\nVittinghoff, Eric, David V Glidden, Stephen C Shiboski, and Charles E McCulloch. 2012. Regression Methods in Biostatistics: Linear, Logistic, Survival, and Repeated Measures Models. 2nd ed. Springer. https://doi.org/10.1007/978-1-4614-1353-0.\n\n\nWeisberg, Sanford. 2005. Applied Linear Regression. Vol. 528. John Wiley & Sons.", "crumbs": [ "Generalized Linear Models", "2  Linear (Gaussian) Models" @@ -501,7 +501,7 @@ "href": "intro-multilevel-models.html", "title": "5  Introduction to multi-level models for correlated data", "section": "", - "text": "For more, see EVE 225 — Linear Mixed Modeling in Ecology & Evolution", + "text": "For more, see EVE 225: Linear Mixed Modeling in Ecology & Evolution", "crumbs": [ "Generalized Linear Models", "5  Introduction to multi-level models for correlated data" diff --git a/_freeze/Linear-models-overview/execute-results/html.json b/_freeze/Linear-models-overview/execute-results/html.json index 86de8cd..c6e3c0d 100644 --- a/_freeze/Linear-models-overview/execute-results/html.json +++ b/_freeze/Linear-models-overview/execute-results/html.json @@ -1,8 +1,8 @@ { - "hash": "cf0077fda43f6716c6ac1ffa5069d436", + "hash": "32ab51f7bf136708900c866450f81545", "result": { "engine": "knitr", - "markdown": "---\ndf-print: paged\n---\n\n\n\n\n# Linear (Gaussian) Models\n\n---\n\n\n\n\n---\n\n### Configuring R {.unnumbered}\n\nFunctions from these packages will be used throughout this document:\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(conflicted) # check for conflicting function definitions\n# library(printr) # inserts help-file output into markdown output\nlibrary(rmarkdown) # Convert R Markdown documents into a variety of formats.\nlibrary(pander) # format tables for markdown\nlibrary(ggplot2) # graphics\nlibrary(ggeasy) # help with graphics\nlibrary(ggfortify) # help with graphics\nlibrary(dplyr) # manipulate data\nlibrary(tibble) # `tibble`s extend `data.frame`s\nlibrary(magrittr) # `%>%` and other additional piping tools\nlibrary(haven) # import Stata files\nlibrary(knitr) # format R output for markdown\nlibrary(tidyr) # Tools to help to create tidy data\nlibrary(plotly) # interactive graphics\nlibrary(dobson) # datasets from Dobson and Barnett 2018\nlibrary(parameters) # format model output tables for markdown\nlibrary(haven) # import Stata files\nlibrary(latex2exp) # use LaTeX in R code (for figures and tables)\nlibrary(fs) # filesystem path manipulations\nlibrary(survival) # survival analysis\nlibrary(survminer) # survival analysis graphics\nlibrary(KMsurv) # datasets from Klein and Moeschberger\nlibrary(parameters) # format model output tables for\nlibrary(webshot2) # convert interactive content to static for pdf\nlibrary(forcats) # functions for categorical variables (\"factors\")\nlibrary(stringr) # functions for dealing with strings\nlibrary(lubridate) # functions for dealing with dates and times\n```\n:::\n\n\n\nHere are some R settings I use in this document:\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nrm(list = ls()) # delete any data that's already loaded into R\n\nconflicts_prefer(dplyr::filter)\nggplot2::theme_set(\n ggplot2::theme_bw() + \n # ggplot2::labs(col = \"\") +\n ggplot2::theme(\n legend.position = \"bottom\",\n text = ggplot2::element_text(size = 12, family = \"serif\")))\n\nknitr::opts_chunk$set(message = FALSE)\noptions('digits' = 4)\n\npanderOptions(\"big.mark\", \",\")\npander::panderOptions(\"table.emphasize.rownames\", FALSE)\npander::panderOptions(\"table.split.table\", Inf)\nconflicts_prefer(dplyr::filter) # use the `filter()` function from dplyr() by default\nlegend_text_size = 9\n```\n:::\n\n\n\n\n\n\n\n\\providecommand{\\cbl}[1]{\\left\\{#1\\right.}\n\\providecommand{\\cb}[1]{\\left\\{#1\\right\\}}\n\\providecommand{\\paren}[1]{\\left(#1\\right)}\n\\providecommand{\\sb}[1]{\\left[#1\\right]}\n\\def\\pr{\\text{p}}\n\\def\\am{\\arg \\max}\n\\def\\argmax{\\arg \\max}\n\\def\\p{\\text{p}}\n\\def\\P{\\text{P}}\n\\def\\ph{\\hat{\\text{p}}}\n\\def\\hp{\\hat{\\text{p}}}\n\\def\\ga{\\alpha}\n\\def\\b{\\beta}\n\\providecommand{\\floor}[1]{\\left \\lfloor{#1}\\right \\rfloor}\n\\providecommand{\\ceiling}[1]{\\left \\lceil{#1}\\right \\rceil}\n\\providecommand{\\ceil}[1]{\\left \\lceil{#1}\\right \\rceil}\n\\def\\Ber{\\text{Ber}}\n\\def\\Bernoulli{\\text{Bernoulli}}\n\\def\\Pois{\\text{Pois}}\n\\def\\Poisson{\\text{Poisson}}\n\\def\\Gaus{\\text{Gaussian}}\n\\def\\Normal{\\text{N}}\n\\def\\NB{\\text{NegBin}}\n\\def\\NegBin{\\text{NegBin}}\n\\def\\vbeta{\\vec \\beta}\n\\def\\vb{\\vec \\b}\n\\def\\v0{\\vec{0}}\n\\def\\gb{\\beta}\n\\def\\gg{\\gamma}\n\\def\\gd{\\delta}\n\\def\\eps{\\varepsilon}\n\\def\\om{\\omega}\n\\def\\m{\\mu}\n\\def\\s{\\sigma}\n\\def\\l{\\lambda}\n\\def\\gs{\\sigma}\n\\def\\gm{\\mu}\n\\def\\M{\\text{M}}\n\\def\\gM{\\text{M}}\n\\def\\Mu{\\text{M}}\n\\def\\cd{\\cdot}\n\\def\\cds{\\cdots}\n\\def\\lds{\\ldots}\n\\def\\eqdef{\\stackrel{\\text{def}}{=}}\n\\def\\defeq{\\stackrel{\\text{def}}{=}}\n\\def\\hb{\\hat \\beta}\n\\def\\hl{\\hat \\lambda}\n\\def\\hy{\\hat y}\n\\def\\yh{\\hat y}\n\\def\\V{{\\text{Var}}}\n\\def\\hs{\\hat \\sigma}\n\\def\\hsig{\\hat \\sigma}\n\\def\\hS{\\hat \\Sigma}\n\\def\\hSig{\\hat \\Sigma}\n\\def\\hSigma{\\hat \\Sigma}\n\\def\\hSurv{\\hat{S}}\n\\providecommand{\\hSurvf}[1]{\\hat{S}\\paren{#1}}\n\\def\\dist{\\ \\sim \\ }\n\\def\\ddist{\\ \\dot{\\sim} \\ }\n\\def\\dsim{\\ \\dot{\\sim} \\ }\n\\def\\za{z_{1 - \\frac{\\alpha}{2}}}\n\\def\\cirad{\\za \\cdot \\hse{\\hb}}\n\\def\\ci{\\hb {\\color{red}\\pm} \\cirad}\n\\def\\th{\\theta}\n\\def\\Th{\\Theta}\n\\def\\xbar{\\bar{x}}\n\\def\\hth{\\hat\\theta}\n\\def\\hthml{\\hth_{\\text{ML}}}\n\\def\\ba{\\begin{aligned}}\n\\def\\ea{\\end{aligned}}\n\\def\\ind{⫫}\n\\def\\indpt{⫫}\n\\def\\all{\\forall}\n\\def\\iid{\\text{iid}}\n\\def\\ciid{\\text{ciid}}\n\\def\\simind{\\ \\sim_{\\ind}\\ }\n\\def\\siid{\\ \\sim_{\\iid}\\ }\n\\def\\simiid{\\siid}\n\\def\\distiid{\\siid}\n\\def\\tf{\\therefore}\n\\def\\Lik{\\mathcal{L}}\n\\def\\llik{\\ell}\n\\providecommand{\\llikf}[1]{\\llik \\paren{#1}}\n\\def\\score{\\ell'}\n\\providecommand{\\scoref}[1]{\\score \\paren{#1}}\n\\def\\hess{\\ell''}\n\\def\\hessian{\\ell''}\n\\providecommand{\\hessf}[1]{\\hess \\paren{#1}}\n\\providecommand{\\hessianf}[1]{\\hess \\paren{#1}}\n\\providecommand{\\starf}[1]{#1^*}\n\\def\\lik{\\ell}\n\\providecommand{\\est}[1]{\\widehat{#1}}\n\\providecommand{\\esttmp}[1]{{\\widehat{#1}}^*}\n\\def\\esttmpl{\\esttmp{\\lambda}}\n\\def\\cR{\\mathcal{R}}\n\\def\\range{\\mathcal{R}}\n\\def\\Range{\\mathcal{R}}\n\\providecommand{\\rangef}[1]{\\cR(#1)}\n\\def\\~{\\approx}\n\\def\\dapp{\\dot\\approx}\n\\providecommand{\\red}[1]{{\\color{red}#1}}\n\\providecommand{\\deriv}[1]{\\frac{\\partial}{\\partial #1}}\n\\providecommand{\\derivf}[2]{\\frac{\\partial #1}{\\partial #2}}\n\\providecommand{\\blue}[1]{{\\color{blue}#1}}\n\\providecommand{\\green}[1]{{\\color{green}#1}}\n\\providecommand{\\hE}[1]{\\hat{\\text{E}}\\sb{#1}}\n\\providecommand{\\hExp}[1]{\\hat{\\text{E}}\\sb{#1}}\n\\providecommand{\\hmu}[1]{\\hat{\\mu}\\sb{#1}}\n\\def\\Expp{\\mathbb{E}}\n\\def\\Ep{\\mathbb{E}}\n\\def\\expit{\\text{expit}}\n\\providecommand{\\expitf}[1]{\\expit\\cb{#1}}\n\\providecommand{\\dexpitf}[1]{\\expit'\\cb{#1}}\n\\def\\logit{\\text{logit}}\n\\providecommand{\\logitf}[1]{\\logit\\cb{#1}}\n\\providecommand{\\E}[1]{\\mathbb{E}\\sb{#1}}\n\\providecommand{\\Ef}[1]{\\mathbb{E}\\sb{#1}}\n\\providecommand{\\Exp}[1]{\\mathbb{E}\\sb{#1}}\n\\providecommand{\\Expf}[1]{\\mathbb{E}\\sb{#1}}\n\\def\\Varr{\\text{Var}}\n\\providecommand{\\var}[1]{\\text{Var}\\paren{#1}}\n\\providecommand{\\varf}[1]{\\text{Var}\\paren{#1}}\n\\providecommand{\\Var}[1]{\\text{Var}\\paren{#1}}\n\\providecommand{\\Varf}[1]{\\text{Var}\\paren{#1}}\n\\def\\Covt{\\text{Cov}}\n\\providecommand{\\covh}[1]{\\widehat{\\text{Cov}}\\paren{#1}}\n\\providecommand{\\Cov}[1]{\\Covt \\paren{#1}}\n\\providecommand{\\Covf}[1]{\\Covt \\paren{#1}}\n\\def\\varht{\\widehat{\\text{Var}}}\n\\providecommand{\\varh}[1]{\\varht\\paren{#1}}\n\\providecommand{\\varhf}[1]{\\varht\\paren{#1}}\n\\providecommand{\\vc}[1]{\\boldsymbol{#1}}\n\\providecommand{\\sd}[1]{\\text{sd}\\paren{#1}}\n\\providecommand{\\SD}[1]{\\text{SD}\\paren{#1}}\n\\providecommand{\\hSD}[1]{\\widehat{\\text{SD}}\\paren{#1}}\n\\providecommand{\\se}[1]{\\text{se}\\paren{#1}}\n\\providecommand{\\hse}[1]{\\hat{\\text{se}}\\paren{#1}}\n\\providecommand{\\SE}[1]{\\text{SE}\\paren{#1}}\n\\providecommand{\\HSE}[1]{\\widehat{\\text{SE}}\\paren{#1}}\n\\renewcommand{\\log}[1]{\\text{log}\\cb{#1}}\n\\providecommand{\\logf}[1]{\\text{log}\\cb{#1}}\n\\def\\dlog{\\text{log}'}\n\\providecommand{\\dlogf}[1]{\\dlog \\cb{#1}}\n\\renewcommand{\\exp}[1]{\\text{exp}\\cb{#1}}\n\\providecommand{\\expf}[1]{\\exp{#1}}\n\\def\\dexp{\\text{exp}'}\n\\providecommand{\\dexpf}[1]{\\dexp \\cb{#1}}\n\\providecommand{\\e}[1]{\\text{e}^{#1}}\n\\providecommand{\\ef}[1]{\\text{e}^{#1}}\n\\providecommand{\\inv}[1]{\\paren{#1}^{-1}}\n\\providecommand{\\invf}[1]{\\paren{#1}^{-1}}\n\\def\\oinf{I}\n\\def\\Nat{\\mathbb{N}}\n\\providecommand{\\oinff}[1]{\\oinf\\paren{#1}}\n\\def\\einf{\\mathcal{I}}\n\\providecommand{\\einff}[1]{\\einf\\paren{#1}}\n\\def\\heinf{\\hat{\\einf}}\n\\providecommand{\\heinff}[1]{\\heinf \\paren{#1}}\n\\providecommand{\\1}[1]{\\mathbb{1}_{#1}}\n\\providecommand{\\set}[1]{\\cb{#1}}\n\\providecommand{\\pf}[1]{\\p \\paren{#1}}\n\\providecommand{\\Bias}[1]{\\text{Bias}\\paren{#1}}\n\\providecommand{\\bias}[1]{\\text{Bias}\\paren{#1}}\n\\def\\ss{\\sigma^2}\n\\providecommand{\\ssqf}[1]{\\sigma^2\\paren{#1}}\n\\providecommand{\\mselr}[1]{\\text{MSE}\\paren{#1}}\n\\providecommand{\\maelr}[1]{\\text{MAE}\\paren{#1}}\n\\providecommand{\\abs}[1]{\\left|#1\\right|}\n\\providecommand{\\sqf}[1]{\\paren{#1}^2}\n\\providecommand{\\sq}{^2}\n\\def\\err{\\eps}\n\\providecommand{\\erf}[1]{\\err\\paren{#1}}\n\\renewcommand{\\vec}[1]{\\tilde{#1}}\n\\providecommand{\\v}[1]{\\vec{#1}}\n\\providecommand{\\matr}[1]{\\mathbf{#1}}\n\\def\\mX{\\matr{X}}\n\\def\\mx{\\matr{x}}\n\\def\\vx{\\vec{x}}\n\\def\\vX{\\vec{X}}\n\\def\\vy{\\vec{y}}\n\\def\\vY{\\vec{Y}}\n\\def\\vpi{\\vec{\\pi}}\n\\providecommand{\\mat}[1]{\\mathbf{#1}}\n\\providecommand{\\dsn}[1]{#1_1, \\ldots, #1_n}\n\\def\\X1n{\\dsn{X}}\n\\def\\Xin{\\dsn{X}}\n\\def\\x1n{\\dsn{x}}\n\\def\\'{^{\\top}}\n\\def\\dpr{\\cdot}\n\\def\\Xx1n{X_1=x_1, \\ldots, X_n = x_n}\n\\providecommand{\\dsvn}[2]{#1_1=#2_1, \\ldots, #1_n = #2_n}\n\\providecommand{\\sumn}[1]{\\sum_{#1=1}^n}\n\\def\\sumin{\\sum_{i=1}^n}\n\\def\\sumi1n{\\sum_{i=1}^n}\n\\def\\prodin{\\prod_{i=1}^n}\n\\def\\prodi1n{\\prod_{i=1}^n}\n\\providecommand{\\lp}[2]{#1 \\' \\beta}\n\\def\\odds{\\omega}\n\\def\\OR{\\text{OR}}\n\\def\\logodds{\\eta}\n\\def\\oddst{\\text{odds}}\n\\def\\probst{\\text{probs}}\n\\def\\probt{\\text{probt}}\n\\def\\probit{\\text{probit}}\n\\providecommand{\\oddsf}[1]{\\oddst\\cb{#1}}\n\\providecommand{\\doddsf}[1]{{\\oddst}'\\cb{#1}}\n\\def\\oddsinv{\\text{invodds}}\n\\providecommand{\\oddsinvf}[1]{\\oddsinv\\cb{#1}}\n\\def\\invoddsf{\\oddsinvf}\n\\providecommand{\\doddsinvf}[1]{{\\oddsinv}'\\cb{#1}}\n\\def\\dinvoddsf{\\doddsinvf}\n\\def\\haz{h}\n\\def\\cuhaz{H}\n\\def\\incidence{\\bar{\\haz}}\n\\def\\phaz{\\Expf{\\haz}}\n\n\n\n\n\n```{=html}\n\n```\n\n\n\n\n\n---\n\n:::{.callout-note}\nThis content is adapted from:\n\n- @dobson4e, Chapters 2-6\n- @dunn2018generalized, Chapters 2-3\n- @vittinghoff2e, Chapter 4\n\nThere are numerous textbooks specifically for linear regression, including:\n\n- @kutner2005applied: used for UCLA Biostatistics MS level linear models class\n- @chatterjee2015regression: used for Stanford MS-level linear models class\n- @seber2012linear: used for UCLA Biostatistics PhD level linear models class and UC Davis STA 108.\n- @kleinbaum2014applied: same first author as @kleinbaum2010logistic and @kleinbaum2012survival\n- @weisberg2005applied\n- *Linear Models with R* [@Faraway2025-io]\n\n\n## Overview\n\n### Why this course includes linear regression {.smaller}\n\n:::{.fragment .fade-in-then-semi-out}\n* This course is about *generalized linear models* (for non-Gaussian outcomes)\n:::\n\n:::{.fragment .fade-in-then-semi-out}\n* UC Davis STA 108 (\"Applied Statistical Methods: Regression Analysis\") is a prerequisite for this course, so everyone here should have some understanding of linear regression already.\n:::\n\n:::{.fragment .fade-in}\n* We will review linear regression to:\n - make sure everyone is caught up\n - to provide an epidemiological perspective on model interpretation.\n:::\n\n### Chapter overview\n\n* @sec-understand-LMs: how to interpret linear regression models\n\n* @sec-est-LMs: how to estimate linear regression models\n\n* @sec-infer-LMs: how to quantify uncertainty about our estimates\n\n* @sec-diagnose-LMs: how to tell if your model is insufficiently complex\n\n\n## Understanding Gaussian Linear Regression Models {#sec-understand-LMs}\n\n### Motivating example: birthweights and gestational age {.smaller}\n\nSuppose we want to learn about the distributions of birthweights (*outcome* $Y$) for (human) babies born at different gestational ages (*covariate* $A$) and with different chromosomal sexes (*covariate* $S$) (@dobson4e Example 2.2.2).\n\n::::: {.panel-tabset}\n\n#### Data as table\n\n\n\n\n::: {#tbl-birthweight-data1 .cell tbl-cap='`birthweight` data (@dobson4e Example 2.2.2)'}\n\n```{.r .cell-code}\nlibrary(dobson)\ndata(\"birthweight\", package = \"dobson\")\nbirthweight |> knitr::kable()\n```\n\n::: {.cell-output-display}\n\n\n| boys gestational age| boys weight| girls gestational age| girls weight|\n|--------------------:|-----------:|---------------------:|------------:|\n| 40| 2968| 40| 3317|\n| 38| 2795| 36| 2729|\n| 40| 3163| 40| 2935|\n| 35| 2925| 38| 2754|\n| 36| 2625| 42| 3210|\n| 37| 2847| 39| 2817|\n| 41| 3292| 40| 3126|\n| 40| 3473| 37| 2539|\n| 37| 2628| 36| 2412|\n| 38| 3176| 38| 2991|\n| 40| 3421| 39| 2875|\n| 38| 2975| 40| 3231|\n\n\n:::\n:::\n\n\n\n\n#### Reshape data for graphing\n\n\n\n\n::: {#tbl-birthweight-data2 .cell tbl-cap='`birthweight` data reshaped'}\n\n```{.r .cell-code}\nbw = \n birthweight |> \n pivot_longer(\n cols = everything(),\n names_to = c(\"sex\", \".value\"),\n names_sep = \"s \"\n ) |> \n rename(age = `gestational age`) |> \n mutate(\n sex = sex |> \n case_match(\n \"boy\" ~ \"male\",\n \"girl\" ~ \"female\") |> \n factor(levels = c(\"female\", \"male\")))\n\nbw\n```\n\n::: {.cell-output-display}\n`````{=html}\n
\n \n
\n`````\n:::\n:::\n\n\n\n\n#### Data as graph\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nplot1 = bw |> \n ggplot(aes(\n x = age, \n y = weight,\n linetype = sex,\n shape = sex,\n col = sex)) +\n theme_bw() +\n xlab(\"Gestational age (weeks)\") +\n ylab(\"Birthweight (grams)\") +\n theme(legend.position = \"bottom\") +\n # expand_limits(y = 0, x = 0) +\n geom_point(alpha = .7)\nprint(plot1 + facet_wrap(~ sex))\n```\n\n::: {.cell-output-display}\n![`birthweight` data (@dobson4e Example 2.2.2)](Linear-models-overview_files/figure-html/fig-plot-birthweight1-1.png){#fig-plot-birthweight1 width=672}\n:::\n:::\n\n\n\n\n:::::\n\n---\n\n#### Data notation\n\nLet's define some notation to represent this data.\n\n- $Y$: birthweight (measured in grams)\n- $S$: chromosomal sex: \"male\" (XY) or \"female\" (XX)\n- $M$: indicator variable for $S$ = \"male\"^[$M$ is implicitly a deterministic function of $S$]\n- $M = 0$ if female (XX)\n- $M = 1$ if male (XY)\n- $F$: indicator variable for $S$ = \"female\"^[$F$ is implicitly a deterministic function of $S$]\n- $F = 1$ if female (XX)\n- $F = 0$ if male (XY)\n\n- $A$: estimated gestational age at birth (measured in weeks).\n\n::: callout-note\nFemale is the **reference level** for the categorical variable $S$ \n(chromosomal sex) and corresponding indicator variable $M$ . \nThe choice of a reference level is arbitrary and does not limit what \nwe can do with the resulting model; \nit only makes it more computationally convenient to make inferences \nabout comparisons involving that reference group.\n:::\n\n### Parallel lines regression\n\nWe don't have enough data to model the distribution of birth weight \nseparately for each combination of gestational age and sex, \nso let's instead consider a (relatively) simple model for how that \ndistribution varies with gestational age and sex:\n\n$$p(Y=y|A=a,S=s) \\siid N(\\mu(a,s), \\sigma^2)$$\n\n$$\n\\ba\n\\mu(a,s)\n&\\eqdef \\Exp{Y|A=a, S=s} \\\\\n&= \\beta_0 + \\beta_A a+ \\beta_M m\n\\ea\n$$ {#eq-lm-parallel}\n\n:::{.notes}\n\n@tbl-lm-parallel shows the parameter estimates from R.\n@fig-parallel-fit1 shows the estimated model, superimposed on the data.\n\n:::\n\n::: {.column width=40%}\n\n\n\n\n::: {#tbl-lm-parallel .cell tbl-cap='Estimate of [Model @eq-lm-parallel] for `birthweight` data'}\n\n```{.r .cell-code}\nbw_lm1 = lm(\n formula = weight ~ sex + age, \n data = bw)\n\nbw_lm1 |> \n parameters() |>\n print_md(\n include_reference = TRUE,\n # show_sigma = TRUE,\n select = \"{estimate}\")\n```\n\n::: {.cell-output-display}\n\n\n|Parameter | Estimate |\n|:------------|:--------:|\n|(Intercept) | -1773.32 |\n|sex (female) | 0.00 |\n|sex (male) | 163.04 |\n|age | 120.89 |\n\n\n:::\n:::\n\n\n\n\n:::\n\n:::{.column width=10%}\n:::\n\n:::{.column width=50%}\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbw = \n bw |> \n mutate(`E[Y|X=x]` = fitted(bw_lm1)) |> \n arrange(sex, age)\n\nplot2 = \n plot1 %+% bw +\n geom_line(aes(y = `E[Y|X=x]`))\n\nprint(plot2)\n\n```\n\n::: {.cell-output-display}\n![Parallel-slopes model of birthweight](Linear-models-overview_files/figure-html/fig-parallel-fit1-1.png){#fig-parallel-fit1 width=672}\n:::\n:::\n\n\n\n\n:::\n\n---\n\n#### Model assumptions and predictions\n\n::: notes\nTo learn what this model is assuming, let's plug in a few values.\n:::\n\n::: {#exr-pred-fem-parallel}\n\nAccording to this model, what's the mean birthweight for a female born at 36 weeks?\n\n\n\n\n::: {#tbl-coef-model1 .cell tbl-cap='Estimated coefficients for [model @eq-lm-parallel]'}\n\n```{.r .cell-code}\ncoef(bw_lm1)\n#> (Intercept) sexmale age \n#> -1773.3 163.0 120.9\n```\n:::\n\n\n\n\n:::\n\n---\n\n:::{.solution}\n\\ \n\n\n\n::: {.cell}\n\n```{.r .cell-code}\npred_female = coef(bw_lm1)[\"(Intercept)\"] + coef(bw_lm1)[\"age\"]*36\ncoef(bw_lm1)\n#> (Intercept) sexmale age \n#> -1773.3 163.0 120.9\n# print(pred_female)\n### built-in prediction: \n# predict(bw_lm1, newdata = tibble(sex = \"female\", age = 36))\n```\n:::\n\n\n\n\n$$\n\\ba\nE[Y|A = 0, A = 36] \n&= \\beta_0 + \\beta_M \\cdot 0+ \\beta_A \\cdot 36 \\\\\n&= 2578.8739\n\\ea\n$$\n:::\n\n---\n\n:::{#exr-pred-male-parallel}\n\nWhat's the mean birthweight for a male born at 36 weeks?\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncoef(bw_lm1)\n#> (Intercept) sexmale age \n#> -1773.3 163.0 120.9\n```\n:::\n\n\n\n\n:::\n\n---\n\n:::{.solution}\n\\ \n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\npred_male = \n coef(bw_lm1)[\"(Intercept)\"] + \n coef(bw_lm1)[\"sexmale\"] + \n coef(bw_lm1)[\"age\"]*36\ncoef(bw_lm1)\n#> (Intercept) sexmale age \n#> -1773.3 163.0 120.9\n```\n:::\n\n\n\n\n$$\n\\ba\nE[Y|M = 1, A = 36] \n&= \\beta_0 + \\beta_M \\cdot 1+ \\beta_A \\cdot 36 \\\\\n&= 2741.9132\n\\ea\n$$\n\n:::\n\n---\n\n:::{#exr-diff-sex-parallel-1}\nWhat's the difference in mean birthweights between males born at 36 weeks and females born at 36 weeks?\n:::\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncoef(bw_lm1)\n#> (Intercept) sexmale age \n#> -1773.3 163.0 120.9\n```\n:::\n\n\n\n\n---\n\n:::{.solution}\n\n$$\n\\begin{aligned}\n& E[Y|M = 1, A = 36] - E[Y|M = 0, A = 36]\\\\\n&= \n2741.9132 - 2578.8739\\\\\n&=\n163.0393\n\\end{aligned}\n$$\n\nShortcut:\n\n$$\n\\begin{aligned}\n& E[Y|M = 1, A = 36] - E[Y|M = 0, A = 36]\\\\\n&= (\\beta_0 + \\beta_M \\cdot 1+ \\beta_A \\cdot 36) - \n(\\beta_0 + \\beta_M \\cdot 0+ \\beta_A \\cdot 36) \\\\\n&= \\beta_M \\\\ \n&= 163.0393\n\\end{aligned}\n$$\n\n:::\n\n:::{.notes}\n\nNote that age doesn't show up in this difference: in other words, according to this model, the difference between females and males with the same gestational age is the same for every age.\n\nThat's an assumption of the model; it's built-in to the parametric structure, even before we plug in the estimated values of those parameters.\n\nThat's why the lines are parallel.\n\n:::\n\n### Interactions {.smaller}\n\n:::{.notes}\nWhat if we don't like that parallel lines assumption?\n\nThen we need to allow an \"interaction\" between age $A$ and sex $S$:\n:::\n\n$$\nE[Y|A=a, S=s] = \\beta_0 + \\beta_A a+ \\beta_M m + \\beta_{AM} (a \\cdot m)\n$$ {#eq-BW-lm-interact}\n\n::: notes\nNow, the slope of mean birthweight $E[Y|A,S]$ with respect to gestational age $A$ depends on the value of sex $S$.\n:::\n\n::: {.column width=40% .smaller}\n\n\n\n\n::: {#tbl-bw-model-coefs-interact .cell tbl-cap='Birthweight model with interaction term'}\n\n```{.r .cell-code}\nbw_lm2 = lm(weight ~ sex + age + sex:age, data = bw)\nbw_lm2 |> \n parameters() |>\n print_md(\n include_reference = TRUE,\n # show_sigma = TRUE,\n select = \"{estimate}\")\n```\n\n::: {.cell-output-display}\n\n\n|Parameter | Estimate |\n|:----------------|:--------:|\n|(Intercept) | -2141.67 |\n|sex (female) | 0.00 |\n|sex (male) | 872.99 |\n|age | 130.40 |\n|sex (male) × age | -18.42 |\n\n\n:::\n:::\n\n\n\n\n:::\n\n:::{.column width=5%}\n:::\n\n:::{.column width=55%}\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbw = \n bw |> \n mutate(\n predlm2 = predict(bw_lm2)\n ) |> \n arrange(sex, age)\n\nplot1_interact = \n plot1 %+% bw +\n geom_line(aes(y = predlm2))\n\nprint(plot1_interact)\n```\n\n::: {.cell-output-display}\n![Birthweight model with interaction term](Linear-models-overview_files/figure-html/fig-bw-interaction-1.png){#fig-bw-interaction width=672}\n:::\n:::\n\n\n\n\n:::\n\n::: {.notes}\nNow we can see that the lines aren't parallel.\n:::\n\n---\n\nHere's another way we could rewrite this model (by collecting terms involving $S$):\n\n$$\nE[Y|A, M] = \\beta_0 + \\beta_M M+ (\\beta_A + \\beta_{AM} M) A\n$$\n\n::: callout-note\nIf you want to understand a coefficient in a model with interactions, collect terms for the corresponding variable, and you will see what other variables are interacting with the variable you are interested in.\n:::\n\n:::{.notes}\nIn this case, the coefficient $S$ is interacting with $A$. So the slope of $Y$ with respect to $A$ depends on the value of $M$.\n\nAccording to this model, there is no such thing as \"*the* slope of birthweight with respect to age\". There are two slopes, one for each sex.^[using the definite article \"the\" would mean there is only one slope.] We can only talk about \"the slope of birthweight with respect to age among males\" and \"the slope of birthweight with respect to age among females\".\n\nThen: that coefficient is the difference in means per unit change in its corresponding coefficient, when the other collected variables are set to 0.\n:::\n\n---\n\n::: notes\nTo learn what this model is assuming, let's plug in a few values.\n:::\n\n:::{#exr-pred-fem-interact}\nAccording to this model, what's the mean birthweight for a female born at 36 weeks?\n:::\n\n---\n\n::: {.solution}\n\\ \n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\npred_female = coef(bw_lm2)[\"(Intercept)\"] + coef(bw_lm2)[\"age\"]*36\n```\n:::\n\n\n\n\n$$\nE[Y|A = 0, X_2 = 36] = \n\\beta_0 + \\beta_M \\cdot 0+ \\beta_A \\cdot 36 + \\beta_{AM} \\cdot (0 * 36) \n= 2552.7333\n$$ \n\n:::\n\n---\n\n:::{#exr-pred-interact-male_36}\nWhat's the mean birthweight for a male born at 36 weeks?\n\n:::\n\n---\n\n::: solution\n\\ \n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\npred_male = \n coef(bw_lm2)[\"(Intercept)\"] + \n coef(bw_lm2)[\"sexmale\"] + \n coef(bw_lm2)[\"age\"]*36 + \n coef(bw_lm2)[\"sexmale:age\"] * 36\n```\n:::\n\n\n\n\n$$\n\\ba\nE[Y|A = 0, X_2 = 36]\n&= \\beta_0 + \\beta_M \\cdot 1+ \\beta_A \\cdot 36 + \\beta_{AM} \\cdot 1 \\cdot 36\\\\\n&= 2762.7069\n\\ea\n$$\n\n:::\n\n---\n\n:::{#exr-diff-gender-interact}\nWhat's the difference in mean birthweights between males born at 36 weeks and females born at 36 weeks?\n:::\n\n---\n\n:::{.solution}\n\n$$\n\\begin{aligned}\n& E[Y|M = 1, A = 36] - E[Y|M = 0, A = 36]\\\\ \n&= (\\beta_0 + \\beta_M \\cdot 1+ \\beta_A \\cdot 36 + \\beta_{AM} \\cdot 1 \\cdot 36)\\\\ \n&\\ \\ \\ \\ \\ -(\\beta_0 + \\beta_M \\cdot 0+ \\beta_A \\cdot 36 + \\beta_{AM} \\cdot 0 \\cdot 36) \\\\\n&= \\beta_{S} + \\beta_{AM}\\cdot 36\\\\\n&= 209.9736\n\\end{aligned}\n$$\n:::\n\n:::{.notes}\nNote that age now does show up in the difference: in other words, according to this model, the difference in mean birthweights between females and males with the same gestational age can vary by gestational age.\n\nThat's how the lines in the graph ended up non-parallel.\n\n:::\n\n### Stratified regression {.smaller}\n\n:::{.notes}\nWe could re-write the interaction model as a stratified model, with a slope and intercept for each sex:\n:::\n\n$$\n\\E{Y|A=a, S=s} = \n\\beta_M m + \\beta_{AM} (a \\cdot m) + \n\\beta_F f + \\beta_{AF} (a \\cdot f)\n$$ {#eq-model-strat}\n\nCompare this stratified model with our interaction model, @eq-BW-lm-interact:\n\n$$\n\\E{Y|A=a, S=s} = \n\\beta_0 + \\beta_A a + \\beta_M m + \\beta_{AM} (a \\cdot m)\n$$\n\n::: notes\n\nIn the stratified model, the intercept term $\\beta_0$ has been relabeled as $\\beta_F$.\n\n:::\n\n::: {.column width=45%}\n\n\n\n::: {#tbl-bw-model-coefs-interact2 .cell tbl-cap='Birthweight model with interaction term'}\n\n```{.r .cell-code}\nbw_lm2 = lm(weight ~ sex + age + sex:age, data = bw)\nbw_lm2 |> \n parameters() |>\n print_md(\n include_reference = TRUE,\n # show_sigma = TRUE,\n select = \"{estimate}\")\n```\n\n::: {.cell-output-display}\n\n\n|Parameter | Estimate |\n|:----------------|:--------:|\n|(Intercept) | -2141.67 |\n|sex (female) | 0.00 |\n|sex (male) | 872.99 |\n|age | 130.40 |\n|sex (male) × age | -18.42 |\n\n\n:::\n:::\n\n\n\n\n:::\n\n:::{.column width=10%}\n:::\n\n:::{.column width=45%}\n\n\n\n\n::: {#tbl-bw-model-coefs-strat .cell tbl-cap='Birthweight model - stratified betas'}\n\n```{.r .cell-code}\nbw_lm_strat = \n bw |> \n lm(\n formula = weight ~ sex + sex:age - 1, \n data = _)\n\nbw_lm_strat |> \n parameters() |>\n print_md(\n # show_sigma = TRUE,\n select = \"{estimate}\")\n```\n\n::: {.cell-output-display}\n\n\n|Parameter | Estimate |\n|:------------------|:--------:|\n|sex (female) | -2141.67 |\n|sex (male) | -1268.67 |\n|sex (female) × age | 130.40 |\n|sex (male) × age | 111.98 |\n\n\n:::\n:::\n\n\n\n\n:::\n\n### Curved-line regression\n\n::: notes\nIf we transform some of our covariates ($X$s) and plot the resulting model on the original covariate scale, we end up with curved regression lines:\n:::\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbw_lm3 = lm(weight ~ sex:log(age) - 1, data = bw)\nlibrary(palmerpenguins)\n\nggpenguins <- \n palmerpenguins::penguins |> \n dplyr::filter(species == \"Adelie\") |> \n ggplot(\n aes(x = bill_length_mm , y = body_mass_g)) +\n geom_point() + \n xlab(\"Bill length (mm)\") + \n ylab(\"Body mass (g)\")\n\nggpenguins2 = ggpenguins +\n stat_smooth(\n method = \"lm\",\n formula = y ~ log(x),\n geom = \"smooth\") +\n xlab(\"Bill length (mm)\") + \n ylab(\"Body mass (g)\")\n\n\nggpenguins2 |> print()\n```\n\n::: {.cell-output-display}\n![`palmerpenguins` model with `bill_length` entering on log scale](Linear-models-overview_files/figure-html/fig-penguins-log-x-1.png){#fig-penguins-log-x width=672}\n:::\n:::\n\n\n\n\n## Estimating Linear Models via Maximum Likelihood {#sec-est-LMs}\n\n### Likelihood, log-likelihood, and score functions for linear regression {.smaller}\n\n:::{.notes}\n\nIn EPI 203 and @sec-intro-MLEs, we learned how to fit outcome-only models of the form $p(X=x|\\theta)$ to iid data $\\vx = (x_1,…,x_n)$ using maximum likelihood estimation.\n\nNow, we apply the same procedure to linear regression models:\n\n:::\n\n$$\n\\mathcal L(\\vec y|\\mat x,\\beta, \\sigma^2) = \n\\prod_{i=1}^n (2\\pi\\sigma^2)^{-1/2} \n\\exp{-\\frac{1}{2\\sigma^2}(y_i - \\vec{x_i}'\\beta)^2}\n$$ {#eq-linreg-lik}\n\n$$\n\\ell(\\vec y|\\mat x,\\beta, \\sigma^2) \n= -\\frac{n}{2}\\log{\\sigma^2} - \n\\frac{1}{2\\sigma^2}\\sum_{i=1}^n (y_i - \\vec{x_i}' \\beta)^2\n$$ {#eq-linreg-loglik}\n\n$$\n\\ell'_{\\beta}(\\vec y|\\mat x,\\beta, \\sigma^2) \n= - \n\\frac{1}{2\\sigma^2}\\deriv{\\beta}\n\\paren{\\sum_{i=1}^n (y_i - \\vec{x_i}\\' \\beta)^2}\n$$ {#eq-linreg-score}\n\n---\n\n::: notes\nLet's switch to matrix-vector notation:\n:::\n\n$$\n\\sum_{i=1}^n (y_i - \\vx_i\\' \\vb)^2 \n= (\\vy - \\mX\\vb)'(\\vy - \\mX\\vb)\n$$\n\n---\n\nSo\n\n$$\n\\begin{aligned}\n(\\vy - \\mX\\vb)'(\\vy - \\mX\\vb) \n&= (\\vy' - \\vb'X')(\\vy - \\mX\\vb)\n\\\\ &= y'y - \\vb'X'y - y'\\mX\\vb +\\vb'\\mX'\\mX\\beta\n\\\\ &= y'y - 2y'\\mX\\beta +\\beta'\\mX'\\mX\\beta\n\\end{aligned}\n$$\n\n### Deriving the linear regression score function\n\n::: notes\nWe will use some results from [vector calculus](math-prereqs.qmd#sec-vector-calculus):\n:::\n\n$$\n\\begin{aligned}\n\\deriv{\\beta}\\paren{\\sum_{i=1}^n (y_i - x_i' \\beta)^2} \n &= \\deriv{\\beta}(\\vy - X\\beta)'(\\vy - X\\beta)\n\\\\ &= \\deriv{\\beta} (y'y - 2y'X\\beta +\\beta'X'X\\beta)\n\\\\ &= (- 2X'y +2X'X\\beta)\n\\\\ &= - 2X'(y - X\\beta)\n\\\\ &= - 2X'(y - \\Expp[y])\n\\\\ &= - 2X' \\err(y)\n\\end{aligned}\n$${#eq-scorefun-linreg}\n\n---\n\nSo if $\\ell(\\beta,\\sigma^2) =0$, then\n\n$$\n\\begin{aligned}\n0 &= (- 2X'y +2X'X\\beta)\\\\\n2X'y &= 2X'X\\beta\\\\\nX'y &= X'X\\beta\\\\\n(X'X)^{-1}X'y &= \\beta\n\\end{aligned}\n$$\n\n---\n\nThe second derivative matrix $\\ell_{\\beta, \\beta'} ''(\\beta, \\sigma^2;\\mathbf X,\\vy)$ is negative definite at $\\beta = (X'X)^{-1}X'y$, so $\\hat \\beta_{ML} = (X'X)^{-1}X'y$ is the MLE for $\\beta$.\n\n---\n\nSimilarly (not shown):\n\n$$\n\\hat\\sigma^2_{ML} = \\frac{1}{n} (Y-X\\hat\\beta)'(Y-X\\hat\\beta)\n$$\n\nAnd\n\n$$\n\\begin{aligned}\n\\mathcal I_{\\beta} &= E[-\\ell_{\\beta, \\beta'} ''(Y|X,\\beta, \\sigma^2)]\\\\\n&= \\frac{1}{\\sigma^2}X'X\n\\end{aligned}\n$$\n\n---\n\nSo:\n\n$$\nVar(\\hat \\beta) \\approx (\\mathcal I_{\\beta})^{-1} = \\sigma^2 (X'X)^{-1}\n$$\n\nand\n\n$$\n\\hat\\beta \\dot \\sim N(\\beta, \\mathcal I_{\\beta}^{-1})\n$$ \n\n:::{.notes}\n\nThese are all results you have hopefully seen before.\n\n:::\n\n---\n\nIn the Gaussian linear regression case, we also have exact results:\n\n$$\n\\frac{\\hat\\beta_j}{\\hse{\\hat\\beta_j}} \\dist t_{n-p}\n$$ \n\n---\n\nIn our model 2 above, $\\heinf(\\beta)$ is:\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbw_lm2 |> vcov()\n#> (Intercept) sexmale age sexmale:age\n#> (Intercept) 1353968 -1353968 -34871.0 34871.0\n#> sexmale -1353968 2596387 34871.0 -67211.0\n#> age -34871 34871 899.9 -899.9\n#> sexmale:age 34871 -67211 -899.9 1743.5\n```\n:::\n\n\n\n\nIf we take the square roots of the diagonals, we get the standard errors listed in the model output:\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n\nbw_lm2 |> vcov() |> diag() |> sqrt()\n#> (Intercept) sexmale age sexmale:age \n#> 1163.60 1611.33 30.00 41.76\n```\n:::\n\n::: {#tbl-mod-intx .cell tbl-cap='Estimated model for `birthweight` data with interaction term'}\n\n```{.r .cell-code}\nbw_lm2 |> parameters() |> print_md()\n```\n\n::: {.cell-output-display}\n\n\n|Parameter | Coefficient | SE | 95% CI | t(20) | p |\n|:----------------|:-----------:|:-------:|:-------------------:|:-----:|:------:|\n|(Intercept) | -2141.67 | 1163.60 | (-4568.90, 285.56) | -1.84 | 0.081 |\n|sex (male) | 872.99 | 1611.33 | (-2488.18, 4234.17) | 0.54 | 0.594 |\n|age | 130.40 | 30.00 | (67.82, 192.98) | 4.35 | < .001 |\n|sex (male) × age | -18.42 | 41.76 | (-105.52, 68.68) | -0.44 | 0.664 |\n\n\n:::\n:::\n\n\n\n\nSo we can do confidence intervals, hypothesis tests, and p-values exactly as in the one-variable case we looked at previously.\n\n### Residual Standard Deviation\n\n::: notes\n$\\hs$ represents an *estimate* of the *Residual Standard Deviation* parameter, $\\s$. \nWe can extract $\\hs$ from the fitted model, using the `sigma()` function:\n:::\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"show\"}\nsigma(bw_lm2)\n#> [1] 180.6\n```\n:::\n\n\n\n\n---\n\n#### $\\s$ is NOT \"Residual standard error\"\n\n::: notes\nIn the `summary.lm()` output, this estimate is labeled as `\"Residual standard error\"`:\n:::\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"show\"}\nsummary(bw_lm2)\n#> \n#> Call:\n#> lm(formula = weight ~ sex + age + sex:age, data = bw)\n#> \n#> Residuals:\n#> Min 1Q Median 3Q Max \n#> -246.7 -138.1 -39.1 176.6 274.3 \n#> \n#> Coefficients:\n#> Estimate Std. Error t value Pr(>|t|) \n#> (Intercept) -2141.7 1163.6 -1.84 0.08057 . \n#> sexmale 873.0 1611.3 0.54 0.59395 \n#> age 130.4 30.0 4.35 0.00031 ***\n#> sexmale:age -18.4 41.8 -0.44 0.66389 \n#> ---\n#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n#> \n#> Residual standard error: 181 on 20 degrees of freedom\n#> Multiple R-squared: 0.643,\tAdjusted R-squared: 0.59 \n#> F-statistic: 12 on 3 and 20 DF, p-value: 0.000101\n```\n:::\n\n\n\n\n---\n\n::: notes\nHowever, this is a misnomer:\n:::\n\n\n\n\n::: {.cell printr.help.sections='[\"description\",\"note\"]'}\n\n```{.r .cell-code code-fold=\"show\"}\nlibrary(printr) # captures ? documentation\n?stats::sigma\n```\n\n::: {.cell-output-display}\n```{=html}\n
\n\n
sigmaR Documentation
\n\n

Extract Residual Standard Deviation 'Sigma'

\n\n

Description

\n\n

Extract the estimated standard deviation of the errors, the\n“residual standard deviation” (misnamed also\n“residual standard error”, e.g., in\nsummary.lm()'s output, from a fitted model).\n

\n

Many classical statistical models have a scale parameter,\ntypically the standard deviation of a zero-mean normal (or Gaussian)\nrandom variable which is denoted as \\sigma.\nsigma(.) extracts the estimated parameter from a fitted\nmodel, i.e., \\hat\\sigma.\n

\n\n\n

Note

\n\n

The misnomer “Residual standard error” has been part of\ntoo many R (and S) outputs to be easily changed there.\n

\n\n
\n\n
\n
\n```\n:::\n:::\n\n\n\n\n## Inference about Gaussian Linear Regression Models {#sec-infer-LMs}\n\n### Motivating example: `birthweight` data\n\nResearch question: is there really an interaction between sex and age?\n\n$H_0: \\beta_{AM} = 0$\n\n$H_A: \\beta_{AM} \\neq 0$\n\n$P(|\\hat\\beta_{AM}| > |-18.4172| \\mid H_0)$ = ?\n\n### Wald tests and CIs {.smaller}\n\nR can give you Wald tests for single coefficients and corresponding CIs:\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n\nbw_lm2 |> \n parameters() |>\n print_md(\n include_reference = TRUE)\n```\n\n::: {.cell-output-display}\n\n\n|Parameter | Coefficient | SE | 95% CI | t(20) | p |\n|:----------------|:-----------:|:-------:|:-------------------:|:-----:|:------:|\n|(Intercept) | -2141.67 | 1163.60 | (-4568.90, 285.56) | -1.84 | 0.081 |\n|sex (female) | 0.00 | | | | |\n|sex (male) | 872.99 | 1611.33 | (-2488.18, 4234.17) | 0.54 | 0.594 |\n|age | 130.40 | 30.00 | (67.82, 192.98) | 4.35 | < .001 |\n|sex (male) × age | -18.42 | 41.76 | (-105.52, 68.68) | -0.44 | 0.664 |\n\n\n:::\n:::\n\n\n\n\nTo understand what's happening, let's replicate these results by hand for the interaction term.\n\n### P-values {.smaller}\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbw_lm2 |> \n parameters(keep = \"sexmale:age\") |>\n print_md(\n include_reference = TRUE)\n```\n\n::: {.cell-output-display}\n\n\n|Parameter | Coefficient | SE | 95% CI | t(20) | p |\n|:----------------|:-----------:|:-----:|:----------------:|:-----:|:-----:|\n|sex (male) × age | -18.42 | 41.76 | (-105.52, 68.68) | -0.44 | 0.664 |\n\n\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nbeta_hat = coef(summary(bw_lm2))[\"sexmale:age\", \"Estimate\"]\nse_hat = coef(summary(bw_lm2))[\"sexmale:age\", \"Std. Error\"]\ndfresid = bw_lm2$df.residual\nt_stat = abs(beta_hat)/se_hat\npval_t = \n pt(-t_stat, df = dfresid, lower.tail = TRUE) +\n pt(t_stat, df = dfresid, lower.tail = FALSE)\n```\n:::\n\n\n\n\n$$\n\\begin{aligned}\n&P\\paren{\n| \\hat \\beta_{AM} | > \n| -18.4172| \\middle| H_0\n} \n\\\\\n&= \\Pr \\paren{\n\\abs{ \\frac{\\hat\\beta_{AM}}{\\hat{SE}(\\hat\\beta_{AM})} } > \n\\abs{ \\frac{-18.4172}{41.7558} } \\middle| H_0\n}\\\\ \n&= \\Pr \\paren{\n\\abs{ T_{20} } > 0.4411 | H_0\n}\\\\\n&= 0.6639\n\\end{aligned}\n$$ \n\n::: notes\nThis matches the result in the table above.\n:::\n\n### Confidence intervals\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbw_lm2 |> \n parameters(keep = \"sexmale:age\") |>\n print_md(\n include_reference = TRUE)\n```\n\n::: {.cell-output-display}\n\n\n|Parameter | Coefficient | SE | 95% CI | t(20) | p |\n|:----------------|:-----------:|:-----:|:----------------:|:-----:|:-----:|\n|sex (male) × age | -18.42 | 41.76 | (-105.52, 68.68) | -0.44 | 0.664 |\n\n\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nq_t = qt(\n p = 0.975, \n df = dfresid, \n lower.tail = TRUE)\n\nq_t = qt(\n p = 0.025, \n df = dfresid, \n lower.tail = TRUE)\n\n\nconfint_radius_t = \n se_hat * q_t\n\nconfint_t = beta_hat + c(-1,1) * confint_radius_t\n\nprint(confint_t)\n#> [1] 68.68 -105.52\n```\n:::\n\n\n\n\n::: notes\nThis also matches.\n:::\n\n### Gaussian approximations\n\nHere are the asymptotic (Gaussian approximation) equivalents:\n\n### P-values {.smaller}\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbw_lm2 |> \n parameters(keep = \"sexmale:age\") |>\n print_md(\n include_reference = TRUE)\n```\n\n::: {.cell-output-display}\n\n\n|Parameter | Coefficient | SE | 95% CI | t(20) | p |\n|:----------------|:-----------:|:-----:|:----------------:|:-----:|:-----:|\n|sex (male) × age | -18.42 | 41.76 | (-105.52, 68.68) | -0.44 | 0.664 |\n\n\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\npval_z = pnorm(abs(t_stat), lower = FALSE) * 2\n\nprint(pval_z)\n#> [1] 0.6592\n```\n:::\n\n\n\n\n### Confidence intervals {.smaller}\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbw_lm2 |> \n parameters(keep = \"sexmale:age\") |>\n print_md(\n include_reference = TRUE)\n```\n\n::: {.cell-output-display}\n\n\n|Parameter | Coefficient | SE | 95% CI | t(20) | p |\n|:----------------|:-----------:|:-----:|:----------------:|:-----:|:-----:|\n|sex (male) × age | -18.42 | 41.76 | (-105.52, 68.68) | -0.44 | 0.664 |\n\n\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nconfint_radius_z = se_hat * qnorm(0.975, lower = TRUE)\nconfint_z = \n beta_hat + c(-1,1) * confint_radius_z\nprint(confint_z)\n#> [1] -100.26 63.42\n```\n:::\n\n\n\n\n### Likelihood ratio statistics\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n\nlogLik(bw_lm2)\n#> 'log Lik.' -156.6 (df=5)\nlogLik(bw_lm1)\n#> 'log Lik.' -156.7 (df=4)\n\nlLR = (logLik(bw_lm2) - logLik(bw_lm1)) |> as.numeric()\ndelta_df = (bw_lm1$df.residual - df.residual(bw_lm2))\n\n\nx_max = 1\n\n```\n:::\n\n\n\n\n---\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nd_lLR = function(x, df = delta_df) dchisq(x, df = df)\n\nchisq_plot = \n ggplot() + \n geom_function(fun = d_lLR) +\n stat_function( fun = d_lLR, xlim = c(lLR, x_max), geom = \"area\", fill = \"gray\") +\n geom_segment(aes(x = lLR, xend = lLR, y = 0, yend = d_lLR(lLR)), col = \"red\") + \n xlim(0.0001,x_max) + \n ylim(0,4) + \n ylab(\"p(X=x)\") + \n xlab(\"log(likelihood ratio) statistic [x]\") +\n theme_classic()\nchisq_plot |> print()\n```\n\n::: {.cell-output-display}\n![Chi-square distribution](Linear-models-overview_files/figure-html/fig-chisq-plot-1.png){#fig-chisq-plot width=672}\n:::\n:::\n\n\n\n\n---\n\nNow we can get the p-value:\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\npchisq(\n q = 2*lLR, \n df = delta_df, \n lower = FALSE) |> \n print()\n#> [1] 0.6298\n```\n:::\n\n\n\n\n\n---\n\nIn practice you don't have to do this by hand; there are functions to do it for you:\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n\n# built in\nlibrary(lmtest)\nlrtest(bw_lm2, bw_lm1)\n```\n\n::: {.cell-output-display}\n\n\n| #Df| LogLik| Df| Chisq| Pr(>Chisq)|\n|---:|------:|--:|------:|----------:|\n| 5| -156.6| NA| NA| NA|\n| 4| -156.7| -1| 0.2323| 0.6298|\n:::\n:::\n\n\n\n\n## Goodness of fit\n\n### AIC and BIC\n\n::: notes\nWhen we use likelihood ratio tests, we are comparing how well different models fit the data.\n\nLikelihood ratio tests require \"nested\" models: one must be a special case of the other.\n\nIf we have non-nested models, we can instead use the Akaike Information Criterion (AIC) or Bayesian Information Criterion (BIC):\n:::\n\n- AIC = $-2 * \\ell(\\hat\\theta) + 2 * p$\n\n- BIC = $-2 * \\ell(\\hat\\theta) + p * \\text{log}(n)$\n\nwhere $\\ell$ is the log-likelihood of the data evaluated using the parameter estimates $\\hat\\theta$, $p$ is the number of estimated parameters in the model (including $\\hat\\sigma^2$), and $n$ is the number of observations.\n\nYou can calculate these criteria using the `logLik()` function, or use the built-in R functions:\n\n#### AIC in R\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n\n-2 * logLik(bw_lm2) |> as.numeric() + \n 2*(length(coef(bw_lm2))+1) # sigma counts as a parameter here\n#> [1] 323.2\n\nAIC(bw_lm2)\n#> [1] 323.2\n```\n:::\n\n\n\n\n#### BIC in R\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n\n-2 * logLik(bw_lm2) |> as.numeric() + \n (length(coef(bw_lm2))+1) * log(nobs(bw_lm2))\n#> [1] 329\n\nBIC(bw_lm2)\n#> [1] 329\n```\n:::\n\n\n\n\nLarge values of AIC and BIC are worse than small values. There are no hypothesis tests or p-values associated with these criteria.\n\n### (Residual) Deviance\n\nLet $q$ be the number of distinct covariate combinations in a data set.\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbw.X.unique = \n bw |> \n count(sex, age)\n\nn_unique.bw = nrow(bw.X.unique)\n```\n:::\n\n\n\n\nFor example, in the `birthweight` data, there are $q = 12$ unique patterns (@tbl-bw-x-combos).\n\n\n\n\n::: {#tbl-bw-x-combos .cell tbl-cap='Unique covariate combinations in the `birthweight` data, with replicate counts'}\n\n```{.r .cell-code}\nbw.X.unique\n```\n\n::: {.cell-output-display}\n\n\n|sex | age| n|\n|:------|---:|--:|\n|female | 36| 2|\n|female | 37| 1|\n|female | 38| 2|\n|female | 39| 2|\n|female | 40| 4|\n|female | 42| 1|\n|male | 35| 1|\n|male | 36| 1|\n|male | 37| 2|\n|male | 38| 3|\n|male | 40| 4|\n|male | 41| 1|\n:::\n:::\n\n\n\n\n---\n\n::: {#def-replicates}\n#### Replicates\nIf a given covariate pattern has more than one observation in a dataset, those observations are called **replicates**.\n:::\n\n---\n\n::: {#exm-replicate-bw}\n\n#### Replicates in the `birthweight` data\n\nIn the `birthweight` dataset, there are 2 replicates of the combination \"female, age 36\" (@tbl-bw-x-combos).\n\n:::\n\n---\n\n::: {#exr-replicate-bw}\n\n#### Replicates in the `birthweight` data\n\nWhich covariate pattern(s) in the `birthweight` data has the most replicates?\n\n:::\n\n---\n\n::: {#sol-replicate-bw}\n\n#### Replicates in the `birthweight` data\n\nTwo covariate patterns are tied for most replicates: males at age 40 weeks \nand females at age 40 weeks.\n40 weeks is the usual length for human pregnancy (@polin2011fetal), so this result makes sense.\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbw.X.unique |> dplyr::filter(n == max(n))\n```\n\n::: {.cell-output-display}\n\n\n|sex | age| n|\n|:------|---:|--:|\n|female | 40| 4|\n|male | 40| 4|\n:::\n:::\n\n\n\n\n:::\n\n---\n\n#### Saturated models {.smaller}\n\nThe most complicated model we could fit would have one parameter (a mean) for each covariate pattern, plus a variance parameter:\n\n\n\n\n::: {#tbl-bw-model-sat .cell tbl-cap='Saturated model for the `birthweight` data'}\n\n```{.r .cell-code}\nlm_max = \n bw |> \n mutate(age = factor(age)) |> \n lm(\n formula = weight ~ sex:age - 1, \n data = _)\n\nlm_max |> \n parameters() |> \n print_md()\n```\n\n::: {.cell-output-display}\n\n\n|Parameter | Coefficient | SE | 95% CI | t(12) | p |\n|:--------------------|:-----------:|:------:|:------------------:|:-----:|:------:|\n|sex (male) × age35 | 2925.00 | 187.92 | (2515.55, 3334.45) | 15.56 | < .001 |\n|sex (female) × age36 | 2570.50 | 132.88 | (2280.98, 2860.02) | 19.34 | < .001 |\n|sex (male) × age36 | 2625.00 | 187.92 | (2215.55, 3034.45) | 13.97 | < .001 |\n|sex (female) × age37 | 2539.00 | 187.92 | (2129.55, 2948.45) | 13.51 | < .001 |\n|sex (male) × age37 | 2737.50 | 132.88 | (2447.98, 3027.02) | 20.60 | < .001 |\n|sex (female) × age38 | 2872.50 | 132.88 | (2582.98, 3162.02) | 21.62 | < .001 |\n|sex (male) × age38 | 2982.00 | 108.50 | (2745.60, 3218.40) | 27.48 | < .001 |\n|sex (female) × age39 | 2846.00 | 132.88 | (2556.48, 3135.52) | 21.42 | < .001 |\n|sex (female) × age40 | 3152.25 | 93.96 | (2947.52, 3356.98) | 33.55 | < .001 |\n|sex (male) × age40 | 3256.25 | 93.96 | (3051.52, 3460.98) | 34.66 | < .001 |\n|sex (male) × age41 | 3292.00 | 187.92 | (2882.55, 3701.45) | 17.52 | < .001 |\n|sex (female) × age42 | 3210.00 | 187.92 | (2800.55, 3619.45) | 17.08 | < .001 |\n\n\n:::\n:::\n\n\n\n\nWe call this model the **full**, **maximal**, or **saturated** model for this dataset.\n\nWe can calculate the log-likelihood of this model as usual:\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlogLik(lm_max)\n#> 'log Lik.' -151.4 (df=13)\n```\n:::\n\n\n\n\nWe can compare this model to our other models using chi-square tests, as usual:\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlrtest(lm_max, bw_lm2)\n```\n\n::: {.cell-output-display}\n\n\n| #Df| LogLik| Df| Chisq| Pr(>Chisq)|\n|---:|------:|--:|-----:|----------:|\n| 13| -151.4| NA| NA| NA|\n| 5| -156.6| -8| 10.36| 0.241|\n:::\n:::\n\n\n\n\nThe likelihood ratio statistic for this test is $$\\lambda = 2 * (\\ell_{\\text{full}} - \\ell) = 10.3554$$ where:\n\n- $\\ell_{\\text{max}}$ is the log-likelihood of the full model: -151.4016\n- $\\ell$ is the log-likelihood of our comparison model (two slopes, two intercepts): -156.5793\n\nThis statistic is called the **deviance** or **residual deviance** for our two-slopes and two-intercepts model; it tells us how much the likelihood of that model deviates from the likelihood of the maximal model.\n\nThe corresponding p-value tells us whether there we have enough evidence to detect that our two-slopes, two-intercepts model is a worse fit for the data than the maximal model; in other words, it tells us if there's evidence that we missed any important patterns. (Remember, a nonsignificant p-value could mean that we didn't miss anything and a more complicated model is unnecessary, or it could mean we just don't have enough data to tell the difference between these models.)\n\n### Null Deviance\n\nSimilarly, the *least* complicated model we could fit would have only one mean parameter, an intercept:\n\n$$\\text E[Y|X=x] = \\beta_0$$ We can fit this model in R like so:\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlm0 = lm(weight ~ 1, data = bw)\n\nlm0 |> parameters() |> print_md()\n```\n\n::: {.cell-output-display}\n\n\n|Parameter | Coefficient | SE | 95% CI | t(23) | p |\n|:-----------|:-----------:|:-----:|:------------------:|:-----:|:------:|\n|(Intercept) | 2967.67 | 57.58 | (2848.56, 3086.77) | 51.54 | < .001 |\n\n\n:::\n:::\n\n\n\n\nThis model also has a likelihood:\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlogLik(lm0)\n#> 'log Lik.' -169 (df=2)\n```\n:::\n\n\n\n\nAnd we can compare it to more complicated models using a likelihood ratio test:\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n\nlrtest(bw_lm2, lm0)\n```\n\n::: {.cell-output-display}\n\n\n| #Df| LogLik| Df| Chisq| Pr(>Chisq)|\n|---:|------:|--:|-----:|----------:|\n| 5| -156.6| NA| NA| NA|\n| 2| -169.0| -3| 24.75| 0|\n:::\n:::\n\n\n\n\nThe likelihood ratio statistic for the test comparing the null model to the maximal model is $$\\lambda = 2 * (\\ell_{\\text{full}} - \\ell_{0}) = 35.1067$$ where:\n\n- $\\ell_{\\text{0}}$ is the log-likelihood of the null model: -168.955\n- $\\ell_{\\text{full}}$ is the log-likelihood of the maximal model: -151.4016\n\nIn R, this test is:\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlrtest(lm_max, lm0)\n```\n\n::: {.cell-output-display}\n\n\n| #Df| LogLik| Df| Chisq| Pr(>Chisq)|\n|---:|------:|---:|-----:|----------:|\n| 13| -151.4| NA| NA| NA|\n| 2| -169.0| -11| 35.11| 2e-04|\n:::\n:::\n\n\n\n\nThis log-likelihood ratio statistic is called the **null deviance**. It tells us whether we have enough data to detect a difference between the null and full models.\n\n## Rescaling\n\n### Rescale age {.smaller}\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n\nbw = \n bw |>\n mutate(\n `age - mean` = age - mean(age),\n `age - 36wks` = age - 36\n )\n\nlm1c = lm(weight ~ sex + `age - 36wks`, data = bw)\n\nlm2c = lm(weight ~ sex + `age - 36wks` + sex:`age - 36wks`, data = bw)\n\nparameters(lm2c, ci_method = \"wald\") |> print_md()\n```\n\n::: {.cell-output-display}\n\n\n|Parameter | Coefficient | SE | 95% CI | t(20) | p |\n|:------------------------|:-----------:|:------:|:------------------:|:-----:|:------:|\n|(Intercept) | 2552.73 | 97.59 | (2349.16, 2756.30) | 26.16 | < .001 |\n|sex (male) | 209.97 | 129.75 | (-60.68, 480.63) | 1.62 | 0.121 |\n|age - 36wks | 130.40 | 30.00 | (67.82, 192.98) | 4.35 | < .001 |\n|sex (male) × age - 36wks | -18.42 | 41.76 | (-105.52, 68.68) | -0.44 | 0.664 |\n\n\n:::\n:::\n\n\n\n\nCompare with what we got without rescaling:\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nparameters(bw_lm2, ci_method = \"wald\") |> print_md()\n```\n\n::: {.cell-output-display}\n\n\n|Parameter | Coefficient | SE | 95% CI | t(20) | p |\n|:----------------|:-----------:|:-------:|:-------------------:|:-----:|:------:|\n|(Intercept) | -2141.67 | 1163.60 | (-4568.90, 285.56) | -1.84 | 0.081 |\n|sex (male) | 872.99 | 1611.33 | (-2488.18, 4234.17) | 0.54 | 0.594 |\n|age | 130.40 | 30.00 | (67.82, 192.98) | 4.35 | < .001 |\n|sex (male) × age | -18.42 | 41.76 | (-105.52, 68.68) | -0.44 | 0.664 |\n\n\n:::\n:::\n\n\n\n\n## Prediction\n\n### Prediction for linear models\n\n:::{#def-predicted-value}\n#### Predicted value\n\nIn a regression model $\\p(y|x)$, the **predicted value** of $y$ given $x$ is the estimated mean of $Y$ given $X$:\n\n$$\\hat y \\eqdef \\hE{Y|X=x}$$\n:::\n\n---\n\nFor linear models, the predicted value can be straightforwardly calculated by multiplying each predictor value $x_j$ by its corresponding coefficient $\\beta_j$ and adding up the results:\n\n$$\n\\begin{aligned}\n\\hat Y &= \\hat E[Y|X=x] \\\\\n&= x'\\hat\\beta \\\\\n&= \\hat\\beta_0\\cdot 1 + \\hat\\beta_1 x_1 + ... + \\hat\\beta_p x_p\n\\end{aligned}\n$$\n\n---\n\n### Example: prediction for the `birthweight` data\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n\nX = c(1,1,40)\nsum(X * coef(bw_lm1))\n#> [1] 3225\n```\n:::\n\n\n\n\nR has built-in functions for prediction:\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx = tibble(age = 40, sex = \"male\")\nbw_lm1 |> predict(newdata = x)\n#> 1 \n#> 3225\n```\n:::\n\n\n\n\nIf you don't provide `newdata`, R will use the covariate values from the original dataset:\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\npredict(bw_lm1)\n#> 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 \n#> 3225 3062 2984 2579 3225 3062 2621 2821 2742 3304 2863 2942 3346 3062 3225 2700 \n#> 17 18 19 20 21 22 23 24 \n#> 2863 2579 2984 2821 3225 2942 2984 3062\n```\n:::\n\n\n\n\nThese special predictions are called the *fitted values* of the dataset:\n\n:::{#def-fitted-value}\n\nFor a given dataset $(\\vY, \\mX)$ and corresponding fitted model $\\p_{\\hb}(\\vy|\\mx)$, the **fitted value** of $y_i$ is the predicted value of $y$ when $\\vX=\\vx_i$ using the estimate parameters $\\hb$.\n\n:::\n\nR has an extra function to get these values:\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfitted(bw_lm1)\n#> 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 \n#> 3225 3062 2984 2579 3225 3062 2621 2821 2742 3304 2863 2942 3346 3062 3225 2700 \n#> 17 18 19 20 21 22 23 24 \n#> 2863 2579 2984 2821 3225 2942 2984 3062\n```\n:::\n\n\n\n\n### Quantifying uncertainty in predictions\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbw_lm1 |> \n predict(\n newdata = x,\n se.fit = TRUE)\n#> $fit\n#> 1 \n#> 3225 \n#> \n#> $se.fit\n#> [1] 61.46\n#> \n#> $df\n#> [1] 21\n#> \n#> $residual.scale\n#> [1] 177.1\n```\n:::\n\n\n\n\nThis is a `list()`; you can extract the elements with `$` or `magrittr::use_series()`:\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbw_lm1 |> \n predict(\n newdata = x,\n se.fit = TRUE) |> \n use_series(se.fit)\n#> [1] 61.46\n```\n:::\n\n\n\n\nYou can get **confidence intervals** for $\\E{Y|X=x}$:\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbw_lm1 |> predict(\n newdata = x,\n interval = \"confidence\")\n```\n\n::: {.cell-output-display}\n\n\n| fit| lwr| upr|\n|----:|----:|----:|\n| 3225| 3098| 3353|\n:::\n:::\n\n\n\n\nYou can also get **prediction intervals** for the value of an individual outcome $Y$:\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbw_lm1 |> \n predict(newdata = x, interval = \"predict\")\n```\n\n::: {.cell-output-display}\n\n\n| fit| lwr| upr|\n|----:|----:|----:|\n| 3225| 2836| 3615|\n:::\n:::\n\n\n\n\nThe warning from the last command is: \"predictions on current data refer to *future* responses\" (since you already know what happened to the current data, and thus don't need to predict it).\n\nSee `?predict.lm` for more.\n\n## Diagnostics {#sec-diagnose-LMs}\n\n:::{.callout-tip}\nThis section is adapted from @dobson4e [§6.2-6.3] and \n@dunn2018generalized [Chapter 3](https://link.springer.com/chapter/10.1007/978-1-4419-0118-7_3).\n:::\n### Assumptions in linear regression models {.smaller .scrollable}\n\n$$Y|\\vX \\simind N(\\vX'\\b,\\ss)$$\n\n1. Normality: The distribution conditional on a given $X$ value is normal\n\n2. Correct Functional Form: The conditional means have the structure \n\n$$E[Y|\\vec X = \\vec x] = \\vec x'\\beta$$\n3. Homoskedasticity: The variance $\\ss$ is constant (with respect to $\\vx$)\n\n4. Independence: The observations are statistically independent\n\n### Direct visualization\n\n::: notes\nThe most direct way to examine the fit of a model is to compare it to the raw observed data.\n:::\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbw = \n bw |> \n mutate(\n predlm2 = predict(bw_lm2)\n ) |> \n arrange(sex, age)\n\nplot1_interact = \n plot1 %+% bw +\n geom_line(aes(y = predlm2))\n\nprint(plot1_interact)\n```\n\n::: {.cell-output-display}\n![Birthweight model with interaction term](Linear-models-overview_files/figure-html/fig-bw-interaction2-1.png){#fig-bw-interaction2 width=672}\n:::\n:::\n\n\n\n\n::: notes\nIt's not easy to assess these assumptions from this model.\nIf there are multiple continuous covariates, it becomes even harder to visualize the raw data.\n:::\n\n### Residuals\n\n::: notes\nMaybe we can transform the data and model in some way to make it easier to inspect.\n:::\n:::{#def-resid-noise}\n#### Residual noise\n\nThe **residual noise** in a probabilistic model $p(Y)$ is the difference between an observed value $y$ and its distributional mean:\n\n$$\\eps(y) \\eqdef y - \\Exp{Y}$$ {#eq-def-resid}\n:::\n\n:::{.notes}\nWe use the same notation for residual noise that we used for [errors](estimation.qmd#def-error). \n$\\Exp{Y}$ can be viewed as an estimate of $Y$, before $y$ is observed.\nConversely, each observation $y$ can be viewed as an estimate of $\\Exp{Y}$ (albeit an imprecise one, individually, since $n=1$). \n\n:::\n\nWe can rearrange @eq-def-resid to view $y$ as the sum of its mean plus the residual noise:\n\n$$y = \\Exp{Y} + \\eps{y}$$\n\n---\n\n:::{#thm-gaussian-resid-noise}\n#### Residuals in Gaussian models\n\nIf $Y$ has a Gaussian distribution, then $\\err(Y)$ also has a Gaussian distribution, and vice versa.\n:::\n\n:::{.proof}\nLeft to the reader.\n:::\n\n---\n\n:::{#def-resid-fitted}\n#### Residual errors of a fitted model value\n\nThe **residual of a fitted value $\\hat y$** (shorthand: \"residual\") is its [error](estimation.qmd#def-error):\n$$\n\\ba\ne(\\hat y) &\\eqdef \\erf{\\hat y}\n\\\\&= y - \\hat y\n\\ea\n$$\n:::\n\n$e(\\hat y)$ can be seen as the maximum likelihood estimate of the residual noise:\n\n$$\n\\ba\ne(\\hy) &= y - \\hat y\n\\\\ &= \\hat\\eps_{ML}\n\\ea\n$$\n\n---\n\n#### General characteristics of residuals\n\n:::{#thm-resid-unbiased}\nFor [unbiased](estimation.qmd#sec-unbiased-estimators) estimators $\\hth$:\n\n$$\\E{e(y)} = 0$$ {#eq-mean-resid-unbiased}\n$$\\Var{e(y)} \\approx \\ss$$ {#eq-var-resid-unbiased}\n\n:::\n\n:::{.proof}\n\\ \n\n@eq-mean-resid-unbiased:\n\n$$\n\\ba\n\\Ef{e(y)} &= \\Ef{y - \\hat y}\n\\\\ &= \\Ef{y} - \\Ef{\\hat y}\n\\\\ &= \\Ef{y} - \\Ef{y}\n\\\\ &= 0\n\\ea\n$$\n\n@eq-var-resid-unbiased:\n\n$$\n\\ba\n\\Var{e(y)} &= \\Var{y - \\hy}\n\\\\ &= \\Var{y} + \\Var{\\hy} - 2 \\Cov{y, \\hy}\n\\\\ &{\\dot{\\approx}} \\Var{y} + 0 - 2 \\cdot 0\n\\\\ &= \\Var{y}\n\\\\ &= \\ss\n\\ea\n$$\n:::\n\n---\n\n#### Characteristics of residuals in Gaussian models\n\nWith enough data and a correct model, the residuals will be approximately Guassian distributed, with variance $\\sigma^2$, which we can estimate using $\\hat\\sigma^2$: that is:\n\n$$\ne_i \\siid N(0, \\hat\\sigma^2)\n$$\n\n---\n\n:::{#exm-resid-bw}\n#### residuals in `birthweight` data\n\nR provides a function for residuals:\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nresid(bw_lm2)\n#> 1 2 3 4 5 6 7 8 9 10 \n#> 176.27 -140.73 -144.13 -59.53 177.47 -126.93 -68.93 242.67 -139.33 51.67 \n#> 11 12 13 14 15 16 17 18 19 20 \n#> 156.67 -125.13 274.28 -137.71 -27.69 -246.69 -191.67 189.33 -11.67 -242.64 \n#> 21 22 23 24 \n#> -47.64 262.36 210.36 -30.62\n```\n:::\n\n\n\n\n:::\n\n:::{#exr-calc-resids}\nCheck R's output by computing the residuals directly.\n:::\n\n:::{.solution}\n\\ \n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbw$weight - fitted(bw_lm2)\n#> 1 2 3 4 5 6 7 8 9 10 \n#> 176.27 -140.73 -144.13 -59.53 177.47 -126.93 -68.93 242.67 -139.33 51.67 \n#> 11 12 13 14 15 16 17 18 19 20 \n#> 156.67 -125.13 274.28 -137.71 -27.69 -246.69 -191.67 189.33 -11.67 -242.64 \n#> 21 22 23 24 \n#> -47.64 262.36 210.36 -30.62\n```\n:::\n\n\n\n\nThis matches R's output!\n:::\n\n---\n\n#### Graph the residuals\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbw = bw |> \n mutate(resids_intxn = \n weight - fitted(bw_lm2))\n\nplot_bw_resid =\n bw |> \n ggplot(aes(\n x = age, \n y = resids_intxn,\n linetype = sex,\n shape = sex,\n col = sex)) +\n theme_bw() +\n xlab(\"Gestational age (weeks)\") +\n ylab(\"residuals (grams)\") +\n theme(legend.position = \"bottom\") +\n # expand_limits(y = 0, x = 0) +\n geom_point(alpha = .7)\nprint(plot_bw_resid + facet_wrap(~ sex))\n```\n\n::: {.cell-output-display}\n![Residuals of interaction model for `birthweight` data](Linear-models-overview_files/figure-html/fig-resids-intxn-1.png){#fig-resids-intxn width=672}\n:::\n:::\n\n\n\n\n---\n\n:::{#def-stred}\n\n#### Standardized residuals\n\n$$r_i = \\frac{e_i}{\\widehat{SD}(e_i)}$$\n\n:::\n\nHence, with enough data and a correct model, the standardized residuals will be approximately standard Gaussian; that is,\n\n$$\nr_i \\siid N(0,1)\n$$\n\n### Marginal distributions of residuals\n\nTo look for problems with our model, we can check whether the residuals $e_i$ and standardized residuals $r_i$ look like they have the distributions that they are supposed to have, according to the model.\n\n---\n\n#### Standardized residuals in R\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n\nrstandard(bw_lm2)\n#> 1 2 3 4 5 6 7 8 \n#> 1.15982 -0.92601 -0.87479 -0.34723 1.03507 -0.73473 -0.39901 1.43752 \n#> 9 10 11 12 13 14 15 16 \n#> -0.82539 0.30606 0.92807 -0.87616 1.91428 -0.86559 -0.16430 -1.46376 \n#> 17 18 19 20 21 22 23 24 \n#> -1.11016 1.09658 -0.06761 -1.46159 -0.28696 1.58040 1.26717 -0.19805\nresid(bw_lm2)/sigma(bw_lm2)\n#> 1 2 3 4 5 6 7 8 \n#> 0.97593 -0.77920 -0.79802 -0.32962 0.98258 -0.70279 -0.38166 1.34357 \n#> 9 10 11 12 13 14 15 16 \n#> -0.77144 0.28606 0.86741 -0.69282 1.51858 -0.76244 -0.15331 -1.36584 \n#> 17 18 19 20 21 22 23 24 \n#> -1.06123 1.04825 -0.06463 -1.34341 -0.26376 1.45262 1.16471 -0.16954\n```\n:::\n\n\n\n\n::: notes\nThese are not quite the same, because R is doing something more complicated and precise to get the standard errors. Let's not worry about those details for now; the difference is pretty small in this case:\n\n:::\n\n---\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n\nrstandard_compare_plot = \n tibble(\n x = resid(bw_lm2)/sigma(bw_lm2), \n y = rstandard(bw_lm2)) |> \n ggplot(aes(x = x, y = y)) +\n geom_point() + \n theme_bw() +\n coord_equal() + \n xlab(\"resid(bw_lm2)/sigma(bw_lm2)\") +\n ylab(\"rstandard(bw_lm2)\") +\n geom_abline(\n aes(\n intercept = 0,\n slope = 1, \n col = \"x=y\")) +\n labs(colour=\"\") +\n scale_colour_manual(values=\"red\")\n\nprint(rstandard_compare_plot)\n```\n\n::: {.cell-output-display}\n![](Linear-models-overview_files/figure-html/unnamed-chunk-65-1.png){width=672}\n:::\n:::\n\n\n\n\n---\n\nLet's add these residuals to the `tibble` of our dataset:\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n\nbw = \n bw |> \n mutate(\n fitted_lm2 = fitted(bw_lm2),\n \n resid_lm2 = resid(bw_lm2),\n # resid_lm2 = weight - fitted_lm2,\n \n std_resid_lm2 = rstandard(bw_lm2),\n # std_resid_lm2 = resid_lm2 / sigma(bw_lm2)\n )\n\nbw |> \n select(\n sex,\n age,\n weight,\n fitted_lm2,\n resid_lm2,\n std_resid_lm2\n )\n```\n\n::: {.cell-output-display}\n\n\n|sex | age| weight| fitted_lm2| resid_lm2| std_resid_lm2|\n|:------|---:|------:|----------:|---------:|-------------:|\n|female | 36| 2729| 2553| 176.27| 1.1598|\n|female | 36| 2412| 2553| -140.73| -0.9260|\n|female | 37| 2539| 2683| -144.13| -0.8748|\n|female | 38| 2754| 2814| -59.53| -0.3472|\n|female | 38| 2991| 2814| 177.47| 1.0351|\n|female | 39| 2817| 2944| -126.93| -0.7347|\n|female | 39| 2875| 2944| -68.93| -0.3990|\n|female | 40| 3317| 3074| 242.67| 1.4375|\n|female | 40| 2935| 3074| -139.33| -0.8254|\n|female | 40| 3126| 3074| 51.67| 0.3061|\n|female | 40| 3231| 3074| 156.67| 0.9281|\n|female | 42| 3210| 3335| -125.13| -0.8762|\n|male | 35| 2925| 2651| 274.28| 1.9143|\n|male | 36| 2625| 2763| -137.71| -0.8656|\n|male | 37| 2847| 2875| -27.69| -0.1643|\n|male | 37| 2628| 2875| -246.69| -1.4638|\n|male | 38| 2795| 2987| -191.67| -1.1102|\n|male | 38| 3176| 2987| 189.33| 1.0966|\n|male | 38| 2975| 2987| -11.67| -0.0676|\n|male | 40| 2968| 3211| -242.64| -1.4616|\n|male | 40| 3163| 3211| -47.64| -0.2870|\n|male | 40| 3473| 3211| 262.36| 1.5804|\n|male | 40| 3421| 3211| 210.36| 1.2672|\n|male | 41| 3292| 3323| -30.62| -0.1981|\n:::\n:::\n\n\n\n\n---\n\n::: notes\n\nNow let's build histograms:\n\n:::\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nresid_marginal_hist = \n bw |> \n ggplot(aes(x = resid_lm2)) +\n geom_histogram()\n\nprint(resid_marginal_hist)\n```\n\n::: {.cell-output-display}\n![Marginal distribution of (nonstandardized) residuals](Linear-models-overview_files/figure-html/fig-marg-dist-resid-1.png){#fig-marg-dist-resid width=672}\n:::\n:::\n\n\n\n\n::: notes\nHard to tell with this small amount of data, but I'm a bit concerned that the histogram doesn't show a bell-curve shape.\n\n:::\n\n---\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nstd_resid_marginal_hist = \n bw |> \n ggplot(aes(x = std_resid_lm2)) +\n geom_histogram()\n\nprint(std_resid_marginal_hist)\n```\n\n::: {.cell-output-display}\n![Marginal distribution of standardized residuals](Linear-models-overview_files/figure-html/fig-marg-stresd-1.png){#fig-marg-stresd width=672}\n:::\n:::\n\n\n\n\n::: notes\nThis looks similar, although the scale of the x-axis got narrower, because we divided by $\\hat\\sigma$ (roughly speaking).\n\nStill hard to tell if the distribution is Gaussian.\n\n:::\n\n---\n\n### QQ plot of standardized residuals\n\n::: notes\nAnother way to assess normality is the QQ plot of the standardized residuals versus normal quantiles:\n\n:::\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n\nlibrary(ggfortify) \n# needed to make ggplot2::autoplot() work for `lm` objects\n\nqqplot_lm2_auto = \n bw_lm2 |> \n autoplot(\n which = 2, # options are 1:6; can do multiple at once\n ncol = 1) +\n theme_classic()\n\nprint(qqplot_lm2_auto)\n```\n\n::: {.cell-output-display}\n![](Linear-models-overview_files/figure-html/unnamed-chunk-69-1.png){width=672}\n:::\n:::\n\n\n\n\n::: notes\nIf the Gaussian model were correct, these points should follow the dotted line.\n\nFig 2.4 panel (c) in @dobson4e is a little different; they didn't specify how they produced it, but other statistical analysis systems do things differently from R.\n\nSee also @dunn2018generalized [§3.5.4](https://link.springer.com/chapter/10.1007/978-1-4419-0118-7_3#Sec14:~:text=3.5.4%20Q%E2%80%93Q%20Plots%20and%20Normality).\n\n:::\n\n---\n\n#### QQ plot - how it's built\n\n::: notes\nLet's construct it by hand:\n:::\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n\nbw = bw |> \n mutate(\n p = (rank(std_resid_lm2) - 1/2)/n(), # \"Blom's method\"\n expected_quantiles_lm2 = qnorm(p)\n )\n\nqqplot_lm2 = \n bw |> \n ggplot(\n aes(\n x = expected_quantiles_lm2, \n y = std_resid_lm2, \n col = sex, \n shape = sex)\n ) + \n geom_point() +\n theme_classic() +\n theme(legend.position='none') + # removing the plot legend\n ggtitle(\"Normal Q-Q\") +\n xlab(\"Theoretical Quantiles\") + \n ylab(\"Standardized residuals\")\n\n# find the expected line:\n\nps <- c(.25, .75) # reference probabilities\na <- quantile(rstandard(bw_lm2), ps) # empirical quantiles\nb <- qnorm(ps) # theoretical quantiles\n\nqq_slope = diff(a)/diff(b)\nqq_intcpt = a[1] - b[1] * qq_slope\n\nqqplot_lm2 = \n qqplot_lm2 +\n geom_abline(slope = qq_slope, intercept = qq_intcpt)\n\nprint(qqplot_lm2)\n```\n\n::: {.cell-output-display}\n![](Linear-models-overview_files/figure-html/unnamed-chunk-70-1.png){width=672}\n:::\n:::\n\n\n\n\n---\n\n### Conditional distributions of residuals\n\nIf our Gaussian linear regression model is correct, the residuals $e_i$ and standardized residuals $r_i$ should have:\n\n- an approximately Gaussian distribution, with:\n- a mean of 0\n- a constant variance\n\nThis should be true **for every** value of $x$.\n\n---\n\nIf we didn't correctly guess the functional form of the linear component of the mean, \n$$\\text{E}[Y|X=x] = \\beta_0 + \\beta_1 X_1 + ... + \\beta_p X_p$$\n\nThen the the residuals might have nonzero mean.\n\nRegardless of whether we guessed the mean function correctly, ther the variance of the residuals might differ between values of $x$.\n\n---\n\n#### Residuals versus fitted values\n\n::: notes\nTo look for these issues, we can plot the residuals $e_i$ against the fitted values $\\hat y_i$ (@fig-bw_lm2-resid-vs-fitted).\n:::\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nautoplot(bw_lm2, which = 1, ncol = 1) |> print()\n```\n\n::: {.cell-output-display}\n![`birthweight` model (@eq-BW-lm-interact): residuals versus fitted values](Linear-models-overview_files/figure-html/fig-bw_lm2-resid-vs-fitted-1.png){#fig-bw_lm2-resid-vs-fitted width=672}\n:::\n:::\n\n\n\n\n::: notes\nIf the model is correct, the blue line should stay flat and close to 0, and the cloud of dots should have the same vertical spread regardless of the fitted value.\n\nIf not, we probably need to change the functional form of linear component of the mean, $$\\text{E}[Y|X=x] = \\beta_0 + \\beta_1 X_1 + ... + \\beta_p X_p$$\n\n:::\n\n---\n\n\n#### Example: PLOS Medicine title length data\n\n(Adapted from @dobson4e, §6.7.1)\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndata(PLOS, package = \"dobson\")\nlibrary(ggplot2)\nfig1 = \n PLOS |> \n ggplot(\n aes(x = authors,\n y = nchar)\n ) +\n geom_point() +\n theme(legend.position = \"bottom\") +\n labs(col = \"\") +\n guides(col=guide_legend(ncol=3))\nfig1\n```\n\n::: {.cell-output-display}\n![Number of authors versus title length in *PLOS Medicine* articles](Linear-models-overview_files/figure-html/fig-plos-1.png){#fig-plos width=672}\n:::\n:::\n\n\n\n---\n\n##### Linear fit\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlm_PLOS_linear = lm(\n formula = nchar ~ authors, \n data = PLOS)\n```\n:::\n\n::: {#fig-plos-lm .cell layout-ncol=\"2\"}\n\n```{.r .cell-code}\nfig2 = fig1 +\n geom_smooth(\n method = \"lm\", \n fullrange = TRUE,\n aes(col = \"lm(y ~ x)\"))\nfig2\n\nlibrary(ggfortify)\nautoplot(lm_PLOS_linear, which = 1, ncol = 1)\n```\n\n::: {.cell-output-display}\n![Data and fit](Linear-models-overview_files/figure-html/fig-plos-lm-1.png){#fig-plos-lm-1 width=672}\n:::\n\n::: {.cell-output-display}\n![Residuals vs fitted](Linear-models-overview_files/figure-html/fig-plos-lm-2.png){#fig-plos-lm-2 width=672}\n:::\n\nNumber of authors versus title length in *PLOS Medicine*, with linear model fit\n:::\n\n\n\n---\n\n##### Quadratic fit {.smaller}\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlm_PLOS_quad = lm(\n formula = nchar ~ authors + I(authors^2), \n data = PLOS)\n```\n:::\n\n::: {#fig-plos-lm-quad .cell layout-ncol=\"2\"}\n\n```{.r .cell-code}\nfig3 = \n fig2 + \ngeom_smooth(\n method = \"lm\",\n fullrange = TRUE,\n formula = y ~ x + I(x ^ 2),\n aes(col = \"lm(y ~ x + I(x^2))\")\n )\nfig3\n\nautoplot(lm_PLOS_quad, which = 1, ncol = 1)\n```\n\n::: {.cell-output-display}\n![Data and fit](Linear-models-overview_files/figure-html/fig-plos-lm-quad-1.png){#fig-plos-lm-quad-1 width=672}\n:::\n\n::: {.cell-output-display}\n![Residuals vs fitted](Linear-models-overview_files/figure-html/fig-plos-lm-quad-2.png){#fig-plos-lm-quad-2 width=672}\n:::\n\nNumber of authors versus title length in *PLOS Medicine*, with quadratic model fit\n:::\n\n\n\n---\n\n##### Linear versus quadratic fits\n\n\n\n::: {#fig-plos-lm-resid2 .cell layout-ncol=\"2\"}\n\n```{.r .cell-code}\nlibrary(ggfortify)\nautoplot(lm_PLOS_linear, which = 1, ncol = 1)\n\nautoplot(lm_PLOS_quad, which = 1, ncol = 1)\n```\n\n::: {.cell-output-display}\n![Linear](Linear-models-overview_files/figure-html/fig-plos-lm-resid2-1.png){#fig-plos-lm-resid2-1 width=672}\n:::\n\n::: {.cell-output-display}\n![Quadratic](Linear-models-overview_files/figure-html/fig-plos-lm-resid2-2.png){#fig-plos-lm-resid2-2 width=672}\n:::\n\nResiduals versus fitted plot for linear and quadratic fits to `PLOS` data\n:::\n\n\n\n---\n\n##### Cubic fit\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlm_PLOS_cub = lm(\n formula = nchar ~ authors + I(authors^2) + I(authors^3), \n data = PLOS)\n```\n:::\n\n::: {#fig-plos-lm-cubic .cell layout-ncol=\"2\"}\n\n```{.r .cell-code}\nfig4 = \n fig3 + \ngeom_smooth(\n method = \"lm\",\n fullrange = TRUE,\n formula = y ~ x + I(x ^ 2) + I(x ^ 3),\n aes(col = \"lm(y ~ x + I(x^2) + I(x ^ 3))\")\n )\nfig4\n\nautoplot(lm_PLOS_cub, which = 1, ncol = 1)\n\n```\n\n::: {.cell-output-display}\n![Data and fit](Linear-models-overview_files/figure-html/fig-plos-lm-cubic-1.png){#fig-plos-lm-cubic-1 width=672}\n:::\n\n::: {.cell-output-display}\n![Residuals vs fitted](Linear-models-overview_files/figure-html/fig-plos-lm-cubic-2.png){#fig-plos-lm-cubic-2 width=672}\n:::\n\nNumber of authors versus title length in *PLOS Medicine*, with cubic model fit\n:::\n\n\n\n---\n\n##### Logarithmic fit\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlm_PLOS_log = lm(nchar ~ log(authors), data = PLOS)\n```\n:::\n\n::: {#fig-plos-log .cell layout-ncol=\"2\"}\n\n```{.r .cell-code}\nfig5 = fig4 + \n geom_smooth(\n method = \"lm\",\n fullrange = TRUE,\n formula = y ~ log(x),\n aes(col = \"lm(y ~ log(x))\")\n )\nfig5\n\nautoplot(lm_PLOS_log, which = 1, ncol = 1)\n```\n\n::: {.cell-output-display}\n![Data and fit](Linear-models-overview_files/figure-html/fig-plos-log-1.png){#fig-plos-log-1 width=672}\n:::\n\n::: {.cell-output-display}\n![Residuals vs fitted](Linear-models-overview_files/figure-html/fig-plos-log-2.png){#fig-plos-log-2 width=672}\n:::\n\nlogarithmic fit\n:::\n\n\n\n---\n\n##### Model selection {.smaller}\n\n\n\n::: {#tbl-plos-lin-quad-anova .cell tbl-cap='linear vs quadratic'}\n\n```{.r .cell-code}\nanova(lm_PLOS_linear, lm_PLOS_quad)\n```\n\n::: {.cell-output-display}\n\n\n| Res.Df| RSS| Df| Sum of Sq| F| Pr(>F)|\n|------:|------:|--:|---------:|----:|------:|\n| 876| 947502| NA| NA| NA| NA|\n| 875| 880950| 1| 66552| 66.1| 0|\n:::\n:::\n\n::: {#tbl-plos-quad-cub-anova .cell tbl-cap='quadratic vs cubic'}\n\n```{.r .cell-code}\nanova(lm_PLOS_quad, lm_PLOS_cub)\n```\n\n::: {.cell-output-display}\n\n\n| Res.Df| RSS| Df| Sum of Sq| F| Pr(>F)|\n|------:|------:|--:|---------:|-----:|------:|\n| 875| 880950| NA| NA| NA| NA|\n| 874| 865933| 1| 15018| 15.16| 1e-04|\n:::\n:::\n\n\n\n---\n\n##### AIC/BIC {.smaller}\n\n\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"show\"}\nAIC(lm_PLOS_quad)\n#> [1] 8568\nAIC(lm_PLOS_cub)\n#> [1] 8555\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"show\"}\nAIC(lm_PLOS_cub)\n#> [1] 8555\nAIC(lm_PLOS_log)\n#> [1] 8544\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"show\"}\nBIC(lm_PLOS_cub)\n#> [1] 8578\nBIC(lm_PLOS_log)\n#> [1] 8558\n```\n:::\n\n\n\n---\n\n##### Extrapolation is dangerous\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfig_all = fig5 +\n xlim(0, 60)\nfig_all\n```\n\n::: {.cell-output-display}\n![Number of authors versus title length in *PLOS Medicine*](Linear-models-overview_files/figure-html/fig-plos-multifit-1.png){#fig-plos-multifit width=672}\n:::\n:::\n\n\n\n\n\n---\n\n#### Scale-location plot\n\n::: notes\nWe can also plot the square roots of the absolute values of the standardized residuals against the fitted values (@fig-bw-scale-loc).\n:::\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nautoplot(bw_lm2, which = 3, ncol = 1) |> print()\n```\n\n::: {.cell-output-display}\n![Scale-location plot of `birthweight` data](Linear-models-overview_files/figure-html/fig-bw-scale-loc-1.png){#fig-bw-scale-loc width=672}\n:::\n:::\n\n\n\n::: notes\nHere, the blue line doesn't need to be near 0, \nbut it should be flat. \nIf not, the residual variance $\\sigma^2$ might not be constant, \nand we might need to transform our outcome $Y$ \n(or use a model that allows non-constant variance).\n:::\n\n---\n\n\n#### Residuals versus leverage\n\n::: notes\n\nWe can also plot our standardized residuals against \"leverage\", which roughly speaking is a measure of how unusual each $x_i$ value is. Very unusual $x_i$ values can have extreme effects on the model fit, so we might want to remove those observations as outliers, particularly if they have large residuals.\n\n:::\n\n\n\n\n::: {.cell labels='fig-bw_lm2_resid-vs-leverage'}\n\n```{.r .cell-code}\nautoplot(bw_lm2, which = 5, ncol = 1) |> print()\n```\n\n::: {.cell-output-display}\n![`birthweight` model with interactions (@eq-BW-lm-interact): residuals versus leverage](Linear-models-overview_files/figure-html/unnamed-chunk-89-1.png){width=672}\n:::\n:::\n\n\n\n\n::: notes\nThe blue line should be relatively flat and close to 0 here.\n:::\n\n---\n\n### Diagnostics constructed by hand\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n\nbw = \n bw |> \n mutate(\n predlm2 = predict(bw_lm2),\n residlm2 = weight - predlm2,\n std_resid = residlm2 / sigma(bw_lm2),\n # std_resid_builtin = rstandard(bw_lm2), # uses leverage\n sqrt_abs_std_resid = std_resid |> abs() |> sqrt()\n \n )\n\n```\n:::\n\n\n\n\n##### Residuals vs fitted\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n\nresid_vs_fit = bw |> \n ggplot(\n aes(x = predlm2, y = residlm2, col = sex, shape = sex)\n ) + \n geom_point() +\n theme_classic() +\n geom_hline(yintercept = 0)\n\n```\n:::\n\n\n\n\n::: {.content-visible when-format=\"html\"}\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nprint(resid_vs_fit)\n```\n\n::: {.cell-output-display}\n![](Linear-models-overview_files/figure-html/unnamed-chunk-92-1.png){width=672}\n:::\n:::\n\n\n\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nprint(resid_vs_fit)\n```\n\n::: {.cell-output-display}\n![](Linear-models-overview_files/figure-html/unnamed-chunk-93-1.png){width=672}\n:::\n:::\n\n\n\n:::\n\n##### Standardized residuals vs fitted\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbw |> \n ggplot(\n aes(x = predlm2, y = std_resid, col = sex, shape = sex)\n ) + \n geom_point() +\n theme_classic() +\n geom_hline(yintercept = 0)\n```\n\n::: {.cell-output-display}\n![](Linear-models-overview_files/figure-html/unnamed-chunk-94-1.png){width=672}\n:::\n:::\n\n\n\n\n##### Standardized residuals vs gestational age\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n\nbw |> \n ggplot(\n aes(x = age, y = std_resid, col = sex, shape = sex)\n ) + \n geom_point() +\n theme_classic() +\n geom_hline(yintercept = 0)\n```\n\n::: {.cell-output-display}\n![](Linear-models-overview_files/figure-html/unnamed-chunk-95-1.png){width=672}\n:::\n:::\n\n\n\n\n##### `sqrt(abs(rstandard()))` vs fitted\n\nCompare with `autoplot(bw_lm2, 3)`\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n\n\nbw |> \n ggplot(\n aes(x = predlm2, y = sqrt_abs_std_resid, col = sex, shape = sex)\n ) + \n geom_point() +\n theme_classic() +\n geom_hline(yintercept = 0)\n```\n\n::: {.cell-output-display}\n![](Linear-models-overview_files/figure-html/unnamed-chunk-96-1.png){width=672}\n:::\n:::\n\n\n\n\n## Model selection\n\n(adapted from @dobson4e §6.3.3; for more information on prediction, see @james2013introduction and @rms2e).\n\n::: notes\nIf we have a lot of covariates in our dataset, we might want to choose a small subset to use in our model.\n\nThere are a few possible metrics to consider for choosing a \"best\" model.\n:::\n\n### Mean squared error\n\nWe might want to minimize the **mean squared error**, $\\text E[(y-\\hat y)^2]$, for new observations that weren't in our data set when we fit the model.\n\nUnfortunately, $$\\frac{1}{n}\\sum_{i=1}^n (y_i-\\hat y_i)^2$$ gives a biased estimate of $\\text E[(y-\\hat y)^2]$ for new data. If we want an unbiased estimate, we will have to be clever.\n\n---\n\n#### Cross-validation\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndata(\"carbohydrate\", package = \"dobson\")\nlibrary(cvTools)\nfull_model <- lm(carbohydrate ~ ., data = carbohydrate)\ncv_full = \n full_model |> cvFit(\n data = carbohydrate, K = 5, R = 10,\n y = carbohydrate$carbohydrate)\n\nreduced_model = update(full_model, \n formula = ~ . - age)\n\ncv_reduced = \n reduced_model |> cvFit(\n data = carbohydrate, K = 5, R = 10,\n y = carbohydrate$carbohydrate)\n```\n:::\n\n\n\n\n---\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nresults_reduced = \n tibble(\n model = \"wgt+protein\",\n errs = cv_reduced$reps[])\nresults_full = \n tibble(model = \"wgt+age+protein\",\n errs = cv_full$reps[])\n\ncv_results = \n bind_rows(results_reduced, results_full)\n\ncv_results |> \n ggplot(aes(y = model, x = errs)) +\n geom_boxplot()\n```\n\n::: {.cell-output-display}\n![](Linear-models-overview_files/figure-html/unnamed-chunk-98-1.png){width=672}\n:::\n:::\n\n\n\n\n---\n\n##### comparing metrics\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n\ncompare_results = tribble(\n ~ model, ~ cvRMSE, ~ r.squared, ~adj.r.squared, ~ trainRMSE, ~loglik,\n \"full\", cv_full$cv, summary(full_model)$r.squared, summary(full_model)$adj.r.squared, sigma(full_model), logLik(full_model) |> as.numeric(),\n \"reduced\", cv_reduced$cv, summary(reduced_model)$r.squared, summary(reduced_model)$adj.r.squared, sigma(reduced_model), logLik(reduced_model) |> as.numeric())\n\ncompare_results\n```\n\n::: {.cell-output-display}\n\n\n|model | cvRMSE| r.squared| adj.r.squared| trainRMSE| loglik|\n|:-------|------:|---------:|-------------:|---------:|------:|\n|full | 6.887| 0.4805| 0.3831| 5.956| -61.84|\n|reduced | 6.483| 0.4454| 0.3802| 5.971| -62.49|\n:::\n:::\n\n\n\n\n---\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nanova(full_model, reduced_model)\n```\n\n::: {.cell-output-display}\n\n\n| Res.Df| RSS| Df| Sum of Sq| F| Pr(>F)|\n|------:|-----:|--:|---------:|-----:|------:|\n| 16| 567.7| NA| NA| NA| NA|\n| 17| 606.0| -1| -38.36| 1.081| 0.3139|\n:::\n:::\n\n\n\n\n---\n\n#### stepwise regression\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(olsrr)\nolsrr:::ols_step_both_aic(full_model)\n#> \n#> \n#> Stepwise Summary \n#> -------------------------------------------------------------------------\n#> Step Variable AIC SBC SBIC R2 Adj. R2 \n#> -------------------------------------------------------------------------\n#> 0 Base Model 140.773 142.764 83.068 0.00000 0.00000 \n#> 1 protein (+) 137.950 140.937 80.438 0.21427 0.17061 \n#> 2 weight (+) 132.981 136.964 77.191 0.44544 0.38020 \n#> -------------------------------------------------------------------------\n#> \n#> Final Model Output \n#> ------------------\n#> \n#> Model Summary \n#> ---------------------------------------------------------------\n#> R 0.667 RMSE 5.505 \n#> R-Squared 0.445 MSE 30.301 \n#> Adj. R-Squared 0.380 Coef. Var 15.879 \n#> Pred R-Squared 0.236 AIC 132.981 \n#> MAE 4.593 SBC 136.964 \n#> ---------------------------------------------------------------\n#> RMSE: Root Mean Square Error \n#> MSE: Mean Square Error \n#> MAE: Mean Absolute Error \n#> AIC: Akaike Information Criteria \n#> SBC: Schwarz Bayesian Criteria \n#> \n#> ANOVA \n#> -------------------------------------------------------------------\n#> Sum of \n#> Squares DF Mean Square F Sig. \n#> -------------------------------------------------------------------\n#> Regression 486.778 2 243.389 6.827 0.0067 \n#> Residual 606.022 17 35.648 \n#> Total 1092.800 19 \n#> -------------------------------------------------------------------\n#> \n#> Parameter Estimates \n#> ----------------------------------------------------------------------------------------\n#> model Beta Std. Error Std. Beta t Sig lower upper \n#> ----------------------------------------------------------------------------------------\n#> (Intercept) 33.130 12.572 2.635 0.017 6.607 59.654 \n#> protein 1.824 0.623 0.534 2.927 0.009 0.509 3.139 \n#> weight -0.222 0.083 -0.486 -2.662 0.016 -0.397 -0.046 \n#> ----------------------------------------------------------------------------------------\n```\n:::\n\n\n\n\n---\n\n#### Lasso\n\n$$\\arg min_{\\theta} \\llik(\\th) + \\lambda \\sum_{j=1}^p|\\beta_j|$$\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(glmnet)\ny = carbohydrate$carbohydrate\nx = carbohydrate |> \n select(age, weight, protein) |> \n as.matrix()\nfit = glmnet(x,y)\n```\n:::\n\n\n\n\n---\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nautoplot(fit, xvar = 'lambda')\n\n```\n\n::: {.cell-output-display}\n![Lasso selection](Linear-models-overview_files/figure-html/fig-carbs-lasso-1.png){#fig-carbs-lasso width=672}\n:::\n:::\n\n\n\n\n---\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncvfit = cv.glmnet(x,y)\nplot(cvfit)\n```\n\n::: {.cell-output-display}\n![](Linear-models-overview_files/figure-html/unnamed-chunk-104-1.png){width=672}\n:::\n:::\n\n\n\n\n---\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncoef(cvfit, s = \"lambda.1se\")\n#> 4 x 1 sparse Matrix of class \"dgCMatrix\"\n#> s1\n#> (Intercept) 33.8049\n#> age . \n#> weight -0.1406\n#> protein 1.2176\n```\n:::\n\n\n\n\n\n## Categorical covariates with more than two levels\n\n### Example: `birthweight`\n\nIn the birthweight example, the variable `sex` had only two observed values:\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nunique(bw$sex)\n#> [1] female male \n#> Levels: female male\n```\n:::\n\n\n\n\nIf there are more than two observed values, we can't just use a single variable with 0s and 1s.\n\n### \n\n:::{.notes}\nFor example, @tbl-iris-data shows the \n[(in)famous](https://www.meganstodel.com/posts/no-to-iris/) \n`iris` data (@anderson1935irises), \nand @tbl-iris-summary provides summary statistics. \nThe data include three species: \"setosa\", \"versicolor\", and \"virginica\".\n:::\n\n\n\n\n::: {#tbl-iris-data .cell tbl-cap='The `iris` data'}\n\n```{.r .cell-code}\nhead(iris)\n```\n\n::: {.cell-output-display}\n\n\n| Sepal.Length| Sepal.Width| Petal.Length| Petal.Width|Species |\n|------------:|-----------:|------------:|-----------:|:-------|\n| 5.1| 3.5| 1.4| 0.2|setosa |\n| 4.9| 3.0| 1.4| 0.2|setosa |\n| 4.7| 3.2| 1.3| 0.2|setosa |\n| 4.6| 3.1| 1.5| 0.2|setosa |\n| 5.0| 3.6| 1.4| 0.2|setosa |\n| 5.4| 3.9| 1.7| 0.4|setosa |\n:::\n:::\n\n::: {#tbl-iris-summary .cell tbl-cap='Summary statistics for the `iris` data'}\n\n```{.r .cell-code}\nlibrary(table1)\ntable1(\n x = ~ . | Species,\n data = iris,\n overall = FALSE\n)\n```\n\n::: {.cell-output-display}\n\n```{=html}\n
\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n
setosa
(N=50)
versicolor
(N=50)
virginica
(N=50)
Sepal.Length
Mean (SD)5.01 (0.352)5.94 (0.516)6.59 (0.636)
Median [Min, Max]5.00 [4.30, 5.80]5.90 [4.90, 7.00]6.50 [4.90, 7.90]
Sepal.Width
Mean (SD)3.43 (0.379)2.77 (0.314)2.97 (0.322)
Median [Min, Max]3.40 [2.30, 4.40]2.80 [2.00, 3.40]3.00 [2.20, 3.80]
Petal.Length
Mean (SD)1.46 (0.174)4.26 (0.470)5.55 (0.552)
Median [Min, Max]1.50 [1.00, 1.90]4.35 [3.00, 5.10]5.55 [4.50, 6.90]
Petal.Width
Mean (SD)0.246 (0.105)1.33 (0.198)2.03 (0.275)
Median [Min, Max]0.200 [0.100, 0.600]1.30 [1.00, 1.80]2.00 [1.40, 2.50]
\n
\n```\n\n:::\n:::\n\n\n\n\n---\n\nIf we want to model `Sepal.Length` by species, we could create a variable $X$ that represents \"setosa\" as $X=1$, \"virginica\" as $X=2$, and \"versicolor\" as $X=3$.\n\n\n\n\n::: {#tbl-numeric-coding .cell tbl-cap='`iris` data with numeric coding of species'}\n\n```{.r .cell-code}\ndata(iris) # this step is not always necessary, but ensures you're starting \n# from the original version of a dataset stored in a loaded package\n\niris = \n iris |> \n tibble() |>\n mutate(\n X = case_when(\n Species == \"setosa\" ~ 1,\n Species == \"virginica\" ~ 2,\n Species == \"versicolor\" ~ 3\n )\n )\n\niris |> \n distinct(Species, X)\n```\n\n::: {.cell-output-display}\n\n\n|Species | X|\n|:----------|--:|\n|setosa | 1|\n|versicolor | 3|\n|virginica | 2|\n:::\n:::\n\n\n\n\nThen we could fit a model like:\n\n\n\n\n::: {#tbl-iris-numeric-species .cell tbl-cap='Model of `iris` data with numeric coding of `Species`'}\n\n```{.r .cell-code}\niris_lm1 = lm(Sepal.Length ~ X, data = iris)\niris_lm1 |> parameters() |> print_md()\n```\n\n::: {.cell-output-display}\n\n\n|Parameter | Coefficient | SE | 95% CI | t(148) | p |\n|:-----------|:-----------:|:----:|:------------:|:------:|:------:|\n|(Intercept) | 4.91 | 0.16 | (4.60, 5.23) | 30.83 | < .001 |\n|X | 0.47 | 0.07 | (0.32, 0.61) | 6.30 | < .001 |\n\n\n:::\n:::\n\n\n\n\n### Let's see how that model looks:\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\niris_plot1 = iris |> \n ggplot(\n aes(\n x = X, \n y = Sepal.Length)\n ) +\n geom_point(alpha = .1) +\n geom_abline(\n intercept = coef(iris_lm1)[1], \n slope = coef(iris_lm1)[2]) +\n theme_bw(base_size = 18)\nprint(iris_plot1)\n\n```\n\n::: {.cell-output-display}\n![Model of `iris` data with numeric coding of `Species`](Linear-models-overview_files/figure-html/fig-iris-numeric-species-model-1.png){#fig-iris-numeric-species-model width=672}\n:::\n:::\n\n\n\n\nWe have forced the model to use a straight line for the three estimated means. Maybe not a good idea?\n\n### Let's see what R does with categorical variables by default:\n\n\n\n\n::: {#tbl-iris-model-factor1 .cell tbl-cap='Model of `iris` data with `Species` as a categorical variable'}\n\n```{.r .cell-code}\niris_lm2 = lm(Sepal.Length ~ Species, data = iris)\niris_lm2 |> parameters() |> print_md()\n```\n\n::: {.cell-output-display}\n\n\n|Parameter | Coefficient | SE | 95% CI | t(147) | p |\n|:--------------------|:-----------:|:----:|:------------:|:------:|:------:|\n|(Intercept) | 5.01 | 0.07 | (4.86, 5.15) | 68.76 | < .001 |\n|Species (versicolor) | 0.93 | 0.10 | (0.73, 1.13) | 9.03 | < .001 |\n|Species (virginica) | 1.58 | 0.10 | (1.38, 1.79) | 15.37 | < .001 |\n\n\n:::\n:::\n\n\n\n\n### Re-parametrize with no intercept\n\nIf you don't want the default and offset option, you can use \"-1\" like we've seen previously:\n\n\n\n\n::: {#tbl-iris-no-intcpt .cell}\n\n```{.r .cell-code}\niris.lm2b = lm(Sepal.Length ~ Species - 1, data = iris)\niris.lm2b |> parameters() |> print_md()\n```\n\n::: {.cell-output-display}\n\n\n|Parameter | Coefficient | SE | 95% CI | t(147) | p |\n|:--------------------|:-----------:|:----:|:------------:|:------:|:------:|\n|Species (setosa) | 5.01 | 0.07 | (4.86, 5.15) | 68.76 | < .001 |\n|Species (versicolor) | 5.94 | 0.07 | (5.79, 6.08) | 81.54 | < .001 |\n|Species (virginica) | 6.59 | 0.07 | (6.44, 6.73) | 90.49 | < .001 |\n\n\n:::\n:::\n\n\n\n\n### Let's see what these new models look like:\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\niris_plot2 = \n iris |> \n mutate(\n predlm2 = predict(iris_lm2)) |> \n arrange(X) |> \n ggplot(aes(x = X, y = Sepal.Length)) +\n geom_point(alpha = .1) +\n geom_line(aes(y = predlm2), col = \"red\") +\n geom_abline(\n intercept = coef(iris_lm1)[1], \n slope = coef(iris_lm1)[2]) + \n theme_bw(base_size = 18)\n\nprint(iris_plot2)\n\n```\n\n::: {.cell-output-display}\n![](Linear-models-overview_files/figure-html/fig-iris-no-intcpt-1.png){#fig-iris-no-intcpt width=672}\n:::\n:::\n\n\n\n\n### Let's see how R did that:\n\n\n\n\n::: {#tbl-iris-model-matrix-factor .cell}\n\n```{.r .cell-code}\nformula(iris_lm2)\n#> Sepal.Length ~ Species\nmodel.matrix(iris_lm2) |> as_tibble() |> unique()\n```\n\n::: {.cell-output-display}\n\n\n| (Intercept)| Speciesversicolor| Speciesvirginica|\n|-----------:|-----------------:|----------------:|\n| 1| 0| 0|\n| 1| 1| 0|\n| 1| 0| 1|\n:::\n:::\n\n\n\n\nThis is called a \"corner point parametrization\".\n\n\n\n\n::: {#tbl-iris-group-point-parameterization .cell}\n\n```{.r .cell-code}\nformula(iris.lm2b)\n#> Sepal.Length ~ Species - 1\nmodel.matrix(iris.lm2b) |> as_tibble() |> unique()\n```\n\n::: {.cell-output-display}\n\n\n| Speciessetosa| Speciesversicolor| Speciesvirginica|\n|-------------:|-----------------:|----------------:|\n| 1| 0| 0|\n| 0| 1| 0|\n| 0| 0| 1|\n:::\n:::\n\n\n\n\nThis can be called a \"group point parametrization\".\n\nThere are more options; see @dobson4e §6.4.1 and the \n[`codingMatrices` package](https://CRAN.R-project.org/package=codingMatrices) \n[vignette](https://cran.r-project.org/web/packages/codingMatrices/vignettes/codingMatrices.pdf) \n(@venablescodingMatrices).\n\n## Ordinal covariates\n\n(c.f. @dobson4e §2.4.4)\n\n---\n\n::: notes\nWe can create ordinal variables in R using the `ordered()` function^[or equivalently, `factor(ordered = TRUE)`].\n:::\n\n:::{#exm-ordinal-variable}\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nurl = paste0(\n \"https://regression.ucsf.edu/sites/g/files/tkssra6706/\",\n \"f/wysiwyg/home/data/hersdata.dta\")\nlibrary(haven)\nhers = read_dta(url)\n```\n:::\n\n\n\n::: {#tbl-HERS .cell tbl-cap='HERS dataset'}\n\n```{.r .cell-code}\nhers |> head()\n```\n\n::: {.cell-output-display}\n\n\n| HT| age| raceth| nonwhite| smoking| drinkany| exercise| physact| globrat| poorfair| medcond| htnmeds| statins| diabetes| dmpills| insulin| weight| BMI| waist| WHR| glucose| weight1| BMI1| waist1| WHR1| glucose1| tchol| LDL| HDL| TG| tchol1| LDL1| HDL1| TG1| SBP| DBP| age10|\n|--:|---:|------:|--------:|-------:|--------:|--------:|-------:|-------:|--------:|-------:|-------:|-------:|--------:|-------:|-------:|------:|-----:|-----:|-----:|-------:|-------:|-----:|------:|-----:|--------:|-----:|-----:|---:|---:|------:|-----:|----:|---:|---:|---:|-----:|\n| 0| 70| 2| 1| 0| 0| 0| 5| 3| 0| 0| 1| 1| 0| 0| 0| 73.8| 23.69| 96.0| 0.932| 84| 73.6| 23.63| 93.0| 0.912| 94| 189| 122.4| 52| 73| 201| 137.6| 48| 77| 138| 78| 7.0|\n| 0| 62| 2| 1| 0| 0| 0| 1| 3| 0| 1| 1| 0| 0| 0| 0| 70.9| 28.62| 93.0| 0.964| 111| 73.4| 28.89| 95.0| 0.964| 78| 307| 241.6| 44| 107| 216| 150.6| 48| 87| 118| 70| 6.2|\n| 1| 69| 1| 0| 0| 0| 0| 3| 3| 0| 0| 1| 0| 1| 0| 0| 102.0| 42.51| 110.2| 0.782| 114| 96.1| 40.73| 103.0| 0.774| 98| 254| 166.2| 57| 154| 254| 156.0| 66| 160| 134| 78| 6.9|\n| 0| 64| 1| 0| 1| 1| 0| 1| 3| 0| 1| 1| 0| 0| 0| 0| 64.4| 24.39| 87.0| 0.877| 94| 58.6| 22.52| 77.0| 0.802| 93| 204| 116.2| 56| 159| 207| 122.6| 57| 137| 152| 72| 6.4|\n| 0| 65| 1| 0| 0| 0| 0| 2| 3| 0| 0| 0| 0| 0| 0| 0| 57.9| 21.90| 77.0| 0.794| 101| 58.9| 22.28| 76.5| 0.757| 92| 214| 150.6| 42| 107| 235| 172.2| 35| 139| 175| 95| 6.5|\n| 1| 68| 2| 1| 0| 1| 0| 3| 3| 0| 0| 0| 0| 0| 0| 0| 60.9| 29.05| 96.0| 1.000| 116| 57.7| 27.52| 86.0| 0.910| 115| 212| 137.8| 52| 111| 202| 126.6| 53| 112| 174| 98| 6.8|\n:::\n:::\n\n\n\n\n\n:::\n\n---\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n\n# C(contr = codingMatrices::contr.diff)\n\n```\n:::\n", + "markdown": "---\ndf-print: paged\n---\n\n\n\n\n\n\n# Linear (Gaussian) Models\n\n---\n\n\n\n\n---\n\n### Configuring R {.unnumbered}\n\nFunctions from these packages will be used throughout this document:\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(conflicted) # check for conflicting function definitions\n# library(printr) # inserts help-file output into markdown output\nlibrary(rmarkdown) # Convert R Markdown documents into a variety of formats.\nlibrary(pander) # format tables for markdown\nlibrary(ggplot2) # graphics\nlibrary(ggeasy) # help with graphics\nlibrary(ggfortify) # help with graphics\nlibrary(dplyr) # manipulate data\nlibrary(tibble) # `tibble`s extend `data.frame`s\nlibrary(magrittr) # `%>%` and other additional piping tools\nlibrary(haven) # import Stata files\nlibrary(knitr) # format R output for markdown\nlibrary(tidyr) # Tools to help to create tidy data\nlibrary(plotly) # interactive graphics\nlibrary(dobson) # datasets from Dobson and Barnett 2018\nlibrary(parameters) # format model output tables for markdown\nlibrary(haven) # import Stata files\nlibrary(latex2exp) # use LaTeX in R code (for figures and tables)\nlibrary(fs) # filesystem path manipulations\nlibrary(survival) # survival analysis\nlibrary(survminer) # survival analysis graphics\nlibrary(KMsurv) # datasets from Klein and Moeschberger\nlibrary(parameters) # format model output tables for\nlibrary(webshot2) # convert interactive content to static for pdf\nlibrary(forcats) # functions for categorical variables (\"factors\")\nlibrary(stringr) # functions for dealing with strings\nlibrary(lubridate) # functions for dealing with dates and times\n```\n:::\n\n\n\n\n\nHere are some R settings I use in this document:\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nrm(list = ls()) # delete any data that's already loaded into R\n\nconflicts_prefer(dplyr::filter)\nggplot2::theme_set(\n ggplot2::theme_bw() + \n # ggplot2::labs(col = \"\") +\n ggplot2::theme(\n legend.position = \"bottom\",\n text = ggplot2::element_text(size = 12, family = \"serif\")))\n\nknitr::opts_chunk$set(message = FALSE)\noptions('digits' = 4)\n\npanderOptions(\"big.mark\", \",\")\npander::panderOptions(\"table.emphasize.rownames\", FALSE)\npander::panderOptions(\"table.split.table\", Inf)\nconflicts_prefer(dplyr::filter) # use the `filter()` function from dplyr() by default\nlegend_text_size = 9\n```\n:::\n\n\n\n\n\n\n\n\n\n\\providecommand{\\cbl}[1]{\\left\\{#1\\right.}\n\\providecommand{\\cb}[1]{\\left\\{#1\\right\\}}\n\\providecommand{\\paren}[1]{\\left(#1\\right)}\n\\providecommand{\\sb}[1]{\\left[#1\\right]}\n\\def\\pr{\\text{p}}\n\\def\\am{\\arg \\max}\n\\def\\argmax{\\arg \\max}\n\\def\\p{\\text{p}}\n\\def\\P{\\text{P}}\n\\def\\ph{\\hat{\\text{p}}}\n\\def\\hp{\\hat{\\text{p}}}\n\\def\\ga{\\alpha}\n\\def\\b{\\beta}\n\\providecommand{\\floor}[1]{\\left \\lfloor{#1}\\right \\rfloor}\n\\providecommand{\\ceiling}[1]{\\left \\lceil{#1}\\right \\rceil}\n\\providecommand{\\ceil}[1]{\\left \\lceil{#1}\\right \\rceil}\n\\def\\Ber{\\text{Ber}}\n\\def\\Bernoulli{\\text{Bernoulli}}\n\\def\\Pois{\\text{Pois}}\n\\def\\Poisson{\\text{Poisson}}\n\\def\\Gaus{\\text{Gaussian}}\n\\def\\Normal{\\text{N}}\n\\def\\NB{\\text{NegBin}}\n\\def\\NegBin{\\text{NegBin}}\n\\def\\vbeta{\\vec \\beta}\n\\def\\vb{\\vec \\b}\n\\def\\v0{\\vec{0}}\n\\def\\gb{\\beta}\n\\def\\gg{\\gamma}\n\\def\\gd{\\delta}\n\\def\\eps{\\varepsilon}\n\\def\\om{\\omega}\n\\def\\m{\\mu}\n\\def\\s{\\sigma}\n\\def\\l{\\lambda}\n\\def\\gs{\\sigma}\n\\def\\gm{\\mu}\n\\def\\M{\\text{M}}\n\\def\\gM{\\text{M}}\n\\def\\Mu{\\text{M}}\n\\def\\cd{\\cdot}\n\\def\\cds{\\cdots}\n\\def\\lds{\\ldots}\n\\def\\eqdef{\\stackrel{\\text{def}}{=}}\n\\def\\defeq{\\stackrel{\\text{def}}{=}}\n\\def\\hb{\\hat \\beta}\n\\def\\hl{\\hat \\lambda}\n\\def\\hy{\\hat y}\n\\def\\yh{\\hat y}\n\\def\\V{{\\text{Var}}}\n\\def\\hs{\\hat \\sigma}\n\\def\\hsig{\\hat \\sigma}\n\\def\\hS{\\hat \\Sigma}\n\\def\\hSig{\\hat \\Sigma}\n\\def\\hSigma{\\hat \\Sigma}\n\\def\\hSurv{\\hat{S}}\n\\providecommand{\\hSurvf}[1]{\\hat{S}\\paren{#1}}\n\\def\\dist{\\ \\sim \\ }\n\\def\\ddist{\\ \\dot{\\sim} \\ }\n\\def\\dsim{\\ \\dot{\\sim} \\ }\n\\def\\za{z_{1 - \\frac{\\alpha}{2}}}\n\\def\\cirad{\\za \\cdot \\hse{\\hb}}\n\\def\\ci{\\hb {\\color{red}\\pm} \\cirad}\n\\def\\th{\\theta}\n\\def\\Th{\\Theta}\n\\def\\xbar{\\bar{x}}\n\\def\\hth{\\hat\\theta}\n\\def\\hthml{\\hth_{\\text{ML}}}\n\\def\\ba{\\begin{aligned}}\n\\def\\ea{\\end{aligned}}\n\\def\\ind{⫫}\n\\def\\indpt{⫫}\n\\def\\all{\\forall}\n\\def\\iid{\\text{iid}}\n\\def\\ciid{\\text{ciid}}\n\\def\\simind{\\ \\sim_{\\ind}\\ }\n\\def\\siid{\\ \\sim_{\\iid}\\ }\n\\def\\simiid{\\siid}\n\\def\\distiid{\\siid}\n\\def\\tf{\\therefore}\n\\def\\Lik{\\mathcal{L}}\n\\def\\llik{\\ell}\n\\providecommand{\\llikf}[1]{\\llik \\paren{#1}}\n\\def\\score{\\ell'}\n\\providecommand{\\scoref}[1]{\\score \\paren{#1}}\n\\def\\hess{\\ell''}\n\\def\\hessian{\\ell''}\n\\providecommand{\\hessf}[1]{\\hess \\paren{#1}}\n\\providecommand{\\hessianf}[1]{\\hess \\paren{#1}}\n\\providecommand{\\starf}[1]{#1^*}\n\\def\\lik{\\ell}\n\\providecommand{\\est}[1]{\\widehat{#1}}\n\\providecommand{\\esttmp}[1]{{\\widehat{#1}}^*}\n\\def\\esttmpl{\\esttmp{\\lambda}}\n\\def\\cR{\\mathcal{R}}\n\\def\\range{\\mathcal{R}}\n\\def\\Range{\\mathcal{R}}\n\\providecommand{\\rangef}[1]{\\cR(#1)}\n\\def\\~{\\approx}\n\\def\\dapp{\\dot\\approx}\n\\providecommand{\\red}[1]{{\\color{red}#1}}\n\\providecommand{\\deriv}[1]{\\frac{\\partial}{\\partial #1}}\n\\providecommand{\\derivf}[2]{\\frac{\\partial #1}{\\partial #2}}\n\\providecommand{\\blue}[1]{{\\color{blue}#1}}\n\\providecommand{\\green}[1]{{\\color{green}#1}}\n\\providecommand{\\hE}[1]{\\hat{\\text{E}}\\sb{#1}}\n\\providecommand{\\hExp}[1]{\\hat{\\text{E}}\\sb{#1}}\n\\providecommand{\\hmu}[1]{\\hat{\\mu}\\sb{#1}}\n\\def\\Expp{\\mathbb{E}}\n\\def\\Ep{\\mathbb{E}}\n\\def\\expit{\\text{expit}}\n\\providecommand{\\expitf}[1]{\\expit\\cb{#1}}\n\\providecommand{\\dexpitf}[1]{\\expit'\\cb{#1}}\n\\def\\logit{\\text{logit}}\n\\providecommand{\\logitf}[1]{\\logit\\cb{#1}}\n\\providecommand{\\E}[1]{\\mathbb{E}\\sb{#1}}\n\\providecommand{\\Ef}[1]{\\mathbb{E}\\sb{#1}}\n\\providecommand{\\Exp}[1]{\\mathbb{E}\\sb{#1}}\n\\providecommand{\\Expf}[1]{\\mathbb{E}\\sb{#1}}\n\\def\\Varr{\\text{Var}}\n\\providecommand{\\var}[1]{\\text{Var}\\paren{#1}}\n\\providecommand{\\varf}[1]{\\text{Var}\\paren{#1}}\n\\providecommand{\\Var}[1]{\\text{Var}\\paren{#1}}\n\\providecommand{\\Varf}[1]{\\text{Var}\\paren{#1}}\n\\def\\Covt{\\text{Cov}}\n\\providecommand{\\covh}[1]{\\widehat{\\text{Cov}}\\paren{#1}}\n\\providecommand{\\Cov}[1]{\\Covt \\paren{#1}}\n\\providecommand{\\Covf}[1]{\\Covt \\paren{#1}}\n\\def\\varht{\\widehat{\\text{Var}}}\n\\providecommand{\\varh}[1]{\\varht\\paren{#1}}\n\\providecommand{\\varhf}[1]{\\varht\\paren{#1}}\n\\providecommand{\\vc}[1]{\\boldsymbol{#1}}\n\\providecommand{\\sd}[1]{\\text{sd}\\paren{#1}}\n\\providecommand{\\SD}[1]{\\text{SD}\\paren{#1}}\n\\providecommand{\\hSD}[1]{\\widehat{\\text{SD}}\\paren{#1}}\n\\providecommand{\\se}[1]{\\text{se}\\paren{#1}}\n\\providecommand{\\hse}[1]{\\hat{\\text{se}}\\paren{#1}}\n\\providecommand{\\SE}[1]{\\text{SE}\\paren{#1}}\n\\providecommand{\\HSE}[1]{\\widehat{\\text{SE}}\\paren{#1}}\n\\renewcommand{\\log}[1]{\\text{log}\\cb{#1}}\n\\providecommand{\\logf}[1]{\\text{log}\\cb{#1}}\n\\def\\dlog{\\text{log}'}\n\\providecommand{\\dlogf}[1]{\\dlog \\cb{#1}}\n\\renewcommand{\\exp}[1]{\\text{exp}\\cb{#1}}\n\\providecommand{\\expf}[1]{\\exp{#1}}\n\\def\\dexp{\\text{exp}'}\n\\providecommand{\\dexpf}[1]{\\dexp \\cb{#1}}\n\\providecommand{\\e}[1]{\\text{e}^{#1}}\n\\providecommand{\\ef}[1]{\\text{e}^{#1}}\n\\providecommand{\\inv}[1]{\\paren{#1}^{-1}}\n\\providecommand{\\invf}[1]{\\paren{#1}^{-1}}\n\\def\\oinf{I}\n\\def\\Nat{\\mathbb{N}}\n\\providecommand{\\oinff}[1]{\\oinf\\paren{#1}}\n\\def\\einf{\\mathcal{I}}\n\\providecommand{\\einff}[1]{\\einf\\paren{#1}}\n\\def\\heinf{\\hat{\\einf}}\n\\providecommand{\\heinff}[1]{\\heinf \\paren{#1}}\n\\providecommand{\\1}[1]{\\mathbb{1}_{#1}}\n\\providecommand{\\set}[1]{\\cb{#1}}\n\\providecommand{\\pf}[1]{\\p \\paren{#1}}\n\\providecommand{\\Bias}[1]{\\text{Bias}\\paren{#1}}\n\\providecommand{\\bias}[1]{\\text{Bias}\\paren{#1}}\n\\def\\ss{\\sigma^2}\n\\providecommand{\\ssqf}[1]{\\sigma^2\\paren{#1}}\n\\providecommand{\\mselr}[1]{\\text{MSE}\\paren{#1}}\n\\providecommand{\\maelr}[1]{\\text{MAE}\\paren{#1}}\n\\providecommand{\\abs}[1]{\\left|#1\\right|}\n\\providecommand{\\sqf}[1]{\\paren{#1}^2}\n\\providecommand{\\sq}{^2}\n\\def\\err{\\eps}\n\\providecommand{\\erf}[1]{\\err\\paren{#1}}\n\\renewcommand{\\vec}[1]{\\tilde{#1}}\n\\providecommand{\\v}[1]{\\vec{#1}}\n\\providecommand{\\matr}[1]{\\mathbf{#1}}\n\\def\\mX{\\matr{X}}\n\\def\\mx{\\matr{x}}\n\\def\\vx{\\vec{x}}\n\\def\\vX{\\vec{X}}\n\\def\\vy{\\vec{y}}\n\\def\\vY{\\vec{Y}}\n\\def\\vpi{\\vec{\\pi}}\n\\providecommand{\\mat}[1]{\\mathbf{#1}}\n\\providecommand{\\dsn}[1]{#1_1, \\ldots, #1_n}\n\\def\\X1n{\\dsn{X}}\n\\def\\Xin{\\dsn{X}}\n\\def\\x1n{\\dsn{x}}\n\\def\\'{^{\\top}}\n\\def\\dpr{\\cdot}\n\\def\\Xx1n{X_1=x_1, \\ldots, X_n = x_n}\n\\providecommand{\\dsvn}[2]{#1_1=#2_1, \\ldots, #1_n = #2_n}\n\\providecommand{\\sumn}[1]{\\sum_{#1=1}^n}\n\\def\\sumin{\\sum_{i=1}^n}\n\\def\\sumi1n{\\sum_{i=1}^n}\n\\def\\prodin{\\prod_{i=1}^n}\n\\def\\prodi1n{\\prod_{i=1}^n}\n\\providecommand{\\lp}[2]{#1 \\' \\beta}\n\\def\\odds{\\omega}\n\\def\\OR{\\text{OR}}\n\\def\\logodds{\\eta}\n\\def\\oddst{\\text{odds}}\n\\def\\probst{\\text{probs}}\n\\def\\probt{\\text{probt}}\n\\def\\probit{\\text{probit}}\n\\providecommand{\\oddsf}[1]{\\oddst\\cb{#1}}\n\\providecommand{\\doddsf}[1]{{\\oddst}'\\cb{#1}}\n\\def\\oddsinv{\\text{invodds}}\n\\providecommand{\\oddsinvf}[1]{\\oddsinv\\cb{#1}}\n\\def\\invoddsf{\\oddsinvf}\n\\providecommand{\\doddsinvf}[1]{{\\oddsinv}'\\cb{#1}}\n\\def\\dinvoddsf{\\doddsinvf}\n\\def\\haz{h}\n\\def\\cuhaz{H}\n\\def\\incidence{\\bar{\\haz}}\n\\def\\phaz{\\Expf{\\haz}}\n\n\n\n\n\n\n\n```{=html}\n\n```\n\n\n\n\n\n\n\n---\n\n:::{.callout-note}\nThis content is adapted from:\n\n- @dobson4e, Chapters 2-6\n- @dunn2018generalized, Chapters 2-3\n- @vittinghoff2e, Chapter 4\n\nThere are numerous textbooks specifically for linear regression, including:\n\n- @kutner2005applied: used for UCLA Biostatistics MS level linear models class\n- @chatterjee2015regression: used for Stanford MS-level linear models class\n- @seber2012linear: used for UCLA Biostatistics PhD level linear models class and UC Davis STA 108.\n- @kleinbaum2014applied: same first author as @kleinbaum2010logistic and @kleinbaum2012survival\n- @weisberg2005applied\n- *Linear Models with R* [@Faraway2025-io]\n\n:::\n\n## Overview\n\n### Why this course includes linear regression {.smaller}\n\n:::{.fragment .fade-in-then-semi-out}\n* This course is about *generalized linear models* (for non-Gaussian outcomes)\n:::\n\n:::{.fragment .fade-in-then-semi-out}\n* UC Davis STA 108 (\"Applied Statistical Methods: Regression Analysis\") is a prerequisite for this course, so everyone here should have some understanding of linear regression already.\n:::\n\n:::{.fragment .fade-in}\n* We will review linear regression to:\n - make sure everyone is caught up\n - to provide an epidemiological perspective on model interpretation.\n:::\n\n### Chapter overview\n\n* @sec-understand-LMs: how to interpret linear regression models\n\n* @sec-est-LMs: how to estimate linear regression models\n\n* @sec-infer-LMs: how to quantify uncertainty about our estimates\n\n* @sec-diagnose-LMs: how to tell if your model is insufficiently complex\n\n\n## Understanding Gaussian Linear Regression Models {#sec-understand-LMs}\n\n### Motivating example: birthweights and gestational age {.smaller}\n\nSuppose we want to learn about the distributions of birthweights (*outcome* $Y$) for (human) babies born at different gestational ages (*covariate* $A$) and with different chromosomal sexes (*covariate* $S$) (@dobson4e Example 2.2.2).\n\n::::: {.panel-tabset}\n\n#### Data as table\n\n\n\n\n\n\n::: {#tbl-birthweight-data1 .cell tbl-cap='`birthweight` data (@dobson4e Example 2.2.2)'}\n\n```{.r .cell-code}\nlibrary(dobson)\ndata(\"birthweight\", package = \"dobson\")\nbirthweight |> knitr::kable()\n```\n\n::: {.cell-output-display}\n\n\n| boys gestational age| boys weight| girls gestational age| girls weight|\n|--------------------:|-----------:|---------------------:|------------:|\n| 40| 2968| 40| 3317|\n| 38| 2795| 36| 2729|\n| 40| 3163| 40| 2935|\n| 35| 2925| 38| 2754|\n| 36| 2625| 42| 3210|\n| 37| 2847| 39| 2817|\n| 41| 3292| 40| 3126|\n| 40| 3473| 37| 2539|\n| 37| 2628| 36| 2412|\n| 38| 3176| 38| 2991|\n| 40| 3421| 39| 2875|\n| 38| 2975| 40| 3231|\n\n\n:::\n:::\n\n\n\n\n\n\n#### Reshape data for graphing\n\n\n\n\n\n\n::: {#tbl-birthweight-data2 .cell tbl-cap='`birthweight` data reshaped'}\n\n```{.r .cell-code}\nbw = \n birthweight |> \n pivot_longer(\n cols = everything(),\n names_to = c(\"sex\", \".value\"),\n names_sep = \"s \"\n ) |> \n rename(age = `gestational age`) |> \n mutate(\n sex = sex |> \n case_match(\n \"boy\" ~ \"male\",\n \"girl\" ~ \"female\") |> \n factor(levels = c(\"female\", \"male\")))\n\nbw\n```\n\n::: {.cell-output-display}\n`````{=html}\n
\n \n
\n`````\n:::\n:::\n\n\n\n\n\n\n#### Data as graph\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nplot1 = bw |> \n ggplot(aes(\n x = age, \n y = weight,\n linetype = sex,\n shape = sex,\n col = sex)) +\n theme_bw() +\n xlab(\"Gestational age (weeks)\") +\n ylab(\"Birthweight (grams)\") +\n theme(legend.position = \"bottom\") +\n # expand_limits(y = 0, x = 0) +\n geom_point(alpha = .7)\nprint(plot1 + facet_wrap(~ sex))\n```\n\n::: {.cell-output-display}\n![`birthweight` data (@dobson4e Example 2.2.2)](Linear-models-overview_files/figure-html/fig-plot-birthweight1-1.png){#fig-plot-birthweight1 width=672}\n:::\n:::\n\n\n\n\n\n\n:::::\n\n---\n\n#### Data notation\n\nLet's define some notation to represent this data.\n\n- $Y$: birthweight (measured in grams)\n- $S$: chromosomal sex: \"male\" (XY) or \"female\" (XX)\n- $M$: indicator variable for $S$ = \"male\"^[$M$ is implicitly a deterministic function of $S$]\n- $M = 0$ if female (XX)\n- $M = 1$ if male (XY)\n- $F$: indicator variable for $S$ = \"female\"^[$F$ is implicitly a deterministic function of $S$]\n- $F = 1$ if female (XX)\n- $F = 0$ if male (XY)\n\n- $A$: estimated gestational age at birth (measured in weeks).\n\n::: callout-note\nFemale is the **reference level** for the categorical variable $S$ \n(chromosomal sex) and corresponding indicator variable $M$ . \nThe choice of a reference level is arbitrary and does not limit what \nwe can do with the resulting model; \nit only makes it more computationally convenient to make inferences \nabout comparisons involving that reference group.\n:::\n\n### Parallel lines regression\n\nWe don't have enough data to model the distribution of birth weight \nseparately for each combination of gestational age and sex, \nso let's instead consider a (relatively) simple model for how that \ndistribution varies with gestational age and sex:\n\n$$p(Y=y|A=a,S=s) \\siid N(\\mu(a,s), \\sigma^2)$$\n\n$$\n\\ba\n\\mu(a,s)\n&\\eqdef \\Exp{Y|A=a, S=s} \\\\\n&= \\beta_0 + \\beta_A a+ \\beta_M m\n\\ea\n$$ {#eq-lm-parallel}\n\n:::{.notes}\n\n@tbl-lm-parallel shows the parameter estimates from R.\n@fig-parallel-fit1 shows the estimated model, superimposed on the data.\n\n:::\n\n::: {.column width=40%}\n\n\n\n\n\n\n::: {#tbl-lm-parallel .cell tbl-cap='Estimate of [Model @eq-lm-parallel] for `birthweight` data'}\n\n```{.r .cell-code}\nbw_lm1 = lm(\n formula = weight ~ sex + age, \n data = bw)\n\nbw_lm1 |> \n parameters() |>\n print_md(\n include_reference = TRUE,\n # show_sigma = TRUE,\n select = \"{estimate}\")\n```\n\n::: {.cell-output-display}\n\n\n|Parameter | Estimate |\n|:------------|:--------:|\n|(Intercept) | -1773.32 |\n|sex (female) | 0.00 |\n|sex (male) | 163.04 |\n|age | 120.89 |\n\n\n:::\n:::\n\n\n\n\n\n\n:::\n\n:::{.column width=10%}\n:::\n\n:::{.column width=50%}\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbw = \n bw |> \n mutate(`E[Y|X=x]` = fitted(bw_lm1)) |> \n arrange(sex, age)\n\nplot2 = \n plot1 %+% bw +\n geom_line(aes(y = `E[Y|X=x]`))\n\nprint(plot2)\n\n```\n\n::: {.cell-output-display}\n![Parallel-slopes model of birthweight](Linear-models-overview_files/figure-html/fig-parallel-fit1-1.png){#fig-parallel-fit1 width=672}\n:::\n:::\n\n\n\n\n\n\n:::\n\n---\n\n#### Model assumptions and predictions\n\n::: notes\nTo learn what this model is assuming, let's plug in a few values.\n:::\n\n::: {#exr-pred-fem-parallel}\n\nAccording to this model, what's the mean birthweight for a female born at 36 weeks?\n\n\n\n\n\n\n::: {#tbl-coef-model1 .cell tbl-cap='Estimated coefficients for [model @eq-lm-parallel]'}\n\n```{.r .cell-code}\ncoef(bw_lm1)\n#> (Intercept) sexmale age \n#> -1773.3 163.0 120.9\n```\n:::\n\n\n\n\n\n\n:::\n\n---\n\n:::{.solution}\n\\ \n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\npred_female = coef(bw_lm1)[\"(Intercept)\"] + coef(bw_lm1)[\"age\"]*36\ncoef(bw_lm1)\n#> (Intercept) sexmale age \n#> -1773.3 163.0 120.9\n# print(pred_female)\n### built-in prediction: \n# predict(bw_lm1, newdata = tibble(sex = \"female\", age = 36))\n```\n:::\n\n\n\n\n\n\n$$\n\\ba\nE[Y|A = 0, A = 36] \n&= \\beta_0 + \\beta_M \\cdot 0+ \\beta_A \\cdot 36 \\\\\n&= 2578.8739\n\\ea\n$$\n:::\n\n---\n\n:::{#exr-pred-male-parallel}\n\nWhat's the mean birthweight for a male born at 36 weeks?\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncoef(bw_lm1)\n#> (Intercept) sexmale age \n#> -1773.3 163.0 120.9\n```\n:::\n\n\n\n\n\n\n:::\n\n---\n\n:::{.solution}\n\\ \n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\npred_male = \n coef(bw_lm1)[\"(Intercept)\"] + \n coef(bw_lm1)[\"sexmale\"] + \n coef(bw_lm1)[\"age\"]*36\ncoef(bw_lm1)\n#> (Intercept) sexmale age \n#> -1773.3 163.0 120.9\n```\n:::\n\n\n\n\n\n\n$$\n\\ba\nE[Y|M = 1, A = 36] \n&= \\beta_0 + \\beta_M \\cdot 1+ \\beta_A \\cdot 36 \\\\\n&= 2741.9132\n\\ea\n$$\n\n:::\n\n---\n\n:::{#exr-diff-sex-parallel-1}\nWhat's the difference in mean birthweights between males born at 36 weeks and females born at 36 weeks?\n:::\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncoef(bw_lm1)\n#> (Intercept) sexmale age \n#> -1773.3 163.0 120.9\n```\n:::\n\n\n\n\n\n\n---\n\n:::{.solution}\n\n$$\n\\begin{aligned}\n& E[Y|M = 1, A = 36] - E[Y|M = 0, A = 36]\\\\\n&= \n2741.9132 - 2578.8739\\\\\n&=\n163.0393\n\\end{aligned}\n$$\n\nShortcut:\n\n$$\n\\begin{aligned}\n& E[Y|M = 1, A = 36] - E[Y|M = 0, A = 36]\\\\\n&= (\\beta_0 + \\beta_M \\cdot 1+ \\beta_A \\cdot 36) - \n(\\beta_0 + \\beta_M \\cdot 0+ \\beta_A \\cdot 36) \\\\\n&= \\beta_M \\\\ \n&= 163.0393\n\\end{aligned}\n$$\n\n:::\n\n:::{.notes}\n\nNote that age doesn't show up in this difference: in other words, according to this model, the difference between females and males with the same gestational age is the same for every age.\n\nThat's an assumption of the model; it's built-in to the parametric structure, even before we plug in the estimated values of those parameters.\n\nThat's why the lines are parallel.\n\n:::\n\n### Interactions {.smaller}\n\n:::{.notes}\nWhat if we don't like that parallel lines assumption?\n\nThen we need to allow an \"interaction\" between age $A$ and sex $S$:\n:::\n\n$$\nE[Y|A=a, S=s] = \\beta_0 + \\beta_A a+ \\beta_M m + \\beta_{AM} (a \\cdot m)\n$$ {#eq-BW-lm-interact}\n\n::: notes\nNow, the slope of mean birthweight $E[Y|A,S]$ with respect to gestational age $A$ depends on the value of sex $S$.\n:::\n\n::: {.column width=40% .smaller}\n\n\n\n\n\n\n::: {#tbl-bw-model-coefs-interact .cell tbl-cap='Birthweight model with interaction term'}\n\n```{.r .cell-code}\nbw_lm2 = lm(weight ~ sex + age + sex:age, data = bw)\nbw_lm2 |> \n parameters() |>\n print_md(\n include_reference = TRUE,\n # show_sigma = TRUE,\n select = \"{estimate}\")\n```\n\n::: {.cell-output-display}\n\n\n|Parameter | Estimate |\n|:----------------|:--------:|\n|(Intercept) | -2141.67 |\n|sex (female) | 0.00 |\n|sex (male) | 872.99 |\n|age | 130.40 |\n|sex (male) × age | -18.42 |\n\n\n:::\n:::\n\n\n\n\n\n\n:::\n\n:::{.column width=5%}\n:::\n\n:::{.column width=55%}\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbw = \n bw |> \n mutate(\n predlm2 = predict(bw_lm2)\n ) |> \n arrange(sex, age)\n\nplot1_interact = \n plot1 %+% bw +\n geom_line(aes(y = predlm2))\n\nprint(plot1_interact)\n```\n\n::: {.cell-output-display}\n![Birthweight model with interaction term](Linear-models-overview_files/figure-html/fig-bw-interaction-1.png){#fig-bw-interaction width=672}\n:::\n:::\n\n\n\n\n\n\n:::\n\n::: {.notes}\nNow we can see that the lines aren't parallel.\n:::\n\n---\n\nHere's another way we could rewrite this model (by collecting terms involving $S$):\n\n$$\nE[Y|A, M] = \\beta_0 + \\beta_M M+ (\\beta_A + \\beta_{AM} M) A\n$$\n\n::: callout-note\nIf you want to understand a coefficient in a model with interactions, collect terms for the corresponding variable, and you will see what other variables are interacting with the variable you are interested in.\n:::\n\n:::{.notes}\nIn this case, the coefficient $S$ is interacting with $A$. So the slope of $Y$ with respect to $A$ depends on the value of $M$.\n\nAccording to this model, there is no such thing as \"*the* slope of birthweight with respect to age\". There are two slopes, one for each sex.^[using the definite article \"the\" would mean there is only one slope.] We can only talk about \"the slope of birthweight with respect to age among males\" and \"the slope of birthweight with respect to age among females\".\n\nThen: that coefficient is the difference in means per unit change in its corresponding coefficient, when the other collected variables are set to 0.\n:::\n\n---\n\n::: notes\nTo learn what this model is assuming, let's plug in a few values.\n:::\n\n:::{#exr-pred-fem-interact}\nAccording to this model, what's the mean birthweight for a female born at 36 weeks?\n:::\n\n---\n\n::: {.solution}\n\\ \n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\npred_female = coef(bw_lm2)[\"(Intercept)\"] + coef(bw_lm2)[\"age\"]*36\n```\n:::\n\n\n\n\n\n\n$$\nE[Y|A = 0, X_2 = 36] = \n\\beta_0 + \\beta_M \\cdot 0+ \\beta_A \\cdot 36 + \\beta_{AM} \\cdot (0 * 36) \n= 2552.7333\n$$ \n\n:::\n\n---\n\n:::{#exr-pred-interact-male_36}\nWhat's the mean birthweight for a male born at 36 weeks?\n\n:::\n\n---\n\n::: solution\n\\ \n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\npred_male = \n coef(bw_lm2)[\"(Intercept)\"] + \n coef(bw_lm2)[\"sexmale\"] + \n coef(bw_lm2)[\"age\"]*36 + \n coef(bw_lm2)[\"sexmale:age\"] * 36\n```\n:::\n\n\n\n\n\n\n$$\n\\ba\nE[Y|A = 0, X_2 = 36]\n&= \\beta_0 + \\beta_M \\cdot 1+ \\beta_A \\cdot 36 + \\beta_{AM} \\cdot 1 \\cdot 36\\\\\n&= 2762.7069\n\\ea\n$$\n\n:::\n\n---\n\n:::{#exr-diff-gender-interact}\nWhat's the difference in mean birthweights between males born at 36 weeks and females born at 36 weeks?\n:::\n\n---\n\n:::{.solution}\n\n$$\n\\begin{aligned}\n& E[Y|M = 1, A = 36] - E[Y|M = 0, A = 36]\\\\ \n&= (\\beta_0 + \\beta_M \\cdot 1+ \\beta_A \\cdot 36 + \\beta_{AM} \\cdot 1 \\cdot 36)\\\\ \n&\\ \\ \\ \\ \\ -(\\beta_0 + \\beta_M \\cdot 0+ \\beta_A \\cdot 36 + \\beta_{AM} \\cdot 0 \\cdot 36) \\\\\n&= \\beta_{S} + \\beta_{AM}\\cdot 36\\\\\n&= 209.9736\n\\end{aligned}\n$$\n:::\n\n:::{.notes}\nNote that age now does show up in the difference: in other words, according to this model, the difference in mean birthweights between females and males with the same gestational age can vary by gestational age.\n\nThat's how the lines in the graph ended up non-parallel.\n\n:::\n\n### Stratified regression {.smaller}\n\n:::{.notes}\nWe could re-write the interaction model as a stratified model, with a slope and intercept for each sex:\n:::\n\n$$\n\\E{Y|A=a, S=s} = \n\\beta_M m + \\beta_{AM} (a \\cdot m) + \n\\beta_F f + \\beta_{AF} (a \\cdot f)\n$$ {#eq-model-strat}\n\nCompare this stratified model with our interaction model, @eq-BW-lm-interact:\n\n$$\n\\E{Y|A=a, S=s} = \n\\beta_0 + \\beta_A a + \\beta_M m + \\beta_{AM} (a \\cdot m)\n$$\n\n::: notes\n\nIn the stratified model, the intercept term $\\beta_0$ has been relabeled as $\\beta_F$.\n\n:::\n\n::: {.column width=45%}\n\n\n\n\n\n::: {#tbl-bw-model-coefs-interact2 .cell tbl-cap='Birthweight model with interaction term'}\n\n```{.r .cell-code}\nbw_lm2 = lm(weight ~ sex + age + sex:age, data = bw)\nbw_lm2 |> \n parameters() |>\n print_md(\n include_reference = TRUE,\n # show_sigma = TRUE,\n select = \"{estimate}\")\n```\n\n::: {.cell-output-display}\n\n\n|Parameter | Estimate |\n|:----------------|:--------:|\n|(Intercept) | -2141.67 |\n|sex (female) | 0.00 |\n|sex (male) | 872.99 |\n|age | 130.40 |\n|sex (male) × age | -18.42 |\n\n\n:::\n:::\n\n\n\n\n\n\n:::\n\n:::{.column width=10%}\n:::\n\n:::{.column width=45%}\n\n\n\n\n\n\n::: {#tbl-bw-model-coefs-strat .cell tbl-cap='Birthweight model - stratified betas'}\n\n```{.r .cell-code}\nbw_lm_strat = \n bw |> \n lm(\n formula = weight ~ sex + sex:age - 1, \n data = _)\n\nbw_lm_strat |> \n parameters() |>\n print_md(\n # show_sigma = TRUE,\n select = \"{estimate}\")\n```\n\n::: {.cell-output-display}\n\n\n|Parameter | Estimate |\n|:------------------|:--------:|\n|sex (female) | -2141.67 |\n|sex (male) | -1268.67 |\n|sex (female) × age | 130.40 |\n|sex (male) × age | 111.98 |\n\n\n:::\n:::\n\n\n\n\n\n\n:::\n\n### Curved-line regression\n\n::: notes\nIf we transform some of our covariates ($X$s) and plot the resulting model on the original covariate scale, we end up with curved regression lines:\n:::\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbw_lm3 = lm(weight ~ sex:log(age) - 1, data = bw)\nlibrary(palmerpenguins)\n\nggpenguins <- \n palmerpenguins::penguins |> \n dplyr::filter(species == \"Adelie\") |> \n ggplot(\n aes(x = bill_length_mm , y = body_mass_g)) +\n geom_point() + \n xlab(\"Bill length (mm)\") + \n ylab(\"Body mass (g)\")\n\nggpenguins2 = ggpenguins +\n stat_smooth(\n method = \"lm\",\n formula = y ~ log(x),\n geom = \"smooth\") +\n xlab(\"Bill length (mm)\") + \n ylab(\"Body mass (g)\")\n\n\nggpenguins2 |> print()\n```\n\n::: {.cell-output-display}\n![`palmerpenguins` model with `bill_length` entering on log scale](Linear-models-overview_files/figure-html/fig-penguins-log-x-1.png){#fig-penguins-log-x width=672}\n:::\n:::\n\n\n\n\n\n\n## Estimating Linear Models via Maximum Likelihood {#sec-est-LMs}\n\n### Likelihood, log-likelihood, and score functions for linear regression {.smaller}\n\n:::{.notes}\n\nIn EPI 203 and @sec-intro-MLEs, we learned how to fit outcome-only models of the form $p(X=x|\\theta)$ to iid data $\\vx = (x_1,…,x_n)$ using maximum likelihood estimation.\n\nNow, we apply the same procedure to linear regression models:\n\n:::\n\n$$\n\\mathcal L(\\vec y|\\mat x,\\beta, \\sigma^2) = \n\\prod_{i=1}^n (2\\pi\\sigma^2)^{-1/2} \n\\exp{-\\frac{1}{2\\sigma^2}(y_i - \\vec{x_i}'\\beta)^2}\n$$ {#eq-linreg-lik}\n\n$$\n\\ell(\\vec y|\\mat x,\\beta, \\sigma^2) \n= -\\frac{n}{2}\\log{\\sigma^2} - \n\\frac{1}{2\\sigma^2}\\sum_{i=1}^n (y_i - \\vec{x_i}' \\beta)^2\n$$ {#eq-linreg-loglik}\n\n$$\n\\ell'_{\\beta}(\\vec y|\\mat x,\\beta, \\sigma^2) \n= - \n\\frac{1}{2\\sigma^2}\\deriv{\\beta}\n\\paren{\\sum_{i=1}^n (y_i - \\vec{x_i}\\' \\beta)^2}\n$$ {#eq-linreg-score}\n\n---\n\n::: notes\nLet's switch to matrix-vector notation:\n:::\n\n$$\n\\sum_{i=1}^n (y_i - \\vx_i\\' \\vb)^2 \n= (\\vy - \\mX\\vb)'(\\vy - \\mX\\vb)\n$$\n\n---\n\nSo\n\n$$\n\\begin{aligned}\n(\\vy - \\mX\\vb)'(\\vy - \\mX\\vb) \n&= (\\vy' - \\vb'X')(\\vy - \\mX\\vb)\n\\\\ &= y'y - \\vb'X'y - y'\\mX\\vb +\\vb'\\mX'\\mX\\beta\n\\\\ &= y'y - 2y'\\mX\\beta +\\beta'\\mX'\\mX\\beta\n\\end{aligned}\n$$\n\n### Deriving the linear regression score function\n\n::: notes\nWe will use some results from [vector calculus](math-prereqs.qmd#sec-vector-calculus):\n:::\n\n$$\n\\begin{aligned}\n\\deriv{\\beta}\\paren{\\sum_{i=1}^n (y_i - x_i' \\beta)^2} \n &= \\deriv{\\beta}(\\vy - X\\beta)'(\\vy - X\\beta)\n\\\\ &= \\deriv{\\beta} (y'y - 2y'X\\beta +\\beta'X'X\\beta)\n\\\\ &= (- 2X'y +2X'X\\beta)\n\\\\ &= - 2X'(y - X\\beta)\n\\\\ &= - 2X'(y - \\Expp[y])\n\\\\ &= - 2X' \\err(y)\n\\end{aligned}\n$${#eq-scorefun-linreg}\n\n---\n\nSo if $\\ell(\\beta,\\sigma^2) =0$, then\n\n$$\n\\begin{aligned}\n0 &= (- 2X'y +2X'X\\beta)\\\\\n2X'y &= 2X'X\\beta\\\\\nX'y &= X'X\\beta\\\\\n(X'X)^{-1}X'y &= \\beta\n\\end{aligned}\n$$\n\n---\n\nThe second derivative matrix $\\ell_{\\beta, \\beta'} ''(\\beta, \\sigma^2;\\mathbf X,\\vy)$ is negative definite at $\\beta = (X'X)^{-1}X'y$, so $\\hat \\beta_{ML} = (X'X)^{-1}X'y$ is the MLE for $\\beta$.\n\n---\n\nSimilarly (not shown):\n\n$$\n\\hat\\sigma^2_{ML} = \\frac{1}{n} (Y-X\\hat\\beta)'(Y-X\\hat\\beta)\n$$\n\nAnd\n\n$$\n\\begin{aligned}\n\\mathcal I_{\\beta} &= E[-\\ell_{\\beta, \\beta'} ''(Y|X,\\beta, \\sigma^2)]\\\\\n&= \\frac{1}{\\sigma^2}X'X\n\\end{aligned}\n$$\n\n---\n\nSo:\n\n$$\nVar(\\hat \\beta) \\approx (\\mathcal I_{\\beta})^{-1} = \\sigma^2 (X'X)^{-1}\n$$\n\nand\n\n$$\n\\hat\\beta \\dot \\sim N(\\beta, \\mathcal I_{\\beta}^{-1})\n$$ \n\n:::{.notes}\n\nThese are all results you have hopefully seen before.\n\n:::\n\n---\n\nIn the Gaussian linear regression case, we also have exact results:\n\n$$\n\\frac{\\hat\\beta_j}{\\hse{\\hat\\beta_j}} \\dist t_{n-p}\n$$ \n\n---\n\nIn our model 2 above, $\\heinf(\\beta)$ is:\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbw_lm2 |> vcov()\n#> (Intercept) sexmale age sexmale:age\n#> (Intercept) 1353968 -1353968 -34871.0 34871.0\n#> sexmale -1353968 2596387 34871.0 -67211.0\n#> age -34871 34871 899.9 -899.9\n#> sexmale:age 34871 -67211 -899.9 1743.5\n```\n:::\n\n\n\n\n\n\nIf we take the square roots of the diagonals, we get the standard errors listed in the model output:\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n\nbw_lm2 |> vcov() |> diag() |> sqrt()\n#> (Intercept) sexmale age sexmale:age \n#> 1163.60 1611.33 30.00 41.76\n```\n:::\n\n::: {#tbl-mod-intx .cell tbl-cap='Estimated model for `birthweight` data with interaction term'}\n\n```{.r .cell-code}\nbw_lm2 |> parameters() |> print_md()\n```\n\n::: {.cell-output-display}\n\n\n|Parameter | Coefficient | SE | 95% CI | t(20) | p |\n|:----------------|:-----------:|:-------:|:-------------------:|:-----:|:------:|\n|(Intercept) | -2141.67 | 1163.60 | (-4568.90, 285.56) | -1.84 | 0.081 |\n|sex (male) | 872.99 | 1611.33 | (-2488.18, 4234.17) | 0.54 | 0.594 |\n|age | 130.40 | 30.00 | (67.82, 192.98) | 4.35 | < .001 |\n|sex (male) × age | -18.42 | 41.76 | (-105.52, 68.68) | -0.44 | 0.664 |\n\n\n:::\n:::\n\n\n\n\n\n\nSo we can do confidence intervals, hypothesis tests, and p-values exactly as in the one-variable case we looked at previously.\n\n### Residual Standard Deviation\n\n::: notes\n$\\hs$ represents an *estimate* of the *Residual Standard Deviation* parameter, $\\s$. \nWe can extract $\\hs$ from the fitted model, using the `sigma()` function:\n:::\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"show\"}\nsigma(bw_lm2)\n#> [1] 180.6\n```\n:::\n\n\n\n\n\n\n---\n\n#### $\\s$ is NOT \"Residual standard error\"\n\n::: notes\nIn the `summary.lm()` output, this estimate is labeled as `\"Residual standard error\"`:\n:::\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"show\"}\nsummary(bw_lm2)\n#> \n#> Call:\n#> lm(formula = weight ~ sex + age + sex:age, data = bw)\n#> \n#> Residuals:\n#> Min 1Q Median 3Q Max \n#> -246.7 -138.1 -39.1 176.6 274.3 \n#> \n#> Coefficients:\n#> Estimate Std. Error t value Pr(>|t|) \n#> (Intercept) -2141.7 1163.6 -1.84 0.08057 . \n#> sexmale 873.0 1611.3 0.54 0.59395 \n#> age 130.4 30.0 4.35 0.00031 ***\n#> sexmale:age -18.4 41.8 -0.44 0.66389 \n#> ---\n#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n#> \n#> Residual standard error: 181 on 20 degrees of freedom\n#> Multiple R-squared: 0.643,\tAdjusted R-squared: 0.59 \n#> F-statistic: 12 on 3 and 20 DF, p-value: 0.000101\n```\n:::\n\n\n\n\n\n\n---\n\n::: notes\nHowever, this is a misnomer:\n:::\n\n\n\n\n\n\n::: {.cell printr.help.sections='[\"description\",\"note\"]'}\n\n```{.r .cell-code code-fold=\"show\"}\nlibrary(printr) # captures ? documentation\n?stats::sigma\n```\n\n::: {.cell-output-display}\n```{=html}\n
\n\n
sigmaR Documentation
\n\n

Extract Residual Standard Deviation 'Sigma'

\n\n

Description

\n\n

Extract the estimated standard deviation of the errors, the\n“residual standard deviation” (misnamed also\n“residual standard error”, e.g., in\nsummary.lm()'s output, from a fitted model).\n

\n

Many classical statistical models have a scale parameter,\ntypically the standard deviation of a zero-mean normal (or Gaussian)\nrandom variable which is denoted as \\sigma.\nsigma(.) extracts the estimated parameter from a fitted\nmodel, i.e., \\hat\\sigma.\n

\n\n\n

Note

\n\n

The misnomer “Residual standard error” has been part of\ntoo many R (and S) outputs to be easily changed there.\n

\n\n
\n\n
\n
\n```\n:::\n:::\n\n\n\n\n\n\n## Inference about Gaussian Linear Regression Models {#sec-infer-LMs}\n\n### Motivating example: `birthweight` data\n\nResearch question: is there really an interaction between sex and age?\n\n$H_0: \\beta_{AM} = 0$\n\n$H_A: \\beta_{AM} \\neq 0$\n\n$P(|\\hat\\beta_{AM}| > |-18.4172| \\mid H_0)$ = ?\n\n### Wald tests and CIs {.smaller}\n\nR can give you Wald tests for single coefficients and corresponding CIs:\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n\nbw_lm2 |> \n parameters() |>\n print_md(\n include_reference = TRUE)\n```\n\n::: {.cell-output-display}\n\n\n|Parameter | Coefficient | SE | 95% CI | t(20) | p |\n|:----------------|:-----------:|:-------:|:-------------------:|:-----:|:------:|\n|(Intercept) | -2141.67 | 1163.60 | (-4568.90, 285.56) | -1.84 | 0.081 |\n|sex (female) | 0.00 | | | | |\n|sex (male) | 872.99 | 1611.33 | (-2488.18, 4234.17) | 0.54 | 0.594 |\n|age | 130.40 | 30.00 | (67.82, 192.98) | 4.35 | < .001 |\n|sex (male) × age | -18.42 | 41.76 | (-105.52, 68.68) | -0.44 | 0.664 |\n\n\n:::\n:::\n\n\n\n\n\n\nTo understand what's happening, let's replicate these results by hand for the interaction term.\n\n### P-values {.smaller}\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbw_lm2 |> \n parameters(keep = \"sexmale:age\") |>\n print_md(\n include_reference = TRUE)\n```\n\n::: {.cell-output-display}\n\n\n|Parameter | Coefficient | SE | 95% CI | t(20) | p |\n|:----------------|:-----------:|:-----:|:----------------:|:-----:|:-----:|\n|sex (male) × age | -18.42 | 41.76 | (-105.52, 68.68) | -0.44 | 0.664 |\n\n\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nbeta_hat = coef(summary(bw_lm2))[\"sexmale:age\", \"Estimate\"]\nse_hat = coef(summary(bw_lm2))[\"sexmale:age\", \"Std. Error\"]\ndfresid = bw_lm2$df.residual\nt_stat = abs(beta_hat)/se_hat\npval_t = \n pt(-t_stat, df = dfresid, lower.tail = TRUE) +\n pt(t_stat, df = dfresid, lower.tail = FALSE)\n```\n:::\n\n\n\n\n\n\n$$\n\\begin{aligned}\n&P\\paren{\n| \\hat \\beta_{AM} | > \n| -18.4172| \\middle| H_0\n} \n\\\\\n&= \\Pr \\paren{\n\\abs{ \\frac{\\hat\\beta_{AM}}{\\hat{SE}(\\hat\\beta_{AM})} } > \n\\abs{ \\frac{-18.4172}{41.7558} } \\middle| H_0\n}\\\\ \n&= \\Pr \\paren{\n\\abs{ T_{20} } > 0.4411 | H_0\n}\\\\\n&= 0.6639\n\\end{aligned}\n$$ \n\n::: notes\nThis matches the result in the table above.\n:::\n\n### Confidence intervals\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbw_lm2 |> \n parameters(keep = \"sexmale:age\") |>\n print_md(\n include_reference = TRUE)\n```\n\n::: {.cell-output-display}\n\n\n|Parameter | Coefficient | SE | 95% CI | t(20) | p |\n|:----------------|:-----------:|:-----:|:----------------:|:-----:|:-----:|\n|sex (male) × age | -18.42 | 41.76 | (-105.52, 68.68) | -0.44 | 0.664 |\n\n\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nq_t = qt(\n p = 0.975, \n df = dfresid, \n lower.tail = TRUE)\n\nq_t = qt(\n p = 0.025, \n df = dfresid, \n lower.tail = TRUE)\n\n\nconfint_radius_t = \n se_hat * q_t\n\nconfint_t = beta_hat + c(-1,1) * confint_radius_t\n\nprint(confint_t)\n#> [1] 68.68 -105.52\n```\n:::\n\n\n\n\n\n\n::: notes\nThis also matches.\n:::\n\n### Gaussian approximations\n\nHere are the asymptotic (Gaussian approximation) equivalents:\n\n### P-values {.smaller}\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbw_lm2 |> \n parameters(keep = \"sexmale:age\") |>\n print_md(\n include_reference = TRUE)\n```\n\n::: {.cell-output-display}\n\n\n|Parameter | Coefficient | SE | 95% CI | t(20) | p |\n|:----------------|:-----------:|:-----:|:----------------:|:-----:|:-----:|\n|sex (male) × age | -18.42 | 41.76 | (-105.52, 68.68) | -0.44 | 0.664 |\n\n\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\npval_z = pnorm(abs(t_stat), lower = FALSE) * 2\n\nprint(pval_z)\n#> [1] 0.6592\n```\n:::\n\n\n\n\n\n\n### Confidence intervals {.smaller}\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbw_lm2 |> \n parameters(keep = \"sexmale:age\") |>\n print_md(\n include_reference = TRUE)\n```\n\n::: {.cell-output-display}\n\n\n|Parameter | Coefficient | SE | 95% CI | t(20) | p |\n|:----------------|:-----------:|:-----:|:----------------:|:-----:|:-----:|\n|sex (male) × age | -18.42 | 41.76 | (-105.52, 68.68) | -0.44 | 0.664 |\n\n\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nconfint_radius_z = se_hat * qnorm(0.975, lower = TRUE)\nconfint_z = \n beta_hat + c(-1,1) * confint_radius_z\nprint(confint_z)\n#> [1] -100.26 63.42\n```\n:::\n\n\n\n\n\n\n### Likelihood ratio statistics\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n\nlogLik(bw_lm2)\n#> 'log Lik.' -156.6 (df=5)\nlogLik(bw_lm1)\n#> 'log Lik.' -156.7 (df=4)\n\nlLR = (logLik(bw_lm2) - logLik(bw_lm1)) |> as.numeric()\ndelta_df = (bw_lm1$df.residual - df.residual(bw_lm2))\n\n\nx_max = 1\n\n```\n:::\n\n\n\n\n\n\n---\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nd_lLR = function(x, df = delta_df) dchisq(x, df = df)\n\nchisq_plot = \n ggplot() + \n geom_function(fun = d_lLR) +\n stat_function( fun = d_lLR, xlim = c(lLR, x_max), geom = \"area\", fill = \"gray\") +\n geom_segment(aes(x = lLR, xend = lLR, y = 0, yend = d_lLR(lLR)), col = \"red\") + \n xlim(0.0001,x_max) + \n ylim(0,4) + \n ylab(\"p(X=x)\") + \n xlab(\"log(likelihood ratio) statistic [x]\") +\n theme_classic()\nchisq_plot |> print()\n```\n\n::: {.cell-output-display}\n![Chi-square distribution](Linear-models-overview_files/figure-html/fig-chisq-plot-1.png){#fig-chisq-plot width=672}\n:::\n:::\n\n\n\n\n\n\n---\n\nNow we can get the p-value:\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\npchisq(\n q = 2*lLR, \n df = delta_df, \n lower = FALSE) |> \n print()\n#> [1] 0.6298\n```\n:::\n\n\n\n\n\n\n\n---\n\nIn practice you don't have to do this by hand; there are functions to do it for you:\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n\n# built in\nlibrary(lmtest)\nlrtest(bw_lm2, bw_lm1)\n```\n\n::: {.cell-output-display}\n\n\n| #Df| LogLik| Df| Chisq| Pr(>Chisq)|\n|---:|------:|--:|------:|----------:|\n| 5| -156.6| NA| NA| NA|\n| 4| -156.7| -1| 0.2323| 0.6298|\n:::\n:::\n\n\n\n\n\n\n## Goodness of fit\n\n### AIC and BIC\n\n::: notes\nWhen we use likelihood ratio tests, we are comparing how well different models fit the data.\n\nLikelihood ratio tests require \"nested\" models: one must be a special case of the other.\n\nIf we have non-nested models, we can instead use the Akaike Information Criterion (AIC) or Bayesian Information Criterion (BIC):\n:::\n\n- AIC = $-2 * \\ell(\\hat\\theta) + 2 * p$\n\n- BIC = $-2 * \\ell(\\hat\\theta) + p * \\text{log}(n)$\n\nwhere $\\ell$ is the log-likelihood of the data evaluated using the parameter estimates $\\hat\\theta$, $p$ is the number of estimated parameters in the model (including $\\hat\\sigma^2$), and $n$ is the number of observations.\n\nYou can calculate these criteria using the `logLik()` function, or use the built-in R functions:\n\n#### AIC in R\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n\n-2 * logLik(bw_lm2) |> as.numeric() + \n 2*(length(coef(bw_lm2))+1) # sigma counts as a parameter here\n#> [1] 323.2\n\nAIC(bw_lm2)\n#> [1] 323.2\n```\n:::\n\n\n\n\n\n\n#### BIC in R\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n\n-2 * logLik(bw_lm2) |> as.numeric() + \n (length(coef(bw_lm2))+1) * log(nobs(bw_lm2))\n#> [1] 329\n\nBIC(bw_lm2)\n#> [1] 329\n```\n:::\n\n\n\n\n\n\nLarge values of AIC and BIC are worse than small values. There are no hypothesis tests or p-values associated with these criteria.\n\n### (Residual) Deviance\n\nLet $q$ be the number of distinct covariate combinations in a data set.\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbw.X.unique = \n bw |> \n count(sex, age)\n\nn_unique.bw = nrow(bw.X.unique)\n```\n:::\n\n\n\n\n\n\nFor example, in the `birthweight` data, there are $q = 12$ unique patterns (@tbl-bw-x-combos).\n\n\n\n\n\n\n::: {#tbl-bw-x-combos .cell tbl-cap='Unique covariate combinations in the `birthweight` data, with replicate counts'}\n\n```{.r .cell-code}\nbw.X.unique\n```\n\n::: {.cell-output-display}\n\n\n|sex | age| n|\n|:------|---:|--:|\n|female | 36| 2|\n|female | 37| 1|\n|female | 38| 2|\n|female | 39| 2|\n|female | 40| 4|\n|female | 42| 1|\n|male | 35| 1|\n|male | 36| 1|\n|male | 37| 2|\n|male | 38| 3|\n|male | 40| 4|\n|male | 41| 1|\n:::\n:::\n\n\n\n\n\n\n---\n\n::: {#def-replicates}\n#### Replicates\nIf a given covariate pattern has more than one observation in a dataset, those observations are called **replicates**.\n:::\n\n---\n\n::: {#exm-replicate-bw}\n\n#### Replicates in the `birthweight` data\n\nIn the `birthweight` dataset, there are 2 replicates of the combination \"female, age 36\" (@tbl-bw-x-combos).\n\n:::\n\n---\n\n::: {#exr-replicate-bw}\n\n#### Replicates in the `birthweight` data\n\nWhich covariate pattern(s) in the `birthweight` data has the most replicates?\n\n:::\n\n---\n\n::: {#sol-replicate-bw}\n\n#### Replicates in the `birthweight` data\n\nTwo covariate patterns are tied for most replicates: males at age 40 weeks \nand females at age 40 weeks.\n40 weeks is the usual length for human pregnancy (@polin2011fetal), so this result makes sense.\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbw.X.unique |> dplyr::filter(n == max(n))\n```\n\n::: {.cell-output-display}\n\n\n|sex | age| n|\n|:------|---:|--:|\n|female | 40| 4|\n|male | 40| 4|\n:::\n:::\n\n\n\n\n\n\n:::\n\n---\n\n#### Saturated models {.smaller}\n\nThe most complicated model we could fit would have one parameter (a mean) for each covariate pattern, plus a variance parameter:\n\n\n\n\n\n\n::: {#tbl-bw-model-sat .cell tbl-cap='Saturated model for the `birthweight` data'}\n\n```{.r .cell-code}\nlm_max = \n bw |> \n mutate(age = factor(age)) |> \n lm(\n formula = weight ~ sex:age - 1, \n data = _)\n\nlm_max |> \n parameters() |> \n print_md()\n```\n\n::: {.cell-output-display}\n\n\n|Parameter | Coefficient | SE | 95% CI | t(12) | p |\n|:--------------------|:-----------:|:------:|:------------------:|:-----:|:------:|\n|sex (male) × age35 | 2925.00 | 187.92 | (2515.55, 3334.45) | 15.56 | < .001 |\n|sex (female) × age36 | 2570.50 | 132.88 | (2280.98, 2860.02) | 19.34 | < .001 |\n|sex (male) × age36 | 2625.00 | 187.92 | (2215.55, 3034.45) | 13.97 | < .001 |\n|sex (female) × age37 | 2539.00 | 187.92 | (2129.55, 2948.45) | 13.51 | < .001 |\n|sex (male) × age37 | 2737.50 | 132.88 | (2447.98, 3027.02) | 20.60 | < .001 |\n|sex (female) × age38 | 2872.50 | 132.88 | (2582.98, 3162.02) | 21.62 | < .001 |\n|sex (male) × age38 | 2982.00 | 108.50 | (2745.60, 3218.40) | 27.48 | < .001 |\n|sex (female) × age39 | 2846.00 | 132.88 | (2556.48, 3135.52) | 21.42 | < .001 |\n|sex (female) × age40 | 3152.25 | 93.96 | (2947.52, 3356.98) | 33.55 | < .001 |\n|sex (male) × age40 | 3256.25 | 93.96 | (3051.52, 3460.98) | 34.66 | < .001 |\n|sex (male) × age41 | 3292.00 | 187.92 | (2882.55, 3701.45) | 17.52 | < .001 |\n|sex (female) × age42 | 3210.00 | 187.92 | (2800.55, 3619.45) | 17.08 | < .001 |\n\n\n:::\n:::\n\n\n\n\n\n\nWe call this model the **full**, **maximal**, or **saturated** model for this dataset.\n\nWe can calculate the log-likelihood of this model as usual:\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlogLik(lm_max)\n#> 'log Lik.' -151.4 (df=13)\n```\n:::\n\n\n\n\n\n\nWe can compare this model to our other models using chi-square tests, as usual:\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlrtest(lm_max, bw_lm2)\n```\n\n::: {.cell-output-display}\n\n\n| #Df| LogLik| Df| Chisq| Pr(>Chisq)|\n|---:|------:|--:|-----:|----------:|\n| 13| -151.4| NA| NA| NA|\n| 5| -156.6| -8| 10.36| 0.241|\n:::\n:::\n\n\n\n\n\n\nThe likelihood ratio statistic for this test is $$\\lambda = 2 * (\\ell_{\\text{full}} - \\ell) = 10.3554$$ where:\n\n- $\\ell_{\\text{max}}$ is the log-likelihood of the full model: -151.4016\n- $\\ell$ is the log-likelihood of our comparison model (two slopes, two intercepts): -156.5793\n\nThis statistic is called the **deviance** or **residual deviance** for our two-slopes and two-intercepts model; it tells us how much the likelihood of that model deviates from the likelihood of the maximal model.\n\nThe corresponding p-value tells us whether there we have enough evidence to detect that our two-slopes, two-intercepts model is a worse fit for the data than the maximal model; in other words, it tells us if there's evidence that we missed any important patterns. (Remember, a nonsignificant p-value could mean that we didn't miss anything and a more complicated model is unnecessary, or it could mean we just don't have enough data to tell the difference between these models.)\n\n### Null Deviance\n\nSimilarly, the *least* complicated model we could fit would have only one mean parameter, an intercept:\n\n$$\\text E[Y|X=x] = \\beta_0$$ We can fit this model in R like so:\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlm0 = lm(weight ~ 1, data = bw)\n\nlm0 |> parameters() |> print_md()\n```\n\n::: {.cell-output-display}\n\n\n|Parameter | Coefficient | SE | 95% CI | t(23) | p |\n|:-----------|:-----------:|:-----:|:------------------:|:-----:|:------:|\n|(Intercept) | 2967.67 | 57.58 | (2848.56, 3086.77) | 51.54 | < .001 |\n\n\n:::\n:::\n\n\n\n\n\n\nThis model also has a likelihood:\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlogLik(lm0)\n#> 'log Lik.' -169 (df=2)\n```\n:::\n\n\n\n\n\n\nAnd we can compare it to more complicated models using a likelihood ratio test:\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n\nlrtest(bw_lm2, lm0)\n```\n\n::: {.cell-output-display}\n\n\n| #Df| LogLik| Df| Chisq| Pr(>Chisq)|\n|---:|------:|--:|-----:|----------:|\n| 5| -156.6| NA| NA| NA|\n| 2| -169.0| -3| 24.75| 0|\n:::\n:::\n\n\n\n\n\n\nThe likelihood ratio statistic for the test comparing the null model to the maximal model is $$\\lambda = 2 * (\\ell_{\\text{full}} - \\ell_{0}) = 35.1067$$ where:\n\n- $\\ell_{\\text{0}}$ is the log-likelihood of the null model: -168.955\n- $\\ell_{\\text{full}}$ is the log-likelihood of the maximal model: -151.4016\n\nIn R, this test is:\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlrtest(lm_max, lm0)\n```\n\n::: {.cell-output-display}\n\n\n| #Df| LogLik| Df| Chisq| Pr(>Chisq)|\n|---:|------:|---:|-----:|----------:|\n| 13| -151.4| NA| NA| NA|\n| 2| -169.0| -11| 35.11| 2e-04|\n:::\n:::\n\n\n\n\n\n\nThis log-likelihood ratio statistic is called the **null deviance**. It tells us whether we have enough data to detect a difference between the null and full models.\n\n## Rescaling\n\n### Rescale age {.smaller}\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n\nbw = \n bw |>\n mutate(\n `age - mean` = age - mean(age),\n `age - 36wks` = age - 36\n )\n\nlm1c = lm(weight ~ sex + `age - 36wks`, data = bw)\n\nlm2c = lm(weight ~ sex + `age - 36wks` + sex:`age - 36wks`, data = bw)\n\nparameters(lm2c, ci_method = \"wald\") |> print_md()\n```\n\n::: {.cell-output-display}\n\n\n|Parameter | Coefficient | SE | 95% CI | t(20) | p |\n|:------------------------|:-----------:|:------:|:------------------:|:-----:|:------:|\n|(Intercept) | 2552.73 | 97.59 | (2349.16, 2756.30) | 26.16 | < .001 |\n|sex (male) | 209.97 | 129.75 | (-60.68, 480.63) | 1.62 | 0.121 |\n|age - 36wks | 130.40 | 30.00 | (67.82, 192.98) | 4.35 | < .001 |\n|sex (male) × age - 36wks | -18.42 | 41.76 | (-105.52, 68.68) | -0.44 | 0.664 |\n\n\n:::\n:::\n\n\n\n\n\n\nCompare with what we got without rescaling:\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nparameters(bw_lm2, ci_method = \"wald\") |> print_md()\n```\n\n::: {.cell-output-display}\n\n\n|Parameter | Coefficient | SE | 95% CI | t(20) | p |\n|:----------------|:-----------:|:-------:|:-------------------:|:-----:|:------:|\n|(Intercept) | -2141.67 | 1163.60 | (-4568.90, 285.56) | -1.84 | 0.081 |\n|sex (male) | 872.99 | 1611.33 | (-2488.18, 4234.17) | 0.54 | 0.594 |\n|age | 130.40 | 30.00 | (67.82, 192.98) | 4.35 | < .001 |\n|sex (male) × age | -18.42 | 41.76 | (-105.52, 68.68) | -0.44 | 0.664 |\n\n\n:::\n:::\n\n\n\n\n\n\n## Prediction\n\n### Prediction for linear models\n\n:::{#def-predicted-value}\n#### Predicted value\n\nIn a regression model $\\p(y|x)$, the **predicted value** of $y$ given $x$ is the estimated mean of $Y$ given $X$:\n\n$$\\hat y \\eqdef \\hE{Y|X=x}$$\n:::\n\n---\n\nFor linear models, the predicted value can be straightforwardly calculated by multiplying each predictor value $x_j$ by its corresponding coefficient $\\beta_j$ and adding up the results:\n\n$$\n\\begin{aligned}\n\\hat Y &= \\hat E[Y|X=x] \\\\\n&= x'\\hat\\beta \\\\\n&= \\hat\\beta_0\\cdot 1 + \\hat\\beta_1 x_1 + ... + \\hat\\beta_p x_p\n\\end{aligned}\n$$\n\n---\n\n### Example: prediction for the `birthweight` data\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n\nX = c(1,1,40)\nsum(X * coef(bw_lm1))\n#> [1] 3225\n```\n:::\n\n\n\n\n\n\nR has built-in functions for prediction:\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx = tibble(age = 40, sex = \"male\")\nbw_lm1 |> predict(newdata = x)\n#> 1 \n#> 3225\n```\n:::\n\n\n\n\n\n\nIf you don't provide `newdata`, R will use the covariate values from the original dataset:\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\npredict(bw_lm1)\n#> 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 \n#> 3225 3062 2984 2579 3225 3062 2621 2821 2742 3304 2863 2942 3346 3062 3225 2700 \n#> 17 18 19 20 21 22 23 24 \n#> 2863 2579 2984 2821 3225 2942 2984 3062\n```\n:::\n\n\n\n\n\n\nThese special predictions are called the *fitted values* of the dataset:\n\n:::{#def-fitted-value}\n\nFor a given dataset $(\\vY, \\mX)$ and corresponding fitted model $\\p_{\\hb}(\\vy|\\mx)$, the **fitted value** of $y_i$ is the predicted value of $y$ when $\\vX=\\vx_i$ using the estimate parameters $\\hb$.\n\n:::\n\nR has an extra function to get these values:\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfitted(bw_lm1)\n#> 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 \n#> 3225 3062 2984 2579 3225 3062 2621 2821 2742 3304 2863 2942 3346 3062 3225 2700 \n#> 17 18 19 20 21 22 23 24 \n#> 2863 2579 2984 2821 3225 2942 2984 3062\n```\n:::\n\n\n\n\n\n\n### Quantifying uncertainty in predictions\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbw_lm1 |> \n predict(\n newdata = x,\n se.fit = TRUE)\n#> $fit\n#> 1 \n#> 3225 \n#> \n#> $se.fit\n#> [1] 61.46\n#> \n#> $df\n#> [1] 21\n#> \n#> $residual.scale\n#> [1] 177.1\n```\n:::\n\n\n\n\n\n\nThis is a `list()`; you can extract the elements with `$` or `magrittr::use_series()`:\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbw_lm1 |> \n predict(\n newdata = x,\n se.fit = TRUE) |> \n use_series(se.fit)\n#> [1] 61.46\n```\n:::\n\n\n\n\n\n\nYou can get **confidence intervals** for $\\E{Y|X=x}$:\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbw_lm1 |> predict(\n newdata = x,\n interval = \"confidence\")\n```\n\n::: {.cell-output-display}\n\n\n| fit| lwr| upr|\n|----:|----:|----:|\n| 3225| 3098| 3353|\n:::\n:::\n\n\n\n\n\n\nYou can also get **prediction intervals** for the value of an individual outcome $Y$:\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbw_lm1 |> \n predict(newdata = x, interval = \"predict\")\n```\n\n::: {.cell-output-display}\n\n\n| fit| lwr| upr|\n|----:|----:|----:|\n| 3225| 2836| 3615|\n:::\n:::\n\n\n\n\n\n\nThe warning from the last command is: \"predictions on current data refer to *future* responses\" (since you already know what happened to the current data, and thus don't need to predict it).\n\nSee `?predict.lm` for more.\n\n## Diagnostics {#sec-diagnose-LMs}\n\n:::{.callout-tip}\nThis section is adapted from @dobson4e [§6.2-6.3] and \n@dunn2018generalized [Chapter 3](https://link.springer.com/chapter/10.1007/978-1-4419-0118-7_3).\n:::\n### Assumptions in linear regression models {.smaller .scrollable}\n\n$$Y|\\vX \\simind N(\\vX'\\b,\\ss)$$\n\n1. Normality: The distribution conditional on a given $X$ value is normal\n\n2. Correct Functional Form: The conditional means have the structure \n\n$$E[Y|\\vec X = \\vec x] = \\vec x'\\beta$$\n3. Homoskedasticity: The variance $\\ss$ is constant (with respect to $\\vx$)\n\n4. Independence: The observations are statistically independent\n\n### Direct visualization\n\n::: notes\nThe most direct way to examine the fit of a model is to compare it to the raw observed data.\n:::\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbw = \n bw |> \n mutate(\n predlm2 = predict(bw_lm2)\n ) |> \n arrange(sex, age)\n\nplot1_interact = \n plot1 %+% bw +\n geom_line(aes(y = predlm2))\n\nprint(plot1_interact)\n```\n\n::: {.cell-output-display}\n![Birthweight model with interaction term](Linear-models-overview_files/figure-html/fig-bw-interaction2-1.png){#fig-bw-interaction2 width=672}\n:::\n:::\n\n\n\n\n\n\n::: notes\nIt's not easy to assess these assumptions from this model.\nIf there are multiple continuous covariates, it becomes even harder to visualize the raw data.\n:::\n\n### Residuals\n\n::: notes\nMaybe we can transform the data and model in some way to make it easier to inspect.\n:::\n:::{#def-resid-noise}\n#### Residual noise\n\nThe **residual noise** in a probabilistic model $p(Y)$ is the difference between an observed value $y$ and its distributional mean:\n\n$$\\eps(y) \\eqdef y - \\Exp{Y}$$ {#eq-def-resid}\n:::\n\n:::{.notes}\nWe use the same notation for residual noise that we used for [errors](estimation.qmd#def-error). \n$\\Exp{Y}$ can be viewed as an estimate of $Y$, before $y$ is observed.\nConversely, each observation $y$ can be viewed as an estimate of $\\Exp{Y}$ (albeit an imprecise one, individually, since $n=1$). \n\n:::\n\nWe can rearrange @eq-def-resid to view $y$ as the sum of its mean plus the residual noise:\n\n$$y = \\Exp{Y} + \\eps{y}$$\n\n---\n\n:::{#thm-gaussian-resid-noise}\n#### Residuals in Gaussian models\n\nIf $Y$ has a Gaussian distribution, then $\\err(Y)$ also has a Gaussian distribution, and vice versa.\n:::\n\n:::{.proof}\nLeft to the reader.\n:::\n\n---\n\n:::{#def-resid-fitted}\n#### Residual errors of a fitted model value\n\nThe **residual of a fitted value $\\hat y$** (shorthand: \"residual\") is its [error](estimation.qmd#def-error):\n$$\n\\ba\ne(\\hat y) &\\eqdef \\erf{\\hat y}\n\\\\&= y - \\hat y\n\\ea\n$$\n:::\n\n$e(\\hat y)$ can be seen as the maximum likelihood estimate of the residual noise:\n\n$$\n\\ba\ne(\\hy) &= y - \\hat y\n\\\\ &= \\hat\\eps_{ML}\n\\ea\n$$\n\n---\n\n#### General characteristics of residuals\n\n:::{#thm-resid-unbiased}\nFor [unbiased](estimation.qmd#sec-unbiased-estimators) estimators $\\hth$:\n\n$$\\E{e(y)} = 0$$ {#eq-mean-resid-unbiased}\n$$\\Var{e(y)} \\approx \\ss$$ {#eq-var-resid-unbiased}\n\n:::\n\n:::{.proof}\n\\ \n\n@eq-mean-resid-unbiased:\n\n$$\n\\ba\n\\Ef{e(y)} &= \\Ef{y - \\hat y}\n\\\\ &= \\Ef{y} - \\Ef{\\hat y}\n\\\\ &= \\Ef{y} - \\Ef{y}\n\\\\ &= 0\n\\ea\n$$\n\n@eq-var-resid-unbiased:\n\n$$\n\\ba\n\\Var{e(y)} &= \\Var{y - \\hy}\n\\\\ &= \\Var{y} + \\Var{\\hy} - 2 \\Cov{y, \\hy}\n\\\\ &{\\dot{\\approx}} \\Var{y} + 0 - 2 \\cdot 0\n\\\\ &= \\Var{y}\n\\\\ &= \\ss\n\\ea\n$$\n:::\n\n---\n\n#### Characteristics of residuals in Gaussian models\n\nWith enough data and a correct model, the residuals will be approximately Guassian distributed, with variance $\\sigma^2$, which we can estimate using $\\hat\\sigma^2$: that is:\n\n$$\ne_i \\siid N(0, \\hat\\sigma^2)\n$$\n\n---\n\n:::{#exm-resid-bw}\n#### residuals in `birthweight` data\n\nR provides a function for residuals:\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nresid(bw_lm2)\n#> 1 2 3 4 5 6 7 8 9 10 \n#> 176.27 -140.73 -144.13 -59.53 177.47 -126.93 -68.93 242.67 -139.33 51.67 \n#> 11 12 13 14 15 16 17 18 19 20 \n#> 156.67 -125.13 274.28 -137.71 -27.69 -246.69 -191.67 189.33 -11.67 -242.64 \n#> 21 22 23 24 \n#> -47.64 262.36 210.36 -30.62\n```\n:::\n\n\n\n\n\n\n:::\n\n:::{#exr-calc-resids}\nCheck R's output by computing the residuals directly.\n:::\n\n:::{.solution}\n\\ \n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbw$weight - fitted(bw_lm2)\n#> 1 2 3 4 5 6 7 8 9 10 \n#> 176.27 -140.73 -144.13 -59.53 177.47 -126.93 -68.93 242.67 -139.33 51.67 \n#> 11 12 13 14 15 16 17 18 19 20 \n#> 156.67 -125.13 274.28 -137.71 -27.69 -246.69 -191.67 189.33 -11.67 -242.64 \n#> 21 22 23 24 \n#> -47.64 262.36 210.36 -30.62\n```\n:::\n\n\n\n\n\n\nThis matches R's output!\n:::\n\n---\n\n#### Graph the residuals\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbw = bw |> \n mutate(resids_intxn = \n weight - fitted(bw_lm2))\n\nplot_bw_resid =\n bw |> \n ggplot(aes(\n x = age, \n y = resids_intxn,\n linetype = sex,\n shape = sex,\n col = sex)) +\n theme_bw() +\n xlab(\"Gestational age (weeks)\") +\n ylab(\"residuals (grams)\") +\n theme(legend.position = \"bottom\") +\n # expand_limits(y = 0, x = 0) +\n geom_point(alpha = .7)\nprint(plot_bw_resid + facet_wrap(~ sex))\n```\n\n::: {.cell-output-display}\n![Residuals of interaction model for `birthweight` data](Linear-models-overview_files/figure-html/fig-resids-intxn-1.png){#fig-resids-intxn width=672}\n:::\n:::\n\n\n\n\n\n\n---\n\n:::{#def-stred}\n\n#### Standardized residuals\n\n$$r_i = \\frac{e_i}{\\widehat{SD}(e_i)}$$\n\n:::\n\nHence, with enough data and a correct model, the standardized residuals will be approximately standard Gaussian; that is,\n\n$$\nr_i \\siid N(0,1)\n$$\n\n### Marginal distributions of residuals\n\nTo look for problems with our model, we can check whether the residuals $e_i$ and standardized residuals $r_i$ look like they have the distributions that they are supposed to have, according to the model.\n\n---\n\n#### Standardized residuals in R\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n\nrstandard(bw_lm2)\n#> 1 2 3 4 5 6 7 8 \n#> 1.15982 -0.92601 -0.87479 -0.34723 1.03507 -0.73473 -0.39901 1.43752 \n#> 9 10 11 12 13 14 15 16 \n#> -0.82539 0.30606 0.92807 -0.87616 1.91428 -0.86559 -0.16430 -1.46376 \n#> 17 18 19 20 21 22 23 24 \n#> -1.11016 1.09658 -0.06761 -1.46159 -0.28696 1.58040 1.26717 -0.19805\nresid(bw_lm2)/sigma(bw_lm2)\n#> 1 2 3 4 5 6 7 8 \n#> 0.97593 -0.77920 -0.79802 -0.32962 0.98258 -0.70279 -0.38166 1.34357 \n#> 9 10 11 12 13 14 15 16 \n#> -0.77144 0.28606 0.86741 -0.69282 1.51858 -0.76244 -0.15331 -1.36584 \n#> 17 18 19 20 21 22 23 24 \n#> -1.06123 1.04825 -0.06463 -1.34341 -0.26376 1.45262 1.16471 -0.16954\n```\n:::\n\n\n\n\n\n\n::: notes\nThese are not quite the same, because R is doing something more complicated and precise to get the standard errors. Let's not worry about those details for now; the difference is pretty small in this case:\n\n:::\n\n---\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n\nrstandard_compare_plot = \n tibble(\n x = resid(bw_lm2)/sigma(bw_lm2), \n y = rstandard(bw_lm2)) |> \n ggplot(aes(x = x, y = y)) +\n geom_point() + \n theme_bw() +\n coord_equal() + \n xlab(\"resid(bw_lm2)/sigma(bw_lm2)\") +\n ylab(\"rstandard(bw_lm2)\") +\n geom_abline(\n aes(\n intercept = 0,\n slope = 1, \n col = \"x=y\")) +\n labs(colour=\"\") +\n scale_colour_manual(values=\"red\")\n\nprint(rstandard_compare_plot)\n```\n\n::: {.cell-output-display}\n![](Linear-models-overview_files/figure-html/unnamed-chunk-65-1.png){width=672}\n:::\n:::\n\n\n\n\n\n\n---\n\nLet's add these residuals to the `tibble` of our dataset:\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n\nbw = \n bw |> \n mutate(\n fitted_lm2 = fitted(bw_lm2),\n \n resid_lm2 = resid(bw_lm2),\n # resid_lm2 = weight - fitted_lm2,\n \n std_resid_lm2 = rstandard(bw_lm2),\n # std_resid_lm2 = resid_lm2 / sigma(bw_lm2)\n )\n\nbw |> \n select(\n sex,\n age,\n weight,\n fitted_lm2,\n resid_lm2,\n std_resid_lm2\n )\n```\n\n::: {.cell-output-display}\n\n\n|sex | age| weight| fitted_lm2| resid_lm2| std_resid_lm2|\n|:------|---:|------:|----------:|---------:|-------------:|\n|female | 36| 2729| 2553| 176.27| 1.1598|\n|female | 36| 2412| 2553| -140.73| -0.9260|\n|female | 37| 2539| 2683| -144.13| -0.8748|\n|female | 38| 2754| 2814| -59.53| -0.3472|\n|female | 38| 2991| 2814| 177.47| 1.0351|\n|female | 39| 2817| 2944| -126.93| -0.7347|\n|female | 39| 2875| 2944| -68.93| -0.3990|\n|female | 40| 3317| 3074| 242.67| 1.4375|\n|female | 40| 2935| 3074| -139.33| -0.8254|\n|female | 40| 3126| 3074| 51.67| 0.3061|\n|female | 40| 3231| 3074| 156.67| 0.9281|\n|female | 42| 3210| 3335| -125.13| -0.8762|\n|male | 35| 2925| 2651| 274.28| 1.9143|\n|male | 36| 2625| 2763| -137.71| -0.8656|\n|male | 37| 2847| 2875| -27.69| -0.1643|\n|male | 37| 2628| 2875| -246.69| -1.4638|\n|male | 38| 2795| 2987| -191.67| -1.1102|\n|male | 38| 3176| 2987| 189.33| 1.0966|\n|male | 38| 2975| 2987| -11.67| -0.0676|\n|male | 40| 2968| 3211| -242.64| -1.4616|\n|male | 40| 3163| 3211| -47.64| -0.2870|\n|male | 40| 3473| 3211| 262.36| 1.5804|\n|male | 40| 3421| 3211| 210.36| 1.2672|\n|male | 41| 3292| 3323| -30.62| -0.1981|\n:::\n:::\n\n\n\n\n\n\n---\n\n::: notes\n\nNow let's build histograms:\n\n:::\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nresid_marginal_hist = \n bw |> \n ggplot(aes(x = resid_lm2)) +\n geom_histogram()\n\nprint(resid_marginal_hist)\n```\n\n::: {.cell-output-display}\n![Marginal distribution of (nonstandardized) residuals](Linear-models-overview_files/figure-html/fig-marg-dist-resid-1.png){#fig-marg-dist-resid width=672}\n:::\n:::\n\n\n\n\n\n\n::: notes\nHard to tell with this small amount of data, but I'm a bit concerned that the histogram doesn't show a bell-curve shape.\n\n:::\n\n---\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nstd_resid_marginal_hist = \n bw |> \n ggplot(aes(x = std_resid_lm2)) +\n geom_histogram()\n\nprint(std_resid_marginal_hist)\n```\n\n::: {.cell-output-display}\n![Marginal distribution of standardized residuals](Linear-models-overview_files/figure-html/fig-marg-stresd-1.png){#fig-marg-stresd width=672}\n:::\n:::\n\n\n\n\n\n\n::: notes\nThis looks similar, although the scale of the x-axis got narrower, because we divided by $\\hat\\sigma$ (roughly speaking).\n\nStill hard to tell if the distribution is Gaussian.\n\n:::\n\n---\n\n### QQ plot of standardized residuals\n\n::: notes\nAnother way to assess normality is the QQ plot of the standardized residuals versus normal quantiles:\n\n:::\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n\nlibrary(ggfortify) \n# needed to make ggplot2::autoplot() work for `lm` objects\n\nqqplot_lm2_auto = \n bw_lm2 |> \n autoplot(\n which = 2, # options are 1:6; can do multiple at once\n ncol = 1) +\n theme_classic()\n\nprint(qqplot_lm2_auto)\n```\n\n::: {.cell-output-display}\n![](Linear-models-overview_files/figure-html/unnamed-chunk-69-1.png){width=672}\n:::\n:::\n\n\n\n\n\n\n::: notes\nIf the Gaussian model were correct, these points should follow the dotted line.\n\nFig 2.4 panel (c) in @dobson4e is a little different; they didn't specify how they produced it, but other statistical analysis systems do things differently from R.\n\nSee also @dunn2018generalized [§3.5.4](https://link.springer.com/chapter/10.1007/978-1-4419-0118-7_3#Sec14:~:text=3.5.4%20Q%E2%80%93Q%20Plots%20and%20Normality).\n\n:::\n\n---\n\n#### QQ plot - how it's built\n\n::: notes\nLet's construct it by hand:\n:::\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n\nbw = bw |> \n mutate(\n p = (rank(std_resid_lm2) - 1/2)/n(), # \"Blom's method\"\n expected_quantiles_lm2 = qnorm(p)\n )\n\nqqplot_lm2 = \n bw |> \n ggplot(\n aes(\n x = expected_quantiles_lm2, \n y = std_resid_lm2, \n col = sex, \n shape = sex)\n ) + \n geom_point() +\n theme_classic() +\n theme(legend.position='none') + # removing the plot legend\n ggtitle(\"Normal Q-Q\") +\n xlab(\"Theoretical Quantiles\") + \n ylab(\"Standardized residuals\")\n\n# find the expected line:\n\nps <- c(.25, .75) # reference probabilities\na <- quantile(rstandard(bw_lm2), ps) # empirical quantiles\nb <- qnorm(ps) # theoretical quantiles\n\nqq_slope = diff(a)/diff(b)\nqq_intcpt = a[1] - b[1] * qq_slope\n\nqqplot_lm2 = \n qqplot_lm2 +\n geom_abline(slope = qq_slope, intercept = qq_intcpt)\n\nprint(qqplot_lm2)\n```\n\n::: {.cell-output-display}\n![](Linear-models-overview_files/figure-html/unnamed-chunk-70-1.png){width=672}\n:::\n:::\n\n\n\n\n\n\n---\n\n### Conditional distributions of residuals\n\nIf our Gaussian linear regression model is correct, the residuals $e_i$ and standardized residuals $r_i$ should have:\n\n- an approximately Gaussian distribution, with:\n- a mean of 0\n- a constant variance\n\nThis should be true **for every** value of $x$.\n\n---\n\nIf we didn't correctly guess the functional form of the linear component of the mean, \n$$\\text{E}[Y|X=x] = \\beta_0 + \\beta_1 X_1 + ... + \\beta_p X_p$$\n\nThen the the residuals might have nonzero mean.\n\nRegardless of whether we guessed the mean function correctly, ther the variance of the residuals might differ between values of $x$.\n\n---\n\n#### Residuals versus fitted values\n\n::: notes\nTo look for these issues, we can plot the residuals $e_i$ against the fitted values $\\hat y_i$ (@fig-bw_lm2-resid-vs-fitted).\n:::\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nautoplot(bw_lm2, which = 1, ncol = 1) |> print()\n```\n\n::: {.cell-output-display}\n![`birthweight` model (@eq-BW-lm-interact): residuals versus fitted values](Linear-models-overview_files/figure-html/fig-bw_lm2-resid-vs-fitted-1.png){#fig-bw_lm2-resid-vs-fitted width=672}\n:::\n:::\n\n\n\n\n\n\n::: notes\nIf the model is correct, the blue line should stay flat and close to 0, and the cloud of dots should have the same vertical spread regardless of the fitted value.\n\nIf not, we probably need to change the functional form of linear component of the mean, $$\\text{E}[Y|X=x] = \\beta_0 + \\beta_1 X_1 + ... + \\beta_p X_p$$\n\n:::\n\n---\n\n\n#### Example: PLOS Medicine title length data\n\n(Adapted from @dobson4e, §6.7.1)\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndata(PLOS, package = \"dobson\")\nlibrary(ggplot2)\nfig1 = \n PLOS |> \n ggplot(\n aes(x = authors,\n y = nchar)\n ) +\n geom_point() +\n theme(legend.position = \"bottom\") +\n labs(col = \"\") +\n guides(col=guide_legend(ncol=3))\nfig1\n```\n\n::: {.cell-output-display}\n![Number of authors versus title length in *PLOS Medicine* articles](Linear-models-overview_files/figure-html/fig-plos-1.png){#fig-plos width=672}\n:::\n:::\n\n\n\n\n\n---\n\n##### Linear fit\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlm_PLOS_linear = lm(\n formula = nchar ~ authors, \n data = PLOS)\n```\n:::\n\n::: {#fig-plos-lm .cell layout-ncol=\"2\"}\n\n```{.r .cell-code}\nfig2 = fig1 +\n geom_smooth(\n method = \"lm\", \n fullrange = TRUE,\n aes(col = \"lm(y ~ x)\"))\nfig2\n\nlibrary(ggfortify)\nautoplot(lm_PLOS_linear, which = 1, ncol = 1)\n```\n\n::: {.cell-output-display}\n![Data and fit](Linear-models-overview_files/figure-html/fig-plos-lm-1.png){#fig-plos-lm-1 width=672}\n:::\n\n::: {.cell-output-display}\n![Residuals vs fitted](Linear-models-overview_files/figure-html/fig-plos-lm-2.png){#fig-plos-lm-2 width=672}\n:::\n\nNumber of authors versus title length in *PLOS Medicine*, with linear model fit\n:::\n\n\n\n\n\n---\n\n##### Quadratic fit {.smaller}\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlm_PLOS_quad = lm(\n formula = nchar ~ authors + I(authors^2), \n data = PLOS)\n```\n:::\n\n::: {#fig-plos-lm-quad .cell layout-ncol=\"2\"}\n\n```{.r .cell-code}\nfig3 = \n fig2 + \ngeom_smooth(\n method = \"lm\",\n fullrange = TRUE,\n formula = y ~ x + I(x ^ 2),\n aes(col = \"lm(y ~ x + I(x^2))\")\n )\nfig3\n\nautoplot(lm_PLOS_quad, which = 1, ncol = 1)\n```\n\n::: {.cell-output-display}\n![Data and fit](Linear-models-overview_files/figure-html/fig-plos-lm-quad-1.png){#fig-plos-lm-quad-1 width=672}\n:::\n\n::: {.cell-output-display}\n![Residuals vs fitted](Linear-models-overview_files/figure-html/fig-plos-lm-quad-2.png){#fig-plos-lm-quad-2 width=672}\n:::\n\nNumber of authors versus title length in *PLOS Medicine*, with quadratic model fit\n:::\n\n\n\n\n\n---\n\n##### Linear versus quadratic fits\n\n\n\n\n\n::: {#fig-plos-lm-resid2 .cell layout-ncol=\"2\"}\n\n```{.r .cell-code}\nlibrary(ggfortify)\nautoplot(lm_PLOS_linear, which = 1, ncol = 1)\n\nautoplot(lm_PLOS_quad, which = 1, ncol = 1)\n```\n\n::: {.cell-output-display}\n![Linear](Linear-models-overview_files/figure-html/fig-plos-lm-resid2-1.png){#fig-plos-lm-resid2-1 width=672}\n:::\n\n::: {.cell-output-display}\n![Quadratic](Linear-models-overview_files/figure-html/fig-plos-lm-resid2-2.png){#fig-plos-lm-resid2-2 width=672}\n:::\n\nResiduals versus fitted plot for linear and quadratic fits to `PLOS` data\n:::\n\n\n\n\n\n---\n\n##### Cubic fit\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlm_PLOS_cub = lm(\n formula = nchar ~ authors + I(authors^2) + I(authors^3), \n data = PLOS)\n```\n:::\n\n::: {#fig-plos-lm-cubic .cell layout-ncol=\"2\"}\n\n```{.r .cell-code}\nfig4 = \n fig3 + \ngeom_smooth(\n method = \"lm\",\n fullrange = TRUE,\n formula = y ~ x + I(x ^ 2) + I(x ^ 3),\n aes(col = \"lm(y ~ x + I(x^2) + I(x ^ 3))\")\n )\nfig4\n\nautoplot(lm_PLOS_cub, which = 1, ncol = 1)\n\n```\n\n::: {.cell-output-display}\n![Data and fit](Linear-models-overview_files/figure-html/fig-plos-lm-cubic-1.png){#fig-plos-lm-cubic-1 width=672}\n:::\n\n::: {.cell-output-display}\n![Residuals vs fitted](Linear-models-overview_files/figure-html/fig-plos-lm-cubic-2.png){#fig-plos-lm-cubic-2 width=672}\n:::\n\nNumber of authors versus title length in *PLOS Medicine*, with cubic model fit\n:::\n\n\n\n\n\n---\n\n##### Logarithmic fit\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlm_PLOS_log = lm(nchar ~ log(authors), data = PLOS)\n```\n:::\n\n::: {#fig-plos-log .cell layout-ncol=\"2\"}\n\n```{.r .cell-code}\nfig5 = fig4 + \n geom_smooth(\n method = \"lm\",\n fullrange = TRUE,\n formula = y ~ log(x),\n aes(col = \"lm(y ~ log(x))\")\n )\nfig5\n\nautoplot(lm_PLOS_log, which = 1, ncol = 1)\n```\n\n::: {.cell-output-display}\n![Data and fit](Linear-models-overview_files/figure-html/fig-plos-log-1.png){#fig-plos-log-1 width=672}\n:::\n\n::: {.cell-output-display}\n![Residuals vs fitted](Linear-models-overview_files/figure-html/fig-plos-log-2.png){#fig-plos-log-2 width=672}\n:::\n\nlogarithmic fit\n:::\n\n\n\n\n\n---\n\n##### Model selection {.smaller}\n\n\n\n\n\n::: {#tbl-plos-lin-quad-anova .cell tbl-cap='linear vs quadratic'}\n\n```{.r .cell-code}\nanova(lm_PLOS_linear, lm_PLOS_quad)\n```\n\n::: {.cell-output-display}\n\n\n| Res.Df| RSS| Df| Sum of Sq| F| Pr(>F)|\n|------:|------:|--:|---------:|----:|------:|\n| 876| 947502| NA| NA| NA| NA|\n| 875| 880950| 1| 66552| 66.1| 0|\n:::\n:::\n\n::: {#tbl-plos-quad-cub-anova .cell tbl-cap='quadratic vs cubic'}\n\n```{.r .cell-code}\nanova(lm_PLOS_quad, lm_PLOS_cub)\n```\n\n::: {.cell-output-display}\n\n\n| Res.Df| RSS| Df| Sum of Sq| F| Pr(>F)|\n|------:|------:|--:|---------:|-----:|------:|\n| 875| 880950| NA| NA| NA| NA|\n| 874| 865933| 1| 15018| 15.16| 1e-04|\n:::\n:::\n\n\n\n\n\n---\n\n##### AIC/BIC {.smaller}\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"show\"}\nAIC(lm_PLOS_quad)\n#> [1] 8568\nAIC(lm_PLOS_cub)\n#> [1] 8555\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"show\"}\nAIC(lm_PLOS_cub)\n#> [1] 8555\nAIC(lm_PLOS_log)\n#> [1] 8544\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"show\"}\nBIC(lm_PLOS_cub)\n#> [1] 8578\nBIC(lm_PLOS_log)\n#> [1] 8558\n```\n:::\n\n\n\n\n\n---\n\n##### Extrapolation is dangerous\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfig_all = fig5 +\n xlim(0, 60)\nfig_all\n```\n\n::: {.cell-output-display}\n![Number of authors versus title length in *PLOS Medicine*](Linear-models-overview_files/figure-html/fig-plos-multifit-1.png){#fig-plos-multifit width=672}\n:::\n:::\n\n\n\n\n\n\n\n---\n\n#### Scale-location plot\n\n::: notes\nWe can also plot the square roots of the absolute values of the standardized residuals against the fitted values (@fig-bw-scale-loc).\n:::\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nautoplot(bw_lm2, which = 3, ncol = 1) |> print()\n```\n\n::: {.cell-output-display}\n![Scale-location plot of `birthweight` data](Linear-models-overview_files/figure-html/fig-bw-scale-loc-1.png){#fig-bw-scale-loc width=672}\n:::\n:::\n\n\n\n\n\n::: notes\nHere, the blue line doesn't need to be near 0, \nbut it should be flat. \nIf not, the residual variance $\\sigma^2$ might not be constant, \nand we might need to transform our outcome $Y$ \n(or use a model that allows non-constant variance).\n:::\n\n---\n\n\n#### Residuals versus leverage\n\n::: notes\n\nWe can also plot our standardized residuals against \"leverage\", which roughly speaking is a measure of how unusual each $x_i$ value is. Very unusual $x_i$ values can have extreme effects on the model fit, so we might want to remove those observations as outliers, particularly if they have large residuals.\n\n:::\n\n\n\n\n\n\n::: {.cell labels='fig-bw_lm2_resid-vs-leverage'}\n\n```{.r .cell-code}\nautoplot(bw_lm2, which = 5, ncol = 1) |> print()\n```\n\n::: {.cell-output-display}\n![`birthweight` model with interactions (@eq-BW-lm-interact): residuals versus leverage](Linear-models-overview_files/figure-html/unnamed-chunk-89-1.png){width=672}\n:::\n:::\n\n\n\n\n\n\n::: notes\nThe blue line should be relatively flat and close to 0 here.\n:::\n\n---\n\n### Diagnostics constructed by hand\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n\nbw = \n bw |> \n mutate(\n predlm2 = predict(bw_lm2),\n residlm2 = weight - predlm2,\n std_resid = residlm2 / sigma(bw_lm2),\n # std_resid_builtin = rstandard(bw_lm2), # uses leverage\n sqrt_abs_std_resid = std_resid |> abs() |> sqrt()\n \n )\n\n```\n:::\n\n\n\n\n\n\n##### Residuals vs fitted\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n\nresid_vs_fit = bw |> \n ggplot(\n aes(x = predlm2, y = residlm2, col = sex, shape = sex)\n ) + \n geom_point() +\n theme_classic() +\n geom_hline(yintercept = 0)\n\n```\n:::\n\n\n\n\n\n\n::: {.content-visible when-format=\"html\"}\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nprint(resid_vs_fit)\n```\n\n::: {.cell-output-display}\n![](Linear-models-overview_files/figure-html/unnamed-chunk-92-1.png){width=672}\n:::\n:::\n\n\n\n\n\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nprint(resid_vs_fit)\n```\n\n::: {.cell-output-display}\n![](Linear-models-overview_files/figure-html/unnamed-chunk-93-1.png){width=672}\n:::\n:::\n\n\n\n\n\n:::\n\n##### Standardized residuals vs fitted\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbw |> \n ggplot(\n aes(x = predlm2, y = std_resid, col = sex, shape = sex)\n ) + \n geom_point() +\n theme_classic() +\n geom_hline(yintercept = 0)\n```\n\n::: {.cell-output-display}\n![](Linear-models-overview_files/figure-html/unnamed-chunk-94-1.png){width=672}\n:::\n:::\n\n\n\n\n\n\n##### Standardized residuals vs gestational age\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n\nbw |> \n ggplot(\n aes(x = age, y = std_resid, col = sex, shape = sex)\n ) + \n geom_point() +\n theme_classic() +\n geom_hline(yintercept = 0)\n```\n\n::: {.cell-output-display}\n![](Linear-models-overview_files/figure-html/unnamed-chunk-95-1.png){width=672}\n:::\n:::\n\n\n\n\n\n\n##### `sqrt(abs(rstandard()))` vs fitted\n\nCompare with `autoplot(bw_lm2, 3)`\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n\n\nbw |> \n ggplot(\n aes(x = predlm2, y = sqrt_abs_std_resid, col = sex, shape = sex)\n ) + \n geom_point() +\n theme_classic() +\n geom_hline(yintercept = 0)\n```\n\n::: {.cell-output-display}\n![](Linear-models-overview_files/figure-html/unnamed-chunk-96-1.png){width=672}\n:::\n:::\n\n\n\n\n\n\n## Model selection\n\n(adapted from @dobson4e §6.3.3; for more information on prediction, see @james2013introduction and @rms2e).\n\n::: notes\nIf we have a lot of covariates in our dataset, we might want to choose a small subset to use in our model.\n\nThere are a few possible metrics to consider for choosing a \"best\" model.\n:::\n\n### Mean squared error\n\nWe might want to minimize the **mean squared error**, $\\text E[(y-\\hat y)^2]$, for new observations that weren't in our data set when we fit the model.\n\nUnfortunately, $$\\frac{1}{n}\\sum_{i=1}^n (y_i-\\hat y_i)^2$$ gives a biased estimate of $\\text E[(y-\\hat y)^2]$ for new data. If we want an unbiased estimate, we will have to be clever.\n\n---\n\n#### Cross-validation\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndata(\"carbohydrate\", package = \"dobson\")\nlibrary(cvTools)\nfull_model <- lm(carbohydrate ~ ., data = carbohydrate)\ncv_full = \n full_model |> cvFit(\n data = carbohydrate, K = 5, R = 10,\n y = carbohydrate$carbohydrate)\n\nreduced_model = update(full_model, \n formula = ~ . - age)\n\ncv_reduced = \n reduced_model |> cvFit(\n data = carbohydrate, K = 5, R = 10,\n y = carbohydrate$carbohydrate)\n```\n:::\n\n\n\n\n\n\n---\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nresults_reduced = \n tibble(\n model = \"wgt+protein\",\n errs = cv_reduced$reps[])\nresults_full = \n tibble(model = \"wgt+age+protein\",\n errs = cv_full$reps[])\n\ncv_results = \n bind_rows(results_reduced, results_full)\n\ncv_results |> \n ggplot(aes(y = model, x = errs)) +\n geom_boxplot()\n```\n\n::: {.cell-output-display}\n![](Linear-models-overview_files/figure-html/unnamed-chunk-98-1.png){width=672}\n:::\n:::\n\n\n\n\n\n\n---\n\n##### comparing metrics\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n\ncompare_results = tribble(\n ~ model, ~ cvRMSE, ~ r.squared, ~adj.r.squared, ~ trainRMSE, ~loglik,\n \"full\", cv_full$cv, summary(full_model)$r.squared, summary(full_model)$adj.r.squared, sigma(full_model), logLik(full_model) |> as.numeric(),\n \"reduced\", cv_reduced$cv, summary(reduced_model)$r.squared, summary(reduced_model)$adj.r.squared, sigma(reduced_model), logLik(reduced_model) |> as.numeric())\n\ncompare_results\n```\n\n::: {.cell-output-display}\n\n\n|model | cvRMSE| r.squared| adj.r.squared| trainRMSE| loglik|\n|:-------|------:|---------:|-------------:|---------:|------:|\n|full | 6.906| 0.4805| 0.3831| 5.956| -61.84|\n|reduced | 6.586| 0.4454| 0.3802| 5.971| -62.49|\n:::\n:::\n\n\n\n\n\n\n---\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nanova(full_model, reduced_model)\n```\n\n::: {.cell-output-display}\n\n\n| Res.Df| RSS| Df| Sum of Sq| F| Pr(>F)|\n|------:|-----:|--:|---------:|-----:|------:|\n| 16| 567.7| NA| NA| NA| NA|\n| 17| 606.0| -1| -38.36| 1.081| 0.3139|\n:::\n:::\n\n\n\n\n\n\n---\n\n#### stepwise regression\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(olsrr)\nolsrr:::ols_step_both_aic(full_model)\n#> \n#> \n#> Stepwise Summary \n#> -------------------------------------------------------------------------\n#> Step Variable AIC SBC SBIC R2 Adj. R2 \n#> -------------------------------------------------------------------------\n#> 0 Base Model 140.773 142.764 83.068 0.00000 0.00000 \n#> 1 protein (+) 137.950 140.937 80.438 0.21427 0.17061 \n#> 2 weight (+) 132.981 136.964 77.191 0.44544 0.38020 \n#> -------------------------------------------------------------------------\n#> \n#> Final Model Output \n#> ------------------\n#> \n#> Model Summary \n#> ---------------------------------------------------------------\n#> R 0.667 RMSE 5.505 \n#> R-Squared 0.445 MSE 30.301 \n#> Adj. R-Squared 0.380 Coef. Var 15.879 \n#> Pred R-Squared 0.236 AIC 132.981 \n#> MAE 4.593 SBC 136.964 \n#> ---------------------------------------------------------------\n#> RMSE: Root Mean Square Error \n#> MSE: Mean Square Error \n#> MAE: Mean Absolute Error \n#> AIC: Akaike Information Criteria \n#> SBC: Schwarz Bayesian Criteria \n#> \n#> ANOVA \n#> -------------------------------------------------------------------\n#> Sum of \n#> Squares DF Mean Square F Sig. \n#> -------------------------------------------------------------------\n#> Regression 486.778 2 243.389 6.827 0.0067 \n#> Residual 606.022 17 35.648 \n#> Total 1092.800 19 \n#> -------------------------------------------------------------------\n#> \n#> Parameter Estimates \n#> ----------------------------------------------------------------------------------------\n#> model Beta Std. Error Std. Beta t Sig lower upper \n#> ----------------------------------------------------------------------------------------\n#> (Intercept) 33.130 12.572 2.635 0.017 6.607 59.654 \n#> protein 1.824 0.623 0.534 2.927 0.009 0.509 3.139 \n#> weight -0.222 0.083 -0.486 -2.662 0.016 -0.397 -0.046 \n#> ----------------------------------------------------------------------------------------\n```\n:::\n\n\n\n\n\n\n---\n\n#### Lasso\n\n$$\\arg min_{\\theta} \\llik(\\th) + \\lambda \\sum_{j=1}^p|\\beta_j|$$\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(glmnet)\ny = carbohydrate$carbohydrate\nx = carbohydrate |> \n select(age, weight, protein) |> \n as.matrix()\nfit = glmnet(x,y)\n```\n:::\n\n\n\n\n\n\n---\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nautoplot(fit, xvar = 'lambda')\n\n```\n\n::: {.cell-output-display}\n![Lasso selection](Linear-models-overview_files/figure-html/fig-carbs-lasso-1.png){#fig-carbs-lasso width=672}\n:::\n:::\n\n\n\n\n\n\n---\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncvfit = cv.glmnet(x,y)\nplot(cvfit)\n```\n\n::: {.cell-output-display}\n![](Linear-models-overview_files/figure-html/unnamed-chunk-104-1.png){width=672}\n:::\n:::\n\n\n\n\n\n\n---\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncoef(cvfit, s = \"lambda.1se\")\n#> 4 x 1 sparse Matrix of class \"dgCMatrix\"\n#> s1\n#> (Intercept) 34.2044\n#> age . \n#> weight -0.0926\n#> protein 0.8582\n```\n:::\n\n\n\n\n\n\n\n## Categorical covariates with more than two levels\n\n### Example: `birthweight`\n\nIn the birthweight example, the variable `sex` had only two observed values:\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nunique(bw$sex)\n#> [1] female male \n#> Levels: female male\n```\n:::\n\n\n\n\n\n\nIf there are more than two observed values, we can't just use a single variable with 0s and 1s.\n\n### \n\n:::{.notes}\nFor example, @tbl-iris-data shows the \n[(in)famous](https://www.meganstodel.com/posts/no-to-iris/) \n`iris` data (@anderson1935irises), \nand @tbl-iris-summary provides summary statistics. \nThe data include three species: \"setosa\", \"versicolor\", and \"virginica\".\n:::\n\n\n\n\n\n\n::: {#tbl-iris-data .cell tbl-cap='The `iris` data'}\n\n```{.r .cell-code}\nhead(iris)\n```\n\n::: {.cell-output-display}\n\n\n| Sepal.Length| Sepal.Width| Petal.Length| Petal.Width|Species |\n|------------:|-----------:|------------:|-----------:|:-------|\n| 5.1| 3.5| 1.4| 0.2|setosa |\n| 4.9| 3.0| 1.4| 0.2|setosa |\n| 4.7| 3.2| 1.3| 0.2|setosa |\n| 4.6| 3.1| 1.5| 0.2|setosa |\n| 5.0| 3.6| 1.4| 0.2|setosa |\n| 5.4| 3.9| 1.7| 0.4|setosa |\n:::\n:::\n\n::: {#tbl-iris-summary .cell tbl-cap='Summary statistics for the `iris` data'}\n\n```{.r .cell-code}\nlibrary(table1)\ntable1(\n x = ~ . | Species,\n data = iris,\n overall = FALSE\n)\n```\n\n::: {.cell-output-display}\n\n```{=html}\n
\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n
setosa
(N=50)
versicolor
(N=50)
virginica
(N=50)
Sepal.Length
Mean (SD)5.01 (0.352)5.94 (0.516)6.59 (0.636)
Median [Min, Max]5.00 [4.30, 5.80]5.90 [4.90, 7.00]6.50 [4.90, 7.90]
Sepal.Width
Mean (SD)3.43 (0.379)2.77 (0.314)2.97 (0.322)
Median [Min, Max]3.40 [2.30, 4.40]2.80 [2.00, 3.40]3.00 [2.20, 3.80]
Petal.Length
Mean (SD)1.46 (0.174)4.26 (0.470)5.55 (0.552)
Median [Min, Max]1.50 [1.00, 1.90]4.35 [3.00, 5.10]5.55 [4.50, 6.90]
Petal.Width
Mean (SD)0.246 (0.105)1.33 (0.198)2.03 (0.275)
Median [Min, Max]0.200 [0.100, 0.600]1.30 [1.00, 1.80]2.00 [1.40, 2.50]
\n
\n```\n\n:::\n:::\n\n\n\n\n\n\n---\n\nIf we want to model `Sepal.Length` by species, we could create a variable $X$ that represents \"setosa\" as $X=1$, \"virginica\" as $X=2$, and \"versicolor\" as $X=3$.\n\n\n\n\n\n\n::: {#tbl-numeric-coding .cell tbl-cap='`iris` data with numeric coding of species'}\n\n```{.r .cell-code}\ndata(iris) # this step is not always necessary, but ensures you're starting \n# from the original version of a dataset stored in a loaded package\n\niris = \n iris |> \n tibble() |>\n mutate(\n X = case_when(\n Species == \"setosa\" ~ 1,\n Species == \"virginica\" ~ 2,\n Species == \"versicolor\" ~ 3\n )\n )\n\niris |> \n distinct(Species, X)\n```\n\n::: {.cell-output-display}\n\n\n|Species | X|\n|:----------|--:|\n|setosa | 1|\n|versicolor | 3|\n|virginica | 2|\n:::\n:::\n\n\n\n\n\n\nThen we could fit a model like:\n\n\n\n\n\n\n::: {#tbl-iris-numeric-species .cell tbl-cap='Model of `iris` data with numeric coding of `Species`'}\n\n```{.r .cell-code}\niris_lm1 = lm(Sepal.Length ~ X, data = iris)\niris_lm1 |> parameters() |> print_md()\n```\n\n::: {.cell-output-display}\n\n\n|Parameter | Coefficient | SE | 95% CI | t(148) | p |\n|:-----------|:-----------:|:----:|:------------:|:------:|:------:|\n|(Intercept) | 4.91 | 0.16 | (4.60, 5.23) | 30.83 | < .001 |\n|X | 0.47 | 0.07 | (0.32, 0.61) | 6.30 | < .001 |\n\n\n:::\n:::\n\n\n\n\n\n\n### Let's see how that model looks:\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\niris_plot1 = iris |> \n ggplot(\n aes(\n x = X, \n y = Sepal.Length)\n ) +\n geom_point(alpha = .1) +\n geom_abline(\n intercept = coef(iris_lm1)[1], \n slope = coef(iris_lm1)[2]) +\n theme_bw(base_size = 18)\nprint(iris_plot1)\n\n```\n\n::: {.cell-output-display}\n![Model of `iris` data with numeric coding of `Species`](Linear-models-overview_files/figure-html/fig-iris-numeric-species-model-1.png){#fig-iris-numeric-species-model width=672}\n:::\n:::\n\n\n\n\n\n\nWe have forced the model to use a straight line for the three estimated means. Maybe not a good idea?\n\n### Let's see what R does with categorical variables by default:\n\n\n\n\n\n\n::: {#tbl-iris-model-factor1 .cell tbl-cap='Model of `iris` data with `Species` as a categorical variable'}\n\n```{.r .cell-code}\niris_lm2 = lm(Sepal.Length ~ Species, data = iris)\niris_lm2 |> parameters() |> print_md()\n```\n\n::: {.cell-output-display}\n\n\n|Parameter | Coefficient | SE | 95% CI | t(147) | p |\n|:--------------------|:-----------:|:----:|:------------:|:------:|:------:|\n|(Intercept) | 5.01 | 0.07 | (4.86, 5.15) | 68.76 | < .001 |\n|Species (versicolor) | 0.93 | 0.10 | (0.73, 1.13) | 9.03 | < .001 |\n|Species (virginica) | 1.58 | 0.10 | (1.38, 1.79) | 15.37 | < .001 |\n\n\n:::\n:::\n\n\n\n\n\n\n### Re-parametrize with no intercept\n\nIf you don't want the default and offset option, you can use \"-1\" like we've seen previously:\n\n\n\n\n\n\n::: {#tbl-iris-no-intcpt .cell}\n\n```{.r .cell-code}\niris.lm2b = lm(Sepal.Length ~ Species - 1, data = iris)\niris.lm2b |> parameters() |> print_md()\n```\n\n::: {.cell-output-display}\n\n\n|Parameter | Coefficient | SE | 95% CI | t(147) | p |\n|:--------------------|:-----------:|:----:|:------------:|:------:|:------:|\n|Species (setosa) | 5.01 | 0.07 | (4.86, 5.15) | 68.76 | < .001 |\n|Species (versicolor) | 5.94 | 0.07 | (5.79, 6.08) | 81.54 | < .001 |\n|Species (virginica) | 6.59 | 0.07 | (6.44, 6.73) | 90.49 | < .001 |\n\n\n:::\n:::\n\n\n\n\n\n\n### Let's see what these new models look like:\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\niris_plot2 = \n iris |> \n mutate(\n predlm2 = predict(iris_lm2)) |> \n arrange(X) |> \n ggplot(aes(x = X, y = Sepal.Length)) +\n geom_point(alpha = .1) +\n geom_line(aes(y = predlm2), col = \"red\") +\n geom_abline(\n intercept = coef(iris_lm1)[1], \n slope = coef(iris_lm1)[2]) + \n theme_bw(base_size = 18)\n\nprint(iris_plot2)\n\n```\n\n::: {.cell-output-display}\n![](Linear-models-overview_files/figure-html/fig-iris-no-intcpt-1.png){#fig-iris-no-intcpt width=672}\n:::\n:::\n\n\n\n\n\n\n### Let's see how R did that:\n\n\n\n\n\n\n::: {#tbl-iris-model-matrix-factor .cell}\n\n```{.r .cell-code}\nformula(iris_lm2)\n#> Sepal.Length ~ Species\nmodel.matrix(iris_lm2) |> as_tibble() |> unique()\n```\n\n::: {.cell-output-display}\n\n\n| (Intercept)| Speciesversicolor| Speciesvirginica|\n|-----------:|-----------------:|----------------:|\n| 1| 0| 0|\n| 1| 1| 0|\n| 1| 0| 1|\n:::\n:::\n\n\n\n\n\n\nThis is called a \"corner point parametrization\".\n\n\n\n\n\n\n::: {#tbl-iris-group-point-parameterization .cell}\n\n```{.r .cell-code}\nformula(iris.lm2b)\n#> Sepal.Length ~ Species - 1\nmodel.matrix(iris.lm2b) |> as_tibble() |> unique()\n```\n\n::: {.cell-output-display}\n\n\n| Speciessetosa| Speciesversicolor| Speciesvirginica|\n|-------------:|-----------------:|----------------:|\n| 1| 0| 0|\n| 0| 1| 0|\n| 0| 0| 1|\n:::\n:::\n\n\n\n\n\n\nThis can be called a \"group point parametrization\".\n\nThere are more options; see @dobson4e §6.4.1 and the \n[`codingMatrices` package](https://CRAN.R-project.org/package=codingMatrices) \n[vignette](https://cran.r-project.org/web/packages/codingMatrices/vignettes/codingMatrices.pdf) \n(@venablescodingMatrices).\n\n## Ordinal covariates\n\n(c.f. @dobson4e §2.4.4)\n\n---\n\n::: notes\nWe can create ordinal variables in R using the `ordered()` function^[or equivalently, `factor(ordered = TRUE)`].\n:::\n\n:::{#exm-ordinal-variable}\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nurl = paste0(\n \"https://regression.ucsf.edu/sites/g/files/tkssra6706/\",\n \"f/wysiwyg/home/data/hersdata.dta\")\nlibrary(haven)\nhers = read_dta(url)\n```\n:::\n\n\n\n::: {#tbl-HERS .cell tbl-cap='HERS dataset'}\n\n```{.r .cell-code}\nhers |> head()\n```\n\n::: {.cell-output-display}\n\n\n| HT| age| raceth| nonwhite| smoking| drinkany| exercise| physact| globrat| poorfair| medcond| htnmeds| statins| diabetes| dmpills| insulin| weight| BMI| waist| WHR| glucose| weight1| BMI1| waist1| WHR1| glucose1| tchol| LDL| HDL| TG| tchol1| LDL1| HDL1| TG1| SBP| DBP| age10|\n|--:|---:|------:|--------:|-------:|--------:|--------:|-------:|-------:|--------:|-------:|-------:|-------:|--------:|-------:|-------:|------:|-----:|-----:|-----:|-------:|-------:|-----:|------:|-----:|--------:|-----:|-----:|---:|---:|------:|-----:|----:|---:|---:|---:|-----:|\n| 0| 70| 2| 1| 0| 0| 0| 5| 3| 0| 0| 1| 1| 0| 0| 0| 73.8| 23.69| 96.0| 0.932| 84| 73.6| 23.63| 93.0| 0.912| 94| 189| 122.4| 52| 73| 201| 137.6| 48| 77| 138| 78| 7.0|\n| 0| 62| 2| 1| 0| 0| 0| 1| 3| 0| 1| 1| 0| 0| 0| 0| 70.9| 28.62| 93.0| 0.964| 111| 73.4| 28.89| 95.0| 0.964| 78| 307| 241.6| 44| 107| 216| 150.6| 48| 87| 118| 70| 6.2|\n| 1| 69| 1| 0| 0| 0| 0| 3| 3| 0| 0| 1| 0| 1| 0| 0| 102.0| 42.51| 110.2| 0.782| 114| 96.1| 40.73| 103.0| 0.774| 98| 254| 166.2| 57| 154| 254| 156.0| 66| 160| 134| 78| 6.9|\n| 0| 64| 1| 0| 1| 1| 0| 1| 3| 0| 1| 1| 0| 0| 0| 0| 64.4| 24.39| 87.0| 0.877| 94| 58.6| 22.52| 77.0| 0.802| 93| 204| 116.2| 56| 159| 207| 122.6| 57| 137| 152| 72| 6.4|\n| 0| 65| 1| 0| 0| 0| 0| 2| 3| 0| 0| 0| 0| 0| 0| 0| 57.9| 21.90| 77.0| 0.794| 101| 58.9| 22.28| 76.5| 0.757| 92| 214| 150.6| 42| 107| 235| 172.2| 35| 139| 175| 95| 6.5|\n| 1| 68| 2| 1| 0| 1| 0| 3| 3| 0| 0| 0| 0| 0| 0| 0| 60.9| 29.05| 96.0| 1.000| 116| 57.7| 27.52| 86.0| 0.910| 115| 212| 137.8| 52| 111| 202| 126.6| 53| 112| 174| 98| 6.8|\n:::\n:::\n\n\n\n\n\n\n:::\n\n---\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n\n# C(contr = codingMatrices::contr.diff)\n\n```\n:::\n", "supporting": [], "filters": [ "rmarkdown/pagebreak.lua" diff --git a/_freeze/Linear-models-overview/execute-results/tex.json b/_freeze/Linear-models-overview/execute-results/tex.json index 9043fe6..86f8b60 100644 --- a/_freeze/Linear-models-overview/execute-results/tex.json +++ b/_freeze/Linear-models-overview/execute-results/tex.json @@ -1,8 +1,8 @@ { - "hash": "cf0077fda43f6716c6ac1ffa5069d436", + "hash": "32ab51f7bf136708900c866450f81545", "result": { "engine": "knitr", - "markdown": "---\ndf-print: paged\n---\n\n\n\n\n\n\n\n\n# Linear (Gaussian) Models\n\n---\n\n\n\n\n---\n\n### Configuring R {.unnumbered}\n\nFunctions from these packages will be used throughout this document:\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(conflicted) # check for conflicting function definitions\n# library(printr) # inserts help-file output into markdown output\nlibrary(rmarkdown) # Convert R Markdown documents into a variety of formats.\nlibrary(pander) # format tables for markdown\nlibrary(ggplot2) # graphics\nlibrary(ggeasy) # help with graphics\nlibrary(ggfortify) # help with graphics\nlibrary(dplyr) # manipulate data\nlibrary(tibble) # `tibble`s extend `data.frame`s\nlibrary(magrittr) # `%>%` and other additional piping tools\nlibrary(haven) # import Stata files\nlibrary(knitr) # format R output for markdown\nlibrary(tidyr) # Tools to help to create tidy data\nlibrary(plotly) # interactive graphics\nlibrary(dobson) # datasets from Dobson and Barnett 2018\nlibrary(parameters) # format model output tables for markdown\nlibrary(haven) # import Stata files\nlibrary(latex2exp) # use LaTeX in R code (for figures and tables)\nlibrary(fs) # filesystem path manipulations\nlibrary(survival) # survival analysis\nlibrary(survminer) # survival analysis graphics\nlibrary(KMsurv) # datasets from Klein and Moeschberger\nlibrary(parameters) # format model output tables for\nlibrary(webshot2) # convert interactive content to static for pdf\nlibrary(forcats) # functions for categorical variables (\"factors\")\nlibrary(stringr) # functions for dealing with strings\nlibrary(lubridate) # functions for dealing with dates and times\n```\n:::\n\n\n\n\n\n\n\nHere are some R settings I use in this document:\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nrm(list = ls()) # delete any data that's already loaded into R\n\nconflicts_prefer(dplyr::filter)\nggplot2::theme_set(\n ggplot2::theme_bw() + \n # ggplot2::labs(col = \"\") +\n ggplot2::theme(\n legend.position = \"bottom\",\n text = ggplot2::element_text(size = 12, family = \"serif\")))\n\nknitr::opts_chunk$set(message = FALSE)\noptions('digits' = 4)\n\npanderOptions(\"big.mark\", \",\")\npander::panderOptions(\"table.emphasize.rownames\", FALSE)\npander::panderOptions(\"table.split.table\", Inf)\nconflicts_prefer(dplyr::filter) # use the `filter()` function from dplyr() by default\nlegend_text_size = 9\n```\n:::\n\n\n\n\n\n\n\n\n\n\n\n\\providecommand{\\cbl}[1]{\\left\\{#1\\right.}\n\\providecommand{\\cb}[1]{\\left\\{#1\\right\\}}\n\\providecommand{\\paren}[1]{\\left(#1\\right)}\n\\providecommand{\\sb}[1]{\\left[#1\\right]}\n\\def\\pr{\\text{p}}\n\\def\\am{\\arg \\max}\n\\def\\argmax{\\arg \\max}\n\\def\\p{\\text{p}}\n\\def\\P{\\text{P}}\n\\def\\ph{\\hat{\\text{p}}}\n\\def\\hp{\\hat{\\text{p}}}\n\\def\\ga{\\alpha}\n\\def\\b{\\beta}\n\\providecommand{\\floor}[1]{\\left \\lfloor{#1}\\right \\rfloor}\n\\providecommand{\\ceiling}[1]{\\left \\lceil{#1}\\right \\rceil}\n\\providecommand{\\ceil}[1]{\\left \\lceil{#1}\\right \\rceil}\n\\def\\Ber{\\text{Ber}}\n\\def\\Bernoulli{\\text{Bernoulli}}\n\\def\\Pois{\\text{Pois}}\n\\def\\Poisson{\\text{Poisson}}\n\\def\\Gaus{\\text{Gaussian}}\n\\def\\Normal{\\text{N}}\n\\def\\NB{\\text{NegBin}}\n\\def\\NegBin{\\text{NegBin}}\n\\def\\vbeta{\\vec \\beta}\n\\def\\vb{\\vec \\b}\n\\def\\v0{\\vec{0}}\n\\def\\gb{\\beta}\n\\def\\gg{\\gamma}\n\\def\\gd{\\delta}\n\\def\\eps{\\varepsilon}\n\\def\\om{\\omega}\n\\def\\m{\\mu}\n\\def\\s{\\sigma}\n\\def\\l{\\lambda}\n\\def\\gs{\\sigma}\n\\def\\gm{\\mu}\n\\def\\M{\\text{M}}\n\\def\\gM{\\text{M}}\n\\def\\Mu{\\text{M}}\n\\def\\cd{\\cdot}\n\\def\\cds{\\cdots}\n\\def\\lds{\\ldots}\n\\def\\eqdef{\\stackrel{\\text{def}}{=}}\n\\def\\defeq{\\stackrel{\\text{def}}{=}}\n\\def\\hb{\\hat \\beta}\n\\def\\hl{\\hat \\lambda}\n\\def\\hy{\\hat y}\n\\def\\yh{\\hat y}\n\\def\\V{{\\text{Var}}}\n\\def\\hs{\\hat \\sigma}\n\\def\\hsig{\\hat \\sigma}\n\\def\\hS{\\hat \\Sigma}\n\\def\\hSig{\\hat \\Sigma}\n\\def\\hSigma{\\hat \\Sigma}\n\\def\\hSurv{\\hat{S}}\n\\providecommand{\\hSurvf}[1]{\\hat{S}\\paren{#1}}\n\\def\\dist{\\ \\sim \\ }\n\\def\\ddist{\\ \\dot{\\sim} \\ }\n\\def\\dsim{\\ \\dot{\\sim} \\ }\n\\def\\za{z_{1 - \\frac{\\alpha}{2}}}\n\\def\\cirad{\\za \\cdot \\hse{\\hb}}\n\\def\\ci{\\hb {\\color{red}\\pm} \\cirad}\n\\def\\th{\\theta}\n\\def\\Th{\\Theta}\n\\def\\xbar{\\bar{x}}\n\\def\\hth{\\hat\\theta}\n\\def\\hthml{\\hth_{\\text{ML}}}\n\\def\\ba{\\begin{aligned}}\n\\def\\ea{\\end{aligned}}\n\\def\\ind{⫫}\n\\def\\indpt{⫫}\n\\def\\all{\\forall}\n\\def\\iid{\\text{iid}}\n\\def\\ciid{\\text{ciid}}\n\\def\\simind{\\ \\sim_{\\ind}\\ }\n\\def\\siid{\\ \\sim_{\\iid}\\ }\n\\def\\simiid{\\siid}\n\\def\\distiid{\\siid}\n\\def\\tf{\\therefore}\n\\def\\Lik{\\mathcal{L}}\n\\def\\llik{\\ell}\n\\providecommand{\\llikf}[1]{\\llik \\paren{#1}}\n\\def\\score{\\ell'}\n\\providecommand{\\scoref}[1]{\\score \\paren{#1}}\n\\def\\hess{\\ell''}\n\\def\\hessian{\\ell''}\n\\providecommand{\\hessf}[1]{\\hess \\paren{#1}}\n\\providecommand{\\hessianf}[1]{\\hess \\paren{#1}}\n\\providecommand{\\starf}[1]{#1^*}\n\\def\\lik{\\ell}\n\\providecommand{\\est}[1]{\\widehat{#1}}\n\\providecommand{\\esttmp}[1]{{\\widehat{#1}}^*}\n\\def\\esttmpl{\\esttmp{\\lambda}}\n\\def\\cR{\\mathcal{R}}\n\\def\\range{\\mathcal{R}}\n\\def\\Range{\\mathcal{R}}\n\\providecommand{\\rangef}[1]{\\cR(#1)}\n\\def\\~{\\approx}\n\\def\\dapp{\\dot\\approx}\n\\providecommand{\\red}[1]{{\\color{red}#1}}\n\\providecommand{\\deriv}[1]{\\frac{\\partial}{\\partial #1}}\n\\providecommand{\\derivf}[2]{\\frac{\\partial #1}{\\partial #2}}\n\\providecommand{\\blue}[1]{{\\color{blue}#1}}\n\\providecommand{\\green}[1]{{\\color{green}#1}}\n\\providecommand{\\hE}[1]{\\hat{\\text{E}}\\sb{#1}}\n\\providecommand{\\hExp}[1]{\\hat{\\text{E}}\\sb{#1}}\n\\providecommand{\\hmu}[1]{\\hat{\\mu}\\sb{#1}}\n\\def\\Expp{\\mathbb{E}}\n\\def\\Ep{\\mathbb{E}}\n\\def\\expit{\\text{expit}}\n\\providecommand{\\expitf}[1]{\\expit\\cb{#1}}\n\\providecommand{\\dexpitf}[1]{\\expit'\\cb{#1}}\n\\def\\logit{\\text{logit}}\n\\providecommand{\\logitf}[1]{\\logit\\cb{#1}}\n\\providecommand{\\E}[1]{\\mathbb{E}\\sb{#1}}\n\\providecommand{\\Ef}[1]{\\mathbb{E}\\sb{#1}}\n\\providecommand{\\Exp}[1]{\\mathbb{E}\\sb{#1}}\n\\providecommand{\\Expf}[1]{\\mathbb{E}\\sb{#1}}\n\\def\\Varr{\\text{Var}}\n\\providecommand{\\var}[1]{\\text{Var}\\paren{#1}}\n\\providecommand{\\varf}[1]{\\text{Var}\\paren{#1}}\n\\providecommand{\\Var}[1]{\\text{Var}\\paren{#1}}\n\\providecommand{\\Varf}[1]{\\text{Var}\\paren{#1}}\n\\def\\Covt{\\text{Cov}}\n\\providecommand{\\covh}[1]{\\widehat{\\text{Cov}}\\paren{#1}}\n\\providecommand{\\Cov}[1]{\\Covt \\paren{#1}}\n\\providecommand{\\Covf}[1]{\\Covt \\paren{#1}}\n\\def\\varht{\\widehat{\\text{Var}}}\n\\providecommand{\\varh}[1]{\\varht\\paren{#1}}\n\\providecommand{\\varhf}[1]{\\varht\\paren{#1}}\n\\providecommand{\\vc}[1]{\\boldsymbol{#1}}\n\\providecommand{\\sd}[1]{\\text{sd}\\paren{#1}}\n\\providecommand{\\SD}[1]{\\text{SD}\\paren{#1}}\n\\providecommand{\\hSD}[1]{\\widehat{\\text{SD}}\\paren{#1}}\n\\providecommand{\\se}[1]{\\text{se}\\paren{#1}}\n\\providecommand{\\hse}[1]{\\hat{\\text{se}}\\paren{#1}}\n\\providecommand{\\SE}[1]{\\text{SE}\\paren{#1}}\n\\providecommand{\\HSE}[1]{\\widehat{\\text{SE}}\\paren{#1}}\n\\renewcommand{\\log}[1]{\\text{log}\\cb{#1}}\n\\providecommand{\\logf}[1]{\\text{log}\\cb{#1}}\n\\def\\dlog{\\text{log}'}\n\\providecommand{\\dlogf}[1]{\\dlog \\cb{#1}}\n\\renewcommand{\\exp}[1]{\\text{exp}\\cb{#1}}\n\\providecommand{\\expf}[1]{\\exp{#1}}\n\\def\\dexp{\\text{exp}'}\n\\providecommand{\\dexpf}[1]{\\dexp \\cb{#1}}\n\\providecommand{\\e}[1]{\\text{e}^{#1}}\n\\providecommand{\\ef}[1]{\\text{e}^{#1}}\n\\providecommand{\\inv}[1]{\\paren{#1}^{-1}}\n\\providecommand{\\invf}[1]{\\paren{#1}^{-1}}\n\\def\\oinf{I}\n\\def\\Nat{\\mathbb{N}}\n\\providecommand{\\oinff}[1]{\\oinf\\paren{#1}}\n\\def\\einf{\\mathcal{I}}\n\\providecommand{\\einff}[1]{\\einf\\paren{#1}}\n\\def\\heinf{\\hat{\\einf}}\n\\providecommand{\\heinff}[1]{\\heinf \\paren{#1}}\n\\providecommand{\\1}[1]{\\mathbb{1}_{#1}}\n\\providecommand{\\set}[1]{\\cb{#1}}\n\\providecommand{\\pf}[1]{\\p \\paren{#1}}\n\\providecommand{\\Bias}[1]{\\text{Bias}\\paren{#1}}\n\\providecommand{\\bias}[1]{\\text{Bias}\\paren{#1}}\n\\def\\ss{\\sigma^2}\n\\providecommand{\\ssqf}[1]{\\sigma^2\\paren{#1}}\n\\providecommand{\\mselr}[1]{\\text{MSE}\\paren{#1}}\n\\providecommand{\\maelr}[1]{\\text{MAE}\\paren{#1}}\n\\providecommand{\\abs}[1]{\\left|#1\\right|}\n\\providecommand{\\sqf}[1]{\\paren{#1}^2}\n\\providecommand{\\sq}{^2}\n\\def\\err{\\eps}\n\\providecommand{\\erf}[1]{\\err\\paren{#1}}\n\\renewcommand{\\vec}[1]{\\tilde{#1}}\n\\providecommand{\\v}[1]{\\vec{#1}}\n\\providecommand{\\matr}[1]{\\mathbf{#1}}\n\\def\\mX{\\matr{X}}\n\\def\\mx{\\matr{x}}\n\\def\\vx{\\vec{x}}\n\\def\\vX{\\vec{X}}\n\\def\\vy{\\vec{y}}\n\\def\\vY{\\vec{Y}}\n\\def\\vpi{\\vec{\\pi}}\n\\providecommand{\\mat}[1]{\\mathbf{#1}}\n\\providecommand{\\dsn}[1]{#1_1, \\ldots, #1_n}\n\\def\\X1n{\\dsn{X}}\n\\def\\Xin{\\dsn{X}}\n\\def\\x1n{\\dsn{x}}\n\\def\\'{^{\\top}}\n\\def\\dpr{\\cdot}\n\\def\\Xx1n{X_1=x_1, \\ldots, X_n = x_n}\n\\providecommand{\\dsvn}[2]{#1_1=#2_1, \\ldots, #1_n = #2_n}\n\\providecommand{\\sumn}[1]{\\sum_{#1=1}^n}\n\\def\\sumin{\\sum_{i=1}^n}\n\\def\\sumi1n{\\sum_{i=1}^n}\n\\def\\prodin{\\prod_{i=1}^n}\n\\def\\prodi1n{\\prod_{i=1}^n}\n\\providecommand{\\lp}[2]{#1 \\' \\beta}\n\\def\\odds{\\omega}\n\\def\\OR{\\text{OR}}\n\\def\\logodds{\\eta}\n\\def\\oddst{\\text{odds}}\n\\def\\probst{\\text{probs}}\n\\def\\probt{\\text{probt}}\n\\def\\probit{\\text{probit}}\n\\providecommand{\\oddsf}[1]{\\oddst\\cb{#1}}\n\\providecommand{\\doddsf}[1]{{\\oddst}'\\cb{#1}}\n\\def\\oddsinv{\\text{invodds}}\n\\providecommand{\\oddsinvf}[1]{\\oddsinv\\cb{#1}}\n\\def\\invoddsf{\\oddsinvf}\n\\providecommand{\\doddsinvf}[1]{{\\oddsinv}'\\cb{#1}}\n\\def\\dinvoddsf{\\doddsinvf}\n\\def\\haz{h}\n\\def\\cuhaz{H}\n\\def\\incidence{\\bar{\\haz}}\n\\def\\phaz{\\Expf{\\haz}}\n\n\n\n\n\n\n\n\n\n```{=html}\n\n```\n\n\n\n\n\n\n\n\n\n---\n\n:::{.callout-note}\nThis content is adapted from:\n\n- @dobson4e, Chapters 2-6\n- @dunn2018generalized, Chapters 2-3\n- @vittinghoff2e, Chapter 4\n\nThere are numerous textbooks specifically for linear regression, including:\n\n- @kutner2005applied: used for UCLA Biostatistics MS level linear models class\n- @chatterjee2015regression: used for Stanford MS-level linear models class\n- @seber2012linear: used for UCLA Biostatistics PhD level linear models class and UC Davis STA 108.\n- @kleinbaum2014applied: same first author as @kleinbaum2010logistic and @kleinbaum2012survival\n- @weisberg2005applied\n- *Linear Models with R* [@Faraway2025-io]\n\n\n## Overview\n\n### Why this course includes linear regression {.smaller}\n\n:::{.fragment .fade-in-then-semi-out}\n* This course is about *generalized linear models* (for non-Gaussian outcomes)\n:::\n\n:::{.fragment .fade-in-then-semi-out}\n* UC Davis STA 108 (\"Applied Statistical Methods: Regression Analysis\") is a prerequisite for this course, so everyone here should have some understanding of linear regression already.\n:::\n\n:::{.fragment .fade-in}\n* We will review linear regression to:\n - make sure everyone is caught up\n - to provide an epidemiological perspective on model interpretation.\n:::\n\n### Chapter overview\n\n* @sec-understand-LMs: how to interpret linear regression models\n\n* @sec-est-LMs: how to estimate linear regression models\n\n* @sec-infer-LMs: how to quantify uncertainty about our estimates\n\n* @sec-diagnose-LMs: how to tell if your model is insufficiently complex\n\n\n## Understanding Gaussian Linear Regression Models {#sec-understand-LMs}\n\n### Motivating example: birthweights and gestational age {.smaller}\n\nSuppose we want to learn about the distributions of birthweights (*outcome* $Y$) for (human) babies born at different gestational ages (*covariate* $A$) and with different chromosomal sexes (*covariate* $S$) (@dobson4e Example 2.2.2).\n\n::::: {.panel-tabset}\n\n#### Data as table\n\n\n\n\n\n\n\n\n::: {#tbl-birthweight-data1 .cell tbl-cap='`birthweight` data (@dobson4e Example 2.2.2)'}\n\n```{.r .cell-code}\nlibrary(dobson)\ndata(\"birthweight\", package = \"dobson\")\nbirthweight |> knitr::kable()\n```\n\n::: {.cell-output-display}\n\n\n| boys gestational age| boys weight| girls gestational age| girls weight|\n|--------------------:|-----------:|---------------------:|------------:|\n| 40| 2968| 40| 3317|\n| 38| 2795| 36| 2729|\n| 40| 3163| 40| 2935|\n| 35| 2925| 38| 2754|\n| 36| 2625| 42| 3210|\n| 37| 2847| 39| 2817|\n| 41| 3292| 40| 3126|\n| 40| 3473| 37| 2539|\n| 37| 2628| 36| 2412|\n| 38| 3176| 38| 2991|\n| 40| 3421| 39| 2875|\n| 38| 2975| 40| 3231|\n\n\n:::\n:::\n\n\n\n\n\n\n\n\n#### Reshape data for graphing\n\n\n\n\n\n\n\n\n::: {#tbl-birthweight-data2 .cell tbl-cap='`birthweight` data reshaped'}\n\n```{.r .cell-code}\nbw = \n birthweight |> \n pivot_longer(\n cols = everything(),\n names_to = c(\"sex\", \".value\"),\n names_sep = \"s \"\n ) |> \n rename(age = `gestational age`) |> \n mutate(\n sex = sex |> \n case_match(\n \"boy\" ~ \"male\",\n \"girl\" ~ \"female\") |> \n factor(levels = c(\"female\", \"male\")))\n\nbw\n```\n\n::: {.cell-output-display}\n\n|sex | age| weight|\n|:------|---:|------:|\n|male | 40| 2968|\n|female | 40| 3317|\n|male | 38| 2795|\n|female | 36| 2729|\n|male | 40| 3163|\n|female | 40| 2935|\n|male | 35| 2925|\n|female | 38| 2754|\n|male | 36| 2625|\n|female | 42| 3210|\n|male | 37| 2847|\n|female | 39| 2817|\n|male | 41| 3292|\n|female | 40| 3126|\n|male | 40| 3473|\n|female | 37| 2539|\n|male | 37| 2628|\n|female | 36| 2412|\n|male | 38| 3176|\n|female | 38| 2991|\n|male | 40| 3421|\n|female | 39| 2875|\n|male | 38| 2975|\n|female | 40| 3231|\n\n:::\n:::\n\n\n\n\n\n\n\n\n#### Data as graph\n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nplot1 = bw |> \n ggplot(aes(\n x = age, \n y = weight,\n linetype = sex,\n shape = sex,\n col = sex)) +\n theme_bw() +\n xlab(\"Gestational age (weeks)\") +\n ylab(\"Birthweight (grams)\") +\n theme(legend.position = \"bottom\") +\n # expand_limits(y = 0, x = 0) +\n geom_point(alpha = .7)\nprint(plot1 + facet_wrap(~ sex))\n```\n\n::: {.cell-output-display}\n![`birthweight` data (@dobson4e Example 2.2.2)](Linear-models-overview_files/figure-pdf/fig-plot-birthweight1-1.pdf){#fig-plot-birthweight1}\n:::\n:::\n\n\n\n\n\n\n\n\n:::::\n\n---\n\n#### Data notation\n\nLet's define some notation to represent this data.\n\n- $Y$: birthweight (measured in grams)\n- $S$: chromosomal sex: \"male\" (XY) or \"female\" (XX)\n- $M$: indicator variable for $S$ = \"male\"^[$M$ is implicitly a deterministic function of $S$]\n- $M = 0$ if female (XX)\n- $M = 1$ if male (XY)\n- $F$: indicator variable for $S$ = \"female\"^[$F$ is implicitly a deterministic function of $S$]\n- $F = 1$ if female (XX)\n- $F = 0$ if male (XY)\n\n- $A$: estimated gestational age at birth (measured in weeks).\n\n::: callout-note\nFemale is the **reference level** for the categorical variable $S$ \n(chromosomal sex) and corresponding indicator variable $M$ . \nThe choice of a reference level is arbitrary and does not limit what \nwe can do with the resulting model; \nit only makes it more computationally convenient to make inferences \nabout comparisons involving that reference group.\n:::\n\n### Parallel lines regression\n\nWe don't have enough data to model the distribution of birth weight \nseparately for each combination of gestational age and sex, \nso let's instead consider a (relatively) simple model for how that \ndistribution varies with gestational age and sex:\n\n$$p(Y=y|A=a,S=s) \\siid N(\\mu(a,s), \\sigma^2)$$\n\n$$\n\\ba\n\\mu(a,s)\n&\\eqdef \\Exp{Y|A=a, S=s} \\\\\n&= \\beta_0 + \\beta_A a+ \\beta_M m\n\\ea\n$$ {#eq-lm-parallel}\n\n:::{.notes}\n\n@tbl-lm-parallel shows the parameter estimates from R.\n@fig-parallel-fit1 shows the estimated model, superimposed on the data.\n\n:::\n\n::: {.column width=40%}\n\n\n\n\n\n\n\n\n::: {#tbl-lm-parallel .cell tbl-cap='Estimate of [Model @eq-lm-parallel] for `birthweight` data'}\n\n```{.r .cell-code}\nbw_lm1 = lm(\n formula = weight ~ sex + age, \n data = bw)\n\nbw_lm1 |> \n parameters() |>\n print_md(\n include_reference = TRUE,\n # show_sigma = TRUE,\n select = \"{estimate}\")\n```\n\n::: {.cell-output-display}\n\n\n|Parameter | Estimate |\n|:------------|:--------:|\n|(Intercept) | -1773.32 |\n|sex (female) | 0.00 |\n|sex (male) | 163.04 |\n|age | 120.89 |\n\n\n:::\n:::\n\n\n\n\n\n\n\n\n:::\n\n:::{.column width=10%}\n:::\n\n:::{.column width=50%}\n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbw = \n bw |> \n mutate(`E[Y|X=x]` = fitted(bw_lm1)) |> \n arrange(sex, age)\n\nplot2 = \n plot1 %+% bw +\n geom_line(aes(y = `E[Y|X=x]`))\n\nprint(plot2)\n\n```\n\n::: {.cell-output-display}\n![Parallel-slopes model of birthweight](Linear-models-overview_files/figure-pdf/fig-parallel-fit1-1.pdf){#fig-parallel-fit1}\n:::\n:::\n\n\n\n\n\n\n\n\n:::\n\n---\n\n#### Model assumptions and predictions\n\n::: notes\nTo learn what this model is assuming, let's plug in a few values.\n:::\n\n::: {#exr-pred-fem-parallel}\n\nAccording to this model, what's the mean birthweight for a female born at 36 weeks?\n\n\n\n\n\n\n\n\n::: {#tbl-coef-model1 .cell tbl-cap='Estimated coefficients for [model @eq-lm-parallel]'}\n\n```{.r .cell-code}\ncoef(bw_lm1)\n#> (Intercept) sexmale age \n#> -1773.3 163.0 120.9\n```\n:::\n\n\n\n\n\n\n\n\n:::\n\n---\n\n:::{.solution}\n\\ \n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\npred_female = coef(bw_lm1)[\"(Intercept)\"] + coef(bw_lm1)[\"age\"]*36\ncoef(bw_lm1)\n#> (Intercept) sexmale age \n#> -1773.3 163.0 120.9\n# print(pred_female)\n### built-in prediction: \n# predict(bw_lm1, newdata = tibble(sex = \"female\", age = 36))\n```\n:::\n\n\n\n\n\n\n\n\n$$\n\\ba\nE[Y|A = 0, A = 36] \n&= \\beta_0 + \\beta_M \\cdot 0+ \\beta_A \\cdot 36 \\\\\n&= 2578.8739\n\\ea\n$$\n:::\n\n---\n\n:::{#exr-pred-male-parallel}\n\nWhat's the mean birthweight for a male born at 36 weeks?\n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncoef(bw_lm1)\n#> (Intercept) sexmale age \n#> -1773.3 163.0 120.9\n```\n:::\n\n\n\n\n\n\n\n\n:::\n\n---\n\n:::{.solution}\n\\ \n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\npred_male = \n coef(bw_lm1)[\"(Intercept)\"] + \n coef(bw_lm1)[\"sexmale\"] + \n coef(bw_lm1)[\"age\"]*36\ncoef(bw_lm1)\n#> (Intercept) sexmale age \n#> -1773.3 163.0 120.9\n```\n:::\n\n\n\n\n\n\n\n\n$$\n\\ba\nE[Y|M = 1, A = 36] \n&= \\beta_0 + \\beta_M \\cdot 1+ \\beta_A \\cdot 36 \\\\\n&= 2741.9132\n\\ea\n$$\n\n:::\n\n---\n\n:::{#exr-diff-sex-parallel-1}\nWhat's the difference in mean birthweights between males born at 36 weeks and females born at 36 weeks?\n:::\n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncoef(bw_lm1)\n#> (Intercept) sexmale age \n#> -1773.3 163.0 120.9\n```\n:::\n\n\n\n\n\n\n\n\n---\n\n:::{.solution}\n\n$$\n\\begin{aligned}\n& E[Y|M = 1, A = 36] - E[Y|M = 0, A = 36]\\\\\n&= \n2741.9132 - 2578.8739\\\\\n&=\n163.0393\n\\end{aligned}\n$$\n\nShortcut:\n\n$$\n\\begin{aligned}\n& E[Y|M = 1, A = 36] - E[Y|M = 0, A = 36]\\\\\n&= (\\beta_0 + \\beta_M \\cdot 1+ \\beta_A \\cdot 36) - \n(\\beta_0 + \\beta_M \\cdot 0+ \\beta_A \\cdot 36) \\\\\n&= \\beta_M \\\\ \n&= 163.0393\n\\end{aligned}\n$$\n\n:::\n\n:::{.notes}\n\nNote that age doesn't show up in this difference: in other words, according to this model, the difference between females and males with the same gestational age is the same for every age.\n\nThat's an assumption of the model; it's built-in to the parametric structure, even before we plug in the estimated values of those parameters.\n\nThat's why the lines are parallel.\n\n:::\n\n### Interactions {.smaller}\n\n:::{.notes}\nWhat if we don't like that parallel lines assumption?\n\nThen we need to allow an \"interaction\" between age $A$ and sex $S$:\n:::\n\n$$\nE[Y|A=a, S=s] = \\beta_0 + \\beta_A a+ \\beta_M m + \\beta_{AM} (a \\cdot m)\n$$ {#eq-BW-lm-interact}\n\n::: notes\nNow, the slope of mean birthweight $E[Y|A,S]$ with respect to gestational age $A$ depends on the value of sex $S$.\n:::\n\n::: {.column width=40% .smaller}\n\n\n\n\n\n\n\n\n::: {#tbl-bw-model-coefs-interact .cell tbl-cap='Birthweight model with interaction term'}\n\n```{.r .cell-code}\nbw_lm2 = lm(weight ~ sex + age + sex:age, data = bw)\nbw_lm2 |> \n parameters() |>\n print_md(\n include_reference = TRUE,\n # show_sigma = TRUE,\n select = \"{estimate}\")\n```\n\n::: {.cell-output-display}\n\n\n|Parameter | Estimate |\n|:----------------|:--------:|\n|(Intercept) | -2141.67 |\n|sex (female) | 0.00 |\n|sex (male) | 872.99 |\n|age | 130.40 |\n|sex (male) × age | -18.42 |\n\n\n:::\n:::\n\n\n\n\n\n\n\n\n:::\n\n:::{.column width=5%}\n:::\n\n:::{.column width=55%}\n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbw = \n bw |> \n mutate(\n predlm2 = predict(bw_lm2)\n ) |> \n arrange(sex, age)\n\nplot1_interact = \n plot1 %+% bw +\n geom_line(aes(y = predlm2))\n\nprint(plot1_interact)\n```\n\n::: {.cell-output-display}\n![Birthweight model with interaction term](Linear-models-overview_files/figure-pdf/fig-bw-interaction-1.pdf){#fig-bw-interaction}\n:::\n:::\n\n\n\n\n\n\n\n\n:::\n\n::: {.notes}\nNow we can see that the lines aren't parallel.\n:::\n\n---\n\nHere's another way we could rewrite this model (by collecting terms involving $S$):\n\n$$\nE[Y|A, M] = \\beta_0 + \\beta_M M+ (\\beta_A + \\beta_{AM} M) A\n$$\n\n::: callout-note\nIf you want to understand a coefficient in a model with interactions, collect terms for the corresponding variable, and you will see what other variables are interacting with the variable you are interested in.\n:::\n\n:::{.notes}\nIn this case, the coefficient $S$ is interacting with $A$. So the slope of $Y$ with respect to $A$ depends on the value of $M$.\n\nAccording to this model, there is no such thing as \"*the* slope of birthweight with respect to age\". There are two slopes, one for each sex.^[using the definite article \"the\" would mean there is only one slope.] We can only talk about \"the slope of birthweight with respect to age among males\" and \"the slope of birthweight with respect to age among females\".\n\nThen: that coefficient is the difference in means per unit change in its corresponding coefficient, when the other collected variables are set to 0.\n:::\n\n---\n\n::: notes\nTo learn what this model is assuming, let's plug in a few values.\n:::\n\n:::{#exr-pred-fem-interact}\nAccording to this model, what's the mean birthweight for a female born at 36 weeks?\n:::\n\n---\n\n::: {.solution}\n\\ \n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\npred_female = coef(bw_lm2)[\"(Intercept)\"] + coef(bw_lm2)[\"age\"]*36\n```\n:::\n\n\n\n\n\n\n\n\n$$\nE[Y|A = 0, X_2 = 36] = \n\\beta_0 + \\beta_M \\cdot 0+ \\beta_A \\cdot 36 + \\beta_{AM} \\cdot (0 * 36) \n= 2552.7333\n$$ \n\n:::\n\n---\n\n:::{#exr-pred-interact-male_36}\nWhat's the mean birthweight for a male born at 36 weeks?\n\n:::\n\n---\n\n::: solution\n\\ \n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\npred_male = \n coef(bw_lm2)[\"(Intercept)\"] + \n coef(bw_lm2)[\"sexmale\"] + \n coef(bw_lm2)[\"age\"]*36 + \n coef(bw_lm2)[\"sexmale:age\"] * 36\n```\n:::\n\n\n\n\n\n\n\n\n$$\n\\ba\nE[Y|A = 0, X_2 = 36]\n&= \\beta_0 + \\beta_M \\cdot 1+ \\beta_A \\cdot 36 + \\beta_{AM} \\cdot 1 \\cdot 36\\\\\n&= 2762.7069\n\\ea\n$$\n\n:::\n\n---\n\n:::{#exr-diff-gender-interact}\nWhat's the difference in mean birthweights between males born at 36 weeks and females born at 36 weeks?\n:::\n\n---\n\n:::{.solution}\n\n$$\n\\begin{aligned}\n& E[Y|M = 1, A = 36] - E[Y|M = 0, A = 36]\\\\ \n&= (\\beta_0 + \\beta_M \\cdot 1+ \\beta_A \\cdot 36 + \\beta_{AM} \\cdot 1 \\cdot 36)\\\\ \n&\\ \\ \\ \\ \\ -(\\beta_0 + \\beta_M \\cdot 0+ \\beta_A \\cdot 36 + \\beta_{AM} \\cdot 0 \\cdot 36) \\\\\n&= \\beta_{S} + \\beta_{AM}\\cdot 36\\\\\n&= 209.9736\n\\end{aligned}\n$$\n:::\n\n:::{.notes}\nNote that age now does show up in the difference: in other words, according to this model, the difference in mean birthweights between females and males with the same gestational age can vary by gestational age.\n\nThat's how the lines in the graph ended up non-parallel.\n\n:::\n\n### Stratified regression {.smaller}\n\n:::{.notes}\nWe could re-write the interaction model as a stratified model, with a slope and intercept for each sex:\n:::\n\n$$\n\\E{Y|A=a, S=s} = \n\\beta_M m + \\beta_{AM} (a \\cdot m) + \n\\beta_F f + \\beta_{AF} (a \\cdot f)\n$$ {#eq-model-strat}\n\nCompare this stratified model with our interaction model, @eq-BW-lm-interact:\n\n$$\n\\E{Y|A=a, S=s} = \n\\beta_0 + \\beta_A a + \\beta_M m + \\beta_{AM} (a \\cdot m)\n$$\n\n::: notes\n\nIn the stratified model, the intercept term $\\beta_0$ has been relabeled as $\\beta_F$.\n\n:::\n\n::: {.column width=45%}\n\n\n\n\n\n\n\n::: {#tbl-bw-model-coefs-interact2 .cell tbl-cap='Birthweight model with interaction term'}\n\n```{.r .cell-code}\nbw_lm2 = lm(weight ~ sex + age + sex:age, data = bw)\nbw_lm2 |> \n parameters() |>\n print_md(\n include_reference = TRUE,\n # show_sigma = TRUE,\n select = \"{estimate}\")\n```\n\n::: {.cell-output-display}\n\n\n|Parameter | Estimate |\n|:----------------|:--------:|\n|(Intercept) | -2141.67 |\n|sex (female) | 0.00 |\n|sex (male) | 872.99 |\n|age | 130.40 |\n|sex (male) × age | -18.42 |\n\n\n:::\n:::\n\n\n\n\n\n\n\n\n:::\n\n:::{.column width=10%}\n:::\n\n:::{.column width=45%}\n\n\n\n\n\n\n\n\n::: {#tbl-bw-model-coefs-strat .cell tbl-cap='Birthweight model - stratified betas'}\n\n```{.r .cell-code}\nbw_lm_strat = \n bw |> \n lm(\n formula = weight ~ sex + sex:age - 1, \n data = _)\n\nbw_lm_strat |> \n parameters() |>\n print_md(\n # show_sigma = TRUE,\n select = \"{estimate}\")\n```\n\n::: {.cell-output-display}\n\n\n|Parameter | Estimate |\n|:------------------|:--------:|\n|sex (female) | -2141.67 |\n|sex (male) | -1268.67 |\n|sex (female) × age | 130.40 |\n|sex (male) × age | 111.98 |\n\n\n:::\n:::\n\n\n\n\n\n\n\n\n:::\n\n### Curved-line regression\n\n::: notes\nIf we transform some of our covariates ($X$s) and plot the resulting model on the original covariate scale, we end up with curved regression lines:\n:::\n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbw_lm3 = lm(weight ~ sex:log(age) - 1, data = bw)\nlibrary(palmerpenguins)\n\nggpenguins <- \n palmerpenguins::penguins |> \n dplyr::filter(species == \"Adelie\") |> \n ggplot(\n aes(x = bill_length_mm , y = body_mass_g)) +\n geom_point() + \n xlab(\"Bill length (mm)\") + \n ylab(\"Body mass (g)\")\n\nggpenguins2 = ggpenguins +\n stat_smooth(\n method = \"lm\",\n formula = y ~ log(x),\n geom = \"smooth\") +\n xlab(\"Bill length (mm)\") + \n ylab(\"Body mass (g)\")\n\n\nggpenguins2 |> print()\n```\n\n::: {.cell-output-display}\n![`palmerpenguins` model with `bill_length` entering on log scale](Linear-models-overview_files/figure-pdf/fig-penguins-log-x-1.pdf){#fig-penguins-log-x}\n:::\n:::\n\n\n\n\n\n\n\n\n## Estimating Linear Models via Maximum Likelihood {#sec-est-LMs}\n\n### Likelihood, log-likelihood, and score functions for linear regression {.smaller}\n\n:::{.notes}\n\nIn EPI 203 and @sec-intro-MLEs, we learned how to fit outcome-only models of the form $p(X=x|\\theta)$ to iid data $\\vx = (x_1,…,x_n)$ using maximum likelihood estimation.\n\nNow, we apply the same procedure to linear regression models:\n\n:::\n\n$$\n\\mathcal L(\\vec y|\\mat x,\\beta, \\sigma^2) = \n\\prod_{i=1}^n (2\\pi\\sigma^2)^{-1/2} \n\\exp{-\\frac{1}{2\\sigma^2}(y_i - \\vec{x_i}'\\beta)^2}\n$$ {#eq-linreg-lik}\n\n$$\n\\ell(\\vec y|\\mat x,\\beta, \\sigma^2) \n= -\\frac{n}{2}\\log{\\sigma^2} - \n\\frac{1}{2\\sigma^2}\\sum_{i=1}^n (y_i - \\vec{x_i}' \\beta)^2\n$$ {#eq-linreg-loglik}\n\n$$\n\\ell'_{\\beta}(\\vec y|\\mat x,\\beta, \\sigma^2) \n= - \n\\frac{1}{2\\sigma^2}\\deriv{\\beta}\n\\paren{\\sum_{i=1}^n (y_i - \\vec{x_i}\\' \\beta)^2}\n$$ {#eq-linreg-score}\n\n---\n\n::: notes\nLet's switch to matrix-vector notation:\n:::\n\n$$\n\\sum_{i=1}^n (y_i - \\vx_i\\' \\vb)^2 \n= (\\vy - \\mX\\vb)'(\\vy - \\mX\\vb)\n$$\n\n---\n\nSo\n\n$$\n\\begin{aligned}\n(\\vy - \\mX\\vb)'(\\vy - \\mX\\vb) \n&= (\\vy' - \\vb'X')(\\vy - \\mX\\vb)\n\\\\ &= y'y - \\vb'X'y - y'\\mX\\vb +\\vb'\\mX'\\mX\\beta\n\\\\ &= y'y - 2y'\\mX\\beta +\\beta'\\mX'\\mX\\beta\n\\end{aligned}\n$$\n\n### Deriving the linear regression score function\n\n::: notes\nWe will use some results from [vector calculus](math-prereqs.qmd#sec-vector-calculus):\n:::\n\n$$\n\\begin{aligned}\n\\deriv{\\beta}\\paren{\\sum_{i=1}^n (y_i - x_i' \\beta)^2} \n &= \\deriv{\\beta}(\\vy - X\\beta)'(\\vy - X\\beta)\n\\\\ &= \\deriv{\\beta} (y'y - 2y'X\\beta +\\beta'X'X\\beta)\n\\\\ &= (- 2X'y +2X'X\\beta)\n\\\\ &= - 2X'(y - X\\beta)\n\\\\ &= - 2X'(y - \\Expp[y])\n\\\\ &= - 2X' \\err(y)\n\\end{aligned}\n$${#eq-scorefun-linreg}\n\n---\n\nSo if $\\ell(\\beta,\\sigma^2) =0$, then\n\n$$\n\\begin{aligned}\n0 &= (- 2X'y +2X'X\\beta)\\\\\n2X'y &= 2X'X\\beta\\\\\nX'y &= X'X\\beta\\\\\n(X'X)^{-1}X'y &= \\beta\n\\end{aligned}\n$$\n\n---\n\nThe second derivative matrix $\\ell_{\\beta, \\beta'} ''(\\beta, \\sigma^2;\\mathbf X,\\vy)$ is negative definite at $\\beta = (X'X)^{-1}X'y$, so $\\hat \\beta_{ML} = (X'X)^{-1}X'y$ is the MLE for $\\beta$.\n\n---\n\nSimilarly (not shown):\n\n$$\n\\hat\\sigma^2_{ML} = \\frac{1}{n} (Y-X\\hat\\beta)'(Y-X\\hat\\beta)\n$$\n\nAnd\n\n$$\n\\begin{aligned}\n\\mathcal I_{\\beta} &= E[-\\ell_{\\beta, \\beta'} ''(Y|X,\\beta, \\sigma^2)]\\\\\n&= \\frac{1}{\\sigma^2}X'X\n\\end{aligned}\n$$\n\n---\n\nSo:\n\n$$\nVar(\\hat \\beta) \\approx (\\mathcal I_{\\beta})^{-1} = \\sigma^2 (X'X)^{-1}\n$$\n\nand\n\n$$\n\\hat\\beta \\dot \\sim N(\\beta, \\mathcal I_{\\beta}^{-1})\n$$ \n\n:::{.notes}\n\nThese are all results you have hopefully seen before.\n\n:::\n\n---\n\nIn the Gaussian linear regression case, we also have exact results:\n\n$$\n\\frac{\\hat\\beta_j}{\\hse{\\hat\\beta_j}} \\dist t_{n-p}\n$$ \n\n---\n\nIn our model 2 above, $\\heinf(\\beta)$ is:\n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbw_lm2 |> vcov()\n#> (Intercept) sexmale age sexmale:age\n#> (Intercept) 1353968 -1353968 -34871.0 34871.0\n#> sexmale -1353968 2596387 34871.0 -67211.0\n#> age -34871 34871 899.9 -899.9\n#> sexmale:age 34871 -67211 -899.9 1743.5\n```\n:::\n\n\n\n\n\n\n\n\nIf we take the square roots of the diagonals, we get the standard errors listed in the model output:\n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n\nbw_lm2 |> vcov() |> diag() |> sqrt()\n#> (Intercept) sexmale age sexmale:age \n#> 1163.60 1611.33 30.00 41.76\n```\n:::\n\n::: {#tbl-mod-intx .cell tbl-cap='Estimated model for `birthweight` data with interaction term'}\n\n```{.r .cell-code}\nbw_lm2 |> parameters() |> print_md()\n```\n\n::: {.cell-output-display}\n\n\n|Parameter | Coefficient | SE | 95% CI | t(20) | p |\n|:----------------|:-----------:|:-------:|:-------------------:|:-----:|:------:|\n|(Intercept) | -2141.67 | 1163.60 | (-4568.90, 285.56) | -1.84 | 0.081 |\n|sex (male) | 872.99 | 1611.33 | (-2488.18, 4234.17) | 0.54 | 0.594 |\n|age | 130.40 | 30.00 | (67.82, 192.98) | 4.35 | < .001 |\n|sex (male) × age | -18.42 | 41.76 | (-105.52, 68.68) | -0.44 | 0.664 |\n\n\n:::\n:::\n\n\n\n\n\n\n\n\nSo we can do confidence intervals, hypothesis tests, and p-values exactly as in the one-variable case we looked at previously.\n\n### Residual Standard Deviation\n\n::: notes\n$\\hs$ represents an *estimate* of the *Residual Standard Deviation* parameter, $\\s$. \nWe can extract $\\hs$ from the fitted model, using the `sigma()` function:\n:::\n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"show\"}\nsigma(bw_lm2)\n#> [1] 180.6\n```\n:::\n\n\n\n\n\n\n\n\n---\n\n#### $\\s$ is NOT \"Residual standard error\"\n\n::: notes\nIn the `summary.lm()` output, this estimate is labeled as `\"Residual standard error\"`:\n:::\n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"show\"}\nsummary(bw_lm2)\n#> \n#> Call:\n#> lm(formula = weight ~ sex + age + sex:age, data = bw)\n#> \n#> Residuals:\n#> Min 1Q Median 3Q Max \n#> -246.7 -138.1 -39.1 176.6 274.3 \n#> \n#> Coefficients:\n#> Estimate Std. Error t value Pr(>|t|) \n#> (Intercept) -2141.7 1163.6 -1.84 0.08057 . \n#> sexmale 873.0 1611.3 0.54 0.59395 \n#> age 130.4 30.0 4.35 0.00031 ***\n#> sexmale:age -18.4 41.8 -0.44 0.66389 \n#> ---\n#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n#> \n#> Residual standard error: 181 on 20 degrees of freedom\n#> Multiple R-squared: 0.643,\tAdjusted R-squared: 0.59 \n#> F-statistic: 12 on 3 and 20 DF, p-value: 0.000101\n```\n:::\n\n\n\n\n\n\n\n\n---\n\n::: notes\nHowever, this is a misnomer:\n:::\n\n\n\n\n\n\n\n\n::: {.cell printr.help.sections='[\"description\",\"note\"]'}\n\n```{.r .cell-code code-fold=\"show\"}\nlibrary(printr) # captures ? documentation\n?stats::sigma\n#> Extract Residual Standard Deviation 'Sigma'\n#> \n#> Description:\n#> \n#> Extract the estimated standard deviation of the errors, the\n#> \"residual standard deviation\" (misnamed also \"residual standard\n#> error\", e.g., in 'summary.lm()''s output, from a fitted model).\n#> \n#> Many classical statistical models have a _scale parameter_,\n#> typically the standard deviation of a zero-mean normal (or\n#> Gaussian) random variable which is denoted as sigma. 'sigma(.)'\n#> extracts the _estimated_ parameter from a fitted model, i.e.,\n#> sigma^.\n#> \n#> Note:\n#> \n#> The misnomer \"Residual standard *error*\" has been part of too many\n#> R (and S) outputs to be easily changed there.\n```\n:::\n\n\n\n\n\n\n\n\n## Inference about Gaussian Linear Regression Models {#sec-infer-LMs}\n\n### Motivating example: `birthweight` data\n\nResearch question: is there really an interaction between sex and age?\n\n$H_0: \\beta_{AM} = 0$\n\n$H_A: \\beta_{AM} \\neq 0$\n\n$P(|\\hat\\beta_{AM}| > |-18.4172| \\mid H_0)$ = ?\n\n### Wald tests and CIs {.smaller}\n\nR can give you Wald tests for single coefficients and corresponding CIs:\n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n\nbw_lm2 |> \n parameters() |>\n print_md(\n include_reference = TRUE)\n```\n\n::: {.cell-output-display}\n\n\n|Parameter | Coefficient | SE | 95% CI | t(20) | p |\n|:----------------|:-----------:|:-------:|:-------------------:|:-----:|:------:|\n|(Intercept) | -2141.67 | 1163.60 | (-4568.90, 285.56) | -1.84 | 0.081 |\n|sex (female) | 0.00 | | | | |\n|sex (male) | 872.99 | 1611.33 | (-2488.18, 4234.17) | 0.54 | 0.594 |\n|age | 130.40 | 30.00 | (67.82, 192.98) | 4.35 | < .001 |\n|sex (male) × age | -18.42 | 41.76 | (-105.52, 68.68) | -0.44 | 0.664 |\n\n\n:::\n:::\n\n\n\n\n\n\n\n\nTo understand what's happening, let's replicate these results by hand for the interaction term.\n\n### P-values {.smaller}\n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbw_lm2 |> \n parameters(keep = \"sexmale:age\") |>\n print_md(\n include_reference = TRUE)\n```\n\n::: {.cell-output-display}\n\n\n|Parameter | Coefficient | SE | 95% CI | t(20) | p |\n|:----------------|:-----------:|:-----:|:----------------:|:-----:|:-----:|\n|sex (male) × age | -18.42 | 41.76 | (-105.52, 68.68) | -0.44 | 0.664 |\n\n\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nbeta_hat = coef(summary(bw_lm2))[\"sexmale:age\", \"Estimate\"]\nse_hat = coef(summary(bw_lm2))[\"sexmale:age\", \"Std. Error\"]\ndfresid = bw_lm2$df.residual\nt_stat = abs(beta_hat)/se_hat\npval_t = \n pt(-t_stat, df = dfresid, lower.tail = TRUE) +\n pt(t_stat, df = dfresid, lower.tail = FALSE)\n```\n:::\n\n\n\n\n\n\n\n\n$$\n\\begin{aligned}\n&P\\paren{\n| \\hat \\beta_{AM} | > \n| -18.4172| \\middle| H_0\n} \n\\\\\n&= \\Pr \\paren{\n\\abs{ \\frac{\\hat\\beta_{AM}}{\\hat{SE}(\\hat\\beta_{AM})} } > \n\\abs{ \\frac{-18.4172}{41.7558} } \\middle| H_0\n}\\\\ \n&= \\Pr \\paren{\n\\abs{ T_{20} } > 0.4411 | H_0\n}\\\\\n&= 0.6639\n\\end{aligned}\n$$ \n\n::: notes\nThis matches the result in the table above.\n:::\n\n### Confidence intervals\n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbw_lm2 |> \n parameters(keep = \"sexmale:age\") |>\n print_md(\n include_reference = TRUE)\n```\n\n::: {.cell-output-display}\n\n\n|Parameter | Coefficient | SE | 95% CI | t(20) | p |\n|:----------------|:-----------:|:-----:|:----------------:|:-----:|:-----:|\n|sex (male) × age | -18.42 | 41.76 | (-105.52, 68.68) | -0.44 | 0.664 |\n\n\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nq_t = qt(\n p = 0.975, \n df = dfresid, \n lower.tail = TRUE)\n\nq_t = qt(\n p = 0.025, \n df = dfresid, \n lower.tail = TRUE)\n\n\nconfint_radius_t = \n se_hat * q_t\n\nconfint_t = beta_hat + c(-1,1) * confint_radius_t\n\nprint(confint_t)\n#> [1] 68.68 -105.52\n```\n:::\n\n\n\n\n\n\n\n\n::: notes\nThis also matches.\n:::\n\n### Gaussian approximations\n\nHere are the asymptotic (Gaussian approximation) equivalents:\n\n### P-values {.smaller}\n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbw_lm2 |> \n parameters(keep = \"sexmale:age\") |>\n print_md(\n include_reference = TRUE)\n```\n\n::: {.cell-output-display}\n\n\n|Parameter | Coefficient | SE | 95% CI | t(20) | p |\n|:----------------|:-----------:|:-----:|:----------------:|:-----:|:-----:|\n|sex (male) × age | -18.42 | 41.76 | (-105.52, 68.68) | -0.44 | 0.664 |\n\n\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\npval_z = pnorm(abs(t_stat), lower = FALSE) * 2\n\nprint(pval_z)\n#> [1] 0.6592\n```\n:::\n\n\n\n\n\n\n\n\n### Confidence intervals {.smaller}\n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbw_lm2 |> \n parameters(keep = \"sexmale:age\") |>\n print_md(\n include_reference = TRUE)\n```\n\n::: {.cell-output-display}\n\n\n|Parameter | Coefficient | SE | 95% CI | t(20) | p |\n|:----------------|:-----------:|:-----:|:----------------:|:-----:|:-----:|\n|sex (male) × age | -18.42 | 41.76 | (-105.52, 68.68) | -0.44 | 0.664 |\n\n\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nconfint_radius_z = se_hat * qnorm(0.975, lower = TRUE)\nconfint_z = \n beta_hat + c(-1,1) * confint_radius_z\nprint(confint_z)\n#> [1] -100.26 63.42\n```\n:::\n\n\n\n\n\n\n\n\n### Likelihood ratio statistics\n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n\nlogLik(bw_lm2)\n#> 'log Lik.' -156.6 (df=5)\nlogLik(bw_lm1)\n#> 'log Lik.' -156.7 (df=4)\n\nlLR = (logLik(bw_lm2) - logLik(bw_lm1)) |> as.numeric()\ndelta_df = (bw_lm1$df.residual - df.residual(bw_lm2))\n\n\nx_max = 1\n\n```\n:::\n\n\n\n\n\n\n\n\n---\n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nd_lLR = function(x, df = delta_df) dchisq(x, df = df)\n\nchisq_plot = \n ggplot() + \n geom_function(fun = d_lLR) +\n stat_function( fun = d_lLR, xlim = c(lLR, x_max), geom = \"area\", fill = \"gray\") +\n geom_segment(aes(x = lLR, xend = lLR, y = 0, yend = d_lLR(lLR)), col = \"red\") + \n xlim(0.0001,x_max) + \n ylim(0,4) + \n ylab(\"p(X=x)\") + \n xlab(\"log(likelihood ratio) statistic [x]\") +\n theme_classic()\nchisq_plot |> print()\n```\n\n::: {.cell-output-display}\n![Chi-square distribution](Linear-models-overview_files/figure-pdf/fig-chisq-plot-1.pdf){#fig-chisq-plot}\n:::\n:::\n\n\n\n\n\n\n\n\n---\n\nNow we can get the p-value:\n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\npchisq(\n q = 2*lLR, \n df = delta_df, \n lower = FALSE) |> \n print()\n#> [1] 0.6298\n```\n:::\n\n\n\n\n\n\n\n\n\n---\n\nIn practice you don't have to do this by hand; there are functions to do it for you:\n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n\n# built in\nlibrary(lmtest)\nlrtest(bw_lm2, bw_lm1)\n```\n\n::: {.cell-output-display}\n\n\n| #Df| LogLik| Df| Chisq| Pr(>Chisq)|\n|---:|------:|--:|------:|----------:|\n| 5| -156.6| NA| NA| NA|\n| 4| -156.7| -1| 0.2323| 0.6298|\n:::\n:::\n\n\n\n\n\n\n\n\n## Goodness of fit\n\n### AIC and BIC\n\n::: notes\nWhen we use likelihood ratio tests, we are comparing how well different models fit the data.\n\nLikelihood ratio tests require \"nested\" models: one must be a special case of the other.\n\nIf we have non-nested models, we can instead use the Akaike Information Criterion (AIC) or Bayesian Information Criterion (BIC):\n:::\n\n- AIC = $-2 * \\ell(\\hat\\theta) + 2 * p$\n\n- BIC = $-2 * \\ell(\\hat\\theta) + p * \\text{log}(n)$\n\nwhere $\\ell$ is the log-likelihood of the data evaluated using the parameter estimates $\\hat\\theta$, $p$ is the number of estimated parameters in the model (including $\\hat\\sigma^2$), and $n$ is the number of observations.\n\nYou can calculate these criteria using the `logLik()` function, or use the built-in R functions:\n\n#### AIC in R\n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n\n-2 * logLik(bw_lm2) |> as.numeric() + \n 2*(length(coef(bw_lm2))+1) # sigma counts as a parameter here\n#> [1] 323.2\n\nAIC(bw_lm2)\n#> [1] 323.2\n```\n:::\n\n\n\n\n\n\n\n\n#### BIC in R\n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n\n-2 * logLik(bw_lm2) |> as.numeric() + \n (length(coef(bw_lm2))+1) * log(nobs(bw_lm2))\n#> [1] 329\n\nBIC(bw_lm2)\n#> [1] 329\n```\n:::\n\n\n\n\n\n\n\n\nLarge values of AIC and BIC are worse than small values. There are no hypothesis tests or p-values associated with these criteria.\n\n### (Residual) Deviance\n\nLet $q$ be the number of distinct covariate combinations in a data set.\n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbw.X.unique = \n bw |> \n count(sex, age)\n\nn_unique.bw = nrow(bw.X.unique)\n```\n:::\n\n\n\n\n\n\n\n\nFor example, in the `birthweight` data, there are $q = 12$ unique patterns (@tbl-bw-x-combos).\n\n\n\n\n\n\n\n\n::: {#tbl-bw-x-combos .cell tbl-cap='Unique covariate combinations in the `birthweight` data, with replicate counts'}\n\n```{.r .cell-code}\nbw.X.unique\n```\n\n::: {.cell-output-display}\n\n\n|sex | age| n|\n|:------|---:|--:|\n|female | 36| 2|\n|female | 37| 1|\n|female | 38| 2|\n|female | 39| 2|\n|female | 40| 4|\n|female | 42| 1|\n|male | 35| 1|\n|male | 36| 1|\n|male | 37| 2|\n|male | 38| 3|\n|male | 40| 4|\n|male | 41| 1|\n:::\n:::\n\n\n\n\n\n\n\n\n---\n\n::: {#def-replicates}\n#### Replicates\nIf a given covariate pattern has more than one observation in a dataset, those observations are called **replicates**.\n:::\n\n---\n\n::: {#exm-replicate-bw}\n\n#### Replicates in the `birthweight` data\n\nIn the `birthweight` dataset, there are 2 replicates of the combination \"female, age 36\" (@tbl-bw-x-combos).\n\n:::\n\n---\n\n::: {#exr-replicate-bw}\n\n#### Replicates in the `birthweight` data\n\nWhich covariate pattern(s) in the `birthweight` data has the most replicates?\n\n:::\n\n---\n\n::: {#sol-replicate-bw}\n\n#### Replicates in the `birthweight` data\n\nTwo covariate patterns are tied for most replicates: males at age 40 weeks \nand females at age 40 weeks.\n40 weeks is the usual length for human pregnancy (@polin2011fetal), so this result makes sense.\n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbw.X.unique |> dplyr::filter(n == max(n))\n```\n\n::: {.cell-output-display}\n\n\n|sex | age| n|\n|:------|---:|--:|\n|female | 40| 4|\n|male | 40| 4|\n:::\n:::\n\n\n\n\n\n\n\n\n:::\n\n---\n\n#### Saturated models {.smaller}\n\nThe most complicated model we could fit would have one parameter (a mean) for each covariate pattern, plus a variance parameter:\n\n\n\n\n\n\n\n\n::: {#tbl-bw-model-sat .cell tbl-cap='Saturated model for the `birthweight` data'}\n\n```{.r .cell-code}\nlm_max = \n bw |> \n mutate(age = factor(age)) |> \n lm(\n formula = weight ~ sex:age - 1, \n data = _)\n\nlm_max |> \n parameters() |> \n print_md()\n```\n\n::: {.cell-output-display}\n\n\n|Parameter | Coefficient | SE | 95% CI | t(12) | p |\n|:--------------------|:-----------:|:------:|:------------------:|:-----:|:------:|\n|sex (male) × age35 | 2925.00 | 187.92 | (2515.55, 3334.45) | 15.56 | < .001 |\n|sex (female) × age36 | 2570.50 | 132.88 | (2280.98, 2860.02) | 19.34 | < .001 |\n|sex (male) × age36 | 2625.00 | 187.92 | (2215.55, 3034.45) | 13.97 | < .001 |\n|sex (female) × age37 | 2539.00 | 187.92 | (2129.55, 2948.45) | 13.51 | < .001 |\n|sex (male) × age37 | 2737.50 | 132.88 | (2447.98, 3027.02) | 20.60 | < .001 |\n|sex (female) × age38 | 2872.50 | 132.88 | (2582.98, 3162.02) | 21.62 | < .001 |\n|sex (male) × age38 | 2982.00 | 108.50 | (2745.60, 3218.40) | 27.48 | < .001 |\n|sex (female) × age39 | 2846.00 | 132.88 | (2556.48, 3135.52) | 21.42 | < .001 |\n|sex (female) × age40 | 3152.25 | 93.96 | (2947.52, 3356.98) | 33.55 | < .001 |\n|sex (male) × age40 | 3256.25 | 93.96 | (3051.52, 3460.98) | 34.66 | < .001 |\n|sex (male) × age41 | 3292.00 | 187.92 | (2882.55, 3701.45) | 17.52 | < .001 |\n|sex (female) × age42 | 3210.00 | 187.92 | (2800.55, 3619.45) | 17.08 | < .001 |\n\n\n:::\n:::\n\n\n\n\n\n\n\n\nWe call this model the **full**, **maximal**, or **saturated** model for this dataset.\n\nWe can calculate the log-likelihood of this model as usual:\n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlogLik(lm_max)\n#> 'log Lik.' -151.4 (df=13)\n```\n:::\n\n\n\n\n\n\n\n\nWe can compare this model to our other models using chi-square tests, as usual:\n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlrtest(lm_max, bw_lm2)\n```\n\n::: {.cell-output-display}\n\n\n| #Df| LogLik| Df| Chisq| Pr(>Chisq)|\n|---:|------:|--:|-----:|----------:|\n| 13| -151.4| NA| NA| NA|\n| 5| -156.6| -8| 10.36| 0.241|\n:::\n:::\n\n\n\n\n\n\n\n\nThe likelihood ratio statistic for this test is $$\\lambda = 2 * (\\ell_{\\text{full}} - \\ell) = 10.3554$$ where:\n\n- $\\ell_{\\text{max}}$ is the log-likelihood of the full model: -151.4016\n- $\\ell$ is the log-likelihood of our comparison model (two slopes, two intercepts): -156.5793\n\nThis statistic is called the **deviance** or **residual deviance** for our two-slopes and two-intercepts model; it tells us how much the likelihood of that model deviates from the likelihood of the maximal model.\n\nThe corresponding p-value tells us whether there we have enough evidence to detect that our two-slopes, two-intercepts model is a worse fit for the data than the maximal model; in other words, it tells us if there's evidence that we missed any important patterns. (Remember, a nonsignificant p-value could mean that we didn't miss anything and a more complicated model is unnecessary, or it could mean we just don't have enough data to tell the difference between these models.)\n\n### Null Deviance\n\nSimilarly, the *least* complicated model we could fit would have only one mean parameter, an intercept:\n\n$$\\text E[Y|X=x] = \\beta_0$$ We can fit this model in R like so:\n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlm0 = lm(weight ~ 1, data = bw)\n\nlm0 |> parameters() |> print_md()\n```\n\n::: {.cell-output-display}\n\n\n|Parameter | Coefficient | SE | 95% CI | t(23) | p |\n|:-----------|:-----------:|:-----:|:------------------:|:-----:|:------:|\n|(Intercept) | 2967.67 | 57.58 | (2848.56, 3086.77) | 51.54 | < .001 |\n\n\n:::\n:::\n\n\n\n\n\n\n\n\nThis model also has a likelihood:\n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlogLik(lm0)\n#> 'log Lik.' -169 (df=2)\n```\n:::\n\n\n\n\n\n\n\n\nAnd we can compare it to more complicated models using a likelihood ratio test:\n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n\nlrtest(bw_lm2, lm0)\n```\n\n::: {.cell-output-display}\n\n\n| #Df| LogLik| Df| Chisq| Pr(>Chisq)|\n|---:|------:|--:|-----:|----------:|\n| 5| -156.6| NA| NA| NA|\n| 2| -169.0| -3| 24.75| 0|\n:::\n:::\n\n\n\n\n\n\n\n\nThe likelihood ratio statistic for the test comparing the null model to the maximal model is $$\\lambda = 2 * (\\ell_{\\text{full}} - \\ell_{0}) = 35.1067$$ where:\n\n- $\\ell_{\\text{0}}$ is the log-likelihood of the null model: -168.955\n- $\\ell_{\\text{full}}$ is the log-likelihood of the maximal model: -151.4016\n\nIn R, this test is:\n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlrtest(lm_max, lm0)\n```\n\n::: {.cell-output-display}\n\n\n| #Df| LogLik| Df| Chisq| Pr(>Chisq)|\n|---:|------:|---:|-----:|----------:|\n| 13| -151.4| NA| NA| NA|\n| 2| -169.0| -11| 35.11| 2e-04|\n:::\n:::\n\n\n\n\n\n\n\n\nThis log-likelihood ratio statistic is called the **null deviance**. It tells us whether we have enough data to detect a difference between the null and full models.\n\n## Rescaling\n\n### Rescale age {.smaller}\n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n\nbw = \n bw |>\n mutate(\n `age - mean` = age - mean(age),\n `age - 36wks` = age - 36\n )\n\nlm1c = lm(weight ~ sex + `age - 36wks`, data = bw)\n\nlm2c = lm(weight ~ sex + `age - 36wks` + sex:`age - 36wks`, data = bw)\n\nparameters(lm2c, ci_method = \"wald\") |> print_md()\n```\n\n::: {.cell-output-display}\n\n\n|Parameter | Coefficient | SE | 95% CI | t(20) | p |\n|:------------------------|:-----------:|:------:|:------------------:|:-----:|:------:|\n|(Intercept) | 2552.73 | 97.59 | (2349.16, 2756.30) | 26.16 | < .001 |\n|sex (male) | 209.97 | 129.75 | (-60.68, 480.63) | 1.62 | 0.121 |\n|age - 36wks | 130.40 | 30.00 | (67.82, 192.98) | 4.35 | < .001 |\n|sex (male) × age - 36wks | -18.42 | 41.76 | (-105.52, 68.68) | -0.44 | 0.664 |\n\n\n:::\n:::\n\n\n\n\n\n\n\n\nCompare with what we got without rescaling:\n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nparameters(bw_lm2, ci_method = \"wald\") |> print_md()\n```\n\n::: {.cell-output-display}\n\n\n|Parameter | Coefficient | SE | 95% CI | t(20) | p |\n|:----------------|:-----------:|:-------:|:-------------------:|:-----:|:------:|\n|(Intercept) | -2141.67 | 1163.60 | (-4568.90, 285.56) | -1.84 | 0.081 |\n|sex (male) | 872.99 | 1611.33 | (-2488.18, 4234.17) | 0.54 | 0.594 |\n|age | 130.40 | 30.00 | (67.82, 192.98) | 4.35 | < .001 |\n|sex (male) × age | -18.42 | 41.76 | (-105.52, 68.68) | -0.44 | 0.664 |\n\n\n:::\n:::\n\n\n\n\n\n\n\n\n## Prediction\n\n### Prediction for linear models\n\n:::{#def-predicted-value}\n#### Predicted value\n\nIn a regression model $\\p(y|x)$, the **predicted value** of $y$ given $x$ is the estimated mean of $Y$ given $X$:\n\n$$\\hat y \\eqdef \\hE{Y|X=x}$$\n:::\n\n---\n\nFor linear models, the predicted value can be straightforwardly calculated by multiplying each predictor value $x_j$ by its corresponding coefficient $\\beta_j$ and adding up the results:\n\n$$\n\\begin{aligned}\n\\hat Y &= \\hat E[Y|X=x] \\\\\n&= x'\\hat\\beta \\\\\n&= \\hat\\beta_0\\cdot 1 + \\hat\\beta_1 x_1 + ... + \\hat\\beta_p x_p\n\\end{aligned}\n$$\n\n---\n\n### Example: prediction for the `birthweight` data\n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n\nX = c(1,1,40)\nsum(X * coef(bw_lm1))\n#> [1] 3225\n```\n:::\n\n\n\n\n\n\n\n\nR has built-in functions for prediction:\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx = tibble(age = 40, sex = \"male\")\nbw_lm1 |> predict(newdata = x)\n#> 1 \n#> 3225\n```\n:::\n\n\n\n\n\n\n\n\nIf you don't provide `newdata`, R will use the covariate values from the original dataset:\n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\npredict(bw_lm1)\n#> 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 \n#> 3225 3062 2984 2579 3225 3062 2621 2821 2742 3304 2863 2942 3346 3062 3225 2700 \n#> 17 18 19 20 21 22 23 24 \n#> 2863 2579 2984 2821 3225 2942 2984 3062\n```\n:::\n\n\n\n\n\n\n\n\nThese special predictions are called the *fitted values* of the dataset:\n\n:::{#def-fitted-value}\n\nFor a given dataset $(\\vY, \\mX)$ and corresponding fitted model $\\p_{\\hb}(\\vy|\\mx)$, the **fitted value** of $y_i$ is the predicted value of $y$ when $\\vX=\\vx_i$ using the estimate parameters $\\hb$.\n\n:::\n\nR has an extra function to get these values:\n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfitted(bw_lm1)\n#> 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 \n#> 3225 3062 2984 2579 3225 3062 2621 2821 2742 3304 2863 2942 3346 3062 3225 2700 \n#> 17 18 19 20 21 22 23 24 \n#> 2863 2579 2984 2821 3225 2942 2984 3062\n```\n:::\n\n\n\n\n\n\n\n\n### Quantifying uncertainty in predictions\n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbw_lm1 |> \n predict(\n newdata = x,\n se.fit = TRUE)\n#> $fit\n#> 1 \n#> 3225 \n#> \n#> $se.fit\n#> [1] 61.46\n#> \n#> $df\n#> [1] 21\n#> \n#> $residual.scale\n#> [1] 177.1\n```\n:::\n\n\n\n\n\n\n\n\nThis is a `list()`; you can extract the elements with `$` or `magrittr::use_series()`:\n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbw_lm1 |> \n predict(\n newdata = x,\n se.fit = TRUE) |> \n use_series(se.fit)\n#> [1] 61.46\n```\n:::\n\n\n\n\n\n\n\n\nYou can get **confidence intervals** for $\\E{Y|X=x}$:\n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbw_lm1 |> predict(\n newdata = x,\n interval = \"confidence\")\n```\n\n::: {.cell-output-display}\n\n\n| fit| lwr| upr|\n|----:|----:|----:|\n| 3225| 3098| 3353|\n:::\n:::\n\n\n\n\n\n\n\n\nYou can also get **prediction intervals** for the value of an individual outcome $Y$:\n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbw_lm1 |> \n predict(newdata = x, interval = \"predict\")\n```\n\n::: {.cell-output-display}\n\n\n| fit| lwr| upr|\n|----:|----:|----:|\n| 3225| 2836| 3615|\n:::\n:::\n\n\n\n\n\n\n\n\nThe warning from the last command is: \"predictions on current data refer to *future* responses\" (since you already know what happened to the current data, and thus don't need to predict it).\n\nSee `?predict.lm` for more.\n\n## Diagnostics {#sec-diagnose-LMs}\n\n:::{.callout-tip}\nThis section is adapted from @dobson4e [§6.2-6.3] and \n@dunn2018generalized [Chapter 3](https://link.springer.com/chapter/10.1007/978-1-4419-0118-7_3).\n:::\n### Assumptions in linear regression models {.smaller .scrollable}\n\n$$Y|\\vX \\simind N(\\vX'\\b,\\ss)$$\n\n1. Normality: The distribution conditional on a given $X$ value is normal\n\n2. Correct Functional Form: The conditional means have the structure \n\n$$E[Y|\\vec X = \\vec x] = \\vec x'\\beta$$\n3. Homoskedasticity: The variance $\\ss$ is constant (with respect to $\\vx$)\n\n4. Independence: The observations are statistically independent\n\n### Direct visualization\n\n::: notes\nThe most direct way to examine the fit of a model is to compare it to the raw observed data.\n:::\n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbw = \n bw |> \n mutate(\n predlm2 = predict(bw_lm2)\n ) |> \n arrange(sex, age)\n\nplot1_interact = \n plot1 %+% bw +\n geom_line(aes(y = predlm2))\n\nprint(plot1_interact)\n```\n\n::: {.cell-output-display}\n![Birthweight model with interaction term](Linear-models-overview_files/figure-pdf/fig-bw-interaction2-1.pdf){#fig-bw-interaction2}\n:::\n:::\n\n\n\n\n\n\n\n\n::: notes\nIt's not easy to assess these assumptions from this model.\nIf there are multiple continuous covariates, it becomes even harder to visualize the raw data.\n:::\n\n### Residuals\n\n::: notes\nMaybe we can transform the data and model in some way to make it easier to inspect.\n:::\n:::{#def-resid-noise}\n#### Residual noise\n\nThe **residual noise** in a probabilistic model $p(Y)$ is the difference between an observed value $y$ and its distributional mean:\n\n$$\\eps(y) \\eqdef y - \\Exp{Y}$$ {#eq-def-resid}\n:::\n\n:::{.notes}\nWe use the same notation for residual noise that we used for [errors](estimation.qmd#def-error). \n$\\Exp{Y}$ can be viewed as an estimate of $Y$, before $y$ is observed.\nConversely, each observation $y$ can be viewed as an estimate of $\\Exp{Y}$ (albeit an imprecise one, individually, since $n=1$). \n\n:::\n\nWe can rearrange @eq-def-resid to view $y$ as the sum of its mean plus the residual noise:\n\n$$y = \\Exp{Y} + \\eps{y}$$\n\n---\n\n:::{#thm-gaussian-resid-noise}\n#### Residuals in Gaussian models\n\nIf $Y$ has a Gaussian distribution, then $\\err(Y)$ also has a Gaussian distribution, and vice versa.\n:::\n\n:::{.proof}\nLeft to the reader.\n:::\n\n---\n\n:::{#def-resid-fitted}\n#### Residual errors of a fitted model value\n\nThe **residual of a fitted value $\\hat y$** (shorthand: \"residual\") is its [error](estimation.qmd#def-error):\n$$\n\\ba\ne(\\hat y) &\\eqdef \\erf{\\hat y}\n\\\\&= y - \\hat y\n\\ea\n$$\n:::\n\n$e(\\hat y)$ can be seen as the maximum likelihood estimate of the residual noise:\n\n$$\n\\ba\ne(\\hy) &= y - \\hat y\n\\\\ &= \\hat\\eps_{ML}\n\\ea\n$$\n\n---\n\n#### General characteristics of residuals\n\n:::{#thm-resid-unbiased}\nFor [unbiased](estimation.qmd#sec-unbiased-estimators) estimators $\\hth$:\n\n$$\\E{e(y)} = 0$$ {#eq-mean-resid-unbiased}\n$$\\Var{e(y)} \\approx \\ss$$ {#eq-var-resid-unbiased}\n\n:::\n\n:::{.proof}\n\\ \n\n@eq-mean-resid-unbiased:\n\n$$\n\\ba\n\\Ef{e(y)} &= \\Ef{y - \\hat y}\n\\\\ &= \\Ef{y} - \\Ef{\\hat y}\n\\\\ &= \\Ef{y} - \\Ef{y}\n\\\\ &= 0\n\\ea\n$$\n\n@eq-var-resid-unbiased:\n\n$$\n\\ba\n\\Var{e(y)} &= \\Var{y - \\hy}\n\\\\ &= \\Var{y} + \\Var{\\hy} - 2 \\Cov{y, \\hy}\n\\\\ &{\\dot{\\approx}} \\Var{y} + 0 - 2 \\cdot 0\n\\\\ &= \\Var{y}\n\\\\ &= \\ss\n\\ea\n$$\n:::\n\n---\n\n#### Characteristics of residuals in Gaussian models\n\nWith enough data and a correct model, the residuals will be approximately Guassian distributed, with variance $\\sigma^2$, which we can estimate using $\\hat\\sigma^2$: that is:\n\n$$\ne_i \\siid N(0, \\hat\\sigma^2)\n$$\n\n---\n\n:::{#exm-resid-bw}\n#### residuals in `birthweight` data\n\nR provides a function for residuals:\n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nresid(bw_lm2)\n#> 1 2 3 4 5 6 7 8 9 10 \n#> 176.27 -140.73 -144.13 -59.53 177.47 -126.93 -68.93 242.67 -139.33 51.67 \n#> 11 12 13 14 15 16 17 18 19 20 \n#> 156.67 -125.13 274.28 -137.71 -27.69 -246.69 -191.67 189.33 -11.67 -242.64 \n#> 21 22 23 24 \n#> -47.64 262.36 210.36 -30.62\n```\n:::\n\n\n\n\n\n\n\n\n:::\n\n:::{#exr-calc-resids}\nCheck R's output by computing the residuals directly.\n:::\n\n:::{.solution}\n\\ \n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbw$weight - fitted(bw_lm2)\n#> 1 2 3 4 5 6 7 8 9 10 \n#> 176.27 -140.73 -144.13 -59.53 177.47 -126.93 -68.93 242.67 -139.33 51.67 \n#> 11 12 13 14 15 16 17 18 19 20 \n#> 156.67 -125.13 274.28 -137.71 -27.69 -246.69 -191.67 189.33 -11.67 -242.64 \n#> 21 22 23 24 \n#> -47.64 262.36 210.36 -30.62\n```\n:::\n\n\n\n\n\n\n\n\nThis matches R's output!\n:::\n\n---\n\n#### Graph the residuals\n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbw = bw |> \n mutate(resids_intxn = \n weight - fitted(bw_lm2))\n\nplot_bw_resid =\n bw |> \n ggplot(aes(\n x = age, \n y = resids_intxn,\n linetype = sex,\n shape = sex,\n col = sex)) +\n theme_bw() +\n xlab(\"Gestational age (weeks)\") +\n ylab(\"residuals (grams)\") +\n theme(legend.position = \"bottom\") +\n # expand_limits(y = 0, x = 0) +\n geom_point(alpha = .7)\nprint(plot_bw_resid + facet_wrap(~ sex))\n```\n\n::: {.cell-output-display}\n![Residuals of interaction model for `birthweight` data](Linear-models-overview_files/figure-pdf/fig-resids-intxn-1.pdf){#fig-resids-intxn}\n:::\n:::\n\n\n\n\n\n\n\n\n---\n\n:::{#def-stred}\n\n#### Standardized residuals\n\n$$r_i = \\frac{e_i}{\\widehat{SD}(e_i)}$$\n\n:::\n\nHence, with enough data and a correct model, the standardized residuals will be approximately standard Gaussian; that is,\n\n$$\nr_i \\siid N(0,1)\n$$\n\n### Marginal distributions of residuals\n\nTo look for problems with our model, we can check whether the residuals $e_i$ and standardized residuals $r_i$ look like they have the distributions that they are supposed to have, according to the model.\n\n---\n\n#### Standardized residuals in R\n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n\nrstandard(bw_lm2)\n#> 1 2 3 4 5 6 7 8 \n#> 1.15982 -0.92601 -0.87479 -0.34723 1.03507 -0.73473 -0.39901 1.43752 \n#> 9 10 11 12 13 14 15 16 \n#> -0.82539 0.30606 0.92807 -0.87616 1.91428 -0.86559 -0.16430 -1.46376 \n#> 17 18 19 20 21 22 23 24 \n#> -1.11016 1.09658 -0.06761 -1.46159 -0.28696 1.58040 1.26717 -0.19805\nresid(bw_lm2)/sigma(bw_lm2)\n#> 1 2 3 4 5 6 7 8 \n#> 0.97593 -0.77920 -0.79802 -0.32962 0.98258 -0.70279 -0.38166 1.34357 \n#> 9 10 11 12 13 14 15 16 \n#> -0.77144 0.28606 0.86741 -0.69282 1.51858 -0.76244 -0.15331 -1.36584 \n#> 17 18 19 20 21 22 23 24 \n#> -1.06123 1.04825 -0.06463 -1.34341 -0.26376 1.45262 1.16471 -0.16954\n```\n:::\n\n\n\n\n\n\n\n\n::: notes\nThese are not quite the same, because R is doing something more complicated and precise to get the standard errors. Let's not worry about those details for now; the difference is pretty small in this case:\n\n:::\n\n---\n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n\nrstandard_compare_plot = \n tibble(\n x = resid(bw_lm2)/sigma(bw_lm2), \n y = rstandard(bw_lm2)) |> \n ggplot(aes(x = x, y = y)) +\n geom_point() + \n theme_bw() +\n coord_equal() + \n xlab(\"resid(bw_lm2)/sigma(bw_lm2)\") +\n ylab(\"rstandard(bw_lm2)\") +\n geom_abline(\n aes(\n intercept = 0,\n slope = 1, \n col = \"x=y\")) +\n labs(colour=\"\") +\n scale_colour_manual(values=\"red\")\n\nprint(rstandard_compare_plot)\n```\n\n::: {.cell-output-display}\n![](Linear-models-overview_files/figure-pdf/unnamed-chunk-65-1.pdf)\n:::\n:::\n\n\n\n\n\n\n\n\n---\n\nLet's add these residuals to the `tibble` of our dataset:\n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n\nbw = \n bw |> \n mutate(\n fitted_lm2 = fitted(bw_lm2),\n \n resid_lm2 = resid(bw_lm2),\n # resid_lm2 = weight - fitted_lm2,\n \n std_resid_lm2 = rstandard(bw_lm2),\n # std_resid_lm2 = resid_lm2 / sigma(bw_lm2)\n )\n\nbw |> \n select(\n sex,\n age,\n weight,\n fitted_lm2,\n resid_lm2,\n std_resid_lm2\n )\n```\n\n::: {.cell-output-display}\n\n\n|sex | age| weight| fitted_lm2| resid_lm2| std_resid_lm2|\n|:------|---:|------:|----------:|---------:|-------------:|\n|female | 36| 2729| 2553| 176.27| 1.1598|\n|female | 36| 2412| 2553| -140.73| -0.9260|\n|female | 37| 2539| 2683| -144.13| -0.8748|\n|female | 38| 2754| 2814| -59.53| -0.3472|\n|female | 38| 2991| 2814| 177.47| 1.0351|\n|female | 39| 2817| 2944| -126.93| -0.7347|\n|female | 39| 2875| 2944| -68.93| -0.3990|\n|female | 40| 3317| 3074| 242.67| 1.4375|\n|female | 40| 2935| 3074| -139.33| -0.8254|\n|female | 40| 3126| 3074| 51.67| 0.3061|\n|female | 40| 3231| 3074| 156.67| 0.9281|\n|female | 42| 3210| 3335| -125.13| -0.8762|\n|male | 35| 2925| 2651| 274.28| 1.9143|\n|male | 36| 2625| 2763| -137.71| -0.8656|\n|male | 37| 2847| 2875| -27.69| -0.1643|\n|male | 37| 2628| 2875| -246.69| -1.4638|\n|male | 38| 2795| 2987| -191.67| -1.1102|\n|male | 38| 3176| 2987| 189.33| 1.0966|\n|male | 38| 2975| 2987| -11.67| -0.0676|\n|male | 40| 2968| 3211| -242.64| -1.4616|\n|male | 40| 3163| 3211| -47.64| -0.2870|\n|male | 40| 3473| 3211| 262.36| 1.5804|\n|male | 40| 3421| 3211| 210.36| 1.2672|\n|male | 41| 3292| 3323| -30.62| -0.1981|\n:::\n:::\n\n\n\n\n\n\n\n\n---\n\n::: notes\n\nNow let's build histograms:\n\n:::\n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nresid_marginal_hist = \n bw |> \n ggplot(aes(x = resid_lm2)) +\n geom_histogram()\n\nprint(resid_marginal_hist)\n```\n\n::: {.cell-output-display}\n![Marginal distribution of (nonstandardized) residuals](Linear-models-overview_files/figure-pdf/fig-marg-dist-resid-1.pdf){#fig-marg-dist-resid}\n:::\n:::\n\n\n\n\n\n\n\n\n::: notes\nHard to tell with this small amount of data, but I'm a bit concerned that the histogram doesn't show a bell-curve shape.\n\n:::\n\n---\n\n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nstd_resid_marginal_hist = \n bw |> \n ggplot(aes(x = std_resid_lm2)) +\n geom_histogram()\n\nprint(std_resid_marginal_hist)\n```\n\n::: {.cell-output-display}\n![Marginal distribution of standardized residuals](Linear-models-overview_files/figure-pdf/fig-marg-stresd-1.pdf){#fig-marg-stresd}\n:::\n:::\n\n\n\n\n\n\n\n\n::: notes\nThis looks similar, although the scale of the x-axis got narrower, because we divided by $\\hat\\sigma$ (roughly speaking).\n\nStill hard to tell if the distribution is Gaussian.\n\n:::\n\n---\n\n### QQ plot of standardized residuals\n\n::: notes\nAnother way to assess normality is the QQ plot of the standardized residuals versus normal quantiles:\n\n:::\n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n\nlibrary(ggfortify) \n# needed to make ggplot2::autoplot() work for `lm` objects\n\nqqplot_lm2_auto = \n bw_lm2 |> \n autoplot(\n which = 2, # options are 1:6; can do multiple at once\n ncol = 1) +\n theme_classic()\n\nprint(qqplot_lm2_auto)\n```\n\n::: {.cell-output-display}\n![](Linear-models-overview_files/figure-pdf/unnamed-chunk-69-1.pdf)\n:::\n:::\n\n\n\n\n\n\n\n\n::: notes\nIf the Gaussian model were correct, these points should follow the dotted line.\n\nFig 2.4 panel (c) in @dobson4e is a little different; they didn't specify how they produced it, but other statistical analysis systems do things differently from R.\n\nSee also @dunn2018generalized [§3.5.4](https://link.springer.com/chapter/10.1007/978-1-4419-0118-7_3#Sec14:~:text=3.5.4%20Q%E2%80%93Q%20Plots%20and%20Normality).\n\n:::\n\n---\n\n#### QQ plot - how it's built\n\n::: notes\nLet's construct it by hand:\n:::\n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n\nbw = bw |> \n mutate(\n p = (rank(std_resid_lm2) - 1/2)/n(), # \"Blom's method\"\n expected_quantiles_lm2 = qnorm(p)\n )\n\nqqplot_lm2 = \n bw |> \n ggplot(\n aes(\n x = expected_quantiles_lm2, \n y = std_resid_lm2, \n col = sex, \n shape = sex)\n ) + \n geom_point() +\n theme_classic() +\n theme(legend.position='none') + # removing the plot legend\n ggtitle(\"Normal Q-Q\") +\n xlab(\"Theoretical Quantiles\") + \n ylab(\"Standardized residuals\")\n\n# find the expected line:\n\nps <- c(.25, .75) # reference probabilities\na <- quantile(rstandard(bw_lm2), ps) # empirical quantiles\nb <- qnorm(ps) # theoretical quantiles\n\nqq_slope = diff(a)/diff(b)\nqq_intcpt = a[1] - b[1] * qq_slope\n\nqqplot_lm2 = \n qqplot_lm2 +\n geom_abline(slope = qq_slope, intercept = qq_intcpt)\n\nprint(qqplot_lm2)\n```\n\n::: {.cell-output-display}\n![](Linear-models-overview_files/figure-pdf/unnamed-chunk-70-1.pdf)\n:::\n:::\n\n\n\n\n\n\n\n\n---\n\n### Conditional distributions of residuals\n\nIf our Gaussian linear regression model is correct, the residuals $e_i$ and standardized residuals $r_i$ should have:\n\n- an approximately Gaussian distribution, with:\n- a mean of 0\n- a constant variance\n\nThis should be true **for every** value of $x$.\n\n---\n\nIf we didn't correctly guess the functional form of the linear component of the mean, \n$$\\text{E}[Y|X=x] = \\beta_0 + \\beta_1 X_1 + ... + \\beta_p X_p$$\n\nThen the the residuals might have nonzero mean.\n\nRegardless of whether we guessed the mean function correctly, ther the variance of the residuals might differ between values of $x$.\n\n---\n\n#### Residuals versus fitted values\n\n::: notes\nTo look for these issues, we can plot the residuals $e_i$ against the fitted values $\\hat y_i$ (@fig-bw_lm2-resid-vs-fitted).\n:::\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nautoplot(bw_lm2, which = 1, ncol = 1) |> print()\n```\n\n::: {.cell-output-display}\n![`birthweight` model (@eq-BW-lm-interact): residuals versus fitted values](Linear-models-overview_files/figure-pdf/fig-bw_lm2-resid-vs-fitted-1.pdf){#fig-bw_lm2-resid-vs-fitted}\n:::\n:::\n\n\n\n\n\n\n\n\n::: notes\nIf the model is correct, the blue line should stay flat and close to 0, and the cloud of dots should have the same vertical spread regardless of the fitted value.\n\nIf not, we probably need to change the functional form of linear component of the mean, $$\\text{E}[Y|X=x] = \\beta_0 + \\beta_1 X_1 + ... + \\beta_p X_p$$\n\n:::\n\n---\n\n\n#### Example: PLOS Medicine title length data\n\n(Adapted from @dobson4e, §6.7.1)\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndata(PLOS, package = \"dobson\")\nlibrary(ggplot2)\nfig1 = \n PLOS |> \n ggplot(\n aes(x = authors,\n y = nchar)\n ) +\n geom_point() +\n theme(legend.position = \"bottom\") +\n labs(col = \"\") +\n guides(col=guide_legend(ncol=3))\nfig1\n```\n\n::: {.cell-output-display}\n![Number of authors versus title length in *PLOS Medicine* articles](Linear-models-overview_files/figure-pdf/fig-plos-1.pdf){#fig-plos}\n:::\n:::\n\n\n\n\n\n\n\n---\n\n##### Linear fit\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlm_PLOS_linear = lm(\n formula = nchar ~ authors, \n data = PLOS)\n```\n:::\n\n::: {#fig-plos-lm .cell layout-ncol=\"2\"}\n\n```{.r .cell-code}\nfig2 = fig1 +\n geom_smooth(\n method = \"lm\", \n fullrange = TRUE,\n aes(col = \"lm(y ~ x)\"))\nfig2\n\nlibrary(ggfortify)\nautoplot(lm_PLOS_linear, which = 1, ncol = 1)\n```\n\n::: {.cell-output-display}\n![Data and fit](Linear-models-overview_files/figure-pdf/fig-plos-lm-1.pdf){#fig-plos-lm-1}\n:::\n\n::: {.cell-output-display}\n![Residuals vs fitted](Linear-models-overview_files/figure-pdf/fig-plos-lm-2.pdf){#fig-plos-lm-2}\n:::\n\nNumber of authors versus title length in *PLOS Medicine*, with linear model fit\n:::\n\n\n\n\n\n\n\n---\n\n##### Quadratic fit {.smaller}\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlm_PLOS_quad = lm(\n formula = nchar ~ authors + I(authors^2), \n data = PLOS)\n```\n:::\n\n::: {#fig-plos-lm-quad .cell layout-ncol=\"2\"}\n\n```{.r .cell-code}\nfig3 = \n fig2 + \ngeom_smooth(\n method = \"lm\",\n fullrange = TRUE,\n formula = y ~ x + I(x ^ 2),\n aes(col = \"lm(y ~ x + I(x^2))\")\n )\nfig3\n\nautoplot(lm_PLOS_quad, which = 1, ncol = 1)\n```\n\n::: {.cell-output-display}\n![Data and fit](Linear-models-overview_files/figure-pdf/fig-plos-lm-quad-1.pdf){#fig-plos-lm-quad-1}\n:::\n\n::: {.cell-output-display}\n![Residuals vs fitted](Linear-models-overview_files/figure-pdf/fig-plos-lm-quad-2.pdf){#fig-plos-lm-quad-2}\n:::\n\nNumber of authors versus title length in *PLOS Medicine*, with quadratic model fit\n:::\n\n\n\n\n\n\n\n---\n\n##### Linear versus quadratic fits\n\n\n\n\n\n\n\n::: {#fig-plos-lm-resid2 .cell layout-ncol=\"2\"}\n\n```{.r .cell-code}\nlibrary(ggfortify)\nautoplot(lm_PLOS_linear, which = 1, ncol = 1)\n\nautoplot(lm_PLOS_quad, which = 1, ncol = 1)\n```\n\n::: {.cell-output-display}\n![Linear](Linear-models-overview_files/figure-pdf/fig-plos-lm-resid2-1.pdf){#fig-plos-lm-resid2-1}\n:::\n\n::: {.cell-output-display}\n![Quadratic](Linear-models-overview_files/figure-pdf/fig-plos-lm-resid2-2.pdf){#fig-plos-lm-resid2-2}\n:::\n\nResiduals versus fitted plot for linear and quadratic fits to `PLOS` data\n:::\n\n\n\n\n\n\n\n---\n\n##### Cubic fit\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlm_PLOS_cub = lm(\n formula = nchar ~ authors + I(authors^2) + I(authors^3), \n data = PLOS)\n```\n:::\n\n::: {#fig-plos-lm-cubic .cell layout-ncol=\"2\"}\n\n```{.r .cell-code}\nfig4 = \n fig3 + \ngeom_smooth(\n method = \"lm\",\n fullrange = TRUE,\n formula = y ~ x + I(x ^ 2) + I(x ^ 3),\n aes(col = \"lm(y ~ x + I(x^2) + I(x ^ 3))\")\n )\nfig4\n\nautoplot(lm_PLOS_cub, which = 1, ncol = 1)\n\n```\n\n::: {.cell-output-display}\n![Data and fit](Linear-models-overview_files/figure-pdf/fig-plos-lm-cubic-1.pdf){#fig-plos-lm-cubic-1}\n:::\n\n::: {.cell-output-display}\n![Residuals vs fitted](Linear-models-overview_files/figure-pdf/fig-plos-lm-cubic-2.pdf){#fig-plos-lm-cubic-2}\n:::\n\nNumber of authors versus title length in *PLOS Medicine*, with cubic model fit\n:::\n\n\n\n\n\n\n\n---\n\n##### Logarithmic fit\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlm_PLOS_log = lm(nchar ~ log(authors), data = PLOS)\n```\n:::\n\n::: {#fig-plos-log .cell layout-ncol=\"2\"}\n\n```{.r .cell-code}\nfig5 = fig4 + \n geom_smooth(\n method = \"lm\",\n fullrange = TRUE,\n formula = y ~ log(x),\n aes(col = \"lm(y ~ log(x))\")\n )\nfig5\n\nautoplot(lm_PLOS_log, which = 1, ncol = 1)\n```\n\n::: {.cell-output-display}\n![Data and fit](Linear-models-overview_files/figure-pdf/fig-plos-log-1.pdf){#fig-plos-log-1}\n:::\n\n::: {.cell-output-display}\n![Residuals vs fitted](Linear-models-overview_files/figure-pdf/fig-plos-log-2.pdf){#fig-plos-log-2}\n:::\n\nlogarithmic fit\n:::\n\n\n\n\n\n\n\n---\n\n##### Model selection {.smaller}\n\n\n\n\n\n\n\n::: {#tbl-plos-lin-quad-anova .cell tbl-cap='linear vs quadratic'}\n\n```{.r .cell-code}\nanova(lm_PLOS_linear, lm_PLOS_quad)\n```\n\n::: {.cell-output-display}\n\n\n| Res.Df| RSS| Df| Sum of Sq| F| Pr(>F)|\n|------:|------:|--:|---------:|----:|------:|\n| 876| 947502| NA| NA| NA| NA|\n| 875| 880950| 1| 66552| 66.1| 0|\n:::\n:::\n\n::: {#tbl-plos-quad-cub-anova .cell tbl-cap='quadratic vs cubic'}\n\n```{.r .cell-code}\nanova(lm_PLOS_quad, lm_PLOS_cub)\n```\n\n::: {.cell-output-display}\n\n\n| Res.Df| RSS| Df| Sum of Sq| F| Pr(>F)|\n|------:|------:|--:|---------:|-----:|------:|\n| 875| 880950| NA| NA| NA| NA|\n| 874| 865933| 1| 15018| 15.16| 1e-04|\n:::\n:::\n\n\n\n\n\n\n\n---\n\n##### AIC/BIC {.smaller}\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"show\"}\nAIC(lm_PLOS_quad)\n#> [1] 8568\nAIC(lm_PLOS_cub)\n#> [1] 8555\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"show\"}\nAIC(lm_PLOS_cub)\n#> [1] 8555\nAIC(lm_PLOS_log)\n#> [1] 8544\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"show\"}\nBIC(lm_PLOS_cub)\n#> [1] 8578\nBIC(lm_PLOS_log)\n#> [1] 8558\n```\n:::\n\n\n\n\n\n\n\n---\n\n##### Extrapolation is dangerous\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfig_all = fig5 +\n xlim(0, 60)\nfig_all\n```\n\n::: {.cell-output-display}\n![Number of authors versus title length in *PLOS Medicine*](Linear-models-overview_files/figure-pdf/fig-plos-multifit-1.pdf){#fig-plos-multifit}\n:::\n:::\n\n\n\n\n\n\n\n\n\n---\n\n#### Scale-location plot\n\n::: notes\nWe can also plot the square roots of the absolute values of the standardized residuals against the fitted values (@fig-bw-scale-loc).\n:::\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nautoplot(bw_lm2, which = 3, ncol = 1) |> print()\n```\n\n::: {.cell-output-display}\n![Scale-location plot of `birthweight` data](Linear-models-overview_files/figure-pdf/fig-bw-scale-loc-1.pdf){#fig-bw-scale-loc}\n:::\n:::\n\n\n\n\n\n\n\n::: notes\nHere, the blue line doesn't need to be near 0, \nbut it should be flat. \nIf not, the residual variance $\\sigma^2$ might not be constant, \nand we might need to transform our outcome $Y$ \n(or use a model that allows non-constant variance).\n:::\n\n---\n\n\n#### Residuals versus leverage\n\n::: notes\n\nWe can also plot our standardized residuals against \"leverage\", which roughly speaking is a measure of how unusual each $x_i$ value is. Very unusual $x_i$ values can have extreme effects on the model fit, so we might want to remove those observations as outliers, particularly if they have large residuals.\n\n:::\n\n\n\n\n\n\n\n\n::: {.cell labels='fig-bw_lm2_resid-vs-leverage'}\n\n```{.r .cell-code}\nautoplot(bw_lm2, which = 5, ncol = 1) |> print()\n```\n\n::: {.cell-output-display}\n![`birthweight` model with interactions (@eq-BW-lm-interact): residuals versus leverage](Linear-models-overview_files/figure-pdf/unnamed-chunk-89-1.pdf)\n:::\n:::\n\n\n\n\n\n\n\n\n::: notes\nThe blue line should be relatively flat and close to 0 here.\n:::\n\n---\n\n### Diagnostics constructed by hand\n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n\nbw = \n bw |> \n mutate(\n predlm2 = predict(bw_lm2),\n residlm2 = weight - predlm2,\n std_resid = residlm2 / sigma(bw_lm2),\n # std_resid_builtin = rstandard(bw_lm2), # uses leverage\n sqrt_abs_std_resid = std_resid |> abs() |> sqrt()\n \n )\n\n```\n:::\n\n\n\n\n\n\n\n\n##### Residuals vs fitted\n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n\nresid_vs_fit = bw |> \n ggplot(\n aes(x = predlm2, y = residlm2, col = sex, shape = sex)\n ) + \n geom_point() +\n theme_classic() +\n geom_hline(yintercept = 0)\n\n```\n:::\n\n\n\n\n\n\n\n\n::: {.content-visible when-format=\"html\"}\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nprint(resid_vs_fit)\n```\n\n::: {.cell-output-display}\n![](Linear-models-overview_files/figure-pdf/unnamed-chunk-92-1.pdf)\n:::\n:::\n\n\n\n\n\n\n\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nprint(resid_vs_fit)\n```\n\n::: {.cell-output-display}\n![](Linear-models-overview_files/figure-pdf/unnamed-chunk-93-1.pdf)\n:::\n:::\n\n\n\n\n\n\n\n:::\n\n##### Standardized residuals vs fitted\n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbw |> \n ggplot(\n aes(x = predlm2, y = std_resid, col = sex, shape = sex)\n ) + \n geom_point() +\n theme_classic() +\n geom_hline(yintercept = 0)\n```\n\n::: {.cell-output-display}\n![](Linear-models-overview_files/figure-pdf/unnamed-chunk-94-1.pdf)\n:::\n:::\n\n\n\n\n\n\n\n\n##### Standardized residuals vs gestational age\n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n\nbw |> \n ggplot(\n aes(x = age, y = std_resid, col = sex, shape = sex)\n ) + \n geom_point() +\n theme_classic() +\n geom_hline(yintercept = 0)\n```\n\n::: {.cell-output-display}\n![](Linear-models-overview_files/figure-pdf/unnamed-chunk-95-1.pdf)\n:::\n:::\n\n\n\n\n\n\n\n\n##### `sqrt(abs(rstandard()))` vs fitted\n\nCompare with `autoplot(bw_lm2, 3)`\n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n\n\nbw |> \n ggplot(\n aes(x = predlm2, y = sqrt_abs_std_resid, col = sex, shape = sex)\n ) + \n geom_point() +\n theme_classic() +\n geom_hline(yintercept = 0)\n```\n\n::: {.cell-output-display}\n![](Linear-models-overview_files/figure-pdf/unnamed-chunk-96-1.pdf)\n:::\n:::\n\n\n\n\n\n\n\n\n## Model selection\n\n(adapted from @dobson4e §6.3.3; for more information on prediction, see @james2013introduction and @rms2e).\n\n::: notes\nIf we have a lot of covariates in our dataset, we might want to choose a small subset to use in our model.\n\nThere are a few possible metrics to consider for choosing a \"best\" model.\n:::\n\n### Mean squared error\n\nWe might want to minimize the **mean squared error**, $\\text E[(y-\\hat y)^2]$, for new observations that weren't in our data set when we fit the model.\n\nUnfortunately, $$\\frac{1}{n}\\sum_{i=1}^n (y_i-\\hat y_i)^2$$ gives a biased estimate of $\\text E[(y-\\hat y)^2]$ for new data. If we want an unbiased estimate, we will have to be clever.\n\n---\n\n#### Cross-validation\n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndata(\"carbohydrate\", package = \"dobson\")\nlibrary(cvTools)\nfull_model <- lm(carbohydrate ~ ., data = carbohydrate)\ncv_full = \n full_model |> cvFit(\n data = carbohydrate, K = 5, R = 10,\n y = carbohydrate$carbohydrate)\n\nreduced_model = update(full_model, \n formula = ~ . - age)\n\ncv_reduced = \n reduced_model |> cvFit(\n data = carbohydrate, K = 5, R = 10,\n y = carbohydrate$carbohydrate)\n```\n:::\n\n\n\n\n\n\n\n\n---\n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nresults_reduced = \n tibble(\n model = \"wgt+protein\",\n errs = cv_reduced$reps[])\nresults_full = \n tibble(model = \"wgt+age+protein\",\n errs = cv_full$reps[])\n\ncv_results = \n bind_rows(results_reduced, results_full)\n\ncv_results |> \n ggplot(aes(y = model, x = errs)) +\n geom_boxplot()\n```\n\n::: {.cell-output-display}\n![](Linear-models-overview_files/figure-pdf/unnamed-chunk-98-1.pdf)\n:::\n:::\n\n\n\n\n\n\n\n\n---\n\n##### comparing metrics\n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n\ncompare_results = tribble(\n ~ model, ~ cvRMSE, ~ r.squared, ~adj.r.squared, ~ trainRMSE, ~loglik,\n \"full\", cv_full$cv, summary(full_model)$r.squared, summary(full_model)$adj.r.squared, sigma(full_model), logLik(full_model) |> as.numeric(),\n \"reduced\", cv_reduced$cv, summary(reduced_model)$r.squared, summary(reduced_model)$adj.r.squared, sigma(reduced_model), logLik(reduced_model) |> as.numeric())\n\ncompare_results\n```\n\n::: {.cell-output-display}\n\n\n|model | cvRMSE| r.squared| adj.r.squared| trainRMSE| loglik|\n|:-------|------:|---------:|-------------:|---------:|------:|\n|full | 6.723| 0.4805| 0.3831| 5.956| -61.84|\n|reduced | 6.698| 0.4454| 0.3802| 5.971| -62.49|\n:::\n:::\n\n\n\n\n\n\n\n\n---\n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nanova(full_model, reduced_model)\n```\n\n::: {.cell-output-display}\n\n\n| Res.Df| RSS| Df| Sum of Sq| F| Pr(>F)|\n|------:|-----:|--:|---------:|-----:|------:|\n| 16| 567.7| NA| NA| NA| NA|\n| 17| 606.0| -1| -38.36| 1.081| 0.3139|\n:::\n:::\n\n\n\n\n\n\n\n\n---\n\n#### stepwise regression\n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(olsrr)\nolsrr:::ols_step_both_aic(full_model)\n#> \n#> \n#> Stepwise Summary \n#> -------------------------------------------------------------------------\n#> Step Variable AIC SBC SBIC R2 Adj. R2 \n#> -------------------------------------------------------------------------\n#> 0 Base Model 140.773 142.764 83.068 0.00000 0.00000 \n#> 1 protein (+) 137.950 140.937 80.438 0.21427 0.17061 \n#> 2 weight (+) 132.981 136.964 77.191 0.44544 0.38020 \n#> -------------------------------------------------------------------------\n#> \n#> Final Model Output \n#> ------------------\n#> \n#> Model Summary \n#> ---------------------------------------------------------------\n#> R 0.667 RMSE 5.505 \n#> R-Squared 0.445 MSE 30.301 \n#> Adj. R-Squared 0.380 Coef. Var 15.879 \n#> Pred R-Squared 0.236 AIC 132.981 \n#> MAE 4.593 SBC 136.964 \n#> ---------------------------------------------------------------\n#> RMSE: Root Mean Square Error \n#> MSE: Mean Square Error \n#> MAE: Mean Absolute Error \n#> AIC: Akaike Information Criteria \n#> SBC: Schwarz Bayesian Criteria \n#> \n#> ANOVA \n#> -------------------------------------------------------------------\n#> Sum of \n#> Squares DF Mean Square F Sig. \n#> -------------------------------------------------------------------\n#> Regression 486.778 2 243.389 6.827 0.0067 \n#> Residual 606.022 17 35.648 \n#> Total 1092.800 19 \n#> -------------------------------------------------------------------\n#> \n#> Parameter Estimates \n#> ----------------------------------------------------------------------------------------\n#> model Beta Std. Error Std. Beta t Sig lower upper \n#> ----------------------------------------------------------------------------------------\n#> (Intercept) 33.130 12.572 2.635 0.017 6.607 59.654 \n#> protein 1.824 0.623 0.534 2.927 0.009 0.509 3.139 \n#> weight -0.222 0.083 -0.486 -2.662 0.016 -0.397 -0.046 \n#> ----------------------------------------------------------------------------------------\n```\n:::\n\n\n\n\n\n\n\n\n---\n\n#### Lasso\n\n$$\\arg min_{\\theta} \\llik(\\th) + \\lambda \\sum_{j=1}^p|\\beta_j|$$\n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(glmnet)\ny = carbohydrate$carbohydrate\nx = carbohydrate |> \n select(age, weight, protein) |> \n as.matrix()\nfit = glmnet(x,y)\n```\n:::\n\n\n\n\n\n\n\n\n---\n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nautoplot(fit, xvar = 'lambda')\n\n```\n\n::: {.cell-output-display}\n![Lasso selection](Linear-models-overview_files/figure-pdf/fig-carbs-lasso-1.pdf){#fig-carbs-lasso}\n:::\n:::\n\n\n\n\n\n\n\n\n---\n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncvfit = cv.glmnet(x,y)\nplot(cvfit)\n```\n\n::: {.cell-output-display}\n![](Linear-models-overview_files/figure-pdf/unnamed-chunk-104-1.pdf)\n:::\n:::\n\n\n\n\n\n\n\n\n---\n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncoef(cvfit, s = \"lambda.1se\")\n#> 4 x 1 sparse Matrix of class \"dgCMatrix\"\n#> s1\n#> (Intercept) 33.943\n#> age . \n#> weight -0.124\n#> protein 1.094\n```\n:::\n\n\n\n\n\n\n\n\n\n## Categorical covariates with more than two levels\n\n### Example: `birthweight`\n\nIn the birthweight example, the variable `sex` had only two observed values:\n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nunique(bw$sex)\n#> [1] female male \n#> Levels: female male\n```\n:::\n\n\n\n\n\n\n\n\nIf there are more than two observed values, we can't just use a single variable with 0s and 1s.\n\n### \n\n:::{.notes}\nFor example, @tbl-iris-data shows the \n[(in)famous](https://www.meganstodel.com/posts/no-to-iris/) \n`iris` data (@anderson1935irises), \nand @tbl-iris-summary provides summary statistics. \nThe data include three species: \"setosa\", \"versicolor\", and \"virginica\".\n:::\n\n\n\n\n\n\n\n\n::: {#tbl-iris-data .cell tbl-cap='The `iris` data'}\n\n```{.r .cell-code}\nhead(iris)\n```\n\n::: {.cell-output-display}\n\n\n| Sepal.Length| Sepal.Width| Petal.Length| Petal.Width|Species |\n|------------:|-----------:|------------:|-----------:|:-------|\n| 5.1| 3.5| 1.4| 0.2|setosa |\n| 4.9| 3.0| 1.4| 0.2|setosa |\n| 4.7| 3.2| 1.3| 0.2|setosa |\n| 4.6| 3.1| 1.5| 0.2|setosa |\n| 5.0| 3.6| 1.4| 0.2|setosa |\n| 5.4| 3.9| 1.7| 0.4|setosa |\n:::\n:::\n\n::: {#tbl-iris-summary .cell tbl-cap='Summary statistics for the `iris` data'}\n\n```{.r .cell-code}\nlibrary(table1)\ntable1(\n x = ~ . | Species,\n data = iris,\n overall = FALSE\n)\n```\n\n::: {.cell-output-display}\n\n\\begin{tabular}[t]{llll}\n\\toprule\n  & setosa & versicolor & virginica\\\\\n\\midrule\n & (N=50) & (N=50) & (N=50)\\\\\n\\addlinespace[0.3em]\n\\multicolumn{4}{l}{\\textbf{Sepal.Length}}\\\\\n\\hspace{1em}Mean (SD) & 5.01 (0.352) & 5.94 (0.516) & 6.59 (0.636)\\\\\n\\hspace{1em}Median [Min, Max] & 5.00 [4.30, 5.80] & 5.90 [4.90, 7.00] & 6.50 [4.90, 7.90]\\\\\n\\addlinespace[0.3em]\n\\multicolumn{4}{l}{\\textbf{Sepal.Width}}\\\\\n\\hspace{1em}Mean (SD) & 3.43 (0.379) & 2.77 (0.314) & 2.97 (0.322)\\\\\n\\hspace{1em}Median [Min, Max] & 3.40 [2.30, 4.40] & 2.80 [2.00, 3.40] & 3.00 [2.20, 3.80]\\\\\n\\addlinespace[0.3em]\n\\multicolumn{4}{l}{\\textbf{Petal.Length}}\\\\\n\\hspace{1em}Mean (SD) & 1.46 (0.174) & 4.26 (0.470) & 5.55 (0.552)\\\\\n\\hspace{1em}Median [Min, Max] & 1.50 [1.00, 1.90] & 4.35 [3.00, 5.10] & 5.55 [4.50, 6.90]\\\\\n\\addlinespace[0.3em]\n\\multicolumn{4}{l}{\\textbf{Petal.Width}}\\\\\n\\hspace{1em}Mean (SD) & 0.246 (0.105) & 1.33 (0.198) & 2.03 (0.275)\\\\\n\\hspace{1em}Median [Min, Max] & 0.200 [0.100, 0.600] & 1.30 [1.00, 1.80] & 2.00 [1.40, 2.50]\\\\\n\\bottomrule\n\\end{tabular}\n\n\n:::\n:::\n\n\n\n\n\n\n\n\n---\n\nIf we want to model `Sepal.Length` by species, we could create a variable $X$ that represents \"setosa\" as $X=1$, \"virginica\" as $X=2$, and \"versicolor\" as $X=3$.\n\n\n\n\n\n\n\n\n::: {#tbl-numeric-coding .cell tbl-cap='`iris` data with numeric coding of species'}\n\n```{.r .cell-code}\ndata(iris) # this step is not always necessary, but ensures you're starting \n# from the original version of a dataset stored in a loaded package\n\niris = \n iris |> \n tibble() |>\n mutate(\n X = case_when(\n Species == \"setosa\" ~ 1,\n Species == \"virginica\" ~ 2,\n Species == \"versicolor\" ~ 3\n )\n )\n\niris |> \n distinct(Species, X)\n```\n\n::: {.cell-output-display}\n\n\n|Species | X|\n|:----------|--:|\n|setosa | 1|\n|versicolor | 3|\n|virginica | 2|\n:::\n:::\n\n\n\n\n\n\n\n\nThen we could fit a model like:\n\n\n\n\n\n\n\n\n::: {#tbl-iris-numeric-species .cell tbl-cap='Model of `iris` data with numeric coding of `Species`'}\n\n```{.r .cell-code}\niris_lm1 = lm(Sepal.Length ~ X, data = iris)\niris_lm1 |> parameters() |> print_md()\n```\n\n::: {.cell-output-display}\n\n\n|Parameter | Coefficient | SE | 95% CI | t(148) | p |\n|:-----------|:-----------:|:----:|:------------:|:------:|:------:|\n|(Intercept) | 4.91 | 0.16 | (4.60, 5.23) | 30.83 | < .001 |\n|X | 0.47 | 0.07 | (0.32, 0.61) | 6.30 | < .001 |\n\n\n:::\n:::\n\n\n\n\n\n\n\n\n### Let's see how that model looks:\n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\niris_plot1 = iris |> \n ggplot(\n aes(\n x = X, \n y = Sepal.Length)\n ) +\n geom_point(alpha = .1) +\n geom_abline(\n intercept = coef(iris_lm1)[1], \n slope = coef(iris_lm1)[2]) +\n theme_bw(base_size = 18)\nprint(iris_plot1)\n\n```\n\n::: {.cell-output-display}\n![Model of `iris` data with numeric coding of `Species`](Linear-models-overview_files/figure-pdf/fig-iris-numeric-species-model-1.pdf){#fig-iris-numeric-species-model}\n:::\n:::\n\n\n\n\n\n\n\n\nWe have forced the model to use a straight line for the three estimated means. Maybe not a good idea?\n\n### Let's see what R does with categorical variables by default:\n\n\n\n\n\n\n\n\n::: {#tbl-iris-model-factor1 .cell tbl-cap='Model of `iris` data with `Species` as a categorical variable'}\n\n```{.r .cell-code}\niris_lm2 = lm(Sepal.Length ~ Species, data = iris)\niris_lm2 |> parameters() |> print_md()\n```\n\n::: {.cell-output-display}\n\n\n|Parameter | Coefficient | SE | 95% CI | t(147) | p |\n|:--------------------|:-----------:|:----:|:------------:|:------:|:------:|\n|(Intercept) | 5.01 | 0.07 | (4.86, 5.15) | 68.76 | < .001 |\n|Species (versicolor) | 0.93 | 0.10 | (0.73, 1.13) | 9.03 | < .001 |\n|Species (virginica) | 1.58 | 0.10 | (1.38, 1.79) | 15.37 | < .001 |\n\n\n:::\n:::\n\n\n\n\n\n\n\n\n### Re-parametrize with no intercept\n\nIf you don't want the default and offset option, you can use \"-1\" like we've seen previously:\n\n\n\n\n\n\n\n\n::: {#tbl-iris-no-intcpt .cell}\n\n```{.r .cell-code}\niris.lm2b = lm(Sepal.Length ~ Species - 1, data = iris)\niris.lm2b |> parameters() |> print_md()\n```\n\n::: {.cell-output-display}\n\n\n|Parameter | Coefficient | SE | 95% CI | t(147) | p |\n|:--------------------|:-----------:|:----:|:------------:|:------:|:------:|\n|Species (setosa) | 5.01 | 0.07 | (4.86, 5.15) | 68.76 | < .001 |\n|Species (versicolor) | 5.94 | 0.07 | (5.79, 6.08) | 81.54 | < .001 |\n|Species (virginica) | 6.59 | 0.07 | (6.44, 6.73) | 90.49 | < .001 |\n\n\n:::\n:::\n\n\n\n\n\n\n\n\n### Let's see what these new models look like:\n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\niris_plot2 = \n iris |> \n mutate(\n predlm2 = predict(iris_lm2)) |> \n arrange(X) |> \n ggplot(aes(x = X, y = Sepal.Length)) +\n geom_point(alpha = .1) +\n geom_line(aes(y = predlm2), col = \"red\") +\n geom_abline(\n intercept = coef(iris_lm1)[1], \n slope = coef(iris_lm1)[2]) + \n theme_bw(base_size = 18)\n\nprint(iris_plot2)\n\n```\n\n::: {.cell-output-display}\n![](Linear-models-overview_files/figure-pdf/fig-iris-no-intcpt-1.pdf){#fig-iris-no-intcpt}\n:::\n:::\n\n\n\n\n\n\n\n\n### Let's see how R did that:\n\n\n\n\n\n\n\n\n::: {#tbl-iris-model-matrix-factor .cell}\n\n```{.r .cell-code}\nformula(iris_lm2)\n#> Sepal.Length ~ Species\nmodel.matrix(iris_lm2) |> as_tibble() |> unique()\n```\n\n::: {.cell-output-display}\n\n\n| (Intercept)| Speciesversicolor| Speciesvirginica|\n|-----------:|-----------------:|----------------:|\n| 1| 0| 0|\n| 1| 1| 0|\n| 1| 0| 1|\n:::\n:::\n\n\n\n\n\n\n\n\nThis is called a \"corner point parametrization\".\n\n\n\n\n\n\n\n\n::: {#tbl-iris-group-point-parameterization .cell}\n\n```{.r .cell-code}\nformula(iris.lm2b)\n#> Sepal.Length ~ Species - 1\nmodel.matrix(iris.lm2b) |> as_tibble() |> unique()\n```\n\n::: {.cell-output-display}\n\n\n| Speciessetosa| Speciesversicolor| Speciesvirginica|\n|-------------:|-----------------:|----------------:|\n| 1| 0| 0|\n| 0| 1| 0|\n| 0| 0| 1|\n:::\n:::\n\n\n\n\n\n\n\n\nThis can be called a \"group point parametrization\".\n\nThere are more options; see @dobson4e §6.4.1 and the \n[`codingMatrices` package](https://CRAN.R-project.org/package=codingMatrices) \n[vignette](https://cran.r-project.org/web/packages/codingMatrices/vignettes/codingMatrices.pdf) \n(@venablescodingMatrices).\n\n## Ordinal covariates\n\n(c.f. @dobson4e §2.4.4)\n\n---\n\n::: notes\nWe can create ordinal variables in R using the `ordered()` function^[or equivalently, `factor(ordered = TRUE)`].\n:::\n\n:::{#exm-ordinal-variable}\n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nurl = paste0(\n \"https://regression.ucsf.edu/sites/g/files/tkssra6706/\",\n \"f/wysiwyg/home/data/hersdata.dta\")\nlibrary(haven)\nhers = read_dta(url)\n```\n:::\n\n\n\n::: {#tbl-HERS .cell tbl-cap='HERS dataset'}\n\n```{.r .cell-code}\nhers |> head()\n```\n\n::: {.cell-output-display}\n\n\n| HT| age| raceth| nonwhite| smoking| drinkany| exercise| physact| globrat| poorfair| medcond| htnmeds| statins| diabetes| dmpills| insulin| weight| BMI| waist| WHR| glucose| weight1| BMI1| waist1| WHR1| glucose1| tchol| LDL| HDL| TG| tchol1| LDL1| HDL1| TG1| SBP| DBP| age10|\n|--:|---:|------:|--------:|-------:|--------:|--------:|-------:|-------:|--------:|-------:|-------:|-------:|--------:|-------:|-------:|------:|-----:|-----:|-----:|-------:|-------:|-----:|------:|-----:|--------:|-----:|-----:|---:|---:|------:|-----:|----:|---:|---:|---:|-----:|\n| 0| 70| 2| 1| 0| 0| 0| 5| 3| 0| 0| 1| 1| 0| 0| 0| 73.8| 23.69| 96.0| 0.932| 84| 73.6| 23.63| 93.0| 0.912| 94| 189| 122.4| 52| 73| 201| 137.6| 48| 77| 138| 78| 7.0|\n| 0| 62| 2| 1| 0| 0| 0| 1| 3| 0| 1| 1| 0| 0| 0| 0| 70.9| 28.62| 93.0| 0.964| 111| 73.4| 28.89| 95.0| 0.964| 78| 307| 241.6| 44| 107| 216| 150.6| 48| 87| 118| 70| 6.2|\n| 1| 69| 1| 0| 0| 0| 0| 3| 3| 0| 0| 1| 0| 1| 0| 0| 102.0| 42.51| 110.2| 0.782| 114| 96.1| 40.73| 103.0| 0.774| 98| 254| 166.2| 57| 154| 254| 156.0| 66| 160| 134| 78| 6.9|\n| 0| 64| 1| 0| 1| 1| 0| 1| 3| 0| 1| 1| 0| 0| 0| 0| 64.4| 24.39| 87.0| 0.877| 94| 58.6| 22.52| 77.0| 0.802| 93| 204| 116.2| 56| 159| 207| 122.6| 57| 137| 152| 72| 6.4|\n| 0| 65| 1| 0| 0| 0| 0| 2| 3| 0| 0| 0| 0| 0| 0| 0| 57.9| 21.90| 77.0| 0.794| 101| 58.9| 22.28| 76.5| 0.757| 92| 214| 150.6| 42| 107| 235| 172.2| 35| 139| 175| 95| 6.5|\n| 1| 68| 2| 1| 0| 1| 0| 3| 3| 0| 0| 0| 0| 0| 0| 0| 60.9| 29.05| 96.0| 1.000| 116| 57.7| 27.52| 86.0| 0.910| 115| 212| 137.8| 52| 111| 202| 126.6| 53| 112| 174| 98| 6.8|\n:::\n:::\n\n\n\n\n\n\n\n\n\n:::\n\n---\n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n\n# C(contr = codingMatrices::contr.diff)\n\n```\n:::\n", + "markdown": "---\ndf-print: paged\n---\n\n\n\n\n\n\n\n\n# Linear (Gaussian) Models\n\n---\n\n\n\n\n---\n\n### Configuring R {.unnumbered}\n\nFunctions from these packages will be used throughout this document:\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(conflicted) # check for conflicting function definitions\n# library(printr) # inserts help-file output into markdown output\nlibrary(rmarkdown) # Convert R Markdown documents into a variety of formats.\nlibrary(pander) # format tables for markdown\nlibrary(ggplot2) # graphics\nlibrary(ggeasy) # help with graphics\nlibrary(ggfortify) # help with graphics\nlibrary(dplyr) # manipulate data\nlibrary(tibble) # `tibble`s extend `data.frame`s\nlibrary(magrittr) # `%>%` and other additional piping tools\nlibrary(haven) # import Stata files\nlibrary(knitr) # format R output for markdown\nlibrary(tidyr) # Tools to help to create tidy data\nlibrary(plotly) # interactive graphics\nlibrary(dobson) # datasets from Dobson and Barnett 2018\nlibrary(parameters) # format model output tables for markdown\nlibrary(haven) # import Stata files\nlibrary(latex2exp) # use LaTeX in R code (for figures and tables)\nlibrary(fs) # filesystem path manipulations\nlibrary(survival) # survival analysis\nlibrary(survminer) # survival analysis graphics\nlibrary(KMsurv) # datasets from Klein and Moeschberger\nlibrary(parameters) # format model output tables for\nlibrary(webshot2) # convert interactive content to static for pdf\nlibrary(forcats) # functions for categorical variables (\"factors\")\nlibrary(stringr) # functions for dealing with strings\nlibrary(lubridate) # functions for dealing with dates and times\n```\n:::\n\n\n\n\n\n\n\nHere are some R settings I use in this document:\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nrm(list = ls()) # delete any data that's already loaded into R\n\nconflicts_prefer(dplyr::filter)\nggplot2::theme_set(\n ggplot2::theme_bw() + \n # ggplot2::labs(col = \"\") +\n ggplot2::theme(\n legend.position = \"bottom\",\n text = ggplot2::element_text(size = 12, family = \"serif\")))\n\nknitr::opts_chunk$set(message = FALSE)\noptions('digits' = 4)\n\npanderOptions(\"big.mark\", \",\")\npander::panderOptions(\"table.emphasize.rownames\", FALSE)\npander::panderOptions(\"table.split.table\", Inf)\nconflicts_prefer(dplyr::filter) # use the `filter()` function from dplyr() by default\nlegend_text_size = 9\n```\n:::\n\n\n\n\n\n\n\n\n\n\n\n\\providecommand{\\cbl}[1]{\\left\\{#1\\right.}\n\\providecommand{\\cb}[1]{\\left\\{#1\\right\\}}\n\\providecommand{\\paren}[1]{\\left(#1\\right)}\n\\providecommand{\\sb}[1]{\\left[#1\\right]}\n\\def\\pr{\\text{p}}\n\\def\\am{\\arg \\max}\n\\def\\argmax{\\arg \\max}\n\\def\\p{\\text{p}}\n\\def\\P{\\text{P}}\n\\def\\ph{\\hat{\\text{p}}}\n\\def\\hp{\\hat{\\text{p}}}\n\\def\\ga{\\alpha}\n\\def\\b{\\beta}\n\\providecommand{\\floor}[1]{\\left \\lfloor{#1}\\right \\rfloor}\n\\providecommand{\\ceiling}[1]{\\left \\lceil{#1}\\right \\rceil}\n\\providecommand{\\ceil}[1]{\\left \\lceil{#1}\\right \\rceil}\n\\def\\Ber{\\text{Ber}}\n\\def\\Bernoulli{\\text{Bernoulli}}\n\\def\\Pois{\\text{Pois}}\n\\def\\Poisson{\\text{Poisson}}\n\\def\\Gaus{\\text{Gaussian}}\n\\def\\Normal{\\text{N}}\n\\def\\NB{\\text{NegBin}}\n\\def\\NegBin{\\text{NegBin}}\n\\def\\vbeta{\\vec \\beta}\n\\def\\vb{\\vec \\b}\n\\def\\v0{\\vec{0}}\n\\def\\gb{\\beta}\n\\def\\gg{\\gamma}\n\\def\\gd{\\delta}\n\\def\\eps{\\varepsilon}\n\\def\\om{\\omega}\n\\def\\m{\\mu}\n\\def\\s{\\sigma}\n\\def\\l{\\lambda}\n\\def\\gs{\\sigma}\n\\def\\gm{\\mu}\n\\def\\M{\\text{M}}\n\\def\\gM{\\text{M}}\n\\def\\Mu{\\text{M}}\n\\def\\cd{\\cdot}\n\\def\\cds{\\cdots}\n\\def\\lds{\\ldots}\n\\def\\eqdef{\\stackrel{\\text{def}}{=}}\n\\def\\defeq{\\stackrel{\\text{def}}{=}}\n\\def\\hb{\\hat \\beta}\n\\def\\hl{\\hat \\lambda}\n\\def\\hy{\\hat y}\n\\def\\yh{\\hat y}\n\\def\\V{{\\text{Var}}}\n\\def\\hs{\\hat \\sigma}\n\\def\\hsig{\\hat \\sigma}\n\\def\\hS{\\hat \\Sigma}\n\\def\\hSig{\\hat \\Sigma}\n\\def\\hSigma{\\hat \\Sigma}\n\\def\\hSurv{\\hat{S}}\n\\providecommand{\\hSurvf}[1]{\\hat{S}\\paren{#1}}\n\\def\\dist{\\ \\sim \\ }\n\\def\\ddist{\\ \\dot{\\sim} \\ }\n\\def\\dsim{\\ \\dot{\\sim} \\ }\n\\def\\za{z_{1 - \\frac{\\alpha}{2}}}\n\\def\\cirad{\\za \\cdot \\hse{\\hb}}\n\\def\\ci{\\hb {\\color{red}\\pm} \\cirad}\n\\def\\th{\\theta}\n\\def\\Th{\\Theta}\n\\def\\xbar{\\bar{x}}\n\\def\\hth{\\hat\\theta}\n\\def\\hthml{\\hth_{\\text{ML}}}\n\\def\\ba{\\begin{aligned}}\n\\def\\ea{\\end{aligned}}\n\\def\\ind{⫫}\n\\def\\indpt{⫫}\n\\def\\all{\\forall}\n\\def\\iid{\\text{iid}}\n\\def\\ciid{\\text{ciid}}\n\\def\\simind{\\ \\sim_{\\ind}\\ }\n\\def\\siid{\\ \\sim_{\\iid}\\ }\n\\def\\simiid{\\siid}\n\\def\\distiid{\\siid}\n\\def\\tf{\\therefore}\n\\def\\Lik{\\mathcal{L}}\n\\def\\llik{\\ell}\n\\providecommand{\\llikf}[1]{\\llik \\paren{#1}}\n\\def\\score{\\ell'}\n\\providecommand{\\scoref}[1]{\\score \\paren{#1}}\n\\def\\hess{\\ell''}\n\\def\\hessian{\\ell''}\n\\providecommand{\\hessf}[1]{\\hess \\paren{#1}}\n\\providecommand{\\hessianf}[1]{\\hess \\paren{#1}}\n\\providecommand{\\starf}[1]{#1^*}\n\\def\\lik{\\ell}\n\\providecommand{\\est}[1]{\\widehat{#1}}\n\\providecommand{\\esttmp}[1]{{\\widehat{#1}}^*}\n\\def\\esttmpl{\\esttmp{\\lambda}}\n\\def\\cR{\\mathcal{R}}\n\\def\\range{\\mathcal{R}}\n\\def\\Range{\\mathcal{R}}\n\\providecommand{\\rangef}[1]{\\cR(#1)}\n\\def\\~{\\approx}\n\\def\\dapp{\\dot\\approx}\n\\providecommand{\\red}[1]{{\\color{red}#1}}\n\\providecommand{\\deriv}[1]{\\frac{\\partial}{\\partial #1}}\n\\providecommand{\\derivf}[2]{\\frac{\\partial #1}{\\partial #2}}\n\\providecommand{\\blue}[1]{{\\color{blue}#1}}\n\\providecommand{\\green}[1]{{\\color{green}#1}}\n\\providecommand{\\hE}[1]{\\hat{\\text{E}}\\sb{#1}}\n\\providecommand{\\hExp}[1]{\\hat{\\text{E}}\\sb{#1}}\n\\providecommand{\\hmu}[1]{\\hat{\\mu}\\sb{#1}}\n\\def\\Expp{\\mathbb{E}}\n\\def\\Ep{\\mathbb{E}}\n\\def\\expit{\\text{expit}}\n\\providecommand{\\expitf}[1]{\\expit\\cb{#1}}\n\\providecommand{\\dexpitf}[1]{\\expit'\\cb{#1}}\n\\def\\logit{\\text{logit}}\n\\providecommand{\\logitf}[1]{\\logit\\cb{#1}}\n\\providecommand{\\E}[1]{\\mathbb{E}\\sb{#1}}\n\\providecommand{\\Ef}[1]{\\mathbb{E}\\sb{#1}}\n\\providecommand{\\Exp}[1]{\\mathbb{E}\\sb{#1}}\n\\providecommand{\\Expf}[1]{\\mathbb{E}\\sb{#1}}\n\\def\\Varr{\\text{Var}}\n\\providecommand{\\var}[1]{\\text{Var}\\paren{#1}}\n\\providecommand{\\varf}[1]{\\text{Var}\\paren{#1}}\n\\providecommand{\\Var}[1]{\\text{Var}\\paren{#1}}\n\\providecommand{\\Varf}[1]{\\text{Var}\\paren{#1}}\n\\def\\Covt{\\text{Cov}}\n\\providecommand{\\covh}[1]{\\widehat{\\text{Cov}}\\paren{#1}}\n\\providecommand{\\Cov}[1]{\\Covt \\paren{#1}}\n\\providecommand{\\Covf}[1]{\\Covt \\paren{#1}}\n\\def\\varht{\\widehat{\\text{Var}}}\n\\providecommand{\\varh}[1]{\\varht\\paren{#1}}\n\\providecommand{\\varhf}[1]{\\varht\\paren{#1}}\n\\providecommand{\\vc}[1]{\\boldsymbol{#1}}\n\\providecommand{\\sd}[1]{\\text{sd}\\paren{#1}}\n\\providecommand{\\SD}[1]{\\text{SD}\\paren{#1}}\n\\providecommand{\\hSD}[1]{\\widehat{\\text{SD}}\\paren{#1}}\n\\providecommand{\\se}[1]{\\text{se}\\paren{#1}}\n\\providecommand{\\hse}[1]{\\hat{\\text{se}}\\paren{#1}}\n\\providecommand{\\SE}[1]{\\text{SE}\\paren{#1}}\n\\providecommand{\\HSE}[1]{\\widehat{\\text{SE}}\\paren{#1}}\n\\renewcommand{\\log}[1]{\\text{log}\\cb{#1}}\n\\providecommand{\\logf}[1]{\\text{log}\\cb{#1}}\n\\def\\dlog{\\text{log}'}\n\\providecommand{\\dlogf}[1]{\\dlog \\cb{#1}}\n\\renewcommand{\\exp}[1]{\\text{exp}\\cb{#1}}\n\\providecommand{\\expf}[1]{\\exp{#1}}\n\\def\\dexp{\\text{exp}'}\n\\providecommand{\\dexpf}[1]{\\dexp \\cb{#1}}\n\\providecommand{\\e}[1]{\\text{e}^{#1}}\n\\providecommand{\\ef}[1]{\\text{e}^{#1}}\n\\providecommand{\\inv}[1]{\\paren{#1}^{-1}}\n\\providecommand{\\invf}[1]{\\paren{#1}^{-1}}\n\\def\\oinf{I}\n\\def\\Nat{\\mathbb{N}}\n\\providecommand{\\oinff}[1]{\\oinf\\paren{#1}}\n\\def\\einf{\\mathcal{I}}\n\\providecommand{\\einff}[1]{\\einf\\paren{#1}}\n\\def\\heinf{\\hat{\\einf}}\n\\providecommand{\\heinff}[1]{\\heinf \\paren{#1}}\n\\providecommand{\\1}[1]{\\mathbb{1}_{#1}}\n\\providecommand{\\set}[1]{\\cb{#1}}\n\\providecommand{\\pf}[1]{\\p \\paren{#1}}\n\\providecommand{\\Bias}[1]{\\text{Bias}\\paren{#1}}\n\\providecommand{\\bias}[1]{\\text{Bias}\\paren{#1}}\n\\def\\ss{\\sigma^2}\n\\providecommand{\\ssqf}[1]{\\sigma^2\\paren{#1}}\n\\providecommand{\\mselr}[1]{\\text{MSE}\\paren{#1}}\n\\providecommand{\\maelr}[1]{\\text{MAE}\\paren{#1}}\n\\providecommand{\\abs}[1]{\\left|#1\\right|}\n\\providecommand{\\sqf}[1]{\\paren{#1}^2}\n\\providecommand{\\sq}{^2}\n\\def\\err{\\eps}\n\\providecommand{\\erf}[1]{\\err\\paren{#1}}\n\\renewcommand{\\vec}[1]{\\tilde{#1}}\n\\providecommand{\\v}[1]{\\vec{#1}}\n\\providecommand{\\matr}[1]{\\mathbf{#1}}\n\\def\\mX{\\matr{X}}\n\\def\\mx{\\matr{x}}\n\\def\\vx{\\vec{x}}\n\\def\\vX{\\vec{X}}\n\\def\\vy{\\vec{y}}\n\\def\\vY{\\vec{Y}}\n\\def\\vpi{\\vec{\\pi}}\n\\providecommand{\\mat}[1]{\\mathbf{#1}}\n\\providecommand{\\dsn}[1]{#1_1, \\ldots, #1_n}\n\\def\\X1n{\\dsn{X}}\n\\def\\Xin{\\dsn{X}}\n\\def\\x1n{\\dsn{x}}\n\\def\\'{^{\\top}}\n\\def\\dpr{\\cdot}\n\\def\\Xx1n{X_1=x_1, \\ldots, X_n = x_n}\n\\providecommand{\\dsvn}[2]{#1_1=#2_1, \\ldots, #1_n = #2_n}\n\\providecommand{\\sumn}[1]{\\sum_{#1=1}^n}\n\\def\\sumin{\\sum_{i=1}^n}\n\\def\\sumi1n{\\sum_{i=1}^n}\n\\def\\prodin{\\prod_{i=1}^n}\n\\def\\prodi1n{\\prod_{i=1}^n}\n\\providecommand{\\lp}[2]{#1 \\' \\beta}\n\\def\\odds{\\omega}\n\\def\\OR{\\text{OR}}\n\\def\\logodds{\\eta}\n\\def\\oddst{\\text{odds}}\n\\def\\probst{\\text{probs}}\n\\def\\probt{\\text{probt}}\n\\def\\probit{\\text{probit}}\n\\providecommand{\\oddsf}[1]{\\oddst\\cb{#1}}\n\\providecommand{\\doddsf}[1]{{\\oddst}'\\cb{#1}}\n\\def\\oddsinv{\\text{invodds}}\n\\providecommand{\\oddsinvf}[1]{\\oddsinv\\cb{#1}}\n\\def\\invoddsf{\\oddsinvf}\n\\providecommand{\\doddsinvf}[1]{{\\oddsinv}'\\cb{#1}}\n\\def\\dinvoddsf{\\doddsinvf}\n\\def\\haz{h}\n\\def\\cuhaz{H}\n\\def\\incidence{\\bar{\\haz}}\n\\def\\phaz{\\Expf{\\haz}}\n\n\n\n\n\n\n\n\n\n```{=html}\n\n```\n\n\n\n\n\n\n\n\n\n---\n\n:::{.callout-note}\nThis content is adapted from:\n\n- @dobson4e, Chapters 2-6\n- @dunn2018generalized, Chapters 2-3\n- @vittinghoff2e, Chapter 4\n\n:::\n\nThere are numerous textbooks specifically for linear regression, including:\n\n- @kutner2005applied: used for UCLA Biostatistics MS level linear models class\n- @chatterjee2015regression: used for Stanford MS-level linear models class\n- @seber2012linear: used for UCLA Biostatistics PhD level linear models class and UC Davis STA 108.\n- @kleinbaum2014applied: same first author as @kleinbaum2010logistic and @kleinbaum2012survival\n- @weisberg2005applied\n- *Linear Models with R* [@Faraway2025-io]\n\n\n## Overview\n\n### Why this course includes linear regression {.smaller}\n\n:::{.fragment .fade-in-then-semi-out}\n* This course is about *generalized linear models* (for non-Gaussian outcomes)\n:::\n\n:::{.fragment .fade-in-then-semi-out}\n* UC Davis STA 108 (\"Applied Statistical Methods: Regression Analysis\") is a prerequisite for this course, so everyone here should have some understanding of linear regression already.\n:::\n\n:::{.fragment .fade-in}\n* We will review linear regression to:\n - make sure everyone is caught up\n - to provide an epidemiological perspective on model interpretation.\n:::\n\n### Chapter overview\n\n* @sec-understand-LMs: how to interpret linear regression models\n\n* @sec-est-LMs: how to estimate linear regression models\n\n* @sec-infer-LMs: how to quantify uncertainty about our estimates\n\n* @sec-diagnose-LMs: how to tell if your model is insufficiently complex\n\n\n## Understanding Gaussian Linear Regression Models {#sec-understand-LMs}\n\n### Motivating example: birthweights and gestational age {.smaller}\n\nSuppose we want to learn about the distributions of birthweights (*outcome* $Y$) for (human) babies born at different gestational ages (*covariate* $A$) and with different chromosomal sexes (*covariate* $S$) (@dobson4e Example 2.2.2).\n\n::::: {.panel-tabset}\n\n#### Data as table\n\n\n\n\n\n\n\n\n::: {#tbl-birthweight-data1 .cell tbl-cap='`birthweight` data (@dobson4e Example 2.2.2)'}\n\n```{.r .cell-code}\nlibrary(dobson)\ndata(\"birthweight\", package = \"dobson\")\nbirthweight |> knitr::kable()\n```\n\n::: {.cell-output-display}\n\n\n| boys gestational age| boys weight| girls gestational age| girls weight|\n|--------------------:|-----------:|---------------------:|------------:|\n| 40| 2968| 40| 3317|\n| 38| 2795| 36| 2729|\n| 40| 3163| 40| 2935|\n| 35| 2925| 38| 2754|\n| 36| 2625| 42| 3210|\n| 37| 2847| 39| 2817|\n| 41| 3292| 40| 3126|\n| 40| 3473| 37| 2539|\n| 37| 2628| 36| 2412|\n| 38| 3176| 38| 2991|\n| 40| 3421| 39| 2875|\n| 38| 2975| 40| 3231|\n\n\n:::\n:::\n\n\n\n\n\n\n\n\n#### Reshape data for graphing\n\n\n\n\n\n\n\n\n::: {#tbl-birthweight-data2 .cell tbl-cap='`birthweight` data reshaped'}\n\n```{.r .cell-code}\nbw = \n birthweight |> \n pivot_longer(\n cols = everything(),\n names_to = c(\"sex\", \".value\"),\n names_sep = \"s \"\n ) |> \n rename(age = `gestational age`) |> \n mutate(\n sex = sex |> \n case_match(\n \"boy\" ~ \"male\",\n \"girl\" ~ \"female\") |> \n factor(levels = c(\"female\", \"male\")))\n\nbw\n```\n\n::: {.cell-output-display}\n\n|sex | age| weight|\n|:------|---:|------:|\n|male | 40| 2968|\n|female | 40| 3317|\n|male | 38| 2795|\n|female | 36| 2729|\n|male | 40| 3163|\n|female | 40| 2935|\n|male | 35| 2925|\n|female | 38| 2754|\n|male | 36| 2625|\n|female | 42| 3210|\n|male | 37| 2847|\n|female | 39| 2817|\n|male | 41| 3292|\n|female | 40| 3126|\n|male | 40| 3473|\n|female | 37| 2539|\n|male | 37| 2628|\n|female | 36| 2412|\n|male | 38| 3176|\n|female | 38| 2991|\n|male | 40| 3421|\n|female | 39| 2875|\n|male | 38| 2975|\n|female | 40| 3231|\n\n:::\n:::\n\n\n\n\n\n\n\n\n#### Data as graph\n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nplot1 = bw |> \n ggplot(aes(\n x = age, \n y = weight,\n linetype = sex,\n shape = sex,\n col = sex)) +\n theme_bw() +\n xlab(\"Gestational age (weeks)\") +\n ylab(\"Birthweight (grams)\") +\n theme(legend.position = \"bottom\") +\n # expand_limits(y = 0, x = 0) +\n geom_point(alpha = .7)\nprint(plot1 + facet_wrap(~ sex))\n```\n\n::: {.cell-output-display}\n![`birthweight` data (@dobson4e Example 2.2.2)](Linear-models-overview_files/figure-pdf/fig-plot-birthweight1-1.pdf){#fig-plot-birthweight1}\n:::\n:::\n\n\n\n\n\n\n\n\n:::::\n\n---\n\n#### Data notation\n\nLet's define some notation to represent this data.\n\n- $Y$: birthweight (measured in grams)\n- $S$: chromosomal sex: \"male\" (XY) or \"female\" (XX)\n- $M$: indicator variable for $S$ = \"male\"^[$M$ is implicitly a deterministic function of $S$]\n- $M = 0$ if female (XX)\n- $M = 1$ if male (XY)\n- $F$: indicator variable for $S$ = \"female\"^[$F$ is implicitly a deterministic function of $S$]\n- $F = 1$ if female (XX)\n- $F = 0$ if male (XY)\n\n- $A$: estimated gestational age at birth (measured in weeks).\n\n::: callout-note\nFemale is the **reference level** for the categorical variable $S$ \n(chromosomal sex) and corresponding indicator variable $M$ . \nThe choice of a reference level is arbitrary and does not limit what \nwe can do with the resulting model; \nit only makes it more computationally convenient to make inferences \nabout comparisons involving that reference group.\n:::\n\n### Parallel lines regression\n\nWe don't have enough data to model the distribution of birth weight \nseparately for each combination of gestational age and sex, \nso let's instead consider a (relatively) simple model for how that \ndistribution varies with gestational age and sex:\n\n$$p(Y=y|A=a,S=s) \\siid N(\\mu(a,s), \\sigma^2)$$\n\n$$\n\\ba\n\\mu(a,s)\n&\\eqdef \\Exp{Y|A=a, S=s} \\\\\n&= \\beta_0 + \\beta_A a+ \\beta_M m\n\\ea\n$$ {#eq-lm-parallel}\n\n:::{.notes}\n\n@tbl-lm-parallel shows the parameter estimates from R.\n@fig-parallel-fit1 shows the estimated model, superimposed on the data.\n\n:::\n\n::: {.column width=40%}\n\n\n\n\n\n\n\n\n::: {#tbl-lm-parallel .cell tbl-cap='Estimate of [Model @eq-lm-parallel] for `birthweight` data'}\n\n```{.r .cell-code}\nbw_lm1 = lm(\n formula = weight ~ sex + age, \n data = bw)\n\nbw_lm1 |> \n parameters() |>\n print_md(\n include_reference = TRUE,\n # show_sigma = TRUE,\n select = \"{estimate}\")\n```\n\n::: {.cell-output-display}\n\n\n|Parameter | Estimate |\n|:------------|:--------:|\n|(Intercept) | -1773.32 |\n|sex (female) | 0.00 |\n|sex (male) | 163.04 |\n|age | 120.89 |\n\n\n:::\n:::\n\n\n\n\n\n\n\n\n:::\n\n:::{.column width=10%}\n:::\n\n:::{.column width=50%}\n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbw = \n bw |> \n mutate(`E[Y|X=x]` = fitted(bw_lm1)) |> \n arrange(sex, age)\n\nplot2 = \n plot1 %+% bw +\n geom_line(aes(y = `E[Y|X=x]`))\n\nprint(plot2)\n\n```\n\n::: {.cell-output-display}\n![Parallel-slopes model of birthweight](Linear-models-overview_files/figure-pdf/fig-parallel-fit1-1.pdf){#fig-parallel-fit1}\n:::\n:::\n\n\n\n\n\n\n\n\n:::\n\n---\n\n#### Model assumptions and predictions\n\n::: notes\nTo learn what this model is assuming, let's plug in a few values.\n:::\n\n::: {#exr-pred-fem-parallel}\n\nAccording to this model, what's the mean birthweight for a female born at 36 weeks?\n\n\n\n\n\n\n\n\n::: {#tbl-coef-model1 .cell tbl-cap='Estimated coefficients for [model @eq-lm-parallel]'}\n\n```{.r .cell-code}\ncoef(bw_lm1)\n#> (Intercept) sexmale age \n#> -1773.3 163.0 120.9\n```\n:::\n\n\n\n\n\n\n\n\n:::\n\n---\n\n:::{.solution}\n\\ \n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\npred_female = coef(bw_lm1)[\"(Intercept)\"] + coef(bw_lm1)[\"age\"]*36\ncoef(bw_lm1)\n#> (Intercept) sexmale age \n#> -1773.3 163.0 120.9\n# print(pred_female)\n### built-in prediction: \n# predict(bw_lm1, newdata = tibble(sex = \"female\", age = 36))\n```\n:::\n\n\n\n\n\n\n\n\n$$\n\\ba\nE[Y|A = 0, A = 36] \n&= \\beta_0 + \\beta_M \\cdot 0+ \\beta_A \\cdot 36 \\\\\n&= 2578.8739\n\\ea\n$$\n:::\n\n---\n\n:::{#exr-pred-male-parallel}\n\nWhat's the mean birthweight for a male born at 36 weeks?\n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncoef(bw_lm1)\n#> (Intercept) sexmale age \n#> -1773.3 163.0 120.9\n```\n:::\n\n\n\n\n\n\n\n\n:::\n\n---\n\n:::{.solution}\n\\ \n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\npred_male = \n coef(bw_lm1)[\"(Intercept)\"] + \n coef(bw_lm1)[\"sexmale\"] + \n coef(bw_lm1)[\"age\"]*36\ncoef(bw_lm1)\n#> (Intercept) sexmale age \n#> -1773.3 163.0 120.9\n```\n:::\n\n\n\n\n\n\n\n\n$$\n\\ba\nE[Y|M = 1, A = 36] \n&= \\beta_0 + \\beta_M \\cdot 1+ \\beta_A \\cdot 36 \\\\\n&= 2741.9132\n\\ea\n$$\n\n:::\n\n---\n\n:::{#exr-diff-sex-parallel-1}\nWhat's the difference in mean birthweights between males born at 36 weeks and females born at 36 weeks?\n:::\n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncoef(bw_lm1)\n#> (Intercept) sexmale age \n#> -1773.3 163.0 120.9\n```\n:::\n\n\n\n\n\n\n\n\n---\n\n:::{.solution}\n\n$$\n\\begin{aligned}\n& E[Y|M = 1, A = 36] - E[Y|M = 0, A = 36]\\\\\n&= \n2741.9132 - 2578.8739\\\\\n&=\n163.0393\n\\end{aligned}\n$$\n\nShortcut:\n\n$$\n\\begin{aligned}\n& E[Y|M = 1, A = 36] - E[Y|M = 0, A = 36]\\\\\n&= (\\beta_0 + \\beta_M \\cdot 1+ \\beta_A \\cdot 36) - \n(\\beta_0 + \\beta_M \\cdot 0+ \\beta_A \\cdot 36) \\\\\n&= \\beta_M \\\\ \n&= 163.0393\n\\end{aligned}\n$$\n\n:::\n\n:::{.notes}\n\nNote that age doesn't show up in this difference: in other words, according to this model, the difference between females and males with the same gestational age is the same for every age.\n\nThat's an assumption of the model; it's built-in to the parametric structure, even before we plug in the estimated values of those parameters.\n\nThat's why the lines are parallel.\n\n:::\n\n### Interactions {.smaller}\n\n:::{.notes}\nWhat if we don't like that parallel lines assumption?\n\nThen we need to allow an \"interaction\" between age $A$ and sex $S$:\n:::\n\n$$\nE[Y|A=a, S=s] = \\beta_0 + \\beta_A a+ \\beta_M m + \\beta_{AM} (a \\cdot m)\n$$ {#eq-BW-lm-interact}\n\n::: notes\nNow, the slope of mean birthweight $E[Y|A,S]$ with respect to gestational age $A$ depends on the value of sex $S$.\n:::\n\n::: {.column width=40% .smaller}\n\n\n\n\n\n\n\n\n::: {#tbl-bw-model-coefs-interact .cell tbl-cap='Birthweight model with interaction term'}\n\n```{.r .cell-code}\nbw_lm2 = lm(weight ~ sex + age + sex:age, data = bw)\nbw_lm2 |> \n parameters() |>\n print_md(\n include_reference = TRUE,\n # show_sigma = TRUE,\n select = \"{estimate}\")\n```\n\n::: {.cell-output-display}\n\n\n|Parameter | Estimate |\n|:----------------|:--------:|\n|(Intercept) | -2141.67 |\n|sex (female) | 0.00 |\n|sex (male) | 872.99 |\n|age | 130.40 |\n|sex (male) × age | -18.42 |\n\n\n:::\n:::\n\n\n\n\n\n\n\n\n:::\n\n:::{.column width=5%}\n:::\n\n:::{.column width=55%}\n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbw = \n bw |> \n mutate(\n predlm2 = predict(bw_lm2)\n ) |> \n arrange(sex, age)\n\nplot1_interact = \n plot1 %+% bw +\n geom_line(aes(y = predlm2))\n\nprint(plot1_interact)\n```\n\n::: {.cell-output-display}\n![Birthweight model with interaction term](Linear-models-overview_files/figure-pdf/fig-bw-interaction-1.pdf){#fig-bw-interaction}\n:::\n:::\n\n\n\n\n\n\n\n\n:::\n\n::: {.notes}\nNow we can see that the lines aren't parallel.\n:::\n\n---\n\nHere's another way we could rewrite this model (by collecting terms involving $S$):\n\n$$\nE[Y|A, M] = \\beta_0 + \\beta_M M+ (\\beta_A + \\beta_{AM} M) A\n$$\n\n::: callout-note\nIf you want to understand a coefficient in a model with interactions, collect terms for the corresponding variable, and you will see what other variables are interacting with the variable you are interested in.\n:::\n\n:::{.notes}\nIn this case, the coefficient $S$ is interacting with $A$. So the slope of $Y$ with respect to $A$ depends on the value of $M$.\n\nAccording to this model, there is no such thing as \"*the* slope of birthweight with respect to age\". There are two slopes, one for each sex.^[using the definite article \"the\" would mean there is only one slope.] We can only talk about \"the slope of birthweight with respect to age among males\" and \"the slope of birthweight with respect to age among females\".\n\nThen: that coefficient is the difference in means per unit change in its corresponding coefficient, when the other collected variables are set to 0.\n:::\n\n---\n\n::: notes\nTo learn what this model is assuming, let's plug in a few values.\n:::\n\n:::{#exr-pred-fem-interact}\nAccording to this model, what's the mean birthweight for a female born at 36 weeks?\n:::\n\n---\n\n::: {.solution}\n\\ \n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\npred_female = coef(bw_lm2)[\"(Intercept)\"] + coef(bw_lm2)[\"age\"]*36\n```\n:::\n\n\n\n\n\n\n\n\n$$\nE[Y|A = 0, X_2 = 36] = \n\\beta_0 + \\beta_M \\cdot 0+ \\beta_A \\cdot 36 + \\beta_{AM} \\cdot (0 * 36) \n= 2552.7333\n$$ \n\n:::\n\n---\n\n:::{#exr-pred-interact-male_36}\nWhat's the mean birthweight for a male born at 36 weeks?\n\n:::\n\n---\n\n::: solution\n\\ \n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\npred_male = \n coef(bw_lm2)[\"(Intercept)\"] + \n coef(bw_lm2)[\"sexmale\"] + \n coef(bw_lm2)[\"age\"]*36 + \n coef(bw_lm2)[\"sexmale:age\"] * 36\n```\n:::\n\n\n\n\n\n\n\n\n$$\n\\ba\nE[Y|A = 0, X_2 = 36]\n&= \\beta_0 + \\beta_M \\cdot 1+ \\beta_A \\cdot 36 + \\beta_{AM} \\cdot 1 \\cdot 36\\\\\n&= 2762.7069\n\\ea\n$$\n\n:::\n\n---\n\n:::{#exr-diff-gender-interact}\nWhat's the difference in mean birthweights between males born at 36 weeks and females born at 36 weeks?\n:::\n\n---\n\n:::{.solution}\n\n$$\n\\begin{aligned}\n& E[Y|M = 1, A = 36] - E[Y|M = 0, A = 36]\\\\ \n&= (\\beta_0 + \\beta_M \\cdot 1+ \\beta_A \\cdot 36 + \\beta_{AM} \\cdot 1 \\cdot 36)\\\\ \n&\\ \\ \\ \\ \\ -(\\beta_0 + \\beta_M \\cdot 0+ \\beta_A \\cdot 36 + \\beta_{AM} \\cdot 0 \\cdot 36) \\\\\n&= \\beta_{S} + \\beta_{AM}\\cdot 36\\\\\n&= 209.9736\n\\end{aligned}\n$$\n:::\n\n:::{.notes}\nNote that age now does show up in the difference: in other words, according to this model, the difference in mean birthweights between females and males with the same gestational age can vary by gestational age.\n\nThat's how the lines in the graph ended up non-parallel.\n\n:::\n\n### Stratified regression {.smaller}\n\n:::{.notes}\nWe could re-write the interaction model as a stratified model, with a slope and intercept for each sex:\n:::\n\n$$\n\\E{Y|A=a, S=s} = \n\\beta_M m + \\beta_{AM} (a \\cdot m) + \n\\beta_F f + \\beta_{AF} (a \\cdot f)\n$$ {#eq-model-strat}\n\nCompare this stratified model with our interaction model, @eq-BW-lm-interact:\n\n$$\n\\E{Y|A=a, S=s} = \n\\beta_0 + \\beta_A a + \\beta_M m + \\beta_{AM} (a \\cdot m)\n$$\n\n::: notes\n\nIn the stratified model, the intercept term $\\beta_0$ has been relabeled as $\\beta_F$.\n\n:::\n\n::: {.column width=45%}\n\n\n\n\n\n\n\n::: {#tbl-bw-model-coefs-interact2 .cell tbl-cap='Birthweight model with interaction term'}\n\n```{.r .cell-code}\nbw_lm2 = lm(weight ~ sex + age + sex:age, data = bw)\nbw_lm2 |> \n parameters() |>\n print_md(\n include_reference = TRUE,\n # show_sigma = TRUE,\n select = \"{estimate}\")\n```\n\n::: {.cell-output-display}\n\n\n|Parameter | Estimate |\n|:----------------|:--------:|\n|(Intercept) | -2141.67 |\n|sex (female) | 0.00 |\n|sex (male) | 872.99 |\n|age | 130.40 |\n|sex (male) × age | -18.42 |\n\n\n:::\n:::\n\n\n\n\n\n\n\n\n:::\n\n:::{.column width=10%}\n:::\n\n:::{.column width=45%}\n\n\n\n\n\n\n\n\n::: {#tbl-bw-model-coefs-strat .cell tbl-cap='Birthweight model - stratified betas'}\n\n```{.r .cell-code}\nbw_lm_strat = \n bw |> \n lm(\n formula = weight ~ sex + sex:age - 1, \n data = _)\n\nbw_lm_strat |> \n parameters() |>\n print_md(\n # show_sigma = TRUE,\n select = \"{estimate}\")\n```\n\n::: {.cell-output-display}\n\n\n|Parameter | Estimate |\n|:------------------|:--------:|\n|sex (female) | -2141.67 |\n|sex (male) | -1268.67 |\n|sex (female) × age | 130.40 |\n|sex (male) × age | 111.98 |\n\n\n:::\n:::\n\n\n\n\n\n\n\n\n:::\n\n### Curved-line regression\n\n::: notes\nIf we transform some of our covariates ($X$s) and plot the resulting model on the original covariate scale, we end up with curved regression lines:\n:::\n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbw_lm3 = lm(weight ~ sex:log(age) - 1, data = bw)\nlibrary(palmerpenguins)\n\nggpenguins <- \n palmerpenguins::penguins |> \n dplyr::filter(species == \"Adelie\") |> \n ggplot(\n aes(x = bill_length_mm , y = body_mass_g)) +\n geom_point() + \n xlab(\"Bill length (mm)\") + \n ylab(\"Body mass (g)\")\n\nggpenguins2 = ggpenguins +\n stat_smooth(\n method = \"lm\",\n formula = y ~ log(x),\n geom = \"smooth\") +\n xlab(\"Bill length (mm)\") + \n ylab(\"Body mass (g)\")\n\n\nggpenguins2 |> print()\n```\n\n::: {.cell-output-display}\n![`palmerpenguins` model with `bill_length` entering on log scale](Linear-models-overview_files/figure-pdf/fig-penguins-log-x-1.pdf){#fig-penguins-log-x}\n:::\n:::\n\n\n\n\n\n\n\n\n## Estimating Linear Models via Maximum Likelihood {#sec-est-LMs}\n\n### Likelihood, log-likelihood, and score functions for linear regression {.smaller}\n\n:::{.notes}\n\nIn EPI 203 and @sec-intro-MLEs, we learned how to fit outcome-only models of the form $p(X=x|\\theta)$ to iid data $\\vx = (x_1,…,x_n)$ using maximum likelihood estimation.\n\nNow, we apply the same procedure to linear regression models:\n\n:::\n\n$$\n\\mathcal L(\\vec y|\\mat x,\\beta, \\sigma^2) = \n\\prod_{i=1}^n (2\\pi\\sigma^2)^{-1/2} \n\\exp{-\\frac{1}{2\\sigma^2}(y_i - \\vec{x_i}'\\beta)^2}\n$$ {#eq-linreg-lik}\n\n$$\n\\ell(\\vec y|\\mat x,\\beta, \\sigma^2) \n= -\\frac{n}{2}\\log{\\sigma^2} - \n\\frac{1}{2\\sigma^2}\\sum_{i=1}^n (y_i - \\vec{x_i}' \\beta)^2\n$$ {#eq-linreg-loglik}\n\n$$\n\\ell'_{\\beta}(\\vec y|\\mat x,\\beta, \\sigma^2) \n= - \n\\frac{1}{2\\sigma^2}\\deriv{\\beta}\n\\paren{\\sum_{i=1}^n (y_i - \\vec{x_i}\\' \\beta)^2}\n$$ {#eq-linreg-score}\n\n---\n\n::: notes\nLet's switch to matrix-vector notation:\n:::\n\n$$\n\\sum_{i=1}^n (y_i - \\vx_i\\' \\vb)^2 \n= (\\vy - \\mX\\vb)'(\\vy - \\mX\\vb)\n$$\n\n---\n\nSo\n\n$$\n\\begin{aligned}\n(\\vy - \\mX\\vb)'(\\vy - \\mX\\vb) \n&= (\\vy' - \\vb'X')(\\vy - \\mX\\vb)\n\\\\ &= y'y - \\vb'X'y - y'\\mX\\vb +\\vb'\\mX'\\mX\\beta\n\\\\ &= y'y - 2y'\\mX\\beta +\\beta'\\mX'\\mX\\beta\n\\end{aligned}\n$$\n\n### Deriving the linear regression score function\n\n::: notes\nWe will use some results from [vector calculus](math-prereqs.qmd#sec-vector-calculus):\n:::\n\n$$\n\\begin{aligned}\n\\deriv{\\beta}\\paren{\\sum_{i=1}^n (y_i - x_i' \\beta)^2} \n &= \\deriv{\\beta}(\\vy - X\\beta)'(\\vy - X\\beta)\n\\\\ &= \\deriv{\\beta} (y'y - 2y'X\\beta +\\beta'X'X\\beta)\n\\\\ &= (- 2X'y +2X'X\\beta)\n\\\\ &= - 2X'(y - X\\beta)\n\\\\ &= - 2X'(y - \\Expp[y])\n\\\\ &= - 2X' \\err(y)\n\\end{aligned}\n$${#eq-scorefun-linreg}\n\n---\n\nSo if $\\ell(\\beta,\\sigma^2) =0$, then\n\n$$\n\\begin{aligned}\n0 &= (- 2X'y +2X'X\\beta)\\\\\n2X'y &= 2X'X\\beta\\\\\nX'y &= X'X\\beta\\\\\n(X'X)^{-1}X'y &= \\beta\n\\end{aligned}\n$$\n\n---\n\nThe second derivative matrix $\\ell_{\\beta, \\beta'} ''(\\beta, \\sigma^2;\\mathbf X,\\vy)$ is negative definite at $\\beta = (X'X)^{-1}X'y$, so $\\hat \\beta_{ML} = (X'X)^{-1}X'y$ is the MLE for $\\beta$.\n\n---\n\nSimilarly (not shown):\n\n$$\n\\hat\\sigma^2_{ML} = \\frac{1}{n} (Y-X\\hat\\beta)'(Y-X\\hat\\beta)\n$$\n\nAnd\n\n$$\n\\begin{aligned}\n\\mathcal I_{\\beta} &= E[-\\ell_{\\beta, \\beta'} ''(Y|X,\\beta, \\sigma^2)]\\\\\n&= \\frac{1}{\\sigma^2}X'X\n\\end{aligned}\n$$\n\n---\n\nSo:\n\n$$\nVar(\\hat \\beta) \\approx (\\mathcal I_{\\beta})^{-1} = \\sigma^2 (X'X)^{-1}\n$$\n\nand\n\n$$\n\\hat\\beta \\dot \\sim N(\\beta, \\mathcal I_{\\beta}^{-1})\n$$ \n\n:::{.notes}\n\nThese are all results you have hopefully seen before.\n\n:::\n\n---\n\nIn the Gaussian linear regression case, we also have exact results:\n\n$$\n\\frac{\\hat\\beta_j}{\\hse{\\hat\\beta_j}} \\dist t_{n-p}\n$$ \n\n---\n\nIn our model 2 above, $\\heinf(\\beta)$ is:\n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbw_lm2 |> vcov()\n#> (Intercept) sexmale age sexmale:age\n#> (Intercept) 1353968 -1353968 -34871.0 34871.0\n#> sexmale -1353968 2596387 34871.0 -67211.0\n#> age -34871 34871 899.9 -899.9\n#> sexmale:age 34871 -67211 -899.9 1743.5\n```\n:::\n\n\n\n\n\n\n\n\nIf we take the square roots of the diagonals, we get the standard errors listed in the model output:\n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n\nbw_lm2 |> vcov() |> diag() |> sqrt()\n#> (Intercept) sexmale age sexmale:age \n#> 1163.60 1611.33 30.00 41.76\n```\n:::\n\n::: {#tbl-mod-intx .cell tbl-cap='Estimated model for `birthweight` data with interaction term'}\n\n```{.r .cell-code}\nbw_lm2 |> parameters() |> print_md()\n```\n\n::: {.cell-output-display}\n\n\n|Parameter | Coefficient | SE | 95% CI | t(20) | p |\n|:----------------|:-----------:|:-------:|:-------------------:|:-----:|:------:|\n|(Intercept) | -2141.67 | 1163.60 | (-4568.90, 285.56) | -1.84 | 0.081 |\n|sex (male) | 872.99 | 1611.33 | (-2488.18, 4234.17) | 0.54 | 0.594 |\n|age | 130.40 | 30.00 | (67.82, 192.98) | 4.35 | < .001 |\n|sex (male) × age | -18.42 | 41.76 | (-105.52, 68.68) | -0.44 | 0.664 |\n\n\n:::\n:::\n\n\n\n\n\n\n\n\nSo we can do confidence intervals, hypothesis tests, and p-values exactly as in the one-variable case we looked at previously.\n\n### Residual Standard Deviation\n\n::: notes\n$\\hs$ represents an *estimate* of the *Residual Standard Deviation* parameter, $\\s$. \nWe can extract $\\hs$ from the fitted model, using the `sigma()` function:\n:::\n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"show\"}\nsigma(bw_lm2)\n#> [1] 180.6\n```\n:::\n\n\n\n\n\n\n\n\n---\n\n#### $\\s$ is NOT \"Residual standard error\"\n\n::: notes\nIn the `summary.lm()` output, this estimate is labeled as `\"Residual standard error\"`:\n:::\n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"show\"}\nsummary(bw_lm2)\n#> \n#> Call:\n#> lm(formula = weight ~ sex + age + sex:age, data = bw)\n#> \n#> Residuals:\n#> Min 1Q Median 3Q Max \n#> -246.7 -138.1 -39.1 176.6 274.3 \n#> \n#> Coefficients:\n#> Estimate Std. Error t value Pr(>|t|) \n#> (Intercept) -2141.7 1163.6 -1.84 0.08057 . \n#> sexmale 873.0 1611.3 0.54 0.59395 \n#> age 130.4 30.0 4.35 0.00031 ***\n#> sexmale:age -18.4 41.8 -0.44 0.66389 \n#> ---\n#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n#> \n#> Residual standard error: 181 on 20 degrees of freedom\n#> Multiple R-squared: 0.643,\tAdjusted R-squared: 0.59 \n#> F-statistic: 12 on 3 and 20 DF, p-value: 0.000101\n```\n:::\n\n\n\n\n\n\n\n\n---\n\n::: notes\nHowever, this is a misnomer:\n:::\n\n\n\n\n\n\n\n\n::: {.cell printr.help.sections='[\"description\",\"note\"]'}\n\n```{.r .cell-code code-fold=\"show\"}\nlibrary(printr) # captures ? documentation\n?stats::sigma\n#> Extract Residual Standard Deviation 'Sigma'\n#> \n#> Description:\n#> \n#> Extract the estimated standard deviation of the errors, the\n#> \"residual standard deviation\" (misnamed also \"residual standard\n#> error\", e.g., in 'summary.lm()''s output, from a fitted model).\n#> \n#> Many classical statistical models have a _scale parameter_,\n#> typically the standard deviation of a zero-mean normal (or\n#> Gaussian) random variable which is denoted as sigma. 'sigma(.)'\n#> extracts the _estimated_ parameter from a fitted model, i.e.,\n#> sigma^.\n#> \n#> Note:\n#> \n#> The misnomer \"Residual standard *error*\" has been part of too many\n#> R (and S) outputs to be easily changed there.\n```\n:::\n\n\n\n\n\n\n\n\n## Inference about Gaussian Linear Regression Models {#sec-infer-LMs}\n\n### Motivating example: `birthweight` data\n\nResearch question: is there really an interaction between sex and age?\n\n$H_0: \\beta_{AM} = 0$\n\n$H_A: \\beta_{AM} \\neq 0$\n\n$P(|\\hat\\beta_{AM}| > |-18.4172| \\mid H_0)$ = ?\n\n### Wald tests and CIs {.smaller}\n\nR can give you Wald tests for single coefficients and corresponding CIs:\n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n\nbw_lm2 |> \n parameters() |>\n print_md(\n include_reference = TRUE)\n```\n\n::: {.cell-output-display}\n\n\n|Parameter | Coefficient | SE | 95% CI | t(20) | p |\n|:----------------|:-----------:|:-------:|:-------------------:|:-----:|:------:|\n|(Intercept) | -2141.67 | 1163.60 | (-4568.90, 285.56) | -1.84 | 0.081 |\n|sex (female) | 0.00 | | | | |\n|sex (male) | 872.99 | 1611.33 | (-2488.18, 4234.17) | 0.54 | 0.594 |\n|age | 130.40 | 30.00 | (67.82, 192.98) | 4.35 | < .001 |\n|sex (male) × age | -18.42 | 41.76 | (-105.52, 68.68) | -0.44 | 0.664 |\n\n\n:::\n:::\n\n\n\n\n\n\n\n\nTo understand what's happening, let's replicate these results by hand for the interaction term.\n\n### P-values {.smaller}\n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbw_lm2 |> \n parameters(keep = \"sexmale:age\") |>\n print_md(\n include_reference = TRUE)\n```\n\n::: {.cell-output-display}\n\n\n|Parameter | Coefficient | SE | 95% CI | t(20) | p |\n|:----------------|:-----------:|:-----:|:----------------:|:-----:|:-----:|\n|sex (male) × age | -18.42 | 41.76 | (-105.52, 68.68) | -0.44 | 0.664 |\n\n\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nbeta_hat = coef(summary(bw_lm2))[\"sexmale:age\", \"Estimate\"]\nse_hat = coef(summary(bw_lm2))[\"sexmale:age\", \"Std. Error\"]\ndfresid = bw_lm2$df.residual\nt_stat = abs(beta_hat)/se_hat\npval_t = \n pt(-t_stat, df = dfresid, lower.tail = TRUE) +\n pt(t_stat, df = dfresid, lower.tail = FALSE)\n```\n:::\n\n\n\n\n\n\n\n\n$$\n\\begin{aligned}\n&P\\paren{\n| \\hat \\beta_{AM} | > \n| -18.4172| \\middle| H_0\n} \n\\\\\n&= \\Pr \\paren{\n\\abs{ \\frac{\\hat\\beta_{AM}}{\\hat{SE}(\\hat\\beta_{AM})} } > \n\\abs{ \\frac{-18.4172}{41.7558} } \\middle| H_0\n}\\\\ \n&= \\Pr \\paren{\n\\abs{ T_{20} } > 0.4411 | H_0\n}\\\\\n&= 0.6639\n\\end{aligned}\n$$ \n\n::: notes\nThis matches the result in the table above.\n:::\n\n### Confidence intervals\n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbw_lm2 |> \n parameters(keep = \"sexmale:age\") |>\n print_md(\n include_reference = TRUE)\n```\n\n::: {.cell-output-display}\n\n\n|Parameter | Coefficient | SE | 95% CI | t(20) | p |\n|:----------------|:-----------:|:-----:|:----------------:|:-----:|:-----:|\n|sex (male) × age | -18.42 | 41.76 | (-105.52, 68.68) | -0.44 | 0.664 |\n\n\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nq_t = qt(\n p = 0.975, \n df = dfresid, \n lower.tail = TRUE)\n\nq_t = qt(\n p = 0.025, \n df = dfresid, \n lower.tail = TRUE)\n\n\nconfint_radius_t = \n se_hat * q_t\n\nconfint_t = beta_hat + c(-1,1) * confint_radius_t\n\nprint(confint_t)\n#> [1] 68.68 -105.52\n```\n:::\n\n\n\n\n\n\n\n\n::: notes\nThis also matches.\n:::\n\n### Gaussian approximations\n\nHere are the asymptotic (Gaussian approximation) equivalents:\n\n### P-values {.smaller}\n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbw_lm2 |> \n parameters(keep = \"sexmale:age\") |>\n print_md(\n include_reference = TRUE)\n```\n\n::: {.cell-output-display}\n\n\n|Parameter | Coefficient | SE | 95% CI | t(20) | p |\n|:----------------|:-----------:|:-----:|:----------------:|:-----:|:-----:|\n|sex (male) × age | -18.42 | 41.76 | (-105.52, 68.68) | -0.44 | 0.664 |\n\n\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\npval_z = pnorm(abs(t_stat), lower = FALSE) * 2\n\nprint(pval_z)\n#> [1] 0.6592\n```\n:::\n\n\n\n\n\n\n\n\n### Confidence intervals {.smaller}\n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbw_lm2 |> \n parameters(keep = \"sexmale:age\") |>\n print_md(\n include_reference = TRUE)\n```\n\n::: {.cell-output-display}\n\n\n|Parameter | Coefficient | SE | 95% CI | t(20) | p |\n|:----------------|:-----------:|:-----:|:----------------:|:-----:|:-----:|\n|sex (male) × age | -18.42 | 41.76 | (-105.52, 68.68) | -0.44 | 0.664 |\n\n\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nconfint_radius_z = se_hat * qnorm(0.975, lower = TRUE)\nconfint_z = \n beta_hat + c(-1,1) * confint_radius_z\nprint(confint_z)\n#> [1] -100.26 63.42\n```\n:::\n\n\n\n\n\n\n\n\n### Likelihood ratio statistics\n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n\nlogLik(bw_lm2)\n#> 'log Lik.' -156.6 (df=5)\nlogLik(bw_lm1)\n#> 'log Lik.' -156.7 (df=4)\n\nlLR = (logLik(bw_lm2) - logLik(bw_lm1)) |> as.numeric()\ndelta_df = (bw_lm1$df.residual - df.residual(bw_lm2))\n\n\nx_max = 1\n\n```\n:::\n\n\n\n\n\n\n\n\n---\n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nd_lLR = function(x, df = delta_df) dchisq(x, df = df)\n\nchisq_plot = \n ggplot() + \n geom_function(fun = d_lLR) +\n stat_function( fun = d_lLR, xlim = c(lLR, x_max), geom = \"area\", fill = \"gray\") +\n geom_segment(aes(x = lLR, xend = lLR, y = 0, yend = d_lLR(lLR)), col = \"red\") + \n xlim(0.0001,x_max) + \n ylim(0,4) + \n ylab(\"p(X=x)\") + \n xlab(\"log(likelihood ratio) statistic [x]\") +\n theme_classic()\nchisq_plot |> print()\n```\n\n::: {.cell-output-display}\n![Chi-square distribution](Linear-models-overview_files/figure-pdf/fig-chisq-plot-1.pdf){#fig-chisq-plot}\n:::\n:::\n\n\n\n\n\n\n\n\n---\n\nNow we can get the p-value:\n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\npchisq(\n q = 2*lLR, \n df = delta_df, \n lower = FALSE) |> \n print()\n#> [1] 0.6298\n```\n:::\n\n\n\n\n\n\n\n\n\n---\n\nIn practice you don't have to do this by hand; there are functions to do it for you:\n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n\n# built in\nlibrary(lmtest)\nlrtest(bw_lm2, bw_lm1)\n```\n\n::: {.cell-output-display}\n\n\n| #Df| LogLik| Df| Chisq| Pr(>Chisq)|\n|---:|------:|--:|------:|----------:|\n| 5| -156.6| NA| NA| NA|\n| 4| -156.7| -1| 0.2323| 0.6298|\n:::\n:::\n\n\n\n\n\n\n\n\n## Goodness of fit\n\n### AIC and BIC\n\n::: notes\nWhen we use likelihood ratio tests, we are comparing how well different models fit the data.\n\nLikelihood ratio tests require \"nested\" models: one must be a special case of the other.\n\nIf we have non-nested models, we can instead use the Akaike Information Criterion (AIC) or Bayesian Information Criterion (BIC):\n:::\n\n- AIC = $-2 * \\ell(\\hat\\theta) + 2 * p$\n\n- BIC = $-2 * \\ell(\\hat\\theta) + p * \\text{log}(n)$\n\nwhere $\\ell$ is the log-likelihood of the data evaluated using the parameter estimates $\\hat\\theta$, $p$ is the number of estimated parameters in the model (including $\\hat\\sigma^2$), and $n$ is the number of observations.\n\nYou can calculate these criteria using the `logLik()` function, or use the built-in R functions:\n\n#### AIC in R\n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n\n-2 * logLik(bw_lm2) |> as.numeric() + \n 2*(length(coef(bw_lm2))+1) # sigma counts as a parameter here\n#> [1] 323.2\n\nAIC(bw_lm2)\n#> [1] 323.2\n```\n:::\n\n\n\n\n\n\n\n\n#### BIC in R\n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n\n-2 * logLik(bw_lm2) |> as.numeric() + \n (length(coef(bw_lm2))+1) * log(nobs(bw_lm2))\n#> [1] 329\n\nBIC(bw_lm2)\n#> [1] 329\n```\n:::\n\n\n\n\n\n\n\n\nLarge values of AIC and BIC are worse than small values. There are no hypothesis tests or p-values associated with these criteria.\n\n### (Residual) Deviance\n\nLet $q$ be the number of distinct covariate combinations in a data set.\n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbw.X.unique = \n bw |> \n count(sex, age)\n\nn_unique.bw = nrow(bw.X.unique)\n```\n:::\n\n\n\n\n\n\n\n\nFor example, in the `birthweight` data, there are $q = 12$ unique patterns (@tbl-bw-x-combos).\n\n\n\n\n\n\n\n\n::: {#tbl-bw-x-combos .cell tbl-cap='Unique covariate combinations in the `birthweight` data, with replicate counts'}\n\n```{.r .cell-code}\nbw.X.unique\n```\n\n::: {.cell-output-display}\n\n\n|sex | age| n|\n|:------|---:|--:|\n|female | 36| 2|\n|female | 37| 1|\n|female | 38| 2|\n|female | 39| 2|\n|female | 40| 4|\n|female | 42| 1|\n|male | 35| 1|\n|male | 36| 1|\n|male | 37| 2|\n|male | 38| 3|\n|male | 40| 4|\n|male | 41| 1|\n:::\n:::\n\n\n\n\n\n\n\n\n---\n\n::: {#def-replicates}\n#### Replicates\nIf a given covariate pattern has more than one observation in a dataset, those observations are called **replicates**.\n:::\n\n---\n\n::: {#exm-replicate-bw}\n\n#### Replicates in the `birthweight` data\n\nIn the `birthweight` dataset, there are 2 replicates of the combination \"female, age 36\" (@tbl-bw-x-combos).\n\n:::\n\n---\n\n::: {#exr-replicate-bw}\n\n#### Replicates in the `birthweight` data\n\nWhich covariate pattern(s) in the `birthweight` data has the most replicates?\n\n:::\n\n---\n\n::: {#sol-replicate-bw}\n\n#### Replicates in the `birthweight` data\n\nTwo covariate patterns are tied for most replicates: males at age 40 weeks \nand females at age 40 weeks.\n40 weeks is the usual length for human pregnancy (@polin2011fetal), so this result makes sense.\n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbw.X.unique |> dplyr::filter(n == max(n))\n```\n\n::: {.cell-output-display}\n\n\n|sex | age| n|\n|:------|---:|--:|\n|female | 40| 4|\n|male | 40| 4|\n:::\n:::\n\n\n\n\n\n\n\n\n:::\n\n---\n\n#### Saturated models {.smaller}\n\nThe most complicated model we could fit would have one parameter (a mean) for each covariate pattern, plus a variance parameter:\n\n\n\n\n\n\n\n\n::: {#tbl-bw-model-sat .cell tbl-cap='Saturated model for the `birthweight` data'}\n\n```{.r .cell-code}\nlm_max = \n bw |> \n mutate(age = factor(age)) |> \n lm(\n formula = weight ~ sex:age - 1, \n data = _)\n\nlm_max |> \n parameters() |> \n print_md()\n```\n\n::: {.cell-output-display}\n\n\n|Parameter | Coefficient | SE | 95% CI | t(12) | p |\n|:--------------------|:-----------:|:------:|:------------------:|:-----:|:------:|\n|sex (male) × age35 | 2925.00 | 187.92 | (2515.55, 3334.45) | 15.56 | < .001 |\n|sex (female) × age36 | 2570.50 | 132.88 | (2280.98, 2860.02) | 19.34 | < .001 |\n|sex (male) × age36 | 2625.00 | 187.92 | (2215.55, 3034.45) | 13.97 | < .001 |\n|sex (female) × age37 | 2539.00 | 187.92 | (2129.55, 2948.45) | 13.51 | < .001 |\n|sex (male) × age37 | 2737.50 | 132.88 | (2447.98, 3027.02) | 20.60 | < .001 |\n|sex (female) × age38 | 2872.50 | 132.88 | (2582.98, 3162.02) | 21.62 | < .001 |\n|sex (male) × age38 | 2982.00 | 108.50 | (2745.60, 3218.40) | 27.48 | < .001 |\n|sex (female) × age39 | 2846.00 | 132.88 | (2556.48, 3135.52) | 21.42 | < .001 |\n|sex (female) × age40 | 3152.25 | 93.96 | (2947.52, 3356.98) | 33.55 | < .001 |\n|sex (male) × age40 | 3256.25 | 93.96 | (3051.52, 3460.98) | 34.66 | < .001 |\n|sex (male) × age41 | 3292.00 | 187.92 | (2882.55, 3701.45) | 17.52 | < .001 |\n|sex (female) × age42 | 3210.00 | 187.92 | (2800.55, 3619.45) | 17.08 | < .001 |\n\n\n:::\n:::\n\n\n\n\n\n\n\n\nWe call this model the **full**, **maximal**, or **saturated** model for this dataset.\n\nWe can calculate the log-likelihood of this model as usual:\n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlogLik(lm_max)\n#> 'log Lik.' -151.4 (df=13)\n```\n:::\n\n\n\n\n\n\n\n\nWe can compare this model to our other models using chi-square tests, as usual:\n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlrtest(lm_max, bw_lm2)\n```\n\n::: {.cell-output-display}\n\n\n| #Df| LogLik| Df| Chisq| Pr(>Chisq)|\n|---:|------:|--:|-----:|----------:|\n| 13| -151.4| NA| NA| NA|\n| 5| -156.6| -8| 10.36| 0.241|\n:::\n:::\n\n\n\n\n\n\n\n\nThe likelihood ratio statistic for this test is $$\\lambda = 2 * (\\ell_{\\text{full}} - \\ell) = 10.3554$$ where:\n\n- $\\ell_{\\text{max}}$ is the log-likelihood of the full model: -151.4016\n- $\\ell$ is the log-likelihood of our comparison model (two slopes, two intercepts): -156.5793\n\nThis statistic is called the **deviance** or **residual deviance** for our two-slopes and two-intercepts model; it tells us how much the likelihood of that model deviates from the likelihood of the maximal model.\n\nThe corresponding p-value tells us whether there we have enough evidence to detect that our two-slopes, two-intercepts model is a worse fit for the data than the maximal model; in other words, it tells us if there's evidence that we missed any important patterns. (Remember, a nonsignificant p-value could mean that we didn't miss anything and a more complicated model is unnecessary, or it could mean we just don't have enough data to tell the difference between these models.)\n\n### Null Deviance\n\nSimilarly, the *least* complicated model we could fit would have only one mean parameter, an intercept:\n\n$$\\text E[Y|X=x] = \\beta_0$$ We can fit this model in R like so:\n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlm0 = lm(weight ~ 1, data = bw)\n\nlm0 |> parameters() |> print_md()\n```\n\n::: {.cell-output-display}\n\n\n|Parameter | Coefficient | SE | 95% CI | t(23) | p |\n|:-----------|:-----------:|:-----:|:------------------:|:-----:|:------:|\n|(Intercept) | 2967.67 | 57.58 | (2848.56, 3086.77) | 51.54 | < .001 |\n\n\n:::\n:::\n\n\n\n\n\n\n\n\nThis model also has a likelihood:\n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlogLik(lm0)\n#> 'log Lik.' -169 (df=2)\n```\n:::\n\n\n\n\n\n\n\n\nAnd we can compare it to more complicated models using a likelihood ratio test:\n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n\nlrtest(bw_lm2, lm0)\n```\n\n::: {.cell-output-display}\n\n\n| #Df| LogLik| Df| Chisq| Pr(>Chisq)|\n|---:|------:|--:|-----:|----------:|\n| 5| -156.6| NA| NA| NA|\n| 2| -169.0| -3| 24.75| 0|\n:::\n:::\n\n\n\n\n\n\n\n\nThe likelihood ratio statistic for the test comparing the null model to the maximal model is $$\\lambda = 2 * (\\ell_{\\text{full}} - \\ell_{0}) = 35.1067$$ where:\n\n- $\\ell_{\\text{0}}$ is the log-likelihood of the null model: -168.955\n- $\\ell_{\\text{full}}$ is the log-likelihood of the maximal model: -151.4016\n\nIn R, this test is:\n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlrtest(lm_max, lm0)\n```\n\n::: {.cell-output-display}\n\n\n| #Df| LogLik| Df| Chisq| Pr(>Chisq)|\n|---:|------:|---:|-----:|----------:|\n| 13| -151.4| NA| NA| NA|\n| 2| -169.0| -11| 35.11| 2e-04|\n:::\n:::\n\n\n\n\n\n\n\n\nThis log-likelihood ratio statistic is called the **null deviance**. It tells us whether we have enough data to detect a difference between the null and full models.\n\n## Rescaling\n\n### Rescale age {.smaller}\n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n\nbw = \n bw |>\n mutate(\n `age - mean` = age - mean(age),\n `age - 36wks` = age - 36\n )\n\nlm1c = lm(weight ~ sex + `age - 36wks`, data = bw)\n\nlm2c = lm(weight ~ sex + `age - 36wks` + sex:`age - 36wks`, data = bw)\n\nparameters(lm2c, ci_method = \"wald\") |> print_md()\n```\n\n::: {.cell-output-display}\n\n\n|Parameter | Coefficient | SE | 95% CI | t(20) | p |\n|:------------------------|:-----------:|:------:|:------------------:|:-----:|:------:|\n|(Intercept) | 2552.73 | 97.59 | (2349.16, 2756.30) | 26.16 | < .001 |\n|sex (male) | 209.97 | 129.75 | (-60.68, 480.63) | 1.62 | 0.121 |\n|age - 36wks | 130.40 | 30.00 | (67.82, 192.98) | 4.35 | < .001 |\n|sex (male) × age - 36wks | -18.42 | 41.76 | (-105.52, 68.68) | -0.44 | 0.664 |\n\n\n:::\n:::\n\n\n\n\n\n\n\n\nCompare with what we got without rescaling:\n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nparameters(bw_lm2, ci_method = \"wald\") |> print_md()\n```\n\n::: {.cell-output-display}\n\n\n|Parameter | Coefficient | SE | 95% CI | t(20) | p |\n|:----------------|:-----------:|:-------:|:-------------------:|:-----:|:------:|\n|(Intercept) | -2141.67 | 1163.60 | (-4568.90, 285.56) | -1.84 | 0.081 |\n|sex (male) | 872.99 | 1611.33 | (-2488.18, 4234.17) | 0.54 | 0.594 |\n|age | 130.40 | 30.00 | (67.82, 192.98) | 4.35 | < .001 |\n|sex (male) × age | -18.42 | 41.76 | (-105.52, 68.68) | -0.44 | 0.664 |\n\n\n:::\n:::\n\n\n\n\n\n\n\n\n## Prediction\n\n### Prediction for linear models\n\n:::{#def-predicted-value}\n#### Predicted value\n\nIn a regression model $\\p(y|x)$, the **predicted value** of $y$ given $x$ is the estimated mean of $Y$ given $X$:\n\n$$\\hat y \\eqdef \\hE{Y|X=x}$$\n:::\n\n---\n\nFor linear models, the predicted value can be straightforwardly calculated by multiplying each predictor value $x_j$ by its corresponding coefficient $\\beta_j$ and adding up the results:\n\n$$\n\\begin{aligned}\n\\hat Y &= \\hat E[Y|X=x] \\\\\n&= x'\\hat\\beta \\\\\n&= \\hat\\beta_0\\cdot 1 + \\hat\\beta_1 x_1 + ... + \\hat\\beta_p x_p\n\\end{aligned}\n$$\n\n---\n\n### Example: prediction for the `birthweight` data\n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n\nX = c(1,1,40)\nsum(X * coef(bw_lm1))\n#> [1] 3225\n```\n:::\n\n\n\n\n\n\n\n\nR has built-in functions for prediction:\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx = tibble(age = 40, sex = \"male\")\nbw_lm1 |> predict(newdata = x)\n#> 1 \n#> 3225\n```\n:::\n\n\n\n\n\n\n\n\nIf you don't provide `newdata`, R will use the covariate values from the original dataset:\n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\npredict(bw_lm1)\n#> 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 \n#> 3225 3062 2984 2579 3225 3062 2621 2821 2742 3304 2863 2942 3346 3062 3225 2700 \n#> 17 18 19 20 21 22 23 24 \n#> 2863 2579 2984 2821 3225 2942 2984 3062\n```\n:::\n\n\n\n\n\n\n\n\nThese special predictions are called the *fitted values* of the dataset:\n\n:::{#def-fitted-value}\n\nFor a given dataset $(\\vY, \\mX)$ and corresponding fitted model $\\p_{\\hb}(\\vy|\\mx)$, the **fitted value** of $y_i$ is the predicted value of $y$ when $\\vX=\\vx_i$ using the estimate parameters $\\hb$.\n\n:::\n\nR has an extra function to get these values:\n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfitted(bw_lm1)\n#> 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 \n#> 3225 3062 2984 2579 3225 3062 2621 2821 2742 3304 2863 2942 3346 3062 3225 2700 \n#> 17 18 19 20 21 22 23 24 \n#> 2863 2579 2984 2821 3225 2942 2984 3062\n```\n:::\n\n\n\n\n\n\n\n\n### Quantifying uncertainty in predictions\n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbw_lm1 |> \n predict(\n newdata = x,\n se.fit = TRUE)\n#> $fit\n#> 1 \n#> 3225 \n#> \n#> $se.fit\n#> [1] 61.46\n#> \n#> $df\n#> [1] 21\n#> \n#> $residual.scale\n#> [1] 177.1\n```\n:::\n\n\n\n\n\n\n\n\nThis is a `list()`; you can extract the elements with `$` or `magrittr::use_series()`:\n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbw_lm1 |> \n predict(\n newdata = x,\n se.fit = TRUE) |> \n use_series(se.fit)\n#> [1] 61.46\n```\n:::\n\n\n\n\n\n\n\n\nYou can get **confidence intervals** for $\\E{Y|X=x}$:\n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbw_lm1 |> predict(\n newdata = x,\n interval = \"confidence\")\n```\n\n::: {.cell-output-display}\n\n\n| fit| lwr| upr|\n|----:|----:|----:|\n| 3225| 3098| 3353|\n:::\n:::\n\n\n\n\n\n\n\n\nYou can also get **prediction intervals** for the value of an individual outcome $Y$:\n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbw_lm1 |> \n predict(newdata = x, interval = \"predict\")\n```\n\n::: {.cell-output-display}\n\n\n| fit| lwr| upr|\n|----:|----:|----:|\n| 3225| 2836| 3615|\n:::\n:::\n\n\n\n\n\n\n\n\nThe warning from the last command is: \"predictions on current data refer to *future* responses\" (since you already know what happened to the current data, and thus don't need to predict it).\n\nSee `?predict.lm` for more.\n\n## Diagnostics {#sec-diagnose-LMs}\n\n:::{.callout-tip}\nThis section is adapted from @dobson4e [§6.2-6.3] and \n@dunn2018generalized [Chapter 3](https://link.springer.com/chapter/10.1007/978-1-4419-0118-7_3).\n:::\n### Assumptions in linear regression models {.smaller .scrollable}\n\n$$Y|\\vX \\simind N(\\vX'\\b,\\ss)$$\n\n1. Normality: The distribution conditional on a given $X$ value is normal\n\n2. Correct Functional Form: The conditional means have the structure \n\n$$E[Y|\\vec X = \\vec x] = \\vec x'\\beta$$\n3. Homoskedasticity: The variance $\\ss$ is constant (with respect to $\\vx$)\n\n4. Independence: The observations are statistically independent\n\n### Direct visualization\n\n::: notes\nThe most direct way to examine the fit of a model is to compare it to the raw observed data.\n:::\n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbw = \n bw |> \n mutate(\n predlm2 = predict(bw_lm2)\n ) |> \n arrange(sex, age)\n\nplot1_interact = \n plot1 %+% bw +\n geom_line(aes(y = predlm2))\n\nprint(plot1_interact)\n```\n\n::: {.cell-output-display}\n![Birthweight model with interaction term](Linear-models-overview_files/figure-pdf/fig-bw-interaction2-1.pdf){#fig-bw-interaction2}\n:::\n:::\n\n\n\n\n\n\n\n\n::: notes\nIt's not easy to assess these assumptions from this model.\nIf there are multiple continuous covariates, it becomes even harder to visualize the raw data.\n:::\n\n### Residuals\n\n::: notes\nMaybe we can transform the data and model in some way to make it easier to inspect.\n:::\n:::{#def-resid-noise}\n#### Residual noise\n\nThe **residual noise** in a probabilistic model $p(Y)$ is the difference between an observed value $y$ and its distributional mean:\n\n$$\\eps(y) \\eqdef y - \\Exp{Y}$$ {#eq-def-resid}\n:::\n\n:::{.notes}\nWe use the same notation for residual noise that we used for [errors](estimation.qmd#def-error). \n$\\Exp{Y}$ can be viewed as an estimate of $Y$, before $y$ is observed.\nConversely, each observation $y$ can be viewed as an estimate of $\\Exp{Y}$ (albeit an imprecise one, individually, since $n=1$). \n\n:::\n\nWe can rearrange @eq-def-resid to view $y$ as the sum of its mean plus the residual noise:\n\n$$y = \\Exp{Y} + \\eps{y}$$\n\n---\n\n:::{#thm-gaussian-resid-noise}\n#### Residuals in Gaussian models\n\nIf $Y$ has a Gaussian distribution, then $\\err(Y)$ also has a Gaussian distribution, and vice versa.\n:::\n\n:::{.proof}\nLeft to the reader.\n:::\n\n---\n\n:::{#def-resid-fitted}\n#### Residual errors of a fitted model value\n\nThe **residual of a fitted value $\\hat y$** (shorthand: \"residual\") is its [error](estimation.qmd#def-error):\n$$\n\\ba\ne(\\hat y) &\\eqdef \\erf{\\hat y}\n\\\\&= y - \\hat y\n\\ea\n$$\n:::\n\n$e(\\hat y)$ can be seen as the maximum likelihood estimate of the residual noise:\n\n$$\n\\ba\ne(\\hy) &= y - \\hat y\n\\\\ &= \\hat\\eps_{ML}\n\\ea\n$$\n\n---\n\n#### General characteristics of residuals\n\n:::{#thm-resid-unbiased}\nFor [unbiased](estimation.qmd#sec-unbiased-estimators) estimators $\\hth$:\n\n$$\\E{e(y)} = 0$$ {#eq-mean-resid-unbiased}\n$$\\Var{e(y)} \\approx \\ss$$ {#eq-var-resid-unbiased}\n\n:::\n\n:::{.proof}\n\\ \n\n@eq-mean-resid-unbiased:\n\n$$\n\\ba\n\\Ef{e(y)} &= \\Ef{y - \\hat y}\n\\\\ &= \\Ef{y} - \\Ef{\\hat y}\n\\\\ &= \\Ef{y} - \\Ef{y}\n\\\\ &= 0\n\\ea\n$$\n\n@eq-var-resid-unbiased:\n\n$$\n\\ba\n\\Var{e(y)} &= \\Var{y - \\hy}\n\\\\ &= \\Var{y} + \\Var{\\hy} - 2 \\Cov{y, \\hy}\n\\\\ &{\\dot{\\approx}} \\Var{y} + 0 - 2 \\cdot 0\n\\\\ &= \\Var{y}\n\\\\ &= \\ss\n\\ea\n$$\n:::\n\n---\n\n#### Characteristics of residuals in Gaussian models\n\nWith enough data and a correct model, the residuals will be approximately Guassian distributed, with variance $\\sigma^2$, which we can estimate using $\\hat\\sigma^2$: that is:\n\n$$\ne_i \\siid N(0, \\hat\\sigma^2)\n$$\n\n---\n\n:::{#exm-resid-bw}\n#### residuals in `birthweight` data\n\nR provides a function for residuals:\n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nresid(bw_lm2)\n#> 1 2 3 4 5 6 7 8 9 10 \n#> 176.27 -140.73 -144.13 -59.53 177.47 -126.93 -68.93 242.67 -139.33 51.67 \n#> 11 12 13 14 15 16 17 18 19 20 \n#> 156.67 -125.13 274.28 -137.71 -27.69 -246.69 -191.67 189.33 -11.67 -242.64 \n#> 21 22 23 24 \n#> -47.64 262.36 210.36 -30.62\n```\n:::\n\n\n\n\n\n\n\n\n:::\n\n:::{#exr-calc-resids}\nCheck R's output by computing the residuals directly.\n:::\n\n:::{.solution}\n\\ \n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbw$weight - fitted(bw_lm2)\n#> 1 2 3 4 5 6 7 8 9 10 \n#> 176.27 -140.73 -144.13 -59.53 177.47 -126.93 -68.93 242.67 -139.33 51.67 \n#> 11 12 13 14 15 16 17 18 19 20 \n#> 156.67 -125.13 274.28 -137.71 -27.69 -246.69 -191.67 189.33 -11.67 -242.64 \n#> 21 22 23 24 \n#> -47.64 262.36 210.36 -30.62\n```\n:::\n\n\n\n\n\n\n\n\nThis matches R's output!\n:::\n\n---\n\n#### Graph the residuals\n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbw = bw |> \n mutate(resids_intxn = \n weight - fitted(bw_lm2))\n\nplot_bw_resid =\n bw |> \n ggplot(aes(\n x = age, \n y = resids_intxn,\n linetype = sex,\n shape = sex,\n col = sex)) +\n theme_bw() +\n xlab(\"Gestational age (weeks)\") +\n ylab(\"residuals (grams)\") +\n theme(legend.position = \"bottom\") +\n # expand_limits(y = 0, x = 0) +\n geom_point(alpha = .7)\nprint(plot_bw_resid + facet_wrap(~ sex))\n```\n\n::: {.cell-output-display}\n![Residuals of interaction model for `birthweight` data](Linear-models-overview_files/figure-pdf/fig-resids-intxn-1.pdf){#fig-resids-intxn}\n:::\n:::\n\n\n\n\n\n\n\n\n---\n\n:::{#def-stred}\n\n#### Standardized residuals\n\n$$r_i = \\frac{e_i}{\\widehat{SD}(e_i)}$$\n\n:::\n\nHence, with enough data and a correct model, the standardized residuals will be approximately standard Gaussian; that is,\n\n$$\nr_i \\siid N(0,1)\n$$\n\n### Marginal distributions of residuals\n\nTo look for problems with our model, we can check whether the residuals $e_i$ and standardized residuals $r_i$ look like they have the distributions that they are supposed to have, according to the model.\n\n---\n\n#### Standardized residuals in R\n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n\nrstandard(bw_lm2)\n#> 1 2 3 4 5 6 7 8 \n#> 1.15982 -0.92601 -0.87479 -0.34723 1.03507 -0.73473 -0.39901 1.43752 \n#> 9 10 11 12 13 14 15 16 \n#> -0.82539 0.30606 0.92807 -0.87616 1.91428 -0.86559 -0.16430 -1.46376 \n#> 17 18 19 20 21 22 23 24 \n#> -1.11016 1.09658 -0.06761 -1.46159 -0.28696 1.58040 1.26717 -0.19805\nresid(bw_lm2)/sigma(bw_lm2)\n#> 1 2 3 4 5 6 7 8 \n#> 0.97593 -0.77920 -0.79802 -0.32962 0.98258 -0.70279 -0.38166 1.34357 \n#> 9 10 11 12 13 14 15 16 \n#> -0.77144 0.28606 0.86741 -0.69282 1.51858 -0.76244 -0.15331 -1.36584 \n#> 17 18 19 20 21 22 23 24 \n#> -1.06123 1.04825 -0.06463 -1.34341 -0.26376 1.45262 1.16471 -0.16954\n```\n:::\n\n\n\n\n\n\n\n\n::: notes\nThese are not quite the same, because R is doing something more complicated and precise to get the standard errors. Let's not worry about those details for now; the difference is pretty small in this case:\n\n:::\n\n---\n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n\nrstandard_compare_plot = \n tibble(\n x = resid(bw_lm2)/sigma(bw_lm2), \n y = rstandard(bw_lm2)) |> \n ggplot(aes(x = x, y = y)) +\n geom_point() + \n theme_bw() +\n coord_equal() + \n xlab(\"resid(bw_lm2)/sigma(bw_lm2)\") +\n ylab(\"rstandard(bw_lm2)\") +\n geom_abline(\n aes(\n intercept = 0,\n slope = 1, \n col = \"x=y\")) +\n labs(colour=\"\") +\n scale_colour_manual(values=\"red\")\n\nprint(rstandard_compare_plot)\n```\n\n::: {.cell-output-display}\n![](Linear-models-overview_files/figure-pdf/unnamed-chunk-65-1.pdf)\n:::\n:::\n\n\n\n\n\n\n\n\n---\n\nLet's add these residuals to the `tibble` of our dataset:\n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n\nbw = \n bw |> \n mutate(\n fitted_lm2 = fitted(bw_lm2),\n \n resid_lm2 = resid(bw_lm2),\n # resid_lm2 = weight - fitted_lm2,\n \n std_resid_lm2 = rstandard(bw_lm2),\n # std_resid_lm2 = resid_lm2 / sigma(bw_lm2)\n )\n\nbw |> \n select(\n sex,\n age,\n weight,\n fitted_lm2,\n resid_lm2,\n std_resid_lm2\n )\n```\n\n::: {.cell-output-display}\n\n\n|sex | age| weight| fitted_lm2| resid_lm2| std_resid_lm2|\n|:------|---:|------:|----------:|---------:|-------------:|\n|female | 36| 2729| 2553| 176.27| 1.1598|\n|female | 36| 2412| 2553| -140.73| -0.9260|\n|female | 37| 2539| 2683| -144.13| -0.8748|\n|female | 38| 2754| 2814| -59.53| -0.3472|\n|female | 38| 2991| 2814| 177.47| 1.0351|\n|female | 39| 2817| 2944| -126.93| -0.7347|\n|female | 39| 2875| 2944| -68.93| -0.3990|\n|female | 40| 3317| 3074| 242.67| 1.4375|\n|female | 40| 2935| 3074| -139.33| -0.8254|\n|female | 40| 3126| 3074| 51.67| 0.3061|\n|female | 40| 3231| 3074| 156.67| 0.9281|\n|female | 42| 3210| 3335| -125.13| -0.8762|\n|male | 35| 2925| 2651| 274.28| 1.9143|\n|male | 36| 2625| 2763| -137.71| -0.8656|\n|male | 37| 2847| 2875| -27.69| -0.1643|\n|male | 37| 2628| 2875| -246.69| -1.4638|\n|male | 38| 2795| 2987| -191.67| -1.1102|\n|male | 38| 3176| 2987| 189.33| 1.0966|\n|male | 38| 2975| 2987| -11.67| -0.0676|\n|male | 40| 2968| 3211| -242.64| -1.4616|\n|male | 40| 3163| 3211| -47.64| -0.2870|\n|male | 40| 3473| 3211| 262.36| 1.5804|\n|male | 40| 3421| 3211| 210.36| 1.2672|\n|male | 41| 3292| 3323| -30.62| -0.1981|\n:::\n:::\n\n\n\n\n\n\n\n\n---\n\n::: notes\n\nNow let's build histograms:\n\n:::\n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nresid_marginal_hist = \n bw |> \n ggplot(aes(x = resid_lm2)) +\n geom_histogram()\n\nprint(resid_marginal_hist)\n```\n\n::: {.cell-output-display}\n![Marginal distribution of (nonstandardized) residuals](Linear-models-overview_files/figure-pdf/fig-marg-dist-resid-1.pdf){#fig-marg-dist-resid}\n:::\n:::\n\n\n\n\n\n\n\n\n::: notes\nHard to tell with this small amount of data, but I'm a bit concerned that the histogram doesn't show a bell-curve shape.\n\n:::\n\n---\n\n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nstd_resid_marginal_hist = \n bw |> \n ggplot(aes(x = std_resid_lm2)) +\n geom_histogram()\n\nprint(std_resid_marginal_hist)\n```\n\n::: {.cell-output-display}\n![Marginal distribution of standardized residuals](Linear-models-overview_files/figure-pdf/fig-marg-stresd-1.pdf){#fig-marg-stresd}\n:::\n:::\n\n\n\n\n\n\n\n\n::: notes\nThis looks similar, although the scale of the x-axis got narrower, because we divided by $\\hat\\sigma$ (roughly speaking).\n\nStill hard to tell if the distribution is Gaussian.\n\n:::\n\n---\n\n### QQ plot of standardized residuals\n\n::: notes\nAnother way to assess normality is the QQ plot of the standardized residuals versus normal quantiles:\n\n:::\n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n\nlibrary(ggfortify) \n# needed to make ggplot2::autoplot() work for `lm` objects\n\nqqplot_lm2_auto = \n bw_lm2 |> \n autoplot(\n which = 2, # options are 1:6; can do multiple at once\n ncol = 1) +\n theme_classic()\n\nprint(qqplot_lm2_auto)\n```\n\n::: {.cell-output-display}\n![](Linear-models-overview_files/figure-pdf/unnamed-chunk-69-1.pdf)\n:::\n:::\n\n\n\n\n\n\n\n\n::: notes\nIf the Gaussian model were correct, these points should follow the dotted line.\n\nFig 2.4 panel (c) in @dobson4e is a little different; they didn't specify how they produced it, but other statistical analysis systems do things differently from R.\n\nSee also @dunn2018generalized [§3.5.4](https://link.springer.com/chapter/10.1007/978-1-4419-0118-7_3#Sec14:~:text=3.5.4%20Q%E2%80%93Q%20Plots%20and%20Normality).\n\n:::\n\n---\n\n#### QQ plot - how it's built\n\n::: notes\nLet's construct it by hand:\n:::\n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n\nbw = bw |> \n mutate(\n p = (rank(std_resid_lm2) - 1/2)/n(), # \"Blom's method\"\n expected_quantiles_lm2 = qnorm(p)\n )\n\nqqplot_lm2 = \n bw |> \n ggplot(\n aes(\n x = expected_quantiles_lm2, \n y = std_resid_lm2, \n col = sex, \n shape = sex)\n ) + \n geom_point() +\n theme_classic() +\n theme(legend.position='none') + # removing the plot legend\n ggtitle(\"Normal Q-Q\") +\n xlab(\"Theoretical Quantiles\") + \n ylab(\"Standardized residuals\")\n\n# find the expected line:\n\nps <- c(.25, .75) # reference probabilities\na <- quantile(rstandard(bw_lm2), ps) # empirical quantiles\nb <- qnorm(ps) # theoretical quantiles\n\nqq_slope = diff(a)/diff(b)\nqq_intcpt = a[1] - b[1] * qq_slope\n\nqqplot_lm2 = \n qqplot_lm2 +\n geom_abline(slope = qq_slope, intercept = qq_intcpt)\n\nprint(qqplot_lm2)\n```\n\n::: {.cell-output-display}\n![](Linear-models-overview_files/figure-pdf/unnamed-chunk-70-1.pdf)\n:::\n:::\n\n\n\n\n\n\n\n\n---\n\n### Conditional distributions of residuals\n\nIf our Gaussian linear regression model is correct, the residuals $e_i$ and standardized residuals $r_i$ should have:\n\n- an approximately Gaussian distribution, with:\n- a mean of 0\n- a constant variance\n\nThis should be true **for every** value of $x$.\n\n---\n\nIf we didn't correctly guess the functional form of the linear component of the mean, \n$$\\text{E}[Y|X=x] = \\beta_0 + \\beta_1 X_1 + ... + \\beta_p X_p$$\n\nThen the the residuals might have nonzero mean.\n\nRegardless of whether we guessed the mean function correctly, ther the variance of the residuals might differ between values of $x$.\n\n---\n\n#### Residuals versus fitted values\n\n::: notes\nTo look for these issues, we can plot the residuals $e_i$ against the fitted values $\\hat y_i$ (@fig-bw_lm2-resid-vs-fitted).\n:::\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nautoplot(bw_lm2, which = 1, ncol = 1) |> print()\n```\n\n::: {.cell-output-display}\n![`birthweight` model (@eq-BW-lm-interact): residuals versus fitted values](Linear-models-overview_files/figure-pdf/fig-bw_lm2-resid-vs-fitted-1.pdf){#fig-bw_lm2-resid-vs-fitted}\n:::\n:::\n\n\n\n\n\n\n\n\n::: notes\nIf the model is correct, the blue line should stay flat and close to 0, and the cloud of dots should have the same vertical spread regardless of the fitted value.\n\nIf not, we probably need to change the functional form of linear component of the mean, $$\\text{E}[Y|X=x] = \\beta_0 + \\beta_1 X_1 + ... + \\beta_p X_p$$\n\n:::\n\n---\n\n\n#### Example: PLOS Medicine title length data\n\n(Adapted from @dobson4e, §6.7.1)\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndata(PLOS, package = \"dobson\")\nlibrary(ggplot2)\nfig1 = \n PLOS |> \n ggplot(\n aes(x = authors,\n y = nchar)\n ) +\n geom_point() +\n theme(legend.position = \"bottom\") +\n labs(col = \"\") +\n guides(col=guide_legend(ncol=3))\nfig1\n```\n\n::: {.cell-output-display}\n![Number of authors versus title length in *PLOS Medicine* articles](Linear-models-overview_files/figure-pdf/fig-plos-1.pdf){#fig-plos}\n:::\n:::\n\n\n\n\n\n\n\n---\n\n##### Linear fit\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlm_PLOS_linear = lm(\n formula = nchar ~ authors, \n data = PLOS)\n```\n:::\n\n::: {#fig-plos-lm .cell layout-ncol=\"2\"}\n\n```{.r .cell-code}\nfig2 = fig1 +\n geom_smooth(\n method = \"lm\", \n fullrange = TRUE,\n aes(col = \"lm(y ~ x)\"))\nfig2\n\nlibrary(ggfortify)\nautoplot(lm_PLOS_linear, which = 1, ncol = 1)\n```\n\n::: {.cell-output-display}\n![Data and fit](Linear-models-overview_files/figure-pdf/fig-plos-lm-1.pdf){#fig-plos-lm-1}\n:::\n\n::: {.cell-output-display}\n![Residuals vs fitted](Linear-models-overview_files/figure-pdf/fig-plos-lm-2.pdf){#fig-plos-lm-2}\n:::\n\nNumber of authors versus title length in *PLOS Medicine*, with linear model fit\n:::\n\n\n\n\n\n\n\n---\n\n##### Quadratic fit {.smaller}\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlm_PLOS_quad = lm(\n formula = nchar ~ authors + I(authors^2), \n data = PLOS)\n```\n:::\n\n::: {#fig-plos-lm-quad .cell layout-ncol=\"2\"}\n\n```{.r .cell-code}\nfig3 = \n fig2 + \ngeom_smooth(\n method = \"lm\",\n fullrange = TRUE,\n formula = y ~ x + I(x ^ 2),\n aes(col = \"lm(y ~ x + I(x^2))\")\n )\nfig3\n\nautoplot(lm_PLOS_quad, which = 1, ncol = 1)\n```\n\n::: {.cell-output-display}\n![Data and fit](Linear-models-overview_files/figure-pdf/fig-plos-lm-quad-1.pdf){#fig-plos-lm-quad-1}\n:::\n\n::: {.cell-output-display}\n![Residuals vs fitted](Linear-models-overview_files/figure-pdf/fig-plos-lm-quad-2.pdf){#fig-plos-lm-quad-2}\n:::\n\nNumber of authors versus title length in *PLOS Medicine*, with quadratic model fit\n:::\n\n\n\n\n\n\n\n---\n\n##### Linear versus quadratic fits\n\n\n\n\n\n\n\n::: {#fig-plos-lm-resid2 .cell layout-ncol=\"2\"}\n\n```{.r .cell-code}\nlibrary(ggfortify)\nautoplot(lm_PLOS_linear, which = 1, ncol = 1)\n\nautoplot(lm_PLOS_quad, which = 1, ncol = 1)\n```\n\n::: {.cell-output-display}\n![Linear](Linear-models-overview_files/figure-pdf/fig-plos-lm-resid2-1.pdf){#fig-plos-lm-resid2-1}\n:::\n\n::: {.cell-output-display}\n![Quadratic](Linear-models-overview_files/figure-pdf/fig-plos-lm-resid2-2.pdf){#fig-plos-lm-resid2-2}\n:::\n\nResiduals versus fitted plot for linear and quadratic fits to `PLOS` data\n:::\n\n\n\n\n\n\n\n---\n\n##### Cubic fit\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlm_PLOS_cub = lm(\n formula = nchar ~ authors + I(authors^2) + I(authors^3), \n data = PLOS)\n```\n:::\n\n::: {#fig-plos-lm-cubic .cell layout-ncol=\"2\"}\n\n```{.r .cell-code}\nfig4 = \n fig3 + \ngeom_smooth(\n method = \"lm\",\n fullrange = TRUE,\n formula = y ~ x + I(x ^ 2) + I(x ^ 3),\n aes(col = \"lm(y ~ x + I(x^2) + I(x ^ 3))\")\n )\nfig4\n\nautoplot(lm_PLOS_cub, which = 1, ncol = 1)\n\n```\n\n::: {.cell-output-display}\n![Data and fit](Linear-models-overview_files/figure-pdf/fig-plos-lm-cubic-1.pdf){#fig-plos-lm-cubic-1}\n:::\n\n::: {.cell-output-display}\n![Residuals vs fitted](Linear-models-overview_files/figure-pdf/fig-plos-lm-cubic-2.pdf){#fig-plos-lm-cubic-2}\n:::\n\nNumber of authors versus title length in *PLOS Medicine*, with cubic model fit\n:::\n\n\n\n\n\n\n\n---\n\n##### Logarithmic fit\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlm_PLOS_log = lm(nchar ~ log(authors), data = PLOS)\n```\n:::\n\n::: {#fig-plos-log .cell layout-ncol=\"2\"}\n\n```{.r .cell-code}\nfig5 = fig4 + \n geom_smooth(\n method = \"lm\",\n fullrange = TRUE,\n formula = y ~ log(x),\n aes(col = \"lm(y ~ log(x))\")\n )\nfig5\n\nautoplot(lm_PLOS_log, which = 1, ncol = 1)\n```\n\n::: {.cell-output-display}\n![Data and fit](Linear-models-overview_files/figure-pdf/fig-plos-log-1.pdf){#fig-plos-log-1}\n:::\n\n::: {.cell-output-display}\n![Residuals vs fitted](Linear-models-overview_files/figure-pdf/fig-plos-log-2.pdf){#fig-plos-log-2}\n:::\n\nlogarithmic fit\n:::\n\n\n\n\n\n\n\n---\n\n##### Model selection {.smaller}\n\n\n\n\n\n\n\n::: {#tbl-plos-lin-quad-anova .cell tbl-cap='linear vs quadratic'}\n\n```{.r .cell-code}\nanova(lm_PLOS_linear, lm_PLOS_quad)\n```\n\n::: {.cell-output-display}\n\n\n| Res.Df| RSS| Df| Sum of Sq| F| Pr(>F)|\n|------:|------:|--:|---------:|----:|------:|\n| 876| 947502| NA| NA| NA| NA|\n| 875| 880950| 1| 66552| 66.1| 0|\n:::\n:::\n\n::: {#tbl-plos-quad-cub-anova .cell tbl-cap='quadratic vs cubic'}\n\n```{.r .cell-code}\nanova(lm_PLOS_quad, lm_PLOS_cub)\n```\n\n::: {.cell-output-display}\n\n\n| Res.Df| RSS| Df| Sum of Sq| F| Pr(>F)|\n|------:|------:|--:|---------:|-----:|------:|\n| 875| 880950| NA| NA| NA| NA|\n| 874| 865933| 1| 15018| 15.16| 1e-04|\n:::\n:::\n\n\n\n\n\n\n\n---\n\n##### AIC/BIC {.smaller}\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"show\"}\nAIC(lm_PLOS_quad)\n#> [1] 8568\nAIC(lm_PLOS_cub)\n#> [1] 8555\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"show\"}\nAIC(lm_PLOS_cub)\n#> [1] 8555\nAIC(lm_PLOS_log)\n#> [1] 8544\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"show\"}\nBIC(lm_PLOS_cub)\n#> [1] 8578\nBIC(lm_PLOS_log)\n#> [1] 8558\n```\n:::\n\n\n\n\n\n\n\n---\n\n##### Extrapolation is dangerous\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfig_all = fig5 +\n xlim(0, 60)\nfig_all\n```\n\n::: {.cell-output-display}\n![Number of authors versus title length in *PLOS Medicine*](Linear-models-overview_files/figure-pdf/fig-plos-multifit-1.pdf){#fig-plos-multifit}\n:::\n:::\n\n\n\n\n\n\n\n\n\n---\n\n#### Scale-location plot\n\n::: notes\nWe can also plot the square roots of the absolute values of the standardized residuals against the fitted values (@fig-bw-scale-loc).\n:::\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nautoplot(bw_lm2, which = 3, ncol = 1) |> print()\n```\n\n::: {.cell-output-display}\n![Scale-location plot of `birthweight` data](Linear-models-overview_files/figure-pdf/fig-bw-scale-loc-1.pdf){#fig-bw-scale-loc}\n:::\n:::\n\n\n\n\n\n\n\n::: notes\nHere, the blue line doesn't need to be near 0, \nbut it should be flat. \nIf not, the residual variance $\\sigma^2$ might not be constant, \nand we might need to transform our outcome $Y$ \n(or use a model that allows non-constant variance).\n:::\n\n---\n\n\n#### Residuals versus leverage\n\n::: notes\n\nWe can also plot our standardized residuals against \"leverage\", which roughly speaking is a measure of how unusual each $x_i$ value is. Very unusual $x_i$ values can have extreme effects on the model fit, so we might want to remove those observations as outliers, particularly if they have large residuals.\n\n:::\n\n\n\n\n\n\n\n\n::: {.cell labels='fig-bw_lm2_resid-vs-leverage'}\n\n```{.r .cell-code}\nautoplot(bw_lm2, which = 5, ncol = 1) |> print()\n```\n\n::: {.cell-output-display}\n![`birthweight` model with interactions (@eq-BW-lm-interact): residuals versus leverage](Linear-models-overview_files/figure-pdf/unnamed-chunk-89-1.pdf)\n:::\n:::\n\n\n\n\n\n\n\n\n::: notes\nThe blue line should be relatively flat and close to 0 here.\n:::\n\n---\n\n### Diagnostics constructed by hand\n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n\nbw = \n bw |> \n mutate(\n predlm2 = predict(bw_lm2),\n residlm2 = weight - predlm2,\n std_resid = residlm2 / sigma(bw_lm2),\n # std_resid_builtin = rstandard(bw_lm2), # uses leverage\n sqrt_abs_std_resid = std_resid |> abs() |> sqrt()\n \n )\n\n```\n:::\n\n\n\n\n\n\n\n\n##### Residuals vs fitted\n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n\nresid_vs_fit = bw |> \n ggplot(\n aes(x = predlm2, y = residlm2, col = sex, shape = sex)\n ) + \n geom_point() +\n theme_classic() +\n geom_hline(yintercept = 0)\n\n```\n:::\n\n\n\n\n\n\n\n\n::: {.content-visible when-format=\"html\"}\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nprint(resid_vs_fit)\n```\n\n::: {.cell-output-display}\n![](Linear-models-overview_files/figure-pdf/unnamed-chunk-92-1.pdf)\n:::\n:::\n\n\n\n\n\n\n\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nprint(resid_vs_fit)\n```\n\n::: {.cell-output-display}\n![](Linear-models-overview_files/figure-pdf/unnamed-chunk-93-1.pdf)\n:::\n:::\n\n\n\n\n\n\n\n:::\n\n##### Standardized residuals vs fitted\n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbw |> \n ggplot(\n aes(x = predlm2, y = std_resid, col = sex, shape = sex)\n ) + \n geom_point() +\n theme_classic() +\n geom_hline(yintercept = 0)\n```\n\n::: {.cell-output-display}\n![](Linear-models-overview_files/figure-pdf/unnamed-chunk-94-1.pdf)\n:::\n:::\n\n\n\n\n\n\n\n\n##### Standardized residuals vs gestational age\n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n\nbw |> \n ggplot(\n aes(x = age, y = std_resid, col = sex, shape = sex)\n ) + \n geom_point() +\n theme_classic() +\n geom_hline(yintercept = 0)\n```\n\n::: {.cell-output-display}\n![](Linear-models-overview_files/figure-pdf/unnamed-chunk-95-1.pdf)\n:::\n:::\n\n\n\n\n\n\n\n\n##### `sqrt(abs(rstandard()))` vs fitted\n\nCompare with `autoplot(bw_lm2, 3)`\n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n\n\nbw |> \n ggplot(\n aes(x = predlm2, y = sqrt_abs_std_resid, col = sex, shape = sex)\n ) + \n geom_point() +\n theme_classic() +\n geom_hline(yintercept = 0)\n```\n\n::: {.cell-output-display}\n![](Linear-models-overview_files/figure-pdf/unnamed-chunk-96-1.pdf)\n:::\n:::\n\n\n\n\n\n\n\n\n## Model selection\n\n(adapted from @dobson4e §6.3.3; for more information on prediction, see @james2013introduction and @rms2e).\n\n::: notes\nIf we have a lot of covariates in our dataset, we might want to choose a small subset to use in our model.\n\nThere are a few possible metrics to consider for choosing a \"best\" model.\n:::\n\n### Mean squared error\n\nWe might want to minimize the **mean squared error**, $\\text E[(y-\\hat y)^2]$, for new observations that weren't in our data set when we fit the model.\n\nUnfortunately, $$\\frac{1}{n}\\sum_{i=1}^n (y_i-\\hat y_i)^2$$ gives a biased estimate of $\\text E[(y-\\hat y)^2]$ for new data. If we want an unbiased estimate, we will have to be clever.\n\n---\n\n#### Cross-validation\n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndata(\"carbohydrate\", package = \"dobson\")\nlibrary(cvTools)\nfull_model <- lm(carbohydrate ~ ., data = carbohydrate)\ncv_full = \n full_model |> cvFit(\n data = carbohydrate, K = 5, R = 10,\n y = carbohydrate$carbohydrate)\n\nreduced_model = update(full_model, \n formula = ~ . - age)\n\ncv_reduced = \n reduced_model |> cvFit(\n data = carbohydrate, K = 5, R = 10,\n y = carbohydrate$carbohydrate)\n```\n:::\n\n\n\n\n\n\n\n\n---\n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nresults_reduced = \n tibble(\n model = \"wgt+protein\",\n errs = cv_reduced$reps[])\nresults_full = \n tibble(model = \"wgt+age+protein\",\n errs = cv_full$reps[])\n\ncv_results = \n bind_rows(results_reduced, results_full)\n\ncv_results |> \n ggplot(aes(y = model, x = errs)) +\n geom_boxplot()\n```\n\n::: {.cell-output-display}\n![](Linear-models-overview_files/figure-pdf/unnamed-chunk-98-1.pdf)\n:::\n:::\n\n\n\n\n\n\n\n\n---\n\n##### comparing metrics\n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n\ncompare_results = tribble(\n ~ model, ~ cvRMSE, ~ r.squared, ~adj.r.squared, ~ trainRMSE, ~loglik,\n \"full\", cv_full$cv, summary(full_model)$r.squared, summary(full_model)$adj.r.squared, sigma(full_model), logLik(full_model) |> as.numeric(),\n \"reduced\", cv_reduced$cv, summary(reduced_model)$r.squared, summary(reduced_model)$adj.r.squared, sigma(reduced_model), logLik(reduced_model) |> as.numeric())\n\ncompare_results\n```\n\n::: {.cell-output-display}\n\n\n|model | cvRMSE| r.squared| adj.r.squared| trainRMSE| loglik|\n|:-------|------:|---------:|-------------:|---------:|------:|\n|full | 6.804| 0.4805| 0.3831| 5.956| -61.84|\n|reduced | 6.611| 0.4454| 0.3802| 5.971| -62.49|\n:::\n:::\n\n\n\n\n\n\n\n\n---\n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nanova(full_model, reduced_model)\n```\n\n::: {.cell-output-display}\n\n\n| Res.Df| RSS| Df| Sum of Sq| F| Pr(>F)|\n|------:|-----:|--:|---------:|-----:|------:|\n| 16| 567.7| NA| NA| NA| NA|\n| 17| 606.0| -1| -38.36| 1.081| 0.3139|\n:::\n:::\n\n\n\n\n\n\n\n\n---\n\n#### stepwise regression\n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(olsrr)\nolsrr:::ols_step_both_aic(full_model)\n#> \n#> \n#> Stepwise Summary \n#> -------------------------------------------------------------------------\n#> Step Variable AIC SBC SBIC R2 Adj. R2 \n#> -------------------------------------------------------------------------\n#> 0 Base Model 140.773 142.764 83.068 0.00000 0.00000 \n#> 1 protein (+) 137.950 140.937 80.438 0.21427 0.17061 \n#> 2 weight (+) 132.981 136.964 77.191 0.44544 0.38020 \n#> -------------------------------------------------------------------------\n#> \n#> Final Model Output \n#> ------------------\n#> \n#> Model Summary \n#> ---------------------------------------------------------------\n#> R 0.667 RMSE 5.505 \n#> R-Squared 0.445 MSE 30.301 \n#> Adj. R-Squared 0.380 Coef. Var 15.879 \n#> Pred R-Squared 0.236 AIC 132.981 \n#> MAE 4.593 SBC 136.964 \n#> ---------------------------------------------------------------\n#> RMSE: Root Mean Square Error \n#> MSE: Mean Square Error \n#> MAE: Mean Absolute Error \n#> AIC: Akaike Information Criteria \n#> SBC: Schwarz Bayesian Criteria \n#> \n#> ANOVA \n#> -------------------------------------------------------------------\n#> Sum of \n#> Squares DF Mean Square F Sig. \n#> -------------------------------------------------------------------\n#> Regression 486.778 2 243.389 6.827 0.0067 \n#> Residual 606.022 17 35.648 \n#> Total 1092.800 19 \n#> -------------------------------------------------------------------\n#> \n#> Parameter Estimates \n#> ----------------------------------------------------------------------------------------\n#> model Beta Std. Error Std. Beta t Sig lower upper \n#> ----------------------------------------------------------------------------------------\n#> (Intercept) 33.130 12.572 2.635 0.017 6.607 59.654 \n#> protein 1.824 0.623 0.534 2.927 0.009 0.509 3.139 \n#> weight -0.222 0.083 -0.486 -2.662 0.016 -0.397 -0.046 \n#> ----------------------------------------------------------------------------------------\n```\n:::\n\n\n\n\n\n\n\n\n---\n\n#### Lasso\n\n$$\\arg min_{\\theta} \\llik(\\th) + \\lambda \\sum_{j=1}^p|\\beta_j|$$\n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(glmnet)\ny = carbohydrate$carbohydrate\nx = carbohydrate |> \n select(age, weight, protein) |> \n as.matrix()\nfit = glmnet(x,y)\n```\n:::\n\n\n\n\n\n\n\n\n---\n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nautoplot(fit, xvar = 'lambda')\n\n```\n\n::: {.cell-output-display}\n![Lasso selection](Linear-models-overview_files/figure-pdf/fig-carbs-lasso-1.pdf){#fig-carbs-lasso}\n:::\n:::\n\n\n\n\n\n\n\n\n---\n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncvfit = cv.glmnet(x,y)\nplot(cvfit)\n```\n\n::: {.cell-output-display}\n![](Linear-models-overview_files/figure-pdf/unnamed-chunk-104-1.pdf)\n:::\n:::\n\n\n\n\n\n\n\n\n---\n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncoef(cvfit, s = \"lambda.1se\")\n#> 4 x 1 sparse Matrix of class \"dgCMatrix\"\n#> s1\n#> (Intercept) 34.2044\n#> age . \n#> weight -0.0926\n#> protein 0.8582\n```\n:::\n\n\n\n\n\n\n\n\n\n## Categorical covariates with more than two levels\n\n### Example: `birthweight`\n\nIn the birthweight example, the variable `sex` had only two observed values:\n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nunique(bw$sex)\n#> [1] female male \n#> Levels: female male\n```\n:::\n\n\n\n\n\n\n\n\nIf there are more than two observed values, we can't just use a single variable with 0s and 1s.\n\n### \n\n:::{.notes}\nFor example, @tbl-iris-data shows the \n[(in)famous](https://www.meganstodel.com/posts/no-to-iris/) \n`iris` data (@anderson1935irises), \nand @tbl-iris-summary provides summary statistics. \nThe data include three species: \"setosa\", \"versicolor\", and \"virginica\".\n:::\n\n\n\n\n\n\n\n\n::: {#tbl-iris-data .cell tbl-cap='The `iris` data'}\n\n```{.r .cell-code}\nhead(iris)\n```\n\n::: {.cell-output-display}\n\n\n| Sepal.Length| Sepal.Width| Petal.Length| Petal.Width|Species |\n|------------:|-----------:|------------:|-----------:|:-------|\n| 5.1| 3.5| 1.4| 0.2|setosa |\n| 4.9| 3.0| 1.4| 0.2|setosa |\n| 4.7| 3.2| 1.3| 0.2|setosa |\n| 4.6| 3.1| 1.5| 0.2|setosa |\n| 5.0| 3.6| 1.4| 0.2|setosa |\n| 5.4| 3.9| 1.7| 0.4|setosa |\n:::\n:::\n\n::: {#tbl-iris-summary .cell tbl-cap='Summary statistics for the `iris` data'}\n\n```{.r .cell-code}\nlibrary(table1)\ntable1(\n x = ~ . | Species,\n data = iris,\n overall = FALSE\n)\n```\n\n::: {.cell-output-display}\n\n\\begin{tabular}[t]{llll}\n\\toprule\n  & setosa & versicolor & virginica\\\\\n\\midrule\n & (N=50) & (N=50) & (N=50)\\\\\n\\addlinespace[0.3em]\n\\multicolumn{4}{l}{\\textbf{Sepal.Length}}\\\\\n\\hspace{1em}Mean (SD) & 5.01 (0.352) & 5.94 (0.516) & 6.59 (0.636)\\\\\n\\hspace{1em}Median [Min, Max] & 5.00 [4.30, 5.80] & 5.90 [4.90, 7.00] & 6.50 [4.90, 7.90]\\\\\n\\addlinespace[0.3em]\n\\multicolumn{4}{l}{\\textbf{Sepal.Width}}\\\\\n\\hspace{1em}Mean (SD) & 3.43 (0.379) & 2.77 (0.314) & 2.97 (0.322)\\\\\n\\hspace{1em}Median [Min, Max] & 3.40 [2.30, 4.40] & 2.80 [2.00, 3.40] & 3.00 [2.20, 3.80]\\\\\n\\addlinespace[0.3em]\n\\multicolumn{4}{l}{\\textbf{Petal.Length}}\\\\\n\\hspace{1em}Mean (SD) & 1.46 (0.174) & 4.26 (0.470) & 5.55 (0.552)\\\\\n\\hspace{1em}Median [Min, Max] & 1.50 [1.00, 1.90] & 4.35 [3.00, 5.10] & 5.55 [4.50, 6.90]\\\\\n\\addlinespace[0.3em]\n\\multicolumn{4}{l}{\\textbf{Petal.Width}}\\\\\n\\hspace{1em}Mean (SD) & 0.246 (0.105) & 1.33 (0.198) & 2.03 (0.275)\\\\\n\\hspace{1em}Median [Min, Max] & 0.200 [0.100, 0.600] & 1.30 [1.00, 1.80] & 2.00 [1.40, 2.50]\\\\\n\\bottomrule\n\\end{tabular}\n\n\n:::\n:::\n\n\n\n\n\n\n\n\n---\n\nIf we want to model `Sepal.Length` by species, we could create a variable $X$ that represents \"setosa\" as $X=1$, \"virginica\" as $X=2$, and \"versicolor\" as $X=3$.\n\n\n\n\n\n\n\n\n::: {#tbl-numeric-coding .cell tbl-cap='`iris` data with numeric coding of species'}\n\n```{.r .cell-code}\ndata(iris) # this step is not always necessary, but ensures you're starting \n# from the original version of a dataset stored in a loaded package\n\niris = \n iris |> \n tibble() |>\n mutate(\n X = case_when(\n Species == \"setosa\" ~ 1,\n Species == \"virginica\" ~ 2,\n Species == \"versicolor\" ~ 3\n )\n )\n\niris |> \n distinct(Species, X)\n```\n\n::: {.cell-output-display}\n\n\n|Species | X|\n|:----------|--:|\n|setosa | 1|\n|versicolor | 3|\n|virginica | 2|\n:::\n:::\n\n\n\n\n\n\n\n\nThen we could fit a model like:\n\n\n\n\n\n\n\n\n::: {#tbl-iris-numeric-species .cell tbl-cap='Model of `iris` data with numeric coding of `Species`'}\n\n```{.r .cell-code}\niris_lm1 = lm(Sepal.Length ~ X, data = iris)\niris_lm1 |> parameters() |> print_md()\n```\n\n::: {.cell-output-display}\n\n\n|Parameter | Coefficient | SE | 95% CI | t(148) | p |\n|:-----------|:-----------:|:----:|:------------:|:------:|:------:|\n|(Intercept) | 4.91 | 0.16 | (4.60, 5.23) | 30.83 | < .001 |\n|X | 0.47 | 0.07 | (0.32, 0.61) | 6.30 | < .001 |\n\n\n:::\n:::\n\n\n\n\n\n\n\n\n### Let's see how that model looks:\n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\niris_plot1 = iris |> \n ggplot(\n aes(\n x = X, \n y = Sepal.Length)\n ) +\n geom_point(alpha = .1) +\n geom_abline(\n intercept = coef(iris_lm1)[1], \n slope = coef(iris_lm1)[2]) +\n theme_bw(base_size = 18)\nprint(iris_plot1)\n\n```\n\n::: {.cell-output-display}\n![Model of `iris` data with numeric coding of `Species`](Linear-models-overview_files/figure-pdf/fig-iris-numeric-species-model-1.pdf){#fig-iris-numeric-species-model}\n:::\n:::\n\n\n\n\n\n\n\n\nWe have forced the model to use a straight line for the three estimated means. Maybe not a good idea?\n\n### Let's see what R does with categorical variables by default:\n\n\n\n\n\n\n\n\n::: {#tbl-iris-model-factor1 .cell tbl-cap='Model of `iris` data with `Species` as a categorical variable'}\n\n```{.r .cell-code}\niris_lm2 = lm(Sepal.Length ~ Species, data = iris)\niris_lm2 |> parameters() |> print_md()\n```\n\n::: {.cell-output-display}\n\n\n|Parameter | Coefficient | SE | 95% CI | t(147) | p |\n|:--------------------|:-----------:|:----:|:------------:|:------:|:------:|\n|(Intercept) | 5.01 | 0.07 | (4.86, 5.15) | 68.76 | < .001 |\n|Species (versicolor) | 0.93 | 0.10 | (0.73, 1.13) | 9.03 | < .001 |\n|Species (virginica) | 1.58 | 0.10 | (1.38, 1.79) | 15.37 | < .001 |\n\n\n:::\n:::\n\n\n\n\n\n\n\n\n### Re-parametrize with no intercept\n\nIf you don't want the default and offset option, you can use \"-1\" like we've seen previously:\n\n\n\n\n\n\n\n\n::: {#tbl-iris-no-intcpt .cell}\n\n```{.r .cell-code}\niris.lm2b = lm(Sepal.Length ~ Species - 1, data = iris)\niris.lm2b |> parameters() |> print_md()\n```\n\n::: {.cell-output-display}\n\n\n|Parameter | Coefficient | SE | 95% CI | t(147) | p |\n|:--------------------|:-----------:|:----:|:------------:|:------:|:------:|\n|Species (setosa) | 5.01 | 0.07 | (4.86, 5.15) | 68.76 | < .001 |\n|Species (versicolor) | 5.94 | 0.07 | (5.79, 6.08) | 81.54 | < .001 |\n|Species (virginica) | 6.59 | 0.07 | (6.44, 6.73) | 90.49 | < .001 |\n\n\n:::\n:::\n\n\n\n\n\n\n\n\n### Let's see what these new models look like:\n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\niris_plot2 = \n iris |> \n mutate(\n predlm2 = predict(iris_lm2)) |> \n arrange(X) |> \n ggplot(aes(x = X, y = Sepal.Length)) +\n geom_point(alpha = .1) +\n geom_line(aes(y = predlm2), col = \"red\") +\n geom_abline(\n intercept = coef(iris_lm1)[1], \n slope = coef(iris_lm1)[2]) + \n theme_bw(base_size = 18)\n\nprint(iris_plot2)\n\n```\n\n::: {.cell-output-display}\n![](Linear-models-overview_files/figure-pdf/fig-iris-no-intcpt-1.pdf){#fig-iris-no-intcpt}\n:::\n:::\n\n\n\n\n\n\n\n\n### Let's see how R did that:\n\n\n\n\n\n\n\n\n::: {#tbl-iris-model-matrix-factor .cell}\n\n```{.r .cell-code}\nformula(iris_lm2)\n#> Sepal.Length ~ Species\nmodel.matrix(iris_lm2) |> as_tibble() |> unique()\n```\n\n::: {.cell-output-display}\n\n\n| (Intercept)| Speciesversicolor| Speciesvirginica|\n|-----------:|-----------------:|----------------:|\n| 1| 0| 0|\n| 1| 1| 0|\n| 1| 0| 1|\n:::\n:::\n\n\n\n\n\n\n\n\nThis is called a \"corner point parametrization\".\n\n\n\n\n\n\n\n\n::: {#tbl-iris-group-point-parameterization .cell}\n\n```{.r .cell-code}\nformula(iris.lm2b)\n#> Sepal.Length ~ Species - 1\nmodel.matrix(iris.lm2b) |> as_tibble() |> unique()\n```\n\n::: {.cell-output-display}\n\n\n| Speciessetosa| Speciesversicolor| Speciesvirginica|\n|-------------:|-----------------:|----------------:|\n| 1| 0| 0|\n| 0| 1| 0|\n| 0| 0| 1|\n:::\n:::\n\n\n\n\n\n\n\n\nThis can be called a \"group point parametrization\".\n\nThere are more options; see @dobson4e §6.4.1 and the \n[`codingMatrices` package](https://CRAN.R-project.org/package=codingMatrices) \n[vignette](https://cran.r-project.org/web/packages/codingMatrices/vignettes/codingMatrices.pdf) \n(@venablescodingMatrices).\n\n## Ordinal covariates\n\n(c.f. @dobson4e §2.4.4)\n\n---\n\n::: notes\nWe can create ordinal variables in R using the `ordered()` function^[or equivalently, `factor(ordered = TRUE)`].\n:::\n\n:::{#exm-ordinal-variable}\n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nurl = paste0(\n \"https://regression.ucsf.edu/sites/g/files/tkssra6706/\",\n \"f/wysiwyg/home/data/hersdata.dta\")\nlibrary(haven)\nhers = read_dta(url)\n```\n:::\n\n\n\n::: {#tbl-HERS .cell tbl-cap='HERS dataset'}\n\n```{.r .cell-code}\nhers |> head()\n```\n\n::: {.cell-output-display}\n\n\n| HT| age| raceth| nonwhite| smoking| drinkany| exercise| physact| globrat| poorfair| medcond| htnmeds| statins| diabetes| dmpills| insulin| weight| BMI| waist| WHR| glucose| weight1| BMI1| waist1| WHR1| glucose1| tchol| LDL| HDL| TG| tchol1| LDL1| HDL1| TG1| SBP| DBP| age10|\n|--:|---:|------:|--------:|-------:|--------:|--------:|-------:|-------:|--------:|-------:|-------:|-------:|--------:|-------:|-------:|------:|-----:|-----:|-----:|-------:|-------:|-----:|------:|-----:|--------:|-----:|-----:|---:|---:|------:|-----:|----:|---:|---:|---:|-----:|\n| 0| 70| 2| 1| 0| 0| 0| 5| 3| 0| 0| 1| 1| 0| 0| 0| 73.8| 23.69| 96.0| 0.932| 84| 73.6| 23.63| 93.0| 0.912| 94| 189| 122.4| 52| 73| 201| 137.6| 48| 77| 138| 78| 7.0|\n| 0| 62| 2| 1| 0| 0| 0| 1| 3| 0| 1| 1| 0| 0| 0| 0| 70.9| 28.62| 93.0| 0.964| 111| 73.4| 28.89| 95.0| 0.964| 78| 307| 241.6| 44| 107| 216| 150.6| 48| 87| 118| 70| 6.2|\n| 1| 69| 1| 0| 0| 0| 0| 3| 3| 0| 0| 1| 0| 1| 0| 0| 102.0| 42.51| 110.2| 0.782| 114| 96.1| 40.73| 103.0| 0.774| 98| 254| 166.2| 57| 154| 254| 156.0| 66| 160| 134| 78| 6.9|\n| 0| 64| 1| 0| 1| 1| 0| 1| 3| 0| 1| 1| 0| 0| 0| 0| 64.4| 24.39| 87.0| 0.877| 94| 58.6| 22.52| 77.0| 0.802| 93| 204| 116.2| 56| 159| 207| 122.6| 57| 137| 152| 72| 6.4|\n| 0| 65| 1| 0| 0| 0| 0| 2| 3| 0| 0| 0| 0| 0| 0| 0| 57.9| 21.90| 77.0| 0.794| 101| 58.9| 22.28| 76.5| 0.757| 92| 214| 150.6| 42| 107| 235| 172.2| 35| 139| 175| 95| 6.5|\n| 1| 68| 2| 1| 0| 1| 0| 3| 3| 0| 0| 0| 0| 0| 0| 0| 60.9| 29.05| 96.0| 1.000| 116| 57.7| 27.52| 86.0| 0.910| 115| 212| 137.8| 52| 111| 202| 126.6| 53| 112| 174| 98| 6.8|\n:::\n:::\n\n\n\n\n\n\n\n\n:::\n\n---\n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n\n# C(contr = codingMatrices::contr.diff)\n\n```\n:::\n", "supporting": [], "filters": [ "rmarkdown/pagebreak.lua" diff --git a/_freeze/Linear-models-overview/figure-html/unnamed-chunk-104-1.png b/_freeze/Linear-models-overview/figure-html/unnamed-chunk-104-1.png index 45c9f8d..7758fe1 100644 Binary files a/_freeze/Linear-models-overview/figure-html/unnamed-chunk-104-1.png and b/_freeze/Linear-models-overview/figure-html/unnamed-chunk-104-1.png differ diff --git a/_freeze/Linear-models-overview/figure-html/unnamed-chunk-98-1.png b/_freeze/Linear-models-overview/figure-html/unnamed-chunk-98-1.png index 9a65470..d1a5772 100644 Binary files a/_freeze/Linear-models-overview/figure-html/unnamed-chunk-98-1.png and b/_freeze/Linear-models-overview/figure-html/unnamed-chunk-98-1.png differ diff --git a/intro-multilevel-models.qmd b/intro-multilevel-models.qmd index c1c65ad..bf23a27 100644 --- a/intro-multilevel-models.qmd +++ b/intro-multilevel-models.qmd @@ -1,5 +1,5 @@ # Introduction to multi-level models for correlated data For more, see -[EVE 225](https://catalog.ucdavis.edu/search/?q=EVE+225) -— Linear Mixed Modeling in Ecology & Evolution +[EVE 225](https://catalog.ucdavis.edu/search/?q=EVE+225): +Linear Mixed Modeling in Ecology & Evolution diff --git a/intro-to-survival-analysis.qmd b/intro-to-survival-analysis.qmd index 82b0dee..936eead 100644 --- a/intro-to-survival-analysis.qmd +++ b/intro-to-survival-analysis.qmd @@ -606,7 +606,7 @@ The first day is the most dangerous: # paste0( # "https://github.com/therneau/survival/raw/", # "f3ac93704949ff26e07720b56f2b18ffa8066470/", -# "data/survexp.rda") +# "Data/survexp.rda") #(newer versions of `survival` don't have the first-year breakdown; see: # https://cran.r-project.org/web/packages/survival/news.html)