diff --git a/.nojekyll b/.nojekyll index 07423882b..f38b1f54a 100644 --- a/.nojekyll +++ b/.nojekyll @@ -1 +1 @@ -c97b46da \ No newline at end of file +fb006b97 \ No newline at end of file diff --git a/Linear-models-overview.html b/Linear-models-overview.html index e01ecb5fe..981677402 100644 --- a/Linear-models-overview.html +++ b/Linear-models-overview.html @@ -3926,7 +3926,7 @@

Note

full -6.838 +6.900 0.4805 0.3831 5.956 @@ -3934,7 +3934,7 @@

Note

reduced -6.610 +6.581 0.4454 0.3802 5.971 @@ -4071,10 +4071,10 @@

Note

Show R code
coef(cvfit, s = "lambda.1se")
 #> 4 x 1 sparse Matrix of class "dgCMatrix"
 #>                  s1
-#> (Intercept) 34.4241
+#> (Intercept) 34.2044
 #> age          .     
-#> weight      -0.0662
-#> protein      0.6607
+#> weight -0.0926 +#> protein 0.8582

diff --git a/Linear-models-overview_files/figure-html/unnamed-chunk-104-1.png b/Linear-models-overview_files/figure-html/unnamed-chunk-104-1.png index c7dec15d9..222915efa 100644 Binary files a/Linear-models-overview_files/figure-html/unnamed-chunk-104-1.png and b/Linear-models-overview_files/figure-html/unnamed-chunk-104-1.png differ diff --git a/Linear-models-overview_files/figure-html/unnamed-chunk-98-1.png b/Linear-models-overview_files/figure-html/unnamed-chunk-98-1.png index 42a4f241d..c52402412 100644 Binary files a/Linear-models-overview_files/figure-html/unnamed-chunk-98-1.png and b/Linear-models-overview_files/figure-html/unnamed-chunk-98-1.png differ diff --git a/Regression-Models-for-Epidemiology.pdf b/Regression-Models-for-Epidemiology.pdf index 964ad5e48..db2ca47d2 100644 Binary files a/Regression-Models-for-Epidemiology.pdf and b/Regression-Models-for-Epidemiology.pdf differ diff --git a/coxph-model-building.html b/coxph-model-building.html index 30ec4c4c7..59db8fa14 100644 --- a/coxph-model-building.html +++ b/coxph-model-building.html @@ -446,7 +446,7 @@
Published
-

Last modified: 2024-05-21: 13:19:52 (PM)

+

Last modified: 2024-05-27: 11:07:18 (AM)

@@ -975,7 +975,7 @@
  • Both the statistical tests and the plots are useful.
  • 7.4 Goodness of Fit using the Cox-Snell Residuals

    -

    (references: Klein & Moeschberger textbook, §11.2, and Dobson & Barnett textbook, §10.6)

    +

    (references: Klein, Moeschberger, et al. (2003), §11.2, and Dobson and Barnett (2018), §10.6)

    Suppose that an individual has a survival time \(T\) which has survival function \(S(t)\), meaning that \(\Pr(T> t) = S(t)\). Then \(S(T)\) has a uniform distribution on \((0,1)\).

    \[ \begin{aligned} @@ -1396,7 +1396,7 @@

    7.7.1 Revisiting the leukemia dataset (anderson)

    We will analyze remission survival times on 42 leukemia patients, half on new treatment, half on standard treatment.

    -

    This is the same data as the drug6mp data from KMsurv, but with two other variables and without the pairing. This version comes from the Kleinbaum and Klein survival textbook (e.g., p281):

    +

    This is the same data as the drug6mp data from KMsurv, but with two other variables and without the pairing. This version comes from Kleinbaum and Klein (2012) (e.g., p281):

    Show R code
    
     anderson = 
    @@ -1754,7 +1754,7 @@
     #> # ℹ 9 more variables: z2 <int>, z3 <int>, z4 <int>, z5 <int>, z6 <int>,
     #> #   z7 <int>, z8 <int>, z9 <int>, z10 <int>
    -

    This dataset comes from the Copelan et al. (1991) study of allogenic bone marrow transplant therapy for acute myeloid leukemia (AML) and acute lymphoblastic leukemia (ALL).

    +

    This dataset comes from the Copelan et al. (1991) study of allogenic bone marrow transplant therapy for acute myeloid leukemia (AML) and acute lymphoblastic leukemia (ALL).

    Outcomes (endpoints)
    • The main endpoint is disease-free survival (t2 and d3) for the three risk groups, “ALL”, “AML Low Risk”, and “AML High Risk”.
    • @@ -2218,7 +2218,7 @@

    7.9 Recurrent Events

    -

    (Adapted from Kleinbaum and Klein, Ch 8)

    +

    (Adapted from Kleinbaum and Klein (2012), Ch 8)

    7.9.1 Bladder Cancer Data Set

    -

    The bladder cancer dataset from Kleinbaum and Klein contains recurrent event outcome information for eighty-six cancer patients followed for the recurrence of bladder cancer tumor after transurethral surgical excision (Byar and Green 1980). The exposure of interest is the effect of the drug treatment of thiotepa. Control variables are the initial number and initial size of tumors. The data layout is suitable for a counting processes approach.

    +

    The bladder cancer dataset from Kleinbaum and Klein (2012) contains recurrent event outcome information for eighty-six cancer patients followed for the recurrence of bladder cancer tumor after transurethral surgical excision (Byar and Green 1980). The exposure of interest is the effect of the drug treatment of thiotepa. Control variables are the initial number and initial size of tumors. The data layout is suitable for a counting processes approach.

    This drug is still a possible choice for some patients. Another therapeutic choice is Bacillus Calmette-Guerin (BCG), a live bacterium related to cow tuberculosis.

    Data dictionary

    @@ -2459,12 +2459,21 @@
    Canchola, Alison J, Susan L Stewart, Leslie Bernstein, Dee W West, Ronald K Ross, Dennis Deapen, Richard Pinder, et al. 2003. “Cox Regression Using Different Time-Scales.” Western Users of SAS Software. https://www.lexjansen.com/wuss/2003/DataAnalysis/i-cox_time_scales.pdf.
    +
    +Copelan, Edward A, James C Biggs, James M Thompson, Pamela Crilley, Jeff Szer, John P Klein, Neena Kapoor, Belinda R Avalos, Isabel Cunningham, and Kerry Atkinson. 1991. “Treatment for Acute Myelocytic Leukemia with Allogeneic Bone Marrow Transplantation Following Preparation with BuCy2.” https://doi.org/10.1182/blood.V78.3.838.838 . +
    +
    +Dobson, Annette J, and Adrian G Barnett. 2018. An Introduction to Generalized Linear Models. 4th ed. CRC press. https://doi.org/10.1201/9781315182780. +
    Grambsch, Patricia M, and Terry M Therneau. 1994. “Proportional Hazards Tests and Diagnostics Based on Weighted Residuals.” Biometrika 81 (3): 515–26. https://doi.org/10.1093/biomet/81.3.515.
    Klein, John P, Melvin L Moeschberger, et al. 2003. Survival Analysis: Techniques for Censored and Truncated Data. Vol. 1230. Springer. https://link.springer.com/book/10.1007/b97377.
    +
    +Kleinbaum, David G, and Mitchel Klein. 2012. Survival Analysis a Self-Learning Text. 3rd ed. Springer. https://link.springer.com/book/10.1007/978-1-4419-6646-9. +
    diff --git a/intro-to-survival-analysis.html b/intro-to-survival-analysis.html index 260378b33..3ce1203d8 100644 --- a/intro-to-survival-analysis.html +++ b/intro-to-survival-analysis.html @@ -51,7 +51,27 @@ @media screen { pre > code.sourceCode > span > a:first-child::before { text-decoration: underline; } } - +/* CSS for citations */ +div.csl-bib-body { } +div.csl-entry { + clear: both; + margin-bottom: 0em; +} +.hanging-indent div.csl-entry { + margin-left:2em; + text-indent:-2em; +} +div.csl-left-margin { + min-width:2em; + float:left; +} +div.csl-right-inline { + margin-left:2em; + padding-left:1em; +} +div.csl-indent { + margin-left: 2em; +} @@ -406,11 +426,13 @@
  • 5.9 Example: Bone Marrow Transplant Data
  • @@ -436,7 +458,7 @@
    Published
    -

    Last modified: 2024-05-23: 17:33:42 (PM)

    +

    Last modified: 2024-05-27: 11:49:27 (AM)

    @@ -611,11 +633,14 @@ \end{aligned} \]


    +
    +

    Definition 5.1 (Survival function)  

    The survival function \(S(t)\) is the probability that the event time is later than \(t\). If the event in a clinical trial is death, then \(S(t)\) is the expected fraction of the original population at time 0 who have survived up to time \(t\) and are still alive at time \(t\); that is:

    If \(X_t\) represents survival status at time \(t\), with \(X_t = 1\) denoting alive at time \(t\) and \(X_t = 0\) denoting deceased at time \(t\), then:

    \[S(t) = \Pr(X_t=1) = \mathbb{E}[X_t]\]

    +

    Example 5.3 (exponential distribution) Since \(S(t) = 1 - F(t)\), the survival function of the exponential distribution family of models is:

    @@ -641,7 +666,7 @@ 5.3.4 The Hazard Function

    Another important quantity is the hazard function:

    -

    Definition 5.1 (Hazard function) The hazard function for a random variable \(T\) at value \(t\) is the conditional density of \(T\) at \(t\), given \(T\ge t\); that is:

    +

    Definition 5.2 (Hazard function) The hazard function for a random variable \(T\) at value \(t\) is the conditional density of \(T\) at \(t\), given \(T\ge t\); that is:

    \[h(t) \stackrel{\text{def}}{=}p(T=t|T\ge t)\]

    If \(T\) represents the time at which an event occurs, then \(h(t)\) is the probability that the event occurs at time \(t\), given that it has not occurred prior to time \(t\).

    @@ -724,7 +749,7 @@

    The integral in Equation 5.1 is important enough to have its own name: cumulative hazard.

    -

    Definition 5.2 (cumulative hazard) The cumulative hazard function \(H(t)\) is defined as:

    +

    Definition 5.3 (cumulative hazard) The cumulative hazard function \(H(t)\) is defined as:

    \[H(t) \stackrel{\text{def}}{=}\int_{u=0}^t h(u) du\]

    As we will see below, \(H(t)\) is tractable to estimate, and we can then derive an estimate of the hazard function using an approximate derivative of the estimated cumulative hazard.

    @@ -1293,19 +1318,22 @@
  • At each event time \(t\), we compute the at-risk group size \(Y\), which is all those observations whose event time or censoring time is at least \(t\).

  • If \(d\) of the observations have an event time (not a censoring time) of \(t\), then the group of survivors immediately following time \(t\) is reduced by the fraction \[\frac{Y-d}{Y}=1-\frac{d}{Y}\]

  • -

    If the event times are \(t_i\) with events per time of \(d_i\) (\(1\le i \le k\)), then

    -

    \[\hat S(t) = \prod_{t_i < t}[1-d_i/Y_i]\]

    +
    +
    +

    Definition 5.4 (Kaplan-Meier Product-Limit Estimator of Survival Function) If the event times are \(t_i\) with events per time of \(d_i\) (\(1\le i \le k\)), then the Kaplan-Meier Product-Limit Estimator of the survival function is:

    +

    \[\hat S(t) = \prod_{t_i < t} \left[\frac{1-d_i}{Y_i}\right] \tag{5.3}\]

    where \(Y_i\) is the set of observations whose time (event or censored) is \(\ge t_i\), the group at risk at time \(t_i\).

    +

    If there are no censored data, and there are \(n\) data points, then just after (say) the third event time

    \[ \begin{aligned} \hat S(t) -&= \prod_{t_i < t}[1-d_i/Y_i]\\ -&=[\frac{n-d_1}{n}][\frac{n-d_1-d_2}{n-d_1}][\frac{n-d_1-d_2-d_3}{n-d_1-d_2}]\\ -&=\frac{n-d_1-d_2-d_3}{n}\\ -&=1-\frac{d_1+d_2+d_3}{n}\\ -&=1-\hat F(t)\\ +&= \prod_{t_i < t}\left[1-\frac{d_i}{Y_i}\right] +\\ &= \left[\frac{n-d_1}{n}\right] \left[\frac{n-d_1-d_2}{n-d_1}\right] \left[\frac{n-d_1-d_2-d_3}{n-d_1-d_2}\right] +\\ &= \frac{n-d_1-d_2-d_3}{n} +\\ &=1-\frac{d_1+d_2+d_3}{n} +\\ &=1-\hat F(t) \end{aligned} \]

    where \(\hat F(t)\) is the usual empirical CDF estimate.

    @@ -1932,8 +1960,19 @@

    Interestingly, accounting for pairing reduces the significant of the difference.

    5.9 Example: Bone Marrow Transplant Data

    -

    (Copelan et al., 1991)

    +

    Copelan et al. (1991)


    +
    +
    +Figure 5.6: Recovery process from a bone marrow transplant (Fig. 1.1 from Klein, Moeschberger, et al. (2003)) +
    + +
    +
    +
    +
    +

    +5.9.1 Study design

    Treatment
    • @@ -1957,27 +1996,34 @@
    • death

    Any or all of these events may be censored.

    -

    -5.9.1 KMsurv::bmt data in R

    +

    +5.9.2 KMsurv::bmt data in R

    Show R code
    library(KMsurv)
     ?bmt
    -

    -5.9.2 Analysis plan

    +

    +5.9.3 Analysis plan

    • We concentrate for now on disease-free survival (t2 and d3) for the three risk groups, ALL, AML Low Risk, and AML High Risk.
    • We will construct the Kaplan-Meier survival curves, compare them, and test for differences.
    • We will construct the cumulative hazard curves and compare them.
    • We will estimate the hazard functions, interpret, and compare them.
    • -

    -5.9.3 Survival Function Estimate and Variance

    +

    +5.9.4 Survival Function Estimate and Variance

    \[\hat S(t) = \prod_{t_i < t}\left[1-\frac{d_i}{Y_i}\right]\] where \(Y_i\) is the group at risk at time \(t_i\).

    -

    The estimated variance of \(\hat S(t)\) is (Greenwood’s formula) \[\hat{\text{Var}}\left(\hat S (t)\right) = \hat S(t)^2\sum_{t_i <t}\frac{d_i}{Y_i(Y_i-d_i)}\] which we can use for confidence intervals for a survival function or a difference of survival functions.

    +

    The estimated variance of \(\hat S(t)\) is:

    +
    +

    Theorem 5.4 (Greenwood’s estimator for variance of Kaplan-Meier survival estimator) \[ +\widehat{\text{Var}}\left(\hat S(t)\right) = \hat S(t)^2\sum_{t_i <t}\frac{d_i}{Y_i(Y_i-d_i)} +\tag{5.4}\]

    +
    +

    We can use Equation 5.4 for confidence intervals for a survival function or a difference of survival functions.

    +
    Kaplan-Meier survival curves
    -
    code to preprocess and model bmt data
    library(KMsurv)
    +
    code to preprocess and model bmt data
    library(KMsurv)
     library(survival)
     data(bmt)
     
    @@ -2012,8 +2058,11 @@
     
    -

    Understanding Greenwood’s formula (optional)

    -

    To see where Greenwood’s formula comes from, let \(x_i=Y_i-d_i\). We approximate the solution treating each time as independent, with \(Y_i\) fixed and ignore randomness in times of failure and we treat \(x_i\) as independent binomials \(\text{Bin}(Y_i,p_i)\). Letting \(S(t)\) be the “true” survival function

    +

    +5.9.5 Understanding Greenwood’s formula (optional)

    +
    +

    To see where Greenwood’s formula comes from, let \(x_i = Y_i - d_i\). We approximate the solution treating each time as independent, with \(Y_i\) fixed and ignore randomness in times of failure and we treat \(x_i\) as independent binomials \(\text{Bin}(Y_i,p_i)\). Letting \(S(t)\) be the “true” survival function

    +

    \[ \begin{aligned} \hat S(t) &=\prod_{t_i<t}x_i/Y_i\\ @@ -2023,25 +2072,28 @@

    \[ \begin{aligned} \frac{\hat S(t)}{S(t)} -&= \prod_{t_i<t} \frac{x_i}{p_iY_i}=\prod_{t_i<t} \frac{\hat p_i}{p_i}\\ -&=\prod_{t_i<t} \left( 1+\frac{\hat p_i-p_i}{p_i} \right)\\ -&\approx 1+\sum_{t_i<t} \frac{\hat p_i-p_i}{p_i} \\ + &= \prod_{t_i<t} \frac{x_i}. {p_iY_i} +\\ &= \prod_{t_i<t} \frac{\hat p_i}{p_i} +\\ &= \prod_{t_i<t} \left(1+\frac{\hat p_i-p_i}{p_i}\right) +\\ &\approx 1+\sum_{t_i<t} \frac{\hat p_i-p_i}{p_i} \end{aligned} \]

    +

    \[ \begin{aligned} \text{Var}\left(\frac{\hat S(t)}{S(t)}\right) -&\approx \text{Var}\left(1+\sum_{t_i<t} \frac{\hat p_i-p_i}{p_i}\right) \\ -&=\sum_{t_i<t} \frac{1}{p_i^2}\frac{p_i(1-p_i)}{Y_i} \\ -&= \sum_{t_i<t} \frac{(1-p_i)}{p_iY_i} -\approx\sum_{t_i<t} \frac{(1-x_i/Y_i)}{x_i}\\ -&=\sum_{t_i<t} \frac{Y_i-x_i}{x_iY_i}=\sum_{t_i<t} \frac{d_i}{Y_i(Y_i-d_i)}\\ -\text{Var}\left(\hat S(t)\right) +&\approx \text{Var}\left(1+\sum_{t_i<t} \frac{\hat p_i-p_i}{p_i}\right) +\\ &=\sum_{t_i<t} \frac{1}{p_i^2}\frac{p_i(1-p_i)}{Y_i} +\\ &= \sum_{t_i<t} \frac{(1-p_i)}{p_iY_i} +\\ &\approx\sum_{t_i<t} \frac{(1-x_i/Y_i)}{x_i} +\\ &=\sum_{t_i<t} \frac{Y_i-x_i}{x_iY_i} +\\ &=\sum_{t_i<t} \frac{d_i}{Y_i(Y_i-d_i)} +\\ \therefore\text{Var}\left(\hat S(t)\right) &\approx \hat S(t)^2\sum_{t_i<t} \frac{d_i}{Y_i(Y_i-d_i)} \end{aligned} \]

    -

    -5.9.4 Test for differences among the disease groups

    +

    +5.9.6 Test for differences among the disease groups

    Here we compute a chi-square test for assocation between disease group (group) and disease-free survival:

    Show R code
    survdiff(surv ~ group, data = bmt)
    @@ -2056,8 +2108,8 @@
     #>  Chisq= 13.8  on 2 degrees of freedom, p= 0.001
    -

    -5.9.5 Cumulative Hazard

    +

    +5.9.7 Cumulative Hazard

    \[ \begin{aligned} h(t) @@ -2094,8 +2146,12 @@ theme_bw() + theme(legend.position="bottom")

    -
    -
    Disease-Free Cumulative Hazard by Disease Group

    +
    +
    +Figure 5.7: Disease-Free Cumulative Hazard by Disease Group +
    + +
    @@ -2103,17 +2159,27 @@

    5.10 Nelson-Aalen Estimates of Cumulative Hazard and Survival


    +
    +

    Definition 5.5 (Nelson-Aalen Cumulative Hazard Estimator)  

    +

    The point hazard at time \(t_i\) can be estimated by \(d_i/Y_i\), which leads to the Nelson-Aalen estimator of the cumulative hazard:

    -

    \[\hat H_{NA}(t) \stackrel{\text{def}}{=}\sum_{t_i < t}\frac{d_i}{Y_i}\]

    +
    +

    \[\hat H_{NA}(t) \stackrel{\text{def}}{=}\sum_{t_i < t}\frac{d_i}{Y_i} \tag{5.5}\]

    +

    +
    +

    Theorem 5.5 (Variance of Nelson-Aalen estimator)  

    +

    The variance of this estimator is approximately:

    -

    \[ +

    +

    \[ \begin{aligned} \hat{\text{Var}}\left(\hat H_{NA} (t)\right) &= \sum_{t_i <t}\frac{(d_i/Y_i)(1-d_i/Y_i)}{Y_i}\\ &\approx \sum_{t_i <t}\frac{d_i}{Y_i^2} \end{aligned} -\]

    +\tag{5.6}\]

    +

    Since \(S(t)=\text{exp}\left\{-H(t)\right\}\), the Nelson-Aalen cumulative hazard estimate can be converted into an alternate estimate of the survival function:

    \[ @@ -2179,6 +2245,14 @@ +

    +
    + @@ -3855,8 +3855,8 @@

    wcgs_response_resid_plot |> ggplotly()
    -
    - +
    +

    We can see a slight fan-shape here: observations on the right have larger variance (as expected since \(var(\bar y) = \pi(1-\pi)/n\) is maximized when \(\pi = 0.5\)).

    @@ -3992,8 +3992,8 @@

    Re
    wcgs_resid_plot1 |> ggplotly()
    -
    - +
    +
    diff --git a/references.html b/references.html index 9881981b0..f7abd952a 100644 --- a/references.html +++ b/references.html @@ -420,6 +420,14 @@ Clayton, David, and Michael Hills. 2013. Statistical Models in Epidemiology. OUP Oxford. +
    +Copelan, Edward A, James C Biggs, James M Thompson, Pamela Crilley, Jeff +Szer, John P Klein, Neena Kapoor, Belinda R Avalos, Isabel Cunningham, +and Kerry Atkinson. 1991. “Treatment for Acute Myelocytic Leukemia +with Allogeneic Bone Marrow Transplantation Following Preparation with +BuCy2.” https://doi.org/10.1182/blood.V78.3.838.838 . +
    Dalgaard, Peter. 2008. Introductory Statistics with r. New York, NY: Springer New York. https://link.springer.com/book/10.1007/978-0-387-79054-1. diff --git a/search.json b/search.json index 06ded1dc9..1b0abc1f5 100644 --- a/search.json +++ b/search.json @@ -193,7 +193,7 @@ "href": "Linear-models-overview.html#model-selection-1", "title": "\n2  Linear (Gaussian) Models\n", "section": "\n2.9 Model selection", - "text": "2.9 Model selection\n(adapted from Dobson and Barnett (2018) §6.3.3; for more information on prediction, see James et al. (2013) and Harrell (2015)).\n\nIf we have a lot of covariates in our dataset, we might want to choose a small subset to use in our model.\nThere are a few possible metrics to consider for choosing a “best” model.\n\n\n2.9.1 Mean squared error\nWe might want to minimize the mean squared error, \\(\\text E[(y-\\hat y)^2]\\), for new observations that weren’t in our data set when we fit the model.\nUnfortunately, \\[\\frac{1}{n}\\sum_{i=1}^n (y_i-\\hat y_i)^2\\] gives a biased estimate of \\(\\text E[(y-\\hat y)^2]\\) for new data. If we want an unbiased estimate, we will have to be clever.\n\nCross-validation\n\nShow R codedata(\"carbohydrate\", package = \"dobson\")\nlibrary(cvTools)\nfull_model <- lm(carbohydrate ~ ., data = carbohydrate)\ncv_full = \n full_model |> cvFit(\n data = carbohydrate, K = 5, R = 10,\n y = carbohydrate$carbohydrate)\n\nreduced_model = update(full_model, \n formula = ~ . - age)\n\ncv_reduced = \n reduced_model |> cvFit(\n data = carbohydrate, K = 5, R = 10,\n y = carbohydrate$carbohydrate)\n\n\n\n\nShow R coderesults_reduced = \n tibble(\n model = \"wgt+protein\",\n errs = cv_reduced$reps[])\nresults_full = \n tibble(model = \"wgt+age+protein\",\n errs = cv_full$reps[])\n\ncv_results = \n bind_rows(results_reduced, results_full)\n\ncv_results |> \n ggplot(aes(y = model, x = errs)) +\n geom_boxplot()\n\n\n\n\n\n\n\n\ncomparing metrics\n\nShow R code\ncompare_results = tribble(\n ~ model, ~ cvRMSE, ~ r.squared, ~adj.r.squared, ~ trainRMSE, ~loglik,\n \"full\", cv_full$cv, summary(full_model)$r.squared, summary(full_model)$adj.r.squared, sigma(full_model), logLik(full_model) |> as.numeric(),\n \"reduced\", cv_reduced$cv, summary(reduced_model)$r.squared, summary(reduced_model)$adj.r.squared, sigma(reduced_model), logLik(reduced_model) |> as.numeric())\n\ncompare_results\n\n\n\nmodel\ncvRMSE\nr.squared\nadj.r.squared\ntrainRMSE\nloglik\n\n\n\nfull\n6.838\n0.4805\n0.3831\n5.956\n-61.84\n\n\nreduced\n6.610\n0.4454\n0.3802\n5.971\n-62.49\n\n\n\n\n\n\n\nShow R codeanova(full_model, reduced_model)\n\n\n\nRes.Df\nRSS\nDf\nSum of Sq\nF\nPr(>F)\n\n\n\n16\n567.7\nNA\nNA\nNA\nNA\n\n\n17\n606.0\n-1\n-38.36\n1.081\n0.3139\n\n\n\n\n\nstepwise regression\n\nShow R codelibrary(olsrr)\nolsrr:::ols_step_both_aic(full_model)\n#> \n#> \n#> Stepwise Summary \n#> -------------------------------------------------------------------------\n#> Step Variable AIC SBC SBIC R2 Adj. R2 \n#> -------------------------------------------------------------------------\n#> 0 Base Model 140.773 142.764 83.068 0.00000 0.00000 \n#> 1 protein (+) 137.950 140.937 80.438 0.21427 0.17061 \n#> 2 weight (+) 132.981 136.964 77.191 0.44544 0.38020 \n#> -------------------------------------------------------------------------\n#> \n#> Final Model Output \n#> ------------------\n#> \n#> Model Summary \n#> ---------------------------------------------------------------\n#> R 0.667 RMSE 5.505 \n#> R-Squared 0.445 MSE 35.648 \n#> Adj. R-Squared 0.380 Coef. Var 15.879 \n#> Pred R-Squared 0.236 AIC 132.981 \n#> MAE 4.593 SBC 136.964 \n#> ---------------------------------------------------------------\n#> RMSE: Root Mean Square Error \n#> MSE: Mean Square Error \n#> MAE: Mean Absolute Error \n#> AIC: Akaike Information Criteria \n#> SBC: Schwarz Bayesian Criteria \n#> \n#> ANOVA \n#> -------------------------------------------------------------------\n#> Sum of \n#> Squares DF Mean Square F Sig. \n#> -------------------------------------------------------------------\n#> Regression 486.778 2 243.389 6.827 0.0067 \n#> Residual 606.022 17 35.648 \n#> Total 1092.800 19 \n#> -------------------------------------------------------------------\n#> \n#> Parameter Estimates \n#> ----------------------------------------------------------------------------------------\n#> model Beta Std. Error Std. Beta t Sig lower upper \n#> ----------------------------------------------------------------------------------------\n#> (Intercept) 33.130 12.572 2.635 0.017 6.607 59.654 \n#> protein 1.824 0.623 0.534 2.927 0.009 0.509 3.139 \n#> weight -0.222 0.083 -0.486 -2.662 0.016 -0.397 -0.046 \n#> ----------------------------------------------------------------------------------------\n\n\nLasso\n\\[\\arg min_{\\theta} \\ell(\\theta) + \\lambda \\sum_{j=1}^p|\\beta_j|\\]\n\nShow R codelibrary(glmnet)\ny = carbohydrate$carbohydrate\nx = carbohydrate |> \n select(age, weight, protein) |> \n as.matrix()\nfit = glmnet(x,y)\n\n\n\n\nShow R codeautoplot(fit, xvar = 'lambda')\n\n\n\nFigure 2.19: Lasso selection\n\n\n\n\n\n\n\n\n\nShow R codecvfit = cv.glmnet(x,y)\nplot(cvfit)\n\n\n\n\n\n\n\n\n\nShow R codecoef(cvfit, s = \"lambda.1se\")\n#> 4 x 1 sparse Matrix of class \"dgCMatrix\"\n#> s1\n#> (Intercept) 34.4241\n#> age . \n#> weight -0.0662\n#> protein 0.6607", + "text": "2.9 Model selection\n(adapted from Dobson and Barnett (2018) §6.3.3; for more information on prediction, see James et al. (2013) and Harrell (2015)).\n\nIf we have a lot of covariates in our dataset, we might want to choose a small subset to use in our model.\nThere are a few possible metrics to consider for choosing a “best” model.\n\n\n2.9.1 Mean squared error\nWe might want to minimize the mean squared error, \\(\\text E[(y-\\hat y)^2]\\), for new observations that weren’t in our data set when we fit the model.\nUnfortunately, \\[\\frac{1}{n}\\sum_{i=1}^n (y_i-\\hat y_i)^2\\] gives a biased estimate of \\(\\text E[(y-\\hat y)^2]\\) for new data. If we want an unbiased estimate, we will have to be clever.\n\nCross-validation\n\nShow R codedata(\"carbohydrate\", package = \"dobson\")\nlibrary(cvTools)\nfull_model <- lm(carbohydrate ~ ., data = carbohydrate)\ncv_full = \n full_model |> cvFit(\n data = carbohydrate, K = 5, R = 10,\n y = carbohydrate$carbohydrate)\n\nreduced_model = update(full_model, \n formula = ~ . - age)\n\ncv_reduced = \n reduced_model |> cvFit(\n data = carbohydrate, K = 5, R = 10,\n y = carbohydrate$carbohydrate)\n\n\n\n\nShow R coderesults_reduced = \n tibble(\n model = \"wgt+protein\",\n errs = cv_reduced$reps[])\nresults_full = \n tibble(model = \"wgt+age+protein\",\n errs = cv_full$reps[])\n\ncv_results = \n bind_rows(results_reduced, results_full)\n\ncv_results |> \n ggplot(aes(y = model, x = errs)) +\n geom_boxplot()\n\n\n\n\n\n\n\n\ncomparing metrics\n\nShow R code\ncompare_results = tribble(\n ~ model, ~ cvRMSE, ~ r.squared, ~adj.r.squared, ~ trainRMSE, ~loglik,\n \"full\", cv_full$cv, summary(full_model)$r.squared, summary(full_model)$adj.r.squared, sigma(full_model), logLik(full_model) |> as.numeric(),\n \"reduced\", cv_reduced$cv, summary(reduced_model)$r.squared, summary(reduced_model)$adj.r.squared, sigma(reduced_model), logLik(reduced_model) |> as.numeric())\n\ncompare_results\n\n\n\nmodel\ncvRMSE\nr.squared\nadj.r.squared\ntrainRMSE\nloglik\n\n\n\nfull\n6.900\n0.4805\n0.3831\n5.956\n-61.84\n\n\nreduced\n6.581\n0.4454\n0.3802\n5.971\n-62.49\n\n\n\n\n\n\n\nShow R codeanova(full_model, reduced_model)\n\n\n\nRes.Df\nRSS\nDf\nSum of Sq\nF\nPr(>F)\n\n\n\n16\n567.7\nNA\nNA\nNA\nNA\n\n\n17\n606.0\n-1\n-38.36\n1.081\n0.3139\n\n\n\n\n\nstepwise regression\n\nShow R codelibrary(olsrr)\nolsrr:::ols_step_both_aic(full_model)\n#> \n#> \n#> Stepwise Summary \n#> -------------------------------------------------------------------------\n#> Step Variable AIC SBC SBIC R2 Adj. R2 \n#> -------------------------------------------------------------------------\n#> 0 Base Model 140.773 142.764 83.068 0.00000 0.00000 \n#> 1 protein (+) 137.950 140.937 80.438 0.21427 0.17061 \n#> 2 weight (+) 132.981 136.964 77.191 0.44544 0.38020 \n#> -------------------------------------------------------------------------\n#> \n#> Final Model Output \n#> ------------------\n#> \n#> Model Summary \n#> ---------------------------------------------------------------\n#> R 0.667 RMSE 5.505 \n#> R-Squared 0.445 MSE 35.648 \n#> Adj. R-Squared 0.380 Coef. Var 15.879 \n#> Pred R-Squared 0.236 AIC 132.981 \n#> MAE 4.593 SBC 136.964 \n#> ---------------------------------------------------------------\n#> RMSE: Root Mean Square Error \n#> MSE: Mean Square Error \n#> MAE: Mean Absolute Error \n#> AIC: Akaike Information Criteria \n#> SBC: Schwarz Bayesian Criteria \n#> \n#> ANOVA \n#> -------------------------------------------------------------------\n#> Sum of \n#> Squares DF Mean Square F Sig. \n#> -------------------------------------------------------------------\n#> Regression 486.778 2 243.389 6.827 0.0067 \n#> Residual 606.022 17 35.648 \n#> Total 1092.800 19 \n#> -------------------------------------------------------------------\n#> \n#> Parameter Estimates \n#> ----------------------------------------------------------------------------------------\n#> model Beta Std. Error Std. Beta t Sig lower upper \n#> ----------------------------------------------------------------------------------------\n#> (Intercept) 33.130 12.572 2.635 0.017 6.607 59.654 \n#> protein 1.824 0.623 0.534 2.927 0.009 0.509 3.139 \n#> weight -0.222 0.083 -0.486 -2.662 0.016 -0.397 -0.046 \n#> ----------------------------------------------------------------------------------------\n\n\nLasso\n\\[\\arg min_{\\theta} \\ell(\\theta) + \\lambda \\sum_{j=1}^p|\\beta_j|\\]\n\nShow R codelibrary(glmnet)\ny = carbohydrate$carbohydrate\nx = carbohydrate |> \n select(age, weight, protein) |> \n as.matrix()\nfit = glmnet(x,y)\n\n\n\n\nShow R codeautoplot(fit, xvar = 'lambda')\n\n\n\nFigure 2.19: Lasso selection\n\n\n\n\n\n\n\n\n\nShow R codecvfit = cv.glmnet(x,y)\nplot(cvfit)\n\n\n\n\n\n\n\n\n\nShow R codecoef(cvfit, s = \"lambda.1se\")\n#> 4 x 1 sparse Matrix of class \"dgCMatrix\"\n#> s1\n#> (Intercept) 34.2044\n#> age . \n#> weight -0.0926\n#> protein 0.8582", "crumbs": [ "Generalized Linear Models", "2  Linear (Gaussian) Models" @@ -533,7 +533,7 @@ "href": "intro-to-survival-analysis.html#distribution-functions-for-time-to-event-variables", "title": "\n5  Introduction to Survival Analysis\n", "section": "\n5.3 Distribution functions for time-to-event variables", - "text": "5.3 Distribution functions for time-to-event variables\n\n5.3.1 The Probability Density Function (PDF)\nFor a time-to-event variable \\(T\\) with a continuous distribution, the probability density function is defined as usual (see Section B.4.1).\n\nIn most time-to-event models, this density is assumed to be 0 for all \\(t<0\\); that is, \\(f(t) = 0, \\forall t<0\\). In other words, the support of \\(T\\) is typically \\([0,\\infty)\\).\n\n\n\nExample 5.1 (exponential distribution) Recall from Epi 202: the pdf of the exponential distribution family of models is:\n\\[p(T=t) = \\mathbb{1}_{t \\ge 0} \\cdot \\lambda \\text{e}^{-\\lambda t}\\]\nwhere \\(\\lambda > 0\\).\n\nHere are some examples of exponential pdfs:\n\n\n\n\n\n\n\n\n\n\n5.3.2 The Cumulative Distribution Function (CDF)\nThe cumulative distribution function is defined as:\n\\[\n\\begin{aligned}\nF(t) &\\stackrel{\\text{def}}{=}\\Pr(T \\le t)\\\\\n&=\\int_{u=-\\infty}^t f(u) du\n\\end{aligned}\n\\]\n\nExample 5.2 (exponential distribution) Recall from Epi 202: the cdf of the exponential distribution family of models is:\n\\[\nP(T\\le t) = \\mathbb{1}_{t \\ge 0} \\cdot (1- \\text{e}^{-\\lambda t})\n\\] where \\(\\lambda > 0\\).\n\nHere are some examples of exponential cdfs:\n\n\n\n\n\n\n\n\n\n5.3.3 The Survival Function\nFor survival data, a more important quantity is the survival function:\n\\[\n\\begin{aligned}\nS(t) &\\stackrel{\\text{def}}{=}\\Pr(T > t)\\\\\n&=\\int_{u=t}^\\infty p(u) du\\\\\n&=1-F(t)\\\\\n\\end{aligned}\n\\]\n\n\nThe survival function \\(S(t)\\) is the probability that the event time is later than \\(t\\). If the event in a clinical trial is death, then \\(S(t)\\) is the expected fraction of the original population at time 0 who have survived up to time \\(t\\) and are still alive at time \\(t\\); that is:\n\nIf \\(X_t\\) represents survival status at time \\(t\\), with \\(X_t = 1\\) denoting alive at time \\(t\\) and \\(X_t = 0\\) denoting deceased at time \\(t\\), then:\n\\[S(t) = \\Pr(X_t=1) = \\mathbb{E}[X_t]\\]\n\n\nExample 5.3 (exponential distribution) Since \\(S(t) = 1 - F(t)\\), the survival function of the exponential distribution family of models is:\n\\[\nP(T> t) = \\left\\{ {{\\text{e}^{-\\lambda t}, t\\ge0} \\atop {1, t \\le 0}}\\right.\n\\] where \\(\\lambda > 0\\).\nFigure 5.1 shows some examples of exponential survival functions.\n\n\n\n\n\n\nFigure 5.1: Exponential Survival Functions\n\n\n\n\n\n\n\n\n5.3.4 The Hazard Function\nAnother important quantity is the hazard function:\n\nDefinition 5.1 (Hazard function) The hazard function for a random variable \\(T\\) at value \\(t\\) is the conditional density of \\(T\\) at \\(t\\), given \\(T\\ge t\\); that is:\n\\[h(t) \\stackrel{\\text{def}}{=}p(T=t|T\\ge t)\\]\nIf \\(T\\) represents the time at which an event occurs, then \\(h(t)\\) is the probability that the event occurs at time \\(t\\), given that it has not occurred prior to time \\(t\\).\n\n\n\nThe hazard function has an important relationship to the density and survival functions, which we can use to derive the hazard function for a given probability distribution (Theorem 5.1).\n\n\nLemma 5.1 (Joint probability of a variable with itself) \\[p(T=t, T\\ge t) = p(T=t)\\]\n\nProof. Recall from Epi 202: if \\(A\\) and \\(B\\) are statistical events and \\(A\\subseteq B\\), then \\(p(A, B) = p(A)\\). In particular, \\(\\{T=t\\} \\subseteq \\{T\\geq t\\}\\), so \\(p(T=t, T\\ge t) = p(T=t)\\).\n\n\n\n\nTheorem 5.1 \\[h(t)=\\frac{f(t)}{S(t)}\\]\n\n\n\nProof. \\[\n\\begin{aligned}\nh(t) &=p(T=t|T\\ge t)\\\\\n&=\\frac{p(T=t, T\\ge t)}{p(T \\ge t)}\\\\\n&=\\frac{p(T=t)}{p(T \\ge t)}\\\\\n&=\\frac{f(t)}{S(t)}\n\\end{aligned}\n\\]\n\n\n\nExample 5.4 (exponential distribution) The hazard function of the exponential distribution family of models is:\n\\[\n\\begin{aligned}\nP(T=t|T \\ge t)\n&= \\frac{f(t)}{S(t)}\\\\\n&= \\frac{\\mathbb{1}_{t \\ge 0}\\cdot \\lambda \\text{e}^{-\\lambda t}}{\\text{e}^{-\\lambda t}}\\\\\n&=\\mathbb{1}_{t \\ge 0}\\cdot \\lambda\n\\end{aligned}\n\\] Figure 5.2 shows some examples of exponential hazard functions.\n\n\n\n\n\n\nFigure 5.2: Examples of hazard functions for exponential distributions\n\n\n\n\n\n\n\n\nWe can also view the hazard function as the derivative of the negative of the logarithm of the survival function:\n\nTheorem 5.2 (transform survival to hazard) \\[h(t) = \\frac{\\partial}{\\partial t}\\left\\{-\\text{log}\\left\\{S(t)\\right\\}\\right\\}\\]\n\n\n\nProof. \\[\n\\begin{aligned}\nh(t)\n&= \\frac{f(t)}{S(t)}\\\\\n&= \\frac{-S'(t)}{S(t)}\\\\\n&= -\\frac{S'(t)}{S(t)}\\\\\n&=-\\frac{\\partial}{\\partial t}\\text{log}\\left\\{S(t)\\right\\}\\\\\n&=\\frac{\\partial}{\\partial t}\\left\\{-\\text{log}\\left\\{S(t)\\right\\}\\right\\}\n\\end{aligned}\n\\]\n\n\n5.3.5 The Cumulative Hazard Function\nSince \\(h(t) = \\frac{\\partial}{\\partial t}\\left\\{-\\text{log}\\left\\{S(t)\\right\\}\\right\\}\\) (see Theorem 5.2), we also have:\n\nCorollary 5.1 \\[S(t) = \\text{exp}\\left\\{-\\int_{u=0}^t h(u)du\\right\\} \\tag{5.1}\\]\n\n\n\nThe integral in Equation 5.1 is important enough to have its own name: cumulative hazard.\n\n\nDefinition 5.2 (cumulative hazard) The cumulative hazard function \\(H(t)\\) is defined as:\n\\[H(t) \\stackrel{\\text{def}}{=}\\int_{u=0}^t h(u) du\\]\n\nAs we will see below, \\(H(t)\\) is tractable to estimate, and we can then derive an estimate of the hazard function using an approximate derivative of the estimated cumulative hazard.\n\n\nExample 5.5 The cumulative hazard function of the exponential distribution family of models is:\n\\[\nH(t) = \\mathbb{1}_{t \\ge 0}\\cdot \\lambda t\n\\]\nFigure 5.3 shows some examples of exponential cumulative hazard functions.\n\n\n\n\n\n\nFigure 5.3: Examples of exponential cumulative hazard functions\n\n\n\n\n\n\n\n\n5.3.6 Some Key Mathematical Relationships among Survival Concepts\nDiagram:\n\\[\nh(t) \\xrightarrow[]{\\int_{u=0}^t h(u)du} H(t)\n\\xrightarrow[]{\\text{exp}\\left\\{-H(t)\\right\\}} S(t)\n\\xrightarrow[]{1-S(t)} F(t)\n\\]\n\\[\nh(t) \\xleftarrow[\\frac{\\partial}{\\partial t}H(t)]{} H(t)\n\\xleftarrow[-\\text{log}\\left\\{S(t)\\right\\}]{} S(t)\n\\xleftarrow[1-F(t)]{} F(t)\n\\]\nIdentities:\n\\[\n\\begin{aligned}\nS(t) &= 1 - F(t)\\\\\n&= \\text{exp}\\left\\{-H(t)\\right\\}\\\\\nS'(t) &= -f(t)\\\\\nH(t) &= -\\text{log}\\left\\{S(t)\\right\\}\\\\\nH'(t) &= h(t)\\\\\nh(t) &= \\frac{f(t)}{S(t)}\\\\\n&= -\\frac{\\partial}{\\partial t}\\text{log}\\left\\{S(t)\\right\\} \\\\\nf(t) &= h(t)\\cdot S(t)\\\\\n\\end{aligned}\n\\]\n\nSome proofs (others left as exercises):\n\\[\n\\begin{aligned}\nS'(t) &= \\frac{\\partial}{\\partial t}(1-F(t))\\\\\n&= -F'(t)\\\\\n&= -f(t)\\\\\n\\end{aligned}\n\\]\n\n\\[\n\\begin{aligned}\n\\frac{\\partial}{\\partial t}\\text{log}\\left\\{S(t)\\right\\}\n&= \\frac{S'(t)}{S(t)}\\\\\n&= -\\frac{f(t)}{S(t)}\\\\\n&= -h(t)\\\\\n\\end{aligned}\n\\]\n\n\\[\n\\begin{aligned}\nH(t)\n&\\stackrel{\\text{def}}{=}\\int_{u=0}^t h(u) du\\\\\n&= \\int_0^t -\\frac{\\partial}{\\partial u}\\text{log}\\left\\{S(u)\\right\\} du\\\\\n&= \\left[-\\text{log}\\left\\{S(u)\\right\\}\\right]_{u=0}^{u=t}\\\\\n&= \\left[\\text{log}\\left\\{S(u)\\right\\}\\right]_{u=t}^{u=0}\\\\\n&= \\text{log}\\left\\{S(0)\\right\\} - \\text{log}\\left\\{S(t)\\right\\}\\\\\n&= \\text{log}\\left\\{1\\right\\} - \\text{log}\\left\\{S(t)\\right\\}\\\\\n&= 0 - \\text{log}\\left\\{S(t)\\right\\}\\\\\n&=-\\text{log}\\left\\{S(t)\\right\\}\n\\end{aligned}\n\\]\n\nCorollary:\n\\[S(t) = \\text{exp}\\left\\{-H(t)\\right\\}\\]\nExample: Time to death the US in 2004\nThe first day is the most dangerous:\n\n\n\nDaily Hazard Rates in 2004 for US Females\n\n\n\n\n\nExercise: hypothesize why these curves differ where they do?\n\n\n\nDaily Hazard Rates in 2004 for US Males and Females 1-40\n\n\n\n\n\nExercise: compare and contrast this curve with the corresponding hazard curve.\n\n\n\nSurvival Curve in 2004 for US Females\n\n\n\n\n\n5.3.7 Likelihood with censoring\nIf an event time \\(T\\) is observed exactly as \\(T=t\\), then the likelihood of that observation is just its probability density function:\n\\[\n\\begin{aligned}\n\\mathcal L(t)\n&= p(T=t)\\\\\n&\\stackrel{\\text{def}}{=}f_T(t)\\\\\n&= h_T(t)S_T(t)\\\\\n\\ell(t)\n&\\stackrel{\\text{def}}{=}\\text{log}\\left\\{\\mathcal L(t)\\right\\}\\\\\n&= \\text{log}\\left\\{h_T(t)S_T(t)\\right\\}\\\\\n&= \\text{log}\\left\\{h_T(t)\\right\\} + \\text{log}\\left\\{S_T(t)\\right\\}\\\\\n&= \\text{log}\\left\\{h_T(t)\\right\\} - H_T(t)\\\\\n\\end{aligned}\n\\]\n\nIf instead the event time \\(T\\) is censored and only known to be after time \\(y\\), then the likelihood of that censored observation is instead the survival function evaluated at the censoring time:\n\\[\n\\begin{aligned}\n\\mathcal L(y)\n&=p_T(T>y)\\\\\n&\\stackrel{\\text{def}}{=}S_T(y)\\\\\n\\ell(y)\n&\\stackrel{\\text{def}}{=}\\text{log}\\left\\{\\mathcal L(y)\\right\\}\\\\\n&=\\text{log}\\left\\{S(y)\\right\\}\\\\\n&=-H(y)\\\\\n\\end{aligned}\n\\]\n\n\nWhat’s written above is incomplete. We also observed whether or not the observation was censored. Let \\(C\\) denote the time when censoring would occur (if the event did not occur first); let \\(f_C(y)\\) and \\(S_C(y)\\) be the corresponding density and survival functions for the censoring event.\nLet \\(Y\\) denote the time when observation ended (either by censoring or by the event of interest occurring), and let \\(D\\) be an indicator variable for the event occurring at \\(Y\\) (so \\(D=0\\) represents a censored observation and \\(D=1\\) represents an uncensored observation). In other words, let \\(Y \\stackrel{\\text{def}}{=}\\min(T,C)\\) and \\(D \\stackrel{\\text{def}}{=}\\mathbb 1{\\{T<=C\\}}\\).\nThen the complete likelihood of the observed data \\((Y,D)\\) is:\n\n\\[\n\\begin{aligned}\n\\mathcal L(y,d)\n&= p(Y=y, D=d)\\\\\n&= \\left[p(T=y,C> y)\\right]^d \\cdot\n\\left[p(T>y,C=y)\\right]^{1-d}\\\\\n\\end{aligned}\n\\]\n\n\nTypically, survival analyses assume that \\(C\\) and \\(T\\) are mutually independent; this assumption is called “non-informative” censoring.\nThen the joint likelihood \\(p(Y,D)\\) factors into the product \\(p(Y), p(D)\\), and the likelihood reduces to:\n\n\\[\n\\begin{aligned}\n\\mathcal L(y,d)\n&= \\left[p(T=y,C> y)\\right]^d\\cdot\n\\left[p(T>y,C=y)\\right]^{1-d}\\\\\n&= \\left[p(T=y)p(C> y)\\right]^d\\cdot\n\\left[p(T>y)p(C=y)\\right]^{1-d}\\\\\n&= \\left[f_T(y)S_C(y)\\right]^d\\cdot\n\\left[S(y)f_C(y)\\right]^{1-d}\\\\\n&= \\left[f_T(y)^d S_C(y)^d\\right]\\cdot\n\\left[S_T(y)^{1-d}f_C(y)^{1-d}\\right]\\\\\n&= \\left(f_T(y)^d \\cdot S_T(y)^{1-d}\\right)\\cdot\n\\left(f_C(y)^{1-d} \\cdot S_C(y)^{d}\\right)\n\\end{aligned}\n\\]\n\n\nThe corresponding log-likelihood is:\n\n\\[\n\\begin{aligned}\n\\ell(y,d)\n&= \\text{log}\\left\\{\\mathcal L(y,d) \\right\\}\\\\\n&= \\text{log}\\left\\{\n\\left(f_T(y)^d \\cdot S_T(y)^{1-d}\\right)\\cdot\n\\left(f_C(y)^{1-d} \\cdot S_C(y)^{d}\\right)\n\\right\\}\\\\\n&= \\text{log}\\left\\{\nf_T(y)^d \\cdot S_T(y)^{1-d}\n\\right\\}\n+\n\\text{log}\\left\\{\nf_C(y)^{1-d} \\cdot S_C(y)^{d}\n\\right\\}\\\\\n\\end{aligned}\n\\] Let\n\n\n\\(\\theta_T\\) represent the parameters of \\(p_T(t)\\),\n\n\\(\\theta_C\\) represent the parameters of \\(p_C(c)\\),\n\n\\(\\theta = (\\theta_T, \\theta_C)\\) be the combined vector of all parameters.\n\n\n\nThe corresponding score function is:\n\n\\[\n\\begin{aligned}\n\\ell'(y,d)\n&= \\frac{\\partial}{\\partial \\theta}\n\\left[\n\\text{log}\\left\\{\nf_T(y)^d \\cdot S_T(y)^{1-d}\n\\right\\}\n+\n\\text{log}\\left\\{\nf_C(y)^{1-d} \\cdot S_C(y)^{d}\n\\right\\}\n\\right]\\\\\n&=\n\\left(\n\\frac{\\partial}{\\partial \\theta}\n\\text{log}\\left\\{\nf_T(y)^d \\cdot S_T(y)^{1-d}\n\\right\\}\n\\right)\n+\n\\left(\n\\frac{\\partial}{\\partial \\theta}\n\\text{log}\\left\\{\nf_C(y)^{1-d} \\cdot S_C(y)^{d}\n\\right\\}\n\\right)\\\\\n\\end{aligned}\n\\]\n\n\nAs long as \\(\\theta_C\\) and \\(\\theta_T\\) don’t share any parameters, then if censoring is non-informative, the partial derivative with respect to \\(\\theta_T\\) is:\n\n\\[\n\\begin{aligned}\n\\ell'_{\\theta_T}(y,d)\n&\\stackrel{\\text{def}}{=}\\frac{\\partial}{\\partial \\theta_T}\\ell(y,d)\\\\\n&=\n\\left(\n\\frac{\\partial}{\\partial \\theta_T}\n\\text{log}\\left\\{\nf_T(y)^d \\cdot S_T(y)^{1-d}\n\\right\\}\n\\right)\n+\n\\left(\n\\frac{\\partial}{\\partial \\theta_T}\n\\text{log}\\left\\{\nf_C(y)^{1-d} \\cdot S_C(y)^{d}\n\\right\\}\n\\right)\\\\\n&=\n\\left(\n\\frac{\\partial}{\\partial \\theta_T}\n\\text{log}\\left\\{\nf_T(y)^d \\cdot S_T(y)^{1-d}\n\\right\\}\n\\right) + 0\\\\\n&=\n\\frac{\\partial}{\\partial \\theta_T}\n\\text{log}\\left\\{\nf_T(y)^d \\cdot S_T(y)^{1-d}\n\\right\\}\\\\\n\\end{aligned}\n\\]\n\n\nThus, the MLE for \\(\\theta_T\\) won’t depend on \\(\\theta_C\\), and we can ignore the distribution of \\(C\\) when estimating the parameters of \\(f_T(t)=p(T=t)\\).\n\nThen:\n\\[\n\\begin{aligned}\n\\mathcal L(y,d)\n&= f_T(y)^d \\cdot S_T(y)^{1-d}\\\\\n&= \\left(h_T(y)^d S_T(y)^d\\right) \\cdot S_T(y)^{1-d}\\\\\n&= h_T(y)^d \\cdot S_T(y)^d \\cdot S_T(y)^{1-d}\\\\\n&= h_T(y)^d \\cdot S_T(y)\\\\\n&= S_T(y) \\cdot h_T(y)^d \\\\\n\\end{aligned}\n\\]\n\nThat is, if the event occurred at time \\(y\\) (i.e., if \\(d=1\\)), then the likelihood of \\((Y,D) = (y,d)\\) is equal to the hazard function at \\(y\\) times the survival function at \\(y\\). Otherwise, the likelihood is equal to just the survival function at \\(y\\).\n\n\n\nThe corresponding log-likelihood is:\n\n\\[\n\\begin{aligned}\n\\ell(y,d)\n&=\\text{log}\\left\\{\\mathcal L(y,d)\\right\\}\\\\\n&= \\text{log}\\left\\{S_T(y) \\cdot h_T(y)^d\\right\\}\\\\\n&= \\text{log}\\left\\{S_T(y)\\right\\} + \\text{log}\\left\\{h_T(y)^d\\right\\}\\\\\n&= \\text{log}\\left\\{S_T(y)\\right\\} + d\\cdot \\text{log}\\left\\{h_T(y)\\right\\}\\\\\n&= -H_T(y) + d\\cdot \\text{log}\\left\\{h_T(y)\\right\\}\\\\\n\\end{aligned}\n\\]\n\nIn other words, the log-likelihood contribution from a single observation \\((Y,D) = (y,d)\\) is equal to the negative cumulative hazard at \\(y\\), plus the log of the hazard at \\(y\\) if the event occurred at time \\(y\\).", + "text": "5.3 Distribution functions for time-to-event variables\n\n5.3.1 The Probability Density Function (PDF)\nFor a time-to-event variable \\(T\\) with a continuous distribution, the probability density function is defined as usual (see Section B.4.1).\n\nIn most time-to-event models, this density is assumed to be 0 for all \\(t<0\\); that is, \\(f(t) = 0, \\forall t<0\\). In other words, the support of \\(T\\) is typically \\([0,\\infty)\\).\n\n\n\nExample 5.1 (exponential distribution) Recall from Epi 202: the pdf of the exponential distribution family of models is:\n\\[p(T=t) = \\mathbb{1}_{t \\ge 0} \\cdot \\lambda \\text{e}^{-\\lambda t}\\]\nwhere \\(\\lambda > 0\\).\n\nHere are some examples of exponential pdfs:\n\n\n\n\n\n\n\n\n\n\n5.3.2 The Cumulative Distribution Function (CDF)\nThe cumulative distribution function is defined as:\n\\[\n\\begin{aligned}\nF(t) &\\stackrel{\\text{def}}{=}\\Pr(T \\le t)\\\\\n&=\\int_{u=-\\infty}^t f(u) du\n\\end{aligned}\n\\]\n\nExample 5.2 (exponential distribution) Recall from Epi 202: the cdf of the exponential distribution family of models is:\n\\[\nP(T\\le t) = \\mathbb{1}_{t \\ge 0} \\cdot (1- \\text{e}^{-\\lambda t})\n\\] where \\(\\lambda > 0\\).\n\nHere are some examples of exponential cdfs:\n\n\n\n\n\n\n\n\n\n5.3.3 The Survival Function\nFor survival data, a more important quantity is the survival function:\n\\[\n\\begin{aligned}\nS(t) &\\stackrel{\\text{def}}{=}\\Pr(T > t)\\\\\n&=\\int_{u=t}^\\infty p(u) du\\\\\n&=1-F(t)\\\\\n\\end{aligned}\n\\]\n\n\nDefinition 5.1 (Survival function)  \n\nThe survival function \\(S(t)\\) is the probability that the event time is later than \\(t\\). If the event in a clinical trial is death, then \\(S(t)\\) is the expected fraction of the original population at time 0 who have survived up to time \\(t\\) and are still alive at time \\(t\\); that is:\n\nIf \\(X_t\\) represents survival status at time \\(t\\), with \\(X_t = 1\\) denoting alive at time \\(t\\) and \\(X_t = 0\\) denoting deceased at time \\(t\\), then:\n\\[S(t) = \\Pr(X_t=1) = \\mathbb{E}[X_t]\\]\n\n\n\nExample 5.3 (exponential distribution) Since \\(S(t) = 1 - F(t)\\), the survival function of the exponential distribution family of models is:\n\\[\nP(T> t) = \\left\\{ {{\\text{e}^{-\\lambda t}, t\\ge0} \\atop {1, t \\le 0}}\\right.\n\\] where \\(\\lambda > 0\\).\nFigure 5.1 shows some examples of exponential survival functions.\n\n\n\n\n\n\nFigure 5.1: Exponential Survival Functions\n\n\n\n\n\n\n\n\n5.3.4 The Hazard Function\nAnother important quantity is the hazard function:\n\nDefinition 5.2 (Hazard function) The hazard function for a random variable \\(T\\) at value \\(t\\) is the conditional density of \\(T\\) at \\(t\\), given \\(T\\ge t\\); that is:\n\\[h(t) \\stackrel{\\text{def}}{=}p(T=t|T\\ge t)\\]\nIf \\(T\\) represents the time at which an event occurs, then \\(h(t)\\) is the probability that the event occurs at time \\(t\\), given that it has not occurred prior to time \\(t\\).\n\n\n\nThe hazard function has an important relationship to the density and survival functions, which we can use to derive the hazard function for a given probability distribution (Theorem 5.1).\n\n\nLemma 5.1 (Joint probability of a variable with itself) \\[p(T=t, T\\ge t) = p(T=t)\\]\n\nProof. Recall from Epi 202: if \\(A\\) and \\(B\\) are statistical events and \\(A\\subseteq B\\), then \\(p(A, B) = p(A)\\). In particular, \\(\\{T=t\\} \\subseteq \\{T\\geq t\\}\\), so \\(p(T=t, T\\ge t) = p(T=t)\\).\n\n\n\n\nTheorem 5.1 \\[h(t)=\\frac{f(t)}{S(t)}\\]\n\n\n\nProof. \\[\n\\begin{aligned}\nh(t) &=p(T=t|T\\ge t)\\\\\n&=\\frac{p(T=t, T\\ge t)}{p(T \\ge t)}\\\\\n&=\\frac{p(T=t)}{p(T \\ge t)}\\\\\n&=\\frac{f(t)}{S(t)}\n\\end{aligned}\n\\]\n\n\n\nExample 5.4 (exponential distribution) The hazard function of the exponential distribution family of models is:\n\\[\n\\begin{aligned}\nP(T=t|T \\ge t)\n&= \\frac{f(t)}{S(t)}\\\\\n&= \\frac{\\mathbb{1}_{t \\ge 0}\\cdot \\lambda \\text{e}^{-\\lambda t}}{\\text{e}^{-\\lambda t}}\\\\\n&=\\mathbb{1}_{t \\ge 0}\\cdot \\lambda\n\\end{aligned}\n\\] Figure 5.2 shows some examples of exponential hazard functions.\n\n\n\n\n\n\nFigure 5.2: Examples of hazard functions for exponential distributions\n\n\n\n\n\n\n\n\nWe can also view the hazard function as the derivative of the negative of the logarithm of the survival function:\n\nTheorem 5.2 (transform survival to hazard) \\[h(t) = \\frac{\\partial}{\\partial t}\\left\\{-\\text{log}\\left\\{S(t)\\right\\}\\right\\}\\]\n\n\n\nProof. \\[\n\\begin{aligned}\nh(t)\n&= \\frac{f(t)}{S(t)}\\\\\n&= \\frac{-S'(t)}{S(t)}\\\\\n&= -\\frac{S'(t)}{S(t)}\\\\\n&=-\\frac{\\partial}{\\partial t}\\text{log}\\left\\{S(t)\\right\\}\\\\\n&=\\frac{\\partial}{\\partial t}\\left\\{-\\text{log}\\left\\{S(t)\\right\\}\\right\\}\n\\end{aligned}\n\\]\n\n\n5.3.5 The Cumulative Hazard Function\nSince \\(h(t) = \\frac{\\partial}{\\partial t}\\left\\{-\\text{log}\\left\\{S(t)\\right\\}\\right\\}\\) (see Theorem 5.2), we also have:\n\nCorollary 5.1 \\[S(t) = \\text{exp}\\left\\{-\\int_{u=0}^t h(u)du\\right\\} \\tag{5.1}\\]\n\n\n\nThe integral in Equation 5.1 is important enough to have its own name: cumulative hazard.\n\n\nDefinition 5.3 (cumulative hazard) The cumulative hazard function \\(H(t)\\) is defined as:\n\\[H(t) \\stackrel{\\text{def}}{=}\\int_{u=0}^t h(u) du\\]\n\nAs we will see below, \\(H(t)\\) is tractable to estimate, and we can then derive an estimate of the hazard function using an approximate derivative of the estimated cumulative hazard.\n\n\nExample 5.5 The cumulative hazard function of the exponential distribution family of models is:\n\\[\nH(t) = \\mathbb{1}_{t \\ge 0}\\cdot \\lambda t\n\\]\nFigure 5.3 shows some examples of exponential cumulative hazard functions.\n\n\n\n\n\n\nFigure 5.3: Examples of exponential cumulative hazard functions\n\n\n\n\n\n\n\n\n5.3.6 Some Key Mathematical Relationships among Survival Concepts\nDiagram:\n\\[\nh(t) \\xrightarrow[]{\\int_{u=0}^t h(u)du} H(t)\n\\xrightarrow[]{\\text{exp}\\left\\{-H(t)\\right\\}} S(t)\n\\xrightarrow[]{1-S(t)} F(t)\n\\]\n\\[\nh(t) \\xleftarrow[\\frac{\\partial}{\\partial t}H(t)]{} H(t)\n\\xleftarrow[-\\text{log}\\left\\{S(t)\\right\\}]{} S(t)\n\\xleftarrow[1-F(t)]{} F(t)\n\\]\nIdentities:\n\\[\n\\begin{aligned}\nS(t) &= 1 - F(t)\\\\\n&= \\text{exp}\\left\\{-H(t)\\right\\}\\\\\nS'(t) &= -f(t)\\\\\nH(t) &= -\\text{log}\\left\\{S(t)\\right\\}\\\\\nH'(t) &= h(t)\\\\\nh(t) &= \\frac{f(t)}{S(t)}\\\\\n&= -\\frac{\\partial}{\\partial t}\\text{log}\\left\\{S(t)\\right\\} \\\\\nf(t) &= h(t)\\cdot S(t)\\\\\n\\end{aligned}\n\\]\n\nSome proofs (others left as exercises):\n\\[\n\\begin{aligned}\nS'(t) &= \\frac{\\partial}{\\partial t}(1-F(t))\\\\\n&= -F'(t)\\\\\n&= -f(t)\\\\\n\\end{aligned}\n\\]\n\n\\[\n\\begin{aligned}\n\\frac{\\partial}{\\partial t}\\text{log}\\left\\{S(t)\\right\\}\n&= \\frac{S'(t)}{S(t)}\\\\\n&= -\\frac{f(t)}{S(t)}\\\\\n&= -h(t)\\\\\n\\end{aligned}\n\\]\n\n\\[\n\\begin{aligned}\nH(t)\n&\\stackrel{\\text{def}}{=}\\int_{u=0}^t h(u) du\\\\\n&= \\int_0^t -\\frac{\\partial}{\\partial u}\\text{log}\\left\\{S(u)\\right\\} du\\\\\n&= \\left[-\\text{log}\\left\\{S(u)\\right\\}\\right]_{u=0}^{u=t}\\\\\n&= \\left[\\text{log}\\left\\{S(u)\\right\\}\\right]_{u=t}^{u=0}\\\\\n&= \\text{log}\\left\\{S(0)\\right\\} - \\text{log}\\left\\{S(t)\\right\\}\\\\\n&= \\text{log}\\left\\{1\\right\\} - \\text{log}\\left\\{S(t)\\right\\}\\\\\n&= 0 - \\text{log}\\left\\{S(t)\\right\\}\\\\\n&=-\\text{log}\\left\\{S(t)\\right\\}\n\\end{aligned}\n\\]\n\nCorollary:\n\\[S(t) = \\text{exp}\\left\\{-H(t)\\right\\}\\]\nExample: Time to death the US in 2004\nThe first day is the most dangerous:\n\n\n\nDaily Hazard Rates in 2004 for US Females\n\n\n\n\n\nExercise: hypothesize why these curves differ where they do?\n\n\n\nDaily Hazard Rates in 2004 for US Males and Females 1-40\n\n\n\n\n\nExercise: compare and contrast this curve with the corresponding hazard curve.\n\n\n\nSurvival Curve in 2004 for US Females\n\n\n\n\n\n5.3.7 Likelihood with censoring\nIf an event time \\(T\\) is observed exactly as \\(T=t\\), then the likelihood of that observation is just its probability density function:\n\\[\n\\begin{aligned}\n\\mathcal L(t)\n&= p(T=t)\\\\\n&\\stackrel{\\text{def}}{=}f_T(t)\\\\\n&= h_T(t)S_T(t)\\\\\n\\ell(t)\n&\\stackrel{\\text{def}}{=}\\text{log}\\left\\{\\mathcal L(t)\\right\\}\\\\\n&= \\text{log}\\left\\{h_T(t)S_T(t)\\right\\}\\\\\n&= \\text{log}\\left\\{h_T(t)\\right\\} + \\text{log}\\left\\{S_T(t)\\right\\}\\\\\n&= \\text{log}\\left\\{h_T(t)\\right\\} - H_T(t)\\\\\n\\end{aligned}\n\\]\n\nIf instead the event time \\(T\\) is censored and only known to be after time \\(y\\), then the likelihood of that censored observation is instead the survival function evaluated at the censoring time:\n\\[\n\\begin{aligned}\n\\mathcal L(y)\n&=p_T(T>y)\\\\\n&\\stackrel{\\text{def}}{=}S_T(y)\\\\\n\\ell(y)\n&\\stackrel{\\text{def}}{=}\\text{log}\\left\\{\\mathcal L(y)\\right\\}\\\\\n&=\\text{log}\\left\\{S(y)\\right\\}\\\\\n&=-H(y)\\\\\n\\end{aligned}\n\\]\n\n\nWhat’s written above is incomplete. We also observed whether or not the observation was censored. Let \\(C\\) denote the time when censoring would occur (if the event did not occur first); let \\(f_C(y)\\) and \\(S_C(y)\\) be the corresponding density and survival functions for the censoring event.\nLet \\(Y\\) denote the time when observation ended (either by censoring or by the event of interest occurring), and let \\(D\\) be an indicator variable for the event occurring at \\(Y\\) (so \\(D=0\\) represents a censored observation and \\(D=1\\) represents an uncensored observation). In other words, let \\(Y \\stackrel{\\text{def}}{=}\\min(T,C)\\) and \\(D \\stackrel{\\text{def}}{=}\\mathbb 1{\\{T<=C\\}}\\).\nThen the complete likelihood of the observed data \\((Y,D)\\) is:\n\n\\[\n\\begin{aligned}\n\\mathcal L(y,d)\n&= p(Y=y, D=d)\\\\\n&= \\left[p(T=y,C> y)\\right]^d \\cdot\n\\left[p(T>y,C=y)\\right]^{1-d}\\\\\n\\end{aligned}\n\\]\n\n\nTypically, survival analyses assume that \\(C\\) and \\(T\\) are mutually independent; this assumption is called “non-informative” censoring.\nThen the joint likelihood \\(p(Y,D)\\) factors into the product \\(p(Y), p(D)\\), and the likelihood reduces to:\n\n\\[\n\\begin{aligned}\n\\mathcal L(y,d)\n&= \\left[p(T=y,C> y)\\right]^d\\cdot\n\\left[p(T>y,C=y)\\right]^{1-d}\\\\\n&= \\left[p(T=y)p(C> y)\\right]^d\\cdot\n\\left[p(T>y)p(C=y)\\right]^{1-d}\\\\\n&= \\left[f_T(y)S_C(y)\\right]^d\\cdot\n\\left[S(y)f_C(y)\\right]^{1-d}\\\\\n&= \\left[f_T(y)^d S_C(y)^d\\right]\\cdot\n\\left[S_T(y)^{1-d}f_C(y)^{1-d}\\right]\\\\\n&= \\left(f_T(y)^d \\cdot S_T(y)^{1-d}\\right)\\cdot\n\\left(f_C(y)^{1-d} \\cdot S_C(y)^{d}\\right)\n\\end{aligned}\n\\]\n\n\nThe corresponding log-likelihood is:\n\n\\[\n\\begin{aligned}\n\\ell(y,d)\n&= \\text{log}\\left\\{\\mathcal L(y,d) \\right\\}\\\\\n&= \\text{log}\\left\\{\n\\left(f_T(y)^d \\cdot S_T(y)^{1-d}\\right)\\cdot\n\\left(f_C(y)^{1-d} \\cdot S_C(y)^{d}\\right)\n\\right\\}\\\\\n&= \\text{log}\\left\\{\nf_T(y)^d \\cdot S_T(y)^{1-d}\n\\right\\}\n+\n\\text{log}\\left\\{\nf_C(y)^{1-d} \\cdot S_C(y)^{d}\n\\right\\}\\\\\n\\end{aligned}\n\\] Let\n\n\n\\(\\theta_T\\) represent the parameters of \\(p_T(t)\\),\n\n\\(\\theta_C\\) represent the parameters of \\(p_C(c)\\),\n\n\\(\\theta = (\\theta_T, \\theta_C)\\) be the combined vector of all parameters.\n\n\n\nThe corresponding score function is:\n\n\\[\n\\begin{aligned}\n\\ell'(y,d)\n&= \\frac{\\partial}{\\partial \\theta}\n\\left[\n\\text{log}\\left\\{\nf_T(y)^d \\cdot S_T(y)^{1-d}\n\\right\\}\n+\n\\text{log}\\left\\{\nf_C(y)^{1-d} \\cdot S_C(y)^{d}\n\\right\\}\n\\right]\\\\\n&=\n\\left(\n\\frac{\\partial}{\\partial \\theta}\n\\text{log}\\left\\{\nf_T(y)^d \\cdot S_T(y)^{1-d}\n\\right\\}\n\\right)\n+\n\\left(\n\\frac{\\partial}{\\partial \\theta}\n\\text{log}\\left\\{\nf_C(y)^{1-d} \\cdot S_C(y)^{d}\n\\right\\}\n\\right)\\\\\n\\end{aligned}\n\\]\n\n\nAs long as \\(\\theta_C\\) and \\(\\theta_T\\) don’t share any parameters, then if censoring is non-informative, the partial derivative with respect to \\(\\theta_T\\) is:\n\n\\[\n\\begin{aligned}\n\\ell'_{\\theta_T}(y,d)\n&\\stackrel{\\text{def}}{=}\\frac{\\partial}{\\partial \\theta_T}\\ell(y,d)\\\\\n&=\n\\left(\n\\frac{\\partial}{\\partial \\theta_T}\n\\text{log}\\left\\{\nf_T(y)^d \\cdot S_T(y)^{1-d}\n\\right\\}\n\\right)\n+\n\\left(\n\\frac{\\partial}{\\partial \\theta_T}\n\\text{log}\\left\\{\nf_C(y)^{1-d} \\cdot S_C(y)^{d}\n\\right\\}\n\\right)\\\\\n&=\n\\left(\n\\frac{\\partial}{\\partial \\theta_T}\n\\text{log}\\left\\{\nf_T(y)^d \\cdot S_T(y)^{1-d}\n\\right\\}\n\\right) + 0\\\\\n&=\n\\frac{\\partial}{\\partial \\theta_T}\n\\text{log}\\left\\{\nf_T(y)^d \\cdot S_T(y)^{1-d}\n\\right\\}\\\\\n\\end{aligned}\n\\]\n\n\nThus, the MLE for \\(\\theta_T\\) won’t depend on \\(\\theta_C\\), and we can ignore the distribution of \\(C\\) when estimating the parameters of \\(f_T(t)=p(T=t)\\).\n\nThen:\n\\[\n\\begin{aligned}\n\\mathcal L(y,d)\n&= f_T(y)^d \\cdot S_T(y)^{1-d}\\\\\n&= \\left(h_T(y)^d S_T(y)^d\\right) \\cdot S_T(y)^{1-d}\\\\\n&= h_T(y)^d \\cdot S_T(y)^d \\cdot S_T(y)^{1-d}\\\\\n&= h_T(y)^d \\cdot S_T(y)\\\\\n&= S_T(y) \\cdot h_T(y)^d \\\\\n\\end{aligned}\n\\]\n\nThat is, if the event occurred at time \\(y\\) (i.e., if \\(d=1\\)), then the likelihood of \\((Y,D) = (y,d)\\) is equal to the hazard function at \\(y\\) times the survival function at \\(y\\). Otherwise, the likelihood is equal to just the survival function at \\(y\\).\n\n\n\nThe corresponding log-likelihood is:\n\n\\[\n\\begin{aligned}\n\\ell(y,d)\n&=\\text{log}\\left\\{\\mathcal L(y,d)\\right\\}\\\\\n&= \\text{log}\\left\\{S_T(y) \\cdot h_T(y)^d\\right\\}\\\\\n&= \\text{log}\\left\\{S_T(y)\\right\\} + \\text{log}\\left\\{h_T(y)^d\\right\\}\\\\\n&= \\text{log}\\left\\{S_T(y)\\right\\} + d\\cdot \\text{log}\\left\\{h_T(y)\\right\\}\\\\\n&= -H_T(y) + d\\cdot \\text{log}\\left\\{h_T(y)\\right\\}\\\\\n\\end{aligned}\n\\]\n\nIn other words, the log-likelihood contribution from a single observation \\((Y,D) = (y,d)\\) is equal to the negative cumulative hazard at \\(y\\), plus the log of the hazard at \\(y\\) if the event occurred at time \\(y\\).", "crumbs": [ "Time to Event Models", "5  Introduction to Survival Analysis" @@ -577,7 +577,7 @@ "href": "intro-to-survival-analysis.html#the-kaplan-meier-product-limit-estimator", "title": "\n5  Introduction to Survival Analysis\n", "section": "\n5.7 The Kaplan-Meier Product Limit Estimator", - "text": "5.7 The Kaplan-Meier Product Limit Estimator\n\n5.7.1 Estimating survival in datasets without censoring\n\nIn the drug6mp dataset, the estimated survival function for the placebo patients is easy to compute. For any time \\(t\\) in months, \\(S(t)\\) is the fraction of patients with times greater than \\(t\\):\n\n\n5.7.2 Estimating survival in datasets with censoring\n\nFor the 6-MP patients, we cannot ignore the censored data because we know that the time to relapse is greater than the censoring time.\nFor any time \\(t\\) in months, we know that 6-MP patients with times greater than \\(t\\) have not relapsed, and those with relapse time less than \\(t\\) have relapsed, but we don’t know if patients with censored time less than \\(t\\) have relapsed or not.\nThe procedure we usually use is the Kaplan-Meier product-limit estimator of the survival function.\nThe Kaplan-Meier estimator is a step function (like the empirical cdf), which changes value only at the event times, not at the censoring times.\nAt each event time \\(t\\), we compute the at-risk group size \\(Y\\), which is all those observations whose event time or censoring time is at least \\(t\\).\nIf \\(d\\) of the observations have an event time (not a censoring time) of \\(t\\), then the group of survivors immediately following time \\(t\\) is reduced by the fraction \\[\\frac{Y-d}{Y}=1-\\frac{d}{Y}\\]\n\nIf the event times are \\(t_i\\) with events per time of \\(d_i\\) (\\(1\\le i \\le k\\)), then\n\\[\\hat S(t) = \\prod_{t_i < t}[1-d_i/Y_i]\\]\nwhere \\(Y_i\\) is the set of observations whose time (event or censored) is \\(\\ge t_i\\), the group at risk at time \\(t_i\\).\n\nIf there are no censored data, and there are \\(n\\) data points, then just after (say) the third event time\n\\[\n\\begin{aligned}\n\\hat S(t)\n&= \\prod_{t_i < t}[1-d_i/Y_i]\\\\\n&=[\\frac{n-d_1}{n}][\\frac{n-d_1-d_2}{n-d_1}][\\frac{n-d_1-d_2-d_3}{n-d_1-d_2}]\\\\\n&=\\frac{n-d_1-d_2-d_3}{n}\\\\\n&=1-\\frac{d_1+d_2+d_3}{n}\\\\\n&=1-\\hat F(t)\\\\\n\\end{aligned}\n\\]\nwhere \\(\\hat F(t)\\) is the usual empirical CDF estimate.\n\n5.7.3 Kaplan-Meier curve for drug6mp data\nHere is the Kaplan-Meier estimated survival curve for the patients who received 6-MP in the drug6mp dataset (we will see code to produce figures like this one shortly):\n\nShow R code# | echo: false\n\nrequire(KMsurv)\ndata(drug6mp)\nlibrary(dplyr)\nlibrary(survival)\n\ndrug6mp_km_model1 = \n drug6mp |> \n mutate(surv = Surv(t2, relapse)) |> \n survfit(formula = surv ~ 1, data = _)\n\nlibrary(ggfortify)\ndrug6mp_km_model1 |> \n autoplot(\n mark.time = TRUE,\n conf.int = FALSE) +\n expand_limits(y = 0) +\n xlab('Time since diagnosis (months)') +\n ylab(\"KM Survival Curve\")\n\n\n\nFigure 5.4: Kaplan-Meier Survival Curve for 6-MP Patients\n\n\n\n\n\n\n\n\n5.7.4 Kaplan-Meier calculations\nLet’s compute these estimates and build the chart by hand:\n\nShow R codelibrary(KMsurv)\nlibrary(dplyr)\ndata(drug6mp)\n\ndrug6mp.v2 = \n drug6mp |> \n as_tibble() |> \n mutate(\n remstat = remstat |> \n case_match(\n 1 ~ \"partial\",\n 2 ~ \"complete\"\n ),\n # renaming to \"outcome\" while relabeling is just a style choice:\n outcome = relapse |> \n case_match(\n 0 ~ \"censored\",\n 1 ~ \"relapsed\"\n )\n )\n\nkm.6mp =\n drug6mp.v2 |> \n summarize(\n .by = t2,\n Relapses = sum(outcome == \"relapsed\"),\n Censored = sum(outcome == \"censored\")) |>\n # here we add a start time row, so the graph starts at time 0:\n bind_rows(\n tibble(\n t2 = 0, \n Relapses = 0, \n Censored = 0)\n ) |> \n # sort in time order:\n arrange(t2) |>\n mutate(\n Exiting = Relapses + Censored,\n `Study Size` = sum(Exiting),\n Exited = cumsum(Exiting) |> dplyr::lag(default = 0),\n `At Risk` = `Study Size` - Exited,\n Hazard = Relapses / `At Risk`,\n `KM Factor` = 1 - Hazard,\n `Cumulative Hazard` = cumsum(`Hazard`),\n `KM Survival Curve` = cumprod(`KM Factor`)\n )\n\nlibrary(pander) \npander(km.6mp)\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nt2\nRelapses\nCensored\nExiting\nStudy Size\nExited\nAt Risk\nHazard\nKM Factor\nCumulative Hazard\nKM Survival Curve\n\n\n\n0\n0\n0\n0\n21\n0\n21\n0\n1\n0\n1\n\n\n6\n3\n1\n4\n21\n0\n21\n0.1429\n0.8571\n0.1429\n0.8571\n\n\n7\n1\n0\n1\n21\n4\n17\n0.05882\n0.9412\n0.2017\n0.8067\n\n\n9\n0\n1\n1\n21\n5\n16\n0\n1\n0.2017\n0.8067\n\n\n10\n1\n1\n2\n21\n6\n15\n0.06667\n0.9333\n0.2683\n0.7529\n\n\n11\n0\n1\n1\n21\n8\n13\n0\n1\n0.2683\n0.7529\n\n\n13\n1\n0\n1\n21\n9\n12\n0.08333\n0.9167\n0.3517\n0.6902\n\n\n16\n1\n0\n1\n21\n10\n11\n0.09091\n0.9091\n0.4426\n0.6275\n\n\n17\n0\n1\n1\n21\n11\n10\n0\n1\n0.4426\n0.6275\n\n\n19\n0\n1\n1\n21\n12\n9\n0\n1\n0.4426\n0.6275\n\n\n20\n0\n1\n1\n21\n13\n8\n0\n1\n0.4426\n0.6275\n\n\n22\n1\n0\n1\n21\n14\n7\n0.1429\n0.8571\n0.5854\n0.5378\n\n\n23\n1\n0\n1\n21\n15\n6\n0.1667\n0.8333\n0.7521\n0.4482\n\n\n25\n0\n1\n1\n21\n16\n5\n0\n1\n0.7521\n0.4482\n\n\n32\n0\n2\n2\n21\n17\n4\n0\n1\n0.7521\n0.4482\n\n\n34\n0\n1\n1\n21\n19\n2\n0\n1\n0.7521\n0.4482\n\n\n35\n0\n1\n1\n21\n20\n1\n0\n1\n0.7521\n0.4482\n\n\n\n\n\n\nSummary\nFor the 6-MP patients at time 6 months, there are 21 patients at risk. At \\(t=6\\) there are 3 relapses and 1 censored observations.\nThe Kaplan-Meier factor is \\((21-3)/21 = 0.857\\). The number at risk for the next time (\\(t=7\\)) is \\(21-3-1=17\\).\nAt time 7 months, there are 17 patients at risk. At \\(t=7\\) there is 1 relapse and 0 censored observations. The Kaplan-Meier factor is \\((17-1)/17 = 0.941\\). The Kaplan Meier estimate is \\(0.857\\times0.941=0.807\\). The number at risk for the next time (\\(t=9\\)) is \\(17-1=16\\).\n\nNow, let’s graph this estimated survival curve using ggplot():\n\nShow R codelibrary(ggplot2)\nconflicts_prefer(dplyr::filter)\nkm.6mp |> \n ggplot(aes(x = t2, y = `KM Survival Curve`)) +\n geom_step() +\n geom_point(data = km.6mp |> filter(Censored > 0), shape = 3) +\n expand_limits(y = c(0,1), x = 0) +\n xlab('Time since diagnosis (months)') +\n ylab(\"KM Survival Curve\") +\n scale_y_continuous(labels = scales::percent)\n\n\n\nFigure 5.5: KM curve for 6MP patients, calculated by hand", + "text": "5.7 The Kaplan-Meier Product Limit Estimator\n\n5.7.1 Estimating survival in datasets without censoring\n\nIn the drug6mp dataset, the estimated survival function for the placebo patients is easy to compute. For any time \\(t\\) in months, \\(S(t)\\) is the fraction of patients with times greater than \\(t\\):\n\n\n5.7.2 Estimating survival in datasets with censoring\n\nFor the 6-MP patients, we cannot ignore the censored data because we know that the time to relapse is greater than the censoring time.\nFor any time \\(t\\) in months, we know that 6-MP patients with times greater than \\(t\\) have not relapsed, and those with relapse time less than \\(t\\) have relapsed, but we don’t know if patients with censored time less than \\(t\\) have relapsed or not.\nThe procedure we usually use is the Kaplan-Meier product-limit estimator of the survival function.\nThe Kaplan-Meier estimator is a step function (like the empirical cdf), which changes value only at the event times, not at the censoring times.\nAt each event time \\(t\\), we compute the at-risk group size \\(Y\\), which is all those observations whose event time or censoring time is at least \\(t\\).\nIf \\(d\\) of the observations have an event time (not a censoring time) of \\(t\\), then the group of survivors immediately following time \\(t\\) is reduced by the fraction \\[\\frac{Y-d}{Y}=1-\\frac{d}{Y}\\]\n\n\n\nDefinition 5.4 (Kaplan-Meier Product-Limit Estimator of Survival Function) If the event times are \\(t_i\\) with events per time of \\(d_i\\) (\\(1\\le i \\le k\\)), then the Kaplan-Meier Product-Limit Estimator of the survival function is:\n\\[\\hat S(t) = \\prod_{t_i < t} \\left[\\frac{1-d_i}{Y_i}\\right] \\tag{5.3}\\]\nwhere \\(Y_i\\) is the set of observations whose time (event or censored) is \\(\\ge t_i\\), the group at risk at time \\(t_i\\).\n\n\nIf there are no censored data, and there are \\(n\\) data points, then just after (say) the third event time\n\\[\n\\begin{aligned}\n\\hat S(t)\n&= \\prod_{t_i < t}\\left[1-\\frac{d_i}{Y_i}\\right]\n\\\\ &= \\left[\\frac{n-d_1}{n}\\right] \\left[\\frac{n-d_1-d_2}{n-d_1}\\right] \\left[\\frac{n-d_1-d_2-d_3}{n-d_1-d_2}\\right]\n\\\\ &= \\frac{n-d_1-d_2-d_3}{n}\n\\\\ &=1-\\frac{d_1+d_2+d_3}{n}\n\\\\ &=1-\\hat F(t)\n\\end{aligned}\n\\]\nwhere \\(\\hat F(t)\\) is the usual empirical CDF estimate.\n\n5.7.3 Kaplan-Meier curve for drug6mp data\nHere is the Kaplan-Meier estimated survival curve for the patients who received 6-MP in the drug6mp dataset (we will see code to produce figures like this one shortly):\n\nShow R code# | echo: false\n\nrequire(KMsurv)\ndata(drug6mp)\nlibrary(dplyr)\nlibrary(survival)\n\ndrug6mp_km_model1 = \n drug6mp |> \n mutate(surv = Surv(t2, relapse)) |> \n survfit(formula = surv ~ 1, data = _)\n\nlibrary(ggfortify)\ndrug6mp_km_model1 |> \n autoplot(\n mark.time = TRUE,\n conf.int = FALSE) +\n expand_limits(y = 0) +\n xlab('Time since diagnosis (months)') +\n ylab(\"KM Survival Curve\")\n\n\n\nFigure 5.4: Kaplan-Meier Survival Curve for 6-MP Patients\n\n\n\n\n\n\n\n\n5.7.4 Kaplan-Meier calculations\nLet’s compute these estimates and build the chart by hand:\n\nShow R codelibrary(KMsurv)\nlibrary(dplyr)\ndata(drug6mp)\n\ndrug6mp.v2 = \n drug6mp |> \n as_tibble() |> \n mutate(\n remstat = remstat |> \n case_match(\n 1 ~ \"partial\",\n 2 ~ \"complete\"\n ),\n # renaming to \"outcome\" while relabeling is just a style choice:\n outcome = relapse |> \n case_match(\n 0 ~ \"censored\",\n 1 ~ \"relapsed\"\n )\n )\n\nkm.6mp =\n drug6mp.v2 |> \n summarize(\n .by = t2,\n Relapses = sum(outcome == \"relapsed\"),\n Censored = sum(outcome == \"censored\")) |>\n # here we add a start time row, so the graph starts at time 0:\n bind_rows(\n tibble(\n t2 = 0, \n Relapses = 0, \n Censored = 0)\n ) |> \n # sort in time order:\n arrange(t2) |>\n mutate(\n Exiting = Relapses + Censored,\n `Study Size` = sum(Exiting),\n Exited = cumsum(Exiting) |> dplyr::lag(default = 0),\n `At Risk` = `Study Size` - Exited,\n Hazard = Relapses / `At Risk`,\n `KM Factor` = 1 - Hazard,\n `Cumulative Hazard` = cumsum(`Hazard`),\n `KM Survival Curve` = cumprod(`KM Factor`)\n )\n\nlibrary(pander) \npander(km.6mp)\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nt2\nRelapses\nCensored\nExiting\nStudy Size\nExited\nAt Risk\nHazard\nKM Factor\nCumulative Hazard\nKM Survival Curve\n\n\n\n0\n0\n0\n0\n21\n0\n21\n0\n1\n0\n1\n\n\n6\n3\n1\n4\n21\n0\n21\n0.1429\n0.8571\n0.1429\n0.8571\n\n\n7\n1\n0\n1\n21\n4\n17\n0.05882\n0.9412\n0.2017\n0.8067\n\n\n9\n0\n1\n1\n21\n5\n16\n0\n1\n0.2017\n0.8067\n\n\n10\n1\n1\n2\n21\n6\n15\n0.06667\n0.9333\n0.2683\n0.7529\n\n\n11\n0\n1\n1\n21\n8\n13\n0\n1\n0.2683\n0.7529\n\n\n13\n1\n0\n1\n21\n9\n12\n0.08333\n0.9167\n0.3517\n0.6902\n\n\n16\n1\n0\n1\n21\n10\n11\n0.09091\n0.9091\n0.4426\n0.6275\n\n\n17\n0\n1\n1\n21\n11\n10\n0\n1\n0.4426\n0.6275\n\n\n19\n0\n1\n1\n21\n12\n9\n0\n1\n0.4426\n0.6275\n\n\n20\n0\n1\n1\n21\n13\n8\n0\n1\n0.4426\n0.6275\n\n\n22\n1\n0\n1\n21\n14\n7\n0.1429\n0.8571\n0.5854\n0.5378\n\n\n23\n1\n0\n1\n21\n15\n6\n0.1667\n0.8333\n0.7521\n0.4482\n\n\n25\n0\n1\n1\n21\n16\n5\n0\n1\n0.7521\n0.4482\n\n\n32\n0\n2\n2\n21\n17\n4\n0\n1\n0.7521\n0.4482\n\n\n34\n0\n1\n1\n21\n19\n2\n0\n1\n0.7521\n0.4482\n\n\n35\n0\n1\n1\n21\n20\n1\n0\n1\n0.7521\n0.4482\n\n\n\n\n\n\nSummary\nFor the 6-MP patients at time 6 months, there are 21 patients at risk. At \\(t=6\\) there are 3 relapses and 1 censored observations.\nThe Kaplan-Meier factor is \\((21-3)/21 = 0.857\\). The number at risk for the next time (\\(t=7\\)) is \\(21-3-1=17\\).\nAt time 7 months, there are 17 patients at risk. At \\(t=7\\) there is 1 relapse and 0 censored observations. The Kaplan-Meier factor is \\((17-1)/17 = 0.941\\). The Kaplan Meier estimate is \\(0.857\\times0.941=0.807\\). The number at risk for the next time (\\(t=9\\)) is \\(17-1=16\\).\n\nNow, let’s graph this estimated survival curve using ggplot():\n\nShow R codelibrary(ggplot2)\nconflicts_prefer(dplyr::filter)\nkm.6mp |> \n ggplot(aes(x = t2, y = `KM Survival Curve`)) +\n geom_step() +\n geom_point(data = km.6mp |> filter(Censored > 0), shape = 3) +\n expand_limits(y = c(0,1), x = 0) +\n xlab('Time since diagnosis (months)') +\n ylab(\"KM Survival Curve\") +\n scale_y_continuous(labels = scales::percent)\n\n\n\nFigure 5.5: KM curve for 6MP patients, calculated by hand", "crumbs": [ "Time to Event Models", "5  Introduction to Survival Analysis" @@ -599,7 +599,7 @@ "href": "intro-to-survival-analysis.html#example-bone-marrow-transplant-data", "title": "\n5  Introduction to Survival Analysis\n", "section": "\n5.9 Example: Bone Marrow Transplant Data", - "text": "5.9 Example: Bone Marrow Transplant Data\n(Copelan et al., 1991)\n\nTreatment\n\n\nallogeneic (from a donor) bone marrow transplant therapy\n\nInclusion criteria\n\nacute myeloid leukemia (AML)\nacute lymphoblastic leukemia (ALL).\nPossible intermediate events\n\n\ngraft vs. host disease (GVHD): an immunological rejection response to the transplant\n\nplatelet recovery: a return of platelet count to normal levels.\n\nOne or the other, both in either order, or neither may occur.\nEnd point events\n\nrelapse of the disease\ndeath\n\nAny or all of these events may be censored.\n\n5.9.1 KMsurv::bmt data in R\n\nShow R codelibrary(KMsurv)\n?bmt\n\n\n\n5.9.2 Analysis plan\n\nWe concentrate for now on disease-free survival (t2 and d3) for the three risk groups, ALL, AML Low Risk, and AML High Risk.\nWe will construct the Kaplan-Meier survival curves, compare them, and test for differences.\nWe will construct the cumulative hazard curves and compare them.\nWe will estimate the hazard functions, interpret, and compare them.\n\n5.9.3 Survival Function Estimate and Variance\n\\[\\hat S(t) = \\prod_{t_i < t}\\left[1-\\frac{d_i}{Y_i}\\right]\\] where \\(Y_i\\) is the group at risk at time \\(t_i\\).\nThe estimated variance of \\(\\hat S(t)\\) is (Greenwood’s formula) \\[\\hat{\\text{Var}}\\left(\\hat S (t)\\right) = \\hat S(t)^2\\sum_{t_i <t}\\frac{d_i}{Y_i(Y_i-d_i)}\\] which we can use for confidence intervals for a survival function or a difference of survival functions.\nKaplan-Meier survival curves\n\ncode to preprocess and model bmt datalibrary(KMsurv)\nlibrary(survival)\ndata(bmt)\n\nbmt = \n bmt |> \n as_tibble() |> \n mutate(\n group = \n group |> \n factor(\n labels = c(\"ALL\",\"Low Risk AML\",\"High Risk AML\")),\n surv = Surv(t2,d3))\n\nkm_model1 = survfit(\n formula = surv ~ group, \n data = bmt)\n\n\n\nShow R codelibrary(ggfortify)\nautoplot(\n km_model1, \n conf.int = TRUE,\n ylab = \"Pr(disease-free survival)\",\n xlab = \"Time since transplant (days)\") + \n theme_bw() +\n theme(legend.position=\"bottom\")\n\n\nDisease-Free Survival by Disease Group\n\n\n\n\nUnderstanding Greenwood’s formula (optional)\nTo see where Greenwood’s formula comes from, let \\(x_i=Y_i-d_i\\). We approximate the solution treating each time as independent, with \\(Y_i\\) fixed and ignore randomness in times of failure and we treat \\(x_i\\) as independent binomials \\(\\text{Bin}(Y_i,p_i)\\). Letting \\(S(t)\\) be the “true” survival function\n\\[\n\\begin{aligned}\n\\hat S(t) &=\\prod_{t_i<t}x_i/Y_i\\\\\nS(t)&=\\prod_{t_i<t}p_i\n\\end{aligned}\n\\]\n\\[\n\\begin{aligned}\n\\frac{\\hat S(t)}{S(t)}\n&= \\prod_{t_i<t} \\frac{x_i}{p_iY_i}=\\prod_{t_i<t} \\frac{\\hat p_i}{p_i}\\\\\n&=\\prod_{t_i<t} \\left( 1+\\frac{\\hat p_i-p_i}{p_i} \\right)\\\\\n&\\approx 1+\\sum_{t_i<t} \\frac{\\hat p_i-p_i}{p_i} \\\\\n\\end{aligned}\n\\]\n\\[\n\\begin{aligned}\n\\text{Var}\\left(\\frac{\\hat S(t)}{S(t)}\\right)\n&\\approx \\text{Var}\\left(1+\\sum_{t_i<t} \\frac{\\hat p_i-p_i}{p_i}\\right) \\\\\n&=\\sum_{t_i<t} \\frac{1}{p_i^2}\\frac{p_i(1-p_i)}{Y_i} \\\\\n&= \\sum_{t_i<t} \\frac{(1-p_i)}{p_iY_i}\n\\approx\\sum_{t_i<t} \\frac{(1-x_i/Y_i)}{x_i}\\\\\n&=\\sum_{t_i<t} \\frac{Y_i-x_i}{x_iY_i}=\\sum_{t_i<t} \\frac{d_i}{Y_i(Y_i-d_i)}\\\\\n\\text{Var}\\left(\\hat S(t)\\right)\n&\\approx \\hat S(t)^2\\sum_{t_i<t} \\frac{d_i}{Y_i(Y_i-d_i)}\n\\end{aligned}\n\\]\n\n5.9.4 Test for differences among the disease groups\nHere we compute a chi-square test for assocation between disease group (group) and disease-free survival:\n\nShow R codesurvdiff(surv ~ group, data = bmt)\n#> Call:\n#> survdiff(formula = surv ~ group, data = bmt)\n#> \n#> N Observed Expected (O-E)^2/E (O-E)^2/V\n#> group=ALL 38 24 21.9 0.211 0.289\n#> group=Low Risk AML 54 25 40.0 5.604 11.012\n#> group=High Risk AML 45 34 21.2 7.756 10.529\n#> \n#> Chisq= 13.8 on 2 degrees of freedom, p= 0.001\n\n\n\n5.9.5 Cumulative Hazard\n\\[\n\\begin{aligned}\nh(t)\n&\\stackrel{\\text{def}}{=}P(T=t|T\\ge t)\\\\\n&= \\frac{p(T=t)}{P(T\\ge t)}\\\\\n&= -\\frac{\\partial}{\\partial t}\\text{log}\\left\\{S(t)\\right\\}\n\\end{aligned}\n\\]\nThe cumulative hazard (or integrated hazard) function is\n\\[H(t)\\stackrel{\\text{def}}{=}\\int_0^t h(t) dt\\] Since \\(h(t) = -\\frac{\\partial}{\\partial t}\\text{log}\\left\\{S(t)\\right\\}\\) as shown above, we have:\n\\[\nH(t)=-\\text{log}\\left\\{S\\right\\}(t)\n\\]\n\nSo we can estimate \\(H(t)\\) as:\n\\[\n\\begin{aligned}\n\\hat H(t)\n&= -\\text{log}\\left\\{\\hat S(t)\\right\\}\\\\\n&= -\\text{log}\\left\\{\\prod_{t_i < t}\\left[1-\\frac{d_i}{Y_i}\\right]\\right\\}\\\\\n&= -\\sum_{t_i < t}\\text{log}\\left\\{1-\\frac{d_i}{Y_i}\\right\\}\\\\\n\\end{aligned}\n\\]\nThis is the Kaplan-Meier (product-limit) estimate of cumulative hazard.\n\nExample: Cumulative Hazard Curves for Bone-Marrow Transplant (bmt) data\n\nShow R codeautoplot(\n fun = \"cumhaz\",\n km_model1, \n conf.int = FALSE,\n ylab = \"Cumulative hazard (disease-free survival)\",\n xlab = \"Time since transplant (days)\") + \n theme_bw() +\n theme(legend.position=\"bottom\")\n\n\nDisease-Free Cumulative Hazard by Disease Group", + "text": "5.9 Example: Bone Marrow Transplant Data\nCopelan et al. (1991)\n\n\n\nFigure 5.6: Recovery process from a bone marrow transplant (Fig. 1.1 from Klein, Moeschberger, et al. (2003))\n\n\n\n\n\n\n\n5.9.1 Study design\nTreatment\n\n\nallogeneic (from a donor) bone marrow transplant therapy\n\nInclusion criteria\n\nacute myeloid leukemia (AML)\nacute lymphoblastic leukemia (ALL).\nPossible intermediate events\n\n\ngraft vs. host disease (GVHD): an immunological rejection response to the transplant\n\nplatelet recovery: a return of platelet count to normal levels.\n\nOne or the other, both in either order, or neither may occur.\nEnd point events\n\nrelapse of the disease\ndeath\n\nAny or all of these events may be censored.\n\n5.9.2 KMsurv::bmt data in R\n\nShow R codelibrary(KMsurv)\n?bmt\n\n\n\n5.9.3 Analysis plan\n\nWe concentrate for now on disease-free survival (t2 and d3) for the three risk groups, ALL, AML Low Risk, and AML High Risk.\nWe will construct the Kaplan-Meier survival curves, compare them, and test for differences.\nWe will construct the cumulative hazard curves and compare them.\nWe will estimate the hazard functions, interpret, and compare them.\n\n5.9.4 Survival Function Estimate and Variance\n\\[\\hat S(t) = \\prod_{t_i < t}\\left[1-\\frac{d_i}{Y_i}\\right]\\] where \\(Y_i\\) is the group at risk at time \\(t_i\\).\nThe estimated variance of \\(\\hat S(t)\\) is:\n\nTheorem 5.4 (Greenwood’s estimator for variance of Kaplan-Meier survival estimator) \\[\n\\widehat{\\text{Var}}\\left(\\hat S(t)\\right) = \\hat S(t)^2\\sum_{t_i <t}\\frac{d_i}{Y_i(Y_i-d_i)}\n\\tag{5.4}\\]\n\nWe can use Equation 5.4 for confidence intervals for a survival function or a difference of survival functions.\n\nKaplan-Meier survival curves\n\ncode to preprocess and model bmt datalibrary(KMsurv)\nlibrary(survival)\ndata(bmt)\n\nbmt = \n bmt |> \n as_tibble() |> \n mutate(\n group = \n group |> \n factor(\n labels = c(\"ALL\",\"Low Risk AML\",\"High Risk AML\")),\n surv = Surv(t2,d3))\n\nkm_model1 = survfit(\n formula = surv ~ group, \n data = bmt)\n\n\n\nShow R codelibrary(ggfortify)\nautoplot(\n km_model1, \n conf.int = TRUE,\n ylab = \"Pr(disease-free survival)\",\n xlab = \"Time since transplant (days)\") + \n theme_bw() +\n theme(legend.position=\"bottom\")\n\n\nDisease-Free Survival by Disease Group\n\n\n\n\n\n5.9.5 Understanding Greenwood’s formula (optional)\n\nTo see where Greenwood’s formula comes from, let \\(x_i = Y_i - d_i\\). We approximate the solution treating each time as independent, with \\(Y_i\\) fixed and ignore randomness in times of failure and we treat \\(x_i\\) as independent binomials \\(\\text{Bin}(Y_i,p_i)\\). Letting \\(S(t)\\) be the “true” survival function\n\n\\[\n\\begin{aligned}\n\\hat S(t) &=\\prod_{t_i<t}x_i/Y_i\\\\\nS(t)&=\\prod_{t_i<t}p_i\n\\end{aligned}\n\\]\n\\[\n\\begin{aligned}\n\\frac{\\hat S(t)}{S(t)}\n &= \\prod_{t_i<t} \\frac{x_i}. {p_iY_i}\n\\\\ &= \\prod_{t_i<t} \\frac{\\hat p_i}{p_i}\n\\\\ &= \\prod_{t_i<t} \\left(1+\\frac{\\hat p_i-p_i}{p_i}\\right)\n\\\\ &\\approx 1+\\sum_{t_i<t} \\frac{\\hat p_i-p_i}{p_i}\n\\end{aligned}\n\\]\n\n\\[\n\\begin{aligned}\n\\text{Var}\\left(\\frac{\\hat S(t)}{S(t)}\\right)\n&\\approx \\text{Var}\\left(1+\\sum_{t_i<t} \\frac{\\hat p_i-p_i}{p_i}\\right)\n\\\\ &=\\sum_{t_i<t} \\frac{1}{p_i^2}\\frac{p_i(1-p_i)}{Y_i}\n\\\\ &= \\sum_{t_i<t} \\frac{(1-p_i)}{p_iY_i}\n\\\\ &\\approx\\sum_{t_i<t} \\frac{(1-x_i/Y_i)}{x_i}\n\\\\ &=\\sum_{t_i<t} \\frac{Y_i-x_i}{x_iY_i}\n\\\\ &=\\sum_{t_i<t} \\frac{d_i}{Y_i(Y_i-d_i)}\n\\\\ \\therefore\\text{Var}\\left(\\hat S(t)\\right)\n&\\approx \\hat S(t)^2\\sum_{t_i<t} \\frac{d_i}{Y_i(Y_i-d_i)}\n\\end{aligned}\n\\]\n\n5.9.6 Test for differences among the disease groups\nHere we compute a chi-square test for assocation between disease group (group) and disease-free survival:\n\nShow R codesurvdiff(surv ~ group, data = bmt)\n#> Call:\n#> survdiff(formula = surv ~ group, data = bmt)\n#> \n#> N Observed Expected (O-E)^2/E (O-E)^2/V\n#> group=ALL 38 24 21.9 0.211 0.289\n#> group=Low Risk AML 54 25 40.0 5.604 11.012\n#> group=High Risk AML 45 34 21.2 7.756 10.529\n#> \n#> Chisq= 13.8 on 2 degrees of freedom, p= 0.001\n\n\n\n5.9.7 Cumulative Hazard\n\\[\n\\begin{aligned}\nh(t)\n&\\stackrel{\\text{def}}{=}P(T=t|T\\ge t)\\\\\n&= \\frac{p(T=t)}{P(T\\ge t)}\\\\\n&= -\\frac{\\partial}{\\partial t}\\text{log}\\left\\{S(t)\\right\\}\n\\end{aligned}\n\\]\nThe cumulative hazard (or integrated hazard) function is\n\\[H(t)\\stackrel{\\text{def}}{=}\\int_0^t h(t) dt\\] Since \\(h(t) = -\\frac{\\partial}{\\partial t}\\text{log}\\left\\{S(t)\\right\\}\\) as shown above, we have:\n\\[\nH(t)=-\\text{log}\\left\\{S\\right\\}(t)\n\\]\n\nSo we can estimate \\(H(t)\\) as:\n\\[\n\\begin{aligned}\n\\hat H(t)\n&= -\\text{log}\\left\\{\\hat S(t)\\right\\}\\\\\n&= -\\text{log}\\left\\{\\prod_{t_i < t}\\left[1-\\frac{d_i}{Y_i}\\right]\\right\\}\\\\\n&= -\\sum_{t_i < t}\\text{log}\\left\\{1-\\frac{d_i}{Y_i}\\right\\}\\\\\n\\end{aligned}\n\\]\nThis is the Kaplan-Meier (product-limit) estimate of cumulative hazard.\n\nExample: Cumulative Hazard Curves for Bone-Marrow Transplant (bmt) data\n\nShow R codeautoplot(\n fun = \"cumhaz\",\n km_model1, \n conf.int = FALSE,\n ylab = \"Cumulative hazard (disease-free survival)\",\n xlab = \"Time since transplant (days)\") + \n theme_bw() +\n theme(legend.position=\"bottom\")\n\n\n\nFigure 5.7: Disease-Free Cumulative Hazard by Disease Group", "crumbs": [ "Time to Event Models", "5  Introduction to Survival Analysis" @@ -610,7 +610,7 @@ "href": "intro-to-survival-analysis.html#nelson-aalen-estimates-of-cumulative-hazard-and-survival", "title": "\n5  Introduction to Survival Analysis\n", "section": "\n5.10 Nelson-Aalen Estimates of Cumulative Hazard and Survival", - "text": "5.10 Nelson-Aalen Estimates of Cumulative Hazard and Survival\n\nThe point hazard at time \\(t_i\\) can be estimated by \\(d_i/Y_i\\), which leads to the Nelson-Aalen estimator of the cumulative hazard:\n\\[\\hat H_{NA}(t) \\stackrel{\\text{def}}{=}\\sum_{t_i < t}\\frac{d_i}{Y_i}\\]\n\nThe variance of this estimator is approximately:\n\\[\n\\begin{aligned}\n\\hat{\\text{Var}}\\left(\\hat H_{NA} (t)\\right)\n&= \\sum_{t_i <t}\\frac{(d_i/Y_i)(1-d_i/Y_i)}{Y_i}\\\\\n&\\approx \\sum_{t_i <t}\\frac{d_i}{Y_i^2}\n\\end{aligned}\n\\]\n\nSince \\(S(t)=\\text{exp}\\left\\{-H(t)\\right\\}\\), the Nelson-Aalen cumulative hazard estimate can be converted into an alternate estimate of the survival function:\n\\[\n\\begin{aligned}\n\\hat S_{NA}(t)\n&= \\text{exp}\\left\\{-\\hat H_{NA}(t)\\right\\}\\\\\n&= \\text{exp}\\left\\{-\\sum_{t_i < t}\\frac{d_i}{Y_i}\\right\\}\\\\\n&= \\prod_{t_i < t}\\text{exp}\\left\\{-\\frac{d_i}{Y_i}\\right\\}\\\\\n\\end{aligned}\n\\]\n\nCompare these with the corresponding Kaplan-Meier estimates:\n\\[\n\\begin{aligned}\n\\hat H_{KM}(t) &= -\\sum_{t_i < t}\\text{log}\\left\\{1-\\frac{d_i}{Y_i}\\right\\}\\\\\n\\hat S_{KM}(t) &= \\prod_{t_i < t}\\left[1-\\frac{d_i}{Y_i}\\right]\n\\end{aligned}\n\\]\n\nThe product limit estimate and the Nelson-Aalen estimate often do not differ by much. The latter is considered more accurate in small samples and also directly estimates the cumulative hazard. The \"fleming-harrington\" method for survfit() reduces to Nelson-Aalen when the data are unweighted. We can also estimate the cumulative hazard as the negative log of the KM survival function estimate.\n\n\n5.10.1 Application to bmt dataset\n\nShow R code\nna_fit = survfit(\n formula = surv ~ group,\n type = \"fleming-harrington\",\n data = bmt)\n\nkm_fit = survfit(\n formula = surv ~ group,\n type = \"kaplan-meier\",\n data = bmt)\n\nkm_and_na = \n bind_rows(\n .id = \"model\",\n \"Kaplan-Meier\" = km_fit |> fortify(surv.connect = TRUE),\n \"Nelson-Aalen\" = na_fit |> fortify(surv.connect = TRUE)\n ) |> \n as_tibble()\n\n\n\nShow R codekm_and_na |> \n ggplot(aes(x = time, y = surv, col = model)) +\n geom_step() +\n facet_grid(. ~ strata) +\n theme_bw() + \n ylab(\"S(t) = P(T>=t)\") +\n xlab(\"Survival time (t, days)\") +\n theme(legend.position = \"bottom\")\n\n\nKaplan-Meier and Nelson-Aalen Survival Function Estimates, stratified by disease group\n\n\n\n\nThe Kaplan-Meier and Nelson-Aalen survival estimates are very similar for this dataset.", + "text": "5.10 Nelson-Aalen Estimates of Cumulative Hazard and Survival\n\n\nDefinition 5.5 (Nelson-Aalen Cumulative Hazard Estimator)  \n\nThe point hazard at time \\(t_i\\) can be estimated by \\(d_i/Y_i\\), which leads to the Nelson-Aalen estimator of the cumulative hazard:\n\n\\[\\hat H_{NA}(t) \\stackrel{\\text{def}}{=}\\sum_{t_i < t}\\frac{d_i}{Y_i} \\tag{5.5}\\]\n\n\n\nTheorem 5.5 (Variance of Nelson-Aalen estimator)  \n\nThe variance of this estimator is approximately:\n\n\\[\n\\begin{aligned}\n\\hat{\\text{Var}}\\left(\\hat H_{NA} (t)\\right)\n&= \\sum_{t_i <t}\\frac{(d_i/Y_i)(1-d_i/Y_i)}{Y_i}\\\\\n&\\approx \\sum_{t_i <t}\\frac{d_i}{Y_i^2}\n\\end{aligned}\n\\tag{5.6}\\]\n\n\nSince \\(S(t)=\\text{exp}\\left\\{-H(t)\\right\\}\\), the Nelson-Aalen cumulative hazard estimate can be converted into an alternate estimate of the survival function:\n\\[\n\\begin{aligned}\n\\hat S_{NA}(t)\n&= \\text{exp}\\left\\{-\\hat H_{NA}(t)\\right\\}\\\\\n&= \\text{exp}\\left\\{-\\sum_{t_i < t}\\frac{d_i}{Y_i}\\right\\}\\\\\n&= \\prod_{t_i < t}\\text{exp}\\left\\{-\\frac{d_i}{Y_i}\\right\\}\\\\\n\\end{aligned}\n\\]\n\nCompare these with the corresponding Kaplan-Meier estimates:\n\\[\n\\begin{aligned}\n\\hat H_{KM}(t) &= -\\sum_{t_i < t}\\text{log}\\left\\{1-\\frac{d_i}{Y_i}\\right\\}\\\\\n\\hat S_{KM}(t) &= \\prod_{t_i < t}\\left[1-\\frac{d_i}{Y_i}\\right]\n\\end{aligned}\n\\]\n\nThe product limit estimate and the Nelson-Aalen estimate often do not differ by much. The latter is considered more accurate in small samples and also directly estimates the cumulative hazard. The \"fleming-harrington\" method for survfit() reduces to Nelson-Aalen when the data are unweighted. We can also estimate the cumulative hazard as the negative log of the KM survival function estimate.\n\n\n5.10.1 Application to bmt dataset\n\nShow R code\nna_fit = survfit(\n formula = surv ~ group,\n type = \"fleming-harrington\",\n data = bmt)\n\nkm_fit = survfit(\n formula = surv ~ group,\n type = \"kaplan-meier\",\n data = bmt)\n\nkm_and_na = \n bind_rows(\n .id = \"model\",\n \"Kaplan-Meier\" = km_fit |> fortify(surv.connect = TRUE),\n \"Nelson-Aalen\" = na_fit |> fortify(surv.connect = TRUE)\n ) |> \n as_tibble()\n\n\n\nShow R codekm_and_na |> \n ggplot(aes(x = time, y = surv, col = model)) +\n geom_step() +\n facet_grid(. ~ strata) +\n theme_bw() + \n ylab(\"S(t) = P(T>=t)\") +\n xlab(\"Survival time (t, days)\") +\n theme(legend.position = \"bottom\")\n\n\nKaplan-Meier and Nelson-Aalen Survival Function Estimates, stratified by disease group\n\n\n\n\nThe Kaplan-Meier and Nelson-Aalen survival estimates are very similar for this dataset.\n\n\n\n\n\n\nCopelan, Edward A, James C Biggs, James M Thompson, Pamela Crilley, Jeff Szer, John P Klein, Neena Kapoor, Belinda R Avalos, Isabel Cunningham, and Kerry Atkinson. 1991. “Treatment for Acute Myelocytic Leukemia with Allogeneic Bone Marrow Transplantation Following Preparation with BuCy2.” https://doi.org/10.1182/blood.V78.3.838.838 .\n\n\nKlein, John P, Melvin L Moeschberger, et al. 2003. Survival Analysis: Techniques for Censored and Truncated Data. Vol. 1230. Springer. https://link.springer.com/book/10.1007/b97377.", "crumbs": [ "Time to Event Models", "5  Introduction to Survival Analysis" @@ -709,7 +709,7 @@ "href": "coxph-model-building.html#goodness-of-fit-using-the-cox-snell-residuals", "title": "\n7  Building Cox Proportional Hazards models\n", "section": "\n7.4 Goodness of Fit using the Cox-Snell Residuals", - "text": "7.4 Goodness of Fit using the Cox-Snell Residuals\n(references: Klein & Moeschberger textbook, §11.2, and Dobson & Barnett textbook, §10.6)\nSuppose that an individual has a survival time \\(T\\) which has survival function \\(S(t)\\), meaning that \\(\\Pr(T> t) = S(t)\\). Then \\(S(T)\\) has a uniform distribution on \\((0,1)\\).\n\\[\n\\begin{aligned}\n\\Pr(S(T_i) \\le u)\n&= \\Pr(T_i > S_i^{-1}(u))\\\\\n&= S_i(S_i^{-1}(u))\\\\\n&= u\n\\end{aligned}\n\\]\nAlso, if \\(U\\) has a uniform distribution on \\((0,1)\\), then what is the distribution of \\(-\\ln(U)\\)?\n\\[\n\\begin{aligned}\n\\Pr(-\\ln(U) < x) &= \\Pr(U>\\text{exp}\\left\\{-x\\right\\})\\\\\n&= 1-e^{-x}\n\\end{aligned}\n\\]\nwhich is the CDF of an exponential distribution with parameter \\(\\lambda=1\\).\nSo,\n\\[\n\\begin{aligned}\nr^{CS}_i&\n\\stackrel{\\text{def}}{=}-\\ln[\\hat S(t_i|x_i)]\n= \\hat H(t_i|\\tilde{x}_i)\n\\end{aligned}\n\\]\nshould have an exponential distribution with constant hazard \\(\\lambda=1\\) if the estimate \\(\\hat S_i\\) is accurate, which means that these values should look like a censored sample from this exponential distribution. These values are called generalized residuals or Cox-Snell residuals.\n\nShow R codehodg2 = hodg2 |> \n mutate(cs = predict(hodg.cox1, type = \"expected\"))\n\nsurv.csr = survfit(\n data = hodg2,\n formula = Surv(time = cs, event = delta == \"dead\") ~ 1,\n type = \"fleming-harrington\")\n\nautoplot(surv.csr, fun = \"cumhaz\") + \n geom_abline(aes(intercept = 0, slope = 1), col = \"red\") +\n theme_bw()\n\n\nCumulative Hazard of Cox-Snell Residuals\n\n\n\n\nThe line with slope 1 and intercept 0 fits the curve relatively well, so we don’t see lack of fit using this procedure.", + "text": "7.4 Goodness of Fit using the Cox-Snell Residuals\n(references: Klein, Moeschberger, et al. (2003), §11.2, and Dobson and Barnett (2018), §10.6)\nSuppose that an individual has a survival time \\(T\\) which has survival function \\(S(t)\\), meaning that \\(\\Pr(T> t) = S(t)\\). Then \\(S(T)\\) has a uniform distribution on \\((0,1)\\).\n\\[\n\\begin{aligned}\n\\Pr(S(T_i) \\le u)\n&= \\Pr(T_i > S_i^{-1}(u))\\\\\n&= S_i(S_i^{-1}(u))\\\\\n&= u\n\\end{aligned}\n\\]\nAlso, if \\(U\\) has a uniform distribution on \\((0,1)\\), then what is the distribution of \\(-\\ln(U)\\)?\n\\[\n\\begin{aligned}\n\\Pr(-\\ln(U) < x) &= \\Pr(U>\\text{exp}\\left\\{-x\\right\\})\\\\\n&= 1-e^{-x}\n\\end{aligned}\n\\]\nwhich is the CDF of an exponential distribution with parameter \\(\\lambda=1\\).\nSo,\n\\[\n\\begin{aligned}\nr^{CS}_i&\n\\stackrel{\\text{def}}{=}-\\ln[\\hat S(t_i|x_i)]\n= \\hat H(t_i|\\tilde{x}_i)\n\\end{aligned}\n\\]\nshould have an exponential distribution with constant hazard \\(\\lambda=1\\) if the estimate \\(\\hat S_i\\) is accurate, which means that these values should look like a censored sample from this exponential distribution. These values are called generalized residuals or Cox-Snell residuals.\n\nShow R codehodg2 = hodg2 |> \n mutate(cs = predict(hodg.cox1, type = \"expected\"))\n\nsurv.csr = survfit(\n data = hodg2,\n formula = Surv(time = cs, event = delta == \"dead\") ~ 1,\n type = \"fleming-harrington\")\n\nautoplot(surv.csr, fun = \"cumhaz\") + \n geom_abline(aes(intercept = 0, slope = 1), col = \"red\") +\n theme_bw()\n\n\nCumulative Hazard of Cox-Snell Residuals\n\n\n\n\nThe line with slope 1 and intercept 0 fits the curve relatively well, so we don’t see lack of fit using this procedure.", "crumbs": [ "Time to Event Models", "7  Building Cox Proportional Hazards models" @@ -742,7 +742,7 @@ "href": "coxph-model-building.html#stratified-survival-models", "title": "\n7  Building Cox Proportional Hazards models\n", "section": "\n7.7 Stratified survival models", - "text": "7.7 Stratified survival models\n\n7.7.1 Revisiting the leukemia dataset (anderson)\nWe will analyze remission survival times on 42 leukemia patients, half on new treatment, half on standard treatment.\nThis is the same data as the drug6mp data from KMsurv, but with two other variables and without the pairing. This version comes from the Kleinbaum and Klein survival textbook (e.g., p281):\n\nShow R code\nanderson = \n paste0(\n \"http://web1.sph.emory.edu/dkleinb/allDatasets/\",\n \"surv2datasets/anderson.dta\") |> \n haven::read_dta() |> \n mutate(\n status = status |> \n case_match(\n 1 ~ \"relapse\",\n 0 ~ \"censored\"\n ),\n \n sex = sex |> \n case_match(\n 0 ~ \"female\",\n 1 ~ \"male\"\n ) |> \n factor() |> \n relevel(ref = \"female\"),\n \n rx = rx |> \n case_match(\n 0 ~ \"new\",\n 1 ~ \"standard\"\n ) |> \n factor() |> relevel(ref = \"standard\"),\n \n surv = Surv(\n time = survt, \n event = (status == \"relapse\"))\n )\n\nprint(anderson)\n\n\n\n7.7.2 Cox semi-parametric proportional hazards model\n\nShow R code\nanderson.cox1 = coxph(\n formula = surv ~ rx + sex + logwbc,\n data = anderson)\n\nsummary(anderson.cox1)\n#> Call:\n#> coxph(formula = surv ~ rx + sex + logwbc, data = anderson)\n#> \n#> n= 42, number of events= 30 \n#> \n#> coef exp(coef) se(coef) z Pr(>|z|) \n#> rxnew -1.504 0.222 0.462 -3.26 0.0011 ** \n#> sexmale 0.315 1.370 0.455 0.69 0.4887 \n#> logwbc 1.682 5.376 0.337 5.00 5.8e-07 ***\n#> ---\n#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n#> \n#> exp(coef) exp(-coef) lower .95 upper .95\n#> rxnew 0.222 4.498 0.090 0.549\n#> sexmale 1.370 0.730 0.562 3.338\n#> logwbc 5.376 0.186 2.779 10.398\n#> \n#> Concordance= 0.851 (se = 0.041 )\n#> Likelihood ratio test= 47.2 on 3 df, p=3e-10\n#> Wald test = 33.5 on 3 df, p=2e-07\n#> Score (logrank) test = 48 on 3 df, p=2e-10\n\n\nTest the proportional hazards assumption\n\nShow R codecox.zph(anderson.cox1)\n#> chisq df p\n#> rx 0.036 1 0.85\n#> sex 5.420 1 0.02\n#> logwbc 0.142 1 0.71\n#> GLOBAL 5.879 3 0.12\n\n\nGraph the K-M survival curves\n\nShow R code\nanderson_km_model = survfit(\n formula = surv ~ sex,\n data = anderson)\n\nanderson_km_model |> \n autoplot(conf.int = FALSE) +\n theme_bw() +\n theme(legend.position=\"bottom\")\n\n\n\n\n\n\n\nThe survival curves cross, which indicates a problem in the proportionality assumption by sex.\n\n7.7.3 Graph the Nelson-Aalen cumulative hazard\nWe can also look at the log-hazard (“cloglog survival”) plots:\n\nShow R codeanderson_na_model = survfit(\n formula = surv ~ sex,\n data = anderson,\n type = \"fleming\")\n\nanderson_na_model |> \n autoplot(\n fun = \"cumhaz\",\n conf.int = FALSE) +\n theme_classic() +\n theme(legend.position=\"bottom\") +\n ylab(\"log(Cumulative Hazard)\") +\n scale_y_continuous(\n trans = \"log10\",\n name = \"Cumulative hazard (H(t), log scale)\") +\n scale_x_continuous(\n breaks = c(1,2,5,10,20,50),\n trans = \"log\"\n )\n\n\nCumulative hazard (cloglog scale) for anderson data\n\n\n\n\nThis can be fixed by using strata or possibly by other model alterations.\n\n7.7.4 The Stratified Cox Model\n\nIn a stratified Cox model, each stratum, defined by one or more factors, has its own base survival function \\(h_0(t)\\).\nBut the coefficients for each variable not used in the strata definitions are assumed to be the same across strata.\nTo check if this assumption is reasonable one can include interactions with strata and see if they are significant (this may generate a warning and NA lines but these can be ignored).\nSince the sex variable shows possible non-proportionality, we try stratifying on sex.\n\n\nShow R code\nanderson.coxph.strat = \n coxph(\n formula = \n surv ~ rx + logwbc + strata(sex),\n data = anderson)\n\nsummary(anderson.coxph.strat)\n#> Call:\n#> coxph(formula = surv ~ rx + logwbc + strata(sex), data = anderson)\n#> \n#> n= 42, number of events= 30 \n#> \n#> coef exp(coef) se(coef) z Pr(>|z|) \n#> rxnew -0.998 0.369 0.474 -2.11 0.035 * \n#> logwbc 1.454 4.279 0.344 4.22 2.4e-05 ***\n#> ---\n#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n#> \n#> exp(coef) exp(-coef) lower .95 upper .95\n#> rxnew 0.369 2.713 0.146 0.932\n#> logwbc 4.279 0.234 2.180 8.398\n#> \n#> Concordance= 0.812 (se = 0.059 )\n#> Likelihood ratio test= 32.1 on 2 df, p=1e-07\n#> Wald test = 22.8 on 2 df, p=1e-05\n#> Score (logrank) test = 30.8 on 2 df, p=2e-07\n\n\nLet’s compare this to a model fit only on the subset of males:\n\nShow R code\nanderson.coxph.male = \n coxph(\n formula = surv ~ rx + logwbc,\n subset = sex == \"male\",\n data = anderson)\n\nsummary(anderson.coxph.male)\n#> Call:\n#> coxph(formula = surv ~ rx + logwbc, data = anderson, subset = sex == \n#> \"male\")\n#> \n#> n= 20, number of events= 14 \n#> \n#> coef exp(coef) se(coef) z Pr(>|z|) \n#> rxnew -1.978 0.138 0.739 -2.68 0.0075 **\n#> logwbc 1.743 5.713 0.536 3.25 0.0011 **\n#> ---\n#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n#> \n#> exp(coef) exp(-coef) lower .95 upper .95\n#> rxnew 0.138 7.227 0.0325 0.589\n#> logwbc 5.713 0.175 1.9991 16.328\n#> \n#> Concordance= 0.905 (se = 0.043 )\n#> Likelihood ratio test= 29.2 on 2 df, p=5e-07\n#> Wald test = 15.3 on 2 df, p=5e-04\n#> Score (logrank) test = 26.4 on 2 df, p=2e-06\n\n\n\nShow R codeanderson.coxph.female = \n coxph(\n formula = \n surv ~ rx + logwbc,\n subset = sex == \"female\",\n data = anderson)\n\nsummary(anderson.coxph.female)\n#> Call:\n#> coxph(formula = surv ~ rx + logwbc, data = anderson, subset = sex == \n#> \"female\")\n#> \n#> n= 22, number of events= 16 \n#> \n#> coef exp(coef) se(coef) z Pr(>|z|) \n#> rxnew -0.311 0.733 0.564 -0.55 0.581 \n#> logwbc 1.206 3.341 0.503 2.40 0.017 *\n#> ---\n#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n#> \n#> exp(coef) exp(-coef) lower .95 upper .95\n#> rxnew 0.733 1.365 0.243 2.21\n#> logwbc 3.341 0.299 1.245 8.96\n#> \n#> Concordance= 0.692 (se = 0.085 )\n#> Likelihood ratio test= 6.65 on 2 df, p=0.04\n#> Wald test = 6.36 on 2 df, p=0.04\n#> Score (logrank) test = 6.74 on 2 df, p=0.03\n\n\nThe coefficients of treatment look different. Are they statistically different?\n\nShow R code\nanderson.coxph.strat.intxn = \n coxph(\n formula = surv ~ strata(sex) * (rx + logwbc),\n data = anderson)\n\nanderson.coxph.strat.intxn |> summary()\n#> Call:\n#> coxph(formula = surv ~ strata(sex) * (rx + logwbc), data = anderson)\n#> \n#> n= 42, number of events= 30 \n#> \n#> coef exp(coef) se(coef) z Pr(>|z|) \n#> rxnew -0.311 0.733 0.564 -0.55 0.581 \n#> logwbc 1.206 3.341 0.503 2.40 0.017 *\n#> strata(sex)male:rxnew -1.667 0.189 0.930 -1.79 0.073 .\n#> strata(sex)male:logwbc 0.537 1.710 0.735 0.73 0.465 \n#> ---\n#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n#> \n#> exp(coef) exp(-coef) lower .95 upper .95\n#> rxnew 0.733 1.365 0.2427 2.21\n#> logwbc 3.341 0.299 1.2452 8.96\n#> strata(sex)male:rxnew 0.189 5.294 0.0305 1.17\n#> strata(sex)male:logwbc 1.710 0.585 0.4048 7.23\n#> \n#> Concordance= 0.797 (se = 0.058 )\n#> Likelihood ratio test= 35.8 on 4 df, p=3e-07\n#> Wald test = 21.7 on 4 df, p=2e-04\n#> Score (logrank) test = 33.1 on 4 df, p=1e-06\n\n\n\nShow R codeanova(\n anderson.coxph.strat.intxn,\n anderson.coxph.strat)\n\n\n \n\n\n\nWe don’t have enough evidence to tell the difference between these two models.\n\n7.7.5 Conclusions\n\nWe chose to use a stratified model because of the apparent non-proportionality of the hazard for the sex variable.\nWhen we fit interactions with the strata variable, we did not get an improved model (via the likelihood ratio test).\nSo we use the stratifed model with coefficients that are the same across strata.\n\n7.7.6 Another Modeling Approach\n\nWe used an additive model without interactions and saw that we might need to stratify by sex.\nInstead, we could try to improve the model’s functional form - maybe the interaction of treatment and sex is real, and after fitting that we might not need separate hazard functions.\nEither approach may work.\n\n\nShow R code\nanderson.coxph.intxn = \n coxph(\n formula = surv ~ (rx + logwbc) * sex,\n data = anderson)\n\nanderson.coxph.intxn |> summary()\n#> Call:\n#> coxph(formula = surv ~ (rx + logwbc) * sex, data = anderson)\n#> \n#> n= 42, number of events= 30 \n#> \n#> coef exp(coef) se(coef) z Pr(>|z|) \n#> rxnew -0.3748 0.6874 0.5545 -0.68 0.499 \n#> logwbc 1.0637 2.8971 0.4726 2.25 0.024 *\n#> sexmale -2.8052 0.0605 2.0323 -1.38 0.167 \n#> rxnew:sexmale -2.1782 0.1132 0.9109 -2.39 0.017 *\n#> logwbc:sexmale 1.2303 3.4223 0.6301 1.95 0.051 .\n#> ---\n#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n#> \n#> exp(coef) exp(-coef) lower .95 upper .95\n#> rxnew 0.6874 1.455 0.23185 2.038\n#> logwbc 2.8971 0.345 1.14730 7.315\n#> sexmale 0.0605 16.531 0.00113 3.248\n#> rxnew:sexmale 0.1132 8.830 0.01899 0.675\n#> logwbc:sexmale 3.4223 0.292 0.99539 11.766\n#> \n#> Concordance= 0.861 (se = 0.036 )\n#> Likelihood ratio test= 57 on 5 df, p=5e-11\n#> Wald test = 35.6 on 5 df, p=1e-06\n#> Score (logrank) test = 57.1 on 5 df, p=5e-11\n\n\n\nShow R codecox.zph(anderson.coxph.intxn)\n#> chisq df p\n#> rx 0.136 1 0.71\n#> logwbc 1.652 1 0.20\n#> sex 1.266 1 0.26\n#> rx:sex 0.149 1 0.70\n#> logwbc:sex 0.102 1 0.75\n#> GLOBAL 3.747 5 0.59", + "text": "7.7 Stratified survival models\n\n7.7.1 Revisiting the leukemia dataset (anderson)\nWe will analyze remission survival times on 42 leukemia patients, half on new treatment, half on standard treatment.\nThis is the same data as the drug6mp data from KMsurv, but with two other variables and without the pairing. This version comes from Kleinbaum and Klein (2012) (e.g., p281):\n\nShow R code\nanderson = \n paste0(\n \"http://web1.sph.emory.edu/dkleinb/allDatasets/\",\n \"surv2datasets/anderson.dta\") |> \n haven::read_dta() |> \n mutate(\n status = status |> \n case_match(\n 1 ~ \"relapse\",\n 0 ~ \"censored\"\n ),\n \n sex = sex |> \n case_match(\n 0 ~ \"female\",\n 1 ~ \"male\"\n ) |> \n factor() |> \n relevel(ref = \"female\"),\n \n rx = rx |> \n case_match(\n 0 ~ \"new\",\n 1 ~ \"standard\"\n ) |> \n factor() |> relevel(ref = \"standard\"),\n \n surv = Surv(\n time = survt, \n event = (status == \"relapse\"))\n )\n\nprint(anderson)\n\n\n\n7.7.2 Cox semi-parametric proportional hazards model\n\nShow R code\nanderson.cox1 = coxph(\n formula = surv ~ rx + sex + logwbc,\n data = anderson)\n\nsummary(anderson.cox1)\n#> Call:\n#> coxph(formula = surv ~ rx + sex + logwbc, data = anderson)\n#> \n#> n= 42, number of events= 30 \n#> \n#> coef exp(coef) se(coef) z Pr(>|z|) \n#> rxnew -1.504 0.222 0.462 -3.26 0.0011 ** \n#> sexmale 0.315 1.370 0.455 0.69 0.4887 \n#> logwbc 1.682 5.376 0.337 5.00 5.8e-07 ***\n#> ---\n#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n#> \n#> exp(coef) exp(-coef) lower .95 upper .95\n#> rxnew 0.222 4.498 0.090 0.549\n#> sexmale 1.370 0.730 0.562 3.338\n#> logwbc 5.376 0.186 2.779 10.398\n#> \n#> Concordance= 0.851 (se = 0.041 )\n#> Likelihood ratio test= 47.2 on 3 df, p=3e-10\n#> Wald test = 33.5 on 3 df, p=2e-07\n#> Score (logrank) test = 48 on 3 df, p=2e-10\n\n\nTest the proportional hazards assumption\n\nShow R codecox.zph(anderson.cox1)\n#> chisq df p\n#> rx 0.036 1 0.85\n#> sex 5.420 1 0.02\n#> logwbc 0.142 1 0.71\n#> GLOBAL 5.879 3 0.12\n\n\nGraph the K-M survival curves\n\nShow R code\nanderson_km_model = survfit(\n formula = surv ~ sex,\n data = anderson)\n\nanderson_km_model |> \n autoplot(conf.int = FALSE) +\n theme_bw() +\n theme(legend.position=\"bottom\")\n\n\n\n\n\n\n\nThe survival curves cross, which indicates a problem in the proportionality assumption by sex.\n\n7.7.3 Graph the Nelson-Aalen cumulative hazard\nWe can also look at the log-hazard (“cloglog survival”) plots:\n\nShow R codeanderson_na_model = survfit(\n formula = surv ~ sex,\n data = anderson,\n type = \"fleming\")\n\nanderson_na_model |> \n autoplot(\n fun = \"cumhaz\",\n conf.int = FALSE) +\n theme_classic() +\n theme(legend.position=\"bottom\") +\n ylab(\"log(Cumulative Hazard)\") +\n scale_y_continuous(\n trans = \"log10\",\n name = \"Cumulative hazard (H(t), log scale)\") +\n scale_x_continuous(\n breaks = c(1,2,5,10,20,50),\n trans = \"log\"\n )\n\n\nCumulative hazard (cloglog scale) for anderson data\n\n\n\n\nThis can be fixed by using strata or possibly by other model alterations.\n\n7.7.4 The Stratified Cox Model\n\nIn a stratified Cox model, each stratum, defined by one or more factors, has its own base survival function \\(h_0(t)\\).\nBut the coefficients for each variable not used in the strata definitions are assumed to be the same across strata.\nTo check if this assumption is reasonable one can include interactions with strata and see if they are significant (this may generate a warning and NA lines but these can be ignored).\nSince the sex variable shows possible non-proportionality, we try stratifying on sex.\n\n\nShow R code\nanderson.coxph.strat = \n coxph(\n formula = \n surv ~ rx + logwbc + strata(sex),\n data = anderson)\n\nsummary(anderson.coxph.strat)\n#> Call:\n#> coxph(formula = surv ~ rx + logwbc + strata(sex), data = anderson)\n#> \n#> n= 42, number of events= 30 \n#> \n#> coef exp(coef) se(coef) z Pr(>|z|) \n#> rxnew -0.998 0.369 0.474 -2.11 0.035 * \n#> logwbc 1.454 4.279 0.344 4.22 2.4e-05 ***\n#> ---\n#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n#> \n#> exp(coef) exp(-coef) lower .95 upper .95\n#> rxnew 0.369 2.713 0.146 0.932\n#> logwbc 4.279 0.234 2.180 8.398\n#> \n#> Concordance= 0.812 (se = 0.059 )\n#> Likelihood ratio test= 32.1 on 2 df, p=1e-07\n#> Wald test = 22.8 on 2 df, p=1e-05\n#> Score (logrank) test = 30.8 on 2 df, p=2e-07\n\n\nLet’s compare this to a model fit only on the subset of males:\n\nShow R code\nanderson.coxph.male = \n coxph(\n formula = surv ~ rx + logwbc,\n subset = sex == \"male\",\n data = anderson)\n\nsummary(anderson.coxph.male)\n#> Call:\n#> coxph(formula = surv ~ rx + logwbc, data = anderson, subset = sex == \n#> \"male\")\n#> \n#> n= 20, number of events= 14 \n#> \n#> coef exp(coef) se(coef) z Pr(>|z|) \n#> rxnew -1.978 0.138 0.739 -2.68 0.0075 **\n#> logwbc 1.743 5.713 0.536 3.25 0.0011 **\n#> ---\n#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n#> \n#> exp(coef) exp(-coef) lower .95 upper .95\n#> rxnew 0.138 7.227 0.0325 0.589\n#> logwbc 5.713 0.175 1.9991 16.328\n#> \n#> Concordance= 0.905 (se = 0.043 )\n#> Likelihood ratio test= 29.2 on 2 df, p=5e-07\n#> Wald test = 15.3 on 2 df, p=5e-04\n#> Score (logrank) test = 26.4 on 2 df, p=2e-06\n\n\n\nShow R codeanderson.coxph.female = \n coxph(\n formula = \n surv ~ rx + logwbc,\n subset = sex == \"female\",\n data = anderson)\n\nsummary(anderson.coxph.female)\n#> Call:\n#> coxph(formula = surv ~ rx + logwbc, data = anderson, subset = sex == \n#> \"female\")\n#> \n#> n= 22, number of events= 16 \n#> \n#> coef exp(coef) se(coef) z Pr(>|z|) \n#> rxnew -0.311 0.733 0.564 -0.55 0.581 \n#> logwbc 1.206 3.341 0.503 2.40 0.017 *\n#> ---\n#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n#> \n#> exp(coef) exp(-coef) lower .95 upper .95\n#> rxnew 0.733 1.365 0.243 2.21\n#> logwbc 3.341 0.299 1.245 8.96\n#> \n#> Concordance= 0.692 (se = 0.085 )\n#> Likelihood ratio test= 6.65 on 2 df, p=0.04\n#> Wald test = 6.36 on 2 df, p=0.04\n#> Score (logrank) test = 6.74 on 2 df, p=0.03\n\n\nThe coefficients of treatment look different. Are they statistically different?\n\nShow R code\nanderson.coxph.strat.intxn = \n coxph(\n formula = surv ~ strata(sex) * (rx + logwbc),\n data = anderson)\n\nanderson.coxph.strat.intxn |> summary()\n#> Call:\n#> coxph(formula = surv ~ strata(sex) * (rx + logwbc), data = anderson)\n#> \n#> n= 42, number of events= 30 \n#> \n#> coef exp(coef) se(coef) z Pr(>|z|) \n#> rxnew -0.311 0.733 0.564 -0.55 0.581 \n#> logwbc 1.206 3.341 0.503 2.40 0.017 *\n#> strata(sex)male:rxnew -1.667 0.189 0.930 -1.79 0.073 .\n#> strata(sex)male:logwbc 0.537 1.710 0.735 0.73 0.465 \n#> ---\n#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n#> \n#> exp(coef) exp(-coef) lower .95 upper .95\n#> rxnew 0.733 1.365 0.2427 2.21\n#> logwbc 3.341 0.299 1.2452 8.96\n#> strata(sex)male:rxnew 0.189 5.294 0.0305 1.17\n#> strata(sex)male:logwbc 1.710 0.585 0.4048 7.23\n#> \n#> Concordance= 0.797 (se = 0.058 )\n#> Likelihood ratio test= 35.8 on 4 df, p=3e-07\n#> Wald test = 21.7 on 4 df, p=2e-04\n#> Score (logrank) test = 33.1 on 4 df, p=1e-06\n\n\n\nShow R codeanova(\n anderson.coxph.strat.intxn,\n anderson.coxph.strat)\n\n\n \n\n\n\nWe don’t have enough evidence to tell the difference between these two models.\n\n7.7.5 Conclusions\n\nWe chose to use a stratified model because of the apparent non-proportionality of the hazard for the sex variable.\nWhen we fit interactions with the strata variable, we did not get an improved model (via the likelihood ratio test).\nSo we use the stratifed model with coefficients that are the same across strata.\n\n7.7.6 Another Modeling Approach\n\nWe used an additive model without interactions and saw that we might need to stratify by sex.\nInstead, we could try to improve the model’s functional form - maybe the interaction of treatment and sex is real, and after fitting that we might not need separate hazard functions.\nEither approach may work.\n\n\nShow R code\nanderson.coxph.intxn = \n coxph(\n formula = surv ~ (rx + logwbc) * sex,\n data = anderson)\n\nanderson.coxph.intxn |> summary()\n#> Call:\n#> coxph(formula = surv ~ (rx + logwbc) * sex, data = anderson)\n#> \n#> n= 42, number of events= 30 \n#> \n#> coef exp(coef) se(coef) z Pr(>|z|) \n#> rxnew -0.3748 0.6874 0.5545 -0.68 0.499 \n#> logwbc 1.0637 2.8971 0.4726 2.25 0.024 *\n#> sexmale -2.8052 0.0605 2.0323 -1.38 0.167 \n#> rxnew:sexmale -2.1782 0.1132 0.9109 -2.39 0.017 *\n#> logwbc:sexmale 1.2303 3.4223 0.6301 1.95 0.051 .\n#> ---\n#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n#> \n#> exp(coef) exp(-coef) lower .95 upper .95\n#> rxnew 0.6874 1.455 0.23185 2.038\n#> logwbc 2.8971 0.345 1.14730 7.315\n#> sexmale 0.0605 16.531 0.00113 3.248\n#> rxnew:sexmale 0.1132 8.830 0.01899 0.675\n#> logwbc:sexmale 3.4223 0.292 0.99539 11.766\n#> \n#> Concordance= 0.861 (se = 0.036 )\n#> Likelihood ratio test= 57 on 5 df, p=5e-11\n#> Wald test = 35.6 on 5 df, p=1e-06\n#> Score (logrank) test = 57.1 on 5 df, p=5e-11\n\n\n\nShow R codecox.zph(anderson.coxph.intxn)\n#> chisq df p\n#> rx 0.136 1 0.71\n#> logwbc 1.652 1 0.20\n#> sex 1.266 1 0.26\n#> rx:sex 0.149 1 0.70\n#> logwbc:sex 0.102 1 0.75\n#> GLOBAL 3.747 5 0.59", "crumbs": [ "Time to Event Models", "7  Building Cox Proportional Hazards models" @@ -753,7 +753,7 @@ "href": "coxph-model-building.html#time-varying-covariates", "title": "\n7  Building Cox Proportional Hazards models\n", "section": "\n7.8 Time-varying covariates", - "text": "7.8 Time-varying covariates\n(adapted from Klein, Moeschberger, et al. (2003), §9.2)\n\n7.8.1 Motivating example: back to the leukemia dataset\n\n# load the data:\ndata(bmt, package = 'KMsurv')\nbmt |> as_tibble() |> print(n = 5)\n#> # A tibble: 137 × 22\n#> group t1 t2 d1 d2 d3 ta da tc dc tp dp z1\n#> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>\n#> 1 1 2081 2081 0 0 0 67 1 121 1 13 1 26\n#> 2 1 1602 1602 0 0 0 1602 0 139 1 18 1 21\n#> 3 1 1496 1496 0 0 0 1496 0 307 1 12 1 26\n#> 4 1 1462 1462 0 0 0 70 1 95 1 13 1 17\n#> 5 1 1433 1433 0 0 0 1433 0 236 1 12 1 32\n#> # ℹ 132 more rows\n#> # ℹ 9 more variables: z2 <int>, z3 <int>, z4 <int>, z5 <int>, z6 <int>,\n#> # z7 <int>, z8 <int>, z9 <int>, z10 <int>\n\nThis dataset comes from the Copelan et al. (1991) study of allogenic bone marrow transplant therapy for acute myeloid leukemia (AML) and acute lymphoblastic leukemia (ALL).\nOutcomes (endpoints)\n\nThe main endpoint is disease-free survival (t2 and d3) for the three risk groups, “ALL”, “AML Low Risk”, and “AML High Risk”.\nPossible intermediate events\n\ngraft vs. host disease (GVHD), an immunological rejection response to the transplant (bad)\nacute (AGVHD)\nchronic (CGVHD)\nplatelet recovery, a return of platelet count to normal levels (good)\n\nOne or the other, both in either order, or neither may occur.\nCovariates\n\nWe are interested in possibly using the covariates z1-z10 to adjust for other factors.\nIn addition, the time-varying covariates for acute GVHD, chronic GVHD, and platelet recovery may be useful.\nPreprocessing\nWe reformat the data before analysis:\n\nShow R code# reformat the data:\nbmt1 = \n bmt |> \n as_tibble() |> \n mutate(\n id = 1:n(), # will be used to connect multiple records for the same individual\n \n group = group |> \n case_match(\n 1 ~ \"ALL\",\n 2 ~ \"Low Risk AML\",\n 3 ~ \"High Risk AML\") |> \n factor(levels = c(\"ALL\", \"Low Risk AML\", \"High Risk AML\")),\n \n `patient age` = z1,\n \n `donor age` = z2,\n \n `patient sex` = z3 |> \n case_match(\n 0 ~ \"Female\",\n 1 ~ \"Male\"),\n \n `donor sex` = z4 |> \n case_match(\n 0 ~ \"Female\",\n 1 ~ \"Male\"),\n \n `Patient CMV Status` = z5 |> \n case_match(\n 0 ~ \"CMV Negative\",\n 1 ~ \"CMV Positive\"),\n \n `Donor CMV Status` = z6 |> \n case_match(\n 0 ~ \"CMV Negative\",\n 1 ~ \"CMV Positive\"),\n \n `Waiting Time to Transplant` = z7,\n \n FAB = z8 |> \n case_match(\n 1 ~ \"Grade 4 Or 5 (AML only)\",\n 0 ~ \"Other\") |> \n factor() |> \n relevel(ref = \"Other\"),\n \n hospital = z9 |> # `z9` is hospital\n case_match(\n 1 ~ \"Ohio State University\",\n 2 ~ \"Alferd\",\n 3 ~ \"St. Vincent\",\n 4 ~ \"Hahnemann\") |> \n factor() |> \n relevel(ref = \"Ohio State University\"),\n \n MTX = (z10 == 1) # a prophylatic treatment for GVHD\n \n ) |> \n select(-(z1:z10)) # don't need these anymore\n\nbmt1 |> \n select(group, id:MTX) |> \n print(n = 10)\n#> # A tibble: 137 × 12\n#> group id `patient age` `donor age` `patient sex` `donor sex`\n#> <fct> <int> <int> <int> <chr> <chr> \n#> 1 ALL 1 26 33 Male Female \n#> 2 ALL 2 21 37 Male Male \n#> 3 ALL 3 26 35 Male Male \n#> 4 ALL 4 17 21 Female Male \n#> 5 ALL 5 32 36 Male Male \n#> 6 ALL 6 22 31 Male Male \n#> 7 ALL 7 20 17 Male Female \n#> 8 ALL 8 22 24 Male Female \n#> 9 ALL 9 18 21 Female Male \n#> 10 ALL 10 24 40 Male Male \n#> # ℹ 127 more rows\n#> # ℹ 6 more variables: `Patient CMV Status` <chr>, `Donor CMV Status` <chr>,\n#> # `Waiting Time to Transplant` <int>, FAB <fct>, hospital <fct>, MTX <lgl>\n\n\n\n7.8.2 Time-Dependent Covariates\n\nA time-dependent covariate (“TDC”) is a covariate whose value changes during the course of the study.\nFor variables like age that change in a linear manner with time, we can just use the value at the start.\nBut it may be plausible that when and if GVHD occurs, the risk of relapse or death increases, and when and if platelet recovery occurs, the risk decreases.\n\n7.8.3 Analysis in R\n\nWe form a variable precovery which is = 0 before platelet recovery and is = 1 after platelet recovery, if it occurs.\nFor each subject where platelet recovery occurs, we set up multiple records (lines in the data frame); for example one from t = 0 to the time of platelet recovery, and one from that time to relapse, recovery, or death.\nWe do the same for acute GVHD and chronic GVHD.\nFor each record, the covariates are constant.\n\n\nShow R code\nbmt2 = bmt1 |> \n #set up new long-format data set:\n tmerge(bmt1, id = id, tstop = t2) |> \n \n # the following three steps can be in any order, \n # and will still produce the same result:\n #add aghvd as tdc:\n tmerge(bmt1, id = id, agvhd = tdc(ta)) |> \n #add cghvd as tdc:\n tmerge(bmt1, id = id, cgvhd = tdc(tc)) |> \n #add platelet recovery as tdc:\n tmerge(bmt1, id = id, precovery = tdc(tp)) \n\nbmt2 = bmt2 |> \n as_tibble() |> \n mutate(status = as.numeric((tstop == t2) & d3))\n# status only = 1 if at end of t2 and not censored\n\n\nLet’s see how we’ve rearranged the first row of the data:\n\nShow R code\nbmt1 |> \n dplyr::filter(id == 1) |> \n dplyr::select(id, t1, d1, t2, d2, d3, ta, da, tc, dc, tp, dp)\n\n\n \n\n\n\nThe event times for this individual are:\n\n\nt = 0 time of transplant\n\ntp = 13 platelet recovery\n\nta = 67 acute GVHD onset\n\ntc = 121 chronic GVHD onset\n\nt2 = 2081 end of study, patient not relapsed or dead\n\nAfter converting the data to long-format, we have:\n\nShow R codebmt2 |> \n select(\n id,\n tstart,\n tstop,\n agvhd,\n cgvhd,\n precovery,\n status\n ) |> \n dplyr::filter(id == 1)\n\n\n \n\n\n\nNote that status could have been 1 on the last row, indicating that relapse or death occurred; since it is false, the participant must have exited the study without experiencing relapse or death (i.e., they were censored).\n\n7.8.4 Event sequences\nLet:\n\nA = acute GVHD\nC = chronic GVHD\nP = platelet recovery\n\nEach of the eight possible combinations of A or not-A, with C or not-C, with P or not-P occurs in this data set.\n\nA always occurs before C, and P always occurs before C, if both occur.\nThus there are ten event sequences in the data set: None, A, C, P, AC, AP, PA, PC, APC, and PAC.\nIn general, there could be as many as \\(1+3+(3)(2)+6=16\\) sequences, but our domain knowledge tells us that some are missing: CA, CP, CAP, CPA, PCA, PC, PAC\nDifferent subjects could have 1, 2, 3, or 4 intervals, depending on which of acute GVHD, chronic GVHD, and/or platelet recovery occurred.\nThe final interval for any subject has status = 1 if the subject relapsed or died at that time; otherwise status = 0.\nAny earlier intervals have status = 0.\nEven though there might be multiple lines per ID in the dataset, there is never more than one event, so no alterations need be made in the estimation procedures or in the interpretation of the output.\nThe function tmerge in the survival package eases the process of constructing the new long-format dataset.\n\n7.8.5 Model with Time-Fixed Covariates\n\nShow R codebmt1 = \n bmt1 |> \n mutate(surv = Surv(t2,d3))\n\nbmt_coxph_TF = coxph(\n formula = surv ~ group + `patient age`*`donor age` + FAB,\n data = bmt1)\nsummary(bmt_coxph_TF)\n#> Call:\n#> coxph(formula = surv ~ group + `patient age` * `donor age` + \n#> FAB, data = bmt1)\n#> \n#> n= 137, number of events= 83 \n#> \n#> coef exp(coef) se(coef) z Pr(>|z|) \n#> groupLow Risk AML -1.090648 0.335999 0.354279 -3.08 0.00208 ** \n#> groupHigh Risk AML -0.403905 0.667707 0.362777 -1.11 0.26555 \n#> `patient age` -0.081639 0.921605 0.036107 -2.26 0.02376 * \n#> `donor age` -0.084587 0.918892 0.030097 -2.81 0.00495 ** \n#> FABGrade 4 Or 5 (AML only) 0.837416 2.310388 0.278464 3.01 0.00264 ** \n#> `patient age`:`donor age` 0.003159 1.003164 0.000951 3.32 0.00089 ***\n#> ---\n#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n#> \n#> exp(coef) exp(-coef) lower .95 upper .95\n#> groupLow Risk AML 0.336 2.976 0.168 0.673\n#> groupHigh Risk AML 0.668 1.498 0.328 1.360\n#> `patient age` 0.922 1.085 0.859 0.989\n#> `donor age` 0.919 1.088 0.866 0.975\n#> FABGrade 4 Or 5 (AML only) 2.310 0.433 1.339 3.988\n#> `patient age`:`donor age` 1.003 0.997 1.001 1.005\n#> \n#> Concordance= 0.665 (se = 0.033 )\n#> Likelihood ratio test= 32.8 on 6 df, p=1e-05\n#> Wald test = 33 on 6 df, p=1e-05\n#> Score (logrank) test = 35.8 on 6 df, p=3e-06\ndrop1(bmt_coxph_TF, test = \"Chisq\")\n\n\n \n\n\n\n\nShow R codebmt1$mres = \n bmt_coxph_TF |> \n update(. ~ . - `donor age`) |> \n residuals(type=\"martingale\")\n\nbmt1 |> \n ggplot(aes(x = `donor age`, y = mres)) +\n geom_point() +\n geom_smooth(method = \"loess\", aes(col = \"loess\")) +\n geom_smooth(method = 'lm', aes(col = \"lm\")) +\n theme_classic() +\n xlab(\"Donor age\") +\n ylab(\"Martingale Residuals\") +\n guides(col=guide_legend(title = \"\"))\n\n\nMartingale residuals for Donor age\n\n\n\n\nA more complex functional form for donor age seems warranted; left as an exercise for the reader.\nNow we will add the time-varying covariates:\n\nShow R code\n# add counting process formulation of Surv():\nbmt2 = \n bmt2 |> \n mutate(\n surv = \n Surv(\n time = tstart,\n time2 = tstop,\n event = status,\n type = \"counting\"))\n\n\nLet’s see how the data looks for patient 15:\n\nShow R codebmt1 |> dplyr::filter(id == 15) |> dplyr::select(tp, dp, tc,dc, ta, da, FAB, surv, t1, d1, t2, d2, d3)\n\n\n \n\n\nShow R codebmt2 |> dplyr::filter(id == 15) |> dplyr::select(id, agvhd, cgvhd, precovery, surv)\n\n\n \n\n\n\n\n7.8.6 Model with Time-Dependent Covariates\n\nShow R codebmt_coxph_TV = coxph(\n formula = \n surv ~ \n group + `patient age`*`donor age` + FAB + agvhd + cgvhd + precovery,\n data = bmt2)\n\nsummary(bmt_coxph_TV)\n#> Call:\n#> coxph(formula = surv ~ group + `patient age` * `donor age` + \n#> FAB + agvhd + cgvhd + precovery, data = bmt2)\n#> \n#> n= 341, number of events= 83 \n#> \n#> coef exp(coef) se(coef) z Pr(>|z|) \n#> groupLow Risk AML -1.038514 0.353980 0.358220 -2.90 0.0037 **\n#> groupHigh Risk AML -0.380481 0.683533 0.374867 -1.01 0.3101 \n#> `patient age` -0.073351 0.929275 0.035956 -2.04 0.0413 * \n#> `donor age` -0.076406 0.926440 0.030196 -2.53 0.0114 * \n#> FABGrade 4 Or 5 (AML only) 0.805700 2.238263 0.284273 2.83 0.0046 **\n#> agvhd 0.150565 1.162491 0.306848 0.49 0.6237 \n#> cgvhd -0.116136 0.890354 0.289046 -0.40 0.6878 \n#> precovery -0.941123 0.390190 0.347861 -2.71 0.0068 **\n#> `patient age`:`donor age` 0.002895 1.002899 0.000944 3.07 0.0022 **\n#> ---\n#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n#> \n#> exp(coef) exp(-coef) lower .95 upper .95\n#> groupLow Risk AML 0.354 2.825 0.175 0.714\n#> groupHigh Risk AML 0.684 1.463 0.328 1.425\n#> `patient age` 0.929 1.076 0.866 0.997\n#> `donor age` 0.926 1.079 0.873 0.983\n#> FABGrade 4 Or 5 (AML only) 2.238 0.447 1.282 3.907\n#> agvhd 1.162 0.860 0.637 2.121\n#> cgvhd 0.890 1.123 0.505 1.569\n#> precovery 0.390 2.563 0.197 0.772\n#> `patient age`:`donor age` 1.003 0.997 1.001 1.005\n#> \n#> Concordance= 0.702 (se = 0.028 )\n#> Likelihood ratio test= 40.3 on 9 df, p=7e-06\n#> Wald test = 42.4 on 9 df, p=3e-06\n#> Score (logrank) test = 47.2 on 9 df, p=4e-07\n\n\nPlatelet recovery is highly significant.\nNeither acute GVHD (agvhd) nor chronic GVHD (cgvhd) has a statistically significant effect here, nor are they significant in models with the other one removed.\n\nShow R code\nupdate(bmt_coxph_TV, .~.-agvhd) |> summary()\n#> Call:\n#> coxph(formula = surv ~ group + `patient age` + `donor age` + \n#> FAB + cgvhd + precovery + `patient age`:`donor age`, data = bmt2)\n#> \n#> n= 341, number of events= 83 \n#> \n#> coef exp(coef) se(coef) z Pr(>|z|) \n#> groupLow Risk AML -1.049870 0.349983 0.356727 -2.94 0.0032 **\n#> groupHigh Risk AML -0.417049 0.658988 0.365348 -1.14 0.2537 \n#> `patient age` -0.070749 0.931696 0.035477 -1.99 0.0461 * \n#> `donor age` -0.075693 0.927101 0.030075 -2.52 0.0118 * \n#> FABGrade 4 Or 5 (AML only) 0.807035 2.241253 0.283437 2.85 0.0044 **\n#> cgvhd -0.095393 0.909015 0.285979 -0.33 0.7387 \n#> precovery -0.983653 0.373942 0.338170 -2.91 0.0036 **\n#> `patient age`:`donor age` 0.002859 1.002863 0.000936 3.05 0.0023 **\n#> ---\n#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n#> \n#> exp(coef) exp(-coef) lower .95 upper .95\n#> groupLow Risk AML 0.350 2.857 0.174 0.704\n#> groupHigh Risk AML 0.659 1.517 0.322 1.349\n#> `patient age` 0.932 1.073 0.869 0.999\n#> `donor age` 0.927 1.079 0.874 0.983\n#> FABGrade 4 Or 5 (AML only) 2.241 0.446 1.286 3.906\n#> cgvhd 0.909 1.100 0.519 1.592\n#> precovery 0.374 2.674 0.193 0.726\n#> `patient age`:`donor age` 1.003 0.997 1.001 1.005\n#> \n#> Concordance= 0.701 (se = 0.027 )\n#> Likelihood ratio test= 40 on 8 df, p=3e-06\n#> Wald test = 42.4 on 8 df, p=1e-06\n#> Score (logrank) test = 47.2 on 8 df, p=1e-07\nupdate(bmt_coxph_TV, .~.-cgvhd) |> summary()\n#> Call:\n#> coxph(formula = surv ~ group + `patient age` + `donor age` + \n#> FAB + agvhd + precovery + `patient age`:`donor age`, data = bmt2)\n#> \n#> n= 341, number of events= 83 \n#> \n#> coef exp(coef) se(coef) z Pr(>|z|) \n#> groupLow Risk AML -1.019638 0.360725 0.355311 -2.87 0.0041 **\n#> groupHigh Risk AML -0.381356 0.682935 0.374568 -1.02 0.3086 \n#> `patient age` -0.073189 0.929426 0.035890 -2.04 0.0414 * \n#> `donor age` -0.076753 0.926118 0.030121 -2.55 0.0108 * \n#> FABGrade 4 Or 5 (AML only) 0.811716 2.251769 0.284012 2.86 0.0043 **\n#> agvhd 0.131621 1.140676 0.302623 0.43 0.6636 \n#> precovery -0.946697 0.388021 0.347265 -2.73 0.0064 **\n#> `patient age`:`donor age` 0.002904 1.002908 0.000943 3.08 0.0021 **\n#> ---\n#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n#> \n#> exp(coef) exp(-coef) lower .95 upper .95\n#> groupLow Risk AML 0.361 2.772 0.180 0.724\n#> groupHigh Risk AML 0.683 1.464 0.328 1.423\n#> `patient age` 0.929 1.076 0.866 0.997\n#> `donor age` 0.926 1.080 0.873 0.982\n#> FABGrade 4 Or 5 (AML only) 2.252 0.444 1.291 3.929\n#> agvhd 1.141 0.877 0.630 2.064\n#> precovery 0.388 2.577 0.196 0.766\n#> `patient age`:`donor age` 1.003 0.997 1.001 1.005\n#> \n#> Concordance= 0.701 (se = 0.027 )\n#> Likelihood ratio test= 40.1 on 8 df, p=3e-06\n#> Wald test = 42.1 on 8 df, p=1e-06\n#> Score (logrank) test = 47.1 on 8 df, p=1e-07\n\n\nLet’s drop them both:\n\nShow R codebmt_coxph_TV2 = update(bmt_coxph_TV, . ~ . - agvhd -cgvhd)\nbmt_coxph_TV2 |> summary()\n#> Call:\n#> coxph(formula = surv ~ group + `patient age` + `donor age` + \n#> FAB + precovery + `patient age`:`donor age`, data = bmt2)\n#> \n#> n= 341, number of events= 83 \n#> \n#> coef exp(coef) se(coef) z Pr(>|z|) \n#> groupLow Risk AML -1.032520 0.356108 0.353202 -2.92 0.0035 **\n#> groupHigh Risk AML -0.413888 0.661075 0.365209 -1.13 0.2571 \n#> `patient age` -0.070965 0.931495 0.035453 -2.00 0.0453 * \n#> `donor age` -0.076052 0.926768 0.030007 -2.53 0.0113 * \n#> FABGrade 4 Or 5 (AML only) 0.811926 2.252242 0.283231 2.87 0.0041 **\n#> precovery -0.983505 0.373998 0.337997 -2.91 0.0036 **\n#> `patient age`:`donor age` 0.002872 1.002876 0.000936 3.07 0.0021 **\n#> ---\n#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n#> \n#> exp(coef) exp(-coef) lower .95 upper .95\n#> groupLow Risk AML 0.356 2.808 0.178 0.712\n#> groupHigh Risk AML 0.661 1.513 0.323 1.352\n#> `patient age` 0.931 1.074 0.869 0.999\n#> `donor age` 0.927 1.079 0.874 0.983\n#> FABGrade 4 Or 5 (AML only) 2.252 0.444 1.293 3.924\n#> precovery 0.374 2.674 0.193 0.725\n#> `patient age`:`donor age` 1.003 0.997 1.001 1.005\n#> \n#> Concordance= 0.7 (se = 0.027 )\n#> Likelihood ratio test= 39.9 on 7 df, p=1e-06\n#> Wald test = 42.2 on 7 df, p=5e-07\n#> Score (logrank) test = 47.1 on 7 df, p=5e-08", + "text": "7.8 Time-varying covariates\n(adapted from Klein, Moeschberger, et al. (2003), §9.2)\n\n7.8.1 Motivating example: back to the leukemia dataset\n\n# load the data:\ndata(bmt, package = 'KMsurv')\nbmt |> as_tibble() |> print(n = 5)\n#> # A tibble: 137 × 22\n#> group t1 t2 d1 d2 d3 ta da tc dc tp dp z1\n#> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>\n#> 1 1 2081 2081 0 0 0 67 1 121 1 13 1 26\n#> 2 1 1602 1602 0 0 0 1602 0 139 1 18 1 21\n#> 3 1 1496 1496 0 0 0 1496 0 307 1 12 1 26\n#> 4 1 1462 1462 0 0 0 70 1 95 1 13 1 17\n#> 5 1 1433 1433 0 0 0 1433 0 236 1 12 1 32\n#> # ℹ 132 more rows\n#> # ℹ 9 more variables: z2 <int>, z3 <int>, z4 <int>, z5 <int>, z6 <int>,\n#> # z7 <int>, z8 <int>, z9 <int>, z10 <int>\n\nThis dataset comes from the Copelan et al. (1991) study of allogenic bone marrow transplant therapy for acute myeloid leukemia (AML) and acute lymphoblastic leukemia (ALL).\nOutcomes (endpoints)\n\nThe main endpoint is disease-free survival (t2 and d3) for the three risk groups, “ALL”, “AML Low Risk”, and “AML High Risk”.\nPossible intermediate events\n\ngraft vs. host disease (GVHD), an immunological rejection response to the transplant (bad)\nacute (AGVHD)\nchronic (CGVHD)\nplatelet recovery, a return of platelet count to normal levels (good)\n\nOne or the other, both in either order, or neither may occur.\nCovariates\n\nWe are interested in possibly using the covariates z1-z10 to adjust for other factors.\nIn addition, the time-varying covariates for acute GVHD, chronic GVHD, and platelet recovery may be useful.\nPreprocessing\nWe reformat the data before analysis:\n\nShow R code# reformat the data:\nbmt1 = \n bmt |> \n as_tibble() |> \n mutate(\n id = 1:n(), # will be used to connect multiple records for the same individual\n \n group = group |> \n case_match(\n 1 ~ \"ALL\",\n 2 ~ \"Low Risk AML\",\n 3 ~ \"High Risk AML\") |> \n factor(levels = c(\"ALL\", \"Low Risk AML\", \"High Risk AML\")),\n \n `patient age` = z1,\n \n `donor age` = z2,\n \n `patient sex` = z3 |> \n case_match(\n 0 ~ \"Female\",\n 1 ~ \"Male\"),\n \n `donor sex` = z4 |> \n case_match(\n 0 ~ \"Female\",\n 1 ~ \"Male\"),\n \n `Patient CMV Status` = z5 |> \n case_match(\n 0 ~ \"CMV Negative\",\n 1 ~ \"CMV Positive\"),\n \n `Donor CMV Status` = z6 |> \n case_match(\n 0 ~ \"CMV Negative\",\n 1 ~ \"CMV Positive\"),\n \n `Waiting Time to Transplant` = z7,\n \n FAB = z8 |> \n case_match(\n 1 ~ \"Grade 4 Or 5 (AML only)\",\n 0 ~ \"Other\") |> \n factor() |> \n relevel(ref = \"Other\"),\n \n hospital = z9 |> # `z9` is hospital\n case_match(\n 1 ~ \"Ohio State University\",\n 2 ~ \"Alferd\",\n 3 ~ \"St. Vincent\",\n 4 ~ \"Hahnemann\") |> \n factor() |> \n relevel(ref = \"Ohio State University\"),\n \n MTX = (z10 == 1) # a prophylatic treatment for GVHD\n \n ) |> \n select(-(z1:z10)) # don't need these anymore\n\nbmt1 |> \n select(group, id:MTX) |> \n print(n = 10)\n#> # A tibble: 137 × 12\n#> group id `patient age` `donor age` `patient sex` `donor sex`\n#> <fct> <int> <int> <int> <chr> <chr> \n#> 1 ALL 1 26 33 Male Female \n#> 2 ALL 2 21 37 Male Male \n#> 3 ALL 3 26 35 Male Male \n#> 4 ALL 4 17 21 Female Male \n#> 5 ALL 5 32 36 Male Male \n#> 6 ALL 6 22 31 Male Male \n#> 7 ALL 7 20 17 Male Female \n#> 8 ALL 8 22 24 Male Female \n#> 9 ALL 9 18 21 Female Male \n#> 10 ALL 10 24 40 Male Male \n#> # ℹ 127 more rows\n#> # ℹ 6 more variables: `Patient CMV Status` <chr>, `Donor CMV Status` <chr>,\n#> # `Waiting Time to Transplant` <int>, FAB <fct>, hospital <fct>, MTX <lgl>\n\n\n\n7.8.2 Time-Dependent Covariates\n\nA time-dependent covariate (“TDC”) is a covariate whose value changes during the course of the study.\nFor variables like age that change in a linear manner with time, we can just use the value at the start.\nBut it may be plausible that when and if GVHD occurs, the risk of relapse or death increases, and when and if platelet recovery occurs, the risk decreases.\n\n7.8.3 Analysis in R\n\nWe form a variable precovery which is = 0 before platelet recovery and is = 1 after platelet recovery, if it occurs.\nFor each subject where platelet recovery occurs, we set up multiple records (lines in the data frame); for example one from t = 0 to the time of platelet recovery, and one from that time to relapse, recovery, or death.\nWe do the same for acute GVHD and chronic GVHD.\nFor each record, the covariates are constant.\n\n\nShow R code\nbmt2 = bmt1 |> \n #set up new long-format data set:\n tmerge(bmt1, id = id, tstop = t2) |> \n \n # the following three steps can be in any order, \n # and will still produce the same result:\n #add aghvd as tdc:\n tmerge(bmt1, id = id, agvhd = tdc(ta)) |> \n #add cghvd as tdc:\n tmerge(bmt1, id = id, cgvhd = tdc(tc)) |> \n #add platelet recovery as tdc:\n tmerge(bmt1, id = id, precovery = tdc(tp)) \n\nbmt2 = bmt2 |> \n as_tibble() |> \n mutate(status = as.numeric((tstop == t2) & d3))\n# status only = 1 if at end of t2 and not censored\n\n\nLet’s see how we’ve rearranged the first row of the data:\n\nShow R code\nbmt1 |> \n dplyr::filter(id == 1) |> \n dplyr::select(id, t1, d1, t2, d2, d3, ta, da, tc, dc, tp, dp)\n\n\n \n\n\n\nThe event times for this individual are:\n\n\nt = 0 time of transplant\n\ntp = 13 platelet recovery\n\nta = 67 acute GVHD onset\n\ntc = 121 chronic GVHD onset\n\nt2 = 2081 end of study, patient not relapsed or dead\n\nAfter converting the data to long-format, we have:\n\nShow R codebmt2 |> \n select(\n id,\n tstart,\n tstop,\n agvhd,\n cgvhd,\n precovery,\n status\n ) |> \n dplyr::filter(id == 1)\n\n\n \n\n\n\nNote that status could have been 1 on the last row, indicating that relapse or death occurred; since it is false, the participant must have exited the study without experiencing relapse or death (i.e., they were censored).\n\n7.8.4 Event sequences\nLet:\n\nA = acute GVHD\nC = chronic GVHD\nP = platelet recovery\n\nEach of the eight possible combinations of A or not-A, with C or not-C, with P or not-P occurs in this data set.\n\nA always occurs before C, and P always occurs before C, if both occur.\nThus there are ten event sequences in the data set: None, A, C, P, AC, AP, PA, PC, APC, and PAC.\nIn general, there could be as many as \\(1+3+(3)(2)+6=16\\) sequences, but our domain knowledge tells us that some are missing: CA, CP, CAP, CPA, PCA, PC, PAC\nDifferent subjects could have 1, 2, 3, or 4 intervals, depending on which of acute GVHD, chronic GVHD, and/or platelet recovery occurred.\nThe final interval for any subject has status = 1 if the subject relapsed or died at that time; otherwise status = 0.\nAny earlier intervals have status = 0.\nEven though there might be multiple lines per ID in the dataset, there is never more than one event, so no alterations need be made in the estimation procedures or in the interpretation of the output.\nThe function tmerge in the survival package eases the process of constructing the new long-format dataset.\n\n7.8.5 Model with Time-Fixed Covariates\n\nShow R codebmt1 = \n bmt1 |> \n mutate(surv = Surv(t2,d3))\n\nbmt_coxph_TF = coxph(\n formula = surv ~ group + `patient age`*`donor age` + FAB,\n data = bmt1)\nsummary(bmt_coxph_TF)\n#> Call:\n#> coxph(formula = surv ~ group + `patient age` * `donor age` + \n#> FAB, data = bmt1)\n#> \n#> n= 137, number of events= 83 \n#> \n#> coef exp(coef) se(coef) z Pr(>|z|) \n#> groupLow Risk AML -1.090648 0.335999 0.354279 -3.08 0.00208 ** \n#> groupHigh Risk AML -0.403905 0.667707 0.362777 -1.11 0.26555 \n#> `patient age` -0.081639 0.921605 0.036107 -2.26 0.02376 * \n#> `donor age` -0.084587 0.918892 0.030097 -2.81 0.00495 ** \n#> FABGrade 4 Or 5 (AML only) 0.837416 2.310388 0.278464 3.01 0.00264 ** \n#> `patient age`:`donor age` 0.003159 1.003164 0.000951 3.32 0.00089 ***\n#> ---\n#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n#> \n#> exp(coef) exp(-coef) lower .95 upper .95\n#> groupLow Risk AML 0.336 2.976 0.168 0.673\n#> groupHigh Risk AML 0.668 1.498 0.328 1.360\n#> `patient age` 0.922 1.085 0.859 0.989\n#> `donor age` 0.919 1.088 0.866 0.975\n#> FABGrade 4 Or 5 (AML only) 2.310 0.433 1.339 3.988\n#> `patient age`:`donor age` 1.003 0.997 1.001 1.005\n#> \n#> Concordance= 0.665 (se = 0.033 )\n#> Likelihood ratio test= 32.8 on 6 df, p=1e-05\n#> Wald test = 33 on 6 df, p=1e-05\n#> Score (logrank) test = 35.8 on 6 df, p=3e-06\ndrop1(bmt_coxph_TF, test = \"Chisq\")\n\n\n \n\n\n\n\nShow R codebmt1$mres = \n bmt_coxph_TF |> \n update(. ~ . - `donor age`) |> \n residuals(type=\"martingale\")\n\nbmt1 |> \n ggplot(aes(x = `donor age`, y = mres)) +\n geom_point() +\n geom_smooth(method = \"loess\", aes(col = \"loess\")) +\n geom_smooth(method = 'lm', aes(col = \"lm\")) +\n theme_classic() +\n xlab(\"Donor age\") +\n ylab(\"Martingale Residuals\") +\n guides(col=guide_legend(title = \"\"))\n\n\nMartingale residuals for Donor age\n\n\n\n\nA more complex functional form for donor age seems warranted; left as an exercise for the reader.\nNow we will add the time-varying covariates:\n\nShow R code\n# add counting process formulation of Surv():\nbmt2 = \n bmt2 |> \n mutate(\n surv = \n Surv(\n time = tstart,\n time2 = tstop,\n event = status,\n type = \"counting\"))\n\n\nLet’s see how the data looks for patient 15:\n\nShow R codebmt1 |> dplyr::filter(id == 15) |> dplyr::select(tp, dp, tc,dc, ta, da, FAB, surv, t1, d1, t2, d2, d3)\n\n\n \n\n\nShow R codebmt2 |> dplyr::filter(id == 15) |> dplyr::select(id, agvhd, cgvhd, precovery, surv)\n\n\n \n\n\n\n\n7.8.6 Model with Time-Dependent Covariates\n\nShow R codebmt_coxph_TV = coxph(\n formula = \n surv ~ \n group + `patient age`*`donor age` + FAB + agvhd + cgvhd + precovery,\n data = bmt2)\n\nsummary(bmt_coxph_TV)\n#> Call:\n#> coxph(formula = surv ~ group + `patient age` * `donor age` + \n#> FAB + agvhd + cgvhd + precovery, data = bmt2)\n#> \n#> n= 341, number of events= 83 \n#> \n#> coef exp(coef) se(coef) z Pr(>|z|) \n#> groupLow Risk AML -1.038514 0.353980 0.358220 -2.90 0.0037 **\n#> groupHigh Risk AML -0.380481 0.683533 0.374867 -1.01 0.3101 \n#> `patient age` -0.073351 0.929275 0.035956 -2.04 0.0413 * \n#> `donor age` -0.076406 0.926440 0.030196 -2.53 0.0114 * \n#> FABGrade 4 Or 5 (AML only) 0.805700 2.238263 0.284273 2.83 0.0046 **\n#> agvhd 0.150565 1.162491 0.306848 0.49 0.6237 \n#> cgvhd -0.116136 0.890354 0.289046 -0.40 0.6878 \n#> precovery -0.941123 0.390190 0.347861 -2.71 0.0068 **\n#> `patient age`:`donor age` 0.002895 1.002899 0.000944 3.07 0.0022 **\n#> ---\n#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n#> \n#> exp(coef) exp(-coef) lower .95 upper .95\n#> groupLow Risk AML 0.354 2.825 0.175 0.714\n#> groupHigh Risk AML 0.684 1.463 0.328 1.425\n#> `patient age` 0.929 1.076 0.866 0.997\n#> `donor age` 0.926 1.079 0.873 0.983\n#> FABGrade 4 Or 5 (AML only) 2.238 0.447 1.282 3.907\n#> agvhd 1.162 0.860 0.637 2.121\n#> cgvhd 0.890 1.123 0.505 1.569\n#> precovery 0.390 2.563 0.197 0.772\n#> `patient age`:`donor age` 1.003 0.997 1.001 1.005\n#> \n#> Concordance= 0.702 (se = 0.028 )\n#> Likelihood ratio test= 40.3 on 9 df, p=7e-06\n#> Wald test = 42.4 on 9 df, p=3e-06\n#> Score (logrank) test = 47.2 on 9 df, p=4e-07\n\n\nPlatelet recovery is highly significant.\nNeither acute GVHD (agvhd) nor chronic GVHD (cgvhd) has a statistically significant effect here, nor are they significant in models with the other one removed.\n\nShow R code\nupdate(bmt_coxph_TV, .~.-agvhd) |> summary()\n#> Call:\n#> coxph(formula = surv ~ group + `patient age` + `donor age` + \n#> FAB + cgvhd + precovery + `patient age`:`donor age`, data = bmt2)\n#> \n#> n= 341, number of events= 83 \n#> \n#> coef exp(coef) se(coef) z Pr(>|z|) \n#> groupLow Risk AML -1.049870 0.349983 0.356727 -2.94 0.0032 **\n#> groupHigh Risk AML -0.417049 0.658988 0.365348 -1.14 0.2537 \n#> `patient age` -0.070749 0.931696 0.035477 -1.99 0.0461 * \n#> `donor age` -0.075693 0.927101 0.030075 -2.52 0.0118 * \n#> FABGrade 4 Or 5 (AML only) 0.807035 2.241253 0.283437 2.85 0.0044 **\n#> cgvhd -0.095393 0.909015 0.285979 -0.33 0.7387 \n#> precovery -0.983653 0.373942 0.338170 -2.91 0.0036 **\n#> `patient age`:`donor age` 0.002859 1.002863 0.000936 3.05 0.0023 **\n#> ---\n#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n#> \n#> exp(coef) exp(-coef) lower .95 upper .95\n#> groupLow Risk AML 0.350 2.857 0.174 0.704\n#> groupHigh Risk AML 0.659 1.517 0.322 1.349\n#> `patient age` 0.932 1.073 0.869 0.999\n#> `donor age` 0.927 1.079 0.874 0.983\n#> FABGrade 4 Or 5 (AML only) 2.241 0.446 1.286 3.906\n#> cgvhd 0.909 1.100 0.519 1.592\n#> precovery 0.374 2.674 0.193 0.726\n#> `patient age`:`donor age` 1.003 0.997 1.001 1.005\n#> \n#> Concordance= 0.701 (se = 0.027 )\n#> Likelihood ratio test= 40 on 8 df, p=3e-06\n#> Wald test = 42.4 on 8 df, p=1e-06\n#> Score (logrank) test = 47.2 on 8 df, p=1e-07\nupdate(bmt_coxph_TV, .~.-cgvhd) |> summary()\n#> Call:\n#> coxph(formula = surv ~ group + `patient age` + `donor age` + \n#> FAB + agvhd + precovery + `patient age`:`donor age`, data = bmt2)\n#> \n#> n= 341, number of events= 83 \n#> \n#> coef exp(coef) se(coef) z Pr(>|z|) \n#> groupLow Risk AML -1.019638 0.360725 0.355311 -2.87 0.0041 **\n#> groupHigh Risk AML -0.381356 0.682935 0.374568 -1.02 0.3086 \n#> `patient age` -0.073189 0.929426 0.035890 -2.04 0.0414 * \n#> `donor age` -0.076753 0.926118 0.030121 -2.55 0.0108 * \n#> FABGrade 4 Or 5 (AML only) 0.811716 2.251769 0.284012 2.86 0.0043 **\n#> agvhd 0.131621 1.140676 0.302623 0.43 0.6636 \n#> precovery -0.946697 0.388021 0.347265 -2.73 0.0064 **\n#> `patient age`:`donor age` 0.002904 1.002908 0.000943 3.08 0.0021 **\n#> ---\n#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n#> \n#> exp(coef) exp(-coef) lower .95 upper .95\n#> groupLow Risk AML 0.361 2.772 0.180 0.724\n#> groupHigh Risk AML 0.683 1.464 0.328 1.423\n#> `patient age` 0.929 1.076 0.866 0.997\n#> `donor age` 0.926 1.080 0.873 0.982\n#> FABGrade 4 Or 5 (AML only) 2.252 0.444 1.291 3.929\n#> agvhd 1.141 0.877 0.630 2.064\n#> precovery 0.388 2.577 0.196 0.766\n#> `patient age`:`donor age` 1.003 0.997 1.001 1.005\n#> \n#> Concordance= 0.701 (se = 0.027 )\n#> Likelihood ratio test= 40.1 on 8 df, p=3e-06\n#> Wald test = 42.1 on 8 df, p=1e-06\n#> Score (logrank) test = 47.1 on 8 df, p=1e-07\n\n\nLet’s drop them both:\n\nShow R codebmt_coxph_TV2 = update(bmt_coxph_TV, . ~ . - agvhd -cgvhd)\nbmt_coxph_TV2 |> summary()\n#> Call:\n#> coxph(formula = surv ~ group + `patient age` + `donor age` + \n#> FAB + precovery + `patient age`:`donor age`, data = bmt2)\n#> \n#> n= 341, number of events= 83 \n#> \n#> coef exp(coef) se(coef) z Pr(>|z|) \n#> groupLow Risk AML -1.032520 0.356108 0.353202 -2.92 0.0035 **\n#> groupHigh Risk AML -0.413888 0.661075 0.365209 -1.13 0.2571 \n#> `patient age` -0.070965 0.931495 0.035453 -2.00 0.0453 * \n#> `donor age` -0.076052 0.926768 0.030007 -2.53 0.0113 * \n#> FABGrade 4 Or 5 (AML only) 0.811926 2.252242 0.283231 2.87 0.0041 **\n#> precovery -0.983505 0.373998 0.337997 -2.91 0.0036 **\n#> `patient age`:`donor age` 0.002872 1.002876 0.000936 3.07 0.0021 **\n#> ---\n#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n#> \n#> exp(coef) exp(-coef) lower .95 upper .95\n#> groupLow Risk AML 0.356 2.808 0.178 0.712\n#> groupHigh Risk AML 0.661 1.513 0.323 1.352\n#> `patient age` 0.931 1.074 0.869 0.999\n#> `donor age` 0.927 1.079 0.874 0.983\n#> FABGrade 4 Or 5 (AML only) 2.252 0.444 1.293 3.924\n#> precovery 0.374 2.674 0.193 0.725\n#> `patient age`:`donor age` 1.003 0.997 1.001 1.005\n#> \n#> Concordance= 0.7 (se = 0.027 )\n#> Likelihood ratio test= 39.9 on 7 df, p=1e-06\n#> Wald test = 42.2 on 7 df, p=5e-07\n#> Score (logrank) test = 47.1 on 7 df, p=5e-08", "crumbs": [ "Time to Event Models", "7  Building Cox Proportional Hazards models" @@ -764,7 +764,7 @@ "href": "coxph-model-building.html#recurrent-events", "title": "\n7  Building Cox Proportional Hazards models\n", "section": "\n7.9 Recurrent Events", - "text": "7.9 Recurrent Events\n(Adapted from Kleinbaum and Klein, Ch 8)\n\nSometimes an appropriate analysis requires consideration of recurrent events.\nA patient with arthritis may have more than one flareup. The same is true of many recurring-remitting diseases.\nIn this case, we have more than one line in the data frame, but each line may have an event.\nWe have to use a “robust” variance estimator to account for correlation of time-to-events within a patient.\n\n\n7.9.1 Bladder Cancer Data Set\nThe bladder cancer dataset from Kleinbaum and Klein contains recurrent event outcome information for eighty-six cancer patients followed for the recurrence of bladder cancer tumor after transurethral surgical excision (Byar and Green 1980). The exposure of interest is the effect of the drug treatment of thiotepa. Control variables are the initial number and initial size of tumors. The data layout is suitable for a counting processes approach.\nThis drug is still a possible choice for some patients. Another therapeutic choice is Bacillus Calmette-Guerin (BCG), a live bacterium related to cow tuberculosis.\nData dictionary\n\nVariables in the bladder dataset\n\nVariable\nDefinition\n\n\n\nid\nPatient unique ID\n\n\nstatus\nfor each time interval: 1 = recurred, 0 = censored\n\n\ninterval\n1 = first recurrence, etc.\n\n\nintime\n`tstop - tstart (all times in months)\n\n\ntstart\nstart of interval\n\n\ntstop\nend of interval\n\n\ntx\ntreatment code, 1 = thiotepa\n\n\nnum\nnumber of initial tumors\n\n\nsize\nsize of initial tumors (cm)\n\n\n\n\nThere are 85 patients and 190 lines in the dataset, meaning that many patients have more than one line.\nPatient 1 with 0 observation time was removed.\nOf the 85 patients, 47 had at least one recurrence and 38 had none.\n18 patients had exactly one recurrence.\nThere were up to 4 recurrences in a patient.\nOf the 190 intervals, 112 terminated with a recurrence and 78 were censored.\nDifferent intervals for the same patient are correlated.\n\nIs the effective sample size 47 or 112? This might narrow confidence intervals by as much as a factor of \\(\\sqrt{112/47}=1.54\\)\nWhat happens if I have 5 treatment and 5 control values and want to do a t-test and I then duplicate the 10 values as if the sample size was 20? This falsely narrows confidence intervals by a factor of \\(\\sqrt{2}=1.41\\).\n\n\nShow R codebladder = \n paste0(\n \"http://web1.sph.emory.edu/dkleinb/allDatasets\",\n \"/surv2datasets/bladder.dta\") |> \n read_dta() |> \n as_tibble()\n\nbladder = bladder[-1,] #remove subject with 0 observation time\nprint(bladder)\n\n\n\nShow R code\nbladder = \n bladder |> \n mutate(\n surv = \n Surv(\n time = start,\n time2 = stop,\n event = event,\n type = \"counting\"))\n\nbladder.cox1 = coxph(\n formula = surv~tx+num+size,\n data = bladder)\n\n#results with biased variance-covariance matrix:\nsummary(bladder.cox1)\n#> Call:\n#> coxph(formula = surv ~ tx + num + size, data = bladder)\n#> \n#> n= 190, number of events= 112 \n#> \n#> coef exp(coef) se(coef) z Pr(>|z|) \n#> tx -0.4116 0.6626 0.1999 -2.06 0.03947 * \n#> num 0.1637 1.1778 0.0478 3.43 0.00061 ***\n#> size -0.0411 0.9598 0.0703 -0.58 0.55897 \n#> ---\n#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n#> \n#> exp(coef) exp(-coef) lower .95 upper .95\n#> tx 0.663 1.509 0.448 0.98\n#> num 1.178 0.849 1.073 1.29\n#> size 0.960 1.042 0.836 1.10\n#> \n#> Concordance= 0.624 (se = 0.032 )\n#> Likelihood ratio test= 14.7 on 3 df, p=0.002\n#> Wald test = 15.9 on 3 df, p=0.001\n#> Score (logrank) test = 16.2 on 3 df, p=0.001\n\n\n\n\n\n\n\n\nNote\n\n\n\nThe likelihood ratio and score tests assume independence of observations within a cluster. The Wald and robust score tests do not.\n\n\nadding cluster = id\n\nIf we add cluster= id to the call to coxph, the coefficient estimates don’t change, but we get an additional column in the summary() output: robust se:\n\nShow R code\nbladder.cox2 = coxph(\n formula = surv ~ tx + num + size,\n cluster = id,\n data = bladder)\n\n#unbiased though this reduces power:\nsummary(bladder.cox2)\n#> Call:\n#> coxph(formula = surv ~ tx + num + size, data = bladder, cluster = id)\n#> \n#> n= 190, number of events= 112 \n#> \n#> coef exp(coef) se(coef) robust se z Pr(>|z|) \n#> tx -0.4116 0.6626 0.1999 0.2488 -1.65 0.0980 . \n#> num 0.1637 1.1778 0.0478 0.0584 2.80 0.0051 **\n#> size -0.0411 0.9598 0.0703 0.0742 -0.55 0.5799 \n#> ---\n#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n#> \n#> exp(coef) exp(-coef) lower .95 upper .95\n#> tx 0.663 1.509 0.407 1.08\n#> num 1.178 0.849 1.050 1.32\n#> size 0.960 1.042 0.830 1.11\n#> \n#> Concordance= 0.624 (se = 0.031 )\n#> Likelihood ratio test= 14.7 on 3 df, p=0.002\n#> Wald test = 11.2 on 3 df, p=0.01\n#> Score (logrank) test = 16.2 on 3 df, p=0.001, Robust = 10.8 p=0.01\n#> \n#> (Note: the likelihood ratio and score tests assume independence of\n#> observations within a cluster, the Wald and robust score tests do not).\n\n\nrobust se is larger than se, and accounts for the repeated observations from the same individuals:\n\nShow R coderound(bladder.cox2$naive.var, 4)\n#> [,1] [,2] [,3]\n#> [1,] 0.0400 -0.0014 0.0000\n#> [2,] -0.0014 0.0023 0.0007\n#> [3,] 0.0000 0.0007 0.0049\nround(bladder.cox2$var, 4)\n#> [,1] [,2] [,3]\n#> [1,] 0.0619 -0.0026 -0.0004\n#> [2,] -0.0026 0.0034 0.0013\n#> [3,] -0.0004 0.0013 0.0055\n\n\nThese are the ratios of correct confidence intervals to naive ones:\n\nShow R codewith(bladder.cox2, diag(var)/diag(naive.var)) |> sqrt()\n#> [1] 1.244 1.223 1.056\n\n\nWe might try dropping the non-significant size variable:\n\nShow R code#remove non-significant size variable:\nbladder.cox3 = bladder.cox2 |> update(. ~ . - size)\nsummary(bladder.cox3)\n#> Call:\n#> coxph(formula = surv ~ tx + num, data = bladder, cluster = id)\n#> \n#> n= 190, number of events= 112 \n#> \n#> coef exp(coef) se(coef) robust se z Pr(>|z|) \n#> tx -0.4117 0.6625 0.2003 0.2515 -1.64 0.1017 \n#> num 0.1700 1.1853 0.0465 0.0564 3.02 0.0026 **\n#> ---\n#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n#> \n#> exp(coef) exp(-coef) lower .95 upper .95\n#> tx 0.663 1.509 0.405 1.08\n#> num 1.185 0.844 1.061 1.32\n#> \n#> Concordance= 0.623 (se = 0.031 )\n#> Likelihood ratio test= 14.3 on 2 df, p=8e-04\n#> Wald test = 10.2 on 2 df, p=0.006\n#> Score (logrank) test = 15.8 on 2 df, p=4e-04, Robust = 10.6 p=0.005\n#> \n#> (Note: the likelihood ratio and score tests assume independence of\n#> observations within a cluster, the Wald and robust score tests do not).\n\n\n\nWays to check PH assumption:\n\ncloglog\nschoenfeld residuals\ninteraction with time", + "text": "7.9 Recurrent Events\n(Adapted from Kleinbaum and Klein (2012), Ch 8)\n\nSometimes an appropriate analysis requires consideration of recurrent events.\nA patient with arthritis may have more than one flareup. The same is true of many recurring-remitting diseases.\nIn this case, we have more than one line in the data frame, but each line may have an event.\nWe have to use a “robust” variance estimator to account for correlation of time-to-events within a patient.\n\n\n7.9.1 Bladder Cancer Data Set\nThe bladder cancer dataset from Kleinbaum and Klein (2012) contains recurrent event outcome information for eighty-six cancer patients followed for the recurrence of bladder cancer tumor after transurethral surgical excision (Byar and Green 1980). The exposure of interest is the effect of the drug treatment of thiotepa. Control variables are the initial number and initial size of tumors. The data layout is suitable for a counting processes approach.\nThis drug is still a possible choice for some patients. Another therapeutic choice is Bacillus Calmette-Guerin (BCG), a live bacterium related to cow tuberculosis.\nData dictionary\n\nVariables in the bladder dataset\n\nVariable\nDefinition\n\n\n\nid\nPatient unique ID\n\n\nstatus\nfor each time interval: 1 = recurred, 0 = censored\n\n\ninterval\n1 = first recurrence, etc.\n\n\nintime\n`tstop - tstart (all times in months)\n\n\ntstart\nstart of interval\n\n\ntstop\nend of interval\n\n\ntx\ntreatment code, 1 = thiotepa\n\n\nnum\nnumber of initial tumors\n\n\nsize\nsize of initial tumors (cm)\n\n\n\n\nThere are 85 patients and 190 lines in the dataset, meaning that many patients have more than one line.\nPatient 1 with 0 observation time was removed.\nOf the 85 patients, 47 had at least one recurrence and 38 had none.\n18 patients had exactly one recurrence.\nThere were up to 4 recurrences in a patient.\nOf the 190 intervals, 112 terminated with a recurrence and 78 were censored.\nDifferent intervals for the same patient are correlated.\n\nIs the effective sample size 47 or 112? This might narrow confidence intervals by as much as a factor of \\(\\sqrt{112/47}=1.54\\)\nWhat happens if I have 5 treatment and 5 control values and want to do a t-test and I then duplicate the 10 values as if the sample size was 20? This falsely narrows confidence intervals by a factor of \\(\\sqrt{2}=1.41\\).\n\n\nShow R codebladder = \n paste0(\n \"http://web1.sph.emory.edu/dkleinb/allDatasets\",\n \"/surv2datasets/bladder.dta\") |> \n read_dta() |> \n as_tibble()\n\nbladder = bladder[-1,] #remove subject with 0 observation time\nprint(bladder)\n\n\n\nShow R code\nbladder = \n bladder |> \n mutate(\n surv = \n Surv(\n time = start,\n time2 = stop,\n event = event,\n type = \"counting\"))\n\nbladder.cox1 = coxph(\n formula = surv~tx+num+size,\n data = bladder)\n\n#results with biased variance-covariance matrix:\nsummary(bladder.cox1)\n#> Call:\n#> coxph(formula = surv ~ tx + num + size, data = bladder)\n#> \n#> n= 190, number of events= 112 \n#> \n#> coef exp(coef) se(coef) z Pr(>|z|) \n#> tx -0.4116 0.6626 0.1999 -2.06 0.03947 * \n#> num 0.1637 1.1778 0.0478 3.43 0.00061 ***\n#> size -0.0411 0.9598 0.0703 -0.58 0.55897 \n#> ---\n#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n#> \n#> exp(coef) exp(-coef) lower .95 upper .95\n#> tx 0.663 1.509 0.448 0.98\n#> num 1.178 0.849 1.073 1.29\n#> size 0.960 1.042 0.836 1.10\n#> \n#> Concordance= 0.624 (se = 0.032 )\n#> Likelihood ratio test= 14.7 on 3 df, p=0.002\n#> Wald test = 15.9 on 3 df, p=0.001\n#> Score (logrank) test = 16.2 on 3 df, p=0.001\n\n\n\n\n\n\n\n\nNote\n\n\n\nThe likelihood ratio and score tests assume independence of observations within a cluster. The Wald and robust score tests do not.\n\n\nadding cluster = id\n\nIf we add cluster= id to the call to coxph, the coefficient estimates don’t change, but we get an additional column in the summary() output: robust se:\n\nShow R code\nbladder.cox2 = coxph(\n formula = surv ~ tx + num + size,\n cluster = id,\n data = bladder)\n\n#unbiased though this reduces power:\nsummary(bladder.cox2)\n#> Call:\n#> coxph(formula = surv ~ tx + num + size, data = bladder, cluster = id)\n#> \n#> n= 190, number of events= 112 \n#> \n#> coef exp(coef) se(coef) robust se z Pr(>|z|) \n#> tx -0.4116 0.6626 0.1999 0.2488 -1.65 0.0980 . \n#> num 0.1637 1.1778 0.0478 0.0584 2.80 0.0051 **\n#> size -0.0411 0.9598 0.0703 0.0742 -0.55 0.5799 \n#> ---\n#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n#> \n#> exp(coef) exp(-coef) lower .95 upper .95\n#> tx 0.663 1.509 0.407 1.08\n#> num 1.178 0.849 1.050 1.32\n#> size 0.960 1.042 0.830 1.11\n#> \n#> Concordance= 0.624 (se = 0.031 )\n#> Likelihood ratio test= 14.7 on 3 df, p=0.002\n#> Wald test = 11.2 on 3 df, p=0.01\n#> Score (logrank) test = 16.2 on 3 df, p=0.001, Robust = 10.8 p=0.01\n#> \n#> (Note: the likelihood ratio and score tests assume independence of\n#> observations within a cluster, the Wald and robust score tests do not).\n\n\nrobust se is larger than se, and accounts for the repeated observations from the same individuals:\n\nShow R coderound(bladder.cox2$naive.var, 4)\n#> [,1] [,2] [,3]\n#> [1,] 0.0400 -0.0014 0.0000\n#> [2,] -0.0014 0.0023 0.0007\n#> [3,] 0.0000 0.0007 0.0049\nround(bladder.cox2$var, 4)\n#> [,1] [,2] [,3]\n#> [1,] 0.0619 -0.0026 -0.0004\n#> [2,] -0.0026 0.0034 0.0013\n#> [3,] -0.0004 0.0013 0.0055\n\n\nThese are the ratios of correct confidence intervals to naive ones:\n\nShow R codewith(bladder.cox2, diag(var)/diag(naive.var)) |> sqrt()\n#> [1] 1.244 1.223 1.056\n\n\nWe might try dropping the non-significant size variable:\n\nShow R code#remove non-significant size variable:\nbladder.cox3 = bladder.cox2 |> update(. ~ . - size)\nsummary(bladder.cox3)\n#> Call:\n#> coxph(formula = surv ~ tx + num, data = bladder, cluster = id)\n#> \n#> n= 190, number of events= 112 \n#> \n#> coef exp(coef) se(coef) robust se z Pr(>|z|) \n#> tx -0.4117 0.6625 0.2003 0.2515 -1.64 0.1017 \n#> num 0.1700 1.1853 0.0465 0.0564 3.02 0.0026 **\n#> ---\n#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n#> \n#> exp(coef) exp(-coef) lower .95 upper .95\n#> tx 0.663 1.509 0.405 1.08\n#> num 1.185 0.844 1.061 1.32\n#> \n#> Concordance= 0.623 (se = 0.031 )\n#> Likelihood ratio test= 14.3 on 2 df, p=8e-04\n#> Wald test = 10.2 on 2 df, p=0.006\n#> Score (logrank) test = 15.8 on 2 df, p=4e-04, Robust = 10.6 p=0.005\n#> \n#> (Note: the likelihood ratio and score tests assume independence of\n#> observations within a cluster, the Wald and robust score tests do not).\n\n\n\nWays to check PH assumption:\n\ncloglog\nschoenfeld residuals\ninteraction with time", "crumbs": [ "Time to Event Models", "7  Building Cox Proportional Hazards models" @@ -775,7 +775,7 @@ "href": "coxph-model-building.html#age-as-the-time-scale", "title": "\n7  Building Cox Proportional Hazards models\n", "section": "\n7.10 Age as the time scale", - "text": "7.10 Age as the time scale\nSee Canchola et al. (2003).\n\n\n\n\n\n\nCanchola, Alison J, Susan L Stewart, Leslie Bernstein, Dee W West, Ronald K Ross, Dennis Deapen, Richard Pinder, et al. 2003. “Cox Regression Using Different Time-Scales.” Western Users of SAS Software. https://www.lexjansen.com/wuss/2003/DataAnalysis/i-cox_time_scales.pdf.\n\n\nGrambsch, Patricia M, and Terry M Therneau. 1994. “Proportional Hazards Tests and Diagnostics Based on Weighted Residuals.” Biometrika 81 (3): 515–26. https://doi.org/10.1093/biomet/81.3.515.\n\n\nKlein, John P, Melvin L Moeschberger, et al. 2003. Survival Analysis: Techniques for Censored and Truncated Data. Vol. 1230. Springer. https://link.springer.com/book/10.1007/b97377.", + "text": "7.10 Age as the time scale\nSee Canchola et al. (2003).\n\n\n\n\n\n\nCanchola, Alison J, Susan L Stewart, Leslie Bernstein, Dee W West, Ronald K Ross, Dennis Deapen, Richard Pinder, et al. 2003. “Cox Regression Using Different Time-Scales.” Western Users of SAS Software. https://www.lexjansen.com/wuss/2003/DataAnalysis/i-cox_time_scales.pdf.\n\n\nCopelan, Edward A, James C Biggs, James M Thompson, Pamela Crilley, Jeff Szer, John P Klein, Neena Kapoor, Belinda R Avalos, Isabel Cunningham, and Kerry Atkinson. 1991. “Treatment for Acute Myelocytic Leukemia with Allogeneic Bone Marrow Transplantation Following Preparation with BuCy2.” https://doi.org/10.1182/blood.V78.3.838.838 .\n\n\nDobson, Annette J, and Adrian G Barnett. 2018. An Introduction to Generalized Linear Models. 4th ed. CRC press. https://doi.org/10.1201/9781315182780.\n\n\nGrambsch, Patricia M, and Terry M Therneau. 1994. “Proportional Hazards Tests and Diagnostics Based on Weighted Residuals.” Biometrika 81 (3): 515–26. https://doi.org/10.1093/biomet/81.3.515.\n\n\nKlein, John P, Melvin L Moeschberger, et al. 2003. Survival Analysis: Techniques for Censored and Truncated Data. Vol. 1230. Springer. https://link.springer.com/book/10.1007/b97377.\n\n\nKleinbaum, David G, and Mitchel Klein. 2012. Survival Analysis a Self-Learning Text. 3rd ed. Springer. https://link.springer.com/book/10.1007/978-1-4419-6646-9.", "crumbs": [ "Time to Event Models", "7  Building Cox Proportional Hazards models" @@ -819,7 +819,7 @@ "href": "references.html", "title": "References", "section": "", - "text": "Agresti, Alan. 2012. Categorical Data Analysis. Vol. 792. John\nWiley & Sons. https://www.wiley.com/en-us/Categorical+Data+Analysis%2C+3rd+Edition-p-9780470463635.\n\n\n———. 2018. An Introduction to Categorical Data Analysis. John\nWiley & Sons. https://www.wiley.com/en-us/An+Introduction+to+Categorical+Data+Analysis%2C+3rd+Edition-p-9781119405283.\n\n\nAnderson, Edgar. 1935. “The Irises of the Gaspe Peninsula.”\nBulletin of American Iris Society 59: 2–5.\n\n\nBache, Stefan Milton, and Hadley Wickham. 2022. Magrittr: A\nForward-Pipe Operator for r. https://CRAN.R-project.org/package=magrittr.\n\n\nBanerjee, Sudipto, and Anindya Roy. 2014. Linear Algebra and Matrix\nAnalysis for Statistics. Vol. 181. Crc Press Boca Raton.\n\n\nCanchola, Alison J, Susan L Stewart, Leslie Bernstein, Dee W West,\nRonald K Ross, Dennis Deapen, Richard Pinder, et al. 2003. “Cox\nRegression Using Different Time-Scales.” Western Users of SAS\nSoftware. https://www.lexjansen.com/wuss/2003/DataAnalysis/i-cox_time_scales.pdf.\n\n\nCasella, George, and Roger Berger. 2002. Statistical Inference.\n2nd ed. Cengage Learning. https://www.cengage.com/c/statistical-inference-2e-casella-berger/9780534243128/.\n\n\nChang, Winston. 2024. R Graphics Cookbook: Practical Recipes for\nVisualizing Data. O’Reilly Media. https://r-graphics.org/.\n\n\nChatterjee, Samprit, and Ali S Hadi. 2015. Regression Analysis by\nExample. John Wiley & Sons. https://www.wiley.com/en-us/Regression+Analysis+by+Example%2C+4th+Edition-p-9780470055458.\n\n\nClayton, David, and Michael Hills. 2013. Statistical Models in\nEpidemiology. OUP Oxford.\n\n\nDalgaard, Peter. 2008. Introductory Statistics with r. New\nYork, NY: Springer New York. https://link.springer.com/book/10.1007/978-0-387-79054-1.\n\n\nDobson, Annette J, and Adrian G Barnett. 2018. An Introduction to\nGeneralized Linear Models. 4th ed. CRC press. https://doi.org/10.1201/9781315182780.\n\n\nDunn, Peter K, Gordon K Smyth, et al. 2018. Generalized Linear\nModels with Examples in r. Vol. 53. Springer. https://link.springer.com/book/10.1007/978-1-4419-0118-7.\n\n\nEfron, Bradley, and David V Hinkley. 1978. “Assessing the Accuracy\nof the Maximum Likelihood Estimator: Observed Versus Expected Fisher\nInformation.” Biometrika 65 (3): 457–83.\n\n\nFaraway, Julian J. 2016. Extending the Linear Model with r:\nGeneralized Linear, Mixed Effects and Nonparametric Regression\nModels. 2nd ed. Chapman; Hall/CRC. https://doi.org/10.1201/9781315382722.\n\n\nFay, Colin, Sébastien Rochette, Vincent Guyader, and Cervan Girard.\n2021. Engineering Production-Grade Shiny Apps. Chapman;\nHall/CRC. https://engineering-shiny.org/.\n\n\nFieller, Nick. 2016. Basics of Matrix Algebra for Statistics with\nr. Chapman; Hall/CRC. https://doi.org/10.1201/9781315370200.\n\n\nFox, John. 2015. Applied Regression Analysis and Generalized Linear\nModels. Sage publications.\n\n\nGrambsch, Patricia M, and Terry M Therneau. 1994. “Proportional\nHazards Tests and Diagnostics Based on Weighted Residuals.”\nBiometrika 81 (3): 515–26. https://doi.org/10.1093/biomet/81.3.515.\n\n\nHarrell, Frank E. 2015. Regression Modeling Strategies: With\nApplications to Linear Models, Logistic Regression, and Survival\nAnalysis. 2nd ed. Springer. https://doi.org/10.1007/978-3-319-19425-7.\n\n\nHosmer, David W, Stanley Lemeshow, and Rodney X Sturdivant. 2013.\nApplied Logistic Regression. John Wiley & Sons. https://onlinelibrary.wiley.com/doi/book/10.1002/9781118548387.\n\n\nJames, Gareth, Daniela Witten, Trevor Hastie, Robert Tibshirani, et al.\n2013. An Introduction to Statistical Learning. Vol. 112.\nSpringer. https://www.statlearning.com/.\n\n\nJewell, Nicholas P. 2003. Statistics for Epidemiology. chapman;\nhall/CRC. https://www.routledge.com/Statistics-for-Epidemiology/Jewell/p/book/9781584884330.\n\n\nKalbfleisch, John D, and Ross L Prentice. 2011. The Statistical\nAnalysis of Failure Time Data. John Wiley & Sons.\n\n\nKhuri, André I. 2003. Advanced Calculus with Applications in\nStatistics. John Wiley & Sons. https://www.wiley.com/en-us/Advanced+Calculus+with+Applications+in+Statistics%2C+2nd+Edition-p-9780471391043.\n\n\nKlein, John P, Melvin L Moeschberger, et al. 2003. Survival\nAnalysis: Techniques for Censored and Truncated Data. Vol. 1230.\nSpringer. https://link.springer.com/book/10.1007/b97377.\n\n\nKleinbaum, David G, and Mitchel Klein. 2010. Logistic\nRegression. 3rd ed. Springer. https://link.springer.com/book/10.1007/978-1-4419-1742-3.\n\n\n———. 2012. Survival Analysis a Self-Learning Text. 3rd ed.\nSpringer. https://link.springer.com/book/10.1007/978-1-4419-6646-9.\n\n\nKleinbaum, David G, Lawrence L Kupper, Azhar Nizam, K Muller, and ES\nRosenberg. 2014. Applied Regression Analysis and Other Multivariable\nMethods. 5th ed. Cengage Learning. https://www.cengage.com/c/applied-regression-analysis-and-other-multivariable-methods-5e-kleinbaum/9781285051086/.\n\n\nKleinman, Ken, and Nicholas J Horton. 2009. SAS and r: Data\nManagement, Statistical Analysis, and Graphics. Chapman; Hall/CRC.\nhttps://www.routledge.com/SAS-and-R-Data-Management-Statistical-Analysis-and-Graphics-Second-Edition/Kleinman-Horton/p/book/9781466584495.\n\n\nKuhn, Max, and Julia Silge. 2022. Tidy Modeling with r. \"\nO’Reilly Media, Inc.\". https://www.tmwr.org/.\n\n\nKutner, Michael H, Christopher J Nachtsheim, John Neter, and William Li.\n2005. Applied Linear Statistical Models. McGraw-Hill.\n\n\nLawrance, Rachael, Evgeny Degtyarev, Philip Griffiths, Peter Trask,\nHelen Lau, Denise D’Alessio, Ingolf Griebsch, Gudrun Wallenstein, Kim\nCocks, and Kaspar Rufibach. 2020. “What Is an Estimand, and How\nDoes It Relate to Quantifying the Effect of Treatment on\nPatient-Reported Quality of Life Outcomes in Clinical Trials?”\nJournal of Patient-Reported Outcomes 4 (1): 1–8. https://link.springer.com/article/10.1186/s41687-020-00218-5.\n\n\nMcCullagh, Peter, and J. A. Nelder. 1989. Generalized Linear\nModels. 2nd ed. Routledge. https://www.utstat.toronto.edu/~brunner/oldclass/2201s11/readings/glmbook.pdf.\n\n\nMcLachlan, Geoffrey J, and Thriyambakam Krishnan. 2007. The EM\nAlgorithm and Extensions. 2nd ed. John Wiley & Sons. https://doi.org/10.1002/9780470191613.\n\n\nMoore, Dirk F. 2016. Applied Survival Analysis Using r. Vol.\n473. Springer. https://doi.org/10.1007/978-3-319-31245-3.\n\n\nNahhas, Ramzi W. 2024. Introduction to Regression Methods for Public\nHealth Using r. CRC Press. https://www.bookdown.org/rwnahhas/RMPH/.\n\n\nPebesma, Edzer, and Roger Bivand. 2023. Spatial Data Science: With\nApplications in R. Boca Raton: Chapman; Hall/CRC. https://doi.org/10.1201/9780429459016.\n\n\nPohl, Moritz, Lukas Baumann, Rouven Behnisch, Marietta Kirchner,\nJohannes Krisam, and Anja Sander. 2021. “Estimands—A Basic Element for Clinical\nTrials.” Deutsches Ärzteblatt\nInternational 118 (51-52): 883–88. https://doi.org/10.3238/arztebl.m2021.0373.\n\n\nPolin, Richard A, William W Fox, and Steven H Abman. 2011. Fetal and\nNeonatal Physiology. 4th ed. Elsevier health sciences.\n\n\nRosenman, Ray H, Richard J Brand, C David Jenkins, Meyer Friedman,\nReuben Straus, and Moses Wurm. 1975. “Coronary Heart Disease in\nthe Western Collaborative Group Study: Final Follow-up Experience of 8\n1/2 Years.” JAMA 233 (8): 872–77. https://doi.org/10.1001/jama.1975.03260080034016.\n\n\nSearle, Shayle R, and Andre I Khuri. 2017. Matrix Algebra Useful for\nStatistics. John Wiley & Sons.\n\n\nSeber, George AF, and Alan J Lee. 2012. Linear Regression\nAnalysis. 2nd ed. John Wiley & Sons. https://www.wiley.com/en-us/Linear+Regression+Analysis%2C+2nd+Edition-p-9781118274422.\n\n\nSelvin, Steve. 2001. Epidemiologic Analysis: A Case-Oriented\nApproach. Oxford University Press.\n\n\nVan Buuren, Stef. 2018. Flexible Imputation of Missing Data.\nCRC press. https://stefvanbuuren.name/fimd/.\n\n\nVenables, Bill. 2023. codingMatrices: Alternative Factor Coding\nMatrices for Linear Model Formulae. https://CRAN.R-project.org/package=codingMatrices.\n\n\nVittinghoff, Eric, David V Glidden, Stephen C Shiboski, and Charles E\nMcCulloch. 2012. Regression Methods in Biostatistics: Linear,\nLogistic, Survival, and Repeated Measures Models. 2nd ed. Springer.\nhttps://doi.org/10.1007/978-1-4614-1353-0.\n\n\nWickham, Hadley. 2019. Advanced r. Chapman; Hall/CRC. https://adv-r.hadley.nz/index.html.\n\n\n———. 2021. Mastering Shiny. \" O’Reilly Media, Inc.\". https://mastering-shiny.org/.\n\n\nWickham, Hadley, Mara Averick, Jennifer Bryan, Winston Chang, Lucy\nD’Agostino McGowan, Romain François, Garrett Grolemund, et al. 2019.\n“Welcome to the tidyverse.”\nJournal of Open Source Software 4 (43): 1686. https://doi.org/10.21105/joss.01686.\n\n\nWickham, Hadley, and Jennifer Bryan. 2023. R Packages. \"\nO’Reilly Media, Inc.\". https://r-pkgs.org/.\n\n\nWickham, Hadley, Mine Çetinkaya-Rundel, and Garrett Grolemund. 2023.\nR for Data Science. \" O’Reilly Media, Inc.\". https://r4ds.hadley.nz/.", + "text": "Agresti, Alan. 2012. Categorical Data Analysis. Vol. 792. John\nWiley & Sons. https://www.wiley.com/en-us/Categorical+Data+Analysis%2C+3rd+Edition-p-9780470463635.\n\n\n———. 2018. An Introduction to Categorical Data Analysis. John\nWiley & Sons. https://www.wiley.com/en-us/An+Introduction+to+Categorical+Data+Analysis%2C+3rd+Edition-p-9781119405283.\n\n\nAnderson, Edgar. 1935. “The Irises of the Gaspe Peninsula.”\nBulletin of American Iris Society 59: 2–5.\n\n\nBache, Stefan Milton, and Hadley Wickham. 2022. Magrittr: A\nForward-Pipe Operator for r. https://CRAN.R-project.org/package=magrittr.\n\n\nBanerjee, Sudipto, and Anindya Roy. 2014. Linear Algebra and Matrix\nAnalysis for Statistics. Vol. 181. Crc Press Boca Raton.\n\n\nCanchola, Alison J, Susan L Stewart, Leslie Bernstein, Dee W West,\nRonald K Ross, Dennis Deapen, Richard Pinder, et al. 2003. “Cox\nRegression Using Different Time-Scales.” Western Users of SAS\nSoftware. https://www.lexjansen.com/wuss/2003/DataAnalysis/i-cox_time_scales.pdf.\n\n\nCasella, George, and Roger Berger. 2002. Statistical Inference.\n2nd ed. Cengage Learning. https://www.cengage.com/c/statistical-inference-2e-casella-berger/9780534243128/.\n\n\nChang, Winston. 2024. R Graphics Cookbook: Practical Recipes for\nVisualizing Data. O’Reilly Media. https://r-graphics.org/.\n\n\nChatterjee, Samprit, and Ali S Hadi. 2015. Regression Analysis by\nExample. John Wiley & Sons. https://www.wiley.com/en-us/Regression+Analysis+by+Example%2C+4th+Edition-p-9780470055458.\n\n\nClayton, David, and Michael Hills. 2013. Statistical Models in\nEpidemiology. OUP Oxford.\n\n\nCopelan, Edward A, James C Biggs, James M Thompson, Pamela Crilley, Jeff\nSzer, John P Klein, Neena Kapoor, Belinda R Avalos, Isabel Cunningham,\nand Kerry Atkinson. 1991. “Treatment for Acute Myelocytic Leukemia\nwith Allogeneic Bone Marrow Transplantation Following Preparation with\nBuCy2.” https://doi.org/10.1182/blood.V78.3.838.838 .\n\n\nDalgaard, Peter. 2008. Introductory Statistics with r. New\nYork, NY: Springer New York. https://link.springer.com/book/10.1007/978-0-387-79054-1.\n\n\nDobson, Annette J, and Adrian G Barnett. 2018. An Introduction to\nGeneralized Linear Models. 4th ed. CRC press. https://doi.org/10.1201/9781315182780.\n\n\nDunn, Peter K, Gordon K Smyth, et al. 2018. Generalized Linear\nModels with Examples in r. Vol. 53. Springer. https://link.springer.com/book/10.1007/978-1-4419-0118-7.\n\n\nEfron, Bradley, and David V Hinkley. 1978. “Assessing the Accuracy\nof the Maximum Likelihood Estimator: Observed Versus Expected Fisher\nInformation.” Biometrika 65 (3): 457–83.\n\n\nFaraway, Julian J. 2016. Extending the Linear Model with r:\nGeneralized Linear, Mixed Effects and Nonparametric Regression\nModels. 2nd ed. Chapman; Hall/CRC. https://doi.org/10.1201/9781315382722.\n\n\nFay, Colin, Sébastien Rochette, Vincent Guyader, and Cervan Girard.\n2021. Engineering Production-Grade Shiny Apps. Chapman;\nHall/CRC. https://engineering-shiny.org/.\n\n\nFieller, Nick. 2016. Basics of Matrix Algebra for Statistics with\nr. Chapman; Hall/CRC. https://doi.org/10.1201/9781315370200.\n\n\nFox, John. 2015. Applied Regression Analysis and Generalized Linear\nModels. Sage publications.\n\n\nGrambsch, Patricia M, and Terry M Therneau. 1994. “Proportional\nHazards Tests and Diagnostics Based on Weighted Residuals.”\nBiometrika 81 (3): 515–26. https://doi.org/10.1093/biomet/81.3.515.\n\n\nHarrell, Frank E. 2015. Regression Modeling Strategies: With\nApplications to Linear Models, Logistic Regression, and Survival\nAnalysis. 2nd ed. Springer. https://doi.org/10.1007/978-3-319-19425-7.\n\n\nHosmer, David W, Stanley Lemeshow, and Rodney X Sturdivant. 2013.\nApplied Logistic Regression. John Wiley & Sons. https://onlinelibrary.wiley.com/doi/book/10.1002/9781118548387.\n\n\nJames, Gareth, Daniela Witten, Trevor Hastie, Robert Tibshirani, et al.\n2013. An Introduction to Statistical Learning. Vol. 112.\nSpringer. https://www.statlearning.com/.\n\n\nJewell, Nicholas P. 2003. Statistics for Epidemiology. chapman;\nhall/CRC. https://www.routledge.com/Statistics-for-Epidemiology/Jewell/p/book/9781584884330.\n\n\nKalbfleisch, John D, and Ross L Prentice. 2011. The Statistical\nAnalysis of Failure Time Data. John Wiley & Sons.\n\n\nKhuri, André I. 2003. Advanced Calculus with Applications in\nStatistics. John Wiley & Sons. https://www.wiley.com/en-us/Advanced+Calculus+with+Applications+in+Statistics%2C+2nd+Edition-p-9780471391043.\n\n\nKlein, John P, Melvin L Moeschberger, et al. 2003. Survival\nAnalysis: Techniques for Censored and Truncated Data. Vol. 1230.\nSpringer. https://link.springer.com/book/10.1007/b97377.\n\n\nKleinbaum, David G, and Mitchel Klein. 2010. Logistic\nRegression. 3rd ed. Springer. https://link.springer.com/book/10.1007/978-1-4419-1742-3.\n\n\n———. 2012. Survival Analysis a Self-Learning Text. 3rd ed.\nSpringer. https://link.springer.com/book/10.1007/978-1-4419-6646-9.\n\n\nKleinbaum, David G, Lawrence L Kupper, Azhar Nizam, K Muller, and ES\nRosenberg. 2014. Applied Regression Analysis and Other Multivariable\nMethods. 5th ed. Cengage Learning. https://www.cengage.com/c/applied-regression-analysis-and-other-multivariable-methods-5e-kleinbaum/9781285051086/.\n\n\nKleinman, Ken, and Nicholas J Horton. 2009. SAS and r: Data\nManagement, Statistical Analysis, and Graphics. Chapman; Hall/CRC.\nhttps://www.routledge.com/SAS-and-R-Data-Management-Statistical-Analysis-and-Graphics-Second-Edition/Kleinman-Horton/p/book/9781466584495.\n\n\nKuhn, Max, and Julia Silge. 2022. Tidy Modeling with r. \"\nO’Reilly Media, Inc.\". https://www.tmwr.org/.\n\n\nKutner, Michael H, Christopher J Nachtsheim, John Neter, and William Li.\n2005. Applied Linear Statistical Models. McGraw-Hill.\n\n\nLawrance, Rachael, Evgeny Degtyarev, Philip Griffiths, Peter Trask,\nHelen Lau, Denise D’Alessio, Ingolf Griebsch, Gudrun Wallenstein, Kim\nCocks, and Kaspar Rufibach. 2020. “What Is an Estimand, and How\nDoes It Relate to Quantifying the Effect of Treatment on\nPatient-Reported Quality of Life Outcomes in Clinical Trials?”\nJournal of Patient-Reported Outcomes 4 (1): 1–8. https://link.springer.com/article/10.1186/s41687-020-00218-5.\n\n\nMcCullagh, Peter, and J. A. Nelder. 1989. Generalized Linear\nModels. 2nd ed. Routledge. https://www.utstat.toronto.edu/~brunner/oldclass/2201s11/readings/glmbook.pdf.\n\n\nMcLachlan, Geoffrey J, and Thriyambakam Krishnan. 2007. The EM\nAlgorithm and Extensions. 2nd ed. John Wiley & Sons. https://doi.org/10.1002/9780470191613.\n\n\nMoore, Dirk F. 2016. Applied Survival Analysis Using r. Vol.\n473. Springer. https://doi.org/10.1007/978-3-319-31245-3.\n\n\nNahhas, Ramzi W. 2024. Introduction to Regression Methods for Public\nHealth Using r. CRC Press. https://www.bookdown.org/rwnahhas/RMPH/.\n\n\nPebesma, Edzer, and Roger Bivand. 2023. Spatial Data Science: With\nApplications in R. Boca Raton: Chapman; Hall/CRC. https://doi.org/10.1201/9780429459016.\n\n\nPohl, Moritz, Lukas Baumann, Rouven Behnisch, Marietta Kirchner,\nJohannes Krisam, and Anja Sander. 2021. “Estimands—A Basic Element for Clinical\nTrials.” Deutsches Ärzteblatt\nInternational 118 (51-52): 883–88. https://doi.org/10.3238/arztebl.m2021.0373.\n\n\nPolin, Richard A, William W Fox, and Steven H Abman. 2011. Fetal and\nNeonatal Physiology. 4th ed. Elsevier health sciences.\n\n\nRosenman, Ray H, Richard J Brand, C David Jenkins, Meyer Friedman,\nReuben Straus, and Moses Wurm. 1975. “Coronary Heart Disease in\nthe Western Collaborative Group Study: Final Follow-up Experience of 8\n1/2 Years.” JAMA 233 (8): 872–77. https://doi.org/10.1001/jama.1975.03260080034016.\n\n\nSearle, Shayle R, and Andre I Khuri. 2017. Matrix Algebra Useful for\nStatistics. John Wiley & Sons.\n\n\nSeber, George AF, and Alan J Lee. 2012. Linear Regression\nAnalysis. 2nd ed. John Wiley & Sons. https://www.wiley.com/en-us/Linear+Regression+Analysis%2C+2nd+Edition-p-9781118274422.\n\n\nSelvin, Steve. 2001. Epidemiologic Analysis: A Case-Oriented\nApproach. Oxford University Press.\n\n\nVan Buuren, Stef. 2018. Flexible Imputation of Missing Data.\nCRC press. https://stefvanbuuren.name/fimd/.\n\n\nVenables, Bill. 2023. codingMatrices: Alternative Factor Coding\nMatrices for Linear Model Formulae. https://CRAN.R-project.org/package=codingMatrices.\n\n\nVittinghoff, Eric, David V Glidden, Stephen C Shiboski, and Charles E\nMcCulloch. 2012. Regression Methods in Biostatistics: Linear,\nLogistic, Survival, and Repeated Measures Models. 2nd ed. Springer.\nhttps://doi.org/10.1007/978-1-4614-1353-0.\n\n\nWickham, Hadley. 2019. Advanced r. Chapman; Hall/CRC. https://adv-r.hadley.nz/index.html.\n\n\n———. 2021. Mastering Shiny. \" O’Reilly Media, Inc.\". https://mastering-shiny.org/.\n\n\nWickham, Hadley, Mara Averick, Jennifer Bryan, Winston Chang, Lucy\nD’Agostino McGowan, Romain François, Garrett Grolemund, et al. 2019.\n“Welcome to the tidyverse.”\nJournal of Open Source Software 4 (43): 1686. https://doi.org/10.21105/joss.01686.\n\n\nWickham, Hadley, and Jennifer Bryan. 2023. R Packages. \"\nO’Reilly Media, Inc.\". https://r-pkgs.org/.\n\n\nWickham, Hadley, Mine Çetinkaya-Rundel, and Garrett Grolemund. 2023.\nR for Data Science. \" O’Reilly Media, Inc.\". https://r4ds.hadley.nz/.", "crumbs": [ "References" ]