diff --git a/.nojekyll b/.nojekyll index f38b1f54a..796047ecf 100644 --- a/.nojekyll +++ b/.nojekyll @@ -1 +1 @@ -fb006b97 \ No newline at end of file +55f3eddd \ No newline at end of file diff --git a/Linear-models-overview.html b/Linear-models-overview.html index 981677402..94bd64fe8 100644 --- a/Linear-models-overview.html +++ b/Linear-models-overview.html @@ -3926,7 +3926,7 @@

Note

full -6.900 +7.071 0.4805 0.3831 5.956 @@ -3934,7 +3934,7 @@

Note

reduced -6.581 +6.767 0.4454 0.3802 5.971 diff --git a/Linear-models-overview_files/figure-html/unnamed-chunk-104-1.png b/Linear-models-overview_files/figure-html/unnamed-chunk-104-1.png index 222915efa..f7e541934 100644 Binary files a/Linear-models-overview_files/figure-html/unnamed-chunk-104-1.png and b/Linear-models-overview_files/figure-html/unnamed-chunk-104-1.png differ diff --git a/Linear-models-overview_files/figure-html/unnamed-chunk-98-1.png b/Linear-models-overview_files/figure-html/unnamed-chunk-98-1.png index c52402412..aa97d6684 100644 Binary files a/Linear-models-overview_files/figure-html/unnamed-chunk-98-1.png and b/Linear-models-overview_files/figure-html/unnamed-chunk-98-1.png differ diff --git a/Regression-Models-for-Epidemiology.pdf b/Regression-Models-for-Epidemiology.pdf index db2ca47d2..fb7870fd2 100644 Binary files a/Regression-Models-for-Epidemiology.pdf and b/Regression-Models-for-Epidemiology.pdf differ diff --git a/index.html b/index.html index f3fe81a93..f54d12209 100644 --- a/index.html +++ b/index.html @@ -395,7 +395,7 @@

Table of contents

Published
-

Last modified: 2024-05-27: 11:49:27 (AM)

+

Last modified: 2024-05-27: 12:56:57 (PM)

diff --git a/intro-MLEs.html b/intro-MLEs.html index 7b18107a8..828b10ef8 100644 --- a/intro-MLEs.html +++ b/intro-MLEs.html @@ -2082,7 +2082,7 @@
diff --git a/intro-to-survival-analysis.html b/intro-to-survival-analysis.html index 3ce1203d8..57348c59b 100644 --- a/intro-to-survival-analysis.html +++ b/intro-to-survival-analysis.html @@ -458,7 +458,7 @@
Published
-

Last modified: 2024-05-27: 11:49:27 (AM)

+

Last modified: 2024-05-27: 12:56:57 (PM)

@@ -638,8 +638,7 @@

The survival function \(S(t)\) is the probability that the event time is later than \(t\). If the event in a clinical trial is death, then \(S(t)\) is the expected fraction of the original population at time 0 who have survived up to time \(t\) and are still alive at time \(t\); that is:

-

If \(X_t\) represents survival status at time \(t\), with \(X_t = 1\) denoting alive at time \(t\) and \(X_t = 0\) denoting deceased at time \(t\), then:

-

\[S(t) = \Pr(X_t=1) = \mathbb{E}[X_t]\]

+

\[S(t) \stackrel{\text{def}}{=}\Pr(T > t) \tag{5.1}\]


@@ -662,6 +661,20 @@
+
+
+

Theorem 5.1 If \(A_t\) represents survival status at time \(t\), with \(A_t = 1\) denoting alive at time \(t\) and \(A_t = 0\) denoting deceased at time \(t\), then:

+

\[S(t) = \Pr(A_t=1) = \mathbb{E}[A_t]\]

+
+
+
+

Theorem 5.2 If \(T\) is a nonnegative random variable, then:

+

\[\mathbb{E}[T] = \int_{t=0}^{\infty} S(t)dt\]

+
+
+
+

Proof. See https://statproofbook.github.io/P/mean-nnrvar.html or

+

5.3.4 The Hazard Function

Another important quantity is the hazard function:

@@ -672,7 +685,7 @@
-

The hazard function has an important relationship to the density and survival functions, which we can use to derive the hazard function for a given probability distribution (Theorem 5.1).

+

The hazard function has an important relationship to the density and survival functions, which we can use to derive the hazard function for a given probability distribution (Theorem 5.3).

Lemma 5.1 (Joint probability of a variable with itself) \[p(T=t, T\ge t) = p(T=t)\]

@@ -682,7 +695,7 @@

-

Theorem 5.1 \[h(t)=\frac{f(t)}{S(t)}\]

+

Theorem 5.3 \[h(t)=\frac{f(t)}{S(t)}\]


@@ -723,7 +736,7 @@

We can also view the hazard function as the derivative of the negative of the logarithm of the survival function:

-

Theorem 5.2 (transform survival to hazard) \[h(t) = \frac{\partial}{\partial t}\left\{-\text{log}\left\{S(t)\right\}\right\}\]

+

Theorem 5.4 (transform survival to hazard) \[h(t) = \frac{\partial}{\partial t}\left\{-\text{log}\left\{S(t)\right\}\right\}\]


@@ -740,13 +753,13 @@

5.3.5 The Cumulative Hazard Function

-

Since \(h(t) = \frac{\partial}{\partial t}\left\{-\text{log}\left\{S(t)\right\}\right\}\) (see Theorem 5.2), we also have:

+

Since \(h(t) = \frac{\partial}{\partial t}\left\{-\text{log}\left\{S(t)\right\}\right\}\) (see Theorem 5.4), we also have:

-

Corollary 5.1 \[S(t) = \text{exp}\left\{-\int_{u=0}^t h(u)du\right\} \tag{5.1}\]

+

Corollary 5.1 \[S(t) = \text{exp}\left\{-\int_{u=0}^t h(u)du\right\} \tag{5.2}\]


-

The integral in Equation 5.1 is important enough to have its own name: cumulative hazard.

+

The integral in Equation 5.2 is important enough to have its own name: cumulative hazard.

Definition 5.3 (cumulative hazard) The cumulative hazard function \(H(t)\) is defined as:

@@ -1104,10 +1117,10 @@ \]


-

Theorem 5.3 Let \(T=\sum t_i\) and \(U=\sum u_j\). Then:

+

Theorem 5.5 Let \(T=\sum t_i\) and \(U=\sum u_j\). Then:

\[ -\ell(\lambda) = \frac{m}{T+U} -\tag{5.2}\]

+\hat{\lambda}_{ML} = \frac{m}{T+U} +\tag{5.3}\]


@@ -1266,7 +1279,7 @@ 5.6.5 Exponential model
    -
  • We can compute the hazard rate, assuming an exponential model: number of relapses divided by the sum of the exit times (Equation 5.2).
  • +
  • We can compute the hazard rate, assuming an exponential model: number of relapses divided by the sum of the exit times (Equation 5.3).

\[\hat\lambda = \frac{\sum_{i=1}^nD_i}{\sum_{i=1}^nY_i}\]

@@ -1321,11 +1334,12 @@

Definition 5.4 (Kaplan-Meier Product-Limit Estimator of Survival Function) If the event times are \(t_i\) with events per time of \(d_i\) (\(1\le i \le k\)), then the Kaplan-Meier Product-Limit Estimator of the survival function is:

-

\[\hat S(t) = \prod_{t_i < t} \left[\frac{1-d_i}{Y_i}\right] \tag{5.3}\]

+

\[\hat S(t) = \prod_{t_i < t} \left[\frac{1-d_i}{Y_i}\right] \tag{5.4}\]

where \(Y_i\) is the set of observations whose time (event or censored) is \(\ge t_i\), the group at risk at time \(t_i\).


-

If there are no censored data, and there are \(n\) data points, then just after (say) the third event time

+
+

Theorem 5.6 (Kaplan-Meier Estimate with No Censored Observations) If there are no censored data, and there are \(n\) data points, then just after (say) the third event time

\[ \begin{aligned} \hat S(t) @@ -1337,6 +1351,7 @@ \end{aligned} \]

where \(\hat F(t)\) is the usual empirical CDF estimate.

+

5.7.3 Kaplan-Meier curve for drug6mp data

Here is the Kaplan-Meier estimated survival curve for the patients who received 6-MP in the drug6mp dataset (we will see code to produce figures like this one shortly):

@@ -2015,11 +2030,11 @@

\[\hat S(t) = \prod_{t_i < t}\left[1-\frac{d_i}{Y_i}\right]\] where \(Y_i\) is the group at risk at time \(t_i\).

The estimated variance of \(\hat S(t)\) is:

-

Theorem 5.4 (Greenwood’s estimator for variance of Kaplan-Meier survival estimator) \[ +

Theorem 5.7 (Greenwood’s estimator for variance of Kaplan-Meier survival estimator) \[ \widehat{\text{Var}}\left(\hat S(t)\right) = \hat S(t)^2\sum_{t_i <t}\frac{d_i}{Y_i(Y_i-d_i)} -\tag{5.4}\]

+\tag{5.5}\]

-

We can use Equation 5.4 for confidence intervals for a survival function or a difference of survival functions.

+

We can use Equation 5.5 for confidence intervals for a survival function or a difference of survival functions.


Kaplan-Meier survival curves
@@ -2164,11 +2179,11 @@

The point hazard at time \(t_i\) can be estimated by \(d_i/Y_i\), which leads to the Nelson-Aalen estimator of the cumulative hazard:

-

\[\hat H_{NA}(t) \stackrel{\text{def}}{=}\sum_{t_i < t}\frac{d_i}{Y_i} \tag{5.5}\]

+

\[\hat H_{NA}(t) \stackrel{\text{def}}{=}\sum_{t_i < t}\frac{d_i}{Y_i} \tag{5.6}\]


-

Theorem 5.5 (Variance of Nelson-Aalen estimator)  

+

Theorem 5.8 (Variance of Nelson-Aalen estimator)  

The variance of this estimator is approximately:

@@ -2178,7 +2193,7 @@ &= \sum_{t_i <t}\frac{(d_i/Y_i)(1-d_i/Y_i)}{Y_i}\\ &\approx \sum_{t_i <t}\frac{d_i}{Y_i^2} \end{aligned} -\tag{5.6}\]

+\tag{5.7}\]


Since \(S(t)=\text{exp}\left\{-H(t)\right\}\), the Nelson-Aalen cumulative hazard estimate can be converted into an alternate estimate of the survival function:

@@ -3076,1698 +3091,1729 @@ the expected fraction of the original population at time 0 who have survived up to time $t$ and are still alive at time $t$; that is: :::: -If $X_t$ represents survival status at time $t$, with $X_t = 1$ denoting alive at time $t$ and $X_t = 0$ denoting deceased at time $t$, then: +$$S(t) \eqdef \Pr(T > t)$${#eq-def-surv} -$$S(t) = \Pr(X_t=1) = \Expp[X_t]$$ +::: -::: +--- ---- - -:::{#exm-exp-survfn} -##### exponential distribution - -Since $S(t) = 1 - F(t)$, the survival function of the exponential -distribution family of models is: - -$$ -P(T> t) = \left\{ {{\text{e}^{-\lambda t}, t\ge0} \atop {1, t \le 0}}\right. -$$ where $\lambda > 0$. +:::{#exm-exp-survfn} +##### exponential distribution + +Since $S(t) = 1 - F(t)$, the survival function of the exponential +distribution family of models is: + +$$ +P(T> t) = \left\{ {{\text{e}^{-\lambda t}, t\ge0} \atop {1, t \le 0}}\right. +$$ where $\lambda > 0$. + +@fig-exp-survfuns shows some examples of exponential survival functions. -@fig-exp-survfuns shows some examples of exponential survival functions. +::: -::: +--- ---- - -```{r, echo = FALSE} -#| fig-cap: "Exponential Survival Functions" -#| label: fig-exp-survfuns -library(ggplot2) -ggplot() + - geom_function( - aes(col = "0.5"), - fun = \(x) pexp(x, lower = FALSE, rate = 0.5)) + - geom_function( - aes(col = "p = 1"), - fun = \(x) pexp(x, lower = FALSE, rate = 1)) + - geom_function( - aes(col = "p = 1.5"), - fun = \(x) pexp(x, lower = FALSE, rate = 1.5)) + - geom_function( - aes(col = "p = 5"), - fun = \(x) pexp(x, lower = FALSE, rate = 5)) + - theme_bw() + - ylab("S(t)") + - guides(col = guide_legend(title = expr(lambda))) + - xlab("Time (t)") + - xlim(0, 2.5) + - theme( - axis.title.x = - element_text( - angle = 0, - vjust = 1, - hjust = 1), - axis.title.y = - element_text( - angle = 0, - vjust = 1, - hjust = 1)) -``` +```{r, echo = FALSE} +#| fig-cap: "Exponential Survival Functions" +#| label: fig-exp-survfuns +library(ggplot2) +ggplot() + + geom_function( + aes(col = "0.5"), + fun = \(x) pexp(x, lower = FALSE, rate = 0.5)) + + geom_function( + aes(col = "p = 1"), + fun = \(x) pexp(x, lower = FALSE, rate = 1)) + + geom_function( + aes(col = "p = 1.5"), + fun = \(x) pexp(x, lower = FALSE, rate = 1.5)) + + geom_function( + aes(col = "p = 5"), + fun = \(x) pexp(x, lower = FALSE, rate = 5)) + + theme_bw() + + ylab("S(t)") + + guides(col = guide_legend(title = expr(lambda))) + + xlab("Time (t)") + + xlim(0, 2.5) + + theme( + axis.title.x = + element_text( + angle = 0, + vjust = 1, + hjust = 1), + axis.title.y = + element_text( + angle = 0, + vjust = 1, + hjust = 1)) +``` + +--- -### The Hazard Function +:::{#thm-surv-fn-as-mean-status} -Another important quantity is the **hazard function**: +If $A_t$ represents survival status at time $t$, with $A_t = 1$ denoting alive at time $t$ and $A_t = 0$ denoting deceased at time $t$, then: -:::{#def-hazard} +$$S(t) = \Pr(A_t=1) = \Expp[A_t]$$ -{{< include _def-hazard.qmd >}} +::: -::: +--- ---- +:::{#thm-surv-and-mean} -::: notes -The hazard function has an important relationship to the density and survival functions, -which we can use to derive the hazard function for a given probability distribution (@thm-hazard1). -::: - -:::::{#lem-joint-prob-same-var} - -#### Joint probability of a variable with itself - -$$p(T=t, T\ge t) = p(T=t)$$ - -::::::{.proof} -Recall from Epi 202: -if $A$ and $B$ are statistical events and $A\subseteq B$, then $p(A, B) = p(A)$. -In particular, $\{T=t\} \subseteq \{T\geq t\}$, so $p(T=t, T\ge t) = p(T=t)$. -:::::: -::::: +If $T$ is a nonnegative random variable, then: + +$$\Expp[T] = \int_{t=0}^{\infty} S(t)dt$$ + +::: + +--- + +:::{.proof} + +See <https://statproofbook.github.io/P/mean-nnrvar.html> or + +::: + +### The Hazard Function + +Another important quantity is the **hazard function**: ---- +:::{#def-hazard} -:::{#thm-hazard1} +{{< include _def-hazard.qmd >}} -$$h(t)=\frac{f(t)}{S(t)}$$ -::: - ---- - -::::{.proof} - -$$ -\begin{aligned} -h(t) &=p(T=t|T\ge t)\\ -&=\frac{p(T=t, T\ge t)}{p(T \ge t)}\\ -&=\frac{p(T=t)}{p(T \ge t)}\\ -&=\frac{f(t)}{S(t)} -\end{aligned} -$$ - -:::: - ---- - -:::{#exm-exp-haz} -##### exponential distribution - -The hazard function of the exponential distribution family of models is: - -$$ -\begin{aligned} -P(T=t|T \ge t) -&= \frac{f(t)}{S(t)}\\ -&= \frac{\mathbb{1}_{t \ge 0}\cdot \lambda \text{e}^{-\lambda t}}{\text{e}^{-\lambda t}}\\ -&=\mathbb{1}_{t \ge 0}\cdot \lambda -\end{aligned} -$$ -@fig-exp-hazard shows some examples of exponential hazard functions. - -::: - ---- - -```{r, echo = FALSE} -#| fig-cap: "Examples of hazard functions for exponential distributions" -#| label: fig-exp-hazard -library(ggplot2) -ggplot() + - geom_hline( - aes(col = "0.5",yintercept = 0.5)) + - geom_hline( - aes(col = "p = 1", yintercept = 1)) + - geom_hline( - aes(col = "p = 1.5", yintercept = 1.5)) + - geom_hline( - aes(col = "p = 5", yintercept = 5)) + - theme_bw() + - ylab("h(t)") + - ylim(0,5) + - guides(col = guide_legend(title = expr(lambda))) + - xlab("Time (t)") + - xlim(0, 2.5) + - theme( - axis.title.x = - element_text( - angle = 0, - vjust = 1, - hjust = 1), - axis.title.y = - element_text( - angle = 0, - vjust = 1, - hjust = 1)) -``` - ---- - -We can also view the hazard function as the derivative of the negative of the logarithm of the survival function: - -:::{#thm-h-logS} - -#### transform survival to hazard - -$$h(t) = \deriv{t}\cb{-\log{S(t)}}$$ -::: - ---- - -::::{.proof} -$$ -\begin{aligned} -h(t) -&= \frac{f(t)}{S(t)}\\ -&= \frac{-S'(t)}{S(t)}\\ -&= -\frac{S'(t)}{S(t)}\\ -&=-\deriv{t}\log{S(t)}\\ -&=\deriv{t}\cb{-\log{S(t)}} -\end{aligned} -$$ -:::: +::: + +--- + +::: notes +The hazard function has an important relationship to the density and survival functions, +which we can use to derive the hazard function for a given probability distribution (@thm-hazard1). +::: + +:::::{#lem-joint-prob-same-var} + +#### Joint probability of a variable with itself + +$$p(T=t, T\ge t) = p(T=t)$$ + +::::::{.proof} +Recall from Epi 202: +if $A$ and $B$ are statistical events and $A\subseteq B$, then $p(A, B) = p(A)$. +In particular, $\{T=t\} \subseteq \{T\geq t\}$, so $p(T=t, T\ge t) = p(T=t)$. +:::::: +::::: + +--- + +:::{#thm-hazard1} + +$$h(t)=\frac{f(t)}{S(t)}$$ +::: + +--- + +::::{.proof} + +$$ +\begin{aligned} +h(t) &=p(T=t|T\ge t)\\ +&=\frac{p(T=t, T\ge t)}{p(T \ge t)}\\ +&=\frac{p(T=t)}{p(T \ge t)}\\ +&=\frac{f(t)}{S(t)} +\end{aligned} +$$ + +:::: + +--- + +:::{#exm-exp-haz} +##### exponential distribution + +The hazard function of the exponential distribution family of models is: + +$$ +\begin{aligned} +P(T=t|T \ge t) +&= \frac{f(t)}{S(t)}\\ +&= \frac{\mathbb{1}_{t \ge 0}\cdot \lambda \text{e}^{-\lambda t}}{\text{e}^{-\lambda t}}\\ +&=\mathbb{1}_{t \ge 0}\cdot \lambda +\end{aligned} +$$ +@fig-exp-hazard shows some examples of exponential hazard functions. + +::: + +--- + +```{r, echo = FALSE} +#| fig-cap: "Examples of hazard functions for exponential distributions" +#| label: fig-exp-hazard +library(ggplot2) +ggplot() + + geom_hline( + aes(col = "0.5",yintercept = 0.5)) + + geom_hline( + aes(col = "p = 1", yintercept = 1)) + + geom_hline( + aes(col = "p = 1.5", yintercept = 1.5)) + + geom_hline( + aes(col = "p = 5", yintercept = 5)) + + theme_bw() + + ylab("h(t)") + + ylim(0,5) + + guides(col = guide_legend(title = expr(lambda))) + + xlab("Time (t)") + + xlim(0, 2.5) + + theme( + axis.title.x = + element_text( + angle = 0, + vjust = 1, + hjust = 1), + axis.title.y = + element_text( + angle = 0, + vjust = 1, + hjust = 1)) +``` -### The Cumulative Hazard Function +--- -Since $h(t) = \deriv{t}\cb{-\log{S(t)}}$ (see @thm-h-logS), we also have: +We can also view the hazard function as the derivative of the negative of the logarithm of the survival function: -:::{#cor-surv-int-haz} -$$S(t) = \exp{-\int_{u=0}^t h(u)du}$${#eq-surv-int-haz} -::: +:::{#thm-h-logS} + +#### transform survival to hazard ---- - -::: notes -The integral in @eq-surv-int-haz is important enough to have its own name: **cumulative hazard**. -::: - -:::{#def-cumhaz} - -##### cumulative hazard - -The **cumulative hazard function** $H(t)$ is defined as: - -$$H(t) \eqdef \int_{u=0}^t h(u) du$$ - -::: - -As we will see below, $H(t)$ is tractable to estimate, and we can then -derive an estimate of the hazard function using an approximate derivative -of the estimated cumulative hazard. +$$h(t) = \deriv{t}\cb{-\log{S(t)}}$$ +::: + +--- + +::::{.proof} +$$ +\begin{aligned} +h(t) +&= \frac{f(t)}{S(t)}\\ +&= \frac{-S'(t)}{S(t)}\\ +&= -\frac{S'(t)}{S(t)}\\ +&=-\deriv{t}\log{S(t)}\\ +&=\deriv{t}\cb{-\log{S(t)}} +\end{aligned} +$$ +:::: + +### The Cumulative Hazard Function ---- +Since $h(t) = \deriv{t}\cb{-\log{S(t)}}$ (see @thm-h-logS), we also have: -:::{#exm-exp-cumhaz} - -The cumulative hazard function of the exponential distribution family of -models is: - -$$ -H(t) = \mathbb{1}_{t \ge 0}\cdot \lambda t -$$ - -@fig-cuhaz-exp shows some examples of exponential cumulative hazard functions. - -::: - ---- - -```{r, echo = FALSE} -#| fig-cap: "Examples of exponential cumulative hazard functions" -#| label: fig-cuhaz-exp -library(ggplot2) -ggplot() + - geom_abline( - aes(col = "0.5",intercept = 0, slope = 0.5)) + - geom_abline( - aes(col = "p = 1", intercept = 0, slope = 1)) + - geom_abline( - aes(col = "p = 1.5", intercept = 0, slope = 1.5)) + - geom_abline( - aes(col = "p = 5", intercept = 0, slope = 5)) + - theme_bw() + - ylab("H(t)") + - ylim(0,5) + - guides(col = guide_legend(title = expr(lambda))) + - xlab("Time (t)") + - xlim(0, 2.5) + - theme( - axis.title.x = - element_text( - angle = 0, - vjust = 1, - hjust = 1), - axis.title.y = - element_text( - angle = 0, - vjust = 1, - hjust = 1)) -``` - -### Some Key Mathematical Relationships among Survival Concepts - -#### Diagram: - -$$ -h(t) \xrightarrow[]{\int_{u=0}^t h(u)du} H(t) -\xrightarrow[]{\exp{-H(t)}} S(t) -\xrightarrow[]{1-S(t)} F(t) -$$ - -$$ -h(t) \xleftarrow[\deriv{t}H(t)]{} H(t) -\xleftarrow[-\log{S(t)}]{} S(t) -\xleftarrow[1-F(t)]{} F(t) -$$ - - ---- - -#### Identities: - -$$ -\begin{aligned} -S(t) &= 1 - F(t)\\ -&= \text{exp}\left\{-H(t)\right\}\\ -S'(t) &= -f(t)\\ -H(t) &= -\text{log}\left\{S(t)\right\}\\ -H'(t) &= h(t)\\ -h(t) &= \frac{f(t)}{S(t)}\\ - &= -\deriv{t}\log{S(t)} \\ -f(t) &= h(t)\cdot S(t)\\ -\end{aligned} -$$ - ---- +:::{#cor-surv-int-haz} +$$S(t) = \exp{-\int_{u=0}^t h(u)du}$${#eq-surv-int-haz} +::: + +--- + +::: notes +The integral in @eq-surv-int-haz is important enough to have its own name: **cumulative hazard**. +::: + +:::{#def-cumhaz} + +##### cumulative hazard + +The **cumulative hazard function** $H(t)$ is defined as: + +$$H(t) \eqdef \int_{u=0}^t h(u) du$$ + +::: + +As we will see below, $H(t)$ is tractable to estimate, and we can then +derive an estimate of the hazard function using an approximate derivative +of the estimated cumulative hazard. + +--- + +:::{#exm-exp-cumhaz} + +The cumulative hazard function of the exponential distribution family of +models is: + +$$ +H(t) = \mathbb{1}_{t \ge 0}\cdot \lambda t +$$ + +@fig-cuhaz-exp shows some examples of exponential cumulative hazard functions. + +::: + +--- + +```{r, echo = FALSE} +#| fig-cap: "Examples of exponential cumulative hazard functions" +#| label: fig-cuhaz-exp +library(ggplot2) +ggplot() + + geom_abline( + aes(col = "0.5",intercept = 0, slope = 0.5)) + + geom_abline( + aes(col = "p = 1", intercept = 0, slope = 1)) + + geom_abline( + aes(col = "p = 1.5", intercept = 0, slope = 1.5)) + + geom_abline( + aes(col = "p = 5", intercept = 0, slope = 5)) + + theme_bw() + + ylab("H(t)") + + ylim(0,5) + + guides(col = guide_legend(title = expr(lambda))) + + xlab("Time (t)") + + xlim(0, 2.5) + + theme( + axis.title.x = + element_text( + angle = 0, + vjust = 1, + hjust = 1), + axis.title.y = + element_text( + angle = 0, + vjust = 1, + hjust = 1)) +``` + +### Some Key Mathematical Relationships among Survival Concepts + +#### Diagram: + +$$ +h(t) \xrightarrow[]{\int_{u=0}^t h(u)du} H(t) +\xrightarrow[]{\exp{-H(t)}} S(t) +\xrightarrow[]{1-S(t)} F(t) +$$ -Some proofs (others left as exercises): - -$$ -\begin{aligned} -S'(t) &= \deriv{t}(1-F(t))\\ -&= -F'(t)\\ -&= -f(t)\\ -\end{aligned} -$$ - ---- - -$$ -\begin{aligned} -\deriv{t}\log{S(t)} -&= \frac{S'(t)}{S(t)}\\ -&= -\frac{f(t)}{S(t)}\\ -&= -h(t)\\ -\end{aligned} -$$ - ---- - -$$ -\begin{aligned} -H(t) -&\eqdef \int_{u=0}^t h(u) du\\ -&= \int_0^t -\deriv{u}\text{log}\left\{S(u)\right\} du\\ -&= \left[-\text{log}\left\{S(u)\right\}\right]_{u=0}^{u=t}\\ -&= \left[\text{log}\left\{S(u)\right\}\right]_{u=t}^{u=0}\\ -&= \text{log}\left\{S(0)\right\} - \text{log}\left\{S(t)\right\}\\ -&= \text{log}\left\{1\right\} - \text{log}\left\{S(t)\right\}\\ -&= 0 - \text{log}\left\{S(t)\right\}\\ -&=-\text{log}\left\{S(t)\right\} -\end{aligned} -$$ - ---- - -Corollary: - -$$S(t) = \text{exp}\left\{-H(t)\right\}$$ - ---- - -#### Example: Time to death the US in 2004 +$$ +h(t) \xleftarrow[\deriv{t}H(t)]{} H(t) +\xleftarrow[-\log{S(t)}]{} S(t) +\xleftarrow[1-F(t)]{} F(t) +$$ + + +--- + +#### Identities: + +$$ +\begin{aligned} +S(t) &= 1 - F(t)\\ +&= \text{exp}\left\{-H(t)\right\}\\ +S'(t) &= -f(t)\\ +H(t) &= -\text{log}\left\{S(t)\right\}\\ +H'(t) &= h(t)\\ +h(t) &= \frac{f(t)}{S(t)}\\ + &= -\deriv{t}\log{S(t)} \\ +f(t) &= h(t)\cdot S(t)\\ +\end{aligned} +$$ + +--- + +Some proofs (others left as exercises): + +$$ +\begin{aligned} +S'(t) &= \deriv{t}(1-F(t))\\ +&= -F'(t)\\ +&= -f(t)\\ +\end{aligned} +$$ + +--- + +$$ +\begin{aligned} +\deriv{t}\log{S(t)} +&= \frac{S'(t)}{S(t)}\\ +&= -\frac{f(t)}{S(t)}\\ +&= -h(t)\\ +\end{aligned} +$$ -The first day is the most dangerous: +--- -```{r, echo = FALSE} -#| fig-cap: "Daily Hazard Rates in 2004 for US Females" -#| fig-pos: "H" -#| fig-height: 6 - -# download `survexp.rda` from: -# paste0( -# "https://github.com/therneau/survival/raw/", -# "f3ac93704949ff26e07720b56f2b18ffa8066470/", -# "data/survexp.rda") - -#(newer versions of `survival` don't have the first-year breakdown; see: -# https://cran.r-project.org/web/packages/survival/news.html) +$$ +\begin{aligned} +H(t) +&\eqdef \int_{u=0}^t h(u) du\\ +&= \int_0^t -\deriv{u}\text{log}\left\{S(u)\right\} du\\ +&= \left[-\text{log}\left\{S(u)\right\}\right]_{u=0}^{u=t}\\ +&= \left[\text{log}\left\{S(u)\right\}\right]_{u=t}^{u=0}\\ +&= \text{log}\left\{S(0)\right\} - \text{log}\left\{S(t)\right\}\\ +&= \text{log}\left\{1\right\} - \text{log}\left\{S(t)\right\}\\ +&= 0 - \text{log}\left\{S(t)\right\}\\ +&=-\text{log}\left\{S(t)\right\} +\end{aligned} +$$ -fs::path( - here::here(), - "data", - "survexp.rda") |> - load() -s1 <- survexp.us[,"female","2004"] -age1 <- c( - 0.5/365.25, - 4/365.25, - 17.5/365.25, - 196.6/365.25, - 1:109+0.5) -s2 <- 365.25*s1[5:113] -s2 <- c(s1[1], 6*s1[2], 22*s1[3], 337.25*s1[4], s2) -cols <- rep(1,113) -cols[1] <- 2 -cols[2] <- 3 -cols[3] <- 4 - -plot(age1,s1,type="b",lwd=2,xlab="Age",ylab="Daily Hazard Rate",col=cols) - -text(10,.003,"First Day",col=2) -text(18,.00030,"Rest of First Week",col=3) -text(18,.00015,"Rest of First month",col=4) -``` +--- + +Corollary: + +$$S(t) = \text{exp}\left\{-H(t)\right\}$$ + +--- + +#### Example: Time to death the US in 2004 + +The first day is the most dangerous: + +```{r, echo = FALSE} +#| fig-cap: "Daily Hazard Rates in 2004 for US Females" +#| fig-pos: "H" +#| fig-height: 6 + +# download `survexp.rda` from: +# paste0( +# "https://github.com/therneau/survival/raw/", +# "f3ac93704949ff26e07720b56f2b18ffa8066470/", +# "data/survexp.rda") + +#(newer versions of `survival` don't have the first-year breakdown; see: +# https://cran.r-project.org/web/packages/survival/news.html) ---- - -Exercise: hypothesize why these curves differ where they do? - -```{r,echo = FALSE} -#| fig-cap: "Daily Hazard Rates in 2004 for US Males and Females 1-40" -#| fig-pos: "H" -yrs=1:40 -s1 <- survexp.us[5:113,"male","2004"] -s2 <- survexp.us[5:113,"female","2004"] - -age1 <- 1:109 - -plot(age1[yrs],s1[yrs],type="l",lwd=2,xlab="Age",ylab="Daily Hazard Rate") -lines(age1[yrs],s2[yrs],col=2,lwd=2) -legend(5,5e-6,c("Males","Females"),col=1:2,lwd=2) - -``` +fs::path( + here::here(), + "data", + "survexp.rda") |> + load() +s1 <- survexp.us[,"female","2004"] +age1 <- c( + 0.5/365.25, + 4/365.25, + 17.5/365.25, + 196.6/365.25, + 1:109+0.5) +s2 <- 365.25*s1[5:113] +s2 <- c(s1[1], 6*s1[2], 22*s1[3], 337.25*s1[4], s2) +cols <- rep(1,113) +cols[1] <- 2 +cols[2] <- 3 +cols[3] <- 4 ---- +plot(age1,s1,type="b",lwd=2,xlab="Age",ylab="Daily Hazard Rate",col=cols) -Exercise: compare and contrast this curve with the corresponding hazard -curve. - -```{r, echo = FALSE} -#| fig-cap: "Survival Curve in 2004 for US Females" -#| fig-pos: "H" +text(10,.003,"First Day",col=2) +text(18,.00030,"Rest of First Week",col=3) +text(18,.00015,"Rest of First month",col=4) +``` + +--- -s1 <- survexp.us[,"female","2004"] +Exercise: hypothesize why these curves differ where they do? -s2 <- 365.25*s1[5:113] -s2 <- c(s1[1], 6*s1[2], 21*s1[3], 337.25*s1[4], s2) -cs2 <- cumsum(s2) -age2 <- c(1/365.25, 7/365.25, 28/365.25, 1:110) -plot(age2,exp(-cs2),type="l",lwd=2,xlab="Age",ylab="Survival") - -``` - ---- - -### Likelihood with censoring - -If an event time $T$ is observed exactly as $T=t$, then the likelihood -of that observation is just its probability density function: +```{r,echo = FALSE} +#| fig-cap: "Daily Hazard Rates in 2004 for US Males and Females 1-40" +#| fig-pos: "H" +yrs=1:40 +s1 <- survexp.us[5:113,"male","2004"] +s2 <- survexp.us[5:113,"female","2004"] + +age1 <- 1:109 + +plot(age1[yrs],s1[yrs],type="l",lwd=2,xlab="Age",ylab="Daily Hazard Rate") +lines(age1[yrs],s2[yrs],col=2,lwd=2) +legend(5,5e-6,c("Males","Females"),col=1:2,lwd=2) + +``` -$$ -\begin{aligned} -\mathcal L(t) -&= p(T=t)\\ -&\eqdef f_T(t)\\ -&= h_T(t)S_T(t)\\ -\ell(t) -&\eqdef \text{log}\left\{\mathcal L(t)\right\}\\ -&= \text{log}\left\{h_T(t)S_T(t)\right\}\\ -&= \text{log}\left\{h_T(t)\right\} + \text{log}\left\{S_T(t)\right\}\\ -&= \text{log}\left\{h_T(t)\right\} - H_T(t)\\ -\end{aligned} -$$ - ---- - -If instead the event time $T$ is censored and only known to be after -time $y$, then the likelihood of that censored observation is instead -the survival function evaluated at the censoring time: - -$$ -\begin{aligned} -\mathcal L(y) -&=p_T(T>y)\\ -&\eqdef S_T(y)\\ -\ell(y) -&\eqdef \text{log}\left\{\mathcal L(y)\right\}\\ -&=\text{log}\left\{S(y)\right\}\\ -&=-H(y)\\ -\end{aligned} -$$ - ---- - -::: notes -What's written above is incomplete. We also observed whether or not the -observation was censored. Let $C$ denote the time when censoring would -occur (if the event did not occur first); let $f_C(y)$ and $S_C(y)$ be -the corresponding density and survival functions for the censoring -event. - -Let $Y$ denote the time when observation ended (either by censoring or -by the event of interest occurring), and let $D$ be an indicator -variable for the event occurring at $Y$ (so $D=0$ represents a censored -observation and $D=1$ represents an uncensored observation). In other -words, let $Y \eqdef \min(T,C)$ and -$D \eqdef \mathbb 1{\{T<=C\}}$. - -Then the complete likelihood of the observed data $(Y,D)$ is: -::: - -$$ -\begin{aligned} -\mathcal L(y,d) -&= p(Y=y, D=d)\\ -&= \left[p(T=y,C> y)\right]^d \cdot -\left[p(T>y,C=y)\right]^{1-d}\\ -\end{aligned} -$$ +--- + +Exercise: compare and contrast this curve with the corresponding hazard +curve. + +```{r, echo = FALSE} +#| fig-cap: "Survival Curve in 2004 for US Females" +#| fig-pos: "H" + +s1 <- survexp.us[,"female","2004"] + +s2 <- 365.25*s1[5:113] +s2 <- c(s1[1], 6*s1[2], 21*s1[3], 337.25*s1[4], s2) +cs2 <- cumsum(s2) +age2 <- c(1/365.25, 7/365.25, 28/365.25, 1:110) +plot(age2,exp(-cs2),type="l",lwd=2,xlab="Age",ylab="Survival") + +``` + +--- + +### Likelihood with censoring + +If an event time $T$ is observed exactly as $T=t$, then the likelihood +of that observation is just its probability density function: + +$$ +\begin{aligned} +\mathcal L(t) +&= p(T=t)\\ +&\eqdef f_T(t)\\ +&= h_T(t)S_T(t)\\ +\ell(t) +&\eqdef \text{log}\left\{\mathcal L(t)\right\}\\ +&= \text{log}\left\{h_T(t)S_T(t)\right\}\\ +&= \text{log}\left\{h_T(t)\right\} + \text{log}\left\{S_T(t)\right\}\\ +&= \text{log}\left\{h_T(t)\right\} - H_T(t)\\ +\end{aligned} +$$ + +--- + +If instead the event time $T$ is censored and only known to be after +time $y$, then the likelihood of that censored observation is instead +the survival function evaluated at the censoring time: + +$$ +\begin{aligned} +\mathcal L(y) +&=p_T(T>y)\\ +&\eqdef S_T(y)\\ +\ell(y) +&\eqdef \text{log}\left\{\mathcal L(y)\right\}\\ +&=\text{log}\left\{S(y)\right\}\\ +&=-H(y)\\ +\end{aligned} +$$ + +--- ---- - -::: notes -Typically, survival analyses assume that $C$ and $T$ are mutually -independent; this assumption is called "non-informative" censoring. - -Then the joint likelihood $p(Y,D)$ factors into the product -$p(Y), p(D)$, and the likelihood reduces to: -::: - -$$ -\begin{aligned} -\mathcal L(y,d) -&= \left[p(T=y,C> y)\right]^d\cdot -\left[p(T>y,C=y)\right]^{1-d}\\ -&= \left[p(T=y)p(C> y)\right]^d\cdot -\left[p(T>y)p(C=y)\right]^{1-d}\\ -&= \left[f_T(y)S_C(y)\right]^d\cdot -\left[S(y)f_C(y)\right]^{1-d}\\ -&= \left[f_T(y)^d S_C(y)^d\right]\cdot -\left[S_T(y)^{1-d}f_C(y)^{1-d}\right]\\ -&= \left(f_T(y)^d \cdot S_T(y)^{1-d}\right)\cdot -\left(f_C(y)^{1-d} \cdot S_C(y)^{d}\right) +::: notes +What's written above is incomplete. We also observed whether or not the +observation was censored. Let $C$ denote the time when censoring would +occur (if the event did not occur first); let $f_C(y)$ and $S_C(y)$ be +the corresponding density and survival functions for the censoring +event. + +Let $Y$ denote the time when observation ended (either by censoring or +by the event of interest occurring), and let $D$ be an indicator +variable for the event occurring at $Y$ (so $D=0$ represents a censored +observation and $D=1$ represents an uncensored observation). In other +words, let $Y \eqdef \min(T,C)$ and +$D \eqdef \mathbb 1{\{T<=C\}}$. + +Then the complete likelihood of the observed data $(Y,D)$ is: +::: + +$$ +\begin{aligned} +\mathcal L(y,d) +&= p(Y=y, D=d)\\ +&= \left[p(T=y,C> y)\right]^d \cdot +\left[p(T>y,C=y)\right]^{1-d}\\ \end{aligned} $$ --- ::: notes -The corresponding log-likelihood is: -::: +Typically, survival analyses assume that $C$ and $T$ are mutually +independent; this assumption is called "non-informative" censoring. -$$ -\begin{aligned} -\ell(y,d) -&= \text{log}\left\{\mathcal L(y,d) \right\}\\ -&= \text{log}\left\{ -\left(f_T(y)^d \cdot S_T(y)^{1-d}\right)\cdot -\left(f_C(y)^{1-d} \cdot S_C(y)^{d}\right) -\right\}\\ -&= \text{log}\left\{ -f_T(y)^d \cdot S_T(y)^{1-d} -\right\} -+ -\text{log}\left\{ -f_C(y)^{1-d} \cdot S_C(y)^{d} -\right\}\\ -\end{aligned} -$$ Let - -- $\theta_T$ represent the parameters of $p_T(t)$, -- $\theta_C$ represent the parameters of $p_C(c)$, -- $\theta = (\theta_T, \theta_C)$ be the combined vector of all - parameters. - ---- - -::: notes -The corresponding score function is: -::: - -$$ -\begin{aligned} -\ell'(y,d) -&= \deriv{\theta} -\left[ -\text{log}\left\{ +Then the joint likelihood $p(Y,D)$ factors into the product +$p(Y), p(D)$, and the likelihood reduces to: +::: + +$$ +\begin{aligned} +\mathcal L(y,d) +&= \left[p(T=y,C> y)\right]^d\cdot +\left[p(T>y,C=y)\right]^{1-d}\\ +&= \left[p(T=y)p(C> y)\right]^d\cdot +\left[p(T>y)p(C=y)\right]^{1-d}\\ +&= \left[f_T(y)S_C(y)\right]^d\cdot +\left[S(y)f_C(y)\right]^{1-d}\\ +&= \left[f_T(y)^d S_C(y)^d\right]\cdot +\left[S_T(y)^{1-d}f_C(y)^{1-d}\right]\\ +&= \left(f_T(y)^d \cdot S_T(y)^{1-d}\right)\cdot +\left(f_C(y)^{1-d} \cdot S_C(y)^{d}\right) +\end{aligned} +$$ + +--- + +::: notes +The corresponding log-likelihood is: +::: + +$$ +\begin{aligned} +\ell(y,d) +&= \text{log}\left\{\mathcal L(y,d) \right\}\\ +&= \text{log}\left\{ +\left(f_T(y)^d \cdot S_T(y)^{1-d}\right)\cdot +\left(f_C(y)^{1-d} \cdot S_C(y)^{d}\right) +\right\}\\ +&= \text{log}\left\{ f_T(y)^d \cdot S_T(y)^{1-d} \right\} + \text{log}\left\{ f_C(y)^{1-d} \cdot S_C(y)^{d} -\right\} -\right]\\ -&= -\left( -\deriv{\theta} -\text{log}\left\{ -f_T(y)^d \cdot S_T(y)^{1-d} -\right\} -\right) -+ -\left( -\deriv{\theta} -\text{log}\left\{ -f_C(y)^{1-d} \cdot S_C(y)^{d} -\right\} -\right)\\ -\end{aligned} -$$ - ---- - -::: notes -As long as $\theta_C$ and $\theta_T$ don't share any parameters, then if -censoring is non-informative, the partial derivative with respect to -$\theta_T$ is: -::: - -$$ -\begin{aligned} -\ell'_{\theta_T}(y,d) -&\eqdef \deriv{\theta_T}\ell(y,d)\\ -&= -\left( -\deriv{\theta_T} -\text{log}\left\{ -f_T(y)^d \cdot S_T(y)^{1-d} -\right\} -\right) -+ -\left( -\deriv{\theta_T} -\text{log}\left\{ -f_C(y)^{1-d} \cdot S_C(y)^{d} -\right\} -\right)\\ -&= -\left( -\deriv{\theta_T} -\text{log}\left\{ -f_T(y)^d \cdot S_T(y)^{1-d} -\right\} -\right) + 0\\ -&= -\deriv{\theta_T} -\text{log}\left\{ -f_T(y)^d \cdot S_T(y)^{1-d} -\right\}\\ -\end{aligned} -$$ - ---- - -::: notes -Thus, the MLE for $\theta_T$ won't depend on $\theta_C$, and we can -ignore the distribution of $C$ when estimating the parameters of -$f_T(t)=p(T=t)$. -::: - -Then: - -$$ -\begin{aligned} -\mathcal L(y,d) -&= f_T(y)^d \cdot S_T(y)^{1-d}\\ -&= \left(h_T(y)^d S_T(y)^d\right) \cdot S_T(y)^{1-d}\\ -&= h_T(y)^d \cdot S_T(y)^d \cdot S_T(y)^{1-d}\\ -&= h_T(y)^d \cdot S_T(y)\\ -&= S_T(y) \cdot h_T(y)^d \\ -\end{aligned} -$$ - -::: notes -That is, if the event occurred at time $y$ (i.e., if $d=1$), then the -likelihood of $(Y,D) = (y,d)$ is equal to the hazard function at $y$ -times the survival function at $y$. Otherwise, the likelihood is equal -to just the survival function at $y$. -::: +\right\}\\ +\end{aligned} +$$ Let + +- $\theta_T$ represent the parameters of $p_T(t)$, +- $\theta_C$ represent the parameters of $p_C(c)$, +- $\theta = (\theta_T, \theta_C)$ be the combined vector of all + parameters. + +--- + +::: notes +The corresponding score function is: +::: + +$$ +\begin{aligned} +\ell'(y,d) +&= \deriv{\theta} +\left[ +\text{log}\left\{ +f_T(y)^d \cdot S_T(y)^{1-d} +\right\} ++ +\text{log}\left\{ +f_C(y)^{1-d} \cdot S_C(y)^{d} +\right\} +\right]\\ +&= +\left( +\deriv{\theta} +\text{log}\left\{ +f_T(y)^d \cdot S_T(y)^{1-d} +\right\} +\right) ++ +\left( +\deriv{\theta} +\text{log}\left\{ +f_C(y)^{1-d} \cdot S_C(y)^{d} +\right\} +\right)\\ +\end{aligned} +$$ + +--- + +::: notes +As long as $\theta_C$ and $\theta_T$ don't share any parameters, then if +censoring is non-informative, the partial derivative with respect to +$\theta_T$ is: +::: + +$$ +\begin{aligned} +\ell'_{\theta_T}(y,d) +&\eqdef \deriv{\theta_T}\ell(y,d)\\ +&= +\left( +\deriv{\theta_T} +\text{log}\left\{ +f_T(y)^d \cdot S_T(y)^{1-d} +\right\} +\right) ++ +\left( +\deriv{\theta_T} +\text{log}\left\{ +f_C(y)^{1-d} \cdot S_C(y)^{d} +\right\} +\right)\\ +&= +\left( +\deriv{\theta_T} +\text{log}\left\{ +f_T(y)^d \cdot S_T(y)^{1-d} +\right\} +\right) + 0\\ +&= +\deriv{\theta_T} +\text{log}\left\{ +f_T(y)^d \cdot S_T(y)^{1-d} +\right\}\\ +\end{aligned} +$$ + +--- ---- - -::: notes -The corresponding log-likelihood is: +::: notes +Thus, the MLE for $\theta_T$ won't depend on $\theta_C$, and we can +ignore the distribution of $C$ when estimating the parameters of +$f_T(t)=p(T=t)$. ::: -$$ -\begin{aligned} -\ell(y,d) -&=\text{log}\left\{\mathcal L(y,d)\right\}\\ -&= \text{log}\left\{S_T(y) \cdot h_T(y)^d\right\}\\ -&= \text{log}\left\{S_T(y)\right\} + \text{log}\left\{h_T(y)^d\right\}\\ -&= \text{log}\left\{S_T(y)\right\} + d\cdot \text{log}\left\{h_T(y)\right\}\\ -&= -H_T(y) + d\cdot \text{log}\left\{h_T(y)\right\}\\ -\end{aligned} -$$ - -::: notes -In other words, the log-likelihood contribution from a single -observation $(Y,D) = (y,d)$ is equal to the negative cumulative hazard -at $y$, plus the log of the hazard at $y$ if the event occurred at time -$y$. -::: - -## Parametric Models for Time-to-Event Outcomes +Then: + +$$ +\begin{aligned} +\mathcal L(y,d) +&= f_T(y)^d \cdot S_T(y)^{1-d}\\ +&= \left(h_T(y)^d S_T(y)^d\right) \cdot S_T(y)^{1-d}\\ +&= h_T(y)^d \cdot S_T(y)^d \cdot S_T(y)^{1-d}\\ +&= h_T(y)^d \cdot S_T(y)\\ +&= S_T(y) \cdot h_T(y)^d \\ +\end{aligned} +$$ + +::: notes +That is, if the event occurred at time $y$ (i.e., if $d=1$), then the +likelihood of $(Y,D) = (y,d)$ is equal to the hazard function at $y$ +times the survival function at $y$. Otherwise, the likelihood is equal +to just the survival function at $y$. +::: -### Exponential Distribution +--- -- The exponential distribution is the base distribution for survival - analysis. -- The distribution has a constant hazard $\lambda$ -- The mean survival time is $\lambda^{-1}$ - ---- - -#### Mathematical details of exponential distribution - -$$ -\begin{aligned} -f(t) &= \lambda \text{e}^{-\lambda t}\\ -E(t) &= \lambda^{-1}\\ -Var(t) &= \lambda^{-2}\\ -F(t) &= 1-\text{e}^{-\lambda x}\\ -S(t)&= \text{e}^{-\lambda x}\\ -\ln(S(t))&=-\lambda x\\ -h(t) &= -\frac{f(t)}{S(t)} = -\frac{\lambda \text{e}^{-\lambda t}}{\text{e}^{-\lambda t}}=\lambda -\end{aligned} -$$ - ---- - -#### Estimating $\lambda$ {.smaller} - -- Suppose we have $m$ exponential survival times of - $t_1, t_2,\ldots,t_m$ and $k$ right-censored values at - $u_1,u_2,\ldots,u_k$. - -- A survival time of $t_i=10$ means that subject $i$ died at time 10. - A right-censored time $u_i=10$ means that at time 10, subject $i$ - was still alive and that we have no further follow-up. +::: notes +The corresponding log-likelihood is: +::: + +$$ +\begin{aligned} +\ell(y,d) +&=\text{log}\left\{\mathcal L(y,d)\right\}\\ +&= \text{log}\left\{S_T(y) \cdot h_T(y)^d\right\}\\ +&= \text{log}\left\{S_T(y)\right\} + \text{log}\left\{h_T(y)^d\right\}\\ +&= \text{log}\left\{S_T(y)\right\} + d\cdot \text{log}\left\{h_T(y)\right\}\\ +&= -H_T(y) + d\cdot \text{log}\left\{h_T(y)\right\}\\ +\end{aligned} +$$ + +::: notes +In other words, the log-likelihood contribution from a single +observation $(Y,D) = (y,d)$ is equal to the negative cumulative hazard +at $y$, plus the log of the hazard at $y$ if the event occurred at time +$y$. +::: + +## Parametric Models for Time-to-Event Outcomes + +### Exponential Distribution + +- The exponential distribution is the base distribution for survival + analysis. +- The distribution has a constant hazard $\lambda$ +- The mean survival time is $\lambda^{-1}$ + +--- -- For the moment we will assume that the survival distribution is - exponential and that all the subjects have the same parameter - $\lambda$. - -We have $m$ exponential survival times of $t_1, t_2,\ldots,t_m$ and $k$ -right-censored values at $u_1,u_2,\ldots,u_k$. The log-likelihood of an -observed survival time $t_i$ is $$ -\text{log}\left\{\lambda \text{e}^{-\lambda t_i}\right\} = -\text{log}\left\{\lambda\right\}-\lambda t_i -$$ and the likelihood of a censored value is the probability of that -outcome (survival greater than $u_j$) so the log-likelihood is - +#### Mathematical details of exponential distribution + +$$ +\begin{aligned} +f(t) &= \lambda \text{e}^{-\lambda t}\\ +E(t) &= \lambda^{-1}\\ +Var(t) &= \lambda^{-2}\\ +F(t) &= 1-\text{e}^{-\lambda x}\\ +S(t)&= \text{e}^{-\lambda x}\\ +\ln(S(t))&=-\lambda x\\ +h(t) &= -\frac{f(t)}{S(t)} = -\frac{\lambda \text{e}^{-\lambda t}}{\text{e}^{-\lambda t}}=\lambda +\end{aligned} $$ -\ba -\ell_j(\lambda) &= \text{log}\left\{\lambda \text{e}^{u_j}\right\} -\\ &= -\lambda u_j -\ea -$$ - ---- - -:::{#thm-mle-exp} -Let $T=\sum t_i$ and $U=\sum u_j$. Then: - -$$ -\ell(\lambda) = \frac{m}{T+U} -$$ {#eq-mle-exp} -::: - ---- - -::: proof - -$$ -\begin{aligned} -\ell(\lambda) &= \sum_{i=1}^m( \ln \lambda-\lambda t_i) + \sum_{j=1}^k (-\lambda u_j)\\ -&= m \ln \lambda -(T+U)\lambda\\ -\ell'(\lambda) -&=m\lambda^{-1} -(T+U)\\ -\hat{\lambda} &= \frac{m}{T+U} -\ea -$$ -::: - ---- - -$$ -\ba -\ell''&=-m/\lambda^2\\ -&< 0\\ -\hat E[T] &= \hat\lambda^{-1}\\ -&= \frac{T+U}{m} -\end{aligned} -$$ + +--- + +#### Estimating $\lambda$ {.smaller} + +- Suppose we have $m$ exponential survival times of + $t_1, t_2,\ldots,t_m$ and $k$ right-censored values at + $u_1,u_2,\ldots,u_k$. + +- A survival time of $t_i=10$ means that subject $i$ died at time 10. + A right-censored time $u_i=10$ means that at time 10, subject $i$ + was still alive and that we have no further follow-up. + +- For the moment we will assume that the survival distribution is + exponential and that all the subjects have the same parameter + $\lambda$. + +We have $m$ exponential survival times of $t_1, t_2,\ldots,t_m$ and $k$ +right-censored values at $u_1,u_2,\ldots,u_k$. The log-likelihood of an +observed survival time $t_i$ is $$ +\text{log}\left\{\lambda \text{e}^{-\lambda t_i}\right\} = +\text{log}\left\{\lambda\right\}-\lambda t_i +$$ and the likelihood of a censored value is the probability of that +outcome (survival greater than $u_j$) so the log-likelihood is + +$$ +\ba +\ell_j(\lambda) &= \text{log}\left\{\lambda \text{e}^{u_j}\right\} +\\ &= -\lambda u_j +\ea +$$ + +--- + +:::{#thm-mle-exp} +Let $T=\sum t_i$ and $U=\sum u_j$. Then: + +$$ +\hat{\lambda}_{ML} = \frac{m}{T+U} +$$ {#eq-mle-exp} +::: --- -#### Fisher Information and Standard Error +::: proof $$ \begin{aligned} -E[-\ell''] -& = m/\lambda^2\\ -\text{Var}\left(\hat\lambda\right) -&\approx \left(E[-\ell'']\right)^{-1}\\ -&=\lambda^2/m\\ -\text{SE}\left(\hat\lambda\right) -&= \sqrt{\text{Var}\left(\hat\lambda\right)}\\ -&\approx \lambda/\sqrt{m} -\end{aligned} -$$ +\ell(\lambda) &= \sum_{i=1}^m( \ln \lambda-\lambda t_i) + \sum_{j=1}^k (-\lambda u_j)\\ +&= m \ln \lambda -(T+U)\lambda\\ +\ell'(\lambda) +&=m\lambda^{-1} -(T+U)\\ +\hat{\lambda} &= \frac{m}{T+U} +\ea +$$ +::: + +--- -::: notes -$\hat\lambda$ depends on the censoring times of the censored -observations, but $\text{Var}\left(\hat\lambda\right)$ only depends on -the number of uncensored observations, $m$, and not on the number of -censored observations ($k$). -::: - ---- +$$ +\ba +\ell''&=-m/\lambda^2\\ +&< 0\\ +\hat E[T] &= \hat\lambda^{-1}\\ +&= \frac{T+U}{m} +\end{aligned} +$$ -### Other Parametric Survival Distributions +--- -- Any density on $[0,\infty)$ can be a survival distribution, but the - most useful ones are all skew right. -- The most frequently used generalization of the exponential is the [Weibull](probability.qmd#sec-weibull). -- Other common choices are the gamma, log-normal, log-logistic, - Gompertz, inverse Gaussian, and Pareto. -- Most of what we do going forward is non-parametric or - semi-parametric, but sometimes these parametric distributions - provide a useful approach. - -## Nonparametric Survival Analysis - -### Basic ideas - -- Mostly, we work without a parametric model. +#### Fisher Information and Standard Error + +$$ +\begin{aligned} +E[-\ell''] +& = m/\lambda^2\\ +\text{Var}\left(\hat\lambda\right) +&\approx \left(E[-\ell'']\right)^{-1}\\ +&=\lambda^2/m\\ +\text{SE}\left(\hat\lambda\right) +&= \sqrt{\text{Var}\left(\hat\lambda\right)}\\ +&\approx \lambda/\sqrt{m} +\end{aligned} +$$ -- The first task is to estimate a survival function from data listing - survival times, and censoring times for censored data. - -- For example one patient may have relapsed at 10 months. Another - might have been followed for 32 months without a relapse having - occurred (censored). +::: notes +$\hat\lambda$ depends on the censoring times of the censored +observations, but $\text{Var}\left(\hat\lambda\right)$ only depends on +the number of uncensored observations, $m$, and not on the number of +censored observations ($k$). +::: -- The minimum information we need for each patient is a time and a - censoring variable which is 1 if the event occurred at the indicated - time and 0 if this is a censoring time. +--- + +### Other Parametric Survival Distributions -## Example: clinical trial for pediatric acute leukemia - -### Overview of study {.smaller} - -This is from a clinical trial in 1963 for 6-MP treatment vs. placebo for -Acute Leukemia in 42 children. - -- Pairs of children: +- Any density on $[0,\infty)$ can be a survival distribution, but the + most useful ones are all skew right. +- The most frequently used generalization of the exponential is the [Weibull](probability.qmd#sec-weibull). +- Other common choices are the gamma, log-normal, log-logistic, + Gompertz, inverse Gaussian, and Pareto. +- Most of what we do going forward is non-parametric or + semi-parametric, but sometimes these parametric distributions + provide a useful approach. - - matched by remission status at the time of treatment (`remstat`: - `1` = partial, `2` = complete) - - randomized to 6-MP (exit times in `t2`) or placebo (exit times - in `t1`) - -- Followed until relapse or end of study. - -- All of the placebo group relapsed, but some of the 6-MP group were - censored (which means they were still in remission); indicated by - `relapse` variable (`0` = censored, `1` = relapse). - -- 6-MP = 6-Mercaptopurine (Purinethol) is an anti-cancer - ("antineoplastic" or "cytotoxic") chemotherapy drug used currently - for Acute lymphoblastic leukemia (ALL). It is classified as an - antimetabolite. - -### Study design {.smaller} - -- Clinical trial in 1963 for 6-MP treatment vs. placebo for Acute Leukemia -in 42 children. -- Pairs of children: - - matched by remission status at the time of treatment (`remstat`) - - `remstat` = 1: partial - - `remstat` = 2: complete - - randomized to 6-MP (exit time: `t2`) or placebo (`t1`). -- Followed until relapse or end of study. - - All of the placebo group relapsed, - - Some of the 6-MP group were censored. - ---- +## Nonparametric Survival Analysis + +### Basic ideas + +- Mostly, we work without a parametric model. + +- The first task is to estimate a survival function from data listing + survival times, and censoring times for censored data. + +- For example one patient may have relapsed at 10 months. Another + might have been followed for 32 months without a relapse having + occurred (censored). + +- The minimum information we need for each patient is a time and a + censoring variable which is 1 if the event occurred at the indicated + time and 0 if this is a censoring time. + +## Example: clinical trial for pediatric acute leukemia + +### Overview of study {.smaller} + +This is from a clinical trial in 1963 for 6-MP treatment vs. placebo for +Acute Leukemia in 42 children. + +- Pairs of children: + + - matched by remission status at the time of treatment (`remstat`: + `1` = partial, `2` = complete) + - randomized to 6-MP (exit times in `t2`) or placebo (exit times + in `t1`) -```{r} -#| tbl-cap: "`drug6mp` pediatric acute leukemia data" -#| label: tbl-drug6mp -library(KMsurv) -data(drug6mp) -drug6mp = drug6mp |> as_tibble() |> print() -``` - -### Data documentation for `drug6mp` - -```{r, printr.help.sections = c("description", "format")} -#| fig-cap: Data documentation for `drug6mp` -#| label: fig-drug6mp-helpdoc -# library(printr) # inserts help-file output into markdown output -library(KMsurv) -?drug6mp -``` - -### Descriptive Statistics {.smaller} - -```{r} -#| tbl-cap: "Summary statistics for `drug6mp` data" -#| label: tbl-drug6mp-summary -summary(drug6mp) -``` - -::: notes - -- The average time in each group is not useful. Some of the 6-MP - patients have not relapsed at the time recorded, while all of the - placebo patients have relapsed. -- The median time is not really useful either because so many of the - 6-MP patients have not relapsed (12/21). -- Both are biased down in the 6-MP group. Remember that lower times - are worse since they indicate sooner recurrence. -::: - -### Exponential model - -::: notes -- We *can* compute the hazard rate, assuming an exponential model: -number of relapses divided by the sum of the exit times (@eq-mle-exp). -::: +- Followed until relapse or end of study. + +- All of the placebo group relapsed, but some of the 6-MP group were + censored (which means they were still in remission); indicated by + `relapse` variable (`0` = censored, `1` = relapse). + +- 6-MP = 6-Mercaptopurine (Purinethol) is an anti-cancer + ("antineoplastic" or "cytotoxic") chemotherapy drug used currently + for Acute lymphoblastic leukemia (ALL). It is classified as an + antimetabolite. + +### Study design {.smaller} + +- Clinical trial in 1963 for 6-MP treatment vs. placebo for Acute Leukemia +in 42 children. +- Pairs of children: + - matched by remission status at the time of treatment (`remstat`) + - `remstat` = 1: partial + - `remstat` = 2: complete + - randomized to 6-MP (exit time: `t2`) or placebo (`t1`). +- Followed until relapse or end of study. + - All of the placebo group relapsed, + - Some of the 6-MP group were censored. + +--- + +```{r} +#| tbl-cap: "`drug6mp` pediatric acute leukemia data" +#| label: tbl-drug6mp +library(KMsurv) +data(drug6mp) +drug6mp = drug6mp |> as_tibble() |> print() +``` + +### Data documentation for `drug6mp` + +```{r, printr.help.sections = c("description", "format")} +#| fig-cap: Data documentation for `drug6mp` +#| label: fig-drug6mp-helpdoc +# library(printr) # inserts help-file output into markdown output +library(KMsurv) +?drug6mp +``` + +### Descriptive Statistics {.smaller} -$$\hat\lambda = \frac{\sumin D_i}{\sumin Y_i}$$ - -::: notes -- For the placebo, that is just the reciprocal of the mean time: -::: -$$ -\ba -\hat \lambda_{\text{placebo}} -&= \frac{\sumin D_i}{\sumin Y_i} -\\ &= \frac{\sumin 1}{\sumin Y_i} -\\ &= \frac{n}{\sumin Y_i} -\\ &= \frac{1}{\bar{Y}} -\\ &= \frac{1}{`r drug6mp |> pull(t1) |> mean()`} -\\ &= `r 1/(drug6mp |> pull(t1) |> mean())` -\ea -$$ - ---- +```{r} +#| tbl-cap: "Summary statistics for `drug6mp` data" +#| label: tbl-drug6mp-summary +summary(drug6mp) +``` + +::: notes + +- The average time in each group is not useful. Some of the 6-MP + patients have not relapsed at the time recorded, while all of the + placebo patients have relapsed. +- The median time is not really useful either because so many of the + 6-MP patients have not relapsed (12/21). +- Both are biased down in the 6-MP group. Remember that lower times + are worse since they indicate sooner recurrence. +::: -- For the 6-MP group, $\hat\lambda = 9/359 = 0.025$ +### Exponential model -$$ -\ba -\hat \lambda_{\text{6-MP}} -&= \frac{\sumin D_i}{\sumin Y_i} -\\ &= \frac{9}{359} -\\ &= `r 9/359` -\ea -$$ - -- The estimated hazard in the placebo group is 4.6 times as - large as in the 6-MP group (assuming the hazard is constant over time). - - -## The Kaplan-Meier Product Limit Estimator - -### Estimating survival in datasets without censoring - -::: notes -In the `drug6mp` dataset, the estimated survival function for the placebo patients is easy to compute. -For any time $t$ in months, $S(t)$ is the fraction of patients with times greater than $t$: -::: +::: notes +- We *can* compute the hazard rate, assuming an exponential model: +number of relapses divided by the sum of the exit times (@eq-mle-exp). +::: + +$$\hat\lambda = \frac{\sumin D_i}{\sumin Y_i}$$ + +::: notes +- For the placebo, that is just the reciprocal of the mean time: +::: +$$ +\ba +\hat \lambda_{\text{placebo}} +&= \frac{\sumin D_i}{\sumin Y_i} +\\ &= \frac{\sumin 1}{\sumin Y_i} +\\ &= \frac{n}{\sumin Y_i} +\\ &= \frac{1}{\bar{Y}} +\\ &= \frac{1}{`r drug6mp |> pull(t1) |> mean()`} +\\ &= `r 1/(drug6mp |> pull(t1) |> mean())` +\ea +$$ -```{r} -``` - - -### Estimating survival in datasets with censoring - -- For the 6-MP patients, we cannot ignore the censored data because we - know that the time to relapse is greater than the censoring time. - -- For any time $t$ in months, we know that 6-MP patients with times - greater than $t$ have not relapsed, and those with relapse time less - than $t$ have relapsed, but we don't know if patients with censored - time less than $t$ have relapsed or not. +--- + +- For the 6-MP group, $\hat\lambda = 9/359 = 0.025$ + +$$ +\ba +\hat \lambda_{\text{6-MP}} +&= \frac{\sumin D_i}{\sumin Y_i} +\\ &= \frac{9}{359} +\\ &= `r 9/359` +\ea +$$ + +- The estimated hazard in the placebo group is 4.6 times as + large as in the 6-MP group (assuming the hazard is constant over time). -- The procedure we usually use is the Kaplan-Meier product-limit - estimator of the survival function. + +## The Kaplan-Meier Product Limit Estimator -- The Kaplan-Meier estimator is a step function (like the empirical - cdf), which changes value only at the event times, not at the - censoring times. - -- At each event time $t$, we compute the at-risk group size $Y$, which - is all those observations whose event time or censoring time is at - least $t$. - -- If $d$ of the observations have an event time (not a censoring time) - of $t$, then the group of survivors immediately following time $t$ - is reduced by the fraction $$\frac{Y-d}{Y}=1-\frac{d}{Y}$$ +### Estimating survival in datasets without censoring + +::: notes +In the `drug6mp` dataset, the estimated survival function for the placebo patients is easy to compute. +For any time $t$ in months, $S(t)$ is the fraction of patients with times greater than $t$: +::: + +```{r} +``` ---- - -:::{#def-KM-estimator} - -#### Kaplan-Meier Product-Limit Estimator of Survival Function + +### Estimating survival in datasets with censoring + +- For the 6-MP patients, we cannot ignore the censored data because we + know that the time to relapse is greater than the censoring time. -If the event times are $t_i$ with events per time of $d_i$ ($1\le i \le k$), -then the **Kaplan-Meier Product-Limit Estimator** -of the [survival function](@def-surv-fn) is: - -$$\hat S(t) = \prod_{t_i < t} \sb{\frac{1-d_i}{Y_i}}$${#eq-km-surv-est} - -where $Y_i$ is the set of observations whose time (event or censored) is -$\ge t_i$, the group at risk at time $t_i$. - -::: - ---- - -If there are no censored data, and there are $n$ data points, then just -after (say) the third event time +- For any time $t$ in months, we know that 6-MP patients with times + greater than $t$ have not relapsed, and those with relapse time less + than $t$ have relapsed, but we don't know if patients with censored + time less than $t$ have relapsed or not. + +- The procedure we usually use is the Kaplan-Meier product-limit + estimator of the survival function. + +- The Kaplan-Meier estimator is a step function (like the empirical + cdf), which changes value only at the event times, not at the + censoring times. + +- At each event time $t$, we compute the at-risk group size $Y$, which + is all those observations whose event time or censoring time is at + least $t$. -$$ -\begin{aligned} -\hat S(t) -&= \prod_{t_i < t}\sb{1-\frac{d_i}{Y_i}} -\\ &= \sb{\frac{n-d_1}{n}} \sb{\frac{n-d_1-d_2}{n-d_1}} \sb{\frac{n-d_1-d_2-d_3}{n-d_1-d_2}} -\\ &= \frac{n-d_1-d_2-d_3}{n} -\\ &=1-\frac{d_1+d_2+d_3}{n} -\\ &=1-\hat F(t) -\end{aligned} -$$ - -where $\hat F(t)$ is the usual empirical CDF estimate. - -### Kaplan-Meier curve for `drug6mp` data - -Here is the Kaplan-Meier estimated survival curve for the patients who -received 6-MP in the `drug6mp` dataset (we will see code to produce -figures like this one shortly): +- If $d$ of the observations have an event time (not a censoring time) + of $t$, then the group of survivors immediately following time $t$ + is reduced by the fraction $$\frac{Y-d}{Y}=1-\frac{d}{Y}$$ + +--- + +:::{#def-KM-estimator} + +#### Kaplan-Meier Product-Limit Estimator of Survival Function + +If the event times are $t_i$ with events per time of $d_i$ ($1\le i \le k$), +then the **Kaplan-Meier Product-Limit Estimator** +of the [survival function](@def-surv-fn) is: + +$$\hat S(t) = \prod_{t_i < t} \sb{\frac{1-d_i}{Y_i}}$${#eq-km-surv-est} + +where $Y_i$ is the set of observations whose time (event or censored) is +$\ge t_i$, the group at risk at time $t_i$. -```{r} -#| fig-cap: "Kaplan-Meier Survival Curve for 6-MP Patients" -#| label: fig-KM-mp6 -# | echo: false +::: + +--- + +:::{#thm-KM-est-no-cens} +#### Kaplan-Meier Estimate with No Censored Observations -require(KMsurv) -data(drug6mp) -library(dplyr) -library(survival) - -drug6mp_km_model1 = - drug6mp |> - mutate(surv = Surv(t2, relapse)) |> - survfit(formula = surv ~ 1, data = _) - -library(ggfortify) -drug6mp_km_model1 |> - autoplot( - mark.time = TRUE, - conf.int = FALSE) + - expand_limits(y = 0) + - xlab('Time since diagnosis (months)') + - ylab("KM Survival Curve") - -``` - -### Kaplan-Meier calculations {.smaller} - -Let's compute these estimates and build the chart by hand: - -```{r} -library(KMsurv) -library(dplyr) -data(drug6mp) - -drug6mp.v2 = - drug6mp |> - as_tibble() |> - mutate( - remstat = remstat |> - case_match( - 1 ~ "partial", - 2 ~ "complete" - ), - # renaming to "outcome" while relabeling is just a style choice: - outcome = relapse |> - case_match( - 0 ~ "censored", - 1 ~ "relapsed" - ) - ) - -km.6mp = - drug6mp.v2 |> - summarize( - .by = t2, - Relapses = sum(outcome == "relapsed"), - Censored = sum(outcome == "censored")) |> - # here we add a start time row, so the graph starts at time 0: - bind_rows( - tibble( - t2 = 0, - Relapses = 0, - Censored = 0) - ) |> - # sort in time order: - arrange(t2) |> - mutate( - Exiting = Relapses + Censored, - `Study Size` = sum(Exiting), - Exited = cumsum(Exiting) |> dplyr::lag(default = 0), - `At Risk` = `Study Size` - Exited, - Hazard = Relapses / `At Risk`, - `KM Factor` = 1 - Hazard, - `Cumulative Hazard` = cumsum(`Hazard`), - `KM Survival Curve` = cumprod(`KM Factor`) - ) - -library(pander) -pander(km.6mp) -``` - ---- - -#### Summary - -For the 6-MP patients at time 6 months, there are 21 patients at risk. -At $t=6$ there are 3 relapses and 1 censored observations. - -The Kaplan-Meier factor is $(21-3)/21 = 0.857$. The number at risk for -the next time ($t=7$) is $21-3-1=17$. - -At time 7 months, there are 17 patients at risk. At $t=7$ there is 1 -relapse and 0 censored observations. The Kaplan-Meier factor is -$(17-1)/17 = 0.941$. The Kaplan Meier estimate is -$0.857\times0.941=0.807$. The number at risk for the next time ($t=9$) -is $17-1=16$. - ---- - -Now, let's graph this estimated survival curve using `ggplot()`: - -```{r "estimated survival curve"} -#| label: fig-km-by-hand -#| fig-cap: "KM curve for 6MP patients, calculated by hand" -library(ggplot2) -conflicts_prefer(dplyr::filter) -km.6mp |> - ggplot(aes(x = t2, y = `KM Survival Curve`)) + - geom_step() + - geom_point(data = km.6mp |> filter(Censored > 0), shape = 3) + - expand_limits(y = c(0,1), x = 0) + - xlab('Time since diagnosis (months)') + - ylab("KM Survival Curve") + - scale_y_continuous(labels = scales::percent) -``` +If there are no censored data, and there are $n$ data points, then just +after (say) the third event time + +$$ +\begin{aligned} +\hat S(t) +&= \prod_{t_i < t}\sb{1-\frac{d_i}{Y_i}} +\\ &= \sb{\frac{n-d_1}{n}} \sb{\frac{n-d_1-d_2}{n-d_1}} \sb{\frac{n-d_1-d_2-d_3}{n-d_1-d_2}} +\\ &= \frac{n-d_1-d_2-d_3}{n} +\\ &=1-\frac{d_1+d_2+d_3}{n} +\\ &=1-\hat F(t) +\end{aligned} +$$ + +where $\hat F(t)$ is the usual empirical CDF estimate. + +::: + +### Kaplan-Meier curve for `drug6mp` data + +Here is the Kaplan-Meier estimated survival curve for the patients who +received 6-MP in the `drug6mp` dataset (we will see code to produce +figures like this one shortly): + +```{r} +#| fig-cap: "Kaplan-Meier Survival Curve for 6-MP Patients" +#| label: fig-KM-mp6 +# | echo: false + +require(KMsurv) +data(drug6mp) +library(dplyr) +library(survival) + +drug6mp_km_model1 = + drug6mp |> + mutate(surv = Surv(t2, relapse)) |> + survfit(formula = surv ~ 1, data = _) + +library(ggfortify) +drug6mp_km_model1 |> + autoplot( + mark.time = TRUE, + conf.int = FALSE) + + expand_limits(y = 0) + + xlab('Time since diagnosis (months)') + + ylab("KM Survival Curve") + +``` + +### Kaplan-Meier calculations {.smaller} + +Let's compute these estimates and build the chart by hand: + +```{r} +library(KMsurv) +library(dplyr) +data(drug6mp) + +drug6mp.v2 = + drug6mp |> + as_tibble() |> + mutate( + remstat = remstat |> + case_match( + 1 ~ "partial", + 2 ~ "complete" + ), + # renaming to "outcome" while relabeling is just a style choice: + outcome = relapse |> + case_match( + 0 ~ "censored", + 1 ~ "relapsed" + ) + ) + +km.6mp = + drug6mp.v2 |> + summarize( + .by = t2, + Relapses = sum(outcome == "relapsed"), + Censored = sum(outcome == "censored")) |> + # here we add a start time row, so the graph starts at time 0: + bind_rows( + tibble( + t2 = 0, + Relapses = 0, + Censored = 0) + ) |> + # sort in time order: + arrange(t2) |> + mutate( + Exiting = Relapses + Censored, + `Study Size` = sum(Exiting), + Exited = cumsum(Exiting) |> dplyr::lag(default = 0), + `At Risk` = `Study Size` - Exited, + Hazard = Relapses / `At Risk`, + `KM Factor` = 1 - Hazard, + `Cumulative Hazard` = cumsum(`Hazard`), + `KM Survival Curve` = cumprod(`KM Factor`) + ) + +library(pander) +pander(km.6mp) +``` + +--- + +#### Summary -## Using the `survival` package in R - -We don't have to do these calculations by hand every time; the -`survival` package and several others have functions available to -automate many of these tasks (full list: -<https://cran.r-project.org/web/views/Survival.html>). - -### The `Surv` function - -To use the `survival` package, the first step is telling R how to -combine the exit time and exit reason (censoring versus event) columns. -The `Surv()` function accomplishes this task. - -#### Example: `Surv()` with `drug6mp` data - -```{r} -#| code-fold: show -#| code-line-numbers: "5-7" -library(survival) -drug6mp.v3 = - drug6mp.v2 |> - mutate( - surv2 = Surv( - time = t2, - event = (outcome == "relapsed"))) - -print(drug6mp.v3) - -``` - -The output of `Surv()` is a vector of objects with class `Surv`. When we -print this vector: - -- observations where the event was observed are printed as the event - time (for example, `surv2 = 10` on line 1) +For the 6-MP patients at time 6 months, there are 21 patients at risk. +At $t=6$ there are 3 relapses and 1 censored observations. + +The Kaplan-Meier factor is $(21-3)/21 = 0.857$. The number at risk for +the next time ($t=7$) is $21-3-1=17$. + +At time 7 months, there are 17 patients at risk. At $t=7$ there is 1 +relapse and 0 censored observations. The Kaplan-Meier factor is +$(17-1)/17 = 0.941$. The Kaplan Meier estimate is +$0.857\times0.941=0.807$. The number at risk for the next time ($t=9$) +is $17-1=16$. + +--- + +Now, let's graph this estimated survival curve using `ggplot()`: + +```{r "estimated survival curve"} +#| label: fig-km-by-hand +#| fig-cap: "KM curve for 6MP patients, calculated by hand" +library(ggplot2) +conflicts_prefer(dplyr::filter) +km.6mp |> + ggplot(aes(x = t2, y = `KM Survival Curve`)) + + geom_step() + + geom_point(data = km.6mp |> filter(Censored > 0), shape = 3) + + expand_limits(y = c(0,1), x = 0) + + xlab('Time since diagnosis (months)') + + ylab("KM Survival Curve") + + scale_y_continuous(labels = scales::percent) +``` + +## Using the `survival` package in R + +We don't have to do these calculations by hand every time; the +`survival` package and several others have functions available to +automate many of these tasks (full list: +<https://cran.r-project.org/web/views/Survival.html>). -- observations where the event was right-censored are printed as the - censoring time with a plus sign (`+`; for example, `surv2 = 32+` on - line 3). - -### The `survfit` function +### The `Surv` function + +To use the `survival` package, the first step is telling R how to +combine the exit time and exit reason (censoring versus event) columns. +The `Surv()` function accomplishes this task. -Once we have constructed our `Surv` variable, we can calculate the -Kaplan-Meier estimate of the survival curve using the `survfit()` -function. - -::: callout-note -The documentation for `?survfit` isn't too helpful; the -`survfit.formula` documentation is better. - -```{r, printr.help.sections = c("description", "usage")} -#| include: false -?survfit.formula -``` -::: - ---- - -#### Example: `survfit()` with `drug6mp` data - -Here we use `survfit()` to create a `survfit` object, which contains the -Kaplan-Meier estimate: - -```{r} -#| code-fold: show -drug6mp.km_model = survfit( - formula = surv2 ~ 1, - data = drug6mp.v3) -``` - -`print.survfit()` just gives some summary statistics: - -```{r} -#| code-fold: show -print(drug6mp.km_model) -``` +#### Example: `Surv()` with `drug6mp` data + +```{r} +#| code-fold: show +#| code-line-numbers: "5-7" +library(survival) +drug6mp.v3 = + drug6mp.v2 |> + mutate( + surv2 = Surv( + time = t2, + event = (outcome == "relapsed"))) + +print(drug6mp.v3) + +``` + +The output of `Surv()` is a vector of objects with class `Surv`. When we +print this vector: + +- observations where the event was observed are printed as the event + time (for example, `surv2 = 10` on line 1) + +- observations where the event was right-censored are printed as the + censoring time with a plus sign (`+`; for example, `surv2 = 32+` on + line 3). + +### The `survfit` function + +Once we have constructed our `Surv` variable, we can calculate the +Kaplan-Meier estimate of the survival curve using the `survfit()` +function. + +::: callout-note +The documentation for `?survfit` isn't too helpful; the +`survfit.formula` documentation is better. -`summary.survfit()` shows us the underlying Kaplan-Meier table: - -```{r} -#| code-fold: show -summary(drug6mp.km_model) -``` - ---- - -We can specify which time points we want using the `times` argument: +```{r, printr.help.sections = c("description", "usage")} +#| include: false +?survfit.formula +``` +::: + +--- + +#### Example: `survfit()` with `drug6mp` data + +Here we use `survfit()` to create a `survfit` object, which contains the +Kaplan-Meier estimate: ```{r} #| code-fold: show -summary( - drug6mp.km_model, - times = c(0, drug6mp.v3$t2)) - -``` - ---- - -```{r, printr.help.sections = c("description", "usage", "arguments")} -#| code-fold: show -?summary.survfit -``` - -### Plotting estimated survival functions - -We can plot `survfit` objects with `plot()`, `autoplot()`, or -`ggsurvplot()`: - -```{r} -#| code-fold: show -#| fig-cap: "Kaplan-Meier Survival Curve for 6-MP Patients" - -library(ggfortify) -autoplot(drug6mp.km_model) +drug6mp.km_model = survfit( + formula = surv2 ~ 1, + data = drug6mp.v3) +``` + +`print.survfit()` just gives some summary statistics: + +```{r} +#| code-fold: show +print(drug6mp.km_model) +``` + +`summary.survfit()` shows us the underlying Kaplan-Meier table: + +```{r} +#| code-fold: show +summary(drug6mp.km_model) +``` + +--- + +We can specify which time points we want using the `times` argument: -# not shown: -# plot(drug6mp.km_model) - -# library(survminer) -# ggsurvplot(drug6mp.km_model) - -``` +```{r} +#| code-fold: show +summary( + drug6mp.km_model, + times = c(0, drug6mp.v3$t2)) ---- +``` -#### quantiles of survival curve +--- -We can extract quantiles with `quantile()`: - -```{r} -#| code-line-numbers: "2" -drug6mp.km_model |> - quantile(p = c(.25, .5)) |> - as_tibble() |> - mutate(p = c(.25, .5)) |> - relocate(p, .before = everything()) -``` - -### Two-sample tests - -#### The `survdiff` function - -```{r, printr.help.sections = c("description", "usage")} -?survdiff -``` - -#### Example: `survdiff()` with `drug6mp` data - -Now we are going to compare the placebo and 6-MP data. We need to -reshape the data to make it usable with the standard `survival` -workflow: +```{r, printr.help.sections = c("description", "usage", "arguments")} +#| code-fold: show +?summary.survfit +``` + +### Plotting estimated survival functions + +We can plot `survfit` objects with `plot()`, `autoplot()`, or +`ggsurvplot()`: + +```{r} +#| code-fold: show +#| fig-cap: "Kaplan-Meier Survival Curve for 6-MP Patients" + +library(ggfortify) +autoplot(drug6mp.km_model) + +# not shown: +# plot(drug6mp.km_model) + +# library(survminer) +# ggsurvplot(drug6mp.km_model) + +``` -```{r} -library(survival) -library(tidyr) -drug6mp.v4 = - drug6mp.v3 |> - select(pair, remstat, t1, t2, outcome) |> - # here we are going to change the data from a wide format to long: - pivot_longer( - cols = c(t1, t2), - names_to = "treatment", - values_to = "exit_time") |> - mutate( - treatment = treatment |> - case_match( - "t1" ~ "placebo", - "t2" ~ "6-MP" - ), - outcome = if_else( - treatment == "placebo", - "relapsed", - outcome - ), - surv = Surv( - time = exit_time, - event = (outcome == "relapsed")) - ) -``` - ---- - -Using this long data format, we can fit a Kaplan-Meier curve for each -treatment group simultaneously: - -```{r} -drug6mp.km_model2 = - survfit( - formula = surv ~ treatment, - data = drug6mp.v4) -``` - ---- - -We can plot the curves in the same graph: - -```{r} -drug6mp.km_model2 |> autoplot() -``` - ---- - -We can also perform something like a t-test, where the null hypothesis -is that the curves are the same: - -```{r} -survdiff( - formula = surv ~ treatment, - data = drug6mp.v4) -``` - -By default, `survdiff()` ignores any pairing, -but we can use `strata()` to perform something similar to a paired t-test: - -```{r} -survdiff( - formula = surv ~ treatment + strata(pair), - data = drug6mp.v4) +--- + +#### quantiles of survival curve + +We can extract quantiles with `quantile()`: + +```{r} +#| code-line-numbers: "2" +drug6mp.km_model |> + quantile(p = c(.25, .5)) |> + as_tibble() |> + mutate(p = c(.25, .5)) |> + relocate(p, .before = everything()) +``` + +### Two-sample tests + +#### The `survdiff` function + +```{r, printr.help.sections = c("description", "usage")} +?survdiff +``` + +#### Example: `survdiff()` with `drug6mp` data + +Now we are going to compare the placebo and 6-MP data. We need to +reshape the data to make it usable with the standard `survival` +workflow: + +```{r} +library(survival) +library(tidyr) +drug6mp.v4 = + drug6mp.v3 |> + select(pair, remstat, t1, t2, outcome) |> + # here we are going to change the data from a wide format to long: + pivot_longer( + cols = c(t1, t2), + names_to = "treatment", + values_to = "exit_time") |> + mutate( + treatment = treatment |> + case_match( + "t1" ~ "placebo", + "t2" ~ "6-MP" + ), + outcome = if_else( + treatment == "placebo", + "relapsed", + outcome + ), + surv = Surv( + time = exit_time, + event = (outcome == "relapsed")) + ) +``` + +--- + +Using this long data format, we can fit a Kaplan-Meier curve for each +treatment group simultaneously: + +```{r} +drug6mp.km_model2 = + survfit( + formula = surv ~ treatment, + data = drug6mp.v4) +``` + +--- -``` +We can plot the curves in the same graph: -Interestingly, accounting for pairing reduces the significant of the -difference. - -## Example: Bone Marrow Transplant Data - -@copelan1991treatment +```{r} +drug6mp.km_model2 |> autoplot() +``` + --- -![Recovery process from a bone marrow transplant (Fig. 1.1 from @klein2003survival)](images/bone marrow multi-stage model.png){#fig-bmt-mst} - ---- - -### Study design - -##### Treatment {.unnumbered} - -- **allogeneic** (from a donor) **bone marrow transplant therapy** - -##### Inclusion criteria {.unnumbered} - -- **acute myeloid leukemia (AML)** -- **acute lymphoblastic leukemia (ALL).** - -##### Possible intermediate events {.unnumbered} - -- **graft vs. host disease (GVHD)**: an immunological rejection - response to the transplant -- **platelet recovery**: a return of platelet count to normal levels. +We can also perform something like a t-test, where the null hypothesis +is that the curves are the same: + +```{r} +survdiff( + formula = surv ~ treatment, + data = drug6mp.v4) +``` + +By default, `survdiff()` ignores any pairing, +but we can use `strata()` to perform something similar to a paired t-test: + +```{r} +survdiff( + formula = surv ~ treatment + strata(pair), + data = drug6mp.v4) -One or the other, both in either order, or neither may occur. +``` -##### End point events - -- relapse of the disease -- death +Interestingly, accounting for pairing reduces the significant of the +difference. + +## Example: Bone Marrow Transplant Data -Any or all of these events may be censored. - -### `KMsurv::bmt` data in R - -```{r} -library(KMsurv) -?bmt -``` +@copelan1991treatment +--- + +![Recovery process from a bone marrow transplant (Fig. 1.1 from @klein2003survival)](images/bone marrow multi-stage model.png){#fig-bmt-mst} + +--- + +### Study design + +##### Treatment {.unnumbered} -### Analysis plan +- **allogeneic** (from a donor) **bone marrow transplant therapy** -- We concentrate for now on disease-free survival (`t2` and `d3`) for - the three risk groups, ALL, AML Low Risk, and AML High Risk. -- We will construct the Kaplan-Meier survival curves, compare them, - and test for differences. -- We will construct the cumulative hazard curves and compare them. -- We will estimate the hazard functions, interpret, and compare them. +##### Inclusion criteria {.unnumbered} + +- **acute myeloid leukemia (AML)** +- **acute lymphoblastic leukemia (ALL).** + +##### Possible intermediate events {.unnumbered} -### Survival Function Estimate and Variance - -$$\hat S(t) = \prod_{t_i < t}\left[1-\frac{d_i}{Y_i}\right]$$ where -$Y_i$ is the group at risk at time $t_i$. - -The estimated variance of $\hat S(t)$ is: - -:::{#thm-greenwood} -#### Greenwood's estimator for variance of Kaplan-Meier survival estimator - -$$ -\varhf{\hat S(t)} = \hat S(t)^2\sum_{t_i <t}\frac{d_i}{Y_i(Y_i-d_i)} -$${#eq-var-est-surv} -::: +- **graft vs. host disease (GVHD)**: an immunological rejection + response to the transplant +- **platelet recovery**: a return of platelet count to normal levels. + +One or the other, both in either order, or neither may occur. + +##### End point events + +- relapse of the disease +- death + +Any or all of these events may be censored. + +### `KMsurv::bmt` data in R - -We can use @eq-var-est-surv for confidence intervals for a survival function or a -difference of survival functions. - ---- - -##### Kaplan-Meier survival curves - -```{r} -#| code-summary: "code to preprocess and model `bmt` data" -library(KMsurv) -library(survival) -data(bmt) +```{r} +library(KMsurv) +?bmt +``` + +### Analysis plan + +- We concentrate for now on disease-free survival (`t2` and `d3`) for + the three risk groups, ALL, AML Low Risk, and AML High Risk. +- We will construct the Kaplan-Meier survival curves, compare them, + and test for differences. +- We will construct the cumulative hazard curves and compare them. +- We will estimate the hazard functions, interpret, and compare them. -bmt = - bmt |> - as_tibble() |> - mutate( - group = - group |> - factor( - labels = c("ALL","Low Risk AML","High Risk AML")), - surv = Surv(t2,d3)) +### Survival Function Estimate and Variance + +$$\hat S(t) = \prod_{t_i < t}\left[1-\frac{d_i}{Y_i}\right]$$ where +$Y_i$ is the group at risk at time $t_i$. + +The estimated variance of $\hat S(t)$ is: + +:::{#thm-greenwood} +#### Greenwood's estimator for variance of Kaplan-Meier survival estimator -km_model1 = survfit( - formula = surv ~ group, - data = bmt) -``` +$$ +\varhf{\hat S(t)} = \hat S(t)^2\sum_{t_i <t}\frac{d_i}{Y_i(Y_i-d_i)} +$${#eq-var-est-surv} +::: -```{r "KM survival curves for bmt data"} -#| fig-cap: "Disease-Free Survival by Disease Group" - -library(ggfortify) -autoplot( - km_model1, - conf.int = TRUE, - ylab = "Pr(disease-free survival)", - xlab = "Time since transplant (days)") + - theme_bw() + - theme(legend.position="bottom") - -``` - ---- + +We can use @eq-var-est-surv for confidence intervals for a survival function or a +difference of survival functions. + +--- + +##### Kaplan-Meier survival curves + +```{r} +#| code-summary: "code to preprocess and model `bmt` data" +library(KMsurv) +library(survival) +data(bmt) -### Understanding Greenwood's formula (optional) - -::: notes -To see where Greenwood's formula comes from, let $x_i = Y_i - d_i$. We -approximate the solution treating each time as independent, with $Y_i$ -fixed and ignore randomness in times of failure and we treat $x_i$ as -independent binomials $\text{Bin}(Y_i,p_i)$. Letting $S(t)$ be the -"true" survival function -::: +bmt = + bmt |> + as_tibble() |> + mutate( + group = + group |> + factor( + labels = c("ALL","Low Risk AML","High Risk AML")), + surv = Surv(t2,d3)) -$$ -\begin{aligned} -\hat S(t) &=\prod_{t_i<t}x_i/Y_i\\ -S(t)&=\prod_{t_i<t}p_i -\end{aligned} -$$ - -$$ -\begin{aligned} -\frac{\hat S(t)}{S(t)} - &= \prod_{t_i<t} \frac{x_i}. {p_iY_i} -\\ &= \prod_{t_i<t} \frac{\hat p_i}{p_i} -\\ &= \prod_{t_i<t} \paren{1+\frac{\hat p_i-p_i}{p_i}} -\\ &\approx 1+\sum_{t_i<t} \frac{\hat p_i-p_i}{p_i} -\end{aligned} -$$ +km_model1 = survfit( + formula = surv ~ group, + data = bmt) +``` + +```{r "KM survival curves for bmt data"} +#| fig-cap: "Disease-Free Survival by Disease Group" + +library(ggfortify) +autoplot( + km_model1, + conf.int = TRUE, + ylab = "Pr(disease-free survival)", + xlab = "Time since transplant (days)") + + theme_bw() + + theme(legend.position="bottom") ---- +``` -$$ -\begin{aligned} -\text{Var}\left(\frac{\hat S(t)}{S(t)}\right) -&\approx \text{Var}\left(1+\sum_{t_i<t} \frac{\hat p_i-p_i}{p_i}\right) -\\ &=\sum_{t_i<t} \frac{1}{p_i^2}\frac{p_i(1-p_i)}{Y_i} -\\ &= \sum_{t_i<t} \frac{(1-p_i)}{p_iY_i} -\\ &\approx\sum_{t_i<t} \frac{(1-x_i/Y_i)}{x_i} -\\ &=\sum_{t_i<t} \frac{Y_i-x_i}{x_iY_i} -\\ &=\sum_{t_i<t} \frac{d_i}{Y_i(Y_i-d_i)} -\\ \tf \text{Var}\left(\hat S(t)\right) -&\approx \hat S(t)^2\sum_{t_i<t} \frac{d_i}{Y_i(Y_i-d_i)} -\end{aligned} +--- + +### Understanding Greenwood's formula (optional) + +::: notes +To see where Greenwood's formula comes from, let $x_i = Y_i - d_i$. We +approximate the solution treating each time as independent, with $Y_i$ +fixed and ignore randomness in times of failure and we treat $x_i$ as +independent binomials $\text{Bin}(Y_i,p_i)$. Letting $S(t)$ be the +"true" survival function +::: + $$ - -### Test for differences among the disease groups - -Here we compute a chi-square test for assocation between disease group -(`group`) and disease-free survival: +\begin{aligned} +\hat S(t) &=\prod_{t_i<t}x_i/Y_i\\ +S(t)&=\prod_{t_i<t}p_i +\end{aligned} +$$ -```{r} -survdiff(surv ~ group, data = bmt) -``` - -### Cumulative Hazard - +$$ +\begin{aligned} +\frac{\hat S(t)}{S(t)} + &= \prod_{t_i<t} \frac{x_i}. {p_iY_i} +\\ &= \prod_{t_i<t} \frac{\hat p_i}{p_i} +\\ &= \prod_{t_i<t} \paren{1+\frac{\hat p_i-p_i}{p_i}} +\\ &\approx 1+\sum_{t_i<t} \frac{\hat p_i-p_i}{p_i} +\end{aligned} $$ -\begin{aligned} -h(t) -&\eqdef P(T=t|T\ge t)\\ -&= \frac{p(T=t)}{P(T\ge t)}\\ -&= -\deriv{t}\text{log}\left\{S(t)\right\} -\end{aligned} -$$ - -The **cumulative hazard** (or **integrated hazard**) function is - -$$H(t)\eqdef \int_0^t h(t) dt$$ Since -$h(t) = -\deriv{t}\text{log}\left\{S(t)\right\}$ as shown above, we -have: - -$$ -H(t)=-\text{log}\left\{S\right\}(t) -$$ - ---- - -So we can estimate $H(t)$ as: + +--- + +$$ +\begin{aligned} +\text{Var}\left(\frac{\hat S(t)}{S(t)}\right) +&\approx \text{Var}\left(1+\sum_{t_i<t} \frac{\hat p_i-p_i}{p_i}\right) +\\ &=\sum_{t_i<t} \frac{1}{p_i^2}\frac{p_i(1-p_i)}{Y_i} +\\ &= \sum_{t_i<t} \frac{(1-p_i)}{p_iY_i} +\\ &\approx\sum_{t_i<t} \frac{(1-x_i/Y_i)}{x_i} +\\ &=\sum_{t_i<t} \frac{Y_i-x_i}{x_iY_i} +\\ &=\sum_{t_i<t} \frac{d_i}{Y_i(Y_i-d_i)} +\\ \tf \text{Var}\left(\hat S(t)\right) +&\approx \hat S(t)^2\sum_{t_i<t} \frac{d_i}{Y_i(Y_i-d_i)} +\end{aligned} +$$ + +### Test for differences among the disease groups + +Here we compute a chi-square test for assocation between disease group +(`group`) and disease-free survival: -$$ -\begin{aligned} -\hat H(t) -&= -\text{log}\left\{\hat S(t)\right\}\\ -&= -\text{log}\left\{\prod_{t_i < t}\left[1-\frac{d_i}{Y_i}\right]\right\}\\ -&= -\sum_{t_i < t}\text{log}\left\{1-\frac{d_i}{Y_i}\right\}\\ -\end{aligned} -$$ - -This is the **Kaplan-Meier (product-limit) estimate of cumulative -hazard**. - ---- - -#### Example: Cumulative Hazard Curves for Bone-Marrow Transplant (`bmt`) data - -```{r} -#| fig-cap: "Disease-Free Cumulative Hazard by Disease Group" -#| label: fig-cuhaz-bmt - -autoplot( - fun = "cumhaz", - km_model1, - conf.int = FALSE, - ylab = "Cumulative hazard (disease-free survival)", - xlab = "Time since transplant (days)") + - theme_bw() + - theme(legend.position="bottom") -``` - -## Nelson-Aalen Estimates of Cumulative Hazard and Survival - ---- - -:::{#def-na-cuhaz-est} -#### Nelson-Aalen Cumulative Hazard Estimator -:::: notes -The point hazard at time $t_i$ can be estimated by $d_i/Y_i$, which -leads to the **Nelson-Aalen estimator of the cumulative hazard**: -:::: -$$\hat H_{NA}(t) \eqdef \sum_{t_i < t}\frac{d_i}{Y_i}$${#eq-NA-cuhaz-est} - -::: - ---- - -:::{#thm-var-NA-est} - -#### Variance of Nelson-Aalen estimator - -:::: notes -The variance of this estimator is approximately: -:::: - -$$ -\begin{aligned} -\hat{\text{Var}}\left(\hat H_{NA} (t)\right) -&= \sum_{t_i <t}\frac{(d_i/Y_i)(1-d_i/Y_i)}{Y_i}\\ -&\approx \sum_{t_i <t}\frac{d_i}{Y_i^2} -\end{aligned} -$${#eq-var-NA-cuhaz-est} - -::: - ---- - -Since $S(t)=\text{exp}\left\{-H(t)\right\}$, the Nelson-Aalen cumulative -hazard estimate can be converted into an alternate estimate of the -survival function: - -$$ -\begin{aligned} -\hat S_{NA}(t) -&= \text{exp}\left\{-\hat H_{NA}(t)\right\}\\ -&= \text{exp}\left\{-\sum_{t_i < t}\frac{d_i}{Y_i}\right\}\\ -&= \prod_{t_i < t}\text{exp}\left\{-\frac{d_i}{Y_i}\right\}\\ -\end{aligned} -$$ +```{r} +survdiff(surv ~ group, data = bmt) +``` + +### Cumulative Hazard + +$$ +\begin{aligned} +h(t) +&\eqdef P(T=t|T\ge t)\\ +&= \frac{p(T=t)}{P(T\ge t)}\\ +&= -\deriv{t}\text{log}\left\{S(t)\right\} +\end{aligned} +$$ + +The **cumulative hazard** (or **integrated hazard**) function is + +$$H(t)\eqdef \int_0^t h(t) dt$$ Since +$h(t) = -\deriv{t}\text{log}\left\{S(t)\right\}$ as shown above, we +have: + +$$ +H(t)=-\text{log}\left\{S\right\}(t) +$$ + +--- + +So we can estimate $H(t)$ as: + +$$ +\begin{aligned} +\hat H(t) +&= -\text{log}\left\{\hat S(t)\right\}\\ +&= -\text{log}\left\{\prod_{t_i < t}\left[1-\frac{d_i}{Y_i}\right]\right\}\\ +&= -\sum_{t_i < t}\text{log}\left\{1-\frac{d_i}{Y_i}\right\}\\ +\end{aligned} +$$ + +This is the **Kaplan-Meier (product-limit) estimate of cumulative +hazard**. + +--- + +#### Example: Cumulative Hazard Curves for Bone-Marrow Transplant (`bmt`) data + +```{r} +#| fig-cap: "Disease-Free Cumulative Hazard by Disease Group" +#| label: fig-cuhaz-bmt + +autoplot( + fun = "cumhaz", + km_model1, + conf.int = FALSE, + ylab = "Cumulative hazard (disease-free survival)", + xlab = "Time since transplant (days)") + + theme_bw() + + theme(legend.position="bottom") +``` + +## Nelson-Aalen Estimates of Cumulative Hazard and Survival + +--- + +:::{#def-na-cuhaz-est} +#### Nelson-Aalen Cumulative Hazard Estimator +:::: notes +The point hazard at time $t_i$ can be estimated by $d_i/Y_i$, which +leads to the **Nelson-Aalen estimator of the cumulative hazard**: +:::: +$$\hat H_{NA}(t) \eqdef \sum_{t_i < t}\frac{d_i}{Y_i}$${#eq-NA-cuhaz-est} + +::: + +--- + +:::{#thm-var-NA-est} ---- +#### Variance of Nelson-Aalen estimator -Compare these with the corresponding Kaplan-Meier estimates: - -$$ -\begin{aligned} -\hat H_{KM}(t) &= -\sum_{t_i < t}\text{log}\left\{1-\frac{d_i}{Y_i}\right\}\\ -\hat S_{KM}(t) &= \prod_{t_i < t}\left[1-\frac{d_i}{Y_i}\right] -\end{aligned} -$$ - -::: notes -The product limit estimate and the Nelson-Aalen estimate often do not -differ by much. The latter is considered more accurate in small samples -and also directly estimates the cumulative hazard. -The `"fleming-harrington"` method for `survfit()` reduces to Nelson-Aalen -when the data are unweighted. -We can also estimate the cumulative hazard -as the negative log of the KM survival function estimate. -::: - -### Application to `bmt` dataset - -```{r} -na_fit = survfit( - formula = surv ~ group, - type = "fleming-harrington", - data = bmt) +:::: notes +The variance of this estimator is approximately: +:::: + +$$ +\begin{aligned} +\hat{\text{Var}}\left(\hat H_{NA} (t)\right) +&= \sum_{t_i <t}\frac{(d_i/Y_i)(1-d_i/Y_i)}{Y_i}\\ +&\approx \sum_{t_i <t}\frac{d_i}{Y_i^2} +\end{aligned} +$${#eq-var-NA-cuhaz-est} + +::: + +--- + +Since $S(t)=\text{exp}\left\{-H(t)\right\}$, the Nelson-Aalen cumulative +hazard estimate can be converted into an alternate estimate of the +survival function: + +$$ +\begin{aligned} +\hat S_{NA}(t) +&= \text{exp}\left\{-\hat H_{NA}(t)\right\}\\ +&= \text{exp}\left\{-\sum_{t_i < t}\frac{d_i}{Y_i}\right\}\\ +&= \prod_{t_i < t}\text{exp}\left\{-\frac{d_i}{Y_i}\right\}\\ +\end{aligned} +$$ -km_fit = survfit( - formula = surv ~ group, - type = "kaplan-meier", - data = bmt) - -km_and_na = - bind_rows( - .id = "model", - "Kaplan-Meier" = km_fit |> fortify(surv.connect = TRUE), - "Nelson-Aalen" = na_fit |> fortify(surv.connect = TRUE) - ) |> - as_tibble() - -``` - -```{r} -#| fig-cap: "Kaplan-Meier and Nelson-Aalen Survival Function Estimates, stratified by disease group" - -km_and_na |> - ggplot(aes(x = time, y = surv, col = model)) + - geom_step() + - facet_grid(. ~ strata) + - theme_bw() + - ylab("S(t) = P(T>=t)") + - xlab("Survival time (t, days)") + - theme(legend.position = "bottom") - -``` +--- + +Compare these with the corresponding Kaplan-Meier estimates: + +$$ +\begin{aligned} +\hat H_{KM}(t) &= -\sum_{t_i < t}\text{log}\left\{1-\frac{d_i}{Y_i}\right\}\\ +\hat S_{KM}(t) &= \prod_{t_i < t}\left[1-\frac{d_i}{Y_i}\right] +\end{aligned} +$$ + +::: notes +The product limit estimate and the Nelson-Aalen estimate often do not +differ by much. The latter is considered more accurate in small samples +and also directly estimates the cumulative hazard. +The `"fleming-harrington"` method for `survfit()` reduces to Nelson-Aalen +when the data are unweighted. +We can also estimate the cumulative hazard +as the negative log of the KM survival function estimate. +::: + +### Application to `bmt` dataset + +```{r} +na_fit = survfit( + formula = surv ~ group, + type = "fleming-harrington", + data = bmt) -The Kaplan-Meier and Nelson-Aalen survival estimates are very similar -for this dataset. +km_fit = survfit( + formula = surv ~ group, + type = "kaplan-meier", + data = bmt) + +km_and_na = + bind_rows( + .id = "model", + "Kaplan-Meier" = km_fit |> fortify(surv.connect = TRUE), + "Nelson-Aalen" = na_fit |> fortify(surv.connect = TRUE) + ) |> + as_tibble() + +``` + +```{r} +#| fig-cap: "Kaplan-Meier and Nelson-Aalen Survival Function Estimates, stratified by disease group" + +km_and_na |> + ggplot(aes(x = time, y = surv, col = model)) + + geom_step() + + facet_grid(. ~ strata) + + theme_bw() + + ylab("S(t) = P(T>=t)") + + xlab("Survival time (t, days)") + + theme(legend.position = "bottom") + +``` + +The Kaplan-Meier and Nelson-Aalen survival estimates are very similar +for this dataset. diff --git a/logistic-regression.html b/logistic-regression.html index 15cfaa2be..8f50aaaf7 100644 --- a/logistic-regression.html +++ b/logistic-regression.html @@ -3703,8 +3703,8 @@

<
ggplotly(HL_plot)
-
- +
+

@@ -3855,8 +3855,8 @@

wcgs_response_resid_plot |> ggplotly()
-
- +
+

We can see a slight fan-shape here: observations on the right have larger variance (as expected since \(var(\bar y) = \pi(1-\pi)/n\) is maximized when \(\pi = 0.5\)).

@@ -3992,8 +3992,8 @@

Re
wcgs_resid_plot1 |> ggplotly()
-
- +
+

diff --git a/search.json b/search.json index 1b0abc1f5..d6a7c9f5e 100644 --- a/search.json +++ b/search.json @@ -193,7 +193,7 @@ "href": "Linear-models-overview.html#model-selection-1", "title": "\n2  Linear (Gaussian) Models\n", "section": "\n2.9 Model selection", - "text": "2.9 Model selection\n(adapted from Dobson and Barnett (2018) §6.3.3; for more information on prediction, see James et al. (2013) and Harrell (2015)).\n\nIf we have a lot of covariates in our dataset, we might want to choose a small subset to use in our model.\nThere are a few possible metrics to consider for choosing a “best” model.\n\n\n2.9.1 Mean squared error\nWe might want to minimize the mean squared error, \\(\\text E[(y-\\hat y)^2]\\), for new observations that weren’t in our data set when we fit the model.\nUnfortunately, \\[\\frac{1}{n}\\sum_{i=1}^n (y_i-\\hat y_i)^2\\] gives a biased estimate of \\(\\text E[(y-\\hat y)^2]\\) for new data. If we want an unbiased estimate, we will have to be clever.\n\nCross-validation\n\nShow R codedata(\"carbohydrate\", package = \"dobson\")\nlibrary(cvTools)\nfull_model <- lm(carbohydrate ~ ., data = carbohydrate)\ncv_full = \n full_model |> cvFit(\n data = carbohydrate, K = 5, R = 10,\n y = carbohydrate$carbohydrate)\n\nreduced_model = update(full_model, \n formula = ~ . - age)\n\ncv_reduced = \n reduced_model |> cvFit(\n data = carbohydrate, K = 5, R = 10,\n y = carbohydrate$carbohydrate)\n\n\n\n\nShow R coderesults_reduced = \n tibble(\n model = \"wgt+protein\",\n errs = cv_reduced$reps[])\nresults_full = \n tibble(model = \"wgt+age+protein\",\n errs = cv_full$reps[])\n\ncv_results = \n bind_rows(results_reduced, results_full)\n\ncv_results |> \n ggplot(aes(y = model, x = errs)) +\n geom_boxplot()\n\n\n\n\n\n\n\n\ncomparing metrics\n\nShow R code\ncompare_results = tribble(\n ~ model, ~ cvRMSE, ~ r.squared, ~adj.r.squared, ~ trainRMSE, ~loglik,\n \"full\", cv_full$cv, summary(full_model)$r.squared, summary(full_model)$adj.r.squared, sigma(full_model), logLik(full_model) |> as.numeric(),\n \"reduced\", cv_reduced$cv, summary(reduced_model)$r.squared, summary(reduced_model)$adj.r.squared, sigma(reduced_model), logLik(reduced_model) |> as.numeric())\n\ncompare_results\n\n\n\nmodel\ncvRMSE\nr.squared\nadj.r.squared\ntrainRMSE\nloglik\n\n\n\nfull\n6.900\n0.4805\n0.3831\n5.956\n-61.84\n\n\nreduced\n6.581\n0.4454\n0.3802\n5.971\n-62.49\n\n\n\n\n\n\n\nShow R codeanova(full_model, reduced_model)\n\n\n\nRes.Df\nRSS\nDf\nSum of Sq\nF\nPr(>F)\n\n\n\n16\n567.7\nNA\nNA\nNA\nNA\n\n\n17\n606.0\n-1\n-38.36\n1.081\n0.3139\n\n\n\n\n\nstepwise regression\n\nShow R codelibrary(olsrr)\nolsrr:::ols_step_both_aic(full_model)\n#> \n#> \n#> Stepwise Summary \n#> -------------------------------------------------------------------------\n#> Step Variable AIC SBC SBIC R2 Adj. R2 \n#> -------------------------------------------------------------------------\n#> 0 Base Model 140.773 142.764 83.068 0.00000 0.00000 \n#> 1 protein (+) 137.950 140.937 80.438 0.21427 0.17061 \n#> 2 weight (+) 132.981 136.964 77.191 0.44544 0.38020 \n#> -------------------------------------------------------------------------\n#> \n#> Final Model Output \n#> ------------------\n#> \n#> Model Summary \n#> ---------------------------------------------------------------\n#> R 0.667 RMSE 5.505 \n#> R-Squared 0.445 MSE 35.648 \n#> Adj. R-Squared 0.380 Coef. Var 15.879 \n#> Pred R-Squared 0.236 AIC 132.981 \n#> MAE 4.593 SBC 136.964 \n#> ---------------------------------------------------------------\n#> RMSE: Root Mean Square Error \n#> MSE: Mean Square Error \n#> MAE: Mean Absolute Error \n#> AIC: Akaike Information Criteria \n#> SBC: Schwarz Bayesian Criteria \n#> \n#> ANOVA \n#> -------------------------------------------------------------------\n#> Sum of \n#> Squares DF Mean Square F Sig. \n#> -------------------------------------------------------------------\n#> Regression 486.778 2 243.389 6.827 0.0067 \n#> Residual 606.022 17 35.648 \n#> Total 1092.800 19 \n#> -------------------------------------------------------------------\n#> \n#> Parameter Estimates \n#> ----------------------------------------------------------------------------------------\n#> model Beta Std. Error Std. Beta t Sig lower upper \n#> ----------------------------------------------------------------------------------------\n#> (Intercept) 33.130 12.572 2.635 0.017 6.607 59.654 \n#> protein 1.824 0.623 0.534 2.927 0.009 0.509 3.139 \n#> weight -0.222 0.083 -0.486 -2.662 0.016 -0.397 -0.046 \n#> ----------------------------------------------------------------------------------------\n\n\nLasso\n\\[\\arg min_{\\theta} \\ell(\\theta) + \\lambda \\sum_{j=1}^p|\\beta_j|\\]\n\nShow R codelibrary(glmnet)\ny = carbohydrate$carbohydrate\nx = carbohydrate |> \n select(age, weight, protein) |> \n as.matrix()\nfit = glmnet(x,y)\n\n\n\n\nShow R codeautoplot(fit, xvar = 'lambda')\n\n\n\nFigure 2.19: Lasso selection\n\n\n\n\n\n\n\n\n\nShow R codecvfit = cv.glmnet(x,y)\nplot(cvfit)\n\n\n\n\n\n\n\n\n\nShow R codecoef(cvfit, s = \"lambda.1se\")\n#> 4 x 1 sparse Matrix of class \"dgCMatrix\"\n#> s1\n#> (Intercept) 34.2044\n#> age . \n#> weight -0.0926\n#> protein 0.8582", + "text": "2.9 Model selection\n(adapted from Dobson and Barnett (2018) §6.3.3; for more information on prediction, see James et al. (2013) and Harrell (2015)).\n\nIf we have a lot of covariates in our dataset, we might want to choose a small subset to use in our model.\nThere are a few possible metrics to consider for choosing a “best” model.\n\n\n2.9.1 Mean squared error\nWe might want to minimize the mean squared error, \\(\\text E[(y-\\hat y)^2]\\), for new observations that weren’t in our data set when we fit the model.\nUnfortunately, \\[\\frac{1}{n}\\sum_{i=1}^n (y_i-\\hat y_i)^2\\] gives a biased estimate of \\(\\text E[(y-\\hat y)^2]\\) for new data. If we want an unbiased estimate, we will have to be clever.\n\nCross-validation\n\nShow R codedata(\"carbohydrate\", package = \"dobson\")\nlibrary(cvTools)\nfull_model <- lm(carbohydrate ~ ., data = carbohydrate)\ncv_full = \n full_model |> cvFit(\n data = carbohydrate, K = 5, R = 10,\n y = carbohydrate$carbohydrate)\n\nreduced_model = update(full_model, \n formula = ~ . - age)\n\ncv_reduced = \n reduced_model |> cvFit(\n data = carbohydrate, K = 5, R = 10,\n y = carbohydrate$carbohydrate)\n\n\n\n\nShow R coderesults_reduced = \n tibble(\n model = \"wgt+protein\",\n errs = cv_reduced$reps[])\nresults_full = \n tibble(model = \"wgt+age+protein\",\n errs = cv_full$reps[])\n\ncv_results = \n bind_rows(results_reduced, results_full)\n\ncv_results |> \n ggplot(aes(y = model, x = errs)) +\n geom_boxplot()\n\n\n\n\n\n\n\n\ncomparing metrics\n\nShow R code\ncompare_results = tribble(\n ~ model, ~ cvRMSE, ~ r.squared, ~adj.r.squared, ~ trainRMSE, ~loglik,\n \"full\", cv_full$cv, summary(full_model)$r.squared, summary(full_model)$adj.r.squared, sigma(full_model), logLik(full_model) |> as.numeric(),\n \"reduced\", cv_reduced$cv, summary(reduced_model)$r.squared, summary(reduced_model)$adj.r.squared, sigma(reduced_model), logLik(reduced_model) |> as.numeric())\n\ncompare_results\n\n\n\nmodel\ncvRMSE\nr.squared\nadj.r.squared\ntrainRMSE\nloglik\n\n\n\nfull\n7.071\n0.4805\n0.3831\n5.956\n-61.84\n\n\nreduced\n6.767\n0.4454\n0.3802\n5.971\n-62.49\n\n\n\n\n\n\n\nShow R codeanova(full_model, reduced_model)\n\n\n\nRes.Df\nRSS\nDf\nSum of Sq\nF\nPr(>F)\n\n\n\n16\n567.7\nNA\nNA\nNA\nNA\n\n\n17\n606.0\n-1\n-38.36\n1.081\n0.3139\n\n\n\n\n\nstepwise regression\n\nShow R codelibrary(olsrr)\nolsrr:::ols_step_both_aic(full_model)\n#> \n#> \n#> Stepwise Summary \n#> -------------------------------------------------------------------------\n#> Step Variable AIC SBC SBIC R2 Adj. R2 \n#> -------------------------------------------------------------------------\n#> 0 Base Model 140.773 142.764 83.068 0.00000 0.00000 \n#> 1 protein (+) 137.950 140.937 80.438 0.21427 0.17061 \n#> 2 weight (+) 132.981 136.964 77.191 0.44544 0.38020 \n#> -------------------------------------------------------------------------\n#> \n#> Final Model Output \n#> ------------------\n#> \n#> Model Summary \n#> ---------------------------------------------------------------\n#> R 0.667 RMSE 5.505 \n#> R-Squared 0.445 MSE 35.648 \n#> Adj. R-Squared 0.380 Coef. Var 15.879 \n#> Pred R-Squared 0.236 AIC 132.981 \n#> MAE 4.593 SBC 136.964 \n#> ---------------------------------------------------------------\n#> RMSE: Root Mean Square Error \n#> MSE: Mean Square Error \n#> MAE: Mean Absolute Error \n#> AIC: Akaike Information Criteria \n#> SBC: Schwarz Bayesian Criteria \n#> \n#> ANOVA \n#> -------------------------------------------------------------------\n#> Sum of \n#> Squares DF Mean Square F Sig. \n#> -------------------------------------------------------------------\n#> Regression 486.778 2 243.389 6.827 0.0067 \n#> Residual 606.022 17 35.648 \n#> Total 1092.800 19 \n#> -------------------------------------------------------------------\n#> \n#> Parameter Estimates \n#> ----------------------------------------------------------------------------------------\n#> model Beta Std. Error Std. Beta t Sig lower upper \n#> ----------------------------------------------------------------------------------------\n#> (Intercept) 33.130 12.572 2.635 0.017 6.607 59.654 \n#> protein 1.824 0.623 0.534 2.927 0.009 0.509 3.139 \n#> weight -0.222 0.083 -0.486 -2.662 0.016 -0.397 -0.046 \n#> ----------------------------------------------------------------------------------------\n\n\nLasso\n\\[\\arg min_{\\theta} \\ell(\\theta) + \\lambda \\sum_{j=1}^p|\\beta_j|\\]\n\nShow R codelibrary(glmnet)\ny = carbohydrate$carbohydrate\nx = carbohydrate |> \n select(age, weight, protein) |> \n as.matrix()\nfit = glmnet(x,y)\n\n\n\n\nShow R codeautoplot(fit, xvar = 'lambda')\n\n\n\nFigure 2.19: Lasso selection\n\n\n\n\n\n\n\n\n\nShow R codecvfit = cv.glmnet(x,y)\nplot(cvfit)\n\n\n\n\n\n\n\n\n\nShow R codecoef(cvfit, s = \"lambda.1se\")\n#> 4 x 1 sparse Matrix of class \"dgCMatrix\"\n#> s1\n#> (Intercept) 34.2044\n#> age . \n#> weight -0.0926\n#> protein 0.8582", "crumbs": [ "Generalized Linear Models", "2  Linear (Gaussian) Models" @@ -533,7 +533,7 @@ "href": "intro-to-survival-analysis.html#distribution-functions-for-time-to-event-variables", "title": "\n5  Introduction to Survival Analysis\n", "section": "\n5.3 Distribution functions for time-to-event variables", - "text": "5.3 Distribution functions for time-to-event variables\n\n5.3.1 The Probability Density Function (PDF)\nFor a time-to-event variable \\(T\\) with a continuous distribution, the probability density function is defined as usual (see Section B.4.1).\n\nIn most time-to-event models, this density is assumed to be 0 for all \\(t<0\\); that is, \\(f(t) = 0, \\forall t<0\\). In other words, the support of \\(T\\) is typically \\([0,\\infty)\\).\n\n\n\nExample 5.1 (exponential distribution) Recall from Epi 202: the pdf of the exponential distribution family of models is:\n\\[p(T=t) = \\mathbb{1}_{t \\ge 0} \\cdot \\lambda \\text{e}^{-\\lambda t}\\]\nwhere \\(\\lambda > 0\\).\n\nHere are some examples of exponential pdfs:\n\n\n\n\n\n\n\n\n\n\n5.3.2 The Cumulative Distribution Function (CDF)\nThe cumulative distribution function is defined as:\n\\[\n\\begin{aligned}\nF(t) &\\stackrel{\\text{def}}{=}\\Pr(T \\le t)\\\\\n&=\\int_{u=-\\infty}^t f(u) du\n\\end{aligned}\n\\]\n\nExample 5.2 (exponential distribution) Recall from Epi 202: the cdf of the exponential distribution family of models is:\n\\[\nP(T\\le t) = \\mathbb{1}_{t \\ge 0} \\cdot (1- \\text{e}^{-\\lambda t})\n\\] where \\(\\lambda > 0\\).\n\nHere are some examples of exponential cdfs:\n\n\n\n\n\n\n\n\n\n5.3.3 The Survival Function\nFor survival data, a more important quantity is the survival function:\n\\[\n\\begin{aligned}\nS(t) &\\stackrel{\\text{def}}{=}\\Pr(T > t)\\\\\n&=\\int_{u=t}^\\infty p(u) du\\\\\n&=1-F(t)\\\\\n\\end{aligned}\n\\]\n\n\nDefinition 5.1 (Survival function)  \n\nThe survival function \\(S(t)\\) is the probability that the event time is later than \\(t\\). If the event in a clinical trial is death, then \\(S(t)\\) is the expected fraction of the original population at time 0 who have survived up to time \\(t\\) and are still alive at time \\(t\\); that is:\n\nIf \\(X_t\\) represents survival status at time \\(t\\), with \\(X_t = 1\\) denoting alive at time \\(t\\) and \\(X_t = 0\\) denoting deceased at time \\(t\\), then:\n\\[S(t) = \\Pr(X_t=1) = \\mathbb{E}[X_t]\\]\n\n\n\nExample 5.3 (exponential distribution) Since \\(S(t) = 1 - F(t)\\), the survival function of the exponential distribution family of models is:\n\\[\nP(T> t) = \\left\\{ {{\\text{e}^{-\\lambda t}, t\\ge0} \\atop {1, t \\le 0}}\\right.\n\\] where \\(\\lambda > 0\\).\nFigure 5.1 shows some examples of exponential survival functions.\n\n\n\n\n\n\nFigure 5.1: Exponential Survival Functions\n\n\n\n\n\n\n\n\n5.3.4 The Hazard Function\nAnother important quantity is the hazard function:\n\nDefinition 5.2 (Hazard function) The hazard function for a random variable \\(T\\) at value \\(t\\) is the conditional density of \\(T\\) at \\(t\\), given \\(T\\ge t\\); that is:\n\\[h(t) \\stackrel{\\text{def}}{=}p(T=t|T\\ge t)\\]\nIf \\(T\\) represents the time at which an event occurs, then \\(h(t)\\) is the probability that the event occurs at time \\(t\\), given that it has not occurred prior to time \\(t\\).\n\n\n\nThe hazard function has an important relationship to the density and survival functions, which we can use to derive the hazard function for a given probability distribution (Theorem 5.1).\n\n\nLemma 5.1 (Joint probability of a variable with itself) \\[p(T=t, T\\ge t) = p(T=t)\\]\n\nProof. Recall from Epi 202: if \\(A\\) and \\(B\\) are statistical events and \\(A\\subseteq B\\), then \\(p(A, B) = p(A)\\). In particular, \\(\\{T=t\\} \\subseteq \\{T\\geq t\\}\\), so \\(p(T=t, T\\ge t) = p(T=t)\\).\n\n\n\n\nTheorem 5.1 \\[h(t)=\\frac{f(t)}{S(t)}\\]\n\n\n\nProof. \\[\n\\begin{aligned}\nh(t) &=p(T=t|T\\ge t)\\\\\n&=\\frac{p(T=t, T\\ge t)}{p(T \\ge t)}\\\\\n&=\\frac{p(T=t)}{p(T \\ge t)}\\\\\n&=\\frac{f(t)}{S(t)}\n\\end{aligned}\n\\]\n\n\n\nExample 5.4 (exponential distribution) The hazard function of the exponential distribution family of models is:\n\\[\n\\begin{aligned}\nP(T=t|T \\ge t)\n&= \\frac{f(t)}{S(t)}\\\\\n&= \\frac{\\mathbb{1}_{t \\ge 0}\\cdot \\lambda \\text{e}^{-\\lambda t}}{\\text{e}^{-\\lambda t}}\\\\\n&=\\mathbb{1}_{t \\ge 0}\\cdot \\lambda\n\\end{aligned}\n\\] Figure 5.2 shows some examples of exponential hazard functions.\n\n\n\n\n\n\nFigure 5.2: Examples of hazard functions for exponential distributions\n\n\n\n\n\n\n\n\nWe can also view the hazard function as the derivative of the negative of the logarithm of the survival function:\n\nTheorem 5.2 (transform survival to hazard) \\[h(t) = \\frac{\\partial}{\\partial t}\\left\\{-\\text{log}\\left\\{S(t)\\right\\}\\right\\}\\]\n\n\n\nProof. \\[\n\\begin{aligned}\nh(t)\n&= \\frac{f(t)}{S(t)}\\\\\n&= \\frac{-S'(t)}{S(t)}\\\\\n&= -\\frac{S'(t)}{S(t)}\\\\\n&=-\\frac{\\partial}{\\partial t}\\text{log}\\left\\{S(t)\\right\\}\\\\\n&=\\frac{\\partial}{\\partial t}\\left\\{-\\text{log}\\left\\{S(t)\\right\\}\\right\\}\n\\end{aligned}\n\\]\n\n\n5.3.5 The Cumulative Hazard Function\nSince \\(h(t) = \\frac{\\partial}{\\partial t}\\left\\{-\\text{log}\\left\\{S(t)\\right\\}\\right\\}\\) (see Theorem 5.2), we also have:\n\nCorollary 5.1 \\[S(t) = \\text{exp}\\left\\{-\\int_{u=0}^t h(u)du\\right\\} \\tag{5.1}\\]\n\n\n\nThe integral in Equation 5.1 is important enough to have its own name: cumulative hazard.\n\n\nDefinition 5.3 (cumulative hazard) The cumulative hazard function \\(H(t)\\) is defined as:\n\\[H(t) \\stackrel{\\text{def}}{=}\\int_{u=0}^t h(u) du\\]\n\nAs we will see below, \\(H(t)\\) is tractable to estimate, and we can then derive an estimate of the hazard function using an approximate derivative of the estimated cumulative hazard.\n\n\nExample 5.5 The cumulative hazard function of the exponential distribution family of models is:\n\\[\nH(t) = \\mathbb{1}_{t \\ge 0}\\cdot \\lambda t\n\\]\nFigure 5.3 shows some examples of exponential cumulative hazard functions.\n\n\n\n\n\n\nFigure 5.3: Examples of exponential cumulative hazard functions\n\n\n\n\n\n\n\n\n5.3.6 Some Key Mathematical Relationships among Survival Concepts\nDiagram:\n\\[\nh(t) \\xrightarrow[]{\\int_{u=0}^t h(u)du} H(t)\n\\xrightarrow[]{\\text{exp}\\left\\{-H(t)\\right\\}} S(t)\n\\xrightarrow[]{1-S(t)} F(t)\n\\]\n\\[\nh(t) \\xleftarrow[\\frac{\\partial}{\\partial t}H(t)]{} H(t)\n\\xleftarrow[-\\text{log}\\left\\{S(t)\\right\\}]{} S(t)\n\\xleftarrow[1-F(t)]{} F(t)\n\\]\nIdentities:\n\\[\n\\begin{aligned}\nS(t) &= 1 - F(t)\\\\\n&= \\text{exp}\\left\\{-H(t)\\right\\}\\\\\nS'(t) &= -f(t)\\\\\nH(t) &= -\\text{log}\\left\\{S(t)\\right\\}\\\\\nH'(t) &= h(t)\\\\\nh(t) &= \\frac{f(t)}{S(t)}\\\\\n&= -\\frac{\\partial}{\\partial t}\\text{log}\\left\\{S(t)\\right\\} \\\\\nf(t) &= h(t)\\cdot S(t)\\\\\n\\end{aligned}\n\\]\n\nSome proofs (others left as exercises):\n\\[\n\\begin{aligned}\nS'(t) &= \\frac{\\partial}{\\partial t}(1-F(t))\\\\\n&= -F'(t)\\\\\n&= -f(t)\\\\\n\\end{aligned}\n\\]\n\n\\[\n\\begin{aligned}\n\\frac{\\partial}{\\partial t}\\text{log}\\left\\{S(t)\\right\\}\n&= \\frac{S'(t)}{S(t)}\\\\\n&= -\\frac{f(t)}{S(t)}\\\\\n&= -h(t)\\\\\n\\end{aligned}\n\\]\n\n\\[\n\\begin{aligned}\nH(t)\n&\\stackrel{\\text{def}}{=}\\int_{u=0}^t h(u) du\\\\\n&= \\int_0^t -\\frac{\\partial}{\\partial u}\\text{log}\\left\\{S(u)\\right\\} du\\\\\n&= \\left[-\\text{log}\\left\\{S(u)\\right\\}\\right]_{u=0}^{u=t}\\\\\n&= \\left[\\text{log}\\left\\{S(u)\\right\\}\\right]_{u=t}^{u=0}\\\\\n&= \\text{log}\\left\\{S(0)\\right\\} - \\text{log}\\left\\{S(t)\\right\\}\\\\\n&= \\text{log}\\left\\{1\\right\\} - \\text{log}\\left\\{S(t)\\right\\}\\\\\n&= 0 - \\text{log}\\left\\{S(t)\\right\\}\\\\\n&=-\\text{log}\\left\\{S(t)\\right\\}\n\\end{aligned}\n\\]\n\nCorollary:\n\\[S(t) = \\text{exp}\\left\\{-H(t)\\right\\}\\]\nExample: Time to death the US in 2004\nThe first day is the most dangerous:\n\n\n\nDaily Hazard Rates in 2004 for US Females\n\n\n\n\n\nExercise: hypothesize why these curves differ where they do?\n\n\n\nDaily Hazard Rates in 2004 for US Males and Females 1-40\n\n\n\n\n\nExercise: compare and contrast this curve with the corresponding hazard curve.\n\n\n\nSurvival Curve in 2004 for US Females\n\n\n\n\n\n5.3.7 Likelihood with censoring\nIf an event time \\(T\\) is observed exactly as \\(T=t\\), then the likelihood of that observation is just its probability density function:\n\\[\n\\begin{aligned}\n\\mathcal L(t)\n&= p(T=t)\\\\\n&\\stackrel{\\text{def}}{=}f_T(t)\\\\\n&= h_T(t)S_T(t)\\\\\n\\ell(t)\n&\\stackrel{\\text{def}}{=}\\text{log}\\left\\{\\mathcal L(t)\\right\\}\\\\\n&= \\text{log}\\left\\{h_T(t)S_T(t)\\right\\}\\\\\n&= \\text{log}\\left\\{h_T(t)\\right\\} + \\text{log}\\left\\{S_T(t)\\right\\}\\\\\n&= \\text{log}\\left\\{h_T(t)\\right\\} - H_T(t)\\\\\n\\end{aligned}\n\\]\n\nIf instead the event time \\(T\\) is censored and only known to be after time \\(y\\), then the likelihood of that censored observation is instead the survival function evaluated at the censoring time:\n\\[\n\\begin{aligned}\n\\mathcal L(y)\n&=p_T(T>y)\\\\\n&\\stackrel{\\text{def}}{=}S_T(y)\\\\\n\\ell(y)\n&\\stackrel{\\text{def}}{=}\\text{log}\\left\\{\\mathcal L(y)\\right\\}\\\\\n&=\\text{log}\\left\\{S(y)\\right\\}\\\\\n&=-H(y)\\\\\n\\end{aligned}\n\\]\n\n\nWhat’s written above is incomplete. We also observed whether or not the observation was censored. Let \\(C\\) denote the time when censoring would occur (if the event did not occur first); let \\(f_C(y)\\) and \\(S_C(y)\\) be the corresponding density and survival functions for the censoring event.\nLet \\(Y\\) denote the time when observation ended (either by censoring or by the event of interest occurring), and let \\(D\\) be an indicator variable for the event occurring at \\(Y\\) (so \\(D=0\\) represents a censored observation and \\(D=1\\) represents an uncensored observation). In other words, let \\(Y \\stackrel{\\text{def}}{=}\\min(T,C)\\) and \\(D \\stackrel{\\text{def}}{=}\\mathbb 1{\\{T<=C\\}}\\).\nThen the complete likelihood of the observed data \\((Y,D)\\) is:\n\n\\[\n\\begin{aligned}\n\\mathcal L(y,d)\n&= p(Y=y, D=d)\\\\\n&= \\left[p(T=y,C> y)\\right]^d \\cdot\n\\left[p(T>y,C=y)\\right]^{1-d}\\\\\n\\end{aligned}\n\\]\n\n\nTypically, survival analyses assume that \\(C\\) and \\(T\\) are mutually independent; this assumption is called “non-informative” censoring.\nThen the joint likelihood \\(p(Y,D)\\) factors into the product \\(p(Y), p(D)\\), and the likelihood reduces to:\n\n\\[\n\\begin{aligned}\n\\mathcal L(y,d)\n&= \\left[p(T=y,C> y)\\right]^d\\cdot\n\\left[p(T>y,C=y)\\right]^{1-d}\\\\\n&= \\left[p(T=y)p(C> y)\\right]^d\\cdot\n\\left[p(T>y)p(C=y)\\right]^{1-d}\\\\\n&= \\left[f_T(y)S_C(y)\\right]^d\\cdot\n\\left[S(y)f_C(y)\\right]^{1-d}\\\\\n&= \\left[f_T(y)^d S_C(y)^d\\right]\\cdot\n\\left[S_T(y)^{1-d}f_C(y)^{1-d}\\right]\\\\\n&= \\left(f_T(y)^d \\cdot S_T(y)^{1-d}\\right)\\cdot\n\\left(f_C(y)^{1-d} \\cdot S_C(y)^{d}\\right)\n\\end{aligned}\n\\]\n\n\nThe corresponding log-likelihood is:\n\n\\[\n\\begin{aligned}\n\\ell(y,d)\n&= \\text{log}\\left\\{\\mathcal L(y,d) \\right\\}\\\\\n&= \\text{log}\\left\\{\n\\left(f_T(y)^d \\cdot S_T(y)^{1-d}\\right)\\cdot\n\\left(f_C(y)^{1-d} \\cdot S_C(y)^{d}\\right)\n\\right\\}\\\\\n&= \\text{log}\\left\\{\nf_T(y)^d \\cdot S_T(y)^{1-d}\n\\right\\}\n+\n\\text{log}\\left\\{\nf_C(y)^{1-d} \\cdot S_C(y)^{d}\n\\right\\}\\\\\n\\end{aligned}\n\\] Let\n\n\n\\(\\theta_T\\) represent the parameters of \\(p_T(t)\\),\n\n\\(\\theta_C\\) represent the parameters of \\(p_C(c)\\),\n\n\\(\\theta = (\\theta_T, \\theta_C)\\) be the combined vector of all parameters.\n\n\n\nThe corresponding score function is:\n\n\\[\n\\begin{aligned}\n\\ell'(y,d)\n&= \\frac{\\partial}{\\partial \\theta}\n\\left[\n\\text{log}\\left\\{\nf_T(y)^d \\cdot S_T(y)^{1-d}\n\\right\\}\n+\n\\text{log}\\left\\{\nf_C(y)^{1-d} \\cdot S_C(y)^{d}\n\\right\\}\n\\right]\\\\\n&=\n\\left(\n\\frac{\\partial}{\\partial \\theta}\n\\text{log}\\left\\{\nf_T(y)^d \\cdot S_T(y)^{1-d}\n\\right\\}\n\\right)\n+\n\\left(\n\\frac{\\partial}{\\partial \\theta}\n\\text{log}\\left\\{\nf_C(y)^{1-d} \\cdot S_C(y)^{d}\n\\right\\}\n\\right)\\\\\n\\end{aligned}\n\\]\n\n\nAs long as \\(\\theta_C\\) and \\(\\theta_T\\) don’t share any parameters, then if censoring is non-informative, the partial derivative with respect to \\(\\theta_T\\) is:\n\n\\[\n\\begin{aligned}\n\\ell'_{\\theta_T}(y,d)\n&\\stackrel{\\text{def}}{=}\\frac{\\partial}{\\partial \\theta_T}\\ell(y,d)\\\\\n&=\n\\left(\n\\frac{\\partial}{\\partial \\theta_T}\n\\text{log}\\left\\{\nf_T(y)^d \\cdot S_T(y)^{1-d}\n\\right\\}\n\\right)\n+\n\\left(\n\\frac{\\partial}{\\partial \\theta_T}\n\\text{log}\\left\\{\nf_C(y)^{1-d} \\cdot S_C(y)^{d}\n\\right\\}\n\\right)\\\\\n&=\n\\left(\n\\frac{\\partial}{\\partial \\theta_T}\n\\text{log}\\left\\{\nf_T(y)^d \\cdot S_T(y)^{1-d}\n\\right\\}\n\\right) + 0\\\\\n&=\n\\frac{\\partial}{\\partial \\theta_T}\n\\text{log}\\left\\{\nf_T(y)^d \\cdot S_T(y)^{1-d}\n\\right\\}\\\\\n\\end{aligned}\n\\]\n\n\nThus, the MLE for \\(\\theta_T\\) won’t depend on \\(\\theta_C\\), and we can ignore the distribution of \\(C\\) when estimating the parameters of \\(f_T(t)=p(T=t)\\).\n\nThen:\n\\[\n\\begin{aligned}\n\\mathcal L(y,d)\n&= f_T(y)^d \\cdot S_T(y)^{1-d}\\\\\n&= \\left(h_T(y)^d S_T(y)^d\\right) \\cdot S_T(y)^{1-d}\\\\\n&= h_T(y)^d \\cdot S_T(y)^d \\cdot S_T(y)^{1-d}\\\\\n&= h_T(y)^d \\cdot S_T(y)\\\\\n&= S_T(y) \\cdot h_T(y)^d \\\\\n\\end{aligned}\n\\]\n\nThat is, if the event occurred at time \\(y\\) (i.e., if \\(d=1\\)), then the likelihood of \\((Y,D) = (y,d)\\) is equal to the hazard function at \\(y\\) times the survival function at \\(y\\). Otherwise, the likelihood is equal to just the survival function at \\(y\\).\n\n\n\nThe corresponding log-likelihood is:\n\n\\[\n\\begin{aligned}\n\\ell(y,d)\n&=\\text{log}\\left\\{\\mathcal L(y,d)\\right\\}\\\\\n&= \\text{log}\\left\\{S_T(y) \\cdot h_T(y)^d\\right\\}\\\\\n&= \\text{log}\\left\\{S_T(y)\\right\\} + \\text{log}\\left\\{h_T(y)^d\\right\\}\\\\\n&= \\text{log}\\left\\{S_T(y)\\right\\} + d\\cdot \\text{log}\\left\\{h_T(y)\\right\\}\\\\\n&= -H_T(y) + d\\cdot \\text{log}\\left\\{h_T(y)\\right\\}\\\\\n\\end{aligned}\n\\]\n\nIn other words, the log-likelihood contribution from a single observation \\((Y,D) = (y,d)\\) is equal to the negative cumulative hazard at \\(y\\), plus the log of the hazard at \\(y\\) if the event occurred at time \\(y\\).", + "text": "5.3 Distribution functions for time-to-event variables\n\n5.3.1 The Probability Density Function (PDF)\nFor a time-to-event variable \\(T\\) with a continuous distribution, the probability density function is defined as usual (see Section B.4.1).\n\nIn most time-to-event models, this density is assumed to be 0 for all \\(t<0\\); that is, \\(f(t) = 0, \\forall t<0\\). In other words, the support of \\(T\\) is typically \\([0,\\infty)\\).\n\n\n\nExample 5.1 (exponential distribution) Recall from Epi 202: the pdf of the exponential distribution family of models is:\n\\[p(T=t) = \\mathbb{1}_{t \\ge 0} \\cdot \\lambda \\text{e}^{-\\lambda t}\\]\nwhere \\(\\lambda > 0\\).\n\nHere are some examples of exponential pdfs:\n\n\n\n\n\n\n\n\n\n\n5.3.2 The Cumulative Distribution Function (CDF)\nThe cumulative distribution function is defined as:\n\\[\n\\begin{aligned}\nF(t) &\\stackrel{\\text{def}}{=}\\Pr(T \\le t)\\\\\n&=\\int_{u=-\\infty}^t f(u) du\n\\end{aligned}\n\\]\n\nExample 5.2 (exponential distribution) Recall from Epi 202: the cdf of the exponential distribution family of models is:\n\\[\nP(T\\le t) = \\mathbb{1}_{t \\ge 0} \\cdot (1- \\text{e}^{-\\lambda t})\n\\] where \\(\\lambda > 0\\).\n\nHere are some examples of exponential cdfs:\n\n\n\n\n\n\n\n\n\n5.3.3 The Survival Function\nFor survival data, a more important quantity is the survival function:\n\\[\n\\begin{aligned}\nS(t) &\\stackrel{\\text{def}}{=}\\Pr(T > t)\\\\\n&=\\int_{u=t}^\\infty p(u) du\\\\\n&=1-F(t)\\\\\n\\end{aligned}\n\\]\n\n\nDefinition 5.1 (Survival function)  \n\nThe survival function \\(S(t)\\) is the probability that the event time is later than \\(t\\). If the event in a clinical trial is death, then \\(S(t)\\) is the expected fraction of the original population at time 0 who have survived up to time \\(t\\) and are still alive at time \\(t\\); that is:\n\n\\[S(t) \\stackrel{\\text{def}}{=}\\Pr(T > t) \\tag{5.1}\\]\n\n\n\nExample 5.3 (exponential distribution) Since \\(S(t) = 1 - F(t)\\), the survival function of the exponential distribution family of models is:\n\\[\nP(T> t) = \\left\\{ {{\\text{e}^{-\\lambda t}, t\\ge0} \\atop {1, t \\le 0}}\\right.\n\\] where \\(\\lambda > 0\\).\nFigure 5.1 shows some examples of exponential survival functions.\n\n\n\n\n\n\nFigure 5.1: Exponential Survival Functions\n\n\n\n\n\n\n\n\n\nTheorem 5.1 If \\(A_t\\) represents survival status at time \\(t\\), with \\(A_t = 1\\) denoting alive at time \\(t\\) and \\(A_t = 0\\) denoting deceased at time \\(t\\), then:\n\\[S(t) = \\Pr(A_t=1) = \\mathbb{E}[A_t]\\]\n\n\n\nTheorem 5.2 If \\(T\\) is a nonnegative random variable, then:\n\\[\\mathbb{E}[T] = \\int_{t=0}^{\\infty} S(t)dt\\]\n\n\n\nProof. See https://statproofbook.github.io/P/mean-nnrvar.html or\n\n\n5.3.4 The Hazard Function\nAnother important quantity is the hazard function:\n\nDefinition 5.2 (Hazard function) The hazard function for a random variable \\(T\\) at value \\(t\\) is the conditional density of \\(T\\) at \\(t\\), given \\(T\\ge t\\); that is:\n\\[h(t) \\stackrel{\\text{def}}{=}p(T=t|T\\ge t)\\]\nIf \\(T\\) represents the time at which an event occurs, then \\(h(t)\\) is the probability that the event occurs at time \\(t\\), given that it has not occurred prior to time \\(t\\).\n\n\n\nThe hazard function has an important relationship to the density and survival functions, which we can use to derive the hazard function for a given probability distribution (Theorem 5.3).\n\n\nLemma 5.1 (Joint probability of a variable with itself) \\[p(T=t, T\\ge t) = p(T=t)\\]\n\nProof. Recall from Epi 202: if \\(A\\) and \\(B\\) are statistical events and \\(A\\subseteq B\\), then \\(p(A, B) = p(A)\\). In particular, \\(\\{T=t\\} \\subseteq \\{T\\geq t\\}\\), so \\(p(T=t, T\\ge t) = p(T=t)\\).\n\n\n\n\nTheorem 5.3 \\[h(t)=\\frac{f(t)}{S(t)}\\]\n\n\n\nProof. \\[\n\\begin{aligned}\nh(t) &=p(T=t|T\\ge t)\\\\\n&=\\frac{p(T=t, T\\ge t)}{p(T \\ge t)}\\\\\n&=\\frac{p(T=t)}{p(T \\ge t)}\\\\\n&=\\frac{f(t)}{S(t)}\n\\end{aligned}\n\\]\n\n\n\nExample 5.4 (exponential distribution) The hazard function of the exponential distribution family of models is:\n\\[\n\\begin{aligned}\nP(T=t|T \\ge t)\n&= \\frac{f(t)}{S(t)}\\\\\n&= \\frac{\\mathbb{1}_{t \\ge 0}\\cdot \\lambda \\text{e}^{-\\lambda t}}{\\text{e}^{-\\lambda t}}\\\\\n&=\\mathbb{1}_{t \\ge 0}\\cdot \\lambda\n\\end{aligned}\n\\] Figure 5.2 shows some examples of exponential hazard functions.\n\n\n\n\n\n\nFigure 5.2: Examples of hazard functions for exponential distributions\n\n\n\n\n\n\n\n\nWe can also view the hazard function as the derivative of the negative of the logarithm of the survival function:\n\nTheorem 5.4 (transform survival to hazard) \\[h(t) = \\frac{\\partial}{\\partial t}\\left\\{-\\text{log}\\left\\{S(t)\\right\\}\\right\\}\\]\n\n\n\nProof. \\[\n\\begin{aligned}\nh(t)\n&= \\frac{f(t)}{S(t)}\\\\\n&= \\frac{-S'(t)}{S(t)}\\\\\n&= -\\frac{S'(t)}{S(t)}\\\\\n&=-\\frac{\\partial}{\\partial t}\\text{log}\\left\\{S(t)\\right\\}\\\\\n&=\\frac{\\partial}{\\partial t}\\left\\{-\\text{log}\\left\\{S(t)\\right\\}\\right\\}\n\\end{aligned}\n\\]\n\n\n5.3.5 The Cumulative Hazard Function\nSince \\(h(t) = \\frac{\\partial}{\\partial t}\\left\\{-\\text{log}\\left\\{S(t)\\right\\}\\right\\}\\) (see Theorem 5.4), we also have:\n\nCorollary 5.1 \\[S(t) = \\text{exp}\\left\\{-\\int_{u=0}^t h(u)du\\right\\} \\tag{5.2}\\]\n\n\n\nThe integral in Equation 5.2 is important enough to have its own name: cumulative hazard.\n\n\nDefinition 5.3 (cumulative hazard) The cumulative hazard function \\(H(t)\\) is defined as:\n\\[H(t) \\stackrel{\\text{def}}{=}\\int_{u=0}^t h(u) du\\]\n\nAs we will see below, \\(H(t)\\) is tractable to estimate, and we can then derive an estimate of the hazard function using an approximate derivative of the estimated cumulative hazard.\n\n\nExample 5.5 The cumulative hazard function of the exponential distribution family of models is:\n\\[\nH(t) = \\mathbb{1}_{t \\ge 0}\\cdot \\lambda t\n\\]\nFigure 5.3 shows some examples of exponential cumulative hazard functions.\n\n\n\n\n\n\nFigure 5.3: Examples of exponential cumulative hazard functions\n\n\n\n\n\n\n\n\n5.3.6 Some Key Mathematical Relationships among Survival Concepts\nDiagram:\n\\[\nh(t) \\xrightarrow[]{\\int_{u=0}^t h(u)du} H(t)\n\\xrightarrow[]{\\text{exp}\\left\\{-H(t)\\right\\}} S(t)\n\\xrightarrow[]{1-S(t)} F(t)\n\\]\n\\[\nh(t) \\xleftarrow[\\frac{\\partial}{\\partial t}H(t)]{} H(t)\n\\xleftarrow[-\\text{log}\\left\\{S(t)\\right\\}]{} S(t)\n\\xleftarrow[1-F(t)]{} F(t)\n\\]\nIdentities:\n\\[\n\\begin{aligned}\nS(t) &= 1 - F(t)\\\\\n&= \\text{exp}\\left\\{-H(t)\\right\\}\\\\\nS'(t) &= -f(t)\\\\\nH(t) &= -\\text{log}\\left\\{S(t)\\right\\}\\\\\nH'(t) &= h(t)\\\\\nh(t) &= \\frac{f(t)}{S(t)}\\\\\n&= -\\frac{\\partial}{\\partial t}\\text{log}\\left\\{S(t)\\right\\} \\\\\nf(t) &= h(t)\\cdot S(t)\\\\\n\\end{aligned}\n\\]\n\nSome proofs (others left as exercises):\n\\[\n\\begin{aligned}\nS'(t) &= \\frac{\\partial}{\\partial t}(1-F(t))\\\\\n&= -F'(t)\\\\\n&= -f(t)\\\\\n\\end{aligned}\n\\]\n\n\\[\n\\begin{aligned}\n\\frac{\\partial}{\\partial t}\\text{log}\\left\\{S(t)\\right\\}\n&= \\frac{S'(t)}{S(t)}\\\\\n&= -\\frac{f(t)}{S(t)}\\\\\n&= -h(t)\\\\\n\\end{aligned}\n\\]\n\n\\[\n\\begin{aligned}\nH(t)\n&\\stackrel{\\text{def}}{=}\\int_{u=0}^t h(u) du\\\\\n&= \\int_0^t -\\frac{\\partial}{\\partial u}\\text{log}\\left\\{S(u)\\right\\} du\\\\\n&= \\left[-\\text{log}\\left\\{S(u)\\right\\}\\right]_{u=0}^{u=t}\\\\\n&= \\left[\\text{log}\\left\\{S(u)\\right\\}\\right]_{u=t}^{u=0}\\\\\n&= \\text{log}\\left\\{S(0)\\right\\} - \\text{log}\\left\\{S(t)\\right\\}\\\\\n&= \\text{log}\\left\\{1\\right\\} - \\text{log}\\left\\{S(t)\\right\\}\\\\\n&= 0 - \\text{log}\\left\\{S(t)\\right\\}\\\\\n&=-\\text{log}\\left\\{S(t)\\right\\}\n\\end{aligned}\n\\]\n\nCorollary:\n\\[S(t) = \\text{exp}\\left\\{-H(t)\\right\\}\\]\nExample: Time to death the US in 2004\nThe first day is the most dangerous:\n\n\n\nDaily Hazard Rates in 2004 for US Females\n\n\n\n\n\nExercise: hypothesize why these curves differ where they do?\n\n\n\nDaily Hazard Rates in 2004 for US Males and Females 1-40\n\n\n\n\n\nExercise: compare and contrast this curve with the corresponding hazard curve.\n\n\n\nSurvival Curve in 2004 for US Females\n\n\n\n\n\n5.3.7 Likelihood with censoring\nIf an event time \\(T\\) is observed exactly as \\(T=t\\), then the likelihood of that observation is just its probability density function:\n\\[\n\\begin{aligned}\n\\mathcal L(t)\n&= p(T=t)\\\\\n&\\stackrel{\\text{def}}{=}f_T(t)\\\\\n&= h_T(t)S_T(t)\\\\\n\\ell(t)\n&\\stackrel{\\text{def}}{=}\\text{log}\\left\\{\\mathcal L(t)\\right\\}\\\\\n&= \\text{log}\\left\\{h_T(t)S_T(t)\\right\\}\\\\\n&= \\text{log}\\left\\{h_T(t)\\right\\} + \\text{log}\\left\\{S_T(t)\\right\\}\\\\\n&= \\text{log}\\left\\{h_T(t)\\right\\} - H_T(t)\\\\\n\\end{aligned}\n\\]\n\nIf instead the event time \\(T\\) is censored and only known to be after time \\(y\\), then the likelihood of that censored observation is instead the survival function evaluated at the censoring time:\n\\[\n\\begin{aligned}\n\\mathcal L(y)\n&=p_T(T>y)\\\\\n&\\stackrel{\\text{def}}{=}S_T(y)\\\\\n\\ell(y)\n&\\stackrel{\\text{def}}{=}\\text{log}\\left\\{\\mathcal L(y)\\right\\}\\\\\n&=\\text{log}\\left\\{S(y)\\right\\}\\\\\n&=-H(y)\\\\\n\\end{aligned}\n\\]\n\n\nWhat’s written above is incomplete. We also observed whether or not the observation was censored. Let \\(C\\) denote the time when censoring would occur (if the event did not occur first); let \\(f_C(y)\\) and \\(S_C(y)\\) be the corresponding density and survival functions for the censoring event.\nLet \\(Y\\) denote the time when observation ended (either by censoring or by the event of interest occurring), and let \\(D\\) be an indicator variable for the event occurring at \\(Y\\) (so \\(D=0\\) represents a censored observation and \\(D=1\\) represents an uncensored observation). In other words, let \\(Y \\stackrel{\\text{def}}{=}\\min(T,C)\\) and \\(D \\stackrel{\\text{def}}{=}\\mathbb 1{\\{T<=C\\}}\\).\nThen the complete likelihood of the observed data \\((Y,D)\\) is:\n\n\\[\n\\begin{aligned}\n\\mathcal L(y,d)\n&= p(Y=y, D=d)\\\\\n&= \\left[p(T=y,C> y)\\right]^d \\cdot\n\\left[p(T>y,C=y)\\right]^{1-d}\\\\\n\\end{aligned}\n\\]\n\n\nTypically, survival analyses assume that \\(C\\) and \\(T\\) are mutually independent; this assumption is called “non-informative” censoring.\nThen the joint likelihood \\(p(Y,D)\\) factors into the product \\(p(Y), p(D)\\), and the likelihood reduces to:\n\n\\[\n\\begin{aligned}\n\\mathcal L(y,d)\n&= \\left[p(T=y,C> y)\\right]^d\\cdot\n\\left[p(T>y,C=y)\\right]^{1-d}\\\\\n&= \\left[p(T=y)p(C> y)\\right]^d\\cdot\n\\left[p(T>y)p(C=y)\\right]^{1-d}\\\\\n&= \\left[f_T(y)S_C(y)\\right]^d\\cdot\n\\left[S(y)f_C(y)\\right]^{1-d}\\\\\n&= \\left[f_T(y)^d S_C(y)^d\\right]\\cdot\n\\left[S_T(y)^{1-d}f_C(y)^{1-d}\\right]\\\\\n&= \\left(f_T(y)^d \\cdot S_T(y)^{1-d}\\right)\\cdot\n\\left(f_C(y)^{1-d} \\cdot S_C(y)^{d}\\right)\n\\end{aligned}\n\\]\n\n\nThe corresponding log-likelihood is:\n\n\\[\n\\begin{aligned}\n\\ell(y,d)\n&= \\text{log}\\left\\{\\mathcal L(y,d) \\right\\}\\\\\n&= \\text{log}\\left\\{\n\\left(f_T(y)^d \\cdot S_T(y)^{1-d}\\right)\\cdot\n\\left(f_C(y)^{1-d} \\cdot S_C(y)^{d}\\right)\n\\right\\}\\\\\n&= \\text{log}\\left\\{\nf_T(y)^d \\cdot S_T(y)^{1-d}\n\\right\\}\n+\n\\text{log}\\left\\{\nf_C(y)^{1-d} \\cdot S_C(y)^{d}\n\\right\\}\\\\\n\\end{aligned}\n\\] Let\n\n\n\\(\\theta_T\\) represent the parameters of \\(p_T(t)\\),\n\n\\(\\theta_C\\) represent the parameters of \\(p_C(c)\\),\n\n\\(\\theta = (\\theta_T, \\theta_C)\\) be the combined vector of all parameters.\n\n\n\nThe corresponding score function is:\n\n\\[\n\\begin{aligned}\n\\ell'(y,d)\n&= \\frac{\\partial}{\\partial \\theta}\n\\left[\n\\text{log}\\left\\{\nf_T(y)^d \\cdot S_T(y)^{1-d}\n\\right\\}\n+\n\\text{log}\\left\\{\nf_C(y)^{1-d} \\cdot S_C(y)^{d}\n\\right\\}\n\\right]\\\\\n&=\n\\left(\n\\frac{\\partial}{\\partial \\theta}\n\\text{log}\\left\\{\nf_T(y)^d \\cdot S_T(y)^{1-d}\n\\right\\}\n\\right)\n+\n\\left(\n\\frac{\\partial}{\\partial \\theta}\n\\text{log}\\left\\{\nf_C(y)^{1-d} \\cdot S_C(y)^{d}\n\\right\\}\n\\right)\\\\\n\\end{aligned}\n\\]\n\n\nAs long as \\(\\theta_C\\) and \\(\\theta_T\\) don’t share any parameters, then if censoring is non-informative, the partial derivative with respect to \\(\\theta_T\\) is:\n\n\\[\n\\begin{aligned}\n\\ell'_{\\theta_T}(y,d)\n&\\stackrel{\\text{def}}{=}\\frac{\\partial}{\\partial \\theta_T}\\ell(y,d)\\\\\n&=\n\\left(\n\\frac{\\partial}{\\partial \\theta_T}\n\\text{log}\\left\\{\nf_T(y)^d \\cdot S_T(y)^{1-d}\n\\right\\}\n\\right)\n+\n\\left(\n\\frac{\\partial}{\\partial \\theta_T}\n\\text{log}\\left\\{\nf_C(y)^{1-d} \\cdot S_C(y)^{d}\n\\right\\}\n\\right)\\\\\n&=\n\\left(\n\\frac{\\partial}{\\partial \\theta_T}\n\\text{log}\\left\\{\nf_T(y)^d \\cdot S_T(y)^{1-d}\n\\right\\}\n\\right) + 0\\\\\n&=\n\\frac{\\partial}{\\partial \\theta_T}\n\\text{log}\\left\\{\nf_T(y)^d \\cdot S_T(y)^{1-d}\n\\right\\}\\\\\n\\end{aligned}\n\\]\n\n\nThus, the MLE for \\(\\theta_T\\) won’t depend on \\(\\theta_C\\), and we can ignore the distribution of \\(C\\) when estimating the parameters of \\(f_T(t)=p(T=t)\\).\n\nThen:\n\\[\n\\begin{aligned}\n\\mathcal L(y,d)\n&= f_T(y)^d \\cdot S_T(y)^{1-d}\\\\\n&= \\left(h_T(y)^d S_T(y)^d\\right) \\cdot S_T(y)^{1-d}\\\\\n&= h_T(y)^d \\cdot S_T(y)^d \\cdot S_T(y)^{1-d}\\\\\n&= h_T(y)^d \\cdot S_T(y)\\\\\n&= S_T(y) \\cdot h_T(y)^d \\\\\n\\end{aligned}\n\\]\n\nThat is, if the event occurred at time \\(y\\) (i.e., if \\(d=1\\)), then the likelihood of \\((Y,D) = (y,d)\\) is equal to the hazard function at \\(y\\) times the survival function at \\(y\\). Otherwise, the likelihood is equal to just the survival function at \\(y\\).\n\n\n\nThe corresponding log-likelihood is:\n\n\\[\n\\begin{aligned}\n\\ell(y,d)\n&=\\text{log}\\left\\{\\mathcal L(y,d)\\right\\}\\\\\n&= \\text{log}\\left\\{S_T(y) \\cdot h_T(y)^d\\right\\}\\\\\n&= \\text{log}\\left\\{S_T(y)\\right\\} + \\text{log}\\left\\{h_T(y)^d\\right\\}\\\\\n&= \\text{log}\\left\\{S_T(y)\\right\\} + d\\cdot \\text{log}\\left\\{h_T(y)\\right\\}\\\\\n&= -H_T(y) + d\\cdot \\text{log}\\left\\{h_T(y)\\right\\}\\\\\n\\end{aligned}\n\\]\n\nIn other words, the log-likelihood contribution from a single observation \\((Y,D) = (y,d)\\) is equal to the negative cumulative hazard at \\(y\\), plus the log of the hazard at \\(y\\) if the event occurred at time \\(y\\).", "crumbs": [ "Time to Event Models", "5  Introduction to Survival Analysis" @@ -544,7 +544,7 @@ "href": "intro-to-survival-analysis.html#parametric-models-for-time-to-event-outcomes", "title": "\n5  Introduction to Survival Analysis\n", "section": "\n5.4 Parametric Models for Time-to-Event Outcomes", - "text": "5.4 Parametric Models for Time-to-Event Outcomes\n\n5.4.1 Exponential Distribution\n\nThe exponential distribution is the base distribution for survival analysis.\nThe distribution has a constant hazard \\(\\lambda\\)\n\nThe mean survival time is \\(\\lambda^{-1}\\)\n\n\n\nMathematical details of exponential distribution\n\\[\n\\begin{aligned}\nf(t) &= \\lambda \\text{e}^{-\\lambda t}\\\\\nE(t) &= \\lambda^{-1}\\\\\nVar(t) &= \\lambda^{-2}\\\\\nF(t) &= 1-\\text{e}^{-\\lambda x}\\\\\nS(t)&= \\text{e}^{-\\lambda x}\\\\\n\\ln(S(t))&=-\\lambda x\\\\\nh(t) &= -\\frac{f(t)}{S(t)} = -\\frac{\\lambda \\text{e}^{-\\lambda t}}{\\text{e}^{-\\lambda t}}=\\lambda\n\\end{aligned}\n\\]\nEstimating \\(\\lambda\\)\n\n\nSuppose we have \\(m\\) exponential survival times of \\(t_1, t_2,\\ldots,t_m\\) and \\(k\\) right-censored values at \\(u_1,u_2,\\ldots,u_k\\).\nA survival time of \\(t_i=10\\) means that subject \\(i\\) died at time 10. A right-censored time \\(u_i=10\\) means that at time 10, subject \\(i\\) was still alive and that we have no further follow-up.\nFor the moment we will assume that the survival distribution is exponential and that all the subjects have the same parameter \\(\\lambda\\).\n\nWe have \\(m\\) exponential survival times of \\(t_1, t_2,\\ldots,t_m\\) and \\(k\\) right-censored values at \\(u_1,u_2,\\ldots,u_k\\). The log-likelihood of an observed survival time \\(t_i\\) is \\[\n\\text{log}\\left\\{\\lambda \\text{e}^{-\\lambda t_i}\\right\\} =\n\\text{log}\\left\\{\\lambda\\right\\}-\\lambda t_i\n\\] and the likelihood of a censored value is the probability of that outcome (survival greater than \\(u_j\\)) so the log-likelihood is\n\\[\n\\begin{aligned}\n\\ell_j(\\lambda) &= \\text{log}\\left\\{\\lambda \\text{e}^{u_j}\\right\\}\n\\\\ &= -\\lambda u_j\n\\end{aligned}\n\\]\n\n\nTheorem 5.3 Let \\(T=\\sum t_i\\) and \\(U=\\sum u_j\\). Then:\n\\[\n\\ell(\\lambda) = \\frac{m}{T+U}\n\\tag{5.2}\\]\n\n\n\nProof. \\[\n\\begin{aligned}\n\\ell(\\lambda) &= \\sum_{i=1}^m( \\ln \\lambda-\\lambda t_i) + \\sum_{j=1}^k (-\\lambda u_j)\\\\\n&= m \\ln \\lambda -(T+U)\\lambda\\\\\n\\ell'(\\lambda)\n&=m\\lambda^{-1} -(T+U)\\\\\n\\hat{\\lambda} &= \\frac{m}{T+U}\n\\end{aligned}\n\\]\n\n\n\\[\n\\begin{aligned}\n\\ell''&=-m/\\lambda^2\\\\\n&< 0\\\\\n\\hat E[T] &= \\hat\\lambda^{-1}\\\\\n&= \\frac{T+U}{m}\n\\end{aligned}\n\\]\nFisher Information and Standard Error\n\\[\n\\begin{aligned}\nE[-\\ell'']\n& = m/\\lambda^2\\\\\n\\text{Var}\\left(\\hat\\lambda\\right)\n&\\approx \\left(E[-\\ell'']\\right)^{-1}\\\\\n&=\\lambda^2/m\\\\\n\\text{SE}\\left(\\hat\\lambda\\right)\n&= \\sqrt{\\text{Var}\\left(\\hat\\lambda\\right)}\\\\\n&\\approx \\lambda/\\sqrt{m}\n\\end{aligned}\n\\]\n\n\\(\\hat\\lambda\\) depends on the censoring times of the censored observations, but \\(\\text{Var}\\left(\\hat\\lambda\\right)\\) only depends on the number of uncensored observations, \\(m\\), and not on the number of censored observations (\\(k\\)).\n\n\n5.4.2 Other Parametric Survival Distributions\n\nAny density on \\([0,\\infty)\\) can be a survival distribution, but the most useful ones are all skew right.\nThe most frequently used generalization of the exponential is the Weibull.\nOther common choices are the gamma, log-normal, log-logistic, Gompertz, inverse Gaussian, and Pareto.\nMost of what we do going forward is non-parametric or semi-parametric, but sometimes these parametric distributions provide a useful approach.", + "text": "5.4 Parametric Models for Time-to-Event Outcomes\n\n5.4.1 Exponential Distribution\n\nThe exponential distribution is the base distribution for survival analysis.\nThe distribution has a constant hazard \\(\\lambda\\)\n\nThe mean survival time is \\(\\lambda^{-1}\\)\n\n\n\nMathematical details of exponential distribution\n\\[\n\\begin{aligned}\nf(t) &= \\lambda \\text{e}^{-\\lambda t}\\\\\nE(t) &= \\lambda^{-1}\\\\\nVar(t) &= \\lambda^{-2}\\\\\nF(t) &= 1-\\text{e}^{-\\lambda x}\\\\\nS(t)&= \\text{e}^{-\\lambda x}\\\\\n\\ln(S(t))&=-\\lambda x\\\\\nh(t) &= -\\frac{f(t)}{S(t)} = -\\frac{\\lambda \\text{e}^{-\\lambda t}}{\\text{e}^{-\\lambda t}}=\\lambda\n\\end{aligned}\n\\]\nEstimating \\(\\lambda\\)\n\n\nSuppose we have \\(m\\) exponential survival times of \\(t_1, t_2,\\ldots,t_m\\) and \\(k\\) right-censored values at \\(u_1,u_2,\\ldots,u_k\\).\nA survival time of \\(t_i=10\\) means that subject \\(i\\) died at time 10. A right-censored time \\(u_i=10\\) means that at time 10, subject \\(i\\) was still alive and that we have no further follow-up.\nFor the moment we will assume that the survival distribution is exponential and that all the subjects have the same parameter \\(\\lambda\\).\n\nWe have \\(m\\) exponential survival times of \\(t_1, t_2,\\ldots,t_m\\) and \\(k\\) right-censored values at \\(u_1,u_2,\\ldots,u_k\\). The log-likelihood of an observed survival time \\(t_i\\) is \\[\n\\text{log}\\left\\{\\lambda \\text{e}^{-\\lambda t_i}\\right\\} =\n\\text{log}\\left\\{\\lambda\\right\\}-\\lambda t_i\n\\] and the likelihood of a censored value is the probability of that outcome (survival greater than \\(u_j\\)) so the log-likelihood is\n\\[\n\\begin{aligned}\n\\ell_j(\\lambda) &= \\text{log}\\left\\{\\lambda \\text{e}^{u_j}\\right\\}\n\\\\ &= -\\lambda u_j\n\\end{aligned}\n\\]\n\n\nTheorem 5.5 Let \\(T=\\sum t_i\\) and \\(U=\\sum u_j\\). Then:\n\\[\n\\hat{\\lambda}_{ML} = \\frac{m}{T+U}\n\\tag{5.3}\\]\n\n\n\nProof. \\[\n\\begin{aligned}\n\\ell(\\lambda) &= \\sum_{i=1}^m( \\ln \\lambda-\\lambda t_i) + \\sum_{j=1}^k (-\\lambda u_j)\\\\\n&= m \\ln \\lambda -(T+U)\\lambda\\\\\n\\ell'(\\lambda)\n&=m\\lambda^{-1} -(T+U)\\\\\n\\hat{\\lambda} &= \\frac{m}{T+U}\n\\end{aligned}\n\\]\n\n\n\\[\n\\begin{aligned}\n\\ell''&=-m/\\lambda^2\\\\\n&< 0\\\\\n\\hat E[T] &= \\hat\\lambda^{-1}\\\\\n&= \\frac{T+U}{m}\n\\end{aligned}\n\\]\nFisher Information and Standard Error\n\\[\n\\begin{aligned}\nE[-\\ell'']\n& = m/\\lambda^2\\\\\n\\text{Var}\\left(\\hat\\lambda\\right)\n&\\approx \\left(E[-\\ell'']\\right)^{-1}\\\\\n&=\\lambda^2/m\\\\\n\\text{SE}\\left(\\hat\\lambda\\right)\n&= \\sqrt{\\text{Var}\\left(\\hat\\lambda\\right)}\\\\\n&\\approx \\lambda/\\sqrt{m}\n\\end{aligned}\n\\]\n\n\\(\\hat\\lambda\\) depends on the censoring times of the censored observations, but \\(\\text{Var}\\left(\\hat\\lambda\\right)\\) only depends on the number of uncensored observations, \\(m\\), and not on the number of censored observations (\\(k\\)).\n\n\n5.4.2 Other Parametric Survival Distributions\n\nAny density on \\([0,\\infty)\\) can be a survival distribution, but the most useful ones are all skew right.\nThe most frequently used generalization of the exponential is the Weibull.\nOther common choices are the gamma, log-normal, log-logistic, Gompertz, inverse Gaussian, and Pareto.\nMost of what we do going forward is non-parametric or semi-parametric, but sometimes these parametric distributions provide a useful approach.", "crumbs": [ "Time to Event Models", "5  Introduction to Survival Analysis" @@ -566,7 +566,7 @@ "href": "intro-to-survival-analysis.html#example-clinical-trial-for-pediatric-acute-leukemia", "title": "\n5  Introduction to Survival Analysis\n", "section": "\n5.6 Example: clinical trial for pediatric acute leukemia", - "text": "5.6 Example: clinical trial for pediatric acute leukemia\n\n5.6.1 Overview of study\nThis is from a clinical trial in 1963 for 6-MP treatment vs. placebo for Acute Leukemia in 42 children.\n\n\nPairs of children:\n\nmatched by remission status at the time of treatment (remstat: 1 = partial, 2 = complete)\nrandomized to 6-MP (exit times in t2) or placebo (exit times in t1)\n\n\nFollowed until relapse or end of study.\nAll of the placebo group relapsed, but some of the 6-MP group were censored (which means they were still in remission); indicated by relapse variable (0 = censored, 1 = relapse).\n6-MP = 6-Mercaptopurine (Purinethol) is an anti-cancer (“antineoplastic” or “cytotoxic”) chemotherapy drug used currently for Acute lymphoblastic leukemia (ALL). It is classified as an antimetabolite.\n\n5.6.2 Study design\n\nClinical trial in 1963 for 6-MP treatment vs. placebo for Acute Leukemia in 42 children.\nPairs of children:\n\nmatched by remission status at the time of treatment (remstat)\n\n\nremstat = 1: partial\n\nremstat = 2: complete\n\n\nrandomized to 6-MP (exit time: t2) or placebo (t1).\n\n\nFollowed until relapse or end of study.\n\nAll of the placebo group relapsed,\nSome of the 6-MP group were censored.\n\n\n\n\n\n\nTable 5.1: drug6mp pediatric acute leukemia data\n\nShow R codelibrary(KMsurv)\ndata(drug6mp)\ndrug6mp = drug6mp |> as_tibble() |> print()\n#> # A tibble: 21 × 5\n#> pair remstat t1 t2 relapse\n#> <int> <int> <int> <int> <int>\n#> 1 1 1 1 10 1\n#> 2 2 2 22 7 1\n#> 3 3 2 3 32 0\n#> 4 4 2 12 23 1\n#> 5 5 2 8 22 1\n#> 6 6 1 17 6 1\n#> 7 7 2 2 16 1\n#> 8 8 2 11 34 0\n#> 9 9 2 8 32 0\n#> 10 10 2 12 25 0\n#> # ℹ 11 more rows\n\n\n\n\n\n5.6.3 Data documentation for drug6mp\n\n\nShow R code# library(printr) # inserts help-file output into markdown output\nlibrary(KMsurv)\n?drug6mp\n\n\n\n5.6.4 Descriptive Statistics\n\n\nTable 5.2: Summary statistics for drug6mp data\n\nShow R codesummary(drug6mp)\n#> pair remstat t1 t2 relapse \n#> Min. : 1 Min. :1.00 Min. : 1.00 Min. : 6.0 Min. :0.000 \n#> 1st Qu.: 6 1st Qu.:2.00 1st Qu.: 4.00 1st Qu.: 9.0 1st Qu.:0.000 \n#> Median :11 Median :2.00 Median : 8.00 Median :16.0 Median :0.000 \n#> Mean :11 Mean :1.76 Mean : 8.67 Mean :17.1 Mean :0.429 \n#> 3rd Qu.:16 3rd Qu.:2.00 3rd Qu.:12.00 3rd Qu.:23.0 3rd Qu.:1.000 \n#> Max. :21 Max. :2.00 Max. :23.00 Max. :35.0 Max. :1.000\n\n\n\n\n\n\nThe average time in each group is not useful. Some of the 6-MP patients have not relapsed at the time recorded, while all of the placebo patients have relapsed.\nThe median time is not really useful either because so many of the 6-MP patients have not relapsed (12/21).\nBoth are biased down in the 6-MP group. Remember that lower times are worse since they indicate sooner recurrence.\n\n\n\n5.6.5 Exponential model\n\n\nWe can compute the hazard rate, assuming an exponential model: number of relapses divided by the sum of the exit times (Equation 5.2).\n\n\n\\[\\hat\\lambda = \\frac{\\sum_{i=1}^nD_i}{\\sum_{i=1}^nY_i}\\]\n\n\nFor the placebo, that is just the reciprocal of the mean time:\n\n\n\\[\n\\begin{aligned}\n\\hat \\lambda_{\\text{placebo}}\n&= \\frac{\\sum_{i=1}^nD_i}{\\sum_{i=1}^nY_i}\n\\\\ &= \\frac{\\sum_{i=1}^n1}{\\sum_{i=1}^nY_i}\n\\\\ &= \\frac{n}{\\sum_{i=1}^nY_i}\n\\\\ &= \\frac{1}{\\bar{Y}}\n\\\\ &= \\frac{1}{8.6667}\n\\\\ &= 0.1154\n\\end{aligned}\n\\]\n\n\nFor the 6-MP group, \\(\\hat\\lambda = 9/359 = 0.025\\)\n\n\n\\[\n\\begin{aligned}\n\\hat \\lambda_{\\text{6-MP}}\n&= \\frac{\\sum_{i=1}^nD_i}{\\sum_{i=1}^nY_i}\n\\\\ &= \\frac{9}{359}\n\\\\ &= 0.0251\n\\end{aligned}\n\\]\n\nThe estimated hazard in the placebo group is 4.6 times as large as in the 6-MP group (assuming the hazard is constant over time).", + "text": "5.6 Example: clinical trial for pediatric acute leukemia\n\n5.6.1 Overview of study\nThis is from a clinical trial in 1963 for 6-MP treatment vs. placebo for Acute Leukemia in 42 children.\n\n\nPairs of children:\n\nmatched by remission status at the time of treatment (remstat: 1 = partial, 2 = complete)\nrandomized to 6-MP (exit times in t2) or placebo (exit times in t1)\n\n\nFollowed until relapse or end of study.\nAll of the placebo group relapsed, but some of the 6-MP group were censored (which means they were still in remission); indicated by relapse variable (0 = censored, 1 = relapse).\n6-MP = 6-Mercaptopurine (Purinethol) is an anti-cancer (“antineoplastic” or “cytotoxic”) chemotherapy drug used currently for Acute lymphoblastic leukemia (ALL). It is classified as an antimetabolite.\n\n5.6.2 Study design\n\nClinical trial in 1963 for 6-MP treatment vs. placebo for Acute Leukemia in 42 children.\nPairs of children:\n\nmatched by remission status at the time of treatment (remstat)\n\n\nremstat = 1: partial\n\nremstat = 2: complete\n\n\nrandomized to 6-MP (exit time: t2) or placebo (t1).\n\n\nFollowed until relapse or end of study.\n\nAll of the placebo group relapsed,\nSome of the 6-MP group were censored.\n\n\n\n\n\n\nTable 5.1: drug6mp pediatric acute leukemia data\n\nShow R codelibrary(KMsurv)\ndata(drug6mp)\ndrug6mp = drug6mp |> as_tibble() |> print()\n#> # A tibble: 21 × 5\n#> pair remstat t1 t2 relapse\n#> <int> <int> <int> <int> <int>\n#> 1 1 1 1 10 1\n#> 2 2 2 22 7 1\n#> 3 3 2 3 32 0\n#> 4 4 2 12 23 1\n#> 5 5 2 8 22 1\n#> 6 6 1 17 6 1\n#> 7 7 2 2 16 1\n#> 8 8 2 11 34 0\n#> 9 9 2 8 32 0\n#> 10 10 2 12 25 0\n#> # ℹ 11 more rows\n\n\n\n\n\n5.6.3 Data documentation for drug6mp\n\n\nShow R code# library(printr) # inserts help-file output into markdown output\nlibrary(KMsurv)\n?drug6mp\n\n\n\n5.6.4 Descriptive Statistics\n\n\nTable 5.2: Summary statistics for drug6mp data\n\nShow R codesummary(drug6mp)\n#> pair remstat t1 t2 relapse \n#> Min. : 1 Min. :1.00 Min. : 1.00 Min. : 6.0 Min. :0.000 \n#> 1st Qu.: 6 1st Qu.:2.00 1st Qu.: 4.00 1st Qu.: 9.0 1st Qu.:0.000 \n#> Median :11 Median :2.00 Median : 8.00 Median :16.0 Median :0.000 \n#> Mean :11 Mean :1.76 Mean : 8.67 Mean :17.1 Mean :0.429 \n#> 3rd Qu.:16 3rd Qu.:2.00 3rd Qu.:12.00 3rd Qu.:23.0 3rd Qu.:1.000 \n#> Max. :21 Max. :2.00 Max. :23.00 Max. :35.0 Max. :1.000\n\n\n\n\n\n\nThe average time in each group is not useful. Some of the 6-MP patients have not relapsed at the time recorded, while all of the placebo patients have relapsed.\nThe median time is not really useful either because so many of the 6-MP patients have not relapsed (12/21).\nBoth are biased down in the 6-MP group. Remember that lower times are worse since they indicate sooner recurrence.\n\n\n\n5.6.5 Exponential model\n\n\nWe can compute the hazard rate, assuming an exponential model: number of relapses divided by the sum of the exit times (Equation 5.3).\n\n\n\\[\\hat\\lambda = \\frac{\\sum_{i=1}^nD_i}{\\sum_{i=1}^nY_i}\\]\n\n\nFor the placebo, that is just the reciprocal of the mean time:\n\n\n\\[\n\\begin{aligned}\n\\hat \\lambda_{\\text{placebo}}\n&= \\frac{\\sum_{i=1}^nD_i}{\\sum_{i=1}^nY_i}\n\\\\ &= \\frac{\\sum_{i=1}^n1}{\\sum_{i=1}^nY_i}\n\\\\ &= \\frac{n}{\\sum_{i=1}^nY_i}\n\\\\ &= \\frac{1}{\\bar{Y}}\n\\\\ &= \\frac{1}{8.6667}\n\\\\ &= 0.1154\n\\end{aligned}\n\\]\n\n\nFor the 6-MP group, \\(\\hat\\lambda = 9/359 = 0.025\\)\n\n\n\\[\n\\begin{aligned}\n\\hat \\lambda_{\\text{6-MP}}\n&= \\frac{\\sum_{i=1}^nD_i}{\\sum_{i=1}^nY_i}\n\\\\ &= \\frac{9}{359}\n\\\\ &= 0.0251\n\\end{aligned}\n\\]\n\nThe estimated hazard in the placebo group is 4.6 times as large as in the 6-MP group (assuming the hazard is constant over time).", "crumbs": [ "Time to Event Models", "5  Introduction to Survival Analysis" @@ -577,7 +577,7 @@ "href": "intro-to-survival-analysis.html#the-kaplan-meier-product-limit-estimator", "title": "\n5  Introduction to Survival Analysis\n", "section": "\n5.7 The Kaplan-Meier Product Limit Estimator", - "text": "5.7 The Kaplan-Meier Product Limit Estimator\n\n5.7.1 Estimating survival in datasets without censoring\n\nIn the drug6mp dataset, the estimated survival function for the placebo patients is easy to compute. For any time \\(t\\) in months, \\(S(t)\\) is the fraction of patients with times greater than \\(t\\):\n\n\n5.7.2 Estimating survival in datasets with censoring\n\nFor the 6-MP patients, we cannot ignore the censored data because we know that the time to relapse is greater than the censoring time.\nFor any time \\(t\\) in months, we know that 6-MP patients with times greater than \\(t\\) have not relapsed, and those with relapse time less than \\(t\\) have relapsed, but we don’t know if patients with censored time less than \\(t\\) have relapsed or not.\nThe procedure we usually use is the Kaplan-Meier product-limit estimator of the survival function.\nThe Kaplan-Meier estimator is a step function (like the empirical cdf), which changes value only at the event times, not at the censoring times.\nAt each event time \\(t\\), we compute the at-risk group size \\(Y\\), which is all those observations whose event time or censoring time is at least \\(t\\).\nIf \\(d\\) of the observations have an event time (not a censoring time) of \\(t\\), then the group of survivors immediately following time \\(t\\) is reduced by the fraction \\[\\frac{Y-d}{Y}=1-\\frac{d}{Y}\\]\n\n\n\nDefinition 5.4 (Kaplan-Meier Product-Limit Estimator of Survival Function) If the event times are \\(t_i\\) with events per time of \\(d_i\\) (\\(1\\le i \\le k\\)), then the Kaplan-Meier Product-Limit Estimator of the survival function is:\n\\[\\hat S(t) = \\prod_{t_i < t} \\left[\\frac{1-d_i}{Y_i}\\right] \\tag{5.3}\\]\nwhere \\(Y_i\\) is the set of observations whose time (event or censored) is \\(\\ge t_i\\), the group at risk at time \\(t_i\\).\n\n\nIf there are no censored data, and there are \\(n\\) data points, then just after (say) the third event time\n\\[\n\\begin{aligned}\n\\hat S(t)\n&= \\prod_{t_i < t}\\left[1-\\frac{d_i}{Y_i}\\right]\n\\\\ &= \\left[\\frac{n-d_1}{n}\\right] \\left[\\frac{n-d_1-d_2}{n-d_1}\\right] \\left[\\frac{n-d_1-d_2-d_3}{n-d_1-d_2}\\right]\n\\\\ &= \\frac{n-d_1-d_2-d_3}{n}\n\\\\ &=1-\\frac{d_1+d_2+d_3}{n}\n\\\\ &=1-\\hat F(t)\n\\end{aligned}\n\\]\nwhere \\(\\hat F(t)\\) is the usual empirical CDF estimate.\n\n5.7.3 Kaplan-Meier curve for drug6mp data\nHere is the Kaplan-Meier estimated survival curve for the patients who received 6-MP in the drug6mp dataset (we will see code to produce figures like this one shortly):\n\nShow R code# | echo: false\n\nrequire(KMsurv)\ndata(drug6mp)\nlibrary(dplyr)\nlibrary(survival)\n\ndrug6mp_km_model1 = \n drug6mp |> \n mutate(surv = Surv(t2, relapse)) |> \n survfit(formula = surv ~ 1, data = _)\n\nlibrary(ggfortify)\ndrug6mp_km_model1 |> \n autoplot(\n mark.time = TRUE,\n conf.int = FALSE) +\n expand_limits(y = 0) +\n xlab('Time since diagnosis (months)') +\n ylab(\"KM Survival Curve\")\n\n\n\nFigure 5.4: Kaplan-Meier Survival Curve for 6-MP Patients\n\n\n\n\n\n\n\n\n5.7.4 Kaplan-Meier calculations\nLet’s compute these estimates and build the chart by hand:\n\nShow R codelibrary(KMsurv)\nlibrary(dplyr)\ndata(drug6mp)\n\ndrug6mp.v2 = \n drug6mp |> \n as_tibble() |> \n mutate(\n remstat = remstat |> \n case_match(\n 1 ~ \"partial\",\n 2 ~ \"complete\"\n ),\n # renaming to \"outcome\" while relabeling is just a style choice:\n outcome = relapse |> \n case_match(\n 0 ~ \"censored\",\n 1 ~ \"relapsed\"\n )\n )\n\nkm.6mp =\n drug6mp.v2 |> \n summarize(\n .by = t2,\n Relapses = sum(outcome == \"relapsed\"),\n Censored = sum(outcome == \"censored\")) |>\n # here we add a start time row, so the graph starts at time 0:\n bind_rows(\n tibble(\n t2 = 0, \n Relapses = 0, \n Censored = 0)\n ) |> \n # sort in time order:\n arrange(t2) |>\n mutate(\n Exiting = Relapses + Censored,\n `Study Size` = sum(Exiting),\n Exited = cumsum(Exiting) |> dplyr::lag(default = 0),\n `At Risk` = `Study Size` - Exited,\n Hazard = Relapses / `At Risk`,\n `KM Factor` = 1 - Hazard,\n `Cumulative Hazard` = cumsum(`Hazard`),\n `KM Survival Curve` = cumprod(`KM Factor`)\n )\n\nlibrary(pander) \npander(km.6mp)\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nt2\nRelapses\nCensored\nExiting\nStudy Size\nExited\nAt Risk\nHazard\nKM Factor\nCumulative Hazard\nKM Survival Curve\n\n\n\n0\n0\n0\n0\n21\n0\n21\n0\n1\n0\n1\n\n\n6\n3\n1\n4\n21\n0\n21\n0.1429\n0.8571\n0.1429\n0.8571\n\n\n7\n1\n0\n1\n21\n4\n17\n0.05882\n0.9412\n0.2017\n0.8067\n\n\n9\n0\n1\n1\n21\n5\n16\n0\n1\n0.2017\n0.8067\n\n\n10\n1\n1\n2\n21\n6\n15\n0.06667\n0.9333\n0.2683\n0.7529\n\n\n11\n0\n1\n1\n21\n8\n13\n0\n1\n0.2683\n0.7529\n\n\n13\n1\n0\n1\n21\n9\n12\n0.08333\n0.9167\n0.3517\n0.6902\n\n\n16\n1\n0\n1\n21\n10\n11\n0.09091\n0.9091\n0.4426\n0.6275\n\n\n17\n0\n1\n1\n21\n11\n10\n0\n1\n0.4426\n0.6275\n\n\n19\n0\n1\n1\n21\n12\n9\n0\n1\n0.4426\n0.6275\n\n\n20\n0\n1\n1\n21\n13\n8\n0\n1\n0.4426\n0.6275\n\n\n22\n1\n0\n1\n21\n14\n7\n0.1429\n0.8571\n0.5854\n0.5378\n\n\n23\n1\n0\n1\n21\n15\n6\n0.1667\n0.8333\n0.7521\n0.4482\n\n\n25\n0\n1\n1\n21\n16\n5\n0\n1\n0.7521\n0.4482\n\n\n32\n0\n2\n2\n21\n17\n4\n0\n1\n0.7521\n0.4482\n\n\n34\n0\n1\n1\n21\n19\n2\n0\n1\n0.7521\n0.4482\n\n\n35\n0\n1\n1\n21\n20\n1\n0\n1\n0.7521\n0.4482\n\n\n\n\n\n\nSummary\nFor the 6-MP patients at time 6 months, there are 21 patients at risk. At \\(t=6\\) there are 3 relapses and 1 censored observations.\nThe Kaplan-Meier factor is \\((21-3)/21 = 0.857\\). The number at risk for the next time (\\(t=7\\)) is \\(21-3-1=17\\).\nAt time 7 months, there are 17 patients at risk. At \\(t=7\\) there is 1 relapse and 0 censored observations. The Kaplan-Meier factor is \\((17-1)/17 = 0.941\\). The Kaplan Meier estimate is \\(0.857\\times0.941=0.807\\). The number at risk for the next time (\\(t=9\\)) is \\(17-1=16\\).\n\nNow, let’s graph this estimated survival curve using ggplot():\n\nShow R codelibrary(ggplot2)\nconflicts_prefer(dplyr::filter)\nkm.6mp |> \n ggplot(aes(x = t2, y = `KM Survival Curve`)) +\n geom_step() +\n geom_point(data = km.6mp |> filter(Censored > 0), shape = 3) +\n expand_limits(y = c(0,1), x = 0) +\n xlab('Time since diagnosis (months)') +\n ylab(\"KM Survival Curve\") +\n scale_y_continuous(labels = scales::percent)\n\n\n\nFigure 5.5: KM curve for 6MP patients, calculated by hand", + "text": "5.7 The Kaplan-Meier Product Limit Estimator\n\n5.7.1 Estimating survival in datasets without censoring\n\nIn the drug6mp dataset, the estimated survival function for the placebo patients is easy to compute. For any time \\(t\\) in months, \\(S(t)\\) is the fraction of patients with times greater than \\(t\\):\n\n\n5.7.2 Estimating survival in datasets with censoring\n\nFor the 6-MP patients, we cannot ignore the censored data because we know that the time to relapse is greater than the censoring time.\nFor any time \\(t\\) in months, we know that 6-MP patients with times greater than \\(t\\) have not relapsed, and those with relapse time less than \\(t\\) have relapsed, but we don’t know if patients with censored time less than \\(t\\) have relapsed or not.\nThe procedure we usually use is the Kaplan-Meier product-limit estimator of the survival function.\nThe Kaplan-Meier estimator is a step function (like the empirical cdf), which changes value only at the event times, not at the censoring times.\nAt each event time \\(t\\), we compute the at-risk group size \\(Y\\), which is all those observations whose event time or censoring time is at least \\(t\\).\nIf \\(d\\) of the observations have an event time (not a censoring time) of \\(t\\), then the group of survivors immediately following time \\(t\\) is reduced by the fraction \\[\\frac{Y-d}{Y}=1-\\frac{d}{Y}\\]\n\n\n\nDefinition 5.4 (Kaplan-Meier Product-Limit Estimator of Survival Function) If the event times are \\(t_i\\) with events per time of \\(d_i\\) (\\(1\\le i \\le k\\)), then the Kaplan-Meier Product-Limit Estimator of the survival function is:\n\\[\\hat S(t) = \\prod_{t_i < t} \\left[\\frac{1-d_i}{Y_i}\\right] \\tag{5.4}\\]\nwhere \\(Y_i\\) is the set of observations whose time (event or censored) is \\(\\ge t_i\\), the group at risk at time \\(t_i\\).\n\n\n\nTheorem 5.6 (Kaplan-Meier Estimate with No Censored Observations) If there are no censored data, and there are \\(n\\) data points, then just after (say) the third event time\n\\[\n\\begin{aligned}\n\\hat S(t)\n&= \\prod_{t_i < t}\\left[1-\\frac{d_i}{Y_i}\\right]\n\\\\ &= \\left[\\frac{n-d_1}{n}\\right] \\left[\\frac{n-d_1-d_2}{n-d_1}\\right] \\left[\\frac{n-d_1-d_2-d_3}{n-d_1-d_2}\\right]\n\\\\ &= \\frac{n-d_1-d_2-d_3}{n}\n\\\\ &=1-\\frac{d_1+d_2+d_3}{n}\n\\\\ &=1-\\hat F(t)\n\\end{aligned}\n\\]\nwhere \\(\\hat F(t)\\) is the usual empirical CDF estimate.\n\n\n5.7.3 Kaplan-Meier curve for drug6mp data\nHere is the Kaplan-Meier estimated survival curve for the patients who received 6-MP in the drug6mp dataset (we will see code to produce figures like this one shortly):\n\nShow R code# | echo: false\n\nrequire(KMsurv)\ndata(drug6mp)\nlibrary(dplyr)\nlibrary(survival)\n\ndrug6mp_km_model1 = \n drug6mp |> \n mutate(surv = Surv(t2, relapse)) |> \n survfit(formula = surv ~ 1, data = _)\n\nlibrary(ggfortify)\ndrug6mp_km_model1 |> \n autoplot(\n mark.time = TRUE,\n conf.int = FALSE) +\n expand_limits(y = 0) +\n xlab('Time since diagnosis (months)') +\n ylab(\"KM Survival Curve\")\n\n\n\nFigure 5.4: Kaplan-Meier Survival Curve for 6-MP Patients\n\n\n\n\n\n\n\n\n5.7.4 Kaplan-Meier calculations\nLet’s compute these estimates and build the chart by hand:\n\nShow R codelibrary(KMsurv)\nlibrary(dplyr)\ndata(drug6mp)\n\ndrug6mp.v2 = \n drug6mp |> \n as_tibble() |> \n mutate(\n remstat = remstat |> \n case_match(\n 1 ~ \"partial\",\n 2 ~ \"complete\"\n ),\n # renaming to \"outcome\" while relabeling is just a style choice:\n outcome = relapse |> \n case_match(\n 0 ~ \"censored\",\n 1 ~ \"relapsed\"\n )\n )\n\nkm.6mp =\n drug6mp.v2 |> \n summarize(\n .by = t2,\n Relapses = sum(outcome == \"relapsed\"),\n Censored = sum(outcome == \"censored\")) |>\n # here we add a start time row, so the graph starts at time 0:\n bind_rows(\n tibble(\n t2 = 0, \n Relapses = 0, \n Censored = 0)\n ) |> \n # sort in time order:\n arrange(t2) |>\n mutate(\n Exiting = Relapses + Censored,\n `Study Size` = sum(Exiting),\n Exited = cumsum(Exiting) |> dplyr::lag(default = 0),\n `At Risk` = `Study Size` - Exited,\n Hazard = Relapses / `At Risk`,\n `KM Factor` = 1 - Hazard,\n `Cumulative Hazard` = cumsum(`Hazard`),\n `KM Survival Curve` = cumprod(`KM Factor`)\n )\n\nlibrary(pander) \npander(km.6mp)\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nt2\nRelapses\nCensored\nExiting\nStudy Size\nExited\nAt Risk\nHazard\nKM Factor\nCumulative Hazard\nKM Survival Curve\n\n\n\n0\n0\n0\n0\n21\n0\n21\n0\n1\n0\n1\n\n\n6\n3\n1\n4\n21\n0\n21\n0.1429\n0.8571\n0.1429\n0.8571\n\n\n7\n1\n0\n1\n21\n4\n17\n0.05882\n0.9412\n0.2017\n0.8067\n\n\n9\n0\n1\n1\n21\n5\n16\n0\n1\n0.2017\n0.8067\n\n\n10\n1\n1\n2\n21\n6\n15\n0.06667\n0.9333\n0.2683\n0.7529\n\n\n11\n0\n1\n1\n21\n8\n13\n0\n1\n0.2683\n0.7529\n\n\n13\n1\n0\n1\n21\n9\n12\n0.08333\n0.9167\n0.3517\n0.6902\n\n\n16\n1\n0\n1\n21\n10\n11\n0.09091\n0.9091\n0.4426\n0.6275\n\n\n17\n0\n1\n1\n21\n11\n10\n0\n1\n0.4426\n0.6275\n\n\n19\n0\n1\n1\n21\n12\n9\n0\n1\n0.4426\n0.6275\n\n\n20\n0\n1\n1\n21\n13\n8\n0\n1\n0.4426\n0.6275\n\n\n22\n1\n0\n1\n21\n14\n7\n0.1429\n0.8571\n0.5854\n0.5378\n\n\n23\n1\n0\n1\n21\n15\n6\n0.1667\n0.8333\n0.7521\n0.4482\n\n\n25\n0\n1\n1\n21\n16\n5\n0\n1\n0.7521\n0.4482\n\n\n32\n0\n2\n2\n21\n17\n4\n0\n1\n0.7521\n0.4482\n\n\n34\n0\n1\n1\n21\n19\n2\n0\n1\n0.7521\n0.4482\n\n\n35\n0\n1\n1\n21\n20\n1\n0\n1\n0.7521\n0.4482\n\n\n\n\n\n\nSummary\nFor the 6-MP patients at time 6 months, there are 21 patients at risk. At \\(t=6\\) there are 3 relapses and 1 censored observations.\nThe Kaplan-Meier factor is \\((21-3)/21 = 0.857\\). The number at risk for the next time (\\(t=7\\)) is \\(21-3-1=17\\).\nAt time 7 months, there are 17 patients at risk. At \\(t=7\\) there is 1 relapse and 0 censored observations. The Kaplan-Meier factor is \\((17-1)/17 = 0.941\\). The Kaplan Meier estimate is \\(0.857\\times0.941=0.807\\). The number at risk for the next time (\\(t=9\\)) is \\(17-1=16\\).\n\nNow, let’s graph this estimated survival curve using ggplot():\n\nShow R codelibrary(ggplot2)\nconflicts_prefer(dplyr::filter)\nkm.6mp |> \n ggplot(aes(x = t2, y = `KM Survival Curve`)) +\n geom_step() +\n geom_point(data = km.6mp |> filter(Censored > 0), shape = 3) +\n expand_limits(y = c(0,1), x = 0) +\n xlab('Time since diagnosis (months)') +\n ylab(\"KM Survival Curve\") +\n scale_y_continuous(labels = scales::percent)\n\n\n\nFigure 5.5: KM curve for 6MP patients, calculated by hand", "crumbs": [ "Time to Event Models", "5  Introduction to Survival Analysis" @@ -599,7 +599,7 @@ "href": "intro-to-survival-analysis.html#example-bone-marrow-transplant-data", "title": "\n5  Introduction to Survival Analysis\n", "section": "\n5.9 Example: Bone Marrow Transplant Data", - "text": "5.9 Example: Bone Marrow Transplant Data\nCopelan et al. (1991)\n\n\n\nFigure 5.6: Recovery process from a bone marrow transplant (Fig. 1.1 from Klein, Moeschberger, et al. (2003))\n\n\n\n\n\n\n\n5.9.1 Study design\nTreatment\n\n\nallogeneic (from a donor) bone marrow transplant therapy\n\nInclusion criteria\n\nacute myeloid leukemia (AML)\nacute lymphoblastic leukemia (ALL).\nPossible intermediate events\n\n\ngraft vs. host disease (GVHD): an immunological rejection response to the transplant\n\nplatelet recovery: a return of platelet count to normal levels.\n\nOne or the other, both in either order, or neither may occur.\nEnd point events\n\nrelapse of the disease\ndeath\n\nAny or all of these events may be censored.\n\n5.9.2 KMsurv::bmt data in R\n\nShow R codelibrary(KMsurv)\n?bmt\n\n\n\n5.9.3 Analysis plan\n\nWe concentrate for now on disease-free survival (t2 and d3) for the three risk groups, ALL, AML Low Risk, and AML High Risk.\nWe will construct the Kaplan-Meier survival curves, compare them, and test for differences.\nWe will construct the cumulative hazard curves and compare them.\nWe will estimate the hazard functions, interpret, and compare them.\n\n5.9.4 Survival Function Estimate and Variance\n\\[\\hat S(t) = \\prod_{t_i < t}\\left[1-\\frac{d_i}{Y_i}\\right]\\] where \\(Y_i\\) is the group at risk at time \\(t_i\\).\nThe estimated variance of \\(\\hat S(t)\\) is:\n\nTheorem 5.4 (Greenwood’s estimator for variance of Kaplan-Meier survival estimator) \\[\n\\widehat{\\text{Var}}\\left(\\hat S(t)\\right) = \\hat S(t)^2\\sum_{t_i <t}\\frac{d_i}{Y_i(Y_i-d_i)}\n\\tag{5.4}\\]\n\nWe can use Equation 5.4 for confidence intervals for a survival function or a difference of survival functions.\n\nKaplan-Meier survival curves\n\ncode to preprocess and model bmt datalibrary(KMsurv)\nlibrary(survival)\ndata(bmt)\n\nbmt = \n bmt |> \n as_tibble() |> \n mutate(\n group = \n group |> \n factor(\n labels = c(\"ALL\",\"Low Risk AML\",\"High Risk AML\")),\n surv = Surv(t2,d3))\n\nkm_model1 = survfit(\n formula = surv ~ group, \n data = bmt)\n\n\n\nShow R codelibrary(ggfortify)\nautoplot(\n km_model1, \n conf.int = TRUE,\n ylab = \"Pr(disease-free survival)\",\n xlab = \"Time since transplant (days)\") + \n theme_bw() +\n theme(legend.position=\"bottom\")\n\n\nDisease-Free Survival by Disease Group\n\n\n\n\n\n5.9.5 Understanding Greenwood’s formula (optional)\n\nTo see where Greenwood’s formula comes from, let \\(x_i = Y_i - d_i\\). We approximate the solution treating each time as independent, with \\(Y_i\\) fixed and ignore randomness in times of failure and we treat \\(x_i\\) as independent binomials \\(\\text{Bin}(Y_i,p_i)\\). Letting \\(S(t)\\) be the “true” survival function\n\n\\[\n\\begin{aligned}\n\\hat S(t) &=\\prod_{t_i<t}x_i/Y_i\\\\\nS(t)&=\\prod_{t_i<t}p_i\n\\end{aligned}\n\\]\n\\[\n\\begin{aligned}\n\\frac{\\hat S(t)}{S(t)}\n &= \\prod_{t_i<t} \\frac{x_i}. {p_iY_i}\n\\\\ &= \\prod_{t_i<t} \\frac{\\hat p_i}{p_i}\n\\\\ &= \\prod_{t_i<t} \\left(1+\\frac{\\hat p_i-p_i}{p_i}\\right)\n\\\\ &\\approx 1+\\sum_{t_i<t} \\frac{\\hat p_i-p_i}{p_i}\n\\end{aligned}\n\\]\n\n\\[\n\\begin{aligned}\n\\text{Var}\\left(\\frac{\\hat S(t)}{S(t)}\\right)\n&\\approx \\text{Var}\\left(1+\\sum_{t_i<t} \\frac{\\hat p_i-p_i}{p_i}\\right)\n\\\\ &=\\sum_{t_i<t} \\frac{1}{p_i^2}\\frac{p_i(1-p_i)}{Y_i}\n\\\\ &= \\sum_{t_i<t} \\frac{(1-p_i)}{p_iY_i}\n\\\\ &\\approx\\sum_{t_i<t} \\frac{(1-x_i/Y_i)}{x_i}\n\\\\ &=\\sum_{t_i<t} \\frac{Y_i-x_i}{x_iY_i}\n\\\\ &=\\sum_{t_i<t} \\frac{d_i}{Y_i(Y_i-d_i)}\n\\\\ \\therefore\\text{Var}\\left(\\hat S(t)\\right)\n&\\approx \\hat S(t)^2\\sum_{t_i<t} \\frac{d_i}{Y_i(Y_i-d_i)}\n\\end{aligned}\n\\]\n\n5.9.6 Test for differences among the disease groups\nHere we compute a chi-square test for assocation between disease group (group) and disease-free survival:\n\nShow R codesurvdiff(surv ~ group, data = bmt)\n#> Call:\n#> survdiff(formula = surv ~ group, data = bmt)\n#> \n#> N Observed Expected (O-E)^2/E (O-E)^2/V\n#> group=ALL 38 24 21.9 0.211 0.289\n#> group=Low Risk AML 54 25 40.0 5.604 11.012\n#> group=High Risk AML 45 34 21.2 7.756 10.529\n#> \n#> Chisq= 13.8 on 2 degrees of freedom, p= 0.001\n\n\n\n5.9.7 Cumulative Hazard\n\\[\n\\begin{aligned}\nh(t)\n&\\stackrel{\\text{def}}{=}P(T=t|T\\ge t)\\\\\n&= \\frac{p(T=t)}{P(T\\ge t)}\\\\\n&= -\\frac{\\partial}{\\partial t}\\text{log}\\left\\{S(t)\\right\\}\n\\end{aligned}\n\\]\nThe cumulative hazard (or integrated hazard) function is\n\\[H(t)\\stackrel{\\text{def}}{=}\\int_0^t h(t) dt\\] Since \\(h(t) = -\\frac{\\partial}{\\partial t}\\text{log}\\left\\{S(t)\\right\\}\\) as shown above, we have:\n\\[\nH(t)=-\\text{log}\\left\\{S\\right\\}(t)\n\\]\n\nSo we can estimate \\(H(t)\\) as:\n\\[\n\\begin{aligned}\n\\hat H(t)\n&= -\\text{log}\\left\\{\\hat S(t)\\right\\}\\\\\n&= -\\text{log}\\left\\{\\prod_{t_i < t}\\left[1-\\frac{d_i}{Y_i}\\right]\\right\\}\\\\\n&= -\\sum_{t_i < t}\\text{log}\\left\\{1-\\frac{d_i}{Y_i}\\right\\}\\\\\n\\end{aligned}\n\\]\nThis is the Kaplan-Meier (product-limit) estimate of cumulative hazard.\n\nExample: Cumulative Hazard Curves for Bone-Marrow Transplant (bmt) data\n\nShow R codeautoplot(\n fun = \"cumhaz\",\n km_model1, \n conf.int = FALSE,\n ylab = \"Cumulative hazard (disease-free survival)\",\n xlab = \"Time since transplant (days)\") + \n theme_bw() +\n theme(legend.position=\"bottom\")\n\n\n\nFigure 5.7: Disease-Free Cumulative Hazard by Disease Group", + "text": "5.9 Example: Bone Marrow Transplant Data\nCopelan et al. (1991)\n\n\n\nFigure 5.6: Recovery process from a bone marrow transplant (Fig. 1.1 from Klein, Moeschberger, et al. (2003))\n\n\n\n\n\n\n\n5.9.1 Study design\nTreatment\n\n\nallogeneic (from a donor) bone marrow transplant therapy\n\nInclusion criteria\n\nacute myeloid leukemia (AML)\nacute lymphoblastic leukemia (ALL).\nPossible intermediate events\n\n\ngraft vs. host disease (GVHD): an immunological rejection response to the transplant\n\nplatelet recovery: a return of platelet count to normal levels.\n\nOne or the other, both in either order, or neither may occur.\nEnd point events\n\nrelapse of the disease\ndeath\n\nAny or all of these events may be censored.\n\n5.9.2 KMsurv::bmt data in R\n\nShow R codelibrary(KMsurv)\n?bmt\n\n\n\n5.9.3 Analysis plan\n\nWe concentrate for now on disease-free survival (t2 and d3) for the three risk groups, ALL, AML Low Risk, and AML High Risk.\nWe will construct the Kaplan-Meier survival curves, compare them, and test for differences.\nWe will construct the cumulative hazard curves and compare them.\nWe will estimate the hazard functions, interpret, and compare them.\n\n5.9.4 Survival Function Estimate and Variance\n\\[\\hat S(t) = \\prod_{t_i < t}\\left[1-\\frac{d_i}{Y_i}\\right]\\] where \\(Y_i\\) is the group at risk at time \\(t_i\\).\nThe estimated variance of \\(\\hat S(t)\\) is:\n\nTheorem 5.7 (Greenwood’s estimator for variance of Kaplan-Meier survival estimator) \\[\n\\widehat{\\text{Var}}\\left(\\hat S(t)\\right) = \\hat S(t)^2\\sum_{t_i <t}\\frac{d_i}{Y_i(Y_i-d_i)}\n\\tag{5.5}\\]\n\nWe can use Equation 5.5 for confidence intervals for a survival function or a difference of survival functions.\n\nKaplan-Meier survival curves\n\ncode to preprocess and model bmt datalibrary(KMsurv)\nlibrary(survival)\ndata(bmt)\n\nbmt = \n bmt |> \n as_tibble() |> \n mutate(\n group = \n group |> \n factor(\n labels = c(\"ALL\",\"Low Risk AML\",\"High Risk AML\")),\n surv = Surv(t2,d3))\n\nkm_model1 = survfit(\n formula = surv ~ group, \n data = bmt)\n\n\n\nShow R codelibrary(ggfortify)\nautoplot(\n km_model1, \n conf.int = TRUE,\n ylab = \"Pr(disease-free survival)\",\n xlab = \"Time since transplant (days)\") + \n theme_bw() +\n theme(legend.position=\"bottom\")\n\n\nDisease-Free Survival by Disease Group\n\n\n\n\n\n5.9.5 Understanding Greenwood’s formula (optional)\n\nTo see where Greenwood’s formula comes from, let \\(x_i = Y_i - d_i\\). We approximate the solution treating each time as independent, with \\(Y_i\\) fixed and ignore randomness in times of failure and we treat \\(x_i\\) as independent binomials \\(\\text{Bin}(Y_i,p_i)\\). Letting \\(S(t)\\) be the “true” survival function\n\n\\[\n\\begin{aligned}\n\\hat S(t) &=\\prod_{t_i<t}x_i/Y_i\\\\\nS(t)&=\\prod_{t_i<t}p_i\n\\end{aligned}\n\\]\n\\[\n\\begin{aligned}\n\\frac{\\hat S(t)}{S(t)}\n &= \\prod_{t_i<t} \\frac{x_i}. {p_iY_i}\n\\\\ &= \\prod_{t_i<t} \\frac{\\hat p_i}{p_i}\n\\\\ &= \\prod_{t_i<t} \\left(1+\\frac{\\hat p_i-p_i}{p_i}\\right)\n\\\\ &\\approx 1+\\sum_{t_i<t} \\frac{\\hat p_i-p_i}{p_i}\n\\end{aligned}\n\\]\n\n\\[\n\\begin{aligned}\n\\text{Var}\\left(\\frac{\\hat S(t)}{S(t)}\\right)\n&\\approx \\text{Var}\\left(1+\\sum_{t_i<t} \\frac{\\hat p_i-p_i}{p_i}\\right)\n\\\\ &=\\sum_{t_i<t} \\frac{1}{p_i^2}\\frac{p_i(1-p_i)}{Y_i}\n\\\\ &= \\sum_{t_i<t} \\frac{(1-p_i)}{p_iY_i}\n\\\\ &\\approx\\sum_{t_i<t} \\frac{(1-x_i/Y_i)}{x_i}\n\\\\ &=\\sum_{t_i<t} \\frac{Y_i-x_i}{x_iY_i}\n\\\\ &=\\sum_{t_i<t} \\frac{d_i}{Y_i(Y_i-d_i)}\n\\\\ \\therefore\\text{Var}\\left(\\hat S(t)\\right)\n&\\approx \\hat S(t)^2\\sum_{t_i<t} \\frac{d_i}{Y_i(Y_i-d_i)}\n\\end{aligned}\n\\]\n\n5.9.6 Test for differences among the disease groups\nHere we compute a chi-square test for assocation between disease group (group) and disease-free survival:\n\nShow R codesurvdiff(surv ~ group, data = bmt)\n#> Call:\n#> survdiff(formula = surv ~ group, data = bmt)\n#> \n#> N Observed Expected (O-E)^2/E (O-E)^2/V\n#> group=ALL 38 24 21.9 0.211 0.289\n#> group=Low Risk AML 54 25 40.0 5.604 11.012\n#> group=High Risk AML 45 34 21.2 7.756 10.529\n#> \n#> Chisq= 13.8 on 2 degrees of freedom, p= 0.001\n\n\n\n5.9.7 Cumulative Hazard\n\\[\n\\begin{aligned}\nh(t)\n&\\stackrel{\\text{def}}{=}P(T=t|T\\ge t)\\\\\n&= \\frac{p(T=t)}{P(T\\ge t)}\\\\\n&= -\\frac{\\partial}{\\partial t}\\text{log}\\left\\{S(t)\\right\\}\n\\end{aligned}\n\\]\nThe cumulative hazard (or integrated hazard) function is\n\\[H(t)\\stackrel{\\text{def}}{=}\\int_0^t h(t) dt\\] Since \\(h(t) = -\\frac{\\partial}{\\partial t}\\text{log}\\left\\{S(t)\\right\\}\\) as shown above, we have:\n\\[\nH(t)=-\\text{log}\\left\\{S\\right\\}(t)\n\\]\n\nSo we can estimate \\(H(t)\\) as:\n\\[\n\\begin{aligned}\n\\hat H(t)\n&= -\\text{log}\\left\\{\\hat S(t)\\right\\}\\\\\n&= -\\text{log}\\left\\{\\prod_{t_i < t}\\left[1-\\frac{d_i}{Y_i}\\right]\\right\\}\\\\\n&= -\\sum_{t_i < t}\\text{log}\\left\\{1-\\frac{d_i}{Y_i}\\right\\}\\\\\n\\end{aligned}\n\\]\nThis is the Kaplan-Meier (product-limit) estimate of cumulative hazard.\n\nExample: Cumulative Hazard Curves for Bone-Marrow Transplant (bmt) data\n\nShow R codeautoplot(\n fun = \"cumhaz\",\n km_model1, \n conf.int = FALSE,\n ylab = \"Cumulative hazard (disease-free survival)\",\n xlab = \"Time since transplant (days)\") + \n theme_bw() +\n theme(legend.position=\"bottom\")\n\n\n\nFigure 5.7: Disease-Free Cumulative Hazard by Disease Group", "crumbs": [ "Time to Event Models", "5  Introduction to Survival Analysis" @@ -610,7 +610,7 @@ "href": "intro-to-survival-analysis.html#nelson-aalen-estimates-of-cumulative-hazard-and-survival", "title": "\n5  Introduction to Survival Analysis\n", "section": "\n5.10 Nelson-Aalen Estimates of Cumulative Hazard and Survival", - "text": "5.10 Nelson-Aalen Estimates of Cumulative Hazard and Survival\n\n\nDefinition 5.5 (Nelson-Aalen Cumulative Hazard Estimator)  \n\nThe point hazard at time \\(t_i\\) can be estimated by \\(d_i/Y_i\\), which leads to the Nelson-Aalen estimator of the cumulative hazard:\n\n\\[\\hat H_{NA}(t) \\stackrel{\\text{def}}{=}\\sum_{t_i < t}\\frac{d_i}{Y_i} \\tag{5.5}\\]\n\n\n\nTheorem 5.5 (Variance of Nelson-Aalen estimator)  \n\nThe variance of this estimator is approximately:\n\n\\[\n\\begin{aligned}\n\\hat{\\text{Var}}\\left(\\hat H_{NA} (t)\\right)\n&= \\sum_{t_i <t}\\frac{(d_i/Y_i)(1-d_i/Y_i)}{Y_i}\\\\\n&\\approx \\sum_{t_i <t}\\frac{d_i}{Y_i^2}\n\\end{aligned}\n\\tag{5.6}\\]\n\n\nSince \\(S(t)=\\text{exp}\\left\\{-H(t)\\right\\}\\), the Nelson-Aalen cumulative hazard estimate can be converted into an alternate estimate of the survival function:\n\\[\n\\begin{aligned}\n\\hat S_{NA}(t)\n&= \\text{exp}\\left\\{-\\hat H_{NA}(t)\\right\\}\\\\\n&= \\text{exp}\\left\\{-\\sum_{t_i < t}\\frac{d_i}{Y_i}\\right\\}\\\\\n&= \\prod_{t_i < t}\\text{exp}\\left\\{-\\frac{d_i}{Y_i}\\right\\}\\\\\n\\end{aligned}\n\\]\n\nCompare these with the corresponding Kaplan-Meier estimates:\n\\[\n\\begin{aligned}\n\\hat H_{KM}(t) &= -\\sum_{t_i < t}\\text{log}\\left\\{1-\\frac{d_i}{Y_i}\\right\\}\\\\\n\\hat S_{KM}(t) &= \\prod_{t_i < t}\\left[1-\\frac{d_i}{Y_i}\\right]\n\\end{aligned}\n\\]\n\nThe product limit estimate and the Nelson-Aalen estimate often do not differ by much. The latter is considered more accurate in small samples and also directly estimates the cumulative hazard. The \"fleming-harrington\" method for survfit() reduces to Nelson-Aalen when the data are unweighted. We can also estimate the cumulative hazard as the negative log of the KM survival function estimate.\n\n\n5.10.1 Application to bmt dataset\n\nShow R code\nna_fit = survfit(\n formula = surv ~ group,\n type = \"fleming-harrington\",\n data = bmt)\n\nkm_fit = survfit(\n formula = surv ~ group,\n type = \"kaplan-meier\",\n data = bmt)\n\nkm_and_na = \n bind_rows(\n .id = \"model\",\n \"Kaplan-Meier\" = km_fit |> fortify(surv.connect = TRUE),\n \"Nelson-Aalen\" = na_fit |> fortify(surv.connect = TRUE)\n ) |> \n as_tibble()\n\n\n\nShow R codekm_and_na |> \n ggplot(aes(x = time, y = surv, col = model)) +\n geom_step() +\n facet_grid(. ~ strata) +\n theme_bw() + \n ylab(\"S(t) = P(T>=t)\") +\n xlab(\"Survival time (t, days)\") +\n theme(legend.position = \"bottom\")\n\n\nKaplan-Meier and Nelson-Aalen Survival Function Estimates, stratified by disease group\n\n\n\n\nThe Kaplan-Meier and Nelson-Aalen survival estimates are very similar for this dataset.\n\n\n\n\n\n\nCopelan, Edward A, James C Biggs, James M Thompson, Pamela Crilley, Jeff Szer, John P Klein, Neena Kapoor, Belinda R Avalos, Isabel Cunningham, and Kerry Atkinson. 1991. “Treatment for Acute Myelocytic Leukemia with Allogeneic Bone Marrow Transplantation Following Preparation with BuCy2.” https://doi.org/10.1182/blood.V78.3.838.838 .\n\n\nKlein, John P, Melvin L Moeschberger, et al. 2003. Survival Analysis: Techniques for Censored and Truncated Data. Vol. 1230. Springer. https://link.springer.com/book/10.1007/b97377.", + "text": "5.10 Nelson-Aalen Estimates of Cumulative Hazard and Survival\n\n\nDefinition 5.5 (Nelson-Aalen Cumulative Hazard Estimator)  \n\nThe point hazard at time \\(t_i\\) can be estimated by \\(d_i/Y_i\\), which leads to the Nelson-Aalen estimator of the cumulative hazard:\n\n\\[\\hat H_{NA}(t) \\stackrel{\\text{def}}{=}\\sum_{t_i < t}\\frac{d_i}{Y_i} \\tag{5.6}\\]\n\n\n\nTheorem 5.8 (Variance of Nelson-Aalen estimator)  \n\nThe variance of this estimator is approximately:\n\n\\[\n\\begin{aligned}\n\\hat{\\text{Var}}\\left(\\hat H_{NA} (t)\\right)\n&= \\sum_{t_i <t}\\frac{(d_i/Y_i)(1-d_i/Y_i)}{Y_i}\\\\\n&\\approx \\sum_{t_i <t}\\frac{d_i}{Y_i^2}\n\\end{aligned}\n\\tag{5.7}\\]\n\n\nSince \\(S(t)=\\text{exp}\\left\\{-H(t)\\right\\}\\), the Nelson-Aalen cumulative hazard estimate can be converted into an alternate estimate of the survival function:\n\\[\n\\begin{aligned}\n\\hat S_{NA}(t)\n&= \\text{exp}\\left\\{-\\hat H_{NA}(t)\\right\\}\\\\\n&= \\text{exp}\\left\\{-\\sum_{t_i < t}\\frac{d_i}{Y_i}\\right\\}\\\\\n&= \\prod_{t_i < t}\\text{exp}\\left\\{-\\frac{d_i}{Y_i}\\right\\}\\\\\n\\end{aligned}\n\\]\n\nCompare these with the corresponding Kaplan-Meier estimates:\n\\[\n\\begin{aligned}\n\\hat H_{KM}(t) &= -\\sum_{t_i < t}\\text{log}\\left\\{1-\\frac{d_i}{Y_i}\\right\\}\\\\\n\\hat S_{KM}(t) &= \\prod_{t_i < t}\\left[1-\\frac{d_i}{Y_i}\\right]\n\\end{aligned}\n\\]\n\nThe product limit estimate and the Nelson-Aalen estimate often do not differ by much. The latter is considered more accurate in small samples and also directly estimates the cumulative hazard. The \"fleming-harrington\" method for survfit() reduces to Nelson-Aalen when the data are unweighted. We can also estimate the cumulative hazard as the negative log of the KM survival function estimate.\n\n\n5.10.1 Application to bmt dataset\n\nShow R code\nna_fit = survfit(\n formula = surv ~ group,\n type = \"fleming-harrington\",\n data = bmt)\n\nkm_fit = survfit(\n formula = surv ~ group,\n type = \"kaplan-meier\",\n data = bmt)\n\nkm_and_na = \n bind_rows(\n .id = \"model\",\n \"Kaplan-Meier\" = km_fit |> fortify(surv.connect = TRUE),\n \"Nelson-Aalen\" = na_fit |> fortify(surv.connect = TRUE)\n ) |> \n as_tibble()\n\n\n\nShow R codekm_and_na |> \n ggplot(aes(x = time, y = surv, col = model)) +\n geom_step() +\n facet_grid(. ~ strata) +\n theme_bw() + \n ylab(\"S(t) = P(T>=t)\") +\n xlab(\"Survival time (t, days)\") +\n theme(legend.position = \"bottom\")\n\n\nKaplan-Meier and Nelson-Aalen Survival Function Estimates, stratified by disease group\n\n\n\n\nThe Kaplan-Meier and Nelson-Aalen survival estimates are very similar for this dataset.\n\n\n\n\n\n\nCopelan, Edward A, James C Biggs, James M Thompson, Pamela Crilley, Jeff Szer, John P Klein, Neena Kapoor, Belinda R Avalos, Isabel Cunningham, and Kerry Atkinson. 1991. “Treatment for Acute Myelocytic Leukemia with Allogeneic Bone Marrow Transplantation Following Preparation with BuCy2.” https://doi.org/10.1182/blood.V78.3.838.838 .\n\n\nKlein, John P, Melvin L Moeschberger, et al. 2003. Survival Analysis: Techniques for Censored and Truncated Data. Vol. 1230. Springer. https://link.springer.com/book/10.1007/b97377.", "crumbs": [ "Time to Event Models", "5  Introduction to Survival Analysis"