survival.Rmd

---
title: "cox regression"
author: "liuc"
date: "11/29/2021"
output: html_document
editor_options: 
  markdown: 
    wrap: 72
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```


# 生存分析的COX回归和KM分析

<https://www.emilyzabor.com/tutorials/survival_analysis_in_r_tutorial.html>
> https://thomaselove.github.io/2020-432-book/index.html
> https://cran.r-project.org/web/packages/survminer/vignettes/Specifiying_weights_in_log-rank_comparisons.html


本笔记主要记录较为完整的Cox回归模型和KM分析，其中对于Lasso-Cox、mixed-Cox等，详见*lasso_cox.Rmd*,
*mixed_cox.Rmd*这两个笔记。

生存分析所分析的数据特点为time-to-event数据，time总是from to end的，比如Time from response to recurrence
这类最常见的数据，在面对此类数据时除了period需要考察外还需要考虑起始时间的特点，这一点在删失数据的各种分类更需注意。
对于左删失和interval censor数据需要别的数据处理方法，此处主要讨论以右删失数据为主。

右删失数据是在实验结束之时，事件还没有发生，也不知道它啥时候发生，是比较常见的。

任何数据分析，统计分析，开始都需要对数据进行探索和清洗，此处暂且略过。

## cox regression

```{r, include=FALSE}
library(tidyverse)
library(easystats)
library(survival)
library(survminer) # for KM plot
library(riskRegression)
library(rms)
library(gtsummary) # for beautiful table
library(ggsurvfit) # recommend use survfit2
```

建模. 对于cox回归其实质同线性回归;cox回归需要满足等比例假设，对于不满足的自变量采用分层分析的方法。

_对于因子变量_应如何纳入到具体的模型中呢？对于设置成factor的因子R中模型常见的处理方式是将其自动设置为dummy变量。就像`model.matrix(~ 0 + factor(ph.ecog), lung)`

```{r}
# data(lung)

lung <- na.omit(lung) %>% 
  as_tibble() %>% 
  rstatix::convert_as_factor(sex, ph.ecog)

# 多变量cox回归
mod <- coxph(Surv(time, status==2)~age + sex + ph.ecog + ph.karno + meal.cal + wt.loss, 
             data = lung, 
             # y=TRUE, 
             x = TRUE)
summary(mod) # Concordance 即是c-index

```


```{r}
# 等比例风险假设，cox回归需要满足的假设, 可采用cox.zph函数并结合cumulative incidence plot
# GLOBAL为整体上是否满足等比例风险，p值大于0.05一般视为满足, 对于某一不满足的自变量采用分层分析的方法进行
(z <- cox.zph(mod))

# We can plot the scaled Schoenfeld residuals directly with ggcoxzph from the survminer package.
ggcoxzph(z)
```

```{r}
# 下面为分层分析后的变量，以下更为具体的内容暂不考虑，此处只做一个记录
# 
mod_strata <- coxph(Surv(time, status)~age + sex + ph.ecog + strata(ph.karno) + meal.cal + wt.loss, 
             data = lung, 
             y=TRUE, 
             x = TRUE)

mod_strata # 分层变量不在作为自变量因素
summary(mod_strata)
```


```{r}
#ggcoxdiagnostics(fit, type = "schoenfeld")
plot(z[3])
abline(0,0, col="red") # 0水平线
abline(h=mod$coef[3], col="green", lwd=2, lty=2) 
```


*对于Cox回归结果的解释：*P值和CI为此系数是否对模型具有影响,检验系数值是否为0。而coef或者exp(coef)为HR值，HR值等于1为变量中的因素之间没有差异，对于分类变量而言，sex解读为女性(2)比男性(1)降低了57%的风险；对于自变量为连续变量的数据而言，其值一般正值表示预后更差。 these results indicate that age makes a smaller contribution to the difference in the HR after adjusting for the ph.ecog values and patient’s sex, and only trend toward significance.holding the other covariates constant, an additional year of age induce daily hazard of death by a factor of exp(beta) = 1.01, or 1%, which is not a significant contribution.
Rsquare in this setting is Cox and Snell’s pseudo-R2, which reflects the improvement of the model we have fit over the model with the intercept alone - a comparison that is tested by the likelihood ratio test. The maximum value of this statistic is often less than one, in which case R will tell you that. Here, our observed pseudo-R2 is 0.207 and that is out of a possible maximum of 0.996.


*筛选变量：*
```{r}
step(mod)
```

a more detailed description of the likelihood-ratio test of the model 
```{r}
anova(mod)
```

```{r}
car::Anova(mod = mod, type = 'III')
```


```{r}
broom::tidy(mod, exponentiate = TRUE)
```


*对于不满足等比例假设的另一种方法：*

上文有看到别人对某一变量进行strata分层，此处再介绍两种。对于这个主题本笔记还需进一步的整理。

处理不满足等比例假设的方式中对数据进行分层，以及对数据进行转换：

```{r}
# 通过对不符合PH假设的变量，即时依变量随时间变化的分层节点进行分层
# 此函数提供的是将整个数据集进行分层
# 以下survival包中的示例表明按照karno变量进行分层

vet2 <- survival::survSplit(Surv(time, status) ~ ., 
                    data= veteran, 
                    cut=c(60, 120), # 依据具体的变化情况设置分段的拐点
                    episode= "timegroup", 
                    id="id" # add an identifier
                    )

fit2 <- coxph(Surv(tstart, time, status) ~ karno* strata(timegroup) +
                age + trt, data= vet2)

c(overall= coef(fit1)[1],
  t0_60  = coef(fit2)[1],
  t60_120= sum(coef(fit2)[c(1,4)]),
  t120   = sum(coef(fit2)[c(1,5)]))

cox.zph(fit2)

```

对数据进行转换
```{r}
# ?tt
fit3 <- coxph(Surv(time, status) ~ trt + prior + karno + tt(karno), # 对karno进行变换
              data = veteran, 
              tt = function(x, t, ...) x * log(t+20) # 具体变换方式
              )
fit3
```


变量筛选:

```{r}
# 提取模型estimate
# 并转化成HR值
broom::tidy(mod, 
            exp = TRUE
            )

# 输出漂亮表格
gtsummary::tbl_regression(mod, exp = T)

```


*简答起见可以选择finalfit进行拟合*下面是一个示例：

参考官方文档。

```{r}
# melanoma = boot::melanoma

dependent_os  <- "Surv(time, status_os)"
dependent_dss <- "Surv(time, status_dss)"
dependent_crr <- "Surv(time, status_crr)"
explanatory   <- c("age", "sex", "thickness", "ulcer")

melanoma %>% 
    finalfit::finalfit(dependent_os, explanatory)


melanoma %>% 
    hr_plot(dependent_os, explanatory)
```


c-index以及校准曲线:
c-index用于评估模型的区分度，可以理解为模型拟合后自身的区分能力。
校准曲线用于评估模型的准确性或者说与实际的一致性。

```{r}
# 基于riskRegression的校准曲线
mod2 <- coxph(Surv(time, status==2)~age + sex + ph.ecog + ph.karno, 
             data = lung, x = T)
xs <- riskRegression::Score(list(model1 = mod, model2 = mod2),
         Hist(time, status==2)~1,
         data=lung,
         plots="calibration",
         metrics='Brier')

riskRegression::plotCalibration(xs, cens.method="jackknife",
                col = c('black', 'red')
                )

plotCalibration(xs,
                model="model1", # 绘制模型1的校准曲线
                bars=TRUE,  # 校准曲线为条形图形式
                show.frequencies = TRUE, # 条形图上显示频率
                xlab = "预测的风险", # x轴标签
                ylab = "观察的风险", # y轴标签
                col = c("#ca3e47","#1ee3cf"), # 设置条形图的颜色
                names.cex = 1.2, # X轴数字的缩放倍数
                cex = 2.0) # 设置图例和标签的缩放倍数)
```

基于riskRegression的ROC曲线:
survivalROC包和timeROC包也能绘制生存资料的ROC曲线，但是他们一般是基于单因素的。

```{r}
# 在对cox回归计算ROC时，一般会指定一个时间
roc_res <- Score(list(model1 = mod), 
            Hist(time, status==2)~1, 
            data = lung,
            # times = 365,
            plots = 'roc',
            metrics ="auc")

plotROC(roc_res, 
        xlab="1-Specificity",
        ylab="Sensitivity",
        lty=1,
        cex=1.1,
        pch=2,
        lwd=2,
        col="red",
        legend="Model1")

```

### 呈现cox回归的结果

interpret the results is so import!

在上述的`summary(mod)`的结果中，三个整体的检验(likelihood,Wald,Score)P值小于0.05，说明模型整体上是有意义的。

```{r}
# The HR is interpreted as the instantaneous rate of occurrence of the event of interest in those who are still at risk for the event. It is not a risk, though it is commonly interpreted as such
# HR = 0.59 implies that around 0.6 times as many females are dying as males, at any given time


```


### 基于rms包的分析:

列线图的绘制、c-index、校准曲线等.

`time.inc`目前倾向于设置为最大的感兴趣时间.

```{r}
# rms::cph

lung2 <- lung

units(lung2$time) <- "Day"
dd <- datadist(lung2)
options(datadist = 'dd')

label(lung2$age) <- 'Patient Age' # 在nomograph中显示的变量label

stime <- Surv(lung2$time, lung2$status == 2)

coxm <- rms::cph(
  formula = Surv(time, status==2)~age + sex + ph.ecog + ph.karno + meal.cal + wt.loss,
  x = T,
  y = T,
  surv = T,
  data = lung2,
  time.inc = 365*2 # 此参数的设置会影响到后续的绘图，如survplot
)
coxm # 和mod估计的结果是一致的


```
*interpret the result: *


```{r}
anova(coxm)
```
```{r}
plot(summary(coxm))
```

等比例检验
```{r}
cox.zph(coxm)

plot(cox.zph(coxm))
```


If we want to use the Spearman  ρ2 plot to consider how we might perhaps incorporate non-linear terms describing any or all of the five potential predictors (age, sex...) for survival time.
确定是否需要考虑非线性的拟合。需要考虑样本数，来决定是否有必要拟合一个更为复杂的模型。

```{r}
plot(spearman2(time ~age + sex + ph.ecog + ph.karno + meal.cal + wt.loss, data=lung2))
```


*concordance* is only appropriate when we have at least one continuous predictor in our Cox model, in which case it assesses the probability of agreement between the survival time and the risk score generated by the predictor (or set of predictors.) A value of 1 indicates perfect agreement, but values of 0.6 to 0.7 are more common in survival data. 0.5 is an agreement that is no better than chance. Here, our concordance is 0.65, which is a fairly typical value.
```{r}
# c-index 的计算
# Dxy/2+0.5
c <- validate(coxm, 
             dxy=T, 
             B=1000)
```


nomogram

```{r}
## 设置预测时间
# 预测时间的选择，需要依据具体的数据和time的分布结合临床需求进行操作
# 比如此例，因为lung数据集中time单位为Day，而我们比较关注一年，两年的生存期情况
surv <- Survival(coxm)
surv1 <- function(x)surv(365 * 1, lp=x)
surv2 <- function(x)surv(365 * 2, lp=x)

nomo <- nomogram(coxm,
                fun = list(surv1, surv2),
                funlabel = c('1-year LR probability',
                             '2-year LR probability'
                             ),
                lp = F,
                maxscale = 100 # 以百分比的形式展示
                )
plot(nomo, 
     #1.变量与图形的占比
     xfrac=.35, 
     #2.变量字体加粗
     cex.var=1, 
     #3.数轴：字体的大小
     cex.axis=0.8,
     #4.数轴：刻度的长度
     tcl=-0.5,
     #5.数轴：文字与刻度的距离
     lmgp=0.3, 
     #6.数轴：刻度下的文字，1=连续显示，2=隔一个显示一个
     label.every=1, 
     #7.1个页面有几个数轴(这个可以压缩行间距)
     naxes=13,
     #8.垂直线的颜色.
     col.grid=gray(c(0.8, 0.95)),
     #9.线性预测轴名字
     lplabel="Linear Predictorlp", 
     #10变量分数名字
     points.label='Points', 
     #11总分名字
     total.points.label='Total Points',
     force.label=T#没啥用。TRUE强制标记的每个刻度线都绘制标签，我也没研究明白
     )
```

```{r}
survplot(coxm, sex)

survplot(coxm, age=median(lung2$age), conf.int=.95, 
         col='blue', time.inc=12, n.risk=TRUE,
         conf='bands', type="kaplan-meier", 
         xlab="Study Survival Time in Months")

ggplot(Predict(coxm, age))
```

rms包的校准曲线绘制:

校准曲线是模型验证的方法，一般我们会通过外部数据或者测试数据集对模型进行验证。但在没有外部数据集的情况下，则可以考虑进行内部验证，比如会常用bootstrap重采样的方法，并绘制校准曲线。

校准曲线是绘制一条预测和实际的曲线，For survival models, "predicted" means predicted survival probability at a single time point, and "observed" refers to the corresponding Kaplan-Meier survival estimate, stratifying on intervals of predicted survival,

u: the time point for which to validate predictions for survival models. For cph fits, you must have specified surv=TRUE, time.inc=u, where u is the constant specifying the time to predict.
u值是在模型中是一个按照具体课题设置的时间，其默认的时间单位为Day，但是可以依据自己数据的时间单位进行修改，比如我们的时间单位为month，则，而如果我们比较关注5年生存期，而time单位为月的话则u=60等。
`Hmisc::units(lung2$time) <- "month"`


```{r}
cal <- rms::calibrate(coxm,#模型名称
              cmethod='KM',
              method='boot',#检测方法
              u=365 * 2, #评估的时间，注：一定要与模型的时间一致,如何在模型中设置此时间取决于自己
              m=50, #每次抽样的样本量，样本总量除以3或4，因为一般将所有样本分为3、4组，当然如果样本量多也可以多设置组
              B=1000)#抽样次数

plot(cal,
     add=F,
     conf.int=T,
     subtitles = F,
     cex.subtitles=0.8, 
     lwd=2,
     lty=1,
     errbar.col="blue",
     xlim=c(0,1),#调节x.y轴刻度范围
     ylim=c(0,1),
     xlab="nomo for two year survival",
     ylab="actual two year survival",
     col="red")

# another plot
plot(cal,lwd=2,lty=1,errbar.col=c(rgb(0,118,192,maxColorValue=255)),
     xlim=c(0,0.8),ylim=c(0,1),
     xlab="Nomogram-Predicted Probability of 5-Year OS",
     ylab="Actual 5-Year OS(proportion)", 
     col=c(rgb(192,98,83,maxColorValue=255)))
lines(cal[,c("mean.predicted","KM")],type="b",lwd=2,
      col=c(rgb(192,98,83,maxColorValue=255)), pch=16)
abline(0,1,lty=3,lwd=2,col=c(rgb(0,118,192,maxColorValue=255)))
```

ggDCA包的决策曲线分析： 决策曲线（Decision curve analysis，DCA）

```{r}
library(ggDCA)

dca1 <- ggDCA::dca(mod,
            new.data = NULL,
            times = "median"
            )

ggDCA::AUDC(dca1)

ggplot(dca1,       
       model.names="model1",
       linetype =F,
       lwd = 1.2) 
```


对于决策曲线的绘制，发现了了一个新的工具`dcurves`.


*多因素cox回归分析后的单因素的KM曲线*

或者应该说是cofounder-adjusted expected survival probabilities, 不知此时是否还合适称之为KMplot.

在混杂因素纳入模型后，聚焦的单一变量在一些文献中，也会绘制Adjusted Survival Curves. The `survfit` function estimates $S(t)$, by default at the mean values of the covariates:

> <http://www.sthda.com/english/wiki/cox-proportional-hazards-model>
> <https://bookdown.org/sestelo/sa_financial/adjusting-survival-curves.html>
> https://stackoverflow.com/questions/70783093/how-to-generate-covariate-adjusted-cox-survival-hazard-functions

经过多方尝试，似乎`https://github.com/RobinDenz1/adjustedCurves` 是个不错的选择。

```{r}
new_df <- with(lung,
               data.frame(sex = c(1, 2), 
                          age = rep(mean(age, na.rm = TRUE), 2),
                          ph.ecog = c(1, 1),
                          ph.karno = rep(mean(ph.karno, na.rm = TRUE), 2)
                          )
               ) %>% rstatix::convert_as_factor(sex, ph.ecog)

fit <- survfit(mod2, newdata = new_df)

ggsurvplot(fit, conf.int = TRUE, 
           pval = TRUE,data = new_df,
           legend.labs=c("Sex=1", "Sex=2"),
           ggtheme = theme_minimal())

# 利用survminer
# 似乎对因子变量不怎么支持
curve <- surv_adjustedcurves(mod2, variable = 'sex', data = lung, method = "average")
ggadjustedcurves(mod2, variable = 'sex', data = lung, method = "average")


# 尝试adjustedCurves的用法
adjsurv <- adjustedCurves::adjustedsurv(data=lung,
                        variable="sex",
                        ev_time="time",
                        event="status",
                        method="direct",
                        outcome_model=cox_mod,
                        conf_int=TRUE)

# plot with confidence intervals
plot(adjsurv, conf_int=TRUE)
```

**两个模型的比较**
```{r}
anova(mod2, mod) #fist the reduced model, second the full model
```


## *KM plot*

KM plot一个很重要的点在于，得有一个分类数据。

一般来说，log-rank检验的应用条件跟COX回归类似，都需要满足等比例风险假定条件，通俗来说，最好是两条曲线大致平行（当然这不是严格定义，但容易理解）。Wilcoxon检验对服从对数正态分布的生存数据比log-rank检验更好一些。
最后，简单说一下其它检验方法。
除了log-rank检验和Wilcoxon检验，还有其它几种常见的，也是软件提供的。如Peto检验，Tarone-Ware检验。它们与log-rank检验和Wilcoxon检验的不同就是权重不一样，如Tarone-Ware检验的权重是根号N，即N的平方根，也就是说，它也是越来越小，但是既然开了根号，那变化就没有那么大了。Peto检验的权重是生存估计函数，也是一个变化值。所以，其实根据这些权重，就不难理解这些方法侧重的方向。
最最后，几种检验方法应用场景小结。
从统计学角度，这几种检验方法其实所做的无效假设都是一样的，都是为了检验组间的生存曲线是否有统计学差异。然而，具体细节上有不同。如果你在研究时，确信某种疗法在一开始效果较好，随着时间推移，可能效果会减弱，此时应事先就确定采用Wilcoxon检验（或Tarone-Ware检验，视情况而定）而不是log-rank检验。采用log-rank检验，实际上相当于你对某种疗法并没有太多的概念，可能认为该疗法在整个研究期间应该效果差不多。
严格来说，统计学不应该根据事后的生存曲线图和p值来选择方法，而应该在研究设计时就有计划，确定统计学方法。因为它们所对应的情形并不相同。

To perform a standard log rank test, one can use
`survdiff(Surv(time, status) ~ group, data = obs, rho = 0)`
How could you perform a Wilcoxon signed rank test, a Peto test or a Fleming-Harrington test?
 to get a "Peto test" is to use rho=1, or we can use the `survMisc` package:
 `library(survMisc) comp(fit)$tests$lrTests`，`comp(ten(fit))`结果里包含一系列的计算结果。R在这种时候容易给人一种不值得信任的感觉，如果有一个R统计联盟就好了。

*SAS*中在计算CI时，默认使用的是loglog方法，而R默认的是log，可以更改`survfit`的参数，`survfit(Surv(time, status==2) ~ sex, data = lung, conf.type = "log-log")`


查看survfit.formula的帮助页。

```{r}
# ?survfit.formula
fit <- survfit(Surv(time, status==2) ~ sex, data = lung,
               conf.int = 0.95,
               se.fit = TRUE,
               conf.type = "log-log" # 和SAS保持一致
               )
fit # 中位生存期以及95%CI

# 中位生存期、95%CI
surv_median(fit)
surv_pvalue(fit)

# Estimating x-year survival
# 查看某个时间点的生存率, 如下示例为1年时的生存率
summary(survfit(Surv(time, status) ~ 1, data = lung), times = 365.25)
```

```{r}

# 从survdiff中提取pvalue
sd <- survdiff(Surv(time, status) ~ sex, data = lung)
1 - pchisq(sd$chisq, length(sd$n) - 1)
# directly calculate the log-rank test p-value using survdiff()
survdiff(Surv(time, status) ~ sex, data = lung)


# KM plot
pp <- ggsurvplot(fit, 
                 data = lung,
                 surv.median.line = "hv",
                 legend.title = "Subtype",
                 legend.labs = c("Male", "Female"),
                 legend = c(0.9,0.9),
                 censor.shape="|", censor.size = 4,
                 pval = TRUE,
                 pval.method = TRUE,
                 pval.size = 4,
                 # pval.coord = c(0, 0.03),
                 conf.int = FALSE,
                 conf.int.style = "step",
                 add.all = FALSE,
                 risk.table = 'abs_pct',
                 tables.height = 0.2,
                 tables.theme = theme_cleantable(),
                 risk.table.fontsize = 5,
                 risk.table.y.text = TRUE,
                 fontsize = 8,
                 font.x = c(15),
                 font.y = c(15),
                 # font.family = 'Book Antiqua',
                 # font.tickslab = c(20, "plain"),
                 ncensor.plot = FALSE,
                 ncensor.plot.height = 0.25,
                 # cumevents.y.text = 8,
                 # ylab="Cumulative survival (percentage)",
                 xlab = " Time(Days)",
                 break.time.by = 50,
                 palette = 'aaas',
                 ggtheme = cowplot::theme_cowplot(font_size = 15) +
                   theme(axis.text.x = element_text(face = 'bold'),
                         axis.text.y = element_text(face = 'bold'),
                         axis.line = element_line(linetype = 'solid',
                                                  size = 1),
                         axis.ticks = element_line(size = 1),
                         axis.title = element_text(face = 'bold')
                   )
                 # ggtheme = theme_survminer() +
                 #   theme(legend.text = element_text(size = 20),
                 #         legend.title = element_text(size = 20),
                 #         axis.text.x = element_text(size = 20),
                 #         text = element_text(size = 20),
                 #         plot.title = element_text(hjust = 0.0)
                   # )
)
pp$plot <- pp$plot + 
  labs(title    = "Survival curves",
       subtitle = "Based on Kaplan-Meier estimates"
  ) +
  scale_x_continuous(expand = c(0, 0)) + 
  scale_y_continuous(expand = c(0, 0))
pp
```


*此处增加一个ggsurvfit包的示例*

此包推荐拟合时用其包装好的survfit2函数;
`CDISC ADaM ADTTE`推荐的数据格式中，1竟然是censored。。。无语。。需要注意。。。   

```{r}
fit2 <- survfit2(Surv(time, status==2) ~ sex, data = lung,
               conf.int = 0.95,
               se.fit = TRUE
               )

fit2 %>% 
  ggsurvfit::ggsurvfit() +
  add_confidence_interval() +
  add_risktable() +
  add_quantile(y_value = 0.6, color = "gray50", size = 0.75) +
  scale_color_manual(values = c('#54738E', '#82AC7C')) +
  scale_fill_manual(values = c('#54738E', '#82AC7C')) +
  add_pvalue(caption = "Log-rank {p.value}") +
  theme(legend.position = "bottom") +
  guides(color = guide_legend(ncol = 1)) +
  labs(title = "KM plot",
       y = "Percentage Survival")
```


*Landmark分析：*如果随访的不同时间段间基础风险发生变化，需要分时间段分别分析。处理KM曲线交叉的情况.

更一般的，当处理非baseline变量时

```{r}
# landmark 分析, 分时间段分析, 以某个时间点为截断点，分别做KM分析，以解决time-to-treatment bias或immortal time bias
# https://github.com/jinseob2kim/jskm
# https://bbisr.shinyapps.winship.emory.edu/CASAS/
jskm(fit, mark = F,  surv.scale = "percent", pval =T, table = T, 
     cut.landmark = 500, # 需要修改的函数
     showpercent = T)


p3 <- jskm(fit,
         pval = T,
         table = T,
         label.nrisk = "Numbers at risk",
         ystrataname = "Treatment",
         ystratalabs = c("Obs", "Lev", "Lev+5FU"),
         marks = F,
         linecols = "Set2",
         xlabs = "Time (Day)",
         cut.landmark = 365)#选择分界点


# landmark的分析，首先需要选择landmark time
# 然后再选择time 大于landmark time的人群
# Calculate follow-up from landmark time and apply traditional log-rank tests or Cox regression


```


## Random Forests Model for survival

随机森林生存分析。利用随机森林算法执行的生存分析方法，可用R包，包括`ranger`, `randomForestSRC`包等，以下列一个`ranger`的常规用法，关于`randomForestSRC`的用法见另一篇笔记。

随机生存森林，和随机森林一致，可以筛选变量，包括变量重要性的排序等；以及构建预测模型。一般适用于右删失数据。

```{r}
# ranger model
r_fit <- ranger::ranger(Surv(time, status) ~ trt + celltype + 
                     karno + diagtime + age + prior,
                     data = vet,
                     mtry = 4,
                     importance = "permutation",
                     splitrule = "extratrees",
                     verbose = TRUE)

# Average the survival models
death_times <- r_fit$unique.death.times 
surv_prob <- data.frame(r_fit$survival)
avg_prob <- sapply(surv_prob,mean)

# Plot the survival models for each patient
plot(r_fit$unique.death.times,r_fit$survival[1,], 
     type = "l", 
     ylim = c(0,1),
     col = "red",
     xlab = "Days",
     ylab = "survival",
     main = "Patient Survival Curves")

```


## Parametric Survival Analysis

> https://github.com/FJRubio67/ShortCourseParamSurvival
> https://rpubs.com/FJRubio/PSM
> https://rpubs.com/FJRubio/PSRM


参数生存分析包含two regression models: PH(the proportional hazards) and AFT(accelerated failure time).

By selecting a parametric from the catalogue of distributions, we are making assumptions about the possible hazard rates of the true distribution. Selecting the best model using formal tools is usually recommended.
对于参数类的生存分析，需要选择所使用的参数方法。

- Weibull, Lognormal, and Generalised Gamma.(没有太看明白，难道不能加协变量吗？)
- PH
- AFT
- General Hazard (GH) model

分布和两种model的关系，


*Weibull分布：*
生存分析的参数模型中，指数分布最为简单，不过其所假定的死亡风险不变太过理想化，在实际中应用不多。`Weibull`分布除了风险参数lambda参数外，多一个控制曲线形状的参数，用p表示，p<1表示多数病人较早死亡，少数人生存时间较长；p>1表示早期死亡例数较少，后期死亡数较多；p=1表示病人死亡数不随时间变化，此时相当于指数分布(指数分布可以看作其的一个特殊情况)。那么Weibull回归分析结果长什么样呢？其实跟普通回归没什么太大区别，也是给出截距、各影响因素的参数估计值。如果说唯一有不同的地方就是，它还给出了Weibull分布的参数。你会发现p=1.72，而且从置信区间来看，明显大于1。这说明了两件事：第一，既然p从统计学上来看不等于1，说明这不是指数分布，而是Weibull分布。第二，既然p大于1，说明在这个数据中，死亡风险一开始较低，后面逐渐增大。从生存角度则说明，一开始死亡人数可能较少，后面反而死亡人数越来越多。


*lognormal: *
Lognormal 分布是一种连续概率分布，通常用于描述正态分布的对数。在 R 中，可以使用“survival”包中的“survreg”函数进行 Lognormal 分布分析。


*shared frailty model: *
Shared frailty model是一种生存分析模型，它考虑了群组内个体之间的相关性，常用于分析有群组结构的数据，如医院、家庭、社区等。在R中，可以使用coxme包实现shared frailty model。类似于重复测量资料。详细见笔记`mixed_cox.Rmd`.


```{r, include=FALSE}
library(flexsurv) # 此包提供诸多参数类的拟合
library(eha) # event history and survival analysis
library(MASS)
library(fitdistrplus)

```


### hp model


类似的三个模型是？见上面的描述。weibull 分布，或者`log-logistic`分布，是值得在生存分析中应用的，不过作为参数模型，需要对数据的分布进行一些检验。

```{r}
# Survival model using a Weibull distribution
# wb 和 flexsurvreg 拟合函数分别可以做对应的weibull分析

# By default R reports coefficients in the accelerated failure time metric.
wb <- survival::survreg(formula = Surv(time, status) ~ 1, data = lung,
                           dist = "weibull")

# 
fit.weibull <- flexsurv::flexsurvreg(formula = Surv(time, status) ~ 1, data = lung,
                           dist = "weibull")

fit.weibull
```
*interpret the result:* shape参数确定了生存时间分布的形状，而scale参数确定了生存时间分布的比例因子。
从置信区间来看，shape明显大于1，即是死亡风险在前期较低后期较高。


```{r}
# summary(fit.weibull)

# 预测患者生存概率，对于有自变量的情况下
lung$sex <- factor(lung$sex)

weib_model <- flexsurv::flexsurvreg(formula = Surv(time, status) ~ age + sex + ph.ecog, 
                                      data = lung,
                           dist = "weibull")

fit.aft <- eha::aftreg(formula = Surv(time, status) ~ age + sex + ph.ecog, data = lung,
                         dist = "weibull" # 此处参数可以选择weibull等
                         )


new_data <- data.frame(age = 65, sex = "1", ph.ecog = 1)
predict(weib_model, new_data, type = "survival")
```


```{r}
# 拟合生存函数和风险函数
spline_fit <- flexsurvspline(Surv(time, status) ~ 1, data = lung)


```

```{r}
# 绘制生存概率密度函数和风险密度函数
ggplot(data.frame(time = seq(0, 50, by = 0.1)), aes(x = time)) +
  geom_line(aes(y = dweibull(time, shape = weibull_fit$estimate[1], 
                            scale = weibull_fit$estimate[2])), 
            color = "red") +
  geom_line(aes(y = -log(survfit(Surv(time, status) ~ 1)$surv)), 
            color = "blue") +
  ylab("Probability")
```


### AFT fit

> https://en.wikipedia.org/wiki/Accelerated_failure_time_model

AFT(Accelerated Failure Time), AFT models assume that the hazard rate, or risk of an event occurring, is constant over time. This model estimates parameters for the effects of predictor variables on the time to event.

```{r}
# AFT model using a lognormal baseline distribution
fit.aft <- eha::aftreg(formula = Surv(time, status) ~ age + sex + ph.ecog, data = lung |> na.omit(),
                         dist = "lognormal" # 此处参数可以选择weibull等
                         )

summary(fit.aft)


```


```{r}
fit.aft2 <- eha::aftreg(formula = Surv(time, status) ~ age + sex + ph.ecog, data = lung,
                         dist = "weibull"
                         )

fit.aft2
```
*interpret the result: *本模型的结果和SPSS结果保持一致。不过一些Coef值的正负不一致。


*Cox-Snell Residual plot*
```{r}
# 四幅图的诊断
plot(fit.aft2)
```


PH model, 指的是Proportional hazards model with parametric baseline hazard(s). Allows for stratification with different scale and shape in each stratum, and left truncated and right censored data.
```{r}
# PH model using a lognormal baseline distribution
fit.ph <- eha::phreg(formula = Surv(time, status) ~ age + sex , data = lung,
                dist = "lognormal")

summary(fit.ph)
```
```{r}
fit.ph2 <- eha::phreg(formula = Surv(time, status) ~ sex , data = lung,
                dist = "gompertz")


plot(fit.ph2, fn = "haz", col = c("red", "blue"), 
     main = "Male mortality by social class", 
     ylab = "Gompertz hazard", 
     xlab = "Sex")
```


```{r}
# Comparison of the three models using AIC
AIC.aft <- -2*(fit.aft$loglik[2]) + 2*length(fit.aft$coefficients)
AIC.ph <- -2*(fit.ph$loglik[2]) + 2*length(fit.ph$coefficients)

AIC.aft
AIC.ph
```


```{r}
fitdistrplus::fitdist()
```