-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathsurvival.Rmd
executable file
·903 lines (651 loc) · 31.3 KB
/
survival.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
---
title: "cox regression"
author: "liuc"
date: "11/29/2021"
output: html_document
editor_options:
markdown:
wrap: 72
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
# 生存分析的COX回归和KM分析
<https://www.emilyzabor.com/tutorials/survival_analysis_in_r_tutorial.html>
> https://thomaselove.github.io/2020-432-book/index.html
> https://cran.r-project.org/web/packages/survminer/vignettes/Specifiying_weights_in_log-rank_comparisons.html
本笔记主要记录较为完整的Cox回归模型和KM分析,其中对于Lasso-Cox、mixed-Cox等,详见*lasso_cox.Rmd*,
*mixed_cox.Rmd*这两个笔记。
生存分析所分析的数据特点为time-to-event数据,time总是from to end的,比如Time from response to recurrence
这类最常见的数据,在面对此类数据时除了period需要考察外还需要考虑起始时间的特点,这一点在删失数据的各种分类更需注意。
对于左删失和interval censor数据需要别的数据处理方法,此处主要讨论以右删失数据为主。
右删失数据是在实验结束之时,事件还没有发生,也不知道它啥时候发生,是比较常见的。
任何数据分析,统计分析,开始都需要对数据进行探索和清洗,此处暂且略过。
## cox regression
```{r, include=FALSE}
library(tidyverse)
library(easystats)
library(survival)
library(survminer) # for KM plot
library(riskRegression)
library(rms)
library(gtsummary) # for beautiful table
library(ggsurvfit) # recommend use survfit2
```
建模. 对于cox回归其实质同线性回归;cox回归需要满足等比例假设,对于不满足的自变量采用分层分析的方法。
_对于因子变量_应如何纳入到具体的模型中呢?对于设置成factor的因子R中模型常见的处理方式是将其自动设置为dummy变量。就像`model.matrix(~ 0 + factor(ph.ecog), lung)`
```{r}
# data(lung)
lung <- na.omit(lung) %>%
as_tibble() %>%
rstatix::convert_as_factor(sex, ph.ecog)
# 多变量cox回归
mod <- coxph(Surv(time, status==2)~age + sex + ph.ecog + ph.karno + meal.cal + wt.loss,
data = lung,
# y=TRUE,
x = TRUE)
summary(mod) # Concordance 即是c-index
```
```{r}
# 等比例风险假设,cox回归需要满足的假设, 可采用cox.zph函数并结合cumulative incidence plot
# GLOBAL为整体上是否满足等比例风险,p值大于0.05一般视为满足, 对于某一不满足的自变量采用分层分析的方法进行
(z <- cox.zph(mod))
# We can plot the scaled Schoenfeld residuals directly with ggcoxzph from the survminer package.
ggcoxzph(z)
```
```{r}
# 下面为分层分析后的变量,以下更为具体的内容暂不考虑,此处只做一个记录
#
mod_strata <- coxph(Surv(time, status)~age + sex + ph.ecog + strata(ph.karno) + meal.cal + wt.loss,
data = lung,
y=TRUE,
x = TRUE)
mod_strata # 分层变量不在作为自变量因素
summary(mod_strata)
```
```{r}
#ggcoxdiagnostics(fit, type = "schoenfeld")
plot(z[3])
abline(0,0, col="red") # 0水平线
abline(h=mod$coef[3], col="green", lwd=2, lty=2)
```
*对于Cox回归结果的解释:*P值和CI为此系数是否对模型具有影响,检验系数值是否为0。而coef或者exp(coef)为HR值,HR值等于1为变量中的因素之间没有差异,对于分类变量而言,sex解读为女性(2)比男性(1)降低了57%的风险;对于自变量为连续变量的数据而言,其值一般正值表示预后更差。 these results indicate that age makes a smaller contribution to the difference in the HR after adjusting for the ph.ecog values and patient’s sex, and only trend toward significance.holding the other covariates constant, an additional year of age induce daily hazard of death by a factor of exp(beta) = 1.01, or 1%, which is not a significant contribution.
Rsquare in this setting is Cox and Snell’s pseudo-R2, which reflects the improvement of the model we have fit over the model with the intercept alone - a comparison that is tested by the likelihood ratio test. The maximum value of this statistic is often less than one, in which case R will tell you that. Here, our observed pseudo-R2 is 0.207 and that is out of a possible maximum of 0.996.
*筛选变量:*
```{r}
step(mod)
```
a more detailed description of the likelihood-ratio test of the model
```{r}
anova(mod)
```
```{r}
car::Anova(mod = mod, type = 'III')
```
```{r}
broom::tidy(mod, exponentiate = TRUE)
```
*对于不满足等比例假设的另一种方法:*
上文有看到别人对某一变量进行strata分层,此处再介绍两种。对于这个主题本笔记还需进一步的整理。
处理不满足等比例假设的方式中对数据进行分层,以及对数据进行转换:
```{r}
# 通过对不符合PH假设的变量,即时依变量随时间变化的分层节点进行分层
# 此函数提供的是将整个数据集进行分层
# 以下survival包中的示例表明按照karno变量进行分层
vet2 <- survival::survSplit(Surv(time, status) ~ .,
data= veteran,
cut=c(60, 120), # 依据具体的变化情况设置分段的拐点
episode= "timegroup",
id="id" # add an identifier
)
fit2 <- coxph(Surv(tstart, time, status) ~ karno* strata(timegroup) +
age + trt, data= vet2)
c(overall= coef(fit1)[1],
t0_60 = coef(fit2)[1],
t60_120= sum(coef(fit2)[c(1,4)]),
t120 = sum(coef(fit2)[c(1,5)]))
cox.zph(fit2)
```
对数据进行转换
```{r}
# ?tt
fit3 <- coxph(Surv(time, status) ~ trt + prior + karno + tt(karno), # 对karno进行变换
data = veteran,
tt = function(x, t, ...) x * log(t+20) # 具体变换方式
)
fit3
```
变量筛选:
```{r}
# 提取模型estimate
# 并转化成HR值
broom::tidy(mod,
exp = TRUE
)
# 输出漂亮表格
gtsummary::tbl_regression(mod, exp = T)
```
*简答起见可以选择finalfit进行拟合*下面是一个示例:
参考官方文档。
```{r}
# melanoma = boot::melanoma
dependent_os <- "Surv(time, status_os)"
dependent_dss <- "Surv(time, status_dss)"
dependent_crr <- "Surv(time, status_crr)"
explanatory <- c("age", "sex", "thickness", "ulcer")
melanoma %>%
finalfit::finalfit(dependent_os, explanatory)
melanoma %>%
hr_plot(dependent_os, explanatory)
```
c-index以及校准曲线:
c-index用于评估模型的区分度,可以理解为模型拟合后自身的区分能力。
校准曲线用于评估模型的准确性或者说与实际的一致性。
```{r}
# 基于riskRegression的校准曲线
mod2 <- coxph(Surv(time, status==2)~age + sex + ph.ecog + ph.karno,
data = lung, x = T)
xs <- riskRegression::Score(list(model1 = mod, model2 = mod2),
Hist(time, status==2)~1,
data=lung,
plots="calibration",
metrics='Brier')
riskRegression::plotCalibration(xs, cens.method="jackknife",
col = c('black', 'red')
)
plotCalibration(xs,
model="model1", # 绘制模型1的校准曲线
bars=TRUE, # 校准曲线为条形图形式
show.frequencies = TRUE, # 条形图上显示频率
xlab = "预测的风险", # x轴标签
ylab = "观察的风险", # y轴标签
col = c("#ca3e47","#1ee3cf"), # 设置条形图的颜色
names.cex = 1.2, # X轴数字的缩放倍数
cex = 2.0) # 设置图例和标签的缩放倍数)
```
基于riskRegression的ROC曲线:
survivalROC包和timeROC包也能绘制生存资料的ROC曲线,但是他们一般是基于单因素的。
```{r}
# 在对cox回归计算ROC时,一般会指定一个时间
roc_res <- Score(list(model1 = mod),
Hist(time, status==2)~1,
data = lung,
# times = 365,
plots = 'roc',
metrics ="auc")
plotROC(roc_res,
xlab="1-Specificity",
ylab="Sensitivity",
lty=1,
cex=1.1,
pch=2,
lwd=2,
col="red",
legend="Model1")
```
### 呈现cox回归的结果
interpret the results is so import!
在上述的`summary(mod)`的结果中,三个整体的检验(likelihood,Wald,Score)P值小于0.05,说明模型整体上是有意义的。
```{r}
# The HR is interpreted as the instantaneous rate of occurrence of the event of interest in those who are still at risk for the event. It is not a risk, though it is commonly interpreted as such
# HR = 0.59 implies that around 0.6 times as many females are dying as males, at any given time
```
### 基于rms包的分析:
列线图的绘制、c-index、校准曲线等.
`time.inc`目前倾向于设置为最大的感兴趣时间.
```{r}
# rms::cph
lung2 <- lung
units(lung2$time) <- "Day"
dd <- datadist(lung2)
options(datadist = 'dd')
label(lung2$age) <- 'Patient Age' # 在nomograph中显示的变量label
stime <- Surv(lung2$time, lung2$status == 2)
coxm <- rms::cph(
formula = Surv(time, status==2)~age + sex + ph.ecog + ph.karno + meal.cal + wt.loss,
x = T,
y = T,
surv = T,
data = lung2,
time.inc = 365*2 # 此参数的设置会影响到后续的绘图,如survplot
)
coxm # 和mod估计的结果是一致的
```
*interpret the result: *
```{r}
anova(coxm)
```
```{r}
plot(summary(coxm))
```
等比例检验
```{r}
cox.zph(coxm)
plot(cox.zph(coxm))
```
If we want to use the Spearman ρ2 plot to consider how we might perhaps incorporate non-linear terms describing any or all of the five potential predictors (age, sex...) for survival time.
确定是否需要考虑非线性的拟合。需要考虑样本数,来决定是否有必要拟合一个更为复杂的模型。
```{r}
plot(spearman2(time ~age + sex + ph.ecog + ph.karno + meal.cal + wt.loss, data=lung2))
```
*concordance* is only appropriate when we have at least one continuous predictor in our Cox model, in which case it assesses the probability of agreement between the survival time and the risk score generated by the predictor (or set of predictors.) A value of 1 indicates perfect agreement, but values of 0.6 to 0.7 are more common in survival data. 0.5 is an agreement that is no better than chance. Here, our concordance is 0.65, which is a fairly typical value.
```{r}
# c-index 的计算
# Dxy/2+0.5
c <- validate(coxm,
dxy=T,
B=1000)
```
nomogram
```{r}
## 设置预测时间
# 预测时间的选择,需要依据具体的数据和time的分布结合临床需求进行操作
# 比如此例,因为lung数据集中time单位为Day,而我们比较关注一年,两年的生存期情况
surv <- Survival(coxm)
surv1 <- function(x)surv(365 * 1, lp=x)
surv2 <- function(x)surv(365 * 2, lp=x)
nomo <- nomogram(coxm,
fun = list(surv1, surv2),
funlabel = c('1-year LR probability',
'2-year LR probability'
),
lp = F,
maxscale = 100 # 以百分比的形式展示
)
plot(nomo,
#1.变量与图形的占比
xfrac=.35,
#2.变量字体加粗
cex.var=1,
#3.数轴:字体的大小
cex.axis=0.8,
#4.数轴:刻度的长度
tcl=-0.5,
#5.数轴:文字与刻度的距离
lmgp=0.3,
#6.数轴:刻度下的文字,1=连续显示,2=隔一个显示一个
label.every=1,
#7.1个页面有几个数轴(这个可以压缩行间距)
naxes=13,
#8.垂直线的颜色.
col.grid=gray(c(0.8, 0.95)),
#9.线性预测轴名字
lplabel="Linear Predictorlp",
#10变量分数名字
points.label='Points',
#11总分名字
total.points.label='Total Points',
force.label=T#没啥用。TRUE强制标记的每个刻度线都绘制标签,我也没研究明白
)
```
```{r}
survplot(coxm, sex)
survplot(coxm, age=median(lung2$age), conf.int=.95,
col='blue', time.inc=12, n.risk=TRUE,
conf='bands', type="kaplan-meier",
xlab="Study Survival Time in Months")
ggplot(Predict(coxm, age))
```
rms包的校准曲线绘制:
校准曲线是模型验证的方法,一般我们会通过外部数据或者测试数据集对模型进行验证。但在没有外部数据集的情况下,则可以考虑进行内部验证,比如会常用bootstrap重采样的方法,并绘制校准曲线。
校准曲线是绘制一条预测和实际的曲线,For survival models, "predicted" means predicted survival probability at a single time point, and "observed" refers to the corresponding Kaplan-Meier survival estimate, stratifying on intervals of predicted survival,
u: the time point for which to validate predictions for survival models. For cph fits, you must have specified surv=TRUE, time.inc=u, where u is the constant specifying the time to predict.
u值是在模型中是一个按照具体课题设置的时间,其默认的时间单位为Day,但是可以依据自己数据的时间单位进行修改,比如我们的时间单位为month,则,而如果我们比较关注5年生存期,而time单位为月的话则u=60等。
`Hmisc::units(lung2$time) <- "month"`
```{r}
cal <- rms::calibrate(coxm,#模型名称
cmethod='KM',
method='boot',#检测方法
u=365 * 2, #评估的时间,注:一定要与模型的时间一致,如何在模型中设置此时间取决于自己
m=50, #每次抽样的样本量,样本总量除以3或4,因为一般将所有样本分为3、4组,当然如果样本量多也可以多设置组
B=1000)#抽样次数
plot(cal,
add=F,
conf.int=T,
subtitles = F,
cex.subtitles=0.8,
lwd=2,
lty=1,
errbar.col="blue",
xlim=c(0,1),#调节x.y轴刻度范围
ylim=c(0,1),
xlab="nomo for two year survival",
ylab="actual two year survival",
col="red")
# another plot
plot(cal,lwd=2,lty=1,errbar.col=c(rgb(0,118,192,maxColorValue=255)),
xlim=c(0,0.8),ylim=c(0,1),
xlab="Nomogram-Predicted Probability of 5-Year OS",
ylab="Actual 5-Year OS(proportion)",
col=c(rgb(192,98,83,maxColorValue=255)))
lines(cal[,c("mean.predicted","KM")],type="b",lwd=2,
col=c(rgb(192,98,83,maxColorValue=255)), pch=16)
abline(0,1,lty=3,lwd=2,col=c(rgb(0,118,192,maxColorValue=255)))
```
ggDCA包的决策曲线分析: 决策曲线(Decision curve analysis,DCA)
```{r}
library(ggDCA)
dca1 <- ggDCA::dca(mod,
new.data = NULL,
times = "median"
)
ggDCA::AUDC(dca1)
ggplot(dca1,
model.names="model1",
linetype =F,
lwd = 1.2)
```
对于决策曲线的绘制,发现了了一个新的工具`dcurves`.
*多因素cox回归分析后的单因素的KM曲线*
或者应该说是cofounder-adjusted expected survival probabilities, 不知此时是否还合适称之为KMplot.
在混杂因素纳入模型后,聚焦的单一变量在一些文献中,也会绘制Adjusted Survival Curves. The `survfit` function estimates $S(t)$, by default at the mean values of the covariates:
> <http://www.sthda.com/english/wiki/cox-proportional-hazards-model>
> <https://bookdown.org/sestelo/sa_financial/adjusting-survival-curves.html>
> https://stackoverflow.com/questions/70783093/how-to-generate-covariate-adjusted-cox-survival-hazard-functions
经过多方尝试,似乎`https://github.com/RobinDenz1/adjustedCurves` 是个不错的选择。
```{r}
new_df <- with(lung,
data.frame(sex = c(1, 2),
age = rep(mean(age, na.rm = TRUE), 2),
ph.ecog = c(1, 1),
ph.karno = rep(mean(ph.karno, na.rm = TRUE), 2)
)
) %>% rstatix::convert_as_factor(sex, ph.ecog)
fit <- survfit(mod2, newdata = new_df)
ggsurvplot(fit, conf.int = TRUE,
pval = TRUE,data = new_df,
legend.labs=c("Sex=1", "Sex=2"),
ggtheme = theme_minimal())
# 利用survminer
# 似乎对因子变量不怎么支持
curve <- surv_adjustedcurves(mod2, variable = 'sex', data = lung, method = "average")
ggadjustedcurves(mod2, variable = 'sex', data = lung, method = "average")
# 尝试adjustedCurves的用法
adjsurv <- adjustedCurves::adjustedsurv(data=lung,
variable="sex",
ev_time="time",
event="status",
method="direct",
outcome_model=cox_mod,
conf_int=TRUE)
# plot with confidence intervals
plot(adjsurv, conf_int=TRUE)
```
**两个模型的比较**
```{r}
anova(mod2, mod) #fist the reduced model, second the full model
```
## *KM plot*
KM plot一个很重要的点在于,得有一个分类数据。
一般来说,log-rank检验的应用条件跟COX回归类似,都需要满足等比例风险假定条件,通俗来说,最好是两条曲线大致平行(当然这不是严格定义,但容易理解)。Wilcoxon检验对服从对数正态分布的生存数据比log-rank检验更好一些。
最后,简单说一下其它检验方法。
除了log-rank检验和Wilcoxon检验,还有其它几种常见的,也是软件提供的。如Peto检验,Tarone-Ware检验。它们与log-rank检验和Wilcoxon检验的不同就是权重不一样,如Tarone-Ware检验的权重是根号N,即N的平方根,也就是说,它也是越来越小,但是既然开了根号,那变化就没有那么大了。Peto检验的权重是生存估计函数,也是一个变化值。所以,其实根据这些权重,就不难理解这些方法侧重的方向。
最最后,几种检验方法应用场景小结。
从统计学角度,这几种检验方法其实所做的无效假设都是一样的,都是为了检验组间的生存曲线是否有统计学差异。然而,具体细节上有不同。如果你在研究时,确信某种疗法在一开始效果较好,随着时间推移,可能效果会减弱,此时应事先就确定采用Wilcoxon检验(或Tarone-Ware检验,视情况而定)而不是log-rank检验。采用log-rank检验,实际上相当于你对某种疗法并没有太多的概念,可能认为该疗法在整个研究期间应该效果差不多。
严格来说,统计学不应该根据事后的生存曲线图和p值来选择方法,而应该在研究设计时就有计划,确定统计学方法。因为它们所对应的情形并不相同。
To perform a standard log rank test, one can use
`survdiff(Surv(time, status) ~ group, data = obs, rho = 0)`
How could you perform a Wilcoxon signed rank test, a Peto test or a Fleming-Harrington test?
to get a "Peto test" is to use rho=1, or we can use the `survMisc` package:
`library(survMisc) comp(fit)$tests$lrTests`,`comp(ten(fit))`结果里包含一系列的计算结果。R在这种时候容易给人一种不值得信任的感觉,如果有一个R统计联盟就好了。
*SAS*中在计算CI时,默认使用的是loglog方法,而R默认的是log,可以更改`survfit`的参数,`survfit(Surv(time, status==2) ~ sex, data = lung, conf.type = "log-log")`
查看survfit.formula的帮助页。
```{r}
# ?survfit.formula
fit <- survfit(Surv(time, status==2) ~ sex, data = lung,
conf.int = 0.95,
se.fit = TRUE,
conf.type = "log-log" # 和SAS保持一致
)
fit # 中位生存期以及95%CI
# 中位生存期、95%CI
surv_median(fit)
surv_pvalue(fit)
# Estimating x-year survival
# 查看某个时间点的生存率, 如下示例为1年时的生存率
summary(survfit(Surv(time, status) ~ 1, data = lung), times = 365.25)
```
```{r}
# 从survdiff中提取pvalue
sd <- survdiff(Surv(time, status) ~ sex, data = lung)
1 - pchisq(sd$chisq, length(sd$n) - 1)
# directly calculate the log-rank test p-value using survdiff()
survdiff(Surv(time, status) ~ sex, data = lung)
# KM plot
pp <- ggsurvplot(fit,
data = lung,
surv.median.line = "hv",
legend.title = "Subtype",
legend.labs = c("Male", "Female"),
legend = c(0.9,0.9),
censor.shape="|", censor.size = 4,
pval = TRUE,
pval.method = TRUE,
pval.size = 4,
# pval.coord = c(0, 0.03),
conf.int = FALSE,
conf.int.style = "step",
add.all = FALSE,
risk.table = 'abs_pct',
tables.height = 0.2,
tables.theme = theme_cleantable(),
risk.table.fontsize = 5,
risk.table.y.text = TRUE,
fontsize = 8,
font.x = c(15),
font.y = c(15),
# font.family = 'Book Antiqua',
# font.tickslab = c(20, "plain"),
ncensor.plot = FALSE,
ncensor.plot.height = 0.25,
# cumevents.y.text = 8,
# ylab="Cumulative survival (percentage)",
xlab = " Time(Days)",
break.time.by = 50,
palette = 'aaas',
ggtheme = cowplot::theme_cowplot(font_size = 15) +
theme(axis.text.x = element_text(face = 'bold'),
axis.text.y = element_text(face = 'bold'),
axis.line = element_line(linetype = 'solid',
size = 1),
axis.ticks = element_line(size = 1),
axis.title = element_text(face = 'bold')
)
# ggtheme = theme_survminer() +
# theme(legend.text = element_text(size = 20),
# legend.title = element_text(size = 20),
# axis.text.x = element_text(size = 20),
# text = element_text(size = 20),
# plot.title = element_text(hjust = 0.0)
# )
)
pp$plot <- pp$plot +
labs(title = "Survival curves",
subtitle = "Based on Kaplan-Meier estimates"
) +
scale_x_continuous(expand = c(0, 0)) +
scale_y_continuous(expand = c(0, 0))
pp
```
*此处增加一个ggsurvfit包的示例*
此包推荐拟合时用其包装好的survfit2函数;
`CDISC ADaM ADTTE`推荐的数据格式中,1竟然是censored。。。无语。。需要注意。。。
```{r}
fit2 <- survfit2(Surv(time, status==2) ~ sex, data = lung,
conf.int = 0.95,
se.fit = TRUE
)
fit2 %>%
ggsurvfit::ggsurvfit() +
add_confidence_interval() +
add_risktable() +
add_quantile(y_value = 0.6, color = "gray50", size = 0.75) +
scale_color_manual(values = c('#54738E', '#82AC7C')) +
scale_fill_manual(values = c('#54738E', '#82AC7C')) +
add_pvalue(caption = "Log-rank {p.value}") +
theme(legend.position = "bottom") +
guides(color = guide_legend(ncol = 1)) +
labs(title = "KM plot",
y = "Percentage Survival")
```
*Landmark分析:*如果随访的不同时间段间基础风险发生变化,需要分时间段分别分析。处理KM曲线交叉的情况.
更一般的,当处理非baseline变量时
```{r}
# landmark 分析, 分时间段分析, 以某个时间点为截断点,分别做KM分析,以解决time-to-treatment bias或immortal time bias
# https://github.com/jinseob2kim/jskm
# https://bbisr.shinyapps.winship.emory.edu/CASAS/
jskm(fit, mark = F, surv.scale = "percent", pval =T, table = T,
cut.landmark = 500, # 需要修改的函数
showpercent = T)
p3 <- jskm(fit,
pval = T,
table = T,
label.nrisk = "Numbers at risk",
ystrataname = "Treatment",
ystratalabs = c("Obs", "Lev", "Lev+5FU"),
marks = F,
linecols = "Set2",
xlabs = "Time (Day)",
cut.landmark = 365)#选择分界点
# landmark的分析,首先需要选择landmark time
# 然后再选择time 大于landmark time的人群
# Calculate follow-up from landmark time and apply traditional log-rank tests or Cox regression
```
## Random Forests Model for survival
随机森林生存分析。利用随机森林算法执行的生存分析方法,可用R包,包括`ranger`, `randomForestSRC`包等,以下列一个`ranger`的常规用法,关于`randomForestSRC`的用法见另一篇笔记。
随机生存森林,和随机森林一致,可以筛选变量,包括变量重要性的排序等;以及构建预测模型。一般适用于右删失数据。
```{r}
# ranger model
r_fit <- ranger::ranger(Surv(time, status) ~ trt + celltype +
karno + diagtime + age + prior,
data = vet,
mtry = 4,
importance = "permutation",
splitrule = "extratrees",
verbose = TRUE)
# Average the survival models
death_times <- r_fit$unique.death.times
surv_prob <- data.frame(r_fit$survival)
avg_prob <- sapply(surv_prob,mean)
# Plot the survival models for each patient
plot(r_fit$unique.death.times,r_fit$survival[1,],
type = "l",
ylim = c(0,1),
col = "red",
xlab = "Days",
ylab = "survival",
main = "Patient Survival Curves")
```
## Parametric Survival Analysis
> https://github.com/FJRubio67/ShortCourseParamSurvival
> https://rpubs.com/FJRubio/PSM
> https://rpubs.com/FJRubio/PSRM
参数生存分析包含two regression models: PH(the proportional hazards) and AFT(accelerated failure time).
By selecting a parametric from the catalogue of distributions, we are making assumptions about the possible hazard rates of the true distribution. Selecting the best model using formal tools is usually recommended.
对于参数类的生存分析,需要选择所使用的参数方法。
- Weibull, Lognormal, and Generalised Gamma.(没有太看明白,难道不能加协变量吗?)
- PH
- AFT
- General Hazard (GH) model
分布和两种model的关系,
*Weibull分布:*
生存分析的参数模型中,指数分布最为简单,不过其所假定的死亡风险不变太过理想化,在实际中应用不多。`Weibull`分布除了风险参数lambda参数外,多一个控制曲线形状的参数,用p表示,p<1表示多数病人较早死亡,少数人生存时间较长;p>1表示早期死亡例数较少,后期死亡数较多;p=1表示病人死亡数不随时间变化,此时相当于指数分布(指数分布可以看作其的一个特殊情况)。那么Weibull回归分析结果长什么样呢?其实跟普通回归没什么太大区别,也是给出截距、各影响因素的参数估计值。如果说唯一有不同的地方就是,它还给出了Weibull分布的参数。你会发现p=1.72,而且从置信区间来看,明显大于1。这说明了两件事:第一,既然p从统计学上来看不等于1,说明这不是指数分布,而是Weibull分布。第二,既然p大于1,说明在这个数据中,死亡风险一开始较低,后面逐渐增大。从生存角度则说明,一开始死亡人数可能较少,后面反而死亡人数越来越多。
*lognormal: *
Lognormal 分布是一种连续概率分布,通常用于描述正态分布的对数。在 R 中,可以使用“survival”包中的“survreg”函数进行 Lognormal 分布分析。
*shared frailty model: *
Shared frailty model是一种生存分析模型,它考虑了群组内个体之间的相关性,常用于分析有群组结构的数据,如医院、家庭、社区等。在R中,可以使用coxme包实现shared frailty model。类似于重复测量资料。详细见笔记`mixed_cox.Rmd`.
```{r, include=FALSE}
library(flexsurv) # 此包提供诸多参数类的拟合
library(eha) # event history and survival analysis
library(MASS)
library(fitdistrplus)
```
### hp model
类似的三个模型是?见上面的描述。weibull 分布,或者`log-logistic`分布,是值得在生存分析中应用的,不过作为参数模型,需要对数据的分布进行一些检验。
```{r}
# Survival model using a Weibull distribution
# wb 和 flexsurvreg 拟合函数分别可以做对应的weibull分析
# By default R reports coefficients in the accelerated failure time metric.
wb <- survival::survreg(formula = Surv(time, status) ~ 1, data = lung,
dist = "weibull")
#
fit.weibull <- flexsurv::flexsurvreg(formula = Surv(time, status) ~ 1, data = lung,
dist = "weibull")
fit.weibull
```
*interpret the result:* shape参数确定了生存时间分布的形状,而scale参数确定了生存时间分布的比例因子。
从置信区间来看,shape明显大于1,即是死亡风险在前期较低后期较高。
```{r}
# summary(fit.weibull)
# 预测患者生存概率,对于有自变量的情况下
lung$sex <- factor(lung$sex)
weib_model <- flexsurv::flexsurvreg(formula = Surv(time, status) ~ age + sex + ph.ecog,
data = lung,
dist = "weibull")
fit.aft <- eha::aftreg(formula = Surv(time, status) ~ age + sex + ph.ecog, data = lung,
dist = "weibull" # 此处参数可以选择weibull等
)
new_data <- data.frame(age = 65, sex = "1", ph.ecog = 1)
predict(weib_model, new_data, type = "survival")
```
```{r}
# 拟合生存函数和风险函数
spline_fit <- flexsurvspline(Surv(time, status) ~ 1, data = lung)
```
```{r}
# 绘制生存概率密度函数和风险密度函数
ggplot(data.frame(time = seq(0, 50, by = 0.1)), aes(x = time)) +
geom_line(aes(y = dweibull(time, shape = weibull_fit$estimate[1],
scale = weibull_fit$estimate[2])),
color = "red") +
geom_line(aes(y = -log(survfit(Surv(time, status) ~ 1)$surv)),
color = "blue") +
ylab("Probability")
```
### AFT fit
> https://en.wikipedia.org/wiki/Accelerated_failure_time_model
AFT(Accelerated Failure Time), AFT models assume that the hazard rate, or risk of an event occurring, is constant over time. This model estimates parameters for the effects of predictor variables on the time to event.
```{r}
# AFT model using a lognormal baseline distribution
fit.aft <- eha::aftreg(formula = Surv(time, status) ~ age + sex + ph.ecog, data = lung |> na.omit(),
dist = "lognormal" # 此处参数可以选择weibull等
)
summary(fit.aft)
```
```{r}
fit.aft2 <- eha::aftreg(formula = Surv(time, status) ~ age + sex + ph.ecog, data = lung,
dist = "weibull"
)
fit.aft2
```
*interpret the result: *本模型的结果和SPSS结果保持一致。不过一些Coef值的正负不一致。
*Cox-Snell Residual plot*
```{r}
# 四幅图的诊断
plot(fit.aft2)
```
PH model, 指的是Proportional hazards model with parametric baseline hazard(s). Allows for stratification with different scale and shape in each stratum, and left truncated and right censored data.
```{r}
# PH model using a lognormal baseline distribution
fit.ph <- eha::phreg(formula = Surv(time, status) ~ age + sex , data = lung,
dist = "lognormal")
summary(fit.ph)
```
```{r}
fit.ph2 <- eha::phreg(formula = Surv(time, status) ~ sex , data = lung,
dist = "gompertz")
plot(fit.ph2, fn = "haz", col = c("red", "blue"),
main = "Male mortality by social class",
ylab = "Gompertz hazard",
xlab = "Sex")
```
```{r}
# Comparison of the three models using AIC
AIC.aft <- -2*(fit.aft$loglik[2]) + 2*length(fit.aft$coefficients)
AIC.ph <- -2*(fit.ph$loglik[2]) + 2*length(fit.ph$coefficients)
AIC.aft
AIC.ph
```
```{r}
fitdistrplus::fitdist()
```