-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathextract_and_cross_map_ICD_codes_MVR_and_Schnitzer_2011.Rmd
1435 lines (1204 loc) · 80.5 KB
/
extract_and_cross_map_ICD_codes_MVR_and_Schnitzer_2011.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: Extract & cross-map ICD codes from MVR papers and Schnitzer et al 2011
author: "Jan Savinc"
date: "`r format(Sys.time(), '%d %B, %Y')`"
output:
html_document:
toc: true
toc_float: true
code_folding: hide
editor_options:
chunk_output_type: console
bibliography: bibliography.bib
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
```{r, warning=FALSE}
library(tidyverse)
library(readxl)
library(knitr)
library(huxtable)
library(fuzzyjoin)
```
# Intro
# Rationale for loading complete listing of ICD codes
The reason for importing the complete listing of ICD codes is to assist in extracting the ICD codes used as inclusion/exclusion criteria in previous literature, specifically where ranges of codes were defined, such as "X85 - Y03, Y08 - Y09"; not all codes in the X85 - X99 range are valid codes, for example, and by importing the ocmplete listing, we can generate a range of valid codes. These will be used as predictors in the analysis in the CHASe project.
# Loading ICD codes
The UK uses unmodified WHO-published ICD codes (unlike the US where they use the - CM modification, or Australia, where they use -AM). ICD-9 was used until 1996, and ICD-10 since then.
## UK codes
These were previously imported from the UK Biobank data, and the processed data is loaded:
```{r}
icd9_codes_UK <- read_csv("./processed_ICD_codes/master_icd9_code_list_UK(WHO).csv")
icd10_codes_UK <- read_csv("./processed_ICD_codes/master_icd10_code_list_UK(WHO).csv")
```
## USA (CM) codes
These are included in the *icd* package for R.
```{r}
library(icd)
```
## General Equivalence Mappings for the CM versions of ICD
The latest GEM files were downloaded from the [CDC website](ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/Publications/ICD10CM/2018/Dxgem_2018.zip). The 2018 release was used because it was the most recent at the time of writing.
```{r}
gem_icd9cm <-
read_fwf("./raw/2018_I9gem.txt",
fwf_cols(
source = c(1, 5),
target = c(7, 13),
approximate=c(15,15),
no_map=c(16,16),
combination=c(17,17),
scenario=c(18,18),
choice_list=c(19,19)
)
)
gem_icd10cm <-
read_fwf("./raw/2018_I10gem.txt",
fwf_cols(
source = c(1, 7),
target = c(9, 13),
approximate=c(15,15),
no_map=c(16,16),
combination=c(17,17),
scenario=c(18,18),
choice_list=c(19,19)
)
)
## for our purposes the distinction between forward & backward maps isn't relevant, so we can combine the two to simplify finding matches
gem_combined <-
bind_rows(
gem_icd9cm %>% rename(icd9cm=source, icd10cm=target) %>% mutate(direction = "f"),
gem_icd10cm %>% rename(icd9cm=target, icd10cm=source) %>% mutate(direction = "b")
)
```
## Parsing code ranges
Sources often supply code definitions as ranges, e.g. *Y08-Y09*. Because of the complicated rules of how codes are defined, ranges need to be parsed into sequences, that is, *Y08-Y09* needs to be turned into the sequence *Y08,Y09*. This is not trivial, so below is an algorithm for producing these. Note that this is omitted in the report, but is included in the .Rmd source file!
```{r}
## helper function to generate codes from a range given in a string
## e.g. given "Y08 - Y09", generate c("Y08", "Y09")
## this assume 3-character codes and own't work
generate_codes_from_range <- function(codeRangeString) {
## if no hyphen in string, return the string
if (!str_detect(codeRangeString, "-|–")) return(codeRangeString)
hasLetters <- str_detect(codeRangeString, pattern="[a-zA-Z]")
hasDecimals <- str_detect(codeRangeString, pattern="\\.")
codes <- toupper(trimws(unlist(str_split(codeRangeString, pattern= "-|–")))) # extract codes, and sort them
numbers <- parse_number(codes) # extract the numbers and sort them
numCharacters <- max(nchar(codes))
if (hasLetters) {
lettrs <- gsub(codes, pattern="^([a-zA-Z]{1}).*$", replacement = "\\1")
lettersAreSame <- lettrs[1] == lettrs[2]
}
incrementStep <- 1
if (hasDecimals) {
numDecimalPlaces <- max(nchar(str_split(codes, pattern = "\\.", simplify = TRUE)[,2]))
incrementStep <- 1 / 10^numDecimalPlaces
}
## in the simple case where no letters are included, just generate a sequence from the numbers given and pad
if (!hasLetters) {
## now we generate the pre-padded sequence of numbers - depending on the format required, they'll be modified next
sequenceNumbersPrePadding <- seq(from=numbers[1], to=numbers[2], by=incrementStep)
if (hasDecimals) { # if it has decimals we pad using formatC for decimals, "f"
finishedSequence <- formatC(sequenceNumbersPrePadding, format = "f", digits=numDecimalPlaces, width=numCharacters, flag="0")
} else { # if nop decimals, we pad by adding a "0" until desired width is reached
finishedSequence <- str_pad(sequenceNumbersPrePadding, width=numCharacters, pad="0")
}
return(finishedSequence)
} else { ## else if it has letters, it's a little more complicated
## if the letters are the same
if (lettersAreSame) {
sequencePrePadding <- seq(from=numbers[1], to=numbers[2], by=incrementStep)
## pad the sequence to the required width - 1 to account for the letter, but differently depending on whether there are decimals
if (hasDecimals) {
paddedSequence <- formatC(sequencePrePadding, format = "f", digits=numDecimalPlaces, width=numCharacters-1, flag="0")
} else {
paddedSequence <- str_pad(sequencePrePadding, width=numCharacters-1, pad="0")
}
finishedSequence <- paste0(lettrs[1], paddedSequence)
return(finishedSequence)
} else { ## if the letters are not the same it's more complex, because we need to generate the sequences between the letters also
lettersIndices <- seq.int(from=which(LETTERS==lettrs[1]), to=which(LETTERS==lettrs[2]))
lettersSequence <- LETTERS[lettersIndices]
## for the middle sequence between the letters provided, figure out which numbers are needed, they'll go between 0 and a maximum number, e.g.
## for E75 to F12, we need E75-99 and F00-12; for E72.1 to F12.3, we needE72.1-99.9 etc.
maxNumber <- (10^(numCharacters - 1)) - 1
if (hasDecimals) maxNumber <- (10^(numCharacters-2) - 1) / 10^numDecimalPlaces
if (hasDecimals) {
numbersSequenceLeftWithpadding <- formatC(seq(from=numbers[1], to=maxNumber, by=incrementStep), width = numCharacters-1, format = "f", flag = "0", digits=numDecimalPlaces)
numbersSequenceRightWithpadding <- formatC(seq(from=0, to=numbers[2], by=incrementStep), width = numCharacters-1, format = "f", flag = "0", digits=numDecimalPlaces)
numbersSequenceMiddleWithpadding <- formatC(seq(from=0, to=maxNumber, by=incrementStep), width = numCharacters-1, format = "f", flag = "0", digits=numDecimalPlaces)
} else {
numbersSequenceLeftWithpadding <- formatC(seq(from=numbers[1], to=maxNumber, by=incrementStep), width = numCharacters-1, format = "d")
numbersSequenceRightWithpadding <- formatC(seq(from=0, to=numbers[2], by=incrementStep), width = numCharacters-1, format = "d", flag = "0")
numbersSequenceMiddleWithpadding <- formatC(seq(from=0, to=maxNumber, by=incrementStep), width = numCharacters-1, format = "d", flag = "0")
}
finishedSequence <-
c(
paste0(head(lettersSequence,n=1),numbersSequenceLeftWithpadding),
as.vector(unlist(sapply(lettersSequence[c(-1,-length(lettersSequence))], function(x) paste0(x,numbersSequenceMiddleWithpadding)))),
paste0(tail(lettersSequence,n=1),numbersSequenceRightWithpadding)
)
return(finishedSequence)
}
}
## note that we've omitted checking whether the produced codes are actually in use
}
```
## Helper functions
Below are some helper functions to use in looking at the codes
```{r}
## remove dot
remove_dot <- function(x) gsub(x, pattern="\\.", replacement="")
## helper function to expand range codes eg "960-978" into numeric range...
## and collapse to string eg "960,961,962"
expand_range_to_csv_simple_case <- function(this_range) {
this_range %>%
str_split(., pattern="–|-") %>% # the Schnitzer tables use mid-length dash
unlist %>%
trimws %>%
as.numeric %>%
{seq(from=min(.), to=max(.), by=1)} %>%
paste(., collapse=",")
}
## helper functions for generating forward & backward mapped codes from the CM GEM lists
compile_mapping_cm <- function(code) {
lapply(code, function(x) {
x <- remove_dot(x)
gem_combined %>%
filter(str_starts(icd9cm, pattern=x)) %>%
mutate(prefix_code=x)
}) %>% bind_rows
}
## shortcuts for finding codes with a prefix
find_prefix_in_master_list <- function(this_code, master_list) {
master_list %>%
filter(str_starts(code, remove_dot(this_code)))
}
find_prefix_in_icd9 <- function(this_code) {
find_prefix_in_master_list(this_code, master_list = icd9_codes_UK)
}
find_prefix_in_icd10 <- function(this_code) {
find_prefix_in_master_list(this_code, master_list = icd10_codes_UK)
}
## shortcuts for finding codes by keyword
find_keyword_in_master_list <- function(key_word, master_list) {
master_list %>%
filter(str_detect(meaning, regex(pattern = key_word, ignore_case = TRUE)))
}
find_keyword_in_icd9 <- function(key_word) {
find_keyword_in_master_list(key_word, master_list = icd9_codes_UK)
}
find_keyword_in_icd10 <- function(key_word) {
find_keyword_in_master_list(key_word, master_list = icd10_codes_UK)
}
## shortcuts for looking up child codes in ICD-9-CM and ICD-10-CM
find_children_in_icd_cm <- function(parent_code) {
tibble(
code = icd::children(as.character(parent_code))
) %>%
mutate(
description = sapply(code, icd::explain_code, USE.NAMES = FALSE)
)
}
```
# ICD-9-CM to ICD-10-CM cross-mapping
Cross-mapping was done manually in consultation with the master lists of ICD-9 and ICD-10 codes, and the online version of the [WHO ICD-10](https://icd.who.int/browse10/2016/en). To interpret the original intent of ICD-9-CM codes specified by Schnitzer et al. 2011 and Schnitzer 2004, we also consulted the -CM lists of codes as provided by the *icd* package, specifically using the function *icd::explain_code()*, and the online listing of codes at the [following address](http://icd9.chrisendres.com/index.php).
Finally, the ICD-9-CM to ICD-10-CM (and reverse) General Equivalence Mapping was also consulted - the helper function *compile_mapping_cm()* was used for this.
The helper functions defined earlier were used to find codes in the master lists by prefix, *find_prefix_in_icd9()*, or by keyword *find_keyword_in_icd9()*.
Some mappings were simple: e.g. gonococcal infection in ICD-9, *098*, has an equivalent ICD-10 code, *A54*.
The mappings were more complicated with edge cases, especially where codes fell into Not otherwise specified (NOS) or Other categories, because some conditions that ICD-9 categorised as Other were categorised as a separate code in ICD-10. For example: E988 in ICD-9 (Injury by other/unspecified means undeteremined whether accidentally or purposely inflicted), subdivided into: `r icd9_codes_UK %>% filter(str_starts(code,"E988")) %>% select(meaning) %>% kable`. Some of the 4th digit categories in ICD-9 exist as separate 3-character codes in ICD-10: `r icd10_codes_UK %>% filter(code %in% paste0("Y",c(19,26,27,31:34))) %>% select(meaning) %>% kable`.
## Failed attempt at fully algorithmic mapping
The plan was initially to compile a full GEM mapping from ICD-9-CM to ICD-10-CM, and then to attempt mapping from the -CM variant codes to ICD-9 and ICD-10. This turned out to be too arduous a task: the fully automated GEM mapping produces several false leads, and omits some meaningful mappings that are easily spotted by a human, but difficult to write rules for.
# Importing ICD inclusion/exclusion criteria from past literature
There are two sources of codes: the MVR (Maltreatment & Violence-related) code lists [@agi_2010; @gilbert_2012], and the codes suggestive of maltreatment [@schnitzer_2011]
# MVR - Maltreatment & Violence-related codes
Arturo Gonzalez-Izquierdo kindly provided the list of ICD codes compiled by his team for the 2010 paper [@agi_2010], which is in the file **/raw/CODES - concern groups 2011 v6.xlsx**, specifically in the sheet *Victimization related*.
A newer listing of codes is available in the supplementary file to the Gilbert et al. 2012 paper [@gilbert_2012], which is more compact and also adds a few more codes. The supplementary file is included here: **/raw/1-s2.0-S0140673611610878-mmc1.pdf**
## Extracting MVR codes from Gilbert et al 2012 paper
```{r}
## Note: the way tabulizer extracts the table, we receive a list, the first element of which is a matrix of rows and columns;
## The codes of interest are in column 2, in rows 26 to 33
mvr_codes_raw <- tabulizer::extract_tables(file="./raw/1-s2.0-S0140673611610878-mmc1.pdf", pages=11)[[1]][26:33,2]
mvr_codes_condensed <-
mvr_codes_raw %>%
tibble(raw=.) %>%
filter(raw!="") %>%
separate(raw, into=c("category_mvr","icd_9","icd_10"), sep="\\: |\\; and ") %>%
mutate(
grouping = if_else(condition = str_starts(category_mvr, "\\d"), true = "Maltreatment-related injury", false = "Marker conditions in infants (<1 y)"),
category_mvr = gsub(category_mvr, pattern="^\\d\\.\\s+", replacement=""),
category_mvr = gsub(category_mvr, pattern="\\-syndrome", replacement=""),
icd_9 = gsub(icd_9, pattern="^.*codes\\s+", replacement=""),
icd_10 = gsub(icd_10, pattern="^.*code(s)?\\s+", replacement=""),
icd_10 = gsub(icd_10, pattern="42\\.8", replacement="S42.8") # fix missing S in one of the codes
) %>%
mutate_all(~trimws(.)) %>%
mutate_all(~gsub(x=., pattern="\\s\\s+", replacement=" ")) %>%
select(grouping, category_mvr, icd_9, icd_10) %>%
mutate(mvr_index=1:n())
mvr_codes_long <-
mvr_codes_condensed %>%
pivot_longer(cols=starts_with("icd"), names_to="icd_version", values_to="codes") %>%
mutate(icd_version=parse_number(icd_version)) %>%
separate_rows(codes, sep = "\\,\\s*|,\\s*or") %>%
separate_rows(codes, sep = "\\s*or\\s*|\\s*and\\s*") %>%
mutate(
codes=trimws(codes)
) %>%
filter(codes!="") %>%
mutate(codes = sapply(codes, FUN=function(x) generate_codes_from_range(x) %>% paste(collapse = ","), USE.NAMES = FALSE)) %>%
separate_rows(codes, sep=",") %>%
rename(code_decimal=codes) %>%
mutate(code=remove_dot(code_decimal))
write_csv(x = mvr_codes_long %>% select(-mvr_index), path="./processed_ICD_codes/MVR_codes_long.csv")
```
## Extracting Gonzalez-Izquierdo et al. 2010 codes
Note: see above for a newer listing of codes - this listing can be considered obsolete now
The spreadsheet is nicely formatted for reading, and we can remove several rows and columns to make importing the data easier.
Columns 2,3,4 correspond to code description, ICD-10 code, and ICD-9 code, respectively. The data is in rows 2 to 40, which in Excel corresponds to the cell range **B2:B40**.
```{r}
agi_2010_file <- "./raw/CODES - concern groups 2011 v6.xlsx"
agi_2010_raw <- read_excel(agi_2010_file, sheet = "Victimization related", trim_ws = TRUE, range = "B2:D40")
## this produces a "condensed" file, meaning that several codes are expressed as ranges rather than individual values, and some codes are listed with commas separating individual entries
agi_2010_long_condensed <-
agi_2010_raw %>%
rename(description_icd10=Description, icd10code =`ICD-10 Codes`, icd9code=`ICD-9 CODES`) %>% # rename to standard names
filter(!is.na(description_icd10)) %>% # remove blank rows
mutate( # extract header to include as broad maltreatment type variable
category_mvr = if_else(
condition = is.na(icd10code),
true = description_icd10,
false = as.character(NA)
)
) %>%
fill(category_mvr) %>% # LOCF on maltreatment type
filter(!is.na(icd10code) & !is.na(icd9code)) %>% # remove header row
mutate(agi_index = 1:nrow(.)) %>%
gather(-description_icd10,-category_mvr,-agi_index, key="icd_version", value="code_decimal") %>%
mutate(
icd_version = case_when(
icd_version == "icd10code" ~ 10,
icd_version == "icd9code" ~ 9,
TRUE ~ as.numeric(NA)
),
code_decimal = gsub(code_decimal, pattern="\\s", replacement="") # remove spaces
)
```
Some codes have been entered as ranges rather than a list of codes. The ranges need to be expanded.
There will be two steps to this:
1. Split up comma-separated values and put them in separate rows
2. Split up ranges by generating all values in the range
```{r}
agi_2010_long <-
agi_2010_long_condensed %>%
separate_rows(code_decimal, sep = ",") %>% # separate comma-separated values
mutate(code_decimal = sapply(code_decimal, function(x) {generate_codes_from_range(x) %>% paste(collapse=",")}, USE.NAMES = FALSE)) %>% # convert range-defined codes to comma-separated values
separate_rows(code_decimal, sep = ",") %>% # split comma-seprated values again
mutate(code = remove_dot(code_decimal)) #%>%
# filter((icd_version==9 & code %in% icd9_codes_UK$code_decimal) | (icd_version==10 & code %in% icd10_codes_UK$code_decimal)) # remove non-UK codes
## now write the file
write_csv(agi_2010_long, path = "./processed_ICD_codes/AGI_et_al_2010_ICD_codes.csv")
```
## Comparing AGI et al 2010 to MVR codes
There are some codes that aren't in one list or the other:
Not in AGI et al 2010 [@agi_2010]:
```{r}
mvr_codes_long %>% filter(str_starts(grouping,"Maltreatment")) %>% anti_join(agi_2010_long, by=c("code_decimal","icd_version")) %>% select(code, icd_version) %>% left_join(bind_rows(icd9_codes_UK %>% mutate(icd_version=9),icd10_codes_UK %>% mutate(icd_version=10)), by = c("code","icd_version")) %>% knitr::kable()
```
Not in Gilbert et al 2012 [@gilbert_2012]:
```{r}
agi_2010_long %>% anti_join(mvr_codes_long %>% filter(str_starts(grouping,"Maltreatment")), by=c("code_decimal","icd_version")) %>% select(code_decimal, icd_version) %>% left_join(bind_rows(icd9_codes_UK %>% mutate(icd_version=9),icd10_codes_UK %>% mutate(icd_version=10)), by = c("code_decimal","icd_version")) %>% knitr::kable()
```
## Merging the MVR codes for a complete-most list
Given that there are codes in each of the lists, we can simply merge the two lists and use those codes!
```{r}
order_of_mvr_categories <- agi_2010_long$category_mvr %>% unique
mvr_codes_merged_long_raw <-
bind_rows(
mvr_codes_long %>% filter(str_starts(grouping,"Maltreatment")) %>% select(-grouping),
agi_2010_long %>% select(-description_icd10, -agi_index)
) %>%
distinct
codes_that_are_matched_by_shorter_codes <- map(
mvr_codes_merged_long_raw$code,
function(x) {
pool_to_match <- mvr_codes_merged_long_raw$code[mvr_codes_merged_long_raw$code!=x]
pool_to_match[str_detect(pool_to_match, pattern=paste0("^", x))]
}
) %>% unlist
mvr_codes_merged_long <-
mvr_codes_merged_long_raw %>%
filter(!code %in% codes_that_are_matched_by_shorter_codes) %>%
arrange(factor(category_mvr, levels = order_of_mvr_categories))
write_csv(mvr_codes_merged_long, path = "./processed_ICD_codes/MVR_codes_merged_long.csv")
```
## Condensed version of merged codes
```{r}
mvr_codes_merged_condensed_long <-
bind_rows(
mvr_codes_condensed %>% filter(str_starts(grouping,"Maltreatment")) %>% select(-grouping) %>% pivot_longer(cols=starts_with("icd"), names_to="icd_version", values_to="code_decimal") %>%
mutate(icd_version=parse_number(icd_version)),
agi_2010_long_condensed %>% select(-description_icd10)
) %>%
mutate(code_decimal=gsub(code_decimal,pattern="(\\,\\s)*or|(\\,\\s)*and", replacement=",")) %>%
separate_rows(code_decimal, sep="\\,") %>%
mutate(
code_decimal=trimws(code_decimal),
code_decimal = gsub(code_decimal, pattern="(\\s*\\-\\s*)", replacement=" - ")
) %>%
filter(str_detect(code_decimal,"\\-")|code_decimal %in% mvr_codes_merged_long$code_decimal) %>% # only keep codes that are expressed as range or are actually in the final merged list
select(-mvr_index, -agi_index) %>%
distinct %>%
arrange(factor(category_mvr, levels = order_of_mvr_categories), icd_version, code_decimal) %>%
filter( # go through list by hand, pick out redundant entries...
!(icd_version==9 & code_decimal %in% c("994.2","994.3")) &
!(icd_version==9 & code_decimal %in% c("E960", "E960 - E966", "E968", "E969")) &
!(icd_version==10 & code_decimal %in% c("X85 - Y09")) &
!(icd_version==9 & code_decimal %in% c("E983 - E989")) &
!(icd_version==10 & code_decimal %in% c("Y20 - Y34")) &
!(icd_version==10 & code_decimal %in% c("Z72.3 - Z72.9", "Z60", "Z61", "Z62", "Z63"))
) %>%
group_by(
category_mvr, icd_version
) %>%
summarise(
codes = paste(code_decimal, collapse = ", ")
)
mvr_codes_merged_condensed_wide <-
mvr_codes_merged_condensed_long %>%
pivot_wider(names_from = icd_version, values_from = codes, names_prefix = "codes_icd_") %>%
arrange(factor(category_mvr, levels = order_of_mvr_categories))
write_csv(mvr_codes_merged_condensed_wide, path = "./processed_ICD_codes/MVR_codes_merged_condensed_for_appendix.csv")
```
## MVR codes & age appropriateness
When using MVR codes to tag mothers' medical records (past the age of 18), some codes become less appropriate to use, specifically codes for smoking, holiday relief care, etc. For this purpose we'll produce a complete list of captured ICD codes by MVR, so we can further decide on which ones would be appropriate to use with adults.
Specifically, the codes in the categories Adverse social circumstances are suspect!
```{r}
mvr_codes_expanded_for_age_check <-
bind_rows(
icd9_codes_UK %>%
regex_inner_join(
mvr_codes_long %>%
filter(icd_version==9 & category_mvr=="Adverse social circumstances") %>%
mutate(prefix = paste0("^",code)) %>% # make a prefix for fuzzy matching
select(grouping, category_mvr,prefix)
,
by=c("code"="prefix")
) %>% mutate(icd_version=9)
,
icd10_codes_UK %>%
regex_inner_join(
mvr_codes_long %>%
filter(icd_version==10 & category_mvr=="Adverse social circumstances") %>%
mutate(prefix = paste0("^",code)) %>% # make a prefix for fuzzy matching
select(grouping, category_mvr,prefix)
,
by=c("code"="prefix")
) %>% mutate(icd_version=10)
) %>%
mutate(prefix=gsub(prefix,pattern="\\^", replacement="")) %>%
select(category_mvr, prefix, icd_version, code, meaning) %>%
mutate(notes="") # add an empty notes column for manual review!
write_csv(x = mvr_codes_expanded_for_age_check, path = "./processed_ICD_codes/MVR_codes_expanded_for_age_check.csv")
```
# Schnitzer et al. 2011
[Schnitzer et al 2011](https://doi.org/10.1016/j.chiabu.2010.06.008) [@schnitzer_2011] provide a list of ICD codes suggestive of child maltreatment, but don't provide the codes in an easily parsed format.
The following documents how I extracted the tables and converted them to a usable format.
I tried scraping the web article with no success, so I used a Google Chrome add-on called [*table-to-spreadsheet*](https://chrome.google.com/webstore/detail/table-to-spreadsheet/haidhlbpihfihbjcggmffnmhgiddjcoc) and saved all tables as .xlsx from the paper to folder *schnitzer2011\\raw*.
Note that MS Excel displays a warning about the data being corrupted when you try to open them, but displays the tables just fine - I assume this is the chrome plugin's fault, maybe not following the .xlsx format correctly.
The following tables were taken from the web article:
```{r}
dirTables <- "./schnitzer2011/"
dir(dirTables)
```
## Understanding the criteria
Each table is accompanied by additional notes denoted with superscript letters (^a^, ^b^, etc.) which need to be dealt with appropriately and removed from the table contents.
Depending on how we define what codes should count as suggestive of child maltreatment, we would use different tables. Schnitzer et al. 2011 compiled a larger list of suggestive codes through expert consultation and literature review which they then further refined by manually reviewing random samples of 50 cases for each ICD code deemed suggestive, and only retaining those where 66% or more of cases where thought to possibly or probably indicate child maltreatment.
Therefore there are two lists:
* The prior list of suggestive ICD codes (table A1 in Schnitzer et al 2011) - this list is broader, i.e. a more liberal criterion
* The empirically tested list of suggestive ICD codes (table 2 in Schnitzer et al 2011) - this list is narrower, i.e. a more conservative criterion
In addition, there is a further list of possible sexual maltreatment ICD codes that were not present in the dataset reviewed by Schnitzer et al 2011 and so could not be reviewed, yet may still be suggestive.
## Processing individual Schnitzer 2011 tables
For ease of use in later data analysis, the tables from the Schnitzer et al 2011 paper need to be converted into a format that can be used as a lookup table for use in analysing ICD 9 codes mapping them onto ICD 10 codes.
The tables will be converted to long format, with one row for each ICD code, and all other information in appropriate columns.
## Table 2 (narrower/conservative/empirical list of inclusion criteria)
```{r}
schnitzer_2011_table2_file <- "./schnitzer2011/Table 2. ICD-9 codes suggestivea of maltreatment.xlsx"
schnitzer_2011_table2_raw <- read_excel(schnitzer_2011_table2_file, trim_ws = TRUE)
```
The data is currently in a non-standard format, having been grabbed from the table in the paper.
We can remove the last 4 columns because they report the percentage of cases confirmed as suggesting maltreatment on review. Additionally, we can remove the first row which contains no relevant information. We also rename the columns to a more easily programmed-with format, and remove the leading less-than sign from the age.
The type of maltreatment is currently a heading separating rows, but it should be added as a column instead. For this we use a last observation carried forward (LOCF), from *tidyr* library (*tidyr::fill*). Finally we remove the rows that were used for headers.
Next we deal with with the table notes denoted in the paper by superscript letters (^a^, ^b^).
From the paper:
> a
> The suggestive codes are those where more than 66% of records reviewed were classified as probable or possible maltreatment.
>
> b
> Designates a code with a sample size of less than 5.
Superscript ^a^ was only used in the description of the table, and superscript ^b^ has no bearing on the criteria. This means we can safely remove them from the final table.
```{r}
schnitzer_2011_inclusions_empirical_raw <-
schnitzer_2011_table2_raw %>%
select(1:3) %>% # keeping first 3 columns is same s dropping final 4 columns
slice(-1) %>% # remove 1st row
rename( # rename the columns to a standard format
code_decimal = `ICD-9 code`,
description_icd9cm = `Code description`,
age_less_than = `Age included (years)`
) %>%
mutate(age_less_than = parse_number(age_less_than)) %>%
mutate(
maltreatment_type_schnitzer2011 = # make a new column for type of maltreatment, using those rows that were used as headers
ifelse(
test = str_detect(code_decimal, "^ICD\\-9"), # headers begin with string "ICD-9"
yes = code_decimal,
no = NA
)
) %>%
fill(maltreatment_type_schnitzer2011) %>% # LOCF on maltreatment type
filter(!str_detect(code_decimal, "^ICD\\-9")) %>% # remove header rows
mutate(
description_icd9cm = # use reg.ex. to find b in final position and remove it
gsub(description_icd9cm, pattern = "a$|b$", replacement = "")
) %>%
mutate(inclusion_index = 1:n()) # add an index for later use to each inclusion listed
index_table_schnitzer_2011_inclusions_age <-
schnitzer_2011_inclusions_empirical_raw %>%
select(age_less_than, inclusion_index) %>%
distinct
```
Next we deal with rows where multiple codes are denoted in same row, separated by commas. We'll leave the range-defined codes for poisoning intact for now.
Once we have each inclusion listed and paired with an index number, we'll put them in a separate table for later use in cross-mapping to ICD-10.
```{r}
schnitzer_2011_inclusions_empirical_icd9cm <-
schnitzer_2011_inclusions_empirical_raw %>%
separate_rows(code_decimal, sep=", ")
```
Now we separate the range-defined codes as well (poisoning codes *960-979*):
```{r}
schnitzer_2011_inclusions_empirical_icd9cm <-
schnitzer_2011_inclusions_empirical_icd9cm %>%
mutate(
code_decimal =
sapply(code_decimal, USE.NAMES = FALSE,
FUN= function(x) if (str_detect(x, pattern="–|-")) expand_range_to_csv_simple_case(x) else x)
) %>%
separate_rows(code_decimal, sep=",") %>%
mutate(code = remove_dot(code_decimal)) # add a code with the dot removed
index_table_schnitzer_2011_inclusions_icd9cm <-
schnitzer_2011_inclusions_empirical_icd9cm %>%
select(inclusion_code = code, inclusion_code_decimal=code_decimal, inclusion_index)
## table of categories mapped to inclusion index
index_table_schnitzer_2011_categories <-
schnitzer_2011_inclusions_empirical_icd9cm %>%
select(inclusion_index, maltreatment_type_schnitzer2011) %>%
distinct %>%
mutate(maltreatment_type_schnitzer2011 = sub(maltreatment_type_schnitzer2011, pattern="ICD-9 .* suggestive of ", replacement=""))
```
## Converting to ICD-9
ICD-9 and ICD-9-CM are supposed to agree in codes up to 4 characters long. Some codes in one version aren't defined in the other and vice versa. We'll check how many of the inclusion criteria in Schnitzer 2011 are missing in ICD-9, and if truncating the last digit yields an appropriate code:
```{r}
schnitzer_2011_inclusions_empirical_icd9cm %>%
filter(!code %in% icd9_codes_UK$code) %>%
mutate(truncated_code = str_sub(code, end = -2)) %>%
left_join(icd9_codes_UK, by=c("truncated_code"="code")) %>%
select(code_decimal.x, description_icd9cm, truncated_code, meaning)
```
Looking at the above, some truncated codes matched to ICD-9 codes all appear too unspecific as there's no guarantee that they would have been used for the same conditions.
The ICD-9 heading *852 Subarachnoid, subdural and extradural haemorrhage, following injury* seems to be a decent fit for ICD-9-CM *852.2 Traumatic subdural hemorrhage*. Although the latter code is a sub-code of *852* in ICD-9-CM, the 4th digit in the CM system specifies the location of the haemorrhage, whereas in ICD-9 the 4th digit specifies whether an an open intercranial wound was present - this distinction is made in ICD-9-CM at 5th digit level.
Note that there is already another *852* heading entry in the Schnitzer 2011 list, *852.2 Traumatic subarachnoid hemorrhage*; this will also be included by adding the less specific *852* code to the inclusions.
For the other ICD-9-CM codes without a match in ICD-9, we can attempt to search for keywords:
**362.81 Retinal hemorrhage**
```{r}
icd9_codes_UK %>% filter(str_detect(tolower(meaning), pattern="retina") & str_detect(tolower(meaning), pattern="h.*morrhage|bleed"))
```
**V71.81 Observation for abuse/neglect**
```{r}
icd9_codes_UK %>% filter(str_detect(tolower(meaning), pattern="observation") | str_detect(tolower(meaning), pattern="abus|neglect"))
```
**E869.4 Second-hand tobacco smoke**
```{r}
icd9_codes_UK %>% filter(str_detect(tolower(meaning), pattern="expos") & str_detect(tolower(meaning), pattern="tobacco"))
icd9_codes_UK %>% filter(str_detect(tolower(meaning), pattern="second") & str_detect(tolower(meaning), pattern="tobacco"))
icd9_codes_UK %>% filter(str_detect(tolower(meaning), pattern="second") & str_detect(tolower(meaning), pattern="smoke"))
```
Second-hand smoke (ICD-9-M E869.4) was not included as a code in ICD-9, but it was in ICD-10 as code Z58.7 Exposure to tobacco smoke.
To conclude, there don't seem to be appropriate ICD-9 codes for the above ICD-9-CM codes.
The following table shows the ICD-9-CM and ICD-9 codes matched up so you can see that the mapping makes sense:
```{r}
schnitzer_2011_inclusions_empirical_icd9cm %>%
left_join(icd9_codes_UK, by="code") %>%
select(code=code_decimal.x, icd_9_cm=description_icd9cm, icd_9=meaning) %>%
kable
```
All the above shown algorithmically matched codes appear to match in meaning.
Now we can construct a list of ICD-9-CM codes maped to ICD-9 codes.
```{r}
schnitzer_2011_inclusions_empirical_icd9cm_and_icd9 <-
schnitzer_2011_inclusions_empirical_icd9cm %>%
left_join(icd9_codes_UK %>% select(code, description_icd9=meaning), by=c("code")) %>%
mutate(
code_icd9 = case_when(
str_starts(code, pattern="852") ~ "852",
!is.na(description_icd9) ~ code,
TRUE ~ as.character(NA)
),
description_icd9 = ifelse(code_icd9=="852", icd9_codes_UK$meaning[icd9_codes_UK$code=="852"], description_icd9)
)
schnitzer_2011_inclusions_icd9 <-
schnitzer_2011_inclusions_empirical_icd9cm_and_icd9 %>%
filter(!is.na(code_icd9)) %>%
select(code = code_icd9, description_icd9, inclusion_index, maltreatment_type_schnitzer2011) %>%
mutate(code_decimal = sapply(code, function(x) icd::short_to_decimal(x), USE.NAMES = FALSE)) %>% # add decimal code also
left_join(schnitzer_2011_inclusions_empirical_icd9cm %>% select(inclusion_index,age_less_than) %>% distinct, by="inclusion_index") # also merge the age included variable
index_table_schnitzer_2011_inclusions_icd9 <-
schnitzer_2011_inclusions_icd9 %>%
select(inclusion_index, inclusion_code=code, inclusion_code_decimal=code_decimal)
## write interim results to csv file
# write.csv(schnitzer_2011_inclusions_empirical_icd9, file = "./processed_ICD_codes/Schnitzer_et_al_2011_empirical_inclusions_icd9.csv", row.names = FALSE)
```
## Converting to ICD-10
```{r}
# index_table_schnitzer_2011_inclusions_icd9 %>% left_join(icd9_codes_UK %>% select(code, meaning), by=c("inclusion_code"="code"))
schnitzer_2011_inclusions_icd10 <-
index_table_schnitzer_2011_inclusions_icd9cm %>%
mutate(
icd10 = case_when(
inclusion_code == "0541" ~ "A60",
inclusion_code == "098" ~ "A54",
inclusion_code == "6149" ~ "N73.9",
inclusion_code == "9224" ~ "S30.2",
inclusion_code == "V715" ~ "Z04.4",
inclusion_code == "V7181" ~ as.character(NA), # Observation and evaluation for Abuse and neglect - no equivalent in ICD-10
inclusion_code == "36281" ~ "H35.6",
inclusion_code == "8070" ~ "S22.30",
inclusion_code == "8071" ~ "S22.31",
inclusion_code == "8522" ~ "S06.5",
inclusion_code == "8530" ~ "S06.8",
inclusion_code == "8631" ~ "S36.31",
inclusion_code == "E965" ~ "X93-X96",
inclusion_code == "E966" ~ "X99",
inclusion_code == "E9682" ~ "Y00",
inclusion_code == "E9689" ~ "Y09",
inclusion_code == "E988" ~ "Y19,Y26,Y27,Y31-Y34", # several items categorised at 4th digit level in "other" in ICD-9 now have their own 3 digit codes in ICD-10
inclusion_code == "800" ~ "S02.0",
inclusion_code == "805" ~ "S12.0,S12.1,S12.2,S12.7,S12.9,S22.0,S22.1,S32.0,S32.7", # ICD-9 doesn't speficy location of vertebrae, so we need to pick out all possibilities in ICD-10; included are multiple fractures of spine as these presumably involve vertebrae
inclusion_code == "811" ~ "S42.1",
inclusion_code == "8520" ~ "S06.6",
inclusion_code == "862" ~ "S277-S279", # ICD-9 862 excludes pneumo/hemothorax, whereas ICD-10 combines p/h-t with intrathoracic injuries
inclusion_code == "8632" ~ "S36.40",
inclusion_code == "8633" ~ "S36.41",
inclusion_code == "865" ~ "S36.0",
inclusion_code == "952" ~ "S14.0,S14.1,S24.0,S24.1,S34.0,S34.1,T06.0,T06.1,T09.3", # ICD-9 speficies location of spinal cord lesion at 4th digit level, whereas ICD-10 included spinal cord injuries at different locations at separate 3-character codes
inclusion_code == "262" ~ "E43",
inclusion_code == "5210" ~ "K02",
inclusion_code == "6927" ~ "L57.8",
inclusion_code == "808" ~ "S32.1-S32.8,T02.1", # in ICD-9 pelvis fractures are a single code; in ICD-10 they are combined with lumbar spine
inclusion_code == "860" ~ "S27.0-S27.2", # see note about intrathoracic injuries above
inclusion_code == "861" ~ "S26, S27.3-S27.6", # ICD-9 specified heart & lung in single code; ICD-10 separates them; also se note about intrathoracic organs
inclusion_code == "8638" ~ "S36.2,S36.8,S36.9", # ICD-9 separately specifies stomach, small intenstine, colon or rectum
inclusion_code == "864" ~ "S36.1",
inclusion_code == "866" ~ "S37.0",
inclusion_code == "941" ~ "T20",
inclusion_code == "942" ~ "T21",
inclusion_code == "945" ~ "T24,T25", # 945 means burns of lower limbs, and doesn't specify if feet/ankle included, hence also T25
inclusion_code == "946" ~ "T29",
inclusion_code %in% as.character(960:979) ~ "T36-T50", # poisoning
inclusion_code == "9941" ~ "T751",
inclusion_code == "E8694" ~ "Z58.7", # exposure to tobacco smoke
inclusion_code == "E9102" ~ "W67-W70",
inclusion_code == "E9104" ~ "W65,W66",
inclusion_code == "E9108" ~ "W73",
inclusion_code == "E9109" ~ "W74",
inclusion_code == "E9600" ~ "Y04",
inclusion_code == "E985" ~ "Y22-Y25",
inclusion_code == "E980" ~ "Y10-Y19",
inclusion_code == "V60" ~ "Z59",
TRUE ~ as.character(NA)
)
)
index_table_schnitzer_2011_inclusions_icd10 <-
schnitzer_2011_inclusions_icd10 %>%
select(inclusion_index, inclusion_code=icd10) %>%
separate_rows(inclusion_code, sep=",") %>%
filter(!is.na(inclusion_code)) %>% # remove missing mapping
mutate(
inclusion_code = remove_dot(inclusion_code),
inclusion_code = sapply(inclusion_code, function(x) generate_codes_from_range(x) %>% paste(collapse=","), USE.NAMES = FALSE )
) %>%
separate_rows(inclusion_code, sep=",") %>%
mutate(inclusion_code_decimal = sapply(inclusion_code, icd::short_to_decimal, USE.NAMES = FALSE)) %>%
distinct
## Now we write the inclusion tables to files
write_csv(index_table_schnitzer_2011_inclusions_age, path = "./processed_ICD_codes/schnitzer_et_al_2011_empirical_inclusions_age_index.csv")
write_csv(index_table_schnitzer_2011_inclusions_icd9, path = "./processed_ICD_codes/schnitzer_et_al_2011_empirical_inclusions_index_icd9.csv")
write_csv(index_table_schnitzer_2011_inclusions_icd10, path = "./processed_ICD_codes/schnitzer_et_al_2011_empirical_inclusions_index_icd10.csv")
write_csv(index_table_schnitzer_2011_categories, path = "./processed_ICD_codes/schnitzer_et_al_2011_empirical_inclusions_categories_index.csv")
```
## Exclusion criteria
Like with inclusion criteria, we will compile a table of exclusion criteria. These are extracted from *Tables A2* & *A3*, and are likewise in a one code per row format, except there are two columns of codes - one for the inclusion criterion and another for the exclusion criterion.
*Table A2* lists the same criteria we've extracted from *Table 2* above, but with added exclusions. There is an additional note to exclude some codes where the 4th digit is .6 or.7, denoted by superscript *b*; superscript *a* refers to cases where N<5 cases were reported in the paper and can be safely removed. Additionally, there is an entry in Table A2 which lists an additional requirement for code **E869.4 Second-hand tobacco smoke** to include at least one from a list of codes. This will be included in a separate table of additional requirements.
*Table A3* lists all exclusion criteria for ICD-9-CM code **262: Other severe malnutrition**, so those will need to be added.
```{r}
schnitzer_2011_table2a_file <- "./schnitzer2011/Table A2. Included ages, co-occurring exclusion codes and calculated weights for ICD-9-CM codes suggestive of maltreatment.xlsx"
schnitzer_2011_table2a_raw <- read_excel(schnitzer_2011_table2a_file, trim_ws = TRUE)
schnitzer_2011_table3a_file <-
"./schnitzer2011/Table A3. Exclusions for use with the nutritional deficiency and failure to thrive ICD-9-CM codes.xlsx"
schnitzer_2011_table3a_raw <-
read_excel(schnitzer_2011_table3a_file, trim_ws = TRUE) %>%
rename(code=Code, condition=Condition) %>%
mutate(
malnutrition_index = 1: nrow(.)
)
```
## Table A2
There is an additional note to exclude some codes where the 4th digit is .6 or.7, denoted by superscript *b*; superscript *a* refers to cases where N<5 cases were reported in the paper and can be safely removed. Additionally, there is an entry in Table A2 which lists an additional requirement for code **E869.4 Second-hand tobacco smoke** to include at least one from a list of codes. This will be included in a separate table of additional requirements.
Not all entries in Table A2 also have exclusion codes (or extra requirements), so we remove ones that don't to begin with. We also add the inclusion index generated earlier when we processed inclusion codes - this will allow us a to produce a more compact file.
We also remove superscript ^a^ from codes as it has no bearing on our project.
```{r}
schnitzer_2011_exclusions_empirical_raw <-
schnitzer_2011_table2a_raw %>%
select(exclusions=`Co-Occurring exclusion codes`, code_decimal=Code, condition=Condition, -Weight, -`Age included`) %>% # rename exclusions and remove the irrelevant Weight amd age columns
filter(!is.na(exclusions)) %>% # remove rows where no exclusion code was provided
# mutate(exclusion_index = 1:nrow(.)) %>% # add index variable
left_join(schnitzer_2011_inclusions_empirical_raw %>% select(code_decimal, inclusion_index), by="code_decimal") %>% # add inclusion index
mutate(
condition = sub(condition, pattern="a$", replacement="") # remove final a in Conditions that include it
)
```
## Additional inclusion criteria
We'll start by compiling the table of extra inclusion requirements, which can be spotted in table A2 by the phrase *Include only* in the exclusions column:
```{r}
schnitzer_2011_inclusions_empirical_extra_requirements_icd9cm <-
schnitzer_2011_exclusions_empirical_raw %>%
filter(str_detect(exclusions, pattern="Include only")) %>% # the phrase denoting additional code requirements
mutate(exclusions = sub(exclusions, pattern="Include only with co-occurring code\\: ", replacement="")) %>% # remove the preceding sentence
separate_rows(exclusions, sep="; or |; ") %>%
select(extra_required_range_icd9cm=exclusions,inclusion_index)
schnitzer_2011_inclusions_empirical_extra_requirements_icd9_icd10_condensed <-
schnitzer_2011_inclusions_empirical_extra_requirements_icd9cm %>%
mutate(
extra_required_range_icd9 = case_when(
extra_required_range_icd9cm == "480.0–487.8" ~ "480-487",
extra_required_range_icd9cm == "490.0–491.9" ~ "490,491",
extra_required_range_icd9cm == "466.0–466.19" ~ "466",
extra_required_range_icd9cm == "493.0–493.9" ~ "493",
TRUE ~ as.character(extra_required_range_icd9cm)
),
extra_required_range_icd10 = case_when(
extra_required_range_icd9cm == "480.0–487.8" ~ "J10-J18", # pneumonia & influenza (not avian) - I excluded J09 as that's what avian influenza falls under
extra_required_range_icd9cm == "490.0–491.9" ~ "J40-J42", # bronchitis not specified acute/chronic, chronic bronchitis
extra_required_range_icd9cm == "466.0–466.19" ~ "J20-J21,J68", # Acute bronchitis and bronchiolitis; added ICD-10 respiratory conditions due to exposure to chemicals
extra_required_range_icd9cm == "493.0–493.9" ~ "J44-J46", # asthma
extra_required_range_icd9cm == "381.0–381.4" ~ "H65", # Nonsuppurative otitis media
TRUE ~ as.character(NA)
)
) %>%
select(
inclusion_index,
icd9=extra_required_range_icd9, icd10=extra_required_range_icd10
)
schnitzer_2011_inclusions_empirical_extra_requirements <-
schnitzer_2011_inclusions_empirical_extra_requirements_icd9_icd10_condensed %>%
gather(-inclusion_index, key="icd_version", value="required_any") %>%
mutate(
icd_version = parse_number(icd_version)
) %>%
separate_rows(required_any,sep=",") %>%
mutate(required_any = sapply(required_any,function(x) generate_codes_from_range(x) %>% paste(collapse = ","),USE.NAMES = FALSE)) %>%
separate_rows(required_any,sep=",")
index_table_schnitzer_2011_inclusions_extra_requirements_icd10 <-
schnitzer_2011_inclusions_empirical_extra_requirements %>%
filter(icd_version==10) %>%
select(inclusion_index, required_any_code=required_any)
## Write the file
write_csv(index_table_schnitzer_2011_inclusions_extra_requirements_icd10, path = "./processed_ICD_codes/schnitzer_et_al_2011_empirical_inclusions_extra_requirements_index_icd10.csv")
```
## Exclusions
Now that we've processed the extra requirements, we can remove that row from the remaining data, and the entry for malnutrition which specifies an entire separate table, which will be dealt with later.
```{r}
schnitzer_2011_exclusions_empirical_raw_remaining <-
schnitzer_2011_exclusions_empirical_raw %>%
filter(!str_detect(exclusions, pattern="Include only")) %>% # the phrase denoting additional code requirements
filter(!str_detect(exclusions, pattern="See Table")) #%>% # the phrase denoting the separate malnutrition exclusions
# mutate(entry_index = 1:n()) # add entry index for use in emrging later
```
We have `r nrow(schnitzer_2011_exclusions_empirical_raw_remaining)` rows with a mixture of specified single codes, code ranges, and semi-colon separated codes and code ranges.
Included are code ranges with superscript *b* denoting exclusion codes:
>Unless 4th digit = .6 or.7.
The *b* always follows the second code in a range, i.e.: E815–E819^b^. Ultimately we will generate the full range of specified codes, and remove cases where 4th digit is .6 or .7 where noted.
Note also that all ranges where superscript *b* applies are listed as an interrupted range, separated by a comma: *E810–E813, E815–E819b*; superscript *b* applies to both of these ranges. This specifies external cause codes for transport accidents, excluding cases where the victim was a pedestrian or cyclist (4th digit .6 or .7). E814 is entirely for pedestrian victims and so was excluded in the range.
This means there are now these cases:
* Single code entries
* Semicolon-separated entries
+ Single codes
+ Code ranges
+ Interrupted code ranges, separated by comma
The elegant solution is to split comma-separated values first, then deal with range-defined codes by transforming the range into a long list of comma-separated codes, and then split comma-seprated values a second time.
Instead of compiling these line-by-line, we can group the exclusion codes, process them individually, and then re-merge them with the target codes. We'll split the codes with semicolons into separate rows, and it should be obvious afterwards that there's only a small number of groups of exclusion codes to deal with that repeat throughout the table.
## Cross-mapping exclusions to ICD-9 and ICD-10
Next we deal with the exclusions. We'll put semicolon separated values in separate rows to begin with, add an id, and then deal with them individually, mapping them to ICD-9 and ICD-10.
```{r}
schnitzer_2011_exclusions_empirical_separated_semicolon_exclusions <-
schnitzer_2011_exclusions_empirical_raw_remaining %>%
separate_rows(exclusions, sep="; ") %>%
select(inclusion_index, exclusions)
index_table_schnitzer_2011_exclusions_icd9cm_condensed <-
schnitzer_2011_exclusions_empirical_separated_semicolon_exclusions %>%
select(exclusions) %>%
distinct %>%
arrange(nchar(exclusions)) %>%
mutate(exclusion_index = 1:n())
index_table_schnitzer_2011_map_inclusions_and_exclusions <-
schnitzer_2011_exclusions_empirical_separated_semicolon_exclusions %>%
left_join(index_table_schnitzer_2011_exclusions_icd9cm_condensed, by="exclusions") %>%
select(matches("index"))
```
As mentioned above, there are only a small(ish) number of separate exclusions listed: `r kable(index_table_schnitzer_2011_exclusions_icd9cm_condensed)`
We also now have a lookup table linking the code entries for target codes in the exclusions table, and the exclusions: `r kable(index_table_schnitzer_2011_map_inclusions_and_exclusions)`
Now that each of the unique groupings of exclusions have had an id assigned and we've also compiled a mapping of exclusion ids to inclusion ids, we can also convert the exclusion codes to ICD-9 and ICD-10.
```{r}
index_table_schnitzer_2011_exclusions_icd9_10_condensed <-
index_table_schnitzer_2011_exclusions_icd9cm_condensed %>%
mutate(
icd_9 = case_when(
exclusions == "771.2" ~ "771.22", # congenital herpes simplex - in ICD-9-CM it's grouped with "Other", and ICD-9 has a 5th digit code
exclusions == "756.51" ~ "756.50", # osteogenesis imperfecta - different 5th digits
exclusions == "733.10–733.19" ~ "733.1", # pathological fracture - covers entire 4th digit range
TRUE ~ exclusions
),
icd_10 = case_when(
exclusions == "767" ~ "P10-P15,P52.4,P52.6,P52.8,P52.9", # birth trauma, plus several non-traumatic brain haemorrhages that were included in ICD-9-CM 767 birth trauma due to hypoxia/anoxia (but excluding intraventricular & subarachnoid haemorrhages)
exclusions == "765" ~ "P05,P07", # Disorders relating to short gestation and low birthweight
exclusions == "771.2" ~ "P35.2", # congenital herpes
exclusions == "098.4" ~ "A54.3", # Gonococcal infection of eye
exclusions == "771.6" ~ "P39.1", # Neonatal conjunctivitis and dacryocystitis
exclusions == "756.51" ~ "Q78.0", # Osteogenesis imperfecta
exclusions == "E960.1" ~ "Y05,T74.2", # rape (I included attempted rape Y05 - this would flag as definite maltreatment)
exclusions == "E968.4" ~ "Y06", # ICD-9 criminal neglect, ICD-10 Neglect and abandonment
exclusions == "286–287" ~ "D65-D69", # Coagulation defects, purpura and other haemorrhagic conditions
exclusions == "E800–E819" ~ "V01-V99", # ICD-9-CM railway & motor traffic accidents: ICD-10 transport accidents - these are for use with gential contusions & spinal cord injuries, so a broader ICD-10 definition is okay
exclusions == "E890–E897" ~ "X00-X09", # ICD-9-CM ACCIDENTS CAUSED BY FIRE AND FLAMES (but without Other & NOS): ICD-10 Exposure to smoke, fire and flames
exclusions == "E870–E876" ~ "Y60-Y69", # ICD-9-CM MISADVENTURES TO PATIENTS DURING SURGICAL AND MEDICAL CARE
exclusions == "733.10–733.19" ~ "M48.5,M80,M84.4,M90.7", # ICD-9 pathological fracture: scattered across various ICD-10 codes
exclusions == "E810–E813, E815–E819b" ~ "V20-V99", # ICD-9 MOTOR VEHICLE TRAFFIC ACCIDENTS (E810-E819) excluding cases where child was pedestrian or cyclist; ICD-10 transport accidents except ones where victim was pedestrian or pedal cyclist
TRUE ~ as.character(NA)
)
) %>%
select(-exclusions) # remove the original ICD-9-CM codes, no longer needed
```
## Separating exclusion codes to individual codes
Most exclusions are listed as ranges of codes, so we'll now separate them next.
The first step is to tag the line with superscript ^b^ so we can deal with it later, then the superscript ^b^ can be removed from the exclusion codes. Note that this only applies to the ICD-9 codes - ICD-10 codes specify pedestrian/cyclist at 3-character level and only appropriate codes have been included already.
Second, we'll separate comma separated codes.
Third, we'll expand codes expressed as ranges and convert them into comma-seaparated codes,
Fourth, we'll separate the newly generated comma-separated values again
```{r}
## helper function to deal with superscript b cases
## essentially, we are given a 4 character E-code, e.g. E815
## and we need to compile a range of 5th character codes, E815.1 - E815.9, but excluding .6 and .7
generate_5th_digit_codes_superscript_b <- function(code) {
digits = c(1:9)[-c(6,7)]
return(paste(code,digits,sep="."))
}
index_table_schnitzer_2011_exclusions_icd9 <-