extract_and_cross_map_ICD_codes_MVR_and_Schnitzer_2011.Rmd

---
title: Extract & cross-map ICD codes from MVR papers and Schnitzer et al 2011
author: "Jan Savinc"
date: "`r format(Sys.time(), '%d %B, %Y')`"
output: 
  html_document:
    toc: true
    toc_float: true
    code_folding: hide
editor_options: 
  chunk_output_type: console
bibliography: bibliography.bib
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

```{r, warning=FALSE}
library(tidyverse)
library(readxl)
library(knitr)
library(huxtable)
library(fuzzyjoin)
```

# Intro

# Rationale for loading complete listing of ICD codes

The reason for importing the complete listing of ICD codes is to assist in extracting the ICD codes used as inclusion/exclusion criteria in previous literature, specifically where ranges of codes were defined, such as "X85 - Y03, Y08 - Y09"; not all codes in the X85 - X99 range are valid codes, for example, and by importing the ocmplete listing, we can generate a range of valid codes. These will be used as predictors in the analysis in the CHASe project.

# Loading ICD codes

The UK uses unmodified WHO-published ICD codes (unlike the US where they use the - CM modification, or Australia, where they use -AM). ICD-9 was used until 1996, and ICD-10 since then. 

## UK codes

These were previously imported from the UK Biobank data, and the processed data is loaded:

```{r}
icd9_codes_UK <- read_csv("./processed_ICD_codes/master_icd9_code_list_UK(WHO).csv")
icd10_codes_UK <- read_csv("./processed_ICD_codes/master_icd10_code_list_UK(WHO).csv")
```

## USA (CM) codes

These are included in the *icd* package for R.

```{r}
library(icd)
```

## General Equivalence Mappings for the CM versions of ICD

The latest GEM files were downloaded from the [CDC website](ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/Publications/ICD10CM/2018/Dxgem_2018.zip). The 2018 release was used because it was the most recent at the time of writing.

```{r}
gem_icd9cm <- 
  read_fwf("./raw/2018_I9gem.txt", 
           fwf_cols(
             source = c(1, 5), 
             target = c(7, 13), 
             approximate=c(15,15), 
             no_map=c(16,16), 
             combination=c(17,17), 
             scenario=c(18,18), 
             choice_list=c(19,19) 
             )
           )

gem_icd10cm <- 
  read_fwf("./raw/2018_I10gem.txt", 
           fwf_cols(
             source = c(1, 7), 
             target = c(9, 13), 
             approximate=c(15,15), 
             no_map=c(16,16),  
             combination=c(17,17), 
             scenario=c(18,18), 
             choice_list=c(19,19) 
             )
           )

## for our purposes the distinction between forward & backward maps isn't relevant, so we can combine the two to simplify finding matches
gem_combined <- 
  bind_rows(
    gem_icd9cm %>% rename(icd9cm=source, icd10cm=target) %>% mutate(direction = "f"),
    gem_icd10cm %>% rename(icd9cm=target, icd10cm=source) %>% mutate(direction = "b")
  )
```


## Parsing code ranges

Sources often supply code definitions as ranges, e.g. *Y08-Y09*. Because of the complicated rules of how codes are defined, ranges need to be parsed into sequences, that is, *Y08-Y09* needs to be turned into the sequence *Y08,Y09*. This is not trivial, so below is an algorithm for producing these. Note that this is omitted in the report, but is included in the .Rmd source file!

```{r}
## helper function to generate codes from a range given in a string
## e.g. given "Y08 - Y09", generate c("Y08", "Y09") 
## this assume 3-character codes and own't work 
generate_codes_from_range <- function(codeRangeString) {
  ## if no hyphen in string, return the string
  if (!str_detect(codeRangeString, "-|–")) return(codeRangeString)
  
  hasLetters <- str_detect(codeRangeString, pattern="[a-zA-Z]")
  hasDecimals <- str_detect(codeRangeString, pattern="\\.")
  
  codes <- toupper(trimws(unlist(str_split(codeRangeString, pattern= "-|–"))))  # extract codes, and sort them
  numbers <- parse_number(codes)  # extract the numbers and sort them
  
  numCharacters <- max(nchar(codes))
  
  if (hasLetters) {
    lettrs <- gsub(codes, pattern="^([a-zA-Z]{1}).*$", replacement = "\\1")
    lettersAreSame <- lettrs[1] == lettrs[2]
  }
  
  incrementStep <- 1
  if (hasDecimals) {
    numDecimalPlaces <-  max(nchar(str_split(codes, pattern = "\\.", simplify = TRUE)[,2]))
    incrementStep <- 1 / 10^numDecimalPlaces
  }
  
  ## in the simple case where no letters are included, just generate a sequence from the numbers given and pad
  if (!hasLetters) {
      ## now we generate the pre-padded sequence of numbers - depending on the format required, they'll be modified next
    sequenceNumbersPrePadding <- seq(from=numbers[1], to=numbers[2], by=incrementStep)
    if (hasDecimals) {  # if it has decimals we pad using formatC for decimals, "f"
      finishedSequence <- formatC(sequenceNumbersPrePadding, format = "f", digits=numDecimalPlaces, width=numCharacters, flag="0")
    } else {  # if nop decimals, we pad by adding a "0" until desired width is reached
      finishedSequence <- str_pad(sequenceNumbersPrePadding, width=numCharacters, pad="0")
    }
    return(finishedSequence)
  } else { ## else if it has letters, it's a little more complicated 
    ## if the letters are the same
    if (lettersAreSame) {
      sequencePrePadding <- seq(from=numbers[1], to=numbers[2], by=incrementStep)
      ## pad the sequence to the required width - 1 to account for the letter, but differently depending on whether there are decimals
      if (hasDecimals) {
        paddedSequence <- formatC(sequencePrePadding, format = "f", digits=numDecimalPlaces, width=numCharacters-1, flag="0")
      } else {
        paddedSequence <- str_pad(sequencePrePadding, width=numCharacters-1, pad="0")
      }
      finishedSequence <- paste0(lettrs[1], paddedSequence)
      return(finishedSequence)
    } else { ## if the letters are not the same it's more complex, because we need to generate the sequences between the letters also
      lettersIndices <- seq.int(from=which(LETTERS==lettrs[1]), to=which(LETTERS==lettrs[2]))
      lettersSequence <- LETTERS[lettersIndices]
      
      ## for the middle sequence between the letters provided, figure out which numbers are needed, they'll go between 0 and a maximum number, e.g.
      ## for E75 to F12, we need E75-99 and F00-12; for E72.1 to F12.3, we needE72.1-99.9 etc.
      maxNumber <- (10^(numCharacters - 1)) - 1
      if (hasDecimals) maxNumber <- (10^(numCharacters-2) - 1) / 10^numDecimalPlaces
      if (hasDecimals) {
      numbersSequenceLeftWithpadding <- formatC(seq(from=numbers[1], to=maxNumber, by=incrementStep), width = numCharacters-1, format = "f", flag = "0", digits=numDecimalPlaces)
      numbersSequenceRightWithpadding <- formatC(seq(from=0, to=numbers[2], by=incrementStep), width = numCharacters-1, format = "f", flag = "0", digits=numDecimalPlaces)
      numbersSequenceMiddleWithpadding <- formatC(seq(from=0, to=maxNumber, by=incrementStep), width = numCharacters-1, format = "f", flag = "0", digits=numDecimalPlaces)
      } else {
      numbersSequenceLeftWithpadding <- formatC(seq(from=numbers[1], to=maxNumber, by=incrementStep), width = numCharacters-1, format = "d")
      numbersSequenceRightWithpadding <- formatC(seq(from=0, to=numbers[2], by=incrementStep), width = numCharacters-1, format = "d", flag = "0")
      numbersSequenceMiddleWithpadding <- formatC(seq(from=0, to=maxNumber, by=incrementStep), width = numCharacters-1, format = "d", flag = "0")
      }
      
      finishedSequence <-
        c(
          paste0(head(lettersSequence,n=1),numbersSequenceLeftWithpadding),
          as.vector(unlist(sapply(lettersSequence[c(-1,-length(lettersSequence))], function(x) paste0(x,numbersSequenceMiddleWithpadding)))),
          paste0(tail(lettersSequence,n=1),numbersSequenceRightWithpadding)
        )
      return(finishedSequence)
    }
  }
  ## note that we've omitted checking whether the produced codes are actually in use
}
```

## Helper functions

Below are some helper functions to use in looking at the codes

```{r}
## remove dot
remove_dot <- function(x) gsub(x, pattern="\\.", replacement="")

## helper function to expand range codes eg "960-978" into numeric range...
## and collapse to string eg "960,961,962"
expand_range_to_csv_simple_case <- function(this_range) {
  this_range %>% 
   str_split(., pattern="–|-") %>%  # the Schnitzer tables use mid-length dash
   unlist %>% 
   trimws %>% 
   as.numeric %>% 
   {seq(from=min(.), to=max(.), by=1)} %>%
   paste(., collapse=",")
}

## helper functions for generating forward & backward mapped codes from the CM GEM lists
compile_mapping_cm <- function(code) {
  lapply(code, function(x) {
      x <- remove_dot(x)
      gem_combined %>% 
      filter(str_starts(icd9cm, pattern=x)) %>%
      mutate(prefix_code=x)
  }) %>% bind_rows
}

## shortcuts for finding codes with a prefix
find_prefix_in_master_list <- function(this_code, master_list) {
  master_list %>%
    filter(str_starts(code, remove_dot(this_code)))
}
find_prefix_in_icd9 <- function(this_code) {
  find_prefix_in_master_list(this_code, master_list = icd9_codes_UK)
}
find_prefix_in_icd10 <- function(this_code) {
  find_prefix_in_master_list(this_code, master_list = icd10_codes_UK)
}
## shortcuts for finding codes by keyword
find_keyword_in_master_list <- function(key_word, master_list) {
  master_list %>%
    filter(str_detect(meaning, regex(pattern = key_word, ignore_case = TRUE)))
}
find_keyword_in_icd9 <- function(key_word) {
  find_keyword_in_master_list(key_word, master_list = icd9_codes_UK)
}
find_keyword_in_icd10 <- function(key_word) {
  find_keyword_in_master_list(key_word, master_list = icd10_codes_UK)
}
## shortcuts for looking up child codes in ICD-9-CM and ICD-10-CM
find_children_in_icd_cm <- function(parent_code) {
  tibble(
    code = icd::children(as.character(parent_code))
  ) %>%
    mutate(
      description = sapply(code, icd::explain_code, USE.NAMES = FALSE)
    )
}
```


# ICD-9-CM to ICD-10-CM cross-mapping

Cross-mapping was done manually in consultation with the master lists of ICD-9 and ICD-10 codes, and the online version of the [WHO ICD-10](https://icd.who.int/browse10/2016/en). To interpret the original intent of ICD-9-CM codes specified by Schnitzer et al. 2011 and Schnitzer 2004, we also consulted the -CM lists of codes as provided by the *icd* package, specifically using the function *icd::explain_code()*, and the online listing of codes at the [following address](http://icd9.chrisendres.com/index.php).

Finally, the ICD-9-CM to ICD-10-CM (and reverse) General Equivalence Mapping was also consulted - the helper function *compile_mapping_cm()* was used for this.

The helper functions defined earlier were used to find codes in the master lists by prefix, *find_prefix_in_icd9()*, or by keyword *find_keyword_in_icd9()*.

Some mappings were simple: e.g. gonococcal infection in ICD-9, *098*, has an equivalent ICD-10 code, *A54*.

The mappings were more complicated with edge cases, especially where codes fell into Not otherwise specified (NOS) or Other categories, because some conditions that ICD-9 categorised as Other were categorised as a separate code in ICD-10. For example: E988 in ICD-9 (Injury by other/unspecified means undeteremined whether accidentally or purposely inflicted), subdivided into: `r icd9_codes_UK %>% filter(str_starts(code,"E988")) %>% select(meaning) %>% kable`. Some of the 4th digit categories in ICD-9 exist as separate 3-character codes in ICD-10: `r icd10_codes_UK %>% filter(code %in% paste0("Y",c(19,26,27,31:34))) %>% select(meaning) %>% kable`.

## Failed attempt at fully algorithmic mapping

The plan was initially to compile a full GEM mapping from ICD-9-CM to ICD-10-CM, and then to attempt mapping from the -CM variant codes to ICD-9 and ICD-10. This turned out to be too arduous a task: the fully automated GEM mapping produces several false leads, and omits some meaningful mappings that are easily spotted by a human, but difficult to write rules for.

# Importing ICD inclusion/exclusion criteria from past literature

There are two sources of codes: the MVR (Maltreatment & Violence-related) code lists [@agi_2010; @gilbert_2012], and the codes suggestive of maltreatment [@schnitzer_2011]

# MVR - Maltreatment & Violence-related codes 

Arturo Gonzalez-Izquierdo kindly provided the list of ICD codes compiled by his team for the 2010 paper [@agi_2010], which is in the file **/raw/CODES - concern groups 2011 v6.xlsx**, specifically in the sheet *Victimization related*.

A newer listing of codes is available in the supplementary file to the Gilbert et al. 2012 paper [@gilbert_2012], which is more compact and also adds a few more codes. The supplementary file is included here: **/raw/1-s2.0-S0140673611610878-mmc1.pdf**

## Extracting MVR codes from Gilbert et al 2012 paper

```{r}
## Note: the way tabulizer extracts the table, we receive a list, the first element of which is a matrix of rows and columns;
## The codes of interest are in column 2, in rows 26 to 33

mvr_codes_raw <- tabulizer::extract_tables(file="./raw/1-s2.0-S0140673611610878-mmc1.pdf", pages=11)[[1]][26:33,2]

mvr_codes_condensed <- 
  mvr_codes_raw %>%
  tibble(raw=.) %>%
  filter(raw!="") %>%
  separate(raw, into=c("category_mvr","icd_9","icd_10"), sep="\\: |\\; and ") %>%
  mutate(
    grouping = if_else(condition = str_starts(category_mvr, "\\d"), true = "Maltreatment-related injury", false = "Marker conditions in infants (<1 y)"),
    category_mvr = gsub(category_mvr, pattern="^\\d\\.\\s+", replacement=""),
    category_mvr = gsub(category_mvr, pattern="\\-syndrome", replacement=""),
    icd_9 = gsub(icd_9, pattern="^.*codes\\s+", replacement=""),
    icd_10 = gsub(icd_10, pattern="^.*code(s)?\\s+", replacement=""),
    icd_10 = gsub(icd_10, pattern="42\\.8", replacement="S42.8")  #  fix missing S in one of the codes
  ) %>%
  mutate_all(~trimws(.)) %>%
  mutate_all(~gsub(x=., pattern="\\s\\s+", replacement=" ")) %>%
  select(grouping, category_mvr, icd_9, icd_10) %>%
  mutate(mvr_index=1:n())

mvr_codes_long <-
  mvr_codes_condensed %>%
  pivot_longer(cols=starts_with("icd"), names_to="icd_version", values_to="codes") %>%
  mutate(icd_version=parse_number(icd_version)) %>%
  separate_rows(codes, sep = "\\,\\s*|,\\s*or") %>%
  separate_rows(codes, sep = "\\s*or\\s*|\\s*and\\s*") %>%
  mutate(
    codes=trimws(codes)
    ) %>%
  filter(codes!="") %>%
  mutate(codes = sapply(codes, FUN=function(x) generate_codes_from_range(x) %>% paste(collapse = ","), USE.NAMES = FALSE)) %>%
  separate_rows(codes, sep=",") %>%
  rename(code_decimal=codes) %>%
  mutate(code=remove_dot(code_decimal))

write_csv(x = mvr_codes_long %>% select(-mvr_index), path="./processed_ICD_codes/MVR_codes_long.csv")
```

## Extracting Gonzalez-Izquierdo et al. 2010 codes

Note: see above for a newer listing of codes - this listing can be considered obsolete now

The spreadsheet is nicely formatted for reading, and we can remove several rows and columns to make importing the data easier.
Columns 2,3,4 correspond to code description, ICD-10 code, and ICD-9 code, respectively. The data is in rows 2 to 40, which in Excel corresponds to the cell range **B2:B40**.

```{r}
agi_2010_file <- "./raw/CODES - concern groups 2011 v6.xlsx"
agi_2010_raw <- read_excel(agi_2010_file, sheet = "Victimization related", trim_ws = TRUE, range = "B2:D40")

## this produces a "condensed" file, meaning that several codes are expressed as ranges rather than individual values, and some codes are listed with commas separating individual entries
agi_2010_long_condensed <-
  agi_2010_raw %>%
  rename(description_icd10=Description, icd10code =`ICD-10 Codes`, icd9code=`ICD-9 CODES`) %>%  # rename to standard names
  filter(!is.na(description_icd10)) %>%  # remove blank rows
  mutate(  # extract header to include as broad maltreatment type variable
    category_mvr = if_else(
      condition = is.na(icd10code),
      true = description_icd10,
      false = as.character(NA)
    )
  ) %>%
  fill(category_mvr) %>%  # LOCF on maltreatment type
  filter(!is.na(icd10code) & !is.na(icd9code)) %>%  # remove header row
  mutate(agi_index = 1:nrow(.)) %>%
  gather(-description_icd10,-category_mvr,-agi_index, key="icd_version", value="code_decimal") %>%
  mutate(
    icd_version = case_when(
      icd_version == "icd10code" ~ 10,
      icd_version == "icd9code" ~ 9,
      TRUE ~ as.numeric(NA)
    ),
    code_decimal = gsub(code_decimal, pattern="\\s", replacement="")  # remove spaces
  )
```

Some codes have been entered as ranges rather than a list of codes. The ranges need to be expanded.

There will be two steps to this:

1. Split up comma-separated values and put them in separate rows
2. Split up ranges by generating all values in the range

```{r}
agi_2010_long <-
  agi_2010_long_condensed %>%
  separate_rows(code_decimal, sep = ",") %>%  # separate comma-separated values
  mutate(code_decimal = sapply(code_decimal, function(x) {generate_codes_from_range(x) %>% paste(collapse=",")}, USE.NAMES = FALSE)) %>%  # convert range-defined codes to comma-separated values
  separate_rows(code_decimal, sep = ",") %>%  # split comma-seprated values again
  mutate(code = remove_dot(code_decimal)) #%>%
  # filter((icd_version==9 & code %in% icd9_codes_UK$code_decimal) | (icd_version==10 & code %in% icd10_codes_UK$code_decimal))  # remove non-UK codes

## now write the file
write_csv(agi_2010_long, path = "./processed_ICD_codes/AGI_et_al_2010_ICD_codes.csv")
```

## Comparing AGI et al 2010 to MVR codes

There are some codes that aren't in one list or the other:

Not in AGI et al 2010 [@agi_2010]:

```{r}
mvr_codes_long %>% filter(str_starts(grouping,"Maltreatment")) %>% anti_join(agi_2010_long, by=c("code_decimal","icd_version")) %>% select(code, icd_version) %>% left_join(bind_rows(icd9_codes_UK %>% mutate(icd_version=9),icd10_codes_UK %>% mutate(icd_version=10)), by = c("code","icd_version")) %>% knitr::kable()
```

Not in Gilbert et al 2012 [@gilbert_2012]:

```{r}
agi_2010_long %>% anti_join(mvr_codes_long %>% filter(str_starts(grouping,"Maltreatment")), by=c("code_decimal","icd_version")) %>% select(code_decimal, icd_version) %>% left_join(bind_rows(icd9_codes_UK %>% mutate(icd_version=9),icd10_codes_UK %>% mutate(icd_version=10)), by = c("code_decimal","icd_version")) %>% knitr::kable()
```

## Merging the MVR codes for a complete-most list

Given that there are codes in each of the lists, we can simply merge the two lists and use those codes!

```{r}
order_of_mvr_categories <- agi_2010_long$category_mvr %>% unique

mvr_codes_merged_long_raw <-
  bind_rows(
    mvr_codes_long %>% filter(str_starts(grouping,"Maltreatment")) %>% select(-grouping),
    agi_2010_long %>% select(-description_icd10, -agi_index)
  ) %>%
  distinct

codes_that_are_matched_by_shorter_codes <- map(
  mvr_codes_merged_long_raw$code,
  function(x) {
    pool_to_match <- mvr_codes_merged_long_raw$code[mvr_codes_merged_long_raw$code!=x]
    pool_to_match[str_detect(pool_to_match, pattern=paste0("^", x))]
    }
) %>% unlist

mvr_codes_merged_long <-
  mvr_codes_merged_long_raw %>% 
  filter(!code %in% codes_that_are_matched_by_shorter_codes) %>%
  arrange(factor(category_mvr, levels = order_of_mvr_categories))

write_csv(mvr_codes_merged_long, path = "./processed_ICD_codes/MVR_codes_merged_long.csv")
```

## Condensed version of merged codes

```{r}
mvr_codes_merged_condensed_long <-
  bind_rows(
    mvr_codes_condensed %>% filter(str_starts(grouping,"Maltreatment")) %>% select(-grouping) %>% pivot_longer(cols=starts_with("icd"), names_to="icd_version", values_to="code_decimal") %>%
  mutate(icd_version=parse_number(icd_version)),
   agi_2010_long_condensed %>% select(-description_icd10)
  ) %>%
  mutate(code_decimal=gsub(code_decimal,pattern="(\\,\\s)*or|(\\,\\s)*and", replacement=",")) %>%
  separate_rows(code_decimal, sep="\\,") %>%
  mutate(
    code_decimal=trimws(code_decimal),
    code_decimal = gsub(code_decimal, pattern="(\\s*\\-\\s*)", replacement=" - ")
    ) %>%
  filter(str_detect(code_decimal,"\\-")|code_decimal %in% mvr_codes_merged_long$code_decimal) %>%  # only keep codes that are expressed as range or are actually in the final merged list
  select(-mvr_index, -agi_index) %>%
  distinct %>%
  arrange(factor(category_mvr, levels = order_of_mvr_categories), icd_version, code_decimal) %>%
  filter(  # go through list by hand, pick out redundant entries...
    !(icd_version==9 & code_decimal %in% c("994.2","994.3")) &
      !(icd_version==9 & code_decimal %in% c("E960", "E960 - E966", "E968", "E969")) &
      !(icd_version==10 & code_decimal %in% c("X85 - Y09")) &
      !(icd_version==9 & code_decimal %in% c("E983 - E989")) &
      !(icd_version==10 & code_decimal %in% c("Y20 - Y34")) &
      !(icd_version==10 & code_decimal %in% c("Z72.3 - Z72.9", "Z60", "Z61", "Z62", "Z63"))
  ) %>%
  group_by(
    category_mvr, icd_version
  ) %>%
  summarise(
    codes = paste(code_decimal, collapse = ", ")
  )

mvr_codes_merged_condensed_wide <-
  mvr_codes_merged_condensed_long %>%
  pivot_wider(names_from = icd_version, values_from = codes, names_prefix = "codes_icd_") %>%
  arrange(factor(category_mvr, levels = order_of_mvr_categories))

write_csv(mvr_codes_merged_condensed_wide, path = "./processed_ICD_codes/MVR_codes_merged_condensed_for_appendix.csv")
```

## MVR codes & age appropriateness

When using MVR codes to tag mothers' medical records (past the age of 18), some codes become less appropriate to use, specifically codes for smoking, holiday relief care, etc. For this purpose we'll produce a complete list of captured ICD codes by MVR, so we can further decide on which ones would be appropriate to use with adults.

Specifically, the codes in the categories Adverse social circumstances are suspect!

```{r}
mvr_codes_expanded_for_age_check <-
  bind_rows(
    icd9_codes_UK %>%
    regex_inner_join(
      mvr_codes_long %>%
      filter(icd_version==9 & category_mvr=="Adverse social circumstances") %>% 
      mutate(prefix = paste0("^",code)) %>%  # make a prefix for fuzzy matching
      select(grouping, category_mvr,prefix)
      , 
      by=c("code"="prefix")
    ) %>% mutate(icd_version=9)
    ,
    icd10_codes_UK %>%
    regex_inner_join(
      mvr_codes_long %>%
      filter(icd_version==10 & category_mvr=="Adverse social circumstances") %>% 
      mutate(prefix = paste0("^",code)) %>%  # make a prefix for fuzzy matching
      select(grouping, category_mvr,prefix)
      , 
      by=c("code"="prefix")
    ) %>% mutate(icd_version=10)
  ) %>%
  mutate(prefix=gsub(prefix,pattern="\\^", replacement="")) %>%
  select(category_mvr, prefix, icd_version, code, meaning) %>%
  mutate(notes="")  # add an empty notes column for manual review!

write_csv(x = mvr_codes_expanded_for_age_check, path = "./processed_ICD_codes/MVR_codes_expanded_for_age_check.csv")
```


# Schnitzer et al. 2011

[Schnitzer et al 2011](https://doi.org/10.1016/j.chiabu.2010.06.008) [@schnitzer_2011] provide a list of ICD codes suggestive of child maltreatment, but don't provide the codes in an easily parsed format.

The following documents how I extracted the tables and converted them to a usable format.

I tried scraping the web article with no success, so I used a Google Chrome add-on called [*table-to-spreadsheet*](https://chrome.google.com/webstore/detail/table-to-spreadsheet/haidhlbpihfihbjcggmffnmhgiddjcoc) and saved all tables as .xlsx from the paper to folder *schnitzer2011\\raw*.

Note that MS Excel displays a warning about the data being corrupted when you try to open them, but displays the tables just fine - I assume this is the chrome plugin's fault, maybe not following the .xlsx format correctly.

The following tables were taken from the web article:

```{r}
dirTables <- "./schnitzer2011/"
dir(dirTables)
```

## Understanding the criteria

Each table is accompanied by additional notes denoted with superscript letters (^a^, ^b^, etc.) which need to be dealt with appropriately and removed from the table contents.

Depending on how we define what codes should count as suggestive of child maltreatment, we would use different tables. Schnitzer et al. 2011 compiled a larger list of suggestive codes through expert consultation and literature review which they then further refined by manually reviewing random samples of 50 cases for each ICD code deemed suggestive, and only retaining those where 66% or more of cases where thought to possibly or probably indicate child maltreatment.

Therefore there are two lists:

 * The prior list of suggestive ICD codes (table A1 in Schnitzer et al 2011) - this list is broader, i.e. a more liberal criterion
 * The empirically tested list of suggestive ICD codes (table 2 in Schnitzer et al 2011) - this list is narrower, i.e. a more conservative criterion

In addition, there is a further list of possible sexual maltreatment ICD codes that were not present in the dataset reviewed by Schnitzer et al 2011 and so could not be reviewed, yet may still be suggestive.

## Processing individual Schnitzer 2011 tables

For ease of use in later data analysis, the tables from the Schnitzer et al 2011 paper need to be converted into a format that can be used as a lookup table for use in analysing ICD 9 codes mapping them onto ICD 10 codes.

The tables will be converted to long format, with one row for each ICD code, and all other information in appropriate columns.

## Table 2 (narrower/conservative/empirical list of inclusion criteria)

```{r}
schnitzer_2011_table2_file <- "./schnitzer2011/Table 2. ICD-9 codes suggestivea of maltreatment.xlsx"
schnitzer_2011_table2_raw <- read_excel(schnitzer_2011_table2_file, trim_ws = TRUE)
```

The data is currently in a non-standard format, having been grabbed from the table in the paper.

We can remove the last 4 columns because they report the percentage of cases confirmed as suggesting maltreatment on review. Additionally, we can remove the first row which contains no relevant information. We also rename the columns to a more easily programmed-with format, and remove the leading less-than sign from the age.

The type of maltreatment is currently a heading separating rows, but it should be added as a column instead. For this we use a last observation carried forward (LOCF), from *tidyr* library (*tidyr::fill*). Finally we remove the rows that were used for headers.

Next we deal with with the table notes denoted in the paper by superscript letters (^a^, ^b^).
From the paper:

> a
> The suggestive codes are those where more than 66% of records reviewed were classified as probable or possible maltreatment.
>
> b
> Designates a code with a sample size of less than 5.

Superscript ^a^ was only used in the description of the table, and superscript ^b^ has no bearing on the criteria. This means we can safely remove them from the final table.

```{r}
schnitzer_2011_inclusions_empirical_raw <-
  schnitzer_2011_table2_raw %>%
  select(1:3) %>%  # keeping first 3 columns is same s dropping final 4 columns
  slice(-1) %>%  # remove 1st row
  rename(  # rename the columns to a standard format
    code_decimal = `ICD-9 code`, 
    description_icd9cm = `Code description`,
    age_less_than = `Age included (years)`
    ) %>%
  mutate(age_less_than = parse_number(age_less_than)) %>%
  mutate(
    maltreatment_type_schnitzer2011 =  # make a new column for type of maltreatment, using those rows that were used as headers
      ifelse(
        test = str_detect(code_decimal, "^ICD\\-9"),  # headers begin with string "ICD-9"
        yes = code_decimal,
        no = NA
      )
  ) %>%
  fill(maltreatment_type_schnitzer2011) %>%  # LOCF on maltreatment type
  filter(!str_detect(code_decimal, "^ICD\\-9")) %>%  # remove header rows
  mutate(
    description_icd9cm =  # use reg.ex. to find b in final position and remove it
      gsub(description_icd9cm, pattern = "a$|b$", replacement = "")
  ) %>%
  mutate(inclusion_index = 1:n())  # add an index for later use to each inclusion listed

index_table_schnitzer_2011_inclusions_age <- 
  schnitzer_2011_inclusions_empirical_raw %>%
  select(age_less_than, inclusion_index) %>%
  distinct
```

Next we deal with rows where multiple codes are denoted in same row, separated by commas. We'll leave the range-defined codes for poisoning intact for now.

Once we have each inclusion listed and paired with an index number, we'll put them in a separate table for later use in cross-mapping to ICD-10.

```{r}
schnitzer_2011_inclusions_empirical_icd9cm <-
  schnitzer_2011_inclusions_empirical_raw %>%
  separate_rows(code_decimal, sep=", ")
```

Now we separate the range-defined codes as well (poisoning codes *960-979*):

```{r}
schnitzer_2011_inclusions_empirical_icd9cm <-
  schnitzer_2011_inclusions_empirical_icd9cm %>%
  mutate(
    code_decimal = 
      sapply(code_decimal, USE.NAMES = FALSE, 
             FUN= function(x) if (str_detect(x, pattern="–|-")) expand_range_to_csv_simple_case(x) else x)
    ) %>%
  separate_rows(code_decimal, sep=",") %>%
  mutate(code = remove_dot(code_decimal))  # add a code with the dot removed

index_table_schnitzer_2011_inclusions_icd9cm <-
  schnitzer_2011_inclusions_empirical_icd9cm %>%
  select(inclusion_code = code, inclusion_code_decimal=code_decimal, inclusion_index)

## table of categories mapped to inclusion index
index_table_schnitzer_2011_categories <-
  schnitzer_2011_inclusions_empirical_icd9cm %>%
  select(inclusion_index, maltreatment_type_schnitzer2011) %>%
  distinct %>%
  mutate(maltreatment_type_schnitzer2011 = sub(maltreatment_type_schnitzer2011, pattern="ICD-9 .* suggestive of ", replacement=""))
```

## Converting to ICD-9

ICD-9 and ICD-9-CM are supposed to agree in codes up to 4 characters long. Some codes in one version aren't defined in the other and vice versa. We'll check how many of the inclusion criteria in Schnitzer 2011 are missing in ICD-9, and if truncating the last digit yields an appropriate code:

```{r}
schnitzer_2011_inclusions_empirical_icd9cm %>%
  filter(!code %in% icd9_codes_UK$code) %>%
  mutate(truncated_code = str_sub(code, end = -2)) %>%
  left_join(icd9_codes_UK, by=c("truncated_code"="code")) %>%
  select(code_decimal.x, description_icd9cm, truncated_code, meaning)
```

Looking at the above, some truncated codes matched to ICD-9 codes all appear too unspecific as there's no guarantee that they would have been used for the same conditions.

The ICD-9 heading *852 Subarachnoid, subdural and extradural haemorrhage, following injury* seems to be a decent fit for ICD-9-CM *852.2 Traumatic subdural hemorrhage*. Although the latter code is a sub-code of *852* in ICD-9-CM, the 4th digit in the CM system specifies the location of the haemorrhage, whereas in ICD-9 the 4th digit specifies whether an an open intercranial wound was present - this distinction is made in ICD-9-CM at 5th digit level.

Note that there is already another *852* heading entry in the Schnitzer 2011 list, *852.2 Traumatic subarachnoid hemorrhage*; this will also be included by adding the less specific *852* code to the inclusions.

For the other ICD-9-CM codes without a match in ICD-9, we can attempt to search for keywords:

**362.81 Retinal hemorrhage**

```{r}
icd9_codes_UK %>% filter(str_detect(tolower(meaning), pattern="retina") & str_detect(tolower(meaning), pattern="h.*morrhage|bleed"))
```

**V71.81 Observation for abuse/neglect**

```{r}
icd9_codes_UK %>% filter(str_detect(tolower(meaning), pattern="observation") | str_detect(tolower(meaning), pattern="abus|neglect"))
```

**E869.4 Second-hand tobacco smoke**

```{r}
icd9_codes_UK %>% filter(str_detect(tolower(meaning), pattern="expos") & str_detect(tolower(meaning), pattern="tobacco"))
icd9_codes_UK %>% filter(str_detect(tolower(meaning), pattern="second") & str_detect(tolower(meaning), pattern="tobacco"))
icd9_codes_UK %>% filter(str_detect(tolower(meaning), pattern="second") & str_detect(tolower(meaning), pattern="smoke"))
```

Second-hand smoke (ICD-9-M E869.4) was not included as a code in ICD-9, but it was in ICD-10 as code Z58.7 Exposure to tobacco smoke.

To conclude, there don't seem to be appropriate ICD-9 codes for the above ICD-9-CM codes.

The following table shows the ICD-9-CM and ICD-9 codes matched up so you can see that the mapping makes sense:

```{r}
schnitzer_2011_inclusions_empirical_icd9cm %>%
  left_join(icd9_codes_UK, by="code") %>%
  select(code=code_decimal.x, icd_9_cm=description_icd9cm, icd_9=meaning) %>%
  kable
```

All the above shown algorithmically matched codes appear to match in meaning.

Now we can construct a list of ICD-9-CM codes maped to ICD-9 codes.

```{r}
schnitzer_2011_inclusions_empirical_icd9cm_and_icd9 <-
  schnitzer_2011_inclusions_empirical_icd9cm %>%
  left_join(icd9_codes_UK %>% select(code, description_icd9=meaning), by=c("code")) %>%
  mutate(
    code_icd9 = case_when(
      str_starts(code, pattern="852") ~ "852",
      !is.na(description_icd9) ~ code,
      TRUE ~ as.character(NA)
      ),
    description_icd9 = ifelse(code_icd9=="852", icd9_codes_UK$meaning[icd9_codes_UK$code=="852"], description_icd9)
  )

schnitzer_2011_inclusions_icd9 <-
  schnitzer_2011_inclusions_empirical_icd9cm_and_icd9 %>%
  filter(!is.na(code_icd9)) %>%
  select(code = code_icd9, description_icd9, inclusion_index, maltreatment_type_schnitzer2011) %>%
  mutate(code_decimal = sapply(code, function(x) icd::short_to_decimal(x), USE.NAMES = FALSE)) %>%  # add decimal code also
  left_join(schnitzer_2011_inclusions_empirical_icd9cm %>% select(inclusion_index,age_less_than) %>% distinct, by="inclusion_index")  # also merge the age included variable

index_table_schnitzer_2011_inclusions_icd9 <-
  schnitzer_2011_inclusions_icd9 %>% 
  select(inclusion_index, inclusion_code=code, inclusion_code_decimal=code_decimal)

## write interim results to csv file
# write.csv(schnitzer_2011_inclusions_empirical_icd9, file = "./processed_ICD_codes/Schnitzer_et_al_2011_empirical_inclusions_icd9.csv", row.names = FALSE)
```

## Converting to ICD-10

```{r}
# index_table_schnitzer_2011_inclusions_icd9 %>% left_join(icd9_codes_UK %>% select(code, meaning), by=c("inclusion_code"="code"))

schnitzer_2011_inclusions_icd10 <-
  index_table_schnitzer_2011_inclusions_icd9cm %>%
  mutate(
    icd10 = case_when(
      inclusion_code == "0541" ~ "A60",
      inclusion_code == "098" ~ "A54",
      inclusion_code == "6149" ~ "N73.9",
      inclusion_code == "9224" ~ "S30.2",
      inclusion_code == "V715" ~ "Z04.4",
      inclusion_code == "V7181" ~ as.character(NA),  # Observation and evaluation for Abuse and neglect - no equivalent in ICD-10
      inclusion_code == "36281" ~ "H35.6",
      inclusion_code == "8070" ~ "S22.30",
      inclusion_code == "8071" ~ "S22.31",
      inclusion_code == "8522" ~ "S06.5",
      inclusion_code == "8530" ~ "S06.8",
      inclusion_code == "8631" ~ "S36.31",
      inclusion_code == "E965" ~ "X93-X96",
      inclusion_code == "E966" ~ "X99",
      inclusion_code == "E9682" ~ "Y00",
      inclusion_code == "E9689" ~ "Y09",
      inclusion_code == "E988" ~ "Y19,Y26,Y27,Y31-Y34",  # several items categorised at 4th digit level in "other" in ICD-9 now have their own 3 digit codes in ICD-10 
      inclusion_code == "800" ~ "S02.0",
      inclusion_code == "805" ~ "S12.0,S12.1,S12.2,S12.7,S12.9,S22.0,S22.1,S32.0,S32.7", # ICD-9 doesn't speficy location of vertebrae, so we need to pick out all possibilities in ICD-10; included are multiple fractures of spine as these presumably involve vertebrae
      inclusion_code == "811" ~ "S42.1",
      inclusion_code == "8520" ~ "S06.6",
      inclusion_code == "862" ~ "S277-S279",  # ICD-9 862 excludes pneumo/hemothorax, whereas ICD-10 combines p/h-t with intrathoracic injuries
      inclusion_code == "8632" ~ "S36.40",
      inclusion_code == "8633" ~ "S36.41",
      inclusion_code == "865" ~ "S36.0",
      inclusion_code == "952" ~ "S14.0,S14.1,S24.0,S24.1,S34.0,S34.1,T06.0,T06.1,T09.3", # ICD-9 speficies location of spinal cord lesion at 4th digit level, whereas ICD-10 included spinal cord injuries at different locations at separate 3-character codes
      inclusion_code == "262" ~ "E43",
      inclusion_code == "5210" ~ "K02",
      inclusion_code == "6927" ~ "L57.8",
      inclusion_code == "808" ~ "S32.1-S32.8,T02.1",  # in ICD-9 pelvis fractures are a single code; in ICD-10 they are combined with lumbar spine
      inclusion_code == "860" ~ "S27.0-S27.2",  # see note about intrathoracic injuries above
      inclusion_code == "861" ~ "S26, S27.3-S27.6",  # ICD-9 specified heart & lung in single code; ICD-10 separates them; also se note about intrathoracic organs
      inclusion_code == "8638" ~ "S36.2,S36.8,S36.9",  # ICD-9 separately specifies stomach, small intenstine, colon or rectum
      inclusion_code == "864" ~ "S36.1",
      inclusion_code == "866" ~ "S37.0",
      inclusion_code == "941" ~ "T20",
      inclusion_code == "942" ~ "T21",
      inclusion_code == "945" ~ "T24,T25",  # 945 means burns of lower limbs, and doesn't specify if feet/ankle included, hence also T25
      inclusion_code == "946" ~ "T29",
      inclusion_code %in% as.character(960:979) ~ "T36-T50",  # poisoning
      inclusion_code == "9941" ~ "T751",
      inclusion_code == "E8694" ~ "Z58.7",  # exposure to tobacco smoke
      inclusion_code == "E9102" ~ "W67-W70",
      inclusion_code == "E9104" ~ "W65,W66",
      inclusion_code == "E9108" ~ "W73",
      inclusion_code == "E9109" ~ "W74",
      inclusion_code == "E9600" ~ "Y04",
      inclusion_code == "E985" ~ "Y22-Y25",
      inclusion_code == "E980" ~ "Y10-Y19",
      inclusion_code == "V60" ~ "Z59",
      TRUE ~ as.character(NA)
    )
  )

index_table_schnitzer_2011_inclusions_icd10 <-
  schnitzer_2011_inclusions_icd10 %>% 
  select(inclusion_index, inclusion_code=icd10) %>%
  separate_rows(inclusion_code, sep=",") %>%
  filter(!is.na(inclusion_code)) %>%  # remove missing mapping
  mutate(
    inclusion_code = remove_dot(inclusion_code),
    inclusion_code = sapply(inclusion_code, function(x) generate_codes_from_range(x) %>% paste(collapse=","), USE.NAMES = FALSE )
    ) %>%
  separate_rows(inclusion_code, sep=",") %>%
  mutate(inclusion_code_decimal = sapply(inclusion_code, icd::short_to_decimal, USE.NAMES = FALSE)) %>%
  distinct

## Now we write the inclusion tables to files
write_csv(index_table_schnitzer_2011_inclusions_age, path = "./processed_ICD_codes/schnitzer_et_al_2011_empirical_inclusions_age_index.csv")
write_csv(index_table_schnitzer_2011_inclusions_icd9, path = "./processed_ICD_codes/schnitzer_et_al_2011_empirical_inclusions_index_icd9.csv")
write_csv(index_table_schnitzer_2011_inclusions_icd10, path = "./processed_ICD_codes/schnitzer_et_al_2011_empirical_inclusions_index_icd10.csv")
write_csv(index_table_schnitzer_2011_categories, path = "./processed_ICD_codes/schnitzer_et_al_2011_empirical_inclusions_categories_index.csv")
```

## Exclusion criteria

Like with inclusion criteria, we will compile a table of exclusion criteria. These are extracted from *Tables A2* & *A3*, and are likewise in a one code per row format, except there are two columns of codes - one for the inclusion criterion and another for the exclusion criterion.

*Table A2* lists the same criteria we've extracted from *Table 2* above, but with added exclusions. There is an additional note to exclude some codes where the 4th digit is .6 or.7, denoted by superscript *b*; superscript *a* refers to cases where N<5 cases were reported in the paper and can be safely removed. Additionally, there is an entry in Table A2 which lists an additional requirement for code **E869.4 Second-hand tobacco smoke** to include at least one from a list of codes. This will be included in a separate table of additional requirements.

*Table A3* lists all exclusion criteria for ICD-9-CM code **262: Other severe malnutrition**, so those will need to be added. 

```{r}
schnitzer_2011_table2a_file <- "./schnitzer2011/Table A2. Included ages, co-occurring exclusion codes and calculated weights for ICD-9-CM codes suggestive of maltreatment.xlsx"
schnitzer_2011_table2a_raw <- read_excel(schnitzer_2011_table2a_file, trim_ws = TRUE)

schnitzer_2011_table3a_file <- 
  "./schnitzer2011/Table A3. Exclusions for use with the nutritional deficiency and failure to thrive ICD-9-CM codes.xlsx"
schnitzer_2011_table3a_raw <- 
  read_excel(schnitzer_2011_table3a_file, trim_ws = TRUE) %>%
  rename(code=Code, condition=Condition) %>%
  mutate(
    malnutrition_index = 1: nrow(.)
  )
```

## Table A2

There is an additional note to exclude some codes where the 4th digit is .6 or.7, denoted by superscript *b*; superscript *a* refers to cases where N<5 cases were reported in the paper and can be safely removed. Additionally, there is an entry in Table A2 which lists an additional requirement for code **E869.4 Second-hand tobacco smoke** to include at least one from a list of codes. This will be included in a separate table of additional requirements.

Not all entries in Table A2 also have exclusion codes (or extra requirements), so we remove ones that don't to begin with. We also add the inclusion index generated earlier when we processed inclusion codes - this will allow us a to produce a more compact file.

We also remove superscript ^a^ from codes as it has no bearing on our project.

```{r}
schnitzer_2011_exclusions_empirical_raw <- 
  schnitzer_2011_table2a_raw %>%
  select(exclusions=`Co-Occurring exclusion codes`, code_decimal=Code, condition=Condition, -Weight, -`Age included`) %>%  # rename exclusions and  remove the irrelevant Weight amd age columns
  filter(!is.na(exclusions)) %>%  # remove rows where no exclusion code was provided
  # mutate(exclusion_index = 1:nrow(.)) %>%  # add index variable
  left_join(schnitzer_2011_inclusions_empirical_raw %>% select(code_decimal, inclusion_index), by="code_decimal") %>%  # add inclusion index
  mutate(
    condition = sub(condition, pattern="a$", replacement="")  # remove final a in Conditions that include it
  )
``` 

## Additional inclusion criteria

We'll start by compiling the table of extra inclusion requirements, which can be spotted in table A2 by the phrase *Include only* in the exclusions column:

```{r}
schnitzer_2011_inclusions_empirical_extra_requirements_icd9cm <-
  schnitzer_2011_exclusions_empirical_raw %>%
  filter(str_detect(exclusions, pattern="Include only")) %>%  # the phrase denoting additional code requirements
  mutate(exclusions = sub(exclusions, pattern="Include only with co-occurring code\\: ", replacement="")) %>%  # remove the preceding sentence
  separate_rows(exclusions, sep="; or |; ") %>%
  select(extra_required_range_icd9cm=exclusions,inclusion_index)

schnitzer_2011_inclusions_empirical_extra_requirements_icd9_icd10_condensed <-
  schnitzer_2011_inclusions_empirical_extra_requirements_icd9cm %>%
  mutate(
    extra_required_range_icd9 = case_when(
      extra_required_range_icd9cm == "480.0–487.8" ~ "480-487",
      extra_required_range_icd9cm == "490.0–491.9" ~ "490,491",
      extra_required_range_icd9cm == "466.0–466.19" ~ "466",
      extra_required_range_icd9cm == "493.0–493.9" ~ "493",
      TRUE ~ as.character(extra_required_range_icd9cm)
    ),
    extra_required_range_icd10 = case_when(
      extra_required_range_icd9cm == "480.0–487.8" ~ "J10-J18",  # pneumonia & influenza (not avian) - I excluded J09 as that's what avian influenza falls under
      extra_required_range_icd9cm == "490.0–491.9" ~ "J40-J42",  # bronchitis not specified acute/chronic, chronic bronchitis
      extra_required_range_icd9cm == "466.0–466.19" ~ "J20-J21,J68",  # Acute bronchitis and bronchiolitis; added ICD-10 respiratory conditions due to exposure to chemicals
      extra_required_range_icd9cm == "493.0–493.9" ~ "J44-J46",  # asthma
      extra_required_range_icd9cm == "381.0–381.4" ~ "H65",  # Nonsuppurative otitis media
      TRUE ~ as.character(NA)
    )
  ) %>%
  select(
    inclusion_index,
    icd9=extra_required_range_icd9, icd10=extra_required_range_icd10
  )

schnitzer_2011_inclusions_empirical_extra_requirements <-
  schnitzer_2011_inclusions_empirical_extra_requirements_icd9_icd10_condensed %>%
  gather(-inclusion_index, key="icd_version", value="required_any") %>%
  mutate(
    icd_version = parse_number(icd_version)
  ) %>%
  separate_rows(required_any,sep=",") %>%
  mutate(required_any = sapply(required_any,function(x) generate_codes_from_range(x) %>% paste(collapse = ","),USE.NAMES = FALSE)) %>%
  separate_rows(required_any,sep=",")

index_table_schnitzer_2011_inclusions_extra_requirements_icd10 <-
  schnitzer_2011_inclusions_empirical_extra_requirements %>%
  filter(icd_version==10) %>%
  select(inclusion_index, required_any_code=required_any)

## Write the file
write_csv(index_table_schnitzer_2011_inclusions_extra_requirements_icd10, path = "./processed_ICD_codes/schnitzer_et_al_2011_empirical_inclusions_extra_requirements_index_icd10.csv")
```

## Exclusions

Now that we've processed the extra requirements, we can remove that row from the remaining data, and the entry for malnutrition which specifies an entire separate table, which will be dealt with later.

```{r}
schnitzer_2011_exclusions_empirical_raw_remaining <-
  schnitzer_2011_exclusions_empirical_raw %>%
  filter(!str_detect(exclusions, pattern="Include only")) %>%  # the phrase denoting additional code requirements
  filter(!str_detect(exclusions, pattern="See Table")) #%>%  # the phrase denoting the separate malnutrition exclusions
  # mutate(entry_index = 1:n())  # add entry index for use in emrging later
```

We have `r nrow(schnitzer_2011_exclusions_empirical_raw_remaining)` rows with a mixture of specified single codes, code ranges, and semi-colon separated codes and code ranges.

Included are code ranges with superscript *b* denoting exclusion codes:

>Unless 4th digit = .6 or.7.

The *b* always follows the second code in a range, i.e.: E815–E819^b^. Ultimately we will generate the full range of specified codes, and remove cases where 4th digit is .6 or .7 where noted.

Note also that all ranges where superscript *b* applies are listed as an interrupted range, separated by a comma: *E810–E813, E815–E819b*; superscript *b* applies to both of these ranges. This specifies external cause codes for transport accidents, excluding cases where the victim was a pedestrian or cyclist (4th digit .6 or .7). E814 is entirely for pedestrian victims and so was excluded in the range.

This means there are now these cases:
 * Single code entries
 * Semicolon-separated entries
    + Single codes
    + Code ranges
    + Interrupted code ranges, separated by comma

The elegant solution is to split comma-separated values first, then deal with range-defined codes by transforming the range into a long list of comma-separated codes, and then split comma-seprated values a second time.

Instead of compiling these line-by-line, we can group the exclusion codes, process them individually, and then re-merge them with the target codes. We'll split the codes with semicolons into separate rows, and it should be obvious afterwards that there's only a small number of groups of exclusion codes to deal with that repeat throughout the table.

## Cross-mapping exclusions to ICD-9 and ICD-10

Next we deal with the exclusions. We'll put semicolon separated values in separate rows to begin with, add an id, and then deal with them individually, mapping them to ICD-9 and ICD-10.

```{r}
schnitzer_2011_exclusions_empirical_separated_semicolon_exclusions <-
  schnitzer_2011_exclusions_empirical_raw_remaining %>%
  separate_rows(exclusions, sep="; ") %>%
  select(inclusion_index, exclusions)

index_table_schnitzer_2011_exclusions_icd9cm_condensed <-
  schnitzer_2011_exclusions_empirical_separated_semicolon_exclusions %>%
  select(exclusions) %>%
  distinct %>%
  arrange(nchar(exclusions)) %>%
  mutate(exclusion_index = 1:n())

index_table_schnitzer_2011_map_inclusions_and_exclusions <-
  schnitzer_2011_exclusions_empirical_separated_semicolon_exclusions %>%
  left_join(index_table_schnitzer_2011_exclusions_icd9cm_condensed, by="exclusions") %>%
  select(matches("index"))
```

As mentioned above, there are only a small(ish) number of separate exclusions listed: `r kable(index_table_schnitzer_2011_exclusions_icd9cm_condensed)`

We also now have a lookup table linking the code entries for target codes in the exclusions table, and the exclusions: `r kable(index_table_schnitzer_2011_map_inclusions_and_exclusions)`

Now that each of the unique groupings of exclusions have had an id assigned and we've also compiled a mapping of exclusion ids to inclusion ids, we can also convert the exclusion codes to ICD-9 and ICD-10.

```{r}
index_table_schnitzer_2011_exclusions_icd9_10_condensed <-
  index_table_schnitzer_2011_exclusions_icd9cm_condensed %>%
  mutate(
    icd_9 = case_when(
      exclusions == "771.2" ~ "771.22",  # congenital herpes simplex - in ICD-9-CM it's grouped with "Other", and ICD-9 has a 5th digit code
      exclusions == "756.51" ~ "756.50",  # osteogenesis imperfecta - different 5th digits
      exclusions == "733.10–733.19" ~ "733.1",  # pathological fracture - covers entire 4th digit range
      TRUE ~ exclusions
    ),
    icd_10 = case_when(
      exclusions == "767" ~ "P10-P15,P52.4,P52.6,P52.8,P52.9",  # birth trauma, plus several non-traumatic brain haemorrhages that were included in ICD-9-CM 767 birth trauma due to hypoxia/anoxia (but excluding intraventricular & subarachnoid haemorrhages)
      exclusions == "765" ~ "P05,P07",  # Disorders relating to short gestation and low birthweight
      exclusions == "771.2" ~ "P35.2",  # congenital herpes
      exclusions == "098.4" ~ "A54.3",  # Gonococcal infection of eye
      exclusions == "771.6" ~ "P39.1",  # Neonatal conjunctivitis and dacryocystitis
      exclusions == "756.51" ~ "Q78.0",  # Osteogenesis imperfecta
      exclusions == "E960.1" ~ "Y05,T74.2",  # rape (I included attempted rape Y05 - this would flag as definite maltreatment)
      exclusions == "E968.4" ~ "Y06",  # ICD-9 criminal neglect, ICD-10 Neglect and abandonment
      exclusions == "286–287" ~ "D65-D69",  # Coagulation defects, purpura and other haemorrhagic conditions
      exclusions == "E800–E819" ~ "V01-V99",  # ICD-9-CM railway & motor traffic accidents: ICD-10 transport accidents - these are for use with gential contusions & spinal cord injuries, so a broader ICD-10 definition is okay
      exclusions == "E890–E897" ~ "X00-X09",  # ICD-9-CM ACCIDENTS CAUSED BY FIRE AND FLAMES (but without Other & NOS): ICD-10 Exposure to smoke, fire and flames
      exclusions == "E870–E876" ~ "Y60-Y69",  # ICD-9-CM MISADVENTURES TO PATIENTS DURING SURGICAL AND MEDICAL CARE
      exclusions == "733.10–733.19" ~ "M48.5,M80,M84.4,M90.7",  # ICD-9 pathological fracture: scattered across various ICD-10 codes
      exclusions == "E810–E813, E815–E819b" ~ "V20-V99",  # ICD-9 MOTOR VEHICLE TRAFFIC ACCIDENTS (E810-E819) excluding cases where child was pedestrian or cyclist; ICD-10 transport accidents except ones where victim was pedestrian or pedal cyclist
      TRUE ~ as.character(NA)
    )
  ) %>%
  select(-exclusions)  # remove the original ICD-9-CM codes, no longer needed
```

## Separating exclusion codes to individual codes

Most exclusions are listed as ranges of codes, so we'll now separate them next.

The first step is to tag the line with superscript ^b^ so we can deal with it later, then the superscript ^b^ can be removed from the exclusion codes. Note that this only applies to the ICD-9 codes - ICD-10 codes specify pedestrian/cyclist at 3-character level and only appropriate codes have been included already.

Second, we'll separate comma separated codes.
Third, we'll expand codes expressed as ranges and convert them into comma-seaparated codes,
Fourth, we'll separate the newly generated comma-separated values again

```{r}
## helper function to deal with superscript b cases
## essentially, we are given a 4 character E-code, e.g. E815
## and we need to compile a range of 5th character codes, E815.1 - E815.9, but excluding .6 and .7
generate_5th_digit_codes_superscript_b <- function(code) {
  digits = c(1:9)[-c(6,7)]
  return(paste(code,digits,sep="."))
}

index_table_schnitzer_2011_exclusions_icd9 <-
  index_table_schnitzer_2011_exclusions_icd9_10_condensed %>%
  select(-icd_10) %>%  # remove icd-10 codes
  mutate(
    superscript_b = str_detect(icd_9, pattern="b$"),  # tag superscript b so we can deal with it later
    icd_9 = sub(icd_9, pattern="b$", replacement = "")  # remove final now that we have a tag
  ) %>%
  separate_rows(icd_9, sep = ", ") %>%  # separate comma-separated values
  mutate(icd_9 = sapply(icd_9, FUN=function(x) generate_codes_from_range(x) %>% paste(collapse = ","), USE.NAMES = FALSE)) %>%  # collapse range codes into comma searated values
  separate_rows(icd_9, sep = ",") %>%
  mutate(  # convert superscript b codes to comma separated 5th digit codes
    icd_9 = 
      ifelse(
        superscript_b,
          sapply(
            icd_9, 
            FUN=function(x) generate_5th_digit_codes_superscript_b(x) %>% paste(collapse = ","), 
            USE.NAMES = FALSE
          ),
        icd_9
      )
  ) %>%
  separate_rows(icd_9, sep = ",") %>%  # separate newly generated 5th digit codes
  select(-superscript_b) %>%  # the superscript b tag is no longer needed
  rename(exclusion_code_decimal = icd_9) %>%
  mutate(exclusion_code = remove_dot(exclusion_code_decimal))

index_table_schnitzer_2011_exclusions_icd10 <-
  index_table_schnitzer_2011_exclusions_icd9_10_condensed %>%
  select(-icd_9) %>%  # remove icd-9 codes
  separate_rows(icd_10, sep=",") %>%
  mutate(icd_10 = sapply(icd_10, FUN=function(x) generate_codes_from_range(x) %>% paste(collapse = ","), USE.NAMES = FALSE)) %>%  # collapse range codes into comma searated values
  separate_rows(icd_10, sep=",") %>%
  rename(exclusion_code_decimal = icd_10) %>%
  mutate(exclusion_code = remove_dot(exclusion_code_decimal))
```

## Table A3

Table A3 is is again a case of parsing out codes that have been denoted as ranges of codes; we do this by separately dealing with single codes and codes that include a hyphen. There were no comma-separated codes listed, so we can go straight into converting to ICD-9 and ICD-10.

```{r}
schnitzer_2011_malnutrition_exclusions_icd9_icd10 <-
  schnitzer_2011_table3a_raw %>%
  mutate(
    icd9_range = case_when(
      code=="009.0" ~ "009.0",  # Infectious colitis, enteritis and gastroenteritis
      code=="010.0–018.9" ~ "010-018",  # covers entire range
      code=="042" ~ "042-044",  # ICD-9 codes 042-044 are for HIV
      code=="070.00–070.9" ~ "070",  # covers entire range
      code=="140.0–208.91" ~ "140-208",  # covers entire range
      code=="243–244.9" ~ "243-244",  # covers entire range
      code=="250.00–250.93" ~ "250",  # covers entire code block
      code=="252.0–252.9" ~ "252",  # covers entire code block
      code=="253.0–253.9" ~ "253",  # covers entire code block
      code=="270.0–275.9" ~ "270-275",  # covers entire code block
      code=="277.00" ~ "277.0",  # cystic fibrosis
      code=="330.0–344.42" ~ "330-344",  # covers entire range
      code=="446.0–446.7" ~ "446",  # covers entire code block
      code=="493.00–493.92" ~ "493",  # covers entire code block
      code=="530.81" ~ "530.8",  # ICD-9 Other disorders of oesophagus = best match for ICD-9-CM Esophageal reflux 
      code=="555.0–558.9" ~ "555-558",  # covers entire code range
      code=="575.0–576.9" ~ "575-576",  # covers entire code range
      code=="571.0–571.9" ~ "571",  # covers entire code block
      code=="579.0–579.9" ~ "579",  # covers entire code block
      code=="593.9" ~ "593.9, 585.9,",  # Schnitzer says Chronic renal insufficiency, which is actually ICD-9-CM 585.9; we'll include both
      code=="710.0–710.9" ~ "710",  # covers entire code block
      code=="714.0–714.9" ~ "714",  # covers entire code block (note how ICD-9-CM is less specific than ICD-9!)
      code=="745.0–747.9" ~ "745-747",  # covers entire code block
      code=="749.00–749.25" ~ "749",  # covers entire code block
      code=="758.0–758.9" ~ "758",  # covers entire code block
      code=="760.71" ~ "760.76",  # Fetus/newborn affected by alcohol via placenta/breast milk
      code=="771.0–771.89" ~ "771",  # covers entire code block
      code=="772.10–772.14" ~ "772.1",  # Fetal and neonatal intraventricular haemorrhage
      code=="852.00–853.19" ~ "852-853",  # traumatic intracranial haemorrhages (subarach.,subdural,extradural,other)
      code=="984.0–984.9" ~ "984",  # covers entire range
      TRUE ~ code
    ),
    icd10_range = case_when(
      code=="009.0" ~ "A09",  # ICD-9 Infectious colitis, enteritis and gastroenteritis to Other gastroenteritis and colitis of infectious and unspecified origin
      code=="010.0–018.9" ~ "A15-A19",  # Tuberculosis
      code=="042" ~ "B20-B24",  # HIV
      code=="070.00–070.9" ~ "B15-B19",  # viral hepatitis
      code=="140.0–208.91" ~ "C00-C97",  # malignancy
      code=="243–244.9" ~ "E00-E03,E89.0",  # congenital & acquired hypothyroidism; E89.0 postprocedural hypothyroidism
      code=="250.00–250.93" ~ "E10-E14",  # diabetes mellitus
      code=="252.0–252.9" ~ "E20-E21, E89.2",  # Parathyroid disorders; E89.0 Postprocedural hypothyroidism
      code=="253.0–253.9" ~ "E22-E23, E89.3", # E22 Hyperfunction of pituitary gland; E23 Hypofunction and other disorders of pituitary gland;	E89.2 Postprocedural hypoparathyroidism
      code=="270.0–275.9" ~ "E70-E90, D89, M10",  # E70-E90 Metabolic disorders; However, there's matches in other categories; D89 Other disorders involving the immune mechanism, not elsewhere classified; M10 Gout;
      # compile_mapping_cm(as.character(270:275)) %>% filter(!str_detect(icd10cm,pattern="E[7-9]|D89|D47|M1[A0]"))
      code=="271.3" ~ "E73",  # E73 lactose intolerance
      code=="277.00" ~ "E84",  # E84 Cystic fibrosis
      code=="317–319" ~ "F70-F79",  # mental retardation
      code=="330.0–344.42" ~ "G00-G99, R52",  # neurologic hereditary etc - cocneptually matches G00-G99 Diseases of the nervous system; R52 Pain
      # this range is composed of:
      # HEREDITARY AND DEGENERATIVE DISEASES OF THE CENTRAL NERVOUS SYSTEM (330-337)
      # PAIN (338)
      # OTHER HEADACHE SYNDROMES (339)
      # OTHER DISORDERS OF THE CENTRAL NERVOUS SYSTEM (340-349) except 348-349, which covers Other conditions of brain and 349 Other and unspecified disorders of the nervous system; not clear why those were excluded - we'll cover the entire G00-G99 range
      code=="431" ~ "I61",  # Intracerebral hemorrhage
      code=="446.0–446.7" ~ "M30-M31",  # Polyarteritis nodosa and allied conditions to ICD-10 M30 Polyarteritis nodosa and related conditions and M31 Other necrotising vasculopathies
      code=="493.00–493.92" ~ "J45",  # J45 asthma
      code=="530.81" ~ "K21",  # K21 Gastro-oesophageal reflux disease
      code=="555.0–558.9" ~ "K50-K52",  # Inflammatory bowel disease: NONINFECTIOUS ENTERITIS AND COLITIS (555-558) - Noninfective enteritis and colitis (K50-K52)
      code=="575.0–576.9" ~ "K82-K83",  # K82 Other diseases of gallbladder; K83 Other diseases of biliary tract
      code=="571.0–571.9" ~ "K70-K76",  # K70-K76 Diseases of liver (K77 excluded - diseases of liver classified elsewhere)
      code=="577.8" ~ "K86.8",  # Pancreatic insufficiency (Other specified diseases of pancreas) - K86.8 Other specified diseases of pancreas
      code=="579.0–579.9" ~ "K90-K91",  # K90 intestinal malabsorption; K91 Postprocedural disorders of digestive system, not elsewhere classified
      code=="588.8" ~ "N25.8",  # N25.8 Other disorders resulting from impaired renal tubular function
      code=="593.9" ~ "N18-N19, N28.9",  # Schnitzer specified "Chronic renal insufficiency", but listed code 593.9 Disorders of kidney and ureter, unspecified; I'm assuming they mean the former, which maps to N18 & possibly N19, but we'll also include N28.9 Disorder of kidney and ureter, unspecified
      code=="599.0" ~ "N39.0",  # N39.0 Urinary tract infection, site not specified
      code=="710.0–710.9" ~ "M32-M36",  # M32-M36 based on matching ICD-9-CM code headings to ICD-10 & prefixes generated via GEMs
      code=="714.0–714.9" ~ "M05-M06, M08, M12.0",  # based on GEMs, we also include M12.0 Chronic postrheumatic arthropathy
      code=="745.0–747.9" ~ "Q20-Q28, P29.3",  # "Cardiac disease, congenital" - Q20-Q28 Congenital malformations of the circulatory system; also includes P29.3 Persistent foetal circulation
      code=="749.00–749.25" ~ "Q35-Q37",  # Q35-Q37 Cleft lip and cleft palate
      code=="750.5" ~ "Q40.0",  # congenital pyloric stenosis
      code=="751.3" ~ "Q43.1",  # Hirschsprung disease
      code=="758.0–758.9" ~ "Q90-Q99",  # Chromosomal abnormalities, not elsewhere classified (Q90-Q99)
      code=="760.71" ~ "Q86.0, P04.3", # Q86.0 Fetal alcohol syndrome (dysmorphic); P04.3 Fetus and newborn affected by maternal use of alcohol
      code=="767.0" ~ "P10, P11.1, P11.2",  # P10 Intracranial laceration and haemorrhage due to birth injury; P11.1 Other specified brain damage due to birth injury'; P11.2 Unspecified brain damage due to birth injury
      code=="770.7" ~ "P27.1",  # P27.1 Bronchopulmonary dysplasia originating in the perinatal period
      code=="771.0–771.89" ~ "P35-P39",  # Infections specific to the perinatal period (P35-P39)
      code=="772.10–772.14" ~ "P52",  # P10.2 Intraventricular haemorrhage due to birth injury already covered above; P52 Intracranial nontraumatic haemorrhage of fetus and newborn
      code=="852.00–853.19" ~ "S063-S068",  # traumatic intracranial haemorrhages
      code=="984.0–984.9" ~ "T56.0",  #  T56.0 toxic effect of lead and its compounds
      TRUE ~ code
    )
  )
```

## Merging malnutrition-related exclusion codes to exclusions

We now add the malnutrition-related exclusion codes to the ICD-9 and ICD-10 exclusion codes already generated. In addition, we add a new index representing the malnutrition codes to the mapping of inclusion to exclusion code indexes.

```{r}
malnutrition_code <- "262"
malnutrition_index <- index_table_schnitzer_2011_inclusions_icd9$inclusion_index[index_table_schnitzer_2011_inclusions_icd9$inclusion_code==malnutrition_code]

malnutrition_exclusion_index_new <- max(c(index_table_schnitzer_2011_exclusions_icd9$exclusion_index,index_table_schnitzer_2011_exclusions_icd10$exclusion_index))+1

index_table_schnitzer_2011_malnutrition_exclusions_condensed <-
  schnitzer_2011_malnutrition_exclusions_icd9_icd10 %>%
  select(
    exclusion_range_icd9 = icd9_range,
    exclusion_range_icd10 = icd10_range
    ) %>%
  mutate(exclusion_index = malnutrition_exclusion_index_new)

index_table_schnitzer_2011_malnutrition_exclusions_icd9 <-
  index_table_schnitzer_2011_malnutrition_exclusions_condensed %>%
  select(exclusion_index, exclusion_range_icd9) %>%
  mutate(exclusion_range_icd9 = gsub(exclusion_range_icd9,pattern=" ",replacement = "")) %>%  # remove spaces
  separate_rows(exclusion_range_icd9, sep=",") %>%
  mutate(exclusion_code = sapply(exclusion_range_icd9,function(x) generate_codes_from_range(x) %>% paste(collapse = ","),USE.NAMES = FALSE)) %>%
  separate_rows(exclusion_code, sep=",") %>%
  select(exclusion_index, exclusion_code_decimal=exclusion_code) %>%
  mutate(exclusion_code = icd::decimal_to_short(exclusion_code_decimal) %>% as.character)

index_table_schnitzer_2011_malnutrition_exclusions_icd10 <-
  index_table_schnitzer_2011_malnutrition_exclusions_condensed %>%
  select(exclusion_index, exclusion_range_icd10) %>%
  mutate(exclusion_range_icd10 = gsub(exclusion_range_icd10,pattern=" ",replacement = "")) %>%  # remove spaces
  separate_rows(exclusion_range_icd10, sep=",") %>%
  mutate(exclusion_code = sapply(exclusion_range_icd10,function(x) generate_codes_from_range(x) %>% paste(collapse = ","),USE.NAMES = FALSE)) %>%
  separate_rows(exclusion_code, sep=",") %>%
  select(exclusion_index, exclusion_code_decimal=exclusion_code) %>%
  mutate(exclusion_code = icd::decimal_to_short(exclusion_code_decimal) %>% as.character)

## add new mapping of maltreatment inclusion & exclusions
index_table_schnitzer_2011_map_inclusions_and_exclusions <-
  index_table_schnitzer_2011_map_inclusions_and_exclusions %>%
  add_case(inclusion_index=malnutrition_index, exclusion_index=malnutrition_exclusion_index_new)
  
index_table_schnitzer_2011_exclusions_icd9 <-
  bind_rows(
    index_table_schnitzer_2011_exclusions_icd9,
    index_table_schnitzer_2011_malnutrition_exclusions_icd9
  )
  
index_table_schnitzer_2011_exclusions_icd10 <-
  bind_rows(
    index_table_schnitzer_2011_exclusions_icd10,
    index_table_schnitzer_2011_malnutrition_exclusions_icd10
  )
```

## Write completed exclusion indices to files 

```{r}
## write index of inclusions mapped to exclusions
write_csv(index_table_schnitzer_2011_map_inclusions_and_exclusions, path = "./processed_ICD_codes/schnitzer_et_al_2011_empirical_index_map_inclusions_exclusions.csv")

## write the icd9 and icd10 indexes for exclusions
write_csv(index_table_schnitzer_2011_exclusions_icd9, path = "./processed_ICD_codes/schnitzer_et_al_2011_empirical_exclusions_index_icd9.csv")
write_csv(index_table_schnitzer_2011_exclusions_icd10, path = "./processed_ICD_codes/schnitzer_et_al_2011_empirical_exclusions_index_icd10.csv")
```

# Writing human-readable spreadsheets of Schnitzer et al 2011 cross-mapping

For the purposes of validation and dissemination of the cross-mapping, it is useful to convert the lists of cross-mapped codes in long format to a condensed spreadsheet format also.

```{r}
readable_crossmapping_spreadsheet <- list(
  inclusions =
    schnitzer_2011_inclusions_empirical_icd9cm_and_icd9 %>%
    rename(icd9cm=code_decimal) %>%
    mutate(icd9 = as.character(icd::short_to_decimal(code_icd9))) %>%
    group_by(inclusion_index) %>%
    summarise(icd9cm = paste(icd9cm,collapse=", "), icd9 = paste(icd9, collapse = ", "), icd9 = ifelse(icd9=="NA", NA_character_, icd9)) %>%
    mutate(
      icd9cm = ifelse(str_starts(icd9cm, pattern="960"),"960-979",icd9cm),
      icd9 = ifelse(str_starts(icd9, pattern="960"),"960-979",icd9),
      comment_icd9 = case_when(
         str_starts(icd9cm, pattern="V71.81") ~ "No ICD-9 code for Observation for abuse/neglect",
         str_starts(icd9cm, pattern="362.81") ~ "No ICD-9 code for Retinal haemorrhage",
         str_starts(icd9cm, pattern="E869.4") ~ "No ICD-9 code for Second-hand tobacco smoke exposure",
         TRUE ~ as.character(NA)
      ) 
    ) %>%
    left_join(
      schnitzer_2011_inclusions_icd10 %>%
      group_by(inclusion_index) %>%
      summarise(icd10 = paste(icd10, collapse=", "), icd10 = ifelse(icd10=="NA", NA_character_, icd10)) %>%
      mutate(icd10 = ifelse(str_starts(icd10,pattern="T36"),"T36-T50",icd10)),
      by = "inclusion_index"
    ) %>%
    mutate(
      comment_icd10 = case_when(
        str_detect(icd9cm, pattern="V71.81") ~ "No ICD-10 code for Observation and evaluation for abuse and neglect",
        str_detect(icd9cm, pattern="E988") ~ "Several 4th digit level diagnoses in ICD-9 are 3-character codes in ICD-10",
        str_detect(icd9cm, pattern="805") ~ "ICD-9 doesn't specify location of fractured vertebrae, so we need to pick out all possibilities in ICD-10; included are multiple fractures of spine as these will involve vertebrae fractures also",
        str_detect(icd9cm, pattern="862") ~ "ICD-9 code 862 excludes pneumo/hemothorax, whereas ICD-10 combines pneumo/hemothorax with intrathoracic injuries",
        str_detect(icd9cm, pattern="952") ~ "ICD-9 specifies location of spinal cord lesion at 4th digit level, whereas in ICD-10 spinal cord injuries at different locations have their own 3-character codes",
        str_detect(icd9cm, pattern="808") ~ "in ICD-9, pelvis fractures are grouped under a single code; in ICD-10 they are combined with lumbar spine injuries",
        str_detect(icd9cm, pattern="860") ~ "see note above on pneumo/hemothorax and intrathoracic injuries above",
        str_detect(icd9cm, pattern="861") ~ "ICD-9 specified heart & lung injuries in single code; ICD-10 has separate codes, but also see note above about intrathoracic injuries",
        str_detect(icd9cm, pattern="863.8") ~ "ICD-9 separately specifies 4th digit codes for stomach, small intenstine, colon or rectum, so to match the Other code we include ICD-10 codes for gastrointestinal injuries excluded by those",
        str_detect(icd9cm, pattern="945") ~ "ICD-9-CM 945 stands for burns of lower limbs, and doesn't specify if feet/ankle included, hence also T25",
        TRUE ~ as.character(NA)
      )
    ) %>%
    left_join(
      schnitzer_2011_inclusions_empirical_icd9cm_and_icd9 %>% select(inclusion_index, condition=description_icd9cm), by="inclusion_index"
    ),
   exclusions = 
    index_table_schnitzer_2011_exclusions_icd9cm_condensed %>%
  mutate(
    icd9 = case_when(
      exclusions == "771.2" ~ "771.22",  # congenital herpes simplex - in ICD-9-CM it's grouped with "Other", and ICD-9 has a 5th digit code
      exclusions == "756.51" ~ "756.50",  # osteogenesis imperfecta - different 5th digits
      exclusions == "733.10–733.19" ~ "733.1",  # pathological fracture - covers entire 4th digit range
      TRUE ~ exclusions
    ),
    comments_icd9 = case_when(
      exclusions == "771.2" ~ "in ICD-9-CM herpes simplex is grouped with Other, but in ICD-9 it has a separate 5th digit code",
      exclusions == "756.51" ~ "osteogenesis imperfecta has different 5th digit between ICD-9-CM and ICD-9",
      exclusions == "733.10–733.19" ~ "Entire 4th digit code range 733.1 refers to pathological fracture",
      TRUE ~ as.character(NA)
    )
  ) %>%
  left_join(index_table_schnitzer_2011_exclusions_icd9_10_condensed %>% select(exclusion_index, icd10=icd_10), by="exclusion_index") %>%
  mutate(
    comments_icd10 = case_when(
      exclusions == "767" ~ "ICD-10 codes for Birth trauma plus several codes for non-traumatic brain haemorrhages that were included in ICD-9-CM 767 birth trauma due to hypoxia/anoxia (but excluding intraventricular & subarachnoid haemorrhages which werent included in ICD-9-CM code 767)",
      exclusions == "765" ~ "ICD-10 codes for Disorders relating to: short gestation, low birthweight",
      exclusions == "E960.1" ~ "ICD-9 E960.1 Rape mapped to ICD-10 codes Y05 Attempted rape and T74.2 Sexual abuse",
      exclusions == "E968.4" ~ "ICD-9 E968.4 criminal neglect mapped to ICD-10 Y06 Neglect and abandonment",
      exclusions == "E800–E819" ~ "ICD-9-CM Railway & motor traffic accidents mapped to ICD-10 Transport accidents; ICD-10 has a broader definition which is acceptable, since these codes are used to exclude cases of genital contusions & spinal cord injuries",
      exclusions == "E890–E897" ~ "ICD-9-CM ACCIDENTS CAUSED BY FIRE AND FLAMES (but excluding Other & NOS codes) mapped to ICD-10 Exposure to smoke, fire and flames",
      exclusions == "733.10–733.19" ~ "ICD-9-CM code 733.1 for pathological fracture corresponds to multiple separate ICD-10 codes",
      exclusions == "E810–E813, E815–E819b" ~ "ICD-9 MOTOR VEHICLE TRAFFIC ACCIDENTS (E810-E819) excluding cases where child was pedestrian or cyclist; ICD-10 transport accidents except ones where victim was pedestrian or pedal cyclist (i.e. V01-V19)",
      TRUE ~ as.character(NA)
    )
  ) %>%
    rename(icd9cm = exclusions) %>%
    left_join(index_table_schnitzer_2011_map_inclusions_and_exclusions, by="exclusion_index"),
  malnutrition_exclusions = 
    schnitzer_2011_malnutrition_exclusions_icd9_icd10 %>%
    rename(condition_schnitzer=condition, icd9=icd9_range, icd10=icd10_range) %>%
    mutate(
    comments_icd9 = case_when(
      code=="277.00" ~ "cystic fibrosis is 277.0",
      code=="530.81" ~ "530.8 Other disorders of oesophagus = best match for ICD-9-CM Esophageal reflux",
      code=="593.9" ~ "Schnitzer gives code 593.9 Unspecified disorder of kidney and ureter, but gives description Chronic renal insufficiency, which is actually ICD-9-CM 585.9; we'll include both",
      code=="714.0–714.9" ~ "note how ICD-9-CM is less specific than ICD-9!",
      code=="760.71" ~ "760.76 Fetus/newborn affected by alcohol via placenta/breast milk",
      code=="772.10–772.14" ~ "772.1 Fetal and neonatal intraventricular haemorrhage",
      code=="852.00–853.19" ~ "852-853 traumatic intracranial haemorrhages (subarach.,subdural,extradural,other)",
      TRUE ~ NA_character_
      ),
    comments_icd10 = case_when(
      code=="009.0" ~ "ICD-9 Infectious colitis, enteritis and gastroenteritis, to ICD-10 A09 Other gastroenteritis and colitis of infectious and unspecified origin",
      code=="010.0–018.9" ~ "A15-A19 Tuberculosis",
      code=="042" ~ "B20-B24 HIV",
      code=="243–244.9" ~ "E00-E03 congenital & acquired hypothyroidism; E89.0 postprocedural hypothyroidism", 
      code=="252.0–252.9" ~ "E20-E21 Parathyroid disorders; E89.0 Postprocedural hypoparathyroidism",
      code=="253.0–253.9" ~ "E22 Hyperfunction of pituitary gland; E23 Hypofunction and other disorders of pituitary gland;	E89.2 Postprocedural hypoparathyroidism",
      code=="270.0–275.9" ~ "E70-E90 Metabolic disorders; However, there's matches in other categories; D89 Other disorders involving the immune mechanism, not elsewhere classified; M10 Gout;", 
      code=="330.0–344.42" ~ "this range is composed of: HEREDITARY AND DEGENERATIVE DISEASES OF THE CENTRAL NERVOUS SYSTEM (330-337);  PAIN (338); OTHER HEADACHE SYNDROMES (339); OTHER DISORDERS OF THE CENTRAL NERVOUS SYSTEM (340-349) except 348-349, which covers Other conditions of brain and 349 Other and unspecified disorders of the nervous system; not clear why those were excluded - we'll cover the entire G00-G99 range; also matches R52 Pain",
      code=="446.0–446.7" ~ "Polyarteritis nodosa and allied conditions to ICD-10 M30 Polyarteritis nodosa and related conditions and M31 Other necrotising vasculopathies",
      code=="530.81" ~ "K21 Gastro-oesophageal reflux disease",
      code=="555.0–558.9" ~ "Inflammatory bowel disease: NONINFECTIOUS ENTERITIS AND COLITIS (555-558) - Noninfective enteritis and colitis (K50-K52)",
      code=="575.0–576.9" ~ "K82 Other diseases of gallbladder; K83 Other diseases of biliary tract", 
      code=="571.0–571.9" ~ "K70-K76 Diseases of liver (K77 excluded - diseases of liver classified elsewhere)", 
      code=="577.8" ~ "Pancreatic insufficiency (Other specified diseases of pancreas) - K86.8 Other specified diseases of pancreas", 
      code=="579.0–579.9" ~ "K90 intestinal malabsorption; K91 Postprocedural disorders of digestive system, not elsewhere classified",
      code=="593.9" ~ "Schnitzer specified Chronic renal insufficiency, but listed code 593.9 Disorders of kidney and ureter, unspecified; I'm assuming they mean the former, which maps to N18 & possibly N19, but we'll also include N28.9 Disorder of kidney and ureter, unspecified", 
      code=="710.0–710.9" ~ "M32-M36 based on matching ICD-9-CM code headings to ICD-10 & prefixes generated via GEMs",
      code=="714.0–714.9" ~ "based on GEMs, we also include M12.0 Chronic postrheumatic arthropathy",
      code=="745.0–747.9" ~ "Q20-Q28 Congenital malformations of the circulatory system; also includes P29.3 Persistent foetal circulation",
      code=="760.71" ~ "Q86.0 Fetal alcohol syndrome (dysmorphic); P04.3 Fetus and newborn affected by maternal use of alcohol", 
      code=="767.0" ~ "P10 Intracranial laceration and haemorrhage due to birth injury; P11.1 Other specified brain damage due to birth injury'; P11.2 Unspecified brain damage due to birth injury",
      code=="772.10–772.14" ~ "P10.2 Intraventricular haemorrhage due to birth injury already covered above; P52 Intracranial nontraumatic haemorrhage of fetus and newborn",
      TRUE ~ NA_character_
      )
    ) %>%
    select(icd9cm = code, icd9, comments_icd9, icd10, comments_icd10, malnutrition_index) %>%
    left_join(
      schnitzer_2011_malnutrition_exclusions_icd9_icd10 %>% select(malnutrition_index, condition), by="malnutrition_index"
    ),
  additional_requirements =
    schnitzer_2011_inclusions_empirical_extra_requirements_icd9cm %>%
    rename(code = extra_required_range_icd9cm) %>%
    bind_cols(schnitzer_2011_inclusions_empirical_extra_requirements_icd9_icd10_condensed) %>%
    select(-matches("inclusion_index1")) %>%
    mutate(
      comments_icd9 = NA_character_,
      comments_icd10 = case_when(
      code == "480.0–487.8" ~ "pneumonia & influenza (not avian) - I excluded J09 as that's what avian influenza falls under",  # 
      code == "466.0–466.19" ~ "Acute bronchitis and bronchiolitis; added ICD-10 J68 respiratory conditions due to exposure to chemicals",  # 
      TRUE ~ NA_character_
      )
    ) %>%
    rename(icd9cm=code)
) %>%
  map(., ~distinct(.))

readable_crossmapping_spreadsheet_with_codes <-
  readable_crossmapping_spreadsheet %>%
  map(., .f = ~select(., -matches("_index|condition")))

readable_crossmapping_spreadsheet_with_descriptions_instead_of_icd_cm_codes <-
  readable_crossmapping_spreadsheet %>%
  map(., .f = ~select(., -matches("_index|icd9cm")) %>% select(matches("condition"), everything()))

readable_crossmapping_spreadsheet_full <-
  readable_crossmapping_spreadsheet %>%
  map(., .f = ~select(.,matches("condition"),matches("_index"),everything()))

condensed_crossmapping_spreadsheet <-
  list(
    inclusions = readable_crossmapping_spreadsheet$inclusions %>%
      select(condition, inclusion_index, icd9, icd10) %>%
      group_by(inclusion_index, condition) %>%
      summarise(
        icd9 = paste(icd9, collapse=", "), icd10 = paste(icd10, collapse=", ")
      ) %>%
      mutate_at(vars(icd9,icd10), ~ifelse(.=="NA",NA_character_,.)) %>%
      left_join(
        readable_crossmapping_spreadsheet$exclusions %>% 
          select(inclusion_index, icd9, icd10) %>%
          distinct %>%
          group_by(inclusion_index) %>%
        summarise(
          exclusions_icd9 = paste(icd9, collapse=", "), exclusions_icd10 = paste(icd10, collapse=", ")
        ),
        by="inclusion_index"
      ) %>%
      mutate(
        # flag_4th_digit = str_detect(exclusions_icd9, pattern="b$"),
        exclusions_icd9 = gsub(exclusions_icd9, pattern="b$", replacement="\u00b9"),  # add a superscript 1 sign for use in the footnotes
        icd9 = if_else(inclusion_index==23, paste0(icd9, "\u00b2"), icd9),
        icd10 = if_else(inclusion_index==23, paste0(icd10, "\u00b2"), icd10),
        icd10 = if_else(inclusion_index==38, paste0(icd10, "\u00b3"), icd10)
        ) %>%
      ungroup %>%
      select(-inclusion_index),
    malnutrition_exclusions =
      readable_crossmapping_spreadsheet$malnutrition_exclusions %>%
      select(icd9, icd10),
    additional_requirements = 
      readable_crossmapping_spreadsheet$additional_requirements %>%
      select(icd9, icd10)
  )
```

## Save table as excel

```{r}
## function that replaces comments with blanks, and codes with "N/A" where they are NA
replace_na_in_excel <- function(x) {
  replace_na(data=x,list(icd9="N/A",icd10="N/A",comments_icd9="",comments_icd10="", exclusions_icd9="", exclusions_icd10=""))
}

add_zero_width_space <- function(this_string) {
  if_else(
    condition = str_starts(this_string, pattern="\\d"),
    true = paste0("\u200b",this_string),
    false = this_string
  )
}

excel_condensed_crossmapping_spreadsheet <- openxlsx::createWorkbook()

condensed_crossmapping_spreadsheet$inclusions %>%
  replace_na(list(icd9="N/A",icd10="N/A",exclusions_icd9="", exclusions_icd10="")) %>%
  mutate_all(add_zero_width_space) %>%
  as_hux(add_colnames = TRUE) %>%
  add_footnote(text="\u00b9 Excluding codes where 4th digit is .6 or .7;\u00b2 For all exclusion codes see Other severe malnutrition exclusions table; \u00b3 Include only with co-ocurring codes in 2nd hand smoke code table", number_format = list(as.character)) %>%
  as_Workbook(Workbook=excel_condensed_crossmapping_spreadsheet, sheet="Inclusions") %>%
  (function(x) openxlsx::setColWidths(wb = x, sheet = 1, cols = 1:ncol(condensed_crossmapping_spreadsheet$inclusions), widths = 40))

condensed_crossmapping_spreadsheet$malnutrition_exclusions %>%
  mutate_all(add_zero_width_space) %>%
  as_hux(add_colnames = TRUE) %>%
  as_Workbook(Workbook=excel_condensed_crossmapping_spreadsheet, sheet="Malnutrition exclusions") %>%
  (function(x) openxlsx::setColWidths(wb = x, sheet = 2, cols = 1:ncol(condensed_crossmapping_spreadsheet$malnutrition_exclusions), widths = 40))

condensed_crossmapping_spreadsheet$additional_requirements %>%
  mutate_all(add_zero_width_space) %>%
  as_hux(add_colnames = TRUE) %>%
  as_Workbook(Workbook=excel_condensed_crossmapping_spreadsheet, sheet="Additional requirements") %>%
  (function(x) openxlsx::setColWidths(wb = x, sheet = 3, cols = 1:ncol(condensed_crossmapping_spreadsheet$additional_requirements), widths = 40))

openxlsx::saveWorkbook(wb = excel_condensed_crossmapping_spreadsheet, file = "./processed_ICD_codes/Schnitzer_et_al_2011_condensed_ICD_codes_list_for_appendix.xlsx", overwrite = TRUE)

openxlsx::write.xlsx(x=readable_crossmapping_spreadsheet_with_codes %>% map(., replace_na_in_excel), file = "./processed_ICD_codes/Schnitzer_et_al_2011_ICD_crossmapping_for_appendix_with_icd9cm_codes.xlsx")
openxlsx::write.xlsx(x=readable_crossmapping_spreadsheet_with_descriptions_instead_of_icd_cm_codes %>% map(., replace_na_in_excel), file = "./processed_ICD_codes/Schnitzer_et_al_2011_ICD_crossmapping_for_appendix_with_condition_descriptions.xlsx")
openxlsx::write.xlsx(x=readable_crossmapping_spreadsheet_full %>% map(., replace_na_in_excel), file = "./processed_ICD_codes/Schnitzer_et_al_2011_ICD_crossmapping_for_appendix_with_codes_and_descriptions.xlsx")
```

# Conclusion

We've parsed the MVR codes, as well as parsed & cross-mapped to ICD-9 and ICD-10 the list of Schnitzer 2011 inclusion codes, exclusion codes (including for maltreatment), and a separate list of additional inclusion requirements for the 2nd hand smoke exposure code in ICD-10.

# References