Training Data for Machine Learning to Enhance PCOR Data Infrastructure

https://www.healthit.gov/topic/scientific-initiatives/pcor/machine-learning

Project Background

Innovative artificial intelligence (AI) methods and the increase in computational power support the use of tools and advanced technologies such as machine learning, which consumes large amounts of data to make predictions for actionable information.

Current AI workflows make it possible to conduct complex studies and uncover deeper insights than traditional analytical methods do. As the volume and availability of electronic health data increases, patient-centered outcomes research (PCOR) investigators need better tools to analyze data and interpret those outcomes. A foundation of high-quality training data is critical to developing robust machine-learning models. Training data sets are essential to train prediction models that use machine learning algorithms, to extract features most relevant to specified research goals, and to reveal meaningful associations.

Through this project, ONC in partnership with National Institutes of Health (NIH) National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK), advanced the application of AI/ML in patient-centered outcomes research (PCOR) by generating high quality training datasets for a chronic kidney disease (CKD) use case – predicting mortality within the first 90 days of dialysis.

Project Goal

The goal of this project was to conduct foundational work to support future applications of AI/ML to improve PCOR, health and health care delivery. This project achieved its goal by: • Establishing a kidney disease-related use case that benefits from ML applications using data from the United States Renal Data System (USRDS) • Identifying and convening a multidisciplinary expert panel to provide input into the criteria for the development of high quality training datasets and provisionally testing these datasets using ML algorithms • Developing machine-learning models and using conventional metrics to evaluate their performance • Capturing lessons learned from the process of developing the training datasets and ML models • Disseminating project tools, activities, and points to consider/lessons learned to encourage future applications of these methods by PCOR researchers

This project began in 2019 and ended in September 2021

See this blog post for more background.

Please contact onc.request@hhs.gov with questions about this project and rankin_summer@bah.com with any issues or questions about the code.

Please see the detailed implementation guidance provided below or download it as a pdf.

Citate as

Summer K. Rankin. (2021). onc-healthit/2021PCOR-ML-AI: ONC-BAH-1 (v1.1). Zenodo. https://doi.org/10.5281/zenodo.5514290

@software{summer_k_rankin_2021_5514290,
  author       = {Summer K. Rankin},
  title        = {onc-healthit/2021PCOR-ML-AI: ONC-BAH-1},
  month        = sep,
  year         = 2021,
  publisher    = {Zenodo},
  version      = {v1.1},
  doi          = {10.5281/zenodo.5514290},
  url          = {https://doi.org/10.5281/zenodo.5514290}
}

Implementation Guidance

6.1 Data Understanding
6.2 Overall Training Dataset
6.3 ML Modeling and Evaluation

6.1 Data Understanding

The source data for building the overall training dataset was obtained from the United States Renal Data System (USRDS), the national data registry developed from resources initiated by the Centers for Medicare & Medicaid Services (CMS) and its funded end-stage kidney disease (ESKD) networks and subsequently maintained by the National Institute for Diabetes and Digestive and Kidney Diseases (NIDDK). USRDS stores and distributes data on the outcomes and treatments of chronic kidney disease (CKD) and ESKD population in the U.S. (Note: to be consistent with USRDS terminology for data tables, this document uses end stage renal disease - ESRD - instead of ESKD.) To better understand the data, data profiling was performed on the demographic variables and the outcome variable of interest (mortality in the first 90 days of dialysis). Information on constructing the outcome variable can be found in Section 6.2.4 Create Patients Table.

The distribution of patients in the cohort who survived versus died in the first 90 days after dialysis initiation:

The age distribution of patients who survived versus died in the first 90 days after dialysis initiation:

The sex distribution of patients who survived versus died in the first 90 days after dialysis initiation:

The distribution by race of patients who survived versus died in the first 90 days after dialysis initiation:

6.2 Overall Training Dataset

Section 6.2 details the methodology used to create the overall training dataset. A high level overview of the tables used for the training dataset can be found in Section 6.2.1 Overview of Cohort Creation and results in a final dataset with 1,150,195 observations and 188 variables. The final dataset used for modeling is stored in PostgreSQL (Postgres) tables called medxpreesrdfor the non-imputed variables and micecomplete_pmmfor the imputed variables (5 sets of imputations were generated; more information on imputations can be found in Section 6.2.19 Impute Missing Values).

The construction of medxpreesrd involves using more than 20 USRDS data tables, as well as publicly available data, for mapping diagnosis codes to groupings.

All scripts are located in the DataSet/ directory on GitHub.

Two types of files are involved in constructing medxpreesrd:

Sequential scripts - these have the prefix S0-, "S1-", etc. to indicate the sequence in which they are run
Utility scripts - these create the data used by the sequential scripts

Other resources that could be helpful to users include:

6.2.0 Deidentify the Data

The data received from USRDS was de-identified before use to comply with the approved University of San Francisco (UCSF) institutional review board (IRB) study plan. As per the Health Insurance Portability and Accountability Act (HIPAA) guidance, the following are identifiers:

All elements of dates (except year) for dates directly related to an individual, including birth date, admission date, discharge date, date of death
All ages over 89 and all elements of dates (including year) indicative of such age, except that such ages and elements may be aggregated into a single category of age 90 or older
All geographic subdivisions smaller than a state, including street address, city, county, precinct, zip code, and their equivalent geocodes

The date variables in USRDS were de-identified by offsetting each date by a randomly chosen number specific to each patient. For example, if first ESRD service date is April 5, 2016 and the random offset is 60 days, then first ESRD service day is transformed to April 5, 2016 plus 60 days (or June 5, 2016); for the same patient, if the date of birth is Sept 1, 1950, then date of birth gets transformed to Sept 1, 1950 plus 60 days (or Nov 1, 1950). The ages of patients were de-identified by setting the age of all patients over the age of 90 to 90. The location variables were de-identified by removing all location variables (zip code, etc.) from the dataset.

Points to consider

Other methods can be used to de-identify locations without completely deleting the variables, such as by combining all zip codes with the same three initial digits to form geographic units containing more than 20,000 people according to the current publicly available data from the Bureau of the Census. For all such geographic units containing 20,000 for fewer people, the initial three digits should be changed to 000.
Complete de-identification of the datasets obtained from USRDS was performed to comply with UCSF IRB requirements. Not all IRBs may require that PII/PHI be de-identified prior to use in a project. Future researchers may consider working with their IRB to ensure that relevant identifier variables for a specific use case are retained in the source dataset used for building the training datasets and ML models.

6.2.1 Overview of Cohort Creation

This diagram is a high level view of the tables used to create the cohort for the dataset. The number of rows and/or patients is listed at each stage of the cohort selection. Each of these R scripts is detailed below in the guide. The colors represent the following items:

Yellow = R scripts
Pink = Tables in the PostgreSQL database
Green = Inclusion criteria

6.2.2 Connect to Postgres Database

Steps for running the S0-connectToPostgres.R script

This script creates and calls a function to 1) create a connection to a Postgres database and 2) drop a Postgres table if it exists. Both of these functions are used heavily throughout the dataset creation and are typically loaded into the top of each script. Each user will have their own details for connecting to Postgres.

Environment:

The Postgres Database for this project was hosted in an Amazon Web Services (AWS) environment with the following specifications:

Name: t2.large
vCPU: 2
GPU: 0
Architecture: x86_64
Memory: 8 GB
Storage: 500 GB
Operating System: Windows
Network Performance: low to moderate
Zone: US govcloud west

The overall training dataset set was created using R (version 3.6.3 (2020-02-29) running on x86_64 Linux Ubuntu 20.04.1 LTS) and a PostgreSQL database (PostgreSQL 12.3, compiled by gcc (GCC) 4.8.3 20140911 (Red Hat 4.8.3-9), 64-bit). The specific R libraries and versions are shown in the table below:

R library	Version
RPostgres	1.3.1
DBI	1.1.1
stringr	1.4.0
haven	2.4.0
readr	1.4.0
lubridate	1.7.9.2
dplyr	1.0.4
magrittr	1.5
tidyr	1.1.2
sqldf	0.4-11
RSQLite	2.2.3
gsubfn	0.7
proto	1.0.0
readxl	1.3.1
plyr	1.8.6
mice	3.13.0

Input:

Name of the database 
database port
user name
user password

Step 1. Create function for connecting to Postgres.

dbConnect(
	RPostgres::Postgres(), 
	dbname = cred$dbname,  
	host=cred$host, 
	port=cred$port, 
	user=cred$user, 
	password=cred$password
	)
- Output: 
	- An object called "con" that can be used in database queries. 
- Example: 
con = getConnection()

Step 2. Create function for dropping a Postgres table if it exists.

drop_table_function <- function(con, tablename) {
  if (isTRUE(dbExistsTable(con, tablename)))   {
    print(str_glue("existing {tablename} table dropped"))
    dbRemoveTable(con, tablename)
  }
  else {
    print(str_glue("{tablename} table does not exist"))
  }
}

6.2.3 Convert Data to CSV

Steps for running the S1a_convertSAStoCSV.R script

This script reads in a list of files from the raw source data (.sas7bdat files in USRDS data) and saves them as a .csv file. The reason for this conversion to CSV before loading into R is that the SAS files contain more information than is needed/able to store in Postgres or an R table, such as categorical variable encodings that are also documents in the USRDS Researchers Guide Appendix B and C.

Input:

source_data_dir string path to the raw data file
file_name string name of the file without the extension
output_data_dir string path to the output directory for csv files

Ouput:

A .csv version of the file with the same file_name

Step 1. Convert and combine the raw source data from .sas7bdat files to .csv files

convert_to_csv = function(source_data_dir, file_name, output_data_dir) {
  raw_file_path = haven::read_sas(str_glue("{source_data_dir}{file_name}.sas7bdat"))
  csv_path = str_glue("{output_data_dir}{file_name}.csv")
  write_csv(raw_file_path, csv_path)
}

Example:

convert_to_csv("/home/sas_data_usrds/", "patients", "/home/csv_data_usrds/")}

6.2.4 Create Patients Table

Steps for running the S1b_patients.R script

This script creates the table patientsin the Postgres database, after filtering on the criteria to create the study cohort and the dependent variable (died in the first 90 days of dialysis).

Input: csv files are produced in script S1a-convertSAStoCSV.R

patients.csv

Output: Postgres table

patients

Step 1. Import patients and apply exclusion criteria

cohort_patients = read_csv(file.path(data_dir, "patients.csv"), col_types = cols(
      CDTYPE = "c"))
   
names(cohort_patients) = tolower(names(cohort_patients))

Connect to the Postgres database using the S0-connectToPostgres.R script, which results in the variable con that is used in the queries to Postgres. This also imports the drop_table_function used when creating a Postgres table. These functions are used in almost every script and will be imported at the top in the code.

source(file.path(source_dir, "S0-connectToPostgres.R"))

Store the raw cohort_patients data as "patients" table in Postgres (after dropping the table if it exists).

fields = names(cohort_patients)
drop_table_function(con, "patients") 
   dbCreateTable(
      con,
      name = "patients",
      fields = cohort_patients,
      row.names = NULL
   )
dbWriteTable(
      con,
      name = "patients",
      value = cohort_patients,
      row.names = FALSE,
      append = TRUE
   )

The patients table holds the data for the 1,150,195 patients (rows) in the cohort which is created by excluding the following rows:

age less than 18
incident year before 2008
incident year after 2017
first dialysis date is missing
death date is before first dialysis date (one patient)

The script does each exclusion criteria separately to calculate the number of rows/patients at each stage, but the exclusion can be done more simply with one SQL query shown below (and commented out in the script).

exclude_patients = str_glue(
                               "DELETE FROM patients
                               WHERE inc_age<18
                               OR incyear>2017
                               OR incyear<2008
                               OR masked_firstdial IS NULL
                               OR masked_died<masked_firstdial"
    )
dbSendStatement(con, exclude_patients)

Step 2. Using the create_dependent_var function, check that set all inc_age greater than 90 to be 90.

This step doubles checks that all patients with age > 90 are set to 90.

patients_dependent_var = cohort_patients %>%
  mutate(inc_age=ifelse(inc_age>90, 90, inc_age),

Step 3. Create the days_on_dial variable by converting the masked_died and masked_firstdial to a date and by calculating the number of days between them.

  masked_firstdial = as_date(masked_firstdial, origin = "1960-01-01"),
  masked_died = as_date(masked_died, origin = "1960-01-01"),
  days_on_dial = as.double(difftime(masked_died,
                                  masked_firstdial,
                                  units = "days")),

Step 4. Create the dependent variable (outcome variable) died_in_90 by setting all patients with a value for days_on_dial to 1 (died) and patients with no value for days_on_dial to 0 (survived).

  died_in_90 = ifelse(is.na(days_on_dial), 0, ifelse(days_on_dial <= 90, 1, 0)),

Step 5. Convert data variables to dates that are used to calculate waitlist and transplant status.

  masked_first_se = as_date(masked_first_se, origin = "1960-01-01"),

can_first_listing_dt is the first date patient is ever waitlisted.

  masked_can_first_listing_dt = as_date(masked_can_first_listing_dt, origin = "1960-01-01"),

can_rem_dt is the date patient was removed from the waitlist the first time.

  masked_can_rem_dt = as_date(masked_can_rem_dt, origin = "1960-01-01"),
  masked_tx1date = as_date(masked_tx1date, origin = "1960-01-01"),
  masked_tx1fail = as_date(masked_tx1fail, origin = "1960-01-01")

Step 6. Save this table to Postgres as patients after dropping the initial patients table.

drop_table_function(con, "patients") 
   dbCreateTable(
      con,
      name = "patients",
      fields = patients_dependent_var,
      row.names = NULL
   )
   dbWriteTable(
      con,
      name = "patients",
      value = patients_dependent_var,
      row.names = FALSE,
      append = TRUE
   )

Points to consider

The study cohort as well as the outcome variable for the use case should be driven by a strong clinical understanding of the data and defined with clinician input. For example, kidney transplant events occurring within the first 90 days of initiating dialysis were evaluated as a potential competing outcome for death. However, since only less than 1% of the patient cohort recieved kidney transplants, retaining the patients in the study cohort would likely have only a small effect on modeling. After consultating with clinicans and the Technical Expert Panel, it was determined that patients with kidney transplants should be retained in the patient cohort for the purposes of this analysis.
The origin date when converting data variables to readable dates is 1960-01-01.

6.2.5 Create Medevid Table

The Medevid table contains the data from CMS Form 2728 -- Medical Evidence Form, a form required to be completed and submitted when a patient is diagnosed with ESRD and receives their first chronic dialysis treatment(s) or receives a transplant. Medevid contains data on patient demographics, insurance status, comorbid conditions, primary cause of kidney failure, and laboratory values at the time of ESRD diagnosis as well as prior nephrology care, dietician care, and patient education.

Steps to running the 4.1.5 S1c_medevid.R script

This script creates the medevid table in Postgres database. The medevid table is filtered based on the following:

Keep only rows with a matching usrds_id from our patients table
For usrds_ids with multiple entries, select the first entry (in the table order) for each usrds_id
The final table results in 1,150,195 patients/rows

Input: csv files are produced in script S1a-convertSAStoCSV.R

patients
medevid.csv

Output: Postgres table

medevid

Step 1. Import medevid data

raw_medevid = read_csv(file.path(data_dir, filename), col_types = cols(
      CDTYPE = "c",
      masked_UREADT = "c",
      ALGCON = "c",
      PATNOTINFORMEDREASON = "c",
      RACEC = "c",
      RACE_SUB_CODE = "c"))
   
   names(raw_medevid) = tolower(names(raw_medevid))

Step 2. Import the usrds_ids from the patients table

patients_filtered = dbGetQuery(
      con,
      str_glue(
         "
         SELECT *
         FROM {table_name_pt}
         "))

Step 3. Remove unused columns

Columns with the comorbidities that were only collected in the 1995 version of the Medical Evidence Form were removed from the training dataset as patients with ESRD incident years between 2008-2017 do not have this data collected.

medevid_ids_filtered = raw_medevid %>% 
      select(-c(
                  "como_ihd",
                  "como_mi",
                  "como_cararr",
                  "como_dysrhyt",
                  "como_pericar",
                  "como_diabprim",
                  "como_hiv",
                  "como_aids")) %>%

Step 4. Filter on usrds_ids from the patients table

      filter(usrds_id %in% patients_filtered$usrds_id)

Step 5. Keep first row of medevid data if a usrds_id has more than one, per the USRDS Researcher's Guide for de-duplicating the medevid table

medevid_filtered = medevid_ids_filtered %>%
      distinct(usrds_id, .keep_all = TRUE) %>%

Step 6. Calculate the dialysis train time in days

      mutate(
         masked_trnend = as_date(masked_trnend, origin = "1960-01-01"),
         masked_trstdat = as_date(masked_trstdat, origin = "1960-01-01"),
         
         dial_train_time = as.double(difftime(masked_trnend,
                                              masked_trstdat,
                                              units = "days"))
      )

Points to consider

It is important not to sort or alter the order of (or import into a SQL database) the medevid table before selecting the first entry per usrds_id for the medevid table. This order of the entries from the medevid table is curated as per the USRDS Researcher's Guide, which advises users to selected the first medevid entry for analysis. For other use cases, especially those requiring a longtitudinal dataset, the multiple MEDEVID records per patient may need to be retained. Decisions on how to handle the duplicated data should be made with the proposed use case in mind.
Constructed features should be validated with clinicians to ensure that they are meaningful. For example, several preliminary features were initially constructed from the variables in the medevid table to serve as a proxy for prior nephrology care, such as whether a physician signature was present on the Medical Evidence form, the time between a patient signature and physician signature on the Medical Evidence Form, etc. After discussing these features with the clinicians on the team, it was determined that since a nephrologist's signature is required when a patient is referred for dialysis treatment, the signature date field is not a good proxy measure.

6.2.6 Join Patients to Medevid

Steps to running the S1d_patients_medevid_join.R script

This script creates the patients_medevid table in Postgres database. The patients_medevid table consists of the patients table with the data from the medevid table added via left join on usrds_id.

Input: Postgres tables

patients
medevid

Output: Postgres table

patients_medevid

The result of this script produces the same 1,150,195 rows as we have in the patients table.

Step 1. Left join the patients table and medevid table on usrds_id

patients_medevid = left_join(
      patients_filtered %>% select(-c( "cdtype")),
      medevid_filtered %>% select(-c("randomoffsetindays", "disgrpc", "network", "inc_age",
                                     "pdis", "sex", "race", "masked_died")),
      by = "usrds_id"
   )

The cdtype column is kept from the medevid table, the other duplicate colums are from the patients table

Step 2. Populate any missing values for sex and pdis variables in patients with values from medevid (otherwise keep any duplicate columns from patients)

patients_medevid = patients_medevid %>%
         mutate( 
            sex = ifelse(is.na(sex), sex_med, sex),
            pdis = ifelse(is.na(pdis), pdis_med, pdis)
            ) %>%
         select(-c(sex_med, pdis_med))

6.2.7 Create Transplant Waitlist Features

The transplant data tables contains kidney transplant information from United Network for Organ Sharing (UNOS), such as information on transplant elibigility and transplant status. These features are created as transplant waitlist status and the amount of time on the transplant waitlist are indicative of a patients overall health status. Patients who are on the transplant waitlist are generally much healthier than those who are not listed.

Steps to running the S1e_patients_medevid_waitlist.R script

This script creates the waitseq_ki, waitseq_kp, tx, tx_waitlist_vars and patients_medevid_waitlist tables in the Postgres database from the .csv files and the patients_medevid table.

Input: csv files are produced in script S1a-convertSAStoCSV.R

patients_medevid
tx.csv
waitseq_kp.csv
waitseq_ki.csv

Output: Postgres tables

waitseq_ki
waitseq_kp
tx
tx_waitlist_vars
patients_medevid_waitlist

patients_medevid_waitlist is the full cohort that should be used from this point forward.

The result of this script is the calculation of the following variables added to the patients_medevid table which will be saved as the patients_medevid_waitlist table.

days_on_waitlist (number of days in transplant waitlist)
waitlist_status (active, transplanted, removed, never)

Step 1. Import the patients table

pat = dbGetQuery(con,
                      "SELECT *
                      FROM patients_medevid")

Step 2. Import the waitseq_ki.csv file.

waitseq_ki = read_csv(file.path(data_dir,"waitseq_ki.csv"), col_types = cols(
    USRDS_ID = col_double(),
    randomOffsetInDays = col_double(),
    PROVUSRD = col_double(),
    PID = col_double(),
    masked_BEGIN = col_double(),
    masked_ENDING = col_double()
  ))

Step 3. Transform column names to lowercase

names(waitseq_ki) = tolower(names(waitseq_ki))

Step 4. Filter on rows with usrds_id in cohort.

waitseq_ki = waitseq_ki %>%
  filter(usrds_id %in% pat$usrds_id) %>%

Step 5. Set masked_begin and masked_ending as dates.

  mutate(ws_list_dt = as_date(masked_begin, origin = "1960-01-01"),
         ws_end_dt = as_date(masked_ending, origin = "1960-01-01"),
         source = "ki") %>%

Step 6. Keep only the following columns:

ws_list_dt = New Waiting Period Starting Date
ws_end_dt = New Waiting Period Ending Date
provusrd = USRDS Assigned Facility ID
source = 'ki'
usrds_id
pid

  select(usrds_id, pid, provusrd, ws_list_dt, ws_end_dt, source)

Step 7. Save as the waitseq_ki table in Postgres

fields = names(waitseq_ki)
drop_table_function(con, "waitseq_ki") 
print(str_glue("create waitseq_ki in postgres"))
dbCreateTable(
  con,
  name = "waitseq_ki",
  fields = waitseq_ki,
  row.names = NULL
)
dbWriteTable(
  con,
  name = "waitseq_ki",
  value = waitseq_ki,
  row.names = FALSE,
  append = TRUE
)

Step 8. Import waitseq_kp.csv

waitseq_kp = read_csv(file.path(data_dir,"waitseq_kp.csv"), col_types = cols(
  USRDS_ID = col_double(),
  randomOffsetInDays = col_double(),
  PROVUSRD = col_double(),
  PID = col_double(),
  masked_BEGIN = col_double(),
  masked_ENDING = col_double()
))
names(waitseq_kp) = tolower(names(waitseq_kp))

Step 9. Filter on rows with usrds_id in cohort.

waitseq_kp = waitseq_kp %>%
  filter(usrds_id %in% pat$usrds_id) %>%

Step 10. Set masked_begin and masked_ending as dates and save with new names ws_list_dt and ws_end_dt.

  mutate(ws_list_dt = as_date(masked_begin, origin = "1960-01-01"),
         ws_end_dt = as_date(masked_ending, origin = "1960-01-01"),
         source = "kp") %>%

Step 11. Keep only the following columns:

ws_list_dt = New Waiting Period Starting Date
ws_end_dt = New Waiting Period Ending Date
provusrd = USRDS Assigned Facility ID
source = 'kp'
usrds_id
pid

  select(usrds_id, pid, provusrd, ws_list_dt, ws_end_dt, source)

Step 12. Save as the waitseq_kp table in Postgres.

fields = names(waitseq_kp)
drop_table_function(con, "waitseq_kp") 
print(str_glue("create waitseq_kp in postgres"))
dbCreateTable(
  con,
  name = "waitseq_kp",
  fields = waitseq_kp,
  row.names = NULL
)
dbWriteTable(
  con,
  name = "waitseq_kp",
  value = waitseq_kp,
  row.names = FALSE,
  append = TRUE
)

Step 13. Concatenate waitseq_ki and waitseq_kp.

waitseq = bind_rows(waitseq_ki, waitseq_kp) %>%
  arrange(usrds_id, ws_list_dt)

Step 14. Join the new waitseq table to patients.

pat_waitseq = left_join(
  pat %>% select(usrds_id, masked_first_se, masked_firstdial,
                 masked_can_first_listing_dt, masked_can_rem_dt,
                 masked_tx1date, masked_died, can_rem_cd, masked_tx1fail),
  waitseq,
  by = "usrds_id") %>%
  arrange(usrds_id, ws_list_dt)

Step 15. Label patients as ACTIVE on the waitlist and calculate the days on the transplant waitlist for ACTIVE patients

Patients are labeled as "active" (those who are considered active on the transplant waitlist) if they meet one of the following criteria:

If list_date is before dial_date and end_date is on or after dial_date
Status is ACTIVE on the first day of dialysis

First, check if earliest listing date from waitseq matches first listing date from patients.

first_list = pat_waitseq %>% group_by(usrds_id) %>%
  arrange(usrds_id, ws_list_dt) %>%
  distinct(usrds_id, .keep_all = TRUE) %>%
  ungroup(usrds_id)

If list_date is before dial_date and end_date is on or after dial_date, OR if list_dt < dial_dt and end_dt == NA: status is ACTIVE.

pat_waitseq = pat_waitseq %>%
  mutate(active = ifelse(
    (ws_list_dt < masked_firstdial & ws_end_dt >= masked_firstdial) | (ws_list_dt < masked_firstdial & is.na(ws_end_dt)), 1, 0))

active = pat_waitseq %>%
  filter((ws_list_dt < masked_firstdial & ws_end_dt >= masked_firstdial) | (ws_list_dt < masked_firstdial & is.na(ws_end_dt)))

Calculate the days on the transplant waitlist for ACTIVE patients using dial_dt - ws_list_dt.

active = active %>%
  mutate(
    days_on_waitlist = as.double(difftime(masked_firstdial,
                                          ws_list_dt,
                                          units = "days"))
  )

Step 16. Remove ACTIVE patients from pat_waitseq

Sort by usrds_id and ws_list_dt and keep the row with the earliest ws_list_dt.

active = active %>% group_by(usrds_id) %>%
  arrange(usrds_id, ws_list_dt) %>%
  distinct(usrds_id, .keep_all = TRUE) %>%
  ungroup(usrds_id)

Remove active patients from pat_waitseq. Get unique usrds_ids in active dataframe.

active_id = unique(active$usrds_id)

Filter out rows from pat_waitseq where usrds_id is in the list of active USRDS IDs (active_id).

pat_waitseq_not_act = pat_waitseq %>%
  filter(!usrds_id %in% active_id)

Step 17. Import the transplant dataset and process the data

Import tx.csv.

tx = read_csv(file.path(data_dir,"tx.csv"), col_types = cols(
  DHISP = "c",
  DSEX = "c",
  RHISP = "c",
  RSEX = "c"
))

names(tx) = tolower(names(tx))

Filter on rows with usrds_id in cohort.

tx = tx %>%
  filter(usrds_id %in% pat$usrds_id) %>%

Transform masked_tdate to a date and save as t_tx_dt.

  mutate(t_tx_dt = as_date(masked_tdate, origin = "1960-01-01"),

Transform masked_faildate to a date and save as t_fail_dt.

         t_fail_dt = as_date(masked_faildate, origin = "1960-01-01")) %>%

Keep only the following columns:

t_tx_dt = transplant date
t_fail_dt = Transplant Failure Date
provusrd = USRDS Assigned Facility ID
tottx = Patient Total Number of Transplants
tx_srce = Source of Transplant Record
usrds_id

  select(usrds_id, provusrd, t_tx_dt, t_fail_dt, tottx, tx_srce) %>%
  arrange(usrds_id, t_tx_dt)

Step 18. Save as tx in Postgres database

fields = names(tx)
drop_table_function(con, "tx") 
print(str_glue("create tx in postgres"))
dbCreateTable(
  con,
  name = "tx",
  fields = tx,
  row.names = NULL
)
dbWriteTable(
  con,
  name = "tx",
  value = tx,
  row.names = FALSE,
  append = TRUE
)

Step 19. Construct TRANSPLANTED status

Subset rows where LISTING DATE and LIST END DATE are both BEFORE DIAL START DATE Subset rows with ws_list_dt (listing date) & ws_end_date (list end date) both BEFORE patient masked_firstdial (dialysis start date).

list_before_dial = pat_waitseq_not_act %>%
  filter(ws_list_dt < masked_firstdial & ws_end_dt < masked_firstdial)

Group by usrds_id, sort by largest to smallest end_date, and keep the maximum end_date for each usrds_id.

closest_end_dt_to_dial = list_before_dial %>% group_by(usrds_id) %>%
  arrange(usrds_id, desc(ws_end_dt)) %>%
  distinct(usrds_id, .keep_all = TRUE) %>%
  ungroup(usrds_id)

Left join closest_end_dt_to_dial and tx on usrds_id. This has effect of filtering tx dataset and keeping rows where usrds_id is in closest_end_dt_to_dial.

If the maximum end date (max_end_dt) is equal to the transplant date (t_tx_dt), then the status is TRANSPLANTED.

max_end_dt = left_join(
  closest_end_dt_to_dial %>% select(-pid, -provusrd),
  tx %>% select(usrds_id, t_tx_dt, t_fail_dt),
  by = "usrds_id"
)

max_end_dt = max_end_dt %>%
  mutate(transplanted = if_else(is.na(t_tx_dt), 0,
                                if_else(ws_end_dt == t_tx_dt, 1, 0))) 
transplanted = max_end_dt %>%
  filter(ws_end_dt == t_tx_dt)

Days on waitlist for TRANSPLANTED patients is t_tx_dt - ws_list_dt.

transplanted = transplanted %>%
  mutate(
    days_on_waitlist = as.double(difftime(t_tx_dt,
                                          ws_list_dt,
                                          units = "days"))
  )

Step 20. Construct REMOVED status

Remove rows from max_end_dt where usrds_id is in transplanted.

Get the unique usrds_ids in transplanted dataframe.

transplanted_id = unique(transplanted$usrds_id)

Filter out rows from max_end_dt where usrds_id is in transplanted_id.

no_act_or_trans = max_end_dt %>%
  filter(!usrds_id %in% transplanted_id)

The remaining IDs should have REMOVED status. Check that all rows meet the removed criteria.

num_no_act_tx = nrow(no_act_or_trans %>%
       filter(ws_end_dt != t_tx_dt | is.na(t_tx_dt)))

Create a REMOVED column and set removed = 1 if ws_end_dt != t_tx_dt or t_tx_dt = NA.

no_act_or_trans = no_act_or_trans %>%
  mutate(removed = if_else(ws_end_dt != t_tx_dt | is.na(t_tx_dt), 1, 0))

removed = no_act_or_trans %>%
  filter(ws_end_dt != t_tx_dt | is.na(t_tx_dt))

Days on waitlist for REMOVED patients is ws_end_dt - ws_list_dt.

removed = removed %>%
  mutate(
    days_on_waitlist = as.double(difftime(ws_end_dt,
                                          ws_list_dt,
                                          units = "days"))
  )

Note: REMOVED only has duplicates because the tx table has duplicate rows for some patients, but the waitseq start and end dates are the same for both rows of each usrds_id, so only the first record is kept.

removed = removed %>% group_by(usrds_id) %>%
  arrange(usrds_id, ws_list_dt) %>%
  distinct(usrds_id, .keep_all = TRUE) %>%
  ungroup(usrds_id)

Get unique usrds_ids in the removed dataframe.

removed_id = unique(removed$usrds_id)

Step 21. Merge days_on_waitlist with usrds_id from active, transplanted, and removed.

days = bind_rows(active %>% select(usrds_id, days_on_waitlist),
                 transplanted %>% select(usrds_id, days_on_waitlist),
                 removed %>% select(usrds_id, days_on_waitlist))
days = days %>% arrange(usrds_id)

Add ACTIVE patients to the patients table by setting all rows in pat where usrds_id is in active_id to ACTIVE = 1.

pat = pat %>%
  mutate(active = if_else(usrds_id %in% active_id, 1, 0)) %>%
  select(usrds_id, active, masked_first_se, masked_firstdial, masked_can_first_listing_dt,
         masked_can_rem_dt, masked_tx1date, masked_died, can_rem_cd, masked_tx1fail)

Add TRANSPLANTED patients to the patients table by setting all rows in pat where usrds_id is in transplanted_id to TRANSPLANTED = 1.

pat = pat %>%
  mutate(transplanted = if_else(usrds_id %in% transplanted_id, 1, 0))

n_both = nrow(pat %>% filter(active == 1 & transplanted == 1))
if (n_both!=0){
  print("WARNING! rows exist where active and transplanted are both == 1")
}

Add REMOVED patients to the patients table by setting all rows in pat where usrds_id is in removed_id to REMOVED = 1.

pat = pat %>%
  mutate(removed = if_else(usrds_id %in% removed_id, 1, 0))

Step 22. Construct the NEVER status

Set all rows where active, transplanted, and removed are all 0 to NEVER = 1.

pat = pat %>%
  mutate(never = if_else(active == 0 & transplanted == 0 & removed == 0, 1, 0))

Calculate the time on the waitlist. Join days_on_waitlist onto patients table.

pat = left_join(
  pat,
  days,
  by = "usrds_id"
)

When never is 0, set days_on_waitlist to 0.

pat = pat %>%
  mutate(days_on_waitlist = replace_na(days_on_waitlist, 0))

Step 23. Reshape into long form with one waitlist_status variable.

pat2 = pat %>%
  mutate(waitlist_status = names(
    pat %>% select(
      active, transplanted, removed,never))[
        max.col(pat %>% select(active, transplanted, removed, never))])

Step 24. Save waitlist variables to Postgres.

tx_waitlist_vars = pat2 %>%
  select(usrds_id, waitlist_status, days_on_waitlist) %>%
  arrange(usrds_id)

csv_path = str_glue("{data_dir}tx_waitlist_vars.csv")
write_csv(tx_waitlist_vars,csv_path)
drop_table_function(con, "tx_waitlist_vars")
dbWriteTable(
  con, 
  name = "tx_waitlist_vars", 
  value = tx_waitlist_vars, 
  row.names = FALSE, 
  append = TRUE)

Step 25. Merge with patients_medevid and save to Postgres by adding the waitlist and transplant features to the patient_medevid table.

patients_med = dbGetQuery(con,
                 "SELECT *
                  FROM patients_medevid")

patients_med_waitlist = inner_join(
  patients_med,
  tx_waitlist_vars,
  by="usrds_id"
)
fields = names(patients_med_waitlist)
print(str_glue("create patients_medevid_waitlist in postgres"))
drop_table_function(con, "patients_medevid_waitlist") 
dbCreateTable(
  con,
  name = "patients_medevid_waitlist",
  fields = patients_med_waitlist,
  row.names = NULL
)
dbWriteTable(
  con,
  name = "patients_medevid_waitlist",
  value = patients_med_waitlist,
  row.names = FALSE,
  append = TRUE
)

6.2.8 Create Partition Data

Steps for Running S2a_partitionData.R

This script creates the partition_10 table in Postgres which consists of usrds_id and subset and adds this subset column to the patients_medevid_waitlist table. This subset column is the result of partitioning the number of rows (1,150,195) into 10 random subsets (numbered 0, 1, ..., 9) and assigning a patient identifier (usrds_id) to each subset. The purpose of partitioning the data is three-fold:

to ensure that there is no leakage between the training and test datasets (correctly stratify the classes)
to manage performance of imputation code (larger datasets require longer run times)
to ensure that the machine learning models are reproducible for any users (as opposed to setting the seed and using a library like caret to partition)

Note: Each subset is approximately 10% because it is constructed completely at random.

Input: patients_medevid_waitlist table from Postgres

Output: partition_10 table in Postgres

Step 1. Define function to create num_partitions (10) indexed in a column named subset and save to Postgres as partition_10

partition_data <- function(con,
                           usrds_id,
                           num_partitions, 
                           data_tablename, 
                           seed_value) {

  set.seed(2539)

  randvalue = runif(
    length(usrds_id), 
    min = 0, 
    max = num_partitions
    )
  
  universe = cbind(
    usrds_id, 
    floor(randvalue)) %>% 
    as.data.frame()
  
  names(universe) = c("usrds_id", "subset")
  
  tblname = str_glue("partition_{num_partitions}")
  drop_table_function(con, tblname)
  dbWriteTable(
    con,
    tblname,
    universe,
    append = FALSE,
    row.names = FALSE
    )
}

Step 2. Import the usrds_ids from Postgres.

data_tbl = "patients_medevid_waitlist"

usrds_id = dbGetQuery(
  con,
  str_glue(
    "
    SELECT usrds_id 
    FROM {data_tbl}
    ORDER BY usrds_id
    "))
usrds_id = usrds_id$usrds_id

Call the function defined above to create the 10 partitions.

partition_data(
                con, 
                usrds_id,
                num_partitions = 10,
                data_tablename = data_tbl
              )

6.2.9 Join `patients_medevid_waitlist` Table to the Partition Index

Steps to running the S2b_join_partition_data.R script

This script joins the patients_medevid_waitlist table to the partition index.

Input: patients_medevid_waitlist

Output: patients_medevid_waitlist

Step 1. Define a function to import and alter the patients_medevid_waitlist table by adding the subset column, and save to Postgres.

join_data_partitions <- function(con, 
                                 data_tablename="patients_medevid_waitlist", 
                                 num_partitions=10){
  
  dbSendStatement(con, str_glue(
    "
     ALTER TABLE {data_tablename} 
     ADD subset integer
     "), n = -1)
  
  dbSendStatement(
    con, 
    str_glue(
      "
        UPDATE {data_tablename} d
        SET subset = p.subset
        FROM partition_{num_partitions} p
        WHERE d.usrds_id = p.usrds_id
        "), n = -1)
}

Step 2. Execute the function

data_tbl = "patients_medevid_waitlist"

join_data_partitions(
  con,
  data_tablename = data_tbl,
  num_partitions = 10
)

6.2.9.1 Calculate Demographic Subtotals Per Partition

Steps to running the S2c_calculate_partition_totals.R script

This script creates a table with counts of select categories for each data partition to ensure that the partitions are representative of the entire dataset.

Input: patients_medevid_waitlist

Output: ./partition_totals_rev_method.csv

Step 1. Pull the data from the database and count the number of patients per partition for the following variables:

sex = male
race = white
number of missing hemoglobin values
number of missing serum creatine values
number of missing albumin values
number of patients who died in the first 90 days (outcome variable)

df = dbGetQuery(
  con,
  "
  SELECT *
  FROM patients_medevid_waitlist
  "
)

subsets_totals = df %>%
  select(subset) %>%
  group_by(subset) %>%
  count()

subsets_totals = rename(subsets_totals, c("total_pts"=n))

subsets_male = df %>% 
  filter(sex==1) %>%
  select(sex, subset) %>% 
  group_by(sex, subset) %>% 
  count()
subsets_male <- rename(subsets_male, c("total_males"=n))

subsets_white = df %>%
  filter(race==1) %>%
  select(subset, race) %>%
  group_by(subset,race) %>%
  count()
subsets_white <- rename(subsets_white, c("total_white"=n))

subsets_heme = df %>%
  filter(is.na(heglb)==TRUE) %>%
  select(subset,heglb) %>%
  group_by(heglb, subset) %>%
  count()
subsets_heme <- rename(subsets_heme, c("total_heme_na"=n))

subsets_sercr = df %>%
  filter(is.na(sercr)==TRUE) %>%
  select(subset,sercr) %>%
  group_by(sercr, subset) %>%
  count()
subsets_sercr <- rename(subsets_sercr, c("total_sercr_na"=n))

subsets_album = df %>%
  filter(is.na(album)==TRUE) %>%
  select(subset,album) %>%
  group_by(album, subset) %>%
  count() 
subsets_album <- rename(subsets_album, c("total_album_na"=n))

subsets_outcome = df %>%
  filter(died_in_90==1) %>%
  select(subset,died_in_90) %>%
  group_by(died_in_90,subset) %>%
  count()
subsets_outcome <- rename(subsets_outcome, c("total_died"=n))

dd =  
  left_join(
    subsets_totals,
    subsets_outcome,
    by='subset'
)

dd = left_join(
  dd,
  subsets_male,
  by='subset'
)

dd = left_join(
  dd,
  subsets_white,
  by='subset'
)

dd = left_join(  dd,
  subsets_heme,
  by='subset'
)

dd = left_join(  dd,
  subsets_album,
  by='subset'
)

dd = left_join(  dd,
  subsets_sercr,
  by='subset'
)
write_csv(dd, "partition_totals_rev_method.csv")

Table: Counts of select categories for each data partition

Sub-set	Number of Males	Number of Race Group (White)	Number of Missing Hemoglobin Values	Number of Missing Serum Creatinine Values	Number of Missing Albumin Values	Total Number of Patients	Number of Patients who Died
0	65,981	76,535	17,248	2,055	35,925	114,824	8,529
1	66,131	76,864	17,108	2,051	35,129	115,050	8,773
2	66,137	76,773	17,240	2,043	35,428	115,044	8,669
3	66,031	76,846	17,406	1,937	35,100	115,027	8,426
4	66,282	76,788	16,971	1,917	34,933	114,802	8,549
5	66,042	76,652	17,285	2,008	35,138	114,936	8,671
6	66,579	77,002	17,266	1,976	35,219	115,207	8,728
7	66,332	77,221	17,266	2,035	35,019	115,557	8,695
8	66,982	76,605	17,027	2,014	34,797	114,925	8,478
9	66,033	76,751	16,847	1,936	34,973	114,823	8,565

6.2.10 Get Pre-ESRD Claims Data

The pre-ESRD claims tables in USRDS contains Medicare pre-ESRD Parts A and B, which are used to construct features for health care received prior to ESRD diagnosis.

Steps for running S3a_esrd_claims.R

This script extracts, filters, and stores the pre-ESRD claims tables from 2011-2017 for the cohort. This script uses the create_claim_table.R functions detailed in the next section.

Input: csv files are produced in script S1a-convertSAStoCSV.R

create_claim_table.R
pre_esrd_ip_claim_variables.R
pre_esrd_hs_claim_variables.R
pre_esrd_hh_claim_variables.R
pre_esrd_op_claim_variables.R
pre_esrd_sn_claim_variables.R

Output: The Postgres tables

preesrd5y_ip_clm_inc
preesrd5y_hs_clm_inc
preesrd5y_hh_clm_inc
preesrd5y_op_clm_inc
preesrd5y_sn_clm_inc

Step 1. Import the input file names and column types from the pre_esrd_{xx}_claim_variables.R scripts

The types of claims include:

Inpatient (IP)
Outpatient (OP)
Home health (HH)
Hospice (HS)
Skilled Nursing Unit (SN)

source('CreateDataSet/create_claim_table.R')

claim_types = c(
  "ip",
  "hs",
  "hh",
  "op",
  "sn"
)

Step 2. Import and run the create_claim_table function for each claim type for years 2011-2017.

for (typ in claim_types) {
  source(str_glue("CreateDataSet/pre_esrd_{typ}_claim_variables.R"))
  
  create_claim_table(
    data_dir, 
    con, 
    filenames_esrd, 
    fieldnames_esrd, 
    columns_esrd, 
    columns_esrd_2015, 
    table_name_pt='patients_medevid_waitlist'
    )
}

Points to consider

Pre-ESRD claims data includes clinical as well as administrative information. Clinicians should be engaged to identify the variables in the claims data with predictive value.

6.2.11 Create Claims Tables

Steps to running the create_claim_table.R script

This script contains the functions used in S3a_esrd_claims.R to create the pre-ESRD claims tables. The schema for the tables changes from year to year. For example, there is no cdtype field prior to 2014, since all diagnosis codes were ICD9 prior to 2014. The script handles these year-to-year changes in schema.

Step 1. Define create_claim_table function

create_claim_table <- function(
  data_dir, 
  con, 
  filenames, 
  fieldnames, 
  column_type,
  column_type_2015,
  table_name_pt) {
  # send information to insert each year of claims data into the same Postgres table
  
  fieldnames = tolower(fieldnames)
  for (filename in filenames) {
    incident_year =
      substr(filename, str_length(filename) - 3, str_length(filename))
    
    if (incident_year < 2015) {
      # claims prior to 2015 are all icd9, so we set cdtype to I for those years
      csvfile = read_csv(file.path(data_dir, str_glue("{filename}.csv")), col_types = column_type_2015)
      csvfile = csvfile %>%  
        mutate(cdtype =  "I")
    }
    else {
      csvfile = read_csv(file.path(data_dir, str_glue("{filename}.csv")), col_types = column_type)
    }
    
    tblname = str_remove(filename, incident_year)
    names(csvfile) = tolower(names(csvfile))
    fields = names(csvfile)
    
    patients = dbGetQuery(
      con,
      str_glue(
        "SELECT usrds_id
            FROM {table_name_pt}")
    )
    
    df = patients %>%
      inner_join(
        csvfile, 
        by = "usrds_id") %>%
      mutate(
        incident_year = incident_year)
    
    df$pdgns_cd = df$pdgns_cd %>%
          trimws() %>%
          str_pad(.,
                  width = 7,
                  side = "right",
                  pad = "0")
    
    if (grepl('_ip_', tblname)){
      df = createIP_CLM(df, incident_year)
    } 
    else {
      df <- df %>%
        filter(!is.na(masked_clm_from) & (masked_clm_from != ""))
  }
  # Append every set, except '2012' which will be the first table to import. 
  # this is b/c 2012 has the format that we want to use to create the table 
  # and append the other years since the format changes between 2011 and 2012-2017
    
    if (incident_year==2012){
      drop_table_function(con, tblname)
      print(str_glue("creating {tblname} claims using {incident_year}={nrow(df)}
                      patients={nrow(df %>% distinct(usrds_id, keep_all=FALSE))}"))
      
      dbWriteTable(
        con, 
        tblname,
        df[, fieldnames], 
        append = FALSE, 
        row.names = FALSE)
    } 
    else {
      print(str_glue("adding {incident_year} to {tblname}={nrow(df)}
                     patients={nrow(df %>% distinct(usrds_id, keep_all=FALSE))}"))
      dbWriteTable(
        con, 
        tblname,
        df[, fieldnames],
        append = TRUE, 
        row.names = FALSE)
    }
  }
}

Step 2. Create a separate function createIP_CLM to handle the inpatient (IP) claims differently. This filters out rows with missing data.

createIP_CLM = function(df, incident_year) {
  # filtering for table named "preesrd5y_ip_clm"
  print(str_glue("filtering IP claims {incident_year}"))
  
  df = df %>%
    filter(
      !is.na(masked_clm_from) & 
      (masked_clm_from != "") &
      !is.na(drg_cd) & 
      (drg_cd != "")
      ) 
  return(df)
}

6.2.12 Map Diagnosis Codes (drg_cd) to Primary Diagnosis Codes (pdgns_cd)

More information about the primary diagnosis codes and aggregate categories can be found in Section 6.2.14 Diagnosis Groupings.

Steps to running the S3b_mapDrgCdToPdgnsCd.R script

Prior to 2011, there is no pdgns_cd (primary diagnosis code) in the USRDS pre-ESRD data. This is an issue, because we need the pdgns_cd in order to map a claim to one of the 12 aggregate categories. This script addresses the issue by mapping the drg_cd (which is available prior to 2011) to a pdgns_cd. The mapping is not one-to-one. This script therefore constructs a probability distribution for the mapping, and the pdgns_cd is subsequently constructed based on this probability distribution.

Input: Postgres table

preesrd5y_ip_clm_inc

Output: Postgres table

drg_cd_mapping

The script S3a-esrd_claims.R must be run in order to generate the data used by this script. The in-patient claims have both drg_cd and pdgns_cd. These are used as the source data for mapping drg_cd to pdgns_cd.

Step 1. Import data from the preesrd5y_ip_clm_inc table

res = dbGetQuery(
  con,
  "WITH pre_drg_pdgn AS (
                        SELECT drg_cd, pdgns_cd, COUNT(*) AS nmbr 
                        FROM preesrd5y_ip_clm_inc
                        WHERE cdtype='I' 
                        GROUP BY drg_cd, pdgns_cd),
          drg_cd_tbl AS (
                        SELECT drg_cd, pdgns_cd, nmbr, 
                        row_number() OVER (PARTITION BY drg_cd 
                                          ORDER BY nmbr DESC) 
                        FROM pre_drg_pdgn
                        )
  SELECT a.drg_cd, a.pdgns_cd, a.nmbr, a.row_number, SUM(b.nmbr) AS cum 
  FROM drg_cd_tbl a
    INNER JOIN drg_cd_tbl b 
    ON a.drg_cd=b.drg_cd 
    AND a.row_number<=b.row_number 
    GROUP BY a.drg_cd, a.pdgns_cd, a.nmbr, a.row_number 
    ORDER BY a.drg_cd, a.row_number"
)

Step 2. Aggregate table by drg_cd

bydrgcd = res %>% 
  group_by(drg_cd) %>%
  dplyr::summarise(
    total = sum(as.numeric(nmbr)))
res = res %>% 
  inner_join(
    bydrgcd,
    by = "drg_cd")
res = res %>% 
  mutate(
        cum0 = as.numeric(cum - nmbr),
        cum = as.numeric(cum),
        lb = cum0 / total,
        ub = cum / total
)

Step 3. Select the columns to save.

drg_cd_mapping = res %>% 
  select(
          drg_cd, 
          pdgns_cd,
          lb, 
          ub)

Step 4. Save to Postgres as drg_cd_mapping

drg_tblname = "drg_cd_mapping"
drop_table_function(con, drg_tblname)
dbWriteTable(con,
             drg_tblname,
             drg_cd_mapping,
             append = F,
             row.names = FALSE)

6.2.13 Get pre-2011 pre-ESRD Claims Data

Steps to running the S3c_esrd_claims_pre_2011.R script

Before 2011, pre-ESRD claims are stored in the files inc2008.csv, inc2009.csv, inc2010.csv. The files are organized differently from the other pre-ESRD files: the type of claim is not part of the file name (instead, it is identified in the file's contents in a field called "hcfasaf"); and the contents of the file can differ from year to year. Also, the pdgns_cd is not available prior to 2012. This script constructs a pdgns_cd from the drg_cd which is available prior to 2011.

Input: .csv files are produced in script S1a-convertSAStoCSV.R

pre_esrd_pre2011_claim_variables.R

inc2008.csv
inc2009.csv
inc2010.csv
drg_cd_mapping

Output: Rows of pre-2011 claims for the cohort added to the following Postgres tables

preesrd5y_ip_clm_inc
preesrd5y_hh_clm_inc
preesrd5y_hs_clm_inc
preesrd5y_op_clm_inc
preesrd5y_sn_clm_inc

File names and column types are defined in pre_esrd_pre2011_claim_variables.R

source('CreateDataSet/pre_esrd_pre2011_claim_variables.R')

Step 1. Import the pre-2011 claims and filter on usrds_ids in the cohort and features in the post-2011 claims.

Set cdtype = "I" to indicate ICD-9
Set any missing drg_cd=000.

create_pre_2011 <- function(
    data_dir, 
    filename, 
    tblname,
    append_flag,
    table_name_pt, 
    newIn2010, 
    column_types){
    
    inc20xx = read_csv(file.path(data_dir, str_glue("{filename}.csv")), col_types=column_types)
    incident_year =
        substr(filename, str_length(filename) - 3, str_length(filename))
    names(inc20xx) = tolower(names(inc20xx))
    
    patients = dbGetQuery(
        con,
        str_glue(
            "SELECT usrds_id
            FROM {table_name_pt}")
    )
    
    # filter on ids from the patient table
    inc20xx = inc20xx %>% 
        filter(
            usrds_id %in% patients$usrds_id) %>%
        mutate(
            incident_year = incident_year,
            cdtype = "I",
            drg_cd = ifelse(
                    is.na(drg_cd), "000", drg_cd),
            drg_cd = ifelse(
                    drg_cd == "", "000", drg_cd)) %>%
        mutate(
            drg_cd = as.numeric(drg_cd))
    
    sortednm = names(inc20xx) %>% sort()
    inc20xx = inc20xx[, sortednm]
    
    if (append_flag==FALSE){
        inc20xx[, newIn2010] = NA
        drop_table_function(con, tblname)
    }
    print(nrow(inc20xx))
    dbWriteTable(
        con,
        tblname, 
        inc20xx, 
        append = append_flag,
        row.names = FALSE)
}

Step 2: For each claim type (home health - hh, hospice - hs, inpatient - ip, skilled nursing unit - sn, outpatient - op)

Generate a uniform random number for each record in pre2011 claims, and look up pdgns_cd from drg_cd_mapping based on this random number, which will produce a pdgns_cd reflecting the underlying joint distribution of (drg_cd, pdgns_cd) in the data

get_claim_type_x <- function(claim_type, table_nm) {
    print(str_glue("get {claim_type}"))
    df = dbGetQuery(
        con,
        str_glue(
            "
            SELECT * 
            FROM {table_nm} 
            WHERE hcfasaf='{claim_type}'
            "))
    return(df)
}
get_distribution <- function(df){
    # Generate a uniform rv for each record in df, and look up pdgns_cd from drg_cd_mapping
    # based on this rv, which will produce a pdgns_cd reflecting the underlying
    # joint distribution of (drg_cd,pdgns_cd) in the data
    
    print("get distribution of drg_cd, pdgns_cd")
    set.seed(597)
    
    df$rv = runif(
        dim(df)[1]
        )
    temptablename = "temp_df"
    
    drop_table_function(con, temptablename)
    
    dbWriteTable(
        con, 
        temptablename,
        df,
        temporary = TRUE
    )
    dg = dbGetQuery(
        con,
        str_glue(
            "
            SELECT a.*, b.pdgns_cd 
            FROM {temptablename} a 
                LEFT JOIN drg_cd_mapping b 
                ON a.drg_cd = b.drg_cd 
                AND a.rv <= b.ub 
                AND a.rv > b.lb
            "))
    return(dg)
}

Step 3: Insert these rows into the main Postgres table for this claim type.

insert_claim_rows <- function(claim_type, pre2011_data) {
    #Get the field names to be inserted into the pre-esrd data, 
    # in the correct order
    print(str_glue("intert pre 2011 {claim_type} rows into table {nrow(pre2011_data)}"))
    main_fieldnames = names(
        dbGetQuery(
            con, 
            str_glue(
                "
                SELECT * 
                FROM preesrd5y_{claim_type}_clm_inc
                LIMIT 10
                ")
            )
    )
    
    #Set fields in main claims fieldnames that do not appear in the pre2011 data = nan
    pre2011_data[, setdiff(main_fieldnames, names(pre2011_data))] = NA
   
    # Include only fields also in main_fieldnames, in the proper order
    pre2011_data = pre2011_data[, main_fieldnames]
    
    # append pre2011 rows to the main claims table
    main_tblname = str_glue("preesrd5y_{claim_type}_clm_inc")
    dbWriteTable(
        con, 
        main_tblname, 
        pre2011_data, 
        append = TRUE, 
        row.names = FALSE)
    }

Step 4. Define the wrapper function to separate into each year and claim type and save to Postgres tables.

source_pre_2011 <- function(data_dir, tblname, column_types) {

    newIn2010 = c(
        "dpoadmin",
        "dpodose",
        "hgb",
        "dpocash",
        "attending_phys",
        "operating_phys",
        "other_phys"
    )

    create_pre_2011(data_dir, 
                    "inc2010", 
                    tblname, 
                    append_flag=FALSE, 
                    table_name_pt = "patients_medevid_waitlist",
                    newIn2010, 
                    column_types)
    
    create_pre_2011(data_dir, 
                    "inc2009", 
                    tblname, 
                    append_flag=TRUE, 
                    table_name_pt = "patients_medevid_waitlist",
                    newIn2010, 
                    column_types)
    
    create_pre_2011(data_dir,
                    "inc2008",
                    tblname,
                    append_flag=TRUE, 
                    table_name_pt = "patients_medevid_waitlist",
                    newIn2010, 
                    column_types)

    ########BEGIN HOME HEALTH#######
    df = get_claim_type_x("H",tblname)
    dg = get_distribution(df)
    insert_claim_rows("hh", dg)
    rm(df,dg)
    
    ####BEGIN HOSPICE##########
    df = get_claim_type_x("S", tblname)
    dg = get_distribution(df)
    insert_claim_rows("hs", dg)
    rm(df,dg)
    
    ####BEGIN INPATIENT#######
    df = get_claim_type_x("I", tblname)
    dg = get_distribution(df)
    insert_claim_rows("ip", dg)
    rm(df,dg)
    
    ###BEGIN SKILLED NURSING####
    df = get_claim_type_x("N", tblname)
    dg = get_distribution(df)
    insert_claim_rows("sn", dg)
    rm(df,dg)
    
    ####BEGIN OUTPATIENT####
    df = get_claim_type_x("O", tblname)

    # Step 2: Generate a uniform rv for each record in df, and look up pdgns_cd from drg_cd_mapping
    # based on this rv, which will produce a pdgns_cd reflecting the underlying
    # joint distribution of (drg_cd, pdgns_cd) in the data
    set.seed(597)
    df$rv = runif(
        dim(df)[1]
        )
    temptablename = "temp_df"
    drop_table_function(con, temptablename)
    dbWriteTable(
        con,
        temptablename, 
        df, 
        temporary = TRUE
        )

    make_query <- function(dg_vals, temptablename){
        dg = str_glue(
                "WITH w as (
                            SELECT * 
                            FROM {temptablename}
                            WHERE MOD(CAST(usrds_id AS NUMERIC),10) IN ({dg_vals})
                            )
                SELECT a.*, b.pdgns_cd 
                FROM w a 
                LEFT JOIN drg_cd_mapping b
                    ON a.drg_cd = b.drg_cd 
                    AND a.rv <= b.ub 
                    AND a.rv > b.lb"
        )
        return(dg)
    }
    dg_1 = dbGetQuery(con, make_query("0,1", temptablename))
    
    dg_2 = dbGetQuery(con, make_query("2,3", temptablename))
    
    dg_3 = dbGetQuery(con, make_query("4,5", temptablename))
                      
    dg_4 = dbGetQuery(con, make_query("6,7", temptablename))

    dg_5 = dbGetQuery(con, make_query("8,9", temptablename))

    dg = rbind(dg_1, dg_2)
    dg = dg %>% 
        rbind(dg_3) %>% 
        rbind(dg_4) %>% 
        rbind(dg_5)

    #step 3 append rows to main table
    insert_claim_rows("op", dg)
}

Step 5. Execute all functions defined above

source_pre_2011(data_dir,"pre_esrd_2011", columns_esrd_2015)

6.2.14 Diagnosis Groupings

There are several thousand primary diagnosis codes in pre-ESRD claims data, which need to be meaningfully categorized in order to create useful features. 12 major disease groups that were determined by the clinicians on the project include: diabetes, hypertension, heart failure, cardiovascular arterial disease, cerebrovascular disease, peripheral arterial disease, pneumonia, kidney failure, malignant neoplasm, smoking, alcohol dependence, and drug dependence.

Steps for running S3d_dxCodeGrouping.R

This script maps each valule in pdgns_cd column in the pre-ESRD data to one of 12 aggregated diagnosis groupings, and stores the mapping in the dxmap Postgres table. Two sources of input are used for the groupings: CCS (Clinical Classification System) and UCSF physician expertise.

Input:

icd9_ccs_codes.R (for CCS groupings)
icd10_ccs_codes.R (for CCS groupings)
icd9_dx_2014.txt (for the icd9 pdgsn_cd)
icd10_dx_codes.txt (for the icd10 pdgsn_cd)
dx_mappings_ucsf.csv (for UCSF-advised categorizations of diagnosis codes)

Output:

dxmap

Step 1. Define functions

read_icd9 <- function(directory, filename) {
  #READ IN ICD9 SOURCE DATA
  lines = readLines(file.path(directory,filename))
  lines =
    iconv(lines[2:length(lines)],
          from = "latin1",
          to = "ASCII",
          sub = ""
          )  
  
  #Convert utf-8 to ASCII and remove special characters like umlauts and accents
  pdgns_cd = substr(lines, 1, 6) %>% 
              trimws() %>% 
              str_pad(.,
                      width = 7,
                      side = "right",
                      pad = "0"
                      )
  description = substr(lines, 7, 130)
  
  df9 = as.data.frame(cbind(pdgns_cd, description))
  df9$cdtype = "I"
  return(df9)
}
read_icd10 <- function(directory, filename){
  lines = readLines(file.path(directory, filename))
  lines <-
    iconv(lines[2:length(lines)],
          from = "latin1",
          to = "ASCII",
          sub = ""
          )  
  pdgns_cd = substr(lines, 1, 7) %>%
              trimws() %>% 
              str_pad(.,
                      width = 7,
                      side = "right",
                      pad = "0"
                      )
  description = substr(lines, 11, 130)
  df10 = as.data.frame(cbind(pdgns_cd, description), stringsAsFactors = F)
  df10 = df10 %>% filter(pdgns_cd != '0000000')
  #There may be multiple entries with the same pdgns_cd for icd10, so choose one
  df10 = sqldf(
              "
              SELECT pdgns_cd, MAX(description) AS description 
              FROM df10 
              GROUP BY pdgns_cd"
              )
  df10$cdtype = "D"
  return(df10)
}
map_pdgns = function(df9, df10){
  # join icd9 and icd10
  df <- as.data.frame(rbind(df9, df10)) %>% 
    mutate_at(
      vars('cdtype', 'pdgns_cd', 'description'),
      as.character
    )
  df = df %>% 
    mutate(
      dx_neo = as.integer(
        grepl("malignant neoplasm", tolower(df$description)) &
          grepl("family history", tolower(df$description))
      ),
      # dx_poi=as.integer(grepl("poisoning",tolower(df$description))),
      dx_smo = as.integer((
        cdtype == 'D' & pdgns_cd %in% smo_10
      ) |(
        cdtype == 'I' & pdgns_cd %in% smo_9)
      ),
      dx_alc = as.integer((
        cdtype == 'D' & pdgns_cd %in% alc_10
      ) | (
        cdtype == 'I' & pdgns_cd %in% alc_9)
      ),
      dx_drg = as.integer((
        cdtype == 'D' & pdgns_cd %in% drg_10
      ) | (
        cdtype == 'I' & pdgns_cd %in% drg_9)
      ),
      dx_pne = as.integer((
        cdtype == 'D' & pdgns_cd %in% pne_10
      ) | (
        cdtype == 'I' & pdgns_cd %in% pne_9)
      ),
      dx_kid = as.integer((
        cdtype == 'D' & pdgns_cd %in% kid_10
      ) | (
        cdtype == 'I' & pdgns_cd %in% kid_9)
      )
    )
  return(df)
}
getComorbids <- function(directory, filename, df, colname, prefix = 'dx_') {
  ucsf_mappings = read.csv(file.path(directory, filename), stringsAsFactors = FALSE)
  dg = sqldf(
    "SELECT df.*, b.label 
    FROM df  
    LEFT JOIN ucsf_mappings b 
    ON df.pdgns_cd>=b.lb 
    AND df.pdgns_cd<=b.ub",
    method = "raw"
  )
  values = unique(dg[, colname]) %>% setdiff(NA)
  for (v in values) {
    dg[, paste0(prefix, v)] = (as.integer(dg[, colname] == v))
    dg[, paste0(prefix, v)] = replace_na(dg[, paste0(prefix, v)], 0)
  }
  dg$label = NULL
  return(dg)
}

Step 2. Execute Functions

df9 = read_icd9(source_dir, "icd9_dx_2014.txt")
df10 = read_icd9(source_dir, "icd10_dx_codes.txt")
mapped9_10 = map_pdgns(df9, df10)
dh = getComorbids(source_dir, "dx_mappings_ucsf.csv", df=mapped9_10, colname = "label")

Step 3. Save to Postgres database as dxmap

drop_table_function(con, "dxmap")
tblname = "dxmap"
dbWriteTable(
  con,
  tblname,
  dh,
  append = FALSE,
  row.names = FALSE
  )

Points to consider

The primary diagnosis codes in the pre-ESRD claims should be converted with clinician’s input into relevant disease groupings that can be used to create features with predictive value. It is difficult find a one-size-fits-all method for mapping diagnosis codes to meaningful categories as the categories are highly dependent on the use case. Future researchers may want to consider alternative disease groupings that are informed by clinicians and other health-care researchers.

6.2.15 Aggregate Pre-ESRD Claims Data

Steps for running S4a_pre_esrd_full.R

USRDS data have multiple pre-ESRD claims per patient. This script aggregates the data for each patient through the following steps:

Merge the pre-ESRD claims tables
Construct counts of claims grouped by type of claim and diagnosis code
Create one record per patient, with all pre-ESRD summary statistics aggregated for each patient
Create binary variables to indicate the presence or absence of pre-ESRD claims and of each type of claim (IP, HH, HS, OP, SN)

The record includes total number of claims and total length of stay, grouped by:

Type of claim (IP, HH, HS, OP, SN) and
The aggregated diagnosis grouping.

Input:

setfieldtypes.R

preesrd5y_ip_clm_inc
preesrd5y_hs_clm_inc
preesrd5y_hh_clm_inc
preesrd5y_op_clm_inc
preesrd5y_sn_clm_inc
patients_medevid_waitlist

Output:

preesrdfeatures

Table: Number of unique patients with each type of Medicare Pre-ESRD claims

	Inpatient (IP)	Outpatient (OP)	Skilled Nursing Unit (SN)	Home Health (HH)	Hospice (HS)
Number of Unique Patients	553,704	514,926	140,417	224,272	12,482
Total Number of Claims	2,496,683	15,222,280	592,970	939,751	50,200

Step 1. Define functions for SQL queries to get claim information for 3 types of aggregations and join to the dxmap table.

prepareQuery = function(dxcols, tablename, qryAggType = 1, testMode = 0) {
  
  qry_pt1=paste0("b.", dxcols$column_name, collapse=",")
  
  if (qryAggType == 1) {
    vec1 = paste0("SUM(stay*", dxcols$column_name, ")")
    vec2 = paste0(" AS stay", substr(dxcols$column_name, 3, 6))
    qry_pt5 = paste0(vec1, vec2, collapse = ", ")

  } else if (qryAggType == 2) {
    vec1 = paste0("SUM(", dxcols$column_name, ")")
    vec2 = paste0(" AS clms", substr(dxcols$column_name, 3, 6))
    qry_pt5 = paste0(vec1, vec2, collapse = ", ")

  }  else if (qryAggType == 3) {
    vec1 = paste0("MAX(", dxcols$column_name, ")")
    vec2 = paste0(" AS has", substr(dxcols$column_name, 3, 6))
    qry_pt5 = paste0(vec1, vec2, collapse = ", ")
  }
  
    qry_main = str_glue("WITH w AS (
                                  SELECT a.usrds_id,
                                    a.pdgns_cd, 
                                    a.masked_clm_thru-a.masked_clm_from AS stay,
                                    a.cdtype, 
                                    a.hgb, 
                                    a.hcrit, 
                                    {qry_pt1}
                                   FROM {tablename} a 
                                   LEFT JOIN dxmap b 
                                    ON a.cdtype=b.cdtype 
                                    AND a.pdgns_cd=b.pdgns_cd
                                ) 
                      SELECT usrds_id, {qry_pt5}
                      FROM w
                      GROUP BY usrds_id"
                      )
  return(qry_main)
}

Step 2. Get column names

dxcols = names(dbGetQuery(
  con, 
  "
  SELECT * 
  FROM dxmap 
  LIMIT 5
  "))

dxcols = dxcols[4:length(dxcols)] %>% as.data.frame()
names(dxcols) = "column_name"

Step 3. Send a SQL query for each type of claim and aggregation.

ip1 = dbGetQuery(con,prepareQuery(
                                 dxcols,
                                 "preesrd5y_ip_clm_inc",
                                 qryAggType = 1,
                                 testMode = 0
                 ))

ip2 = dbGetQuery(con,prepareQuery(
                                 dxcols,
                                 "preesrd5y_ip_clm_inc",
                                 qryAggType = 2,
                                 testMode = 0
                 ))

ip3 = dbGetQuery(con,prepareQuery(
                                 dxcols,
                                 "preesrd5y_ip_clm_inc",
                                 qryAggType = 3,
                                 testMode = 0
                 ))

op1 = dbGetQuery(con,prepareQuery(
                                 dxcols,
                                 "preesrd5y_op_clm_inc",
                                 qryAggType = 1,
                                 testMode = 0
                 ))

op2 = dbGetQuery(con,prepareQuery(
                                 dxcols,
                                 "preesrd5y_op_clm_inc",
                                 qryAggType = 2,
                                 testMode = 0
                 ))

op3 = dbGetQuery(con,prepareQuery(
                                 dxcols,
                                 "preesrd5y_op_clm_inc",
                                 qryAggType = 3,
                                 testMode = 0
                 ))

sn1 = dbGetQuery(con, prepareQuery(
                                   dxcols,
                                   "preesrd5y_sn_clm_inc",
                                   qryAggType = 1,
                                   testMode = 0
                 ))

sn2 = dbGetQuery(con, prepareQuery(
                                   dxcols,
                                   "preesrd5y_sn_clm_inc",
                                   qryAggType = 2,
                                   testMode = 0
                 ))

sn3 = dbGetQuery(con, prepareQuery(
                                   dxcols,
                                   "preesrd5y_sn_clm_inc",
                                   qryAggType = 3,
                                   testMode = 0
                 ))

Step 4. Calculate MAX(masked_clm_thru)-MIN(masked_clm_from) as the time range of claims for each patient.

prepareAggQuery = function(clm_type) {
  qry_main = str_glue("SELECT usrds_id, 
                        SUM(masked_clm_thru-masked_clm_from) AS stay,
                        MAX(masked_clm_thru)-MIN(masked_clm_from) AS range, 
                        MIN(masked_clm_from) AS earliest_clm,
                        MAX(masked_clm_thru) AS latest_clm, 
                        COUNT(*) AS claims 
                      FROM preesrd5y_{clm_type}_clm_inc
                      GROUP BY usrds_id"
                     )            
  return(qry_main)
}

hha = dbGetQuery(con, prepareAggQuery("hh"))
ipa = dbGetQuery(con, prepareAggQuery("ip"))
opa = dbGetQuery(con, prepareAggQuery("op"))
sna = dbGetQuery(con, prepareAggQuery("sn"))
hsa = dbGetQuery(con, prepareAggQuery("hs"))

Note: A large amount of code devoted to creating queries is not included in this guide. See the code for details.

Step 5. Get claims_range

df = dbGetQuery(con, qry)

earliest_cols = names(df)[grepl("earliest_clm", names(df))]
latest_cols = names(df)[grepl("latest_clm", names(df))]
for (c in earliest_cols) {
  df[, c] = ifelse(is.na(df[, c]), 500000, df[, c])
}
for (c in latest_cols) {
  df[, c] = ifelse(is.na(df[, c]), -500000, df[, c])
}

earliest_claim_date = apply(df[, earliest_cols], 1, "min")
latest_claim_date = apply(df[, latest_cols], 1, "max")
df$claims_range = latest_claim_date - earliest_claim_date

cols_to_delete = union(earliest_cols, latest_cols)
df[, cols_to_delete] = NULL

Out of the individual columns named "has_dx_claimtype" (e.g., "has_neo_ip") create a single column "has_dx"

has_cols = names(df)[grepl("has_", names(df))]

Step 6. Create a list of diagnosis groupings

dxs = unique(
  substr(
    has_cols, 5, 7))

Step 7. Create a binary result to yield 1 if the patients has any present, 0 if not, na if all are nans

Example 1: has_dia_ip=NA, has_dia_op=0, has_dia_sn=1. So x=c(NA,0,1). Then returns 1
Example 2: x=c(NA,NA,NA). Then returns NA
Example 3: x=c(NA,0,NA). Then returns 0

mymax = function(x) {
  p_sum = sum(x > 0, na.rm = T) #number of positive elements
  z_sum = sum(x == 0, na.rm = T) #number of zero elements
  return(ifelse(p_sum > 0, 1, ifelse(z_sum > 0, 0, NA)))
}

Use this so we end up with NA if a vector is all NA

safe.max = function(invector) {
  na.pct = sum(is.na(invector))/length(invector)
  if (na.pct == 1) {
    return(NA) }
  else {
    return(max(invector,na.rm=TRUE))
  }
}

For each diagnosis grouping

for (c in dxs) {
  hasdxcols = has_cols[grepl(c, has_cols)]
  df[,paste0("has_",c)]=apply(
    df[,hasdxcols],
    1,
    function(x) safe.max(as.numeric(x))
    )
}

hasvars = names(df)[grepl("has_", names(df))]
hasvarsettings = hasvars[grepl("_ip$|_op$|_sn$|_hh$|_hs$", hasvars)]
df[, hasvarsettings] = NULL #remove variables like "has_neo_ip", keeping in "has_neo"
df$claims_range = ifelse(df$claims_range < 0, NA, df$claims_range)

Step 8. Create a binary feature for each claim type. These are used in the parametric models instead of the detailed claim numbers.

df$prior_hh_care = as.integer(df$claims_hh > 0 &
                                !is.na(df$claims_hh))
df$prior_hs_care = as.integer(df$claims_hs > 0 &
                                !is.na(df$claims_hs))
df$prior_ip_care = as.integer(df$claims_ip > 0 &
                                !is.na(df$claims_ip))
df$prior_op_care = as.integer(df$claims_op > 0 &
                                !is.na(df$claims_op))
df$prior_sn_care = as.integer(df$claims_sn > 0 &
                                !is.na(df$claims_sn))
priorvars = names(df)[grepl("prior_", names(df))]

df$has_preesrd_claim = apply(
  df[, priorvars], 
  1, 
  function(x) safe.max(as.numeric(x))
)

Step 9. Save to Postgres as preesrdfeatures

drop_table_function(con, pre_esrd_tblname)
dbWriteTable(
  con,
  pre_esrd_tblname,
  df,
  field.types = myfieldtypes,
  append = FALSE,
  row.names = FALSE
)

6.2.16 Join the `preesrdfeatures` Tables to the Partition Index

Steps to running the S4b_join_to_partition_data.R script

Join the preesrdfeatures tables to our partition index.

Input:

preesrdfeatures
partition_10

Output:

preesrdfeatures

Step 1. Define a function to import and alter the preesrdfeatures table by adding the subset column, and save to Postgres.

join_data_partitions <- function(con, 
                                 data_tablename="preesrdfeatures", 
                                 num_partitions=10){
  
  dbSendStatement(con, str_glue(
    "
     ALTER TABLE {data_tablename} 
     ADD subset integer
     "), n = -1)
  
  dbSendStatement(
    con, 
    str_glue(
      "
        UPDATE {data_tablename} d
        SET subset = p.subset
        FROM partition_{num_partitions} p
        WHERE d.usrds_id = p.usrds_id
        "), n = -1)
}

Step 2. Execute the function

data_tbl = "preesrdfeatures"

join_data_partitions(
  con,
  data_tablename = data_tbl,
  num_partitions = 10
)

6.2.17 Map ICD-9 to ICD-10

Steps for S5_pdis_mapping.R

This script maps ICD-9 to ICD-10 codes and creates a table named pdis_recode_map which is used in S6-Prepare Data Set, for assigning pdis to a numeric value called pdis_recode. The 2017_I9gem_map.txt is used for this purpose.

PDIS (primary disease causing renal failure) is either the ICD-9 or ICD-10 code for the primary cause of renal failure depending on the year of the claim. Claims prior to 2015 contain ICD-9 codes whereas claims post-2016 contain ICD-10 codes. Claims from 2015 and 2016 can be either ICD-9 or ICD-10. The format for the pdis variable in the USRDS data is as a character variable. The ICD-9 codes were mapped to their ICD-10 equivalents to preserve the original meaning of the character variable in the numeric encoding. It was re-coded by

Mapping all codes to their ICD-10 equivalent
Converting them to a factor
Typing them to a numeric

Input: The pdis column from patients_medevid_waitlist (originally comes from the patients table)

2017_I9gem_map.txt
patients_medevid_waitlist

Output:

pdis_recode_map

Step 1. Import the data from Postgres.

df1 = dbGetQuery(con,
                 "SELECT * 
                  FROM patients_medevid_waitlist")

Whether the code is ICD-9 or ICD-10 is determined by the column cdtype. In order to create a map, we must get each unique pdis value for each cdtype where cdtype is NOT missing (NULL). Exclude entries where cdtype is missing (NULL). (There are 20,003 patients where cdtype is missing.)

pdis_occurrences = dbGetQuery(con,
  "SELECT cdtype, pdis, COUNT(*) AS nmbr 
    FROM patients_medevid_waitlist
    WHERE cdtype IS NOT NULL
    GROUP BY pdis, cdtype"
)

Step 2. Standardize the format so that we can match with another pdis file.

pdis_occurrences$pdis = pdis_occurrences$pdis %>% 
                            trimws() %>% 
                            str_pad(.,
                                    width = 7,
                                    side = "right",
                                    pad = "0"
                                    )

Step 3. Import 2017_I9gem_map.txt

map_icd_9_to_10 = read.table(file = file.path(source_dir, "2017_I9gem_map.txt"), 
                             header = TRUE) %>% 
                             select(icd9, icd10)

Step 4. Format columns

map_icd_9_to_10 = map_icd_9_to_10 %>% 
  mutate(icd9 = icd9 %>% 
                trimws() %>% 
                str_pad(.,
                        width = 7,
                        side = "right",
                        pad = "0"
                        ),
          icd10 = icd10 %>% 
                  trimws() %>% 
                  str_pad(.,
                          width = 7,
                          side = "right",
                          pad = "0"
                        )
)

ICD-10: The character-level pdis_recode is same as pdis when cdtype equals "D" (indicating ICD-10)

pdis_occurrences_D = pdis_occurrences %>%
  filter(cdtype == "D") %>%
  mutate(pdis_recode_char = pdis)

Step 5. Use the crosswalk to map the ICD-9 codes to ICD-10

pdis_occurrences_I = sqldf(
  "SELECT a.*, b.icd10 AS pdis_recode_char 
   FROM pdis_occurrences a 
   LEFT JOIN map_icd_9_to_10 b 
     ON a.pdis=b.icd9 
     WHERE a.cdtype='I'",
  method = "raw"
)

Step 6. Concatenate the 2 maps

pdis_recode_map = union(pdis_occurrences_D, pdis_occurrences_I)
pdis_recode_map = pdis_recode_map %>% 
  mutate(pdis_recode = as.factor(pdis_recode_char) %>% as.numeric())

Step 7. Calculate the sum of each recode value when recode is not missing (NaN)

pdis_recode_agg = pdis_recode_map %>% 
  group_by(pdis_recode) %>% 
  dplyr::summarise(pdis_recode_nmbr = sum(nmbr)) %>%
  as.data.frame()

pdis_recode_map = pdis_recode_map %>% left_join(pdis_recode_agg, by = "pdis_recode")

Step 8. Save to Postgres as pdis_recode_map

tblname = "pdis_recode_map"
drop_table_function(con, tblname)
dbWriteTable(con,
             tblname,
             pdis_recode_map,
             append = FALSE,
             row.names = FALSE)

6.2.18 Prepare Data for Modeling

Steps for running S6-prepareDataSet.R script

This script creates medxpreesrd and uses the full dataset patients_medevid_waitlist and preesrdfeatures to construct the table medxpreesrd by:

creating binary variables to indicate whether imputed values are missing or out of bounds for a given patient
encoding character values to numeric
counting the number of value types for como_* columns
incorporating pdis_recode column
deleting features not used for modeling

Input:

patients_medevid_waitlist
preesrdfeatures
pdis_recode_map
dxmap
imputation_rules.xlsx

Output:

medxpreesrd

Step 1. Define variables

subsets = "0, 1"
tablename = "patients_medevid_waitlist"
table_preesrd = "preesrdfeatures"
medex_tblname = "medxpreesrd"

Step 2. Import 2 partitions of data (subsets = "0, 1") from patients_medevid_waitlist. This code should be run 5 times (for the 10 partitions/subsets), but this example will be for 1 loop. The code contains the details to run the functions for the remaining partitions.

qry = str_glue(
                  "SELECT *
                  FROM {tablename} 
                  WHERE subset IN ({subsets})"
                )
data_subset = dbGetQuery(con, qry)

For each variable in the list vars=(c("height","weight","bmi","sercr","album","gfr_epi","heglb"), introduce a binary variable for whether the variable is NA (which means "missing") and a separate binary variable for whether it is out of bounds (that is, not missing but below the clinically plausible min or above the clinically plausible max as determined by UCSF clinicians). These boundaries are imported from imputation_rules.xlsx. The function valueExceptions returns a data frame with usrds_id (the key field) and binary values to indicate whether or not each column in vars is NA.

valueExceptions = function(df, vars) {
  bounds = read_excel(str_glue("{source_dir}imputation_rules.xlsx"), sheet =
                        "Bounds") %>% as.data.frame()
  isnavars = c()
  for (v in vars) {
    newv = str_glue("wasna_{v}")
    df[, newv] = as.integer(is.na(df[, v]))
    isnavars = c(isnavars, newv)
  }
  outofbndsvars = c()
  for (v in vars) {
    newv = str_glue("outofbnds_{v}")
    df[, newv] = as.integer(!is.na(df[, v]) &
                              !(df[, v] >= bounds[1, v] &
                                  df[, v] <= bounds[2, v]))
    outofbndsvars = c(outofbndsvars, newv)
  }
  return(df[, c("usrds_id", isnavars, outofbndsvars)])
}

Step 3. Execute valueExceptions

labvars=c("height","weight","bmi","sercr","album","gfr_epi","heglb")
ve=valueExceptions(data_subset,labvars)

Join the out of bounds binary indicators for each patient to the data.

df=data_subset %>% 
    left_join(
      ve,
      by="usrds_id")

Step 4. Create a function to set the values to NA if it is out of bounds.

setOutOfBoundsToNA=function(df,vars) {
  for (v in vars) {
    df[,v]=ifelse(
      df[,paste0("outofbnds_",v)] == 1,
      NA,
      df[,v])
  }
  return(df)
}

Step 5. Execute the above function on the data

df=setOutOfBoundsToNA(df,labvars)
oobvars = setdiff(names(df),names(data_subset))

Step 6. Get the list of categorical features.

getCategoryVars <- function(dataset){
  pattern1 = "^MEDCOV|^PATTXOP|^PATINFORMED$|^DIET|^NEPHCARE|^EPO" %>% tolower()
  pattern2 = "^DIAL|^TYPTRN|^AVGMATURING|^AVFMATURING" %>% tolower()
  pattern3 = "^ACCESSTYPE|^TRCERT|^CDTYPE" %>% tolower()
  pattern4 = "^EMPCUR|^EMPPREV|^pdis$|^hispanic$|^COMO_" %>% tolower()
  categoryVars = names(dataset)[grepl(pattern1, names(dataset))]
  categoryVars = union(categoryVars, names(dataset)[grepl(pattern2, names(dataset))])
  categoryVars = union(categoryVars, names(dataset)[grepl(pattern3, names(dataset))])
  categoryVars = union(categoryVars, names(dataset)[grepl(pattern4, names(dataset))])
  return(categoryVars)
}

categoryVars = getCategoryVars(df)

Step 7. Get the list of continuous features

getContinuousVars <- function(dataset){
  pattern_continuous = "^GFR_EPI|^SERCR|^ALBUM|^HEGLB|^HBA1C|^BMI$|^HEIGHT|^WEIGHT" %>% tolower()
  continuousVars = names(dataset)[grepl(pattern_continuous, names(dataset))]
  return(continuousVars)
}
continuousVars = getContinuousVars(df)
df = df[, c("usrds_id",
              "subset",
              "comorbid",
              "inc_age",
              "race",
              "sex",
              "disgrpc",
              "waitlist_status",
              "days_on_waitlist",
              "died_in_90",
              oobvars,
              categoryVars,
              continuousVars)]

Step 8. Get non numeric features

getNonNumericCols = function(dx) {
  cols = c()
  for (v in names(dx)) {
    if (!is.numeric(dx[, v])) {
      cols = c(cols, v)
    }
  }
  return(cols)
}
nonNumCols = setdiff(getNonNumericCols(df), c("pdis", "comorbid", "cdtype","hispanic","waitlist_status"))

ML models typically require numeric values instead of characters or factors. The function replaceCharacterVals ensures that character values are replaced with a number.

replaceCharacterVals = function(dx, 
                                vars,
                                sourceValue = c("N", "Y", "M", "F", "U", "C", "X", "D", "I", "A", "R"),
                                sinkValue = c("2", "1", "12", "13", "9", "15", "16", "17", "18", "20", "21"))
{
  for (v in vars) {
    print(v)
    dx[, v] = mapvalues(pull(dx, v), sourceValue, sinkValue)
    dx[, v] = as.integer(pull(dx, v))
  }
  return(dx)
}
df = replaceCharacterVals(df, nonNumCols)

pdis must be encoded as a number prior to ML model training.

recodePdis = function(df, con) {
  df$pdis = df$pdis %>% 
    trimws() %>% str_pad(.,
                         width = 7,
                         side = "right",
                         pad = "0") #Format pdis with the same padding as in pdis_recode_map
  pdis_map = dbGetQuery(
    con, "
    SELECT pdis, cdtype, pdis_recode
    FROM pdis_recode_map") 

  pdis_map = pdis_map %>% 
    group_by(pdis, cdtype) %>% 
    dplyr::summarise(pdis_recode = min(pdis_recode))

  df = df %>% 
    left_join(
      pdis_map, 
      by = c("cdtype", "pdis")) %>% 
    mutate(
      pdis_recode = ifelse(is.na(pdis_recode), 9999, pdis_recode)
      )
  
  return(df)
}
df = recodePdis(df, con)

Step 9. Count value types in como_* variables for each ID

comoEncode <- function(dataset){
    como_names = names(dataset)[grepl("^como_", names(dataset))]

    dataset$num_como_nas = apply(
      dataset[, como_names],
      1,
      function(xx)
        sum(is.na(xx))
      )
    dataset$num_como_Ns = apply(
      dataset[, como_names],
      1,
      function(xx)
        sum(xx == 2, na.rm = TRUE)
      )
    dataset$num_como_Ys = apply(
      dataset[, como_names],
      1,
      function(xx)
        sum(xx == 1, na.rm = TRUE)
      )
    dataset$num_como_Us = apply(
      dataset[, como_names],
      1,
      function(xx)
        sum(xx == 9, na.rm = TRUE)
      )
    return(dataset)
}
df = comoEncode(df)

Step 10. Remove features from the medevid table that are not needed for the ML models

All comorbidities that are only captured on the 1995 Medical Evidence Form (and therefore before our cohort ESRD inicident years of 2008-2017), such as como_cararr, como__hiv, etc.
All laboratory variables that have greater than 40% missingness, such as albumin and hba1c
All year variables and masked dates, such as incyear, masked_died, etc.
All pdis (primary disease causing ESRD) re-formatted variables, such as pdis_count, pdis_mortality, etc.

  varsToDelete = c(
    "albumlm",
    "como_ihd",
    "como_mi",
    "como_cararr",
    "como_dysrhyt",
    "como_pericar",
    "como_diabprim",
    "como_hiv",
    "como_aids",
    "comorbid_count",
    "comorbid_mortality",
    "comorbid_se",
    "comorbid",
    "ethn",
    "hba1c",
    "incyear",
    "masked_died",
    "masked_tx1fail",
    "masked_txactdt",
    "masked_txlstdt",
    "masked_txinitdt",
    "masked_remdate",
    "masked_unossdt",
    "masked_mefdate",
    "masked_ctdate",
    "masked_tdate",
    "masked_patsign",
    "masked_trstdat",
    "masked_trnend",
    "pdis_count",
    "pdis_mortality",
    "pdis_se",
    "pdis",
    "recnum",
    "tottx"
  )
df[, varsToDelete] = NULL

Step 11. Get the preesrdfeatures for the 2 subsets of data.

qry = str_glue(
        "SELECT *
        FROM {table_preesrd} 
        WHERE subset in ({subsets})"
        )
preesrd = dbGetQuery(con, qry)

Step 12. Join the data with columns from preesrdfeatures

full_data = df %>%
    left_join(
      preesrd,
      by = c("usrds_id","subset")
      )

Step 13. Save to Postgres as medxpreesrd

dbWriteTable(
    con,
    "medxpreesrd",
    full_data,
    row.names = FALSE
           )

6.2.19 Impute Missing Values

Missing data have the potential to introduce bias and loss of information, which can result in invalid conclusions. Multiple imputation was chosen as the method to impute missing values over single imputation methods because it addresses the uncertainty about missing data by creating several plausible imputed datasets. For this project, multiple imputation was performed on the clinical and laboratory values (height, weight, BMI , serum creatinine, serum albumin, hemoglobin, and GFR-EPI) using the MICE (multiple imputations by chained equations) library in R (version 3.13.0), the predictive mean matching methodology for the imputations, and 5 imputations (5 datasets) to achieve 95% relative efficiency.

Steps for running S7_makeImputations.R

This script contains the code to create 5 imputations for missing data for each of the laboratory variables

weight
height
gfr_epi
sercr
album

and saves the data to Postgres as the micecomplete_pmm table.

The table micecomplete_pmm has 5 rows per usrds_id, for each of the imputed columns. There is one row per imputation, hence 5 rows per usrds_id. A modeler who wants to use imputed values would use both medxpreesrd and micecomplete_pmm, replacing weight, height, bmi, sercr, etc. in medxpreesrd with the imputed values in micecomplete_pmm. This is shown in the modeling steps.

Input:

imputation_rules.xlsx

medxpreesrd

Output:

micecomplete_pmm

Step 1. Define the function to import the data and impute the missing values and save to Postgres as a new table. The function:

Sets out-of-bounds values to NA so that they will be imputed
Lists the variables to impute
Lists the variables used to inform the imputation
Imputes the missing values
Calculates BMI and GFR as they are derived from other imputed variables

makeImputations <- 
  function(con, subset, bounds, impseed, data_tablename) {
    df = dbGetQuery(
      con, 
      str_glue(
        "SELECT * 
        FROM {data_tablename}
        WHERE subset={subset}"
      ))
      
    varstoimpute = names(bounds)[2:length(names(bounds))]

    varstoimpute = c(
      "height",
      "weight",
      "bmi",
      "sercr",
      "album",
      "gfr_epi",
      "heglb"
      )

    varstouse = c(
      "inc_age",
      "race",
      "sex",
      "hispanic",
      "num_como_nas",
      "num_como_Ns",
      "num_como_Ys",
      "num_como_Us",
      "sercr",
      "height",
      "weight",
      "album",
      "heglb"
    ) 

    dg = df[, c("usrds_id", union(varstoimpute, varstouse))]
    dh = df[, c("usrds_id", "wasna_gfr_epi")]
    dg = dg %>% 
      mutate(
        hispanic = as.factor(hispanic),
        race = as.factor(race),
        sex = ifelse(is.na(sex), 0, sex) %>% 
          as.factor()
    )
    
    imp <- mice(dg, seed = impseed, maxit = 0)
    predictorMatrixDf = imp$predictorMatrix 
    #An entry of 1 means the column variable was used to impute the row variable
    meth = imp$method
    
    #row_imputed indexes the row (variable to be imputed);
    #c indexes the column (variable to use as an independent variable to impute row_imputed)
    for (row_imputed in colnames(predictorMatrixDf)) {
      predictorMatrixDf[,row_imputed ] = 0
    }
    
    for (col_imputed in varstoimpute) {
      for (impute_by in varstouse) {
        if (col_imputed != impute_by)
          predictorMatrixDf[col_imputed, impute_by] = 1
      }
  }
  
  # bmi is arithmetically related to weight and height
  # so it needs to be handled with a separate model
  predictorMatrixDf["bmi", "height"] = 1
  predictorMatrixDf["bmi", "weight"] = 1
  
  for (to_use in c("usrds_id", varstouse)) {
    meth[to_use] = ""
  }
  for (to_impute in varstoimpute) {
    meth[to_impute] = "pmm"
  }
  meth["bmi"] = "~ I(weight/(.01*height)^2)"
  #Model the arithmetic relationship among bmi, weight, and height
  
  miceimp <-
    mice(
      dg,
      m = 5,
      maxit = 20,
      threshold = .99999,
      seed = impseed,
      predictorMatrix = predictorMatrixDf,
      method = meth,
      print =  FALSE
    )
  
  writeImputations(
    con, 
    miceimp, 
    varstoimpute, 
    dh, 
    subset
    )
  return(0)
  
}

Step 2. Set the seeds for each subset, import the boundary data.

seeds = c(2397, 3289, 4323, 4732, 691, 2388, 2688, 176, 1521, 461)
source_dir = file.path("CreateDataSet")
bounds = read_excel(file.path(source_dir, "imputation_rules.xlsx"), sheet ="Bounds"
                    ) %>% as.data.frame()

Step 3. Execute the makeImputations function for each subset of the data (according to the 10 partitions).

for (s in 0:9) {
  makeImputations(
    con, 
    subset = s,
    bounds, 
    impseed = seeds[s],
    data_tablename="medxpreesrd"
    )
}

Points to consider

Curating clinical and laboratory variables requires input from clinicians to remove outlier values and to properly identify relevant variables to retain in the training dataset. For example:

Both hemoglobin and hematocrit are included in the USRDS data and are used by clinicians to identify anemia. Clinicans identified hemoglobin as the more accurate variable so hematocrit was removed from the training dataset since keeping both features results in redunant data.
GFR-EPI is kept in the training data as it is perferred in clinical practice over GFR MDRD.

For smaller datasets, all features in the training dataset can be used to inform the imputation. With a dataset of over 1 million observations, using all features is time prohibitive. Variables that are rarely missing and correlate with the variables to be imputed (age, race, sex, ethnicity, number of comorbidities, and the clinical/laboratory values) were used in the imputation model for this project.
Variables like BMI and GFR should be passively imputed since they are derived from other imputed variables (BMI: height and weight, GFR: serum creatinine, along with age, sex, and race)
There are various types of imputation methods that can be selected from the 'mice' package, such as 'norm', 'pmmm', etc. Running a goodness of the imputations test by masking and then imputing known values as well as comparing the runtimes for each method will help the user select the appropriate imputation method.
Only the features that were missing at random (MAR) with a percent missingness <40% were imputed (i.e., clinical and laboratory values of height, weight, BMI , serum creatinine, serum albumin, hemoglobin, and GFR-EPI). Future researchers could improve the imputations by imputing features that are missing not at random (MNAR) with a more complex imputation model. Additionally other multiple imputation packages, such as the multiple imputation (MI) and Amelias packages, could also be used for the imputations.
Storing imputations in a database table separate from the table storing medevid, patients, and pre-ESRD data prevents the rest of the training dataset from being stored five times; it also reduces the amount of storage required for the training dataset.