Skip to content

Latest commit

 

History

History
306 lines (250 loc) · 15.4 KB

i2b2.md

File metadata and controls

306 lines (250 loc) · 15.4 KB

i2b2

This job loads data into an i2b2 database. The ontology must already exist when the job is invoked.

This document assumes some familiarity with the i2b2 database schema.

Available parameters

  • COLUMN_MAP_FILE -- Required. Specifies the location of the column mapping file. Either a full path or (preferably) a relative path to a directory named i2b2 located in the same directory as the params file.
  • WORD_MAP_FILE -- Optional. Specifies the location of the word map file. It is specified in the same way as COLUMN_MAP_FILE.
  • SOURCE_SYSTEM -- Mandatory. Value with which to populate the column sourcesystem_cd. Important because the non-incremental job will delete all existing data that shares the same sourcesystem_cd. Maximum size is 50 characters.
  • VISIT_IDE_SOURCE -- Optional. The source for the visit identifiers. Together with the visit identifier, they represent a global identifier for the visit. If not specified, defaults to UNSPECIFIED. Maximum size is 50 characters.
  • PATIENT_IDE_SOURCE -- Optional. The source for the patient identifiers. Together with the patient identifiers, they represent a global identifier for the patient. If not specified, defaults to UNSPECIFIED. Maximum size is 50 characters.
  • PROVIDER_PATH -- Optional. An identifier, typically backslash- or slash-separated, that describes how the provider fits into some organizational hierarchy. Together with the provider id, they represent a global identifier for the provider. If not specified, it defaults to /. Maximum size is 700 characters.
  • DATE_FORMAT -- Optional. How the dates will be formatted in the parameters file and in the data files. It is an English locale bound SimpleDateFormatter format. If not specified, an ISO8601 format is assumed.
  • DOWNLOAD_DATE -- Optional. The value with which to populate the download_date administrative column, present in all i2b2 tables. Refers to the time the input data was obtained from the source system. The format is that specified through DATE_FORMAT.
  • INCREMENTAL -- Optional. Whether to activate the incremental mode. Use Y for yes and N for no. Defaults to N.

Overview

The job executes the following steps in order:

  1. Load column mappings from the column mappings file.
  2. Load word mappings from the word mappings file, if it exists.
  3. Load enumeration values from i2b2demodata.code_lookup.
  4. First Pass -- validation of data given the information in the column mapping file and identification of sets of patients, visits and providers (including creating implicit visits and the implicit provider, if necessary).
  5. (non-incremental) Delete existing facts and patient mappings, visit mappings, patients, visits and providers associated with the same sourcesystem_cd as the one for the running job. (incremental) Fetch the ids of already loaded patients and visits (overlapping with those found in the first pass) and identify which providers found in the first pass already exist in the database.
  6. Give out new ids to patients and visits identified in the first pass (except those already existing, during an incremental run).
  7. Insert new patients, visits and providers (non-incremental run: all identified in the first pass) into the database (without details).
  8. Second pass -- insert facts and update details of patients, visits and providers.

Column Mapping File

The column mapping file should be a TSV file with columns named FILENAME, COLUMN_NUMBER, MANDATORY, VARIABLE, TYPE and UNIT (preferably in this order). Example:

FILENAME COLUMN_NUMBER MANDATORY VARIABLE TYPE UNIT
main_data.tsv 3 true PAT:EID
main_data.tsv 4 false PRO:EID
main_data.tsv 5 false VIS:EID
main_data.tsv 1 true START_DATE
main_data.tsv 2 false END_DATE
main_data.tsv 7 true CON:C1-TEXT text
main_data.tsv 15 false MOD:1 text
main_data.tsv 7 true CON:C1-TEXT text
main_data.tsv 16 false MOD:1 text
main_data.tsv 8 false CON:C2-BLOB blob
main_data.tsv 9 true CON:C3-NUMBER number m/s
main_data.tsv 10 false CON:C4-NLP nlp
main_data.tsv 11 true PAT:sex
visit_data.tsv 1 true VIS:EID
visit_data.tsv 2 true PAT:EID
visit_data.tsv 3 false VIS:start_date
provider_data.tsv 1 true PRO:EID
provider_data.tsv 2 true PRO:name_char

FILENAME

The relative path of the data file in relation to an i2b2 directory existing alongside the params file or an absolute path. The data file should be a TSV file with a header. The header is ignored (it's only for the user's convenience).

COLUMN_NUMBER

The index of the column of the TSV file referred to in the FILENAME column. This index starts at 1. Mandatory.

MANDATORY

Whether there must be non-empty data in the referred to column of the data file (after word mapping has been applied). Must be true or false.

VARIABLE

To which variable data in the given column of the given file should be bound. There are several types of variables available:

  • CON:<concept code> or CON:<concept path>: represents a fact associated with an existing concept (row of concept_dimension). The concept will be checked for existence. In this case, the TYPE column must have a value.
  • MOD:<modifier code>: represents a modifier-associated fact (an ancillary fact). These facts complement the concept facts of the preceding concept variable (CON:<>). These variable mappings must satisfy the condition p defined as either 1) the MOD:<> variable mapping must be after a CON:<> variable mapping or 2) after another MOD:<> variable mapping that satisfies p ("after" meaning in a later row in the mapping file). A CON:<> variable and its following MOD:<> variables will be inserted with an identical primary key into observation_fact, except for the value of the modifier_cd column. In particular, they will share the same value for concept_cd and instance_num.
  • start_date: the value of the start_date column of facts inserted based on data of the same row of the data file. Required if there are CON:<> variables mapped in the same data file. Otherwise ignored. Represents the date that the observations started.
  • end_date: the value of the end_date column of facts inserted based on data of the same row of the data file. Always optional. Represents the date that the observations ended.
  • PAT:EID, VIS:EID and PRO:EID: the ids of the patients, visits and providers, as referred to in the source system.
  • PAT:VITAL_STATUS, PAT:BIRTH_DATE, PAT:SEX, PAT:AGE_IN_YEARS_NUM, PAT:LANGUAGE, PAT:RACE, PAT:MARITAL_STATUS, PAT:RELIGION, PAT:ZIP, PAT:STATECITYZIP_PATH, PAT:INCOME, PAT:BLOB: variables mapped to columns in patient_dimension.
  • VIS:ACTIVE_STATUS, VIS:START_DATE, VIS:END_DATE, VIS:INOUT, VIS:LOCATION, VIS:LOCATION_PATH, VIS:LENGTH_OF_STAY and VIS:BLOB: variables mapped to column in visit_dimension.
  • PRO:NAME_CHAR, PRO:BLOB: variables mapped to columns in provider_dimension.

The variable identifiers used in this column are not case sensitive, except for the concept codes, concept paths and modifier codes.

This column is mandatory.

TYPE

One of text, number, blob or nlp, representing each of the four fact data types in i2b2. Used exclusively (and mandatory) with CON:<> variables and MOD:<> variables.

ATTENTION: There is currently no validation against the preexisting concept metadata. For instance, if a concept's metadata indicates that the concept is a positive integer, the user will still be able to upload text for that concept, as long as he provides the type text in the TYPE column of the mapping file. This is an area for future improvement.

UNIT

Value that will be written to the units_cd column in observation_fact. Optional, and only for CON:<> and MOD:<> variables, since these are the only ones that trigger writes in observation_fact.

This column is optional.

Validations

  • The variables start_date and PAT:EID are mandatory if there are concept variables (CON:<>) mapped to one or more columns in the same data file. These variables must have the MANDATORY column set to true.
  • Except for CON:<> and MOD:<>, the variables cannot be repeated in the same file.
  • The set of PAT:<> (except PAT:EID), of VIS:<> (except VIS:EID) and of PRO:<> variables (except PRO:EID) all cannot have their members spread across different data files. For instance, PAT:VITAL_STATUS cannot be mapped in one data file and PAT:SEX in another.
  • Data files with PAT:<> variables associated must have also a PAT:EID variable mapped.
  • Data files with VIS:<> variables associated must have also mapped a PAT:EID variable and either a) a start_date variable (for visit auto-generation) or b) a mandatory VIS:EID variable.
  • Data files with VIS:EID variables must also have a PAT:EID variable.
  • Concept variables can be repeated in the same file, but only if a) they are associated with different columns or 2) they have a different set of modifier variables or 3) they have the same of modifier variables, but at least one is associated with different columns of the data file.
  • Modifier variables cannot be repeated for the same instance of a CON:<> variable. In order to repeat a modifier for the same concept, the concept has to be repeated as well (and hence create a new fact group, meaning a set of facts whose primary key differs only in the value of modifier_cd, each fact group having a distinct instance_num).

Each PAT:<>, VIS:<> and PRO:<> variable also has its own validation rules.

Word Mapping File

The word mapping file is a TSV file with four columns, all of which are mandatory: FILENAME, COLUMN_NUMBER, FROM and TO. Example:

FILENAME COLUMN_NUMBER FROM TO
main_data.tsv 12 (empty blob)
main_data.tsv 11 DEM
main_data.tsv 11 male DEM
main_data.tsv 11 female DEM
main_data.tsv 10 <a>to be replaced by null</a>

The word mapping functionality provides a trivial value replacement. It instructs the runtime to replace a certain value found in a certain column of a data file with another one. The replacement is done before the validations are applied.

First Pass Validations

The first pass checks the sanity of the data files and collects some data about dimensions needed for the subsequent steps.

Besides conformance of the variable values with the local validation rules of each variable, there are other validations, most of which are more global in nature. Starting with the former:

  • The PAT:<>, VIS:<> and PRO:<> (non :EID) variables have variable rules. They are either of type:
    • date (in which case their values must conform to the DATE_FORMAT),
    • string (with variable maximum length),
    • integer (with a minimum value in case of PAT:AGE_IN_YEARS_NUM).
    • enumeration (either hard-coded or with their possible values looked up in code_lookup).
  • Variable mappings marked with MANDATORY = true must find only non-empty values (after word mapping).
  • Dates (start_date and end_date) must conform to DATE_FORMAT.
  • As for fact variables (CON:<> and MOD:<>),
    • numbers have to match the pattern (?<operator><=|<|!=|=|>=|>)?(?<number>[^|]+)(?:|(?<flag>).+)?. The number capture group must be acceptable by the constructor of `BigDecimal (see the grammar). Addditionally the scale of the decimal cannot be larger than 5 (otherwise information will be lost) and its magnitude cannot be larger than 13,
    • values of text variables cannot be longer than 255 characters,
    • values of nlp variables must be well-formed XML.

It is also checked that:

  • Lines read from the data files have at least as many cells as the largest mapped column index in the respective file.
  • Details for a certain visit, patient or provider are seen only once. That is, in files with PAT:<>, VIS:<> and PRO:<> (non :EID) variable mappings, the identifier of the entity (obtained either through the :EID variable or through auto-generation) is seen only once.
  • A given visit identifier does not co-occur with different patient identifiers (both in the case where one data file has a certain pair and later in that same file or in another file a different pair with the same visit identifier is present and in the case, during incremental loading, where the database has a certain association and another one appears in the data files).

Identifier Generation

If a visit identifier or a provider identifier are needed (typically to insert facts, but also to insert visit and even provider details), then these can be auto-generated.

The visit identifiers are generated with the pattern <patient id>@<date> where the date is in the YYYY-MM-dd format.

There is only one provider identifier auto-generated, and its form is Provider for <source system>, where the source system code is that provided in the SOURCE_SYSTEM parameter.

Incremental Mode

The incremental loading mode is enabled by setting INCREMENTAL=y in the .params file. It allows loading new facts and loading new and updating new and existing patients, visits and providers. No rows will be removed.

A limitation of this mode is that the new facts will need to have different combinations of (patient, visit, provider, concept, start_date). It is not possible to either replace existing facts with updated ones or append new facts with the same tuples (and avoiding a PK conflict by increasing the instance_num).