This job loads data into an i2b2 database. The ontology must already exist when the job is invoked.
This document assumes some familiarity with the i2b2 database schema.
COLUMN_MAP_FILE
-- Required. Specifies the location of the column mapping file. Either a full path or (preferably) a relative path to a directory namedi2b2
located in the same directory as the params file.WORD_MAP_FILE
-- Optional. Specifies the location of the word map file. It is specified in the same way asCOLUMN_MAP_FILE
.SOURCE_SYSTEM
-- Mandatory. Value with which to populate the columnsourcesystem_cd
. Important because the non-incremental job will delete all existing data that shares the samesourcesystem_cd
. Maximum size is 50 characters.VISIT_IDE_SOURCE
-- Optional. The source for the visit identifiers. Together with the visit identifier, they represent a global identifier for the visit. If not specified, defaults toUNSPECIFIED
. Maximum size is 50 characters.PATIENT_IDE_SOURCE
-- Optional. The source for the patient identifiers. Together with the patient identifiers, they represent a global identifier for the patient. If not specified, defaults toUNSPECIFIED
. Maximum size is 50 characters.PROVIDER_PATH
-- Optional. An identifier, typically backslash- or slash-separated, that describes how the provider fits into some organizational hierarchy. Together with the provider id, they represent a global identifier for the provider. If not specified, it defaults to/
. Maximum size is 700 characters.DATE_FORMAT
-- Optional. How the dates will be formatted in the parameters file and in the data files. It is an English locale boundSimpleDateFormatter
format. If not specified, an ISO8601 format is assumed.DOWNLOAD_DATE
-- Optional. The value with which to populate thedownload_date
administrative column, present in all i2b2 tables. Refers to the time the input data was obtained from the source system. The format is that specified throughDATE_FORMAT
.INCREMENTAL
-- Optional. Whether to activate the incremental mode. UseY
for yes andN
for no. Defaults toN
.
The job executes the following steps in order:
- Load column mappings from the column mappings file.
- Load word mappings from the word mappings file, if it exists.
- Load enumeration values from
i2b2demodata.code_lookup
. - First Pass -- validation of data given the information in the column mapping file and identification of sets of patients, visits and providers (including creating implicit visits and the implicit provider, if necessary).
- (non-incremental) Delete existing facts and patient mappings, visit
mappings, patients, visits and providers associated with the same
sourcesystem_cd
as the one for the running job. (incremental) Fetch the ids of already loaded patients and visits (overlapping with those found in the first pass) and identify which providers found in the first pass already exist in the database. - Give out new ids to patients and visits identified in the first pass (except those already existing, during an incremental run).
- Insert new patients, visits and providers (non-incremental run: all identified in the first pass) into the database (without details).
- Second pass -- insert facts and update details of patients, visits and providers.
The column mapping file should be a TSV file with columns named FILENAME
,
COLUMN_NUMBER
, MANDATORY
, VARIABLE
, TYPE
and UNIT
(preferably in this
order). Example:
FILENAME | COLUMN_NUMBER | MANDATORY | VARIABLE | TYPE | UNIT |
---|---|---|---|---|---|
main_data.tsv | 3 | true | PAT:EID | ||
main_data.tsv | 4 | false | PRO:EID | ||
main_data.tsv | 5 | false | VIS:EID | ||
main_data.tsv | 1 | true | START_DATE | ||
main_data.tsv | 2 | false | END_DATE | ||
main_data.tsv | 7 | true | CON:C1-TEXT | text | |
main_data.tsv | 15 | false | MOD:1 | text | |
main_data.tsv | 7 | true | CON:C1-TEXT | text | |
main_data.tsv | 16 | false | MOD:1 | text | |
main_data.tsv | 8 | false | CON:C2-BLOB | blob | |
main_data.tsv | 9 | true | CON:C3-NUMBER | number | m/s |
main_data.tsv | 10 | false | CON:C4-NLP | nlp | |
main_data.tsv | 11 | true | PAT:sex | ||
visit_data.tsv | 1 | true | VIS:EID | ||
visit_data.tsv | 2 | true | PAT:EID | ||
visit_data.tsv | 3 | false | VIS:start_date | ||
provider_data.tsv | 1 | true | PRO:EID | ||
provider_data.tsv | 2 | true | PRO:name_char |
The relative path of the data file in relation to an i2b2
directory existing
alongside the params file or an absolute path. The data file should be a TSV
file with a header. The header is ignored (it's only for the user's
convenience).
The index of the column of the TSV file referred to in the FILENAME
column.
This index starts at 1. Mandatory.
Whether there must be non-empty data in the referred to column of the data file
(after word mapping has been applied). Must be true
or false
.
To which variable data in the given column of the given file should be bound. There are several types of variables available:
CON:<concept code>
orCON:<concept path>
: represents a fact associated with an existing concept (row ofconcept_dimension
). The concept will be checked for existence. In this case, theTYPE
column must have a value.MOD:<modifier code>
: represents a modifier-associated fact (an ancillary fact). These facts complement the concept facts of the preceding concept variable (CON:<>
). These variable mappings must satisfy the conditionp
defined as either 1) the MOD:<> variable mapping must be after aCON:<>
variable mapping or 2) after anotherMOD:<>
variable mapping that satisfiesp
("after" meaning in a later row in the mapping file). ACON:<>
variable and its followingMOD:<>
variables will be inserted with an identical primary key intoobservation_fact
, except for the value of themodifier_cd
column. In particular, they will share the same value forconcept_cd
andinstance_num
.start_date
: the value of thestart_date
column of facts inserted based on data of the same row of the data file. Required if there areCON:<>
variables mapped in the same data file. Otherwise ignored. Represents the date that the observations started.end_date
: the value of theend_date
column of facts inserted based on data of the same row of the data file. Always optional. Represents the date that the observations ended.PAT:EID
,VIS:EID
andPRO:EID
: the ids of the patients, visits and providers, as referred to in the source system.PAT:VITAL_STATUS
,PAT:BIRTH_DATE
,PAT:SEX
,PAT:AGE_IN_YEARS_NUM
,PAT:LANGUAGE
,PAT:RACE
,PAT:MARITAL_STATUS
,PAT:RELIGION
,PAT:ZIP
,PAT:STATECITYZIP_PATH
,PAT:INCOME
,PAT:BLOB
: variables mapped to columns inpatient_dimension
.VIS:ACTIVE_STATUS
,VIS:START_DATE
,VIS:END_DATE
,VIS:INOUT
,VIS:LOCATION
,VIS:LOCATION_PATH
,VIS:LENGTH_OF_STAY
andVIS:BLOB
: variables mapped to column invisit_dimension
.PRO:NAME_CHAR
,PRO:BLOB
: variables mapped to columns inprovider_dimension
.
The variable identifiers used in this column are not case sensitive, except for the concept codes, concept paths and modifier codes.
This column is mandatory.
One of text
, number
, blob
or nlp
, representing each of the four fact
data types in i2b2. Used exclusively (and mandatory) with CON:<>
variables
and MOD:<>
variables.
ATTENTION: There is currently no validation against the preexisting concept
metadata. For instance, if a concept's metadata indicates
that the concept is a positive integer, the user will still be able to upload
text for that concept, as long as he provides the type text
in the TYPE
column of the mapping file. This is an area for future improvement.
Value that will be written to the units_cd
column in observation_fact
.
Optional, and only for CON:<>
and MOD:<>
variables, since these are the only
ones that trigger writes in observation_fact
.
This column is optional.
- The variables
start_date
andPAT:EID
are mandatory if there are concept variables (CON:<>
) mapped to one or more columns in the same data file. These variables must have theMANDATORY
column set to true. - Except for
CON:<>
andMOD:<>
, the variables cannot be repeated in the same file. - The set of
PAT:<>
(exceptPAT:EID
), ofVIS:<>
(exceptVIS:EID
) and ofPRO:<>
variables (exceptPRO:EID
) all cannot have their members spread across different data files. For instance,PAT:VITAL_STATUS
cannot be mapped in one data file andPAT:SEX
in another. - Data files with
PAT:<>
variables associated must have also aPAT:EID
variable mapped. - Data files with
VIS:<>
variables associated must have also mapped aPAT:EID
variable and either a) astart_date
variable (for visit auto-generation) or b) a mandatoryVIS:EID
variable. - Data files with
VIS:EID
variables must also have aPAT:EID
variable. - Concept variables can be repeated in the same file, but only if a) they are associated with different columns or 2) they have a different set of modifier variables or 3) they have the same of modifier variables, but at least one is associated with different columns of the data file.
- Modifier variables cannot be repeated for the same instance of a
CON:<>
variable. In order to repeat a modifier for the same concept, the concept has to be repeated as well (and hence create a new fact group, meaning a set of facts whose primary key differs only in the value ofmodifier_cd
, each fact group having a distinctinstance_num
).
Each PAT:<>
, VIS:<>
and PRO:<>
variable also has its own validation rules.
The word mapping file is a TSV file with four columns, all of which are
mandatory: FILENAME
, COLUMN_NUMBER
, FROM
and TO
. Example:
FILENAME | COLUMN_NUMBER | FROM | TO |
---|---|---|---|
main_data.tsv | 12 | (empty blob) | |
main_data.tsv | 11 | DEM | |
main_data.tsv | 11 | male | DEM |
main_data.tsv | 11 | female | DEM |
main_data.tsv | 10 | <a>to be replaced by null</a> |
The word mapping functionality provides a trivial value replacement. It instructs the runtime to replace a certain value found in a certain column of a data file with another one. The replacement is done before the validations are applied.
The first pass checks the sanity of the data files and collects some data about dimensions needed for the subsequent steps.
Besides conformance of the variable values with the local validation rules of each variable, there are other validations, most of which are more global in nature. Starting with the former:
- The
PAT:<>
,VIS:<>
andPRO:<>
(non:EID
) variables have variable rules. They are either of type:- date (in which case their values must conform to the
DATE_FORMAT
), - string (with variable maximum length),
- integer (with a minimum value in case of
PAT:AGE_IN_YEARS_NUM
). - enumeration (either hard-coded or with their possible values looked up in
code_lookup
).
- date (in which case their values must conform to the
- Variable mappings marked with
MANDATORY = true
must find only non-empty values (after word mapping). - Dates (
start_date
andend_date
) must conform toDATE_FORMAT
. - As for fact variables (
CON:<>
andMOD:<>
),- numbers have to match the pattern
(?<operator><=|<|!=|=|>=|>)?(?<number>[^|]+)(?:|(?<flag>).+)?
. Thenumber
capture group must be acceptable by the constructor of `BigDecimal (see the grammar). Addditionally the scale of the decimal cannot be larger than 5 (otherwise information will be lost) and its magnitude cannot be larger than 13, - values of text variables cannot be longer than 255 characters,
- values of nlp variables must be well-formed XML.
- numbers have to match the pattern
It is also checked that:
- Lines read from the data files have at least as many cells as the largest mapped column index in the respective file.
- Details for a certain visit, patient or provider are seen only once. That is,
in files with
PAT:<>
,VIS:<>
andPRO:<>
(non:EID
) variable mappings, the identifier of the entity (obtained either through the:EID
variable or through auto-generation) is seen only once. - A given visit identifier does not co-occur with different patient identifiers (both in the case where one data file has a certain pair and later in that same file or in another file a different pair with the same visit identifier is present and in the case, during incremental loading, where the database has a certain association and another one appears in the data files).
If a visit identifier or a provider identifier are needed (typically to insert facts, but also to insert visit and even provider details), then these can be auto-generated.
The visit identifiers are generated with the pattern <patient id>@<date>
where
the date is in the YYYY-MM-dd format.
There is only one provider identifier auto-generated, and its form is Provider for <source system>
, where the source system code is that provided in the
SOURCE_SYSTEM
parameter.
The incremental loading mode is enabled by setting INCREMENTAL=y
in the
.params
file. It allows loading new facts and loading new and updating new and
existing patients, visits and providers. No rows will be removed.
A limitation of this mode is that the new facts will need to have different
combinations of (patient, visit, provider, concept, start_date). It is not
possible to either replace existing facts with updated ones or append new facts
with the same tuples (and avoiding a PK conflict by increasing the
instance_num
).