Skip to content

Biorj import and export

jdesaphy edited this page Oct 23, 2024 · 3 revisions

Biorj theory

Biorj is a data exchange file format for BioRels. Contrary to many other file formats, it does not follow a specific “format”, where fields are rigidly defined. Instead, it is based on a json representation of the database, more precisely of the scientific concept and its dependencies you wish to export. So, let’s dive into the theory a bit by defining some context.

Biorj assay export example

Let’s take the example of exporting experimental data from an assay. As shown in the image above, experimental data operates at an L9 dependency level, meaning there are eight layers of critical dependencies that must be resolved before we can successfully export it. If you need a refresher on dependencies, please refer to the "Understanding Dependencies" section.

The preceding level is the Cell-Based Assay, which contains metadata about the assay type. This level has critical dependencies on several other concepts, including the BioAssay Ontology (L1), Molecular Entity (L5), Protein Target (L7), and Genetic Target (L5). Each of these concepts also has its own critical dependencies, encompassing protein, disease, molecular structure, and gene concepts. Therefore, when exporting any experimental data, all related data from these critical dependencies must be extracted.

In terms of database schema and relationships, a critical dependency can be represented in two ways: either through a foreign key linking to the dependent table or via a mapping table that contains foreign keys to both related tables. In the Biorj format, which we will discuss later, we will focus on these critical dependency tables, referred to as Parents (P).

In addition, some concept can have additional metadata information. If we take the example of an assay, it can have publications associated to it. We usually refer to those as non-critical or related dependencies since they augment the data but are not critical. Thus, such information can also be extracted and save. In the Biorj format, those will be called children.

If we wish to export experimental data, we do not want to define all of the child and parents of an experimental data concept, which would include all 9 levels of dependencies and their non-critical dependencies. To alleviate some of this challenge, we can group tables into Blocks. A Publication block would provide the rules to extract the publication information, but also its authors, institutions, journal or abstract. Then, if one wants to extract a assay that has reported publications, we can just defines in the assay block that we want to also extract a publication block so that all publication information is also extracted.

Biorj rules

Now that we have explained the concepts, let’s look at the rule set defined in $TG_DIR/BACKEND/SCRIPT/BIORJ/BIORJ_RULES. We will first focus on an example with publications:

BLOCK pmid_entry
PARENT pmid_journal
CHILD pmid_author_map<pmid_author<pmid_instit
CHILD pmid_abstract
END

A BLOCK is defined by a table name that would represent the main table for a scientific concept, in this case pmid_entry for a publication. Publication has a critical dependency, which is the journal it is published in, this pmid_journal is a PARENT of pmid_entry. The abstract of a publication is a non-critical dependency, and therefore can be defined as a CHILD of pmid_entry. Since there can be multiple authors in a publication, Biorels defines a mapping table – pmid_author_map – to bridge publications to their authors. In addition, an author is assigned to an institution (pmid_instit).

Please note here the <. This character is very important as it defines the flow of the data. When you want to export a publication, BioRels will search in pmid_entry table. Once it finds it, it will then search for that publication ID in pmid_author_map. The results from pmid_author_map will allow to retrieve pmid_author records, which themselves will allow to retrieve pmid_instit. Those retrieval rules are based on the foreign key constraints defined in BioRels.

BLOCK assay_entry
PARENT assay_cell<cell_entry:E
PARENT assay_cell<taxon
PARENT assay_tissue<anatomy_entry:E
PARENT taxon
PARENT source
PARENT assay_confidence
PARENT assay_type
PARENT assay_target<taxon
PARENT assay_target<assay_target_type
PARENT assay_target>assay_target_genetic_map<assay_genetic<taxon
PARENT assay_target>assay_target_genetic_map<assay_genetic<gene_seq:E
PARENT assay_target>assay_target_genetic_map<assay_genetic<transcript:E
PARENT assay_target>assay_target_protein_map<assay_protein<prot_seq<prot_entry:E
PARENT assay_target>assay_target_protein_map<assay_protein<gn_entry:E
CHILD assay_pmid<pmid_entry:E
CHILD activity_entry<source
CHILD activity_entry<bioassay_onto_entry
CHILD activity_entry<molecular_entity:E
PARENT assay_variant<prot_seq<prot_entry:E

Now let’s take the example of an assay. Similarly, we define a block with the main table for an assay: assay_entry. An assay has multiple critical dependencies, such as the taxon, source, assay_confidence, assay_type. However, it also have more complicated critical dependencies, ones that requires more definitions. Assay_cell is a table defining the type of cell lines used in assays, as provided by ChEMBL. Itself is critically dependent to cell_entry table which is defining all cell lines. Please note that cell_entry is followed by “:E”. This E stands for Entry and will trigger all the data export rules defined in the BLOCK cell_entry. This allows to extract all the critical and related dependencies of cell_entry without having to define all the rules again.

PARENT assay_target>assay_target_genetic_map<assay_genetic<taxon
PARENT assay_target>assay_target_genetic_map<assay_genetic<gene_seq:E
PARENTassay_target>assay_target_genetic_map<assay_genetic<transcript:E

This block is particularly interesting. Indeed, we can see here that an assay record has a critical dependency to assay_target. However, assay_target_genetic_map is a mapping table between assay_target and assay_genetic and as such, the foreign keys to those tables are located in assay_target_genetic_map. Thus, the direction of the foreign key relationship is changed from < to >. Gene_seq and transcript are both followed with :E, allowing to get their critically and non-critically dependency data by calling their respective blocks.

CHILD assay_pmid<pmid_entry:E

At last, you can see an example of a non-critical dependency with assay_pmid listing all the publications associated with an assay. This will call the pmid_entry block, thanks to :E, which will get all the publication’s metadata.

Biorj requirements and unique keys

To properly function, this process requires a few requirements to be met:

  • The schema must be identical. This implies the table names and the column names

  • Foreign keys must be identical. Their names however doesn’t matter

  • The schema name doesn’t matter

  • The unique keys for a given table must be identical

The list of foreign keys is generated on the fly during Biorj import and export. However, the list of unique keys must be defined in $TG_DIR/BACKEND/SCRIPT/BIORJ/BIORJ_RULES.

KEYS
activity_entry molecular_entity_id|assay_entry_id|value|unit_type
anatomy_entry anatomy_tag
anatomy_extdb source_id|anatomy_extdb|anatomy_entry_id
anatomy_syn anatomy_entry_id|syn_type|syn_value|source_id
assay_confidence description

Each line is made of two columns. The first column defines the table while the second column list all column names of the table, separated by |, which altogether makes the unique definition of a record for that table.

Adding rules to Biorj

Depending on your use case, you might need to add a few rules in Biorj_rules file. To help, here are a few questions to answer to direct you:

  • If you are adding a new data source that will create a new scientific concept => Add block

  • If you are you modifying any column that can be used to uniquely identify a record? => Update key

  • If you are you changing the critical dependencies of the main table? (foreign key) => Update rules

  • If you are adding a new table that isn’t a new scientific concept => Update rules

Adding a Biorj block

If you are expanding on BioRels database schema, thank you for your contribution! You are now at the stage where you need to test the export/import of BioRels for your tables. If you have been adding a new scientific concept, you will need to create a new BLOCK.

BLOCK [TABLE_NAME]

The format of a BLOCK starts with the BLOCK word followed by the main table of the scientific concept, i.e. that ones that uniquely define that concept. For instance, for a gene, it would gn_entry; for a taxon, it would be taxon. Next, you will need to define the PARENT lines, which characterize the critical dependencies for this concept. To do so, you should look at the foreign keys defined in your main table to list all the referenced tables. If those tables are not the main table of scientific concepts, you must follow the path of foreign keys until you reach the main tables, or there are no more foreign keys. Main tables can be found by looking at the table names of the different BLOCKS defined in BIORJ_RULES. If you reach a main table, you must add the suffix :E to include the metadata associated to it.

Next you need to define the CHILD lines, while provides the list of non-critical dependencies for this new scientific concept. Similarly, you must follow the path of foreign keys until you reach the main tables, or there are no more foreign keys.

The last line of the BLOCK should be END.

In addition, you will need to update the unique keys as described in the next section

Update Biorj keys

To perform properly, Biorj will need to know which columns in a given table uniquely defines a record. As many unique keys and foreign keys can be defined for a given table, we need to manually list which one is the proper one to use. The KEYS section in BIORJ_RULES file follows a key->value pair where the key is the table name in column 1 and the value is the list of columns, separated by |, that uniquely defines a record.

If you have modified a table, please verify that the list of columns is correct.

If you have created a new table, please add the table with the list of columns at the end of the KEYS block, before the END line.

KEYS
activity_entry molecular_entity_id|assay_entry_id|value|unit_type
anatomy_entry anatomy_tag
anatomy_extdb source_id|anatomy_extdb|anatomy_entry_id
anatomy_syn anatomy_entry_id|syn_type|syn_value|source_id
assay_confidence description
[TABLE_NAME] [LIST_OF_COLUMNS_UNIQUELY_DEFINING_A_RECORD_SEP_BY_|]

Update rules

If you are modifying the foreign key relationships of a table, you will need to update the corresponding Biorj rules. To do so, please locate the corresponding scientific concept associated with the table and ensure the relationships are properly defined in Biorj rules.