Skip to content

Molecular entity

jdesaphy edited this page Oct 23, 2024 · 2 revisions

As drug and molecule structures become increasingly complex, establishing a level of abstraction in our database is essential. Scientists have developed remarkable molecules, such as siRNA, antibody-drug conjugates, and AAVs encapsulated with payloads. However, representing and unifying these diverse structures in a clear and comprehensive manner poses significant challenges.

In Biorels, a molecular entity serves as an abstract concept that represents what is tested in vitro or in vivo. Each molecular entity comprises one or more components, each with a defined molarity ratio. Components provide a second layer of abstraction, allowing us to define different sets of molecules, again with specified molarity ratios. A component can consist of one or more molecules, which may be connected covalently or through other means. We will illustrate this with some examples for clarity.

A small molecule is the simplest molecular entity, functioning as both a molecular component with a molarity ratio of 1 and a molecular entity with a single component also having a molarity ratio of 1.

For example, a siRNA conjugated with a small molecule is defined by a single molecular component made up of two molecules: the siRNA and the small molecule. Assuming a 1:1 ratio, their molarity ratios would each be 0.5, resulting in a molecular entity defined by this single component with a molarity ratio of 1.

In contrast, an antibody-drug conjugate (ADC) encapsulated in a lipid nanoparticle (LNP) bound to a delivery peptide represents a more complex case. Here, the ADC acts as a molecular component composed of two or three molecules: the antibody, the drug, and possibly a linker. The LNP serves as another molecular component, incorporating the individual molecules of the LNP formulation as well as the peptide. Consequently, the molecular entity is defined by two components: the ADC component and the LNP component, each with its respective molarity ratio.

To streamline this process and minimize redundancy, we have developed a protocol for uniquely identifying the various modalities, components, and entities.

Small molecule standardization

First, we will need to define some vocabulary:

  • The initial molecule with its counterions will be called the initial entry.

  • If multiple molecules are present within a SMILES string, the molecule with the longest string will be considered as the main molecule. The other molecules will be considered as counterions

  • The SMILES of the initial entry (main molecule + counterions) will be called the Full SMILES

  • The SMILES of the counterions will be called Counterion SMILES

  • The SMILES of the main molecule will be called Molecular SMILES

  • If standardized, a (s) will be added.

To insert small molecules in the database, 2 input files will be required. A SMILES file for the main molecules and a SMILES file for the counterions. The SMILES file must follow the format below:

FULL_SMILES[space]ID|InChI|InChI-Key|Counterions|Main_Molecule
  • FULL_SMILES must be the SMILES string of the complete molecule, including counterions.

  • ID is the identifier of the data source

  • InChI: InChi string generated from the SMILES string. Although you can provide it, it will be regenerated for consistency purposes. Otherwise, set to NULL

  • InChiKey: InChi string generated from the SMILES string. Although you can provide it, it will be regenerated for consistency purposes. Otherwise, set to NULL

  • Counterions. List of counterions, separated by dot (.). Otherwise set to NULL

  • Main_Molecule: SMILES string of the longest molecular string.

The counterion file will be defined as follow:

Counterion_smiles[space]counterion_smiles

The next steps will involve a series of standardization. First, we standardize the FULL_SMILES using LillyMol. (s) imply standardization.

FULL_SMILES(s)[space]ID|InChI|InChI-Key|Counterions|Main_Molecule

LillyMol will then generate two files: a file where all molecules have been successfully standardized, and a file for molecules with a standardization issue. The next step is to switch FULL_SMILES and Molecule and add a column (T/F) for standardization success (FULL_MOL_STD).

Main_Molecule[space]ID|InChI|InChI-Key|Counterions|FULL_SMILES(s)|FULL_MOL_STD(T/F)

Next, RDKit is going to generate the InChi and Inchi-Key for the standardized FULL_SMILES.

Main_Molecule[space]ID|InChI(s)|InChI-Key(s)|Counterions|FULL_SMILES(s)|FULL_MOL_STD(T/F)

At last, the molecule will be standardized using LillyMol.

Main_Molecule(s)[space]ID|InChI(s)|InChI-Key(s)|Counterions|FULL_SMILES(s)|FULL_MOL_STD(T/F)

At this step, all but the counterions have been standardized.

Counterion definition

The counterion file will then be running under a different standardization process in LillyMol and the counterions will be registered in the database. A mapping between the non-standardized counterion and the standardized record in the database will be maintained. A counterion record is uniquely defined by its standardized SMILES string. In the case where a counterion record contains multiple counterions, those counterions will be ordered alphabetically after standardization. The alphabetically ordered standardized smiles will then be saved in SM_COUNTERION table.

Main Molecule definition

After the standardization process, the main molecule standardized SMILES will be stored in SM_MOLECULE. The standardized SMILES is the unique definition for the main molecule.

Molecule entry

A Molecule entry will now store the initial entry, i.e the combination of a main molecule and its eventual counterion(s). A Molecule entry is uniquely defined by the following:

  • The standardized SMILES for the main molecule

  • The standardized SMILES of the counterion(s) or NULL if none

  • The InChi generated on the standardized full SMILES

  • The InChi-Key generated on the standardized full SMILES

A md5 hash is generated as a combination of all 4 values, separated by underscore:

Hash=md5(INCHI.'_'.INCHI_KEY.'_'.Main_molecule(s).'_'.COUNTERION(s));

Important

The InChi and InChi-Key must be generated on the SMILES of initial entry but after standardization.

The information about the molecule entry can then be stored in SM_ENTRY as follow:

  • Sm_molecule_id: Foreign key to SM_molecule defining the main molecule

  • SM_counterion_id: Foreign key to SM_counterion defining the counterions – NULL if none

  • InChi: Generated InChi string

  • InChi-Key: Generated Inchi-Key string

  • Md5_hash: Generated Hash

  • Full_smiles: Standardized full smiles

Note

The full smiles is not necessary to uniquely defined a molecule entry. However, it is useful for visualization or data extraction purposes.

Small molecule as molecular component

Once the molecule is registered in SM_ENTRY, we can proceed in registering it as a molecular entity. First, we will need to register it as a molecular component. A molecular component is an abstract representation of a molecule, or set of molecules, covalently linked or not. In the case of a small molecule, the corresponding molecular component will be made of just one molecule, the small molecule itself. When there are multiple molecules involved, we must provide the molar fraction of each individual molecule in the mixture. The sum of all molar fractions must be 1.

A molecular component is defined by several descriptors. A molecular_component_structure, is the full SMILES or HELM representation of the different molecules in that component. The components column will be the list of alphabetically ordered hashes, separated by |.

Then we define two different hashes, one for the structure, and one being a composite of structure and molar fraction. For the molecular_component_structure_hash, the guidelines are a follow:

  • For a single sub-component, it will be the md5 hash of sub-component (small molecule, nucleic-acid, peptide)

  • For multiple sub-components, it will be the md5 hash of the components column, i.e. the list of lphabetically ordered hashes, separated by |.

Md5(firstHash|secondHash|...)

The molecular_component_hash differs from molecular_component_structure_hash because it adds the molar_fraction to the hashes. Thus the molecular_component_hash is defined as:

Md5(firstHash:firstMolarFraction|[secondHash:secondMolarFraction…])

Once all descriptors have been defined for a molecular component, you can proceed with inserting the record in molecular_component table. Then, each individual molecule, whether it is a small molecule, peptide, antibody …, must be linked to that component via the corresponding mapping tables. This is where the molar fraction will be stored.

Note

The reason for the need of 2 hashes is for simplicity purposes. Indeed, having the same hash between the molecular_component record and a sm_entry record for instance allows to join the two tables by the hashes to obtain information on the single small molecule rather than having to perform multiple processes to get it.

Note

To get all identical mixtures of molecules but with different molarities, query the molecular_component_structure_hash.

Small molecule as molecular entity

A molecular entity is a set of molecular components. Thus, the logic will be very similar – if not easier - to registering a molecular component. A Molecular entity will be defined by 2 hashes: a structure hash and a composite of structure and molar fraction – as well as the list of molecular component hashes and the full structure – in SMILES or HELM. The molecular components column will be the list of alphabetically ordered molecular_component_hash, separated by |.

For the molecular_structure_hash, the guidelines are a follow:

  • For a single component, it will be the molecular_component_structure_hash of the corresponding component.

  • For multiple components, it will be the md5 hash of the list of alphabetically ordered molecular_structure_hash of the components, separated by |.

Md5(firstHash|secondHash|...)

The molecular_entity_hash differs from molecular_structure_hash because it adds the molar fraction to the hashes. Thus the molecular_entity_hash is the md5 hash of the list of alphabetically ordered pairs of molecular_component_hash:molar fraction:

Md5(firstComponentHash:firstMolarFraction|[secondComponentHash:secondMolarFraction…])

Once all descriptors have been defined for a molecular entity, you can proceed with inserting the record in molecular_entity table. Then, each individual component must be linked to that entity via the corresponding mapping table. This is where the molar fraction of each component will be stored.

Note

For a single small molecule, the molecular_structure_hash of the entity will be the same as the md5 hash of the sm_molecule entry.

How to register your molecules

Your molecules are the most important molecules, so they deserve their own tables. Before you start, please review the Private section and schema section in the documentation. This will help you to set up the configuration appropriately depending on your needs. The tables will be defined in the public schema if you enabled the public schema only but they will be defined in the private schema only if you are using both public and private schema. This is an additional security to avoid potential leak in the public schema if you want your data to be kept private in the private schema.

The process to register the structure of small molecules has already been outline in the previous sections of Molecular entity. In this section, we will review how to identify your molecules and create libraries.

Internal_molecule table: This table will contain the identifiers and the inventory of your internal molecules.

Internal_library table: This table will provide the list of internal libraries.

internal_library_molecular_map: This table will associate a library to its list of compounds.

A script is provided in PRIVATE_SCRIPT/INTERNAL/load_internal_mol.php to load internal molecules. Please modify the file path in the copy section to make it work. The required format is a SMILES file as such:

SMILES IDENTIFIER[|INVENTORY]

Where INVENTORY is optional. If you choose to provide it, please ensure it is provided in mg (milligrams) and separated from the identifier with a | (pipe). This script will standardize your small molecules and load them in sm_molecule (molecular structure), sm_counterion (counterion), sm_entry (pair molecule/counterion), and mapped to molecular_entity. It will then proceed in loading internal_molecule table with the inventory.