Use only mondo.sssom.tsv for disease normalisation? #269

matentzn · 2024-11-12T07:31:42Z

First of all, thanks for having me :) @marcello-deluca invited me to provide some feedback here, and I am glad to see a team here that is as passionate about biomedical KGs as us join forces! Lets get into it. (My tone when providing feedback is sometimes a bit german, sorry about that; I only do this because your overall product us awesome, else I wouldn't bother).

At the moment, we are missing some interesting integration in ROBOKOP.

Lets look at

MATCH (do:`biolink:Disease` {id: 'DOID:0050430'})-[r]-(a) 
MATCH (mondo:`biolink:Disease` {id: 'MONDO:0008234'})-[s]-(b) 
WHERE r.primary_knowledge_source <> "infores:ubergraph" 
AND s.primary_knowledge_source <> "infores:ubergraph"
RETURN do,mondo,r,s,a,b LIMIT 1

As you can see, only one of the two diseases, which are clearly the same, are associated with the Cutaneous lichen amyloidosis phenotype:

In Mondo SSSOM, these are mapped:

MONDO:0008234	multiple endocrine neoplasia type 2A	skos:exactMatch	DOID:0050430	multiple endocrine neoplasia type 2A	semapv:UnspecifiedMatching

This can have some unnecessary consequences for downstream prediction tasks, especially if links to the non-Mondo ID do not make it into the final KG subset used for learning.

The whole purpose of Mondo is to provide a broad scope disease vocabulary in which we can project existing disease vocabularies, fully at least: DO, ORDO, OMIM, NCIT neoplasm and UMLS (among others).

I would like to suggest two things:

When normalizing, always normalize disease to Mondo if possible (first using exactMatch, then hasDbXref), and if not possible, normalise to HPO (we have some cool new mappings you probably don't have yet!), if not possible, normalise to UMLS. Odds are, if something does not fit any of these, they are simply not diseases.
When you document "equivalent_identifiers", also document somehow (not sure its possible in biolink) the exact source of the mapping, e.g. a versioned dump of node normalizer. This goes a long way for transparency. That way I can see the mappings used, their precedence order, etc.

cc @gaurav

The text was updated successfully, but these errors were encountered:

EvanDietzMorris · 2024-11-15T19:31:14Z

Thanks @matentzn, this is great. This is a bit outside of the scope of ORION, because normalization decisions, and the list of equivalent identifiers, come directly from the Node Normalizer service, backed by Babel. You have tagged the right person, but he's currently on vacation for a few weeks.

In the meantime, I can say:

Currently the hierarchy of identifier preferences comes directly from the biolink model and MONDO is already preferred, but HPO and UMLS are not next. (https://github.com/biolink/biolink-model/blob/57345faddc36b127648292dd8c20bb9e9cf2b149/biolink-model.yaml#L8200-L8218) Customizing this for ORION/robokop would be problematic because all of the services relying on the node normalizer need to use the same preferred identifiers. This is something that would need to be done in biolink or babel.

For this specific case, it does look like maybe those two identifiers should be considered synonyms but they're not currently. (https://nodenormalization-sri.renci.org/1.5/get_normalized_nodes?curie=MONDO%3A0008234&curie=DOID%3A0050430)

equivalent_identifiers is actually not even valid biolink, so we should be looking into changing that anyway. AFAIK there is nothing in biolink that really fits what we have there. Synonym and alias are node properties but they say "human readable" synonyms (not curies?). There is a predicate "same as" but we don't want edges for this. The version of the node normalizer service used to normalize graphs is already tracked in graph metadata for ORION, but this is actually not the same as the version of Babel that was used to generate the synonyms, and this is an issue Gaurav is aware of and plans to address. ORION also generates metadata files that contain the exact mappings used to build a graph, and those could/should be provided alongside graphs.

matentzn · 2024-11-16T11:31:29Z

Currently the hierarchy of identifier preferences comes directly from the biolink model and MONDO is already preferred, but HPO and UMLS are not next. (https://github.com/biolink/biolink-model/blob/57345faddc36b127648292dd8c20bb9e9cf2b149/biolink-model.yaml#L8200-L8218) Customizing this for ORION/robokop would be problematic because all of the services relying on the node normalizer need to use the same preferred identifiers. This is something that would need to be done in biolink or babel.

Yikes.. That is very unfortunate! But good to know, and I guess in principle makes some sense (the order though does not iMO, DOID, OMIM, OMIM.PS, orphanet, EFO are entirely subsumed under Mondo, and UMLS should be last in that list). It just means that in the context of the Everycure, I need to push for a different prefix preference to be used then the standard biolink one.

For this specific case, it does look like maybe those two identifiers should be considered synonyms but they're not currently. (https://nodenormalization-sri.renci.org/1.5/get_normalized_nodes?curie=MONDO%3A0008234&curie=DOID%3A0050430)

Cool, thanks for checking! Shall I move this issue to NN repo then?

equivalent_identifiers is actually not even valid biolink, so we should be looking into changing that anyway. AFAIK there is nothing in biolink that really fits what we have there. Synonym and alias are node properties but they say "human readable" synonyms (not curies?).

https://biolink.github.io/biolink-model/exact_matches/?

ORION also generates metadata files that contain the exact mappings used to build a graph, and those could/should be provided alongside graphs.

Great! I would be happy to advice on the formatting for that, e.g. https://mapping-commons.github.io/sssom/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use only mondo.sssom.tsv for disease normalisation? #269

Use only mondo.sssom.tsv for disease normalisation? #269

matentzn commented Nov 12, 2024

EvanDietzMorris commented Nov 15, 2024

matentzn commented Nov 16, 2024

Use only mondo.sssom.tsv for disease normalisation? #269

Use only mondo.sssom.tsv for disease normalisation? #269

Comments

matentzn commented Nov 12, 2024

EvanDietzMorris commented Nov 15, 2024

matentzn commented Nov 16, 2024