Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use only mondo.sssom.tsv for disease normalisation? #269

Open
matentzn opened this issue Nov 12, 2024 · 2 comments
Open

Use only mondo.sssom.tsv for disease normalisation? #269

matentzn opened this issue Nov 12, 2024 · 2 comments

Comments

@matentzn
Copy link

First of all, thanks for having me :) @marcello-deluca invited me to provide some feedback here, and I am glad to see a team here that is as passionate about biomedical KGs as us join forces! Lets get into it. (My tone when providing feedback is sometimes a bit german, sorry about that; I only do this because your overall product us awesome, else I wouldn't bother).

At the moment, we are missing some interesting integration in ROBOKOP.

Lets look at

MATCH (do:`biolink:Disease` {id: 'DOID:0050430'})-[r]-(a) 
MATCH (mondo:`biolink:Disease` {id: 'MONDO:0008234'})-[s]-(b) 
WHERE r.primary_knowledge_source <> "infores:ubergraph" 
AND s.primary_knowledge_source <> "infores:ubergraph"
RETURN do,mondo,r,s,a,b LIMIT 1

As you can see, only one of the two diseases, which are clearly the same, are associated with the Cutaneous lichen amyloidosis phenotype:

image

In Mondo SSSOM, these are mapped:

MONDO:0008234	multiple endocrine neoplasia type 2A	skos:exactMatch	DOID:0050430	multiple endocrine neoplasia type 2A	semapv:UnspecifiedMatching

This can have some unnecessary consequences for downstream prediction tasks, especially if links to the non-Mondo ID do not make it into the final KG subset used for learning.

The whole purpose of Mondo is to provide a broad scope disease vocabulary in which we can project existing disease vocabularies, fully at least: DO, ORDO, OMIM, NCIT neoplasm and UMLS (among others).

I would like to suggest two things:

  1. When normalizing, always normalize disease to Mondo if possible (first using exactMatch, then hasDbXref), and if not possible, normalise to HPO (we have some cool new mappings you probably don't have yet!), if not possible, normalise to UMLS. Odds are, if something does not fit any of these, they are simply not diseases.
  2. When you document "equivalent_identifiers", also document somehow (not sure its possible in biolink) the exact source of the mapping, e.g. a versioned dump of node normalizer. This goes a long way for transparency. That way I can see the mappings used, their precedence order, etc.

cc @gaurav

@EvanDietzMorris
Copy link
Contributor

Thanks @matentzn, this is great. This is a bit outside of the scope of ORION, because normalization decisions, and the list of equivalent identifiers, come directly from the Node Normalizer service, backed by Babel. You have tagged the right person, but he's currently on vacation for a few weeks.

In the meantime, I can say:

  1. Currently the hierarchy of identifier preferences comes directly from the biolink model and MONDO is already preferred, but HPO and UMLS are not next. (https://github.com/biolink/biolink-model/blob/57345faddc36b127648292dd8c20bb9e9cf2b149/biolink-model.yaml#L8200-L8218) Customizing this for ORION/robokop would be problematic because all of the services relying on the node normalizer need to use the same preferred identifiers. This is something that would need to be done in biolink or babel.

For this specific case, it does look like maybe those two identifiers should be considered synonyms but they're not currently. (https://nodenormalization-sri.renci.org/1.5/get_normalized_nodes?curie=MONDO%3A0008234&curie=DOID%3A0050430)

  1. equivalent_identifiers is actually not even valid biolink, so we should be looking into changing that anyway. AFAIK there is nothing in biolink that really fits what we have there. Synonym and alias are node properties but they say "human readable" synonyms (not curies?). There is a predicate "same as" but we don't want edges for this. The version of the node normalizer service used to normalize graphs is already tracked in graph metadata for ORION, but this is actually not the same as the version of Babel that was used to generate the synonyms, and this is an issue Gaurav is aware of and plans to address. ORION also generates metadata files that contain the exact mappings used to build a graph, and those could/should be provided alongside graphs.

@matentzn
Copy link
Author

Currently the hierarchy of identifier preferences comes directly from the biolink model and MONDO is already preferred, but HPO and UMLS are not next. (https://github.com/biolink/biolink-model/blob/57345faddc36b127648292dd8c20bb9e9cf2b149/biolink-model.yaml#L8200-L8218) Customizing this for ORION/robokop would be problematic because all of the services relying on the node normalizer need to use the same preferred identifiers. This is something that would need to be done in biolink or babel.

Yikes.. That is very unfortunate! But good to know, and I guess in principle makes some sense (the order though does not iMO, DOID, OMIM, OMIM.PS, orphanet, EFO are entirely subsumed under Mondo, and UMLS should be last in that list). It just means that in the context of the Everycure, I need to push for a different prefix preference to be used then the standard biolink one.

For this specific case, it does look like maybe those two identifiers should be considered synonyms but they're not currently. (https://nodenormalization-sri.renci.org/1.5/get_normalized_nodes?curie=MONDO%3A0008234&curie=DOID%3A0050430)

Cool, thanks for checking! Shall I move this issue to NN repo then?

equivalent_identifiers is actually not even valid biolink, so we should be looking into changing that anyway. AFAIK there is nothing in biolink that really fits what we have there. Synonym and alias are node properties but they say "human readable" synonyms (not curies?).

https://biolink.github.io/biolink-model/exact_matches/?

ORION also generates metadata files that contain the exact mappings used to build a graph, and those could/should be provided alongside graphs.

Great! I would be happy to advice on the formatting for that, e.g. https://mapping-commons.github.io/sssom/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants