Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Giving "infores" some love? #270

Open
matentzn opened this issue Nov 12, 2024 · 4 comments
Open

Giving "infores" some love? #270

matentzn opened this issue Nov 12, 2024 · 4 comments
Assignees

Comments

@matentzn
Copy link

Playing with your ROBOKOP instance, I noticed two things and would like to suggest some improvements for increased provenance transparency.

Use `primary_knowledge_source` only ever for the very original source Lets look at a query like this:
MATCH (n:`biolink:Gene` {id: 'NCBIGene:5979'})-[r]-(d:`biolink:Disease` {id: 'DOID:0050430'})
RETURN n,r,d LIMIT 25

For the exact same association, you get a wild mix of primary_knowledge_source and aggregator_knowledge_source. I believe, while it is hard, that for aggregation of scientific evidence it is critical that the very original source making that statement is primary_knowledge_source, and all other downstream sources should only ever be mentioned as aggregator_knowledge_source. In particular, "infores:monarchinitiative" should (at least if I am not mistaken, given infores:hpo-annotations", cc @kevinschaper?) never appear as a primary_knowledge_source. Same as pharos:

image

Neither should, and this is IMO very important, infores:ubergraph. Here it is critical that every single integrated edge gets infores:sourceontology (e.g. infores:uberon) for maximum transparency. Again, it would be great if the edge could point somehow to a version of the knowledge source (asserted_id: [ubergraph2023-01-01, pharos2024-03-4, monarch2024-03-04], etc), but I don't know how that's done in Biolink. This is not just about provenance. This is also about attribution: We want to make sure that when we deliver high Impact KGs like ROBOKOP to the science world, everyone know that "wow, uberon really made a difference to beef up the context for our node embeddings". This is only possible if we add that info on every single edge.

Either never or always aggregate "aggregator_knowledge_source"

Right now we have a mix of cases, like in the query above. The advantage of "always aggregate" is you can see immediately how well an edge is supported in the graph (how many aggregators have deemed it trustworthy). On the other hand, there is a risk of not being able to adequately integrate association metadata if it diverges across resources. I don't know the right answer to this, but in order to recommend preprocessing for ML tools (should the number of edges between two nodes matter?) I believe this has to be done consistently.

Otherwise looks great! You have 42 knowledge sources, and all of them appear in the infores registry, which is awesome!

These were my two cents!

cc @marcello-deluca

@EvanDietzMorris
Copy link
Contributor

Thanks @matentzn, it's great to have more eyes on this.

For point one, that is certainly the intention. For the query you shared, I'm not sure those are all the exact same association. In fact, the reason there are so many edges that look similar is because they came from different primary knowledge sources, even when they came through the same aggregator. The edge merging algorithm in ORION uses the primary knowledge source as part of the criteria for determining whether two edges are the same, so edges with different primary knowledge sources are always kept separate.
uniprot -> pharos -> robokop
monarchinitiative -> pharos -> robokop
eram -> pharos -> robokop
ctd -> pharos -> robokop

This means that pharos has something about this kind of relationship from all of these underlying sources, as separate database entries. However, it appears we may have an issue with pharos because it is an aggregator of aggregators. Monarchinitiative is used because pharos has "Monarch" as the source database, and unfortunately, I don't think pharos provides the true primary source for those. Same for edges from CTD -> pharos.

We do have pharos as the primary knowledge source for some edges, but only when the real primary source could not be determined or for edge cases we never handled, but they should be considered mistakes or TBD (of course it is still helpful to identify and correct these cases).

For ubergraph, let's loop in @balhoff, but I'm under the impression that many (most?) of the edges from Ubergraph actually are generated by Ubergraph in a way where it should be considered the primary source. It is not simply aggregating knowledge from other sources, but generating edges using techniques like logical entailment.

Re: including versions of sources, that's definitely a good idea. ORION tracks source versions in graph metadata as best as it can (many sources do not provide real version identifiers), but does not include any of that inside the graph.

For your second point, I'm not sure I understand. We ingest some sources that are the primary knowledge source for their data, and some sources that are aggregators already. So to properly track that we necessarily have some edges with an aggregator knowledge source and some without. Maybe I'm misunderstanding what you mean though.

@matentzn
Copy link
Author

Thanks @matentzn, it's great to have more eyes on this.

Always happy to, just hope I am not a burden here!

For point one, that is certainly the intention. For the query you shared, I'm not sure those are all the exact same association.

Sorry for point one, I shared the picture not because it has two associations, but because it has primary_knowledge_source infores:pharos, which seemed unlikely (its possible though!). I understand that this is difficult, but when ingesting sources we should make sure that the provenance of every single edge we are importing can be traced to the scientific event that produced it (e.g., the group that extracted the information from some scientific paper).

In fact, the reason there are so many edges that look similar is because they came from different primary knowledge sources, even when they came through the same aggregator.

OK, got it. The "look similar" you say; of subject, relation, object is identical, there can still be biologically meaningful differences?

The edge merging algorithm in ORION uses the primary knowledge source as part of the criteria for determining whether two edges are the same, so edges with different primary knowledge sources are always kept separate.

Yeah, this makes sense. Thank you for clarifying. So when you have two relationships with the same primary knowledge source and different aggregators, you always merge?

Monarchinitiative is used because pharos has "Monarch" as the source database, and unfortunately, I don't think pharos provides the true primary source for those. Same for edges from CTD -> pharos.
We do have pharos as the primary knowledge source for some edges, but only when the real primary source could not be determined or for edge cases we never handled, but they should be considered mistakes or TBD (of course it is still helpful to identify and correct these cases).

If this information is not available, our great leaders like @balhoff and @cbizon should be encouraged to reach out and push the resources to include this provenance - not having it severely diminishes the credibility of our KGs.

For ubergraph, let's loop in @balhoff, but I'm under the impression that many (most?) of the edges from Ubergraph actually are generated by Ubergraph in a way where it should be considered the primary source. It is not simply aggregating knowledge from other sources, but generating edges using techniques like logical entailment.

Agreed thanks for pointing that out! But I don't think I saw infores:mondo anywhere in ROBOKOP.

Re: including versions of sources, that's definitely a good idea. ORION tracks source versions in graph metadata as best as it can (many sources do not provide real version identifiers), but does not include any of that inside the graph.

Awesome.

For your second point, I'm not sure I understand. We ingest some sources that are the primary knowledge source for their data, and some sources that are aggregators already. So to properly track that we necessarily have some edges with an aggregator knowledge source and some without. Maybe I'm misunderstanding what you mean though.

I think you commented the right things above. If this assumption is correct, the second concern is mitigated:

  1. If two associations have diverging primary knowledge providers, they are both asserted as separate relations (I understand that's the case)
  2. If two association from different aggregators have the same primary knowledge provider, they are merged into one association athe the knowledge aggregator metadata field is aggregated (i.e. both aggregators are included).

@EvanDietzMorris thanks so much, this is great feedback!

@balhoff
Copy link

balhoff commented Nov 16, 2024

@matentzn it would be hard/slow to check if an Ubergraph edge is provided by an ontology (compared to just downloading edges). But it would be possible. I think a decent assumption would be to credit whichever ontology is the source of the subject term—what do you think about that?

@matentzn
Copy link
Author

Credit by subject is a good proxy if we don't have anything else, but my gut feeling tells me that in the long run, we should attribute edges directly, using rdfs:isDefinedBy during ubergraph construction, so you can attribute all the amazing RG stuff that ubergraph does (and @EvanDietzMorris rightfully points out!) to infores:ubergraph!

If we don't have a quick function in ROBOT "annotate-all-axioms" (I think we do in "extract"?) we can write one and inject as a robot plugin.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants