-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Giving "infores" some love? #270
Comments
Thanks @matentzn, it's great to have more eyes on this. For point one, that is certainly the intention. For the query you shared, I'm not sure those are all the exact same association. In fact, the reason there are so many edges that look similar is because they came from different primary knowledge sources, even when they came through the same aggregator. The edge merging algorithm in ORION uses the primary knowledge source as part of the criteria for determining whether two edges are the same, so edges with different primary knowledge sources are always kept separate. This means that pharos has something about this kind of relationship from all of these underlying sources, as separate database entries. However, it appears we may have an issue with pharos because it is an aggregator of aggregators. Monarchinitiative is used because pharos has "Monarch" as the source database, and unfortunately, I don't think pharos provides the true primary source for those. Same for edges from CTD -> pharos. We do have pharos as the primary knowledge source for some edges, but only when the real primary source could not be determined or for edge cases we never handled, but they should be considered mistakes or TBD (of course it is still helpful to identify and correct these cases). For ubergraph, let's loop in @balhoff, but I'm under the impression that many (most?) of the edges from Ubergraph actually are generated by Ubergraph in a way where it should be considered the primary source. It is not simply aggregating knowledge from other sources, but generating edges using techniques like logical entailment. Re: including versions of sources, that's definitely a good idea. ORION tracks source versions in graph metadata as best as it can (many sources do not provide real version identifiers), but does not include any of that inside the graph. For your second point, I'm not sure I understand. We ingest some sources that are the primary knowledge source for their data, and some sources that are aggregators already. So to properly track that we necessarily have some edges with an aggregator knowledge source and some without. Maybe I'm misunderstanding what you mean though. |
Always happy to, just hope I am not a burden here!
Sorry for point one, I shared the picture not because it has two associations, but because it has
OK, got it. The "look similar" you say; of subject, relation, object is identical, there can still be biologically meaningful differences?
Yeah, this makes sense. Thank you for clarifying. So when you have two relationships with the same primary knowledge source and different aggregators, you always merge?
If this information is not available, our great leaders like @balhoff and @cbizon should be encouraged to reach out and push the resources to include this provenance - not having it severely diminishes the credibility of our KGs.
Agreed thanks for pointing that out! But I don't think I saw infores:mondo anywhere in ROBOKOP.
Awesome.
I think you commented the right things above. If this assumption is correct, the second concern is mitigated:
@EvanDietzMorris thanks so much, this is great feedback! |
@matentzn it would be hard/slow to check if an Ubergraph edge is provided by an ontology (compared to just downloading edges). But it would be possible. I think a decent assumption would be to credit whichever ontology is the source of the subject term—what do you think about that? |
Credit by subject is a good proxy if we don't have anything else, but my gut feeling tells me that in the long run, we should attribute edges directly, using rdfs:isDefinedBy during ubergraph construction, so you can attribute all the amazing RG stuff that ubergraph does (and @EvanDietzMorris rightfully points out!) to infores:ubergraph! If we don't have a quick function in ROBOT "annotate-all-axioms" (I think we do in "extract"?) we can write one and inject as a robot plugin. |
Playing with your ROBOKOP instance, I noticed two things and would like to suggest some improvements for increased provenance transparency.
Use `primary_knowledge_source` only ever for the very original source
Lets look at a query like this:For the exact same association, you get a wild mix of
primary_knowledge_source
andaggregator_knowledge_source
. I believe, while it is hard, that for aggregation of scientific evidence it is critical that the very original source making that statement isprimary_knowledge_source
, and all other downstream sources should only ever be mentioned asaggregator_knowledge_source
. In particular, "infores:monarchinitiative" should (at least if I am not mistaken, given infores:hpo-annotations", cc @kevinschaper?) never appear as aprimary_knowledge_source
. Same as pharos:Neither should, and this is IMO very important,
infores:ubergraph
. Here it is critical that every single integrated edge gets infores:sourceontology (e.g. infores:uberon) for maximum transparency. Again, it would be great if the edge could point somehow to a version of the knowledge source (asserted_id
: [ubergraph2023-01-01, pharos2024-03-4, monarch2024-03-04], etc), but I don't know how that's done in Biolink. This is not just about provenance. This is also about attribution: We want to make sure that when we deliver high Impact KGs like ROBOKOP to the science world, everyone know that "wow, uberon really made a difference to beef up the context for our node embeddings". This is only possible if we add that info on every single edge.Either never or always aggregate "aggregator_knowledge_source"
Right now we have a mix of cases, like in the query above. The advantage of "always aggregate" is you can see immediately how well an edge is supported in the graph (how many aggregators have deemed it trustworthy). On the other hand, there is a risk of not being able to adequately integrate association metadata if it diverges across resources. I don't know the right answer to this, but in order to recommend preprocessing for ML tools (should the number of edges between two nodes matter?) I believe this has to be done consistently.
Otherwise looks great! You have 42 knowledge sources, and all of them appear in the infores registry, which is awesome!
These were my two cents!
cc @marcello-deluca
The text was updated successfully, but these errors were encountered: