-
Notifications
You must be signed in to change notification settings - Fork 27
Initial draft of the SSSOM/RDF spec. #469
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
This is the complete proposal for the specification of the SSSOM/RDF serialisation format, according to the current state of the discussions about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great start. Lets go a bit back and forth over this; I made my first round of comments with the biggest bomb is to specify a bespoke serialisation of curie_map.
Use BCP14 keywords more consistently. Add a "special consideration" section to explain the possibility of injecting "direct" SPO triples.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some cosmetic things, I will ask two reviewers to chime in so we can merge this asap!
This is huge work, thanks @gouttegd!! much appreciated!
i'm going to run prettier on the markdown before reviewing it, after damien has a chance to address your suggestions |
Clarify that a "string literal" is a `xsd:string` literal -- this has the side-effect of clarifying that it cannot be a langString. Also fix incorrect use of pav:authoredBy to represent the creator_id slot.
@cthoyt As you wish, but please note that you can also view a “rendered” version directly on the branch: https://github.com/mapping-commons/sssom/blob/rdf-spec/src/docs/spec-formats-rdf.md |
This command: `npx prettier --prose-wrap always --check --write src/docs/spec-formats-rdf.md`
- the predicate is either: | ||
- the property indicated by the `URI` field in the LinkML description of the | ||
slot, if such a field is present; | ||
- or a property constructed by concatenating the `https://w3id.org/sssom/` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a situation where this actually happens? Shouldn't the LinkML schema be explicit and exhaustive?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Err, have you looked at the SSSOM schema? It happens everywhere. Most of the slots do not have an explicit URI
field – because for most of the slots there isn’t a readily available property in a pre-existing vocabulary.
(e.g. `mappings`, `extension_definitions`) | ||
|
||
The value MUST be represented as a RDF resource. Whether the resource is named | ||
(IRI) or not (blank node) will depend on the type of the object, see the | ||
[section on representing SSSOM objects](#sssom-objects) below for details. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a bit confusing why it's documented here but not in full, can you make an explicit enumeration here of the three things this applies to an an link to each?
(e.g. `mappings`, `extension_definitions`) | |
The value MUST be represented as a RDF resource. Whether the resource is named | |
(IRI) or not (blank node) will depend on the type of the object, see the | |
[section on representing SSSOM objects](#sssom-objects) below for details. | |
The value MUST be represented as a RDF resource. Whether the resource is named | |
(IRI) or not (blank node) will depend on the type of the object, see the following sections: | |
1. [mappings](#representation-of-a-mapping-object) | |
2. [mapping set)(#representation-of-a-mappingset-object) | |
3. [extension definition](#representation-of-a-extensiondefinition-object) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mildly against it ① as it means there will be one more place to update should we ever add a new type of object, and ② I don’t think it is that confusing. It is expected, and maybe even unavoidable, that a specification document should contain “forward references” (references to things that are detailed later), implementers should not be confused by that.
> SSSOM/TSV file (remember that the SSSOM/TSV format _requires_ that identifiers | ||
> be serialised as CURIEs). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shade thrown at CURIEs :p. Let's extend this with a sentence with some context like
> SSSOM/TSV file (remember that the SSSOM/TSV format _requires_ that identifiers | |
> be serialised as CURIEs). | |
> SSSOM/TSV file (remember that the SSSOM/TSV format _requires_ that identifiers | |
> be serialised as CURIEs to support the easy understanding on its content). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, what? This is a factual statement, how is that a “shade thrown at CURIEs”?
And I believe this factual statement is necessary because people regularly forget the SSSOM/TSV-specific requirement that identifiers MUST be in CURIE form – even the very people who should know very well about that requirement, as in this discussion.
Besides, your added sentence is incorrect. The fact that SSSOM/TSV absolutely requires identifiers to be in CURIE form has nothing to do with “supporting the easy understanding on its contents”.
First because it is dubious that FBbt:00004508
is easier to understand that http://purl.obolibrary.org/obo/FBbt_00004508
– personally I’d say it’s worse, because now you need to look up what FBbt
stands for, whereas the full-length IRI is self-sufficient.
Second, even if CURIEs were actually easier to understand, SSSOM/TSV could very well have simply encouraged the use of CURIE forms, instead of requiring it (as in, “identifiers in SSSOM/TSV files SHOULD be CURIE form”, instead of “identifiers in SSSOM/TSV files MUST be in CURIE form”).
The real reason SSSOM/TSV absolutely requires the use of CURIEs is, if I recall correctly, because SSSOM-Py developers didn’t want to have to deal with the possibility that a SSSOM/TSV could contain both full-length identifiers and CURIEs. Notably because this would have required, either the use of some heuristics to infer the form of an identifier, or the use of some syntactic sugar to distinguish between an IRI and a CURIE, like what is done in most RDF syntaxes – e.g. IRI in angled brackets vs “naked” CURIEs –, which nobody wanted for some reason.
But regardless all that, there is no need for the RDF specification to elaborate on why the SSSOM/TSV requires the use of CURIEs. People simply need to be reminded that it is the case, because they frequently forget about it and this has some consequences if you want to be able to re-convert a SSSOM/RDF mapping set back to the SSSOM/TSV format.
When that behaviour is enabled, implementations SHOULD NOT inject such triples | ||
in the following cases: | ||
|
||
- when the record represents a literal mapping (that is, `subject_type` or |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh man, every time I remember this is actually a part of SSSOM, I cry
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, the previous solution (the so-called “literal profile”) was even worse.
> | ||
> It is recommended not to inject such direct triples for negated mapping | ||
> records because they would seem to convey a meaning that is the exact opposite | ||
> of what the records mean. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, this is super important. I think it begs the more general question of if there's a standard for representing "negated" triples in RDF. Maybe we can link to external information with more explanation about why this doesn't exist, if it doesn't (and if it does, maybe we should use it)
I don't think it exists after some looking around
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only thing I am aware of (which, does it even need to be said, does not mean a lot! :D ) is the concept of RDF Surfaces.
Basically, the idea (as far as I understand it) is that some RDF triples could live within a “negated surface”, a part of the RDF graph that only contains negated assertions. This would look like this:
@prefix log: <http://www.w3.org/2000/10/swap/log#> .
@prefix FBbt: <http://purl.obolibrary.org/obo/FBbt_> .
@prefix UBERON: <http://purl.obolibrary.org/obo/FBbt_> .
@prefix skos: http://www.w3.org/2004/02/skos/core#> .
(_:x) log:onNegativeSurface {
FBbt:00004508 skos:exactMatch UBERON:0000056 .
} .
Of note, however:
① The paper is still under review, so it certainly doesn’t look like it’s ready for prime-time.
② Not sure how much of that is actually usable in RDF. The examples in the paper (and the example above, which is derived from them) use N3 because it allows to express formulae (the list of statements enclosed within {
}
), something that does not exist in RDF.
> records because they would seem to convey a meaning that is the exact opposite | ||
> of what the records mean. | ||
> | ||
> It is recommended not to inject such direct triples for no-match mapping |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Musing: we might consider for the future how to represent the fact that there's no match in a different way
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The time to discuss that was before the 1.0 release last year. Now, even if we come up with another way to represent the absence of match, we will have to keep the sssom:NoMatchFound
mechanism for compatibility anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant making our own RDF idiom for it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there really a need for it?
Given the following mapping:
subject_id predicate_id object_id
FBbt:00004568 skos:exactMatch sssom:NoTermFound
The fact FBbt:00004568
has no match is already expressed in the RDF serialisation:
[] a owl:Axiom .
owl:annotatedSource FBbt:00004568 ;
owl:annotatedProperty skos:exactMatch ;
owl:annotatedTarget sssom:NoTermFound .
Yes, it does not appear as a “direct triple”, but I fail to see how is that a problem.
The point of “direct triples” is a convenience for RDF consumers, so that they can quickly find the mappings they want by simple queries over the RDF graph. Like, they want all the exact matches to FBbt:00004568? They just need to get all the triples where FBbt:00004568
is the subject and skos:exactMatch
is the predicate – something that, with most RDF libraries, is the matter of a single function call. And if that query does not return anything (which would be the case in our example here), well, they know that there is no match.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks all for being patient and @gouttegd for writing this important groundwork. I had a full read and left some minor comments.
One larger question is still about what's the role of this specification? Is the goal to make sure that we can implement RDF serialization outside of a LinkML context (which I consider very important, it's not reliable to defer to that implicitly)
In many cases, I think we should be much more explicit. Let's consider in a follow-up to develop example a suite SSSOM/TSV and SSSOM/RDF input/output pairs against which a serializer can be valiated
Err, to allow developers to write SSSOM/RDF serialisers and deserialisers? The same way the SSSOM/TSV specification is there to allow developers to write SSSOM/TSV serialisers and deserialisers. What else do you think the specification of a format is for?
Yes. Because LinkML is made by and for Python developers. The LinkML runtime (with its built-in RDF serialisers and deserialisers) is only available in Python – there is no support whatsoever for any other language. Programmers in other languages have to implement RDF serialisation “outside of a LinkML context”, because there is no such thing as “a LinkML context” for them. And as if that was not enough, it so happens that LinkML barely bothers to fully describe the way their RDF serialiser and deserialiser work (how objects described in a LinkML schema are turned into a RDF graph, or read from a RDF graph), which means that in practice, without the formal specification that we are trying to make here, the only way for someone wishing to read/write SSSOM/RDF files while having the silly idea of not working in Python is to reverse-engineer SSSOM-Py – that’s what I had to do when I added RDF support in SSSOM-Java. I highlighted at the very beginning of the discussion about the RDF serialisation that this was not acceptable for something claiming to be a “standard”. |
Besides, the LinkML-generated serialization does not always do what we’d want. For example, it will serialize a mapping record with an explicit [] a owl:Axiom ;
owl:annotatedSource UBERON:0000001 ;
owl:annotatedProperty semapv:crossSpeciesExactMatch ;
owl:annotatedTarget FBbt:00000001 ;
sssom:mapping_justification semapv:ManualMappingCuration .
sssom:record_id "https://example.org/mymapping1" . whereas it was quickly agreed in the discussion about the RDF serialisation that, whenever a <https://example.org/mymapping1> a owl:Axiom ;
owl:annotatedSource UBERON:0000001 ;
owl:annotatedProperty semapv:crossSpeciesExactMatch ;
owl:annotatedTarget FBbt:00000001 ;
sssom:mapping_justification semapv:ManualMappingCuration . |
Resolves [#421, #457]
docs/
have been added/updated if necessarymake test
has been run locally[ ] tests have been added/updated (if applicable)This is the complete proposal for the specification of the SSSOM/RDF serialisation format, according to the current state of the discussions about it.