Skip to content

Conversation

gouttegd
Copy link
Contributor

@gouttegd gouttegd commented Jul 14, 2025

Resolves [#421, #457]

  • docs/ have been added/updated if necessary
  • make test has been run locally
  • [ ] tests have been added/updated (if applicable)
  • CHANGELOG.md has been updated.

This is the complete proposal for the specification of the SSSOM/RDF serialisation format, according to the current state of the discussions about it.

This is the complete proposal for the specification of the SSSOM/RDF
serialisation format, according to the current state of the discussions
about it.
@gouttegd gouttegd self-assigned this Jul 14, 2025
@gouttegd gouttegd requested a review from matentzn July 14, 2025 22:23
As noticed by @nichtich:

> the use of `pav:authoredOn` only makes sense if `pav:createdOn` is
> used as well to differentiate two types of dates, in addition to the
> publication date. SSSOM only has one type of date so there is no need
> not to use plain old `dcterms:created`.

closes #457
Copy link
Collaborator

@matentzn matentzn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great start. Lets go a bit back and forth over this; I made my first round of comments with the biggest bomb is to specify a bespoke serialisation of curie_map.

Use BCP14 keywords more consistently.

Add a "special consideration" section to explain the possibility of
injecting "direct" SPO triples.
Copy link
Collaborator

@matentzn matentzn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some cosmetic things, I will ask two reviewers to chime in so we can merge this asap!

This is huge work, thanks @gouttegd!! much appreciated!

@matentzn matentzn requested review from cthoyt and ehartley September 24, 2025 20:17
@cthoyt
Copy link
Member

cthoyt commented Sep 24, 2025

i'm going to run prettier on the markdown before reviewing it, after damien has a chance to address your suggestions

Clarify that a "string literal" is a `xsd:string` literal -- this has
the side-effect of clarifying that it cannot be a langString.

Also fix incorrect use of pav:authoredBy to represent the creator_id
slot.
@gouttegd
Copy link
Contributor Author

@cthoyt As you wish, but please note that you can also view a “rendered” version directly on the branch: https://github.com/mapping-commons/sssom/blob/rdf-spec/src/docs/spec-formats-rdf.md

This command: `npx prettier --prose-wrap always --check --write src/docs/spec-formats-rdf.md`
- the predicate is either:
- the property indicated by the `URI` field in the LinkML description of the
slot, if such a field is present;
- or a property constructed by concatenating the `https://w3id.org/sssom/`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a situation where this actually happens? Shouldn't the LinkML schema be explicit and exhaustive?

Copy link
Contributor Author

@gouttegd gouttegd Sep 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Err, have you looked at the SSSOM schema? It happens everywhere. Most of the slots do not have an explicit URI field – because for most of the slots there isn’t a readily available property in a pre-existing vocabulary.

Comment on lines +76 to +80
(e.g. `mappings`, `extension_definitions`)

The value MUST be represented as a RDF resource. Whether the resource is named
(IRI) or not (blank node) will depend on the type of the object, see the
[section on representing SSSOM objects](#sssom-objects) below for details.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a bit confusing why it's documented here but not in full, can you make an explicit enumeration here of the three things this applies to an an link to each?

Suggested change
(e.g. `mappings`, `extension_definitions`)
The value MUST be represented as a RDF resource. Whether the resource is named
(IRI) or not (blank node) will depend on the type of the object, see the
[section on representing SSSOM objects](#sssom-objects) below for details.
The value MUST be represented as a RDF resource. Whether the resource is named
(IRI) or not (blank node) will depend on the type of the object, see the following sections:
1. [mappings](#representation-of-a-mapping-object)
2. [mapping set)(#representation-of-a-mappingset-object)
3. [extension definition](#representation-of-a-extensiondefinition-object)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mildly against it ① as it means there will be one more place to update should we ever add a new type of object, and ② I don’t think it is that confusing. It is expected, and maybe even unavoidable, that a specification document should contain “forward references” (references to things that are detailed later), implementers should not be confused by that.

Comment on lines +198 to +199
> SSSOM/TSV file (remember that the SSSOM/TSV format _requires_ that identifiers
> be serialised as CURIEs).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shade thrown at CURIEs :p. Let's extend this with a sentence with some context like

Suggested change
> SSSOM/TSV file (remember that the SSSOM/TSV format _requires_ that identifiers
> be serialised as CURIEs).
> SSSOM/TSV file (remember that the SSSOM/TSV format _requires_ that identifiers
> be serialised as CURIEs to support the easy understanding on its content).

Copy link
Contributor Author

@gouttegd gouttegd Sep 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, what? This is a factual statement, how is that a “shade thrown at CURIEs”?

And I believe this factual statement is necessary because people regularly forget the SSSOM/TSV-specific requirement that identifiers MUST be in CURIE form – even the very people who should know very well about that requirement, as in this discussion.

Besides, your added sentence is incorrect. The fact that SSSOM/TSV absolutely requires identifiers to be in CURIE form has nothing to do with “supporting the easy understanding on its contents”.

First because it is dubious that FBbt:00004508 is easier to understand that http://purl.obolibrary.org/obo/FBbt_00004508 – personally I’d say it’s worse, because now you need to look up what FBbt stands for, whereas the full-length IRI is self-sufficient.

Second, even if CURIEs were actually easier to understand, SSSOM/TSV could very well have simply encouraged the use of CURIE forms, instead of requiring it (as in, “identifiers in SSSOM/TSV files SHOULD be CURIE form”, instead of “identifiers in SSSOM/TSV files MUST be in CURIE form”).

The real reason SSSOM/TSV absolutely requires the use of CURIEs is, if I recall correctly, because SSSOM-Py developers didn’t want to have to deal with the possibility that a SSSOM/TSV could contain both full-length identifiers and CURIEs. Notably because this would have required, either the use of some heuristics to infer the form of an identifier, or the use of some syntactic sugar to distinguish between an IRI and a CURIE, like what is done in most RDF syntaxes – e.g. IRI in angled brackets vs “naked” CURIEs –, which nobody wanted for some reason.

But regardless all that, there is no need for the RDF specification to elaborate on why the SSSOM/TSV requires the use of CURIEs. People simply need to be reminded that it is the case, because they frequently forget about it and this has some consequences if you want to be able to re-convert a SSSOM/RDF mapping set back to the SSSOM/TSV format.

When that behaviour is enabled, implementations SHOULD NOT inject such triples
in the following cases:

- when the record represents a literal mapping (that is, `subject_type` or
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh man, every time I remember this is actually a part of SSSOM, I cry

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, the previous solution (the so-called “literal profile”) was even worse.

>
> It is recommended not to inject such direct triples for negated mapping
> records because they would seem to convey a meaning that is the exact opposite
> of what the records mean.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this is super important. I think it begs the more general question of if there's a standard for representing "negated" triples in RDF. Maybe we can link to external information with more explanation about why this doesn't exist, if it doesn't (and if it does, maybe we should use it)

I don't think it exists after some looking around

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only thing I am aware of (which, does it even need to be said, does not mean a lot! :D ) is the concept of RDF Surfaces.

Basically, the idea (as far as I understand it) is that some RDF triples could live within a “negated surface”, a part of the RDF graph that only contains negated assertions. This would look like this:

@prefix log: <http://www.w3.org/2000/10/swap/log#> .
@prefix FBbt: <http://purl.obolibrary.org/obo/FBbt_> .
@prefix UBERON: <http://purl.obolibrary.org/obo/FBbt_> .
@prefix skos: http://www.w3.org/2004/02/skos/core#> .

(_:x) log:onNegativeSurface {
  FBbt:00004508 skos:exactMatch UBERON:0000056 .
} .

Of note, however:

① The paper is still under review, so it certainly doesn’t look like it’s ready for prime-time.

② Not sure how much of that is actually usable in RDF. The examples in the paper (and the example above, which is derived from them) use N3 because it allows to express formulae (the list of statements enclosed within { }), something that does not exist in RDF.

> records because they would seem to convey a meaning that is the exact opposite
> of what the records mean.
>
> It is recommended not to inject such direct triples for no-match mapping
Copy link
Member

@cthoyt cthoyt Sep 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Musing: we might consider for the future how to represent the fact that there's no match in a different way

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The time to discuss that was before the 1.0 release last year. Now, even if we come up with another way to represent the absence of match, we will have to keep the sssom:NoMatchFound mechanism for compatibility anyway.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant making our own RDF idiom for it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there really a need for it?

Given the following mapping:

subject_id      predicate_id      object_id
FBbt:00004568   skos:exactMatch   sssom:NoTermFound

The fact FBbt:00004568 has no match is already expressed in the RDF serialisation:

[] a owl:Axiom .
   owl:annotatedSource FBbt:00004568 ;
   owl:annotatedProperty skos:exactMatch ;
   owl:annotatedTarget sssom:NoTermFound .

Yes, it does not appear as a “direct triple”, but I fail to see how is that a problem.

The point of “direct triples” is a convenience for RDF consumers, so that they can quickly find the mappings they want by simple queries over the RDF graph. Like, they want all the exact matches to FBbt:00004568? They just need to get all the triples where FBbt:00004568 is the subject and skos:exactMatch is the predicate – something that, with most RDF libraries, is the matter of a single function call. And if that query does not return anything (which would be the case in our example here), well, they know that there is no match.

Copy link
Member

@cthoyt cthoyt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks all for being patient and @gouttegd for writing this important groundwork. I had a full read and left some minor comments.

One larger question is still about what's the role of this specification? Is the goal to make sure that we can implement RDF serialization outside of a LinkML context (which I consider very important, it's not reliable to defer to that implicitly)

In many cases, I think we should be much more explicit. Let's consider in a follow-up to develop example a suite SSSOM/TSV and SSSOM/RDF input/output pairs against which a serializer can be valiated

@gouttegd
Copy link
Contributor Author

gouttegd commented Sep 26, 2025

One larger question is still about what's the role of this specification?

Err, to allow developers to write SSSOM/RDF serialisers and deserialisers? The same way the SSSOM/TSV specification is there to allow developers to write SSSOM/TSV serialisers and deserialisers.

What else do you think the specification of a format is for?

Is the goal to make sure that we can implement RDF serialization outside of a LinkML context

Yes. Because LinkML is made by and for Python developers. The LinkML runtime (with its built-in RDF serialisers and deserialisers) is only available in Python – there is no support whatsoever for any other language. Programmers in other languages have to implement RDF serialisation “outside of a LinkML context”, because there is no such thing as “a LinkML context” for them.

And as if that was not enough, it so happens that LinkML barely bothers to fully describe the way their RDF serialiser and deserialiser work (how objects described in a LinkML schema are turned into a RDF graph, or read from a RDF graph), which means that in practice, without the formal specification that we are trying to make here, the only way for someone wishing to read/write SSSOM/RDF files while having the silly idea of not working in Python is to reverse-engineer SSSOM-Py – that’s what I had to do when I added RDF support in SSSOM-Java. I highlighted at the very beginning of the discussion about the RDF serialisation that this was not acceptable for something claiming to be a “standard”.

@gouttegd
Copy link
Contributor Author

gouttegd commented Sep 26, 2025

Besides, the LinkML-generated serialization does not always do what we’d want.

For example, it will serialize a mapping record with an explicit record_id as:

[] a owl:Axiom ;
   owl:annotatedSource UBERON:0000001 ;
   owl:annotatedProperty semapv:crossSpeciesExactMatch ;
   owl:annotatedTarget FBbt:00000001 ;
   sssom:mapping_justification semapv:ManualMappingCuration .
   sssom:record_id "https://example.org/mymapping1" .

whereas it was quickly agreed in the discussion about the RDF serialisation that, whenever a record_id is available, it should be used as the named resource that represents the entire record (this is one of the points that were the most important to @matentzn ), as in:

<https://example.org/mymapping1> a owl:Axiom ;
   owl:annotatedSource UBERON:0000001 ;
   owl:annotatedProperty semapv:crossSpeciesExactMatch ;
   owl:annotatedTarget FBbt:00000001 ;
   sssom:mapping_justification semapv:ManualMappingCuration .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants