-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make RO-Crate paths absolute within knowledge graphs #1
Comments
I tried a bit, see attached branch if you want, I will continue in the next couple of days. One thing is unclear, where to take the version to include in the arcp:// URL ? It's not in the metadata aggregation and not in the RO-Crates. I suppose it's part of the workflowhub record? Or I missed it somewhere? Also I notice that some records contain Finally there are plenty of records returning 404 or 403. |
I think the BSC/COMPSs ones like @rsirvent may know more. Obviously |
Branch looks great, @volodymyrss ! I think @alexhambley may have idea on where to find the workflow version, we may have to embed some metadata from the Bioschemas https://workflowhub.eu/workflows/29.jsonld (or custom JSON https://workflowhub.eu/workflows/29.json) and look for the Bioschemas info should probably also be added to the knowledge graph but that's a separate issue. |
You are right, @stain. As discussed in today's call, and just to update this thread, the URL is a placeholder on where the file can be located (in a specific machine, in a specific path) but not an URL that is publicly available/resolvable. So, if a user wants to re-execute an application in the same HPC system where it was originally run, they need to ensure they have access permissions to that specific path in that specific machine. This is useful for use cases where reproducibility needs a specific hardware/platform to reproduce results. |
Thanks @stain and @rsirvent the the explanation. I will include a comment mentioning that we include this kind of links even if the hostname is not FQDN. |
@volodymyrss will try to do a secondary call to the workflow |
It's now implemented in https://github.com/workflowhub-eu/workflowhub-graph/tree/feature-absolute-url-ro-crates and tested e.g. here workflowhub-graph/tests/test_absolutize.py Lines 35 to 51 in aaf025d
Further validation will follow according to #12 . As agreed on the meeting today closing this. |
Most RO-Crates in WorkflowHub are attached RO-Crates, as a
ro-crate-metadata.json
inside a ZIP file, which references other files (and itself) with@id
paths within the ZIP file like./
andworkflow.xml
This causes a problem when multiple RO-Crates are loaded into a knowledge graph as they will all talk about
./
when meaning different RO-Crates. As triple stores like Jena makes absolute paths you may end up withfile:///dev
or similar as base URI.The ZIP file (originally git content) is not (currently) exposed by WorkflowHub (@fbacall to verify) and so there is no good http/https URI to prefix this with.
https://www.researchobject.org/ro-crate/1.2-DRAFT/appendix/relative-uris.html#establishing-a-base-uri-inside-a-zip-file explains how a base uri can be established within a ZIP file:
However we should not in this project use a random UUID as it would become different for each export.
https://s11.no/2018/arcp.html explains the arcp scheme in a bit more detail and https://datatracker.ietf.org/doc/id/draft-soilandreyes-arcp-03 the full detail. (Yes this draft should be progressed to RFC!)
We can here use a location based UUID based on the download URI of the RO-Crate Zip (even if practically we end up not downloading the ZIP) as in section A.2 of the draft, effectively this hashes the URI to make a UUIDv5 identifier.
The Python
arcp
library can do this UUID and ARCP calculation easily:A.3 hash based could also be used, but then we have to download the RO-Crate and calculate it's sha512 checksum for instance. This would then be unique wherever the ZIP file is encountered.
There is a discussion to be had on what is allowed to change or not on subsequent updates of the knowledge graph. I think we decided for now to only keep the latest version from WorkflowHub so then the first style would make sense, however it probably should still use the versioned download link for the arcp calculation, e.g.
https://workflowhub.eu/workflows/795/ro_crate?version=2
would give a different UUID thanhttps://workflowhub.eu/workflows/795/ro_crate?version=1
To do this task you could either modify the base uri for RDF parsing, for instance Jena's riot command line tool -- see Docker image https://hub.docker.com/r/stain/jena can take a
--base=
parameter -- similar Python rdflib can also take a base parameter. A simpler way is to modify the@context
as in the RO-Crate documentation, first make sure it's an[array]
then add{"@base": "arcp://uuid,b7749d0b-0e47-5fc4-999d-f154abe68065/"}
to the end of the@context
array - the JSON-LD parser should then make all relative urIs absolute within this path.We can implement this as a single Python function or command line tool perhaps.
I've tagged @volodymyrss (who has not yet accepted the invite to this repo) as he agreed in call 2024-05-17 to have a look. @alexhambley may have suggestions on how this can fit into the pipeline.
The text was updated successfully, but these errors were encountered: