Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make RO-Crate paths absolute within knowledge graphs #1

Closed
stain opened this issue May 18, 2024 · 7 comments
Closed

Make RO-Crate paths absolute within knowledge graphs #1

stain opened this issue May 18, 2024 · 7 comments
Assignees

Comments

@stain
Copy link
Member

stain commented May 18, 2024

Most RO-Crates in WorkflowHub are attached RO-Crates, as a ro-crate-metadata.json inside a ZIP file, which references other files (and itself) with @id paths within the ZIP file like ./ and workflow.xml

This causes a problem when multiple RO-Crates are loaded into a knowledge graph as they will all talk about ./ when meaning different RO-Crates. As triple stores like Jena makes absolute paths you may end up with file:///dev or similar as base URI.

The ZIP file (originally git content) is not (currently) exposed by WorkflowHub (@fbacall to verify) and so there is no good http/https URI to prefix this with.

https://www.researchobject.org/ro-crate/1.2-DRAFT/appendix/relative-uris.html#establishing-a-base-uri-inside-a-zip-file explains how a base uri can be established within a ZIP file:

For instance, given a randomly generated UUID b7749d0b-0e47-5fc4-999d-f154abe68065 we can use arcp://uuid,b7749d0b-0e47-5fc4-999d-f154abe68065/ as the @base:

{
  "@context": [
    "https://w3id.org/ro/crate/1.2-DRAFT/context",
    {"@base": "arcp://uuid,b7749d0b-0e47-5fc4-999d-f154abe68065/"}
  ],
  "@graph": [
    {
      "@id": "ro-crate-metadata.json",
      "@type": "CreativeWork",
...

However we should not in this project use a random UUID as it would become different for each export.

https://s11.no/2018/arcp.html explains the arcp scheme in a bit more detail and https://datatracker.ietf.org/doc/id/draft-soilandreyes-arcp-03 the full detail. (Yes this draft should be progressed to RFC!)

We can here use a location based UUID based on the download URI of the RO-Crate Zip (even if practically we end up not downloading the ZIP) as in section A.2 of the draft, effectively this hashes the URI to make a UUIDv5 identifier.

The Python arcp library can do this UUID and ARCP calculation easily:

>>> arcp_location("http://example.com/data.zip", "/file.txt")
'arcp://uuid,b7749d0b-0e47-5fc4-999d-f154abe68065/file.txt'

A.3 hash based could also be used, but then we have to download the RO-Crate and calculate it's sha512 checksum for instance. This would then be unique wherever the ZIP file is encountered.

There is a discussion to be had on what is allowed to change or not on subsequent updates of the knowledge graph. I think we decided for now to only keep the latest version from WorkflowHub so then the first style would make sense, however it probably should still use the versioned download link for the arcp calculation, e.g. https://workflowhub.eu/workflows/795/ro_crate?version=2 would give a different UUID than https://workflowhub.eu/workflows/795/ro_crate?version=1

To do this task you could either modify the base uri for RDF parsing, for instance Jena's riot command line tool -- see Docker image https://hub.docker.com/r/stain/jena can take a --base= parameter -- similar Python rdflib can also take a base parameter. A simpler way is to modify the @context as in the RO-Crate documentation, first make sure it's an [array] then add {"@base": "arcp://uuid,b7749d0b-0e47-5fc4-999d-f154abe68065/"} to the end of the @context array - the JSON-LD parser should then make all relative urIs absolute within this path.

>>> { "@context": [ { "@base": arcp.arcp_location("https://workflowhub.eu/workflows/795/ro_crate?version=1") } ] }
{'@context': [{'@base': 'arcp://uuid,25f1f4ac-57e9-5f9e-942d-220a513ccec7/'}]}

We can implement this as a single Python function or command line tool perhaps.

I've tagged @volodymyrss (who has not yet accepted the invite to this repo) as he agreed in call 2024-05-17 to have a look. @alexhambley may have suggestions on how this can fit into the pipeline.

@volodymyrss volodymyrss self-assigned this May 19, 2024
@volodymyrss volodymyrss pinned this issue May 19, 2024
@volodymyrss
Copy link
Collaborator

volodymyrss commented May 21, 2024

I tried a bit, see attached branch if you want, I will continue in the next couple of days. One thing is unclear, where to take the version to include in the arcp:// URL ? It's not in the metadata aggregation and not in the RO-Crates. I suppose it's part of the workflowhub record? Or I missed it somewhere?

Also I notice that some records contain file:// explicitly, like https://dev.workflowhub.eu/workflows/552/ro_crate (they might even be absolute paths at some BSC facility) . Then simply setting @base does not work. Should these records be ignored?

Finally there are plenty of records returning 404 or 403.

@stain
Copy link
Member Author

stain commented May 23, 2024

I think the BSC/COMPSs ones like file://s07r1b51-ib0/gpfs/home/bsc19/bsc19057/COMPSs-DP/tutorial_apps/java/matmul/C.0.0 are deliberately addressing files on their local HPC file system, and we can leave them as-is. They may be large/sensitive although in this case they look like neither!

@rsirvent may know more. Obviously s07r1b51-ib0 is a bit unique but not globally unique (not a fully qualified hostname as expected by the file: URI scheme), so these could have benefited from their own arcp mapping on workflow side.

@stain
Copy link
Member Author

stain commented May 23, 2024

Branch looks great, @volodymyrss ! I think @alexhambley may have idea on where to find the workflow version, we may have to embed some metadata from the Bioschemas https://workflowhub.eu/workflows/29.jsonld (or custom JSON https://workflowhub.eu/workflows/29.json) and look for version -- ideally there should be explicit signposting and links in the API so we don't have to construct any URIs but we can feed that back to @fbacall for the API.

the Bioschemas info should probably also be added to the knowledge graph but that's a separate issue.

@rsirvent
Copy link

I think the BSC/COMPSs ones like file://s07r1b51-ib0/gpfs/home/bsc19/bsc19057/COMPSs-DP/tutorial_apps/java/matmul/C.0.0 are deliberately addressing files on their local HPC file system, and we can leave them as-is. They may be large/sensitive although in this case they look like neither!

You are right, @stain. As discussed in today's call, and just to update this thread, the URL is a placeholder on where the file can be located (in a specific machine, in a specific path) but not an URL that is publicly available/resolvable. So, if a user wants to re-execute an application in the same HPC system where it was originally run, they need to ensure they have access permissions to that specific path in that specific machine. This is useful for use cases where reproducibility needs a specific hardware/platform to reproduce results.

@volodymyrss
Copy link
Collaborator

I think the BSC/COMPSs ones like file://s07r1b51-ib0/gpfs/home/bsc19/bsc19057/COMPSs-DP/tutorial_apps/java/matmul/C.0.0 are deliberately addressing files on their local HPC file system, and we can leave them as-is. They may be large/sensitive although in this case they look like neither!

You are right, @stain. As discussed in today's call, and just to update this thread, the URL is a placeholder on where the file can be located (in a specific machine, in a specific path) but not an URL that is publicly available/resolvable. So, if a user wants to re-execute an application in the same HPC system where it was originally run, they need to ensure they have access permissions to that specific path in that specific machine. This is useful for use cases where reproducibility needs a specific hardware/platform to reproduce results.

Thanks @stain and @rsirvent the the explanation. I will include a comment mentioning that we include this kind of links even if the hostname is not FQDN.

@stain
Copy link
Member Author

stain commented May 24, 2024

@volodymyrss will try to do a secondary call to the workflow .json to fetch its current version and then retrieve the versioned RO-Crate with using that as part of calculating the base URI.

@volodymyrss
Copy link
Collaborator

It's now implemented in https://github.com/workflowhub-eu/workflowhub-graph/tree/feature-absolute-url-ro-crates and tested e.g. here

for version in [1, 2]:
json_data_abs_paths = make_paths_absolute(
json_data, BASE_URL, 41, version
)
parsed_graph = rdflib.Graph().parse(
data=json.dumps(json_data_abs_paths), format="json-ld"
)
assert is_all_absolute(parsed_graph)
subject = parsed_graph.query(
"SELECT ?s WHERE { ?s a <http://schema.org/CreativeWork> }"
).bindings[0]["s"]
subjects.append(subject)
assert subjects[0] != subjects[1]

Further validation will follow according to #12 . As agreed on the meeting today closing this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants