Added data_access module for managed access data. Fixes #1535 #1537

ESapenaVentura · 2023-11-08T12:00:52Z

Fixes #1535

related to: ebi-ait/dcp-ingest-central#967

Release notes

For type/project/project schema:

Added new required field data_access

For module/ontology/data_access_ontology schema:

Created module

ncalvanese1

LGTM

hannes-ucsc · 2023-11-08T17:29:11Z

Can you describe what use cases this change is intended to facilitate? I assumed we were planning to use DUOS and this change doesn't seem to tie into that.

amnonkhen · 2023-11-09T22:59:25Z

Can you describe what use cases this change is intended to facilitate? I assumed we were planning to use DUOS and this change doesn't seem to tie into that.

Good point @hannes-ucsc
We can have another field in addition to type and notes to hold the DUO code. It can be an enum or an ontology term of children of data use permission
or data use modifier
. Currently, the implementation on the ingest side takes into account the type property when authorising an API call. I don't object to use the DUO code for that, but I would like to finish system tests on the change as it is, so I would like to get this schema change live in the dev environment, for which I will need to merge it to the dev branch.

hannes-ucsc · 2023-11-09T23:11:59Z

When I said DUOS, I meant https://duos.broadinstitute.org/

@ncalvanese1 would the addition of a field with the DUO ontology as described by @amnonkhen work for you when it comes to setting up snapshot permissions?

amnonkhen · 2023-11-09T23:37:22Z

When I said DUOS, I meant https://duos.broadinstitute.org/

Thanks for the clarification.
I need a project to have an indication whether it is open access or managed access, because I use it do allow or prevent access to metadata in ingest via its REST API.
@hannes-ucsc Are you objecting to the to type and notes attributes of the dataAccess module, or are you suggesting to add something in addition to them?

hannes-ucsc · 2023-11-09T23:48:45Z

I'm not objecting to anything. I am asking questions.

gabsie · 2023-11-10T10:37:28Z

Hi, @hannes-ucsc @amnonkhen @ESapenaVentura @ncalvanese1

Basically the purpose of storing with us and passing this information (managed access or open access) to the downstream components will allow us to differentiate how to handle those datasets.

As Amnon has said, this way we will make sure these projects are treated with the required security with us in ingest and down the line with you. So for now with the pilot we are starting, we can mark that project as 'managed access'.

There is no problem to add an additional field/note around DUO codes, but I don't know if at this stage we will always know these from contributors or can assign them correctly. I think we have to continue our current conversations about managed access implementations to decide on the DUO codes.

Hope this is okay and a good first step.

amnonkhen · 2023-11-10T13:17:00Z

Thanks @gabsie for the clarifications and @hannes-ucsc for the questions. I did not say you were objecting, I was trying to understand why you were asking the questions you were asking, in order to figure out how to explain my points better.

hannes-ucsc · 2023-11-14T01:36:00Z

Simply marking a project as containing managed-access data is not sufficient information for determining and enforcing who should have access to that data. Assuming that in order to answer that question, we ultimately want to integrate with DUOS (not DUO) and SAM, the changes proposed here are not taking us in that direction. They will likely need to be backed out and replaced with something else.

For the pilot, why can't we just communicate the set of MA-projects to @ncalvanese1 when they are ready to be imported, instead of burdening the schema with a temporary solution? The set of people who should have access will also have to be communicated to @ncalvanese1 so giving him a few project UUIDs in addition to that set of user identities seems no big deal.

Because I can't attend the biweekly Tuesday meetings, I am likely out of the loop with respect to the pilot. If someone here is the mastermind for the pilot effort, or knows who that person is, please let me know.

amnonkhen · 2023-11-20T16:38:02Z

@hannes-ucsc each component in the DCP will figure out the access rights of its own users. ingest will get information from the DAC in either push or pull mode about users who need access to the project, and protect the api calls accordingly. The same would apply for other components. The access list can change after the project has been exported by ingest, so it is not ingest's responsibility to communicate the access rights down stream. The only thing ingest would communicate is whether the project is a managed access project. Downstream components would query the DAC (or consult a local copy) for managed access projects.

In addition, we need the change in the pilot because ingest uses the schema when it exports files. This is a small change, which we can refine at a later stage.

hannes-ucsc · 2023-11-20T17:49:01Z

each component in the DCP will figure out the access rights of its own users

That's a bit simplistic, I'm afraid, and I must point out, your opinion, not some already accepted decision of the DCP/2.

In my opinion, each component shouldn't have to "figure out access rights", it should enforce access, and do so consistently with other components across the platform, and we've already adopted the specification how this would work in TDR and downstream. You approved the PR that added that part of the specification. Each component can only enforce access if it has the information to do so. To that extent, Azul and Data Browser implement the specification, and tie into TDR for determining who has access. What's missing is the mechanism by which TDR determines access and I thought consensus was that we would use DUOS. We need to figure out how this would work, ideally with an addition to the DCP/2 specification.

This is a small change, which we can refine at a later stage.

If you need a way to temporarily mark a project as managed-access for the pilot, you could just adopt a naming convention for the project description or its short name. That way we wouldn't need to burden the schema with a change that we already know is insufficient for a complete implementation of managed-access for HCA.

gabsie · 2023-11-21T17:05:21Z

Hey both!
@hannes-ucsc - shall we try and organise a short meeting to decide on these? Maybe involving Nate, us, and you?
Tell us some convenient slots for you - I think this will help us move on. Managed access is not easy to coordinate so far. Hannes, there's also a meeting which happens bi-weekly, which we feel you should be invited to.. Organised by John Randell at the HCA exec office. Next one is Dec 7th, 4pm UK time.

hannes-ucsc · 2023-11-21T19:07:10Z

That works for me!

gabsie · 2023-11-22T11:27:44Z

I will tell them to add you to that meeting. Meanwhile, just us need to catch up, so I suggest next week Wednesday, 29 Dec, 4 or 5pm UK time? Let me know if this works, otherwise please suggest another time.

amnonkhen · 2023-11-27T15:30:00Z

@hannes-ucsc I see 2 scenarios:

it is ingest's responsibility to update DUOS with the managed access information when projects are registered in ingest as "managed access"?
managed access registration is recorded on DUOS (not sure yet by whom). HCA components, query DUOS given a request to access a project resource, so that they can enforce access to that resource

I have been under the impression that scenario 2 is the effective scenario. Do you see it differently?

Can you please clarify what you mean?

hannes-ucsc · 2023-11-27T17:28:40Z

Can you please clarify what you mean?

Specifically, which of my statements is unclear to you?

gabsie · 2023-11-27T21:15:32Z

Hi @amnonkhen and @hannes-ucsc, but also @ncalvanese1 - let's discuss this in a call. I have proposed this Wednesday, 29 Dec, 4 pm UK time. Does that work for you?

hannes-ucsc · 2023-11-27T21:59:48Z

I have proposed this Wednesday, 29 Dec, 4 pm UK time. Does that work for you?

Assuming you mean Nov 29, I'm hesitant to have a meeting without wranglers or leadership present. We need to determine who the authority is on deciding the conditions by which specific users are granted access to managed-access projects in HCA. I don't think this is something the audience of the meeting you proposed would be able to determine. That's why I'd rather wait for the big picture to be formed during the meeting on 12/7/2023. Then we can discuss the implementation details. OTOH, it would be helpful to hear from @ncalvanese1. If he thinks a separate meeting would be useful, I'd be happy to join.

gabsie · 2023-12-05T09:25:42Z

Hey Hannes, no need for now for the separate meeting.
Let's as a start get you to join the next HCA Managed access call, that will potentially clarify some aspects of specifically the role of the DACO and how they handle registrations and requests.
If we still need another meeting after, we can schedule that.

Gabby

…nCellAtlas/metadata-schema into esv-managedAccess-Issue1535

idazucchi · 2024-02-27T15:07:17Z

With the last few commits I addressed two comments which are not visible now:

I’ve changed naming from data use to data restriction for all fields/descriptions
Labels: I’ve made the ontology_label into an enum and added a set of rules to enforce the correct pairs of labels and ids. I think this is important because otherwise we can get cases where the ID indicates that the dataset is managed access while the label indicates open access and we don’t know which one to trust

In addition to this I’ve made the ontology id required so we can never push a project that lacks the access restrictions. I’ve also made the ontology label required so we have a field that’s human readable, which makes it easier to check that we selected the right DUO code for the project

hannes-ucsc

A couple of questions:

hannes-ucsc · 2024-02-27T17:57:03Z

json_schema/module/ontology/data_use_restriction_ontology.json

@@ -0,0 +1,94 @@
+{
+    "$schema": "http://json-schema.org/draft-07/schema#",


Is the # at the end intended?

It's a bit confusing but yes, for draft 7 the meta-schema $id contains a hash at the end https://json-schema.org/draft-07/schema

Following drafts do not continue with this convention, not entirely sure why

hannes-ucsc · 2024-02-27T18:01:18Z

json_schema/module/ontology/data_use_restriction_ontology.json

+            "type": "string",
+            "enum": [
+                "no restriction",
+                "non-commercial use only",


IIRC, there was some talk during our meeting that NCU was to be used as a modifier in combination with GRU, at least in the form to be filled out by contributors. This PR models NCU as a stand-alone enum item. Do we need to change the form or the schema or are you OK with that inconsistency?

idazucchi · 2024-02-28T16:43:43Z

I restructured the enum field to combine the general research use code with the non-commercial use only modifier. Considering all the rules we’ve added this module won’t behave like other ontology modules I’ve moved to the project modules instead.

hannes-ucsc · 2024-02-28T17:01:39Z

json_schema/module/project/data_use_restriction.json

+            "enum": [
+                "DUO:0000004",
+                "DUO:0000042",
+                "DUO:0000042;DUO:0000046"


The use of a custom separator strikes me as hacky since it requires custom parsing on the consumer end. Multiplicity in JSON is natively handled as—I hesitate to bring this up again—arrays. You could still restrict the valid term combinations with allOf and require DUO:0000004.

Alternatively, we could ditch the combination from the contributor form. I personally find it confusing and it makes for a simpler "radio button" form UI instead of the more complicated "only some checkbox combinations are valid" approach.

NoopDog · 2024-02-28T19:31:25Z

Hi folks,

I wanted to take a second to point out some prior art.

It looks like Terra and the DUOS API encode the meaning of the consents rather than consistently using the ontology terms and ontology term IDs.

DUOS API

Looking at the output of the DUOS API, you can see modeling like:

"dataUse": {
    "generalUse": true,
    "hmbResearch": false,
    "diseaseRestrictions": [],
    "populationOriginsAncestry": false,
    "commercialUse": true,
    "ethicsApprovalRequired": false,
    "collaboratorRequired": false,
    "geographicalRestrictions": "",
    "geneticStudiesOnly": false,
    "publicationResults": false
},

Here is an example where ontology term IDs are used (for disease):

"dataUse": {
    "generalUse": false,
    "hmbResearch": false,
    "diseaseRestrictions": [
        "http://purl.obolibrary.org/obo/DOID_1287"
    ],
    "populationOriginsAncestry": false,
    "commercialUse": true,
    "ethicsApprovalRequired": false,
    "collaboratorRequired": false,
    "geographicalRestrictions": "",
    "geneticStudiesOnly": false,
    "publicationResults": false
},

The DUOS API also gives a text summary for the restrictions like:

"translatedDataUse": "Samples are restricted for use under the following conditions:\nData use is limited for studying: brain cancer [DS]\nCommercial use is not prohibited.\nData use for methods development research irrespective of the specified data use limitations is not prohibited.\nRestrictions for use as a control set for diseases other than those defined were not specified."

Terra

In Terra, you can see what looks to be a UI embodiment of this concept in the workspace dataset attributes like:

AnVIL Explorer

In the AnVIL Explorer and Dataset Catalog, the consents are represented in a single string like:

AnVIL Dataset Catalog

Here is an example of a consent code with explanatory text from the AnVIL Dataset Catalog:

Options

Following the examples above, some options for us are:

Use a single string to represent the consent e.g., "NRES," "GRU," or "GRU-NPU". This is simple, flexible, and future-proofs us to represent any consent in the future. We would validate that the consent string is in the allowed set.
Use the DUOS/Broad-type approach to semantically represent the specific constraints we care about with a structure like:

NRES

{
      "noRestrictions": true,
      "generalResearchUse": false,
      "nonCommercialUseOnly": false
}

GRU

{
      "noRestrictions": false,
      "generalResearchUse": true,
      "nonCommercialUseOnly": false
}

GRU-NPU

{
      "noRestrictions": false,
      "generalResearchUse": true,
      "nonCommercialUseOnly": true
}

Translated Data Use
Optionally for both approaches above, we could add a "translatedDataUse" field, with values calculated from the input of:

NRES - No restrictions
GRU - General research use
GRU-NPU - General research use by not-for-profit organizations only.

The text would need to be validated, of course.

Given that we have basic requirements for representing consents, it seems like option 1 above (e.g., GRU-NPU) might be easiest all around.

I hope this helps. I am curious what other folks think.

Cheers,
D

NoopDog · 2024-02-29T16:59:58Z

After our discussion, I wanted to propose that we use an enum with the following allowed values:

NRES (DUO_0000004)
GRU (DUO_0000042)
GRU-NCU (DUO_0000042 - DUO_0000046)

We could also use:

NRES
GRU
GRU-NCU

And tie the codes to the ontology term IDs in the description section of the enum definition.

Cheers,
D

NoopDog · 2024-02-29T17:24:19Z

If we want to model our requirements more explicitly, it could be done with two fields:

dataUsePermission - required, with allowed values of DUO_0000004 or DUO_0000042
dataUseModifier - optional, with allowed values of DUO_0000046 but only allowed if dataUsePermission is DUO_0000042

So we would have:

NRES

dataUseRestriction: {
   dataUsePermission: "DUO_0000004",
   dataUseModifier: null
}

GRU

dataUseRestriction: {
   dataUsePermission: "DUO_0000042",
   dataUseModifier: null
}

GRU-NPU

dataUseRestriction: {
   dataUsePermission: "DUO_0000042",
   dataUseModifier:  "DUO_0000046"
}

@idazucchi, @amnonkhen, would ingest be able to validate that dataUseModifier: "DUO_0000046" is only used in the context of dataUsePermission: "DUO_0000042"? For example, can we detect the following is invalid and prevent this from being entered?

NRES-NPU (Invalid)

dataUseRestriction: {
   dataUsePermission: "DUO_0000004",
   dataUseModifier:  "DUO_0000046"
}

NoopDog · 2024-02-29T17:25:07Z

For reference, a link to the Data Use Ontology is here https://www.ebi.ac.uk/ols4/ontologies/duo

hannes-ucsc

I withdraw my vote. You can go ahead and merge without my approval.

https://humancellatlas.slack.com/archives/C01360XN04S/p1709227804910639?thread_ts=1709053675.573539&cid=C01360XN04S

json_schema/type/project/project.json

src/schema_linter.py

Co-authored-by: ESapenaVentura <38617863+ESapenaVentura@users.noreply.github.com>

NoopDog

LGTM! 🚀 Thanks for working through this!

amnonkhen

KISS! Love it!

…schema into esv-managedAccess-Issue1535

ESapenaVentura added 2 commits November 8, 2023 11:57

Added data_access module as a mandatory field. Fixes #1535

944efef

Added data_access module. Fixes #1535

f467ce6

ESapenaVentura requested review from amnonkhen, NoopDog, hannes-ucsc and ncalvanese1 November 8, 2023 13:57

ESapenaVentura mentioned this pull request Nov 8, 2023

add data access details to project #1535

Closed

3 tasks

ncalvanese1 approved these changes Nov 8, 2023

View reviewed changes

amnonkhen changed the base branch from staging to develop November 9, 2023 23:00

amnonkhen approved these changes Nov 20, 2023

View reviewed changes

idazucchi added 2 commits February 15, 2024 17:05

Adds DUO codes to indicate data access type

922e89a

removes data access as project module

0c21a7c

Merge branch 'esv-managedAccess-Issue1535' of https://github.com/Huma…

70b5b1d

…nCellAtlas/metadata-schema into esv-managedAccess-Issue1535

fixes linting errors

1587e07

hannes-ucsc reviewed Feb 27, 2024

View reviewed changes

NoopDog self-assigned this Feb 27, 2024

idazucchi added 3 commits February 28, 2024 11:48

treats NCU as a modifier for GRU and moved module to project

a946c53

updates log and version for project module

5e01ac7

updates field names

5607b71

idazucchi requested review from amnonkhen, ncalvanese1 and hannes-ucsc February 28, 2024 16:48

hannes-ucsc reviewed Feb 28, 2024

View reviewed changes

hannes-ucsc reviewed Feb 29, 2024

View reviewed changes

data use restriction formatted as enum field with three letter labels

9ec18e2

ESapenaVentura commented Mar 1, 2024

View reviewed changes

json_schema/type/project/project.json Outdated Show resolved Hide resolved

ESapenaVentura commented Mar 1, 2024

View reviewed changes

src/schema_linter.py Outdated Show resolved Hide resolved

idazucchi and others added 2 commits March 1, 2024 13:56

Update src/schema_linter.py

3b5e5bb

Co-authored-by: ESapenaVentura <38617863+ESapenaVentura@users.noreply.github.com>

Apply suggestions from code review

1fc0ae4

Co-authored-by: ESapenaVentura <38617863+ESapenaVentura@users.noreply.github.com>

NoopDog approved these changes Mar 1, 2024

View reviewed changes

ncalvanese1 approved these changes Mar 1, 2024

View reviewed changes

amnonkhen approved these changes Mar 4, 2024

View reviewed changes

idazucchi added 2 commits March 4, 2024 13:57

Merge branch 'staging' of https://github.com/HumanCellAtlas/metadata-…

8bd93a3

…schema into esv-managedAccess-Issue1535

Ran release_prepare.py script.

33e7fb8

idazucchi merged commit 91adce8 into staging Mar 4, 2024
3 of 5 checks passed

idazucchi deleted the esv-managedAccess-Issue1535 branch March 4, 2024 14:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added data_access module for managed access data. Fixes #1535 #1537

Added data_access module for managed access data. Fixes #1535 #1537

ESapenaVentura commented Nov 8, 2023 •

edited by amnonkhen

Loading

ncalvanese1 left a comment

hannes-ucsc commented Nov 8, 2023 •

edited

Loading

amnonkhen commented Nov 9, 2023

hannes-ucsc commented Nov 9, 2023

amnonkhen commented Nov 9, 2023 •

edited

Loading

hannes-ucsc commented Nov 9, 2023

gabsie commented Nov 10, 2023

amnonkhen commented Nov 10, 2023

hannes-ucsc commented Nov 14, 2023

amnonkhen commented Nov 20, 2023

hannes-ucsc commented Nov 20, 2023

gabsie commented Nov 21, 2023

hannes-ucsc commented Nov 21, 2023

gabsie commented Nov 22, 2023 •

edited

Loading

amnonkhen commented Nov 27, 2023

hannes-ucsc commented Nov 27, 2023

gabsie commented Nov 27, 2023

hannes-ucsc commented Nov 27, 2023

gabsie commented Dec 5, 2023

idazucchi commented Feb 27, 2024

hannes-ucsc left a comment •

edited

Loading

hannes-ucsc Feb 27, 2024

ESapenaVentura Feb 28, 2024

hannes-ucsc Feb 27, 2024 •

edited

Loading

idazucchi commented Feb 28, 2024

hannes-ucsc Feb 28, 2024 •

edited

Loading

NoopDog commented Feb 28, 2024 •

edited

Loading

NoopDog commented Feb 29, 2024 •

edited

Loading

NoopDog commented Feb 29, 2024 •

edited

Loading

NoopDog commented Feb 29, 2024 •

edited

Loading

hannes-ucsc left a comment

NoopDog left a comment

amnonkhen left a comment

		@@ -0,0 +1,94 @@
		{
		"$schema": "http://json-schema.org/draft-07/schema#",

Added data_access module for managed access data. Fixes #1535 #1537

Added data_access module for managed access data. Fixes #1535 #1537

Conversation

ESapenaVentura commented Nov 8, 2023 • edited by amnonkhen Loading

Release notes

ncalvanese1 left a comment

Choose a reason for hiding this comment

hannes-ucsc commented Nov 8, 2023 • edited Loading

amnonkhen commented Nov 9, 2023

hannes-ucsc commented Nov 9, 2023

amnonkhen commented Nov 9, 2023 • edited Loading

hannes-ucsc commented Nov 9, 2023

gabsie commented Nov 10, 2023

amnonkhen commented Nov 10, 2023

hannes-ucsc commented Nov 14, 2023

amnonkhen commented Nov 20, 2023

hannes-ucsc commented Nov 20, 2023

gabsie commented Nov 21, 2023

hannes-ucsc commented Nov 21, 2023

gabsie commented Nov 22, 2023 • edited Loading

amnonkhen commented Nov 27, 2023

hannes-ucsc commented Nov 27, 2023

gabsie commented Nov 27, 2023

hannes-ucsc commented Nov 27, 2023

gabsie commented Dec 5, 2023

idazucchi commented Feb 27, 2024

hannes-ucsc left a comment • edited Loading

Choose a reason for hiding this comment

hannes-ucsc Feb 27, 2024

Choose a reason for hiding this comment

ESapenaVentura Feb 28, 2024

Choose a reason for hiding this comment

hannes-ucsc Feb 27, 2024 • edited Loading

Choose a reason for hiding this comment

idazucchi commented Feb 28, 2024

hannes-ucsc Feb 28, 2024 • edited Loading

Choose a reason for hiding this comment

NoopDog commented Feb 28, 2024 • edited Loading

DUOS API

Terra

AnVIL Explorer

AnVIL Dataset Catalog

Options

NoopDog commented Feb 29, 2024 • edited Loading

NoopDog commented Feb 29, 2024 • edited Loading

NRES

GRU

GRU-NPU

NRES-NPU (Invalid)

NoopDog commented Feb 29, 2024 • edited Loading

hannes-ucsc left a comment

Choose a reason for hiding this comment

NoopDog left a comment

Choose a reason for hiding this comment

amnonkhen left a comment

Choose a reason for hiding this comment

ESapenaVentura commented Nov 8, 2023 •

edited by amnonkhen

Loading

hannes-ucsc commented Nov 8, 2023 •

edited

Loading

amnonkhen commented Nov 9, 2023 •

edited

Loading

gabsie commented Nov 22, 2023 •

edited

Loading

hannes-ucsc left a comment •

edited

Loading

hannes-ucsc Feb 27, 2024 •

edited

Loading

hannes-ucsc Feb 28, 2024 •

edited

Loading

NoopDog commented Feb 28, 2024 •

edited

Loading

NoopDog commented Feb 29, 2024 •

edited

Loading

NoopDog commented Feb 29, 2024 •

edited

Loading

NoopDog commented Feb 29, 2024 •

edited

Loading