-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added data_access module for managed access data. Fixes #1535 #1537
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Can you describe what use cases this change is intended to facilitate? I assumed we were planning to use DUOS and this change doesn't seem to tie into that. |
Good point @hannes-ucsc |
When I said DUOS, I meant https://duos.broadinstitute.org/ @ncalvanese1 would the addition of a field with the DUO ontology as described by @amnonkhen work for you when it comes to setting up snapshot permissions? |
Thanks for the clarification. |
I'm not objecting to anything. I am asking questions. |
Hi, @hannes-ucsc @amnonkhen @ESapenaVentura @ncalvanese1 Basically the purpose of storing with us and passing this information (managed access or open access) to the downstream components will allow us to differentiate how to handle those datasets. As Amnon has said, this way we will make sure these projects are treated with the required security with us in ingest and down the line with you. So for now with the pilot we are starting, we can mark that project as 'managed access'. There is no problem to add an additional field/note around DUO codes, but I don't know if at this stage we will always know these from contributors or can assign them correctly. I think we have to continue our current conversations about managed access implementations to decide on the DUO codes. Hope this is okay and a good first step. |
Thanks @gabsie for the clarifications and @hannes-ucsc for the questions. I did not say you were objecting, I was trying to understand why you were asking the questions you were asking, in order to figure out how to explain my points better. |
Simply marking a project as containing managed-access data is not sufficient information for determining and enforcing who should have access to that data. Assuming that in order to answer that question, we ultimately want to integrate with DUOS (not DUO) and SAM, the changes proposed here are not taking us in that direction. They will likely need to be backed out and replaced with something else. For the pilot, why can't we just communicate the set of MA-projects to @ncalvanese1 when they are ready to be imported, instead of burdening the schema with a temporary solution? The set of people who should have access will also have to be communicated to @ncalvanese1 so giving him a few project UUIDs in addition to that set of user identities seems no big deal. Because I can't attend the biweekly Tuesday meetings, I am likely out of the loop with respect to the pilot. If someone here is the mastermind for the pilot effort, or knows who that person is, please let me know. |
@hannes-ucsc each component in the DCP will figure out the access rights of its own users. ingest will get information from the DAC in either push or pull mode about users who need access to the project, and protect the api calls accordingly. The same would apply for other components. The access list can change after the project has been exported by ingest, so it is not ingest's responsibility to communicate the access rights down stream. The only thing ingest would communicate is whether the project is a managed access project. Downstream components would query the DAC (or consult a local copy) for managed access projects. In addition, we need the change in the pilot because ingest uses the schema when it exports files. This is a small change, which we can refine at a later stage. |
That's a bit simplistic, I'm afraid, and I must point out, your opinion, not some already accepted decision of the DCP/2. In my opinion, each component shouldn't have to "figure out access rights", it should enforce access, and do so consistently with other components across the platform, and we've already adopted the specification how this would work in TDR and downstream. You approved the PR that added that part of the specification. Each component can only enforce access if it has the information to do so. To that extent, Azul and Data Browser implement the specification, and tie into TDR for determining who has access. What's missing is the mechanism by which TDR determines access and I thought consensus was that we would use DUOS. We need to figure out how this would work, ideally with an addition to the DCP/2 specification.
If you need a way to temporarily mark a project as managed-access for the pilot, you could just adopt a naming convention for the project description or its short name. That way we wouldn't need to burden the schema with a change that we already know is insufficient for a complete implementation of managed-access for HCA. |
Hey both! |
That works for me! |
I will tell them to add you to that meeting. Meanwhile, just us need to catch up, so I suggest next week Wednesday, 29 Dec, 4 or 5pm UK time? Let me know if this works, otherwise please suggest another time. |
@hannes-ucsc I see 2 scenarios:
I have been under the impression that scenario 2 is the effective scenario. Do you see it differently? Can you please clarify what you mean? |
Specifically, which of my statements is unclear to you? |
Hi @amnonkhen and @hannes-ucsc, but also @ncalvanese1 - let's discuss this in a call. I have proposed this Wednesday, 29 Dec, 4 pm UK time. Does that work for you? |
Assuming you mean Nov 29, I'm hesitant to have a meeting without wranglers or leadership present. We need to determine who the authority is on deciding the conditions by which specific users are granted access to managed-access projects in HCA. I don't think this is something the audience of the meeting you proposed would be able to determine. That's why I'd rather wait for the big picture to be formed during the meeting on 12/7/2023. Then we can discuss the implementation details. OTOH, it would be helpful to hear from @ncalvanese1. If he thinks a separate meeting would be useful, I'd be happy to join. |
Hey Hannes, no need for now for the separate meeting. Gabby |
…nCellAtlas/metadata-schema into esv-managedAccess-Issue1535
With the last few commits I addressed two comments which are not visible now:
In addition to this I’ve made the ontology id required so we can never push a project that lacks the access restrictions. I’ve also made the ontology label required so we have a field that’s human readable, which makes it easier to check that we selected the right DUO code for the project |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of questions:
@@ -0,0 +1,94 @@ | |||
{ | |||
"$schema": "http://json-schema.org/draft-07/schema#", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the #
at the end intended?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a bit confusing but yes, for draft 7 the meta-schema $id
contains a hash at the end https://json-schema.org/draft-07/schema
Following drafts do not continue with this convention, not entirely sure why
"type": "string", | ||
"enum": [ | ||
"no restriction", | ||
"non-commercial use only", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC, there was some talk during our meeting that NCU was to be used as a modifier in combination with GRU, at least in the form to be filled out by contributors. This PR models NCU as a stand-alone enum item. Do we need to change the form or the schema or are you OK with that inconsistency?
I restructured the enum field to combine the general research use code with the non-commercial use only modifier. Considering all the rules we’ve added this module won’t behave like other ontology modules I’ve moved to the project modules instead. |
"enum": [ | ||
"DUO:0000004", | ||
"DUO:0000042", | ||
"DUO:0000042;DUO:0000046" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The use of a custom separator strikes me as hacky since it requires custom parsing on the consumer end. Multiplicity in JSON is natively handled as—I hesitate to bring this up again—arrays. You could still restrict the valid term combinations with allOf
and require DUO:0000004
.
Alternatively, we could ditch the combination from the contributor form. I personally find it confusing and it makes for a simpler "radio button" form UI instead of the more complicated "only some checkbox combinations are valid" approach.
Hi folks, I wanted to take a second to point out some prior art. It looks like Terra and the DUOS API encode the meaning of the consents rather than consistently using the ontology terms and ontology term IDs. DUOS APILooking at the output of the DUOS API, you can see modeling like:
Here is an example where ontology term IDs are used (for disease):
The DUOS API also gives a text summary for the restrictions like:
TerraIn Terra, you can see what looks to be a UI embodiment of this concept in the workspace dataset attributes like: AnVIL ExplorerIn the AnVIL Explorer and Dataset Catalog, the consents are represented in a single string like: AnVIL Dataset CatalogHere is an example of a consent code with explanatory text from the AnVIL Dataset Catalog: OptionsFollowing the examples above, some options for us are:
NRES
GRU
GRU-NPU
Translated Data Use NRES - No restrictions The text would need to be validated, of course. Given that we have basic requirements for representing consents, it seems like option 1 above (e.g., GRU-NPU) might be easiest all around. I hope this helps. I am curious what other folks think. Cheers, |
After our discussion, I wanted to propose that we use an enum with the following allowed values:
We could also use:
And tie the codes to the ontology term IDs in the description section of the enum definition. Cheers, |
If we want to model our requirements more explicitly, it could be done with two fields:
So we would have: NRES
GRU
GRU-NPU
@idazucchi, @amnonkhen, would ingest be able to validate that NRES-NPU (Invalid)
|
For reference, a link to the Data Use Ontology is here https://www.ebi.ac.uk/ols4/ontologies/duo |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I withdraw my vote. You can go ahead and merge without my approval.
Co-authored-by: ESapenaVentura <38617863+ESapenaVentura@users.noreply.github.com>
Co-authored-by: ESapenaVentura <38617863+ESapenaVentura@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! 🚀 Thanks for working through this!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
KISS! Love it!
…schema into esv-managedAccess-Issue1535
Fixes #1535
related to: ebi-ait/dcp-ingest-central#967
Release notes
For type/project/project schema:
data_access
For module/ontology/data_access_ontology schema: