Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Index AnVIL dataset description from DUOS #5547

Closed
14 tasks done
bvizzier-ucsc opened this issue Sep 18, 2023 · 13 comments
Closed
14 tasks done

Index AnVIL dataset description from DUOS #5547

bvizzier-ucsc opened this issue Sep 18, 2023 · 13 comments
Assignees
Labels
+ [priority] High API API change affecting callers code [subject] Production code demo [process] To be demonstrated at the end of the sprint demoed [process] Successfully demonstrated to team enh [type] New feature or request orange [process] Done by the Azul team spike:1 [process] Spike estimate of one point

Comments

@bvizzier-ucsc
Copy link

bvizzier-ucsc commented Sep 18, 2023

We currently don't have a method to receive dataset descriptions from Terra.

There is a brief discussion of this topic with Michael Baumann in Slack's #ucsc-anvil-explorer-collab channel.

The proposed TDR API change is a new endpoint /api/repository/v1/snapshots/{id} that would return JSON that included the dataset information. They have committed to supplying the JSON structure by early next week (preferably by the end of this week).

We will ingest the description at the time of indexing.

I'm looking for an estimate of the effort to accomplish this.

[Edit: A refined plan was shared in the Sep 19 Broad/UCSC standup. Updating the description to reflect that plan.]

  • Security design review completed; the Resolution of this issue does not
    • … affect authentication; for example:
      • OAuth 2.0 with the application (API or Swagger UI)
      • Authentication of developers with Google Cloud APIs
      • Authentication of developers with AWS APIs
      • Authentication with a GitLab instance in the system
      • Password and 2FA authentication with GitHub
      • API access token authentication with GitHub
      • Authentication with
    • … affect the permissions of internal users like access to
      • Cloud resources on AWS and GCP
      • GitLab repositories, projects and groups, administration
      • an EC2 instance via SSH
      • GitHub issues, pull requests, commits, commit statuses, wikis, repositories, organizations
    • … affect the permissions of external users like access to
      • TDR snapshots
    • … affect permissions of service or bot accounts
      • Cloud resources on AWS and GCP
    • … affect audit logging in the system, like
      • adding, removing or changing a log message that represents an auditable event
      • changing the routing of log messages through the system
    • … affect monitoring of the system
    • … introduce a new software dependency like
      • Python packages on PYPI
      • Command-line utilities
      • Docker images
      • Terraform providers
    • … add an interface that exposes sensitive or confidential data at the security boundary
    • … affect the encryption of data at rest
    • … require persistence of sensitive or confidential data that might require encryption at rest
    • … require unencrypted transmission of data within the security boundary
    • … affect the network security layer; for example by
      • modifying, adding or removing firewall rules
      • modifying, adding or removing security groups
      • changing or adding a port a service, proxy or load balancer listens on
  • Documentation on any unchecked boxes is provided in comments below
@bvizzier-ucsc bvizzier-ucsc added the orange [process] Done by the Azul team label Sep 18, 2023
@achave11-ucsc achave11-ucsc changed the title Spike for implementation of Dataset supplemental descriptions Index dataset description using TDR API Sep 19, 2023
@achave11-ucsc
Copy link
Member

@hannes-ucsc: "This is dataset, not project. For the latter see #4827. This is an unusual request because we will typically obtain metadata from a Big Query table, but here we will obtain it from TDR's REST API. The REST API has given us some grievance performance-wise in the past so I expect some complications involving retries and time-boxing of requests."

@achave11-ucsc achave11-ucsc added the spike:1 [process] Spike estimate of one point label Sep 19, 2023
@achave11-ucsc
Copy link
Member

Spike for design and estimate.

@hannes-ucsc
Copy link
Member

hannes-ucsc commented Sep 22, 2023

My first stab at this was much more complicated but I realized that we can handle with a special bundle type, similarly to how we handle supplementary files.

We'll assume that the description is only needed in outer entities of the dataset type. This means that it won't be possible to query Azul for, say, a donor given the description of the dataset the donor is part of. That's acceptable because it is not a use case we need to support.

In the tdr_anvil repository plugin, add BundleEntityType.dataset and have Plugin._list_bundles emit a bundle for every dataset row. In Plugin._emulate_bundle, fetch the description of the dataset from a certain TDR API endpoint (the exact endpoint is TBD, the Broad tentatively mentioned /api/repository/v1/snapshots/{id} but they are not sure yet) and emit a psarse dataset entity with just the description property populated. Have the DatasetTransformer emit a contribution to the outer dataset entity. This dataset description contribution will also be sparse in that it will only have contents.datasets[0].description populated. Ensure that the aggregation of the dataset entity merges this special contribution with the organic dataset contributions in such a way that the contents.datasets[0].description property is retained in the aggregate.

Assuming the endpoint ends up being the one tentatively given to us by the Broad (see previous paragraph), there is the underlying assumption that there is only one dataset row per snapshot. This means that when fetching the dataset description, the rowid does not matter, only the SourceRef does. However, we should take measures to assert that assumption, whenever that is possible without incurring a ton of cost in terms of both code complexity and computational effort. It would be simple to ensure that the query against the anvil_dataset table returns only one row but this is complicated by the fact that we index a snapshot in multiple partitions: only one partition should contain any dataset rows, and that partition should only contain one such row. Since partitions are handled independently and concurrently, it would be difficult to detect if more than one partition contain a dataset row. To accommodate this, we will consider excluding the partition key from the query's WHERE clause so that every partition lists all anvil_dataset rows. Every partion asserts the assumption, but only one partition emits a bundle for the row, the partiton that actually contains that row. This should accommodate the cost concerns above, since it seems straight-forward to code and the computational cost is incurred predomninantly by making the query, not by whether it returns zero or one rows.

@hannes-ucsc hannes-ucsc changed the title Index dataset description using TDR API Index dataset description from TDR API Sep 22, 2023
@hannes-ucsc hannes-ucsc added the needs info [process] Resolution requires more information label Sep 22, 2023
@hannes-ucsc
Copy link
Member

We need the exact specification of the endpoint that we should use and with what arguments. If it is the endpoint tentatively mentioned on Slack: I believe we're already hitting that endpoint for a different purpose and we're currently experiencing degraded performance so I would like some assertion from the Broad that the performance issue has been addressed before we start implementing this.

@dsotirho-ucsc dsotirho-ucsc added enh [type] New feature or request code [subject] Production code labels Sep 25, 2023
@dsotirho-ucsc dsotirho-ucsc added this to the AnVIL Public Release milestone Sep 25, 2023
@bvizzier-ucsc
Copy link
Author

Here is a spreadsheet that identifies the available information.

@NoopDog Which of these fields are the priority for the Data Browser to display?

@bvizzier-ucsc
Copy link
Author

Assigning to Dave to identify the high priority fields.

@bvizzier-ucsc
Copy link
Author

@hannes-ucsc @NoopDog Please hold on this... I found out today that they are looking at an alternate path for handing off this data.

The long term plan is to hand off this data via DUOS and they think that may be available in a few weeks. They will be getting us documentation on the DUOS interface (which is under development).

Hannes, Let's discuss this.

I'm going to move this back to Parked until we have more information.

@bvizzier-ucsc bvizzier-ucsc modified the milestones: AnVIL Public Release, AnVIL Beta Release, 5547 Oct 3, 2023
@dsotirho-ucsc dsotirho-ucsc removed this from the AnVIL Beta Release milestone Oct 10, 2023
@bvizzier-ucsc
Copy link
Author

Nate provided the following information on Friday, Oct 13. Please review and followup as needed.

On October 13, 2023 at 6:48:35 AM, Nathan Calvanese wrote:

Hi Ben,

I just wanted to provide you with an update on how we expect the team will be able to collect study and dataset metadata for AnVIL snapshots in the Data Explorer, to aid you and the team in being able to scope out the work:

  1. Retrieve the DUOS Dataset Identifier from the snapshot using the retrieveSnapshot API endpoint in TDR. This is contained in the duosFirecloudGroup.duosId property in the response.
  2. Provide this DUOS Dataset Identifier to the Get Study Registration by Dataset Identifier API endpoint in DUOS to retrieve the study and dataset information associated with the snapshot.
    • The API endpoint is now available in dev, as you can see from the link above. Please keep in mind that the response will be limited to only the dataset associated with the snapshot (as opposed to all datasets for the study, which can be retrieved using a different endpoint if needed).
    • The full schema can be viewed using the Schema API endpoint in DUOS.

I am going to work with the team on getting some dummy data into dev and attached to our dev snapshots to help unblock actual development, but I'm hopeful that the above information should be enough to at least unblock any scoping efforts on your end.

Please let me know if I can answer any questions!

Thanks!
Nate

@bvizzier-ucsc
Copy link
Author

@achave11-ucsc
Copy link
Member

@hannes-ucsc to figure out next steps.

@achave11-ucsc achave11-ucsc changed the title Index dataset description from TDR API Index dataset description from Terra API Oct 16, 2023
@hannes-ucsc hannes-ucsc changed the title Index dataset description from Terra API Index dataset description from DUOS Nov 9, 2023
@hannes-ucsc hannes-ucsc changed the title Index dataset description from DUOS Index AnVIL dataset description from DUOS Nov 9, 2023
@hannes-ucsc hannes-ucsc removed the needs info [process] Resolution requires more information label Nov 15, 2023
@hannes-ucsc
Copy link
Member

hannes-ucsc commented Nov 15, 2023

For demo, show new datasets.description property in the service response to /index/datasets in anvilprod. Show absence from other /index/… endpoints.

@hannes-ucsc hannes-ucsc added the demo [process] To be demonstrated at the end of the sprint label Nov 15, 2023
@nadove-ucsc nadove-ucsc added the demoed [process] Successfully demonstrated to team label Nov 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
+ [priority] High API API change affecting callers code [subject] Production code demo [process] To be demonstrated at the end of the sprint demoed [process] Successfully demonstrated to team enh [type] New feature or request orange [process] Done by the Azul team spike:1 [process] Spike estimate of one point
Projects
None yet
Development

No branches or pull requests

6 participants