Index AnVIL dataset description from DUOS #5547

bvizzier-ucsc · 2023-09-18T23:44:11Z

We currently don't have a method to receive dataset descriptions from Terra.

There is a brief discussion of this topic with Michael Baumann in Slack's #ucsc-anvil-explorer-collab channel.

The proposed TDR API change is a new endpoint /api/repository/v1/snapshots/{id} that would return JSON that included the dataset information. They have committed to supplying the JSON structure by early next week (preferably by the end of this week).

We will ingest the description at the time of indexing.

I'm looking for an estimate of the effort to accomplish this.

[Edit: A refined plan was shared in the Sep 19 Broad/UCSC standup. Updating the description to reflect that plan.]

achave11-ucsc · 2023-09-19T19:01:41Z

@hannes-ucsc: "This is dataset, not project. For the latter see #4827. This is an unusual request because we will typically obtain metadata from a Big Query table, but here we will obtain it from TDR's REST API. The REST API has given us some grievance performance-wise in the past so I expect some complications involving retries and time-boxing of requests."

achave11-ucsc · 2023-09-19T19:02:26Z

Spike for design and estimate.

hannes-ucsc · 2023-09-22T17:53:27Z

My first stab at this was much more complicated but I realized that we can handle with a special bundle type, similarly to how we handle supplementary files.

We'll assume that the description is only needed in outer entities of the dataset type. This means that it won't be possible to query Azul for, say, a donor given the description of the dataset the donor is part of. That's acceptable because it is not a use case we need to support.

In the tdr_anvil repository plugin, add BundleEntityType.dataset and have Plugin._list_bundles emit a bundle for every dataset row. In Plugin._emulate_bundle, fetch the description of the dataset from a certain TDR API endpoint (the exact endpoint is TBD, the Broad tentatively mentioned /api/repository/v1/snapshots/{id} but they are not sure yet) and emit a psarse dataset entity with just the description property populated. Have the DatasetTransformer emit a contribution to the outer dataset entity. This dataset description contribution will also be sparse in that it will only have contents.datasets[0].description populated. Ensure that the aggregation of the dataset entity merges this special contribution with the organic dataset contributions in such a way that the contents.datasets[0].description property is retained in the aggregate.

Assuming the endpoint ends up being the one tentatively given to us by the Broad (see previous paragraph), there is the underlying assumption that there is only one dataset row per snapshot. This means that when fetching the dataset description, the rowid does not matter, only the SourceRef does. However, we should take measures to assert that assumption, whenever that is possible without incurring a ton of cost in terms of both code complexity and computational effort. It would be simple to ensure that the query against the anvil_dataset table returns only one row but this is complicated by the fact that we index a snapshot in multiple partitions: only one partition should contain any dataset rows, and that partition should only contain one such row. Since partitions are handled independently and concurrently, it would be difficult to detect if more than one partition contain a dataset row. To accommodate this, we will consider excluding the partition key from the query's WHERE clause so that every partition lists all anvil_dataset rows. Every partion asserts the assumption, but only one partition emits a bundle for the row, the partiton that actually contains that row. This should accommodate the cost concerns above, since it seems straight-forward to code and the computational cost is incurred predomninantly by making the query, not by whether it returns zero or one rows.

hannes-ucsc · 2023-09-22T18:25:50Z

We need the exact specification of the endpoint that we should use and with what arguments. If it is the endpoint tentatively mentioned on Slack: I believe we're already hitting that endpoint for a different purpose and we're currently experiencing degraded performance so I would like some assertion from the Broad that the performance issue has been addressed before we start implementing this.

hannes-ucsc · 2023-09-22T18:28:51Z

https://ucsc-gi.slack.com/archives/C03TPJS54DC/p1695407307731049?thread_ts=1694792957.916559&cid=C03TPJS54DC

bvizzier-ucsc · 2023-10-03T03:25:39Z

Here is a spreadsheet that identifies the available information.

@NoopDog Which of these fields are the priority for the Data Browser to display?

bvizzier-ucsc · 2023-10-03T03:28:13Z

Assigning to Dave to identify the high priority fields.

bvizzier-ucsc · 2023-10-03T16:59:36Z

@hannes-ucsc @NoopDog Please hold on this... I found out today that they are looking at an alternate path for handing off this data.

The long term plan is to hand off this data via DUOS and they think that may be available in a few weeks. They will be getting us documentation on the DUOS interface (which is under development).

Hannes, Let's discuss this.

I'm going to move this back to Parked until we have more information.

bvizzier-ucsc · 2023-10-13T20:18:58Z

Nate provided the following information on Friday, Oct 13. Please review and followup as needed.

On October 13, 2023 at 6:48:35 AM, Nathan Calvanese wrote:

Hi Ben,

I just wanted to provide you with an update on how we expect the team will be able to collect study and dataset metadata for AnVIL snapshots in the Data Explorer, to aid you and the team in being able to scope out the work:

Retrieve the DUOS Dataset Identifier from the snapshot using the retrieveSnapshot API endpoint in TDR. This is contained in the duosFirecloudGroup.duosId property in the response.

Provide this DUOS Dataset Identifier to the Get Study Registration by Dataset Identifier API endpoint in DUOS to retrieve the study and dataset information associated with the snapshot.

The API endpoint is now available in dev, as you can see from the link above. Please keep in mind that the response will be limited to only the dataset associated with the snapshot (as opposed to all datasets for the study, which can be retrieved using a different endpoint if needed).

The full schema can be viewed using the Schema API endpoint in DUOS.

I am going to work with the team on getting some dummy data into dev and attached to our dev snapshots to help unblock actual development, but I'm hopeful that the above information should be enough to at least unblock any scoping efforts on your end.

Please let me know if I can answer any questions!

Thanks!
Nate

bvizzier-ucsc · 2023-10-16T18:33:43Z

Nate also posted the above in Slack.

achave11-ucsc · 2023-10-16T19:10:36Z

@hannes-ucsc to figure out next steps.

hannes-ucsc · 2023-11-15T02:31:28Z

For demo, show new datasets.description property in the service response to /index/datasets in anvilprod. Show absence from other /index/… endpoints.

bvizzier-ucsc added the orange [process] Done by the Azul team label Sep 18, 2023

achave11-ucsc changed the title ~~Spike for implementation of Dataset supplemental descriptions~~ Index dataset description using TDR API Sep 19, 2023

achave11-ucsc assigned hannes-ucsc Sep 19, 2023

achave11-ucsc added the spike:1 [process] Spike estimate of one point label Sep 19, 2023

hannes-ucsc changed the title ~~Index dataset description using TDR API~~ Index dataset description from TDR API Sep 22, 2023

hannes-ucsc added the needs info [process] Resolution requires more information label Sep 22, 2023

dsotirho-ucsc added enh [type] New feature or request code [subject] Production code labels Sep 25, 2023

dsotirho-ucsc added this to the AnVIL Public Release milestone Sep 25, 2023

bvizzier-ucsc assigned NoopDog Oct 3, 2023

bvizzier-ucsc modified the milestones: AnVIL Public Release, AnVIL Beta Release, 5547 Oct 3, 2023

dsotirho-ucsc removed this from the AnVIL Beta Release milestone Oct 10, 2023

achave11-ucsc unassigned NoopDog Oct 16, 2023

achave11-ucsc changed the title ~~Index dataset description from TDR API~~ Index dataset description from Terra API Oct 16, 2023

nadove-ucsc added a commit that referenced this issue Oct 27, 2023

[a r] Index dataset description from Terra API (#5547)

674ed45

nadove-ucsc added a commit that referenced this issue Oct 27, 2023

[a r] Index dataset description from Terra API (#5547)

85024ca

nadove-ucsc added a commit that referenced this issue Oct 27, 2023

[a r] Index dataset description from Terra API (#5547)

eb0903e

nadove-ucsc added a commit that referenced this issue Oct 27, 2023

[a r] Index dataset description from Terra API (#5547)

8529308

nadove-ucsc added a commit that referenced this issue Oct 30, 2023

fixup! [a r] Index dataset description from Terra API (#5547)

1b1c403

nadove-ucsc added a commit that referenced this issue Nov 1, 2023

[a r] Index dataset description from Terra API (#5547)

49ef44c

nadove-ucsc added a commit that referenced this issue Nov 1, 2023

fixup! [a r] Index dataset description from Terra API (#5547)

dad6c73

nadove-ucsc added a commit that referenced this issue Nov 1, 2023

fixup! [a r] Index dataset description from Terra API (#5547)

c0f92b9

nadove-ucsc added a commit that referenced this issue Nov 7, 2023

[a r] Index dataset description from Terra API (#5547)

09aa312

nadove-ucsc added a commit that referenced this issue Nov 7, 2023

fixup! [a r] Index dataset description from Terra API (#5547)

3e57af1

nadove-ucsc added a commit that referenced this issue Nov 7, 2023

[a r] Index dataset description from Terra API (#5547)

7834873

nadove-ucsc added a commit that referenced this issue Nov 7, 2023

fixup! [a r] Index dataset description from Terra API (#5547)

8de3bfb

nadove-ucsc added a commit that referenced this issue Nov 8, 2023

[a r] Index dataset description from Terra API (#5547)

ece7ac8

nadove-ucsc added a commit that referenced this issue Nov 8, 2023

fixup! [a r] Index dataset description from Terra API (#5547)

566d725

nadove-ucsc added a commit that referenced this issue Nov 8, 2023

fixup! [a r] Index dataset description from Terra API (#5547)

049efb0

nadove-ucsc added a commit that referenced this issue Nov 8, 2023

fixup! [a r] Index dataset description from Terra API (#5547)

57fb8d5

nadove-ucsc added a commit that referenced this issue Nov 8, 2023

fixup! [a r] Index dataset description from Terra API (#5547)

24fe985

nadove-ucsc added a commit that referenced this issue Nov 9, 2023

[a r] Index dataset description from Terra API (#5547)

f438852

nadove-ucsc added a commit that referenced this issue Nov 9, 2023

fixup! [a r] Index dataset description from Terra API (#5547)

6d19727

hannes-ucsc changed the title ~~Index dataset description from Terra API~~ Index dataset description from DUOS Nov 9, 2023

hannes-ucsc changed the title ~~Index dataset description from DUOS~~ Index AnVIL dataset description from DUOS Nov 9, 2023

nadove-ucsc added a commit that referenced this issue Nov 14, 2023

[a r] Index AnVIL dataset description from DUOS (#5547)

2296f29

nadove-ucsc added a commit that referenced this issue Nov 15, 2023

[a r] Index AnVIL dataset description from DUOS (#5547)

c30b221

hannes-ucsc removed the needs info [process] Resolution requires more information label Nov 15, 2023

hannes-ucsc added the demo [process] To be demonstrated at the end of the sprint label Nov 15, 2023

achave11-ucsc pushed a commit that referenced this issue Nov 15, 2023

[a r] Index AnVIL dataset description from DUOS (#5547)

2f1842e

achave11-ucsc added a commit that referenced this issue Nov 15, 2023

[r a] Index AnVIL dataset description from DUOS (#5547, PR #5649)

dcb0992

nadove-ucsc added the demoed [process] Successfully demonstrated to team label Nov 21, 2023

hannes-ucsc closed this as completed Mar 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Index AnVIL dataset description from DUOS #5547

Index AnVIL dataset description from DUOS #5547

bvizzier-ucsc commented Sep 18, 2023 •

edited by hannes-ucsc

Loading

achave11-ucsc commented Sep 19, 2023

achave11-ucsc commented Sep 19, 2023

hannes-ucsc commented Sep 22, 2023 •

edited

Loading

hannes-ucsc commented Sep 22, 2023

hannes-ucsc commented Sep 22, 2023

bvizzier-ucsc commented Oct 3, 2023

bvizzier-ucsc commented Oct 3, 2023

bvizzier-ucsc commented Oct 3, 2023

bvizzier-ucsc commented Oct 13, 2023

bvizzier-ucsc commented Oct 16, 2023

achave11-ucsc commented Oct 16, 2023

hannes-ucsc commented Nov 15, 2023 •

edited

Loading

Index AnVIL dataset description from DUOS #5547

Index AnVIL dataset description from DUOS #5547

Comments

bvizzier-ucsc commented Sep 18, 2023 • edited by hannes-ucsc Loading

achave11-ucsc commented Sep 19, 2023

achave11-ucsc commented Sep 19, 2023

hannes-ucsc commented Sep 22, 2023 • edited Loading

hannes-ucsc commented Sep 22, 2023

hannes-ucsc commented Sep 22, 2023

bvizzier-ucsc commented Oct 3, 2023

bvizzier-ucsc commented Oct 3, 2023

bvizzier-ucsc commented Oct 3, 2023

bvizzier-ucsc commented Oct 13, 2023

bvizzier-ucsc commented Oct 16, 2023

achave11-ucsc commented Oct 16, 2023

hannes-ucsc commented Nov 15, 2023 • edited Loading

bvizzier-ucsc commented Sep 18, 2023 •

edited by hannes-ucsc

Loading

hannes-ucsc commented Sep 22, 2023 •

edited

Loading

hannes-ucsc commented Nov 15, 2023 •

edited

Loading