Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't generate verbatim manifest without filters in AnVIL #6108

Open
achave11-ucsc opened this issue Mar 28, 2024 · 2 comments
Open

Can't generate verbatim manifest without filters in AnVIL #6108

achave11-ucsc opened this issue Mar 28, 2024 · 2 comments
Assignees
Labels
- [priority] Medium bug [type] A defect preventing use of the system as specified manifests [subject] Generation and contents of manifests needs design [process] Solution to issue has yet to be devised orange [process] Done by the Azul team

Comments

@achave11-ucsc
Copy link
Member

achave11-ucsc commented Mar 28, 2024

#6108 (comment)

@github-actions github-actions bot added the orange [process] Done by the Azul team label Mar 28, 2024
@achave11-ucsc achave11-ucsc changed the title Can't generate verbatim manifest for catalog anvil5 Can't generate verbatim manifest without filters in AnVIL Mar 28, 2024
@dsotirho-ucsc
Copy link
Contributor

Assignee to consider next steps.

@hannes-ucsc
Copy link
Member

The generation simply times out.

Our standard approach to this problem is to partition the manifest as we do for the compact manifest, but the verbatim manifest formats complicate this by having to ensure that each replica is only written once. The AvroPFB verbatim manifest further complicates this with the forward-only reference constraint that Terra imposes. We are currently working around that constraint by not exposing relations/links in the AvroPFB schema we generate.

The naive approach to ensuring uniqueness of replicas in the generated manifest is to use a set of already emitted replicas. Since each partition in a partitioned manifest runs in a different Lambda invocation and potentially a different execution context, we would have to persist that set between invocations. My original design already describes optimizations to reduce the size of the set (not tracking replicas with just one hub, tracking hubs instead of replicas). A smaller set is obviously faster to read and write.

I don't have a good solution at this time. Interestingly, the verbatim JSONL manifest for HCA does not timeout (without links, we don't know if adding links breaks that). I don't think there is much of a use case for a all-inclusive, unfiltered AvroPFB manifest. The purpose of that manifest is exporting it to Terra. The resulting Terra workspace would be huge for both AnVIL and HCA.

@hannes-ucsc hannes-ucsc added enh [type] New feature or request manifests [subject] Generation and contents of manifests bug [type] A defect preventing use of the system as specified - [priority] Medium and removed enh [type] New feature or request labels Apr 4, 2024
@hannes-ucsc hannes-ucsc added the needs design [process] Solution to issue has yet to be devised label Sep 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
- [priority] Medium bug [type] A defect preventing use of the system as specified manifests [subject] Generation and contents of manifests needs design [process] Solution to issue has yet to be devised orange [process] Done by the Azul team
Projects
None yet
Development

No branches or pull requests

3 participants