Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create JSONL-based verbatim manifest format #6028

Closed
nadove-ucsc opened this issue Mar 11, 2024 · 4 comments
Closed

Create JSONL-based verbatim manifest format #6028

nadove-ucsc opened this issue Mar 11, 2024 · 4 comments
Assignees
Labels
+ [priority] High API API change affecting callers demo [process] To be demonstrated at the end of the sprint demoed [process] Successfully demonstrated to team enh [type] New feature or request manifests [subject] Generation and contents of manifests orange [process] Done by the Azul team

Comments

@nadove-ucsc
Copy link
Contributor

nadove-ucsc commented Mar 11, 2024

#2693 (comment)

As an intermediary step before implementing a full handover using the PFB format.

@nadove-ucsc nadove-ucsc added the orange [process] Done by the Azul team label Mar 11, 2024
@nadove-ucsc nadove-ucsc added enh [type] New feature or request API API change affecting callers manifests [subject] Generation and contents of manifests labels Mar 11, 2024
@dsotirho-ucsc dsotirho-ucsc added the + [priority] High label Mar 12, 2024
@hannes-ucsc hannes-ucsc added the demo [process] To be demonstrated at the end of the sprint label Mar 25, 2024
@hannes-ucsc
Copy link
Member

hannes-ucsc commented Mar 25, 2024

For demo, generate a largish unfiltered JSONL manifest in anvilprod and—when the PR lands there— in prod. Show that every replica is listed exactly once. Show that the number of file and project/dataset replicas matches the count reported by the summary endpoint.

In anvilprod, generate a small JSONL manifest with filters matching files from both MA and non-public sources. Generate that manifest while providing the access token of a user that has access to all those sources. Generate that manifest again without providing an access token. Compare the manifests and show that the one generated anonymously does not contain entities from managed-access datasets.

@nadove-ucsc
Copy link
Contributor Author

nadove-ucsc commented Mar 26, 2024

Generating an unfiltered manifest of the anvil5 catalog fails due to the lambda timing out after 30s while computing the content hash. There are 981,284 bundles in anvil5 but only 206,132 in dcp35.

@nadove-ucsc
Copy link
Contributor Author

nadove-ucsc commented Mar 26, 2024

The largest manifest I was able to generate for anvil5 took about 5 minutes to generate, contained 1,202,647 replicas and was 507 MB. The lambda function's ephemeral storage is 512 MB, and larger manifests failed with the step-function error message: "OSError: [Errno 28] No space left on device".

@nadove-ucsc
Copy link
Contributor Author

With the manifest content hash replaced with a hardcoded value and ephemeral storage increased to 1 GB, I ran into a new limit when trying to create an unfiltered manifest: the lambda timed out after 15 minutes. This is the maximum possible timeout.

@hannes-ucsc hannes-ucsc mentioned this issue Mar 28, 2024
14 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
+ [priority] High API API change affecting callers demo [process] To be demonstrated at the end of the sprint demoed [process] Successfully demonstrated to team enh [type] New feature or request manifests [subject] Generation and contents of manifests orange [process] Done by the Azul team
Projects
None yet
Development

No branches or pull requests

3 participants