Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Estimate recipe size #136

Open
rabernat opened this issue May 13, 2021 · 1 comment
Open

Estimate recipe size #136

rabernat opened this issue May 13, 2021 · 1 comment

Comments

@rabernat
Copy link
Contributor

It would be very useful to get an estimate of the total size of the target dataset produced by a recipe in GB / TB. For example, this information could be used by bakery managers to decide whether to accept a dataset into their storage.

Here are some different ways we could do this without actually running the whole recipe.

  1. Create a test version of the recipe (see Testing / profiling recipes #97) and examine the total size of the test target. Scale up based on the "pruning factor" (what fraction of the full data did the test dataset pull).
  2. Go through each file in the recipe's FilePattern and inspect its size. Sum to get an estimated size. Only works for static file inputs (not APIs like OPeNDAP). May not accurately reflect target size if there is lots of processing involved.
  3. Randomly sample files from the FilePattern and scale up.
@cisaacstern
Copy link
Member

  1. Create a test version of the recipe (see Testing / profiling recipes #97) and examine the total size of the test target. Scale up based on the "pruning factor" (what fraction of the full data did the test dataset pull).

Are there known reasons why this is not the obvious best direction to pursue? It seems to dovetail nicely with other objectives, and should be relatively accurate, assuming the as-yet-unimplemented prune method referenced in pangeo-forge/staged-recipes#28 (comment) is "prune factor"-aware.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants