-
Notifications
You must be signed in to change notification settings - Fork 1
Lambda function based orchestration for horizontal scaling of aggregations #1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
By far the biggest hassle, and place where there's the most room for improvement has been on both sides of the spatial subsetting. By both, I mean:
I've been using morton indexing which makes the second problem slightly easier at the cost of making the first problem more annoying. The mortie library that I maintain is built on top of healpix, which is wrapped via healpy, and the entire stack has pretty lackluster support for converting complex polygons into coverage maps of healpix indices. Here's some of the (mostly failed) attempts at solving this:
... and many more that are are equally frustrating. Currently, the way that we solve this is by a pretty reliable but jenky solution that does the following:
You could do it for the entire Antarctic polygon, and it would massively overestimate the cell indices that you want, but doing it per basin and merging gets relatively close to proper coverage with a slight over provision. Determining which cells have data from the NASA CMR is also a hassle. Bounding boxes are useless in the polar regions, so instead, we try to use the simplified geometries as query objects. It's relatively easy to get to a '90%' solution-- we take the point / line geometries from the NASA CMR , calculate the morton indices for the vertices in those geometries, and use that provision our worker nodes. It almost works beautifully: |
|
Compare the above output to what's in the description of this PR. We note a few things:
The wall clock time difference is due to the extra points we need to process...but majority of the cost difference is because of the extra ~600 cells we check to make sure we aren't missing any data. The reason that we're missing the 12 cells in the first place is because the geometries in the NASA CMR are simplified, and we tend to miss the edges of some cells near the pole hole. We fix this by densifying the geometries... which at times means crossing the pole hole and adding in many, many extra cells. But, we don't want to miss any data (certainly not 200 million observations). A quick intermediate fix here is probably calling a buffer operation... but it would be nice to have a better exact solution rather than a marginally improved heuristic if possible. |
- Use Docker AL2023 container for builds - Add git-lfs for layer zip files - Pin numpy<2.3 to avoid Lambda issues - Patch astropy/__init__.py to remove pytest dependency - Remove boto3/botocore from layer (Lambda provides) - Disable ARM64 build (healpy compilation issues) - Include working v14 layer (73MB, matches AWS)
|
Todo:
Notes This lambda layer from commit 522185c is what you want to use to reproduce; the more recent one is still being debugged and isn't known to work correctly yet. Our infrastructure for the cloud deployment is here ; the terraform / open tofu template is mostly for setting up jupyterhub, but the most recent commits also patch in our lambda permissions. For now, they're specific to the prototype, but we'll update them to allow arbitrary lambda function calls from the notebook environment in the future. |
|
HEAD version of the lambda zip is working now; there was an issue in bumping from either h5coro 0.8.0 to 1.0.3 , or a regression in pandas from version 2.3.2 to 2.3.3 |
|
h5coro version 1.0.3 is working now; older versions (i.e., 0.8.0) didn't have an explicit |
🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
|
arm64 drops the cost and time a bit: With a better working spatial metadata subset, we can expect close to 6 min and under $2. Output (and notebook) updated to public bucket |
|
@espg, I'm looking at the optimizations we discussed at AGU now (sorry for the delay). Is the granule catalog you've used "granule_catalog_cycle22_order6_stereo.json" the output from one of the |
sorry, please ignore this question. I found the source in |

Rough attempt at making an embarrassingly parallel version of what
demo_s3_xdggs.ipynbdoes. Lots of room for improvement.What currently works: scaling to ~2000 workers. Generally, I set this slightly lower (1,700) for two reasons:
In order to minimize wall time compute, I sort the runs so we start with the furthest south cells first, which due to orbital convergence tend to have the most observations (or zero obs, if we're trying to calculate a cell within the pole hole).
The default aws cap on concurrent lambda functions is 1,000, but they'll immediately approve 2,000 on request:
Here's the tldr of what all we get out of this pipeline:
That's 1.37 billion observations that are aggregated to a grid in a little under 8 minutes for about $3. I would love to see it happen for $2 and take under 5 minutes.
Here's the relevant speed operations to make this work:
Number 5 above means 1 vcpu with 2048 MB of ram. For lambda, AWS charges by the GB-second, where a GB refers to the amount of ram that you provision the process, and vpcus which scale at a rate proportional to the ram allocated at 1 vcpu per 2 GB of ram. Our horizontal processes are pretty light on ram (under 768 MB), so it seems like dropping the memory allocation would be a way to make things cheaper. However, dropping memory also drops your cpu allocation, and the scaling on a portion of a vcpu vs a fully dedicated vcpu is abysmal... so much so that it takes way more than twice as long with 0.5 vcpus than with 1 vcpu, which means something more expensive with a longer wall time wait.
I suspect that this may get re-implemented using DevSeed's recommendation of cubed at some point; the only reason that it isn't used now is because I already had started building the lambda layers in AWS and wanted to benchmark there prior to switching to a new library.