Lambda function based orchestration for horizontal scaling of aggregations #1

espg · 2025-12-10T05:48:08Z

Rough attempt at making an embarrassingly parallel version of what demo_s3_xdggs.ipynb does. Lots of room for improvement.

What currently works: scaling to ~2000 workers. Generally, I set this slightly lower (1,700) for two reasons:

The wall time is dominated by the longest running process, so increasing workers past a certain point does actually speed anything up.
We start run into scaling errors around 2000+ workers (ssl handshake errors with the dispatcher and other hard to fix issues), so it's good to stay slightly under that limit.

In order to minimize wall time compute, I sort the runs so we start with the furthest south cells first, which due to orbital convergence tend to have the most observations (or zero obs, if we're trying to calculate a cell within the pole hole).

The default aws cap on concurrent lambda functions is 1,000, but they'll immediately approve 2,000 on request:

> aws service-quotas request-service-quota-increase \                    
                                             --service-code lambda \
                                             --quota-code L-B99A9384 \
                                             --desired-value 2000 \
                                             --region us-west-2

Here's the tldr of what all we get out of this pipeline:

======================================================================
Production Lambda Orchestrator - Catalog-Based
======================================================================

[1/5] Loading granule catalog from granule_catalog_cycle22_order6_stereo.json...
      Cycle: 22
      Parent order: 6
      Total cells in catalog: 2395
      Total granules: 4153
      Processing 2395 cells

[2/5] Authenticating with NASA Earthdata...
      Credentials expire: 2025-12-10 05:46:36+00:00

[3/5] Invoking 2395 Lambda functions (max 1700 concurrent)...
      Output: s3://jupyterhub-englacial-scratch-429435741471/atl06/production/
[4/5] Cost Calculation
----------------------------------------------------------------------
      Total Lambda execution time: 89,229.4s (24.79 hours)
      Memory: 2048MB (2.0GB)
      GB-seconds: 178,458.8
      Cost: $2.9743

[5/5] Summary
======================================================================
      Total cells:          2395
      With data:            1739
      Empty (no granules):  0
      Empty (filtered):     656
      Errors:               0
      Total observations:   1,370,185,727
----------------------------------------------------------------------
      Wall clock time:      463.2s (7.7m)
      Lambda compute time:  89,229.4s (1487.2m)
      Throughput:           5.2 cells/sec
      Estimated cost:       $2.9743
----------------------------------------------------------------------
      Output location:      s3://jupyterhub-englacial-scratch-429435741471/atl06/production/
======================================================================

That's 1.37 billion observations that are aggregated to a grid in a little under 8 minutes for about $3. I would love to see it happen for $2 and take under 5 minutes.

Here's the relevant speed operations to make this work:

Catalog based query. We hit the NASA CMR once to download for a time range (one orbital cycle above, 92 days) and a large spatial area (everything south of -60 latitude), and then hit that catalog locally for splitting work up to workers. Way faster (which makes it cheaper), with less to debug.
h5coro hyperslices to do spatial subsetting in place for the ICESat-2 data
Aggressive metadata culling via item number 1 above... this is also where there's much more room for improvement (see comment below)
Server-less horizontal scaling with lambda orchestration (possibly switching to cubed in the future)
Right-sized worker processes

Number 5 above means 1 vcpu with 2048 MB of ram. For lambda, AWS charges by the GB-second, where a GB refers to the amount of ram that you provision the process, and vpcus which scale at a rate proportional to the ram allocated at 1 vcpu per 2 GB of ram. Our horizontal processes are pretty light on ram (under 768 MB), so it seems like dropping the memory allocation would be a way to make things cheaper. However, dropping memory also drops your cpu allocation, and the scaling on a portion of a vcpu vs a fully dedicated vcpu is abysmal... so much so that it takes way more than twice as long with 0.5 vcpus than with 1 vcpu, which means something more expensive with a longer wall time wait.

I suspect that this may get re-implemented using DevSeed's recommendation of cubed at some point; the only reason that it isn't used now is because I already had started building the lambda layers in AWS and wanted to benchmark there prior to switching to a new library.

…etes faster

espg · 2025-12-10T06:10:18Z

By far the biggest hassle, and place where there's the most room for improvement has been on both sides of the spatial subsetting. By both, I mean:

Defining the the worker 'cells' that encompass and cover all of Antarctica
Determining which of those cells actually have data

I've been using morton indexing which makes the second problem slightly easier at the cost of making the first problem more annoying. The mortie library that I maintain is built on top of healpix, which is wrapped via healpy, and the entire stack has pretty lackluster support for converting complex polygons into coverage maps of healpix indices. Here's some of the (mostly failed) attempts at solving this:

Building a minimum spanning tree of indices based on coverage of polygon vertices. This is complicated by the fact that the spanning tree isn't restricted to contiguous cells, so it's easy to have 'holes' appear in the geometry.
Using the built in coverage function in healpix to convert from a polygon to healpix index coverage. This only works for convex geometries, which seems to not be any of the geometries that I would like to use.
Trying above, but using qhull to force a convex geometry. Miserable and abject failure; geometries that are convex in polar stereographic aren't necessarily convex in spherical coordinates.

... and many more that are are equally frustrating. Currently, the way that we solve this is by a pretty reliable but jenky solution that does the following:

Take each of the 27 Antarctic drainage basins, and then calculate the a.) the centroid , and b.) the distance to the furthest point in the drainage polgyon to that centroid.
Use healpix disc query with that distance and center point to grab a set of healpix indices at desired cell resolution.
Merge all of those basin disc coverage's together, and then take the np.unique set of that.

You could do it for the entire Antarctic polygon, and it would massively overestimate the cell indices that you want, but doing it per basin and merging gets relatively close to proper coverage with a slight over provision.

Determining which cells have data from the NASA CMR is also a hassle. Bounding boxes are useless in the polar regions, so instead, we try to use the simplified geometries as query objects. It's relatively easy to get to a '90%' solution-- we take the point / line geometries from the NASA CMR , calculate the morton indices for the vertices in those geometries, and use that provision our worker nodes. It almost works beautifully:

======================================================================
Production Lambda Orchestrator - Catalog-Based
======================================================================

[1/5] Loading granule catalog from granule_catalog_cycle22_order6.json...
      Cycle: 22
      Parent order: 6
      Total cells in catalog: 1742
      Total granules: 4153
      Processing 1742 cells

[2/5] Authenticating with NASA Earthdata...
      Credentials expire: 2025-12-10 05:07:33+00:00

[3/5] Invoking 1742 Lambda functions (max 1700 concurrent)...
      Output: s3://jupyterhub-englacial-scratch-429435741471/atl06/production/
      [  50/1742] empty (filtered) | 3.8 cells/s, ETA 7.4m
      [ 100/1742] OK (139 cells, 17,605 obs) | 6.8 cells/s, ETA 4.0m
      [ 150/1742] OK (773 cells, 73,345 obs) | 9.0 cells/s, ETA 3.0m
      [ 200/1742] OK (488 cells, 38,105 obs) | 10.9 cells/s, ETA 2.4m
      [ 250/1742] OK (876 cells, 65,582 obs) | 12.2 cells/s, ETA 2.0m
      [ 300/1742] OK (610 cells, 70,907 obs) | 13.1 cells/s, ETA 1.8m
      [ 350/1742] OK (1468 cells, 193,102 obs) | 14.3 cells/s, ETA 1.6m
      [ 400/1742] OK (1344 cells, 146,479 obs) | 15.7 cells/s, ETA 1.4m
      [ 450/1742] OK (2551 cells, 400,688 obs) | 16.9 cells/s, ETA 1.3m
      [ 500/1742] OK (1773 cells, 215,345 obs) | 18.2 cells/s, ETA 1.1m
      [ 550/1742] OK (2694 cells, 373,129 obs) | 19.6 cells/s, ETA 1.0m
      [ 600/1742] OK (2036 cells, 272,797 obs) | 20.8 cells/s, ETA 0.9m
      [ 650/1742] OK (2839 cells, 477,125 obs) | 21.9 cells/s, ETA 0.8m
      [ 700/1742] OK (2268 cells, 264,217 obs) | 23.1 cells/s, ETA 0.8m
      [ 750/1742] OK (3082 cells, 513,850 obs) | 24.1 cells/s, ETA 0.7m
      [ 800/1742] OK (2868 cells, 465,596 obs) | 25.0 cells/s, ETA 0.6m
      [ 850/1742] OK (2469 cells, 348,257 obs) | 25.9 cells/s, ETA 0.6m
      [ 900/1742] OK (3114 cells, 443,050 obs) | 26.8 cells/s, ETA 0.5m
      [ 950/1742] OK (2644 cells, 349,372 obs) | 27.6 cells/s, ETA 0.5m
      [1000/1742] OK (3029 cells, 564,727 obs) | 28.1 cells/s, ETA 0.4m
      [1050/1742] OK (3520 cells, 794,216 obs) | 28.5 cells/s, ETA 0.4m
      [1100/1742] OK (3421 cells, 661,995 obs) | 28.5 cells/s, ETA 0.4m
      [1150/1742] OK (3664 cells, 785,075 obs) | 28.8 cells/s, ETA 0.3m
      [1200/1742] OK (3081 cells, 487,608 obs) | 28.6 cells/s, ETA 0.3m
      [1250/1742] OK (3759 cells, 972,599 obs) | 28.3 cells/s, ETA 0.3m
      [1300/1742] OK (3507 cells, 870,284 obs) | 28.3 cells/s, ETA 0.3m
      [1350/1742] OK (3907 cells, 940,720 obs) | 28.3 cells/s, ETA 0.2m
      [1400/1742] OK (3268 cells, 709,374 obs) | 27.9 cells/s, ETA 0.2m
      [1450/1742] OK (3890 cells, 883,258 obs) | 27.0 cells/s, ETA 0.2m
      [1500/1742] OK (3386 cells, 1,205,016 obs) | 25.4 cells/s, ETA 0.2m
      [1550/1742] OK (3809 cells, 1,192,299 obs) | 23.3 cells/s, ETA 0.1m
      [1600/1742] OK (3720 cells, 1,521,734 obs) | 21.4 cells/s, ETA 0.1m
      [1650/1742] OK (3973 cells, 866,088 obs) | 18.0 cells/s, ETA 0.1m
      [1700/1742] OK (4094 cells, 2,804,549 obs) | 13.4 cells/s, ETA 0.1m

[4/5] Cost Calculation
----------------------------------------------------------------------
      Total Lambda execution time: 62,289.5s (17.30 hours)
      Memory: 2048MB (2.0GB)
      GB-seconds: 124,579.0
      Cost: $2.0763

[5/5] Summary
======================================================================
      Total cells:          1742
      With data:            1727
      Empty (no granules):  0
      Empty (filtered):     15
      Errors:               0
      Total observations:   1,107,767,160
----------------------------------------------------------------------
      Wall clock time:      377.4s (6.3m)
      Lambda compute time:  62,289.5s (1038.2m)
      Throughput:           4.6 cells/sec
      Estimated cost:       $2.0763
----------------------------------------------------------------------
      Output location:      s3://jupyterhub-englacial-scratch-429435741471/atl06/production/

espg · 2025-12-10T06:10:40Z

Compare the above output to what's in the description of this PR. We note a few things:

We're missing 200 million observations
Those 200 million observations are from 12 cells that aren't processed
We're doing a way better job with false positives-- of the 1742 cells we check, 99% have data. That compares against the production solution where only 72.6% of the 2395 we check have data.
The above cost about a third less to process and finishes a minute faster

The wall clock time difference is due to the extra points we need to process...but majority of the cost difference is because of the extra ~600 cells we check to make sure we aren't missing any data. The reason that we're missing the 12 cells in the first place is because the geometries in the NASA CMR are simplified, and we tend to miss the edges of some cells near the pole hole. We fix this by densifying the geometries... which at times means crossing the pole hole and adding in many, many extra cells. But, we don't want to miss any data (certainly not 200 million observations).

A quick intermediate fix here is probably calling a buffer operation... but it would be nice to have a better exact solution rather than a marginally improved heuristic if possible.

- Use Docker AL2023 container for builds - Add git-lfs for layer zip files - Pin numpy<2.3 to avoid Lambda issues - Patch astropy/__init__.py to remove pytest dependency - Remove boto3/botocore from layer (Lambda provides) - Disable ARM64 build (healpy compilation issues) - Include working v14 layer (73MB, matches AWS)

espg · 2025-12-10T21:27:48Z

Results from the above run(s):

(from lambda/visualize_production_results.ipynb)

espg · 2025-12-10T22:25:07Z

Todo:

Fix / diagnose spacial sub-setting issue (i.e., fix 90% solution to work 100%)
Finalize build pipeline for working x86 layer (working version is hand tuned)
Create working arm layer (lower priority, requires building healpy wheels for py 3.11 on arm)
Better output and visualization pipeline / notebook (lots of work to be done here!)
Define modular library functions and calls that aren't specific to atl06 (feedback appreciated)

Notes

This lambda layer from commit 522185c is what you want to use to reproduce; the more recent one is still being debugged and isn't known to work correctly yet.

Our infrastructure for the cloud deployment is here ; the terraform / open tofu template is mostly for setting up jupyterhub, but the most recent commits also patch in our lambda permissions. For now, they're specific to the prototype, but we'll update them to allow arbitrary lambda function calls from the notebook environment in the future.

espg · 2025-12-10T23:07:31Z

HEAD version of the lambda zip is working now; there was an issue in bumping from either h5coro 0.8.0 to 1.0.3 , or a regression in pandas from version 2.3.2 to 2.3.3

espg · 2025-12-10T23:44:45Z

h5coro version 1.0.3 is working now; older versions (i.e., 0.8.0) didn't have an explicit close() method when opening h5 datasets, but would close / release memory automatically as they were iterated over (reference / variable name overwrite would trigger release and garbage collection). Version 1.0.3 not only added the close() method, but requires it to de-reference and release memory.

🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

espg · 2025-12-11T06:14:58Z

arm64 drops the cost and time a bit:

----------------------------------------------------------------------
      Total Lambda execution time: 87,059.8s (24.18 hours)
      Memory: 2048MB (2.0GB)
      Architecture: arm64
      GB-seconds: 174,119.5
      Cost: $2.3216

[5/5] Summary
======================================================================
      Total cells:          2395
      With data:            1739
      Empty (no granules):  0
      Empty (filtered):     656
      Errors:               0
      Total observations:   1,370,185,727
----------------------------------------------------------------------
      Wall clock time:      409.7s (6.8m)
      Lambda compute time:  87,059.8s (1451.0m)
      Throughput:           5.8 cells/sec
      Estimated cost:       $2.3216
----------------------------------------------------------------------
      Output location:      s3://xagg/atl06/production/
======================================================================

With a better working spatial metadata subset, we can expect close to 6 min and under $2. Output (and notebook) updated to public bucket

maxrjones · 2026-01-08T19:25:45Z

@espg, I'm looking at the optimizations we discussed at AGU now (sorry for the delay). Is the granule catalog you've used "granule_catalog_cycle22_order6_stereo.json" the output from one of the query_cmr_*.py scripts?

maxrjones · 2026-01-08T21:25:25Z

@espg, I'm looking at the optimizations we discussed at AGU now (sorry for the delay). Is the granule catalog you've used "granule_catalog_cycle22_order6_stereo.json" the output from one of the query_cmr_*.py scripts?

sorry, please ignore this question. I found the source in lambda/built_granule_catalog.py.

espg added 4 commits December 9, 2025 14:13

inital lambda code for atl06 microservice

9f0bd2e

dead end packaging approach with xddgs/xarray

94bd284

working version of lambda agg

498e396

reverting to morton order 6, since it's more cost effective and compl…

2e98860

…etes faster

espg added 4 commits December 9, 2025 23:42

updated with over-query

f5395b3

git runner to compile arm-layers

4744d73

saving artifacts

46c63e7

Add build_layer_v14.sh for Lambda layer builds

aa17084

espg force-pushed the lambda branch from 56faca2 to aa17084 Compare December 10, 2025 09:40

espg force-pushed the lambda branch from d39d18d to 522185c Compare December 10, 2025 20:13

Build Lambda layer for x86_64

5aeeba7

espg mentioned this pull request Dec 10, 2025

Design discussion #2

Open

espg and others added 2 commits December 10, 2025 14:50

pinning layer build to h5coro 0.8 and pandas 2.3.2

1f3aef1

Build Lambda layer for x86_64

e58faa2

espg and others added 7 commits December 10, 2025 16:07

removing stale / out of date files

86e851f

removing stale file

570ed28

Add .gitignore for local development files

322790d

🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

arm instructions

6a7aea4

new arm version and build script

438f611

fixing arm64 script

552ce58

compat layer for numpy 2.x

28389c0

public bucket for ipynb viz

fbde565

This was referenced Jan 9, 2026

Memory profiling results #4

Open

Performance profiling results #5

Open

Update lambda function to write directly to zarr #6

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Lambda function based orchestration for horizontal scaling of aggregations #1

Lambda function based orchestration for horizontal scaling of aggregations #1

Uh oh!

espg commented Dec 10, 2025 •

edited

Loading

Uh oh!

espg commented Dec 10, 2025 •

edited

Loading

Uh oh!

espg commented Dec 10, 2025

Uh oh!

espg commented Dec 10, 2025

Uh oh!

espg commented Dec 10, 2025 •

edited

Loading

Uh oh!

espg commented Dec 10, 2025

Uh oh!

espg commented Dec 10, 2025

Uh oh!

espg commented Dec 11, 2025

Uh oh!

maxrjones commented Jan 8, 2026

Uh oh!

maxrjones commented Jan 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Lambda function based orchestration for horizontal scaling of aggregations #1

Are you sure you want to change the base?

Lambda function based orchestration for horizontal scaling of aggregations #1

Uh oh!

Conversation

espg commented Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

espg commented Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

espg commented Dec 10, 2025

Uh oh!

espg commented Dec 10, 2025

Uh oh!

espg commented Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

espg commented Dec 10, 2025

Uh oh!

espg commented Dec 10, 2025

Uh oh!

espg commented Dec 11, 2025

Uh oh!

maxrjones commented Jan 8, 2026

Uh oh!

maxrjones commented Jan 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

espg commented Dec 10, 2025 •

edited

Loading

espg commented Dec 10, 2025 •

edited

Loading

espg commented Dec 10, 2025 •

edited

Loading