Skip to content

Conversation

@espg
Copy link
Contributor

@espg espg commented Dec 10, 2025

Rough attempt at making an embarrassingly parallel version of what demo_s3_xdggs.ipynb does. Lots of room for improvement.

What currently works: scaling to ~2000 workers. Generally, I set this slightly lower (1,700) for two reasons:

  1. The wall time is dominated by the longest running process, so increasing workers past a certain point does actually speed anything up.
  2. We start run into scaling errors around 2000+ workers (ssl handshake errors with the dispatcher and other hard to fix issues), so it's good to stay slightly under that limit.

In order to minimize wall time compute, I sort the runs so we start with the furthest south cells first, which due to orbital convergence tend to have the most observations (or zero obs, if we're trying to calculate a cell within the pole hole).

The default aws cap on concurrent lambda functions is 1,000, but they'll immediately approve 2,000 on request:

> aws service-quotas request-service-quota-increase \                    
                                             --service-code lambda \
                                             --quota-code L-B99A9384 \
                                             --desired-value 2000 \
                                             --region us-west-2

Here's the tldr of what all we get out of this pipeline:

======================================================================
Production Lambda Orchestrator - Catalog-Based
======================================================================

[1/5] Loading granule catalog from granule_catalog_cycle22_order6_stereo.json...
      Cycle: 22
      Parent order: 6
      Total cells in catalog: 2395
      Total granules: 4153
      Processing 2395 cells

[2/5] Authenticating with NASA Earthdata...
      Credentials expire: 2025-12-10 05:46:36+00:00

[3/5] Invoking 2395 Lambda functions (max 1700 concurrent)...
      Output: s3://jupyterhub-englacial-scratch-429435741471/atl06/production/
[4/5] Cost Calculation
----------------------------------------------------------------------
      Total Lambda execution time: 89,229.4s (24.79 hours)
      Memory: 2048MB (2.0GB)
      GB-seconds: 178,458.8
      Cost: $2.9743

[5/5] Summary
======================================================================
      Total cells:          2395
      With data:            1739
      Empty (no granules):  0
      Empty (filtered):     656
      Errors:               0
      Total observations:   1,370,185,727
----------------------------------------------------------------------
      Wall clock time:      463.2s (7.7m)
      Lambda compute time:  89,229.4s (1487.2m)
      Throughput:           5.2 cells/sec
      Estimated cost:       $2.9743
----------------------------------------------------------------------
      Output location:      s3://jupyterhub-englacial-scratch-429435741471/atl06/production/
======================================================================

That's 1.37 billion observations that are aggregated to a grid in a little under 8 minutes for about $3. I would love to see it happen for $2 and take under 5 minutes.

Here's the relevant speed operations to make this work:

  1. Catalog based query. We hit the NASA CMR once to download for a time range (one orbital cycle above, 92 days) and a large spatial area (everything south of -60 latitude), and then hit that catalog locally for splitting work up to workers. Way faster (which makes it cheaper), with less to debug.
  2. h5coro hyperslices to do spatial subsetting in place for the ICESat-2 data
  3. Aggressive metadata culling via item number 1 above... this is also where there's much more room for improvement (see comment below)
  4. Server-less horizontal scaling with lambda orchestration (possibly switching to cubed in the future)
  5. Right-sized worker processes

Number 5 above means 1 vcpu with 2048 MB of ram. For lambda, AWS charges by the GB-second, where a GB refers to the amount of ram that you provision the process, and vpcus which scale at a rate proportional to the ram allocated at 1 vcpu per 2 GB of ram. Our horizontal processes are pretty light on ram (under 768 MB), so it seems like dropping the memory allocation would be a way to make things cheaper. However, dropping memory also drops your cpu allocation, and the scaling on a portion of a vcpu vs a fully dedicated vcpu is abysmal... so much so that it takes way more than twice as long with 0.5 vcpus than with 1 vcpu, which means something more expensive with a longer wall time wait.

I suspect that this may get re-implemented using DevSeed's recommendation of cubed at some point; the only reason that it isn't used now is because I already had started building the lambda layers in AWS and wanted to benchmark there prior to switching to a new library.

@espg
Copy link
Contributor Author

espg commented Dec 10, 2025

By far the biggest hassle, and place where there's the most room for improvement has been on both sides of the spatial subsetting. By both, I mean:

  1. Defining the the worker 'cells' that encompass and cover all of Antarctica
  2. Determining which of those cells actually have data

I've been using morton indexing which makes the second problem slightly easier at the cost of making the first problem more annoying. The mortie library that I maintain is built on top of healpix, which is wrapped via healpy, and the entire stack has pretty lackluster support for converting complex polygons into coverage maps of healpix indices. Here's some of the (mostly failed) attempts at solving this:

  1. Building a minimum spanning tree of indices based on coverage of polygon vertices. This is complicated by the fact that the spanning tree isn't restricted to contiguous cells, so it's easy to have 'holes' appear in the geometry.
  2. Using the built in coverage function in healpix to convert from a polygon to healpix index coverage. This only works for convex geometries, which seems to not be any of the geometries that I would like to use.
  3. Trying above, but using qhull to force a convex geometry. Miserable and abject failure; geometries that are convex in polar stereographic aren't necessarily convex in spherical coordinates.

... and many more that are are equally frustrating. Currently, the way that we solve this is by a pretty reliable but jenky solution that does the following:

  1. Take each of the 27 Antarctic drainage basins, and then calculate the a.) the centroid , and b.) the distance to the furthest point in the drainage polgyon to that centroid.
  2. Use healpix disc query with that distance and center point to grab a set of healpix indices at desired cell resolution.
  3. Merge all of those basin disc coverage's together, and then take the np.unique set of that.

You could do it for the entire Antarctic polygon, and it would massively overestimate the cell indices that you want, but doing it per basin and merging gets relatively close to proper coverage with a slight over provision.

Determining which cells have data from the NASA CMR is also a hassle. Bounding boxes are useless in the polar regions, so instead, we try to use the simplified geometries as query objects. It's relatively easy to get to a '90%' solution-- we take the point / line geometries from the NASA CMR , calculate the morton indices for the vertices in those geometries, and use that provision our worker nodes. It almost works beautifully:

======================================================================
Production Lambda Orchestrator - Catalog-Based
======================================================================

[1/5] Loading granule catalog from granule_catalog_cycle22_order6.json...
      Cycle: 22
      Parent order: 6
      Total cells in catalog: 1742
      Total granules: 4153
      Processing 1742 cells

[2/5] Authenticating with NASA Earthdata...
      Credentials expire: 2025-12-10 05:07:33+00:00

[3/5] Invoking 1742 Lambda functions (max 1700 concurrent)...
      Output: s3://jupyterhub-englacial-scratch-429435741471/atl06/production/
      [  50/1742] empty (filtered) | 3.8 cells/s, ETA 7.4m
      [ 100/1742] OK (139 cells, 17,605 obs) | 6.8 cells/s, ETA 4.0m
      [ 150/1742] OK (773 cells, 73,345 obs) | 9.0 cells/s, ETA 3.0m
      [ 200/1742] OK (488 cells, 38,105 obs) | 10.9 cells/s, ETA 2.4m
      [ 250/1742] OK (876 cells, 65,582 obs) | 12.2 cells/s, ETA 2.0m
      [ 300/1742] OK (610 cells, 70,907 obs) | 13.1 cells/s, ETA 1.8m
      [ 350/1742] OK (1468 cells, 193,102 obs) | 14.3 cells/s, ETA 1.6m
      [ 400/1742] OK (1344 cells, 146,479 obs) | 15.7 cells/s, ETA 1.4m
      [ 450/1742] OK (2551 cells, 400,688 obs) | 16.9 cells/s, ETA 1.3m
      [ 500/1742] OK (1773 cells, 215,345 obs) | 18.2 cells/s, ETA 1.1m
      [ 550/1742] OK (2694 cells, 373,129 obs) | 19.6 cells/s, ETA 1.0m
      [ 600/1742] OK (2036 cells, 272,797 obs) | 20.8 cells/s, ETA 0.9m
      [ 650/1742] OK (2839 cells, 477,125 obs) | 21.9 cells/s, ETA 0.8m
      [ 700/1742] OK (2268 cells, 264,217 obs) | 23.1 cells/s, ETA 0.8m
      [ 750/1742] OK (3082 cells, 513,850 obs) | 24.1 cells/s, ETA 0.7m
      [ 800/1742] OK (2868 cells, 465,596 obs) | 25.0 cells/s, ETA 0.6m
      [ 850/1742] OK (2469 cells, 348,257 obs) | 25.9 cells/s, ETA 0.6m
      [ 900/1742] OK (3114 cells, 443,050 obs) | 26.8 cells/s, ETA 0.5m
      [ 950/1742] OK (2644 cells, 349,372 obs) | 27.6 cells/s, ETA 0.5m
      [1000/1742] OK (3029 cells, 564,727 obs) | 28.1 cells/s, ETA 0.4m
      [1050/1742] OK (3520 cells, 794,216 obs) | 28.5 cells/s, ETA 0.4m
      [1100/1742] OK (3421 cells, 661,995 obs) | 28.5 cells/s, ETA 0.4m
      [1150/1742] OK (3664 cells, 785,075 obs) | 28.8 cells/s, ETA 0.3m
      [1200/1742] OK (3081 cells, 487,608 obs) | 28.6 cells/s, ETA 0.3m
      [1250/1742] OK (3759 cells, 972,599 obs) | 28.3 cells/s, ETA 0.3m
      [1300/1742] OK (3507 cells, 870,284 obs) | 28.3 cells/s, ETA 0.3m
      [1350/1742] OK (3907 cells, 940,720 obs) | 28.3 cells/s, ETA 0.2m
      [1400/1742] OK (3268 cells, 709,374 obs) | 27.9 cells/s, ETA 0.2m
      [1450/1742] OK (3890 cells, 883,258 obs) | 27.0 cells/s, ETA 0.2m
      [1500/1742] OK (3386 cells, 1,205,016 obs) | 25.4 cells/s, ETA 0.2m
      [1550/1742] OK (3809 cells, 1,192,299 obs) | 23.3 cells/s, ETA 0.1m
      [1600/1742] OK (3720 cells, 1,521,734 obs) | 21.4 cells/s, ETA 0.1m
      [1650/1742] OK (3973 cells, 866,088 obs) | 18.0 cells/s, ETA 0.1m
      [1700/1742] OK (4094 cells, 2,804,549 obs) | 13.4 cells/s, ETA 0.1m

[4/5] Cost Calculation
----------------------------------------------------------------------
      Total Lambda execution time: 62,289.5s (17.30 hours)
      Memory: 2048MB (2.0GB)
      GB-seconds: 124,579.0
      Cost: $2.0763

[5/5] Summary
======================================================================
      Total cells:          1742
      With data:            1727
      Empty (no granules):  0
      Empty (filtered):     15
      Errors:               0
      Total observations:   1,107,767,160
----------------------------------------------------------------------
      Wall clock time:      377.4s (6.3m)
      Lambda compute time:  62,289.5s (1038.2m)
      Throughput:           4.6 cells/sec
      Estimated cost:       $2.0763
----------------------------------------------------------------------
      Output location:      s3://jupyterhub-englacial-scratch-429435741471/atl06/production/

@espg
Copy link
Contributor Author

espg commented Dec 10, 2025

Compare the above output to what's in the description of this PR. We note a few things:

  1. We're missing 200 million observations
  2. Those 200 million observations are from 12 cells that aren't processed
  3. We're doing a way better job with false positives-- of the 1742 cells we check, 99% have data. That compares against the production solution where only 72.6% of the 2395 we check have data.
  4. The above cost about a third less to process and finishes a minute faster

The wall clock time difference is due to the extra points we need to process...but majority of the cost difference is because of the extra ~600 cells we check to make sure we aren't missing any data. The reason that we're missing the 12 cells in the first place is because the geometries in the NASA CMR are simplified, and we tend to miss the edges of some cells near the pole hole. We fix this by densifying the geometries... which at times means crossing the pole hole and adding in many, many extra cells. But, we don't want to miss any data (certainly not 200 million observations).

A quick intermediate fix here is probably calling a buffer operation... but it would be nice to have a better exact solution rather than a marginally improved heuristic if possible.

- Use Docker AL2023 container for builds
- Add git-lfs for layer zip files
- Pin numpy<2.3 to avoid Lambda issues
- Patch astropy/__init__.py to remove pytest dependency
- Remove boto3/botocore from layer (Lambda provides)
- Disable ARM64 build (healpy compilation issues)
- Include working v14 layer (73MB, matches AWS)
@espg
Copy link
Contributor Author

espg commented Dec 10, 2025

Results from the above run(s):

results_lambda

(from lambda/visualize_production_results.ipynb)

@espg espg mentioned this pull request Dec 10, 2025
@espg
Copy link
Contributor Author

espg commented Dec 10, 2025

Todo:

  • Fix / diagnose spacial sub-setting issue (i.e., fix 90% solution to work 100%)
  • Finalize build pipeline for working x86 layer (working version is hand tuned)
  • Create working arm layer (lower priority, requires building healpy wheels for py 3.11 on arm)
  • Better output and visualization pipeline / notebook (lots of work to be done here!)
  • Define modular library functions and calls that aren't specific to atl06 (feedback appreciated)

Notes

This lambda layer from commit 522185c is what you want to use to reproduce; the more recent one is still being debugged and isn't known to work correctly yet.

Our infrastructure for the cloud deployment is here ; the terraform / open tofu template is mostly for setting up jupyterhub, but the most recent commits also patch in our lambda permissions. For now, they're specific to the prototype, but we'll update them to allow arbitrary lambda function calls from the notebook environment in the future.

@espg
Copy link
Contributor Author

espg commented Dec 10, 2025

HEAD version of the lambda zip is working now; there was an issue in bumping from either h5coro 0.8.0 to 1.0.3 , or a regression in pandas from version 2.3.2 to 2.3.3

@espg
Copy link
Contributor Author

espg commented Dec 10, 2025

h5coro version 1.0.3 is working now; older versions (i.e., 0.8.0) didn't have an explicit close() method when opening h5 datasets, but would close / release memory automatically as they were iterated over (reference / variable name overwrite would trigger release and garbage collection). Version 1.0.3 not only added the close() method, but requires it to de-reference and release memory.

@espg
Copy link
Contributor Author

espg commented Dec 11, 2025

arm64 drops the cost and time a bit:

----------------------------------------------------------------------
      Total Lambda execution time: 87,059.8s (24.18 hours)
      Memory: 2048MB (2.0GB)
      Architecture: arm64
      GB-seconds: 174,119.5
      Cost: $2.3216

[5/5] Summary
======================================================================
      Total cells:          2395
      With data:            1739
      Empty (no granules):  0
      Empty (filtered):     656
      Errors:               0
      Total observations:   1,370,185,727
----------------------------------------------------------------------
      Wall clock time:      409.7s (6.8m)
      Lambda compute time:  87,059.8s (1451.0m)
      Throughput:           5.8 cells/sec
      Estimated cost:       $2.3216
----------------------------------------------------------------------
      Output location:      s3://xagg/atl06/production/
======================================================================

With a better working spatial metadata subset, we can expect close to 6 min and under $2. Output (and notebook) updated to public bucket

@maxrjones
Copy link

@espg, I'm looking at the optimizations we discussed at AGU now (sorry for the delay). Is the granule catalog you've used "granule_catalog_cycle22_order6_stereo.json" the output from one of the query_cmr_*.py scripts?

@maxrjones
Copy link

@espg, I'm looking at the optimizations we discussed at AGU now (sorry for the delay). Is the granule catalog you've used "granule_catalog_cycle22_order6_stereo.json" the output from one of the query_cmr_*.py scripts?

sorry, please ignore this question. I found the source in lambda/built_granule_catalog.py.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants