Skip to content

Design discussion #2

@espg

Description

@espg

New xagg prototype up (#1 ) that gets at some of what I've been thinking about for aggregations. The lambda takes 6-to-7 minutes and $2-to-$3 to run a continental aggregation function on atl06 for an orbital cycle.

My hope is that the design pattern is someone portable and can be generalized. Here's the high-level overview of what happens:

  1. Spacio-temporal metadata query. The atl06 example hits NASA CMR, but this is broadly compatible with anything that conforms to a STAC API
  2. Definition of what to fetch (i.e., which columns); this is in a python function right now, but would probably be abstracted to a yml template. This is also where we might add in (for example) a reader specification (current example uses h5coro).
  3. Definition of what aggregation to do. This is defined in the same python file, and could either be a script, function, or similar.
  4. Definition of input / output grid. Input grid defines the workers; output grid is self explanatory. This is implemented using healpix for both, but could be generalized to arbitrary user defined grids later on.
  5. What to write to. This is currently a stub (i.e., parquet files), but should be more grid like (i.e., to zarr or similar with partial spatial chunked writes)

In principle, we could (for minimal work) swap out ICESat-2 for anything in the CMR that has a data model such that we're reading h5 files with per observation lat/lon data-- that's a huge amount of science products already.

A few bigger picture items:

  1. What do we write too? We currently use parquet/geoparquet, but Zarr seems like a promising candidate.
  2. What do we visualize with? I'd love the ability to view this where data actually populated as it was written.
    • We had looked at longboard a while ago; it works with with geoparquet and is client side, which is great
      • My understanding is that it's based on deck.gl, which doesn't support non-Mercator projections, which is a deal breaker for the polar regions
      • How feasible is upstreaming multiple projection types?
    • Other options are zarr viewers
      • Requires fixed array output size (honestly a reasonable and workable restrictions)
      • Seem to have some support for healpix metadata (not required, but nice)
      • Some issues with only displaying global grid extent for the healpix example
  3. Can we generalize the pattern to a common interface for both local and cloud deployment?
    • What's the cloud backend (cubed?), and how do we handle authentication
    • What's the local backend (vaex?)
    • How do we integrate with the existing ecosystem
  4. Do we support custom input and output grids, or just custom output grids, or neither?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions