Skip to content

Manifest Splitting #767

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 76 commits into from
May 13, 2025
Merged

Manifest Splitting #767

merged 76 commits into from
May 13, 2025

Conversation

dcherian
Copy link
Contributor

@dcherian dcherian commented Feb 21, 2025

  • does the config get serialized properly?
  • Add ManifestSplitCondition.AnyArray
  • python tests for Or, And
  • Add docs
  • real-world benchmark; test with ERA5
  • add ndim based condition (3D vs 4D) (if someone asks for it)

Minimal docs here: https://icechunk--767.org.readthedocs.build/en/767/icechunk-python/performance/

I rewrote the ERA5 manifests to put 1 year per manifest (~9000 chunks); This gets us 3X speedup.

image

pub struct ManifestShards(Vec<ManifestExtents>);

impl ManifestShards {
pub fn default(ndim: usize) -> Self {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like this, but it is certainly tied to ndim.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe ManifestSplits is an enum to avoid this?

enum ManifestSplits {
   Single,
   Multiple(Vec<ManifestExtents>)
}

What I don't like is the empty vector. I wonder if Rust has a NonEmptyVec type, otherwise, a trick people use is:

...
   Multiple{ first: ManifestExtents, rest: Vec<ManifestExtents>}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I don't need the default any more. It was an artifact that appeared because I implemented the core logic before wiring up the config. Now the default gets set when parsing the config using the Array Metadata
image

@@ -37,9 +33,77 @@ impl ManifestExtents {
Self(v)
}

pub fn contains(&self, coord: &[u32]) -> bool {
self.iter().zip(coord.iter()).all(|(range, that)| range.contains(that))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to start checking on writes that indexes have the proper size for the metadata

@dcherian dcherian force-pushed the split-manifests branch 2 times, most recently from e7d9221 to 09476a4 Compare March 6, 2025 23:02
@dcherian dcherian force-pushed the split-manifests branch from bc67218 to 9954cda Compare May 8, 2025 03:14
@dcherian dcherian marked this pull request as ready for review May 8, 2025 03:30
@dcherian dcherian requested a review from paraseba May 8, 2025 03:30
@dcherian dcherian force-pushed the split-manifests branch from fe64bda to 0891269 Compare May 8, 2025 04:26
@dcherian dcherian force-pushed the split-manifests branch from 0891269 to b5812d5 Compare May 8, 2025 04:31
Ok(())
}

// #[tokio::test]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ping

@dcherian dcherian enabled auto-merge (squash) May 13, 2025 15:16
@dcherian dcherian merged commit 9b78cf6 into main May 13, 2025
8 checks passed
@dcherian dcherian deleted the split-manifests branch May 13, 2025 15:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants