Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache Manifest files #787

Merged
merged 5 commits into from
Sep 10, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 11 additions & 3 deletions pyiceberg/table/snapshots.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@
import time
from collections import defaultdict
from enum import Enum
from functools import lru_cache
from typing import TYPE_CHECKING, Any, DefaultDict, Dict, Iterable, List, Mapping, Optional

from pydantic import Field, PrivateAttr, model_serializer
Expand Down Expand Up @@ -230,6 +231,13 @@ def __eq__(self, other: Any) -> bool:
)


@lru_cache
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what are your thoughts on this, wouldn't be clearer if we could explicitly pass the max size for the lru_cache annotation, to store max entries rather than going default.
@lru_cache(maxsize=128)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the default is 128, I don't think we need to explicitly define maxsize=128.

Also, _manifests() is not a public API, so we would probably not set the cache size through user config via pyiceberg.yaml. Is there a use-case where the user would want to set the cache size?

Copy link
Contributor

@MehulBatra MehulBatra Jun 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree the default size is what you mentioned, It was just an example, but the idea was to make it more readable & configurable.
Please feel free to take your call.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is 128 big enough?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think 128 should be enough. We're caching manifest_list here and the number of manifest lists should be <100 in most cases.
We can always make this configurable in the future :)

def _manifests(io: FileIO, manifest_list: str) -> List[ManifestFile]:
"""Return the manifests from the manifest list."""
file = io.new_input(manifest_list)
return list(read_manifest_list(file))


class Snapshot(IcebergBaseModel):
snapshot_id: int = Field(alias="snapshot-id")
parent_snapshot_id: Optional[int] = Field(alias="parent-snapshot-id", default=None)
Expand All @@ -250,9 +258,9 @@ def __str__(self) -> str:
return result_str

def manifests(self, io: FileIO) -> List[ManifestFile]:
if self.manifest_list is not None:
file = io.new_input(self.manifest_list)
return list(read_manifest_list(file))
"""Return the manifests for the given snapshot."""
if self.manifest_list:
return _manifests(io, self.manifest_list)
return []


Expand Down