Skip to content

Conversation

@arielshaqed
Copy link
Contributor

  • Hold commits map as a space-efficient arena map (h/t @nopcoder)
  • Reduce first-parents memory usage by changing it from map[string]bool to map[string]struct{}

Together (of course the first does almost all the work) this reduces memory usage by almost a half.

This reduces a third of memory - from ~9+ to ~6+ GiB in one case.
This saves ~40% memory when processing a huge repository.
@github-actions github-actions bot added the area/cataloger Improvements or additions to the cataloger label Dec 11, 2025
@arielshaqed arielshaqed linked an issue Dec 11, 2025 that may be closed by this pull request
@arielshaqed arielshaqed added performance include-changelog PR description should be included in next release changelog CGC labels Dec 11, 2025
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request reduces memory consumption in garbage collection preparation by replacing standard Go maps with space-efficient arena maps and changing map[string]bool to map[string]struct{}. The changes also optimize memory usage by storing commit creation times as int64 microseconds instead of time.Time objects.

  • Introduces arena.Map for storing commits, reducing memory footprint significantly
  • Changes firstParents from map[string]bool to a custom StringSet wrapping map[string]struct{}
  • Replaces time.Time with int64 microseconds for commit creation timestamps

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.

File Description
pkg/graveler/retention/active_commits.go Replaces standard commit map with arena.Map, changes CommitNode to store creation time as int64 microseconds, updates Get() to return pointer, adds BinaryCommitID type for arena key conversion
pkg/graveler/ref/commit_ordered_iterator.go Introduces StringSet wrapper to replace map[string]bool with map[string]struct{} for memory efficiency

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

CreationDate time.Time
MainParent graveler.CommitID
MetaRangeID graveler.MetaRangeID
// CreationSecs is in microseconds-since-epoch. It takes 3x less space than time.Time.
Copy link

Copilot AI Dec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment says "CreationSecs is in microseconds-since-epoch" but the field name is "CreationUsecs". The comment should say "CreationUsecs is in microseconds-since-epoch" to match the field name.

Suggested change
// CreationSecs is in microseconds-since-epoch. It takes 3x less space than time.Time.
// CreationUsecs is in microseconds-since-epoch. It takes 3x less space than time.Time.

Copilot uses AI. Check for mistakes.
func ToBinaryCommitID(hexID graveler.CommitID) BinaryCommitID {
b, err := hex.DecodeString(string(hexID))
if err != nil {
// This check should happen at compile time but cannot happen at
Copy link

Copilot AI Dec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment says "This check should happen at compile time but cannot happen at" and appears to be incomplete. It should be completed or clarified to explain what the check is for and why it cannot happen at compile time.

Suggested change
// This check should happen at compile time but cannot happen at
// Ideally, the validity and format of CommitID should be enforced at compile time,
// but since CommitID is a string and its value is only known at runtime, this check
// must happen here. If this panic occurs, it indicates a bug elsewhere in the code.

Copilot uses AI. Check for mistakes.
b, err := hex.DecodeString(string(hexID))
if err != nil {
// This check should happen at compile time but cannot happen at
panic(fmt.Sprintf("decode hex %s: %w", hexID, err))
Copy link

Copilot AI Dec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The panic message uses %w format specifier which is for error wrapping with fmt.Errorf, but this is in fmt.Sprintf which doesn't support %w. This should use %s or %v instead to properly format the error.

Suggested change
panic(fmt.Sprintf("decode hex %s: %w", hexID, err))
panic(fmt.Sprintf("decode hex %s: %s", hexID, err))

Copilot uses AI. Check for mistakes.
Comment on lines +21 to +32
type nothing struct{}

type StringSet struct {
set map[string]nothing
}

func NewStringSet() *StringSet {
return &StringSet{set: make(map[string]nothing)}
}

func (s *StringSet) Set(str string) {
s.set[str] = nothing{}
Copy link

Copilot AI Dec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type name "nothing" is not descriptive or idiomatic for Go. The conventional and more clear name for an empty struct used as a placeholder is "struct{}" used inline, or if you need a named type, a more descriptive name like "empty" or "marker" would be better. However, the most idiomatic approach would be to use "struct{}" directly in the map definition without creating a separate type.

Suggested change
type nothing struct{}
type StringSet struct {
set map[string]nothing
}
func NewStringSet() *StringSet {
return &StringSet{set: make(map[string]nothing)}
}
func (s *StringSet) Set(str string) {
s.set[str] = nothing{}
type StringSet struct {
set map[string]struct{}
}
func NewStringSet() *StringSet {
return &StringSet{set: make(map[string]struct{})}
}
func (s *StringSet) Set(str string) {
s.set[str] = struct{}{}

Copilot uses AI. Check for mistakes.
Comment on lines 35 to 41
func ToBinaryCommitID(hexID graveler.CommitID) BinaryCommitID {
b, err := hex.DecodeString(string(hexID))
if err != nil {
// This check should happen at compile time but cannot happen at
panic(fmt.Sprintf("decode hex %s: %w", hexID, err))
}
return BinaryCommitID(b[:arena.KEY_SIZE_BOUND])
Copy link

Copilot AI Dec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ToBinaryCommitID function truncates commit IDs to arena.KEY_SIZE_BOUND (16 bytes), but commit IDs are SHA256 hashes which are 32 bytes when decoded from hex (64 hex characters as per commitIDStringLength constant). This truncation creates a significant risk of collisions - two different commit IDs could map to the same BinaryCommitID if they share the same first 16 bytes. This could lead to incorrect behavior where different commits are treated as the same commit in the arena map.

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's just a missing commit; of course I tested it with the correct value.

res := make(map[graveler.CommitID]graveler.MetaRangeID)
for commitID := range commitSet {
res[commitID] = commitNodes[commitID].MetaRangeID
res[commitID] = commitNodes.Get(ToBinaryCommitID(commitID)).MetaRangeID
Copy link

Copilot AI Dec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code assumes that Get will never return nil for commit IDs in the commitSet. If a commit ID exists in commitSet but is not found in commitNodes (which could happen due to the truncation issue in ToBinaryCommitID or other data consistency issues), this will panic when trying to access .MetaRangeID on a nil pointer. Consider adding a nil check and appropriate error handling.

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

@arielshaqed arielshaqed Dec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not done - the previous code would emit a blank MetaRangeID here. So it had the exact same failure mode, only it could only be discovered further downstream.
Of course not finding the commit node here indicates an internal consistency error - these are literally the same commits.

Discovered during manually running GC.
Flagged by copilot
It needs to use 256-bit commit IDs - this is now required by active_commits.
@github-actions github-actions bot added the area/testing Improvements or additions to tests label Dec 11, 2025
Copy link
Contributor

@nopcoder nopcoder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm - nice work in reducing the memory footprint.

kept the comments on the data structures - left almost nothing to review.

left a small comment related to naming and extra code that I suggest to drop.

Comment on lines +18 to +37
firstParents *StringSet
}

type nothing struct{}

type StringSet struct {
set map[string]nothing
}

func NewStringSet() *StringSet {
return &StringSet{set: make(map[string]nothing)}
}

func (s *StringSet) Set(str string) {
s.set[str] = nothing{}
}

func (s *StringSet) Get(str string) bool {
_, ok := s.set[str]
return ok
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just use map[string]struct{}
No need to define a struct, hold a pointer to it, adding methods - over eng.

// arena.KEY_SIZE_BOUND bytes.
type BinaryCommitID string

func ToBinaryCommitID(hexID graveler.CommitID) BinaryCommitID {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggest to add MustXXX to reflect that this func panic in case of an error

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/cataloger Improvements or additions to the cataloger area/testing Improvements or additions to tests CGC include-changelog PR description should be included in next release changelog performance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Reduce prepare GC commits memory consumption

3 participants