-
Notifications
You must be signed in to change notification settings - Fork 2.3k
fix(image): implement caching and domain matching for image rewriting #6180
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Whoa! Easy there, Partner!This PR is too big. Please break it up into smaller PRs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if req.Auth == nil && s.authStore != nil { | ||
| domain := extractDomainFromImage(imageName) | ||
| if cfg, ok := s.authStore.GetCRIConfig(domain); ok { | ||
| req.Auth = ToV1AuthConfig(&cfg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix PullImage auth lookup for normalized domains
PullImage now calls extractDomainFromImage and passes the normalized value directly to GetCRIConfig. For Docker Hub images this function always returns docker.io, while existing credentials are commonly keyed by the canonical endpoints (index.docker.io or registry-1.docker.io). After this change those entries will no longer be found and private pulls from Docker Hub will proceed without credentials. Consider keeping the original registry string or applying the same domain-matching logic used in rewriteImage before fetching the config.
Useful? React with 👍 / 👎.
dinoallo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can use a LRU cache instead of simply evict cache entry based on their entry time
dinoallo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also cache eviction can also be optimized. Currently eviction has O(n) time complexity, if the size of cache is n.
e264ee7 to
9689300
Compare
26c58b7 to
4a64ebf
Compare
|
The following are the results of the four tests. The current image-cri-shim configmap configuration is as follows. The global rollback address is http://sealos.hub:5000. This image does not exist locally. 1、Execute the command to pull the image.
The cache was not hit, and the final pull address was index.docker.io/nowinkey/service-cornjob:v0.0.1, and the image was successfully pulled.
2、Execute the command to pull the image.
Cache hit.
3、Execute the command to pull the image.
4、Update the configmap and set an incorrect docker.io account.
Configmap hot update successful.
First, clean up the local image, then execute the command to pull the image.
|
2300c61 to
250c271
Compare
lifecycle/staging/src/github.com/labring/image-cri-shim/.github/workflows/close-pr.yml
Show resolved
Hide resolved
|
@codex review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
sealos/lifecycle/cmd/image-cri-shim/cmd/root_test.go
Lines 37 to 41 in 250c271
| func (f *fakeShim) Setup() error { return nil } | |
| func (f *fakeShim) Start() error { return nil } | |
| func (f *fakeShim) Stop() {} |
The unit test’s fakeShim only implements Setup, Start, Stop, and UpdateAuth (lines 37‑44), but the shim.Shim interface now also requires UpdateCache and CacheStats. As a result watchAuthConfig cannot be compiled for tests: *fakeShim does not implement shim.Shim (missing methods UpdateCache, CacheStats). Please extend the fake to implement the new methods (even with no‑ops) so go test ./lifecycle/cmd/image-cri-shim/cmd builds again.
sealos/lifecycle/cmd/image-cri-shim/cmd/root_test.go
Lines 74 to 77 in 250c271
| done := make(chan error, 1) | |
| go func() { | |
| done <- watchAuthConfig(ctx, cfgPath, shim, 10*time.Millisecond) | |
| }() |
watchAuthConfig now has a fifth parameter (the cache stats interval updater), but the test still calls it with the old four-argument signature (lines 74‑77). This causes a compile-time failure in root_test.go. Pass the new callback argument (e.g. nil for tests) so the package builds.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
lifecycle/staging/src/github.com/labring/image-cri-shim/pkg/server/cri_server_v1.go
Outdated
Show resolved
Hide resolved
250c271 to
d419a5e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 20 out of 21 changed files in this pull request and generated 5 comments.
lifecycle/staging/src/github.com/labring/image-cri-shim/pkg/server/utils.go
Show resolved
Hide resolved
lifecycle/staging/src/github.com/labring/image-cri-shim/pkg/types/configmap_sync.go
Show resolved
Hide resolved
lifecycle/staging/src/github.com/labring/image-cri-shim/pkg/server/cri_server_v1_test.go
Outdated
Show resolved
Hide resolved
lifecycle/staging/src/github.com/labring/image-cri-shim/pkg/server/cri_server_v1.go
Show resolved
Hide resolved
d7873db to
1b6fe07
Compare
|
@codex review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| for regDomain, cfg := range registries { | ||
| if strings.HasSuffix(domain, "."+regDomain) || strings.HasSuffix(regDomain, "."+domain) { | ||
| s.cacheDomainMatch(domain, regDomain) | ||
| cfgCopy := cfg | ||
| return regDomain, &cfgCopy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stop matching parent domains to child registry creds
In findMatchingRegistry you now treat a configured registry as a match not only when the image domain is equal or a subdomain (strings.HasSuffix(domain, "."+regDomain)) but also when the registry domain ends with the image domain (strings.HasSuffix(regDomain, "."+domain)). This second clause incorrectly matches parent domains to child registry entries. For example, if an operator only configured credentials for mirror.example.com and kubelet pulls example.com/nginx, the function will return the mirror credentials, cache that mapping, and every pull for example.com/* will be rewritten toward the wrong registry (and will usually fail auth). The cache also cements the wrong mapping so the shim never falls back to other registries. Please drop the strings.HasSuffix(regDomain, "."+domain) test so we only match when the image is an equal or more specific domain.
Useful? React with 👍 / 👎.
lifecycle/staging/src/github.com/labring/image-cri-shim/pkg/types/config.go
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 21 out of 22 changed files in this pull request and generated 11 comments.
| if cfg.Cache.DomainCacheTTL.Duration != 15*time.Minute { | ||
| t.Fatalf("expected domain cache ttl 15m, got %s", cfg.Cache.DomainCacheTTL.Duration) | ||
| } | ||
| if cfg.Cache.DisableStats != true { |
Copilot
AI
Nov 17, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The condition cfg.Cache.DisableStats != true uses explicit comparison with true. This should be simplified to !cfg.Cache.DisableStats for better readability and idiomatic Go style.
| if cfg.Cache.DisableStats != true { | |
| if !cfg.Cache.DisableStats { |
| s.cacheMutex.Unlock() | ||
| s.setMaxCacheSize(normalized.ImageCacheSize) |
Copilot
AI
Nov 17, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Race condition: The function reads and writes the cache fields outside the lock. While s.cacheMutex.Lock() is called, it's released with s.cacheMutex.Unlock() before calling s.setMaxCacheSize(normalized.ImageCacheSize). If setMaxCacheSize accesses cache-related fields that other goroutines might read/modify, this could cause a race condition. Consider whether setMaxCacheSize should be called within the locked section or if it properly handles its own locking.
| s.cacheMutex.Unlock() | |
| s.setMaxCacheSize(normalized.ImageCacheSize) | |
| s.setMaxCacheSize(normalized.ImageCacheSize) | |
| s.cacheMutex.Unlock() |
| // domainCacheRatio determines the size of the domain cache as imageCacheSize/ratio. A ratio of | ||
| // 10 keeps domain cache small relative to image cache to cap memory use while still providing | ||
| // effective domain-level memoization; tweak if a different ratio is desired. | ||
| domainCacheRatio = 10 |
Copilot
AI
Nov 17, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The constant domainCacheRatio is defined in cri_server_v1.go but never explicitly validated. If s.maxCacheSize is less than 10 (the ratio value), the domain cache size calculation at line 349 could result in a very small cache (potentially just 1), which might not be the intended behavior. Consider documenting this edge case or adding validation.
| func (s *v1ImageService) logRewriteResult(action, original, rewritten, source string, cacheHit bool, replaced bool) { | ||
| logger.Info("rewrite action=%s cache_hit=%t source=%s original=%s result=%s replaced=%t", | ||
| action, cacheHit, source, original, rewritten, replaced) |
Copilot
AI
Nov 17, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logging statement uses action as a parameter, but in the context where it's called (line 402), if action is not sanitized or validated, it could potentially contain format specifiers that interfere with the log formatting. Consider using %s format specifier for the action parameter or ensuring action values are from a controlled set.
| case <-tick: | ||
| stats := imgShim.CacheStats() | ||
| logger.Info("cache stats: image_hits=%d image_misses=%d domain_hits=%d domain_misses=%d image_evictions=%d domain_evictions=%d invalidations=%d generated_at=%s", | ||
| stats.ImageHits, stats.ImageMisses, stats.DomainHits, stats.DomainMisses, stats.ImageEvictions, stats.DomainEvictions, stats.Invalidations, stats.GeneratedAt.Format(time.RFC3339)) | ||
| } |
Copilot
AI
Nov 17, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The stats reporter goroutine started by startCacheStatsReporter may leak if the context is cancelled while a ticker operation is in progress. The case <-tick: branch doesn't check if the context is done before calling imgShim.CacheStats() and logging. Consider adding a context check at the beginning of the tick handling to ensure clean shutdown.
|
|
||
| func NewServer(options Options) (Server, error) { | ||
| if !filepath.IsAbs(options.Socket) { | ||
| return nil, fmt.Errorf("invalid socked") |
Copilot
AI
Nov 17, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spelling: "socked" should be "socket" in the error message.
| return nil, fmt.Errorf("invalid socked") | |
| return nil, fmt.Errorf("invalid socket") |
| func (s *v1ImageService) maxDomainCacheSize() int { | ||
| if s.maxCacheSize <= 0 { | ||
| return 0 | ||
| } |
Copilot
AI
Nov 17, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] Magic number: The domain cache size calculation uses a hardcoded ratio of 10 without clear justification. While there is a comment explaining the ratio, the actual value should be defined as a named constant (which it is as domainCacheRatio), but the constant is defined far from where it's used in maxDomainCacheSize(). Consider adding a comment at the usage site referencing why this specific ratio was chosen.
| } | |
| } | |
| // Use a smaller domain cache to reduce memory usage; see domainCacheRatio definition for rationale. |
| if authStore != nil { | ||
| authStore.AddObserver(service.invalidateCache) | ||
| } | ||
| return service |
Copilot
AI
Nov 17, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] Potential nil pointer dereference: If s.authStore is nil when newV1ImageService is called (line 189), the observer is not added. However, later in rewriteImage (line 210), there's a nil check for s.authStore. This is good defensive programming, but the initialization logic in newV1ImageService should be consistent - either always require a non-nil authStore or handle the nil case throughout. Consider documenting whether nil authStore is a supported configuration.
| return | ||
| } | ||
| s.domainCache = domainCache |
Copilot
AI
Nov 17, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inconsistent error handling: When lru.New() fails for the image cache (line 433-437), the code logs a warning and sets s.imageCache = nil. However, for the domain cache failure (line 446-450), it also sets s.domainCache = nil but then returns early. This means if the domain cache creation fails, the hadCache invalidation metric recording at line 453-455 is skipped. Consider making the error handling consistent for both cache creation failures.
| return | |
| } | |
| s.domainCache = domainCache | |
| } else { | |
| s.domainCache = domainCache | |
| } |
| } | ||
| ref, err := name.ParseReference(image) | ||
| if err != nil { | ||
| return normalizeDomainCandidate(registryFromImage(image)) |
Copilot
AI
Nov 17, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function registryFromImage is called in extractDomainFromImage but is not defined in the visible code changes. This could cause a compilation error if this function doesn't exist elsewhere in the codebase.
bff2b2d to
148182d
Compare
… private registry protection, ConfigMap sync + hot-reload, and logging reduction\n\n- Keep original domain on first pull; fallback only if offline manifest exists\n- Skip rewrite for configured private registries (honor original domain)\n- Sync kube-system/image-cri-shim ConfigMap at startup and periodically; emit single-line summary (offline/private domains, cache)\n- Reduce noisy startup logs; demote auth/cache details to debug\n- Add failure logs for first attempt and fallback attempt to aid troubleshooting\n\nRef: PR labring#6180
148182d to
a3846b6
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #6180 +/- ##
=======================================
Coverage 61.86% 61.86%
=======================================
Files 8 8
Lines 653 653
=======================================
Hits 404 404
Misses 202 202
Partials 47 47 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
… private registry protection, ConfigMap sync + hot-reload, and logging reduction\n\n- Keep original domain on first pull; fallback only if offline manifest exists\n- Skip rewrite for configured private registries (honor original domain)\n- Sync kube-system/image-cri-shim ConfigMap at startup and periodically; emit single-line summary (offline/private domains, cache)\n- Reduce noisy startup logs; demote auth/cache details to debug\n- Add failure logs for first attempt and fallback attempt to aid troubleshooting\n\nRef: PR labring#6180
Signed-off-by: cuisongliu <cuisongliu@qq.com>
…hing Signed-off-by: cuisongliu <cuisongliu@qq.com>
Signed-off-by: cuisongliu <cuisongliu@qq.com>
Signed-off-by: cuisongliu <cuisongliu@qq.com>
…oad support Signed-off-by: cuisongliu <cuisongliu@qq.com>
…oad support Signed-off-by: cuisongliu <cuisongliu@qq.com>
Signed-off-by: cuisongliu <cuisongliu@qq.com>
Signed-off-by: cuisongliu <cuisongliu@qq.com>
…service Signed-off-by: cuisongliu <cuisongliu@qq.com>
…arity Signed-off-by: cuisongliu <cuisongliu@qq.com>
…mage service Signed-off-by: cuisongliu <cuisongliu@qq.com>
…nd improve case sensitivity Signed-off-by: cuisongliu <cuisongliu@qq.com>
…on failure; exact domain matching with caching (#900) * fix(image-cri-shim): prefer original registry; fallback offline only on failure; exact domain matching with caching * lifecycle: remove staging entrypoint for image-cri-shim and unify to CLI build; no functional change.
5828c62 to
c946791
Compare







No description provided.