Skip to content

Conversation

@cuisongliu
Copy link
Collaborator

No description provided.

@cuisongliu cuisongliu requested review from a team as code owners November 13, 2025 06:19
@pull-request-size
Copy link

Whoa! Easy there, Partner!

This PR is too big. Please break it up into smaller PRs.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines 323 to 519
if req.Auth == nil && s.authStore != nil {
domain := extractDomainFromImage(imageName)
if cfg, ok := s.authStore.GetCRIConfig(domain); ok {
req.Auth = ToV1AuthConfig(&cfg)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Fix PullImage auth lookup for normalized domains

PullImage now calls extractDomainFromImage and passes the normalized value directly to GetCRIConfig. For Docker Hub images this function always returns docker.io, while existing credentials are commonly keyed by the canonical endpoints (index.docker.io or registry-1.docker.io). After this change those entries will no longer be found and private pulls from Docker Hub will proceed without credentials. Consider keeping the original registry string or applying the same domain-matching logic used in rewriteImage before fetching the config.

Useful? React with 👍 / 👎.

@dinoallo dinoallo requested a review from Copilot November 13, 2025 07:38
Copilot finished reviewing on behalf of dinoallo November 13, 2025 07:43

This comment was marked as resolved.

Copy link
Collaborator

@dinoallo dinoallo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can use a LRU cache instead of simply evict cache entry based on their entry time

Copy link
Collaborator

@dinoallo dinoallo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also cache eviction can also be optimized. Currently eviction has O(n) time complexity, if the size of cache is n.

@cuisongliu cuisongliu force-pushed the fix-image-cri-shim-cache branch from e264ee7 to 9689300 Compare November 13, 2025 15:53
@cuisongliu cuisongliu added this to the v5.2 milestone Nov 13, 2025
@cuisongliu cuisongliu force-pushed the fix-image-cri-shim-cache branch from 26c58b7 to 4a64ebf Compare November 14, 2025 09:31
@nowinkeyy
Copy link
Contributor

The following are the results of the four tests.

The current image-cri-shim configmap configuration is as follows. The global rollback address is http://sealos.hub:5000.

data:
  registries.yaml: |
    version: v1
    sealos:
      address: http://sealos.hub:5000
      auth:
        username: admin
        password: xxx

    registries:
      - address: https://ghcr.io
        auth:
          username: your-ghcr-user
          password: your-ghcr-token
      - address: https://docker.io
        auth: {}

    reloadInterval: 15s
    force: false
    debug: true
    timeout: 15m
    cache:
      imageCacheSize: 1024
      imageCacheTTL: 30m
      domainCacheTTL: 10m
      statsLogInterval: 1m
      disableStats: false

This image does not exist locally.
image

1、Execute the command to pull the image.

crictl --image-endpoint unix:///var/run/image-cri-shim.sock pull docker.io/nowinkey/service-cornjob:v0.0.1
image image The cache was not hit, and the final pull address was index.docker.io/nowinkey/service-cornjob:v0.0.1, and the image was successfully pulled.

2、Execute the command to pull the image.

crictl --image-endpoint unix:///var/run/image-cri-shim.sock pull docker.io/nowinkey/service-cornjob:v0.0.1
image Cache hit.

3、Execute the command to pull the image.

crictl --image-endpoint unix:///var/run/image-cri-shim.sock pull query.io/nowinkey/service-cornjob:v0.0.1
image image The cache was not hit. First, it tried accessing http://sealos.hub:5000/v2/nowinkey/service-cornjob/manifests/v0.0.1, but couldn't find the image. Finally, it rewrote the address to index.docker.io/nowinkey/service-cornjob:v0.0.1 and successfully pulled the image.

4、Update the configmap and set an incorrect docker.io account.

    registries:
      - address: https://ghcr.io
        auth:
          username: your-ghcr-user
          password: your-ghcr-token
      - address: https://docker.io
        auth: 
          username: your-docker-user
          password: your-docker-token
image Configmap hot update successful.

First, clean up the local image, then execute the command to pull the image.

crictl --image-endpoint unix:///var/run/image-cri-shim.sock pull docker.io/nowinkey/service-cornjob:v0.0.1
image image The cache was not hit. The docker.io address was matched first, then a 401 error was reported. Finally, the image was successfully pulled using the original address.

@cuisongliu cuisongliu force-pushed the fix-image-cri-shim-cache branch 2 times, most recently from 2300c61 to 250c271 Compare November 15, 2025 14:57
@cuisongliu
Copy link
Collaborator Author

@codex review

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

func (f *fakeShim) Setup() error { return nil }
func (f *fakeShim) Start() error { return nil }
func (f *fakeShim) Stop() {}

P1 Badge Fake shim no longer satisfies shim.Shim

The unit test’s fakeShim only implements Setup, Start, Stop, and UpdateAuth (lines 37‑44), but the shim.Shim interface now also requires UpdateCache and CacheStats. As a result watchAuthConfig cannot be compiled for tests: *fakeShim does not implement shim.Shim (missing methods UpdateCache, CacheStats). Please extend the fake to implement the new methods (even with no‑ops) so go test ./lifecycle/cmd/image-cri-shim/cmd builds again.


done := make(chan error, 1)
go func() {
done <- watchAuthConfig(ctx, cfgPath, shim, 10*time.Millisecond)
}()

P1 Badge Test call to watchAuthConfig missing stats callback

watchAuthConfig now has a fifth parameter (the cache stats interval updater), but the test still calls it with the old four-argument signature (lines 74‑77). This causes a compile-time failure in root_test.go. Pass the new callback argument (e.g. nil for tests) so the package builds.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@cuisongliu cuisongliu requested a review from Copilot November 17, 2025 04:10
@cuisongliu cuisongliu force-pushed the fix-image-cri-shim-cache branch from 250c271 to d419a5e Compare November 17, 2025 04:13
Copilot finished reviewing on behalf of cuisongliu November 17, 2025 04:15
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 20 out of 21 changed files in this pull request and generated 5 comments.

@cuisongliu cuisongliu force-pushed the fix-image-cri-shim-cache branch from d7873db to 1b6fe07 Compare November 17, 2025 06:03
@cuisongliu cuisongliu requested a review from Copilot November 17, 2025 06:04
@cuisongliu
Copy link
Collaborator Author

@codex review

Copilot finished reviewing on behalf of cuisongliu November 17, 2025 06:08
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines 390 to 394
for regDomain, cfg := range registries {
if strings.HasSuffix(domain, "."+regDomain) || strings.HasSuffix(regDomain, "."+domain) {
s.cacheDomainMatch(domain, regDomain)
cfgCopy := cfg
return regDomain, &cfgCopy

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Stop matching parent domains to child registry creds

In findMatchingRegistry you now treat a configured registry as a match not only when the image domain is equal or a subdomain (strings.HasSuffix(domain, "."+regDomain)) but also when the registry domain ends with the image domain (strings.HasSuffix(regDomain, "."+domain)). This second clause incorrectly matches parent domains to child registry entries. For example, if an operator only configured credentials for mirror.example.com and kubelet pulls example.com/nginx, the function will return the mirror credentials, cache that mapping, and every pull for example.com/* will be rewritten toward the wrong registry (and will usually fail auth). The cache also cements the wrong mapping so the shim never falls back to other registries. Please drop the strings.HasSuffix(regDomain, "."+domain) test so we only match when the image is an equal or more specific domain.

Useful? React with 👍 / 👎.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 21 out of 22 changed files in this pull request and generated 11 comments.

if cfg.Cache.DomainCacheTTL.Duration != 15*time.Minute {
t.Fatalf("expected domain cache ttl 15m, got %s", cfg.Cache.DomainCacheTTL.Duration)
}
if cfg.Cache.DisableStats != true {
Copy link

Copilot AI Nov 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The condition cfg.Cache.DisableStats != true uses explicit comparison with true. This should be simplified to !cfg.Cache.DisableStats for better readability and idiomatic Go style.

Suggested change
if cfg.Cache.DisableStats != true {
if !cfg.Cache.DisableStats {

Copilot uses AI. Check for mistakes.
Comment on lines +200 to +201
s.cacheMutex.Unlock()
s.setMaxCacheSize(normalized.ImageCacheSize)
Copy link

Copilot AI Nov 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Race condition: The function reads and writes the cache fields outside the lock. While s.cacheMutex.Lock() is called, it's released with s.cacheMutex.Unlock() before calling s.setMaxCacheSize(normalized.ImageCacheSize). If setMaxCacheSize accesses cache-related fields that other goroutines might read/modify, this could cause a race condition. Consider whether setMaxCacheSize should be called within the locked section or if it properly handles its own locking.

Suggested change
s.cacheMutex.Unlock()
s.setMaxCacheSize(normalized.ImageCacheSize)
s.setMaxCacheSize(normalized.ImageCacheSize)
s.cacheMutex.Unlock()

Copilot uses AI. Check for mistakes.
Comment on lines +98 to +101
// domainCacheRatio determines the size of the domain cache as imageCacheSize/ratio. A ratio of
// 10 keeps domain cache small relative to image cache to cap memory use while still providing
// effective domain-level memoization; tweak if a different ratio is desired.
domainCacheRatio = 10
Copy link

Copilot AI Nov 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The constant domainCacheRatio is defined in cri_server_v1.go but never explicitly validated. If s.maxCacheSize is less than 10 (the ratio value), the domain cache size calculation at line 349 could result in a very small cache (potentially just 1), which might not be the intended behavior. Consider documenting this edge case or adding validation.

Copilot uses AI. Check for mistakes.
Comment on lines +401 to +388
func (s *v1ImageService) logRewriteResult(action, original, rewritten, source string, cacheHit bool, replaced bool) {
logger.Info("rewrite action=%s cache_hit=%t source=%s original=%s result=%s replaced=%t",
action, cacheHit, source, original, rewritten, replaced)
Copy link

Copilot AI Nov 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logging statement uses action as a parameter, but in the context where it's called (line 402), if action is not sanitized or validated, it could potentially contain format specifiers that interfere with the log formatting. Consider using %s format specifier for the action parameter or ensuring action values are from a controlled set.

Copilot uses AI. Check for mistakes.
Comment on lines +222 to +226
case <-tick:
stats := imgShim.CacheStats()
logger.Info("cache stats: image_hits=%d image_misses=%d domain_hits=%d domain_misses=%d image_evictions=%d domain_evictions=%d invalidations=%d generated_at=%s",
stats.ImageHits, stats.ImageMisses, stats.DomainHits, stats.DomainMisses, stats.ImageEvictions, stats.DomainEvictions, stats.Invalidations, stats.GeneratedAt.Format(time.RFC3339))
}
Copy link

Copilot AI Nov 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The stats reporter goroutine started by startCacheStatsReporter may leak if the context is cancelled while a ticker operation is in progress. The case <-tick: branch doesn't check if the context is done before calling imgShim.CacheStats() and logging. Consider adding a context check at the beginning of the tick handling to ensure clean shutdown.

Copilot uses AI. Check for mistakes.

func NewServer(options Options) (Server, error) {
if !filepath.IsAbs(options.Socket) {
return nil, fmt.Errorf("invalid socked")
Copy link

Copilot AI Nov 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spelling: "socked" should be "socket" in the error message.

Suggested change
return nil, fmt.Errorf("invalid socked")
return nil, fmt.Errorf("invalid socket")

Copilot uses AI. Check for mistakes.
func (s *v1ImageService) maxDomainCacheSize() int {
if s.maxCacheSize <= 0 {
return 0
}
Copy link

Copilot AI Nov 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Magic number: The domain cache size calculation uses a hardcoded ratio of 10 without clear justification. While there is a comment explaining the ratio, the actual value should be defined as a named constant (which it is as domainCacheRatio), but the constant is defined far from where it's used in maxDomainCacheSize(). Consider adding a comment at the usage site referencing why this specific ratio was chosen.

Suggested change
}
}
// Use a smaller domain cache to reduce memory usage; see domainCacheRatio definition for rationale.

Copilot uses AI. Check for mistakes.
Comment on lines 189 to 192
if authStore != nil {
authStore.AddObserver(service.invalidateCache)
}
return service
Copy link

Copilot AI Nov 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Potential nil pointer dereference: If s.authStore is nil when newV1ImageService is called (line 189), the observer is not added. However, later in rewriteImage (line 210), there's a nil check for s.authStore. This is good defensive programming, but the initialization logic in newV1ImageService should be consistent - either always require a non-nil authStore or handle the nil case throughout. Consider documenting whether nil authStore is a supported configuration.

Copilot uses AI. Check for mistakes.
Comment on lines +450 to +437
return
}
s.domainCache = domainCache
Copy link

Copilot AI Nov 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistent error handling: When lru.New() fails for the image cache (line 433-437), the code logs a warning and sets s.imageCache = nil. However, for the domain cache failure (line 446-450), it also sets s.domainCache = nil but then returns early. This means if the domain cache creation fails, the hadCache invalidation metric recording at line 453-455 is skipped. Consider making the error handling consistent for both cache creation failures.

Suggested change
return
}
s.domainCache = domainCache
} else {
s.domainCache = domainCache
}

Copilot uses AI. Check for mistakes.
}
ref, err := name.ParseReference(image)
if err != nil {
return normalizeDomainCandidate(registryFromImage(image))
Copy link

Copilot AI Nov 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function registryFromImage is called in extractDomainFromImage but is not defined in the visible code changes. This could cause a compilation error if this function doesn't exist elsewhere in the codebase.

Copilot uses AI. Check for mistakes.
@cuisongliu cuisongliu force-pushed the fix-image-cri-shim-cache branch from bff2b2d to 148182d Compare November 19, 2025 07:14
nowinkeyy added a commit to nowinkeyy/sealos that referenced this pull request Nov 25, 2025
… private registry protection, ConfigMap sync + hot-reload, and logging reduction\n\n- Keep original domain on first pull; fallback only if offline manifest exists\n- Skip rewrite for configured private registries (honor original domain)\n- Sync kube-system/image-cri-shim ConfigMap at startup and periodically; emit single-line summary (offline/private domains, cache)\n- Reduce noisy startup logs; demote auth/cache details to debug\n- Add failure logs for first attempt and fallback attempt to aid troubleshooting\n\nRef: PR labring#6180
@cuisongliu cuisongliu force-pushed the fix-image-cri-shim-cache branch from 148182d to a3846b6 Compare November 25, 2025 11:47
@codecov
Copy link

codecov bot commented Nov 25, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 61.86%. Comparing base (2df0bdf) to head (a3846b6).
⚠️ Report is 20 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #6180   +/-   ##
=======================================
  Coverage   61.86%   61.86%           
=======================================
  Files           8        8           
  Lines         653      653           
=======================================
  Hits          404      404           
  Misses        202      202           
  Partials       47       47           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

nowinkeyy added a commit to nowinkeyy/sealos that referenced this pull request Nov 26, 2025
… private registry protection, ConfigMap sync + hot-reload, and logging reduction\n\n- Keep original domain on first pull; fallback only if offline manifest exists\n- Skip rewrite for configured private registries (honor original domain)\n- Sync kube-system/image-cri-shim ConfigMap at startup and periodically; emit single-line summary (offline/private domains, cache)\n- Reduce noisy startup logs; demote auth/cache details to debug\n- Add failure logs for first attempt and fallback attempt to aid troubleshooting\n\nRef: PR labring#6180
cuisongliu and others added 13 commits November 26, 2025 20:22
Signed-off-by: cuisongliu <cuisongliu@qq.com>
…hing

Signed-off-by: cuisongliu <cuisongliu@qq.com>
Signed-off-by: cuisongliu <cuisongliu@qq.com>
Signed-off-by: cuisongliu <cuisongliu@qq.com>
…oad support

Signed-off-by: cuisongliu <cuisongliu@qq.com>
…oad support

Signed-off-by: cuisongliu <cuisongliu@qq.com>
Signed-off-by: cuisongliu <cuisongliu@qq.com>
Signed-off-by: cuisongliu <cuisongliu@qq.com>
…service

Signed-off-by: cuisongliu <cuisongliu@qq.com>
…arity

Signed-off-by: cuisongliu <cuisongliu@qq.com>
…mage service

Signed-off-by: cuisongliu <cuisongliu@qq.com>
…nd improve case sensitivity

Signed-off-by: cuisongliu <cuisongliu@qq.com>
…on failure; exact domain matching with caching (#900)

* fix(image-cri-shim): prefer original registry; fallback offline only on failure; exact domain matching with caching

* lifecycle: remove staging entrypoint for image-cri-shim and unify to CLI build; no functional change.
@cuisongliu cuisongliu force-pushed the fix-image-cri-shim-cache branch from 5828c62 to c946791 Compare November 26, 2025 12:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants