Skip to content
This repository was archived by the owner on Sep 24, 2025. It is now read-only.

Conversation

tricktron
Copy link

Summary

Fix infinite sync failure loops when managed namespaces are deleted by implementing automatic namespace validation and cleanup during cluster cache synchronization.

Fixes: argoproj/argo-cd#24709

Problem

When namespaces managed by ArgoCD are deleted without first removing the managed-by label, the GitOps Engine enters an infinite failure loop during cluster cache sync operations. The processApi() function attempts to list resources in deleted namespaces, resulting in 403 Forbidden errors from the Kubernetes API. This causes:

  • Complete sync failures every 10 minutes (default cache sync interval)
  • ArgoCD becomes unresponsive until manual controller restart
  • No automatic recovery mechanism exists

Root Cause: The sync() process iterates through c.namespaces slice containing deleted namespace names but has no validation to check if those namespaces still exist before attempting API operations.

Solution

Implement namespace validation with automatic cleanup:

Key Changes

  1. namespaceExists() function - Validates namespace existence using canonical apierrors.IsNotFound() detection
  2. Enhanced processApi() - Skip deleted namespaces during resource processing using thread-safe tracking
  3. Post-sync cleanup in sync() - Remove deleted namespaces from configuration after parallel processing completes

I also added a test for the scenario called TestSyncWithDeletedNamespace and added the default namespace in other tests to not break them.

@tricktron tricktron requested a review from a team as a code owner September 23, 2025 11:41
@tricktron tricktron force-pushed the fix-cluster-sync-if-ns-has-been-been-deleted branch from 093236d to 57775ce Compare September 23, 2025 11:51
Copy link

codecov bot commented Sep 23, 2025

Codecov Report

❌ Patch coverage is 90.00000% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 47.63%. Comparing base (8849c3f) to head (0b5d0a1).
⚠️ Report is 62 commits behind head on master.

Files with missing lines Patch % Lines
pkg/cache/cluster.go 90.00% 3 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #785      +/-   ##
==========================================
- Coverage   54.26%   47.63%   -6.63%     
==========================================
  Files          64       64              
  Lines        6164     6627     +463     
==========================================
- Hits         3345     3157     -188     
- Misses       2549     3212     +663     
+ Partials      270      258      -12     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Implement automatic detection and removal of deleted namespaces to
prevent infinite sync failure loops when namespaces are deleted without
removing the managed-by label first.

Signed-off-by: Thibault Gagnaux <thibault.gagnaux@bit.admin.ch>
@tricktron tricktron force-pushed the fix-cluster-sync-if-ns-has-been-been-deleted branch from 57775ce to 0b5d0a1 Compare September 23, 2025 19:08
Copy link

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cluster Cache Sync Fails When Managed Namespaces are Deleted Without Label Removal
1 participant