Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[backport v2.10] [SURE-9460] Fleet not picking up gitrepo updates, no job created to update #3252

Open
rancherbot opened this issue Jan 24, 2025 · 1 comment

Comments

@rancherbot
Copy link
Collaborator

This is a backport issue for #3138, automatically created via GitHub Actions workflow initiated by @0xavi0

Original issue body:

SURE-9460

Issue description

After upgrading Rancher to 2.9.3 / fleet to v0.10.4, some gitrepos are no longer receiving updates. Customer update the repository, but changes are not pushed to the clusters. No Job is created to pull in the changes that should be tracked by the gitRepo.

In fleet v0.10.4, there were changes made to how jobs are managed in fleet. Could these changes be the cause of the issue here? #2932 seems to change how jobs are managed.

Business impact:

Unable to receive updates to applications using fleet for continuous delivery.

Troubleshooting steps:

GitJob pod, does not show that jobs are completing for those gitRepos, We are also unable to find jobs for the

Repro steps:

Upgrade to Rancher 2.9.3 from 2.9.2

Workaround:

Is a workaround available and implemented? yes
What is the workaround:
Customer found that by editing a gitRepo in the Rancher UI, changing nothing, then saving, it will eventually cause the repo to pull the change and make the necessary updates.

When making those changes, a couple lines are changed within the gitRepo:
spec.correctDrift: {} is added
status.commit is updated
status.lastPollingTriggered time is updated (time changed by more than a day).

Actual behavior:

repositories are not updated.

Expected behavior:

Repositories are updated.

@rancherbot rancherbot added this to the v2.10.3 milestone Jan 24, 2025
@rancherbot rancherbot added this to Fleet Jan 24, 2025
@github-project-automation github-project-automation bot moved this to 🆕 New in Fleet Jan 24, 2025
0xavi0 added a commit to 0xavi0/fleet that referenced this issue Jan 27, 2025
Port of: rancher#3239
Refers to rancher#3252

Signed-off-by: Xavi Garcia <xavi.garcia@suse.com>
0xavi0 added a commit that referenced this issue Jan 27, 2025
Port of: #3239
Refers to #3252

Signed-off-by: Xavi Garcia <xavi.garcia@suse.com>
0xavi0 added a commit that referenced this issue Jan 28, 2025
Port of: #3239
Refers to #3252

Signed-off-by: Xavi Garcia <xavi.garcia@suse.com>
@mmartin24
Copy link
Collaborator

mmartin24 commented Jan 28, 2025

Tested in Rancher 2.10 with hotfix v0.11.3-hotfix-ch.1.3afc03bd and compared vs a non fixed version 2.10.1 with fleet:v0.11.3 with upstream and downstream clusters using k3s.


Test steps

  • Deployed Rancher 2.10.0 with 1 downstream cluster and later updated gitrepo with hotfix image
export TAG=v0.11.3-hotfix-ch.1.3afc03bd
kubectl set image -n cattle-fleet-system deployment/gitjob "*=rancher/fleet:$TAG-linux-amd64"
  • Fleet is set with 50 workers for gitrepo, bundle, and bundledeployment
  • Deployed 30 gitrepos with simple-chart: https://github.com/mmartin24/test-fleet/tree/test30/30gitrepos (thanks @0xavi0 for the examples).
  • Note down the commit hash
  • Updated a value, for instance, name on the fleet.yaml file on the 30 gitrepos at the same time

Observations

  • Hash is correctly updated on all gitrepos in the fixed version, while some struggle in a non-fix version (2.10.1) in the example, so the fix seems to work. See screenshot of fixed version (left) vs non-fixed (right) where some hashes belonging to the previous commit remain

Image

  • Some jobs fail more often in the fixed version than in the non-fixed version. Log error is due to TLS handshake timeout. I have spoken aside with @0xavi0 and this seems due to the fact that commits are updated quicker in the fixed version and Github applies some kind of rate limiting, so I understand this "side effect" is outside the scope of the validation of this fix

Image

Error

simple-chart-010-745db-ss6lq gitcloner-initializer time="2025-01-28T15:03:08Z" level=fatal msg="Get \"https://github.com/mmartin24/test-fleet/info/refs?service=git-upload-pack\": net/http: TLS handshake timeout"   

I repeated the same with 15 workers (1 per 2 gitrepos) and the results were similar, both for checking the validity of the fix and the occasional job failure for rate limiting.


Video with real-time check with and without fix here:

Screencast.from.2025-01-28.16-12-00.mov

@kkaempf kkaempf moved this from 🆕 New to 📋 Backlog in Fleet Jan 29, 2025
@manno manno moved this from 📋 Backlog to Needs QA review in Fleet Feb 5, 2025
0xavi0 added a commit to 0xavi0/fleet that referenced this issue Feb 6, 2025
Refers to rancher#3252

Signed-off-by: Xavi Garcia <xavi.garcia@suse.com>
0xavi0 added a commit that referenced this issue Feb 6, 2025
Refers to #3252

Signed-off-by: Xavi Garcia <xavi.garcia@suse.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Needs QA review
Development

No branches or pull requests

4 participants