snapshot-controller logs report failure frequently #748

ambiknai · 2022-08-12T11:54:49Z

What happened:

In our csi-driver snapshot controller logs shows below error. There is no impact on functionality. But there are too many of these errors.

snapshot controller failed to update ...xxxxx
the object has been modified; please apply your changes to the latest version and try again

What you expected to happen:

How to reproduce it:

Anything else we need to know?:

Environment: I tested in IKs 1.22,1.23,1.24

Driver version: Our csi driver is not GA
Kubernetes version (use kubectl version):
OS (e.g. from /etc/os-release):
Kernel (e.g. uname -a):
Install tools:
Others:
gcr.io/k8s-staging-sig-storage/csi-snapshotter:v6.0.1
gcr.io/k8s-staging-sig-storage/snapshot-controller:v6.0.1

The text was updated successfully, but these errors were encountered:

xing-yang · 2022-08-12T14:28:02Z

@ggriffiths Can you take a look? These are from the "update" that are not replaced with "patch" in snapshot-controller.

ggriffiths · 2022-08-24T21:12:28Z

Yes, there are many spots where we still use "update" instead of "patch":

Update:

pkg/common-controller/snapshot_controller.go:1383:	newSnapshot, err := ctrl.clientset.SnapshotV1().VolumeSnapshots(snapshotClone.Namespace).Update(context.TODO(), snapshotClone, metav1.UpdateOptions{})
pkg/common-controller/snapshot_controller.go:1454:		updatedSnapshot, err = ctrl.clientset.SnapshotV1().VolumeSnapshots(snapshotClone.Namespace).Update(context.TODO(), snapshotClone, metav1.UpdateOptions{})
pkg/common-controller/snapshot_controller.go:1523:	newSnapshot, err := ctrl.clientset.SnapshotV1().VolumeSnapshots(snapshotClone.Namespace).Update(context.TODO(), snapshotClone, metav1.UpdateOptions{})
pkg/common-controller/snapshot_controller.go:1618:	updatedContent, err := ctrl.clientset.SnapshotV1().VolumeSnapshotContents().Update(context.TODO(), contentClone, metav1.UpdateOptions{})
pkg/common-controller/snapshot_controller.go:1660:	updatedSnapshot, err := ctrl.clientset.SnapshotV1().VolumeSnapshots(snapshot.Namespace).Update(context.TODO(), snapshotClone, metav1.UpdateOptions{})
pkg/sidecar-controller/snapshot_controller.go:547:	updatedContent, err := ctrl.clientset.SnapshotV1().VolumeSnapshotContents().Update(context.TODO(), contentClone, metav1.UpdateOptions{})
pkg/sidecar-controller/snapshot_controller.go:642:	updatedContent, err := ctrl.clientset.SnapshotV1().VolumeSnapshotContents().Update(context.TODO(), contentClone, metav1.UpdateOptions{})

UpdateStatus:

pkg/common-controller/snapshot_controller.go:813:	newSnapshot, err := ctrl.clientset.SnapshotV1().VolumeSnapshots(snapshotClone.Namespace).UpdateStatus(context.TODO(), snapshotClone, metav1.UpdateOptions{})
pkg/common-controller/snapshot_controller.go:1218:		newSnapshotObj, err := ctrl.clientset.SnapshotV1().VolumeSnapshots(snapshotClone.Namespace).UpdateStatus(context.TODO(), snapshotClone, metav1.UpdateOptions{})
pkg/sidecar-controller/snapshot_controller.go:403:	newContent, err := ctrl.clientset.SnapshotV1().VolumeSnapshotContents().UpdateStatus(context.TODO(), content, metav1.UpdateOptions{})
pkg/sidecar-controller/snapshot_controller.go:459:		newContent, err := ctrl.clientset.SnapshotV1().VolumeSnapshotContents().UpdateStatus(context.TODO(), contentClone, metav1.UpdateOptions{})

This error will still be hit in these scenarios. We reduced the major scenarios in #526, but there is more work to be done. I'm happy to review a PR for this work.

ggriffiths · 2022-08-24T21:12:41Z

/help

k8s-ci-robot · 2022-08-24T21:12:42Z

@ggriffiths:
This request has been marked as needing help from a contributor.

Guidelines

Please ensure that the issue body includes answers to the following questions:

Why are we solving this issue?
To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
Does this issue have zero to low barrier of entry?
How can the assignee reach out to you for help?

For more details on the requirements of such an issue, please see here and ensure that they are met.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

/help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

camartinez04 · 2022-08-29T22:51:42Z

I've sent a PR-757

ggriffiths · 2022-08-29T23:04:03Z

/assign @camartinez04

blackpiglet · 2022-10-11T08:16:46Z

Also hit this issue.

E1011 03:12:02.763356       1 snapshot_controller.go:993] checkandRemovePVCFinalizer [velero-kibishii-data-kibishii-deployment-0-tsdgw]: removePVCFinalizer failed to remove finalizer snapshot controller failed to update kibishii-data-kibishii-deployment-0 on API server: Operation cannot be fulfilled on persistentvolumeclaims "kibishii-data-kibishii-deployment-0": the object has been modified; please apply your changes to the latest version and try again
E1011 03:12:02.763396       1 snapshot_controller.go:189] error check and remove PVC finalizer for snapshot [velero-kibishii-data-kibishii-deployment-0-tsdgw]: snapshot controller failed to update kibishii-data-kibishii-deployment-0 on API server: Operation cannot be fulfilled on persistentvolumeclaims "kibishii-data-kibishii-deployment-0": the object has been modified; please apply your changes to the latest version and try again

ggriffiths · 2022-10-19T19:36:58Z

/unassign @camartinez04

k8s-triage-robot · 2023-02-08T09:28:38Z

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle stale
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

shubham-pampattiwar · 2023-05-30T17:15:05Z

@ggriffiths @xing-yang If anyone is not actively working on this issue, I would like to work on it.

xing-yang · 2023-05-30T17:18:56Z

Hi @shubham-pampattiwar, that's great. Can you please assign this issue to yourself? Thanks.

xing-yang · 2023-05-30T17:19:12Z

/remove-lifecycle stale

shubham-pampattiwar · 2023-05-30T17:20:48Z

/assign

ggriffiths · 2023-10-31T02:09:07Z

/unassign @shubham-pampattiwar

phoenix-bjoern · 2023-11-06T15:36:37Z

Spottet this issue when using Velero 1.12.1 and CSI plugin 0.6.1 together with SC 6.3.1:

E1106 14:02:37.353831       1 snapshot_controller.go:107] createSnapshot for content [snapcontent-3cce7669-3854-4ee3-a157-d3abb5ff2dcd]: error occurred in createSnapshotWrapper: failed to remove VolumeSnapshotBeingCreated annotation on the content snapcontent-3cce7669-3854-4ee3-a157-d3abb5ff2dcd: "snapshot controller failed to update snapcontent-3cce7669-3854-4ee3-a157-d3abb5ff2dcd on API server: Operation cannot be fulfilled on volumesnapshotcontents.snapshot.storage.k8s.io \"snapcontent-3cce7669-3854-4ee3-a157-d3abb5ff2dcd\": the object has been modified; please apply your changes to the latest version and try again"
E1106 14:02:37.353865       1 snapshot_controller_base.go:351] could not sync content "snapcontent-3cce7669-3854-4ee3-a157-d3abb5ff2dcd": failed to remove VolumeSnapshotBeingCreated annotation on the content snapcontent-3cce7669-3854-4ee3-a157-d3abb5ff2dcd: "snapshot controller failed to update snapcontent-3cce7669-3854-4ee3-a157-d3abb5ff2dcd on API server: Operation cannot be fulfilled on volumesnapshotcontents.snapshot.storage.k8s.io \"snapcontent-3cce7669-3854-4ee3-a157-d3abb5ff2dcd\": the object has been modified; please apply your changes to the latest version and try again"
I1106 14:02:37.353904       1 snapshot_controller.go:291] createSnapshotWrapper: Creating snapshot for content snapcontent-3cce7669-3854-4ee3-a157-d3abb5ff2dcd through the plugin ...
I1106 14:02:37.353931       1 event.go:298] Event(v1.ObjectReference{Kind:"VolumeSnapshotContent", Namespace:"", Name:"snapcontent-3cce7669-3854-4ee3-a157-d3abb5ff2dcd", UID:"194a4dc1-c11c-44d4-acb4-504a0fed9314", APIVersion:"snapshot.storage.k8s.io/v1", ResourceVersion:"934203821", FieldPath:""}): type: 'Warning' reason: 'SnapshotCreationFailed' Failed to create snapshot: failed to remove VolumeSnapshotBeingCreated annotation on the content snapcontent-3cce7669-3854-4ee3-a157-d3abb5ff2dcd: "snapshot controller failed to update snapcontent-3cce7669-3854-4ee3-a157-d3abb5ff2dcd on API server: Operation cannot be fulfilled on volumesnapshotcontents.snapshot.storage.k8s.io \"snapcontent-3cce7669-3854-4ee3-a157-d3abb5ff2dcd\": the object has been modified; please apply your changes to the latest version and try again"

As the logs indicate the patch rewrites are not in the 6.3.1 release yet. So please release them ASAP to close this issue.

phoenix-bjoern · 2023-11-06T22:35:14Z

I've cherry-picked commit ff71329 from main to the release-6.3 branch and created new images:

phoenixmedia/k8s-snapshot-controller:v6.3.1-patch1
phoenixmedia/k8s-csi-snapshotter:v6.3.1-patch1

I can confirm the patch rewrites solve this issue. IMHO the patch could be safely merged to the release-6.3 branch.

phoenix-bjoern · 2023-11-07T07:25:33Z

See #876 (comment)

hairyhum · 2023-11-29T17:59:21Z

We have also seen similar errors about snapshot status update, most likely coming from UpdateStatus mentioned above. I don't see them addressed in the PR, but I believe those are also important.

julienvincent · 2023-12-27T23:28:06Z

Any update on this? This happens pretty consistently for me. Running v6.3.3.

phoenix-bjoern · 2023-12-28T00:07:07Z

@julienvincent can you provide logs to understand the context?
How do you create snapshots? Often two or more processes interact with the VolumeSnapshot objects.

julienvincent · 2023-12-28T00:12:07Z

@phoenix-bjoern Sure! What kind of logs would you like? Happy to provide.

Similar to other users in this thread, I am using Velero@1.12.2 with the csi plugin at 0.6.2 and Longhorn@1.5.3. Velero is creating the VolumeSnapshots and I am getting the same error as reported in this issue.

For example:

  Error:
    Message:     Failed to check and update snapshot content: failed to remove VolumeSnapshotBeingCreated annotation on the content snapcontent-48e5a79d-6d41-4b28-9d17-24cfaa920cad: "snapshot controller failed to update snapcontent-48e5a79d-6d41-4b28-9d17-24cfaa920cad on API server: Operation cannot be fulfilled on volumesnapshotcontents.snapshot.storage.k8s.io \"snapcontent-48e5a79d-6d41-4b28-9d17-24cfaa920cad\": the object has been modified; please apply your changes to the latest version and try again"
    Time:        2023-12-28T00:01:34Z
  Ready To Use:  false

Happy to provide any other information you need.

phoenix-bjoern · 2023-12-28T18:47:58Z

@julienvincent the snapshot controller log should have a trace for the error. Can you share it to identify the exact code lines?

julienvincent · 2023-12-30T12:43:22Z

@phoenix-bjoern sure, here is an example:

E1230 12:33:25.731663       1 snapshot_controller_base.go:470] could not sync snapshot "default/velero-registry-data-7737fc94-tn4m2": snapshot controller failed to update velero-registry-data-7737fc94-tn4m2 on API server: Operation cannot be fulfilled on volumesnapshots.snapshot.storage.k8s.io "velero-registry-data-7737fc94-tn4m2": StorageError: invalid object, Code: 4, Key: /registry/snapshot.storage.k8s.io/volumesnapshots/default/velero-registry-data-7737fc94-tn4m2, ResourceVersion: 0, AdditionalErrorMsg: Precondition failed: UID in precondition: 84d606d2-68a0-4ea7-b670-ea3068ce4975, UID in object meta:
I1230 12:33:25.731729       1 snapshot_controller_base.go:288] Failed to sync snapshot "default/velero-registry-data-7737fc94-tn4m2", will retry again: snapshot controller failed to update velero-registry-data-7737fc94-tn4m2 on API server: Operation cannot be fulfilled on volumesnapshots.snapshot.storage.k8s.io "velero-registry-data-7737fc94-tn4m2": StorageError: invalid object, Code: 4, Key: /registry/snapshot.storage.k8s.io/volumesnapshots/default/velero-registry-data-7737fc94-tn4m2, ResourceVersion: 0, AdditionalErrorMsg: Precondition failed: UID in precondition: 84d606d2-68a0-4ea7-b670-ea3068ce4975, UID in object meta:

phoenix-bjoern · 2024-01-03T19:03:53Z

Thanks @julienvincent for the additional context. The trace shows a storage error: StorageError: invalid object, Code: 4. It seems the snapshot controller doesn't have an issue. It only catches the exception thrown by Velero and/or the CSI driver. Maybe check vmware-tanzu/velero#6179 for more context.

julienvincent · 2024-01-04T13:39:33Z

@phoenix-bjoern I might be misunderstanding something but the linked issue doesn't seem directly related to this issue.

That issue is more about longhorn behaviour of not deleting it's internal snapshots when CSI volume snapshots are deleted. AFAIU Velero is not actually involved in the CSI snapshot process other than initially creating the VolumeSnapshot resource. I can reproduce this issue independently of Velero by manually creating a VolumeSnapshot which I think excludes Velero's involvement in this.

In this case the snapshots themselves are not being reported as successful (but the underlying driver is successfully performing a snapshot).

But if I understand what you are saying this error message is set on the VolumeSnapshot resource by the driver (longhorn) and external-snapshotter is just reporting/relaying it?

Would you recommend opening an issue with longhorn?

phoenix-bjoern · 2024-01-04T17:50:09Z

@julienvincent The snapshot controller is only triggering a process which the storage driver then executes. Since the error seems to occur in the storage driver there is nothing you can in the snapshot controller.
So yes, it makes sense to continue investigating the issue in the Longhorn components.

kain88-de · 2024-05-15T10:32:19Z

Hi. We are running into the same issues. Have the UpdateStatus calls already been refactored to use apply instead of patch? If not I would like to help.

hoyho · 2024-06-18T03:09:04Z

Hi. We are running into the same issues. Have the UpdateStatus calls already been refactored to use apply instead of patch? If not I would like to help.

Having refactored last month.
However, some of tests failed since then. So I took several days to fix them until now.

SCLogo · 2024-08-02T12:33:01Z

@hoyho did you have time to finish ? do you know any ETA? thanks

hoyho · 2024-08-02T13:01:36Z

@hoyho did you have time to finish ? do you know any ETA? thanks

No problems. Probably will do it next week

xing-yang assigned ggriffiths Aug 12, 2022

k8s-ci-robot added the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Aug 24, 2022

ggriffiths removed their assignment Aug 24, 2022

k8s-ci-robot assigned camartinez04 Aug 29, 2022

k8s-ci-robot unassigned camartinez04 Oct 19, 2022

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 8, 2023

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 30, 2023

k8s-ci-robot assigned shubham-pampattiwar May 30, 2023

shubham-pampattiwar mentioned this issue Jul 12, 2023

Update VolumeSnapshot and VolumeSnapshotContent using JSON patch #876

Merged

k8s-ci-robot unassigned shubham-pampattiwar Oct 31, 2023

hairyhum mentioned this issue Dec 1, 2023

Retry on transient error when waiting for snapshot to be ready kanisterio/kanister#2508

Merged

10 tasks

phoenix-bjoern mentioned this issue Dec 7, 2023

Update VolumeSnapshot and VolumeSnapshotContent using JSON patch #974

Merged

kaovilai mentioned this issue Dec 11, 2023

Failed to check and update snapshot content: failed to remove VolumeSnapshotBeingCreated annotation on the content vmware-tanzu/velero#7197

Closed

This was referenced Feb 22, 2024

Use patch instead of update for GroupSnapshots, VolumeSnapshots, PVCs #1019

Closed

Use Patch to add finalizers, Get updated objects when removing finalizer with update fails with conflict. #1023

Closed

Test framework improvements #1024

Closed

kaovilai mentioned this issue Mar 8, 2024

VolumeSnapshotContents are retained after VolumeSnapshots have been deleted vmware-tanzu/velero#7511

Closed

hoyho linked a pull request Jun 18, 2024 that will close this issue

refactor controllers to use patch to update snapshot status #1110

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

snapshot-controller logs report failure frequently #748

snapshot-controller logs report failure frequently #748

ambiknai commented Aug 12, 2022

xing-yang commented Aug 12, 2022

ggriffiths commented Aug 24, 2022 •

edited

Loading

ggriffiths commented Aug 24, 2022

k8s-ci-robot commented Aug 24, 2022

camartinez04 commented Aug 29, 2022

ggriffiths commented Aug 29, 2022

blackpiglet commented Oct 11, 2022

ggriffiths commented Oct 19, 2022

k8s-triage-robot commented Feb 8, 2023

shubham-pampattiwar commented May 30, 2023

xing-yang commented May 30, 2023

xing-yang commented May 30, 2023

shubham-pampattiwar commented May 30, 2023

ggriffiths commented Oct 31, 2023

phoenix-bjoern commented Nov 6, 2023

phoenix-bjoern commented Nov 6, 2023

phoenix-bjoern commented Nov 7, 2023

hairyhum commented Nov 29, 2023

julienvincent commented Dec 27, 2023

phoenix-bjoern commented Dec 28, 2023

julienvincent commented Dec 28, 2023

phoenix-bjoern commented Dec 28, 2023

julienvincent commented Dec 30, 2023

phoenix-bjoern commented Jan 3, 2024

julienvincent commented Jan 4, 2024 •

edited

Loading

phoenix-bjoern commented Jan 4, 2024

kain88-de commented May 15, 2024

hoyho commented Jun 18, 2024

SCLogo commented Aug 2, 2024

hoyho commented Aug 2, 2024

snapshot-controller logs report failure frequently #748

snapshot-controller logs report failure frequently #748

Comments

ambiknai commented Aug 12, 2022

xing-yang commented Aug 12, 2022

ggriffiths commented Aug 24, 2022 • edited Loading

ggriffiths commented Aug 24, 2022

k8s-ci-robot commented Aug 24, 2022

Guidelines

camartinez04 commented Aug 29, 2022

ggriffiths commented Aug 29, 2022

blackpiglet commented Oct 11, 2022

ggriffiths commented Oct 19, 2022

k8s-triage-robot commented Feb 8, 2023

shubham-pampattiwar commented May 30, 2023

xing-yang commented May 30, 2023

xing-yang commented May 30, 2023

shubham-pampattiwar commented May 30, 2023

ggriffiths commented Oct 31, 2023

phoenix-bjoern commented Nov 6, 2023

phoenix-bjoern commented Nov 6, 2023

phoenix-bjoern commented Nov 7, 2023

hairyhum commented Nov 29, 2023

julienvincent commented Dec 27, 2023

phoenix-bjoern commented Dec 28, 2023

julienvincent commented Dec 28, 2023

phoenix-bjoern commented Dec 28, 2023

julienvincent commented Dec 30, 2023

phoenix-bjoern commented Jan 3, 2024

julienvincent commented Jan 4, 2024 • edited Loading

phoenix-bjoern commented Jan 4, 2024

kain88-de commented May 15, 2024

hoyho commented Jun 18, 2024

SCLogo commented Aug 2, 2024

hoyho commented Aug 2, 2024

ggriffiths commented Aug 24, 2022 •

edited

Loading

julienvincent commented Jan 4, 2024 •

edited

Loading