On transient failure in velero csi plugin, the volumesnapshot is getting deleted without updating the object store #8116

soumyapattnaik · 2024-08-14T04:44:36Z

What steps did you take and what happened:
In the finalizing phase today, we do a get on volumesnapshot, if it fails due to some transient failures like TLS handshake timeout, velero csi plugin deletes the volumesnapshot and volumesnapshotcontent.

https://github.com/vmware-tanzu/velero-plugin-for-csi/blob/e8f7af4b65f0ed6c69d340aefe2257dc25cd013f/internal/backup/volumesnapshot_action.go#L104

Post delete the backup controller re uploads the backup TarBall.

velero/pkg/backup/backup.go

Line 756 in 1ec52be

    
           func buildFinalTarball(tr *tar.Reader, tw *tar.Writer, updateFiles map[string]FileForArchive) error {

But it does not update CSI related artifacts in the object store.

Because of which there is mismatch between what is there in object store and what is actually backed up.

This has led to other issue in velero- #7979

What did you expect to happen:

The expectation is if the snapshot is cleaned up then the corresponding entry should also be removed from object store. Also for transient errors we should have a retry mechanism in velero to retry the get operation atleast and not fail the operation upfront.

The following information will help us better understand what's going on:

If you are using velero v1.7.0+:
Please use velero debug --backup <backupname> --restore <restorename> to generate the support bundle, and attach to this issue, more options please refer to velero debug --help

If you are using earlier versions:
Please provide the output of the following commands (Pasting long output into a GitHub gist or other pastebin is fine.)

kubectl logs deployment/velero -n velero
velero backup describe <backupname> or kubectl get backup/<backupname> -n velero -o yaml
velero backup logs <backupname>
velero restore describe <restorename> or kubectl get restore/<restorename> -n velero -o yaml
velero restore logs <restorename>

Anything else you would like to add:

Environment:

Velero version (use velero version):
Velero features (use velero client config get features):
Kubernetes version (use kubectl version):
Kubernetes installer & version:
Cloud provider or hardware configuration:
OS (e.g. from /etc/os-release):

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

👍 for "I would like to see this bug fixed as soon as possible"
👎 for "There are more important bugs to focus on right now"

The text was updated successfully, but these errors were encountered:

Lyndon-Li · 2024-08-14T05:44:28Z

@kaovilai Please help to check if design #8063 could cover the operations at Finalizing phase.

soumyapattnaik · 2024-08-19T05:10:07Z

@Lyndon-Li i dont think the design will be able to the address this issue. Velero doesnt patch VSC during finalizing phase and that is still not covered as part of the design 8063

reasonerjt · 2024-08-19T06:39:00Z

@soumyapattnaik
After reading the descriptioin I'm confused:

In the finalizing phase today, we do a get on volumesnapshot, if it fails due to some transient failures like TLS handshake timeout, velero csi plugin deletes the volumesnapshot and volumesnapshotcontent.

I don't think this matches the line you pasted:
https://github.com/vmware-tanzu/velero-plugin-for-csi/blob/e8f7af4b65f0ed6c69d340aefe2257dc25cd013f/internal/backup/volumesnapshot_action.go#L104

Generally, velero removes the volumesnapshot to make sure it doesn't impact the actual snapshot in storage provider when the resource is removed. I don't think it deliberately removes it when a GET fails.

Could you double check?

blackpiglet · 2024-08-19T08:06:27Z

vmware-tanzu/velero-plugin-for-csi#177 (comment)
The backup phase checking code in VolumeSnapshot BIA was introduced by this comment.

The deletion VolumeSnapshot code is used to purge unneeded VolumeSnapshot after PVC data backup. It's not used for error handling.

	if backup.Status.Phase == velerov1api.BackupPhaseFinalizing || backup.Status.Phase == velerov1api.BackupPhaseFinalizingPartiallyFailed {
		p.Log.WithField("Backup", fmt.Sprintf("%s/%s", backup.Namespace, backup.Name)).
			WithField("BackupPhase", backup.Status.Phase).Debugf("Clean VolumeSnapshots.")
		util.DeleteVolumeSnapshot(vs, *vsc, backup, snapshotClient.SnapshotV1(), p.Log)
		return item, nil, "", nil, nil
	}

kaovilai · 2024-08-19T15:05:09Z

@kaovilai Please help to check if design #8063 could cover the operations at Finalizing phase.

Per #8063 (comment)

I think we have agreement that it can cover finalizing phase.

soumyapattnaik · 2024-08-20T10:51:32Z

@reasonerjt - on any transient failure at https://github.com/vmware-tanzu/velero-plugin-for-csi/blob/e8f7af4b65f0ed6c69d340aefe2257dc25cd013f/internal/backup/volumesnapshot_action.go#L98C2-L98C157 this goes into the code - https://github.com/vmware-tanzu/velero-plugin-for-csi/blob/e8f7af4b65f0ed6c69d340aefe2257dc25cd013f/internal/backup/volumesnapshot_action.go#L100

for my customer the get failed with TLS handshake error and then below logs got printed.

Deleting Volumesnapshot XX/XXXX :: {"cmd":"/plugins/velero-plugin-for-csi"}
Deleted volumesnapshot with volumesnapshotContent XX/XXXX :: {"cmd":"/plugins/velero-plugin-for-csi"}

Also from our arm traces i could see that our disk snapshot gets cleaned up for this VS. For other VS where the get calls succeed it went into line https://github.com/vmware-tanzu/velero-plugin-for-csi/blob/e8f7af4b65f0ed6c69d340aefe2257dc25cd013f/internal/backup/volumesnapshot_action.go#L104 as pointed by you above.

blackpiglet · 2024-08-20T15:06:07Z

I see. The error happened while waiting for VolumeSnapshot.Status.ReadyToUse turning into True.
What is the error reported?
Usually, this only happens when the client-go cannot find the VS or VSC.

soumyapattnaik · 2024-08-21T07:41:04Z

the error was a transient error , where the call was not reaching api server because of TLS handshake timeout. The VS and VSC was present during the duration of get call

blackpiglet · 2024-08-22T02:03:54Z

By saying a transient error of TLS handshake timeout, do you mean the Velero pod lost connection with kube-apiserver?

It could cause Velero's client to fail to read VS and VSC.

If so, this issue is also related to the request for a retry mechanism with kube-apiserver.

soumyapattnaik · 2024-08-22T08:03:14Z

yes. correct. building a retry logic here will help as in my customer case this failure was observed for only 1 VS and VSC and for other two VS and VSC there were no issues. Can you please share me the issue # where the retry mechanism with kube-apiserver is being discussed.

blackpiglet · 2024-08-22T08:25:17Z

#8063 (comment)

A retry mechanism was discussed there, although it may not cover your case.

Could you give more information about why the kube-apiserver didn't work temporarily?

soumyapattnaik · 2024-08-26T07:38:09Z

the setup belongs to one of the customer. I am not sure why kube-apiserver didn't work temporarily.

kaovilai · 2024-08-26T13:41:13Z

for one of our customers it's due to api server SSL certificate rotation which means tls wouldn't work temporarily.

reasonerjt · 2024-08-27T09:24:44Z

@soumyapattnaik
After checking the code, this is a valid issue, the root cause is that the Execute of CSI BIAv2 is executed again in FinalizeBackup and the code in the Execute does not differentiate Finalize from the first round of backup.

It's a valid issue.

soumyapattnaik changed the title ~~On transient failure in velero csi plugin the volumesnapshot is getting deleted without updating the object store~~ On transient failure in velero csi plugin, the volumesnapshot is getting deleted without updating the object store Aug 14, 2024

reasonerjt assigned reasonerjt and blackpiglet Aug 19, 2024

reasonerjt added the Needs info Waiting for information label Aug 19, 2024

kaovilai mentioned this issue Aug 29, 2024

Velero didn't retry on failed Restore CR status update, causing the CR to remain stuck in "InProgress" #7207

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

On transient failure in velero csi plugin, the volumesnapshot is getting deleted without updating the object store #8116

On transient failure in velero csi plugin, the volumesnapshot is getting deleted without updating the object store #8116

soumyapattnaik commented Aug 14, 2024

Lyndon-Li commented Aug 14, 2024

soumyapattnaik commented Aug 19, 2024

reasonerjt commented Aug 19, 2024

blackpiglet commented Aug 19, 2024

kaovilai commented Aug 19, 2024

soumyapattnaik commented Aug 20, 2024 •

edited

Loading

blackpiglet commented Aug 20, 2024

soumyapattnaik commented Aug 21, 2024

blackpiglet commented Aug 22, 2024 •

edited

Loading

soumyapattnaik commented Aug 22, 2024

blackpiglet commented Aug 22, 2024

soumyapattnaik commented Aug 26, 2024

kaovilai commented Aug 26, 2024

reasonerjt commented Aug 27, 2024

On transient failure in velero csi plugin, the volumesnapshot is getting deleted without updating the object store #8116

On transient failure in velero csi plugin, the volumesnapshot is getting deleted without updating the object store #8116

Comments

soumyapattnaik commented Aug 14, 2024

Lyndon-Li commented Aug 14, 2024

soumyapattnaik commented Aug 19, 2024

reasonerjt commented Aug 19, 2024

blackpiglet commented Aug 19, 2024

kaovilai commented Aug 19, 2024

soumyapattnaik commented Aug 20, 2024 • edited Loading

blackpiglet commented Aug 20, 2024

soumyapattnaik commented Aug 21, 2024

blackpiglet commented Aug 22, 2024 • edited Loading

soumyapattnaik commented Aug 22, 2024

blackpiglet commented Aug 22, 2024

soumyapattnaik commented Aug 26, 2024

kaovilai commented Aug 26, 2024

reasonerjt commented Aug 27, 2024

soumyapattnaik commented Aug 20, 2024 •

edited

Loading

blackpiglet commented Aug 22, 2024 •

edited

Loading