-
Notifications
You must be signed in to change notification settings - Fork 339
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Delete volume requests are sent twice with new provisioner version v5.0.1 #1235
Delete volume requests are sent twice with new provisioner version v5.0.1 #1235
Comments
We also see similar behavior with pre-provisioned disks, the csi driver after the upgrade started to delete them after with the PVC and PV were deleted which is NOT expected. The pre-provisioned disks should remain untouched |
Another instance where pre-provisioned PVC was deleted as per the logs. But, from the cluster it is still showing Terminating.
|
This is as per the KEP design https://github.com/kubernetes/enhancements/tree/master/keps/sig-storage/2644-honor-pv-reclaim-policy#csi-driver-volumes, essentially, the new finalizer will be added to even statically provisioned volumes if they are in a Bound state, and deleting the PVC, will delete the underlying volume, honoring the reclaim policy specified. |
@shrutinipane This is pretty odd, based on https://github.com/kubernetes-sigs/sig-storage-lib-external-provisioner/blob/master/controller/controller.go#L1545 Once the deleteVolumeOperation is triggered, it removes all the finalizers if necessary. Only after that would it log the message "PersistentVolume deleted succeeded" |
We have following setting in provisioner clusterrole:
Still while deleting statically provisioned PV, it remains in Terminating state. Following are the logs in provisioner sidecar:
|
@shrutinipane could you post the whole csi-provisioner logs? it would be helpful to get kcm logs as well. Please post the PV and PVC yaml as well. Also the steps, did you simply delete the PVC? Do the logs indicate something as a fallout after the PVC delete? |
These are the CSI pods:
The yaml representation of kubernetes objects created:
PV:
StorageClass:
Steps to reproduce defect:
Please find attached CSI provisioner logs, Kube Controller Manager logs and CSI Snap. CSI Provisioner logs:
Kube Controller Manager logs:
kube-controller-manager_logs.txt |
@deepakkinni do you need any more information on the above issue? |
@deepakkinni @Jainbrt @shrutinipane A similar issue was reported to kubernetes/kuberentes, and fixed by this PR in 1.30. Because this feature is alpha, it is not backported to v1.29. Only v1.30 and later versions are fixed. |
@carlory thanks for your response, but as this feature is not enabled, we are hitting this in common path. |
The root cause is that the pv controller marks the pv as failed, so the external provsioner cannot handle the pv. The previous solution is gated with if utilfeature.DefaultFeatureGate.Enabled(features.HonorPVReclaimPolicy) {
if metav1.HasAnnotation(volume.ObjectMeta, storagehelpers.AnnMigratedTo) {
// CSI migration scenario - do not depend on in-tree plugin
return nil, nil
}
if volume.Spec.CSI != nil {
// CSI volume source scenario - external provisioner is requested
return nil, nil
}
} Should we remove the feature-gate check in k/k and then backport the fix to previous releases? |
@carlory I think we'd need to do this. In theory, the fix should have been applicable regardless of the feature
But how did this work in the past? With older kcm and older external-provisioner, would deleting a statically provisioned volume(bound pv-pvc, delete PVC) leave the PV in a Failed state? @shrutinipane would it be possible to repeat this using <5.0.1 provisioner? What was the state of PV when PVC was deleted? can you post kcm logs as well?post pv,PVC yamls |
With older provisioner version, PV is in Released state and then gets deleted. CSI Provisioner Logs:
Kube Controller Manager logs:
Attaching log files: |
Thanks @shrutinipane From external-provisioner perspective the PV moves from
But in kcm, there's no indication that it moved into Released state, on the other hand the indicated error should have moved the PV to failed state
pluginName, deleted, err := ctrl.doDeleteVolume(ctx, volume)
if err != nil {
// Delete failed, update the volume and emit an event.
logger.V(3).Info("Deletion of volume failed", "volumeName", volume.Name, "err", err)
if volerr.IsDeletedVolumeInUse(err) {
// The plugin needs more time, don't mark the volume as Failed
// and send Normal event only
ctrl.eventRecorder.Event(volume, v1.EventTypeNormal, events.VolumeDelete, err.Error())
} else {
// The plugin failed, mark the volume as Failed and send Warning
// event
if _, err := ctrl.updateVolumePhaseWithEvent(ctx, volume, v1.VolumeFailed, v1.EventTypeWarning, events.VolumeFailedDelete, err.Error()); err != nil {
logger.V(4).Info("DeleteVolumeOperation: failed to mark volume as failed", "volumeName", volume.Name, "err", err)
// Save failed, retry on the next deletion attempt
return pluginName, err
}
} This was supposed to move the PV into https://github.com/kubernetes/kubernetes/blob/master/pkg/volume/plugins.go#L770-L779 // FindDeletablePluginBySpec fetches a persistent volume plugin by spec. If
// no plugin is found, returns error.
func (pm *VolumePluginMgr) FindDeletablePluginBySpec(spec *Spec) (DeletableVolumePlugin, error) {
volumePlugin, err := pm.FindPluginBySpec(spec)
if err != nil {
return nil, err
}
if deletableVolumePlugin, ok := volumePlugin.(DeletableVolumePlugin); ok {
return deletableVolumePlugin, nil
}
return nil, fmt.Errorf("no deletable volume plugin matched")
} Will follow up this soon. |
The KCM updates pv's phase with the Released state and then may move this field into the Failed state, please refer to kubernetes/kubernetes#122030 (comment) for more details. We can observe the behavior from the above log snippet of the KCM and the external provisioner. |
Thanks @carlory for the help here and raising the fix. I was wondering if we would get new provisioner sidecar image or this be a dependency on newer k8s versions? |
@Jainbrt After the pr is merged, it will be backported to 1.30, 1.29, 1.28, and 1.27. The latest provisioner image is enough, so we don't need to release a newer one. |
Thanks @carlory for the help in understanding, while are k8s < 1.27 versions also impacted ? |
Yes, please refer to End of Life |
Hi @carlory, i don't think that is the root cause, it'll definitely resolve the issue(since we avoid moving into Failed state entirely), but there seems to be other issue. To summarize the observation based on logs in in #1235 (comment), the volume is moved into a Released state, followed by a Failed state, this infact was observed in the csi-provisioner logs:
As seen in the above logs, the csi driver threw the error with message Now, even if kcm doesn't move into Failed state, the driver may not necessarily successfully delete the volume because of the above error. In summary, the change kubernetes/kubernetes#125767 should go in, but doesn't mean it'll resolve @shrutinipane reported issue. Perhaps, the driver eventually detaches the volume from node, then maybe they'll not see the issue. |
@deepakkinni I have a different opinion on the error log. Let's see the function signature: func (p *csiProvisioner) canDeleteVolume(volume *v1.PersistentVolume) error It uses the return value to indicate whether the volume can be deleted or not. for _, va := range vaList {
if va.Spec.Source.PersistentVolumeName != nil && *va.Spec.Source.PersistentVolumeName == volume.Name {
return fmt.Errorf("persistentvolume %s is still attached to node %s", volume.Name, va.Spec.NodeName)
}
} the above code snippet is checking if the volume is still attached to a node. If it is, it has to return an error (not a boolean value) to indicate that the volume cannot be deleted. The detach operation is asynchronous, the volume may still be attached to the node when the |
|
Thanks for the clarification, @carlory. I was assuming the error was coming from the driver, but it seems like it's from the provisioner itself. |
Thanks @carlory and @deepakkinni for helping on this issue, do we have any outlook when will we get the fix kubernetes/kubernetes#125767 in next k8s PTF builds ? |
@carlory Regarding comment #1235 (comment), with the current implementation of provisioner v5.0.1, PV stays in Terminating stays and doesn't get cleaned up after 15 minutes. |
@shrutinipane once the pv is updated with Failed state by kcm, the csi-provisioner won't be able to handle the pv. so kubernetes/kubernetes#125767 will fix this. |
@deepakkinni @Jainbrt Kubernetes release 1.27 reaches the End of Life, please see https://kubernetes.io/releases/patch-releases/\#1-27. so 1.27 won't include this fix |
Sure @carlory . I hope this would be picked for 1.28,1.29 and 1.30 k8s further ptf releases. |
Found it, the provisioner computes a wrong JSON patch to remove the finalizer.
Line 1591 in 1194963
But Line 1595 in 1194963
|
/reopen |
@carlory: Reopened this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Needs to bump dependencies @jsafrane |
@shrutinipane @andyzhangx @Jainbrt Before I tag a new sig-storage-lib-external-provisioner + external-provisioner versions, I'd like to make sure the fix is complete and there are no lose ends. Can you please test #1245? If you trust me, I prepared an image from the PR at quay.io/jsafrane/external-provisioner:test1245 Thanks in advance for any feedback. |
Sure @jsafrane we can give this a try. thanks for the help. |
@jsafrane we do not have access to pull the above images, can you please help with the same ? |
@Jainbrt sorry, I forgot to make the repo public. Please try again. |
Thanks @jsafrane yes, we are able to pull now. |
@Jainbrt is my image working? How much did you test it? |
@Jainbrt friendly reminder |
Hey @jsafrane, we are still testing this while we do see issue in static provisioning. |
I already released a fixed version of lib-external-provisioner and updated it here, I'm cherry-picking the fix and I plan to release v5.0.2 soon-ish. |
v5.0.2 tagged, waiting for image builds |
@jsafrane I noticed that this issue persists with the above image quay.io/jsafrane/external-provisioner:test1245 Is there any specific k8s version this image would work on? |
I just tested with 1.29.3 and csi-driver-hostpath with my provisioner image, the driver got just 1 DeleteVolume call. |
prov_lopgs.txt shows 1 delete attempt:
But then I can see only:
(which is odd... how was it Released before and now it's not?) |
@jsafrane we have observed this issue when we run the quick test using automation ( where deletion of pvc and pv is quick). Not sure if this is related. |
@jsafrane we tried with k8s 1.30.3 which does have changes from k8s too, and that works fine. Can you please help release 5.0.2 images ? |
5.0.2 images are out, |
Still having this problem. |
What happened:
When we try to delete volume using new provisioner version v5.0.1, we notice 2 volume deletion requests in the logs. As a result of which PV remains in Terminating state.
Following are the driver logs:
What you expected to happen:
PV should be deleted successfully.
How to reproduce it:
Environment:
kubectl version
): v1.29.3uname -a
):The text was updated successfully, but these errors were encountered: