Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Silence "deletion in progress" errors when deleting VRGs #1689

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 0 additions & 3 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -190,9 +190,6 @@ test-util: generate manifests envtest ## Run util tests.
test-util-pvc: generate manifests envtest ## Run util-pvc tests.
go test ./internal/controller/util -coverprofile cover.out -ginkgo.focus PVCS_Util

test-kubeobjects: ## Run kubeobjects tests.
go test ./internal/controller/kubeobjects -coverprofile cover.out -ginkgo.focus Kubeobjects

test-drenv: ## Run drenv tests.
$(MAKE) -C test

Expand Down
20 changes: 14 additions & 6 deletions internal/controller/drplacementcontrol_controller.go
Original file line number Diff line number Diff line change
Expand Up @@ -119,7 +119,7 @@ func (r *DRPlacementControlReconciler) SetupWithManager(mgr ctrl.Manager) error
// For more details, check Reconcile and its Result here:
// - https://pkg.go.dev/sigs.k8s.io/controller-runtime@v0.7.0/pkg/reconcile
//
//nolint:funlen,gocognit,gocyclo,cyclop
//nolint:funlen,gocognit,gocyclo,cyclop,nestif
func (r *DRPlacementControlReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
logger := r.Log.WithValues("DRPC", req.NamespacedName, "rid", uuid.New())

Expand Down Expand Up @@ -166,13 +166,21 @@ func (r *DRPlacementControlReconciler) Reconcile(ctx context.Context, req ctrl.R
// then the DRPC should be deleted as well. The least we should do here is to clean up DPRC.
err := r.processDeletion(ctx, drpc, placementObj, logger)
if err != nil {
logger.Info(fmt.Sprintf("Error in deleting DRPC: (%v)", err))

statusErr := r.setDeletionStatusAndUpdate(ctx, drpc)
if statusErr != nil {
err = fmt.Errorf("drpc deletion failed: %w and status update failed: %w", err, statusErr)

return ctrl.Result{}, err
}

// Is this an expected condition?
if rmnutil.IsOperationInProgress(err) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of passing back errors, which seems to be of limited use, and defining an error internally that needs to be honored in other error cases and also existing cases (which adds to code awareness across the functions when returning errors), I would instead just not return error and log it for troubleshooting use.

That way we are not defining an error that other parts of the code needs to return. Current code requests a reconcile using a requeue or an error, which for us is the same. So this change would impact future work.

When starting to write the code initially there was some confusion in this regard, and errors were returned. I suggest we do not return errors, but request a requeue via the result.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current solution is a local change in the in some functions inside the flow, and a local change in Reconcile. This should be easy to backport (and we must backport it to 4.18 since this is a regression added recently).

If we don't use the error, we need to change the all the function signatures in delete flow. We can return requeue, error, but this does not allow code to specify a RequeAfter, and even if we don't have code that needs it right now in this flow, we may have code that needs it in other flows or in this flow in the future, so this signature is not future proof. So looks like the new signature should be cntl.Result, error.

But this does not match other controllers, and we probably want a consistent way to return results from controller functions. Do you suggest to refactor entire ramen to return cntl.Result, error?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For reference, searching for (ctrl.Result, error) show that only few functions use this return value.

Results (without Reconcile functions):

  • (r DRClusterReconciler) processCreateOrUpdate
  • (r DRClusterReconciler) processDeletion
  • (r *DRClusterConfigReconciler) processDeletion
  • (r *DRClusterConfigReconciler) processCreateOrUpdate
  • (r *DRPlacementControlReconciler) reconcileDRPCInstance
  • (r *DRPolicyReconciler) reconcile

Using this signature for deletion flow seems common.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a well-known and long-standing debate in software design. It's similar to classic discussions around handling conditions like a "file-not-found.", whether you throw an exception or not... Is it an error/exception or simply an expected state? Similarly, in the context of a reconciler, when attempting to fetch a resource, receiving a NotFound isn't necessarily a failure but rather an expected condition. But it is still treated as an error.

In our case, the processDeletion function returns an error. If it doesn't complete fully, there's clearly a reason for it. This reason holds valuable context that can help the caller logic make informed decisions. Simply logging it without passing it back discards this context.

By returning the error, we make the caller to decide whether to log, requeue, or take any other corrective action. I also believe that Result{} should not be propagated from function to function. Only the reconciler function should fill-in that structure with the information returned from the helper functions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am stating simply that we update result to requeue as true on errors and not propogate the Result to other functions. The reconciler is just a loop with an end to a call requesting a requeue or not, the error does not hold much value.

If we add reasons for errors and start making smart decisions we will need to use this pattern consistently, which is a burden, IMHO. If that will not be the case then in this narrow use-case we can do what is done here, but I worry that the pattern will be enforced in the future for no particular gain.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am stating simply that we update result to requeue as true on errors and not propogate the Result to other functions.

So you suggest to change functions in processDeletion flow to behave like this?

func deletteExample(...) bool {
    // Expected to succeed during deletion
    if err := trySomething(...); err != nil {
        log.Error(err, "failed to ...")
        return false
    }

    // Expected to return false for a while during deletion
    if !readyForNextStep(...) {
        return false
    }

    // Completed
    return true
}

So we log an INFO message in ensureDeletedVRGs if we the VRG was not deleted yet, and return false.

If we add reasons for errors and start making smart decisions we will need to use this pattern consistently.

This pattern of trying to do the next step in the way to the desired state is pretty common in ramen. It could be very useful if we use this pattern everywhere. It will make the code much more clear, and avoid all the bogus errors that we have today. This change make it easy to use this pattern everywhere gradually, without changing lot of code.

If that will not be the case then in this narrow use-case we can do what is done here, but I worry that the pattern will be enforced in the future for no particular gain.

How about using this pattern only in this controller for now? We need a way to fix the regression caused by adding the vrg on the secondary cluster. This should be backported to 4.18 soon before it breaks QE tests in the same way it breaks our tests.

I can also split the fix (the first commit) so we can merge it now. It will cause a lot of noise in the logs during deletion, but we can fix this in the next step when we converge on what is the right way to handle "not ready" state during reconcile.

logger.Info("Deleting DRPC in progress", "reason", err)

return ctrl.Result{Requeue: true}, nil
}

// Unexpected error.
return ctrl.Result{}, err
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean just let this be returned and remove the block starting from line 189 to 195 inclusive.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The result will be ERROR log with a stacktrace - we see 18 of these for every delete operation. Since this is an expected condition, it is wrong to treat this as an error.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is an expected condition, it is wrong to treat this as an error.

Same as my comment above

}

Expand Down Expand Up @@ -736,7 +744,7 @@ func (r *DRPlacementControlReconciler) cleanupVRGs(
}

if len(vrgs) != 0 {
return fmt.Errorf("waiting for VRGs count to go to zero")
return rmnutil.OperationInProgress("waiting for VRGs count to go to zero")
}

// delete MCVs
Expand All @@ -761,7 +769,7 @@ func (r *DRPlacementControlReconciler) ensureVRGsDeleted(
for cluster, vrg := range vrgs {
if vrg.Spec.ReplicationState == replicationState {
if !ensureVRGsManagedByDRPC(r.Log, mwu, vrgs, drpc, vrgNamespace) {
return fmt.Errorf("%s VRG adoption in progress", replicationState)
return rmnutil.OperationInProgress(fmt.Sprintf("%s VRG adoption in progress", replicationState))
}

if err := mwu.DeleteManifestWork(mwu.BuildManifestWorkName(rmnutil.MWTypeVRG), cluster); err != nil {
Expand All @@ -773,7 +781,7 @@ func (r *DRPlacementControlReconciler) ensureVRGsDeleted(
}

if inProgress {
return fmt.Errorf("%s VRG manifestwork deletion in progress", replicationState)
return rmnutil.OperationInProgress(fmt.Sprintf("%s VRG manifestwork deletion in progress", replicationState))
}

return nil
Expand Down
12 changes: 0 additions & 12 deletions internal/controller/kubeobjects/requests.go
Original file line number Diff line number Diff line change
Expand Up @@ -43,8 +43,6 @@ func RequestsMapKeyedByName(requestsStruct Requests) map[string]Request {
return requests
}

type RequestProcessingError struct{ string }

type CaptureSpec struct {
//+optional
Name string `json:"name,omitempty"`
Expand Down Expand Up @@ -138,16 +136,6 @@ type Operation struct {
InverseOp string `json:"inverseOp,omitempty"`
}

func RequestProcessingErrorCreate(s string) RequestProcessingError { return RequestProcessingError{s} }
func (e RequestProcessingError) Error() string { return e.string }

// Called by errors.Is() to match target.
func (RequestProcessingError) Is(target error) bool {
_, ok := target.(RequestProcessingError)

return ok
}

type RequestsManager interface {
ProtectsPath() string
RecoversPath() string
Expand Down
30 changes: 0 additions & 30 deletions internal/controller/kubeobjects/requests_test.go

This file was deleted.

13 changes: 7 additions & 6 deletions internal/controller/kubeobjects/velero/requests.go
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@ import (

"github.com/go-logr/logr"
"github.com/ramendr/ramen/internal/controller/kubeobjects"
"github.com/ramendr/ramen/internal/controller/util"
velero "github.com/vmware-tanzu/velero/pkg/apis/velero/v1"
corev1 "k8s.io/api/core/v1"
k8serrors "k8s.io/apimachinery/pkg/api/errors"
Expand Down Expand Up @@ -249,11 +250,11 @@ func backupDummyStatusProcessAndRestore(
velero.BackupPhaseUploading,
velero.BackupPhaseUploadingPartialFailure,
velero.BackupPhaseDeleting:
return nil, kubeobjects.RequestProcessingErrorCreate("backup" + string(backup.Status.Phase))
return nil, util.OperationInProgress("backup" + string(backup.Status.Phase))
case velero.BackupPhaseFailedValidation:
return nil, errors.New("backup" + string(backup.Status.Phase))
default:
return nil, kubeobjects.RequestProcessingErrorCreate("backup.status.phase absent")
return nil, util.OperationInProgress("backup.status.phase absent")
}
}

Expand Down Expand Up @@ -283,13 +284,13 @@ func restoreStatusProcess(
return nil
case velero.RestorePhaseNew,
velero.RestorePhaseInProgress:
return kubeobjects.RequestProcessingErrorCreate("restore" + string(restore.Status.Phase))
return util.OperationInProgress("restore" + string(restore.Status.Phase))
case velero.RestorePhaseFailed,
velero.RestorePhaseFailedValidation,
velero.RestorePhasePartiallyFailed:
return errors.New("restore" + string(restore.Status.Phase))
default:
return kubeobjects.RequestProcessingErrorCreate("restore.status.phase absent")
return util.OperationInProgress("restore.status.phase absent")
}
}

Expand Down Expand Up @@ -399,13 +400,13 @@ func backupRealStatusProcess(
velero.BackupPhaseUploading,
velero.BackupPhaseUploadingPartialFailure,
velero.BackupPhaseDeleting:
return kubeobjects.RequestProcessingErrorCreate("backup" + string(backup.Status.Phase))
return util.OperationInProgress("backup" + string(backup.Status.Phase))
case velero.BackupPhaseFailedValidation,
velero.BackupPhasePartiallyFailed,
velero.BackupPhaseFailed:
return errors.New("backup" + string(backup.Status.Phase))
default:
return kubeobjects.RequestProcessingErrorCreate("backup.status.phase absent")
return util.OperationInProgress("backup.status.phase absent")
}
}

Expand Down
24 changes: 24 additions & 0 deletions internal/controller/util/errors.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
// SPDX-FileCopyrightText: The RamenDR authors
// SPDX-License-Identifier: Apache-2.0package util

package util

import "errors"

// OperationInProgress error is return when an operation is in progress and we wait for the desired state. The error
// string should describe the operation for logging the error.
type OperationInProgress string

func (e OperationInProgress) Error() string { return string(e) }

// Called by errors.Is() to match target.
func (OperationInProgress) Is(target error) bool {
_, ok := target.(OperationInProgress)

return ok
}

// IsOperationInProgress returns true if err or error wrapped by it is an OperationInProgress error.
func IsOperationInProgress(err error) bool {
return errors.Is(err, OperationInProgress(""))
Copy link
Member

@BenamarMk BenamarMk Dec 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One question though, is this call going to propagate back to line 15? I doubt it. errors.Is(...) is a free function.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it does here:

if x, ok := err.(interface{ Is(error) bool }); ok && x.Is(target) {
			return true
		}

All good

}
33 changes: 33 additions & 0 deletions internal/controller/util/errors_test.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
// SPDX-FileCopyrightText: The RamenDR authors
// SPDX-License-Identifier: Apache-2.0

package util_test

import (
"errors"
"fmt"

. "github.com/onsi/ginkgo/v2"
. "github.com/onsi/gomega"
"github.com/ramendr/ramen/internal/controller/util"
)

var _ = Describe("OperationInProgress", func() {
Context("comparing errors", func() {
err := util.OperationInProgress("this operation")

It("is not equal", func() {
Expect(err == util.OperationInProgress("other operation")).To(Equal(false))
})
It("match error", func() {
Expect(util.IsOperationInProgress(err)).To(Equal(true))
})
It("match wrapped error", func() {
wrapped := fmt.Errorf("wrapping operation in progress: %w", err)
Expect(util.IsOperationInProgress(wrapped)).To(Equal(true))
})
It("does not match other errors", func() {
Expect(util.IsOperationInProgress(errors.New("other error"))).To(Equal(false))
})
})
})
4 changes: 2 additions & 2 deletions internal/controller/vrg_kubeobjects.go
Original file line number Diff line number Diff line change
Expand Up @@ -361,7 +361,7 @@ func (v *VRGInstance) kubeObjectsGroupCapture(
continue
}

if errors.Is(err, kubeobjects.RequestProcessingError{}) {
if util.IsOperationInProgress(err) {
log1.Info("Kube objects group capturing", "state", err.Error())

continue
Expand Down Expand Up @@ -695,7 +695,7 @@ func (v *VRGInstance) executeRecoverGroup(result *ctrl.Result, s3StoreAccessor s
}
}

if errors.Is(err, kubeobjects.RequestProcessingError{}) {
if util.IsOperationInProgress(err) {
log1.Info("Kube objects group recovering", "state", err.Error())

return err
Expand Down
Loading