-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Silence "deletion in progress" errors when deleting VRGs #1689
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -119,7 +119,7 @@ func (r *DRPlacementControlReconciler) SetupWithManager(mgr ctrl.Manager) error | |
// For more details, check Reconcile and its Result here: | ||
// - https://pkg.go.dev/sigs.k8s.io/controller-runtime@v0.7.0/pkg/reconcile | ||
// | ||
//nolint:funlen,gocognit,gocyclo,cyclop | ||
//nolint:funlen,gocognit,gocyclo,cyclop,nestif | ||
func (r *DRPlacementControlReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) { | ||
logger := r.Log.WithValues("DRPC", req.NamespacedName, "rid", uuid.New()) | ||
|
||
|
@@ -166,13 +166,21 @@ func (r *DRPlacementControlReconciler) Reconcile(ctx context.Context, req ctrl.R | |
// then the DRPC should be deleted as well. The least we should do here is to clean up DPRC. | ||
err := r.processDeletion(ctx, drpc, placementObj, logger) | ||
if err != nil { | ||
logger.Info(fmt.Sprintf("Error in deleting DRPC: (%v)", err)) | ||
|
||
statusErr := r.setDeletionStatusAndUpdate(ctx, drpc) | ||
if statusErr != nil { | ||
err = fmt.Errorf("drpc deletion failed: %w and status update failed: %w", err, statusErr) | ||
|
||
return ctrl.Result{}, err | ||
} | ||
|
||
// Is this an expected condition? | ||
if rmnutil.IsOperationInProgress(err) { | ||
logger.Info("Deleting DRPC in progress", "reason", err) | ||
|
||
return ctrl.Result{Requeue: true}, nil | ||
} | ||
|
||
// Unexpected error. | ||
return ctrl.Result{}, err | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I mean just let this be returned and remove the block starting from line 189 to 195 inclusive. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The result will be ERROR log with a stacktrace - we see 18 of these for every delete operation. Since this is an expected condition, it is wrong to treat this as an error. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Same as my comment above |
||
} | ||
|
||
|
@@ -736,7 +744,7 @@ func (r *DRPlacementControlReconciler) cleanupVRGs( | |
} | ||
|
||
if len(vrgs) != 0 { | ||
return fmt.Errorf("waiting for VRGs count to go to zero") | ||
return rmnutil.OperationInProgress("waiting for VRGs count to go to zero") | ||
} | ||
|
||
// delete MCVs | ||
|
@@ -761,7 +769,7 @@ func (r *DRPlacementControlReconciler) ensureVRGsDeleted( | |
for cluster, vrg := range vrgs { | ||
if vrg.Spec.ReplicationState == replicationState { | ||
if !ensureVRGsManagedByDRPC(r.Log, mwu, vrgs, drpc, vrgNamespace) { | ||
return fmt.Errorf("%s VRG adoption in progress", replicationState) | ||
return rmnutil.OperationInProgress(fmt.Sprintf("%s VRG adoption in progress", replicationState)) | ||
} | ||
|
||
if err := mwu.DeleteManifestWork(mwu.BuildManifestWorkName(rmnutil.MWTypeVRG), cluster); err != nil { | ||
|
@@ -773,7 +781,7 @@ func (r *DRPlacementControlReconciler) ensureVRGsDeleted( | |
} | ||
|
||
if inProgress { | ||
return fmt.Errorf("%s VRG manifestwork deletion in progress", replicationState) | ||
return rmnutil.OperationInProgress(fmt.Sprintf("%s VRG manifestwork deletion in progress", replicationState)) | ||
} | ||
|
||
return nil | ||
|
This file was deleted.
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,24 @@ | ||
// SPDX-FileCopyrightText: The RamenDR authors | ||
// SPDX-License-Identifier: Apache-2.0package util | ||
|
||
package util | ||
|
||
import "errors" | ||
|
||
// OperationInProgress error is return when an operation is in progress and we wait for the desired state. The error | ||
// string should describe the operation for logging the error. | ||
type OperationInProgress string | ||
|
||
func (e OperationInProgress) Error() string { return string(e) } | ||
|
||
// Called by errors.Is() to match target. | ||
func (OperationInProgress) Is(target error) bool { | ||
_, ok := target.(OperationInProgress) | ||
|
||
return ok | ||
} | ||
|
||
// IsOperationInProgress returns true if err or error wrapped by it is an OperationInProgress error. | ||
func IsOperationInProgress(err error) bool { | ||
return errors.Is(err, OperationInProgress("")) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. One question though, is this call going to propagate back to line 15? I doubt it. errors.Is(...) is a free function. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yes, see errors.Is() implementation: There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think it does here:
All good |
||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,33 @@ | ||
// SPDX-FileCopyrightText: The RamenDR authors | ||
// SPDX-License-Identifier: Apache-2.0 | ||
|
||
package util_test | ||
|
||
import ( | ||
"errors" | ||
"fmt" | ||
|
||
. "github.com/onsi/ginkgo/v2" | ||
. "github.com/onsi/gomega" | ||
"github.com/ramendr/ramen/internal/controller/util" | ||
) | ||
|
||
var _ = Describe("OperationInProgress", func() { | ||
Context("comparing errors", func() { | ||
err := util.OperationInProgress("this operation") | ||
|
||
It("is not equal", func() { | ||
Expect(err == util.OperationInProgress("other operation")).To(Equal(false)) | ||
}) | ||
It("match error", func() { | ||
Expect(util.IsOperationInProgress(err)).To(Equal(true)) | ||
}) | ||
It("match wrapped error", func() { | ||
wrapped := fmt.Errorf("wrapping operation in progress: %w", err) | ||
Expect(util.IsOperationInProgress(wrapped)).To(Equal(true)) | ||
}) | ||
It("does not match other errors", func() { | ||
Expect(util.IsOperationInProgress(errors.New("other error"))).To(Equal(false)) | ||
}) | ||
}) | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of passing back errors, which seems to be of limited use, and defining an error internally that needs to be honored in other error cases and also existing cases (which adds to code awareness across the functions when returning errors), I would instead just not return error and log it for troubleshooting use.
That way we are not defining an error that other parts of the code needs to return. Current code requests a reconcile using a requeue or an error, which for us is the same. So this change would impact future work.
When starting to write the code initially there was some confusion in this regard, and errors were returned. I suggest we do not return errors, but request a requeue via the result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current solution is a local change in the in some functions inside the flow, and a local change in Reconcile. This should be easy to backport (and we must backport it to 4.18 since this is a regression added recently).
If we don't use the error, we need to change the all the function signatures in delete flow. We can return
requeue, error
, but this does not allow code to specify a RequeAfter, and even if we don't have code that needs it right now in this flow, we may have code that needs it in other flows or in this flow in the future, so this signature is not future proof. So looks like the new signature should becntl.Result, error
.But this does not match other controllers, and we probably want a consistent way to return results from controller functions. Do you suggest to refactor entire ramen to return
cntl.Result, error
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For reference, searching for (ctrl.Result, error) show that only few functions use this return value.
Results (without Reconcile functions):
Using this signature for deletion flow seems common.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a well-known and long-standing debate in software design. It's similar to classic discussions around handling conditions like a "file-not-found.", whether you throw an exception or not... Is it an error/exception or simply an expected state? Similarly, in the context of a reconciler, when attempting to fetch a resource, receiving a
NotFound
isn't necessarily a failure but rather an expected condition. But it is still treated as an error.In our case, the
processDeletion
function returns an error. If it doesn't complete fully, there's clearly a reason for it. This reason holds valuable context that can help the caller logic make informed decisions. Simply logging it without passing it back discards this context.By returning the error, we make the caller to decide whether to log, requeue, or take any other corrective action. I also believe that
Result{}
should not be propagated from function to function. Only the reconciler function should fill-in that structure with the information returned from the helper functions.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am stating simply that we update result to requeue as
true
on errors and not propogate the Result to other functions. The reconciler is just a loop with an end to a call requesting a requeue or not, the error does not hold much value.If we add reasons for errors and start making smart decisions we will need to use this pattern consistently, which is a burden, IMHO. If that will not be the case then in this narrow use-case we can do what is done here, but I worry that the pattern will be enforced in the future for no particular gain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So you suggest to change functions in processDeletion flow to behave like this?
So we log an INFO message in ensureDeletedVRGs if we the VRG was not deleted yet, and return false.
This pattern of trying to do the next step in the way to the desired state is pretty common in ramen. It could be very useful if we use this pattern everywhere. It will make the code much more clear, and avoid all the bogus errors that we have today. This change make it easy to use this pattern everywhere gradually, without changing lot of code.
How about using this pattern only in this controller for now? We need a way to fix the regression caused by adding the vrg on the secondary cluster. This should be backported to 4.18 soon before it breaks QE tests in the same way it breaks our tests.
I can also split the fix (the first commit) so we can merge it now. It will cause a lot of noise in the logs during deletion, but we can fix this in the next step when we converge on what is the right way to handle "not ready" state during reconcile.