diff --git a/design/retry-patching-configuration_design.md b/design/retry-patching-configuration_design.md new file mode 100644 index 0000000000..4643c620f0 --- /dev/null +++ b/design/retry-patching-configuration_design.md @@ -0,0 +1,103 @@ +# Backup Restore Status Patch Retrying Configuration + +## Abstract +When a restore completes, we want to ensure that the restore custom resource progress to the correct status. +If a patch call fails to update status to completion, it should be retried up to a certain time limit. + +This design proposes a way to configure timeout for this retry time limit. + +## Background +Original Issue: https://github.com/vmware-tanzu/velero/issues/7207 + +Velero was performing a restore when the API server was rolling out to a new version. +It had trouble connecting to the API server, but eventually, the restore was successful. +However, since the API server was still in the middle of rolling out, Velero failed to update the restore CR status and gave up. + +After the connection was restored, it didn't attempt to update, causing the restore CR to be stuck at "In progress" indefinitely. +This can lead to incorrect decisions for other components that rely on the backup/restore CR status to determine completion. + +## Goals +- Make timeout configurable for retry patching by reusing existing [`--resource-timeout` server flag](https://github.com/vmware-tanzu/velero/blob/d9ca14747925630664c9e4f85a682b5fc356806d/pkg/cmd/server/server.go#L245) + +## Non Goals +- Create a new timeout flag +- Refactor backup/restore workflow + + +## High-Level Design +We will add retries with timeout to existing patch calls that moves a backup/restore from InProgress to a final status such as +- Completed +- PartiallyFailed +- Failed +- FailedValidation + + +## Detailed Design +Relevant reconcilers will have `resourceTimeout time.Duration` added to its struct and to parameters of New[Backup|Restore]XReconciler functions. + +pkg/cmd/server/server.go in `func (s *server) runControllers(..) error` also update the New[Backup|Restore]XCReconciler with added duration parameters. + +Current calls to kube.PatchResource involving status patch will be replaced with kube.PatchResourceWithRetriesOnErrors added to package `kube` below. + +Calls where there is a ...client.Patch() will be replaced with client.RetriesPhasePatchFuncOnErrors() added to package `client` below. + +```go +// PatchResourceWithRetries patches the original resource with the updated resource, retrying when the provided retriable function returns true. +// We want retry to last up to 2 minutes: https://github.com/vmware-tanzu/velero/issues/7207#:~:text=A%20two%2Dminute%20retry%20is%20reasonable%20when%20there%20is%20API%20outage%20due%20to%20cert%20rotation. +func PatchResourceWithRetries(maxDuration time.Duration, original, updated client.Object, kbClient client.Client, retriable func(error) bool) error { + return veleroPkgClient.RetriesPhasePatchFunc(maxDuration, func() error { return PatchResource(original, updated, kbClient) }, retriable) +} + +// PatchResourceWithRetriesOnErrors patches the original resource with the updated resource, retrying when the operation returns an error. +func PatchResourceWithRetriesOnErrors(maxDuration time.Duration, original, updated client.Object, kbClient client.Client) error { + return PatchResourceWithRetries(maxDuration, original, updated, kbClient, func(err error) bool { + // retry using DefaultBackoff to resolve connection refused error that may occur when the server is under heavy load + // TODO: consider using a more specific error type to retry, for now, we retry on all errors + // specific errors: + // - connection refused: https://pkg.go.dev/syscall#:~:text=Errno(0x67)-,ECONNREFUSED,-%3D%20Errno(0x6f + return err != nil + }) +} +``` + +pkg/client/retry.go +```go +// CapBackoff provides a backoff with a set duration +func CapBackoff(duration time.Duration) wait.Backoff{ + if duration.Nanoseconds() < 0 { + duration = 0 + } + return wait.Backoff{ + Steps: math.MaxInt, + Duration: duration, + Factor: 1.0, + Jitter: 0.0, + } +} + +// RetriesPhasePatchFunc accepts a patch function param, retrying when the provided retriable function returns true. +func RetriesPhasePatchFunc(maxDuration time.Duration, fn func() error, retriable func(error) bool) error { + return retry.OnError(CapBackoff(maxDuration), func(err error) bool { return retriable(err) }, fn) +} + +// RetriesPhasePatchFuncOnErrors accepts a patch function param, retrying when the error is not nil. +func RetriesPhasePatchFuncOnErrors(maxDuration time.Duration, fn func() error) error { + return RetriesPhasePatchFunc(maxDuration, fn, func(err error) bool { return err != nil }) +} +``` + +## Alternatives Considered + - Requeuing InProgress backups that is not known by current velero instance to still be in progress as failed (attempted in [#7863](https://github.com/vmware-tanzu/velero/pull/7863)) + - It was deemed as making backup restore flow hard to enhance for future reconciler updates such as adding cancel or adding parallel backups. + +## Security Considerations +None + +## Compatibility +Retry should only trigger a restore or backup that is already in progress and not patching successfully by current instance. Prior InProgress backups/restores will not be re-processed and will remain stuck InProgress until there is another velero server (re)start. + +## Implementation +There is a past implementation in [#7845](https://github.com/vmware-tanzu/velero/pull/7845/) where implementation for this design will be based upon. + +## Open Issues +A discussion of issues relating to this proposal for which the author does not know the solution. This section may be omitted if there are none.