Retry failed runs with ObserveAndDelete policy #301

d-honeybadger · 2024-01-31T20:21:47Z

Description of your changes

Fixes #295
On backoff: Since Observe returns error whenever a retry run fails, crossplane-runtime will automatically apply backoff
We don't have control over the interval, max steps etc., but it's going to be the same default backoff that crossplane-runtime applies whenever Update fails, so we should have the bahavior that matches CheckWhenObserve policy's updates, and also matches other providers.

I have:

Read and followed Crossplane's [contribution process].
Run make reviewable to ensure this PR is ready for review.
Added backport release-x.y labels to auto-backport this PR if necessary.

How has this code been tested

Deploy modified version of provider-ansible in a k8s cluster (make local.xpkg.deploy.provider.provider-ansible)
Create an ansiblerun w/ the default run policy ObserveAndDelete and a playbook that will fail. Here's my example:

apiVersion: ansible.crossplane.io/v1alpha1
kind: AnsibleRun
metadata:
  annotations:
  name: test-run
spec:
  forProvider:
    executableInventory: false
    inventoryInline: |
      cluster:
        hosts:
          test:
            ansible_host: <VM IP>
    playbookInline: |
      - hosts: cluster
        tasks:
        - file:
            path: /test-file-missing
            state: present
    vars:
      ansible_ssh_private_key_file: ./ssh_id
  providerConfigRef:
    name: test-config
3. Watch the logs to see retries
4. Fix the playbook, e.g.:
 playbookInline: |
      - hosts: cluster
        tasks:
        - file:
            path: /test-file-missing
            state: touch
5. Check the logs to see that, after a successful run, no more runs happen during regular sync.

[contribution process]: https://git.io/fj2m9

d-honeybadger · 2024-01-31T20:26:45Z

cmd/provider/main.go

@@ -42,6 +43,7 @@ func main() {
 		timeout                = app.Flag("timeout", "Controls how long Ansible processes may run before they are killed.").Default("20m").Duration()
 		leaderElection         = app.Flag("leader-election", "Use leader election for the controller manager.").Short('l').Default("false").OverrideDefaultFromEnvar("LEADER_ELECTION").Bool()
 		maxReconcileRate       = app.Flag("max-reconcile-rate", "The maximum number of concurrent reconciliation operations.").Default("1").Int()
+		retryFailed            = app.Flag("retry-failed", "Whether to retry failed runs with ObserveAndDelete policy (with CheckWhenObserve, they are retried unconditionally).").Default("false").Bool()


Idk if this flag is a good idea: retrying failed syncs, to me, is something that a k8s controller should always do, it shouldn't even be configurable. And it's unconditionally retried with CheckWhenObserve policy.
But there might be provider-ansible users who already rely on the existing no-retries behavior, which is the only reason I added the flag. LMK what you'd prefer

I agree. retryFailed seems to be the usual case most of the time.

there might be provider-ansible users who already rely on the existing no-retries behavior

Do you have a concrete use case for this? If not, I'd prefer taking this as default behavior and open another issue to track the potential necessity of introducing this flag.

BTW: It looks the retryFailed flag is not really being used through out the code in this PR.

🤦 yup forgot to consume it in handleLastApplied
Anyway, removed the flag

d-honeybadger · 2024-01-31T20:28:06Z

cmd/provider/main.go

@@ -76,6 +78,13 @@ func main() {
 		Features:                &feature.Flags{},
 	}

-	kingpin.FatalIfError(ansible.Setup(mgr, o, *ansibleCollectionsPath, *ansibleRolesPath, *timeout), "Cannot setup Ansible controllers")
+	ao := ansiblerun.SetupOptions{


The list of parameters that apply only to the ansiblerun controller and not to the config controller was getting too long as I added another flag, so I decided to refactor a bit and out this into a struct

d-honeybadger · 2024-01-31T20:29:16Z

internal/controller/ansibleRun/ansibleRun.go

 	}

-	if !isUpToDate {


This whole block has just changed indentation cause it was moved out of the if. There aren't actually any logic changes

d-honeybadger · 2024-01-31T20:34:27Z

@morningspace ready for review 🙏

morningspace · 2024-02-01T12:38:45Z

@d-honeybadger There are some lint errors and conflicts need to be resolved.

Signed-off-by: Dasha Komsa <komsa.darya@gmail.com>

d-honeybadger · 2024-02-01T15:36:37Z

@morningspace sorry about the linting errors! Turns out I commented out the linting line in the makefile cause at some point it was broken for me, and of course forgot that I did that...
Got it working now and all linting passing

morningspace

It looks awesome.

d-honeybadger commented Jan 31, 2024

View reviewed changes

d-honeybadger marked this pull request as ready for review January 31, 2024 20:33

d-honeybadger force-pushed the retry-on-failure branch from 600c974 to 8fc237f Compare February 1, 2024 15:06

d-honeybadger added 3 commits February 1, 2024 15:27

retry failed ansibleruns with ObserverAndDelete policy

1a7b96a

Signed-off-by: Dasha Komsa <komsa.darya@gmail.com>

add unit tests for retrying failed ObserveAndDelete runs

9395d32

Signed-off-by: Dasha Komsa <komsa.darya@gmail.com>

linting fixes

2d0c627

Signed-off-by: Dasha Komsa <komsa.darya@gmail.com>

d-honeybadger force-pushed the retry-on-failure branch from 8fc237f to 2d0c627 Compare February 1, 2024 15:28

handleLastApplied cyclomatic complexity reduced, remove nolint comment

f54bc38

Signed-off-by: Dasha Komsa <komsa.darya@gmail.com>

d-honeybadger force-pushed the retry-on-failure branch from 7205cf6 to f54bc38 Compare February 1, 2024 15:32

morningspace approved these changes Feb 2, 2024

View reviewed changes

morningspace merged commit 4d74aa6 into crossplane-contrib:main Feb 2, 2024
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retry failed runs with ObserveAndDelete policy #301

Retry failed runs with ObserveAndDelete policy #301

d-honeybadger commented Jan 31, 2024 •

edited

Loading

d-honeybadger Jan 31, 2024

morningspace Feb 1, 2024

d-honeybadger Feb 1, 2024

d-honeybadger Jan 31, 2024

d-honeybadger Jan 31, 2024

d-honeybadger commented Jan 31, 2024

morningspace commented Feb 1, 2024

d-honeybadger commented Feb 1, 2024

morningspace left a comment

Retry failed runs with ObserveAndDelete policy #301

Retry failed runs with ObserveAndDelete policy #301

Conversation

d-honeybadger commented Jan 31, 2024 • edited Loading

Description of your changes

How has this code been tested

d-honeybadger Jan 31, 2024

Choose a reason for hiding this comment

morningspace Feb 1, 2024

Choose a reason for hiding this comment

d-honeybadger Feb 1, 2024

Choose a reason for hiding this comment

d-honeybadger Jan 31, 2024

Choose a reason for hiding this comment

d-honeybadger Jan 31, 2024

Choose a reason for hiding this comment

d-honeybadger commented Jan 31, 2024

morningspace commented Feb 1, 2024

d-honeybadger commented Feb 1, 2024

morningspace left a comment

Choose a reason for hiding this comment

d-honeybadger commented Jan 31, 2024 •

edited

Loading