Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry failed runs with ObserveAndDelete policy #301

Merged

Conversation

d-honeybadger
Copy link
Collaborator

@d-honeybadger d-honeybadger commented Jan 31, 2024

Description of your changes

Fixes #295
On backoff: Since Observe returns error whenever a retry run fails, crossplane-runtime will automatically apply backoff
We don't have control over the interval, max steps etc., but it's going to be the same default backoff that crossplane-runtime applies whenever Update fails, so we should have the bahavior that matches CheckWhenObserve policy's updates, and also matches other providers.

I have:

  • Read and followed Crossplane's [contribution process].
  • Run make reviewable to ensure this PR is ready for review.
  • Added backport release-x.y labels to auto-backport this PR if necessary.

How has this code been tested

  1. Deploy modified version of provider-ansible in a k8s cluster (make local.xpkg.deploy.provider.provider-ansible)
  2. Create an ansiblerun w/ the default run policy ObserveAndDelete and a playbook that will fail. Here's my example:
apiVersion: ansible.crossplane.io/v1alpha1
kind: AnsibleRun
metadata:
  annotations:
  name: test-run
spec:
  forProvider:
    executableInventory: false
    inventoryInline: |
      cluster:
        hosts:
          test:
            ansible_host: <VM IP>
    playbookInline: |
      - hosts: cluster
        tasks:
        - file:
            path: /test-file-missing
            state: present
    vars:
      ansible_ssh_private_key_file: ./ssh_id
  providerConfigRef:
    name: test-config
3. Watch the logs to see retries
4. Fix the playbook, e.g.:
 playbookInline: |
      - hosts: cluster
        tasks:
        - file:
            path: /test-file-missing
            state: touch
5. Check the logs to see that, after a successful run, no more runs happen during regular sync.

[contribution process]: https://git.io/fj2m9

@@ -42,6 +43,7 @@ func main() {
timeout = app.Flag("timeout", "Controls how long Ansible processes may run before they are killed.").Default("20m").Duration()
leaderElection = app.Flag("leader-election", "Use leader election for the controller manager.").Short('l').Default("false").OverrideDefaultFromEnvar("LEADER_ELECTION").Bool()
maxReconcileRate = app.Flag("max-reconcile-rate", "The maximum number of concurrent reconciliation operations.").Default("1").Int()
retryFailed = app.Flag("retry-failed", "Whether to retry failed runs with ObserveAndDelete policy (with CheckWhenObserve, they are retried unconditionally).").Default("false").Bool()
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Idk if this flag is a good idea: retrying failed syncs, to me, is something that a k8s controller should always do, it shouldn't even be configurable. And it's unconditionally retried with CheckWhenObserve policy.
But there might be provider-ansible users who already rely on the existing no-retries behavior, which is the only reason I added the flag. LMK what you'd prefer

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. retryFailed seems to be the usual case most of the time.

there might be provider-ansible users who already rely on the existing no-retries behavior

Do you have a concrete use case for this? If not, I'd prefer taking this as default behavior and open another issue to track the potential necessity of introducing this flag.

BTW: It looks the retryFailed flag is not really being used through out the code in this PR.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤦 yup forgot to consume it in handleLastApplied
Anyway, removed the flag

@@ -76,6 +78,13 @@ func main() {
Features: &feature.Flags{},
}

kingpin.FatalIfError(ansible.Setup(mgr, o, *ansibleCollectionsPath, *ansibleRolesPath, *timeout), "Cannot setup Ansible controllers")
ao := ansiblerun.SetupOptions{
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The list of parameters that apply only to the ansiblerun controller and not to the config controller was getting too long as I added another flag, so I decided to refactor a bit and out this into a struct

}

if !isUpToDate {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This whole block has just changed indentation cause it was moved out of the if. There aren't actually any logic changes

@d-honeybadger d-honeybadger marked this pull request as ready for review January 31, 2024 20:33
@d-honeybadger
Copy link
Collaborator Author

@morningspace ready for review 🙏

@morningspace
Copy link
Collaborator

@d-honeybadger There are some lint errors and conflicts need to be resolved.

Signed-off-by: Dasha Komsa <komsa.darya@gmail.com>
Signed-off-by: Dasha Komsa <komsa.darya@gmail.com>
Signed-off-by: Dasha Komsa <komsa.darya@gmail.com>
Signed-off-by: Dasha Komsa <komsa.darya@gmail.com>
@d-honeybadger
Copy link
Collaborator Author

@morningspace sorry about the linting errors! Turns out I commented out the linting line in the makefile cause at some point it was broken for me, and of course forgot that I did that...
Got it working now and all linting passing

Copy link
Collaborator

@morningspace morningspace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks awesome.

@morningspace morningspace merged commit 4d74aa6 into crossplane-contrib:main Feb 2, 2024
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Retry failed runs with ObserveAndDelete policy
2 participants