Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unit goes into Blocked status when relation schema validation fails #223

Open
beliaev-maksim opened this issue Jun 20, 2023 · 3 comments
Open
Labels
bug Something isn't working

Comments

@beliaev-maksim
Copy link
Member

beliaev-maksim commented Jun 20, 2023

After bumping the SDI version, we had to introduce extra error handling around the get_interfaces call to catch a possible RelationDataError:

When this error is caught, the unit goes into Blocked status:

except RelationDataError as err:
raise ErrorWithStatus(str(err), BlockedStatus)

kfp-api/0*                    blocked   idle   10.1.147.102                    Failed to replan

This is not ideal, however, since there is no possible user action to unblock it.

@phoevos phoevos changed the title unit went into Blocked status with Failed to replan Unit goes into Blocked status when relation schema validation fails Jun 28, 2023
@phoevos
Copy link
Contributor

phoevos commented Jun 28, 2023

Explanation

Calling get_interfaces eventually results in the app attempting to retrieve the relation data (SDI.get_data()) and unwrap them. Unwrapping the relation data on the SDI side includes entails validating the JSON schema of the collected data. In the event that information (i.e. required relation fields) is missing (i.e. partial relation data), a RelationDataError is raised.

The above could occur in a couple of cases:

  • The relation is misconfigured / the remote app does not provide the required data.
  • The remote app updates the relation databag gradually (i.e. different points in time), leaving our charm with partial data in the event that it attempts to read from the relation before all required values have been set.

While in the first case, there's not much to do, we should make sure that the latter (i.e. transient errors) is properly handled. In fact, when we expect that the databag will be completed at some point, we should put the unit in Waiting status and ensure that the call to retrieve the data is repeated.

@ca-scribner
Copy link
Contributor

Is there a way to detect the two situations from each other? I can't think of one.

Personally, I'd think of the second senario (gradual databag update) as a bug. If our charms are doing this, they're violating the schema by omitting required attributes. If juju is doing this (which I don't think it is known to do), that's a juju bug

@ca-scribner
Copy link
Contributor

To @beliaev-maksim's original point though, I agree that we should add this handling to all our charms

To make this more reusable, we could also put it into chisme as a context manager that catches these different errors, that way we could define it in one place

@ca-scribner ca-scribner added the bug Something isn't working label Jul 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants