Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ability to execute only some health checks #81

Open
johnament opened this issue Aug 27, 2017 · 24 comments · Fixed by #221
Open

Ability to execute only some health checks #81

johnament opened this issue Aug 27, 2017 · 24 comments · Fixed by #221

Comments

@johnament
Copy link
Contributor

It would be useful in the wire format if we specified a way to execute only some health checks, rather than all discovered @Health health checks.

I could see this being done from both client and server side. From client perspective, listing out a set of names to run, or a group name to run. On the server, I could see marker interfaces, annotations, or even a simple use of class name/@Inject @Named support.

@dfbadawi
Copy link

dfbadawi commented Jun 12, 2018

@johnament

Since @Health has been deprecated, so maybe defining a new default method in HealthCheck interface in order to define the group name?

default String group() {
    return null; //not a member of any group
}

@mkouba
Copy link

mkouba commented Feb 14, 2019

Copied from my comment on PR #142:

I would also recommend to add a new annotation to specify custom health checks groups, something like @HealthCheckGroup("checksA") (but find a better name ;-). @Named is probably not appropriate as it has side effects (named beans are accessible in EL etc.). This way it would be possible to select only specific checks for /health/{groupName} endpoints, using @Inject @Any Instance<HealthCheck> and instance.select(HealthCheckGroup.Literal.of("checksA")).

@cescoffier
Copy link
Contributor

I really like the group idea.

@mkouba
Copy link

mkouba commented Feb 14, 2019

In fact, readyness and liveness are de facto also groups but it's convenient to define qualifiers for these special groups.

@antoinesd
Copy link
Contributor

Group idea is interesting. We could even imagine slightly change @Health annotation usage to define these groups with it. WDYT ?

@antoinesd antoinesd modified the milestone: 2.0 Mar 29, 2019
@antoinesd antoinesd modified the milestones: 3.0, 2.2 Sep 3, 2019
@antoinesd
Copy link
Contributor

With Liveness and Readiness question is more about creating a custom type of HC

@antoinesd
Copy link
Contributor

So we go for HealthCheck groups.
urls will have the following pattern : /health/group/{groupName}
@HealthGroup qualifier will support repeatable annotation

@antoinesd
Copy link
Contributor

Discussion reopened after community concerns

@antoinesd antoinesd reopened this Jan 29, 2020
@antoinesd antoinesd modified the milestones: 2.2, 3.0 Jan 29, 2020
@aguibert
Copy link

I'd like to propose an alternative approach that would achieve a similar result but without increasing the API surface of the spec much. Currently a health check can have multiple checks in it like so:

  {
    "status": "DOWN",
    "checks": [
      {
        "name": "check1",
        "status": "UP"
      },
      {
          "name": "check2",
          "status": "UP"
      },
      {
          "name": "check3",
          "status": "DOWN"
      }
    ]
  }

Instead of making the application define a group name and having the observer use that group name (i.e. /health/group/myGroup) the observer could add filter parameters to their request.

For example, using the above example suppose I just want the status of check2 and check3, I could make a request to /health?check=check2&check=check3 and I would get the result:

  {
    "status": "DOWN",
    "checks": [
      {
          "name": "check2",
          "status": "UP"
      },
      {
          "name": "check3",
          "status": "DOWN"
      }
    ]
  }

I think this approach has several benefits:

  • Observers can easily discover possible check names by first invoking /health
  • Observers can request any arbitrary subset of checks they want without requiring application code changes
  • Application developer does not need to consider health check groupings while writing the application

@pgunapal
Copy link
Contributor

pgunapal commented Jan 30, 2020

@aguibert your alternative approach definitely solves the issues that was brought up, however only have one concern...

The request URL might become very long if we want multiple checks in the filter parameters
e.g. /health?check=check1?check=check2?check=check3?check=check4?check=check5....

Maybe, we could still use the @HealthGroup(/"name/"), but when we access the /health endpoint, we can list out the groupNames associated with each check in the JSON response as well. This will allow Observers to easily discover all the available group names for each check.

@aguibert
Copy link

I don't think long URLs is necessarily an issue, and IMO it's not enough reason to have an entirely separate mechanism (i.e. health groups)

@xstefank
Copy link
Member

Both approaches can be combined and we can allow users to choose. Maybe we can ask what would be the preference on the microprofile mailing list.

Also couldn't it be more problematic to share concrete names of the health checks with everyone that is calling the service (instead of 3 groups I need to pass the names of 12 HC classes)? And not so easy to add or remove a check from a group without breaking all clients. My 2c.

@aguibert
Copy link

Unless there is a good reason, I think it's better to only give users one way to accomplish something, otherwise it can cause confusion.

Sharing concrete names could be an issue, but I see 2 generic types of health checks:

  • provider-defined checks (e.g. Quarkus built-in health check for Datasource)
  • application-defined checks

Provider-defined checks will be stable due to providers not wanting to disrupt their users (changing HC name would be a breaking change). For app-defined checks, developers can already "group" multiple health checks into a single check if they want to with existing MP Health API.

The most common use-case I see here is an observer wanting to observe >=1 provider-defined health checks (stable names) and 0/1 app-defined health checks.

@xstefank
Copy link
Member

xstefank commented Jan 30, 2020

@aguibert I agree that one option will be better.

Sharing concrete names could be an issue, but I see 2 generic types of health checks:

provider-defined checks (e.g. Quarkus built-in health check for Datasource)
application-defined checks

provider-defined HC can already be disabled by mp.health.disable-default-procedures so we are talking here really only about application-defined HCs. We didn't account for groups of vendor-defined groups (if the vendor doesn't define the health group themselves).

For app-defined checks, developers can already "group" multiple health checks into a single check if they want to with existing MP Health API.

Exactly. Readiness and Liveness are already groups. The proposal is to extend these two groups with custom app-defined groups which are not suitable for readiness or liveness checks.

@aguibert
Copy link

provider-defined HC can already be disabled by mp.health.disable-default-procedures so we are talking here really only about application-defined HCs. We didn't account for groups of vendor-defined groups (if the vendor doesn't define the health group themselves).

The mp.health.disable-default-procedures is only a boolean option so you cannot choose which built-in checks you want to keep on.

Exactly. Readiness and Liveness are already groups. The proposal is to extend these two groups with custom app-defined groups which are not suitable for readiness or liveness checks.

If someone uses the regular @Health annotation does that fall outside of the readiness/liveness groups? If so, that seems like the solution the the problem. I know the anno is deprecated now, but perhaps it could be un-deprecated.

@xstefank
Copy link
Member

If someone uses the regular @health annotation does that fall outside of the readiness/liveness groups? If so, that seems like the solution the the problem. I know the anno is deprecated now, but perhaps it could be un-deprecated.

Yes, @health is not included in any of ready or live checks. The plan is to remove @health in 3.0 in June. What do you mean that it would be a solution to this problem?

@aguibert
Copy link

It seems to me like it's a solution to the problem because that's a health check a user can write which falls outside of the category of liveness or readiness groups. Additionally a user can give the response for an @Health annotated HealthCheck a name, which is effectively a "health group" if the developer desires.

@donbourne
Copy link
Member

@aguibert , I think what you are getting at is you could...

  1. define non-liveness/readiness healthchecks using @Health
    and then...
  2. be able to run any health check (liveness, readiness, or other) by specifying the health check name in the querystring.

I think we still need to settle on the prime use case for this though. Is it so that a client can decide which health checks they want to run for deeper problem determination in cases where those health checks may be too expensive to run all the time as part of liveness/readiness probes? or something else?

@donbourne
Copy link
Member

... and if that is the use case, perhaps we would also need a way to query the list of health checks without running them.

@johnament
Copy link
Contributor Author

Perhaps by executing an HTTP OPTIONS an external caller can discover the list of checks available?

@NottyCode
Copy link
Member

@johnament Are you able to describe the primary use case you were thinking of when you opened this? I think that would help to work out where a client or server approach is better and from that whether things like a check query api is necessary.

I know that Kubernetes isn’t the only valid use case, but that is a use case I understand and can get my head around. I think the usage scenario of this is less obvious. I’m not saying I doubt that a use case exists, just that I’m struggling to understand one based on this issue.

@johnament
Copy link
Contributor Author

Originally when I had this, it was actually for two reasons. The first was to differentiate liveliness and readiness checks (granted, when I first implemented this capability in the platform I was working on, it was simply described as I need a way to run these checks and those checks, and when it came down to seeing what checks they were it was the difference between checking liveliness and readiness). However, after that we actually realized something useful in our cloud monitoring capabilities, by exposing the list of checks to execute we can actually do things like narrow down nodes that were unavailable or pieces of our system that were failing due to the individual checks that would pass/fail.

I think now that liveliness/readiness are effectively the same as a health check group. Having other groups may make sense, and having the application dictate "this is what it means for my service to be up and operating within reason" or "this is what it means for me to start servicing new requests" makes a ton of sense. However, when it comes to an ops team needing to identify what part of a system is failing, they sometimes need to be able to specify only a certain check to run and want to focus on that one problem area (because sometimes, even the best developer leaves an accidental dependency on two external components in their code instead of just one). Being able to list out the available checks makes sense, but honestly I was thinking the execution of a single check is by a person, not by a system, and more likely than not that person would know the name of the check they want to execute and not have to rely on calling an API (then you have to start to consider that a developer may name their check check1234081 instead of a human readable hazelcast-cluster-connection as well as provide a describe of what the check looks for).

FWIW, the way we did this back in the day (when I first put in this ticket) was a query string, /health?checks=a,b,c

@xstefank xstefank modified the milestones: 3.0, 3.1 May 19, 2020
@xstefank xstefank removed this from the 3.1 milestone May 4, 2021
@migoldschmidt
Copy link

Hello @ALL,
sorry for my comment on this. Years ago I found the blogpost of @xstefank on health-groups and I was excited to find such a great feature in later MP Health spec versions. Recently I checked back and find this issue still open.

From my understanding there was some confusion in #232 regarding the use case of introducing HealthGroups, correct?

I think the main reason for such confusion is the different view on technical and functional health of a service. From my point of view it is all about monitoring, service restoration and different requirements on how to handle health errors.

So I'll try to explain it.
Technical service health issues happen within a service and should trigger some automation e.g. a kubernetes cluster to keep traffic away or even restart my service. Use them if something went wrong and could be fixed by automation like to restart the service on OOM errors.
Functional service health issues on the other hand should trigger other automation to get things fixed, even monitor with alarms and manually fixes are fine. But in such cases the service should not be restarted and also could receive further traffic as well. Also handling such issues is strongly related to the service and its capabilities. For example if some subservices fail which provide data but my service is able to use cached data as a fallback. Then such issues should decrease my service health e.g. state WARN but not kill the whole service.

In both cases my overall service health is bad, but different things need to be done to restore the service health .
If it is an technical error which can be fixed from kubernetes cluster manager with
A) a restart -> fail on liveness,
B) decreasing traffic -> fail on Readiness.
If it is some other error which cannot be fixed within the service or the kubernetes cluster manager, and possibly needs some manual interaction then
C) fail in health, but do not restart.

BY now C is not possible in MP Health (please interrupt me, if I am wrong).
I think for such reason some kind of custom health groups are a great choice and could be used by other monitoring tools in case the groups could be queried correspondingly.

I am not an expert on this but from my point of view Spring Boot does a great job explaining and reasoning about the usage of health groups, and comparing them with Readiness & Liveness endpoints.
https://docs.spring.io/spring-boot/docs/3.0.4/reference/htmlsingle/#actuator.endpoints.health.groups
Besides I am a fan of their cusomt-health path-style in favor of the proposed query string ;) but both solutions have their benefits.

Any chance to get such a feature in MP Health as well? I am absolutely sure devs out there will make use of it!

Thanks for such a great job on MP. I really love this spec!
Cheers Michael

@xstefank
Copy link
Member

Hi @migoldschmidt, I understand and can tell you that now I agree. We have had health groups in Quarkus/Smallrye Health for some time, and they are being used - https://smallrye.io/docs/smallrye-health/3.0.1/health-groups.html. So if you fancy doing a PR, that would be great.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.