Ability to execute only some health checks #81

johnament · 2017-08-27T16:14:02Z

It would be useful in the wire format if we specified a way to execute only some health checks, rather than all discovered @Health health checks.

I could see this being done from both client and server side. From client perspective, listing out a set of names to run, or a group name to run. On the server, I could see marker interfaces, annotations, or even a simple use of class name/@Inject @Named support.

The text was updated successfully, but these errors were encountered:

dfbadawi · 2018-06-12T22:33:21Z

@johnament

Since @Health has been deprecated, so maybe defining a new default method in HealthCheck interface in order to define the group name?

default String group() {
    return null; //not a member of any group
}

mkouba · 2019-02-14T09:27:15Z

Copied from my comment on PR #142:

I would also recommend to add a new annotation to specify custom health checks groups, something like @HealthCheckGroup("checksA") (but find a better name ;-). @Named is probably not appropriate as it has side effects (named beans are accessible in EL etc.). This way it would be possible to select only specific checks for /health/{groupName} endpoints, using @Inject @Any Instance<HealthCheck> and instance.select(HealthCheckGroup.Literal.of("checksA")).

cescoffier · 2019-02-14T09:43:41Z

I really like the group idea.

mkouba · 2019-02-14T10:42:14Z

In fact, readyness and liveness are de facto also groups but it's convenient to define qualifiers for these special groups.

antoinesd · 2019-03-29T11:15:26Z

Group idea is interesting. We could even imagine slightly change @Health annotation usage to define these groups with it. WDYT ?

antoinesd · 2019-09-03T14:36:53Z

With Liveness and Readiness question is more about creating a custom type of HC

antoinesd · 2020-01-07T15:19:34Z

So we go for HealthCheck groups.
urls will have the following pattern : /health/group/{groupName}
@HealthGroup qualifier will support repeatable annotation

antoinesd · 2020-01-29T09:08:38Z

Discussion reopened after community concerns

aguibert · 2020-01-30T03:46:38Z

I'd like to propose an alternative approach that would achieve a similar result but without increasing the API surface of the spec much. Currently a health check can have multiple checks in it like so:

  {
    "status": "DOWN",
    "checks": [
      {
        "name": "check1",
        "status": "UP"
      },
      {
          "name": "check2",
          "status": "UP"
      },
      {
          "name": "check3",
          "status": "DOWN"
      }
    ]
  }

Instead of making the application define a group name and having the observer use that group name (i.e. /health/group/myGroup) the observer could add filter parameters to their request.

For example, using the above example suppose I just want the status of check2 and check3, I could make a request to /health?check=check2&check=check3 and I would get the result:

  {
    "status": "DOWN",
    "checks": [
      {
          "name": "check2",
          "status": "UP"
      },
      {
          "name": "check3",
          "status": "DOWN"
      }
    ]
  }

I think this approach has several benefits:

Observers can easily discover possible check names by first invoking /health
Observers can request any arbitrary subset of checks they want without requiring application code changes
Application developer does not need to consider health check groupings while writing the application

pgunapal · 2020-01-30T15:35:26Z

@aguibert your alternative approach definitely solves the issues that was brought up, however only have one concern...

The request URL might become very long if we want multiple checks in the filter parameters
e.g. /health?check=check1?check=check2?check=check3?check=check4?check=check5....

Maybe, we could still use the @HealthGroup(/"name/"), but when we access the /health endpoint, we can list out the groupNames associated with each check in the JSON response as well. This will allow Observers to easily discover all the available group names for each check.

aguibert · 2020-01-30T15:52:42Z

I don't think long URLs is necessarily an issue, and IMO it's not enough reason to have an entirely separate mechanism (i.e. health groups)

xstefank · 2020-01-30T16:04:46Z

Both approaches can be combined and we can allow users to choose. Maybe we can ask what would be the preference on the microprofile mailing list.

Also couldn't it be more problematic to share concrete names of the health checks with everyone that is calling the service (instead of 3 groups I need to pass the names of 12 HC classes)? And not so easy to add or remove a check from a group without breaking all clients. My 2c.

aguibert · 2020-01-30T16:28:12Z

Unless there is a good reason, I think it's better to only give users one way to accomplish something, otherwise it can cause confusion.

Sharing concrete names could be an issue, but I see 2 generic types of health checks:

provider-defined checks (e.g. Quarkus built-in health check for Datasource)
application-defined checks

Provider-defined checks will be stable due to providers not wanting to disrupt their users (changing HC name would be a breaking change). For app-defined checks, developers can already "group" multiple health checks into a single check if they want to with existing MP Health API.

The most common use-case I see here is an observer wanting to observe >=1 provider-defined health checks (stable names) and 0/1 app-defined health checks.

xstefank · 2020-01-30T16:41:02Z

@aguibert I agree that one option will be better.

Sharing concrete names could be an issue, but I see 2 generic types of health checks:

provider-defined checks (e.g. Quarkus built-in health check for Datasource)
application-defined checks

provider-defined HC can already be disabled by mp.health.disable-default-procedures so we are talking here really only about application-defined HCs. We didn't account for groups of vendor-defined groups (if the vendor doesn't define the health group themselves).

For app-defined checks, developers can already "group" multiple health checks into a single check if they want to with existing MP Health API.

Exactly. Readiness and Liveness are already groups. The proposal is to extend these two groups with custom app-defined groups which are not suitable for readiness or liveness checks.

aguibert · 2020-01-30T16:55:32Z

provider-defined HC can already be disabled by mp.health.disable-default-procedures so we are talking here really only about application-defined HCs. We didn't account for groups of vendor-defined groups (if the vendor doesn't define the health group themselves).

The mp.health.disable-default-procedures is only a boolean option so you cannot choose which built-in checks you want to keep on.

Exactly. Readiness and Liveness are already groups. The proposal is to extend these two groups with custom app-defined groups which are not suitable for readiness or liveness checks.

If someone uses the regular @Health annotation does that fall outside of the readiness/liveness groups? If so, that seems like the solution the the problem. I know the anno is deprecated now, but perhaps it could be un-deprecated.

xstefank · 2020-01-30T16:57:30Z

If someone uses the regular @health annotation does that fall outside of the readiness/liveness groups? If so, that seems like the solution the the problem. I know the anno is deprecated now, but perhaps it could be un-deprecated.

Yes, @health is not included in any of ready or live checks. The plan is to remove @health in 3.0 in June. What do you mean that it would be a solution to this problem?

aguibert · 2020-01-30T17:09:42Z

It seems to me like it's a solution to the problem because that's a health check a user can write which falls outside of the category of liveness or readiness groups. Additionally a user can give the response for an @Health annotated HealthCheck a name, which is effectively a "health group" if the developer desires.

donbourne · 2020-01-30T22:40:06Z

@aguibert , I think what you are getting at is you could...

define non-liveness/readiness healthchecks using @Health
and then...
be able to run any health check (liveness, readiness, or other) by specifying the health check name in the querystring.

I think we still need to settle on the prime use case for this though. Is it so that a client can decide which health checks they want to run for deeper problem determination in cases where those health checks may be too expensive to run all the time as part of liveness/readiness probes? or something else?

donbourne · 2020-01-30T22:41:31Z

... and if that is the use case, perhaps we would also need a way to query the list of health checks without running them.

johnament · 2020-01-30T22:48:44Z

Perhaps by executing an HTTP OPTIONS an external caller can discover the list of checks available?

NottyCode · 2020-01-31T01:02:58Z

@johnament Are you able to describe the primary use case you were thinking of when you opened this? I think that would help to work out where a client or server approach is better and from that whether things like a check query api is necessary.

I know that Kubernetes isn’t the only valid use case, but that is a use case I understand and can get my head around. I think the usage scenario of this is less obvious. I’m not saying I doubt that a use case exists, just that I’m struggling to understand one based on this issue.

johnament · 2020-01-31T04:15:24Z

Originally when I had this, it was actually for two reasons. The first was to differentiate liveliness and readiness checks (granted, when I first implemented this capability in the platform I was working on, it was simply described as I need a way to run these checks and those checks, and when it came down to seeing what checks they were it was the difference between checking liveliness and readiness). However, after that we actually realized something useful in our cloud monitoring capabilities, by exposing the list of checks to execute we can actually do things like narrow down nodes that were unavailable or pieces of our system that were failing due to the individual checks that would pass/fail.

I think now that liveliness/readiness are effectively the same as a health check group. Having other groups may make sense, and having the application dictate "this is what it means for my service to be up and operating within reason" or "this is what it means for me to start servicing new requests" makes a ton of sense. However, when it comes to an ops team needing to identify what part of a system is failing, they sometimes need to be able to specify only a certain check to run and want to focus on that one problem area (because sometimes, even the best developer leaves an accidental dependency on two external components in their code instead of just one). Being able to list out the available checks makes sense, but honestly I was thinking the execution of a single check is by a person, not by a system, and more likely than not that person would know the name of the check they want to execute and not have to rely on calling an API (then you have to start to consider that a developer may name their check check1234081 instead of a human readable hazelcast-cluster-connection as well as provide a describe of what the check looks for).

FWIW, the way we did this back in the day (when I first put in this ticket) was a query string, /health?checks=a,b,c

migoldschmidt · 2023-03-19T23:26:15Z

Hello @ALL,
sorry for my comment on this. Years ago I found the blogpost of @xstefank on health-groups and I was excited to find such a great feature in later MP Health spec versions. Recently I checked back and find this issue still open.

From my understanding there was some confusion in #232 regarding the use case of introducing HealthGroups, correct?

I think the main reason for such confusion is the different view on technical and functional health of a service. From my point of view it is all about monitoring, service restoration and different requirements on how to handle health errors.

So I'll try to explain it.
Technical service health issues happen within a service and should trigger some automation e.g. a kubernetes cluster to keep traffic away or even restart my service. Use them if something went wrong and could be fixed by automation like to restart the service on OOM errors.
Functional service health issues on the other hand should trigger other automation to get things fixed, even monitor with alarms and manually fixes are fine. But in such cases the service should not be restarted and also could receive further traffic as well. Also handling such issues is strongly related to the service and its capabilities. For example if some subservices fail which provide data but my service is able to use cached data as a fallback. Then such issues should decrease my service health e.g. state WARN but not kill the whole service.

In both cases my overall service health is bad, but different things need to be done to restore the service health .
If it is an technical error which can be fixed from kubernetes cluster manager with
A) a restart -> fail on liveness,
B) decreasing traffic -> fail on Readiness.
If it is some other error which cannot be fixed within the service or the kubernetes cluster manager, and possibly needs some manual interaction then
C) fail in health, but do not restart.

BY now C is not possible in MP Health (please interrupt me, if I am wrong).
I think for such reason some kind of custom health groups are a great choice and could be used by other monitoring tools in case the groups could be queried correspondingly.

I am not an expert on this but from my point of view Spring Boot does a great job explaining and reasoning about the usage of health groups, and comparing them with Readiness & Liveness endpoints.
https://docs.spring.io/spring-boot/docs/3.0.4/reference/htmlsingle/#actuator.endpoints.health.groups
Besides I am a fan of their cusomt-health path-style in favor of the proposed query string ;) but both solutions have their benefits.

Any chance to get such a feature in MP Health as well? I am absolutely sure devs out there will make use of it!

Thanks for such a great job on MP. I really love this spec!
Cheers Michael

xstefank · 2023-03-21T12:59:45Z

Hi @migoldschmidt, I understand and can tell you that now I agree. We have had health groups in Quarkus/Smallrye Health for some time, and they are being used - https://smallrye.io/docs/smallrye-health/3.0.1/health-groups.html. So if you fancy doing a PR, that would be great.

derekm mentioned this issue Oct 18, 2018

Don't deprecate the @Health annotation #148

Closed

mkouba mentioned this issue Feb 14, 2019

microprofile-health-35 Provide different types of health check #142

Merged

antoinesd modified the milestone: 2.0 Mar 29, 2019

antoinesd modified the milestones: 3.0, 2.2 Sep 3, 2019

antoinesd mentioned this issue Jan 10, 2020

microprofile-health-81 Ability to execute only some health checks #221

Merged

antoinesd closed this as completed in #221 Jan 14, 2020

xstefank mentioned this issue Jan 24, 2020

Health Group introduced caused more confusion rather than helping #232

Closed

antoinesd reopened this Jan 29, 2020

antoinesd modified the milestones: 2.2, 3.0 Jan 29, 2020

antoinesd mentioned this issue Jan 29, 2020

Revert @HealthGroup Introduction #236

Merged

xstefank modified the milestones: 3.0, 3.1 May 19, 2020

xstefank removed this from the 3.1 milestone May 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ability to execute only some health checks #81

Ability to execute only some health checks #81

johnament commented Aug 27, 2017

dfbadawi commented Jun 12, 2018 •

edited

Loading

mkouba commented Feb 14, 2019

cescoffier commented Feb 14, 2019

mkouba commented Feb 14, 2019

antoinesd commented Mar 29, 2019

antoinesd commented Sep 3, 2019

antoinesd commented Jan 7, 2020

antoinesd commented Jan 29, 2020

aguibert commented Jan 30, 2020

pgunapal commented Jan 30, 2020 •

edited

Loading

aguibert commented Jan 30, 2020

xstefank commented Jan 30, 2020

aguibert commented Jan 30, 2020

xstefank commented Jan 30, 2020 •

edited

Loading

aguibert commented Jan 30, 2020

xstefank commented Jan 30, 2020

aguibert commented Jan 30, 2020

donbourne commented Jan 30, 2020

donbourne commented Jan 30, 2020

johnament commented Jan 30, 2020

NottyCode commented Jan 31, 2020

johnament commented Jan 31, 2020

migoldschmidt commented Mar 19, 2023

xstefank commented Mar 21, 2023

Ability to execute only some health checks #81

Ability to execute only some health checks #81

Comments

johnament commented Aug 27, 2017

dfbadawi commented Jun 12, 2018 • edited Loading

mkouba commented Feb 14, 2019

cescoffier commented Feb 14, 2019

mkouba commented Feb 14, 2019

antoinesd commented Mar 29, 2019

antoinesd commented Sep 3, 2019

antoinesd commented Jan 7, 2020

antoinesd commented Jan 29, 2020

aguibert commented Jan 30, 2020

pgunapal commented Jan 30, 2020 • edited Loading

aguibert commented Jan 30, 2020

xstefank commented Jan 30, 2020

aguibert commented Jan 30, 2020

xstefank commented Jan 30, 2020 • edited Loading

aguibert commented Jan 30, 2020

xstefank commented Jan 30, 2020

aguibert commented Jan 30, 2020

donbourne commented Jan 30, 2020

donbourne commented Jan 30, 2020

johnament commented Jan 30, 2020

NottyCode commented Jan 31, 2020

johnament commented Jan 31, 2020

migoldschmidt commented Mar 19, 2023

xstefank commented Mar 21, 2023

dfbadawi commented Jun 12, 2018 •

edited

Loading

pgunapal commented Jan 30, 2020 •

edited

Loading

xstefank commented Jan 30, 2020 •

edited

Loading