Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add Keycloak SLO docs fixes #579 #1020

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion doc/kubernetes/modules/ROOT/nav.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -14,9 +14,10 @@ include::partial$subnav-openshift.adoc[]
* xref:testing/index.adoc[]
* xref:running/index.adoc[]
** xref:running/infinispan-deployment.adoc[]
** xref:running/timeout_tunning.adoc[]
** xref:running/timeout_tuning.adoc[]
** xref:running/jvm/jvm_options.adoc[]
** Metrics
*** xref:running/metrics/keycloak_service_level_indicators.adoc[]
*** xref:running/metrics/jvm_metrics.adoc[]
*** xref:running/metrics/keycloak_cluster.adoc[]
*** xref:running/metrics/keycloak_with_external_infinispan.adoc[]
Expand Down
3 changes: 2 additions & 1 deletion doc/kubernetes/modules/ROOT/pages/running/index.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ These guides will eventually be published Keycloak's main web page.
== Building blocks

* xref:running/infinispan-deployment.adoc[]
* xref:running/timeout_tunning.adoc[]
* xref:running/timeout_tuning.adoc[]

[#jvm-tuning]
== JVM tuning guides
Expand All @@ -26,6 +26,7 @@ These guides will eventually be published Keycloak's main web page.
[#monitoring-deployments]
== Monitoring deployments

* xref:running/metrics/keycloak_service_level_indicators.adoc[]
* xref:running/metrics/jvm_metrics.adoc[]
* xref:running/metrics/keycloak_cluster.adoc[]
* xref:running/metrics/keycloak_with_external_infinispan.adoc[]
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,160 @@
= {project_name} Service Level Indicators
:description: This document contains details of the Service Level Indicators to monitor your {project_name} deployment's performance.

Service Level Indicators (SLIs) and Service Level Objectives (SLOs) are essential components in monitoring and maintaining the performance and reliability of {project_name} in production environments.

The Google Site Reliability Engineering book defines this as follows:

- A Service Level Indicator (SLI) is a carefully defined quantitative measure of some aspect of the level of service that is provided.

- A Service level objective (SLO) is a target value or range of values for a service level that is measured by an SLI.

By agreeing those with the stakeholders and tracking these, service owners
can ensure that deployments are aligned with user's expectations and that they neither over- nor under-deliver on the service they provide.

== Prerequisites

* Metrics need to be enabled for Keycloak, and the `http-metrics-slos` option needs to be set to latency to be measured for the SLO defined below.
* A monitoring system collecting the metrics. The following paragraphs assume Prometheus or a similar system is used that supports the PromQL query language.

https://www.keycloak.org/keycloak-benchmark/kubernetes-guide/latest/running/metrics/keycloak_cluster#processing-time[More details about the metrics are captured here.]

== Definition of the service delivered

The following service definition is used in the next steps to identify the appropriate SLIs and SLOs. It should capture the behavior observed by its users.

====
As a {project_name} user,

* I want to be able to log in,
* refresh my token and
* log out,

so that I can use the applications that use {project_name} for authentication.
====

== Definition of SLI and SLO

The following provides example SLIs and SLOs based on the service description above and the metrics available in {project_name}.

[%autowidth,options="header"]
|===
| Characteristic | Service Level Indicator | Service Level Objective^*^ | Metric Source

| Availability
| Percentage of the time {project_name} is able to answer requests as measured by the monitoring system
| {project_name} should be available 99.9% of the time within a month (44 minutes unavailability per month).
| Use the Prometheus `up` metric which indicates if the Prometheus server is able to scrape metrics from the {project_name} instance.

| Latency
| Response time for authentication related HTTP requests as measured by the server
| 95% of all authentication related requests should be faster than 250 ms within a 5-minute-interval.
| {project_name} server-side metrics to track latency for specific endpoints along with Response Time Distribution using `http_server_requests_seconds_bucket` and `http_server_requests_seconds_count`.

| Errors
| Failed authentication requests due to server problems as measured by the server
| The rate of errors due to server problems for authentication requests should be less than 0.1% within a 5-minute-interval.
| Identify server side error by filtering the metric `http_server_requests_seconds_count` on the tag `outcome` for value `SERVER_ERROR`.

|===

^*^ These SLO target values are an example and should be tailored to fit your use case and deployment.

== PromQL queries

=== Availability

This metric will have a value of at least one if the {project_name} instances is available
and responding to Prometheus scrape requests,
and 0 if the service is down or unreachable.

Then use a tool like Grafana to show a 30-day interval and let it calculate the average of the metric in that time window.

----
sum(
up{
container="keycloak", # <1>
namespace="$namespace"
}
)
OR
on() vector(0) # <2>
----
<1> Filter by additional tags to identify Keycloak
<2> Alternative value 0 when none of the Pods is available

=== Latency of authentication requests

This Prometheus query calculates the percentage of authentication requests
that completed within 0.25 seconds relative to all authentication requests for specific Keycloak endpoints, targeting a particular namespace and pod, over the past 5 minutes.

This example requires the Keycloak configuration `http-metrics-slos` to be set to `250` indicating that buckets for requests faster and slower than 250 ms should be recorded.
Setting `http-metrics-histograms-enabled` to `true` would capture additional buckets which can help with performance troubleshooting.

----
sum(
rate(
http_server_requests_seconds_bucket{
uri=~"/realms/{realm}/protocol/{protocol}/.*|/realms/{realm}/login-actions/.*", # <1>
le="0.25", # <2>
container="keycloak", # <3>
namespace="$namespace"}
[5m] # <4>
)
) without (le,uri,status,outcome,method,pod,instance) # <5>
/
sum(
rate(
http_server_requests_seconds_count{
uri=~"/realms/{realm}/protocol/{protocol}/.*|/realms/{realm}/login-actions/.*", # <1>
container="keycloak",
namespace="$namespace"}
[5m] # <3>
)
) without (le,uri,status,outcome,method,pod,instance) # <5>
----
<1> URLs related to logging in
<2> Response time as defined by SLO
<3> Filter by additional tags like
<4> Interval as specfied by SLO
<5> Ignore as many labels necessary to create a single sum

=== Errors for authentication requests

This Prometheus query calculates the percentage of authentication requests
that returned a server side error for all authentication requests,
targeting a particular namespace, over the past 5 minutes.

[source,plaintext]
----
sum(
rate(
http_server_requests_seconds_count{
uri=~"/realms/{realm}/protocol/{protocol}/.*|/realms/{realm}/login-actions/.*", # <1>
outcome="SERVER_ERROR", # <2>
container="keycloak", # <3>
namespace="$namespace"}
[2m] # <4>
)
) without (le,uri,status,outcome,method,pod,instance) # <5>
/
sum(
rate(
http_server_requests_seconds_count{
uri=~"/realms/{realm}/protocol/{protocol}/.*|/realms/{realm}/login-actions/.*", # <1>
container="keycloak", # <3>
namespace="$namespace"}
[2m] # <4>
)
) without (le,uri,status,outcome,method,pod,instance) # <5>
----
<1> URLs related to logging in
<2> Filter for all requests that responded with a server error (HTTP status 5xx)
<3> Filter for Keycloak containers
<4> Interval as specified by SLO
<5> Ignore as many labels necessary to create a single sum

== Further Reading

* https://sre.google/sre-book/service-level-objectives/[Google SRE Book on Service Level Objectives]
* https://prometheus.io/docs/prometheus/latest/querying/basics/[Prometheus PromQL Basics]