keycloak · kami619 · Oct 22, 2024 · Oct 25, 2024
diff --git a/doc/kubernetes/modules/ROOT/nav.adoc b/doc/kubernetes/modules/ROOT/nav.adoc
@@ -14,9 +14,10 @@ include::partial$subnav-openshift.adoc[]
 * xref:testing/index.adoc[]
 * xref:running/index.adoc[]
 ** xref:running/infinispan-deployment.adoc[]
-** xref:running/timeout_tunning.adoc[]
+** xref:running/timeout_tuning.adoc[]
 ** xref:running/jvm/jvm_options.adoc[]
 ** Metrics
+*** xref:running/metrics/keycloak_service_level_indicators.adoc[]
 *** xref:running/metrics/jvm_metrics.adoc[]
 *** xref:running/metrics/keycloak_cluster.adoc[]
 *** xref:running/metrics/keycloak_with_external_infinispan.adoc[]

diff --git a/doc/kubernetes/modules/ROOT/pages/running/index.adoc b/doc/kubernetes/modules/ROOT/pages/running/index.adoc
@@ -16,7 +16,7 @@ These guides will eventually be published Keycloak's main web page.
 == Building blocks
 
 * xref:running/infinispan-deployment.adoc[]
-* xref:running/timeout_tunning.adoc[]
+* xref:running/timeout_tuning.adoc[]
 
 [#jvm-tuning]
 == JVM tuning guides
@@ -26,6 +26,7 @@ These guides will eventually be published Keycloak's main web page.
 [#monitoring-deployments]
 == Monitoring deployments
 
+* xref:running/metrics/keycloak_service_level_indicators.adoc[]
 * xref:running/metrics/jvm_metrics.adoc[]
 * xref:running/metrics/keycloak_cluster.adoc[]
 * xref:running/metrics/keycloak_with_external_infinispan.adoc[]

diff --git a/...netes/modules/ROOT/pages/running/metrics/keycloak_service_level_indicators.adoc b/...netes/modules/ROOT/pages/running/metrics/keycloak_service_level_indicators.adoc
@@ -0,0 +1,160 @@
+= {project_name} Service Level Indicators
+:description: This document contains details of the Service Level Indicators to monitor your {project_name} deployment's performance.
+
+Service Level Indicators (SLIs) and Service Level Objectives (SLOs) are essential components in monitoring and maintaining the performance and reliability of {project_name} in production environments.
+
+The Google Site Reliability Engineering book defines this as follows:
+
+- A Service Level Indicator (SLI) is a carefully defined quantitative measure of some aspect of the level of service that is provided.
+
+- A Service level objective (SLO) is a target value or range of values for a service level that is measured by an SLI.
+
+By agreeing those with the stakeholders and tracking these, service owners
+can ensure that deployments are aligned with user's expectations and that they neither over- nor under-deliver on the service they provide.
+
+== Prerequisites
+
+* Metrics need to be enabled for Keycloak, and the `http-metrics-slos` option needs to be set to latency to be measured for the SLO defined below.
+* A monitoring system collecting the metrics. The following paragraphs assume Prometheus or a similar system is used that supports the PromQL query language.
+
+https://www.keycloak.org/keycloak-benchmark/kubernetes-guide/latest/running/metrics/keycloak_cluster#processing-time[More details about the metrics are captured here.]
+
+== Definition of the service delivered
+
+The following service definition is used in the next steps to identify the appropriate SLIs and SLOs. It should capture the behavior observed by its users.
+
+====
+As a {project_name} user,
+
+* I want to be able to log in,
+* refresh my token and
+* log out,
+
+so that I can use the applications that use {project_name} for authentication.
+====
+
+== Definition of SLI and SLO
+
+The following provides example SLIs and SLOs based on the service description above and the metrics available in {project_name}.
+
+[%autowidth,options="header"]
+|===
+| Characteristic | Service Level Indicator | Service Level Objective^*^ | Metric Source
+
+| Availability
+| Percentage of the time {project_name} is able to answer requests as measured by the monitoring system
+| {project_name} should be available 99.9% of the time within a month (44 minutes unavailability per month).
+| Use the Prometheus `up` metric which indicates if the Prometheus server is able to scrape metrics from the {project_name} instance.
+
+| Latency
+| Response time for authentication related HTTP requests as measured by the server
+| 95% of all authentication related requests should be faster than 250 ms within a 5-minute-interval.
+| {project_name} server-side metrics to track latency for specific endpoints along with Response Time Distribution using `http_server_requests_seconds_bucket` and `http_server_requests_seconds_count`.
+
+| Errors
+| Failed authentication requests due to server problems as measured by the server
+| The rate of errors due to server problems for authentication requests should be less than 0.1% within a 5-minute-interval.
+| Identify server side error by filtering the metric `http_server_requests_seconds_count` on the tag `outcome` for value `SERVER_ERROR`.
+
+|===
+
+^*^ These SLO target values are an example and should be tailored to fit your use case and deployment.
+
+== PromQL queries
+
+=== Availability
+
+This metric will have a value of at least one if the {project_name} instances is available
+and responding to Prometheus scrape requests,
+and 0 if the service is down or unreachable.
+
+Then use a tool like Grafana to show a 30-day interval and let it calculate the average of the metric in that time window.
+
+----
+sum(
+  up{
+    container="keycloak", # <1>
+    namespace="$namespace"
+  }
+)
+OR
+on() vector(0) # <2>
+----
+<1> Filter by additional tags to identify Keycloak
+<2> Alternative value 0 when none of the Pods is available
+
+=== Latency of authentication requests
+
+This Prometheus query calculates the percentage of authentication requests
+that completed within 0.25 seconds relative to all authentication requests for specific Keycloak endpoints, targeting a particular namespace and pod, over the past 5 minutes.
+
+This example requires the Keycloak configuration `http-metrics-slos` to be set to `250` indicating that buckets for requests faster and slower than 250 ms should be recorded.
+Setting `http-metrics-histograms-enabled` to `true` would capture additional buckets which can help with performance troubleshooting.
+
+----
+sum(
+  rate(
+    http_server_requests_seconds_bucket{
+      uri=~"/realms/{realm}/protocol/{protocol}/.*|/realms/{realm}/login-actions/.*", # <1>
+      le="0.25", # <2>
+      container="keycloak", # <3>
+      namespace="$namespace"}
+    [5m] # <4>
+  )
+) without (le,uri,status,outcome,method,pod,instance) # <5>
+/
+sum(
+  rate(
+    http_server_requests_seconds_count{
+      uri=~"/realms/{realm}/protocol/{protocol}/.*|/realms/{realm}/login-actions/.*", # <1>
+      container="keycloak",
+      namespace="$namespace"}
+    [5m] # <3>
+  )
+) without (le,uri,status,outcome,method,pod,instance) # <5>
+----
+<1> URLs related to logging in
+<2> Response time as defined by SLO
+<3> Filter by additional tags like
+<4> Interval as specfied by SLO
+<5> Ignore as many labels necessary to create a single sum
+
+=== Errors for authentication requests
+
+This Prometheus query calculates the percentage of authentication requests
+that returned a server side error for all authentication requests,
+targeting a particular namespace, over the past 5 minutes.
+
+[source,plaintext]
+----
+sum(
+  rate(
+    http_server_requests_seconds_count{
+      uri=~"/realms/{realm}/protocol/{protocol}/.*|/realms/{realm}/login-actions/.*", # <1>
+      outcome="SERVER_ERROR", # <2>
+      container="keycloak", # <3>
+      namespace="$namespace"}
+    [2m] # <4>
+  )
+) without (le,uri,status,outcome,method,pod,instance) # <5>
+/
+sum(
+  rate(
+    http_server_requests_seconds_count{
+      uri=~"/realms/{realm}/protocol/{protocol}/.*|/realms/{realm}/login-actions/.*", # <1>
+      container="keycloak", # <3>
+      namespace="$namespace"}
+    [2m] # <4>
+  )
+) without (le,uri,status,outcome,method,pod,instance) # <5>
+----
+<1> URLs related to logging in
+<2> Filter for all requests that responded with a server error (HTTP status 5xx)
+<3> Filter for Keycloak containers
+<4> Interval as specified by SLO
+<5> Ignore as many labels necessary to create a single sum
+
+== Further Reading
+
+* https://sre.google/sre-book/service-level-objectives/[Google SRE Book on Service Level Objectives]
+* https://prometheus.io/docs/prometheus/latest/querying/basics/[Prometheus PromQL Basics]
diff --git a/...s/ROOT/pages/running/timeout_tunning.adoc → ...es/ROOT/pages/running/timeout_tuning.adoc b/...s/ROOT/pages/running/timeout_tunning.adoc → ...es/ROOT/pages/running/timeout_tuning.adoc