Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add Keycloak SLO docs fixes #579 #1020

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

add Keycloak SLO docs fixes #579 #1020

wants to merge 1 commit into from

Conversation

kami619
Copy link
Contributor

@kami619 kami619 commented Oct 22, 2024

Fixes #579

@kami619 kami619 requested a review from ahus1 October 22, 2024 21:44
Copy link
Contributor

@ahus1 ahus1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this PR. It captures the discussions we had internally, still it would need to change its shape a bit to be useful as a future guide in Keycloak main.

  • The introduction needs to briefly explain SLI/SLO to our best knowledge
  • It should be more specific about which metrics should be used, as readers would otherwise be lost in the metrics and their tags.
  • It needs to be less specific about actual SLOs, as those would require a lot more assumptions to be explained and calculations to be made

See below for the comment.

Comment on lines 29 to 31
| Authentication Latency | XX% of {project_name} authentication requests should have a latency below 200ms. | 99% | 95% | {project_name} server-side metrics to track latency for specific endpoints along with Response Time Distribution. | `http_server_requests_seconds_count`, `http_server_requests_seconds_sum`.

https://www.keycloak.org/keycloak-benchmark/kubernetes-guide/latest/running/metrics/keycloak_cluster#processing-time[More details about the metrics are captured here.] | https://github.com/keycloak/keycloak-benchmark/blob/main/provision/minikube/monitoring/dashboards/authentication-code.json[Example Grafana dashboard]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to place this in the main Keycloak docs eventually, and it should be self-sufficient.
It is missing the URL path for URLs related to login, and the information that histograms for HTTP metrics need to be enabled first.

IMHO the heatmap you point to will not show the SLI. IMHO the diagram on the left will show the SLI/SLO

AFAIK the expression could be simplified by using all URLs starting with /realms/{realm}/protocol/{protocol}/.* and not restricting them to (token|auth|logout).

sum(irate(http_server_requests_seconds_bucket{uri=~"/realms/{realm}/protocol/{protocol}/(token|auth|logout)|/realms/{realm}/login-actions/authenticate", le="0.25", namespace="$namespace", pod="$pod_name"}[2m])) without (le) / 
irate(http_server_requests_seconds_count{uri=~"/realms/{realm}/protocol/{protocol}/(token|auth|logout)|/realms/{realm}/login-actions/authenticate", namespace="$namespace", pod="$pod_name"}[2m]) 

Comment on lines 33 to 34
| Error Rate “during login” | The error rate should be less than X.X%. | 0.1% | 0.05% | The ratio of failed authentication requests to total requests. | Failed requests could be identified by the `5xx error codes` generated by the {project_name} server and those could be further per URL.
|https://grafana.com/grafana/dashboards/10441-keycloak-metrics-dashboard/[Example Grafana dashboard]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again I'd like it to be more specific and show which URLs are relevant during login. This could be done with a PromQL query.

@kami619
Copy link
Contributor Author

kami619 commented Oct 23, 2024

@ahus1 I have addressed most of the review feedback, and I still have some doubts on how to implement your input on the PromQL queries. I will connect with you offline and work on it.

@kami619
Copy link
Contributor Author

kami619 commented Oct 24, 2024

@ ahus1 I added what we thought would be good concerning the structure of the doc. If you can double-check the error rate related PromQL queries, that would be great.

Signed-off-by: Kamesh Akella <kamesh.asp@gmail.com>
@ahus1 ahus1 self-assigned this Oct 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Metrics - Most important SLI to determine that Keycloak is healthy
2 participants