From 64cf50078c282e4505eb3d4351e47c4306c069ce Mon Sep 17 00:00:00 2001
From: Alexander Schwartz <aschwart@redhat.com>
Date: Tue, 18 Jul 2023 18:42:07 +0200
Subject: [PATCH] Add ROSA Benchmark Key Results

Closes #432

Co-authored-by: Michal Hajas <mhajas@redhat.com>
---
 doc/benchmark/modules/ROOT/nav.adoc           |   1 +
 .../report/rosa-benchmark-key-results.adoc    | 209 ++++++++++++++++++
 2 files changed, 210 insertions(+)
 create mode 100644 doc/benchmark/modules/ROOT/pages/report/rosa-benchmark-key-results.adoc

diff --git a/doc/benchmark/modules/ROOT/nav.adoc b/doc/benchmark/modules/ROOT/nav.adoc
index 4a306f7c4..2192a8909 100644
--- a/doc/benchmark/modules/ROOT/nav.adoc
+++ b/doc/benchmark/modules/ROOT/nav.adoc
@@ -13,6 +13,7 @@
 ** xref:report/trend-report.adoc[]
 ** xref:report/diagram-types.adoc[]
 ** xref:report/result-summary.adoc[]
+** xref:report/rosa-benchmark-key-results.adoc[]
 * xref:scenario-overview.adoc[]
 ** xref:scenario/authorization-code.adoc[]
 ** xref:scenario/list-sessions.adoc[]
diff --git a/doc/benchmark/modules/ROOT/pages/report/rosa-benchmark-key-results.adoc b/doc/benchmark/modules/ROOT/pages/report/rosa-benchmark-key-results.adoc
new file mode 100644
index 000000000..873b0c5ec
--- /dev/null
+++ b/doc/benchmark/modules/ROOT/pages/report/rosa-benchmark-key-results.adoc
@@ -0,0 +1,209 @@
+= Keycloak on ROSA Benchmark Key Results
+
+This summarizes a benchmark run with Keycloak 22 performed in July 2023.
+Use this as a starting point to calculate the requirements of a Keycloak environment.
+Use them to perform a load testing in your environment.
+
+[WARNING]
+====
+CPU usage for refreshing a token is currently missing.
+We hope to add this soon.
+====
+
+== Data collection
+
+These are rough estimates from looking at Grafana dashboards.
+A full automation is pending to show repeatable results over different releases.
+
+== Setup
+
+* OpenShift 4.13.x deployed on AWS via ROSA.
+* Machinepool with `m5.4xlarge` instances.
+* Keycloak 22 deployed with Operator and 3 pods.
+* Default user password hashing with PBKDF2 27,500 hash iterations.
+* Database seeded with 100,000 users and 100,000 clients.
+* Infinispan caches at default of 10,000 entries, so not all clients and users fit into the cache, and some requests will need to fetch the data from the database.
+* All sessions in distributed caches as per default, with two owners per entries, allowing one failing pod without losing data.
+* PostgreSQL deployed inside the same OpenShift with ephemeral storage.
++
+Using a database with persistent storage will have longer database latencies, which might lead to longer response times; still, the throughput should be similar.
+
+== Installation
+
+Deploy OpenShift and ROSA as described in xref:kubernetes-guide::prerequisite/prerequisite-rosa.adoc[ROSA] and xref:kubernetes-guide::prerequisite/prerequisite-openshift.adoc[OpenShift] with
+
+.OpenShift `.env` file
+----
+# no KC_CPU_LIMITS set for this scenario
+KC_CPU_REQUESTS=6
+KC_INSTANCES=3
+KC_DISABLE_STICKY_SESSION=true
+KC_MEMORY_REQUESTS_MB=4000
+KC_MEMORY_LIMITS_MB=4000
+KC_HEAP_MAX_MB=2048
+KC_DB_POOL_INITIAL_SIZE=30
+KC_DB_POOL_MAX_SIZE=30
+KC_DB_POOL_MIN_SIZE=30
+----
+
+== Performance results
+
+[WARNING]
+====
+* Performance will be lowered when scaling to more Pods (due to additional overhead) and using a cross-datacenter setup (due to additional traffic and operations).
+
+* Increased cache sizes can improve the performance when Keycloak instances run for a longer time. Still, those caches need to be filled when an instance is restarted.
+
+* Use these values as a starting point and perform your own load tests before going into production.
+====
+
+Summary:
+
+* The used CPU scales linearly with the number of requests up to the tested limit below.
+* The used memory scales linearly with the number of active sessions up to the tested limit below.
+
+Observations:
+
+* The base memory usage for an inactive Pod is 1 GB of RAM.
+
+* Leave 1 GB extra head-room for spikes of RAM.
+
+* For each 100,000 active user sessions, add 500 MB per Pod in a three-node cluster (tested with up to 200,000 sessions).
++
+This assumes that each user connects to only one client.
+Memory requirements increase with the number of client sessions per user session (not tested yet).
+
+* For each 45 user logins per second, 1 vCPU per Pod in a three-node cluster (tested with up to 300 per second).
++
+Keycloak spends most of the CPU time hashing the password provided by the user.
+
+* For each 250 client credential grants per second, 1 vCPU per Pod in a three node cluster (tested with up to 2000 per second).
++
+Most CPU time goes into creating new TLS connections, as each client runs only a single request.
+
+* Leave 100% extra head-room for CPU usage to handle spikes in the load.
+Performance of Keycloak dropped significantly when its Pods were throttled in our tests.
+
+=== Calculation example
+
+Target size:
+
+* 50,000 active user sessions
+* 45 logins per seconds
+* 250 client credential grants per second
+
+Limits calculated:
+
+* CPU requested: 2 vCPU
++
+(45 logins per second = 1 vCPU, 250 client credential grants per second = 1 vCPU)
+
+* CPU limit: 4 vCPU
++
+(doubling the CPU requested to handle peaks, and also refresh token handling which we don't have numbers on, yet)
+
+* Memory requested: 1.2 GB
++
+(1 GB base memory plus 200 MB RAM for 50,000 active sessions)
+
+* Memory limit: 2.2 GB
++
+(adding 1 GB to the memory requested)
+
+== Tests performed
+
+Each test ran for 10 minutes.
+
+. Setup ROSA cluster as default.
+. Scale machine pool.
++
+[source,bash,subs="+quotes"]
+----
+rosa edit machinepool -c  **<clustername>** --min-replicas 3 --max-replicas 10 scaling
+----
+. Deploy Keycloak and Monitoring
++
+[source,bash]
+----
+cd provision/openshift
+task
+task monitoring
+----
+. Create dataset
++
+[source,bash]
+----
+task dataset-import -- -a create-realms -u 100000
+# wait for first task to complete
+task dataset-import -- -a create-clients -c 100000 -n realm-0
+----
+. Prepare environment for running the benchmark via Ansible
++
+See xref:run/running-benchmark-ansible.adoc[] for details.
++
+.Contents of `env.yml` used here
+[source,yaml]
+----
+cluster_size: 5
+instance_type: t3.small
+instance_volume_size: 30
+kcb_zip: ../benchmark/target/keycloak-benchmark-0.10-SNAPSHOT.zip
+kcb_heap_size: 1G
+----
+
+. Create load runners
++
+[source,bash,subs="+quotes"]
+----
+cd ../../ansible
+./aws_ec2.sh start **<region of ROSA cluster>**
+----
+. Run different load tests
+
+* Testing memory for creating sessions
++
+[source,bash,subs="+quotes"]
+----
+./benchmark.sh eu-west-1 \
+--scenario=keycloak.scenario.authentication.AuthorizationCode \
+--server-url=${KEYCLOAK_URL} \
+--realm-name=realm-0 \
+--users-per-sec=**<number of users per second>** \
+--ramp-up=20 \
+--logout-percentage=0 \
+--measurement=600 \
+--users-per-realm=100000 \
+--log-http-on-failure
+----
+
+* Testing CPU usage for user logins
++
+[source,bash,subs="+quotes"]
+----
+./benchmark.sh eu-west-1 \
+--scenario=keycloak.scenario.authentication.AuthorizationCode \
+--server-url=${KEYCLOAK_URL} \
+--realm-name=realm-0 \
+--users-per-sec=**<number of users per second>** \
+--ramp-up=20 \
+--logout-percentage=100 \
+--measurement=600 \
+--users-per-realm=100000 \
+--log-http-on-failure
+----
+
+* Testing CPU usage for client credential grants
++
+[source,bash,subs="+quotes"]
+----
+./benchmark.sh eu-west-1 \
+--scenario=keycloak.scenario.authentication.AuthorizationCode \
+--server-url=${KEYCLOAK_URL} \
+--realm-name=realm-0 \
+--users-per-sec=**<number of clients per second>** \
+--ramp-up=20 \
+--logout-percentage=100 \
+--measurement=600 \
+--users-per-realm=100000 \
+--log-http-on-failure
+----