diff --git a/doc/kubernetes/modules/ROOT/images/crossdc/active-passive-sync.dio.svg b/doc/kubernetes/modules/ROOT/images/crossdc/active-passive-sync.dio.svg new file mode 100644 index 000000000..85e4f436c --- /dev/null +++ b/doc/kubernetes/modules/ROOT/images/crossdc/active-passive-sync.dio.svg @@ -0,0 +1,4 @@ + + + +
Secondary Datacenter (passive)
Secondary Datacenter (passive)
Primary Datacenter (active)
Primary Datacenter (active)
Keycloak
Keycloak
Infinispan
Infinispan
Browser
Browser
Infinispan
Infinispan
Keycloak
Keycloak
Load Balancer
Load Balancer
Communication path
after failover / switchover 
Communication path...
<<sync>>
<<sync>>
Synchronous
Database
Synchronous...
Text is not SVG - cannot display
\ No newline at end of file diff --git a/doc/kubernetes/modules/ROOT/images/crossdc/infinispan-crossdc-az.dio.svg b/doc/kubernetes/modules/ROOT/images/crossdc/infinispan-crossdc-az.dio.svg index a5cbe0f51..dd1ac29da 100644 --- a/doc/kubernetes/modules/ROOT/images/crossdc/infinispan-crossdc-az.dio.svg +++ b/doc/kubernetes/modules/ROOT/images/crossdc/infinispan-crossdc-az.dio.svg @@ -1,4 +1,4 @@ -
AWS Zone eu-west-1a (active)
AWS Zone eu-west-1a (active)
OpenShift Cluster
OpenShift Cluster
«Pod»
Keycloak
«Pod»...
«Pod»
Infinispan
«Pod»...
«Pod»
GossipRouter
«Pod»...
«Local»
Browser
«Local»...
AWS Zone eu-west-1b (Passive)
AWS Zone eu-west-1b (Passive)
OpenShift Cluster
OpenShift Cluster
«Pod»
GossipRouter
«Pod»...
«Pod»
Infinispan
«Pod»...
«Pod»
Keycloak
«Pod»...
<<sync replication>>
<<sync rep...
Load Balancer
Load Balancer
Aurora DB Writer
Aurora DB Writer
Aurora DB Reader
Aurora DB Reader
Communication path
after failover / switchover 
of both Keycloak and Aurora
Communication path...
Text is not SVG - cannot display
\ No newline at end of file +
AWS Zone eu-west-1a (active)
AWS Zone eu-west-1a (active)
OpenShift Cluster
OpenShift Cluster
«Pod»
Keycloak
«Pod»...
«Pod»
Infinispan
«Pod»...
«Pod»
GossipRouter
«Pod»...
«Local»
Browser
«Local»...
AWS Zone eu-west-1b (Passive)
AWS Zone eu-west-1b (Passive)
OpenShift Cluster
OpenShift Cluster
«Pod»
GossipRouter
«Pod»...
«Pod»
Infinispan
«Pod»...
«Pod»
Keycloak
«Pod»...
Load Balancer
Load Balancer
Communication path
after failover / switchover 
of both Keycloak and Aurora
Communication path...
Synchronous
Database
(for example
Aurora Regional
 Deployment) 
Synchronous...
Text is not SVG - cannot display
\ No newline at end of file diff --git a/doc/kubernetes/modules/ROOT/images/crossdc/infinispan-crossdc-region.dio.svg b/doc/kubernetes/modules/ROOT/images/crossdc/infinispan-crossdc-region.dio.svg index f268bfa3c..0afe9a9b3 100644 --- a/doc/kubernetes/modules/ROOT/images/crossdc/infinispan-crossdc-region.dio.svg +++ b/doc/kubernetes/modules/ROOT/images/crossdc/infinispan-crossdc-region.dio.svg @@ -1,4 +1,4 @@ -
AWS Region eu-west-1 (active)
AWS Region eu-west-1 (active)
OpenShift Cluster
OpenShift Cluster
«Pod»
Keycloak
«Pod»...
«Pod»
Infinispan
«Pod»...
«Pod»
GossipRouter
«Pod»...
«Local»
Browser
«Local»...
AWS Region eu-east-1 (Passive)
AWS Region eu-east-1 (Passive)
OpenShift Cluster
OpenShift Cluster
«Pod»
GossipRouter
«Pod»...
«Pod»
Infinispan
«Pod»...
«Pod»
Keycloak
«Pod»...
<<async replication>>
<<async replication>>
Load Balancer
Load Balancer
Aurora Primary
DB Cluster
Aurora Primary...
Aurora Secondary
DB Cluster
Aurora Secondary...
Communication path
after failover / switchover 
of both Keycloak and Aurora
Communication path...
Text is not SVG - cannot display
\ No newline at end of file +
AWS Region eu-west-1 (active)
AWS Region eu-west-1 (active)
OpenShift Cluster
OpenShift Cluster
«Pod»
Keycloak
«Pod»...
«Pod»
Infinispan
«Pod»...
«Pod»
GossipRouter
«Pod»...
«Local»
Browser
«Local»...
AWS Region eu-east-1 (Passive)
AWS Region eu-east-1 (Passive)
OpenShift Cluster
OpenShift Cluster
«Pod»
GossipRouter
«Pod»...
«Pod»
Infinispan
«Pod»...
«Pod»
Keycloak
«Pod»...
Load Balancer
Load Balancer
Communication path
after failover / switchover 
of both Keycloak and Aurora
Communication path...
Asynchronous
Database
(for example
Aurora Global
 Deployment) 
Asynchronous...
Text is not SVG - cannot display
\ No newline at end of file diff --git a/doc/kubernetes/modules/ROOT/nav.adoc b/doc/kubernetes/modules/ROOT/nav.adoc index b9580266c..e68979e56 100644 --- a/doc/kubernetes/modules/ROOT/nav.adoc +++ b/doc/kubernetes/modules/ROOT/nav.adoc @@ -16,12 +16,15 @@ ** xref:openshift/installation-infinispan.adoc[] ** xref:openshift/cross-site-rosa.adoc[] * xref:running/index.adoc[] -** xref:running/keycloak-deployment.adoc[] -** xref:running/keycloak-with-external-infinispan.adoc[] -** xref:running/infinispan-deployment.adoc[] -** xref:running/infinispan-crossdc-deployment.adoc[] -** xref:running/aurora-multi-az.adoc[] -** xref:running/aurora-peering-connections.adoc[] +** xref:running/index.adoc#overview[Overviews] +*** xref:running/deployments/active-passive-sync.adoc[] +** xref:running/index.adoc#building-blocks[Building Blocks] +*** xref:running/keycloak-deployment.adoc[] +*** xref:running/keycloak-with-external-infinispan.adoc[] +*** xref:running/infinispan-deployment.adoc[] +*** xref:running/infinispan-crossdc-deployment.adoc[] +*** xref:running/aurora-multi-az.adoc[] +*** xref:running/aurora-peering-connections.adoc[] ** xref:running/concepts/index.adoc[] *** xref:running/concepts/database-connections.adoc[] *** xref:running/concepts/threads.adoc[] diff --git a/doc/kubernetes/modules/ROOT/pages/running/aurora-multi-az.adoc b/doc/kubernetes/modules/ROOT/pages/running/aurora-multi-az.adoc index c0a2e1fb9..64d67acf2 100644 --- a/doc/kubernetes/modules/ROOT/pages/running/aurora-multi-az.adoc +++ b/doc/kubernetes/modules/ROOT/pages/running/aurora-multi-az.adoc @@ -52,7 +52,7 @@ endpoint. To do this we create a `Keycloak` CR outlined in the xref:running/keyc we modify the following elements: -. Update `spec.db.url` to be `jdbc:postgresql://$HOST$:5432/keycloak` where `$HOST` is the +. Update `spec.db.url` to be `jdbc:postgresql://$HOST:5432/keycloak` where `$HOST` is the <>. . Ensure that the Secrets referenced by `spec.db.usernameSecret` and `spec.db.passwordSecret` contain usernames and diff --git a/doc/kubernetes/modules/ROOT/pages/running/deployments/active-passive-sync.adoc b/doc/kubernetes/modules/ROOT/pages/running/deployments/active-passive-sync.adoc new file mode 100644 index 000000000..93ca331c3 --- /dev/null +++ b/doc/kubernetes/modules/ROOT/pages/running/deployments/active-passive-sync.adoc @@ -0,0 +1,225 @@ += HA-Keycloak active/passive with synchronous replication +:navtitle: Active/passive with sync replication +:description: This concept describes the building blocks needed for a highly available active/passive setup and the behavior customers can expect from it. + +{description} + +== Audience + +Solution architects and customers that plan for a high-available Keycloak environment and want to learn about the requirements, benefits and tradeoffs of a synchronous active/passive setup. + +After summarizing the architecture see <> with the links to blueprints for each building block. + +== Architecture + +Two independent Keycloak deployments running in different datacenters connected with a low latency. +Entities like users, realms and clients are stored in a database which is running synchronously in the two datacenters. +Sessions are stored exclusively in Infinispan, which is also running synchronously in the two datacenters. +Offline sessions are stored in Infinispan, and can optionally be stored in the database as well. + +image::crossdc/active-passive-sync.dio.svg[] + +=== When to use this setup + +Use this setup for customers who want to be able to recover automatically from a datacenter failure, and not to lose data or sessions. + +=== Causes of data and service loss + +While this setup aims for high availability, the following situations can still lead to service or data loss: + +* Network failures between the datacenters or failures of components can lead to short service downtimes while those failures are detected. +The service will be restored automatically. +The system is degraded until the redundancy of components and connectivity is restored. + +* Once failures occur in the communication between the datacenters, manual steps may be necessary to re-synchronize a degraded setup. +Future versions of Keycloak and Infinispan plan to reduce those manual operations. + +* Degraded setups can lead to service or data loss if additional components fail. +Monitoring is necessary to detect degraded setups. + +=== Failures this setup can survive + +[%autowidth] +|=== +| Failure | Recovery | RPO^1^ | RTO^2^ + +| Database node +| If the writer instance fails, the database can promote a reader instance in the same or other datacenter to be the new writer. +| No data loss +| Seconds to minutes (depending on the database) + +| Keycloak node +| Multiple Keycloak instances run in each datacenter. If one instance fails, it takes a few seconds for the other nodes to notice the change, and some incoming requests might receive an error message or are delayed for some seconds. +| No data loss +| Less than one minute + +| Infinispan node +| Multiple Infinispan instances run in each datacenter. If one instance fails, it takes a few seconds to the other nodes to notice the change. Sessions are stored in at least two Infinispan nodes, so a single node failure doesn't lead to data loss. +| No data loss +| Less than one minute + +| Infinispan cluster failure +| If the Infinispan cluster fails in the active datacenter, Keycloak won't be able to send session data to the secondary datacenter. +The state between Keycloak and Infinispan is out-of-sync, and also the state between the two Datacenters. +Keycloak will continue on the best effort basis, but the service might be degraded due to retry mechanisms. +Even when the Infinispan cluster is restored, its data will be out-of-sync with Keycloak. + +Manual switchover to the secondary datacenter is recommended. As that datacenter is out-of-sync, the customer should consider if all data should be cleared from the session store of the passive datacenter to avoid out-of-date information. +| Loss of service and data +| Human intervention required + +| Connectivity Infinispan +| If the connectivity between the two datacenters is lost, session information can't be sent to the other datacenter. +Incoming requests might receive an error message or are delayed for some seconds. +The primary site marks the secondary site offline, and will stop sending data to the secondary. +The setup is degraded until the connection is restored and the session data is re-synchronized to the secondary datacenter. +| No data loss ^3^ +| Less than one minute + +| Connectivity Database +| If the connectivity between the two datacenters is lost, the synchronous replication will fail, and it might take some time for the primary site to mark the secondary offline. +Some requests might receive an error message or are delayed for some seconds. +Manual operations might be necessary depending on the database. +| No data loss ^3^ +| Seconds to minutes (depending on the database) + +| Primary Datacenter +| If none of the Keycloak nodes is available, the loadbalancer will detect the outage and redirect the traffic to the secondary site. +Some requests might receive an error message while the loadbalancer hasn't detected the primary datacenter failure. +The setup will be degraded until the primary site is back up and the session state has been manually synced from the secondary to the primary site. +| No data loss^3^ +| Less than one minute + +| Secondary Datacenter +| If the secondary datacenter is not available, it will take a moment for the primary Infinispan and database to mark the secondary datacenter offline. +Some requests might receive an error message while the detection takes place. +Once the secondary datacenter is up again, the session state needs to be manually synced from the primary site to the secondary site. +| No data loss^3^ +| Less than one minute + +|=== + +^1^: Recovery point objective, assuming all parts of the setup were healthy at the time this occurred. + +^2^: Recovery time objective. + +^3^: Manual operations needed to restore the degraded setup. + +The statement "`No data loss`" depends on the setup not being degraded from previous failures, which includes completing any pending manual operations to resynchronize the state between the datacenters. + +=== Known limitations + +==== Upgrades + +* On Keycloak major version upgrade, all session data (except offline session) might be lost as the data format is not guaranteed to be compatible between major versions. +* On any Infinispan upgrade (major, minor or patch), all session data might be lost as cross-DC communication is guaranteed to work only between identical versions of Infinispan. + +==== Failovers + +* A successful failover requires a setup not degraded from previous failures. +All manual operations like a re-synchronization after a previous failure must be complete to prevent a data loss. +Customers must use monitoring to ensure degradations are detected and handled in a timely manner. + +==== Switchovers + +* A successful switchover requires a setup not degraded from previous failures. +All manual operations like a re-synchronization after a previous failure must be complete to prevent a data loss. +Customers must use monitoring to ensure degradations are detected and handled in a timely manner. + +==== Out-of-sync datacenters + +* The datacenters can become out of sync when a synchronous Infinispan request fails. +This is currently difficult to monitor, and it would need a full manual re-sync of Infinispan to recover. +Monitoring the number of cache entries in both datacenters and Keycloak's log file can show when this would become necessary. +Future versions of Keycloak and Infinispan plan to automate this. + +==== Manual operations + +* Manual operations that re-synchronize the Infinispan state between the datacenters will issue a full state transfer which will put a stress on the system (network, CPU, Java heap in Infinispan and Keycloak). + +=== Questions and answers + +Why a synchronous database?:: +A synchronously replicated database ensures that data written in the primary datacenter is always available in the secondary datacenter on failover and no data is lost. + +Why a synchronous Infinispan replication?:: +A synchronously replicated Infinispan ensures that sessions created, updated and deleted in the primary datacenter are always available in the secondary datacenter on failover and no data is lost. + +Why low-latency between datacenters?:: +Synchronous replication defers the response to the caller until the data is received at the secondary datacenter. +For a synchronous database replication and a synchronous Infinispan replication, a low latency is necessary as each request can have potentially multiple interactions between the datacenters when data is updated which would amplify the latency. + +Why active-passive?:: +Some databases support a single writer instance with a reader instance which is then promoted to be the new writer once the original writer fails. +In such a setup, it is beneficial for the latency to have the writer instance in the same datacenter as the currently active Keycloak. +Synchronous Infinispan replication can lead to deadlocks when entries in both datacenters are modified concurrently. + +Is this setup limited to two datacenters?:: +This setup could be extended to multiple datacenters, and there are no fundamental changes necessary to have, for example, three datacenters. Once more datacenters are added, the overall latency between the datacenters increases, and the likeliness of network failures, and therefore short downtimes, increases as well. +For now, it has been tested and documented with blueprints only for two datacenters. + +Is a synchronous cluster less stable than an asynchronous cluster?:: +An asynchronous setup would handle network failures between the datacenter gracefully, while the synchronous setup would delay requests and will throw errors to the caller where the asynchronous setup would have deferred the writes to the secondary datacenter. +But as the secondary site would never be fully up to date with the primary site, this could lead to data loss during failovers. +This would include: ++ +-- +* Lost logouts (sessions are still logged in the secondary datacenter that logged out in the primary datacenter at the point of failover) +* Lost changes like users being able to log in with their old passwords (database changes not replicated to secondary datacenter at the point of failover). +-- ++ +So there is effectively a tradeoff between availability and consistency. +For now, we've considered to rank consistency higher than availability with Keycloak. + +[#building-blocks] +== Building blocks + +The following building blocks are needed to set up the architecture described above. +Each building block links to a blueprint with an example configuration. +They are listed in the order in which they need to be installed. + +=== Two datacenters with low-latency connection + +Ensures that synchronous replication is available for both the database and Infinispan. + +*Blueprint:* Two AWS Availablity Zones within the same AWS Region. + +*Not considered:* Two regions on the same or different continents, as it would increase the latency and the likelihood of network failures. +Synchronous replication of databases as a services with Aurora Regional Deployments on AWS is only available within the same region. + +=== Environment for Keycloak and Infinispan + +Ensures that the instances are deployed and restarted as needed. + +*Blueprint:* Red Hat OpenShift Service on AWS (ROSA) deployed in each availability zone. + +*Not considered:* A stretched ROSA cluster which spans multiple availability zones, as this could be a single point of failure if misconfigured. + +=== Database + +A synchronously replicated database across two datacenters. + +*Blueprint:* xref::running/aurora-multi-az.adoc[Amazon Aurora PostgreSQL Regional Deployment spanning two availability zones, connected to ROSA] + +=== Infinispan + +An Infinispan deployment which leverages the Infinispan's Cross-DC functionality. + +*Blueprint:* xref::running/infinispan-crossdc-deployment.adoc[Deploy Infinispan using the Infinispan Operator on ROSA, and connect the two datacenters using Infinispan's Gossip Router]. + +*Not considered:* Direct interconnections between the OpenShift clusters on the network layer. +It might be considered in the future based on customer or community demand. + +=== Loadbalancer + +A loadbalancer which checks the `/health/live` URl of the Keycloak deployment in each datacenter. + +*Blueprint:* Amazon Route 53 active-passive failover. +If an availability zone goes offline, clients might fail to update their cached DNS settings and would fail to reach the secondary availability zone. + +*Not considered:* AWS Global Accelerator connecting to Red Hat OpenShift Service on AWS (ROSA) as it supports only weighted traffic routing and not active-passive failover. +To support active-passive failover, additional logic using, for example, AWS CloudWatch and AWS Lambda would be necessary to simulate the active-passive handling by adjusting the weights when the probes fail. + +=== Keycloak + +A clustered deployment of Keycloak in each datacenter, connected to an external Infinispan. + +*Blueprint:* xref::running/keycloak-deployment.adoc[Deploy Keycloak using the Keycloak Operator on ROSA], and xref::running/keycloak-with-external-infinispan.adoc[connect it to the external Infinispan] and the Aurora database. diff --git a/doc/kubernetes/modules/ROOT/pages/running/index.adoc b/doc/kubernetes/modules/ROOT/pages/running/index.adoc index 1935a801f..de525e07c 100644 --- a/doc/kubernetes/modules/ROOT/pages/running/index.adoc +++ b/doc/kubernetes/modules/ROOT/pages/running/index.adoc @@ -4,6 +4,14 @@ {description} It summarizes the logic which is condensed in the Helm charts and scripts in this project to make it accessible as independent knowledge to adapt it to other environments. +[#overview] +== Overview of different configurations + +* xref::running/deployments/active-passive-sync.adoc[] + +[#building-blocks] +== Building Blocks + * xref::running/keycloak-deployment.adoc[] * xref::running/keycloak-with-external-infinispan.adoc[] * xref::running/infinispan-deployment.adoc[]