A/P failure testing in SYNC mode #495

PavelVlha · 2023-08-21T12:32:32Z

Test scenarios:

Manually switch from the active to the passive site under load

Document steps needed to do the switch
Observe the recovery time needed (higher response times and maybe even failures expected as the passive site is not warmed up)

Make the passive site fail

Kill the connection between active and passive sites to simulate a connection disruption. The idea is to remove the gossip router public route (we probably also need to kill/stop the Infinispan operator to make sure the operator does not recreate the route)
Observe the active site - how long does it take for the active site to detect passive is down, does the active site work correctly without the passive

ryanemerson · 2023-08-29T14:26:43Z

kc-failover.sh script, plus docs and associated changes required for failover tests added as part of #507

Deployment

AURORA_CLUSTER=ryan-aurora
AURORA_REGION=${REGION}
ROSA_CLUSTER_NAME_1=ryan-active
ROSA_CLUSTER_NAME_2=ryan-passive
DOMAIN="ryan-active-ryan-passive-mtyxmdck.keycloak-benchmark.com"
KC_CLIENT_URL="client.${DOMAIN}"
KC_HEALTH_URL_CLUSTER_1="primary.${DOMAIN}"
KC_HEALTH_URL_CLUSTER_2="backup.${DOMAIN}"
CROSS_DC_IMAGE=quay.io/pruivo/infinispan-server:14.0-patched

KC_DB_POOL_MIN_SIZE=15
KC_DB_POOL_MAX_SIZE=15
KC_DB_POOL_INITIAL_SIZE=15
KC_INSTANCES=6
KC_DISABLE_STICKY_SESSION='true'
KC_MEMORY_REQUESTS_MB=2048
KC_MEMORY_LIMITS_MB=3000
KC_HEAP_MAX_MB=2048
KC_HEAP_INIT_MB=1024
KC_CPU_REQUESTS=8
KC_CPU_LIMITS=8
KC_OTEL='true'
KC_CUSTOM_INFINISPAN_CONFIG_FILE=config/kcb-infinispan-cache-remote-store-config.xml

Keycloak dataset intialised with:

curl -v -k "https://${KC_CLIENT_URL}/realms/master/dataset/create-realms?realm-name=realm-0&count=1&threads-count=10&users-per-realm=10000"

Benchmark Setup

./ansible/benchmark.sh and kc-failover.sh script are executed concurrently. The latter is executed manually on one of the EC2 nodes used by benchmark.sh to ensure that the Route53 failover latency can be more accurately observed.

Environment:

cluster_size: 2
kcb_zip: ../benchmark/target/keycloak-benchmark-0.10-SNAPSHOT.zip

Benchmark cmd

DOMAIN="ryan-active-ryan-passive-mtyxmdck.keycloak-benchmark.com"
REGION=eu-west-1
KC_CLIENT_URL="https://client.${DOMAIN}"
./benchmark.sh ${REGION} \
  --scenario=keycloak.scenario.authentication.AuthorizationCode \
  --server-url=${KC_CLIENT_URL} \
  --users-per-sec=300 \
  --measurement=300 \
  --realm-name=realm-0 \
  --logout-percentage=0 \
  --users-per-realm=10000 \
  --ramp-up=100 \
  --log-http-on-failure

Initial Results

Disable Health Check

Failover cmd:

PROJECT=remerson-keycloak DOMAIN=ryan-active-ryan-passive-mtyxmdck.keycloak-benchmark.com FAILOVER_MODE="HEALTH_PROBE" FAILOVER_DELAY=120 ./bin/kc-failover.sh

Route53 Failover time: 63 Seconds

Looking at the Infinispan logs, in this scenario there seems to be an issue here with lock contention as I believe writes are happening on both the active and passive site:

14:05:45,567 WARN  (jgroups-239,infinispan-0-40258) [org.infinispan.CLUSTER] ISPN000071: Caught exception when handling command SingleXSiteRpcCommand{command=RemoveCommand{key=WrappedByteArray[0304090000000E\j\a\v\a\.\u\t\i\l\.\U\U\I\DBC9903F798\m85\/000000020000000C\l\e\a\s\t\S\i\g\B\i\t\s\$000000000B\m\o\s\t\S\i\g\B\i... (85 bytes)], value=null, returnEntry=false, metadata=null, internalMetadata=null, flags=[], commandInvocationId=CommandInvocation:7e2b2d82-8f0c-42ce-8eed-bb3a0d057e98:1305642, valueMatcher=MATCH_ALWAYS, topologyId=-1}} org.infinispan.remoting.RemoteException: ISPN000217: Received exception from infinispan-2-46837, see cause for remote stack trace
	at org.infinispan.remoting.transport.ResponseCollectors.wrapRemoteException(ResponseCollectors.java:25)
	at org.infinispan.remoting.transport.ValidSingleResponseCollector.withException(ValidSingleResponseCollector.java:37)
	at org.infinispan.remoting.transport.ValidSingleResponseCollector.addResponse(ValidSingleResponseCollector.java:21)
	at org.infinispan.remoting.transport.impl.SingleTargetRequest.addResponse(SingleTargetRequest.java:75)
	at org.infinispan.remoting.transport.impl.SingleTargetRequest.onResponse(SingleTargetRequest.java:45)
	at org.infinispan.remoting.transport.impl.RequestRepository.addResponse(RequestRepository.java:51)
	at org.infinispan.remoting.transport.jgroups.JGroupsTransport.processResponse(JGroupsTransport.java:1578)
	at org.infinispan.remoting.transport.jgroups.JGroupsTransport.processMessage(JGroupsTransport.java:1480)
	at org.infinispan.remoting.transport.jgroups.JGroupsTransport$ChannelCallbacks.up(JGroupsTransport.java:1685)
	at org.jgroups.JChannel.up(JChannel.java:733)
	at org.jgroups.stack.ProtocolStack.up(ProtocolStack.java:921)
	at org.jgroups.protocols.relay.RELAY2.up(RELAY2.java:527)
	at org.jgroups.protocols.FRAG2.up(FRAG2.java:138)
	at org.jgroups.protocols.FlowControl.up(FlowControl.java:245)
	at org.jgroups.protocols.FlowControl.up(FlowControl.java:245)
	at org.jgroups.protocols.pbcast.GMS.up(GMS.java:845)
	at org.jgroups.protocols.pbcast.STABLE.up(STABLE.java:226)
	at org.jgroups.protocols.UNICAST3.deliverMessage(UNICAST3.java:1052)
	at org.jgroups.protocols.UNICAST3.addMessage(UNICAST3.java:794)
	at org.jgroups.protocols.UNICAST3.handleDataReceived(UNICAST3.java:776)
	at org.jgroups.protocols.UNICAST3.up(UNICAST3.java:425)
	at org.jgroups.protocols.pbcast.NAKACK2.up(NAKACK2.java:658)
	at org.jgroups.protocols.VERIFY_SUSPECT2.up(VERIFY_SUSPECT2.java:105)
	at org.jgroups.protocols.FailureDetection.up(FailureDetection.java:180)
	at org.jgroups.protocols.FD_SOCK2.up(FD_SOCK2.java:188)
	at org.jgroups.protocols.MERGE3.up(MERGE3.java:274)
	at org.jgroups.protocols.Discovery.up(Discovery.java:294)
	at org.jgroups.stack.Protocol.up(Protocol.java:314)
	at org.jgroups.protocols.TP.passMessageUp(TP.java:1178)
	at org.jgroups.util.SubmitToThreadPool$SingleMessageHandler.run(SubmitToThreadPool.java:100)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at java.base/java.lang.Thread.run(Thread.java:833)
Caused by: org.infinispan.util.concurrent.TimeoutException: ISPN000299: Unable to acquire lock after 10 seconds for key WrappedByteArray[0304090000000E\j\a\v\a\.\u\t\i\l\.\U\U\I\DBC9903F798\m85\/000000020000000C\l\e\a\s\t\S\i\g\B\i\t\s\$000000000B\m\o\s\t\S\i\g\B\i... (85 bytes)] and requestor CommandInvocation:infinispan-0-40258:832116. Lock is held by CommandInvocation:infinispan-2-46837:821621
	at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:77)
	at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.base/java.lang.reflect.Constructor.newInstanceWithCaller(Constructor.java:499)
	at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:480)
	at org.infinispan.marshall.exts.ThrowableExternalizer.newThrowableInstance(ThrowableExternalizer.java:287)
	at org.infinispan.marshall.exts.ThrowableExternalizer.readGenericThrowable(ThrowableExternalizer.java:265)
	at org.infinispan.marshall.exts.ThrowableExternalizer.readObject(ThrowableExternalizer.java:240)
	at org.infinispan.marshall.exts.ThrowableExternalizer.readObject(ThrowableExternalizer.java:44)
	at org.infinispan.marshall.core.GlobalMarshaller.readWithExternalizer(GlobalMarshaller.java:727)
	at org.infinispan.marshall.core.GlobalMarshaller.readNonNullableObject(GlobalMarshaller.java:708)
	at org.infinispan.marshall.core.GlobalMarshaller.readNullableObject(GlobalMarshaller.java:357)
	at org.infinispan.marshall.core.BytesObjectInput.readObject(BytesObjectInput.java:32)
	at org.infinispan.remoting.responses.ExceptionResponse$Externalizer.readObject(ExceptionResponse.java:49)
	at org.infinispan.remoting.responses.ExceptionResponse$Externalizer.readObject(ExceptionResponse.java:41)
	at org.infinispan.marshall.core.GlobalMarshaller.readWithExternalizer(GlobalMarshaller.java:727)
	at org.infinispan.marshall.core.GlobalMarshaller.readNonNullableObject(GlobalMarshaller.java:708)
	at org.infinispan.marshall.core.GlobalMarshaller.readNullableObject(GlobalMarshaller.java:357)
	at org.infinispan.marshall.core.GlobalMarshaller.objectFromObjectInput(GlobalMarshaller.java:191)
	at org.infinispan.marshall.core.GlobalMarshaller.objectFromByteBuffer(GlobalMarshaller.java:220)
	at org.infinispan.remoting.transport.jgroups.JGroupsTransport.processResponse(JGroupsTransport.java:1570)
	... 26 more

Delete All Routes

Failover cmd:

PROJECT=remerson-keycloak DOMAIN=ryan-active-ryan-passive-mtyxmdck.keycloak-benchmark.com FAILOVER_MODE="ALL_ROUTES" FAILOVER_DELAY=120 ./bin/kc-failover.sh

Route53 Failover time: 57 Seconds

Detailed results and exported grafana dashbboards an be found on the google drive

ryanemerson · 2023-08-29T15:33:52Z

Delete All Routes 2

The Grafana snapshots in failover-all-routes.tar.gz were not correctly exported.

Route53 Failover time: 58 Seconds

failover-all-routes-2.tar.gz

ryanemerson · 2023-08-29T15:45:30Z

Disable Health Check 2

Route53 Failover time: 72 seconds

failover-health-probe-2.tar.gz

ryanemerson · 2023-08-30T15:34:29Z

Disable Health Check 3

Following on from our discussions this afternoon. The failover now occurs after 4 minutes and the benchmark has been updated to use the following parameters:

./benchmark.sh ${REGION} \
  --scenario=keycloak.scenario.authentication.AuthorizationCode \
  --server-url=${KC_CLIENT_URL} \
  --users-per-sec=300 \
  --measurement=600 \
  --realm-name=realm-0 \
  --logout-percentage=100 \
  --users-per-realm=10000 \
  --ramp-up=100 \
  --log-http-on-failure

As hoped, utilising --logout-percentage=100 has removed the build up of active-users towards the end of the benchmark.

In the infinispan logs there were no lock acquisition exceptions as seen in the first run, even though I ran the benchmark across two EC2 nodes as before.

failover-health-probe-3.tar.gz

ryanemerson · 2023-08-31T10:29:34Z

Active Site Gossip Router Fail

Utilising the following benchmark with two EC2 clients:

./benchmark.sh ${REGION} \
  --scenario=keycloak.scenario.authentication.AuthorizationCode \
  --server-url=${KC_CLIENT_URL} \
  --users-per-sec=300 \
  --measurement=600 \
  --realm-name=realm-0 \
  --logout-percentage=100 \
  --users-per-realm=10000 \
  --ramp-up=100 \
  --log-http-on-failure

Kill the connection between active and passive sites to simulate a connection disruption. The idea is to remove the gossip router public route (we probably also need to kill/stop the Infinispan operator to make sure the operator does not recreate the route)

Gossip Router pods on the active cluster are killed to simulate a network split between the active and passive site. Removing the GossipRouter Route is not sufficient, as the TCP connections established between the two sites remain open until the GR pod is killed.

Observe the active site - how long does it take for the active site to detect passive is down, does the active site work correctly without the passive

Requests to the active site fail for ~ 45 seconds, after which the passive site is suspected and the active site continues to function.

~~Grafana hadn't installed correctly on my active site, so unfortunately the dashboards are missing from the results. I can repeat the experiment as required though.~~

I have dashboards for the passive site, just not the active. Uploaded tar has been updated.

gossip-router-fail-active.tar.gz

ryanemerson · 2023-08-31T13:09:07Z

Passive site cluster fail

All Infinispan and Keycloak pods are killed on the passive site to simulate the site going down.

As expected this results in comparable results to the Active Gossip Router failure scenario as xsite operations between the Active and Passive site fail until the passive site becomes suspected by JGroups.

Failover cmd:

DOMAIN=ryan-active-ryan-passive-mzeymtmk.keycloak-benchmark.com PROJECT=remerson-keycloak FAILOVER_MODE=CLUSTER_FAIL FAILOVER_DELAY=240 ./bin/kc-failover.sh

passive-site-cluster-fail.tar.gz

ahus1 · 2023-09-04T12:36:23Z

Discussion:

Downtime on active when passive goes down could be configured with different probes down to ~15 seconds
On manual takeover, a CLI command could take passive offline before manual shutdown to avoid the downtime in active
Spikes in response time in active are expected when any pod in passive goes down unexpectedly.

(Reducing the downtime from now 40 seconds to then 15 seconds requires a change in the Infinispan Operator)

Alternative: go to async mode, with the tradeoff that not all updates in active are available in the passive site. In async mode, a resync is necessary when the passive side is completely restarted. Keys are kept in memory for the queue for things to be sent.

Check with @stianst to verify if this is acceptable.

ahus1 · 2023-09-07T07:57:42Z

Most "sensible scenario":

If a Cross-DC write operation fails, throw an exception so that the session is not in the primary caches, so we don't need to issue a state transfer on single transfer failures (not implemented yet) / Throw an error when Cross-DC operations fail (in SYNC mode) #531
If a Cross-DC connection becomes offline, a manual state transfer needs to be issued from the primary site to bring the connection online again and transfer the state / State transfer automation #533
Timeout should be reduced to ~15 seconds (not implemented yet) / Make Infinispan failure detection timeout configurable #532

Why customers want a SYNC setup:

No chance of data loss due to ASYNC replication

Follow-ups:

How to configure a health probe so that Route 53 or another loadbalancing service doesn't send requests to a site which doesn't have all the data / is in need for a state transfer (applies to both ASYNC and SYNC) / Investigate health probe for loadbalancer #534
Show with some functional testing (applies to ASYNC and SYNC): / Run functional cross-site testing #535
- logging in one site, using the session the side
- logging out one site, verifying the session is gone on the side
- modifying a client on one side, verifying the change is available on the other side (even if it was already cached on the other side)

ahus1 · 2023-09-07T15:17:35Z

Follow-up tasks have been captured in new issues, closing this issue.

PavelVlha assigned ryanemerson Aug 21, 2023

ryanemerson added a commit to ryanemerson/keycloak-benchmark that referenced this issue Aug 29, 2023

A/P failure testing keycloak#495

1b94694

ryanemerson added a commit to ryanemerson/keycloak-benchmark that referenced this issue Aug 31, 2023

A/P failure testing keycloak#495

dbbb070

ryanemerson added a commit to ryanemerson/keycloak-benchmark that referenced this issue Sep 1, 2023

A/P failure testing keycloak#495

479f693

mhajas pushed a commit that referenced this issue Sep 4, 2023

A/P failure testing #495

545c924

ahus1 self-assigned this Sep 5, 2023

ahus1 changed the title ~~A/P failure testing~~ A/P failure testing in SYNC mode Sep 7, 2023

ahus1 closed this as completed Sep 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A/P failure testing in SYNC mode #495

A/P failure testing in SYNC mode #495

PavelVlha commented Aug 21, 2023

ryanemerson commented Aug 29, 2023 •

edited

Loading

ryanemerson commented Aug 29, 2023 •

edited

Loading

ryanemerson commented Aug 29, 2023

ryanemerson commented Aug 30, 2023 •

edited

Loading

ryanemerson commented Aug 31, 2023 •

edited

Loading

ryanemerson commented Aug 31, 2023 •

edited

Loading

ahus1 commented Sep 4, 2023 •

edited

Loading

ahus1 commented Sep 7, 2023 •

edited

Loading

ahus1 commented Sep 7, 2023

A/P failure testing in SYNC mode #495

A/P failure testing in SYNC mode #495

Comments

PavelVlha commented Aug 21, 2023

ryanemerson commented Aug 29, 2023 • edited Loading

Deployment

Benchmark Setup

Initial Results

Disable Health Check

Delete All Routes

ryanemerson commented Aug 29, 2023 • edited Loading

Delete All Routes 2

ryanemerson commented Aug 29, 2023

Disable Health Check 2

ryanemerson commented Aug 30, 2023 • edited Loading

Disable Health Check 3

ryanemerson commented Aug 31, 2023 • edited Loading

Active Site Gossip Router Fail

ryanemerson commented Aug 31, 2023 • edited Loading

Passive site cluster fail

ahus1 commented Sep 4, 2023 • edited Loading

ahus1 commented Sep 7, 2023 • edited Loading

ahus1 commented Sep 7, 2023

ryanemerson commented Aug 29, 2023 •

edited

Loading

ryanemerson commented Aug 29, 2023 •

edited

Loading

ryanemerson commented Aug 30, 2023 •

edited

Loading

ryanemerson commented Aug 31, 2023 •

edited

Loading

ryanemerson commented Aug 31, 2023 •

edited

Loading

ahus1 commented Sep 4, 2023 •

edited

Loading

ahus1 commented Sep 7, 2023 •

edited

Loading