Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A/P failure testing in SYNC mode #495

Closed
PavelVlha opened this issue Aug 21, 2023 · 9 comments
Closed

A/P failure testing in SYNC mode #495

PavelVlha opened this issue Aug 21, 2023 · 9 comments
Assignees

Comments

@PavelVlha
Copy link

Test scenarios:

Manually switch from the active to the passive site under load

  • Document steps needed to do the switch
  • Observe the recovery time needed (higher response times and maybe even failures expected as the passive site is not warmed up)

Make the passive site fail

  • Kill the connection between active and passive sites to simulate a connection disruption. The idea is to remove the gossip router public route (we probably also need to kill/stop the Infinispan operator to make sure the operator does not recreate the route)
  • Observe the active site - how long does it take for the active site to detect passive is down, does the active site work correctly without the passive
ryanemerson added a commit to ryanemerson/keycloak-benchmark that referenced this issue Aug 29, 2023
@ryanemerson
Copy link
Contributor

ryanemerson commented Aug 29, 2023

kc-failover.sh script, plus docs and associated changes required for failover tests added as part of #507

Deployment

AURORA_CLUSTER=ryan-aurora
AURORA_REGION=${REGION}
ROSA_CLUSTER_NAME_1=ryan-active
ROSA_CLUSTER_NAME_2=ryan-passive
DOMAIN="ryan-active-ryan-passive-mtyxmdck.keycloak-benchmark.com"
KC_CLIENT_URL="client.${DOMAIN}"
KC_HEALTH_URL_CLUSTER_1="primary.${DOMAIN}"
KC_HEALTH_URL_CLUSTER_2="backup.${DOMAIN}"
CROSS_DC_IMAGE=quay.io/pruivo/infinispan-server:14.0-patched

KC_DB_POOL_MIN_SIZE=15
KC_DB_POOL_MAX_SIZE=15
KC_DB_POOL_INITIAL_SIZE=15
KC_INSTANCES=6
KC_DISABLE_STICKY_SESSION='true'
KC_MEMORY_REQUESTS_MB=2048
KC_MEMORY_LIMITS_MB=3000
KC_HEAP_MAX_MB=2048
KC_HEAP_INIT_MB=1024
KC_CPU_REQUESTS=8
KC_CPU_LIMITS=8
KC_OTEL='true'
KC_CUSTOM_INFINISPAN_CONFIG_FILE=config/kcb-infinispan-cache-remote-store-config.xml

Keycloak dataset intialised with:

curl -v -k "https://${KC_CLIENT_URL}/realms/master/dataset/create-realms?realm-name=realm-0&count=1&threads-count=10&users-per-realm=10000"

Benchmark Setup

./ansible/benchmark.sh and kc-failover.sh script are executed concurrently. The latter is executed manually on one of the EC2 nodes used by benchmark.sh to ensure that the Route53 failover latency can be more accurately observed.

Environment:

cluster_size: 2
kcb_zip: ../benchmark/target/keycloak-benchmark-0.10-SNAPSHOT.zip

Benchmark cmd

DOMAIN="ryan-active-ryan-passive-mtyxmdck.keycloak-benchmark.com"
REGION=eu-west-1
KC_CLIENT_URL="https://client.${DOMAIN}"
./benchmark.sh ${REGION} \
  --scenario=keycloak.scenario.authentication.AuthorizationCode \
  --server-url=${KC_CLIENT_URL} \
  --users-per-sec=300 \
  --measurement=300 \
  --realm-name=realm-0 \
  --logout-percentage=0 \
  --users-per-realm=10000 \
  --ramp-up=100 \
  --log-http-on-failure

Initial Results

Disable Health Check

Failover cmd:

PROJECT=remerson-keycloak DOMAIN=ryan-active-ryan-passive-mtyxmdck.keycloak-benchmark.com FAILOVER_MODE="HEALTH_PROBE" FAILOVER_DELAY=120 ./bin/kc-failover.sh

Route53 Failover time: 63 Seconds

Screenshot from 2023-08-29 15-23-33

Looking at the Infinispan logs, in this scenario there seems to be an issue here with lock contention as I believe writes are happening on both the active and passive site:

14:05:45,567 WARN  (jgroups-239,infinispan-0-40258) [org.infinispan.CLUSTER] ISPN000071: Caught exception when handling command SingleXSiteRpcCommand{command=RemoveCommand{key=WrappedByteArray[0304090000000E\j\a\v\a\.\u\t\i\l\.\U\U\I\DBC9903F798\m85\/000000020000000C\l\e\a\s\t\S\i\g\B\i\t\s\$000000000B\m\o\s\t\S\i\g\B\i... (85 bytes)], value=null, returnEntry=false, metadata=null, internalMetadata=null, flags=[], commandInvocationId=CommandInvocation:7e2b2d82-8f0c-42ce-8eed-bb3a0d057e98:1305642, valueMatcher=MATCH_ALWAYS, topologyId=-1}} org.infinispan.remoting.RemoteException: ISPN000217: Received exception from infinispan-2-46837, see cause for remote stack trace
	at org.infinispan.remoting.transport.ResponseCollectors.wrapRemoteException(ResponseCollectors.java:25)
	at org.infinispan.remoting.transport.ValidSingleResponseCollector.withException(ValidSingleResponseCollector.java:37)
	at org.infinispan.remoting.transport.ValidSingleResponseCollector.addResponse(ValidSingleResponseCollector.java:21)
	at org.infinispan.remoting.transport.impl.SingleTargetRequest.addResponse(SingleTargetRequest.java:75)
	at org.infinispan.remoting.transport.impl.SingleTargetRequest.onResponse(SingleTargetRequest.java:45)
	at org.infinispan.remoting.transport.impl.RequestRepository.addResponse(RequestRepository.java:51)
	at org.infinispan.remoting.transport.jgroups.JGroupsTransport.processResponse(JGroupsTransport.java:1578)
	at org.infinispan.remoting.transport.jgroups.JGroupsTransport.processMessage(JGroupsTransport.java:1480)
	at org.infinispan.remoting.transport.jgroups.JGroupsTransport$ChannelCallbacks.up(JGroupsTransport.java:1685)
	at org.jgroups.JChannel.up(JChannel.java:733)
	at org.jgroups.stack.ProtocolStack.up(ProtocolStack.java:921)
	at org.jgroups.protocols.relay.RELAY2.up(RELAY2.java:527)
	at org.jgroups.protocols.FRAG2.up(FRAG2.java:138)
	at org.jgroups.protocols.FlowControl.up(FlowControl.java:245)
	at org.jgroups.protocols.FlowControl.up(FlowControl.java:245)
	at org.jgroups.protocols.pbcast.GMS.up(GMS.java:845)
	at org.jgroups.protocols.pbcast.STABLE.up(STABLE.java:226)
	at org.jgroups.protocols.UNICAST3.deliverMessage(UNICAST3.java:1052)
	at org.jgroups.protocols.UNICAST3.addMessage(UNICAST3.java:794)
	at org.jgroups.protocols.UNICAST3.handleDataReceived(UNICAST3.java:776)
	at org.jgroups.protocols.UNICAST3.up(UNICAST3.java:425)
	at org.jgroups.protocols.pbcast.NAKACK2.up(NAKACK2.java:658)
	at org.jgroups.protocols.VERIFY_SUSPECT2.up(VERIFY_SUSPECT2.java:105)
	at org.jgroups.protocols.FailureDetection.up(FailureDetection.java:180)
	at org.jgroups.protocols.FD_SOCK2.up(FD_SOCK2.java:188)
	at org.jgroups.protocols.MERGE3.up(MERGE3.java:274)
	at org.jgroups.protocols.Discovery.up(Discovery.java:294)
	at org.jgroups.stack.Protocol.up(Protocol.java:314)
	at org.jgroups.protocols.TP.passMessageUp(TP.java:1178)
	at org.jgroups.util.SubmitToThreadPool$SingleMessageHandler.run(SubmitToThreadPool.java:100)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at java.base/java.lang.Thread.run(Thread.java:833)
Caused by: org.infinispan.util.concurrent.TimeoutException: ISPN000299: Unable to acquire lock after 10 seconds for key WrappedByteArray[0304090000000E\j\a\v\a\.\u\t\i\l\.\U\U\I\DBC9903F798\m85\/000000020000000C\l\e\a\s\t\S\i\g\B\i\t\s\$000000000B\m\o\s\t\S\i\g\B\i... (85 bytes)] and requestor CommandInvocation:infinispan-0-40258:832116. Lock is held by CommandInvocation:infinispan-2-46837:821621
	at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:77)
	at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.base/java.lang.reflect.Constructor.newInstanceWithCaller(Constructor.java:499)
	at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:480)
	at org.infinispan.marshall.exts.ThrowableExternalizer.newThrowableInstance(ThrowableExternalizer.java:287)
	at org.infinispan.marshall.exts.ThrowableExternalizer.readGenericThrowable(ThrowableExternalizer.java:265)
	at org.infinispan.marshall.exts.ThrowableExternalizer.readObject(ThrowableExternalizer.java:240)
	at org.infinispan.marshall.exts.ThrowableExternalizer.readObject(ThrowableExternalizer.java:44)
	at org.infinispan.marshall.core.GlobalMarshaller.readWithExternalizer(GlobalMarshaller.java:727)
	at org.infinispan.marshall.core.GlobalMarshaller.readNonNullableObject(GlobalMarshaller.java:708)
	at org.infinispan.marshall.core.GlobalMarshaller.readNullableObject(GlobalMarshaller.java:357)
	at org.infinispan.marshall.core.BytesObjectInput.readObject(BytesObjectInput.java:32)
	at org.infinispan.remoting.responses.ExceptionResponse$Externalizer.readObject(ExceptionResponse.java:49)
	at org.infinispan.remoting.responses.ExceptionResponse$Externalizer.readObject(ExceptionResponse.java:41)
	at org.infinispan.marshall.core.GlobalMarshaller.readWithExternalizer(GlobalMarshaller.java:727)
	at org.infinispan.marshall.core.GlobalMarshaller.readNonNullableObject(GlobalMarshaller.java:708)
	at org.infinispan.marshall.core.GlobalMarshaller.readNullableObject(GlobalMarshaller.java:357)
	at org.infinispan.marshall.core.GlobalMarshaller.objectFromObjectInput(GlobalMarshaller.java:191)
	at org.infinispan.marshall.core.GlobalMarshaller.objectFromByteBuffer(GlobalMarshaller.java:220)
	at org.infinispan.remoting.transport.jgroups.JGroupsTransport.processResponse(JGroupsTransport.java:1570)
	... 26 more

Delete All Routes

Failover cmd:

PROJECT=remerson-keycloak DOMAIN=ryan-active-ryan-passive-mtyxmdck.keycloak-benchmark.com FAILOVER_MODE="ALL_ROUTES" FAILOVER_DELAY=120 ./bin/kc-failover.sh

Route53 Failover time: 57 Seconds

Screenshot from 2023-08-29 15-25-41

Detailed results and exported grafana dashbboards an be found on the google drive

@ryanemerson
Copy link
Contributor

ryanemerson commented Aug 29, 2023

Delete All Routes 2

The Grafana snapshots in failover-all-routes.tar.gz were not correctly exported.

Route53 Failover time: 58 Seconds

failover-all-routes-2.tar.gz

Screenshot from 2023-08-29 16-31-41

@ryanemerson
Copy link
Contributor

Disable Health Check 2

Route53 Failover time: 72 seconds

failover-health-probe-2.tar.gz

Screenshot from 2023-08-29 16-44-11

@ryanemerson
Copy link
Contributor

ryanemerson commented Aug 30, 2023

Disable Health Check 3

Following on from our discussions this afternoon. The failover now occurs after 4 minutes and the benchmark has been updated to use the following parameters:

./benchmark.sh ${REGION} \
  --scenario=keycloak.scenario.authentication.AuthorizationCode \
  --server-url=${KC_CLIENT_URL} \
  --users-per-sec=300 \
  --measurement=600 \
  --realm-name=realm-0 \
  --logout-percentage=100 \
  --users-per-realm=10000 \
  --ramp-up=100 \
  --log-http-on-failure

As hoped, utilising --logout-percentage=100 has removed the build up of active-users towards the end of the benchmark.

In the infinispan logs there were no lock acquisition exceptions as seen in the first run, even though I ran the benchmark across two EC2 nodes as before.

Screenshot from 2023-08-30 16-30-45

failover-health-probe-3.tar.gz

@ryanemerson
Copy link
Contributor

ryanemerson commented Aug 31, 2023

Active Site Gossip Router Fail

Utilising the following benchmark with two EC2 clients:

./benchmark.sh ${REGION} \
  --scenario=keycloak.scenario.authentication.AuthorizationCode \
  --server-url=${KC_CLIENT_URL} \
  --users-per-sec=300 \
  --measurement=600 \
  --realm-name=realm-0 \
  --logout-percentage=100 \
  --users-per-realm=10000 \
  --ramp-up=100 \
  --log-http-on-failure

Kill the connection between active and passive sites to simulate a connection disruption. The idea is to remove the gossip router public route (we probably also need to kill/stop the Infinispan operator to make sure the operator does not recreate the route)

Gossip Router pods on the active cluster are killed to simulate a network split between the active and passive site. Removing the GossipRouter Route is not sufficient, as the TCP connections established between the two sites remain open until the GR pod is killed.

Observe the active site - how long does it take for the active site to detect passive is down, does the active site work correctly without the passive

Requests to the active site fail for ~ 45 seconds, after which the passive site is suspected and the active site continues to function.

Screenshot from 2023-08-31 11-23-14

Screenshot from 2023-08-31 11-23-30

Grafana hadn't installed correctly on my active site, so unfortunately the dashboards are missing from the results. I can repeat the experiment as required though.

I have dashboards for the passive site, just not the active. Uploaded tar has been updated.

gossip-router-fail-active.tar.gz

ryanemerson added a commit to ryanemerson/keycloak-benchmark that referenced this issue Aug 31, 2023
@ryanemerson
Copy link
Contributor

ryanemerson commented Aug 31, 2023

Passive site cluster fail

All Infinispan and Keycloak pods are killed on the passive site to simulate the site going down.

As expected this results in comparable results to the Active Gossip Router failure scenario as xsite operations between the Active and Passive site fail until the passive site becomes suspected by JGroups.

Failover cmd:

DOMAIN=ryan-active-ryan-passive-mzeymtmk.keycloak-benchmark.com PROJECT=remerson-keycloak FAILOVER_MODE=CLUSTER_FAIL FAILOVER_DELAY=240 ./bin/kc-failover.sh

Screenshot from 2023-08-31 14-07-38
Screenshot from 2023-08-31 14-07-49

Screenshot from 2023-08-31 14-08-00

passive-site-cluster-fail.tar.gz

ryanemerson added a commit to ryanemerson/keycloak-benchmark that referenced this issue Sep 1, 2023
mhajas pushed a commit that referenced this issue Sep 4, 2023
@ahus1
Copy link
Contributor

ahus1 commented Sep 4, 2023

Discussion:

  • Downtime on active when passive goes down could be configured with different probes down to ~15 seconds
  • On manual takeover, a CLI command could take passive offline before manual shutdown to avoid the downtime in active
  • Spikes in response time in active are expected when any pod in passive goes down unexpectedly.

(Reducing the downtime from now 40 seconds to then 15 seconds requires a change in the Infinispan Operator)

Alternative: go to async mode, with the tradeoff that not all updates in active are available in the passive site. In async mode, a resync is necessary when the passive side is completely restarted. Keys are kept in memory for the queue for things to be sent.

Check with @stianst to verify if this is acceptable.

@ahus1 ahus1 self-assigned this Sep 5, 2023
@ahus1
Copy link
Contributor

ahus1 commented Sep 7, 2023

Most "sensible scenario":

Why customers want a SYNC setup:

  • No chance of data loss due to ASYNC replication

Follow-ups:

  • How to configure a health probe so that Route 53 or another loadbalancing service doesn't send requests to a site which doesn't have all the data / is in need for a state transfer (applies to both ASYNC and SYNC) / Investigate health probe for loadbalancer #534
  • Show with some functional testing (applies to ASYNC and SYNC): / Run functional cross-site testing #535
    • logging in one site, using the session the side
    • logging out one site, verifying the session is gone on the side
    • modifying a client on one side, verifying the change is available on the other side (even if it was already cached on the other side)

@ahus1 ahus1 changed the title A/P failure testing A/P failure testing in SYNC mode Sep 7, 2023
@ahus1
Copy link
Contributor

ahus1 commented Sep 7, 2023

Follow-up tasks have been captured in new issues, closing this issue.

@ahus1 ahus1 closed this as completed Sep 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants