-
Notifications
You must be signed in to change notification settings - Fork 72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A/P failure testing in SYNC mode #495
Comments
Deployment
Keycloak dataset intialised with:
Benchmark Setup
Environment:
Benchmark cmd
Initial ResultsDisable Health CheckFailover cmd:
Route53 Failover time: 63 Seconds Looking at the Infinispan logs, in this scenario there seems to be an issue here with lock contention as I believe writes are happening on both the active and passive site:
Delete All RoutesFailover cmd:
Route53 Failover time: 57 Seconds Detailed results and exported grafana dashbboards an be found on the google drive |
Delete All Routes 2The Grafana snapshots in failover-all-routes.tar.gz were not correctly exported. Route53 Failover time: 58 Seconds |
Disable Health Check 2Route53 Failover time: 72 seconds |
Disable Health Check 3Following on from our discussions this afternoon. The failover now occurs after 4 minutes and the benchmark has been updated to use the following parameters: ./benchmark.sh ${REGION} \
--scenario=keycloak.scenario.authentication.AuthorizationCode \
--server-url=${KC_CLIENT_URL} \
--users-per-sec=300 \
--measurement=600 \
--realm-name=realm-0 \
--logout-percentage=100 \
--users-per-realm=10000 \
--ramp-up=100 \
--log-http-on-failure As hoped, utilising In the infinispan logs there were no lock acquisition exceptions as seen in the first run, even though I ran the benchmark across two EC2 nodes as before. |
Active Site Gossip Router FailUtilising the following benchmark with two EC2 clients:
Gossip Router pods on the active cluster are killed to simulate a network split between the active and passive site. Removing the GossipRouter Route is not sufficient, as the TCP connections established between the two sites remain open until the GR pod is killed.
Requests to the active site fail for ~ 45 seconds, after which the passive site is suspected and the active site continues to function.
I have dashboards for the passive site, just not the active. Uploaded tar has been updated. |
Passive site cluster failAll Infinispan and Keycloak pods are killed on the passive site to simulate the site going down. As expected this results in comparable results to the Active Gossip Router failure scenario as xsite operations between the Active and Passive site fail until the passive site becomes suspected by JGroups. Failover cmd:
|
Discussion:
(Reducing the downtime from now 40 seconds to then 15 seconds requires a change in the Infinispan Operator) Alternative: go to async mode, with the tradeoff that not all updates in active are available in the passive site. In async mode, a resync is necessary when the passive side is completely restarted. Keys are kept in memory for the queue for things to be sent. Check with @stianst to verify if this is acceptable. |
Most "sensible scenario":
Why customers want a SYNC setup:
Follow-ups:
|
Follow-up tasks have been captured in new issues, closing this issue. |
Test scenarios:
Manually switch from the active to the passive site under load
Make the passive site fail
The text was updated successfully, but these errors were encountered: