Create sample HA deployment #127

cmgrote · 2021-05-10T11:08:18Z

Create a sample chart for demonstrating a high-availability deployment of the Crux repository:

2-3 OMAG Server Platform pods
Each configured with the same configuration document for the Crux repo config
Each Crux repo config using a local (e.g. Rocks) index, but pointing to the same "remote" (OMAG-external) document store and transaction log
Probably simplest to start with just Kafka as this external store (for both document store and transaction log)

Configure the polling latency for Kafka to be 10-50ms rather than 1 full second, so that the default sync-index behaviour is not degraded too much by the polling intervals.

Also document the structure of such a configuration for reference purposes (explaining that Kafka is used only as an example, but could be other external mechanisms like S3, JDBC, etc).

Signed-off-by: Christopher Grote <chris@thegrotes.net>

#127 Initial chart for providing a sample high availability config

cmgrote · 2021-05-15T15:14:12Z

Also consider documenting a more dynamic HA deployment:

New connector OMAG pods that can come online (or be dropped) at any time
Some quorum mechanism across the pods so that one of the pods can be elected to periodically create an index checkpoint and store in some out-of-cluster location (e.g. S3)
Initial index store of each new OMAG pod taken from the latest such external checkpoint (see: https://opencrux.com/reference/21.04-1.16.0/checkpointing.html)
A readiness probe that would ideally only succeed once the pod's local index is up-to-date (not sure this would be feasible, as what would indicate it is up-to-date assuming there is always some activity happening via other pods (?))

This will be reliant on having a configuration mechanism for the OMAG platform itself that does not require configuration and / or startup via REST, as otherwise the readiness probe would have to be successful just to configure and startup the platform -- in which case it would already start receiving other traffic via a load-balancing service, all of which would fail prior to the connector being configured and started up (takes at least 20-30 seconds for an empty system, could be several minutes or longer if also bootstrapping its index). Having several minutes of "random" failures for requests that the load-balancer just happens to send to this bootstrapping pod would be unacceptable -- hence dependency on having a non-REST mechanism to start the pods, so readiness probe can indicate that the pod is truly ready to start receiving and (correctly) responding to requests.

cmgrote · 2021-05-21T10:31:39Z

Moved dynamic deployment to a new issue #150, given its dependency on Egeria core changes. Initial documentation of the original issue is now complete: https://odpi.github.io/egeria-connector-crux/high-availability/

cmgrote added the enhancement New feature or request label May 10, 2021

cmgrote self-assigned this May 10, 2021

cmgrote referenced this issue in cmgrote/egeria-connector-xtdb May 14, 2021

#127 Initial chart for providing a sample high availability config

7ba09e4

Signed-off-by: Christopher Grote <chris@thegrotes.net>

cmgrote added a commit that referenced this issue May 14, 2021

Merge pull request #137 from cmgrote/main

57d05bb

#127 Initial chart for providing a sample high availability config

cmgrote closed this as completed May 21, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create sample HA deployment #127

Create sample HA deployment #127

cmgrote commented May 10, 2021

cmgrote commented May 15, 2021

cmgrote commented May 21, 2021

Create sample HA deployment #127

Create sample HA deployment #127

Comments

cmgrote commented May 10, 2021

cmgrote commented May 15, 2021

cmgrote commented May 21, 2021