High Availability #470

drewwells · 2023-08-27T16:59:53Z

Support high availability installation of spire-server to support node outages and upgrade with zero downtime. HA would support only external databases and multiple pods with anti affinity rules to spread pods across nodes. Database in a container would be a great step forward in portable example of this deployment.

Some changes that need to happen:

replace the statefulset with a deployment
separate controller manager from spire server pod as they have different availability concerns

marcofranssen · 2023-08-28T07:35:13Z

Spire-controller-manager can't be separated unfortunately as it needs access to the spire-server.sock.

We might do a trick to run it just on one of the replicas until spire-controller-manager has another way of connecting to the server.sock or maybe we can do a trick with a nodeSelector to always run it on the same node as one of the spire-server instances 🤔 .

drewwells · 2023-08-28T10:20:12Z

Create a leader election so only one operates.

marcofranssen · 2023-08-28T10:27:29Z

Could you file an issue for that on the spire-controller-manager repo and reference it here?

kfox1111 · 2023-08-28T14:52:47Z

Statefulset can support HA. So, that part isnt strictly needed.

External db should work with one of the ha database charts/operators.

While not ideal, I believe the controller-manager already does locking in its current config.

So, I think its possible today to build a usable HA configuration with the chart as is.

drewwells · 2023-08-28T15:00:28Z

If you change replica > 1 today, spire server stops working. We need better HA support. In general, spire-server needs to move towards stateless operation without bagge of a controller or sqlite. Moving to a deployment w/replicaset encourages that.

kfox1111 · 2023-08-28T15:57:44Z

Could you please provide some more detail? how does things stop working?

drewwells · 2023-08-28T16:04:54Z

A integration test would be a better way to prove it does work. I don't know the details, but replica>1 caused outage on our cluster.

kfox1111 · 2023-08-28T16:13:02Z

+1 to more tests...

But I'm not seeing brokenness. More details please:

$ kubectl get pods --all-namespaces 
NAMESPACE      NAME                                                   READY   STATUS      RESTARTS   AGE
kube-system    coredns-787d4945fb-ssw2p                               1/1     Running     0          2m59s
kube-system    etcd-minikube                                          1/1     Running     0          3m11s
kube-system    kube-apiserver-minikube                                1/1     Running     0          3m13s
kube-system    kube-controller-manager-minikube                       1/1     Running     0          3m12s
kube-system    kube-proxy-45r5k                                       1/1     Running     0          2m59s
kube-system    kube-scheduler-minikube                                1/1     Running     0          3m12s
kube-system    storage-provisioner                                    1/1     Running     0          3m11s
mysql          mysql-0                                                1/1     Running     0          2m41s
spire-system   spire-agent-47tvn                                      1/1     Running     0          105s
spire-system   spire-server-0                                         2/2     Running     0          105s
spire-system   spire-server-1                                         2/2     Running     0          78s
spire-system   spire-server-2                                         2/2     Running     0          67s
spire-system   spire-server-test-connection                           0/1     Completed   0          44s
spire-system   spire-spiffe-csi-driver-xsn8d                          2/2     Running     0          105s
spire-system   spire-spiffe-oidc-discovery-provider-c679d5556-mmqnf   2/2     Running     0          105s
spire-system   spire-spiffe-oidc-discovery-provider-test-connection   0/3     Completed   0          37s
spire-system   spire-spiffe-oidc-discovery-provider-test-keys         0/1     Completed   0          33s

NOTES:
Installed spire…
+ helm test --namespace spire-system spire
NAME: spire
LAST DEPLOYED: Mon Aug 28 14:07:47 2023
NAMESPACE: spire-system
STATUS: deployed
REVISION: 1
TEST SUITE:     spire-spiffe-oidc-discovery-provider-test-connection
Last Started:   Mon Aug 28 14:08:55 2023
Last Completed: Mon Aug 28 14:08:59 2023
Phase:          Succeeded
TEST SUITE:     spire-spiffe-oidc-discovery-provider-test-keys
Last Started:   Mon Aug 28 14:08:59 2023
Last Completed: Mon Aug 28 14:09:22 2023
Phase:          Succeeded
TEST SUITE:     spire-server-test-connection
Last Started:   Mon Aug 28 14:08:48 2023
Last Completed: Mon Aug 28 14:08:55 2023
Phase:          Succeeded

marcofranssen · 2023-08-29T15:52:20Z

@drewwells Could you have a look at either the postgres or the mysql example?

Using that example you should be able to increase the number of replicas.

Please let us know if that works. Hopefully that helps figuring out the difference with your current attempt. Based on that we might find a way to prevent the misconfiguration in this chart.

drewwells · 2023-08-29T16:35:19Z

Regardless of database, my concern is running multiple spire-controller-managers. The first comment noted that a spire-controller-manager ticket was needed here. Unless there are plans to build consensus across multiple stateful pods like etcd can do, we should drop the statefulset and use a deployment.

kfox1111 · 2023-08-29T17:32:30Z

The spire-controller-manager itself takes a Kubernetes lock to prevent multiple from acting at once. So, I believe that part is working today.

The statefulset is currently required to persist the intermediate CA's for the individual server. If the statefulset members are spread across multiple nodes using antiaffinity, it should still be HA.

drewwells mentioned this issue Aug 28, 2023

Support HA spire-server spiffe/spire-controller-manager#204

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High Availability #470

High Availability #470

drewwells commented Aug 27, 2023

marcofranssen commented Aug 28, 2023 •

edited

Loading

drewwells commented Aug 28, 2023 via email

marcofranssen commented Aug 28, 2023

kfox1111 commented Aug 28, 2023

drewwells commented Aug 28, 2023

kfox1111 commented Aug 28, 2023

drewwells commented Aug 28, 2023

kfox1111 commented Aug 28, 2023 •

edited

Loading

marcofranssen commented Aug 29, 2023

drewwells commented Aug 29, 2023

kfox1111 commented Aug 29, 2023

High Availability #470

High Availability #470

Comments

drewwells commented Aug 27, 2023

marcofranssen commented Aug 28, 2023 • edited Loading

drewwells commented Aug 28, 2023 via email

marcofranssen commented Aug 28, 2023

kfox1111 commented Aug 28, 2023

drewwells commented Aug 28, 2023

kfox1111 commented Aug 28, 2023

drewwells commented Aug 28, 2023

kfox1111 commented Aug 28, 2023 • edited Loading

marcofranssen commented Aug 29, 2023

drewwells commented Aug 29, 2023

kfox1111 commented Aug 29, 2023

marcofranssen commented Aug 28, 2023 •

edited

Loading

kfox1111 commented Aug 28, 2023 •

edited

Loading