restarted secondary fails to join the cluster back #415

gboutry · 2024-05-16T14:20:07Z

Steps to reproduce

rollout restart pods in a 3 units cluster (it's not 100% reproducible, but happen often enough)

Expected behavior

Secondary should join the cluster back

Actual behavior

Secondary is not joining back, considered as offline

Versions

Operating system:

Juju CLI:

Juju agent:

Charm revision: 127

microk8s: MicroK8s v1.28.7 revision 6532

Log output

2024-05-16T13:45:18.111Z [container-agent] 2024-05-16 13:45:18 INFO juju-log Unit workload member-state is offline with member-role unknown
2024-05-16T13:45:21.896Z [container-agent] 2024-05-16 13:45:21 ERROR juju-log Failed to get cluster status for cluster-ab0e762c137dc447d08ce68b19fb20b3
2024-05-16T13:45:21.903Z [container-agent] 2024-05-16 13:45:21 ERROR juju-log Failed to get cluster endpoints
2024-05-16T13:45:21.903Z [container-agent] Traceback (most recent call last):
2024-05-16T13:45:21.903Z [container-agent]   File "/var/lib/juju/agents/unit-heat-mysql-0/charm/src/mysql_k8s_helpers.py", line 836, in update_endpoints
2024-05-16T13:45:21.903Z [container-agent]     rw_endpoints, ro_endpoints, offline = self.get_cluster_endpoints(get_ips=False)
2024-05-16T13:45:21.903Z [container-agent]   File "/var/lib/juju/agents/unit-heat-mysql-0/charm/lib/charms/mysql/v0/mysql.py", line 1469, in get_cluster_endpoints
2024-05-16T13:45:21.903Z [container-agent]     raise MySQLGetClusterEndpointsError("Failed to get endpoints from cluster status")
2024-05-16T13:45:21.903Z [container-agent] charms.mysql.v0.mysql.MySQLGetClusterEndpointsError: Failed to get endpoints from cluster status
2024-05-16T13:45:22.191Z [container-agent] 2024-05-16 13:45:22 INFO juju.worker.uniter.operation runhook.go:186 ran "update-status" hook (via hook dispatching script: dispatch)

2024-05-16T13:47:53.387910Z 0 [ERROR] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] The member was unable to join the group. Local port: 3306'
2024-05-16T13:48:00.275796Z 0 [ERROR] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Error connecting to all peers. Member join failed. Local port: 3306'
2024-05-16T13:48:00.385285Z 0 [ERROR] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] The member was unable to join the group. Local port: 3306'
2024-05-16T13:48:07.654156Z 0 [ERROR] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Error connecting to all peers. Member join failed. Local port: 3306'
2024-05-16T13:48:07.767533Z 0 [ERROR] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] The member was unable to join the group. Local port: 3306'
2024-05-16T13:48:08.469058Z 28247 [ERROR] [MY-011640] [Repl] Plugin group_replication reported: 'Timeout on wait for view after joining group'
2024-05-16T13:48:08.469343Z 28247 [ERROR] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] The member is already leaving or joining a group.'

Additional context

After a debugging session with @paulomach, we got the instance to successfully join back using: c.rejoin_instance("heat-mysql-0.heat-mysql-endpoints.openstack.svc.cluster.local:3306")

The command was performed from the failed unit to the primary unit (ruling out connection issue)

The text was updated successfully, but these errors were encountered:

github-actions · 2024-05-16T14:20:24Z

https://warthogs.atlassian.net/browse/DPE-4375

gboutry added the bug Something isn't working label May 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

restarted secondary fails to join the cluster back #415

restarted secondary fails to join the cluster back #415

gboutry commented May 16, 2024

github-actions bot commented May 16, 2024

restarted secondary fails to join the cluster back #415

restarted secondary fails to join the cluster back #415

Comments

gboutry commented May 16, 2024

Steps to reproduce

Expected behavior

Actual behavior

Versions

Log output

Additional context

github-actions bot commented May 16, 2024