OCPBUGS-59238: podman-etcd: Redo counting of active_resources to avoid bug on rapid etcd restart #2082

fonta-rh · 2025-10-14T10:10:47Z

Fix rapid restart failure in podman-etcd resource agent

Problem Statement

TNF (Two-Node Failover) clusters do not automatically recover from some etcd process crashes. When an etcd process is killed directly (bypassing Pacemaker's normal stop procedure), the cluster detects the failure via monitor operation and attempts stop→start recovery, but the start operation fails with:

ERROR: Unexpected active resource count: 2

This requires manual intervention (pcs resource cleanup etcd) to recover the cluster.

Root Cause

During rapid restart scenarios (e.g., process crash recovery), Pacemaker's clone notification variables show resources in transitional states. Specifically, a resource can appear in both the active and stop lists simultaneously:

notify: type=pre, operation=stop,
  active=[etcd:0 etcd:1],    ← Both marked active
  start=[etcd:1],             ← master-1 is starting
  stop=[etcd:1]               ← master-1 is also stopping

The podman-etcd agent was using a naive word count of OCF_RESKEY_CRM_meta_notify_active_resource, which doesn't account for resources being stopped. This caused the agent to see 2 active resources when it expected only 1 (the standalone leader), leading to startup failure.

Solution

According to the Pacemaker documentation, during "Post-notification (stop) / Pre-notification (start)" transitions, the true active resource count must be calculated as:

Active resources = $OCF_RESKEY_CRM_meta_notify_active_resource
                   minus $OCF_RESKEY_CRM_meta_notify_stop_resource

Changes Made

Added get_truly_active_resources_count() helper function (lines 1032-1072):
- Implements the Pacemaker-documented algorithm for calculating true active count
- Filters out resources from active_resource that also appear in stop_resource

Updated active_resources_count calculation in podman_start (line 1574):

# Before (BROKEN):
active_resources_count=$(echo "$OCF_RESKEY_CRM_meta_notify_active_resource" | wc -w)

# After (FIXED):
active_resources_count=$(get_truly_active_resources_count)

References

Bug Report: OCPBUGS-59238
Pacemaker Documentation: https://clusterlabs.org/projects/pacemaker/doc/2.1/Pacemaker_Administration/html/agents.html#interpretation-of-notification-variables
Test Case: "should recover from etcd process crash" in test/extended/two_node/tnf_recovery.go:173 -> To be merged

knet-jenkins · 2025-10-14T10:11:42Z

Can one of the admins check and authorise this run please: https://ci.kronosnet.org/job/resource-agents/job/resource-agents-pipeline/job/PR-2082/1/input

knet-jenkins · 2025-10-17T09:29:45Z

Can one of the admins check and authorise this run please: https://ci.kronosnet.org/job/resource-agents/job/resource-agents-pipeline/job/PR-2082/2/input

knet-jenkins · 2025-10-20T12:20:31Z

Can one of the admins check and authorise this run please: https://ci.kronosnet.org/job/resource-agents/job/resource-agents-pipeline/job/PR-2082/3/input

knet-jenkins · 2025-10-20T13:28:40Z

Can one of the admins check and authorise this run please: https://ci.kronosnet.org/job/resource-agents/job/resource-agents-pipeline/job/PR-2082/4/input

knet-jenkins · 2025-10-20T15:38:06Z

Can one of the admins check and authorise this run please: https://ci.kronosnet.org/job/resource-agents/job/resource-agents-pipeline/job/PR-2082/5/input

knet-jenkins · 2025-10-21T10:16:05Z

Can one of the admins check and authorise this run please: https://ci.kronosnet.org/job/resource-agents/job/resource-agents-pipeline/job/PR-2082/6/input

knet-jenkins · 2025-10-21T10:22:32Z

Can one of the admins check and authorise this run please: https://ci.kronosnet.org/job/resource-agents/job/resource-agents-pipeline/job/PR-2082/7/input

knet-jenkins · 2025-10-21T12:34:36Z

Can one of the admins check and authorise this run please: https://ci.kronosnet.org/job/resource-agents/job/resource-agents-pipeline/job/PR-2082/8/input

knet-jenkins · 2025-10-24T08:17:00Z

Can one of the project admins check and authorise this run please: https://ci.kronosnet.org/job/resource-agents/job/resource-agents-pipeline/job/PR-2082/9/input

knet-jenkins · 2025-10-24T08:49:40Z

Can one of the project admins check and authorise this run please: https://ci.kronosnet.org/job/resource-agents/job/resource-agents-pipeline/job/PR-2082/10/input

oalbrigt · 2025-10-24T09:01:32Z

heartbeat/podman-etcd

+	# check to see if the container has already started
+	podman_simple_status
+	if [ $? -eq $OCF_SUCCESS ]; then
+		return "$OCF_SUCCESS"


FYI. There's no need to quote rc codes (they're always numeric). I know some of the existing ones are quoted, but no need to undo those unless you're changing that part of the code.

I've updated this, but just to as a heads up, I'm not sure we'll ever want to merge all the changes here (maybe just the first commits updating the active resource calculation). I've been using this branch to test the scenario, but I think @clobrano might have a better approach to this in the future than patching the podman_start in this way

Yeah. You can squash the commits or remove the ones you dont want before requesting review.

@oalbrigt Re-requested review, going ahead only with the fix to active_resources count :)

knet-jenkins · 2025-10-24T09:20:15Z

Can one of the project admins check and authorise this run please: https://ci.kronosnet.org/job/resource-agents/job/resource-agents-pipeline/job/PR-2082/11/input

knet-jenkins · 2025-10-29T09:37:12Z

Can one of the project admins check and authorise this run please: https://ci.kronosnet.org/job/resource-agents/job/resource-agents-pipeline/job/PR-2082/12/input

oalbrigt

LGTM.

clobrano

I left some comments

heartbeat/podman-etcd

clobrano · 2025-10-29T11:19:34Z

heartbeat/podman-etcd

+	# Filter out resources that are being stopped from the active list
+	for resource in $active_list; do
+		local is_stopping=0
+		for stop_resource in $stop_list; do
+			if [ "$resource" = "$stop_resource" ]; then
+				is_stopping=1
+				break
+			fi
+		done
+		if [ $is_stopping -eq 0 ]; then
+			truly_active="$truly_active $resource"
+		fi
+	done


I wonder, if

We're just returning the count of words in $truly_active

The actual name of the resources is not important

The problems with -z test in the previous comment

It seems to me that we can simplify the function with some basic math.

Something like this:

get_truly_active_resources_count() { local active_count stop_count active_count=$(echo "$OCF_RESKEY_CRM_meta_notify_active_resource" | wc -w) stop_count=$(echo "$OCF_RESKEY_CRM_meta_notify_stop_resource" | wc -w) echo $((active_count - stop_count)) }

Thoughts?

Hmmmmm, I see your point, but what if a resource appears in the stop count without it being active too? Can we guarantee this from Pacemaker code? In that case, this approach would break.

Also, I understand wanting to reduce this to pure calculus, but I feel that we might lose some valuable future information when changing from set comparison to substraction of counts. It's true that the code is not very elegant, but not sure it's mathematically equivalent with this change

what if a resource appears in the stop count without it being active too? Can we guarantee this from Pacemaker code?

While I don't expect a "resources to be stopped" (as OCF_RESKEY_CRM_meta_notify_stop_resource is defined) to not be also active (otherwise why stop it?), I agree that this defensive approach is safer 👍.

heartbeat/podman-etcd

knet-jenkins · 2025-10-29T11:50:06Z

Can one of the project admins check and authorise this run please: https://ci.kronosnet.org/job/resource-agents/job/resource-agents-pipeline/job/PR-2082/13/input

clobrano · 2025-10-29T13:48:57Z

heartbeat/podman-etcd

+	# Filter out resources that are being stopped from the active list
+	for resource in $active_list; do
+		local is_stopping=0
+		for stop_resource in $stop_list; do
+			if [ "$resource" = "$stop_resource" ]; then
+				is_stopping=1
+				break
+			fi
+		done
+		if [ $is_stopping -eq 0 ]; then
+			truly_active="$truly_active $resource"
+		fi
+	done


what if a resource appears in the stop count without it being active too? Can we guarantee this from Pacemaker code?

While I don't expect a "resources to be stopped" (as OCF_RESKEY_CRM_meta_notify_stop_resource is defined) to not be also active (otherwise why stop it?), I agree that this defensive approach is safer 👍.

Redo counting of active_resources

d5b4428

oalbrigt changed the title ~~OCPBUGS-59238: Redo counting of active_resources to avoid bug on rapid etcd restart~~ OCPBUGS-59238: podman-etcd: Redo counting of active_resources to avoid bug on rapid etcd restart Oct 14, 2025

fonta-rh force-pushed the OCPBUGS-59238-fix-active-resource-count branch from 1255998 to 5100d70 Compare October 21, 2025 12:33

oalbrigt reviewed Oct 24, 2025

View reviewed changes

clobrano self-requested a review October 24, 2025 11:54

fonta-rh force-pushed the OCPBUGS-59238-fix-active-resource-count branch from 899e40e to d5b4428 Compare October 29, 2025 09:36

fonta-rh marked this pull request as ready for review October 29, 2025 09:36

fonta-rh requested a review from oalbrigt October 29, 2025 09:36

oalbrigt approved these changes Oct 29, 2025

View reviewed changes

clobrano requested changes Oct 29, 2025

View reviewed changes

Update truly active resources count with safer empty calculation

0114ddf

clobrano approved these changes Oct 29, 2025

View reviewed changes

clobrano merged commit b2947e7 into ClusterLabs:main Oct 31, 2025
1 check passed

OCPBUGS-59238: podman-etcd: Redo counting of active_resources to avoid bug on rapid etcd restart #2082

OCPBUGS-59238: podman-etcd: Redo counting of active_resources to avoid bug on rapid etcd restart #2082

Uh oh!

Conversation

fonta-rh commented Oct 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Fix rapid restart failure in podman-etcd resource agent

Problem Statement

Root Cause

Solution

Changes Made

References

Uh oh!

knet-jenkins bot commented Oct 14, 2025

Uh oh!

knet-jenkins bot commented Oct 17, 2025

Uh oh!

knet-jenkins bot commented Oct 20, 2025

Uh oh!

knet-jenkins bot commented Oct 20, 2025

Uh oh!

knet-jenkins bot commented Oct 20, 2025

Uh oh!

knet-jenkins bot commented Oct 21, 2025

Uh oh!

knet-jenkins bot commented Oct 21, 2025

Uh oh!

knet-jenkins bot commented Oct 21, 2025

Uh oh!

knet-jenkins bot commented Oct 24, 2025

Uh oh!

knet-jenkins bot commented Oct 24, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

knet-jenkins bot commented Oct 24, 2025

Uh oh!

knet-jenkins bot commented Oct 29, 2025

Uh oh!

oalbrigt left a comment

Choose a reason for hiding this comment

Uh oh!

clobrano left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

knet-jenkins bot commented Oct 29, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fonta-rh commented Oct 14, 2025 •

edited

Loading