Skip to content

Conversation

@fonta-rh
Copy link
Contributor

@fonta-rh fonta-rh commented Oct 14, 2025

Fix rapid restart failure in podman-etcd resource agent

Problem Statement

TNF (Two-Node Failover) clusters do not automatically recover from some etcd process crashes. When an etcd process is killed directly (bypassing Pacemaker's normal stop procedure), the cluster detects the failure via monitor operation and attempts stop→start recovery, but the start operation fails with:

ERROR: Unexpected active resource count: 2

This requires manual intervention (pcs resource cleanup etcd) to recover the cluster.

Root Cause

During rapid restart scenarios (e.g., process crash recovery), Pacemaker's clone notification variables show resources in transitional states. Specifically, a resource can appear in both the active and stop lists simultaneously:

notify: type=pre, operation=stop,
  active=[etcd:0 etcd:1],    ← Both marked active
  start=[etcd:1],             ← master-1 is starting
  stop=[etcd:1]               ← master-1 is also stopping

The podman-etcd agent was using a naive word count of OCF_RESKEY_CRM_meta_notify_active_resource, which doesn't account for resources being stopped. This caused the agent to see 2 active resources when it expected only 1 (the standalone leader), leading to startup failure.

Solution

According to the Pacemaker documentation, during "Post-notification (stop) / Pre-notification (start)" transitions, the true active resource count must be calculated as:

Active resources = $OCF_RESKEY_CRM_meta_notify_active_resource
                   minus $OCF_RESKEY_CRM_meta_notify_stop_resource

Changes Made

  1. Added get_truly_active_resources_count() helper function (lines 1032-1072):

    • Implements the Pacemaker-documented algorithm for calculating true active count
    • Filters out resources from active_resource that also appear in stop_resource
  2. Updated active_resources_count calculation in podman_start (line 1574):

    # Before (BROKEN):
    active_resources_count=$(echo "$OCF_RESKEY_CRM_meta_notify_active_resource" | wc -w)
    
    # After (FIXED):
    active_resources_count=$(get_truly_active_resources_count)

References

@knet-jenkins
Copy link

knet-jenkins bot commented Oct 14, 2025

Can one of the admins check and authorise this run please: https://ci.kronosnet.org/job/resource-agents/job/resource-agents-pipeline/job/PR-2082/1/input

@oalbrigt oalbrigt changed the title OCPBUGS-59238: Redo counting of active_resources to avoid bug on rapid etcd restart OCPBUGS-59238: podman-etcd: Redo counting of active_resources to avoid bug on rapid etcd restart Oct 14, 2025
@knet-jenkins
Copy link

knet-jenkins bot commented Oct 17, 2025

Can one of the admins check and authorise this run please: https://ci.kronosnet.org/job/resource-agents/job/resource-agents-pipeline/job/PR-2082/2/input

@knet-jenkins
Copy link

knet-jenkins bot commented Oct 20, 2025

Can one of the admins check and authorise this run please: https://ci.kronosnet.org/job/resource-agents/job/resource-agents-pipeline/job/PR-2082/3/input

@knet-jenkins
Copy link

knet-jenkins bot commented Oct 20, 2025

Can one of the admins check and authorise this run please: https://ci.kronosnet.org/job/resource-agents/job/resource-agents-pipeline/job/PR-2082/4/input

@knet-jenkins
Copy link

knet-jenkins bot commented Oct 20, 2025

Can one of the admins check and authorise this run please: https://ci.kronosnet.org/job/resource-agents/job/resource-agents-pipeline/job/PR-2082/5/input

@knet-jenkins
Copy link

knet-jenkins bot commented Oct 21, 2025

Can one of the admins check and authorise this run please: https://ci.kronosnet.org/job/resource-agents/job/resource-agents-pipeline/job/PR-2082/6/input

@knet-jenkins
Copy link

knet-jenkins bot commented Oct 21, 2025

Can one of the admins check and authorise this run please: https://ci.kronosnet.org/job/resource-agents/job/resource-agents-pipeline/job/PR-2082/7/input

@fonta-rh fonta-rh force-pushed the OCPBUGS-59238-fix-active-resource-count branch from 1255998 to 5100d70 Compare October 21, 2025 12:33
@knet-jenkins
Copy link

knet-jenkins bot commented Oct 21, 2025

Can one of the admins check and authorise this run please: https://ci.kronosnet.org/job/resource-agents/job/resource-agents-pipeline/job/PR-2082/8/input

@knet-jenkins
Copy link

knet-jenkins bot commented Oct 24, 2025

Can one of the project admins check and authorise this run please: https://ci.kronosnet.org/job/resource-agents/job/resource-agents-pipeline/job/PR-2082/9/input

@knet-jenkins
Copy link

knet-jenkins bot commented Oct 24, 2025

Can one of the project admins check and authorise this run please: https://ci.kronosnet.org/job/resource-agents/job/resource-agents-pipeline/job/PR-2082/10/input

# check to see if the container has already started
podman_simple_status
if [ $? -eq $OCF_SUCCESS ]; then
return "$OCF_SUCCESS"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI. There's no need to quote rc codes (they're always numeric). I know some of the existing ones are quoted, but no need to undo those unless you're changing that part of the code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated this, but just to as a heads up, I'm not sure we'll ever want to merge all the changes here (maybe just the first commits updating the active resource calculation). I've been using this branch to test the scenario, but I think @clobrano might have a better approach to this in the future than patching the podman_start in this way

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. You can squash the commits or remove the ones you dont want before requesting review.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@oalbrigt Re-requested review, going ahead only with the fix to active_resources count :)

@knet-jenkins
Copy link

knet-jenkins bot commented Oct 24, 2025

Can one of the project admins check and authorise this run please: https://ci.kronosnet.org/job/resource-agents/job/resource-agents-pipeline/job/PR-2082/11/input

@clobrano clobrano self-requested a review October 24, 2025 11:54
@fonta-rh fonta-rh force-pushed the OCPBUGS-59238-fix-active-resource-count branch from 899e40e to d5b4428 Compare October 29, 2025 09:36
@fonta-rh fonta-rh marked this pull request as ready for review October 29, 2025 09:36
@fonta-rh fonta-rh requested a review from oalbrigt October 29, 2025 09:36
@knet-jenkins
Copy link

knet-jenkins bot commented Oct 29, 2025

Can one of the project admins check and authorise this run please: https://ci.kronosnet.org/job/resource-agents/job/resource-agents-pipeline/job/PR-2082/12/input

Copy link
Contributor

@oalbrigt oalbrigt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Copy link
Collaborator

@clobrano clobrano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some comments

Comment on lines +1056 to +1068
# Filter out resources that are being stopped from the active list
for resource in $active_list; do
local is_stopping=0
for stop_resource in $stop_list; do
if [ "$resource" = "$stop_resource" ]; then
is_stopping=1
break
fi
done
if [ $is_stopping -eq 0 ]; then
truly_active="$truly_active $resource"
fi
done
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder, if

  • We're just returning the count of words in $truly_active
  • The actual name of the resources is not important
  • The problems with -z test in the previous comment

It seems to me that we can simplify the function with some basic math.

Something like this:

get_truly_active_resources_count() {
    local active_count stop_count

    active_count=$(echo "$OCF_RESKEY_CRM_meta_notify_active_resource" | wc -w)
    stop_count=$(echo "$OCF_RESKEY_CRM_meta_notify_stop_resource" | wc -w)

    echo $((active_count - stop_count))
}

Thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmmmm, I see your point, but what if a resource appears in the stop count without it being active too? Can we guarantee this from Pacemaker code? In that case, this approach would break.

Also, I understand wanting to reduce this to pure calculus, but I feel that we might lose some valuable future information when changing from set comparison to substraction of counts. It's true that the code is not very elegant, but not sure it's mathematically equivalent with this change

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if a resource appears in the stop count without it being active too? Can we guarantee this from Pacemaker code?

While I don't expect a "resources to be stopped" (as OCF_RESKEY_CRM_meta_notify_stop_resource is defined) to not be also active (otherwise why stop it?), I agree that this defensive approach is safer 👍.

@knet-jenkins
Copy link

knet-jenkins bot commented Oct 29, 2025

Can one of the project admins check and authorise this run please: https://ci.kronosnet.org/job/resource-agents/job/resource-agents-pipeline/job/PR-2082/13/input

Comment on lines +1056 to +1068
# Filter out resources that are being stopped from the active list
for resource in $active_list; do
local is_stopping=0
for stop_resource in $stop_list; do
if [ "$resource" = "$stop_resource" ]; then
is_stopping=1
break
fi
done
if [ $is_stopping -eq 0 ]; then
truly_active="$truly_active $resource"
fi
done
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if a resource appears in the stop count without it being active too? Can we guarantee this from Pacemaker code?

While I don't expect a "resources to be stopped" (as OCF_RESKEY_CRM_meta_notify_stop_resource is defined) to not be also active (otherwise why stop it?), I agree that this defensive approach is safer 👍.

@clobrano clobrano merged commit b2947e7 into ClusterLabs:main Oct 31, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants