-
Notifications
You must be signed in to change notification settings - Fork 603
OCPBUGS-59238: podman-etcd: Redo counting of active_resources to avoid bug on rapid etcd restart #2082
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OCPBUGS-59238: podman-etcd: Redo counting of active_resources to avoid bug on rapid etcd restart #2082
Conversation
|
Can one of the admins check and authorise this run please: https://ci.kronosnet.org/job/resource-agents/job/resource-agents-pipeline/job/PR-2082/1/input |
|
Can one of the admins check and authorise this run please: https://ci.kronosnet.org/job/resource-agents/job/resource-agents-pipeline/job/PR-2082/2/input |
|
Can one of the admins check and authorise this run please: https://ci.kronosnet.org/job/resource-agents/job/resource-agents-pipeline/job/PR-2082/3/input |
|
Can one of the admins check and authorise this run please: https://ci.kronosnet.org/job/resource-agents/job/resource-agents-pipeline/job/PR-2082/4/input |
|
Can one of the admins check and authorise this run please: https://ci.kronosnet.org/job/resource-agents/job/resource-agents-pipeline/job/PR-2082/5/input |
|
Can one of the admins check and authorise this run please: https://ci.kronosnet.org/job/resource-agents/job/resource-agents-pipeline/job/PR-2082/6/input |
|
Can one of the admins check and authorise this run please: https://ci.kronosnet.org/job/resource-agents/job/resource-agents-pipeline/job/PR-2082/7/input |
1255998 to
5100d70
Compare
|
Can one of the admins check and authorise this run please: https://ci.kronosnet.org/job/resource-agents/job/resource-agents-pipeline/job/PR-2082/8/input |
|
Can one of the project admins check and authorise this run please: https://ci.kronosnet.org/job/resource-agents/job/resource-agents-pipeline/job/PR-2082/9/input |
|
Can one of the project admins check and authorise this run please: https://ci.kronosnet.org/job/resource-agents/job/resource-agents-pipeline/job/PR-2082/10/input |
heartbeat/podman-etcd
Outdated
| # check to see if the container has already started | ||
| podman_simple_status | ||
| if [ $? -eq $OCF_SUCCESS ]; then | ||
| return "$OCF_SUCCESS" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI. There's no need to quote rc codes (they're always numeric). I know some of the existing ones are quoted, but no need to undo those unless you're changing that part of the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated this, but just to as a heads up, I'm not sure we'll ever want to merge all the changes here (maybe just the first commits updating the active resource calculation). I've been using this branch to test the scenario, but I think @clobrano might have a better approach to this in the future than patching the podman_start in this way
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. You can squash the commits or remove the ones you dont want before requesting review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@oalbrigt Re-requested review, going ahead only with the fix to active_resources count :)
|
Can one of the project admins check and authorise this run please: https://ci.kronosnet.org/job/resource-agents/job/resource-agents-pipeline/job/PR-2082/11/input |
899e40e to
d5b4428
Compare
|
Can one of the project admins check and authorise this run please: https://ci.kronosnet.org/job/resource-agents/job/resource-agents-pipeline/job/PR-2082/12/input |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some comments
| # Filter out resources that are being stopped from the active list | ||
| for resource in $active_list; do | ||
| local is_stopping=0 | ||
| for stop_resource in $stop_list; do | ||
| if [ "$resource" = "$stop_resource" ]; then | ||
| is_stopping=1 | ||
| break | ||
| fi | ||
| done | ||
| if [ $is_stopping -eq 0 ]; then | ||
| truly_active="$truly_active $resource" | ||
| fi | ||
| done |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder, if
- We're just returning the count of words in
$truly_active - The actual name of the resources is not important
- The problems with
-ztest in the previous comment
It seems to me that we can simplify the function with some basic math.
Something like this:
get_truly_active_resources_count() {
local active_count stop_count
active_count=$(echo "$OCF_RESKEY_CRM_meta_notify_active_resource" | wc -w)
stop_count=$(echo "$OCF_RESKEY_CRM_meta_notify_stop_resource" | wc -w)
echo $((active_count - stop_count))
}
Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmmmm, I see your point, but what if a resource appears in the stop count without it being active too? Can we guarantee this from Pacemaker code? In that case, this approach would break.
Also, I understand wanting to reduce this to pure calculus, but I feel that we might lose some valuable future information when changing from set comparison to substraction of counts. It's true that the code is not very elegant, but not sure it's mathematically equivalent with this change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if a resource appears in the stop count without it being active too? Can we guarantee this from Pacemaker code?
While I don't expect a "resources to be stopped" (as OCF_RESKEY_CRM_meta_notify_stop_resource is defined) to not be also active (otherwise why stop it?), I agree that this defensive approach is safer 👍.
|
Can one of the project admins check and authorise this run please: https://ci.kronosnet.org/job/resource-agents/job/resource-agents-pipeline/job/PR-2082/13/input |
| # Filter out resources that are being stopped from the active list | ||
| for resource in $active_list; do | ||
| local is_stopping=0 | ||
| for stop_resource in $stop_list; do | ||
| if [ "$resource" = "$stop_resource" ]; then | ||
| is_stopping=1 | ||
| break | ||
| fi | ||
| done | ||
| if [ $is_stopping -eq 0 ]; then | ||
| truly_active="$truly_active $resource" | ||
| fi | ||
| done |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if a resource appears in the stop count without it being active too? Can we guarantee this from Pacemaker code?
While I don't expect a "resources to be stopped" (as OCF_RESKEY_CRM_meta_notify_stop_resource is defined) to not be also active (otherwise why stop it?), I agree that this defensive approach is safer 👍.
Fix rapid restart failure in podman-etcd resource agent
Problem Statement
TNF (Two-Node Failover) clusters do not automatically recover from some etcd process crashes. When an etcd process is killed directly (bypassing Pacemaker's normal stop procedure), the cluster detects the failure via monitor operation and attempts stop→start recovery, but the start operation fails with:
This requires manual intervention (
pcs resource cleanup etcd) to recover the cluster.Root Cause
During rapid restart scenarios (e.g., process crash recovery), Pacemaker's clone notification variables show resources in transitional states. Specifically, a resource can appear in both the
activeandstoplists simultaneously:The podman-etcd agent was using a naive word count of
OCF_RESKEY_CRM_meta_notify_active_resource, which doesn't account for resources being stopped. This caused the agent to see 2 active resources when it expected only 1 (the standalone leader), leading to startup failure.Solution
According to the Pacemaker documentation, during "Post-notification (stop) / Pre-notification (start)" transitions, the true active resource count must be calculated as:
Changes Made
Added
get_truly_active_resources_count()helper function (lines 1032-1072):active_resourcethat also appear instop_resourceUpdated
active_resources_countcalculation inpodman_start(line 1574):References
test/extended/two_node/tnf_recovery.go:173-> To be merged