Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPBUGS-16008: Reconciler pending fix [WIP] #170

Closed

Conversation

nicklesimba
Copy link
Contributor

This fix resolves the issue where, after a forceful node reboot, force deleting a pod in a stateful set causes the pod to be recreated and remain indefinitely in the Pending state.

Note: The fix was authored originally by user xagent003. Currently, to resolve bug 16008, we have made this patch available for 4.12 - but we cannot merge this PR until upstream/4.14/4.13 are merged first. Nonetheless, 4.12 users who are affected by the bug described by OCPBUGS-16008 can apply this patch to resolve the issue.

Solution description, as written on xagent003's upstream PR:
"We shouldn't treat all pending Pods as "alive" and skip the check. The list of all Pods fetch'd earlier may be stale, and as observed in some scenarios, several seconds before the ip-reconciler does the isPodAlive check.

Instead, can we retry a Get on an individual Pod, with the hopes that it has final IP/network annoations? So we try to refetch the pod a few times if it is Pending state and initial IP check fails. After that, just do the IP matching check like before"

Note that xagent003's upstream PR is stale and has since been rebased by dougbtv. You can find the current upstream PR here.

@openshift-ci-robot openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Aug 4, 2023
@openshift-ci-robot
Copy link
Contributor

@nicklesimba: This pull request references Jira Issue OCPBUGS-16008, which is invalid:

  • expected the bug to target the "4.12.z" version, but no target version was set
  • expected Jira Issue OCPBUGS-16008 to depend on a bug targeting a version in 4.13.0, 4.13.z and in one of the following states: VERIFIED, RELEASE PENDING, CLOSED (ERRATA), CLOSED (CURRENT RELEASE), CLOSED (DONE), but no dependents were found

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

This fix resolves the issue where, after a forceful node reboot, force deleting a pod in a stateful set causes the pod to be recreated and remain indefinitely in the Pending state.

Note: The fix was authored originally by user xagent003. Currently, to resolve bug 16008, we have made this patch available for 4.12 - but we cannot merge this PR until upstream/4.14/4.13 are merged first. Nonetheless, 4.12 users who are affected by the bug described by OCPBUGS-16008 can apply this patch to resolve the issue.

Solution description, as written on xagent003's upstream PR:
"We shouldn't treat all pending Pods as "alive" and skip the check. The list of all Pods fetch'd earlier may be stale, and as observed in some scenarios, several seconds before the ip-reconciler does the isPodAlive check.

Instead, can we retry a Get on an individual Pod, with the hopes that it has final IP/network annoations? So we try to refetch the pod a few times if it is Pending state and initial IP check fails. After that, just do the IP matching check like before"

Note that xagent003's upstream PR is stale and has since been rebased by dougbtv. You can find the current upstream PR here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot requested review from aneeshkp and fepan August 4, 2023 16:57
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 4, 2023

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: nicklesimba
Once this PR has been reviewed and has the lgtm label, please assign dougbtv for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 4, 2023

@nicklesimba: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@nicklesimba
Copy link
Contributor Author

/jira refresh

@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Sep 5, 2023

@nicklesimba: This pull request references Jira Issue OCPBUGS-16008, which is invalid:

  • bug is open, matching expected state (open)
  • bug target version (4.12.z) matches configured target version for branch (4.12.z)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)
  • bug has dependents
  • dependent bug SUPPORTEX-15837 is not in the required OCPBUGS project

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@nicklesimba nicklesimba closed this Sep 6, 2023
@nicklesimba nicklesimba deleted the reconciler-pending-fix branch September 6, 2023 19:59
@openshift-ci-robot
Copy link
Contributor

@nicklesimba: This pull request references Jira Issue OCPBUGS-16008. The bug has been updated to no longer refer to the pull request using the external bug tracker. All external bug links have been closed. The bug has been moved to the NEW state.

In response to this:

This fix resolves the issue where, after a forceful node reboot, force deleting a pod in a stateful set causes the pod to be recreated and remain indefinitely in the Pending state.

Note: The fix was authored originally by user xagent003. Currently, to resolve bug 16008, we have made this patch available for 4.12 - but we cannot merge this PR until upstream/4.14/4.13 are merged first. Nonetheless, 4.12 users who are affected by the bug described by OCPBUGS-16008 can apply this patch to resolve the issue.

Solution description, as written on xagent003's upstream PR:
"We shouldn't treat all pending Pods as "alive" and skip the check. The list of all Pods fetch'd earlier may be stale, and as observed in some scenarios, several seconds before the ip-reconciler does the isPodAlive check.

Instead, can we retry a Get on an individual Pod, with the hopes that it has final IP/network annoations? So we try to refetch the pod a few times if it is Pending state and initial IP check fails. After that, just do the IP matching check like before"

Note that xagent003's upstream PR is stale and has since been rebased by dougbtv. You can find the current upstream PR here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants