-
Notifications
You must be signed in to change notification settings - Fork 148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(upgrade): add sleep to allow event to flush before panic #3234
Conversation
WalkthroughThe changes in this pull request involve modifications to the Changes
Assessment against linked issues
Possibly related PRs
Suggested reviewers
📜 Recent review detailsConfiguration used: .coderabbit.yaml 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (2)
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Outside diff range and nitpick comments (3)
app/pre_upgrade.go (1)
82-85
: Consider restructuring event handling for better reliability.The current implementation might still miss events due to the order of operations. The sleep occurs after Run()'s deferred event recording but before the broadcaster shutdown.
Consider this more robust approach:
func preUpgrade(c *cli.Context) error { // ... existing code ... - defer eventBroadcaster.Shutdown() err = newPreUpgrader(namespace, lhClient, eventRecorder).Run() if err != nil { logrus.Warnf("Done with Run() ... err is %v", err) - time.Sleep(5 * time.Minute) + // Ensure event is recorded and flushed + time.Sleep(preUpgradeEventFlushDelay) } + + // Gracefully shutdown the broadcaster after all events are flushed + eventBroadcaster.Shutdown() return err }Additionally, consider implementing a more deterministic approach:
- Add an event queue depth check
- Implement a proper wait mechanism with timeout
- Add metrics for event broadcasting failures
Would you like me to provide a more detailed implementation of these suggestions?
app/post_upgrade.go (2)
89-92
: Log the specific error details before sleep.The current implementation only logs that Run() is done with an error. For better debugging, consider adding more context about the failure.
-logrus.Warnf("Done with Run() ... err is %v", err) +logrus.Errorf("Post-upgrade checks failed: %v. Waiting for events to be flushed...", err) time.Sleep(5 * time.Minute)
Line range hint
41-67
: Consider improving error handling in the CLI command.The current implementation logs the error and exits, but we could provide more user-friendly error messages and potentially different exit codes for different failure scenarios.
Action: func(c *cli.Context) { logrus.Info("Running post-upgrade...") - defer logrus.Info("Completed post-upgrade.") + defer func() { + if r := recover(); r != nil { + logrus.Errorf("Post-upgrade panic: %v", r) + // Ensure we exit with error code + cli.Exit("Post-upgrade failed with panic", 2) + } + }() + if err := postUpgrade(c); err != nil { - logrus.Fatalf("Error during post-upgrade: %v", err) + // Use structured error message + logrus.WithError(err).Error("Post-upgrade failed") + cli.Exit("Post-upgrade failed. Check logs for details.", 1) } + logrus.Info("Completed post-upgrade successfully") },
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
📒 Files selected for processing (2)
- app/post_upgrade.go (1 hunks)
- app/pre_upgrade.go (2 hunks)
🔇 Additional comments (3)
app/pre_upgrade.go (2)
4-5
: LGTM: Import addition is appropriate.The time import is correctly placed and necessary for the new sleep functionality.
82-85
: Verify consistent error handling across upgrade components.Let's ensure this pattern is consistently applied across all upgrade-related components.
✅ Verification successful
Based on the results, I can now generate the final response as we have sufficient information about the error handling patterns across upgrade components.
Error handling is consistent across upgrade components
The verification shows that:
- Both
preUpgrader
andpostUpgrader
implement consistent error handling patterns in theirRun()
methods- Event broadcasting and shutdown are handled similarly in both components
- Event recording for upgrade checks follows a consistent pattern with proper event reasons defined in constants
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Search for similar error handling patterns in upgrade-related files echo "Checking error handling patterns in upgrade-related files..." rg -A 5 "func.*[Uu]pgrade.*Run\(\)" echo "Checking event broadcasting patterns..." rg -A 5 "eventBroadcaster.Shutdown\(\)" echo "Checking related event recording..." rg "EventReasonFailedUpgradePreCheck|EventReasonPassedUpgradeCheck"Length of output: 2340
app/post_upgrade.go (1)
Line range hint
52-58
: Ensure proper cleanup of resources.The eventBroadcaster shutdown is deferred, but we should ensure it has enough time to process events before the program exits.
Let's verify the event broadcaster implementation:
d1b2d32
to
a1ade09
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
It seems that we need both eventBroadcaster.Shutdown()
and sleep
:
eventBroadcaster.Shutdown()
will trigger event flushingSleep
will wait for the event to hit API server
Other considered alternative is creating the event directly using kubeclient. But looks like using eventBroadcaster will be more consistent with other part of the code
I'm fine with the sleep method. Another idea for reference is combining EventSinkImpl with
Then, after
|
Ok, let's go with the simple sleep, then, and keep this in our back pocket in case problems still show up in testing. Thanks! |
defer eventBroadcaster.Shutdown() | ||
defer func() { | ||
eventBroadcaster.Shutdown() | ||
time.Sleep(5 * time.Second) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember you used a big number before. How do you determine the sleep period?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The long period was to allow a shot at gathering logs before the pod exited. But as @PhanLe1010 points out, that could get in the way of another run, and if the event is in place, the logs are not as important. This is a reasonable guess at a time that should allow the event to be flushed, but get the result back to the job quickly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@james-munson Isee.
Ah, no, I mean leave a comment above time.Sleep(5 * time.Second). Thank you.
a1ade09
to
6675802
Compare
…fore panic Signed-off-by: James Munson <james.munson@suse.com>
6675802
to
0c21cf5
Compare
@mergify backport v1.7.x v1.6.x |
✅ Backports have been created
|
Which issue(s) this PR fixes:
Issue longhorn/longhorn#9569
What this PR does / why we need it:
Add a sleep after Run() before panicking to let the queued event get flushed. We thought that eventBroadcaster.Shutdown() would take care of that, and it did in some testing, but there is still a race and it does not work reliably. QA found that while testing the feature and I did reproduce it - see the note in longhorn/longhorn#9569 (comment)
Special notes for your reviewer:
This is the simplest fix. We could do something more complicated, like calling Watch() on the events and waiting for it to appear, but there is no need to speed up the exit as soon as possible. The upgrade will not go forward anyway. The important thing is to guarantee the event can be found so the problem can be corrected.
Additional documentation or context