Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add node timeout #109

Merged
merged 3 commits into from
May 23, 2024
Merged

feat: add node timeout #109

merged 3 commits into from
May 23, 2024

Conversation

Yu-Jack
Copy link
Collaborator

@Yu-Jack Yu-Jack commented Apr 17, 2024

Problem

Sometime nodes spend too much time collecting logs, even forever. It makes users can't download support bundle kit.

Solution

Add collecting node timeout. When collecting node reach timeout, it will skip it instead of stuck. Then users are able download support bundle without waiting. But, in the situation, there is no node's logs in support bundle file.

For example, if A node is finished before timeout, but B node isn't finished. There is only A node's logs in support bundle file. So, we're still able to check something.

Related Issue

harvester/harvester#1646

Test

This is test case when collecting node reach out the timeout.

level=debug msg="Creating daemonset supportbundle-agent-bundle-1wrog with image jk82421/support-bundle-kit:v0.0.36.4"  
level=debug msg="Waiting for the creation of agent DaemonSet Pods for scheduled node names collection"       
level=debug msg="Expecting bundles from nodes: map[jacklnode:]"      
level=info msg="Some node bundles are received." 
level=warning msg="Collection timed out for node: jacklnode"         
level=info msg="Succeed to run phase node bundle. Progress (60)."

It's hard to simulate the node stuck for 30 minutes, so I suggest following steps to test:

  1. Create support bundle
  2. Change env of deployment/support-bundle-manager-xxxx, set up like this
- name: SUPPORT_BUNDLE_NODE_TIMEOUT
   value: "1s"
  1. Wait for deployment restarting.
  2. After downloading support bundle kit, it shouldn't have node logs there.

TODO

If we're okay with this default timeout, these following features can be postponed.

  • Node timeout setting/documentation for harvester/harvester
  • Node timeout setting/documentation for longhorn

Signed-off-by: Jack Yu <jack.yu@suse.com>
@Yu-Jack Yu-Jack self-assigned this Apr 17, 2024
@Yu-Jack Yu-Jack marked this pull request as ready for review April 17, 2024 08:24
@@ -308,7 +332,7 @@ func (m *SupportBundleManager) completeNode(node string) {
if len(m.expectedNodes) == 0 {
if !m.done {
logrus.Debugf("All nodes are completed")
m.ch <- struct{}{}
close(m.ch)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could just close it because there is no other one to write it.

Signed-off-by: Jack Yu <jack.yu@suse.com>
Copy link
Collaborator

@ibrokethecloud ibrokethecloud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. thanks.

Signed-off-by: Jack Yu <jack.yu@suse.com>
Copy link
Collaborator

@c3y1huang c3y1huang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants