-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] Bundle is stuck permanently if collection agent fails on one node #73
Comments
@Yu-Jack can you help with this one? thanks. |
@innobead sure, no problem, I'll look into it. |
Although adding timeout lets manager finish process eventually, but it's too hard to decide reasonable timeout. The one reason is that different nodes has different environments, we can't expect how long those nodes takes. Another reason is file size, we can't expect how much the file size will be. For example, there are two example size mentioned in #72, and agent timeout also affects uploading. So, I think we could monitor the progress of all nodes. In order to achieve this, we need to rewrite our shell script, and combine it with our Golang code, then we will also know what steps are stuck by doing that. After that, we could show progress of each node on the GUI, then we know which node is stuck or failed, and which node is succeed, even show Here is my idea:
If we choose to do simple one, I think we could do:
The disadvantage of this way is that timeout might be not so useful like I mentioned before. @bk201 WDYT? |
My two cents is to make it simple:
Can you check if this issue duplicate with harvester/harvester#1646 |
For reasons outlined in #72, the support bundle collection process could not complete on one node in a cluster. It looks like we wait here indefinitely to receive all expected bundles before proceeding. Since the collection agent on one node failed before checking in, we did not proceed to finish creating the bundle, and the user had nothing to send to support.
Some suggested resolutions:
m.ch
after some time, even if all bundles had not been received. This would ensure we got something, though we would have to determine what a reasonable timeout should be.The text was updated successfully, but these errors were encountered: