Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Bundle is stuck permanently if collection agent fails on one node #73

Open
ejweber opened this issue May 25, 2023 · 4 comments
Open
Assignees

Comments

@ejweber
Copy link
Contributor

ejweber commented May 25, 2023

For reasons outlined in #72, the support bundle collection process could not complete on one node in a cluster. It looks like we wait here indefinitely to receive all expected bundles before proceeding. Since the collection agent on one node failed before checking in, we did not proceed to finish creating the bundle, and the user had nothing to send to support.

Some suggested resolutions:

  • A timeout mechanism could automatically send on m.ch after some time, even if all bundles had not been received. This would ensure we got something, though we would have to determine what a reasonable timeout should be.
  • Watch for DaemonSet Pod restarts. After some threshold (or maybe just one), stop expecting the corresponding collection agent to send a bundle.
  • The collection agent could survive errors like the one the user experienced and send at least something to the manager. This probably doesn't help us in a network partition, etc.
@innobead
Copy link

@Yu-Jack can you help with this one? thanks.

@Yu-Jack
Copy link
Collaborator

Yu-Jack commented Mar 27, 2024

@innobead sure, no problem, I'll look into it.

@Yu-Jack
Copy link
Collaborator

Yu-Jack commented Mar 28, 2024

Although adding timeout lets manager finish process eventually, but it's too hard to decide reasonable timeout. The one reason is that different nodes has different environments, we can't expect how long those nodes takes.

Another reason is file size, we can't expect how much the file size will be. For example, there are two example size mentioned in #72, and agent timeout also affects uploading.

So, I think we could monitor the progress of all nodes. In order to achieve this, we need to rewrite our shell script, and combine it with our Golang code, then we will also know what steps are stuck by doing that. After that, we could show progress of each node on the GUI, then we know which node is stuck or failed, and which node is succeed, even show Stop button to terminate them.

Here is my idea:

  • Use smaller bundle instead of bigger one during collection:
    In original flow, after we collect A & B, we bundle them as one called bundle.tar.gz, and send it to manager.

    In new flow, we could send separated bundle which is A.tar.gz and B.tar.gz to manager. We'll know what we already collected at least. I think something is better than nothing. But, we need to unzip those tarball in the manager pod and place those files in the local.

    Then we're able to know what current step is during collection.

  • Manager pull progress from agent and set timeout for each step:
    After we get the progress, we'll also get the all steps, and know what steps we don't run yet.

  • Tail pod log when timeout occurs:
    If timeout occurs, tail pod logs, and save it into bundle file for further investigation.

If we choose to do simple one, I think we could do:

  • Add a huge timeout, just make sure the agent will end someday.
  • Manager pod tails pod log when timeout occurs, and save it into bundle file for further investigation.

The disadvantage of this way is that timeout might be not so useful like I mentioned before. @bk201 WDYT?

@bk201
Copy link
Member

bk201 commented Mar 28, 2024

My two cents is to make it simple:

  • Set a reasonable timeout (we can measure the timeout in our development VMs plus some buffer) and the sb manager skip the agent who can't report back in time. A node might be dead or in trouble, the worst case is to ask the user/support to log in to the node to retrieve information. no need to do the fancy tailing thing, but we can include agent pod log (in fact it's already included).
  • (Optional) Make the timeout values configurable.

Can you check if this issue duplicate with harvester/harvester#1646

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants