[Bug] Bundle is stuck permanently if collection agent fails on one node #73

ejweber · 2023-05-25T18:32:45Z

For reasons outlined in #72, the support bundle collection process could not complete on one node in a cluster. It looks like we wait here indefinitely to receive all expected bundles before proceeding. Since the collection agent on one node failed before checking in, we did not proceed to finish creating the bundle, and the user had nothing to send to support.

Some suggested resolutions:

A timeout mechanism could automatically send on m.ch after some time, even if all bundles had not been received. This would ensure we got something, though we would have to determine what a reasonable timeout should be.
Watch for DaemonSet Pod restarts. After some threshold (or maybe just one), stop expecting the corresponding collection agent to send a bundle.
The collection agent could survive errors like the one the user experienced and send at least something to the manager. This probably doesn't help us in a network partition, etc.

The text was updated successfully, but these errors were encountered:

innobead · 2024-03-27T02:32:59Z

@Yu-Jack can you help with this one? thanks.

Yu-Jack · 2024-03-27T02:34:28Z

@innobead sure, no problem, I'll look into it.

Yu-Jack · 2024-03-28T09:33:33Z

Although adding timeout lets manager finish process eventually, but it's too hard to decide reasonable timeout. The one reason is that different nodes has different environments, we can't expect how long those nodes takes.

Another reason is file size, we can't expect how much the file size will be. For example, there are two example size mentioned in #72, and agent timeout also affects uploading.

So, I think we could monitor the progress of all nodes. In order to achieve this, we need to rewrite our shell script, and combine it with our Golang code, then we will also know what steps are stuck by doing that. After that, we could show progress of each node on the GUI, then we know which node is stuck or failed, and which node is succeed, even show Stop button to terminate them.

Here is my idea:

Use smaller bundle instead of bigger one during collection:
In original flow, after we collect A & B, we bundle them as one called bundle.tar.gz, and send it to manager.

In new flow, we could send separated bundle which is A.tar.gz and B.tar.gz to manager. We'll know what we already collected at least. I think something is better than nothing. But, we need to unzip those tarball in the manager pod and place those files in the local.

Then we're able to know what current step is during collection.
Manager pull progress from agent and set timeout for each step:
After we get the progress, we'll also get the all steps, and know what steps we don't run yet.
Tail pod log when timeout occurs:
If timeout occurs, tail pod logs, and save it into bundle file for further investigation.

If we choose to do simple one, I think we could do:

Add a huge timeout, just make sure the agent will end someday.
Manager pod tails pod log when timeout occurs, and save it into bundle file for further investigation.

The disadvantage of this way is that timeout might be not so useful like I mentioned before. @bk201 WDYT?

bk201 · 2024-03-28T15:42:36Z

My two cents is to make it simple:

Set a reasonable timeout (we can measure the timeout in our development VMs plus some buffer) and the sb manager skip the agent who can't report back in time. A node might be dead or in trouble, the worst case is to ask the user/support to log in to the node to retrieve information. no need to do the fancy tailing thing, but we can include agent pod log (in fact it's already included).
(Optional) Make the timeout values configurable.

Can you check if this issue duplicate with harvester/harvester#1646

ejweber mentioned this issue May 25, 2023

[BUG][1.4.2+] RWO volumes don't seem to mount with statefulsets longhorn/longhorn#5985

Open

c3y1huang mentioned this issue Oct 26, 2023

[BUG][1.5.1] Support bundle generation gets stuck if even one node is unable to send it longhorn/longhorn#6931

Open

innobead assigned Yu-Jack Mar 27, 2024

Yu-Jack mentioned this issue Apr 16, 2024

[FEATURE] Support Bundle timeout enhancement harvester/harvester#1646

Open

Yu-Jack mentioned this issue May 28, 2024

feat(support-bundle): add support-bundle-node-collection-timeout setting longhorn/longhorn-manager#2825

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Bundle is stuck permanently if collection agent fails on one node #73

[Bug] Bundle is stuck permanently if collection agent fails on one node #73

ejweber commented May 25, 2023

innobead commented Mar 27, 2024

Yu-Jack commented Mar 27, 2024

Yu-Jack commented Mar 28, 2024 •

edited

Loading

bk201 commented Mar 28, 2024

[Bug] Bundle is stuck permanently if collection agent fails on one node #73

[Bug] Bundle is stuck permanently if collection agent fails on one node #73

Comments

ejweber commented May 25, 2023

innobead commented Mar 27, 2024

Yu-Jack commented Mar 27, 2024

Yu-Jack commented Mar 28, 2024 • edited Loading

bk201 commented Mar 28, 2024

Yu-Jack commented Mar 28, 2024 •

edited

Loading