Shuttle hangs indefinitely after hub connection failure #2346

tybook · 2024-10-01T23:23:49Z

What is the bug?
Correction: Recent hub instability is showing that our Shuttle-powered app (which is very similar in structure to the Shuttle example-app) doesn't handle hub connection errors well, neither for streaming nor reconciliation. I originally thought this behavior was specific to backfill/reconciliation and tied to a change in Shuttle v0.6.1 but it's happening with v0.6.0 as well.
I'd expect errors to be thrown by Shuttle when hub API calls fail so our app can promptly error out and be restarted automatically by docker-compose, but that doesn't appear to be happening reliably. We have to manually restart the app after the hub comes back online for it to recover. What does Warpcast's error-handling code around Shuttle's hub API calls/streams look like? Is something missing from the example-app?

I noticed after upgrading our Shuttle-powered app from Shuttle v5.10.0 to the latest v0.6.4 that our reconciliation/backfill jobs would hang indefinitely after a while. We use BullMQ to manage reconcile/backfills jobs very similar to the Shuttle example-app. The hanging jobs are stuck in the active state, but aren't doing anything. Nothing meaningful in our logs.

I saw there's a new streamFetch hub API being used for reconciliation as of Shuttle v0.6.1. Thinking that might be the problem, I forced it off by extending Shuttle's MessageReconciliation class within our app like so:

export class SelectiveMessageReconciliation extends MessageReconciliation {
  constructor(client: HubRpcClient, db: DB, log: pino.Logger) {
    super(client, db, log);
    // Immediately close the stream connection because it appears to be buggy. Closing the stream causes Shuttle to
    // fallback on the old method of fetching messages, which is less efficient but may be more reliable.
    this.close();
  }

The problem persisted after this, indicating it isn't the new streamFetch API per se at fault. Though, through trial-and-error I determined the buggy hanging behavior was still introduced with v0.6.1. I think there is something wrong with how promises/errors are handled in this commit such that backfill/reconciliation no longer automatically recovers from hub API connection failures like it used to.

How can it be reproduced? (optional)
I don't have rock solid repro steps unfortunately. I'm basing this off observations of our Shuttle-powered app's metrics as our hubs intermittently experienced unrelated failures while I swapped between different Shuttle versions. I'm guessing you could repro by doing this though:

Start Shuttle example-app in both streaming and backfills modes, pointing at some test hub.
Take down the test hub, purposefully causing the example-app's connections to it to start failing.
Bring back the test hub, and notice the example-app's streaming process automatically recovers but the backfill jobs don't. They hang indefinitely instead.

The text was updated successfully, but these errors were encountered:

github-actions bot added the s-triage Needs to be reviewed, designed and prioritized label Oct 1, 2024

CassOnMars mentioned this issue Oct 2, 2024

fix: restore start/stop time for reconcileMessagesForFid #2351

Merged

4 tasks

tybook changed the title ~~Shuttle reconciliation hangs indefinitely in versions >= 0.6.1~~ Shuttle hangs indefinitely after hub connection failure Oct 7, 2024

CassOnMars mentioned this issue Oct 8, 2024

fix: gracefully handle streaming hangups when the server dies unexpected #2358

Merged

4 tasks

CassOnMars closed this as completed in #2358 Oct 8, 2024

CassOnMars closed this as completed in 386059a Oct 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Shuttle hangs indefinitely after hub connection failure #2346

Shuttle hangs indefinitely after hub connection failure #2346

tybook commented Oct 1, 2024 •

edited

Loading

Shuttle hangs indefinitely after hub connection failure #2346

Shuttle hangs indefinitely after hub connection failure #2346

Comments

tybook commented Oct 1, 2024 • edited Loading

tybook commented Oct 1, 2024 •

edited

Loading