Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shuttle hangs indefinitely after hub connection failure #2346

Closed
tybook opened this issue Oct 1, 2024 · 0 comments · Fixed by #2358
Closed

Shuttle hangs indefinitely after hub connection failure #2346

tybook opened this issue Oct 1, 2024 · 0 comments · Fixed by #2358
Labels
s-triage Needs to be reviewed, designed and prioritized

Comments

@tybook
Copy link
Contributor

tybook commented Oct 1, 2024

What is the bug?
Correction: Recent hub instability is showing that our Shuttle-powered app (which is very similar in structure to the Shuttle example-app) doesn't handle hub connection errors well, neither for streaming nor reconciliation. I originally thought this behavior was specific to backfill/reconciliation and tied to a change in Shuttle v0.6.1 but it's happening with v0.6.0 as well.
I'd expect errors to be thrown by Shuttle when hub API calls fail so our app can promptly error out and be restarted automatically by docker-compose, but that doesn't appear to be happening reliably. We have to manually restart the app after the hub comes back online for it to recover. What does Warpcast's error-handling code around Shuttle's hub API calls/streams look like? Is something missing from the example-app?

I noticed after upgrading our Shuttle-powered app from Shuttle v5.10.0 to the latest v0.6.4 that our reconciliation/backfill jobs would hang indefinitely after a while. We use BullMQ to manage reconcile/backfills jobs very similar to the Shuttle example-app. The hanging jobs are stuck in the active state, but aren't doing anything. Nothing meaningful in our logs.

I saw there's a new streamFetch hub API being used for reconciliation as of Shuttle v0.6.1. Thinking that might be the problem, I forced it off by extending Shuttle's MessageReconciliation class within our app like so:

export class SelectiveMessageReconciliation extends MessageReconciliation {
  constructor(client: HubRpcClient, db: DB, log: pino.Logger) {
    super(client, db, log);
    // Immediately close the stream connection because it appears to be buggy. Closing the stream causes Shuttle to
    // fallback on the old method of fetching messages, which is less efficient but may be more reliable.
    this.close();
  }

The problem persisted after this, indicating it isn't the new streamFetch API per se at fault. Though, through trial-and-error I determined the buggy hanging behavior was still introduced with v0.6.1. I think there is something wrong with how promises/errors are handled in this commit such that backfill/reconciliation no longer automatically recovers from hub API connection failures like it used to.

How can it be reproduced? (optional)
I don't have rock solid repro steps unfortunately. I'm basing this off observations of our Shuttle-powered app's metrics as our hubs intermittently experienced unrelated failures while I swapped between different Shuttle versions. I'm guessing you could repro by doing this though:

  1. Start Shuttle example-app in both streaming and backfills modes, pointing at some test hub.
  2. Take down the test hub, purposefully causing the example-app's connections to it to start failing.
  3. Bring back the test hub, and notice the example-app's streaming process automatically recovers but the backfill jobs don't. They hang indefinitely instead.
@github-actions github-actions bot added the s-triage Needs to be reviewed, designed and prioritized label Oct 1, 2024
@tybook tybook changed the title Shuttle reconciliation hangs indefinitely in versions >= 0.6.1 Shuttle hangs indefinitely after hub connection failure Oct 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
s-triage Needs to be reviewed, designed and prioritized
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant