You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
What is the bug? Correction: Recent hub instability is showing that our Shuttle-powered app (which is very similar in structure to the Shuttle example-app) doesn't handle hub connection errors well, neither for streaming nor reconciliation. I originally thought this behavior was specific to backfill/reconciliation and tied to a change in Shuttle v0.6.1 but it's happening with v0.6.0 as well.
I'd expect errors to be thrown by Shuttle when hub API calls fail so our app can promptly error out and be restarted automatically by docker-compose, but that doesn't appear to be happening reliably. We have to manually restart the app after the hub comes back online for it to recover. What does Warpcast's error-handling code around Shuttle's hub API calls/streams look like? Is something missing from the example-app?
I noticed after upgrading our Shuttle-powered app from Shuttle v5.10.0 to the latest v0.6.4 that our reconciliation/backfill jobs would hang indefinitely after a while. We use BullMQ to manage reconcile/backfills jobs very similar to the Shuttle example-app. The hanging jobs are stuck in the active state, but aren't doing anything. Nothing meaningful in our logs.
I saw there's a new streamFetch hub API being used for reconciliation as of Shuttle v0.6.1. Thinking that might be the problem, I forced it off by extending Shuttle's MessageReconciliation class within our app like so:
export class SelectiveMessageReconciliation extends MessageReconciliation {
constructor(client: HubRpcClient, db: DB, log: pino.Logger) {
super(client, db, log);
// Immediately close the stream connection because it appears to be buggy. Closing the stream causes Shuttle to
// fallback on the old method of fetching messages, which is less efficient but may be more reliable.
this.close();
}
The problem persisted after this, indicating it isn't the new streamFetch API per se at fault. Though, through trial-and-error I determined the buggy hanging behavior was still introduced with v0.6.1. I think there is something wrong with how promises/errors are handled in this commit such that backfill/reconciliation no longer automatically recovers from hub API connection failures like it used to.
How can it be reproduced? (optional)
I don't have rock solid repro steps unfortunately. I'm basing this off observations of our Shuttle-powered app's metrics as our hubs intermittently experienced unrelated failures while I swapped between different Shuttle versions. I'm guessing you could repro by doing this though:
Start Shuttle example-app in both streaming and backfills modes, pointing at some test hub.
Take down the test hub, purposefully causing the example-app's connections to it to start failing.
Bring back the test hub, and notice the example-app's streaming process automatically recovers but the backfill jobs don't. They hang indefinitely instead.
The text was updated successfully, but these errors were encountered:
tybook
changed the title
Shuttle reconciliation hangs indefinitely in versions >= 0.6.1
Shuttle hangs indefinitely after hub connection failure
Oct 7, 2024
What is the bug?
Correction: Recent hub instability is showing that our Shuttle-powered app (which is very similar in structure to the Shuttle example-app) doesn't handle hub connection errors well, neither for streaming nor reconciliation. I originally thought this behavior was specific to backfill/reconciliation and tied to a change in Shuttle v0.6.1 but it's happening with v0.6.0 as well.
I'd expect errors to be thrown by Shuttle when hub API calls fail so our app can promptly error out and be restarted automatically by docker-compose, but that doesn't appear to be happening reliably. We have to manually restart the app after the hub comes back online for it to recover. What does Warpcast's error-handling code around Shuttle's hub API calls/streams look like? Is something missing from the example-app?
I noticed after upgrading our Shuttle-powered app from Shuttle v5.10.0 to the latest v0.6.4 that our reconciliation/backfill jobs would hang indefinitely after a while. We use BullMQ to manage reconcile/backfills jobs very similar to the Shuttle example-app. The hanging jobs are stuck in the
active
state, but aren't doing anything. Nothing meaningful in our logs.I saw there's a new
streamFetch
hub API being used for reconciliation as of Shuttle v0.6.1. Thinking that might be the problem, I forced it off by extending Shuttle'sMessageReconciliation
class within our app like so:The problem persisted after this, indicating it isn't the new
streamFetch
API per se at fault. Though, through trial-and-error I determined the buggy hanging behavior was still introduced with v0.6.1. I think there is something wrong with how promises/errors are handled in this commit such that backfill/reconciliation no longer automatically recovers from hub API connection failures like it used to.How can it be reproduced? (optional)
I don't have rock solid repro steps unfortunately. I'm basing this off observations of our Shuttle-powered app's metrics as our hubs intermittently experienced unrelated failures while I swapped between different Shuttle versions. I'm guessing you could repro by doing this though:
The text was updated successfully, but these errors were encountered: