|
| 1 | +# Research: First-Class Error Tracking for Collections (Issue #672) |
| 2 | + |
| 3 | +## Summary |
| 4 | + |
| 5 | +GitHub Issue [#672](https://github.com/TanStack/db/issues/672) proposes standardizing error handling across TanStack DB collection types by adding first-class error tracking to `CollectionLifecycleManager`. This research document analyzes the proposal and samwillis' feedback from PR #671. |
| 6 | + |
| 7 | +## Current State of Error Handling |
| 8 | + |
| 9 | +### Query Collection (`query-db-collection`) |
| 10 | +**Location:** `packages/query-db-collection/src/query.ts:417-422` |
| 11 | + |
| 12 | +Uses **closure variables** for error tracking: |
| 13 | +```typescript |
| 14 | +let lastError: any |
| 15 | +let errorCount = 0 |
| 16 | +let lastErrorUpdatedAt = 0 |
| 17 | +``` |
| 18 | + |
| 19 | +Exposes error utilities via `utils` object: |
| 20 | +- `lastError()` - Returns the most recent error |
| 21 | +- `isError()` - Boolean indicating error state |
| 22 | +- `errorCount()` - Number of consecutive sync failures |
| 23 | +- `clearError()` - Clears error state and triggers refetch |
| 24 | + |
| 25 | +**Behavior on error:** |
| 26 | +- Marks collection as `ready` even on error (line 548) to avoid blocking apps |
| 27 | +- Increments `errorCount` only when query fails completely (not per retry) |
| 28 | +- Resets `errorCount` and `lastError` on success (lines 461-462) |
| 29 | + |
| 30 | +### Electric Collection (`electric-db-collection`) |
| 31 | +**No error tracking implementation** - identified as a gap. |
| 32 | + |
| 33 | +### Base Collection (CollectionLifecycleManager) |
| 34 | +**Location:** `packages/db/src/collection/lifecycle.ts` |
| 35 | + |
| 36 | +Currently only tracks **status**, including `error` as a status value, but: |
| 37 | +- No `error` property to store error details |
| 38 | +- No `errorCount` tracking |
| 39 | +- No error event emissions |
| 40 | +- No standardized error recovery mechanism |
| 41 | + |
| 42 | +## Proposed Solution (Issue #672) |
| 43 | + |
| 44 | +Add to `CollectionLifecycleManager`: |
| 45 | +```typescript |
| 46 | +error: Error | null = null |
| 47 | +errorCount: number = 0 |
| 48 | +markError(error?: Error): void |
| 49 | +``` |
| 50 | + |
| 51 | +**Benefits:** |
| 52 | +- Consistency across all collection types |
| 53 | +- First-class error handling in framework integrations |
| 54 | +- Better debugging with error details |
| 55 | +- Support for retry logic with exponential backoff |
| 56 | + |
| 57 | +## Critical Feedback from samwillis (PR #671) |
| 58 | + |
| 59 | +### Main Concern: Error State Transitions |
| 60 | + |
| 61 | +samwillis identified a fundamental design question about **how collections should exit error states**. |
| 62 | + |
| 63 | +#### Current Valid Transitions (lifecycle.ts:76-82) |
| 64 | +```typescript |
| 65 | +const validTransitions: Record<CollectionStatus, Array<CollectionStatus>> = { |
| 66 | + idle: [`loading`, `error`, `cleaned-up`], |
| 67 | + loading: [`ready`, `error`, `cleaned-up`], |
| 68 | + ready: [`cleaned-up`, `error`], |
| 69 | + error: [`cleaned-up`, `idle`], // ← No `loading` or `ready` |
| 70 | + "cleaned-up": [`loading`, `error`], |
| 71 | +} |
| 72 | +``` |
| 73 | + |
| 74 | +**Key observation:** `error → ready` and `error → loading` are NOT currently valid transitions. |
| 75 | + |
| 76 | +#### Problem with Original PR #671 Approach |
| 77 | + |
| 78 | +The PR suggested automatic `loading` transition in `markReady()` when recovering from error: |
| 79 | +```typescript |
| 80 | +error → markReady() → loading → ready |
| 81 | +``` |
| 82 | + |
| 83 | +**samwillis' objection:** |
| 84 | +> "live queries have no mechanism to handle `ready → loading` transitions" |
| 85 | +
|
| 86 | +This creates an architectural inconsistency where: |
| 87 | +1. Collections that are already `ready` cannot transition to `loading` |
| 88 | +2. But error recovery would require `error → loading → ready` |
| 89 | +3. Live queries subscribing to `ready` collections would see unexpected `loading` states |
| 90 | + |
| 91 | +### Two Proposed Architectural Models |
| 92 | + |
| 93 | +#### 1. Graceful Recovery Model |
| 94 | +**Concept:** Sync implementations handle errors internally without disrupting the synced state. |
| 95 | + |
| 96 | +**Characteristics:** |
| 97 | +- Similar to `truncate()` operations (which work on `ready` collections) |
| 98 | +- Allows direct `error → ready` transition |
| 99 | +- Error doesn't destroy the sync connection |
| 100 | +- Collection maintains its data and recovers gracefully |
| 101 | + |
| 102 | +**Example use case:** |
| 103 | +- Network timeout during sync |
| 104 | +- Temporary API rate limiting |
| 105 | +- Recoverable authentication issues |
| 106 | + |
| 107 | +**Implementation:** |
| 108 | +```typescript |
| 109 | +// In sync implementation |
| 110 | +try { |
| 111 | + await syncData() |
| 112 | + markReady() // Direct error → ready transition |
| 113 | +} catch (error) { |
| 114 | + markError(error) // ready → error, then can recover directly |
| 115 | +} |
| 116 | +``` |
| 117 | + |
| 118 | +#### 2. Catastrophic Restart Model |
| 119 | +**Concept:** Unrecoverable errors require full restart of the sync mechanism. |
| 120 | + |
| 121 | +**Characteristics:** |
| 122 | +- Calls `.cleanup()` for garbage collection |
| 123 | +- Moves to `cleaned-up` state first |
| 124 | +- Then calls `.startSync()` to go through normal `loading → ready` cycle |
| 125 | +- Complete teardown and rebuild of sync state |
| 126 | + |
| 127 | +**Example use case:** |
| 128 | +- Authentication revoked |
| 129 | +- Schema mismatch |
| 130 | +- Connection permanently lost |
| 131 | + |
| 132 | +**Implementation:** |
| 133 | +```typescript |
| 134 | +// Manual or automatic restart |
| 135 | +await collection.cleanup() // error → cleaned-up |
| 136 | +collection.startSync() // cleaned-up → loading → ready |
| 137 | +``` |
| 138 | + |
| 139 | +**Proposed API:** |
| 140 | +```typescript |
| 141 | +collection.restartSync() // Convenience method for cleanup + startSync |
| 142 | +``` |
| 143 | + |
| 144 | +### Design Question Raised |
| 145 | + |
| 146 | +> "How much state management responsibility should fall on sync implementations versus the framework itself?" |
| 147 | +
|
| 148 | +**Implications:** |
| 149 | +- Should `markError()` automatically handle recovery logic? |
| 150 | +- Should error state transitions be more flexible? |
| 151 | +- Should the framework distinguish between recoverable and catastrophic errors? |
| 152 | + |
| 153 | +## Current Error Handling Patterns |
| 154 | + |
| 155 | +### Operation Validation (lifecycle.ts:128-137) |
| 156 | +```typescript |
| 157 | +public validateCollectionUsable(operation: string): void { |
| 158 | + switch (this.status) { |
| 159 | + case `error`: |
| 160 | + throw new CollectionInErrorStateError(operation, this.id) |
| 161 | + case `cleaned-up`: |
| 162 | + // Automatically restart the collection |
| 163 | + this.sync.startSync() |
| 164 | + break |
| 165 | + } |
| 166 | +} |
| 167 | +``` |
| 168 | + |
| 169 | +**Behavior:** |
| 170 | +- Collections in `error` state **block all operations** |
| 171 | +- Collections in `cleaned-up` state **auto-restart** on operations |
| 172 | +- No automatic recovery from `error` state |
| 173 | + |
| 174 | +### Error Handling in Tests (collection-errors.test.ts) |
| 175 | + |
| 176 | +The test suite shows expected behaviors: |
| 177 | +1. **Cleanup errors are isolated** - thrown in microtasks to prevent blocking (lines 29-74) |
| 178 | +2. **Error state blocks operations** - Must be manually recovered (lines 250-283) |
| 179 | +3. **Cleaned-up state auto-restarts** - Operations trigger `startSync()` (lines 285-354) |
| 180 | + |
| 181 | +## Documentation Analysis (docs/guides/error-handling.md) |
| 182 | + |
| 183 | +Current documentation shows: |
| 184 | + |
| 185 | +### Collection Status Values (line 118-124) |
| 186 | +``` |
| 187 | +- idle - Not yet started |
| 188 | +- loading - Loading initial data |
| 189 | +- initialCommit - Processing initial data ← NOTE: Not in lifecycle.ts! |
| 190 | +- ready - Ready for use |
| 191 | +- error - In error state |
| 192 | +- cleaned-up - Cleaned up and no longer usable |
| 193 | +``` |
| 194 | + |
| 195 | +**Discrepancy:** `initialCommit` status is documented but not in the current type definition. |
| 196 | + |
| 197 | +### Recommended Recovery Pattern (lines 390-397) |
| 198 | +```typescript |
| 199 | +if (todoCollection.status === "error") { |
| 200 | + await todoCollection.cleanup() |
| 201 | + todoCollection.preload() // Or any other operation |
| 202 | +} |
| 203 | +``` |
| 204 | + |
| 205 | +Uses the **Catastrophic Restart Model** by default. |
| 206 | + |
| 207 | +## Analysis & Recommendations |
| 208 | + |
| 209 | +### Key Insights |
| 210 | + |
| 211 | +1. **Current error handling is inconsistent:** |
| 212 | + - Query collections have robust error tracking via closures |
| 213 | + - Electric collections have none |
| 214 | + - Base collection only tracks status |
| 215 | + |
| 216 | +2. **State transition model needs clarification:** |
| 217 | + - No consensus on `error → ready` vs `error → cleanup → loading → ready` |
| 218 | + - Live query compatibility concerns with state transitions |
| 219 | + - Auto-restart works for `cleaned-up` but not `error` |
| 220 | + |
| 221 | +3. **Two distinct error categories exist:** |
| 222 | + - **Recoverable errors** - temporary network issues, rate limiting |
| 223 | + - **Catastrophic errors** - auth revoked, schema mismatch |
| 224 | + |
| 225 | +### Questions to Resolve |
| 226 | + |
| 227 | +1. **Should `error → ready` be a valid transition?** |
| 228 | + - Pro: Enables graceful recovery without full restart |
| 229 | + - Con: May confuse consumers expecting `loading` state |
| 230 | + - samwillis concern: Live queries don't handle `ready → loading` |
| 231 | + |
| 232 | +2. **Should the framework distinguish error types?** |
| 233 | + ```typescript |
| 234 | + markError(error: Error, options?: { recoverable: boolean }) |
| 235 | + ``` |
| 236 | + - Recoverable: Allow `error → ready` |
| 237 | + - Catastrophic: Require `error → cleaned-up → loading → ready` |
| 238 | + |
| 239 | +3. **Should error state auto-restart like cleaned-up?** |
| 240 | + - Current: `error` blocks, `cleaned-up` auto-restarts |
| 241 | + - Proposed: Both could auto-restart, or both could block |
| 242 | + |
| 243 | +4. **How should `markReady()` behave when called from error state?** |
| 244 | + - Option A: Throw error (maintain strict transitions) |
| 245 | + - Option B: Allow `error → ready` (graceful recovery) |
| 246 | + - Option C: Auto-cleanup then transition (catastrophic restart) |
| 247 | + |
| 248 | +### Proposed Solution Path |
| 249 | + |
| 250 | +1. **Add first-class error tracking to CollectionLifecycleManager** ✓ |
| 251 | + - Implement `error`, `errorCount`, `markError()` as proposed |
| 252 | + |
| 253 | +2. **Support both recovery models:** |
| 254 | + ```typescript |
| 255 | + // Graceful Recovery |
| 256 | + markReady() // error → ready (if valid transition) |
| 257 | + |
| 258 | + // Catastrophic Restart |
| 259 | + restartSync() // error → cleaned-up → loading → ready |
| 260 | + ``` |
| 261 | + |
| 262 | +3. **Update valid transitions:** |
| 263 | + ```typescript |
| 264 | + error: [`cleaned-up`, `idle`, `ready`], // Add `ready` for graceful recovery |
| 265 | + ``` |
| 266 | + |
| 267 | +4. **Add transition guards:** |
| 268 | + ```typescript |
| 269 | + // Only allow error → ready if sync implementation explicitly calls markReady() |
| 270 | + // This gives sync implementations control over recovery strategy |
| 271 | + ``` |
| 272 | + |
| 273 | +5. **Document both patterns:** |
| 274 | + - Graceful recovery for temporary errors |
| 275 | + - Catastrophic restart for permanent failures |
| 276 | + |
| 277 | +### Implementation Considerations |
| 278 | + |
| 279 | +1. **Backwards compatibility:** |
| 280 | + - Query collections already expose `lastError()` and `errorCount()` |
| 281 | + - New first-class properties should be compatible |
| 282 | + - Consider deprecation path for closure-based approach |
| 283 | + |
| 284 | +2. **Event emissions:** |
| 285 | + - Should `markError()` emit error events? |
| 286 | + - Should `errorCount` changes emit events? |
| 287 | + - How do live queries react to error events? |
| 288 | + |
| 289 | +3. **Error history:** |
| 290 | + - Should past errors be tracked? |
| 291 | + - How long should error history persist? |
| 292 | + - Memory implications of error tracking |
| 293 | + |
| 294 | +4. **Testing strategy:** |
| 295 | + - Add tests for both recovery models |
| 296 | + - Test error event emissions |
| 297 | + - Test error count behavior across scenarios |
| 298 | + |
| 299 | +## References |
| 300 | + |
| 301 | +- **Issue:** https://github.com/TanStack/db/issues/672 |
| 302 | +- **PR:** https://github.com/TanStack/db/pull/671 |
| 303 | +- **Key Files:** |
| 304 | + - `packages/db/src/collection/lifecycle.ts` |
| 305 | + - `packages/query-db-collection/src/query.ts` |
| 306 | + - `packages/db/tests/collection-errors.test.ts` |
| 307 | + - `docs/guides/error-handling.md` |
| 308 | + |
| 309 | +## Next Steps |
| 310 | + |
| 311 | +1. **Clarify architectural vision** with maintainers: |
| 312 | + - Should both recovery models be supported? |
| 313 | + - What are the valid state transitions for error recovery? |
| 314 | + - How should live queries handle different transitions? |
| 315 | + |
| 316 | +2. **Design decision required:** |
| 317 | + - Single recovery model vs. both models |
| 318 | + - Error categorization (recoverable vs. catastrophic) |
| 319 | + - Auto-restart behavior for error state |
| 320 | + |
| 321 | +3. **Implementation approach:** |
| 322 | + - Prototype both recovery models |
| 323 | + - Test with live query scenarios |
| 324 | + - Gather feedback from community |
| 325 | + |
| 326 | +4. **Documentation updates:** |
| 327 | + - Clear guidance on when to use each recovery model |
| 328 | + - Examples of both patterns |
| 329 | + - Migration guide for existing code |
0 commit comments