Skip to content

Commit b433c86

Browse files
committed
Research first-class error tracking proposal and state transition concerns
Comprehensive analysis of GitHub issue #672 and samwillis' feedback on PR #671: - Document current error handling state across collection types - Analyze proposed first-class error tracking in CollectionLifecycleManager - Examine samwillis' concerns about error state transitions - Compare Graceful Recovery vs Catastrophic Restart models - Identify design questions around state transition validity - Propose solution path supporting both recovery patterns - Highlight backwards compatibility and testing considerations Key findings: - Query collections use closure-based error tracking - Electric collections lack error tracking entirely - Current transitions don't allow error → ready or error → loading - Live queries can't handle ready → loading transitions - Two distinct error recovery patterns needed for different scenarios 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
1 parent 8051cbe commit b433c86

File tree

1 file changed

+329
-0
lines changed

1 file changed

+329
-0
lines changed

RESEARCH_ERROR_HANDLING.md

Lines changed: 329 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,329 @@
1+
# Research: First-Class Error Tracking for Collections (Issue #672)
2+
3+
## Summary
4+
5+
GitHub Issue [#672](https://github.com/TanStack/db/issues/672) proposes standardizing error handling across TanStack DB collection types by adding first-class error tracking to `CollectionLifecycleManager`. This research document analyzes the proposal and samwillis' feedback from PR #671.
6+
7+
## Current State of Error Handling
8+
9+
### Query Collection (`query-db-collection`)
10+
**Location:** `packages/query-db-collection/src/query.ts:417-422`
11+
12+
Uses **closure variables** for error tracking:
13+
```typescript
14+
let lastError: any
15+
let errorCount = 0
16+
let lastErrorUpdatedAt = 0
17+
```
18+
19+
Exposes error utilities via `utils` object:
20+
- `lastError()` - Returns the most recent error
21+
- `isError()` - Boolean indicating error state
22+
- `errorCount()` - Number of consecutive sync failures
23+
- `clearError()` - Clears error state and triggers refetch
24+
25+
**Behavior on error:**
26+
- Marks collection as `ready` even on error (line 548) to avoid blocking apps
27+
- Increments `errorCount` only when query fails completely (not per retry)
28+
- Resets `errorCount` and `lastError` on success (lines 461-462)
29+
30+
### Electric Collection (`electric-db-collection`)
31+
**No error tracking implementation** - identified as a gap.
32+
33+
### Base Collection (CollectionLifecycleManager)
34+
**Location:** `packages/db/src/collection/lifecycle.ts`
35+
36+
Currently only tracks **status**, including `error` as a status value, but:
37+
- No `error` property to store error details
38+
- No `errorCount` tracking
39+
- No error event emissions
40+
- No standardized error recovery mechanism
41+
42+
## Proposed Solution (Issue #672)
43+
44+
Add to `CollectionLifecycleManager`:
45+
```typescript
46+
error: Error | null = null
47+
errorCount: number = 0
48+
markError(error?: Error): void
49+
```
50+
51+
**Benefits:**
52+
- Consistency across all collection types
53+
- First-class error handling in framework integrations
54+
- Better debugging with error details
55+
- Support for retry logic with exponential backoff
56+
57+
## Critical Feedback from samwillis (PR #671)
58+
59+
### Main Concern: Error State Transitions
60+
61+
samwillis identified a fundamental design question about **how collections should exit error states**.
62+
63+
#### Current Valid Transitions (lifecycle.ts:76-82)
64+
```typescript
65+
const validTransitions: Record<CollectionStatus, Array<CollectionStatus>> = {
66+
idle: [`loading`, `error`, `cleaned-up`],
67+
loading: [`ready`, `error`, `cleaned-up`],
68+
ready: [`cleaned-up`, `error`],
69+
error: [`cleaned-up`, `idle`], // ← No `loading` or `ready`
70+
"cleaned-up": [`loading`, `error`],
71+
}
72+
```
73+
74+
**Key observation:** `error → ready` and `error → loading` are NOT currently valid transitions.
75+
76+
#### Problem with Original PR #671 Approach
77+
78+
The PR suggested automatic `loading` transition in `markReady()` when recovering from error:
79+
```typescript
80+
errormarkReady() → loadingready
81+
```
82+
83+
**samwillis' objection:**
84+
> "live queries have no mechanism to handle `ready → loading` transitions"
85+
86+
This creates an architectural inconsistency where:
87+
1. Collections that are already `ready` cannot transition to `loading`
88+
2. But error recovery would require `error → loading → ready`
89+
3. Live queries subscribing to `ready` collections would see unexpected `loading` states
90+
91+
### Two Proposed Architectural Models
92+
93+
#### 1. Graceful Recovery Model
94+
**Concept:** Sync implementations handle errors internally without disrupting the synced state.
95+
96+
**Characteristics:**
97+
- Similar to `truncate()` operations (which work on `ready` collections)
98+
- Allows direct `error → ready` transition
99+
- Error doesn't destroy the sync connection
100+
- Collection maintains its data and recovers gracefully
101+
102+
**Example use case:**
103+
- Network timeout during sync
104+
- Temporary API rate limiting
105+
- Recoverable authentication issues
106+
107+
**Implementation:**
108+
```typescript
109+
// In sync implementation
110+
try {
111+
await syncData()
112+
markReady() // Direct error → ready transition
113+
} catch (error) {
114+
markError(error) // ready → error, then can recover directly
115+
}
116+
```
117+
118+
#### 2. Catastrophic Restart Model
119+
**Concept:** Unrecoverable errors require full restart of the sync mechanism.
120+
121+
**Characteristics:**
122+
- Calls `.cleanup()` for garbage collection
123+
- Moves to `cleaned-up` state first
124+
- Then calls `.startSync()` to go through normal `loading → ready` cycle
125+
- Complete teardown and rebuild of sync state
126+
127+
**Example use case:**
128+
- Authentication revoked
129+
- Schema mismatch
130+
- Connection permanently lost
131+
132+
**Implementation:**
133+
```typescript
134+
// Manual or automatic restart
135+
await collection.cleanup() // error → cleaned-up
136+
collection.startSync() // cleaned-up → loading → ready
137+
```
138+
139+
**Proposed API:**
140+
```typescript
141+
collection.restartSync() // Convenience method for cleanup + startSync
142+
```
143+
144+
### Design Question Raised
145+
146+
> "How much state management responsibility should fall on sync implementations versus the framework itself?"
147+
148+
**Implications:**
149+
- Should `markError()` automatically handle recovery logic?
150+
- Should error state transitions be more flexible?
151+
- Should the framework distinguish between recoverable and catastrophic errors?
152+
153+
## Current Error Handling Patterns
154+
155+
### Operation Validation (lifecycle.ts:128-137)
156+
```typescript
157+
public validateCollectionUsable(operation: string): void {
158+
switch (this.status) {
159+
case `error`:
160+
throw new CollectionInErrorStateError(operation, this.id)
161+
case `cleaned-up`:
162+
// Automatically restart the collection
163+
this.sync.startSync()
164+
break
165+
}
166+
}
167+
```
168+
169+
**Behavior:**
170+
- Collections in `error` state **block all operations**
171+
- Collections in `cleaned-up` state **auto-restart** on operations
172+
- No automatic recovery from `error` state
173+
174+
### Error Handling in Tests (collection-errors.test.ts)
175+
176+
The test suite shows expected behaviors:
177+
1. **Cleanup errors are isolated** - thrown in microtasks to prevent blocking (lines 29-74)
178+
2. **Error state blocks operations** - Must be manually recovered (lines 250-283)
179+
3. **Cleaned-up state auto-restarts** - Operations trigger `startSync()` (lines 285-354)
180+
181+
## Documentation Analysis (docs/guides/error-handling.md)
182+
183+
Current documentation shows:
184+
185+
### Collection Status Values (line 118-124)
186+
```
187+
- idle - Not yet started
188+
- loading - Loading initial data
189+
- initialCommit - Processing initial data ← NOTE: Not in lifecycle.ts!
190+
- ready - Ready for use
191+
- error - In error state
192+
- cleaned-up - Cleaned up and no longer usable
193+
```
194+
195+
**Discrepancy:** `initialCommit` status is documented but not in the current type definition.
196+
197+
### Recommended Recovery Pattern (lines 390-397)
198+
```typescript
199+
if (todoCollection.status === "error") {
200+
await todoCollection.cleanup()
201+
todoCollection.preload() // Or any other operation
202+
}
203+
```
204+
205+
Uses the **Catastrophic Restart Model** by default.
206+
207+
## Analysis & Recommendations
208+
209+
### Key Insights
210+
211+
1. **Current error handling is inconsistent:**
212+
- Query collections have robust error tracking via closures
213+
- Electric collections have none
214+
- Base collection only tracks status
215+
216+
2. **State transition model needs clarification:**
217+
- No consensus on `error → ready` vs `error → cleanup → loading → ready`
218+
- Live query compatibility concerns with state transitions
219+
- Auto-restart works for `cleaned-up` but not `error`
220+
221+
3. **Two distinct error categories exist:**
222+
- **Recoverable errors** - temporary network issues, rate limiting
223+
- **Catastrophic errors** - auth revoked, schema mismatch
224+
225+
### Questions to Resolve
226+
227+
1. **Should `error → ready` be a valid transition?**
228+
- Pro: Enables graceful recovery without full restart
229+
- Con: May confuse consumers expecting `loading` state
230+
- samwillis concern: Live queries don't handle `ready → loading`
231+
232+
2. **Should the framework distinguish error types?**
233+
```typescript
234+
markError(error: Error, options?: { recoverable: boolean })
235+
```
236+
- Recoverable: Allow `error → ready`
237+
- Catastrophic: Require `error → cleaned-up → loading → ready`
238+
239+
3. **Should error state auto-restart like cleaned-up?**
240+
- Current: `error` blocks, `cleaned-up` auto-restarts
241+
- Proposed: Both could auto-restart, or both could block
242+
243+
4. **How should `markReady()` behave when called from error state?**
244+
- Option A: Throw error (maintain strict transitions)
245+
- Option B: Allow `error → ready` (graceful recovery)
246+
- Option C: Auto-cleanup then transition (catastrophic restart)
247+
248+
### Proposed Solution Path
249+
250+
1. **Add first-class error tracking to CollectionLifecycleManager**
251+
- Implement `error`, `errorCount`, `markError()` as proposed
252+
253+
2. **Support both recovery models:**
254+
```typescript
255+
// Graceful Recovery
256+
markReady() // error → ready (if valid transition)
257+
258+
// Catastrophic Restart
259+
restartSync() // error → cleaned-up → loading → ready
260+
```
261+
262+
3. **Update valid transitions:**
263+
```typescript
264+
error: [`cleaned-up`, `idle`, `ready`], // Add `ready` for graceful recovery
265+
```
266+
267+
4. **Add transition guards:**
268+
```typescript
269+
// Only allow error → ready if sync implementation explicitly calls markReady()
270+
// This gives sync implementations control over recovery strategy
271+
```
272+
273+
5. **Document both patterns:**
274+
- Graceful recovery for temporary errors
275+
- Catastrophic restart for permanent failures
276+
277+
### Implementation Considerations
278+
279+
1. **Backwards compatibility:**
280+
- Query collections already expose `lastError()` and `errorCount()`
281+
- New first-class properties should be compatible
282+
- Consider deprecation path for closure-based approach
283+
284+
2. **Event emissions:**
285+
- Should `markError()` emit error events?
286+
- Should `errorCount` changes emit events?
287+
- How do live queries react to error events?
288+
289+
3. **Error history:**
290+
- Should past errors be tracked?
291+
- How long should error history persist?
292+
- Memory implications of error tracking
293+
294+
4. **Testing strategy:**
295+
- Add tests for both recovery models
296+
- Test error event emissions
297+
- Test error count behavior across scenarios
298+
299+
## References
300+
301+
- **Issue:** https://github.com/TanStack/db/issues/672
302+
- **PR:** https://github.com/TanStack/db/pull/671
303+
- **Key Files:**
304+
- `packages/db/src/collection/lifecycle.ts`
305+
- `packages/query-db-collection/src/query.ts`
306+
- `packages/db/tests/collection-errors.test.ts`
307+
- `docs/guides/error-handling.md`
308+
309+
## Next Steps
310+
311+
1. **Clarify architectural vision** with maintainers:
312+
- Should both recovery models be supported?
313+
- What are the valid state transitions for error recovery?
314+
- How should live queries handle different transitions?
315+
316+
2. **Design decision required:**
317+
- Single recovery model vs. both models
318+
- Error categorization (recoverable vs. catastrophic)
319+
- Auto-restart behavior for error state
320+
321+
3. **Implementation approach:**
322+
- Prototype both recovery models
323+
- Test with live query scenarios
324+
- Gather feedback from community
325+
326+
4. **Documentation updates:**
327+
- Clear guidance on when to use each recovery model
328+
- Examples of both patterns
329+
- Migration guide for existing code

0 commit comments

Comments
 (0)