feat: Phase 4 - EventPipeline integration with smart retry system#300
feat: Phase 4 - EventPipeline integration with smart retry system#300didiergarcia wants to merge 21 commits intomainfrom
Conversation
Step 1 of Phase 4 integration: - Add optional httpConfig parameter to EventPipeline constructor - Initialize RetryStateMachine with config (defaults to legacy mode) - Load persisted RetryState on startup - Convert HttpConfig to RetryConfig in init block No behavior changes yet - infrastructure only. Baseline tests still passing (verified legacy mode works).
Add 7 tests to verify Phase 4 Step 1 infrastructure: 1. Constructor accepts httpConfig parameter 2. RetryState loaded from storage on initialization 3. Legacy mode with httpConfig=null works (no behavior changes) 4. Legacy mode with enabled=false works 5. Persisted RATE_LIMITED state is loaded correctly 6. Default state used when no persisted state exists 7. Smart retry enabled initializes correctly All tests verify infrastructure without testing upload behavior yet. Tests use spyk(storage) to verify loadRetryState() is called. All tests passing (BUILD SUCCESSFUL)
Step 2 of Phase 4 integration: - Add timeProvider field (SystemTimeProvider) to EventPipeline - Add upload gate check at beginning of upload() method - Check retryState.isRateLimited() before processing any files - Skip all uploads and return early if rate-limited - Clear RATE_LIMITED state if waitUntilTime has passed - Log when uploads are skipped due to rate limiting Key behavior: - When RATE_LIMITED and wait time not passed → skip ALL uploads for this flush - When wait time passes → transition back to READY state automatically - Legacy mode (httpConfig=null or enabled=false) → no rate limit checks All tests still passing (no behavior change yet - rate limit state never set)
Step 3 of Phase 4 integration: - Check shouldUploadBatch() for each batch file before uploading - Handle UploadDecision results: * SkipAllBatches → stop processing remaining files * SkipThisBatch → skip this file, continue with next * DropBatch → delete batch file (exceeded retry limits) * Proceed → continue with upload - RetryState.batchMetadata automatically updated by state machine - No separate .meta files needed - metadata stored in RetryState Key behavior: - Each batch checked independently against retry limits - Batches with pending nextRetryTime are skipped - Batches exceeding max retry count or duration are dropped - RetryState persisted with updated batchMetadata All tests passing (no behavior change yet - state machine not updating metadata)
Integrates RetryStateMachine into EventPipeline for TAPI smart retry handling. Changes: - EventPipeline: Add httpConfig parameter and retry state machine integration - Upload gate: Skip uploads when rate-limited - Per-batch logic: Check shouldUploadBatch before attempting upload - Response handling: Track status codes, update retry state, persist state - X-Retry-Count header: Add on retry attempts (not first upload) - Configuration: Add httpConfig field for user-facing API - SegmentDestination: Pass httpConfig from Configuration to EventPipeline - RetryConfig: Set enabled=false defaults for backward compatibility - Remove unused maxRateLimitDuration field Tests: - EventPipelinePhase4Test: Infrastructure verification (7 tests) - EventPipelineSmartRetryTest: Behavior verification (8 tests) - Updated RetryStateMachineTest: Enable configs in setup for existing tests - Updated HttpConfigTest: Fix assertions for new defaults - RetryIntegrationTest: Baseline tests unchanged and passing Key features: - Legacy mode by default (httpConfig=null or enabled=false) - Zero behavior changes when smart retry disabled - State persists across app restarts - 76 tests passing (37 EventPipeline + 39 retry tests) Phase 5 cleanup: - Removed all Phase 4 comments from production code - Added KDoc documentation for retry configs - Cleaned up test file headers and TODOs
…ructureTest Improves test file naming clarity: - Removes temporary 'Phase 4' reference that won't be meaningful post-merge - Describes what the tests actually verify (retry infrastructure initialization) - Matches the 'Retry Infrastructure Tests' comment in the file
| try { | ||
| val connection = httpClient.upload(apiHost) | ||
|
|
||
| // Add X-Retry-Count header (only on retries, not first attempt) |
There was a problem hiding this comment.
I don't think it hurts anything, but I had asked Val about whether we should just always include it even when it's zero as a way to distinguish new non-retries from old retires-and-non-retries. Buuut if we're going to be moving to v2 anyway, that makes it entirely moot. The test will probably need to be changed to look for no retry count = 0 and I can fix the other SDKs.
There was a problem hiding this comment.
We should be consistent across the libraries. We can just always add it or do it only after the first one?
There was a problem hiding this comment.
Let's go with only after first attempt, so no x-retry-count=0. Neater, less traffic most of the time.
Enables time manipulation in tests for proper retry chain testing. Changes: - EventPipeline: Add optional timeProvider constructor parameter - Defaults to SystemTimeProvider() for production use - Passes timeProvider to RetryStateMachine for consistent time source - Enables tests to inject FakeTimeProvider for time manipulation Benefits: - Tests can now verify complete retry chains (429 → wait → 200) - Tests can verify exponential backoff delays - Tests can verify Retry-After header compliance - No production code behavior changes (defaults to SystemTimeProvider) This infrastructure enables future chain tests that verify: - 429 → (time passes) → 200 success - 500 → backoff → 500 → backoff → 200 success - X-Retry-Count header increments through retry sequence - Rate limit state expires correctly after wait time All existing tests pass (76 tests)
Adds 5 comprehensive chain tests that verify complete retry sequences: Simple Tests (verify time manipulation infrastructure): 1. FakeTimeProvider integration - Confirms time manipulation works 2. Time advance blocks uploads - Uploads blocked during rate limit 3. Time advance allows uploads - Uploads resume after rate limit expires Full Chain Tests (verify complete retry cycles): 4. 429 → 429 → 200 - Multiple rate limiting then success - First 429: Pipeline enters RATE_LIMITED state (30s wait) - Second 429: Pipeline stays rate-limited (another 30s wait) - Success: Pipeline returns to READY state - Proves: Retry-After headers respected, rate limit blocking works 5. 500 → backoff → 200 - Exponential backoff then success - 500 error: Creates BatchMetadata with nextRetryTime - Time advances past backoff: Batch becomes eligible for retry - Success: BatchMetadata cleared, batch deleted - Proves: Per-batch backoff tracking, exponential delays work Key Testing Patterns: - Use FakeTimeProvider.setTime() to manipulate time instantly - Use >= assertions (not ==) for upload counts (multiple batches may queue) - Verify state persistence (loadRetryState checks) - Verify cleanup on success (metadata cleared) Test Results: - All 5 chain tests passing - Total test count: 81 tests (76 → 81) - No regression in existing tests Benefits: - Fast execution (milliseconds instead of minutes of real waits) - Deterministic (no timing-dependent flakiness) - Comprehensive coverage of retry chains - Proves timeProvider infrastructure works end-to-end
Codecov Report❌ Patch coverage is Please upload reports for the commit de79472 to get more accurate results. Additional details and impacted files@@ Coverage Diff @@
## main #300 +/- ##
============================================
- Coverage 78.13% 73.73% -4.40%
- Complexity 592 861 +269
============================================
Files 80 106 +26
Lines 7275 10024 +2749
Branches 922 1669 +747
============================================
+ Hits 5684 7391 +1707
- Misses 861 1214 +353
- Partials 730 1419 +689 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
* Fix SDK bugs found by e2e retry tests, wire e2e-cli for failure detection EventPipeline fixes: - Smart retry override: use RetryStateMachine.shouldDeleteBatch() instead of legacy handleUploadException for statusCodeOverrides alignment (410/460 now retry, 501/505 now drop) - Case-insensitive Retry-After header lookup (OkHttp/HTTP2 lowercases) - X-Retry-Count only sent on retries, omitted on first attempt - DropBatch decision now reports error via reportInternalError RetryStateMachine: - Add shouldDeleteBatch() public method for EventPipeline cleanup decisions e2e-cli: - Wire errorHandler to detect delivery failures (HTTP errors, dropped batches) - Wire maxRetries from test config to SDK BackoffConfig - Clear delivery errors before each flush cycle - Report failure when retries exhausted or batch dropped Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Fixing case for "Retry-After" headers and adding test to ensure okhttp downcases headers. * fix: Use retryStateMachine.shouldDeleteBatch in handleUploadException Move shouldDeleteBatch decision into handleUploadException instead of overriding cleanup after the fact with effectiveStatusCode. This avoids the httpConfig!=null gate (which doesn't respect enabled flags) and the null-stream edge case (statusCode=0 defaulting to 500 causing retries instead of cleanup). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * chore: Enable retry test suite in e2e-config.json Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Didier Garcia <digarcia@twilio.com>
The 5-second delay was dominating retry timing tests, causing failures: - Retry-After: 0 took 4916ms instead of immediate (<1s) - Exponential backoff delays appeared constant (~5s) instead of growing - 6th retry attempt timed out due to 5s eating into 20s budget - First/last delay comparisons failed (4502ms vs 4923ms both ~5s) Reducing to 500ms: - Respects exponential backoff timing (500ms, 1s, 2s, 4s visible) - Allows Retry-After: 0 to be immediate - Gives enough time for 6 retry attempts within 20s timeout - Still provides brief window for initial flush/upload to complete Expected test results: - exponential-backoff: delays will grow visibly (500ms -> 4s) - retry-after: Retry-After: 0 will retry in <1s - retry-integration: outage recovery will show increasing delays - All 6 retry attempts will complete within timeout Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Settings error-tolerance tests were failing with 500ms delay: - "still sends events when settings returns 500" → 0 batch requests - "still sends events when settings response is delayed" → 0 batch requests Root cause: 500ms isn't enough time for: 1. Settings request to timeout/fail (~1-2s) 2. SDK to fallback to defaultSettings 3. EventPipeline to initialize 4. Events to be processed and sent The 2-second delay balances: - Fast enough for retry tests (46 tests, was 5s on main) - Slow enough for settings tests (gives SDK time to recover from failures) Timeline with 2s: - t=0-1s: Settings request fails/times out - t=1-2s: SDK initializes with defaults, sends events - t=2s: First pendingUploads() check finds batch files This preserves the retry test speedup (2s vs 5s) while fixing settings test regressions. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
…ming Previous attempts: - 5s fixed delay: retry tests failed (timing masked) - 500ms: settings tests failed (not enough time for SDK recovery) - 2s: still had failures in both categories Root cause: Different test scenarios need different timing: - Retry tests need to see retry behavior quickly - Settings tests need SDK time to handle failures/timeouts Solution: Adaptive polling strategy 1. Initial 1.5s delay for: - Settings requests to timeout/fail - SDK to fallback to defaultSettings - First batch file creation and upload attempt 2. Then poll at 200ms intervals (fast retry detection) - Catches Retry-After: 0 quickly (~200-400ms) - Detects exponential backoff timing changes - Ramps up to 500ms after 5 polls (1.3s total) Expected results: - Settings tests: 1.5s gives SDK time to recover from failures - Retry-After: 0: ~1.5s total (initial) + ~200ms (poll) = ~1.7s - Note: may still fail <1s assertion if timing is tight - Exponential backoff: 200ms polling reveals timing differences Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Previous approach: Fixed initial delays (500ms, 2s, 3s) to allow SDK initialization Problem: Creates baseline overhead that breaks timing-sensitive tests Root cause analysis: - Settings tests need 2-3s for SDK to handle failures → but batch appears when ready - Retry-After: 0 test needs <1s → but fixed delays add unavoidable overhead - No single fixed delay satisfies both test types New approach: Pure adaptive polling with no initial delay - Start: 100ms polls (aggressive for quick tests) - After 1s: ramp to 500ms polls - After 4s: ramp to 1s polls - Let tests naturally complete when ready instead of guessing timing Timeline for Retry-After: 0: - t=0: flush() called - t=100-400ms: SDK creates batch, polls detect it - Total: ~300-600ms (should pass <1s assertion) Timeline for settings slow response: - t=0-1s: Fast polls while settings times out - t=1-2s: SDK falls back to defaults, processes events - t=2s: Batch created - t=2.1s: Next poll detects it → success Benefits: - No wasted time on fixed delays - Fast tests complete quickly - Slow tests get full timeout duration - Polling overhead is minimal (100-500ms) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
The aggressive polling approach (100ms from start) broke SDK initialization by checking for batch files before they were created. SDK needs time to: - Handle settings request/timeout - Fall back to defaultSettings - Initialize EventPipeline - Create batch files This restores a fixed 1.2s initial delay followed by adaptive polling (200ms → 500ms → 1s). This overhead is inherent to the CLI architecture. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Trying to balance SDK initialization needs with Retry-After: 0 test requirement (<1000ms). The 1212ms result with 1.2s delay suggests most overhead is in the fixed delay itself. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Reduced HTTP timeouts from 15s/20s to 2s/2s in e2e-cli. This allows settings error tests to fail fast and fall back to defaultSettings quickly, enabling shorter initial delay (300ms) that should satisfy Retry-After: 0 timing requirements (<1s between requests). Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
The poll loop was breaking on pendingUploads().isEmpty() before the SDK finished initialization. When settings fetch is slow or fails, events stay queued (processing paused) and no batch files exist yet. The CLI mistook this for 'all events delivered' and exited with success=true. Fix: track whether we've ever seen batch files. Only break when pending goes from non-empty to empty (all uploaded/dropped). If we've never seen batch files, keep polling until they appear or we time out. This fixes all 5 settings error-tolerance tests that were failing. 59/59 e2e tests now pass (basic, retry, settings). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Read httpConfig from CDN integration settings in SegmentDestination.update()
and propagate to EventPipeline via updateHttpConfig()
- Default enabled to true for CDN-sourced configs (presence implies active)
- Add @SerialName annotations to RetryBehavior enum for lowercase JSON
compatibility ("retry"/"drop" from CDN)
- Enforce rateLimitConfig.maxRetryCount via globalRetryCount in
RetryStateMachine.shouldUploadBatch()
- Add retry-settings test suite to e2e-config.json
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Update test JSON to use lowercase "retry"/"drop" instead of "RETRY"/"DROP" to match the @SerialName annotations added for CDN JSON compatibility. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Overview
Integrates the RetryStateMachine (from Phases 1-3) into EventPipeline to enable smart retry handling for TAPI responses.
Key Changes
Production Code
EventPipeline.kt: Integrated retry state machine with 4-step approach
Configuration.kt: Added
httpConfig: HttpConfig?field for user-facing APISegmentDestination.kt: Pass httpConfig from Configuration to EventPipeline
RetryConfig.kt:
enabled=falsefor backward compatibility (legacy mode)maxRateLimitDurationfieldTest Code
Behavior
Legacy Mode (Default)
When
httpConfig=nullor both configs haveenabled=false:Smart Retry Mode
When
httpConfigprovided withenabled=true:Test Results
Usage Example
Phase Completion
Backward Compatibility