feat: Phase 4 - EventPipeline integration with smart retry system by didiergarcia · Pull Request #300 · segmentio/analytics-kotlin

didiergarcia · 2026-03-12T19:29:11Z

Overview

Integrates the RetryStateMachine (from Phases 1-3) into EventPipeline to enable smart retry handling for TAPI responses.

Key Changes

Production Code

EventPipeline.kt: Integrated retry state machine with 4-step approach
- Step 1: Infrastructure - Load httpConfig and RetryState on initialization
- Step 2: Upload Gate - Skip uploads when pipeline is rate-limited
- Step 3: Per-batch Logic - Check shouldUploadBatch() before each upload attempt
- Step 4: Response Handling - Track responses, update state, persist to storage, send X-Retry-Count header
Configuration.kt: Added httpConfig: HttpConfig? field for user-facing API
SegmentDestination.kt: Pass httpConfig from Configuration to EventPipeline
RetryConfig.kt:
- Changed defaults to enabled=false for backward compatibility (legacy mode)
- Removed unused maxRateLimitDuration field
- Added KDoc documentation

Test Code

EventPipelinePhase4Test.kt (NEW): 7 infrastructure tests
EventPipelineSmartRetryTest.kt (NEW): 8 behavior tests verifying actual retry logic
RetryStateMachineTest.kt: Updated setup to explicitly enable configs
HttpConfigTest.kt: Updated assertions for new defaults
RetryIntegrationTest.kt: Baseline tests unchanged - all passing ✅

Behavior

Legacy Mode (Default)

When httpConfig=null or both configs have enabled=false:

Zero behavior changes from current EventPipeline
No state machine logic runs
Baseline tests prove identical behavior

Smart Retry Mode

When httpConfig provided with enabled=true:

429 rate limiting: Pipeline respects Retry-After headers, skips all uploads during rate limit period
Exponential backoff: 5xx errors trigger per-batch exponential backoff with jitter
X-Retry-Count header: Sent on retry attempts (not first upload)
State persistence: RetryState survives app restarts
Per-batch tracking: Independent retry state for each batch file

Test Results

76 tests passing (0 failures)
- EventPipelineTest: 18 tests
- EventPipelinePhase4Test: 7 tests (infrastructure)
- EventPipelineSmartRetryTest: 8 tests (behavior)
- RetryIntegrationTest: 4 tests (baseline - unchanged)
- RetryStateMachineTest: 24 tests
- RetryStateStorageTest: passing
- HttpConfigTest: 13 tests

Usage Example

// Legacy mode (default - no changes)
val analytics = Analytics("write-key", context, Configuration())

// Smart retry mode (opt-in)
val analytics = Analytics(
    "write-key", 
    context,
    Configuration(
        httpConfig = HttpConfig(
            rateLimitConfig = RateLimitConfig(enabled = true),
            backoffConfig = BackoffConfig(enabled = true)
        )
    )
)

Phase Completion

✅ Phase 1: Core state machine (merged)
✅ Phase 2: Status code resolution (merged)
✅ Phase 3: Batch metadata & upload decisions (merged)
✅ Phase 4: EventPipeline integration (this PR)
✅ Phase 5: Cleanup & documentation (this PR)

Backward Compatibility

✅ All existing tests passing
✅ Baseline behavior tests prove no changes
✅ Default configuration uses legacy mode
✅ Opt-in activation via Configuration.httpConfig

Step 1 of Phase 4 integration: - Add optional httpConfig parameter to EventPipeline constructor - Initialize RetryStateMachine with config (defaults to legacy mode) - Load persisted RetryState on startup - Convert HttpConfig to RetryConfig in init block No behavior changes yet - infrastructure only. Baseline tests still passing (verified legacy mode works).

Add 7 tests to verify Phase 4 Step 1 infrastructure: 1. Constructor accepts httpConfig parameter 2. RetryState loaded from storage on initialization 3. Legacy mode with httpConfig=null works (no behavior changes) 4. Legacy mode with enabled=false works 5. Persisted RATE_LIMITED state is loaded correctly 6. Default state used when no persisted state exists 7. Smart retry enabled initializes correctly All tests verify infrastructure without testing upload behavior yet. Tests use spyk(storage) to verify loadRetryState() is called. All tests passing (BUILD SUCCESSFUL)

Step 2 of Phase 4 integration: - Add timeProvider field (SystemTimeProvider) to EventPipeline - Add upload gate check at beginning of upload() method - Check retryState.isRateLimited() before processing any files - Skip all uploads and return early if rate-limited - Clear RATE_LIMITED state if waitUntilTime has passed - Log when uploads are skipped due to rate limiting Key behavior: - When RATE_LIMITED and wait time not passed → skip ALL uploads for this flush - When wait time passes → transition back to READY state automatically - Legacy mode (httpConfig=null or enabled=false) → no rate limit checks All tests still passing (no behavior change yet - rate limit state never set)

Step 3 of Phase 4 integration: - Check shouldUploadBatch() for each batch file before uploading - Handle UploadDecision results: * SkipAllBatches → stop processing remaining files * SkipThisBatch → skip this file, continue with next * DropBatch → delete batch file (exceeded retry limits) * Proceed → continue with upload - RetryState.batchMetadata automatically updated by state machine - No separate .meta files needed - metadata stored in RetryState Key behavior: - Each batch checked independently against retry limits - Batches with pending nextRetryTime are skipped - Batches exceeding max retry count or duration are dropped - RetryState persisted with updated batchMetadata All tests passing (no behavior change yet - state machine not updating metadata)

Integrates RetryStateMachine into EventPipeline for TAPI smart retry handling. Changes: - EventPipeline: Add httpConfig parameter and retry state machine integration - Upload gate: Skip uploads when rate-limited - Per-batch logic: Check shouldUploadBatch before attempting upload - Response handling: Track status codes, update retry state, persist state - X-Retry-Count header: Add on retry attempts (not first upload) - Configuration: Add httpConfig field for user-facing API - SegmentDestination: Pass httpConfig from Configuration to EventPipeline - RetryConfig: Set enabled=false defaults for backward compatibility - Remove unused maxRateLimitDuration field Tests: - EventPipelinePhase4Test: Infrastructure verification (7 tests) - EventPipelineSmartRetryTest: Behavior verification (8 tests) - Updated RetryStateMachineTest: Enable configs in setup for existing tests - Updated HttpConfigTest: Fix assertions for new defaults - RetryIntegrationTest: Baseline tests unchanged and passing Key features: - Legacy mode by default (httpConfig=null or enabled=false) - Zero behavior changes when smart retry disabled - State persists across app restarts - 76 tests passing (37 EventPipeline + 39 retry tests) Phase 5 cleanup: - Removed all Phase 4 comments from production code - Added KDoc documentation for retry configs - Cleaned up test file headers and TODOs

…ructureTest Improves test file naming clarity: - Removes temporary 'Phase 4' reference that won't be meaningful post-merge - Describes what the tests actually verify (retry infrastructure initialization) - Matches the 'Retry Infrastructure Tests' comment in the file

MichaelGHSeg · 2026-03-13T20:02:10Z

core/src/main/java/com/segment/analytics/kotlin/core/platform/EventPipeline.kt

                    try {
                        val connection = httpClient.upload(apiHost)
+
+                        // Add X-Retry-Count header (only on retries, not first attempt)


I don't think it hurts anything, but I had asked Val about whether we should just always include it even when it's zero as a way to distinguish new non-retries from old retires-and-non-retries. Buuut if we're going to be moving to v2 anyway, that makes it entirely moot. The test will probably need to be changed to look for no retry count = 0 and I can fix the other SDKs.

We should be consistent across the libraries. We can just always add it or do it only after the first one?

Let's go with only after first attempt, so no x-retry-count=0. Neater, less traffic most of the time.

Enables time manipulation in tests for proper retry chain testing. Changes: - EventPipeline: Add optional timeProvider constructor parameter - Defaults to SystemTimeProvider() for production use - Passes timeProvider to RetryStateMachine for consistent time source - Enables tests to inject FakeTimeProvider for time manipulation Benefits: - Tests can now verify complete retry chains (429 → wait → 200) - Tests can verify exponential backoff delays - Tests can verify Retry-After header compliance - No production code behavior changes (defaults to SystemTimeProvider) This infrastructure enables future chain tests that verify: - 429 → (time passes) → 200 success - 500 → backoff → 500 → backoff → 200 success - X-Retry-Count header increments through retry sequence - Rate limit state expires correctly after wait time All existing tests pass (76 tests)

Adds 5 comprehensive chain tests that verify complete retry sequences: Simple Tests (verify time manipulation infrastructure): 1. FakeTimeProvider integration - Confirms time manipulation works 2. Time advance blocks uploads - Uploads blocked during rate limit 3. Time advance allows uploads - Uploads resume after rate limit expires Full Chain Tests (verify complete retry cycles): 4. 429 → 429 → 200 - Multiple rate limiting then success - First 429: Pipeline enters RATE_LIMITED state (30s wait) - Second 429: Pipeline stays rate-limited (another 30s wait) - Success: Pipeline returns to READY state - Proves: Retry-After headers respected, rate limit blocking works 5. 500 → backoff → 200 - Exponential backoff then success - 500 error: Creates BatchMetadata with nextRetryTime - Time advances past backoff: Batch becomes eligible for retry - Success: BatchMetadata cleared, batch deleted - Proves: Per-batch backoff tracking, exponential delays work Key Testing Patterns: - Use FakeTimeProvider.setTime() to manipulate time instantly - Use >= assertions (not ==) for upload counts (multiple batches may queue) - Verify state persistence (loadRetryState checks) - Verify cleanup on success (metadata cleared) Test Results: - All 5 chain tests passing - Total test count: 81 tests (76 → 81) - No regression in existing tests Benefits: - Fast execution (milliseconds instead of minutes of real waits) - Deterministic (no timing-dependent flakiness) - Comprehensive coverage of retry chains - Proves timeProvider infrastructure works end-to-end

codecov · 2026-03-16T18:28:08Z

Codecov Report

❌ Patch coverage is 70.57143% with 103 lines in your changes missing coverage. Please review.
✅ Project coverage is 73.73%. Comparing base (b958b36) to head (80dbe31).
⚠️ Report is 30 commits behind head on main.

⚠️ Current head 80dbe31 differs from pull request most recent head de79472

Please upload reports for the commit de79472 to get more accurate results.

Files with missing lines	Patch %	Lines
...otlin/core/platform/EventPipelineSmartRetryTest.kt	59.88%	1 Missing and 66 partials ⚠️
...nt/analytics/kotlin/core/platform/EventPipeline.kt	69.64%	11 Missing and 6 partials ⚠️
...e/platform/EventPipelineRetryInfrastructureTest.kt	85.58%	0 Missing and 16 partials ⚠️
...segment/analytics/kotlin/core/retry/RetryConfig.kt	25.00%	3 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##               main     #300      +/-   ##
============================================
- Coverage     78.13%   73.73%   -4.40%     
- Complexity      592      861     +269     
============================================
  Files            80      106      +26     
  Lines          7275    10024    +2749     
  Branches        922     1669     +747     
============================================
+ Hits           5684     7391    +1707     
- Misses          861     1214     +353     
- Partials        730     1419     +689

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

* Fix SDK bugs found by e2e retry tests, wire e2e-cli for failure detection EventPipeline fixes: - Smart retry override: use RetryStateMachine.shouldDeleteBatch() instead of legacy handleUploadException for statusCodeOverrides alignment (410/460 now retry, 501/505 now drop) - Case-insensitive Retry-After header lookup (OkHttp/HTTP2 lowercases) - X-Retry-Count only sent on retries, omitted on first attempt - DropBatch decision now reports error via reportInternalError RetryStateMachine: - Add shouldDeleteBatch() public method for EventPipeline cleanup decisions e2e-cli: - Wire errorHandler to detect delivery failures (HTTP errors, dropped batches) - Wire maxRetries from test config to SDK BackoffConfig - Clear delivery errors before each flush cycle - Report failure when retries exhausted or batch dropped Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Fixing case for "Retry-After" headers and adding test to ensure okhttp downcases headers. * fix: Use retryStateMachine.shouldDeleteBatch in handleUploadException Move shouldDeleteBatch decision into handleUploadException instead of overriding cleanup after the fact with effectiveStatusCode. This avoids the httpConfig!=null gate (which doesn't respect enabled flags) and the null-stream edge case (statusCode=0 defaulting to 500 causing retries instead of cleanup). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * chore: Enable retry test suite in e2e-config.json Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Didier Garcia <digarcia@twilio.com>

The 5-second delay was dominating retry timing tests, causing failures: - Retry-After: 0 took 4916ms instead of immediate (<1s) - Exponential backoff delays appeared constant (~5s) instead of growing - 6th retry attempt timed out due to 5s eating into 20s budget - First/last delay comparisons failed (4502ms vs 4923ms both ~5s) Reducing to 500ms: - Respects exponential backoff timing (500ms, 1s, 2s, 4s visible) - Allows Retry-After: 0 to be immediate - Gives enough time for 6 retry attempts within 20s timeout - Still provides brief window for initial flush/upload to complete Expected test results: - exponential-backoff: delays will grow visibly (500ms -> 4s) - retry-after: Retry-After: 0 will retry in <1s - retry-integration: outage recovery will show increasing delays - All 6 retry attempts will complete within timeout Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Settings error-tolerance tests were failing with 500ms delay: - "still sends events when settings returns 500" → 0 batch requests - "still sends events when settings response is delayed" → 0 batch requests Root cause: 500ms isn't enough time for: 1. Settings request to timeout/fail (~1-2s) 2. SDK to fallback to defaultSettings 3. EventPipeline to initialize 4. Events to be processed and sent The 2-second delay balances: - Fast enough for retry tests (46 tests, was 5s on main) - Slow enough for settings tests (gives SDK time to recover from failures) Timeline with 2s: - t=0-1s: Settings request fails/times out - t=1-2s: SDK initializes with defaults, sends events - t=2s: First pendingUploads() check finds batch files This preserves the retry test speedup (2s vs 5s) while fixing settings test regressions. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

…ming Previous attempts: - 5s fixed delay: retry tests failed (timing masked) - 500ms: settings tests failed (not enough time for SDK recovery) - 2s: still had failures in both categories Root cause: Different test scenarios need different timing: - Retry tests need to see retry behavior quickly - Settings tests need SDK time to handle failures/timeouts Solution: Adaptive polling strategy 1. Initial 1.5s delay for: - Settings requests to timeout/fail - SDK to fallback to defaultSettings - First batch file creation and upload attempt 2. Then poll at 200ms intervals (fast retry detection) - Catches Retry-After: 0 quickly (~200-400ms) - Detects exponential backoff timing changes - Ramps up to 500ms after 5 polls (1.3s total) Expected results: - Settings tests: 1.5s gives SDK time to recover from failures - Retry-After: 0: ~1.5s total (initial) + ~200ms (poll) = ~1.7s - Note: may still fail <1s assertion if timing is tight - Exponential backoff: 200ms polling reveals timing differences Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Previous approach: Fixed initial delays (500ms, 2s, 3s) to allow SDK initialization Problem: Creates baseline overhead that breaks timing-sensitive tests Root cause analysis: - Settings tests need 2-3s for SDK to handle failures → but batch appears when ready - Retry-After: 0 test needs <1s → but fixed delays add unavoidable overhead - No single fixed delay satisfies both test types New approach: Pure adaptive polling with no initial delay - Start: 100ms polls (aggressive for quick tests) - After 1s: ramp to 500ms polls - After 4s: ramp to 1s polls - Let tests naturally complete when ready instead of guessing timing Timeline for Retry-After: 0: - t=0: flush() called - t=100-400ms: SDK creates batch, polls detect it - Total: ~300-600ms (should pass <1s assertion) Timeline for settings slow response: - t=0-1s: Fast polls while settings times out - t=1-2s: SDK falls back to defaults, processes events - t=2s: Batch created - t=2.1s: Next poll detects it → success Benefits: - No wasted time on fixed delays - Fast tests complete quickly - Slow tests get full timeout duration - Polling overhead is minimal (100-500ms) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

The aggressive polling approach (100ms from start) broke SDK initialization by checking for batch files before they were created. SDK needs time to: - Handle settings request/timeout - Fall back to defaultSettings - Initialize EventPipeline - Create batch files This restores a fixed 1.2s initial delay followed by adaptive polling (200ms → 500ms → 1s). This overhead is inherent to the CLI architecture. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Trying to balance SDK initialization needs with Retry-After: 0 test requirement (<1000ms). The 1212ms result with 1.2s delay suggests most overhead is in the fixed delay itself. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Reduced HTTP timeouts from 15s/20s to 2s/2s in e2e-cli. This allows settings error tests to fail fast and fall back to defaultSettings quickly, enabling shorter initial delay (300ms) that should satisfy Retry-After: 0 timing requirements (<1s between requests). Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

The poll loop was breaking on pendingUploads().isEmpty() before the SDK finished initialization. When settings fetch is slow or fails, events stay queued (processing paused) and no batch files exist yet. The CLI mistook this for 'all events delivered' and exited with success=true. Fix: track whether we've ever seen batch files. Only break when pending goes from non-empty to empty (all uploaded/dropped). If we've never seen batch files, keep polling until they appear or we time out. This fixes all 5 settings error-tolerance tests that were failing. 59/59 e2e tests now pass (basic, retry, settings). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Read httpConfig from CDN integration settings in SegmentDestination.update() and propagate to EventPipeline via updateHttpConfig() - Default enabled to true for CDN-sourced configs (presence implies active) - Add @SerialName annotations to RetryBehavior enum for lowercase JSON compatibility ("retry"/"drop" from CDN) - Enforce rateLimitConfig.maxRetryCount via globalRetryCount in RetryStateMachine.shouldUploadBatch() - Add retry-settings test suite to e2e-config.json Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

Update test JSON to use lowercase "retry"/"drop" instead of "RETRY"/"DROP" to match the @SerialName annotations added for CDN JSON compatibility. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

didiergarcia added 6 commits March 11, 2026 17:34

MichaelGHSeg reviewed Mar 13, 2026

View reviewed changes

wenxi-zeng previously approved these changes Mar 13, 2026

View reviewed changes

didiergarcia dismissed wenxi-zeng’s stale review via 80dbe31 March 16, 2026 17:47

Merge branch 'main' into feat/tapi-retry-phase4-pipeline-integration

c87da75

didiergarcia requested review from MichaelGHSeg and wenxi-zeng March 24, 2026 14:36

wenxi-zeng previously approved these changes Mar 24, 2026

View reviewed changes

didiergarcia dismissed wenxi-zeng’s stale review via ec8132f March 24, 2026 18:36

didiergarcia requested a review from wenxi-zeng March 25, 2026 00:37

didiergarcia and others added 7 commits March 24, 2026 22:28

wenxi-zeng previously approved these changes Mar 25, 2026

View reviewed changes

MichaelGHSeg dismissed wenxi-zeng’s stale review via 998d230 March 25, 2026 18:03

didiergarcia and others added 2 commits March 30, 2026 10:16

test: Fix HttpConfigTest to use lowercase RetryBehavior values

3ee672c

Update test JSON to use lowercase "retry"/"drop" instead of "RETRY"/"DROP" to match the @SerialName annotations added for CDN JSON compatibility. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

fix: test isolation and timing flakiness (#305)

de79472

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Phase 4 - EventPipeline integration with smart retry system#300

feat: Phase 4 - EventPipeline integration with smart retry system#300
didiergarcia wants to merge 21 commits intomainfrom
feat/tapi-retry-phase4-pipeline-integration

didiergarcia commented Mar 12, 2026

Uh oh!

MichaelGHSeg Mar 13, 2026

Uh oh!

didiergarcia Mar 24, 2026

Uh oh!

MichaelGHSeg Mar 24, 2026

Uh oh!

codecov bot commented Mar 16, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

didiergarcia commented Mar 12, 2026

Overview

Key Changes

Production Code

Test Code

Behavior

Legacy Mode (Default)

Smart Retry Mode

Test Results

Usage Example

Phase Completion

Backward Compatibility

Uh oh!

MichaelGHSeg Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

didiergarcia Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

MichaelGHSeg Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov bot commented Mar 16, 2026 •

edited

Loading