Skip to content

feat: Phase 4 - EventPipeline integration with smart retry system#300

Open
didiergarcia wants to merge 21 commits intomainfrom
feat/tapi-retry-phase4-pipeline-integration
Open

feat: Phase 4 - EventPipeline integration with smart retry system#300
didiergarcia wants to merge 21 commits intomainfrom
feat/tapi-retry-phase4-pipeline-integration

Conversation

@didiergarcia
Copy link
Copy Markdown
Contributor

Overview

Integrates the RetryStateMachine (from Phases 1-3) into EventPipeline to enable smart retry handling for TAPI responses.

Key Changes

Production Code

  • EventPipeline.kt: Integrated retry state machine with 4-step approach

    • Step 1: Infrastructure - Load httpConfig and RetryState on initialization
    • Step 2: Upload Gate - Skip uploads when pipeline is rate-limited
    • Step 3: Per-batch Logic - Check shouldUploadBatch() before each upload attempt
    • Step 4: Response Handling - Track responses, update state, persist to storage, send X-Retry-Count header
  • Configuration.kt: Added httpConfig: HttpConfig? field for user-facing API

  • SegmentDestination.kt: Pass httpConfig from Configuration to EventPipeline

  • RetryConfig.kt:

    • Changed defaults to enabled=false for backward compatibility (legacy mode)
    • Removed unused maxRateLimitDuration field
    • Added KDoc documentation

Test Code

  • EventPipelinePhase4Test.kt (NEW): 7 infrastructure tests
  • EventPipelineSmartRetryTest.kt (NEW): 8 behavior tests verifying actual retry logic
  • RetryStateMachineTest.kt: Updated setup to explicitly enable configs
  • HttpConfigTest.kt: Updated assertions for new defaults
  • RetryIntegrationTest.kt: Baseline tests unchanged - all passing ✅

Behavior

Legacy Mode (Default)

When httpConfig=null or both configs have enabled=false:

  • Zero behavior changes from current EventPipeline
  • No state machine logic runs
  • Baseline tests prove identical behavior

Smart Retry Mode

When httpConfig provided with enabled=true:

  • 429 rate limiting: Pipeline respects Retry-After headers, skips all uploads during rate limit period
  • Exponential backoff: 5xx errors trigger per-batch exponential backoff with jitter
  • X-Retry-Count header: Sent on retry attempts (not first upload)
  • State persistence: RetryState survives app restarts
  • Per-batch tracking: Independent retry state for each batch file

Test Results

  • 76 tests passing (0 failures)
    • EventPipelineTest: 18 tests
    • EventPipelinePhase4Test: 7 tests (infrastructure)
    • EventPipelineSmartRetryTest: 8 tests (behavior)
    • RetryIntegrationTest: 4 tests (baseline - unchanged)
    • RetryStateMachineTest: 24 tests
    • RetryStateStorageTest: passing
    • HttpConfigTest: 13 tests

Usage Example

// Legacy mode (default - no changes)
val analytics = Analytics("write-key", context, Configuration())

// Smart retry mode (opt-in)
val analytics = Analytics(
    "write-key", 
    context,
    Configuration(
        httpConfig = HttpConfig(
            rateLimitConfig = RateLimitConfig(enabled = true),
            backoffConfig = BackoffConfig(enabled = true)
        )
    )
)

Phase Completion

  • ✅ Phase 1: Core state machine (merged)
  • ✅ Phase 2: Status code resolution (merged)
  • ✅ Phase 3: Batch metadata & upload decisions (merged)
  • ✅ Phase 4: EventPipeline integration (this PR)
  • ✅ Phase 5: Cleanup & documentation (this PR)

Backward Compatibility

  • ✅ All existing tests passing
  • ✅ Baseline behavior tests prove no changes
  • ✅ Default configuration uses legacy mode
  • ✅ Opt-in activation via Configuration.httpConfig

Step 1 of Phase 4 integration:
- Add optional httpConfig parameter to EventPipeline constructor
- Initialize RetryStateMachine with config (defaults to legacy mode)
- Load persisted RetryState on startup
- Convert HttpConfig to RetryConfig in init block

No behavior changes yet - infrastructure only.
Baseline tests still passing (verified legacy mode works).
Add 7 tests to verify Phase 4 Step 1 infrastructure:
1. Constructor accepts httpConfig parameter
2. RetryState loaded from storage on initialization
3. Legacy mode with httpConfig=null works (no behavior changes)
4. Legacy mode with enabled=false works
5. Persisted RATE_LIMITED state is loaded correctly
6. Default state used when no persisted state exists
7. Smart retry enabled initializes correctly

All tests verify infrastructure without testing upload behavior yet.
Tests use spyk(storage) to verify loadRetryState() is called.

All tests passing (BUILD SUCCESSFUL)
Step 2 of Phase 4 integration:
- Add timeProvider field (SystemTimeProvider) to EventPipeline
- Add upload gate check at beginning of upload() method
- Check retryState.isRateLimited() before processing any files
- Skip all uploads and return early if rate-limited
- Clear RATE_LIMITED state if waitUntilTime has passed
- Log when uploads are skipped due to rate limiting

Key behavior:
- When RATE_LIMITED and wait time not passed → skip ALL uploads for this flush
- When wait time passes → transition back to READY state automatically
- Legacy mode (httpConfig=null or enabled=false) → no rate limit checks

All tests still passing (no behavior change yet - rate limit state never set)
Step 3 of Phase 4 integration:
- Check shouldUploadBatch() for each batch file before uploading
- Handle UploadDecision results:
  * SkipAllBatches → stop processing remaining files
  * SkipThisBatch → skip this file, continue with next
  * DropBatch → delete batch file (exceeded retry limits)
  * Proceed → continue with upload
- RetryState.batchMetadata automatically updated by state machine
- No separate .meta files needed - metadata stored in RetryState

Key behavior:
- Each batch checked independently against retry limits
- Batches with pending nextRetryTime are skipped
- Batches exceeding max retry count or duration are dropped
- RetryState persisted with updated batchMetadata

All tests passing (no behavior change yet - state machine not updating metadata)
Integrates RetryStateMachine into EventPipeline for TAPI smart retry handling.

Changes:
- EventPipeline: Add httpConfig parameter and retry state machine integration
  - Upload gate: Skip uploads when rate-limited
  - Per-batch logic: Check shouldUploadBatch before attempting upload
  - Response handling: Track status codes, update retry state, persist state
  - X-Retry-Count header: Add on retry attempts (not first upload)

- Configuration: Add httpConfig field for user-facing API
- SegmentDestination: Pass httpConfig from Configuration to EventPipeline
- RetryConfig: Set enabled=false defaults for backward compatibility
  - Remove unused maxRateLimitDuration field

Tests:
- EventPipelinePhase4Test: Infrastructure verification (7 tests)
- EventPipelineSmartRetryTest: Behavior verification (8 tests)
- Updated RetryStateMachineTest: Enable configs in setup for existing tests
- Updated HttpConfigTest: Fix assertions for new defaults
- RetryIntegrationTest: Baseline tests unchanged and passing

Key features:
- Legacy mode by default (httpConfig=null or enabled=false)
- Zero behavior changes when smart retry disabled
- State persists across app restarts
- 76 tests passing (37 EventPipeline + 39 retry tests)

Phase 5 cleanup:
- Removed all Phase 4 comments from production code
- Added KDoc documentation for retry configs
- Cleaned up test file headers and TODOs
…ructureTest

Improves test file naming clarity:
- Removes temporary 'Phase 4' reference that won't be meaningful post-merge
- Describes what the tests actually verify (retry infrastructure initialization)
- Matches the 'Retry Infrastructure Tests' comment in the file
try {
val connection = httpClient.upload(apiHost)

// Add X-Retry-Count header (only on retries, not first attempt)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it hurts anything, but I had asked Val about whether we should just always include it even when it's zero as a way to distinguish new non-retries from old retires-and-non-retries. Buuut if we're going to be moving to v2 anyway, that makes it entirely moot. The test will probably need to be changed to look for no retry count = 0 and I can fix the other SDKs.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be consistent across the libraries. We can just always add it or do it only after the first one?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's go with only after first attempt, so no x-retry-count=0. Neater, less traffic most of the time.

wenxi-zeng
wenxi-zeng previously approved these changes Mar 13, 2026
Enables time manipulation in tests for proper retry chain testing.

Changes:
- EventPipeline: Add optional timeProvider constructor parameter
  - Defaults to SystemTimeProvider() for production use
  - Passes timeProvider to RetryStateMachine for consistent time source
  - Enables tests to inject FakeTimeProvider for time manipulation

Benefits:
- Tests can now verify complete retry chains (429 → wait → 200)
- Tests can verify exponential backoff delays
- Tests can verify Retry-After header compliance
- No production code behavior changes (defaults to SystemTimeProvider)

This infrastructure enables future chain tests that verify:
- 429 → (time passes) → 200 success
- 500 → backoff → 500 → backoff → 200 success
- X-Retry-Count header increments through retry sequence
- Rate limit state expires correctly after wait time

All existing tests pass (76 tests)
Adds 5 comprehensive chain tests that verify complete retry sequences:

Simple Tests (verify time manipulation infrastructure):
1. FakeTimeProvider integration - Confirms time manipulation works
2. Time advance blocks uploads - Uploads blocked during rate limit
3. Time advance allows uploads - Uploads resume after rate limit expires

Full Chain Tests (verify complete retry cycles):
4. 429 → 429 → 200 - Multiple rate limiting then success
   - First 429: Pipeline enters RATE_LIMITED state (30s wait)
   - Second 429: Pipeline stays rate-limited (another 30s wait)
   - Success: Pipeline returns to READY state
   - Proves: Retry-After headers respected, rate limit blocking works

5. 500 → backoff → 200 - Exponential backoff then success
   - 500 error: Creates BatchMetadata with nextRetryTime
   - Time advances past backoff: Batch becomes eligible for retry
   - Success: BatchMetadata cleared, batch deleted
   - Proves: Per-batch backoff tracking, exponential delays work

Key Testing Patterns:
- Use FakeTimeProvider.setTime() to manipulate time instantly
- Use >= assertions (not ==) for upload counts (multiple batches may queue)
- Verify state persistence (loadRetryState checks)
- Verify cleanup on success (metadata cleared)

Test Results:
- All 5 chain tests passing
- Total test count: 81 tests (76 → 81)
- No regression in existing tests

Benefits:
- Fast execution (milliseconds instead of minutes of real waits)
- Deterministic (no timing-dependent flakiness)
- Comprehensive coverage of retry chains
- Proves timeProvider infrastructure works end-to-end
@codecov
Copy link
Copy Markdown

codecov bot commented Mar 16, 2026

Codecov Report

❌ Patch coverage is 70.57143% with 103 lines in your changes missing coverage. Please review.
✅ Project coverage is 73.73%. Comparing base (b958b36) to head (80dbe31).
⚠️ Report is 30 commits behind head on main.

⚠️ Current head 80dbe31 differs from pull request most recent head de79472

Please upload reports for the commit de79472 to get more accurate results.

Files with missing lines Patch % Lines
...otlin/core/platform/EventPipelineSmartRetryTest.kt 59.88% 1 Missing and 66 partials ⚠️
...nt/analytics/kotlin/core/platform/EventPipeline.kt 69.64% 11 Missing and 6 partials ⚠️
...e/platform/EventPipelineRetryInfrastructureTest.kt 85.58% 0 Missing and 16 partials ⚠️
...segment/analytics/kotlin/core/retry/RetryConfig.kt 25.00% 3 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main     #300      +/-   ##
============================================
- Coverage     78.13%   73.73%   -4.40%     
- Complexity      592      861     +269     
============================================
  Files            80      106      +26     
  Lines          7275    10024    +2749     
  Branches        922     1669     +747     
============================================
+ Hits           5684     7391    +1707     
- Misses          861     1214     +353     
- Partials        730     1419     +689     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

wenxi-zeng
wenxi-zeng previously approved these changes Mar 24, 2026
* Fix SDK bugs found by e2e retry tests, wire e2e-cli for failure detection

EventPipeline fixes:
- Smart retry override: use RetryStateMachine.shouldDeleteBatch() instead
  of legacy handleUploadException for statusCodeOverrides alignment
  (410/460 now retry, 501/505 now drop)
- Case-insensitive Retry-After header lookup (OkHttp/HTTP2 lowercases)
- X-Retry-Count only sent on retries, omitted on first attempt
- DropBatch decision now reports error via reportInternalError

RetryStateMachine:
- Add shouldDeleteBatch() public method for EventPipeline cleanup decisions

e2e-cli:
- Wire errorHandler to detect delivery failures (HTTP errors, dropped batches)
- Wire maxRetries from test config to SDK BackoffConfig
- Clear delivery errors before each flush cycle
- Report failure when retries exhausted or batch dropped

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Fixing case for "Retry-After" headers and adding
test to ensure okhttp downcases headers.

* fix: Use retryStateMachine.shouldDeleteBatch in handleUploadException

Move shouldDeleteBatch decision into handleUploadException instead of
overriding cleanup after the fact with effectiveStatusCode. This avoids
the httpConfig!=null gate (which doesn't respect enabled flags) and the
null-stream edge case (statusCode=0 defaulting to 500 causing retries
instead of cleanup).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* chore: Enable retry test suite in e2e-config.json

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Didier Garcia <digarcia@twilio.com>
didiergarcia and others added 7 commits March 24, 2026 22:28
The 5-second delay was dominating retry timing tests, causing failures:
- Retry-After: 0 took 4916ms instead of immediate (<1s)
- Exponential backoff delays appeared constant (~5s) instead of growing
- 6th retry attempt timed out due to 5s eating into 20s budget
- First/last delay comparisons failed (4502ms vs 4923ms both ~5s)

Reducing to 500ms:
- Respects exponential backoff timing (500ms, 1s, 2s, 4s visible)
- Allows Retry-After: 0 to be immediate
- Gives enough time for 6 retry attempts within 20s timeout
- Still provides brief window for initial flush/upload to complete

Expected test results:
- exponential-backoff: delays will grow visibly (500ms -> 4s)
- retry-after: Retry-After: 0 will retry in <1s
- retry-integration: outage recovery will show increasing delays
- All 6 retry attempts will complete within timeout

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Settings error-tolerance tests were failing with 500ms delay:
- "still sends events when settings returns 500" → 0 batch requests
- "still sends events when settings response is delayed" → 0 batch requests

Root cause: 500ms isn't enough time for:
1. Settings request to timeout/fail (~1-2s)
2. SDK to fallback to defaultSettings
3. EventPipeline to initialize
4. Events to be processed and sent

The 2-second delay balances:
- Fast enough for retry tests (46 tests, was 5s on main)
- Slow enough for settings tests (gives SDK time to recover from failures)

Timeline with 2s:
- t=0-1s: Settings request fails/times out
- t=1-2s: SDK initializes with defaults, sends events
- t=2s: First pendingUploads() check finds batch files

This preserves the retry test speedup (2s vs 5s) while fixing settings
test regressions.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
…ming

Previous attempts:
- 5s fixed delay: retry tests failed (timing masked)
- 500ms: settings tests failed (not enough time for SDK recovery)
- 2s: still had failures in both categories

Root cause: Different test scenarios need different timing:
- Retry tests need to see retry behavior quickly
- Settings tests need SDK time to handle failures/timeouts

Solution: Adaptive polling strategy
1. Initial 1.5s delay for:
   - Settings requests to timeout/fail
   - SDK to fallback to defaultSettings
   - First batch file creation and upload attempt

2. Then poll at 200ms intervals (fast retry detection)
   - Catches Retry-After: 0 quickly (~200-400ms)
   - Detects exponential backoff timing changes
   - Ramps up to 500ms after 5 polls (1.3s total)

Expected results:
- Settings tests: 1.5s gives SDK time to recover from failures
- Retry-After: 0: ~1.5s total (initial) + ~200ms (poll) = ~1.7s
  - Note: may still fail <1s assertion if timing is tight
- Exponential backoff: 200ms polling reveals timing differences

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Previous approach: Fixed initial delays (500ms, 2s, 3s) to allow SDK initialization
Problem: Creates baseline overhead that breaks timing-sensitive tests

Root cause analysis:
- Settings tests need 2-3s for SDK to handle failures → but batch appears when ready
- Retry-After: 0 test needs <1s → but fixed delays add unavoidable overhead
- No single fixed delay satisfies both test types

New approach: Pure adaptive polling with no initial delay
- Start: 100ms polls (aggressive for quick tests)
- After 1s: ramp to 500ms polls
- After 4s: ramp to 1s polls
- Let tests naturally complete when ready instead of guessing timing

Timeline for Retry-After: 0:
- t=0: flush() called
- t=100-400ms: SDK creates batch, polls detect it
- Total: ~300-600ms (should pass <1s assertion)

Timeline for settings slow response:
- t=0-1s: Fast polls while settings times out
- t=1-2s: SDK falls back to defaults, processes events
- t=2s: Batch created
- t=2.1s: Next poll detects it → success

Benefits:
- No wasted time on fixed delays
- Fast tests complete quickly
- Slow tests get full timeout duration
- Polling overhead is minimal (100-500ms)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
The aggressive polling approach (100ms from start) broke SDK initialization
by checking for batch files before they were created. SDK needs time to:
- Handle settings request/timeout
- Fall back to defaultSettings
- Initialize EventPipeline
- Create batch files

This restores a fixed 1.2s initial delay followed by adaptive polling
(200ms → 500ms → 1s). This overhead is inherent to the CLI architecture.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Trying to balance SDK initialization needs with Retry-After: 0 test
requirement (<1000ms). The 1212ms result with 1.2s delay suggests most
overhead is in the fixed delay itself.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Reduced HTTP timeouts from 15s/20s to 2s/2s in e2e-cli. This allows
settings error tests to fail fast and fall back to defaultSettings
quickly, enabling shorter initial delay (300ms) that should satisfy
Retry-After: 0 timing requirements (<1s between requests).

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
wenxi-zeng
wenxi-zeng previously approved these changes Mar 25, 2026
The poll loop was breaking on pendingUploads().isEmpty() before the SDK
finished initialization. When settings fetch is slow or fails, events
stay queued (processing paused) and no batch files exist yet. The CLI
mistook this for 'all events delivered' and exited with success=true.

Fix: track whether we've ever seen batch files. Only break when pending
goes from non-empty to empty (all uploaded/dropped). If we've never
seen batch files, keep polling until they appear or we time out.

This fixes all 5 settings error-tolerance tests that were failing.
59/59 e2e tests now pass (basic, retry, settings).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Read httpConfig from CDN integration settings in SegmentDestination.update()
  and propagate to EventPipeline via updateHttpConfig()
- Default enabled to true for CDN-sourced configs (presence implies active)
- Add @SerialName annotations to RetryBehavior enum for lowercase JSON
  compatibility ("retry"/"drop" from CDN)
- Enforce rateLimitConfig.maxRetryCount via globalRetryCount in
  RetryStateMachine.shouldUploadBatch()
- Add retry-settings test suite to e2e-config.json

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
didiergarcia and others added 2 commits March 30, 2026 10:16
Update test JSON to use lowercase "retry"/"drop" instead of "RETRY"/"DROP"
to match the @SerialName annotations added for CDN JSON compatibility.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants