Skip to content

Fix NOT_LEADER_FOR_PARTITION error with Confluent Cloud#8

Merged
kamir merged 1 commit intomainfrom
claude/fix-kshark-confluent-error-XOQbE
Feb 11, 2026
Merged

Fix NOT_LEADER_FOR_PARTITION error with Confluent Cloud#8
kamir merged 1 commit intomainfrom
claude/fix-kshark-confluent-error-XOQbE

Conversation

@kamir
Copy link
Contributor

@kamir kamir commented Feb 11, 2026

This commit addresses the "Not Leader For Partition" error that occurs when using kshark with Confluent Cloud clusters.

Root Cause:

  • The Writer's Transport was not properly using the configured Dialer
  • Transport created new connections without SASL/TLS from the Dialer
  • Bootstrap servers were used directly instead of allowing metadata discovery of actual partition leader brokers

Changes:

  1. Configure Transport to use Dialer's DialContext method

    • Ensures all connections use proper SASL/TLS authentication
    • Enables correct broker discovery for managed Kafka services
    • Fixes metadata refresh for Confluent Cloud's dynamic brokers
  2. Add retry logic for metadata-related errors

    • Retry up to 3 times with exponential backoff (500ms, 1s, 1.5s)
    • Specifically handle NOT_LEADER_FOR_PARTITION and LeaderNotAvailable
    • Non-retryable errors fail immediately without wasted retries
    • Detailed logging for debugging retry attempts

References:

https://claude.ai/code/session_01VSnKepTAD53ZUfXYDDvXwa

This commit addresses the "Not Leader For Partition" error that occurs
when using kshark with Confluent Cloud clusters.

Root Cause:
- The Writer's Transport was not properly using the configured Dialer
- Transport created new connections without SASL/TLS from the Dialer
- Bootstrap servers were used directly instead of allowing metadata
  discovery of actual partition leader brokers

Changes:
1. Configure Transport to use Dialer's DialContext method
   - Ensures all connections use proper SASL/TLS authentication
   - Enables correct broker discovery for managed Kafka services
   - Fixes metadata refresh for Confluent Cloud's dynamic brokers

2. Add retry logic for metadata-related errors
   - Retry up to 3 times with exponential backoff (500ms, 1s, 1.5s)
   - Specifically handle NOT_LEADER_FOR_PARTITION and LeaderNotAvailable
   - Non-retryable errors fail immediately without wasted retries
   - Detailed logging for debugging retry attempts

References:
- segmentio/kafka-go#1078
- segmentio/kafka-go#712

https://claude.ai/code/session_01VSnKepTAD53ZUfXYDDvXwa
@kamir kamir merged commit 478b61b into main Feb 11, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants