Fix NOT_LEADER_FOR_PARTITION error with Confluent Cloud#8
Merged
Conversation
This commit addresses the "Not Leader For Partition" error that occurs when using kshark with Confluent Cloud clusters. Root Cause: - The Writer's Transport was not properly using the configured Dialer - Transport created new connections without SASL/TLS from the Dialer - Bootstrap servers were used directly instead of allowing metadata discovery of actual partition leader brokers Changes: 1. Configure Transport to use Dialer's DialContext method - Ensures all connections use proper SASL/TLS authentication - Enables correct broker discovery for managed Kafka services - Fixes metadata refresh for Confluent Cloud's dynamic brokers 2. Add retry logic for metadata-related errors - Retry up to 3 times with exponential backoff (500ms, 1s, 1.5s) - Specifically handle NOT_LEADER_FOR_PARTITION and LeaderNotAvailable - Non-retryable errors fail immediately without wasted retries - Detailed logging for debugging retry attempts References: - segmentio/kafka-go#1078 - segmentio/kafka-go#712 https://claude.ai/code/session_01VSnKepTAD53ZUfXYDDvXwa
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This commit addresses the "Not Leader For Partition" error that occurs when using kshark with Confluent Cloud clusters.
Root Cause:
Changes:
Configure Transport to use Dialer's DialContext method
Add retry logic for metadata-related errors
References:
kafka server: Tried to send a m... Your metadata is out of date.segmentio/kafka-go#712https://claude.ai/code/session_01VSnKepTAD53ZUfXYDDvXwa