Skip to content

Conversation

@k-raina
Copy link
Member

@k-raina k-raina commented Nov 7, 2025

Summary

This PR fixes a critical usability issue where Consumer.poll(), Consumer.consume(), Producer.poll(), and Producer.flush() would block indefinitely and not respond to Ctrl+C (KeyboardInterrupt) signals.

Fixes: #209 and #807

Problem

When calling blocking operations with infinite or very long timeouts:

  • Consumer.poll() / Consumer.consume() - would block indefinitely waiting for messages
  • Producer.poll() / Producer.flush() - would block indefinitely waiting for delivery callbacks or queue flushing

Because these operations release the Python Global Interpreter Lock (GIL) during blocking calls to the underlying librdkafka C library, Python's signal handling mechanism couldn't detect Ctrl+C signals, making it impossible to gracefully interrupt these operations.

Solution

The fix implements a "wakeable poll" pattern that:

  1. Chunks long timeouts: Instead of making a single blocking call with the full timeout, the implementation breaks it into smaller 200ms chunks

  2. Periodic signal checking: Between chunks, the code re-acquires the Python GIL and calls PyErr_CheckSignals() to detect pending KeyboardInterrupt signals

  3. Immediate interruption: If a signal is detected, the operation returns immediately, allowing the KeyboardInterrupt to propagate to the Python code

Impact

This fix significantly improves the developer and operational experience for applications using Python client.

Before this fix:

  • Developers had to forcefully kill processes during development and testing
  • No way to gracefully stop consumer/producer loops during debugging
  • Producer flush() operations couldn't be interrupted, causing issues during shutdown

After this fix:

  • Developers can use standard Ctrl+C to interrupt all blocking operations
  • Clean shutdown during development and testing

Testing

Consumer Tests (tests/test_Consumer.py)

  1. Utility function tests:

    • test_calculate_chunk_timeout_utility_function(): Tests chunk timeout calculation logic
    • test_check_signals_between_chunks_utility_function(): Tests signal detection between chunks
    • test_wakeable_poll_utility_functions_interaction(): Tests interaction between both utilities
  2. Poll interruptibility tests:

    • test_wakeable_poll_interruptibility_and_messages(): Tests poll() can be interrupted and still handles messages correctly
    • test_wakeable_poll_edge_cases(): Tests edge cases (zero timeout, closed consumer, short timeouts)
  3. Consume interruptibility tests:

    • test_wakeable_consume_interruptibility_and_messages(): Tests consume() can be interrupted and still handles messages correctly
    • test_wakeable_consume_edge_cases(): Tests edge cases (zero timeout, invalid parameters, short timeouts)

Producer Tests (tests/test_Producer.py)

  1. Utility function tests:

    • test_wakeable_poll_utility_functions_interaction(): Tests chunk calculation and signal checking work together
  2. Poll interruptibility tests:

    • test_wakeable_poll_interruptibility_and_messages(): Tests poll() can be interrupted and still handles delivery callbacks correctly
    • test_wakeable_poll_edge_cases(): Tests edge cases (zero timeout, closed producer, short timeouts)
  3. Flush interruptibility tests:

    • test_wakeable_flush_interruptibility_and_messages(): Tests flush() can be interrupted and still delivers messages correctly
    • test_wakeable_flush_edge_cases(): Tests edge cases (zero timeout, closed producer, short timeouts, empty queue)

Shared Utility Tests (tests/test_wakeable_utilities.py)

  • Tests for newly added utility functions

Integration Tests

  • tests/integration/consumer/test_consumer_wakeable_poll_consume.py: Verifies wakeable pattern doesn't interfere with normal message consumption
  • tests/integration/producer/test_producer_wakeable_poll_flush.py: Verifies wakeable pattern doesn't interfere with normal message production and delivery callbacks

All tests use a helper function TestUtils.send_sigint_after_delay() to simulate Ctrl+C in automated tests.

Performance Impact

  • Minimal overhead: Only adds signal checks between 200ms chunks
  • No impact on short timeouts: Timeouts < 200ms bypass chunking entirely

Manual Testing

Replicable code

test_wakeable_consume_interrupt.py
test_wakeable_poll_interrupt.py
test_wakeable_producer_flush_interrupt.py
test_wakeable_producer_poll_interrupt.py

Example Run

 Queue before flush: 101000 messages
  (Many messages are in queue and in-flight, waiting for acknowledgment)

[20:45:54] Calling flush() with infinite timeout...
    (This will block until all messages are flushed or Ctrl+C)
    Background thread is continuously adding messages to the queue
    Current queue length: 101000 messages
    Press Ctrl+C to test interruptibility...

^C
======================================================================
✓ KeyboardInterrupt caught!
  Interrupted after 1.07 seconds
  Background thread produced 1167722 messages during this time
  (Had 53 production errors)
  With wakeable pattern, interruption should occur within ~200ms of Ctrl+C
  ⚠ Interruption took 1.07s (may indicate wakeable pattern issue)
======================================================================

Stopping background producer thread...
  Final stats: 1167722 messages produced
  (53 production errors)

Closing producer...
Producer closed.
Traceback (most recent call last):
  File "/Users/kaushikraina/projects/njc/confluent-kafka-python/test_wakeable_producer_flush_interrupt.py", line 292, in <module>
    main()
  File "/Users/kaushikraina/projects/njc/confluent-kafka-python/test_wakeable_producer_flush_interrupt.py", line 223, in main
    remaining = producer.flush(timeout=-1.0)  # Infinite timeout - will block
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
KeyboardInterrupt

Copilot AI review requested due to automatic review settings November 7, 2025 15:53
@k-raina k-raina requested review from a team and MSeal as code owners November 7, 2025 15:53
@confluent-cla-assistant
Copy link

🎉 All Contributor License Agreements have been signed. Ready to merge.
Please push an empty commit if you would like to re-run the checks to verify CLA status for all contributors.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes critical usability issues where Consumer.poll() and Consumer.consume() would block indefinitely and not respond to Ctrl+C (KeyboardInterrupt) signals. The solution implements a "wakeable poll" pattern that chunks long timeouts into 200ms intervals and periodically checks for pending signals between chunks, allowing proper signal handling and graceful interruption.

Key Changes:

  • Implemented helper functions calculate_chunk_timeout() and check_signals_between_chunks() in C code to enable interruptible polling
  • Modified Consumer.poll() and Consumer.consume() to use chunked polling with periodic signal checks
  • Added comprehensive test coverage for utility functions, interruptibility, edge cases, and message handling

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
tests/test_Consumer.py Added utility helper send_sigint_after_delay() and 7 new test functions covering chunk timeout calculation, signal detection, utility function interaction, poll/consume interruptibility, and edge cases
src/confluent_kafka/src/Consumer.c Implemented wakeable poll pattern with helper functions and refactored Consumer_poll() and Consumer_consume() to use chunked polling with signal checking
CHANGELOG.md Documented the fix for blocking poll/consume operations

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

*
* Instead of a single blocking call to rd_kafka_consumer_poll() with the
* full timeout, this function:
* 1. Splits the timeout into 200ms chunks
Copy link

Copilot AI Nov 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The function documentation would benefit from documenting the CHUNK_TIMEOUT_MS constant value (200ms) in the description to match the implementation details mentioned in comment lines.

Suggested change
* 1. Splits the timeout into 200ms chunks
* 1. Splits the timeout into 200ms chunks (CHUNK_TIMEOUT_MS = 200ms)

Copilot uses AI. Check for mistakes.
return NULL;
}

/* Create Python list from messages */
Copy link

Copilot AI Nov 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extra whitespace before closing comment marker.

Suggested change
/* Create Python list from messages */
/* Create Python list from messages */

Copilot uses AI. Check for mistakes.
MSeal
MSeal previously approved these changes Nov 7, 2025
Copy link
Contributor

@MSeal MSeal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One minor comment, tested out things locally with a 30s timeout with this vs master... much better experience of always being able to interrupt.

time.sleep(delay_seconds)
try:
os.kill(os.getpid(), signal.SIGINT)
except Exception:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should probably catch the KeyboardInterrupt here instead so you don't mask other errors being raised instead

Copy link
Member Author

@k-raina k-raina Nov 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks
os.kill() raises OSError, not KeyboardInterrupt, i have removed try catch block as any exception raised in this function should not be masked

MSeal
MSeal previously approved these changes Nov 10, 2025
@sonarqube-confluent
Copy link

Quality Gate failed Quality Gate failed

Failed conditions
0.0% Coverage on New Code (required ≥ 80%)

See analysis details on SonarQube

@airlock-confluentinc airlock-confluentinc bot force-pushed the fix/wakeable-poll-issues-209-807 branch from dcc1639 to 6180f01 Compare November 26, 2025 14:48
@airlock-confluentinc airlock-confluentinc bot force-pushed the fix/wakeable-poll-issues-209-807 branch from 6180f01 to 5a86d02 Compare November 26, 2025 15:08
@k-raina k-raina changed the title Make consumer - consumer and poll wakeable Make long waiting consumer and producer API wakeable Nov 26, 2025
@k-raina k-raina requested a review from MSeal November 26, 2025 16:52
@sonarqube-confluent
Copy link

Quality Gate failed Quality Gate failed

Failed conditions
0.0% Coverage on New Code (required ≥ 80%)

See analysis details on SonarQube

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants