Skip to content

Conversation

lyqfjcs
Copy link

@lyqfjcs lyqfjcs commented Sep 30, 2025

[PR] Fix potential deadlock between PingCheckMgr and PingChannel
Problem:
https://gitlab.isc.org/isc-projects/kea/-/issues/4140
A potential deadlock exists due to inconsistent lock acquisition order between PingCheckMgr and PingChannel:
Path 1:
PingChannel::sendNext() (holds PingChannel lock)
→ callback into PingCheckMgr::nextToSend()
→ checkSuspended() attempts to acquire PingCheckMgr lock.

Path 2:
PingCheckMgr::expirationTimedOut() (holds PingCheckMgr lock)
→ calls channel_->startSend() / channel_->startRead()
→ these attempt to acquire PingChannel lock.

Fix

Move the calls to channel_->startSend() / channel_->startRead() outside the critical section guarded by PingCheckMgr::mutex_.

In the current implementation this was already mostly the case; this change ensures the call site in expirationTimedOut() consistently follows the lock-free pattern.


Concurrency Considerations

  • Use of more_pings:
    • Computed inside the lock, used outside only as a flag to decide whether to wake the channel.
    • If the state changes after unlocking, at worst it causes:
      • one extra wakeup, or
      • a missed unnecessary wakeup.
    • Neither case breaks correctness.
  • Access to channel_:
    • Guarded with a null check before use.
    • If another thread concurrently closes the channel and sets it to nullptr, the null check prevents invalid access.
    • At most, a wakeup is skipped — not a correctness issue.

Impact

  • Eliminates the lock order inversion between PingCheckMgr and PingChannel.
  • Ensures PingCheckMgr does not call into PingChannel while holding its mutex.
  • No functional changes to ping logic; only improved concurrency safety.

@razvan-becheriu
Copy link
Contributor

Thank you for finding this race. I am not sure if the provided fix is correct, as keeping a reference to the channel while stop is called might lead to some resources not being released (memory leak caused by not calling io_service_->stopAndPoll() after the channel last reference has been released). I'll try to fix the intertwined calls to mutex lock between PingCheckMgr from PingChannel.

@lyqfjcs
Copy link
Author

lyqfjcs commented Sep 30, 2025

Thank you for finding this race. I am not sure if the provided fix is correct, as keeping a reference to the channel while stop is called might lead to some resources not being released (memory leak caused by not calling io_service_->stopAndPoll() after the channel last reference has been released). I'll try to fix the intertwined calls to mutex lock between PingCheckMgr from PingChannel.

Thanks for pointing this out. I think the current order (channel_.reset() before io_service_->stopAndPoll()) is still safe:

  1. PingChannel::~PingChannel() only calls close(), which synchronously unregisters/cleans up sockets and does not post new tasks.
  2. Any tasks already posted (startSend()/startRead()) capture shared_from_this(), so the object will stay alive until those tasks run.
  3. stopAndPoll() flushes the ready tasks, ensuring proper cleanup.
  4. Since close() sets canSend()/canRead() to false, any late-running callbacks will exit without scheduling further I/O.
    So while the order may look unusual, I think it does not risk leaks or use-after-free.

@lyqfjcs
Copy link
Author

lyqfjcs commented Sep 30, 2025

Other possible fix would be to modify PingChannel::sendNext(): release the channel lock before invoking next_to_send_cb_ , then reacquire the channel lock afterward for a second check and send. This would unify the lock acquisition order and avoid the deadlock.
lyqfjcs@e950f05

Why I think this might be safe:

  1. The callback runs after releasing the channel lock, so we avoid lock order inversion.
  2. After reacquiring the lock, we re-check canSend(); if the state changed , we safely return.
  3. sending_ is still updated inside the locked region, so only one send operation is active.
  4. If the channel is closed between unlock and asyncSend(), the existing try/catch already handles it by logging and calling stopChannel().
  5. Posted IOService callbacks capture shared_from_this(), so the object lifetime is preserved.
  6. next_to_send_cb_ has its own manager-side locking, so calling it without the channel lock should be safe.

Do you think this alternative would be acceptable, or do you see potential issues I may have another problem?

@razvan-becheriu
Copy link
Contributor

Second approach seems better. Will try to understand if there are other possible issues.

@razvan-becheriu
Copy link
Contributor

I think that moving:

        // Fetch the next one to send.
        IOAddress target("0.0.0.0");
        if (!((next_to_send_cb_)(target))) {
            // Nothing to send.
            return;
        }

before MultiThreadingLock lock(*mutex_); in PingChannel::sendNext() is the best and simplest approach.

the callback just returns the next target..if available, regardless of socket states, so it can be called at any time.

@lyqfjcs
Copy link
Author

lyqfjcs commented Sep 30, 2025

Thank you for your review and helpful suggestions.

I've updated the code as discussed:
PingCheckMgr::nextToSend() has been split into two parts: a fetch-only callback (next_to_send_cb_) and an update callback (update_to_send_cb_).

Could you please take another look and confirm if this aligns with your proposal?
Thanks again for your time and feedback!

…ingChannel

Split PingCheckMgr::nextToSend into fetch/update phases to avoid lock inversion
@razvan-becheriu
Copy link
Contributor

looks alright. ticket will be triaged, updated, reviewed and tested. thank you again for finding this and for your patience.

@lyqfjcs
Copy link
Author

lyqfjcs commented Sep 30, 2025

looks alright. ticket will be triaged, updated, reviewed and tested. thank you again for finding this and for your patience.

I’d be glad if my contribution could be acknowledged. Appreciate your review and looking forward to seeing it in the next dev release.

@razvan-becheriu
Copy link
Contributor

Of course we will acknowledge you for this. Please provide credentials for including them in the ChangeLog: name/nickname/alias/company...whatever you like us to mention.

Thank you.

@lyqfjcs
Copy link
Author

lyqfjcs commented Sep 30, 2025

Thanks! You can use the name liyunqing_kylin

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants