[#4140] ping-check: fix potential deadlock between PingCheckMgr and P… #151

lyqfjcs · 2025-09-30T02:58:01Z

[PR] Fix potential deadlock between PingCheckMgr and PingChannel
Problem:
https://gitlab.isc.org/isc-projects/kea/-/issues/4140
A potential deadlock exists due to inconsistent lock acquisition order between PingCheckMgr and PingChannel:
Path 1:
PingChannel::sendNext() (holds PingChannel lock)
→ callback into PingCheckMgr::nextToSend()
→ checkSuspended() attempts to acquire PingCheckMgr lock.

Path 2:
PingCheckMgr::expirationTimedOut() (holds PingCheckMgr lock)
→ calls channel_->startSend() / channel_->startRead()
→ these attempt to acquire PingChannel lock.

Fix

Move the calls to channel_->startSend() / channel_->startRead() outside the critical section guarded by PingCheckMgr::mutex_.

In the current implementation this was already mostly the case; this change ensures the call site in expirationTimedOut() consistently follows the lock-free pattern.

Concurrency Considerations

Use of more_pings:
- Computed inside the lock, used outside only as a flag to decide whether to wake the channel.
- If the state changes after unlocking, at worst it causes:
  - one extra wakeup, or
  - a missed unnecessary wakeup.
- Neither case breaks correctness.
Access to channel_:
- Guarded with a null check before use.
- If another thread concurrently closes the channel and sets it to nullptr, the null check prevents invalid access.
- At most, a wakeup is skipped — not a correctness issue.

Impact

Eliminates the lock order inversion between PingCheckMgr and PingChannel.
Ensures PingCheckMgr does not call into PingChannel while holding its mutex.
No functional changes to ping logic; only improved concurrency safety.

razvan-becheriu · 2025-09-30T07:53:38Z

Thank you for finding this race. I am not sure if the provided fix is correct, as keeping a reference to the channel while stop is called might lead to some resources not being released (memory leak caused by not calling io_service_->stopAndPoll() after the channel last reference has been released). I'll try to fix the intertwined calls to mutex lock between PingCheckMgr from PingChannel.

lyqfjcs · 2025-09-30T08:31:08Z

Thank you for finding this race. I am not sure if the provided fix is correct, as keeping a reference to the channel while stop is called might lead to some resources not being released (memory leak caused by not calling io_service_->stopAndPoll() after the channel last reference has been released). I'll try to fix the intertwined calls to mutex lock between PingCheckMgr from PingChannel.

Thanks for pointing this out. I think the current order (channel_.reset() before io_service_->stopAndPoll()) is still safe:

PingChannel::~PingChannel() only calls close(), which synchronously unregisters/cleans up sockets and does not post new tasks.
Any tasks already posted (startSend()/startRead()) capture shared_from_this(), so the object will stay alive until those tasks run.
stopAndPoll() flushes the ready tasks, ensuring proper cleanup.
Since close() sets canSend()/canRead() to false, any late-running callbacks will exit without scheduling further I/O.
So while the order may look unusual, I think it does not risk leaks or use-after-free.

lyqfjcs · 2025-09-30T08:46:10Z

Other possible fix would be to modify PingChannel::sendNext(): release the channel lock before invoking next_to_send_cb_ , then reacquire the channel lock afterward for a second check and send. This would unify the lock acquisition order and avoid the deadlock.
lyqfjcs@e950f05

Why I think this might be safe:

The callback runs after releasing the channel lock, so we avoid lock order inversion.
After reacquiring the lock, we re-check canSend(); if the state changed , we safely return.
sending_ is still updated inside the locked region, so only one send operation is active.
If the channel is closed between unlock and asyncSend(), the existing try/catch already handles it by logging and calling stopChannel().
Posted IOService callbacks capture shared_from_this(), so the object lifetime is preserved.
next_to_send_cb_ has its own manager-side locking, so calling it without the channel lock should be safe.

Do you think this alternative would be acceptable, or do you see potential issues I may have another problem?

razvan-becheriu · 2025-09-30T10:24:06Z

Second approach seems better. Will try to understand if there are other possible issues.

razvan-becheriu · 2025-09-30T10:28:19Z

I think that moving:

        // Fetch the next one to send.
        IOAddress target("0.0.0.0");
        if (!((next_to_send_cb_)(target))) {
            // Nothing to send.
            return;
        }

before MultiThreadingLock lock(*mutex_); in PingChannel::sendNext() is the best and simplest approach.

the callback just returns the next target..if available, regardless of socket states, so it can be called at any time.

lyqfjcs · 2025-09-30T14:49:30Z

Thank you for your review and helpful suggestions.

I've updated the code as discussed:
PingCheckMgr::nextToSend() has been split into two parts: a fetch-only callback (next_to_send_cb_) and an update callback (update_to_send_cb_).

Could you please take another look and confirm if this aligns with your proposal?
Thanks again for your time and feedback!

…ingChannel Split PingCheckMgr::nextToSend into fetch/update phases to avoid lock inversion

razvan-becheriu · 2025-09-30T15:32:17Z

looks alright. ticket will be triaged, updated, reviewed and tested. thank you again for finding this and for your patience.

lyqfjcs · 2025-09-30T15:48:54Z

looks alright. ticket will be triaged, updated, reviewed and tested. thank you again for finding this and for your patience.

I’d be glad if my contribution could be acknowledged. Appreciate your review and looking forward to seeing it in the next dev release.

razvan-becheriu · 2025-09-30T15:58:00Z

Of course we will acknowledge you for this. Please provide credentials for including them in the ChangeLog: name/nickname/alias/company...whatever you like us to mention.

Thank you.

lyqfjcs · 2025-09-30T16:12:50Z

Thanks! You can use the name liyunqing_kylin

lyqfjcs force-pushed the master branch from f7243d8 to f7138b8 Compare September 30, 2025 14:44

[#4140] ping-check: fix potential deadlock between PingCheckMgr and P…

9f1da6b

…ingChannel Split PingCheckMgr::nextToSend into fetch/update phases to avoid lock inversion

lyqfjcs force-pushed the master branch from f7138b8 to 9f1da6b Compare September 30, 2025 14:54

lyqfjcs mentioned this pull request Sep 30, 2025

[#4140] PingChannel: release lock before invoking PingCheckMgr #152

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[#4140] ping-check: fix potential deadlock between PingCheckMgr and P… #151

[#4140] ping-check: fix potential deadlock between PingCheckMgr and P… #151

lyqfjcs commented Sep 30, 2025

Uh oh!

razvan-becheriu commented Sep 30, 2025

Uh oh!

lyqfjcs commented Sep 30, 2025 •

edited

Loading

Uh oh!

lyqfjcs commented Sep 30, 2025 •

edited

Loading

Uh oh!

razvan-becheriu commented Sep 30, 2025

Uh oh!

razvan-becheriu commented Sep 30, 2025

Uh oh!

lyqfjcs commented Sep 30, 2025

Uh oh!

razvan-becheriu commented Sep 30, 2025

Uh oh!

lyqfjcs commented Sep 30, 2025

Uh oh!

razvan-becheriu commented Sep 30, 2025

Uh oh!

lyqfjcs commented Sep 30, 2025

Uh oh!

Uh oh!

[#4140] ping-check: fix potential deadlock between PingCheckMgr and P… #151

Are you sure you want to change the base?

[#4140] ping-check: fix potential deadlock between PingCheckMgr and P… #151

Conversation

lyqfjcs commented Sep 30, 2025

Fix

Concurrency Considerations

Impact

Uh oh!

razvan-becheriu commented Sep 30, 2025

Uh oh!

lyqfjcs commented Sep 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lyqfjcs commented Sep 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

razvan-becheriu commented Sep 30, 2025

Uh oh!

razvan-becheriu commented Sep 30, 2025

Uh oh!

lyqfjcs commented Sep 30, 2025

Uh oh!

razvan-becheriu commented Sep 30, 2025

Uh oh!

lyqfjcs commented Sep 30, 2025

Uh oh!

razvan-becheriu commented Sep 30, 2025

Uh oh!

lyqfjcs commented Sep 30, 2025

Uh oh!

Uh oh!

lyqfjcs commented Sep 30, 2025 •

edited

Loading

lyqfjcs commented Sep 30, 2025 •

edited

Loading