-
Notifications
You must be signed in to change notification settings - Fork 840
watchdog: alert on ET_NET thread stalls beyond threshold #12524
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Run a watchdog thread to find blocking events.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR implements a thread watchdog system to detect blocking events in the traffic server. The watchdog monitors network threads and alerts when they remain awake (not sleeping) longer than a configurable timeout threshold, indicating potential performance issues or blocking operations.
Key changes:
- Introduces a
Watchdog::Monitor
class that runs in a separate thread to monitor event loop health - Adds heartbeat tracking to
EThread
instances to record sleep/wake timestamps - Configures the watchdog timeout through a new configuration parameter
Reviewed Changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.
Show a summary per file
File | Description |
---|---|
src/traffic_server/traffic_server.cc | Creates and manages the watchdog instance, integrating it into server startup/shutdown |
src/records/RecordsConfig.cc | Adds configuration parameter for watchdog timeout |
src/iocore/eventsystem/Watchdog.cc | Implements the watchdog monitoring logic |
src/iocore/eventsystem/UnixEThread.cc | Adds heartbeat updates to the event loop |
src/iocore/eventsystem/CMakeLists.txt | Includes the new watchdog source file in the build |
include/iocore/eventsystem/Watchdog.h | Defines watchdog interfaces and heartbeat structure |
include/iocore/eventsystem/EThread.h | Adds heartbeat state to the EThread class |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Because ubuntu 20
macOS 14, FreeBSD 13 don't have it
This is a PR that would be nicer if we had |
|
||
// Start the watchdog | ||
int watchdog_timeout_ms = RecGetRecordInt("proxy.config.thread_watchdog.timeout_ms").value_or(1000); | ||
watchdog = std::make_unique<Watchdog::Monitor>(eventProcessor.thread_group[ET_NET]._thread, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we make this optional, as in, if proxy.config.thread_watchdog == 0, we don't setup the watchdog ?
Adds a watchdog thread that warns when a net thread remains in the work phase longer than a configurable duration.
Config:
proxy.config.thread_watchdog.timeout_ms (default: 1000)
Why:
Net threads should not stall; doing so adds latency to all transactions multiplexed on that thread. Stalls may indicate a misbehaving plugin, overload, or a Traffic Server bug.
On trigger, a warning is logged with the offending thread number and elapsed time to aid targeted diagnostics.