Skip to content

Conversation

moonchen
Copy link
Contributor

@moonchen moonchen commented Sep 19, 2025

Adds a watchdog thread that warns when a net thread remains in the work phase longer than a configurable duration.

Config:
proxy.config.thread_watchdog.timeout_ms (default: 1000)

Why:
Net threads should not stall; doing so adds latency to all transactions multiplexed on that thread. Stalls may indicate a misbehaving plugin, overload, or a Traffic Server bug.

On trigger, a warning is logged with the offending thread number and elapsed time to aid targeted diagnostics.

Run a watchdog thread to find blocking events.
@moonchen moonchen requested a review from Copilot September 19, 2025 20:26
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements a thread watchdog system to detect blocking events in the traffic server. The watchdog monitors network threads and alerts when they remain awake (not sleeping) longer than a configurable timeout threshold, indicating potential performance issues or blocking operations.

Key changes:

  • Introduces a Watchdog::Monitor class that runs in a separate thread to monitor event loop health
  • Adds heartbeat tracking to EThread instances to record sleep/wake timestamps
  • Configures the watchdog timeout through a new configuration parameter

Reviewed Changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
src/traffic_server/traffic_server.cc Creates and manages the watchdog instance, integrating it into server startup/shutdown
src/records/RecordsConfig.cc Adds configuration parameter for watchdog timeout
src/iocore/eventsystem/Watchdog.cc Implements the watchdog monitoring logic
src/iocore/eventsystem/UnixEThread.cc Adds heartbeat updates to the event loop
src/iocore/eventsystem/CMakeLists.txt Includes the new watchdog source file in the build
include/iocore/eventsystem/Watchdog.h Defines watchdog interfaces and heartbeat structure
include/iocore/eventsystem/EThread.h Adds heartbeat state to the EThread class

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@moonchen moonchen marked this pull request as draft September 19, 2025 20:27
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@moonchen moonchen marked this pull request as ready for review September 19, 2025 20:28
Because ubuntu 20
macOS 14, FreeBSD 13 don't have it
@moonchen moonchen changed the title Thread watchdog watchdog: alert on ET_NET thread stalls beyond threshold Sep 22, 2025
@moonchen
Copy link
Contributor Author

This is a PR that would be nicer if we had std::jthread and stop_token support across all platforms. Ubuntu 20, FreeBSD 13, and macOS 14 are too old.

@moonchen moonchen marked this pull request as draft September 22, 2025 18:04

// Start the watchdog
int watchdog_timeout_ms = RecGetRecordInt("proxy.config.thread_watchdog.timeout_ms").value_or(1000);
watchdog = std::make_unique<Watchdog::Monitor>(eventProcessor.thread_group[ET_NET]._thread,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we make this optional, as in, if proxy.config.thread_watchdog == 0, we don't setup the watchdog ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants