[docs][serve] Add guide and docstrings for custom check_health hooks #59924

damodaryekkuluri · 2026-01-07T09:30:24Z

Description

This documentation change helps users implement "Logic-level" monitoring instead of just "Process-level" monitoring.

Context/Background

Users deploying heavy-compute models (specifically vLLM for LLMs) often encounter "zombie replicas." This occurs when the application's internal engine (e.g., the vLLM background loop) crashes with an AsyncEngineDeadError, but the Ray Actor process remains alive. Because the process is alive, the default Ray Serve health check remains green, and the Ray Head node continues to route traffic to the broken replica, resulting in a series of 5xx errors for the end user.

Currently, the documentation for implementing a custom check_health hook to solve this is fragmented and not easily discoverable in the core monitoring or API guides.

Changes

This PR improves the discoverability of application-level health checks in two ways:

doc/source/serve/monitoring.md: Added a new section, "Application-level Health Checks (Custom Health Checks)," including a clear code example and a real-world scenario (engine crashes vs. process health).
python/ray/serve/deployment.py: Updated the Deployment class docstring to explicitly list check_health() as an optional user-defined hook. This ensures that developers using IDEs can discover the feature via hover-over or code completion.

Signed-off-by: Damodar Yekkuluri <damodar3sachin@gmail.com>

Added documentation for user-defined health check method. Signed-off-by: Damodar Yekkuluri <damodar3sachin@gmail.com>

gemini-code-assist

Code Review

This pull request adds valuable documentation for custom health checks in Ray Serve, addressing a key user need for logic-level monitoring. The changes in monitoring.md and deployment.py significantly improve the discoverability and understanding of the check_health hook. My review focuses on enhancing the clarity and completeness of this new documentation. I've suggested a formatting fix to a list for better logical structure and recommended adding details about configuration options to both the guide and the docstring to make them more comprehensive for users. Overall, this is a great contribution.

gemini-code-assist · 2026-01-07T09:31:54Z

doc/source/serve/monitoring.md


 * viewing the Ray dashboard
 * viewing the `serve status` output
+  * implementing custom application-level health checks 


The new list item for custom health checks is indented, which makes it appear as a sub-item of 'viewing the serve status output'. To improve clarity and discoverability, it should be a top-level item in the list, at the same indentation level as the other items.

Suggested change

* implementing custom application-level health checks

* implementing custom application-level health checks

Intentional. the custom health check is a sub-item of the serve status.

gemini-code-assist · 2026-01-07T09:31:54Z

doc/source/serve/monitoring.md

+
+### Implementing `check_health`
+
+When you define an `async def check_health(self)` method, Ray Serve calls it periodically (defaulting to every 10 seconds). If this method raises an exception, Ray marks the replica as `UNHEALTHY`, stops routing traffic to it, and attempts to restart it.


This is a great explanation of check_health. To make it even more helpful for users, I suggest mentioning how they can configure the health check frequency and timeout. You could add a sentence about the health_check_period_s and health_check_timeout_s parameters of the @serve.deployment decorator.

Suggested change

When you define an `async def check_health(self)` method, Ray Serve calls it periodically (defaulting to every 10 seconds). If this method raises an exception, Ray marks the replica as `UNHEALTHY`, stops routing traffic to it, and attempts to restart it.

When you define an `async def check_health(self)` method, Ray Serve calls it periodically. You can configure the period and timeout using the `health_check_period_s` (default: 10s) and `health_check_timeout_s` arguments in the `@serve.deployment` decorator. If this method raises an exception, Ray marks the replica as `UNHEALTHY`, stops routing traffic to it, and attempts to restart it.

gemini-code-assist · 2026-01-07T09:31:54Z

python/ray/serve/deployment.py

+            [Optional] Define this method to implement custom application-level 
+            health checks. Ray Serve calls this periodically (default 10s). 
+            If it raises an exception, the replica is marked as UNHEALTHY 
+            and restarted. Useful for detecting internal engine crashes 
+            (like vLLM AsyncEngineDeadError) that leave the actor process alive.


The docstring is helpful for discoverability. To improve it further, I suggest reformatting it for better readability and adding details about how to configure the health check period and timeout, similar to the documentation in monitoring.md.

Suggested change

[Optional] Define this method to implement custom application-level

health checks. Ray Serve calls this periodically (default 10s).

If it raises an exception, the replica is marked as UNHEALTHY

and restarted. Useful for detecting internal engine crashes

(like vLLM AsyncEngineDeadError) that leave the actor process alive.

[Optional] Define this method to implement custom application-level

health checks. Ray Serve calls this periodically. You can configure the

period and timeout using the `health_check_period_s` (default: 10s) and

`health_check_timeout_s` arguments in the `@serve.deployment` decorator.

If it raises an exception, the replica is marked as UNHEALTHY and

restarted. This is useful for detecting internal engine crashes (like

vLLM AsyncEngineDeadError) that leave the actor process alive.

Signed-off-by: Damodar Yekkuluri <damodar3sachin@gmail.com>

damodaryekkuluri added 2 commits January 7, 2026 14:38

Update monitoring.md to include details about custom health checks

28c0b98

Signed-off-by: Damodar Yekkuluri <damodar3sachin@gmail.com>

Document user-defined health check in deployment.py

1236dc5

Added documentation for user-defined health check method. Signed-off-by: Damodar Yekkuluri <damodar3sachin@gmail.com>

damodaryekkuluri requested review from a team as code owners January 7, 2026 09:30

gemini-code-assist bot reviewed Jan 7, 2026

View reviewed changes

minor - Fix formatting

43a2ba2

Signed-off-by: Damodar Yekkuluri <damodar3sachin@gmail.com>

ray-gardener bot added serve Ray Serve Related Issue docs An issue or change related to documentation community-contribution Contributed by the community labels Jan 7, 2026

harshit-anyscale added the go add ONLY when ready to merge, run all tests label Jan 20, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[docs][serve] Add guide and docstrings for custom check_health hooks #59924

[docs][serve] Add guide and docstrings for custom check_health hooks #59924

damodaryekkuluri commented Jan 7, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Jan 7, 2026

Uh oh!

damodaryekkuluri Jan 7, 2026

Uh oh!

gemini-code-assist bot Jan 7, 2026

Uh oh!

gemini-code-assist bot Jan 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	* implementing custom application-level health checks
	* implementing custom application-level health checks


		### Implementing `check_health`

		When you define an `async def check_health(self)` method, Ray Serve calls it periodically (defaulting to every 10 seconds). If this method raises an exception, Ray marks the replica as `UNHEALTHY`, stops routing traffic to it, and attempts to restart it.

[docs][serve] Add guide and docstrings for custom check_health hooks #59924

Are you sure you want to change the base?

[docs][serve] Add guide and docstrings for custom check_health hooks #59924

Conversation

damodaryekkuluri commented Jan 7, 2026

Description

Context/Background

Changes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

damodaryekkuluri Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants