Add `is_healthy` to truss #1283

spal1 · 2024-12-13T16:48:45Z

🚀 What

Adds a is_healthy function to the model wrapper that users can use to define their custom health checks. This function is called on calls to this endpoint /v1/models/{model_name} (our existing health check endpoint), if defined. Otherwise we'll fall back to the existing health check functionality which checks if either the load failed or the model is not in a ready state.

We also needed to introduce a new endpoint on the inference server: v1/models/{model_name}/loaded that the control server will now explicitly call if the health check endpoint is called. This is to ensure that custom health checks are not called on development models.

🔬 Testing

Tested on staging with an RC (`0.9.59rc018`)

Confirmed that no health check warning logs display before model is loaded, and then health check warnings are logged every 10s that they fail, one for each of the liveness and readiness probes:

Confirmed that these changes won't break with current ctx builder version in staging and production

Ran poetry run truss push from my branch into orgs in staging and production without this RC and confirmed that truss push works properly and doesn't break existing deploy and inference flows.

Ensure we're still able to fail fast when load fails

Model loading

About a minute later, deployment is terminated because load failed

linear · 2024-12-13T16:48:48Z

BT-13064 Add `is_ready` to model trusses

truss/templates/server/model_wrapper.py

truss/templates/server/truss_server.py

spal1 · 2025-01-14T18:18:55Z

truss/tests/test_model_inference.py

+        def load(self) -> bool:
+            time.sleep(10)
+
+        def is_ready(self) -> bool:


add a test with a non-bool return type

Maybe binary/bool is not the right data model for this? It seems we have 3 states: ready, not ready and "not defined" - that calls for an enum.

I think bool is still valid here. "not defined" defaults to the current health check method, which just checks if the model has successfully loaded, otherwise a boolean is healthy or is not healthy seems to be the most intuitive way to reason about a health check

Right. I think I got confused because the method of the same name on the model (not the chainlet) can return None if not if hasattr(self, "_chainlet"):

marius-baseten · 2025-01-14T23:15:38Z

truss/templates/server/model_wrapper.py

+                is_ready = await self._model.is_ready()
+            else:
+                # Offload sync functions to thread, to not block event loop.
+                is_ready = await to_thread.run_sync(self._model.is_ready)


Do we expect any use case where this function is not super fast?

I'm just wondering if the overhead for dual sync/async support is warranted or it would be a nice simplification to just make this "naively sync"

We'll enforce a 10s timeout on health checks. async support is more for parity with the other truss methods we allow calling async or sync

My thinking was more in the sense of "can we reduce complexity" (no threading, not multiple code branches, not two ways of defining the function) if there's a reasonable case that this function is super fast - but it seems 10s is not super fast. Still wondering what real situations are where this check would not be virtually instant.

truss/templates/server/model_wrapper.py

marius-baseten · 2025-01-14T23:23:58Z

truss/tests/test_model_inference.py

+        def load(self) -> bool:
+            time.sleep(10)
+
+        def is_ready(self) -> bool:


Maybe binary/bool is not the right data model for this? It seems we have 3 states: ready, not ready and "not defined" - that calls for an enum.

marius-baseten · 2025-01-14T23:27:30Z

truss/templates/server/model_wrapper.py

+    NONE = enum.auto()
    INPUTS_ONLY = enum.auto()
    REQUEST_ONLY = enum.auto()
    INPUTS_AND_REQUEST = enum.auto()


Hmmm, this helper was originally for the preprocess, predict and postprocess functions, for which also prepare_args is useful. I'm not sure if this extension is a bit confusing (also for setup_environment).

I think it makes sense only to use something like MethodDescriptor for setup_environment and is_ready that has no arg_config.

Not sure I understand the comment here, I'm currently defining is_healthy as a MethodDescriptor, is there a better way to define is_healthy? I needed to support no args in ArgConfig to allow methods to have no argument other than self

What I meant was that MethodDescriptor was not envisioned to be used with additional new methods added to the model wrapper - but it's not that important...

truss-chains/tests/custom_health_checks/custom_health_checks.py

truss-chains/truss_chains/definitions.py

truss-chains/truss_chains/deployment/code_gen.py

jrochette

lgtm.

spal1 · 2025-01-23T15:32:54Z

truss/templates/control/control/endpoints.py

-    url = URL(path=request.url.path, query=request.url.query.encode("utf-8"))
+
+    path = request.url.path
+    if path == "/v1/models/model":


need to ensure we still match on /v1/models/{model_name}, will update the logic here

took a closer look at where we call these inference server endpoints and it looks like we hardcode v1/models/model everywhere else, I think its ok hardcoding it here as well.

clarifying what the logic here would be:

path = request.url.path path_params = request.path_params["path"].split("/") if len(path_params) == 2 and path_params[0] == "models": # Reroute health checks to the inference server's /v1/models/{model_name}/loaded endpoint path += "/loaded"

since we technically define {model_name} as a path parameter in the inference server, we could support this parameter in the control server routing logic i've added in this PR

chatted with @squidarth and went with keeping the checks hardcoded to v1/models/model since we use this everywhere we call the inference server, and moved the string comparison logic to a helper fn: _reroute_if_health_check

truss-chains/truss_chains/deployment/code_gen.py

truss/templates/control/control/endpoints.py

jrochette · 2025-01-23T19:40:20Z

truss/templates/control/control/endpoints.py

+
+
+def _custom_stop_strategy(retry_state: RetryCallState) -> bool:
+    # Stop after 10 attempts for ModelNotReady


How long can this end up waiting and retrying?

should be capped at 75 seconds from first attempt to the last attempt.

I think this could be shorter. 75s seems long, and I'm not sure anything is going to wait that long for a response on the healthcheck. For example, the operator's retry timeout on this is 2s. The wake call might wait longer though

did some digging and looks like we just call GET / on wake, which shouldn't take long. I'll go ahead and update the retries here to be once a second for 10 attempts

jrochette · 2025-01-23T19:42:47Z

truss/templates/control/control/endpoints.py

+    # For all other exceptions, stop after INFERENCE_SERVER_START_WAIT_SECS
+    seconds_since_start = (
+        retry_state.seconds_since_start
+        if retry_state.seconds_since_start is not None
+        else 0.0
+    )
+    return seconds_since_start >= INFERENCE_SERVER_START_WAIT_SECS


nit: It shouldn't if I understand the code correctly, but want to double check: does this change the retry behaviour?

Is it possible for retry_state.seconds_since_start to never be set and for this to retry forever?

I don't think its possible for seconds_since_start to never be set, looks like its set in the __init__ of RetryCallState https://github.com/jd/tenacity/blob/main/tenacity/__init__.py#L532

* first pass at is_ready in chains * revert example chainlet * bump ctx builder * address comments + add tests * add test chain * fix test * cr fixes * fix test * cr * fix formatting

spal1 commented Dec 13, 2024

View reviewed changes

truss/templates/server/model_wrapper.py Show resolved Hide resolved

spal1 added do not merge and removed do not merge labels Dec 13, 2024

spal1 marked this pull request as ready for review December 20, 2024 22:26

spal1 force-pushed the samiksha/bt-13064-add-is_ready-to-model-trusses branch 2 times, most recently from 3b008e7 to cd2d7db Compare January 10, 2025 16:21

spal1 commented Jan 10, 2025

View reviewed changes

truss/templates/server/truss_server.py Show resolved Hide resolved

spal1 requested review from squidarth and jrochette January 10, 2025 17:11

spal1 mentioned this pull request Jan 10, 2025

Add is_ready to chains #1289

Merged

spal1 commented Jan 14, 2025

View reviewed changes

spal1 requested a review from marius-baseten January 14, 2025 23:09

marius-baseten reviewed Jan 14, 2025

View reviewed changes

spal1 force-pushed the samiksha/bt-13064-add-is_ready-to-model-trusses branch 2 times, most recently from 0e62bb1 to 9c69f9f Compare January 16, 2025 23:21

spal1 changed the title ~~Add is_ready to model trusses~~ Add is_healthy to truss Jan 17, 2025

spal1 requested a review from marius-baseten January 17, 2025 00:01

marius-baseten reviewed Jan 17, 2025

View reviewed changes

truss-chains/tests/custom_health_checks/custom_health_checks.py Outdated Show resolved Hide resolved

truss-chains/truss_chains/definitions.py Outdated Show resolved Hide resolved

truss-chains/truss_chains/deployment/code_gen.py Outdated Show resolved Hide resolved

marius-baseten approved these changes Jan 17, 2025

View reviewed changes

jrochette approved these changes Jan 17, 2025

View reviewed changes

spal1 force-pushed the samiksha/bt-13064-add-is_ready-to-model-trusses branch 3 times, most recently from 38d865e to cb675f9 Compare January 23, 2025 00:31

spal1 requested review from jrochette, marius-baseten and nnarayen January 23, 2025 15:26

spal1 commented Jan 23, 2025

View reviewed changes

jrochette approved these changes Jan 23, 2025

View reviewed changes

marius-baseten approved these changes Jan 23, 2025

View reviewed changes

spal1 added 22 commits January 24, 2025 17:34

first poc

03d1170

change logs to be time elapsed

63c26c6

reset _first_health_check_failure on is_ready success

200cc65

is_ready fail fast on load failure

b7c5b2f

move logging to after load finished, update log text

34a6f8f

remove comment

1919612

cr

8b609e6

Add is_ready to chains (#1289)

5aff660

* first pass at is_ready in chains * revert example chainlet * bump ctx builder * address comments + add tests * add test chain * fix test * cr fixes * fix test * cr * fix formatting

is_ready -> is_healthy

65132c0

more refactoring

bd04e15

fix assert

737a188

fix couple more tests

33b7731

fix for extra args

2957804

marius cr

27cc4fc

fix linting

dba1736

new loaded endpoint + customize retries

529bc49

fix test_e2e

73663a3

fix tests

ece99ef

filter out loaded logs

b58941b

move reroute logic into helper

6630bd8

docstrings + wait 10s max

e1a90de

bump ctx builder

0fa72ee

spal1 force-pushed the samiksha/bt-13064-add-is_ready-to-model-trusses branch from c52ed3e to 0fa72ee Compare January 24, 2025 17:36

spal1 merged commit d08b512 into main Jan 24, 2025
7 checks passed

spal1 deleted the samiksha/bt-13064-add-is_ready-to-model-trusses branch January 24, 2025 20:26



		def _custom_stop_strategy(retry_state: RetryCallState) -> bool:
		# Stop after 10 attempts for ModelNotReady

Add is_healthy to truss #1283

Add is_healthy to truss #1283

Uh oh!

Conversation

spal1 commented Dec 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🚀 What

🔬 Testing

Tested on staging with an RC (0.9.59rc018)

Confirmed that these changes won't break with current ctx builder version in staging and production

Ensure we're still able to fail fast when load fails

Uh oh!

linear bot commented Dec 13, 2024

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jrochette left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

spal1 Jan 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Add `is_healthy` to truss #1283

Add `is_healthy` to truss #1283

spal1 commented Dec 13, 2024 •

edited

Loading

Tested on staging with an RC (`0.9.59rc018`)

spal1 Jan 23, 2025 •

edited

Loading