Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new health check configuration fields #1290

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

spal1
Copy link
Collaborator

@spal1 spal1 commented Dec 17, 2024

🚀 What

Adds new configuration fields to truss config for models and chains. Naming and defaults can be seen in examples below.

for models, we add a new health_checks section to runtime:

runtime:
  health_checks:
    restart_check_delay_seconds: 0
    restart_failure_threshold_seconds: 1800 # 18 consecutive failures over 30 minutes
    stop_traffic_failure_threshold_seconds: 1800 # 18 consecutive failures over 30 minutes

for chains, we add a new runtime field to RemoteConfig that includes health_checks:

remote_config = chains.RemoteConfig(
  runtime=chains.Runtime(
    health_checks=truss_config.HealthCheck(
      restart_check_delay_seconds=0,
      max_time_before_restart_seconds=1800,
      max_time_before_stopping_traffic_seconds=1800
    )
  )
)		

🔬 Testing

Added a couple of integration test cases

Copy link

linear bot commented Dec 17, 2024

@spal1 spal1 marked this pull request as ready for review December 19, 2024 15:43
truss/base/truss_config.py Outdated Show resolved Hide resolved
restart_failure_threshold_seconds: int = MAX_FAILURE_THRESHOLD_SECONDS
stop_traffic_failure_threshold_seconds: int = MAX_FAILURE_THRESHOLD_SECONDS

def __post_init__(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want to do these validations on the backend. That gives us more flexibility to not be stuck with these checks for old clients.

Can you look into how that might work?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline--I'll use this as an opportunity to create a first version of a validate_truss_config GQL mutation that we can call from here: https://github.com/basetenlabs/truss/blob/main/truss/remote/baseten/remote.py#L124

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why we write so much boilerplate code for this here when pydantic can do most of this. - more a comment on pre-existing code though...

I think we should refactor this whole file soon. It's also super error prone that you have to add key literals to to_dict methods each time you add a field.

@@ -393,6 +397,7 @@ class MyChainlet(chains.ChainletBase):
assets: Assets = Assets()
name: Optional[str] = None
options: ChainletOptions = ChainletOptions()
runtime: Runtime = Runtime()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Chains doesn't map 1:1 to the structure of truss config. I wonder if it would make sense to add this field to ChainletOptions instead of creating a new class.

If you want to proceed with a new class, then there should be some differentiator, explaining which future fields go into Runtime and which ones go into ChainletOptions.

restart_failure_threshold_seconds: int = MAX_FAILURE_THRESHOLD_SECONDS
stop_traffic_failure_threshold_seconds: int = MAX_FAILURE_THRESHOLD_SECONDS

def __post_init__(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why we write so much boilerplate code for this here when pydantic can do most of this. - more a comment on pre-existing code though...

I think we should refactor this whole file soon. It's also super error prone that you have to add key literals to to_dict methods each time you add a field.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants