Skip to content

Conversation

jen-scymanski-scale
Copy link

Pull Request Summary

What is this PR changing? Why is this change being made? Any caveats you'd like to highlight? Link any relevant documents, links, or screenshots here if applicable.

Fixes Needed for model-engine to work on-prem
AWS Disable Logic: Added DISABLE_AWS=true environment variable support
Redis Fallback: S3 backend falls back to Redis when AWS is disabled
Lazy Initialization: Celery apps are initialized lazily to avoid import-time AWS sessions
On-Premises Redis: Added direct Redis connection support via ONPREM_REDIS_HOST
Configurable Gunicorn: Made worker timeouts and settings configurable via environment variables
Thread-Safe Logging: Fixed potential race conditions in logger initialization
Broker Disable Options: Added DISABLE_SQS_BROKER and DISABLE_SERVICEBUS_BROKER flags

Test Plan and Usage Guide

How did you validate that your PR works correctly? How do you run or demo the code? Provide enough detail so a reviewer can reasonably reproduce the testing procedure. Paste example command line invocations if applicable.

Tested locally with scripts
AWS Disable Logic: Tested DISABLE_AWS=true prevents AWS session creation
Redis Fallback: Verified S3 backend falls back to Redis when AWS disabled
Lazy Initialization: Confirmed Celery apps don't create AWS sessions at import time
Thread-Safe Logging: Tested logger initialization in multi-threaded environments
Direct Redis Connection: Tested ONPREM_REDIS_HOST configuration
Server Startup: Confirmed server starts with all cloud services disabled
Environment Variables: Verified all new config variables work correctly
Async Task Workflow: Tested complete task submission and result retrieval
Broker Disable Logic: Verified SQS/ServiceBus fallback to Redis
Error Handling: Tested graceful handling of missing AWS credentials
Backward Compatibility: Confirmed existing cloud functionality still works
No Breaking Changes: Verified no existing APIs are broken

return creds["cache-url"]

# Check if we're in an onprem environment with direct Redis access
if os.environ.get('ONPREM_REDIS_HOST'):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we not pass this in via config in the same way as the other redis configs?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

@@ -0,0 +1,132 @@
# values_onprem.yaml - On-premises deployment configuration
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you need to include this file here? I believe SGP maintains their own values.yaml in their own repo somewhere cc @nicolastomeo ?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, removed.

logger.error(e)
logger.error(f"Failed to retrieve secret: {secret_name}")
return {}
response = secret_manager.get_secret_value(SecretId=secret_name)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we still want to do the try_catch wrapping to handle the cases where secret_manager client errors our

if aws_role is None:
aws_session = session(infra_config().profile_ml_worker)
# Check if AWS is disabled via config - if so, fall back to Redis backend
if infra_config().disable_aws:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of doing this, wondering if we address this upstream and figure out how to pass in "redis" as the backend_protocol in the on-prem scenario

- name: CIRCLECI
value: "true"
{{- end }}
{{- if .Values.gunicorn }}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

anecdotally, we found it a lot easier to performance tune pure uvicorn, so we actually migrated most usage of gunicorn back to uvicorn. That being said, won't block your usage of it

celery_servicebus = celery_app(
None, broker_type=str(BrokerType.SERVICEBUS.value), backend_protocol=backend_protocol
)
# Initialize celery apps lazily to avoid import-time AWS session creation
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curious why we're not running into similar issues for our other non-AWS environments

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On Prem - This is likely due to no creds and the import failing
Container starts with NO AWS credentials
Python import hits celery_task_queue_gateway.py:19
boto3.Session() creation fails immediately
Import exception → container crash before application even starts

if debug:
additional_args.extend(["--reload", "--timeout", "0"])

# Use environment variables for configuration with fallbacks
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

format=LOG_FORMAT,
)

# Thread-safe logging configuration - only configure if not already configured
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interesting, was this manifesting in a particular error?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, there was a specific recursive logging error causing worker crashes:

RuntimeError: reentrant call inside <_io.BufferedWriter name=''>

This occurred during Gunicorn worker startup when multiple processes tried to initialize logging simultaneously, causing thread-unsafe logging configuration and race conditions. The error led to worker crashes, which then triggered the WORKER TIMEOUT errors we were seeing.

The issue was that multiple Gunicorn workers starting at the same time would compete to write to stderr during logging setup, causing a reentrant call error that crashed the worker processes.

firehose_stream_name: Optional[str] = None
prometheus_server_address: Optional[str] = None
# On-premises configuration
onprem_redis_host: Optional[str] = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm wondering if there's an easy way to merge onprem_redis_host w/ the other redis_host arg that already exists.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could consolidate this by using the existing redis_host field and adding logic to detect on-premises vs cloud environments. However, we kept them separate because:
redis_host is used for the message broker (Celery)
onprem_redis_host is used for the cache/result storage
They might point to different Redis instances in some deployments.. let me know if you would like me to combine them.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

redis_host is used for the message broker (Celery)
onprem_redis_host is used for the cache/result storage

This is a good distinction. I think it's better to make that more explicit in the naming as opposed to marking one as "onprem"

onprem_redis_port: Optional[str] = "6379"
onprem_redis_password: Optional[str] = None
# AWS disable configuration
disable_aws: bool = False
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we instead add a new cloud_provider == onprem and change logic based off that?

global celery_servicebus
if celery_servicebus is None:
# Check if ServiceBus broker is disabled or if we're forcing Redis via config
if infra_config().disable_servicebus_broker or infra_config().force_celery_redis:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we just throw an error if the app is configured to use sqs but the environment doesn't support it instead of adding a fallback logic?



def session(role: Optional[str], session_type: SessionT = Session) -> SessionT:
def session(role: Optional[str], session_type: SessionT = Session) -> Optional[SessionT]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't need to touch this; this should only be used if cloud_provider == 'aws'

vllm_args.disable_log_requests = True

vllm_cmd = f"python -m vllm_server --model {final_weights_folder} --served-model-name {model_name} {final_weights_folder} --port 5005"
vllm_cmd = f"python -m vllm_server --model {final_weights_folder} --served-model-name {model_name} --port 5005"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dmchoiboi - JFYI: this is where I've hackily changed it to support public vllm docker image.

            vllm_cmd = f"python3 -m vllm.entrypoints.openai.api_server --model {final_weights_folder} --served-model-name {model_name} --port 5005"

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

protocol="http",
readiness_initial_delay_seconds=10,
healthcheck_route="/health",
readiness_initial_delay_seconds=1800, # 30 minutes for large model downloads
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@saeidbarati-scale for context

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants