on prem changes to disable cloud solutions #700

jen-scymanski-scale · 2025-08-18T15:38:52Z

Pull Request Summary

What is this PR changing? Why is this change being made? Any caveats you'd like to highlight? Link any relevant documents, links, or screenshots here if applicable.

Fixes Needed for model-engine to work on-prem
AWS Disable Logic: Added DISABLE_AWS=true environment variable support
Redis Fallback: S3 backend falls back to Redis when AWS is disabled
Lazy Initialization: Celery apps are initialized lazily to avoid import-time AWS sessions
On-Premises Redis: Added direct Redis connection support via ONPREM_REDIS_HOST
Configurable Gunicorn: Made worker timeouts and settings configurable via environment variables
Thread-Safe Logging: Fixed potential race conditions in logger initialization
Broker Disable Options: Added DISABLE_SQS_BROKER and DISABLE_SERVICEBUS_BROKER flags

Test Plan and Usage Guide

How did you validate that your PR works correctly? How do you run or demo the code? Provide enough detail so a reviewer can reasonably reproduce the testing procedure. Paste example command line invocations if applicable.

Tested locally with scripts
AWS Disable Logic: Tested DISABLE_AWS=true prevents AWS session creation
Redis Fallback: Verified S3 backend falls back to Redis when AWS disabled
Lazy Initialization: Confirmed Celery apps don't create AWS sessions at import time
Thread-Safe Logging: Tested logger initialization in multi-threaded environments
Direct Redis Connection: Tested ONPREM_REDIS_HOST configuration
Server Startup: Confirmed server starts with all cloud services disabled
Environment Variables: Verified all new config variables work correctly
Async Task Workflow: Tested complete task submission and result retrieval
Broker Disable Logic: Verified SQS/ServiceBus fallback to Redis
Error Handling: Tested graceful handling of missing AWS credentials
Backward Compatibility: Confirmed existing cloud functionality still works
No Breaking Changes: Verified no existing APIs are broken

dmchoiboi · 2025-08-18T16:19:09Z

model-engine/model_engine_server/common/config.py

            return creds["cache-url"]

+        # Check if we're in an onprem environment with direct Redis access
+        if os.environ.get('ONPREM_REDIS_HOST'):


why do we not pass this in via config in the same way as the other redis configs?

dmchoiboi · 2025-08-18T17:23:45Z

charts/model-engine/values_onprem.yaml

@@ -0,0 +1,132 @@
+# values_onprem.yaml - On-premises deployment configuration


I don't think you need to include this file here? I believe SGP maintains their own values.yaml in their own repo somewhere cc @nicolastomeo ?

ok, removed.

dmchoiboi · 2025-08-18T17:25:18Z

model-engine/model_engine_server/core/aws/secrets.py

-        logger.error(e)
-        logger.error(f"Failed to retrieve secret: {secret_name}")
-        return {}
+    response = secret_manager.get_secret_value(SecretId=secret_name)


I think we still want to do the try_catch wrapping to handle the cases where secret_manager client errors our

dmchoiboi · 2025-08-18T17:31:32Z

model-engine/model_engine_server/core/celery/app.py

-        if aws_role is None:
-            aws_session = session(infra_config().profile_ml_worker)
+        # Check if AWS is disabled via config - if so, fall back to Redis backend
+        if infra_config().disable_aws:


instead of doing this, wondering if we address this upstream and figure out how to pass in "redis" as the backend_protocol in the on-prem scenario

dmchoiboi · 2025-08-18T17:33:42Z

charts/model-engine/templates/_helpers.tpl

  - name: CIRCLECI
    value: "true"
  {{- end }}
+  {{- if .Values.gunicorn }}


anecdotally, we found it a lot easier to performance tune pure uvicorn, so we actually migrated most usage of gunicorn back to uvicorn. That being said, won't block your usage of it

dmchoiboi · 2025-08-18T17:37:20Z

model-engine/model_engine_server/infra/gateways/celery_task_queue_gateway.py

-celery_servicebus = celery_app(
-    None, broker_type=str(BrokerType.SERVICEBUS.value), backend_protocol=backend_protocol
-)
+# Initialize celery apps lazily to avoid import-time AWS session creation


curious why we're not running into similar issues for our other non-AWS environments

On Prem - This is likely due to no creds and the import failing
Container starts with NO AWS credentials
Python import hits celery_task_queue_gateway.py:19
boto3.Session() creation fails immediately
Import exception → container crash before application even starts

dmchoiboi · 2025-08-18T17:38:23Z

model-engine/model_engine_server/entrypoints/start_fastapi_server.py

    if debug:
        additional_args.extend(["--reload", "--timeout", "0"])
+
+    # Use environment variables for configuration with fallbacks


dmchoiboi · 2025-08-18T17:38:45Z

model-engine/model_engine_server/core/loggers.py

-        format=LOG_FORMAT,
-    )
+
+    # Thread-safe logging configuration - only configure if not already configured


interesting, was this manifesting in a particular error?

Yes, there was a specific recursive logging error causing worker crashes:

RuntimeError: reentrant call inside <_io.BufferedWriter name=''>

This occurred during Gunicorn worker startup when multiple processes tried to initialize logging simultaneously, causing thread-unsafe logging configuration and race conditions. The error led to worker crashes, which then triggered the WORKER TIMEOUT errors we were seeing.

The issue was that multiple Gunicorn workers starting at the same time would compete to write to stderr during logging setup, causing a reentrant call error that crashed the worker processes.

dmchoiboi · 2025-08-18T17:39:47Z

model-engine/model_engine_server/core/config.py

    firehose_stream_name: Optional[str] = None
    prometheus_server_address: Optional[str] = None
+    # On-premises configuration
+    onprem_redis_host: Optional[str] = None


i'm wondering if there's an easy way to merge onprem_redis_host w/ the other redis_host arg that already exists.

We could consolidate this by using the existing redis_host field and adding logic to detect on-premises vs cloud environments. However, we kept them separate because:
redis_host is used for the message broker (Celery)
onprem_redis_host is used for the cache/result storage
They might point to different Redis instances in some deployments.. let me know if you would like me to combine them.

redis_host is used for the message broker (Celery)
onprem_redis_host is used for the cache/result storage

This is a good distinction. I think it's better to make that more explicit in the naming as opposed to marking one as "onprem"

dmchoiboi · 2025-08-18T17:40:34Z

model-engine/model_engine_server/core/config.py

+    onprem_redis_port: Optional[str] = "6379"
+    onprem_redis_password: Optional[str] = None
+    # AWS disable configuration
+    disable_aws: bool = False


could we instead add a new cloud_provider == onprem and change logic based off that?

dmchoiboi · 2025-08-18T17:44:42Z

model-engine/model_engine_server/infra/gateways/celery_task_queue_gateway.py

+    global celery_servicebus
+    if celery_servicebus is None:
+        # Check if ServiceBus broker is disabled or if we're forcing Redis via config
+        if infra_config().disable_servicebus_broker or infra_config().force_celery_redis:


should we just throw an error if the app is configured to use sqs but the environment doesn't support it instead of adding a fallback logic?

dmchoiboi · 2025-08-18T18:55:30Z

model-engine/model_engine_server/core/aws/roles.py



-def session(role: Optional[str], session_type: SessionT = Session) -> SessionT:
+def session(role: Optional[str], session_type: SessionT = Session) -> Optional[SessionT]:


shouldn't need to touch this; this should only be used if cloud_provider == 'aws'

…pdates

sandeshghanta · 2025-08-31T00:18:22Z

model-engine/model_engine_server/domain/use_cases/llm_model_endpoint_use_cases.py

                vllm_args.disable_log_requests = True

-            vllm_cmd = f"python -m vllm_server --model {final_weights_folder} --served-model-name {model_name} {final_weights_folder} --port 5005"
+            vllm_cmd = f"python -m vllm_server --model {final_weights_folder} --served-model-name {model_name} --port 5005"


@dmchoiboi - JFYI: this is where I've hackily changed it to support public vllm docker image.

vllm_cmd = f"python3 -m vllm.entrypoints.openai.api_server --model {final_weights_folder} --served-model-name {model_name} --port 5005"

CC: @saeidbarati-scale

sandeshghanta · 2025-08-31T00:43:33Z

model-engine/model_engine_server/domain/use_cases/llm_model_endpoint_use_cases.py

                protocol="http",
-                readiness_initial_delay_seconds=10,
-                healthcheck_route="/health",
+                readiness_initial_delay_seconds=1800,  # 30 minutes for large model downloads  


@saeidbarati-scale for context

…build, add v2 sgp workaround adapter, add awscli for model downloads using scality

…er doniyor

on prem changes to disable cloud solutions

16bea80

dmchoiboi reviewed Aug 18, 2025

View reviewed changes

update to use config

d72f6c7

dmchoiboi reviewed Aug 18, 2025

View reviewed changes

updated based on comments

4442dd8

dmchoiboi reviewed Aug 18, 2025

View reviewed changes

jen-scymanski-scale added 6 commits August 18, 2025 14:46

remove old vars and update logging

3febc44

remove update to roles.py

a1fe268

add back optional field

61a67ae

WIP -needs cleaned up onprem updates - doesnt include charts folder u…

10f9e4f

…pdates

adding latest updates

0dcd54a

TEMP KT Doc

de8cb1a

sandeshghanta reviewed Aug 31, 2025

View reviewed changes

jen-scymanski-scale added 3 commits September 4, 2025 13:09

on prem modifications

324fe4d

update vllm image "onprem-vllm-0.10.0" update build script for local …

b138a2e

…build, add v2 sgp workaround adapter, add awscli for model downloads using scality

sandesh's updates into this branch as well ass updated vllm_adapter p…

1ee453a

…er doniyor

		@@ -0,0 +1,132 @@
		# values_onprem.yaml - On-premises deployment configuration



		def session(role: Optional[str], session_type: SessionT = Session) -> SessionT:
		def session(role: Optional[str], session_type: SessionT = Session) -> Optional[SessionT]:

on prem changes to disable cloud solutions #700

Are you sure you want to change the base?

on prem changes to disable cloud solutions #700

Uh oh!

Conversation

jen-scymanski-scale commented Aug 18, 2025

Pull Request Summary

Test Plan and Usage Guide

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!