-
Notifications
You must be signed in to change notification settings - Fork 67
on prem changes to disable cloud solutions #700
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
on prem changes to disable cloud solutions #700
Conversation
return creds["cache-url"] | ||
|
||
# Check if we're in an onprem environment with direct Redis access | ||
if os.environ.get('ONPREM_REDIS_HOST'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we not pass this in via config in the same way as the other redis configs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
@@ -0,0 +1,132 @@ | |||
# values_onprem.yaml - On-premises deployment configuration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you need to include this file here? I believe SGP maintains their own values.yaml in their own repo somewhere cc @nicolastomeo ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, removed.
logger.error(e) | ||
logger.error(f"Failed to retrieve secret: {secret_name}") | ||
return {} | ||
response = secret_manager.get_secret_value(SecretId=secret_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we still want to do the try_catch wrapping to handle the cases where secret_manager client errors our
if aws_role is None: | ||
aws_session = session(infra_config().profile_ml_worker) | ||
# Check if AWS is disabled via config - if so, fall back to Redis backend | ||
if infra_config().disable_aws: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of doing this, wondering if we address this upstream and figure out how to pass in "redis" as the backend_protocol in the on-prem scenario
- name: CIRCLECI | ||
value: "true" | ||
{{- end }} | ||
{{- if .Values.gunicorn }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
anecdotally, we found it a lot easier to performance tune pure uvicorn, so we actually migrated most usage of gunicorn back to uvicorn. That being said, won't block your usage of it
celery_servicebus = celery_app( | ||
None, broker_type=str(BrokerType.SERVICEBUS.value), backend_protocol=backend_protocol | ||
) | ||
# Initialize celery apps lazily to avoid import-time AWS session creation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
curious why we're not running into similar issues for our other non-AWS environments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On Prem - This is likely due to no creds and the import failing
Container starts with NO AWS credentials
Python import hits celery_task_queue_gateway.py:19
boto3.Session() creation fails immediately
Import exception → container crash before application even starts
if debug: | ||
additional_args.extend(["--reload", "--timeout", "0"]) | ||
|
||
# Use environment variables for configuration with fallbacks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
format=LOG_FORMAT, | ||
) | ||
|
||
# Thread-safe logging configuration - only configure if not already configured |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
interesting, was this manifesting in a particular error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, there was a specific recursive logging error causing worker crashes:
RuntimeError: reentrant call inside <_io.BufferedWriter name=''>
This occurred during Gunicorn worker startup when multiple processes tried to initialize logging simultaneously, causing thread-unsafe logging configuration and race conditions. The error led to worker crashes, which then triggered the WORKER TIMEOUT errors we were seeing.
The issue was that multiple Gunicorn workers starting at the same time would compete to write to stderr during logging setup, causing a reentrant call error that crashed the worker processes.
firehose_stream_name: Optional[str] = None | ||
prometheus_server_address: Optional[str] = None | ||
# On-premises configuration | ||
onprem_redis_host: Optional[str] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm wondering if there's an easy way to merge onprem_redis_host
w/ the other redis_host
arg that already exists.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could consolidate this by using the existing redis_host
field and adding logic to detect on-premises vs cloud environments. However, we kept them separate because:
redis_host
is used for the message broker (Celery)
onprem_redis_host
is used for the cache/result storage
They might point to different Redis instances in some deployments.. let me know if you would like me to combine them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
redis_host is used for the message broker (Celery)
onprem_redis_host is used for the cache/result storage
This is a good distinction. I think it's better to make that more explicit in the naming as opposed to marking one as "onprem"
onprem_redis_port: Optional[str] = "6379" | ||
onprem_redis_password: Optional[str] = None | ||
# AWS disable configuration | ||
disable_aws: bool = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we instead add a new cloud_provider == onprem
and change logic based off that?
global celery_servicebus | ||
if celery_servicebus is None: | ||
# Check if ServiceBus broker is disabled or if we're forcing Redis via config | ||
if infra_config().disable_servicebus_broker or infra_config().force_celery_redis: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we just throw an error if the app is configured to use sqs but the environment doesn't support it instead of adding a fallback logic?
|
||
|
||
def session(role: Optional[str], session_type: SessionT = Session) -> SessionT: | ||
def session(role: Optional[str], session_type: SessionT = Session) -> Optional[SessionT]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't need to touch this; this should only be used if cloud_provider == 'aws'
vllm_args.disable_log_requests = True | ||
|
||
vllm_cmd = f"python -m vllm_server --model {final_weights_folder} --served-model-name {model_name} {final_weights_folder} --port 5005" | ||
vllm_cmd = f"python -m vllm_server --model {final_weights_folder} --served-model-name {model_name} --port 5005" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dmchoiboi - JFYI: this is where I've hackily changed it to support public vllm docker image.
vllm_cmd = f"python3 -m vllm.entrypoints.openai.api_server --model {final_weights_folder} --served-model-name {model_name} --port 5005"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
protocol="http", | ||
readiness_initial_delay_seconds=10, | ||
healthcheck_route="/health", | ||
readiness_initial_delay_seconds=1800, # 30 minutes for large model downloads |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@saeidbarati-scale for context
…build, add v2 sgp workaround adapter, add awscli for model downloads using scality
Pull Request Summary
What is this PR changing? Why is this change being made? Any caveats you'd like to highlight? Link any relevant documents, links, or screenshots here if applicable.
Fixes Needed for model-engine to work on-prem
AWS Disable Logic: Added DISABLE_AWS=true environment variable support
Redis Fallback: S3 backend falls back to Redis when AWS is disabled
Lazy Initialization: Celery apps are initialized lazily to avoid import-time AWS sessions
On-Premises Redis: Added direct Redis connection support via ONPREM_REDIS_HOST
Configurable Gunicorn: Made worker timeouts and settings configurable via environment variables
Thread-Safe Logging: Fixed potential race conditions in logger initialization
Broker Disable Options: Added DISABLE_SQS_BROKER and DISABLE_SERVICEBUS_BROKER flags
Test Plan and Usage Guide
How did you validate that your PR works correctly? How do you run or demo the code? Provide enough detail so a reviewer can reasonably reproduce the testing procedure. Paste example command line invocations if applicable.
Tested locally with scripts
AWS Disable Logic: Tested DISABLE_AWS=true prevents AWS session creation
Redis Fallback: Verified S3 backend falls back to Redis when AWS disabled
Lazy Initialization: Confirmed Celery apps don't create AWS sessions at import time
Thread-Safe Logging: Tested logger initialization in multi-threaded environments
Direct Redis Connection: Tested ONPREM_REDIS_HOST configuration
Server Startup: Confirmed server starts with all cloud services disabled
Environment Variables: Verified all new config variables work correctly
Async Task Workflow: Tested complete task submission and result retrieval
Broker Disable Logic: Verified SQS/ServiceBus fallback to Redis
Error Handling: Tested graceful handling of missing AWS credentials
Backward Compatibility: Confirmed existing cloud functionality still works
No Breaking Changes: Verified no existing APIs are broken