FlashLabs Chroma 1.0 with SGLang Support
An OpenAI-compatible FastAPI server for the Chroma audio generation model, supporting data parallelism (dp-size).
- ✅ OpenAI-compatible
v1/chat/completionsAPI - ✅ Configurable
--dp-size(Data Parallelism) - ✅ Audio input/output support (Base64 encoded)
- ✅ Distributed inference support
- ✅ Health check and model listing endpoints
pip install -r requirements.txtbash chroma_server.sh \
--chroma-model-path /path/to/chroma/model# Use 2 GPUs for data parallelism
bash chroma_server.sh \
--chroma-model-path /path/to/chroma/model \
--dp-size 2
# Use 4 GPUs for data parallelism
bash chroma_server.sh \
--chroma-model-path /path/to/chroma/model \
--dp-size 4bash chroma_server.sh \
--host 0.0.0.0 \
--port 8000 \
--chroma-model-path /path/to/chroma/model \
--dp-size 1docker pull flashlabs/chroma:latest
docker-compose up -dcurl http://localhost:8000/healthcurl http://localhost:8000/v1/modelscurl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "chroma",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": [
{
"type": "audio",
"audio": "assets/question_audio.wav"
}
]
}
],
"prompt_text": "I have not... I'm so exhausted, I haven't slept in a very long time. It could be because... Well, I used our... Uh, I'm, I just use... This is what I use every day. I use our cleanser every day, I use serum in the morning and then the moistu- daily moisturizer. That's what I use every morning.",
"prompt_audio": "assets/ref_audio.wav",
"max_tokens": 1000,
"temperature": 1.0,
"return_audio": true
}'import requests
import base64
def load_audio_as_base64(file_path):
with open(file_path, 'rb') as f:
return base64.b64encode(f.read()).decode('utf-8')
url = "http://localhost:8000/v1/chat/completions"
headers = {"Content-Type": "application/json"}
prompt_audio_base64 = load_audio_as_base64("assets/ref_audio.wav")
payload = {
"model": "chroma",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": [
{"type": "text", "text": "Please respond to my question."},
{"type": "audio", "audio": "assets/question_audio.wav"}
]
}
],
"prompt_text": "I have not... I'm so exhausted, I haven't slept in a very long time. It could be because... Well, I used our... Uh, I'm, I just use... This is what I use every day. I use our cleanser every day, I use serum in the morning and then the moistu- daily moisturizer. That's what I use every morning.",
"prompt_audio": prompt_audio_base64,
"max_tokens": 1000,
"temperature": 1.0,
"return_audio": True
}
response = requests.post(url, json=payload, headers=headers)
result = response.json()
if result.get("audio"):
audio_data = base64.b64decode(result["audio"])
with open("output.wav", "wb") as f:
f.write(audio_data)
print("Audio saved to output.wav")
print(f"Response: {result}")from openai import OpenAI
client = OpenAI(
api_key="dummy",
base_url="http://localhost:8000/v1"
)
response = client.chat.completions.create(
model="chroma",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{
"role": "user",
"content": [
{"type": "audio", "audio": "assets/question_audio.wav"}
]
}
],
extra_body={
"prompt_text": "I have not... I'm so exhausted, I haven't slept in a very long time. It could be because... Well, I used our... Uh, I'm, I just use... This is what I use every day. I use our cleanser every day, I use serum in the morning and then the moistu- daily moisturizer. That's what I use every morning.",
"prompt_audio": "assets/ref_audio.wav",
"return_audio": True
}
)
print(response)Root endpoint returning basic server information.
{
"status": "healthy",
"model_loaded": true
}{
"object": "list",
"data": [
{
"id": "chroma",
"object": "model",
"created": 1234567890,
"owned_by": "chroma"
}
]
}model(string, required)messages(array, required)prompt_text(string, optional) - Must be provided together withprompt_audioor both omittedprompt_audio(string, optional) - Must be provided together withprompt_textor both omittedmax_tokens(integer, optional, default: 1000)temperature(float, optional, default: 1.0)top_p(float, optional, default: 1.0)return_audio(boolean, optional, default: true)audio_format(string, optional, default:wav)
Note: prompt_text and prompt_audio must be provided together. If both are omitted, default values will be used.
{
"id": "chatcmpl-1234567890",
"object": "chat.completion",
"created": 1234567890,
"model": "chroma",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "Generated audio (12.24s)"
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 0,
"completion_tokens": 0,
"total_tokens": 0
},
"audio": "base64_encoded_audio_data..."
}| Argument | Description | Required | Default |
|---|---|---|---|
--host |
Bind address | No | 0.0.0.0 |
--port |
Server port | No | 8000 |
--chroma-model-path |
Path to Chroma model | Yes | - |
--dp-size |
Data parallel size | No | 1 |
--workers |
Worker processes | No | 1 |
Data parallelism improves throughput for handling multiple concurrent requests:
# 2 GPUs
bash chroma_server.sh \
--chroma-model-path /path/to/chroma/model \
--dp-size 2
# 4 GPUs
bash chroma_server.sh \
--chroma-model-path /path/to/chroma/model \
--dp-size 4bash chroma_server.sh \
--chroma-model-path /path/to/chroma/model \
--dp-size 1- Lowest latency
- Simplest setup
- Ideal for low concurrency
bash chroma_server.sh \
--chroma-model-path /path/to/chroma/model \
--dp-size 4- Higher throughput
- Handles concurrent requests
- Recommended for production with high load
- Verify model paths
- Ensure sufficient GPU memory
- Check PyTorch and CUDA compatibility
- GPU count must be ≥
dp_size - Ensure port
29500is not occupied - Verify NCCL installation
- Use supported formats (WAV, MP3, etc.)
- Verify Base64 encoding
- Ensure correct sample rate (default: 24 kHz)
If you see an error about prompt_text and prompt_audio:
- Either provide both parameters
- Or provide neither (default values will be used)
- Providing only one will result in an error
See the LICENSE file for details.
- Qwen2.5-Omni Team
- SGLang Project
- FastAPI Framework