GitHub - Yuvrajxms09/modal_models: Deploy and scale SOTA image and TTS models on Modal.com with zero infra pain.

Inference deployments for TTS and image generation models on modal.com.

Each model runs as a separate FastAPI service with GPU acceleration and persistent volume storage.

Note: These are hobbyist deployments I run in my free time. They're not production-ready - missing authentication, rate limiting, proper error handling, and other production features.

Models

qwen3_tts - Qwen3-TTS with custom voice, voice design, and voice cloning
soprano_tts - Soprano TTS text-to-speech
higgs_audio_v2 - Higgs Audio TTS with voice cloning
cosmos_predict2_t2i - Cosmos Predict2 text-to-image generation
omnigen2 - OmniGen2 image generation (text2img, editing, in-context)
qwen_image_edit - Qwen Image Edit for natural language image editing
glm_ocr - GLM-OCR for high-quality document parsing and OCR

Setup

Clone Repository

git clone <repository-url>

Configure Secrets

Set up Modal secrets:

huggingface-secret (HF_TOKEN)
nvidia-ngc-secret (NGC_API_KEY) - Required for cosmos_predict2_t2i, higgs_audio_v2, qwen_image_edit
aws-s3-secrets (AWS credentials) - Required for S3 upload endpoints

Upload Models

Each model folder contains an upload_models.py script. Update model IDs in the script, then run:

modal run <model-folder>/upload_models.py

Example:

modal run qwen3_tts/upload_models.py

Deploy

modal deploy <model-folder>/inference.py

Example:

modal deploy qwen3_tts/inference.py

API Usage

Endpoints follow the pattern: https://<your-workspace>--<endpoint-label>.modal.run

Qwen3-TTS

# Custom voice
curl -X POST https://<workspace>--qwen3-tts-web-endpoint.modal.run/v1/audio/speech/custom-voice \
  -H "Content-Type: application/json" \
  -d '{"input": "Hello world", "language": "English", "speaker": "Ryan"}' \
  --output speech.wav

# Voice design
curl -X POST https://<workspace>--qwen3-tts-web-endpoint.modal.run/v1/audio/speech/voice-design \
  -H "Content-Type: application/json" \
  -d '{"input": "Text here", "language": "English", "instruct": "Speak in a cheerful tone"}' \
  --output speech.wav

# Voice clone
curl -X POST https://<workspace>--qwen3-tts-web-endpoint.modal.run/v1/audio/speech/voice-clone \
  -H "Content-Type: application/json" \
  -d '{"input": "Text here", "ref_audio": "https://example.com/ref.wav", "ref_text": "Reference transcript"}' \
  --output speech.wav

Soprano TTS

curl -X POST https://<workspace>--soprano-web-endpoint.modal.run/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"input": "Hello world"}' \
  --output speech.wav

Cosmos Text2Image

curl -X POST https://<workspace>--cosmos-text2image-web-endpoint.modal.run/generate \
  -F "prompt=a beautiful landscape" \
  -F "aspect_ratio=16:9" \
  --output image.jpg

OmniGen2

curl -X POST https://<workspace>--omnigen2.modal.run \
  -F "prompt=a beautiful landscape" \
  -F "width=1024" \
  -F "height=1024" \
  --output image.png

Qwen Image Edit

curl -X POST https://<workspace>--qwen-image-edit-endpoint.modal.run/generate \
  -F "prompt=add a sunset" \
  -F "input_image_url=https://example.com/image.jpg" \
  --output output.png

Higgs Audio

curl -X POST https://<workspace>--higgs-audio-web-endpoint.modal.run/generate \
  -F "text=Hello world" \
  --output audio.wav

GLM-OCR

# Parse image from URL
curl -X POST https://<workspace>--glm-ocr-api.modal.run/glmocr/parse \
  -H "Content-Type: application/json" \
  -d '{"images": ["https://example.com/image.jpg"]}'

# Parse image from local file
curl -X POST https://<workspace>--glm-ocr-api.modal.run/glmocr/parse \
  -F "files=@/path/to/image.png"

Caveats

No authentication - endpoints are publicly accessible
No rate limiting - can be abused
No monitoring/alerting - failures go unnoticed
Basic error handling - errors might not be user-friendly
No request validation - malformed requests can crash services

These work fine for personal projects and testing, but don't use them for anything that needs reliability or security.

Requirements

Modal account with GPU access
Python 3.11+
Modal CLI installed (pip install modal)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Models

Setup

Clone Repository

Configure Secrets

Upload Models

Deploy

API Usage

Qwen3-TTS

Soprano TTS

Cosmos Text2Image

OmniGen2

Qwen Image Edit

Higgs Audio

GLM-OCR

Caveats

Requirements

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
cosmos_predict2_t2i		cosmos_predict2_t2i
glm_ocr		glm_ocr
higgs_audio_v2		higgs_audio_v2
omnigen2		omnigen2
qwen3_tts		qwen3_tts
qwen_image_edit		qwen_image_edit
soprano_tts		soprano_tts
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md

License

Yuvrajxms09/modal_models

Folders and files

Latest commit

History

Repository files navigation

Models

Setup

Clone Repository

Configure Secrets

Upload Models

Deploy

API Usage

Qwen3-TTS

Soprano TTS

Cosmos Text2Image

OmniGen2

Qwen Image Edit

Higgs Audio

GLM-OCR

Caveats

Requirements

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages