Inference optimization for LLMs, diffusion, and voice. Self-hosted or cloud. Works on NVIDIA GPUs, Apple Silicon, and edge devices.
Links:
Web App • Docs • Hugging Face • X • LinkedIn • Discord (request invite) • Email
TheStage AI is an inference optimization stack. It helps you compress, compile, and serve models. You keep control of the accuracy versus performance trade-off.
-
ANNA (Automatic Neural Network Acceleration)
Automated compression analysis under user-defined constraints (size, MACs, latency, memory). Outputs a QlipConfig for compile and serve.
-
Full-stack optimization and inference framework. Quantization, sparsification, and compilation for NVIDIA GPUs (Apple Silicon supported). Produces pre-compiled (non-JIT) artifacts with dynamic shapes and mixed precision. Triton-based serving.
-
Qlip-optimized models with S / M / L / XL performance tiers (availability varies). L/M/S may include quantization or pruning for faster inference.
-
Manage projects, tokens, and hardware from the terminal. Launch/monitor jobs, rent instances, and stream logs.
-
Web UI and APIs for instances, models, and deployments. Includes the Playground to test Elastic Models, switch hardware, and compare tiers before deployment.
- Elastic Models with S/M/L/XL tiers per model (choose cost, quality, and memory balance; availability varies).
- ANNA constraint-driven compression analysis (outputs a QlipConfig for compile and serve).
- Qlip compiler and runtime (pre-compiled engines; no runtime JIT; dynamic shapes; mixed precision).
- OpenAI-compatible HTTP serving (deploy and scale models through a standard API).
- Playground to test models and hardware (compare performance and tiers before deployment).
- Self-host or run in the cloud (use your own infrastructure; keep data private).
- Hardware support: NVIDIA (incl. Jetson), Apple Silicon, and edge targets (NPUs, DSPs, and MCUs per model).
- Comprehensive tutorials and documentation (from setup to evaluation and production).
- Install CLI:
pip install thestage - Set token:
thestage config set --api-token <YOUR_API_TOKEN>(get it in the web app) - Use
elastic_modelsin your code and choose a tier (S/M/L/XL). See Markdown version for a snippet. - Diffusion and voice examples are in the docs.
OpenAI-compatible API flow with Modal is documented (single- and multi-GPU).
Start here: https://docs.thestage.ai/
- NVIDIA GPUs (incl. Jetson where applicable)
- Apple Silicon
- Edge/embedded devices