A high-performance, production-ready inference server written in Rust that supports model versioning, A/B testing, caching, and comprehensive monitoring.
- Model Versioning: Support multiple model versions with traffic allocation
- A/B Testing: Configure traffic distribution across different model versions
- Caching: Redis-based caching for inference results
- Rate Limiting: Configurable rate limiting per client
- Monitoring: Comprehensive metrics via Prometheus and Grafana dashboards
- Batch Processing: Efficient handling of batch inference requests
- Input Validation: Configurable input size limits and validation
- Health Checks: Built-in health monitoring endpoints
- Graceful Shutdown: Proper shutdown handling with cleanup
- Rust (latest stable version)
- Redis (optional, for caching)
- Docker and Docker Compose (optional, for containerization)
# Clone the repository
git clone https://github.com/Pewpenguin/inferstack-rs
cd inferstack-rs
# Build the project
cargo build --release
Set up the server using environment variables defined in a .env
file.
Use the .env.example
file as a reference for the required structure and variable names. Ensure all necessary variables are defined before starting the server.
# Run directly
cargo run --release
# Or using Docker
docker-compose up -d
GET /health
POST /inference
Content-Type: application/json
{
"input": [[1.0, 2.0, 3.0]],
"model_version": "v1" // optional
}
POST /inference
Content-Type: application/json
{
"input": [
[1.0, 2.0, 3.0],
[4.0, 5.0, 6.0]
],
"batch": true,
"model_version": "v1" // optional
}
Access metrics at /metrics
endpoint. Key metrics include:
inferstack_inference_total
: Total inference requestsinferstack_model_version_usage_total
: Usage by model versioninferstack_inference_duration_seconds
: Inference latencyinferstack_cache_operations_total
: Cache operation statisticsinferstack_batch_throughput_items_per_second
: Batch processing performance
A pre-configured Grafana dashboard is available in monitoring/grafana/dashboards/
.
- Fork the repository
- Create your feature branch
- Commit your changes
- Push to the branch
- Create a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.