This is a portfolio and demonstration project.
Multi-Provider GPU Orchestrator with Credit-based Billing
Intelligent job scheduling across multiple compute providers with double-entry accounting
A production-ready Spring Boot microservices platform for managing AI/ML workloads across multiple cloud GPU providers (RunPod, AWS, GCP, etc.) with automatic cost optimization, credit-based billing, and comprehensive observability.
- Features
- Architecture
- Prerequisites
- Quick Start
- Development
- API Documentation
- Testing
- Deployment
- Troubleshooting
- Contributing
- License
- Multi-Provider Orchestration: Automatically select optimal GPU provider based on cost, latency, and reliability
- Credit-based Billing: Double-entry accounting ledger with hold/debit/refund transactions
- Job Lifecycle Management: Submit → Queue → Provision → Run → Complete with full state tracking
- Provider Abstraction: Plug & play adapter pattern for adding new compute providers
- Security: JWT-based authentication with OAuth2 resource server and scope-based RBAC
- Idempotency: Built-in idempotency key support for safe retry of API requests
- Event-Driven: Transactional outbox pattern with RabbitMQ for reliable event delivery
- Observability: Swagger/OpenAPI documentation, structured logging, and metrics
- API Gateway: REST endpoints with validation, security, and documentation
- Orchestrator: Quote aggregation, provider selection, and job state machine
- Billing Module: Ledger-based accounting with ACID guarantees
- Adapters: RunPod integration (fake adapter for testing)
- Storage Service: Presigned URL generation for S3/blob storage I/O
- Reconciliation: Periodic sync to detect and recover stuck jobs
compute-as-credit/
├── api-gateway # REST API + Security + Swagger
├── orchestrator # Job lifecycle + Provider selection
├── billing # Double-entry ledger
├── domain # Core entities + JPA repositories
├── shared # RabbitMQ config + Events
├── adapters-core # Provider client interface
├── adapters-fake # Mock provider for testing
├── adapters-runpod # RunPod API integration
└── agent-sdk # Client library for AI agents
┌─────────────┐ ┌──────────────┐ ┌─────────────┐
│ API Gateway│─────▶│ Orchestrator │─────▶│ Billing │
│ (REST+JWT) │ │ (Selection) │ │ (Ledger) │
└─────────────┘ └──────────────┘ └─────────────┘
│ │ │
│ ┌──────▼──────┐ │
│ │ Providers │ │
│ │ (RunPod etc)│ │
│ └─────────────┘ │
│ │
└──────────────▶ MySQL ◀───────────────────┘
│
┌─────▼─────┐
│ RabbitMQ │
│ (Events) │
└───────────┘
- Hexagonal Architecture: Domain-driven design with ports & adapters
- Transactional Outbox: Reliable event publishing without distributed transactions
- Circuit Breaker: Resilience4j for fault tolerance (WIP)
- Strategy Pattern: Pluggable provider selection policies (BalancedPolicy, etc.)
- Java 17+ (tested with Temurin 17.0.9)
- Docker & Docker Compose (for MySQL + RabbitMQ)
- Gradle 8.5+ (wrapper included)
- Git (for version control)
- IntelliJ IDEA or any Java IDE
- Postman or
curlfor API testing - Python 3 (for JWT token generation script)
git clone <repository-url>
cd compute-as-credit# Start MySQL + RabbitMQ
make up
# Verify containers are running
docker compose ps# Build all modules
./gradlew clean build -x test
# Run API Gateway (http://localhost:8080)
make run
# OR
./gradlew :api-gateway:bootRunOpen Swagger UI: http://localhost:8080/swagger-ui.html
Or via curl:
# Generate JWT token (for development)
# Use https://jwt.io to create a token with:
# - Algorithm: HS256
# - Secret: dev-secret
# - Payload: {"sub": "user1", "scope": "jobs:read jobs:write"}
export TOKEN="your-generated-jwt-token"
# Submit a job
curl -X POST http://localhost:8080/v1/jobs \
-H "Authorization: Bearer <TOKEN_FROM_ABOVE>" \
-H "Content-Type: application/json" \
-H "Idempotency-Key: unique-key-123" \
-d '{
"userId": 1,
"agentSpec": "{\"image\":\"ghcr.io/your-org/agent:1.0\",\"cmd\":[\"python\",\"train.py\"]}",
"resourceHint": "{\"gpuType\":\"A100-80G\",\"spotOk\":true}",
"maxBudget": 50.0
}'
# Get job status
curl -H "Authorization: Bearer <TOKEN>" \
http://localhost:8080/v1/jobs/1api-gateway/
JobController- REST endpoints (submit, get, allocate I/O)SecurityConfig- JWT + OAuth2 resource serverJobApiModels- DTO records (SubmitReq, SubmitRes, JobRes)IdempotencyService- Request deduplication
orchestrator/
JobOrchestrator- Core job lifecycle managementQuoteService- Provider price aggregationSelectionPolicy+BalancedPolicy- Provider selectionOutboxPublisher- RabbitMQ event publishingStorageService- S3 presigned URL generationUsagePollingService- Periodic usage pollingReconciler- Stuck job recovery
billing/
LedgerEntities- Account, Entry, Posting entitiesLedgerRepos- JPA repositoriesLedgerService- Double-entry accounting logic
domain/
Job,JobStatus,Provider,OutboxEvent- Core entitiesJobRepository,OutboxEventRepository- JPA repositories
shared/
DomainEvents- Event records (JobSubmitted, JobStarted, etc.)RabbitConfig- Exchange + queue setup
adapters-*
ProviderClient- Provider abstraction interfaceRunPodClient- RunPod API integrationFakeProviderClient- Mock for testing
# Run all tests (requires Testcontainers)
./gradlew test
# Run specific module tests
./gradlew :api-gateway:test
# Skip tests during build
./gradlew build -x testFlyway migrations are in domain/src/main/resources/db/migration/:
V1__init.sql # Initial schema (jobs, providers, ledger, outbox, etc.)
Migrations run automatically on application startup.
- Create adapter in
adapters-{provider}/ - Implement
ProviderClientinterface - Add
@Componentannotation - Update
QuoteServiceto fetch quotes - Add integration test with WireMock
All endpoints require JWT Bearer token with appropriate scopes:
jobs:read- View job statusjobs:write- Submit jobs
Token Generation (development only):
Visit jwt.io and create a token with:
- Algorithm:
HS256 - Secret:
dev-secret - Payload:
{ "sub": "user1", "scope": "jobs:read jobs:write", "exp": 9999999999 }
POST /v1/jobs
Content-Type: application/json
Authorization: Bearer {token}
Idempotency-Key: {unique-key} # Optional
{
"userId": 1,
"agentSpec": "{\"image\":\"...\"}", # JSON string
"resourceHint": "{\"gpuType\":\"A100-80G\"}", # JSON string
"maxBudget": 100.0
}
Response: 200 OK
{
"jobId": 123,
"status": "QUEUED"
}GET /v1/jobs/{id}
Authorization: Bearer {token}
Response: 200 OK
{
"jobId": 123,
"status": "RUNNING",
"providerId": 5
}POST /v1/jobs/{id}/io
Authorization: Bearer {token}
Response: 200 OK
{
"uploadUrl": "https://...",
"downloadUrl": "https://...",
"inputUri": "s3://tenant/123/input/",
"outputUri": "s3://tenant/123/output/",
"expiresAt": "2025-10-02T12:00:00Z"
}SUBMITTED → QUEUED → PROVISIONING → RUNNING → SUCCEEDED
↓
FAILED
↓
CANCELLED
# API Gateway integration test (Testcontainers)
./gradlew :api-gateway:test
# WireMock test for RunPod adapter
./gradlew :adapters-runpod:test- Import Swagger spec:
http://localhost:8080/v3/api-docs - Set Authorization: Bearer token (from jwt.io)
- Test endpoints
# Already includes MySQL + RabbitMQ
docker compose up -d- Database: Use managed MySQL (AWS RDS, Cloud SQL)
- Message Queue: Use managed RabbitMQ (CloudAMQP) or switch to Kafka
- Secrets: Use AWS Secrets Manager / Vault (not
.env) - JWT Secret: Generate strong 256-bit key
- Monitoring: Add Prometheus + Grafana
- Logging: Use structured JSON logs → ELK/Splunk
- High Availability: Deploy multiple API Gateway instances behind load balancer
# Database
DB_URL=jdbc:mysql://localhost:3306/compute
DB_USER=root
DB_PASS=root
# RabbitMQ
RABBIT_HOST=localhost
RABBIT_PORT=5672
# Security
JWT_SECRET=your-strong-256-bit-secretCause: Gradle dependency resolution issue.
Fix:
./gradlew clean build --refresh-dependenciesCause: JAVA_HOME not set or wrong Java version.
Fix:
export JAVA_HOME=/path/to/java17
java -version # Should show Java 17Cause: Docker container not running or port conflict.
Fix:
docker compose ps # Check if mysql container is up
docker compose logs mysql # Check logs
lsof -i :3306 # Check if port 3306 is availableCause: RabbitMQ not started or wrong credentials.
Fix:
docker compose ps # Check if rabbitmq container is up
# Access management UI: http://localhost:15672
# Default credentials: guest/guestCause: Token expired or wrong secret.
Fix:
# Regenerate token at https://jwt.io
# Paste token and verify with secret: dev-secret
# Ensure 'exp' claim is in the futureCause: Schema already exists or migration checksum mismatch.
Fix:
# Drop and recreate database
docker compose down -v
docker compose up -d
# Wait 10s, then restart app- Fork the repository
- Create feature branch (
git checkout -b feature/amazing-feature) - Commit changes (
git commit -m 'feat: add amazing feature') - Push to branch (
git push origin feature/amazing-feature) - Open Pull Request
- Follow Spring Boot best practices
- Use meaningful variable names
- Add Javadoc for public APIs
- Write tests for new features
feat: Add new feature
fix: Bug fix
docs: Documentation update
refactor: Code refactoring
test: Add tests
chore: Build/config changes
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
- Spring Boot team for the excellent framework
- Testcontainers for integration testing
- WireMock for HTTP mocking
- All contributors and open-source maintainers
Built using Java 17, Spring Boot 3, and modern cloud-native practices.
For questions or support, please open an issue on GitHub.