Distributed Inference Engine

This project is a lightweight, vanilla-Python prototype showing patterns and building blocks for distributed AI inference: KV cache management, disaggregated inference, batching, sharding, and runtime optimizations.

High-level Features

Coordinator / Orchestrator — receives inference requests, performs routing, model selection, and scheduling.
Worker nodes — run model sessions and handle batched inference. Each worker exposes a simple RPC API (TCP sockets / asyncio) implemented with stdlib.
Load Balancer — distributes requests across worker nodes using configurable strategies (round-robin, least connections, etc.) and monitors worker health.
KV cache management — shared request/response cache with LRU eviction; optional persistence
Disaggregated inference — separate processes for model serving and for pre/post-processing to demonstrate vertical split of concerns.
Sharding & routing — split model or requests across workers (simple hash-based sharding), plus fallback routing.
Batching & coalescing — combine compatible queries into larger batches to improve throughput.
Lightweight telemetry — request tracing and basic metrics (latency, throughput) using stdlib time and logging.

Folder Layout

distributed-inference-engine/
├── README.md                # this file
├── src/
│   ├── coordinator.py       # main API server + scheduler
│   ├── router.py            # routing + sharding logic
│   ├── load_balancer.py     # worker load balancing and health monitoring
│   ├── batcher.py           # request coalescing and batching
│   ├── worker.py            # worker process that loads model and serves requests
│   ├── kvstore.py           # in-memory KV cache with LRU + optional sqlite persistence
│   ├── model_registry.py    # registry for model metadata, endpoints and shards
│   ├── preproc.py           # example pre-processing utilities
│   ├── postproc.py          # example post-processing utilities
│   └── utils.py             # helpers: serialization, tracing, metrics
├── examples/
│   ├── example_client.py    # simple client that sends requests
│   └── demo_config.yaml     # sample configuration
├── tests/
│   └── test_basic_flow.py
├── docs/
│   └── design.md
└── run.sh

Requirements

Python 3.10+ (stdlib-only prototype). Optional: uvloop, aiohttp, grpcio if you want to replace stdlib primitives later.
For real model inference, plug in PyTorch/TF/ONNX; the orchestration still uses vanilla Python.

Components (detailed shell)

coordinator.py

Starts an asyncio TCP server (or HTTP endpoint) to accept inference requests.
Validates request format, consults kvstore for cache hits.
If miss, pushes request to batcher and returns a request_id for polling or pushes result back to client.
Responsible for global scheduling decisions and retries.

router.py

Implements model-to-worker mapping.
Hash-based sharding: e.g., shard = hash(request.key) % N.
Handles model/shard routing and failover.

load_balancer.py

Distributes requests across worker nodes using configurable strategies:
- Round-robin
- Least connections
- Random
- Least latency
Monitors worker health with configurable timeouts and failure thresholds
Tracks performance metrics (latency, error rates, etc.)
Supports dynamic worker registration/deregistration

batcher.py

Buffers requests per target-model.
Flush policy: max batch size OR max latency.
Coalesces inputs into a batch payload; decompresses model outputs into per-request results.

worker.py

Loads model artifacts (mocked in vanilla prototype).
Runs preproc -> infer -> postproc pipeline.
Can register with coordinator (simple registration handshake).

kvstore.py

In-memory dict + doubly linked list for LRU eviction.
TTL support per entry and optional persistence.
API: get(key), set(key, value, ttl=None), delete(key), exists(key).

model_registry.py

Stores model metadata: version, shards, input schema, batchable flag, quantized flag.
Used by router/coordinator to route correctly and optimize batching.

preproc.py / postproc.py

Pure Python transformations (tokenization mock, normalization).
Demonstrate push-down of CPU-bound preprocessing to a separate process to reduce worker load.

utils.py

Serialization helpers: serialize(obj), deserialize(bytes) using pickle + length-prefixed frames.
Simple tracer for request IDs and timings.

Design choices & tradeoffs

Batching vs latency: larger batches increase throughput but increase tail latency. Make batching policy configurable per-model.
Load balancing strategies: different strategies (round-robin, least connections, etc.) offer tradeoffs between simplicity and optimal resource utilization.
Health checking: aggressive health checking detects failures quickly but adds overhead; balance based on system requirements.
Consistency: KV cache is eventually consistent across nodes in this prototype. For stronger consistency, replace with a consensus-backed store.
Sharding: simple hash sharding is easy but uneven; consistent hashing or range sharding can be added later.
Fault tolerance: load balancer automatically removes unhealthy workers; coordinator retries on worker failures; consider adding a replicated model registry or leader election for production.

Development / run instructions (shell)

Clone the repo
python -m venv .venv && source .venv/bin/activate
Start one or more workers: python src/worker.py --port 9001 --model default
Start coordinator: python src/coordinator.py --listen-port 9000 --worker 127.0.0.1:9001
Try the example client: python examples/example_client.py

Testing & validation

tests/test_basic_flow.py should spin up an in-process coordinator and worker (using multiprocessing) and assert request->response flow and KV cache hits.

Extension ideas (future work)

Advanced load balancing:
- Weighted load balancing based on worker capabilities
- Locality-aware request routing
- Predictive autoscaling based on traffic patterns
Enhanced monitoring:
- Real-time dashboard for system metrics
- Anomaly detection for performance issues
- Distributed tracing across components
Integration:
- Add support for real inference frameworks (PyTorch/TensorFlow/ONNX)
- Secure channels (TLS) for RPC
- Service mesh integration (e.g., Istio, Linkerd)
High availability:
- Leader election for coordinator using multiprocessing-based consensus or etcd
- Multi-region deployment support
- Stateful failover for in-flight requests
Optimizations:
- Adaptive routing for "hot" models
- GPU-aware scheduling and resource accounting
- Request prioritization and preemption

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Distributed Inference Engine

High-level Features

Folder Layout

Requirements

Components (detailed shell)

coordinator.py

router.py

load_balancer.py

batcher.py

worker.py

kvstore.py

model_registry.py

preproc.py / postproc.py

utils.py

Design choices & tradeoffs

Development / run instructions (shell)

Testing & validation

Extension ideas (future work)

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
docs		docs
examples		examples
src		src
tests		tests
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
run.sh		run.sh

Real-VeerSandhu/Distributed-Inference-Engine

Folders and files

Latest commit

History

Repository files navigation

Distributed Inference Engine

High-level Features

Folder Layout

Requirements

Components (detailed shell)

coordinator.py

router.py

load_balancer.py

batcher.py

worker.py

kvstore.py

model_registry.py

preproc.py / postproc.py

utils.py

Design choices & tradeoffs

Development / run instructions (shell)

Testing & validation

Extension ideas (future work)

About

Topics

Resources

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages