High-performance C++/CUDA inference runtime for Vision-Language-Action (VLA) models.
ServoFlow targets real-time robot control at 50 Hz — something Python-based frameworks cannot reliably achieve due to interpreter and GIL overhead. It is the foundation for a full-stack (model + software + hardware) robotics VLA deployment solution.
Status: Phase 1 — framework core + CUDA backend + RDT-1B support.
| Goal | How |
|---|---|
| Real-time (≥50 Hz) | CUDA Graph loop capture, condition caching, static memory pool |
| Hardware-agnostic | IBackend abstraction; CUDA now, ROCm / Metal / TensorRT planned |
| Open architecture | Clean layered design; model Zoo and hardware backends are plug-in |
| Production quality | Typed APIs, zero hidden allocation in hot path, comprehensive tests |
Python bindings (optional) ← pybind11
C API (servoflow.h) ← stable ABI
─────────────────────────────────────────
InferenceEngine ← condition cache · CUDA Graph · async D2H
├── FlowMatchingSampler ← Euler ODE (RDT-1B / π0)
│ └── DDIMSampler ← DDIM (DDPM-trained models)
└── IVLAModel ← RDT-1B · (more planned)
─────────────────────────────────────────
Operator Library
Attention (FlashAttention v2 on CUDA) · GEMM (cuBLAS) · LayerNorm · RMSNorm
GELU · SiLU · Embedding · Cast · Cat · Softmax
─────────────────────────────────────────
IBackend
├── CUDABackend (Phase 1) ← memory pool · streams · CUDA Graph
├── ROCmBackend (planned)
├── MetalBackend (planned)
└── TensorRTBackend (planned)
─────────────────────────────────────────
Core: Tensor · Shape · DType · Device · Storage
- Condition cache — vision + language encoding runs once per scene, not once per step. Saves ~60–80 % of total compute.
- CUDA Graph capture — the entire denoising loop (N steps × DiT forward) is captured as a CUDA Graph on the first call and replayed on subsequent calls, eliminating CPU kernel-launch overhead.
- Static memory pool — all intermediate tensors are pre-allocated at engine init; zero
cudaMalloc/cudaFreein the hot path. - FlashAttention v2 — O(S) memory, 2–4× faster attention vs standard SDPA on Ampere+.
- Multi-stream overlap — vision encoding and denoising run on separate CUDA streams, synchronised via a lightweight CUDA event.
- Pinned host output — action result is transferred from GPU to pinned host memory for minimum D2H latency.
| Model | Sampler | Status |
|---|---|---|
| RDT-1B | Flow Matching (Euler) | Phase 1 target |
| OpenVLA | DDIM | Planned |
- CMake ≥ 3.22
- CUDA Toolkit ≥ 12.0 (for CUDA backend)
- GCC ≥ 11 or Clang ≥ 14
- (Optional) FlashAttention v2 for best attention performance
git clone https://github.com/your-org/servoflow.git
cd servoflow
cmake -B build -DCMAKE_BUILD_TYPE=Release \
-DSF_CUDA_ARCHS="86" # RTX 3090 = sm_86
cmake --build build -j$(nproc)With FlashAttention:
cmake -B build -DCMAKE_BUILD_TYPE=Release \
-DSF_USE_FLASH_ATTN=ON \
-DFLASH_ATTN_ROOT=/path/to/flash-attention
cmake --build build -j$(nproc)ctest --test-dir build --output-on-failure# Attention microbenchmark
./build/benchmarks/bench_attention
# End-to-end pipeline benchmark (stub model, N denoising steps)
./build/benchmarks/bench_pipeline 10ServoFlow achieves 1.66x speedup over optimized PyTorch (FP16) for RDT-1B inference, thanks to aggressive operator fusion, zero-overhead memory management, and CUDA Graph execution.
| Metric | PyTorch (FP16) | ServoFlow (FP16) | Speedup |
|---|---|---|---|
| Loop Latency (10 steps) | 551.48 ms | 332.40 ms | 1.66x |
| Per-step Latency | 55.15 ms | 33.24 ms | 1.66x |
| Control Freq | 1.81 Hz | 3.01 Hz | 1.66x |
Alignment Accuracy:
- Max Error: 1.95e-03 (FP16)
- Cosine Similarity: 1.000001
- Status: Verified against HuggingFace
rdt-1bPyTorch implementation.
Key Optimizations:
- CUDA Graph: Captures the entire denoising loop (10 steps × 28 blocks) into a single graph launch, eliminating CPU overhead.
- Memory Pool: Custom
cudaMallocAsync-based memory pool ensures zero allocation overhead during inference. - Operator Fusion: Fused
Add+RMSNormandGEMM+Bias+Actkernels minimize memory bandwidth usage. - FlashAttention: Zero-allocation integration of FlashAttention v2.
To run benchmarks:
./run_gpu_comparison.sh
# Or run the C++ inference benchmark directly:
./build/examples/rdt1b_inference /path/to/checkpoint 50servoflow/
├── include/servoflow/ # Public headers (stable API)
│ ├── core/ # Tensor, Shape, DType, Device, Storage
│ ├── backend/ # IBackend interface + CUDA header
│ ├── ops/ # Operator declarations
│ ├── models/ # IVLAModel + model configs
│ ├── sampling/ # ISampler, FlowMatchingSampler, DDIMSampler
│ └── engine/ # InferenceEngine, VLAInput/Output, EngineConfig
├── src/ # Implementations
│ ├── backend/cuda/ # CUDABackend + CUDA kernels
│ ├── sampling/ # Sampler implementations
│ └── engine/ # InferenceEngine orchestration
├── tests/ # Unit + integration tests (GoogleTest)
├── benchmarks/ # GPU microbenchmarks + pipeline benchmark
├── examples/ # Usage examples
└── tools/convert/ # HuggingFace → ServoFlow weight converter
| Model | Task | PyTorch (Eager) | ServoFlow (C++) | Speedup |
|---|---|---|---|---|
| Octo-Small | Denoise Step (MLP) | 0.220 ms | 0.070 ms | 3.14x |
| RDT-1B | DiT Block (FWD) | 1.8 ms | 1.2 ms | 1.5x |
Tested on NVIDIA GeForce RTX 3090.
Phase 1 (Completed)
- Core C++ Tensor & Autograd Engine
- FlashAttention integration
- Flow Matching sampler with CUDA Graph capture
- InferenceEngine with condition cache
- RDT-1B model weight loader (safetensors)
- RDT-1B DiT block implementation
- Benchmark vs diffusers + TensorRT pipeline (PyTorch baseline)
Phase 2
- Octo Support (Standard Transformer Diffusion Policy)
- Transformer Backbone + FiLM/Cross-Attn (Pre-LN Block implemented)
- Diffusion Head (MLP)
- ViT Encoder Integration (Placeholder / Interface ready)
- Backend Ops: UnpackQKV, Permute
- Dita Support (Native DiT Policy)
- Scalable Diffusion Transformer Architecture
- Backend support for INT8 dequantization (Weight-Only)
- INT4 / INT8 quantization for LLM backbone
- ROCm backend
- Jetson / edge hardware optimisation
- TensorRT backend
Apache 2.0 — see LICENSE.