High-performance vector database benchmarks and deployment on Google Cloud Run with GPU acceleration (NVIDIA L4).
- Overview
- Prerequisites
- Quick Start
- Step-by-Step Tutorial
- Deployment Options
- Benchmarking
- Architecture
- API Reference
- Troubleshooting
This example provides:
- GPU-Accelerated Benchmarks: SIMD (AVX-512, AVX2, NEON) and CUDA optimized operations
- Cloud Run Deployment: Scalable, serverless deployment with GPU support
- Multiple Deployment Models:
- Single-node benchmark service
- Attention/GNN inference service
- Raft consensus cluster (3+ nodes)
- Primary-replica replication
| Capability | Description | Cloud Run Support |
|---|---|---|
| Core Vector Search | HNSW indexing, k-NN search | ✅ Full GPU |
| Attention Mechanisms | Multi-head attention layers | ✅ Full GPU |
| GNN Inference | Graph neural network forward pass | ✅ Full GPU |
| Raft Consensus | Distributed consensus protocol | ✅ Multi-service |
| Replication | Primary-replica data replication | ✅ Multi-service |
| Quantization | INT8/PQ compression | ✅ GPU optimized |
# Google Cloud CLI
curl https://sdk.cloud.google.com | bash
gcloud init
# Docker
# Install from: https://docs.docker.com/get-docker/
# Rust (for local development)
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh# Authenticate
gcloud auth login
# Set project
gcloud config set project YOUR_PROJECT_ID
# Enable required APIs
gcloud services enable \
run.googleapis.com \
containerregistry.googleapis.com \
cloudbuild.googleapis.com \
compute.googleapis.comcd examples/google-cloud
# Setup and deploy
./deploy.sh setup
./deploy.sh build Dockerfile.gpu latest
./deploy.sh push latest
./deploy.sh deploy latest true # true = GPU enabled
# Run benchmark
./deploy.sh benchmark ruvector-benchmark quick# Get service URL
gcloud run services describe ruvector-benchmark \
--region=us-central1 \
--format='value(status.url)'
# Test endpoints
curl $URL/health
curl $URL/info
curl -X POST $URL/benchmark/quick# Clone the repository
git clone https://github.com/ruvnet/ruvector.git
cd ruvector/examples/google-cloud
# Set environment variables
export GCP_PROJECT_ID="your-project-id"
export GCP_REGION="us-central1"
# Run setup
./deploy.sh setupOption A: Local Build (faster iteration)
# Build locally
./deploy.sh build Dockerfile.gpu latest
# Push to Container Registry
./deploy.sh push latestOption B: Cloud Build (no local Docker required)
# Build in the cloud
./deploy.sh build-cloud Dockerfile.gpu latestBasic Deployment (with GPU)
./deploy.sh deploy latest trueCustom Configuration
# High-memory configuration for large vector sets
MEMORY=16Gi CPU=8 ./deploy.sh deploy latest true
# Scale settings
MIN_INSTANCES=1 MAX_INSTANCES=20 ./deploy.sh deploy latest true# Quick benchmark (128d, 10k vectors)
./deploy.sh benchmark ruvector-benchmark quick
# Distance computation benchmark
./deploy.sh benchmark ruvector-benchmark distance
# HNSW index benchmark
./deploy.sh benchmark ruvector-benchmark hnsw
# Full benchmark suite
./deploy.sh benchmark ruvector-benchmark full# Get all results
./deploy.sh results ruvector-benchmark
# View logs
./deploy.sh logs ruvector-benchmark
# Check service status
./deploy.sh statusBest for: Development, testing, single-user benchmarks
./deploy.sh deploy latest trueBest for: Neural network inference, embedding generation
./deploy.sh deploy-attention latestFeatures:
- 16GB memory for large models
- 3-layer GNN with 8 attention heads
- Optimized for batch inference
Best for: High availability, consistent distributed state
# Deploy 3-node cluster
CLUSTER_SIZE=3 ./deploy.sh deploy-raft
# Deploy 5-node cluster for higher fault tolerance
CLUSTER_SIZE=5 ./deploy.sh deploy-raftArchitecture:
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Node 1 │◄───►│ Node 2 │◄───►│ Node 3 │
│ (Leader) │ │ (Follower) │ │ (Follower) │
└─────────────┘ └─────────────┘ └─────────────┘
│ │ │
└──────────────────┴───────────────────┘
Raft Consensus
Configuration:
# Environment variables for Raft nodes
RUVECTOR_NODE_ID=0 # Node identifier (0, 1, 2, ...)
RUVECTOR_CLUSTER_SIZE=3 # Total cluster size
RUVECTOR_RAFT_ELECTION_TIMEOUT=150 # Election timeout (ms)
RUVECTOR_RAFT_HEARTBEAT_INTERVAL=50 # Heartbeat interval (ms)Best for: Read scaling, geographic distribution
# Deploy with 3 replicas
./deploy.sh deploy-replication 3Architecture:
┌─────────────┐
Writes───►│ Primary │
└──────┬──────┘
│ Replication
┌────────────────┼────────────────┐
▼ ▼ ▼
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Replica 1 │ │ Replica 2 │ │ Replica 3 │
└─────────────┘ └─────────────┘ └─────────────┘
│ │ │
└────────────────┴────────────────┘
Reads (load balanced)
Configuration:
# Primary node
RUVECTOR_MODE=primary
RUVECTOR_REPLICATION_FACTOR=3
RUVECTOR_SYNC_MODE=async # or "sync" for strong consistency
# Replica nodes
RUVECTOR_MODE=replica
RUVECTOR_PRIMARY_URL=https://ruvector-primary-xxx.run.app| Benchmark | Description | Dimensions | Vector Count |
|---|---|---|---|
quick |
Fast sanity check | 128 | 10,000 |
distance |
Distance computation | configurable | configurable |
hnsw |
HNSW index search | configurable | configurable |
gnn |
GNN forward pass | 256 | 10,000 nodes |
cuda |
CUDA kernel perf | - | - |
quantization |
INT8/PQ compression | configurable | configurable |
# Quick benchmark
curl -X POST https://YOUR-SERVICE-URL/benchmark/quick
# Custom distance benchmark
curl -X POST "https://YOUR-SERVICE-URL/benchmark/distance?dims=768&num_vectors=100000&batch_size=64"
# Custom HNSW benchmark
curl -X POST "https://YOUR-SERVICE-URL/benchmark/hnsw?dims=768&num_vectors=100000&k=10"
# Full custom benchmark
curl -X POST https://YOUR-SERVICE-URL/benchmark \
-H "Content-Type: application/json" \
-d '{
"dims": 768,
"num_vectors": 100000,
"num_queries": 1000,
"k": 10,
"benchmark_type": "hnsw"
}'NVIDIA L4 GPU (Cloud Run default):
| Operation | Dimensions | Vectors | P99 Latency | QPS |
|---|---|---|---|---|
| L2 Distance | 128 | 10k | 0.5ms | 2,000 |
| L2 Distance | 768 | 100k | 5ms | 200 |
| HNSW Search | 128 | 100k | 1ms | 1,000 |
| HNSW Search | 768 | 1M | 10ms | 100 |
| GNN Forward | 256 | 10k nodes | 15ms | 66 |
The benchmark automatically detects and uses:
| Architecture | SIMD | Vector Width | Speedup |
|---|---|---|---|
| x86_64 | AVX-512 | 16 floats | 8-16x |
| x86_64 | AVX2 | 8 floats | 4-8x |
| x86_64 | SSE4.1 | 4 floats | 2-4x |
| ARM64 | NEON | 4 floats | 2-4x |
┌─────────────────────────────────────────────────────────────────┐
│ Cloud Run │
├─────────────────────────────────────────────────────────────────┤
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────────┐ │
│ │ HTTP Server │ │ Benchmark │ │ SIMD/GPU Runtime │ │
│ │ (Axum) │ │ Engine │ │ AVX-512 │ CUDA │ NEON │ │
│ └──────┬──────┘ └──────┬──────┘ └────────────────┬────────┘ │
│ │ │ │ │
│ ┌──────┴────────────────┴──────────────────────────┴────────┐ │
│ │ RuVector Core │ │
│ │ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────────────┐ │ │
│ │ │ HNSW │ │ GNN │ │ Quant │ │ Attention │ │ │
│ │ │ Index │ │ Layers │ │ INT8 │ │ Multi-Head │ │ │
│ │ └────────┘ └────────┘ └────────┘ └────────────────┘ │ │
│ └───────────────────────────────────────────────────────────┘ │
├─────────────────────────────────────────────────────────────────┤
│ NVIDIA L4 GPU │
└─────────────────────────────────────────────────────────────────┘
examples/google-cloud/
├── Cargo.toml # Rust dependencies
├── Dockerfile.gpu # GPU-optimized Docker image
├── cloudrun.yaml # Cloud Run service configs
├── deploy.sh # Deployment automation
├── README.md # This file
└── src/
├── main.rs # CLI entry point
├── benchmark.rs # Benchmark implementations
├── simd.rs # SIMD-optimized operations
├── cuda.rs # GPU/CUDA operations
├── report.rs # Report generation
└── server.rs # HTTP server for Cloud Run
| Method | Endpoint | Description |
|---|---|---|
| GET | / |
API info and available endpoints |
| GET | /health |
Health check |
| GET | /info |
System information (GPU, SIMD, memory) |
| POST | /benchmark |
Run custom benchmark |
| POST | /benchmark/quick |
Run quick benchmark |
| POST | /benchmark/distance |
Run distance benchmark |
| POST | /benchmark/hnsw |
Run HNSW benchmark |
| GET | /results |
Get all benchmark results |
| POST | /results/clear |
Clear stored results |
{
"status": "healthy",
"version": "0.1.0",
"gpu_available": true,
"gpu_name": "NVIDIA L4",
"simd_capability": "AVX2",
"uptime_secs": 3600
}{
"dims": 768,
"num_vectors": 100000,
"num_queries": 1000,
"k": 10,
"benchmark_type": "hnsw"
}{
"status": "success",
"message": "Benchmark completed",
"result": {
"name": "hnsw_768d_100000v",
"operation": "hnsw_search",
"dimensions": 768,
"num_vectors": 100000,
"mean_time_ms": 2.5,
"p50_ms": 2.1,
"p95_ms": 3.8,
"p99_ms": 5.2,
"qps": 400.0,
"memory_mb": 585.9,
"gpu_enabled": true
}
}1. GPU not detected
# Check GPU availability
gcloud run services describe ruvector-benchmark \
--region=us-central1 \
--format='yaml(spec.template.metadata.annotations)'
# Ensure GPU annotations are present:
# run.googleapis.com/gpu-type: nvidia-l4
# run.googleapis.com/gpu-count: "1"2. Container fails to start
# Check logs
./deploy.sh logs ruvector-benchmark 200
# Common causes:
# - Missing CUDA libraries (use nvidia/cuda base image)
# - Memory limit too low (increase MEMORY env var)
# - Health check failing (check /health endpoint)3. Slow cold starts
# Set minimum instances
MIN_INSTANCES=1 ./deploy.sh deploy latest true
# Enable startup CPU boost (already in cloudrun.yaml)4. Out of memory
# Increase memory allocation
MEMORY=16Gi ./deploy.sh deploy latest true
# Or reduce vector count in benchmark
curl -X POST "$URL/benchmark?num_vectors=50000"-
Enable CPU boost for cold starts
run.googleapis.com/startup-cpu-boost: "true"
-
Disable CPU throttling
run.googleapis.com/cpu-throttling: "false"
-
Use Gen2 execution environment
run.googleapis.com/execution-environment: gen2
-
Tune concurrency based on workload
- CPU-bound: Lower concurrency (10-20)
- Memory-bound: Medium concurrency (50-80)
- I/O-bound: Higher concurrency (100+)
# Remove all RuVector services
./deploy.sh cleanup
# Remove specific service
gcloud run services delete ruvector-benchmark --region=us-central1
# Remove container images
gcloud container images delete gcr.io/PROJECT_ID/ruvector-benchmark| Configuration | vCPU | Memory | GPU | Cost/hour |
|---|---|---|---|---|
| Basic | 2 | 4GB | None | ~$0.10 |
| GPU Standard | 4 | 8GB | L4 | ~$0.80 |
| GPU High-Mem | 8 | 16GB | L4 | ~$1.20 |
| Raft Cluster (3) | 6 | 12GB | None | ~$0.30 |
Costs are approximate and vary by region. See Cloud Run Pricing.
- Fork the repository
- Create a feature branch
- Make your changes
- Run benchmarks to verify performance
- Submit a pull request
MIT License - see LICENSE for details.