We provide a full documentation site (MkDocs). See docs/ and the published site: Resk-Caching Docs.
Resk-Caching is a Bun-based backend library/server designed to cache Large Language Model (LLM) responses using vector databases, significantly reducing API costs while maintaining response quality and relevance.
Resk-Caching delivers the complete value proposition of intelligent LLM caching with four core benefits that transform how you build and scale AI applications:
- Up to 90% reduction in LLM API costs through intelligent semantic caching
- Real-time cost tracking with provider-specific pricing (OpenAI, Anthropic, Google, etc.)
- ROI analysis showing exact savings from cache hits vs API calls
- Cost breakdown by provider, model, and time period
- Automatic savings calculation for every cached response
- Sub-5ms response times for cached queries vs 500ms+ for API calls
- Intelligent cache warming strategies (popular, recent, predictive)
- Real-time performance monitoring with benchmarking and optimization recommendations
- Slow query detection with automated performance suggestions
- Cache hit rate optimization through advanced similarity algorithms
- OpenAI-compatible API for offline development without API costs
- Mock LLM provider with customizable responses and scenarios
- Automated testing scenarios with validation and metrics
- Zero-cost development workflows with realistic API simulation
- Circuit breaker patterns for resilient application development
- Enhanced rate limiting bypass with cache-first approach reducing API pressure
- Circuit breaker patterns with automatic failover and recovery
- Health monitoring and real-time system status
- Automatic scaling with proactive cache warming for traffic spikes
- Graceful degradation when external services fail
- Pre-populated Response Database: You maintain a database of high-quality LLM responses to common queries, stored as vector embeddings
- Semantic Matching: When a new query arrives, the system finds the most semantically similar cached response
- Cost Savings: Returns cached responses instead of making new API calls
- Response Selection: Advanced algorithms allow you to choose specific responses based on business logic, user preferences, or A/B testing strategies
β All Four GPTCache-Style Benefits Implemented:
- π° Massive Cost Reduction: Up to 90% savings with real-time ROI tracking
- π Performance Optimization: Sub-5ms responses with intelligent cache warming
- π§ͺ Development Environment: OpenAI-compatible API for offline testing
- π‘οΈ Scalability & Availability: Circuit breakers and automatic failover
- LLM Response Caching: Store and retrieve LLM responses using vector similarity matching
- Multiple Cache Backends: in-memory, SQLite (local persistence), Redis (multi-instance)
- Advanced Response Selection: Deterministic, weighted, and randomized response selection algorithms
- Vector Database Integration: Optimized for semantic search and similarity matching
- AES-GCM Encryption: Secure cache-at-rest protection (optional via env key)
- JWT-Protected API: Secure access with rate limiting and abuse prevention
- OpenAPI 3.1: Auto-generated API documentation from Zod schemas
- Performance Monitoring: Prometheus metrics and OpenTelemetry tracing
- Real-time Updates: WebSockets for instant response distribution
- GPTCache: Great Python-first cache. Resk-Caching focuses on Bun/TypeScript, ships with JWT-secured HTTP API, OpenAPI generation, built-in Prometheus/OTEL, and optional authenticated-at-rest encryption out of the box.
- ModelCache: Provides a semantic cache layer. Resk-Caching adds production concerns (rate-limit, security wrapper, metrics, tracing, OpenAPI, WebSockets) and pluggable backends with zero-code switching via
CACHE_BACKEND. - Upstash Semantic Cache: Managed vector-backed cache. Resk-Caching is open-source, self-hosted by default, and can run fully local with SQLite or purely in-memory while retaining encryption and observability.
- Redis LangCache: Managed Redis-based semantic cache. Resk-Caching supports Redis natively via Bun's RESP3 client while also offering SQLite and in-memory modes for portability and offline development.
- SemantiCache (FAISS): FAISS-native library. Resk-Caching prioritizes a secure, observable HTTP surface with variant selection strategies and can integrate external vector DBs; no GPU dependency required.
If you need a secure, auditable cache service with operational tooling for teams, Resk-Caching is purpose-built for that surface.
- LLM Response Storage: Store pre-computed LLM responses with their vector embeddings for fast retrieval
- Caching Backends: Choose between low-latency memory, local persistence (SQLite), or distributed (Redis) based on your scale
- Response Selection Algorithms: Implement deterministic, weighted, or randomized response selection based on business logic
- Vector Similarity Matching: Find the most semantically similar cached response to incoming queries
- AES-GCM Encryption: Protect sensitive LLM responses at rest with authenticated encryption
- JWT + Rate Limiting: Secure API access and prevent abuse while maintaining performance
- Zod + OpenAPI: Ensure data validation and provide always-in-sync API documentation
- Performance Monitoring: Track cache hit rates, response times, and cost savings in real-time
- Real-time Distribution: Instantly distribute responses across multiple instances and clients
Before using Resk-Caching, you need to have a vector database ready with pre-computed LLM responses. This is the foundation of the caching system:
- Response Database: Create a collection of high-quality LLM responses to common queries
- Vector Embeddings: Generate vector embeddings for each response using your preferred embedding model
- Metadata Storage: Store additional context like response quality scores, categories, or business rules
- Similarity Index: Ensure your vector database has proper indexing for fast similarity search
Recommended Vector Databases:
- Pinecone: Excellent for production use with high performance
- Weaviate: Open-source with great similarity search capabilities
- Qdrant: Fast and efficient for real-time applications
- Chroma: Simple local development and testing
# as a library (npm)
npm install resk-caching
# as a library (bun)
bun add resk-caching# Install dependencies
bun install
# Start the server
bun run dev
# The server will be available at http://localhost:3000- Choose your key-value cache backend:
CACHE_BACKEND=memoryfor local/devCACHE_BACKEND=sqlitefor single-node durabilityCACHE_BACKEND=redisfor distributed/multi-instance
- Choose your vector search strategy for semantic features:
- Default: in-memory vector store (process-local)
- Production: external vector DB (Pinecone/Qdrant/Weaviate/Chroma)
- Alternative: Redis RediSearch vectors or SQLite vector extensions
- Ingest responses and embeddings (see Ingestion or
scripts/ingest-example.ts). - Call
/api/semantic/storeand/api/semantic/search.
By default, semantic embeddings live in memory. To power vector search with Redis or SQLite, see the guides below.
Use Redis Stack with RediSearch for vector similarity.
Example index and KNN search (1536-dim float32 cosine):
# Create index
redis-cli FT.CREATE idx:llm ON HASH PREFIX 1 llm: SCHEMA \
query TEXT \
embedding VECTOR HNSW 6 TYPE FLOAT32 DIM 1536 DISTANCE_METRIC COSINE \
category TAG SORTABLE \
metadata TEXT
# Insert (embedding must be raw float32 bytes)
redis-cli HSET llm:thank-you query "thank you" category "gratitude" \
embedding "$BINARY_FLOAT32" metadata "{\"tone\":\"friendly\"}"
# KNN search
redis-cli FT.SEARCH idx:llm "*=>[KNN 5 @embedding $vec AS score]" \
PARAMS 2 vec "$QUERY_EMBED_FLOAT32" \
SORTBY score DIALECT 2 RETURN 3 query category scoreNotes:
- Convert
number[]βFloat32Arrayβ bytes forembeddingfield. - Keep response variants in a secondary key (e.g.,
llm:<id>:responses) and run variant selection after KNN.
Ship SQLite with a vector extension, then create a VSS table and join with metadata:
CREATE VIRTUAL TABLE vss_entries USING vss0(
id TEXT PRIMARY KEY,
embedding(1536)
);
CREATE TABLE llm_entries (
id TEXT PRIMARY KEY,
query TEXT NOT NULL,
category TEXT,
metadata TEXT
);Insert and search:
-- insert: embedding blob is Float32 (vss_f32)
INSERT INTO vss_entries(id, embedding) VALUES (?, vss_f32(?));
INSERT INTO llm_entries(id, query, category, metadata) VALUES(?, ?, ?, ?);
-- KNN
SELECT e.id, l.query, vss_distance(e.embedding, vss_f32(?)) AS score
FROM vss_entries e
JOIN llm_entries l ON l.id = e.id
ORDER BY score ASC
LIMIT 5;Notes:
- Convert
number[]to Float32 blob for inserts and query embedding. - Join back to your stored responses via
idorquery, then apply variant selection.
# Store multiple "thank you" responses with different tones
curl -X POST http://localhost:3000/api/semantic/store \
-H "Content-Type: application/json" \
-H "Authorization: Bearer YOUR_API_KEY" \
-d '{
"query": "thank you",
"query_embedding": {
"vector": [0.1, 0.2, 0.3],
"dimension": 3
},
"responses": [
{
"id": "resp1",
"text": "You're welcome!",
"metadata": { "tone": "friendly", "formality": "casual" },
"quality_score": 0.95,
"category": "gratitude",
"tags": ["polite", "casual"]
},
{
"id": "resp2",
"text": "My pleasure!",
"metadata": { "tone": "professional", "formality": "formal" },
"quality_score": 0.92,
"category": "gratitude",
"tags": ["polite", "professional"]
},
{
"id": "resp3",
"text": "No problem at all!",
"metadata": { "tone": "casual", "formality": "informal" },
"quality_score": 0.88,
"category": "gratitude",
"tags": ["casual", "friendly"]
}
],
"variant_strategy": "weighted",
"weights": [3, 2, 1],
"seed": "user:123"
}'Response:
{
"success": true,
"message": "LLM responses stored successfully",
"entry_id": "thank you",
"responses_count": 3
}# Search for responses to "merci" (French thank you)
curl -X POST http://localhost:3000/api/semantic/search \
-H "Content-Type: application/json" \
-H "Authorization: Bearer YOUR_API_KEY" \
-d '{
"query": "merci",
"query_embedding": {
"vector": [0.11, 0.19, 0.29],
"dimension": 3
},
"limit": 2,
"similarity_threshold": 0.8
}'Response:
{
"success": true,
"search_result": {
"query": "merci",
"query_embedding": {
"vector": [0.11, 0.19, 0.29],
"dimension": 3
},
"matches": [
{
"entry": {
"query": "thank you",
"responses": [...],
"variant_strategy": "weighted",
"weights": [3, 2, 1]
},
"similarity_score": 0.997,
"selected_response": {
"id": "resp1",
"text": "You're welcome!",
"metadata": { "tone": "friendly" }
}
}
],
"total_matches": 1,
"search_time_ms": 2
}
}# Retrieve all stored responses for "thank you"
curl -X GET "http://localhost:3000/api/semantic/responses?query=thank%20you" \
-H "Authorization: Bearer YOUR_API_KEY"Response:
{
"success": true,
"entry": {
"query": "thank you",
"query_embedding": {
"vector": [0.1, 0.2, 0.3],
"dimension": 3
},
"responses": [
{
"id": "resp1",
"text": "You're welcome!",
"metadata": { "tone": "friendly" }
},
{
"id": "resp2",
"text": "My pleasure!",
"metadata": { "tone": "professional" }
},
{
"id": "resp3",
"text": "No problem at all!",
"metadata": { "tone": "casual" }
}
],
"variant_strategy": "weighted",
"weights": [3, 2, 1],
"created_at": "2024-01-15T10:30:00.000Z",
"last_accessed": "2024-01-15T10:35:00.000Z"
}
}# View cache performance metrics
curl -X GET http://localhost:3000/api/semantic/stats \
-H "Authorization: Bearer YOUR_API_KEY"Response:
{
"success": true,
"cache_type": "InMemoryVectorCache",
"message": "Stats endpoint - implementation needed"
}# Store responses for different types of greetings
curl -X POST http://localhost:3000/api/semantic/store \
-H "Content-Type: application/json" \
-H "Authorization: Bearer YOUR_API_KEY" \
-d '{
"query": "hello",
"query_embedding": {
"vector": [0.9, 0.8, 0.7],
"dimension": 3
},
"responses": [
{
"id": "hello1",
"text": "Hi there!",
"metadata": { "tone": "friendly", "time_of_day": "any" }
},
{
"id": "hello2",
"text": "Hello! How are you?",
"metadata": { "tone": "polite", "time_of_day": "morning" }
}
],
"variant_strategy": "round-robin"
}'# Strict similarity matching (only very similar queries)
curl -X POST http://localhost:3000/api/semantic/search \
-H "Content-Type: application/json" \
-H "Authorization: Bearer YOUR_API_KEY" \
-d '{
"query": "thanks a lot",
"query_embedding": {
"vector": [0.15, 0.25, 0.35],
"dimension": 3
},
"limit": 1,
"similarity_threshold": 0.95
}'The system automatically tracks comprehensive metrics for all semantic operations:
- Semantic Searches: Total count, duration, and success rates
- Vector Similarity: Distribution of similarity scores
- Response Storage: Count of stored LLM responses by strategy
- Cache Performance: Entry counts and access patterns
- Response Selection: Variant strategy usage and performance
Access metrics at /api/metrics endpoint (Prometheus format).
- Search Speed: Typical semantic searches complete in <5ms
- Memory Usage: Efficient in-memory storage with configurable TTL
- Scalability: Designed for thousands of cached responses
- Accuracy: High-precision vector similarity using cosine distance
- Vector Dimensions: Use consistent embedding dimensions across your system
- Similarity Thresholds: Start with 0.7-0.8 for production use
- Response Variety: Store 3-5 responses per query for good variant selection
- Metadata: Include rich metadata for better response selection
- TTL Management: Set appropriate expiration times for dynamic content
- PORT (default 3000)
- JWT_SECRET
- CACHE_BACKEND = memory | sqlite | redis
- REDIS_URL (for Redis backend)
- CACHE_ENCRYPTION_KEY (base64, 32 bytes)
- RATE_LIMIT_WINDOW_MS (default 900000)
- RATE_LIMIT_MAX (default 1000)
- OTEL_EXPORTER_OTLP_ENDPOINT (traces), OTEL_SERVICE_NAME
-
In-memory (
CACHE_BACKEND=memory):- Fastest single-process store (Map-based), ideal for development and ephemeral caches
- Per-key TTL stored alongside values; expired entries are lazily evicted on access
- No cross-process sharing and no durability
-
SQLite (
CACHE_BACKEND=sqlite):- Local durability using Bun's SQLite; table
kv(key TEXT PRIMARY KEY, value TEXT, expiresAt INTEGER) - Upsert semantics on
set, TTL computed client-side and stored inexpiresAt - Expired rows are pruned lazily on
get;clear()wipes the table - File path defaults to
resk-cache.sqlite
- Local durability using Bun's SQLite; table
-
Redis (
CACHE_BACKEND=redis,REDIS_URL=...):- Distributed, multi-instance cache using Bun's native RESP3 client
- Values are JSON-serialized with optional TTL via
EXPIRE - Prefix isolation via
rc:;clear()scans and deletes onlyrc:*keys - Helpers for experiments (round-robin counters, sets/lists for variants, optional pub/sub)
GET /health- Health check endpointPOST /api/cache(JWT) - Store simple key-value pairsPOST /api/cache/query(JWT) - Retrieve cached valuesDELETE /api/cache(JWT) - Clear all cacheGET /api/openapi.json- OpenAPI 3.1 specification from Zod schemasGET /api/metrics- Prometheus metrics exposition
POST /api/cost/record(JWT) - Record LLM API cost for a requestGET /api/cost/analysis(JWT) - Get comprehensive cost analysis and ROIGET /api/cost/breakdown(JWT) - Cost breakdown by provider and modelGET /api/cost/recent(JWT) - Get recent cost entriesPOST /api/cost/pricing(JWT) - Add custom pricing for provider/modelGET /api/cost/pricing(JWT) - Get all configured pricing
POST /api/performance/record(JWT) - Record performance metricsGET /api/performance/benchmarks(JWT) - Get performance benchmarksGET /api/performance/slow-queries(JWT) - Detect slow queriesGET /api/performance/recommendations(JWT) - Get optimization recommendationsPOST /api/performance/warming/start(JWT) - Start cache warming strategyGET /api/performance/warming/progress(JWT) - Get cache warming progressGET /api/performance/metrics(JWT) - Get recent performance metrics
POST /api/testing/chat/completions(JWT) - OpenAI-compatible chat completionsPOST /api/testing/mock/responses(JWT) - Add custom mock responsesGET /api/testing/mock/responses(JWT) - Get all mock responsesPOST /api/testing/scenarios(JWT) - Add test scenariosGET /api/testing/scenarios(JWT) - Get all test scenariosPOST /api/testing/scenarios/run(JWT) - Run specific test scenarioPOST /api/testing/scenarios/run-all(JWT) - Run all test scenariosGET /api/testing/history(JWT) - Get request historyPOST /api/testing/scenarios/defaults(JWT) - Load default test scenariosGET /api/testing/health(JWT) - Get system health statusGET /api/testing/circuit-breakers(JWT) - Get circuit breaker statistics
POST /api/semantic/store(JWT) - Store LLM responses with vector embeddingsPOST /api/semantic/search(JWT) - Search for similar queries using semantic similarityGET /api/semantic/responses(JWT) - Get all responses for a specific queryGET /api/semantic/stats(JWT) - Get cache statistics and performance metrics
- Store Responses: First, store your pre-computed LLM responses with their vector embeddings
- User Query: When a user sends a message (e.g., "merci", "merci pour ta rΓ©ponse")
- Vector Search: The system finds semantically similar queries in your database
- Response Selection: Uses advanced algorithms to choose the most appropriate response
- Return Result: Sends back a varied, contextually relevant response
Store multiple responses for "thank you" queries:
{
"query": "thank you",
"query_embedding": {
"vector": [0.1, 0.2, 0.3, 0.4, 0.5],
"dimension": 5
},
"responses": [
{
"id": "thank_1",
"text": "You're welcome! I'm glad I could help.",
"metadata": {"tone": "friendly", "formality": "casual"},
"quality_score": 0.9,
"category": "gratitude"
},
{
"id": "thank_2",
"text": "My pleasure! Feel free to ask if you need anything else.",
"metadata": {"tone": "professional", "formality": "formal"},
"quality_score": 0.85,
"category": "gratitude"
}
],
"variant_strategy": "weighted",
"weights": [3, 2],
"seed": "user:123"
}- random: Uniform random selection for variety
- round-robin: Cycles through responses systematically
- deterministic: Stable selection based on seed (user ID, conversation ID)
- weighted: Probability-based selection according to quality scores or preferences
When a user sends "merci pour ta rΓ©ponse", the system:
- Converts the message to a vector embedding
- Finds similar queries in the database (e.g., "thank you", "thanks", "merci")
- Selects the best match based on similarity score
- Applies the variant strategy to choose a response
- Returns the selected response with metadata
This approach ensures users get varied, contextually appropriate responses while maintaining the high quality of pre-approved LLM outputs.
import { selectCache, globalCostTracker, globalPerformanceOptimizer } from "resk-caching";
// Basic cache usage
const cache = selectCache();
await cache.set("key", { payload: true }, 60);
const val = await cache.get("key");
// Cost tracking integration
const cacheResult = await cache.search(query);
if (cacheResult) {
// Cache hit - record savings
globalCostTracker.recordCost({
provider: "openai",
model: "gpt-4",
inputTokens: 150,
outputTokens: 200,
cacheHit: true
});
} else {
// Cache miss - record actual cost
const response = await llmApi.createCompletion(query);
globalCostTracker.recordCost({
provider: "openai",
model: "gpt-4",
inputTokens: response.usage.prompt_tokens,
outputTokens: response.usage.completion_tokens,
cacheHit: false
});
}
// Performance monitoring
globalPerformanceOptimizer.recordMetric({
operation: 'search',
duration: responseTime,
cacheHit: !!cacheResult,
backend: 'redis'
});// examples/cost-tracking-example.ts
import { CostTracker } from "resk-caching";
const tracker = new CostTracker();
// Record API costs
tracker.recordCost({
provider: "openai",
model: "gpt-4",
inputTokens: 150,
outputTokens: 300,
cacheHit: false
});
// Get ROI analysis
const analysis = tracker.getCostAnalysis(30); // 30 days
console.log(`Total Savings: $${analysis.totalSavings}`);
console.log(`ROI: ${analysis.roiPercentage}%`);// examples/performance-optimization-example.ts
import { PerformanceOptimizer } from "resk-caching";
const optimizer = new PerformanceOptimizer();
// Start cache warming
await optimizer.startCacheWarming({
strategy: 'popular',
batchSize: 20,
maxEntries: 1000
});
// Get optimization recommendations
const recommendations = optimizer.getOptimizationRecommendations();
recommendations.forEach(rec => {
console.log(`${rec.type}: ${rec.description}`);
});// examples/development-testing-example.ts
import { MockLLMProvider } from "resk-caching";
const mockProvider = new MockLLMProvider();
// OpenAI-compatible API for development
const response = await mockProvider.createChatCompletion({
model: "gpt-3.5-turbo",
messages: [{ role: "user", content: "Hello!" }]
});
// Run automated test scenarios
const testResults = await mockProvider.runAllTestScenarios();
console.log(`Tests passed: ${testResults.filter(r => r.passed).length}`);# Run the comprehensive demo showcasing all four benefits
npm run example:demo
# Or run individual examples
npm run example:cost-tracking
npm run example:performance
npm run example:development- Fetch the spec: GET /api/openapi.json
- Use your preferred OpenAPI generator to produce clients/SDKs
- Prometheus metrics at /api/metrics
- OpenTelemetry tracing via OTLP HTTP exporter (configure OTEL_EXPORTER_OTLP_ENDPOINT and OTEL_SERVICE_NAME)
- Correlation-ID header propagated for easier debugging
- Secrets only on the server (env/secret manager). No secrets in frontend
- TLS transport; JWT short-lived; per-user/IP rate-limit
- Optional AES-GCM encryption at rest for persisted cache entries
- Structured logs with correlation-id; metrics and traces for forensics
Apache-2.0 β see LICENSE
Resk-Caching supports multiple vector database backends for similarity search and semantic caching. The system can ingest documents, compute embeddings, and store them in vector databases for efficient retrieval.
- Chroma: Local or hosted ChromaDB instances
- Pinecone: Managed vector database service
- Weaviate: Open-source vector database
- Milvus: High-performance vector database
- Custom adapters: Extend for your specific needs
# Vector Database Type
export VECTORDB_TYPE=pinecone # or chroma, weaviate, milvus
# Embedding Provider
export EMBEDDING_PROVIDER=openai # or huggingface, sentence-transformers
export EMBEDDING_MODEL=text-embedding-ada-002 # OpenAI model name
export OPENAI_API_KEY=your_openai_key_here
# Pinecone Configuration
export PINECONE_API_KEY=your_pinecone_key
export PINECONE_INDEX_HOST=https://your-index.pinecone.io
export PINECONE_INDEX_NAME=your-index-name
# Chroma Configuration
export CHROMA_HOST=localhost
export CHROMA_PORT=8000
export CHROMA_COLLECTION_NAME=documents
# Weaviate Configuration
export WEAVIATE_URL=http://localhost:8080
export WEAVIATE_API_KEY=your_weaviate_key
export WEAVIATE_CLASS_NAME=Document
# Milvus Configuration
export MILVUS_HOST=localhost
export MILVUS_PORT=19530
export MILVUS_COLLECTION_NAME=documents
# Batch Processing
export BATCH_SIZE=100 # Documents per batch for embeddings
export UPSERT_BATCH=50 # Documents per batch for vector DBUse the provided ingestion script to batch process documents:
# Run ingestion
bun run scripts/ingest-example.tsThe script will:
- Read documents from your source
- Compute embeddings in batches
- Store vectors in the configured database
- Handle retries and error recovery
import { createVectorDBAdapter } from 'resk-caching';
import { createEmbeddingProvider } from 'resk-caching';
async function ingestDocuments(documents: Document[]) {
const vectorDB = createVectorDBAdapter();
const embeddings = createEmbeddingProvider();
// Process in batches
for (let i = 0; i < documents.length; i += BATCH_SIZE) {
const batch = documents.slice(i, i + BATCH_SIZE);
// Compute embeddings
const vectors = await embeddings.embedBatch(
batch.map(doc => doc.content)
);
// Prepare for storage
const vectorsWithMetadata = batch.map((doc, idx) => ({
id: doc.id,
vector: vectors[idx],
metadata: {
title: doc.title,
source: doc.source,
timestamp: doc.timestamp
}
}));
// Store in vector database
await vectorDB.upsertBatch(vectorsWithMetadata);
}
}import { createVectorDBAdapter } from 'resk-caching';
async function searchSimilar(query: string, k: number = 5) {
const vectorDB = createVectorDBAdapter();
const embeddings = createEmbeddingProvider();
// Get query embedding
const queryVector = await embeddings.embed(query);
// Search for similar vectors
const results = await vectorDB.search(queryVector, {
k,
threshold: 0.7, // Similarity threshold
filters: {
source: 'knowledge_base',
timestamp: { $gte: '2024-01-01' }
}
});
return results;
}- Batch sizes: Larger batches (100-500) for embeddings, smaller (50-100) for vector DB operations
- Parallel processing: Use worker threads for CPU-intensive embedding computation
- Caching: Cache frequently accessed embeddings and search results
- Indexing: Ensure proper vector database indexes are created for your use case
The system provides metrics for:
- Embedding computation latency and throughput
- Vector database operation success rates
- Search query performance
- Cache hit rates for vector operations
Access metrics at /api/metrics endpoint.
- Docker image and multi-stage build for slim runtimes
- LangChain integration helper (middleware to consult cache before LLM calls)
- LlamaIndex and Vercel AI SDK adapters
- Pluggable vector stores (Qdrant, Weaviate, Pinecone) with adapters
- Background refresh policies and stale-while-revalidate
- Eviction strategies (LRU/LFU) and cache warming CLI
- Upstash Redis & Redis Cloud deployment templates
- Benchmarks and load-test recipes (k6/Artillery)