Skip to content

Resk-Caching is a Bun-based backend library and server designed for secure caching, embeddings orchestration, and vector database access. It prioritizes security, high performance, and deep observability.

License

Notifications You must be signed in to change notification settings

Resk-Security/resk-caching

NPM version NPM License NPM Downloads GitHub issues GitHub stars GitHub last commit TypeScript LLM Security

Full documentation

We provide a full documentation site (MkDocs). See docs/ and the published site: Resk-Caching Docs.

Resk-Caching β€” LLM Response Caching with Vector Database Integration

Resk-Caching is a Bun-based backend library/server designed to cache Large Language Model (LLM) responses using vector databases, significantly reducing API costs while maintaining response quality and relevance.

🎯 Four Key GPTCache-Style Benefits

Resk-Caching delivers the complete value proposition of intelligent LLM caching with four core benefits that transform how you build and scale AI applications:

πŸ’° 1. Massive Cost Reduction

  • Up to 90% reduction in LLM API costs through intelligent semantic caching
  • Real-time cost tracking with provider-specific pricing (OpenAI, Anthropic, Google, etc.)
  • ROI analysis showing exact savings from cache hits vs API calls
  • Cost breakdown by provider, model, and time period
  • Automatic savings calculation for every cached response

πŸš€ 2. Performance Optimization

  • Sub-5ms response times for cached queries vs 500ms+ for API calls
  • Intelligent cache warming strategies (popular, recent, predictive)
  • Real-time performance monitoring with benchmarking and optimization recommendations
  • Slow query detection with automated performance suggestions
  • Cache hit rate optimization through advanced similarity algorithms

πŸ§ͺ 3. Development & Testing Environment

  • OpenAI-compatible API for offline development without API costs
  • Mock LLM provider with customizable responses and scenarios
  • Automated testing scenarios with validation and metrics
  • Zero-cost development workflows with realistic API simulation
  • Circuit breaker patterns for resilient application development

πŸ›‘οΈ 4. Scalability & Availability

  • Enhanced rate limiting bypass with cache-first approach reducing API pressure
  • Circuit breaker patterns with automatic failover and recovery
  • Health monitoring and real-time system status
  • Automatic scaling with proactive cache warming for traffic spikes
  • Graceful degradation when external services fail

πŸ” How It Works

  1. Pre-populated Response Database: You maintain a database of high-quality LLM responses to common queries, stored as vector embeddings
  2. Semantic Matching: When a new query arrives, the system finds the most semantically similar cached response
  3. Cost Savings: Returns cached responses instead of making new API calls
  4. Response Selection: Advanced algorithms allow you to choose specific responses based on business logic, user preferences, or A/B testing strategies

πŸš€ Key Benefits

βœ… All Four GPTCache-Style Benefits Implemented:

  • πŸ’° Massive Cost Reduction: Up to 90% savings with real-time ROI tracking
  • πŸš€ Performance Optimization: Sub-5ms responses with intelligent cache warming
  • πŸ§ͺ Development Environment: OpenAI-compatible API for offline testing
  • πŸ›‘οΈ Scalability & Availability: Circuit breakers and automatic failover

Features

  • LLM Response Caching: Store and retrieve LLM responses using vector similarity matching
  • Multiple Cache Backends: in-memory, SQLite (local persistence), Redis (multi-instance)
  • Advanced Response Selection: Deterministic, weighted, and randomized response selection algorithms
  • Vector Database Integration: Optimized for semantic search and similarity matching
  • AES-GCM Encryption: Secure cache-at-rest protection (optional via env key)
  • JWT-Protected API: Secure access with rate limiting and abuse prevention
  • OpenAPI 3.1: Auto-generated API documentation from Zod schemas
  • Performance Monitoring: Prometheus metrics and OpenTelemetry tracing
  • Real-time Updates: WebSockets for instant response distribution

How we're different from other semantic caches

  • GPTCache: Great Python-first cache. Resk-Caching focuses on Bun/TypeScript, ships with JWT-secured HTTP API, OpenAPI generation, built-in Prometheus/OTEL, and optional authenticated-at-rest encryption out of the box.
  • ModelCache: Provides a semantic cache layer. Resk-Caching adds production concerns (rate-limit, security wrapper, metrics, tracing, OpenAPI, WebSockets) and pluggable backends with zero-code switching via CACHE_BACKEND.
  • Upstash Semantic Cache: Managed vector-backed cache. Resk-Caching is open-source, self-hosted by default, and can run fully local with SQLite or purely in-memory while retaining encryption and observability.
  • Redis LangCache: Managed Redis-based semantic cache. Resk-Caching supports Redis natively via Bun's RESP3 client while also offering SQLite and in-memory modes for portability and offline development.
  • SemantiCache (FAISS): FAISS-native library. Resk-Caching prioritizes a secure, observable HTTP surface with variant selection strategies and can integrate external vector DBs; no GPU dependency required.

If you need a secure, auditable cache service with operational tooling for teams, Resk-Caching is purpose-built for that surface.

What each module is for

  • LLM Response Storage: Store pre-computed LLM responses with their vector embeddings for fast retrieval
  • Caching Backends: Choose between low-latency memory, local persistence (SQLite), or distributed (Redis) based on your scale
  • Response Selection Algorithms: Implement deterministic, weighted, or randomized response selection based on business logic
  • Vector Similarity Matching: Find the most semantically similar cached response to incoming queries
  • AES-GCM Encryption: Protect sensitive LLM responses at rest with authenticated encryption
  • JWT + Rate Limiting: Secure API access and prevent abuse while maintaining performance
  • Zod + OpenAPI: Ensure data validation and provide always-in-sync API documentation
  • Performance Monitoring: Track cache hit rates, response times, and cost savings in real-time
  • Real-time Distribution: Instantly distribute responses across multiple instances and clients

Prerequisites

Vector Database Setup

Before using Resk-Caching, you need to have a vector database ready with pre-computed LLM responses. This is the foundation of the caching system:

  1. Response Database: Create a collection of high-quality LLM responses to common queries
  2. Vector Embeddings: Generate vector embeddings for each response using your preferred embedding model
  3. Metadata Storage: Store additional context like response quality scores, categories, or business rules
  4. Similarity Index: Ensure your vector database has proper indexing for fast similarity search

Recommended Vector Databases:

  • Pinecone: Excellent for production use with high performance
  • Weaviate: Open-source with great similarity search capabilities
  • Qdrant: Fast and efficient for real-time applications
  • Chroma: Simple local development and testing

Install

# as a library (npm)
npm install resk-caching

# as a library (bun)
bun add resk-caching

Quick Start

Server Setup

# Install dependencies
bun install

# Start the server
bun run dev

# The server will be available at http://localhost:3000

Step-by-step setup

  1. Choose your key-value cache backend:
    • CACHE_BACKEND=memory for local/dev
    • CACHE_BACKEND=sqlite for single-node durability
    • CACHE_BACKEND=redis for distributed/multi-instance
  2. Choose your vector search strategy for semantic features:
    • Default: in-memory vector store (process-local)
    • Production: external vector DB (Pinecone/Qdrant/Weaviate/Chroma)
    • Alternative: Redis RediSearch vectors or SQLite vector extensions
  3. Ingest responses and embeddings (see Ingestion or scripts/ingest-example.ts).
  4. Call /api/semantic/store and /api/semantic/search.

By default, semantic embeddings live in memory. To power vector search with Redis or SQLite, see the guides below.

Vector search with Redis (RediSearch)

Use Redis Stack with RediSearch for vector similarity.

Example index and KNN search (1536-dim float32 cosine):

# Create index
redis-cli FT.CREATE idx:llm ON HASH PREFIX 1 llm: SCHEMA \
  query TEXT \
  embedding VECTOR HNSW 6 TYPE FLOAT32 DIM 1536 DISTANCE_METRIC COSINE \
  category TAG SORTABLE \
  metadata TEXT

# Insert (embedding must be raw float32 bytes)
redis-cli HSET llm:thank-you query "thank you" category "gratitude" \
  embedding "$BINARY_FLOAT32" metadata "{\"tone\":\"friendly\"}"

# KNN search
redis-cli FT.SEARCH idx:llm "*=>[KNN 5 @embedding $vec AS score]" \
  PARAMS 2 vec "$QUERY_EMBED_FLOAT32" \
  SORTBY score DIALECT 2 RETURN 3 query category score

Notes:

  • Convert number[] β†’ Float32Array β†’ bytes for embedding field.
  • Keep response variants in a secondary key (e.g., llm:<id>:responses) and run variant selection after KNN.

Vector search with SQLite (sqlite-vss/sqlite-vec)

Ship SQLite with a vector extension, then create a VSS table and join with metadata:

CREATE VIRTUAL TABLE vss_entries USING vss0(
  id TEXT PRIMARY KEY,
  embedding(1536)
);

CREATE TABLE llm_entries (
  id TEXT PRIMARY KEY,
  query TEXT NOT NULL,
  category TEXT,
  metadata TEXT
);

Insert and search:

-- insert: embedding blob is Float32 (vss_f32)
INSERT INTO vss_entries(id, embedding) VALUES (?, vss_f32(?));
INSERT INTO llm_entries(id, query, category, metadata) VALUES(?, ?, ?, ?);

-- KNN
SELECT e.id, l.query, vss_distance(e.embedding, vss_f32(?)) AS score
FROM vss_entries e
JOIN llm_entries l ON l.id = e.id
ORDER BY score ASC
LIMIT 5;

Notes:

  • Convert number[] to Float32 blob for inserts and query embedding.
  • Join back to your stored responses via id or query, then apply variant selection.

Basic Usage Examples

1. Store LLM Responses with Vector Embeddings

# Store multiple "thank you" responses with different tones
curl -X POST http://localhost:3000/api/semantic/store \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{
    "query": "thank you",
    "query_embedding": {
      "vector": [0.1, 0.2, 0.3],
      "dimension": 3
    },
    "responses": [
      {
        "id": "resp1",
        "text": "You're welcome!",
        "metadata": { "tone": "friendly", "formality": "casual" },
        "quality_score": 0.95,
        "category": "gratitude",
        "tags": ["polite", "casual"]
      },
      {
        "id": "resp2", 
        "text": "My pleasure!",
        "metadata": { "tone": "professional", "formality": "formal" },
        "quality_score": 0.92,
        "category": "gratitude",
        "tags": ["polite", "professional"]
      },
      {
        "id": "resp3",
        "text": "No problem at all!",
        "metadata": { "tone": "casual", "formality": "informal" },
        "quality_score": 0.88,
        "category": "gratitude",
        "tags": ["casual", "friendly"]
      }
    ],
    "variant_strategy": "weighted",
    "weights": [3, 2, 1],
    "seed": "user:123"
  }'

Response:

{
  "success": true,
  "message": "LLM responses stored successfully",
  "entry_id": "thank you",
  "responses_count": 3
}

2. Semantic Search for Similar Queries

# Search for responses to "merci" (French thank you)
curl -X POST http://localhost:3000/api/semantic/search \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{
    "query": "merci",
    "query_embedding": {
      "vector": [0.11, 0.19, 0.29],
      "dimension": 3
    },
    "limit": 2,
    "similarity_threshold": 0.8
  }'

Response:

{
  "success": true,
  "search_result": {
    "query": "merci",
    "query_embedding": {
      "vector": [0.11, 0.19, 0.29],
      "dimension": 3
    },
    "matches": [
      {
        "entry": {
          "query": "thank you",
          "responses": [...],
          "variant_strategy": "weighted",
          "weights": [3, 2, 1]
        },
        "similarity_score": 0.997,
        "selected_response": {
          "id": "resp1",
          "text": "You're welcome!",
          "metadata": { "tone": "friendly" }
        }
      }
    ],
    "total_matches": 1,
    "search_time_ms": 2
  }
}

3. Get All Responses for a Query

# Retrieve all stored responses for "thank you"
curl -X GET "http://localhost:3000/api/semantic/responses?query=thank%20you" \
  -H "Authorization: Bearer YOUR_API_KEY"

Response:

{
  "success": true,
  "entry": {
    "query": "thank you",
    "query_embedding": {
      "vector": [0.1, 0.2, 0.3],
      "dimension": 3
    },
    "responses": [
      {
        "id": "resp1",
        "text": "You're welcome!",
        "metadata": { "tone": "friendly" }
      },
      {
        "id": "resp2",
        "text": "My pleasure!",
        "metadata": { "tone": "professional" }
      },
      {
        "id": "resp3",
        "text": "No problem at all!",
        "metadata": { "tone": "casual" }
      }
    ],
    "variant_strategy": "weighted",
    "weights": [3, 2, 1],
    "created_at": "2024-01-15T10:30:00.000Z",
    "last_accessed": "2024-01-15T10:35:00.000Z"
  }
}

4. Get Cache Statistics

# View cache performance metrics
curl -X GET http://localhost:3000/api/semantic/stats \
  -H "Authorization: Bearer YOUR_API_KEY"

Response:

{
  "success": true,
  "cache_type": "InMemoryVectorCache",
  "message": "Stats endpoint - implementation needed"
}

Advanced Usage Examples

Store Multiple Query Types

# Store responses for different types of greetings
curl -X POST http://localhost:3000/api/semantic/store \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{
    "query": "hello",
    "query_embedding": {
      "vector": [0.9, 0.8, 0.7],
      "dimension": 3
    },
    "responses": [
      {
        "id": "hello1",
        "text": "Hi there!",
        "metadata": { "tone": "friendly", "time_of_day": "any" }
      },
      {
        "id": "hello2",
        "text": "Hello! How are you?",
        "metadata": { "tone": "polite", "time_of_day": "morning" }
      }
    ],
    "variant_strategy": "round-robin"
  }'

Search with Different Similarity Thresholds

# Strict similarity matching (only very similar queries)
curl -X POST http://localhost:3000/api/semantic/search \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{
    "query": "thanks a lot",
    "query_embedding": {
      "vector": [0.15, 0.25, 0.35],
      "dimension": 3
    },
    "limit": 1,
    "similarity_threshold": 0.95
  }'

Metrics and Monitoring

The system automatically tracks comprehensive metrics for all semantic operations:

  • Semantic Searches: Total count, duration, and success rates
  • Vector Similarity: Distribution of similarity scores
  • Response Storage: Count of stored LLM responses by strategy
  • Cache Performance: Entry counts and access patterns
  • Response Selection: Variant strategy usage and performance

Access metrics at /api/metrics endpoint (Prometheus format).

Performance Characteristics

  • Search Speed: Typical semantic searches complete in <5ms
  • Memory Usage: Efficient in-memory storage with configurable TTL
  • Scalability: Designed for thousands of cached responses
  • Accuracy: High-precision vector similarity using cosine distance

Best Practices

  1. Vector Dimensions: Use consistent embedding dimensions across your system
  2. Similarity Thresholds: Start with 0.7-0.8 for production use
  3. Response Variety: Store 3-5 responses per query for good variant selection
  4. Metadata: Include rich metadata for better response selection
  5. TTL Management: Set appropriate expiration times for dynamic content

Environment variables

  • PORT (default 3000)
  • JWT_SECRET
  • CACHE_BACKEND = memory | sqlite | redis
  • REDIS_URL (for Redis backend)
  • CACHE_ENCRYPTION_KEY (base64, 32 bytes)
  • RATE_LIMIT_WINDOW_MS (default 900000)
  • RATE_LIMIT_MAX (default 1000)
  • OTEL_EXPORTER_OTLP_ENDPOINT (traces), OTEL_SERVICE_NAME

Cache backends explained

  • In-memory (CACHE_BACKEND=memory):

    • Fastest single-process store (Map-based), ideal for development and ephemeral caches
    • Per-key TTL stored alongside values; expired entries are lazily evicted on access
    • No cross-process sharing and no durability
  • SQLite (CACHE_BACKEND=sqlite):

    • Local durability using Bun's SQLite; table kv(key TEXT PRIMARY KEY, value TEXT, expiresAt INTEGER)
    • Upsert semantics on set, TTL computed client-side and stored in expiresAt
    • Expired rows are pruned lazily on get; clear() wipes the table
    • File path defaults to resk-cache.sqlite
  • Redis (CACHE_BACKEND=redis, REDIS_URL=...):

    • Distributed, multi-instance cache using Bun's native RESP3 client
    • Values are JSON-serialized with optional TTL via EXPIRE
    • Prefix isolation via rc:; clear() scans and deletes only rc:* keys
    • Helpers for experiments (round-robin counters, sets/lists for variants, optional pub/sub)

πŸ”— API Endpoints - Complete Reference

Core Cache Endpoints

  • GET /health - Health check endpoint
  • POST /api/cache (JWT) - Store simple key-value pairs
  • POST /api/cache/query (JWT) - Retrieve cached values
  • DELETE /api/cache (JWT) - Clear all cache
  • GET /api/openapi.json - OpenAPI 3.1 specification from Zod schemas
  • GET /api/metrics - Prometheus metrics exposition

πŸ’° Cost Tracking Endpoints (NEW!)

  • POST /api/cost/record (JWT) - Record LLM API cost for a request
  • GET /api/cost/analysis (JWT) - Get comprehensive cost analysis and ROI
  • GET /api/cost/breakdown (JWT) - Cost breakdown by provider and model
  • GET /api/cost/recent (JWT) - Get recent cost entries
  • POST /api/cost/pricing (JWT) - Add custom pricing for provider/model
  • GET /api/cost/pricing (JWT) - Get all configured pricing

πŸš€ Performance Optimization Endpoints (NEW!)

  • POST /api/performance/record (JWT) - Record performance metrics
  • GET /api/performance/benchmarks (JWT) - Get performance benchmarks
  • GET /api/performance/slow-queries (JWT) - Detect slow queries
  • GET /api/performance/recommendations (JWT) - Get optimization recommendations
  • POST /api/performance/warming/start (JWT) - Start cache warming strategy
  • GET /api/performance/warming/progress (JWT) - Get cache warming progress
  • GET /api/performance/metrics (JWT) - Get recent performance metrics

πŸ§ͺ Development & Testing Endpoints (NEW!)

  • POST /api/testing/chat/completions (JWT) - OpenAI-compatible chat completions
  • POST /api/testing/mock/responses (JWT) - Add custom mock responses
  • GET /api/testing/mock/responses (JWT) - Get all mock responses
  • POST /api/testing/scenarios (JWT) - Add test scenarios
  • GET /api/testing/scenarios (JWT) - Get all test scenarios
  • POST /api/testing/scenarios/run (JWT) - Run specific test scenario
  • POST /api/testing/scenarios/run-all (JWT) - Run all test scenarios
  • GET /api/testing/history (JWT) - Get request history
  • POST /api/testing/scenarios/defaults (JWT) - Load default test scenarios
  • GET /api/testing/health (JWT) - Get system health status
  • GET /api/testing/circuit-breakers (JWT) - Get circuit breaker statistics

Semantic Search Endpoints

  • POST /api/semantic/store (JWT) - Store LLM responses with vector embeddings
  • POST /api/semantic/search (JWT) - Search for similar queries using semantic similarity
  • GET /api/semantic/responses (JWT) - Get all responses for a specific query
  • GET /api/semantic/stats (JWT) - Get cache statistics and performance metrics

Semantic Search & Response Selection

How It Works

  1. Store Responses: First, store your pre-computed LLM responses with their vector embeddings
  2. User Query: When a user sends a message (e.g., "merci", "merci pour ta rΓ©ponse")
  3. Vector Search: The system finds semantically similar queries in your database
  4. Response Selection: Uses advanced algorithms to choose the most appropriate response
  5. Return Result: Sends back a varied, contextually relevant response

Example: Thank You Responses

Store multiple responses for "thank you" queries:

{
  "query": "thank you",
  "query_embedding": {
    "vector": [0.1, 0.2, 0.3, 0.4, 0.5],
    "dimension": 5
  },
  "responses": [
    {
      "id": "thank_1",
      "text": "You're welcome! I'm glad I could help.",
      "metadata": {"tone": "friendly", "formality": "casual"},
      "quality_score": 0.9,
      "category": "gratitude"
    },
    {
      "id": "thank_2",
      "text": "My pleasure! Feel free to ask if you need anything else.",
      "metadata": {"tone": "professional", "formality": "formal"},
      "quality_score": 0.85,
      "category": "gratitude"
    }
  ],
  "variant_strategy": "weighted",
  "weights": [3, 2],
  "seed": "user:123"
}

Response Selection Strategies

  • random: Uniform random selection for variety
  • round-robin: Cycles through responses systematically
  • deterministic: Stable selection based on seed (user ID, conversation ID)
  • weighted: Probability-based selection according to quality scores or preferences

Search for Similar Queries

When a user sends "merci pour ta rΓ©ponse", the system:

  1. Converts the message to a vector embedding
  2. Finds similar queries in the database (e.g., "thank you", "thanks", "merci")
  3. Selects the best match based on similarity score
  4. Applies the variant strategy to choose a response
  5. Returns the selected response with metadata

This approach ensures users get varied, contextually appropriate responses while maintaining the high quality of pre-approved LLM outputs.

Library usage (TypeScript)

import { selectCache, globalCostTracker, globalPerformanceOptimizer } from "resk-caching";

// Basic cache usage
const cache = selectCache();
await cache.set("key", { payload: true }, 60);
const val = await cache.get("key");

// Cost tracking integration
const cacheResult = await cache.search(query);
if (cacheResult) {
  // Cache hit - record savings
  globalCostTracker.recordCost({
    provider: "openai",
    model: "gpt-4", 
    inputTokens: 150,
    outputTokens: 200,
    cacheHit: true
  });
} else {
  // Cache miss - record actual cost
  const response = await llmApi.createCompletion(query);
  globalCostTracker.recordCost({
    provider: "openai",
    model: "gpt-4",
    inputTokens: response.usage.prompt_tokens,
    outputTokens: response.usage.completion_tokens,
    cacheHit: false
  });
}

// Performance monitoring
globalPerformanceOptimizer.recordMetric({
  operation: 'search',
  duration: responseTime,
  cacheHit: !!cacheResult,
  backend: 'redis'
});

πŸ“š Comprehensive Examples

πŸ’° Cost Tracking Example

// examples/cost-tracking-example.ts
import { CostTracker } from "resk-caching";

const tracker = new CostTracker();

// Record API costs
tracker.recordCost({
  provider: "openai",
  model: "gpt-4",
  inputTokens: 150,
  outputTokens: 300,
  cacheHit: false
});

// Get ROI analysis
const analysis = tracker.getCostAnalysis(30); // 30 days
console.log(`Total Savings: $${analysis.totalSavings}`);
console.log(`ROI: ${analysis.roiPercentage}%`);

πŸš€ Performance Optimization Example

// examples/performance-optimization-example.ts
import { PerformanceOptimizer } from "resk-caching";

const optimizer = new PerformanceOptimizer();

// Start cache warming
await optimizer.startCacheWarming({
  strategy: 'popular',
  batchSize: 20,
  maxEntries: 1000
});

// Get optimization recommendations
const recommendations = optimizer.getOptimizationRecommendations();
recommendations.forEach(rec => {
  console.log(`${rec.type}: ${rec.description}`);
});

πŸ§ͺ Development & Testing Example

// examples/development-testing-example.ts
import { MockLLMProvider } from "resk-caching";

const mockProvider = new MockLLMProvider();

// OpenAI-compatible API for development
const response = await mockProvider.createChatCompletion({
  model: "gpt-3.5-turbo",
  messages: [{ role: "user", content: "Hello!" }]
});

// Run automated test scenarios
const testResults = await mockProvider.runAllTestScenarios();
console.log(`Tests passed: ${testResults.filter(r => r.passed).length}`);

🌟 Complete Demo

# Run the comprehensive demo showcasing all four benefits
npm run example:demo

# Or run individual examples
npm run example:cost-tracking
npm run example:performance
npm run example:development

OpenAPI and clients

  • Fetch the spec: GET /api/openapi.json
  • Use your preferred OpenAPI generator to produce clients/SDKs

Observability

  • Prometheus metrics at /api/metrics
  • OpenTelemetry tracing via OTLP HTTP exporter (configure OTEL_EXPORTER_OTLP_ENDPOINT and OTEL_SERVICE_NAME)
  • Correlation-ID header propagated for easier debugging

Security model (summary)

  • Secrets only on the server (env/secret manager). No secrets in frontend
  • TLS transport; JWT short-lived; per-user/IP rate-limit
  • Optional AES-GCM encryption at rest for persisted cache entries
  • Structured logs with correlation-id; metrics and traces for forensics

License

Apache-2.0 β€” see LICENSE

Vector Database Integration

Overview

Resk-Caching supports multiple vector database backends for similarity search and semantic caching. The system can ingest documents, compute embeddings, and store them in vector databases for efficient retrieval.

Supported Vector Databases

  • Chroma: Local or hosted ChromaDB instances
  • Pinecone: Managed vector database service
  • Weaviate: Open-source vector database
  • Milvus: High-performance vector database
  • Custom adapters: Extend for your specific needs

Environment Configuration

# Vector Database Type
export VECTORDB_TYPE=pinecone  # or chroma, weaviate, milvus

# Embedding Provider
export EMBEDDING_PROVIDER=openai  # or huggingface, sentence-transformers
export EMBEDDING_MODEL=text-embedding-ada-002  # OpenAI model name
export OPENAI_API_KEY=your_openai_key_here

# Pinecone Configuration
export PINECONE_API_KEY=your_pinecone_key
export PINECONE_INDEX_HOST=https://your-index.pinecone.io
export PINECONE_INDEX_NAME=your-index-name

# Chroma Configuration
export CHROMA_HOST=localhost
export CHROMA_PORT=8000
export CHROMA_COLLECTION_NAME=documents

# Weaviate Configuration
export WEAVIATE_URL=http://localhost:8080
export WEAVIATE_API_KEY=your_weaviate_key
export WEAVIATE_CLASS_NAME=Document

# Milvus Configuration
export MILVUS_HOST=localhost
export MILVUS_PORT=19530
export MILVUS_COLLECTION_NAME=documents

# Batch Processing
export BATCH_SIZE=100  # Documents per batch for embeddings
export UPSERT_BATCH=50  # Documents per batch for vector DB

Ingestion Script

Use the provided ingestion script to batch process documents:

# Run ingestion
bun run scripts/ingest-example.ts

The script will:

  1. Read documents from your source
  2. Compute embeddings in batches
  3. Store vectors in the configured database
  4. Handle retries and error recovery

Example Ingestion Code

import { createVectorDBAdapter } from 'resk-caching';
import { createEmbeddingProvider } from 'resk-caching';

async function ingestDocuments(documents: Document[]) {
  const vectorDB = createVectorDBAdapter();
  const embeddings = createEmbeddingProvider();
  
  // Process in batches
  for (let i = 0; i < documents.length; i += BATCH_SIZE) {
    const batch = documents.slice(i, i + BATCH_SIZE);
    
    // Compute embeddings
    const vectors = await embeddings.embedBatch(
      batch.map(doc => doc.content)
    );
    
    // Prepare for storage
    const vectorsWithMetadata = batch.map((doc, idx) => ({
      id: doc.id,
      vector: vectors[idx],
      metadata: {
        title: doc.title,
        source: doc.source,
        timestamp: doc.timestamp
      }
    }));
    
    // Store in vector database
    await vectorDB.upsertBatch(vectorsWithMetadata);
  }
}

Vector Search

import { createVectorDBAdapter } from 'resk-caching';

async function searchSimilar(query: string, k: number = 5) {
  const vectorDB = createVectorDBAdapter();
  const embeddings = createEmbeddingProvider();
  
  // Get query embedding
  const queryVector = await embeddings.embed(query);
  
  // Search for similar vectors
  const results = await vectorDB.search(queryVector, {
    k,
    threshold: 0.7,  // Similarity threshold
    filters: {
      source: 'knowledge_base',
      timestamp: { $gte: '2024-01-01' }
    }
  });
  
  return results;
}

Performance Considerations

  • Batch sizes: Larger batches (100-500) for embeddings, smaller (50-100) for vector DB operations
  • Parallel processing: Use worker threads for CPU-intensive embedding computation
  • Caching: Cache frequently accessed embeddings and search results
  • Indexing: Ensure proper vector database indexes are created for your use case

Monitoring and Metrics

The system provides metrics for:

  • Embedding computation latency and throughput
  • Vector database operation success rates
  • Search query performance
  • Cache hit rates for vector operations

Access metrics at /api/metrics endpoint.

Next steps

  • Docker image and multi-stage build for slim runtimes
  • LangChain integration helper (middleware to consult cache before LLM calls)
  • LlamaIndex and Vercel AI SDK adapters
  • Pluggable vector stores (Qdrant, Weaviate, Pinecone) with adapters
  • Background refresh policies and stale-while-revalidate
  • Eviction strategies (LRU/LFU) and cache warming CLI
  • Upstash Redis & Redis Cloud deployment templates
  • Benchmarks and load-test recipes (k6/Artillery)

About

Resk-Caching is a Bun-based backend library and server designed for secure caching, embeddings orchestration, and vector database access. It prioritizes security, high performance, and deep observability.

Topics

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Contributors 3

  •  
  •  
  •