GitHub - Resk-Security/resk-caching: Resk-Caching is a Bun-based backend library and server designed for secure caching, embeddings orchestration, and vector database access. It prioritizes security, high performance, and deep observability.

Full documentation

We provide a full documentation site (MkDocs). See docs/ and the published site: Resk-Caching Docs.

Resk-Caching — LLM Response Caching with Vector Database Integration

Resk-Caching is a Bun-based backend library/server designed to cache Large Language Model (LLM) responses using vector databases, significantly reducing API costs while maintaining response quality and relevance.

🎯 Four Key GPTCache-Style Benefits

Resk-Caching delivers the complete value proposition of intelligent LLM caching with four core benefits that transform how you build and scale AI applications:

💰 1. Massive Cost Reduction

Up to 90% reduction in LLM API costs through intelligent semantic caching
Real-time cost tracking with provider-specific pricing (OpenAI, Anthropic, Google, etc.)
ROI analysis showing exact savings from cache hits vs API calls
Cost breakdown by provider, model, and time period
Automatic savings calculation for every cached response

🚀 2. Performance Optimization

Sub-5ms response times for cached queries vs 500ms+ for API calls
Intelligent cache warming strategies (popular, recent, predictive)
Real-time performance monitoring with benchmarking and optimization recommendations
Slow query detection with automated performance suggestions
Cache hit rate optimization through advanced similarity algorithms

🧪 3. Development & Testing Environment

OpenAI-compatible API for offline development without API costs
Mock LLM provider with customizable responses and scenarios
Automated testing scenarios with validation and metrics
Zero-cost development workflows with realistic API simulation
Circuit breaker patterns for resilient application development

🛡️ 4. Scalability & Availability

Enhanced rate limiting bypass with cache-first approach reducing API pressure
Circuit breaker patterns with automatic failover and recovery
Health monitoring and real-time system status
Automatic scaling with proactive cache warming for traffic spikes
Graceful degradation when external services fail

🔍 How It Works

Pre-populated Response Database: You maintain a database of high-quality LLM responses to common queries, stored as vector embeddings
Semantic Matching: When a new query arrives, the system finds the most semantically similar cached response
Cost Savings: Returns cached responses instead of making new API calls
Response Selection: Advanced algorithms allow you to choose specific responses based on business logic, user preferences, or A/B testing strategies

🚀 Key Benefits

✅ All Four GPTCache-Style Benefits Implemented:

💰 Massive Cost Reduction: Up to 90% savings with real-time ROI tracking
🚀 Performance Optimization: Sub-5ms responses with intelligent cache warming
🧪 Development Environment: OpenAI-compatible API for offline testing
🛡️ Scalability & Availability: Circuit breakers and automatic failover

Features

LLM Response Caching: Store and retrieve LLM responses using vector similarity matching
Multiple Cache Backends: in-memory, SQLite (local persistence), Redis (multi-instance)
Advanced Response Selection: Deterministic, weighted, and randomized response selection algorithms
Vector Database Integration: Optimized for semantic search and similarity matching
AES-GCM Encryption: Secure cache-at-rest protection (optional via env key)
JWT-Protected API: Secure access with rate limiting and abuse prevention
OpenAPI 3.1: Auto-generated API documentation from Zod schemas
Performance Monitoring: Prometheus metrics and OpenTelemetry tracing
Real-time Updates: WebSockets for instant response distribution

How we're different from other semantic caches

GPTCache: Great Python-first cache. Resk-Caching focuses on Bun/TypeScript, ships with JWT-secured HTTP API, OpenAPI generation, built-in Prometheus/OTEL, and optional authenticated-at-rest encryption out of the box.
ModelCache: Provides a semantic cache layer. Resk-Caching adds production concerns (rate-limit, security wrapper, metrics, tracing, OpenAPI, WebSockets) and pluggable backends with zero-code switching via CACHE_BACKEND.
Upstash Semantic Cache: Managed vector-backed cache. Resk-Caching is open-source, self-hosted by default, and can run fully local with SQLite or purely in-memory while retaining encryption and observability.
Redis LangCache: Managed Redis-based semantic cache. Resk-Caching supports Redis natively via Bun's RESP3 client while also offering SQLite and in-memory modes for portability and offline development.
SemantiCache (FAISS): FAISS-native library. Resk-Caching prioritizes a secure, observable HTTP surface with variant selection strategies and can integrate external vector DBs; no GPU dependency required.

If you need a secure, auditable cache service with operational tooling for teams, Resk-Caching is purpose-built for that surface.

What each module is for

LLM Response Storage: Store pre-computed LLM responses with their vector embeddings for fast retrieval
Caching Backends: Choose between low-latency memory, local persistence (SQLite), or distributed (Redis) based on your scale
Response Selection Algorithms: Implement deterministic, weighted, or randomized response selection based on business logic
Vector Similarity Matching: Find the most semantically similar cached response to incoming queries
AES-GCM Encryption: Protect sensitive LLM responses at rest with authenticated encryption
JWT + Rate Limiting: Secure API access and prevent abuse while maintaining performance
Zod + OpenAPI: Ensure data validation and provide always-in-sync API documentation
Performance Monitoring: Track cache hit rates, response times, and cost savings in real-time
Real-time Distribution: Instantly distribute responses across multiple instances and clients

Prerequisites

Vector Database Setup

Before using Resk-Caching, you need to have a vector database ready with pre-computed LLM responses. This is the foundation of the caching system:

Response Database: Create a collection of high-quality LLM responses to common queries
Vector Embeddings: Generate vector embeddings for each response using your preferred embedding model
Metadata Storage: Store additional context like response quality scores, categories, or business rules
Similarity Index: Ensure your vector database has proper indexing for fast similarity search

Recommended Vector Databases:

Pinecone: Excellent for production use with high performance
Weaviate: Open-source with great similarity search capabilities
Qdrant: Fast and efficient for real-time applications
Chroma: Simple local development and testing

Install

# as a library (npm)
npm install resk-caching

# as a library (bun)
bun add resk-caching

Quick Start

Server Setup

# Install dependencies
bun install

# Start the server
bun run dev

# The server will be available at http://localhost:3000

Step-by-step setup

Choose your key-value cache backend:
- CACHE_BACKEND=memory for local/dev
- CACHE_BACKEND=sqlite for single-node durability
- CACHE_BACKEND=redis for distributed/multi-instance
Choose your vector search strategy for semantic features:
- Default: in-memory vector store (process-local)
- Production: external vector DB (Pinecone/Qdrant/Weaviate/Chroma)
- Alternative: Redis RediSearch vectors or SQLite vector extensions
Ingest responses and embeddings (see Ingestion or scripts/ingest-example.ts).
Call /api/semantic/store and /api/semantic/search.

By default, semantic embeddings live in memory. To power vector search with Redis or SQLite, see the guides below.

Vector search with Redis (RediSearch)

Use Redis Stack with RediSearch for vector similarity.

Example index and KNN search (1536-dim float32 cosine):

# Create index
redis-cli FT.CREATE idx:llm ON HASH PREFIX 1 llm: SCHEMA \
  query TEXT \
  embedding VECTOR HNSW 6 TYPE FLOAT32 DIM 1536 DISTANCE_METRIC COSINE \
  category TAG SORTABLE \
  metadata TEXT

# Insert (embedding must be raw float32 bytes)
redis-cli HSET llm:thank-you query "thank you" category "gratitude" \
  embedding "$BINARY_FLOAT32" metadata "{\"tone\":\"friendly\"}"

# KNN search
redis-cli FT.SEARCH idx:llm "*=>[KNN 5 @embedding $vec AS score]" \
  PARAMS 2 vec "$QUERY_EMBED_FLOAT32" \
  SORTBY score DIALECT 2 RETURN 3 query category score

Notes:

Convert number[] → Float32Array → bytes for embedding field.
Keep response variants in a secondary key (e.g., llm:<id>:responses) and run variant selection after KNN.

Vector search with SQLite (sqlite-vss/sqlite-vec)

Ship SQLite with a vector extension, then create a VSS table and join with metadata:

CREATE VIRTUAL TABLE vss_entries USING vss0(
  id TEXT PRIMARY KEY,
  embedding(1536)
);

CREATE TABLE llm_entries (
  id TEXT PRIMARY KEY,
  query TEXT NOT NULL,
  category TEXT,
  metadata TEXT
);

Insert and search:

-- insert: embedding blob is Float32 (vss_f32)
INSERT INTO vss_entries(id, embedding) VALUES (?, vss_f32(?));
INSERT INTO llm_entries(id, query, category, metadata) VALUES(?, ?, ?, ?);

-- KNN
SELECT e.id, l.query, vss_distance(e.embedding, vss_f32(?)) AS score
FROM vss_entries e
JOIN llm_entries l ON l.id = e.id
ORDER BY score ASC
LIMIT 5;

Notes:

Convert number[] to Float32 blob for inserts and query embedding.
Join back to your stored responses via id or query, then apply variant selection.

Basic Usage Examples

1. Store LLM Responses with Vector Embeddings

# Store multiple "thank you" responses with different tones
curl -X POST http://localhost:3000/api/semantic/store \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{
    "query": "thank you",
    "query_embedding": {
      "vector": [0.1, 0.2, 0.3],
      "dimension": 3
    },
    "responses": [
      {
        "id": "resp1",
        "text": "You're welcome!",
        "metadata": { "tone": "friendly", "formality": "casual" },
        "quality_score": 0.95,
        "category": "gratitude",
        "tags": ["polite", "casual"]
      },
      {
        "id": "resp2", 
        "text": "My pleasure!",
        "metadata": { "tone": "professional", "formality": "formal" },
        "quality_score": 0.92,
        "category": "gratitude",
        "tags": ["polite", "professional"]
      },
      {
        "id": "resp3",
        "text": "No problem at all!",
        "metadata": { "tone": "casual", "formality": "informal" },
        "quality_score": 0.88,
        "category": "gratitude",
        "tags": ["casual", "friendly"]
      }
    ],
    "variant_strategy": "weighted",
    "weights": [3, 2, 1],
    "seed": "user:123"
  }'

Response:

{
  "success": true,
  "message": "LLM responses stored successfully",
  "entry_id": "thank you",
  "responses_count": 3
}

2. Semantic Search for Similar Queries

# Search for responses to "merci" (French thank you)
curl -X POST http://localhost:3000/api/semantic/search \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{
    "query": "merci",
    "query_embedding": {
      "vector": [0.11, 0.19, 0.29],
      "dimension": 3
    },
    "limit": 2,
    "similarity_threshold": 0.8
  }'

Response:

{
  "success": true,
  "search_result": {
    "query": "merci",
    "query_embedding": {
      "vector": [0.11, 0.19, 0.29],
      "dimension": 3
    },
    "matches": [
      {
        "entry": {
          "query": "thank you",
          "responses": [...],
          "variant_strategy": "weighted",
          "weights": [3, 2, 1]
        },
        "similarity_score": 0.997,
        "selected_response": {
          "id": "resp1",
          "text": "You're welcome!",
          "metadata": { "tone": "friendly" }
        }
      }
    ],
    "total_matches": 1,
    "search_time_ms": 2
  }
}

3. Get All Responses for a Query

# Retrieve all stored responses for "thank you"
curl -X GET "http://localhost:3000/api/semantic/responses?query=thank%20you" \
  -H "Authorization: Bearer YOUR_API_KEY"

Response:

{
  "success": true,
  "entry": {
    "query": "thank you",
    "query_embedding": {
      "vector": [0.1, 0.2, 0.3],
      "dimension": 3
    },
    "responses": [
      {
        "id": "resp1",
        "text": "You're welcome!",
        "metadata": { "tone": "friendly" }
      },
      {
        "id": "resp2",
        "text": "My pleasure!",
        "metadata": { "tone": "professional" }
      },
      {
        "id": "resp3",
        "text": "No problem at all!",
        "metadata": { "tone": "casual" }
      }
    ],
    "variant_strategy": "weighted",
    "weights": [3, 2, 1],
    "created_at": "2024-01-15T10:30:00.000Z",
    "last_accessed": "2024-01-15T10:35:00.000Z"
  }
}

4. Get Cache Statistics

# View cache performance metrics
curl -X GET http://localhost:3000/api/semantic/stats \
  -H "Authorization: Bearer YOUR_API_KEY"

Response:

{
  "success": true,
  "cache_type": "InMemoryVectorCache",
  "message": "Stats endpoint - implementation needed"
}

Advanced Usage Examples

Store Multiple Query Types

# Store responses for different types of greetings
curl -X POST http://localhost:3000/api/semantic/store \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{
    "query": "hello",
    "query_embedding": {
      "vector": [0.9, 0.8, 0.7],
      "dimension": 3
    },
    "responses": [
      {
        "id": "hello1",
        "text": "Hi there!",
        "metadata": { "tone": "friendly", "time_of_day": "any" }
      },
      {
        "id": "hello2",
        "text": "Hello! How are you?",
        "metadata": { "tone": "polite", "time_of_day": "morning" }
      }
    ],
    "variant_strategy": "round-robin"
  }'

Search with Different Similarity Thresholds

# Strict similarity matching (only very similar queries)
curl -X POST http://localhost:3000/api/semantic/search \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{
    "query": "thanks a lot",
    "query_embedding": {
      "vector": [0.15, 0.25, 0.35],
      "dimension": 3
    },
    "limit": 1,
    "similarity_threshold": 0.95
  }'

Metrics and Monitoring

The system automatically tracks comprehensive metrics for all semantic operations:

Semantic Searches: Total count, duration, and success rates
Vector Similarity: Distribution of similarity scores
Response Storage: Count of stored LLM responses by strategy
Cache Performance: Entry counts and access patterns
Response Selection: Variant strategy usage and performance

Access metrics at /api/metrics endpoint (Prometheus format).

Performance Characteristics

Search Speed: Typical semantic searches complete in <5ms
Memory Usage: Efficient in-memory storage with configurable TTL
Scalability: Designed for thousands of cached responses
Accuracy: High-precision vector similarity using cosine distance

Best Practices

Vector Dimensions: Use consistent embedding dimensions across your system
Similarity Thresholds: Start with 0.7-0.8 for production use
Response Variety: Store 3-5 responses per query for good variant selection
Metadata: Include rich metadata for better response selection
TTL Management: Set appropriate expiration times for dynamic content

Environment variables

PORT (default 3000)
JWT_SECRET
CACHE_BACKEND = memory | sqlite | redis
REDIS_URL (for Redis backend)
CACHE_ENCRYPTION_KEY (base64, 32 bytes)
RATE_LIMIT_WINDOW_MS (default 900000)
RATE_LIMIT_MAX (default 1000)
OTEL_EXPORTER_OTLP_ENDPOINT (traces), OTEL_SERVICE_NAME

Cache backends explained

In-memory (CACHE_BACKEND=memory):
- Fastest single-process store (Map-based), ideal for development and ephemeral caches
- Per-key TTL stored alongside values; expired entries are lazily evicted on access
- No cross-process sharing and no durability
SQLite (CACHE_BACKEND=sqlite):
- Local durability using Bun's SQLite; table kv(key TEXT PRIMARY KEY, value TEXT, expiresAt INTEGER)
- Upsert semantics on set, TTL computed client-side and stored in expiresAt
- Expired rows are pruned lazily on get; clear() wipes the table
- File path defaults to resk-cache.sqlite
Redis (CACHE_BACKEND=redis, REDIS_URL=...):
- Distributed, multi-instance cache using Bun's native RESP3 client
- Values are JSON-serialized with optional TTL via EXPIRE
- Prefix isolation via rc:; clear() scans and deletes only rc:* keys
- Helpers for experiments (round-robin counters, sets/lists for variants, optional pub/sub)

🔗 API Endpoints - Complete Reference

Core Cache Endpoints

GET /health - Health check endpoint
POST /api/cache (JWT) - Store simple key-value pairs
POST /api/cache/query (JWT) - Retrieve cached values
DELETE /api/cache (JWT) - Clear all cache
GET /api/openapi.json - OpenAPI 3.1 specification from Zod schemas
GET /api/metrics - Prometheus metrics exposition

💰 Cost Tracking Endpoints (NEW!)

POST /api/cost/record (JWT) - Record LLM API cost for a request
GET /api/cost/analysis (JWT) - Get comprehensive cost analysis and ROI
GET /api/cost/breakdown (JWT) - Cost breakdown by provider and model
GET /api/cost/recent (JWT) - Get recent cost entries
POST /api/cost/pricing (JWT) - Add custom pricing for provider/model
GET /api/cost/pricing (JWT) - Get all configured pricing

🚀 Performance Optimization Endpoints (NEW!)

POST /api/performance/record (JWT) - Record performance metrics
GET /api/performance/benchmarks (JWT) - Get performance benchmarks
GET /api/performance/slow-queries (JWT) - Detect slow queries
GET /api/performance/recommendations (JWT) - Get optimization recommendations
POST /api/performance/warming/start (JWT) - Start cache warming strategy
GET /api/performance/warming/progress (JWT) - Get cache warming progress
GET /api/performance/metrics (JWT) - Get recent performance metrics

🧪 Development & Testing Endpoints (NEW!)

POST /api/testing/chat/completions (JWT) - OpenAI-compatible chat completions
POST /api/testing/mock/responses (JWT) - Add custom mock responses
GET /api/testing/mock/responses (JWT) - Get all mock responses
POST /api/testing/scenarios (JWT) - Add test scenarios
GET /api/testing/scenarios (JWT) - Get all test scenarios
POST /api/testing/scenarios/run (JWT) - Run specific test scenario
POST /api/testing/scenarios/run-all (JWT) - Run all test scenarios
GET /api/testing/history (JWT) - Get request history
POST /api/testing/scenarios/defaults (JWT) - Load default test scenarios
GET /api/testing/health (JWT) - Get system health status
GET /api/testing/circuit-breakers (JWT) - Get circuit breaker statistics

Semantic Search Endpoints

POST /api/semantic/store (JWT) - Store LLM responses with vector embeddings
POST /api/semantic/search (JWT) - Search for similar queries using semantic similarity
GET /api/semantic/responses (JWT) - Get all responses for a specific query
GET /api/semantic/stats (JWT) - Get cache statistics and performance metrics

Semantic Search & Response Selection

How It Works

Store Responses: First, store your pre-computed LLM responses with their vector embeddings
User Query: When a user sends a message (e.g., "merci", "merci pour ta réponse")
Vector Search: The system finds semantically similar queries in your database
Response Selection: Uses advanced algorithms to choose the most appropriate response
Return Result: Sends back a varied, contextually relevant response

Example: Thank You Responses

Store multiple responses for "thank you" queries:

{
  "query": "thank you",
  "query_embedding": {
    "vector": [0.1, 0.2, 0.3, 0.4, 0.5],
    "dimension": 5
  },
  "responses": [
    {
      "id": "thank_1",
      "text": "You're welcome! I'm glad I could help.",
      "metadata": {"tone": "friendly", "formality": "casual"},
      "quality_score": 0.9,
      "category": "gratitude"
    },
    {
      "id": "thank_2",
      "text": "My pleasure! Feel free to ask if you need anything else.",
      "metadata": {"tone": "professional", "formality": "formal"},
      "quality_score": 0.85,
      "category": "gratitude"
    }
  ],
  "variant_strategy": "weighted",
  "weights": [3, 2],
  "seed": "user:123"
}

Response Selection Strategies

random: Uniform random selection for variety
round-robin: Cycles through responses systematically
deterministic: Stable selection based on seed (user ID, conversation ID)
weighted: Probability-based selection according to quality scores or preferences

Search for Similar Queries

When a user sends "merci pour ta réponse", the system:

Converts the message to a vector embedding
Finds similar queries in the database (e.g., "thank you", "thanks", "merci")
Selects the best match based on similarity score
Applies the variant strategy to choose a response
Returns the selected response with metadata

This approach ensures users get varied, contextually appropriate responses while maintaining the high quality of pre-approved LLM outputs.

Library usage (TypeScript)

import { selectCache, globalCostTracker, globalPerformanceOptimizer } from "resk-caching";

// Basic cache usage
const cache = selectCache();
await cache.set("key", { payload: true }, 60);
const val = await cache.get("key");

// Cost tracking integration
const cacheResult = await cache.search(query);
if (cacheResult) {
  // Cache hit - record savings
  globalCostTracker.recordCost({
    provider: "openai",
    model: "gpt-4", 
    inputTokens: 150,
    outputTokens: 200,
    cacheHit: true
  });
} else {
  // Cache miss - record actual cost
  const response = await llmApi.createCompletion(query);
  globalCostTracker.recordCost({
    provider: "openai",
    model: "gpt-4",
    inputTokens: response.usage.prompt_tokens,
    outputTokens: response.usage.completion_tokens,
    cacheHit: false
  });
}

// Performance monitoring
globalPerformanceOptimizer.recordMetric({
  operation: 'search',
  duration: responseTime,
  cacheHit: !!cacheResult,
  backend: 'redis'
});

📚 Comprehensive Examples

💰 Cost Tracking Example

// examples/cost-tracking-example.ts
import { CostTracker } from "resk-caching";

const tracker = new CostTracker();

// Record API costs
tracker.recordCost({
  provider: "openai",
  model: "gpt-4",
  inputTokens: 150,
  outputTokens: 300,
  cacheHit: false
});

// Get ROI analysis
const analysis = tracker.getCostAnalysis(30); // 30 days
console.log(`Total Savings: $${analysis.totalSavings}`);
console.log(`ROI: ${analysis.roiPercentage}%`);

🚀 Performance Optimization Example

// examples/performance-optimization-example.ts
import { PerformanceOptimizer } from "resk-caching";

const optimizer = new PerformanceOptimizer();

// Start cache warming
await optimizer.startCacheWarming({
  strategy: 'popular',
  batchSize: 20,
  maxEntries: 1000
});

// Get optimization recommendations
const recommendations = optimizer.getOptimizationRecommendations();
recommendations.forEach(rec => {
  console.log(`${rec.type}: ${rec.description}`);
});

🧪 Development & Testing Example

// examples/development-testing-example.ts
import { MockLLMProvider } from "resk-caching";

const mockProvider = new MockLLMProvider();

// OpenAI-compatible API for development
const response = await mockProvider.createChatCompletion({
  model: "gpt-3.5-turbo",
  messages: [{ role: "user", content: "Hello!" }]
});

// Run automated test scenarios
const testResults = await mockProvider.runAllTestScenarios();
console.log(`Tests passed: ${testResults.filter(r => r.passed).length}`);

🌟 Complete Demo

# Run the comprehensive demo showcasing all four benefits
npm run example:demo

# Or run individual examples
npm run example:cost-tracking
npm run example:performance
npm run example:development

OpenAPI and clients

Fetch the spec: GET /api/openapi.json
Use your preferred OpenAPI generator to produce clients/SDKs

Observability

Prometheus metrics at /api/metrics
OpenTelemetry tracing via OTLP HTTP exporter (configure OTEL_EXPORTER_OTLP_ENDPOINT and OTEL_SERVICE_NAME)
Correlation-ID header propagated for easier debugging

Security model (summary)

Secrets only on the server (env/secret manager). No secrets in frontend
TLS transport; JWT short-lived; per-user/IP rate-limit
Optional AES-GCM encryption at rest for persisted cache entries
Structured logs with correlation-id; metrics and traces for forensics

License

Apache-2.0 — see LICENSE

Vector Database Integration

Overview

Resk-Caching supports multiple vector database backends for similarity search and semantic caching. The system can ingest documents, compute embeddings, and store them in vector databases for efficient retrieval.

Supported Vector Databases

Chroma: Local or hosted ChromaDB instances
Pinecone: Managed vector database service
Weaviate: Open-source vector database
Milvus: High-performance vector database
Custom adapters: Extend for your specific needs

Environment Configuration

# Vector Database Type
export VECTORDB_TYPE=pinecone  # or chroma, weaviate, milvus

# Embedding Provider
export EMBEDDING_PROVIDER=openai  # or huggingface, sentence-transformers
export EMBEDDING_MODEL=text-embedding-ada-002  # OpenAI model name
export OPENAI_API_KEY=your_openai_key_here

# Pinecone Configuration
export PINECONE_API_KEY=your_pinecone_key
export PINECONE_INDEX_HOST=https://your-index.pinecone.io
export PINECONE_INDEX_NAME=your-index-name

# Chroma Configuration
export CHROMA_HOST=localhost
export CHROMA_PORT=8000
export CHROMA_COLLECTION_NAME=documents

# Weaviate Configuration
export WEAVIATE_URL=http://localhost:8080
export WEAVIATE_API_KEY=your_weaviate_key
export WEAVIATE_CLASS_NAME=Document

# Milvus Configuration
export MILVUS_HOST=localhost
export MILVUS_PORT=19530
export MILVUS_COLLECTION_NAME=documents

# Batch Processing
export BATCH_SIZE=100  # Documents per batch for embeddings
export UPSERT_BATCH=50  # Documents per batch for vector DB

Ingestion Script

Use the provided ingestion script to batch process documents:

# Run ingestion
bun run scripts/ingest-example.ts

The script will:

Read documents from your source
Compute embeddings in batches
Store vectors in the configured database
Handle retries and error recovery

Example Ingestion Code

import { createVectorDBAdapter } from 'resk-caching';
import { createEmbeddingProvider } from 'resk-caching';

async function ingestDocuments(documents: Document[]) {
  const vectorDB = createVectorDBAdapter();
  const embeddings = createEmbeddingProvider();
  
  // Process in batches
  for (let i = 0; i < documents.length; i += BATCH_SIZE) {
    const batch = documents.slice(i, i + BATCH_SIZE);
    
    // Compute embeddings
    const vectors = await embeddings.embedBatch(
      batch.map(doc => doc.content)
    );
    
    // Prepare for storage
    const vectorsWithMetadata = batch.map((doc, idx) => ({
      id: doc.id,
      vector: vectors[idx],
      metadata: {
        title: doc.title,
        source: doc.source,
        timestamp: doc.timestamp
      }
    }));
    
    // Store in vector database
    await vectorDB.upsertBatch(vectorsWithMetadata);
  }
}

Vector Search

import { createVectorDBAdapter } from 'resk-caching';

async function searchSimilar(query: string, k: number = 5) {
  const vectorDB = createVectorDBAdapter();
  const embeddings = createEmbeddingProvider();
  
  // Get query embedding
  const queryVector = await embeddings.embed(query);
  
  // Search for similar vectors
  const results = await vectorDB.search(queryVector, {
    k,
    threshold: 0.7,  // Similarity threshold
    filters: {
      source: 'knowledge_base',
      timestamp: { $gte: '2024-01-01' }
    }
  });
  
  return results;
}

Performance Considerations

Batch sizes: Larger batches (100-500) for embeddings, smaller (50-100) for vector DB operations
Parallel processing: Use worker threads for CPU-intensive embedding computation
Caching: Cache frequently accessed embeddings and search results
Indexing: Ensure proper vector database indexes are created for your use case

Monitoring and Metrics

The system provides metrics for:

Embedding computation latency and throughput
Vector database operation success rates
Search query performance
Cache hit rates for vector operations

Access metrics at /api/metrics endpoint.

Next steps

Docker image and multi-stage build for slim runtimes
LangChain integration helper (middleware to consult cache before LLM calls)
LlamaIndex and Vercel AI SDK adapters
Pluggable vector stores (Qdrant, Weaviate, Pinecone) with adapters
Background refresh policies and stale-while-revalidate
Eviction strategies (LRU/LFU) and cache warming CLI
Upstash Redis & Redis Cloud deployment templates
Benchmarks and load-test recipes (k6/Artillery)

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
.github/workflows		.github/workflows
docs		docs
examples		examples
scripts		scripts
site		site
src		src
tests		tests
.dockerignore		.dockerignore
.eslintrc.json		.eslintrc.json
.gitignore		.gitignore
.prettierrc		.prettierrc
.readthedocs.yaml		.readthedocs.yaml
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
eslint.config.js		eslint.config.js
mkdocs.yml		mkdocs.yml
package.json		package.json
test-api-simple.js		test-api-simple.js
test-semantic.js		test-semantic.js
tsconfig.json		tsconfig.json

License

Resk-Security/resk-caching

Folders and files

Latest commit

History

Repository files navigation

Full documentation

Resk-Caching — LLM Response Caching with Vector Database Integration

🎯 Four Key GPTCache-Style Benefits

💰 1. Massive Cost Reduction

🚀 2. Performance Optimization

🧪 3. Development & Testing Environment

🛡️ 4. Scalability & Availability

🔍 How It Works

🚀 Key Benefits

Features

How we're different from other semantic caches

What each module is for

Prerequisites

Vector Database Setup

Install

Quick Start

Server Setup

Step-by-step setup

Vector search with Redis (RediSearch)

Vector search with SQLite (sqlite-vss/sqlite-vec)

Basic Usage Examples

1. Store LLM Responses with Vector Embeddings

2. Semantic Search for Similar Queries

3. Get All Responses for a Query

4. Get Cache Statistics

Advanced Usage Examples

Store Multiple Query Types

Search with Different Similarity Thresholds

Metrics and Monitoring

Performance Characteristics

Best Practices

Environment variables

Cache backends explained

🔗 API Endpoints - Complete Reference

Core Cache Endpoints

💰 Cost Tracking Endpoints (NEW!)

🚀 Performance Optimization Endpoints (NEW!)

🧪 Development & Testing Endpoints (NEW!)

Semantic Search Endpoints

Semantic Search & Response Selection

How It Works

Example: Thank You Responses

Response Selection Strategies

Search for Similar Queries

Library usage (TypeScript)

📚 Comprehensive Examples

💰 Cost Tracking Example

🚀 Performance Optimization Example

🧪 Development & Testing Example

🌟 Complete Demo

OpenAPI and clients

Observability

Security model (summary)

License

Vector Database Integration

Overview

Supported Vector Databases

Environment Configuration

Ingestion Script

Example Ingestion Code

Vector Search

Performance Considerations

Monitoring and Metrics

Next steps

About

Topics

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks