Skip to content

Bit-Quill/valkey-semantic-cache-demo

Repository files navigation

Retail Support Desk - Semantic Caching Demo

An AI-powered retail customer support system demonstrating measurable cost and performance benefits through semantic caching with ElastiCache (Valkey), AWS Bedrock AgentCore, and multi-agent orchestration.

🎯 Project Goals

This application provides developers and AWS customers with a concrete, measurable demonstration of cost and performance benefits achievable through semantic caching in AI applications. The demo serves as a reference implementation showcasing how ElastiCache's vector search capabilities combined with Titan embeddings enable intelligent response caching in a modern multi-agent application deployed on Bedrock AgentCore.

Key Demonstrations:

  • Handling request surges during peak sales events (e.g., Black Friday)
  • Reducing response latency from seconds to milliseconds through semantic caching
  • Cutting LLM token costs by avoiding redundant agent calls
  • Multi-agent orchestration using AWS Bedrock AgentCore and Strands framework
  • Real-time metrics visualization via CloudWatch Dashboard

✨ Key Features

  • Semantic Caching: Uses Titan embeddings with HNSW algorithm (0.80 similarity threshold) to identify semantically similar requests
  • Multi-Agent Architecture: SupportAgent orchestrates with OrderTrackingAgent using Strands framework
  • Intelligent Cache Layer: @entrypoint intercepts requests before agent invocation, returning cached responses in <100ms
  • Agent Tooling: OrderTrackingAgent uses decorated tools to check order status and delivery information
  • Real-time Metrics: CloudWatch Dashboard visualizes latency reduction, cost savings, cache efficiency, and match scores
  • Traffic Simulation: Lambda-based ramp-up simulator (1 → 11 requests/second over 180 seconds)
  • Conference-Ready: Live metrics suitable for projection, optional cache reset between demos

🏗️ Architecture

Architecture Diagram

Data Flow

  1. Client initiates request via API Gateway to Ramp-up Simulator (Lambda)
  2. Lambda gradually increases throughput (1 → 11 requests/second over 180 seconds)
  3. @entrypoint generates Titan embedding and queries ElastiCache vector index
  4. Cache Hit (≥0.80 similarity):
    • Returns cached response immediately (<100ms)
    • Logs metrics: latency, avoided cost, match score
  5. Cache Miss:
    • Forwards to SupportAgent (Claude Sonnet 4.0)
    • SupportAgent analyzes request, consults OrderTrackingAgent if needed
    • OrderTrackingAgent uses @tool decorated functions to check order status
    • Response returned to @entrypoint, cached with embedding
  6. CloudWatch receives metrics for all requests (cache hits/misses, latency, costs)
  7. Dashboard visualizes cumulative effectiveness in real-time

🚀 Quick Start

Prerequisites

AWS Account Requirements:

  • AWS Account with Bedrock model access enabled:
    • Claude Sonnet 4 (anthropic.claude-sonnet-4-20250514)
    • Claude 3.5 Haiku (anthropic.claude-3-5-haiku-20241022)
    • Titan Embeddings v2 (amazon.titan-embed-text-v2:0)
  • IAM permissions to create: Lambda, ElastiCache, CloudWatch, CodeBuild, ECR, IAM roles

Required Tools:

Tool Version Install Command Verify
AWS CLI 2.x Install Guide aws --version
Node.js 18+ nodejs.org node --version
AWS CDK 2.x npm install -g aws-cdk cdk --version
Python 3.12+ python.org python3 --version
Go 1.21+ go.dev (for Lambda builds) go version

AWS CLI Configuration:

# Configure credentials (if not already done)
aws configure --profile semantic-cache-demo

# Set as default for this session
export AWS_PROFILE=semantic-cache-demo
export AWS_REGION=us-east-2

Running the Application

  1. Clone the repository
git clone https://github.com/vasigorc/valkey-semantic-cache-demo.git
cd valkey-semantic-cache-demo
  1. Deploy all infrastructure (single command)
# Deploy all 7 stacks + create index + deploy agent
./deploy.sh --all

# Or deploy infrastructure only (then manually deploy agent)
./deploy.sh
./scripts/trigger-agent-deploy.sh
  1. Trigger simulation
aws lambda invoke \
  --function-name semantic-cache-demo-ramp-up-simulator \
  --region us-east-2 \
  --payload '{}' \
  response.json
  1. View metrics

    Navigate to CloudWatch Dashboard in AWS Console to observe:

    • Average latency reduction
    • Bedrock cost savings
    • Cache hit ratio
    • Match score distribution

Resetting the Demo

For conference presentations, reset the cache between runs via the cache management Lambda:

# Via AWS CLI
aws lambda invoke \
  --function-name semantic-cache-demo-cache-management \
  --region us-east-2 \
  --payload '{"action": "reset-cache"}' \
  response.json && cat response.json

# Or via AWS Console: Test tab with payload {"action": "reset-cache"}

Other cache management actions:

# Health check (verify connectivity, dbsize, index status)
aws lambda invoke \
  --function-name semantic-cache-demo-cache-management \
  --region us-east-2 \
  --payload '{"action": "health-check"}' \
  response.json && cat response.json

# Create index (if needed after full cache flush)
aws lambda invoke \
  --function-name semantic-cache-demo-cache-management \
  --region us-east-2 \
  --payload '{"action": "create-index"}' \
  response.json && cat response.json

🛠️ Tech Stack

Core Infrastructure

  • AWS Bedrock AgentCore: Multi-agent runtime with Strands framework
  • AWS Lambda: Serverless compute for simulation and cache management
  • API Gateway: HTTP endpoint for client requests
  • ElastiCache (Valkey): Vector database with HNSW indexing
  • CloudWatch: Metrics aggregation and dashboard visualization

AI Components

  • Claude Sonnet 4: Primary SupportAgent for complex query analysis
  • Claude 3.5 Haiku: OrderTrackingAgent for fast tool invocations
  • Titan Embeddings: Generate 1024-dimensional vectors for semantic search
  • Strands Framework: Agent orchestration, @entrypoint and @tool decorators

Development

  • Python 3.12+: Primary language for Strands agents
  • Conventional Commits: Git commit message format
  • CloudFormation/SAM: Infrastructure as Code (when applicable)

📊 Data Structure

Vector Index (Semantic Search)

Index Name: idx:requests
Key Pattern: request:vector:{uuid}
Fields:
  - request_id (TAG)
  - embedding (VECTOR HNSW, 1536 dimensions)
  - timestamp (NUMERIC)

Request-Response Store

Key: rr:{request_id}
Hash:
  - request_text (string)
  - response_text (string)
  - tokens_input (numeric)
  - tokens_output (numeric)
  - cost_dollars (numeric)
  - created_at (timestamp)
  - agent_chain (string) # e.g., "SupportAgent->OrderTrackingAgent"

Metrics Store

Key: metrics:global
Hash:
  - total_requests (numeric)
  - cache_hits (numeric)
  - cache_misses (numeric)
  - total_cost_savings_dollars (numeric)
  - avg_hit_proximity_match_score (numeric)
  - avg_latency_cached_ms (numeric)
  - avg_latency_uncached_ms (numeric)

🗺️ Project Timeline

Task 1: Foundation

  • AWS account setup and Bedrock access verification
  • IAM role configuration (Lambda, Bedrock, ElastiCache, CloudWatch)
  • Repository structure and collaboration workflow
  • Architecture document and technical narrative

Task 2: ElastiCache Integration

  • ElastiCache (Valkey) cluster provisioning
  • Vector index schema creation (HNSW, 1024 dimensions)
  • @entrypoint implementation in AgentCore
  • Titan Embeddings integration for semantic search
  • Basic cache hit/miss logic

Task 3: SupportAgent Integration

  • SupportAgent deployment on AgentCore
  • System prompt design for retail support context
  • Integration with @entrypoint for cache misses
  • End-to-end test: cache miss → SupportAgent → cache write

Task 4: CloudWatch Integration

  • Metrics emission from @entrypoint (latency, cost, hit ratio)
  • CloudWatch custom metric definitions
  • Dashboard creation with real-time visualization
  • VPC endpoint for CloudWatch Monitoring service
  • Alerts configuration (optional)

Task 5: Multi-Agent Scenario

  • OrderTrackingAgent deployment (Claude 3.5 Haiku)
  • @tool decorator implementation for order status checks
  • Agent orchestration: SupportAgent → OrderTrackingAgent
  • Tool invocation testing and validation
  • Token accumulation from sub-agent calls

Task 6: Integration & Testing

  • End-to-end flow validation
  • Error handling (rate limits, model availability)
  • Local and AWS deployment testing
  • Performance optimization
  • Seed data creation (sample requests)

Task 7: Simulation & Presentation Layer

  • Ramp-up Lambda implementation (1 → 11 requests/second over 180s)
  • Go-based Lambda with deterministic question selection
  • S3-based seed questions (50 base scenarios + 450 variations)
  • SAM template with IAM policies for deployment
  • Rate limiting and session pooling for AgentCore stability
  • CloudWatch Logs integration for monitoring
  • API Gateway configuration (optional - direct Lambda invocation works)
  • Cache reset Lambda (optional - manual flush via redis-cli)
  • Demo script and walkthrough

Task 8: Dashboard Enhancements

  • Add Cost Reduction % single-value widget (expression-based)
  • Add Pie Chart for Cache Hits vs Misses distribution
  • Remove Similarity Score Distribution widget (not a clear KPI)
  • Reorganize layout for business impact (top row: key KPIs)
  • Deploy and validate updated dashboard

Task 9: Cache Management Lambda (Eliminate EC2 for Cache Ops)

  • Create separate Python Lambda (cache_management/) with valkey-glide-sync
  • Implement health-check action (ping, DBSIZE, FT._LIST)
  • Implement reset-cache action (SCAN + DEL for cached keys)
  • Implement create-index action (FT.CREATE with HNSW)
  • SAM template with VPC config for ElastiCache access
  • Deploy and test all actions

Task 10: CDK Consolidation + AgentCore Deployment Automation (Merged)

  • CodeBuild automation for AgentCore deployment (eliminates EC2 jump host)
  • Master deploy.sh script - single command to deploy all 7 stacks
  • Master teardown.sh script - single command to delete all stacks
  • CDK project initialized with ElastiCache and AgentCore stacks (scaffolding)
  • Complete CDK migration (deferred to post-demo, master scripts achieve the goal)

Task 11: Simple Demo UI

  • Create static HTML/JS page (Start, Reset buttons, 4 KPI cards)
  • Create Metrics API Lambda (queries CloudWatch, returns JSON)
  • Create API Gateway (POST /start, POST /reset, GET /metrics)
  • Add polling/auto-refresh logic (every 5s during demo)
  • Add CloudWatch dashboard iframe for detailed view (optional)
  • Style for conference projection (large fonts, high contrast)
  • Deploy to S3 + CloudFront (static hosting with HTTPS)
  • Add UI resources to CDK stack

Task 12: Demo Script Simplification

  • Create 5-minute script outline (Problem → Solution → Live Demo → Results)
  • Prepare business impact talking points (cost savings, latency reduction)
  • Record fallback demo video (3-minute backup recording)
  • Create simplified slides (3-4 slides, optional)
  • Practice run with timing (ensure < 5 minutes)
  • Update SCRIPT.md with new simplified version

🎪 Conference Demo Flow

  1. Fresh Start: Show CloudWatch Dashboard (all metrics at zero)
  2. Ramp-up Simulation: Invoke Lambda via AWS Console or CLI
    • Linear ramp: 1 → 11 req/s over 180 seconds (~1,080 total requests)
    • First 90s: Base questions prime the cache
    • Second 90s: Variations hit cache (80%+ hit rate)
  3. Live Metrics: CloudWatch Dashboard shows real-time results
    • Cache hit ratio: 45% → 90% in 1 minute
    • Average cache hit latency: ~90ms (vs. 2-3s for full agent chain)
    • Cost savings: ~$2.33 per 30 minutes of traffic
    • Total cache hits: 600+ requests served from cache
  4. Key Takeaway: Semantic caching enables handling traffic surges (Black Friday) with:
    • 20-30x latency reduction
    • Significant cost savings by avoiding redundant LLM calls
    • 90% cache efficiency for similar customer queries

📝 Configuration

Environment Variables

AgentCore / Lambda

AWS_REGION=us-east-1
ELASTICACHE_ENDPOINT=your-cluster.cache.amazonaws.com:6379
BEDROCK_SUPPORT_AGENT_MODEL=anthropic.claude-sonnet-4-20250514
BEDROCK_TRACKING_AGENT_MODEL=anthropic.claude-sonnet-3-5-v2
EMBEDDING_MODEL=amazon.titan-embed-text-v2:0
SIMILARITY_THRESHOLD=0.80
CLOUDWATCH_NAMESPACE=SemanticSupportDesk

Simulation Parameters

RAMP_START_RPS=1
RAMP_END_RPS=11
RAMP_DURATION_SECONDS=180
REQUEST_TEMPLATES_PATH=./templates/requests.json

🤝 Development Principles

  • Incremental Progress: Small, testable steps with frequent Git commits
  • Conventional Commits: Structured commit messages (feat:, fix:, docs:)
  • Red → Green → Refactor: TDD approach for cacheable business logic
  • Error Handling First: No silent failures, comprehensive logging
  • Progress Tracking: Maintained via Progress.md at repository root
  • Immediate Error Resolution: Stop and fix blockers before proceeding

📦 Project Structure

valkey-semantic-cache-demo/
├── README.md
├── Progress.md
├── semantic_support_desk_arch.png  # Architecture diagram
├── agents/
│   ├── support_agent.py            # Main SupportAgent
│   ├── order_tracking_agent.py     # OrderTrackingAgent with @tool decorators
│   └── entrypoint.py               # @entrypoint with caching logic
├── lambda/
│   ├── ramp_up_simulator/          # Traffic simulation Lambda (Go)
│   └── cache_management/           # Cache management Lambda (Python)
├── infrastructure/
│   ├── cloudformation/             # Infrastructure as Code
│   └── elasticache_config/         # Valkey cluster configuration
├── templates/
│   └── requests.json               # Sample request templates for simulation
├── scripts/
│   ├── deploy.sh                   # Deployment automation
│   └── reset-demo.sh               # Demo reset script
└── tests/
    └─ unit/                       # Unit tests for agents

🐛 Troubleshooting

ElastiCache Connection Fails

  • Verify VPC security groups allow traffic from Lambda
  • Check ElastiCache cluster status in AWS Console
  • Review Lambda CloudWatch logs for connection errors

AgentCore Agent Invocation Errors

  • Confirm Bedrock model access in IAM permissions
  • Verify model IDs match available Bedrock models in region
  • Check Lambda timeout settings (increase if needed for agent chains)

Metrics Not Appearing in CloudWatch

  • Ensure CloudWatch PutMetricData permissions in Lambda role
  • Verify custom namespace matches Dashboard configuration
  • Check for rate limiting on CloudWatch API calls

Cache Hit Ratio Lower Than Expected

  • Review similarity threshold (may need adjustment < 0.85)
  • Examine request diversity in simulation templates
  • Verify embeddings are being generated correctly

Throughput Limits

The demo is constrained by several AWS service limits:

Factor Limit Impact
AgentCore InvokeAgentRuntime TPS 25 TPS per agent Primary bottleneck - max 25 concurrent requests
AgentCore Active Sessions 500 concurrent (us-east-2) Sessions idle for 15 min; requires session pooling
AgentCore New Session Rate 100 TPM per endpoint Limits new session creation speed
Cache Miss Latency ~5-10 seconds Slow agent responses during cache priming create backlog
AWS SDK Rate Limiter Built-in retry quota failed to get rate limit token when SDK limiter exhausted

Effective throughput: ~5-6 RPS (~330 requests/min) due to these constraints. All limits are adjustable via AWS Service Quotas.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published