Evaluate how a semantic cache performs on your dataset by computing key KPIs over a threshold sweep and producing plots/CSVs:
The pipeline finds nearest matches for each user query using text embeddings, optionally asks an LLM to judge similarity, computes metrics across score thresholds, and generates plots — with support for local or S3 inputs/outputs and optional GPU acceleration.
Why does the full analysis mode require an LLM? We use an LLM-as-a-Judge to produce proxy ground‑truth labels for each
(query, match)pair, so you can calculate precision without manual annotation.
- Two evaluation modes in one script
Choose full LLM-judged metrics (
evaluation.py --full) or fast cache-hit-ratio-only analysis (evaluation.py). - Conditional LLM dependency
The
llm-sim-evalpackage is only required for--fullmode. Run CHR-only analysis without it. - Two‑stage scoring (--full mode only) Neural embedding matching followed by LLM‑as‑a‑Judge for higher‑quality similarity signals.
- Metrics & plots out of the box Saves CSVs and generates threshold‑sweep visualizations to help tune decision thresholds.
- Local and S3 I/O
Read inputs and write outputs to
s3://…or local paths; no code changes needed. - Deterministic runs
Seeds are set for
random,numpy, andtorchto improve reproducibility. - GPU‑aware Uses CUDA automatically when available; falls back to CPU otherwise.
This project uses uv for fast, reliable Python package management.
Install uv (if you don't have it yet):
# macOS and Linux
curl -LsSf https://astral.sh/uv/install.sh | sh
# Windows (PowerShell)
powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"Install project dependencies:
# Clone the repository (if needed)
git clone <your-repo-url>
cd langcache-customer-data-eval
# Install dependencies and sync the environment
uv sync
# Or if you want to include dev dependencies
uv sync --all-groupsInstall LLM-as-a-Judge:
- Configure
~/.pip/pip.conf- Locate the package
- Set me up (Client:
pip) - If you want to use
uvuv add llm-sim-eval==x.x.x --index=...
The matching pipeline can use Redis for fast vector similarity search. To enable Redis-based matching, add the --use_redis flag and ensure Redis is running:
# Using Docker (recommended)
docker run -d -p 6379:6379 redis/redis-stack:latest
# Or install Redis locally
# macOS: brew install redis && redis-server
# Ubuntu: sudo apt-get install redis-server && redis-server
# Windows: Download from https://redis.io/downloadNote: Redis connection defaults to
redis://localhost:6379. You can customize this with--redis_url.
- Queries CSV (
--query_log_path): must include your sentence column (name passed via--sentence_column). - Cache CSV (
--cache_path): a catalog of reference sentences/utterances. Must at least include the same sentence column.
Example (minimal)
# queries.csv
id,text
1,"how do I reset my password?"
2,"store hours on sunday"
# cache.csv
id,text
101,"reset your password"
102,"our store hours"
In all cases the script can be run with and without an explicit cache set. If you pass only
--query_log_pathand omit--cache_pathas an argument, the script randomly samplesn_samplesto use as queries and uses the rest as cache.
CHR-only mode (default - no LLM required):
# Fast cache hit ratio analysis
uv run evaluation.py \
--query_log_path ./data/queries.csv \
--cache_path ./data/cache.csv \
--sentence_column text \
--output_dir ./outputs \
--n_samples 100 \
--model_name "redis/langcache-embed-v3.1"Full evaluation mode (requires llm-sim-eval):
# With Redis (recommended for better performance)
uv run evaluation.py \
--query_log_path ./data/queries.csv \
--cache_path ./data/cache.csv \
--sentence_column text \
--output_dir ./outputs \
--n_samples 1000 \
--model_name "redis/langcache-embed-v1" \
--llm_name "microsoft/Phi-4-mini-instruct" \
--full \
--use_redis
# Without Redis (in-memory matching)
uv run evaluation.py \
--query_log_path ./data/queries.csv \
--cache_path ./data/cache.csv \
--sentence_column text \
--output_dir ./outputs \
--n_samples 1000 \
--model_name "redis/langcache-embed-v1" \
--llm_name "microsoft/Phi-4-mini-instruct" \
--fullS3 example
# CHR-only with S3
uv run evaluation.py \
--query_log_path s3://my-bucket/eval/queries.csv \
--cache_path s3://my-bucket/eval/cache.csv \
--sentence_column text \
--output_dir s3://my-bucket/eval/results
# Full evaluation with S3
uv run evaluation.py \
--query_log_path s3://my-bucket/eval/queries.csv \
--cache_path s3://my-bucket/eval/cache.csv \
--sentence_column text \
--output_dir s3://my-bucket/eval/results \
--fullS3 access relies on your environment (e.g.,
AWS_ACCESS_KEY_ID,AWS_SECRET_ACCESS_KEY,AWS_DEFAULT_REGION).
High‑level flow (see code for details):
-
Load data
load_data(query_log_path, cache_path, n_samples)→queriesandcacheDataFrames. -
Stage 1 — Matching Uses
NeuralEmbedding(model_name, device=auto)andcalculate_best_matches_with_cache_large_dataset(...)to find top matches for each query (batch size512).
-
Threshold sweep for cache hit ratios Sweeps thresholds from min score to 1.0, calculating CHR at each point. Output:
chr_sweep.csv,chr_matches.csv -
Plotting & summary Generates
chr_vs_threshold.pngand prints summary statistics (score distribution, CHR at common thresholds).
-
Stage 2 — LLM‑as‑a‑Judge Pairs
(query, match)are scored byrun_llm_local_sim_prediction_pipeline(...)using an empty prompt loaded viaDEFAULT_PROMPTS. Output:llm_as_a_judge_results.csvwith[<sentence_column>, matches, similarity_score, actual_label]after post‑processing. -
Stage 3 — Metrics & threshold sweep
postprocess_results_for_metrics(...)prepares a final frame, thensweep_thresholds_on_results(...)evaluates metrics across thresholds. Output:threshold_sweep_results.csv,matches.csv -
Stage 4 — Plotting
generate_plots(...)writes:precision_vs_cache_hit_ratio.pngmetrics_over_threshold.png
Finally prints Done!.
| Flag | Type | Required | Default | Description |
|---|---|---|---|---|
--query_log_path |
str | ✅ | — | Path to the queries CSV (local or s3://…). |
--sentence_column |
str | ✅ | — | Name of the text column to evaluate (must exist in both CSVs). |
--output_dir |
str | ✅ | — | Where to write CSVs/plots (local or s3://…). |
--n_samples |
int | 100 |
Number of samples to evaluate (default: 100). | |
--model_name |
str | "redis/langcache-embed-v3.1" |
Embedding model passed to NeuralEmbedding. |
|
--cache_path |
str | None |
Path to the cache CSV (local or s3://…). |
|
--full |
flag | False |
Run full evaluation with LLM-as-a-Judge (requires llm-sim-eval). | |
--llm_name |
str | "microsoft/Phi-4-mini-instruct" |
Local LLM identifier (only used with --full). |
|
--sweep_steps |
int | 200 |
Number of threshold steps in sweep (default: 200). | |
--use_redis |
flag | False |
Use Redis for vector matching (default: in-memory matching). | |
--redis_url |
str | "redis://localhost:6379" |
Redis connection URL for vector search. | |
--redis_index_name |
str | "idx_cache_match" |
Redis index name for vector storage. | |
--redis_doc_prefix |
str | "cache:" |
Redis document key prefix. | |
--redis_batch_size |
int | 256 |
Batch size for Redis vector operations. |
Tip:
--model_nameand--llm_namemust be supported by your environment/backends. The script auto‑selectscudawhentorch.cuda.is_available()returns true.
Inputs
queries.csv— must include--sentence_column(e.g.,text).cache.csv— must include the same--sentence_columnused by queries.
Outputs (written under --output_dir)
chr_matches.csv—[<sentence_column>, matches, best_scores]chr_sweep.csv—[threshold, cache_hit_ratio]chr_vs_threshold.png— Plot of cache hit ratio vs threshold
matches.csv—[<sentence_column>, matches, best_scores]llm_as_a_judge_results.csv—[<sentence_column>, matches, similarity_score, actual_label]threshold_sweep_results.csv— thresholded metrics across the sweepprecision_vs_cache_hit_ratio.pngandmetrics_over_threshold.png
-
Metrics (computed per threshold and saved in
threshold_sweep_results.csv):thresholdprecision,recall,f1_score,f0_5_scoref05_chr_score— harmonic mean of precision and cache hit ratio (β=0.5)cache_hit_ratiotp,fp,fn,tn,accuracy
-
Charts (saved under
--output_dir):precision_vs_cache_hit_ratio.png— Precision vs Cache Hit Ratiometrics_over_threshold.png— Over threshold: Precision, Cache Hit Ratio, andprecision * cache_hit_ratio
-
Files (saved under
--output_dir):matches.csv—[<sentence_column>, matches, best_scores]llm_as_a_judge_results.csv—[<sentence_column>, matches, similarity_score, actual_label]threshold_sweep_results.csv— one row per threshold with the metrics listed above
- GPU: The embedding stage will use CUDA if available; otherwise CPU. Free device memory after matching is reclaimed (
torch.cuda.empty_cache()). - Batch sizes: Matching uses
batch_size=512. The LLM stage uses a conservativebatch_size=2to reduce memory spikes; increase if your hardware allows. - Early stop:
--n_sampleslimits how many queries pass through heavy stages (useful for quick iteration).
- Custom prompts: Swap out the default empty prompt (
DEFAULT_PROMPTS["empty_prompt"]) with your own prompt viaPrompt.load(...). - Alternative scorers: Replace
NeuralEmbeddingor itscalculate_best_matches_with_cache_large_datasetwith your own implementation. - Metrics: Extend
postprocess_results_for_metricsand/orsweep_thresholds_on_resultsto add domain‑specific KPIs.
- "Connection refused" or "Error connecting to Redis":
Ensure Redis is running on the specified
--redis_url. Test with:redis-cli ping(should returnPONG). - My run finishes matching but shows "Number of discarded queries…": That count is how many judge calls produced unusable/failed responses; the pipeline continues with successful ones.
- Plots are empty or flat:
Ensure your inputs contain valid ground‑truth signals (
actual_labelor whatever your metrics function expects), and that scores vary across pairs. - S3 permissions errors:
Confirm AWS credentials in the environment and that your
FileHandleris configured for those credentials/regions. - Redis index conflicts:
If you see errors about existing indexes, change
--redis_index_nameto a unique value or manually delete the old index with:redis-cli FT.DROPINDEX <index_name> DD
We welcome improvements — add new metrics, plots, or backends and send a PR. A helpful contribution flow is:
-
Create a branch:
git checkout -b feat/<your-feature>
-
Add tests and docs (e.g., examples under a
docs/folder). -
Ensure type checks / formatting / tests pass.
-
Open a pull request with a clear description and before/after examples.
(Contributing steps follow the same spirit as the “development setup” section and workflow in the Redis Model Store README.) (GitHub)
- README structure inspired by the Redis Model Store project’s README. (GitHub)
Use CHR-only mode (default) when:
- You want to quickly understand cache hit ratio characteristics
- You don't need precision/recall metrics (only CHR)
- You want to avoid the overhead of running an LLM judge
- You're doing exploratory analysis on embedding model performance
- The
llm-sim-evalpackage is not available in your environment
Use full mode (--full) when:
- You need precision, recall, and F-scores
- You have labeled data or need LLM-judged similarity labels
- You want the full evaluation pipeline with quality metrics
- You need to understand both cache efficiency (CHR) and accuracy (precision)
CLI help
# See all available options
uv run evaluation.py -h