-
Notifications
You must be signed in to change notification settings - Fork 7
Feat/benchmark hardening #18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This commit addresses potential criticisms of our benchmark methodology:
1. Official F1 Metric (locomo_metrics.py)
- Implements exact F1 scoring from LoCoMo paper with Porter stemmer
- Category-specific evaluation (multi-answer F1, adversarial phrase match)
- Reports both original word-overlap and official F1 metrics
2. E2E QA Mode (--eval-mode e2e)
- Generates answers via LLM from retrieved context
- Matches CORE's evaluation methodology
- Configurable model (--e2e-model, default gpt-4o-mini)
3. Category 5 (Adversarial) Fix
- Properly handles questions where answer is NOT in conversation
- Checks for adversarial_answer in memories (shouldn't be found)
- No longer inflates Complex Reasoning score
4. Evidence Hint Disabling (--no-evidence-hints)
- Removes data leakage from using ground truth evidence IDs
- Provides honest retrieval-only evaluation
5. CORE Benchmark Adapter (core_adapter/)
- Node.js adapter for CORE's exact benchmark code
- Enables direct apples-to-apples comparison
- Documents methodology differences
6. Reproducibility Package (REPRODUCIBILITY.md)
- Step-by-step reproduction instructions
- All evaluation modes documented
- Known differences from CORE explained
New CLI options:
--eval-mode {retrieval,e2e}
--no-official-f1
--no-evidence-hints
--e2e-model MODEL
--f1-threshold FLOAT
Refs: CORE benchmark, LoCoMo ACL 2024 paper
…ions Key changes: 1. Added --lenient flag for CORE-compatible semantic evaluation - Uses LLM to judge if answer 'touches on the same topic' - Matches CORE's exact evaluation prompt - Fallback to 30% word match (same as CORE) 2. Updated default models to gpt-4.1 - Better reasoning than gpt-4o-mini - Good cost/performance balance - Options: gpt-5.1 (best), gpt-4o-mini (cheapest) 3. CORE adapter now uses exact CORE evaluation prompt - Found their evaluateService.js: 'be generous with grading' - This explains their 88% vs our 36% with strict F1 4. Added --eval-judge-model for configurable judge model New CLI options: --lenient Use CORE-compatible lenient eval --eval-judge-model Model for lenient evaluation --e2e-model Now defaults to gpt-4.1 To run with CORE-comparable methodology: python test_locomo.py --eval-mode e2e --lenient Expected result: Should see ~70-85% accuracy instead of 36%
- Default models now gpt-5.1 for best reasoning - Added --conversations CLI arg to run subset (e.g., --conversations conv-26 conv-30) - Both Python and Node.js adapters updated
Key changes: 1. Fixed max_tokens -> max_completion_tokens for GPT-5.1 models - GPT-5.x requires max_completion_tokens parameter - Older models still use max_tokens 2. Added --conversations flag for selective testing - Can now test specific conversations: --conversations conv-26 conv-30 - Useful for quick iteration and debugging 3. Added BENCHMARK_FINDINGS_2025-12-03.md documenting results: - E2E + Lenient (GPT-5.1): 78.89% on conv-26 - This is within 10% of CORE's claimed 88.24% - Strict F1 scores (~36%) don't reflect actual system quality - Evaluation methodology is the dominant factor Results comparison: Mode | Score | Notes --------------|--------|--------------------------- Retrieval | 70.49% | Word overlap check E2E Strict F1 | 36.15% | Penalizes verbose answers E2E Lenient | 78.89% | CORE-compatible methodology
Introduces a new experimentation framework under tests/benchmarks/experiments for optimizing memory retrieval configurations. Adds experiment configuration, runner, analysis agent, Railway deployment manager, and supporting scripts and tests. Updates .gitignore to exclude experiment results.
… (3072d) - Add EMBEDDING_MODEL and CLASSIFICATION_MODEL env vars - Default to text-embedding-3-large for better semantic precision - Update VECTOR_SIZE default to 3072 - Fix scripts to use config instead of hardcoded models - Add re-embedding guide to docs - Fix FalkorDB auth in reembed script
📝 WalkthroughWalkthroughThis PR introduces configurable embedding and classification models, upgrades default embedding dimensions from 768 to 3072, and adds comprehensive benchmarking infrastructure including LoCoMo evaluation metrics, AI-guided optimization, CORE-compatible adapters, and extensive documentation for reproduction and experimentation. Changes
Sequence Diagram(s)sequenceDiagram
participant User as User/Script
participant EvaluatorPy as LoCoMo Evaluator<br/>(Python)
participant OpenAI as OpenAI API
participant AutoMem as AutoMem Service
User->>EvaluatorPy: run_benchmark(eval_mode='e2e')
loop Per Conversation
EvaluatorPy->>AutoMem: POST /memory (store memories)
AutoMem-->>EvaluatorPy: Success
EvaluatorPy->>+AutoMem: GET /recall (retrieve context)
AutoMem-->>-EvaluatorPy: Recalled memories
alt E2E Mode
EvaluatorPy->>OpenAI: chat.completions.create<br/>(question + context)
OpenAI-->>EvaluatorPy: Generated answer
EvaluatorPy->>OpenAI: chat.completions.create<br/>(evaluate answer)
OpenAI-->>EvaluatorPy: Evaluation score
else Retrieval Mode
EvaluatorPy->>EvaluatorPy: check_answer_in_memories<br/>(official F1 + category logic)
end
EvaluatorPy->>AutoMem: DELETE /admin/memories<br/>(cleanup)
AutoMem-->>EvaluatorPy: Cleanup complete
end
EvaluatorPy-->>User: Results with accuracy breakdown
sequenceDiagram
participant User as User
participant AnalysisAgent as AnalysisAgent
participant ExperimentRunner as ExperimentRunner
participant RailwayMgr as RailwayManager
participant AutoMem as AutoMem Instance
participant LLM as OpenAI LLM
User->>AnalysisAgent: run_autonomous_optimization()
AnalysisAgent->>AnalysisAgent: Initialize grid configs
loop Iteration ≤ max_iterations
AnalysisAgent->>ExperimentRunner: run_grid_search(configs)
loop Per Config
ExperimentRunner->>RailwayMgr: deploy_instance(config)
RailwayMgr-->>ExperimentRunner: Instance URL
ExperimentRunner->>AutoMem: Execute benchmark<br/>with config
AutoMem-->>ExperimentRunner: ExperimentResult<br/>(metrics & accuracy)
ExperimentRunner->>RailwayMgr: destroy_instance()
end
ExperimentRunner-->>AnalysisAgent: List[ExperimentResult]
AnalysisAgent->>AnalysisAgent: Track best config<br/>& score
AnalysisAgent->>LLM: analyze_and_decide<br/>(results summary)
LLM-->>AnalysisAgent: JSON decision<br/>(continue?, next_experiments)
alt Continue Exploring
AnalysisAgent->>AnalysisAgent: Generate pending configs<br/>from suggestions
else Stop
AnalysisAgent->>AnalysisAgent: Exit loop
end
end
AnalysisAgent->>AnalysisAgent: _save_final_report<br/>(best config & history)
AnalysisAgent-->>User: Optimization complete
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Areas requiring extra attention:
Possibly related PRs
Suggested labels
Pre-merge checks and finishing touches❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 18
🧹 Nitpick comments (35)
tests/benchmarks/REPRODUCIBILITY.md (1)
177-189: Add language specifiers to fenced code blocks.The code blocks at lines 177 and 185 should specify their language for proper syntax highlighting and better accessibility.
Apply this change:
### Python -``` +```text nltk==3.9.1 openai>=1.0.0 requests>=2.28.0 python-dateutil>=2.8.0Node.js (CORE adapter)
-
+text
axios>=1.6.0
openai>=4.20.0
dotenv>=16.3.0tests/benchmarks/BENCHMARK_FINDINGS_2025-12-03.md (1)
90-108: Add language specifiers to fenced code blocks.The code blocks at lines 90 and 105 should specify their language (use
textorplaintextfor output blocks) for proper rendering.Apply this change:
### Test 1: E2E + Lenient (conv-26) ✅ -``` +```text Overall Accuracy: 78.89% (157/199) Time: 606.1s ...Test 2: Full Dataset (10 conversations) - Previous Run
-
+text
E2E + Strict F1: 36.15% (718/1986)
Retrieval Mode: 70.49% (1400/1986)tests/benchmarks/core_adapter/automem-search.js (2)
93-96: Consider surfacing errors to callers for better debugging.The current error handling silently returns empty results, which makes it difficult to distinguish between "no results found" and "API error occurred" during benchmark runs. This could mask connectivity issues, authentication failures, or API errors.
Consider one of these approaches:
- Throw errors and let callers handle: Remove the try-catch and let errors propagate
- Return error information: Include error details in the response
- Add verbose mode: Use an optional
throwOnErrorflag in the constructorExample for option 2:
} catch (error) { console.error("AutoMem search error:", error.message); - return { episodes: [], facts: [] }; + return { + episodes: [], + facts: [], + error: error.message, + _errorDetails: error.response?.data + }; }This would allow benchmark runners to detect and report infrastructure issues separately from legitimate empty result sets.
16-28: LGTM with minor suggestion: Consider exposing timeout as configuration.The implementation correctly uses environment variables with sensible defaults. The 30-second timeout is reasonable for benchmark workloads but might be too long for production use cases.
Optionally expose timeout as a configuration parameter:
constructor(config = {}) { this.baseUrl = config.baseUrl || process.env.AUTOMEM_BASE_URL || "http://localhost:8001"; this.apiToken = config.apiToken || process.env.AUTOMEM_API_TOKEN || "test-token"; + this.timeout = config.timeout || 30000; this.axios = axios.create({ baseURL: this.baseUrl, headers: { Authorization: `Bearer ${this.apiToken}`, "Content-Type": "application/json", }, - timeout: 30000, + timeout: this.timeout, }); }tests/benchmarks/experiments/README.md (1)
100-106: Add language specifier to fenced code block.The code block at line 100 should specify
textorplaintextas the language identifier.Apply this change:
## Output Structure -``` +```text experiment_results/ ├── 20251204_103000_baseline.json # Individual results ├── 20251204_103000_large_embeddings.json ├── 20251204_103000_report.txt # Summary report └── optimization_report_20251204.json # Agent optimization log</blockquote></details> <details> <summary>tests/benchmarks/run_all_benchmarks.sh (2)</summary><blockquote> `19-24`: **Add timeout to health check curl command.** The health check curl lacks a timeout, which could cause the script to hang indefinitely if AutoMem is unresponsive. ```diff # Check if AutoMem is running -if ! curl -s http://localhost:8001/health > /dev/null; then +if ! curl -s --connect-timeout 5 --max-time 10 http://localhost:8001/health > /dev/null; then echo "❌ AutoMem is not running. Start with: make dev" exit 1 fi
80-86: Inline Python command could fail silently on malformed JSON.The Python command for extracting accuracy uses
2>/dev/null || echo "N/A"which hides all errors. Consider more explicit error handling or usingjqif available.for f in "$OUTPUT_DIR"/results_*_$TIMESTAMP.json; do if [ -f "$f" ]; then name=$(basename "$f" | sed "s/_$TIMESTAMP.json//") - accuracy=$(python -c "import json; d=json.load(open('$f')); print(f\"{d.get('overall', {}).get('accuracy', 0)*100:.2f}%\")" 2>/dev/null || echo "N/A") + accuracy=$(python3 -c "import json, sys; d=json.load(open('$f')); print(f\"{d.get('overall', {}).get('accuracy', 0)*100:.2f}%\")" 2>/dev/null || echo "N/A") echo " $name: $accuracy" fi doneUsing
python3explicitly ensures Python 3.x is used for f-string support.tests/benchmarks/experiments/test_minimal.py (4)
34-37: Remove unnecessary f-string prefixes.Multiple print statements use f-strings without placeholders. This is flagged by Ruff (F541).
- print(f"\n📋 Test Config:") + print("\n📋 Test Config:") print(f" Name: {exp_config.name}") print(f" Recall Limit: {exp_config.recall_limit}") - print(f"\n🔧 LoCoMo Config:") + print("\n🔧 LoCoMo Config:") ... - print(f"\n📂 Loading dataset...") + print("\n📂 Loading dataset...") ... - print(f" Testing first 5 questions only") + print(" Testing first 5 questions only") ... - print(f"\n📊 RESULTS:") + print("\n📊 RESULTS:") ... - print(f"\n✅ Integration test complete!") + print("\n✅ Integration test complete!") ... - print(f"\n📥 Loading memories...") + print("\n📥 Loading memories...") ... - print(f"⏳ Waiting for enrichment...") + print("⏳ Waiting for enrichment...") ... - print(f"\n🧹 Cleaning up test memories...") + print("\n🧹 Cleaning up test memories...")Also applies to: 50-53, 58-60, 68-68, 73-76, 78-79, 87-88, 91-92, 137-138
63-63: Usenext(iter(...))instead oflist(...)[0].Per Ruff (RUF015), prefer
next(iter(evaluator.conversations.keys()))for efficiency - avoids creating an intermediate list.- conv_id = list(evaluator.conversations.keys())[0] + conv_id = next(iter(evaluator.conversations.keys()))
120-126: Prefix unused variable with underscore.The
explanationvariable is never used. Per Ruff (RUF059), prefix it with an underscore.- is_correct, confidence, explanation = evaluator._evaluate_lenient_semantic( + is_correct, confidence, _explanation = evaluator._evaluate_lenient_semantic( question, expected, generated, category ) else: - is_correct, confidence, explanation = evaluator.check_answer_in_memories( + is_correct, confidence, _explanation = evaluator.check_answer_in_memories( question, expected, recalled, category )
90-92: Hardcoded sleep for enrichment is fragile.A fixed 3-second wait may be insufficient under load or excessive in fast environments. Consider polling the enrichment status endpoint instead.
Consider implementing a polling mechanism:
# Wait for enrichment with polling max_wait = 10 # seconds poll_interval = 0.5 waited = 0 while waited < max_wait: # Could check enrichment status via API if available await asyncio.sleep(poll_interval) waited += poll_intervaltests/benchmarks/core_adapter/evaluate.js (2)
41-43: Add validation for OPENAI_API_KEY.The OpenAI client is initialized without checking if the API key exists, which will fail later with a less helpful error message.
// Initialize services const searchService = new AutoMemSearchService(); +if (!process.env.OPENAI_API_KEY) { + console.error("❌ OPENAI_API_KEY environment variable is required"); + process.exit(1); +} const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
284-288: Guard against division by zero.If a conversation has no questions,
questions.lengthwould be 0, causing a division by zero.- const convAcc = (convCorrect / questions.length) * 100; - console.log(` Accuracy: ${convAcc.toFixed(1)}% (${convCorrect}/${questions.length})`); + const convAcc = questions.length > 0 ? (convCorrect / questions.length) * 100 : 0; + console.log(` Accuracy: ${convAcc.toFixed(1)}% (${convCorrect}/${questions.length})`);tests/benchmarks/experiments/test_e2e_mini.py (3)
15-15: Remove unused import.
ExperimentConfigis imported but never used in this file.-from tests.benchmarks.experiments.experiment_config import ExperimentConfig
44-44: Remove extraneousfprefixes from strings without placeholders.These strings don't contain any format placeholders:
- print(f"\n[1/4] Storing memories...") + print("\n[1/4] Storing memories...")- print(f"\n[2/4] Testing recall and E2E answer generation...") + print("\n[2/4] Testing recall and E2E answer generation...")- print(f"\n[4/4] Cleaning up...") + print("\n[4/4] Cleaning up...")- headers={"Authorization": f"Bearer admin"}, + headers={"Authorization": "Bearer admin"},Also applies to: 91-91, 158-158, 161-161
100-100: Remove unused variable assignment.
categoryis assigned but never used.- category = qa.get('category', 1)tests/benchmarks/experiments/experiment_config.py (2)
10-10: Remove unused import.
itertoolsis imported but never used in this file.-import itertools
14-36: Enums defined but not used for type safety in dataclass.
EmbeddingModel,EnrichmentModel, andRecallStrategyenums are defined butExperimentConfigusesstrtypes for corresponding fields. Consider using the enum types for better type safety and IDE support, or document that strings are intentional for flexibility.Also applies to: 39-68
tests/benchmarks/experiments/railway_manager.py (2)
17-17: Use explicit relative or absolute import.The import
from experiment_config import ExperimentConfigrelies on the module being insys.path. For better clarity and to avoid import issues when running from different directories, use an explicit relative import.-from experiment_config import ExperimentConfig +from .experiment_config import ExperimentConfig
51-63: Synchronous subprocess call in async method.
check_railway_cliis an async method but uses synchronoussubprocess.run. While this works (it blocks the event loop briefly), consider usingasyncio.create_subprocess_execfor consistency with the async pattern.tests/benchmarks/test_locomo.py (4)
1083-1084: Use explicit Optional type hints (PEP 484).PEP 484 prohibits implicit
Optional. Parameters with defaultNoneshould useOptional[T]orT | None.def check_answer_in_memories( self, question: str, expected_answer: Any, recalled_memories: List[Dict[str, Any]], - evidence_dialog_ids: List[str] = None, - sample_id: str = None, + evidence_dialog_ids: Optional[List[str]] = None, + sample_id: Optional[str] = None, category: int = 0, ) -> Tuple[bool, float, str]:
1438-1438: Use explicit Optional type hint.- def run_benchmark(self, cleanup_after: bool = True, conversation_ids: List[str] = None) -> Dict[str, Any]: + def run_benchmark(self, cleanup_after: bool = True, conversation_ids: Optional[List[str]] = None) -> Dict[str, Any]:
699-767: Remove unused parameterquestionfrom_evaluate_adversarial.The
questionparameter is passed but never used in the method body.def _evaluate_adversarial( self, - question: str, adversarial_answer: str, recalled_memories: List[Dict[str, Any]], ) -> Tuple[bool, float, str]:Then update callers accordingly (line 1370-1371):
is_correct, confidence, explanation = self._evaluate_adversarial( - question, adversarial_answer, recalled_memories + adversarial_answer, recalled_memories )
751-752: Remove unused variable assignment.
found_in_memoryis assigned but never used.if overlap_ratio > max_overlap: max_overlap = overlap_ratio - found_in_memory = mem.get("id")tests/benchmarks/locomo_metrics.py (3)
31-56: Duplicatenormalize_answerimplementation exists.This function duplicates logic found in
tests/benchmarks/test_locomo.py(lines 537-570) with a different implementation (this one removes articles; the other does basic suffix stemming). Having two different normalization approaches could lead to inconsistent evaluation results.Consider consolidating to a single source of truth and importing where needed.
173-175: Use explicit Optional type hint and document unused parameter.def extract_answer_from_context( - memories: List[str], question: str, answer_hint: str = None + memories: List[str], question: str, answer_hint: Optional[str] = None # noqa: ARG001 ) -> str:If
questionis reserved for future use (e.g., question-aware extraction), add a comment explaining this.
358-364: Rename unused loop variable.The loop variable
catis not used within the loop body sincedatacontains all needed information.- for cat, data in summary["categories"].items(): + for _cat, data in summary["categories"].items(): print( f" {data['name']:35s}: "tests/benchmarks/experiments/experiment_runner.py (4)
101-101: Remove duplicate import ofjson.
jsonis already imported at the module level (line 15). This duplicate import inside the method is unnecessary.- import json - with open(locomo_config.data_file, 'r') as f: + with open(locomo_config.data_file, 'r') as f:
128-135: Consider logging the full traceback for debugging.While catching broad exceptions is reasonable here to ensure the experiment runner continues, consider logging the full traceback for easier debugging when experiments fail.
+ import traceback except Exception as e: - print(f"❌ Experiment failed: {e}") + print(f"❌ Experiment failed: {e}") + traceback.print_exc() experiment_result = ExperimentResult( config=config, - errors=[str(e)], + errors=[f"{e}\n{traceback.format_exc()}"],
373-373: Remove extraneousfprefix from strings without placeholders.Multiple strings use f-string syntax without any placeholders. Per static analysis hints at lines 373, 383, 390, 405, 416, 622.
- report.append(f"\n2. Multi-hop Reasoning (our weakness):") + report.append("\n2. Multi-hop Reasoning (our weakness):")Apply similar fixes to lines 383, 390, 405, 416, and 622.
625-640: Ablation mode result is unused.The
resultsvariable at line 640 is assigned but never used. Either use it for reporting or remove the assignment.- results = await runner.run_ablation_study(args.param, values) + await runner.run_ablation_study(args.param, values)Alternatively, ensure results are incorporated into the report that follows.
tests/benchmarks/experiments/analysis_agent.py (4)
139-149: Catch specific exceptions for better error handling.The broad
Exceptioncatch masks the specific failure mode. Since you're usingresponse_format={"type": "json_object"}andjson.loads(), consider catching more specific exceptions:- decision = json.loads(response.choices[0].message.content) - return decision - - except Exception as e: + return json.loads(response.choices[0].message.content) + + except json.JSONDecodeError as e: + print(f"❌ Failed to parse LLM response: {e}") + return { + "continue_exploring": False, + "reasoning": f"JSON parse error: {e}", + "next_experiments": [], + "confidence": 0.0, + } + except openai.APIError as e: print(f"❌ Analysis failed: {e}") return { "continue_exploring": False, - "reasoning": f"Analysis error: {e}", + "reasoning": f"OpenAI API error: {e}", "next_experiments": [], "confidence": 0.0, }You'll need to add
import openaior catchself.clientexceptions appropriately.
226-230:_pending_configsaccessed before explicit initialization - fragile pattern.
self._pending_configsis first accessed at line 230 but only initialized at line 276. While the current control flow ensures this works (iteration 1 usesinitial_configs), this creates a fragile pattern where refactoring could introduce anAttributeError.Initialize
_pending_configsin__init__or at the start of this method:async def run_autonomous_optimization( self, initial_configs: Optional[List[ExperimentConfig]] = None, runner: Optional[ExperimentRunner] = None, ) -> ExperimentConfig: + self._pending_configs: List[ExperimentConfig] = [] + if runner is None:
258-264: Remove extraneousfprefixes from strings without placeholders.Lines 258, 261, and 294 use f-string syntax without any placeholders per static analysis hints.
- print(f"\n🤔 Analyzing results...") + print("\n🤔 Analyzing results...") decision = await self.analyze_and_decide(all_results) - print(f"\n📋 Agent Decision:") + print("\n📋 Agent Decision:")Also at line 294:
- print(f"\nFinal Config:") + print("\nFinal Config:")
358-361:best_configis unused after assignment.The result of
run_autonomous_optimizationis assigned but never used. Consider using it in the output message.- best_config = await agent.run_autonomous_optimization(runner=runner) - - print("\n✅ Optimization complete!") - print(f"Best configuration saved to: {args.output_dir}") + best_config = await agent.run_autonomous_optimization(runner=runner) + + print("\n✅ Optimization complete!") + print(f"Best configuration: {best_config.name}") + print(f"Results saved to: {args.output_dir}")
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
tests/benchmarks/core_adapter/package-lock.jsonis excluded by!**/package-lock.json
📒 Files selected for processing (29)
.gitignore(1 hunks)CLAUDE.md(2 hunks)INSTALLATION.md(1 hunks)README.md(1 hunks)app.py(5 hunks)automem/api/admin.py(2 hunks)automem/config.py(2 hunks)docs/ENVIRONMENT_VARIABLES.md(3 hunks)requirements-dev.txt(1 hunks)scripts/reclassify_with_llm.py(4 hunks)scripts/reembed_embeddings.py(4 hunks)tests/benchmarks/BENCHMARK_FINDINGS_2025-12-03.md(1 hunks)tests/benchmarks/REPRODUCIBILITY.md(1 hunks)tests/benchmarks/core_adapter/README.md(1 hunks)tests/benchmarks/core_adapter/automem-search.js(1 hunks)tests/benchmarks/core_adapter/evaluate.js(1 hunks)tests/benchmarks/core_adapter/package.json(1 hunks)tests/benchmarks/experiments/README.md(1 hunks)tests/benchmarks/experiments/__init__.py(1 hunks)tests/benchmarks/experiments/analysis_agent.py(1 hunks)tests/benchmarks/experiments/experiment_config.py(1 hunks)tests/benchmarks/experiments/experiment_runner.py(1 hunks)tests/benchmarks/experiments/railway_manager.py(1 hunks)tests/benchmarks/experiments/run_optimization.sh(1 hunks)tests/benchmarks/experiments/test_e2e_mini.py(1 hunks)tests/benchmarks/experiments/test_minimal.py(1 hunks)tests/benchmarks/locomo_metrics.py(1 hunks)tests/benchmarks/run_all_benchmarks.sh(1 hunks)tests/benchmarks/test_locomo.py(17 hunks)
🧰 Additional context used
📓 Path-based instructions (4)
**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
**/*.py: Use black for Python code formatting
Use flake8 for Python linting
**/*.py: Use Python with type hints. Indent 4 spaces; line length 100 (Black).
Use snake_case for module and function names in Python.
Use PascalCase for class names in Python.
Use UPPER_SNAKE_CASE for constants in Python.
Format code with Black (line length 100) and organize imports with Isort (profile=black).
Lint with Flake8 and ensure CI passes before committing.
Add or adjust tests for new endpoints, stores, or utils; maintain comprehensive test coverage.
Files:
automem/config.pyscripts/reclassify_with_llm.pytests/benchmarks/experiments/__init__.pyautomem/api/admin.pyscripts/reembed_embeddings.pytests/benchmarks/experiments/test_e2e_mini.pytests/benchmarks/experiments/experiment_config.pytests/benchmarks/experiments/analysis_agent.pytests/benchmarks/test_locomo.pytests/benchmarks/experiments/test_minimal.pytests/benchmarks/experiments/experiment_runner.pytests/benchmarks/locomo_metrics.pytests/benchmarks/experiments/railway_manager.pyapp.py
scripts/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
Create utility scripts in scripts/ directory for backup, recovery, data management, and monitoring tasks
Files:
scripts/reclassify_with_llm.pyscripts/reembed_embeddings.py
tests/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
Use pytest with DummyGraph fixture to mock FalkorDB operations for unit testing
Files:
tests/benchmarks/experiments/__init__.pytests/benchmarks/experiments/test_e2e_mini.pytests/benchmarks/experiments/experiment_config.pytests/benchmarks/experiments/analysis_agent.pytests/benchmarks/test_locomo.pytests/benchmarks/experiments/test_minimal.pytests/benchmarks/experiments/experiment_runner.pytests/benchmarks/locomo_metrics.pytests/benchmarks/experiments/railway_manager.py
app.py
📄 CodeRabbit inference engine (CLAUDE.md)
app.py: Use Flask for the main API server (port 8001) with request validation and orchestration
Implement 13 API endpoints covering core memory operations (POST /memory, GET /recall, PATCH /memory/, DELETE /memory/, GET /memory/by-tag), relationship management (POST /associate), consolidation (POST /consolidate, GET /consolidate/status), health checks (GET /health), enrichment (GET /enrichment/status, POST /enrichment/reprocess), analysis (GET /analyze), and startup recall (GET /startup-recall)
Expose enrichment metrics at GET /enrichment/status including processed counts, last success/error, queue depth, and inflight jobs
Combine vector similarity, keyword match, tag overlap, and recency in recall scoring
Support authentication via Bearer token, X-API-Key header, or query parameter
Files:
app.py
🧠 Learnings (23)
📓 Common learnings
Learnt from: CR
Repo: verygoodplugins/automem PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-02T05:34:12.678Z
Learning: Applies to automem/embedding/*.py : Use FastEmbed BAAI/bge-base-en-v1.5 model with optional 384 and 1024 dimension variants
Learnt from: CR
Repo: verygoodplugins/automem PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-02T05:34:12.678Z
Learning: Applies to automem/embedding/*.py : Implement embedding generation using a provider pattern with three backends in priority order: OpenAI (if OPENAI_API_KEY set), FastEmbed (local ONNX model), and Placeholder (hash-based fallback)
📚 Learning: 2025-12-02T05:34:12.678Z
Learnt from: CR
Repo: verygoodplugins/automem PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-02T05:34:12.678Z
Learning: Applies to automem/embedding/*.py : Use FastEmbed BAAI/bge-base-en-v1.5 model with optional 384 and 1024 dimension variants
Applied to files:
INSTALLATION.mddocs/ENVIRONMENT_VARIABLES.mdautomem/config.pyCLAUDE.mdautomem/api/admin.pyscripts/reembed_embeddings.pyapp.py
📚 Learning: 2025-12-02T05:34:12.678Z
Learnt from: CR
Repo: verygoodplugins/automem PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-02T05:34:12.678Z
Learning: Configure core services via environment variables: FALKORDB_HOST, FALKORDB_PORT, FALKORDB_GRAPH, QDRANT_URL, QDRANT_API_KEY, QDRANT_COLLECTION, VECTOR_SIZE, PORT, AUTOMEM_API_TOKEN, ADMIN_API_TOKEN
Applied to files:
INSTALLATION.mddocs/ENVIRONMENT_VARIABLES.mdautomem/config.pyCLAUDE.mdscripts/reembed_embeddings.py
📚 Learning: 2025-12-02T05:34:12.678Z
Learnt from: CR
Repo: verygoodplugins/automem PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-02T05:34:12.678Z
Learning: Applies to **/*enrichment*.py : Use ENRICHMENT_SPACY_MODEL environment variable to configure spaCy model for entity extraction
Applied to files:
INSTALLATION.mddocs/ENVIRONMENT_VARIABLES.mdautomem/config.pyapp.py
📚 Learning: 2025-12-02T05:34:12.678Z
Learnt from: CR
Repo: verygoodplugins/automem PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-02T05:34:12.678Z
Learning: Applies to **/*enrichment*.py : Establish temporal (PRECEDED_BY) and semantic (SIMILAR_TO) edges in enrichment pipeline with symmetric cosine scores from Qdrant
Applied to files:
INSTALLATION.md
📚 Learning: 2025-12-02T05:34:12.678Z
Learnt from: CR
Repo: verygoodplugins/automem PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-02T05:34:12.678Z
Learning: Applies to **/*enrichment*.py : Use ENRICHMENT_SIMILARITY_LIMIT and ENRICHMENT_SIMILARITY_THRESHOLD for neighbor link discovery via Qdrant
Applied to files:
INSTALLATION.mdautomem/config.pyapp.py
📚 Learning: 2025-12-02T05:34:12.678Z
Learnt from: CR
Repo: verygoodplugins/automem PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-02T05:34:12.678Z
Learning: Use Qdrant (vector database on port 6333) for 768-dimensional semantic search with optional availability (graceful degradation when Qdrant is unavailable)
Applied to files:
README.mddocs/ENVIRONMENT_VARIABLES.mdautomem/config.pyCLAUDE.md
📚 Learning: 2025-12-02T05:34:25.275Z
Learnt from: CR
Repo: verygoodplugins/automem PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-02T05:34:25.275Z
Learning: Never commit secrets; configure sensitive values via environment variables (`AUTOMEM_API_TOKEN`, `ADMIN_API_TOKEN`, `OPENAI_API_KEY`, `FALKORDB_PASSWORD`, `QDRANT_API_KEY`).
Applied to files:
docs/ENVIRONMENT_VARIABLES.mdautomem/config.pyscripts/reclassify_with_llm.py
📚 Learning: 2025-12-02T05:34:12.678Z
Learnt from: CR
Repo: verygoodplugins/automem PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-02T05:34:12.678Z
Learning: Control embedding provider via EMBEDDING_PROVIDER environment variable with values: auto (default), openai, local, or placeholder
Applied to files:
docs/ENVIRONMENT_VARIABLES.mdCLAUDE.mdapp.py
📚 Learning: 2025-12-02T05:34:12.678Z
Learnt from: CR
Repo: verygoodplugins/automem PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-02T05:34:12.678Z
Learning: Applies to automem/embedding/*.py : Implement embedding generation using a provider pattern with three backends in priority order: OpenAI (if OPENAI_API_KEY set), FastEmbed (local ONNX model), and Placeholder (hash-based fallback)
Applied to files:
docs/ENVIRONMENT_VARIABLES.mdautomem/config.pyCLAUDE.mdautomem/api/admin.pyscripts/reembed_embeddings.pyapp.py
📚 Learning: 2025-12-02T05:34:25.275Z
Learnt from: CR
Repo: verygoodplugins/automem PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-02T05:34:25.275Z
Learning: Use the Agent Memory Protocol: fetch context with recall at task start, persist key steps with POST `/memory`, tag with conventions (`agent`, `repo:<name>`, `task:<id>`, `file:<path>`, `pr:<num>`, `issue:<num>`, `result`, `decision`, `docs`), and associate related memories.
Applied to files:
tests/benchmarks/experiments/README.md
📚 Learning: 2025-12-02T05:34:25.275Z
Learnt from: CR
Repo: verygoodplugins/automem PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-02T05:34:25.275Z
Learning: PRs must include: clear description and scope, linked issues, test plan/output, and notes on API or config changes. Update relevant docs under `docs/`.
Applied to files:
tests/benchmarks/REPRODUCIBILITY.md
📚 Learning: 2025-11-08T10:51:31.578Z
Learnt from: jack-arturo
Repo: verygoodplugins/automem PR: 0
File: :0-0
Timestamp: 2025-11-08T10:51:31.578Z
Learning: In railway-template.json for the automem project, the memory-service connects to FalkorDB using FALKORDB_PASSWORD=${{FalkorDB.FALKOR_PASSWORD}} (referencing the FalkorDB service's FALKOR_PASSWORD variable).
Applied to files:
automem/config.pyscripts/reembed_embeddings.py
📚 Learning: 2025-12-02T05:34:12.678Z
Learnt from: CR
Repo: verygoodplugins/automem PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-02T05:34:12.678Z
Learning: Use FalkorDB (graph database on port 6379) for storing Memory nodes and relationship edges with atomic operations and automatic rollback on errors
Applied to files:
automem/config.pyscripts/reembed_embeddings.py
📚 Learning: 2025-11-08T10:51:31.578Z
Learnt from: jack-arturo
Repo: verygoodplugins/automem PR: 0
File: :0-0
Timestamp: 2025-11-08T10:51:31.578Z
Learning: In railway-template.json, FalkorDB authentication should use the FALKOR_PASSWORD environment variable alone, without --requirepass in REDIS_ARGS. The FalkorDB Docker image handles password configuration internally via the FALKOR_PASSWORD env var.
Applied to files:
automem/config.pyscripts/reembed_embeddings.py
📚 Learning: 2025-12-02T05:34:12.678Z
Learnt from: CR
Repo: verygoodplugins/automem PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-02T05:34:12.678Z
Learning: Applies to **/*enrichment*.py : Support ENRICHMENT_ENABLE_SUMMARIES environment variable to toggle automatic summary creation
Applied to files:
automem/config.py
📚 Learning: 2025-12-02T05:34:12.678Z
Learnt from: CR
Repo: verygoodplugins/automem PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-02T05:34:12.678Z
Learning: Applies to **/*enrichment*.py : Support ENRICHMENT_MAX_ATTEMPTS environment variable to configure retry attempts before giving up
Applied to files:
automem/config.py
📚 Learning: 2025-12-02T05:34:12.678Z
Learnt from: CR
Repo: verygoodplugins/automem PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-02T05:34:12.678Z
Learning: Applies to automem/embedding/*.py : Implement EmbeddingProvider abstraction with three implementations: OpenAIEmbeddingProvider, FastEmbedProvider, and PlaceholderEmbeddingProvider
Applied to files:
CLAUDE.mdapp.py
📚 Learning: 2025-12-02T05:34:25.275Z
Learnt from: CR
Repo: verygoodplugins/automem PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-02T05:34:25.275Z
Learning: Core package is located at `automem/` with notable directories: `api/` (Flask blueprints), `utils/`, `stores/`, and `config.py`.
Applied to files:
tests/benchmarks/core_adapter/package.jsontests/benchmarks/experiments/__init__.pyapp.py
📚 Learning: 2025-12-02T05:34:12.678Z
Learnt from: CR
Repo: verygoodplugins/automem PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-02T05:34:12.678Z
Learning: Applies to scripts/migrate*.py : Implement migration tools: migrate_mcp_sqlite.py for importing from legacy SQLite store, reembed_embeddings.py for refreshing embeddings
Applied to files:
scripts/reembed_embeddings.py
📚 Learning: 2025-12-02T05:34:12.678Z
Learnt from: CR
Repo: verygoodplugins/automem PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-02T05:34:12.678Z
Learning: Applies to tests/**/*.py : Use pytest with DummyGraph fixture to mock FalkorDB operations for unit testing
Applied to files:
scripts/reembed_embeddings.py
📚 Learning: 2025-12-02T05:34:25.275Z
Learnt from: CR
Repo: verygoodplugins/automem PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-02T05:34:25.275Z
Learning: Applies to **/*.py : Add or adjust tests for new endpoints, stores, or utils; maintain comprehensive test coverage.
Applied to files:
tests/benchmarks/experiments/test_e2e_mini.pytests/benchmarks/experiments/test_minimal.py
📚 Learning: 2025-12-02T05:34:12.678Z
Learnt from: CR
Repo: verygoodplugins/automem PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-02T05:34:12.678Z
Learning: Applies to app.py : Implement 13 API endpoints covering core memory operations (POST /memory, GET /recall, PATCH /memory/<id>, DELETE /memory/<id>, GET /memory/by-tag), relationship management (POST /associate), consolidation (POST /consolidate, GET /consolidate/status), health checks (GET /health), enrichment (GET /enrichment/status, POST /enrichment/reprocess), analysis (GET /analyze), and startup recall (GET /startup-recall)
Applied to files:
tests/benchmarks/experiments/test_e2e_mini.pyapp.py
🧬 Code graph analysis (9)
tests/benchmarks/core_adapter/automem-search.js (2)
mcp-sse-server/server.js (1)
userId(333-333)tests/benchmarks/core_adapter/evaluate.js (3)
response(69-80)response(144-157)results(214-218)
tests/benchmarks/core_adapter/evaluate.js (2)
tests/benchmarks/core_adapter/automem-search.js (1)
AutoMemSearchService(15-110)tests/conftest.py (1)
OpenAI(128-131)
tests/benchmarks/experiments/run_optimization.sh (2)
tests/benchmarks/experiments/experiment_runner.py (1)
main(505-653)tests/benchmarks/experiments/analysis_agent.py (1)
main(335-361)
tests/benchmarks/experiments/test_e2e_mini.py (4)
tests/benchmarks/experiments/experiment_config.py (1)
ExperimentConfig(40-92)tests/benchmarks/locomo_metrics.py (1)
f1_score(59-89)tests/benchmarks/core_adapter/evaluate.js (8)
openai(43-43)conversations(210-210)results(214-218)i(234-234)qa(235-235)question(236-236)category(238-238)context(244-244)tests/conftest.py (1)
OpenAI(128-131)
tests/benchmarks/experiments/analysis_agent.py (1)
tests/benchmarks/experiments/experiment_config.py (5)
ExperimentConfig(40-92)ExperimentResult(96-152)score(126-152)generate_experiment_grid(155-242)to_json(91-92)
tests/benchmarks/experiments/test_minimal.py (2)
tests/benchmarks/experiments/experiment_config.py (1)
ExperimentConfig(40-92)tests/benchmarks/test_locomo.py (2)
LoCoMoEvaluator(107-1587)LoCoMoConfig(54-104)
tests/benchmarks/experiments/experiment_runner.py (3)
tests/benchmarks/experiments/experiment_config.py (6)
ExperimentConfig(40-92)ExperimentResult(96-152)generate_experiment_grid(155-242)generate_focused_experiments(245-260)score(126-152)to_json(91-92)tests/benchmarks/experiments/railway_manager.py (6)
RailwayManager(32-286)LocalTestManager(289-337)RailwayInstance(21-29)deploy_instance(65-145)destroy_instance(211-237)destroy_all(239-242)tests/benchmarks/test_locomo.py (3)
LoCoMoEvaluator(107-1587)LoCoMoConfig(54-104)run_benchmark(1438-1587)
tests/benchmarks/locomo_metrics.py (1)
tests/benchmarks/test_locomo.py (1)
normalize_answer(538-571)
tests/benchmarks/experiments/railway_manager.py (1)
tests/benchmarks/experiments/experiment_config.py (2)
ExperimentConfig(40-92)to_env_vars(70-89)
🪛 LanguageTool
tests/benchmarks/experiments/README.md
[grammar] ~160-~160: Ensure spelling is correct
Context: ...Monitor response times - stay under 100ms for production 4. Watch token costs...
(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)
🪛 markdownlint-cli2 (0.18.1)
docs/ENVIRONMENT_VARIABLES.md
90-90: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
tests/benchmarks/BENCHMARK_FINDINGS_2025-12-03.md
90-90: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
105-105: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
tests/benchmarks/experiments/README.md
100-100: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
118-118: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
174-174: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
tests/benchmarks/REPRODUCIBILITY.md
177-177: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
185-185: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
tests/benchmarks/core_adapter/README.md
90-90: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🪛 Ruff (0.14.7)
tests/benchmarks/experiments/test_e2e_mini.py
1-1: Shebang is present but file is not executable
(EXE001)
44-44: f-string without any placeholders
Remove extraneous f prefix
(F541)
77-77: Do not catch blind exception: Exception
(BLE001)
91-91: f-string without any placeholders
Remove extraneous f prefix
(F541)
100-100: Local variable category is assigned to but never used
Remove assignment to unused variable category
(F841)
106-106: Probable use of requests call without timeout
(S113)
137-137: Do not catch blind exception: Exception
(BLE001)
158-158: f-string without any placeholders
Remove extraneous f prefix
(F541)
159-159: Probable use of requests call without timeout
(S113)
161-161: f-string without any placeholders
Remove extraneous f prefix
(F541)
tests/benchmarks/experiments/analysis_agent.py
1-1: Shebang is present but file is not executable
(EXE001)
140-140: Consider moving this statement to an else block
(TRY300)
142-142: Do not catch blind exception: Exception
(BLE001)
258-258: f-string without any placeholders
Remove extraneous f prefix
(F541)
261-261: f-string without any placeholders
Remove extraneous f prefix
(F541)
294-294: f-string without any placeholders
Remove extraneous f prefix
(F541)
358-358: Local variable best_config is assigned to but never used
Remove assignment to unused variable best_config
(F841)
tests/benchmarks/test_locomo.py
701-701: Unused method argument: question
(ARG002)
752-752: Local variable found_in_memory is assigned to but never used
Remove assignment to unused variable found_in_memory
(F841)
849-849: Consider moving this statement to an else block
(TRY300)
851-851: Do not catch blind exception: Exception
(BLE001)
935-935: Consider moving this statement to an else block
(TRY300)
937-937: Do not catch blind exception: Exception
(BLE001)
1083-1083: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
1084-1084: Unused method argument: category
(ARG002)
1438-1438: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
tests/benchmarks/experiments/test_minimal.py
1-1: Shebang is present but file is not executable
(EXE001)
34-34: f-string without any placeholders
Remove extraneous f prefix
(F541)
50-50: f-string without any placeholders
Remove extraneous f prefix
(F541)
58-58: f-string without any placeholders
Remove extraneous f prefix
(F541)
63-63: Prefer next(iter(evaluator.conversations.keys())) over single element slice
Replace with next(iter(evaluator.conversations.keys()))
(RUF015)
68-68: f-string without any placeholders
Remove extraneous f prefix
(F541)
73-73: f-string without any placeholders
Remove extraneous f prefix
(F541)
78-78: f-string without any placeholders
Remove extraneous f prefix
(F541)
87-87: f-string without any placeholders
Remove extraneous f prefix
(F541)
91-91: f-string without any placeholders
Remove extraneous f prefix
(F541)
124-124: Unpacked variable explanation is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
137-137: f-string without any placeholders
Remove extraneous f prefix
(F541)
tests/benchmarks/experiments/experiment_runner.py
1-1: Shebang is present but file is not executable
(EXE001)
128-128: Do not catch blind exception: Exception
(BLE001)
279-279: Standard pseudo-random generators are not suitable for cryptographic purposes
(S311)
280-280: Standard pseudo-random generators are not suitable for cryptographic purposes
(S311)
330-330: Local variable worst is assigned to but never used
Remove assignment to unused variable worst
(F841)
373-373: f-string without any placeholders
Remove extraneous f prefix
(F541)
383-383: f-string without any placeholders
Remove extraneous f prefix
(F541)
390-390: f-string without any placeholders
Remove extraneous f prefix
(F541)
405-405: f-string without any placeholders
Remove extraneous f prefix
(F541)
416-416: f-string without any placeholders
Remove extraneous f prefix
(F541)
622-622: f-string without any placeholders
Remove extraneous f prefix
(F541)
640-640: Local variable results is assigned to but never used
Remove assignment to unused variable results
(F841)
tests/benchmarks/locomo_metrics.py
174-174: Unused function argument: question
(ARG001)
174-174: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
358-358: Loop control variable cat not used within loop body
Rename unused cat to _cat
(B007)
tests/benchmarks/experiments/railway_manager.py
55-55: Starting a process with a partial executable path
(S607)
60-60: Consider moving this statement to an else block
(TRY300)
61-61: Do not catch blind exception: Exception
(BLE001)
82-82: subprocess call: check for execution of untrusted input
(S603)
83-87: Starting a process with a partial executable path
(S607)
91-91: Probable insecure usage of temporary file or directory: "/tmp"
(S108)
104-104: subprocess call: check for execution of untrusted input
(S603)
105-105: Starting a process with a partial executable path
(S607)
112-112: Starting a process with a partial executable path
(S607)
141-141: Consider moving this statement to an else block
(TRY300)
143-143: Do not catch blind exception: Exception
(BLE001)
166-166: Probable insecure usage of temporary file or directory: "/tmp/"
(S108)
173-173: subprocess call: check for execution of untrusted input
(S603)
174-180: Starting a process with a partial executable path
(S607)
205-205: Consider moving this statement to an else block
(TRY300)
207-207: Do not catch blind exception: Exception
(BLE001)
216-216: subprocess call: check for execution of untrusted input
(S603)
217-217: Starting a process with a partial executable path
(S607)
224-224: Starting a process with a partial executable path
(S607)
233-233: Consider moving this statement to an else block
(TRY300)
235-235: Do not catch blind exception: Exception
(BLE001)
260-261: try-except-pass detected, consider logging the exception
(S110)
260-260: Do not catch blind exception: Exception
(BLE001)
265-265: Unused method argument: project_id
(ARG002)
268-268: Starting a process with a partial executable path
(S607)
319-319: Consider moving this statement to an else block
(TRY300)
320-320: Do not catch blind exception: Exception
(BLE001)
334-334: Consider moving this statement to an else block
(TRY300)
335-335: Do not catch blind exception: Exception
(BLE001)
🪛 Shellcheck (0.11.0)
tests/benchmarks/experiments/run_optimization.sh
[error] 143-143: -f doesn't work with globs. Use a for loop.
(SC2144)
🔇 Additional comments (36)
.gitignore (1)
25-26: LGTM!The new ignore pattern appropriately excludes ephemeral experiment results while preserving the ability to commit notable runs to the main
tests/benchmarks/results/directory. This aligns well with the new benchmark infrastructure.requirements-dev.txt (1)
10-12: LGTM!The addition of nltk for LoCoMo benchmark evaluation is appropriate. NLTK provides the Porter stemmer and tokenization needed for official F1 scoring metrics referenced in the benchmark documentation.
tests/benchmarks/core_adapter/package.json (1)
1-18: LGTM!The package.json correctly configures the CORE adapter as an ES module with appropriate scripts and dependencies. The version pins for axios (^1.6.0), dotenv (^16.3.0), and openai (^4.20.0) are reasonable for benchmark stability.
automem/config.py (1)
50-54: Update config comment to clarify embedding model selection across providers.The embedding provider architecture correctly implements the fallback pattern (OpenAI → FastEmbed → Placeholder), and FastEmbed is properly supported as a secondary backend. However, the configuration approach differs from the review comment's assumption:
FastEmbed fallback is supported ✓ (confirmed in app.py auto-selection logic)
EMBEDDING_MODEL only applies to OpenAI — Users cannot set
EMBEDDING_MODEL=BAAI/bge-base-en-v1.5for FastEmbed. Instead, FastEmbed model selection is controlled byVECTOR_SIZE:
VECTOR_SIZE=384→ BAAI/bge-small-en-v1.5VECTOR_SIZE=768→ BAAI/bge-base-en-v1.5 (default)VECTOR_SIZE=1024→ BAAI/bge-large-en-v1.5Comment should clarify provider control — Line 52's comment mentions "use VECTOR_SIZE=768 if switching" but only in the context of OpenAI models. Add a note that
VECTOR_SIZEalso controls the FastEmbed model variant, and that provider selection is controlled byEMBEDDING_PROVIDERenv var (auto/openai/local/placeholder), notEMBEDDING_MODEL.Likely an incorrect or invalid review comment.
README.md (1)
101-101: LGTM! Architecture diagram updated to reflect new default embedding dimensions.The change from 768-d to 3072-d accurately reflects the update to text-embedding-3-large as the default embedding model.
CLAUDE.md (2)
135-138: LGTM! OpenAI embedding configuration updated correctly.The documentation now accurately reflects the new default model (text-embedding-3-large) and its 3072-dimension output, along with the configurability via EMBEDDING_MODEL and VECTOR_SIZE.
205-205: LGTM! Environment variable documentation updated.The VECTOR_SIZE default correctly reflects 3072 for the large model, with helpful guidance for the small model alternative.
INSTALLATION.md (1)
371-372: LGTM! Configuration documentation updated correctly.The table accurately documents the new defaults for VECTOR_SIZE (3072) and EMBEDDING_MODEL (text-embedding-3-large), providing clear guidance for users.
scripts/reclassify_with_llm.py (3)
34-34: LGTM! Classification model now configurable.The introduction of the CLASSIFICATION_MODEL environment variable enables flexible model selection while maintaining a sensible default.
79-79: LGTM! Model parameter correctly uses configurable value.The OpenAI API call now uses the CLASSIFICATION_MODEL variable instead of a hardcoded model name, enabling runtime configuration.
172-172: LGTM! Model selection visibility improved.Logging the classification model during initialization provides helpful debugging information and confirms the active configuration.
tests/benchmarks/experiments/__init__.py (1)
1-20: LGTM! Clean package initialization.The module provides a well-defined public API by re-exporting key symbols from experiment_config, following standard Python packaging conventions.
scripts/reembed_embeddings.py (3)
48-59: LGTM! FalkorDB authentication handling improved.The connection logic now properly handles both authenticated and unauthenticated scenarios, with clear logging of the authentication state.
149-154: LGTM! Embedding generation correctly uses configurable model.The batch embedding logic now uses the EMBEDDING_MODEL environment variable with appropriate logging and dimensions configuration.
8-9: Fix inconsistent EMBEDDING_MODEL default.The default embedding model in this script is "text-embedding-3-small", but according to the PR objectives and INSTALLATION.md (line 372), the default should be "text-embedding-3-large" to match the new 3072-dimension standard.
Apply this diff:
Environment: - EMBEDDING_MODEL: OpenAI embedding model (default: text-embedding-3-small) + EMBEDDING_MODEL: OpenAI embedding model (default: text-embedding-3-large) VECTOR_SIZE: Embedding dimension (default: 768, use 3072 for large model)- embedding_model = os.getenv("EMBEDDING_MODEL", "text-embedding-3-small") + embedding_model = os.getenv("EMBEDDING_MODEL", "text-embedding-3-large")Also applies to: 140-142
⛔ Skipped due to learnings
Learnt from: CR Repo: verygoodplugins/automem PR: 0 File: CLAUDE.md:0-0 Timestamp: 2025-12-02T05:34:12.678Z Learning: Applies to automem/embedding/*.py : Use FastEmbed BAAI/bge-base-en-v1.5 model with optional 384 and 1024 dimension variantsLearnt from: CR Repo: verygoodplugins/automem PR: 0 File: CLAUDE.md:0-0 Timestamp: 2025-12-02T05:34:12.678Z Learning: Control embedding provider via EMBEDDING_PROVIDER environment variable with values: auto (default), openai, local, or placeholderLearnt from: CR Repo: verygoodplugins/automem PR: 0 File: CLAUDE.md:0-0 Timestamp: 2025-12-02T05:34:12.678Z Learning: Applies to automem/embedding/*.py : Implement embedding generation using a provider pattern with three backends in priority order: OpenAI (if OPENAI_API_KEY set), FastEmbed (local ONNX model), and Placeholder (hash-based fallback)docs/ENVIRONMENT_VARIABLES.md (2)
140-169: Well-documented model configuration section.The new Model Configuration section provides clear guidance on embedding models, vector sizes, and classification models with helpful comparison tables.
329-367: Comprehensive re-embedding workflow documentation.The step-by-step guide for re-embedding memories when changing models is thorough and includes backup steps, Qdrant collection recreation, and cost estimates.
tests/benchmarks/run_all_benchmarks.sh (1)
59-69: CORE adapter execution looks correct.The conditional execution with proper directory navigation and dependency installation is well-handled.
app.py (4)
130-131: Configuration constants properly imported.The
EMBEDDING_MODELandCLASSIFICATION_MODELimports enable dynamic model selection as intended.
738-748: Classification model now configurable.The memory type classification correctly uses
CLASSIFICATION_MODELinstead of a hardcoded value.
1116-1124: Embedding provider initialization uses configurable model.Both explicit and auto-selected OpenAI embedding provider paths now use
EMBEDDING_MODEL, enabling runtime configuration.Also applies to: 1147-1155
3238-3250: Admin blueprint receives embedding model parameter.The
EMBEDDING_MODELis correctly passed tocreate_admin_blueprint_full, completing the configuration wiring.automem/api/admin.py (2)
7-19: Function signature updated to accept embedding model.The new
embedding_model: strparameter enables dynamic model selection for re-embedding operations. The signature change is coordinated with the caller inapp.py.
88-92: Re-embedding now uses configurable model.The embeddings request correctly uses the
embedding_modelparameter instead of a hardcoded value, aligning with the PR's dynamic model selection objective.tests/benchmarks/core_adapter/evaluate.js (3)
29-39: Verify OpenAI model names in CONFIG.The default models
gpt-4.1andgpt-4.1(lines 37-38) appear to be non-standard OpenAI model names. Standard names includegpt-4o,gpt-4o-mini,gpt-4-turbo, etc.If these are intentional aliases or internal naming, please document this. Otherwise, consider using standard model names:
const CONFIG = { dataFile: path.join(__dirname, "../locomo/data/locomo10.json"), outputFile: path.join(__dirname, "evaluation_results.json"), searchLimit: 20, - model: process.env.EVAL_MODEL || "gpt-4.1", - evalModel: process.env.EVAL_JUDGE_MODEL || "gpt-4.1", + model: process.env.EVAL_MODEL || "gpt-4o", + evalModel: process.env.EVAL_JUDGE_MODEL || "gpt-4o", };
96-190: CORE-compatible evaluation logic implemented correctly.The
evaluateAnswerfunction properly implements CORE's lenient evaluation methodology with:
- Special handling for adversarial category (5)
- CORE's exact evaluation prompt
- Fallback to 30% word match ratio
195-326: Evaluation orchestration is well-structured.The
runEvaluationfunction properly orchestrates health checks, data loading, per-conversation evaluation, and results aggregation with good progress reporting.tests/benchmarks/experiments/test_e2e_mini.py (1)
20-184: Good implementation of a self-contained micro E2E test.The test provides a complete pipeline verification: memory storage → enrichment wait → recall → LLM answer generation → F1 scoring → cleanup. The error handling and progress output are helpful for debugging.
tests/benchmarks/experiments/experiment_config.py (1)
95-152: Well-structured ExperimentResult dataclass with scoring.The scoring method with configurable weights is a good approach for multi-objective optimization. The normalization of response time and cost scores is reasonable.
tests/benchmarks/experiments/railway_manager.py (1)
244-263: Good health check implementation with retry logic.The
_wait_for_healthymethod properly uses async httpx client with timeout and polling. The 5-second sleep between checks is reasonable for deployment startup times.tests/benchmarks/test_locomo.py (1)
769-854: Well-implemented E2E answer generation and lenient evaluation.The
generate_answer_e2emethod correctly handles context building with token limits, caching, and category-specific prompting. Theevaluate_lenientmethod provides CORE-compatible semantic evaluation with appropriate fallback logic. Good use of structured JSON responses.Also applies to: 855-950
tests/benchmarks/locomo_metrics.py (2)
59-89: Well-implemented official LoCoMo F1 metrics.The
f1_scorefunction correctly implements token-level F1 with Porter stemming as per the LoCoMo paper. Thef1_multi_answerfunction properly handles comma-separated sub-answers with best-match scoring.Also applies to: 92-116
211-366: Clean implementation of OfficialLoCoMoEvaluator class.The evaluator properly tracks results by category and overall, with clear methods for accuracy, mean F1, and summary generation. The
print_summarymethod provides well-formatted output.tests/benchmarks/experiments/analysis_agent.py (1)
151-184: LGTM with a minor suggestion.The key normalization and mapping logic handles LLM naming variations well. Consider using
dataclasses.asdict()for cleaner dataclass handling, though__dict__.copy()works fine for this use case.tests/benchmarks/experiments/experiment_runner.py (2)
242-257: Thedeploy_instanceanddestroy_instancemethods inRailwayManagerare properly defined as async methods (lines 65 and 211 of railway_manager.py), so awaiting them in the experiment_runner.py code is correct and will not raise a TypeError.Likely an incorrect or invalid review comment.
287-290: Weight normalization is correct—recency_weightis independent.The normalization of
vector_weightandgraph_weightto sum to 1.0 is intentional.recency_weightis not part of this constraint; it is applied separately as an independent weighting factor. The_generate_mutationsmethod correctly excludesrecency_weightfrom the mutation space, ensuring it remains at its default value while only the hybrid recall weights are normalized.
| # Qdrant / FalkorDB configuration | ||
| COLLECTION_NAME = os.getenv("QDRANT_COLLECTION", "memories") | ||
| VECTOR_SIZE = int(os.getenv("VECTOR_SIZE") or os.getenv("QDRANT_VECTOR_SIZE", "768")) | ||
| VECTOR_SIZE = int(os.getenv("VECTOR_SIZE") or os.getenv("QDRANT_VECTOR_SIZE", "3072")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Breaking change: Document the migration path for existing deployments.
Changing the default VECTOR_SIZE from 768 to 3072 is a breaking change that will cause dimension mismatches in existing Qdrant collections. Existing deployments with 768-dimensional embeddings cannot simply update this value without re-embedding all memories.
Consider one of these approaches:
-
Add migration documentation: Create a migration guide that explicitly instructs users to:
- Back up existing data
- Run the re-embedding script (
scripts/reembed_embeddings.py) - Verify the new embeddings before switching
-
Add runtime validation: Check if the Qdrant collection's dimension matches VECTOR_SIZE and fail fast with a clear error message if they don't match.
Based on learnings, the system should support graceful degradation when Qdrant is unavailable, but dimension mismatches will cause runtime errors.
🤖 Prompt for AI Agents
In automem/config.py around line 13, the VECTOR_SIZE default was changed to 3072
which breaks existing Qdrant collections using 768; add a migration note and
runtime validation: update docs (or a new docs/migrations.md) with explicit
steps to back up data, run scripts/reembed_embeddings.py, and verify embeddings
before switching, and implement a runtime check where the app queries the Qdrant
collection metadata on startup, compares its vector dimension to VECTOR_SIZE,
and fails fast with a clear error message (including current collection
dimension and expected VECTOR_SIZE) if they differ to prevent silent runtime
errors.
| **Classification Model Pricing (Dec 2025):** | ||
| | Model | Input | Output | Notes | | ||
| |-------|-------|--------|-------| | ||
| | `gpt-4o-mini` | $0.15/1M | $0.60/1M | **Default** - Good enough for classification | | ||
| | `gpt-4.1` | ~$2/1M | ~$8/1M | Better reasoning | | ||
| | `gpt-5.1` | $1.25/1M | $10/1M | Best reasoning, use for benchmarks | | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🌐 Web query:
OpenAI GPT model names December 2025 current available models
💡 Result:
Here are the major OpenAI GPT-family models publicly listed/available as of Dec 6, 2025:
- GPT‑4o (and GPT‑4o mini). (openai.com)
- GPT‑4 (still available in the API; retired from ChatGPT UI in Apr 2025). (help.openai.com)
- GPT‑4.1 / GPT‑4.5-preview (preview / deprecated snapshot names appear in the API docs). (platform.openai.com)
- GPT‑3.5‑turbo (and its snapshots/aliases). (platform.openai.com)
- o3 / o3‑mini (OpenAI “o3” family listed in API docs). (platform.openai.com)
Note: press/industry reports in early December 2025 mention OpenAI work on newer frontier models (e.g., GPT‑5.x), but official releases or API availability should be confirmed on OpenAI’s site. (theverge.com)
If you want, I can open the specific OpenAI API model docs pages and list exact model aliases/pricing/snapshots.
Citations:
- 1: https://openai.com/kn-IN/index/gpt-4o-system-card/?utm_source=openai
- 2: https://help.openai.com/en/articles/6825453?utm_source=openai
- 3: https://platform.openai.com/docs/models/gpt-4.5-preview?utm_source=openai
- 4: https://platform.openai.com/docs/models/gpt-3.5-turbo?utm_source=openai
- 5: https://platform.openai.com/docs/models/gpt-4.5-preview?utm_source=openai
- 6: https://www.theverge.com/report/838857/openai-gpt-5-2-release-date-code-red-google-response?utm_source=openai
Replace non-existent model names with confirmed OpenAI models.
The pricing table references gpt-4.1 (preview/deprecated status) and gpt-5.1 (not officially released). Use confirmed model names like gpt-4o, gpt-4-turbo, or o3 instead. If experimental models must be referenced, clearly document their preview/beta status and note they may not be available via the standard API.
🤖 Prompt for AI Agents
In docs/ENVIRONMENT_VARIABLES.md around lines 162 to 168, the pricing table
lists non-existent or deprecated model names (gpt-4.1, gpt-5.1); replace those
with confirmed OpenAI model names such as gpt-4o, gpt-4-turbo, or o3 (choose the
appropriate model for the stated pricing/quality), update the Model column
values accordingly, and if you must reference experimental or preview models
mark them explicitly as “preview/beta” with a short note that they may not be
available via the standard API; ensure the Notes column reflects the new model
names and their status.
| ```javascript | ||
| // In services/search.server.js | ||
| import { AutoMemSearchService } from './automem-search.js'; | ||
| export default new AutoMemSearchService(); | ||
| ``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add language specifier to fenced code block.
The JavaScript code block should specify the language for proper syntax highlighting.
Apply this diff:
- ```
+ ```javascript
// In services/search.server.js
import { AutoMemSearchService } from './automem-search.js';
export default new AutoMemSearchService();
- ```
+ ```🤖 Prompt for AI Agents
In tests/benchmarks/core_adapter/README.md around lines 84 to 88, the fenced
code block lacks a language specifier; update the opening fence to
"```javascript" (ensure the opening line reads ```javascript) and keep the
closing fence as "```" so the block is properly marked for JavaScript syntax
highlighting.
|
|
||
| def __init__( | ||
| self, | ||
| model: str = "gpt-5.1", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Non-existent model name "gpt-5.1" used as default.
Same issue as in experiment_runner.py - "gpt-5.1" is not a valid OpenAI model name. This will cause API errors at runtime.
def __init__(
self,
- model: str = "gpt-5.1",
+ model: str = "gpt-4o",Also update line 341 in main():
- parser.add_argument("--model", type=str, default="gpt-5.1", help="LLM model for analysis")
+ parser.add_argument("--model", type=str, default="gpt-4o", help="LLM model for analysis")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| model: str = "gpt-5.1", | |
| model: str = "gpt-4o", |
🤖 Prompt for AI Agents
In tests/benchmarks/experiments/analysis_agent.py around line 69, the default
model is set to the non-existent "gpt-5.1"; change the default to a valid OpenAI
model name (e.g., "gpt-4o" or "gpt-4") to prevent runtime API errors, and also
update the corresponding model reference at line 341 inside main() to the same
valid model string so both defaults stay consistent.
| GPT_4_1 = "gpt-4.1" | ||
| GPT_5_1 = "gpt-5.1" | ||
| CLAUDE_HAIKU = "claude-3-haiku" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🌐 Web query:
valid OpenAI GPT model names December 2025 gpt-4 gpt-4o gpt-5
💡 Result:
As of December 6, 2025:
- gpt-4 — still an API model (no longer the default in ChatGPT; some GPT‑4 snapshots are being deprecated). [1][2]
- gpt-4o — current OpenAI multimodal "o‑series" model (available in API/ChatGPT). [1][2]
- gpt-5 — released and available (API names include gpt-5 and smaller variants like gpt-5-mini/nano). [3]
Sources: [1] OpenAI model docs (gpt-4o). [2] OpenAI ChatGPT release notes (sunsetting GPT‑4 / GPT‑4o updates). [3] TechCrunch reporting on GPT‑5 release.
🌐 Web query:
gpt-4-turbo OpenAI December 2025 deprecated
💡 Result:
Short answer: GPT-4‑Turbo itself is not fully shut down as of Dec 6, 2025, but some older GPT‑4 snapshots/previews (including gpt-4-turbo-preview / gpt-4-0125-preview) have been marked deprecated with a scheduled shutdown on March 26, 2026 — OpenAI’s docs list recommended replacements (e.g., gpt-5 or gpt-4.1). [1][2]
Sources:
- OpenAI GPT-4 Turbo model page (snapshots & deprecation status). [1]
- OpenAI Deprecations page — legacy GPT snapshot deprecation list (includes gpt-4-turbo-preview → shutdown 2026‑03‑26). [2]
Replace invalid OpenAI model names with valid alternatives.
gpt-4.1 and gpt-5.1 are not valid OpenAI model identifiers. Use current, stable model names instead:
class EnrichmentModel(Enum):
"""LLM model for enrichment/extraction"""
GPT_4O_MINI = "gpt-4o-mini"
GPT_4O = "gpt-4o"
- GPT_4_1 = "gpt-4.1"
- GPT_5_1 = "gpt-5.1"
+ GPT_4 = "gpt-4"
+ GPT_5 = "gpt-5"
CLAUDE_HAIKU = "claude-3-haiku"Note: gpt-4-turbo variants are being phased out (deprecated with March 2026 shutdown). Use gpt-4o, gpt-4, or gpt-5 instead.
Committable suggestion skipped: line range outside the PR's diff.
| resp = requests.get( | ||
| f"{BASE_URL}/recall", | ||
| headers=headers, | ||
| params={"query": question, "tags": test_tag, "limit": 5} | ||
| ) | ||
|
|
||
| if resp.status_code != 200: | ||
| print(f" ❌ Recall failed: {resp.status_code}") | ||
| continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add timeout to recall request for reliability.
The recall request lacks a timeout, which could cause the test to hang indefinitely if the service is unresponsive.
resp = requests.get(
f"{BASE_URL}/recall",
headers=headers,
- params={"query": question, "tags": test_tag, "limit": 5}
+ params={"query": question, "tags": test_tag, "limit": 5},
+ timeout=10
)🧰 Tools
🪛 Ruff (0.14.7)
106-106: Probable use of requests call without timeout
(S113)
🤖 Prompt for AI Agents
In tests/benchmarks/experiments/test_e2e_mini.py around lines 106 to 114, the
requests.get call to the /recall endpoint has no timeout which can hang tests;
add a timeout parameter (e.g. timeout=10) to the requests.get call and wrap the
request in a try/except catching requests.exceptions.Timeout (and
requests.exceptions.RequestException) to log the timeout/error and continue the
loop so the test suite remains reliable.
| resp = requests.delete( | ||
| f"{BASE_URL}/admin/memories", | ||
| headers={"Authorization": f"Bearer admin"}, | ||
| params={"tag": test_tag, "confirm": "yes"} | ||
| ) | ||
| print(f" Cleanup: {resp.status_code}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add timeout to cleanup request.
The DELETE request lacks a timeout, similar to the recall request.
resp = requests.delete(
f"{BASE_URL}/admin/memories",
headers={"Authorization": "Bearer admin"},
- params={"tag": test_tag, "confirm": "yes"}
+ params={"tag": test_tag, "confirm": "yes"},
+ timeout=30
)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| resp = requests.delete( | |
| f"{BASE_URL}/admin/memories", | |
| headers={"Authorization": f"Bearer admin"}, | |
| params={"tag": test_tag, "confirm": "yes"} | |
| ) | |
| print(f" Cleanup: {resp.status_code}") | |
| resp = requests.delete( | |
| f"{BASE_URL}/admin/memories", | |
| headers={"Authorization": "Bearer admin"}, | |
| params={"tag": test_tag, "confirm": "yes"}, | |
| timeout=30 | |
| ) | |
| print(f" Cleanup: {resp.status_code}") |
🧰 Tools
🪛 Ruff (0.14.7)
159-159: Probable use of requests call without timeout
(S113)
161-161: f-string without any placeholders
Remove extraneous f prefix
(F541)
🤖 Prompt for AI Agents
In tests/benchmarks/experiments/test_e2e_mini.py around lines 159-164 the
cleanup DELETE request lacks a timeout; add the same timeout parameter used for
the recall request (e.g., timeout=TIMEOUT or timeout=REQUEST_TIMEOUT) to the
requests.delete call (or use a sensible constant like timeout=5 if no shared
variable exists) so the test won’t hang on network issues.
| try: | ||
| ps = PorterStemmer() | ||
| except LookupError: | ||
| nltk.download("punkt") | ||
| ps = PorterStemmer() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Incorrect exception handling for NLTK initialization.
PorterStemmer() itself doesn't raise LookupError. The download is only needed for tokenizers (like punkt), but Porter stemmer doesn't require punkt. This try/except block is misleading and the download may never be triggered when actually needed.
If you intend to ensure nltk data is available, consider a more explicit check or move the download to a setup step.
-# Initialize stemmer
-try:
- ps = PorterStemmer()
-except LookupError:
- nltk.download("punkt")
- ps = PorterStemmer()
+# Initialize stemmer (PorterStemmer doesn't require nltk data downloads)
+ps = PorterStemmer()🤖 Prompt for AI Agents
In tests/benchmarks/locomo_metrics.py around lines 24 to 28, the try/except
around PorterStemmer() is incorrect because PorterStemmer() does not raise
LookupError and the punkt download is unrelated; replace this block by directly
instantiating PorterStemmer (remove the try/except and nltk.download call) or,
if you really need to ensure NLTK tokenizer data, explicitly check for
tokenizers/punkt using nltk.data.find('tokenizers/punkt') and only call
nltk.download('punkt') inside a LookupError handler for that check; keep the
initialization of ps = PorterStemmer() separate from any data-download logic.
| # Models: gpt-5.1 (best), gpt-4.1 (good balance), gpt-4o-mini (cheapest) | ||
| e2e_model: str = "gpt-5.1" # Model for answer generation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Invalid OpenAI model names in defaults.
gpt-5.1 and gpt-4.1 are not valid OpenAI model identifiers. These defaults will cause API errors.
# E2E QA settings
# Models: gpt-5.1 (best), gpt-4.1 (good balance), gpt-4o-mini (cheapest)
- e2e_model: str = "gpt-5.1" # Model for answer generation
+ e2e_model: str = "gpt-4o" # Model for answer generation
e2e_max_context_tokens: int = 4000 # Max tokens of context to include # Model for lenient evaluation judging
- eval_judge_model: str = "gpt-5.1"
+ eval_judge_model: str = "gpt-4o"Also update the comment on line 91 to reflect valid model options.
Also applies to: 104-104
| "--e2e-model", | ||
| default="gpt-5.1", | ||
| help="Model for E2E answer generation (default: gpt-5.1, options: gpt-4.1, gpt-4o-mini)", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CLI help text references invalid model names.
The argparse help text mentions gpt-5.1 and gpt-4.1 which are not valid models.
parser.add_argument(
"--e2e-model",
- default="gpt-5.1",
- help="Model for E2E answer generation (default: gpt-5.1, options: gpt-4.1, gpt-4o-mini)",
+ default="gpt-4o",
+ help="Model for E2E answer generation (default: gpt-4o, options: gpt-4-turbo, gpt-4o-mini)",
) parser.add_argument(
"--eval-judge-model",
- default="gpt-5.1",
- help="Model for lenient evaluation judging (default: gpt-5.1)",
+ default="gpt-4o",
+ help="Model for lenient evaluation judging (default: gpt-4o)",
)Also applies to: 1677-1679
🤖 Prompt for AI Agents
In tests/benchmarks/test_locomo.py around lines 1661-1663 and 1677-1679, the
argparse help text lists invalid model names ("gpt-5.1" and "gpt-4.1"); update
the help strings to reference valid model names (for example "gpt-4o" and
"gpt-4o-mini" or whatever current supported models your project uses), e.g.,
replace occurrences of gpt-5.1/gpt-4.1 with the correct supported model
identifiers and ensure the help text options match the actual accepted values.
This pull request introduces major improvements to model configuration and embedding management across the codebase, focusing on flexibility, documentation, and environment variable support. The default embedding model is upgraded to
text-embedding-3-large(3072 dimensions) for better semantic precision, with clear options to switch to smaller, cost-effective models. All relevant scripts, configuration files, and documentation are updated to support dynamic selection of embedding and classification models via environment variables. Comprehensive instructions for re-embedding memories when changing models are also provided.Model Configuration and Embedding Management
text-embedding-3-large(3072d), withVECTOR_SIZEdefaulting to 3072; both are now configurable via environment variables (EMBEDDING_MODEL,VECTOR_SIZE). [1] [2] [3] [4] [5] [6]CLASSIFICATION_MODELandEMBEDDING_MODELenvironment variables. [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12]Documentation and Environment Variables
Script and Codebase Updates
reembed_embeddings.py,reclassify_with_llm.py) refactored to use environment-configured models and vector sizes; improved logging and FalkorDB connection handling. [1] [2] [3] [4] [5] [6] [7]nltktorequirements-dev.txtfor benchmark evaluation.These changes provide greater flexibility, clearer documentation, and easier migration between embedding models, ensuring the system is production-ready and cost-optimized.