Feat/benchmark hardening #18

jack-arturo · 2025-12-06T04:01:06Z

This pull request introduces major improvements to model configuration and embedding management across the codebase, focusing on flexibility, documentation, and environment variable support. The default embedding model is upgraded to text-embedding-3-large (3072 dimensions) for better semantic precision, with clear options to switch to smaller, cost-effective models. All relevant scripts, configuration files, and documentation are updated to support dynamic selection of embedding and classification models via environment variables. Comprehensive instructions for re-embedding memories when changing models are also provided.

Model Configuration and Embedding Management

Default embedding model changed to text-embedding-3-large (3072d), with VECTOR_SIZE defaulting to 3072; both are now configurable via environment variables (EMBEDDING_MODEL, VECTOR_SIZE). [1] [2] [3] [4] [5] [6]
Added support for dynamic selection of classification and embedding models in both application logic and scripts, using CLASSIFICATION_MODEL and EMBEDDING_MODEL environment variables. [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12]
All embedding and classification model invocations now use the configured environment variables instead of hardcoded model names. [1] [2] [3] [4] [5] [6]

Documentation and Environment Variables

Extensive documentation updates: environment variable tables, model comparison, cost breakdown, and instructions for switching models and re-embedding memories. [1] [2] [3] [4] [5]
Added detailed guide for re-embedding memories when changing embedding model or vector size, including backup and migration steps.

Script and Codebase Updates

Scripts (reembed_embeddings.py, reclassify_with_llm.py) refactored to use environment-configured models and vector sizes; improved logging and FalkorDB connection handling. [1] [2] [3] [4] [5] [6] [7]
Added nltk to requirements-dev.txt for benchmark evaluation.

These changes provide greater flexibility, clearer documentation, and easier migration between embedding models, ensuring the system is production-ready and cost-optimized.

This commit addresses potential criticisms of our benchmark methodology: 1. Official F1 Metric (locomo_metrics.py) - Implements exact F1 scoring from LoCoMo paper with Porter stemmer - Category-specific evaluation (multi-answer F1, adversarial phrase match) - Reports both original word-overlap and official F1 metrics 2. E2E QA Mode (--eval-mode e2e) - Generates answers via LLM from retrieved context - Matches CORE's evaluation methodology - Configurable model (--e2e-model, default gpt-4o-mini) 3. Category 5 (Adversarial) Fix - Properly handles questions where answer is NOT in conversation - Checks for adversarial_answer in memories (shouldn't be found) - No longer inflates Complex Reasoning score 4. Evidence Hint Disabling (--no-evidence-hints) - Removes data leakage from using ground truth evidence IDs - Provides honest retrieval-only evaluation 5. CORE Benchmark Adapter (core_adapter/) - Node.js adapter for CORE's exact benchmark code - Enables direct apples-to-apples comparison - Documents methodology differences 6. Reproducibility Package (REPRODUCIBILITY.md) - Step-by-step reproduction instructions - All evaluation modes documented - Known differences from CORE explained New CLI options: --eval-mode {retrieval,e2e} --no-official-f1 --no-evidence-hints --e2e-model MODEL --f1-threshold FLOAT Refs: CORE benchmark, LoCoMo ACL 2024 paper

…ions Key changes: 1. Added --lenient flag for CORE-compatible semantic evaluation - Uses LLM to judge if answer 'touches on the same topic' - Matches CORE's exact evaluation prompt - Fallback to 30% word match (same as CORE) 2. Updated default models to gpt-4.1 - Better reasoning than gpt-4o-mini - Good cost/performance balance - Options: gpt-5.1 (best), gpt-4o-mini (cheapest) 3. CORE adapter now uses exact CORE evaluation prompt - Found their evaluateService.js: 'be generous with grading' - This explains their 88% vs our 36% with strict F1 4. Added --eval-judge-model for configurable judge model New CLI options: --lenient Use CORE-compatible lenient eval --eval-judge-model Model for lenient evaluation --e2e-model Now defaults to gpt-4.1 To run with CORE-comparable methodology: python test_locomo.py --eval-mode e2e --lenient Expected result: Should see ~70-85% accuracy instead of 36%

- Default models now gpt-5.1 for best reasoning - Added --conversations CLI arg to run subset (e.g., --conversations conv-26 conv-30) - Both Python and Node.js adapters updated

Key changes: 1. Fixed max_tokens -> max_completion_tokens for GPT-5.1 models - GPT-5.x requires max_completion_tokens parameter - Older models still use max_tokens 2. Added --conversations flag for selective testing - Can now test specific conversations: --conversations conv-26 conv-30 - Useful for quick iteration and debugging 3. Added BENCHMARK_FINDINGS_2025-12-03.md documenting results: - E2E + Lenient (GPT-5.1): 78.89% on conv-26 - This is within 10% of CORE's claimed 88.24% - Strict F1 scores (~36%) don't reflect actual system quality - Evaluation methodology is the dominant factor Results comparison: Mode | Score | Notes --------------|--------|--------------------------- Retrieval | 70.49% | Word overlap check E2E Strict F1 | 36.15% | Penalizes verbose answers E2E Lenient | 78.89% | CORE-compatible methodology

Introduces a new experimentation framework under tests/benchmarks/experiments for optimizing memory retrieval configurations. Adds experiment configuration, runner, analysis agent, Railway deployment manager, and supporting scripts and tests. Updates .gitignore to exclude experiment results.

… (3072d) - Add EMBEDDING_MODEL and CLASSIFICATION_MODEL env vars - Default to text-embedding-3-large for better semantic precision - Update VECTOR_SIZE default to 3072 - Fix scripts to use config instead of hardcoded models - Add re-embedding guide to docs - Fix FalkorDB auth in reembed script

coderabbitai · 2025-12-06T04:01:17Z

📝 Walkthrough

Walkthrough

This PR introduces configurable embedding and classification models, upgrades default embedding dimensions from 768 to 3072, and adds comprehensive benchmarking infrastructure including LoCoMo evaluation metrics, AI-guided optimization, CORE-compatible adapters, and extensive documentation for reproduction and experimentation.

Changes

Cohort / File(s)	Summary
Configuration & Constants `app.py`, `automem/config.py`, `automem/api/admin.py`	Added EMBEDDING_MODEL and CLASSIFICATION_MODEL as configurable constants; updated VECTOR_SIZE default from 768 to 3072; propagated model configs through initialization and API signatures.
Documentation Updates `README.md`, `CLAUDE.md`, `INSTALLATION.md`, `docs/ENVIRONMENT_VARIABLES.md`	Updated architecture diagrams and environment documentation to reflect 3072-d embeddings; added Model Configuration section with pricing, comparison tables, and re-embedding workflow guidance.
Script Updates `scripts/reclassify_with_llm.py`, `scripts/reembed_embeddings.py`	Replaced hardcoded model names with configurable EMBEDDING_MODEL and CLASSIFICATION_MODEL; added environment documentation and authentication handling for FalkorDB.
LoCoMo Metrics & Evaluation `tests/benchmarks/locomo_metrics.py`, `tests/benchmarks/test_locomo.py`	Introduced official LoCoMo evaluation metrics (F1, normalization, category-specific scoring); added E2E evaluation mode, adversarial category handling, lenient evaluation, and extended configuration for eval modes and models.
Experiment Framework `tests/benchmarks/experiments/__init__.py`, `tests/benchmarks/experiments/experiment_config.py`, `tests/benchmarks/experiments/experiment_runner.py`	Introduced ExperimentConfig, ExperimentResult, and ExperimentRunner classes; defined predefined experiment configurations (baseline, large_embeddings, etc.); implemented grid search, exploration, and ablation study execution.
AI-Guided Optimization `tests/benchmarks/experiments/analysis_agent.py`	Implemented autonomous AnalysisAgent leveraging LLM-based reasoning for iterative optimization; tracks best configurations, maintains history, and generates comprehensive reports.
Railway Deployment Manager `tests/benchmarks/experiments/railway_manager.py`	Added RailwayManager and LocalTestManager for deploying/managing temporary experiment environments with health checks, Docker support, and asynchronous concurrent deployments.
Benchmark Test & Documentation `tests/benchmarks/experiments/test_e2e_mini.py`, `tests/benchmarks/experiments/test_minimal.py`, `tests/benchmarks/experiments/run_optimization.sh`, `tests/benchmarks/experiments/README.md`	Added micro E2E tests, minimal integration tests, and optimization shell script; documented experiment framework capabilities, configuration, and deployment workflows.
CORE Benchmark Adapter `tests/benchmarks/core_adapter/automem-search.js`, `tests/benchmarks/core_adapter/evaluate.js`, `tests/benchmarks/core_adapter/package.json`, `tests/benchmarks/core_adapter/README.md`	Implemented Node.js-based CORE-compatible benchmark adapter with AutoMemSearchService, CORE-style answer generation/evaluation, and detailed documentation on methodology comparison.
Benchmark Orchestration & Reporting `tests/benchmarks/run_all_benchmarks.sh`, `tests/benchmarks/BENCHMARK_FINDINGS_2025-12-03.md`, `tests/benchmarks/REPRODUCIBILITY.md`	Added master benchmark runner script; comprehensive benchmark findings report with LoCoMo vs CORE methodology analysis; detailed reproducibility guide with quick start and troubleshooting.
Miscellaneous `.gitignore`, `requirements-dev.txt`	Added ignore rule for experiment results; added nltk dependency for benchmark evaluation metrics.

Sequence Diagram(s)

sequenceDiagram
    participant User as User/Script
    participant EvaluatorPy as LoCoMo Evaluator<br/>(Python)
    participant OpenAI as OpenAI API
    participant AutoMem as AutoMem Service
    
    User->>EvaluatorPy: run_benchmark(eval_mode='e2e')
    loop Per Conversation
        EvaluatorPy->>AutoMem: POST /memory (store memories)
        AutoMem-->>EvaluatorPy: Success
        EvaluatorPy->>+AutoMem: GET /recall (retrieve context)
        AutoMem-->>-EvaluatorPy: Recalled memories
        
        alt E2E Mode
            EvaluatorPy->>OpenAI: chat.completions.create<br/>(question + context)
            OpenAI-->>EvaluatorPy: Generated answer
            EvaluatorPy->>OpenAI: chat.completions.create<br/>(evaluate answer)
            OpenAI-->>EvaluatorPy: Evaluation score
        else Retrieval Mode
            EvaluatorPy->>EvaluatorPy: check_answer_in_memories<br/>(official F1 + category logic)
        end
        
        EvaluatorPy->>AutoMem: DELETE /admin/memories<br/>(cleanup)
        AutoMem-->>EvaluatorPy: Cleanup complete
    end
    EvaluatorPy-->>User: Results with accuracy breakdown

sequenceDiagram
    participant User as User
    participant AnalysisAgent as AnalysisAgent
    participant ExperimentRunner as ExperimentRunner
    participant RailwayMgr as RailwayManager
    participant AutoMem as AutoMem Instance
    participant LLM as OpenAI LLM
    
    User->>AnalysisAgent: run_autonomous_optimization()
    AnalysisAgent->>AnalysisAgent: Initialize grid configs
    
    loop Iteration ≤ max_iterations
        AnalysisAgent->>ExperimentRunner: run_grid_search(configs)
        
        loop Per Config
            ExperimentRunner->>RailwayMgr: deploy_instance(config)
            RailwayMgr-->>ExperimentRunner: Instance URL
            ExperimentRunner->>AutoMem: Execute benchmark<br/>with config
            AutoMem-->>ExperimentRunner: ExperimentResult<br/>(metrics & accuracy)
            ExperimentRunner->>RailwayMgr: destroy_instance()
        end
        
        ExperimentRunner-->>AnalysisAgent: List[ExperimentResult]
        AnalysisAgent->>AnalysisAgent: Track best config<br/>& score
        
        AnalysisAgent->>LLM: analyze_and_decide<br/>(results summary)
        LLM-->>AnalysisAgent: JSON decision<br/>(continue?, next_experiments)
        
        alt Continue Exploring
            AnalysisAgent->>AnalysisAgent: Generate pending configs<br/>from suggestions
        else Stop
            AnalysisAgent->>AnalysisAgent: Exit loop
        end
    end
    
    AnalysisAgent->>AnalysisAgent: _save_final_report<br/>(best config & history)
    AnalysisAgent-->>User: Optimization complete

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Areas requiring extra attention:

LoCoMo Evaluation Logic: The E2E evaluation mode, adversarial category handling, and lenient evaluation criteria in test_locomo.py introduce new scoring paths that require careful verification against official LoCoMo specification.
Experiment Framework Architecture: The interaction between ExperimentRunner, AnalysisAgent, and RailwayManager involves complex orchestration, state management, and error handling across multiple files (experiment_runner.py, analysis_agent.py, railway_manager.py).
Model Configuration Propagation: Verify that EMBEDDING_MODEL and CLASSIFICATION_MODEL are consistently threaded through all code paths—config loading, initialization, API calls, and scripts (app.py, admin.py, reembed/reclassify scripts).
CORE Adapter Fidelity: The Node.js CORE adapter evaluation logic must faithfully reproduce CORE's exact evaluation methodology; cross-reference with CORE's published evaluation code and results reconciliation.
Vector Size Migration: The upgrade from 768 to 3072 dimensions requires validation that all vector operations, Qdrant schema, and distance metrics remain compatible across old and new deployments.
Re-embedding Workflow: The documented re-embedding steps and cost estimates should be tested in a real migration scenario to ensure data integrity and backward compatibility handling.

Possibly related PRs

feat: Multi-hop recall with graph expansion and multi-query support #8: Touches LoCoMo benchmarking and recall/graph configuration; shares modifications to tests/benchmarks/test_locomo.py and memory retrieval logic.

Suggested labels

enhancement, benchmarking, configuration, documentation

Pre-merge checks and finishing touches

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 79.01% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.
Title check	❓ Inconclusive	The title "Feat/benchmark hardening" is vague and does not clearly reflect the primary changes in the changeset, which focus on model configuration, embedding management, and documentation updates rather than benchmark hardening.	Consider using a more descriptive title such as "Feat: configurable embedding and classification models with environment variables" or "Feat: upgrade default embedding model to text-embedding-3-large and add model configuration" to better convey the main changes.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Description check	✅ Passed	The pull request description is well-detailed and directly related to the changeset, covering model configuration, embedding management, documentation, and script updates across multiple sections with clear explanations.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch feat/benchmark-hardening

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 18

🧹 Nitpick comments (35)

tests/benchmarks/REPRODUCIBILITY.md (1)
177-189: Add language specifiers to fenced code blocks.

The code blocks at lines 177 and 185 should specify their language for proper syntax highlighting and better accessibility.

Apply this change:
 ### Python
-```
+```text
 nltk==3.9.1
 openai>=1.0.0
 requests>=2.28.0
 python-dateutil>=2.8.0
Node.js (CORE adapter)

- +text
axios>=1.6.0
openai>=4.20.0
dotenv>=16.3.0
tests/benchmarks/BENCHMARK_FINDINGS_2025-12-03.md (1)
90-108: Add language specifiers to fenced code blocks.

The code blocks at lines 90 and 105 should specify their language (use text or plaintext for output blocks) for proper rendering.

Apply this change:
 ### Test 1: E2E + Lenient (conv-26) ✅
-```
+```text
 Overall Accuracy: 78.89% (157/199)
 Time: 606.1s
 ...
Test 2: Full Dataset (10 conversations) - Previous Run

- +text
E2E + Strict F1: 36.15% (718/1986)
Retrieval Mode: 70.49% (1400/1986)
tests/benchmarks/core_adapter/automem-search.js (2)
93-96: Consider surfacing errors to callers for better debugging.

The current error handling silently returns empty results, which makes it difficult to distinguish between "no results found" and "API error occurred" during benchmark runs. This could mask connectivity issues, authentication failures, or API errors.

Consider one of these approaches:

Throw errors and let callers handle: Remove the try-catch and let errors propagate

Return error information: Include error details in the response

Add verbose mode: Use an optional throwOnError flag in the constructor

Example for option 2:
     } catch (error) {
       console.error("AutoMem search error:", error.message);
-      return { episodes: [], facts: [] };
+      return { 
+        episodes: [], 
+        facts: [], 
+        error: error.message,
+        _errorDetails: error.response?.data 
+      };
     }
This would allow benchmark runners to detect and report infrastructure issues separately from legitimate empty result sets.

16-28: LGTM with minor suggestion: Consider exposing timeout as configuration.

The implementation correctly uses environment variables with sensible defaults. The 30-second timeout is reasonable for benchmark workloads but might be too long for production use cases.

Optionally expose timeout as a configuration parameter:
   constructor(config = {}) {
     this.baseUrl = config.baseUrl || process.env.AUTOMEM_BASE_URL || "http://localhost:8001";
     this.apiToken = config.apiToken || process.env.AUTOMEM_API_TOKEN || "test-token";
+    this.timeout = config.timeout || 30000;

     this.axios = axios.create({
       baseURL: this.baseUrl,
       headers: {
         Authorization: `Bearer ${this.apiToken}`,
         "Content-Type": "application/json",
       },
-      timeout: 30000,
+      timeout: this.timeout,
     });
   }
tests/benchmarks/experiments/README.md (1)
100-106: Add language specifier to fenced code block.

The code block at line 100 should specify text or plaintext as the language identifier.

Apply this change:
 ## Output Structure

-```
+```text
 experiment_results/
 ├── 20251204_103000_baseline.json       # Individual results
 ├── 20251204_103000_large_embeddings.json
 ├── 20251204_103000_report.txt          # Summary report
 └── optimization_report_20251204.json   # Agent optimization log
</blockquote></details>
<details>
<summary>tests/benchmarks/run_all_benchmarks.sh (2)</summary><blockquote>

`19-24`: **Add timeout to health check curl command.**

The health check curl lacks a timeout, which could cause the script to hang indefinitely if AutoMem is unresponsive.


```diff
 # Check if AutoMem is running
-if ! curl -s http://localhost:8001/health > /dev/null; then
+if ! curl -s --connect-timeout 5 --max-time 10 http://localhost:8001/health > /dev/null; then
     echo "❌ AutoMem is not running. Start with: make dev"
     exit 1
 fi
80-86: Inline Python command could fail silently on malformed JSON.

The Python command for extracting accuracy uses 2>/dev/null || echo "N/A" which hides all errors. Consider more explicit error handling or using jq if available.
 for f in "$OUTPUT_DIR"/results_*_$TIMESTAMP.json; do
     if [ -f "$f" ]; then
         name=$(basename "$f" | sed "s/_$TIMESTAMP.json//")
-        accuracy=$(python -c "import json; d=json.load(open('$f')); print(f\"{d.get('overall', {}).get('accuracy', 0)*100:.2f}%\")" 2>/dev/null || echo "N/A")
+        accuracy=$(python3 -c "import json, sys; d=json.load(open('$f')); print(f\"{d.get('overall', {}).get('accuracy', 0)*100:.2f}%\")" 2>/dev/null || echo "N/A")
         echo "  $name: $accuracy"
     fi
 done
Using python3 explicitly ensures Python 3.x is used for f-string support.
tests/benchmarks/experiments/test_minimal.py (4)
34-37: Remove unnecessary f-string prefixes.

Multiple print statements use f-strings without placeholders. This is flagged by Ruff (F541).
-    print(f"\n📋 Test Config:")
+    print("\n📋 Test Config:")
     print(f"   Name: {exp_config.name}")
     print(f"   Recall Limit: {exp_config.recall_limit}")
     
-    print(f"\n🔧 LoCoMo Config:")
+    print("\n🔧 LoCoMo Config:")
...
-    print(f"\n📂 Loading dataset...")
+    print("\n📂 Loading dataset...")
...
-    print(f"   Testing first 5 questions only")
+    print("   Testing first 5 questions only")
...
-    print(f"\n📊 RESULTS:")
+    print("\n📊 RESULTS:")
...
-    print(f"\n✅ Integration test complete!")
+    print("\n✅ Integration test complete!")
...
-    print(f"\n📥 Loading memories...")
+    print("\n📥 Loading memories...")
...
-    print(f"⏳ Waiting for enrichment...")
+    print("⏳ Waiting for enrichment...")
...
-    print(f"\n🧹 Cleaning up test memories...")
+    print("\n🧹 Cleaning up test memories...")
Also applies to: 50-53, 58-60, 68-68, 73-76, 78-79, 87-88, 91-92, 137-138

63-63: Use next(iter(...)) instead of list(...)[0].

Per Ruff (RUF015), prefer next(iter(evaluator.conversations.keys())) for efficiency - avoids creating an intermediate list.
-    conv_id = list(evaluator.conversations.keys())[0]
+    conv_id = next(iter(evaluator.conversations.keys()))
120-126: Prefix unused variable with underscore.

The explanation variable is never used. Per Ruff (RUF059), prefix it with an underscore.
-            is_correct, confidence, explanation = evaluator._evaluate_lenient_semantic(
+            is_correct, confidence, _explanation = evaluator._evaluate_lenient_semantic(
                 question, expected, generated, category
             )
         else:
-            is_correct, confidence, explanation = evaluator.check_answer_in_memories(
+            is_correct, confidence, _explanation = evaluator.check_answer_in_memories(
                 question, expected, recalled, category
             )
90-92: Hardcoded sleep for enrichment is fragile.

A fixed 3-second wait may be insufficient under load or excessive in fast environments. Consider polling the enrichment status endpoint instead.

Consider implementing a polling mechanism:
# Wait for enrichment with polling
max_wait = 10  # seconds
poll_interval = 0.5
waited = 0
while waited < max_wait:
    # Could check enrichment status via API if available
    await asyncio.sleep(poll_interval)
    waited += poll_interval
tests/benchmarks/core_adapter/evaluate.js (2)
41-43: Add validation for OPENAI_API_KEY.

The OpenAI client is initialized without checking if the API key exists, which will fail later with a less helpful error message.
 // Initialize services
 const searchService = new AutoMemSearchService();
+if (!process.env.OPENAI_API_KEY) {
+  console.error("❌ OPENAI_API_KEY environment variable is required");
+  process.exit(1);
+}
 const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
284-288: Guard against division by zero.

If a conversation has no questions, questions.length would be 0, causing a division by zero.
-    const convAcc = (convCorrect / questions.length) * 100;
-    console.log(`  Accuracy: ${convAcc.toFixed(1)}% (${convCorrect}/${questions.length})`);
+    const convAcc = questions.length > 0 ? (convCorrect / questions.length) * 100 : 0;
+    console.log(`  Accuracy: ${convAcc.toFixed(1)}% (${convCorrect}/${questions.length})`);
tests/benchmarks/experiments/test_e2e_mini.py (3)
15-15: Remove unused import.

ExperimentConfig is imported but never used in this file.
-from tests.benchmarks.experiments.experiment_config import ExperimentConfig
44-44: Remove extraneous f prefixes from strings without placeholders.

These strings don't contain any format placeholders:
-    print(f"\n[1/4] Storing memories...")
+    print("\n[1/4] Storing memories...")
-    print(f"\n[2/4] Testing recall and E2E answer generation...")
+    print("\n[2/4] Testing recall and E2E answer generation...")
-    print(f"\n[4/4] Cleaning up...")
+    print("\n[4/4] Cleaning up...")
-        headers={"Authorization": f"Bearer admin"},
+        headers={"Authorization": "Bearer admin"},
Also applies to: 91-91, 158-158, 161-161

100-100: Remove unused variable assignment.

category is assigned but never used.
-        category = qa.get('category', 1)
tests/benchmarks/experiments/experiment_config.py (2)
10-10: Remove unused import.

itertools is imported but never used in this file.
-import itertools
14-36: Enums defined but not used for type safety in dataclass.

EmbeddingModel, EnrichmentModel, and RecallStrategy enums are defined but ExperimentConfig uses str types for corresponding fields. Consider using the enum types for better type safety and IDE support, or document that strings are intentional for flexibility.

Also applies to: 39-68
tests/benchmarks/experiments/railway_manager.py (2)
17-17: Use explicit relative or absolute import.

The import from experiment_config import ExperimentConfig relies on the module being in sys.path. For better clarity and to avoid import issues when running from different directories, use an explicit relative import.
-from experiment_config import ExperimentConfig
+from .experiment_config import ExperimentConfig
51-63: Synchronous subprocess call in async method.

check_railway_cli is an async method but uses synchronous subprocess.run. While this works (it blocks the event loop briefly), consider using asyncio.create_subprocess_exec for consistency with the async pattern.
tests/benchmarks/test_locomo.py (4)
1083-1084: Use explicit Optional type hints (PEP 484).

PEP 484 prohibits implicit Optional. Parameters with default None should use Optional[T] or T | None.
     def check_answer_in_memories(
         self, 
         question: str,
         expected_answer: Any,
         recalled_memories: List[Dict[str, Any]],
-        evidence_dialog_ids: List[str] = None,
-        sample_id: str = None,
+        evidence_dialog_ids: Optional[List[str]] = None,
+        sample_id: Optional[str] = None,
         category: int = 0,
     ) -> Tuple[bool, float, str]:
1438-1438: Use explicit Optional type hint.
-    def run_benchmark(self, cleanup_after: bool = True, conversation_ids: List[str] = None) -> Dict[str, Any]:
+    def run_benchmark(self, cleanup_after: bool = True, conversation_ids: Optional[List[str]] = None) -> Dict[str, Any]:
699-767: Remove unused parameter question from _evaluate_adversarial.

The question parameter is passed but never used in the method body.
     def _evaluate_adversarial(
         self,
-        question: str,
         adversarial_answer: str,
         recalled_memories: List[Dict[str, Any]],
     ) -> Tuple[bool, float, str]:
Then update callers accordingly (line 1370-1371):
                     is_correct, confidence, explanation = self._evaluate_adversarial(
-                        question, adversarial_answer, recalled_memories
+                        adversarial_answer, recalled_memories
                     )
751-752: Remove unused variable assignment.

found_in_memory is assigned but never used.
                 if overlap_ratio > max_overlap:
                     max_overlap = overlap_ratio
-                    found_in_memory = mem.get("id")
tests/benchmarks/locomo_metrics.py (3)
31-56: Duplicate normalize_answer implementation exists.

This function duplicates logic found in tests/benchmarks/test_locomo.py (lines 537-570) with a different implementation (this one removes articles; the other does basic suffix stemming). Having two different normalization approaches could lead to inconsistent evaluation results.

Consider consolidating to a single source of truth and importing where needed.

173-175: Use explicit Optional type hint and document unused parameter.
 def extract_answer_from_context(
-    memories: List[str], question: str, answer_hint: str = None
+    memories: List[str], question: str, answer_hint: Optional[str] = None  # noqa: ARG001
 ) -> str:
If question is reserved for future use (e.g., question-aware extraction), add a comment explaining this.

358-364: Rename unused loop variable.

The loop variable cat is not used within the loop body since data contains all needed information.
-        for cat, data in summary["categories"].items():
+        for _cat, data in summary["categories"].items():
             print(
                 f"  {data['name']:35s}: "
tests/benchmarks/experiments/experiment_runner.py (4)
101-101: Remove duplicate import of json.

json is already imported at the module level (line 15). This duplicate import inside the method is unnecessary.
-            import json
-            with open(locomo_config.data_file, 'r') as f:
+            with open(locomo_config.data_file, 'r') as f:
128-135: Consider logging the full traceback for debugging.

While catching broad exceptions is reasonable here to ensure the experiment runner continues, consider logging the full traceback for easier debugging when experiments fail.
+        import traceback
         except Exception as e:
-            print(f"❌ Experiment failed: {e}")
+            print(f"❌ Experiment failed: {e}")
+            traceback.print_exc()
             experiment_result = ExperimentResult(
                 config=config,
-                errors=[str(e)],
+                errors=[f"{e}\n{traceback.format_exc()}"],
373-373: Remove extraneous f prefix from strings without placeholders.

Multiple strings use f-string syntax without any placeholders. Per static analysis hints at lines 373, 383, 390, 405, 416, 622.
-        report.append(f"\n2. Multi-hop Reasoning (our weakness):")
+        report.append("\n2. Multi-hop Reasoning (our weakness):")
Apply similar fixes to lines 383, 390, 405, 416, and 622.

625-640: Ablation mode result is unused.

The results variable at line 640 is assigned but never used. Either use it for reporting or remove the assignment.
-            results = await runner.run_ablation_study(args.param, values)
+            await runner.run_ablation_study(args.param, values)
Alternatively, ensure results are incorporated into the report that follows.
tests/benchmarks/experiments/analysis_agent.py (4)
139-149: Catch specific exceptions for better error handling.

The broad Exception catch masks the specific failure mode. Since you're using response_format={"type": "json_object"} and json.loads(), consider catching more specific exceptions:
-            decision = json.loads(response.choices[0].message.content)
-            return decision
-            
-        except Exception as e:
+            return json.loads(response.choices[0].message.content)
+
+        except json.JSONDecodeError as e:
+            print(f"❌ Failed to parse LLM response: {e}")
+            return {
+                "continue_exploring": False,
+                "reasoning": f"JSON parse error: {e}",
+                "next_experiments": [],
+                "confidence": 0.0,
+            }
+        except openai.APIError as e:
             print(f"❌ Analysis failed: {e}")
             return {
                 "continue_exploring": False,
-                "reasoning": f"Analysis error: {e}",
+                "reasoning": f"OpenAI API error: {e}",
                 "next_experiments": [],
                 "confidence": 0.0,
             }
You'll need to add import openai or catch self.client exceptions appropriately.

226-230: _pending_configs accessed before explicit initialization - fragile pattern.

self._pending_configs is first accessed at line 230 but only initialized at line 276. While the current control flow ensures this works (iteration 1 uses initial_configs), this creates a fragile pattern where refactoring could introduce an AttributeError.

Initialize _pending_configs in __init__ or at the start of this method:
     async def run_autonomous_optimization(
         self,
         initial_configs: Optional[List[ExperimentConfig]] = None,
         runner: Optional[ExperimentRunner] = None,
     ) -> ExperimentConfig:
+        self._pending_configs: List[ExperimentConfig] = []
+
         if runner is None:
258-264: Remove extraneous f prefixes from strings without placeholders.

Lines 258, 261, and 294 use f-string syntax without any placeholders per static analysis hints.
-            print(f"\n🤔 Analyzing results...")
+            print("\n🤔 Analyzing results...")
             decision = await self.analyze_and_decide(all_results)
             
-            print(f"\n📋 Agent Decision:")
+            print("\n📋 Agent Decision:")
Also at line 294:
-        print(f"\nFinal Config:")
+        print("\nFinal Config:")
358-361: best_config is unused after assignment.

The result of run_autonomous_optimization is assigned but never used. Consider using it in the output message.
-    best_config = await agent.run_autonomous_optimization(runner=runner)
-    
-    print("\n✅ Optimization complete!")
-    print(f"Best configuration saved to: {args.output_dir}")
+    best_config = await agent.run_autonomous_optimization(runner=runner)
+
+    print("\n✅ Optimization complete!")
+    print(f"Best configuration: {best_config.name}")
+    print(f"Results saved to: {args.output_dir}")

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between abc689d and b6c1be7.

⛔ Files ignored due to path filters (1)

tests/benchmarks/core_adapter/package-lock.json is excluded by !**/package-lock.json

📒 Files selected for processing (29)

.gitignore (1 hunks)
CLAUDE.md (2 hunks)
INSTALLATION.md (1 hunks)
README.md (1 hunks)
app.py (5 hunks)
automem/api/admin.py (2 hunks)
automem/config.py (2 hunks)
docs/ENVIRONMENT_VARIABLES.md (3 hunks)
requirements-dev.txt (1 hunks)
scripts/reclassify_with_llm.py (4 hunks)
scripts/reembed_embeddings.py (4 hunks)
tests/benchmarks/BENCHMARK_FINDINGS_2025-12-03.md (1 hunks)
tests/benchmarks/REPRODUCIBILITY.md (1 hunks)
tests/benchmarks/core_adapter/README.md (1 hunks)
tests/benchmarks/core_adapter/automem-search.js (1 hunks)
tests/benchmarks/core_adapter/evaluate.js (1 hunks)
tests/benchmarks/core_adapter/package.json (1 hunks)
tests/benchmarks/experiments/README.md (1 hunks)
tests/benchmarks/experiments/__init__.py (1 hunks)
tests/benchmarks/experiments/analysis_agent.py (1 hunks)
tests/benchmarks/experiments/experiment_config.py (1 hunks)
tests/benchmarks/experiments/experiment_runner.py (1 hunks)
tests/benchmarks/experiments/railway_manager.py (1 hunks)
tests/benchmarks/experiments/run_optimization.sh (1 hunks)
tests/benchmarks/experiments/test_e2e_mini.py (1 hunks)
tests/benchmarks/experiments/test_minimal.py (1 hunks)
tests/benchmarks/locomo_metrics.py (1 hunks)
tests/benchmarks/run_all_benchmarks.sh (1 hunks)
tests/benchmarks/test_locomo.py (17 hunks)

🧰 Additional context used

📓 Path-based instructions (4)

**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Use black for Python code formatting
Use flake8 for Python linting

**/*.py: Use Python with type hints. Indent 4 spaces; line length 100 (Black).
Use snake_case for module and function names in Python.
Use PascalCase for class names in Python.
Use UPPER_SNAKE_CASE for constants in Python.
Format code with Black (line length 100) and organize imports with Isort (profile=black).
Lint with Flake8 and ensure CI passes before committing.
Add or adjust tests for new endpoints, stores, or utils; maintain comprehensive test coverage.

Files:

automem/config.py
scripts/reclassify_with_llm.py
tests/benchmarks/experiments/__init__.py
automem/api/admin.py
scripts/reembed_embeddings.py
tests/benchmarks/experiments/test_e2e_mini.py
tests/benchmarks/experiments/experiment_config.py
tests/benchmarks/experiments/analysis_agent.py
tests/benchmarks/test_locomo.py
tests/benchmarks/experiments/test_minimal.py
tests/benchmarks/experiments/experiment_runner.py
tests/benchmarks/locomo_metrics.py
tests/benchmarks/experiments/railway_manager.py
app.py

scripts/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Create utility scripts in scripts/ directory for backup, recovery, data management, and monitoring tasks

Files:

scripts/reclassify_with_llm.py
scripts/reembed_embeddings.py

tests/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Use pytest with DummyGraph fixture to mock FalkorDB operations for unit testing

Files:

tests/benchmarks/experiments/__init__.py
tests/benchmarks/experiments/test_e2e_mini.py
tests/benchmarks/experiments/experiment_config.py
tests/benchmarks/experiments/analysis_agent.py
tests/benchmarks/test_locomo.py
tests/benchmarks/experiments/test_minimal.py
tests/benchmarks/experiments/experiment_runner.py
tests/benchmarks/locomo_metrics.py
tests/benchmarks/experiments/railway_manager.py

app.py

📄 CodeRabbit inference engine (CLAUDE.md)

app.py: Use Flask for the main API server (port 8001) with request validation and orchestration
Implement 13 API endpoints covering core memory operations (POST /memory, GET /recall, PATCH /memory/, DELETE /memory/, GET /memory/by-tag), relationship management (POST /associate), consolidation (POST /consolidate, GET /consolidate/status), health checks (GET /health), enrichment (GET /enrichment/status, POST /enrichment/reprocess), analysis (GET /analyze), and startup recall (GET /startup-recall)
Expose enrichment metrics at GET /enrichment/status including processed counts, last success/error, queue depth, and inflight jobs
Combine vector similarity, keyword match, tag overlap, and recency in recall scoring
Support authentication via Bearer token, X-API-Key header, or query parameter

Files:

app.py

🧠 Learnings (23)

📓 Common learnings

Learnt from: CR
Repo: verygoodplugins/automem PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-02T05:34:12.678Z
Learning: Applies to automem/embedding/*.py : Use FastEmbed BAAI/bge-base-en-v1.5 model with optional 384 and 1024 dimension variants

Learnt from: CR
Repo: verygoodplugins/automem PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-02T05:34:12.678Z
Learning: Applies to automem/embedding/*.py : Implement embedding generation using a provider pattern with three backends in priority order: OpenAI (if OPENAI_API_KEY set), FastEmbed (local ONNX model), and Placeholder (hash-based fallback)

📚 Learning: 2025-12-02T05:34:12.678Z

Learnt from: CR
Repo: verygoodplugins/automem PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-02T05:34:12.678Z
Learning: Applies to automem/embedding/*.py : Use FastEmbed BAAI/bge-base-en-v1.5 model with optional 384 and 1024 dimension variants

Applied to files:

INSTALLATION.md
docs/ENVIRONMENT_VARIABLES.md
automem/config.py
CLAUDE.md
automem/api/admin.py
scripts/reembed_embeddings.py
app.py

📚 Learning: 2025-12-02T05:34:12.678Z

Learnt from: CR
Repo: verygoodplugins/automem PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-02T05:34:12.678Z
Learning: Configure core services via environment variables: FALKORDB_HOST, FALKORDB_PORT, FALKORDB_GRAPH, QDRANT_URL, QDRANT_API_KEY, QDRANT_COLLECTION, VECTOR_SIZE, PORT, AUTOMEM_API_TOKEN, ADMIN_API_TOKEN

Applied to files:

INSTALLATION.md
docs/ENVIRONMENT_VARIABLES.md
automem/config.py
CLAUDE.md
scripts/reembed_embeddings.py

📚 Learning: 2025-12-02T05:34:12.678Z

Learnt from: CR
Repo: verygoodplugins/automem PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-02T05:34:12.678Z
Learning: Applies to **/*enrichment*.py : Use ENRICHMENT_SPACY_MODEL environment variable to configure spaCy model for entity extraction

Applied to files:

INSTALLATION.md
docs/ENVIRONMENT_VARIABLES.md
automem/config.py
app.py

📚 Learning: 2025-12-02T05:34:12.678Z

Learnt from: CR
Repo: verygoodplugins/automem PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-02T05:34:12.678Z
Learning: Applies to **/*enrichment*.py : Establish temporal (PRECEDED_BY) and semantic (SIMILAR_TO) edges in enrichment pipeline with symmetric cosine scores from Qdrant

Applied to files:

INSTALLATION.md

📚 Learning: 2025-12-02T05:34:12.678Z

Learnt from: CR
Repo: verygoodplugins/automem PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-02T05:34:12.678Z
Learning: Applies to **/*enrichment*.py : Use ENRICHMENT_SIMILARITY_LIMIT and ENRICHMENT_SIMILARITY_THRESHOLD for neighbor link discovery via Qdrant

Applied to files:

INSTALLATION.md
automem/config.py
app.py

📚 Learning: 2025-12-02T05:34:12.678Z

Learnt from: CR
Repo: verygoodplugins/automem PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-02T05:34:12.678Z
Learning: Use Qdrant (vector database on port 6333) for 768-dimensional semantic search with optional availability (graceful degradation when Qdrant is unavailable)

Applied to files:

README.md
docs/ENVIRONMENT_VARIABLES.md
automem/config.py
CLAUDE.md

📚 Learning: 2025-12-02T05:34:25.275Z

Learnt from: CR
Repo: verygoodplugins/automem PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-02T05:34:25.275Z
Learning: Never commit secrets; configure sensitive values via environment variables (`AUTOMEM_API_TOKEN`, `ADMIN_API_TOKEN`, `OPENAI_API_KEY`, `FALKORDB_PASSWORD`, `QDRANT_API_KEY`).

Applied to files:

docs/ENVIRONMENT_VARIABLES.md
automem/config.py
scripts/reclassify_with_llm.py

📚 Learning: 2025-12-02T05:34:12.678Z

Learnt from: CR
Repo: verygoodplugins/automem PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-02T05:34:12.678Z
Learning: Control embedding provider via EMBEDDING_PROVIDER environment variable with values: auto (default), openai, local, or placeholder

Applied to files:

docs/ENVIRONMENT_VARIABLES.md
CLAUDE.md
app.py

📚 Learning: 2025-12-02T05:34:12.678Z

Learnt from: CR
Repo: verygoodplugins/automem PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-02T05:34:12.678Z
Learning: Applies to automem/embedding/*.py : Implement embedding generation using a provider pattern with three backends in priority order: OpenAI (if OPENAI_API_KEY set), FastEmbed (local ONNX model), and Placeholder (hash-based fallback)

Applied to files:

docs/ENVIRONMENT_VARIABLES.md
automem/config.py
CLAUDE.md
automem/api/admin.py
scripts/reembed_embeddings.py
app.py

📚 Learning: 2025-12-02T05:34:25.275Z

Learnt from: CR
Repo: verygoodplugins/automem PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-02T05:34:25.275Z
Learning: Use the Agent Memory Protocol: fetch context with recall at task start, persist key steps with POST `/memory`, tag with conventions (`agent`, `repo:<name>`, `task:<id>`, `file:<path>`, `pr:<num>`, `issue:<num>`, `result`, `decision`, `docs`), and associate related memories.

Applied to files:

tests/benchmarks/experiments/README.md

📚 Learning: 2025-12-02T05:34:25.275Z

Learnt from: CR
Repo: verygoodplugins/automem PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-02T05:34:25.275Z
Learning: PRs must include: clear description and scope, linked issues, test plan/output, and notes on API or config changes. Update relevant docs under `docs/`.

Applied to files:

tests/benchmarks/REPRODUCIBILITY.md

📚 Learning: 2025-11-08T10:51:31.578Z

Learnt from: jack-arturo
Repo: verygoodplugins/automem PR: 0
File: :0-0
Timestamp: 2025-11-08T10:51:31.578Z
Learning: In railway-template.json for the automem project, the memory-service connects to FalkorDB using FALKORDB_PASSWORD=${{FalkorDB.FALKOR_PASSWORD}} (referencing the FalkorDB service's FALKOR_PASSWORD variable).

Applied to files:

automem/config.py
scripts/reembed_embeddings.py

📚 Learning: 2025-12-02T05:34:12.678Z

Learnt from: CR
Repo: verygoodplugins/automem PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-02T05:34:12.678Z
Learning: Use FalkorDB (graph database on port 6379) for storing Memory nodes and relationship edges with atomic operations and automatic rollback on errors

Applied to files:

automem/config.py
scripts/reembed_embeddings.py

📚 Learning: 2025-11-08T10:51:31.578Z

Learnt from: jack-arturo
Repo: verygoodplugins/automem PR: 0
File: :0-0
Timestamp: 2025-11-08T10:51:31.578Z
Learning: In railway-template.json, FalkorDB authentication should use the FALKOR_PASSWORD environment variable alone, without --requirepass in REDIS_ARGS. The FalkorDB Docker image handles password configuration internally via the FALKOR_PASSWORD env var.

Applied to files:

automem/config.py
scripts/reembed_embeddings.py

📚 Learning: 2025-12-02T05:34:12.678Z

Learnt from: CR
Repo: verygoodplugins/automem PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-02T05:34:12.678Z
Learning: Applies to **/*enrichment*.py : Support ENRICHMENT_ENABLE_SUMMARIES environment variable to toggle automatic summary creation

Applied to files:

automem/config.py

📚 Learning: 2025-12-02T05:34:12.678Z

Learnt from: CR
Repo: verygoodplugins/automem PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-02T05:34:12.678Z
Learning: Applies to **/*enrichment*.py : Support ENRICHMENT_MAX_ATTEMPTS environment variable to configure retry attempts before giving up

Applied to files:

automem/config.py

📚 Learning: 2025-12-02T05:34:12.678Z

Learnt from: CR
Repo: verygoodplugins/automem PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-02T05:34:12.678Z
Learning: Applies to automem/embedding/*.py : Implement EmbeddingProvider abstraction with three implementations: OpenAIEmbeddingProvider, FastEmbedProvider, and PlaceholderEmbeddingProvider

Applied to files:

CLAUDE.md
app.py

📚 Learning: 2025-12-02T05:34:25.275Z

Learnt from: CR
Repo: verygoodplugins/automem PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-02T05:34:25.275Z
Learning: Core package is located at `automem/` with notable directories: `api/` (Flask blueprints), `utils/`, `stores/`, and `config.py`.

Applied to files:

tests/benchmarks/core_adapter/package.json
tests/benchmarks/experiments/__init__.py
app.py

📚 Learning: 2025-12-02T05:34:12.678Z

Learnt from: CR
Repo: verygoodplugins/automem PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-02T05:34:12.678Z
Learning: Applies to scripts/migrate*.py : Implement migration tools: migrate_mcp_sqlite.py for importing from legacy SQLite store, reembed_embeddings.py for refreshing embeddings

Applied to files:

scripts/reembed_embeddings.py

📚 Learning: 2025-12-02T05:34:12.678Z

Learnt from: CR
Repo: verygoodplugins/automem PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-02T05:34:12.678Z
Learning: Applies to tests/**/*.py : Use pytest with DummyGraph fixture to mock FalkorDB operations for unit testing

Applied to files:

scripts/reembed_embeddings.py

📚 Learning: 2025-12-02T05:34:25.275Z

Learnt from: CR
Repo: verygoodplugins/automem PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-02T05:34:25.275Z
Learning: Applies to **/*.py : Add or adjust tests for new endpoints, stores, or utils; maintain comprehensive test coverage.

Applied to files:

tests/benchmarks/experiments/test_e2e_mini.py
tests/benchmarks/experiments/test_minimal.py

📚 Learning: 2025-12-02T05:34:12.678Z

Learnt from: CR
Repo: verygoodplugins/automem PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-02T05:34:12.678Z
Learning: Applies to app.py : Implement 13 API endpoints covering core memory operations (POST /memory, GET /recall, PATCH /memory/<id>, DELETE /memory/<id>, GET /memory/by-tag), relationship management (POST /associate), consolidation (POST /consolidate, GET /consolidate/status), health checks (GET /health), enrichment (GET /enrichment/status, POST /enrichment/reprocess), analysis (GET /analyze), and startup recall (GET /startup-recall)

Applied to files:

tests/benchmarks/experiments/test_e2e_mini.py
app.py

🧬 Code graph analysis (9)

tests/benchmarks/core_adapter/automem-search.js (2)

mcp-sse-server/server.js (1)

userId (333-333)

tests/benchmarks/core_adapter/evaluate.js (3)

response (69-80)

response (144-157)

results (214-218)

tests/benchmarks/core_adapter/evaluate.js (2)

tests/benchmarks/core_adapter/automem-search.js (1)

AutoMemSearchService (15-110)

tests/conftest.py (1)

OpenAI (128-131)

tests/benchmarks/experiments/run_optimization.sh (2)

tests/benchmarks/experiments/experiment_runner.py (1)

main (505-653)

tests/benchmarks/experiments/analysis_agent.py (1)

main (335-361)

tests/benchmarks/experiments/test_e2e_mini.py (4)

tests/benchmarks/experiments/experiment_config.py (1)

ExperimentConfig (40-92)

tests/benchmarks/locomo_metrics.py (1)

f1_score (59-89)

tests/benchmarks/core_adapter/evaluate.js (8)

openai (43-43)

conversations (210-210)

results (214-218)

i (234-234)

qa (235-235)

question (236-236)

category (238-238)

context (244-244)

tests/conftest.py (1)

OpenAI (128-131)

tests/benchmarks/experiments/analysis_agent.py (1)

tests/benchmarks/experiments/experiment_config.py (5)

ExperimentConfig (40-92)

ExperimentResult (96-152)

score (126-152)

generate_experiment_grid (155-242)

to_json (91-92)

tests/benchmarks/experiments/test_minimal.py (2)

tests/benchmarks/experiments/experiment_config.py (1)

ExperimentConfig (40-92)

tests/benchmarks/test_locomo.py (2)

LoCoMoEvaluator (107-1587)

LoCoMoConfig (54-104)

tests/benchmarks/experiments/experiment_runner.py (3)

tests/benchmarks/experiments/experiment_config.py (6)

ExperimentConfig (40-92)

ExperimentResult (96-152)

generate_experiment_grid (155-242)

generate_focused_experiments (245-260)

score (126-152)

to_json (91-92)

tests/benchmarks/experiments/railway_manager.py (6)

RailwayManager (32-286)

LocalTestManager (289-337)

RailwayInstance (21-29)

deploy_instance (65-145)

destroy_instance (211-237)

destroy_all (239-242)

tests/benchmarks/test_locomo.py (3)

LoCoMoEvaluator (107-1587)

LoCoMoConfig (54-104)

run_benchmark (1438-1587)

tests/benchmarks/locomo_metrics.py (1)

tests/benchmarks/test_locomo.py (1)

normalize_answer (538-571)

tests/benchmarks/experiments/railway_manager.py (1)

tests/benchmarks/experiments/experiment_config.py (2)

ExperimentConfig (40-92)

to_env_vars (70-89)

🪛 LanguageTool

tests/benchmarks/experiments/README.md

[grammar] ~160-~160: Ensure spelling is correct
Context: ...Monitor response times - stay under 100ms for production 4. Watch token costs...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)

🪛 markdownlint-cli2 (0.18.1)

docs/ENVIRONMENT_VARIABLES.md

90-90: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

tests/benchmarks/BENCHMARK_FINDINGS_2025-12-03.md

90-90: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

105-105: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

tests/benchmarks/experiments/README.md

100-100: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

118-118: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

174-174: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

tests/benchmarks/REPRODUCIBILITY.md

177-177: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

185-185: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

tests/benchmarks/core_adapter/README.md

90-90: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🪛 Ruff (0.14.7)

tests/benchmarks/experiments/test_e2e_mini.py

1-1: Shebang is present but file is not executable

(EXE001)

44-44: f-string without any placeholders

Remove extraneous f prefix

(F541)

77-77: Do not catch blind exception: Exception

(BLE001)

91-91: f-string without any placeholders

Remove extraneous f prefix

(F541)

100-100: Local variable category is assigned to but never used

Remove assignment to unused variable category

(F841)

106-106: Probable use of requests call without timeout

(S113)

137-137: Do not catch blind exception: Exception

(BLE001)

158-158: f-string without any placeholders

Remove extraneous f prefix

(F541)

159-159: Probable use of requests call without timeout

(S113)

161-161: f-string without any placeholders

Remove extraneous f prefix

(F541)

tests/benchmarks/experiments/analysis_agent.py

1-1: Shebang is present but file is not executable

(EXE001)

140-140: Consider moving this statement to an else block

(TRY300)

142-142: Do not catch blind exception: Exception

(BLE001)

258-258: f-string without any placeholders

Remove extraneous f prefix

(F541)

261-261: f-string without any placeholders

Remove extraneous f prefix

(F541)

294-294: f-string without any placeholders

Remove extraneous f prefix

(F541)

358-358: Local variable best_config is assigned to but never used

Remove assignment to unused variable best_config

(F841)

tests/benchmarks/test_locomo.py

701-701: Unused method argument: question

(ARG002)

752-752: Local variable found_in_memory is assigned to but never used

Remove assignment to unused variable found_in_memory

(F841)

849-849: Consider moving this statement to an else block

(TRY300)

851-851: Do not catch blind exception: Exception

(BLE001)

935-935: Consider moving this statement to an else block

(TRY300)

937-937: Do not catch blind exception: Exception

(BLE001)

1083-1083: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)

1084-1084: Unused method argument: category

(ARG002)

1438-1438: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)

tests/benchmarks/experiments/test_minimal.py

1-1: Shebang is present but file is not executable

(EXE001)

34-34: f-string without any placeholders

Remove extraneous f prefix

(F541)

50-50: f-string without any placeholders

Remove extraneous f prefix

(F541)

58-58: f-string without any placeholders

Remove extraneous f prefix

(F541)

63-63: Prefer next(iter(evaluator.conversations.keys())) over single element slice

Replace with next(iter(evaluator.conversations.keys()))

(RUF015)

68-68: f-string without any placeholders

Remove extraneous f prefix

(F541)

73-73: f-string without any placeholders

Remove extraneous f prefix

(F541)

78-78: f-string without any placeholders

Remove extraneous f prefix

(F541)

87-87: f-string without any placeholders

Remove extraneous f prefix

(F541)

91-91: f-string without any placeholders

Remove extraneous f prefix

(F541)

124-124: Unpacked variable explanation is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

137-137: f-string without any placeholders

Remove extraneous f prefix

(F541)

tests/benchmarks/experiments/experiment_runner.py

1-1: Shebang is present but file is not executable

(EXE001)

128-128: Do not catch blind exception: Exception

(BLE001)

279-279: Standard pseudo-random generators are not suitable for cryptographic purposes

(S311)

280-280: Standard pseudo-random generators are not suitable for cryptographic purposes

(S311)

330-330: Local variable worst is assigned to but never used

Remove assignment to unused variable worst

(F841)

373-373: f-string without any placeholders

Remove extraneous f prefix

(F541)

383-383: f-string without any placeholders

Remove extraneous f prefix

(F541)

390-390: f-string without any placeholders

Remove extraneous f prefix

(F541)

405-405: f-string without any placeholders

Remove extraneous f prefix

(F541)

416-416: f-string without any placeholders

Remove extraneous f prefix

(F541)

622-622: f-string without any placeholders

Remove extraneous f prefix

(F541)

640-640: Local variable results is assigned to but never used

Remove assignment to unused variable results

(F841)

tests/benchmarks/locomo_metrics.py

174-174: Unused function argument: question

(ARG001)

174-174: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)

358-358: Loop control variable cat not used within loop body

Rename unused cat to _cat

(B007)

tests/benchmarks/experiments/railway_manager.py

55-55: Starting a process with a partial executable path

(S607)

60-60: Consider moving this statement to an else block

(TRY300)

61-61: Do not catch blind exception: Exception

(BLE001)

82-82: subprocess call: check for execution of untrusted input

(S603)

83-87: Starting a process with a partial executable path

(S607)

91-91: Probable insecure usage of temporary file or directory: "/tmp"

(S108)

104-104: subprocess call: check for execution of untrusted input

(S603)

105-105: Starting a process with a partial executable path

(S607)

112-112: Starting a process with a partial executable path

(S607)

141-141: Consider moving this statement to an else block

(TRY300)

143-143: Do not catch blind exception: Exception

(BLE001)

166-166: Probable insecure usage of temporary file or directory: "/tmp/"

(S108)

173-173: subprocess call: check for execution of untrusted input

(S603)

174-180: Starting a process with a partial executable path

(S607)

205-205: Consider moving this statement to an else block

(TRY300)

207-207: Do not catch blind exception: Exception

(BLE001)

216-216: subprocess call: check for execution of untrusted input

(S603)

217-217: Starting a process with a partial executable path

(S607)

224-224: Starting a process with a partial executable path

(S607)

233-233: Consider moving this statement to an else block

(TRY300)

235-235: Do not catch blind exception: Exception

(BLE001)

260-261: try-except-pass detected, consider logging the exception

(S110)

260-260: Do not catch blind exception: Exception

(BLE001)

265-265: Unused method argument: project_id

(ARG002)

268-268: Starting a process with a partial executable path

(S607)

319-319: Consider moving this statement to an else block

(TRY300)

320-320: Do not catch blind exception: Exception

(BLE001)

334-334: Consider moving this statement to an else block

(TRY300)

335-335: Do not catch blind exception: Exception

(BLE001)

🪛 Shellcheck (0.11.0)

tests/benchmarks/experiments/run_optimization.sh

[error] 143-143: -f doesn't work with globs. Use a for loop.

(SC2144)

🔇 Additional comments (36)

.gitignore (1)

25-26: LGTM!

The new ignore pattern appropriately excludes ephemeral experiment results while preserving the ability to commit notable runs to the main tests/benchmarks/results/ directory. This aligns well with the new benchmark infrastructure.

requirements-dev.txt (1)

10-12: LGTM!

The addition of nltk for LoCoMo benchmark evaluation is appropriate. NLTK provides the Porter stemmer and tokenization needed for official F1 scoring metrics referenced in the benchmark documentation.

tests/benchmarks/core_adapter/package.json (1)

1-18: LGTM!

The package.json correctly configures the CORE adapter as an ES module with appropriate scripts and dependencies. The version pins for axios (^1.6.0), dotenv (^16.3.0), and openai (^4.20.0) are reasonable for benchmark stability.

automem/config.py (1)

50-54: Update config comment to clarify embedding model selection across providers.

The embedding provider architecture correctly implements the fallback pattern (OpenAI → FastEmbed → Placeholder), and FastEmbed is properly supported as a secondary backend. However, the configuration approach differs from the review comment's assumption:

FastEmbed fallback is supported ✓ (confirmed in app.py auto-selection logic)

EMBEDDING_MODEL only applies to OpenAI — Users cannot set EMBEDDING_MODEL=BAAI/bge-base-en-v1.5 for FastEmbed. Instead, FastEmbed model selection is controlled by VECTOR_SIZE:

VECTOR_SIZE=384 → BAAI/bge-small-en-v1.5

VECTOR_SIZE=768 → BAAI/bge-base-en-v1.5 (default)

VECTOR_SIZE=1024 → BAAI/bge-large-en-v1.5

Comment should clarify provider control — Line 52's comment mentions "use VECTOR_SIZE=768 if switching" but only in the context of OpenAI models. Add a note that VECTOR_SIZE also controls the FastEmbed model variant, and that provider selection is controlled by EMBEDDING_PROVIDER env var (auto/openai/local/placeholder), not EMBEDDING_MODEL.

Likely an incorrect or invalid review comment.

README.md (1)

101-101: LGTM! Architecture diagram updated to reflect new default embedding dimensions.

The change from 768-d to 3072-d accurately reflects the update to text-embedding-3-large as the default embedding model.

CLAUDE.md (2)

135-138: LGTM! OpenAI embedding configuration updated correctly.

The documentation now accurately reflects the new default model (text-embedding-3-large) and its 3072-dimension output, along with the configurability via EMBEDDING_MODEL and VECTOR_SIZE.

205-205: LGTM! Environment variable documentation updated.

The VECTOR_SIZE default correctly reflects 3072 for the large model, with helpful guidance for the small model alternative.

INSTALLATION.md (1)

371-372: LGTM! Configuration documentation updated correctly.

The table accurately documents the new defaults for VECTOR_SIZE (3072) and EMBEDDING_MODEL (text-embedding-3-large), providing clear guidance for users.

scripts/reclassify_with_llm.py (3)

34-34: LGTM! Classification model now configurable.

The introduction of the CLASSIFICATION_MODEL environment variable enables flexible model selection while maintaining a sensible default.

79-79: LGTM! Model parameter correctly uses configurable value.

The OpenAI API call now uses the CLASSIFICATION_MODEL variable instead of a hardcoded model name, enabling runtime configuration.

172-172: LGTM! Model selection visibility improved.

Logging the classification model during initialization provides helpful debugging information and confirms the active configuration.

tests/benchmarks/experiments/__init__.py (1)

1-20: LGTM! Clean package initialization.

The module provides a well-defined public API by re-exporting key symbols from experiment_config, following standard Python packaging conventions.
scripts/reembed_embeddings.py (3)
48-59: LGTM! FalkorDB authentication handling improved.

The connection logic now properly handles both authenticated and unauthenticated scenarios, with clear logging of the authentication state.

149-154: LGTM! Embedding generation correctly uses configurable model.

The batch embedding logic now uses the EMBEDDING_MODEL environment variable with appropriate logging and dimensions configuration.

8-9: Fix inconsistent EMBEDDING_MODEL default.

The default embedding model in this script is "text-embedding-3-small", but according to the PR objectives and INSTALLATION.md (line 372), the default should be "text-embedding-3-large" to match the new 3072-dimension standard.

Apply this diff:
 Environment:
-    EMBEDDING_MODEL: OpenAI embedding model (default: text-embedding-3-small)
+    EMBEDDING_MODEL: OpenAI embedding model (default: text-embedding-3-large)
     VECTOR_SIZE: Embedding dimension (default: 768, use 3072 for large model)
-    embedding_model = os.getenv("EMBEDDING_MODEL", "text-embedding-3-small")
+    embedding_model = os.getenv("EMBEDDING_MODEL", "text-embedding-3-large")
Also applies to: 140-142
⛔ Skipped due to learnings
Learnt from: CR
Repo: verygoodplugins/automem PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-02T05:34:12.678Z
Learning: Applies to automem/embedding/*.py : Use FastEmbed BAAI/bge-base-en-v1.5 model with optional 384 and 1024 dimension variants
Learnt from: CR
Repo: verygoodplugins/automem PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-02T05:34:12.678Z
Learning: Control embedding provider via EMBEDDING_PROVIDER environment variable with values: auto (default), openai, local, or placeholder
Learnt from: CR
Repo: verygoodplugins/automem PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-02T05:34:12.678Z
Learning: Applies to automem/embedding/*.py : Implement embedding generation using a provider pattern with three backends in priority order: OpenAI (if OPENAI_API_KEY set), FastEmbed (local ONNX model), and Placeholder (hash-based fallback)
docs/ENVIRONMENT_VARIABLES.md (2)

140-169: Well-documented model configuration section.

The new Model Configuration section provides clear guidance on embedding models, vector sizes, and classification models with helpful comparison tables.

329-367: Comprehensive re-embedding workflow documentation.

The step-by-step guide for re-embedding memories when changing models is thorough and includes backup steps, Qdrant collection recreation, and cost estimates.

tests/benchmarks/run_all_benchmarks.sh (1)

59-69: CORE adapter execution looks correct.

The conditional execution with proper directory navigation and dependency installation is well-handled.

app.py (4)

130-131: Configuration constants properly imported.

The EMBEDDING_MODEL and CLASSIFICATION_MODEL imports enable dynamic model selection as intended.

738-748: Classification model now configurable.

The memory type classification correctly uses CLASSIFICATION_MODEL instead of a hardcoded value.

1116-1124: Embedding provider initialization uses configurable model.

Both explicit and auto-selected OpenAI embedding provider paths now use EMBEDDING_MODEL, enabling runtime configuration.

Also applies to: 1147-1155

3238-3250: Admin blueprint receives embedding model parameter.

The EMBEDDING_MODEL is correctly passed to create_admin_blueprint_full, completing the configuration wiring.

automem/api/admin.py (2)

7-19: Function signature updated to accept embedding model.

The new embedding_model: str parameter enables dynamic model selection for re-embedding operations. The signature change is coordinated with the caller in app.py.

88-92: Re-embedding now uses configurable model.

The embeddings request correctly uses the embedding_model parameter instead of a hardcoded value, aligning with the PR's dynamic model selection objective.
tests/benchmarks/core_adapter/evaluate.js (3)
29-39: Verify OpenAI model names in CONFIG.

The default models gpt-4.1 and gpt-4.1 (lines 37-38) appear to be non-standard OpenAI model names. Standard names include gpt-4o, gpt-4o-mini, gpt-4-turbo, etc.

If these are intentional aliases or internal naming, please document this. Otherwise, consider using standard model names:
 const CONFIG = {
   dataFile: path.join(__dirname, "../locomo/data/locomo10.json"),
   outputFile: path.join(__dirname, "evaluation_results.json"),
   searchLimit: 20,
-  model: process.env.EVAL_MODEL || "gpt-4.1",
-  evalModel: process.env.EVAL_JUDGE_MODEL || "gpt-4.1",
+  model: process.env.EVAL_MODEL || "gpt-4o",
+  evalModel: process.env.EVAL_JUDGE_MODEL || "gpt-4o",
 };
96-190: CORE-compatible evaluation logic implemented correctly.

The evaluateAnswer function properly implements CORE's lenient evaluation methodology with:

Special handling for adversarial category (5)

CORE's exact evaluation prompt

Fallback to 30% word match ratio

195-326: Evaluation orchestration is well-structured.

The runEvaluation function properly orchestrates health checks, data loading, per-conversation evaluation, and results aggregation with good progress reporting.
tests/benchmarks/experiments/test_e2e_mini.py (1)

20-184: Good implementation of a self-contained micro E2E test.

The test provides a complete pipeline verification: memory storage → enrichment wait → recall → LLM answer generation → F1 scoring → cleanup. The error handling and progress output are helpful for debugging.

tests/benchmarks/experiments/experiment_config.py (1)

95-152: Well-structured ExperimentResult dataclass with scoring.

The scoring method with configurable weights is a good approach for multi-objective optimization. The normalization of response time and cost scores is reasonable.

tests/benchmarks/experiments/railway_manager.py (1)

244-263: Good health check implementation with retry logic.

The _wait_for_healthy method properly uses async httpx client with timeout and polling. The 5-second sleep between checks is reasonable for deployment startup times.

tests/benchmarks/test_locomo.py (1)

769-854: Well-implemented E2E answer generation and lenient evaluation.

The generate_answer_e2e method correctly handles context building with token limits, caching, and category-specific prompting. The evaluate_lenient method provides CORE-compatible semantic evaluation with appropriate fallback logic. Good use of structured JSON responses.

Also applies to: 855-950

tests/benchmarks/locomo_metrics.py (2)

59-89: Well-implemented official LoCoMo F1 metrics.

The f1_score function correctly implements token-level F1 with Porter stemming as per the LoCoMo paper. The f1_multi_answer function properly handles comma-separated sub-answers with best-match scoring.

Also applies to: 92-116

211-366: Clean implementation of OfficialLoCoMoEvaluator class.

The evaluator properly tracks results by category and overall, with clear methods for accuracy, mean F1, and summary generation. The print_summary method provides well-formatted output.

tests/benchmarks/experiments/analysis_agent.py (1)

151-184: LGTM with a minor suggestion.

The key normalization and mapping logic handles LLM naming variations well. Consider using dataclasses.asdict() for cleaner dataclass handling, though __dict__.copy() works fine for this use case.

tests/benchmarks/experiments/experiment_runner.py (2)

242-257: The deploy_instance and destroy_instance methods in RailwayManager are properly defined as async methods (lines 65 and 211 of railway_manager.py), so awaiting them in the experiment_runner.py code is correct and will not raise a TypeError.

Likely an incorrect or invalid review comment.

287-290: Weight normalization is correct—recency_weight is independent.

The normalization of vector_weight and graph_weight to sum to 1.0 is intentional. recency_weight is not part of this constraint; it is applied separately as an independent weighting factor. The _generate_mutations method correctly excludes recency_weight from the mutation space, ensuring it remains at its default value while only the hybrid recall weights are normalized.

coderabbitai · 2025-12-06T04:07:22Z

automem/config.py

 # Qdrant / FalkorDB configuration
 COLLECTION_NAME = os.getenv("QDRANT_COLLECTION", "memories")
-VECTOR_SIZE = int(os.getenv("VECTOR_SIZE") or os.getenv("QDRANT_VECTOR_SIZE", "768"))
+VECTOR_SIZE = int(os.getenv("VECTOR_SIZE") or os.getenv("QDRANT_VECTOR_SIZE", "3072"))


⚠️ Potential issue | 🟠 Major

Breaking change: Document the migration path for existing deployments.

Changing the default VECTOR_SIZE from 768 to 3072 is a breaking change that will cause dimension mismatches in existing Qdrant collections. Existing deployments with 768-dimensional embeddings cannot simply update this value without re-embedding all memories.

Consider one of these approaches:

Add migration documentation: Create a migration guide that explicitly instructs users to:

Back up existing data

Run the re-embedding script (scripts/reembed_embeddings.py)

Verify the new embeddings before switching

Add runtime validation: Check if the Qdrant collection's dimension matches VECTOR_SIZE and fail fast with a clear error message if they don't match.

Based on learnings, the system should support graceful degradation when Qdrant is unavailable, but dimension mismatches will cause runtime errors.

🤖 Prompt for AI Agents

In automem/config.py around line 13, the VECTOR_SIZE default was changed to 3072 which breaks existing Qdrant collections using 768; add a migration note and runtime validation: update docs (or a new docs/migrations.md) with explicit steps to back up data, run scripts/reembed_embeddings.py, and verify embeddings before switching, and implement a runtime check where the app queries the Qdrant collection metadata on startup, compares its vector dimension to VECTOR_SIZE, and fails fast with a clear error message (including current collection dimension and expected VECTOR_SIZE) if they differ to prevent silent runtime errors.

coderabbitai · 2025-12-06T04:07:22Z

docs/ENVIRONMENT_VARIABLES.md

+**Classification Model Pricing (Dec 2025):**
+| Model | Input | Output | Notes |
+|-------|-------|--------|-------|
+| `gpt-4o-mini` | $0.15/1M | $0.60/1M | **Default** - Good enough for classification |
+| `gpt-4.1` | ~$2/1M | ~$8/1M | Better reasoning |
+| `gpt-5.1` | $1.25/1M | $10/1M | Best reasoning, use for benchmarks |
+


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🌐 Web query:

OpenAI GPT model names December 2025 current available models

💡 Result:

Here are the major OpenAI GPT-family models publicly listed/available as of Dec 6, 2025:

GPT‑4o (and GPT‑4o mini). (openai.com)

GPT‑4 (still available in the API; retired from ChatGPT UI in Apr 2025). (help.openai.com)

GPT‑4.1 / GPT‑4.5-preview (preview / deprecated snapshot names appear in the API docs). (platform.openai.com)

GPT‑3.5‑turbo (and its snapshots/aliases). (platform.openai.com)

o3 / o3‑mini (OpenAI “o3” family listed in API docs). (platform.openai.com)

Note: press/industry reports in early December 2025 mention OpenAI work on newer frontier models (e.g., GPT‑5.x), but official releases or API availability should be confirmed on OpenAI’s site. (theverge.com)

If you want, I can open the specific OpenAI API model docs pages and list exact model aliases/pricing/snapshots.

Citations:

1: https://openai.com/kn-IN/index/gpt-4o-system-card/?utm_source=openai

2: https://help.openai.com/en/articles/6825453?utm_source=openai

3: https://platform.openai.com/docs/models/gpt-4.5-preview?utm_source=openai

4: https://platform.openai.com/docs/models/gpt-3.5-turbo?utm_source=openai

5: https://platform.openai.com/docs/models/gpt-4.5-preview?utm_source=openai

6: https://www.theverge.com/report/838857/openai-gpt-5-2-release-date-code-red-google-response?utm_source=openai

Replace non-existent model names with confirmed OpenAI models.

The pricing table references gpt-4.1 (preview/deprecated status) and gpt-5.1 (not officially released). Use confirmed model names like gpt-4o, gpt-4-turbo, or o3 instead. If experimental models must be referenced, clearly document their preview/beta status and note they may not be available via the standard API.

🤖 Prompt for AI Agents

In docs/ENVIRONMENT_VARIABLES.md around lines 162 to 168, the pricing table lists non-existent or deprecated model names (gpt-4.1, gpt-5.1); replace those with confirmed OpenAI model names such as gpt-4o, gpt-4-turbo, or o3 (choose the appropriate model for the stated pricing/quality), update the Model column values accordingly, and if you must reference experimental or preview models mark them explicitly as “preview/beta” with a short note that they may not be available via the standard API; ensure the Notes column reflects the new model names and their status.

coderabbitai · 2025-12-06T04:07:22Z

tests/benchmarks/core_adapter/README.md

+   ```javascript
+   // In services/search.server.js
+   import { AutoMemSearchService } from './automem-search.js';
+   export default new AutoMemSearchService();
+   ```


⚠️ Potential issue | 🟡 Minor

Add language specifier to fenced code block.

The JavaScript code block should specify the language for proper syntax highlighting.

Apply this diff:

- ``` + ```javascript // In services/search.server.js import { AutoMemSearchService } from './automem-search.js'; export default new AutoMemSearchService(); - ``` + ```

🤖 Prompt for AI Agents

In tests/benchmarks/core_adapter/README.md around lines 84 to 88, the fenced code block lacks a language specifier; update the opening fence to "```javascript" (ensure the opening line reads ```javascript) and keep the closing fence as "```" so the block is properly marked for JavaScript syntax highlighting.

coderabbitai · 2025-12-06T04:07:22Z

tests/benchmarks/experiments/analysis_agent.py

+
+    def __init__(
+        self,
+        model: str = "gpt-5.1",


⚠️ Potential issue | 🟠 Major

Non-existent model name "gpt-5.1" used as default.

Same issue as in experiment_runner.py - "gpt-5.1" is not a valid OpenAI model name. This will cause API errors at runtime.

def __init__( self, - model: str = "gpt-5.1", + model: str = "gpt-4o",

Also update line 341 in main():

- parser.add_argument("--model", type=str, default="gpt-5.1", help="LLM model for analysis") + parser.add_argument("--model", type=str, default="gpt-4o", help="LLM model for analysis")

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

model: str = "gpt-5.1",

model: str = "gpt-4o",

🤖 Prompt for AI Agents

In tests/benchmarks/experiments/analysis_agent.py around line 69, the default model is set to the non-existent "gpt-5.1"; change the default to a valid OpenAI model name (e.g., "gpt-4o" or "gpt-4") to prevent runtime API errors, and also update the corresponding model reference at line 341 inside main() to the same valid model string so both defaults stay consistent.

coderabbitai · 2025-12-06T04:07:22Z

tests/benchmarks/experiments/experiment_config.py

+    GPT_4_1 = "gpt-4.1"
+    GPT_5_1 = "gpt-5.1"
+    CLAUDE_HAIKU = "claude-3-haiku"


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

valid OpenAI GPT model names December 2025 gpt-4 gpt-4o gpt-5

💡 Result:

As of December 6, 2025:

gpt-4 — still an API model (no longer the default in ChatGPT; some GPT‑4 snapshots are being deprecated). [1][2]

gpt-4o — current OpenAI multimodal "o‑series" model (available in API/ChatGPT). [1][2]

gpt-5 — released and available (API names include gpt-5 and smaller variants like gpt-5-mini/nano). [3]

Sources: [1] OpenAI model docs (gpt-4o). [2] OpenAI ChatGPT release notes (sunsetting GPT‑4 / GPT‑4o updates). [3] TechCrunch reporting on GPT‑5 release.

🌐 Web query:

gpt-4-turbo OpenAI December 2025 deprecated

💡 Result:

Short answer: GPT-4‑Turbo itself is not fully shut down as of Dec 6, 2025, but some older GPT‑4 snapshots/previews (including gpt-4-turbo-preview / gpt-4-0125-preview) have been marked deprecated with a scheduled shutdown on March 26, 2026 — OpenAI’s docs list recommended replacements (e.g., gpt-5 or gpt-4.1). [1][2]

Sources:

OpenAI GPT-4 Turbo model page (snapshots & deprecation status). [1]

OpenAI Deprecations page — legacy GPT snapshot deprecation list (includes gpt-4-turbo-preview → shutdown 2026‑03‑26). [2]

Replace invalid OpenAI model names with valid alternatives.

gpt-4.1 and gpt-5.1 are not valid OpenAI model identifiers. Use current, stable model names instead:

class EnrichmentModel(Enum): """LLM model for enrichment/extraction""" GPT_4O_MINI = "gpt-4o-mini" GPT_4O = "gpt-4o" - GPT_4_1 = "gpt-4.1" - GPT_5_1 = "gpt-5.1" + GPT_4 = "gpt-4" + GPT_5 = "gpt-5" CLAUDE_HAIKU = "claude-3-haiku"

Note: gpt-4-turbo variants are being phased out (deprecated with March 2026 shutdown). Use gpt-4o, gpt-4, or gpt-5 instead.

Committable suggestion skipped: line range outside the PR's diff.

coderabbitai · 2025-12-06T04:07:23Z

tests/benchmarks/experiments/test_e2e_mini.py

+        resp = requests.get(
+            f"{BASE_URL}/recall",
+            headers=headers,
+            params={"query": question, "tags": test_tag, "limit": 5}
+        )
+
+        if resp.status_code != 200:
+            print(f"      ❌ Recall failed: {resp.status_code}")
+            continue


⚠️ Potential issue | 🟡 Minor

Add timeout to recall request for reliability.

The recall request lacks a timeout, which could cause the test to hang indefinitely if the service is unresponsive.

resp = requests.get( f"{BASE_URL}/recall", headers=headers, - params={"query": question, "tags": test_tag, "limit": 5} + params={"query": question, "tags": test_tag, "limit": 5}, + timeout=10 )

🧰 Tools

🪛 Ruff (0.14.7)

106-106: Probable use of requests call without timeout

(S113)

🤖 Prompt for AI Agents

In tests/benchmarks/experiments/test_e2e_mini.py around lines 106 to 114, the requests.get call to the /recall endpoint has no timeout which can hang tests; add a timeout parameter (e.g. timeout=10) to the requests.get call and wrap the request in a try/except catching requests.exceptions.Timeout (and requests.exceptions.RequestException) to log the timeout/error and continue the loop so the test suite remains reliable.

coderabbitai · 2025-12-06T04:07:23Z

tests/benchmarks/experiments/test_e2e_mini.py

+    resp = requests.delete(
+        f"{BASE_URL}/admin/memories",
+        headers={"Authorization": f"Bearer admin"},
+        params={"tag": test_tag, "confirm": "yes"}
+    )
+    print(f"  Cleanup: {resp.status_code}")


⚠️ Potential issue | 🟡 Minor

Add timeout to cleanup request.

The DELETE request lacks a timeout, similar to the recall request.

resp = requests.delete( f"{BASE_URL}/admin/memories", headers={"Authorization": "Bearer admin"}, - params={"tag": test_tag, "confirm": "yes"} + params={"tag": test_tag, "confirm": "yes"}, + timeout=30 )

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

resp = requests.delete(

f"{BASE_URL}/admin/memories",

headers={"Authorization": f"Bearer admin"},

params={"tag": test_tag, "confirm": "yes"}

)

print(f" Cleanup: {resp.status_code}")

resp = requests.delete(

f"{BASE_URL}/admin/memories",

headers={"Authorization": "Bearer admin"},

params={"tag": test_tag, "confirm": "yes"},

timeout=30

)

print(f" Cleanup: {resp.status_code}")

🧰 Tools

🪛 Ruff (0.14.7)

159-159: Probable use of requests call without timeout

(S113)

161-161: f-string without any placeholders

Remove extraneous f prefix

(F541)

🤖 Prompt for AI Agents

In tests/benchmarks/experiments/test_e2e_mini.py around lines 159-164 the cleanup DELETE request lacks a timeout; add the same timeout parameter used for the recall request (e.g., timeout=TIMEOUT or timeout=REQUEST_TIMEOUT) to the requests.delete call (or use a sensible constant like timeout=5 if no shared variable exists) so the test won’t hang on network issues.

coderabbitai · 2025-12-06T04:07:23Z

tests/benchmarks/locomo_metrics.py

+try:
+    ps = PorterStemmer()
+except LookupError:
+    nltk.download("punkt")
+    ps = PorterStemmer()


⚠️ Potential issue | 🟡 Minor

Incorrect exception handling for NLTK initialization.

PorterStemmer() itself doesn't raise LookupError. The download is only needed for tokenizers (like punkt), but Porter stemmer doesn't require punkt. This try/except block is misleading and the download may never be triggered when actually needed.

If you intend to ensure nltk data is available, consider a more explicit check or move the download to a setup step.

-# Initialize stemmer -try: - ps = PorterStemmer() -except LookupError: - nltk.download("punkt") - ps = PorterStemmer() +# Initialize stemmer (PorterStemmer doesn't require nltk data downloads) +ps = PorterStemmer()

🤖 Prompt for AI Agents

In tests/benchmarks/locomo_metrics.py around lines 24 to 28, the try/except around PorterStemmer() is incorrect because PorterStemmer() does not raise LookupError and the punkt download is unrelated; replace this block by directly instantiating PorterStemmer (remove the try/except and nltk.download call) or, if you really need to ensure NLTK tokenizer data, explicitly check for tokenizers/punkt using nltk.data.find('tokenizers/punkt') and only call nltk.download('punkt') inside a LookupError handler for that check; keep the initialization of ps = PorterStemmer() separate from any data-download logic.

coderabbitai · 2025-12-06T04:07:23Z

tests/benchmarks/test_locomo.py

+    # Models: gpt-5.1 (best), gpt-4.1 (good balance), gpt-4o-mini (cheapest)
+    e2e_model: str = "gpt-5.1"  # Model for answer generation


⚠️ Potential issue | 🟠 Major

Invalid OpenAI model names in defaults.

gpt-5.1 and gpt-4.1 are not valid OpenAI model identifiers. These defaults will cause API errors.

# E2E QA settings # Models: gpt-5.1 (best), gpt-4.1 (good balance), gpt-4o-mini (cheapest) - e2e_model: str = "gpt-5.1" # Model for answer generation + e2e_model: str = "gpt-4o" # Model for answer generation e2e_max_context_tokens: int = 4000 # Max tokens of context to include

# Model for lenient evaluation judging - eval_judge_model: str = "gpt-5.1" + eval_judge_model: str = "gpt-4o"

Also update the comment on line 91 to reflect valid model options.

Also applies to: 104-104

coderabbitai · 2025-12-06T04:07:23Z

tests/benchmarks/test_locomo.py

+        "--e2e-model",
+        default="gpt-5.1",
+        help="Model for E2E answer generation (default: gpt-5.1, options: gpt-4.1, gpt-4o-mini)",


⚠️ Potential issue | 🟠 Major

CLI help text references invalid model names.

The argparse help text mentions gpt-5.1 and gpt-4.1 which are not valid models.

parser.add_argument( "--e2e-model", - default="gpt-5.1", - help="Model for E2E answer generation (default: gpt-5.1, options: gpt-4.1, gpt-4o-mini)", + default="gpt-4o", + help="Model for E2E answer generation (default: gpt-4o, options: gpt-4-turbo, gpt-4o-mini)", )

parser.add_argument( "--eval-judge-model", - default="gpt-5.1", - help="Model for lenient evaluation judging (default: gpt-5.1)", + default="gpt-4o", + help="Model for lenient evaluation judging (default: gpt-4o)", )

Also applies to: 1677-1679

🤖 Prompt for AI Agents

In tests/benchmarks/test_locomo.py around lines 1661-1663 and 1677-1679, the argparse help text lists invalid model names ("gpt-5.1" and "gpt-4.1"); update the help strings to reference valid model names (for example "gpt-4o" and "gpt-4o-mini" or whatever current supported models your project uses), e.g., replace occurrences of gpt-5.1/gpt-4.1 with the correct supported model identifiers and ensure the help text options match the actual accepted values.

jack-arturo added 6 commits December 3, 2025 02:41

feat(benchmark): update to GPT-5.1 and add --conversations filter

19e63ec

- Default models now gpt-5.1 for best reasoning - Added --conversations CLI arg to run subset (e.g., --conversations conv-26 conv-30) - Both Python and Node.js adapters updated

coderabbitai bot added the enhancement New feature or request label Dec 6, 2025

coderabbitai bot reviewed Dec 6, 2025

View reviewed changes

		# Models: gpt-5.1 (best), gpt-4.1 (good balance), gpt-4o-mini (cheapest)
		e2e_model: str = "gpt-5.1" # Model for answer generation

Feat/benchmark hardening #18

Are you sure you want to change the base?

Feat/benchmark hardening #18

Uh oh!

Conversation

jack-arturo commented Dec 6, 2025

Uh oh!

coderabbitai bot commented Dec 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Node.js (CORE adapter)

Test 2: Full Dataset (10 conversations) - Previous Run

Uh oh!

coderabbitai bot Dec 6, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 6, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 6, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 6, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 6, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 6, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 6, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 6, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 6, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 6, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai bot commented Dec 6, 2025 •

edited

Loading