Why
Recent changes should improve temporal reasoning (24.8% → ?) and preference extraction (0% → ?):
- Extraction prompt now embeds session dates in memory content
- Extraction prompt has stronger preference guidance
- QA prompt handles recommendation questions
- Retrieval: top-k 10→20, keyword search augmentation, preference query expansion
Plan
Re-run only 163 questions (temporal-reasoning: 133 + single-session-preference: 30) instead of all 500. Merge results into existing file.
Speed optimizations
Cache strategy
Execution
Expected
- ~3.7 hours runtime (vs 18h full)
- ~$330 cost (vs $1K+ full)
- Temporal reasoning should improve from date-aware extraction + keyword search
- Preference should improve from 0% to something nonzero
Acceptance criteria
- Temporal reasoning EM > 30% (up from 24.8%)
- Preference LLM-judge > 20% (up from 6.7%)
Why
Recent changes should improve temporal reasoning (24.8% → ?) and preference extraction (0% → ?):
Plan
Re-run only 163 questions (temporal-reasoning: 133 + single-session-preference: 30) instead of all 500. Merge results into existing file.
Speed optimizations
--typesflag to harness for filtering by question typeCache strategy
Execution
python longmemeval_harness.py --variant m --types temporal-reasoning,single-session-preferencelongmemeval_m_judged.jsonand regenerate reportExpected
Acceptance criteria