Skip to content

eval: targeted LongMemEval-M re-run (temporal + preference only) #79

@hdviettt

Description

@hdviettt

Why

Recent changes should improve temporal reasoning (24.8% → ?) and preference extraction (0% → ?):

  • Extraction prompt now embeds session dates in memory content
  • Extraction prompt has stronger preference guidance
  • QA prompt handles recommendation questions
  • Retrieval: top-k 10→20, keyword search augmentation, preference query expansion

Plan

Re-run only 163 questions (temporal-reasoning: 133 + single-session-preference: 30) instead of all 500. Merge results into existing file.

Speed optimizations

  • Add --types flag to harness for filtering by question type
  • Increase parallel workers from 10 to 30
  • Consider Anthropic Message Batches API (50% cost discount, async)

Cache strategy

  • Clear extraction cache entries for the 163 targeted questions only (new prompt needs fresh extractions)
  • Keep cache for the 337 untouched questions

Execution

  • Run: python longmemeval_harness.py --variant m --types temporal-reasoning,single-session-preference
  • Re-run LLM judge on the 163 new results
  • Merge into longmemeval_m_judged.json and regenerate report

Expected

  • ~3.7 hours runtime (vs 18h full)
  • ~$330 cost (vs $1K+ full)
  • Temporal reasoning should improve from date-aware extraction + keyword search
  • Preference should improve from 0% to something nonzero

Acceptance criteria

  • Temporal reasoning EM > 30% (up from 24.8%)
  • Preference LLM-judge > 20% (up from 6.7%)

Metadata

Metadata

Assignees

No one assigned

    Labels

    evalEvaluation pipeline

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions