eval: targeted LongMemEval-M re-run (temporal + preference only)

## Why
Recent changes should improve temporal reasoning (24.8% → ?) and preference extraction (0% → ?):
- Extraction prompt now embeds session dates in memory content
- Extraction prompt has stronger preference guidance
- QA prompt handles recommendation questions
- Retrieval: top-k 10→20, keyword search augmentation, preference query expansion

## Plan
Re-run only **163 questions** (temporal-reasoning: 133 + single-session-preference: 30) instead of all 500. Merge results into existing file.

### Speed optimizations
- [ ] Add `--types` flag to harness for filtering by question type
- [ ] Increase parallel workers from 10 to 30
- [ ] Consider Anthropic Message Batches API (50% cost discount, async)

### Cache strategy
- [ ] Clear extraction cache entries for the 163 targeted questions only (new prompt needs fresh extractions)
- [ ] Keep cache for the 337 untouched questions

### Execution
- [ ] Run: `python longmemeval_harness.py --variant m --types temporal-reasoning,single-session-preference`
- [ ] Re-run LLM judge on the 163 new results
- [ ] Merge into `longmemeval_m_judged.json` and regenerate report

### Expected
- ~3.7 hours runtime (vs 18h full)
- ~$330 cost (vs $1K+ full)
- Temporal reasoning should improve from date-aware extraction + keyword search
- Preference should improve from 0% to something nonzero

## Acceptance criteria
- Temporal reasoning EM > 30% (up from 24.8%)
- Preference LLM-judge > 20% (up from 6.7%)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

eval: targeted LongMemEval-M re-run (temporal + preference only) #79

Why

Plan

Speed optimizations

Cache strategy

Execution

Expected

Acceptance criteria

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

eval: targeted LongMemEval-M re-run (temporal + preference only) #79

Description

Why

Plan

Speed optimizations

Cache strategy

Execution

Expected

Acceptance criteria

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions