ATM-Bench is the first benchmark for multimodal, multi-source personalized referential memory QA over long time horizons (~4 years) with evidence-grounded retrieval and answering.
Paper: According to Me: Long-Term Personalized Referential Memory QA
Project Page: https://atmbench.github.io/
- ATM-Bench: Long-Term Personalized Referential Memory QA
- 2026-03-03: arXiv paper release (2603.01990)
- 2026-03-04: Initial codebase release, including baseline implementations for MMRAG, Oracle, NIAH, and four ported third-party baselines (A-Mem, HippoRAG2, mem0, MemoryOS).
- Coming soon: ATM-Bench data release
- Coming soon: Implementations for benchmarking on OpenClaw, Codex, and OpenCode
Existing long-term memory benchmarks focus primarily on dialogue history, failing to capture realistic personalized references grounded in lived experience. ATM-Bench addresses this gap with:
- πΌοΈ Multimodal and multi-source data: Images, videos, emails
- π Long-term horizon: ~4 years of personal memory
- π― Referential queries: Resolving personalized references (e.g., "Show me the moments where Grace was trying to be sneaky...")
- π Evidence-grounded: Human-annotated QA pairs with ground-truth memory evidence
- π§© Multi-evidence reasoning: Queries requiring evidence from multiple sources
- β‘ Conflicting evidence: Handling contradictory information
Memory Ingestion is decomposed into:
- Memory preprocessing (how each memory item is represented)
- Memory organization (how items are structured/linked)
We compare two preprocessing representations:
- Descriptive Memory (DM): each memory item is represented as one natural-language description.
- Schema-Guided Memory (SGM): each memory item is represented with fixed text-based key-value fields under a schema.
In SGM, schema fields are modality-aware. For example:
- Image/Video memory:
time,location,entities,ocr,tags - Email memory:
time,summary,body
DM and SGM contain the same underlying information but use different formats.
In this codebase, DM is implemented as caption/description-style text, while SGM is implemented as schema-based key-value text fields.
For organization of the memory store:
- Piled Memory: items are stored without explicit links.
- Linked Memory: items are linked with inferred relations (graph structure); agentic systems can additionally update existing items during organization.
In addition to end-to-end retrieval + generation evaluation, we provide NIAH (Needle In A Haystack):
- Each question is paired with a fixed evidence pool (
niah_evidence_ids) that contains all ground-truth items. - The rest of the pool is filled with realistic distractors.
- This isolates answer generation/reasoning quality from retrieval quality.
See:
conda create -n atmbench python=3.11 -y
conda activate atmbench
pip install -r requirements.txt
pip install -e .Set via environment variables:
export OPENAI_API_KEY="your-key"
export VLLM_API_KEY="your-key"Or use local key files (gitignored):
api_keys/.openai_keyapi_keys/.vllm_key
Before running MMRAG or Oracle, generate the image/video batch_results.json files:
# Optional but recommended: preload reverse-geocoding cache
bash scripts/memory_processor/image/copy_gps_cache.sh output/image/qwen3vl2b/cache
bash scripts/memory_processor/video/copy_gps_cache.sh output/video/qwen3vl2b/cache
# Generate memory itemization results
bash scripts/memory_processor/image/memory_itemize/run_qwen3vl2b.sh
bash scripts/memory_processor/video/memory_itemize/run_qwen3vl2b.sh# MMRAG (runs both ATM-bench and ATM-bench-hard)
bash scripts/QA_Agent/MMRAG/run.sh
# Oracle (upper bound; raw multimodal evidence)
bash scripts/QA_Agent/Oracle/run_oracle_qwen3vl8b_raw.sh
- Core baselines (
MMRAG,Oracle,NIAH) are tested in the mainatmbenchenvironment. - Third-party memory-system baselines in this repo include:
A-MemHippoRAG2mem0MemoryOS
MemoryOSis strongly recommended to run in a separate conda environment.A-Mem,HippoRAG2, andmem0are tested to be compatible with the core baseline environment, but separate environments are still safer for reproducibility and dependency isolation.- Setup references for these baselines are under
third_party/:third_party/A-mem/third_party/HippoRAG/third_party/mem0/third_party/MemoryOS/
- OpenClaw, OpenCode, and Codex baselines are compatible with this repoβs evaluation workflow, but each requires its own third-party software installation.
For detailed setup, data layout, and reproducibility settings, see:
ATMBench/
βββ memqa/ # Core memory QA implementation
βββ scripts/ # Experiment scripts
βββ docs/ # Documentation
βββ data/ # Data directory (user-provided)
βββ third_party/ # Vendored agentic memory systems
βββ output/ # Experiment outputs (gitignored)
docs/README.md- Getting started guidedocs/data.md- Data format and preparationdocs/baseline.md- Baseline implementationsdocs/niah.md- NIAH protocol and usagedocs/metrics.md- Evaluation metricsdocs/reproducibility.md- Reproduction instructionsdocs/repo_structure.md- Repository organization
If you use ATM-Bench in your research, please cite:
@article{mei2026atm,
title={According to Me: Long-Term Personalized Referential Memory QA},
author={Mei, Jingbiao and Chen, Jinghong and Yang, Guangyu and Hou, Xinyu and Li, Margaret and Byrne, Bill},
journal={arXiv preprint arXiv:2603.01990},
year={2026},
url={https://arxiv.org/abs/2603.01990},
doi={10.48550/arXiv.2603.01990}
}- π Paper: https://arxiv.org/abs/2603.01990
- π» Code: https://github.com/JingbiaoMei/ATM-Bench
- π Issues: https://github.com/JingbiaoMei/ATM-Bench/issues
This project is licensed under the MIT License - see the LICENSE file for details.

