ATM-Bench: Long-Term Personalized Referential Memory QA

ATM-Bench is the first benchmark for multimodal, multi-source personalized referential memory QA over long time horizons (~4 years) with evidence-grounded retrieval and answering.

Paper: According to Me: Long-Term Personalized Referential Memory QA
Project Page: https://atmbench.github.io/

🗓️ Timeline

2026-03-03: arXiv paper release (2603.01990)
2026-03-04: Initial codebase release, including baseline implementations for MMRAG, Oracle, NIAH, and four ported third-party baselines (A-Mem, HippoRAG2, mem0, MemoryOS).
Coming soon: ATM-Bench data release
Coming soon: Implementations for benchmarking on OpenClaw, Codex, and OpenCode

📋 Overview

Existing long-term memory benchmarks focus primarily on dialogue history, failing to capture realistic personalized references grounded in lived experience. ATM-Bench addresses this gap with:

🖼️ Multimodal and multi-source data: Images, videos, emails
📅 Long-term horizon: ~4 years of personal memory
🎯 Referential queries: Resolving personalized references (e.g., "Show me the moments where Grace was trying to be sneaky...")
🔍 Evidence-grounded: Human-annotated QA pairs with ground-truth memory evidence
🧩 Multi-evidence reasoning: Queries requiring evidence from multiple sources
⚡ Conflicting evidence: Handling contradictory information

Memory Ingestion

Memory Ingestion is decomposed into:

Memory preprocessing (how each memory item is represented)
Memory organization (how items are structured/linked)

Memory Preprocessing

We compare two preprocessing representations:

Descriptive Memory (DM): each memory item is represented as one natural-language description.
Schema-Guided Memory (SGM): each memory item is represented with fixed text-based key-value fields under a schema.

In SGM, schema fields are modality-aware. For example:

Image/Video memory: time, location, entities, ocr, tags
Email memory: time, summary, body

DM and SGM contain the same underlying information but use different formats.

In this codebase, DM is implemented as caption/description-style text, while SGM is implemented as schema-based key-value text fields.

Memory Organization

For organization of the memory store:

Piled Memory: items are stored without explicit links.
Linked Memory: items are linked with inferred relations (graph structure); agentic systems can additionally update existing items during organization.

NIAH Evaluation Setup

In addition to end-to-end retrieval + generation evaluation, we provide NIAH (Needle In A Haystack):

Each question is paired with a fixed evidence pool (niah_evidence_ids) that contains all ground-truth items.
The rest of the pool is filled with realistic distractors.
This isolates answer generation/reasoning quality from retrieval quality.

See:

docs/niah.md

🚀 Quick Start

Installation

conda create -n atmbench python=3.11 -y
conda activate atmbench
pip install -r requirements.txt
pip install -e .

API Keys

Set via environment variables:

export OPENAI_API_KEY="your-key"
export VLLM_API_KEY="your-key"

Or use local key files (gitignored):

api_keys/.openai_key
api_keys/.vllm_key

Generate Memory Files First

Before running MMRAG or Oracle, generate the image/video batch_results.json files:

# Optional but recommended: preload reverse-geocoding cache
bash scripts/memory_processor/image/copy_gps_cache.sh output/image/qwen3vl2b/cache
bash scripts/memory_processor/video/copy_gps_cache.sh output/video/qwen3vl2b/cache

# Generate memory itemization results
bash scripts/memory_processor/image/memory_itemize/run_qwen3vl2b.sh
bash scripts/memory_processor/video/memory_itemize/run_qwen3vl2b.sh

Quick commands (MMRAG + Oracle)

# MMRAG (runs both ATM-bench and ATM-bench-hard)
bash scripts/QA_Agent/MMRAG/run.sh

# Oracle (upper bound; raw multimodal evidence)
bash scripts/QA_Agent/Oracle/run_oracle_qwen3vl8b_raw.sh

Baseline Compatibility and Environments

Core baselines (MMRAG, Oracle, NIAH) are tested in the main atmbench environment.
Third-party memory-system baselines in this repo include:
- A-Mem
- HippoRAG2
- mem0
- MemoryOS
MemoryOS is strongly recommended to run in a separate conda environment.
A-Mem, HippoRAG2, and mem0 are tested to be compatible with the core baseline environment, but separate environments are still safer for reproducibility and dependency isolation.
Setup references for these baselines are under third_party/:
- third_party/A-mem/
- third_party/HippoRAG/
- third_party/mem0/
- third_party/MemoryOS/
OpenClaw, OpenCode, and Codex baselines are compatible with this repo’s evaluation workflow, but each requires its own third-party software installation.

For detailed setup, data layout, and reproducibility settings, see:

📁 Repository Structure

ATMBench/
├── memqa/              # Core memory QA implementation
├── scripts/            # Experiment scripts
├── docs/               # Documentation
├── data/               # Data directory (user-provided)
├── third_party/        # Vendored agentic memory systems
└── output/             # Experiment outputs (gitignored)

📚 Documentation

docs/README.md - Getting started guide
docs/data.md - Data format and preparation
docs/baseline.md - Baseline implementations
docs/niah.md - NIAH protocol and usage
docs/metrics.md - Evaluation metrics
docs/reproducibility.md - Reproduction instructions
docs/repo_structure.md - Repository organization

📖 Citation

If you use ATM-Bench in your research, please cite:

@article{mei2026atm,
  title={According to Me: Long-Term Personalized Referential Memory QA},
  author={Mei, Jingbiao and Chen, Jinghong and Yang, Guangyu and Hou, Xinyu and Li, Margaret and Byrne, Bill},
  journal={arXiv preprint arXiv:2603.01990},
  year={2026},
  url={https://arxiv.org/abs/2603.01990},
  doi={10.48550/arXiv.2603.01990}
}

🔗 Links

📄 Paper: https://arxiv.org/abs/2603.01990
💻 Code: https://github.com/JingbiaoMei/ATM-Bench
🐛 Issues: https://github.com/JingbiaoMei/ATM-Bench/issues

📝 License

This project is licensed under the MIT License - see the LICENSE file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ATM-Bench: Long-Term Personalized Referential Memory QA

Table of Contents

🗓️ Timeline

📋 Overview

Memory Ingestion

Memory Preprocessing

Memory Organization

NIAH Evaluation Setup

🚀 Quick Start

Installation

API Keys

Generate Memory Files First

Quick commands (MMRAG + Oracle)

Baseline Compatibility and Environments

📁 Repository Structure

📚 Documentation

📖 Citation

🔗 Links

📝 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
docs		docs
memqa		memqa
scripts		scripts
third_party		third_party
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py

Folders and files

Latest commit

History

Repository files navigation

ATM-Bench: Long-Term Personalized Referential Memory QA

Table of Contents

🗓️ Timeline

📋 Overview

Memory Ingestion

Memory Preprocessing

Memory Organization

NIAH Evaluation Setup

🚀 Quick Start

Installation

API Keys

Generate Memory Files First

Quick commands (MMRAG + Oracle)

Baseline Compatibility and Environments

📁 Repository Structure

📚 Documentation

📖 Citation

🔗 Links

📝 License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages