Skip to content

JingbiaoMei/ATM-Bench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

18 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

ATM-Bench: Long-Term Personalized Referential Memory QA

arXiv License

ATM-Bench is the first benchmark for multimodal, multi-source personalized referential memory QA over long time horizons (~4 years) with evidence-grounded retrieval and answering.

Paper: According to Me: Long-Term Personalized Referential Memory QA
Project Page: https://atmbench.github.io/

Table of Contents

πŸ—“οΈ Timeline

  • 2026-03-03: arXiv paper release (2603.01990)
  • 2026-03-04: Initial codebase release, including baseline implementations for MMRAG, Oracle, NIAH, and four ported third-party baselines (A-Mem, HippoRAG2, mem0, MemoryOS).
  • Coming soon: ATM-Bench data release
  • Coming soon: Implementations for benchmarking on OpenClaw, Codex, and OpenCode

πŸ“‹ Overview

Existing long-term memory benchmarks focus primarily on dialogue history, failing to capture realistic personalized references grounded in lived experience. ATM-Bench addresses this gap with:

  • πŸ–ΌοΈ Multimodal and multi-source data: Images, videos, emails
  • πŸ“… Long-term horizon: ~4 years of personal memory
  • 🎯 Referential queries: Resolving personalized references (e.g., "Show me the moments where Grace was trying to be sneaky...")
  • πŸ” Evidence-grounded: Human-annotated QA pairs with ground-truth memory evidence
  • 🧩 Multi-evidence reasoning: Queries requiring evidence from multiple sources
  • ⚑ Conflicting evidence: Handling contradictory information

ATM-Bench Overview

Memory Ingestion

Memory Ingestion is decomposed into:

  1. Memory preprocessing (how each memory item is represented)
  2. Memory organization (how items are structured/linked)

ATM Method

Memory Preprocessing

We compare two preprocessing representations:

  • Descriptive Memory (DM): each memory item is represented as one natural-language description.
  • Schema-Guided Memory (SGM): each memory item is represented with fixed text-based key-value fields under a schema.

In SGM, schema fields are modality-aware. For example:

  • Image/Video memory: time, location, entities, ocr, tags
  • Email memory: time, summary, body

DM and SGM contain the same underlying information but use different formats.

In this codebase, DM is implemented as caption/description-style text, while SGM is implemented as schema-based key-value text fields.

Memory Organization

For organization of the memory store:

  • Piled Memory: items are stored without explicit links.
  • Linked Memory: items are linked with inferred relations (graph structure); agentic systems can additionally update existing items during organization.

NIAH Evaluation Setup

In addition to end-to-end retrieval + generation evaluation, we provide NIAH (Needle In A Haystack):

  • Each question is paired with a fixed evidence pool (niah_evidence_ids) that contains all ground-truth items.
  • The rest of the pool is filled with realistic distractors.
  • This isolates answer generation/reasoning quality from retrieval quality.

See:

πŸš€ Quick Start

Installation

conda create -n atmbench python=3.11 -y
conda activate atmbench
pip install -r requirements.txt
pip install -e .

API Keys

Set via environment variables:

export OPENAI_API_KEY="your-key"
export VLLM_API_KEY="your-key"

Or use local key files (gitignored):

  • api_keys/.openai_key
  • api_keys/.vllm_key

Generate Memory Files First

Before running MMRAG or Oracle, generate the image/video batch_results.json files:

# Optional but recommended: preload reverse-geocoding cache
bash scripts/memory_processor/image/copy_gps_cache.sh output/image/qwen3vl2b/cache
bash scripts/memory_processor/video/copy_gps_cache.sh output/video/qwen3vl2b/cache

# Generate memory itemization results
bash scripts/memory_processor/image/memory_itemize/run_qwen3vl2b.sh
bash scripts/memory_processor/video/memory_itemize/run_qwen3vl2b.sh

Quick commands (MMRAG + Oracle)

# MMRAG (runs both ATM-bench and ATM-bench-hard)
bash scripts/QA_Agent/MMRAG/run.sh

# Oracle (upper bound; raw multimodal evidence)
bash scripts/QA_Agent/Oracle/run_oracle_qwen3vl8b_raw.sh

Baseline Compatibility and Environments

  • Core baselines (MMRAG, Oracle, NIAH) are tested in the main atmbench environment.
  • Third-party memory-system baselines in this repo include:
    • A-Mem
    • HippoRAG2
    • mem0
    • MemoryOS
  • MemoryOS is strongly recommended to run in a separate conda environment.
  • A-Mem, HippoRAG2, and mem0 are tested to be compatible with the core baseline environment, but separate environments are still safer for reproducibility and dependency isolation.
  • Setup references for these baselines are under third_party/:
    • third_party/A-mem/
    • third_party/HippoRAG/
    • third_party/mem0/
    • third_party/MemoryOS/
  • OpenClaw, OpenCode, and Codex baselines are compatible with this repo’s evaluation workflow, but each requires its own third-party software installation.

For detailed setup, data layout, and reproducibility settings, see:

πŸ“ Repository Structure

ATMBench/
β”œβ”€β”€ memqa/              # Core memory QA implementation
β”œβ”€β”€ scripts/            # Experiment scripts
β”œβ”€β”€ docs/               # Documentation
β”œβ”€β”€ data/               # Data directory (user-provided)
β”œβ”€β”€ third_party/        # Vendored agentic memory systems
└── output/             # Experiment outputs (gitignored)

πŸ“š Documentation

πŸ“– Citation

If you use ATM-Bench in your research, please cite:

@article{mei2026atm,
  title={According to Me: Long-Term Personalized Referential Memory QA},
  author={Mei, Jingbiao and Chen, Jinghong and Yang, Guangyu and Hou, Xinyu and Li, Margaret and Byrne, Bill},
  journal={arXiv preprint arXiv:2603.01990},
  year={2026},
  url={https://arxiv.org/abs/2603.01990},
  doi={10.48550/arXiv.2603.01990}
}

πŸ”— Links

πŸ“ License

This project is licensed under the MIT License - see the LICENSE file for details.

About

ATM-Bench: A benchmark for long-term personalized memory QA spanning ~4 years of multimodal data (images, videos, emails). Features referential queries, evidence-grounded answering, and multi-source reasoning. Paper: "According to Me: Long-Term Personalized Referential Memory QA"

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors