Beyond a Million Tokens: Benchmarking and Enhancing Long-Term Memory in LLMs

This repository contains codes and data for the paper: Beyond a Million Tokens: Benchmarking and Enhancing Long-Term Memory in LLMs

Evaluating the abilities of large language models (LLMs) for tasks that require long-term memory and thus long-context reasoning, for example in conversational settings, is hampered by the existing benchmarks, which often lack narrative coherence, cover narrow domains, and only test simple recall-oriented tasks. This paper introduces a comprehensive solution to these challenges. First, we present a novel framework for automatically generating long (up to 10M tokens), coherent, and topically diverse conversations, accompanied by probing questions targeting a wide range of memory abilities. From this, we construct BEAM, a new benchmark comprising 100 conversations and 2,000 validated questions. Second, to enhance model performance, we propose LIGHT–a framework inspired by human cognition that equips LLMs with three complementary memory systems: a long-term episodic memory, a short-term working memory, and a scratchpad for accumulating salient facts. Our experiments on BEAM reveal that even LLMs with 1M token context windows (with and without retrieval-augmentation) struggle as dialogues lengthen. In contrast, LIGHT consistently improves performance across various models, achieving an average improvement of 3.5%–12.69% over the strongest baselines, depending on the backbone LLM. An ablation study further confirms the contribution of each memory component.

🧠 BEAM Description

BEAM is a comprehensive dataset for evaluating long-term memory in language models. It contains multi-scale conversations (128K, 500K, 1M, and 10M tokens) spanning diverse domains, including general, coding, and mathematical topics, and is designed to assess ten distinct memory abilities. To evaluate LLMs on these abilities, we generate a set of probing questions for each conversation.

📊 Dataset Statistics

BEAM consists of 100 conversations distributed as follows:

128K: 20 chats
500K: 35 chats
1M: 35 chats
10M: 10 chats

Chat Size	# User Messages	# Assistant Messages	# Answer Assistant Questions	# Follow-up Questions	# Turns
128K	144	144	27	216	107
500K	544	544	79	51	416
1M	1,067	1,067	105	120	842
10M	10,435	10,435	1,151	1,528	7,757

Statistics of the BEAM dataset. Reported values are averages per chat in each chat size. “# User Messages” and “# Assistant Messages” denote the average number of utterances from each side. “# Answer Assistant Questions” indicates how often the assistant posed a question that the user answered. “# Follow-up Questions” counts user follow-ups, and “# Turns” refers to the total number of dialogue turns.

🧩 Probing Questions Types

Abstention: Evaluates whether a model withholds answers when evidence is missing
Contradiction Resolution: Tests the capacity to detect and reconcile inconsistent statements across widely separated turns, maintaining global coherence
Event Ordering: Assesses whether a model can recognize and reconstruct the sequence of evolving information in the dialogue
Information Extraction: Measures recall of entities and factual details in long histories
Instruction Following: Examines sustained adherence to user-specified constraints over long contexts
Knowledge Update: Evaluates revising stored facts as new ones appear
Multi-Session Reasoning: Probes inference that integrates evidence across multiple, non-adjacent dialogue segments
Preference Following: Captures personalized responses that adapt to evolving preferences
Summarization: Assesses the ability to abstract and compress dialogue content
Temporal Reasoning: Tests reasoning about explicit and implicit time relations

⚖️ Comparison with Existing Long-Term Memory Benchmarks

Benchmark	Domain	Chat Length	IE	MR	KU	TR	ABS	CR	EO	IF	PF	SUM
MSC	Casual	~1K	❌	❌	❌	❌	❌	❌	❌	❌	❌	❌
DuLeMon	Casual	~1K	❌	❌	❌	❌	❌	❌	❌	❌	❌	❌
MemoryBank	Personal	~5K	✅	❌	❌	✅	❌	❌	❌	❌	❌	❌
PerLTQA	Personal	N/A	✅	❌	❌	❌	✅	❌	❌	❌	❌	❌
LoCoMo	Personal	~10K	✅	✅	❌	✅	✅	❌	❌	❌	❌	✅
DialSim	TV/Film	~350K	✅	✅	❌	✅	✅	❌	❌	❌	❌	❌
LongMemEval	Personal	115K, 1M	✅	✅	✅	✅	✅	❌	❌	❌	✅	❌
MemBench	Personal	~100K	✅	✅	✅	✅	❌	❌	❌	❌	✅	❌
BEAM (This work)	Multi-domain (Coding, Math, Health, Finance, Personal, ...)	128K, 500K, 1M, 10M	✅	✅	✅	✅	✅	✅	✅	✅	✅	✅

Comparison of BEAM with existing long-term memory benchmarks.
Memory abilities — IE: Information Extraction, MR: Multi-hop Reasoning, KU: Knowledge Update, TR: Temporal Reasoning, ABS: Abstention, CR: Contradiction Resolution, EO: Event Ordering, IF: Instruction Following, PF: Preference Following, SUM: Summarization.

⚙️ LIGHT Framework

LIGHT is a cognitively inspired memory-augmented framework designed to enhance long-term memory in large language models.

It draws inspiration from human memory systems and integrates three complementary components that work together during inference:

Episodic Memory – A long-term memory index that retrieves relevant information across extended contexts.
Working Memory – A short-term buffer that retains the most recent dialogue turns, enabling continuity and contextual relevance.
Scratchpad – An iteratively compressed semantic layer that tracks salient facts, user instructions, and contextual updates after each turn.

At inference time, LIGHT retrieves and integrates information from all three memory systems, enabling the model to produce more grounded, coherent, and contextually consistent responses even in conversations spanning millions of tokens.

🔬 Evaluation Results

LIGHT demonstrates consistent improvements across all evaluated models on the BEAM benchmark, achieving 3.5%–12.7% higher accuracy on probing questions compared to the strongest baselines.
An ablation study confirms that each component—episodic retrieval, working memory, and scratchpad—contributes complementary benefits to overall performance.

🧩 How to Use LIGHT

The detailed implementation of each memory component—episodic memory, working memory, and scratchpad—can be found in: src/answer_probing_questions/light.py

📦 Dataset Access and Download

The BEAM dataset is publicly available on the Hugging Face Hub and also included within this repository.

🔗 Hugging Face Links

BEAM (128K, 500K, 1M chats) – https://huggingface.co/datasets/Mohammadta/BEAM
BEAM-10M (10M-token chats) – https://huggingface.co/datasets/Mohammadta/BEAM-10M

Each dataset contains multi-turn conversations and corresponding metadata required for memory-ability evaluation.

🗂 Local Copy

A copy of the chat data is also provided in the repository under: /chats/

⚙️ Automatic Download and Pre-processing

To simplify setup, we provide a script that automatically downloads and formats the dataset for use with the provided codebase.

Run the following command from the root directory:

python src/beam/download_dataset.py

After running it, the dataset will be ready to use for model evaluation or reproduction of the results from the paper.

⚙️ Installation

To install all required dependencies, run:

pip install -r requirements.txt

This will install all necessary libraries for dataset generation, probing question creation, answer generation, and evaluation.

🚀 Usage Guide

1️⃣ Dataset Generation

The topics of the chats are provided in the topics/ directory.
To recreate the dataset (as provided in this repository), follow these steps:

Step 1: Configure LLMs

Define your model configurations — including model URLs, names, and API keys — inside:

src/llms_config.json

Step 2: Run the Pipeline

The dataset generation process is divided into three stages, executed with run_pipeline.sh.

Each command follows this format:

bash src/beam/run_pipeline.sh [llm_url] [llm_name] [llm_api_key] [stage] [start_index] [end_index] [chat_directory] [chat_size]

For example:

# 1. Create conversation plans
bash src/beam/run_pipeline.sh http://localhost:8000 llama3 my_api_key plan 0 10 chats/1M 1M

# 2. Create user questions
bash src/beam/run_pipeline.sh http://localhost:8000 llama3 my_api_key question 0 10 chats/1M 1M

# 3. Create assistant answers
bash src/beam/run_pipeline.sh http://localhost:8000 llama3 my_api_key answer 0 10 chats/1M 1M

2️⃣ Create Probing Questions

After generating the chats, the next step is to create probing questions for each conversation.

For 128K, 500K, and 1M chat sizes, run the function create_probing_questions inside: src/beam/main.py

For 10M chat size, run the function ten_m_create_probing_questions inside: src/beam/ten_milion_pipeline.py

These functions automatically generate probing questions that evaluate ten distinct memory abilities for each conversation.

⚠️ Build Your Own Multi-Turn Dialogues

You can build your own multi-turn conversational datasets of any size — including 128K, 500K, 1M, 10M tokens, or even longer.

To do this:

Prepare your own chat seed information, similar to the examples in the topics/ directory.
Define your LLM configurations in src/llms_config.json.
Run the same three-stage pipeline described above (plan → question → answer).
Create probing questions using the appropriate function (create_probing_questions and ten_m_create_probing_questions).
Design evaluation rubrics

This process automatically generates long, coherent, multi-domain dialogues ready for probing and evaluation.

3️⃣ Generate Answers to Probing Questions

After the dataset and probing questions are ready, generate answers using:

bash src/model_inference/answer_generation.sh

Before running, you must edit the environment variables inside the file answer_generation.sh to match your experimental setup.

For long-context LLMs:
Set EVAL_TYPE="long-context"
For the RAG baseline:
Set EVAL_TYPE="rag", RETRIEVAL_METHOD="pair_chunk" and RETRIEVER="dense"
For LIGHT (our proposed method):
Set EVAL_TYPE="rag" and RETRIEVAL_METHOD="light"

4️⃣ Evaluate Generated Answers

To evaluate the generated answers against reference probing questions, run:

python -m src.evaluation.run_evaluation \
    --input_directory [results directory e.g. results/1M] \
    --chat_size [chat size e.g. 1M] \
    --start_index [start index] \
    --end_index [end index] \
    --max_workers [num workers] \
    --allowed_result_files [list of result files to evaluate]

This script uses an LLM-as-a-judge model to score each generated answer against the corresponding probing question. The judging LLM assigns a numerical score for each memory ability category. All evaluation scores are automatically saved in the specified results directory for later analysis and reporting.

5️⃣ Generate and Save Reports

To aggregate and export evaluation results into a structured Excel file, run:

python -m src.evaluation.report_results.py \
    --evaluation_directory [evaluation directory e.g. results/1M/] \
    --row_names [evaluation file names to display] \
    --output_filename [output filename]

This script combines evaluation results across models and probing categories and produces a .xlsx report for analysis and comparison.

💡 Tip: The pipeline is fully modular — you can independently run dataset creation, question generation, answer generation, and evaluation, depending on your specific experimental needs.

📄 License

This repository contains both code and data, each released under a different license:

Code
Licensed under the MIT License. You can find the full text of the license in the LICENSE file.
BEAM Benchmark
Licensed under the Creative Commons Attribution–ShareAlike 4.0 International License (CC BY-SA 4.0).

📝 References

If you find this work helpful or use the code or dataset, please cite the following paper:

Beyond a Million Tokens: Benchmarking and Enhancing Long-Term Memory in LLMs

@misc{tavakoli2025milliontokensbenchmarkingenhancing,
      title={Beyond a Million Tokens: Benchmarking and Enhancing Long-Term Memory in LLMs}, 
      author={Mohammad Tavakoli and Alireza Salemi and Carrie Ye and Mohamed Abdalla and Hamed Zamani and J Ross Mitchell},
      year={2025},
      eprint={2510.27246},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2510.27246}, 
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Beyond a Million Tokens: Benchmarking and Enhancing Long-Term Memory in LLMs

This repository contains codes and data for the paper: Beyond a Million Tokens: Benchmarking and Enhancing Long-Term Memory in LLMs

🧠 BEAM Description

📊 Dataset Statistics

🧩 Probing Questions Types

⚖️ Comparison with Existing Long-Term Memory Benchmarks

⚙️ LIGHT Framework

🔬 Evaluation Results

🧩 How to Use LIGHT

📦 Dataset Access and Download

🔗 Hugging Face Links

🗂 Local Copy

A copy of the chat data is also provided in the repository under: /chats/

⚙️ Automatic Download and Pre-processing

⚙️ Installation

🚀 Usage Guide

1️⃣ Dataset Generation

Step 1: Configure LLMs

Step 2: Run the Pipeline

2️⃣ Create Probing Questions

⚠️ Build Your Own Multi-Turn Dialogues

3️⃣ Generate Answers to Probing Questions

4️⃣ Evaluate Generated Answers

5️⃣ Generate and Save Reports

📄 License

📝 References

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
chats		chats
src		src
topics		topics
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

License

mohammadtavakoli78/BEAM

Folders and files

Latest commit

History

Repository files navigation

Beyond a Million Tokens: Benchmarking and Enhancing Long-Term Memory in LLMs

This repository contains codes and data for the paper: Beyond a Million Tokens: Benchmarking and Enhancing Long-Term Memory in LLMs

🧠 BEAM Description

📊 Dataset Statistics

🧩 Probing Questions Types

⚖️ Comparison with Existing Long-Term Memory Benchmarks

⚙️ LIGHT Framework

🔬 Evaluation Results

🧩 How to Use LIGHT

📦 Dataset Access and Download

🔗 Hugging Face Links

🗂 Local Copy

A copy of the chat data is also provided in the repository under: /chats/

⚙️ Automatic Download and Pre-processing

⚙️ Installation

🚀 Usage Guide

1️⃣ Dataset Generation

Step 1: Configure LLMs

Step 2: Run the Pipeline

2️⃣ Create Probing Questions

⚠️ Build Your Own Multi-Turn Dialogues

3️⃣ Generate Answers to Probing Questions

4️⃣ Evaluate Generated Answers

5️⃣ Generate and Save Reports

📄 License

📝 References

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages