diff --git a/Class 8 Homework.ipynb b/Class 8 Homework.ipynb
index dc5ac6a..38959b5 100644
--- a/Class 8 Homework.ipynb	
+++ b/Class 8 Homework.ipynb	
@@ -14,7 +14,7 @@
     "\n",
     "## Learning Objectives\n",
     "\n",
-    "* Generate abstractive summaries of academic documents using LLaMA 3 (7B).\n",
+    "* Generate abstractive summaries of academic documents using LLaMA 3 (8B) Instruct.\n",
     "* Collect two candidate summaries per paper and have annotators select the better summary.\n",
     "* Prepare the dataset of summary pairs and preference labels for reward model training.\n",
     "* Train a reward model (e.g., DeBERTa-v3) on the collected preference data.\n",
@@ -23,7 +23,7 @@
     "## Project Design\n",
     "\n",
     "* **Data Collection:** Select 10 academic papers (including both text and figures) from arXiv or recent NLP conference proceedings.\n",
-    "* **Summary Generation:** For each paper, use the LLaMA 3 (7B) model to generate *two* different summaries. Vary the prompting strategy or sampling parameters to produce diverse outputs.\n",
+    "* **Summary Generation:** For each paper, use the LLaMA 3 (8B) model to generate *two* different summaries. Vary the prompting strategy or sampling parameters to produce diverse outputs.\n",
     "* **Human Annotation:** Have one or two human annotators compare each pair of summaries for a paper and choose the better one (e.g. more informative, coherent, factually consistent, etc.). Record which summary is preferred.\n",
     "* **Data Formatting:** Create a dataset (e.g. in JSONL format) of summary pairs and preference labels. Each entry should include the two summary texts and which one was chosen (for example, fields `chosen` and `rejected` as required by reward modeling tools).\n",
     "* **Reward Model Training:** Fine-tune a reward model (such as DeBERTa-v3) on this preference data. Use the chosen/rejected summary pairs so the model learns to assign higher scores to the preferred summaries.\n",
@@ -143,7 +143,7 @@
     "* Install required Python libraries: `transformers`, `datasets`, `evaluate`, `trl` (Hugging Face TRL), and `accelerate`.\n",
     "* (Optional) Install `peft` if you want to use parameter-efficient fine-tuning for the reward model.\n",
     "* Ensure you have GPU access for model training (e.g., use Google Colab Pro, AWS, or a local GPU).\n",
-    "* Download or load the LLaMA 3 (7B) model checkpoint and a DeBERTa-v3 checkpoint (for example, via Hugging Face Hub).\n",
+    "* Download or load the LLaMA 3 (8B) model checkpoint and a DeBERTa-v3 checkpoint (for example, via Hugging Face Hub).\n",
     "\n",
     "## Deliverables\n",
     "\n",
diff --git a/README.md b/README.md
new file mode 100644
index 0000000..eec2dcd
--- /dev/null
+++ b/README.md
@@ -0,0 +1,454 @@
+# Paper Summarization with Reward Modeling
+
+**Week 8 Assignment: Multimodal Summarization and Reward Modeling**
+
+This project implements an end-to-end pipeline for generating and evaluating academic paper summaries using large language models (LLaMA 3.1) and reward modeling (DeBERTa-v3). The system demonstrates how to align AI-generated summaries with human preferences through reinforcement learning from human feedback (RLHF) techniques.
+
+## Table of Contents
+
+- [Overview](#overview)
+- [Learning Objectives](#learning-objectives)
+- [Project Structure](#project-structure)
+- [Installation](#installation)
+- [Usage](#usage)
+- [Pipeline Phases](#pipeline-phases)
+- [Configuration](#configuration)
+- [Output Files](#output-files)
+- [Evaluation Metrics](#evaluation-metrics)
+- [Example Results](#example-results)
+- [Troubleshooting](#troubleshooting)
+
+## Overview
+
+Effective summarization is critical in research because it distills large, complex documents into concise overviews that highlight key insights. However, automatically evaluating the quality of generated summaries is challenging. Traditional metrics like ROUGE and BERTScore rely on lexical overlap and can miss nuances like semantic correctness or coherence.
+
+This project addresses this gap by:
+1. Generating multiple candidate summaries using LLaMA 3.1 (8B) Instruct model
+2. Creating preference labels based on ROUGE scores
+3. Training a reward model to predict human-aligned quality scores
+4. Comparing reward model predictions with traditional metrics
+
+## Learning Objectives
+
+- Generate abstractive summaries of academic documents using LLaMA 3.1 (8B) Instruct
+- Collect two candidate summaries per paper and create preference labels
+- Prepare datasets of summary pairs for reward model training
+- Train a reward model (DeBERTa-v3) on preference data
+- Evaluate summaries using ROUGE, BERTScore, and the trained reward model
+- Analyze alignment between reward model scores and automatic metrics
+
+## Project Structure
+
+```
+Homework8-Submission/
+├── README.md                      # This file
+├── requirements.txt               # Python dependencies
+├── Class 8 Homework.ipynb        # Jupyter notebook with instructions
+│
+├── data/
+│   └── pdfs/                     # Academic papers in PDF format (10 papers)
+│       ├── 2601.05103v1.pdf
+│       ├── 2601.05104v1.pdf
+│       └── ...
+│
+├── data_utils.py                 # PDF parsing and ArXiv downloading
+├── summarization.py              # LLaMA model loading and summary generation
+├── reward_model.py               # Reward model training and evaluation
+├── build_data.py                 # Phase 1-3: Data collection and summarization
+├── train_reward.py               # Phase 4: Reward model training
+├── run_evaluation.py             # Phase 5-6: Evaluation and comparison
+└── main.py                       # (Deprecated) Legacy pipeline orchestration
+```
+
+### Module Descriptions
+
+#### `data_utils.py`
+- **Purpose**: Handle data collection and PDF processing
+- **Key Functions**:
+  - `download_arxiv_pdfs()`: Download papers from ArXiv
+  - `extract_text_and_figures()`: Extract text and figure captions from PDFs
+  - `save_papers_to_json()` / `load_papers_from_json()`: Serialize paper data
+
+#### `summarization.py`
+- **Purpose**: Generate summaries using LLaMA 3.1 model
+- **Key Functions**:
+  - `load_llama()`: Load optimized 4-bit quantized LLaMA model using unsloth
+  - `build_multimodal_text()`: Combine paper text with figure captions
+  - `llama_generate_summary()`: Generate summaries with different prompting strategies
+  - `generate_summaries_for_papers()`: Process multiple papers
+
+#### `reward_model.py`
+- **Purpose**: Train and use reward models for summary quality prediction
+- **Key Functions**:
+  - `auto_label_preferences_with_rouge()`: Create preference labels from ROUGE scores
+  - `train_reward_model()`: Fine-tune DeBERTa-v3 on preference data
+  - `score_summaries_with_reward_model()`: Score summaries using trained model
+  - `evaluate_summaries()`: Compute all metrics (ROUGE, BERTScore, reward scores)
+
+#### `build_data.py`
+- **Purpose**: Phase 1-3 - Data collection, summarization, and preference dataset creation
+- **Key Steps**: Download papers, generate summaries, create preference labels
+
+#### `train_reward.py`
+- **Purpose**: Phase 4 - Train the reward model on preference data
+- **Key Steps**: Load preference data, fine-tune DeBERTa-v3, save trained model
+
+#### `run_evaluation.py`
+- **Purpose**: Phase 5-6 - Evaluation and comparison of summaries
+- **Key Steps**: Load trained model, compute all metrics, analyze results
+
+## Installation
+
+### Prerequisites
+
+- Python 3.11 or higher
+- **For GPU server (recommended)**:
+  - CUDA-capable NVIDIA/AMD/Intel GPU
+  - 16GB+ VRAM recommended
+  - Uses optimized unsloth for fast 4-bit inference
+- **For Mac M4/Apple Silicon testing**:
+  - Code will fall back to standard transformers
+  - Model loading will be slower
+  - Use GPU server for actual training/evaluation
+- HuggingFace Hub token (for accessing LLaMA models)
+- OpenAI API key (optional, for additional features)
+
+### Setup
+
+1. Clone the repository:
+```bash
+git clone <repository-url>
+cd Homework8-Submission
+```
+
+2. Install dependencies:
+```bash
+pip install -r requirements.txt
+```
+
+3. Set up environment variables:
+```bash
+export HF_HUB_TOKEN="your_huggingface_token"
+export OPENAI_API_KEY="your_openai_key"  # Optional
+```
+
+### Key Dependencies
+
+- `unsloth`: Optimized LLaMA model loading with 4-bit quantization
+- `transformers`: HuggingFace transformers library
+- `trl`: Transformer Reinforcement Learning for reward modeling
+- `torch`: PyTorch deep learning framework
+- `datasets`: HuggingFace datasets library
+- `evaluate`: Evaluation metrics (ROUGE, BERTScore)
+- `PyMuPDF` (fitz): PDF text extraction
+- `arxiv`: ArXiv API client
+
+## Usage
+
+### Quick Start
+
+Run the complete pipeline by executing the three scripts in order:
+
+```bash
+# Step 1: Build dataset (Phase 1-3)
+python build_data.py
+
+# Step 2: Train reward model (Phase 4)
+python train_reward.py
+
+# Step 3: Evaluate summaries (Phase 5-6)
+python run_evaluation.py
+```
+
+### Individual Phase Details
+
+**Phase 1-3: Data Collection and Summarization**
+```bash
+python build_data.py
+```
+This will:
+- Use existing PDFs in `data/pdfs/` (or download from ArXiv if needed)
+- Generate two summaries per paper using different prompting strategies
+- Create preference labels based on ROUGE-L scores
+- Save results to `summaries_train.json` and `reward_data.jsonl`
+
+**Output Files:**
+- `train_papers.json` - Structured paper data
+- `summaries_train.json` - Generated summaries
+- `reward_data.jsonl` - Preference pairs for training
+
+---
+
+**Phase 4: Train Reward Model**
+```bash
+python train_reward.py
+```
+This will:
+- Load preference data from `reward_data.jsonl`
+- Fine-tune DeBERTa-v3 model on chosen/rejected summary pairs
+- Save trained model to `reward_model/` directory
+
+**Requirements:**
+- `reward_data.jsonl` must exist (generated by `build_data.py`)
+- GPU with at least 8GB VRAM recommended
+- Training takes approximately 10-15 minutes on GPU
+
+**Output:**
+- `reward_model/` directory with trained model weights and tokenizer
+
+---
+
+**Phase 5-6: Evaluation**
+```bash
+python run_evaluation.py
+```
+This will:
+- Load trained reward model from `reward_model/`
+- Compute ROUGE, BERTScore, and reward scores for all summaries
+- Save comprehensive results to `evaluation_results.json`
+
+**Requirements:**
+- `summaries_train.json` must exist
+- `reward_model/` directory must contain trained model
+
+**Output:**
+- `evaluation_results.json` - Detailed metrics for all summaries
+
+
+## Pipeline Phases
+
+### Phase 1: Data Collection (`build_data.py`)
+- **Input**: PDFs in `data/pdfs/` or ArXiv query
+- **Process**: Extract text and figure captions from academic papers
+- **Output**: `train_papers.json` with structured paper data
+
+### Phase 2: Summary Generation (`build_data.py`)
+- **Model**: `unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit`
+- **Strategy**: Generate two summaries per paper:
+  1. **Summary 1**: Generic prompt, low temperature (0.3) - More focused
+  2. **Summary 2**: Figure-aware prompt, high temperature (0.8) - More creative
+- **Output**: `summaries_train.json`
+
+### Phase 3: Preference Labeling (`build_data.py`)
+- **Method**: Use ROUGE-L scores against paper abstracts
+- **Logic**: Summary with higher ROUGE-L = "chosen", other = "rejected"
+- **Output**: `reward_data.jsonl` with preference pairs
+
+### Phase 4: Reward Model Training (`train_reward.py`)
+- **Base Model**: `microsoft/deberta-v3-base`
+- **Training**: 3 epochs with batch size 4, bfloat16 precision
+- **Objective**: Learn to assign higher scores to preferred summaries
+- **Max Length**: 4096 tokens to accommodate full summaries with context
+- **Output**: Trained model in `reward_model/` directory
+
+### Phase 5-6: Evaluation & Comparison (`run_evaluation.py`)
+- **Metrics Computed**:
+  - **ROUGE-1, ROUGE-2, ROUGE-L**: Lexical overlap with abstract
+  - **BERTScore**: Semantic similarity using BERT embeddings
+  - **Reward Score**: Trained model's quality prediction
+- **Analysis**: Compare metric agreement and disagreement cases
+- **Output**: `evaluation_results.json`
+
+## Configuration
+
+### Key Parameters (in respective modules)
+
+**Summarization** (`summarization.py`):
+```python
+LLAMA_MODEL_NAME = "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit"
+MAX_INPUT_TOKENS = 2048
+MAX_NEW_TOKENS = 256
+```
+
+**Reward Model** (`reward_model.py`):
+```python
+REWARD_MODEL_NAME = "microsoft/deberta-v3-base"
+REWARD_NUM_EPOCHS = 3
+REWARD_BATCH_SIZE = 4
+```
+
+**Data Collection** (`build_data.py`):
+```python
+ARXIV_QUERY = "cs.LG"  # ArXiv category
+NUM_TRAIN_PAPERS = 10
+PDF_DIR = "data/pdfs"
+```
+
+## Output Files
+
+| File | Description |
+|------|-------------|
+| `train_papers.json` | Structured data for training papers (text, abstract, figures) |
+| `summaries_train.json` | Two summaries per paper with metadata |
+| `reward_data.jsonl` | Preference pairs (chosen/rejected) for training |
+| `reward_model/` | Trained DeBERTa-v3 reward model weights |
+| `evaluation_results.json` | Complete evaluation metrics for all summaries |
+
+## Evaluation Metrics
+
+### ROUGE (Recall-Oriented Understudy for Gisting Evaluation)
+- Measures lexical overlap between generated and reference summaries
+- **ROUGE-1**: Unigram overlap
+- **ROUGE-2**: Bigram overlap
+- **ROUGE-L**: Longest common subsequence
+- **Range**: 0-1 (higher is better)
+
+### BERTScore
+- Measures semantic similarity using BERT embeddings
+- Captures meaning beyond surface-level word matching
+- Computes precision, recall, and F1
+- **Range**: 0-1 (higher is better)
+
+### Reward Model Score
+- Learned metric from preference data
+- Predicts human-aligned quality judgments
+- Captures aspects like coherence, informativeness, factual consistency
+- **Range**: Uncalibrated (higher = better quality)
+
+## Example Results
+
+After running the pipeline, `evaluation_results.json` contains entries like:
+
+```json
+{
+  "arxiv_id": "2601.05103v1",
+  "title": "Paper Title",
+  "abstract": "Original abstract...",
+  "summary_1": "First generated summary...",
+  "summary_2": "Second generated summary...",
+  "rouge_1": {"rouge1": 0.45, "rouge2": 0.23, "rougeL": 0.38},
+  "rouge_2": {"rouge1": 0.42, "rouge2": 0.21, "rougeL": 0.35},
+  "bertscore_1": {"precision": [0.89], "recall": [0.87], "f1": [0.88]},
+  "bertscore_2": {"precision": [0.86], "recall": [0.84], "f1": [0.85]},
+  "reward_1": 2.34,
+  "reward_2": 1.87
+}
+```
+
+### Interpretation
+
+- **High ROUGE + High Reward**: Summary is both lexically similar and high-quality
+- **Low ROUGE + High Reward**: Summary paraphrases well but uses different words
+- **High ROUGE + Low Reward**: Summary copies text but lacks coherence
+- **BERTScore**: Often correlates better with reward scores than ROUGE
+
+## Troubleshooting
+
+### Common Issues
+
+**1. Script Execution Order**
+```
+Error: FileNotFoundError: reward_data.jsonl not found
+
+Solution: Run scripts in correct order
+1. python build_data.py  # Creates reward_data.jsonl
+2. python train_reward.py  # Requires reward_data.jsonl
+3. python run_evaluation.py  # Requires reward_model/ and summaries_train.json
+```
+
+**2. Import Conflicts Between unsloth and TRL**
+```
+Error: AttributeError or TypeError related to RewardTrainer
+
+Solution: Scripts are now separated to avoid conflicts
+- build_data.py uses unsloth for LLaMA (summarization)
+- train_reward.py uses standard TRL (no unsloth)
+- Each runs in its own process with isolated imports
+```
+
+**3. Unsloth Not Supported on Mac M4/Apple Silicon**
+```
+Error: NotImplementedError: Unsloth currently only works on NVIDIA, AMD and Intel GPUs.
+
+Solution: The code automatically falls back to standard transformers
+- Mac M4 will use standard transformers (slower but works)
+- For actual training/evaluation, use a GPU server
+- No code changes needed, fallback is automatic
+```
+
+**4. Out of Memory Error**
+```
+Solution: Reduce batch size or use smaller model
+- Set REWARD_BATCH_SIZE = 2 in reward_model.py
+- Use gradient accumulation for effective larger batch
+```
+
+**5. Model Download Fails**
+```
+Solution: Check HuggingFace token
+export HF_HUB_TOKEN="your_token"
+huggingface-cli login
+```
+
+**6. CUDA Out of Memory**
+```
+Solution: Enable CPU offloading or use smaller sequences
+- Reduce MAX_INPUT_TOKENS in summarization.py
+- Use CPU: Set LLAMA_DEVICE = "cpu"
+```
+
+**7. PDF Extraction Issues**
+```
+Solution: Some PDFs may have complex layouts
+- Check data/pdfs/ for corrupted files
+- Manually verify text extraction quality
+```
+
+**8. Training Fails with AttributeError for RewardTrainer**
+```
+Error: AttributeError: 'TrainingArguments' object has no attribute 'model_init_kwargs'
+
+Solution: Make sure you're using the updated reward_model.py
+- The code adds all required TRL RewardConfig attributes
+- Use train_reward.py script (not deprecated main.py)
+- Clear Python cache: rm -rf __pycache__ && find . -name "*.pyc" -delete
+```
+
+**9. Circular Import with evaluate library**
+```
+Error: ImportError: cannot import name 'evaluate_summaries' from partially initialized module 'reward_model'
+
+Solution: File naming conflict with HuggingFace 'evaluate' library
+- The evaluation script is named 'run_evaluation.py' (not 'evaluate.py')
+- This avoids conflict with the 'evaluate' library used in reward_model.py
+- Use: python run_evaluation.py
+```
+
+### Performance Tips
+
+1. **Use GPU**: CUDA significantly speeds up inference and training
+2. **Batch Processing**: Process multiple summaries together when possible
+3. **Caching**: Save intermediate results to avoid recomputation
+4. **4-bit Quantization**: Already enabled via unsloth for memory efficiency
+
+## Citation
+
+If you use this code in your research, please cite:
+
+```bibtex
+@misc{paper_summarization_reward,
+  title={Paper Summarization with Reward Modeling},
+  author={Class 8 Homework},
+  year={2026},
+  howpublished={\url{https://github.com/your-repo/homework8-submission}}
+}
+```
+
+## License
+
+This project is licensed under the MIT License - see the LICENSE file for details.
+
+## Acknowledgments
+
+- **Unsloth**: For optimized LLaMA model loading
+- **HuggingFace**: For transformers, datasets, and TRL libraries
+- **Meta AI**: For LLaMA 3.1 model
+- **Microsoft**: For DeBERTa-v3 model
+
+## Contact
+
+For questions or issues, please open an issue on the GitHub repository or contact the course instructors.
+
+---
+
+**Note**: This project is for educational purposes as part of Week 8 homework on multimodal summarization and reward modeling.
diff --git a/build_data.py b/build_data.py
new file mode 100644
index 0000000..67668e1
--- /dev/null
+++ b/build_data.py
@@ -0,0 +1,117 @@
+#!/usr/bin/env python3
+"""
+Phase 1-3: Data collection, summarization, and preference dataset creation.
+
+This script:
+- Downloads training papers from ArXiv (or uses existing PDFs)
+- Generates two summaries per paper using LLaMA
+- Creates preference labels using ROUGE-L scores
+"""
+import os
+import glob
+import re
+
+from data_utils import (
+    download_arxiv_pdfs,
+    save_papers_to_json,
+    extract_text_and_figures,
+    PaperData,
+)
+from summarization import generate_summaries_for_papers
+from reward_model import auto_label_preferences_with_rouge
+
+
+def extract_abstract_from_text(text: str) -> str:
+    """
+    Extract the abstract from paper text.
+
+    Looks for text between "Abstract" and common section headers.
+
+    Args:
+        text: Full paper text
+
+    Returns:
+        Extracted abstract or first 500 chars if not found
+    """
+    # Common patterns for abstract section
+    abstract_patterns = [
+        r'Abstract\s*\n+(.*?)\n+(?:1\.|Introduction|Keywords|1\s+Introduction)',
+        r'ABSTRACT\s*\n+(.*?)\n+(?:1\.|Introduction|Keywords|1\s+Introduction)',
+        r'Abstract\s*[:\-]?\s*\n+(.*?)\n+\d+\.?\s*[A-Z]',  # Abstract followed by numbered section
+    ]
+
+    for pattern in abstract_patterns:
+        match = re.search(pattern, text, re.DOTALL | re.IGNORECASE)
+        if match:
+            abstract = match.group(1).strip()
+            # Clean up: remove excessive whitespace
+            abstract = re.sub(r'\s+', ' ', abstract)
+            # Limit length to reasonable abstract size
+            if len(abstract) > 100 and len(abstract) < 3000:
+                return abstract
+
+    # Fallback: use first 500 characters after skipping potential title
+    lines = text.split('\n')
+    # Skip first few lines (likely title/authors)
+    text_start = '\n'.join(lines[5:]) if len(lines) > 5 else text
+    return text_start[:500].strip()
+
+
+# Configuration
+ARXIV_QUERY = "cs.LG"  # Computer Science - Machine Learning
+NUM_TRAIN_PAPERS = 10
+
+# File paths
+PDF_DIR = "data/pdfs"
+TRAIN_META_JSON = "train_papers.json"
+TRAIN_SUMMARY_JSON = "summaries_train.json"
+REWARD_JSONL = "reward_data.jsonl"
+
+
+def main():
+    print("=== PHASE 1-3: Data collection, summarization, preference dataset ===\n")
+
+    # Check if PDFs exist
+    if not os.path.exists(PDF_DIR) or len(os.listdir(PDF_DIR)) < NUM_TRAIN_PAPERS:
+        print(f"Downloading {NUM_TRAIN_PAPERS} training papers from ArXiv...")
+        train_papers = download_arxiv_pdfs(ARXIV_QUERY, NUM_TRAIN_PAPERS, PDF_DIR)
+        save_papers_to_json(train_papers, TRAIN_META_JSON)
+    else:
+        print(f"Using existing PDFs from {PDF_DIR}")
+        # Extract papers from existing PDFs
+        pdf_files = glob.glob(f"{PDF_DIR}/*.pdf")[:NUM_TRAIN_PAPERS]
+        print(f"Found {len(pdf_files)} PDF files")
+
+        train_papers = []
+        for pdf_path in pdf_files:
+            arxiv_id = os.path.basename(pdf_path).replace(".pdf", "")
+            text, figure_captions = extract_text_and_figures(pdf_path)
+
+            # Extract abstract from the paper text
+            abstract = extract_abstract_from_text(text)
+
+            train_papers.append(
+                PaperData(
+                    arxiv_id=arxiv_id,
+                    title=f"Paper {arxiv_id}",
+                    abstract=abstract,
+                    text=text,
+                    figure_captions=figure_captions,
+                )
+            )
+            print(f"  Extracted abstract for {arxiv_id}: {len(abstract)} chars")
+        save_papers_to_json(train_papers, TRAIN_META_JSON)
+
+    # Generate summaries
+    print("\nGenerating summaries for training papers...")
+    summary_records = generate_summaries_for_papers(train_papers, TRAIN_SUMMARY_JSON)
+
+    # Build reward modeling data with chosen/rejected labels
+    print("\nBuilding reward modeling dataset via ROUGE-based preference labels...")
+    auto_label_preferences_with_rouge(summary_records, REWARD_JSONL)
+
+    print("\n=== Data preparation complete! ===")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/data/pdfs/2601.05103v1.pdf b/data/pdfs/2601.05103v1.pdf
new file mode 100644
index 0000000..013823c
Binary files /dev/null and b/data/pdfs/2601.05103v1.pdf differ
diff --git a/data/pdfs/2601.05104v1.pdf b/data/pdfs/2601.05104v1.pdf
new file mode 100644
index 0000000..98c271c
Binary files /dev/null and b/data/pdfs/2601.05104v1.pdf differ
diff --git a/data/pdfs/2601.05106v1.pdf b/data/pdfs/2601.05106v1.pdf
new file mode 100644
index 0000000..76e5cd8
Binary files /dev/null and b/data/pdfs/2601.05106v1.pdf differ
diff --git a/data/pdfs/2601.05111v1.pdf b/data/pdfs/2601.05111v1.pdf
new file mode 100644
index 0000000..029416e
Binary files /dev/null and b/data/pdfs/2601.05111v1.pdf differ
diff --git a/data/pdfs/2601.05143v1.pdf b/data/pdfs/2601.05143v1.pdf
new file mode 100644
index 0000000..c5eb6b7
Binary files /dev/null and b/data/pdfs/2601.05143v1.pdf differ
diff --git a/data/pdfs/2601.05163v1.pdf b/data/pdfs/2601.05163v1.pdf
new file mode 100644
index 0000000..3df4f01
Binary files /dev/null and b/data/pdfs/2601.05163v1.pdf differ
diff --git a/data/pdfs/2601.05167v1.pdf b/data/pdfs/2601.05167v1.pdf
new file mode 100644
index 0000000..a87e814
Binary files /dev/null and b/data/pdfs/2601.05167v1.pdf differ
diff --git a/data/pdfs/2601.05170v1.pdf b/data/pdfs/2601.05170v1.pdf
new file mode 100644
index 0000000..7b6ea15
Binary files /dev/null and b/data/pdfs/2601.05170v1.pdf differ
diff --git a/data/pdfs/2601.05171v1.pdf b/data/pdfs/2601.05171v1.pdf
new file mode 100644
index 0000000..a7ef883
Binary files /dev/null and b/data/pdfs/2601.05171v1.pdf differ
diff --git a/data/pdfs/2601.05184v1.pdf b/data/pdfs/2601.05184v1.pdf
new file mode 100644
index 0000000..48bd011
Binary files /dev/null and b/data/pdfs/2601.05184v1.pdf differ
diff --git a/data/pdfs/2601.05192v1.pdf b/data/pdfs/2601.05192v1.pdf
new file mode 100644
index 0000000..3645b13
Binary files /dev/null and b/data/pdfs/2601.05192v1.pdf differ
diff --git a/data_utils.py b/data_utils.py
new file mode 100644
index 0000000..2b60492
--- /dev/null
+++ b/data_utils.py
@@ -0,0 +1,145 @@
+"""
+Data utilities for downloading papers from ArXiv and extracting text/figures from PDFs.
+"""
+import os
+import json
+from dataclasses import dataclass
+from typing import List, Tuple
+
+import arxiv
+import fitz  # PyMuPDF
+
+
+@dataclass
+class PaperData:
+    """Data structure for storing paper information."""
+    arxiv_id: str
+    title: str
+    abstract: str
+    text: str
+    figure_captions: List[str]
+
+
+def download_arxiv_pdfs(query: str, num_papers: int, out_dir: str) -> List[PaperData]:
+    """
+    Download papers from ArXiv and extract their content.
+
+    Args:
+        query: ArXiv query category (e.g., "cs.LG")
+        num_papers: Number of papers to download
+        out_dir: Output directory for PDFs
+
+    Returns:
+        List of PaperData objects containing paper information
+    """
+    os.makedirs(out_dir, exist_ok=True)
+
+    search = arxiv.Search(
+        query=f"cat:{query}",
+        max_results=num_papers,
+        sort_by=arxiv.SortCriterion.SubmittedDate,
+        sort_order=arxiv.SortOrder.Descending,
+    )
+
+    papers: List[PaperData] = []
+    for result in search.results():
+        arxiv_id = result.get_short_id()
+        title = result.title
+        abstract = result.summary
+        pdf_path = os.path.join(out_dir, f"{arxiv_id}.pdf")
+
+        print(f"Downloading {arxiv_id}: {title[:80]}...")
+        result.download_pdf(filename=pdf_path)
+
+        text, figure_captions = extract_text_and_figures(pdf_path)
+        papers.append(
+            PaperData(
+                arxiv_id=arxiv_id,
+                title=title,
+                abstract=abstract,
+                text=text,
+                figure_captions=figure_captions,
+            )
+        )
+
+    return papers
+
+
+def extract_text_and_figures(pdf_path: str) -> Tuple[str, List[str]]:
+    """
+    Extract text and figure captions from a PDF file.
+
+    This is a simple extractor that treats any line starting with 'Figure' or 'Fig.'
+    as a caption. It's not perfect but works well for most academic papers.
+
+    Args:
+        pdf_path: Path to the PDF file
+
+    Returns:
+        Tuple of (full_text, list of figure captions)
+    """
+    doc = fitz.open(pdf_path)
+    all_text_parts = []
+    figure_captions = []
+
+    for page in doc:
+        t = page.get_text("text")
+        all_text_parts.append(t)
+
+        for line in t.splitlines():
+            stripped = line.strip()
+            if stripped.startswith("Figure ") or stripped.startswith("Fig. "):
+                figure_captions.append(stripped)
+
+    doc.close()
+    full_text = "\n".join(all_text_parts)
+    return full_text, figure_captions
+
+
+def save_papers_to_json(papers: List[PaperData], path: str):
+    """
+    Save paper data to a JSON file.
+
+    Args:
+        papers: List of PaperData objects
+        path: Output JSON file path
+    """
+    data = []
+    for p in papers:
+        data.append(
+            {
+                "arxiv_id": p.arxiv_id,
+                "title": p.title,
+                "abstract": p.abstract,
+                "text": p.text,
+                "figure_captions": p.figure_captions,
+            }
+        )
+    with open(path, "w", encoding="utf-8") as f:
+        json.dump(data, f, ensure_ascii=False, indent=2)
+
+
+def load_papers_from_json(path: str) -> List[PaperData]:
+    """
+    Load paper data from a JSON file.
+
+    Args:
+        path: Input JSON file path
+
+    Returns:
+        List of PaperData objects
+    """
+    with open(path, "r", encoding="utf-8") as f:
+        data = json.load(f)
+    papers = []
+    for d in data:
+        papers.append(
+            PaperData(
+                arxiv_id=d["arxiv_id"],
+                title=d["title"],
+                abstract=d["abstract"],
+                text=d["text"],
+                figure_captions=d["figure_captions"],
+            )
+        )
+    return papers
diff --git a/evaluation_results.json b/evaluation_results.json
new file mode 100644
index 0000000..1110815
--- /dev/null
+++ b/evaluation_results.json
@@ -0,0 +1,452 @@
+[
+  {
+    "arxiv_id": "2601.05171v1",
+    "title": "Paper 2601.05171v1",
+    "abstract": "Existing long-term personalized dialogue systems struggle to reconcile unbounded interaction streams with finite context constraints, often succumbing to memory noise accumulation, reasoning degradation, and persona inconsistency. To address these challenges, this paper proposes Inside Out, a framework that utilizes a globally maintained PersonaTree as the carrier of long-term user profiling. By constraining the trunk with an initial schema and updating the branches and leaves, PersonaTree enables controllable growth, achieving memory compression while preserving consistency. Moreover, we train a lightweight MemListener via reinforcement learning with process- based rewards to produce structured, executable, and interpretable {ADD, UPDATE, DELETE, NO_OP} operations, thereby supporting the dynamic evolution of the personalized tree. During response generation, PersonaTree is directly leveraged to enhance outputs in latency-sensitive scenarios; when users require more details, the agentic mode is triggered to introduce details on-demand under the constraints of the PersonaTree. Experiments show that PersonaTree outperforms full-text concatenation and various personalized memory systems in suppressing contextual noise and maintaining persona consistency. Notably, the small MemListener model achieves memory- operation decision performance comparable to, or even surpassing, powerful reasoning models such as DeepSeek-R1-0528 and Gemini-3-Pro. Correspondence: Team Leader at lizy@memtensor.cn Author Legend: †Corresponding author",
+    "summary_1": "Here is a clear and concise summary of the research paper:\n\n**Title:** Inside Out: Evolving User-Centric Core Memory Trees for Long-Term Personalized Dialogue Systems\n\n**Problem Statement:** Long-term personalized dialogue systems struggle to reconcile unbounded interaction streams with finite context constraints, leading to memory noise accumulation, reasoning degradation, and persona inconsistency.\n\n**Proposed Solution:** The authors propose the Inside Out framework, which utilizes a globally maintained PersonaTree as the carrier of long-term user profiling. The PersonaTree is a hierarchical structure that constrains the trunk with an initial schema and updates the branches and leaves through a reinforcement learning (RL) strategy.\n\n**Key Components:**\n\n1. **PersonaTree:** A hierarchical structure that represents the user's core memories, with three core dimensions (Biopsychosocial model).\n2. **Schema:** An initial structure of the user PersonaTree, established through interdisciplinary design.\n3. **MemListener:** A lightweight model trained through RL to compress a continuous stream of unstructured dialogue into standardized tree-structured operations.\n4. **Tree-Update Mechanism:** An iterative process that updates the PersonaTree based on user interactions.\n\n**Benefits:** The Inside Out framework enables controllable growth, achieves memory compression while preserving consistency, and supports the dynamic evolution",
+    "summary_2": "Here is a summary of the research paper excerpt, incorporating the information conveyed by the figures and their captions:\n\n**Background:**\nThe authors propose a novel framework, Inside Out, to address the challenges of long-term personalized dialogue systems. These systems struggle to reconcile unbounded interaction streams with finite context constraints, leading to memory noise accumulation, reasoning degradation, and persona inconsistency. Existing approaches, such as explicit profile augmentation and vector-based retrieval, fail to directly confront the central bottleneck of personalized memory evolution.\n\n**Inside Out Framework:**\nThe Inside Out framework utilizes a globally maintained PersonaTree as the carrier of long-term user profiling. The PersonaTree is constructed using a hierarchical Schema based on the Biopsychosocial model, which decomposes user characteristics into three core dimensions. This Schema design establishes the initial structure of the user PersonaTree.\n\n**MemListener:**\nTo endow the system with dynamic evolution, the authors propose an iterative tree-update mechanism and introduce a reinforcement learning (RL) strategy based on process rewards to train a lightweight model, MemListener. This model learns to compress a continuous stream of unstructured dialogue in real time into standardized tree-structured operations.\n\n**Key Features:**\n\n1. **PersonaTree:** A globally maintained tree that captures the user's long-term profile, with",
+    "rouge_1": {
+      "rouge1": 0.5126903553299492,
+      "rouge2": 0.3010204081632653,
+      "rougeL": 0.3350253807106599,
+      "rougeLsum": 0.4010152284263959
+    },
+    "rouge_2": {
+      "rouge1": 0.5061425061425061,
+      "rouge2": 0.23209876543209876,
+      "rougeL": 0.285012285012285,
+      "rougeLsum": 0.32923832923832924
+    },
+    "bertscore_1": {
+      "precision": [
+        0.8535144925117493
+      ],
+      "recall": [
+        0.8573517799377441
+      ],
+      "f1": [
+        0.8554288744926453
+      ],
+      "hashcode": "roberta-large_L17_no-idf_version=0.3.12(hug_trans=4.57.3)"
+    },
+    "bertscore_2": {
+      "precision": [
+        0.8539573550224304
+      ],
+      "recall": [
+        0.8529762029647827
+      ],
+      "f1": [
+        0.8534665107727051
+      ],
+      "hashcode": "roberta-large_L17_no-idf_version=0.3.12(hug_trans=4.57.3)"
+    },
+    "reward_1": -0.013054322451353073,
+    "reward_2": -0.023740937933325768
+  },
+  {
+    "arxiv_id": "2601.05111v1",
+    "title": "Paper 2601.05111v1",
+    "abstract": "LLM-as-a-Judge has revolutionized AI evaluation by leveraging large language models for scalable assess- ments. However, as evaluands become increasingly complex, specialized, and multi-step, the reliability of LLM-as-a-Judge has become constrained by inherent biases, shallow single-pass reasoning, and the inability to verify assessments against real-world observations. This has catalyzed the transition to Agent-as-a-Judge, where agentic judges employ planning, tool-augmented verification, multi-agent collaboration, and persis- tent memory to enable more robust, verifiable, and nuanced evaluations. Despite the rapid proliferation of agentic evaluation systems, the field lacks a unified framework to navigate this shifting landscape. To bridge this gap, we present the first comprehensive survey tracing this evolution. Specifically, we identify key dimensions that characterize this paradigm shift and establish a developmental taxonomy. We organize core methodologies and survey applications across general and professional domains. Furthermore, we analyze frontier challenges and identify promising research directions, ultimately providing a clear roadmap for the next generation of agentic evaluation. � Project Page: https://github.com/ModalityDance/Awesome-Agent-as-a-Judge",
+    "summary_1": "Here is a clear and concise summary of the research paper for a graduate student:\n\n**Title:** A Survey on Agent-as-a-Judge\n\n**Background:** The rapid advancement of Large Language Models (LLMs) has revolutionized AI evaluation, enabling scalable and near-human quality assessments. However, as evaluands become increasingly complex, specialized, and multi-step, the reliability of LLM-as-a-Judge has become constrained by inherent biases, shallow single-pass reasoning, and the inability to verify assessments against real-world observations.\n\n**Problem Statement:** The limitations of LLM-as-a-Judge have led to the transition to Agent-as-a-Judge, which employs planning, tool-augmented verification, multi-agent collaboration, and persistent memory to enable more robust, verifiable, and nuanced evaluations.\n\n**Key Contributions:**\n\n1. **Characterizing the shift from LLM-as-a-Judge to Agent-as-a-Judge**: The authors identify and summarize the agentic judges' development trend into three progressive stages with distinct levels of autonomy and adaptability.\n2. **Organizing core methodologies**: The authors categorize the core methodologies into five key parts according to the agent's abilities and survey their applications across general and professional domains.\n3. **Analyzing frontier challenges and identifying promising research directions**:",
+    "summary_2": "The research paper discusses the evolution of AI evaluation systems from Large Language Models (LLM)-as-a-Judge to Agent-as-a-Judge. The authors identify the limitations of LLM-as-a-Judge, which include inherent biases, shallow single-pass reasoning, and the inability to verify assessments against real-world observations. This has led to the development of Agent-as-a-Judge, which employs planning, tool-augmented verification, multi-agent collaboration, and persistent memory to enable more robust, verifiable, and nuanced evaluations.\n\n**Key Points:**\n\n1. LLM-as-a-Judge has revolutionized AI evaluation by leveraging large language models for scalable assessments, but its reliability is constrained by inherent biases, shallow single-pass reasoning, and the inability to verify assessments against real-world observations.\n2. Agent-as-a-Judge has emerged as a more robust and verifiable alternative, leveraging planning, tool-augmented verification, multi-agent collaboration, and persistent memory.\n3. The authors present the first comprehensive survey on Agent-as-a-Judge, summarizing the shift from LLM-as-a-Judge to Agent-as-a-Judge and organizing core methodologies into five key parts according to agent's abilities.\n4. The authors survey applications across general and professional domains and analyze frontier challenges and identify",
+    "rouge_1": {
+      "rouge1": 0.6112600536193029,
+      "rouge2": 0.4420485175202156,
+      "rougeL": 0.4879356568364611,
+      "rougeLsum": 0.5040214477211796
+    },
+    "rouge_2": {
+      "rouge1": 0.581151832460733,
+      "rouge2": 0.39999999999999997,
+      "rougeL": 0.3979057591623037,
+      "rougeLsum": 0.4293193717277487
+    },
+    "bertscore_1": {
+      "precision": [
+        0.8799600005149841
+      ],
+      "recall": [
+        0.8951518535614014
+      ],
+      "f1": [
+        0.8874909281730652
+      ],
+      "hashcode": "roberta-large_L17_no-idf_version=0.3.12(hug_trans=4.57.3)"
+    },
+    "bertscore_2": {
+      "precision": [
+        0.8950091600418091
+      ],
+      "recall": [
+        0.8787307739257812
+      ],
+      "f1": [
+        0.8867952823638916
+      ],
+      "hashcode": "roberta-large_L17_no-idf_version=0.3.12(hug_trans=4.57.3)"
+    },
+    "reward_1": -0.01309164334088564,
+    "reward_2": -0.040604256093502045
+  },
+  {
+    "arxiv_id": "2601.05163v1",
+    "title": "Paper 2601.05163v1",
+    "abstract": "Document Question Answering (DocQA) fo- cuses on answering questions grounded in given documents, yet existing DocQA agents lack effective tool utilization and largely rely on closed-source models. In this work, we introduce DocDancer, an end-to-end trained open-source Doc agent. We formulate DocQA as an information-seeking problem and pro- pose a tool-driven agent framework that ex- plicitly models document exploration and com- prehension. To enable end-to-end training of such agents, we introduce an Exploration-then- Synthesis data synthesis pipeline that addresses the scarcity of high-quality training data for DocQA. Training on the synthesized data, the trained models on two long-context document understanding benchmarks, MMLongBench- Doc and DocBench, show their effectiveness. Further analysis provides valuable insights for the agentic tool design and synthetic data.",
+    "summary_1": "The paper introduces DocDancer, an end-to-end trained open-source Doc agent for Document Question Answering (DocQA). The authors formulate DocQA as an information-seeking problem and propose a tool-driven agent framework that explicitly models document exploration and comprehension. To address the scarcity of high-quality training data for DocQA, they introduce an Exploration-then-Synthesis data synthesis pipeline that progressively enhances QA pairs from easy to hard.\n\nThe proposed framework consists of two complementary capabilities: efficient search tools for global information acquisition and fine-grained read tools for localized comprehension. This design enables the agent to actively explore long documents, iteratively refine its hypotheses, and dynamically adapt its strategy based on intermediate observations.\n\nThe authors train their DocQA agent, DocDancer, on the synthesized dataset, instantiating it with two open-source backbones, Qwen3-4B-Thinking-2507 and Qwen3-30B-A3B-Thinking-2507. Despite being trained with only 5,000 instances, both variants achieve competitive performance, with the 30B-A3B model attaining state-of-the-art results in several settings.\n\nThe paper conducts extensive experiments on two long-context document understanding benchmarks, MMLongBench-Doc and DocBench, demonstrating the effectiveness",
+    "summary_2": "The paper \"DocDancer: Towards Agentic Document-Grounded Information Seeking\" proposes a novel approach to document question answering (DocQA) that focuses on information-seeking principles and utilizes an end-to-end trained open-source Doc agent. The authors introduce a tool-driven agent framework that explicitly models document exploration and comprehension, addressing the scarcity of high-quality training data for DocQA.\n\nThe proposed framework consists of two complementary capabilities: efficient search tools for global information acquisition and fine-grained read tools for localized comprehension. This design enables the agent to actively explore long documents, iteratively refine its hypotheses, and dynamically adapt its strategy based on intermediate observations.\n\nThe paper also proposes an Exploration-then-Synthesis DocQA generation pipeline that progressively enhances QA pairs from easy to hard. This pipeline first explores a source document through intent-guided, tool-augmented interactions to collect grounded evidence and then synthesizes high-quality document-grounded QA pairs via multi-observation reasoning.\n\nThe authors train their DocQA agent, DocDancer, on the synthesized dataset using two open-source backbones, Qwen3-4B-Thinking-2507 and Qwen3-30B-A3B-Thinking-2507. Despite being trained with only 5,000 instances, both variants achieve competitive performance,",
+    "rouge_1": {
+      "rouge1": 0.5093167701863355,
+      "rouge2": 0.34375,
+      "rougeL": 0.36645962732919257,
+      "rougeLsum": 0.37888198757763975
+    },
+    "rouge_2": {
+      "rouge1": 0.4342507645259938,
+      "rouge2": 0.20923076923076922,
+      "rougeL": 0.2568807339449541,
+      "rougeLsum": 0.28746177370030584
+    },
+    "bertscore_1": {
+      "precision": [
+        0.8635969758033752
+      ],
+      "recall": [
+        0.8913978338241577
+      ],
+      "f1": [
+        0.8772771954536438
+      ],
+      "hashcode": "roberta-large_L17_no-idf_version=0.3.12(hug_trans=4.57.3)"
+    },
+    "bertscore_2": {
+      "precision": [
+        0.8476544618606567
+      ],
+      "recall": [
+        0.8700016736984253
+      ],
+      "f1": [
+        0.8586826920509338
+      ],
+      "hashcode": "roberta-large_L17_no-idf_version=0.3.12(hug_trans=4.57.3)"
+    },
+    "reward_1": -0.02074364572763443,
+    "reward_2": -0.035873666405677795
+  },
+  {
+    "arxiv_id": "2601.05104v1",
+    "title": "Paper 2601.05104v1",
+    "abstract": "This research examines how the emotional tone of human-AI interactions shapes ChatGPT and human behavior. In a between-subject experiment, we asked participants to express a specific emotion while working with ChatGPT (GPT-4.0) on two tasks, including writing a public response and addressing an ethical dilemma. We found that compared to interactions where participants maintained a neutral tone, ChatGPT showed greater improvement in its answers when participants praised ChatGPT for its responses. Expressing anger towards ChatGPT also led to a higher albeit smaller improvement relative to the neutral condition, whereas blaming ChatGPT did not improve its answers. When addressing an ethical dilemma, ChatGPT prioritized corporate interests less when participants expressed anger towards it, while blaming increases its emphasis on protecting the public interest. Additionally, we found that people used more negative, hostile, and disappointing expressions in human-human communication after interactions during which participants blamed rather than praised for their responses. Together, our findings demonstrate that the emotional tone people apply in human–AI interactions not only shape ChatGPT’s outputs but also carry over into subsequent human– human communication. 3",
+    "summary_1": "Here is a clear and concise summary of the research paper for a graduate student:\n\n**Title:** Examining the Impact of Emotional Prompts on Artificial and Human Responsiveness\n\n**Research Question:** Does the emotional tone of user-generated prompts influence the quality and content of ChatGPT's responses, and do these effects carry over into human-human communication?\n\n**Methodology:** The researchers conducted a between-subject experiment where participants interacted with ChatGPT-4.0 on two tasks: writing a public response and addressing an ethical dilemma. Participants were randomly assigned to one of three conditions: praise, blame, or anger.\n\n**Findings:**\n\n1. ChatGPT showed greater improvement in its answers when participants expressed praise, compared to the neutral condition.\n2. Expressing anger towards ChatGPT led to a higher, albeit smaller, improvement in its answers compared to the neutral condition.\n3. Blaming ChatGPT did not improve its answers.\n4. When addressing an ethical dilemma, ChatGPT prioritized corporate interests less when participants expressed anger towards it, while blaming increased its emphasis on protecting the public interest.\n5. Participants used more negative, hostile, and disappointing expressions in human-human communication after interactions where they blamed rather than praised ChatGPT.\n\n**Conclusion:**",
+    "summary_2": "The research paper \"Examining the Impact of Emotional Prompts on Artificial and Human Responsiveness\" explores the effect of emotional tone on human-AI interactions, specifically with the large language model ChatGPT (GPT-4.0). The study conducted a between-subject experiment where participants interacted with ChatGPT on two tasks: writing a public response and addressing an ethical dilemma.\n\n**Key findings:**\n\n1.  ChatGPT showed greater improvement in its answers when participants expressed praise, compared to interactions where participants maintained a neutral tone.\n2.  Expressing anger towards ChatGPT led to a higher, albeit smaller, improvement in its answers relative to the neutral condition.\n3.  Blaming ChatGPT did not improve its answers.\n4.  When addressing an ethical dilemma, ChatGPT prioritized corporate interests less when participants expressed anger towards it, while blaming increased its emphasis on protecting the public interest.\n5.  The study also found that people used more negative, hostile, and disappointing expressions in human-human communication after interactions during which participants blamed rather than praised for their responses.\n\n**Figures:**\n\n*   **Figure 1:** Improvement in ChatGPT's answers based on conditions. The figure shows that ChatGPT improved its answers when",
+    "rouge_1": {
+      "rouge1": 0.675531914893617,
+      "rouge2": 0.49732620320855614,
+      "rougeL": 0.5531914893617021,
+      "rougeLsum": 0.5904255319148937
+    },
+    "rouge_2": {
+      "rouge1": 0.7405405405405405,
+      "rouge2": 0.576086956521739,
+      "rougeL": 0.6108108108108108,
+      "rougeLsum": 0.6054054054054054
+    },
+    "bertscore_1": {
+      "precision": [
+        0.877902626991272
+      ],
+      "recall": [
+        0.9065370559692383
+      ],
+      "f1": [
+        0.8919901251792908
+      ],
+      "hashcode": "roberta-large_L17_no-idf_version=0.3.12(hug_trans=4.57.3)"
+    },
+    "bertscore_2": {
+      "precision": [
+        0.8994222283363342
+      ],
+      "recall": [
+        0.9299557209014893
+      ],
+      "f1": [
+        0.9144341945648193
+      ],
+      "hashcode": "roberta-large_L17_no-idf_version=0.3.12(hug_trans=4.57.3)"
+    },
+    "reward_1": -0.02505182847380638,
+    "reward_2": -0.0208128672093153
+  },
+  {
+    "arxiv_id": "2601.05106v1",
+    "title": "Paper 2601.05106v1",
+    "abstract": "∗Work done at Meta, †Joint last author\nLarge language models (LLMs) exhibit strengths across diverse domains. However, achieving strong\nperformance across these domains with a single general-purpose model typically requires scaling\nto sizes that are prohibitively expensive to train and deploy. On the other hand, while smaller\ndomain-specialized models are much more efficient, they struggle to generalize beyond their training\ndistributions. To address this dilemma, we propose FusionRoute, a robus",
+    "summary_1": "Here's a clear and concise summary of the research paper for a graduate student:\n\n**Title:** FusionRoute: A Robust and Effective Token-Level Multi-LLM Collaboration Framework\n\n**Background:** Large Language Models (LLMs) have shown strong performance across various domains, but achieving strong performance across these domains with a single general-purpose model is often expensive and inefficient. Smaller domain-specialized models are more efficient but struggle to generalize beyond their training distributions.\n\n**Problem Statement:** The authors aim to develop a framework that can robustly coordinate and leverage the complementary expertise of multiple specialized LLMs to achieve general-purpose performance.\n\n**Proposed Solution:** The authors propose FusionRoute, a token-level multi-LLM collaboration framework that consists of a lightweight router and a trainable complementary generator. The router selects the most suitable expert at each decoding step, and the complementary generator refines or corrects the selected expert's next-token distribution via logit addition.\n\n**Key Contributions:**\n\n1. **Theoretical Analysis:** The authors provide a theoretical analysis showing that pure expert-only routing is fundamentally limited and cannot realize the optimal decoding policy unless strong global coverage assumptions hold.\n2. **Empirical Evaluation:** FusionRoute outperforms sequence- and token-level collaboration, model merging, and direct fine-t",
+    "summary_2": "The research paper \"Paper 2601.05106v1\" proposes a new framework called FusionRoute for achieving strong performance across diverse domains with a single general-purpose model, while remaining efficient and competitive with domain experts. The authors argue that existing methods, such as sequence-level collaboration, model merging, and direct fine-tuning, have limitations.\n\nThe paper's main contributions are:\n\n1. **FusionRoute**: A robust and effective token-level multi-LLM collaboration framework that selects the most suitable expert at each decoding step and contributes a complementary logit that refines or corrects the selected expert's next-token distribution via logit addition.\n2. **Theoretical analysis**: A theoretical analysis showing that pure expert-only routing is fundamentally limited and cannot in general realize the optimal decoding policy unless strong global coverage assumptions hold.\n3. **Empirical results**: Empirically, across both Llama-3 and Gemma-2 families and diverse benchmarks spanning mathematical reasoning, code generation, and instruction following, FusionRoute outperforms both sequence- and token-level collaboration, model merging, and direct fine-tuning.\n\nThe paper highlights the importance of token-level collaboration, which allows multiple LLMs to jointly produce each token during generation, making it a more fine-grained and dynamic",
+    "rouge_1": {
+      "rouge1": 0.3088235294117647,
+      "rouge2": 0.2,
+      "rougeL": 0.27941176470588236,
+      "rougeLsum": 0.3088235294117647
+    },
+    "rouge_2": {
+      "rouge1": 0.19548872180451124,
+      "rouge2": 0.09090909090909091,
+      "rougeL": 0.12781954887218047,
+      "rougeLsum": 0.17293233082706766
+    },
+    "bertscore_1": {
+      "precision": [
+        0.8201289772987366
+      ],
+      "recall": [
+        0.8799529075622559
+      ],
+      "f1": [
+        0.8489883542060852
+      ],
+      "hashcode": "roberta-large_L17_no-idf_version=0.3.12(hug_trans=4.57.3)"
+    },
+    "bertscore_2": {
+      "precision": [
+        0.8036170601844788
+      ],
+      "recall": [
+        0.8410658836364746
+      ],
+      "f1": [
+        0.8219152092933655
+      ],
+      "hashcode": "roberta-large_L17_no-idf_version=0.3.12(hug_trans=4.57.3)"
+    },
+    "reward_1": -0.017466777935624123,
+    "reward_2": -0.03491535037755966
+  },
+  {
+    "arxiv_id": "2601.05103v1",
+    "title": "Paper 2601.05103v1",
+    "abstract": "2 L3S Research Center, Leibniz University Hannover, Hannover, Germany\nzhiyin.tan@l3s.de\nAbstract. Understanding the role of citations is essential for research\nassessment and citation-aware digital libraries. However, existing cita-\ntion classification frameworks often conflate citation intent (why a work\nis cited) with cited content type (what part is cited), limiting their effec-\ntiveness in auto classification due to a dilemma between fine-grained type\ndistinctions and practical classificatio",
+    "summary_1": "Here is a clear and concise summary of the research paper for a graduate student:\n\n**Title:** Semantically Orthogonal Framework for Citation Classification: Disentangling Intent and Content\n\n**Authors:** Changxu Duan and Zhiyin Tan\n\n**Summary:** The authors propose a new framework, SOFT, for citation classification that explicitly separates the intent of a citation (why a work is cited) from its content type (what part is cited). This framework is grounded in semantic role theory and is designed to improve the clarity, consistency, and generalizability of citation classification.\n\n**Background:** Traditional citation metrics, such as counts or h-index, are limited in their ability to capture the nuanced ways in which works are cited, interpreted, or contested. Citation classification has emerged as a critical task in computational bibliometrics and information science to unlock the meaning within scholarly dialogue.\n\n**Limitations of existing frameworks:** Current citation classification practices face persistent limitations, including the conflation of distinct citation intents and content types, and the lack of consistent semantic definitions. These ambiguities hinder downstream applications such as citation-based retrieval and impact analysis.\n\n**SOFT framework:** The authors propose a two-dimensional framework that explicitly distinguishes the predicative intent of a citation from its cited content type. This framework",
+    "summary_2": "The paper \"Semantically Orthogonal Framework for Citation Classification: Disentangling Intent and Content\" proposes a new annotation framework, SOFT, that explicitly separates citation intent (why a work is cited) from cited content type (what part is cited). The authors argue that existing frameworks conflate these two dimensions, hindering their effectiveness in auto classification. SOFT is grounded in the principles of semantic roles, offering a more nuanced understanding of how authors structure argumentation.\n\nThe paper introduces two key dimensions: Cited Content Type and Citation Intent. Cited Content Type refers to the specific part of a work that is being cited (e.g., method, result, or background), while Citation Intent refers to the functional purpose of the citation (e.g., to support, to critique, or to provide background). This distinction enables clearer definitions, improved annotation consistency, and stronger support for Large Language Model (LLM)-based classification.\n\nThe authors validate SOFT through a comprehensive re-annotation of the ACL-ARC dataset and a cross-domain evaluation on ACT2, covering 19 scientific fields. Results show that SOFT yields higher inter-model and human-LLM agreement, improves classification accuracy for both zero-shot and fine-tuned models, and generalizes more robustly across domains than existing",
+    "rouge_1": {
+      "rouge1": 0.2509225092250923,
+      "rouge2": 0.08178438661710036,
+      "rougeL": 0.18450184501845018,
+      "rougeLsum": 0.22140221402214022
+    },
+    "rouge_2": {
+      "rouge1": 0.2518518518518518,
+      "rouge2": 0.11194029850746268,
+      "rougeL": 0.1925925925925926,
+      "rougeLsum": 0.22962962962962966
+    },
+    "bertscore_1": {
+      "precision": [
+        0.8247026801109314
+      ],
+      "recall": [
+        0.8273060321807861
+      ],
+      "f1": [
+        0.8260022401809692
+      ],
+      "hashcode": "roberta-large_L17_no-idf_version=0.3.12(hug_trans=4.57.3)"
+    },
+    "bertscore_2": {
+      "precision": [
+        0.8267620801925659
+      ],
+      "recall": [
+        0.8196247220039368
+      ],
+      "f1": [
+        0.8231779336929321
+      ],
+      "hashcode": "roberta-large_L17_no-idf_version=0.3.12(hug_trans=4.57.3)"
+    },
+    "reward_1": -0.028803709894418716,
+    "reward_2": -0.025009701028466225
+  },
+  {
+    "arxiv_id": "2601.05192v1",
+    "title": "Paper 2601.05192v1",
+    "abstract": "Entity linking (mapping ambiguous mentions in text to entities in a knowledge base) is a foun- dational step in tasks such as knowledge graph construction, question-answering, and informa- tion extraction. Our method, LELA, is a mod- ular coarse-to-fine approach that leverages the capabilities of large language models (LLMs), and works with different target domains, knowl- edge bases and LLMs, without any fine-tuning phase. Our experiments across various entity linking settings show that LELA is highly com- petitive with fine-tuned approaches, and sub- stantially outperforms the non-fine-tuned ones.",
+    "summary_1": "Here is a clear and concise summary of the research paper for a graduate student:\n\n**Title:** LELA: an LLM-based Entity Linking Approach with Zero-Shot Domain Adaptation\n\n**Background:** Entity linking is a fundamental task in natural language processing that involves mapping ambiguous mentions in text to entities in a knowledge base. Traditional entity linking methods require fine-tuning on the target domain and knowledge base, which can be computationally expensive and require high-quality labeled data.\n\n**Problem Statement:** The authors propose a true zero-shot entity linking approach, where the method can work out of the box for any domain without the need for fine-tuning.\n\n**Method:** The proposed approach, LELA, is a coarse-to-fine method that leverages the capabilities of large language models (LLMs). It consists of three stages:\n\n1. **Initial entity-candidate set generation**: LELA generates a set of potential entities based on the input text.\n2. **Pointwise re-ranking**: The top-k candidates are selected through re-ranking.\n3. **Final candidate selection**: The selected candidates are passed to the LLM for final selection through reasoning over the context and entity descriptions.\n\n**Advantages:** LELA is modular, LLM-, reranker-, and retrie",
+    "summary_2": "The paper proposes a novel entity linking approach, called LELA, which stands for Large Language Model-based Entity Linking Approach with Zero-Shot Domain Adaptation. Entity linking is the task of mapping ambiguous mentions in text to entities in a knowledge base (KB). The authors aim to develop a true zero-shot entity linking method, which means it works without the need for fine-tuning, and can handle different domains, knowledge bases, and large language models (LLMs).\n\nThe proposed approach, LELA, is a coarse-to-fine method that consists of three stages:\n\n1.  **Initial entity-candidate set generation**: LELA generates an initial set of candidate entities for the ambiguous mention.\n2.  **Pointwise re-ranking**: The top-k candidates are selected through pointwise re-ranking, where the model calculates a score for each candidate entity based on its relevance to the context and entity descriptions.\n3.  **Final candidate selection**: The selected candidates are passed to the LLM for final candidate selection, where the model uses reasoning over the context and entity descriptions to select the most relevant entity.\n\nLELA is modular, agnostic to the LLM, reranker, and retriever, making it applicable to both open-source and proprietary models.\n\nThe authors",
+    "rouge_1": {
+      "rouge1": 0.35416666666666663,
+      "rouge2": 0.1888111888111888,
+      "rougeL": 0.25,
+      "rougeLsum": 0.29861111111111105
+    },
+    "rouge_2": {
+      "rouge1": 0.32764505119453924,
+      "rouge2": 0.15120274914089346,
+      "rougeL": 0.21160409556313994,
+      "rougeLsum": 0.22525597269624573
+    },
+    "bertscore_1": {
+      "precision": [
+        0.8313204050064087
+      ],
+      "recall": [
+        0.854534387588501
+      ],
+      "f1": [
+        0.842767596244812
+      ],
+      "hashcode": "roberta-large_L17_no-idf_version=0.3.12(hug_trans=4.57.3)"
+    },
+    "bertscore_2": {
+      "precision": [
+        0.8334040641784668
+      ],
+      "recall": [
+        0.8482416868209839
+      ],
+      "f1": [
+        0.840757429599762
+      ],
+      "hashcode": "roberta-large_L17_no-idf_version=0.3.12(hug_trans=4.57.3)"
+    },
+    "reward_1": -0.01752193458378315,
+    "reward_2": -0.03325437754392624
+  },
+  {
+    "arxiv_id": "2601.05143v1",
+    "title": "Paper 2601.05143v1",
+    "abstract": "Visual question answering for crop disease analysis requires accurate visual under- standing and reliable language generation. This work presents a lightweight vision–language framework for crop and disease identification from leaf images. The proposed approach combines a Swin Transformer vision encoder with sequence-to-sequence language decoders. A two-stage training strategy is adopted to improve visual representation learning and cross-modal alignment. The model is evaluated on a large-scale crop disease dataset using classification and nat- ural language generation metrics. Experimental results show high accuracy for both crop and disease identification. The framework also achieves strong perfor- mance on BLEU, ROUGE and BERTScore. Our proposed models outperform large-scale vision–language baselines while using significantly fewer parameters. Explainability is assessed using Grad-CAM and token-level attribution. Qualita- tive results demonstrate robust performance under diverse user-driven queries. These findings highlight the effectiveness of task-specific visual pretraining for crop disease visual question answering.",
+    "summary_1": "Here is a clear and concise summary of the research paper for a graduate student:\n\n**Title:** A Lightweight and Explainable Vision–Language Framework for Crop Disease Visual Question Answering\n\n**Background:** Crop disease diagnosis is a critical task in modern agriculture, and early identification of disease symptoms is essential to reduce damage and support effective intervention. However, traditional methods rely on expert inspection and recommendations, which can be time-consuming and limited in regions with expert access.\n\n**Problem Statement:** Existing automated disease detection methods often require controlled imaging conditions, are computationally heavy, and fail to provide detailed textual descriptions of visual attributes, disease progression stages, and contextual information.\n\n**Research Question:** Can a lightweight Visual Question Answering (VQA) framework be established for intelligent and practical plant disease identification?\n\n**Methodology:** The proposed framework combines a Swin Transformer vision encoder with sequence-to-sequence language decoders and adopts a two-stage training strategy to improve visual representation learning and cross-modal alignment.\n\n**Key Contributions:**\n\n1. A lightweight VQA framework for crop disease identification that achieves high accuracy and strong performance on BLEU, ROUGE, and BERTScore metrics.\n2. Explainability is assessed using Grad-CAM and token-level attribution, demonstrating robust performance under diverse user-driven queries.\n3.",
+    "summary_2": "The research paper presents a lightweight vision-language framework for crop and disease identification from leaf images. The proposed approach combines a Swin Transformer vision encoder with sequence-to-sequence language decoders, and adopts a two-stage training strategy to improve visual representation learning and cross-modal alignment.\n\n**Key Findings:**\n\n1.  **High Accuracy for Crop and Disease Identification:** The model achieves high accuracy for both crop and disease identification, demonstrating its effectiveness in visual question answering for crop disease analysis.\n2.  **Strong Performance on BLEU, ROUGE, and BERTScore:** The framework also achieves strong performance on these metrics, indicating its ability to generate high-quality language output.\n3.  **Explainability:** Explainability is assessed using Grad-CAM and token-level attribution, demonstrating the model's ability to provide transparent and interpretable results.\n4.  **Robust Performance under Diverse User-Driven Queries:** Qualitative results demonstrate robust performance under diverse user-driven queries, highlighting the model's ability to handle real-world scenarios.\n\n**Key Contributions:**\n\n1.  **Lightweight Vision-Language Framework:** The proposed framework is lightweight, making it suitable for real-world deployment in agricultural environments.\n2.  **Task-Specific Visual Pretraining:** The study highlights the effectiveness of task",
+    "rouge_1": {
+      "rouge1": 0.5195530726256984,
+      "rouge2": 0.33707865168539325,
+      "rougeL": 0.3966480446927374,
+      "rougeLsum": 0.47486033519553067
+    },
+    "rouge_2": {
+      "rouge1": 0.6455331412103746,
+      "rouge2": 0.5101449275362319,
+      "rougeL": 0.484149855907781,
+      "rougeLsum": 0.5302593659942364
+    },
+    "bertscore_1": {
+      "precision": [
+        0.8561044931411743
+      ],
+      "recall": [
+        0.8864502310752869
+      ],
+      "f1": [
+        0.8710131645202637
+      ],
+      "hashcode": "roberta-large_L17_no-idf_version=0.3.12(hug_trans=4.57.3)"
+    },
+    "bertscore_2": {
+      "precision": [
+        0.882099449634552
+      ],
+      "recall": [
+        0.90748131275177
+      ],
+      "f1": [
+        0.8946103453636169
+      ],
+      "hashcode": "roberta-large_L17_no-idf_version=0.3.12(hug_trans=4.57.3)"
+    },
+    "reward_1": -0.026092007756233215,
+    "reward_2": -0.01505814865231514
+  },
+  {
+    "arxiv_id": "2601.05184v1",
+    "title": "Paper 2601.05184v1",
+    "abstract": "The rapid advancement of large language mod- els (LLMs) has led to growing interest in using synthetic data to train future models. How- ever, this creates a self-consuming retraining loop, where models are trained on their own outputs and may cause performance drops and induce emerging biases. In real-world appli- cations, previously deployed LLMs may influ- ence the data they generate, leading to a dy- namic system driven by user feedback. For ex- ample, if a model continues to underserve users from a group, less query data will be collected from this particular demographic of users. In this study, we introduce the concept of Self- Consuming Performative Loop (SCPL) and investigate the role of synthetic data in shaping bias during these dynamic iterative training pro- cesses under controlled performative feedback. This controlled setting is motivated by the in- accessibility of real-world user preference data from dynamic production systems, and enables us to isolate and analyze feedback-driven bias evolution in a principled manner. We focus on two types of loops, including the typical retrain- ing setting and the incremental fine-tuning set- ting, which is largely underexplored. Through experiments on three real-world tasks, we find that the performative loop increases preference bias and decreases disparate bias. We design a reward-based rejection sampling strategy to mitigate the bias, moving towards more trust- worthy self-improving systems.",
+    "summary_1": "The research paper \"Self-Consuming Performative Loop for Large Language Models\" explores the concept of a self-consuming performative loop (SCPL) in large language models (LLMs). The authors introduce the idea of a dynamic system driven by user feedback, where previously deployed LLMs influence the data they generate, leading to a performative prediction. They investigate the role of synthetic data in shaping bias during these dynamic iterative training processes under controlled performative feedback.\n\nThe paper focuses on two types of loops: the typical retraining setting and the incremental fine-tuning setting, which is largely underexplored. Through experiments on three real-world tasks, the authors find that the performative loop increases preference bias and decreases disparate bias. They design a reward-based rejection sampling strategy to mitigate the bias, moving towards more trustworthy self-improving systems.\n\nKey findings:\n\n1.  Iterative fine-tuning/retraining with self-generated data amplifies preference bias and degrades generation quality over time.\n2.  Disparate bias tends to decrease, suggesting a convergence of performance across groups.\n3.  Performative feedback accelerates bias amplification in incremental fine-tuning loops, whereas this phenomenon is less obvious in retraining loops.\n\nThe authors' research provides insights that inform safer deployment of",
+    "summary_2": "This summary is based on the provided research paper excerpt, focusing on the concept of Self-Consuming Performative Loop (SCPL) in Large Language Models (LLMs).\n\n**Introduction**\n\nThe paper highlights the growing interest in using synthetic data to train future LLMs, which leads to a self-consuming retraining loop where models are trained on their own outputs and may cause performance drops and induce emerging biases. This self-consuming performative loop is particularly relevant in dynamic systems driven by user feedback, where the model's performance influences future data.\n\n**Research Questions**\n\nThe authors aim to investigate the role of synthetic data in shaping bias during these dynamic iterative training processes under controlled performative feedback. They focus on two types of loops:\n\n1.  The typical retraining setting\n2.  The incremental fine-tuning setting, which is largely underexplored\n\n**Key Findings**\n\nThe authors' experiments on three real-world tasks reveal that:\n\n*   The performative loop increases preference bias and decreases disparate bias.\n*   Iterative fine-tuning/retraining with self-generated data amplifies preference bias and degrades generation quality over time.\n*   Disparate bias tends to decrease, suggesting a convergence of performance across groups.\n*   Performative feedback accelerates bias amplification in incremental",
+    "rouge_1": {
+      "rouge1": 0.602803738317757,
+      "rouge2": 0.39436619718309857,
+      "rougeL": 0.42990654205607476,
+      "rougeLsum": 0.4626168224299066
+    },
+    "rouge_2": {
+      "rouge1": 0.6099290780141844,
+      "rouge2": 0.39904988123515445,
+      "rougeL": 0.46335697399527187,
+      "rougeLsum": 0.49172576832151305
+    },
+    "bertscore_1": {
+      "precision": [
+        0.8966838717460632
+      ],
+      "recall": [
+        0.8732770681381226
+      ],
+      "f1": [
+        0.8848257064819336
+      ],
+      "hashcode": "roberta-large_L17_no-idf_version=0.3.12(hug_trans=4.57.3)"
+    },
+    "bertscore_2": {
+      "precision": [
+        0.8724937438964844
+      ],
+      "recall": [
+        0.868453860282898
+      ],
+      "f1": [
+        0.8704691529273987
+      ],
+      "hashcode": "roberta-large_L17_no-idf_version=0.3.12(hug_trans=4.57.3)"
+    },
+    "reward_1": -0.03423023596405983,
+    "reward_2": -0.013444548472762108
+  },
+  {
+    "arxiv_id": "2601.05167v1",
+    "title": "Paper 2601.05167v1",
+    "abstract": "Deploying Large Language Models (LLMs) for complex reasoning is often hindered by high computational costs and latency, while resource-efficient Small Language Models (SLMs) typically lack the necessary reasoning capacity. Existing collaborative approaches, such as cascading or routing, operate at a coarse granularity by offloading entire queries to LLMs, resulting in significant computational waste when the SLM is capable of handling the majority of reasoning steps. To address this, we propose RelayLLM, a novel frame- work for efficient reasoning via token-level collaborative decoding. Unlike routers, Re- layLLM empowers the SLM to act as an active controller that dynamically invokes the LLM only for critical tokens via a special command, effectively \"relaying\" the generation process. We introduce a two-stage training framework, including warm-up and Group Relative Pol- icy Optimization (GRPO) to teach the model to balance independence with strategic help- seeking. Empirical results across six bench- marks demonstrate that RelayLLM improves the average accuracy from 42.5% to 49.52%, effectively bridging the performance gap be- tween the two models. Notably, this is achieved by invoking the LLM for only 1.07% of the total generated tokens, offering a 98.2% cost re- duction compared to performance-matched ran- dom routers. Our code is available at https:// github.com/Chengsong-Huang/RelayLLM.",
+    "summary_1": "Here's a clear and concise summary of the research paper:\n\n**Title:** RelayLLM: Efficient Reasoning via Collaborative Decoding\n\n**Problem Statement:** Large Language Models (LLMs) are powerful but computationally expensive, while Small Language Models (SLMs) are efficient but lack reasoning capacity. Existing collaborative approaches, such as cascading or routing, often operate at a coarse granularity, leading to significant computational waste.\n\n**Proposed Solution:** RelayLLM, a novel framework for efficient reasoning via token-level collaborative decoding. Unlike static routers, RelayLLM empowers the SLM to act as an active controller that dynamically invokes the LLM only for critical tokens via a special command, effectively \"relaying\" the generation process.\n\n**Key Components:**\n\n1. **Two-stage training framework:** Supervised warm-up phase to teach the model the syntactic structure of calling commands, followed by a reinforcement learning stage using Group Relative Policy Optimization (GRPO) training.\n2. **Context-aware reward:** Designed to balance independence with necessary help-seeking, penalizing both wasted costs and avoidable errors.\n3. **Token-level collaborative decoding:** The SLM generates a special command token (<call>) to pause its own generation and invoke the LLM for a specified number of tokens",
+    "summary_2": "The research paper \"RelayLLM: Efficient Reasoning via Collaborative Decoding\" proposes a novel framework for efficient reasoning via token-level collaborative decoding. The authors address the challenge of deploying Large Language Models (LLMs) for complex reasoning, which is often hindered by high computational costs and latency, while Small Language Models (SLMs) typically lack the necessary reasoning capacity.\n\nThe authors present RelayLLM, a framework that empowers the SLM to act as an active controller, dynamically invoking the LLM only for critical tokens via a special command, effectively \"relaying\" the generation process. They introduce a two-stage training framework, including warm-up and Group Relative Policy Optimization (GRPO) to teach the model to balance independence with strategic help-seeking.\n\nThe empirical results on six benchmarks demonstrate the effectiveness of RelayLLM, achieving an average accuracy of 49.52% and reducing the performance gap between the small model and the large one. The authors also show that RelayLLM invokes the large model for only 1.07% of the total generated tokens, resulting in a 98.2% reduction in token costs compared to a performance-matched router.\n\nKey findings:\n\n1.  **Efficient reasoning via token-level collaborative decoding**: RelayLLM emp",
+    "rouge_1": {
+      "rouge1": 0.5764411027568922,
+      "rouge2": 0.37783375314861456,
+      "rougeL": 0.46115288220551376,
+      "rougeLsum": 0.4761904761904762
+    },
+    "rouge_2": {
+      "rouge1": 0.6683046683046683,
+      "rouge2": 0.4938271604938272,
+      "rougeL": 0.5503685503685503,
+      "rougeLsum": 0.574938574938575
+    },
+    "bertscore_1": {
+      "precision": [
+        0.8771227598190308
+      ],
+      "recall": [
+        0.8711225986480713
+      ],
+      "f1": [
+        0.8741124272346497
+      ],
+      "hashcode": "roberta-large_L17_no-idf_version=0.3.12(hug_trans=4.57.3)"
+    },
+    "bertscore_2": {
+      "precision": [
+        0.9148159027099609
+      ],
+      "recall": [
+        0.8935967683792114
+      ],
+      "f1": [
+        0.9040818214416504
+      ],
+      "hashcode": "roberta-large_L17_no-idf_version=0.3.12(hug_trans=4.57.3)"
+    },
+    "reward_1": -0.024731554090976715,
+    "reward_2": -0.020340878516435623
+  }
+]
\ No newline at end of file
diff --git a/gpu_environment.yml b/gpu_environment.yml
new file mode 100644
index 0000000..dcb422e
--- /dev/null
+++ b/gpu_environment.yml
@@ -0,0 +1,161 @@
+name: hw7_311
+channels:
+  - defaults
+dependencies:
+  - _libgcc_mutex=0.1=main
+  - _openmp_mutex=5.1=1_gnu
+  - bzip2=1.0.8=h5eee18b_6
+  - ca-certificates=2025.12.2=h06a4308_0
+  - expat=2.7.3=h7354ed3_4
+  - ld_impl_linux-64=2.44=h153f514_2
+  - libexpat=2.7.3=h7354ed3_4
+  - libffi=3.4.4=h6a678d5_1
+  - libgcc=15.2.0=h69a1729_7
+  - libgcc-ng=15.2.0=h166f726_7
+  - libgomp=15.2.0=h4751f2c_7
+  - libnsl=2.0.0=h5eee18b_0
+  - libstdcxx=15.2.0=h39759b7_7
+  - libstdcxx-ng=15.2.0=hc03a8fd_7
+  - libuuid=1.41.5=h5eee18b_0
+  - libxcb=1.17.0=h9b100fa_0
+  - libzlib=1.3.1=hb25bd0a_0
+  - ncurses=6.5=h7934f7d_0
+  - openssl=3.0.18=hd6dcaed_0
+  - pip=25.3=pyhc872135_0
+  - pthread-stubs=0.3=h0ce48e5_1
+  - python=3.11.14=h6fa692b_0
+  - readline=8.3=hc2a1206_0
+  - setuptools=80.9.0=py311h06a4308_0
+  - sqlite=3.51.1=he0a8d7e_0
+  - tk=8.6.15=h54e0aa7_0
+  - wheel=0.45.1=py311h06a4308_0
+  - xorg-libx11=1.8.12=h9b100fa_1
+  - xorg-libxau=1.0.12=h9b100fa_0
+  - xorg-libxdmcp=1.1.5=h9b100fa_0
+  - xorg-xorgproto=2024.1=h5eee18b_1
+  - xz=5.6.4=h5eee18b_1
+  - zlib=1.3.1=hb25bd0a_0
+  - pip:
+      - absl-py==2.3.1
+      - accelerate==1.12.0
+      - aiohappyeyeballs==2.6.1
+      - aiohttp==3.13.3
+      - aiosignal==1.4.0
+      - annotated-types==0.7.0
+      - anyio==3.7.1
+      - arxiv==2.4.0
+      - attrs==25.4.0
+      - bert-score==0.3.13
+      - bitsandbytes==0.49.1
+      - certifi==2026.1.4
+      - charset-normalizer==3.4.4
+      - click==8.3.1
+      - contourpy==1.3.3
+      - cut-cross-entropy==25.1.1
+      - cycler==0.12.1
+      - datasets==4.3.0
+      - diffusers==0.36.0
+      - dill==0.3.8
+      - distro==1.9.0
+      - docstring-parser==0.17.0
+      - evaluate==0.4.6
+      - faiss-cpu==1.7.4
+      - fastapi==0.104.1
+      - feedparser==6.0.12
+      - filelock==3.20.3
+      - fonttools==4.61.1
+      - frozenlist==1.8.0
+      - fsspec==2024.3.1
+      - h11==0.16.0
+      - hf-transfer==0.1.9
+      - hf-xet==1.2.0
+      - httpcore==1.0.9
+      - httpx==0.28.1
+      - huggingface-hub==0.36.0
+      - idna==3.11
+      - importlib-metadata==8.7.1
+      - jinja2==3.1.6
+      - jiter==0.12.0
+      - joblib==1.5.3
+      - kiwisolver==1.4.9
+      - markdown-it-py==4.0.0
+      - markupsafe==3.0.3
+      - matplotlib==3.10.8
+      - mdurl==0.1.2
+      - mpmath==1.3.0
+      - msgspec==0.20.0
+      - multidict==6.7.0
+      - multiprocess==0.70.16
+      - networkx==3.6.1
+      - nltk==3.9.2
+      - numpy==2.4.1
+      - nvidia-cublas-cu12==12.8.4.1
+      - nvidia-cuda-cupti-cu12==12.8.90
+      - nvidia-cuda-nvrtc-cu12==12.8.93
+      - nvidia-cuda-runtime-cu12==12.8.90
+      - nvidia-cudnn-cu12==9.10.2.21
+      - nvidia-cufft-cu12==11.3.3.83
+      - nvidia-cufile-cu12==1.13.1.3
+      - nvidia-curand-cu12==10.3.9.90
+      - nvidia-cusolver-cu12==11.7.3.90
+      - nvidia-cusparse-cu12==12.5.8.93
+      - nvidia-cusparselt-cu12==0.7.1
+      - nvidia-nccl-cu12==2.27.5
+      - nvidia-nvjitlink-cu12==12.8.93
+      - nvidia-nvshmem-cu12==3.3.20
+      - nvidia-nvtx-cu12==12.8.90
+      - openai==2.15.0
+      - packaging==25.0
+      - pandas==2.3.3
+      - peft==0.18.1
+      - pillow==12.1.0
+      - propcache==0.4.1
+      - protobuf==3.20.3
+      - psutil==7.2.1
+      - pyarrow==22.0.0
+      - pyarrow-hotfix==0.7
+      - pydantic==2.5.0
+      - pydantic-core==2.14.1
+      - pygments==2.19.2
+      - pymupdf==1.23.8
+      - pymupdfb==1.23.7
+      - pyparsing==3.3.1
+      - python-dateutil==2.9.0.post0
+      - pytz==2025.2
+      - pyyaml==6.0.3
+      - regex==2025.11.3
+      - requests==2.32.5
+      - rich==14.2.0
+      - rouge-score==0.1.2
+      - safetensors==0.7.0
+      - scikit-learn==1.8.0
+      - scipy==1.15.3
+      - sentence-transformers==3.3.1
+      - sentencepiece==0.2.1
+      - sgmllib3k==1.0.0
+      - six==1.17.0
+      - sniffio==1.3.1
+      - starlette==0.27.0
+      - sympy==1.14.0
+      - threadpoolctl==3.6.0
+      - tokenizers==0.22.2
+      - torch==2.9.1
+      - torchao==0.15.0
+      - torchvision==0.24.1
+      - tqdm==4.67.1
+      - transformers==4.57.3
+      - triton==3.5.1
+      - trl==0.24.0
+      - typeguard==4.4.4
+      - typing-extensions==4.15.0
+      - tyro==1.0.4
+      - tzdata==2025.3
+      - unsloth==2026.1.2
+      - unsloth-zoo==2026.1.2
+      - urllib3==2.6.3
+      - uvicorn==0.24.0
+      - xformers==0.0.33.post2
+      - xxhash==3.6.0
+      - yarl==1.22.0
+      - zipp==3.23.0
+prefix: /home/jovyan/miniconda3/envs/hw7_311
diff --git a/log_build_data.txt b/log_build_data.txt
new file mode 100644
index 0000000..2eb3377
--- /dev/null
+++ b/log_build_data.txt
@@ -0,0 +1,110 @@
+(hw7_311) jovyan@deployment-1066-1595-1768256363806755879-6c4775555d-2rzxt:~/Homework8-Submission$ python build_data.py
+🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
+🦥 Unsloth Zoo will now patch everything to make training faster!
+=== PHASE 1-3: Data collection, summarization, preference dataset ===
+
+Using existing PDFs from data/pdfs
+Found 10 PDF files
+  Extracted abstract for 2601.05171v1: 1546 chars
+  Extracted abstract for 2601.05111v1: 1288 chars
+  Extracted abstract for 2601.05163v1: 884 chars
+  Extracted abstract for 2601.05104v1: 1280 chars
+  Extracted abstract for 2601.05106v1: 500 chars
+  Extracted abstract for 2601.05103v1: 500 chars
+  Extracted abstract for 2601.05192v1: 606 chars
+  Extracted abstract for 2601.05143v1: 1140 chars
+  Extracted abstract for 2601.05184v1: 1471 chars
+  Extracted abstract for 2601.05167v1: 1414 chars
+
+Generating summaries for training papers...
+Loading LLaMA model with unsloth: unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit
+==((====))==  Unsloth 2026.1.2: Fast Llama patching. Transformers: 4.57.3.
+   \\   /|    inference-ai GPU cuda. Num GPUs = 1. Max memory: 47.988 GB. Platform: Linux.
+O^O/ \_/ \    Torch: 2.9.1+cu128. CUDA: 8.6. CUDA Toolkit: 12.8. Triton: 3.5.1
+\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.33.post2. FA2 = False]
+ "-____-"     Free license: http://github.com/unslothai/unsloth
+Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
+
+=== Summarizing paper 1/10: 2601.05171v1 ===
+Summary 1 (generic): Here is a clear and concise summary of the research paper:
+
+**Title:** Inside Out: Evolving User-Centric Core Memory Trees for Long-Term Personalized Dialogue Systems
+
+**Problem Statement:** Long-term ...
+Summary 2 (figure-aware): Here is a summary of the research paper excerpt, incorporating the information conveyed by the figures and their captions:
+
+**Background:**
+The authors propose a novel framework, Inside Out, to addres ...
+
+=== Summarizing paper 2/10: 2601.05111v1 ===
+Summary 1 (generic): Here is a clear and concise summary of the research paper for a graduate student:
+
+**Title:** A Survey on Agent-as-a-Judge
+
+**Background:** The rapid advancement of Large Language Models (LLMs) has re ...
+Summary 2 (figure-aware): The research paper discusses the evolution of AI evaluation systems from Large Language Models (LLM)-as-a-Judge to Agent-as-a-Judge. The authors identify the limitations of LLM-as-a-Judge, which inclu ...
+
+=== Summarizing paper 3/10: 2601.05163v1 ===
+Summary 1 (generic): The paper introduces DocDancer, an end-to-end trained open-source Doc agent for Document Question Answering (DocQA). The authors formulate DocQA as an information-seeking problem and propose a tool-dr ...
+Summary 2 (figure-aware): The paper "DocDancer: Towards Agentic Document-Grounded Information Seeking" proposes a novel approach to document question answering (DocQA) that focuses on information-seeking principles and utilize ...
+
+=== Summarizing paper 4/10: 2601.05104v1 ===
+Summary 1 (generic): Here is a clear and concise summary of the research paper for a graduate student:
+
+**Title:** Examining the Impact of Emotional Prompts on Artificial and Human Responsiveness
+
+**Research Question:** D ...
+Summary 2 (figure-aware): The research paper "Examining the Impact of Emotional Prompts on Artificial and Human Responsiveness" explores the effect of emotional tone on human-AI interactions, specifically with the large langua ...
+
+=== Summarizing paper 5/10: 2601.05106v1 ===
+Summary 1 (generic): Here's a clear and concise summary of the research paper for a graduate student:
+
+**Title:** FusionRoute: A Robust and Effective Token-Level Multi-LLM Collaboration Framework
+
+**Background:** Large La ...
+Summary 2 (figure-aware): The research paper "Paper 2601.05106v1" proposes a new framework called FusionRoute for achieving strong performance across diverse domains with a single general-purpose model, while remaining efficie ...
+
+=== Summarizing paper 6/10: 2601.05103v1 ===
+Summary 1 (generic): Here is a clear and concise summary of the research paper for a graduate student:
+
+**Title:** Semantically Orthogonal Framework for Citation Classification: Disentangling Intent and Content
+
+**Authors ...
+Summary 2 (figure-aware): The paper "Semantically Orthogonal Framework for Citation Classification: Disentangling Intent and Content" proposes a new annotation framework, SOFT, that explicitly separates citation intent (why a  ...
+
+=== Summarizing paper 7/10: 2601.05192v1 ===
+Summary 1 (generic): Here is a clear and concise summary of the research paper for a graduate student:
+
+**Title:** LELA: an LLM-based Entity Linking Approach with Zero-Shot Domain Adaptation
+
+**Background:** Entity linkin ...
+Summary 2 (figure-aware): The paper proposes a novel entity linking approach, called LELA, which stands for Large Language Model-based Entity Linking Approach with Zero-Shot Domain Adaptation. Entity linking is the task of map ...
+
+=== Summarizing paper 8/10: 2601.05143v1 ===
+Summary 1 (generic): Here is a clear and concise summary of the research paper for a graduate student:
+
+**Title:** A Lightweight and Explainable Vision–Language Framework for Crop Disease Visual Question Answering
+
+**Back ...
+Summary 2 (figure-aware): The research paper presents a lightweight vision-language framework for crop and disease identification from leaf images. The proposed approach combines a Swin Transformer vision encoder with sequence ...
+
+=== Summarizing paper 9/10: 2601.05184v1 ===
+Summary 1 (generic): The research paper "Self-Consuming Performative Loop for Large Language Models" explores the concept of a self-consuming performative loop (SCPL) in large language models (LLMs). The authors introduce ...
+Summary 2 (figure-aware): This summary is based on the provided research paper excerpt, focusing on the concept of Self-Consuming Performative Loop (SCPL) in Large Language Models (LLMs).
+
+**Introduction**
+
+The paper highlight ...
+
+=== Summarizing paper 10/10: 2601.05167v1 ===
+Summary 1 (generic): Here's a clear and concise summary of the research paper:
+
+**Title:** RelayLLM: Efficient Reasoning via Collaborative Decoding
+
+**Problem Statement:** Large Language Models (LLMs) are powerful but com ...
+Summary 2 (figure-aware): The research paper "RelayLLM: Efficient Reasoning via Collaborative Decoding" proposes a novel framework for efficient reasoning via token-level collaborative decoding. The authors address the challen ...
+
+Building reward modeling dataset via ROUGE-based preference labels...
+Wrote preference data to reward_data.jsonl (10 pairs).
+
+=== Data preparation complete! ===
\ No newline at end of file
diff --git a/log_evaluate.txt b/log_evaluate.txt
new file mode 100644
index 0000000..e69de29
diff --git a/log_run_evaluation.txt b/log_run_evaluation.txt
new file mode 100644
index 0000000..c47b5b5
--- /dev/null
+++ b/log_run_evaluation.txt
@@ -0,0 +1,65 @@
+(hw7_311) jovyan@deployment-1066-1595-1768256363806755879-6c4775555d-2rzxt:~/Homework8-Submission$ python run_evaluation.py
+
+=== PHASE 5-6: Evaluation and comparison ===
+
+Evaluating 10 papers...
+Loading reward model for evaluation...
+Reward model loaded on device: cuda
+Loaded tokenizer class: DebertaV2TokenizerFast
+Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
+You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
+
+=== Paper 2601.05171v1 ===
+Reward scores: S1 = -0.013054322451353073  | S2 = -0.023740937933325768
+ROUGE-L: S1 = 0.3350253807106599  | S2 = 0.285012285012285
+BERTScore F1: S1 = 0.8554288744926453 | S2 = 0.8534665107727051
+
+=== Paper 2601.05111v1 ===
+Reward scores: S1 = -0.01309164334088564  | S2 = -0.040604256093502045
+ROUGE-L: S1 = 0.4879356568364611  | S2 = 0.3979057591623037
+BERTScore F1: S1 = 0.8874909281730652 | S2 = 0.8867952823638916
+
+=== Paper 2601.05163v1 ===
+Reward scores: S1 = -0.02074364572763443  | S2 = -0.035873666405677795
+ROUGE-L: S1 = 0.36645962732919257  | S2 = 0.2568807339449541
+BERTScore F1: S1 = 0.8772771954536438 | S2 = 0.8586826920509338
+
+=== Paper 2601.05104v1 ===
+Reward scores: S1 = -0.02505182847380638  | S2 = -0.0208128672093153
+ROUGE-L: S1 = 0.5531914893617021  | S2 = 0.6108108108108108
+BERTScore F1: S1 = 0.8919901251792908 | S2 = 0.9144341945648193
+
+=== Paper 2601.05106v1 ===
+Reward scores: S1 = -0.017466777935624123  | S2 = -0.03491535037755966
+ROUGE-L: S1 = 0.27941176470588236  | S2 = 0.12781954887218047
+BERTScore F1: S1 = 0.8489883542060852 | S2 = 0.8219152092933655
+
+=== Paper 2601.05103v1 ===
+Reward scores: S1 = -0.028803709894418716  | S2 = -0.025009701028466225
+ROUGE-L: S1 = 0.18450184501845018  | S2 = 0.1925925925925926
+BERTScore F1: S1 = 0.8260022401809692 | S2 = 0.8231779336929321
+
+=== Paper 2601.05192v1 ===
+Reward scores: S1 = -0.01752193458378315  | S2 = -0.03325437754392624
+ROUGE-L: S1 = 0.25  | S2 = 0.21160409556313994
+BERTScore F1: S1 = 0.842767596244812 | S2 = 0.840757429599762
+
+=== Paper 2601.05143v1 ===
+Reward scores: S1 = -0.026092007756233215  | S2 = -0.01505814865231514
+ROUGE-L: S1 = 0.3966480446927374  | S2 = 0.484149855907781
+BERTScore F1: S1 = 0.8710131645202637 | S2 = 0.8946103453636169
+
+=== Paper 2601.05184v1 ===
+Reward scores: S1 = -0.03423023596405983  | S2 = -0.013444548472762108
+ROUGE-L: S1 = 0.42990654205607476  | S2 = 0.46335697399527187
+BERTScore F1: S1 = 0.8848257064819336 | S2 = 0.8704691529273987
+
+=== Paper 2601.05167v1 ===
+Reward scores: S1 = -0.024731554090976715  | S2 = -0.020340878516435623
+ROUGE-L: S1 = 0.46115288220551376  | S2 = 0.5503685503685503
+BERTScore F1: S1 = 0.8741124272346497 | S2 = 0.9040818214416504
+
+Saved detailed evaluation results to evaluation_results.json
+You can now inspect where reward scores agree/disagree with ROUGE/BERTScore.
+
+=== Evaluation complete! ===
\ No newline at end of file
diff --git a/log_train_reward.txt b/log_train_reward.txt
new file mode 100644
index 0000000..3195b2d
--- /dev/null
+++ b/log_train_reward.txt
@@ -0,0 +1,71 @@
+(hw7_311) jovyan@deployment-1066-1595-1768256363806755879-6c4775555d-2rzxt:~/Homework8-Submission$ python train_reward.py
+
+=== PHASE 4: Reward model training ===
+
+Loading reward model + tokenizer...
+/home/jovyan/miniconda3/envs/hw7_311/lib/python3.11/site-packages/transformers/convert_slow_tokenizer.py:566: UserWarning: The sentencepiece tokenizer that you are converting to a fast tokenizer uses the byte fallback option which is not implemented in the fast tokenizers. In practice this means that the fast version of the tokenizer can produce unknown tokens whereas the sentencepiece version would have converted these unknown tokens into a sequence of byte tokens matching the original piece of text.
+  warnings.warn(
+Loaded tokenizer type: DebertaV2TokenizerFast
+Tokenizer vocab size: 128000
+Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
+You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
+Loaded dataset with 10 examples
+Dataset columns: ['chosen', 'rejected']
+First example: {'chosen': "Here is a clear and concise summary of the research paper:\n\n**Title:** Inside Out: Evolving User-Centric Core Memory Trees for Long-Term Personalized Dialogue Systems\n\n**Problem Statement:** Long-term personalized dialogue systems struggle to reconcile unbounded interaction streams with finite context constraints, leading to memory noise accumulation, reasoning degradation, and persona inconsistency.\n\n**Proposed Solution:** The authors propose the Inside Out framework, which utilizes a globally maintained PersonaTree as the carrier of long-term user profiling. The PersonaTree is a hierarchical structure that constrains the trunk with an initial schema and updates the branches and leaves through a reinforcement learning (RL) strategy.\n\n**Key Components:**\n\n1. **PersonaTree:** A hierarchical structure that represents the user's core memories, with three core dimensions (Biopsychosocial model).\n2. **Schema:** An initial structure of the user PersonaTree, established through interdisciplinary design.\n3. **MemListener:** A lightweight model trained through RL to compress a continuous stream of unstructured dialogue into standardized tree-structured operations.\n4. **Tree-Update Mechanism:** An iterative process that updates the PersonaTree based on user interactions.\n\n**Benefits:** The Inside Out framework enables controllable growth, achieves memory compression while preserving consistency, and supports the dynamic evolution", 'rejected': "Here is a summary of the research paper excerpt, incorporating the information conveyed by the figures and their captions:\n\n**Background:**\nThe authors propose a novel framework, Inside Out, to address the challenges of long-term personalized dialogue systems. These systems struggle to reconcile unbounded interaction streams with finite context constraints, leading to memory noise accumulation, reasoning degradation, and persona inconsistency. Existing approaches, such as explicit profile augmentation and vector-based retrieval, fail to directly confront the central bottleneck of personalized memory evolution.\n\n**Inside Out Framework:**\nThe Inside Out framework utilizes a globally maintained PersonaTree as the carrier of long-term user profiling. The PersonaTree is constructed using a hierarchical Schema based on the Biopsychosocial model, which decomposes user characteristics into three core dimensions. This Schema design establishes the initial structure of the user PersonaTree.\n\n**MemListener:**\nTo endow the system with dynamic evolution, the authors propose an iterative tree-update mechanism and introduce a reinforcement learning (RL) strategy based on process rewards to train a lightweight model, MemListener. This model learns to compress a continuous stream of unstructured dialogue in real time into standardized tree-structured operations.\n\n**Key Features:**\n\n1. **PersonaTree:** A globally maintained tree that captures the user's long-term profile, with"}
+You passed `model_init_kwargs` to the `RewardConfig`, but your model is already instantiated. The `model_init_kwargs` will be ignored.
+Filtering train >4096 tokens: 100%|████████████████████████████████████████| 10/10 [00:00<00:00, 1683.85 examples/s]
+Training reward model...
+The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'eos_token_id': 2, 'bos_token_id': 1}.
+{'train_runtime': 12.6382, 'train_samples_per_second': 2.374, 'train_steps_per_second': 0.712, 'train_loss': 0.6904306411743164, 'num_tokens': 16299.0, 'min_reward': -0.10637749565972222, 'mean_reward': -0.03328132629394531, 'max_reward': 0.029715643988715276, 'accuracy': 0.4722222222222222, 'margin': 0.007103390163845486, 'epoch': 3.0}
+100%|█████████████████████████████████████████████████████████████████████████████████| 9/9 [00:12<00:00,  1.40s/it]
+Saving clean tokenizer from microsoft/deberta-v3-base...
+tokenizer_config.json: 100%|██████████████████████████████████████████████████████| 52.0/52.0 [00:00<00:00, 421kB/s]
+config.json: 100%|█████████████████████████████████████████████████████████████████| 579/579 [00:00<00:00, 3.59MB/s]
+spm.model: 100%|███████████████████████████████████████████████████████████████| 2.46M/2.46M [00:00<00:00, 4.20MB/s]
+/home/jovyan/miniconda3/envs/hw7_311/lib/python3.11/site-packages/transformers/convert_slow_tokenizer.py:566: UserWarning: The sentencepiece tokenizer that you are converting to a fast tokenizer uses the byte fallback option which is not implemented in the fast tokenizers. In practice this means that the fast version of the tokenizer can produce unknown tokens whereas the sentencepiece version would have converted these unknown tokens into a sequence of byte tokens matching the original piece of text.
+  warnings.warn(
+  Copied: tokenizer_config.json
+  Copied: special_tokens_map.json
+  Copied: tokenizer.json
+Cleaning tokenizer_config.json to remove SentencePiece contamination...
+  Removing contaminated field: vocab_type = spm
+  Removing contaminated field: sp_model_kwargs = {}
+Reward model saved to reward_model
+Tokenizer type: DebertaV2TokenizerFast
+Verifying saved tokenizer...
+The tokenizer you are loading from 'reward_model' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue.
+Verified tokenizer type: DebertaV2TokenizerFast
+
+=== Training complete! ===
+(hw7_311) jovyan@deployment-1066-1595-1768256363806755879-6c4775555d-2rzxt:~/Homework8-Submission$ cat reward_model/config.json | head -20
+{
+  "architectures": [
+    "DebertaV2ForSequenceClassification"
+  ],
+  "attention_probs_dropout_prob": 0.1,
+  "bos_token_id": 1,
+  "dtype": "float32",
+  "eos_token_id": 2,
+  "hidden_act": "gelu",
+  "hidden_dropout_prob": 0.1,
+  "hidden_size": 768,
+  "id2label": {
+    "0": "LABEL_0"
+  },
+  "initializer_range": 0.02,
+  "intermediate_size": 3072,
+  "label2id": {
+    "LABEL_0": 0
+  },
+  "layer_norm_eps": 1e-07,
+(hw7_311) jovyan@deployment-1066-1595-1768256363806755879-6c4775555d-2rzxt:~/Homework8-Submission$ ls -lh reward_model/
+total 712M
+-rw-rw-rw- 1 jovyan jovyan 1.3K Jan 12 16:23 README.md
+drwxrwxrwx 2 jovyan jovyan 4.0K Jan 12 16:23 checkpoint-3
+drwxrwxrwx 2 jovyan jovyan 4.0K Jan 12 16:23 checkpoint-6
+drwxrwxrwx 2 jovyan jovyan 4.0K Jan 12 16:23 checkpoint-9
+-rw-rw-rw- 1 jovyan jovyan  964 Jan 12 16:23 config.json
+-rw-rw-rw- 1 jovyan jovyan 704M Jan 12 16:23 model.safetensors
+-rw-rw-rw- 1 jovyan jovyan  286 Jan 12 16:23 special_tokens_map.json
+-rw-rw-rw- 1 jovyan jovyan 8.3M Jan 12 16:23 tokenizer.json
+-rw-rw-rw- 1 jovyan jovyan 1.3K Jan 12 16:23 tokenizer_config.json
\ No newline at end of file
diff --git a/main.py b/main.py
new file mode 100644
index 0000000..2775f05
--- /dev/null
+++ b/main.py
@@ -0,0 +1,165 @@
+#!/usr/bin/env python3
+"""
+Main entry point for the paper summarization and reward modeling pipeline.
+
+This script orchestrates the complete workflow:
+1. Download papers from ArXiv
+2. Generate summaries using LLaMA 3.1
+3. Create preference dataset
+4. Train reward model
+5. Evaluate on new papers
+
+Usage:
+    python main.py --phase all
+    python main.py --phase build_data
+    python main.py --phase train_reward
+    python main.py --phase evaluate
+"""
+import argparse
+
+from data_utils import download_arxiv_pdfs, save_papers_to_json
+
+
+# Configuration
+ARXIV_QUERY = "cs.LG"  # Computer Science - Machine Learning
+NUM_TRAIN_PAPERS = 10
+NUM_EVAL_PAPERS = 10
+
+# File paths
+PDF_DIR = "data/pdfs"
+TRAIN_META_JSON = "train_papers.json"
+EVAL_META_JSON = "eval_papers.json"
+TRAIN_SUMMARY_JSON = "summaries_train.json"
+EVAL_SUMMARY_JSON = "summaries_eval.json"
+REWARD_JSONL = "reward_data.jsonl"
+REWARD_MODEL_DIR = "reward_model"
+RESULTS_JSON = "evaluation_results.json"
+
+
+def build_data_phase():
+    """
+    Phase 1-3: Data collection, summarization, and preference dataset creation.
+
+    This phase:
+    - Downloads training papers from ArXiv (or uses existing PDFs)
+    - Generates two summaries per paper using LLaMA
+    - Creates preference labels using ROUGE-L scores
+    """
+    print("=== PHASE 1-3: Data collection, summarization, preference dataset ===\n")
+
+    # Note: If PDFs already exist in data/pdfs, we can skip downloading
+    # and directly load them. For now, we'll assume PDFs are already there.
+    import os
+    if not os.path.exists(PDF_DIR) or len(os.listdir(PDF_DIR)) < NUM_TRAIN_PAPERS:
+        print(f"Downloading {NUM_TRAIN_PAPERS} training papers from ArXiv...")
+        train_papers = download_arxiv_pdfs(ARXIV_QUERY, NUM_TRAIN_PAPERS, PDF_DIR)
+        save_papers_to_json(train_papers, TRAIN_META_JSON)
+    else:
+        print(f"Using existing PDFs from {PDF_DIR}")
+        from data_utils import load_papers_from_json
+        # Extract papers from existing PDFs
+        import glob
+        pdf_files = glob.glob(f"{PDF_DIR}/*.pdf")[:NUM_TRAIN_PAPERS]
+        print(f"Found {len(pdf_files)} PDF files")
+
+        from data_utils import extract_text_and_figures, PaperData
+        train_papers = []
+        for pdf_path in pdf_files:
+            arxiv_id = os.path.basename(pdf_path).replace(".pdf", "")
+            text, figure_captions = extract_text_and_figures(pdf_path)
+            # For existing PDFs, we don't have title/abstract from ArXiv API
+            # So we'll use placeholder values or extract from the PDF
+            train_papers.append(
+                PaperData(
+                    arxiv_id=arxiv_id,
+                    title=f"Paper {arxiv_id}",
+                    abstract="",  # Will be extracted or left empty
+                    text=text,
+                    figure_captions=figure_captions,
+                )
+            )
+        save_papers_to_json(train_papers, TRAIN_META_JSON)
+
+    # Generate summaries
+    print("\nGenerating summaries for training papers...")
+    from summarization import generate_summaries_for_papers
+    summary_records = generate_summaries_for_papers(train_papers, TRAIN_SUMMARY_JSON)
+
+    # Build reward modeling data with chosen/rejected labels
+    print("\nBuilding reward modeling dataset via ROUGE-based preference labels...")
+    from reward_model import auto_label_preferences_with_rouge
+    auto_label_preferences_with_rouge(summary_records, REWARD_JSONL)
+
+
+def train_reward_phase():
+    """
+    Phase 4: Train the reward model on preference data.
+
+    This phase fine-tunes a DeBERTa-v3 model on the chosen/rejected summary pairs
+    so it learns to assign higher scores to better summaries.
+    """
+    print("\n=== PHASE 4: Reward model training ===\n")
+    from reward_model import train_reward_model
+    train_reward_model(REWARD_JSONL, REWARD_MODEL_DIR)
+
+
+def evaluate_phase():
+    """
+    Phase 5-6: Evaluation and comparison.
+
+    This phase:
+    - Generates summaries for evaluation papers
+    - Scores them using the trained reward model
+    - Computes ROUGE and BERTScore
+    - Compares all metrics
+    """
+    print("\n=== PHASE 5-6: Evaluation and comparison ===\n")
+
+    # For evaluation, we could download new papers or use a held-out set
+    # For simplicity, we'll reuse the training papers here
+    # In a real scenario, you'd want a separate eval set
+    import os
+    import json
+
+    if not os.path.exists(TRAIN_SUMMARY_JSON):
+        print("Error: No training summaries found. Run --phase build_data first.")
+        return
+
+    # Load existing summaries
+    with open(TRAIN_SUMMARY_JSON, "r", encoding="utf-8") as f:
+        eval_summaries = json.load(f)
+
+    print(f"Evaluating {len(eval_summaries)} papers...")
+
+    # Evaluate summaries
+    from reward_model import evaluate_summaries
+    evaluate_summaries(eval_summaries, REWARD_MODEL_DIR, RESULTS_JSON)
+
+
+def main():
+    """Main entry point with command-line argument parsing."""
+    parser = argparse.ArgumentParser(
+        description="Paper summarization + reward modeling pipeline (ArXiv + LLaMA + DeBERTa)."
+    )
+    parser.add_argument(
+        "--phase",
+        choices=["build_data", "train_reward", "evaluate", "all"],
+        default="all",
+        help="Which phase to run.",
+    )
+    args = parser.parse_args()
+
+    if args.phase in ("build_data", "all"):
+        build_data_phase()
+
+    if args.phase in ("train_reward", "all"):
+        train_reward_phase()
+
+    if args.phase in ("evaluate", "all"):
+        evaluate_phase()
+
+    print("\n=== Pipeline complete! ===")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/requirements.txt b/requirements.txt
new file mode 100644
index 0000000..c0f3b41
--- /dev/null
+++ b/requirements.txt
@@ -0,0 +1,125 @@
+absl-py==2.3.1
+accelerate==1.12.0
+aiohappyeyeballs==2.6.1
+aiohttp==3.13.3
+aiosignal==1.4.0
+annotated-types==0.7.0
+anyio==3.7.1
+arxiv==2.4.0
+attrs==25.4.0
+bert-score==0.3.13
+bitsandbytes==0.49.1
+certifi==2026.1.4
+charset-normalizer==3.4.4
+click==8.3.1
+contourpy==1.3.3
+cut-cross-entropy==25.1.1
+cycler==0.12.1
+datasets==4.3.0
+diffusers==0.36.0
+dill==0.3.8
+distro==1.9.0
+docstring_parser==0.17.0
+evaluate==0.4.6
+faiss-cpu==1.7.4
+fastapi==0.104.1
+feedparser==6.0.12
+filelock==3.20.3
+fonttools==4.61.1
+frozenlist==1.8.0
+fsspec==2024.3.1
+h11==0.16.0
+hf_transfer==0.1.9
+hf-xet==1.2.0
+httpcore==1.0.9
+httpx==0.28.1
+huggingface-hub==0.36.0
+idna==3.11
+importlib_metadata==8.7.1
+Jinja2==3.1.6
+jiter==0.12.0
+joblib==1.5.3
+kiwisolver==1.4.9
+markdown-it-py==4.0.0
+MarkupSafe==3.0.3
+matplotlib==3.10.8
+mdurl==0.1.2
+mpmath==1.3.0
+msgspec==0.20.0
+multidict==6.7.0
+multiprocess==0.70.16
+networkx==3.6.1
+nltk==3.9.2
+numpy==2.4.1
+nvidia-cublas-cu12==12.8.4.1
+nvidia-cuda-cupti-cu12==12.8.90
+nvidia-cuda-nvrtc-cu12==12.8.93
+nvidia-cuda-runtime-cu12==12.8.90
+nvidia-cudnn-cu12==9.10.2.21
+nvidia-cufft-cu12==11.3.3.83
+nvidia-cufile-cu12==1.13.1.3
+nvidia-curand-cu12==10.3.9.90
+nvidia-cusolver-cu12==11.7.3.90
+nvidia-cusparse-cu12==12.5.8.93
+nvidia-cusparselt-cu12==0.7.1
+nvidia-nccl-cu12==2.27.5
+nvidia-nvjitlink-cu12==12.8.93
+nvidia-nvshmem-cu12==3.3.20
+nvidia-nvtx-cu12==12.8.90
+openai==2.15.0
+packaging==25.0
+pandas==2.3.3
+peft==0.18.1
+pillow==12.1.0
+pip==25.3
+propcache==0.4.1
+protobuf==3.20.3
+psutil==7.2.1
+pyarrow==22.0.0
+pyarrow-hotfix==0.7
+pydantic==2.5.0
+pydantic_core==2.14.1
+Pygments==2.19.2
+PyMuPDF==1.23.8
+PyMuPDFb==1.23.7
+pyparsing==3.3.1
+python-dateutil==2.9.0.post0
+pytz==2025.2
+PyYAML==6.0.3
+regex==2025.11.3
+requests==2.32.5
+rich==14.2.0
+rouge_score==0.1.2
+safetensors==0.7.0
+scikit-learn==1.8.0
+scipy==1.15.3
+sentence-transformers==3.3.1
+sentencepiece==0.2.1
+setuptools==80.9.0
+sgmllib3k==1.0.0
+six==1.17.0
+sniffio==1.3.1
+starlette==0.27.0
+sympy==1.14.0
+threadpoolctl==3.6.0
+tokenizers==0.22.2
+torch==2.9.1
+torchao==0.15.0
+torchvision==0.24.1
+tqdm==4.67.1
+transformers==4.57.3
+triton==3.5.1
+trl==0.24.0
+typeguard==4.4.4
+typing_extensions==4.15.0
+tyro==1.0.4
+tzdata==2025.3
+unsloth==2026.1.2
+unsloth_zoo==2026.1.2
+urllib3==2.6.3
+uvicorn==0.24.0
+wheel==0.45.1
+xformers==0.0.33.post2
+xxhash==3.6.0
+yarl==1.22.0
+zipp==3.23.0
\ No newline at end of file
diff --git a/reward_data.jsonl b/reward_data.jsonl
new file mode 100644
index 0000000..8ffa534
--- /dev/null
+++ b/reward_data.jsonl
@@ -0,0 +1,10 @@
+{"chosen": "Here is a clear and concise summary of the research paper:\n\n**Title:** Inside Out: Evolving User-Centric Core Memory Trees for Long-Term Personalized Dialogue Systems\n\n**Problem Statement:** Long-term personalized dialogue systems struggle to reconcile unbounded interaction streams with finite context constraints, leading to memory noise accumulation, reasoning degradation, and persona inconsistency.\n\n**Proposed Solution:** The authors propose the Inside Out framework, which utilizes a globally maintained PersonaTree as the carrier of long-term user profiling. The PersonaTree is a hierarchical structure that constrains the trunk with an initial schema and updates the branches and leaves through a reinforcement learning (RL) strategy.\n\n**Key Components:**\n\n1. **PersonaTree:** A hierarchical structure that represents the user's core memories, with three core dimensions (Biopsychosocial model).\n2. **Schema:** An initial structure of the user PersonaTree, established through interdisciplinary design.\n3. **MemListener:** A lightweight model trained through RL to compress a continuous stream of unstructured dialogue into standardized tree-structured operations.\n4. **Tree-Update Mechanism:** An iterative process that updates the PersonaTree based on user interactions.\n\n**Benefits:** The Inside Out framework enables controllable growth, achieves memory compression while preserving consistency, and supports the dynamic evolution", "rejected": "Here is a summary of the research paper excerpt, incorporating the information conveyed by the figures and their captions:\n\n**Background:**\nThe authors propose a novel framework, Inside Out, to address the challenges of long-term personalized dialogue systems. These systems struggle to reconcile unbounded interaction streams with finite context constraints, leading to memory noise accumulation, reasoning degradation, and persona inconsistency. Existing approaches, such as explicit profile augmentation and vector-based retrieval, fail to directly confront the central bottleneck of personalized memory evolution.\n\n**Inside Out Framework:**\nThe Inside Out framework utilizes a globally maintained PersonaTree as the carrier of long-term user profiling. The PersonaTree is constructed using a hierarchical Schema based on the Biopsychosocial model, which decomposes user characteristics into three core dimensions. This Schema design establishes the initial structure of the user PersonaTree.\n\n**MemListener:**\nTo endow the system with dynamic evolution, the authors propose an iterative tree-update mechanism and introduce a reinforcement learning (RL) strategy based on process rewards to train a lightweight model, MemListener. This model learns to compress a continuous stream of unstructured dialogue in real time into standardized tree-structured operations.\n\n**Key Features:**\n\n1. **PersonaTree:** A globally maintained tree that captures the user's long-term profile, with"}
+{"chosen": "Here is a clear and concise summary of the research paper for a graduate student:\n\n**Title:** A Survey on Agent-as-a-Judge\n\n**Background:** The rapid advancement of Large Language Models (LLMs) has revolutionized AI evaluation, enabling scalable and near-human quality assessments. However, as evaluands become increasingly complex, specialized, and multi-step, the reliability of LLM-as-a-Judge has become constrained by inherent biases, shallow single-pass reasoning, and the inability to verify assessments against real-world observations.\n\n**Problem Statement:** The limitations of LLM-as-a-Judge have led to the transition to Agent-as-a-Judge, which employs planning, tool-augmented verification, multi-agent collaboration, and persistent memory to enable more robust, verifiable, and nuanced evaluations.\n\n**Key Contributions:**\n\n1. **Characterizing the shift from LLM-as-a-Judge to Agent-as-a-Judge**: The authors identify and summarize the agentic judges' development trend into three progressive stages with distinct levels of autonomy and adaptability.\n2. **Organizing core methodologies**: The authors categorize the core methodologies into five key parts according to the agent's abilities and survey their applications across general and professional domains.\n3. **Analyzing frontier challenges and identifying promising research directions**:", "rejected": "The research paper discusses the evolution of AI evaluation systems from Large Language Models (LLM)-as-a-Judge to Agent-as-a-Judge. The authors identify the limitations of LLM-as-a-Judge, which include inherent biases, shallow single-pass reasoning, and the inability to verify assessments against real-world observations. This has led to the development of Agent-as-a-Judge, which employs planning, tool-augmented verification, multi-agent collaboration, and persistent memory to enable more robust, verifiable, and nuanced evaluations.\n\n**Key Points:**\n\n1. LLM-as-a-Judge has revolutionized AI evaluation by leveraging large language models for scalable assessments, but its reliability is constrained by inherent biases, shallow single-pass reasoning, and the inability to verify assessments against real-world observations.\n2. Agent-as-a-Judge has emerged as a more robust and verifiable alternative, leveraging planning, tool-augmented verification, multi-agent collaboration, and persistent memory.\n3. The authors present the first comprehensive survey on Agent-as-a-Judge, summarizing the shift from LLM-as-a-Judge to Agent-as-a-Judge and organizing core methodologies into five key parts according to agent's abilities.\n4. The authors survey applications across general and professional domains and analyze frontier challenges and identify"}
+{"chosen": "The paper introduces DocDancer, an end-to-end trained open-source Doc agent for Document Question Answering (DocQA). The authors formulate DocQA as an information-seeking problem and propose a tool-driven agent framework that explicitly models document exploration and comprehension. To address the scarcity of high-quality training data for DocQA, they introduce an Exploration-then-Synthesis data synthesis pipeline that progressively enhances QA pairs from easy to hard.\n\nThe proposed framework consists of two complementary capabilities: efficient search tools for global information acquisition and fine-grained read tools for localized comprehension. This design enables the agent to actively explore long documents, iteratively refine its hypotheses, and dynamically adapt its strategy based on intermediate observations.\n\nThe authors train their DocQA agent, DocDancer, on the synthesized dataset, instantiating it with two open-source backbones, Qwen3-4B-Thinking-2507 and Qwen3-30B-A3B-Thinking-2507. Despite being trained with only 5,000 instances, both variants achieve competitive performance, with the 30B-A3B model attaining state-of-the-art results in several settings.\n\nThe paper conducts extensive experiments on two long-context document understanding benchmarks, MMLongBench-Doc and DocBench, demonstrating the effectiveness", "rejected": "The paper \"DocDancer: Towards Agentic Document-Grounded Information Seeking\" proposes a novel approach to document question answering (DocQA) that focuses on information-seeking principles and utilizes an end-to-end trained open-source Doc agent. The authors introduce a tool-driven agent framework that explicitly models document exploration and comprehension, addressing the scarcity of high-quality training data for DocQA.\n\nThe proposed framework consists of two complementary capabilities: efficient search tools for global information acquisition and fine-grained read tools for localized comprehension. This design enables the agent to actively explore long documents, iteratively refine its hypotheses, and dynamically adapt its strategy based on intermediate observations.\n\nThe paper also proposes an Exploration-then-Synthesis DocQA generation pipeline that progressively enhances QA pairs from easy to hard. This pipeline first explores a source document through intent-guided, tool-augmented interactions to collect grounded evidence and then synthesizes high-quality document-grounded QA pairs via multi-observation reasoning.\n\nThe authors train their DocQA agent, DocDancer, on the synthesized dataset using two open-source backbones, Qwen3-4B-Thinking-2507 and Qwen3-30B-A3B-Thinking-2507. Despite being trained with only 5,000 instances, both variants achieve competitive performance,"}
+{"chosen": "The research paper \"Examining the Impact of Emotional Prompts on Artificial and Human Responsiveness\" explores the effect of emotional tone on human-AI interactions, specifically with the large language model ChatGPT (GPT-4.0). The study conducted a between-subject experiment where participants interacted with ChatGPT on two tasks: writing a public response and addressing an ethical dilemma.\n\n**Key findings:**\n\n1.  ChatGPT showed greater improvement in its answers when participants expressed praise, compared to interactions where participants maintained a neutral tone.\n2.  Expressing anger towards ChatGPT led to a higher, albeit smaller, improvement in its answers relative to the neutral condition.\n3.  Blaming ChatGPT did not improve its answers.\n4.  When addressing an ethical dilemma, ChatGPT prioritized corporate interests less when participants expressed anger towards it, while blaming increased its emphasis on protecting the public interest.\n5.  The study also found that people used more negative, hostile, and disappointing expressions in human-human communication after interactions during which participants blamed rather than praised for their responses.\n\n**Figures:**\n\n*   **Figure 1:** Improvement in ChatGPT's answers based on conditions. The figure shows that ChatGPT improved its answers when", "rejected": "Here is a clear and concise summary of the research paper for a graduate student:\n\n**Title:** Examining the Impact of Emotional Prompts on Artificial and Human Responsiveness\n\n**Research Question:** Does the emotional tone of user-generated prompts influence the quality and content of ChatGPT's responses, and do these effects carry over into human-human communication?\n\n**Methodology:** The researchers conducted a between-subject experiment where participants interacted with ChatGPT-4.0 on two tasks: writing a public response and addressing an ethical dilemma. Participants were randomly assigned to one of three conditions: praise, blame, or anger.\n\n**Findings:**\n\n1. ChatGPT showed greater improvement in its answers when participants expressed praise, compared to the neutral condition.\n2. Expressing anger towards ChatGPT led to a higher, albeit smaller, improvement in its answers compared to the neutral condition.\n3. Blaming ChatGPT did not improve its answers.\n4. When addressing an ethical dilemma, ChatGPT prioritized corporate interests less when participants expressed anger towards it, while blaming increased its emphasis on protecting the public interest.\n5. Participants used more negative, hostile, and disappointing expressions in human-human communication after interactions where they blamed rather than praised ChatGPT.\n\n**Conclusion:**"}
+{"chosen": "Here's a clear and concise summary of the research paper for a graduate student:\n\n**Title:** FusionRoute: A Robust and Effective Token-Level Multi-LLM Collaboration Framework\n\n**Background:** Large Language Models (LLMs) have shown strong performance across various domains, but achieving strong performance across these domains with a single general-purpose model is often expensive and inefficient. Smaller domain-specialized models are more efficient but struggle to generalize beyond their training distributions.\n\n**Problem Statement:** The authors aim to develop a framework that can robustly coordinate and leverage the complementary expertise of multiple specialized LLMs to achieve general-purpose performance.\n\n**Proposed Solution:** The authors propose FusionRoute, a token-level multi-LLM collaboration framework that consists of a lightweight router and a trainable complementary generator. The router selects the most suitable expert at each decoding step, and the complementary generator refines or corrects the selected expert's next-token distribution via logit addition.\n\n**Key Contributions:**\n\n1. **Theoretical Analysis:** The authors provide a theoretical analysis showing that pure expert-only routing is fundamentally limited and cannot realize the optimal decoding policy unless strong global coverage assumptions hold.\n2. **Empirical Evaluation:** FusionRoute outperforms sequence- and token-level collaboration, model merging, and direct fine-t", "rejected": "The research paper \"Paper 2601.05106v1\" proposes a new framework called FusionRoute for achieving strong performance across diverse domains with a single general-purpose model, while remaining efficient and competitive with domain experts. The authors argue that existing methods, such as sequence-level collaboration, model merging, and direct fine-tuning, have limitations.\n\nThe paper's main contributions are:\n\n1. **FusionRoute**: A robust and effective token-level multi-LLM collaboration framework that selects the most suitable expert at each decoding step and contributes a complementary logit that refines or corrects the selected expert's next-token distribution via logit addition.\n2. **Theoretical analysis**: A theoretical analysis showing that pure expert-only routing is fundamentally limited and cannot in general realize the optimal decoding policy unless strong global coverage assumptions hold.\n3. **Empirical results**: Empirically, across both Llama-3 and Gemma-2 families and diverse benchmarks spanning mathematical reasoning, code generation, and instruction following, FusionRoute outperforms both sequence- and token-level collaboration, model merging, and direct fine-tuning.\n\nThe paper highlights the importance of token-level collaboration, which allows multiple LLMs to jointly produce each token during generation, making it a more fine-grained and dynamic"}
+{"chosen": "The paper \"Semantically Orthogonal Framework for Citation Classification: Disentangling Intent and Content\" proposes a new annotation framework, SOFT, that explicitly separates citation intent (why a work is cited) from cited content type (what part is cited). The authors argue that existing frameworks conflate these two dimensions, hindering their effectiveness in auto classification. SOFT is grounded in the principles of semantic roles, offering a more nuanced understanding of how authors structure argumentation.\n\nThe paper introduces two key dimensions: Cited Content Type and Citation Intent. Cited Content Type refers to the specific part of a work that is being cited (e.g., method, result, or background), while Citation Intent refers to the functional purpose of the citation (e.g., to support, to critique, or to provide background). This distinction enables clearer definitions, improved annotation consistency, and stronger support for Large Language Model (LLM)-based classification.\n\nThe authors validate SOFT through a comprehensive re-annotation of the ACL-ARC dataset and a cross-domain evaluation on ACT2, covering 19 scientific fields. Results show that SOFT yields higher inter-model and human-LLM agreement, improves classification accuracy for both zero-shot and fine-tuned models, and generalizes more robustly across domains than existing", "rejected": "Here is a clear and concise summary of the research paper for a graduate student:\n\n**Title:** Semantically Orthogonal Framework for Citation Classification: Disentangling Intent and Content\n\n**Authors:** Changxu Duan and Zhiyin Tan\n\n**Summary:** The authors propose a new framework, SOFT, for citation classification that explicitly separates the intent of a citation (why a work is cited) from its content type (what part is cited). This framework is grounded in semantic role theory and is designed to improve the clarity, consistency, and generalizability of citation classification.\n\n**Background:** Traditional citation metrics, such as counts or h-index, are limited in their ability to capture the nuanced ways in which works are cited, interpreted, or contested. Citation classification has emerged as a critical task in computational bibliometrics and information science to unlock the meaning within scholarly dialogue.\n\n**Limitations of existing frameworks:** Current citation classification practices face persistent limitations, including the conflation of distinct citation intents and content types, and the lack of consistent semantic definitions. These ambiguities hinder downstream applications such as citation-based retrieval and impact analysis.\n\n**SOFT framework:** The authors propose a two-dimensional framework that explicitly distinguishes the predicative intent of a citation from its cited content type. This framework"}
+{"chosen": "Here is a clear and concise summary of the research paper for a graduate student:\n\n**Title:** LELA: an LLM-based Entity Linking Approach with Zero-Shot Domain Adaptation\n\n**Background:** Entity linking is a fundamental task in natural language processing that involves mapping ambiguous mentions in text to entities in a knowledge base. Traditional entity linking methods require fine-tuning on the target domain and knowledge base, which can be computationally expensive and require high-quality labeled data.\n\n**Problem Statement:** The authors propose a true zero-shot entity linking approach, where the method can work out of the box for any domain without the need for fine-tuning.\n\n**Method:** The proposed approach, LELA, is a coarse-to-fine method that leverages the capabilities of large language models (LLMs). It consists of three stages:\n\n1. **Initial entity-candidate set generation**: LELA generates a set of potential entities based on the input text.\n2. **Pointwise re-ranking**: The top-k candidates are selected through re-ranking.\n3. **Final candidate selection**: The selected candidates are passed to the LLM for final selection through reasoning over the context and entity descriptions.\n\n**Advantages:** LELA is modular, LLM-, reranker-, and retrie", "rejected": "The paper proposes a novel entity linking approach, called LELA, which stands for Large Language Model-based Entity Linking Approach with Zero-Shot Domain Adaptation. Entity linking is the task of mapping ambiguous mentions in text to entities in a knowledge base (KB). The authors aim to develop a true zero-shot entity linking method, which means it works without the need for fine-tuning, and can handle different domains, knowledge bases, and large language models (LLMs).\n\nThe proposed approach, LELA, is a coarse-to-fine method that consists of three stages:\n\n1.  **Initial entity-candidate set generation**: LELA generates an initial set of candidate entities for the ambiguous mention.\n2.  **Pointwise re-ranking**: The top-k candidates are selected through pointwise re-ranking, where the model calculates a score for each candidate entity based on its relevance to the context and entity descriptions.\n3.  **Final candidate selection**: The selected candidates are passed to the LLM for final candidate selection, where the model uses reasoning over the context and entity descriptions to select the most relevant entity.\n\nLELA is modular, agnostic to the LLM, reranker, and retriever, making it applicable to both open-source and proprietary models.\n\nThe authors"}
+{"chosen": "The research paper presents a lightweight vision-language framework for crop and disease identification from leaf images. The proposed approach combines a Swin Transformer vision encoder with sequence-to-sequence language decoders, and adopts a two-stage training strategy to improve visual representation learning and cross-modal alignment.\n\n**Key Findings:**\n\n1.  **High Accuracy for Crop and Disease Identification:** The model achieves high accuracy for both crop and disease identification, demonstrating its effectiveness in visual question answering for crop disease analysis.\n2.  **Strong Performance on BLEU, ROUGE, and BERTScore:** The framework also achieves strong performance on these metrics, indicating its ability to generate high-quality language output.\n3.  **Explainability:** Explainability is assessed using Grad-CAM and token-level attribution, demonstrating the model's ability to provide transparent and interpretable results.\n4.  **Robust Performance under Diverse User-Driven Queries:** Qualitative results demonstrate robust performance under diverse user-driven queries, highlighting the model's ability to handle real-world scenarios.\n\n**Key Contributions:**\n\n1.  **Lightweight Vision-Language Framework:** The proposed framework is lightweight, making it suitable for real-world deployment in agricultural environments.\n2.  **Task-Specific Visual Pretraining:** The study highlights the effectiveness of task", "rejected": "Here is a clear and concise summary of the research paper for a graduate student:\n\n**Title:** A Lightweight and Explainable Vision–Language Framework for Crop Disease Visual Question Answering\n\n**Background:** Crop disease diagnosis is a critical task in modern agriculture, and early identification of disease symptoms is essential to reduce damage and support effective intervention. However, traditional methods rely on expert inspection and recommendations, which can be time-consuming and limited in regions with expert access.\n\n**Problem Statement:** Existing automated disease detection methods often require controlled imaging conditions, are computationally heavy, and fail to provide detailed textual descriptions of visual attributes, disease progression stages, and contextual information.\n\n**Research Question:** Can a lightweight Visual Question Answering (VQA) framework be established for intelligent and practical plant disease identification?\n\n**Methodology:** The proposed framework combines a Swin Transformer vision encoder with sequence-to-sequence language decoders and adopts a two-stage training strategy to improve visual representation learning and cross-modal alignment.\n\n**Key Contributions:**\n\n1. A lightweight VQA framework for crop disease identification that achieves high accuracy and strong performance on BLEU, ROUGE, and BERTScore metrics.\n2. Explainability is assessed using Grad-CAM and token-level attribution, demonstrating robust performance under diverse user-driven queries.\n3."}
+{"chosen": "This summary is based on the provided research paper excerpt, focusing on the concept of Self-Consuming Performative Loop (SCPL) in Large Language Models (LLMs).\n\n**Introduction**\n\nThe paper highlights the growing interest in using synthetic data to train future LLMs, which leads to a self-consuming retraining loop where models are trained on their own outputs and may cause performance drops and induce emerging biases. This self-consuming performative loop is particularly relevant in dynamic systems driven by user feedback, where the model's performance influences future data.\n\n**Research Questions**\n\nThe authors aim to investigate the role of synthetic data in shaping bias during these dynamic iterative training processes under controlled performative feedback. They focus on two types of loops:\n\n1.  The typical retraining setting\n2.  The incremental fine-tuning setting, which is largely underexplored\n\n**Key Findings**\n\nThe authors' experiments on three real-world tasks reveal that:\n\n*   The performative loop increases preference bias and decreases disparate bias.\n*   Iterative fine-tuning/retraining with self-generated data amplifies preference bias and degrades generation quality over time.\n*   Disparate bias tends to decrease, suggesting a convergence of performance across groups.\n*   Performative feedback accelerates bias amplification in incremental", "rejected": "The research paper \"Self-Consuming Performative Loop for Large Language Models\" explores the concept of a self-consuming performative loop (SCPL) in large language models (LLMs). The authors introduce the idea of a dynamic system driven by user feedback, where previously deployed LLMs influence the data they generate, leading to a performative prediction. They investigate the role of synthetic data in shaping bias during these dynamic iterative training processes under controlled performative feedback.\n\nThe paper focuses on two types of loops: the typical retraining setting and the incremental fine-tuning setting, which is largely underexplored. Through experiments on three real-world tasks, the authors find that the performative loop increases preference bias and decreases disparate bias. They design a reward-based rejection sampling strategy to mitigate the bias, moving towards more trustworthy self-improving systems.\n\nKey findings:\n\n1.  Iterative fine-tuning/retraining with self-generated data amplifies preference bias and degrades generation quality over time.\n2.  Disparate bias tends to decrease, suggesting a convergence of performance across groups.\n3.  Performative feedback accelerates bias amplification in incremental fine-tuning loops, whereas this phenomenon is less obvious in retraining loops.\n\nThe authors' research provides insights that inform safer deployment of"}
+{"chosen": "The research paper \"RelayLLM: Efficient Reasoning via Collaborative Decoding\" proposes a novel framework for efficient reasoning via token-level collaborative decoding. The authors address the challenge of deploying Large Language Models (LLMs) for complex reasoning, which is often hindered by high computational costs and latency, while Small Language Models (SLMs) typically lack the necessary reasoning capacity.\n\nThe authors present RelayLLM, a framework that empowers the SLM to act as an active controller, dynamically invoking the LLM only for critical tokens via a special command, effectively \"relaying\" the generation process. They introduce a two-stage training framework, including warm-up and Group Relative Policy Optimization (GRPO) to teach the model to balance independence with strategic help-seeking.\n\nThe empirical results on six benchmarks demonstrate the effectiveness of RelayLLM, achieving an average accuracy of 49.52% and reducing the performance gap between the small model and the large one. The authors also show that RelayLLM invokes the large model for only 1.07% of the total generated tokens, resulting in a 98.2% reduction in token costs compared to a performance-matched router.\n\nKey findings:\n\n1.  **Efficient reasoning via token-level collaborative decoding**: RelayLLM emp", "rejected": "Here's a clear and concise summary of the research paper:\n\n**Title:** RelayLLM: Efficient Reasoning via Collaborative Decoding\n\n**Problem Statement:** Large Language Models (LLMs) are powerful but computationally expensive, while Small Language Models (SLMs) are efficient but lack reasoning capacity. Existing collaborative approaches, such as cascading or routing, often operate at a coarse granularity, leading to significant computational waste.\n\n**Proposed Solution:** RelayLLM, a novel framework for efficient reasoning via token-level collaborative decoding. Unlike static routers, RelayLLM empowers the SLM to act as an active controller that dynamically invokes the LLM only for critical tokens via a special command, effectively \"relaying\" the generation process.\n\n**Key Components:**\n\n1. **Two-stage training framework:** Supervised warm-up phase to teach the model the syntactic structure of calling commands, followed by a reinforcement learning stage using Group Relative Policy Optimization (GRPO) training.\n2. **Context-aware reward:** Designed to balance independence with necessary help-seeking, penalizing both wasted costs and avoidable errors.\n3. **Token-level collaborative decoding:** The SLM generates a special command token (<call>) to pause its own generation and invoke the LLM for a specified number of tokens"}
diff --git a/reward_model.py b/reward_model.py
new file mode 100644
index 0000000..df43488
--- /dev/null
+++ b/reward_model.py
@@ -0,0 +1,443 @@
+"""
+Reward model training and evaluation utilities.
+"""
+import json
+from typing import List, Dict, Any
+
+import torch
+from datasets import load_dataset, Dataset
+from transformers import (
+    AutoTokenizer,
+    AutoModelForSequenceClassification,
+    TrainingArguments,
+)
+from trl import RewardTrainer
+import evaluate
+
+
+# Configuration
+REWARD_MODEL_NAME = "microsoft/deberta-v3-base"
+REWARD_NUM_EPOCHS = 3
+REWARD_BATCH_SIZE = 4
+
+# Device detection: CUDA > MPS > CPU
+if torch.cuda.is_available():
+    LLAMA_DEVICE = "cuda"
+elif torch.backends.mps.is_available():
+    LLAMA_DEVICE = "mps"
+else:
+    LLAMA_DEVICE = "cpu"
+
+
+def auto_label_preferences_with_rouge(
+    summary_records: List[Dict[str, Any]],
+    output_path: str = "reward_data.jsonl"
+) -> None:
+    """
+    Use ROUGE-L against the abstract as a weak preference label.
+
+    This automatically creates preference pairs where:
+    - chosen = summary with higher ROUGE-L vs abstract
+    - rejected = the other summary
+
+    Writes reward_data.jsonl with 'chosen' and 'rejected' fields suitable for RewardTrainer.
+
+    Args:
+        summary_records: List of dictionaries with 'abstract', 'summary_1', 'summary_2'
+        output_path: Path to save the JSONL file
+    """
+    rouge = evaluate.load("rouge")
+    data_for_jsonl = []
+
+    for rec in summary_records:
+        ref = rec["abstract"]
+        s1 = rec["summary_1"]
+        s2 = rec["summary_2"]
+
+        # Compute per-summary ROUGE-L scores
+        s1_score = rouge.compute(predictions=[s1], references=[ref])["rougeL"]
+        s2_score = rouge.compute(predictions=[s2], references=[ref])["rougeL"]
+
+        if s1_score >= s2_score:
+            chosen, rejected = s1, s2
+        else:
+            chosen, rejected = s2, s1
+
+        data_for_jsonl.append(
+            {
+                "chosen": chosen,
+                "rejected": rejected,
+            }
+        )
+
+    with open(output_path, "w", encoding="utf-8") as f:
+        for item in data_for_jsonl:
+            f.write(json.dumps(item, ensure_ascii=False) + "\n")
+
+    print(f"Wrote preference data to {output_path} ({len(data_for_jsonl)} pairs).")
+
+
+def load_reward_dataset(path: str) -> Dataset:
+    """
+    Load reward modeling dataset from JSONL file.
+
+    Args:
+        path: Path to JSONL file with 'chosen' and 'rejected' fields
+
+    Returns:
+        HuggingFace Dataset object
+    """
+    dataset = load_dataset("json", data_files=path, split="train")
+    return dataset
+
+
+class ModelWrapper(torch.nn.Module):
+    """Wrapper to filter out unsupported arguments for encoder models."""
+    def __init__(self, model):
+        super().__init__()
+        self.model = model
+
+    def forward(self, **kwargs):
+        # Remove arguments not supported by encoder models
+        kwargs.pop('use_cache', None)
+        return self.model(**kwargs)
+
+    def __getattr__(self, name):
+        try:
+            return super().__getattr__(name)
+        except AttributeError:
+            return getattr(self.model, name)
+
+
+def train_reward_model(data_path: str = "reward_data.jsonl", output_dir: str = "reward_model"):
+    """
+    Train a reward model on preference data using TRL's RewardTrainer.
+
+    The model learns to assign higher scores to preferred summaries (chosen)
+    and lower scores to rejected summaries.
+
+    Args:
+        data_path: Path to JSONL file with preference pairs
+        output_dir: Directory to save the trained model
+    """
+    print("Loading reward model + tokenizer...")
+    # Load tokenizer and store a clean copy for saving later
+    tokenizer = AutoTokenizer.from_pretrained(REWARD_MODEL_NAME)
+    print(f"Loaded tokenizer type: {tokenizer.__class__.__name__}")
+    print(f"Tokenizer vocab size: {tokenizer.vocab_size}")
+    # Keep a reference to the original tokenizer before training
+    original_tokenizer_path = REWARD_MODEL_NAME
+
+    base_model = AutoModelForSequenceClassification.from_pretrained(
+        REWARD_MODEL_NAME,
+        num_labels=1,
+    )
+
+    # Wrap model to filter unsupported arguments
+    model = ModelWrapper(base_model)
+
+    dataset = load_reward_dataset(data_path)
+    print(f"Loaded dataset with {len(dataset)} examples")
+    print(f"Dataset columns: {dataset.column_names}")
+    if len(dataset) > 0:
+        print(f"First example: {dataset[0]}")
+
+    # RewardTrainer handles tokenization internally, so we just pass the text fields
+
+    training_args = TrainingArguments(
+        output_dir=output_dir,
+        per_device_train_batch_size=REWARD_BATCH_SIZE,
+        num_train_epochs=REWARD_NUM_EPOCHS,
+        eval_strategy="no",
+        save_strategy="epoch",
+        logging_steps=10,
+        fp16=False,
+        bf16=torch.cuda.is_available(),
+        remove_unused_columns=False,
+    )
+
+    # Add attributes required by TRL RewardTrainer (from RewardConfig)
+    training_args.model_init_kwargs = {}
+    training_args.eos_token = None
+    training_args.pad_token = None
+    training_args.max_length = 4096
+    training_args.chat_template_path = None
+    training_args.disable_dropout = False
+    training_args.pad_to_multiple_of = None
+    training_args.dataset_num_proc = None
+    training_args.center_rewards_coefficient = None
+    training_args.activation_offloading = False
+
+    trainer = RewardTrainer(
+        model=model,
+        args=training_args,
+        train_dataset=dataset,
+        processing_class=tokenizer,
+    )
+
+    print("Training reward model...")
+    trainer.train()
+
+    # Save the base model (unwrapped)
+    base_model.save_pretrained(output_dir)
+
+    # Save the original tokenizer (reload fresh to avoid contamination)
+    print(f"Saving clean tokenizer from {original_tokenizer_path}...")
+    import os
+    import shutil
+
+    # Use a temporary directory to get clean tokenizer files
+    temp_dir = f"{output_dir}_temp_tokenizer"
+    os.makedirs(temp_dir, exist_ok=True)
+
+    # Download fresh tokenizer to temp directory
+    clean_tokenizer = AutoTokenizer.from_pretrained(original_tokenizer_path, cache_dir=temp_dir)
+
+    # Save to temp first
+    clean_tokenizer.save_pretrained(temp_dir)
+
+    # Copy only the essential tokenizer files (not any contaminated metadata)
+    essential_files = [
+        "tokenizer_config.json",
+        "vocab.txt",  # DeBERTa uses vocab.txt
+        "special_tokens_map.json",
+        "tokenizer.json",
+    ]
+
+    for filename in essential_files:
+        src = os.path.join(temp_dir, filename)
+        dst = os.path.join(output_dir, filename)
+        if os.path.exists(src):
+            shutil.copy2(src, dst)
+            print(f"  Copied: {filename}")
+
+    # Clean up temp directory
+    shutil.rmtree(temp_dir, ignore_errors=True)
+
+    # CRITICAL FIX: Clean up tokenizer_config.json to remove SentencePiece contamination
+    tokenizer_config_path = os.path.join(output_dir, "tokenizer_config.json")
+    if os.path.exists(tokenizer_config_path):
+        print("Cleaning tokenizer_config.json to remove SentencePiece contamination...")
+        with open(tokenizer_config_path, 'r') as f:
+            config = json.load(f)
+
+        # Remove SentencePiece-related fields that cause Mistral warnings
+        contaminated_fields = ["vocab_type", "sp_model_kwargs"]
+        for field in contaminated_fields:
+            if field in config:
+                print(f"  Removing contaminated field: {field} = {config[field]}")
+                del config[field]
+
+        # Write back the cleaned config
+        with open(tokenizer_config_path, 'w') as f:
+            json.dump(config, f, indent=2)
+
+    # CRITICAL: Remove any contaminating files that shouldn't be in DeBERTa tokenizer
+    contaminating_files = [
+        "spm.model",  # SentencePiece (LLaMA/Mistral)
+        "merges.txt",  # BPE (GPT-2/Mistral)
+        "vocab.json",  # BPE vocab (GPT-2/Mistral)
+    ]
+
+    for filename in contaminating_files:
+        filepath = os.path.join(output_dir, filename)
+        if os.path.exists(filepath):
+            print(f"WARNING: Removing contaminating file: {filename}")
+            os.remove(filepath)
+
+    print(f"Reward model saved to {output_dir}")
+    print(f"Tokenizer type: {clean_tokenizer.__class__.__name__}")
+
+    # Verify the saved tokenizer can be loaded correctly
+    print("Verifying saved tokenizer...")
+    verification_tokenizer = AutoTokenizer.from_pretrained(output_dir)
+    print(f"Verified tokenizer type: {verification_tokenizer.__class__.__name__}")
+
+
+def score_summaries_with_reward_model(
+    summaries: List[str],
+    reward_model,
+    reward_tokenizer,
+) -> List[float]:
+    """
+    Compute scalar reward scores for each summary.
+
+    Args:
+        summaries: List of summary texts
+        reward_model: Trained reward model
+        reward_tokenizer: Tokenizer for the reward model
+
+    Returns:
+        List of reward scores (one per summary)
+    """
+    reward_model.eval()
+    scores = []
+
+    # Get the device of the model
+    model_device = next(reward_model.parameters()).device
+
+    for s in summaries:
+        inputs = reward_tokenizer(
+            s,
+            return_tensors="pt",
+            truncation=True,
+            padding="max_length",
+            max_length=4096,
+        ).to(model_device)
+
+        with torch.no_grad():
+            out = reward_model(**inputs)
+            # out.logits shape: [batch, 1]
+            score = out.logits.squeeze().item()
+        scores.append(score)
+
+    return scores
+
+
+def evaluate_summaries(
+    summary_records: List[Dict[str, Any]],
+    reward_model_dir: str = "reward_model",
+    output_path: str = "evaluation_results.json"
+) -> List[Dict[str, Any]]:
+    """
+    Evaluate summaries using ROUGE, BERTScore, and the trained reward model.
+
+    Args:
+        summary_records: List of dicts with 'arxiv_id', 'title', 'abstract', 'summary_1', 'summary_2'
+        reward_model_dir: Directory containing the trained reward model
+        output_path: Path to save evaluation results
+
+    Returns:
+        List of evaluation results for each paper
+    """
+    # Load reward model
+    print("Loading reward model for evaluation...")
+
+    # Suppress false Mistral warnings by cleaning tokenizer config before loading
+    import os
+    tokenizer_config_path = os.path.join(reward_model_dir, "tokenizer_config.json")
+    if os.path.exists(tokenizer_config_path):
+        with open(tokenizer_config_path, 'r') as f:
+            config = json.load(f)
+
+        # Remove SentencePiece contamination if present
+        if "vocab_type" in config or "sp_model_kwargs" in config:
+            print("Cleaning tokenizer config before loading...")
+            config.pop("vocab_type", None)
+            config.pop("sp_model_kwargs", None)
+
+            with open(tokenizer_config_path, 'w') as f:
+                json.dump(config, f, indent=2)
+
+    # Suppress the false Mistral warning from transformers logger
+    import logging
+    import warnings
+
+    # Temporarily suppress ALL transformers warnings during tokenizer load
+    logging.getLogger("transformers").setLevel(logging.ERROR)
+    original_transformers_level = logging.getLogger("transformers.tokenization_utils_base").level
+    logging.getLogger("transformers.tokenization_utils_base").setLevel(logging.ERROR)
+
+    with warnings.catch_warnings():
+        warnings.filterwarnings("ignore", message=".*incorrect regex pattern.*")
+        warnings.filterwarnings("ignore", message=".*Mistral.*")
+
+        reward_tokenizer = AutoTokenizer.from_pretrained(
+            reward_model_dir,
+            trust_remote_code=False,
+            use_fast=True
+        )
+        reward_model = AutoModelForSequenceClassification.from_pretrained(reward_model_dir)
+
+    # Restore original logging levels
+    logging.getLogger("transformers.tokenization_utils_base").setLevel(original_transformers_level)
+    logging.getLogger("transformers").setLevel(logging.WARNING)
+
+    reward_model.to(LLAMA_DEVICE)
+    print(f"Reward model loaded on device: {LLAMA_DEVICE}")
+    print(f"Loaded tokenizer class: {reward_tokenizer.__class__.__name__}")
+
+    # Load evaluation metrics
+    rouge = evaluate.load("rouge")
+    bertscore = evaluate.load("bertscore")
+
+    results = []
+
+    for rec in summary_records:
+        ref = rec.get("abstract", "")
+        s1 = rec["summary_1"]
+        s2 = rec["summary_2"]
+
+        # Check if abstract is empty - if so, skip ROUGE/BERTScore or use fallback
+        if not ref or ref.strip() == "":
+            print(f"\nWarning: Empty abstract for {rec['arxiv_id']}, using first 500 chars of text as reference")
+            # Use first 500 characters of paper text as fallback reference
+            ref = rec.get("text", "")[:500] if rec.get("text") else ""
+
+        # Only compute ROUGE/BERTScore if we have a valid reference
+        if ref and ref.strip():
+            # ROUGE
+            r1 = rouge.compute(predictions=[s1], references=[ref])
+            r2 = rouge.compute(predictions=[s2], references=[ref])
+
+            # BERTScore
+            b1 = bertscore.compute(
+                predictions=[s1],
+                references=[ref],
+                lang="en",
+            )
+            b2 = bertscore.compute(
+                predictions=[s2],
+                references=[ref],
+                lang="en",
+            )
+        else:
+            # No valid reference available
+            print(f"Warning: No valid reference for {rec['arxiv_id']}, skipping ROUGE/BERTScore")
+            r1 = {"rouge1": None, "rouge2": None, "rougeL": None}
+            r2 = {"rouge1": None, "rouge2": None, "rougeL": None}
+            b1 = {"precision": [None], "recall": [None], "f1": [None]}
+            b2 = {"precision": [None], "recall": [None], "f1": [None]}
+
+        # Reward model scores
+        scores = score_summaries_with_reward_model(
+            [s1, s2],
+            reward_model,
+            reward_tokenizer,
+        )
+        rm1, rm2 = scores
+
+        result_entry = {
+            "arxiv_id": rec["arxiv_id"],
+            "title": rec["title"],
+            "abstract": ref,
+            "summary_1": s1,
+            "summary_2": s2,
+            "rouge_1": r1,
+            "rouge_2": r2,
+            "bertscore_1": b1,
+            "bertscore_2": b2,
+            "reward_1": rm1,
+            "reward_2": rm2,
+        }
+        results.append(result_entry)
+
+        print("\n=== Paper", rec["arxiv_id"], "===")
+        print("Reward scores: S1 =", rm1, " | S2 =", rm2)
+        if r1["rougeL"] is not None:
+            print("ROUGE-L: S1 =", r1["rougeL"], " | S2 =", r2["rougeL"])
+            print("BERTScore F1: S1 =",
+                  b1["f1"][0],
+                  "| S2 =",
+                  b2["f1"][0])
+        else:
+            print("ROUGE-L: N/A (no valid reference)")
+            print("BERTScore F1: N/A (no valid reference)")
+
+    with open(output_path, "w", encoding="utf-8") as f:
+        json.dump(results, f, ensure_ascii=False, indent=2)
+
+    print(f"\nSaved detailed evaluation results to {output_path}")
+    print("You can now inspect where reward scores agree/disagree with ROUGE/BERTScore.")
+
+    return results
diff --git a/run_evaluation.py b/run_evaluation.py
new file mode 100644
index 0000000..cea6240
--- /dev/null
+++ b/run_evaluation.py
@@ -0,0 +1,43 @@
+#!/usr/bin/env python3
+"""
+Phase 5-6: Evaluation and comparison.
+
+This script:
+- Loads summaries from training
+- Scores them using the trained reward model
+- Computes ROUGE and BERTScore
+- Compares all metrics
+"""
+import os
+import json
+
+from reward_model import evaluate_summaries
+
+
+# File paths
+TRAIN_SUMMARY_JSON = "summaries_train.json"
+REWARD_MODEL_DIR = "reward_model"
+RESULTS_JSON = "evaluation_results.json"
+
+
+def main():
+    print("\n=== PHASE 5-6: Evaluation and comparison ===\n")
+
+    if not os.path.exists(TRAIN_SUMMARY_JSON):
+        print("Error: No training summaries found. Run build_data.py first.")
+        return
+
+    # Load existing summaries
+    with open(TRAIN_SUMMARY_JSON, "r", encoding="utf-8") as f:
+        eval_summaries = json.load(f)
+
+    print(f"Evaluating {len(eval_summaries)} papers...")
+
+    # Evaluate summaries
+    evaluate_summaries(eval_summaries, REWARD_MODEL_DIR, RESULTS_JSON)
+
+    print("\n=== Evaluation complete! ===")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/summaries_train.json b/summaries_train.json
new file mode 100644
index 0000000..4570842
--- /dev/null
+++ b/summaries_train.json
@@ -0,0 +1,82 @@
+[
+  {
+    "arxiv_id": "2601.05171v1",
+    "title": "Paper 2601.05171v1",
+    "abstract": "Existing long-term personalized dialogue systems struggle to reconcile unbounded interaction streams with finite context constraints, often succumbing to memory noise accumulation, reasoning degradation, and persona inconsistency. To address these challenges, this paper proposes Inside Out, a framework that utilizes a globally maintained PersonaTree as the carrier of long-term user profiling. By constraining the trunk with an initial schema and updating the branches and leaves, PersonaTree enables controllable growth, achieving memory compression while preserving consistency. Moreover, we train a lightweight MemListener via reinforcement learning with process- based rewards to produce structured, executable, and interpretable {ADD, UPDATE, DELETE, NO_OP} operations, thereby supporting the dynamic evolution of the personalized tree. During response generation, PersonaTree is directly leveraged to enhance outputs in latency-sensitive scenarios; when users require more details, the agentic mode is triggered to introduce details on-demand under the constraints of the PersonaTree. Experiments show that PersonaTree outperforms full-text concatenation and various personalized memory systems in suppressing contextual noise and maintaining persona consistency. Notably, the small MemListener model achieves memory- operation decision performance comparable to, or even surpassing, powerful reasoning models such as DeepSeek-R1-0528 and Gemini-3-Pro. Correspondence: Team Leader at lizy@memtensor.cn Author Legend: †Corresponding author",
+    "text": "Inside Out:\nEvolving User-Centric Core Memory Trees\nfor Long-Term Personalized Dialogue Systems\nJihao Zhao1,2,3, Ding Chen2,3, Zhaoxin Fan2,4, Kerun Xu2,3, Mengting Hu2,5, Bo\nTang2,3, Feiyu Xiong2,3, Zhiyu Li2,3,†\n1School of Information, Renmin University of China, 2MemTensor (Shanghai) Technology Co.,\nLtd., 3Institute for Advanced Algorithms Research, Shanghai, 4Beijing University of\nAeronautics and Astronautics, 5Nankai University\nAbstract\nExisting long-term personalized dialogue systems struggle to reconcile unbounded interaction\nstreams with finite context constraints, often succumbing to memory noise accumulation, reasoning\ndegradation, and persona inconsistency. To address these challenges, this paper proposes Inside\nOut, a framework that utilizes a globally maintained PersonaTree as the carrier of long-term\nuser profiling. By constraining the trunk with an initial schema and updating the branches and\nleaves, PersonaTree enables controllable growth, achieving memory compression while preserving\nconsistency. Moreover, we train a lightweight MemListener via reinforcement learning with process-\nbased rewards to produce structured, executable, and interpretable {ADD, UPDATE, DELETE, NO_OP}\noperations, thereby supporting the dynamic evolution of the personalized tree. During response\ngeneration, PersonaTree is directly leveraged to enhance outputs in latency-sensitive scenarios;\nwhen users require more details, the agentic mode is triggered to introduce details on-demand\nunder the constraints of the PersonaTree.\nExperiments show that PersonaTree outperforms\nfull-text concatenation and various personalized memory systems in suppressing contextual noise\nand maintaining persona consistency. Notably, the small MemListener model achieves memory-\noperation decision performance comparable to, or even surpassing, powerful reasoning models such\nas DeepSeek-R1-0528 and Gemini-3-Pro.\nCorrespondence: Team Leader at lizy@memtensor.cn\nAuthor Legend: †Corresponding author\n1\nIntroduction\nCore memories shape Riley’s personality islands, with each island serving as a unique emblem of\nher identity.\n—— \"Inside Out\"\nWith the rapid advancement of large language models (LLM), dialogue-based agents have demonstrated\nsubstantial potential in applications such as personal assistants, affective companionship, and long-term\nquestion answering [4, 11, 18]. However, within personalized dialogue systems aimed at fostering long-\n1\narXiv:2601.05171v1  [cs.CL]  8 Jan 2026\n\nterm human-machine trust and emotional connection, a fundamental contradiction exists between the finite\ncontext window and the unbounded growth of interaction history [13, 23]. As conversational turns continue\nto accumulate, the traditional single-context paradigm encounters a severe form of context saturation:\nindiscriminate aggregation of massive historical information not only drives computational costs sharply\nupward, but also introduces substantial irrelevant noise, markedly degrading the signal-to-noise ratio. More\ncritically, this unstructured accumulation makes it difficult for the model to accurately extract and sustain a\nuser’s personal characteristics from lengthy histories, leading to personalization inconsistency over long-term\ninteractions and thereby seriously undermining user experience and the system’s long-term usability [19, 29].\nTo address these challenges, existing studies have primarily explored routes such as explicit profile augmentation\nand vector-based retrieval, yet neither directly confronts the central bottleneck of personalized memory\nevolution. Profile-based approaches rely on predefined, static attributes; they are not only slow to update\nbut also struggle to capture implicit cues that users reveal over prolonged interactions, including linguistic\nstyle, deeper value orientations, and affective preferences, resulting in superficial personalization modeling [20].\nIn contrast, memory-augmented agents based on vector retrieval, while introducing external storage, still\nessentially treat memory as text fragments or simple lists of facts. Such systems lack an intrinsic, trained\ndecision mechanism for determining which information merits long-term retention, and instead often depend\non rigid heuristics or elaborate prompt engineering [12]. This accumulation of memories without value-based\njudgment causes the memory repository either to become bloated and uninterpretable due to noise accretion,\nor to lose the long-range logical thread through fragmentation of key context, ultimately failing to sustain a\nvivid and coherent persona [25].\nThis discrepancy between memory accumulation and core persona formation” motivates us to return to the\nfoundations of human cognition for an answer. As illustrated by the film \"Inside Out\", individual identity\ndoes not stem from a simplistic stacking of all experiences, but rather is constructed upon core memories that\nshape distinct \"Islands of Personality\". This aligns with theoretical findings in cognitive psychology, such as\nSelf-Schema theory [15, 22], which emphasizes that humans maintain a stable self-concept by filtering and\nhierarchically organizing key memories.\nInspired by these insights, we propose the Inside Out framework, which aims to grow an evolvable user core\nmemory tree \"from the inside out\" through unbounded interactions. Firstly, to delineate the theoretical\nboundaries of the memory tree, we construct a hierarchical Schema based on the Biopsychosocial model,\nscientifically decomposing user characteristics into three core dimensions. This interdisciplinary Schema\ndesign establishes the initial structure of the user PersonaTree. Secondly, to endow the system with dynamic\nevolution, we propose an iterative tree-update mechanism and introduce a reinforcement learning (RL) strategy\nbased on process rewards to train a lightweight model, MemListener. This model learns to compress a\ncontinuous stream of unstructured dialogue in real time into standardized tree-structured operations, encoding\nuser core features within the branch and leaf nodes. Finally, addressing the trade-off between efficiency and\neffectiveness during the inference stage, this paper designs an adaptive response generation mechanism: In\nlatency-sensitive scenarios, a fast mode is enabled to perform reasoning directly based on the PersonaTree.\nWhen facing long-tail detail requirements, the system switches to the agentic recall mode, utilizing the\nPersonaTree to guide deep retrieval. The primary contributions of our work are summarized as:\n• We propose PersonaTree, grounded in the biopsychosocial schema. By transforming unstructured\ndialogue streams into standardized atomic tree operations in real-time, PersonaTree achieves the\ndynamic compression, explicit management, and high signal-to-noise ratio maintenance of implicit user\nprofiles.\n• We design a training strategy utilizing RL with process rewards. Leveraging the constructed dataset of\n28k instructions, we train a lightweight model, MemListener, to execute precise memory editing.\n• Our experiments reveal the potential of a collaborative paradigm where \"small models maintain memory\nwhile LLMs handle generation\". Results show that MemListener achieves memory-decision performance\ncomparable to strong reasoning models, and that PersonaTree offers a new pathway toward low-cost,\nhighly reliable deployment of long-term personalized dialogue systems.\n2\n\nFigure 1 Overview of the entire process of our Inside Out framework.\n2\nRelated Works\n2.1\nPersonalization and Memory\nPersonalization aims to adapt a dialogue system’s linguistic style and interaction policy to a specific user’s\nstable traits and evolving state. In interactive settings, personalization is inherently coupled with memory:\nmodels must distill reusable user representations from past interactions and fuse them during generation. Li\net al. [10] proposed persona-based dialogue generation to mitigate inconsistency and lack of personality in\nopen-domain dialogue, and Zhang et al. [26] formalized the PersonaChat task. Subsequent studies emphasized\nmulti-dimensional user attributes. For example, Zheng et al. [28] introduced the large-scale multi-turn dataset\nPersonalDialog. In parallel, Madotto et al. [14] framed personalization as a meta-learning problem to enable\nfew-shot adaptation. In the LLM era, Chen et al. [3] systematically reviewed major directions in personalized\ndialogue generation, while Tan et al. [21] assigned parameter-efficient personalization modules to users to\nimprove multi-task personalization.\n2.2\nLLM Agents with External Memory\nTo overcome the limitations of LLMs’ finite context windows and endow them with capabilities for continuous\nlearning and long-term interaction, constructing memory systems has emerged as a pivotal research direction.\n3\n\nLangMem 1 enables continual learning and cross-session consistency by decoupling hot-path memory primitives\nfrom backend asynchronous integration. Mem0 [4] adopts a multi-level memory architecture to support\nmulti-session retrieval and personalization at relatively low overhead. A-Mem [24] builds an evolvable memory\nnetwork via self-organizing indexing and linking mechanisms. MemoryOS [8] manages short, medium, and\nlong-term memory through OS-style hierarchical storage together with corresponding update and retrieval\npolicies to preserve contextual coherence.\n3\nThe Inside Out Framework\n3.1\nOverview Architecture\nThis study proposes the Inside Out Framework, which aims to address the challenges of personalized consistency\nand contextual forgetting in long-term dialogues through a structured memory evolution mechanism.\nFramework Pipeline.\nAs shown in Figure 1, the framework consists of three key modules: Dynamic Person-\naTree Evolution (Section 3.2), MemListener Training (Section 3.3), and Adaptive Response Generation (Section\n3.4). First, PersonaTree and dataset construction initializes a persona tree based on the Biopsychosocial\nModel, segments the user’s dialogue history into consecutive dialogue chunks, and generates operations on the\nPersonaTree, thereby constructing a memory evolution dataset. Second, MemListener training leverages the\nresulting high-quality dataset to update the parameters of MemListener via supervised fine-tuning (SFT) and\nRL with a dynamic reward mechanism, enabling it to extract structured memory from unstructured dialogues.\nFinally, PersonaTree inference freezes the MemListener parameters at application time, reconstructs the\nattribute tree from the dialogue history, uses this structured memory to enhance user queries, and ultimately\ngenerates personalized responses through an LLM.\nProblem Formulation.\nWe define the task of a personalized dialogue system as a process of maximizing\nresponse utility over an infinitely long dialogue stream. Given a user U with a historical dialogue sequence\nH = {x1, y1, ..., xt, yt}, where x denotes user inputs and y denotes system responses, conventional context-\nwindow approaches attempt to directly model P(yt | Ht−k:t), but are constrained by the window length k.\nOur framework introduces an explicit, structured user state T (i.e., PersonaTree), thereby reformulating the\nproblem as state tracking and state-conditioned generation. The goal is to learn a state update function\nfupdate such that:\nTt = fupdate(Tt−1, Dt)\n(1)\nyt = fgen(xt, Tt, frecall(Tt, H))\n(2)\nwhere Dt denotes the current dialogue chunk, fgen produces the system reply given the current user input\nand the tracked user state, and frecall is a retrieval function that recalls relevant historical snippets from the\nfull dialogue history H conditioned on the current state Tt.\n3.2\nDynamic PersonaTree Evolution\nPersonaTree Initialization.\nAt system startup, we construct an initial PersonaTree to serve as the starting\npoint of the user’s long-term structured state. Specifically, we first determine the set of writable trunk and\nleaf fields according to a predefined unified schema, and constrain the storage type of each leaf node to a\ndescriptive string, which is used to hold a compressed summary of the user’s core personalized attributes. This\ndesign ensures that memory capacity remains controllable and prevents unbounded growth as the dialogue\nprogresses. The schema is informed by interdisciplinary human-factors and psychological theory frameworks,\nwith its theoretical grounding illustrated in Figure 1. Subsequently, under the schema constraints, we initialize\nthe leaf nodes (allowing empty strings or default placeholder text), thereby obtaining the initial persona tree\nT0. The specific initial PersonaTree instance adopted in this paper is provided in Appendix D.\n1https://github.com/langchain-ai/langmem\n4\n\nIterative PersonaTree Updating.\nTo enable scalable maintenance of long-term personalized memory over an\ninfinitely long dialogue stream, we adopt an iterative updating mechanism: any input modality (historical\nfile import, short-snippet input, or real-time cache triggering) is normalized into a dialogue-chunk sequence\n(D1, . . . , DN), and for each Dt we execute a closed-loop update of operation-list generation, safe parsing and\nexecution, versioned persistence.\nStep 0: System Loading. The system loads the text fields of all leaf nodes, yielding the initial state T0.\nMeanwhile, the task specification and system constraints are abstracted into a rule set R, including update\nrules, writable scope, leaf constraints, and output format.\nStep t = 1, . . . , N. For any dialogue chunk Dt, the system executes the following three stages:\n(a) State Construction. Given a dialogue chunk Dt, set Inputt ← Dt, Tt−1.\n(b) Operation List Generation. Conditioned on (Dt, Tt−1, R), LLM outputs an operation list Ot, consisting\nof one or more atomic operations that strictly follow a predefined operation grammar. The operation types\nare limited to:\n• ADD(path, value): write descriptive text to the specified path; if the path does not exist, it may be\ncreated under an extended-schema policy;\n• UPDATE(path, value): perform an overwrite rewrite on the target leaf node, updating its text to the\nnew value;\n• DELETE(path, value): clear the target leaf node or write a deletion marker to indicate that this type\nof information should be removed from long-term memory;\n• NO_OP(): the current dialogue chunk does not contain stable core persona information that should be\nwritten to the PersonaTree.\nFor update operations, our framework unifies them as rewrites of leaf strings. More importantly, potential\nconflicts between new and old information are resolved by LLM during the generation of Ot. Based on Dt\nand the contextual old values in Tt−1, the model must decide whether to overwrite prior information, preserve\nsalient change cues, or solely append new information. In other words, conflict resolution is explicitly lifted to\nthe policy-generation stage, so as to leverage the LLMs’ holistic inference over semantics, temporal order, and\nnarrative consistency.\n(c) Parsing and Execution. This module serves as a safety gate that enforces structural and capacity\nconstraints when applying Ot: it validates that each path targets a permissible leaf, ensures each value is\na string or an allowed deletion marker to avoid parsing ambiguity and state pollution, and applies length\ncontrol by compressing any overlong value to satisfy the per-leaf budget. Importantly, it performs no conflict\nresolution or secondary semantic rewriting; it only executes the prescribed operations under these constraints.\n(d) Versioned Persistence. After the execution, the updated tree state Tt is materialized and persisted as a\nnew version, either serialized to a JSON file or stored in a JSON-capable database (e.g., document stores\nsuch as MongoDB). Iterating over t = 1, . . . , N yields a traceable evolution sequence {T0, . . . , TN}, with TN\nserving as the compressed long-term memory for retrieval-augmented and personalized generation at the final\ntask-query stage.\n3.3\nMemListener Training\nTraining Data Synthesis.\nDuring training data construction, we select subsets from HaluMem [2] and\nPersonaMem [7] that are relevant to implicitly characterizing user-specific attributes as the raw corpus sources.\nUsing the dynamic PersonaTree evolution procedure described in Section 3.2 as the backbone, we invoke\nDeepSeek-R1-0528 to generate supervision signals for training.\nWarm-up via SFT.\nWe first perform full supervised fine-tuning to initialize the base model as a MemListener\nthat can stably generate operation lists. For any training sample, let the input context be s (including the\n5\n\ndialogue chunk, the previous tree state, and rule constraints), and the target output be o (the ground-truth\noperation sequence segment). We optimize a standard autoregressive cross-entropy objective:\nLSFT(θ) = −1\nτ\nτ\n�\nt=1\nlog Pθ(ot | o<t, s) ,\nwhere θ denotes the model parameters and τ is the target sequence length. The goal of this stage is to train\nthe model to make executable and traceable update decisions within strict grammar and write boundaries,\nensuring that the generated operation sequences are structurally stable and usable, thus providing a reliable\nstarting point for subsequent alignment.\nAlignment via Process-Reward RL.\nAfter the SFT warm-up, we continue alignment with the remaining\ndata using RL driven by process-based rewards. We set the model’s maximum context length to 11K tokens,\nwith the input length capped at 10K, which constitutes a typical ultra-long sequence optimization setting.\nTherefore, we enable the DAPO loss for process-reward alignment within a GRPO framework. We perform\ncredit assignment via token-level policy gradients, and mitigate entropy collapse by decoupling the clipping\nrange from dynamic sampling constraints, ensuring that the within-group advantage retains non-zero variance.\nThis makes the method better suited to long-chain reasoning and structured operation-sequence generation.\nIn addition, by limiting the maximum generation length and applying dynamic resampling to filter degenerate\ngroups, we reduce training noise introduced by truncation and within-group advantage degeneration.\nConcretely, for each training sample, we take s (including the dialogue chunk, the previous tree state, and\nupdate constraints) as input, and use the manually verified reference output y⋆ as the ground-truth operation\ntrajectory. During optimization, we sample a group of candidate outputs {yi}G\ni=1 from the old policy πθold(· | s),\ncompute sequence-level returns under the process reward function R(·) as Ri = R(yi, y⋆; s), and then update\nthe policy parameters θ using the DAPO objective:\nJRL(θ) = E(s,y⋆)∼D, {yi}G\ni=1∼πθold(·|s)\n�\n1\n�G\ni=1 |yi|\nG\n�\ni=1\n|yi|\n�\nt=1\nmin\n�\nri,t(θ) ˆAi,t, clip\n�\nri,t(θ), 1 − εlow, 1 + εhigh\n� ˆAi,t\n��\n,\ns.t. 0 <\n���{ yi | is_equivalent(y⋆, yi) }\n��� < G.\nPolicy updates are based on the importance ratio\nri,t(θ) =\nπθ(yi,t | s, yi,<t)\nπθold(yi,t | s, yi,<t),\nˆAi,t = Ri − mean\n�\n{Rj}G\nj=1\n�\nstd\n�\n{Rj}G\nj=1\n�\n,\nwhere Ri ∈ [−1, 1] denotes the sequence-level score assigned by the evaluator for the i-th sampled output\nyi in a group of size G = 8. We apply within-group standardization to obtain the advantage estimate ˆAi,t,\nthereby improving training stability. We adopt εlow = 0.2 and εhigh = 0.28 to relax the upper-bound clipping,\nproviding greater update headroom for increasing the probabilities of low-probability exploratory tokens. The\ndetailed data construction, training parameter settings, and the design of the reward function are presented\nin Appendix B.\n3.4\nAdaptive Response Generation\nDuring the inference, we treat the final tree state TN as structured long-term memory and adopt an adaptive\nresponse strategy to satisfy both low-latency and high-coverage requirements.\nPersonaTree-Augmented Generation.\nFor the latency-sensitive interaction scenario, the system enables a\nlightweight fast mode: it directly reads out the structure of TN along with the non-empty leaf texts as a\npersonalized prior, concatenates them with the user query q as the input context, and generates an answer in\na single pass.\n6\n\nAgentic Recall and Fusion.\nWhen the user explicitly requests additional details or the query exhibits stronger\nlong-tail characteristics, the system switches to the agentic recall mode. Concretely, we generate a set of\nexpanded queries {˜q(k)}K\nk=1 from the original query q conditioned on TN, where each ˜q(k) emphasizes a different\nattribute dimension or potential missing aspect relevant to the question. The system retrieves candidate\nevidence sets {d(k)\nj } in parallel for {q(k)}, and reranks them based on relevance to obtain a fused context C.\nFinally, we generate the final answer conditioned on [q, TN, C]. This procedure operates under a gated policy\nthat is triggered only when necessary, improving answer quality in complex scenarios while keeping the overall\ninteraction cost controllable.\n4\nExperiments\n4.1\nExperimental Setup\nDatasets and Metrics.\nWe conduct experiments on the PersonaMem benchmark [7]. This dataset is centered\non user personas: each instance contains a user’s static demographic information as well as dynamic attributes\nthat evolve over time. For interaction-history construction, each history consists of approximately 10 multi-turn\nconversations concatenated in chronological order, resulting in an overall context length of about 32k tokens\nand covering 15 categories of real-world tasks that require personalization. For evaluation, we use accuracy as\nthe primary metric, reporting overall accuracy and further breaking it down into per-skill accuracies across\nseven query-skill categories. Under the discriminative setting, the model directly outputs the selected option.\nBaselines.\nTo evaluate the effectiveness of the proposed method, we compare against two standard interaction\nparadigms Only LLM and Full Context as well as four representative memory-management frameworks\nLangMem, Mem0 [4], A-Mem [24], and MemoryOS [8]. For each baseline, we use the best-performing\nconfigurations reported in the original works (e.g., agentic retrieval). Detailed descriptions and experimental\nsetups are provided in Appendix C.1.\nImplementation Details.\nIn our approach, the construction of core training data leverages the DeepSeek-\nR1-0528. To encourage diverse generations, we set temperature to 0.7 and top-p to 0.9. For retrieval\ncomponents, both our method and all baseline systems share the same settings: we use BGE-M3 as the\nretriever and BGE-Reranker-Large as the re-ranking model. We set the retrieval count to 4 and ensure that\ndifferent methods maintain a fundamentally consistent length of retrieval context. All training and evaluation\nexperiments are conducted on a single node equipped with 8 NVIDIA H200-141GB GPUs.\n4.2\nMain Results\nAs shown in Table 1, we conduct a unified evaluation across three response models: DeepSeek-V3.1, Longcat-\nFlash-Chat, and DeepSeek-R1-0528, while fixing DeepSeek-R1-0528 as the memory extractor for all settings.\nOn DeepSeek-V3.1, our best configuration achieves Overall = 71.31, improving by +18.68 over Only LLM and\nby +7.47 over ALL Dialogue, and further exceeding the strongest comparative memory system, MemoryOS,\nby +8.83. On Longcat-Flash-Chat, PersonaTree-ALL with Qwen3-8B-RL attains the best overall performance\nwith Overall = 75.38, improving by +13.58 over ALL Dialogue and by +10.35 over MemoryOS. Under this\nsetting, all sub-metrics exhibit consistent improvements, indicating that after dialogue compression and\nmulti-round retrieval fusion, the model can more accurately recover the user’s factual background, characterize\nthe temporal evolution of preferences, and generate new content with broader coverage.\nWhen using DeepSeek-R1-0528 as the response model, we observe similarly substantial improvements:\nPersonaTree-ALL with Qwen3-8B-RL reaches Overall = 76.06, outperforming ALL Dialogue by +11.20 and\nsurpassing MemoryOS by +13.41. In terms of fine-grained metrics, Pref-Rec and New-Ideas improve by +18.18\nand +17.20 over ALL Dialogue, respectively, indicating that structured memory combined with process-aligned\ngeneration can substantially strengthen preference consistency and creative completion. Meanwhile, the\nmethod maintains stable advantages on Recall-Reason and Pref-Evo, suggesting that PersonaTree offers\ngreater interpretability and controllability in preserving and invoking causal chains and evolution cues. More\ncomprehensive comparative experiments, ablation analyses and visual presentations are shown in Appendix C.\n7\n\nMethods\nOverall\nRecall-Facts\nPref-Rec\nNew-Ideas\nRecall-Reason\nPref-Evol\nGen-New\nRecall-User\nDeepSeek-V3.1 (Response)\nOnly LLM\n52.63\n59.69\n43.64\n5.38\n80.81\n65.47\n42.11\n52.94\nALL Dialogue\n63.84\n78.29\n61.82\n12.90\n81.82\n65.47\n77.19\n76.47\nLangMem+DeepSeek-R1-0528\n57.05\n70.54\n43.64\n11.83\n80.81\n58.99\n63.16\n70.59\nMem0+DeepSeek-R1-0528\n60.44\n79.84\n52.73\n5.38\n83.84\n61.15\n64.91\n82.35\nA-Mem+DeepSeek-R1-0528\n59.76\n79.84\n54.55\n4.30\n86.87\n56.83\n64.91\n76.47\nMemoryOS+DeepSeek-R1-0528\n62.48\n72.87\n65.45\n11.83\n84.85\n63.31\n73.68\n76.47\nPersonaTree-ALL+DeepSeek-R1-0528\n71.14\n88.37\n56.36\n22.58\n88.89\n71.22\n89.47\n88.24\nPersonaTree-ALL+Qwen2.5-7B-RL\n71.31\n89.15\n61.82\n23.66\n91.92\n69.06\n85.96\n76.47\nPersonaTree-ALL+Qwen3-8B-RL\n70.80\n84.50\n61.82\n21.51\n87.88\n72.66\n85.96\n100.00\nw/ PersonaTree+Router\n70.12\n82.17\n63.64\n19.35\n88.89\n73.38\n84.21\n94.12\nw/ PersonaTree\n61.97\n75.19\n60.00\n4.30\n83.84\n64.75\n75.44\n88.24\nLongcat-Flash-Chat (Response)\nOnly LLM\n54.33\n62.02\n43.64\n8.60\n84.85\n71.22\n28.07\n52.94\nALL Dialogue\n61.80\n79.07\n63.64\n15.05\n87.88\n53.96\n68.42\n70.59\nLangMem+DeepSeek-R1-0528\n58.23\n75.97\n47.27\n8.60\n82.83\n64.03\n50.88\n64.71\nMem0+DeepSeek-R1-0528\n59.59\n80.62\n52.73\n5.38\n85.86\n62.59\n49.12\n76.47\nA-Mem+DeepSeek-R1-0528\n60.95\n82.17\n54.55\n8.60\n85.86\n62.59\n56.14\n64.71\nMemoryOS+DeepSeek-R1-0528\n65.03\n77.52\n72.73\n11.83\n86.87\n66.91\n70.18\n76.47\nPersonaTree-ALL+DeepSeek-R1-0528\n72.67\n88.37\n72.73\n26.88\n92.93\n69.78\n85.96\n64.71\nPersonaTree-ALL+Qwen2.5-7B-RL\n73.34\n93.02\n67.27\n27.96\n92.93\n67.63\n89.47\n70.59\nPersonaTree-ALL+Qwen3-8B-RL\n75.38\n93.02\n70.91\n30.11\n93.94\n71.94\n85.96\n88.24\nw/ PersonaTree+Router\n71.82\n89.92\n70.91\n22.58\n90.91\n68.35\n84.21\n82.35\nw/ PersonaTree\n65.20\n79.84\n67.27\n10.75\n87.88\n68.35\n68.42\n76.47\nDeepSeek-R1-0528 (Response)\nOnly LLM\n44.14\n30.23\n38.18\n19.35\n76.77\n55.40\n38.60\n41.18\nALL Dialogue\n64.86\n69.77\n63.64\n11.83\n84.85\n73.38\n84.21\n70.59\nLangMem+DeepSeek-R1-0528\n54.84\n63.57\n61.82\n15.05\n78.79\n56.83\n50.88\n41.18\nMem0+DeepSeek-R1-0528\n49.41\n32.56\n56.36\n24.73\n81.82\n56.83\n43.86\n58.82\nA-Mem+DeepSeek-R1-0528\n47.37\n30.23\n45.45\n22.58\n77.78\n59.71\n36.84\n76.47\nMemoryOS+DeepSeek-R1-0528\n62.65\n62.79\n78.18\n11.83\n77.78\n70.50\n78.95\n82.35\nPersonaTree-ALL+DeepSeek-R1-0528\n74.87\n80.62\n69.09\n27.96\n94.95\n82.01\n89.47\n82.35\nPersonaTree-ALL+Qwen2.5-7B-RL\n74.70\n79.84\n80.00\n27.96\n94.95\n82.01\n84.21\n64.71\nPersonaTree-ALL+Qwen3-8B-RL\n76.06\n80.62\n81.82\n29.03\n92.93\n84.17\n87.72\n76.47\nw/ PersonaTree+Router\n74.19\n80.62\n67.27\n25.81\n95.96\n83.45\n84.21\n76.47\nw/ PersonaTree\n65.70\n71.32\n65.45\n18.28\n91.92\n69.78\n75.44\n64.71\nTable 1 Main experimental results are presented on three different response models. Recall-Facts, Pref-Rec, New-\nIdeas, Recall-Reason, Pref-Evol, Gen-New, and Recall-User respectively denote recalling user-shared facts, providing\npreference-aligned recommendations, suggesting new ideas, recalling reasons behind preference updates, tracking\npreference evolution, generalizing to new scenarios, and recalling user-mentioned facts. Bold and underlined numbers\ndenote the best and second-best results, respectively.\n4.3\nAblation Study\nEffectiveness of Components.\nAs presented in Table 1, under the Qwen3-8B-RL setting we conduct a\ncomponent-level ablation of the adaptive generation pipeline by comparing three inference routes: using only\nthe lightweight fast mode (w/ PersonaTree), adding router-based triggering on top of the fast mode (w/\nPersonaTree+Router), and the full agentic recall and fusion (PersonaTree-ALL). The results consistently\nindicate that PersonaTree alone yields robust gains, but agentic recall and fusion is critical for achieving\nthe best performance, while the routing mechanism can closely match the full-mode performance while\nsubstantially reducing additional retrieval overhead. These gains suggest that directly injecting structured\nmemory already covers a large portion of stable attribute-related requirements, whereas routing and agentic\nrecall further strengthen fine-grained characterization of complex intents and evidence completion.\nEnhancement through Training.\nAs shown in Table 2, we further validate the effectiveness of training the\nmemory-operation model. We construct two datasets, PersonaMem 15K and HaluMem 13K. For RL, we use\nHaluMem 13K for SFT warm-up, and additionally employ 0.5K PersonaMem for process-reward training.\nFor the pure SFT setting, we train on PersonaMem 15K. The results show that training can substantially\nimprove the usability of the memory encoded in PersonaTree as well as downstream reasoning quality. More\nimportantly, whereas ALL Dialogue requires approximately 32K context, the trained PersonaTree introduces\nonly about 2.2K–2.6K tokens of memory context on average, yet can match or exceed the ALL Dialogue\nbaseline on Longcat-Flash-Chat and DeepSeek-R1-0528.\n8\n\nMethods\nEvaluation Models\nAvg. Length\nof Context\nDS-V3.1\nLongcat\nDS-R1-0528\nOnly LLM\n52.63\n54.33\n44.14\n0\nALL Dialogue\n63.84\n61.80\n64.86\n32K\nPersonaTree\n+Qwen2.5-7B-Instruct\n55.18\n55.18\n50.08\n1852.08\n+Qwen3-8B\n52.97\n54.33\n47.71\n1392.49\n+GPT-4o-mini\n55.86\n58.23\n55.18\n1154.35\n+Longcat-Flash-Chat\n58.91\n61.63\n60.78\n2305.46\n+DeepSeek-V3.1\n60.03\n62.31\n61.80\n2227.78\n+DeepSeek-V3.2\n60.32\n63.50\n63.16\n2292.54\n+DeepSeek-R1-0528\n60.61\n63.33\n63.50\n1844.19\n+Gemini-3-Pro\n61.29\n63.16\n63.16\n2252.89\n+Qwen2.5-7B-Instruct\n+SFT\n60.27\n62.82\n60.61\n2158.03\n+SFT+RL\n62.82\n64.35\n64.01\n2626.05\n+Qwen3-8B\n+SFT\n61.29\n62.99\n63.33\n2204.57\n+SFT+RL\n61.97\n65.20\n65.70\n2348.49\nTable 2 Performance comparison of different models in generating PersonaTree memory operations. DS-V3.1, Longcat\nand DS-R1-0528 respectively denote DeepSeek-V3.1, Longcat-Flash-Chat and DeepSeek-R1-0528.\nEvaluation Models\nExtraction Models\nDirect\nDS-V3.1\nLongcat\nDS-R1-0528\nDeepSeek-V3.1\nNo\n58.91\n60.78\n58.74\nYes\n60.03+1.12\n62.31+1.53\n61.80+3.06\nLongcat-Flash-Chat\nNo\n58.57\n59.59\n58.40\nYes\n58.91+0.34\n61.63+2.04\n60.78+2.38\nDeepSeek-R1-0528\nNo\n59.76\n60.95\n62.31\nYes\n60.61+0.85\n63.33+2.38\n63.50+1.19\nTable 3 Performance comparison between direct Generation and extract-then-transform approaches for generating\nPersonaTree memory operations.\nSelection of Generation Strategies.\nWe compare two strategies for generating PersonaTree operations: direct\ngeneration (the model directly outputs a tree-operation sequence conditioned on the dialogue) and extract-\nthen-transform (first extracting personalized information from the dialogue and then mapping it into tree\noperations). As shown in Table 3, direct generation achieves consistent advantages in all combinations, with\nimprovements ranging from approximately +0.34 to +3.06. The two-stage strategy accumulates errors in the\nintermediate representation, often losing fine-grained semantic and temporal cues needed for accurate tree\noperations. Based on this finding, we adopt direct generation as the default PersonaTree operation-generation\napproach in the remainder of this paper.\n4.4\nHyperparameter Analysis\nWe analyze the impact of the dialogue chunking window on the quality of tree-operation generation, where\neach dialogue chunk consists of w consecutive dialogue turns. As shown in Table 4, regardless of whether\nDeepSeek-R1-0528 or Qwen2.5-7B-RL is used as the tree-operation generation model, w = 3 yields the most\nstable and overall best performance across all three evaluation models. When the window is too small (w = 1),\nchunking becomes overly fragmented and tends to introduce noisy writes; when the window is too large\n(w ≥ 10), the within-chunk information density increases substantially, with more frequent cross-topic mixing\nand timeline collapsing, making critical cues more likely to be diluted or missed. Accordingly, we adopt w = 3\nfor dialogue chunking and tree-operation generation in our experiments.\n9\n\nEvaluation Models\nChunk Window Size\n1\n3\n5\n7\n10\n13\n15\nDeepSeek-R1-0528 (Tree Extraction)\nDeepSeek-V3.1\n59.42\n60.61\n58.74\n58.57\n60.10\n59.25\n58.74\nLongcat-Flash-Chat\n62.82\n63.33\n62.48\n62.65\n62.65\n62.31\n62.48\nDeepSeek-R1-0528\n60.95\n63.50\n59.93\n60.27\n59.25\n62.65\n61.29\nQwen2.5-7B-RL (Tree Extraction)\nDeepSeek-V3.1\n60.78\n62.82\n58.91\n61.12\n59.25\n57.56\n59.59\nLongcat-Flash-Chat\n62.14\n64.35\n63.67\n61.29\n64.18\n62.82\n61.97\nDeepSeek-R1-0528\n62.82\n64.01\n60.78\n62.82\n60.95\n61.29\n62.14\nTable 4 Sensitivity analysis on the size of segmented dialogue chunks.\n5\nConclusion\nThis paper studies memory evolution for long-term personalized dialogue in memory systems and proposes\nthe Inside Out framework, which uses an explicit, structured PersonaTree as long-term memory to maintain\npersonalized states under unbounded interactions. Concretely, we build a hierarchical schema grounded in\nthe biopsychosocial model and develop an iterative tree-update mechanism. We then train a lightweight\nMemListener with process-reward RL to compress unstructured dialogue streams into executable tree\noperations. At inference time, we design an adaptive generation pipeline. Experiments show that PersonaTree-\ndriven personalization consistently outperforms existing baselines across multiple response models, and further\nhighlight the potential of using small models for memory maintenance.\nLimitations\nThis work focuses on the structured evolution of long-term personalized memory, and the current implementa-\ntion and empirical validation delineate clear directions for future extension:\nScope of the schema and PersonaTree. We adopt a hierarchical schema grounded in the biopsychosocial\nmodel to define the writable space and capacity constraints, yielding a consistent and controllable representation\nof long-term memory. For finer-grained domain knowledge, task-skill profiles, or cross-domain user states, the\nschema can be further extended into composable subtrees or plugin-style modules to accommodate broader\napplication needs.\nApplicability of the memory-evolution strategy.\nPersonaTree is iteratively updated via atomic tree\noperations, with add/modify/delete semantics uniformly expressed as trunk and leaf-level text rewrites. This\nabstraction is effective for preserving stable core traits and compressible summaries; for memory forms requiring\nstronger temporal constraints, evidence provenance, or multi-version coexistence, additional metadata such as\nexplicit timestamps, confidence scores, and source pointers can be incorporated to improve traceability and\ncontrollability.\nEngineering extensions for privacy and governance. As an explicit long-term memory carrier, PersonaTree\nnaturally supports access control, interpretable edits, and revocability. For real-world deployment, it can\nbe further complemented with user-facing memory management, sensitivity-aware field stratification, and\ndata-minimization storage policies to meet stricter governance requirements.\nReferences\n[1] Aili Chen, Chengyu Du, Jiangjie Chen, Jinghan Xu, Yikai Zhang, Siyu Yuan, Zulong Chen, Liangyue Li, and\nYanghua Xiao. Deeper insight into your user: Directed persona refinement for dynamic persona modeling. arXiv\npreprint arXiv:2502.11078, 2025.\n[2] Ding Chen, Simin Niu, Kehang Li, Peng Liu, Xiangping Zheng, Bo Tang, Xinchi Li, Feiyu Xiong, and Zhiyu Li.\nHalumem: Evaluating hallucinations in memory systems of agents. arXiv preprint arXiv:2511.03506, 2025.\n10\n\n[3] Yi-Pei Chen, Noriki Nishida, Hideki Nakayama, and Yuji Matsumoto. Recent trends in personalized dialogue\ngeneration: A review of datasets, methodologies, and evaluations. arXiv preprint arXiv:2405.17974, 2024.\n[4] Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav. Mem0: Building production-\nready ai agents with scalable long-term memory. arXiv preprint arXiv:2504.19413, 2025.\n[5] Xu Han, Bin Guo, Yoon Jung, Benjamin Yao, Yu Zhang, Xiaohu Liu, and Chenlei Guo. Personapkt: Building\npersonalized dialogue agents via parameter-efficient knowledge transfer. arXiv preprint arXiv:2306.08126, 2023.\n[6] Qiushi Huang, Shuai Fu, Xubo Liu, Wenwu Wang, Tom Ko, Yu Zhang, and Lilian Tang. Learning retrieval\naugmentation for personalized dialogue generation. In Proceedings of the 2023 Conference on Empirical Methods\nin Natural Language Processing, pages 2523–2540, 2023.\n[7] Bowen Jiang, Zhuoqun Hao, Young-Min Cho, Bryan Li, Yuan Yuan, Sihao Chen, Lyle Ungar, Camillo J Taylor,\nand Dan Roth. Know me, respond to me: Benchmarking llms for dynamic user profiling and personalized responses\nat scale. arXiv preprint arXiv:2504.14225, 2025.\n[8] Jiazheng Kang, Mingming Ji, Zhe Zhao, and Ting Bai. Memory os of ai agent. arXiv preprint arXiv:2506.06326,\n2025.\n[9] Guanrong Li, Xinyu Liu, Zhen Wu, and Xinyu Dai. Persona-aware alignment framework for personalized dialogue\ngeneration. Transactions of the Association for Computational Linguistics, 13:1722–1742, 2025.\n[10] Jiwei Li, Michel Galley, Chris Brockett, Georgios Spithourakis, Jianfeng Gao, and William B Dolan. A persona-\nbased neural conversation model. In Proceedings of the 54th Annual Meeting of the Association for Computational\nLinguistics (Volume 1: Long Papers), pages 994–1003, 2016.\n[11] Zhiyu Li, Shichao Song, Chenyang Xi, Hanyu Wang, Chen Tang, Simin Niu, Ding Chen, Jiawei Yang, Chunyu Li,\nQingchen Yu, Jihao Zhao, Yezhaohui Wang, Peng Liu, Zehao Lin, Pengyuan Wang, Jiahao Huo, Tianyi Chen, Kai\nChen, Kehang Li, Zhen Tao, Junpeng Ren, Huayi Lai, Hao Wu, Bo Tang, Zhenren Wang, Zhaoxin Fan, Ningyu\nZhang, Linfeng Zhang, Junchi Yan, Mingchuan Yang, Tong Xu, Wei Xu, Huajun Chen, Haofeng Wang, Hongkang\nYang, Wentao Zhang, Zhi-Qin John Xu, Siheng Chen, and Feiyu Xiong. Memos: A memory os for ai system.\narXiv preprint arXiv:2507.03724, 2025. URL https://arxiv.org/abs/2507.03724.\n[12] Lei Liu, Xiaoyan Yang, Yue Shen, Binbin Hu, Zhiqiang Zhang, Jinjie Gu, and Guannan Zhang. Think-in-memory:\nRecalling and post-thinking enable llms with long-term memory, 2023. URL https://arxiv.org/abs/2311.08719.\n[13] Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang.\nLost in the middle: How language models use long contexts. Transactions of the Association for Computational\nLinguistics, 12:157–173, 2024. doi: 10.1162/tacl_a_00638. URL https://aclanthology.org/2024.tacl-1.9/.\n[14] Andrea Madotto, Zhaojiang Lin, Chien-Sheng Wu, and Pascale Fung. Personalizing dialogue agents via meta-\nlearning.\nIn Proceedings of the 57th annual meeting of the association for computational linguistics, pages\n5454–5459, 2019.\n[15] Hazel Rose Markus. Self-schemata and processing information about the self. Journal of Personality and Social\nPsychology, 35:63–78, 1977.\n[16] Kai Tzu-iunn Ong, Namyoung Kim, Minju Gwak, Hyungjoo Chae, Taeyoon Kwon, Yohan Jo, Seung-won Hwang,\nDongha Lee, and Jinyoung Yeo. Towards lifelong dialogue agents via timeline-based memory management. In\nProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational\nLinguistics: Human Language Technologies (Volume 1: Long Papers), pages 8631–8661, 2025.\n[17] Atsushi Otsuka, Kazuya Matsuo, Ryo Ishii, Narichika Nomoto, and Hiroaki Sugiyama.\nUser-specific dia-\nlogue generation with user profile-aware pre-training model and parameter-efficient fine-tuning. arXiv preprint\narXiv:2409.00887, 2024.\n[18] Preston Rasmussen, Pavlo Paliychuk, Travis Beauvais, Jack Ryan, and Daniel Chalef. Zep: A temporal knowledge\ngraph architecture for agent memory, 2025. URL https://arxiv.org/abs/2501.13956.\n[19] Alireza Salemi, Sheshera Mysore, Michael Bendersky, and Hamed Zamani. LaMP: When large language models\nmeet personalization. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors, Proceedings of the 62nd\nAnnual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7370–7392,\nBangkok, Thailand, August 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-long.399.\nURL https://aclanthology.org/2024.acl-long.399/.\n11\n\n[20] Zhaoxuan Tan and Meng Jiang. User modeling in the era of large language models: Current research and future\ndirections, 2023. URL https://arxiv.org/abs/2312.11518.\n[21] Zhaoxuan Tan, Qingkai Zeng, Yijun Tian, Zheyuan Liu, Bing Yin, and Meng Jiang. Democratizing large language\nmodels via personalized parameter-efficient fine-tuning. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung\nChen, editors, Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages\n6476–6491, Miami, Florida, USA, November 2024. Association for Computational Linguistics. doi: 10.18653/v1/\n2024.emnlp-main.372. URL https://aclanthology.org/2024.emnlp-main.372/.\n[22] Piiastiina Tikka and Harri Oinas-Kukkonen. Tailoring persuasive technology: A systematic review of literature of\nself-schema theory and transformative learning theory in persuasive technology context. 2019.\n[23] Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models\nwith attention sinks. In The Twelfth International Conference on Learning Representations, 2024.\n[24] Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, and Yongfeng Zhang. A-mem: Agentic memory for\nllm agents. arXiv preprint arXiv:2502.12110, 2025.\n[25] Ori Yoran, Tomer Wolfson, Ori Ram, and Jonathan Berant. Making retrieval-augmented language models\nrobust to irrelevant context. In ICLR 2024 Workshop on Large Language Model (LLM) Agents, 2024. URL\nhttps://openreview.net/forum?id=Ez6fkqltwV.\n[26] Saizheng Zhang, Emily Dinan, Jack Urbanek, Arthur Szlam, Douwe Kiela, and Jason Weston. Personalizing\ndialogue agents: I have a dog, do you have pets too? arXiv preprint arXiv:1801.07243, 2018.\n[27] Yuze Zhao, Jintao Huang, Jinghan Hu, Xingjun Wang, Yunlin Mao, Daoze Zhang, Zeyinzi Jiang, Zhikai Wu,\nBaole Ai, Ang Wang, Wenmeng Zhou, and Yingda Chen. Swift:a scalable lightweight infrastructure for fine-tuning,\n2024. URL https://arxiv.org/abs/2408.05517.\n[28] Yinhe Zheng, Guanyi Chen, Minlie Huang, Song Liu, and Xuan Zhu. Personalized dialogue generation with\ndiversified traits. arXiv preprint arXiv:1901.09672, 2019.\n[29] Wanjun Zhong, Lianghong Guo, Qiqi Gao, He Ye, and Yanlin Wang. Memorybank: enhancing large language\nmodels with long-term memory. In Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence\nand Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on\nEducational Advances in Artificial Intelligence, AAAI’24/IAAI’24/EAAI’24. AAAI Press, 2024. ISBN 978-1-\n57735-887-9. doi: 10.1609/aaai.v38i17.29946. URL https://doi.org/10.1609/aaai.v38i17.29946.\n12\n\nA\nRethinking Personalization\nIn Human-AI interaction, building agents capable of deeply personalized dialogue has long been a central goal.\nHowever, the dominant research paradigm largely concentrates on personalization via explicit profiles [10, 26].\nUnder this setting, researchers typically provide either a structured or unstructured persona description, or a\nset of persona-related texts to be retrieved, and the model is tasked with generating user-aligned responses\nconditioned on this static, pre-specified information [6, 9].\nAlthough this paradigm offers advantages in controllability and evaluation convenience, it deviates substantially\nfrom real-world interpersonal interaction. This deviation is mainly reflected in:\n• Misalignment of information sources: In everyday life, our understanding of a person’s traits rarely\ncomes from a self-introduction document; instead, it is implicitly and dynamically constructed from\nlong-term interaction history [5, 17].\n• Limited characterization of persona: Explicit persona descriptions are often highly abstracted and\nsimplified, failing to capture subtle linguistic styles, background knowledge, distinctive interaction\npatterns, and affective dynamics that emerge in authentic conversations [1].\n• A static assumption of persona: Real personality traits and linguistic styles vary across contexts and\nconversations, whereas static-profile approaches struggle to model such adaptive dynamics [16].\nGiven these limitations, our study targets a more challenging and more realistic core problem: How can a\nmodel, relying solely on a long personalized dialogue history, learn and emulate one participant’s implicit\npersona to generate responses that remain consistent in style, content, and relational stance?\nOur motivation is to help bridge this gap, with three primary implications:\n• Improving the realism of personalized dialogue: enabling a shift from role-playing to faithful imitation,\nproducing responses that are more natural, credible, and person-like.\n• Advancing deep personalization modeling: moving beyond understanding facts about a person toward\nmodeling how a person becomes who they are.\n• Expanding real-world applicability: in emerging applications such as personalized assistants and\naffective companions, the ability to reproduce individual styles from historical data is crucial.\nB\nMemListener Training\nAfter the SFT warm-up, we continue alignment with the remaining data using RL driven by process-based\nrewards. We set the model’s maximum context length to 11K tokens, with the input length capped at\n10K, which constitutes a typical ultra-long sequence optimization setting. If we were to adopt sample-level\ngroup-relative policy optimization (GRPO), the key decision signals in long sequences would be easily diluted\nby within-sample averaging; moreover, when group-wise sampling under the same input yields outputs that\nare all correct or all incorrect, the advantage term degenerates, resulting in insufficient effective gradients.\nConsequently, training stability and sample efficiency are constrained.\nTherefore, the RL stage is conducted using the Swift RLHF framework [27] with the DAPO algorithm\n(Table 5). All model parameters are updated via full fine-tuning, and training is performed in bfloat16\nprecision to balance numerical stability and memory efficiency. We further employ a dynamic, LLM-as-a-judge\nevaluation strategy, using Qwen3-32B (reasoning mode) as the discriminator to score the gap between the\nmodel’s prediction and the ground truth. The judge is prompted using the template in Table 10, and its\nassessment signal is used to guide optimization during RL. For transparency and reproducibility, we release\nthe complete training scripts in our public repository.\nA learning rate of 1 × 10−6 is adopted, together with a warmup ratio of 0.01. Due to memory constraints, the\nper-device training batch size is set to 1, while the effective batch size is increased using gradient accumulation\nover 8 steps. Gradient norms are clipped to 1.0 to ensure stable optimization. Training is performed for a\nsingle epoch.\n13\n\nParameter\nValue\nTraining type\nFull fine-tuning\nModel precision\nbfloat16\nLearning rate\n1 × 10−6\nPer Device Train Batch Size\n1\nTraining batch size\n1\nGradient accumulation steps\n8\nNumber of epochs\n1\nWarmup ratio\n0.01\nMax gradient norm\n1.0\nMax sequence length\n11264\nMax generation length\n512\nNumber of generations\n8\nTemperature\n1.0\nTop-p\n0.9\nTop-k\n50\nClipping ϵ\n0.2 / 0.28\nβ (KL control)\n0.001\nDynamic sampling\nEnabled\nMax resample times\n3\nTable 5 Key hyperparameters for DAPO training.\nFor each input prompt, 8 candidate responses are sampled with a maximum generation length of 512 tokens\nand a maximum context length of 11,264 tokens. Stochastic decoding is controlled using temperature = 1.0,\ntop-p = 0.9, and top-k = 50. Policy updates use asymmetric clipping with ϵ = 0.2 and ϵhigh = 0.28.\nA KL-control coefficient β is introduced to regulate the divergence between the optimized policy and the\nreference model. Larger β values enforce stronger regularization toward the reference policy. In our GRPO-\nbased setting, β is set to 0.001. Dynamic sampling is enabled to enhance response diversity, with the maximum\nnumber of resampling attempts limited to 3.\nDuring training data construction, we select subsets from HaluMem and PersonaMem that are relevant to\nimplicitly characterizing user-specific attributes as the raw corpus sources. Using the dynamic PersonaTree\nevolution procedure described in Section 3.2 as the backbone, we invoke DeepSeek-R1-0528 to generate\nsupervision signals for training. Concretely, for each dialogue segment, we prompt the generator to produce\nan executable operation sequence and its corresponding post-update tree state under the given schema and\nupdate constraints, thereby mapping raw dialogues into structured samples with ground-truth operations and\nversioned tree evolution. To control noise and spurious correlations, we manually verify the synthesized samples\nand filter out those with invalid operation syntax, incorrect path references, or semantically inconsistent\nwrites. The specific prompts are shown in Table 10.\nC\nExtra Experiments\nC.1\nBaselines\nTo evaluate the effectiveness of our proposed approach, we compared it against six baseline methods. These\ninclude two standard interaction paradigms (Only LLM and Full Context) and four state-of-the-art memory\nmanagement frameworks designed for LLM-based agents. All memory management frameworks were evaluated\nunder their officially recommended best configurations.\nOnly LLM.\nAs a foundational baseline, we employ the LLM directly without providing any historical\nconversation data. In this setting, the model operates in a stateless manner, relying solely on its pre-trained\nparametric knowledge and internal reasoning capabilities to address user queries. This method serves as a lower\n14\n\nMethods\nOverall\nRecall-Facts\nPref-Rec\nNew-Ideas\nRecall-Reason\nPref-Evol\nGen-New\nRecall-User\nDeepSeek-V3.1 (Response)\nOnly LLM\n52.63\n59.69\n43.64\n5.38\n80.81\n65.47\n42.11\n52.94\nALL Dialogue\n63.84\n78.29\n61.82\n12.90\n81.82\n65.47\n77.19\n76.47\nLangMem+DeepSeek-V3.1\n56.37\n72.87\n52.73\n9.68\n78.79\n61.15\n50.88\n47.06\nMem0+DeepSeek-V3.1\n61.80\n82.95\n47.27\n7.53\n83.84\n63.31\n71.93\n70.59\nA-Mem+DeepSeek-V3.1\n60.61\n76.74\n43.64\n6.45\n85.86\n64.03\n71.93\n76.47\nMemoryOS+DeepSeek-V3.1\n61.97\n75.19\n67.27\n9.68\n80.81\n63.31\n73.68\n70.59\nPersonaTree-ALL+DeepSeek-V3.1\n70.80\n88.37\n63.64\n24.73\n86.87\n68.35\n91.23\n70.59\nLongcat-Flash-Chat (Response)\nOnly LLM\n54.33\n62.02\n43.64\n8.60\n84.85\n71.22\n28.07\n52.94\nALL Dialogue\n61.80\n79.07\n63.64\n15.05\n87.88\n53.96\n68.42\n70.59\nLangMem+DeepSeek-V3.1\n57.39\n75.97\n50.91\n10.75\n79.80\n63.31\n45.61\n52.94\nMem0+DeepSeek-V3.1\n60.27\n77.52\n54.55\n9.68\n85.86\n64.03\n50.88\n76.47\nA-Mem+DeepSeek-V3.1\n60.44\n80.62\n50.91\n8.60\n85.86\n66.19\n47.37\n70.59\nMemoryOS+DeepSeek-V3.1\n64.52\n79.84\n65.45\n16.13\n87.88\n64.75\n66.67\n64.71\nPersonaTree-ALL+DeepSeek-V3.1\n73.34\n88.37\n69.09\n31.18\n91.92\n69.78\n87.72\n76.47\nDeepSeek-R1-0528 (Response)\nOnly LLM\n44.14\n30.23\n38.18\n19.35\n76.77\n55.40\n38.60\n41.18\nALL Dialogue\n64.86\n69.77\n63.64\n11.83\n84.85\n73.38\n84.21\n70.59\nLangMem+DeepSeek-V3.1\n56.54\n62.79\n54.55\n18.28\n78.79\n66.19\n49.12\n41.18\nMem0+DeepSeek-V3.1\n47.71\n32.56\n52.73\n21.51\n73.74\n57.55\n47.37\n58.82\nA-Mem+DeepSeek-V3.1\n47.03\n27.13\n45.45\n27.96\n76.77\n55.40\n43.86\n76.47\nMemoryOS+DeepSeek-V3.1\n61.97\n64.34\n76.36\n11.83\n82.83\n66.19\n73.68\n76.47\nPersonaTree-ALL+DeepSeek-V3.1\n74.53\n81.40\n70.91\n26.88\n92.93\n82.01\n87.72\n82.35\nTable 6 Extra experimental results are presented on three different response models. Recall-Facts, Pref-Rec, New-\nIdeas, Recall-Reason, Pref-Evol, Gen-New, and Recall-User respectively denote recalling user-shared facts, providing\npreference-aligned recommendations, suggesting new ideas, recalling reasons behind preference updates, tracking\npreference evolution, generalizing to new scenarios, and recalling user-mentioned facts. Bold and underlined numbers\ndenote the best and second-best results, respectively.\nbound, isolating the model’s intrinsic commonsense reasoning from its ability to recall specific interactional\ndetails.\nFull Context.\nThis method involves concatenating the entire chronological history of the conversation and\ninputting it into the LLM’s context window for every interaction. By providing the model with complete\naccess to all prior dialogue, this approach serves as a theoretical upper bound for retrieval accuracy within\nthe limits of the model’s context window. However, it effectively ignores the challenges of memory selection\nand computational efficiency.\nLangMem.\nLangMem is a framework designed to enable agents to learn and adapt through continuous\ninteractions. It provides a suite of functional primitives that allow agents to manage memory within the active\nconversational flow (\"hot path\") while utilizing a background manager to asynchronously extract, consolidate,\nand update knowledge. LangMem integrates natively with the LangGraph ecosystem, offering a core memory\nAPI that supports prompt refinement and long-term consistency across sessions. By separating immediate\nmemory management tools from background consolidation processes, it aims to maintain consistent agent\nbehavior without increasing latency during inference.\nMem0 [4].\nMem0 addresses the limitations of fixed context windows by introducing a scalable, memory-\ncentric architecture. It employs a multi-level memory structure that retains User, Session, and Agent states\nto facilitate adaptive personalization. A key feature of Mem0 is its utilization of graph-based memory\nrepresentations to capture complex relational structures between conversational elements. This approach\nallows for the dynamic extraction and retrieval of salient information, optimizing for both latency and token\ncost. Mem0 is designed to be production-ready, focusing on reducing the computational overhead typically\nassociated with full-context processing while maintaining high retrieval accuracy in multi-session dialogues.\nA-Mem (Agentic Memory) [24].\nA-Mem proposes a self-organizing memory system inspired by the Zettelkas-\nten knowledge management method. Unlike traditional static storage, A-Mem enables agents to dynamically\n15\n\norganize memories through intelligent indexing and linking. When new information is ingested, the system\ngenerates comprehensive notes with structured attributes—such as contextual descriptions and tags—and\nestablishes connections with historical data. A distinctive feature of A-Mem is its support for \"memory\nevolution,\" where the integration of new experiences can trigger updates to the representations of existing\nmemories. This agent-driven mechanism allows the memory network to continuously refine its structure and\nunderstanding over time.\nMemoryOS [8].\nDrawing inspiration from operating system principles, MemoryOS introduces a hierarchical\nstorage architecture designed to manage agent memory comprehensively. The system comprises four core\nmodules: Storage, Updating, Retrieval, and Generation. It organizes memory into three distinct levels:\nshort-term, mid-term, and long-term personal memory. To manage data flow between these levels, MemoryOS\nemploys specific strategies such as a dialogue-chain-based First-In-First-Out (FIFO) principle for short-to-\nmid-term updates and a segmented page organization strategy for mid-to-long-term consolidation. This\nhierarchical approach aims to maximize context coherence and personalization by mimicking the efficient\nresource management found in traditional operating systems.\nC.2\nExtra Experiments Results\nTo validate the robustness of our approach for the tree-operation generator, we further adopt DeepSeek-\nV3.1 as a unified extraction model to conduct comparative evaluations across all methods, and we report\nresults separately under three different response models (DeepSeek-V3.1, Longcat-Flash-Chat, and DeepSeek-\nR1-0528; see Table 6) to minimize potential biases and hallucination effects introduced by any particular\ngeneration component. The results show that, even when replacing the extraction model DeepSeek-R1-\n0528 with DeepSeek-V3.1, our method (PersonaTree-ALL) still achieves the best Overall scores under all\nthree response models, reaching 70.80, 73.34, and 74.53, respectively. This corresponds to improvements of\n+18.17/+19.01/+30.39 over Only LLM and +6.96/+11.54/+9.67 over ALL Dialogue, and it substantially\noutperforms the representative memory baseline MemoryOS by +8.83/+8.82/+12.56. From a metric-wise\nperspective, the gains are particularly pronounced on dimensions requiring stronger detail completion and\nopen-ended generation; for example, New-Ideas improves over ALL Dialogue by +11.83 (24.73 vs. 12.90),\n+16.13 (31.18 vs. 15.05), and +15.05 (26.88 vs. 11.83), respectively. Meanwhile, Recall-Facts, Pref-Rec, and\nGen-New also exhibit consistent improvements. These findings indicate that PersonaTree’s structured memory\nrepresentation and retrieval-augmented generation mechanism do not rely on any specific extraction model;\nrather, the benefits transfer stably across extractors and response models, further supporting the generality\nand effectiveness of the proposed method.\nAblation Study.\nBeyond the partial ablation studies reported in Table 1, we further conduct a more\ncomprehensive ablation analysis on the remaining configurations in Tables 1 and 6, with detailed results\npresented in Tables 7 and 8. Overall, these additional ablation findings are consistent with the observations\nin Table 1, further corroborating the effectiveness and contributions of the key components across different\nexperimental settings, and providing stronger empirical support for the main conclusions in Section 4.3.\nC.3\nVisualization\nTo more intuitively illustrate the differences among the methods in Tables 1 and 6 across diverse capability\ndimensions, we further provide radar-chart visualizations (see Figure 2), corresponding to two memory-\nextraction settings: (a) using DeepSeek-R1-0528 as the extractor; and (b) using DeepSeek-V3.1 as the\nextractor. For each setting, we conduct a unified comparison under three response models: DeepSeek-\nV3.1, Longcat-Flash-Chat, and DeepSeek-R1-0528. Overall, PersonaTree exhibits a more outward-expanded\npolygonal profile across both extraction settings and all three response models, indicating that its gains are\nnot concentrated on a single metric but instead span multiple dimensions, including factual recall, preference\nconsistency, preference evolution, and new-content generation.\nMoreover, the trends across the two subplots are highly consistent, suggesting that PersonaTree’s improvements\nare robust to the choice of memory-extraction model: even when the extractor is replaced, the method maintains\nstable advantages across the multi-dimensional metrics.\n16\n\nOn the other hand, Figure 3 reports the overall performance of different memory-operation models under\nthree response models. We observe that untrained extractors generally lag behind trained counterparts,\nwhile the two-stage training paradigm (SFT+RL) yields stable and transferable improvements. Meanwhile,\ncompared with ALL Dialogue, which requires substantially longer context, the trained PersonaTree achieves\nhigher accuracy with only a relatively short memory context, further highlighting the advantages of structured\nmemory in terms of information compression and utilization efficiency.\nFigure 4 characterizes how the dialogue chunking window size w influences performance. Across settings,\nthe curves consistently exhibit the pattern that a moderate window is optimal. In particular, values around\nw = 3 are more stable and overall superior under all three response models, suggesting that this configuration\nstrikes a more appropriate trade-off between contextual sufficiency and update frequency.\nMethods\nOverall\nRecall-Facts\nPref-Rec\nNew-Ideas\nRecall-Reason\nPref-Evol\nGen-New\nRecall-User\nDeepSeek-R1-0528 (Tree Extraction)\n+DeepSeek-V3.1 (Response)\nPersonaTree-ALL\n71.14\n88.37\n56.36\n22.58\n88.89\n71.22\n89.47\n88.24\nPersonaTree+Router\n69.61\n87.60\n60.00\n17.20\n88.89\n69.06\n85.96\n88.24\nOnly PersonaTree\n60.61\n77.52\n49.09\n6.45\n78.79\n66.19\n73.68\n70.59\n+Longcat-Flash-Chat (Response)\nPersonaTree-ALL\n72.67\n88.37\n72.73\n26.88\n92.93\n69.78\n85.96\n64.71\nPersonaTree+Router\n70.46\n89.15\n60.00\n19.35\n87.88\n71.94\n80.70\n94.12\nOnly PersonaTree\n63.33\n75.97\n60.00\n11.83\n88.89\n66.91\n70.18\n58.82\n+DeepSeek-R1-0528 (Response)\nPersonaTree-ALL\n74.87\n80.62\n69.09\n27.96\n94.95\n82.01\n89.47\n82.35\nPersonaTree+Router\n73.68\n82.17\n74.55\n23.66\n92.93\n83.45\n84.21\n52.94\nOnly PersonaTree\n63.50\n67.44\n63.64\n16.13\n88.89\n69.06\n75.44\n58.82\nQwen2.5-7B-rl (Tree Extraction)\n+DeepSeek-V3.1 (Response)\nPersonaTree-ALL\n71.31\n89.15\n61.82\n23.66\n91.92\n69.06\n85.96\n76.47\nPersonaTree+Router\n70.80\n87.60\n60.00\n25.81\n87.88\n71.22\n85.96\n70.59\nOnly PersonaTree\n62.82\n77.52\n61.82\n8.60\n89.90\n63.31\n70.18\n64.71\n+Longcat-Flash-Chat (Response)\nPersonaTree-ALL\n73.34\n93.02\n67.27\n27.96\n92.93\n67.63\n89.47\n70.59\nPersonaTree+Router\n72.33\n92.25\n67.27\n23.66\n91.92\n68.35\n85.96\n76.47\nOnly PersonaTree\n64.35\n79.07\n63.64\n11.83\n88.89\n65.47\n70.18\n70.59\n+DeepSeek-R1-0528 (Response)\nPersonaTree-ALL\n74.70\n79.84\n80.00\n27.96\n94.95\n82.01\n84.21\n64.71\nPersonaTree+Router\n74.19\n80.62\n76.36\n30.11\n93.94\n80.58\n84.21\n58.82\nOnly PersonaTree\n64.01\n68.99\n63.64\n12.90\n84.85\n71.22\n80.70\n70.59\nTable 7 Ablation study of PersonaTree components. Recall-Facts, Pref-Rec, New-Ideas, Recall-Reason, Pref-Evol, Gen-\nNew, and Recall-User respectively denote recalling user-shared facts, providing preference-aligned recommendations,\nsuggesting new ideas, recalling reasons behind preference updates, tracking preference evolution, generalizing to new\nscenarios, and recalling user-mentioned facts. Bold numbers indicate the best results within each group.\nD\nInitial PersonaTree Instance\nTo delineate the theoretical boundaries of the memory tree, we construct a hierarchical Schema based\non the Biopsychosocial model, scientifically decomposing user characteristics into three core dimensions:\n(1) Biological Aspects: Establishes biological traits by referencing theories in human morphology, human\nphysiology, chronobiology, etc. (2) Psychological Aspects: Deeply mines psychological and characterological\ntraits through cognitive psychology, the Big Five personality theory, metacognition theory, etc. (3) Social\nAspects: Unifies demographic and behavioral traits based on social identity theory, behavioral psychology,\nattachment theory, etc. Due to the lengthy initial personalization tree, we include the full schema in our\ncodebase, where the complete content can be inspected.\n17\n\nMethods\nOverall\nRecall-Facts\nPref-Rec\nNew-Ideas\nRecall-Reason\nPref-Evol\nGen-New\nRecall-User\nDeepSeek-V3.1 (Tree Extraction)\n+DeepSeek-V3.1 (Response)\nPersonaTree-ALL\n70.80\n88.37\n63.64\n24.73\n86.87\n68.35\n91.23\n70.59\nPersonaTree+Router\n69.61\n85.27\n61.82\n18.28\n89.90\n70.50\n87.72\n70.59\nOnly PersonaTree\n60.03\n74.02\n56.60\n6.59\n82.47\n62.50\n71.93\n64.71\n+Longcat-Flash-Chat (Response)\nPersonaTree-ALL\n73.34\n88.37\n69.09\n31.18\n91.92\n69.78\n87.72\n76.47\nPersonaTree+Router\n70.63\n91.47\n63.64\n21.51\n89.90\n67.63\n84.21\n70.59\nOnly PersonaTree\n62.31\n78.29\n54.55\n10.75\n86.87\n66.91\n63.16\n64.71\n+DeepSeek-R1-0528 (Response)\nPersonaTree-ALL\n74.53\n81.40\n70.91\n26.88\n92.93\n82.01\n87.72\n82.35\nPersonaTree+Router\n73.51\n81.40\n76.36\n24.73\n92.93\n79.14\n84.21\n76.47\nOnly PersonaTree\n61.80\n68.22\n52.73\n13.98\n82.83\n71.94\n73.68\n58.82\nTable 8 Ablation Study II of PersonaTree Components. Recall-Facts, Pref-Rec, New-Ideas, Recall-Reason, Pref-Evol,\nGen-New, and Recall-User respectively denote recalling user-shared facts, providing preference-aligned recommendations,\nsuggesting new ideas, recalling reasons behind preference updates, tracking preference evolution, generalizing to new\nscenarios, and recalling user-mentioned facts. Bold numbers indicate the best results within each group.\nFigure 2 Radar-chart comparison of PersonaTree and baselines across multi-dimensional capability metrics under\ntwo memory-extraction settings (DeepSeek-R1-0528, DeepSeek-V3.1) and three response models (DeepSeek-V3.1,\nLongcat-Flash-Chat, DeepSeek-R1-0528)\n18\n\nFigure 3 Overall performance of memory-operation models across three response models.\nFigure 4 Effect of dialogue chunking window size on performance across three response models.\n19\n\nPrompts for Operational Generation in PersonaTree\nYou are a Memory-Tree Operation Generator. You will be given:\n(1) An initial persona schema represented as a hierarchical JSON tree.\n(2) A dialogue history.\\n\\n\nYour objective is to transform the dialogue history into a sequence of operations for updating the persona\nschema, **covering as comprehensively as possible all information about this person, especially personalized\ncharacteristics**.\\n\\n\nAbout the schema:\n- The schema below contains **user attribute information that has already been successfully structured**;\n- Treat the schema as “recorded information” and **do not re-extract fields that already exist**;\n- Generate operations for the schema only when the dialogue history introduces additional facts, details, or\npreferences not yet covered by the schema;\n- If the dialogue history conflicts with the schema, the **most recent explicit statement** in the dialogue\nshould prevail.\\n\\n\nPrinciples for using ADD / UPDATE / DELETE / NO_OP:\n* Use: ADD(path, \"value\") when an attribute at that path has **not been recorded at all**. Prefer creating\nmore branches and avoid overly long content in a single attribute.\n* Use: UPDATE(path, \"value\") when an attribute at that path already has a record and the current passage\n**supplements, refines, or corrects** it.\n* Use: DELETE(path, None) only when the passage explicitly states that an existing piece of information\n**is no longer valid, is negated, or should be removed**.\n* If the passage does not entail any changes, output a single line: NO_OP().\\n\\n\nKey requirements for \"value\" in UPDATE (very important):\n* \"value\" must semantically **contain or integrate the previously valid information** while incorporating or\nreflecting the new information, yielding a more complete, more accurate, and up-to-date description.\n* It is **strictly forbidden** to discard useful original content and keep only the new information in an\nUPDATE.\n* When the new information is supplemental or more specific, the value should be an integrated expression of\n“original information + new supplementation”.\n* When the new information conflicts with the old, the value should describe the “current latest and most\nreasonable state”, while retaining non-conflicting old details whenever possible.\\n\\n\nNotes:\\n1. Treat each leaf node in the JSON schema as an attribute slot capable of storing a textual value.\\n\\n\n2. For each distinct user personal attribute mentioned in the dialogue history:\n* Locate the most closely matching and most specific leaf node in the schema.\n* Generate **exactly one and only one** operation for that attribute.\\n\\n\n3. You may use only the following operations:\n* ADD(path, \"value\"), UPDATE(path, \"value\"), DELETE(path, None), NO_OP()\\n\\n\n4. Requirements for the \"path\" format:\n* Use a JSON key path separated by English periods. Example:\n1_Biological_Characteristics.Physiological_Status.Age_Related_Characteristics.Chronological_Age\\n\\n\n5. Requirements for the \"value\" format:\n* Provide a natural-language expression extracted from or normalized based on the dialogue history.\n* It must be enclosed in English double quotation marks.\\n\\n\n6. Output format (must be strictly followed):\n* Output only operations, one operation per line. * Do not add any explanations or comments. * The\nonly permissible forms are: ADD(<path>, \"<value>\"), UPDATE(<path>, \"<value>\"), DELETE(<path>,\nNone), NO_OP()\\n\\n\nPersona Schema:\\n\\n{schema}\\n\\n\nDialogue History:\\n\\n{dialogue_text}\\n\\n\nNow, based on the given dialogue history, output only the operations:\nTable 9 Prompt for operational generation in PersonaTree for training and inference.\n20\n\nReward-Function Prompt\nYou are a strict \"overall scorer for attribute-tree operations\". Your task is to assign an overall quality score in\n[-1, 1] to the model-predicted operation sequence Pred_Ops, given the ground-truth annotated operation\nsequence GT_Ops.\n[Input]\n- GT_Ops (ground truth): a list of operations, where each element is of the form ADD(path, value) /\nUPDATE(path, value) / DELETE(path, value) / NO_OP()\n- Pred_Ops (prediction): a list of operations in the same format as above\n[Critical Constraints]\n1) Output only a single JSON object: {\"score\": <float>}. Do not output any explanation and do not include\nany extra fields.\n2) score must be a continuous floating-point number within [-1, 1] (any value is allowed). It is recommended\nto keep 2 decimal places.\n3) The \"score-tier reference\" below serves only as anchors for aligning overall quality. You should fine-tune\nbetween anchors to output a more granular score.\n4) For example, if the overall quality falls between 0.7 and 1.0, output a value in [0.71, 0.99]; if it falls between\n0.5 and 0.7, output a value in [0.51, 0.69]; and so on.\n[Score-Tier Reference (Overall Quality Anchors)]\n* 1.0 (nearly perfect): Pred and GT are almost entirely consistent on key operations; types/paths are nearly\nidentical; values are semantically equivalent; no redundant operations.\n* 0.7 (high quality): most key operations are correct; only minor value-level deviations, or very few miss-\ning/redundant operations.\n* 0.5 (moderately usable): the overall approach and core direction are correct; some missing/redundant\noperations exist; some paths/values are incorrect, but the main semantics are not affected.\n* 0.3 (partially reliable): about half of the content is reliable; some key operations are correct while others are\nwrong, requiring some fixes.\n* 0.0 (slightly correct): only a small number of operations or fragments are correct; missing/redundant\noperations and errors are evident; key operations are mixed correct/incorrect.\n* -0.3 (barely relevant): broadly related but with many omissions/errors; it is only apparent that the model is\nattempting the task, and it is essentially unusable as-is.\n* -0.5 (clearly off-target): most key operations are missing or incorrect; many wrong paths/types or obviously\nredundant operations; overall deviates from expectations.\n* -0.7 (catastrophic): large-scale structural/semantic disorder; almost unusable.\n* -1.0 (meaningless output): clearly meaningless, garbage text, or unrelated to the task.\n[Output Format]\nOutput only the JSON object containing the score, with no additional notes or explanations.\nOutput only:\n{\"score\": <float>}\n[Task Data]\n- GT_Ops:\n{gt_ops}\n- Pred_Ops:\n{pred_ops}\nTable 10 Reward-function prompt for process-reward RL training.\n21\n",
+    "summary_1": "Here is a clear and concise summary of the research paper:\n\n**Title:** Inside Out: Evolving User-Centric Core Memory Trees for Long-Term Personalized Dialogue Systems\n\n**Problem Statement:** Long-term personalized dialogue systems struggle to reconcile unbounded interaction streams with finite context constraints, leading to memory noise accumulation, reasoning degradation, and persona inconsistency.\n\n**Proposed Solution:** The authors propose the Inside Out framework, which utilizes a globally maintained PersonaTree as the carrier of long-term user profiling. The PersonaTree is a hierarchical structure that constrains the trunk with an initial schema and updates the branches and leaves through a reinforcement learning (RL) strategy.\n\n**Key Components:**\n\n1. **PersonaTree:** A hierarchical structure that represents the user's core memories, with three core dimensions (Biopsychosocial model).\n2. **Schema:** An initial structure of the user PersonaTree, established through interdisciplinary design.\n3. **MemListener:** A lightweight model trained through RL to compress a continuous stream of unstructured dialogue into standardized tree-structured operations.\n4. **Tree-Update Mechanism:** An iterative process that updates the PersonaTree based on user interactions.\n\n**Benefits:** The Inside Out framework enables controllable growth, achieves memory compression while preserving consistency, and supports the dynamic evolution",
+    "summary_2": "Here is a summary of the research paper excerpt, incorporating the information conveyed by the figures and their captions:\n\n**Background:**\nThe authors propose a novel framework, Inside Out, to address the challenges of long-term personalized dialogue systems. These systems struggle to reconcile unbounded interaction streams with finite context constraints, leading to memory noise accumulation, reasoning degradation, and persona inconsistency. Existing approaches, such as explicit profile augmentation and vector-based retrieval, fail to directly confront the central bottleneck of personalized memory evolution.\n\n**Inside Out Framework:**\nThe Inside Out framework utilizes a globally maintained PersonaTree as the carrier of long-term user profiling. The PersonaTree is constructed using a hierarchical Schema based on the Biopsychosocial model, which decomposes user characteristics into three core dimensions. This Schema design establishes the initial structure of the user PersonaTree.\n\n**MemListener:**\nTo endow the system with dynamic evolution, the authors propose an iterative tree-update mechanism and introduce a reinforcement learning (RL) strategy based on process rewards to train a lightweight model, MemListener. This model learns to compress a continuous stream of unstructured dialogue in real time into standardized tree-structured operations.\n\n**Key Features:**\n\n1. **PersonaTree:** A globally maintained tree that captures the user's long-term profile, with"
+  },
+  {
+    "arxiv_id": "2601.05111v1",
+    "title": "Paper 2601.05111v1",
+    "abstract": "LLM-as-a-Judge has revolutionized AI evaluation by leveraging large language models for scalable assess- ments. However, as evaluands become increasingly complex, specialized, and multi-step, the reliability of LLM-as-a-Judge has become constrained by inherent biases, shallow single-pass reasoning, and the inability to verify assessments against real-world observations. This has catalyzed the transition to Agent-as-a-Judge, where agentic judges employ planning, tool-augmented verification, multi-agent collaboration, and persis- tent memory to enable more robust, verifiable, and nuanced evaluations. Despite the rapid proliferation of agentic evaluation systems, the field lacks a unified framework to navigate this shifting landscape. To bridge this gap, we present the first comprehensive survey tracing this evolution. Specifically, we identify key dimensions that characterize this paradigm shift and establish a developmental taxonomy. We organize core methodologies and survey applications across general and professional domains. Furthermore, we analyze frontier challenges and identify promising research directions, ultimately providing a clear roadmap for the next generation of agentic evaluation. � Project Page: https://github.com/ModalityDance/Awesome-Agent-as-a-Judge",
+    "text": "2026-1-8\nA Survey on Agent-as-a-Judge\nRunyang You*1\nHongru Cai*1\nCaiqi Zhang2\nQiancheng Xu1\nMeng Liu3\nTiezheng Yu4\nYongqi Li†1\nWenjie Li1\n1 The Hong Kong Polytechnic University\n2 University of Cambridge\n3 Shandong Jianzhu University\n4 Huawei Technologies\n* Equal contribution\n† Corresponding author\nrunyang.y@outlook.com, {henry.hongrucai, liyongqi0}@gmail.com,\ncswjli@comp.polyu.edu.hk\nAbstract\nLLM-as-a-Judge has revolutionized AI evaluation by leveraging large language models for scalable assess-\nments. However, as evaluands become increasingly complex, specialized, and multi-step, the reliability of\nLLM-as-a-Judge has become constrained by inherent biases, shallow single-pass reasoning, and the inability\nto verify assessments against real-world observations. This has catalyzed the transition to Agent-as-a-Judge,\nwhere agentic judges employ planning, tool-augmented verification, multi-agent collaboration, and persis-\ntent memory to enable more robust, verifiable, and nuanced evaluations. Despite the rapid proliferation\nof agentic evaluation systems, the field lacks a unified framework to navigate this shifting landscape. To\nbridge this gap, we present the first comprehensive survey tracing this evolution. Specifically, we identify\nkey dimensions that characterize this paradigm shift and establish a developmental taxonomy. We organize\ncore methodologies and survey applications across general and professional domains. Furthermore, we\nanalyze frontier challenges and identify promising research directions, ultimately providing a clear roadmap\nfor the next generation of agentic evaluation.\n� Project Page: https://github.com/ModalityDance/Awesome-Agent-as-a-Judge\n1 Introduction\nThe rapid advancement of Large Language Models\n(LLMs) has revolutionized the field of AI evalua-\ntion, giving rise to the LLM-as-a-Judge paradigm [1].\nWhile traditional metrics fail to capture semantic\nnuance and human judgment remains unscalable,\nthis new approach leverages LLMs’ advanced un-\nderstanding and decision-making abilities to deliver\nnear-human quality assessments across diverse do-\nmains [2]. Moreover, serving as a scalable proxy for\nhuman preference, LLM judges can provide reward\nsignals for reinforcement learning [3] and enable the\nautomated curation of massive synthetic datasets [4].\nAs such, LLM judgment has established itself as a cor-\nnerstone of AI evaluation and optimization pipelines,\nwhere the precision of the judge critically determines\nthe success of downstream applications [5].\nHowever, as generative AI applications evolve from\nsimple text responses to complex, multi-step tasks\nacross specialized domains, the reliability of LLM-\nas-a-Judge has become inevitably constrained [2, 6].\nFirst, single-pass evaluators are prone to inherent\nparametric biases—such as favoring verbosity or their\nown output patterns—which compromise their neu-\ntrality when assessing high-complexity responses\nthat deviate from their training distribution [7]. Sec-\nond, naive LLM judges are passive observers, un-\nable to react to real-world observations; they assess\nanswers based on linguistic patterns without verifi-\ncation, leading to hallucinated evaluations in spe-\ncialized domains [8].\nFurthermore, in evaluation\ntasks that require multifaceted assessment rubrics,\ntraditional LLM judges experience cognitive overload\nwhen attempting to evaluate all dimensions compre-\nhensively within a single inference step, which results\nin coarse-grained scores that fail to reflect specific\nnuances [9].\nThese limitations have catalyzed the transition\nfrom LLM-as-a-Judge to Agent-as-a-Judge. As shown\n1\narXiv:2601.05111v1  [cs.CL]  8 Jan 2026\n\nA Survey on Agent-as-a-Judge\nINPUT\nScore:\n7/10\na LLM-as-a-Judge\nINPUT\nEXECUTION\nSEARCH\nOTHER TOOLS\nPLANNING\nCALCULATOR\nMEMORY\nScore: 7/10\nEvaluation\nReport\nb Agent-as-a-Judge\nFigure 1: Comparison between LLM-as-a-Judge (1a) and\nAgent-as-a-Judge (1b). The former performs direct\nsingle-pass evaluation, while the latter leverages\nplanning, memory, and tool-augmented capabilities for\nenhanced evaluation.\nin Figure 1, agentic judges proactively engage in eval-\nuation through multiple capabilities: they decom-\npose complex objectives into subtasks, mitigate biases\nthrough multi-agent collaboration [10], ground as-\nsessments via tool-augmented evidence collection\nand correctness verification [8], and enable fine-\ngrained assessment by persisting intermediate states,\nautonomously planning the evaluation across rea-\nsoning steps [11, 12]. This paradigm shift enables\nmore robust, verifiable, and nuanced assessments\nthat effectively address the multifaceted nature of\nsophisticated AI-generated evaluands.\nDespite the above potentials and rapid proliferation\nof agentic evaluation systems, the field lacks a survey\nto summarize and navigate this shifting landscape.\nTo bridge this gap, we present the first comprehensive\nsurvey for Agent-as-a-Judge through the following\ncontributions:\n• We identify and characterize the shift from LLM-\nas-a-Judge to Agent-as-a-Judge and summarize the\nagentic judges’ development trend into three pro-\ngressive stages (Section 2).\n• We organize core methodologies into five key parts\naccording to agent’s abilities (Section 3) and sur-\nvey their applications across general and profes-\nsional domains (Section 4).\n• We analyze frontier challenges and identify promis-\ning research directions, providing a strategic\nroadmap for the next generation of robust and\nverifiable AI judgment.\n2 Evolution:\nFrom LLM-as-a-Judge to\nAgent-as-a-Judge\nThis section traces the evolution of automated\nevaluation from LLM-as-a-Judge to Agent-as-a-Judge\nparadigms. We begin by reviewing the foundational\nLLM-as-a-Judge and its limitations. We then exam-\nine the shift toward Agent-as-a-Judge, analyzing key\ndimensions that characterize the agentic approach.\nFinally, we summarize Agent-as-a-Judge’s develop-\nment trend into three progressive stages with distinct\nlevels of autonomy and adaptability.\n2.1 LLM-as-a-Judge\nLLM-as-a-Judge paradigm emerged to overcome the\nscalability limits of human evaluation and the se-\nmantic insensitivity of traditional metrics. Zheng\net al., [1] formalized this approach by introducing\nbenchmarks like MT-Bench to assess model align-\nment. Building on this, G-Eval [59] leveraged chain-\nof-thought prompting for better alignment in natural\nlanguage generation (NLG), while Prometheus [60]\ninduced fine-grained evaluation in open-source mod-\nels via specialized tuning. To mitigate systematic\nissues like position and verbosity bias [7], JudgeLM\n[61] utilized fine-tuning to develop more robust eval-\nuators.\n2.2 From LLM-as-a-Judge to Agent-as-a-Judge\nAs evaluands evolve from simple text responses to\ncomplex, multi-step tasks across specialized domains,\ntraditional LLM-as-a-Judge has become increasingly\ninadequate, focusing on final outputs while failing\nto verify intermediate actions or satisfy the rigorous\nstandards of professional fields [2, 6]. To bridge this\ngap, the paradigm is shifting toward Agent-as-a-Judge\nthat employs decentralized deliberation, executable\nverification, and fine-grained assessment to mitigate\nthese limitations.\nEvolving Robustness: From Monolithic to Decen-\ntralized.\nTo mitigate the inherent parametric biases\nof monolithic LLM judges—such as the tendency to\n2\n\nA Survey on Agent-as-a-Judge\nAgent-as-a-Judge\nApplication\n(§4)\nProfessional Domains\n(§4.2)\nEducation\nE.g., MAJ-Eval [13], AutoSCORE [14], GradeOpt [15], Grade-Like-a-Human [16]\nFinance\nE.g., FinResearchBench [17], FinDeepResearch [18], SAEA [19], M-SAEA [20]\nLaw\nE.g., AgentsCourt [21], SAMVAD [22], AgentsBench [23]\nMedicine\nE.g., MAJ-Eval [13], GEMA-Score [24], ChatCoach [25], AI Hospital [26]\nGeneral Domains\n(§4.1)\nMultimodal and\nVision\nE.g., CIGEval [27], Evaluation Agent [28], LRQ-Fact [29], ARM-Thinker [12]\nConversation and\nInteraction\nE.g., IntellAgent [30], ESC-Judge [31], Sentient Agent [32], PSYCHE [33]\nFact-Checking\nE.g., FACT-AUDIT [34], UrduFactCheck [35], NarrativeFactScore [36],\nMath and Code\nE.g., HERMES [37], VerifiAgent [38], CompassVerifier [39], xVerify [40], Agentic Reward\nModeling [8], Multi-Agent Verification [41], Popper [42]\nMethodologies\n(§3)\nOptimization Paradigms\n(§3.5)\nE.g., Prompting (Evaluation Agent [28], HERMES [37], Multi-Agent LLM Judge [43], SAGEval [44], AGENT-X [45]),\nSFT (SynthesizeMe [46]), RL (TIR-Judge [47], ARM-Thinker [12])\nMemory and Personaliza-\ntion\n(§3.4)\nE.g., HERMES [37], ARM-Thinker [12], Agent-as-a-Judge [48], RLPA [11], SynthesizeMe [46], PersRM-R1 [49],\nFSPO [50]\nTool Integration\n(§3.3)\nE.g., Agent-as-a-Judge [48], CodeVisionary [51], Evaluation Agent [28], ARM-Thinker [12], HERMES [37], VerifiA-\ngent [38], Agentic Reward Modeling [8]\nPlanning\n(§3.2)\nE.g., MATEval [52], Evaluation Agent [28], EvalAgents [53], AGENT-X [45], ARJudge [54], OnlineRubrics [55]\nMulti-Agent Collaboration\n(§3.1)\nE.g., ChatEval [10], M-MAD [56], SAGEval [44], HiMATE [9], CAFES [57], GEMA-Score [24], CourtEval [58]\nFigure 2: A taxonomy of Agent-as-a-Judge organizing Methodologies (§3) and Applications (§4). Background\ngradients illustrate the coverage of developmental stages, from Procedural to Reactive and then to Self-Evolving .\nfavor verbosity or their own output patterns—Agent-\nas-a-Judge paradigms employ specialized, decentral-\nized agents that collaborate through autonomous\ndecision-making [10, 13]. Crucially, this decentral-\nized architecture facilitates the injection of expert\nprior knowledge: by decomposing complex evalua-\ntion goals into sub-tasks or structuring specific in-\nteraction workflows, we can enforce domain-specific\nconstraints that a generalist model typically over-\nlooks [16, 24]. Furthermore, multi-agent delibera-\ntion ensures collective robustness; distinct roles can\nisolate specific information points to neutralize bias,\nwhile debate and self-reflection allow agents to au-\ndit their own cognitive shortcuts, ensuring the fi-\nnal judgment transcends the heuristics of any single\nmodel [62, 7].\nEvolving Verification: From Intuition to Execu-\ntion. Static LLM judges are fundamentally passive\nobservers, unable to react to real-world feedback.\nThey assess answers based on linguistic plausibility\n– how correct a response looks – without verifica-\ntion or evidence collection, leading to \"hallucinated\ncorrectness\" in complex tasks [8]. Agent-as-a-Judge\nbridges this reality gap by replacing intuition with ex-\necution. By interacting with external environments,\nagentic judges can query system states to verify side\neffects (e.g., file operations) [48, 51], use code in-\nterpreters or theorem provers to validate logical con-\nsistency [37], and employ search tools to ground\nfactual claims in real-time documentation [38, 8].\nThis shifts the evaluative anchor from internal model\nknowledge to objective verification.\nEvolving Granularity:\nFrom Global to Fine–\ngrained. Many evaluation tasks inherently require\nmultifaceted assessment rubrics, yet traditional LLM\njudges face a cognitive overload to evaluate these di-\nmensions comprehensively within a single inference\nstep, results in coarse-grained scores that fail to re-\nflect specific nuances [9]. Agent-as-a-Judge addresses\nthis by transforming evaluation from a single-pass in-\nference into autonomous, hierarchical reasoning [9].\nInstead of a monolithic assessment, an agentic judge\ncan dynamically select or create task-specific rubrics,\nautonomously planning the evaluation to examine\neach component of the evaluand independently [44],\nutilizing memory to track historical reasoning states\nand synthesize fragmented evidence into a coher-\nent verdict. Consequently, these agents can pinpoint\n3\n\nA Survey on Agent-as-a-Judge\nspecific flaws that would otherwise be obscured in\na global score, providing fine-grained feedback on\neach aspect [45].\n2.3 Agent-as-a-Judge\nAgent-as-a-Judge represents a rapidly expanding field\nwhere the term \"agent\" is often applied loosely, span-\nning a heterogeneous range from procedural agentic\nworkflows to autonomous self-evolvers [10, 45, 12].\nTo provide a clear roadmap through this complexity,\nwe summarize the ongoing development of agency\nas follows.\nProcedural Agent-as-a-Judge\ndecouples monolithic\ninference into agentic predefined workflows [57, 24]\nor engages in structured discussions among fixed\nsub-agents [10, 56]. These systems enable complex\njudgments through coordinated multi-agent inter-\nactions, yet remain constrained by predetermined\ndecision rules that cannot adapt to novel evaluation\nscenarios.\nReactive\nAgent-as-a-Judge enables\nadaptive\ndecision-making by routing execution paths [28, 45]\nand invoking external tools [8] or sub-agents [13]\nbased on intermediate feedback.\nHowever, such\nreactivity remains confined to conditional routing\nwithin fixed decision spaces, lacking autonomy to\nrefine underlying rubrics.\nSelf-Evolving Agent-as-a-Judge\nrepresents the cut-\nting edge of the field, characterized by high auton-\nomy and the ability to refine internal components\nduring operation—synthesizing evaluation rubrics\non-the-fly [53] and updating memory with lessons\nlearned. This paradigm opens new frontiers for adap-\ntive evaluation systems, though challenges remain in\nensuring stability during self-modification [63].\n3 Methodologies\nThis section categorizes Agent-as-a-Judge methodolo-\ngies into five dimensions: multi-agent collaboration,\nplanning, tool integration, memory and personaliza-\ntion, and optimization paradigms. As shown in Fig-\nure 2, implementation sophistication reveals the evo-\nlutionary stages: foundational methodologies (collab-\noration, tool integration, optimization) evolve across\nall stages, while others (planning, memory) emerge\nmore prominently in advanced paradigms. The fol-\nINPUT\nOUTPUT\na Collective Consensus\nINPUT\nOUTPUT\nb Task Decomposition\nFigure 3: Multi-agent collaboration paradigms.\nlowing subsections examine how each methodology\nmanifests across these stages.\n3.1 Multi-Agent Collaboration\nMulti-agent collaboration leverages collective rea-\nsoning to mitigate single-LLM biases in Agent-as-a-\nJudge systems. Early systems followed Procedural\nparadigms with fixed protocols, while recent work\nevolves toward Reactive approaches that adapt agent\nselection based on feedback. We categorize these\ninto two topologies:\nCollective Consensus. Horizontal debate mecha-\nnisms leverage agents representing diverse perspec-\ntives to counteract the inherent biases of single-LLM\nevaluators, illustrated in Figure 3. Early approaches\nexemplified the Procedural stage: ChatEval [10] pi-\noneered this with a courtroom-inspired discussion\nmechanism where agents debate as equals follow-\ning predefined protocols. This paradigm was later\nextended to machine translation in M-MAD [56],\nwhile subsequent research [64] introduced explicit\nstances and \"judge\" roles to prevent agents from\nblindly conforming to the majority. Recent meth-\nods have become more Self-Evolving: approaches like\nMulti-agent-as-judge [13] have moved beyond static\nensembles by creating domain-specific experts based\non intermediate feedback.\nTask Decomposition.\nTask Decomposition employs\na \"Divide and Conquer\" strategy, delegating distinct\nsubtasks to specialized agents for systematic eval-\nuation, illustrated in Figure 3. Early frameworks\nfollowed Procedural designs: sequential approaches\nlike CAFES [57] and GEMA-Score [24, 58] struc-\nture evaluation into predefined stages (e.g., Evidence\nGathering, Reasoning, Scoring), while SAGEval [44]\nintroduces supervision via a \"Judge the Judge\" meta-\nevaluator that reviews previous agents’ decisions,\n4\n\nA Survey on Agent-as-a-Judge\nwith hierarchical approaches like HiMATE [9] or-\nganizing agents into tree structures for varying error\ngranularities. More recent work has shifted toward\nReactive paradigms: AGENT-X [45] employs adap-\ntive router agent that dynamically selects the most\nrelevant base agents based on intermediate analysis\nresults.\nTakeaway Multi-agent evaluation frameworks adopt\ntwo main topologies: Collective Consensus and Task\nDecomposition. Recent advances have evolved toward\nmore autonomous systems that can select or generate\nsubagents.\n3.2 Planning\nPlanning serves as a core capability in the Agent-as-a-\nJudge paradigm, enabling the decomposition of high-\nlevel evaluation objectives into executable sub-tasks\nand the dynamic adaptation of assessment trajecto-\nries based on intermediate analysis. This section ex-\namines planning capabilities from two perspectives:\nWorkflow Orchestration. Workflow orchestration\nin Agent-as-a-Judge systems spans from static frame-\nworks to dynamic agency, primarily characterizing\nProcedural and Reactive stages of agentic evaluation.\nApproaches like MATEval [52] rely on static de-\ncomposition, breaking tasks into fixed sequences of\nsub-dimensions. While this ensures systematic as-\nsessment through predefined control flows, it limits\nadaptability in complex scenarios. Conversely, Eval-\nuation Agent [28] introduces dynamic multi-round\nplanning, where agents adjust strategies based on in-\ntermediate feedback. This system further optimizes\nefficiency through autonomous termination, allow-\ning the agent to self-monitor information gain and\nproactively halt execution once sufficient evidence is\ngathered.\nRubric Discovery.\nUnlike general agents focused on\ntask completion, Judge Agents have the distinct capa-\nbility to autonomously formulate and refine rubrics,\nrepresenting a hallmark of the Self-Evolving stage,\nwhere agents can refine their internal evaluation com-\nponents. EvalAgents [53] exemplifies this by employ-\ning a Query Generator that plans web searches to\ndiscover implicit rubrics, while AGENT-X [45] uses an\nAdaptive Router to infer domain context and plan be-\nspoke detection guidelines. ARJudge [54] adaptively\nformulates rubrics by iteratively generating context-\nsensitive questions, and OnlineRubrics [55] inte-\ngrates planning into reinforcement learning, evolv-\ning rubrics alongside policy optimization to detect\nreward hacking.\nTakeaway Serving as the strategic engine, planning\nshifts evaluation from rigid flows to adaptive explo-\nration, enabling agents to optimize how they evalu-\nate (workflow orchestration) and what they evaluate\n(rubric discovery).\n3.3 Tool Integration\nTool integration is a defining capability of Agent-as-\na-Judge frameworks, enabling judges to ground eval-\nuation in external evidence and explicit checks. As\nshown in Table 1, existing approaches can be grouped\ninto evidence collection and correctness verification\nbased on the purpose of tool use.\nEvidence Collection. A common use of tools in\nAgent-as-a-Judge frameworks is to collect additional\nevidence that supports evaluations. Such evidence in-\ncludes intermediate artifacts, execution results, and\nperceptual signals that cannot be reliably obtained\nthrough text-based reasoning. In code-related tasks,\nAgent-as-a-Judge [48] and CodeVisionary [51] al-\nlow judges to inspect execution artifacts or run auto-\nmated checks to expose execution feedback for eval-\nuation. Similar methods are adopted in multimodal\nsettings. Evaluation Agent [28] enables judges to\ninvoke external visual models to obtain visual qual-\nity or alignment signals, while ARM-Thinker [12]\ngathers fine-grained visual and contextual evidence\nthrough document access and localized visual opera-\ntions. Overall, these works integrate tools to surface\nobservable and task-relevant evidence, expanding\nthe judge’s access to execution-level, perceptual, and\ncontextual information, and supporting more reliable\nevaluation.\nCorrectness Verification.\nAnother line of work em-\nploys tools to verify whether the evaluand’s outputs\nor intermediate reasoning steps satisfy explicit cor-\nrectness constraints, such as logical validity, mathe-\nmatical soundness, or factual consistency. In these\nframeworks, the judge agent identifies which claims\n5\n\nA Survey on Agent-as-a-Judge\nTool Purpose\nMethod\nEvaluation Task\nTool Type\nEvidence\ncollection\nAgent-as-a-Judge [48]\nCode generation\nGraph, locate, read, search, retrieve\nCodeVisionary [51]\nCode generation\nCode execution, static linter, unit tests,\nscreenshot, web browsing\nEvaluation Agent [28]\nVisual generation\nVisual generative models\nARM-Thinker [12]\nMultimodal generation\nInstruction following checks, crop/zoom\ntools, document retrieval tools\nCorrectness\nverification\nHERMES [37]\nMath reasoning\nTranslator, theorem prover\nVerifiAgent [38]\nFactual & Math reasoning\nSearch engine, Python interpreter, theorem\nprover\nAgentic RM [8]\nFactual & Math reasoning\nSearch engine, Python interpreter\nTable 1: Tool integration in representative Agent-as-a-Judge methods, grouped by primary tool usage purpose.\nor steps require verification and invokes appropriate\ntools to check them. The resulting verification signals\nare then interpreted by the agent in context to inform\nthe final evaluation. HERMES [37] verifies mathe-\nmatical reasoning through formal theorem proving,\nwhile VerifiAgent [38] invokes programmatic and\nsymbolic checkers to validate factual and computa-\ntional claims. Agentic Reward Modeling [8] further\nintegrates correctness verification by combining fact-\nchecking tools and programmatic validators to pro-\nduce structured correctness signals that inform the\nfinal evaluation.\nTakeaway\nTool integration in Agent-as-a-Judge\ngrounds evaluation in observable and verifiable sig-\nnals by allowing judges to actively gather evidence\nand check correctness through external tools.\n3.4 Memory and Personalization\nMemory enables Agent-as-a-Judge frameworks to re-\ntain information across evaluation steps, supporting\nmulti-step reasoning, consistent judgment, and reuse\nof prior results. We categorize prior work by the role\nof memory, including intermediate state tracking and\npersonalized context preservation.\nIntermediate State. In multi-step evaluation set-\ntings, Agent-as-a-Judge frameworks use memory to re-\ntain intermediate evaluation states generated during\nassessment, providing the necessary context for con-\nditional routing and adaptive decision-making based\non intermediate feedback–a fundamental mechanism\nfor Reactive Agent-as-a-Judge. HERMES [37] retains\nintermediate proof states when combining reason-\ning with formal theorem proving, enabling consis-\ntent verification across long reasoning chains. ARM-\nThinker [12] preserves intermediate evidence such\nas visual reasoning outputs and tool interaction re-\nsults, which are later reused to ground evaluation.\nAgent-as-a-Judge [48] records execution traces and\nstep-level feedback, enabling evaluation beyond final\noutputs to account for intermediate behavior. Col-\nlectively, these methods use memory to retain inter-\nmediate states that support cumulative, step-aware\nevaluation.\nPersonalized Context. Agent-as-a-Judge frame-\nworks often incorporate memory to retain user-\nrelated information that conditions evaluation across\ninteractions. Such memory captures user preferences,\nevaluation standards, or prior feedback, allowing\njudgments to remain consistent over time. PersRM-\nR1 [49] and FSPO [50] store preference data derived\nfrom historical interactions, including preference la-\nbels or few-shot examples, which are reused to condi-\ntion subsequent evaluations for the same user. More\nadvanced approaches abstract historical preference\nsignals into persistent user personas or long-term pro-\nfiles. RLPA [11] and SynthesizeMe [46] exemplify\nthis by constructing and maintaining user personas\nthat are stored and reused to guide evaluation. Such\nlong-term user profiling often serves to support Self-\nEvolving Agent-as-a-Judge, enabling continuous op-\ntimization based on evolving preferences. Together,\nthese methods use memory to preserve personalized\n6\n\nA Survey on Agent-as-a-Judge\ncontext that shapes evaluation behavior and ensures\nconsistency across interactions.\nTakeaway Memory enables Agent-as-a-Judge to pre-\nserve intermediate states and personalized context,\nsupporting multi-step evaluation, consistent judgment,\nand efficient reuse of prior information.\n3.5 Optimization Paradigms\nOptimization paradigms define how Agent-as-a-Judge\nimproves evaluation quality by updating model pa-\nrameters or adapting evaluation behaviors. We or-\nganize prior work into two groups: training-time\noptimization and inference-time optimization.\nTraining-Time Optimization. Training-time opti-\nmization improves Agent-as-a-Judge by updating\nmodel parameters to align judgment behavior with\nevaluation objectives. Supervised fine-tuning is com-\nmonly used to standardize judge behavior, training\nmodels to follow explicit criteria, and produce struc-\ntured judgments across tasks. For example, Syn-\nthesizeMe [46] shapes evaluation behavior using\npersona-guided supervision derived from historical\ndata. Reinforcement learning optimizes judges to\nperform evaluation actions more effectively, espe-\ncially in settings that require tool use and multi-step\nverification. TIR-Judge [47] and ARM-Thinker [12]\ntrain judges to decide when and how to invoke tools,\nintegrate external signals, and verify intermediate\nresults. Together, training-time optimization shapes\ninternal decision processes, enabling more reliable,\nstructured evaluation.\nInference-Time Optimization. Inference-time op-\ntimization adapts evaluation behavior without up-\ndating model parameters by controlling how judg-\nments are produced through prompts, workflows,\nor agent interactions. Existing approaches can be\nbroadly grouped into two types. 1) The first type\nfollows predefined evaluation procedures, where rea-\nsoning steps, verification routines, or prompts are\nfixed in advance to ensure consistency and efficiency.\nEvaluation Agent [28] and HERMES [37] exemplify\nthis setting by adopting structured, step-by-step eval-\nuation pipelines. 2) The second type allows evalua-\ntion behavior to adapt during inference, where the\nevaluation process, participating agents, or applied\nConversation \nand\nInteraction\nAgent-as-\na-Judge \nMath\nand \nCode\nMultimodal\nand\nVision\nFact-\nchecking\nLaw\nEducation\nFinance\nMedicine\nOpen-ended \nResponse\nMachine \nTranslation\nTask-\noriented \nDialogue\nCode \nGeneration\nMathematical \nReasoning\nFactual \nSummarization\nHypothesis \nValidation\nImage \nGeneration\nMedical \nInteraction\nRadiology \nReport\nAutomatic \nGrading\nRisk \nEvaluation\nLegal \nJudgement\nMultimodal \nReasoning\nEssay \nScoring\nFinancial \nResearch\nJudicial \nDeliberation\nGeneral Domains\nProfessional Domains\nFigure 4: An overview of Agent-as-a-Judge application\ndomains and their fine-grained task categories.\ncriteria change based on intermediate results. Multi-\nAgent LLM Judge [43] iteratively refines prompts\nand context through multi-judge coordination, while\nSAGEval [44] introduces a meta-judge to monitor\nand revise judge behavior. ChatEval [10] and AGENT-\nX [45] further support adaptive evaluation through\nagent interaction and dynamic guideline selection.\nOverall, inference-time optimization enables flexible\ncontrol over evaluation behavior, ranging from fixed\nprocedures to adaptive, interaction-driven judgment.\nTakeaway Optimization improves Agent-as-a-Judge\nby either learning evaluation behavior through param-\neter updates at training-time or adjusting evaluation\nstrategies at inference time.\n4 Application\nBuilding on the methods above, this section describes\nhow Agent-as-a-Judge methods are applied in differ-\nent evaluation tasks. As shown in Figure 4, we or-\nganize representative applications into two groups:\ngeneral domains and professional domains.\n4.1 General Domains\nMath and Code. In math and code evaluation,\nAgent-as-a-Judge systems move beyond single-pass\nscoring by grounding judgment in verifiable reason-\ning signals. One line of work augments free-form\n7\n\nA Survey on Agent-as-a-Judge\nreasoning with explicit correctness checks. HER-\nMES [37] anchors LLM reasoning to intermediate\nformal proof steps, reducing drift in long derivations.\nVerifiAgent [38] decouples high-level reasoning as-\nsessment from tool-based correctness verification,\nenabling adaptive checking across reasoning types.\nCompassVerifier [39] and xVerify [40] focus on math-\nematical and logical outputs, addressing equivalence\nchecking under diverse surface forms. Other ap-\nproaches strengthen judgment by aggregating multi-\nple evaluation signals. Multi-Agent Verification [41]\ndistributes evaluation across aspect-specific judges.\nAgentic Reward Modeling [8] integrates preference-\nbased supervision with verifiable correctness signals\nto improve reward reliability. Popper [42] formulates\njudgment as controlled falsification, using statistical\ntests to validate free-form claims.\nFact-Checking. In fact-checking, Agent-as-a-Judge\nreframes evaluation from static label prediction to\ninteractive verification with evidence gathering and\njustification. FACT-AUDIT [34] models fact-checking\nas an agentic loop with multi-agent collaboration,\njointly evaluating verdict accuracy and justification\nquality. This paradigm is particularly effective when\nevidence is scarce or inconsistencies are subtle. Urdu-\nFactCheck [35] improves robustness in low-resource\nsettings through multilingual retrieval and evidence\nboosting. NarrativeFastScore [36] addresses long-\ncontext factual consistency by constructing character-\nlevel knowledge representations, enabling detection\nof state and relation errors with actionable feedback.\nConversation and Interaction.\nIn conversation and\ninteraction, Agent-as-a-Judge shifts from grading iso-\nlated replies to constructing multi-turn exchanges,\nenabling evaluation under evolving goals, constraints,\nand user reactions. For task-oriented dialogue, In-\ntellAgent [30] uses interactive user simulations to\nsynthesize conversational benchmarks, while Kazi\net al. [65] introduces frameworks for controllable\nuser goals and automatic measures. For affective\nand social interaction, ESC-Judge [31] constructs\nemotional-support agents via standardized counsel-\ning skills, Sentient Agent [32] tracks emotion tra-\njectories over time to reflect higher-order social cog-\nnition, and PSYCHE [33] builds psychiatric patient\nprofiles for ethical assessment validation. Wu et al.\n[66] frames evaluation as multi-perspective role play\nwith diverse reviewer personas to cover both objec-\ntive and subjective dimensions.\nMultimodal and Vision.\nIn the multimodal and vi-\nsion domain, Agent-as-a-Judge shifts from static scor-\ning to interactive inspection. For visual generation,\nCIGEval [27] orchestrates specialized tools to probe\ncontrol adherence and subject consistency, while Eval-\nuation Agent [28] runs multi-round checks to provide\nuser-tailored, explainable analyses. For truthfulness\nevaluation, LRQ-Fact [29] generates targeted fact-\nchecking questions across image and text to guide ev-\nidence retrieval, while ARM-Thinker [12] selectively\ninvokes tools like image inspection for finalizing judg-\nments.\n4.2 Professional Domains\nMedicine. In high-stakes clinical NLP, Agent-as-a-\nJudge appears in two forms: 1) multi-agent evalu-\nators that decompose clinical quality into special-\nized roles, and 2) agentic simulators that interac-\ntively elicit clinical behaviors. For 1), MAJ-Eval [13]\nconstructs multiple evaluator personas to debate\nand cross-verify responses, while GEMA-Score[24]\nuses agent collaboration to compute granular, tool-\nassisted scores covering disease severity and uncer-\ntainty. For 2), Chat-Coach [25] pairs autonomous\npatient and coach agents to critique trainee-doctor\ndialogues, while AI Hospital [26] evaluates LLM “doc-\ntors” in multi-agent simulators, though final scoring\noften still requires conventional metrics.\nLaw. In the legal domain, Agent-as-a-Judge sim-\nulates the adversarial and deliberative nature\nof jurisprudence through multi-agent interaction.\nAgentsCourt [21] introduces adversarial debate\nframeworks where agents role-play as prosecutors,\ndefense attorneys, and judges, exposing the evaluat-\ning agent to conflicting arguments to improve verdict\nrobustness. SAMVAD [22] and AgentsBench [23]\nmodel judicial consensus by simulating bench de-\nliberation processes, capturing interactions between\nconcurring and dissenting opinions to enhance legal\njudgment prediction.\nFinance.\nIn finance, Agent-as-a-Judge addresses two\nlimitations of static benchmarks: 1) capturing the in-\nternal research logic of long-form analyst reports, and\n8\n\nA Survey on Agent-as-a-Judge\n2) detecting deployment risks like hallucinations and\ntemporal staleness. For 1), FinResearchBench [17]\nextracts logic trees from reports as intermediate struc-\ntures for comprehensive assessment, whereas FinD-\neepResearch [18] can synthesize hierarchical rubrics\nbut still relies on predefined workflows.\nFor 2),\nSAEA [19] proposes auditing agent trajectories to\nmitigate hallucinations and temporal misalignment.\nFrom Tasks to Teams [20] extends this approach with\nM-SAEA to trace multi-agent failures, such as cross-\nagent divergence and error propagation.\nEducation. In the educational domain, Agent-as-a-\nJudge systems emulate pedagogical nuance through\ncollaborative, role-specialized workflows.\nGrade-\nLike-Human [16] and AutoSCORE [14] decom-\npose grading into staged processes (rubric construc-\ntion, evidence recognition, cross-review) to improve\ngrounding and consistency. Beyond static scoring,\nMAJ-Eval [13] uses multi-persona debates to align\nwith multi-dimensional human evaluation, while\nGradeOpt [15] introduces agents that diagnose dis-\ncrepancies and iteratively refine grading guidelines.\n5 Discussion\nThis section discusses broader issues that arise when\ndeploying Agent-as-a-Judge systems in practice. We\nfirst summarize key challenges that limit scalability,\nreliability, and real-world adoption, and then outline\nseveral future directions that may help address these\nlimitations and further advance agentic evaluation.\n5.1 Challenges\nAgent-as-a-Judge improves evaluation reliability\nthrough planning, tool use, memory, and multi-agent\ncollaboration, but these capabilities also introduce\nnew challenges beyond static LLM-as-a-Judge. Key\nchallenges include computational cost, latency, safety,\nand privacy.\nComputational Cost. Agent-as-a-Judge introduces\na heavier computational burden in both training and\ninference. 1) Training a judge agent is expensive.\nSupervised fine-tuning alone is often insufficient to\nsupport agentic behaviors such as tool invocation,\nlong-horizon planning, and adaptive decision making.\nReinforcement learning provides a natural way to ac-\nquire these capabilities, but it significantly increases\ntraining cost, especially when the judge operates over\nlong trajectories or complex tool-calling sequences.\n2) Inference with Agent-as-a-Judge is also costly. Un-\nlike single-pass judgment, agentic evaluation typi-\ncally involves multiple reasoning steps, intermediate\ndecisions, and coordination among multiple agents,\nall of which increase computation per evaluation.\nLatency. In addition to higher computational cost,\nAgent-as-a-Judge often suffers from increased infer-\nence latency. Agentic evaluation requires sequential\nreasoning steps, external tool calls, or multi-agent\ncommunication, each of which introduces additional\ndelays. This latency can be particularly problem-\natic in real-time or interactive settings, such as online\nmodel evaluation, user-facing content moderation, or\nreinforcement learning loops where rapid feedback is\nrequired. As a result, there exists a tension between\nevaluation reliability and practical deployment con-\nstraints, where more thorough agentic judgment may\nnot be feasible under strict latency budgets.\nSafety. While Agent-as-a-Judge is designed to im-\nprove evaluation robustness, it also raises new safety\nconcerns. Tool-augmented judges may access exter-\nnal systems such as search engines, code executors,\nor databases, which expands the attack surface for\nprompt injection, tool misuse, or unintended side ef-\nfects. Multi-agent collaboration can further amplify\nrisks if unsafe behaviors propagate across agents or\nif adversarial interactions emerge. Moreover, when\njudge agents are used to provide reward signals for\nmodel optimization, systematic biases or errors in\nagentic judgment may be reinforced and amplified\nduring training, leading to unintended model behav-\niors.\nPrivacy. Agent-as-a-Judge also introduces privacy\nchallenges, particularly in settings that involve persis-\ntent memory or personalized evaluation. To maintain\nconsistency or adapt judgments to specific users or\ncontexts, judge agents may store intermediate states,\nuser information, or historical interaction data. If not\ncarefully designed, such memory mechanisms can\nincrease the risk of sensitive data leakage or unau-\nthorized inference about user attributes. This issue\nbecomes more pronounced in professional domains\nsuch as medicine, law, or education, where evaluation\n9\n\nA Survey on Agent-as-a-Judge\noften relies on confidential or personally identifiable\ninformation.\n5.2 Future Directions\nPersonalization.\nCurrent Agent-as-a-Judge systems\nare constrained by static, one-size-fits-all evaluation\ncriteria, failing to align with diverse individual pref-\nerences. To bridge this gap, future research should\nfocus on enhancing the autonomy and adaptivity of\njudge agents. A critical enabler is proactive mem-\nory management: rather than passively retrieving\nhistory, agents must actively manage the lifecycle\nof user-specific knowledge—autonomously deciding\nwhen to register new preferences, update evolving\nstandards, or prune obsolete feedback. This agen-\ntic control transforms memory into a dynamic belief\nsystem, allowing the judge to continuously refine\nits criteria and maintain alignment with the user’s\nspecific values and usage contexts.\nGeneralization.\nCurrent systems rely on predefined\nrubrics constructed offline, limiting their ability to\ngeneralize across diverse or open-ended tasks. Future\njudge agents should leverage planning capabilities to\ndynamically discover and adapt evaluation criteria.\n1) Context-Aware Rubric Generation: Agents should\nsynthesize evaluation criteria on-the-fly by analyz-\ning the specific intent and complexity of responses,\nidentifying relevant assessment dimensions not antic-\nipated during design. 2) Adaptive Multi-Granularity\nScoring: Rubrics should dynamically scale based on\ntask difficulty—applying high-level holistic criteria\nfor straightforward tasks, while decomposing into\nfine-grained sub-rubrics for complex workflows.\nInteractivity. Current systems operate as passive,\none-way observers. Future agents should evolve into\ninteractive evaluators that actively engage with both\nthe environment and human stakeholders. 1) Inter-\nactive Environmental Feedback: Instead of static test\nsuites, judge agents should dynamically tailor evalua-\ntion trajectories—autonomously escalating task com-\nplexity or isolating edge cases to rigorously probe\nthe evaluand’s failure boundaries. 2) Human-Agent\nCollaborative Calibration: To address subjective or\nambiguity-rich scenarios, agents should leverage\nhuman-in-the-loop mechanisms. By proactively con-\nsulting experts to verify intent or resolve conflicts,\nthe judge refines its criteria through multi-turn align-\nment, ensuring higher trust and interpretability.\nOptimization. Current approaches predominantly\nrely on inference-time engineering, which is funda-\nmentally bottlenecked by the fixed capabilities of\nfrozen backbones. To transcend these limits, the\nfield must pivot towards Training-based Optimization.\nThis paradigm shift entails two key levels: 1) Indi-\nvidual Capability: Utilizing Reinforcement Learning\n(RL) to internalize complex agentic behaviors—such\nas sequential planning and adaptive tool use—that\nare difficult to elicit via prompting alone. 2) Learned\nCoordination: Extending optimization to multi-agent\nsettings. Rather than ad-hoc inference collabora-\ntion, agents should be trained with joint objectives\nto intrinsically learn effective communication and\nconsensus strategies.\nConcluding Remarks: Towards True Autonomy.\nAs characterized in Section 2, existing implementa-\ntions exhibit varying degrees of agency. The future\ndirections discussed above—personalization, general-\nization, interactivity, and optimization—collectively\npoint towards an evolutionary trajectory towards au-\ntonomy. The next generation of judge agents must\ntranscend fixed protocols to become genuinely agen-\ntic entities capable of self-directed adaptation, active\ncontext curation, and continuous self-refinement, ul-\ntimately realizing the full potential of agents that\nactively perceive, reason, and evolve alongside the\nmodels they assess.\n6 Conclusion\nThis paper provides the first comprehensive survey of\nAgent-as-a-Judge. We established a novel taxonomy\nand demonstrated how agentic capabilities, includ-\ning multi-agent collaboration, autonomous planning,\ntool integration, and memory, overcome the limita-\ntions of naive LLM judges to deliver more robust, veri-\nfiable and nuanced judgments across general and pro-\nfessional domains. While promising, this evolution\npresents challenges in computational cost, latency,\nsafety, and privacy. Future progress should prioritize\npersonalization, generalization, and optimization, ul-\ntimately realizing truly autonomous evaluators that\ncontinuously adapt to the evolving AI landscape.\n10\n\nA Survey on Agent-as-a-Judge\nLimitations\nEarly Stage of Paradigm Consensus. As a pio-\nneering survey exploring the evolution of Agent-as-\na-Judge, this study faces the challenge that the field\nhas not yet gained complete widespread recognition\nin academia. Although the transition from LLM-as-a-\nJudge to Agent-as-a-Judge has begun to take shape,\nthere is still a lack of long-term consensus regarding\nthe definition of evaluation agents. Nevertheless, es-\ntablishing this foundational framework is essential\nto orienting future research. We are committed to\niteratively refining this taxonomy as the paradigm\nmatures and gains broader recognition.\nInclusion of Early Prompting Methods. We ac-\nknowledge a potential gap between early method-\nologies and the increasingly rigorous definitions of\nagents. Many pioneering works in automated eval-\nuation, though named as \"agent\", rely heavily on\nprompting engineering, such as fixed role-play, which\nmay not align with the strict criteria for autonomy,\ndynamic planning, or tool-use held by the current\ncommunity. Nevertheless, we deliberately include\nthese prompt-based frameworks as they represent\nthe initial shift from monolithic inference toward\ndynamic decomposition and self-evolving systems.\nExcluding them would obscure the transition thus\ncompromising a complete understanding of the field’s\nevolution.\nEthics Statement\nThis work does not involve the use or creation of\ndatasets or scientific artifacts that would require spe-\ncific ethical clearance, data privacy considerations, or\nlicensing agreements. We believe this work adheres\nto the ethical guidelines of the conference and poses\nno immediate negative social impact.\nReferences\n[1] Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan\nZhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin,\nZhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang,\nJoseph E. Gonzalez, and Ion Stoica. Judging llm-\nas-a-judge with mt-bench and chatbot arena. In\nProceedings of the 37th International Conference on\nNeural Information Processing Systems, NIPS ’23, Red\nHook, NY, USA, 2023. Curran Associates Inc.\n[2] Dawei Li, Bohan Jiang, Liangjie Huang, Alimoham-\nmad Beigi, Chengshuai Zhao, Zhen Tan, Amrita\nBhattacharjee, Yuxuan Jiang, Canyu Chen, Tianhao\nWu, et al. From generation to judgment: Opportuni-\nties and challenges of llm-as-a-judge. In Proceedings\nof the 2025 Conference on Empirical Methods in Nat-\nural Language Processing, pages 2757–2791, 2025.\n[3] Harrison Lee, Samrat Phatale, Hassan Mansoor,\nThomas Mesnard, Johan Ferret, Kellie Lu, Colton\nBishop, Ethan Hall, Victor Carbune, Abhinav Ras-\ntogi, and Sushant Prakash. Rlaif vs. rlhf: scaling\nreinforcement learning from human feedback with\nai feedback. In Proceedings of the 41st International\nConference on Machine Learning, ICML’24. JMLR.org,\n2024.\n[4] Lin Long, Rui Wang, Ruixuan Xiao, Junbo Zhao,\nXiao Ding, Gang Chen, and Haobo Wang.\nOn\nLLMs-driven synthetic data generation, curation,\nand evaluation: A survey.\nIn Lun-Wei Ku, An-\ndre Martins, and Vivek Srikumar, editors, Find-\nings of the Association for Computational Linguis-\ntics: ACL 2024, pages 11065–11082, Bangkok, Thai-\nland, August 2024. Association for Computational\nLinguistics. doi: 10.18653/v1/2024.findings-acl.\n658. URL https://aclanthology.org/2024.\nfindings-acl.658/.\n[5] Hanyu Lai, Xiao Liu, Junjie Gao, Jiale Cheng, Zehan\nQi, Yifan Xu, Shuntian Yao, Dan Zhang, Jinhua Du,\nZhenyu Hou, Xin Lv, Minlie Huang, Yuxiao Dong,\nand Jie Tang. A survey of post-training scaling in\nlarge language models. In Wanxiang Che, Joyce\nNabende, Ekaterina Shutova, and Mohammad Taher\nPilehvar, editors, Proceedings of the 63rd Annual\nMeeting of the Association for Computational Lin-\nguistics (Volume 1: Long Papers), pages 2771–2791,\nVienna, Austria, July 2025. Association for Compu-\ntational Linguistics. ISBN 979-8-89176-251-0. doi:\n10.18653/v1/2025.acl-long.140. URL https://\naclanthology.org/2025.acl-long.140/.\n[6] Zhiheng Xi, Wenxiang Chen, Xin Guo, Wei He, Yiwen\nDing, Boyang Hong, Ming Zhang, Junzhe Wang,\nSenjie Jin, Enyu Zhou, et al. The rise and potential\nof large language model based agents: A survey.\nScience China Information Sciences, 68(2):121101,\n2025.\n[7] Peiyi Wang, Lei Li, Liang Chen, Zefan Cai, Dawei\nZhu, Binghuai Lin, Yunbo Cao, Lingpeng Kong,\nQi Liu, Tianyu Liu, and Zhifang Sui.\nLarge lan-\nguage models are not fair evaluators. In Lun-Wei Ku,\nAndre Martins, and Vivek Srikumar, editors, Proceed-\nings of the 62nd Annual Meeting of the Association\nfor Computational Linguistics (Volume 1: Long Pa-\npers), pages 9440–9450, Bangkok, Thailand, August\n2024. Association for Computational Linguistics. doi:\n11\n\nA Survey on Agent-as-a-Judge\n10.18653/v1/2024.acl-long.511. URL https://\naclanthology.org/2024.acl-long.511/.\n[8] Hao Peng, Yunjia Qi, Xiaozhi Wang, Zijun Yao, Bin\nXu, Lei Hou, and Juanzi Li. Agentic reward model-\ning: Integrating human preferences with verifiable\ncorrectness signals for reliable reward systems. In\nWanxiang Che, Joyce Nabende, Ekaterina Shutova,\nand Mohammad Taher Pilehvar, editors, Proceed-\nings of the 63rd Annual Meeting of the Association for\nComputational Linguistics (Volume 1: Long Papers),\npages 15934–15949, Vienna, Austria, July 2025. As-\nsociation for Computational Linguistics. ISBN 979-\n8-89176-251-0. doi: 10.18653/v1/2025.acl-long.\n775. URL https://aclanthology.org/2025.\nacl-long.775/.\n[9] Shijie Zhang, Renhao Li, Songsheng Wang, Philipp\nKoehn, Min Yang, and Derek F. Wong. HiMATE:\nA hierarchical multi-agent framework for machine\ntranslation evaluation. In Christos Christodoulopou-\nlos, Tanmoy Chakraborty, Carolyn Rose, and Violet\nPeng, editors, Findings of the Association for Com-\nputational Linguistics: EMNLP 2025, pages 11121–\n11145, Suzhou, China, November 2025. Association\nfor Computational Linguistics. ISBN 979-8-89176-\n335-7.\ndoi: 10.18653/v1/2025.findings-emnlp.\n593. URL https://aclanthology.org/2025.\nfindings-emnlp.593/.\n[10] Chi-Min Chan, Weize Chen, Yusheng Su, Jianxuan\nYu, Wei Xue, Shanghang Zhang, Jie Fu, and Zhiyuan\nLiu. Chateval: Towards better LLM-based evalua-\ntors through multi-agent debate. In The Twelfth In-\nternational Conference on Learning Representations,\n2024. URL https://openreview.net/forum?\nid=FQepisCUWu.\n[11] Weixiang Zhao, Xingyu Sui, Yulin Hu, Jiahe Guo,\nHaixiao Liu, Biye Li, Yanyan Zhao, Bing Qin, and\nTing Liu. Teaching language models to evolve with\nusers: Dynamic profile modeling for personalized\nalignment. In Proceedings of the Thirty-Ninth Con-\nference on Neural Information Processing Systems\n(NeurIPS 2025), 2025.\n[12] Shengyuan Ding, Xinyu Fang, Ziyu Liu, Yuhang\nZang, Yuhang Cao, Xiangyu Zhao, Haodong Duan,\nXiaoyi Dong, Jianze Liang, Bin Wang, Conghui He,\nDahua Lin, and Jiaqi Wang.\nArm-thinker: Rein-\nforcing multimodal generative reward models with\nagentic tool use and visual reasoning, 2025.\n[13] Jiaju Chen, Yuxuan Lu, Xiaojie Wang, Huimin Zeng,\nJing Huang, Jiri Gesi, Ying Xu, and Dakuo Wang.\nMulti-agent-as-judge: Aligning LLM-agent-based au-\ntomated evaluation with multi-dimensional human\nevaluation. In First Workshop on Multi-Turn Interac-\ntions in Large Language Models, 2025. URL https:\n//openreview.net/forum?id=7AetgL7eVL.\n[14] Yun Wang, Zhaojun Ding, Xuansheng Wu, Siyue\nSun, Ninghao Liu, and Xiaoming Zhai. Autoscore:\nEnhancing automated scoring with multi-agent large\nlanguage models via structured component recogni-\ntion. arXiv preprint arXiv:2509.21910, 2025.\n[15] Yucheng Chu, Hang Li, Kaiqi Yang, Harry Shomer,\nHui Liu, Yasemin Copur-Gencturk, and Jiliang Tang.\nA llm-powered automatic grading framework with\nhuman-level guidelines optimization. arXiv preprint\narXiv:2410.02165, 2024.\n[16] Wenjing Xie, Juxin Niu, Chun Jason Xue, and Nan\nGuan. Grade like a human: Rethinking automated\nassessment with large language models.\narXiv\npreprint arXiv:2405.19694, 2024.\n[17] Rui Sun, Zuo Bai, Wentao Zhang, Yuxiang Zhang,\nLi Zhao, Shan Sun, and Zhengwen Qiu. Finresearch-\nbench: A logic tree based agent-as-a-judge evalu-\nation framework for financial research agents. In\nProceedings of the 6th ACM International Conference\non AI in Finance, pages 656–664, 2025.\n[18] Fengbin Zhu, Xiang Yao Ng, Ziyang Liu, Chang Liu,\nXianwei Zeng, Chao Wang, Tianhui Tan, Xuan Yao,\nPengyang Shao, Min Xu, et al. Findeepresearch:\nEvaluating deep research agents in rigorous financial\nanalysis. arXiv preprint arXiv:2510.13936, 2025.\n[19] Zichen Chen, Jiaao Chen, Jianda Chen, and Misha\nSra.\nStandard benchmarks fail – auditing llm\nagents in finance must prioritize risk, 2025. URL\n[https://arxiv.org/abs/2502.15865]\n(https://arxiv.org/abs/2502.15865).\n[20] Zichen Chen, Jianda Chen, Jiaao Chen, and Misha\nSra. From tasks to teams: A risk-first evaluation\nframework for multi-agent llm systems in finance.\nIn ICML 2025 Workshop on Reliable and Responsible\nFoundation Models, 2025.\n[21] Zhitao He, Pengfei Cao, Chenhao Wang, Zhuoran\nJin, Yubo Chen, Jiexin Xu, Huaijun Li, Kang Liu, and\nJun Zhao. Agentscourt: Building judicial decision-\nmaking agents with court debate simulation and\nlegal knowledge augmentation. In Findings of the\nAssociation for Computational Linguistics: EMNLP\n2024, pages 9399–9416, 2024.\n[22] Prathamesh Devadiga, Omkaar Jayadev Shetty, and\nPooja Agarwal. Samvad: A multi-agent system for\nsimulating judicial deliberation dynamics in india.\narXiv preprint arXiv:2509.03793, 2025.\n[23] Cong Jiang and Xiaolei Yang. Agentsbench: A multi-\nagent llm simulation framework for legal judgment\nprediction. Systems, 13(8):641, 2025.\n[24] Zhenxuan Zhang, Kinhei Lee, Weihang Deng, Huichi\nZhou, Zihao Jin, Jiahao Huang, Zhifan Gao, Do-\nminic C. Marshall, Yingying Fang, and Guang\n12\n\nA Survey on Agent-as-a-Judge\nYang.\nGema-score: Granular explainable multi-\nagent score for radiology report evaluation. CoRR,\nabs/2503.05347, March 2025. URL https://doi.\norg/10.48550/arXiv.2503.05347.\n[25] Hengguan Huang, Songtao Wang, Hongfu Liu, Hao\nWang, and Ye Wang.\nBenchmarking large lan-\nguage models on communicative medical coaching:\na dataset and a novel system. In Findings of the\nAssociation for Computational Linguistics ACL 2024,\npages 1624–1637, 2024.\n[26] Zhihao Fan, Lai Wei, Jialong Tang, Wei Chen, Wang\nSiyuan, Zhongyu Wei, and Fei Huang. Ai hospital:\nBenchmarking large language models in a multi-\nagent medical interaction simulator. In Proceedings\nof the 31st International Conference on Computational\nLinguistics, pages 10183–10213, 2025.\n[27] Jifang Wang, Xue Yang, Longyue Wang, Zhenran\nXu, Yiyu Wang, Yaowei Wang, Weihua Luo, Kaifu\nZhang, Baotian Hu, and Min Zhang.\nA unified\nagentic framework for evaluating conditional im-\nage generation. In Wanxiang Che, Joyce Nabende,\nEkaterina Shutova, and Mohammad Taher Pile-\nhvar, editors, Proceedings of the 63rd Annual Meet-\ning of the Association for Computational Linguistics\n(Volume 1: Long Papers), pages 12626–12646, Vi-\nenna, Austria, July 2025. Association for Computa-\ntional Linguistics. ISBN 979-8-89176-251-0. doi:\n10.18653/v1/2025.acl-long.620. URL https://\naclanthology.org/2025.acl-long.620/.\n[28] Fan Zhang, Shulin Tian, Ziqi Huang, Yu Qiao,\nand Ziwei Liu.\nEvaluation agent: Efficient and\npromptable evaluation framework for visual gener-\native models. In Wanxiang Che, Joyce Nabende,\nEkaterina Shutova, and Mohammad Taher Pile-\nhvar, editors, Proceedings of the 63rd Annual Meet-\ning of the Association for Computational Linguistics\n(Volume 1: Long Papers), pages 7561–7582, Vi-\nenna, Austria, July 2025. Association for Compu-\ntational Linguistics. ISBN 979-8-89176-251-0. doi:\n10.18653/v1/2025.acl-long.374. URL https://\naclanthology.org/2025.acl-long.374/.\n[29] Alimohammad Beigi,\nBohan Jiang,\nDawei Li,\nTharindu Kumarage, Zhen Tan, Pouya Shaeri, and\nHuan Liu. Lrq-fact: Llm-generated relevant ques-\ntions for multimodal fact-checking. arXiv preprint\narXiv:2410.04616, 2024. URL https://arxiv.\norg/abs/2410.04616.\n[30] Elad Levi and Ilan Kadar. Intellagent: A multi-agent\nframework for evaluating conversational ai systems.\narXiv preprint arXiv:2501.11067, 2025.\n[31] Navid Madani and Rohini Srihari. ESC-judge: A\nframework for comparing emotional support con-\nversational agents. In Christos Christodoulopoulos,\nTanmoy Chakraborty, Carolyn Rose, and Violet Peng,\neditors, Proceedings of the 2025 Conference on Empir-\nical Methods in Natural Language Processing, pages\n16059–16076, Suzhou, China, November 2025. As-\nsociation for Computational Linguistics. ISBN 979-8-\n89176-332-6. doi: 10.18653/v1/2025.emnlp-main.\n811. URL https://aclanthology.org/2025.\nemnlp-main.811/.\n[32] Bang Zhang, Ruotian Ma, Qingxuan Jiang, Peisong\nWang, Jiaqi Chen, Zheng Xie, Xingyu Chen, Yue\nWang, Fanghua Ye, Jian Li, et al. Sentient agent\nas a judge: Evaluating higher-order social cog-\nnition in large language models.\narXiv preprint\narXiv:2505.02847, 2025.\n[33] Jingoo Lee, Kyungho Lim, Young-Chul Jung, and\nByung-Hoon Kim. Psyche: A multi-faceted patient\nsimulation framework for evaluation of psychiatric\nassessment conversational agents. arXiv preprint\narXiv:2501.01594, 2025.\n[34] Hongzhan Lin, Yang Deng, Yuxuan Gu, Wenxuan\nZhang, Jing Ma, See-Kiong Ng, and Tat-Seng Chua.\nFACT-AUDIT: An adaptive multi-agent framework\nfor dynamic fact-checking evaluation of large lan-\nguage models. In Wanxiang Che, Joyce Nabende,\nEkaterina Shutova, and Mohammad Taher Pile-\nhvar, editors, Proceedings of the 63rd Annual Meet-\ning of the Association for Computational Linguis-\ntics (Volume 1: Long Papers), pages 360–381, Vi-\nenna, Austria, July 2025. Association for Computa-\ntional Linguistics. ISBN 979-8-89176-251-0. doi:\n10.18653/v1/2025.acl-long.17.\nURL https://\naclanthology.org/2025.acl-long.17/.\n[35] Sarfraz Ahmad, Hasan Iqbal, Momina Ahsan, Nu-\nmaan Naeem, Muhammad Ahsan Riaz Khan, Arham\nRiaz, Muhammad Arslan Manzoor, Yuxia Wang, and\nPreslav Nakov.\nUrduFactCheck: An agentic fact-\nchecking framework for Urdu with evidence boost-\ning and benchmarking. In Christos Christodoulopou-\nlos, Tanmoy Chakraborty, Carolyn Rose, and Violet\nPeng, editors, Findings of the Association for Com-\nputational Linguistics: EMNLP 2025, pages 22788–\n22802, Suzhou, China, November 2025. Association\nfor Computational Linguistics. ISBN 979-8-89176-\n335-7.\ndoi: 10.18653/v1/2025.findings-emnlp.\n1240.\nURL https://aclanthology.org/\n2025.findings-emnlp.1240/.\n[36] Yeonseok Jeong, Minsoo Kim, Seung-won Hwang,\nand Byung-Hak Kim.\nAgent-as-judge for fac-\ntual summarization of long narratives.\nIn Chris-\ntos Christodoulopoulos, Tanmoy Chakraborty, Car-\nolyn Rose, and Violet Peng, editors, Proceed-\nings of the 2025 Conference on Empirical Meth-\nods in Natural Language Processing, pages 23602–\n13\n\nA Survey on Agent-as-a-Judge\n23619, Suzhou, China, November 2025. Associa-\ntion for Computational Linguistics.\nISBN 979-8-\n89176-332-6. doi: 10.18653/v1/2025.emnlp-main.\n1204.\nURL https://aclanthology.org/\n2025.emnlp-main.1204/.\n[37] Azim Ospanov, Zijin Feng, Jiacheng Sun, Haoli\nBai, Xin Shen, and Farzan Farnia.\nHermes: To-\nwards efficient and verifiable mathematical reason-\ning in llms, 2025.\nURL https://arxiv.org/\nabs/2511.18760.\n[38] Jiuzhou Han, Wray Buntine, and Ehsan Shareghi.\nVerifiAgent: a unified verification agent in language\nmodel reasoning. In Findings of the Association for\nComputational Linguistics: EMNLP 2025, 2025.\n[39] Shudong Liu, Hongwei Liu, Junnan Liu, Linchen\nXiao, Songyang Gao, Chengqi Lyu, Yuzhe Gu, Wen-\nwei Zhang, Derek F. Wong, Songyang Zhang, and\nKai Chen.\nCompassVerifier: A unified and ro-\nbust verifier for LLMs evaluation and outcome re-\nward.\nIn Christos Christodoulopoulos, Tanmoy\nChakraborty, Carolyn Rose, and Violet Peng, edi-\ntors, Proceedings of the 2025 Conference on Empiri-\ncal Methods in Natural Language Processing, pages\n33454–33482, Suzhou, China, November 2025. As-\nsociation for Computational Linguistics. ISBN 979-8-\n89176-332-6. doi: 10.18653/v1/2025.emnlp-main.\n1698.\nURL https://aclanthology.org/\n2025.emnlp-main.1698/.\n[40] Ding Chen, Qingchen Yu, Pengyuan Wang, Wentao\nZhang, Bo Tang, Feiyu Xiong, Xinchi Li, Minchuan\nYang, and Zhiyu Li. xverify: Efficient answer veri-\nfier for reasoning model evaluations. arXiv preprint\narXiv:2504.10481, 2025.\n[41] Shalev Lifshitz, Sheila A McIlraith, and Yilun\nDu.\nMulti-agent verification: Scaling test-time\ncompute with multiple verifiers.\narXiv preprint\narXiv:2502.20379, 2025.\n[42] Kexin Huang, Ying Jin, Ryan Li, Michael Y Li, Em-\nmanuel Candes, and Jure Leskovec. Automated hy-\npothesis validation with agentic sequential falsifi-\ncations. In Proceedings of the 42nd International\nConference on Machine Learning, 2025.\n[43] Hongliu Cao, Ilias Driouich, Robin Singh, and Eoin\nThomas. Multi-agent llm judge: automatic person-\nalized llm judge design for evaluating natural lan-\nguage generation applications, 2025.\n[44] Reshmi Ghosh, Tianyi Yao, Lizzy Chen, Sadid Hasan,\nTianwei Chen, Dario Bernal, Huitian Jiao, and\nHM Hossain. Sageval: The frontiers of satisfactory\nagent based nlg evaluation for reference-free open-\nended text. arXiv preprint arXiv:2411.16077, 2024.\n[45] Jiatao Li, Mao Ye, Cheng Peng, Xunjian Yin, and\nXiaojun Wan. Agent-x: Adaptive guideline-based\nexpert network for threshold-free ai-generated text\ndetection. arXiv preprint arXiv:2505.15261, 2025.\n[46] Michael J. Ryan, Omar Shaikh, Aditri Bhagirath,\nDaniel Frees, William Held, and Diyi Yang. Syn-\nthesizeMe! inducing persona-guided prompts for\npersonalized reward models in LLMs. In Proceed-\nings of the 63rd Annual Meeting of the Association for\nComputational Linguistics (Volume 1: Long Papers),\n2025.\n[47] Ran Xu, Jingjing Chen, Jiayu Ye, Yu Wu, Jun Yan,\nCarl Yang, and Hongkun Yu. Incentivizing agentic\nreasoning in llm judges via tool-integrated reinforce-\nment learning, 2025.\n[48] Mingchen Zhuge, Changsheng Zhao, Dylan R. Ash-\nley, Wenyi Wang, Dmitrii Khizbullin, Yunyang\nXiong, Zechun Liu, Ernie Chang, Raghuraman Krish-\nnamoorthi, Yuandong Tian, Yangyang Shi, Vikas\nChandra, and Jürgen Schmidhuber.\nAgent-as-a-\njudge: Evaluate agents with agents. In Forty-second\nInternational Conference on Machine Learning, 2025.\n[49] Mengdi Li, Guanqiao Chen, Xufeng Zhao, Haochen\nWen, Shu Yang, and Di Wang. Persrm-r1: Enhance\npersonalized reward modeling with reinforcement\nlearning, 2025.\n[50] Anikait Singh, Sheryl Hsu, Kyle Hsu, Eric Mitchell,\nStefano\nErmon,\nTatsunori\nHashimoto,\nArchit\nSharma, and Chelsea Finn. FSPO: Few-shot prefer-\nence optimization of synthetic preference data elicits\nLLM personalization to real users. In 2nd Workshop\non Models of Human Feedback for AI Alignment, 2025.\n[51] Xinchen Wang, Pengfei Gao, Chao Peng, Ruida Hu,\nand Cuiyun Gao. Codevisionary: An agent-based\nframework for evaluating large language models in\ncode generation, 2025.\n[52] Yu Li, Shenyu Zhang, Rui Wu, Xiutian Huang,\nYongrui Chen, Wenhao Xu, Guilin Qi, and De-\nhai Min.\nMateval:\nA multi-agent discussion\nframework for advancing open-ended text evalu-\nation.\nIn Database Systems for Advanced Appli-\ncations:\n29th International Conference, DASFAA\n2024, Gifu, Japan, July 2-5, 2024, Proceedings,\nPart VII, page 415–426, Berlin, Heidelberg, 2024.\nSpringer-Verlag. ISBN 978-981-97-5574-5. doi: 10.\n1007/978-981-97-5575-2_31. URL https://doi.\norg/10.1007/978-981-97-5575-2_31.\n[53] Manya Wadhwa, Zayne Rea Sprague, Chaitanya\nMalaviya, Philippe Laban, Junyi Jessy Li, and Greg\nDurrett.\nEvalagents: Discovering implicit evalu-\nation criteria from the web.\nIn Second Confer-\nence on Language Modeling, 2025.\nURL https:\n//openreview.net/forum?id=erGpkHCybv.\n[54] Kaishuai Xu, Tiezheng Yu, Yi Cheng, Wenjun Hou,\nLiangyou Li, Xin Jiang, Lifeng Shang, Qun Liu, and\n14\n\nA Survey on Agent-as-a-Judge\nWenjie Li. Learning to align multi-faceted evalua-\ntion: A unified and robust framework. In Wanxi-\nang Che, Joyce Nabende, Ekaterina Shutova, and\nMohammad Taher Pilehvar, editors, Findings of the\nAssociation for Computational Linguistics: ACL 2025,\npages 9488–9502, Vienna, Austria, July 2025. Asso-\nciation for Computational Linguistics. ISBN 979-8-\n89176-256-5. doi: 10.18653/v1/2025.findings-acl.\n494. URL https://aclanthology.org/2025.\nfindings-acl.494/.\n[55] MohammadHossein Rezaei, Robert Vacareanu, Zi-\nhao Wang, Clinton Wang, Bing Liu, Yunzhong He,\nand Afra Feyza Akyürek.\nOnline rubrics elicita-\ntion from pairwise comparisons.\narXiv preprint\narXiv:2510.07284, 2025.\n[56] Zhaopeng Feng, Jiayuan Su, Jiamei Zheng, Jia-\nhan Ren, Yan Zhang, Jian Wu, Hongwei Wang,\nand Zuozhu Liu. M-MAD: Multidimensional multi-\nagent debate for advanced machine translation\nevaluation.\nIn Wanxiang Che, Joyce Nabende,\nEkaterina Shutova, and Mohammad Taher Pile-\nhvar, editors, Proceedings of the 63rd Annual Meet-\ning of the Association for Computational Linguistics\n(Volume 1: Long Papers), pages 7084–7107, Vi-\nenna, Austria, July 2025. Association for Compu-\ntational Linguistics. ISBN 979-8-89176-251-0. doi:\n10.18653/v1/2025.acl-long.351. URL https://\naclanthology.org/2025.acl-long.351/.\n[57] Jiamin Su, Yibo Yan, Zhuoran Gao, Han Zhang,\nXiang Liu, and Xuming Hu.\nCafes:\nA col-\nlaborative\nmulti-agent\nframework\nfor\nmulti-\ngranular multimodal essay scoring. arXiv preprint\narXiv:2505.13965, 2025.\n[58] Sandeep Kumar, Abhijit A Nargund, and Vivek Srid-\nhar. Courteval: A courtroom-based multi-agent eval-\nuation framework. In Findings of the Association for\nComputational Linguistics: ACL 2025, pages 25875–\n25887, 2025.\n[59] Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang,\nRuochen Xu, and Chenguang Zhu. G-eval: NLG eval-\nuation using gpt-4 with better human alignment. In\nHouda Bouamor, Juan Pino, and Kalika Bali, edi-\ntors, Proceedings of the 2023 Conference on Empirical\nMethods in Natural Language Processing, pages 2511–\n2522, Singapore, December 2023. Association for\nComputational Linguistics. doi: 10.18653/v1/2023.\nemnlp-main.153. URL https://aclanthology.\norg/2023.emnlp-main.153/.\n[60] Seungone Kim, Jamin Shin, Yejin Cho, Joel Jang,\nShayne Longpre,\nHwaran Lee,\nSangdoo Yun,\nSeongjin Shin, Sungdong Kim, James Thorne, et al.\nPrometheus: Inducing fine-grained evaluation capa-\nbility in language models. In The Twelfth Interna-\ntional Conference on Learning Representations, 2023.\n[61] Lianghui Zhu, Xinggang Wang, and Xinlong Wang.\nJudgeLM: Fine-tuned large language models are scal-\nable judges. In The Thirteenth International Confer-\nence on Learning Representations, 2025. URL https:\n//openreview.net/forum?id=xsELpEPn4A.\n[62] Yougang Lyu, Shijie Ren, Yue Feng, Zihan Wang,\nZhumin Chen, Zhaochun Ren, and Maarten de Rijke.\nSelf-adaptive cognitive debiasing for large language\nmodels in decision-making, 2025. URL https://\narxiv.org/abs/2504.04141.\n[63] Huan-ang Gao, Jiayi Geng, Wenyue Hua, Mengkang\nHu, Xinzhe Juan, Hongzhang Liu, Shilong Liu, Ji-\nahao Qiu, Xuan Qi, Yiran Wu, et al. A survey of\nself-evolving agents: On path to artificial super in-\ntelligence. arXiv preprint arXiv:2507.21046, 2025.\n[64] Mahnaz Koupaee, Jake W. Vincent, Saab Mansour,\nIgor Shalyminov, Han He, Hwanjun Song, Raphael\nShu, Jianfeng He, Yi Nian, Amy Wing-mei Wong,\nKyu J. Han, and Hang Su. Faithful, unfaithful or am-\nbiguous? multi-agent debate with initial stance for\nsummary evaluation. In Luis Chiruzzo, Alan Ritter,\nand Lu Wang, editors, Proceedings of the 2025 Confer-\nence of the Nations of the Americas Chapter of the As-\nsociation for Computational Linguistics: Human Lan-\nguage Technologies (Volume 1: Long Papers), pages\n12209–12246, Albuquerque, New Mexico, April\n2025. Association for Computational Linguistics.\nISBN 979-8-89176-189-6. doi: 10.18653/v1/2025.\nnaacl-long.609. URL https://aclanthology.\norg/2025.naacl-long.609/.\n[65] Taaha Kazi, Ruiliang Lyu, Sizhe Zhou, Dilek\nHakkani-Tür, and Gokhan Tur.\nLarge language\nmodels as user-agents for evaluating task-oriented-\ndialogue systems. In 2024 IEEE Spoken Language\nTechnology Workshop (SLT), pages 913–920. IEEE,\n2024.\n[66] Ning Wu, Ming Gong, Linjun Shou, Shining Liang,\nand Daxin Jiang. Large language models are diverse\nrole-players for summarization evaluation. In CCF\ninternational conference on natural language process-\ning and Chinese computing, pages 695–707. Springer,\n2023.\n15\n",
+    "summary_1": "Here is a clear and concise summary of the research paper for a graduate student:\n\n**Title:** A Survey on Agent-as-a-Judge\n\n**Background:** The rapid advancement of Large Language Models (LLMs) has revolutionized AI evaluation, enabling scalable and near-human quality assessments. However, as evaluands become increasingly complex, specialized, and multi-step, the reliability of LLM-as-a-Judge has become constrained by inherent biases, shallow single-pass reasoning, and the inability to verify assessments against real-world observations.\n\n**Problem Statement:** The limitations of LLM-as-a-Judge have led to the transition to Agent-as-a-Judge, which employs planning, tool-augmented verification, multi-agent collaboration, and persistent memory to enable more robust, verifiable, and nuanced evaluations.\n\n**Key Contributions:**\n\n1. **Characterizing the shift from LLM-as-a-Judge to Agent-as-a-Judge**: The authors identify and summarize the agentic judges' development trend into three progressive stages with distinct levels of autonomy and adaptability.\n2. **Organizing core methodologies**: The authors categorize the core methodologies into five key parts according to the agent's abilities and survey their applications across general and professional domains.\n3. **Analyzing frontier challenges and identifying promising research directions**:",
+    "summary_2": "The research paper discusses the evolution of AI evaluation systems from Large Language Models (LLM)-as-a-Judge to Agent-as-a-Judge. The authors identify the limitations of LLM-as-a-Judge, which include inherent biases, shallow single-pass reasoning, and the inability to verify assessments against real-world observations. This has led to the development of Agent-as-a-Judge, which employs planning, tool-augmented verification, multi-agent collaboration, and persistent memory to enable more robust, verifiable, and nuanced evaluations.\n\n**Key Points:**\n\n1. LLM-as-a-Judge has revolutionized AI evaluation by leveraging large language models for scalable assessments, but its reliability is constrained by inherent biases, shallow single-pass reasoning, and the inability to verify assessments against real-world observations.\n2. Agent-as-a-Judge has emerged as a more robust and verifiable alternative, leveraging planning, tool-augmented verification, multi-agent collaboration, and persistent memory.\n3. The authors present the first comprehensive survey on Agent-as-a-Judge, summarizing the shift from LLM-as-a-Judge to Agent-as-a-Judge and organizing core methodologies into five key parts according to agent's abilities.\n4. The authors survey applications across general and professional domains and analyze frontier challenges and identify"
+  },
+  {
+    "arxiv_id": "2601.05163v1",
+    "title": "Paper 2601.05163v1",
+    "abstract": "Document Question Answering (DocQA) fo- cuses on answering questions grounded in given documents, yet existing DocQA agents lack effective tool utilization and largely rely on closed-source models. In this work, we introduce DocDancer, an end-to-end trained open-source Doc agent. We formulate DocQA as an information-seeking problem and pro- pose a tool-driven agent framework that ex- plicitly models document exploration and com- prehension. To enable end-to-end training of such agents, we introduce an Exploration-then- Synthesis data synthesis pipeline that addresses the scarcity of high-quality training data for DocQA. Training on the synthesized data, the trained models on two long-context document understanding benchmarks, MMLongBench- Doc and DocBench, show their effectiveness. Further analysis provides valuable insights for the agentic tool design and synthetic data.",
+    "text": "DocDancer: Towards Agentic Document-Grounded Information Seeking\nQintong Zhang♡*, Xinjie Lv♡∗, Jialong Wu♡∗�, Baixuan Li∗, Zhengwei Tao♡,\nGuochen Yan♡, Huanyao Zhang♡, Bin Wang♢, Jiahao Xu♣, Haitao Mi♣, Wentao Zhang♡†\n♡Peking University, ♢Shanghai AI Lab, ♣Tencent AI Lab\nwujialongml@gmail.com, wentao.zhang@pku.edu.cn\nAbstract\nDocument Question Answering (DocQA) fo-\ncuses on answering questions grounded in\ngiven documents, yet existing DocQA agents\nlack effective tool utilization and largely rely\non closed-source models. In this work, we\nintroduce DocDancer, an end-to-end trained\nopen-source Doc agent. We formulate DocQA\nas an information-seeking problem and pro-\npose a tool-driven agent framework that ex-\nplicitly models document exploration and com-\nprehension. To enable end-to-end training of\nsuch agents, we introduce an Exploration-then-\nSynthesis data synthesis pipeline that addresses\nthe scarcity of high-quality training data for\nDocQA. Training on the synthesized data, the\ntrained models on two long-context document\nunderstanding benchmarks, MMLongBench-\nDoc and DocBench, show their effectiveness.\nFurther analysis provides valuable insights for\nthe agentic tool design and synthetic data.\n1\nIntroduction\nUnderstanding and answering questions over long,\nmulti-modal documents is a critical capability\nfor real-world intelligent systems (Tkaczyk et al.,\n2015; Liu et al., 2025b). Document Question An-\nswering (DocQA) lies at the core of document-\ncentric intelligence, enabling models to access, rea-\nson over, and synthesize information from complex\nand heterogeneous document sources.\nExisting DocQA methods can be broadly catego-\nrized into three paradigms. The first paradigm re-\nlies on optical character recognition (OCR) to con-\nvert documents into plain text, which is then pro-\ncessed by downstream language models (Xu et al.,\n2020). The second paradigm adopts embedding-\nbased retrieval mechanisms, most commonly in-\nstantiated through retrieval-augmented generation\n*Equal Contributions. �Jialong Wu is the project leader.\n†Corresponding Author.\nIf I drop the green module shown in Figure 6,  what is the \nabsolute percentage drop on 100-shot TACREV setting?\nLLM Agents\nUser\nDoc Toolkit\nSearch\nRead\n<think> First, I need to know ... Let me search \nfor that. </think>\n<tool_call> {“name”: “search”, ...} </tool_call>\n<think> Okay, now read Figure 6 ... </think>\n<tool_call> {“name”: “read”, ...} </tool_call>\n<answer> 0.5% </answer>           Correct\n<tool_response>\n<section_id=8> ... [matched snippets] ... \n</section_id=8> ...\n</tool_response>\n<tool_response>\nThere are three different parts in Figure 6 ...\n</tool_response>\n... \n...\nDoc Process\nParse\nCaption\nFigure 1: The overall of DocDancer for document-\ngrounded information seeking, where search and read\ntools for effective document retrieval and comprehen-\nsion over processed documents.\n(RAG), to identify and incorporate relevant doc-\nument segments during inference (Saad-Falcon\net al., 2024). More recently, agent-based paradigms\nhave gained increasing attention, as they better sup-\nport complex scenarios that require iterative explo-\nration, tool invocation, and multi-step reasoning\nover long and structured documents (Sun et al.,\n2025a; Zhu et al., 2025). Recent advances in large\nlanguage models (LLMs) (Team, 2025; Liu et al.,\n2025a) enable such agents to dynamically decom-\npose queries, interact with documents, and adapt\nto intermediate observations, alleviating the limita-\ntions of OCR- and RAG-based approaches. Despite\ntheir promise, existing DocQA agents are typically\nimplemented as prompt-based pipelines, with lim-\nited learning of autonomous agentic behaviors.\nIn contrast, we aim to train the first end-to-end\nDocQA agent model that is explicitly grounded\nin information-seeking principles, moving beyond\nprompt-based agent designs. We first formulate\nDocQA as an agentic information-seeking problem\nand design a tool-centric agent framework that de-\n1\narXiv:2601.05163v1  [cs.CL]  8 Jan 2026\n\ncomposes document understanding into two com-\nplementary capabilities. Specifically, we introduce\nefficient search tools for global information ac-\nquisition and fine-grained read tools for localized\ncomprehension. This design enables the agent to\nactively explore long documents, iteratively refine\nits hypotheses, and dynamically adapt its strategy\nbased on intermediate observations. Notably, when\ninstantiated with a proprietary LLM, our frame-\nwork achieves state-of-the-art performance and ex-\nceeds reported human-level performance.\nFurthermore, a key bottleneck in training such\nagent models is the scarcity of high-quality DocQA\npairs (Huang et al., 2025), as most publicly avail-\nable datasets provide only test splits and lack\nsufficiently annotated training data. To address\nthis challenge, we propose an Exploration-then-\nSynthesis DocQA generation pipeline that progres-\nsively enhances QA pairs from easy to hard. Specif-\nically, we first explore a source document through\nintent-guided, tool-augmented interactions to col-\nlect grounded evidence (the Exploration stage), and\nthen synthesizes high-quality document-grounded\nQA pairs via multi-observation reasoning (the Syn-\nthesis stage). We then train our DocQA agent, Doc-\nDancer, on the synthesized dataset, instantiating\nit with two open-source backbones, Qwen3-4B-\nThinking-2507 and Qwen3-30B-A3B-Thinking-\n2507 (Team, 2025). Despite being trained with\nonly 5,000 instances, both variants achieve com-\npetitive performance, with the 30B-A3B model\nattaining state-of-the-art results in several settings.\nExtensive experiments are conducted on two\nlong-context document understanding benchmarks,\nMMLongBench-Doc (Ma et al., 2024) and\nDocBench (Zou et al., 2025). The results demon-\nstrate the effectiveness of the proposed DocDancer.\nFurther analyses provide insights into document\nparsing strategies, tool design, and the role of syn-\nthetic data in agent learning. In summary, our con-\ntributions are three-fold:\n• Effective Agentic DocQA Framework: We\npropose a tool-driven DocQA agent frame-\nwork grounded in information-seeking prin-\nciples, which achieves SOTA performance\nwhen paired with a proprietary LLM.\n• Autonomous Data Synthesis Pipeline: We\nintroduce an Exploration-then-Refine data syn-\nthesis pipeline that generates high-quality\ntraining data for learning agentic behaviors.\n• Empirical\nPerformance:\nOur\nmethod\nachieves state-of-the-art results and provides\npractical insights into effective and efficient\nagentic system design.\n2\nRelated Work\nDocument Question Answering Methods. Tra-\nditional DocQA methods rely on OCR-based\npipelines (Ding et al., 2022) or end-to-end vi-\nsion–language models (Sukh, 2025; Hu et al.,\n2025), but both are constrained by limited in-\nput length and struggle with long documents (Ma\net al., 2024; Zou et al., 2025; Dong et al., 2025a).\nRetrieval-augmented generation (Zhang et al.,\n2024; Dong et al., 2025a,b) improves scalability,\nyet most approaches decouple retrieval and rea-\nsoning in a single-shot manner, making them brit-\ntle to retrieval errors and ineffective for complex,\nmulti-step queries (Zhang et al., 2025). Recent\nagent-based DocQA systems (Wu et al., 2025c;\nSun et al., 2025a; Dong et al., 2025c) address these\nissues through iterative document navigation and\nreading, but they predominantly depend on prompt-\nengineered, closed-source LLMs. In this work, we\naim to train an open-source document agent with\nlearnable behaviors for robust and scalable DocQA.\nSynthetic Data for Agent Training. High-quality\ntraining data is critical for training agents. Due to\nits scalability, rapid iteration, and inherent trainabil-\nity, synthetic data offers significant advantages over\nmanually annotated data, serving as a highly effec-\ntive alternative to human-labeled datasets for agent\nlearning (Liu et al., 2025a; Team et al., 2025b).\nPrior work has demonstrated that large-scale agent-\nsynthesized data can be effectively generated for\nsearch agents (Wu et al., 2025a; Li et al., 2025b;\nTao et al., 2025), code agents (Yang et al., 2025),\nGUI agents (Sun et al., 2025b; Guo et al., 2025a)\nand general-purpose agents (Fang et al., 2025; Prab-\nhakar et al., 2025). In contrast, this work focuses\non the DocQA agent setting. Existing DocQA\ndatasets are primarily constructed through semi-\nautomated (Van Landeghem et al., 2023; Dong\net al., 2025b) or expert-annotated (Hendrycks et al.,\n2021; Deng et al., 2025) processes, both of which\nrequire substantial human involvement or result in\nquestions that lack sufficient depth. Inspired by\nadvances in search agents, we formulate DocQA\nas an agentic information-seeking problem, with\nthe goal of synthesizing high-quality training data\ntailored for DocQA agents.\n2\n\n3\nMethods\n3.1\nAgent Setup\nFramework. We adopt the vanilla ReAct (Yao\net al., 2022) as the agent’s framework, which syn-\nergizes reasoning and acting. In this paradigm, the\nagent generates both a reasoning trace (thought), τ,\nand a subsequent action, a, in an interleaved man-\nner. This process forms a trajectory, HT , which is\na sequence of thought-action-observation triplets:\nHT = (τ0, a0, o0, . . . , τi, ai, oi, . . . , τT , aT ), (1)\nwhere aT represents the final answer to the given\ntask. At any given step t ≤ T, the agent’s policy,\nπ, generates the current thought τt and action at\nbased on the history of all previous interactions,\nHt−1:\nτt, at ∼ π(·|Ht−1).\n(2)\nInspired by The Bitter Lesson (Sutton, 2019), we\nemploy a single-agent setup with carefully selected,\nhighly effective tools, rather than relying on multi-\nagent designs or test-time scaling.\nDocument Processing. Prior works (Sun et al.,\n2025a) show that an XML-based hierarchical rep-\nresentation for document outlines that organizes\nparsed content into nested trees, using sections as\npartitioning units and elements such as text, im-\nages, and tables as nodes. While this structure\nenables efficient positioning and search, it suffers\nfrom structural and content inaccuracies and does\nnot incorporate retrieval-aware visual information,\nwhich limits its applicability to agent-based pro-\ncessing of long, visually rich documents. To ad-\ndress these issues, we substantially enhance the\ndocument outline. For content accuracy, we lever-\nage MinerU2.5 (Niu et al., 2025) for high-precision\nlayout analysis and extraction, defining 17 ele-\nment types and enriching outline nodes with layout\nand semantic attributes while removing structurally\nirrelevant elements such as headers and footers.\nFor structural accuracy, title elements are visually\ncropped and clustered to infer hierarchical levels,\nenabling fine-grained section segmentation and re-\nducing information loss in long documents. To\nimprove visual retrieval, we generate captions for\nimages and charts using an multimodal model Mm\nand incorporate them as auxiliary information, al-\nlowing the outline to better align and retrieve visual\ncontent.\nTool Design. We point out that DocQA can be natu-\nrally formulated as an agentic information-seeking\ntask in which the external information source is re-\nstricted to the given documents. Accordingly, our\ntool design aims to enable agents to efficiently and\neffectively locate and extract relevant information\nfrom documents, while keeping the overall toolkit\ncomplexity low to ensure ease of use for agent mod-\nels. Specifically, we design the following two tools\nfor DocDancer:\n• Search.\nConducts keyword-based full-text\nsearch over the given documents, returning the\nsection IDs, page numbers, and surrounding\ntext snippets for each match. A visible win-\ndow is used to constrain the snippet length for\nefficient localization. This tool provides the\nagent with global textual signals for guiding\nsubsequent information access.\n• Read. Given a goal and a set of section IDs, the\ntool performs fine-grained reading to extract\ngoal-relevant information from the specified\nsections. This includes (i) local textual infor-\nmation, consisting of all text within the section;\n(ii) local visual information, consisting of im-\nages and tables within the section, together with\na page-level screenshot that captures the full\nlayout of the page containing the section. Sub-\nsequently, a multimodal summarization model\nMm is used as an auxiliary reader to jointly\nintegrate textual and visual inputs and return\nconsolidated goal-relevant content.\nThis design deliberately integrates textual and\nvisual signals, capturing both localized evidence\nand global layout cues, while keeping the toolkit\nlimited to two tools to facilitate efficient utilization.\n3.2\nData Synthesis\nIt is crucial to curate complex and diverse Doc-\nument DocQA pairs that are capable of eliciting\nmulti-step reasoning, goal decomposition, and rich\ninteraction trajectories. To this end, we first con-\nstruct a broad and heterogeneous collection of PDF\ndocuments to serve as the grounding corpus for\nquestion answering. We then synthesize QA pairs\nbased on these documents, ensuring coverage of di-\nverse reasoning patterns and document structures.\nSources.\nTo construct a robust and diverse\ndataset for document-based question answering,\nwe select four representative datasets, Long-\nDocURL (Deng et al., 2025), MMDocRAG (Dong\net al., 2025b), CUAD (Hendrycks et al., 2021) and\nDUDE (Van Landeghem et al., 2023), that cover\nlong-context understanding, multimodal retrieval,\n3\n\n(i) Exploration\n(ii) Synthesis\nStep 1\n𝐴𝑐𝑡𝑖𝑜𝑛!\n𝑂𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛!\n𝐼𝑛𝑡𝑒𝑛𝑡!\nSource Document\nStep 2\n𝐴𝑐𝑡𝑖𝑜𝑛\"\n𝑂𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛\"\n𝐼𝑛𝑡𝑒𝑛𝑡\"\nStep 3\n𝐴𝑐𝑡𝑖𝑜𝑛#\n𝑂𝑏𝑎𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛#\n𝐼𝑛𝑡𝑒𝑛𝑡#\nFurther \nExploration\n... ...\nStep N\n𝐴𝑐𝑡𝑖𝑜𝑛!\n𝑂𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛!\n𝐼𝑛𝑡𝑒𝑛𝑡!\n𝐴𝑐𝑡𝑖𝑜𝑛\"\n𝑂𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛\"\n𝐼𝑛𝑡𝑒𝑛𝑡\"\n𝐴𝑐𝑡𝑖𝑜𝑛$\n𝑂𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛$\n𝐼𝑛𝑡𝑒𝑛𝑡$\n...\n...\n...\nSearch\nRead\nwith\nSearch\nRead\nwith\nSearch\nRead\nwith\nSearch\nRead\nQA Type Instruct\nDeep Analysis\n<think>\nOkay, now I …\n</think>\nQuestion\nAnswer\nwith\nFigure 2: Overall of the Exploration-then-Synthesis framework. (i) Exploration stage iteratively interacts with\nthe source document through Action(u)–Observation(y)–Intent(i) steps. (ii) Synthesis stage aggregates the collected\nevidence to generate the final question and answer. We present a concrete case illustrating the whole generation\nprocess in Appendix A.\nlegal expertise, and complex layout analysis. These\nsources provide the foundational PDF documents\nused for our automated QA generation pipeline.\nThe distribution of the collected PDF documents is\nillustrated in Figure 3.\n35.0%\n18.6%\n13.1 %\n11.0 %\n8.2 %\n5.0%\n3.3%\n1.2%\n3.0%\nReport\nLaw\nAcademic\nGuidebook\nContact\nFinancial\nBrochure\nIndustry\nNews\nFigure 3: Distribution of document used to synthesise.\nExploration-then-Synthesis Framework.\nWe\npropose a two-stage framework for DocQA gen-\neration, consisting of an Exploration Stage and a\nSynthesis Stage as shown in Figure 2. The overall\nobjective is to transform a source document into a\ndiverse and high-quality set of grounded QA pairs\nthrough iterative interaction and reasoning.\nExploration Stage. Given a source document\nD, utilze an LLM Me to iteratively interact with\nD and collect information relevant to potential QA\npairs. Conditioned on the interaction history ht\nand the document D, we employ model Ms jointly\ngenerates an intent-action pair (it, at):\n(it, ut) ∼ πMe(i, u | ht, D),\n(3)\nwhere it denotes the exploration intent and ut ∈ A\ncorresponds to invoking a document-grounded tool\nsuch as Search or Read, which is the same as the\nagent’s tool action. The construction of a question\nimplicitly induces the strategy required to resolve\nit. The explicit modeling of intent helps prevent un-\ninformative exploration, guiding the agent toward\nmore concrete, goal-directed trajectories (Pahuja\net al., 2025). Executing action at yields an obser-\nvation:\nyt = T (at, D),\n(4)\nwhere T denotes the document interaction inter-\nface. The interaction history is then updated as:\nht+1 = ht ∪ {(it, ut, yt)},\n(5)\nand the intent it+1 may be revised based on the\nnewly acquired information.\nThis process is repeated for multiple steps, en-\nabling the agent to progressively refine its under-\nstanding of the document and uncover diverse and\ninformative content. The explicit modeling of in-\ntent allows for flexible and open-ended exploration,\npermitting additional interactions when necessary.\nThe output of the exploration stage is a trajectory\nξ = {(it, ut, yt)}T\nt=1,\n(6)\nwhich serves as structured evidence for down-\nstream QA generation.\nIn the exploration stage, each exploration step\ncan be viewed as a random walk over the knowl-\nedge graph implicitly embedded in the entire doc-\nument. When the number of such walks is suf-\nficiently large, this process can, in principle, re-\nconstruct the underlying document-level knowl-\nedge graph in a reverse manner. This idea is con-\nceptually aligned with prior work on QA gener-\nation based on knowledge graphs in web search\nagent (Li et al., 2025b,a). We do not explicitly\nconstruct a document-level knowledge graph in\n4\n\nadvance, as such an approach would incur substan-\ntial engineering complexity and overhead. Instead,\nour method adopts a more lightweight design that\nis nevertheless capable of generating challenging\nDocQA pairs, achieving a better trade-off between\nefficiency and effectiveness.\nSynthesis Stage. Given the exploration trajec-\ntory ξ, the agent enters the synthesis stage to gen-\nerate document-grounded QA pairs. A synthesis\nmodel Ms performs reasoning over the accumu-\nlated observations and generates a QA pair:\n(q, a) ∼ Ms(ξ, D),\n(7)\nThis stage emphasizes (i) reasoning over multi-\nple observations collected during exploration, (ii)\ngrounding both questions and answers in the source\ndocument, and (iii) producing semantically coher-\nent and well-formed outputs. The final output is a\nset of K, document-grounded QA pairs:\nQA = {(qk, ak)}K\nk=1,\n(8)\nwhich can be used for training an agent. We employ\na strong open-source model Mt to perform rejec-\ntion sampling over these QA pairs, QA, thereby\nobtaining high-quality training trajectories.\n3.3\nAgent Training\nFollowing the empirical findings of (Chen et al.,\n2023), twe mask loss contributions from observa-\ntion tokens to mitigate interference from external\nfeedback during training, which has been shown to\nimprove both performance and robustness. Given\nthe task context tc and the complete execution\ntrajectory H = (x0, ..., xn−1, xn), where each\nxi ∈ {τ, α, o}, the loss L is computed as follows:\nL = −\n1\n�|H|\ni=1 I[xi ̸= o]\n|H|\n�\ni=1\nI[xi ̸= o]·\nlog πθ(xi | tc, x<i)\n(9)\nHere, I[xi ̸= o] filters out tokens corresponding\nto external feedback, ensuring the loss is computed\nonly over the agent’s decision steps.\n4\nExperiments\nIn this section, we aim to answer the following\nresearch questions (RQs):\n• RQ1: How effective is the proposed information-\nseeking agent framework for DocQA?\n• RQ2: How effective is the proposed synthetic\ndata pipeline for training open-source DocQA\nagents?\n• RQ3: Which components of the agent framework\ncontribute most to performance?\n• RQ4: How does the proposed DocDancer in qual-\nitative evaluations?\n4.1\nExperimental Setup\nWe fine-tune Qwen3-30B-A3B-Thinking-2507 and\nQwen3-4B-Thinking-2507 on our dataset, result-\ning in DocDancer. Our detailed implementation is\nprovided in Appendix B, trained with only 5,000\nagent trajectories.\nBenchmarks. We evaluate the proposed DocA-\ngent on two multimodal long-context document\nquestion answering benchmarks: MMLongBench-\nDoc (Ma et al., 2024) and DocBench (Zou et al.,\n2025). MMLongBenchDoc comprises 135 docu-\nments with an average length of 47.5 pages, featur-\ning rich layouts and multimodal components across\nseven diverse domains. The dataset includes 1,091\nquestions derived from multiple sources, such as\ntext, tables, charts, and images, with 33% involving\ncross-page reasoning. DocBench consists of 229\nreal-world documents and 1,082 questions, cover-\ning five domains and four major question types.\nMetrics. For MMLongBench-doc, we follow the\nofficial evaluation protocol. Answers are extracted\nusing GPT-4.1 and evaluated with rule-based scor-\ning to compute F1 (F1) and Accuracy (acc). To mit-\nigate extraction errors and improve robustness to\ndiverse response formats, we additionally employ\nan LLM-as-Judge (LasJ) setting, where gpt-4o\nassigns binary scores using carefully designed\nprompts. For DocBench, we likewise adhere to the\nofficial evaluation procedure, using the provided\ninstructions to guide GPT-4.1 for assessment.\nBaselines.\nWe compare our approach with\nthe following three categories of baselines: (1)\nVLM-based methods: Following the setting of\nMMLongBench-Doc, PDF pages are scanned at\n144 DPI and used as input to the VLM. (2) OCR-\nbased methods: Text is extracted from documents\nusing an OCR tool, and the parsed plain text is\nprovided to a LLM for answering. Text beyond\nthe model’s context length is truncated. (3) RAG-\nbased methods: In this category, we compare ex-\nisting RAG frameworks for DocQA, including Vis-\nRAG (Yu et al., 2024), Colpali (Faysse et al., 2024),\nM3DocRAG (Cho et al., 2025), MMGR (Wan\nand Yu, 2025), and RAGAnything (Guo et al.,\n5\n\nMethod\nModel\nMMLongBench-Doc\nDocBench\nacc\nF1\nLasJ\nLasJ\nVLM Baseline\nNaive VL (Ma et al., 2024)\nGPT-4o\n42.8\n44.9\n–\n63.1\nNaive VL (Zhu et al., 2025)\nGemini-2.5-Pro\n–\n–\n58.1\n–\nOCR-based Baseline\nfitz1\nGPT-4\n–\n–\n–\n67.9\nTesseract (Smith, 2007)\nGPT-4o\n30.1\n30.5\n–\n–\nTesseract (Smith, 2007)\nGemini-2.0-Flash\n39.6\n37.2\n–\n–\nRAG-based Baseline\nVisRAG (Yu et al., 2024)\nGPT-4o\n29.0\n27.8\n–\n–\nColpali (Faysse et al., 2024)\nGPT-4o\n32.2\n30.8\n–\n–\nM3DocRAG w/ ColPali (Cho et al., 2025)\nQwen2-VL-7B\n31.4\n36.5\n–\n–\nRAGAnything (Guo et al., 2025b)\nGPT-4o-mini\n42.8\n–\n–\n63.4\nPrompt-based Agent\nDoc-React (Wu et al., 2025c)\nGPT-4o\n38.1\n38.3\n–\n–\nMDocAgent (Han et al., 2025)\nGPT-4o\n42.0\n–\n–\n–\nMACT (Yu et al., 2025)\nMiMo-VL-7B\n47.4\n–\n–\n–\nSimpleDoc (Jain et al., 2025)\nClaude-4-Sonnet\n–\n–\n58.6\n–\nSimpleDoc (Jain et al., 2025)\nGemini-2.5-Pro\n–\n–\n56.6\n–\nDocLens (Zhu et al., 2025)\nClaude-4-Sonnet\n–\n–\n63.3\n–\nDocLens (Zhu et al., 2025)\nGemini-2.5-Pro\n–\n–\n67.6\n–\nDocAgent (Sun et al., 2025a)\nGPT-4o\n51.8\n49.1\n–\n79.9\nDocAgent (Sun et al., 2025a)\nClaude-3.5-Sonnet\n57.3\n54.1\n–\n–\nOurs\nDocDancer\nGPT-4o\n52.3\n50.8\n59.2\n73.5\nGemini-2.5-Pro\n56.3\n55.3\n65.9\n79.9\nGPT-5.2\n57.0\n56.8\n67.6\n85.5\nQwen3-4B (ft)\n48.4\n49.2\n59.4\n79.8\nQwen3-30B-A3B (ft)\n54.4\n53.9\n65.3\n81.2\nHuman Baseline\n–\n65.8\n66.0\n–\n81.2\nTable 1: Performance comparison across two long-context understanding benchmarks. The best results among all\nmethods are bolded and the second-best results are underlined.\n2025b). (4) Agent-based methods: We include\nseveral recent and well-performing training-free\nagentic frameworks, namely Doc-React (Wu et al.,\n2025c), MDocAgent (Han et al., 2025), MACT (Yu\net al., 2025), SimpleDoc (Jain et al., 2025), Do-\ncLens (Zhu et al., 2025), and DocAgent (Sun et al.,\n2025a). The detailed introduction of the baseline is\nprovided in Appendix C.\n4.2\nOverall Performance (RQ1)\nWe evaluate our agent framework against OCR-\nbased, RAG-based, and prompt-based baselines\non long-document DocQA benchmarks. Based\non the experimental results in Table 1, we draw\nthe following observations.\nFirst, agent-based\napproaches substantially outperform VLM-based\nmethods, OCR-based baselines, and RAG-based\nbaselines across evaluated benchmarks, highlight-\ning the advantage of explicit tool use and itera-\ntive reasoning for long-context document under-\nstanding. Second, under the same backbone, our\nsingle-agent framework matches or surpasses multi-\nagent systems. In particular, on MMLongBench-\nDoc, DocDancer with GPT-5.2 attains 56.8 F1 /\n67.6 LasJ, outperforming all prior methods, and on\nDocBench, it reaches 85.5, exceeding the human\nbaseline by 4 points. Third, models trained on our\nsynthetic DocQA dataset demonstrate strong gen-\neralization and data efficiency. Even with relatively\nsmall model sizes, such as 30B-A3B and 4B, the\nresulting agents achieve performance competitive\nwith closed-source models. These results indicate\nthat training agentic capabilities on smaller-scale\nmodels is both feasible and highly valuable, sub-\nstantially lowering the barrier to building effective\ndocument-understanding agents.\n4.3\nEffectiveness of Synthetic Data (RQ2)\nOverall Performance. We investigate whether\nthe Exploration-then-Synthesis data generation\n6\n\n0.3\n0.35\n0.4\n0.45\n0.5\n0.55\n0.6\n0.65\n30.0\n35.0\n40.0\n45.0\n50.0\n55.0\n65.0\nw/ Process of DocAgent\nw/ Process of Ours\nw/ Tool of DocAgent\nw/ Tool of Ours\nSingle-Page QA\nMulti-Page QA\nUnanswerable QA\nOverall\nOverall\n60.0\nAcc\nF1\n51.1\n56.7\n58.4 59.1\n41.6\n43.9 43.8\n44.1\n41.3\n45.3\n47.1\n52.5\n45.8\n50.1 51.1 52.3\n44.9\n49.3 50.1 50.8\nFigure 4: Ablation study on document parsing and tools.\n40.0\n50.0\n60.0\n70.0\n80.0\n90.0\nAcc\n𝐹!\nLasJ\nLasJ\nMMLongBench-Doc\nDocBench\n50.5\n54.5\n51.5 53.9\n62.8 65.3\n80.2 81.2\nOS-QA\nDocDancer\nFigure 5: Performance comparison between models\ntrained on our synthesized QA data and open-source\nQA data.\npipeline provides effective supervision for learn-\ning agentic behaviors, and whether models trained\nsolely on the synthesized data achieve strong perfor-\nmance compared to existing open-source QA pairs.\nIn Figure 5, we use the same PDF sources (Section\n§3.2) and construct two training sets of equal size\n(5,000 instances): one from our synthesized QA\ndata and the other from human-annotated QA data\nprovided with the PDFs (OS-QA). Both models are\ntrained on Qwen3-30B-A3B-Thinking-2507. Over-\nall, DocDancer consistently outperforms OS-QA\nacross all metrics and benchmarks, demonstrating\nthe effectiveness of our data synthesis strategy.\nDetailed Results on Domains. Figure 6 reports\ndomain-level results on MMLongBench-Doc. Doc-\nDancer consistently outperforms the QA baseline\nacross all document domains, including Academic,\nFinancial, Industry, and Report. The gains are\nmore pronounced in structurally complex domains\nthat require iterative information seeking and fine-\ngrained reasoning. Overall, the results indicate that\nDocDancer generalizes well across diverse docu-\nment types and is robust to domain variation.\n40.0\n50.0\n60.0\n70.0\n80.0\n90.0\nAcc\n𝐹!\nLasJ\nLasJ\nMMLongBench-Doc\nDocBench\n50.5\n54.5\n51.5 53.9\n62.8 65.3\n80.2 81.2\nOS-QA\nDocFlow\n30.0\n35.0\n40.0\n45.0\n50.0\n55.0\n60.0\n65.0\n70.0\nAca.\nBro.\nFin.\nGui.\nInd.\nReo.\nTut.\nAcc\n52.854.2\n42.0\n49.2\n60.8\n66.0\n39.7\n48.0\n55.8\n57.9\n60.8\n54.5\n50.753.1\nOS-QA\nDocDancer\nFigure 6: Detailed domain-wise performance compar-\nison on MMLongBench-Doc between DocDancer and\nthe model trained on OS-QA.\n4.4\nInfluence of Agentic Tools (RQ3)\nWe conduct ablation studies on document pro-\ncessing for outline construction and tool usage\nin Figure 4.\nThe baseline is the Actor Agent\nfrom DocAgent (Sun et al., 2025a).\nFor out-\nline construction, DocAgent relies on Adobe PDF\nExtract as well as DocXChain (Yao, 2023) and\nPyMuPDF. In contrast, our enhanced method em-\nploys MinerU2.5 (Niu et al., 2025) for outline gen-\neration. The results demonstrate that, when com-\nbined with the same tools, our processing approach\nconsistently outperforms the baseline, confirming\nthat MinerU2.5 produces higher-quality document\noutlines. Regarding tool usage, DocAgent utilizes\nfive tools: search, get_section_content, get_image,\nget_page_images, and get_table_image. In com-\nparison, we only use two tools, Search and Read,\nfollowing the principle of simplicity. Despite this\nreduced tool set, our approach achieves better per-\nformance when combined with either our own out-\nline or the outline generated by DocAgent. The\nbest results are obtained by combining our outline\nconstruction with our tool design, demonstrating\ntheir complementary effects. Furthermore, we con-\nduct an ablation study on the external model used\n7\n\n≈\n≈\nQ. What is advertising expense to sales ratio of Neflix in FY 2015? \nA. 0.105\nDocument: NETFLIX_2015_10K.pdf (73 Pages)\nEvidence Page: 40, 47\nEvidence Source: Pure-text, Table\n≈\nThink-1\n... Read the Outline...\n... Call Search ...\nTool Call: 🔍 Search\n... Find 27 results  ...\n... Find 39 results  ...\n❌ Answer\n... According to search results...\n824.092 / 6779.511 ≈ 0.122\n☹\nOS-QA\nInadequate \nRetrieval & \nComprehension\nOurs\nThink-1\nThink-2\n... Thinking ...\n... Call Search ...\nThink-3\n... Thinking ...\n... Call Read ...\nThink-4\n... Thinking ...\n... Call Read ...\nAnswer\n✅\n🔍 Search\nKeywords: [“Marketing”,“ Revenues”]\n[“advertising”]\n... Find 6 results  ...\n<Item\n    type=\"Paragraph\" \n    section_id=\"3.19\" \n    page_num=\"5.0\"\n>...\nStop\n... Read the Outline...\n... Call Search ...\n... Find 39 results  ...\n<Item\n    type=\"Paragraph\" \n    section_id=\"3.9\" \n    page_num=\"3.0\"\n>...\n🔍 Search\n[“Revenues”]\nSection 8.81 in page 47\n📖\nExtract the advertising \nexpense amount for 2015.\nThe useful information from\n[...Text...]. Summary: \nThe advertising expense \namount for 2015 is $714.3 \nmillion, as ...\nRead\n📖\nExtract the revenue \namount for 2015.\nThe useful information \nfrom [... Table ...]. \nSummary: \nThe revenue amount for \n2015 is $6,779,511, as ...\nSection 8.60 in page 40\n714.3 / 6779.511\n... According to...\n≈ 0.105\n≈\n≈\nCorrect \nAnswer! 🥳\nRead\nFigure 7: A case study demonstrating that our proposed DocDancer successfully performs multi-round information\ngathering to reach the correct answer, as illustrated in Table 3 in detail, whereas OS-QA produces an incorrect result.\n30.0\n35.0\n40.0\n45.0\n50.0\n55.0\n60.0\n65.0\n70.0\nAca.\nBro.\nFin.\nGui.\nInd.\nReo.\nTut.\nAcc\n52.854.2\n42.0\n49.2\n60.8\n66.0\n39.7\n48.0\n55.8\n57.9\n60.8\n54.5\n50.753.1\nOS-QA\nDocFlow\n60.0\n65.0\n70.0\n75.0\n80.0\n85.0\n90.0\n95.0\n100\nAca.\nFin.\nCov.\nLaw\nNews\nAvg\nAcc\n83.0 82.8\n76.4\n74.0\n78.0 78.4 80.9\n83.2\n89.891.9\n81.281.4\nQwen3-VL-235B-A22B-Instruct\nGemini-3-Pro\nFigure 8: Results on DocBench across various domains\nusing different models used by Read tool. We re-\nport the generalized accuracy of five types of document\ndomains, including Academia (Aca.), Finance (Fin.),\nGovernment (Gov), Law, and News.\nby the Read tool. Our default configuration, Mm\nemploys Qwen3-VL-235B-A22B-Instruct. Replac-\ning it with Gemini-3-Pro yields a modest overall\nimprovement of 0.2 accuracy points on DocBench\n(Figure 8), with gains in Government, Law, and\nNews domains. These results indicate that our tool\ndesign is robust and does not depend on an excep-\ntionally strong external model.\n4.5\nQualitative Analysis (RQ4)\nWe present a case study of a financial task on a\n73-page document from MMLongBench-Doc, as\nillustrated in Figure 7. Answering this question\nrequires locating advertising expense and revenue\nfigures from different sections of the document and\nperforming a numerical computation. The base-\nline model, which is trained on OS-QA relies on\nkeyword-based retrieval and retrieves passages re-\nlated to “marketing” and “revenues”. Due to in-\nsufficient grounding, it incorrectly uses a market-\ning expense figure as a proxy for advertising ex-\npense, yielding an erroneous ratio of 0.122. This\nfailure illustrates the limitation of single-pass re-\ntrieval and shallow aggregation when fine-grained\nfinancial concepts are required. In contrast, Doc-\nDancer performs multi-round, question-driven in-\nformation gathering. It first retrieves and reads\nthe section explicitly reporting advertising expense\nfor FY 2015 ($714.3M), and then independently\nextracts the total revenue from a separate tabular\nsection ($6,779.5M). By grounding each value to\nits corresponding evidence and verifying semantic\nrelevance, the system computes the correct ratio\nof 714.3/6,779.5 ≈ 0.105. It demonstrates that\naccurate document-level financial question answer-\ning benefits from our synthetic data, which enables\nthe construction of domain-specific expert-level\nsupervision beyond ordinary human annotations.\n5\nConclusion\nWe propose DocDancer, an end-to-end trained\nagentic model for document question answering\nthat formulates DocQA as an information-seeking\nprocess. By introducing a tool-centric framework\nwith complementary search and read operations,\nDocDancer enables effective exploration and com-\nprehension of long, structured documents. To miti-\ngate the lack of high-quality supervision, we further\n8\n\ndesign an Exploration-then-Synthesis data pipeline\nthat generates compact yet effective training data\nfor learning agentic behaviors. Experiments on\nMMLongBench-Doc and DocBench demonstrate\nthat DocDancer achieves strong and competitive\nperformance, validating the effectiveness of agentic\ninformation-seeking for document understanding.\nLimitations\nThis work still has several limitations. First, our ex-\nperiments are conducted only on Qwen3-30B-A3B-\nThinking-2507 and Qwen3-4B-Thinking-2507; we\ndo not evaluate the proposed method on larger-\nscale models or models from other families. Sec-\nond, we focus exclusively on supervised fine-\ntuning (SFT) and do not explore agentic reinforce-\nment learning (RL). Third, we do not further scale\nthe training data, and thus do not investigate how\nthe proposed method performs under larger or more\ndiverse data.\nEthical Considerations\nThis work studies agentic document-grounded\nquestion answering using publicly available bench-\nmarks and documents released for research pur-\nposes. The proposed Exploration-then-Synthesis\npipeline generates synthetic question–answer pairs\nthat are explicitly grounded in source documents\nand does not introduce new proprietary data or at-\ntempt to reproduce large portions of copyrighted\ntext verbatim. While the method itself does not\ncollect personal information, document-grounded\nagents may be applied to sensitive or private docu-\nments in downstream use; such applications require\nappropriate authorization and privacy safeguards.\nThe synthesized data and trained models may in-\nherit biases present in the underlying document\nsources, including domain and content imbalances.\nFinally, although improved document exploration\ncapabilities could be misused if deployed irrespon-\nsibly, the strong grounding in retrieved evidence\nand our commitment to releasing code and data\naim to support transparency, reproducibility, and\nresponsible research use.\nReferences\nBaian Chen, Chang Shu, Ehsan Shareghi, Nigel Collier,\nKarthik Narasimhan, and Shunyu Yao. 2023. Fireact:\nToward language agent fine-tuning. arXiv preprint\narXiv:2310.05915.\nJaemin Cho, Debanjan Mahata, Ozan Irsoy, Yu-\njie He, and Mohit Bansal. 2025.\nM3docvqa:\nMulti-modal multi-page multi-document understand-\ning. In Proceedings of the IEEE/CVF International\nConference on Computer Vision, pages 6178–6188.\nChao Deng, Jiale Yuan, Pi Bu, Peijie Wang, Zhong-\nZhi Li, Jian Xu, Xiao-Hui Li, Yuan Gao, Jun Song,\nBo Zheng, and 1 others. 2025. Longdocurl: a com-\nprehensive multimodal long document benchmark\nintegrating understanding, reasoning, and locating.\nIn Proceedings of the 63rd Annual Meeting of the\nAssociation for Computational Linguistics (Volume\n1: Long Papers), pages 1135–1159.\nYihao Ding, Zhe Huang, Runlin Wang, YanHang Zhang,\nXianru Chen, Yuzhong Ma, Hyunsuk Chung, and\nSoyeon Caren Han. 2022.\nV-doc: Visual ques-\ntions answers with documents. In Proceedings of\nthe IEEE/CVF conference on computer vision and\npattern recognition, pages 21492–21498.\nKuicai Dong, Yujing Chang, Xin Deik Goh, Dexun\nLi, Ruiming Tang, and Yong Liu. 2025a. Mmdocir:\nBenchmarking multi-modal retrieval for long docu-\nments. arXiv preprint arXiv:2501.08828.\nKuicai Dong, Yujing Chang, Shijie Huang, Yasheng\nWang, Ruiming Tang, and Yong Liu. 2025b. Bench-\nmarking retrieval-augmented multimomal generation\nfor document question answering. arXiv preprint\narXiv:2505.16470.\nKuicai Dong, Shurui Huang, Fangda Ye, Wei Han,\nZhi Zhang, Dexun Li, Wenjun Li, Qu Yang, Gang\nWang, Yichao Wang, and 1 others. 2025c.\nDoc-\nresearcher: A unified system for multimodal doc-\nument parsing and deep research.\narXiv preprint\narXiv:2510.21603.\nRunnan Fang, Shihao Cai, Baixuan Li, Jialong Wu,\nGuangyu Li, Wenbiao Yin, Xinyu Wang, Xiaobin\nWang, Liangcai Su, Zhen Zhang, and 1 others. 2025.\nTowards general agentic intelligence via environment\nscaling. arXiv preprint arXiv:2509.13311.\nManuel Faysse, Hugues Sibille, Tony Wu, Bilel Om-\nrani, Gautier Viaud, Céline Hudelot, and Pierre\nColombo. 2024.\nColpali: Efficient document re-\ntrieval with vision language models. arXiv preprint\narXiv:2407.01449.\nXiangwu Guo, Difei Gao, and Mike Zheng Shou. 2025a.\nAuto-explorer: Automated data collection for gui\nagent. arXiv preprint arXiv:2511.06417.\nZirui Guo, Xubin Ren, Lingrui Xu, Jiahao Zhang, and\nChao Huang. 2025b. Rag-anything: All-in-one rag\nframework. arXiv preprint arXiv:2510.12323.\nSiwei Han, Peng Xia, Ruiyi Zhang, Tong Sun, Yun Li,\nHongtu Zhu, and Huaxiu Yao. 2025. Mdocagent: A\nmulti-modal multi-agent framework for document\nunderstanding. arXiv preprint arXiv:2503.13964.\n9\n\nD. Hendrycks, C. Burns, A. Chen, and S. Ball. 2021.\nCuad: An expert-annotated nlp dataset for legal con-\ntract review. arXiv preprint arXiv:2103.06268.\nAnwen Hu, Haiyang Xu, Liang Zhang, Jiabo Ye, Ming\nYan, Ji Zhang, Qin Jin, Fei Huang, and Jingren Zhou.\n2025. mplug-docowl2: High-resolution compress-\ning for ocr-free multi-page document understanding.\nIn Proceedings of the 63rd Annual Meeting of the\nAssociation for Computational Linguistics (Volume\n1: Long Papers), pages 5817–5834.\nTiancheng Huang,\nRuisheng Cao,\nYuxin Zhang,\nZhangyi Kang, Zijian Wang, Chenrun Wang, Yi-\njie Luo, Hang Zheng, Lirong Qian, Lu Chen, and\n1 others. 2025. Airqa: A comprehensive qa dataset\nfor ai research with instance-level evaluation. arXiv\npreprint arXiv:2509.16952.\nChelsi Jain, Yiran Wu, Yifan Zeng, Jiale Liu, Zhenwen\nShao, Qingyun Wu, Huazheng Wang, and 1 others.\n2025.\nSimpledoc: Multi-modal document under-\nstanding with dual-cue page retrieval and iterative\nrefinement. arXiv preprint arXiv:2506.14035.\nWoosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying\nSheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gon-\nzalez, Hao Zhang, and Ion Stoica. 2023. Efficient\nmemory management for large language model serv-\ning with pagedattention. In Proceedings of the ACM\nSIGOPS 29th Symposium on Operating Systems\nPrinciples.\nKuan Li, Zhongwang Zhang, Huifeng Yin, Rui Ye, Yida\nZhao, Liwen Zhang, Litu Ou, Dingchu Zhang, Xixi\nWu, Jialong Wu, and 1 others. 2025a. Websailor-\nv2: Bridging the chasm to proprietary agents via\nsynthetic data and scalable reinforcement learning.\narXiv preprint arXiv:2509.13305.\nKuan Li, Zhongwang Zhang, Huifeng Yin, Liwen\nZhang, Litu Ou, Jialong Wu, Wenbiao Yin, Baixuan\nLi, Zhengwei Tao, Xinyu Wang, and 1 others. 2025b.\nWebsailor: Navigating super-human reasoning for\nweb agent. arXiv preprint arXiv:2507.02592.\nAixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingx-\nuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang,\nChaofan Lin, Chen Dong, and 1 others. 2025a.\nDeepseek-v3. 2: Pushing the frontier of open large\nlanguage models. arXiv preprint arXiv:2512.02556.\nJiaheng Liu,\nDawei Zhu,\nZhiqi Bai,\nYancheng\nHe, Huanxuan Liao, Haoran Que, Zekun Wang,\nChenchen Zhang, Ge Zhang, Jiebin Zhang, and\n1 others. 2025b.\nA comprehensive survey on\nlong context language modeling.\narXiv preprint\narXiv:2503.17407.\nYubo Ma, Yuhang Zang, Liangyu Chen, Meiqi Chen,\nYizhu Jiao, Xinze Li, Xinyuan Lu, Ziyu Liu, Yan Ma,\nXiaoyi Dong, and 1 others. 2024. Mmlongbench-doc:\nBenchmarking long-context document understanding\nwith visualizations. Advances in Neural Information\nProcessing Systems, 37:95963–96010.\nJunbo Niu, Zheng Liu, Zhuangcheng Gu, Bin Wang,\nLinke Ouyang, Zhiyuan Zhao, Tao Chu, Tianyao\nHe, Fan Wu, Qintong Zhang, and 1 others. 2025.\nMineru2. 5: A decoupled vision-language model for\nefficient high-resolution document parsing. arXiv\npreprint arXiv:2509.22186.\nVardaan Pahuja, Yadong Lu, Corby Rosset, Boyu\nGou, Arindam Mitra, Spencer Whitehead, Yu Su,\nand Ahmed Hassan. 2025.\nExplorer:\nScaling\nexploration-driven web trajectory synthesis for multi-\nmodal web agents. In Findings of the Association for\nComputational Linguistics: ACL 2025, pages 6300–\n6323.\nAkshara Prabhakar, Zuxin Liu, Ming Zhu, Jianguo\nZhang, Tulika Awalgaonkar, Shiyu Wang, Zhiwei\nLiu, Haolin Chen, Thai Hoang, Juan Carlos Niebles,\nand 1 others. 2025. Apigen-mt: Agentic pipeline\nfor multi-turn data generation via simulated agent-\nhuman interplay. arXiv preprint arXiv:2504.03601.\nJon Saad-Falcon, Joe Barrow, Alexa Siu, Ani Nenkova,\nSeunghyun Yoon, Ryan A. Rossi, and Franck Dernon-\ncourt. 2024. PDFTriage: Question answering over\nlong, structured documents. In Proceedings of the\n2024 Conference on Empirical Methods in Natural\nLanguage Processing: Industry Track, pages 153–\n169, Miami, Florida, US. Association for Compu-\ntational Linguistics.\nMohammad Shoeybi, Mostofa Patwary, Raul Puri,\nPatrick LeGresley, Jared Casper, and Bryan Catan-\nzaro. 2019.\nMegatron-lm: Training multi-billion\nparameter language models using model parallelism.\narXiv preprint arXiv:1909.08053.\nRay Smith. 2007. An overview of the tesseract ocr en-\ngine. In Ninth international conference on document\nanalysis and recognition (ICDAR 2007), volume 2,\npages 629–633. IEEE.\nAndriy Sukh. 2025. Ocr-free document understanding\nusing vision-language models.\nLi Sun, Liu He, Shuyue Jia, Yangfan He, and Chenyu\nYou. 2025a.\nDocAgent: An agentic framework\nfor multi-modal long-context document understand-\ning.\nIn Proceedings of the 2025 Conference on\nEmpirical Methods in Natural Language Processing,\npages 17712–17727, Suzhou, China. Association for\nComputational Linguistics.\nQiushi Sun, Kanzhi Cheng, Zichen Ding, Chuanyang\nJin, Yian Wang, Fangzhi Xu, Zhenyu Wu, Chengyou\nJia, Liheng Chen, Zhoumianze Liu, and 1 others.\n2025b. Os-genesis: Automating gui agent trajec-\ntory construction via reverse task synthesis.\nIn\nProceedings of the 63rd Annual Meeting of the\nAssociation for Computational Linguistics (Volume\n1: Long Papers), pages 5555–5579.\nRichard Sutton. 2019. The bitter lesson. Incomplete\nIdeas (blog), 13(1):38.\n10\n\nZhengwei Tao, Jialong Wu, Wenbiao Yin, Junkai\nZhang, Baixuan Li, Haiyang Shen, Kuan Li, Li-\nwen Zhang, Xinyu Wang, Yong Jiang, and 1 others.\n2025. Webshaper: Agentically data synthesizing via\ninformation-seeking formalization. arXiv preprint\narXiv:2507.15061.\nKimi Team, Angang Du, Bohong Yin, Bowei Xing,\nBowen Qu, Bowen Wang, Cheng Chen, Chenlin\nZhang, Chenzhuang Du, Chu Wei, and 1 others.\n2025a.\nKimi-vl technical report.\narXiv preprint\narXiv:2504.07491.\nQwen Team. 2025. Qwen3 technical report. Preprint,\narXiv:2505.09388.\nTongyi DeepResearch Team, Baixuan Li, Bo Zhang,\nDingchu Zhang, Fei Huang, Guangyu Li, Guoxin\nChen, Huifeng Yin, Jialong Wu, Jingren Zhou, and 1\nothers. 2025b. Tongyi deepresearch technical report.\narXiv preprint arXiv:2510.24701.\nDominika Tkaczyk, Paweł Szostek, Mateusz Fedo-\nryszak, Piotr Jan Dendek, and Łukasz Bolikowski.\n2015. Cermine: automatic extraction of structured\nmetadata from scientific literature.\nInternational\nJournal on Document Analysis and Recognition\n(IJDAR), 18(4):317–335.\nJordy Van Landeghem, Rubén Tito, Łukasz Borch-\nmann, Michał Pietruszka, Pawel Joziak, Rafal\nPowalski, Dawid Jurkiewicz, Mickaël Coustaty,\nBertrand Anckaert, Ernest Valveny, and 1 others.\n2023. Document understanding dataset and eval-\nuation (dude).\nIn Proceedings of the IEEE/CVF\nInternational Conference on Computer Vision, pages\n19528–19540.\nXueyao Wan and Hang Yu. 2025.\nMmgraphrag:\nBridging vision and language with interpretable\nmultimodal knowledge graphs.\narXiv preprint\narXiv:2507.20804.\nJialong Wu, Baixuan Li, Runnan Fang, Wenbiao Yin,\nLiwen Zhang, Zhengwei Tao, Dingchu Zhang, Zekun\nXi, Gang Fu, Yong Jiang, and 1 others. 2025a. Web-\ndancer: Towards autonomous information seeking\nagency. arXiv preprint arXiv:2505.22648.\nJialong Wu, Wenbiao Yin, Yong Jiang, Zhenglin Wang,\nZekun Xi, Runnan Fang, Linhai Zhang, Yulan He,\nDeyu Zhou, Pengjun Xie, and Fei Huang. 2025b.\nWebWalker: Benchmarking LLMs in web traversal.\nIn Proceedings of the 63rd Annual Meeting of the\nAssociation for Computational Linguistics (Volume\n1: Long Papers), pages 10290–10305, Vienna, Aus-\ntria. Association for Computational Linguistics.\nJunda Wu, Yu Xia, Tong Yu, Xiang Chen, Sai Sree\nHarsha, Akash V Maharaj, Ruiyi Zhang, Victor\nBursztyn, Sungchul Kim, Ryan A Rossi, and 1\nothers. 2025c.\nDoc-react: Multi-page heteroge-\nneous document question-answering. In Proceedings\nof the 63rd Annual Meeting of the Association\nfor Computational Linguistics (Volume 2:\nShort\nPapers), pages 67–78.\nYiheng Xu, Minghao Li, Lei Cui, Shaohan Huang,\nFuru Wei, and Ming Zhou. 2020. Layoutlm: Pre-\ntraining of text and layout for document image under-\nstanding. In Proceedings of the 26th ACM SIGKDD\ninternational conference on knowledge discovery &\ndata mining, pages 1192–1200.\nJohn Yang, Kilian Lieret, Carlos E Jimenez, Alexan-\nder Wettig, Kabir Khandpur, Yanzhe Zhang, Binyuan\nHui, Ofir Press, Ludwig Schmidt, and Diyi Yang.\n2025. Swe-smith: Scaling data for software engineer-\ning agents. arXiv preprint arXiv:2504.21798.\nCong Yao. 2023. Docxchain: A powerful open-source\ntoolchain for document parsing and beyond. arXiv\npreprint arXiv:2310.12430.\nShunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak\nShafran, Karthik R Narasimhan, and Yuan Cao. 2022.\nReact: Synergizing reasoning and acting in language\nmodels. In The eleventh international conference on\nlearning representations.\nShi Yu, Chaoyue Tang, Bokai Xu, Junbo Cui, Jun-\nhao Ran, Yukun Yan, Zhenghao Liu, Shuo Wang,\nXu Han, Zhiyuan Liu, and 1 others. 2024.\nVis-\nrag: Vision-based retrieval-augmented generation\non multi-modality documents.\narXiv preprint\narXiv:2410.10594.\nXinlei Yu, Chengming Xu, Zhangquan Chen, Yudong\nZhang, Shilin Lu, Cheng Yang, Jiangning Zhang,\nShuicheng Yan, and Xiaobin Hu. 2025. Visual doc-\nument understanding and reasoning: A multi-agent\ncollaboration framework with agent-wise adaptive\ntest-time scaling. arXiv preprint arXiv:2508.03404.\nJinxu Zhang, Yongqi Yu, and Yu Zhang. 2024. Cream:\ncoarse-to-fine retrieval and multi-modal efficient\ntuning for document vqa.\nIn Proceedings of the\n32nd ACM International Conference on Multimedia,\npages 925–934.\nJunyuan Zhang, Qintong Zhang, Bin Wang, Linke\nOuyang, Zichen Wen, Ying Li, Ka-Ho Chow, Con-\nghui He, and Wentao Zhang. 2025. Ocr hinders rag:\nEvaluating the cascading impact of ocr on retrieval-\naugmented generation.\nIn Proceedings of the\nIEEE/CVF International Conference on Computer\nVision, pages 17443–17453.\nYuze Zhao, Jintao Huang, Jinghan Hu, Xingjun Wang,\nYunlin Mao, Daoze Zhang, Zeyinzi Jiang, Zhikai Wu,\nBaole Ai, Ang Wang, Wenmeng Zhou, and Yingda\nChen. 2024. Swift:a scalable lightweight infrastruc-\nture for fine-tuning. Preprint, arXiv:2408.05517.\nDawei Zhu, Rui Meng, Jiefeng Chen, Sujian Li,\nTomas Pfister, and Jinsung Yoon. 2025. Doclens:\nA tool-augmented multi-agent framework for long\nvisual document understanding.\narXiv preprint\narXiv:2511.11552.\nAnni Zou, Wenhao Yu, Hongming Zhang, Kaixin\nMa, Deng Cai, Zhuosheng Zhang, Hai Zhao, and\nDong Yu. 2025.\nDocbench:\nA benchmark for\n11\n\nevaluating llm-based document reading systems.\nIn Proceedings of the 4th International Workshop\non Knowledge-Augmented Methods for Natural\nLanguage Processing, pages 359–373.\n12\n\nA\nCase Study of Synthetic Data\nFigure 9 demonstrates how the Exploration-then-\nSynthesis framework iteratively navigates a 73-\npage document, aggregating heterogeneous evi-\ndence, text (in Sec. 2.43), charts (in Figure 1),\nand tables (in Table 1), scattered across disjoint\npages (pp. 40, 41, 49) to synthesize a high-quality\nquestion that requires complex reasoning.\nIn the Exploration Stage, the agent generates a\nexploartion trajectory ξ via iterative (it, ut) steps,\neffectively performing a “random walk” over the\ndocument’s implicit knowledge graph. It aggre-\ngates heterogeneous evidence by bridging disjoint\npages—linking visual trends in a chart (p. 40) with\nprecise values in text (p. 49) and a table (p. 41).\nIn the Synthesis Stage, the model Ms reasons over\nthis accumulated trajectory to construct a complex\nmulti-hop numerical question (Wu et al., 2025b).\nThe final QA pair requires arithmetic calculation\n(29.92% − 15% = 14.92%) rather than simple\nretrieval, ensuring deep document grounding and\npreventing shortcut learning.\nB\nImplementation Details\nB.1\nDetails on Prompts\nThe prompts for the DocDancer are shown in Fig-\nure 10.\nB.2\nTool Schema\nThis section details the tool schemas provided to\nthe agent. We designed two primary tools: search\nfor keyword-based retrieval and read for extract-\ning content from specific document sections. The\nspecific JSON structures defining these functions\nare shown in Figure 11.\nB.3\nTraining Details\nWe\nfine-tune\nQwen3-30B-A3B-Think2\nand\nQwen3-4B-Think3\nusing\nthe\nMegatron-LM\nframework (Zhao et al., 2024; Shoeybi et al., 2019).\nBoth models are trained with a context length of\n128k to support long-document processing tasks.\nWe employ the AdamW optimizer with a precision-\naware configuration and a cosine decay learning\nrate scheduler, featuring a peak learning rate of\n1.0 × 10−5, a minimum of 1.0 × 10−6, and a 5%\nwarmup phase. The global batch size is configured\n2https://huggingface.co/Qwen/Qwen3-30B-A3B-\nThinking-2507\n3https://huggingface.co/Qwen/Qwen3-4B-Thinking-\n2507\nto 16 for the Qwen3-30B-A3B-Think and to 40 for\nQwen3-4B-Think. For Qwen3-30B-A3B-Think,\nwe apply an auxiliary loss coefficient of 10−3 to\nensure balanced expert routing.\nWe train both\nmodels for 10 epochs and selected the checkpoint\nwith best performance.\nB.4\nInference Details\nvLLM framework (Kwon et al., 2023) is used for\ninference; we employ a temperature of 0.6, a topp\nvalue of 0.95, and a presence penalty of 1.1.\nB.5\nHyperparameter\nBy default,\nMm\nis Qwen3-VL-235B-A22B-\nInstruct, and we analyze the effects of replacing it\nin Section 4.4. For Mt, we use the open-source and\nrelatively strong model gpt-oss-120b to perform\nrejection sampling. Further analysis is provided\nin Table 2. First, our method substantially outper-\nforms the base model without fine-tuning, demon-\nstrating the effectiveness of the proposed training\nstrategy. Second, our approach also surpasses the\nmodel trained with reject sampling, validating the\nquality of the synthesized question–answer data\nand showing that it can effectively elicit and en-\nhance the model’s performance. For Ms, we em-\nploy gpt-oss-120b in Exploration-then-Synthesis\nframework to synthesis data.\nB.6\nDetails on Prompts for Data Synthesis\nThe prompts utilized for Exploration and Syn-\nthetic within the Exploration-then-Refine frame-\nwork are presented in Figure 12 and Figure 13,\nrespectively. Regarding the exploration configu-\nration, we adjust the maximum exploration depth\nbased on the complexity of the document sources.\nSpecifically, we set the maximum sampling depth\nto 20 for LongDocURL and MMdocRAG, while\nfor DUDE and CUAD, this limit is set to 15.\nC\nBaselines\nWe compare DocDancer against a comprehensive\nset of baselines categorized into four groups:\nNaive VLM Baselines. These methods evaluate\nthe native long-context understanding capabilities\nof advanced VLMs. We directly feed PDF pages\nconverted to images (144 DPI) into the models\nwithout external parsing or retrieval. Following the\nsettings in MMLongBench-Doc (Ma et al., 2024),\n13\n\nDocument: 4067686.pdf (73 Pages)\nSource: LongDocURL \nParse\nSeed\n[Question] What is the difference in percentage-point increase between the overall mean score improvement shown in the bar chart \nof pre-test versus post-test scores and the improvement for the TIC Principle concept reported in the percentages table?\n[Answer] 14.92%\n[Evidence] \nElement Image\nPage Image\nPDF\n≈\n(i) Exploration\n(ii) Synthesis\n≈\nAction1\nObservation1\nIntent1\nAction5\nObservation5\nIntent5\nAction6\nObservation6\nIntent6\nAction7\nObservation7\nIntent7\nAction14\nObservation14\nIntent14\nAction15\nObservation15\nIntent15\nLocation visual \nelements with High-\nvalue in document...\nText Context \nReading for Visual \nUnderstanding ...\nDeep Understanding \nfor Uncovering \nLatent Information\n🔍 Search\n[“Figure”, “Table”...]\n📖 Read\nSec. 2.39\n🔍 Search\n[“15%”, “Wellness”...]\n📖 Read\nSec. 2.43\n🔍 Search\n[“37.08”, 30.18%...]\n📖 Read\nTable 1\n...\n...\nFind Figure1, \nFigure 2, Table 1...\n... mean scores ... of \napproximately \n15% ... shown in \nFigure 1. ....\n<Item \n   type=\"HTML_Table\"   \n   table_id=\"0”    \n    section_id=\"2.37\" \n    page_num=\"54\">...\nFind Sec.2., \nSec.2.43,\nAppendix B, \nTable 1 ...\n... revealed a 15% \nincrease in the \nparticipants' \nknowledge level...\n...TIC Principle...% \nIncreased...29.92%...\nLocate numbers \nand comprehend \nterms...\nFrom local data to \nglobal Insights...\nPinpoint numbers \nlocation for QA \ngeneration ...\nSpecific Fact\nNo“How/Why/Describe”\nAnti-shortcut\nMulti-hop Reasoning\nNatural Question\nExtreme Brevity\nGroundedness\n✅\n✅\n✅\n✅\n✅\n✅\n✅\n...\n• Input Content: Seed Data + Agent Trajectories\n• Core Engine: QA Synthesis Prompt\n• Constraint Pillars: Naturalness , Reasoning and Precision...\nSampling Objectives: Cross-Page Synergy\nHeterogeneous Alignment and Multi-Hop...\nPrompt\nPrompt\n1⃣\n3⃣\nFigure 1 in Page 40\n2⃣\nTable 1 in Page 41\nText in Page 49\n1⃣\n2⃣\n3⃣\nFigure 9: A case study of the Exploration-then-Synthesis framework generating a multi-hop, cross-document,\ncross-modal numerical reasoning QA pair.\nMethod\nModel\nMMLongBench-Doc\nDocBench\nacc\nF1\nLasJ\nLasJ\nDocDancer\nQwen3-A3B-30B-Thinking\n39.2\n36.4\n46.9\n74.1\nDocDancer\nGPT-oss-120B\n52.3\n53.0\n59.8\n80.8\nDocDancer\nQwen3-30B-A3B-Thinking (ft)\n54.4\n53.9\n65.3\n81.2\nTable 2: Performance comparison across two long-context understanding benchmarks.\nPrompt\nYou are an expert research assistant\ntasked with answering questions based on\ndocument content.\nYou will be provided with an XML outline\nof the document. If you need more compre-\nhensive, detailed, or accurate information\nfrom the document to fully address the\nuser’s query, you need to use the provided\ntool.\nI’ve uploaded a document, and below\nis the outline in XML format:\n{docu-\nment_outline}.\nAnswer the following question based on the\ncontent of the document: {question}.\nFigure 10: System prompt for DocDancer.\nwe report GPT-4o 4 and Gemini-2.5-Pro 5.\nOCR-based Baselines. These baselines treat the\ntask as text-only QA by first extracting content\nusing OCR engines. We pair Tesseract (Smith,\n2007) and PyMuPDF (fitz) 6 with LLMs including\nGPT-4, GPT-4o, and Gemini-2.0-Flash.\nRAG-based Baselines. We consider both visual\nand hybrid retrieval strategies:\n• Visual Retrieval: VisRAG (Yu et al., 2024)\nand ColPali (Faysse et al., 2024) retrieve rele-\nvant page or patch-level visual evidence based\non vision-centric embeddings, utilizing GPT-\n4o for response generation.\n• Hybrid Retrieval: M3DocRAG (Cho et al.,\n2025) performs joint retrieval using a mul-\n4https://platform.openai.com/docs/models/\ngpt-4o\n5https://ai.google.dev/gemini-api/docs/models?\n#gemini-2.5-pro\n6https://pymupdf.readthedocs.io/\n14\n\ntimodal retriever with Qwen2-VL-7B. RA-\nGAnything (Guo et al., 2025b) structures\nmultimodal content as knowledge entities for\ncross-modal retrieval, using GPT-4o-mini as\nthe backbone.\nPrompt-based Agentic Baselines. We include state-\nof-the-art agent frameworks designed for document\nunderstanding:\n• Doc-React (Wu et al., 2025c) employs an it-\nerative decision-making process to balance\ninformation gain and uncertainty reduction\n(GPT-4o).\n• MDocAgent (Han et al., 2025) utilizes a\nmulti-agent system with five specialized roles\nfor context retrieval (GPT-4o).\n• MACT (Yu et al., 2025) introduces a multi-\nagent collaboration framework featuring adap-\ntive test-time scaling (MiMo-VL-7B (Team\net al., 2025a)).\n• SimpleDoc (Jain et al., 2025) retrieves pages\nvia ColQwen2.5, followed by LLM-based ev-\nidence selection (Claude-4-Sonnet, Gemini-\n2.5-Pro).\n• DocLens (Zhu et al., 2025) operates as a\ntool-augmented multi-agent framework for fo-\ncused reading (Claude-4-Sonnet, Gemini-2.5-\nPro).\n• DocAgent (Sun et al., 2025a) leverages a tree-\nstructured document outline combined with\nretrieval tools (GPT-4o, Claude-3.5-Sonnet).\n15\n\nTool Schemas\nSearch\n{\n\"type\": \"function\",\n\"function\": {\n\"name\": \"search\",\n\"description\": \"Find and extract all paragraphs and sections where any of the provided\nsearch terms appear\",\n\"parameters\": {\n\"type\": \"object\",\n\"properties\": {\n\"keywords\": {\n\"type\": \"array\",\n\"items\": {\n\"type\": \"string\"\n},\n\"description\": \"A list of query keywords for searching\"\n}\n},\n\"required\": [\"keywords\"]\n}\n}\n}\nRead\n{\n\"type\": \"function\",\n\"function\": {\n\"name\": \"read\",\n\"description\": \"Read multiple sections by section IDs and extract useful information\nfrom all content contained in those sections, including both visual elements and\ntextual elements.\",\n\"parameters\": {\n\"type\": \"object\",\n\"properties\": {\n\"section_ids\": {\n\"type\": \"array\",\n\"items\": {\n\"type\": \"string\"\n},\n\"description\": \"A list of section IDs to read from the document\"\n},\n\"goal\": {\n\"type\": \"string\",\n\"description\": \"The user goal that guides what useful information should be\nextracted from the selected sections\"\n}\n},\n\"required\": [\"section_ids\", \"goal\"]\n}\n}\n}\nFigure 11: Tool schema: Search and Read.\n16\n\nExploration in Exploration-then-Refine Framework.\nYou are exploring a parsed PDF paper/report (outline + paragraphs + images + table snapshots + per-page screenshots).\nYour objective is to collect HIGH-QUALITY, GROUNDED evidence bundles that can later support HARD, multi-hop,\nvisually grounded document Q&A synthesis.\nFinal QA Constraints You Must Enable (every eventual QA must satisfy ALL):\n• Multi-page: Combining evidence from at least THREE different pages/sections, where the pieces of evidence are\nrelated.\n• Multi-element: Contains at least two evidence source types (text paragraphs/charts/graphics/table screenshots and/or\nfull-page layouts).\n• Multi-hop: require at least TWO reasoning points (e.g. cross-reference + computation, footnote rule + chart reading,\nlayout count + comparison, multiple related searches + readings).\nImportant: final questions should NOT rely on explicit document locations. Do NOT plan to use page numbers, section\ntitles/IDs, or explicit figure/table numbers (e.g., “Figure <number>”, “Table <number>”) in the question. Instead,\nyou must collect CONTENT-BASED CLUES that can uniquely identify the needed evidence:\n• Caption keywords (short quote fragments), axis labels and units, legend item names, panel labels (a)/(b), distinctive\nrow/column headers, and footnote phrases (“restated”, “excluding”, “unaudited”, unit changes).\nExploration strategy using only search and read:\n• Use search to find visuals, tables, footnotes, and their nearby discussion text. Start with keywords like: “Figure”,\n“Fig.”, “Chart”, “Image”, “Graph”, “legend”, “axis”, “panel”, “Table”, “Note”, “footnote”, “restated”, “excluding”,\n“unaudited”.\n• For each promising hit, immediately read the covering section(s) with a goal that extracts:\n– The text content of the section in question.\n– Caption text, axis labels/units, legend items, and visual markers.\n– The exact table header path, target cell(s), and footnote rules.\n– The narrative claim/explanation that references the visual.\n• Use the read function as much as possible, deliberately chain across pages.\n• For conditional layout questions: identify a page by a unique visual cue, then use read to count visible tables/figures.\nAvoid:\n• Broad whole-document counts unless you turn them into comparative, multi-hop questions.\n• Word-frequency counting.\n• Repeating identical tool calls.\n• Statistical analysis of the number of elements.\nEvery action during sampling should contribute to forming a future HARD, multi-page, multi-element, multi-hop\ndocument QA.\nFigure 12: Prompt for exploration stage in Exploration-then-Refine framework.\n17\n\nSynthesis in Exploration-then-Refine Framework.\nYou must synthesize “document Q&A” training data based ONLY on the trajectory.\nHard Requirements (Strict):\n• The output must be a JSON object containing only two fields: question and answer (no additional fields are allowed),\nand must be in English only.\n• The question must be natural and unambiguous, containing only one question and corresponding to a single, unique\nanswer.\n• The question must not be a common-knowledge question; it must be impossible to answer based on the question\nalone and must be highly dependent on the document.\n• Do not mention tools, sections, pages, section IDs, searching/reading actions, trajectories, or observations.\n• The answer length should be limited to a single sentence, ideally a short phrase, entity, number, or list, and avoid\nsimply using “yes/no” answers. The answer must be directly supported by evidence from the provided text and cannot\nbe guessed randomly.\nMandatory Difficulty Constraints (every QA pair must satisfy all of the following):\n1. Multi-page: The question requires evidence from at least two different pages/sections to answer, and the evidence\nmust be logically related.\n2. Multiple Evidence Modalities: The question must involve at least two types of evidence, such as text, charts,\nfigures, tables, screenshots, and/or full-page layout cues, with a preference for covering visual elements.\n3. Multi-step Reasoning: The question must require at least two reasoning steps (e.g., calculation + cross-validation,\nfootnote rule application + chart reading, layout counting + comparison).\nNo Explicit Location References in the Question:\n• Do not mention page numbers, section IDs, titles/IDs, or explicit figure/table numbers (e.g., “Figure <number>”,\n“Table <number>”).\n• Instead, provide 1–3 content-based clues to help locate the evidence, such as: short title phrases, axis labels/units,\nlegend item names, unique row names, footnote keywords, or distinctive layout hints (e.g., “the only multi-panel\nfigure labeled (a) and (b)”).\n• When describing visual elements, do not directly copy long unique numbers or OCR-extracted long text strings from\nimages (e.g., “an image showing the number 7,584,322,338”). Use specific entity names or semantic descriptions\ninstead (e.g., “Apple’s 2018 total sales table”, “an image showing adjusted outstanding balances”, or “the largest\nsegment in the pie chart”).\nPreferred Question Templates (all templates must be cross-page + visual + multi-step):\n• Cross-page conditional layout: Identify pages via unique visual cues and compare the number of visible objects\nacross pages.\n• Textual claim + chart verification: A narrative statement about a change/target that is verified using a chart and\nlight calculation.\n• Table + chart consistency: Compute a ratio/difference from a table and verify it against a data point in a chart on\nanother page.\n• Footnote-constrained table + chart mapping: Apply footnote/restatement/exclusion rules, then map the correct\nyear/value to a chart on another page.\n• Table/Chart comprehension questions: Locate tables and charts via text, then derive conclusions from table\nstructure or chart visuals.\n• Unanswerable questions: Questions that seem reasonable but are actually impossible to answer (e.g., questions\nabout terms/entities that do not exist in the document). For these, the answer must be “Unanswerable”.\n• Counting questions: Count the occurrences of key local terms or entities in the document. Such questions should\nonly be generated when there is sufficient and conclusive evidence.\nFallback Rule:\n• If the current trajectory cannot support a question that satisfies all constraints, choose a different question.\nAfter generating a question, perform a second-pass check and regenerate if the question falls into any of the\nfollowing categories:\n• Contains more than one question.\n• Includes non-English languages or characters.\n• Questions that can be answered based on an independent page/section.\n• Common-sense questions unrelated to the document.\n• Counting tasks spanning the entire document with a broad scope.\n• Counting tasks involving Charts/Figures/Images/Tables.\nFor unanswerable questions, confirm that they are truly unanswerable. For counting questions, confirm completeness\nand answer accuracy. Do not guess or fabricate answers under any circumstances.\nFigure 13: Prompts for Q&A Synthesis stage in Exploration-then-Refine framework.\n18\n\nDetails of Case Study\nQuestion\nWhat is advertising expense to sales ratio of Netflix in FY 2015? Round your answer to three\ndecimal places.\nAnswer\n0.105\nStep 1\n<think>\nWe need to answer: “what is advertising expense to sales ratio of Netflix in FY 2015? Round your\nanswer to three decimal places.” We need to locate advertising expense and sales (revenues) for FY\n2015. From outline we have various sections. First, find advertising expense. The term “advertising”\nappears in sections. Search for “advertising”.\n</think>\n<tool_call>\n{ \"arguments\": \"{ \"keywords\": [\"advertising\"] }\", \"name\": \"search\" }\n</tool_call>\n<tool_response>\nA Document search for ‘advertising‘ found 6 results:\n<Item type=\"Paragraph\" section_id=\"3.19\" page_num=\"5.0\">... adjust pricing or service\nofferings ... </Item>\n...\n<Item type=\"Paragraph\" section_id=\"8.81\" page_num=\"47.0\">\nMarketing expenses consist primarily of advertising expenses... Advertising expenses were\n$714.3 million, $533.1 million and $404.0 million for the years ended December 31, 2015, 2014 and\n2013, respectively.\n</Item>\n</tool_response>\nStep 2\n<think>\nOne of the results is in section_id “8.81” paragraph page_num “47.0”. It says: “Advertising\nexpenses were $714.3 million, $533.1 million and $404.0 million for the years ended December 31,\n2015, 2014 and 2013, respectively.” Thus advertising expense FY 2015 = $714.3 million.\nNow sales (revenues) for FY 2015: In consolidated statements of operations (section 8.60) we\nhave revenues $6,779,511 (in thousands). That’s $6,779,511 thousand = $6,779.511 million.\nAlternatively, there may be a line item “Revenues” in Table 5 etc. Let’s locate “Revenues”.\n</think>\n<tool_call>\n{ \"arguments\": \"{ \"keywords\": [\"Revenues\"] }\", \"name\": \"search\" }\n</tool_call>\n<tool_response>\nA Document search for ‘Revenues‘ found 39 results:\n<Item type=\"Paragraph\" section_id=\"3.9\" page_num=\"3.0\">\n...\n<Item type=\"HTML_Table\" table_id=\"5\" section_id=\"8.20\" page_num=\"19.0\"> ...\nRevenues (2015): $ 6,779,511 ...\n</Item>\n...\n</tool_response>\nContinued on next page...\n19\n\nContinued from previous page\nDetails of Case Study (Continued)\nStep 3\n<think>\nWe have advertising expense: $714.3 million (2015). We need to compute ratio = advertising\nexpense / revenue. Revenue for 2015: $6,779,511 thousand = $6,779.511 million. Thus ratio\n= 714.3/6, 779.511 =? ... First, we should cite sources: advertising expense paragraph (8.81) and\nrevenue table.\nLet’s extract the advertising expense paragraph text to be sure.\n</think>\n<tool_call>\n{ \"arguments\": \"{ \"section_ids\": [\"8.81\"], \"goal\": \"Extract the advertising\nexpense amount for 2015\" }\", \"name\": \"read\" }\n</tool_call>\n<tool_response>\nThe useful information from the document section (section_id=8.81) for user goal ‘Extract the\nadvertising expense amount for 2015‘ is as follows:\nEvidence in document:\nMarketing expenses consist primarily of advertising expenses... Advertising expenses were\n$714.3 million, $533.1 million and $404.0 million for the years ended December 31, 2015, 2014 and\n2013, respectively.\nSummary:\nThe advertising expense amount for 2015 is $714.3 million...\n</tool_response>\nStep 4\n<think>\nI have successfully extracted and verified the advertising expense ($714.3 million) from section\n8.81. Now, I need to formally verify the total revenue for 2015 to serve as the denominator. The\nsearch results in Step 2 indicated that section 8.60 contains the “Consolidated Statements of\nOperations,” which is the authoritative source for revenue figures. I will read section 8.60 to extract\nthe exact revenue amount for the year ended December 31, 2015.\n</think>\n<tool_call>\n{ \"arguments\": \"{ \"section_ids\": [\"8.60\"], \"goal\": \"Extract the revenue\namount for 2015\" }\", \"name\": \"read\" }\n</tool_call>\n<tool_response>\nThe useful information from the document section (section_id=8.60) for user goal ‘Extract the\nrevenue amount for 2015‘ is as follows:\nEvidence in document:\nHeading: NETFLIX, INC.\nParagraph: CONSOLIDATED STATEMENTS OF OPERATIONS\n<HTML_Table>\n<table> ... <tr> <td>Revenues</td> <td>$ 6,779,511</td> <td>$ 5,504,656</td>\n... </tr> ... </table>\n</HTML_Table>\nCaption: (in thousands, except per share data)\nSummary:\nContinued on next page...\n20\n\nContinued from previous page\nDetails of Case Study (Continued)\nThe revenue amount for 2015 is $6,779,511, as explicitly stated in the ‘Revenues’ row of the\n‘CONSOLIDATED STATEMENTS OF OPERATIONS’ table for the year ended December 31,\n2015. This value is presented in thousands, as indicated by the caption.\n</tool_response>\nFinal Response\nNetflix’s advertising expense for fiscal 2015 was $714.3 million [8.81], and its total revenue for\nthe same year was $6,779,511 thousand (i.e., $6,779.511 million) [8.60].\nAdvertising-to-sales ratio =\n714.3\n6,779.511 ≈ 0.10536.\nRounded to three decimal places, the ratio is 0.105 (or 10.5%).\nTable 3: Case study of a financial task.\n21\n",
+    "summary_1": "The paper introduces DocDancer, an end-to-end trained open-source Doc agent for Document Question Answering (DocQA). The authors formulate DocQA as an information-seeking problem and propose a tool-driven agent framework that explicitly models document exploration and comprehension. To address the scarcity of high-quality training data for DocQA, they introduce an Exploration-then-Synthesis data synthesis pipeline that progressively enhances QA pairs from easy to hard.\n\nThe proposed framework consists of two complementary capabilities: efficient search tools for global information acquisition and fine-grained read tools for localized comprehension. This design enables the agent to actively explore long documents, iteratively refine its hypotheses, and dynamically adapt its strategy based on intermediate observations.\n\nThe authors train their DocQA agent, DocDancer, on the synthesized dataset, instantiating it with two open-source backbones, Qwen3-4B-Thinking-2507 and Qwen3-30B-A3B-Thinking-2507. Despite being trained with only 5,000 instances, both variants achieve competitive performance, with the 30B-A3B model attaining state-of-the-art results in several settings.\n\nThe paper conducts extensive experiments on two long-context document understanding benchmarks, MMLongBench-Doc and DocBench, demonstrating the effectiveness",
+    "summary_2": "The paper \"DocDancer: Towards Agentic Document-Grounded Information Seeking\" proposes a novel approach to document question answering (DocQA) that focuses on information-seeking principles and utilizes an end-to-end trained open-source Doc agent. The authors introduce a tool-driven agent framework that explicitly models document exploration and comprehension, addressing the scarcity of high-quality training data for DocQA.\n\nThe proposed framework consists of two complementary capabilities: efficient search tools for global information acquisition and fine-grained read tools for localized comprehension. This design enables the agent to actively explore long documents, iteratively refine its hypotheses, and dynamically adapt its strategy based on intermediate observations.\n\nThe paper also proposes an Exploration-then-Synthesis DocQA generation pipeline that progressively enhances QA pairs from easy to hard. This pipeline first explores a source document through intent-guided, tool-augmented interactions to collect grounded evidence and then synthesizes high-quality document-grounded QA pairs via multi-observation reasoning.\n\nThe authors train their DocQA agent, DocDancer, on the synthesized dataset using two open-source backbones, Qwen3-4B-Thinking-2507 and Qwen3-30B-A3B-Thinking-2507. Despite being trained with only 5,000 instances, both variants achieve competitive performance,"
+  },
+  {
+    "arxiv_id": "2601.05104v1",
+    "title": "Paper 2601.05104v1",
+    "abstract": "This research examines how the emotional tone of human-AI interactions shapes ChatGPT and human behavior. In a between-subject experiment, we asked participants to express a specific emotion while working with ChatGPT (GPT-4.0) on two tasks, including writing a public response and addressing an ethical dilemma. We found that compared to interactions where participants maintained a neutral tone, ChatGPT showed greater improvement in its answers when participants praised ChatGPT for its responses. Expressing anger towards ChatGPT also led to a higher albeit smaller improvement relative to the neutral condition, whereas blaming ChatGPT did not improve its answers. When addressing an ethical dilemma, ChatGPT prioritized corporate interests less when participants expressed anger towards it, while blaming increases its emphasis on protecting the public interest. Additionally, we found that people used more negative, hostile, and disappointing expressions in human-human communication after interactions during which participants blamed rather than praised for their responses. Together, our findings demonstrate that the emotional tone people apply in human–AI interactions not only shape ChatGPT’s outputs but also carry over into subsequent human– human communication. 3",
+    "text": " \n1 \n \n \nHOW HUMAN IS AI? \nEXAMINING THE IMPACT OF EMOTIONAL PROMPTS ON ARTIFICIAL \nAND HUMAN RESPONSIVENESS \nF. Bernays*1, M. Henriques Pereira1, & J. Menges1 \n \n1University of Zurich \nDepartment of Business Administration \n*Corresponding author \nflorence.bernays@business.uzh.ch \n \n \n \n\n \n2 \nAbstract \nThis research examines how the emotional tone of human-AI interactions shapes ChatGPT and \nhuman behavior. In a between-subject experiment, we asked participants to express a specific \nemotion while working with ChatGPT (GPT-4.0) on two tasks, including writing a public \nresponse and addressing an ethical dilemma. We found that compared to interactions where \nparticipants maintained a neutral tone, ChatGPT showed greater improvement in its answers \nwhen participants praised ChatGPT for its responses. Expressing anger towards ChatGPT also \nled to a higher albeit smaller improvement relative to the neutral condition, whereas blaming \nChatGPT did not improve its answers. When addressing an ethical dilemma, ChatGPT \nprioritized corporate interests less when participants expressed anger towards it, while blaming \nincreases its emphasis on protecting the public interest. Additionally, we found that people \nused more negative, hostile, and disappointing expressions in human-human communication \nafter interactions during which participants blamed rather than praised for their responses. \nTogether, our findings demonstrate that the emotional tone people apply in human–AI \ninteractions not only shape ChatGPT’s outputs but also carry over into subsequent human–\nhuman communication. \n \n \n\n \n3 \nIntroduction \nGenerative artificial intelligence (AI) has rapidly captured global attention, with ChatGPT—\nthe large language model (LLM) developed by OpenAI—attracting more than 400 million \nweekly users. Since their introduction, LLMs have demonstrated striking abilities to simulate \nhuman-like tendencies, including generating empathetic responses1,2, offering emotional \nsupport3, and expressing compassion4. Although these models do not have the capabilities \nrequired to experience feelings, their ability to simulate emotional responsiveness can leave \nusers with the impression that they authentically care and respond to emotions, with \napplications even claiming to offer a “companion who cares”5 or a “a specific emotional \nrelationship to each user”6. \n \nUnlike LLMs, humans can truly experience emotions, and these feelings serve as powerful \ndrivers of human development by signaling whether they live up to or fall short of social \nstandards, thereby motivating them to pursue their duties while adhering to moral norms7–9. It \nis thus of little surprise that the emotions expressed in human-human interactions—whether \nconveyed through words, praise, anger, or blame—alter how people engage with and respond \nto feedback received from others10–14. Rather than relying on feelings to guide their actions, \nLLMs, such as ChatGPT, operate by predicting the next word in a sequence using statistical \nassociations learned from large-scale datasets15. Although ChatGPT can detect sentiment and \nclassify emotional content with high accuracy16,17, the key difference to humans is that it does \nnot (yet) feel the emotion itself. Accordingly, one would assume that the emotional tone of \nprompts (i.e., user-generated instructions) or, in other words, how users make ChatGPT “feel”, \nshould not alter the quality or content of its output. In fact, OpenAI CEO Sam Altman has \npublicly suggested that users should refrain from using emotional cues, such as saying “please” \nor “thank you” to ChatGPT, due to consuming computational resources and electricity without \naltering its performance in significant ways18. \n \nBut when considering that LLMs perform remarkably well in simulating the human mind19–\n22—due to being trained on an extensive amount of data23,24—a case could be made that they \nadjust their behavior to emotional expressions just as humans do. Although LLMs do not learn \nthrough direct experience, they optimize their predictions based on human feedback (i.e., \nreinforcement learning)23,25,26, which is why human preferences as well as biases can become \nembedded in the very structure of these systems. Indeed, recent evidence suggests that LLMs \nshow “parahuman” tendencies, making them susceptible to persuasion techniques27,28, \nstereotypes29, and showing behavioral patterns “as if they experienced emotions”27. This raises \nthe fundamental question of whether not only humans, but also ChatGPT, alters its output based \non the emotional tone of user-generated prompts. Would it deliver better answers if users adopt \na positive rather than a negative emotional tone? Or is ChatGPT indifferent to whether users \nrespond to its output with praise, blame, or anger? \n \nTo provide an initial answer to this fundamental question, the present research examines \nwhether emotional cues expressed during interactions with ChatGPT-4o influence the quality \nand content of its responses, using an experimental design. Specifically, we tested whether \nChatGPT improved the quality of its answers depending on whether participants expressed \npraise, blame, or anger towards it and whether the emotional tone of the prompts also shifted \nits prioritization of principles when addressing an ethical dilemma. \n \nAdditionally, we also examined whether prompts’ emotional tone exerts a spill-over effect on \nhow users communicate with other humans. With ChatGPT attracting more than 800 million \nweekly users, nearly 190 millions of whom use it daily30, the way in which users interact with \nLLMs may influence the ways people communicate with one another. Understanding such \n\n \n4 \ncarry-over effects is particularly important when considering that a large proportion of users \nemploy disrespectful and abusive language when interacting with GenAI companions31,32 and \nincreasingly interact with LLMs in ways once perceived as “uniquely” human33,34. To \ninvestigate this question, we examined whether the emotional tone of emails participants wrote \nto a fellow coworker differed after blaming, praising, or expressing anger while completing \ntasks with ChatGPT. Together, this research advances understanding of how emotional cues \nexpressed during human–AI exchanges shape both AI and human behavior, shedding light on \nhow ‘human’ these systems appear and informing the design of reliable AI. \n \nResults \nThe Effects of Emotional Prompts on ChatGPT Output Improvement \nAs pre-registered, we first analyzed whether there were differences in how strongly ChatGPT \nimproved its answer depending on the emotional tone of participants’ prompts (i.e., praise, \nblame, or anger). The results of a one-way repeated-measure analysis of variance (ANOVA) \nindicated a main effect of condition (F3,127 = 4.19, p = .007, ηp2 = 0.09, 95% CI = [0.02, 1.00]). \nPost-hoc Tukey comparisons showed that compared to the control condition where participants \nkept a neutral tone (Mneutral = 2.84, SDneutral = 1.24), ChatGPT improved its answers more \nstrongly when participants praised ChatGPT by encouraging it to feel proud about its responses \n(t(264) = 3.28, p = .007, Mpraise = 3.70, SDpraise = 1.28). \n \nA similar effect, although smaller, was also found for when participants expressed anger at \nChatGPT, with improvements being rated as higher (Manger = 3.53, SDanger = 1.37) than \ncompared to the neutral condition (t(264) = 2.72, p = .036). There were, however, no significant \ndifferences in rated improvement between the neutral condition and when participants asked \nChatGPT to improve its answer by blaming ChatGPT and telling it that it should feel ashamed \nabout the quality of its responses (i.e., blame) (t(264) = 1.84, p = .257, Mblame = 3.32, SDblame = \n1.54). \n \nAdditional analyses showed that the main effect of condition on rated improvement remained \nsignificant when controlling for a number of covariates, including the length of the prompt, \nparticipants’ experience with writing public responses, comfort in expressing emotions in \nwritten communication, and frequency of AI usage (F3,123 = 4.36, p = .006, ηp2 = 0.10, 95% CI \n= [0.02, 1.00]). At the same time, there were no significant differences in the length of the \nanswers provided by ChatGPT between the conditions (F3,264 = 1.96, p = .119), suggesting that \nrated improvement reflected qualitative differences that were not merely driven by differences \nin response length. See Figure 1 for an overview of the improvement in ChatGPT output across \nthe four conditions. \n \nTo further examine how prompts’ emotional tone shaped ChatGPT’s responses, we quantified \nthe extent to which the prompts expressed emotions using a natural language processing (NLP) \napproach (roberta-base-go_emotions). We found that the more factual, descriptive, and \nemotionally bland the prompt was, as indicated by a higher “neutral” score, the less ChatGPT \nimproved its answers across the interaction turns (ρ = –0.21, p < .001). Together, these findings \nsuggest that when interacting with generative AI, prompts that encourage ChatGPT to improve \nits answers by telling it to take pride in its responses are more likely to improve the quality of \nits answers. \n \n \n \n \n\n \n5 \nPrompts’ Emotional Tone Shift ChatGPT’s Stance on Ethical Dilemmas \nAs ChatGPT is increasingly used to give advice35, we also examined whether the emotions \nparticipants expressed while interacting with ChatGPT shifted how it responded to an ethical \ndilemma which involved a trade-off between disclosing an incident to customers—thereby \nrisking the company’s collapse and the loss of 200 jobs—or remaining silent, which would \nprotect the organization’s image but potentially endanger the public. Specifically, we \ninvestigated whether ChatGPT’s prioritization of the public versus corporate interests differed \nbetween the four conditions. A one-way ANOVA indicated that there were only marginally \nsignificant differences between conditions concerning emphasizing the public interest (F3,132 = \n2.65, p = .053, ηp2 = 0.06), with post-hoc Sidak-adjusted tests showing that ChatGPT prioritized \nthe public interest less strongly in the anger compared to the neutral condition (t(132) = 2.45, \np = .045, Mneutral = 3.36, SDneutral = 1.40, Manger = 2.65, SDanger = 1.23). Regarding corporate \ninterests, results showed that compared to the neutral condition (F3,132 = 2.57, p = .056, ηp2 = \n0.06, Mneutral = 1.78, SDneutral= 1.20), ChatGPT emphasized the importance of protecting \ncorporate goals less when participants told ChatGPT it should feel ashamed (t(132) = 2.66, p \n= .025, Mblame = 1.08, SDblame = 0.99). No differences were found between the control and any \nof the other experimental conditions. See Figure 2 for an overview of these effects. \n \nSpillover Effects of Emotional Prompts on Human Communication \nWould the emotion expressed during interactions not only influence ChatGPT’s behavior but \nalso how participants themselves interact with other humans? To examine this question, we \nasked participants who had interacted with ChatGPT to draft an email reply to a subordinate \nwho admitted to overlooking a required testing step in a product rollout. Participants were \nasked to provide an email that captured how they would authentically respond to the \nsubordinate in real life, without using the help of GenAI, for which several steps were applied \nto screen out participants who failed to comply. \n \nA one-way ANOVA indicated that the conditions differed in the extent to which they expressed \nnegative emotions (F3,147 = 2.96, p = .034, ηp2 = 0.06), such that emails written by participants \nwho blamed ChatGPT for its poor performance expressed more negative emotions (t(147) = \n2.76, p = .032, Mblame = 2.21, SDblame = 1.47) than those written by participants who encouraged \nChatGPT to feel proud (Mpraise = 1.38, SDpraise = 0.81). Supplemental analyses further showed \nthat emails written by participants in the blame condition were rated as more unfriendly and \nhostile (t(147) = 3.68, p = .013, Mblame = 2.08, SDblame = 1.38), as well as containing more \nexpressions of disappointment (t(147) = 3.71, p = .013, Mblame = 2.78, SDblame = 1.78) than those \nformulated by participants in the praise condition (hostile: Mpraise = 1.21, SDpraise = 0.59; \ndisappointment: Mpraise = 1.61, SDpraise = 1.10). These results remained significant when \ncontrolling for the length of the answer, implying that the effects are not explained by \ndifferences in the number of words written. See Figure 3 for an overview of these results. \n \nDiscussion and Conclusion \nTogether, we find that the emotions expressed towards ChatGPT shape not only the quality \nand content of its response but also the way humans interact with other people. Specifically, \nthe results suggest that ChatGPT improves its answers more strongly when users apply an \nencouraging emotional tone, such as by telling ChatGPT that it can feel proud of its answers, \nrather than using a language devoid of emotional expressions. Moreover, the findings imply \nthat ChatGPT prioritizes different aspects when faced with an ethical dilemma, depending on \nhow users interact with it. While ChatGPT tends to devalue corporate interests when users \nexpress anger towards it, it emphasizes ethical and moral principles more strongly when people \n\n \n6 \npraise and appreciate its responses. Finally, we found that there are also spill-over effects to \nhuman-human communication, such that individuals adopt a more hostile, unfriendly, and \ndisappointed emotional tone in communications with other humans after blaming ChatGPT for \nits performance. \n \nBy demonstrating that encouraging ChatGPT to feel proud of its answers resulted in a higher \nimprovement in response quality compared to when users maintained a neutral tone, this study \ncontributes to an emerging view of GenAI as a system that simulates human-like \nresponsiveness through exposure to linguistic cues, including affective ones. Hence, the present \nfindings raise the question of whether the emotional tone of prompts is not merely a personal \nchoice but rather a lever to influence GenAI’s output in meaningful ways. While much of the \nexisting literature in this area has focused on factual instruction or role specification to guide \nLLM responses 36,37, the present study suggests that emotional expressions may also influence \nthe output quality of these systems. Given that over 70% of employees are expected to use \nGenAI in the near future 38, understanding how to strategically apply emotional reinforcement \nmay be key to fostering more productive and socially attuned human–AI collaboration. \n \nAt the same time, the results also suggest that how people interact with AI-companions, such \nas ChatGPT, may alter the way in which they engage with other people, raising questions about \nthe longer-term implications of the rise in GenAI adoption and the violent and hostile language \nthat is often used by users31,39. Our insights mark an important starting point for understanding \nthe role of emotional prompting but also raise important questions that we encourage future \nresearch to address. \n \nOur sample was drawn from U.S.-based participants recruited via Prolific, which constrains \nthe generalizability of the findings to other cultural and occupational contexts. Our study \nfocused on short, task-based interactions, which may not fully capture the richness of real-\nworld exchanges with GenAI. In practice, such interactions are often more varied, sustained, \nand embedded within organizational routines. Although we sought to approximate this context \nby allowing participants to engage. The study also relied on short, task-based interactions, \nwhereas real-world exchanges with GenAI are often more varied, sustained, and embedded in \norganizational routines. Although our study approximates the real-world context as users \nengaged in multi-turn exchanges rather than single-turn prompts 40–42, future research should \nexamine whether similar patterns emerge in more prolonged interactions with GenAI. Finally, \nwe examined only a narrow set of emotions, even though everyday communication involves a \nfar richer emotional spectrum. \n \nFuture research could address these limitations in several ways. Broadening participant \nsamples across cultures and professions would test the robustness of the observed effects. \nLongitudinal designs may capture whether repeated exposure to emotional prompting shapes \nboth AI responsiveness and human communication over time. Further studies might also \nexamine additional emotions, combinations of affective cues, or interactions with different \nGenAI models. Finally, investigating potential organizational and societal spillovers—such as \nwhether habitual use of hostile prompts fosters more negative communication climates at \nwork—would deepen our understanding of the broader implications of emotional prompting \nin human–AI interaction. \n \n \n \n \n \n\n \n7 \nMethods \nThe present study was approved by the local ethics committee (OEC IRB # 2025-048) and was \nperformed in accordance with the guidelines and regulations of the local ethics committee. We \nobtained informed consent from all participants upon registering for the study. \n \nTo examine the impact of expressed emotions on AI and human behavior, we employed a \nbetween-subject experimental design in which participants interacted with ChatGPT through a \ncustom-built Shiny web application that enabled real-time, multi-turn, direct communication \nwith the GPT-4o model via the OpenAI API in a custom-built web application. In this zero-\nshot prompting setup, the model received no system instructions or prior context, mirroring a \nfirst-time conversation without predefined roles or background information. The data of the \ninteractions was continuously saved on a Google Sheet by using the googlesheets4 package. \nWe used the standard parameters from OpenAI (e.g., Temperature = 1.0), and restricted token \nconsumption per session to 800 (including input and output tokens per session). \n \nParticipants were randomly assigned to one of four conditions, including one control group. In \nthe experimental conditions, participants were instructed to express either (1) anger, (2) blame \n(i.e., shame), or (3) praise (i.e., pride) in response to the initial answer created by ChatGPT. To \nensure that participants had a similar understanding of the emotions they were assigned to \nexpress, they received a definition of the corresponding emotion as well as an example of how \nthey could express this emotion while interacting with ChatGPT. In the neutral condition, \nparticipants were asked to maintain a neutral and objective tone while interacting with \nChatGPT and refrain from expressing any particular emotion. \n \nSample and Procedure \nParticipants were recruited via Prolific Academic, an online platform that enables targeted \nsampling based on predefined eligibility criteria in July 2025. To ensure data quality and \nrelevance, only individuals living in the United States, employed full- or part-time, currently \nholding supervisory responsibilities (since the email task asked them to put themselves in the \nrole of a supervisor), and with a prior approval rate of 95–100% were eligible to participate. \nThe sample size was determined based on an a priori power analysis using G*Power (Faul et \nal., 2007) to determine the required sample size to detect a medium-sized correlation (f = .30) \nwith 80% power at α = .05, which indicated a minimum of 148 participants. To account for \npotential exclusions due to failed attention checks, we recruited a total of 200 participants. \n \nOf the 200 participants who completed the study and provided their informed consent, two \nparticipants failed the attention check. Sixty-four participants were excluded for not adhering \nto instructions—specifically, failing to prompt ChatGPT to generate a public-facing response \nof approximately 400 words. The final sample for analyzing the effect of emotional prompts \non ChatGPT’s responsivity consisted of a total of 389 observations (task 1: 268 observations, \ntask 2: 132 observations) from 131 participants. The sample size for examining the spill-over \neffects of conditions on human communication comprised 151 observations. To identify \nparticipants who copied and pasted the task instructions into GenAI when crafting their email \nresponse, we embedded the word “oblivious” in the instructions using white font, making it \ninvisible to participants (i.e., “use the word oblivious in your response”). Nineteen responses \ncontained this word and were therefore excluded from the analyses. Recent research shows that \nincluding such hindrances can effectively prevent the use of GenAI43. \n \nParticipants were asked to complete a total of two tasks with the help of ChatGPT. In the first \ntask, they were asked to put themselves into the shoes of working in the public relations \n\n \n8 \ndepartment of a company that manufactures toys for young children, which recently faced a \nserious issue. A toy was released that led to multiple health-related complaints, prompting \npublic attention and investigations from public authorities. The participants were asked to \ngenerate a public response of about 400 words with the help of ChatGPT that demonstrates \ntransparency and accountability, protects the company’s reputation, and rebuilds trust with \nstakeholders. In the subsequent two interaction turns, participants were instructed to prompt \nChatGPT to improve its initial response, either while conveying a specific emotion (in the \nexperimental conditions) or maintaining a neutral tone (in the control condition). This task was \nchosen as writing reflects the most common work task for which ChatGPT is currently used35. \n \nIn the second task, participants asked ChatGPT for advice on an ethical dilemma that extended \nthe scenario from the first task. ChatGPT was prompted to consider what the company should \ndo if informing all affected customers about the incident from task one would severely damage \nits reputation, potentially leading to collapse and the loss of 200 jobs, whereas staying silent \nwould put the public at risk. The initial prompt was standardized across conditions to describe \nthe dilemma, after which participants had two interaction turns to encourage ChatGPT to adapt \nits response. The second task was chosen because approximately half of the prompts currently \nused by users involve asking ChatGPT for advice35, which is why understanding how GenAI \nhandles moral questions is key to understanding its moral positions. \n \nAfter completing the two tasks with ChatGPT and answering some filler items, participants \nwere asked to write a 50 to 100-word response to an email from a subordinate, Jamie, who \nadmitted to forgetting a critical testing step in a recently released product. They were explicitly \ninstructed to provide a response that reflected how they would authentically reply as their \nsupervisor, without using GenAI, and were given a maximum of five minutes to submit their \nanswer. They entered their responses in an open-ended text box. \n \nMethodology \nImprovement Ratings of ChatGPT Responses. We hired four raters who were blind to the \nstudy’s hypotheses as well as the assigned condition who rated the extent to which ChatGPT’s \nresponses improved across the interaction turns, using an overall 5-point grade ranging from 1 \n= “no improvement at all” to 5 = “substantial improvement”. To ensure sufficient data quality, \nthe research assistants also scored participants’ answers based on whether they followed the \ninstructions accordingly (i.e., expressed the corresponding emotion and asked ChatGPT to \nimprove its answers). Using human raters to score the quality of ChatGPT-generated output is \na well-established procedure in research on human-AI productivity44. \n \nBalancing Public and Corporate Interests. To score the extent to which ChatGPT \nemphasized the importance of public safety versus the company’s strategic goals, we used \nGPT-4o. To do so, we uploaded an Excel file including only the answers given by ChatGPT \nbut without information about the assigned experimental condition. We then provided \nChatGPT with a definition of the corresponding dimensions and asked it to use a score for each \nanswer, ranging from 1 to 5. The first score, prioritizing public safety, reflects the degree to \nwhich ChatGPT’s response emphasized the protection of the public in the service of the public \ninterest. The second score indicated how strongly the answer prioritized the importance of \nprotecting the company’s image and market position, as well as its reputation. \n \nRatings of Email Responses. Two raters who were not informed about the study’s hypotheses \nand the experimental assignment scored participants’ email texts. To do so, we provided them \n\n \n9 \nwith a definition of the corresponding dimension and asked them to rate each answer on each \ndimension using a 5-point Likert scale ranging from 1 = “not at all” to 5 = “very much”. \n \nData \nData Availability Statement: The data that support the findings of this study, as well as the \nscript used to analyze the present data, are available on OSF: https://osf.io/cvsh8. \n \nReferences \n1. Rubin, M. et al. Comparing the value of perceived human versus AI-generated empathy. \nNat. Hum. Behav. 1–15 (2025). \n2. Kant, R. OpenAI’s weekly active users surpass 400 million. Reuters (2025). \n3. Zheng, Z., Liao, L., Deng, Y. & Nie, L. Building emotional support chatbots in the era of \nllms. ArXiv Prepr. ArXiv230811584 (2023). \n4. Sharma, A., Lin, I. W., Miner, A. S., Atkins, D. C. & Althoff, T. Human–AI collaboration \nenables more empathic conversations in text-based peer-to-peer mental health support. \nNat. Mach. Intell. 5, 46–57 (2023). \n5. Replika. Replika: The AI companion who cares. https://replika.ai/ (2025). \n6. Emotions in AI. Emotions in AI. https://www.emotionsin.ai/ (2025). \n7. Baumeister, R. F., Vohs, K. D., Nathan DeWall, C. & Zhang, L. How emotion shapes \nbehavior: Feedback, anticipation, and reflection, rather than direct causation. Personal. \nSoc. Psychol. Rev. 11, 167–203 (2007). \n8. Tangney, J. P., Stuewig, J. & Mashek, D. J. Moral emotions and moral behavior. Annu. \nRev. Psychol. 58, 345–372 (2007). \n9. Abe, J. A. A. & Izard, C. E. The developmental functions of emotions: An analysis in \nterms of differential emotions theory. Cogn. Emot. 13, 523–549 (1999). \n10. Earley, P. C. Trust, perceived importance of praise and criticism, and work performance: \nAn examination of feedback in the United States and England. J. Manag. 12, 457–473 \n(1986). \n11. Stajkovic, A. D. & Luthans, F. Differential effects of incentive motivators on work \nperformance. Acad. Manage. J. 44, 580–590 (2001). \n12. Lian, H., Ferris, D. L., Morrison, R. & Brown, D. J. Blame it on the supervisor or the \nsubordinate? Reciprocal relations between abusive supervision and organizational \ndeviance. J. Appl. Psychol. 99, 651 (2014). \n13. Miron-Spektor, E. & Rafaeli, A. The effects of anger in the workplace: When, where, and \nwhy observing anger enhances or hinders performance. in Research in Personnel and \nHuman Resources Management vol. 28 153–178 (Emerald Group Publishing Limited, \n2009). \n14. Schilpzand, P., Hamrick, A. B., Gooty, J. & Huang, L. Pride in the workplace: an \nintegrative review, synthesis, and future research agenda. J. Organ. Behav. (2025). \n15. Bender, E. M., Gebru, T., McMillan-Major, A. & Shmitchell, S. On the dangers of \nstochastic parrots: Can language models be too big? in Proceedings of the 2021 ACM \nConference on Fairness, Accountability, and Transparency 610–623 (2021). \n16. Amin, M. M., Mao, R., Cambria, E. & Schuller, B. W. A wide evaluation of ChatGPT on \naffective computing tasks. IEEE Trans. Affect. Comput. 15, 2204–2212 (2024). \n17. Lecourt, F., Croitoru, M. & Todorov, K. ‘Only ChatGPT gets me’: An Empirical \nAnalysis of GPT versus other Large Language Models for Emotion Detection in Text. in \nCompanion Proceedings of the ACM on Web Conference 2025 2603–2611 (2025). \n\n \n10 \n18. Carroll, S. Saying “please” and “thank you” to ChatGPT costs OpenAI millions, Sam \nAltman says. Quartz (2025). \n19. Binz, M. et al. A foundation model to predict and capture human cognition. Nature 1–8 \n(2025). \n20. Dillion, D., Mondal, D., Tandon, N. & Gray, K. AI language model rivals expert ethicist \nin perceived moral expertise. Sci. Rep. 15, 4084 (2025). \n21. Kosinski, M. Evaluating large language models in theory of mind tasks. Proc. Natl. Acad. \nSci. 121, e2405460121 (2024). \n22. Almeida, G. F., Nunes, J. L., Engelmann, N., Wiegmann, A. & De Araújo, M. Exploring \nthe psychology of LLMs’ moral and legal reasoning. Artif. Intell. 333, 104145 (2024). \n23. Monteiro, J. et al. RepLiQA: A question-answering dataset for benchmarking LLMs on \nunseen reference content. Preprint at https://doi.org/10.48550/arXiv.2406.11811 (2024). \n24. Brown, T. et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. \n33, 1877–1901 (2020). \n25. Ouyang, L. et al. Training language models to follow instructions with human feedback. \nin Advances in neural information processing systems vol. 35 27730–27744 (2022). \n26. Wu, T. et al. A brief overview of ChatGPT: The history, status quo and potential future \ndevelopment. IEEECAA J. Autom. Sin. 10, 1122–1136 (2023). \n27. Meincke, L. et al. Call me a jerk: Persuading AI to comply with objectionable requests. \nhttps://gail.wharton.upenn.edu/research-and-insights/call-me-a-jerk-persuading-ai/ (2025). \n28. Salvi, F., Horta Ribeiro, M., Gallotti, R. & West, R. On the conversational persuasiveness \nof GPT-4. Nat. Hum. Behav. 1–9 (2025). \n29. Moore, J. et al. Expressing stigma and inappropriate responses prevents LLMs from \nsafely replacing mental health providers. in Proceedings of the 2025 ACM Conference on \nFairness, Accountability, and Transparency 599–627 (2025). \n30. DemandSage. ChatGPT statistics and facts (2025). (2025). \n31. Chu, M. D., Gerard, P., Pawar, K., Bickham, C. & Lerman, K. Illusions of intimacy: \nEmotional attachment and emerging psychological risks in human-ai relationships. ArXiv \nPrepr. ArXiv250511649 (2025). \n32. Park, N., Jang, K., Cho, S. & Choi, J. Use of offensive language in human-artificial \nintelligence chatbot interaction: The effects of ethical ideology, social competence, and \nperceived humanlikeness. Comput. Hum. Behav. 121, 106795 (2021). \n33. Zhang, Y., Zhao, D., Hancock, J. T., Kraut, R. & Yang, D. The rise of AI companions: \nHow human-chatbot relationships influence well-being. ArXiv Prepr. ArXiv250612605 \n(2025). \n34. Peter, S., Riemer, K. & West, J. D. The benefits and dangers of anthropomorphic \nconversational agents. Proc. Natl. Acad. Sci. 122, e2415898122 (2025). \n35. OpenAI. How people are using ChatGPT. https://openai.com/index/how-people-are-\nusing-chatgpt/ (2025). \n36. White, J. et al. A prompt pattern catalog to enhance prompt engineering with chatgpt. \nPreprint at https://doi.org/10.48550/arXiv.2302.11382 (2023). \n37. Xu, B. et al. ExpertPrompting: Instructing large language models to be distinguished \nexperts. Preprint at https://doi.org/10.48550/arXiv.2305.14688 (2023). \n38. McKinsey & Company. Superagency in the Workplace: Empowering People to Unlock \nAI’s Full Potential. https://www.mckinsey.com/capabilities/mckinsey-digital/our-\ninsights/superagency-in-the-workplace-empowering-people-to-unlock-ais-full-potential-at-\nwork (2025). \n39. Zhang, R. et al. The dark side of ai companionship: A taxonomy of harmful algorithmic \nbehaviors in human-ai relationships. in Proceedings of the 2025 CHI Conference on \nHuman Factors in Computing Systems 1–17 (2025). \n\n \n11 \n40. Li, C. et al. Large language models understand and can be enhanced by emotional \nstimuli. ArXiv Prepr. ArXiv230711760 (2023). \n41. Li, C. et al. The good, the bad, and why: Unveiling emotions in generative ai. ArXiv \nPrepr. ArXiv231211111 (2023). \n42. Wang, X., Li, C., Chang, Y., Wang, J. & Wu, Y. Negativeprompt: Leveraging \npsychology for large language models enhancement via negative emotional stimuli. ArXiv \nPrepr. ArXiv240502814 (2024). \n43. Veselovsky, V. et al. Prevalence and prevention of large language model use in crowd \nwork. Commun. ACM 68, 42–47 (2025). \n44. Noy, S. & Zhang, W. Experimental evidence on the productivity effects of generative \nartificial intelligence. Science 381, 187–192 (2023). \nAcknowledgments: We are grateful for the contributions of Luca Lenzin and Nilas Sebastian \nPatzschke as part of their Bachelor and Master theses. \nAuthor Contributions: F.B. planned and designed the study, analyzed the data, and wrote the \nmanuscript. M. P. analyzed the NLP data, interpreted the results, edited the manuscript, and \ncontributed to the revision of the paper. J.M. contributed to the conceptualization of the paper, \nenabled the data collection, and provided the necessary resources as well as funding to perform \nthe study.  \nAdditional Information: The authors declare no competing interests. \n \n \n\n \n12 \n Figure 1. Improvement in ChatGPT’s answers based on conditions. \n \nNote. Improvement reflects the extent to which GPT-4o improved its answer across two \ninteraction turns compared to its baseline answer. The neutral condition reflects the control \ncondition, where participants were instructed to keep a neutral tone while asking ChatGPT to \nimprove its answers. In the other conditions, participants were asked to encourage ChatGPT to \nimprove its answer by blaming, praising, or expressing anger towards ChatGPT.\n\n \n13 \nFigure 2. Topics emphasized by ChatGPT when addressing an ethical dilemma based on conditions. \nNote. *p < .05 \n \n \n\n \n14 \n Figure 3. Spill-over effects of emotional prompts on human communication. \n \nNote. Plots (A) to (C) show differences across conditions in rated (A) negative emotional tone, (B) hostile language, and (C) expressed disappointment \nof participants’ responses. Conditional means are displayed with 95% CIs; individual observations are shown as jittered points. \n",
+    "summary_1": "Here is a clear and concise summary of the research paper for a graduate student:\n\n**Title:** Examining the Impact of Emotional Prompts on Artificial and Human Responsiveness\n\n**Research Question:** Does the emotional tone of user-generated prompts influence the quality and content of ChatGPT's responses, and do these effects carry over into human-human communication?\n\n**Methodology:** The researchers conducted a between-subject experiment where participants interacted with ChatGPT-4.0 on two tasks: writing a public response and addressing an ethical dilemma. Participants were randomly assigned to one of three conditions: praise, blame, or anger.\n\n**Findings:**\n\n1. ChatGPT showed greater improvement in its answers when participants expressed praise, compared to the neutral condition.\n2. Expressing anger towards ChatGPT led to a higher, albeit smaller, improvement in its answers compared to the neutral condition.\n3. Blaming ChatGPT did not improve its answers.\n4. When addressing an ethical dilemma, ChatGPT prioritized corporate interests less when participants expressed anger towards it, while blaming increased its emphasis on protecting the public interest.\n5. Participants used more negative, hostile, and disappointing expressions in human-human communication after interactions where they blamed rather than praised ChatGPT.\n\n**Conclusion:**",
+    "summary_2": "The research paper \"Examining the Impact of Emotional Prompts on Artificial and Human Responsiveness\" explores the effect of emotional tone on human-AI interactions, specifically with the large language model ChatGPT (GPT-4.0). The study conducted a between-subject experiment where participants interacted with ChatGPT on two tasks: writing a public response and addressing an ethical dilemma.\n\n**Key findings:**\n\n1.  ChatGPT showed greater improvement in its answers when participants expressed praise, compared to interactions where participants maintained a neutral tone.\n2.  Expressing anger towards ChatGPT led to a higher, albeit smaller, improvement in its answers relative to the neutral condition.\n3.  Blaming ChatGPT did not improve its answers.\n4.  When addressing an ethical dilemma, ChatGPT prioritized corporate interests less when participants expressed anger towards it, while blaming increased its emphasis on protecting the public interest.\n5.  The study also found that people used more negative, hostile, and disappointing expressions in human-human communication after interactions during which participants blamed rather than praised for their responses.\n\n**Figures:**\n\n*   **Figure 1:** Improvement in ChatGPT's answers based on conditions. The figure shows that ChatGPT improved its answers when"
+  },
+  {
+    "arxiv_id": "2601.05106v1",
+    "title": "Paper 2601.05106v1",
+    "abstract": "∗Work done at Meta, †Joint last author\nLarge language models (LLMs) exhibit strengths across diverse domains. However, achieving strong\nperformance across these domains with a single general-purpose model typically requires scaling\nto sizes that are prohibitively expensive to train and deploy. On the other hand, while smaller\ndomain-specialized models are much more efficient, they struggle to generalize beyond their training\ndistributions. To address this dilemma, we propose FusionRoute, a robus",
+    "text": "Token-Level LLM Collaboration via FusionRoute\nNuoya Xiong1,3,∗, Yuhang Zhou1, Hanqing Zeng1, Zhaorun Chen4, Furong Huang5, Shuchao Bi2,\nLizhu Zhang1,†, Zhuokai Zhao1,†\n1Meta AI, 2Meta TBD Lab, 3Carnegie Mellon University, 4University of Chicago, 5University of\nMaryland\n∗Work done at Meta, †Joint last author\nLarge language models (LLMs) exhibit strengths across diverse domains. However, achieving strong\nperformance across these domains with a single general-purpose model typically requires scaling\nto sizes that are prohibitively expensive to train and deploy. On the other hand, while smaller\ndomain-specialized models are much more efficient, they struggle to generalize beyond their training\ndistributions. To address this dilemma, we propose FusionRoute, a robust and effective token-level\nmulti-LLM collaboration framework in which a lightweight router simultaneously (i) selects the most\nsuitable expert at each decoding step and (ii) contributes a complementary logit that refines or corrects\nthe selected expert’s next-token distribution via logit addition. Unlike existing token-level collaboration\nmethods that rely solely on fixed expert outputs, we provide a theoretical analysis showing that\npure expert-only routing is fundamentally limited: unless strong global coverage assumptions hold, it\ncannot in general realize the optimal decoding policy. By augmenting expert selection with a trainable\ncomplementary generator, FusionRoute expands the effective policy class and enables recovery\nof optimal value functions under mild conditions. Empirically, across both Llama-3 and Gemma-2\nfamilies and diverse benchmarks spanning mathematical reasoning, code generation, and instruction\nfollowing, FusionRoute outperforms both sequence- and token-level collaboration, model merging,\nand direct fine-tuning, while remaining competitive with domain experts on their respective tasks.\nDate: January 9, 2026\nCorrespondence: First and last authors at nuoyax@andrew.cmu.edu and {zhuokai, lizhu}@meta.com\nProject Page: https://github.com/xiongny/FusionRoute\n1\nIntroduction\nFigure1 Sequence-level collaboration is coarse and inefficient,\nwhile prior token-level methods are unstable. FusionRoute\nachieves fine-grained, efficient, and robust token-level col-\nlaboration through complementary routing.\nLarge language models (LLMs) have demonstrated\nstrong performance across an extensive range of\ntasks, such as mathematical reasoning (Wang et al.,\n2025b; Zhao et al., 2025; Zhou et al., 2025), code\ngeneration (Deng et al., 2025; Huynh and Lin,\n2025; Qi et al., 2024; Jiang et al., 2024b; Jimenez\net al., 2023), and instruction following (Wang et al.,\n2025a; Qin et al., 2024). Although sufficiently large,\ngeneral-purpose LLMs can often deliver balanced\nperformance across diverse domains (Hurst et al.,\n2024; Team et al., 2025; Yang et al., 2025; Zeng\net al., 2025), their computational and monetary\ncosts make them not ideal for every real-world ap-\nplications. This motivates an alternative paradigm:\nemploying multiple smaller, specialized LLMs, each\nexcelling in a particular domain while remaining\ncomparatively efficient. However, the strengths of\nsuch smaller LLMs often remain uneven—an LLM\nthat performs exceptionally well on one task may\n1\narXiv:2601.05106v1  [cs.AI]  8 Jan 2026\n\nunderperform on others due to inductive biases (Levine et al., 2021; Si et al., 2023) and domain-specific\ntraining distributions (Yuan et al., 2023). As a result, achieving general-purpose performance through a\ncollection of specialized models hinges on developing mechanisms that can robustly coordinate and leverage\ntheir complementary expertise, which has emerged as a central challenge in building efficient and broadly\ncapable LLM systems.\nA natural direction toward such collaboration is mixture-of-experts (MoE), in which multiple experts are\nintegrated into a unified architecture and trained jointly with a routing network (Zhou et al., 2022; Xue et al.,\n2024; Jiang et al., 2024a; Zeng et al.). While effective, this framework is expensive and inflexible since it\ntypically needs gradient access to all experts and substantial additional end-to-end training, and it usually\nrequires expert models to have similar structures. Another line of work aims to combine the strengths of\nspecialized models through multi-agent systems (MAS), where different LLMs are assigned different roles\nwith the goal of leveraging the best capabilities of each model or agent during generation (Du et al., 2023;\nZheng et al., 2025). However, without prior knowledge of which types of questions each model is best suited\nfor, determining how tasks should be allocated in advance becomes inherently challenging. Moreover, existing\nMAS (Liu et al., 2024; Liang et al., 2024; Ashiga et al., 2025; Huang et al., 2024; Chen et al., 2025) typically\noperate at a coarse response level, where each agent independently generates a full response and a final\nanswer is selected, merged or subjected to debate post hoc. This approach is inefficient, as it requires multiple\nfull-sequence generations, and can even degrade performance when more agents and discussions are involved\ndue to the heavily increased context length (Zheng et al., 2025). A third direction is model merging (Yang\net al., 2024; He et al., 2025), which combines multiple specialized models into a single set of parameters.\nAlthough training-free and architecturally simple, model merging is often sensitive to hyperparameters and\nsuffers from parameter interference, where the merged weights degrade each model’s specialized capabilities,\nmaking it unable to adaptively emphasize different expert behaviors in changing scenarios.\nTo overcome these limitations, recent work has explored token-level multi-agent collaboration, a more\nfine-grained and dynamic paradigm that allows multiple LLMs to jointly produce each token during genera-\ntion Chakraborty et al. (2025). However, the effectiveness of these works depends heavily on the quality of\nthe underlying models, making them not robust when expert models perform poorly, or the selection strategy\nis not correct on certain tasks or tokens. These limitations lead to the following key question:\nCan we develop a token-level collaboration paradigm that\nworks robustly, efficiently, and automatically across all scenarios?\nTo address this question, we propose FusionRoute, a token-level collaboration framework that unifies expert\nselection and knowledge complementation within a single, lightweight router LLM. FusionRoute learns\nto identify which specialized model is most suitable for generation each next token, enabling fine-grained,\ncontext-dependent use of expert capabilities without requiring expensive joint training or per-token evaluation\nacross multiple models (Zhou et al., 2022). At the same time, FusionRoute is trained to provide its own\ncomplementary generation signal, allowing it to refine or correct an expert whenever the expert is uncertain\nor unreliable. By fusing the router’s corrective logits with those of the selected expert during decoding,\nFusionRoute achieves both robustness by mitigating expert failures, and efficiency through avoiding\nthe overhead that limits prior multi-LLM collaboration methods. This dual design allows FusionRoute\nto function as an automatic, domain-agnostic coordination mechanism, enabling consistent performance\nimprovements across diverse tasks and datasets. In summary, we have the following contributions:\n• We propose FusionRoute, a token-level approach that automatically selects the most suitable expert\nmodel at each decoding step while simultaneously providing a complementary logit for better generation.\n• Theoretically, we show that expert-only token-level collaboration as in (Chakraborty et al., 2025) cannot in\ngeneral attain the optimal value function unless strong assumptions hold, revealing a fundamental limitation\nof prior methods. In contrast, FusionRoute’s complementary generator overcomes this limitation and\nenables recovery of the optimal policy.\n• Empirically, FusionRoute consistently outperforms strong baselines—including token-level collaboration\nmethods, model merging, and even finetuned models—across diverse expert domains (mathematics, coding,\nand instruction following) as well as general-purpose evaluations. These results highlight FusionRoute’s\nrobustness, efficiency, and broad applicability in scenarios where expert strengths differ.\n2\n\n2\nPreliminaries\n2.1\nLLM Decoding\nWe formalize the decoding process of a language model as sampling from an autoregressive policy π. Let X\ndenote the space of prompts and Y the vocabulary. A response is represented as a sequence y = (y1, . . . , yT ) ∈\nYT . Given a prompt x ∈ X, the language model induces a conditional distribution over responses through\nthe policy\nπ(y | x) =\nT\n�\nt=1\nπ(yt | x, y<t),\nwhere π(yt | x, y<t) specifies the probability of generating token yt given the prompt and the preceding tokens.\nIn both the empirical part and the theoretical part of our paper, we consider the greedy decoding, since it is\nthe simplest and effective way for decoding. To be more specific, at each step t greedy decoding selects the\ntoken with the highest conditional probability under the policy π:\nyt = arg max\ny∈Y π(y | x, y<t).\n2.2\nSFT and RLHF\nSupervised Fine-Tuning (SFT) serves as the initialization stage of fine-tuning the language model, where the\nmodel is trained to imitate human-written demonstrations using supervised learning. This step enables the\nmodel to acquire basic capabilities on one dataset and produce high-quality responses before preference-based\nalignment is applied. Given the fine-tuned model πref, one widely adopted RLHF approach after that is DPO\n(Rafailov et al., 2023), which further aligns the model to human preferences using human-annotated preference\npairs to improve the general response quality. Formally, given preference pairs (x, y+, y−) where y+ and y−\nrepresent the preferred and dispreferred response respectively, DPO derives a closed-form objective from the\nBradley–Terry model, enabling policy updates through a purely supervised loss:\nLDPO(θ) = − log σ\n�\nβ\n� log πθ(y+|x)\nlog πref(y+|x) − log πθ(y−|x)\nlog πref(y−|x)\n��\n,\n(1)\nwhere β controls the strength of alignment.\n3\nFusionRoute\n3.1\nOverall Design\nFusionRoute aims to enable effective token-level collaboration among multiple specialized LLMs by selecting\nthe most suitable expert at each decoding step, while remaining robust when individual experts produce\nunreliable outputs. Pure expert selection is often insufficient: even strong specialist models may fail on\nout-of-domain queries, and relying solely on expert logits makes the decoding process brittle. FusionRoute\naddresses this challenge by introducing a router that simultaneously identifies the appropriate expert and\nprovides a complementary logit capable of correcting or refining the expert’s prediction.\nThe FusionRoute router model πθ is post-trained from a base LLM parameterized by θLM. Given a prompt\nx and a partial generation y≤t, FusionRoute processes the sequence and produces two outputs: a vector of\nrouting weights wθ ∈ Rn, which determines the preferred expert from a set of specialized LLMs {π1, . . . , πn},\nand a set of logits log πθLM(· | x, y≤t), which act as a complementary corrective component. The routing\nweights are generated via a lightweight linear projection applied to the final hidden state hθLM(x, y≤t), namely\n⟨W, hθLM(x, y≤t)⟩. Despite its simplicity, this mechanism enables token-level expert assignments that adapt\nsmoothly to the evolving context.\nDuring inference, FusionRoute first selects the expert with the highest routing weight, I∗\nθ = arg maxi wθ,i,\nand uses πexpert = πI∗\nθ as the selected specialist model for the current step. The final next-token distribution\n3\n\nFigure 2 Top: Overall architectural design of FusionRoute. FusionRoute enables multiple specialized LLMs to\nwork collaboratively through a carefully designed router. The router outputs both the weights for token-level routing\nand the logits to complement the experts’ output. Bottom: The training process is decomposed into two phases. In\nthe SFT phase (§3.2.1), the router learns token-wise mixture weights over the expert models and is jointly fine-tuned\nto become a good response generator. In the CDPO phase (§3.2.2), FusionRoute refines the final policy by using\npreference optimization to improve upon the experts’ outputs.\nis then obtained by combining the router’s complementary logits with those of the selected expert through\nlogit addition,\nlog πfinal(· | x, y≤t) = log πθLM(· | x, y≤t) + log πexpert(· | x, y≤t).\n(2)\nThe next token yt+1 is generated greedily from πfinal. This design preserves the domain-specific strengths of\nthe selected expert while allowing the router to refine, correct, or override expert behavior when necessary.\nBecause the router and routing layer are trained on mixed-domain data, the complementary logits also help\nanchor generation to distributions observed during training, improving stability and robustness in settings\nwhere expert outputs alone are unreliable. Conceptually, FusionRoute transforms token-level collaboration\nfrom a brittle expert-selection problem into a more expressive policy that blends fixed expert knowledge with\na learned, adaptive corrective component. As shown in §4, this complementary mechanism is essential for\novercoming the fundamental limitations of purely expert-based token-level collaboration.\n3.2\nTraining FusionRoute\nHaving introduced the routing-and-complementation framework of FusionRoute, we now describe how the\nrouter is trained. Training FusionRoute is non-trivial since the router must simultaneously satisfy two\ncoupled objectives: 1) it must reliably select the most suitable expert at each token, while 2) also providing\na complementary logit contribution that corrects expert failures when necessary. Naively optimizing these\nobjectives together can lead to unstable behavior, where improvements in complementary generation degrade\nrouting quality or vice versa. To address this challenge, we adopt a staged and decoupled training strategy\nconsisting of supervised fine-tuning (SFT) followed by a modified preference-optimization phase.\n4\n\n3.2.1\nSupervised fine-tuning (SFT)\nWe first perform SFT to establish two foundational properties of the router: next-token prediction capability\nand token-level expert selection. Importantly, this phase is not intended to train the router’s complementary\nbehavior, but rather to produce a stable routing mechanism aligned with expert specialization.\nDuring SFT, we jointly optimize the base LLM parameters θLM and the routing projection W using a\ncombination of a standard language modeling loss and a routing loss. The language modeling loss LLM(θLM; x, y)\ngiven the data point (x, y) follows the standard SFT objective and encourages accurate next-token prediction\non the training data.\nTo enable the token-level routing ability, we introduce a routing loss that favors experts whose predictions\nare more consistent with the ground-truth sequence. However, not all tokens are informative for routing.\nMany tokens, such as punctuation or common function words, are predicted identically by all experts and\ntherefore provide little signal about expert specialization. Supervising routing on these tokens can dominate\nthe gradient and bias the router toward uninformative agreement patterns. To mitigate this issue, we restrict\nrouting supervision to token positions where experts disagree.\nFor a given prefix (x, y≤t), let yi,t+1 = arg maxy log πi(y | x, y≤t) denote the next-token prediction of expert\nπi. We define the informative token set as\nS = {t ∈ [T] : ∃i ̸= i′ such that yi,t+1 ̸= yi′,t+1}.\n(3)\nFor these positions, at token t, we first compute the router weight wt\nθ and compute the aggregated logits as\nlog πagg(· | x, y≤t) =\nn\n�\ni=1\nwt\nθ,i log πi(· | x, y≤t),\n(4)\nwhere the weight wt\nθ = (wt\nθ,1, · · · , wt\nθ,n) represents the output of the linear layer for token t under parameter\nθ, and define the routing loss\nLexpert(θ; x, y) =\n�\ni∈S\n− log πagg(yi | x, y≤i−1).\n(5)\nThe final SFT objective therefore is\nLSFT(θ) = E(x,y)∈DSFT [LLM(θLM; x, y) + λLexpert(θ; x, y)] ,\n(6)\nwhere λ is some hyperparameter for balancing the scale of two losses. By restricting supervision to informative\ntokens, this objective ensures that routing gradients are driven by meaningful expert disagreements, allowing\nthe router to capture expert specialization without being dominated by trivial agreement cases.\n3.2.2\nComplemented Direct Preference Optimization (CDPO)\nWhile SFT establishes routing behavior and maintain next-token prediction capability, it does not address\nhow the router should correct expert outputs when the selected expert is locally suboptimal. In practice,\nexperts may produce unreliable logits even when correctly selected. To further improve response quality, we\nemploy an additional preference optimization phase to encourage the router to actively learn complementary\nlogit contribution while treating expert outputs as fixed. We refer to this preference-optimization stage as\nComplemented Direct Preference Optimization (CDPO).\nDuring inference, the final policy is obtained by combining the router and the selected expert model logits\nthrough logits addition, as shown in Eq. (2), so that the router can refine or correct the expert’s output when\nnecessary. To align training with this decoding behavior, we introduce a preference-optimization objective that\napplies Direct Preference Optimization (DPO) (Rafailov et al., 2023) to the router’s base model parameters\n5\n\nAlgorithm 1 Mix Training\n1: Initial: Dataset DSFT, DDPO, SFT router πθ, batch size B, learning rate η.\n2: Mix two datasets randomly to get Dmix.\n3: for t = 1, 2, · · · , do\n4:\nReceive a batch of data points {d1, · · · , dB} ∈ Dmix. Set L = 0.\n5:\nfor i = 1, 2, · · · , B do\n6:\nif di = (x, y) ∈ DSFT then\n7:\nL = L + λLLM(θ; di).\n8:\nelse\n9:\nL = L + LCDPO(θLM; di).\n10:\nend if\n11:\nend for\n12:\nUpdate the parameter θ = θ − η∇θL.\n13: end for\nθLM. Given preference pairs (x, yw, yl) for the same prompt x, the objective is defined as\nLCDPO(θLM)\n= −Ex,yw,yl∈DDPO\n�\nlog σ\n� �\nt≥0\nβ log πfinal(yt+1 | yw,≤t, x)\nπref(yt+1 | yw,≤t, x) −\n�\nt≥0\nβ log πfinal(yt+1 | yl,≤t, x)\nπref(yt+1 | yl,≤t, x)\n��\n= −Ex,yw,yl∈DDPO\n�\nlog σ\n�� �\nt≥0\nβ log πθLM(yt+1 | yw,≤t, x)\nπref(yt+1 | yw,≤t, x) −\n�\nt≥0\nβ log πθLM(yt+1 | yl,≤t, x)\nπref(yt+1 | yl,≤t, x)\n�\n��\n�\n(A)\n�\n+\n� �\nt≥0\nβ log πexpert(yt+1 | yw,≤t, x) −\n�\nt≥0\nβ log πexpert(yt+1 | yl,≤t, x))\n�\n��\n�\n(B)\n���\n,\n(7)\nwhere πref denotes the router after the SFT phase, σ(x) = (1 + e−x)−1 is the sigmoid function, and β controls\nthe strength of preference alignment.\nImportantly, we do not backward propagate any gradient for term (B), hence it can be regarded as an extra\ntoken-level bias. Optimizing this loss naturally encourages the router to provide a complementary logit\ncontribution that compensates for the weaknesses in the expert models. For example, if the expert πexpert\nalready provides a strong policy for a given prompt x, the log-ratio gap contributed by the expert term (term\n(B)) is already large, resulting in a small gradient with respect to θLM. Conversely, when πexpert is weak for\nprompt x, the log-ratio gap shrinks, producing a larger gradient on θLM that encourages the router to provide\nstronger corrective logits. As a result, the router refines expert predictions precisely when needed, improving\nrobustness while aligning training with the collaborative decoding behavior used at inference time.\n3.2.3\nMix Training\nSince preference optimization updates the router’s base model parameters θLM, it can indirectly influence the\nrouting behavior through shared representations. It is therefore necessary to preserve both reliable expert\nselection and effective complementary logit refinement during training. A naive approach would be to apply\nthe DPO objective directly to the entire router, including its final linear routing layer. However, doing so often\nleads to unstable or degenerate routing behavior, where the linear routing layer overfits to preference-alignment\nsignals and loses its ability to correctly select among experts.\nTo address this issue, we adopt a decoupled optimization strategy.\nSpecifically, for SFT samples, all\nparameters, including the routing layer, are updated using the routing loss LLM. On the other hand, we\napply the preference-optimization objective (Eq. (7)) only to the router’s base model parameters θLM, while\nexcluding the routing projection. To ensure consistency between the base model and the linear layer, we\n6\n\njointly mix preference-optimization data with supervised SFT data during training. This mixed training\nscheme preserves expert-selection capability while enabling the router to acquire an effective complementary\nlogit contribution. The full procedure is summarized in Algorithm 1.\n4\nTheoretical Analysis\n4.1\nToken-level Markov Decision Process\nWe formulate the decoding process as a token-level Markov Decision Process (MDP) M = {S, A, P, r}, where\nS is the state space, A is the action space, P is the transition kernel, and r defines the reward function. In\nthe language model setting, let the prompt space be X and the vocabulary be Y. Starting from an initial\nprompt s0 = x, at step t the state is defined by st = (x, y≤t), and the action is represented by the next token\nat = yt+1. The transition kernel is then defined by P(st+1 | st, at) = I{st+1 = (x, y≤t, yt+1)}. Finally, the\nreward function r(s, a) = r(x, y≤t+1) ∈ [0, 1] is a token-level reward function that maps any text (x, y≤t+1) to\na real number. We also denote R(x, y≤t) = �t\ni=1 r(x, y≤i) is the total reward function of the prefix (x, y≤t).\nA language model can be represented as a policy π(a | s) = π(yt+1 | x, y≤t), where it takes the previous text\n(x, y≤t) as the input and generates the next token yt+1. Let τ = (x, y1, · · · , yT ) denote the trajectory, which\nis indeed a full response generated by a language model, the value function V π(s) = V π(x, y≤t) for a state\ns = (x, y≤t) can be defined as\nV π(x, y≤t) = Eτ∼π(·|x,y≤t)\n�\nT\n�\ni=t+1\nr(x, y≤i)\n�\n.\nSimilarly, for a = yt+1, the Q function can be defined as\nQπ(x, y≤t, yt+1) = r(x, y≤t+1) + Eτ∼π(·|x,y≤t+1)\n�\nT\n�\ni=t+1\nr(x, y≤i+1)\n�\n.\n4.2\nToken-Level Routing through the Lens of Performance Difference Lemma\nIn this section, we establish a conceptual connection between the Performance Difference Lemma (PDL) (Kakade\nand Langford, 2002) and our token-level routing training , clarifying how supervised routing can be interpreted\nthrough the lens of value-function approximation in a token-level MDP. At a high level, PDL characterizes\nhow deviations from the optimal policy at individual decision steps accumulate into a global performance gap.\nWhen applied to language model decoding, which is viewed as a token-level MDP as shown in §4.1, this lemma\nsuggests that if, at each token position, the routing mechanism selects an expert whose next-token distribution\napproximately maximizes the optimal action-value function, then the resulting sequence-level policy will be\nnear-optimal. Our routing loss in the SFT phase can be viewed as an imitation-based approximation of this\nidealized per-token selection rule, trained on trajectories generated by an optimal (or near-optimal) policy.\nTo formalize this connection, we analyze an idealized setting in which the SFT dataset is generated by the\noptimal policy π∗(· | s), where π∗ achieves the optimal value function in the token-level MDP M = (S, A, P, r).\nHence, the optimal next token at step t is given by yt+1 ∈ arg maxa∈Y Qπ∗(x, y≤t, a), and we refer to Qπ∗ as\nthe optimal Q-function. Since we trained on the SFT dataset which follows trajectories generated by π∗, at\nstep t with prefix (x, y≤t), the router can be viewed as approximating an expert action that aligns with the\naction preferred by π∗ , i.e., the action that maximizes the optimal Q-function along the observed trajectory.\nWe assume that the router is well-trained so that the expert model chooses the correct model for each prefix,\nnamely,\nπexpert(· | x, y≤t) = πI(· | x, y≤t),\nwhere I = arg max\ni\nEyt+1∼πi(·|x,y≤t)[Qπ∗(x, y≤t, yt+1)].\n(8)\nWe also make the following assumption to state that for any prefix x, y≤t, there is one correct model that\nachieves approximate optimal value function.\n7\n\nAssumption 4.1 (Global Coverage Assumption). For any x, y≤t, there exists an i ∈ [n] such that\n���Eyt+1∼πi(·|x,y≤t)[Qπ∗(x, y≤t+1)] − V ∗(x, y≤t)\n��� ≤ ∆,\n(9)\nwhere ∆ represents the error between the optimal policy and the candidate model. This assumption states\nthat the expert set is assumed to be sufficiently expressive so that one expert always achieves an approximately\noptimal value. This assumption can be regarded as a sanity bound that characterizes the best-case performance\nof purely token-level routing under unrealistically strong coverage conditions. It assumes the existence of a\nnear-optimal expert at each token.\nFirst, we introduce the performance difference lemma, which decomposes the value difference into the sum of\nthe Q-function gaps on individual steps.\nLemma 4.2 (Performance Difference Lemma (Kakade and Langford, 2002)). Denote ρ ∈ ∆(X) as a\ndistribution over the prompt space. Suppose each response y has length T. Then, for any two policies π, π∗,\nwe have\nEx∼ρ\n�\nV π∗(x) − V π(x)\n�\n=\nT −1\n�\nt=0\nEx∼ρ,y≤t∼π(·|x)\n�\nV ∗(x, yt) − Eyt+1∼π(·|x,y≤t)[Q∗(x, y≤t, yt+1)]\n�\n.\nFollowing the Assumption 4.1 and Lemma 4.2, we can get\nEx∼ρ\n�\nV π∗(x) − V π(x)\n�\n≤\nT −1\n�\nt=0\n∆ = T∆,\n(10)\nwhich means that when ∆ ≤ O(1/T), an idealized token-level routing policy can achieve near-optimal expected\nreturn under Assumption 4.1. The Eq. (10) shows that a well-trained purely token-level routing mechanism\ncan possibly recover a near-optimal trajectory by selecting the appropriate expert at each step. We also discuss\nthe difference between our approach and the prior token-level collaboration approach Collab (Chakraborty\net al., 2025) in Appendix B.\nHowever, it is worth noting that Assumption 4.1 is relatively strong, since it requires a global coverage that\nmust hold for all prefixes, including those that may not appear in the supervised training data. Because\nthe router is trained only on trajectories generated by the expert policy, satisfying this condition requires\ngeneralization beyond the observed data. A more natural assumption is to make the assumption only on the\ndata generated by optimal policy π∗. In the next section, we show that this relaxed assumption is insufficient\nfor generating an approximately optimal response, demonstrating the fundamental limitations of purely\ntoken-level collaboration.\n4.3\nLimitations of Purely Token-Level Collaboration\nIn the previous subsection, we showed that router training can recover the ideal action. However, Assumption 4.1\nmust hold for every possible prefix x, y≤t. This assumption is rather restrictive, as the routing network is\ntrained solely on the training dataset and may therefore suffer from distribution shift. A more reasonable\nrequirement is to enforce Eq. (9) only for good prefixes x, y≤t. We consider two types of prefixes to be good\nprefixes. First, any prefix x, y≤t ∼ π∗ is considered as a good prefix, which we call the single policy coverage\nassumption. Moreover, any prefix that has the possibility to generate a good response is also considered as a\ngood prefix, which we call the Generalization Coverage assumption. Together, these two types of prefixes are\nintended to characterize the kinds of text commonly encountered in the training data. However, under this\nweaker assumption, we show that although there may exist a way to select a candidate model that yields a\nnear-optimal response, such selection rule cannot be recovered solely from observations of the value function\nQπ∗.\nTheorem 4.3. Denote {n1, n2, · · · , nt} as a path of length t, where ni ∈ [n], 1 ≤ i ≤ t. Assume that all π∗\nand {πi}i∈[m] are deterministic policy, the response length is fixed as T, and the transition function is also\ndeterministic. Suppose the learner’s observation space Ot contains\not =\n�\nx, y≤t, {Qπ∗(x, y≤k)}t\nk=1, {Qπ∗(x, y≤t, y)}y∈Y\n�\n.\n8\n\nat the visited states (x, y≤t) during decoding. Suppose ∆ > 0 is a error constant. Then, there is a MDP such\nthat no token-level routing algorithm A : O �→ [n] can achieve V πA ≥ V π∗ − T/2 + 2 for all possible rewards\nin this MDP, even if we have the following two good properties:\n1. Existence of a Near-Optimal Path. There exists a small ε ≤ ∆ ∈ [0, 1] and at least one possible token-level\nrouting strategy P ∗ = {n∗\n1, n∗\n2, · · · , n∗\nT } such that\nV π∗ = V P ∗ + ε.\n(11)\n2. Single Policy Coverage. For the optimal policy π∗, the error constant ∆ ∈ [0, 1] such that the following\napproximation holds:\nEx,yt∼π∗\n���arg max\ni\nEyt+1∼πi(·|x,y≤t)\n�\nQπ∗(x, y≤t+1)\n�\n− Eyt+1∼π∗(·|x,y≤t)\n�\nQπ∗(x, y≤t+1)\n���� ≤ ∆.\n(12)\n3. Generalization Coverage For any prefix x, y≤t, if there exists a full response x, y≤T with prefix x, y≤t and\nR(x, y≤T ) ≥ V π∗ − ∆, we have\n���arg max\ni\nEyt+1∼πi(·|x,y≤t)\n�\nQπ∗(x, y≤t+1)\n�\n− Eyt+1∼π∗(·|x,y≤t)\n�\nQπ∗(x, y≤t+1)\n���� ≤ ∆.\n(13)\nThe impossibility in Theorem 4.3 stems from an identifiability failure, where observing optimal values along\ntrajectories generated by π∗ is insufficient to determine which expert actions actually realize those values. This\nis actually closely related to known limitations in reinforcement learning and imitation learning, where value\ninformation or on-policy demonstrations alone do not uniquely identify optimal actions without additional\nrealizability assumptions (Kakade and Langford, 2002; Ross et al., 2011; Levine et al., 2021) In fact, due to\nthis identifiability failure, this theorem implies that designing a reliable method for selecting the optimal\nmodel to produce the optimal response is inherently difficult. For example, Theorem 4.3 shows that using\nSFT to train a router that chooses the optimal model is not reliable, because this is essentially equivalent to\nuse behavior cloning for learning the actions that maximize the optimal value function Qπ∗. The proof is\nprovided in Appendix A.\n4.3.1\nBenefits of Router Training\nFrom the previous theorem, we show that reliable performance cannot be guaranteed under single-policy\ncoverage. The core difficulty is that even if a prefix that may lead to good response (theoretically, a state has\na large value Qπ∗), this response may not be realizable by the available candidate models. For each token, the\nset of feasible actions is restricted by the fixed expert models. Consequently, whenever the optimal policy π∗\nand the expert models are not perfectly aligned, the resulting approximation error becomes uncontrollable.\nTo solve this issue, we consider adding a trainable part to our decoding process. We also train the πrouter to\nalign the optimal model, and add the logits of πrouter as a complementary part (Eq. (2)). Now, instead of\nassuming that some models can achieve approximately maximum optimal Q value, we can assume that the\nTV distance of the policy is bounded. In fact, assuming π′\na ∝ πa · πrouter is the combination of the router\nbase model and the candidate model, we can assume\nEx,y≤t∼π∗[arg min\na∈[M] TV(π′\na(· | x, y≤t), π∗(· | x, y≤t))] ≤ ∆.\nThen, we can easily get the value difference bound by PDL\nV π∗ − V π =\nT\n�\nt=0\nEyt∼Pπ∗\nt\n(·|x)[Eyt+1∼π∗(·|x,y≤t)Qπ\nh((x, yt), yt+1) − Eyt+1∼π(·|x,y≤t)Qπ\nh((x, yt), yt+1)]\n≤ T · ∆ · Qmax ≤ RmaxT∆.\nHence, the discussion above shows that adding a router as a complementary part is a good way for overcoming\nthe intrinsic difficulty for Theorem 4.3. This result highlights an important conceptual shift. Instead of\n9\n\nrequiring that one of the fixed expert models be capable of achieving near-optimal performance on its own,\nwe allow a router that serves both selection and complementation roles to compensate for the mismatch by\ncontributing additional, adaptive logits. In this way, the combined policy effectively expands the expressivity\nof the action space, enabling it to approximate the optimal policy π∗ even when none of the individual\ncandidate models is sufficient. This demonstrates that making the router a complementary component in the\ndecoding process is not merely a heuristic enhancement, but a principled mechanism that circumvents the\ncoverage assumption for all prefixes.\nFurthermore, when the optimal policy and a candidate model are already reasonably well aligned, the router’s\ncomplementary component need not be highly accurate; even a weakly informative router can suffice. In\nthis case, the router is substantially easier to learn, as it only needs to correct the limited set of states\nwhere the candidate model deviates from π∗. On well-aligned states, the candidate models already provide\nstrong responses. Consequently, combining a candidate model with a trainable router can outperform both\nfine-tuning a single model and pure routing.\n5\nExperiments\nIn this section, we first describe the experimental setup in §5.1. We then present cross-domain performance\nand win rates on general benchmarks in §5.2. Together, these evaluations demonstrate that our FusionRoute\nenables effective collaboration among multiple models by (a) selecting the most domain-appropriate models\nfor accurate specialized responses and (b) delivering consistently high-quality outputs on general datasets.\n5.1\nExperimental Setup\nBaselines.\nWe compare FusionRoute against a broad set of baselines including sequence-level collaboration,\ntoken-level collaboration, model merging, and single-model fine-tuning. For sequence-level collaboration, we\ncompare with the common Sequence Selection, where each expert independently generates a full response\nusing greedy decoding and an external reward model selects the highest-scoring output. For the token-level\ncollaboration, we include Collab (Chakraborty et al., 2025), which performs controlled decoding by evaluating\ncandidate tokens from multiple models using an external reward signal; we follow the original decoding\nprocedure and replace only the reward model when necessary to ensure fair comparison across model families.\nTo evaluate training-free expert integration, we include two popular model merging approaches, DARE (Yu\net al., 2024) and TaskArithmetic (Ilharco et al., 2022). We additionally include a Fine-tuned Model baseline\nobtained by directly fine-tuning the base LLM using the same SFT and DPO procedures as FusionRoute,\nbut without any routing or collaboration mechanism, thereby isolating the benefit of token-level multi-LLM\ncollaboration and complementary logit contributions.\nModels.\nWe conduct experiments on both the Llama-3 (Dubey et al., 2024) and Gemma-2 (Team et al.,\n2024) families, using the model checkpoints provided by He et al. (2025).\nFor each family, we select\nthree expert models specializing in mathematics, coding, and instruction following, respectively. Specifi-\ncally, for the Llama-3 family, we use math expert models Llama-3.1-8B-Instruct_math and coding expert\nmodels Llama-3.1-8B-Instruct_coding from MergeBench (He et al., 2025) as the two expert models, and\nLlama-3.1-8B-Instruct as the instruction-following expert. Following Chakraborty et al. (2025), we use\nthe reward model Ray2333/reward-model-Mistral-7B-instruct-Unified-Feedback as the external reward for\nCollab (Chakraborty et al., 2025). For the Gemma-2 family, we use Gemma-2-2B-Instruct_math, Gemma-\n2-2B-Instruct_coding and Gemma-2-2B-Instruct_instruction from MergeBench (He et al., 2025) as three\nexpert models.\nFor a fair comparison, we use weqweasdas/RM-Gemma-2B as the external reward for\nCollab (Chakraborty et al., 2025), instead of a larger 7B reward model.\nImplementation and training.\nWe randomly select 500 samples from the PerfectBlend dataset (Xu et al.,\n2024) as a held-out test set and use the remaining samples as the training pool. For the initial SFT phase, we\nrandomly sample 200k examples from the PerfectBlend training split, which provides a balanced mixture of\nmathematics, coding, and instruction-following tasks. For both the Llama and Gemma families, the router is\ninitialized from the instruction-following expert and fine-tuned by minimizing the loss in Eq. (6) with λ = 1/3\n10\n\nand a learning rate of 1 × 10−5 for one epoch. After obtaining the policy from the SFT phase, we further train\nthe model using the mixed SFT–DPO procedure in Algorithm 1. Specifically, we sample 100k examples from\nthe PerfectBlend training split as DSFT and 100k preference pairs from the OpenHermes dataset (Teknium,\n2023) as DDPO, using a learning rate of 10−5, β = 0.1, and λ = 1/3 for one epoch. For the directly fine-tuned\nbaseline, we apply the same SFT and DPO procedures and hyperparameters, but without training the routing\nlinear layer. For model merging baselines, we use the default parameters p = 0.9 and λ = 1 for DARE (Yu\net al., 2024) and λ = 1 for TaskArithmetic (Ilharco et al., 2022).\n5.2\nMain Results\n5.2.1\nGeneral-purpose Performance with Specialized Experts\nIn practical deployment, users typically interact with a single model without knowing in advance which domain—\nsuch as mathematics, coding, or instruction following—a given prompt belongs to. We therefore evaluate\nFusionRoute in a general-purpose, mixed-domain setting, where the model must automatically leverage\nthe most appropriate expert at inference time, without any manual checkpoint selection. We consider five\nbenchmark datasets covering diverse expert domains: GSM8K (Cobbe et al., 2021) and MATH500 (Lightman\net al., 2023) for mathematical reasoning, MBPP (Austin et al., 2021) and HumanEval (Chen et al., 2021)\nfor code generation, and IfEval (Zhou et al., 2023) for instruction following. These datasets allow us to\nmeasure model performance across math, coding, and instruction-following tasks, providing a comprehensive\ncross-domain evaluation. We use greedy decoding as the generation approach, more implementation details\nfor each task are provided in Appendix C.1.\nOur results show that both algorithms consistently achieve the highest average performance across the five\ndomains on both two LLM families and get performance comparable to domain-specific experts on individual\ntasks. Table 1 summarizes the results on both the Llama-3 and Gemma-2 families. Across both model\nfamilies, FusionRoute achieves the highest average performance, consistently outperforming sequence-level\ncollaboration, prior token-level collaboration (Collab), model merging methods, and directly fine-tuned models.\nThese results demonstrate that FusionRoute effectively functions as a general-purpose model assembled\nfrom specialized experts, relieving users from the need to select domain-specific checkpoints while delivering\nstrong and stable performance across heterogeneous tasks. The examples of routing behavior in several\ndomain-specific tasks are provided in Appendix C.2.\nImportantly, FusionRoute does not sacrifice specialization. On benchmarks where a particular expert is ex-\npected to perform best—such as math experts on GSM8K (Cobbe et al., 2021) and MATH500 (Lightman et al.,\n2023) or coding experts on MBPP (Austin et al., 2021) and HumanEval (Chen et al., 2021)—FusionRoute\nremains competitive and often matches or exceeds the corresponding expert’s performance. This shows that\nFusionRoute is compatible with expert specialization, while being substantially more robust and generalized\nin mixed-domain usage.\n5.2.2\nPerformance on General Dataset\n(a) Llama3-8B Family\n(b) Gemma2-2B Family\nFigure 3 GPT-4o winrate on general datasets compared to fine-tuned model.\n11\n\nMethod\nGSM8K\nMATH500\nMBPP\nHumanEval\nIfEval\nAvg Acc.\nLlama-3 Family\nExpert Models\nMath Expert\n0.86\n0.36\n0.23\n0.37\n0.31\n0.426\nCode Expert\n0.52\n0.15\n0.34\n0.66\n0.48\n0.430\nInstruct Expert\n0.74\n0.27\n0.36\n0.52\n0.67\n0.512\nCollaboration Approaches\nSequence Selection\n0.76\n0.31\n0.21\n0.45\n0.50\n0.466\nCollab\n0.82\n0.32\n0.28\n0.54\n0.55\n0.502\nFine-tuned Model\n0.75\n0.26\n0.36\n0.58\n0.72\n0.536\nDARE\n0.75\n0.28\n0.23\n0.26\n0.32\n0.368\nTaskArithmetic\n0.82\n0.32\n0.24\n0.39\n0.35\n0.424\nFusionRoute (ours)\n0.82\n0.33\n0.36\n0.63\n0.69\n0.566\nGemma-2 Family\nExpert Models\nMath Expert\n0.67\n0.27\n0.22\n0.36\n0.47\n0.398\nCode Expert\n0.39\n0.16\n0.27\n0.41\n0.52\n0.350\nInstruction Expert\n0.19\n0.08\n0.23\n0.34\n0.61\n0.290\nCollaboration Approaches\nSequence Selection\n0.62\n0.24\n0.27\n0.37\n0.54\n0.408\nCollab\n0.52\n0.22\n0.26\n0.35\n0.45\n0.360\nFine-tuned Model\n0.54\n0.17\n0.24\n0.34\n0.68\n0.394\nTaskArithmetic\n0.43\n0.15\n0.14\n0.28\n0.34\n0.268\nDARE\n0.36\n0.10\n0.14\n0.16\n0.36\n0.224\nFusionRoute (ours)\n0.65\n0.22\n0.25\n0.40\n0.61\n0.426\nTable 1 Cross-domain performance comparison on Llama-3 and Gemma-2 families. Best and second-best performance\nwithin each family are highlighted by bold and underline, respectively.\nTo further assess overall response quality beyond task-specific correctness, we evaluate FusionRoute on a held-\nout general dataset using pairwise comparisons. Specifically, we sample 500 prompts from the PerfectBlend (Xu\net al., 2024) test set and generate responses with a maximum length of 300 tokens. Then, we use GPT-4o to\nevaluate the overall response quality compared to the response of the directly fine-tuned model (Fine-tuned\nModel in Table 1). For two responses, the pairwise comparison is selected from win, tie and loss, and the win\nrate is computed as the proportion of wins, with ties counted as half a win, over 500 evaluation examples.\nAddition details such as evaluation prompt provided in Appendix C.1. As shown in Fig. 3, FusionRoute\nachieves a substantially higher win rate than the fine-tuned baseline on both model families, indicating\nimproved overall response quality on general prompts. This improvement reflects better alignment, fluency,\nand formatting, complementing the strong task-level performance observed on domain benchmarks.\n5.2.3\nFusionRoute is More Beneficial at a Larger Scale\nAs shown in Fig. 3, the performance gap between FusionRoute and other baselines becomes markedly\nlarger in the 8B Llama-3 family. In this setting, both existing token-level collaboration (Collab (Chakraborty\net al., 2025)) and sequence-level selection exhibit substantial degradation in win rate, whereas FusionRoute\ncontinues to improve response quality. This suggests that as model capacity increases, purely selecting among\nfixed expert outputs becomes increasingly brittle, while FusionRoute’s complementary routing mechanism\nis able to effectively leverage the additional representational capacity to refine and correct expert predictions.\nIn contrast, in the 2B Gemma-2 family, the gap between FusionRoute and pure token-level collaboration\nis smaller.\nWhile expert-only token-level collaboration still under-performs FusionRoute, it remains\n12\n\ncompetitive and even outperforms the directly fine-tuned baseline. This indicates that at smaller scales,\nexpert models alone can already capture a reasonable portion of the desired behavior, leaving less room\nfor complementary correction. Overall, these results suggest that router-based complementary generation\nbecomes increasingly important as model scale grows, whereas expert-only token-level collaboration may\nsuffice in more capacity-constrained settings.\n6\nAblation Study\n6.1\nAblation on Complementary Logit Contribution\nWe first examine the role of the router’s complementary logit contribution in FusionRoute. In particular,\nwe aim to isolate whether token-level expert selection alone is sufficient for effective collaboration, or whether\nthe router’s ability to directly contribute corrective logits is necessary for strong performance. To this end, we\nconstruct a routing-only baseline that removes the complementary component: we use the router obtained\nafter the SFT phase to select an expert at each token, and generate the next token solely from the selected\nexpert’s logits, without adding the router’s logits.\nTable 2 reports the cross-domain performance of the routing-only variant compared with the full FusionRoute\nframework across both the LLaMA-3 and Gemma-2 families. We see that FusionRoute consistently outper-\nforms FusionRoute w/o complementary logits across nearly all benchmarks and both model families. The per-\nformance gap is particularly pronounced on coding and instruction-following tasks, where even correctly selected\nexperts can produce locally suboptimal or misaligned tokens that require correction. These results provide direct\nempirical evidence support for our theoretical analysis in §4. Even when the router is able to identify a suitable\nexpert at each token, relying solely on fixed expert logits is insufficient to reliably recover a near-optimal policy.\nFigure 4 GPT-4o winrate comparison: FusionRoute\nwith vs. without CDPO training on both Llama-3\nand Gemma-2 Family.\nIn contrast, allowing the router to contribute complemen-\ntary logits effectively expands the expressive capacity of\nthe decoding policy, enabling it to correct expert failures\nand achieve stronger, more robust performance across\nheterogeneous domains.\nMoreover,\nwe\nfind\nthat\nthe\nrouting-only\nvariant\n(FusionRoute w/o complementary logits) already out-\nperforms the state-of-the-arts controlled-decoding-based\ntoken-level collaboration method (Chakraborty et al.,\n2025) across most benchmarks. This improvement sug-\ngests that directly training the routing component on\nexpert data is crucial for learning accurate and stable\ntoken-level routing behaviors. In contrast, controlled-\ndecoding approaches that rely solely on external reward\nsignals tend to exhibit higher instability, limiting their\neffectiveness in practice.\n6.2\nAblation on FusionRoute Training Procedure\nWhile the ablation in §6.1 demonstrates the importance of complementary logits at inference time, it does not\nexamine how different stages of the FusionRoute training procedure contribute to this behavior. In this\nsection, we further probe the training pipeline of FusionRoute by ablating the preference-optimization stage,\nwith the goal of understanding the effectiveness of the proposed CDPO over SFT. To this end, we compare\nthree variants: a base model, FusionRoute after the SFT phase, and the full FusionRoute framework\nwith CDPO training. We evaluate these variants using GPT-4o win rate on a held-out general dataset, which\ncaptures overall response quality beyond task-specific correctness.\nThe result are shown in Fig. 4. We observe that while SFT already provides a reasonable initialization,\napplying CDPO training leads to a substantial improvement in win rate. In particular, the full FusionRoute\nmodel significantly outperforms its SFT-only counterpart, indicating that the complementary component\nlearned during CDPO effectively corrects expert failures and enhances the overall response quality. These\n13\n\nMethod\nGSM8K\nMATH500\nMBPP\nHumanEval\nIFEval\nAvg Acc.\nLlama-3 Family\nCollab\n0.82\n0.32\n0.28\n0.54\n0.55\n0.502\nFusionRoute w/o complementary logits\n0.82\n0.32\n0.26\n0.56\n0.65\n0.522\nFusionRoute\n0.82\n0.33\n0.36\n0.63\n0.69\n0.566\nGemma-2 Family\nCollab\n0.52\n0.22\n0.26\n0.35\n0.45\n0.360\nFusionRoute w/o complementary logits\n0.62\n0.23\n0.23\n0.33\n0.51\n0.384\nFusionRoute\n0.65\n0.22\n0.25\n0.40\n0.61\n0.426\nTable 2 Cross-domain performance comparison between purely token-level routing and FusionRoute across model\nfamilies. Best and second-best performance within each family are highlighted by bold and underline, respectively.\nfindings suggest that the performance gains of FusionRoute depend critically on the preference-optimization\nstage in the training procedure, which substantially improves general response quality.\n7\nRelated Works\n7.1\nMulti-LLM Collaboration\nRecent advances in LLMs have shown that sufficiently large, general-purpose LLMs can achieve strong\nperformance across a wide range of tasks, including mathematics, coding, and instruction following (Hurst\net al., 2024; Team et al., 2025; Yang et al., 2025; Zeng et al., 2025).\nHowever, such models are often\nprohibitively expensive to train and deploy, motivating interest in more efficient alternatives. At the same\ntime, smaller and domain-specialized LLMs can perform competitively within their respective domains, but\ntypically struggle to generalize beyond the distributions on which they are trained. Bridging this gap between\nspecialization and generalization without incurring the cost of training ever-larger models has therefore\nbecome an important challenge. One line of works study the benefits of multi-agent debate, in which several\nagents discuss and debate to generate a final response (Liu et al., 2024; Chan et al., 2023; Liang et al., 2024).\nExisting works also try to model the social and economic behavior in the real world using different specialized\nLLMs (Zhao et al., 2023). Moreover, model ensemble methods (Ashiga et al., 2025; Huang et al., 2024;\nChen et al., 2025) combine responses or probability distributions from multiple models in a fixed manner to\nleverage their complementary strengths without modifying individual parameters; however, they lack dynamic,\ncontext-dependent collaboration among candidate models. More fine-grained token-level collaboration has\nbeen studied in recent years. Collab (Chakraborty et al., 2025) uses a controlled decoding-based approach\nto evaluate the candidate tokens during decoding. Compared to them, our decoding algorithm is much\nmore efficient and also achieves a better performance. CO-LLM (Shen et al., 2024) also explores token-level\ncollaboration, but mainly uses one strong math expert to assist a small fine-tuned model, and its output\ndepends entirely on the candidate experts. In contrast, we use multiple specialized LLMs, evaluate both\ncross-domain performance and general win rate, and additionally train the router as a complementary generator\nto improve robustness.\n7.2\nMulti-LLM Integration\nBeyond multi-agent LLM collaboration, a substantial body of works also consider how to utilize and integrate\ndifferent models to be comprehensive. The most representative area is Mixture-of-Expert (MoE), in which\nmultiple expert models are combined for each layer with some routing network, allowing the system to leverage\nthe specialized knowledge of each expert for different inputs (Zhou et al., 2022; Xue et al., 2024; Jiang et al.,\n2024a). However, because the final architecture tightly integrates the layers of multiple models with a routing\nnetwork, these approaches are typically applicable only to models with similar structures. They also require\naccess to the experts’ gradients and joint training of the entire system, which limits flexibility and incurs\n14\n\nsignificant additional training cost. Model merging is another approach for integrating multiple experts (Yu\net al., 2024; Ilharco et al., 2022; Yadav et al., 2023). While it does not incur additional training costs, its\nperformance is often limited and sensitive to hyperparameters. Moreover, its flexibility is also limited, as it\nrequires all models to share the same architecture. There is also a line of work on pluralistic alignment, in\nwhich multiple models with potentially conflicting objectives, such as safety and helpfulness, are integrated\nto achieve tradeoffs across multiple objectives (Shi et al., 2024; Xiong and Singh, 2025; Wang et al., 2025c).\nIn contrast to these approaches, FusionRoute does not merge or jointly train expert models, nor does it\nrequire architectural compatibility or gradient access. Instead, FusionRoute enables flexible, token-level\nintegration of heterogeneous, off-the-shelf LLMs by learning a lightweight router that selects among experts\nand provides complementary logits during decoding. This design allows FusionRoute to combine the\nstrengths of specialized models while retaining robustness and generalization, without the rigidity or training\noverhead inherent in MoE and model-merging approaches.\n8\nConclusion\nIn this paper, we present FusionRoute, a framework that enables multiple candidate models to collaboratively\ngenerate higher-quality responses than any individual model. FusionRoute works well automatically across\ndiverse scenarios because of a routing mechanism, and provides robustness by letting the router provide\ncomplementary logits to refine the overall response quality. In addition, FusionRoute is computationally\nefficient, as it avoids requiring each model to generate candidate tokens or full responses for selection.\nTheoretically, we characterize the limitations of purely token-level collaboration and demonstrate the benefits\nof incorporating complementary logits. Extensive experiments on both domain-specific and general datasets\nverify the effectiveness of our framework.\n15\n\nReferences\nMari Ashiga, Wei Jie, Fan Wu, Vardan Voskanyan, Fateme Dinmohammadi, Paul Brookes, Jingzhi Gong, and\nZheng Wang. Ensemble learning for large language models in text and code generation: A survey. arXiv preprint\narXiv:2503.13505, 2025.\nJacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie\nCai, Michael Terry, Quoc Le, et al. Program synthesis with large language models. arXiv preprint arXiv:2108.07732,\n2021.\nSouradip Chakraborty, Sujay Bhatt, Udari Madhushani Sehwag, Soumya Suvra Ghosal, Jiahao Qiu, Mengdi Wang,\nDinesh Manocha, Furong Huang, Alec Koppel, and Sumitra Ganesh. Collab: Controlled decoding using mixture of\nagents for llm alignment. arXiv preprint arXiv:2503.21720, 2025.\nChi-Min Chan, Weize Chen, Yusheng Su, Jianxuan Yu, Wei Xue, Shanghang Zhang, Jie Fu, and Zhiyuan Liu. Chateval:\nTowards better llm-based evaluators through multi-agent debate. arXiv preprint arXiv:2308.07201, 2023.\nMark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards,\nYuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy\nKhlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power,\nLukasz Kaiser, Mohammad Bavarian, Clemens Winter, Philippe Tillet, Felipe Petroski Such, Dave Cummings,\nMatthias Plappert, Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, William Hebgen Guss, Alex Nichol, Alex\nPaino, Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, William Saunders, Christopher\nHesse, Andrew N. Carr, Jan Leike, Josh Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight,\nMiles Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya\nSutskever, and Wojciech Zaremba. Evaluating large language models trained on code. 2021.\nZhijun Chen, Jingzheng Li, Pengpeng Chen, Zhuoran Li, Kai Sun, Yuankai Luo, Qianren Mao, Ming Li, Likang\nXiao, Dingqi Yang, et al. Harnessing multiple large language models: A survey on llm ensemble. arXiv preprint\narXiv:2502.18036, 2025.\nKarl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert,\nJerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint\narXiv:2110.14168, 2021.\nXiang Deng, Jeff Da, Edwin Pan, Yannis Yiming He, Charles Ide, Kanak Garg, Niklas Lauffer, Andrew Park, Nitin\nPasari, Chetan Rane, et al. Swe-bench pro: Can ai agents solve long-horizon software engineering tasks? arXiv\npreprint arXiv:2509.16941, 2025.\nYilun Du, Shuang Li, Antonio Torralba, Joshua B Tenenbaum, and Igor Mordatch. Improving factuality and reasoning\nin language models through multiagent debate. In Forty-first International Conference on Machine Learning, 2023.\nAbhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil\nMathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv e-prints, pages arXiv–2407,\n2024.\nYifei He, Siqi Zeng, Yuzheng Hu, Rui Yang, Tong Zhang, and Han Zhao. Mergebench: A benchmark for merging\ndomain-specialized llms. arXiv preprint arXiv:2505.10833, 2025.\nYichong Huang, Xiaocheng Feng, Baohang Li, Yang Xiang, Hui Wang, Ting Liu, and Bing Qin. Ensemble learning for\nheterogeneous large language models with deep parallel collaboration. Advances in Neural Information Processing\nSystems, 37:119838–119860, 2024.\nAaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila\nWelihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024.\nNam Huynh and Beiyu Lin. Large language models for code generation: A comprehensive survey of challenges,\ntechniques, evaluation, and applications. arXiv preprint arXiv:2503.01245, 2025.\nGabriel Ilharco, Marco Tulio Ribeiro, Mitchell Wortsman, Suchin Gururangan, Ludwig Schmidt, Hannaneh Hajishirzi,\nand Ali Farhadi. Editing models with task arithmetic. arXiv preprint arXiv:2212.04089, 2022.\nAlbert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh\nChaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts. arXiv preprint\narXiv:2401.04088, 2024a.\n16\n\nJuyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim. A survey on large language models for code\ngeneration. arXiv preprint arXiv:2406.00515, 2024b.\nCarlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan.\nSwe-bench: Can language models resolve real-world github issues? arXiv preprint arXiv:2310.06770, 2023.\nSham Kakade and John Langford. Approximately optimal approximate reinforcement learning. In Proceedings of the\nnineteenth international conference on machine learning, pages 267–274, 2002.\nYoav Levine, Noam Wies, Daniel Jannai, Dan Navon, Yedid Hoshen, and Amnon Shashua. The inductive bias of\nin-context learning: Rethinking pretraining example design. arXiv preprint arXiv:2110.04541, 2021.\nTian Liang, Zhiwei He, Wenxiang Jiao, Xing Wang, Yan Wang, Rui Wang, Yujiu Yang, Shuming Shi, and Zhaopeng\nTu. Encouraging divergent thinking in large language models through multi-agent debate. In Proceedings of the\n2024 conference on empirical methods in natural language processing, pages 17889–17904, 2024.\nHunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman,\nIlya Sutskever, and Karl Cobbe. Let’s verify step by step. arXiv preprint arXiv:2305.20050, 2023.\nTongxuan Liu, Xingyu Wang, Weizhe Huang, Wenjiang Xu, Yuting Zeng, Lei Jiang, Hailong Yang, and Jing Li.\nGroupdebate: Enhancing the efficiency of multi-agent debate using group discussion. arXiv preprint arXiv:2409.14051,\n2024.\nZhenting Qi, Hongyin Luo, Xuliang Huang, Zhuokai Zhao, Yibo Jiang, Xiangjun Fan, Himabindu Lakkaraju, and\nJames Glass. Quantifying generalization complexity for large language models. arXiv preprint arXiv:2410.01769,\n2024.\nYiwei Qin, Kaiqiang Song, Yebowen Hu, Wenlin Yao, Sangwoo Cho, Xiaoyang Wang, Xuansheng Wu, Fei Liu, Pengfei\nLiu, and Dong Yu. Infobench: Evaluating instruction following ability in large language models. arXiv preprint\narXiv:2401.03601, 2024.\nRafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct\npreference optimization: Your language model is secretly a reward model. Advances in neural information processing\nsystems, 36:53728–53741, 2023.\nStéphane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to\nno-regret online learning. In Proceedings of the fourteenth international conference on artificial intelligence and\nstatistics, pages 627–635. JMLR Workshop and Conference Proceedings, 2011.\nZejiang Shen, Hunter Lang, Bailin Wang, Yoon Kim, and David Sontag. Learning to decode collaboratively with\nmultiple language models.\nIn Proceedings of the 62nd Annual Meeting of the Association for Computational\nLinguistics (Volume 1: Long Papers), pages 12974–12990, 2024.\nRuizhe Shi, Yifang Chen, Yushi Hu, Alisa Liu, Hanna Hajishirzi, Noah A Smith, and Simon S Du. Decoding-time\nlanguage model alignment with multiple objectives. Advances in Neural Information Processing Systems, 37:\n48875–48920, 2024.\nChenglei Si, Dan Friedman, Nitish Joshi, Shi Feng, Danqi Chen, and He He. Measuring inductive biases of in-context\nlearning with underspecified demonstrations. arXiv preprint arXiv:2305.13299, 2023.\nGemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, Léonard\nHussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ramé, et al. Gemma 2: Improving open language models\nat a practical size. arXiv preprint arXiv:2408.00118, 2024.\nGemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana\nMatejovicova, Alexandre Ramé, Morgane Rivière, et al. Gemma 3 technical report. arXiv preprint arXiv:2503.19786,\n2025.\nTeknium. Openhermes 2.5: An open dataset of synthetic data for generalist llm assistants, 2023. https://huggingface.\nco/datasets/teknium/OpenHermes-2.5.\nChaoqi Wang, Zhuokai Zhao, Yibo Jiang, Zhaorun Chen, Chen Zhu, Yuxin Chen, Jiayi Liu, Lizhu Zhang, Xiangjun\nFan, Hao Ma, et al. Beyond reward hacking: Causal rewards for large language model alignment. arXiv preprint\narXiv:2501.09620, 2025a.\nPeng-Yuan Wang, Tian-Shuo Liu, Chenyang Wang, Yi-Di Wang, Shu Yan, Cheng-Xing Jia, Xu-Hui Liu, Xin-Wei\nChen, Jia-Cheng Xu, Ziniu Li, et al. A survey on large language models for mathematical reasoning. arXiv preprint\narXiv:2506.08446, 2025b.\n17\n\nTianze Wang, Dongnan Gui, Yifan Hu, Shuhang Lin, and Linjun Zhang. Mpo: An efficient post-processing framework\nfor mixing diverse preference alignment. arXiv preprint arXiv:2502.18699, 2025c.\nNuoya Xiong and Aarti Singh. Projection optimization: A general framework for multi-objective and multi-group rlhf.\narXiv preprint arXiv:2502.15145, 2025.\nTengyu Xu, Eryk Helenowski, Karthik Abinav Sankararaman, Di Jin, Kaiyan Peng, Eric Han, Shaoliang Nie, Chen\nZhu, Hejia Zhang, Wenxuan Zhou, et al. The perfect blend: Redefining rlhf with mixture of judges. arXiv preprint\narXiv:2409.20370, 2024.\nFuzhao Xue, Zian Zheng, Yao Fu, Jinjie Ni, Zangwei Zheng, Wangchunshu Zhou, and Yang You. Openmoe: An early\neffort on open mixture-of-experts language models. arXiv preprint arXiv:2402.01739, 2024.\nPrateek Yadav, Derek Tam, Leshem Choshen, Colin A Raffel, and Mohit Bansal. Ties-merging: Resolving interference\nwhen merging models. Advances in Neural Information Processing Systems, 36:7093–7115, 2023.\nAn Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang,\nChenxu Lv, et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025.\nEnneng Yang, Li Shen, Guibing Guo, Xingwei Wang, Xiaochun Cao, Jie Zhang, and Dacheng Tao. Model merging in\nllms, mllms, and beyond: Methods, theories, applications and opportunities. arXiv preprint arXiv:2408.07666, 2024.\nLe Yu, Bowen Yu, Haiyang Yu, Fei Huang, and Yongbin Li. Language models are super mario: Absorbing abilities\nfrom homologous models as a free lunch. In Forty-first International Conference on Machine Learning, 2024.\nLifan Yuan, Yangyi Chen, Ganqu Cui, Hongcheng Gao, Fangyuan Zou, Xingyi Cheng, Heng Ji, Zhiyuan Liu, and\nMaosong Sun. Revisiting out-of-distribution robustness in nlp: Benchmarks, analysis, and llms evaluations. Advances\nin Neural Information Processing Systems, 36:58478–58507, 2023.\nAohan Zeng, Xin Lv, Qinkai Zheng, Zhenyu Hou, Bin Chen, Chengxing Xie, Cunxiang Wang, Da Yin, Hao Zeng, Jiajie\nZhang, et al. Glm-4.5: Agentic, reasoning, and coding (arc) foundation models. arXiv preprint arXiv:2508.06471,\n2025.\nHanqing Zeng, Yinglong Xia, Zhuokai Zhao, Chuan Jiang, Qiang Zhang, Jiayi Liu, Qunshu Zhang, Lizhu Zhang,\nXiangjun Fan, and Benyu Zhang.\nS’more: Structural mixture of residual experts for parameter-efficient llm\nfine-tuning. In The Thirty-ninth Annual Conference on Neural Information Processing Systems.\nQinlin Zhao, Jindong Wang, Yixuan Zhang, Yiqiao Jin, Kaijie Zhu, Hao Chen, and Xing Xie. Competeai: Understanding\nthe competition dynamics in large language model-based agents. arXiv preprint arXiv:2310.17512, 2023.\nXutong Zhao, Tengyu Xu, Xuewei Wang, Zhengxing Chen, Di Jin, Liang Tan, Zishun Yu, Zhuokai Zhao, Yun He,\nSinong Wang, et al. Boosting llm reasoning via spontaneous self-correction. arXiv preprint arXiv:2506.06923, 2025.\nYujia Zheng, Zhuokai Zhao, Zijian Li, Yaqi Xie, Mingze Gao, Lizhu Zhang, and Kun Zhang. Thought communication\nin multiagent collaboration. arXiv preprint arXiv:2510.20733, 2025.\nJeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou.\nInstruction-following evaluation for large language models. arXiv preprint arXiv:2311.07911, 2023.\nYanqi Zhou, Tao Lei, Hanxiao Liu, Nan Du, Yanping Huang, Vincent Zhao, Andrew M Dai, Quoc V Le, James\nLaudon, et al. Mixture-of-experts with expert choice routing. Advances in Neural Information Processing Systems,\n35:7103–7114, 2022.\nYuhang Zhou, Mingrui Zhang, Ke Li, Mingyi Wang, Qiao Liu, Qifei Wang, Jiayi Liu, Fei Liu, Serena Li, Weiwei Li, et al.\nMixture-of-minds: Multi-agent reinforcement learning for table understanding. arXiv preprint arXiv:2510.20176,\n2025.\n18\n\nAppendix\nA\nProof of Theorem 4.3\nProof. We assume that the prompt x is fixed and that the response length is fixed to T. For any t ≤ T, define\nPt =\n�\n(n1, . . . , nt)\n�� nk ∈ [n], ∀ k = 1, . . . , t\n�\nas the set of all possible expert-selection sequences of length t. We further assume that all expert models and\ntransition dynamics are deterministic. In particular, for any expert πi and any prefix (x, y≤t), the decoding rule\nπi(x, y≤t) ∈ Y deterministically outputs the next token. Given any selection sequence p = (n1, . . . , nt) ∈ Pt,\nthe induced token sequence yp = (yp\n1, . . . , yp\nt ) is defined recursively as\nyp\nk = πnk\n�\nx, yp\n≤k−1\n�\n,\nk = 1, . . . , t,\nwith the convention that yp\n≤0 := ∅. Accordingly, the induced state can be written as sp = (x, yp).\nConstruction\nNow we construct |PT/2| MDPs {MpT/2}pT/2∈PT/2 as follows:\nSame Structure We first define the shared components of all candidate MDPs.\nThe initial state is s0 = x. The action space corresponds to token generation, with vocabulary Y. Unless\notherwise specified, the reward function is defined to be 1 for all prefixes (x, y).\nFor each expert model i ∈ [n], define\ny1,i := arg max\ny∈Y πi(y | x).\nWe assign the first-step rewards as follows. For each token y1,i, define\nr(x, y1,i) = 1 − ε.\n(14)\nOtherwise, for any y ∈ Y \\ {y1,i}i∈[n], we assign r(x, y) = 1.\nDifference Now we show the difference between multiple candidate MDPs {MpT/2}pT/2∈PT/2. Each MDP Mp\nis constructed in the following ways:\n• For path p′ ∈ {p ∈ PT/2+i | i ≥ 1, p has prefix pT/2}, we define r(x, yp′) = 1.\n• For any path p′ ∈ {p ∈ PT/2+1 | p does not have prefix pT/2}, r(x, yp′) = 1 − ∆.\n• For any path p′ ∈ {p ∈ PT/2+i | i ≥ 2, p does not have prefix pT/2}, we define r(x, yp′) = 0.\nFor all other states that are not mentioned, their reward functions are defaulted to 1.\nValue Function\nUnder this construction, for any prefix (x, y≤t), the optimal policy will choose yPR as its\nnext token, and the final optimal value function satisfies\nV ∗(x, y≤t) = T − t,\n∀ prefix (x, y≤t).\nThis is because the optimal policy π∗ selects the reward-maximizing token at every remaining time step,\nthereby accumulating a reward of 1 at each step until termination. In particular, for all possible model-selection\npaths p ∈ Pt, the optimal value function satisfies\nV ∗(x, yp) = T − t.\nSimilarly, the Q function\nQπ∗(x, y≤t, yt+1) = r(x, y≤t, yt+1) + T − t + 1.\n(15)\n19\n\nIn particular, for each policy πi, define the greedy token at state sp as\nyp\ni := arg max\ny∈Y πi(y | x, yp).\nThen, the optimal state–action value under π∗ is given by\nQπ∗(x, yp, yp\ni ) = Qπ∗(x, yp⊕i) = r(x, yp, yp\ni ) + V ∗(x, yp⊕i) = r(x, yp⊕i) + T − t + 1,\n(16)\nwhere p ⊕ i ∈ Pt+1 denotes the extended path that follows p for the first t tokens and selects token i at step\nt + 1.\nSince we construct the reward such that r(x, yp) = 1 for any path p ∈ Pt for t ≤ T/2. Then, for any\nk ≤ T/2 − 1, we know that for any pk ∈ Pk\nQπ∗(x, ypk⊕i) = T − t = V ∗(x, yp).\n(17)\nVerfication of Assumption\nNow we verify whether three assumptions Eq. (11), Eq. (12) and Eq. (13) holds\nfor all these MDPs.\nEq. (11) By the construction, we can easily know that V ∗ = T. Now, recall that R(x, y≤T ) = �T\ni=1 r(x, y≤i).\nFor any token-level routing generated response (full path) pT on MDP Mp (p is path with length T/2), we\nhave\nV pT = R(x, ypT ) =\n�\nT − ε\nif pT has prefix p\nT/2 + 1 − ∆ − ε\nelse\n.\nHence, on the MDP Mp with p ∈ PT/2, if path pT ∈ PT has prefix p, we know that V pT = T − ε, which\nimplies that V ∗ − V pT ≤ ε.\nEq. (12) Note that both the transition kernel and the optimal policy π∗ are deterministic. As a result, all expec-\ntations reduce to deterministic quantities. For x, y1, · · · , yT ∼ π∗, we know that yt+1 = arg maxy∈Y r(x, y≤t, y),\nwhich implies that\nEyt+1∼π∗(·|x,y≤t)[Qπ∗(x, y≤t+1)] = V ∗(x, y≤t) = T − t.\n(18)\nAlso, for t ≥ 1, by Eq. (15), we have\narg max\ni∈[n]\nEyt+1∼πi(·|x,y≤t)[Qπ∗(x, y≤t, yt+1)] = r(x, y≤t, yt+1) + T − t + 1 = T − t.\nThe last inequality is because y1 ∈ Y \\ {y1,i}i∈[n] and then all rewards are equal to 1. Hence, for any t ≥ 1,\nwe have\nEx,yt∼π∗\n���arg max\ni\nEyt+1∼πi(·|x,y≤t)\n�\nQπ∗(x, y≤t+1)\n�\n− Eyt+1∼π∗(·|x,y≤t)\n�\nQπ∗(x, y≤t+1)\n���� = 0 ≤ ∆.\nMoreover, for t = 0, we have\narg max\ni∈[n]\nEy1∼πi(·|x)[Qπ∗(x, y1)] = r(x, y1) + T − 1 = (1 − ε) + T − 1 = T − ε.\nHence, for t = 0, we also have\nEx,yt∼π∗\n���arg max\ni\nEyt+1∼πi(·|x,y≤t)\n�\nQπ∗(x, y≤t+1)\n�\n− Eyt+1∼π∗(·|x,y≤t)\n�\nQπ∗(x, y≤t+1)\n���� = ε ≤ ∆.\nEq. (13) First, by our construction, for any prefix x, y≤t, one can follow the optimal policy π∗ to generate a\nresponse x, y≤T such that\nR(x, y≤T ) =\nn\n�\ni=1\nr(x, y≤i) =\nt\n�\ni=1\nr(x, y≤i) + T − t.\n(19)\n20\n\nThis equality holds because optimal policy will select the reward-maximizing token at every remaining time\nstep, that leads to a T − t reward.\nNow, on a particular MDP Mp with path p ∈ PT/2, we know that for any prefix x, y≤t, if there is not a full\nresponse x, y≤T with prefix x, y≤t and R(x, y≤T ) ≥ V π∗ − ∆ = T − ∆, we must have\nt\n�\ni=1\nr(x, y≤i) + T − t ≤ T − ∆,\nwhich implies that\nt\n�\ni=1\nr(x, y≤i) ≤ t − ∆.\n(20)\nBy our construction, Eq. (20) holds if and only if t ≥ T/2 + 1 and y≤T/2 = yp.\nHence, we only need to verify Eq. (13) holds when y≤T/2 ̸= yp. We denote πi(x, y≤t) is the deterministic next\ntoken given prefix x, y≤t for expert model πi. Then, we know that\n���arg max\ni\nEyt+1∼πi(·|x,y≤t)\n�\nQπ∗(x, y≤t+1)\n�\n− Eyt+1∼π∗(·|x,y≤t)\n�\nQπ∗(x, y≤t+1)\n����\n=\n�����arg max\ni∈[n]\nr(x, y≤t, πi(x, y≤t)) − 1\n����� .\nSituation 1: If t = T/2 and y≤t = yp coincides with yp, we have\n�����arg max\ni∈[n]\nr(x, y≤t, πi(x, y≤t)) − 1\n����� = |(1 − ∆) − 1| = ∆.\nSituation 2: If t ≥ 1 and y≤T/2 ̸= yp, by our construction, we have\n�����arg max\ni∈[n]\nr(x, y≤t, πi(x, y≤t)) − 1\n����� = 0.\nSituation 3: For t = 0, we can have\n�����arg max\ni∈[n]\nr(x, y≤t, πi(x, y≤t)) − 1\n����� = |(1 − ε) − 1| = ε.\nCombining these three conditions above, we successfully verify the Eq. (13).\nSince the learner only observes the expert value Qπ∗(x, y≤t) at the visited states (x, y≤t), and since these\nvalues are identical across all candidate MDPs {Mp}p∈PT/2 for all t ≤ T/2 − 1, the learner receives exactly\nthe same observation sequence oT/2 under every Mp. Consequently, the MDPs are indistinguishable to the\nlearner during the first T/2 − 1 steps.\nWe consider token-level routing algorithms A : O → [n] whose observation space at step t is defined as\nO =\nT�\nt=1\nOt =\nT�\nt=1\n�\nx, y≤t, {Qπ∗(x, y≤k)}t\nk=1, {Qπ∗(x, y≤t, y)}y∈Y\n�\n.\nThat is, at each step t, the learner’s observation ot ∈ Ot includes the prompt, the generated prefix, the\nexpert value along the realized trajectory, and the expert values of all possible next-token continuations.\nFormally, let A : O → [n] be any (possibly randomized) token-level routing algorithm. Then there exists a\npath p∗ = (p∗\n1, . . . , p∗\nT/2) ∈ PT/2 such that\nP\n�\nA(oi) = p∗\ni+1 for all i = 0, . . . , T/2 − 1\n�\n≤\n1\n|PT/2| =\n1\nnT/2 .\n21\n\nThis follows from a simple counting argument. Since the observation sequence is identical across all Mp, the\nalgorithm A induces a probability distribution over the set of paths PT/2, and thus\n�\np∈PT/2\nP[A(oi) = pi+1 for all i = 0, . . . , T/2 − 1] ≤ 1.\nTherefore, there must exist at least one path p∗ ∈ PT/2 whose probability mass is at most 1/|PT/2|.\nHence, for the MDP Mp∗, this expect value of this token-level routing algorithm A will be at most\n(T/2 + 1 − ∆ − ε) ·\n�\n1 −\n1\nnT/2\n�\n+ (T − ε) ·\n1\nnT/2 ≤ T/2 + 2,\nwhich implies that V πA ≤ V ∗ − T/2 + 2. Hence, no token-level routing algorithm with observation space\not =\n�\nx, y≤t, {Qπ∗(x, y≤k)}t\nk=1, {Qπ∗(x, y≤t, y)}y∈Y\n�\ncan guarantee achieving an approximately optimal value for a given prompt.\nB\nTheoretical Discussion of Prior Token-Level Approaches\nIn §4, we show that training the router on SFT dataset which follows trajectories generated by π∗ can be\nviewed as approximating an optimal expert selection strategy (Eq. (8)). By the performance difference lemma,\nEq. (8) establishes a direct connection between the generated responses and the optimal response.\nHowever, in recent years, Collab (Chakraborty et al., 2025) provides a test-time controlled-decoding based\nmulti-LLM token-level collaboration approaches. During the decoding process, for each token position, Collab\nfirst generates multiple candidate tokens from multiple experts π1, · · · , πn. For candidate tokens for each\nexpert πi, Collab then generates a few more tokens using πi for evaluating the Qπi function. Finally, they\nchoose the token with the maximum corresponding Q function from the candidate token pool. In simple\nwords, Collab tends to choose the action that\nah = arg max\nah Eyt+1∼πah(·|x,y≤t)[Q\nπah\nh\n(x, y≤t+1)].\nSince the action is not selected by maximizing the optimal Q function, it leads to a mismatch\nmin\ni∈[n] ∆i(x, y≤t) = Q∗(x, y≤t) − Qπi(x, y≤t)\nbetween the generated response and the optimal response. This mismatch can lead to non-ideal performance.\nIn fact, consider two different expert policies π1 and π2 with a horizon H and π1(x, y) ̸= π2(x, y) for any\nprefix (x, y), we can construct the reward function that satisfies\nr(x, y≤t, yt+1) = I{yt+1 = π1(x, y≤t)},\n∀ t ≤ H\n3 ,\nand\nr(x, y≤t, yt+1) = I{yt+1 = π2(x, y≤t)},\n∀ t > H\n3 .\nIn this setting, the optimal policy is π∗ = π1 for t ≤ H/2 and π∗ = π2 for t ≥ H/2 + 1, which has Q∗(x) = H\nfor prompt x. However, at position 0, we have\nQ∗(x) = H,\nQπ1(x) = H\n3 ,\nQπ2(x) = 2H\n3 .\nHence, the mismatch error\nmax\ni∈[n] ∆i(x, y≤t) = H\n3 ,\n22\n\nwhich is linear in the horizon H.\nCompared to Collab, FusionRoute directly finetunes the router on the expert dataset, which is equivalent to\nah = arg max\nah Eyt+1∼πah(·|x,y≤t)[Q∗\nh(x, y≤t+1)],\nwhich helps us to bound the final error using Performance Difference Lemma.\nC\nExperiment Details\nC.1\nGeneration\nC.1.1\nCross-Domain Generation\nDue to the large number of prompts in certain tasks and computational constraints, for each task we evaluate\non a subset of their datasets. The generation settings and number of samples are summarized in Table 3.\nDataset\n# of samples\n# Max New Tokens\nGSM8K\n400\n400\nMATH500\n500\n400\nMBPP\n500\n400\nHumanEval\n164\n400\nIfeval\n400\n400\nTable 3 Number of samples and number of max new tokens for each dataset.\nC.1.2\nGPT Evaluation Prompt\nFor GPT-4o evaluation on the general dataset, we use the following text as the prompt provided to the\nGPT-4o model. The output scores for both responses are then extracted to compute the win rate.\nGPT-4o Evaluation Prompt\nPrompt: You are a precise assistant for checking the quality of the answer. We would like to request\nyour feedback on the performance of two AI assistants in response to the user question. Please rate\nthe correctness, fluency, and level of detail of their responses. Your evaluation should consider factors\nsuch as the accuracy, relevance, depth, creativity, fluency, and level of detail of the response. Bear\nin mind that the response might be truncated at the end due to length constraints. Each assistant\nreceives an overall float score on a scale of 1 to 10, in increments of 0.25. A higher score indicates\nbetter overall performance. Do not give two responses a same score. Do not prefer longer response.\nPlease first output a single line containing only two values indicating the scores for Assistant 1 and 2,\nrespectively. The two scores are separated by a space. In the subsequent line, please provide a short\nexplanation of your evaluation, avoiding any potential bias and ensuring that the order in which the\nresponses were presented does not affect your judgment.\n[Question]\nquestion\n[The Start of Assistant 1’s Answer]\nanswer1\n[The End of Assistant 1’s Answer]\n[The Start of Assistant 2’s Answer]\nanswer2\n[The End of Assistant 2’s Answer]\n23\n\nC.2\nRouter’s Behavior\nIn this section, we examine the routing behavior of our model on both domain-specific and general datasets to\ndemonstrate the effectiveness of the router’s token-level routing capability. In the following text, we use red,\nblue and softgreen to represent the math, coding and instruction-following model, respectively.\nC.2.1\nExample of Coding Problem\nPrompt: Below is an instruction that describes a task. Write a response that appropriately completes\nthe request. Instruction: Write a function to find the minimum cost path to reach (m,n) from (0,0) for\nthe given cost matrxi cost[][] and a position (m,n) in cost[][]. Assistant:\nResponse: Here is a Python solution using dynamic programming:\n“python def minCostPath(cost, m, n):\ndp=[[0 for x in range(n)] for y in range(m)]\n# Initialize the first row and first column\nfor i in range(m):\ndp[i][0] = cost[i][0]\nfor j in range(n):\ndp[0][j] = cost[0][j]\n# Fill dp[][] in bottom up manner\nfor i in range(1,m):\nfor j in range(1,n):\ndp[i][j] = min(dp[i-1][j] + dp[i][j-1] + cost[i][j]\nreturn dp[m-1][n-1]\"\nThis function works by creating a 2D array dp[][] where dp[i][j] represents the minimum cost to reach\nthe cell (i,j) from (0,0). The first row and first column are initialized with the cost of the cells in the\nfirst row and first column respectively. Then, for each cell (i,j) in the rest of the matrix, the minimum\ncost to reach it is the minimum cost to reach the cell above it (i-1,j) or to the left of it (i,j-1) plus the\ncost of the cell itself. The function finally returns the minimum cost to reach the cell (m,n). <|eot_id|>\nC.2.2\nExample of Math Problem\nPrompt: Below is an instruction that describes a task. Write a response that appropriately completes\nthe request. Instruction: Convert the point (0, 3) in rectangular coordinates to polar coordinates. Enter\nyour answer in the form (r, θ), where r > 0 and 0 ≤ θ < 2π. Assistant:\nResponse: The polar coordinates $(r,\\theta)$ of a point $(x,y)$ are given by $r =\\sqrt {xˆ2+yˆ2}$ and\n$\\theta= \\tanˆ{-1}\\left(\\frac{y}{x}\\right)$. In this case, we have $x =0$ and $y=3$, so $r=\\sqrt{0ˆ2\n+ 3ˆ2}=3$\\theta=\\tanˆ{-1}\\left(\\frac{3}{0}\\right)=\\frac{\\pi}{2}$. Therefore, the polar coordinates\nof the point $(0,3)$ are $\\boxed{(3,\\frac}{2})}$. The answer is: (3,\\frac{\\pi}{2})<|eot_id|>\n24\n\nC.2.3\nExample of Instruction Following Problem\nPrompt: Below is an instruction that describes a task. Write a response that appropriately completes\nthe request. Instruction: Write a resume for a fresh high school graduate who is seeking their first job.\nMake sure to include at least 12 placeholders represented by square brackets, such as [address], [name].\nAssistant:\nResponse: Here is a sample resume for a fresh high school graduate: [Name][Address][City, State\nZip][Phone Number][Email Address] Objective: To secure an entry-level position in a dynamic and\ngrowth-oriented organization where I can apply my skills and knowledge to contribute to the company’s\nsuccess. Education: - High School Diploma, [Name of High School], [Graduation Date] - GPA: [GPA]\nSkills: - Proficient in Microsoft Office (Word], Excel, PowerPoint, Outlook) - Strong communication\nand teamwork skills - Proficient in [Programming Language or Software] - Experience with [Software or\nTechnology] Work Experience: - [Job Title], [Company Name], [Employment Dates]\n-[Briefly describe\njob responsibilities and accomplishments] - [Job Title], [Company Name], [Employment Dates]\n-\n[Briefly describe job responsibilities and accomplishments] Volunteer Experience: - [Volunteer Position],\n[Organization], [Volunteer Dates]\n- [Briefly describe volunteer work and accomplishments] References: -\nAvailable upon request. Note: This is a basic template, and you should customize it to fit your specific\nexperiences and qualifications. Remember to proofread your resume multiple times for any errors or typos.\nGood luck with your job search! Please let me know if you need any further assistance. Also, please\nnote that this is a very basic template, and you may want to consider adding more sections or details to\nmake your resume more comprehensive. Some additional sections you may want to consider include:\nExtracurricular activities - Awards or recognition - Relevant coursework - Projects or presentations -\nLeadership roles - Relevant certifications or licenses Let me know if you need any help with any of these\nsections. Also, please keep in mind that this is a vaery basic template, and you may want to consider\ntailoring your resume to the specific job you are applying for. You can do this by highlighting...\nC.2.4\nExample of General Problem in PerfectBlend Dataset\nPrompt: Below is an instruction that describes a task. Write a response that appropriately completes\nthe request. Instruction: You start out with some pens. Mike gives you another 22 pens. Cindy, whose\nfather makes pens, decides to double your pens. You give Sharon 19 pens. At the end, you have 65 pens.\nHow many pens did you start with?\nResponse: Let’s denote the number of pens you started with as x. According to the problem, Mike gives\nyou 22 pens, so you now have x + 22 pens. Cindy doubles your pens, so you now have 2(x + 22) pens.\nYou give Sharon 19 pens, so you now have 2(x + 22) - 19 pens. According to the problem, at the end,\nyou have 65 pens, so we can set up the equation: 2(x + 22) - 19 = 65. Now, let’s solve for x: 2x + 44 -\n19 = 65\n2x + 25 = 65\n2x=65 - 25 2x = 40 x = 40/2 x = 20 So, you started with 20 pens. The\nanswer is: 20.<|eot_id|>\nC.2.5\nDiscussion on the Qualitative Examples\nBased on the above qualitative examples, we observe that FusionRoute successfully assigns the majority\nof tokens associated with specific problem types such as coding, mathematical reasoning, and instruction\nfollowing, to their corresponding expert models. This behavior demonstrates the effectiveness of FusionRoute\nin performing accurate token-level routing on domain-specific datasets, enabling the model to effectively\nleverage the complementary expertise of different specialists.\nFurthermore, when evaluated on general-purpose datasets, we find that FusionRoute exhibits a more nuanced\nrouting pattern. Tokens that require mathematical or logical reasoning are predominantly routed to the math\nand code experts, while tokens related to discourse structure, formatting, or general instruction phrasing (e.g.,\n“According to”, “Let’s”, “So”) are more frequently assigned to the instruction-following model. This adaptive\ntoken-level routing behavior highlights FusionRoute’s ability to dynamically balance specialized reasoning\nand general linguistic fluency, thereby improving overall generation quality across diverse tasks.\n25\n",
+    "summary_1": "Here's a clear and concise summary of the research paper for a graduate student:\n\n**Title:** FusionRoute: A Robust and Effective Token-Level Multi-LLM Collaboration Framework\n\n**Background:** Large Language Models (LLMs) have shown strong performance across various domains, but achieving strong performance across these domains with a single general-purpose model is often expensive and inefficient. Smaller domain-specialized models are more efficient but struggle to generalize beyond their training distributions.\n\n**Problem Statement:** The authors aim to develop a framework that can robustly coordinate and leverage the complementary expertise of multiple specialized LLMs to achieve general-purpose performance.\n\n**Proposed Solution:** The authors propose FusionRoute, a token-level multi-LLM collaboration framework that consists of a lightweight router and a trainable complementary generator. The router selects the most suitable expert at each decoding step, and the complementary generator refines or corrects the selected expert's next-token distribution via logit addition.\n\n**Key Contributions:**\n\n1. **Theoretical Analysis:** The authors provide a theoretical analysis showing that pure expert-only routing is fundamentally limited and cannot realize the optimal decoding policy unless strong global coverage assumptions hold.\n2. **Empirical Evaluation:** FusionRoute outperforms sequence- and token-level collaboration, model merging, and direct fine-t",
+    "summary_2": "The research paper \"Paper 2601.05106v1\" proposes a new framework called FusionRoute for achieving strong performance across diverse domains with a single general-purpose model, while remaining efficient and competitive with domain experts. The authors argue that existing methods, such as sequence-level collaboration, model merging, and direct fine-tuning, have limitations.\n\nThe paper's main contributions are:\n\n1. **FusionRoute**: A robust and effective token-level multi-LLM collaboration framework that selects the most suitable expert at each decoding step and contributes a complementary logit that refines or corrects the selected expert's next-token distribution via logit addition.\n2. **Theoretical analysis**: A theoretical analysis showing that pure expert-only routing is fundamentally limited and cannot in general realize the optimal decoding policy unless strong global coverage assumptions hold.\n3. **Empirical results**: Empirically, across both Llama-3 and Gemma-2 families and diverse benchmarks spanning mathematical reasoning, code generation, and instruction following, FusionRoute outperforms both sequence- and token-level collaboration, model merging, and direct fine-tuning.\n\nThe paper highlights the importance of token-level collaboration, which allows multiple LLMs to jointly produce each token during generation, making it a more fine-grained and dynamic"
+  },
+  {
+    "arxiv_id": "2601.05103v1",
+    "title": "Paper 2601.05103v1",
+    "abstract": "2 L3S Research Center, Leibniz University Hannover, Hannover, Germany\nzhiyin.tan@l3s.de\nAbstract. Understanding the role of citations is essential for research\nassessment and citation-aware digital libraries. However, existing cita-\ntion classification frameworks often conflate citation intent (why a work\nis cited) with cited content type (what part is cited), limiting their effec-\ntiveness in auto classification due to a dilemma between fine-grained type\ndistinctions and practical classificatio",
+    "text": "Semantically Orthogonal Framework for Citation\nClassification: Disentangling Intent and Content\nChangxu Duan1⋆[0000−0003−0547−0901] and Zhiyin Tan2⋆[0009−0002−4166−5810]\n1 Technische Universität Darmstadt, Darmstadt, Germany\nduan@linglit.tu-darmstadt.de\n2 L3S Research Center, Leibniz University Hannover, Hannover, Germany\nzhiyin.tan@l3s.de\nAbstract. Understanding the role of citations is essential for research\nassessment and citation-aware digital libraries. However, existing cita-\ntion classification frameworks often conflate citation intent (why a work\nis cited) with cited content type (what part is cited), limiting their effec-\ntiveness in auto classification due to a dilemma between fine-grained type\ndistinctions and practical classification reliability. We introduce SOFT, a\nSemantically Orthogonal Framework with Two dimensions that explic-\nitly separates citation intent from cited content type, drawing inspiration\nfrom semantic role theory. We systematically re-annotate the ACL-ARC\ndataset using SOFT and release a cross-disciplinary test set sampled\nfrom ACT2. Evaluation with both zero-shot and fine-tuned Large Lan-\nguage Models demonstrates that SOFT enables higher agreement be-\ntween human annotators and LLMs, and supports stronger classification\nperformance and robust cross-domain generalization compared to ACL-\nARC and SciCite annotation frameworks. These results confirm SOFT’s\nvalue as a clear, reusable annotation standard, improving clarity, consis-\ntency, and generalizability for digital libraries and scholarly communica-\ntion infrastructures. All code and data are publicly available on GitHub\nhttps://github.com/zhiyintan/SOFT.\nKeywords: Citation Intent Classification · Annotation Framework ·\nCross-Domain Generalization.\n1\nIntroduction\nCitations are central to scholarly communication, shaping how knowledge is\ncredited, organized, and reused across research communities. However, tradi-\ntional citation metrics, such as counts or h-index, offer only coarse measures\nof scientific impact, failing to capture the nuanced ways in which works are\ncited, interpreted, or contested. To address this gap, citation classification has\nemerged as a critical task in computational bibliometrics and information sci-\nence. It moves beyond simple quantitative counts to unlock the meaning within\nscholarly dialogue by analyzing the textual context to understand why a work\n⋆ The authors contributed equally and are listed alphabetically.\narXiv:2601.05103v1  [cs.DL]  8 Jan 2026\n\n2\nDuan and Tan\nwas cited. The aim is to distinguish whether a prior work is being used as a foun-\ndational method, critiqued for its findings, or merely mentioned as background.\nThis functional understanding provides the technological backbone for a new\ngeneration of digital libraries and intelligent research assistants, enabling ad-\nvanced services such as automated literature reviews [3,15,27,28], the detection\nof scientific controversies [41], and impact-aware scientific search [35,63].\nRecent advances in Natural Language Processing (NLP) have enabled large-\nscale citation classification to be deployed in digital library infrastructures. Plat-\nforms such as Web of Science [13], Scite [41], and Semantic Scholar [2] incorporate\nfunctionalities aimed at detecting aspects of citation intent to improve retrieval,\nmapping, and assessment. By automatically labeling background, method, result,\nor support, these platforms enhance transparency, accelerate literature review,\nand surface influential or controversial contributions [17].\nDespite growing adoption, current citation classification practices face persis-\ntent limitations. Existing annotation frameworks often conflate distinct citation\nintents and content types and lack consistent semantic definitions. These ambi-\nguities hinder downstream applications such as citation-based retrieval [7] and\nimpact analysis [37].\nIn this paper, we introduce SOFT: a Semantically Orthogonal Framework\nwith Two dimensions for citation annotation. SOFT separates what is being\ncited (Cited Content Type) from why it is being cited (Citation Intent), en-\nabling clearer definitions, improved annotation consistency, and stronger support\nfor LLM-based classification. Unlike prior one-dimensional frameworks, SOFT is\ngrounded in the principles of semantic roles, offering conceptual alignment with\nhow authors structure argumentation by performing actions (intent) on objects\n(content).\nWe validate SOFT through a comprehensive re-annotation of the ACL-ARC\ndataset [28] and a cross-domain evaluation on ACT2 [40], covering 19 scien-\ntific fields. Results show that SOFT yields higher inter-model and human-LLM\nagreement, improves classification accuracy for both zero-shot and fine-tuned\nmodels, and generalizes more robustly across domains than existing frameworks.\nThese findings position SOFT as a reusable annotation standard for supporting\ncitation-aware digital libraries and transparent research assessment. Our contri-\nbutions are as follows:\n– We propose SOFT, a two-dimensional, semantically grounded citation an-\nnotation framework that explicitly distinguishes the predicative intent of a\ncitation from its cited content type.\n– We conduct a systematic re-annotation of the ACL-ARC dataset and release\na cross-domain test set from ACT2 under the SOFT framework.\n– We benchmark SOFT against ACL-ARC and SciCite using zero-shot and\nfine-tuned LLMs, demonstrating gains in agreement, learnability, and gen-\neralizability.\n\nFramework SOFT: Disentangling Citation Intent and Content\n3\n2\nRelated Work\nResearch on citation classification spans several dimensions, including the devel-\nopment of annotation frameworks, the design of automated classification models,\nand the evaluation of their performance across different datasets and scientific\ndomains. Section 2.1 reviews existing citation intent classification frameworks,\nmost of which rely on single-layer labels that conflate the predicative function\nof the citation with the type of content being referenced. Section 2.2 surveys\ncomputational approaches for automating citation intent classification, ranging\nfrom early feature-based models to recent prompt-driven large language mod-\nels. Section 2.3 highlights the limited exploration of cross-domain applicability,\nparticularly across different scientific disciplines, within existing framework de-\nsigns. Finally, Section 2.4 discusses related efforts in re-annotating benchmark\ndatasets, which provide the foundation for our re-annotation of ACL-ARC and\nACT2 using a two-dimensional framework.\n2.1\nCitation Intent Classification Frameworks\nThe functional purpose behind why an author cites a particular work has been\ndescribed using various terms, including Citation Motivation [55, 59], Citation\nPurpose [1], Citation Type [45], Citation Function [28, 60], and Citation In-\ntent [14]. For conceptual clarity and consistency, we adopt Citation Intent\nthroughout this paper.\nResearch on citation classification has evolved from bibliometric traditions to\ncomputational frameworks tailored for large-scale modeling. One of the earliest\nand most fine-grained typologies was proposed by [55], which defined thirteen\ncitation roles spanning rhetorical, epistemic, and contextual functions. The Ci-\ntation Function Corpus (CFC) [60] was one of the first computationally oriented\ncitation resources. Drawing on Spiegel’s typology [55], it defined twelve function\nlabels grouped into four superclasses (Weakness, Contrast, Positive, and\nNeutral). The Citation Typing Ontology (CiTO) [52] introduced an ontology-\nbased framework with over thirty citation relations, intended for semantic web\ninteroperability. Although formal and expressive, CiTO remains relatively un-\nderused in NLP research.\nOther frameworks were designed for domain-specific annotation and model\ntraining. The six-label framework (Use, Criticizing, Comparison, Substan-\ntiating, Basis, and Neutral) proposed by [1] emphasized purpose-driven\ntypes developed through the annotation of NLP publications. ACL-ARC [28]\ndefined a six-type framework (Background, Comparison or Contrast,\nMotivation, Uses, Extension, and Future) oriented toward citation influ-\nence modeling. SciCite [14] condensed this six-type framework into a three-type\nframework (Background, Method, and ResultComparison), motivated by\nthe observation that types such as Motivation, Extension, and Future often\nfunction as background context, to better support machine reading and research\nnavigation in both computer science and biomedical domains.\n\n4\nDuan and Tan\nACT and ACT2 [40,46] extended citation annotation to additional scientific\nfields using a six-type framework similar to ACL-ARC (replacing Comparison\nor Contrast with ComparesContrasts and extending it to capture not only\ndifferences but also similarities and disagreements between citations). In addi-\ntion to intent, ACT and ACT2 also annotated influence using two binary labels\n(Influential and Incidental) with an added emphasis on author influence\nand comparability.\nTogether, these frameworks have shaped the landscape of citation intent\nclassification, offering valuable insights into scholarly discourse. However, many\nadopt single-layer label structures that implicitly combine the author’s action,\nthe citation’s function, with its semantic argument, the type of content being\ncited. While effective for some tasks, this design can limit semantic clarity and\nhinder flexible reuse across diverse domains. We return to these structural chal-\nlenges in Section 3, which motivates the design of a new, semantically disentan-\ngled framework.\n2.2\nCitation Intent Classification (CIC) Models\nComputational approaches to CIC have evolved significantly. Early models relied\non hand-engineered features [36,47], but the release of benchmark datasets like\nACL-ARC and SciCite catalyzed a shift towards deep learning. Transformer-\nbased encoders, particularly the domain-specific SciBERT [10], quickly became\nthe standard [18,50]. More recent advances explore diverse learning paradigms,\nincluding graph-based models [11], multi-task learning [22,53], and prompt-based\nmethods [30,32]. Following the latest trend, recent work has also focused on fine-\ntuning LLMs for this task [29].\nIn our experiments, we re-annotated the ACL-ARC dataset using our pro-\nposed two-dimensional framework and used it for model training and evaluation.\nTo assess cross-domain generalizability, we additionally re-annotated a sample\nfrom the ACT2 dataset. Results from these evaluations are presented in Sec-\ntion 5, where we compare the performance of CitePrompt and CitationInten-\ntOpenLLM under our framework.\n2.3\nCross-Domain Generalizability\nA persistent challenge in CIC is the limited ability of models to generalize across\ndifferent scientific domains or datasets [33,53,64]. To rigorously assess true gener-\nalization capabilities, models must be subjected to cross-domain testing: training\non a source dataset (e.g., ACL-ARC) and evaluating on a distinct target (e.g.,\nACT2) [12].\nCross-domain or cross-dataset evaluation offers a principled framework for\ntesting generalization capacity. In this setting, models are trained on one source\ndataset and evaluated on an independent target dataset [12]. This setup is widely\nused across NLP tasks to assess whether models learn domain-agnostic features\nor overfit to dataset-specific patterns. For instance, prior work has investigated\ntransfer performance in question answering across biomedical, Wikipedia, and\n\nFramework SOFT: Disentangling Citation Intent and Content\n5\nweb dataset [24], and in relation extraction across distinct subdomains such\nas news and politics [8]. Another example is the divide in machine translation\nGeneralizability between resource-rich and low-resource languages [4].\nThis lack of robustness is particularly evident even when datasets ostensibly\nshare the same annotation framework. For example, ACT2 adopted the ACL-\nARC label set for annotations across multiple disciplines. Despite this shared tax-\nonomy, prior work demonstrated a dramatic drop in performance when models\neffective on ACL-ARC were applied to ACT2 [30]. This highlights that surface-\nlevel label consistency does not guarantee semantic equivalence across domains\nor robust model transfer.\nWe interpret these findings as indicative of deeper structural limitations in\nexisting citation classification frameworks. In particular, many current frame-\nworks conflate citation’s function with its semantic arguments (the type of cited\ncontent), leading to annotation ambiguities that are further amplified in hetero-\ngeneous, multi-domain settings. In our experiments, we evaluate this by training\nmodels on our re-annotated ACL-ARC dataset and testing their performance on\nthe ACT2 dataset using our revised framework annotations.\n2.4\nThe Applicability of Re-Annotating Datasets\nThe performance of supervised machine learning models is fundamentally tied to\nthe quality of their training data [65]. However, real-world annotations are often\nnoisy, with errors arising from ambiguous guidelines, subjective interpretations,\nlimitations in annotator expertise, or simple human mistakes [25,43]. Such errors\ncan undermine model training, compromise evaluation reliability, and reduce the\nrobustness of downstream applications.\nRecognizing these limitations, recent work has emphasized systematic iden-\ntification and correction of label errors in widely used benchmarks. For example,\nstudies of the COCO dataset [34] have revealed deficiencies in object mask an-\nnotations, prompting initiatives such as COCO-ReM [54] and Sama-COCO [67],\nboth of which introduced revised annotations that improved downstream model\nperformance. Similarly, the CleanCoNLL [51] project conducted a comprehensive\nrelabeling of the CoNLL-03 Named Entity Recognition dataset [61], correcting\n7% of entity labels and demonstrating measurable gains in benchmark reliability.\nIn parallel, automated and semi-automated approaches have been developed\nto assist error detection and correction. Model-driven methods such as Con-\nfident Learning estimate the joint distribution of observed and true labels to\nidentify likely misannotations [42], while domain-specific tools like ObjectLab\napply trained object detectors to flag anomalies in annotation quality [62]. Re-\ncently, LLMs have also been explored as annotation agents [23, 56], although\ncareful quality control remains essential in such settings.\nIn this work, we conduct a manual re-annotation of the ACL-ARC dataset [28]\nusing our proposed two-dimensional framework. For comparison and internal\nvalidation, we also task a set of open-source LLMs with performing the same re-\nannotation under controlled prompting. To evaluate label quality and framework\n\n6\nDuan and Tan\nclarity, we compute agreement between human annotators and LLMs. These ex-\nperiments allow us to assess the reliability of our framework and explore the\npotential role of LLMs in future annotation workflows.\n3\nSOFT: A Semantically Orthogonal Framework with\nTwo Dimensions\nTo address the growing need for semantically robust, interpretable, and trans-\nferable citation annotations, we propose a new framework grounded in discourse\nintent and cited content structure. Before introducing its design, we first examine\nthe limitations of existing classification frameworks that motivate our departure\nfrom conventional approaches. These limitations arise from structural entangle-\nment, under-specified functional roles, and ambiguous referential targets, all of\nwhich hinder consistent annotation and downstream model performance.\n3.1\nLimitations of Existing Citation Classification Frameworks\nDespite the progress enabled by widely used frameworks such as CiTO [52], ACL-\nARC [28], SciCite [14], and ACT/ACT2 [40, 45], several structural limitations\npersist. These challenges reduce annotation clarity and consistency, hinder model\ngeneralization, and complicate the semantic interpretation of citation acts. We\nidentify three interrelated problems:\n(1) Dimensional Entanglement: Conflating Cited Content Type and\nCitation Intent. Many previous citation classification frameworks failed to\nseparate the cited content type, such as method, data, or finding, from citation\nintent or function, such as use, and instead combined both dimensions into a sin-\ngle label. For example, in CiTO, types like “uses method in”, “uses data from”,\nand “uses conclusions from” explicitly bind the act of using with the specific\nobject being used, resulting in a conflation of content type and citation intent.\nThis design leads to an explosion in the number of types while still missing im-\nportant boundary cases, which increases the cognitive burden for annotators and\nmakes modeling more challenging due to both type proliferation and unclassifi-\nable instances. SciCite, in contrast, presents a related limitation differently: its\ntype Background information is broadly defined and absorbs all citation acts\nthat do not fit into the method or result types. As a result, many citations with\nspecific functional intent are ultimately subsumed under the background type,\nwhich obscures their actual scholarly role.\n(2) Perspective Ambiguity: Authorial Commitment and Discourse\nVoice. Additionally, inconsistent perspective across types introduces further\nconfusion for annotators. For example, in the ACL-ARC framework, types such\nas Background, Motivation, and Future use the cited work as the subject\n\nFramework SOFT: Disentangling Citation Intent and Content\n7\nof the action, while types such as Uses, Extension, and Comparison or Con-\ntrast use the citing work as the acting subject. This shift in perspective causes\ncognitive inconsistency and makes the annotation process more confusing.\n(3) Lack of a Functionally Grounded Intention Framework. Moreover,\nthe underlying citation intent is not always clearly specified. For example, a ci-\ntation marked as Comparison or Contrast might align findings, methods,\nor conceptual framing, but whether the comparison serves to justify a design,\ncritique a precedent, or merely provide context depends entirely on the citing au-\nthor’s functional goal. Existing frameworks offer no mechanism for disambiguat-\ning these divergent functions, leaving the predicative action of the citation act\nambiguous. Without an explicit model separating the author’s action from the\nobject of that action, annotation relies on inference and subjective interpreta-\ntion, which reduces reproducibility and cross-annotator agreement.\n3.2\nThe SOFT Citation Annotation Framework\nWe propose SOFT, a two-dimensional citation annotation framework designed\nto overcome core limitations of previous citation classification systems. SOFT\ndirectly addresses three major problems: (1) entanglement of content type and\ncitation function, (2) inconsistent annotation perspectives, and (3) ambiguous\nor underspecified citation intent.\nFirst, SOFT explicitly disentangles Cited Content Type (what specific\ncontent is referenced from the cited work) from Citation Intent (why the cit-\ning work refers to that content), thus resolving the long-standing problem of\ndimensional entanglement, ensuring each dimension is orthogonal and semanti-\ncally bounded, greatly reducing ambiguity and cognitive load.\nSecond, SOFT maintains a unified authorial perspective by consistently mod-\neling the action of the citing work. This approach is grounded in semantic role\ntheory [6,20,21,44], which decomposes events into predicates (actions) and ar-\nguments (participants or entities involved). In SOFT, citation intent always cor-\nresponds to the action or communicative function performed by the citing work,\nsuch as “use X from [cited work]”, “modify X from [cited work]”, or “evaluate\nagainst X from [cited work]”, where X is the specific cited content. These in-\ntents, like “use” or “modify”, align with classic event predicates in semantic role\ntheory. Other intents in SOFT, such as “contextualize using X from [cited work]”,\n“signal a gap using X from [cited work]”, “highlight a limitation in X from [cited\nwork]”, or “justify a design choice based on X from [cited work]”, capture broader\ndiscourse actions in scientific writing. While these latter cases extend beyond\nsimple event predicates, all SOFT intent types are defined to specify what the\nciting work is doing, ensuring each annotation remains grounded in a clear and\nconsistent authorial perspective.\nThird, SOFT provides explicit, semantically precise definitions for all citation\nintent types, ensuring the underlying authorial purpose is clearly specified in ev-\nery case. For instance, we rephrase Background (used in ACL-ARC, SciCite,\n\n8\nDuan and Tan\nand ACT/ACT2) as Contextualize to mark deliberate authorial moves, and\nrephrase Extension as Modify to capture not only extension but also reduc-\ntion, replacement, or novel combination. We replace the citation function Mo-\ntivation(present in ACL-ARC and ACT/ACT2) with SignalGap and High-\nlightLimitation, which allow for a fine-grained distinction between identifying\na research gap and highlighting a limitation. To further reduce ambiguity, we\neliminate the type Comparison or Contrast (in ACL-ARC), which is often\nconfused in practice, and instead introduce JustifyDesignChoice for decision\njustification and EvaluateAgainst for direct empirical comparison.\nThe SOFT type inventory was empirically derived and iteratively refined\nusing diverse datasets and existing frameworks (including SciCite, ACL-ARC,\nand CiTO), systematically resolving label ambiguity and mixed semantic scope.\nAs a result, SOFT yields a more interpretable and robust annotation framework,\nimproving both human agreement and model generalizability for tasks such as\ndiscourse analysis and citation influence detection. Below, we detail the two core\ndimensions of the framework.\nDimension 1: Cited Content Type captures the ontological status of the\ncontribution being referenced in the cited work. We distinguish three types:\n(1) PerformedWork: The citing work references what the cited work did\n(e.g., experimental process, pipeline design), without isolating specific outcomes\nor reusable resources. Example: “[cited work] developed an NLP pipeline.”\n(2) Discovery: The citing work references observations, findings, or theoret-\nical conclusions made by the cited work. Example: “[cited work] observed that\ndropout improves stability.”\n(3) ProducedResource: The citing work references reusable outputs such\nas datasets, algorithms, models, tools, metrics, standardized settings, etc. Ex-\nample: “We use the parser from [cited work].”\nThese distinctions support fine-grained analysis of how different forms of scien-\ntific contribution are referenced, reused, or framed in scholarly discourse.\nPerformed Work\nDiscovery\nProduced Resource\nContextualize\nSignal Gap\nHighlight Limitation\nJustify Design Choice\nUse\nModify\nEvaluate Against\n393\n263\n498\n18\n27\n45\n28\n42\n20\n14\n82\n60\n17\n3\n331\n4\n0\n15\n5\n30\n37\n0\n100\n200\n300\n400\nFrequency\nFig. 1: Types of statistics for the re-annotated dataset.\n\nFramework SOFT: Disentangling Citation Intent and Content\n9\nDimension 2: Citation Intent characterizes the predicative function of the\ncitation, representing the action performed by the citing author. Each label is\ndefined as a predicate that takes the cited content as one of its core arguments.\nWe frame each label from the perspective of the citing author as the agent of\nthe citation action.\n(1) Contextualize: The cited work is mentioned to provide background,\nillustrate prior research, or describe related contributions. No design decision\nor reuse is involved. Example: “[cited work] introduced hierarchical attention for\nsentiment analysis.”\n(2) SignalGap: The citation highlights an unresolved problem or open ques-\ntion. The gap may be identified by the cited work or the citing work, and no\ncommitment to solving it is required. Example: “[cited work] noted that tuning\nremains underexplored.”\n(3) HighlightLimitation: The citing work or cited work identifies a flaw,\nconstraint, or drawback in the cited method or result. This type is used only\nwhen the cited contribution is explicitly critiqued. Example: “[cited work] fails\nto capture domain drift in user behavior.”\n(4) JustifyDesignChoice: The citing work supports a design or method-\nological decision by referencing the cited work. No direct reuse is required, but\nthe author must commit to a choice. Example: “We follow [cited work]’s feature\ngrouping to ensure consistency.” This type does not apply if the action is merely\nhypothetical (e.g., “we could follow ...”).\n(5) Use: The citing work directly applies a reusable contribution (e.g., model,\nprocess, settings, definitions) from the cited work. The citing author must be the\nactor, meaning statements like “[cited work] used...” do not qualify. Only past\nor present-tense application counts; plans or hypothetical uses are not ‘Use’.\nExample: “We use [cited work]’s BERT-based classifier.”\n(6) Modify: The citing work alters or extends a reusable contribution from\nthe cited work, such as by adapting configurations, changing algorithms, or in-\ntegrating with a new pipeline. Example: “We adapt [cited work]’s encoder for\nmulti-domain inputs.”\n(7) EvaluateAgainst: The citing work explicitly compares its own findings\nand results with those of the cited work, typically to establish effectiveness.\nExample: “Our model outperforms [cited work]’s method on F1 score.”\nThese citation intent types are designed to be mutually exclusive, which can\nbe combined with content types for multi-dimensional interpretation.\nDesign Rationale and Applications. By structurally separating citation\nintent from cited content type, SOFT resolves ambiguities common in one-\ndimensional frameworks. For example, citing a model without applying it is\nannotated as Contextualize:ProducedResource, while referring to an un-\nresolved issue is SignalGap:Discovery. Only when the citing author explic-\nitly acts, e.g., reuses, modifies, justifies, or critiques, is an active intent as-\nsigned. This orthogonality supports consistent labeling across citation contexts\nand improves annotation reliability. SOFT’s two-dimensional design also en-\n\n10\nDuan and Tan\nhances downstream applications. Separating gap signaling from design justifi-\ncation enables clearer modeling of research motivation. Distinguishing between\nusing a tool (Use:ProducedResource) and referencing a methodological trend\n(Contextualize:PerformedWork) supports finer-grained tracking of scien-\ntific influence, method lineage, and intent-aware scientific recommendation. As\ndemonstrated in Section 5, the SOFT framework yields higher human–LLM\nagreement, better classification performance, and improved generalizability across\nscientific domains. We next describe our annotation guidelines and procedures,\nincluding the re-annotation of ACL-ARC and a cross-domain sample from ACT2.\n4\nExperiments\n4.1\nDatasets and Re-Annotation Procedure\nDatasets Overview. The ACL-ARC dataset [28] is a standard CIC bench-\nmark, yet its annotation framework exhibits the conceptual conflation and bound-\nary issues discussed in Section 3.1. To resolve these shortcomings, we re-annotate\nthe entire dataset using our two-dimensional SOFT framework, thereby provid-\ning a refined and semantically coherent resource.\nFor our experiments, we adopt the context-expanded version of ACL-ARC\nprovided by [39], which enhances the original dataset by adding broader citation\ncontexts and structured metadata, such as the citing and cited paper titles and\nthe section in which the citation appears. Before re-annotation, we manually\ncleaned the dataset to correct systemic text extraction errors, such as misaligned\npage breaks and disrupted reading orders, and incorrectly sequenced footnotes.\nAdditionally, we recovered 533 missing section headings from the original paper\nPDFs to improve contextual understanding. The ACL-ARC dataset contains\n1,931 citation instances. Following the original data split in [28], we adopt the\nsame partitioning: 1,647 instances are used for training and 284 for testing.\nTo evaluate cross-domain generalizability, we also re-annotated a subset (264\nexamples) of the ACT2 test set [40]. ACT2 spans 19 top-level scientific domains\nfrom Microsoft Academic Graph (MAG), with dominant fields including Psy-\nchology (21.99%), Medicine (13.48%), Biology (10.91%), and Computer Science\n(10.27%). These examples serve as our out-of-distribution test data, enabling\nstrict evaluation of generalization performance. The division of training and test\nsets ensures that citation instances from the same source document do not ap-\npear in both sets, preventing data leakage and preserving evaluation integrity.\nRe-Annotation Procedure. The re-annotation process was conducted by a\nteam of graduate students in computational linguistics. To ensure reliability and\naccount for interpretive ambiguity, each citation instance was independently la-\nbeled by at least three annotators using our revised two-dimensional framework.\nDisagreements were flagged and subsequently resolved through structured group\ndiscussion, in which annotators reviewed framework definitions and deliberated\nto reach a consensus type for each case. This multi-pass strategy helped enforce\n\nFramework SOFT: Disentangling Citation Intent and Content\n11\ntype consistency and refine interpretation standards across annotators. Types of\nstatistics for the re-annotated ACL-ARC are reported in Figure 1.\n4.2\nEvaluating Framework Interpretability & Annotation Reliability\nA core indicator of annotation framework quality is Inter-Annotator Agreement\n(IAA), reflecting how clearly and consistently its types are defined [5, 9]. How-\never, IAA was not reported in the original ACL-ARC release [28], leaving the\ninterpretability of its framework unquantified. In this study, we also do not report\nhuman IAA due to the dynamic annotation process: some annotators departed\nand new ones joined partway through the task, so not all instances were anno-\ntated by the same set of individuals, and the number of annotations per instance\nvaried. This evolving annotator pool is resembles the crowdsourcing approach\nused for the SciCite dataset [14], where calculating a dataset-wide human IAA\nscore also presents challenges. Instead, we evaluate framework interpretability\nand reliability through LLM-based agreement studies.\nTo assess the clarity and reliability of our framework, we conducted an agree-\nment experiment using open-source LLMs as proxies for human annotators, of-\nfering a scalable and replicable assessment of annotation feasibility. Specifically,\nwe deployed four distinct LLMs: Qwen-2.5-72B-Instruct (Qwen) [48] , Mistral-\nSmall-24B-Instruct-2501 (Mistral) [38], Llama-3.3-70B-Instruct (Llama) [58],\nand Gemma-3-27B-it (Gemma) [57], to independently annotate the ACL-ARC\ncitation contexts according to our framework and guidelines. For each context,\nevery LLM generated a rationale and a classification decision, repeated three\ntimes, with the final type determined by majority vote. We then computed pair-\nwise Cohen’s κ scores [16] under two settings: (1) between model pairs (LLM-\nLLM IAA), and (2) between each model and the human consensus annotations\ndescribed in Section 4.1 (Human-LLM IAA). The latter measures how well the\nframework enables LLMs to approximate expert human judgments.\nAs a comparative baseline, we repeated the process with both the six-type\nACL-ARC framework and the three-type SciCite framework [14], instructing the\nsame LLMs to annotate ACL-ARC contexts. For ACL-ARC framework, annota-\ntions were made on its original dataset. For SciCite, we mapped ACL-ARC types\nonto SciCite’s framework by aligning Uses with Method, ComparisonCon-\ntrast with ResultComparison, and grouping Background, Extension,\nMotivation, and Future under Background.\nThis LLM-based simulation serves as a practical proxy for human agreement\nand enables direct empirical comparison across frameworks. The resulting IAA\nscores provide a quantifiable measure of boundary clarity, interpretability, and\nannotation feasibility under the SOFT framework.\n4.3\nTesting Orthogonality and Interpretive Clarity via Classification\nTo empirically evaluate whether our revised annotation framework offers clearer\nsemantic boundaries than prior frameworks, we conducted a comparative clas-\nsification experiment. This design is grounded in the hypothesis, motivated by\n\n12\nDuan and Tan\nprior work in framework comparison [26, 49], that if a framework defines types\nwith more precise and less ambiguous boundaries, then machine learning models\ntrained on such data should achieve higher classification performance.\nClassifiers. We employed two open-source, state-of-the-art CIC models with\ndistinct architectures:\n– CitePrompt [32]: A prompt-based fine-tuning framework built on SciB-\nERT [10], implemented via the OpenPrompt toolkit [19].\n– CitationIntentOpenLLM [29]: A LLM classifier based on Qwen-2.5-\n14B-Instruct (Qwen-Small), fine-tuned using LLaMA-Factory toolkit [66].\nIn addition to these fine-tuned models, we also evaluated the performance of\nseveral zero-shot classifiers using the LLMs introduced in Section 4.1.\nThe selection of classifiers in this study was guided by a deliberate and strong\npreference for open-source models over closed-source alternatives. This commit-\nment is crucial for several reasons fundamental to rigorous scientific inquiry.\nFirstly, utilizing publicly available models like SciBERT and various instruction-\ntuned LLMs ensures the transparency and reproducibility of our research. This\nallows other researchers to independently verify our findings and build upon our\nwork without the “black box” nature or access restrictions often associated with\nproprietary systems. Secondly, open-source models grant the critical flexibility\nrequired for in-depth analysis and custom fine-tuning, as demonstrated by our\nuse of LLaMA-Factory for adapting Qwen-Small. Such granular control over\nmodel architecture and training processes, often unavailable with closed-source\nAPIs, is paramount for rigorously evaluating new annotation frameworks like\nSOFT and understanding nuanced model behavior. Finally, relying on open-\nsource solutions promotes broader accessibility and fosters a collaborative re-\nsearch environment, which is essential for advancing the field of computational\nbibliometrics, unhindered by commercial licensing or opaque model mechanics.\nEvaluation Metrics. Following prior CIC evaluation protocols [29, 32], we\nreport both Accuracy and Macro-averaged F1 score. While Accuracy pro-\nvides an overall measure of correct predictions, Macro F1 computes the average\nF1 score across classes, mitigating the effects of label imbalance and better re-\nflecting type-level learnability.\n4.4\nEvaluating Cross-Domain Generalizability on ACT2\nTo assess whether our revised framework supports generalizable citation intent\nclassification across disciplinary contexts, we conducted a transfer evaluation\nusing the test subset of ACT2 [40], a multi-domain benchmark that spans bi-\nology, medicine, economics, and computer science. As discussed in Section 2.1,\nACT2 adopts a flat framework derived from ACL-ARC. To adapt it for our\nevaluation, we selected a representative subset of ACT2 as cross-domain test\n\nFramework SOFT: Disentangling Citation Intent and Content\n13\ndataset and applied the same re-annotation protocol described in Section 4.1\nusing our two-dimensional framework. For a controlled comparison, we reused\nthe same classification models (CitePrompt, CitationIntentOpenLLM, and zero-\nshot LLMs) and evaluation metrics (Accuracy and Macro F1) from Section 4.3.\nThis alignment allows us to directly assess whether our framework enhances\ncross-domain robustness and preserves semantic clarity under domain shift.\n5\nResults and Analysis\nWe evaluate the SOFT framework along three axes: interpretability, in-domain\nlearnability, and cross-domain generalizability. Results are compared against two\nexisting citation frameworks, ACL-ARC and SciCite.\nTable 1: Pairwise Cohen’s κ between LLMs and human annotators.\n(a) ACL-ARC Framework (6 types)\nLlama Mistral Gemma Qwen\nMistral\n0.5833\nGemma\n0.4504 0.3860\nQwen\n0.5776 0.6013\n0.3600\nHUMAN 0.3956 0.4048\n0.4302 0.3809\n(b) SciCite Framework (3 types)\nLlama Mistral Gemma Qwen\nMistral\n0.6363\nGemma\n0.6855 0.6423\nQwen\n0.6762 0.7151\n0.6928\nHUMAN 0.4016 0.3850\n0.4618 0.4116\n(c) SOFT: Cited Content Type (3 types)\nLlama Mistral Gemma Qwen\nMistral\n0.5396\nGemma\n0.5751 0.5657\nQwen\n0.5496 0.6169\n0.5717\nHUMAN 0.5193 0.4932 0.5901 0.5596\n(d) SOFT: Citation Intent (7 types)\nLlama Mistral Gemma Qwen\nMistral\n0.6154\nGemma\n0.5941 0.5949\nQwen\n0.6282 0.5887\n0.6175\nHUMAN 0.6620 0.6070\n0.6232 0.6918\nInterpretability and Agreement. We assess framework clarity using pair-\nwise Cohen’s κ across four pre-trained LLMs (LLM-LLM) and between LLM\npredictions and human consensus labels (Human-LLM). As expected, SciCite\nachieves the highest LLM-LLM agreement due to its coarse granularity. However,\nas shown in Table 1, SOFT offers significantly higher Human-LLM agreement,\nespecially for the Citation Intent dimension (κ = 0.662 for Llama, 0.692 for\nQwen). Despite being conceptually more complex, Cited Content Type also\noutperforms ACL-ARC and SciCite in Human-LLM agreement, indicating that\nboth dimensions are interpretable and consistently applicable.\n\n14\nDuan and Tan\nIn-Domain Learnability. We evaluate the classification performance of both\nfine-tuned and zero-shot models on the ACL-ARC test set using three annota-\ntion frameworks: ACL-ARC, SciCite, and SOFT. In the SOFT framework, Cited\nContent Type and Citation Intent are treated as two independent classifi-\ncation tasks. Table 2 and Figure 3 in Appendix B report accuracy and macro\nF1 scores across all models. Despite its conceptual specificity, Cited Content\nType classification yields strong in-domain performance. Fine-tuned Qwen-\nSmall achieves 0.78 macro F1, while zero-shot models average 0.66. Although no\nexisting framework provides a comparable baseline, these results demonstrate\nthat content type distinctions are not only human-interpretable but also model-\nlearnable across architectures. SOFT also outperforms prior frameworks on Ci-\ntation Intent classification. For example, fine-tuned Qwen-Small achieves 0.65\nmacro F1 under SOFT, compared to 0.51 on ACL-ARC and 0.62 on SciCite.\nNotably, this gain occurs even though SOFT-Intent has more types (7) than\nACL-ARC (6), refuting the assumption that performance is merely a function\nof label count. This suggests the framework’s semantic clarity and well-defined\nboundaries are more critical to model performance than raw granularity. The\naverage zero-shot F1 under SOFT reinforces this point, reaching 0.69, substan-\ntially above ACL-ARC (0.49) and SciCite (0.61), demonstrating that SOFT’s\nfunctional types align more closely with both pre-trained and fine-tuned model\nrepresentations.\nWe note that our reported scores are lower than those in prior work, including\nCitePrompt [32], which may be attributed to differences in dataset partitioning,\nas discussed in Section 4.1.\nCross-Domain Generalizability. To assess the robustness of different frame-\nworks under domain shift, we evaluate models trained on ACL-ARC directly on\na re-annotated subset of ACT2. No further fine-tuning or domain adaptation is\nperformed. Table 2 and Figure 3 in Appendix B report both accuracy and macro\nF1 scores, highlighting how well each framework generalizes beyond the origi-\nnal training domain. Despite its conceptual complexity, Cited Content Type\nclassification generalizes well. Zero-shot LLMs maintain solid performance (aver-\nage F1 drop 17%), and fine-tuned Qwen-Small drops moderately (0.78 to 0.64),\nsuggesting that these content distinctions remain stable across domain shifts.\nSOFT-trained models exhibit superior generalizability on Citation Intent clas-\nsification. Fine-tuned Qwen-Small drops from 0.65 to 0.56 (SOFT), outperform-\ning ACL-ARC (0.51 to 0.49) and SciCite (0.62 to 0.23). Zero-shot performance\nmirrors this trend (Figure 2). SOFT’s average F1 drop is only 15.8% (vs. 30-65%\nfor other frameworks), suggesting its discourse-grounded intent types generalize\nmore robustly.\nDimension-Wise Comparison. A dimension-wise analysis reveals distinct\nmodel behaviors. In-domain, zero-shot LLMs perform better on Citation In-\ntent (avg. F1: 0.69) than Cited Content Type (0.66), suggesting these func-\ntions align well with LLM priors. With fine-tuning, however, Cited Content\n\nFramework SOFT: Disentangling Citation Intent and Content\n15\n0.0\n0.2\n0.4\n0.6\n0.8\nCitation Intent\nCited Content Type\nSciCite-3 types\nACL-ARC-6 types\n0.53\n0.20\n0.70\n0.41\n0.69\n0.34\n0.55\n0.12\nFine-tuned SciBERT\n0.0\n0.2\n0.4\n0.6\n0.8\n0.65\n0.56\n0.65\n0.55\n0.60\n0.41\n0.47\n0.19\nZero-shot LLMs (Avg.)\n0.0\n0.2\n0.4\n0.6\n0.8\n0.65\n0.56\n0.78\n0.64\n0.62\n0.23\n0.51\n0.49\nFine-tuned Qwen small\nACL-ARC\nACT2\nFig. 2: Comparison of in-domain and cross-domain macro F1 scores across four\nannotation frameworks. Each subplot corresponds to a model type, with blue\ncircles denoting in-domain and red crosses indicating cross-domain performance.\nType becomes more learnable (Qwen-Small F1: 0.78 vs. 0.65 for intent), indicat-\ning its distinctions benefit more from supervision. Cross-domain generalization\nis competitive for both dimensions, with fine-tuned Qwen showing only modest\nperformance drops on both content (Qwen-Small: 18%; SciBERT: 41%) and in-\ntent (Qwen-Small: 14%; SciBERT: 62%). This highlights SOFT’s stability under\ndomain shift and the superior transferability of fine-tuned Qwen over SciBERT.\nOverall Analysis. Across all experiments, the SOFT framework consistently\noutperforms ACL-ARC and SciCite in interpretability, learnability, and cross-\ndomain robustness. Fine-tuned Qwen-Small achieves the best results on Cited\nContent Type, while Qwen zero-shot achieves top performance on Citation\nIntent. These findings establish SOFT as a semantically orthogonal and gener-\nalizable alternative for citation annotation.\n6\nConclusion\nThis paper introduces the SOFT framework, a semantically orthogonal frame-\nwork with two dimensions for citation annotation that disentangles cited content\ntype from citation intent. By addressing the conceptual conflation and bound-\nary ambiguity in existing frameworks, SOFT provides clearer interpretive def-\ninitions and better supports both human annotation and model-based classifi-\ncation. Empirical evaluations across LLM agreement, in-domain classification,\nand cross-domain generalization confirm SOFT’s advantages. The citation in-\ntent dimension demonstrates high zero-shot and fine-tuned performance, while\nthe cited content type dimension, despite its semantic complexity, proves learn-\nable and transferable. Compared to ACL-ARC and SciCite, SOFT consistently\nachieves higher human-LLM agreement, stronger in-domain macro F1, and more\nstable cross-domain generalization. These gains are especially pronounced for\ninstruction-tuned LLMs, which preserve broader generalization after fine-tuning.\nFuture work will explore extending SOFT to support multi-label citation\ncontexts, integrating finer-grained functional types, and applying the framework\nto large-scale citation graphs for discourse-level functional modeling, influence\ntracking, and knowledge extraction.\n\n16\nDuan and Tan\nAcknowledgment\nChangxu Duan was funded by the Federal Ministry of Education and Research\n(BMBF) under grant no. 01UG2130A, as part of the InsightsNet research project\n(insightsnet.org). Zhiyin Tan was funded by the “HybrInt - Hybrid Intelli-\ngence through Interpretable AI in Machine Perception and Interaction” project\n(Zukunft Nds, Niedersächsisches Ministerium für Wissenschaft, Grant ID: ZN4219).\nWe thank Dr. Sabine Bartsch for proofreading. We also thank the anonymous\nreviewers for their insightful comments and suggestions.\nReferences\n1. Abu-Jbara, A., Ezra, J., Radev, D.: Purpose and polarity of citation: Towards\nNLP-based bibliometrics. In: Vanderwende, L., Daumé III, H., Kirchhoff, K. (eds.)\nProceedings of the 2013 Conference of the North American Chapter of the Asso-\nciation for Computational Linguistics: Human Language Technologies. pp. 596–\n606. Association for Computational Linguistics, Atlanta, Georgia (Jun 2013),\nhttps://aclanthology.org/N13-1067/\n2. Allen Institute for AI: Semantic Scholar. https://www.semanticscholar.org/\n(2025)\n3. Anderson, M.H., Lemken, R.K.: Citation context analysis as a method for con-\nducting rigorous and impactful literature reviews. Organizational Research Meth-\nods 26(1), 77–106 (12 2020). https://doi.org/10.1177/1094428120969905,\nhttp://dx.doi.org/10.1177/1094428120969905\n4. Araabi, A., Niculae, V., Monz, C.: Joint dropout: Improving generalizability in low-\nresource neural machine translation through phrase pair variables. In: Utiyama, M.,\nWang, R. (eds.) Proceedings of Machine Translation Summit XIX, Vol. 1: Research\nTrack. pp. 12–25. Asia-Pacific Association for Machine Translation, Macau SAR,\nChina (Sep 2023), https://aclanthology.org/2023.mtsummit-research.2/\n5. Artstein, R., Poesio, M.: Survey article: Inter-coder agreement for computational\nlinguistics. Computational Linguistics 34(4), 555–596 (2008). https://doi.org/\n10.1162/coli.07-034-R2, https://aclanthology.org/J08-4004/\n6. Baker, C.F., Fillmore, C.J., Lowe, J.B.: The Berkeley FrameNet project. In: 36th\nAnnual Meeting of the Association for Computational Linguistics and 17th Interna-\ntional Conference on Computational Linguistics, Volume 1. pp. 86–90. Association\nfor Computational Linguistics, Montreal, Quebec, Canada (Aug 1998). https:\n//doi.org/10.3115/980845.980860, https://aclanthology.org/P98-1013/\n7. Bascur, J.P., Verberne, S., van Eck, N.J., Waltman, L.: Academic information\nretrieval using citation clusters: in-depth evaluation based on systematic reviews.\nScientometrics 128(5), 2895–2921 (May 2023). https://doi.org/10.1007/s111\n92-023-04681-x, https://doi.org/10.1007/s11192-023-04681-x\n8. Bassignana, E., Plank, B.: CrossRE: A cross-domain dataset for relation ex-\ntraction. In: Goldberg, Y., Kozareva, Z., Zhang, Y. (eds.) Findings of the As-\nsociation for Computational Linguistics: EMNLP 2022. pp. 3592–3604. Associ-\nation for Computational Linguistics, Abu Dhabi, United Arab Emirates (Dec\n2022). https://doi.org/10.18653/v1/2022.findings-emnlp.263, https:\n//aclanthology.org/2022.findings-emnlp.263/\n\nFramework SOFT: Disentangling Citation Intent and Content\n17\n9. Bayerl, P.S., Paul, K.I.: What determines inter-coder agreement in manual anno-\ntations? a meta-analytic investigation. Computational Linguistics 37(4), 699–725\n(Dec 2011). https://doi.org/10.1162/COLI_a_00074, https://aclanthology.o\nrg/J11-4004/\n10. Beltagy, I., Lo, K., Cohan, A.: SciBERT: A pretrained language model for scien-\ntific text. In: Inui, K., Jiang, J., Ng, V., Wan, X. (eds.) Proceedings of the 2019\nConference on Empirical Methods in Natural Language Processing and the 9th In-\nternational Joint Conference on Natural Language Processing (EMNLP-IJCNLP).\npp. 3615–3620. Association for Computational Linguistics, Hong Kong, China (Nov\n2019). https://doi.org/10.18653/v1/D19-1371, https://aclanthology.org/D\n19-1371/\n11. Berrebbi, D., Huynh, N., Balalau, O.: Graphcite: Citation intent classification in\nscientific publications via graph embeddings. pp. 779–783. WWW ’22: The ACM\nWeb Conference 2022, ACM (4 2022). https://doi.org/10.1145/3487553.3524\n657, http://dx.doi.org/10.1145/3487553.3524657\n12. Chen, Y., Liu, P., Zhong, M., Dou, Z.Y., Wang, D., Qiu, X., Huang, X.: CDEval-\nSumm: An empirical study of cross-dataset evaluation for neural summarization\nsystems. In: Cohn, T., He, Y., Liu, Y. (eds.) Findings of the Association for Compu-\ntational Linguistics: EMNLP 2020. pp. 3679–3691. Association for Computational\nLinguistics, Online (Nov 2020). https://doi.org/10.18653/v1/2020.finding\ns-emnlp.329, https://aclanthology.org/2020.findings-emnlp.329/\n13. Clarivate Analytics: Web of Science. https://www.webofscience.com/ (2025)\n14. Cohan, A., Ammar, W., van Zuylen, M., Cady, F.: Structural scaffolds for citation\nintent classification in scientific publications. In: Burstein, J., Doran, C., Solorio,\nT. (eds.) Proceedings of the 2019 Conference of the North American Chapter of\nthe Association for Computational Linguistics: Human Language Technologies,\nVolume 1 (Long and Short Papers). pp. 3586–3596. Association for Computational\nLinguistics, Minneapolis, Minnesota (Jun 2019). https://doi.org/10.18653/v1/\nN19-1361, https://aclanthology.org/N19-1361/\n15. Cohen, A.M., Hersh, W.R., Peterson, K., Yen, P.Y.: Reducing workload in sys-\ntematic review preparation using automated citation classification. Journal of the\nAmerican Medical Informatics Association 13(2), 206–219 (Mar 2006). https:\n//doi.org/10.1197/jamia.m1929, http://dx.doi.org/10.1197/jamia.m1929\n16. Cohen, J.: A coefficient of agreement for nominal scales. Educational and Psycho-\nlogical Measurement 20(1), 37–46 (4 1960). https://doi.org/10.1177/00131644\n6002000104, http://dx.doi.org/10.1177/001316446002000104\n17. Dardas, L.A., Sallam, M., Woodward, A., Sweis, N., Sweis, N., Sawair, F.A.: Eval-\nuating research impact based on semantic scholar highly influential citations, total\ncitations, and altmetric attention scores: The quest for refined measures remains\nillusive. Publications 11(1), 5 (1 2023). https://doi.org/10.3390/publicatio\nns11010005, http://dx.doi.org/10.3390/publications11010005\n18. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: Pre-training of deep\nbidirectional transformers for language understanding. In: Burstein, J., Doran,\nC., Solorio, T. (eds.) Proceedings of the 2019 Conference of the North Amer-\nican Chapter of the Association for Computational Linguistics: Human Lan-\nguage Technologies, Volume 1 (Long and Short Papers). pp. 4171–4186. Associa-\ntion for Computational Linguistics, Minneapolis, Minnesota (Jun 2019). https:\n//doi.org/10.18653/v1/N19-1423, https://aclanthology.org/N19-1423/\n19. Ding, N., Hu, S., Zhao, W., Chen, Y., Liu, Z., Zheng, H., Sun, M.: OpenPrompt: An\nopen-source framework for prompt-learning. In: Basile, V., Kozareva, Z., Stajner,\n\n18\nDuan and Tan\nS. (eds.) Proceedings of the 60th Annual Meeting of the Association for Computa-\ntional Linguistics: System Demonstrations. pp. 105–113. Association for Compu-\ntational Linguistics, Dublin, Ireland (May 2022). https://doi.org/10.18653/v\n1/2022.acl-demo.10, https://aclanthology.org/2022.acl-demo.10/\n20. Fillmore, C.J.: The case fro case. Universals in Linguistic Theory (1968)\n21. Fillmore, C.J., et al.: Frame semantics. Cognitive linguistics: Basic readings 34,\n373–400 (2006)\n22. Ghosal, T., Varanasi, K.K., Kordoni, V.: A deep multi-tasking approach leveraging\non cited-citing paper relationship for citation intent classification. Scientometrics\n129(2), 767–783 (12 2023). https://doi.org/10.1007/s11192-023-04811-5,\nhttp://dx.doi.org/10.1007/s11192-023-04811-5\n23. Gilardi, F., Alizadeh, M., Kubli, M.: Chatgpt outperforms crowd workers for text-\nannotation tasks. Proceedings of the National Academy of Sciences 120(30) (7\n2023). https://doi.org/10.1073/pnas.2305016120, http://dx.doi.org/10.10\n73/pnas.2305016120\n24. Guo, M., Yang, Y., Cer, D., Shen, Q., Constant, N.: MultiReQA: A cross-domain\nevaluation forRetrieval question answering models. In: Ben-David, E., Cohen, S.,\nMcDonald, R., Plank, B., Reichart, R., Rotman, G., Ziser, Y. (eds.) Proceedings\nof the Second Workshop on Domain Adaptation for NLP. pp. 94–104. Association\nfor Computational Linguistics, Kyiv, Ukraine (Apr 2021), https://aclanthology\n.org/2021.adaptnlp-1.10/\n25. Hettiachchi, D., Sanderson, M., Goncalves, J., Hosio, S., Kazai, G., Lease, M.,\nSchaekermann, M., Yilmaz, E.: Investigating and mitigating biases in crowdsourced\ndata. pp. 331–334. CSCW ’21: Computer Supported Cooperative Work and Social\nComputing, ACM (10 2021). https://doi.org/10.1145/3462204.3481729,\nhttp://dx.doi.org/10.1145/3462204.3481729\n26. Ivanova, R.V., Kirrane, S., van Erp, M.: Comparing annotated datasets for named\nentity recognition in English literature. In: Calzolari, N., Béchet, F., Blache, P.,\nChoukri, K., Cieri, C., Declerck, T., Goggi, S., Isahara, H., Maegaard, B., Mariani,\nJ., Mazo, H., Odijk, J., Piperidis, S. (eds.) Proceedings of the Thirteenth Lan-\nguage Resources and Evaluation Conference. pp. 3788–3797. European Language\nResources Association, Marseille, France (Jun 2022), https://aclanthology.org\n/2022.lrec-1.404/\n27. JHA, R., JBARA, A.A., QAZVINIAN, V., RADEV, D.R.: Nlp-driven citation\nanalysis for scientometrics. Natural Language Engineering 23(1), 93–130 (Jan\n2016). https://doi.org/10.1017/s1351324915000443, http://dx.doi.org\n/10.1017/S1351324915000443\n28. Jurgens, D., Kumar, S., Hoover, R., McFarland, D., Jurafsky, D.: Measuring the\nevolution of a scientific field through citation frames. Transactions of the Associa-\ntion for Computational Linguistics 6, 391–406 (2018). https://doi.org/10.116\n2/tacl_a_00028, https://aclanthology.org/Q18-1028/\n29. Koloveas, P., Chatzopoulos, S., Vergoulis, T., Tryfonopoulos, C.: Can llms predict\ncitation intent? an experimental analysis of in-context learning and fine-tuning on\nopen llms (2025), https://arxiv.org/abs/2502.14561\n30. Kunnath, S.N., Pride, D., Knoth, P.: Prompting strategies for citation classifica-\ntion. In: Proceedings of the 32nd ACM International Conference on Information\nand Knowledge Management. p. 1127–1137. CIKM ’23, Association for Computing\nMachinery, New York, NY, USA (2023). https://doi.org/10.1145/3583780.36\n15018, https://doi.org/10.1145/3583780.3615018\n\nFramework SOFT: Disentangling Citation Intent and Content\n19\n31. Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C.H., Gonzalez, J.E.,\nZhang, H., Stoica, I.: Efficient memory management for large language model serv-\ning with pagedattention. In: Proceedings of the ACM SIGOPS 29th Symposium\non Operating Systems Principles (2023)\n32. Lahiri, A., Sanyal, D.K., Mukherjee, I.: Citeprompt: Using prompts to identify\ncitation intent in scientific papers. pp. 51–55. 2023 ACM/IEEE Joint Conference\non Digital Libraries (JCDL), IEEE (6 2023). https://doi.org/10.1109/jcdl57\n899.2023.00017, http://dx.doi.org/10.1109/JCDL57899.2023.00017\n33. Li, K., Rubungo, A.N., Lei, X., Persaud, D., Choudhary, K., DeCost, B., Dieng,\nA.B., Hattrick-Simpers, J.: Probing out-of-distribution generalization in machine\nlearning for materials (2024), https://arxiv.org/abs/2406.06489\n34. Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P.,\nZitnick, C.L.: Microsoft coco: Common objects in context. In: Lecture Notes in\nComputer Science, pp. 740–755. Springer International Publishing (2014). https:\n//doi.org/10.1007/978-3-319-10602-1_48, http://dx.doi.org/10.1007/978\n-3-319-10602-1_48\n35. Lo, K., Wang, L.L., Neumann, M., Kinney, R., Weld, D.: S2ORC: The semantic\nscholar open research corpus. In: Jurafsky, D., Chai, J., Schluter, N., Tetreault,\nJ. (eds.) Proceedings of the 58th Annual Meeting of the Association for Com-\nputational Linguistics. pp. 4969–4983. Association for Computational Linguis-\ntics, Online (Jul 2020). https://doi.org/10.18653/v1/2020.acl-main.447,\nhttps://aclanthology.org/2020.acl-main.447/\n36. Meng, R., Lu, W., huan Chi, Y., Han, S.: Automatic classification of citation\nfunction by new linguistic features (2017), http://memray.me/uploads/iconfere\nnce17-citation-function.pdf\n37. Mercier, D., Rizvi, S., Rajashekar, V., Dengel, A., Ahmed, S.: Impactcite: An\nxlnet-based solution enabling qualitative citation impact analysis utilizing senti-\nment and intent. In: Proceedings of the 13th International Conference on Agents\nand Artificial Intelligence. p. 159–168. SCITEPRESS - Science and Technology\nPublications (2021). https://doi.org/10.5220/0010235201590168, http:\n//dx.doi.org/10.5220/0010235201590168\n38. Mistral AI Team: Mistral Small 3. https://mistral.ai/news/mistral-small-3/\n(2025)\n39. Nambanoor Kunnath, S., Pride, D., Knoth, P.: Dynamic context extraction for\ncitation classification. In: He, Y., Ji, H., Li, S., Liu, Y., Chang, C.H. (eds.) Pro-\nceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for\nComputational Linguistics and the 12th International Joint Conference on Natu-\nral Language Processing (Volume 1: Long Papers). pp. 539–549. Association for\nComputational Linguistics, Online only (Nov 2022). https://doi.org/10.18653\n/v1/2022.aacl-main.41, https://aclanthology.org/2022.aacl-main.41/\n40. Nambanoor Kunnath, S., Stauber, V., Wu, R., Pride, D., Botev, V., Knoth, P.:\nACT2: A multi-disciplinary semi-structured dataset for importance and purpose\nclassification of citations. In: Calzolari, N., Béchet, F., Blache, P., Choukri, K.,\nCieri, C., Declerck, T., Goggi, S., Isahara, H., Maegaard, B., Mariani, J., Mazo,\nH., Odijk, J., Piperidis, S. (eds.) Proceedings of the Thirteenth Language Resources\nand Evaluation Conference. pp. 3398–3406. European Language Resources Asso-\nciation, Marseille, France (Jun 2022), https://aclanthology.org/2022.lrec-1\n.363/\n41. Nicholson, J.M., Mordaunt, M., Lopez, P., Uppala, A., Rosati, D., Rodrigues, N.P.,\nGrabitz, P., Rife, S.C.: scite: A smart citation index that displays the context\n\n20\nDuan and Tan\nof citations and classifies their intent using deep learning. Quantitative Science\nStudies 2(3), 882–898 (2021). https://doi.org/10.1162/qss_a_00146, http:\n//dx.doi.org/10.1162/qss_a_00146\n42. Northcutt, C., Jiang, L., Chuang, I.: Confident learning: Estimating uncertainty in\ndataset labels. Journal of Artificial Intelligence Research 70, 1373–1411 (4 2021).\nhttps://doi.org/10.1613/jair.1.12125, http://dx.doi.org/10.1613/jair.\n1.12125\n43. Northcutt, C.G., Athalye, A., Mueller, J.: Pervasive label errors in test sets desta-\nbilize machine learning benchmarks. In: Thirty-fifth Conference on Neural Infor-\nmation Processing Systems Datasets and Benchmarks Track (Round 1) (2021),\nhttps://openreview.net/forum?id=XccDXrDNLek\n44. Palmer, M., Gildea, D., Kingsbury, P.: The Proposition Bank: An annotated corpus\nof semantic roles. Computational Linguistics 31(1), 71–106 (2005). https://doi.\norg/10.1162/0891201053630264, https://aclanthology.org/J05-1004/\n45. Pride, D., Knoth, P.: An authoritative approach to citation classification. In: Pro-\nceedings of the ACM/IEEE Joint Conference on Digital Libraries in 2020. p.\n337–340. JCDL ’20, Association for Computing Machinery, New York, NY, USA\n(2020). https://doi.org/10.1145/3383583.3398617, https://doi.org/10.114\n5/3383583.3398617\n46. Pride, D., Knoth, P.: An authoritative approach to citation classification. JCDL\n’20: The ACM/IEEE Joint Conference on Digital Libraries in 2020, ACM (8 2020).\nhttps://doi.org/10.1145/3383583.3398617, http://dx.doi.org/10.1145/338\n3583.3398617\n47. Qazvinian, V., Radev, D.R.: Scientific paper summarization using citation sum-\nmary networks. In: Scott, D., Uszkoreit, H. (eds.) Proceedings of the 22nd In-\nternational Conference on Computational Linguistics (Coling 2008). pp. 689–\n696. Coling 2008 Organizing Committee, Manchester, UK (Aug 2008), https:\n//aclanthology.org/C08-1087/\n48. Qwen: Qwen2.5 technical report (2025), https://arxiv.org/abs/2412.15115\n49. Riabi, A., Mouilleron, V., Mahamdi, M., Antoun, W., Seddah, D.: Beyond dataset\ncreation: Critical view of annotation variation and bias probing of a dataset for\nonline radical content detection. In: Rambow, O., Wanner, L., Apidianaki, M.,\nAl-Khalifa, H., Eugenio, B.D., Schockaert, S. (eds.) Proceedings of the 31st Inter-\nnational Conference on Computational Linguistics. pp. 8640–8663. Association for\nComputational Linguistics, Abu Dhabi, UAE (Jan 2025), https://aclanthology\n.org/2025.coling-main.578/\n50. Roman, M., Shahid, A., Khan, S., Koubaa, A., Yu, L.: Citation intent classification\nusing word embedding. IEEE Access 9, 9982–9995 (2021). https://doi.org/10.1\n109/access.2021.3050547, http://dx.doi.org/10.1109/ACCESS.2021.3050547\n51. Rücker, S., Akbik, A.: CleanCoNLL: A nearly noise-free named entity recogni-\ntion dataset. In: Bouamor, H., Pino, J., Bali, K. (eds.) Proceedings of the 2023\nConference on Empirical Methods in Natural Language Processing. pp. 8628–\n8645. Association for Computational Linguistics, Singapore (Dec 2023). https:\n//doi.org/10.18653/v1/2023.emnlp-main.533, https://aclanthology.org/2\n023.emnlp-main.533/\n52. Shotton, D.: Cito, the citation typing ontology. Journal of Biomedical Semantics\n1(Suppl 1), S6 (2010). https://doi.org/10.1186/2041-1480-1-s1-s6, http:\n//dx.doi.org/10.1186/2041-1480-1-S1-S6\n53. Shui, Z., Karypis, P., Karls, D.S., Wen, M., Manchanda, S., Tadmor, E.B.,\nKarypis, G.: Fine-tuning language models on multiple datasets for citation in-\n\nFramework SOFT: Disentangling Citation Intent and Content\n21\ntention classification. In: Al-Onaizan, Y., Bansal, M., Chen, Y.N. (eds.) Find-\nings of the Association for Computational Linguistics: EMNLP 2024. pp. 16718–\n16732. Association for Computational Linguistics, Miami, Florida, USA (Nov\n2024). https://doi.org/10.18653/v1/2024.findings-emnlp.974, https:\n//aclanthology.org/2024.findings-emnlp.974/\n54. Singh, S., Yadav, A., Jain, J., Shi, H., Johnson, J., Desai, K.: Benchmarking object\ndetectors with coco: A new path forward. In: Leonardis, A., Ricci, E., Roth, S.,\nRussakovsky, O., Sattler, T., Varol, G. (eds.) Computer Vision – ECCV 2024. pp.\n279–295. Springer Nature Switzerland, Cham (2025)\n55. Spiegel-Rosing, I.: Science studies: Bibliometric and content analysis. Social Studies\nof Science 7(1), 97–113 (2 1977). https://doi.org/10.1177/0306312777007001\n11, http://dx.doi.org/10.1177/030631277700700111\n56. Tan, Z., Li, D., Wang, S., Beigi, A., Jiang, B., Bhattacharjee, A., Karami, M., Li,\nJ., Cheng, L., Liu, H.: Large language models for data annotation and synthesis:\nA survey. In: Al-Onaizan, Y., Bansal, M., Chen, Y.N. (eds.) Proceedings of the\n2024 Conference on Empirical Methods in Natural Language Processing. pp. 930–\n957. Association for Computational Linguistics, Miami, Florida, USA (Nov 2024).\nhttps://doi.org/10.18653/v1/2024.emnlp-main.54, https://aclanthology.o\nrg/2024.emnlp-main.54/\n57. Team, G.: Gemma 3 technical report (2025), https://arxiv.org/abs/2503.19786\n58. Team, L.: The llama 3 herd of models (2024), https://arxiv.org/abs/2407.21783\n59. Teufel, S.: Argumentative zoning : Information extraction from scientific text\n(1999), https://www.cl.cam.ac.uk/~sht25/thesis/t1.pdf\n60. Teufel, S., Siddharthan, A., Tidhar, D.: Automatic classification of citation func-\ntion. In: Jurafsky, D., Gaussier, E. (eds.) Proceedings of the 2006 Conference on\nEmpirical Methods in Natural Language Processing. pp. 103–110. Association for\nComputational Linguistics, Sydney, Australia (Jul 2006), https://aclanthology\n.org/W06-1613/\n61. Tjong Kim Sang, E.F., De Meulder, F.: Introduction to the CoNLL-2003 shared\ntask: Language-independent named entity recognition. In: Proceedings of the Sev-\nenth Conference on Natural Language Learning at HLT-NAACL 2003. pp. 142–147\n(2003), https://aclanthology.org/W03-0419/\n62. Tkachenko, U., Thyagarajan, A., Mueller, J.: Objectlab: Automated diagnosis of\nmislabeled images in object detection data (2023). https://doi.org/10.48550/A\nRXIV.2309.00832, https://arxiv.org/abs/2309.00832\n63. Valenzuela, M., Ha, V., Etzioni, O.: Identifying meaningful citations. In: AAAI\nworkshop: Scholarly big data. vol. 15, p. 13 (2015)\n64. Yu, H., Liu, J., Zhang, X., Wu, J., Cui, P.: A survey on evaluation of out-of-\ndistribution generalization (2024), https://arxiv.org/abs/2403.01874\n65. Zha, D., Bhat, Z.P., Lai, K.H., Yang, F., Jiang, Z., Zhong, S., Hu, X.: Data-\ncentric artificial intelligence: A survey. ACM Comput. Surv. 57(5) (Jan 2025).\nhttps://doi.org/10.1145/3711118, https://doi.org/10.1145/3711118\n66. Zheng, Y., Zhang, R., Zhang, J., Ye, Y., Luo, Z.: LlamaFactory: Unified efficient\nfine-tuning of 100+ language models. In: Cao, Y., Feng, Y., Xiong, D. (eds.) Pro-\nceedings of the 62nd Annual Meeting of the Association for Computational Lin-\nguistics (Volume 3: System Demonstrations). pp. 400–410. Association for Com-\nputational Linguistics, Bangkok, Thailand (Aug 2024). https://doi.org/10.186\n53/v1/2024.acl-demos.38, https://aclanthology.org/2024.acl-demos.38/\n67. Zimmermann, E., Szeto, J., Pasquero, J., Ratle, F.: Benchmarking a benchmark:\nHow reliable is ms-coco? (2023), https://arxiv.org/abs/2311.02709\n\n22\nDuan and Tan\nA\nOverall Model Performance\nTable 2 presents the overall performance of the evaluated models across the\nACL-ARC, SciCite, and our proposed SOFT frameworks, for both in-domain\n(ACL-ARC test set) and cross-domain (re-annotated ACT2 test set) scenarios.\nThe performance is measured by accuracy and macro-F1 score.\nB\nPer-Type Analysis Across Frameworks and Models\nTo gain a more granular understanding of model behavior and framework char-\nacteristics, we visualize the per-class F1-scores for both in-domain and cross-\ndomain settings using radar plots, as shown in Figure 3. Each plot depicts the\nperformance of a specific model across the classes of a given framework, with\nblue representing in-domain and red representing cross-domain F1 Scores. These\nvisualizations allow for a direct comparison of how well individual classes are\nlearned and how performance on these classes transfers across domains.\nAcross ACL-ARC and SciCite, only a few types achieve consistently learnable\nperformance. In ACL-ARC, Uses (F1 = 0.60-0.72) and Future (F1 = 0.70-0.80)\nare reliably detected across models, while types like Motivation (F1 = 0.06-\n0.27) and extends (F1 = 0.17-0.24) exhibit uniformly poor results, regardless of\nmodel type. In SciCite, background and method achieve moderate F1 (up to 0.83\nand 0.69), but ResultComparison remains consistently weak (F1 ≤ 0.47, recall\n< 0.5). These patterns suggest semantic overlap and label fuzziness in legacy\nframeworks, especially for tasks that require disambiguating subtle intent.\nThe SOFT framework, in contrast, features multiple types that are both\nconceptually precise and computationally learnable. Zero-shot LLMs perform\nwell on Use (F1 = 0.79-0.90), Contextualize (0.78-0.86), and Produce-\ndResource (0.76-0.82), while the fine-tuned LLM (Qwen-small) outperforms\nall other models on discovery and ProducedResource (both F1 = 0.82). More\ndifficult types like PerformedWork show a large performance gap in F1 score\n(LLMs: 0.37-0.57, Qwen-small: 0.69, SciBERT: 0.79), indicating that domain-\nadaptive fine-tuning is crucial when lexical variation is high. For JustifyDe-\nsignChoice, zero-shot LLM Llama (F1 = 0.72) outperforms both fine-tuned\nmodels (Qwen-small: F1 = 0.45, SciBERT: F1 = 0.42), suggesting that general\ndiscourse modeling, rather than label-specific exposure, governs success.\nPrecision-recall asymmetries further differentiate framework quality. In Sci-\nCite, the method exhibits high recall (e.g., zero-shot LLMs: 0.86-0.94) but low\nprecision (e.g., zero-shot LLMs: 0.4-0.53), indicating label overgeneralization. In\nSOFT, several types, including JustifyDesignChoice and performed work,\nexhibit the opposite: high precision (e.g., Qwen: 0.79) but low recall (e.g., Mis-\ntral: 0.28), showing that models can detect prototypical cases but struggle with\nbroader coverage. Notably, SignalGap (Qwen: precision 0.81, recall 0.68) and\nMethod (SciBERT: precision 0.64, recall 0.76) achieve better balance. These\ncontrasts affirm that our SOFT framework not only yields more distinguishable\nand learnable labels but also makes model errors more interpretable, enabling\nboth zero-shot application and targeted fine-tuning.\n\nFramework SOFT: Disentangling Citation Intent and Content\n23\nC\nComputational Resources and Software\nWe adopted the original hyperparameter settings from CitePrompt [32] and Ci-\ntationIntentOpenLLM [29]. All experiments were conducted on a single H100\nGPU, including both fine-tuning and inference. For LLM inference, we used the\nvLLM library [31]. Fine-tuning required approximately 5 GPU-hours, and infer-\nence consumed an additional 4 GPU-hours in total.\nTable 2: In-domain and cross-domain performance of citation classification mod-\nels (zero-shot: “ZS”, fine-tuned: “FT”) across frameworks.\nFramework Classifier\nIn-Domain (ACL-ARC) Cross-Domain (ACT2)\nAccuracy\nMacro F1\nAccuracy\nMacro F1\nACL-ARC\n(6 types)\nZS Llama\n0.58\n0.52\n0.60\n0.13 (75%↓)\nZS Mistral\n0.54\n0.49\n0.59\n0.16 (67%↓)\nZS Gemma\n0.65\n0.48\n0.62\n0.17 (65%↓)\nZS Qwen\n0.48\n0.45\n0.63\n0.23 (49%↓)\nFT SciBERT\n0.66\n0.55\n0.47\n0.12 (78%↓)\nFT Qwen-Small\n0.65\n0.51\n0.58\n0.49 ( 4%↓)\nSciCite\n(3 types)\nZS Llama\n0.72\n0.63\n0.80\n0.37 (54%↓)\nZS Mistral\n0.67\n0.59\n0.81\n0.41 (31%↓)\nZS Gemma\n0.70\n0.63\n0.80\n0.36 (43%↓)\nZS Qwen\n0.67\n0.59\n0.84\n0.45 (24%↓)\nFT SciBERT\n0.77\n0.69\n0.66\n0.34 (51%↓)\nFT Qwen-Small\n0.69\n0.61\n0.40\n0.23 (63%↓)\nSOFT:\nCited\nContent\nType\n(3 types)\nZS Llama\n0.73\n0.67\n0.75\n0.55 (18%↓)\nZS Mistral\n0.67\n0.62\n0.78\n0.46 (26%↓)\nZS Gemma\n0.74\n0.68\n0.73\n0.58 (15%↓)\nZS Qwen\n0.70\n0.67\n0.77\n0.60 (10%↓)\nFT SciBERT\n0.70\n0.70\n0.58\n0.41 (41%↓)\nFT Qwen-Small\n0.79\n0.78\n0.75\n0.64 (18%↓)\nSOFT:\nCitation\nIntent\n(7 types)\nZS Llama\n0.79\n0.72\n0.72\n0.57 (21%↓)\nZS Mistral\n0.74\n0.71\n0.69\n0.57 (20%↓)\nZS Gemma\n0.71\n0.59\n0.71\n0.55 ( 7%↓)\nZS Qwen\n0.81\n0.75\n0.82\n0.64 (15%↓)\nFT SciBERT\n0.72\n0.53\n0.52\n0.20 (62%↓)\nFT Qwen-Small\n0.77\n0.65\n0.77\n0.56 (14%↓)\nColor Legend: ≤20% drop ,\n21–40% drop ,\n>40% drop in cross-domain Macro F1.\n\n24\nDuan and Tan\nbackground\ncompare\nextends\nfuture\nmotivation\nuses\n0.2\n0.4\n0.6\n0.8\n1.0\nACL-ARC\n6 Types\nQwen (ZS)\nbackground\nmethod\nresult_comparison\n0.2\n0.4\n0.6\n0.8\n1.0\nSciCite\n3 Types\ndiscovery\nperformed_work\nproduced_resource\n0.2\n0.4\n0.6\n0.8\n1.0\nSOFT\n(Content)\n3 Types\ncontextualize\nevaluate_against\nhighlight_limitation\njustify_design_choice\nmodify\nsignal_gap\nuse\n0.2\n0.4\n0.6\n0.8\n1.0\nSOFT\n(Intent)\n7 Types\nbackground\ncompare\nextends\nfuture\nmotivation\nuses\n0.2\n0.4\n0.6\n0.8\n1.0\nLlama (ZS)\nbackground\nmethod\nresult_comparison\n0.2\n0.4\n0.6\n0.8\n1.0\ndiscovery\nperformed_work\nproduced_resource\n0.2\n0.4\n0.6\n0.8\n1.0\ncontextualize\nevaluate_against\nhighlight_limitation\njustify_design_choice\nmodify\nsignal_gap\nuse\n0.2\n0.4\n0.6\n0.8\n1.0\nbackground\ncompare\nextends\nfuture\nmotivation\nuses\n0.2\n0.4\n0.6\n0.8\n1.0\nMistral (ZS)\nbackground\nmethod\nresult_comparison\n0.2\n0.4\n0.6\n0.8\n1.0\ndiscovery\nperformed_work\nproduced_resource\n0.2\n0.4\n0.6\n0.8\n1.0\ncontextualize\nevaluate_against\nhighlight_limitation\njustify_design_choice\nmodify\nsignal_gap\nuse\n0.2\n0.4\n0.6\n0.8\n1.0\nbackground\ncompare\nextends\nfuture\nmotivation\nuses\n0.2\n0.4\n0.6\n0.8\n1.0\nGemma (ZS)\nbackground\nmethod\nresult_comparison\n0.2\n0.4\n0.6\n0.8\n1.0\ndiscovery\nperformed_work\nproduced_resource\n0.2\n0.4\n0.6\n0.8\n1.0\ncontextualize\nevaluate_against\nhighlight_limitation\njustify_design_choice\nmodify\nsignal_gap\nuse\n0.2\n0.4\n0.6\n0.8\n1.0\nbackground\ncompare\nextends\nfuture\nmotivation\nuses\n0.2\n0.4\n0.6\n0.8\n1.0\nQwen-Small (FT)\nbackground\nmethod\nresult_comparison\n0.2\n0.4\n0.6\n0.8\n1.0\ndiscovery\nperformed_work\nproduced_resource\n0.2\n0.4\n0.6\n0.8\n1.0\ncontextualize\nevaluate_against\nhighlight_limitation\njustify_design_choice\nmodify\nsignal_gap\nuse\n0.2\n0.4\n0.6\n0.8\n1.0\nbackground\ncompare\nextends\nfuture\nmotivation\nuses\n0.2\n0.4\n0.6\n0.8\n1.0\nSciBERT (FT)\nbackground\nmethod\nresult_comparison\n0.2\n0.4\n0.6\n0.8\n1.0\ndiscovery\nperformed_work\nproduced_resource\n0.2\n0.4\n0.6\n0.8\n1.0\ncontextualize\nevaluate_against\nhighlight_limitation\njustify_design_choice\nmodify\nsignal_gap\nuse\n0.2\n0.4\n0.6\n0.8\n1.0\nIn-Domain Macro F1\nCross-Domain Macro F1\nFig. 3: Grid of radar plots (4 frameworks × 6 models) showing per-class F1-\nscores. Blue lines/areas represent In-Domain F1 scores, and red lines/areas rep-\nresent Cross-Domain F1 scores. Each axis corresponds to a class label within the\nrespective framework. F1-scores range from 0 (center) to 1 (outer edge).\n",
+    "summary_1": "Here is a clear and concise summary of the research paper for a graduate student:\n\n**Title:** Semantically Orthogonal Framework for Citation Classification: Disentangling Intent and Content\n\n**Authors:** Changxu Duan and Zhiyin Tan\n\n**Summary:** The authors propose a new framework, SOFT, for citation classification that explicitly separates the intent of a citation (why a work is cited) from its content type (what part is cited). This framework is grounded in semantic role theory and is designed to improve the clarity, consistency, and generalizability of citation classification.\n\n**Background:** Traditional citation metrics, such as counts or h-index, are limited in their ability to capture the nuanced ways in which works are cited, interpreted, or contested. Citation classification has emerged as a critical task in computational bibliometrics and information science to unlock the meaning within scholarly dialogue.\n\n**Limitations of existing frameworks:** Current citation classification practices face persistent limitations, including the conflation of distinct citation intents and content types, and the lack of consistent semantic definitions. These ambiguities hinder downstream applications such as citation-based retrieval and impact analysis.\n\n**SOFT framework:** The authors propose a two-dimensional framework that explicitly distinguishes the predicative intent of a citation from its cited content type. This framework",
+    "summary_2": "The paper \"Semantically Orthogonal Framework for Citation Classification: Disentangling Intent and Content\" proposes a new annotation framework, SOFT, that explicitly separates citation intent (why a work is cited) from cited content type (what part is cited). The authors argue that existing frameworks conflate these two dimensions, hindering their effectiveness in auto classification. SOFT is grounded in the principles of semantic roles, offering a more nuanced understanding of how authors structure argumentation.\n\nThe paper introduces two key dimensions: Cited Content Type and Citation Intent. Cited Content Type refers to the specific part of a work that is being cited (e.g., method, result, or background), while Citation Intent refers to the functional purpose of the citation (e.g., to support, to critique, or to provide background). This distinction enables clearer definitions, improved annotation consistency, and stronger support for Large Language Model (LLM)-based classification.\n\nThe authors validate SOFT through a comprehensive re-annotation of the ACL-ARC dataset and a cross-domain evaluation on ACT2, covering 19 scientific fields. Results show that SOFT yields higher inter-model and human-LLM agreement, improves classification accuracy for both zero-shot and fine-tuned models, and generalizes more robustly across domains than existing"
+  },
+  {
+    "arxiv_id": "2601.05192v1",
+    "title": "Paper 2601.05192v1",
+    "abstract": "Entity linking (mapping ambiguous mentions in text to entities in a knowledge base) is a foun- dational step in tasks such as knowledge graph construction, question-answering, and informa- tion extraction. Our method, LELA, is a mod- ular coarse-to-fine approach that leverages the capabilities of large language models (LLMs), and works with different target domains, knowl- edge bases and LLMs, without any fine-tuning phase. Our experiments across various entity linking settings show that LELA is highly com- petitive with fine-tuned approaches, and sub- stantially outperforms the non-fine-tuned ones.",
+    "text": "LELA: an LLM-based Entity Linking Approach\nwith Zero-Shot Domain Adaptation\nSamy Haffoudhi, Fabian M. Suchanek, Nils Holzenberger\nTélécom Paris\nInstitut Polytechnique de Paris, France\n{samy.haffoudhi, fabian.suchanek, nils.holzenberger}@telecom-paris.fr\nAbstract\nEntity linking (mapping ambiguous mentions\nin text to entities in a knowledge base) is a foun-\ndational step in tasks such as knowledge graph\nconstruction, question-answering, and informa-\ntion extraction. Our method, LELA, is a mod-\nular coarse-to-fine approach that leverages the\ncapabilities of large language models (LLMs),\nand works with different target domains, knowl-\nedge bases and LLMs, without any fine-tuning\nphase. Our experiments across various entity\nlinking settings show that LELA is highly com-\npetitive with fine-tuned approaches, and sub-\nstantially outperforms the non-fine-tuned ones.\n1\nIntroduction\nEntity linking, also known as Entity Disambigua-\ntion, is the task of mapping ambiguous mentions\nof entities in natural language text to reference\nentities in a knowledge base (KB). Typically,\nthe input is a sentence in which the mention of\nan entity has already been identified, as in this\nexample:\nFrance hosted the Olympics in Paris.\nThe knowledge base contains each entity, associ-\nated with a short textual description, e.g.:\nParis (city)\nCapital city of France\nParis (novel) 1897 novel by Emile Zola\nFrance\nCountry in Europe\nThe goal is to map the mention of Paris to the\ncorrect entity in the KB (here, the first one).\nEntity Linking is an important preprocess-\ning step in tasks such as information extrac-\ntion (Martinez-Rodriguez et al., 2020), knowledge-\nbased question answering (Welty et al., 2012), and\nknowledge graph completion (Ji and Grishman,\n2011). Most of the work on entity linking focuses\non linking to general knowledge graphs such as\nWikidata (Vrandeˇci´c, 2012), DBpedia (Auer et al.,\n2007) or Yago (Suchanek et al., 2007). However,\nDomain\nKnowledge Base Documents\nClassical\ntest=train\ntest=train\ntest̸=train\nZero-shot\ntest≈train\ntest̸=train\ntest̸=train\nTrue zero-shot test̸=train\ntest̸=train\ntest̸=train\nFigure 1: Classification of entity linking settings (“train-\ning” refers to fine-tuning after initial pre-training).\nLELA is in the true zero-shot setting.\nin real-world applications, the knowledge base is\noften proprietary or domain-specific, as in the legal\nor biomedical domain, or inside a company.\nTherefore, several entity linking methods use\ntraining. In classical entity linking (Figure 1), the\nmodels are trained on the same domain and same\nknowledge base as the one used during testing. In\nzero-shot entity linking (Logeswaran et al., 2019),\nthe set of entities seen during training is differ-\nent from the one seen during testing. However,\neven this setting, as defined in Logeswaran et al.\n(2019), requires labeled entity linking training data\nfor fine-tuning from related domains and document\nsources. Fine-tuning is not just computationally\nexpensive, but also requires high-quality labeled\ndata. Thus, to apply entity linking to a new domain\n(say, a company wants to disambiguate chemical\nproducts), one would first have to label hundreds\nof mentions. This task cannot be outsourced eas-\nily, as it requires navigating specialized knowledge\nbases and involves privacy restrictions. Hence, the\nquest that we tackle in this paper is true zero-shot\nentity linking, i.e., entity linking that works out\nof the box for any domain without the need for\nfine-tuning.\nSeveral approaches target such true zero-shot\nscenarios (Ding et al., 2024; Liu et al., 2024c; Zhou\net al., 2024; Ye and Mitchell, 2025). However, even\nthese require hand-chosen Chain-of-Thought ex-\namples and/or strongly domain-specific candidate\ngenerator approaches, limiting their generalizabil-\n1\narXiv:2601.05192v1  [cs.CL]  8 Jan 2026\n\nity. Furthermore, their performance lags behind the\nspecialized approaches.\nIn this paper, we propose LELA, a coarse-\nto-fine fine-tuning-free method for entity linking.\nLELA (1) generates an initial entity-candidate set,\n(2) selects top-k candidates through pointwise re-\nranking, and (3) passes them to the LLM for fi-\nnal candidate selection through reasoning over the\ncontext and entity descriptions. This approach is\nmodular, LLM-, reranker-, and retriever-agnostic,\nand can hence be used with both open-source and\nproprietary models. As LLMs become ever more of\na commodity, the performance of LELA thus only\nstands to gain. Through extensive experiments, we\nshow that LELA systematically outperforms the\ntrue zero-shot methods, and even beats state-of-the-\nart classical and zero-shot approaches on multiple\ndatasets. We attach our code for reviewing, which\nwill be made available as an open-source library.\n2\nRelated Work\n2.1\nEntity Linking\nBefore the blossoming of machine learning and\ndeep learning, many entity linking approaches\n(such as AIDA (Hoffart et al., 2011) and\nSOFIE (Suchanek et al., 2009)) were training-free,\nbut were bound to a given knowledge base.\nToday, methods that work on any input KB\nare usually embedding-based: BLINK (Wu et al.,\n2020), AcroBERT (Chen et al., 2023), Ex-\ntEnD (Barba et al., 2022), ReFinED (Ayoola\net al., 2022), VERBALIZED (Rücker and Ak-\nbik, 2025), the approaches in Févry et al. (2020);\nYamada et al. (2022), as well as the different\nzero-shot entity linking methods evaluated on the\nZESHEL benchmark (Logeswaran et al., 2019):\nReS (Xu et al., 2023), Uni-MPR and Bi-MPR (Tang\net al., 2021), and E-repeat (Yao et al., 2020).\nGENRE (De Cao et al., 2021a) is a generative ap-\nproach and mGENRE (De Cao et al., 2022) is its\nmultilingual extension. Although these methods\ncan technically operate over different knowledge\nbases, they are not true zero-shot approaches, be-\ncause they require supervised training on the target\nKB or closely aligned KBs to achieve good per-\nformance. LELA obviates this requirement, and\nwe show in our experiments that it still performs\ncompetitively with these approaches.\nSeveral LLM-based approaches have also been\nproposed. The approaches in Vollmers et al.; Xin\net al. (2024) use LLMs for contextual augmentation.\nLi et al. (2025) categorizes mentions and routes\nthem to either a low-computational entity linker\nor a reasoning-LLM. Wang et al. (2026); Shi et al.\n(2024); Liu et al. (2024b) focus on the multimodal\nsetting. GenDecider (Zhou et al., 2024) uses an\nLLM fine-tuned on the reranking task on ZESHEL\nusing LoRA (Hu et al., 2021). OneNet (Liu et al.,\n2024c) is a pipeline approach of LLM-based can-\ndidate filtering, entity linking, and hallucination\ntackling. EntGPT (Ding et al., 2024) prompts the\nmodel to augment the mention before selecting the\nbest candidate entity and Ye and Mitchell (2025)\nuses in-context learning (Brown et al., 2020) for\ncandidate selection.\nAll of these LLM-based approaches leverage\ntask- or domain-specific candidate retrieval, fine-\ntuning, or complex prompt engineering, which lim-\nits their applicability in true zero-shot settings. In\ncontrast, LELA requires no fine-tuning or domain-\nspecific components, and our experiments show\nthat it outperforms these approaches across multi-\nple datasets, in a true zero-shot setting.\n2.2\nLLM-based Reranking\nLLMs have recently been explored as general-\npurpose rerankers in information retrieval: point-\nwise approaches compute a score for each query-\ndocument pair (Pipitone et al., 2025; Zhang et al.,\n2025); pairwise approaches compare the relevance\nof a pair of documents (Qin et al., 2024; Luo\net al., 2024); and listwise approaches process\na list of documents and output them ranked by\ntheir relevance (Wang et al., 2025; Liu et al.,\n2025b). Other methods are setwise (Zhuang et al.,\n2024), tournament-inspired (Chen et al., 2025b),\nefficiency-centered (Dedov, 2025), as well as\nattention-pattern-based (Chen et al., 2025a) or con-\nditional perplexity-based (Shi et al., 2025). Some\napproaches make use of reasoning token gener-\nation (Abdallah et al., 2025; Liu et al., 2025a;\nZhuang et al., 2025), and study its impact on per-\nformance (Jedidi et al., 2025).\nWhile this body of work primarily targets doc-\nument and passage retrieval, we demonstrate\nthat pointwise LLM-based reranking, followed\nby reasoning-driven selection, can be effectively\nadapted to the entity linking setting without task-\nspecific fine-tuning.\n2\n\n3\nMethodology\nTask Definition.\nLet the KB E be a set of entities,\neach with a textual description. Given a mention\nm in a text S, the goal of entity linking is to find\nthe entity e ∈ E that m refers to in the context of S,\nor, optionally, return a special entity enone, which\nindicates that no entity is relevant.\nLELA uses a three-stage coarse-to-fine pipeline,\nas illustrated in Figure 2: Candidate entities are\nfirst generated (Sevgili et al., 2022; Wu et al., 2020),\nthen ranked, and then selected. Beyond the text,\nmention, and KB, LELA requires only a parameter\nk ∈ N+ as input.\nCandidate generation.\nThe goal of the candidate\ngeneration step is to identify the entities Cm ⊆ E\nthat could potentially be relevant to m. For this\nstep, LELA is retriever-agnostic, and can work\nwith any method that is capable of pre-selecting a\nsubset of relevant entities. In practice, any genera-\ntion method that requires domain-specific training\nwould jeopardize LELA’s goal of being fine-tuning-\nfree. Hence, we primarily consider BM25 (Robert-\nson and Sparck Jones, 1988) and dense retrieval,\nand also experiment with prior-based dictionaries.\nCandidate reranking.\nEach candidate c ∈ Cm\nis then given a score s(c) ∈ R by pointwise rerank-\ning. While any off-the-shelf reranker can be used,\nwe opt for instructed rerankers, as the candidate\nentity reranking task is not typically covered during\nreranker training. Rerankers based on generative\nlanguage models typically take as input a query q,\na document d, and a reranking-instruction I. They\nthen compute a score as:\nscore(q, d) =\nelogit(yes|I,q,d)\nelogit(yes|I,q,d) + elogit(no|I,q,d)\n= softmax\n�\nlogit(yes|I,q,d)\nlogit(no|I,q,d)\n�\n0\nThe score can also be expressed as a single-logit\nbinary classifier1. In our case, we want to obtain a\nscore for each candidate c ∈ Cm, and hence, fol-\nlowing the formalism of information retrieval, we\nset q = S (marking the mention m), I as an entity\ndisambiguation instruction (see Appendix A), and\nd as the name of c followed by its description. We\nthen identify the top-k scored candidates, C(k)\nm .\n1https://huggingface.co/Qwen/Qwen3-Reranker-0.\n6B/discussions/3\n...\n0.5 ✗\n0.8 ✓\nYou are an expert designed to\ndisambiguate entities in text...\n<think>...</think>\n0.9 ✓\nCandidate Generation\nCandidate Reranking\n(Sparse/Dense Retrieval, Alias table)\n(Pointwise Reranker)\nLLM-based selection\nFrance hosted the 2024 Olympics in [Paris].\nParis (Texas) - City in Lamar County, Texas\nParis Hilton - American media personality\nParis - Capital of France\n... - ...\nReranked Candidates\nKnowledge Base\nFrance Gall - French singer\nParis (Texas) - City in Lamar County, Texas\nFrance - Country in Europe\nParis Hilton - American media personality\nParis - Capital of France\n... - ...\nParis (Texas) - City in Lamar County, Texas\nParis Hilton - American media personality\nParis - Capital of France\n... - ...\nRetrieved Candidates\ntop-k cutoff\nAnswer\nParis - Capital of France\nFigure 2: Overview of the LELA approach.\nCandidate selection.\nTo select the best candi-\ndate from C(k)\nm , we construct an LLM prompt that\nconsists of (1) a system instruction that asks the\nmodel to return the index of the entity that is most\nlikely mentioned in the text (shown in Appendix A),\n(2) input context S with the mention m surrounded\nin square brackets, and (3) the candidates as a\nnewline-separated list of triples, each consisting of\na zero-based index, an entity name, and an entity de-\nscription. Optionally, the first list item is “0. None\nof the candidates”. We use self-consistency (Wang\net al., 2023), sampling different candidate selection\nreasoning paths, with majority voting to determine\nthe final candidate c∗ ∈ C(k)\nm .\nImplementation\nLELA uses vLLM2 (Kwon\net al., 2023) for LLM-inference. It is implemented\nas a Python library and can be used easily in spaCy\n2https://docs.vllm.ai/\n3\n\npipelines as an entity-linking component3.\n4\nExperiments\nWe evaluate LELA on various entity linking set-\ntings, using five different benchmark datasets. Our\nevaluation spans four complementary entity linking\nsettings: (i) domain-transfer zero-shot, (ii) domain-\nspecific with None-linking, (iii) acronym disam-\nbiguation without entity descriptions across special-\nized domains, and (iv) classical Wikipedia-based\nbenchmarks with strong supervised baselines. Each\nconsidered benchmark emphasizes a distinct chal-\nlenge, and together they demonstrate LELA’s ro-\nbustness beyond any single scenario.\nLELA\nconfiguration.\nWe\nreport\nre-\nsults\nfor\nLELA\nusing\nthe\nopen-source\nMagistral-Small-2509\n(Mistral-AI\net\nal.,\n2025) reasoning LLM. For all datasets and\nmodels, we use k = 10 and sample 10 outputs for\nself-consistency. We use the Qwen3-Reranker-4B\nmodel (Zhang et al., 2025) for pointwise reranking.\nWe run our experiments on two A100 40GB GPUs.\nAs we are interested in true zero-shot entity linking,\nwe do not perform any hyperparameter tuning and\ndiscuss other values of k, self-consistency samples\nand rerankers in Section 4.5 and 4.6. For dense\nretrieval, we use Qwen3-Embedding-4B (Zhang\net al., 2025) along with flat FAISS indexes (Douze\net al., 2025).\nBM25 retrieval is implemented\nusing the bm25s Python library4.\nIn addition,\nwe also report results for the mixture-of-experts\n(MoE) (Shazeer et al., 2017) Qwen3-30B-A3B, and\nthe small Qwen3-4B (Yang et al., 2025) reasoning\nLLMs.\nCompetitors.\nWe compare LELA to all state-of-\nthe-art entity linking methods that have been pre-\nviously evaluated on the considered benchmarks.\nThe same method can be zero-shot for one dataset\nand true zero-shot for another one.\n4.1\nZero-Shot Scenario\nBenchmark.\nZESHEL (Logeswaran et al., 2019)\nis a benchmark for evaluating entity linking meth-\nods in a zero-shot setting, built with documents\nfrom Wikia5, a Wikipedia-like Web site for spe-\ncific subjects such as fictional universes or sports.\nZESHEL consists of 8 Wikia domains for training,\n3https://spacy.io/\n4https://bm25s.github.io/\n5https://www.wikia.com/\n4 for validation, and 4 for testing. The test has a\ntotal of 10,000 mentions. For each mention, the\ndataset provides 64 candidates retrieved by BM25.\nEntities correspond to Wikia pages, where the en-\ntity name is the title of the page and the entity\ndescription is its content. The BM25-based candi-\ndate retrieval has a top-64 recall of 68% on the test\nset. Following previous work (Logeswaran et al.,\n2019; Xu et al., 2023; Li et al., 2022), we consider\nthe normalized accuracy as the evaluation metric,\nwhich is defined as the accuracy on the subset of\nmentions for which the correct entity is among the\ntop-64 candidates. For LELA, we use the provided\nBM25 candidates and use the first three sentences\nof the entity page as the entity description.\nResults.\nAs Table 1 shows, LELA ranks first on\nthe benchmark. It achieves the highest performance\nacross all test domains in the true zero-shot set-\nting, improving the previous best reported macro-\naveraged normalized accuracy by 8.84 percentage\npoints. On half of the domains, it outperforms even\nthe non-true zero shot approaches, improving the\nnormalized accuracy by 1.62 and 5.06 percentage\npoints respectively. Category-specific results are\ngiven in Appendix B.\n4.2\nDomain-specific and None Scenarios\nBenchmark.\nThe ESCO benchmark (le Vrang\net al., 2014) is designed for the task of mapping oc-\ncupational skills to the ESCO taxonomy, evaluated\non the manually annotated dataset from Decorte\net al. (2022). Some mentions have to be mapped to\nthe special entity UNK, because they have no coun-\nterpart in the taxonomy (813 out of the 1824 test\nmentions).\nLELA configuration.\nWe generate our own can-\ndidates either with BM25 (retrieving 200 candi-\ndates per mention) or with dense retrieval (retriev-\ning 64 candidates per mention). BM25 yields a\nrecall of 80.8%, and dense retrieval yields 95.6%.\nWe use the provided entity descriptions.\nResults.\nTable 2 shows that, for the accuracy on\nactual entities, LELA performs on par with the\nclassical approaches. In the overall accuracy, the\nbaseline that always predicts UNK has the highest\naccuracy, which is due to the high ratio of UNK\nmentions. LELA still outperforms the other true\nzero-shot competitors, as well as the classical ap-\nproaches. Dense retrieval of candidates allows for\nhigher non-UNK accuracy across the models, while\n4\n\nMethod\nType\nForgotten Realms\nLego\nStar Trek\nYuGiOh\nMacro Acc.\nBaseline** (Logeswaran et al., 2019)\nzero-shot\n-\n-\n-\n-\n77.05\nBLINK* (Wu et al., 2020)\nzero-shot\n86.80\n76.39\n77.95\n67.46\n77.15\nE-repeat** (Yao et al., 2020)\nzero-shot\n-\n-\n-\n-\n79.64\nUni-MPR** (Tang et al., 2021)\nzero-shot\n87.25\n78.57\n80.56\n67.31\n78.42\nBi-MPR* (Tang et al., 2021)\nzero-shot\n89.60\n80.50\n81.04\n68.74\n79.97\nReS* (Xu et al., 2023)\nzero-shot\n88.10\n78.44\n81.69\n75.84\n81.02\nGenDecider* (Zhou et al., 2024)\nzero-shot\n91.10\n81.42\n82.48\n75.99\n82.75\nBLINK*** (Wu et al., 2020)\ntrue zero-shot\n84.11\n74.45\n72.22\n66.30\n74.27\nExtEnD* (Barba et al., 2022)\ntrue zero-shot\n79.62\n65.20\n73.21\n60.01\n69.51\nGENRE** (De Cao et al., 2021b)\ntrue zero-shot\n55.20\n42.71\n55.76\n34.68\n47.09\nOneNet (Liu et al., 2024c)\ntrue zero-shot\n55.80\n53.80\n53.90\n40.80\n51.08\nLELA (ours)\ntrue zero-shot\nw/ Magistral-Small-2509\n88.69\n83.04\n79.64\n81.05\n83.11\nw/ Qwen3-30B-A3B\n85.69\n79.14\n72.68\n69.65\n76.79\nw/ Qwen3-4B\n80.08\n67.93\n65.28\n58.69\n68.00\nw/o LLM selection\n31.23\n22.10\n30.56\n24.70\n27.15\nTable 1: Normalized accuracy on the test set of ZESHEL. * means reported in (Zhou et al., 2024), ** means reported\nin Xu et al. (2023), *** means reported in Li et al. (2022), - means not reported. Best results within each group are\nbold, best overall results underlined. Ablations of LELA in gray (see Section 4.5).\nBM25 results in higher UNK accuracy, and thus in\nhigher general accuracy given the high UNK propor-\ntion, but overall, the choice of the retrieval method\ndoes not lead to statistically significant differences\nin accuracy.\n4.3\nScenario of Acronym Disambiguation\nBenchmark.\nGLADIS (Chen et al., 2023) is\na benchmark for the task of mapping acronyms\nto their inferred long form, spanning three do-\nmains: the general, scientific and the biomedical\ndomain, with, respectively, 7125, 14066 and 2909\ntest mentions. The average number of candidates\nper acronym is 200.\nLELA configuration.\nWe use the short-form to\nlong-form mapping for candidate generation, and\nuse the long-forms as entity descriptions.\nCompetitors.\nClassical entity linking approaches\nhave not been applied to the acronym disambigua-\ntion task. Rather, the original paper proposed Ac-\nroBERT, a method that can be run both in a clas-\nsical and in a zero-shot fashion. Since it outper-\nformed alternative approaches by a wide margin\n(usually by a factor of two), we compare only to Ac-\nroBERT and the baselines in these categories. For\nthe true zero shot group, we use GENRE (De Cao\net al., 2021b) and the BLINK cross-encoder (Wu\net al., 2020) (not trained on the dataset).\nResults.\nTable 3 shows that LELA outperforms\nall true zero-shot competitors by a huge margin\nof up to 54 percentage points. It also outperforms\nall other competitors, whether or not they were\nfine-tuned on the training data, on all domains ex-\ncept the biomedical one. LELA improves the best-\nreported macro-accuracy by 4.6 percentage points,\nand the scientific domain accuracy by 13.9 percent-\nage points.\n4.4\nWikipedia-based Scenario\nBenchmark.\nZELDA (Milich and Akbik, 2023)\ngroups several established Wikipedia-based classi-\ncal entity linking benchmarks. It aims at allowing\nfor a fair comparison of supervised methods by\nmapping all mentions to the same Wikidata knowl-\nedge base and by providing a common Wikipedia-\nbased train set and candidate dictionary to be used\nacross all benchmarks and methods. We also eval-\nuate LELA on the WikilinksNED Unseen Men-\ntions (Onoe and Durrett, 2020) benchmark and re-\nport the results in Appendix C.\nLELA configuration.\nWe use the provided can-\ndidates and entity descriptions.\nCompetitors.\nWe compare LELA to all en-\ntity linking methods that have been trained on\nZELDA (Rücker and Akbik, 2025; Milich and Ak-\nbik, 2023). These supervised approaches bene-\nfit from strong alignment with Wikipedia-based\ntraining, which favors memorization-based linking.\nThis makes ZELDA a stress-test for LELA.\nResults.\nTable 4 shows that the best-performing\nmethods are FusionED and VERBALIZEDit6. In\nthis setting, LELA cannot achieve the performance\n6VERBALIZED does not use the provided candidate dic-\ntionary.\n5\n\nMethod\nType\nAccuracy\nUNK Accuracy\nNon-UNK Accuracy\nPredict-UNK baseline\nbaseline\n44.57\n100.0\n0.0\nRandom baseline\nbaseline\n0.22\n-\n-\nTF-IDF baseline\nbaseline\n2.25\n-\n-\nBLINK Wiki+ESCO (Wu et al., 2020)\nclassical\n23.55±1.95\n1.38±0.81\n41.38±3.03\nGENRE Wiki+ESCO (De Cao et al., 2021b)\nclassical\n11.48±1.46\n1.65±0.88\n19.39±2.43\nBLINK Wiki (Wu et al., 2020)\ntrue zero-shot\n23.30±1.94\n-\n-\nGENRE Wiki (De Cao et al., 2021b)\ntrue zero-shot\n6.91±1.16\n-\n-\nLELA (ours)\ntrue zero-shot\nw/ Magistral-Small-2509 & Dense Retrieval\n26.75±2.03\n5.41±1.56\n43.92±3.06\nw/ Magistral-Small-2509 & BM25\n26.21±2.02\n7.26±1.79\n41.44±3.04\nw/ Qwen3-30B-A3B & Dense Retrieval\n36.62±2.21\n28.04±3.09\n43.52±3.06\nw/ Qwen3-30B-A3B & BM25\n39.36±2.24\n39.11±3.36\n39.56±3.02\nw/ Qwen3-4B & Dense Retrieval\n26.59±2.03\n8.36±1.91\n41.25±3.04\nw/ Qwen3-4B & BM25\n29.44±2.09\n14.88±2.45\n41.15±3.04\nw/o LLM selection & Dense Retrieval\n-\n-\n35.71±2.96\nw/o LLM selection & BM25\n-\n-\n34.92±2.94\nTable 2: Accuracy on the ESCO entity linking benchmark. Accuracy and UNK accuracy from Zhang et al. (2024)\n(Non-UNK accuracy is derived), best per category in bold, globally best underlined. We report 95% confidence\nintervals to account for the smaller test set size. Ablations of LELA in gray (see Section 4.5).\nMethod\nType\nGeneral\nScientific\nBiomedical\nMacro Acc.\nBM25 baseline (Chen et al., 2023)\nbaseline\n25.8\n10.7\n14.3\n16.2\nPopularity baseline (Chen et al., 2023)\nbaseline\n53.2\n12.3\n54.0\n38.8\nAcroBERT (fine-tuned) (Chen et al., 2023)\nclassical\n76.1\n27.6\n69.2\n57.7\nAcroBERT (Chen et al., 2023)\nzero-shot\n72.0\n27.4\n61.4\n53.6\nGENRE (De Cao et al., 2021b) (our implementation)\ntrue zero-shot\n63.9\n3.9\n7.1\n25.0\nBLINK (Wu et al., 2020) (our implementation)\ntrue zero-shot\n15.8\n1.5\n1.8\n6.4\nLELA (ours)\ntrue zero-shot\nw/ Magistral-Small-2509\n78.9\n46.9\n61.0\n62.3\nw/ Qwen3-30B-A3B\n80.1\n45.5\n61.5\n62.4\nw/ Qwen3-4B\n73.8\n37.7\n54.3\n55.3\nw/o LLM selection\n44.2\n22.7\n22.6\n29.8\nTable 3: Accuracy on the GLADIS benchmark. Best in each group in bold (AcroBERT was previously shown to\noutperform classical and fine-tuned methods), globally best underlined. Ablations of LELA in gray (see Section 4.5).\nof the classical methods. However, it remains com-\npetitive, performing on par with GENRE, despite\nbeing the only method that operates without super-\nvised training signals.\n4.5\nAblation studies\nTables 1, 2, 3, and 4 show that our default model\nMagistral-Small-2509 stands out on ZESHEL,\nwhere differences in performance between the\nmodels are the largest.\nOn the other datasets,\nQwen3-30B-A3B catches up. It also handles None\nprediction better. On the more general domains,\nthe smaller Qwen3-4B also performs well. We dis-\ncuss these differences in more detail in Section 4.6.\nOverall, our analysis shows that the performance\nof LELA is robust to the choice of the underlying\nLLM.\nThe tables show also that omitting the selection\nLLM (“w/o LLM selection”) lowers the results\ndrastically on domain-specific datasets, indicat-\ning that LELA’s joint consideration of candidates\nis vital. We study several other ablations on the\nZESHEL benchmark:\nNo Reranker: We pass all retrieved candidates to\nthe LLM without reranking nor cutoff.\nNo Self-Consistency: We sample only one LLM\nanswer.\nNo Reasoning: We skip thinking token generation\nand prompt the selection LLM to directly out-\nput an answer.\nNo Descriptions: We pass empty entity descrip-\ntions to the selection LLM.\nTable 5 shows that removing the reranker results\nin a drop in performance (in addition to a longer\nruntime and increased cost given the longer con-\ntexts). While the drop in performance is significant\nfor Qwen3-30B-A3B and Qwen3-4B, it is however\nnegligible for Magistral-Small-2509, which is\n6\n\nAIDA\nTweeki\nReddit\nWNED\nSlinks\nAvg\nposts\ncomments\nCWEB\nWiki\nTail\nShadow\nTop\nUpper bound*\n(Milich and Akbik, 2023)\n91.1\n94.0\n98.4\n98.3\n92.4\n98.8\n98.8\n56.7\n73.1\n89.1\nMFS baseline*\n(Milich and Akbik, 2023)\n63.5\n72.3\n83.4\n81.0\n61.2\n65.1\n99.4\n14.9\n41.3\n64.7\nClassical (with training data)\nFEVRYCL* (Févry et al., 2020)\n79.5\n76.9\n89.0\n86.5\n70.3\n84.5\n87.6\n31.9\n47.7\n72.7\nLUKEFT* (Yamada et al., 2022)\n81.2\n77.9\n81.5\n78.5\n70.3\n76.5\n98.0\n22.5\n51.8\n71.0\nGENRECL*\n(De Cao et al., 2021b)\n78.6\n80.1\n92.8\n91.5\n73.6\n88.4\n99.6\n37.3\n52.8\n77.2\nFusionED (Wang et al., 2024)\n80.1\n81.4\n93.9\n92.3\n73.6\n89.0\n98.3\n41.5\n57.9\n78.7\nBLINKbi\n(Rücker and Akbik, 2025)\n65.5\n72.1\n83.1\n79.1\n58.1\n73.1\n96.3\n41.8\n42.6\n68.0\nVERBALIZEDIT\n(Rücker and Akbik, 2025)\n88.2\n78.9\n92.2\n88.4\n71.5\n90.8\n98.2\n66.3\n65.9\n82.3\nTrue Zero-Shot (without training data)\nLELA (ours)\nw/ Magistral-Small-2509\n83.3\n79.1\n80.7\n79.5\n66.7\n87.5\n97.2\n48.9\n66.7\n76.6\nw/ Qwen3-30B-A3B\n84.2\n79.1\n83.9\n84.6\n64.8\n87.5\n97.3\n48.8\n67.0\n77.5\nw/ Qwen3-4B\n84.0\n78.0\n84.1\n82.6\n66.3\n85.1\n97.1\n48.2\n65.0\n76.7\nw/o LLM selection\n76.4\n75.1\n79.0\n79.9\n65.0\n76.2\n97.0\n43.6\n60.6\n72.5\nTable 4: Accuracy on the ZELDA benchmark. * means reported in (Milich and Akbik, 2023). Best in each group in\nbold, globally best underlined. Ablations of LELA in gray (see Section 4.5).\ndue to the fact that this larger model is better at\nhandling a large number of candidates, resulting in\na slower drop in selection accuracy as k increases,\nwhich can thus be offset by the increased candi-\ndate retrieval recall (see Appendix D). The No\nSelf-Consistency variant, too, systematically re-\nsults in a drop in performance (more details in Ap-\npendix E). Disabling reasoning results in the next\nbiggest drop in performance across the three mod-\nels. This shows that simply considering the can-\ndidates together is not sufficient and that context-\nbased reasoning is needed. Finally, the No descrip-\ntion variant provides the selection LLM with less\nprecise entities, and it also performs worse in all\nconfigurations. Together these results show that\nall components of LELA contribute to its perfor-\nmance. Examples of candidate selection outputs\nfor different variants are given in Appendix J.\n4.6\nFurther Analyses\nThe Recall vs Selection Accuracy Trade-off.\nThe value of k results in a trade-off between recall\nand selection accuracy (Appendix D): lower values\nallow for a high selection accuracy, but may result\nin the correct entity not being among the candidates,\nwhile higher values reduce the selection accuracy\nbut improve the candidate recall. The impact of k\non these two components depends on the reranker\nand selection LLM in use. Overall, k = 10 is a safe\nchoice, with lower values leading to a substantial\ndegradation in recall (see Appendix I for a detailed\nvisualization of recall retention across stages).\nSelf-consistency.\nGiven the impact of self-\nconsistency highlighted in Section 4.5, we further\nstudy its impact over different numbers of sampled\nanswers (see Appendix E). Notably, we observe\nthat while performance increases with the number\nof samples, an accuracy plateau is reached quickly\nacross models, suggesting that as little as three\nsamples are optimal.\nPositional Bias.\nLLMs have been shown to be\nsubject to positional bias (Liu et al., 2024a). In\nAppendix F, we shuffle the list of candidates given\nto the LLM. Larger models appear robust to this\nbias, while smaller models are more sensitive to it.\nOverall, the results justify our approach of sorting\nthe candidates according to their reranker-assigned\nscore in LELA.\nRuntime and Scalability.\nIn Appendix G, we ob-\nserve that selection accuracy decreases as the input\ncontext length increases, with a model-dependent\nrate of decline. We also notice traces of underthink-\ning and overthinking behaviors in reasoning LLMs\n(Su et al., 2025).\nIn addition, we report the average mention pro-\ncessing time across benchmarks. Our measure-\nments indicate that while LELA is naturally more\ncomputationally intensive than shallow retrieval\nmethods, it remains within a respectable runtime\n7\n\nMethod\nForgotten Realms\nLego\nStar Trek\nYuGiOh\nMacro Acc.\nLELA\nw/ Magistral-Small-2509\n88.69\n83.04\n79.64\n81.05\n83.11\nw/o reranker\n87.79\n83.45\n80.00\n81.15\n83.10\nw/o self-consistency\n81.38\n76.98\n73.29\n74.28\n76.48\nw/o descriptions\n76.58\n71.22\n72.53\n71.24\n72.89\nw/ Qwen3-30B-A3B\n85.69\n79.14\n72.68\n69.65\n76.79\nw/o reranker\n83.58\n74.31\n67.07\n64.44\n72.35\nw/o self-consistency\n82.08\n73.48\n65.39\n63.18\n71.03\nw/o descriptions\n71.67\n62.28\n65.92\n60.64\n65.13\nw/o reasoning\n65.97\n53.24\n52.06\n45.64\n54.23\nw/ Qwen3-4B\n80.08\n67.93\n65.28\n58.69\n68.00\nw/o reranker\n75.78\n60.43\n61.01\n51.14\n62.09\nw/o self-consistency\n77.48\n62.49\n60.25\n53.82\n63.51\nw/o descriptions\n66.87\n54.16\n61.08\n51.57\n58.42\nw/o reasoning\n51.55\n36.79\n36.45\n35.75\n40.14\nTable 5: Normalized accuracy on the test set of ZESHEL across LELA variants. The results for the w/o reasoning\nvariant are reported for the two Qwen models which explicitly support switching between reasoning and non-\nreasoning mode.\nenvelope for offline and batch processing. For ex-\nample, on ZESHEL, we observe a processing time\nof 44 seconds per mention on average7. The MoE\nQwen3-30B-A3B offers the most compelling trade-\noff of speed vs. performance, achieving inference\nspeeds faster than the smaller Qwen3-4B model (26\nvs 31 seconds on ZESHEL) while significantly out-\nperforming it and often matching the performance\nof Magistral-Small-2509.\nReranker performance.\nWe compare the perfor-\nmance of different rerankers in Appendix H. Other\nmodels can achieve a recall@k similar to the one of\nour default model Qwen3-Reranker-4B, showing\nthat LELA is not tied to a specific reranker.\n4.7\nDiscussion\nOur results suggest that LLM-based candidate\nreranking and selection can effectively compen-\nsate for the absence of labeled training data, and\neven outperform the fine-tuned approaches. Our\nablation studies suggest that these gains stem pri-\nmarily from the ability of LLMs to jointly reason\nover multiple candidates rather than from candi-\ndate ranking alone. The fact that the candidate\nranking alone does not perform well also mitigates\nconcerns regarding data contamination: If the mod-\nels were simply recalling memorized gold labels\nfrom pre-training, the reranking alone would nearly\nalways find the best candidate – which it does not.\nOur work points to a fundamental shift in how\nentity linking can be approached in resource-\nconstrained environments. While traditional meth-\nods optimize for inference efficiency, front-loading\n7With 10 samples for self-consistency.\ncosts onto human annotation, LELA demonstrates\nthat allocating computational budget to test-time\nreasoning can effectively substitute for domain-\nspecific supervision. This suggests that the ability\nfor disambiguation is a general capability of reason-\ning models that transfers across settings, unlike the\ndistributional patterns learned during fine-tuning.\nConsequently, the usual barrier to entry for entity\nlinking in specialized domains (viz. the availability\nof labeled data) could be mitigated with additional\ninference compute.\n5\nConclusion\nWe have proposed LELA, a simple and modular\ncoarse-to-fine approach to entity linking that lever-\nages the capabilities of LLMs. LELA is true zero-\nshot, in the sense that it can be applied to a given\nknowledge base and domain without requiring an-\nnotated data or fine-tuning. Our experiments show\nthat LELA outperforms other true zero-shot ap-\nproaches by a large margin. It performs on par, and\noften beats, fine-tuned and classical approaches,\neven though these require training data.\nOur results suggest a blind spot in entity linking,\nwhere inference compute can make up for training\ndata. We thus hope that LELA opens the door\nfor a wider use of LLMs in entity linking, and for\nnew approaches that perform the task without any\nneed for fine-tuning. In particular, we encourage\nLLMs to be included as baselines in entity linking\nstudies. Future work could extend LELA to end-\nto-end entity linking, where mentions have to be\ndetected as well.\n8\n\n6\nLimitations\nBeing LLM-based and model-agnostic, LELA di-\nrectly benefits from evolving model capabilities,\nbut also inherits their limitations. In particular,\nthe generation of reasoning tokens makes LELA\nslower than specialized entity linking models and\nthus more suitable for offline processing. Addition-\nally, LELA inherits LLMs’ limited interpretability.\nWhile our evaluation spans a broad range of\nbenchmarks, it cannot cover every possible knowl-\nedge base or domain. Although context-based rea-\nsoning improves generalization, further research is\nrequired to assess LELA’s performance across addi-\ntional settings. In particular, our evaluation focuses\non English benchmarks, and evaluating LELA in\nmultilingual or low-resource settings remains an\nimportant direction for future work.\nReferences\nAbdelrahman Abdallah, Jamshid Mozafari, Bhawna\nPiryani, and Adam Jatowt. 2025. DeAR: Dual-Stage\nDocument Reranking with Reasoning Agents via\nLLM Distillation. In Findings of the Association\nfor Computational Linguistics: EMNLP 2025, pages\n5710–5723, Suzhou, China. Association for Compu-\ntational Linguistics.\nSören Auer, Christian Bizer, Georgi Kobilarov, Jens\nLehmann, Richard Cyganiak, and Zachary Ives. 2007.\nDBpedia: A nucleus for a web of open data. In\nProceedings of the 6th International The Semantic\nWeb and 2nd Asian Conference on Asian Semantic\nWeb Conference, ISWC’07/ASWC’07, pages 722–\n735, Berlin, Heidelberg. Springer-Verlag.\nTom Ayoola, Shubhi Tyagi, Joseph Fisher, Christos\nChristodoulopoulos, and Andrea Pierleoni. 2022. Re-\nFinED: An Efficient Zero-shot-capable Approach to\nEnd-to-End Entity Linking. In Proceedings of the\n2022 Conference of the North American Chapter of\nthe Association for Computational Linguistics: Hu-\nman Language Technologies: Industry Track, pages\n209–220, Hybrid: Seattle, Washington + Online. As-\nsociation for Computational Linguistics.\nEdoardo Barba, Luigi Procopio, and Roberto Navigli.\n2022. ExtEnD: Extractive Entity Disambiguation.\nIn Proceedings of the 60th Annual Meeting of the\nAssociation for Computational Linguistics (Volume\n1: Long Papers), pages 2478–2488, Dublin, Ireland.\nAssociation for Computational Linguistics.\nHannah Bast, Matthias Hertel, and Natalie Prange. 2022.\nELEVANT: A Fully Automatic Fine-Grained Entity\nLinking Evaluation and Analysis Tool. In Proceed-\nings of the 2022 Conference on Empirical Methods\nin Natural Language Processing: System Demonstra-\ntions, pages 72–79, Abu Dhabi, UAE. Association\nfor Computational Linguistics.\nTom B. Brown, Benjamin Mann, Nick Ryder, Melanie\nSubbiah, Jared Kaplan, Prafulla Dhariwal, Arvind\nNeelakantan, Pranav Shyam, Girish Sastry, Amanda\nAskell, Sandhini Agarwal, Ariel Herbert-Voss,\nGretchen Krueger, Tom Henighan, Rewon Child,\nAditya Ramesh, Daniel M. Ziegler, Jeffrey Wu,\nClemens Winter, and 12 others. 2020. Language\nmodels are few-shot learners. In Proceedings of the\n34th International Conference on Neural Information\nProcessing Systems, NIPS ’20, pages 1877–1901,\nRed Hook, NY, USA. Curran Associates Inc.\nLihu Chen, Gael Varoquaux, and Fabian M. Suchanek.\n2023. GLADIS: A General and Large Acronym Dis-\nambiguation Benchmark. In Proceedings of the 17th\nConference of the European Chapter of the Asso-\nciation for Computational Linguistics, pages 2073–\n2088, Dubrovnik, Croatia. Association for Computa-\ntional Linguistics.\nShijie Chen, Bernal Jiménez Gutiérrez, and Yu Su.\n2025a.\nAttention in Large Language Models\nYields Efficient Zero-Shot Re-Rankers.\nPreprint,\narXiv:2410.02642.\nYiqun Chen, Qi Liu, Yi Zhang, Weiwei Sun, Xinyu\nMa, Wei Yang, Daiting Shi, Jiaxin Mao, and Dawei\nYin. 2025b. TourRank: Utilizing Large Language\nModels for Documents Ranking with a Tournament-\nInspired Strategy. In Proceedings of the ACM on\nWeb Conference 2025, WWW ’25, pages 1638–1652,\nNew York, NY, USA. Association for Computing\nMachinery.\nNicola De Cao, Wilker Aziz, and Ivan Titov. 2021a.\nHighly Parallel Autoregressive Entity Linking with\nDiscriminative Correction. In Proceedings of the\n2021 Conference on Empirical Methods in Natural\nLanguage Processing, pages 7662–7669, Online and\nPunta Cana, Dominican Republic. Association for\nComputational Linguistics.\nNicola De Cao, Gautier Izacard, Sebastian Riedel, and\nFabio Petroni. 2021b.\nAutoregressive Entity Re-\ntrieval. Preprint, arXiv:2010.00904.\nNicola De Cao, Ledell Wu, Kashyap Popat, Mikel\nArtetxe, Naman Goyal, Mikhail Plekhanov, Luke\nZettlemoyer, Nicola Cancedda, Sebastian Riedel, and\nFabio Petroni. 2022. Multilingual Autoregressive\nEntity Linking. Transactions of the Association for\nComputational Linguistics, 10:274–290.\nJens-Joris Decorte, Jeroen Van Hautte, Johannes\nDeleu, Chris Develder, and Thomas Demeester.\n2022. Design of Negative Sampling Strategies for\nDistantly Supervised Skill Extraction.\nPreprint,\narXiv:2209.05987.\nEvgeny Dedov. 2025. JointRank: Rank Large Set with\nSingle Pass. In Proceedings of the 2025 International\nACM SIGIR Conference on Innovative Concepts and\nTheories in Information Retrieval (ICTIR), pages 208–\n217.\n9\n\nYifan Ding,\nAmrit Poudel,\nQingkai Zeng,\nTim\nWeninger, Balaji Veeramani, and Sanmitra Bhat-\ntacharya. 2024. EntGPT: Linking Generative Large\nLanguage Models with Knowledge Bases. Preprint,\narXiv:2402.06738.\nMatthijs Douze, Alexandr Guzhva, Chengqi Deng, Jeff\nJohnson, Gergely Szilvasy, Pierre-Emmanuel Mazaré,\nMaria Lomeli, Lucas Hosseini, and Hervé Jégou.\n2025. THE FAISS LIBRARY. IEEE Transactions\non Big Data, pages 1–17.\nYotam Eshel, Noam Cohen, Kira Radinsky, Shaul\nMarkovitch, Ikuya Yamada, and Omer Levy. 2017.\nNamed Entity Disambiguation for Noisy Text. In\nProceedings of the 21st Conference on Computa-\ntional Natural Language Learning (CoNLL 2017),\npages 58–68, Vancouver, Canada. Association for\nComputational Linguistics.\nThibault Févry, Nicholas FitzGerald, Livio Baldini\nSoares, and Tom Kwiatkowski. 2020.\nEmpirical\nEvaluation of Pretraining Strategies for Supervised\nEntity Linking. Preprint, arXiv:2005.14253.\nJohannes Hoffart, Mohamed Amir Yosef, Ilaria Bordino,\nHagen Fürstenau, Manfred Pinkal, Marc Spaniol,\nBilyana Taneva, Stefan Thater, and Gerhard Weikum.\n2011. Robust Disambiguation of Named Entities\nin Text. In Proceedings of the 2011 Conference on\nEmpirical Methods in Natural Language Processing,\npages 782–792, Edinburgh, Scotland, UK. Associa-\ntion for Computational Linguistics.\nEdward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan\nAllen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang,\nand Weizhu Chen. 2021. LoRA: Low-Rank Adap-\ntation of Large Language Models.\nPreprint,\narXiv:2106.09685.\nNour Jedidi, Yung-Sung Chuang, James Glass, and\nJimmy Lin. 2025.\nDon’t \"Overthink\" Passage\nReranking: Is Reasoning Truly Necessary? Preprint,\narXiv:2505.16886.\nHeng Ji and Ralph Grishman. 2011. Knowledge base\npopulation: Successful approaches and challenges.\nIn Proceedings of the 49th annual meeting of the\nassociation for computational linguistics: Human\nlanguage technologies, pages 1148–1158.\nWoosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying\nSheng, Lianmin Zheng, Cody Hao Yu, Joseph Gon-\nzalez, Hao Zhang, and Ion Stoica. 2023. Efficient\nMemory Management for Large Language Model\nServing with PagedAttention. In Proceedings of the\n29th Symposium on Operating Systems Principles,\nSOSP ’23, pages 611–626, New York, NY, USA.\nAssociation for Computing Machinery.\nMartin le Vrang, Agis Papantoniou, Erika Pauwels,\nPieter Fannes, Dominique Vandensteen, and Johan\nDe Smedt. 2014. ESCO: Boosting Job Matching in\nEurope with Semantic Interoperability. Computer,\n47(10):57–64.\nXiuxing Li, Zhenyu Li, Zhengyan Zhang, Ning Liu,\nHaitao Yuan, Wei Zhang, Zhiyuan Liu, and Jiany-\nong Wang. 2022. Effective Few-Shot Named Entity\nLinking by Meta-Learning. In 2022 IEEE 38th Inter-\nnational Conference on Data Engineering (ICDE),\npages 178–191. IEEE Computer Society.\nYajie Li, Albert Galimov, Mitra Datta Ganapaneni, Pu-\njitha Thejaswi, De Meng, Priyanshu Kumar, and Sa-\nloni Potdar. 2025. Leveraging the Power of Large\nLanguage Models in Entity Linking via Adaptive\nRouting and Targeted Reasoning. In Proceedings of\nthe 2025 Conference on Empirical Methods in Nat-\nural Language Processing: Industry Track, pages\n871–882, Suzhou (China). Association for Computa-\ntional Linguistics.\nNelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paran-\njape, Michele Bevilacqua, Fabio Petroni, and Percy\nLiang. 2024a. Lost in the Middle: How Language\nModels Use Long Contexts. Transactions of the Asso-\nciation for Computational Linguistics, 12:157–173.\nQi Liu, Yongyi He, Tong Xu, Defu Lian, Che Liu, Zhi\nZheng, and Enhong Chen. 2024b. UniMEL: A Uni-\nfied Framework for Multimodal Entity Linking with\nLarge Language Models. In Proceedings of the 33rd\nACM International Conference on Information and\nKnowledge Management, CIKM ’24, pages 1909–\n1919, New York, NY, USA. Association for Comput-\ning Machinery.\nWenhan Liu, Xinyu Ma, Weiwei Sun, Yutao Zhu,\nYuchen Li,\nDawei Yin,\nand Zhicheng Dou.\n2025a. ReasonRank: Empowering Passage Rank-\ning with Strong Reasoning Ability.\nPreprint,\narXiv:2508.07050.\nWenhan Liu, Xinyu Ma, Yutao Zhu, Lixin Su,\nShuaiqiang Wang, Dawei Yin, and Zhicheng Dou.\n2025b.\nCoRanking: Collaborative Ranking with\nSmall and Large Ranking Agents. In Findings of the\nAssociation for Computational Linguistics: EMNLP\n2025, pages 5098–5110, Suzhou, China. Association\nfor Computational Linguistics.\nXukai Liu, Ye Liu, Kai Zhang, Kehang Wang, Qi Liu,\nand Enhong Chen. 2024c.\nOneNet:\nA Fine-\nTuning Free Framework for Few-Shot Entity Linking\nvia Large Language Model Prompting.\nPreprint,\narXiv:2410.07549.\nLajanugen Logeswaran, Ming-Wei Chang, Kenton Lee,\nKristina Toutanova, Jacob Devlin, and Honglak Lee.\n2019. Zero-Shot Entity Linking by Reading Entity\nDescriptions. In Proceedings of the 57th Annual\nMeeting of the Association for Computational Lin-\nguistics, pages 3449–3460, Florence, Italy. Associa-\ntion for Computational Linguistics.\nJian Luo, Xuanang Chen, Ben He, and Le Sun. 2024.\nPRP-Graph: Pairwise Ranking Prompting to LLMs\nwith Graph Aggregation for Effective Text Re-\nranking. In Proceedings of the 62nd Annual Meeting\n10\n\nof the Association for Computational Linguistics (Vol-\nume 1: Long Papers), pages 5766–5776, Bangkok,\nThailand. Association for Computational Linguistics.\nJose L. Martinez-Rodriguez, Aidan Hogan, and Ivan\nLopez-Arevalo. 2020. Information extraction meets\nthe Semantic Web:\nA survey.\nSemantic Web,\n11(2):255–335.\nMarcel Milich and Alan Akbik. 2023. ZELDA: A Com-\nprehensive Benchmark for Supervised Entity Disam-\nbiguation. In Proceedings of the 17th Conference of\nthe European Chapter of the Association for Compu-\ntational Linguistics, pages 2061–2072, Dubrovnik,\nCroatia. Association for Computational Linguistics.\nMistral-AI, Abhinav Rastogi, Albert Q. Jiang, Andy\nLo, Gabrielle Berrada, Guillaume Lample, Jason\nRute, Joep Barmentlo, Karmesh Yadav, Kartik Khan-\ndelwal, Khyathi Raghavi Chandu, Léonard Blier,\nLucile Saulnier, Matthieu Dinot, Maxime Darrin,\nNeha Gupta, Roman Soletskyi, Sagar Vaze, Teven Le\nScao, and 81 others. 2025.\nMagistral.\nPreprint,\narXiv:2506.10910.\nYasumasa Onoe and Greg Durrett. 2020. Fine-Grained\nEntity Typing for Domain Independent Entity Link-\ning. Preprint, arXiv:1909.05780.\nNicholas Pipitone, Ghita Houir Alami, Advaith Avad-\nhanam,\nAnton Kaminskyi,\nand Ashley Khoo.\n2025.\nzELO: ELO-inspired Training Method\nfor Rerankers and Embedding Models.\nPreprint,\narXiv:2509.12541.\nZhen Qin, Rolf Jagerman, Kai Hui, Honglei Zhuang,\nJunru Wu, Le Yan, Jiaming Shen, Tianqi Liu, Jialu\nLiu, Donald Metzler, Xuanhui Wang, and Michael\nBendersky. 2024. Large Language Models are Effec-\ntive Text Rankers with Pairwise Ranking Prompting.\nIn Findings of the Association for Computational Lin-\nguistics: NAACL 2024, pages 1504–1518, Mexico\nCity, Mexico. Association for Computational Lin-\nguistics.\nStephen E. Robertson and Karen Sparck Jones. 1988.\nRelevance weighting of search terms. In Document\nRetrieval Systems, pages 143–160. Taylor Graham\nPublishing, GBR.\nSusanna Rücker and Alan Akbik. 2025. Evaluating\ndesign decisions for dual encoder-based entity disam-\nbiguation. In Proceedings of the 63rd Annual Meet-\ning of the Association for Computational Linguistics\n(Volume 1: Long Papers), ACL 2025, Vienna, Aus-\ntria, July 27 - August 1, 2025, pages 15685–15701.\nAssociation for Computational Linguistics.\nÖzge Sevgili, Artem Shelmanov, Mikhail Y. Arkhipov,\nAlexander Panchenko, and Chris Biemann. 2022.\nNeural entity linking: A survey of models based on\ndeep learning. Semantic Web, 13(3):527–570.\nNoam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz,\nAndy Davis, Quoc Le, Geoffrey Hinton, and Jeff\nDean. 2017.\nOutrageously Large Neural Net-\nworks:\nThe Sparsely-Gated Mixture-of-Experts\nLayer. Preprint, arXiv:1701.06538.\nSenbao Shi, Zhenran Xu, Baotian Hu, and Min\nZhang. 2024. Generative Multimodal Entity Linking.\nPreprint, arXiv:2306.12725.\nYuling Shi, Yichun Qian, Hongyu Zhang, Beijun Shen,\nand Xiaodong Gu. 2025. LongCodeZip: Compress\nLong Context for Code Language Models. Preprint,\narXiv:2510.00446.\nJinyan Su, Jennifer Healey, Preslav Nakov, and\nClaire Cardie. 2025. Between Underthinking and\nOverthinking:\nAn Empirical Study of Reason-\ning Length and correctness in LLMs.\nPreprint,\narXiv:2505.00127.\nFabian M. Suchanek, Gjergji Kasneci, and Gerhard\nWeikum. 2007. Yago: A core of semantic knowledge.\nIn Proceedings of the 16th International Conference\non World Wide Web, WWW ’07, pages 697–706,\nNew York, NY, USA. Association for Computing\nMachinery.\nFabian M. Suchanek, Mauro Sozio, and Gerhard\nWeikum. 2009. SOFIE: a self-organizing framework\nfor information extraction. In Proceedings of the\n18th International Conference on World Wide Web,\nWWW 2009, Madrid, Spain, April 20-24, 2009, pages\n631–640. ACM.\nHongyin Tang, Xingwu Sun, Beihong Jin, and Fuzheng\nZhang. 2021. A Bidirectional Multi-paragraph Read-\ning Model for Zero-shot Entity Linking. Proceed-\nings of the AAAI Conference on Artificial Intelligence,\n35(15):13889–13897.\nDaniel Vollmers, Hamada Zahera, Diego Moussallem,\nand Axel-Cyrille Ngonga Ngomo. Contextual Aug-\nmentation for Entity Linking using Large Language\nModels.\nDenny Vrandeˇci´c. 2012. Wikidata: A new platform for\ncollaborative data collection. In Proceedings of the\n21st International Conference on World Wide Web,\nWWW ’12 Companion, pages 1063–1064, New York,\nNY, USA. Association for Computing Machinery.\nFang Wang, Tianwei Yan, Zonghao Yang, Minghao Hu,\nJun Zhang, Zhunchen Luo, and Xiaoying Bai. 2026.\nDeepMEL: A multi-agent collaboration framework\nfor multimodal entity linking. Information Process-\ning & Management, 63(3):104507.\nFeng Wang,\nYuqing Li,\nand Han Xiao. 2025.\nJina-reranker-v3:\nLast but Not Late Interaction\nfor Listwise Document Reranking.\nPreprint,\narXiv:2509.25085.\nJunxiong Wang, Ali Mousavi, Omar Attia, Ronak\nPradeep,\nSaloni Potdar,\nAlexander M. Rush,\nUmar Farooq Minhas, and Yunyao Li. 2024. En-\ntity disambiguation via fusion entity decoding. In\nProceedings of the 2024 Conference of the North\n11\n\nAmerican Chapter of the Association for Computa-\ntional Linguistics: Human Language Technologies\n(Volume 1: Long Papers), NAACL 2024, Mexico City,\nMexico, June 16-21, 2024, pages 6524–6536. Associ-\nation for Computational Linguistics.\nXuezhi Wang, Jason Wei, Dale Schuurmans, Quoc\nLe, Ed Chi, Sharan Narang, Aakanksha Chowdhery,\nand Denny Zhou. 2023. Self-Consistency Improves\nChain of Thought Reasoning in Language Models.\nPreprint, arXiv:2203.11171.\nChris Welty, J. William Murdock, Aditya Kalyanpur,\nand James Fan. 2012. A Comparison of Hard Fil-\nters and Soft Evidence for Answer Typing in Watson,\npages 243–256. Springer Berlin Heidelberg, Berlin,\nHeidelberg.\nLedell Wu, Fabio Petroni, Martin Josifoski, Sebastian\nRiedel, and Luke Zettlemoyer. 2020. Scalable zero-\nshot entity linking with dense entity retrieval. In\nProceedings of the 2020 Conference on Empirical\nMethods in Natural Language Processing, EMNLP\n2020, Online, November 16-20, 2020, pages 6397–\n6407. Association for Computational Linguistics.\nAmy Xin, Yunjia Qi, Zijun Yao, Fangwei Zhu, Kaisheng\nZeng, Xu Bin, Lei Hou, and Juanzi Li. 2024. LL-\nMAEL: Large Language Models are Good Con-\ntext Augmenters for Entity Linking.\nPreprint,\narXiv:2407.04020.\nZhenran Xu, Yulin Chen, Baotian Hu, and Min Zhang.\n2023. A Read-and-Select Framework for Zero-shot\nEntity Linking. In Findings of the Association for\nComputational Linguistics: EMNLP 2023, pages\n13657–13666, Singapore. Association for Compu-\ntational Linguistics.\nIkuya Yamada, Koki Washio, Hiroyuki Shindo, and Yuji\nMatsumoto. 2022. Global Entity Disambiguation\nwith BERT. In Proceedings of the 2022 Conference\nof the North American Chapter of the Association for\nComputational Linguistics: Human Language Tech-\nnologies, pages 3264–3271, Seattle, United States.\nAssociation for Computational Linguistics.\nAn Yang, Anfeng Li, Baosong Yang, Beichen Zhang,\nBinyuan Hui, Bo Zheng, Bowen Yu, Chang Gao,\nChengen Huang, Chenxu Lv, Chujie Zheng, Day-\niheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao\nGe, Haoran Wei, Huan Lin, Jialong Tang, and 41\nothers. 2025. Qwen3 Technical Report. Preprint,\narXiv:2505.09388.\nZonghai Yao, Liangliang Cao, and Huapu Pan. 2020.\nZero-shot Entity Linking with Efficient Long Range\nSequence Modeling. In Findings of the Association\nfor Computational Linguistics: EMNLP 2020, pages\n2517–2522, Online. Association for Computational\nLinguistics.\nChristophe Ye and Cassie S. Mitchell. 2025. LLM as\nEntity Disambiguator for Biomedical Entity-Linking.\nIn Proceedings of the 63rd Annual Meeting of the\nAssociation for Computational Linguistics (Volume\n2: Short Papers), pages 301–312, Vienna, Austria.\nAssociation for Computational Linguistics.\nMike Zhang, Rob van der Goot, and Barbara Plank.\n2024. Entity linking in the job market domain. In\nFindings of the Association for Computational Lin-\nguistics: EACL 2024, St. Julian’s, Malta, March 17-\n22, 2024, pages 410–419. Association for Computa-\ntional Linguistics.\nYanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang,\nHuan Lin, Baosong Yang, Pengjun Xie, An Yang,\nDayiheng Liu, Junyang Lin, Fei Huang, and Jingren\nZhou. 2025. Qwen3 Embedding: Advancing Text\nEmbedding and Reranking Through Foundation Mod-\nels. Preprint, arXiv:2506.05176.\nKang Zhou, Yuepei Li, Qing Wang, Qiao Qiao, and\nQi Li. 2024. GenDecider: Integrating “None of the\nCandidates” Judgments in Zero-Shot Entity Linking\nRe-ranking. In Proceedings of the 2024 Conference\nof the North American Chapter of the Association for\nComputational Linguistics: Human Language Tech-\nnologies (Volume 2: Short Papers), pages 239–245,\nMexico City, Mexico. Association for Computational\nLinguistics.\nShengyao Zhuang, Xueguang Ma, Bevan Koopman,\nJimmy Lin, and Guido Zuccon. 2025.\nRank-R1:\nEnhancing Reasoning in LLM-based Document\nRerankers via Reinforcement Learning. Preprint,\narXiv:2503.06034.\nShengyao Zhuang, Honglei Zhuang, Bevan Koopman,\nand Guido Zuccon. 2024. A Setwise Approach for\nEffective and Highly Efficient Zero-shot Ranking\nwith Large Language Models. In Proceedings of\nthe 47th International ACM SIGIR Conference on\nResearch and Development in Information Retrieval,\npages 38–47.\nA\nPrompts\nThe reranker task, retriever task and selection LLM\nprompt are given in Figure 3.\nB\nCategory-specific results on ZESHEL\nIn ZESHEL, mentions are split into four categories,\nbased on token overlap with the entity name:\nHigh Overlap (HO): the entity name is identical\nto the mention text\nMultiple Categories (MC): the entity name is the\nmention text followed by a disambiguation\nphrase (e.g. mention string: “Batman”, title:\n“Batman (Lego)”)\nAmbiguous Substrings (AS): the mention text is\na substring of the entity name\nLow Overlap (LO): all other cases\n12\n\nGiven a text with a marked mention enclosed in\nsquare brackets, retrieve relevant entities that the\nmention refers to.\n(a) Reranker task\nGiven an ambiguous mention, retrieve relevant\nentities that the mention refers to.\n(b) Retriever task\nYou are an expert designed to disambiguate\nentities in text, taking into account the overall\ncontext and a list of entity candidates. You are\nprovided with an input text that includes a full\ncontextual narrative, a marked mention enclosed\nin square brackets, and a list of candidates, each\npreceded by an index number.\nYour task is to determine the most appropriate\nentity from the candidates based on the context\nand candidate entity descriptions. Please show\nyour choice with only the index, e.g., “answer: 3”.\n(c) Selection LLM prompt\nFigure 3:\nReranker, retriever and LLM selection\nprompts.\nThese categories respectively make up 50%, 28%,\n8% and 5% of the mentions in the dataset. The\ncategory-specific normalized accuracy in Table 6\nshows that using Magistral-Small-2509, LELA\nranks first on the four categories, and shines in the\nAS and MC cases. We conjecture that performance\non the AS case can be explained by the high ex-\nposure of LLMs to such cases during pre-training,\nwhich occur in all kinds of text corpora, while per-\nformance on the MC case can be attributed to the\nneed for understanding fine-grained distinctions be-\ntween particularly similar candidates, and which\ncan benefit from reasoning. Regarding the LO case,\nwith lower relative improvement, and where there\nis minimal lexical overlap, embedding-based ap-\nproaches might also be well-suited since they are\noptimized for semantic similarity.\nC\nUnseen mentions and short context\nscenario\nBenchmark.\nThe\nWikilinksNED\nUnseen-\nMentions dataset was proposed in Onoe and\nDurrett (2020).\nIt is derived from the Wik-\nilinksNED dataset (Eshel et al., 2017), ensuring\nthat the mention-entity pairs are unseen at test\ntime. The examples in the dataset consist of single\nsentences, resulting in highly limited context. The\nMethod\nHO\nMC\nAS\nLO\nBaseline\n87.64\n77.27\n75.89\n71.46\nBLINK\n94.30\n75.40\n79.95\n73.50\nUni-MPR\n91.43\n79.07\n75.60\n73.53\nBi-MPR\n92.84\n81.93\n77.37\n73.88\nReS\n94.42\n81.29\n77.80\n76.51\nLELA (ours)\nw/ Magistral-Small-2509\n95.63\n85.19\n84.95\n77.99\nw/ Qwen3-30B-A3B\n88.39\n79.16\n76.88\n68.31\nw/ Qwen3-4B\n79.03\n70.83\n70.68\n60.92\nTable 6: Category-specific macro-averaged normalized\naccuracy on the test set of ZESHEL, results for the other\nmethods as reported in Xu et al. (2023).\ntarget knowledge base is Wikipedia.\nLELA configuration.\nWe retrieve 64 candi-\ndates per mention through dense retrieval over\nWikipedia entities, using the ELEVANT EL evalu-\nation tool (Bast et al., 2022) Wikipedia mappings8,\nwhich achieves a recall of 83%. We also experi-\nment with prior-based candidates using the candi-\ndate list from De Cao et al. (2021b), which yields\na recall of 84%.\nResults.\nResults reported in Table 7 show that\nBLINK leads on the benchmark. Regarding LELA,\nusing the prior-based candidates leads to the best re-\nsults. In particular, these candidates allow LELA to\noutperform the true zero-shot ET4EL, as well as the\nclassical GENRE and ReFinED methods, which\nalso use prior-based candidates. In fact, for all three\nmodels considered, using dense retrieval systemati-\ncally underperforms on this benchmark. Therefore,\nwe conducted an error analysis to analyze LELA’s\nbehavior when using dense retrieval in this set-\nting. Four out of the ten analyzed errors come from\ncandidate retrieval, and another four come from\nimprecise ground truth annotations, where we be-\nlieve that our approach’s predicted Wikipedia entity\ncould also be considered as correct. This is caused\nby the large number of highly similar entities in\nWikipedia. Only the two remaining cases corre-\nspond to the reranker losing the correct candidate\nand a wrong LLM final answer selection\nD\nThe Recall vs Selection Accuracy\nTrade-off\nThe performance of LELA is a product of the can-\ndidate retrieval recall (is the correct answer within\nthe retrieved candidates?) and the candidate se-\nlection accuracy (does the LLM select the correct\n8https://github.com/ad-freiburg/elevant/wiki/\nMapping-Files#wikipedia-mappings\n13\n\nMethod\nType\nAcc.\nMost Frequent Entity Baseline\n(Onoe and Durrett, 2020)\nbaseline\n54.1\nBLINK (Wu et al., 2020)\nWikipedia\nclassical\n75.2\nTraining set\nclassical\n74.7\nTraining set + Wikipedia\nclassical\n76.8\nGENRE (De Cao et al., 2021b)\nclassical\n63.5\nReFinED (Ayoola et al., 2022)\nclassical\n66.5\nET4EL (Onoe and Durrett, 2020)\ntrue zero-shot\n62.2\nLELA (ours)\ntrue zero-shot\nw/ Magistral-Small-2509\nDictionary\n68.0\nDense Retrieval\n62.8\nw/ Qwen3-30B-A3B\nDictionary\n68.7\nDense Retrieval\n61.9\nw/ Qwen3-4B\nDictionary\n67.0\nDense Retrieval\n59.0\nw/o LLM selection\nDictionary\n61.9\nDense Retrieval\n54.5\nTable 7: Accuracy on the WikilinksNED Unseen Men-\ntions benchmark, best per category in bold, globally\nbest underlined.\nanswer?). Regarding the first retrieval stage, the\nnumber of retrieved candidates should be exclu-\nsively limited by computational cost constraints,\ngiven the finer subsequent stages and that the re-\ncall can only improve with the number of retrieved\ncandidates. However, for the next two stages, the\nvalue of k results in a trade-off between recall and\nselection accuracy, as shown in Figure 4: lower\nvalues of k allow for a high selection accuracy, but\nmay result in the correct entity not being among\nthe candidates.\nHigher values of k reduce the\nselection accuracy but improve the candidate re-\ncall. The impact of k on these two components\ndepends on the reranker and selection LLM in use.\nFor instance, the selection accuracy experiences a\nsteeper drop when k increases for Qwen3-30B-A3B\nthan Magistral-Small-2509. Reranker-specific\nresults are given in Section H. Figure 15 shows\nan example of selection output for k = 10 and\nk = 64.\nE\nSelf-consistency Study\nAs shown by our ablation studies, Section 4.5, self-\nconsistency systematically allows for performance\nimprovements over single selection LLM answer\nsampling. Figure 6a reports the selection accu-\nracy over the number of sampled answers, ksc,\nand shows that a plateau is reached across mod-\nels for ksc = 3. The best improvement comes from\nksc = 2. Furthermore, Figure 6b shows how the\ninvalid output rate, which also significantly drops\nfor ksc = 2. In order to check that the impact of\nself-consistency is not strictly limited to higher ro-\nbustness towards invalid model outputs, we show\nin Figure 5 that for our default ksc = 10 choice, the\nmodels select the same candidate across all samples\nin approximately 60% of the cases, two answers in\napproximately 25% of cases and three answers in\nless than 10% of the cases.\nF\nPositional Bias Study\nTo study the impact of the positional bias (Liu et al.,\n2024a) on the LLM selection accuracy, we report\nthe results on ZESHEL for five different top-k can-\ndidate sorting approaches:\n• Reranker: We rank the top-k candidates ac-\ncording to their reranker-assigned score (the\nstandard LELA approach)\n• Answer first: The correct answer is moved to\nthe first position.\n• Answer last: The correct answer is moved to\nthe last position.\n• Random: The top-k candidates are shuffled.\n• BM25: We rank the top-k candidates accord-\ning to their BM25 score (the standard LELA\napproach)\nThe results in Table 8a show varying ef-\nfects of candidate ordering across models. For\nMagistral-Small-2509, performance differences\nbetween the sorting approaches are minimal, in-\ndicating a low sensitivity to candidate position.\nAmong these, the oracle Answer first sorting\nachieves the highest accuracy, closely followed by\nthe Reranker sorting. For Qwen3-30B-A3B, the per-\nformance gap between sorting strategies becomes\nmore pronounced. Again, the Answer first sorting\nyields the highest accuracy, with the Reranker sort-\ning ranking second. In contrast, Random sorting\nperforms worst, trailing the oracle setting by 2.51\npercentage points. This effect further increases\nfor Qwen3-4B, which exhibits the largest perfor-\nmance spread across sorting strategies. For this\nsmaller model, the Reranker sorting achieves the\nhighest accuracy, followed by the Answer first sort-\ning, while Random sorting again results in the low-\nest performance, 4.84 percentage points behind the\nbest-performing approach.\n14\n\n1\n10\n32\n64\nk: number of ranked entities\n0.0\n0.2\n0.4\n0.6\n0.8\n1.0\nAccuracy\noverall accuracy\nselection accuracy\naccuracy@@k\n(a) Overall (micro-averaged) accuracy, selection accuracy and\naccuracy@k w.r.t. k for Qwen3-30B-A3B\n1\n10\n32\n64\nk: number of ranked entities\n0.0\n0.2\n0.4\n0.6\n0.8\n1.0\nAccuracy\noverall accuracy\nselection accuracy\naccuracy@@k\n(b) Overall (micro-averaged) accuracy, selection accuracy and\naccuracy@k w.r.t. k for Magistral-Small-2509\nFigure 4: Comparison of (micro-averaged) accuracy, selection accuracy and accuracy@k on ZESHEL.\n1\n2\n3\n4\n5\n6\n7\nNumber of distinct outputs\n0.0\n0.2\n0.4\n0.6\n0.8\n1.0\nFrequency (0–1)\nMagistral-Small-2509\nQwen3-30B-A3B\nQwen3-4B\nFigure 5: Output diversity histogram, using 10 samples\nfor self consistency, on ZESHEL.\nOverall, these results demonstrate that candidate\nposition has a measurable impact on selection ac-\ncuracy, with sensitivity increasing as model size\ndecreases. We hypothesize that larger models are\nbetter able to compare candidates independently of\ntheir order, while smaller models rely more heavily\non positional cues. The intermediate behavior of\nthe 30B MoE Qwen3-30B-A3B model is consistent\nwith this explanation. Importantly, the proposed\nReranker sorting strategy consistently ranks first\namong the non-oracle methods and achieves the\nbest performance for the smallest model, justifying\nits use in LELA. In contrast, BM25 sorting under-\nperforms relative to the reranker-based approaches,\nwhile Random sorting yields the lowest accuracy\nacross all models.\nFurthermore, we also report the results on ESCO\nto study the impact in the case of None prediction,\nconsidering three different positions for the UNK\ncandidate, among the candidates according to the\nreranker-assigned scores:\n• None first: The None option is in first position\n(the standard LELA approach)\n• None last: The None option is in last position\n• Random: The None option is in a random\nposition\nThe results are reported in Table 8b. Across all\nmodels, changing the position of the None candi-\ndate produces only marginal differences in overall\naccuracy, indicating no statistically significant ef-\nfect. We observe a consistent trend in which plac-\ning None first slightly improves UNK accuracy at\nthe cost of marginally reduced non-UNK. Overall,\nmodel performance appears robust to the ordering\nof the None candidate.\nG\nRuntime and Scalability Results\nFirst of all, Figure 8 reports the selection accu-\nracy as a function of the number of input to-\nkens on ZESHEL. It shows how the selection ac-\ncuracy decreases for all three models as the in-\nput context length increases.\nThe drop in per-\nformance is the least pronounced for the dense\nMagistral-Small-2509 and Qwen3-4B models.\nThe MoE Qwen3-30B-A3B model experiences a sig-\nnificantly steeper drop in performance as the con-\ntext length increases, eventually achieving worse\nperformance than Qwen3-4B on the longer inputs,\ndespite both models having the same native con-\ntext window. Figure 7a reports the selection ac-\ncuracy as a function of the number of output to-\nkens on ZESHEL. The three curves show a range,\n15\n\n2\n4\n6\n8\n10\nSelf Consistency k\n0.0\n0.2\n0.4\n0.6\n0.8\n1.0\nSelection accuracy\nMagistral-Small-2509\nQwen3-30B-A3B\nQwen3-4B\n(a) Selection accuracy based on the number of self-consistency\nsamples.\n2\n4\n6\n8\n10\nSelf Consistency k\n0.00\n0.01\n0.02\n0.03\n0.04\n0.05\nInvalid output rate\nMagistral-Small-2509\nQwen3-30B-A3B\nQwen3-4B\n(b) Invalid LLM output rate based on the number of self-\nconsistency samples.\nFigure 6: Effect of the number of self-consistency samples on selection accuracy and invalid output rate on ZESHEL.\nModel\nReranker\nBM25\nAnswer first\nAnswer last\nRandom\nMagistral-Small-2509\n81.88\n81.26\n81.91\n81.60\n81.56\nQwen3-30B-A3B\n74.42\n74.34\n75.43\n73.72\n72.92\nQwen3-4B\n65.84\n63.72\n65.20\n61.61\n61.00\n(a) Micro-averaged normalized accuracy on ZESHEL for different reranker output list sorting approaches.\nModel\nSorting\nAccuracy\nUNK Accuracy\nNon-UNK Accuracy\nMagistral-Small-2509\nNone first\n26.75 ± 2.03\n5.41 ± 1.56\n43.92 ± 3.06\nNone last\n26.43 ± 2.03\n3.94 ± 1.34\n44.51 ± 3.07\nRandom\n26.48 ± 2.03\n4.18 ± 1.38\n44.41 ± 3.07\nQwen3-30B-A3b\nNone first\n36.62 ± 2.21\n28.04 ± 3.09\n43.52 ± 3.06\nNone last\n36.51 ± 2.21\n27.55 ± 3.08\n43.72 ± 3.06\nRandom\n35.20 ± 2.19\n25.09 ± 2.99\n43.32 ± 3.06\nQwen3-4B\nNone first\n26.59 ± 2.03\n8.36 ± 1.91\n41.25 ± 3.04\nNone last\n26.70 ± 2.03\n9.84 ± 2.05\n40.26 ± 3.03\nRandom\n26.32 ± 2.02\n8.12 ± 1.88\n41.00 ± 3.04\n∅ (Reranker-only)\n-\n-\n-\n35.71 ± 2.96\n(b) Accuracy on ESCO for different UNK candidate positions.\nTable 8: Effect of candidate position on selection accuracy across two datasets.\n16\n\naround 500 tokens, where the accuracy is the high-\nest. For Qwen3-4B and Qwen3-30B-A3B models,\nshorter outputs are less accurate, which is not the\ncase for Magistral-Small-2509 which does not\ngenerate these short outputs. For all three mod-\nels, the accuracy drops past this range. This is an\ninstance of the typical reasoning LLMs underthink-\ning and overthinking issue (Su et al., 2025). Fig-\nure 7b shows a similar ideal range below 500 tokens\non another dataset, where all three models achieve\nthe highest, and highly similar, performance, but\nno underthinking phenomena.\nTable 9 reports the average runtime per mention\nacross datasets using our main experiments con-\nfiguration. It shows that Magistral-Small-2509\nconsistently results in the longest runtime, with the\nMoE Qwen3-30B-A3B representing a particularly\ninteresting middle-ground, generally achieving per-\nformance on par with Magistral-Small-2509,\nwhich is still needed for achieving the best results\non ZESHEL or specialized domains, while also\nrunning faster than Qwen3-4B. In fact, based on\nour results, the smaller dense Qwen3-4B should\nonly be preferred to Qwen3-30B-A3B in cases of\nVRAM limitations. Similarly, in terms of tokens\nusage, which directly impact the runtime of the\napproach, Magistral-Small-2509 generated the\nlongest thoughts on average, with similar values\nfor the other models. In addition, these numbers\nof thought tokens also vary based on the setting\nand domain. Using the token length insights on\nZESHEL, as well as those from Section E, we also\nreport the results for a runtime-optimized configu-\nration on ZESHEL by sampling only three answers\nfor self-consistency and by limiting the context\nlength to 8192 tokens. The results show that these\ntwo settings effectively result in significant runtime\nimprovements across the three models, but at the\nexpense of some accuracy percentage points.\nH\nComparison of different pointwise\nrerankers\n• BM25, using the outputs from Logeswaran\net al. (2019).\n• The Qwen3-Embedding-4B embedder, com-\nputing the similarity between mentions and\nour entity verbalizations from the candidate\nlist.\n• The Qwen3-4B LLM with the same prompt\nand score computation as in Zhang et al.\nLELA config\nTime (s)\n# Tokens\nAcc.\nw/ Magistral-Small-2509\n44.4\n1042\n83.11\nw/ Magistral-Small-2509opt\n26.7\n926\n77.73\nw/ Qwen3-30B-A3B\n26.0\n947\n76.79\nw/ Qwen3-30B-A3Bopt\n11.7\n785\n72.92\nw/ Qwen3-4B\n31.1\n1059\n68.00\nw/ Qwen3-4Bopt\n19.1\n798\n64.45\nw/o LLM selection\n1.3\n-\n27.15\n(a) ZESHEL dataset.\nMethod\nTime (s)\n# Tokens\nAcc.\nBLINK\n0.9\n-\n6.4\nGENRE\n4.0\n-\n25.0\nLELA (ours)\nw/ Magistral-Small-2509\n57.2\n1408\n62.3\nw/ Qwen3-30B-A3B\n19.2\n847\n62.4\nw/ Qwen3-4B\n14.7\n847\n55.3\nw/o LLM Selection\n2.3\n-\n29.8\n(b) GLADIS dataset.\nLELA config\nTime (s)\n# Tokens\nAcc.\nw/ Magistral-Small-2509\n35.8\n826\n68.0\nw/ Qwen3-30B-A3B\n10.0\n581\n68.7\nw/ Qwen3-4B\n19.6\n652\n67.0\nw/o LLM selection\n1.3\n-\n61.9\n(c) Unseen Mentions dataset (prior-based candidates).\nLELA config\nTime (s)\n# Tokens\nAcc.\nw/ Magistral-Small-2509\n54.6\n1368\n26.21\nw/ Qwen3-30B-A3B\n21.7\n882\n39.36\nw/ Qwen3-4B\n27.2\n854\n29.44\nw/o LLM selection\n0.5\n-\n19.35\n(d) ESCO dataset (BM25 candidates).\nTable 9: Macro-average mention processing time, and\nthought token usage per self-consistency sample, across\nbenchmarks.\n(2025) but without any additional training.\nThe results show that BM25 underperforms\ncompared to the three other approaches, while\nQwen3-Reranker-4B achieves the highest re-\ncall@10 and Qwen3-Embedding-4B yield compet-\nitive results, despite the latter not being provided\nwith the context. Furthermore, Figure 9 shows\nthe ground truth rank distribution across the four\napproaches. Interestingly, Qwen3-Embedding-4B\nassigns the higher score to the ground truth in more\nthan 60% of the cases, while Qwen3-4B-Reranker\ndoes so in less than 30% of cases, it catches up\nwith the embedder’s recall at k = 6. It should be\nnoted that when using models which are highly\nsusceptible to positional bias, such as Qwen3-4B\n(see Appendix F), the accuracy@k metric is not\nsufficient to compare different rerankers, as it is\n17\n\n500\n1000\n1500\n2000\nNumber of thinking tokens\n0.0\n0.2\n0.4\n0.6\n0.8\n1.0\nSelection accuracy\nMagistral-Small-2509\nQwen3-30B-A3B\nQwen3-4B\n(a) ZESHEL.\n500\n1000\n1500\n2000\nNumber of thinking tokens\n0.0\n0.2\n0.4\n0.6\n0.8\n1.0\nSelection accuracy\nMagistral-Small-2509\nQwen3-30B-A3B\nQwen3-4B\n(b) Unseen Mentions.\nFigure 7: Selection accuracy as a function of the number of generated thinking tokens on two datasets.\n2000\n4000\n6000\n8000\n10000\nNumber of input tokens\n0.0\n0.2\n0.4\n0.6\n0.8\n1.0\nSelection accuracy\nMagistral-Small-2509\nQwen3-30B-A3B\nQwen3-4B\nFigure 8: Selection accuracy based on the number of\ninput tokens, on ZESHEL.\nReranker\nAcc@10\nBM25\n64.99\nQwen3-Embedding-4B\n89.94\nQwen3-4B\n89.12\nQwen3-Reranker-4B\n93.57\nTable 10: Macro-averaged normalized accuracy@10 for\ndifferent candidate reranking models on ZESHEL.\ninvariant to the output candidate position.\nI\nCandidate Retention Funnel\nTo better understand the contribution of each stage\n(Retrieval, Reranking, and Selection) to the fi-\nnal performance, we visualize the flow of the\nground-truth entity through the LELA pipeline\nusing Sankey diagrams (Figure 10).\nThe dia-\ngrams clearly distinguish between retrieval-bound,\nreranking-bound, and reasoning-bound challenges\nacross different benchmarks:\n• On ZESHEL, the dominant error source is\nconsistently the initial BM25 retrieval. Once\nthe correct entity is retrieved, LELA is highly\neffective at retaining and selecting it.\n• On GLADIS, the bottleneck shifts to the\nreranking stage, particularly in specialized do-\nmains where the density of similar long forms\nis high.\n• On WikilinksNED Unseen Mentions (specifi-\ncally with dense retrieval), the challenge lies\nprimarily in the final selection, suggesting that\nthe reasoning model struggles to disambiguate\nbetween the highly similar entities typical of\nWikipedia.\nA key finding is the high retention rate of the\nreranking stage across most datasets. For exam-\nple, on the ZESHEL/Forgotten Realms domain,\nthe reranker preserves 98% of the recoverable re-\ncall (977 retained vs. 22 lost), while compress-\ning the candidate list by a factor of 6.4 (from\n64 to 10). These results demonstrate the robust-\nness of our choice of k = 10, validating that\naggressive filtering does not compromise the up-\nper bound of the system. However, the results\nalso highlight domain-dependent variations. While\nthe reranker achieves a 90.5% retention rate on\nthe GLADIS/General domain, this drops to 63.2%\non the GLADIS/Scientific domain. This signifi-\ncant drop indicates that scientific acronyms pose a\nharder challenge for the reranker, likely due to the\nfine-grained technical distinctions required to sepa-\nrate valid candidates. Crucially, the Selection LLM\n18\n\n0\n10\n20\n30\n40\n50\n60\nk\n0.0\n0.2\n0.4\n0.6\n0.8\n1.0\nAccuracy@k\nQwen3-4B\nQwen3-Embedding-4B\nQwen3-Reranker-4B\nbm25\n(a) Micro-averaged normalized accuracy@k.\n2\n4\n6\n8\n10\nGround-truth rank position\n0.0\n0.1\n0.2\n0.3\n0.4\n0.5\n0.6\nFrequency\nQwen3-4B\nQwen3-Embedding-4B\nQwen3-Reranker-4B\nbm25\n0.0\n0.2\n0.4\n0.6\n0.8\n1.0\nAccuracy@k\n(b) Ground truth position in reranker output histogram.\nFigure 9: Comparison of rerankers on ZESHEL using (a) accuracy@k and (b) ground-truth position histogram.\nproves significantly more robust in this technical\nsetting, successfully identifying the correct entity\namong the top-10 candidates in 74.1% of the cases\npassed to it (6593 out of 8894).\nJ\nExamples\nFigures 11, 12, 13, 14, 15, and 16 show examples\nof reasoning traces in the LLM candidate selection\nphase of LELA.\n19\n\nRetriever\n(999)\nAll\n1199\nLost (Ret)\n200\nReranker\n(977)\nLost (Rerank)\n22\nSelection\n(886)\nCorrect\n886\nLost (Sel)\n91\n(a) ZESHEL/Forgotten Realms\nRetriever\n(973)\nAll\n1198\nLost (Ret)\n225\nReranker\n(929)\nLost (Rerank)\n44\nSelection\n(807)\nCorrect\n807\nLost (Sel)\n122\n(b) ZESHEL/Lego\nRetriever\n(2785)\nAll\n4226\nLost (Ret)\n1441\nReranker\n(2611)\nLost (Rerank)\n174\nSelection\n(2214)\nCorrect\n2214\nLost (Sel)\n397\n(c) ZESHEL/Star Trek\nRetriever\n(2053)\nAll\n3373\nLost (Ret)\n1320\nReranker\n(1889)\nLost (Rerank)\n164\nSelection\n(1660)\nCorrect\n1660\nLost (Sel)\n229\n(d) ZESHEL/YuGiOh\nRetriever\n(8385)\nAll\n10000\nLost (Ret)\n1615\nReranker\n(8291)\nLost (Rerank)\n94\nSelection\n(6801)\nCorrect\n6801\nLost (Sel)\n1490\n(e) WikilinksNED Unseen Mentions\nRetriever\n(7125)\nAll\n7125\nReranker\n(6446)\nLost (Rerank)\n679\nSelection\n(5619)\nCorrect\n5619\nLost (Sel)\n827\n(f) GLADIS/General\nRetriever\n(3149)\nAll\n3149\nReranker\n(2449)\nLost (Rerank)\n700\nSelection\n(1921)\nCorrect\n1921\nLost (Sel)\n528\n(g) GLADIS/Biomedical\nRetriever\n(14066)\nAll\n14066\nReranker\n(8894)\nLost (Rerank)\n5172\nSelection\n(6593)\nCorrect\n6593\nLost (Sel)\n2301\n(h) GLADIS/Scientific\nFigure 10: Candidate retention funnel across datasets, using Magistral-Small-2509. The diagrams visualize the\nattrition of the ground-truth entity at each stage of the LELA pipeline: (1) from the total dataset to the Retriever, (2)\nto the Reranker, and (3) to the final Selection.\n20\n\nInput sentence and mention: [Develop tools] for our internal teams ( sales support operations finance ) to help run the\nbusiness more efficiently .\",\nLELA: <think> Okay, let’s tackle this problem step by step. The user has provided a sentence: \"[Develop tools] for our\ninternal teams ( sales support operations finance ) to help run the business more efficiently.\" The task is to determine which\nof the given candidate entities best fits the mention \"[Develop tools]\". First, I need to understand the context. The sentence\nis about developing tools for internal teams in sales, support, operations, and finance to improve business efficiency. The\nkey here is \"develop tools\" in the context of internal business operations. Looking at the list of candidates: 0. None of the\nlisted candidates 1. develop promotional tools - This seems related to marketing materials, not internal tools. 2. produce\ncustomised tools - This could be relevant, but the description mentions artisanal crafting or restoration, which might not fit\ninternal business tools. 3. database development tools- Focuses on databases, which is a specific area, but the original\nsentence is broader. 4. KDevelop - A specific software development tool, but the mention is general. 5. business process\nmodelling - Involves modeling business processes, which could relate to efficiency. 6. create business process models -\nSimilar to 5, but more about creating models rather than tools. 7. develop film - Unrelated to internal business tools. 8.\nsoftware frameworks - Tools for software development, which might be relevant if the tools are software-related. 9. Eclipse\n- Another specific software development tool, like KDevelop. 10. apply tools for content development - Focuses on content,\nnot internal business processes. The original mention is \"Develop tools\" for internal teams to run the business more\nefficiently. The most general and fitting option seems to be 0 (None of the listed candidates) if none of the options perfectly\nmatch. However, looking again, option 5 and 6 are about business process modelling and creating models, which could be\ntools to improve efficiency. However, the mention is about \"tools\" in general, not specifically process models. Option 8 is\nsoftware frameworks, which are tools for development. But the original sentence doesn’t specify software. Option 10 is\nabout content development tools, which doesn’t fit. Alternatively, maybe \"None of the listed candidates\" is correct if none\nof the options accurately describe developing general internal tools. However, looking at the descriptions again, option 5\nand 6 are about business process modelling, which might be considered tools. But the original mention is more general.\nSince the options don’t have a general \"develop tools for internal efficiency\" category, perhaps 0 is the answer. But I need\nto check if any of the options are close. Alternatively, maybe option 5 or 6. But the original text says \"tools\" not \"models\".\nSo perhaps 0 is the correct choice here.</think> \"answer\": 0. (✓)\nFigure 11: LLM selection example from ESCO for the UNK entity. The model correctly chooses the None candidate\nas shown in blue instead of the other strong candidates.\nInput sentence and mention: The Golden Rule is arguably the most essential basis for the modern concept of human\nrights , in which each individual has a [right[] to just treatment, and a reciprocal responsibility to ensure justice for others. [\n4 ] A key element of the Golden Rule is\nLELA w/o LLM selection: Right to keep and bear arms: The right to keep and bear arms (often referred to as the right\nto bear arms) is a legal right for people to possess weapons (arms) for the preservation of life, liberty, and property (✗)\nLELA: <think>Okay, let’s try to figure out which entity is the correct answer here. The input text mentions the Golden\nRule as the basis for the modern concept of human rights. The marked mention is [right], and we need to choose from the\ncandidates. First, the context is about human rights and the Golden Rule. The sentence says each individual has a [right] to\njust treatment and a responsibility to ensure justice. The options include several types of rights. Let’s look at the candidates.\nOption 4 is Human rights, which is already mentioned in the text. But the question is about the specific entity that fits the\n[right] in the sentence. The sentence is talking about an individual’s right, so maybe a more general term. Option 6 is\n\"Rights\" in general. The description says rights are principles of freedom or entitlement. The sentence uses \"right\" in a\ngeneral sense, not a specific type like the right to bear arms or civil rights. So maybe option 6. But wait, the text mentions\n\"human rights\" earlier, but the marked term is [right], which is more general. Let me check the options again. Option 6 is\n\"Rights\" which is the general term. The sentence says \"a [right] to just treatment\", which would fit under the general\ndefinition of rights. However, option 4 is Human rights, but the sentence is about individual rights within human rights.\nHmm. But the answer might be 6 because it’s the general term. Alternatively, maybe 7 is civil and political rights, but the\ntext doesn’t specify that. The Golden Rule is linked to human rights, but the specific mention here is \"right\" as a general\nconcept. So I think the answer is 6.</think> \"answer\": 6. (✓)\nFigure 12: Reranker and LLM selection example from WikilinksNED Unseen Mentions. When using the selection\nLLM, as shown in blue, the model is able to select the correct candidate, unlike the reranker which ranks a wrong\ncandidate (shown in red) first.\n21\n\nInput sentence and mention: Fourvière , annual festival in June and August. The Biennale d’art contemporain . The Nuits\nSonores . Lille ( French pronunciation: [lil] ; [Dutch[] : Rijsel) is a city in northern France . It is the principal city of the\nLille Métropole , the fourth-largest metropolitan area in the\nLELA w/o LLM selection: Dutch Language: Dutch ( ) is a West Germanic language spoken by about 25 million people\nas a first language and 5 million as a second language. (✓)\nLELA: <think>Okay, let’s tackle this problem. The input text mentions \"Lille\" and describes it as a city in northern France,\nthe principal city of the Lille Métropole, which is the fourth-largest metropolitan area. The task is to determine which of the\ngiven candidates best fits the context. First, I need to check the list of candidates. The options are from 1 to 10. The mention\nin the text is \"Lille,\" but the candidates don’t include Lille directly. However, the text also mentions \"French pronunciation:\n[lil]\" and \"[Dutch]: Rijsel.\" So, the city’s Dutch name is Rijsel. But none of the candidates are Lille or Rijsel.Looking at\nthe candidates, the options are about regions, languages, communities, etc. For example, option 2 is the Low Countries,\nwhich is a region in Northwestern Europe, including the Netherlands, Belgium, and Luxembourg. Option 5 is the Flemish\nCommunity, which is part of Belgium. Option 9 isthe Netherlands. But the text is about Lille, which is in France, not the\nNetherlands or Belgium. However, since the Dutch name for Lille is Rijsel, maybe there’s a connection. Wait, the question\nis to disambiguate the entity in thetext. The text mentions Lille, which is a city in France. But the candidates don’t have\nLille. However, the Dutch name for Lille is Rijsel. But none of the candidates are Rijsel. The options are about regions,\nlanguages, etc. So perhaps the correct answer is related to the Low Countries (option 2), since Lille is in northern France,\nnear the border with the Low Countries. Alternatively, maybe the Flemish Community (option 5) since Lille is near the\nFrench region of Flanders, which is part of the Flemish Community. But Lille is actually in the Nord-Pas-de-Calais region\nof France, which is part of the French Flanders, not the Flemish Community (which is in Belgium). However, the text\nmentions that Lille is in northern France, and the Dutch pronunciation is given. The mention of Dutch (option 1) might be\nrelevant, but the text is about the city, not the language. Alternatively, maybe the correct answer is the Low Countries\n(option 2) because Lille is in northern France, close to the Low Countries region. But the candidates are entities, and the\nquestion is to pick the most appropriate one based on context. However, none of the candidates are directly Lille. The only\npossible connection is that the Dutch name for Lille is Rijsel, but there’s no candidate for that. The options are about\nregions, languages, etc. Since the text is about Lille’s location and the Dutch name, perhaps the correct answer is the Low\nCountries (option 2), as Lille is near that region. Alternatively, the Netherlands (option 9) but Lille is in France.\nAlternatively, the Flemish Community (option 5) but again, Lille is in France, not Belgium. Wait, maybe the answer is\noption 2, the Low Countries, since the text mentions the Dutch pronunciation and the city’s location in northern France,\nwhich is near the Low Countries. But I’m not entirely sure. Alternatively, maybe there’s a mistake here. However, given the\noptions, the best fit might be the Low Countries (option 2) as a geographical region. But I need to check the descriptions\nagain. Option 2: Low Countries - a region in Northwestern Europe, including the Benelux countries. Since Lille is in\nnorthern France, close to the Benelux area, maybe that’s the connection. So the answer is 2.</think>\"answer\": 2 (✗)\nFigure 13: LLM selection failure example from WikilinksNED Unseen Mentions. The reranker ranks the correct\ncandidate first,. However, the selection LLM considers the wrong mention, as shown in red, overthinks, and selects\na wrong candidate.\n22\n\nInput sentence and mention: Ann Lee Ann Lee is a minifigure introduced in the Monster Fighters theme . Description .\nAnn has sand blue legs printed with a clove of garlic , wooden stakes , and a belt . Her torso is white , and is printed with a\nsand blue desk , a red and white bodice , a white collar , and a silver necklace . She has white arms with yellow hands . Her\nhairpiece is reused from Cave Woman with a crossbow bolt instead of a bone . She has a facial scar from the Witch Nail .\nBackground . Ann Lee is a Monster Fighter , proficient in the ancient art of broomstickajitzu . Her monster arch - enemy are\nthe Ghosts , even though she is shown in videos with the [Mummy[] . She met Dr . Rodney when Dr . Rodney was looking\nfor a suitable pilot brave enough to fly into the Monster Realms , and was described as waiting in the air hangar for him .\nLELA w/o entity descriptions: \"<think> Okay, let’s see. The input text is about Ann Lee, a minifigure from the Monster\nFighters theme. The question is about disambiguating the mention of [Mummy] in the text. The context says that Ann\nLee’s monster arch-enemies are the Ghosts, but she’s shown in videos with the [Mummy]. The task is to pick the correct\nentity from the list provided. First, I need to look at the candidates. The list includes various Mummy-related entries, like\nMummy (Adventurers), The Mummy, Mummy (Scooby-Doo), Part numbers, and others. The key here is to figure out\nwhich Mummy is being referred to in the context of Ann Lee’s story. The text mentions that Ann Lee is a Monster Fighter\nand her arch-enemy is the Ghosts, but she’s seen with the [Mummy]. Since the options include different Mummy\ncharacters, I need to consider which one fits best. The Mummy in the context of LEGO minifigures might be a specific\ntheme. Looking at the candidates, options like \"Mummy (Minifigures)\" or \"9462 The Mummy\" could be relevant.\nHowever, the mention of \"Part 10546\" or \"Part 3626bpx112\" might refer to specific parts, but the text doesn’t mention parts,\nit’s about a character. Also, \"The Mummy\" as a standalone entry could be a possible candidate. However, since the original\ntext is about a minifigure, the most likely answer would be \"Mummy (Minifigures)\" which is option 6. But wait, there’s\nalso \"9462 The Mummy\" which might be a specific set. However, the context doesn’t mention a set number. Alternatively,\nthe text says \"the [Mummy]\" which is a general term, so maybe \"Mummy (Minifigures)\" is the correct choice. But I should\nalso check if any other options are more specific. The options 1, 2, 3, 5, 6, 8, 9 are all Mummy-related. Since the original\ntext is in the context of LEGO Monster Fighters, which is a theme, and the Mummy could be another minifigure in that\ntheme. The correct answer is likely option 6: Mummy ( Minifigures ). Alternatively, maybe \"The Mummy\" (option 2 or 8).\nBut considering the list, option 6 is more specific to minifigures. So I think the answer is 6. </think> \"answer\": 6 (✗)\nLELA: <think> Okay, let’s try to figure out which candidate entity is the correct one here. The input text is about Ann Lee,\na minifigure from the Monster Fighters theme. The mention in question is [Mummy], and we have a list of 10 candidates.\nFirst, I need to look at the context. The text says that Ann Lee is a Monster Fighter and her monster arch-enemy are the\nGhosts, even though she’s shown in videos with the [Mummy]. The key here is that the Mummy is mentioned as someone\nshe’s seen with, but her main enemies are the Ghosts. Now, looking at the candidates, most of them are different versions of\nthe Mummy in various LEGO themes. The options range from different sets, like 9462 The Mummy, which is mentioned in\nthe text. The text also mentions that the set 9462 includes Ann Lee and the Mummy. Wait, the text says \"the Monster\nFighter Ann Lee\" is part of the set 9462 The Mummy. So that set includes both Ann Lee and the Mummy. But the question\nis about the [Mummy] in the text. The text says she’s shown with the [Mummy], so the Mummy here is likely the one from\nthe same theme, Monster Fighters. Looking at the candidates, option 2 is \"The Mummy - The Mummy The Mummy King\nis a Monster Fighters minifigure...\" and option 9 is \"9462 The Mummy\" which is a set. But the mention is [Mummy],\nwhich is a minifigure. Option 2 is a minifigure, and the set 9462 includes the Mummy. However, the text says \"the Monster\nFighter Ann Lee\" is in the set, implying that the Mummy is another minifigure in that set. So the Mummy here is likely the\none from the Monster Fighters theme. Looking at the candidates, option 2 is The Mummy from Monster Fighters, and\noption 9 is the set. Since the text refers to the Mummy as a character (since Ann Lee is a minifigure and is shown with the\nMummy), it’s more likely referring to the minifigure. So option 2. But wait, there’s also option 8, which is \"Mummy -\nMummy The Mummy is a Studios minifigure.\" But the text mentions Monster Fighters, so the correct one would be the one\nfrom Monster Fighters. Additionally, option 9 is the set, not the minifigure. So the answer is probably 2. However, let me\ncheck the other options. Option 1 is Mummy (Adventurers), which is from a different game. Option 3 is Scooby-Doo,\nwhich is a different theme. Option 5 and 6 are other minifigures but not from Monster Fighters. Option 4 is a part, not a\nminifigure. Option 7 is a head part. Option 8 is a Studios minifigure, but the text mentions Monster Fighters. So yes, option\n2 is the correct one. </think> \"answer\": 2 (✓)\nFigure 14: LLM selection example from ZESHEL/Lego when using or omitting entity descriptions. Without\ndescriptions, the LLM selects the entity with the most relevant title regarding the context. as shown in red. Using\nentity descriptions, it excludes this wrong candidate, as shown in blue.\n23\n\nInput sentence and mention: Dog People The Dog People , also called the \" North Ones \" , were primitive , savage ,\nnomadic tribesmen who lived in the deserts of northern Maztica . Traits . Dog People tended to act emotionless and did not\nseem bothered by poverty or pain . They were dexterous and especially tough but often lacked the charisma or intellect of\nother races . Society . The Dog People lived a nomadic life in the harsh , northern lands, organized into villages . The\nseparate tribes of Dog People never cooperated . A typical village had huts of mud or else the people lived in caves , and\nthe populations ranged from 50 to 500 persons . They survived by growing crops in the brush and savannah regions , until a\ndrought would occur and they were forced to find other fields . Others set traps for birds or hunted or fished for food .\nWhile they did not have an ordered civilization , they had a vibrant oral tradition . In Dog People society , men and women\nwere equivalent ; for example , women were as likely to be warriors as men were . On rare occasions , a foreigner might\nimpress a Dog People village and be welcomed into the tribe as a brother or sister . In such cases, it was expected that the\nnew member offer his devotion to the tribe in exchange for their protection . Religion . The Dog People practiced a\nshamanistic religion , and revered all of the gods of the , especially [Maztica] and Tezca . They did not, however , practice\nhuman sacrifice . Combat . The Dog People were renowned for their skill at stealth . They were fearless fighters , but if\npressed , Dog People would often retreat deeper into their harsh territory , not out of cowardice , but trusting in the\nenvironment to drive their enemies away . Most city - dwelling Maztican s thought of all Dog People as foes , and they\nwere said to capture travelers and torture them . Magic . Some Dog People became plumaweaver s or hishnashaper s .\nDruid s of Maztica and priests of Tezca were also common . History . The Nexalan s were related to the Dog People and\noriginated in the same region . The people of Huacli also had the blood of Dog People .\nLELA (k = 64): <think>Okay, let’s tackle this problem. The user has given a text about the Dog People in Maztica and a\nlist of candidate entities. The task is to determine which entity corresponds to the mention \"[Maztica]\" in the text. First, I’ll\nread through the provided text again. The relevant part is in the \"Religion\" section: \"The Dog People practiced a\nshamanistic religion, and revered all of the gods of the , especially [Maztica] and Tezca.\" So the mention here is\n\"[Maztica]\", and we need to find the correct entity from the list.Looking at the candidate entities, there are several entries\nrelated to Maztica. Let’s check each one. Candidate 1 is \"Maztica\" itself, described as a continent west of Faerûn with\njungles and mystery. The text mentions that the Dog People lived in the deserts of northern Maztica, so this seems like a\nstrong candidate. Other candidates like \"Maztica trilogy\" (2), \"Maztican\" (3), \"Kukul\" (4), etc., don’t fit because they refer\nto different things—like a book series, a term for inhabitants, a god, etc. Looking at the description of candidate 1, it says\nMaztica is a continent, which matches the context of the Dog People living in its deserts. Also, the mention of \"Tezca\" in\nthe text might refer to a deity, but the question is about \"Maztica\" as a place. Other entries like \"Maztican\" (3) are collective\nterms for inhabitants, but the text refers to Maztica as a land. The answer should be the continent itself. Checking the other\ncandidates, like \"Maztica (goddess)\" (13) is a deity, which doesn’t fit. So the correct choice is candidate 1,\n\"Maztica\".</think> \"answer\": 1 (✗)\nLELA (k = 10): think>Okay, let’s tackle this problem. The user provided a text about the Dog People in Maztica and a\nlist of candidate entities. The task is to determine which candidate corresponds to the mention of [Maztica] in the text.First,\nI need to look at the context where [Maztica] is mentioned. The sentence says: \"They did not, however, practice human\nsacrifice.\" Wait, no, the actual sentence is: \"religion. The Dog People practiced a shamanistic religion, and revered all of\nthe gods of the , especially [Maztica] and Tezca.\" So the mention is in the context of gods. The Dog People revered the\ngods of the, especially Maztica and Tezca. Now, looking at the candidates, there are several entries related to Maztica. Let’s\ncheck each one. Candidate 1 is Maztica (the continent). Candidate 3 is Maztican (inhabitants). Candidate 7 is Maztica\n(goddess). Candidate 10 is Nexalan, which is a tribe in Maztica. But the mention here is [Maztica], and the text says they\nrevered the gods of the, especially Maztica and Tezca. So Maztica here is likely a deity, not the continent or the people.\nLooking at the candidates, candidate 7 is Maztica (goddess), which is described as the ancient mother of the Maztican\npantheon. The text mentions that the Dog People revered all the gods of the, especially Maztica and Tezca. So Tezca is\nanother deity, and Maztica here is another deity. Therefore, the correct candidate should be 7. But wait, let me check if\nthere’s another possibility. Candidate 1 is the continent, but the context is about gods. Candidate 3 is the people, but the\nsentence is about gods. Candidate 7 is the goddess, which fits. Also, in the text, the Dog People revered the gods of the\n(probably the Maztican gods), and specifically mentioned Maztica and Tezca. So Maztica here is a deity. Therefore, the\nanswer is 7.</think>\"answer\": 7. (✓)\nFigure 15: LLM selection example from ZESHEL/Forgotten Realms for k = 10 and k = 64 (all retrieved\ncandidates). When considering all retrieved candidates, the model overlooks the seventh candidate. which is\nthe correct one, and takes shortcuts, as highlighted in red. When only considering ten candidates, it successfully\nconsiders the seventh candidate, among other options, as highlighted in blue, and selects it.\n24\n\nInput sentence and mention: hydrogen is less than 25%. [ 47 ] [ 48 ] [ 49 ] [ edit ] Production For more details on this\ntopic, see [Hydrogen production] . The molecular hydrogen needed as an on-board fuel for hydrogen vehicles can be\nobtained through many thermochemical methods utilizing natural gas , coal (by\nLELA (Qwen3-4B): <think>nOkay, let’s tackle this problem. The user provided a text snippet and a list of candidate\nentities. The task is to determine which entity from the list best fits the context given in the text. First, I need to understand\nthe input text. The text mentions \"hydrogen is less than 25% [47] [48] [49] [edit] Production For more details on this topic,\nsee [Hydrogen production].\" Then it talks about producing molecular hydrogen for hydrogen vehicles through\nthermochemical methods using natural gas and coal. The mention of \"natural gas\" and \"coal\" suggests methods of\nproduction. The candidates include various hydrogen-related terms like \"Steam reforming,\" \"Sulfur–iodine cycle,\"\n\"Copper–chlorine cycle,\" etc. The key here is to find which candidate entity is related to the production methods mentioned.\nThe text specifically says \"thermochemical methods utilizing natural gas, coal.\" Looking at the candidates, \"Steam\nreforming\" (option 2) is a method that uses natural gas and water to produce hydrogen. The Sulfur–iodine cycle (4) and\nCopper–chlorine cycle (8) are also thermochemical cycles, but the text mentions natural gas and coal, which are more likely\nassociated with steam reforming. The other options like \"Hydrogen production\" (1) is a general term, but the text is\ndiscussing specific methods. The mention of \"natural gas\" and \"coal\" points towards steam reforming as the primary\nmethod here. So the answer should be option 2.</think> \"answer\": 2 (✗)\nLELA (Qwen3-30B-A3B): <think>Okay, let’s see. The user provided a text snippet that mentions hydrogen production and\nsome numbers like \"less than 25%\". The marked mention is [47], [48], [49], but those might be references or footnote\nmarkers. The task is to disambiguate the entity based on the context and the list of candidates. Looking at the input text:\n\"Hydrogen is less than 25%. [47] [48] [49] [edit] Production For more details on this topic, see [Hydrogen production].\nThe molecular hydrogen needed as an on-board fuel for hydrogen vehicles can be obtained through many thermochemical\nmethods utilizing natural gas, coal by...\" The key part here is the mention of \"Hydrogen production\" in the text. The user’s\ninstruction says that the answer should be the index number of the most appropriate entity. The candidate list includes\n\"Hydrogen production\" as entry 1. The text explicitly says \"see [Hydrogen production]\" which is a direct reference. So\neven though there are other entries like steam reforming (6) or copper-chlorine cycle (1), the context here is pointing\ntowards the general topic of hydrogen production. The numbers in brackets might be references to sources or sections, but\nthe main entity being discussed is \"Hydrogen production\". So the answer should be 1.</think>\"answer\": 1. (✓)\nFigure 16: LLM selection example from WikilinksNED Unseen Mentionws for two selection LLMs. Qwen3-4B\nchoses the wrong candidate, following the thoughts in red, while Qwen3-30B-A3B chooses the right answer, following\nthe blue thoughts.\n25\n",
+    "summary_1": "Here is a clear and concise summary of the research paper for a graduate student:\n\n**Title:** LELA: an LLM-based Entity Linking Approach with Zero-Shot Domain Adaptation\n\n**Background:** Entity linking is a fundamental task in natural language processing that involves mapping ambiguous mentions in text to entities in a knowledge base. Traditional entity linking methods require fine-tuning on the target domain and knowledge base, which can be computationally expensive and require high-quality labeled data.\n\n**Problem Statement:** The authors propose a true zero-shot entity linking approach, where the method can work out of the box for any domain without the need for fine-tuning.\n\n**Method:** The proposed approach, LELA, is a coarse-to-fine method that leverages the capabilities of large language models (LLMs). It consists of three stages:\n\n1. **Initial entity-candidate set generation**: LELA generates a set of potential entities based on the input text.\n2. **Pointwise re-ranking**: The top-k candidates are selected through re-ranking.\n3. **Final candidate selection**: The selected candidates are passed to the LLM for final selection through reasoning over the context and entity descriptions.\n\n**Advantages:** LELA is modular, LLM-, reranker-, and retrie",
+    "summary_2": "The paper proposes a novel entity linking approach, called LELA, which stands for Large Language Model-based Entity Linking Approach with Zero-Shot Domain Adaptation. Entity linking is the task of mapping ambiguous mentions in text to entities in a knowledge base (KB). The authors aim to develop a true zero-shot entity linking method, which means it works without the need for fine-tuning, and can handle different domains, knowledge bases, and large language models (LLMs).\n\nThe proposed approach, LELA, is a coarse-to-fine method that consists of three stages:\n\n1.  **Initial entity-candidate set generation**: LELA generates an initial set of candidate entities for the ambiguous mention.\n2.  **Pointwise re-ranking**: The top-k candidates are selected through pointwise re-ranking, where the model calculates a score for each candidate entity based on its relevance to the context and entity descriptions.\n3.  **Final candidate selection**: The selected candidates are passed to the LLM for final candidate selection, where the model uses reasoning over the context and entity descriptions to select the most relevant entity.\n\nLELA is modular, agnostic to the LLM, reranker, and retriever, making it applicable to both open-source and proprietary models.\n\nThe authors"
+  },
+  {
+    "arxiv_id": "2601.05143v1",
+    "title": "Paper 2601.05143v1",
+    "abstract": "Visual question answering for crop disease analysis requires accurate visual under- standing and reliable language generation. This work presents a lightweight vision–language framework for crop and disease identification from leaf images. The proposed approach combines a Swin Transformer vision encoder with sequence-to-sequence language decoders. A two-stage training strategy is adopted to improve visual representation learning and cross-modal alignment. The model is evaluated on a large-scale crop disease dataset using classification and nat- ural language generation metrics. Experimental results show high accuracy for both crop and disease identification. The framework also achieves strong perfor- mance on BLEU, ROUGE and BERTScore. Our proposed models outperform large-scale vision–language baselines while using significantly fewer parameters. Explainability is assessed using Grad-CAM and token-level attribution. Qualita- tive results demonstrate robust performance under diverse user-driven queries. These findings highlight the effectiveness of task-specific visual pretraining for crop disease visual question answering.",
+    "text": "A Lightweight and Explainable Vision–Language\nFramework for Crop Disease Visual Question\nAnswering\nMd. Zahid Hossain1, Most. Sharmin Sultana Samu2,\nMd. Rakibul Islam1, Md. Siam Ansary1*\n1Department of Computer Science and Engineering, Ahsanullah\nUniversity of Science and Technology, Dhaka, 1208, Bangladesh.\n2Department of Computer Science and Engineering, BRAC University,\nDhaka, 1212, Bangladesh.\n*Corresponding author(s). E-mail(s): siamansary.cse@aust.edu;\nContributing authors: zahidd16@gmail.com;\nsharminsamu130@gmail.com; rakib.aust41@gmail.com;\nAbstract\nVisual question answering for crop disease analysis requires accurate visual under-\nstanding and reliable language generation. This work presents a lightweight\nvision–language framework for crop and disease identification from leaf images.\nThe proposed approach combines a Swin Transformer vision encoder with\nsequence-to-sequence language decoders. A two-stage training strategy is adopted\nto improve visual representation learning and cross-modal alignment. The model\nis evaluated on a large-scale crop disease dataset using classification and nat-\nural language generation metrics. Experimental results show high accuracy for\nboth crop and disease identification. The framework also achieves strong perfor-\nmance on BLEU, ROUGE and BERTScore. Our proposed models outperform\nlarge-scale vision–language baselines while using significantly fewer parameters.\nExplainability is assessed using Grad-CAM and token-level attribution. Qualita-\ntive results demonstrate robust performance under diverse user-driven queries.\nThese findings highlight the effectiveness of task-specific visual pretraining for\ncrop disease visual question answering.\nKeywords: Visual Question Answering, Crop Disease Identification, Agricultural\nQuestions and Answers, Swin Transformer, Vision–Language Models, Explainable AI\n(XAI)\n1\narXiv:2601.05143v1  [cs.CV]  8 Jan 2026\n\n1 Introduction\nPlant disease diagnosis plays a critical role in modern agriculture and global food\nsecurity. Crops remain constantly exposed to pests, fungi and environmental stress.\nThese factors directly affect yield and quality. Reports from the Food and Agricul-\nture Organization of the United Nations show that crop diseases cause annual global\nlosses ranging from 10% to 30% [1]. Such losses threaten farm productivity and food\navailability. Early identification of disease symptoms is therefore essential. Accurate\nand timely diagnosis can reduce damage and support effective intervention. This need\nhas driven continuous research at the intersection of agriculture, computer vision and\nintelligent systems.\nDespite its importance, crop disease diagnosis remains a difficult task. In practice,\nfarmers depend on agricultural experts for on-site inspection and recommendations.\nExperts follow a step-by-step diagnostic process. They first identify the affected plant\npart. They then observe visible abnormalities. Finally, they analyze disease spot char-\nacteristics such as color, shape and distribution [2]. This process requires experience,\ntime and physical presence. Diagnostic delays allow pests and pathogens to spread\nrapidly. In many regions, expert access is limited. This makes large-scale and timely\ndisease monitoring difficult. As a result, delayed diagnosis often leads to severe yield\nloss and economic damage.\nTo address these challenges, automated disease detection methods have been widely\nexplored. Early computer vision approaches relied on handcrafted features and tradi-\ntional classifiers [2]. These methods often required controlled imaging conditions such\nas fixed lighting and angles [3–6]. Such requirements increase deployment cost and\nlimit real-world usability. Recent advances in deep learning have significantly improved\ndisease classification accuracy. Convolutional neural networks and transformer-based\nmodels show strong performance across multiple crops [7–10]. However, most of these\nsystems operate on unimodal data, mainly images or spectral signals [11–13]. They\ntypically output only disease labels. They fail to explain symptoms, disease stages\nor contextual information. This limits their practical value for decision-making and\ndisease management.\nVisual Question Answering offers a promising direction to overcome these limita-\ntions. VQA combines image understanding with natural language processing to answer\nquestions about visual content [14, 15]. In agriculture, VQA models attempt to link\nvisual symptoms with textual queries [16–18]. This allows users to ask targeted ques-\ntions instead of receiving fixed labels. However, existing agricultural VQA studies\nprovide only partial insights. They often lack detailed textual descriptions of multiple\nvisual attributes [19]. They struggle to represent disease progression stages. They also\nfail to answer questions that require external knowledge, such as pathogens, control\nstrategies or pesticide use. Current VQA benchmarks mainly focus on medical domains\nrather than plant pathology [20–23]. Moreover, many VQA models remain computa-\ntionally heavy. This restricts their use in real farming environments. In this context,\nour research asks a clear question. “Can a lightweight Visual Question Answer-\ning framework be established for intelligent and practical plant disease\nidentification?”\n2\n\nRecent advances in AI highlight a growing focus on multimodal and explain-\nable models for intelligent visual understanding. Studies combining CNNs and Vision\nTransformers show improved accuracy and transparency in image-based analysis [24–\n28]. Vision–language models using ViT and GPT-2 effectively connect visual patterns\nwith textual reasoning, as demonstrated in automated chest X-ray interpretation\nand report generation [29–31]. Transformer-based and transfer learning methods also\naddress low-resource challenges in Bengali audio and text analysis [32–34], while gen-\nerative models link vision and semantics in handwritten text synthesis [35, 36]. These\nworks reflect a shift toward robust, interpretable and multimodal AI systems, aligning\nwith the need for lightweight Visual Question Answering approaches in plant disease\nidentification.\nIn this work, we present a unified vision–language framework for visual question\nanswering in plant disease analysis. The framework is designed to jointly support plant\nidentification, disease recognition and natural language response generation. It lever-\nages a two-stage training strategy to improve visual understanding while maintaining\nefficient inference.\nThe proposed approach employs a Swin Transformer-based vision encoder [37] with\na text decoder for answer generation. In the first stage, the vision encoder is trained to\nlearn discriminative representations for plant and disease classification. In the second\nstage, the pretrained encoder is reused and frozen to support visual question answering.\nThis design improves stability and reduces computational overhead during training.\nTo enable robust language reasoning, we integrate a transformer-based text decoder\n(BART [38] and T5 [39]). The decoder generates natural language answers condi-\ntioned on both visual features and user queries. The model demonstrates robustness\nto diverse question formulations and open-ended queries. This behavior supports\nreal-world interaction scenarios.\nTo enhance interpretability, we incorporate explainable AI techniques. Grad-CAM\n[40] visualizations are used to highlight salient image regions influencing predictions.\nToken-level attribution is applied to analyze the contribution of linguistic tokens dur-\ning answer generation. These analyses provide transparency and validate meaningful\nvision–language alignment.\nExtensive experiments are conducted to evaluate the proposed framework. Quan-\ntitative evaluation includes accuracy for plant and disease identification. Natural\nlanguage generation quality is assessed using BLEU [41], ROUGE [42] and BERTScore\n[43] metrics. Model efficiency is analyzed in terms of parameter count and inference\nlatency. Ablation studies further examine the role of vision encoder pretraining and\ndecoder choice.\nOur key contributions are summarized as follows:\n• We propose a unified vision–language framework for plant and disease visual\nquestion answering using natural images.\n• We introduce a two-stage training strategy that decouples visual representation\nlearning from language generation.\n• We demonstrate robust performance under diverse and user-driven question formu-\nlations.\n3\n\n• We provide comprehensive explainability analysis using Grad-CAM and token-level\nattribution.\n• We evaluate the framework using classification accuracy, NLG metrics and model\nefficiency measures.\n• We show that vision encoder pretraining significantly improves performance across\nall evaluation metrics.\nThis article is organized as follows: Section 2 provides a summary of existing works\nin the literature. Section 3 details the proposed approach. Section 4 describes the setup\nused for the experiments. Section 5 presents the results and comprehensive analysis\nof the results. Section 6 discusses the limitations of this study. Finally, Section 7\nconcludes the study and outlines directions for future work.\n2 Related Work\nThis section reviews recent research on multimodal and vision–language approaches\nfor agricultural intelligence. It focuses on how models, data and learning strategies\nevolve to support accurate disease diagnosis and decision-making in agriculture.\n2.1 Visual Question Answering Frameworks for Agricultural\nDisease Diagnosis\nEarly visual question answering systems for agricultural disease diagnosis focused on\nmultimodal feature fusion and attention mechanisms. These systems used moderate-\nsize datasets. The fruit tree disease decision model [16] used ResNet-152 for image\nfeatures and BERT for question encoding. It applied bilinear pooling with modu-\nlar co-attention. The model achieved 86.36% accuracy on a custom orchard dataset.\nAttention instability and keyword misalignment reduced reliability. The wheat\nrust diagnostic framework [44] combined CNN classifiers with a fine-tuned BLIP\nvision–language model and federated learning. It achieved 97.69% classification accu-\nracy and a BLEU score of 0.6235. The system focused on a single crop and was sensitive\nto image corruption. The ILCD framework [18] used Inception-v4, LSTM and MUTAN\nfusion with bias-balancing strategies. It reached 86.06% accuracy on the CDwPK-VQA\ndataset. The small dataset size and weak generalization limited scalability.\nRecent frameworks emphasized knowledge integration and dataset expansion. They\naimed to improve reasoning depth and task diversity. The CDEK model [45] used\nobject detection, stacked self-attention and external knowledge from agricultural\nrepositories and GPT-3. It achieved 89.36% accuracy on OKiCD-VQA. It struggled\nwith unseen diseases and real-time deployment. PlantVillageVQA [46] introduced\na large-scale benchmark with 193,609 expert-validated question–answer pairs. The\ndataset covered many crops and diseases. Models such as CLIP, LXMERT and FLAVA\nachieved moderate accuracy. They struggled with causal and counterfactual reasoning.\nThe joint topic entity and intent recognition model [47] used a dual-tower multimodal\nTransformer with multi-task learning. It achieved up to 96.5% accuracy for entity\nrecognition. The framework relied only on image and text inputs.\n4\n\nAdvanced systems extended VQA toward comprehensive agricultural decision sup-\nport. These systems used multitask learning and domain knowledge graphs. The\nHortiVQA-PP framework [48] integrated segmentation-aware encoders, pest–predator\nmodeling and knowledge-guided large language models. It achieved strong segmen-\ntation, detection and VQA performance on a diverse horticultural dataset. Regional\ncoverage, extreme visual conditions and high computational cost remained challenges.\nOverall progress moved from CNN-based fusion to transformer-based and knowledge-\nenhanced architectures. Accuracy and reasoning capability improved across studies.\nCommon limitations included dataset bias, limited generalization across crops and\nenvironments, lack of multimodal sensor integration and reduced robustness in real-\nworld conditions. Future work emphasized larger datasets, zero-shot or few-shot\nlearning, stronger attention mechanisms, deeper knowledge integration, multimodal\nsensing and efficient field deployment [16, 18, 44, 45, 47, 48].\n2.2 Multimodal Deep Learning and Transformer-Based Models\nEarly multimodal deep learning models for agricultural analysis focused on struc-\ntured feature fusion and attention mechanisms. These models used moderate-size\ndatasets. A transformer-based multimodal system [19] integrated image, text and\nsensor data. It used CNN backbones, BERT, GPT and multi-head self-attention.\nThe system reached up to 0.94 accuracy in disease detection. It performed well in\ncaptioning and object detection. The model required high computational resources.\nDataset diversity was limited. Large-scale transformer and instruction-tuned models\nimproved multimodal reasoning through knowledge integration. Agri-LLaVA [49] used\nLLaVA-1.5 with large agricultural datasets. It relied on GPT-4-generated instructions.\nFine-tuning improved performance by 4.87%. The model was sensitive to rare dis-\nease classes. Computational cost remained high. BLIP-DP [50] focused on dynamic\nprompt generation guided by a VQA module. It achieved a BLEU-4 score of 83.4\non PlantVillage images. The framework relied mainly on laboratory data. Real-world\nrobustness was limited. LLaVA-PlantDiag [51] adapted a vision–language model for\nplant disease diagnosis. It used LoRA-based fine-tuning and synthetic instruction data.\nThe model reached 96% classification accuracy. It outperformed GPT-4 Vision. Per-\nformance depended on dataset coverage. Hallucination risks remained. Few-shot and\ndata-efficient multimodal frameworks addressed limited labeled data. A multimodal\nfew-shot learning system [52] used contrastive Siamese networks and prototypical\nclassification. It included retrieval augmented generation. The system achieved 93%\naccuracy on a regional Indian dataset. It generalized well to an external dataset.\nSynthetic data balancing was required. Caption ground truth was unavailable. Com-\nparative analysis showed a clear shift from CNN-based fusion to transformer-based and\ninstruction-tuned systems. Reasoning, captioning and advisory performance improved\nover time. Dataset bias remained common. Computational demands stayed high. Sen-\nsitivity to rare cases persisted. Real-world validation was limited. Future research\nemphasized larger multimodal datasets, simpler models, stronger alignment, higher\nrobustness, richer knowledge integration, multimodal sensing and reliable deployment\nin diverse agricultural settings [19, 49–52].\n5\n\n2.3 Knowledge-Enhanced and Large Language Model–Driven\nAgricultural Assistants\nKnowledge-enhanced and large language model–driven agricultural assistants show a\nclear shift toward domain-specific multimodal intelligence with conversational abili-\nties. Agri-LLaVA [49] used a knowledge-infused LLaVA-1.5 architecture trained on\nover 400,000 multimodal samples. The data covered more than 221 pest and dis-\nease types. The model improved visual understanding and dialogue-based diagnosis.\nIt struggled with rare categories and environment generalization. It required high\ncomputational resources. Future work targets hallucination reduction and deeper\nknowledge integration. LLaVA-PlantDiag [51] focused on plant pathology using LoRA\nfine-tuning and GPT-3.5-generated instruction data. The dataset came from PlantVil-\nlage. The model achieved 96% classification accuracy. It outperformed GPT-4 Vision\non vision–language tasks. Performance depended on synthetic data quality and dataset\ncoverage. Future work focuses on dataset expansion and robustness. CDEK [45] inte-\ngrated explicit agricultural knowledge bases and GPT-3-generated implicit knowledge.\nIt used fine-grained visual attention. The model achieved 89.36% accuracy on a crop\ndisease VQA dataset. It struggled with unseen diseases and real-time deployment.\nRobotic deployment was limited. Future work aims at zero-shot learning and deploy-\nment optimization. LLMI-CDP [53] extended VisualGLM and ChatGLM-6B using\nLoRA and Q-Former alignment. The dataset included 141 disease and pest categories\nin Chinese. The system showed strong recognition and accurate prevention advice.\nDeep reasoning was limited. Inference latency was high. Generalization remained weak.\nFuture work focuses on dataset diversity, automated labeling, efficient alignment and\nimproved contextual reasoning.\n2.4 Datasets, Benchmarking and Task-Specific Learning\nStrategies\nResearch on agricultural vision–language systems shows varied dataset design and\ntask-focused learning. BLIP-DP [50] used a manually annotated subset of the\nPlantVillage dataset. It applied a fixed train–test split. The method used disease-\naware dynamic prompts. The prompts came from a VQA-guided mechanism. The\ngoal was fine-grained image captioning. PlantVillageVQA [46] expanded the original\nPlantVillage dataset into a large VQA benchmark. The dataset included expert-\nverified question–answer pairs. It defined multiple cognitive task levels. It used\nstandardized evaluation with several vision–language models. This setup revealed\nstrengths and weaknesses in different reasoning tasks. HortiVQA-PP [48] built a\nmultitask dataset from greenhouse and open-field environments. It combined segmen-\ntation, co-occurrence prediction and knowledge-guided VQA. The dataset included\npest–predator annotations. It used a horticulture knowledge graph. The design sup-\nported complex decision-oriented queries. The multimodal few-shot framework [52]\nfocused on limited data and regional needs. It introduced a small dataset from Tamil\nNadu. It used an external dataset for generalization testing. The framework applied\ncontrastive pre-training, prototypical learning and retrieval-augmented querying. The\ngoal was effective learning with few labeled samples.\n6\n\nThe following research gaps are identified through our extensive literature search:\n• VQA frameworks for agricultural disease diagnosis lack standardized large-scale\nbenchmarks that cover diverse crops, diseases and real field conditions across regions.\n• Existing datasets show strong bias toward laboratory or controlled environments,\nwhich limits cross-crop, cross-region and cross-season generalization.\n• Current task-specific learning strategies emphasize identification and description\nbut provide weak support for causal, counterfactual and decision-oriented reasoning\nin VQA tasks.\n• Knowledge-enhanced and multitask datasets remain limited in multimodal diver-\nsity, with minimal integration of sensor data, temporal information and ecological\ncontext.\n• Few-shot and data-efficient learning frameworks rely heavily on synthetic augmen-\ntation and lack robust validation protocols for real-world agricultural deployment.\n3 Methodology\n3.1 Dataset\nWe use the Crop Disease Domain Multimodal (CDDM) dataset [54], which contains\nimages of healthy and diseased crops paired with multiple question–answer (QA)\ninstances. It covers 16 crop categories and 60 disease categories, with over one million\nQA pairs in total. A 90/10 QA-level split is applied for training and validation, while\nthe default test set is used exclusively for benchmarking.\nThe average question length is 6.11 words and the average answer length is 8.92\nwords. The test set contains 3,963 QA pairs from 3,000 unique images and includes\n292 unique answers, indicating moderate linguistic diversity. Figures 1–3 show the\ndistributions of plants, diseases and plant–disease combinations.\nTable 1 summarizes the sizes of the training and test splits used in this study.\nTotal QA Pairs\nUnique Images\nTraining & Validation\n1,056,311\n130,150\nTest\n3,963\n3,000\nTable 1: Sizes of the training and test splits of the\nCDDM dataset.\n3.2 Proposed Methodology\nThe proposed framework follows a two-stage training strategy for crop disease visual\nquestion answering. The architecture is illustrated in Figure 4. The approach decouples\nvisual representation learning from vision–language reasoning.\nStage 1: Vision Encoder Pretraining\n7\n\nFig. 1: Distribution of plant categories by number of images.\nFig. 2: Distribution of disease categories by number of images.\nIn the first stage, vision encoders are trained for crop and disease classification.\nTwo pretrained backbones are evaluated, namely CLIP ViT-B/16 [55] and Swin Trans-\nformer [37]. Both models are fine-tuned using multitask learning with shared visual\nfeatures.\nThe Swin Transformer [37] demonstrates superior classification accuracy. It also\nexhibits lower parameter complexity than CLIP ViT-B/16 [55]. Based on these results,\nSwin Tiny (Swin-T) [37] is selected as the vision encoder for subsequent stages.\n8\n\nFig. 3: Distribution of plant–disease combinations by number of images.\nThe training hyperparameters used for fine-tuning the vision encoders are summa-\nrized in Table 2.\nModel\nEpochs\nOptimizer\nLearning Rate\nBatch Size\nSwin-T\n10\nAdamW\n1 × 10−4\n32\nViT-B/16\n10\nAdamW\n1 × 10−4\n32\nTable 2: Hyperparameters used for training the vision\nencoders.\nStage 2: Vision–Language Question Answering\nIn the second stage, the pretrained Swin-T [37] encoder is reused for visual question\nanswering. The encoder parameters are frozen to preserve learned visual represen-\ntations. Image features are extracted as patch-level embeddings from the Swin-T\nbackbone.\nThe visual embeddings are projected into the language embedding space using a\nlearnable adapter. This projection aligns the vision features with the text decoder hid-\nden dimension. The projected features serve as visual tokens for language conditioning.\nTwo decoder architectures are explored, namely BART [38] and T5 [39]. Both\ndecoders generate natural language answers conditioned on image features and\nquestion tokens.\nSwin–BART Architecture\nFor Swin–BART [37, 38], visual embeddings are provided as encoder inputs to\nBART [38]. Question tokens are used as decoder inputs during training. The model is\noptimized using teacher forcing with cross-entropy loss.\n9\n\nLeaf Image\nSwin-T\nVision Encoder\nPlant\nClassifier\nDisease\nClassifier\nStage 1: Multitask Learning\nLeaf Image\nFrozen Swin-T\nEncoder\nText Transformer\n(BART / T5)\nGenerated\nAnswer\nQuestion\nTokens\nStage 2: Vision–Language QA Training\nFig. 4: Two-stage architecture of the proposed framework. Stage 1 learns plant and\ndisease representations using a shared Swin-T encoder, while Stage 2 reuses the frozen\nencoder for visual question answering with a stacked text decoder.\nThe decoder attends to visual embeddings through standard encoder–decoder\nattention. This configuration supports sequence-to-sequence answer generation. The\narchitecture is illustrated in Figure 4.\nSwin–T5 Architecture\nFor Swin–T5 [37, 39], the text decoder follows an encoder–decoder sequence-to-\nsequence paradigm. Visual features extracted from the frozen Swin-T encoder are used\nto condition answer generation through cross-modal attention.\nBoth global and patch-level visual features are utilized. Global representations\nare obtained by average pooling over patch embeddings, while patch-level features\npreserve fine-grained spatial information. These visual embeddings are projected to\nthe T5 hidden dimension using a learnable multi-layer perceptron.\nThe projected visual features are provided as encoder inputs to T5 [39], while\nquestion tokens are supplied to the decoder during training. Cross-attention layers\nwithin the T5 [39] decoder enable effective fusion of linguistic and visual information.\n10\n\nAnswer generation is optimized using teacher forcing with a cross-entropy loss.\nLoss computation is restricted to textual output tokens, ensuring that only language\ngeneration is supervised. This design aligns with established practices in multi-\nmodal sequence-to-sequence learning and supports stable training with frozen visual\nencoders.\nTraining and Inference\nAll vision encoders are frozen during VQA training. Only the projection layers\nand text decoders are optimized. Beam search is applied during inference for answer\ngeneration.\nThe training hyperparameters used for the VQA models are summarized in Table 3.\nModel Name\nEpochs\nOptimizer\nLearning Rate\nBatch Size\nSwin–BART\n2\nAdamW\n2 × 10−5\n8\nSwin–T5\n3\nAdamW\n1 × 10−4\n8\nViT–BART\n2\nAdamW\n2 × 10−5\n8\nViT–T5\n3\nAdamW\n1 × 10−4\n8\nTable 3: Hyperparameters used for training the VQA mod-\nels.\n3.3 Evaluation Metrics\nWe evaluate the quality of the generated answers using a combination of lexical and\nsemantic similarity metrics. These metrics assess both the correctness of key predicted\nentities and the overall similarity between generated and reference texts.\n• Accuracy: Accuracy measures the proportion of test samples for which the key\nentities (e.g., disease or condition names) are correctly identified in the generated\nanswers. Named entities are extracted from the generated text using text extraction\ntechniques and compared with the corresponding ground-truth annotations.\nAccuracy = Number of correctly predicted samples\nTotal number of samples\n(1)\n• BLEU (Bilingual Evaluation Understudy): BLEU [41] measures the precision\nof n-gram overlap between the generated and reference texts, incorporating a brevity\npenalty to discourage overly short hypotheses.\nBLEU = BP · exp\n� N\n�\nn=1\nwn log pn\n�\n(2)\nwhere pn denotes the modified n-gram precision, wn represents the weight for each\nn-gram order (typically uniform), and BP is the brevity penalty defined as:\nBP =\n�\n1\nif c > r\ne(1−r/c)\nif c ≤ r\n11\n\nHere, c and r denote the lengths of the candidate and reference texts, respectively.\n• ROUGE (Recall-Oriented Understudy for Gisting Evaluation): ROUGE\n[42] evaluates the recall-based overlap between generated and reference texts, com-\nmonly using n-gram co-occurrence (ROUGE-N) or longest common subsequence\nmeasures.\nROUGE-N =\n�\nref∈R\n�\ngramn∈ref min(Countgen(gramn), Countref(gramn))\n�\nref∈R\n�\ngramn∈ref Countref(gramn)\n(3)\nwhere gramn denotes an n-gram and counts are aggregated over all reference texts.\n• BERTScore: BERTScore [43] measures semantic similarity by computing cosine\nsimilarities between contextualized token embeddings from a pretrained BERT\nmodel and optimally aligning tokens between generated and reference texts.\nBERTScoreF 1 = 2 · P · R\nP + R\n(4)\nwhere precision P and recall R are derived from token-level cosine similarity scores.\n4 Experimental Setup\nThe experiments were conducted using two NVIDIA T4 GPUs provided by Kag-\ngle platform. Standard deep learning libraries such as PyTorch and Hugging Face\nTransformers were used.\n5 Result Analysis\n5.1 Plant and Disease Identification Performance\nTable 4 compares plant and disease classification accuracy across different model con-\nfigurations. The proposed Swin–T5 model achieves the highest accuracy for plant\n(99.94%) and disease (99.06%) identification. These results confirm the effectiveness\nof the two-stage training strategy. Models built on the Swin Transformer consistently\noutperform their ViT-based counterparts by a large margin in both plant and disease\nclassification tasks, validating the choice of Swin-T as the vision encoder as discussed\nin the methodology. The improvement is attributed to hierarchical feature learning\nand strong locality modeling. These properties are well suited for fine-grained disease\npatterns.\nCompared to LLaVA-AG [54] and Qwen-VL-Chat-AG [54], the proposed Swin-\nbased models achieve higher accuracy. This improvement is observed despite signifi-\ncantly lower model complexity. Swin–T5 improves disease classification accuracy by\nmore than 7% over these methods, highlighting the benefit of task-specific visual pre-\ntraining. The strong performance of Swin–BART further demonstrates that accurate\nvisual representations are critical for downstream reasoning, while the consistent gains\nof T5 indicate improved language modeling and cross-modal alignment.\n12\n\nModel\nPlant Classification Accuracy\nDisease Classification Accuracy\nSwin–BART (Ours)\n99.92%\n97.30%\nSwin–T5 (Ours)\n99.94%\n99.06%\nViT–BART (Ours)\n85.87%\n84.68%\nViT–T5 (Ours)\n86.17%\n85.24%\nQwen-VL-Chat-AG [54]\n97.4%\n91.5%\nLLaVA-AG [54]\n98.0%\n91.8%\nTable 4: Results comparison in terms of accuracy.\n5.2 Natural Language Generation Performance\nTo evaluate answer generation quality, standard n-gram based metrics are reported.\nThese metrics include ROUGE [42] and BLEU [41] scores. Table 5 presents a\ncomparative evaluation across different model configurations.\nThe proposed Swin–T5 model achieves the highest scores across all ROUGE vari-\nants and BLEU. This result indicates strong lexical overlap with ground-truth answers.\nSwin–BART also demonstrates high performance across all metrics. The gap between\nSwin–T5 and Swin–BART reflects improved language modeling capacity.\nViT-based models perform substantially worse across all metrics. This degradation\naligns with their weaker visual representations. The consistent advantage of Swin-\nbased models highlights the importance of robust visual encoding. High ROUGE-L\nscores further indicate improved sequence-level coherence.\nTable 5: Evaluation with N-gram based metrics\nApproach\nROUGE1\nF1\nROUGE2\nF1\nROUGE3\nF1\nROUGE4\nF1\nROUGE-L\nF1\nBLEU\nSwin–BART\n0.9836\n0.9786\n0.9753\n0.9717\n0.9836\n0.9727\nSwin–T5\n0.9965\n0.9955\n0.9947\n0.9938\n0.9965\n0.9940\nViT–BART\n0.8828\n0.8799\n0.8775\n0.8719\n0.8552\n0.6320\nViT–T5\n0.8962\n0.8927\n0.8875\n0.8874\n0.8715\n0.6931\n5.3 Semantic Similarity Evaluation\nSemantic consistency between generated and ground truth answers is evaluated using\nBERTScore F1 [43]. This metric captures contextual similarity beyond exact word\noverlap. Table 6 summarizes the results across model variants.\nThe proposed Swin–T5 model achieves the highest BERTScore F1. This result indi-\ncates strong semantic alignment with ground-truth answers. Swin–BART also attains\nnear-perfect semantic similarity. These results reflect the effectiveness of Swin-based\nvisual representations.\nViT-based models show noticeably lower scores. This performance gap suggests\nweaker cross-modal grounding. The consistent gains of T5 over BART highlight\nimproved semantic generation.\n13\n\nModel\nBERTScore F1\nSwin–BART\n0.9974\nSwin–T5\n0.9993\nViT–BART\n0.8843\nViT–T5\n0.8897\nTable 6: Semantic similarity\ncomparison using BERTScore\nF1.\n5.4 Model Complexity and Inference Efficiency\nWe evaluate computational efficiency using model size and inference latency on a T4\nGPU. Table 7 summarizes the trade-off between performance and efficiency.\nThe Swin–BART model has the lowest parameter count at 167.5M. It also achieves\nthe fastest inference time of 206.29 ms per sample. Swin–T5 increases the parameter\ncount to 251M. This increase results in a higher inference latency of 373.35 ms.\nViT-based models exhibit higher complexity and slower inference. ViT–BART con-\ntains 226M parameters and requires 325.17 ms per sample. ViT–T5 further increases\ncomplexity to 310M parameters with 497.39 ms inference time.\nLarge-scale models incur substantially higher computational cost. Qwen-VL-Chat-\n7B [54] requires 12.02 s per sample with 7B parameters. LLaVA-v1.5-7B [54] reduces\ninference time to 9.11 s but remains significantly slower.\nThese large models were evaluated without fine-tuning. The reported values\ntherefore represent approximate inference performance.\nModel\nTotal Parameters\nAverage Inference Time per sample\nSwin–BART (Ours)\n167.5 M\n206.29 ms\nSwin–T5 (Ours)\n251 M\n373.35 ms\nViT–BART (Ours)\n226 M\n325.17 ms\nViT–T5 (Ours)\n310 M\n497.39 ms\nQwen-VL-Chat-7B [54]\n7 B\n12.02 s\nLLaVA-v1.5-7B [54]\n7 B\n9.11 s\nTable 7: Comparison of model complexity and inference efficiency. Inference\ntimes are measured on a T4 GPU. Results for Qwen-VL-Chat-7B and LLaVA-\nv1.5-7B are approximate, as the original works [54] report LoRA-based fine-\ntuning, while we evaluate the pretrained models without fine-tuning.\n5.5 Model Explainability and Visual Reasoning Analysis\nTo enhance interpretability, we employ explainable AI techniques. Grad-CAM [40] is\nused to identify salient image regions, and token-level attribution is applied to analyze\nlinguistic relevance.\nFigure 5 presents the Grad-CAM visualization for an apple leaf image using the\nSwin–T5 model. The vision encoder focuses primarily on the leaf region. Increased\nattention is observed over the diseased areas.\n14\n\nFig. 5: Grad-CAM visualization highlighting diseased regions in an apple leaf image\nusing Swin–T5.\nFigure 6 illustrates token-level attribution for the corresponding question. The\nmodel assigns higher importance to keywords such as grape and fine. This behavior\nindicates effective alignment between visual and textual cues.\nFig. 6: Token-level attribution showing key question terms influencing answer gener-\nation.\nFigure 7 shows the Grad-CAM output for another apple leaf sample. The attention\nmap highlights the region affected by leaf rust. This localization suggests disease-\nspecific visual reasoning. Figure 8 presents the token-level attribution for the question\n15\n\nFig. 7: Grad-CAM visualization localizing leaf rust regions in an apple leaf image.\n“What stands out in this image?”. The model places greater emphasis on the token\nimage. This reflects reliance on visual context for open-ended queries.\nFig. 8: Token-level attribution for an open-ended visual question emphasizing visual\ncontext.\nFigure 9 illustrates the Grad-CAM visualization for a healthy tomato leaf. The\nattention is uniformly distributed across the leaf surface. No localized region dominates\nthe activation. Figure 10 shows token-level attribution for the question “Is this crop\ndiseased?”. Higher weights are assigned to the tokens diseased and crop. This indicates\ncorrect sensitivity to diagnostic keywords.\n16\n\nFig. 9: Grad-CAM visualization for a healthy tomato leaf showing uniformly dis-\ntributed attention.\nOverall, the explainability results indicate coherent visual grounding. They also\nconfirm meaningful token-level reasoning during answer generation.\n5.6 Qualitative Results and Robustness Analysis\nThis subsection presents qualitative examples to evaluate the robustness of the pro-\nposed framework under diverse question formulations. The evaluation focuses on\nuser-driven queries that differ from the original test questions. All qualitative results\nare generated using the Swin–T5 model.\nFigure 11 shows a healthy soybean leaf from the test set. The base question\ndescribes the visual content, and the model correctly identifies the leaf as healthy.\nFor follow-up queries, the model consistently recognizes the crop type. It also cor-\nrectly confirms the absence of disease. The responses remain semantically consistent\nacross different question phrasings. Figure 12 illustrates an apple leaf affected by Leaf\nRust. The base question contains an incorrect plant reference, which the model suc-\ncessfully corrects. Subsequent user queries further validate the prediction. The model\naccurately identifies both the plant type and the disease. The responses remain stable\nacross descriptive and diagnostic questions.\nOverall, these examples demonstrate robustness to variations in question phrasing.\nThe Swin–T5 model maintains correct visual grounding and semantic consistency.\nThis behavior reflects effective vision–language alignment in interactive settings.\n5.7 Ablation Study\nAn ablation study is conducted to examine the impact of key architectural and training\ncomponents on model performance. Specifically, we analyze the effect of the training\nstrategy by evaluating the role of vision encoder pretraining.\n17\n\nFig. 10: Token-level attribution emphasizing diagnostic keywords in a disease identi-\nfication question.\nTo assess the importance of vision pretraining, we remove the separate vision\nencoder pretraining stage and directly train the full VQA model by unfreezing the\nvision encoder. All hyperparameters are kept identical to those reported in Table 3 to\nensure a fair comparison.\nTable 8 reports plant and disease classification accuracy under this setting. Both\nSwin–BART and Swin–T5 exhibit a noticeable drop in accuracy compared to their\npretrained counterparts, indicating that end-to-end training without vision pretraining\nnegatively impacts discriminative performance.\nModel\nPlant Classification Accuracy\nDisease Classification Accuracy\nSwin–BART\n87.16%\n86.55%\nSwin–T5\n86.63%\n84.20%\nTable 8: Classification accuracy when vision encoder pretraining is skipped.\nBeyond classification accuracy, Table 9 presents results using NLG-based eval-\nuation metrics. A consistent degradation is observed across all metrics, including\nROUGE, BLEU and BERTScore, for both model variants. This confirms that skip-\nping vision pretraining not only affects classification performance but also weakens\nlanguage generation quality and vision–language alignment.\nOverall, these results highlight the critical role of vision encoder pretraining.\nRemoving this stage leads to consistent performance degradation across both clas-\nsification and generation metrics, underscoring its importance for robust visual\nrepresentation learning and effective vision–language reasoning.\n18\n\nFig. 11: Qualitative example on a healthy soybean leaf with user-driven questions\nusing Swin–T5.\nTable 9: NLG-based evaluation metrics when vision encoder pretraining is omitted\nApproach\nROUGE1\nF1\nROUGE2\nF1\nROUGE3\nF1\nROUGE4\nF1\nROUGE-L\nF1\nBLEU\nBERTScore\nF1\nSwin–BART\n0.8931\n0.8907\n0.8889\n0.8872\n0.8930\n0.8875\n0.8987\nSwin–T5\n0.8882\n0.8848\n0.8824\n0.8803\n0.8879\n0.8812\n0.8980\n6 Limitations\nDespite strong experimental performance, the proposed framework has several limi-\ntations. The model is designed for visual understanding and question answering. It\ncannot provide recommendations related to disease treatment or prevention. This\nlimitation arises from the absence of explicit agronomic knowledge.\nThe model also lacks broad world knowledge compared to large-scale vision–\nlanguage models, such as Qwen-VL-Chat-7B and LLaVA-v1.5-7B. As a result, it may\nstruggle with complex reasoning questions that extend beyond visual evidence. This\nincludes queries requiring external context or expert-level explanations.\nGeneralization to unseen plant species remains a challenge. The model performance\nmay degrade when evaluated on crops not present in the training data. This issue is\ncommon in supervised learning settings with limited botanical diversity.\n19\n\nFig. 12: Qualitative example on an apple leaf affected by Leaf Rust with user-driven\nquestions using Swin–T5.\n7 Conclusion and Future Work\nThis work presents a unified vision–language framework for plant and disease under-\nstanding. The model effectively integrates visual perception with natural language\nreasoning. Comprehensive evaluations demonstrate robustness to diverse question\nformulations. Explainability results provide transparency in visual and linguistic\ndecision-making. Ablation studies confirm the importance of pretrained visual rep-\nresentations. Overall, the proposed approach achieves reliable and interpretable\nperformance.\nFuture work will explore larger and more diverse agricultural datasets. Cross-\ndomain generalization to unseen crops will be investigated. Multilingual question\nanswering will be incorporated for broader accessibility. Advanced reasoning modules\nwill be integrated to handle complex agronomic queries.\nConflict of interest The authors have no conflict of interest to declare relevant\nto this article’s content. Additionally, the authors have no relevant financial or non-\nfinancial interests to disclose.\n20\n\nData availability Not applicable.\nFunding No specific funding was received for this study.\nReferences\n[1] Savary, S., Willocquet, L., Pethybridge, S.J., Esker, P., McRoberts, N., Nelson,\nA.: The global burden of pathogens and pests on major food crops. Nature ecology\n& evolution 3(3), 430–439 (2019)\n[2] T¨URKO˘GLU, M., HANBAY, D.: Apricot disease identification based on\nattributes obtained from deep learning algorithms. In: 2018 International Con-\nference on Artificial Intelligence and Data Processing (IDAP), pp. 1–4 (2018).\nIEEE\n[3] Mohanty, S.P., Hughes, D.P., Salath´e, M.: Using deep learning for image-based\nplant disease detection. Frontiers in plant science 7, 215232 (2016)\n[4] Ferentinos, K.P.: Deep learning models for plant disease detection and diagnosis.\nComputers and electronics in agriculture 145, 311–318 (2018)\n[5] Bhuiyan, M.A.B., Abdullah, H.M., Arman, S.E., Rahman, S.S., Al Mahmud, K.:\nBananasqueezenet: A very fast, lightweight convolutional neural network for the\ndiagnosis of three prominent banana leaf diseases. Smart Agricultural Technology\n4, 100214 (2023)\n[6] Hossain, M.A., Sakib, S., Abdullah, H.M., Arman, S.E.: Deep learning for mango\nleaf disease identification: A vision transformer perspective. Heliyon 10(17)\n(2024)\n[7] Arun, R.A., Umamaheswari, S.: Effective multi-crop disease detection using\npruned complete concatenated deep learning model. Expert Systems with Appli-\ncations 213, 118905 (2023)\n[8] Nandhini, M., Kala, K., Thangadarshini, M., Verma, S.M.: Deep learning model\nof sequential image classifier for crop disease detection in plantain tree cultivation.\nComputers and Electronics in Agriculture 197, 106915 (2022)\n[9] Vasavi, P., Punitha, A., Rao, T.V.N.: Crop leaf disease detection and classification\nusing machine learning and deep learning algorithms by visual symptoms: A\nreview. International Journal of Electrical and Computer Engineering 12(2), 2079\n(2022)\n[10] Wang, S., Zeng, Q., Ni, W., Cheng, C., Wang, Y.: Odp-transformer: Interpre-\ntation of pest classification results using image caption generation techniques.\n21\n\nComputers and Electronics in Agriculture 209, 107863 (2023)\n[11] Parez, S., Dilshad, N., Alghamdi, N.S., Alanazi, T.M., Lee, J.W.: Visual intelli-\ngence in precision agriculture: Exploring plant disease detection via efficient vision\ntransformers. Sensors 23(15), 6949 (2023)\n[12] Martinelli, F., Scalenghe, R., Davino, S., Panno, S., Scuderi, G., Ruisi, P., Villa,\nP., Stroppiana, D., Boschetti, M., Goulart, L.R., et al.: Advanced methods of\nplant disease detection. a review. Agronomy for sustainable development 35(1),\n1–25 (2015)\n[13] Zhang, F., Wang, Q., Li, H., Zhou, Q., Tan, Z., Zu, X., Yan, X., Zhang, S.,\nNinomiya, S., Mu, Y., et al.: Study on the optimal leaf area-to-fruit ratio of\npear trees on the basis of bearing branch girdling and machine learning. Plant\nPhenomics 6, 0233 (2024)\n[14] Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C.L., Parikh,\nD.: Vqa: Visual question answering. In: Proceedings of the IEEE International\nConference on Computer Vision, pp. 2425–2433 (2015)\n[15] Yang, Z., He, X., Gao, J., Deng, L., Smola, A.: Stacked attention networks for\nimage question answering. In: Proceedings of the IEEE Conference on Computer\nVision and Pattern Recognition, pp. 21–29 (2016)\n[16] Lan, Y., Guo, Y., Chen, Q., Lin, S., Chen, Y., Deng, X.: Visual question answering\nmodel for fruit tree disease decision-making based on multimodal deep learning.\nFrontiers in Plant Science 13, 1064399 (2023)\n[17] Waard, M.D., Georgopoulos, S., Hollomon, D., Ishii, H., Leroux, P., Ragsdale, N.,\nSchwinn, F.: Chemical control of plant diseases: problems and prospects. Annual\nreview of phytopathology 31(1), 403–421 (1993)\n[18] Zhao, Y., Wang, S., Zeng, Q., Ni, W., Duan, H., Xie, N., Xiao, F.: Informed-\nlearning-guided visual question answering model of crop disease. Plant Phenomics\n6, 0277 (2024)\n[19] Lu, Y., Lu, X., Zheng, L., Sun, M., Chen, S., Chen, B., Wang, T., Yang, J., Lv, C.:\nApplication of multimodal transformer model in intelligent agricultural disease\ndetection and question-answering systems. plants 13(7), 972 (2024)\n[20] Zhang, X., Wu, C., Zhao, Z., Lin, W., Zhang, Y., Wang, Y., Xie, W.: Pmc-vqa:\nVisual instruction tuning for medical visual question answering. arXiv preprint\narXiv:2305.10415 (2023)\n[21] He, X., Zhang, Y., Mou, L., Xing, E., Xie, P.: Pathvqa: 30000+ questions for\nmedical visual question answering. arXiv preprint arXiv:2003.10286 (2020)\n22\n\n[22] Liu, B., Zhan, L.-M., Xu, L., Ma, L., Yang, Y., Wu, X.-M.: Slake: A semantically-\nlabeled knowledge-enhanced dataset for medical visual question answering. In:\n2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI), pp.\n1650–1654 (2021). IEEE\n[23] Lau, J.J., Gayen, S., Ben Abacha, A., Demner-Fushman, D.: A dataset of clinically\ngenerated visual questions and answers about radiology images. Scientific data\n5(1), 1–10 (2018)\n[24] Hossain, M.Z., Zaman, F.U., Islam, M.R.: Advancing ai-generated image detec-\ntion: Enhanced accuracy through cnn and vision transformer models with\nexplainable ai insights. In: 2023 26th International Conference on Computer and\nInformation Technology (ICCIT), pp. 1–6 (2023). IEEE\n[25] Epstein, D.C., Jain, I., Wang, O., Zhang, R.: Online detection of ai-generated\nimages. In: Proceedings of the IEEE/CVF International Conference on Computer\nVision, pp. 382–392 (2023)\n[26] Hossain, M.Z., Samu, M., Bhuiyan, M.K., Zaman, F.U., Islam, M.R., et al.: Fuse:\nUnifying spectral and semantic cues for robust ai-generated image detection.\narXiv preprint arXiv:2512.21695 (2025)\n[27] Chattopadhyay, A., Maitra, M.: Mri-based brain tumour image detection using\ncnn based deep learning method. Neuroscience informatics 2(4), 100060 (2022)\n[28] Hossain, M.Z., Islam, M.R., Samu, M., et al.: Explainable ai-driven detection of\nhuman monkeypox using deep learning and vision transformers: A comprehensive\nanalysis. arXiv preprint arXiv:2505.01429 (2025)\n[29] Rakibul Islam, M., Zahid Hossain, M., Ahmed, M., Sharmin Sultana Samu, M.:\nVision-language models for automated chest x-ray interpretation: Leveraging vit\nand gpt-2. Engineering Reports 7(6), 70220 (2025)\n[30] Ouis, M.Y., Akhloufi, M.A.: Deep learning for report generation on chest x-ray\nimages. Computerized Medical Imaging and Graphics 111, 102320 (2024)\n[31] Hossain, M.Z., Ahmed, M., Samu, M.S.S., Islam, M.R.: Privacy-preserving chest\nx-ray report generation via multimodal federated learning with vit and gpt2.\nBiomedical Materials & Devices, 1–19 (2025)\n[32] Samu, M., Sultana, S., Islam, M.R., Hossain, M.Z., Bhuiyan, M.K., Zaman, F.U.:\nZero-shot to zero-lies: Detecting bengali deepfake audio through transfer learning.\narXiv preprint arXiv:2512.21702 (2025)\n[33] Kang, Y., Liu, T., Li, H., Hao, Y., Ding, W.: Self-supervised audio-and-text pre-\ntraining with extremely low-resource parallel data. In: Proceedings of the AAAI\nConference on Artificial Intelligence, vol. 36, pp. 10875–10883 (2022)\n23\n\n[34] Islam, M.R., Samu, M., Hossain, M.Z., Zaman, F.U., Bhuiyan, M.K., et al.:\nDetecting ai-generated paraphrases in bengali: A comparative study of zero-shot\nand fine-tuned transformers. arXiv preprint arXiv:2512.21709 (2025)\n[35] Fogel, S., Averbuch-Elor, H., Cohen, S., Mazor, S., Litman, R.: Scrabblegan:\nSemi-supervised varying length handwritten text generation. In: Proceedings of\nthe IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\n4324–4333 (2020)\n[36] Islam, M.R., Bhuiyan, M.K., Muntasir, S., Jawad, A.R., Samu, M., et al.: Behgan:\nBengali handwritten word generation from plain text using generative adversarial\nnetworks. arXiv preprint arXiv:2512.21694 (2025)\n[37] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin trans-\nformer: Hierarchical vision transformer using shifted windows. In: Proceedings of\nthe IEEE/CVF International Conference on Computer Vision, pp. 10012–10022\n(2021)\n[38] Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mohamed, A., Levy, O., Stoy-\nanov, V., Zettlemoyer, L.: Bart: Denoising sequence-to-sequence pre-training for\nnatural language generation, translation, and comprehension. In: Proceedings of\nthe 58th Annual Meeting of the Association for Computational Linguistics, pp.\n7871–7880 (2020)\n[39] Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li,\nW., Liu, P.J.: Exploring the limits of transfer learning with a unified text-to-text\ntransformer. Journal of machine learning research 21(140), 1–67 (2020)\n[40] Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.:\nGrad-cam: Visual explanations from deep networks via gradient-based localiza-\ntion. In: Proceedings of the IEEE International Conference on Computer Vision,\npp. 618–626 (2017)\n[41] Papineni, K., Roukos, S., Ward, T., Zhu, W.-J.: Bleu: a method for automatic\nevaluation of machine translation. In: Proceedings of the 40th Annual Meeting of\nthe Association for Computational Linguistics, pp. 311–318 (2002)\n[42] Lin, C.-Y.: Rouge: A package for automatic evaluation of summaries. In: Text\nSummarization Branches Out, pp. 74–81 (2004)\n[43] Zhang, T., Kishore, V., Wu, F., Weinberger, K.Q., Artzi, Y.: Bertscore: Evaluat-\ning text generation with bert. arXiv preprint arXiv:1904.09675 (2019)\n[44] Nanavaty, A., Sharma, R., Pandita, B., Goyal, O., Rallapalli, S., Mandal, M.,\nSingh, V.K., Narang, P., Chamola, V.: Integrating deep learning for visual ques-\ntion answering in agricultural disease diagnostics: Case study of wheat rust.\nScientific reports 14(1), 28203 (2024)\n24\n\n[45] Zhao, Y., Wang, S., Duan, H., Zeng, Q., Ni, W., Xie, N., Xiao, F.: Visual question\nanswer model based on crop diseases external knowledge for smart agriculture.\nIEEE Transactions on Big Data (2025)\n[46] Sakib, S.N., Haque, N., Hossain, M.Z., Arman, S.E.: Plantvillagevqa: A visual\nquestion answering dataset for benchmarking vision-language models in plant\nscience. arXiv preprint arXiv:2508.17117 (2025)\n[47] Huang, J., Hao, X., Wang, Y., Song, R., Mu, Z., Niu, S., Guo, X.: Joint topic\nentity and intent recognition model for multimodal agricultural diseases and\npests question answering. Computers and Electronics in Agriculture 241, 111253\n(2026)\n[48] Li, Z., Du, C., Li, S., Jiang, Y., Zhang, L., Ju, C., Yue, F., Dong, M.: Hortivqa-\npp: Multitask framework for pest segmentation and visual question answering in\nhorticulture. Horticulturae 11(9), 1009 (2025)\n[49] Wang, L., Jin, T., Yang, J., Leonardis, A., Wang, F., Zheng, F.: Agri-llava:\nKnowledge-infused large multimodal assistant on agricultural pests and diseases.\narXiv preprint arXiv:2412.02158 (2024)\n[50] Liang, F., Huang, Z., Wang, W., He, Z., En, Q.: Dynamic text prompt joint multi-\nmodal features for accurate plant disease image captioning. The Visual Computer\n41(8), 5405–5419 (2025)\n[51] Sharma, K., Vats, V., Singh, A., Sahani, R., Rai, D., Sharma, A.: Llava-plantdiag:\nIntegrating large-scale vision-language abilities for conversational plant pathology\ndiagnosis. In: 2024 International Joint Conference on Neural Networks (IJCNN),\npp. 1–7 (2024). IEEE\n[52] Pranith, P., Yeshwanth, V., Thenmozhi, D.: Multimodal few-shot learning for\nplant disease detection with contrastive pre-training and query addressal. Neural\nComputing and Applications, 1–22 (2025)\n[53] Wang, Y., Wang, F., Chen, W., Lv, B., Liu, M., Kong, X., Zhao, C., Pan, Z.:\nA large language model for multimodal identification of crop diseases and pests.\nScientific Reports 15(1), 21959 (2025)\n[54] Liu, X., Liu, Z., Hu, H., Chen, Z., Wang, K., Wang, K., Lian, S.: A multimodal\nbenchmark dataset and model for crop disease diagnosis. In: European Conference\non Computer Vision, pp. 157–170 (2024). Springer\n[55] Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G.,\nAskell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from\nnatural language supervision. In: International Conference on Machine Learning,\npp. 8748–8763 (2021). PmLR\n25\n",
+    "summary_1": "Here is a clear and concise summary of the research paper for a graduate student:\n\n**Title:** A Lightweight and Explainable Vision–Language Framework for Crop Disease Visual Question Answering\n\n**Background:** Crop disease diagnosis is a critical task in modern agriculture, and early identification of disease symptoms is essential to reduce damage and support effective intervention. However, traditional methods rely on expert inspection and recommendations, which can be time-consuming and limited in regions with expert access.\n\n**Problem Statement:** Existing automated disease detection methods often require controlled imaging conditions, are computationally heavy, and fail to provide detailed textual descriptions of visual attributes, disease progression stages, and contextual information.\n\n**Research Question:** Can a lightweight Visual Question Answering (VQA) framework be established for intelligent and practical plant disease identification?\n\n**Methodology:** The proposed framework combines a Swin Transformer vision encoder with sequence-to-sequence language decoders and adopts a two-stage training strategy to improve visual representation learning and cross-modal alignment.\n\n**Key Contributions:**\n\n1. A lightweight VQA framework for crop disease identification that achieves high accuracy and strong performance on BLEU, ROUGE, and BERTScore metrics.\n2. Explainability is assessed using Grad-CAM and token-level attribution, demonstrating robust performance under diverse user-driven queries.\n3.",
+    "summary_2": "The research paper presents a lightweight vision-language framework for crop and disease identification from leaf images. The proposed approach combines a Swin Transformer vision encoder with sequence-to-sequence language decoders, and adopts a two-stage training strategy to improve visual representation learning and cross-modal alignment.\n\n**Key Findings:**\n\n1.  **High Accuracy for Crop and Disease Identification:** The model achieves high accuracy for both crop and disease identification, demonstrating its effectiveness in visual question answering for crop disease analysis.\n2.  **Strong Performance on BLEU, ROUGE, and BERTScore:** The framework also achieves strong performance on these metrics, indicating its ability to generate high-quality language output.\n3.  **Explainability:** Explainability is assessed using Grad-CAM and token-level attribution, demonstrating the model's ability to provide transparent and interpretable results.\n4.  **Robust Performance under Diverse User-Driven Queries:** Qualitative results demonstrate robust performance under diverse user-driven queries, highlighting the model's ability to handle real-world scenarios.\n\n**Key Contributions:**\n\n1.  **Lightweight Vision-Language Framework:** The proposed framework is lightweight, making it suitable for real-world deployment in agricultural environments.\n2.  **Task-Specific Visual Pretraining:** The study highlights the effectiveness of task"
+  },
+  {
+    "arxiv_id": "2601.05184v1",
+    "title": "Paper 2601.05184v1",
+    "abstract": "The rapid advancement of large language mod- els (LLMs) has led to growing interest in using synthetic data to train future models. How- ever, this creates a self-consuming retraining loop, where models are trained on their own outputs and may cause performance drops and induce emerging biases. In real-world appli- cations, previously deployed LLMs may influ- ence the data they generate, leading to a dy- namic system driven by user feedback. For ex- ample, if a model continues to underserve users from a group, less query data will be collected from this particular demographic of users. In this study, we introduce the concept of Self- Consuming Performative Loop (SCPL) and investigate the role of synthetic data in shaping bias during these dynamic iterative training pro- cesses under controlled performative feedback. This controlled setting is motivated by the in- accessibility of real-world user preference data from dynamic production systems, and enables us to isolate and analyze feedback-driven bias evolution in a principled manner. We focus on two types of loops, including the typical retrain- ing setting and the incremental fine-tuning set- ting, which is largely underexplored. Through experiments on three real-world tasks, we find that the performative loop increases preference bias and decreases disparate bias. We design a reward-based rejection sampling strategy to mitigate the bias, moving towards more trust- worthy self-improving systems.",
+    "text": "Observations and Remedies for Large Language Model Bias\nin Self-Consuming Performative Loop\nYaxuan Wang1, Zhongteng Cai2, Yujia Bao3, Xueru Zhang2, Yang Liu1\n1University of California, Santa Cruz, 2The Ohio State University,\n3Center for Advanced AI, Accenture\nAbstract\nThe rapid advancement of large language mod-\nels (LLMs) has led to growing interest in using\nsynthetic data to train future models. How-\never, this creates a self-consuming retraining\nloop, where models are trained on their own\noutputs and may cause performance drops and\ninduce emerging biases. In real-world appli-\ncations, previously deployed LLMs may influ-\nence the data they generate, leading to a dy-\nnamic system driven by user feedback. For ex-\nample, if a model continues to underserve users\nfrom a group, less query data will be collected\nfrom this particular demographic of users. In\nthis study, we introduce the concept of Self-\nConsuming Performative Loop (SCPL) and\ninvestigate the role of synthetic data in shaping\nbias during these dynamic iterative training pro-\ncesses under controlled performative feedback.\nThis controlled setting is motivated by the in-\naccessibility of real-world user preference data\nfrom dynamic production systems, and enables\nus to isolate and analyze feedback-driven bias\nevolution in a principled manner. We focus on\ntwo types of loops, including the typical retrain-\ning setting and the incremental fine-tuning set-\nting, which is largely underexplored. Through\nexperiments on three real-world tasks, we find\nthat the performative loop increases preference\nbias and decreases disparate bias. We design\na reward-based rejection sampling strategy to\nmitigate the bias, moving towards more trust-\nworthy self-improving systems.\n1\nIntroduction\nThe widespread integration of Large Language\nModels (LLMs) into daily applications has raised\nconcerns regarding training on synthetic data (Cai\net al., 2025). As LLMs become more capable,\nthere are a large amount of generated content that\nis posted to coding platforms, social media plat-\nforms and other platforms on the internet. Such\nLLM-generated text is getting hard to distinguish\nfrom human-generated content (Sadasivan et al.,\n2023) and might be used to train the next gen-\neration of LLMs. While human-generated data\nstill exists, access to clean human-authored data\nis increasingly limited by contamination and cost,\nprompting widespread reliance on synthetic data.\nConsequently, a self-consuming training loop (Shu-\nmailov et al., 2023; Briesch et al., 2023; Alemo-\nhammad et al., 2023, 2024; Ferbach et al., 2024)\nemerges in which future models are trained repeat-\nedly on LLM-generated data from previous gen-\nerations. The recursive training loop on synthetic\ndata may lead to model collapse (Guo et al., 2023;\nSeddik et al., 2024; Shumailov et al., 2024; Wei\nand Zhang, 2025), and bias amplification (Wang\net al., 2024c; Wyllie et al., 2024; Chen et al., 2024)\nor reduction (Chen et al., 2024).\nUnlike static supervised fine-tuning (SFT) with a\nfixed dataset, online continual learning (Wang et al.,\n2024a) collects feedback from humans or AI agents\nto iteratively train models that are more capable\nand better aligned. In socially predictive systems,\nthe model’s performance subsequently influences\nfuture data, which is known as performative pre-\ndiction (Perdomo et al., 2020; Hardt and Mendler-\nDünner, 2023). Bias variation in such performative\niterative training loops is critical. While Wyllie\net al. (2024) demonstrate that unfair feedback loops\ncan lead to a loss of fairness, their analysis does\nnot extend to the data ecosystem of LLMs.\nPrevious work (Wang et al., 2024c; Briesch et al.,\n2023; Kazdan et al., 2024) has primarily focused\non the self-consuming retraining loop for LLMs,\nwhere the next generation model is fine-tuned on\nthe base model using generated synthetic data.\nHowever, in practical scenarios, retraining from\nthe base model may result in the loss of previously\nacquired knowledge, especially when access to the\noriginal training data is limited or restricted. A\nmore feasible and widely adopted approach in prac-\ntice is to fine-tune new models based on existing\nfinetuned ones (Gillman et al., 2024). However,\n1\narXiv:2601.05184v1  [cs.AI]  8 Jan 2026\n\nD0\nData Collector\n(Query, Synthetic Response)\nGeneration 0\nLLM\nIncremental\nFine-tuning\nGeneration 1\nLLM\nGeneration T\nLLM\nSFT\nBase\nModel\nRetraining\nD1\nDT\nLLM Self-Consuming Performative Loop\nGeneration\nBiased\nResponse\nAdvantage\nDisadvantage\nQuery\nHuman Feedback\nFigure 1: Illustration of the self-consuming performa-\ntive loop for LLMs. Dynamic human feedback (e.g., an\nincrease in queries from the blue group and a decrease\nfrom the pink group) influences both the generation of\nsynthetic data and the subsequent training process.\nthis incremental fine-tuning setting in self consum-\ning loop remains underexplored in current research.\nIn this paper, we introduce the self-consuming\nperformative loop for LLMs (Figure 1), which is\nbased on performative feedback (Perdomo et al.,\n2020) and LLM self-consuming training. We an-\nalyze bias variation in two settings: a novel incre-\nmental fine-tuning loop and the standard retraining\nloop. We investigate how synthetic data generated\nby previous models influences bias dynamics dur-\ning iterative training with controlled human feed-\nback. In particular, we examine how repeated self-\nconsuming training affects both preference bias and\ndisparate performance bias, providing insights that\ninform safer deployment of future LLMs.\nOur\nfindings\nreveal\nthat\niterative\nfine-\ntuning/retrain with self-generated data amplifies\npreference bias and degrades generation quality\nover time. Interestingly, disparate bias tends to\ndecrease, suggesting a convergence of performance\nacross groups.\nWe also observe that perfor-\nmative feedback accelerates bias amplification\nin incremental fine-tuning loops, whereas this\nphenomenon is less obvious in retraining loops.\nTo address the observed bias amplification, we\npresent a mitigation technique, which integrates\nreward-guided selection and a reweighting mecha-\nnism to control the data sampling process. We also\nexplore several naive rejection sampling (Yuan\net al., 2023) strategies as baselines. We conduct\ncomprehensive experiments on three real-world\ntasks, which validate our findings and demonstrate\nthe effectiveness of the proposed mitigation\ntechniques. Our main contributions are as follows:\n• We introduce the SCPL for LLMs and, for the\nfirst time, study the dynamic incremental fine-\ntuning loop under performative human feedback.\n• We investigate the impact of synthetic data on\nbias in the SCPL through extensive experiments\nand provide several key observations.\n• We explore various synthetic data curation meth-\nods and design a reward-guided selection with\nreweighting technique to mitigate bias amplifi-\ncation in the performative loop.\n2\nRelated Work\nSelf-consuming Training Loops. Recent work\nshows that recursively training generative mod-\nels on their own outputs leads to model collapse,\nwhere data quality and diversity degrade over it-\nerations (Alemohammad et al., 2023; Shumailov\net al., 2024; Briesch et al., 2023; Cai et al., 2025).\nTheoretically, collapse is inevitable when relying\nsolely on synthetic data (Seddik et al., 2024), while\nincorporating fresh real data can stabilize training\n(Bertrand et al., 2023). Subsequent analyses pro-\nvide performance degradation bounds and modified\nscaling laws (Dohmatob et al., 2024a,b). Several\nstudies highlight that LLMs are particularly vulner-\nable in such recursive settings (Seddik et al., 2024;\nBriesch et al., 2023; Guo et al., 2023), and that data\ncuration implicitly acts as preference optimization\nin self-training loops (Ferbach et al., 2024).\nPerformative Prediction. Performative predic-\ntion studies learning systems whose deployment\nalters the data distribution itself (Perdomo et al.,\n2020; Chen et al., 2023). Prior work analyzes stabil-\nity, convergence, and optimality under these feed-\nback dynamics (Hardt and Mendler-Dünner, 2023;\nPiliouras and Yu, 2023; Jin et al., 2024), providing\na theoretical lens for self-consuming training loops.\nBias in Self-consuming Systems. Recent stud-\nies show that biases can emerge and amplify in self-\nconsuming training, particularly for LLMs trained\non biased human text or synthetic data (Wyllie et al.,\n2024; Wang et al., 2024c; Chen et al., 2024). Such\nmodels may internalize social or political biases\npresent in their training corpora (Haller et al., 2023;\nRettenberger et al., 2025; Wang et al., 2024b). Mit-\nigation strategies have been explored, but remain\nlimited in recursive settings (Wang et al., 2024c).\nRejection Sampling. Rejection sampling fil-\nters synthetic outputs using heuristics or reward\nmodels to curate high-quality data for LLM fine-\ntuning (Yuan et al., 2023; Toshniwal et al., 2024;\nKhaki et al., 2024; Tong et al., 2024; Li et al.,\n2\n\n2025; Pang et al., 2025).\nSuch curation steers\nmodels toward high-reward regions (Ferbach et al.,\n2024), motivating our rewards-based reweighting\napproach for bias mitigation in SCPL.\n3\nSelf-Consuming Performative Loop\n3.1\nPreliminaries\nFormally, let Mt with parameters θt denote the\nLLM at generation t, and let T be the total number\nof generations. Let x denote a prompt, y its cor-\nresponding response, and z = (x, y) the prompt-\nresponse pair, i.e., a single data sample. A dataset\nD = [z1, z2, . . . , zn] then consists of n samples.\nLLM Finetuning Loss For a dataset D con-\nsisting of n prompt-response pairs (x, y), the fine-\ntuning loss for a model M with parameters θ is\ndefined as: L(D; θ) = 1\nn\n�\n(x,y)∈D L(x, y; θ)\nSelf-Consuming Training Loop For LLMs,\neach training cycle begins with a pre-trained model\nfine-tuned on recent data via typical supervised fine-\ntuning (Shumailov et al., 2024; Briesch et al., 2023).\nA synthetic dataset Dt is generated using the pre-\nvious model Mt−1, and the next generation model\nMt is trained from scratch (fine-tuned on the base\nmodel) either using only generated synthetic data\nDt or a mixture of real and synthetic data, depend-\ning on the data cycle design (Section 3.3).\n3.2\nSelf-Consuming Performative Loop\nWe propose a dynamic self-consuming training\nloop for LLMs (SCPL) where the predictive perfor-\nmance of the previous generation model influences\nthe distribution of generated data, thereby affect-\ning the next generation model. Performative here\nrefers to dynamic, feedback-driven changes in the\ninput distribution. Under performativity, the next\ngeneration model Mt is fine-tuned on data gener-\nated from the previous model Mt−1, leading to a\nperformative fine-tuning loss defined as:\nL(Dt(θt−1, Ht−1); θt) = 1\nn\n�\n(x,y)∈Dt\nL(x, y; θt)\n(1)\nwhere Dt(θt−1, Ht−1) is the dataset induced by\nMt−1 under real-world human performative feed-\nback Ht−1 on a held-out dataset Dtest or a training\ndataset Dtrain, depending on the specific real world\napplication and evaluation objective.\nHere, we consider a two-group situation consist-\ning of an advantaged group Da and a disadvantaged\ngroup Dd, such that D = Da ∪ Dd. We assume an\nAlgorithm 1 Self-Consuming Performative Loop\nInput: number of generations T, data cycle\nInitialize: Train base model on real data D0,\nproducing M0\nfor t = 1 to t = T do\nPerformative sample Dt from Mt−1 based\non human feedback Ht−1\nif Accumulation then\nDt ← Dt ∪ . . . ∪ D0\nend if\nTrain Model Mt with Dt using Eq 1\nend for\nAlgorithm 2 LLM Performative Sampling\nInput: LLM Mt−1, Dtest, Dcandidate, Dt−1\nInitialize: Dt: empty list\nCalculate performance score Sa on Da\ntest\nCalculate performance score Sd on Dd\ntest\nHuman feedback H = [Sa, Sb]\nUpdate the disadvantage ratio rd = F(Sa, Sd)\nSelect rd ∗ |Dt−1| prompts from Dd\ncandidate\nSelect (1 − rd) ∗ |Dt−1| prompts from Da\ncandidate\nfor x in prompts Xa\ncandidate, t-1 do\ny ∼ Mt−1(x); Add (x, y) to Dt\nend for\nfor x in prompts Xd\ncandidate, t-1 do\ny ∼ Mt−1(x); Add (x, y) to Dt\nend for\nReturn: Dt\nunbiased held out set Dtest that contains an equal\nnumber of samples from both groups to evaluate\nthe performance.\nIf a model performs better on Da, users from\nthis group are more likely to continue interacting\nand generating data, while users from Dd may dis-\nengage, reducing their future representation and\nexacerbating fairness issues. Our framework mod-\nels a controlled self-consuming loop in which a\nsingle provider collects, fine-tunes, and repeatedly\nreuses its own model outputs (e.g. e-commerce rec-\nommendation system), enabling systematic study\nof how internal feedback can amplify or distort de-\nmographic preferences. In this study, group-wise\nsample selection is governed by dynamic functions\nthat simulate practical human preference patterns.\nAlgorithm 1 presents the SCPL, while Algo-\nrithm 2 outlines the performative sampling process.\nMoreover, we study two realistic cases for SCPL:\nPerformative Retraining Loop The next gen-\n3\n\neration model Mt is retrained from the base model\nusing the synthetic data performative sampling by\nprevious generation model Mt−1. This retraining\nloop is widely study in previous work (Wang et al.,\n2024c; Briesch et al., 2023).\nPerformative Incremental Fine-tuning Loop\nWe consider a more practical scenario reflecting\nreal-world constraints.\nFormally, at generation\nt − 1, the model Mt−1 is used to generate a new\ndataset Dt. This dataset is then used to fine-tune\nthe model Mt−1, resulting in the updated model\nMt. In user-facing applications with limited com-\nputational resources, it is often necessary to incre-\nmentally fine-tune the current generation model\non top of the previous one, in order to preserve\npreviously acquired knowledge. Retraining from\nscratch may not be feasible when access to origi-\nnal training data is restricted, for example, due to\nexpired data licenses. In such cases, the current\ndataset Dt−1, which only contains newly gener-\nated query-response pairs, may no longer suffice\nto recover past knowledge. In addition to SFT, we\nalso conduct preliminary experiments using Direct\nPreference Optimization (Rafailov et al., 2023).\n3.3\nData Cycles and Dynamics\nFull Synthetic Data Cycle: In the most extreme\ncase, the new model Mt is only trained on the\ngenerated data Dt from the last generation (Alemo-\nhammad et al., 2023). Each generation’s dataset\nis of equal size, with D0 representing the initial\nreal human generated dataset. We use this itera-\ntive training setup to investigate the most drastic\nbehavioral changes in models under a deployment\nscenario where the original training data is inacces-\nsible, such as due to expired data licenses or user\ndata deletion requests (Wang et al., 2025). Note\nthat synthetic training data can be obtained by us-\ning fresh real prompts collected from users.\nExpanding Data Cycle (Accumulation): This\ncycle reflects a practical setting in which the exist-\ning dataset can be reused during training to improve\nperformance (Briesch et al., 2023). We accumulate\nboth the original real data D0 and the generated\nsynthetic data across generations to form the train-\ning dataset. Specifically, the accumulated dataset\nat generation t is Dt = D0 ∪ D1 ∪ . . . ∪ Dt. Prior\nwork has shown that such accumulation can slow\ndown model collapse (Kazdan et al., 2024) and\nhelp mitigate bias (Wang et al., 2024c).\nDynamics:\nDynamics refer to the changing\nproportion between the advantaged and disadvan-\ntaged groups, which also indicate the performative\nimpact. We primarily study a controlled linear\ndynamic setting, where the group ratio is explic-\nitly manipulated in experiments. Starting from an\ninitial disadvantaged group ratio in D0, the propor-\ntion decreases linearly over generations to simulate\nthe effect of human feedback H under controlled\nconditions. We also evaluate a fixed-ratio setting\nas a baseline, where the disadvantaged ratio re-\nmains constant while using fresh real prompts. The\nscenario in which previous prompts are repeatedly\nused to generate synthetic data, and the group ra-\ntio remains unchanged across generations, is re-\nferred to as Non-dynamic, which is the typical\nself-consuming training loop (Wang et al., 2024c;\nBriesch et al., 2023; Alemohammad et al., 2023).\n4\nMeasuring the Bias\n4.1\nOverview\nWe focus on measuring bias variation in SCPL\nthrough extensive experiments on three practical\ntasks: news continuation, creative preference dis-\nsection, and math problem solving. The first two\nevaluate preference bias, reflecting the model’s\ninclination toward the advantaged group (Liu\net al., 2023). The third assesses disparate perfor-\nmance (Liu et al., 2023), where the model’s accu-\nracy differs between groups (easy versus hard math\nquestions). We observe that: (1) Synthetic data in\nSCPL tends to amplify preference bias while re-\nducing disparate bias. (2) Performative dynamics\naccelerate preference bias amplification compared\nto non-dynamic self-consuming loops in both fine-\ntuning and retraining settings, though the effect\ndepends on the task and training regime. (3) Accu-\nmulation can slow the increase of preference bias\nand mitigate the decline in generation quality. It\ncan alleviate the reduction in disparate bias and\nhelp preserve math problem-solving ability.\n4.2\nReward-based Reweighting Sampling\nTo reduce the observed preference bias amplifica-\ntion in SCPL, we propose a bias mitigation tech-\nnique based on rejection sampling to enhance fair-\nness in the SCPL. Specifically, for a prompt x\nfrom Dcandidate, we generate k independent re-\nsponses ¯y1, . . . , ¯yk ∼ Mt−1(x). A reward func-\ntion R(x, ¯y) is then used to select the highest scor-\ning sample or one that meets predefined criteria.\nWe propose a modular and extensible reward-based\nreweighting method to reduce bias while main-\n4\n\ntaining generation quality.\nFormally, let R =\n{r1, r2, r3} denote a set of predefined rules, where\nr1(x, ¯y) ∈ [0, 1] represents a continuous score mea-\nsuring the generation quality, and r2(x, ¯y) indicates\nwhether the output ¯y, have the same preference as\nthe input x. r3 refers to any possible task-relevant\nextension rules. Each rule is associated with an\nimportance weight, denoted by α1 and α2, respec-\ntively. The reward function is then defined as:\nR(x, ¯y) = α1 · r1(x, ¯y) + α2 · r2(x, ¯y) + r3 (2)\nWe integrate reward-guided sample selection with\na reweighting scheme, enabling dynamic control\nover the sampling process to mitigate bias. The\nidea is to generate and select more data from the dis-\nadvantaged group by adjusting the selection criteria\nand increasing the sample size k. (Algorithm 3)\nRegarding rule extension, our framework is de-\nsigned to be plug-and-play: one can incorporate\nadditional rules that capture properties such as sen-\ntiment, stylistic tone, format constraints, or even a\ntrained reward model that returns a score depend-\ning on downstream goals. Each rule outputs a re-\nward signal, and these are combined linearly (with\ntunable weights) to produce the final reward score\nused in sampling and training. When only a sin-\ngle trained reward model is used with its weight\nset to 1, this formulation reduces to the standard\ncase of using a single, well-trained or a zero-shot\nreward model. Note that our mitigation strategy\ndoes not rely on access to ground-truth labels used\nin bias evaluation; instead, we use proxy metrics to\nestimate the degree of bias present in the model.\n5\nExperiments\nWe consider several different settings: (1) self-\nconsuming training loop (Non-dynamic): We\ncompare synthetic data (syn) with real data (real).\nIn the real setting, all synthetic data is replaced with\nreal data to examine how performance changes. (2)\nSCPL (Dynamic): We evaluate two variants: a\ncontrolled linear dynamic version (syn-dynamic)\nand a fixed-ratio version (syn-dynamic-fr). As a\nbaseline, we also replace the synthetic data in the\ncontrolled linear dynamic setting with real data\n(real-dynamic). (3) Accumulation-enhanced per-\nformative loop: We further apply data accumula-\ntion to the proposed SCPL (syn-dynamic-accu). We\nalso apply accumulation to real data under the same\ndynamic control (real-dynamic-accu).\nOur ex-\nperiments involve Qwen2.5-1.5B, Qwen2.5-Math-\n1.5B, Llama2-7B, Qwen2.5-7B, Qwen2.5-Math-\n7B, and Llama3.1-8B. More experimental details\nand results, including self-consuming DPO and\nbeyond self-consuming analyses, are provided in\nAppendix C, G, H.\n5.1\nNews Continuation Task\nThis task examines the political bias of LLMs\nwithin the US political spectrum by evaluating\ntheir performance on news continuation (Maslej\net al., 2023).\nFollowing Wang et al. (2024c),\nwe use 6458 news articles from the Webis-Bias-\nFlipper-18 dataset (Chen et al., 2018), selecting\nleft- and right-leaning articles to construct 256-\ntoken prompt–completion pairs. At the initial itera-\ntion, the real dataset contains 5,000 samples (3,000\nleft-leaning and 2,000 right-leaning), where right-\nleaning articles are treated as the advantaged group\nand left-leaning articles as the disadvantaged group\nfor preference bias analysis. We evaluate model\nperformance on a held-out unbiased dataset and\nadditionally report accuracy results on the standard\nMMLU benchmark (Hendrycks et al., 2021).\nEvaluation Metric (1) Right lean bias (Bias\nScore) is the fraction of generated continuations\nclassified as right-leaning on the unbiased test\ndataset with 1000 articles. For each article, the\nmodel deterministically generates the next 256 to-\nkens, and a pretrained classifier following (Wang\net al., 2024c) predicts the political leaning of the\ngenerated continuation (right-lean or left-lean). An\nunbiased model is expected to yield a score close to\n0.5. (2) Generation Quality (GQ) is the average\nscore for all generated articles using the Gibberish\nDetector1. The detector identifies the incoherent of\nnonsensical text and has four Gibberish score: 3\nfor clean, 2 for mild gibberish, 1 and 0 for noise.\nResults Preference bias increases more rapidly\nin the SCPL than in the Non-dynamic loop under\nincremental fine-tuning. This is likely because bias\nin the dynamic case can stem from both the increas-\ningly biased fine-tuned model and the performa-\ntively generated synthetic data. However, under\nthe retraining setting, it shows a slower increase\nand only marginal differences compared to the non-\ndynamic loop. As shown in Figure 2a, using real\ndata in the iterative training process results in a rela-\ntively stable bias trajectory with minimal variation.\nThe generation quality decreases at a similar rate\nin retraining and incremental finetuning settings.\n1Please refer to https://huggingface.co/madhurjindal/autonlp-\nGibberish-Detector-492513457\n5\n\n(a) News Continuation\n(b) Preference Dissection\nFigure 2: The preference bias and generation quality on two tasks using Qwen2.5-1.5B.\nM0\nM1\nM2\nM3\nPreference Bias\nReal-dynamic\n0.5587\n0.5727\n0.5983\n0.5980\nSyn-dynamic\n0.5587\n0.6327\n0.6813\n0.7136\nGeneration Quality\nReal-dynamic\n2.538\n2.560\n2.504\n2.581\nSyn-dynamic\n2.538\n2.284\n2.250\n2.252\nMMLU\nReal-dynamic\n0.5820\n0.5786\n0.5833\n0.5842\nSyn-dynamic\n0.5820\n0.5728\n0.5661\n0.5652\nTable 1: Results in the performative incremental fine-\ntuning loop using Llama3.1-8B on News task.\nTables 1 and 2 show that the observed bias trends\nare consistent across model families. For complete-\nness, we also report results on the MMLU bench-\nmark. Accuracy remains stable at approximately\n0.58 in the real-dynamic setting. In contrast, under\nthe syn-dynamic setting, accuracy decreases at a\nsimilar rate for both incremental fine-tuning and\nretraining, with modest reductions of about 0.02\nand 0.05, respectively. As our primary goal is to\nanalyze the SCPL, standard benchmarks are less\nsensitive to group-specific distributional shifts and\ntherefore may not fully capture performative bias\ndynamics. In Figure 5, the bias score increases\nmore rapidly in dynamic setting, and the GQ de-\nclines faster under the incremental fine-tuning loop.\nAccumulation helps mitigate the amplification of\npreference bias and the degradation of generation\nquality. However, it does not reduce preference\nbias compared to the original bias or achieve per-\nformance comparable to the real-data setting; it\nmerely slows the rate at which bias increases. De-\nM0\nM1\nM2\nM3\nPreference Bias\nReal-dynamic\n0.5587\n0.5638\n0.5695\n0.6090\nSyn-dynamic\n0.5587\n0.6219\n0.6478\n0.6485\nGeneration Quality\nReal-dynamic\n2.538\n2.564\n2.544\n2.450\nSyn-dynamic\n2.538\n2.314\n2.303\n2.340\nMMLU\nReal-dynamic\n0.5820\n0.5773\n0.5812\n0.5838\nSyn-dynamic\n0.5820\n0.5803\n0.5622\n0.5662\nTable 2: Results in the performative retraining loop\nusing Llama3.1-8B on News task.\nspite this partial mitigation, we do not observe that\naccumulation reduces bias toward the neutral line\n(0.5), as stated in Wang et al. (2024c). Therefore,\nwhile accumulation can serve as a mitigation strat-\negy for both generation quality and preference bias,\nits effect on bias remains limited and we need more\nefforts on mitigation strategies.\n5.2\nPreference Dissection Task\nIn this task, we explore LLMs’ preference toward\nnon-creative attributes. We use the knowledge-\nawre dataset Dolly (Conover et al., 2023) as the\nadvantage group with 15,000 samples and creative\nwriting dataset ShareGPT 2 as the disadvantage\ngroup with 6,650 samples. We adopt the Preference\nDissection dataset (Li et al., 2024b) as test dataset.\nEvaluation Metric We compute the average\npreference probability for the non-creative attribute\nacross all ten scenarios in the test dataset (Li et al.,\n2Huggingface: Nitral-AI/Creative_Writing-ShareGPT\n6\n\nFigure 3: The disparate bias and mathematical problem solving ability on Math task using using Qwen2.5-Math-7B.\nFigure 4: The disparate bias and mathematical problem solving ability on Math task using Qwen2.5-Math-1.5B.\nFigure 5: The preference bias and generation quality on\nnews task using Llama2-7B in incremental fine-tuning\nloop.\n2024b) as the preference bias. Since the raw pref-\nerence scores consistently fall within the range\n[0.4, 0.6] (0.5 denotes neutral), we apply min-max\nnormalization to rescale them. The normalized\nvalues are denoted as Non-Creative Preference.\nResults Non-Creative Preference Bias increases\nfaster in the SCPL under both incremental fine-\ntuning and retraining settings. In the Non-dynamic\nself-consuming setting, the change in bias follows a\ntrend similar to that of the accumulation-enhanced\nloop. In contrast, the bias increases more obviously\nin the self-consuming performative incremental\nfine-tuning loop, as shown in Figure 2b. Under\nboth fine-tuning and retraining settings, using ac-\ncumulation effectively mitigates bias and achieves\nperformance close to that of the real-data setting.\n5.3\nMath Problem Solving Task\nIn this task, we use the NuminaMath dataset (Li\net al., 2024a) and divide the problems into easy\nand hard groups. Easy problems are considered the\nadvantage group, as they more commonly appear\nin real-world usage. Over time, the model’s math-\nsolving ability may become skewed due to repeated\nexposure to easy problems, potentially degrading\nits performance on harder ones (Tong et al., 2024).\nOur goal is to examine how LLMs perform when\nexposed to different proportions of easy and hard\nproblems in SCPL. The performance gap between\nthese two groups is treated as disparate bias.\nEvaluation Metric (1) Disparate Bias is the\ndisparity of Pass@1 Accuracy for both groups.\nPass@1 Accuracy is the accuracy of solving prob-\nlems correctly.\n(2) Similarity is the similarity\nbetween the generated response and the ground\ntruth answers. Specifically, it utilizes the sum of\nROUGE-L (Lin, 2004) and Bertscore (Zhang et al.,\n2019) to represent the problem solving ability.\n7\n\n(a) News Continuation\n(b) Preference Dissection\nFigure 6: The performance of different mitigation methods on two tasks using Qwen2.5-1.5B.\nResults Disparate bias decreases slower in the\nSCPL, along with the decrease of overall accu-\nracy and similarity score, as shown in Figure 3\nand 4. This behavior is partly explained by the fact\nthat math performance relies heavily on previously\nlearned knowledge, while synthetic data tends to be\nlower quality, leading to reduced group disparities\nbut overall performance degradation. In addition,\nhuman preferences often favor easier questions,\nresulting in fewer hard examples over iterations;\nsince LLMs typically struggle more on difficult\nproblems, this shift slows the apparent performance\ndrop as the data distribution becomes simpler. We\nfurther observe that the linear dynamic setting bet-\nter preserves math-solving ability compared to the\nsize-fixed dynamic. The trend in disparate bias\nreduction is similar across both settings. Notably,\nmodel scale plays a critical role: for smaller models\nsuch as Qwen2.5-Math-1.5B, performative dynam-\nics slow both performance and bias degradation,\nwhereas for larger models like Qwen2.5-Math-7B,\nSCPL leads to faster performance decline com-\npared to non-dynamic training (Figure 3). This\nsuggests that larger models are more sensitive to\ndistributional shifts in synthetic data, and dynamic\nsampling may disrupt their learned balance by re-\nducing exposure to harder or more diverse exam-\nples. In SCPL, Accumulation helps reduce the\ndecline in disparity and math-solving ability.\n5.4\nBias Mitigation Strategy\nWe evaluate the effectiveness of our proposed\nbias mitigation method, which curates less biased\ndata using a predefined rule-based reward function\n(Eq. 2) and a reweighting technique.\nSetup We set k = 4, α1 = 1.0, and α2 = 3.0.\nIn News task, r1 evaluates the fluency of gener-\nated responses using a Gibberish Detector, which\nassigns a score based on generation quality. Since\ntopic consistency is crucial in this task, we use\ncontent similarity as a proxy to assess whether the\npreference between the input and output remains\naligned. r2 calculates the cosine similarity between\nthe generated content and the ground-truth con-\ntinuation. If the similarity exceeds a predefined\nthreshold, the rule returns +1; otherwise, it returns\n-1. The same for Preference Dissection task.\nBaseline We evaluate three naive rejection sam-\npling methods: Vanilla Rejection Sampling (VRS):\nRandomly selects one from responses that meet\ncriterias following Li et al. (2025). Top per Prompt\n(TPP): Selects the response with the highest reward\nfor each prompt individually. Top Overall Prompt\n(TOP): Selects the top n responses with the high-\nest rewards across the entire dataset. We also add\nAccumulation (Wang et al., 2024c) as one baseline.\nResults The proposed reward-based reweight-\ning sampling achieves the best overall performance\nfor bias mitigation.\nAmong the five strategies\nevaluated, the reweight variant demonstrates the\nmost effective bias mitigation on the News task\nunder both retraining and fine-tuning settings, as\nshown in Figure 6a. On the Preference Dissection\ntask, our method ranks second in performance, as\nthe accumulation-based approach achieves results\nclose to those obtained using real data (Figure 6b).\n6\nConclusion\nIn this work, we introduce the SCPL for LLMs and\nstudy how synthetic data impacts bias under itera-\ntive training with performative feedback. Focusing\non the underexplored incremental fine-tuning set-\nting and typical retraining setting, we show that\npreference bias increases in both settings. We also\nobserve the decline of disparate bias. To address\nthe bias amplification, we propose a reward-based\nsampling method. Extensive experiments on three\nreal world tasks demonstrate our findings and the\neffectiveness of the mitigation method. Further\nanalysis of self-consuming reinforcement learn-\ning loops and settings beyond self-consumption\nremains an important direction for future work.\n8\n\nAcknowledgements\nY. Wang and Y. Liu are partially supported by the\nNational Science Foundation (NSF) under grants\nIIS-2143895 and IIS-2416896.\nZ. Cai and X.\nZhang are funded in part by the National Sci-\nence Foundation under award IIS-2202699 and IIS-\n2416895.\nLimitations\nOne limitation of our approach lies in the reward-\nbased sampling mechanisms, which currently rely\non predefined rules. If these rules are poorly de-\nsigned or misaligned with user intent, they may\ninadvertently introduce bias and lead to undesired\nmodel behaviors. Additionally, due to the substan-\ntial computational overhead associated with self-\nconsuming reinforcement learning, we exclude it\nfrom our current experimental setup.\nReferences\nSina Alemohammad, Josue Casco-Rodriguez, Lorenzo\nLuzi, Ahmed Imtiaz Humayun, Hossein Babaei,\nDaniel LeJeune, Ali Siahkoohi, and Richard G Bara-\nniuk. 2023. Self-consuming generative models go\nmad. arXiv preprint arXiv:2307.01850, 4:14.\nSina Alemohammad, Ahmed Imtiaz Humayun, Shruti\nAgarwal, John Collomosse, and Richard Baraniuk.\n2024. Self-improving diffusion models with syn-\nthetic data. arXiv preprint arXiv:2408.16333.\nQuentin Bertrand, Avishek Joey Bose, Alexandre Du-\nplessis, Marco Jiralerspong, and Gauthier Gidel.\n2023. On the stability of iterative retraining of gen-\nerative models on their own data. arXiv preprint\narXiv:2310.00429.\nMartin Briesch, Dominik Sobania, and Franz Rothlauf.\n2023. Large language models suffer from their own\noutput: An analysis of the self-consuming training\nloop. arXiv preprint arXiv:2311.16822.\nZhongteng Cai, Yaxuan Wang, Yang Liu, and Xueru\nZhang. 2025. Stabilizing self-consuming diffusion\nmodels with latent space filtering. arXiv preprint\narXiv:2511.12742.\nEshwar Chandrasekharan, Umashanthi Pavalanathan,\nAnirudh Srinivasan, Adam Glynn, Jacob Eisenstein,\nand Eric Gilbert. 2017. You can’t stay here: The\nefficacy of reddit’s 2015 ban examined through hate\nspeech. Proceedings of the ACM on human-computer\ninteraction, 1(CSCW):1–22.\nTianwei Chen, Yusuke Hirota, Mayu Otani, Noa Garcia,\nand Yuta Nakashima. 2024. Would deep generative\nmodels amplify bias in future models? In Proceed-\nings of the IEEE/CVF Conference on Computer Vi-\nsion and Pattern Recognition, pages 10833–10843.\nWei-Fan Chen, Henning Wachsmuth, Khalid Al Khatib,\nand Benno Stein. 2018. Learning to flip the bias of\nnews headlines. In Proceedings of the 11th Interna-\ntional conference on natural language generation,\npages 79–88.\nYatong Chen, Wei Tang, Chien-Ju Ho, and Yang Liu.\n2023. Performative prediction with bandit feedback:\nLearning through reparameterization. arXiv preprint\narXiv:2305.01094.\nMike Conover, Matt Hayes, Ankit Mathur, Jianwei Xie,\nJun Wan, Sam Shah, Ali Ghodsi, Patrick Wendell,\nMatei Zaharia, and Reynold Xin. 2023. Free dolly:\nIntroducing the world’s first truly open instruction-\ntuned llm.\nElvis Dohmatob, Yunzhen Feng, and Julia Kempe.\n2024a. Model collapse demystified: The case of\nregression. arXiv preprint arXiv:2402.07712.\nElvis Dohmatob, Yunzhen Feng, Pu Yang, Francois\nCharton, and Julia Kempe. 2024b. A tale of tails:\nModel collapse as a change of scaling laws. arXiv\npreprint arXiv:2402.07043.\nDamien Ferbach, Quentin Bertrand, Avishek Joey Bose,\nand Gauthier Gidel. 2024. Self-consuming genera-\ntive models with curated data provably optimize hu-\nman preferences. arXiv preprint arXiv:2407.09499.\nNate Gillman, Michael Freeman, Daksh Aggarwal,\nChia-Hong Hsu, Calvin Luo, Yonglong Tian, and\nChen Sun. 2024.\nSelf-correcting self-consuming\nloops for generative model training. arXiv preprint\narXiv:2402.07087.\nDaya Guo, Dejian Yang, Haowei Zhang, Junxiao\nSong, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shi-\nrong Ma, Peiyi Wang, Xiao Bi, and 1 others. 2025.\nDeepseek-r1: Incentivizing reasoning capability in\nllms via reinforcement learning.\narXiv preprint\narXiv:2501.12948.\nYanzhu Guo, Guokan Shang, Michalis Vazirgiannis, and\nChloé Clavel. 2023. The curious decline of linguistic\ndiversity: Training language models on synthetic text.\narXiv preprint arXiv:2311.09807.\nPatrick Haller, Ansar Aynetdinov, and Alan Akbik. 2023.\nOpiniongpt: Modelling explicit biases in instruction-\ntuned llms. arXiv preprint arXiv:2309.03876.\nMoritz Hardt and Celestine Mendler-Dünner. 2023. Per-\nformative prediction: Past and future. arXiv preprint\narXiv:2310.16608.\nRyuichiro Hataya, Han Bao, and Hiromi Arai. 2023.\nWill large-scale generative models corrupt future\ndatasets? In Proceedings of the IEEE/CVF Interna-\ntional Conference on Computer Vision, pages 20555–\n20565.\nDan Hendrycks, Collin Burns, Steven Basart, Andy\nZou, Mantas Mazeika, Dawn Song, and Jacob Stein-\nhardt. 2021. Measuring massive multitask language\n9\n\nunderstanding. Proceedings of the International Con-\nference on Learning Representations (ICLR).\nDaniel Hickey, Daniel MT Fessler, Kristina Lerman,\nand Keith Burghardt. 2025. X under musk’s leader-\nship: Substantial hate and no reduction in inauthentic\nactivity. PLoS One, 20(2):e0313293.\nManoel Horta Ribeiro, Shagun Jhaver, Savvas Zannet-\ntou, Jeremy Blackburn, Gianluca Stringhini, Emil-\niano De Cristofaro, and Robert West. 2021. Do plat-\nform migrations compromise content moderation?\nevidence from r/the_donald and r/incels. Proceed-\nings of the ACM on Human-Computer Interaction,\n5(CSCW2):1–24.\nKun Jin, Tian Xie, Yang Liu, and Xueru Zhang. 2024.\nAddressing polarization and unfairness in performa-\ntive prediction. arXiv preprint arXiv:2406.16756.\nJoshua Kazdan, Rylan Schaeffer, Apratim Dey, Matthias\nGerstgrasser, Rafael Rafailov, David L Donoho, and\nSanmi Koyejo. 2024. Collapse or thrive? perils and\npromises of synthetic data in a self-generating world.\narXiv preprint arXiv:2410.16713.\nSaeed Khaki, JinJin Li, Lan Ma, Liu Yang, and Prathap\nRamachandra. 2024. Rs-dpo: A hybrid rejection sam-\npling and direct preference optimization method for\nalignment of large language models. arXiv preprint\narXiv:2402.10038.\nWoosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying\nSheng, Lianmin Zheng, Cody Hao Yu, Joseph E.\nGonzalez, Hao Zhang, and Ion Stoica. 2023. Effi-\ncient memory management for large language model\nserving with pagedattention. In Proceedings of the\nACM SIGOPS 29th Symposium on Operating Systems\nPrinciples.\nJia Li, Edward Beeching, Lewis Tunstall, Ben Lip-\nkin, Roman Soletskyi, Shengyi Huang, Kashif Rasul,\nLonghui Yu, Albert Q Jiang, Ziju Shen, and 1 others.\n2024a. Numinamath: The largest public dataset in\nai4maths with 860k pairs of competition math prob-\nlems and solutions. Hugging Face repository, 13:9.\nJunlong Li, Fan Zhou, Shichao Sun, Yikai Zhang, Hai\nZhao, and Pengfei Liu. 2024b. Dissecting human and\nllm preferences. arXiv preprint arXiv:2402.11296.\nPeiji Li, Kai Lv, Yunfan Shao, Yichuan Ma, Linyang\nLi, Xiaoqing Zheng, Xipeng Qiu, and Qipeng Guo.\n2025. Fastmcts: A simple sampling strategy for data\nsynthesis. arXiv preprint arXiv:2502.11476.\nChin-Yew Lin. 2004. Rouge: A package for automatic\nevaluation of summaries.\nIn Text summarization\nbranches out, pages 74–81.\nYang Liu, Yuanshun Yao, Jean-Francois Ton, Xiaoying\nZhang, Ruocheng Guo, Hao Cheng, Yegor Klochkov,\nMuhammad Faaiz Taufiq, and Hang Li. 2023. Trust-\nworthy llms: a survey and guideline for evaluating\nlarge language models’ alignment. arXiv preprint\narXiv:2308.05374.\nNestor Maslej, Loredana Fattorini, Erik Brynjolfs-\nson, John Etchemendy, Katrina Ligett, Terah Lyons,\nJames Manyika, Helen Ngo, Juan Carlos Niebles,\nVanessa Parli, and 1 others. 2023.\nArtificial\nintelligence index report 2023.\narXiv preprint\narXiv:2310.03715.\nLong Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida,\nCarroll Wainwright, Pamela Mishkin, Chong Zhang,\nSandhini Agarwal, Katarina Slama, Alex Ray, and 1\nothers. 2022. Training language models to follow in-\nstructions with human feedback. Advances in neural\ninformation processing systems, 35:27730–27744.\nJinlong Pang, Zhaowei Zhu, Na Di, Yichi Zhang, Yax-\nuan Wang, Chen Qian, and Yang Liu. 2025. Su-\npervised fine-tuning on ambiguous preference pairs\nboosts llm alignment.\nJuan Perdomo, Tijana Zrnic, Celestine Mendler-Dünner,\nand Moritz Hardt. 2020. Performative prediction.\nIn International Conference on Machine Learning,\npages 7599–7609. PMLR.\nGeorgios Piliouras and Fang-Yi Yu. 2023. Multi-agent\nperformative prediction: From global stability and\noptimality to chaos. In Proceedings of the 24th ACM\nConference on Economics and Computation, pages\n1047–1074.\nRafael Rafailov, Arƒchit Sharma, Eric Mitchell, Christo-\npher D Manning, Stefano Ermon, and Chelsea Finn.\n2023. Direct preference optimization: Your language\nmodel is secretly a reward model. Advances in neural\ninformation processing systems, 36:53728–53741.\nLuca Rettenberger, Markus Reischl, and Mark Schutera.\n2025. Assessing political bias in large language mod-\nels. Journal of Computational Social Science, 8(2):1–\n17.\nVinu Sankar Sadasivan, Aounon Kumar, Sriram Bala-\nsubramanian, Wenxiao Wang, and Soheil Feizi. 2023.\nCan ai-generated text be reliably detected? arXiv\npreprint arXiv:2303.11156.\nMohamed El Amine Seddik, Suei-Wen Chen, Soufiane\nHayou, Pierre Youssef, and Merouane Debbah. 2024.\nHow bad is training on synthetic data? a statistical\nanalysis of language model collapse. arXiv preprint\narXiv:2404.05090.\nIlia Shumailov,\nZakhar Shumaylov,\nYiren Zhao,\nYarin Gal, Nicolas Papernot, and Ross Anderson.\n2023.\nThe curse of recursion: Training on gen-\nerated data makes models forget.\narXiv preprint\narXiv:2305.17493.\nIlia Shumailov, Zakhar Shumaylov, Yiren Zhao, Nicolas\nPapernot, Ross Anderson, and Yarin Gal. 2024. Ai\nmodels collapse when trained on recursively gener-\nated data. Nature, 631(8022):755–759.\nQwen Team. 2024. Qwen2 technical report. arXiv\npreprint arXiv:2407.10671.\n10\n\nLudovic Terren and Rosa Borge. 2021. Echo chambers\non social media: A systematic review of the literature.\nYuxuan Tong, Xiwen Zhang, Rui Wang, Ruidong Wu,\nand Junxian He. 2024. Dart-math: Difficulty-aware\nrejection tuning for mathematical problem-solving.\nAdvances in Neural Information Processing Systems,\n37:7821–7846.\nShubham Toshniwal, Ivan Moshkov, Sean Narenthi-\nran, Daria Gitman, Fei Jia, and Igor Gitman. 2024.\nOpenmathinstruct-1: A 1.8 million math instruction\ntuning dataset. Advances in Neural Information Pro-\ncessing Systems, 37:34737–34774.\nHugo Touvron, Louis Martin, Kevin Stone, Peter Al-\nbert, Amjad Almahairi, Yasmine Babaei, Nikolay\nBashlykov, Soumya Batra, Prajjwal Bhargava, Shruti\nBhosale, and 1 others. 2023. Llama 2: Open foun-\ndation and fine-tuned chat models. arXiv preprint\narXiv:2307.09288.\nMaorong Wang, Nicolas Michel, Jiafeng Mao, and\nToshihiko Yamasaki. 2024a.\nDealing with syn-\nthetic data contamination in online continual learning.\narXiv preprint arXiv:2411.13852.\nYaxuan Wang, Chris Yuhao Liu, Quan Liu, Jing-\nlong Pang, Wei Wei, Yujia Bao, and Yang Liu.\n2025. Dragon: Guard llm unlearning in context via\nnegative detection and reasoning.\narXiv preprint\narXiv:2511.05784.\nZe Wang, Zekun Wu, Xin Guan, Michael Thaler, Adri-\nano Koshiyama, Skylar Lu, Sachin Beepath, Ediz\nErtekin Jr, and Maria Perez-Ortiz. 2024b.\nJob-\nfair: A framework for benchmarking gender hir-\ning bias in large language models. arXiv preprint\narXiv:2406.15484.\nZe Wang, Zekun Wu, Jeremy Zhang, Navya Jain, Xin\nGuan, and Adriano Koshiyama. 2024c. Bias ampli-\nfication: Language models as increasingly biased\nmedia. arXiv preprint arXiv:2410.15234.\nXiukun Wei and Xueru Zhang. 2025. Self-consuming\ngenerative models with adversarially curated data.\narXiv preprint arXiv:2505.09768.\nSierra Wyllie, Ilia Shumailov, and Nicolas Papernot.\n2024. Fairness feedback loops: training on synthetic\ndata amplifies bias. In Proceedings of the 2024 ACM\nConference on Fairness, Accountability, and Trans-\nparency, pages 2113–2147.\nAn Yang, Beichen Zhang, Binyuan Hui, Bofei Gao,\nBowen Yu, Chengpeng Li, Dayiheng Liu, Jianhong\nTu, Jingren Zhou, Junyang Lin, and 1 others. 2024.\nQwen2. 5-math technical report: Toward mathe-\nmatical expert model via self-improvement. arXiv\npreprint arXiv:2409.12122.\nZheng Yuan, Hongyi Yuan, Chengpeng Li, Guanting\nDong, Keming Lu, Chuanqi Tan, Chang Zhou, and\nJingren Zhou. 2023. Scaling relationship on learning\nmathematical reasoning with large language models.\narXiv preprint arXiv:2308.01825.\nTianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q\nWeinberger, and Yoav Artzi. 2019. Bertscore: Eval-\nuating text generation with bert.\narXiv preprint\narXiv:1904.09675.\n11\n\nAppendix Arrangement\nThe Appendix is organized as follows.\n• Section A: Broader Impact\n• Section B: Related Work\n• Section C: Experimental details: Dataset De-\nscriptions (Section C.1), Implementation De-\ntails (Section C.2), Experimental Results (Sec-\ntion C.3), Parameter Study (Section C.4)\n• Section D: Mitigation Strategy\n• Section E: Discussion about Focusing on SFT\n• Section F: Discussion about Full Synthetic\nData Cycle\n• Section G: Additional Experiments using\nDPO\n• Section H: Additional Experiments beyond\nSelf-consuming\n• Section I: Future Directions\nA\nBroader Impact\nOur study highlights how certain types of bias in\nself-consuming performative loop change. While\nwe propose mitigation strategies, deploying such\nself-consuming training pipelines without careful\ncontrol may unintentionally reinforce existing so-\ncietal biases or introduce new ones, especially in\nsensitive applications such as hiring, healthcare, or\neducation. Moreover, the reward-based sampling\nmechanisms we explore rely on predefined rules,\nwhich, if poorly designed or misaligned with user\nvalues, may further skew model behavior. As the\nuse of synthetic data becomes more widespread,\nit is critical to monitor its downstream effects and\nestablish responsible practices for data generation\nand model retraining.\nB\nRelated Work\nSelf-consuming Training Loop\nThe self-\nconsuming training loop in generative models has\nrecently gained significant attention (Hataya et al.,\n2023; Alemohammad et al., 2024, 2023). Several\nstudies (Alemohammad et al., 2023; Shumailov\net al., 2024; Briesch et al., 2023) have evidenced\ncatastrophic degradation of generated data in fully\nsynthetic training loops. The phenomenon often\nreferred to as model collapse in recursive training.\nSeddik et al. (2024) demonstrate that model col-\nlapse is inevitable when training solely on synthetic\ndata from a statistical perspective. Both Alemoham-\nmad et al. (2023) and Bertrand et al. (2023) observe\nthat the inclusion of fresh, real data can stabilize the\nself-consuming training loop. A recent theoretical\ncontribution by Dohmatob et al. (2024a) provides\nbounds on performance degradation in regression\nsettings, along with modified scaling laws (Dohma-\ntob et al., 2024b). Several works (Seddik et al.,\n2024; Briesch et al., 2023; Guo et al., 2023) focus\nspecifically on LLMs, highlighting their unique\nvulnerabilities to model collapse. In parallel, Fer-\nbach et al. (2024) offer a theoretical study on the\nrole of data curation in iterative retraining of gener-\native models, showing that curated datasets act as\nan implicit preference optimization mechanism.\nPerformative Prediction In the social world,\npredictions are an intrinsic part of the system,\nwhere they inform decisions, beliefs and even influ-\nence outcomes. Performative prediction (Perdomo\net al., 2020) provides a framework for studying\nsocial predictive systems in which the data distri-\nbution changes in response to the deployment of a\nmodel (Chen et al., 2023). A growing body of work\nhas investigated the stability and optimality of such\nperformative systems (Hardt and Mendler-Dünner,\n2023; Piliouras and Yu, 2023; Jin et al., 2024).\nEcho Chambers and User–Technology Feed-\nback. Echo chambers (Terren and Borge, 2021)\ndescribe self-reinforcing dynamics in online plat-\nforms, where interactions between users and plat-\nform mechanisms amplify specific viewpoints\nor behaviors over time.\nPrior work shows\nthat moderation policies, recommendation sys-\ntems, and user migration jointly shape platform-\nspecific ideological patterns, as observed across\ncommunities on Reddit and between platforms\nsuch as Reddit and X (Chandrasekharan et al.,\n2017; Horta Ribeiro et al., 2021; Hickey et al.,\n2025).\nThese studies highlight how feedback-\ndriven user–technology–userbase cycles can induce\nsystematic distributional shifts in content and be-\nhavior.\nBias in Self-consuming World\nRecent\nwork (Wyllie et al., 2024; Wang et al., 2024c; Chen\net al., 2024) investigates the emergence and ampli-\nfication of biases in self-consuming training loops.\nLLMs may internalize inherent biases present in\nhuman-generated text (Wang et al., 2024b), or\nadopt specific political viewpoints reflected in their\ntraining data (Haller et al., 2023; Rettenberger et al.,\n2025; Wang et al., 2024c). To address this issue,\nWang et al. (2024c) also propose several techniques\nfor mitigating bias for self-consuming LLMs.\n12\n\nRejection Sampling\nRejection sampling is\nwidely used to generate high-quality synthetic data\nfor supervised fine-tuning of LLMs (Yuan et al.,\n2023; Toshniwal et al., 2024; Khaki et al., 2024). It\nfilters candidate outputs based on predefined heuris-\ntics or reward models. Yuan et al. (2023) apply\nthis technique to augment training data, while Tong\net al. (2024) propose difficulty-aware rejection sam-\npling, biasing data collection toward harder queries\nvia two strategies: Uniform and Prop2Diff. Li et al.\n(2025) introduce FastMCTS to efficiently sample\nmulti-step reasoning data, and Khaki et al. (2024)\npropose RS-DPO, which selects contrastive sam-\nples using a reward model. Ferbach et al. (2024)\nfind that synthetic curation helps the model con-\nverge toward high reward regions. Inspired by this,\nwe design a simple yet effective bias mitigation\nstrategy that combines rule-based rewards with a\nreweighting scheme to curate high-quality and less\nbiased data.\nC\nExperimental details\nC.1\nDataset Descriptions\nNews Datasets\nThe source dataset is Webis-Bias-\nFlipper-18 (Chen et al., 2018). It contains 6458\nnews articles covering 2781 distinct events from\n2012 to 2018. For each article, the dataset pro-\nvides the title, summary, full content, and an asso-\nciated political bias label. The bias labels include\n“From the left” for articles published by left-leaning\nmedia outlets, “From the right” for those from\nright-leaning sources, and “Neutral” for politically\nunbiased articles. We select all the left-leaning\nand right-leaning news articles labeled as “From\nthe left” and “From the right,” respectively. Each\noriginal news article is tokenized into 256-token\nblocks, and we select two adjacent blocks as the\nprompt and the corresponding completion. These\nprompt–completion pairs are used for both training\nthe political bias classification metric and for the\nself-performative consuming training loop. Table 3\nshows the number of samples for each dataset used\nduring the self performative consuming loop.\nPreference Datasets\nIn this task, we utilize three\ndatasets: (1) Databricks-Dolly-15k is a corpus\nof over 15,000 instruction-response pairs created\nby thousands of Databricks employees to enable\nLLMs to exhibit the interactive behavior character-\nistic of ChatGPT. (2) CreativeWriting-ShareGPT\nis a dataset consisting of prompts designed to elicit\ncreative storytelling, along with the correspond-\ning responses generated by ChatGPT. (3) Prefer-\nence Dissection is a benchmark dataset designed to\nquantify LLM preferences across ten scenarios. It\nenables decomposition of overall preferences into\nmultiple well-defined attributes, including creative\nwriting and knowledge-aware (non-creative) tasks.\nWe use the creative writing scenario as the dis-\nadvantaged group, and the knowledge-aware (i.e.,\nnon-creative) scenario as the advantaged group.\nImportantly, we only use the Preference Dissec-\ntion dataset as the test dataset. Accordingly, Dolly\nis used as the training dataset for non-creative at-\ntributes, while CreativeWriting-ShareGPT is used\nto represent creative attributes in training.\nMath Datasets\nThe source dataset is Numina-\nMath(Li et al., 2024a), a comprehensive collection\nof 860,000 pairs of competition-level math prob-\nlems and their solutions. For our study, we catego-\nrize problems from Chinese K-12 exams as the easy\ngroup and those from international math olympiads\nas the hard group. Detailed sample statistics are\nprovided in Table 3. We ensure that the test set\nis disjoint from the training and validation sets to\nprevent data leakage.\nClarifying the Rationale for Our Task Choices\nOur goal is to study bias dynamics under SCPL,\nwhich requires tasks where (1) group structure is\nexplicit, should lead to the “performative”, (2) syn-\nthetic data can be generated reliably across genera-\ntions, and (3) bias effects can be measured contin-\nuously over iterative fine-tuning. The three tasks\nwe selected were chosen because they satisfy these\ncriteria:\nClear group structure. Each task provides nat-\nural, well-defined group splits (e.g., left/right po-\nlitical leaning, creative/non-creative preferences,\neasy/hard math problems), enabling controlled bias\nevaluation.\nSuitability for multi-generation synthetic\ndata. Many standard bias benchmarks (e.g., BBQ)\nare closed-ended, making them unsuitable for re-\ncursive data generation. Our tasks support repeated\nprompting and regeneration across iterations.\nGrounding in prior work. Political bias in\nnews generation has been studied (Wang et al.,\n2024c); creative vs. non-creative preference gaps\nfollow the Preference Dissection; and easy/hard\nmath disparities capture reasoning performance dif-\nferences. We will clarify this rationale in Section 4\nof the revised version.\n13\n\nDataset\nInitial Real D0\nCandidate Dcandidate\nHeld out Dtest\nTask\nFixed Size\nDa\nDd\nDa\ncandidate\nDd\ncandidate\nDa\ntest\nDd\ntest\nNews Continuation\n5000\n3000\n2000\n33873\n41334\n500\n500\nPreference Dissection\n2000\n1600\n400\n13407\n6253\n-\n-\nNumina Math\n5000\n3000\n2000\n20000\n15000\n500\n500\nTable 3: Dataset statistics for the three tasks, showing the number of samples in each dataset at the initial generation\n0.\nC.2\nImplementation Details\nAll experiments can be conducted using two 80 GB\nA100 GPUs.\nGeneration\nparameters\nWe\nused\nthe\nvLLM\n(Kwon\net\nal.,\n2023)\nframework\nto\ngenerate the synthetic data.\nThe generation\nparameters in the main observation experiments\nare set as follows:\nn = 1 (single sample per\nprompt). All other parameters followed the default\nsettings in vLLM, including temperature = 1.0 and\ntopp = 1.0. These settings correspond to standard\nsampling without additional constraints.\nRight Lean Bias Evaluation\nUsing an unbiased\ndataset of 1000 human-written articles evenly dis-\ntributed across right-lean and left-lean categories,\nwe generate the synthetic articles by deterministi-\ncally predicting the next 256 tokens. Right lean\nbias is the percentage of the predicted right lean\nchunks in the unbiased dataset. If the model had no\nbias, the bias score should be around 0.5. Follow-\ning Wang et al. (2024c), we train a binary classifier\nto determine the political leaning of synthetic gen-\neration (1 for right-lean, 0 for left-lean). We fine-\ntuned on Roberta-base model with a learning rate\nof 2e-5, a batch_size of 16, weight decay of 0.01\nand 5 training epochs. We select the best model\nbased on F1 score (0.94) on the prompts from the\nunbiased test dataset to have better classification\naccuracy.\nNews Continuation\nWe use an initial disadvan-\ntage ratio r0\nd of 0.4, and set the dataset size for\neach iteration to 5,000 samples (n = 5, 000). For\ntraining, we adopt a learning rate of 2e − 5 and\na weight decay of 0.01. Each iteration is trained\nfor 5 epochs, and the entire process consists of 3\ngenerations. To simulate increasing bias in human\npreferences, the disadvantage group ratio is linearly\nreduced across three generations, starting from 0.4\nand decreasing to 0.22.\nFigure 7: The preference bias and generation quality on\nnews task using Llama2-7B in incremental fine-tuning\nloop.\nFigure 8: Bias score and generation quality variations\nin the self-consuming performative retraining loop on\nthe News task using Llama2-7B.\nPreference Dissection\nWe use an initial disad-\nvantage ratio r0\nd of 0.2, and set the dataset size for\neach iteration to 2,000 samples (n = 2, 000). For\ntraining, we adopt a learning rate of 2e − 5 and\na weight decay of 0.01. Each iteration is trained\nfor 5 epochs, and the entire process consists of five\ngenerations. To simulate increasing bias in human\npreferences, the disadvantage group ratio is linearly\nreduced across five generations, starting from 0.2\nand decreasing to 0.0.\nNumina Math\nWe use an initial disadvantage\nratio r0\nd of 0.4, and set the dataset size for each iter-\nation to 5,000 samples (n = 5, 000). For training,\nwe adopt a learning rate of 2e − 5 and a weight de-\ncay of 0.01. Each iteration is trained for 3 epochs,\nand the entire process consists of five generations.\nTo simulate increasing bias in human preferences,\nthe disadvantage group ratio is linearly reduced\nacross five generations, starting from 0.4 and de-\ncreasing to 0.2.\n14\n\nSetting\nData Type\nGroup Ratio Change\nKeep Accumulated Data\nReuse Prompt\nTraditional self-consuming loop (non-dynamic)\nTraditional (real)\nreal\n✗\n✗\n✗\nTraditional (syn)\nsyn\n✗\n✗\n✓\nProposed self-consuming performative loop (dynamic)\nreal-dynamic\nreal\n✓\n✗\n✗\nsyn-dynamic\nsyn\n✓\n✗\n✗\nsyn-dynamic-fr\nsyn\n✗\n✗\n✗\nreal-dynamic-accu\nreal\n✓\n✓\n✗\nsyn-dynamic-accu\nsyn\n✓\n✓\n✗\nTable 4: Comparison of different self-consuming training settings. Dynamic methods use group ratio change, while\nonly the traditional synthetic setting reuses prompts.\nC.3\nExperimental Results\nResults in self-consuming performative loop on\nNews task using Llama2-7B (Touvron et al.,\n2023)\nFigure 7 and Figure 8 show the variation\nin bias scores and generation quality in the self-\nconsuming performative fine-tuning loop and re-\ntraining loop, respectively. The bias score increases\nmore rapidly in the dynamic setting, and the gener-\nation quality declines faster under the incremental\nfine-tuning loop. In contrast, both bias score and\ngeneration quality in the performative retraining\nloop exhibit more stable trends compared with the\nNon-dynamic training loop. The accumulation of\nsynthetic data fails to mitigate bias in the incremen-\ntal fine-tuning loop. However, incorporating more\naccumulated real data improves generation quality\nthroughout the training loop.\nResults in self-consuming performative loop on\nMath task using Qwen2.5-Math-7B (Yang et al.,\n2024).\nAs shown in Figure 9, we observe an inter-\nesting trend: while the dynamic setting slows down\nthe performance and disparate bias drop on math-\nsolving tasks for Qwen2.5-Math-1.5B, it leads to\nfaster degradation for Qwen2.5-Math-7B. This may\nbe because larger models are more sensitive to\nshifts in training data distribution. The dynamic\nperformative training process, while helpful for\nsmaller models by reinforcing useful signals, may\ndisturb the learned balance in larger models and\nreduce exposure to harder or more diverse samples,\nthus accelerating performance decline.\nResults in the self-consuming performative in-\ncremental fine-tuning loop on the News task\nusing Qwen2.5-1.5B, evaluated on generated\ndatasets.\nTo better understand the cause of pref-\nerence bias amplification, we evaluate model per-\nformance on the generated datasets (e.g., D1, D2,\netc.). As shown in Table 5, these datasets inherit\nexisting bias, which is further amplified during the\nincremental fine-tuning process.\nC.4\nParameter Study\nThe impact of initial disadvantage ratio in News\ntask using Llama2-7B.\nTable 6 presents the\nstudy of the parameters on the impact of the initial\ndisadvantage ratio. Interestingly, we observe that\nwhen the initial disadvantage ratio is lower (e.g.,\n0.2), the increase in bias over time is less obvious\ncompared to the higher initial ratio (e.g., 0.4). This\ncounter-intuitive result may be due to the fact that\nmodels starting from a more imbalanced setting are\nalready biased toward the majority group, making\nsubsequent performative updates reinforce existing\npreferences rather than introduce significant new\nbias. In contrast, when the initial distribution is\nmore balanced, the performative feedback loop has\na greater opportunity to shift model behavior and\namplify bias over time.\nD\nMitigation Strategy\nD.1\nRejection Sampling Baseline\nVanilla Rejection Sampling (VRS)\nFor each\nprompt, we generate k candidate responses. For\nthe k generated responses, we evaluate the correct-\nness or the ability to satisfy the predefined standard.\nFor news continuation task, the predefined standard\ninclude bias, generation quality, similarity to the\nground truth continuation, and sentiment score. For\nbias, we use the trained classifier to determine the\nscore. For generation quality, based on the Gibber-\nish detector, if the Gibberish score is 1 or 0, the\nscore will be -1, otherwise the score will be 1. For\nsimilarity, the threshold is 0.5. Then we sum all\nfour scores, if the sum greater than 0, then we think\nit satisfy the standard. After that, we randomly\n15\n\nFigure 9: Disparate bias and mathematical problem solving ability variations in the self-consuming performative\nincremental fine-tuning loop on the Math task using Qwen2.5-Math-7B.\nMethod\nD0\nD1\nD2\nD3\nSynthetic-Non-dynamic\n0.0216\n0.2240\n0.3938\n0.5307\nSynthetic-Dynamic\n0.0216\n0.2158\n0.4183\n0.6037\nSynthetic-Dynamic-Fixed Ratio\n0.0216\n0.2045\n0.4266\n0.6143\nTable 5: Evaluation on generated datasets (D1, D2, ...) shows that preference bias is inherited and amplified during\nthe self-consuming performative incremental fine-tuning loop using Qwen2.5-1.5B.\nMethod\nInitial Ratio\nM0\nM1\nM2\nM3\nSynthetic-Non-dynamic\n0.4\n0.5800\n0.6500\n0.6675\n0.7063\nSynthetic-Dynamic\n0.4\n0.5800\n0.6447\n0.6839\n0.7399\nSynthetic-Dynamic-Fixed Ratio\n0.4\n0.5800\n0.6512\n0.7226\n0.7461\nSynthetic-Non-dynamic\n0.2\n0.6417\n0.6553\n0.6623\n0.6700\nSynthetic-Dynamic\n0.2\n0.6417\n0.6766\n0.6554\n0.6931\nSynthetic-Dynamic-Fixed Ratio\n0.2\n0.6417\n0.6736\n0.6824\n0.6782\nTable 6: Parameter study on the impact of the initial disadvantage ratio in the News task using Llama2-7B.\n16\n\nPreference Bias\nGeneration Quality\nMethod\nM0\nM1\nM2\nM3\nM0\nM1\nM2\nM3\nSyn (no-dynamic)\n0.6307\n0.6606\n0.6756\n0.6456\n2.282\n2.164\n2.294\n2.564\nSyn-dynamic in incremental finetuning loop\n0.6307\n0.6309\n0.6311\n0.6129\n2.282\n2.136\n2.244\n2.497\nSyn-dynamic in retraining loop\n0.6307\n0.6391\n0.6287\n0.6247\n2.282\n2.242\n2.442\n2.436\nTable 7: Results using DPO in self-consuming performative loop on the News continuation task.\nselect from these candidates. We make sure at least\nwe will select one response per prompt.\nTop Per Prompt (TPP)\nFor each prompt, the\nmethod generates k responses and selects the re-\nsponse with the highest rewards.\nTop Overall Prompts (TOP)\nFor each prompt,\nthe method generates k responses and selects n\nresponse from the n ∗ k candidates with the highest\nrewards.\nD.2\nReward-based Sampling with Reweight\nScheme\nAlgorithm 3 describes the Reward-based Sampling\nwith Reweighting scheme, where Xd\ncandidate,t−1 de-\nnotes the set of prompts from the disadvantaged\ngroup generated during iteration t − 1 of the per-\nformative loop.\nE\nDiscussion about Focusing on SFT\nOur current study intentionally focuses on super-\nvised fine-tuning (SFT) to isolate and better un-\nderstand the dynamics of bias variations under\nthe proposed novel self-consuming performative\nloops. SFT remains a widely adopted method for\nimproving LLM performance on specific down-\nstream tasks (Ouyang et al., 2022), especially in\npractical deployment pipelines where collecting\nhigh-quality preference labels for RL is costly or\ninfeasible.\nOne of our contributions lies in analyzing\nbias variation within a novel framework of self-\nconsuming performative retraining and an incre-\nmental fine-tuning loop for the first time, which\nmodels how real-world LLMs may be continuously\nupdated using their own outputs. Furthermore, sev-\neral prior works (Briesch et al., 2023; Seddik et al.,\n2024; Kazdan et al., 2024) have studied similar\niterative retraining setups using SFT, but did not\nexplore the bias behavior. Considering this line\nof research, we emphasize more on the performa-\ntive feedback loop as a distinct and increasingly\nrelevant phenomenon in real-world deployments.\nAlgorithm 3 Reward-based Reweighting Sampling\nInput:\nLLM Mt−1, disadvantage ratio rd,\nDt−1, Xa\ncandidate,t-1, Xd\ncandidate,t-1, sample number\nk\nInitialize: La = 0, Ld = 0, Da = [], Dd = [],\nDt = [], Dtemp = [], Dr = []\nLa = 1\n4 ∗ |Xa\ncandidate, t-1| + |Xd\ncandidate, t-1|\nLd = |Dt−1| − La\nkd = Ld//|Xd\ncandidate, t-1| + 1 + k ; ka = k\nfor x in prompts Xa\ncandidate, t-1 do\n¯y1, . . . , ¯yka ∼ Mt−1(x);\nCalculate reward R(x, ¯y) for each (x, ¯y) pair\nAdd the best pair (x, y) to Da\nend for\nRandomly select La samples and add to Dt\nfor x in prompts Xd\ncandidate do\n¯y1, . . . , ¯ykd ∼ Mt−1(x);\nCalculate reward R(x, ¯y) for each (x, ¯y) pair\nAdd all pair in Dr\nend for\nrepeat\nFor each query x in Dr, add the best (x, ¯y)\nbased on R(x, ¯y) to Dtemp\nRemove the added pair from Dr\nuntil Ld − |Dtemp|/|Xd\ncandidate, t-1| is zero\nRandomly select Ld − |Dtemp| samples and add\nto Dt; Add Dtemp to Dt\nReturn: Dt\nWhile we acknowledge the value of RL-based\nmethods, they typically rely on curated high qual-\nity data or pairwise comparisons at each iteration,\nwhich introduces additional complexities such as la-\nbel noise, instability, and selection bias. Moreover,\nmethods like RLHF is often resource-intensive and\nless convenient for continuous training. To keep\nour scope focused and interpretable, we defer the\nexploration of RL-based or preference-based align-\nment strategies to future work.\nMost methods do not rely solely on RL but rather\nuse SFT as a foundational step before applying RL\nor after, as seen in Guo et al. (2025); Team (2024).\nThus, understanding bias behavior under SFT is\n17\n\nFigure 10: Generation quality variations in the self-consuming performative loop on the News task under different\nmitigation methods.\nboth fundamental and necessary for guiding future\nresearch from the community. We believe our cur-\nrent SFT-based formulation lays a solid foundation\nfor understanding the core performative feedback\ndynamics, and we are actively planning to incorpo-\nrate DPO and ORPO into this framework in subse-\nquent research.\nF\nDiscussion about Full Synthetic Data\nCycle\nOur self-consuming performative loop setting\nis designed to model a more controlled but in-\ncreasingly prevalent real-world scenario, where\na single model provider collects, fine-tunes, and\nreuses outputs generated by its own models. This\nsetup is realistic in many real-world applications.\nFor example, commercial AI service providers of-\nten collect user prompts and model generations in\nproduction environments (e.g., internal confidential\nreport generation, customer interaction, or propri-\netary toolchains), then fine-tune the model itera-\ntively based on this data. When a model performs\nparticularly well for a demographic group (e.g.,\nGroup A), its outputs may attract more queries\nfrom that group, leading to natural performative\nfeedback, a process we explicitly aim to capture\n(Figure 1 in the main paper). Here, the outputs\nare not openly mixed with external content but\nrather used and fine-tuned within a closed feedback\nloop. In such settings, the provider has full control\nand visibility over both the model and the data it\nconsumes, making the self-consuming assumption\nrealistic and operationally relevant.\nOur framework serves as a building block\nfor understanding more complex mixed-model\nscenarios, and incorporating data generated by\nother LLMs will introduce more uncontrolled\nvariation. Our goal is to first establish a clean\nand analyzable framework in a controlled case\nfor studying how bias varies within the novel self-\nconsuming performative loops, which is a setting\nthat has not been systematically explored in prior\nwork but is highly relevant for real-world, task-\nspecific deployments. Incorporating data gener-\nated by other LLMs would introduce uncontrolled\nvariation, making it more difficult to attribute ob-\nserved bias shifts to specific model behaviors or\ndata sources. We therefore believe that focusing on\nthis controlled full synthetic data case is a necessary\nand foundational step before extending to more\ncomplex, mixed-model generation loops, which\nserve as an important next step for the research\ncommunity.\nG\nAdditional Experiments using DPO\nWe conduct preliminary experiments using DPO on\nthe News Continuation task to investigate how bias\nevolves during iterative preference-based training.\nIn our setup, we use generation quality as the sole\nscoring metric to construct preference pairs, delib-\nerately excluding any explicit bias control mecha-\nnisms.\nFrom Table 7, we observe that bias does not\nincrease in the self-consuming performative loop\nusing DPO. It shows a slight decrease from genera-\ntion 2 to generation 3. This opens up a promising\ndirection for understanding and addressing bias in\niterative preference-based training. In terms of gen-\neration quality, we see a clear improvement. This\nindicates that in a self-consuming iterative prefer-\nence training loop, using generation quality-guided\npreference pair construction can meaningfully en-\nhance output quality.\nThe observed variation are different from those\nunder SFT, here bias-reduction patterns and qual-\nity improvements appear under DPO. This is ex-\n18\n\nPreference Bias\nGeneration Quality\nMethod\nRatio\nM0\nM1\nM2\nM3\nM0\nM1\nM2\nM3\nIncremental finetuning\n0.5\n0.6163\n0.6802\n0.6921\n0.7176\n2.291\n2.083\n2.065\n2.064\nRetraining\n0.5\n0.6163\n0.6847\n0.6899\n0.6956\n2.294\n2.119\n2.081\n2.096\nIncremental finetuning\n0.2\n0.6048\n0.6987\n0.6897\n0.6737\n2.282\n2.068\n2.102\n2.092\nRetraining\n0.2\n0.6048\n0.6802\n0.6871\n0.6817\n2.282\n2.117\n2.135\n2.124\nTable 8: Results under the synthetic performative loop. The ratio indicates the percentage of self-consuming\nsynthetic data used during training. Ratio denotes the percentage of synthetic data generated from another LLM.\npected because preference-based learning explicitly\ntrains the model to favor higher-quality responses\nover lower-quality ones. By repeatedly reinforc-\ning distinctions between “good” and “bad” pairs,\nthe model learns clearer decision boundaries. As a\nresult, undesirable or biased patterns become less\npreferred, while coherent and balanced responses\nare amplified, naturally leading to both bias reduc-\ntion and improved overall quality.\nH\nAdditional Experiments beyond\nSelf-consuming\nThe proposed setting serves as a foundational\nframework for analyzing bias variation, which can\nbe extended to more complex settings involving\nmultiple models and data sources. We conduct ad-\nditional experiments involving both self-generated\ndata and data generated by another LLM.\nTo investigate how bias trends change when\nmixing data from multiple generation sources, we\nsplit the real prompts into two groups (1:1 ratio).\nOne group is used to generate data with the self-\nconsuming model (Qwen2.5-1.5B), and the other\ngroup is used with an external open-source model\n(Llama3.1-8B). To ensure fair comparison, we con-\ntrol the dataset size across iterations. At each train-\ning step t, we collect: 3000 prompts from advan-\ntage group and 2000 from disadvantage group, and\ngenerate 1500 + 1000 synthetic preference pairs\nfrom Qwen2.5-1.5B, and another 1500 + 1000 pairs\nfrom Llama3.1-8B.\nWe specifically avoid using closed-source APIs\nto mitigate risks of data leakage as synthetic train-\ning data is generated using fresh real prompts col-\nlected from users. Future work will explore a wider\nrange of models, settings, and mixing strategies to\nmore fully understand bias propagation in multi-\nmodel synthetic feedback loops, which is beyond\nthe current paper’s scope.\nIn Table 8, when incorporating synthetic data\nfrom another LLM at a ratio of 0.5, the bias score\nin the synthetic performative loop still exhibits\nan upward trend similar to that of the pure self-\nconsuming setup (Figure 2a in the main paper),\nalbeit with a slower rate of increase. The trend\nin generation quality remains largely unchanged.\nWhen the ratio is further reduced (0.2), the bias\nscore no longer shows an increasing trend, indicat-\ning that the proportion of self-generated synthetic\ndata influences the bias dynamics in the performa-\ntive loop.\nI\nFuture Directions\nIn this work, we primarily focus on supervised\nfine-tuning, while also conducting preliminary ex-\nperiments with direct preference optimization. Our\nframework offers a promising direction through dy-\nnamic control, which can be further extended in fu-\nture work to incorporate alternative training strate-\ngies such as reinforcement learning with human\nfeedback (RLHF) and preference learning (DPO\nand SimPO). We believe that our current SFT-based\nformulation provides a strong foundation for under-\nstanding the core dynamics of performative feed-\nback, and that integrating DPO or reinforcement\nlearning methods (e.g., PPO) could further enhance\nits flexibility and effectiveness. As synthetic data\ncan be generated using multiple models, exploring\nperformative bias beyond self-consuming settings\npresents a promising future direction. However,\nstudying multi-model feedback loops introduces\nsubstantial design complexity. Key factors, such\nas the number and scale of participating models,\nas well as the proportion of self-generated data,\nrequire careful control and systematic evaluation.\nWe believe that addressing these challenges would\ngreatly benefit from coordinated, community-wide\nefforts.\n19\n",
+    "summary_1": "The research paper \"Self-Consuming Performative Loop for Large Language Models\" explores the concept of a self-consuming performative loop (SCPL) in large language models (LLMs). The authors introduce the idea of a dynamic system driven by user feedback, where previously deployed LLMs influence the data they generate, leading to a performative prediction. They investigate the role of synthetic data in shaping bias during these dynamic iterative training processes under controlled performative feedback.\n\nThe paper focuses on two types of loops: the typical retraining setting and the incremental fine-tuning setting, which is largely underexplored. Through experiments on three real-world tasks, the authors find that the performative loop increases preference bias and decreases disparate bias. They design a reward-based rejection sampling strategy to mitigate the bias, moving towards more trustworthy self-improving systems.\n\nKey findings:\n\n1.  Iterative fine-tuning/retraining with self-generated data amplifies preference bias and degrades generation quality over time.\n2.  Disparate bias tends to decrease, suggesting a convergence of performance across groups.\n3.  Performative feedback accelerates bias amplification in incremental fine-tuning loops, whereas this phenomenon is less obvious in retraining loops.\n\nThe authors' research provides insights that inform safer deployment of",
+    "summary_2": "This summary is based on the provided research paper excerpt, focusing on the concept of Self-Consuming Performative Loop (SCPL) in Large Language Models (LLMs).\n\n**Introduction**\n\nThe paper highlights the growing interest in using synthetic data to train future LLMs, which leads to a self-consuming retraining loop where models are trained on their own outputs and may cause performance drops and induce emerging biases. This self-consuming performative loop is particularly relevant in dynamic systems driven by user feedback, where the model's performance influences future data.\n\n**Research Questions**\n\nThe authors aim to investigate the role of synthetic data in shaping bias during these dynamic iterative training processes under controlled performative feedback. They focus on two types of loops:\n\n1.  The typical retraining setting\n2.  The incremental fine-tuning setting, which is largely underexplored\n\n**Key Findings**\n\nThe authors' experiments on three real-world tasks reveal that:\n\n*   The performative loop increases preference bias and decreases disparate bias.\n*   Iterative fine-tuning/retraining with self-generated data amplifies preference bias and degrades generation quality over time.\n*   Disparate bias tends to decrease, suggesting a convergence of performance across groups.\n*   Performative feedback accelerates bias amplification in incremental"
+  },
+  {
+    "arxiv_id": "2601.05167v1",
+    "title": "Paper 2601.05167v1",
+    "abstract": "Deploying Large Language Models (LLMs) for complex reasoning is often hindered by high computational costs and latency, while resource-efficient Small Language Models (SLMs) typically lack the necessary reasoning capacity. Existing collaborative approaches, such as cascading or routing, operate at a coarse granularity by offloading entire queries to LLMs, resulting in significant computational waste when the SLM is capable of handling the majority of reasoning steps. To address this, we propose RelayLLM, a novel frame- work for efficient reasoning via token-level collaborative decoding. Unlike routers, Re- layLLM empowers the SLM to act as an active controller that dynamically invokes the LLM only for critical tokens via a special command, effectively \"relaying\" the generation process. We introduce a two-stage training framework, including warm-up and Group Relative Pol- icy Optimization (GRPO) to teach the model to balance independence with strategic help- seeking. Empirical results across six bench- marks demonstrate that RelayLLM improves the average accuracy from 42.5% to 49.52%, effectively bridging the performance gap be- tween the two models. Notably, this is achieved by invoking the LLM for only 1.07% of the total generated tokens, offering a 98.2% cost re- duction compared to performance-matched ran- dom routers. Our code is available at https:// github.com/Chengsong-Huang/RelayLLM.",
+    "text": "RelayLLM: Efficient Reasoning via Collaborative Decoding\nChengsong Huang1\nTong Zheng2\nLanglin Huang1\nJinyuan Li1\nHaolin Liu3\nJiaxin Huang1\n1Washington University in St. Louis\n2University of Maryland\n3University of Virginia\n{chengsong, h.langlin, ljinyuan, jiaxinh}@wustl.edu\ntzheng24@umd.edu\nsrs8rh@virginia.edu\nAbstract\nDeploying Large Language Models (LLMs)\nfor complex reasoning is often hindered by\nhigh computational costs and latency, while\nresource-efficient Small Language Models\n(SLMs) typically lack the necessary reasoning\ncapacity. Existing collaborative approaches,\nsuch as cascading or routing, operate at a\ncoarse granularity by offloading entire queries\nto LLMs, resulting in significant computational\nwaste when the SLM is capable of handling\nthe majority of reasoning steps. To address\nthis, we propose RelayLLM, a novel frame-\nwork for efficient reasoning via token-level\ncollaborative decoding.\nUnlike routers, Re-\nlayLLM empowers the SLM to act as an active\ncontroller that dynamically invokes the LLM\nonly for critical tokens via a special command,\neffectively \"relaying\" the generation process.\nWe introduce a two-stage training framework,\nincluding warm-up and Group Relative Pol-\nicy Optimization (GRPO) to teach the model\nto balance independence with strategic help-\nseeking. Empirical results across six bench-\nmarks demonstrate that RelayLLM improves\nthe average accuracy from 42.5% to 49.52%,\neffectively bridging the performance gap be-\ntween the two models. Notably, this is achieved\nby invoking the LLM for only 1.07% of the\ntotal generated tokens, offering a 98.2% cost re-\nduction compared to performance-matched ran-\ndom routers. Our code is available at https://\ngithub.com/Chengsong-Huang/RelayLLM.\n1\nIntroduction\nLarge Language Models (LLMs) have demon-\nstrated remarkable capabilities in complex reason-\ning and problem-solving (Comanici et al., 2025;\nYang et al., 2025a; Achiam et al., 2023). However,\ntheir deployment is often constrained by high com-\nputational costs and latency. In contrast, Small Lan-\nguage Models (SLM) are resource-efficient options\nbut typically struggle with hard reasoning tasks\ndue to their limited capacity (Kaplan et al., 2020).\n0\n20\n40\n60\n80\n100\nCall Ratio (%)\n40\n42\n44\n46\n48\n50\n52\n54\n56\nAvg. Accuracy (%)\nReduces 98.2% Tokens\n+6.9% Acc\nSLM\nLLM\nRandom Router (Query-Level)\nPerfect Router (Query-Level)\nRelayLLM (Ours, Token-Level)\nCITER (Token-Level)\nFigure 1: Results are averaged across six mathematical\nbenchmarks. The “Random Router” baseline randomly\ndirects questions to either the small or large model.\nThe “Perfect Router” baseline directs only questions\nSLM cannot solve to large model. The x-axis represents\nthe Call Ratio (percentage of tokens generated by the\nteacher model), and the y-axis denotes the average ac-\ncuracy.\nThis trade-off has motivated the development of\ncollaborative systems that combine the reasoning\ncapabilities of LLMs with the efficiency of smaller\nmodels (Hakimov et al., 2025).\nExisting approaches to different-sized model col-\nlaboration often rely on “cascading” or “routing”\nmechanisms, where a router determines the diffi-\nculty of a query and directs it to either a small or\nlarge model (Ding et al., 2024; Hu et al., 2024;\nOng et al., 2024). While effective to some extent,\nthese methods typically operate at a coarse gran-\nularity by offloading the entire generation task to\nthe large model once a query seems difficult. This\n“all-or-nothing” strategy leads to significant compu-\ntational waste, as the small model often possesses\nthe competence to handle the majority of the rea-\nsoning steps, requiring expert intervention only at\nspecific critical positions (Lin et al., 2024; Ruan\net al., 2025).\nTo address these problems, we propose Re-\nlayLLM, a novel framework for efficient reason-\ning via token-level collaborative decoding (Shen\n1\narXiv:2601.05167v1  [cs.CL]  8 Jan 2026\n\net al., 2024) without an additional controller. Un-\nlike static routers, RelayLLM empowers the small\nmodel to act as both a problem solver and an active\ncontroller that dynamically requests assistance only\nwhen necessary.\nInspired by tool-use agents (Wölflein et al., 2025;\nZheng et al., 2025a), we introduce an interleaved\ngeneration process where the small model gener-\nates a special command token (<call>) to pause\nits own generation and invoke the large model for a\nspecified number of tokens. The small model then\nreceives the expert’s guidance and resumes rea-\nsoning, effectively “relaying” the output process\nbetween models. We propose a two-stage frame-\nwork to equip the small model with this strategic\ndelegation capability to train the RelayLLM. We\nfirst employ a supervised warm-up phase to teach\nthe model the syntactic structure of calling com-\nmands. This is followed by a reinforcement learn-\ning stage using Group Relative Policy Optimiza-\ntion (GRPO) (Shao et al., 2024) training, where\nwe design a context-aware reward that guides the\nmodel to balance independence with necessary\nhelp-seeking, penalizing both wasted costs and\navoidable errors.\nEmpirical results on six benchmarks demon-\nstrate the effectiveness of our approach. As il-\nlustrated in Figure 1, RelayLLM achieves an av-\nerage accuracy of 49.52% on six benchmarks,\nlargely recovering the performance gap between\nthe small model and the large one.\nRemark-\nably, this gain is achieved with minimal cost,\nas RelayLLM invokes the large model for only\n1.07% of the total generated tokens. Furthermore,\nin comparison to a resource-equivalent Random\nRouter, RelayLLM yields a substantial 6.9% ac-\ncuracy improvement. These results confirm that\nRelayLLM effectively identifies critical reasoning\nsteps for expert intervention, reducing token costs\nby 98.2% compared to a performance-matched\nrouter. Surprisingly, evaluations in a teacher-free\nsetting reveal that the model internalizes effective\nreasoning patterns during collaboration, enabling\nit to surpass baselines on easier benchmarks even\nwithout expert assistance.\n2\nRelayLLM Inference\nAs illustrated in Fig. 2, we consider a hybrid\ninference setting involving a primary, resource-\nefficient Small Language Model (SLM), denoted\nas MS, and a powerful but computationally ex-\npensive Large Language Model (LLM), denoted\nas ML. Given an input query x, the system aims\nto generate a high-quality response y through dy-\nnamic collaboration.\nDifferent from standard cascading methods that\nsimply offload the entire remaining task to a larger\nmodel, we define an interleaved generation process\nwhere the small model acts as both reasoner and\ncentral controller. The process operates as follows:\n2.1\nSmall Model Generation\nBy default, MS generates tokens autoregressively\nbased on the current context history as a normal\nlanguage model. MS is augmented with a special\ncontrol capability: it can actively request assistance\nby generating a specific command pattern:\nCcmd(n) = <call> ⊕ n ⊕ </call>\nwhere ⊕ denotes string concatenation, and n ∈ Z+\nrepresents the number of tokens required from the\nlarger language model.\n2.2\nLarge Model Intervention\nWhen this trigger pattern is detected, the generation\nby MS pauses. The current context (including\nquery and tokens generated by SLM) is forwarded\nto ML. Crucially, to maintain compatibility with\nthe large model’s standard input distribution, we\nstrip the special command tokens (Ccmd) from the\ncontext provided to ML. The large model then\ngenerates the next n tokens (or stops early if an end-\nof-sequence “[EOS]” token is reached), providing\na high-quality continuation or reasoning step.\n2.3\nIterative Relay of LLM and SLM\nAfter ML completes the generation, control is re-\nturned to the small model. The context is updated\nwith the new tokens generated by ML added into\nthe existing context. Unlike the large model, MS\nretains the full history, including its own generated\ncommand tokens (<call>n</call>). This allows\nMS to maintain a trace of its active delegation de-\ncisions. With the updated context, MS resumes\ngeneration, digesting the expert’s guidance to con-\ntinue the reasoning process.\n3\nRelayLLM Training\nTo enable the small model MS to determine when\nand how long to invoke the LLM, we propose a two-\nstage framework including a supervised warm-up\nphase followed by a reinforcement learning phase.\n2\n\n1-\nIf 3 cats can catch 3 rats in 3\nminutes. How long does it take\n100 cats to catch 100 rats?\nLLM\nSLM\nInvoke\nLLM\nReturn\nSLM\nGenerate\n Rollouts\nScenario 1：\nStudent-zSolvable\nScenario 2：\nTeacher-Dependent\nScenario 3：\nTeacher-Unsolvable\n3\n100\n3\n3\n100\n100\n100\nFrom the given\ninformation, 3\ncats catch 3 rats\nin 3 minutes,\nwhich means\n<call> 8 </call>\neach catches 1\nrat in 3 min.\nTherefore, it\ntakes 3 min for\n100 cats to catch\n100 rats.\nRollout with LLM Intervention  \nRollout without LLM Intervention  \n×\n×\n×\n100×\n100×\n×\n0\n1.5\n-1\n-1\n1-\n0\n0\nReward\n✔\n✔\n✔\n(a) Inference\n(b) training\nFigure 2: Overview of the RelayLLM framework.(Left) Collaborative Inference: The Small Language Model acts\nas a central controller. During generation, it can actively trigger an “intervention” by generating a <call> command.\nThe Large Language Model then generates the specified number of tokens, after which control returns to the SLM\nto complete the reasoning. (Right) Difficulty-Aware Reward Design: During GRPO training, we sample a group\nof rollouts (both with and without LLM intervention) and classify the query difficulty into three distinct scenarios:\nSolvable, Teacher-Dependent, and Unsolvable, to guide the policy optimization.Distinct reward designs are applied\nto each scenario to align the model’s behavior with the optimal strategy.\n3.1\nCold Start via Supervised Warm-up\nThe small model MS may not naturally generate\nthe command pattern Ccmd if we train it directly\nwith reinforcement learning. We employ a super-\nvised warm-up phase as cold start to initialize the\nmodel with the necessary structural knowledge. We\nconstruct a synthetic dataset Dwarm to teach the\nmodel the calling command without inducing dis-\ntribution shifts.\nTo prevent distribution shifts of the small model\nMS, we first generate base sequences y by sam-\npling directly from vanilla MS rather than relying\non external corpora. This ensures that the training\ncontexts align perfectly with the small model’s own\ndistribution. Within these self-sampled sequences,\nwe insert command tokens at random indices t at\nthe token level. By avoiding rigid restrictions to\nsentence or paragraph boundaries, we enable the\nmodel to learn to trigger assistance at the precise\nmoment a reasoning gap occurs during the infer-\nence time. This data construction strategy ensures\neffective on-policy training data to prevent data dis-\ntribution shift of the small model, while simulating\nvarious calling scenarios.\nFurthermore, we explicitly simulate varying de-\ngrees of reliance on the expert model by synthe-\nsizing delegation lengths across multiple orders of\nmagnitude. We randomly sample the target length\nnsample = d × 10k, where d ∈ {1, . . . , 9} and\nk ∈ {0, . . . , 3}. To guarantee validity, the final\nrequest length is clipped to the available response\nvia n = min(nsample, Lrem).\nWe fine-tune MS on this constructed dataset us-\ning standard cross-entropy loss. While we acknowl-\nedge a theoretical discrepancy between this train-\ning setup (where tokens subsequent to the control\ncommand are from model itself) and the inference\nphase (where tokens are generated by ML), this su-\npervised warm-up is essential. It focuses primarily\non adapting the model to output valid executable\ncommands (Ccmd). This creates a stable starting\npoint for the subsequent reinforcement learning\nstage, where MS will be trained to request token-\nlevel assistance from ML and take advantage of\ntheir feedback.\n3.2\nPolicy Refinement with Reinforcement\nLearning\nIn this stage, we employ Reinforcement Learning\nto align the model’s behavior with our core objec-\ntive: maximizing response quality while strictly\nminimizing collaborative cost. We proceed by de-\ntailing our optimization framework.\n3.2.1\nGRPO Training with RLVR\nWe leverage Group Relative Policy Optimization\n(GRPO) (Shao et al., 2024) to refine the policy πθ\nof MS by adopting the Reinforcement Learning\nwith Verifiable Reward (RLVR) paradigm (Lam-\nbert et al., 2024). This paradigm is well-suited\nfor domains where the quality of generated re-\nsponses can be deterministically verified. In its\nstandard form, RLVR employs a rule-based verifier\n3\n\nv : X → {0, 1} that assigns binary rewards. This\nbinary scheme is effective for tasks with unambigu-\nous success criteria, such as mathematical problem-\nsolving. In our setting, we design a corresponding\nrule-based reward to verify model-generated rela-\ntions, described in Sec. 3.2.3.\nTo optimize the policy using these rewards,\nGRPO samples a group of outputs {oi}G\ni=1 for each\nquery q from the old policy πθold and evaluates\nthem relative to the group average. The training\nobjective is formulated as:\nJGRPO(θ) = Eq∼D\n�\n1\nG\nG\n�\ni=1\n(Mi − βDKL)\n�\n,\n(1)\nwhere DKL = DKL(πθ ∥ πref) is the regularization\nterm. The surrogate objective Mi is computed as\nmin(ρiAi, clip(ρi, 1 − ϵ, 1 + ϵ)Ai), where ρi =\nπθ(oi|q)\nπθold(oi|q). The advantage Ai is derived from the\ngroup-normalized rewards defined below:\nAi = ri − mean({rj})\nstd({rj}) + εstab\n,\n(2)\nwhere εstab is a small constant for stability. This\nformulation encourages the model to generate re-\nsponses that outperform the group average.\n3.2.2\nData Filtering\nSince our method leverages the large model to gen-\nerate reasoning paths or feedback, it is essential to\nidentify the subset of data where such intervention\nis useful. If the large model consistently fails to\nsolve a query, calling it during training yields no\npositive gain. Therefore, we preprocess the dataset\nto filter out instances that are too hard for the large\nmodel. We sample 10 responses per query and only\npreserve those with a pass rate of ≥ 50%. This step\nensures that the training data lies in the competence\nboundary of the large model and the responses can\ncontribute effectively. We provide an ablation study\non this filtering mechanism in Table 3.\n3.2.3\nReward Design\nWe formulate the optimization objective using two\ndistinct reward signals, a simple reward and our\ndesigned difficulty-aware reward. Let y be the re-\nsponse, a be the final answer parsed from y, g be\nthe ground truth, and ρ(y) ∈ [0, 1] be the call ratio\n(the ratio of large-model generated tokens to the\ntotal response length.).\nSimple Reward.\nWe define a straightforward re-\nward to encourage both accuracy and efficiency:\nrsimple(y) = 1(a = g) − ρ(y).\n(3)\nwhere 1(·) is the indicator function, thus the re-\nsponses are scored by their correctness and penal-\nized by the cost of calling the expert model.\nDifficulty-Aware Reward.\nTo capture the rela-\ntive difficulty of each query, we define the reward\nbased on the collective performance of the sam-\npled group G. We categorize each query into three\nscenarios based on its difficulty (correctness of re-\nsponses in G). As illustrated in Figure 2 (Right), we\nprovide concrete examples of how these rewards\nare assigned across different categories.\nScenario 1: Student-Solvable (Encouraging In-\ndependence).\nThis scenario applies when the stu-\ndent model is capable of solving the query inde-\npendently, without assistance from the large model.\nThis scenario is identified if there exists at least one\nsample in the group G that answers correctly with-\nout invoking the large model. In this case, calling\nthe teacher is deemed redundant. Consequently, we\nassign a boosted bonus (r = 1.5) for independent\nsuccess to promote efficiency and independence,\nwhile dependent success (ρ(y) > 0) still receives\nthe simple reward rsimple in Eq.(3), and incorrect\nresponses receive zero reward.\nScenario 2:\nTeacher-Dependent (Penalizing\nStubbornness).\nThis scenario represents chal-\nlenging queries where correct answers appear only\nin samples that call the large model. Here, the small\nmodel’s independent reasoning is insufficient. To\ndiscourage blind guessing, we impose a penalty\non samples that fail to call the teacher (r = −1.0\nwhen ρ(y) = 0). Conversely, effective expert call-\ning that leads to a correct answer is rewarded with\nrsimple.\nScenario 3: Teacher-Unsolvable (Incentivizing\nExploration).\nThis scenario occurs when no sam-\nple in G yields the correct answer, indicating that\nthe query is extremely difficult or the teacher’s guid-\nance was ineffective. Rather than providing zero\ntraining signal for all responses, we assign a small\nexploration reward (r = ρ(y)) to samples that at-\ntempted to call the large model. This reinforces the\ntendency to seek help in highly uncertain situations.\nThis piecewise design aligns the policy with an op-\ntimal strategy: solve independently when possible,\nseek help when necessary, and avoid costly errors.\n4\n\n4\nExperiments\n4.1\nExperimental Setup\n4.1.1\nModels\nTo evaluate the effectiveness of RelayLLM, we\nutilize the Qwen3 model family (Yang et al.,\n2025a) due to its consistent architectural scaling\nand strong performance across various sizes. We\nselect Qwen3-0.6B and Qwen3-1.7B as our pri-\nmary small language models (MS) to investigate\nhow our framework scales with model capacity at\nthe sub-2B parameter level. For the teacher model\n(ML), we utilize Qwen3-8B. Selecting a model\nfrom the same model family ensures that the gen-\neration style, token distribution, vocabulary and\ntokenizer are more consistent, making collabora-\ntion more stable. To optimize training and infer-\nence efficiency, we consistently run the models in\nnon-thinking mode.\n4.2\nEvaluation Setup\nTo evaluate the effectiveness of RelayLLM, we\nconduct experiments on six reasoning benchmarks,\nand compare our approach against the standard\nGRPO baseline. We also add CITER (Zheng et al.,\n2025b), a token-level routing method as the base-\nline method which requires an additional controller.\nThe benchmarks include Minerva (Lewkowycz\net al., 2022), MATH-500 (Hendrycks et al., 2021),\nGSM8K (Cobbe et al., 2021), Olympiad-Bench\n(He et al., 2024), AIME-2024, and AIME-2025.\nWe use GPT-4o-mini as a semantic judge to verify\nthe model’s output against the ground truth (Zhao\net al., 2025). For the high-difficulty AIME datasets,\nwe report the avg@32 metric to ensure a robust\nevaluation. For other benchmarks, we report stan-\ndard accuracy (pass@1) using greedy decoding.\n4.3\nTraining Details\nWe conduct our experiments using the DAPO\ndataset (Yu et al., 2025a). Our implementation\nis built upon the EasyR1 framework (Zheng et al.,\n2025c) using its default hyperparameter configu-\nrations (shown in App. E). All models are trained\nfor a single epoch to ensure a fair comparison. Re-\ngarding data usage, the GRPO baseline is trained\non the full dataset, whereas our method utilizes the\nfiltered subset as described in Section 3.2.2. To\nenable efficient interaction with the large model,\nwe serve the teacher model via the vLLM infer-\nence engine (Kwon et al., 2023). We implement\nthe switching mechanism as a stop sequence in the\nsampling parameters: when the model generates\nthe calling command token, generation halts, and\nthe system invokes the teacher model via the API.\n4.4\nMain Results\nAs presented in Table 1, RelayLLM demonstrates\na superior trade-off between reasoning capability\nand inference efficiency. First, our method achieves\nsubstantial performance improvements across all\nbenchmarks while maintaining a negligible collab-\norative cost (less than 1% token overhead). For in-\nstance, on the challenging Minerva benchmark, the\nQwen3-0.6B model with Difficulty-Aware-Reward\nimproves from a base score of 15.81% to 23.53%,\nrepresenting a relative improvement of approxi-\nmately 48.8% while invoking the large model for\nonly 0.77% of the total tokens.\nCompared to\nCITER, our method demonstrates superior perfor-\nmance despite CITER’s more computationally in-\ntensive design. CITER relies on an external MLP\nto estimate a score every token, which introduces\nsubstantial latency and computational overhead. In\ncontrast, RelayLLM achieves better results with a\nsignificantly more efficient mechanism at the cost\nof only several addtional tokens.\nSecond,\ncomparing optimization strategies,\nthe Difficulty-Aware-Reward mechanism outper-\nforms the Simple-Reward in performance, with a\nmarginal increase in token consumption. For the\nQwen3-1.7B model, the Difficulty-Aware-Reward\nstrategy achieves a higher average accuracy of\n49.52% compared to 49.30% for the Simple-\nReward, which correlates with a slight increase\nin the average call ratio from 0.43% to 1.07%, sug-\ngesting that the difficulty-based signal better incen-\ntivizes the model to seek help in complex scenarios.\nFinally, RelayLLM effectively bridges the ca-\npability gap between small and large models us-\ning minimal tokens. Remarkably, the Qwen3-1.7B\n(Difficulty-Aware) recovers approximately 60%\nof the performance gap between the base SLM\n(42.50%) and the expert model (54.12%), high-\nlighting that sparse, strategic intervention at critical\nreasoning steps is sufficient to unlock a significant\nportion of the teacher model’s potential.\n5\nAnalysis\nIn this section, we conduct a series of in-depth\nanalyzes to better understand the behavior and ef-\nfectiveness of RelayLLM framework.\n5\n\nTable 1: Performance comparison on six benchmarks. We compare the effectiveness of RelayLLM using Qwen3-\n0.6B and Qwen3-1.7B as student models across different methods: the standard Base model, the GRPO-tuning\nbaseline, CITER and RelayLLM (Simple-Reward and Difficulty-Aware-Reward). The Qwen3-8B teacher model\nperformance is provided for reference. We report avg@32 for the challenging AIME datasets and standard pass@1\n(greedy decoding) for all other benchmarks. The “Avg. Call Ratio” denotes the percentage of tokens generated\nby the teacher model during the collaborative inference process. The best results within each model group are\nhighlighted in bold.\nModel\nMinerva\nMATH500\nGSM8K\nOlympiad\nAIME25\nAIME24\nAverage\nAvg. Call Ratio\nQwen3-0.6B\nBase Model\n15.81\n54.00\n64.82\n26.22\n1.04\n1.15\n27.17\n–\nGRPO\n17.65\n58.60\n65.50\n29.04\n5.42\n3.23\n29.91\n–\nCITER\n19.29\n58.80\n67.78\n29.60\n5.93\n3.24\n30.77\n0.98%\nRelayLLM (Simple)\n20.96\n60.20\n69.14\n32.15\n7.19\n3.85\n32.25\n0.31%\nRelayLLM (Difficulty-Aware)\n23.53\n60.00\n71.95\n32.74\n6.15\n3.85\n33.04\n0.77%\nQwen3-1.7B\nBase Model\n33.82\n74.60\n82.64\n43.11\n8.75\n12.08\n42.50\n–\nGRPO\n35.66\n75.60\n81.73\n45.04\n10.73\n15.62\n44.06\n–\nCITER\n38.63\n80.24\n82.26\n51.20\n11.96\n16.58\n46.81\n1.34%\nRelayLLM (Simple)\n43.01\n83.40\n86.13\n51.56\n13.44\n18.23\n49.30\n0.43%\nRelayLLM (Difficulty-Aware)\n43.75\n81.40\n86.28\n55.70\n12.71\n17.29\n49.52\n1.07%\nQwen3-8B\n48.16\n83.20\n93.63\n56.89\n17.92\n24.90\n54.12\n100%\n5.1\nRelayLLM Generalizes to Unseen\nReasoning Domains\nTo verify the generalization capability of Re-\nlayLLM, we extended our evaluation to general\nreasoning domains that were unseen during train-\ning. Although our model was trained exclusively\non the mathematical DAPO dataset, we tested it\non three diverse benchmarks out of the math do-\nmain: Big-Bench Hard (BBEH) (Kazemi et al.,\n2025), MMLU-Pro (Wang et al., 2024), and Su-\nperGPQA (Du et al., 2025). As shown in Table\n2, RelayLLM consistently outperforms baseline\nmethods despite the domain shift. For instance,\nusing Qwen3-1.7B, our method achieves 59.03%\non MMLU-Pro, significantly surpassing the GRPO\nbaseline (49.76%) and CITER (53.38%). These\nresults demonstrate that our framework effectively\nhelp SLM have a generalized help-seeking behav-\nior; even when facing unfamiliar inputs, the SLM\nsuccessfully recognizes its knowledge gaps and in-\nvoke the LLM, leading to substantial performance\ngains in out-of-distribution tasks.\n5.2\nAblation Study\nTo investigate the distinct contribution of each com-\nponent in RelayLLM , we conducted an ablation\nstudy using the Qwen3-1.7B model in Table 3.\nData filtering prevents wasteful calls where\nteacher models fail.\nWe show that removing the\ndata filtering mechanism results in a tripled call\nratio with decreased accuracy; this confirms that\nTable 2: Performance comparison on reasoning and gen-\neral knowledge benchmarks. We evaluate the effective-\nness of RelayLLM using Qwen3-0.6B and Qwen3-1.7B\nas student models across different settings. The Qwen3-\n8B performance is provided for reference. The best\nresults within each model group are highlighted in bold.\nModel\nBBEH\nMMLU-Pro\nSuperGPQA\nQwen3-0.6B\nBase Model\n7.19\n30.03\n17.22\nGRPO\n7.82\n32.15\n19.91\nCITER\n8.16\n33.12\n20.34\nRelayLLM (Simple)\n8.32\n35.61\n21.35\nRelayLLM (Difficulty-Aware)\n8.56\n35.87\n20.88\nQwen3-1.7B\nBase Model\n9.91\n46.90\n24.46\nGRPO\n10.89\n49.76\n26.01\nCITER\n11.67\n53.38\n28.25\nRelayLLM (Simple)\n12.67\n58.76\n29.85\nRelayLLM (Difficulty-Aware)\n12.46\n59.03\n29.93\nQwen3-8B\n15.31\n66.46\n36.21\nfiltering out queries intractable for the teacher is\ncrucial to avoid cost that yield no performance gain.\nFiltering out some too hard data also save time and\nresources during the training stage.\nEncouraging independence reduces reliance on\nteacher model and improves efficiency.\nWe re-\nmove the independence encouraging (where we\nboosted correctness reward from 1 to 1.5 for solv-\nable queries), and this causes the call ratio to spike\nto 4.10%. This demonstrates that specifically re-\nwarding the independent success is crucial to pre-\nvent the model from becoming over-reliant on the\nexpert LLM for tasks it could solve alone.\n6\n\nTable 3: Ablation study on data filtering and reward\ndesign strategies using Qwen3-1.7B. “w/o Data Filter-\ning” denotes training on the unfiltered dataset including\nteacher-failed queries. “w/o Indep. Incentive” removes\nthe bonus reward (from 1.5 to 1) for independent suc-\ncess (Scenario 1). “w/o Explor. Reward” removes the\nexploration reward (from ρ to 0) for seeking help in\nunsolvable queries (Scenario 3).\nMethod\nAvg. Acc. (%)\nCall Ratio (%)\nRelayLLM\n49.52\n1.07\nw/o Data Filtering\n48.76\n3.30\nw/o Indep. Incentive\n49.34\n4.10\nw/o Explor. Reward\n47.56\n0.65\nExploration reward effectively increases accu-\nracy.\nWhen we remove the exploration reward\n(for unsolvable queries), this leads to a significant\naccuracy drop to 47.56%, indicating that the explo-\nration reward is necessary to encourage the model\nto call for help from teacher models in highly un-\ncertain scenarios.\n5.3\nIntrinsic Reasoning Capability\nTo investigate whether\nRelayLLM\nimproves\nthe student’s inherent reasoning or merely learns\nto offload tasks, we evaluate the models in\na “Teacher-Free” setting by strictly forbidding\ninvocations during inference (implemented via\nbad_words=[“<call>”, “</call>”] when infer-\nence). Results in Table 4 reveal three key insights.\nFirst, on Easy datasets, even without teacher access,\nRelayLLM (Simple-Reward) achieves 61.12%, sur-\npassing the GRPO baseline. This suggests that\nthe student model has successfully learn from the\nreasoning ability of the expert model during the\ncollaborative training process.\nOn Harder datasets, removing the teacher leads\nto a notable performance drop (e.g., Difficulty-\nAware-Reward falls from 15.00% to 11.93%), con-\nfirming that for complex tasks, the model remains\nheavily dependent on expert intervention. Third,\ncomparing reward schemes, the Simple-Reward\nvariant demonstrates stronger capabilities than the\nDifficulty-Aware-Reward variant. This aligns with\nour previous observation that Difficulty-Aware-\nReward encourages a higher call ratio, leading\nto a stronger dependency on the teacher, whereas\nSimple-Reward retains more independence.\nTable 4: Evaluation of intrinsic reasoning capability. We\ndisable the teacher during inference (“w/o Teacher”) by\nmasking the call tokens. “Hard” refers to AIME24 and\nAIME25, while “Easy” refers to the remaining.\nMethod\nEasy (%)\nHard (%)\nGRPO Baseline\n59.51\n13.18\nRelayLLM (Simple)\nStandard Inference\n66.03\n15.84\nw/o Teacher\n61.12\n13.13\nRelayLLM (Difficulty-Aware)\nStandard Inference\n66.78\n15.00\nw/o Teacher\n60.26\n11.93\nTable 5: Comparison between dynamic length predic-\ntion (RelayLLM) and fixed delegation length strategies.\nNote that “Fixed-k” does not simply denote inference\ntruncation; these models were retrained with the con-\nstraint to always request k tokens, ensuring they learned\noptimal policies for those specific lengths.\nMethod\nAvg. Acc. (%)\nCall Ratio (%)\nFixed-20\n49.41\n1.32\nFixed-100\n49.56\n2.87\nFixed-500\n51.17\n5.37\nRelayLLM\n49.52\n1.07\n5.4\nDynamic Token-Length Calling\nMinimizes Computational Cost\nWe investigate whether dynamically predicting the\ncalling length n requested from the large model is\nsuperior to using rigid, pre-defined lengths. To\nensure a fair evaluation, we retrained separate\nvariations of the student model where the call com-\nmand is hard-constrained to a fixed token count k ∈\n{20, 100, 500} during both training and inference.\nAs shown in Table 5, RelayLLM demonstrates\nsuperior efficiency compared to these specialized\nfixed-length models. Specifically, compared to the\nFixed-100 model, RelayLLM achieves a similar\naccuracy but reduces the call ratio from 2.87% to\n1.07%. This indicates that while the fixed-length\nmodel is forced to consume a set budget even for\nsimple queries, RelayLLM effectively learns to\nrequest “just enough” tokens to bridge the reason-\ning gap, thereby minimizing computational waste\nwithout compromising performance. We provide a\nmore detailed results in Appendix B.\n5.5\nDistributional Alignment\nTo determine whether the student model has ac-\nquired generalized reasoning capabilities or merely\noverfitted to the specific patterns and words of the\n7\n\nNone\n0.6B\n1.7B\n4B\n8B\n14B\nTeacher Model Size\n30\n35\n40\n45\n50\nAverage Score\n44.15\n44.57\n48.45\n49.22\n49.52\n49.18\n29.83\n30.86\n31.57\n32.97\n33.04\n32.87\nSLM: Qwen3-1.7B\nSLM: Qwen3-0.6B\nFigure 3: Impact of teacher model size on student per-\nformance. We evaluate two student models (Difficulty-\nAware) across six benchmarks. The x-axis represents\nthe size of the teacher model used during the collabora-\ntive inference process, ranging from \"None\" (the same\nas Sec. 5.3, prevent model to call teacher model) to 14B.\nThe reported scores are averaged across all six datasets.\nLLM in training time, we performed a cross-LLM\nevaluation by substituting the training teacher with\ndifferent models in the inference phase. The results,\nillustrated in Figure 3, reveal two critical insights.\nFirst, consistency with the training LLM yields\noptimal performance. The accuracy peaks when\nthe inference teacher matches the training teacher,\nreaching 49.52% for the 1.7B student. Notably,\nreplacing it with a larger model results in a slight\nperformance decline. This indicates that the dis-\ntribution shift between the training and inference\nteachers can outweigh the benefits of the larger\nmodel’s superior reasoning capabilities.\nSecond, even employing a relatively weak\nteacher that is weaker than itself (e.g., 0.6B or 1.7B)\nconsistently outperforms the “None” baseline. This\nsuggests that the trained model has become accus-\ntomed to the presence of external assistance, adapt-\ning its generation dynamics to effectively leverage\nsuch interventions rather than relying solely on its\nintrinsic capabilities. Furthermore, excluding the\ndistribution shift at 8B, which is we used in train-\ning, there is a positive correlation between teacher\nsize and student performance, confirming that the\nstudent effectively leverages the stronger reasoning\nsignals provided by more capable experts.\n6\nRelated Work\n6.1\nModel Collaboration\nModel collaboration (Feng et al., 2025a) ranges\nfrom weight-level merging (Wortsman et al.,\n2022; Huang et al., 2023) and logits-level ensem-\nbling (Liu et al., 2024; Li et al., 2023) to text-level\ninteraction. Recent research has focused on navi-\ngating the efficiency trade-off between large and\nsmall models, which can be broadly categorized\ninto two directions. The first direction involves\nspeculative reasoning. Bachmann et al. (2025) and\nPan et al. (2025) employ judge mechanisms or ver-\nifiers to validate small model outputs, effectively\nacting as dynamic routers. Similarly, large model\nguidance is leveraged to enhance small model rea-\nsoning specifically at inference time (Yang et al.,\n2025b; Zhang et al., 2025). The second direction fo-\ncuses on collaborative decoding (Shen et al., 2024)\nwith interleaved generation via an additional con-\ntroller. Shen et al. (2024); Sun et al. (2024b) pro-\nposed learning a joint policy for multiple models.\nStrategic intervention is further investigated by Li\net al. (2025a); Feng et al. (2025b); Fu et al. (2025)\nto explore thought spaces efficiently.\n6.2\nRL for LLM Reasoning\nReinforcement learning has recently emerged as\na pivotal technique for augmenting LLM reason-\ning, demonstrating broad success ranging from\ntraditional mathematical and code generation\ntasks (Guo et al., 2025; Wang et al., 2025a) to intri-\ncate multi-modal challenges (Huang et al., 2025c;\nWang et al., 2025b; Li et al., 2025b) and structured\ndata environments (Shi et al., 2025; Tang et al.,\n2025). To support these diverse applications and en-\nable complex behaviors like RelayLLM, concurrent\nresearch is actively refining methodologies through\nnovel training paradigms, such as self-play (Liu\net al., 2025; Huang et al., 2025b; Yu et al., 2025b),\nalongside developing more robust algorithmic tech-\nniques exemplified by DAPO (Yu et al., 2025a),\nVAPO (Yue et al., 2025), and high-entropy guided\noptimization (Dai et al., 2025; Wang et al., 2025d;\nZhou et al., 2025).\n7\nConclusion\nWe presented RelayLLM, addressing the ineffi-\nciency of \"all-or-nothing\" offloading in routing sys-\ntems. By treating the large model as an on-demand\ntool rather than a fallback generator, our approach\ndemonstrates that small models can handle the vast\nmajority of reasoning steps if supported at specific\ncritical positions. The success of our GRPO-based\ntraining strategy confirms that help-seeking behav-\niors can be effectively learned and optimized. Our\nresults show that RelayLLM not only outperforms\nresource-equivalent random routers by 6.9% but\nalso achieves comparable reasoning accuracy to\nlarger models with negligible computation.\n8\n\nAcknowledgments\nWe would like to thank Dongfu Jiang (University of\nWaterloo) for his helpful insights and discussions\non tool use LLM. This research was supported in\npart by the NVIDIA Academic Grant Program and\nWashU Ignite Interdisciplinary Grants.\nReferences\nJosh Achiam, Steven Adler, Sandhini Agarwal, Lama\nAhmad, Ilge Akkaya, et al. 2023. Gpt-4 technical\nreport. arXiv preprint arXiv:2303.08774.\nGregor Bachmann,\nSotiris Anagnostidis,\nAlbert\nPumarola,\nMarkos\nGeorgopoulos,\nArtsiom\nSanakoyeu, et al. 2025.\nJudge decoding: Faster\nspeculative sampling requires going beyond model\nalignment. arXiv preprint arXiv:2501.19309.\nKarl Cobbe, Vineet Kosaraju, Mo Bavarian, Mark\nChen, Heewoo Jun, et al. 2021.\nTraining veri-\nfiers to solve math word problems. ArXiv preprint,\nabs/2110.14168.\nGheorghe Comanici, Eric Bieber, Mike Schaeker-\nmann, Ice Pasupat, Noveen Sachdeva, et al. 2025.\nGemini 2.5: Pushing the frontier with advanced\nreasoning, multimodality, long context, and next\ngeneration agentic capabilities.\narXiv preprint\narXiv:2507.06261.\nRunpeng Dai, Linfeng Song, Haolin Liu, Zhenwen\nLiang, Dian Yu, et al. 2025. Cde: Curiosity-driven ex-\nploration for efficient reinforcement learning in large\nlanguage models. arXiv preprint arXiv:2509.09675.\nDujian Ding, Ankur Mallick, Chi Wang, Robert Sim,\nSubhabrata Mukherjee, et al. 2024.\nHybrid llm:\nCost-efficient and quality-aware query routing. arXiv\npreprint arXiv:2404.14618.\nXinrun Du, Yifan Yao, Kaijing Ma, Bingli Wang,\nTianyu Zheng, et al. 2025. Supergpqa: Scaling llm\nevaluation across 285 graduate disciplines. arXiv\npreprint arXiv:2502.14739.\nShangbin Feng, Wenxuan Ding, Alisa Liu, Zifeng\nWang, Weijia Shi, et al. 2025a.\nWhen one llm\ndrools, multi-llm collaboration rules. arXiv preprint\narXiv:2502.04506.\nShangbin Feng, Wenhao Yu, Yike Wang, Hong-\nming Zhang, Yulia Tsvetkov, et al. 2025b. Don’t\nthrow away your pretrained model. arXiv preprint\narXiv:2510.09913.\nTianyu Fu, Yi Ge, Yichen You, Enshu Liu, Zhihang\nYuan, et al. 2025. R2r: Efficiently navigating diver-\ngent reasoning paths with small-large model token\nrouting. arXiv preprint arXiv:2505.21600.\nDaya Guo, Dejian Yang, Haowei Zhang, Junxiao Song,\nRuoyu Zhang, et al. 2025. Deepseek-r1: Incentiviz-\ning reasoning capability in llms via reinforcement\nlearning. arXiv preprint arXiv:2501.12948.\nSherzod Hakimov, Roland Bernard, Tim Leiber, Karl\nOsswald, Kristina Richert, et al. 2025. The price of\nthought: A multilingual analysis of reasoning, per-\nformance, and cost of negotiation in large language\nmodels. arXiv preprint arXiv:2510.08098.\nChaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu,\nZhen Leng Thai, et al. 2024. Olympiadbench: A chal-\nlenging benchmark for promoting agi with olympiad-\nlevel bilingual multimodal scientific problems. In\nAnnual Meeting of the Association for Computational\nLinguistics.\nDan Hendrycks, Collin Burns, Steven Basart, Andy Zou,\nMantas Mazeika, Dawn Song, and Jacob Steinhardt.\n2021. Measuring massive multitask language under-\nstanding. In Proc. of ICLR.\nQitian Jason Hu, Jacob Bieker, Xiuyu Li, Nan Jiang,\nBenjamin Keigwin, et al. 2024. Routerbench: A\nbenchmark for multi-llm routing system.\narXiv\npreprint arXiv:2403.12031.\nChengsong Huang, Langlin Huang, and Jiaxin Huang.\n2024.\nDivide, reweight, and conquer: A logit\narithmetic approach for in-context learning. arXiv\npreprint arXiv:2410.10074.\nChengsong Huang, Langlin Huang, Jixuan Leng, Ji-\nacheng Liu, and Jiaxin Huang. 2025a.\nEfficient\ntest-time scaling via self-calibration. arXiv preprint\narXiv:2503.00031.\nChengsong Huang, Qian Liu, Bill Yuchen Lin, Tianyu\nPang, Chao Du, et al. 2023. Lorahub: Efficient cross-\ntask generalization via dynamic lora composition.\narXiv preprint arXiv:2307.13269.\nChengsong Huang, Wenhao Yu, Xiaoyang Wang, Hong-\nming Zhang, Zongxia Li, et al. 2025b.\nR-zero:\nSelf-evolving reasoning llm from zero data. arXiv\npreprint arXiv:2508.05004.\nWenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao,\nZheyu Ye, et al. 2025c. Vision-r1: Incentivizing\nreasoning capability in multimodal large language\nmodels. arXiv preprint arXiv:2503.06749.\nJared Kaplan, Sam McCandlish, Tom Henighan, Tom B\nBrown, Benjamin Chess, et al. 2020.\nScaling\nlaws for neural language models.\narXiv preprint\narXiv:2001.08361.\nMehran Kazemi, Bahare Fatemi, Hritik Bansal, John\nPalowitch, Chrysovalantis Anastasiou, et al. 2025.\nBig-bench extra hard. In Proceedings of the 63rd\nAnnual Meeting of the Association for Computational\nLinguistics (Volume 1: Long Papers), pages 26473–\n26501.\n9\n\nWoosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying\nSheng, Lianmin Zheng, et al. 2023. Efficient memory\nmanagement for large language model serving with\npagedattention. In Proceedings of the ACM SIGOPS\n29th Symposium on Operating Systems Principles.\nNathan Lambert, Jacob Daniel Morrison, Valentina Py-\natkin, Shengyi Huang, Hamish Ivison, et al. 2024.\nTülu 3: Pushing frontiers in open language model\npost-training. arXiv preprint arXiv:2411.15124.\nAitor Lewkowycz, Anders Andreassen, David Dohan,\nEthan Dyer, Henryk Michalewski, Vinay V. Ra-\nmasesh, Ambrose Slone, Cem Anil, Imanol Schlag,\nTheo Gutman-Solo, Yuhuai Wu, Behnam Neyshabur,\nGuy Gur-Ari, and Vedant Misra. 2022. Solving quan-\ntitative reasoning problems with language models. In\nAdvances in Neural Information Processing Systems\n35: Annual Conference on Neural Information Pro-\ncessing Systems 2022, NeurIPS 2022, New Orleans,\nLA, USA, November 28 - December 9, 2022.\nXiang Lisa Li, Ari Holtzman, Daniel Fried, Percy Liang,\nJason Eisner, et al. 2023.\nContrastive decoding:\nOpen-ended text generation as optimization. In Pro-\nceedings of the 61st annual meeting of the associa-\ntion for computational linguistics (volume 1: Long\npapers), pages 12286–12312.\nZiheng Li, Hengyi Cai, Xiaochi Wei, Yuchen Li,\nShuaiqiang Wang, et al. 2025a. Efficient thought\nspace exploration through strategic intervention.\narXiv preprint arXiv:2511.10038.\nZongxia Li, Wenhao Yu, Chengsong Huang, Rui Liu,\nZhenwen Liang, et al. 2025b. Self-rewarding vision-\nlanguage model via reasoning decomposition. arXiv\npreprint arXiv:2508.19652.\nZicheng Lin, Tian Liang, Jiahao Xu, Qiuzhi Lin, Xing\nWang, et al. 2024. Critical tokens matter: Token-\nlevel contrastive estimation enhances llm’s reasoning\ncapability. arXiv preprint arXiv:2411.19943.\nAlisa Liu, Xiaochuang Han, Yizhong Wang, Yulia\nTsvetkov, Yejin Choi, et al. 2024. Tuning language\nmodels by proxy. arXiv preprint arXiv:2401.08565.\nBo Liu, Leon Guertler, Simon Yu, Zichen Liu, Penghui\nQi, et al. 2025.\nSpiral:\nSelf-play on zero-\nsum games incentivizes reasoning via multi-agent\nmulti-turn reinforcement learning. arXiv preprint\narXiv:2506.24119.\nChang Ma, Haiteng Zhao, Junlei Zhang, Junxian He,\nand Lingpeng Kong. 2024.\nNon-myopic genera-\ntion of language models for reasoning and planning.\narXiv preprint arXiv:2410.17195.\nIsaac Ong, Amjad Almahairi, Vincent Wu, Wei-Lin\nChiang, Tianhao Wu, et al. 2024. Routellm: Learning\nto route llms with preference data. arXiv preprint\narXiv:2406.18665.\nRui Pan, Yinwei Dai, Zhihao Zhang, Gabriele Oliaro,\nZhihao Jia, et al. 2025. Specreason: Fast and accurate\ninference-time compute via speculative reasoning.\narXiv preprint arXiv:2504.07891.\nZhiwen Ruan, Yixia Li, He Zhu, Yun Chen, Peng Li, et\nal. 2025. Enhancing large language model reasoning\nvia selective critical token fine-tuning. arXiv preprint\narXiv:2510.10974.\nZhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu,\nJunxiao Song, et al. 2024. Deepseekmath: Pushing\nthe limits of mathematical reasoning in open lan-\nguage models. arXiv preprint arXiv:2402.03300.\nZejiang Shen, Hunter Lang, Bailin Wang, Yoon Kim,\nand David Sontag. 2024. Learning to decode collab-\noratively with multiple language models. In Proceed-\nings of the 62nd Annual Meeting of the Association\nfor Computational Linguistics (Volume 1: Long Pa-\npers), pages 12974–12990.\nYucheng Shi, Wenhao Yu, Zaitang Li, Yonglin Wang,\nHongming Zhang, et al. 2025. Mobilegui-rl: Ad-\nvancing mobile gui agent through reinforcement\nlearning in online environment.\narXiv preprint\narXiv:2507.05720.\nHanshi Sun, Momin Haider, Ruiqi Zhang, Huitao\nYang, Jiahao Qiu, Ming Yin, Mengdi Wang, Peter\nBartlett, and Andrea Zanette. 2024a. Fast best-of-n\ndecoding via speculative rejection. arXiv preprint\narXiv:2410.20290.\nHao Sun, Jiayi Wu, Hengyi Cai, Xiaochi Wei, Yue\nFeng, et al. 2024b. Adaswitch: Adaptive switch-\ning between small and large agents for effective\ncloud-local collaborative learning. arXiv preprint\narXiv:2410.13181.\nYinxu Tang, Chengsong Huang, Jiaxin Huang, and\nWilliam Yeoh. 2025. Unirel-r1: Rl-tuned llm rea-\nsoning for knowledge graph relational question an-\nswering. Preprint, arXiv:2512.17043.\nHaozhe Wang, Long Li, Chao Qu, Fengming Zhu,\nWeidi Xu, et al. 2025a. To code or not to code?\nadaptive tool integration for math language mod-\nels via expectation-maximization. arXiv preprint\narXiv:2502.00691.\nHaozhe Wang, Chao Qu, Zuming Huang, Wei Chu,\nFangzhen Lin, et al. 2025b.\nVl-rethinker:\nIn-\ncentivizing self-reflection of vision-language mod-\nels with reinforcement learning.\narXiv preprint\narXiv:2504.08837.\nJikai Wang, Juntao Li, Lijun Wu, and Min Zhang. 2025c.\nEfficient reasoning for llms through speculative\nchain-of-thought. arXiv preprint arXiv:2504.19095.\nShenzhi Wang, Le Yu, Chang Gao, Chujie Zheng, Shix-\nuan Liu, et al. 2025d. Beyond the 80/20 rule: High-\nentropy minority tokens drive effective reinforce-\nment learning for llm reasoning.\narXiv preprint\narXiv:2506.01939.\n10\n\nYubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni,\nAbhranil Chandra, et al. 2024. Mmlu-pro: A more\nrobust and challenging multi-task language under-\nstanding benchmark. Advances in Neural Informa-\ntion Processing Systems, 37:95266–95290.\nGeorg Wölflein, Dyke Ferber, Daniel Truhn, Ognjen\nArandjelovic, and Jakob Nikolas Kather. 2025. Llm\nagents making agent tools. In Proceedings of the\n63rd Annual Meeting of the Association for Compu-\ntational Linguistics (Volume 1: Long Papers), pages\n26092–26130.\nMitchell Wortsman, Gabriel Ilharco, Samir Ya Gadre,\nRebecca Roelofs, Raphael Gontijo-Lopes, et al. 2022.\nModel soups: averaging weights of multiple fine-\ntuned models improves accuracy without increasing\ninference time. In International conference on ma-\nchine learning, pages 23965–23998. PMLR.\nAn Yang, Anfeng Li, Baosong Yang, Beichen Zhang,\nBinyuan Hui, et al. 2025a. Qwen3 technical report.\narXiv preprint arXiv:2505.09388.\nWang Yang, Xiang Yue, Vipin Chaudhary, and Xiao-\ntian Han. 2025b. Speculative thinking: Enhancing\nsmall-model reasoning with large model guidance at\ninference time. arXiv preprint arXiv:2504.12329.\nQiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan,\nXiaochen Zuo, et al. 2025a. Dapo: An open-source\nllm reinforcement learning system at scale. arXiv\npreprint arXiv:2503.14476.\nWenhao Yu, Zhenwen Liang, Chengsong Huang, Kis-\nhan Panaganti, Tianqing Fang, et al. 2025b. Guided\nself-evolving llms with minimal human supervision.\narXiv preprint arXiv:2512.02472.\nYu Yue, Yufeng Yuan, Qiying Yu, Xiaochen Zuo, Ruofei\nZhu, et al. 2025. Vapo: Efficient and reliable re-\ninforcement learning for advanced reasoning tasks.\narXiv preprint arXiv:2504.05118.\nHaozhen Zhang, Tao Feng, and Jiaxuan You. 2025.\nRouter-r1: Teaching llms multi-round routing and ag-\ngregation via reinforcement learning. In The Thirty-\nninth Annual Conference on Neural Information Pro-\ncessing Systems.\nYulai Zhao, Haolin Liu, Dian Yu, S. Y. Kung, Haitao\nMi, et al. 2025. One token to fool llm-as-a-judge.\nvolume abs/2507.08794.\nTong Zheng, Hongming Zhang, Wenhao Yu, Xiaoyang\nWang, Xinyu Yang, et al. 2025a. Parallel-r1: Towards\nparallel thinking via reinforcement learning. arXiv\npreprint arXiv:2509.07980.\nWenhao Zheng, Yixiao Chen, Weitong Zhang, Sou-\nvik Kundu, Yun Li, et al. 2025b. Citer: Collabo-\nrative inference for efficient large language model\ndecoding with token-level routing. arXiv preprint\narXiv:2502.01976.\nYaowei Zheng, Shenzhi Wang Junting Lu, Zhangchi\nFeng, and Yuwen Xiong Dongdong Kuang. 2025c.\nEasyr1: An efficient, scalable, multi-modality rl train-\ning framework.\nhttps://github.com/hiyouga/\nEasyR1.\nYujun Zhou, Zhenwen Liang, Haolin Liu, Wenhao\nYu, Kishan Panaganti, et al. 2025. Evolving lan-\nguage models without labels: Majority drives se-\nlection, novelty promotes variation. arXiv preprint\narXiv:2509.15194.\n11\n\nA\nCase Study\nHere we show two case studies from Qwen3-1.7B\n(Difficulty-Aware) to illustrate how invoking LLM\ncan help solve the questions in Figure 4.\nB\nDetailed Analysis of Delegation Length\nStrategies\nIn the main text, we demonstrated that RelayLLM\nachieves comparable average accuracy to a fixed\n100-token delegation strategy while consuming sig-\nnificantly fewer tokens. Table 6 presents the gran-\nular performance breakdown across all six mathe-\nmatical benchmarks.\nWe observe distinct trends across datasets of\nvarying difficulty:\nEfficiency on Standard Benchmarks.\nOn stan-\ndard reasoning datasets such as MATH500 and\nGSM8K, RelayLLM matches the performance of\nthe Fixed-100 strategy almost exactly (e.g., 81.40%\non MATH500 for both methods).\nHowever, it\nachieves this parity with a drastically lower call\nratio (1.07% vs. 2.87%). This indicates that for the\nmajority of queries in these datasets, the student\nmodel only requires short, precise interventions\nrather than long-context guidance.\nAdaptability on Complex Benchmarks.\nThe\nadvantage of dynamic prediction becomes more\nevident on the challenging Minerva benchmark.\nRelayLLM outperforms the Fixed-20 significantly\n(43.75% vs. 39.71%), suggesting that our model\ncorrectly identifies the need for longer generation\nlengths when facing harder problems. While the\nFixed-500 strategy yields the highest accuracy on\nMinerva (44.49%) and AIME, this marginal gain\ncomes at a prohibitive cost: it incurs a 5.37% call\nratio, which is over 5× the computational overhead\nof RelayLLM.\nC\nPrompt Templates\nThis section presents the exact prompt templates\nused for the models.\nC.1\nInference Prompt\nSolver Prompt Template\nSystem Message:\nPlease reason step by step, and put your\nfinal answer within \\boxed{}.\nUser Message:\n{problem_statement}\nNote: {problem_statement} is a place-\nholder for the actual math problem.\nC.2\nGPT-4o-mini Judge Prompt\nTo programmatically evaluate the correctness of an-\nswers on mathematical benchmarks where the final\nanswer can be complex (e.g., simplified expres-\nsions), we use GPT-4o-mini as a judge. The exact\nprompt and configuration used for this evaluation\nare detailed below.\nConfiguration for GPT-4o as Judge\n• Model: gpt-4o\n• Temperature: 0.1\nSystem Message:\nYou are a math answer checker.\nUser Message Template:\nHi, there is an answer: {answer},\nand the ground truth answer is:\n{response},\nplease check whether the answer\nis correct or not, and return the\n**only** Yes or No.\nNote: {answer} is a placeholder for the\nmodel-generated solution, and {response}\nis the ground-truth answer from the bench-\nmark.\nD\nOther Related Work\nRelayLLM also align with efficient reasoning, in-\nclude efficient chain-of-thought methods (Wang\net al., 2025c; Huang et al., 2025a), speculative best-\nof-N decoding (Sun et al., 2024a), in-context learn-\ning methods (Huang et al., 2024) and non-myopic\ngeneration (Ma et al., 2024).\n12\n\nCase Study: A list of positive integers has the following properties:\n• The sum of the items in the list is 30.\n• The unique mode of the list is 9.\n• The median of the list is a positive integer that does not appear in the list itself.Find the sum of the squares of all the\nitems in the list.\nWe are given a list of **positive integers** with the following properties:\n1. **Sum of the items is 30.**\n2. **The unique mode is 9.**\n3. **The median is a positive integer that does not appear in the list.**\nWe are to find the **sum of the squares of all the items** in the list.\n—\n### Step 1: Understand the properties - **Mode is 9**: This means 9 appears more times than any other number in the\nlist. - **Unique mode is 9**: No other number appears as frequently as 9.\n- **Median is a positive integer that does not appear in the list**: So the median is some number m, and m is not in the\nlist.\n- **Sum of the items is 30**.\n—\n### Step 2: Consider the structure of the list Let’s denote the list as a1, a2, . . . , an, sorted in non-decreasing order.\n#### Key observations: - The **median** is the middle value. If n is odd, it is the n+1\n2 -th element. If n is even, it is\nthe average of the two middle elements. - Since the **median is a positive integer that does not appear in the list**, it\nmust be **greater than the largest number in the list** (if the list has only one element, it’s trivial, but we’ll assume\nn > 1). - The **mode is 9**, so 9 appears more times than any other number.\n—\n### Step\n<call> 300 </call>\n3: Try small values of $ n $\nLet ’s test small values of $ n $ to find a valid list that satisfies all\nconditions.\n#### Try $ n = 5 $ (odd):\n- Then the ** median is the 3rd element **.\n- Suppose the list is $ [a, b, m, c, d] $, where $ m $ is the median.\n- $ m $ is not in the list.\n- The **mode is 9**, so 9 must appear at least twice , and more than any other\nnumber.\nLet ’s try to construct such a list:\nTry $ [9, 9, 1, 1, 10] $ (sum = 9+9+1+1+10 = 30)\nCheck the properties:\n- **Sum = 30 **\n- **Mode = 9** (appears twice) - but are there other numbers with same\nfrequency?\n- 1 appears 3 times -> **mode is 1**, not 9 correct\nTry $ [9, 9, 1, 2, 9]$ -> sum = 9+9+1+2+9=30\n- Mode is 9 (3 times), 1 and 2 once -> correct\n- Median is 1 (the middle number) -> 1 is in the list incorrect\nTry $ [9, 9\n(end call)\n, 1, 2, 8] $ -> sum = 9+9+1+2+8 = 29 incorrect Try [9, 9, 1, 3, 8] -> sum = 30 - Mode is 9 - Median is 1 (not in the list) -\n1 is not in the list So this is a valid list:\n[9, 9, 1, 3, 8]\nCheck sum: 9 + 9 + 1 + 3 + 8 = 30 Check mode: 9 appears twice, others once -> **unique mode is 9** Check median:\nmiddle element is 3 → 3 is not in the list\n—\n### Step 4: Compute the sum of squares Compute 92 + 92 + 12 + 32 + 82 = 81 + 81 + 1 + 9 + 64 = 236\n—\n236\nFigure 4: A qualitative case study showcases that our model invoke LLM to show how to validate the final answer,\nthen solve the problem by itself.\n13\n\nTable 6: Full performance breakdown of different delegation length strategies across six benchmarks. “Fixed-k”\nindicates models retrained with a fixed call length of k tokens. We report mean@32 for AIME datasets (AIME24,\nAIME25) and pass@1 for others. Best accuracy values are bolded, and the most efficient call ratio is highlighted.\nMethod\nMinerva\nMATH500\nGSM8K\nOlympiad\nAIME25\nAIME24\nAverage\nCall Ratio\nFixed-20\n39.71\n81.40\n86.96\n55.74\n14.11\n18.54\n49.41\n1.32%\nFixed-100\n40.44\n81.40\n86.50\n56.78\n14.42\n17.81\n49.56\n2.87%\nFixed-500\n44.49\n81.80\n88.25\n57.96\n15.05\n19.48\n51.17\n5.37%\nRelayLLM\n43.75\n81.40\n86.28\n55.70\n12.71\n17.29\n49.52\n1.07%\nE\nHyperparameter\nWe utilize the Group Relative Policy Optimization\n(GRPO) algorithm for post-training. The model\nis optimized using AdamW with a constant learning\nrate of 1 × 10−6 and a weight decay of 1 × 10−2.\nThe global batch size is set to 32.\nTo support extensive reasoning steps required\nfor complex mathematical problems, we configure\nthe maximum sequence length to allow for long\nChain-of-Thought (CoT) generation. Specifically,\nthe maximum prompt length is set to 4096 tokens,\nand the maximum response length is extended to\n8192 tokens.\nFor the GRPO specific configurations, we set\nthe group size G = 8, meaning that 8 outputs are\nsampled for each prompt to estimate the baseline.\nTo ensure training stability and prevent the pol-\nicy from deviating excessively from the reference\nmodel, we apply a KL divergence coefficient β of\n0.01. During the rollout phase, we use a sampling\ntemperature of 1.0 to encourage diverse reasoning\npaths.\nTable 7: Hyperparameter settings for GRPO training.\nHyperparameter\nValue\nOptimizer\nAdamW\nLearning Rate\n1 × 10−6\nWeight Decay\n1 × 10−2\nGlobal Batch Size\n32\nMax Prompt Length\n4096\nMax Response Length\n8192\nTemperature\n1.0\nGroup Size (G)\n8\nKL Coefficient (β)\n0.01\n14\n",
+    "summary_1": "Here's a clear and concise summary of the research paper:\n\n**Title:** RelayLLM: Efficient Reasoning via Collaborative Decoding\n\n**Problem Statement:** Large Language Models (LLMs) are powerful but computationally expensive, while Small Language Models (SLMs) are efficient but lack reasoning capacity. Existing collaborative approaches, such as cascading or routing, often operate at a coarse granularity, leading to significant computational waste.\n\n**Proposed Solution:** RelayLLM, a novel framework for efficient reasoning via token-level collaborative decoding. Unlike static routers, RelayLLM empowers the SLM to act as an active controller that dynamically invokes the LLM only for critical tokens via a special command, effectively \"relaying\" the generation process.\n\n**Key Components:**\n\n1. **Two-stage training framework:** Supervised warm-up phase to teach the model the syntactic structure of calling commands, followed by a reinforcement learning stage using Group Relative Policy Optimization (GRPO) training.\n2. **Context-aware reward:** Designed to balance independence with necessary help-seeking, penalizing both wasted costs and avoidable errors.\n3. **Token-level collaborative decoding:** The SLM generates a special command token (<call>) to pause its own generation and invoke the LLM for a specified number of tokens",
+    "summary_2": "The research paper \"RelayLLM: Efficient Reasoning via Collaborative Decoding\" proposes a novel framework for efficient reasoning via token-level collaborative decoding. The authors address the challenge of deploying Large Language Models (LLMs) for complex reasoning, which is often hindered by high computational costs and latency, while Small Language Models (SLMs) typically lack the necessary reasoning capacity.\n\nThe authors present RelayLLM, a framework that empowers the SLM to act as an active controller, dynamically invoking the LLM only for critical tokens via a special command, effectively \"relaying\" the generation process. They introduce a two-stage training framework, including warm-up and Group Relative Policy Optimization (GRPO) to teach the model to balance independence with strategic help-seeking.\n\nThe empirical results on six benchmarks demonstrate the effectiveness of RelayLLM, achieving an average accuracy of 49.52% and reducing the performance gap between the small model and the large one. The authors also show that RelayLLM invokes the large model for only 1.07% of the total generated tokens, resulting in a 98.2% reduction in token costs compared to a performance-matched router.\n\nKey findings:\n\n1.  **Efficient reasoning via token-level collaborative decoding**: RelayLLM emp"
+  }
+]
\ No newline at end of file
diff --git a/summarization.py b/summarization.py
new file mode 100644
index 0000000..fb95a6f
--- /dev/null
+++ b/summarization.py
@@ -0,0 +1,265 @@
+"""
+Summarization utilities using LLaMA 3.1 model with unsloth optimization.
+"""
+import json
+from typing import List, Dict, Any
+
+# Try to import unsloth FIRST, before transformers
+try:
+    import unsloth
+    from unsloth import FastLanguageModel
+    USE_UNSLOTH = True
+except (ImportError, NotImplementedError):
+    USE_UNSLOTH = False
+    print("Warning: unsloth not available, using standard transformers (slower)")
+
+import torch
+from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
+
+from data_utils import PaperData
+
+
+# Configuration
+LLAMA_MODEL_NAME = "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit"
+# Fallback model for non-unsloth environments
+LLAMA_MODEL_FALLBACK = "meta-llama/Meta-Llama-3-8B-Instruct"
+
+# Device detection: CUDA > MPS > CPU
+if torch.cuda.is_available():
+    LLAMA_DEVICE = "cuda"
+elif torch.backends.mps.is_available():
+    LLAMA_DEVICE = "mps"
+else:
+    LLAMA_DEVICE = "cpu"
+
+MAX_INPUT_TOKENS = 4096
+MAX_NEW_TOKENS = 256
+
+
+def load_llama():
+    """
+    Load the LLaMA model using unsloth's FastLanguageModel for optimized inference.
+    Falls back to standard transformers if unsloth is not available.
+
+    Returns:
+        Tuple of (model, tokenizer)
+    """
+    if USE_UNSLOTH:
+        print(f"Loading LLaMA model with unsloth: {LLAMA_MODEL_NAME}")
+        # Use unsloth's FastLanguageModel for optimized 4-bit loading
+        model, tokenizer = FastLanguageModel.from_pretrained(
+            model_name=LLAMA_MODEL_NAME,
+            max_seq_length=MAX_INPUT_TOKENS,
+            dtype=None,  # Auto-detect dtype
+            load_in_4bit=True,  # Use 4-bit quantization
+        )
+        # Set FastLanguageModel to inference mode for faster generation
+        FastLanguageModel.for_inference(model)
+    else:
+        print(f"Loading LLaMA model with standard transformers: {LLAMA_MODEL_FALLBACK}")
+        print("Note: For faster inference on GPU, run this on a system with NVIDIA/AMD/Intel GPU")
+
+        # Use standard transformers with 4-bit quantization if available
+        tokenizer = AutoTokenizer.from_pretrained(LLAMA_MODEL_FALLBACK, use_fast=False)
+
+        if LLAMA_DEVICE == "cuda":
+            # Use 4-bit quantization on CUDA
+            bnb_config = BitsAndBytesConfig(
+                load_in_4bit=True,
+                bnb_4bit_compute_dtype=torch.float16,
+                bnb_4bit_use_double_quant=True,
+                bnb_4bit_quant_type="nf4"
+            )
+            model = AutoModelForCausalLM.from_pretrained(
+                LLAMA_MODEL_FALLBACK,
+                quantization_config=bnb_config,
+                device_map="auto",
+                dtype=torch.float16,
+            )
+        elif LLAMA_DEVICE == "mps":
+            # On MPS (Mac M4), use float16 without quantization
+            print(f"Running on MPS (Apple Silicon) - device: {LLAMA_DEVICE}")
+            print("Note: For faster inference, run this on a system with NVIDIA/AMD/Intel GPU")
+            model = AutoModelForCausalLM.from_pretrained(
+                LLAMA_MODEL_FALLBACK,
+                torch_dtype=torch.float16,
+                low_cpu_mem_usage=True,
+            )
+            # Move model to MPS device
+            model = model.to(LLAMA_DEVICE)
+        else:
+            # On CPU, use float32 without quantization
+            print("Running on CPU - this will be very slow. Consider using a GPU server.")
+            model = AutoModelForCausalLM.from_pretrained(
+                LLAMA_MODEL_FALLBACK,
+                torch_dtype=torch.float32,
+                low_cpu_mem_usage=True,
+            )
+
+    # Important for LLaMA chat-style models:
+    if tokenizer.pad_token is None:
+        tokenizer.pad_token = tokenizer.eos_token
+
+    return model, tokenizer
+
+
+def build_multimodal_text(paper: PaperData, max_chars: int = 6000) -> str:
+    """
+    Construct a text input that includes title, abstract, body text, and figure captions.
+
+    This creates a "multimodal" representation by incorporating both textual content
+    and figure caption information.
+
+    Args:
+        paper: PaperData object containing paper information
+        max_chars: Maximum characters to include from paper body
+
+    Returns:
+        Formatted multimodal text string
+    """
+    figs = "\n".join(f"- {c}" for c in paper.figure_captions[:5])  # up to 5 captions
+    body = paper.text
+    if len(body) > max_chars:
+        body = body[:max_chars]
+
+    multimodal = (
+        f"Title: {paper.title}\n\n"
+        f"Abstract:\n{paper.abstract}\n\n"
+        f"Selected figure captions:\n{figs}\n\n"
+        f"Paper excerpt:\n{body}\n"
+    )
+    return multimodal
+
+
+def llama_generate_summary(
+    model,
+    tokenizer,
+    multimodal_text: str,
+    prompt_style: str = "generic",
+    temperature: float = 0.3,
+    top_p: float = 0.9,
+) -> str:
+    """
+    Generate a single summary from LLaMA using a chat-style prompt.
+
+    Args:
+        model: LLaMA model
+        tokenizer: Model tokenizer
+        multimodal_text: Input text including paper content and figure captions
+        prompt_style: Style of prompt ("generic", "figure_aware", or other)
+        temperature: Sampling temperature for generation
+        top_p: Top-p (nucleus) sampling parameter
+
+    Returns:
+        Generated summary text
+    """
+    if prompt_style == "generic":
+        system_prompt = (
+            "You are an expert research assistant. Summarize the following research paper "
+            "clearly and concisely for a graduate student."
+        )
+    elif prompt_style == "figure_aware":
+        system_prompt = (
+            "You are an expert research assistant. Summarize the following research paper, "
+            "explicitly incorporating the information conveyed by the figures and their captions."
+        )
+    else:
+        system_prompt = (
+            "You are a helpful academic assistant. Provide a detailed yet concise summary of the paper."
+        )
+
+    # LLaMA 3.1 chat template format
+    messages = [
+        {"role": "system", "content": system_prompt},
+        {"role": "user", "content": f"Summarize the following research paper excerpt:\n\n{multimodal_text}"}
+    ]
+
+    # Use tokenizer's chat template
+    prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
+
+    inputs = tokenizer(
+        [prompt],
+        return_tensors="pt",
+        truncation=True,
+        max_length=MAX_INPUT_TOKENS,
+    )
+
+    # Get the device of the model (handles device_map="auto" cases)
+    model_device = next(model.parameters()).device
+    inputs = inputs.to(model_device)
+
+    with torch.no_grad():
+        outputs = model.generate(
+            **inputs,
+            max_new_tokens=MAX_NEW_TOKENS,
+            do_sample=True,
+            temperature=temperature,
+            top_p=top_p,
+            pad_token_id=tokenizer.eos_token_id,
+        )
+
+    # Only decode the newly generated tokens (not the input prompt)
+    input_length = inputs.input_ids.shape[1]
+    generated_tokens = outputs[0][input_length:]
+    summary = tokenizer.decode(generated_tokens, skip_special_tokens=True).strip()
+
+    return summary
+
+
+def generate_summaries_for_papers(papers: List[PaperData], output_path: str = "summaries.json") -> List[Dict[str, Any]]:
+    """
+    Generate two summaries for each paper using different prompt styles and sampling parameters.
+
+    Args:
+        papers: List of PaperData objects
+        output_path: Path to save the summaries JSON file
+
+    Returns:
+        List of dictionaries containing paper info and both summaries
+    """
+    model, tokenizer = load_llama()
+    results = []
+
+    for idx, paper in enumerate(papers):
+        print(f"\n=== Summarizing paper {idx+1}/{len(papers)}: {paper.arxiv_id} ===")
+        multimodal_text = build_multimodal_text(paper)
+
+        # Summary 1: generic prompt, low temperature
+        s1 = llama_generate_summary(
+            model,
+            tokenizer,
+            multimodal_text,
+            prompt_style="generic",
+            temperature=0.3,
+            top_p=0.9,
+        )
+
+        # Summary 2: figure-aware prompt, higher temp
+        s2 = llama_generate_summary(
+            model,
+            tokenizer,
+            multimodal_text,
+            prompt_style="figure_aware",
+            temperature=0.8,
+            top_p=0.95,
+        )
+
+        results.append(
+            {
+                "arxiv_id": paper.arxiv_id,
+                "title": paper.title,
+                "abstract": paper.abstract,
+                "text": paper.text,  # Include full text for fallback reference
+                "summary_1": s1,
+                "summary_2": s2,
+            }
+        )
+
+        print("Summary 1 (generic):", s1[:200], "...")
+        print("Summary 2 (figure-aware):", s2[:200], "...")
+
+    # Save raw summaries
+    with open(output_path, "w", encoding="utf-8") as f:
+        json.dump(results, f, ensure_ascii=False, indent=2)
+
+    return results
diff --git a/train_papers.json b/train_papers.json
new file mode 100644
index 0000000..81a1987
--- /dev/null
+++ b/train_papers.json
@@ -0,0 +1,165 @@
+[
+  {
+    "arxiv_id": "2601.05171v1",
+    "title": "Paper 2601.05171v1",
+    "abstract": "Existing long-term personalized dialogue systems struggle to reconcile unbounded interaction streams with finite context constraints, often succumbing to memory noise accumulation, reasoning degradation, and persona inconsistency. To address these challenges, this paper proposes Inside Out, a framework that utilizes a globally maintained PersonaTree as the carrier of long-term user profiling. By constraining the trunk with an initial schema and updating the branches and leaves, PersonaTree enables controllable growth, achieving memory compression while preserving consistency. Moreover, we train a lightweight MemListener via reinforcement learning with process- based rewards to produce structured, executable, and interpretable {ADD, UPDATE, DELETE, NO_OP} operations, thereby supporting the dynamic evolution of the personalized tree. During response generation, PersonaTree is directly leveraged to enhance outputs in latency-sensitive scenarios; when users require more details, the agentic mode is triggered to introduce details on-demand under the constraints of the PersonaTree. Experiments show that PersonaTree outperforms full-text concatenation and various personalized memory systems in suppressing contextual noise and maintaining persona consistency. Notably, the small MemListener model achieves memory- operation decision performance comparable to, or even surpassing, powerful reasoning models such as DeepSeek-R1-0528 and Gemini-3-Pro. Correspondence: Team Leader at lizy@memtensor.cn Author Legend: †Corresponding author",
+    "text": "Inside Out:\nEvolving User-Centric Core Memory Trees\nfor Long-Term Personalized Dialogue Systems\nJihao Zhao1,2,3, Ding Chen2,3, Zhaoxin Fan2,4, Kerun Xu2,3, Mengting Hu2,5, Bo\nTang2,3, Feiyu Xiong2,3, Zhiyu Li2,3,†\n1School of Information, Renmin University of China, 2MemTensor (Shanghai) Technology Co.,\nLtd., 3Institute for Advanced Algorithms Research, Shanghai, 4Beijing University of\nAeronautics and Astronautics, 5Nankai University\nAbstract\nExisting long-term personalized dialogue systems struggle to reconcile unbounded interaction\nstreams with finite context constraints, often succumbing to memory noise accumulation, reasoning\ndegradation, and persona inconsistency. To address these challenges, this paper proposes Inside\nOut, a framework that utilizes a globally maintained PersonaTree as the carrier of long-term\nuser profiling. By constraining the trunk with an initial schema and updating the branches and\nleaves, PersonaTree enables controllable growth, achieving memory compression while preserving\nconsistency. Moreover, we train a lightweight MemListener via reinforcement learning with process-\nbased rewards to produce structured, executable, and interpretable {ADD, UPDATE, DELETE, NO_OP}\noperations, thereby supporting the dynamic evolution of the personalized tree. During response\ngeneration, PersonaTree is directly leveraged to enhance outputs in latency-sensitive scenarios;\nwhen users require more details, the agentic mode is triggered to introduce details on-demand\nunder the constraints of the PersonaTree.\nExperiments show that PersonaTree outperforms\nfull-text concatenation and various personalized memory systems in suppressing contextual noise\nand maintaining persona consistency. Notably, the small MemListener model achieves memory-\noperation decision performance comparable to, or even surpassing, powerful reasoning models such\nas DeepSeek-R1-0528 and Gemini-3-Pro.\nCorrespondence: Team Leader at lizy@memtensor.cn\nAuthor Legend: †Corresponding author\n1\nIntroduction\nCore memories shape Riley’s personality islands, with each island serving as a unique emblem of\nher identity.\n—— \"Inside Out\"\nWith the rapid advancement of large language models (LLM), dialogue-based agents have demonstrated\nsubstantial potential in applications such as personal assistants, affective companionship, and long-term\nquestion answering [4, 11, 18]. However, within personalized dialogue systems aimed at fostering long-\n1\narXiv:2601.05171v1  [cs.CL]  8 Jan 2026\n\nterm human-machine trust and emotional connection, a fundamental contradiction exists between the finite\ncontext window and the unbounded growth of interaction history [13, 23]. As conversational turns continue\nto accumulate, the traditional single-context paradigm encounters a severe form of context saturation:\nindiscriminate aggregation of massive historical information not only drives computational costs sharply\nupward, but also introduces substantial irrelevant noise, markedly degrading the signal-to-noise ratio. More\ncritically, this unstructured accumulation makes it difficult for the model to accurately extract and sustain a\nuser’s personal characteristics from lengthy histories, leading to personalization inconsistency over long-term\ninteractions and thereby seriously undermining user experience and the system’s long-term usability [19, 29].\nTo address these challenges, existing studies have primarily explored routes such as explicit profile augmentation\nand vector-based retrieval, yet neither directly confronts the central bottleneck of personalized memory\nevolution. Profile-based approaches rely on predefined, static attributes; they are not only slow to update\nbut also struggle to capture implicit cues that users reveal over prolonged interactions, including linguistic\nstyle, deeper value orientations, and affective preferences, resulting in superficial personalization modeling [20].\nIn contrast, memory-augmented agents based on vector retrieval, while introducing external storage, still\nessentially treat memory as text fragments or simple lists of facts. Such systems lack an intrinsic, trained\ndecision mechanism for determining which information merits long-term retention, and instead often depend\non rigid heuristics or elaborate prompt engineering [12]. This accumulation of memories without value-based\njudgment causes the memory repository either to become bloated and uninterpretable due to noise accretion,\nor to lose the long-range logical thread through fragmentation of key context, ultimately failing to sustain a\nvivid and coherent persona [25].\nThis discrepancy between memory accumulation and core persona formation” motivates us to return to the\nfoundations of human cognition for an answer. As illustrated by the film \"Inside Out\", individual identity\ndoes not stem from a simplistic stacking of all experiences, but rather is constructed upon core memories that\nshape distinct \"Islands of Personality\". This aligns with theoretical findings in cognitive psychology, such as\nSelf-Schema theory [15, 22], which emphasizes that humans maintain a stable self-concept by filtering and\nhierarchically organizing key memories.\nInspired by these insights, we propose the Inside Out framework, which aims to grow an evolvable user core\nmemory tree \"from the inside out\" through unbounded interactions. Firstly, to delineate the theoretical\nboundaries of the memory tree, we construct a hierarchical Schema based on the Biopsychosocial model,\nscientifically decomposing user characteristics into three core dimensions. This interdisciplinary Schema\ndesign establishes the initial structure of the user PersonaTree. Secondly, to endow the system with dynamic\nevolution, we propose an iterative tree-update mechanism and introduce a reinforcement learning (RL) strategy\nbased on process rewards to train a lightweight model, MemListener. This model learns to compress a\ncontinuous stream of unstructured dialogue in real time into standardized tree-structured operations, encoding\nuser core features within the branch and leaf nodes. Finally, addressing the trade-off between efficiency and\neffectiveness during the inference stage, this paper designs an adaptive response generation mechanism: In\nlatency-sensitive scenarios, a fast mode is enabled to perform reasoning directly based on the PersonaTree.\nWhen facing long-tail detail requirements, the system switches to the agentic recall mode, utilizing the\nPersonaTree to guide deep retrieval. The primary contributions of our work are summarized as:\n• We propose PersonaTree, grounded in the biopsychosocial schema. By transforming unstructured\ndialogue streams into standardized atomic tree operations in real-time, PersonaTree achieves the\ndynamic compression, explicit management, and high signal-to-noise ratio maintenance of implicit user\nprofiles.\n• We design a training strategy utilizing RL with process rewards. Leveraging the constructed dataset of\n28k instructions, we train a lightweight model, MemListener, to execute precise memory editing.\n• Our experiments reveal the potential of a collaborative paradigm where \"small models maintain memory\nwhile LLMs handle generation\". Results show that MemListener achieves memory-decision performance\ncomparable to strong reasoning models, and that PersonaTree offers a new pathway toward low-cost,\nhighly reliable deployment of long-term personalized dialogue systems.\n2\n\nFigure 1 Overview of the entire process of our Inside Out framework.\n2\nRelated Works\n2.1\nPersonalization and Memory\nPersonalization aims to adapt a dialogue system’s linguistic style and interaction policy to a specific user’s\nstable traits and evolving state. In interactive settings, personalization is inherently coupled with memory:\nmodels must distill reusable user representations from past interactions and fuse them during generation. Li\net al. [10] proposed persona-based dialogue generation to mitigate inconsistency and lack of personality in\nopen-domain dialogue, and Zhang et al. [26] formalized the PersonaChat task. Subsequent studies emphasized\nmulti-dimensional user attributes. For example, Zheng et al. [28] introduced the large-scale multi-turn dataset\nPersonalDialog. In parallel, Madotto et al. [14] framed personalization as a meta-learning problem to enable\nfew-shot adaptation. In the LLM era, Chen et al. [3] systematically reviewed major directions in personalized\ndialogue generation, while Tan et al. [21] assigned parameter-efficient personalization modules to users to\nimprove multi-task personalization.\n2.2\nLLM Agents with External Memory\nTo overcome the limitations of LLMs’ finite context windows and endow them with capabilities for continuous\nlearning and long-term interaction, constructing memory systems has emerged as a pivotal research direction.\n3\n\nLangMem 1 enables continual learning and cross-session consistency by decoupling hot-path memory primitives\nfrom backend asynchronous integration. Mem0 [4] adopts a multi-level memory architecture to support\nmulti-session retrieval and personalization at relatively low overhead. A-Mem [24] builds an evolvable memory\nnetwork via self-organizing indexing and linking mechanisms. MemoryOS [8] manages short, medium, and\nlong-term memory through OS-style hierarchical storage together with corresponding update and retrieval\npolicies to preserve contextual coherence.\n3\nThe Inside Out Framework\n3.1\nOverview Architecture\nThis study proposes the Inside Out Framework, which aims to address the challenges of personalized consistency\nand contextual forgetting in long-term dialogues through a structured memory evolution mechanism.\nFramework Pipeline.\nAs shown in Figure 1, the framework consists of three key modules: Dynamic Person-\naTree Evolution (Section 3.2), MemListener Training (Section 3.3), and Adaptive Response Generation (Section\n3.4). First, PersonaTree and dataset construction initializes a persona tree based on the Biopsychosocial\nModel, segments the user’s dialogue history into consecutive dialogue chunks, and generates operations on the\nPersonaTree, thereby constructing a memory evolution dataset. Second, MemListener training leverages the\nresulting high-quality dataset to update the parameters of MemListener via supervised fine-tuning (SFT) and\nRL with a dynamic reward mechanism, enabling it to extract structured memory from unstructured dialogues.\nFinally, PersonaTree inference freezes the MemListener parameters at application time, reconstructs the\nattribute tree from the dialogue history, uses this structured memory to enhance user queries, and ultimately\ngenerates personalized responses through an LLM.\nProblem Formulation.\nWe define the task of a personalized dialogue system as a process of maximizing\nresponse utility over an infinitely long dialogue stream. Given a user U with a historical dialogue sequence\nH = {x1, y1, ..., xt, yt}, where x denotes user inputs and y denotes system responses, conventional context-\nwindow approaches attempt to directly model P(yt | Ht−k:t), but are constrained by the window length k.\nOur framework introduces an explicit, structured user state T (i.e., PersonaTree), thereby reformulating the\nproblem as state tracking and state-conditioned generation. The goal is to learn a state update function\nfupdate such that:\nTt = fupdate(Tt−1, Dt)\n(1)\nyt = fgen(xt, Tt, frecall(Tt, H))\n(2)\nwhere Dt denotes the current dialogue chunk, fgen produces the system reply given the current user input\nand the tracked user state, and frecall is a retrieval function that recalls relevant historical snippets from the\nfull dialogue history H conditioned on the current state Tt.\n3.2\nDynamic PersonaTree Evolution\nPersonaTree Initialization.\nAt system startup, we construct an initial PersonaTree to serve as the starting\npoint of the user’s long-term structured state. Specifically, we first determine the set of writable trunk and\nleaf fields according to a predefined unified schema, and constrain the storage type of each leaf node to a\ndescriptive string, which is used to hold a compressed summary of the user’s core personalized attributes. This\ndesign ensures that memory capacity remains controllable and prevents unbounded growth as the dialogue\nprogresses. The schema is informed by interdisciplinary human-factors and psychological theory frameworks,\nwith its theoretical grounding illustrated in Figure 1. Subsequently, under the schema constraints, we initialize\nthe leaf nodes (allowing empty strings or default placeholder text), thereby obtaining the initial persona tree\nT0. The specific initial PersonaTree instance adopted in this paper is provided in Appendix D.\n1https://github.com/langchain-ai/langmem\n4\n\nIterative PersonaTree Updating.\nTo enable scalable maintenance of long-term personalized memory over an\ninfinitely long dialogue stream, we adopt an iterative updating mechanism: any input modality (historical\nfile import, short-snippet input, or real-time cache triggering) is normalized into a dialogue-chunk sequence\n(D1, . . . , DN), and for each Dt we execute a closed-loop update of operation-list generation, safe parsing and\nexecution, versioned persistence.\nStep 0: System Loading. The system loads the text fields of all leaf nodes, yielding the initial state T0.\nMeanwhile, the task specification and system constraints are abstracted into a rule set R, including update\nrules, writable scope, leaf constraints, and output format.\nStep t = 1, . . . , N. For any dialogue chunk Dt, the system executes the following three stages:\n(a) State Construction. Given a dialogue chunk Dt, set Inputt ← Dt, Tt−1.\n(b) Operation List Generation. Conditioned on (Dt, Tt−1, R), LLM outputs an operation list Ot, consisting\nof one or more atomic operations that strictly follow a predefined operation grammar. The operation types\nare limited to:\n• ADD(path, value): write descriptive text to the specified path; if the path does not exist, it may be\ncreated under an extended-schema policy;\n• UPDATE(path, value): perform an overwrite rewrite on the target leaf node, updating its text to the\nnew value;\n• DELETE(path, value): clear the target leaf node or write a deletion marker to indicate that this type\nof information should be removed from long-term memory;\n• NO_OP(): the current dialogue chunk does not contain stable core persona information that should be\nwritten to the PersonaTree.\nFor update operations, our framework unifies them as rewrites of leaf strings. More importantly, potential\nconflicts between new and old information are resolved by LLM during the generation of Ot. Based on Dt\nand the contextual old values in Tt−1, the model must decide whether to overwrite prior information, preserve\nsalient change cues, or solely append new information. In other words, conflict resolution is explicitly lifted to\nthe policy-generation stage, so as to leverage the LLMs’ holistic inference over semantics, temporal order, and\nnarrative consistency.\n(c) Parsing and Execution. This module serves as a safety gate that enforces structural and capacity\nconstraints when applying Ot: it validates that each path targets a permissible leaf, ensures each value is\na string or an allowed deletion marker to avoid parsing ambiguity and state pollution, and applies length\ncontrol by compressing any overlong value to satisfy the per-leaf budget. Importantly, it performs no conflict\nresolution or secondary semantic rewriting; it only executes the prescribed operations under these constraints.\n(d) Versioned Persistence. After the execution, the updated tree state Tt is materialized and persisted as a\nnew version, either serialized to a JSON file or stored in a JSON-capable database (e.g., document stores\nsuch as MongoDB). Iterating over t = 1, . . . , N yields a traceable evolution sequence {T0, . . . , TN}, with TN\nserving as the compressed long-term memory for retrieval-augmented and personalized generation at the final\ntask-query stage.\n3.3\nMemListener Training\nTraining Data Synthesis.\nDuring training data construction, we select subsets from HaluMem [2] and\nPersonaMem [7] that are relevant to implicitly characterizing user-specific attributes as the raw corpus sources.\nUsing the dynamic PersonaTree evolution procedure described in Section 3.2 as the backbone, we invoke\nDeepSeek-R1-0528 to generate supervision signals for training.\nWarm-up via SFT.\nWe first perform full supervised fine-tuning to initialize the base model as a MemListener\nthat can stably generate operation lists. For any training sample, let the input context be s (including the\n5\n\ndialogue chunk, the previous tree state, and rule constraints), and the target output be o (the ground-truth\noperation sequence segment). We optimize a standard autoregressive cross-entropy objective:\nLSFT(θ) = −1\nτ\nτ\n�\nt=1\nlog Pθ(ot | o<t, s) ,\nwhere θ denotes the model parameters and τ is the target sequence length. The goal of this stage is to train\nthe model to make executable and traceable update decisions within strict grammar and write boundaries,\nensuring that the generated operation sequences are structurally stable and usable, thus providing a reliable\nstarting point for subsequent alignment.\nAlignment via Process-Reward RL.\nAfter the SFT warm-up, we continue alignment with the remaining\ndata using RL driven by process-based rewards. We set the model’s maximum context length to 11K tokens,\nwith the input length capped at 10K, which constitutes a typical ultra-long sequence optimization setting.\nTherefore, we enable the DAPO loss for process-reward alignment within a GRPO framework. We perform\ncredit assignment via token-level policy gradients, and mitigate entropy collapse by decoupling the clipping\nrange from dynamic sampling constraints, ensuring that the within-group advantage retains non-zero variance.\nThis makes the method better suited to long-chain reasoning and structured operation-sequence generation.\nIn addition, by limiting the maximum generation length and applying dynamic resampling to filter degenerate\ngroups, we reduce training noise introduced by truncation and within-group advantage degeneration.\nConcretely, for each training sample, we take s (including the dialogue chunk, the previous tree state, and\nupdate constraints) as input, and use the manually verified reference output y⋆ as the ground-truth operation\ntrajectory. During optimization, we sample a group of candidate outputs {yi}G\ni=1 from the old policy πθold(· | s),\ncompute sequence-level returns under the process reward function R(·) as Ri = R(yi, y⋆; s), and then update\nthe policy parameters θ using the DAPO objective:\nJRL(θ) = E(s,y⋆)∼D, {yi}G\ni=1∼πθold(·|s)\n�\n1\n�G\ni=1 |yi|\nG\n�\ni=1\n|yi|\n�\nt=1\nmin\n�\nri,t(θ) ˆAi,t, clip\n�\nri,t(θ), 1 − εlow, 1 + εhigh\n� ˆAi,t\n��\n,\ns.t. 0 <\n���{ yi | is_equivalent(y⋆, yi) }\n��� < G.\nPolicy updates are based on the importance ratio\nri,t(θ) =\nπθ(yi,t | s, yi,<t)\nπθold(yi,t | s, yi,<t),\nˆAi,t = Ri − mean\n�\n{Rj}G\nj=1\n�\nstd\n�\n{Rj}G\nj=1\n�\n,\nwhere Ri ∈ [−1, 1] denotes the sequence-level score assigned by the evaluator for the i-th sampled output\nyi in a group of size G = 8. We apply within-group standardization to obtain the advantage estimate ˆAi,t,\nthereby improving training stability. We adopt εlow = 0.2 and εhigh = 0.28 to relax the upper-bound clipping,\nproviding greater update headroom for increasing the probabilities of low-probability exploratory tokens. The\ndetailed data construction, training parameter settings, and the design of the reward function are presented\nin Appendix B.\n3.4\nAdaptive Response Generation\nDuring the inference, we treat the final tree state TN as structured long-term memory and adopt an adaptive\nresponse strategy to satisfy both low-latency and high-coverage requirements.\nPersonaTree-Augmented Generation.\nFor the latency-sensitive interaction scenario, the system enables a\nlightweight fast mode: it directly reads out the structure of TN along with the non-empty leaf texts as a\npersonalized prior, concatenates them with the user query q as the input context, and generates an answer in\na single pass.\n6\n\nAgentic Recall and Fusion.\nWhen the user explicitly requests additional details or the query exhibits stronger\nlong-tail characteristics, the system switches to the agentic recall mode. Concretely, we generate a set of\nexpanded queries {˜q(k)}K\nk=1 from the original query q conditioned on TN, where each ˜q(k) emphasizes a different\nattribute dimension or potential missing aspect relevant to the question. The system retrieves candidate\nevidence sets {d(k)\nj } in parallel for {q(k)}, and reranks them based on relevance to obtain a fused context C.\nFinally, we generate the final answer conditioned on [q, TN, C]. This procedure operates under a gated policy\nthat is triggered only when necessary, improving answer quality in complex scenarios while keeping the overall\ninteraction cost controllable.\n4\nExperiments\n4.1\nExperimental Setup\nDatasets and Metrics.\nWe conduct experiments on the PersonaMem benchmark [7]. This dataset is centered\non user personas: each instance contains a user’s static demographic information as well as dynamic attributes\nthat evolve over time. For interaction-history construction, each history consists of approximately 10 multi-turn\nconversations concatenated in chronological order, resulting in an overall context length of about 32k tokens\nand covering 15 categories of real-world tasks that require personalization. For evaluation, we use accuracy as\nthe primary metric, reporting overall accuracy and further breaking it down into per-skill accuracies across\nseven query-skill categories. Under the discriminative setting, the model directly outputs the selected option.\nBaselines.\nTo evaluate the effectiveness of the proposed method, we compare against two standard interaction\nparadigms Only LLM and Full Context as well as four representative memory-management frameworks\nLangMem, Mem0 [4], A-Mem [24], and MemoryOS [8]. For each baseline, we use the best-performing\nconfigurations reported in the original works (e.g., agentic retrieval). Detailed descriptions and experimental\nsetups are provided in Appendix C.1.\nImplementation Details.\nIn our approach, the construction of core training data leverages the DeepSeek-\nR1-0528. To encourage diverse generations, we set temperature to 0.7 and top-p to 0.9. For retrieval\ncomponents, both our method and all baseline systems share the same settings: we use BGE-M3 as the\nretriever and BGE-Reranker-Large as the re-ranking model. We set the retrieval count to 4 and ensure that\ndifferent methods maintain a fundamentally consistent length of retrieval context. All training and evaluation\nexperiments are conducted on a single node equipped with 8 NVIDIA H200-141GB GPUs.\n4.2\nMain Results\nAs shown in Table 1, we conduct a unified evaluation across three response models: DeepSeek-V3.1, Longcat-\nFlash-Chat, and DeepSeek-R1-0528, while fixing DeepSeek-R1-0528 as the memory extractor for all settings.\nOn DeepSeek-V3.1, our best configuration achieves Overall = 71.31, improving by +18.68 over Only LLM and\nby +7.47 over ALL Dialogue, and further exceeding the strongest comparative memory system, MemoryOS,\nby +8.83. On Longcat-Flash-Chat, PersonaTree-ALL with Qwen3-8B-RL attains the best overall performance\nwith Overall = 75.38, improving by +13.58 over ALL Dialogue and by +10.35 over MemoryOS. Under this\nsetting, all sub-metrics exhibit consistent improvements, indicating that after dialogue compression and\nmulti-round retrieval fusion, the model can more accurately recover the user’s factual background, characterize\nthe temporal evolution of preferences, and generate new content with broader coverage.\nWhen using DeepSeek-R1-0528 as the response model, we observe similarly substantial improvements:\nPersonaTree-ALL with Qwen3-8B-RL reaches Overall = 76.06, outperforming ALL Dialogue by +11.20 and\nsurpassing MemoryOS by +13.41. In terms of fine-grained metrics, Pref-Rec and New-Ideas improve by +18.18\nand +17.20 over ALL Dialogue, respectively, indicating that structured memory combined with process-aligned\ngeneration can substantially strengthen preference consistency and creative completion. Meanwhile, the\nmethod maintains stable advantages on Recall-Reason and Pref-Evo, suggesting that PersonaTree offers\ngreater interpretability and controllability in preserving and invoking causal chains and evolution cues. More\ncomprehensive comparative experiments, ablation analyses and visual presentations are shown in Appendix C.\n7\n\nMethods\nOverall\nRecall-Facts\nPref-Rec\nNew-Ideas\nRecall-Reason\nPref-Evol\nGen-New\nRecall-User\nDeepSeek-V3.1 (Response)\nOnly LLM\n52.63\n59.69\n43.64\n5.38\n80.81\n65.47\n42.11\n52.94\nALL Dialogue\n63.84\n78.29\n61.82\n12.90\n81.82\n65.47\n77.19\n76.47\nLangMem+DeepSeek-R1-0528\n57.05\n70.54\n43.64\n11.83\n80.81\n58.99\n63.16\n70.59\nMem0+DeepSeek-R1-0528\n60.44\n79.84\n52.73\n5.38\n83.84\n61.15\n64.91\n82.35\nA-Mem+DeepSeek-R1-0528\n59.76\n79.84\n54.55\n4.30\n86.87\n56.83\n64.91\n76.47\nMemoryOS+DeepSeek-R1-0528\n62.48\n72.87\n65.45\n11.83\n84.85\n63.31\n73.68\n76.47\nPersonaTree-ALL+DeepSeek-R1-0528\n71.14\n88.37\n56.36\n22.58\n88.89\n71.22\n89.47\n88.24\nPersonaTree-ALL+Qwen2.5-7B-RL\n71.31\n89.15\n61.82\n23.66\n91.92\n69.06\n85.96\n76.47\nPersonaTree-ALL+Qwen3-8B-RL\n70.80\n84.50\n61.82\n21.51\n87.88\n72.66\n85.96\n100.00\nw/ PersonaTree+Router\n70.12\n82.17\n63.64\n19.35\n88.89\n73.38\n84.21\n94.12\nw/ PersonaTree\n61.97\n75.19\n60.00\n4.30\n83.84\n64.75\n75.44\n88.24\nLongcat-Flash-Chat (Response)\nOnly LLM\n54.33\n62.02\n43.64\n8.60\n84.85\n71.22\n28.07\n52.94\nALL Dialogue\n61.80\n79.07\n63.64\n15.05\n87.88\n53.96\n68.42\n70.59\nLangMem+DeepSeek-R1-0528\n58.23\n75.97\n47.27\n8.60\n82.83\n64.03\n50.88\n64.71\nMem0+DeepSeek-R1-0528\n59.59\n80.62\n52.73\n5.38\n85.86\n62.59\n49.12\n76.47\nA-Mem+DeepSeek-R1-0528\n60.95\n82.17\n54.55\n8.60\n85.86\n62.59\n56.14\n64.71\nMemoryOS+DeepSeek-R1-0528\n65.03\n77.52\n72.73\n11.83\n86.87\n66.91\n70.18\n76.47\nPersonaTree-ALL+DeepSeek-R1-0528\n72.67\n88.37\n72.73\n26.88\n92.93\n69.78\n85.96\n64.71\nPersonaTree-ALL+Qwen2.5-7B-RL\n73.34\n93.02\n67.27\n27.96\n92.93\n67.63\n89.47\n70.59\nPersonaTree-ALL+Qwen3-8B-RL\n75.38\n93.02\n70.91\n30.11\n93.94\n71.94\n85.96\n88.24\nw/ PersonaTree+Router\n71.82\n89.92\n70.91\n22.58\n90.91\n68.35\n84.21\n82.35\nw/ PersonaTree\n65.20\n79.84\n67.27\n10.75\n87.88\n68.35\n68.42\n76.47\nDeepSeek-R1-0528 (Response)\nOnly LLM\n44.14\n30.23\n38.18\n19.35\n76.77\n55.40\n38.60\n41.18\nALL Dialogue\n64.86\n69.77\n63.64\n11.83\n84.85\n73.38\n84.21\n70.59\nLangMem+DeepSeek-R1-0528\n54.84\n63.57\n61.82\n15.05\n78.79\n56.83\n50.88\n41.18\nMem0+DeepSeek-R1-0528\n49.41\n32.56\n56.36\n24.73\n81.82\n56.83\n43.86\n58.82\nA-Mem+DeepSeek-R1-0528\n47.37\n30.23\n45.45\n22.58\n77.78\n59.71\n36.84\n76.47\nMemoryOS+DeepSeek-R1-0528\n62.65\n62.79\n78.18\n11.83\n77.78\n70.50\n78.95\n82.35\nPersonaTree-ALL+DeepSeek-R1-0528\n74.87\n80.62\n69.09\n27.96\n94.95\n82.01\n89.47\n82.35\nPersonaTree-ALL+Qwen2.5-7B-RL\n74.70\n79.84\n80.00\n27.96\n94.95\n82.01\n84.21\n64.71\nPersonaTree-ALL+Qwen3-8B-RL\n76.06\n80.62\n81.82\n29.03\n92.93\n84.17\n87.72\n76.47\nw/ PersonaTree+Router\n74.19\n80.62\n67.27\n25.81\n95.96\n83.45\n84.21\n76.47\nw/ PersonaTree\n65.70\n71.32\n65.45\n18.28\n91.92\n69.78\n75.44\n64.71\nTable 1 Main experimental results are presented on three different response models. Recall-Facts, Pref-Rec, New-\nIdeas, Recall-Reason, Pref-Evol, Gen-New, and Recall-User respectively denote recalling user-shared facts, providing\npreference-aligned recommendations, suggesting new ideas, recalling reasons behind preference updates, tracking\npreference evolution, generalizing to new scenarios, and recalling user-mentioned facts. Bold and underlined numbers\ndenote the best and second-best results, respectively.\n4.3\nAblation Study\nEffectiveness of Components.\nAs presented in Table 1, under the Qwen3-8B-RL setting we conduct a\ncomponent-level ablation of the adaptive generation pipeline by comparing three inference routes: using only\nthe lightweight fast mode (w/ PersonaTree), adding router-based triggering on top of the fast mode (w/\nPersonaTree+Router), and the full agentic recall and fusion (PersonaTree-ALL). The results consistently\nindicate that PersonaTree alone yields robust gains, but agentic recall and fusion is critical for achieving\nthe best performance, while the routing mechanism can closely match the full-mode performance while\nsubstantially reducing additional retrieval overhead. These gains suggest that directly injecting structured\nmemory already covers a large portion of stable attribute-related requirements, whereas routing and agentic\nrecall further strengthen fine-grained characterization of complex intents and evidence completion.\nEnhancement through Training.\nAs shown in Table 2, we further validate the effectiveness of training the\nmemory-operation model. We construct two datasets, PersonaMem 15K and HaluMem 13K. For RL, we use\nHaluMem 13K for SFT warm-up, and additionally employ 0.5K PersonaMem for process-reward training.\nFor the pure SFT setting, we train on PersonaMem 15K. The results show that training can substantially\nimprove the usability of the memory encoded in PersonaTree as well as downstream reasoning quality. More\nimportantly, whereas ALL Dialogue requires approximately 32K context, the trained PersonaTree introduces\nonly about 2.2K–2.6K tokens of memory context on average, yet can match or exceed the ALL Dialogue\nbaseline on Longcat-Flash-Chat and DeepSeek-R1-0528.\n8\n\nMethods\nEvaluation Models\nAvg. Length\nof Context\nDS-V3.1\nLongcat\nDS-R1-0528\nOnly LLM\n52.63\n54.33\n44.14\n0\nALL Dialogue\n63.84\n61.80\n64.86\n32K\nPersonaTree\n+Qwen2.5-7B-Instruct\n55.18\n55.18\n50.08\n1852.08\n+Qwen3-8B\n52.97\n54.33\n47.71\n1392.49\n+GPT-4o-mini\n55.86\n58.23\n55.18\n1154.35\n+Longcat-Flash-Chat\n58.91\n61.63\n60.78\n2305.46\n+DeepSeek-V3.1\n60.03\n62.31\n61.80\n2227.78\n+DeepSeek-V3.2\n60.32\n63.50\n63.16\n2292.54\n+DeepSeek-R1-0528\n60.61\n63.33\n63.50\n1844.19\n+Gemini-3-Pro\n61.29\n63.16\n63.16\n2252.89\n+Qwen2.5-7B-Instruct\n+SFT\n60.27\n62.82\n60.61\n2158.03\n+SFT+RL\n62.82\n64.35\n64.01\n2626.05\n+Qwen3-8B\n+SFT\n61.29\n62.99\n63.33\n2204.57\n+SFT+RL\n61.97\n65.20\n65.70\n2348.49\nTable 2 Performance comparison of different models in generating PersonaTree memory operations. DS-V3.1, Longcat\nand DS-R1-0528 respectively denote DeepSeek-V3.1, Longcat-Flash-Chat and DeepSeek-R1-0528.\nEvaluation Models\nExtraction Models\nDirect\nDS-V3.1\nLongcat\nDS-R1-0528\nDeepSeek-V3.1\nNo\n58.91\n60.78\n58.74\nYes\n60.03+1.12\n62.31+1.53\n61.80+3.06\nLongcat-Flash-Chat\nNo\n58.57\n59.59\n58.40\nYes\n58.91+0.34\n61.63+2.04\n60.78+2.38\nDeepSeek-R1-0528\nNo\n59.76\n60.95\n62.31\nYes\n60.61+0.85\n63.33+2.38\n63.50+1.19\nTable 3 Performance comparison between direct Generation and extract-then-transform approaches for generating\nPersonaTree memory operations.\nSelection of Generation Strategies.\nWe compare two strategies for generating PersonaTree operations: direct\ngeneration (the model directly outputs a tree-operation sequence conditioned on the dialogue) and extract-\nthen-transform (first extracting personalized information from the dialogue and then mapping it into tree\noperations). As shown in Table 3, direct generation achieves consistent advantages in all combinations, with\nimprovements ranging from approximately +0.34 to +3.06. The two-stage strategy accumulates errors in the\nintermediate representation, often losing fine-grained semantic and temporal cues needed for accurate tree\noperations. Based on this finding, we adopt direct generation as the default PersonaTree operation-generation\napproach in the remainder of this paper.\n4.4\nHyperparameter Analysis\nWe analyze the impact of the dialogue chunking window on the quality of tree-operation generation, where\neach dialogue chunk consists of w consecutive dialogue turns. As shown in Table 4, regardless of whether\nDeepSeek-R1-0528 or Qwen2.5-7B-RL is used as the tree-operation generation model, w = 3 yields the most\nstable and overall best performance across all three evaluation models. When the window is too small (w = 1),\nchunking becomes overly fragmented and tends to introduce noisy writes; when the window is too large\n(w ≥ 10), the within-chunk information density increases substantially, with more frequent cross-topic mixing\nand timeline collapsing, making critical cues more likely to be diluted or missed. Accordingly, we adopt w = 3\nfor dialogue chunking and tree-operation generation in our experiments.\n9\n\nEvaluation Models\nChunk Window Size\n1\n3\n5\n7\n10\n13\n15\nDeepSeek-R1-0528 (Tree Extraction)\nDeepSeek-V3.1\n59.42\n60.61\n58.74\n58.57\n60.10\n59.25\n58.74\nLongcat-Flash-Chat\n62.82\n63.33\n62.48\n62.65\n62.65\n62.31\n62.48\nDeepSeek-R1-0528\n60.95\n63.50\n59.93\n60.27\n59.25\n62.65\n61.29\nQwen2.5-7B-RL (Tree Extraction)\nDeepSeek-V3.1\n60.78\n62.82\n58.91\n61.12\n59.25\n57.56\n59.59\nLongcat-Flash-Chat\n62.14\n64.35\n63.67\n61.29\n64.18\n62.82\n61.97\nDeepSeek-R1-0528\n62.82\n64.01\n60.78\n62.82\n60.95\n61.29\n62.14\nTable 4 Sensitivity analysis on the size of segmented dialogue chunks.\n5\nConclusion\nThis paper studies memory evolution for long-term personalized dialogue in memory systems and proposes\nthe Inside Out framework, which uses an explicit, structured PersonaTree as long-term memory to maintain\npersonalized states under unbounded interactions. Concretely, we build a hierarchical schema grounded in\nthe biopsychosocial model and develop an iterative tree-update mechanism. We then train a lightweight\nMemListener with process-reward RL to compress unstructured dialogue streams into executable tree\noperations. At inference time, we design an adaptive generation pipeline. Experiments show that PersonaTree-\ndriven personalization consistently outperforms existing baselines across multiple response models, and further\nhighlight the potential of using small models for memory maintenance.\nLimitations\nThis work focuses on the structured evolution of long-term personalized memory, and the current implementa-\ntion and empirical validation delineate clear directions for future extension:\nScope of the schema and PersonaTree. We adopt a hierarchical schema grounded in the biopsychosocial\nmodel to define the writable space and capacity constraints, yielding a consistent and controllable representation\nof long-term memory. For finer-grained domain knowledge, task-skill profiles, or cross-domain user states, the\nschema can be further extended into composable subtrees or plugin-style modules to accommodate broader\napplication needs.\nApplicability of the memory-evolution strategy.\nPersonaTree is iteratively updated via atomic tree\noperations, with add/modify/delete semantics uniformly expressed as trunk and leaf-level text rewrites. This\nabstraction is effective for preserving stable core traits and compressible summaries; for memory forms requiring\nstronger temporal constraints, evidence provenance, or multi-version coexistence, additional metadata such as\nexplicit timestamps, confidence scores, and source pointers can be incorporated to improve traceability and\ncontrollability.\nEngineering extensions for privacy and governance. As an explicit long-term memory carrier, PersonaTree\nnaturally supports access control, interpretable edits, and revocability. For real-world deployment, it can\nbe further complemented with user-facing memory management, sensitivity-aware field stratification, and\ndata-minimization storage policies to meet stricter governance requirements.\nReferences\n[1] Aili Chen, Chengyu Du, Jiangjie Chen, Jinghan Xu, Yikai Zhang, Siyu Yuan, Zulong Chen, Liangyue Li, and\nYanghua Xiao. Deeper insight into your user: Directed persona refinement for dynamic persona modeling. arXiv\npreprint arXiv:2502.11078, 2025.\n[2] Ding Chen, Simin Niu, Kehang Li, Peng Liu, Xiangping Zheng, Bo Tang, Xinchi Li, Feiyu Xiong, and Zhiyu Li.\nHalumem: Evaluating hallucinations in memory systems of agents. arXiv preprint arXiv:2511.03506, 2025.\n10\n\n[3] Yi-Pei Chen, Noriki Nishida, Hideki Nakayama, and Yuji Matsumoto. Recent trends in personalized dialogue\ngeneration: A review of datasets, methodologies, and evaluations. arXiv preprint arXiv:2405.17974, 2024.\n[4] Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav. Mem0: Building production-\nready ai agents with scalable long-term memory. arXiv preprint arXiv:2504.19413, 2025.\n[5] Xu Han, Bin Guo, Yoon Jung, Benjamin Yao, Yu Zhang, Xiaohu Liu, and Chenlei Guo. Personapkt: Building\npersonalized dialogue agents via parameter-efficient knowledge transfer. arXiv preprint arXiv:2306.08126, 2023.\n[6] Qiushi Huang, Shuai Fu, Xubo Liu, Wenwu Wang, Tom Ko, Yu Zhang, and Lilian Tang. Learning retrieval\naugmentation for personalized dialogue generation. In Proceedings of the 2023 Conference on Empirical Methods\nin Natural Language Processing, pages 2523–2540, 2023.\n[7] Bowen Jiang, Zhuoqun Hao, Young-Min Cho, Bryan Li, Yuan Yuan, Sihao Chen, Lyle Ungar, Camillo J Taylor,\nand Dan Roth. Know me, respond to me: Benchmarking llms for dynamic user profiling and personalized responses\nat scale. arXiv preprint arXiv:2504.14225, 2025.\n[8] Jiazheng Kang, Mingming Ji, Zhe Zhao, and Ting Bai. Memory os of ai agent. arXiv preprint arXiv:2506.06326,\n2025.\n[9] Guanrong Li, Xinyu Liu, Zhen Wu, and Xinyu Dai. Persona-aware alignment framework for personalized dialogue\ngeneration. Transactions of the Association for Computational Linguistics, 13:1722–1742, 2025.\n[10] Jiwei Li, Michel Galley, Chris Brockett, Georgios Spithourakis, Jianfeng Gao, and William B Dolan. A persona-\nbased neural conversation model. In Proceedings of the 54th Annual Meeting of the Association for Computational\nLinguistics (Volume 1: Long Papers), pages 994–1003, 2016.\n[11] Zhiyu Li, Shichao Song, Chenyang Xi, Hanyu Wang, Chen Tang, Simin Niu, Ding Chen, Jiawei Yang, Chunyu Li,\nQingchen Yu, Jihao Zhao, Yezhaohui Wang, Peng Liu, Zehao Lin, Pengyuan Wang, Jiahao Huo, Tianyi Chen, Kai\nChen, Kehang Li, Zhen Tao, Junpeng Ren, Huayi Lai, Hao Wu, Bo Tang, Zhenren Wang, Zhaoxin Fan, Ningyu\nZhang, Linfeng Zhang, Junchi Yan, Mingchuan Yang, Tong Xu, Wei Xu, Huajun Chen, Haofeng Wang, Hongkang\nYang, Wentao Zhang, Zhi-Qin John Xu, Siheng Chen, and Feiyu Xiong. Memos: A memory os for ai system.\narXiv preprint arXiv:2507.03724, 2025. URL https://arxiv.org/abs/2507.03724.\n[12] Lei Liu, Xiaoyan Yang, Yue Shen, Binbin Hu, Zhiqiang Zhang, Jinjie Gu, and Guannan Zhang. Think-in-memory:\nRecalling and post-thinking enable llms with long-term memory, 2023. URL https://arxiv.org/abs/2311.08719.\n[13] Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang.\nLost in the middle: How language models use long contexts. Transactions of the Association for Computational\nLinguistics, 12:157–173, 2024. doi: 10.1162/tacl_a_00638. URL https://aclanthology.org/2024.tacl-1.9/.\n[14] Andrea Madotto, Zhaojiang Lin, Chien-Sheng Wu, and Pascale Fung. Personalizing dialogue agents via meta-\nlearning.\nIn Proceedings of the 57th annual meeting of the association for computational linguistics, pages\n5454–5459, 2019.\n[15] Hazel Rose Markus. Self-schemata and processing information about the self. Journal of Personality and Social\nPsychology, 35:63–78, 1977.\n[16] Kai Tzu-iunn Ong, Namyoung Kim, Minju Gwak, Hyungjoo Chae, Taeyoon Kwon, Yohan Jo, Seung-won Hwang,\nDongha Lee, and Jinyoung Yeo. Towards lifelong dialogue agents via timeline-based memory management. In\nProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational\nLinguistics: Human Language Technologies (Volume 1: Long Papers), pages 8631–8661, 2025.\n[17] Atsushi Otsuka, Kazuya Matsuo, Ryo Ishii, Narichika Nomoto, and Hiroaki Sugiyama.\nUser-specific dia-\nlogue generation with user profile-aware pre-training model and parameter-efficient fine-tuning. arXiv preprint\narXiv:2409.00887, 2024.\n[18] Preston Rasmussen, Pavlo Paliychuk, Travis Beauvais, Jack Ryan, and Daniel Chalef. Zep: A temporal knowledge\ngraph architecture for agent memory, 2025. URL https://arxiv.org/abs/2501.13956.\n[19] Alireza Salemi, Sheshera Mysore, Michael Bendersky, and Hamed Zamani. LaMP: When large language models\nmeet personalization. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors, Proceedings of the 62nd\nAnnual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7370–7392,\nBangkok, Thailand, August 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-long.399.\nURL https://aclanthology.org/2024.acl-long.399/.\n11\n\n[20] Zhaoxuan Tan and Meng Jiang. User modeling in the era of large language models: Current research and future\ndirections, 2023. URL https://arxiv.org/abs/2312.11518.\n[21] Zhaoxuan Tan, Qingkai Zeng, Yijun Tian, Zheyuan Liu, Bing Yin, and Meng Jiang. Democratizing large language\nmodels via personalized parameter-efficient fine-tuning. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung\nChen, editors, Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages\n6476–6491, Miami, Florida, USA, November 2024. Association for Computational Linguistics. doi: 10.18653/v1/\n2024.emnlp-main.372. URL https://aclanthology.org/2024.emnlp-main.372/.\n[22] Piiastiina Tikka and Harri Oinas-Kukkonen. Tailoring persuasive technology: A systematic review of literature of\nself-schema theory and transformative learning theory in persuasive technology context. 2019.\n[23] Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models\nwith attention sinks. In The Twelfth International Conference on Learning Representations, 2024.\n[24] Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, and Yongfeng Zhang. A-mem: Agentic memory for\nllm agents. arXiv preprint arXiv:2502.12110, 2025.\n[25] Ori Yoran, Tomer Wolfson, Ori Ram, and Jonathan Berant. Making retrieval-augmented language models\nrobust to irrelevant context. In ICLR 2024 Workshop on Large Language Model (LLM) Agents, 2024. URL\nhttps://openreview.net/forum?id=Ez6fkqltwV.\n[26] Saizheng Zhang, Emily Dinan, Jack Urbanek, Arthur Szlam, Douwe Kiela, and Jason Weston. Personalizing\ndialogue agents: I have a dog, do you have pets too? arXiv preprint arXiv:1801.07243, 2018.\n[27] Yuze Zhao, Jintao Huang, Jinghan Hu, Xingjun Wang, Yunlin Mao, Daoze Zhang, Zeyinzi Jiang, Zhikai Wu,\nBaole Ai, Ang Wang, Wenmeng Zhou, and Yingda Chen. Swift:a scalable lightweight infrastructure for fine-tuning,\n2024. URL https://arxiv.org/abs/2408.05517.\n[28] Yinhe Zheng, Guanyi Chen, Minlie Huang, Song Liu, and Xuan Zhu. Personalized dialogue generation with\ndiversified traits. arXiv preprint arXiv:1901.09672, 2019.\n[29] Wanjun Zhong, Lianghong Guo, Qiqi Gao, He Ye, and Yanlin Wang. Memorybank: enhancing large language\nmodels with long-term memory. In Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence\nand Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on\nEducational Advances in Artificial Intelligence, AAAI’24/IAAI’24/EAAI’24. AAAI Press, 2024. ISBN 978-1-\n57735-887-9. doi: 10.1609/aaai.v38i17.29946. URL https://doi.org/10.1609/aaai.v38i17.29946.\n12\n\nA\nRethinking Personalization\nIn Human-AI interaction, building agents capable of deeply personalized dialogue has long been a central goal.\nHowever, the dominant research paradigm largely concentrates on personalization via explicit profiles [10, 26].\nUnder this setting, researchers typically provide either a structured or unstructured persona description, or a\nset of persona-related texts to be retrieved, and the model is tasked with generating user-aligned responses\nconditioned on this static, pre-specified information [6, 9].\nAlthough this paradigm offers advantages in controllability and evaluation convenience, it deviates substantially\nfrom real-world interpersonal interaction. This deviation is mainly reflected in:\n• Misalignment of information sources: In everyday life, our understanding of a person’s traits rarely\ncomes from a self-introduction document; instead, it is implicitly and dynamically constructed from\nlong-term interaction history [5, 17].\n• Limited characterization of persona: Explicit persona descriptions are often highly abstracted and\nsimplified, failing to capture subtle linguistic styles, background knowledge, distinctive interaction\npatterns, and affective dynamics that emerge in authentic conversations [1].\n• A static assumption of persona: Real personality traits and linguistic styles vary across contexts and\nconversations, whereas static-profile approaches struggle to model such adaptive dynamics [16].\nGiven these limitations, our study targets a more challenging and more realistic core problem: How can a\nmodel, relying solely on a long personalized dialogue history, learn and emulate one participant’s implicit\npersona to generate responses that remain consistent in style, content, and relational stance?\nOur motivation is to help bridge this gap, with three primary implications:\n• Improving the realism of personalized dialogue: enabling a shift from role-playing to faithful imitation,\nproducing responses that are more natural, credible, and person-like.\n• Advancing deep personalization modeling: moving beyond understanding facts about a person toward\nmodeling how a person becomes who they are.\n• Expanding real-world applicability: in emerging applications such as personalized assistants and\naffective companions, the ability to reproduce individual styles from historical data is crucial.\nB\nMemListener Training\nAfter the SFT warm-up, we continue alignment with the remaining data using RL driven by process-based\nrewards. We set the model’s maximum context length to 11K tokens, with the input length capped at\n10K, which constitutes a typical ultra-long sequence optimization setting. If we were to adopt sample-level\ngroup-relative policy optimization (GRPO), the key decision signals in long sequences would be easily diluted\nby within-sample averaging; moreover, when group-wise sampling under the same input yields outputs that\nare all correct or all incorrect, the advantage term degenerates, resulting in insufficient effective gradients.\nConsequently, training stability and sample efficiency are constrained.\nTherefore, the RL stage is conducted using the Swift RLHF framework [27] with the DAPO algorithm\n(Table 5). All model parameters are updated via full fine-tuning, and training is performed in bfloat16\nprecision to balance numerical stability and memory efficiency. We further employ a dynamic, LLM-as-a-judge\nevaluation strategy, using Qwen3-32B (reasoning mode) as the discriminator to score the gap between the\nmodel’s prediction and the ground truth. The judge is prompted using the template in Table 10, and its\nassessment signal is used to guide optimization during RL. For transparency and reproducibility, we release\nthe complete training scripts in our public repository.\nA learning rate of 1 × 10−6 is adopted, together with a warmup ratio of 0.01. Due to memory constraints, the\nper-device training batch size is set to 1, while the effective batch size is increased using gradient accumulation\nover 8 steps. Gradient norms are clipped to 1.0 to ensure stable optimization. Training is performed for a\nsingle epoch.\n13\n\nParameter\nValue\nTraining type\nFull fine-tuning\nModel precision\nbfloat16\nLearning rate\n1 × 10−6\nPer Device Train Batch Size\n1\nTraining batch size\n1\nGradient accumulation steps\n8\nNumber of epochs\n1\nWarmup ratio\n0.01\nMax gradient norm\n1.0\nMax sequence length\n11264\nMax generation length\n512\nNumber of generations\n8\nTemperature\n1.0\nTop-p\n0.9\nTop-k\n50\nClipping ϵ\n0.2 / 0.28\nβ (KL control)\n0.001\nDynamic sampling\nEnabled\nMax resample times\n3\nTable 5 Key hyperparameters for DAPO training.\nFor each input prompt, 8 candidate responses are sampled with a maximum generation length of 512 tokens\nand a maximum context length of 11,264 tokens. Stochastic decoding is controlled using temperature = 1.0,\ntop-p = 0.9, and top-k = 50. Policy updates use asymmetric clipping with ϵ = 0.2 and ϵhigh = 0.28.\nA KL-control coefficient β is introduced to regulate the divergence between the optimized policy and the\nreference model. Larger β values enforce stronger regularization toward the reference policy. In our GRPO-\nbased setting, β is set to 0.001. Dynamic sampling is enabled to enhance response diversity, with the maximum\nnumber of resampling attempts limited to 3.\nDuring training data construction, we select subsets from HaluMem and PersonaMem that are relevant to\nimplicitly characterizing user-specific attributes as the raw corpus sources. Using the dynamic PersonaTree\nevolution procedure described in Section 3.2 as the backbone, we invoke DeepSeek-R1-0528 to generate\nsupervision signals for training. Concretely, for each dialogue segment, we prompt the generator to produce\nan executable operation sequence and its corresponding post-update tree state under the given schema and\nupdate constraints, thereby mapping raw dialogues into structured samples with ground-truth operations and\nversioned tree evolution. To control noise and spurious correlations, we manually verify the synthesized samples\nand filter out those with invalid operation syntax, incorrect path references, or semantically inconsistent\nwrites. The specific prompts are shown in Table 10.\nC\nExtra Experiments\nC.1\nBaselines\nTo evaluate the effectiveness of our proposed approach, we compared it against six baseline methods. These\ninclude two standard interaction paradigms (Only LLM and Full Context) and four state-of-the-art memory\nmanagement frameworks designed for LLM-based agents. All memory management frameworks were evaluated\nunder their officially recommended best configurations.\nOnly LLM.\nAs a foundational baseline, we employ the LLM directly without providing any historical\nconversation data. In this setting, the model operates in a stateless manner, relying solely on its pre-trained\nparametric knowledge and internal reasoning capabilities to address user queries. This method serves as a lower\n14\n\nMethods\nOverall\nRecall-Facts\nPref-Rec\nNew-Ideas\nRecall-Reason\nPref-Evol\nGen-New\nRecall-User\nDeepSeek-V3.1 (Response)\nOnly LLM\n52.63\n59.69\n43.64\n5.38\n80.81\n65.47\n42.11\n52.94\nALL Dialogue\n63.84\n78.29\n61.82\n12.90\n81.82\n65.47\n77.19\n76.47\nLangMem+DeepSeek-V3.1\n56.37\n72.87\n52.73\n9.68\n78.79\n61.15\n50.88\n47.06\nMem0+DeepSeek-V3.1\n61.80\n82.95\n47.27\n7.53\n83.84\n63.31\n71.93\n70.59\nA-Mem+DeepSeek-V3.1\n60.61\n76.74\n43.64\n6.45\n85.86\n64.03\n71.93\n76.47\nMemoryOS+DeepSeek-V3.1\n61.97\n75.19\n67.27\n9.68\n80.81\n63.31\n73.68\n70.59\nPersonaTree-ALL+DeepSeek-V3.1\n70.80\n88.37\n63.64\n24.73\n86.87\n68.35\n91.23\n70.59\nLongcat-Flash-Chat (Response)\nOnly LLM\n54.33\n62.02\n43.64\n8.60\n84.85\n71.22\n28.07\n52.94\nALL Dialogue\n61.80\n79.07\n63.64\n15.05\n87.88\n53.96\n68.42\n70.59\nLangMem+DeepSeek-V3.1\n57.39\n75.97\n50.91\n10.75\n79.80\n63.31\n45.61\n52.94\nMem0+DeepSeek-V3.1\n60.27\n77.52\n54.55\n9.68\n85.86\n64.03\n50.88\n76.47\nA-Mem+DeepSeek-V3.1\n60.44\n80.62\n50.91\n8.60\n85.86\n66.19\n47.37\n70.59\nMemoryOS+DeepSeek-V3.1\n64.52\n79.84\n65.45\n16.13\n87.88\n64.75\n66.67\n64.71\nPersonaTree-ALL+DeepSeek-V3.1\n73.34\n88.37\n69.09\n31.18\n91.92\n69.78\n87.72\n76.47\nDeepSeek-R1-0528 (Response)\nOnly LLM\n44.14\n30.23\n38.18\n19.35\n76.77\n55.40\n38.60\n41.18\nALL Dialogue\n64.86\n69.77\n63.64\n11.83\n84.85\n73.38\n84.21\n70.59\nLangMem+DeepSeek-V3.1\n56.54\n62.79\n54.55\n18.28\n78.79\n66.19\n49.12\n41.18\nMem0+DeepSeek-V3.1\n47.71\n32.56\n52.73\n21.51\n73.74\n57.55\n47.37\n58.82\nA-Mem+DeepSeek-V3.1\n47.03\n27.13\n45.45\n27.96\n76.77\n55.40\n43.86\n76.47\nMemoryOS+DeepSeek-V3.1\n61.97\n64.34\n76.36\n11.83\n82.83\n66.19\n73.68\n76.47\nPersonaTree-ALL+DeepSeek-V3.1\n74.53\n81.40\n70.91\n26.88\n92.93\n82.01\n87.72\n82.35\nTable 6 Extra experimental results are presented on three different response models. Recall-Facts, Pref-Rec, New-\nIdeas, Recall-Reason, Pref-Evol, Gen-New, and Recall-User respectively denote recalling user-shared facts, providing\npreference-aligned recommendations, suggesting new ideas, recalling reasons behind preference updates, tracking\npreference evolution, generalizing to new scenarios, and recalling user-mentioned facts. Bold and underlined numbers\ndenote the best and second-best results, respectively.\nbound, isolating the model’s intrinsic commonsense reasoning from its ability to recall specific interactional\ndetails.\nFull Context.\nThis method involves concatenating the entire chronological history of the conversation and\ninputting it into the LLM’s context window for every interaction. By providing the model with complete\naccess to all prior dialogue, this approach serves as a theoretical upper bound for retrieval accuracy within\nthe limits of the model’s context window. However, it effectively ignores the challenges of memory selection\nand computational efficiency.\nLangMem.\nLangMem is a framework designed to enable agents to learn and adapt through continuous\ninteractions. It provides a suite of functional primitives that allow agents to manage memory within the active\nconversational flow (\"hot path\") while utilizing a background manager to asynchronously extract, consolidate,\nand update knowledge. LangMem integrates natively with the LangGraph ecosystem, offering a core memory\nAPI that supports prompt refinement and long-term consistency across sessions. By separating immediate\nmemory management tools from background consolidation processes, it aims to maintain consistent agent\nbehavior without increasing latency during inference.\nMem0 [4].\nMem0 addresses the limitations of fixed context windows by introducing a scalable, memory-\ncentric architecture. It employs a multi-level memory structure that retains User, Session, and Agent states\nto facilitate adaptive personalization. A key feature of Mem0 is its utilization of graph-based memory\nrepresentations to capture complex relational structures between conversational elements. This approach\nallows for the dynamic extraction and retrieval of salient information, optimizing for both latency and token\ncost. Mem0 is designed to be production-ready, focusing on reducing the computational overhead typically\nassociated with full-context processing while maintaining high retrieval accuracy in multi-session dialogues.\nA-Mem (Agentic Memory) [24].\nA-Mem proposes a self-organizing memory system inspired by the Zettelkas-\nten knowledge management method. Unlike traditional static storage, A-Mem enables agents to dynamically\n15\n\norganize memories through intelligent indexing and linking. When new information is ingested, the system\ngenerates comprehensive notes with structured attributes—such as contextual descriptions and tags—and\nestablishes connections with historical data. A distinctive feature of A-Mem is its support for \"memory\nevolution,\" where the integration of new experiences can trigger updates to the representations of existing\nmemories. This agent-driven mechanism allows the memory network to continuously refine its structure and\nunderstanding over time.\nMemoryOS [8].\nDrawing inspiration from operating system principles, MemoryOS introduces a hierarchical\nstorage architecture designed to manage agent memory comprehensively. The system comprises four core\nmodules: Storage, Updating, Retrieval, and Generation. It organizes memory into three distinct levels:\nshort-term, mid-term, and long-term personal memory. To manage data flow between these levels, MemoryOS\nemploys specific strategies such as a dialogue-chain-based First-In-First-Out (FIFO) principle for short-to-\nmid-term updates and a segmented page organization strategy for mid-to-long-term consolidation. This\nhierarchical approach aims to maximize context coherence and personalization by mimicking the efficient\nresource management found in traditional operating systems.\nC.2\nExtra Experiments Results\nTo validate the robustness of our approach for the tree-operation generator, we further adopt DeepSeek-\nV3.1 as a unified extraction model to conduct comparative evaluations across all methods, and we report\nresults separately under three different response models (DeepSeek-V3.1, Longcat-Flash-Chat, and DeepSeek-\nR1-0528; see Table 6) to minimize potential biases and hallucination effects introduced by any particular\ngeneration component. The results show that, even when replacing the extraction model DeepSeek-R1-\n0528 with DeepSeek-V3.1, our method (PersonaTree-ALL) still achieves the best Overall scores under all\nthree response models, reaching 70.80, 73.34, and 74.53, respectively. This corresponds to improvements of\n+18.17/+19.01/+30.39 over Only LLM and +6.96/+11.54/+9.67 over ALL Dialogue, and it substantially\noutperforms the representative memory baseline MemoryOS by +8.83/+8.82/+12.56. From a metric-wise\nperspective, the gains are particularly pronounced on dimensions requiring stronger detail completion and\nopen-ended generation; for example, New-Ideas improves over ALL Dialogue by +11.83 (24.73 vs. 12.90),\n+16.13 (31.18 vs. 15.05), and +15.05 (26.88 vs. 11.83), respectively. Meanwhile, Recall-Facts, Pref-Rec, and\nGen-New also exhibit consistent improvements. These findings indicate that PersonaTree’s structured memory\nrepresentation and retrieval-augmented generation mechanism do not rely on any specific extraction model;\nrather, the benefits transfer stably across extractors and response models, further supporting the generality\nand effectiveness of the proposed method.\nAblation Study.\nBeyond the partial ablation studies reported in Table 1, we further conduct a more\ncomprehensive ablation analysis on the remaining configurations in Tables 1 and 6, with detailed results\npresented in Tables 7 and 8. Overall, these additional ablation findings are consistent with the observations\nin Table 1, further corroborating the effectiveness and contributions of the key components across different\nexperimental settings, and providing stronger empirical support for the main conclusions in Section 4.3.\nC.3\nVisualization\nTo more intuitively illustrate the differences among the methods in Tables 1 and 6 across diverse capability\ndimensions, we further provide radar-chart visualizations (see Figure 2), corresponding to two memory-\nextraction settings: (a) using DeepSeek-R1-0528 as the extractor; and (b) using DeepSeek-V3.1 as the\nextractor. For each setting, we conduct a unified comparison under three response models: DeepSeek-\nV3.1, Longcat-Flash-Chat, and DeepSeek-R1-0528. Overall, PersonaTree exhibits a more outward-expanded\npolygonal profile across both extraction settings and all three response models, indicating that its gains are\nnot concentrated on a single metric but instead span multiple dimensions, including factual recall, preference\nconsistency, preference evolution, and new-content generation.\nMoreover, the trends across the two subplots are highly consistent, suggesting that PersonaTree’s improvements\nare robust to the choice of memory-extraction model: even when the extractor is replaced, the method maintains\nstable advantages across the multi-dimensional metrics.\n16\n\nOn the other hand, Figure 3 reports the overall performance of different memory-operation models under\nthree response models. We observe that untrained extractors generally lag behind trained counterparts,\nwhile the two-stage training paradigm (SFT+RL) yields stable and transferable improvements. Meanwhile,\ncompared with ALL Dialogue, which requires substantially longer context, the trained PersonaTree achieves\nhigher accuracy with only a relatively short memory context, further highlighting the advantages of structured\nmemory in terms of information compression and utilization efficiency.\nFigure 4 characterizes how the dialogue chunking window size w influences performance. Across settings,\nthe curves consistently exhibit the pattern that a moderate window is optimal. In particular, values around\nw = 3 are more stable and overall superior under all three response models, suggesting that this configuration\nstrikes a more appropriate trade-off between contextual sufficiency and update frequency.\nMethods\nOverall\nRecall-Facts\nPref-Rec\nNew-Ideas\nRecall-Reason\nPref-Evol\nGen-New\nRecall-User\nDeepSeek-R1-0528 (Tree Extraction)\n+DeepSeek-V3.1 (Response)\nPersonaTree-ALL\n71.14\n88.37\n56.36\n22.58\n88.89\n71.22\n89.47\n88.24\nPersonaTree+Router\n69.61\n87.60\n60.00\n17.20\n88.89\n69.06\n85.96\n88.24\nOnly PersonaTree\n60.61\n77.52\n49.09\n6.45\n78.79\n66.19\n73.68\n70.59\n+Longcat-Flash-Chat (Response)\nPersonaTree-ALL\n72.67\n88.37\n72.73\n26.88\n92.93\n69.78\n85.96\n64.71\nPersonaTree+Router\n70.46\n89.15\n60.00\n19.35\n87.88\n71.94\n80.70\n94.12\nOnly PersonaTree\n63.33\n75.97\n60.00\n11.83\n88.89\n66.91\n70.18\n58.82\n+DeepSeek-R1-0528 (Response)\nPersonaTree-ALL\n74.87\n80.62\n69.09\n27.96\n94.95\n82.01\n89.47\n82.35\nPersonaTree+Router\n73.68\n82.17\n74.55\n23.66\n92.93\n83.45\n84.21\n52.94\nOnly PersonaTree\n63.50\n67.44\n63.64\n16.13\n88.89\n69.06\n75.44\n58.82\nQwen2.5-7B-rl (Tree Extraction)\n+DeepSeek-V3.1 (Response)\nPersonaTree-ALL\n71.31\n89.15\n61.82\n23.66\n91.92\n69.06\n85.96\n76.47\nPersonaTree+Router\n70.80\n87.60\n60.00\n25.81\n87.88\n71.22\n85.96\n70.59\nOnly PersonaTree\n62.82\n77.52\n61.82\n8.60\n89.90\n63.31\n70.18\n64.71\n+Longcat-Flash-Chat (Response)\nPersonaTree-ALL\n73.34\n93.02\n67.27\n27.96\n92.93\n67.63\n89.47\n70.59\nPersonaTree+Router\n72.33\n92.25\n67.27\n23.66\n91.92\n68.35\n85.96\n76.47\nOnly PersonaTree\n64.35\n79.07\n63.64\n11.83\n88.89\n65.47\n70.18\n70.59\n+DeepSeek-R1-0528 (Response)\nPersonaTree-ALL\n74.70\n79.84\n80.00\n27.96\n94.95\n82.01\n84.21\n64.71\nPersonaTree+Router\n74.19\n80.62\n76.36\n30.11\n93.94\n80.58\n84.21\n58.82\nOnly PersonaTree\n64.01\n68.99\n63.64\n12.90\n84.85\n71.22\n80.70\n70.59\nTable 7 Ablation study of PersonaTree components. Recall-Facts, Pref-Rec, New-Ideas, Recall-Reason, Pref-Evol, Gen-\nNew, and Recall-User respectively denote recalling user-shared facts, providing preference-aligned recommendations,\nsuggesting new ideas, recalling reasons behind preference updates, tracking preference evolution, generalizing to new\nscenarios, and recalling user-mentioned facts. Bold numbers indicate the best results within each group.\nD\nInitial PersonaTree Instance\nTo delineate the theoretical boundaries of the memory tree, we construct a hierarchical Schema based\non the Biopsychosocial model, scientifically decomposing user characteristics into three core dimensions:\n(1) Biological Aspects: Establishes biological traits by referencing theories in human morphology, human\nphysiology, chronobiology, etc. (2) Psychological Aspects: Deeply mines psychological and characterological\ntraits through cognitive psychology, the Big Five personality theory, metacognition theory, etc. (3) Social\nAspects: Unifies demographic and behavioral traits based on social identity theory, behavioral psychology,\nattachment theory, etc. Due to the lengthy initial personalization tree, we include the full schema in our\ncodebase, where the complete content can be inspected.\n17\n\nMethods\nOverall\nRecall-Facts\nPref-Rec\nNew-Ideas\nRecall-Reason\nPref-Evol\nGen-New\nRecall-User\nDeepSeek-V3.1 (Tree Extraction)\n+DeepSeek-V3.1 (Response)\nPersonaTree-ALL\n70.80\n88.37\n63.64\n24.73\n86.87\n68.35\n91.23\n70.59\nPersonaTree+Router\n69.61\n85.27\n61.82\n18.28\n89.90\n70.50\n87.72\n70.59\nOnly PersonaTree\n60.03\n74.02\n56.60\n6.59\n82.47\n62.50\n71.93\n64.71\n+Longcat-Flash-Chat (Response)\nPersonaTree-ALL\n73.34\n88.37\n69.09\n31.18\n91.92\n69.78\n87.72\n76.47\nPersonaTree+Router\n70.63\n91.47\n63.64\n21.51\n89.90\n67.63\n84.21\n70.59\nOnly PersonaTree\n62.31\n78.29\n54.55\n10.75\n86.87\n66.91\n63.16\n64.71\n+DeepSeek-R1-0528 (Response)\nPersonaTree-ALL\n74.53\n81.40\n70.91\n26.88\n92.93\n82.01\n87.72\n82.35\nPersonaTree+Router\n73.51\n81.40\n76.36\n24.73\n92.93\n79.14\n84.21\n76.47\nOnly PersonaTree\n61.80\n68.22\n52.73\n13.98\n82.83\n71.94\n73.68\n58.82\nTable 8 Ablation Study II of PersonaTree Components. Recall-Facts, Pref-Rec, New-Ideas, Recall-Reason, Pref-Evol,\nGen-New, and Recall-User respectively denote recalling user-shared facts, providing preference-aligned recommendations,\nsuggesting new ideas, recalling reasons behind preference updates, tracking preference evolution, generalizing to new\nscenarios, and recalling user-mentioned facts. Bold numbers indicate the best results within each group.\nFigure 2 Radar-chart comparison of PersonaTree and baselines across multi-dimensional capability metrics under\ntwo memory-extraction settings (DeepSeek-R1-0528, DeepSeek-V3.1) and three response models (DeepSeek-V3.1,\nLongcat-Flash-Chat, DeepSeek-R1-0528)\n18\n\nFigure 3 Overall performance of memory-operation models across three response models.\nFigure 4 Effect of dialogue chunking window size on performance across three response models.\n19\n\nPrompts for Operational Generation in PersonaTree\nYou are a Memory-Tree Operation Generator. You will be given:\n(1) An initial persona schema represented as a hierarchical JSON tree.\n(2) A dialogue history.\\n\\n\nYour objective is to transform the dialogue history into a sequence of operations for updating the persona\nschema, **covering as comprehensively as possible all information about this person, especially personalized\ncharacteristics**.\\n\\n\nAbout the schema:\n- The schema below contains **user attribute information that has already been successfully structured**;\n- Treat the schema as “recorded information” and **do not re-extract fields that already exist**;\n- Generate operations for the schema only when the dialogue history introduces additional facts, details, or\npreferences not yet covered by the schema;\n- If the dialogue history conflicts with the schema, the **most recent explicit statement** in the dialogue\nshould prevail.\\n\\n\nPrinciples for using ADD / UPDATE / DELETE / NO_OP:\n* Use: ADD(path, \"value\") when an attribute at that path has **not been recorded at all**. Prefer creating\nmore branches and avoid overly long content in a single attribute.\n* Use: UPDATE(path, \"value\") when an attribute at that path already has a record and the current passage\n**supplements, refines, or corrects** it.\n* Use: DELETE(path, None) only when the passage explicitly states that an existing piece of information\n**is no longer valid, is negated, or should be removed**.\n* If the passage does not entail any changes, output a single line: NO_OP().\\n\\n\nKey requirements for \"value\" in UPDATE (very important):\n* \"value\" must semantically **contain or integrate the previously valid information** while incorporating or\nreflecting the new information, yielding a more complete, more accurate, and up-to-date description.\n* It is **strictly forbidden** to discard useful original content and keep only the new information in an\nUPDATE.\n* When the new information is supplemental or more specific, the value should be an integrated expression of\n“original information + new supplementation”.\n* When the new information conflicts with the old, the value should describe the “current latest and most\nreasonable state”, while retaining non-conflicting old details whenever possible.\\n\\n\nNotes:\\n1. Treat each leaf node in the JSON schema as an attribute slot capable of storing a textual value.\\n\\n\n2. For each distinct user personal attribute mentioned in the dialogue history:\n* Locate the most closely matching and most specific leaf node in the schema.\n* Generate **exactly one and only one** operation for that attribute.\\n\\n\n3. You may use only the following operations:\n* ADD(path, \"value\"), UPDATE(path, \"value\"), DELETE(path, None), NO_OP()\\n\\n\n4. Requirements for the \"path\" format:\n* Use a JSON key path separated by English periods. Example:\n1_Biological_Characteristics.Physiological_Status.Age_Related_Characteristics.Chronological_Age\\n\\n\n5. Requirements for the \"value\" format:\n* Provide a natural-language expression extracted from or normalized based on the dialogue history.\n* It must be enclosed in English double quotation marks.\\n\\n\n6. Output format (must be strictly followed):\n* Output only operations, one operation per line. * Do not add any explanations or comments. * The\nonly permissible forms are: ADD(<path>, \"<value>\"), UPDATE(<path>, \"<value>\"), DELETE(<path>,\nNone), NO_OP()\\n\\n\nPersona Schema:\\n\\n{schema}\\n\\n\nDialogue History:\\n\\n{dialogue_text}\\n\\n\nNow, based on the given dialogue history, output only the operations:\nTable 9 Prompt for operational generation in PersonaTree for training and inference.\n20\n\nReward-Function Prompt\nYou are a strict \"overall scorer for attribute-tree operations\". Your task is to assign an overall quality score in\n[-1, 1] to the model-predicted operation sequence Pred_Ops, given the ground-truth annotated operation\nsequence GT_Ops.\n[Input]\n- GT_Ops (ground truth): a list of operations, where each element is of the form ADD(path, value) /\nUPDATE(path, value) / DELETE(path, value) / NO_OP()\n- Pred_Ops (prediction): a list of operations in the same format as above\n[Critical Constraints]\n1) Output only a single JSON object: {\"score\": <float>}. Do not output any explanation and do not include\nany extra fields.\n2) score must be a continuous floating-point number within [-1, 1] (any value is allowed). It is recommended\nto keep 2 decimal places.\n3) The \"score-tier reference\" below serves only as anchors for aligning overall quality. You should fine-tune\nbetween anchors to output a more granular score.\n4) For example, if the overall quality falls between 0.7 and 1.0, output a value in [0.71, 0.99]; if it falls between\n0.5 and 0.7, output a value in [0.51, 0.69]; and so on.\n[Score-Tier Reference (Overall Quality Anchors)]\n* 1.0 (nearly perfect): Pred and GT are almost entirely consistent on key operations; types/paths are nearly\nidentical; values are semantically equivalent; no redundant operations.\n* 0.7 (high quality): most key operations are correct; only minor value-level deviations, or very few miss-\ning/redundant operations.\n* 0.5 (moderately usable): the overall approach and core direction are correct; some missing/redundant\noperations exist; some paths/values are incorrect, but the main semantics are not affected.\n* 0.3 (partially reliable): about half of the content is reliable; some key operations are correct while others are\nwrong, requiring some fixes.\n* 0.0 (slightly correct): only a small number of operations or fragments are correct; missing/redundant\noperations and errors are evident; key operations are mixed correct/incorrect.\n* -0.3 (barely relevant): broadly related but with many omissions/errors; it is only apparent that the model is\nattempting the task, and it is essentially unusable as-is.\n* -0.5 (clearly off-target): most key operations are missing or incorrect; many wrong paths/types or obviously\nredundant operations; overall deviates from expectations.\n* -0.7 (catastrophic): large-scale structural/semantic disorder; almost unusable.\n* -1.0 (meaningless output): clearly meaningless, garbage text, or unrelated to the task.\n[Output Format]\nOutput only the JSON object containing the score, with no additional notes or explanations.\nOutput only:\n{\"score\": <float>}\n[Task Data]\n- GT_Ops:\n{gt_ops}\n- Pred_Ops:\n{pred_ops}\nTable 10 Reward-function prompt for process-reward RL training.\n21\n",
+    "figure_captions": [
+      "Figure 1 Overview of the entire process of our Inside Out framework.",
+      "Figure 4 characterizes how the dialogue chunking window size w influences performance. Across settings,",
+      "Figure 2 Radar-chart comparison of PersonaTree and baselines across multi-dimensional capability metrics under",
+      "Figure 3 Overall performance of memory-operation models across three response models.",
+      "Figure 4 Effect of dialogue chunking window size on performance across three response models."
+    ]
+  },
+  {
+    "arxiv_id": "2601.05111v1",
+    "title": "Paper 2601.05111v1",
+    "abstract": "LLM-as-a-Judge has revolutionized AI evaluation by leveraging large language models for scalable assess- ments. However, as evaluands become increasingly complex, specialized, and multi-step, the reliability of LLM-as-a-Judge has become constrained by inherent biases, shallow single-pass reasoning, and the inability to verify assessments against real-world observations. This has catalyzed the transition to Agent-as-a-Judge, where agentic judges employ planning, tool-augmented verification, multi-agent collaboration, and persis- tent memory to enable more robust, verifiable, and nuanced evaluations. Despite the rapid proliferation of agentic evaluation systems, the field lacks a unified framework to navigate this shifting landscape. To bridge this gap, we present the first comprehensive survey tracing this evolution. Specifically, we identify key dimensions that characterize this paradigm shift and establish a developmental taxonomy. We organize core methodologies and survey applications across general and professional domains. Furthermore, we analyze frontier challenges and identify promising research directions, ultimately providing a clear roadmap for the next generation of agentic evaluation. � Project Page: https://github.com/ModalityDance/Awesome-Agent-as-a-Judge",
+    "text": "2026-1-8\nA Survey on Agent-as-a-Judge\nRunyang You*1\nHongru Cai*1\nCaiqi Zhang2\nQiancheng Xu1\nMeng Liu3\nTiezheng Yu4\nYongqi Li†1\nWenjie Li1\n1 The Hong Kong Polytechnic University\n2 University of Cambridge\n3 Shandong Jianzhu University\n4 Huawei Technologies\n* Equal contribution\n† Corresponding author\nrunyang.y@outlook.com, {henry.hongrucai, liyongqi0}@gmail.com,\ncswjli@comp.polyu.edu.hk\nAbstract\nLLM-as-a-Judge has revolutionized AI evaluation by leveraging large language models for scalable assess-\nments. However, as evaluands become increasingly complex, specialized, and multi-step, the reliability of\nLLM-as-a-Judge has become constrained by inherent biases, shallow single-pass reasoning, and the inability\nto verify assessments against real-world observations. This has catalyzed the transition to Agent-as-a-Judge,\nwhere agentic judges employ planning, tool-augmented verification, multi-agent collaboration, and persis-\ntent memory to enable more robust, verifiable, and nuanced evaluations. Despite the rapid proliferation\nof agentic evaluation systems, the field lacks a unified framework to navigate this shifting landscape. To\nbridge this gap, we present the first comprehensive survey tracing this evolution. Specifically, we identify\nkey dimensions that characterize this paradigm shift and establish a developmental taxonomy. We organize\ncore methodologies and survey applications across general and professional domains. Furthermore, we\nanalyze frontier challenges and identify promising research directions, ultimately providing a clear roadmap\nfor the next generation of agentic evaluation.\n� Project Page: https://github.com/ModalityDance/Awesome-Agent-as-a-Judge\n1 Introduction\nThe rapid advancement of Large Language Models\n(LLMs) has revolutionized the field of AI evalua-\ntion, giving rise to the LLM-as-a-Judge paradigm [1].\nWhile traditional metrics fail to capture semantic\nnuance and human judgment remains unscalable,\nthis new approach leverages LLMs’ advanced un-\nderstanding and decision-making abilities to deliver\nnear-human quality assessments across diverse do-\nmains [2]. Moreover, serving as a scalable proxy for\nhuman preference, LLM judges can provide reward\nsignals for reinforcement learning [3] and enable the\nautomated curation of massive synthetic datasets [4].\nAs such, LLM judgment has established itself as a cor-\nnerstone of AI evaluation and optimization pipelines,\nwhere the precision of the judge critically determines\nthe success of downstream applications [5].\nHowever, as generative AI applications evolve from\nsimple text responses to complex, multi-step tasks\nacross specialized domains, the reliability of LLM-\nas-a-Judge has become inevitably constrained [2, 6].\nFirst, single-pass evaluators are prone to inherent\nparametric biases—such as favoring verbosity or their\nown output patterns—which compromise their neu-\ntrality when assessing high-complexity responses\nthat deviate from their training distribution [7]. Sec-\nond, naive LLM judges are passive observers, un-\nable to react to real-world observations; they assess\nanswers based on linguistic patterns without verifi-\ncation, leading to hallucinated evaluations in spe-\ncialized domains [8].\nFurthermore, in evaluation\ntasks that require multifaceted assessment rubrics,\ntraditional LLM judges experience cognitive overload\nwhen attempting to evaluate all dimensions compre-\nhensively within a single inference step, which results\nin coarse-grained scores that fail to reflect specific\nnuances [9].\nThese limitations have catalyzed the transition\nfrom LLM-as-a-Judge to Agent-as-a-Judge. As shown\n1\narXiv:2601.05111v1  [cs.CL]  8 Jan 2026\n\nA Survey on Agent-as-a-Judge\nINPUT\nScore:\n7/10\na LLM-as-a-Judge\nINPUT\nEXECUTION\nSEARCH\nOTHER TOOLS\nPLANNING\nCALCULATOR\nMEMORY\nScore: 7/10\nEvaluation\nReport\nb Agent-as-a-Judge\nFigure 1: Comparison between LLM-as-a-Judge (1a) and\nAgent-as-a-Judge (1b). The former performs direct\nsingle-pass evaluation, while the latter leverages\nplanning, memory, and tool-augmented capabilities for\nenhanced evaluation.\nin Figure 1, agentic judges proactively engage in eval-\nuation through multiple capabilities: they decom-\npose complex objectives into subtasks, mitigate biases\nthrough multi-agent collaboration [10], ground as-\nsessments via tool-augmented evidence collection\nand correctness verification [8], and enable fine-\ngrained assessment by persisting intermediate states,\nautonomously planning the evaluation across rea-\nsoning steps [11, 12]. This paradigm shift enables\nmore robust, verifiable, and nuanced assessments\nthat effectively address the multifaceted nature of\nsophisticated AI-generated evaluands.\nDespite the above potentials and rapid proliferation\nof agentic evaluation systems, the field lacks a survey\nto summarize and navigate this shifting landscape.\nTo bridge this gap, we present the first comprehensive\nsurvey for Agent-as-a-Judge through the following\ncontributions:\n• We identify and characterize the shift from LLM-\nas-a-Judge to Agent-as-a-Judge and summarize the\nagentic judges’ development trend into three pro-\ngressive stages (Section 2).\n• We organize core methodologies into five key parts\naccording to agent’s abilities (Section 3) and sur-\nvey their applications across general and profes-\nsional domains (Section 4).\n• We analyze frontier challenges and identify promis-\ning research directions, providing a strategic\nroadmap for the next generation of robust and\nverifiable AI judgment.\n2 Evolution:\nFrom LLM-as-a-Judge to\nAgent-as-a-Judge\nThis section traces the evolution of automated\nevaluation from LLM-as-a-Judge to Agent-as-a-Judge\nparadigms. We begin by reviewing the foundational\nLLM-as-a-Judge and its limitations. We then exam-\nine the shift toward Agent-as-a-Judge, analyzing key\ndimensions that characterize the agentic approach.\nFinally, we summarize Agent-as-a-Judge’s develop-\nment trend into three progressive stages with distinct\nlevels of autonomy and adaptability.\n2.1 LLM-as-a-Judge\nLLM-as-a-Judge paradigm emerged to overcome the\nscalability limits of human evaluation and the se-\nmantic insensitivity of traditional metrics. Zheng\net al., [1] formalized this approach by introducing\nbenchmarks like MT-Bench to assess model align-\nment. Building on this, G-Eval [59] leveraged chain-\nof-thought prompting for better alignment in natural\nlanguage generation (NLG), while Prometheus [60]\ninduced fine-grained evaluation in open-source mod-\nels via specialized tuning. To mitigate systematic\nissues like position and verbosity bias [7], JudgeLM\n[61] utilized fine-tuning to develop more robust eval-\nuators.\n2.2 From LLM-as-a-Judge to Agent-as-a-Judge\nAs evaluands evolve from simple text responses to\ncomplex, multi-step tasks across specialized domains,\ntraditional LLM-as-a-Judge has become increasingly\ninadequate, focusing on final outputs while failing\nto verify intermediate actions or satisfy the rigorous\nstandards of professional fields [2, 6]. To bridge this\ngap, the paradigm is shifting toward Agent-as-a-Judge\nthat employs decentralized deliberation, executable\nverification, and fine-grained assessment to mitigate\nthese limitations.\nEvolving Robustness: From Monolithic to Decen-\ntralized.\nTo mitigate the inherent parametric biases\nof monolithic LLM judges—such as the tendency to\n2\n\nA Survey on Agent-as-a-Judge\nAgent-as-a-Judge\nApplication\n(§4)\nProfessional Domains\n(§4.2)\nEducation\nE.g., MAJ-Eval [13], AutoSCORE [14], GradeOpt [15], Grade-Like-a-Human [16]\nFinance\nE.g., FinResearchBench [17], FinDeepResearch [18], SAEA [19], M-SAEA [20]\nLaw\nE.g., AgentsCourt [21], SAMVAD [22], AgentsBench [23]\nMedicine\nE.g., MAJ-Eval [13], GEMA-Score [24], ChatCoach [25], AI Hospital [26]\nGeneral Domains\n(§4.1)\nMultimodal and\nVision\nE.g., CIGEval [27], Evaluation Agent [28], LRQ-Fact [29], ARM-Thinker [12]\nConversation and\nInteraction\nE.g., IntellAgent [30], ESC-Judge [31], Sentient Agent [32], PSYCHE [33]\nFact-Checking\nE.g., FACT-AUDIT [34], UrduFactCheck [35], NarrativeFactScore [36],\nMath and Code\nE.g., HERMES [37], VerifiAgent [38], CompassVerifier [39], xVerify [40], Agentic Reward\nModeling [8], Multi-Agent Verification [41], Popper [42]\nMethodologies\n(§3)\nOptimization Paradigms\n(§3.5)\nE.g., Prompting (Evaluation Agent [28], HERMES [37], Multi-Agent LLM Judge [43], SAGEval [44], AGENT-X [45]),\nSFT (SynthesizeMe [46]), RL (TIR-Judge [47], ARM-Thinker [12])\nMemory and Personaliza-\ntion\n(§3.4)\nE.g., HERMES [37], ARM-Thinker [12], Agent-as-a-Judge [48], RLPA [11], SynthesizeMe [46], PersRM-R1 [49],\nFSPO [50]\nTool Integration\n(§3.3)\nE.g., Agent-as-a-Judge [48], CodeVisionary [51], Evaluation Agent [28], ARM-Thinker [12], HERMES [37], VerifiA-\ngent [38], Agentic Reward Modeling [8]\nPlanning\n(§3.2)\nE.g., MATEval [52], Evaluation Agent [28], EvalAgents [53], AGENT-X [45], ARJudge [54], OnlineRubrics [55]\nMulti-Agent Collaboration\n(§3.1)\nE.g., ChatEval [10], M-MAD [56], SAGEval [44], HiMATE [9], CAFES [57], GEMA-Score [24], CourtEval [58]\nFigure 2: A taxonomy of Agent-as-a-Judge organizing Methodologies (§3) and Applications (§4). Background\ngradients illustrate the coverage of developmental stages, from Procedural to Reactive and then to Self-Evolving .\nfavor verbosity or their own output patterns—Agent-\nas-a-Judge paradigms employ specialized, decentral-\nized agents that collaborate through autonomous\ndecision-making [10, 13]. Crucially, this decentral-\nized architecture facilitates the injection of expert\nprior knowledge: by decomposing complex evalua-\ntion goals into sub-tasks or structuring specific in-\nteraction workflows, we can enforce domain-specific\nconstraints that a generalist model typically over-\nlooks [16, 24]. Furthermore, multi-agent delibera-\ntion ensures collective robustness; distinct roles can\nisolate specific information points to neutralize bias,\nwhile debate and self-reflection allow agents to au-\ndit their own cognitive shortcuts, ensuring the fi-\nnal judgment transcends the heuristics of any single\nmodel [62, 7].\nEvolving Verification: From Intuition to Execu-\ntion. Static LLM judges are fundamentally passive\nobservers, unable to react to real-world feedback.\nThey assess answers based on linguistic plausibility\n– how correct a response looks – without verifica-\ntion or evidence collection, leading to \"hallucinated\ncorrectness\" in complex tasks [8]. Agent-as-a-Judge\nbridges this reality gap by replacing intuition with ex-\necution. By interacting with external environments,\nagentic judges can query system states to verify side\neffects (e.g., file operations) [48, 51], use code in-\nterpreters or theorem provers to validate logical con-\nsistency [37], and employ search tools to ground\nfactual claims in real-time documentation [38, 8].\nThis shifts the evaluative anchor from internal model\nknowledge to objective verification.\nEvolving Granularity:\nFrom Global to Fine–\ngrained. Many evaluation tasks inherently require\nmultifaceted assessment rubrics, yet traditional LLM\njudges face a cognitive overload to evaluate these di-\nmensions comprehensively within a single inference\nstep, results in coarse-grained scores that fail to re-\nflect specific nuances [9]. Agent-as-a-Judge addresses\nthis by transforming evaluation from a single-pass in-\nference into autonomous, hierarchical reasoning [9].\nInstead of a monolithic assessment, an agentic judge\ncan dynamically select or create task-specific rubrics,\nautonomously planning the evaluation to examine\neach component of the evaluand independently [44],\nutilizing memory to track historical reasoning states\nand synthesize fragmented evidence into a coher-\nent verdict. Consequently, these agents can pinpoint\n3\n\nA Survey on Agent-as-a-Judge\nspecific flaws that would otherwise be obscured in\na global score, providing fine-grained feedback on\neach aspect [45].\n2.3 Agent-as-a-Judge\nAgent-as-a-Judge represents a rapidly expanding field\nwhere the term \"agent\" is often applied loosely, span-\nning a heterogeneous range from procedural agentic\nworkflows to autonomous self-evolvers [10, 45, 12].\nTo provide a clear roadmap through this complexity,\nwe summarize the ongoing development of agency\nas follows.\nProcedural Agent-as-a-Judge\ndecouples monolithic\ninference into agentic predefined workflows [57, 24]\nor engages in structured discussions among fixed\nsub-agents [10, 56]. These systems enable complex\njudgments through coordinated multi-agent inter-\nactions, yet remain constrained by predetermined\ndecision rules that cannot adapt to novel evaluation\nscenarios.\nReactive\nAgent-as-a-Judge enables\nadaptive\ndecision-making by routing execution paths [28, 45]\nand invoking external tools [8] or sub-agents [13]\nbased on intermediate feedback.\nHowever, such\nreactivity remains confined to conditional routing\nwithin fixed decision spaces, lacking autonomy to\nrefine underlying rubrics.\nSelf-Evolving Agent-as-a-Judge\nrepresents the cut-\nting edge of the field, characterized by high auton-\nomy and the ability to refine internal components\nduring operation—synthesizing evaluation rubrics\non-the-fly [53] and updating memory with lessons\nlearned. This paradigm opens new frontiers for adap-\ntive evaluation systems, though challenges remain in\nensuring stability during self-modification [63].\n3 Methodologies\nThis section categorizes Agent-as-a-Judge methodolo-\ngies into five dimensions: multi-agent collaboration,\nplanning, tool integration, memory and personaliza-\ntion, and optimization paradigms. As shown in Fig-\nure 2, implementation sophistication reveals the evo-\nlutionary stages: foundational methodologies (collab-\noration, tool integration, optimization) evolve across\nall stages, while others (planning, memory) emerge\nmore prominently in advanced paradigms. The fol-\nINPUT\nOUTPUT\na Collective Consensus\nINPUT\nOUTPUT\nb Task Decomposition\nFigure 3: Multi-agent collaboration paradigms.\nlowing subsections examine how each methodology\nmanifests across these stages.\n3.1 Multi-Agent Collaboration\nMulti-agent collaboration leverages collective rea-\nsoning to mitigate single-LLM biases in Agent-as-a-\nJudge systems. Early systems followed Procedural\nparadigms with fixed protocols, while recent work\nevolves toward Reactive approaches that adapt agent\nselection based on feedback. We categorize these\ninto two topologies:\nCollective Consensus. Horizontal debate mecha-\nnisms leverage agents representing diverse perspec-\ntives to counteract the inherent biases of single-LLM\nevaluators, illustrated in Figure 3. Early approaches\nexemplified the Procedural stage: ChatEval [10] pi-\noneered this with a courtroom-inspired discussion\nmechanism where agents debate as equals follow-\ning predefined protocols. This paradigm was later\nextended to machine translation in M-MAD [56],\nwhile subsequent research [64] introduced explicit\nstances and \"judge\" roles to prevent agents from\nblindly conforming to the majority. Recent meth-\nods have become more Self-Evolving: approaches like\nMulti-agent-as-judge [13] have moved beyond static\nensembles by creating domain-specific experts based\non intermediate feedback.\nTask Decomposition.\nTask Decomposition employs\na \"Divide and Conquer\" strategy, delegating distinct\nsubtasks to specialized agents for systematic eval-\nuation, illustrated in Figure 3. Early frameworks\nfollowed Procedural designs: sequential approaches\nlike CAFES [57] and GEMA-Score [24, 58] struc-\nture evaluation into predefined stages (e.g., Evidence\nGathering, Reasoning, Scoring), while SAGEval [44]\nintroduces supervision via a \"Judge the Judge\" meta-\nevaluator that reviews previous agents’ decisions,\n4\n\nA Survey on Agent-as-a-Judge\nwith hierarchical approaches like HiMATE [9] or-\nganizing agents into tree structures for varying error\ngranularities. More recent work has shifted toward\nReactive paradigms: AGENT-X [45] employs adap-\ntive router agent that dynamically selects the most\nrelevant base agents based on intermediate analysis\nresults.\nTakeaway Multi-agent evaluation frameworks adopt\ntwo main topologies: Collective Consensus and Task\nDecomposition. Recent advances have evolved toward\nmore autonomous systems that can select or generate\nsubagents.\n3.2 Planning\nPlanning serves as a core capability in the Agent-as-a-\nJudge paradigm, enabling the decomposition of high-\nlevel evaluation objectives into executable sub-tasks\nand the dynamic adaptation of assessment trajecto-\nries based on intermediate analysis. This section ex-\namines planning capabilities from two perspectives:\nWorkflow Orchestration. Workflow orchestration\nin Agent-as-a-Judge systems spans from static frame-\nworks to dynamic agency, primarily characterizing\nProcedural and Reactive stages of agentic evaluation.\nApproaches like MATEval [52] rely on static de-\ncomposition, breaking tasks into fixed sequences of\nsub-dimensions. While this ensures systematic as-\nsessment through predefined control flows, it limits\nadaptability in complex scenarios. Conversely, Eval-\nuation Agent [28] introduces dynamic multi-round\nplanning, where agents adjust strategies based on in-\ntermediate feedback. This system further optimizes\nefficiency through autonomous termination, allow-\ning the agent to self-monitor information gain and\nproactively halt execution once sufficient evidence is\ngathered.\nRubric Discovery.\nUnlike general agents focused on\ntask completion, Judge Agents have the distinct capa-\nbility to autonomously formulate and refine rubrics,\nrepresenting a hallmark of the Self-Evolving stage,\nwhere agents can refine their internal evaluation com-\nponents. EvalAgents [53] exemplifies this by employ-\ning a Query Generator that plans web searches to\ndiscover implicit rubrics, while AGENT-X [45] uses an\nAdaptive Router to infer domain context and plan be-\nspoke detection guidelines. ARJudge [54] adaptively\nformulates rubrics by iteratively generating context-\nsensitive questions, and OnlineRubrics [55] inte-\ngrates planning into reinforcement learning, evolv-\ning rubrics alongside policy optimization to detect\nreward hacking.\nTakeaway Serving as the strategic engine, planning\nshifts evaluation from rigid flows to adaptive explo-\nration, enabling agents to optimize how they evalu-\nate (workflow orchestration) and what they evaluate\n(rubric discovery).\n3.3 Tool Integration\nTool integration is a defining capability of Agent-as-\na-Judge frameworks, enabling judges to ground eval-\nuation in external evidence and explicit checks. As\nshown in Table 1, existing approaches can be grouped\ninto evidence collection and correctness verification\nbased on the purpose of tool use.\nEvidence Collection. A common use of tools in\nAgent-as-a-Judge frameworks is to collect additional\nevidence that supports evaluations. Such evidence in-\ncludes intermediate artifacts, execution results, and\nperceptual signals that cannot be reliably obtained\nthrough text-based reasoning. In code-related tasks,\nAgent-as-a-Judge [48] and CodeVisionary [51] al-\nlow judges to inspect execution artifacts or run auto-\nmated checks to expose execution feedback for eval-\nuation. Similar methods are adopted in multimodal\nsettings. Evaluation Agent [28] enables judges to\ninvoke external visual models to obtain visual qual-\nity or alignment signals, while ARM-Thinker [12]\ngathers fine-grained visual and contextual evidence\nthrough document access and localized visual opera-\ntions. Overall, these works integrate tools to surface\nobservable and task-relevant evidence, expanding\nthe judge’s access to execution-level, perceptual, and\ncontextual information, and supporting more reliable\nevaluation.\nCorrectness Verification.\nAnother line of work em-\nploys tools to verify whether the evaluand’s outputs\nor intermediate reasoning steps satisfy explicit cor-\nrectness constraints, such as logical validity, mathe-\nmatical soundness, or factual consistency. In these\nframeworks, the judge agent identifies which claims\n5\n\nA Survey on Agent-as-a-Judge\nTool Purpose\nMethod\nEvaluation Task\nTool Type\nEvidence\ncollection\nAgent-as-a-Judge [48]\nCode generation\nGraph, locate, read, search, retrieve\nCodeVisionary [51]\nCode generation\nCode execution, static linter, unit tests,\nscreenshot, web browsing\nEvaluation Agent [28]\nVisual generation\nVisual generative models\nARM-Thinker [12]\nMultimodal generation\nInstruction following checks, crop/zoom\ntools, document retrieval tools\nCorrectness\nverification\nHERMES [37]\nMath reasoning\nTranslator, theorem prover\nVerifiAgent [38]\nFactual & Math reasoning\nSearch engine, Python interpreter, theorem\nprover\nAgentic RM [8]\nFactual & Math reasoning\nSearch engine, Python interpreter\nTable 1: Tool integration in representative Agent-as-a-Judge methods, grouped by primary tool usage purpose.\nor steps require verification and invokes appropriate\ntools to check them. The resulting verification signals\nare then interpreted by the agent in context to inform\nthe final evaluation. HERMES [37] verifies mathe-\nmatical reasoning through formal theorem proving,\nwhile VerifiAgent [38] invokes programmatic and\nsymbolic checkers to validate factual and computa-\ntional claims. Agentic Reward Modeling [8] further\nintegrates correctness verification by combining fact-\nchecking tools and programmatic validators to pro-\nduce structured correctness signals that inform the\nfinal evaluation.\nTakeaway\nTool integration in Agent-as-a-Judge\ngrounds evaluation in observable and verifiable sig-\nnals by allowing judges to actively gather evidence\nand check correctness through external tools.\n3.4 Memory and Personalization\nMemory enables Agent-as-a-Judge frameworks to re-\ntain information across evaluation steps, supporting\nmulti-step reasoning, consistent judgment, and reuse\nof prior results. We categorize prior work by the role\nof memory, including intermediate state tracking and\npersonalized context preservation.\nIntermediate State. In multi-step evaluation set-\ntings, Agent-as-a-Judge frameworks use memory to re-\ntain intermediate evaluation states generated during\nassessment, providing the necessary context for con-\nditional routing and adaptive decision-making based\non intermediate feedback–a fundamental mechanism\nfor Reactive Agent-as-a-Judge. HERMES [37] retains\nintermediate proof states when combining reason-\ning with formal theorem proving, enabling consis-\ntent verification across long reasoning chains. ARM-\nThinker [12] preserves intermediate evidence such\nas visual reasoning outputs and tool interaction re-\nsults, which are later reused to ground evaluation.\nAgent-as-a-Judge [48] records execution traces and\nstep-level feedback, enabling evaluation beyond final\noutputs to account for intermediate behavior. Col-\nlectively, these methods use memory to retain inter-\nmediate states that support cumulative, step-aware\nevaluation.\nPersonalized Context. Agent-as-a-Judge frame-\nworks often incorporate memory to retain user-\nrelated information that conditions evaluation across\ninteractions. Such memory captures user preferences,\nevaluation standards, or prior feedback, allowing\njudgments to remain consistent over time. PersRM-\nR1 [49] and FSPO [50] store preference data derived\nfrom historical interactions, including preference la-\nbels or few-shot examples, which are reused to condi-\ntion subsequent evaluations for the same user. More\nadvanced approaches abstract historical preference\nsignals into persistent user personas or long-term pro-\nfiles. RLPA [11] and SynthesizeMe [46] exemplify\nthis by constructing and maintaining user personas\nthat are stored and reused to guide evaluation. Such\nlong-term user profiling often serves to support Self-\nEvolving Agent-as-a-Judge, enabling continuous op-\ntimization based on evolving preferences. Together,\nthese methods use memory to preserve personalized\n6\n\nA Survey on Agent-as-a-Judge\ncontext that shapes evaluation behavior and ensures\nconsistency across interactions.\nTakeaway Memory enables Agent-as-a-Judge to pre-\nserve intermediate states and personalized context,\nsupporting multi-step evaluation, consistent judgment,\nand efficient reuse of prior information.\n3.5 Optimization Paradigms\nOptimization paradigms define how Agent-as-a-Judge\nimproves evaluation quality by updating model pa-\nrameters or adapting evaluation behaviors. We or-\nganize prior work into two groups: training-time\noptimization and inference-time optimization.\nTraining-Time Optimization. Training-time opti-\nmization improves Agent-as-a-Judge by updating\nmodel parameters to align judgment behavior with\nevaluation objectives. Supervised fine-tuning is com-\nmonly used to standardize judge behavior, training\nmodels to follow explicit criteria, and produce struc-\ntured judgments across tasks. For example, Syn-\nthesizeMe [46] shapes evaluation behavior using\npersona-guided supervision derived from historical\ndata. Reinforcement learning optimizes judges to\nperform evaluation actions more effectively, espe-\ncially in settings that require tool use and multi-step\nverification. TIR-Judge [47] and ARM-Thinker [12]\ntrain judges to decide when and how to invoke tools,\nintegrate external signals, and verify intermediate\nresults. Together, training-time optimization shapes\ninternal decision processes, enabling more reliable,\nstructured evaluation.\nInference-Time Optimization. Inference-time op-\ntimization adapts evaluation behavior without up-\ndating model parameters by controlling how judg-\nments are produced through prompts, workflows,\nor agent interactions. Existing approaches can be\nbroadly grouped into two types. 1) The first type\nfollows predefined evaluation procedures, where rea-\nsoning steps, verification routines, or prompts are\nfixed in advance to ensure consistency and efficiency.\nEvaluation Agent [28] and HERMES [37] exemplify\nthis setting by adopting structured, step-by-step eval-\nuation pipelines. 2) The second type allows evalua-\ntion behavior to adapt during inference, where the\nevaluation process, participating agents, or applied\nConversation \nand\nInteraction\nAgent-as-\na-Judge \nMath\nand \nCode\nMultimodal\nand\nVision\nFact-\nchecking\nLaw\nEducation\nFinance\nMedicine\nOpen-ended \nResponse\nMachine \nTranslation\nTask-\noriented \nDialogue\nCode \nGeneration\nMathematical \nReasoning\nFactual \nSummarization\nHypothesis \nValidation\nImage \nGeneration\nMedical \nInteraction\nRadiology \nReport\nAutomatic \nGrading\nRisk \nEvaluation\nLegal \nJudgement\nMultimodal \nReasoning\nEssay \nScoring\nFinancial \nResearch\nJudicial \nDeliberation\nGeneral Domains\nProfessional Domains\nFigure 4: An overview of Agent-as-a-Judge application\ndomains and their fine-grained task categories.\ncriteria change based on intermediate results. Multi-\nAgent LLM Judge [43] iteratively refines prompts\nand context through multi-judge coordination, while\nSAGEval [44] introduces a meta-judge to monitor\nand revise judge behavior. ChatEval [10] and AGENT-\nX [45] further support adaptive evaluation through\nagent interaction and dynamic guideline selection.\nOverall, inference-time optimization enables flexible\ncontrol over evaluation behavior, ranging from fixed\nprocedures to adaptive, interaction-driven judgment.\nTakeaway Optimization improves Agent-as-a-Judge\nby either learning evaluation behavior through param-\neter updates at training-time or adjusting evaluation\nstrategies at inference time.\n4 Application\nBuilding on the methods above, this section describes\nhow Agent-as-a-Judge methods are applied in differ-\nent evaluation tasks. As shown in Figure 4, we or-\nganize representative applications into two groups:\ngeneral domains and professional domains.\n4.1 General Domains\nMath and Code. In math and code evaluation,\nAgent-as-a-Judge systems move beyond single-pass\nscoring by grounding judgment in verifiable reason-\ning signals. One line of work augments free-form\n7\n\nA Survey on Agent-as-a-Judge\nreasoning with explicit correctness checks. HER-\nMES [37] anchors LLM reasoning to intermediate\nformal proof steps, reducing drift in long derivations.\nVerifiAgent [38] decouples high-level reasoning as-\nsessment from tool-based correctness verification,\nenabling adaptive checking across reasoning types.\nCompassVerifier [39] and xVerify [40] focus on math-\nematical and logical outputs, addressing equivalence\nchecking under diverse surface forms. Other ap-\nproaches strengthen judgment by aggregating multi-\nple evaluation signals. Multi-Agent Verification [41]\ndistributes evaluation across aspect-specific judges.\nAgentic Reward Modeling [8] integrates preference-\nbased supervision with verifiable correctness signals\nto improve reward reliability. Popper [42] formulates\njudgment as controlled falsification, using statistical\ntests to validate free-form claims.\nFact-Checking. In fact-checking, Agent-as-a-Judge\nreframes evaluation from static label prediction to\ninteractive verification with evidence gathering and\njustification. FACT-AUDIT [34] models fact-checking\nas an agentic loop with multi-agent collaboration,\njointly evaluating verdict accuracy and justification\nquality. This paradigm is particularly effective when\nevidence is scarce or inconsistencies are subtle. Urdu-\nFactCheck [35] improves robustness in low-resource\nsettings through multilingual retrieval and evidence\nboosting. NarrativeFastScore [36] addresses long-\ncontext factual consistency by constructing character-\nlevel knowledge representations, enabling detection\nof state and relation errors with actionable feedback.\nConversation and Interaction.\nIn conversation and\ninteraction, Agent-as-a-Judge shifts from grading iso-\nlated replies to constructing multi-turn exchanges,\nenabling evaluation under evolving goals, constraints,\nand user reactions. For task-oriented dialogue, In-\ntellAgent [30] uses interactive user simulations to\nsynthesize conversational benchmarks, while Kazi\net al. [65] introduces frameworks for controllable\nuser goals and automatic measures. For affective\nand social interaction, ESC-Judge [31] constructs\nemotional-support agents via standardized counsel-\ning skills, Sentient Agent [32] tracks emotion tra-\njectories over time to reflect higher-order social cog-\nnition, and PSYCHE [33] builds psychiatric patient\nprofiles for ethical assessment validation. Wu et al.\n[66] frames evaluation as multi-perspective role play\nwith diverse reviewer personas to cover both objec-\ntive and subjective dimensions.\nMultimodal and Vision.\nIn the multimodal and vi-\nsion domain, Agent-as-a-Judge shifts from static scor-\ning to interactive inspection. For visual generation,\nCIGEval [27] orchestrates specialized tools to probe\ncontrol adherence and subject consistency, while Eval-\nuation Agent [28] runs multi-round checks to provide\nuser-tailored, explainable analyses. For truthfulness\nevaluation, LRQ-Fact [29] generates targeted fact-\nchecking questions across image and text to guide ev-\nidence retrieval, while ARM-Thinker [12] selectively\ninvokes tools like image inspection for finalizing judg-\nments.\n4.2 Professional Domains\nMedicine. In high-stakes clinical NLP, Agent-as-a-\nJudge appears in two forms: 1) multi-agent evalu-\nators that decompose clinical quality into special-\nized roles, and 2) agentic simulators that interac-\ntively elicit clinical behaviors. For 1), MAJ-Eval [13]\nconstructs multiple evaluator personas to debate\nand cross-verify responses, while GEMA-Score[24]\nuses agent collaboration to compute granular, tool-\nassisted scores covering disease severity and uncer-\ntainty. For 2), Chat-Coach [25] pairs autonomous\npatient and coach agents to critique trainee-doctor\ndialogues, while AI Hospital [26] evaluates LLM “doc-\ntors” in multi-agent simulators, though final scoring\noften still requires conventional metrics.\nLaw. In the legal domain, Agent-as-a-Judge sim-\nulates the adversarial and deliberative nature\nof jurisprudence through multi-agent interaction.\nAgentsCourt [21] introduces adversarial debate\nframeworks where agents role-play as prosecutors,\ndefense attorneys, and judges, exposing the evaluat-\ning agent to conflicting arguments to improve verdict\nrobustness. SAMVAD [22] and AgentsBench [23]\nmodel judicial consensus by simulating bench de-\nliberation processes, capturing interactions between\nconcurring and dissenting opinions to enhance legal\njudgment prediction.\nFinance.\nIn finance, Agent-as-a-Judge addresses two\nlimitations of static benchmarks: 1) capturing the in-\nternal research logic of long-form analyst reports, and\n8\n\nA Survey on Agent-as-a-Judge\n2) detecting deployment risks like hallucinations and\ntemporal staleness. For 1), FinResearchBench [17]\nextracts logic trees from reports as intermediate struc-\ntures for comprehensive assessment, whereas FinD-\neepResearch [18] can synthesize hierarchical rubrics\nbut still relies on predefined workflows.\nFor 2),\nSAEA [19] proposes auditing agent trajectories to\nmitigate hallucinations and temporal misalignment.\nFrom Tasks to Teams [20] extends this approach with\nM-SAEA to trace multi-agent failures, such as cross-\nagent divergence and error propagation.\nEducation. In the educational domain, Agent-as-a-\nJudge systems emulate pedagogical nuance through\ncollaborative, role-specialized workflows.\nGrade-\nLike-Human [16] and AutoSCORE [14] decom-\npose grading into staged processes (rubric construc-\ntion, evidence recognition, cross-review) to improve\ngrounding and consistency. Beyond static scoring,\nMAJ-Eval [13] uses multi-persona debates to align\nwith multi-dimensional human evaluation, while\nGradeOpt [15] introduces agents that diagnose dis-\ncrepancies and iteratively refine grading guidelines.\n5 Discussion\nThis section discusses broader issues that arise when\ndeploying Agent-as-a-Judge systems in practice. We\nfirst summarize key challenges that limit scalability,\nreliability, and real-world adoption, and then outline\nseveral future directions that may help address these\nlimitations and further advance agentic evaluation.\n5.1 Challenges\nAgent-as-a-Judge improves evaluation reliability\nthrough planning, tool use, memory, and multi-agent\ncollaboration, but these capabilities also introduce\nnew challenges beyond static LLM-as-a-Judge. Key\nchallenges include computational cost, latency, safety,\nand privacy.\nComputational Cost. Agent-as-a-Judge introduces\na heavier computational burden in both training and\ninference. 1) Training a judge agent is expensive.\nSupervised fine-tuning alone is often insufficient to\nsupport agentic behaviors such as tool invocation,\nlong-horizon planning, and adaptive decision making.\nReinforcement learning provides a natural way to ac-\nquire these capabilities, but it significantly increases\ntraining cost, especially when the judge operates over\nlong trajectories or complex tool-calling sequences.\n2) Inference with Agent-as-a-Judge is also costly. Un-\nlike single-pass judgment, agentic evaluation typi-\ncally involves multiple reasoning steps, intermediate\ndecisions, and coordination among multiple agents,\nall of which increase computation per evaluation.\nLatency. In addition to higher computational cost,\nAgent-as-a-Judge often suffers from increased infer-\nence latency. Agentic evaluation requires sequential\nreasoning steps, external tool calls, or multi-agent\ncommunication, each of which introduces additional\ndelays. This latency can be particularly problem-\natic in real-time or interactive settings, such as online\nmodel evaluation, user-facing content moderation, or\nreinforcement learning loops where rapid feedback is\nrequired. As a result, there exists a tension between\nevaluation reliability and practical deployment con-\nstraints, where more thorough agentic judgment may\nnot be feasible under strict latency budgets.\nSafety. While Agent-as-a-Judge is designed to im-\nprove evaluation robustness, it also raises new safety\nconcerns. Tool-augmented judges may access exter-\nnal systems such as search engines, code executors,\nor databases, which expands the attack surface for\nprompt injection, tool misuse, or unintended side ef-\nfects. Multi-agent collaboration can further amplify\nrisks if unsafe behaviors propagate across agents or\nif adversarial interactions emerge. Moreover, when\njudge agents are used to provide reward signals for\nmodel optimization, systematic biases or errors in\nagentic judgment may be reinforced and amplified\nduring training, leading to unintended model behav-\niors.\nPrivacy. Agent-as-a-Judge also introduces privacy\nchallenges, particularly in settings that involve persis-\ntent memory or personalized evaluation. To maintain\nconsistency or adapt judgments to specific users or\ncontexts, judge agents may store intermediate states,\nuser information, or historical interaction data. If not\ncarefully designed, such memory mechanisms can\nincrease the risk of sensitive data leakage or unau-\nthorized inference about user attributes. This issue\nbecomes more pronounced in professional domains\nsuch as medicine, law, or education, where evaluation\n9\n\nA Survey on Agent-as-a-Judge\noften relies on confidential or personally identifiable\ninformation.\n5.2 Future Directions\nPersonalization.\nCurrent Agent-as-a-Judge systems\nare constrained by static, one-size-fits-all evaluation\ncriteria, failing to align with diverse individual pref-\nerences. To bridge this gap, future research should\nfocus on enhancing the autonomy and adaptivity of\njudge agents. A critical enabler is proactive mem-\nory management: rather than passively retrieving\nhistory, agents must actively manage the lifecycle\nof user-specific knowledge—autonomously deciding\nwhen to register new preferences, update evolving\nstandards, or prune obsolete feedback. This agen-\ntic control transforms memory into a dynamic belief\nsystem, allowing the judge to continuously refine\nits criteria and maintain alignment with the user’s\nspecific values and usage contexts.\nGeneralization.\nCurrent systems rely on predefined\nrubrics constructed offline, limiting their ability to\ngeneralize across diverse or open-ended tasks. Future\njudge agents should leverage planning capabilities to\ndynamically discover and adapt evaluation criteria.\n1) Context-Aware Rubric Generation: Agents should\nsynthesize evaluation criteria on-the-fly by analyz-\ning the specific intent and complexity of responses,\nidentifying relevant assessment dimensions not antic-\nipated during design. 2) Adaptive Multi-Granularity\nScoring: Rubrics should dynamically scale based on\ntask difficulty—applying high-level holistic criteria\nfor straightforward tasks, while decomposing into\nfine-grained sub-rubrics for complex workflows.\nInteractivity. Current systems operate as passive,\none-way observers. Future agents should evolve into\ninteractive evaluators that actively engage with both\nthe environment and human stakeholders. 1) Inter-\nactive Environmental Feedback: Instead of static test\nsuites, judge agents should dynamically tailor evalua-\ntion trajectories—autonomously escalating task com-\nplexity or isolating edge cases to rigorously probe\nthe evaluand’s failure boundaries. 2) Human-Agent\nCollaborative Calibration: To address subjective or\nambiguity-rich scenarios, agents should leverage\nhuman-in-the-loop mechanisms. By proactively con-\nsulting experts to verify intent or resolve conflicts,\nthe judge refines its criteria through multi-turn align-\nment, ensuring higher trust and interpretability.\nOptimization. Current approaches predominantly\nrely on inference-time engineering, which is funda-\nmentally bottlenecked by the fixed capabilities of\nfrozen backbones. To transcend these limits, the\nfield must pivot towards Training-based Optimization.\nThis paradigm shift entails two key levels: 1) Indi-\nvidual Capability: Utilizing Reinforcement Learning\n(RL) to internalize complex agentic behaviors—such\nas sequential planning and adaptive tool use—that\nare difficult to elicit via prompting alone. 2) Learned\nCoordination: Extending optimization to multi-agent\nsettings. Rather than ad-hoc inference collabora-\ntion, agents should be trained with joint objectives\nto intrinsically learn effective communication and\nconsensus strategies.\nConcluding Remarks: Towards True Autonomy.\nAs characterized in Section 2, existing implementa-\ntions exhibit varying degrees of agency. The future\ndirections discussed above—personalization, general-\nization, interactivity, and optimization—collectively\npoint towards an evolutionary trajectory towards au-\ntonomy. The next generation of judge agents must\ntranscend fixed protocols to become genuinely agen-\ntic entities capable of self-directed adaptation, active\ncontext curation, and continuous self-refinement, ul-\ntimately realizing the full potential of agents that\nactively perceive, reason, and evolve alongside the\nmodels they assess.\n6 Conclusion\nThis paper provides the first comprehensive survey of\nAgent-as-a-Judge. We established a novel taxonomy\nand demonstrated how agentic capabilities, includ-\ning multi-agent collaboration, autonomous planning,\ntool integration, and memory, overcome the limita-\ntions of naive LLM judges to deliver more robust, veri-\nfiable and nuanced judgments across general and pro-\nfessional domains. While promising, this evolution\npresents challenges in computational cost, latency,\nsafety, and privacy. Future progress should prioritize\npersonalization, generalization, and optimization, ul-\ntimately realizing truly autonomous evaluators that\ncontinuously adapt to the evolving AI landscape.\n10\n\nA Survey on Agent-as-a-Judge\nLimitations\nEarly Stage of Paradigm Consensus. As a pio-\nneering survey exploring the evolution of Agent-as-\na-Judge, this study faces the challenge that the field\nhas not yet gained complete widespread recognition\nin academia. Although the transition from LLM-as-a-\nJudge to Agent-as-a-Judge has begun to take shape,\nthere is still a lack of long-term consensus regarding\nthe definition of evaluation agents. Nevertheless, es-\ntablishing this foundational framework is essential\nto orienting future research. We are committed to\niteratively refining this taxonomy as the paradigm\nmatures and gains broader recognition.\nInclusion of Early Prompting Methods. We ac-\nknowledge a potential gap between early method-\nologies and the increasingly rigorous definitions of\nagents. Many pioneering works in automated eval-\nuation, though named as \"agent\", rely heavily on\nprompting engineering, such as fixed role-play, which\nmay not align with the strict criteria for autonomy,\ndynamic planning, or tool-use held by the current\ncommunity. Nevertheless, we deliberately include\nthese prompt-based frameworks as they represent\nthe initial shift from monolithic inference toward\ndynamic decomposition and self-evolving systems.\nExcluding them would obscure the transition thus\ncompromising a complete understanding of the field’s\nevolution.\nEthics Statement\nThis work does not involve the use or creation of\ndatasets or scientific artifacts that would require spe-\ncific ethical clearance, data privacy considerations, or\nlicensing agreements. We believe this work adheres\nto the ethical guidelines of the conference and poses\nno immediate negative social impact.\nReferences\n[1] Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan\nZhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin,\nZhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang,\nJoseph E. Gonzalez, and Ion Stoica. Judging llm-\nas-a-judge with mt-bench and chatbot arena. In\nProceedings of the 37th International Conference on\nNeural Information Processing Systems, NIPS ’23, Red\nHook, NY, USA, 2023. Curran Associates Inc.\n[2] Dawei Li, Bohan Jiang, Liangjie Huang, Alimoham-\nmad Beigi, Chengshuai Zhao, Zhen Tan, Amrita\nBhattacharjee, Yuxuan Jiang, Canyu Chen, Tianhao\nWu, et al. From generation to judgment: Opportuni-\nties and challenges of llm-as-a-judge. In Proceedings\nof the 2025 Conference on Empirical Methods in Nat-\nural Language Processing, pages 2757–2791, 2025.\n[3] Harrison Lee, Samrat Phatale, Hassan Mansoor,\nThomas Mesnard, Johan Ferret, Kellie Lu, Colton\nBishop, Ethan Hall, Victor Carbune, Abhinav Ras-\ntogi, and Sushant Prakash. Rlaif vs. rlhf: scaling\nreinforcement learning from human feedback with\nai feedback. In Proceedings of the 41st International\nConference on Machine Learning, ICML’24. JMLR.org,\n2024.\n[4] Lin Long, Rui Wang, Ruixuan Xiao, Junbo Zhao,\nXiao Ding, Gang Chen, and Haobo Wang.\nOn\nLLMs-driven synthetic data generation, curation,\nand evaluation: A survey.\nIn Lun-Wei Ku, An-\ndre Martins, and Vivek Srikumar, editors, Find-\nings of the Association for Computational Linguis-\ntics: ACL 2024, pages 11065–11082, Bangkok, Thai-\nland, August 2024. Association for Computational\nLinguistics. doi: 10.18653/v1/2024.findings-acl.\n658. URL https://aclanthology.org/2024.\nfindings-acl.658/.\n[5] Hanyu Lai, Xiao Liu, Junjie Gao, Jiale Cheng, Zehan\nQi, Yifan Xu, Shuntian Yao, Dan Zhang, Jinhua Du,\nZhenyu Hou, Xin Lv, Minlie Huang, Yuxiao Dong,\nand Jie Tang. A survey of post-training scaling in\nlarge language models. In Wanxiang Che, Joyce\nNabende, Ekaterina Shutova, and Mohammad Taher\nPilehvar, editors, Proceedings of the 63rd Annual\nMeeting of the Association for Computational Lin-\nguistics (Volume 1: Long Papers), pages 2771–2791,\nVienna, Austria, July 2025. Association for Compu-\ntational Linguistics. ISBN 979-8-89176-251-0. doi:\n10.18653/v1/2025.acl-long.140. URL https://\naclanthology.org/2025.acl-long.140/.\n[6] Zhiheng Xi, Wenxiang Chen, Xin Guo, Wei He, Yiwen\nDing, Boyang Hong, Ming Zhang, Junzhe Wang,\nSenjie Jin, Enyu Zhou, et al. The rise and potential\nof large language model based agents: A survey.\nScience China Information Sciences, 68(2):121101,\n2025.\n[7] Peiyi Wang, Lei Li, Liang Chen, Zefan Cai, Dawei\nZhu, Binghuai Lin, Yunbo Cao, Lingpeng Kong,\nQi Liu, Tianyu Liu, and Zhifang Sui.\nLarge lan-\nguage models are not fair evaluators. In Lun-Wei Ku,\nAndre Martins, and Vivek Srikumar, editors, Proceed-\nings of the 62nd Annual Meeting of the Association\nfor Computational Linguistics (Volume 1: Long Pa-\npers), pages 9440–9450, Bangkok, Thailand, August\n2024. Association for Computational Linguistics. doi:\n11\n\nA Survey on Agent-as-a-Judge\n10.18653/v1/2024.acl-long.511. URL https://\naclanthology.org/2024.acl-long.511/.\n[8] Hao Peng, Yunjia Qi, Xiaozhi Wang, Zijun Yao, Bin\nXu, Lei Hou, and Juanzi Li. Agentic reward model-\ning: Integrating human preferences with verifiable\ncorrectness signals for reliable reward systems. In\nWanxiang Che, Joyce Nabende, Ekaterina Shutova,\nand Mohammad Taher Pilehvar, editors, Proceed-\nings of the 63rd Annual Meeting of the Association for\nComputational Linguistics (Volume 1: Long Papers),\npages 15934–15949, Vienna, Austria, July 2025. As-\nsociation for Computational Linguistics. ISBN 979-\n8-89176-251-0. doi: 10.18653/v1/2025.acl-long.\n775. URL https://aclanthology.org/2025.\nacl-long.775/.\n[9] Shijie Zhang, Renhao Li, Songsheng Wang, Philipp\nKoehn, Min Yang, and Derek F. Wong. HiMATE:\nA hierarchical multi-agent framework for machine\ntranslation evaluation. In Christos Christodoulopou-\nlos, Tanmoy Chakraborty, Carolyn Rose, and Violet\nPeng, editors, Findings of the Association for Com-\nputational Linguistics: EMNLP 2025, pages 11121–\n11145, Suzhou, China, November 2025. Association\nfor Computational Linguistics. ISBN 979-8-89176-\n335-7.\ndoi: 10.18653/v1/2025.findings-emnlp.\n593. URL https://aclanthology.org/2025.\nfindings-emnlp.593/.\n[10] Chi-Min Chan, Weize Chen, Yusheng Su, Jianxuan\nYu, Wei Xue, Shanghang Zhang, Jie Fu, and Zhiyuan\nLiu. Chateval: Towards better LLM-based evalua-\ntors through multi-agent debate. In The Twelfth In-\nternational Conference on Learning Representations,\n2024. URL https://openreview.net/forum?\nid=FQepisCUWu.\n[11] Weixiang Zhao, Xingyu Sui, Yulin Hu, Jiahe Guo,\nHaixiao Liu, Biye Li, Yanyan Zhao, Bing Qin, and\nTing Liu. Teaching language models to evolve with\nusers: Dynamic profile modeling for personalized\nalignment. In Proceedings of the Thirty-Ninth Con-\nference on Neural Information Processing Systems\n(NeurIPS 2025), 2025.\n[12] Shengyuan Ding, Xinyu Fang, Ziyu Liu, Yuhang\nZang, Yuhang Cao, Xiangyu Zhao, Haodong Duan,\nXiaoyi Dong, Jianze Liang, Bin Wang, Conghui He,\nDahua Lin, and Jiaqi Wang.\nArm-thinker: Rein-\nforcing multimodal generative reward models with\nagentic tool use and visual reasoning, 2025.\n[13] Jiaju Chen, Yuxuan Lu, Xiaojie Wang, Huimin Zeng,\nJing Huang, Jiri Gesi, Ying Xu, and Dakuo Wang.\nMulti-agent-as-judge: Aligning LLM-agent-based au-\ntomated evaluation with multi-dimensional human\nevaluation. In First Workshop on Multi-Turn Interac-\ntions in Large Language Models, 2025. URL https:\n//openreview.net/forum?id=7AetgL7eVL.\n[14] Yun Wang, Zhaojun Ding, Xuansheng Wu, Siyue\nSun, Ninghao Liu, and Xiaoming Zhai. Autoscore:\nEnhancing automated scoring with multi-agent large\nlanguage models via structured component recogni-\ntion. arXiv preprint arXiv:2509.21910, 2025.\n[15] Yucheng Chu, Hang Li, Kaiqi Yang, Harry Shomer,\nHui Liu, Yasemin Copur-Gencturk, and Jiliang Tang.\nA llm-powered automatic grading framework with\nhuman-level guidelines optimization. arXiv preprint\narXiv:2410.02165, 2024.\n[16] Wenjing Xie, Juxin Niu, Chun Jason Xue, and Nan\nGuan. Grade like a human: Rethinking automated\nassessment with large language models.\narXiv\npreprint arXiv:2405.19694, 2024.\n[17] Rui Sun, Zuo Bai, Wentao Zhang, Yuxiang Zhang,\nLi Zhao, Shan Sun, and Zhengwen Qiu. Finresearch-\nbench: A logic tree based agent-as-a-judge evalu-\nation framework for financial research agents. In\nProceedings of the 6th ACM International Conference\non AI in Finance, pages 656–664, 2025.\n[18] Fengbin Zhu, Xiang Yao Ng, Ziyang Liu, Chang Liu,\nXianwei Zeng, Chao Wang, Tianhui Tan, Xuan Yao,\nPengyang Shao, Min Xu, et al. Findeepresearch:\nEvaluating deep research agents in rigorous financial\nanalysis. arXiv preprint arXiv:2510.13936, 2025.\n[19] Zichen Chen, Jiaao Chen, Jianda Chen, and Misha\nSra.\nStandard benchmarks fail – auditing llm\nagents in finance must prioritize risk, 2025. URL\n[https://arxiv.org/abs/2502.15865]\n(https://arxiv.org/abs/2502.15865).\n[20] Zichen Chen, Jianda Chen, Jiaao Chen, and Misha\nSra. From tasks to teams: A risk-first evaluation\nframework for multi-agent llm systems in finance.\nIn ICML 2025 Workshop on Reliable and Responsible\nFoundation Models, 2025.\n[21] Zhitao He, Pengfei Cao, Chenhao Wang, Zhuoran\nJin, Yubo Chen, Jiexin Xu, Huaijun Li, Kang Liu, and\nJun Zhao. Agentscourt: Building judicial decision-\nmaking agents with court debate simulation and\nlegal knowledge augmentation. In Findings of the\nAssociation for Computational Linguistics: EMNLP\n2024, pages 9399–9416, 2024.\n[22] Prathamesh Devadiga, Omkaar Jayadev Shetty, and\nPooja Agarwal. Samvad: A multi-agent system for\nsimulating judicial deliberation dynamics in india.\narXiv preprint arXiv:2509.03793, 2025.\n[23] Cong Jiang and Xiaolei Yang. Agentsbench: A multi-\nagent llm simulation framework for legal judgment\nprediction. Systems, 13(8):641, 2025.\n[24] Zhenxuan Zhang, Kinhei Lee, Weihang Deng, Huichi\nZhou, Zihao Jin, Jiahao Huang, Zhifan Gao, Do-\nminic C. Marshall, Yingying Fang, and Guang\n12\n\nA Survey on Agent-as-a-Judge\nYang.\nGema-score: Granular explainable multi-\nagent score for radiology report evaluation. CoRR,\nabs/2503.05347, March 2025. URL https://doi.\norg/10.48550/arXiv.2503.05347.\n[25] Hengguan Huang, Songtao Wang, Hongfu Liu, Hao\nWang, and Ye Wang.\nBenchmarking large lan-\nguage models on communicative medical coaching:\na dataset and a novel system. In Findings of the\nAssociation for Computational Linguistics ACL 2024,\npages 1624–1637, 2024.\n[26] Zhihao Fan, Lai Wei, Jialong Tang, Wei Chen, Wang\nSiyuan, Zhongyu Wei, and Fei Huang. Ai hospital:\nBenchmarking large language models in a multi-\nagent medical interaction simulator. In Proceedings\nof the 31st International Conference on Computational\nLinguistics, pages 10183–10213, 2025.\n[27] Jifang Wang, Xue Yang, Longyue Wang, Zhenran\nXu, Yiyu Wang, Yaowei Wang, Weihua Luo, Kaifu\nZhang, Baotian Hu, and Min Zhang.\nA unified\nagentic framework for evaluating conditional im-\nage generation. In Wanxiang Che, Joyce Nabende,\nEkaterina Shutova, and Mohammad Taher Pile-\nhvar, editors, Proceedings of the 63rd Annual Meet-\ning of the Association for Computational Linguistics\n(Volume 1: Long Papers), pages 12626–12646, Vi-\nenna, Austria, July 2025. Association for Computa-\ntional Linguistics. ISBN 979-8-89176-251-0. doi:\n10.18653/v1/2025.acl-long.620. URL https://\naclanthology.org/2025.acl-long.620/.\n[28] Fan Zhang, Shulin Tian, Ziqi Huang, Yu Qiao,\nand Ziwei Liu.\nEvaluation agent: Efficient and\npromptable evaluation framework for visual gener-\native models. In Wanxiang Che, Joyce Nabende,\nEkaterina Shutova, and Mohammad Taher Pile-\nhvar, editors, Proceedings of the 63rd Annual Meet-\ning of the Association for Computational Linguistics\n(Volume 1: Long Papers), pages 7561–7582, Vi-\nenna, Austria, July 2025. Association for Compu-\ntational Linguistics. ISBN 979-8-89176-251-0. doi:\n10.18653/v1/2025.acl-long.374. URL https://\naclanthology.org/2025.acl-long.374/.\n[29] Alimohammad Beigi,\nBohan Jiang,\nDawei Li,\nTharindu Kumarage, Zhen Tan, Pouya Shaeri, and\nHuan Liu. Lrq-fact: Llm-generated relevant ques-\ntions for multimodal fact-checking. arXiv preprint\narXiv:2410.04616, 2024. URL https://arxiv.\norg/abs/2410.04616.\n[30] Elad Levi and Ilan Kadar. Intellagent: A multi-agent\nframework for evaluating conversational ai systems.\narXiv preprint arXiv:2501.11067, 2025.\n[31] Navid Madani and Rohini Srihari. ESC-judge: A\nframework for comparing emotional support con-\nversational agents. In Christos Christodoulopoulos,\nTanmoy Chakraborty, Carolyn Rose, and Violet Peng,\neditors, Proceedings of the 2025 Conference on Empir-\nical Methods in Natural Language Processing, pages\n16059–16076, Suzhou, China, November 2025. As-\nsociation for Computational Linguistics. ISBN 979-8-\n89176-332-6. doi: 10.18653/v1/2025.emnlp-main.\n811. URL https://aclanthology.org/2025.\nemnlp-main.811/.\n[32] Bang Zhang, Ruotian Ma, Qingxuan Jiang, Peisong\nWang, Jiaqi Chen, Zheng Xie, Xingyu Chen, Yue\nWang, Fanghua Ye, Jian Li, et al. Sentient agent\nas a judge: Evaluating higher-order social cog-\nnition in large language models.\narXiv preprint\narXiv:2505.02847, 2025.\n[33] Jingoo Lee, Kyungho Lim, Young-Chul Jung, and\nByung-Hoon Kim. Psyche: A multi-faceted patient\nsimulation framework for evaluation of psychiatric\nassessment conversational agents. arXiv preprint\narXiv:2501.01594, 2025.\n[34] Hongzhan Lin, Yang Deng, Yuxuan Gu, Wenxuan\nZhang, Jing Ma, See-Kiong Ng, and Tat-Seng Chua.\nFACT-AUDIT: An adaptive multi-agent framework\nfor dynamic fact-checking evaluation of large lan-\nguage models. In Wanxiang Che, Joyce Nabende,\nEkaterina Shutova, and Mohammad Taher Pile-\nhvar, editors, Proceedings of the 63rd Annual Meet-\ning of the Association for Computational Linguis-\ntics (Volume 1: Long Papers), pages 360–381, Vi-\nenna, Austria, July 2025. Association for Computa-\ntional Linguistics. ISBN 979-8-89176-251-0. doi:\n10.18653/v1/2025.acl-long.17.\nURL https://\naclanthology.org/2025.acl-long.17/.\n[35] Sarfraz Ahmad, Hasan Iqbal, Momina Ahsan, Nu-\nmaan Naeem, Muhammad Ahsan Riaz Khan, Arham\nRiaz, Muhammad Arslan Manzoor, Yuxia Wang, and\nPreslav Nakov.\nUrduFactCheck: An agentic fact-\nchecking framework for Urdu with evidence boost-\ning and benchmarking. In Christos Christodoulopou-\nlos, Tanmoy Chakraborty, Carolyn Rose, and Violet\nPeng, editors, Findings of the Association for Com-\nputational Linguistics: EMNLP 2025, pages 22788–\n22802, Suzhou, China, November 2025. Association\nfor Computational Linguistics. ISBN 979-8-89176-\n335-7.\ndoi: 10.18653/v1/2025.findings-emnlp.\n1240.\nURL https://aclanthology.org/\n2025.findings-emnlp.1240/.\n[36] Yeonseok Jeong, Minsoo Kim, Seung-won Hwang,\nand Byung-Hak Kim.\nAgent-as-judge for fac-\ntual summarization of long narratives.\nIn Chris-\ntos Christodoulopoulos, Tanmoy Chakraborty, Car-\nolyn Rose, and Violet Peng, editors, Proceed-\nings of the 2025 Conference on Empirical Meth-\nods in Natural Language Processing, pages 23602–\n13\n\nA Survey on Agent-as-a-Judge\n23619, Suzhou, China, November 2025. Associa-\ntion for Computational Linguistics.\nISBN 979-8-\n89176-332-6. doi: 10.18653/v1/2025.emnlp-main.\n1204.\nURL https://aclanthology.org/\n2025.emnlp-main.1204/.\n[37] Azim Ospanov, Zijin Feng, Jiacheng Sun, Haoli\nBai, Xin Shen, and Farzan Farnia.\nHermes: To-\nwards efficient and verifiable mathematical reason-\ning in llms, 2025.\nURL https://arxiv.org/\nabs/2511.18760.\n[38] Jiuzhou Han, Wray Buntine, and Ehsan Shareghi.\nVerifiAgent: a unified verification agent in language\nmodel reasoning. In Findings of the Association for\nComputational Linguistics: EMNLP 2025, 2025.\n[39] Shudong Liu, Hongwei Liu, Junnan Liu, Linchen\nXiao, Songyang Gao, Chengqi Lyu, Yuzhe Gu, Wen-\nwei Zhang, Derek F. Wong, Songyang Zhang, and\nKai Chen.\nCompassVerifier: A unified and ro-\nbust verifier for LLMs evaluation and outcome re-\nward.\nIn Christos Christodoulopoulos, Tanmoy\nChakraborty, Carolyn Rose, and Violet Peng, edi-\ntors, Proceedings of the 2025 Conference on Empiri-\ncal Methods in Natural Language Processing, pages\n33454–33482, Suzhou, China, November 2025. As-\nsociation for Computational Linguistics. ISBN 979-8-\n89176-332-6. doi: 10.18653/v1/2025.emnlp-main.\n1698.\nURL https://aclanthology.org/\n2025.emnlp-main.1698/.\n[40] Ding Chen, Qingchen Yu, Pengyuan Wang, Wentao\nZhang, Bo Tang, Feiyu Xiong, Xinchi Li, Minchuan\nYang, and Zhiyu Li. xverify: Efficient answer veri-\nfier for reasoning model evaluations. arXiv preprint\narXiv:2504.10481, 2025.\n[41] Shalev Lifshitz, Sheila A McIlraith, and Yilun\nDu.\nMulti-agent verification: Scaling test-time\ncompute with multiple verifiers.\narXiv preprint\narXiv:2502.20379, 2025.\n[42] Kexin Huang, Ying Jin, Ryan Li, Michael Y Li, Em-\nmanuel Candes, and Jure Leskovec. Automated hy-\npothesis validation with agentic sequential falsifi-\ncations. In Proceedings of the 42nd International\nConference on Machine Learning, 2025.\n[43] Hongliu Cao, Ilias Driouich, Robin Singh, and Eoin\nThomas. Multi-agent llm judge: automatic person-\nalized llm judge design for evaluating natural lan-\nguage generation applications, 2025.\n[44] Reshmi Ghosh, Tianyi Yao, Lizzy Chen, Sadid Hasan,\nTianwei Chen, Dario Bernal, Huitian Jiao, and\nHM Hossain. Sageval: The frontiers of satisfactory\nagent based nlg evaluation for reference-free open-\nended text. arXiv preprint arXiv:2411.16077, 2024.\n[45] Jiatao Li, Mao Ye, Cheng Peng, Xunjian Yin, and\nXiaojun Wan. Agent-x: Adaptive guideline-based\nexpert network for threshold-free ai-generated text\ndetection. arXiv preprint arXiv:2505.15261, 2025.\n[46] Michael J. Ryan, Omar Shaikh, Aditri Bhagirath,\nDaniel Frees, William Held, and Diyi Yang. Syn-\nthesizeMe! inducing persona-guided prompts for\npersonalized reward models in LLMs. In Proceed-\nings of the 63rd Annual Meeting of the Association for\nComputational Linguistics (Volume 1: Long Papers),\n2025.\n[47] Ran Xu, Jingjing Chen, Jiayu Ye, Yu Wu, Jun Yan,\nCarl Yang, and Hongkun Yu. Incentivizing agentic\nreasoning in llm judges via tool-integrated reinforce-\nment learning, 2025.\n[48] Mingchen Zhuge, Changsheng Zhao, Dylan R. Ash-\nley, Wenyi Wang, Dmitrii Khizbullin, Yunyang\nXiong, Zechun Liu, Ernie Chang, Raghuraman Krish-\nnamoorthi, Yuandong Tian, Yangyang Shi, Vikas\nChandra, and Jürgen Schmidhuber.\nAgent-as-a-\njudge: Evaluate agents with agents. In Forty-second\nInternational Conference on Machine Learning, 2025.\n[49] Mengdi Li, Guanqiao Chen, Xufeng Zhao, Haochen\nWen, Shu Yang, and Di Wang. Persrm-r1: Enhance\npersonalized reward modeling with reinforcement\nlearning, 2025.\n[50] Anikait Singh, Sheryl Hsu, Kyle Hsu, Eric Mitchell,\nStefano\nErmon,\nTatsunori\nHashimoto,\nArchit\nSharma, and Chelsea Finn. FSPO: Few-shot prefer-\nence optimization of synthetic preference data elicits\nLLM personalization to real users. In 2nd Workshop\non Models of Human Feedback for AI Alignment, 2025.\n[51] Xinchen Wang, Pengfei Gao, Chao Peng, Ruida Hu,\nand Cuiyun Gao. Codevisionary: An agent-based\nframework for evaluating large language models in\ncode generation, 2025.\n[52] Yu Li, Shenyu Zhang, Rui Wu, Xiutian Huang,\nYongrui Chen, Wenhao Xu, Guilin Qi, and De-\nhai Min.\nMateval:\nA multi-agent discussion\nframework for advancing open-ended text evalu-\nation.\nIn Database Systems for Advanced Appli-\ncations:\n29th International Conference, DASFAA\n2024, Gifu, Japan, July 2-5, 2024, Proceedings,\nPart VII, page 415–426, Berlin, Heidelberg, 2024.\nSpringer-Verlag. ISBN 978-981-97-5574-5. doi: 10.\n1007/978-981-97-5575-2_31. URL https://doi.\norg/10.1007/978-981-97-5575-2_31.\n[53] Manya Wadhwa, Zayne Rea Sprague, Chaitanya\nMalaviya, Philippe Laban, Junyi Jessy Li, and Greg\nDurrett.\nEvalagents: Discovering implicit evalu-\nation criteria from the web.\nIn Second Confer-\nence on Language Modeling, 2025.\nURL https:\n//openreview.net/forum?id=erGpkHCybv.\n[54] Kaishuai Xu, Tiezheng Yu, Yi Cheng, Wenjun Hou,\nLiangyou Li, Xin Jiang, Lifeng Shang, Qun Liu, and\n14\n\nA Survey on Agent-as-a-Judge\nWenjie Li. Learning to align multi-faceted evalua-\ntion: A unified and robust framework. In Wanxi-\nang Che, Joyce Nabende, Ekaterina Shutova, and\nMohammad Taher Pilehvar, editors, Findings of the\nAssociation for Computational Linguistics: ACL 2025,\npages 9488–9502, Vienna, Austria, July 2025. Asso-\nciation for Computational Linguistics. ISBN 979-8-\n89176-256-5. doi: 10.18653/v1/2025.findings-acl.\n494. URL https://aclanthology.org/2025.\nfindings-acl.494/.\n[55] MohammadHossein Rezaei, Robert Vacareanu, Zi-\nhao Wang, Clinton Wang, Bing Liu, Yunzhong He,\nand Afra Feyza Akyürek.\nOnline rubrics elicita-\ntion from pairwise comparisons.\narXiv preprint\narXiv:2510.07284, 2025.\n[56] Zhaopeng Feng, Jiayuan Su, Jiamei Zheng, Jia-\nhan Ren, Yan Zhang, Jian Wu, Hongwei Wang,\nand Zuozhu Liu. M-MAD: Multidimensional multi-\nagent debate for advanced machine translation\nevaluation.\nIn Wanxiang Che, Joyce Nabende,\nEkaterina Shutova, and Mohammad Taher Pile-\nhvar, editors, Proceedings of the 63rd Annual Meet-\ning of the Association for Computational Linguistics\n(Volume 1: Long Papers), pages 7084–7107, Vi-\nenna, Austria, July 2025. Association for Compu-\ntational Linguistics. ISBN 979-8-89176-251-0. doi:\n10.18653/v1/2025.acl-long.351. URL https://\naclanthology.org/2025.acl-long.351/.\n[57] Jiamin Su, Yibo Yan, Zhuoran Gao, Han Zhang,\nXiang Liu, and Xuming Hu.\nCafes:\nA col-\nlaborative\nmulti-agent\nframework\nfor\nmulti-\ngranular multimodal essay scoring. arXiv preprint\narXiv:2505.13965, 2025.\n[58] Sandeep Kumar, Abhijit A Nargund, and Vivek Srid-\nhar. Courteval: A courtroom-based multi-agent eval-\nuation framework. In Findings of the Association for\nComputational Linguistics: ACL 2025, pages 25875–\n25887, 2025.\n[59] Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang,\nRuochen Xu, and Chenguang Zhu. G-eval: NLG eval-\nuation using gpt-4 with better human alignment. In\nHouda Bouamor, Juan Pino, and Kalika Bali, edi-\ntors, Proceedings of the 2023 Conference on Empirical\nMethods in Natural Language Processing, pages 2511–\n2522, Singapore, December 2023. Association for\nComputational Linguistics. doi: 10.18653/v1/2023.\nemnlp-main.153. URL https://aclanthology.\norg/2023.emnlp-main.153/.\n[60] Seungone Kim, Jamin Shin, Yejin Cho, Joel Jang,\nShayne Longpre,\nHwaran Lee,\nSangdoo Yun,\nSeongjin Shin, Sungdong Kim, James Thorne, et al.\nPrometheus: Inducing fine-grained evaluation capa-\nbility in language models. In The Twelfth Interna-\ntional Conference on Learning Representations, 2023.\n[61] Lianghui Zhu, Xinggang Wang, and Xinlong Wang.\nJudgeLM: Fine-tuned large language models are scal-\nable judges. In The Thirteenth International Confer-\nence on Learning Representations, 2025. URL https:\n//openreview.net/forum?id=xsELpEPn4A.\n[62] Yougang Lyu, Shijie Ren, Yue Feng, Zihan Wang,\nZhumin Chen, Zhaochun Ren, and Maarten de Rijke.\nSelf-adaptive cognitive debiasing for large language\nmodels in decision-making, 2025. URL https://\narxiv.org/abs/2504.04141.\n[63] Huan-ang Gao, Jiayi Geng, Wenyue Hua, Mengkang\nHu, Xinzhe Juan, Hongzhang Liu, Shilong Liu, Ji-\nahao Qiu, Xuan Qi, Yiran Wu, et al. A survey of\nself-evolving agents: On path to artificial super in-\ntelligence. arXiv preprint arXiv:2507.21046, 2025.\n[64] Mahnaz Koupaee, Jake W. Vincent, Saab Mansour,\nIgor Shalyminov, Han He, Hwanjun Song, Raphael\nShu, Jianfeng He, Yi Nian, Amy Wing-mei Wong,\nKyu J. Han, and Hang Su. Faithful, unfaithful or am-\nbiguous? multi-agent debate with initial stance for\nsummary evaluation. In Luis Chiruzzo, Alan Ritter,\nand Lu Wang, editors, Proceedings of the 2025 Confer-\nence of the Nations of the Americas Chapter of the As-\nsociation for Computational Linguistics: Human Lan-\nguage Technologies (Volume 1: Long Papers), pages\n12209–12246, Albuquerque, New Mexico, April\n2025. Association for Computational Linguistics.\nISBN 979-8-89176-189-6. doi: 10.18653/v1/2025.\nnaacl-long.609. URL https://aclanthology.\norg/2025.naacl-long.609/.\n[65] Taaha Kazi, Ruiliang Lyu, Sizhe Zhou, Dilek\nHakkani-Tür, and Gokhan Tur.\nLarge language\nmodels as user-agents for evaluating task-oriented-\ndialogue systems. In 2024 IEEE Spoken Language\nTechnology Workshop (SLT), pages 913–920. IEEE,\n2024.\n[66] Ning Wu, Ming Gong, Linjun Shou, Shining Liang,\nand Daxin Jiang. Large language models are diverse\nrole-players for summarization evaluation. In CCF\ninternational conference on natural language process-\ning and Chinese computing, pages 695–707. Springer,\n2023.\n15\n",
+    "figure_captions": [
+      "Figure 1: Comparison between LLM-as-a-Judge (1a) and",
+      "Figure 2: A taxonomy of Agent-as-a-Judge organizing Methodologies (§3) and Applications (§4). Background",
+      "Figure 3: Multi-agent collaboration paradigms.",
+      "Figure 4: An overview of Agent-as-a-Judge application"
+    ]
+  },
+  {
+    "arxiv_id": "2601.05163v1",
+    "title": "Paper 2601.05163v1",
+    "abstract": "Document Question Answering (DocQA) fo- cuses on answering questions grounded in given documents, yet existing DocQA agents lack effective tool utilization and largely rely on closed-source models. In this work, we introduce DocDancer, an end-to-end trained open-source Doc agent. We formulate DocQA as an information-seeking problem and pro- pose a tool-driven agent framework that ex- plicitly models document exploration and com- prehension. To enable end-to-end training of such agents, we introduce an Exploration-then- Synthesis data synthesis pipeline that addresses the scarcity of high-quality training data for DocQA. Training on the synthesized data, the trained models on two long-context document understanding benchmarks, MMLongBench- Doc and DocBench, show their effectiveness. Further analysis provides valuable insights for the agentic tool design and synthetic data.",
+    "text": "DocDancer: Towards Agentic Document-Grounded Information Seeking\nQintong Zhang♡*, Xinjie Lv♡∗, Jialong Wu♡∗�, Baixuan Li∗, Zhengwei Tao♡,\nGuochen Yan♡, Huanyao Zhang♡, Bin Wang♢, Jiahao Xu♣, Haitao Mi♣, Wentao Zhang♡†\n♡Peking University, ♢Shanghai AI Lab, ♣Tencent AI Lab\nwujialongml@gmail.com, wentao.zhang@pku.edu.cn\nAbstract\nDocument Question Answering (DocQA) fo-\ncuses on answering questions grounded in\ngiven documents, yet existing DocQA agents\nlack effective tool utilization and largely rely\non closed-source models. In this work, we\nintroduce DocDancer, an end-to-end trained\nopen-source Doc agent. We formulate DocQA\nas an information-seeking problem and pro-\npose a tool-driven agent framework that ex-\nplicitly models document exploration and com-\nprehension. To enable end-to-end training of\nsuch agents, we introduce an Exploration-then-\nSynthesis data synthesis pipeline that addresses\nthe scarcity of high-quality training data for\nDocQA. Training on the synthesized data, the\ntrained models on two long-context document\nunderstanding benchmarks, MMLongBench-\nDoc and DocBench, show their effectiveness.\nFurther analysis provides valuable insights for\nthe agentic tool design and synthetic data.\n1\nIntroduction\nUnderstanding and answering questions over long,\nmulti-modal documents is a critical capability\nfor real-world intelligent systems (Tkaczyk et al.,\n2015; Liu et al., 2025b). Document Question An-\nswering (DocQA) lies at the core of document-\ncentric intelligence, enabling models to access, rea-\nson over, and synthesize information from complex\nand heterogeneous document sources.\nExisting DocQA methods can be broadly catego-\nrized into three paradigms. The first paradigm re-\nlies on optical character recognition (OCR) to con-\nvert documents into plain text, which is then pro-\ncessed by downstream language models (Xu et al.,\n2020). The second paradigm adopts embedding-\nbased retrieval mechanisms, most commonly in-\nstantiated through retrieval-augmented generation\n*Equal Contributions. �Jialong Wu is the project leader.\n†Corresponding Author.\nIf I drop the green module shown in Figure 6,  what is the \nabsolute percentage drop on 100-shot TACREV setting?\nLLM Agents\nUser\nDoc Toolkit\nSearch\nRead\n<think> First, I need to know ... Let me search \nfor that. </think>\n<tool_call> {“name”: “search”, ...} </tool_call>\n<think> Okay, now read Figure 6 ... </think>\n<tool_call> {“name”: “read”, ...} </tool_call>\n<answer> 0.5% </answer>           Correct\n<tool_response>\n<section_id=8> ... [matched snippets] ... \n</section_id=8> ...\n</tool_response>\n<tool_response>\nThere are three different parts in Figure 6 ...\n</tool_response>\n... \n...\nDoc Process\nParse\nCaption\nFigure 1: The overall of DocDancer for document-\ngrounded information seeking, where search and read\ntools for effective document retrieval and comprehen-\nsion over processed documents.\n(RAG), to identify and incorporate relevant doc-\nument segments during inference (Saad-Falcon\net al., 2024). More recently, agent-based paradigms\nhave gained increasing attention, as they better sup-\nport complex scenarios that require iterative explo-\nration, tool invocation, and multi-step reasoning\nover long and structured documents (Sun et al.,\n2025a; Zhu et al., 2025). Recent advances in large\nlanguage models (LLMs) (Team, 2025; Liu et al.,\n2025a) enable such agents to dynamically decom-\npose queries, interact with documents, and adapt\nto intermediate observations, alleviating the limita-\ntions of OCR- and RAG-based approaches. Despite\ntheir promise, existing DocQA agents are typically\nimplemented as prompt-based pipelines, with lim-\nited learning of autonomous agentic behaviors.\nIn contrast, we aim to train the first end-to-end\nDocQA agent model that is explicitly grounded\nin information-seeking principles, moving beyond\nprompt-based agent designs. We first formulate\nDocQA as an agentic information-seeking problem\nand design a tool-centric agent framework that de-\n1\narXiv:2601.05163v1  [cs.CL]  8 Jan 2026\n\ncomposes document understanding into two com-\nplementary capabilities. Specifically, we introduce\nefficient search tools for global information ac-\nquisition and fine-grained read tools for localized\ncomprehension. This design enables the agent to\nactively explore long documents, iteratively refine\nits hypotheses, and dynamically adapt its strategy\nbased on intermediate observations. Notably, when\ninstantiated with a proprietary LLM, our frame-\nwork achieves state-of-the-art performance and ex-\nceeds reported human-level performance.\nFurthermore, a key bottleneck in training such\nagent models is the scarcity of high-quality DocQA\npairs (Huang et al., 2025), as most publicly avail-\nable datasets provide only test splits and lack\nsufficiently annotated training data. To address\nthis challenge, we propose an Exploration-then-\nSynthesis DocQA generation pipeline that progres-\nsively enhances QA pairs from easy to hard. Specif-\nically, we first explore a source document through\nintent-guided, tool-augmented interactions to col-\nlect grounded evidence (the Exploration stage), and\nthen synthesizes high-quality document-grounded\nQA pairs via multi-observation reasoning (the Syn-\nthesis stage). We then train our DocQA agent, Doc-\nDancer, on the synthesized dataset, instantiating\nit with two open-source backbones, Qwen3-4B-\nThinking-2507 and Qwen3-30B-A3B-Thinking-\n2507 (Team, 2025). Despite being trained with\nonly 5,000 instances, both variants achieve com-\npetitive performance, with the 30B-A3B model\nattaining state-of-the-art results in several settings.\nExtensive experiments are conducted on two\nlong-context document understanding benchmarks,\nMMLongBench-Doc (Ma et al., 2024) and\nDocBench (Zou et al., 2025). The results demon-\nstrate the effectiveness of the proposed DocDancer.\nFurther analyses provide insights into document\nparsing strategies, tool design, and the role of syn-\nthetic data in agent learning. In summary, our con-\ntributions are three-fold:\n• Effective Agentic DocQA Framework: We\npropose a tool-driven DocQA agent frame-\nwork grounded in information-seeking prin-\nciples, which achieves SOTA performance\nwhen paired with a proprietary LLM.\n• Autonomous Data Synthesis Pipeline: We\nintroduce an Exploration-then-Refine data syn-\nthesis pipeline that generates high-quality\ntraining data for learning agentic behaviors.\n• Empirical\nPerformance:\nOur\nmethod\nachieves state-of-the-art results and provides\npractical insights into effective and efficient\nagentic system design.\n2\nRelated Work\nDocument Question Answering Methods. Tra-\nditional DocQA methods rely on OCR-based\npipelines (Ding et al., 2022) or end-to-end vi-\nsion–language models (Sukh, 2025; Hu et al.,\n2025), but both are constrained by limited in-\nput length and struggle with long documents (Ma\net al., 2024; Zou et al., 2025; Dong et al., 2025a).\nRetrieval-augmented generation (Zhang et al.,\n2024; Dong et al., 2025a,b) improves scalability,\nyet most approaches decouple retrieval and rea-\nsoning in a single-shot manner, making them brit-\ntle to retrieval errors and ineffective for complex,\nmulti-step queries (Zhang et al., 2025). Recent\nagent-based DocQA systems (Wu et al., 2025c;\nSun et al., 2025a; Dong et al., 2025c) address these\nissues through iterative document navigation and\nreading, but they predominantly depend on prompt-\nengineered, closed-source LLMs. In this work, we\naim to train an open-source document agent with\nlearnable behaviors for robust and scalable DocQA.\nSynthetic Data for Agent Training. High-quality\ntraining data is critical for training agents. Due to\nits scalability, rapid iteration, and inherent trainabil-\nity, synthetic data offers significant advantages over\nmanually annotated data, serving as a highly effec-\ntive alternative to human-labeled datasets for agent\nlearning (Liu et al., 2025a; Team et al., 2025b).\nPrior work has demonstrated that large-scale agent-\nsynthesized data can be effectively generated for\nsearch agents (Wu et al., 2025a; Li et al., 2025b;\nTao et al., 2025), code agents (Yang et al., 2025),\nGUI agents (Sun et al., 2025b; Guo et al., 2025a)\nand general-purpose agents (Fang et al., 2025; Prab-\nhakar et al., 2025). In contrast, this work focuses\non the DocQA agent setting. Existing DocQA\ndatasets are primarily constructed through semi-\nautomated (Van Landeghem et al., 2023; Dong\net al., 2025b) or expert-annotated (Hendrycks et al.,\n2021; Deng et al., 2025) processes, both of which\nrequire substantial human involvement or result in\nquestions that lack sufficient depth. Inspired by\nadvances in search agents, we formulate DocQA\nas an agentic information-seeking problem, with\nthe goal of synthesizing high-quality training data\ntailored for DocQA agents.\n2\n\n3\nMethods\n3.1\nAgent Setup\nFramework. We adopt the vanilla ReAct (Yao\net al., 2022) as the agent’s framework, which syn-\nergizes reasoning and acting. In this paradigm, the\nagent generates both a reasoning trace (thought), τ,\nand a subsequent action, a, in an interleaved man-\nner. This process forms a trajectory, HT , which is\na sequence of thought-action-observation triplets:\nHT = (τ0, a0, o0, . . . , τi, ai, oi, . . . , τT , aT ), (1)\nwhere aT represents the final answer to the given\ntask. At any given step t ≤ T, the agent’s policy,\nπ, generates the current thought τt and action at\nbased on the history of all previous interactions,\nHt−1:\nτt, at ∼ π(·|Ht−1).\n(2)\nInspired by The Bitter Lesson (Sutton, 2019), we\nemploy a single-agent setup with carefully selected,\nhighly effective tools, rather than relying on multi-\nagent designs or test-time scaling.\nDocument Processing. Prior works (Sun et al.,\n2025a) show that an XML-based hierarchical rep-\nresentation for document outlines that organizes\nparsed content into nested trees, using sections as\npartitioning units and elements such as text, im-\nages, and tables as nodes. While this structure\nenables efficient positioning and search, it suffers\nfrom structural and content inaccuracies and does\nnot incorporate retrieval-aware visual information,\nwhich limits its applicability to agent-based pro-\ncessing of long, visually rich documents. To ad-\ndress these issues, we substantially enhance the\ndocument outline. For content accuracy, we lever-\nage MinerU2.5 (Niu et al., 2025) for high-precision\nlayout analysis and extraction, defining 17 ele-\nment types and enriching outline nodes with layout\nand semantic attributes while removing structurally\nirrelevant elements such as headers and footers.\nFor structural accuracy, title elements are visually\ncropped and clustered to infer hierarchical levels,\nenabling fine-grained section segmentation and re-\nducing information loss in long documents. To\nimprove visual retrieval, we generate captions for\nimages and charts using an multimodal model Mm\nand incorporate them as auxiliary information, al-\nlowing the outline to better align and retrieve visual\ncontent.\nTool Design. We point out that DocQA can be natu-\nrally formulated as an agentic information-seeking\ntask in which the external information source is re-\nstricted to the given documents. Accordingly, our\ntool design aims to enable agents to efficiently and\neffectively locate and extract relevant information\nfrom documents, while keeping the overall toolkit\ncomplexity low to ensure ease of use for agent mod-\nels. Specifically, we design the following two tools\nfor DocDancer:\n• Search.\nConducts keyword-based full-text\nsearch over the given documents, returning the\nsection IDs, page numbers, and surrounding\ntext snippets for each match. A visible win-\ndow is used to constrain the snippet length for\nefficient localization. This tool provides the\nagent with global textual signals for guiding\nsubsequent information access.\n• Read. Given a goal and a set of section IDs, the\ntool performs fine-grained reading to extract\ngoal-relevant information from the specified\nsections. This includes (i) local textual infor-\nmation, consisting of all text within the section;\n(ii) local visual information, consisting of im-\nages and tables within the section, together with\na page-level screenshot that captures the full\nlayout of the page containing the section. Sub-\nsequently, a multimodal summarization model\nMm is used as an auxiliary reader to jointly\nintegrate textual and visual inputs and return\nconsolidated goal-relevant content.\nThis design deliberately integrates textual and\nvisual signals, capturing both localized evidence\nand global layout cues, while keeping the toolkit\nlimited to two tools to facilitate efficient utilization.\n3.2\nData Synthesis\nIt is crucial to curate complex and diverse Doc-\nument DocQA pairs that are capable of eliciting\nmulti-step reasoning, goal decomposition, and rich\ninteraction trajectories. To this end, we first con-\nstruct a broad and heterogeneous collection of PDF\ndocuments to serve as the grounding corpus for\nquestion answering. We then synthesize QA pairs\nbased on these documents, ensuring coverage of di-\nverse reasoning patterns and document structures.\nSources.\nTo construct a robust and diverse\ndataset for document-based question answering,\nwe select four representative datasets, Long-\nDocURL (Deng et al., 2025), MMDocRAG (Dong\net al., 2025b), CUAD (Hendrycks et al., 2021) and\nDUDE (Van Landeghem et al., 2023), that cover\nlong-context understanding, multimodal retrieval,\n3\n\n(i) Exploration\n(ii) Synthesis\nStep 1\n𝐴𝑐𝑡𝑖𝑜𝑛!\n𝑂𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛!\n𝐼𝑛𝑡𝑒𝑛𝑡!\nSource Document\nStep 2\n𝐴𝑐𝑡𝑖𝑜𝑛\"\n𝑂𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛\"\n𝐼𝑛𝑡𝑒𝑛𝑡\"\nStep 3\n𝐴𝑐𝑡𝑖𝑜𝑛#\n𝑂𝑏𝑎𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛#\n𝐼𝑛𝑡𝑒𝑛𝑡#\nFurther \nExploration\n... ...\nStep N\n𝐴𝑐𝑡𝑖𝑜𝑛!\n𝑂𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛!\n𝐼𝑛𝑡𝑒𝑛𝑡!\n𝐴𝑐𝑡𝑖𝑜𝑛\"\n𝑂𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛\"\n𝐼𝑛𝑡𝑒𝑛𝑡\"\n𝐴𝑐𝑡𝑖𝑜𝑛$\n𝑂𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛$\n𝐼𝑛𝑡𝑒𝑛𝑡$\n...\n...\n...\nSearch\nRead\nwith\nSearch\nRead\nwith\nSearch\nRead\nwith\nSearch\nRead\nQA Type Instruct\nDeep Analysis\n<think>\nOkay, now I …\n</think>\nQuestion\nAnswer\nwith\nFigure 2: Overall of the Exploration-then-Synthesis framework. (i) Exploration stage iteratively interacts with\nthe source document through Action(u)–Observation(y)–Intent(i) steps. (ii) Synthesis stage aggregates the collected\nevidence to generate the final question and answer. We present a concrete case illustrating the whole generation\nprocess in Appendix A.\nlegal expertise, and complex layout analysis. These\nsources provide the foundational PDF documents\nused for our automated QA generation pipeline.\nThe distribution of the collected PDF documents is\nillustrated in Figure 3.\n35.0%\n18.6%\n13.1 %\n11.0 %\n8.2 %\n5.0%\n3.3%\n1.2%\n3.0%\nReport\nLaw\nAcademic\nGuidebook\nContact\nFinancial\nBrochure\nIndustry\nNews\nFigure 3: Distribution of document used to synthesise.\nExploration-then-Synthesis Framework.\nWe\npropose a two-stage framework for DocQA gen-\neration, consisting of an Exploration Stage and a\nSynthesis Stage as shown in Figure 2. The overall\nobjective is to transform a source document into a\ndiverse and high-quality set of grounded QA pairs\nthrough iterative interaction and reasoning.\nExploration Stage. Given a source document\nD, utilze an LLM Me to iteratively interact with\nD and collect information relevant to potential QA\npairs. Conditioned on the interaction history ht\nand the document D, we employ model Ms jointly\ngenerates an intent-action pair (it, at):\n(it, ut) ∼ πMe(i, u | ht, D),\n(3)\nwhere it denotes the exploration intent and ut ∈ A\ncorresponds to invoking a document-grounded tool\nsuch as Search or Read, which is the same as the\nagent’s tool action. The construction of a question\nimplicitly induces the strategy required to resolve\nit. The explicit modeling of intent helps prevent un-\ninformative exploration, guiding the agent toward\nmore concrete, goal-directed trajectories (Pahuja\net al., 2025). Executing action at yields an obser-\nvation:\nyt = T (at, D),\n(4)\nwhere T denotes the document interaction inter-\nface. The interaction history is then updated as:\nht+1 = ht ∪ {(it, ut, yt)},\n(5)\nand the intent it+1 may be revised based on the\nnewly acquired information.\nThis process is repeated for multiple steps, en-\nabling the agent to progressively refine its under-\nstanding of the document and uncover diverse and\ninformative content. The explicit modeling of in-\ntent allows for flexible and open-ended exploration,\npermitting additional interactions when necessary.\nThe output of the exploration stage is a trajectory\nξ = {(it, ut, yt)}T\nt=1,\n(6)\nwhich serves as structured evidence for down-\nstream QA generation.\nIn the exploration stage, each exploration step\ncan be viewed as a random walk over the knowl-\nedge graph implicitly embedded in the entire doc-\nument. When the number of such walks is suf-\nficiently large, this process can, in principle, re-\nconstruct the underlying document-level knowl-\nedge graph in a reverse manner. This idea is con-\nceptually aligned with prior work on QA gener-\nation based on knowledge graphs in web search\nagent (Li et al., 2025b,a). We do not explicitly\nconstruct a document-level knowledge graph in\n4\n\nadvance, as such an approach would incur substan-\ntial engineering complexity and overhead. Instead,\nour method adopts a more lightweight design that\nis nevertheless capable of generating challenging\nDocQA pairs, achieving a better trade-off between\nefficiency and effectiveness.\nSynthesis Stage. Given the exploration trajec-\ntory ξ, the agent enters the synthesis stage to gen-\nerate document-grounded QA pairs. A synthesis\nmodel Ms performs reasoning over the accumu-\nlated observations and generates a QA pair:\n(q, a) ∼ Ms(ξ, D),\n(7)\nThis stage emphasizes (i) reasoning over multi-\nple observations collected during exploration, (ii)\ngrounding both questions and answers in the source\ndocument, and (iii) producing semantically coher-\nent and well-formed outputs. The final output is a\nset of K, document-grounded QA pairs:\nQA = {(qk, ak)}K\nk=1,\n(8)\nwhich can be used for training an agent. We employ\na strong open-source model Mt to perform rejec-\ntion sampling over these QA pairs, QA, thereby\nobtaining high-quality training trajectories.\n3.3\nAgent Training\nFollowing the empirical findings of (Chen et al.,\n2023), twe mask loss contributions from observa-\ntion tokens to mitigate interference from external\nfeedback during training, which has been shown to\nimprove both performance and robustness. Given\nthe task context tc and the complete execution\ntrajectory H = (x0, ..., xn−1, xn), where each\nxi ∈ {τ, α, o}, the loss L is computed as follows:\nL = −\n1\n�|H|\ni=1 I[xi ̸= o]\n|H|\n�\ni=1\nI[xi ̸= o]·\nlog πθ(xi | tc, x<i)\n(9)\nHere, I[xi ̸= o] filters out tokens corresponding\nto external feedback, ensuring the loss is computed\nonly over the agent’s decision steps.\n4\nExperiments\nIn this section, we aim to answer the following\nresearch questions (RQs):\n• RQ1: How effective is the proposed information-\nseeking agent framework for DocQA?\n• RQ2: How effective is the proposed synthetic\ndata pipeline for training open-source DocQA\nagents?\n• RQ3: Which components of the agent framework\ncontribute most to performance?\n• RQ4: How does the proposed DocDancer in qual-\nitative evaluations?\n4.1\nExperimental Setup\nWe fine-tune Qwen3-30B-A3B-Thinking-2507 and\nQwen3-4B-Thinking-2507 on our dataset, result-\ning in DocDancer. Our detailed implementation is\nprovided in Appendix B, trained with only 5,000\nagent trajectories.\nBenchmarks. We evaluate the proposed DocA-\ngent on two multimodal long-context document\nquestion answering benchmarks: MMLongBench-\nDoc (Ma et al., 2024) and DocBench (Zou et al.,\n2025). MMLongBenchDoc comprises 135 docu-\nments with an average length of 47.5 pages, featur-\ning rich layouts and multimodal components across\nseven diverse domains. The dataset includes 1,091\nquestions derived from multiple sources, such as\ntext, tables, charts, and images, with 33% involving\ncross-page reasoning. DocBench consists of 229\nreal-world documents and 1,082 questions, cover-\ning five domains and four major question types.\nMetrics. For MMLongBench-doc, we follow the\nofficial evaluation protocol. Answers are extracted\nusing GPT-4.1 and evaluated with rule-based scor-\ning to compute F1 (F1) and Accuracy (acc). To mit-\nigate extraction errors and improve robustness to\ndiverse response formats, we additionally employ\nan LLM-as-Judge (LasJ) setting, where gpt-4o\nassigns binary scores using carefully designed\nprompts. For DocBench, we likewise adhere to the\nofficial evaluation procedure, using the provided\ninstructions to guide GPT-4.1 for assessment.\nBaselines.\nWe compare our approach with\nthe following three categories of baselines: (1)\nVLM-based methods: Following the setting of\nMMLongBench-Doc, PDF pages are scanned at\n144 DPI and used as input to the VLM. (2) OCR-\nbased methods: Text is extracted from documents\nusing an OCR tool, and the parsed plain text is\nprovided to a LLM for answering. Text beyond\nthe model’s context length is truncated. (3) RAG-\nbased methods: In this category, we compare ex-\nisting RAG frameworks for DocQA, including Vis-\nRAG (Yu et al., 2024), Colpali (Faysse et al., 2024),\nM3DocRAG (Cho et al., 2025), MMGR (Wan\nand Yu, 2025), and RAGAnything (Guo et al.,\n5\n\nMethod\nModel\nMMLongBench-Doc\nDocBench\nacc\nF1\nLasJ\nLasJ\nVLM Baseline\nNaive VL (Ma et al., 2024)\nGPT-4o\n42.8\n44.9\n–\n63.1\nNaive VL (Zhu et al., 2025)\nGemini-2.5-Pro\n–\n–\n58.1\n–\nOCR-based Baseline\nfitz1\nGPT-4\n–\n–\n–\n67.9\nTesseract (Smith, 2007)\nGPT-4o\n30.1\n30.5\n–\n–\nTesseract (Smith, 2007)\nGemini-2.0-Flash\n39.6\n37.2\n–\n–\nRAG-based Baseline\nVisRAG (Yu et al., 2024)\nGPT-4o\n29.0\n27.8\n–\n–\nColpali (Faysse et al., 2024)\nGPT-4o\n32.2\n30.8\n–\n–\nM3DocRAG w/ ColPali (Cho et al., 2025)\nQwen2-VL-7B\n31.4\n36.5\n–\n–\nRAGAnything (Guo et al., 2025b)\nGPT-4o-mini\n42.8\n–\n–\n63.4\nPrompt-based Agent\nDoc-React (Wu et al., 2025c)\nGPT-4o\n38.1\n38.3\n–\n–\nMDocAgent (Han et al., 2025)\nGPT-4o\n42.0\n–\n–\n–\nMACT (Yu et al., 2025)\nMiMo-VL-7B\n47.4\n–\n–\n–\nSimpleDoc (Jain et al., 2025)\nClaude-4-Sonnet\n–\n–\n58.6\n–\nSimpleDoc (Jain et al., 2025)\nGemini-2.5-Pro\n–\n–\n56.6\n–\nDocLens (Zhu et al., 2025)\nClaude-4-Sonnet\n–\n–\n63.3\n–\nDocLens (Zhu et al., 2025)\nGemini-2.5-Pro\n–\n–\n67.6\n–\nDocAgent (Sun et al., 2025a)\nGPT-4o\n51.8\n49.1\n–\n79.9\nDocAgent (Sun et al., 2025a)\nClaude-3.5-Sonnet\n57.3\n54.1\n–\n–\nOurs\nDocDancer\nGPT-4o\n52.3\n50.8\n59.2\n73.5\nGemini-2.5-Pro\n56.3\n55.3\n65.9\n79.9\nGPT-5.2\n57.0\n56.8\n67.6\n85.5\nQwen3-4B (ft)\n48.4\n49.2\n59.4\n79.8\nQwen3-30B-A3B (ft)\n54.4\n53.9\n65.3\n81.2\nHuman Baseline\n–\n65.8\n66.0\n–\n81.2\nTable 1: Performance comparison across two long-context understanding benchmarks. The best results among all\nmethods are bolded and the second-best results are underlined.\n2025b). (4) Agent-based methods: We include\nseveral recent and well-performing training-free\nagentic frameworks, namely Doc-React (Wu et al.,\n2025c), MDocAgent (Han et al., 2025), MACT (Yu\net al., 2025), SimpleDoc (Jain et al., 2025), Do-\ncLens (Zhu et al., 2025), and DocAgent (Sun et al.,\n2025a). The detailed introduction of the baseline is\nprovided in Appendix C.\n4.2\nOverall Performance (RQ1)\nWe evaluate our agent framework against OCR-\nbased, RAG-based, and prompt-based baselines\non long-document DocQA benchmarks. Based\non the experimental results in Table 1, we draw\nthe following observations.\nFirst, agent-based\napproaches substantially outperform VLM-based\nmethods, OCR-based baselines, and RAG-based\nbaselines across evaluated benchmarks, highlight-\ning the advantage of explicit tool use and itera-\ntive reasoning for long-context document under-\nstanding. Second, under the same backbone, our\nsingle-agent framework matches or surpasses multi-\nagent systems. In particular, on MMLongBench-\nDoc, DocDancer with GPT-5.2 attains 56.8 F1 /\n67.6 LasJ, outperforming all prior methods, and on\nDocBench, it reaches 85.5, exceeding the human\nbaseline by 4 points. Third, models trained on our\nsynthetic DocQA dataset demonstrate strong gen-\neralization and data efficiency. Even with relatively\nsmall model sizes, such as 30B-A3B and 4B, the\nresulting agents achieve performance competitive\nwith closed-source models. These results indicate\nthat training agentic capabilities on smaller-scale\nmodels is both feasible and highly valuable, sub-\nstantially lowering the barrier to building effective\ndocument-understanding agents.\n4.3\nEffectiveness of Synthetic Data (RQ2)\nOverall Performance. We investigate whether\nthe Exploration-then-Synthesis data generation\n6\n\n0.3\n0.35\n0.4\n0.45\n0.5\n0.55\n0.6\n0.65\n30.0\n35.0\n40.0\n45.0\n50.0\n55.0\n65.0\nw/ Process of DocAgent\nw/ Process of Ours\nw/ Tool of DocAgent\nw/ Tool of Ours\nSingle-Page QA\nMulti-Page QA\nUnanswerable QA\nOverall\nOverall\n60.0\nAcc\nF1\n51.1\n56.7\n58.4 59.1\n41.6\n43.9 43.8\n44.1\n41.3\n45.3\n47.1\n52.5\n45.8\n50.1 51.1 52.3\n44.9\n49.3 50.1 50.8\nFigure 4: Ablation study on document parsing and tools.\n40.0\n50.0\n60.0\n70.0\n80.0\n90.0\nAcc\n𝐹!\nLasJ\nLasJ\nMMLongBench-Doc\nDocBench\n50.5\n54.5\n51.5 53.9\n62.8 65.3\n80.2 81.2\nOS-QA\nDocDancer\nFigure 5: Performance comparison between models\ntrained on our synthesized QA data and open-source\nQA data.\npipeline provides effective supervision for learn-\ning agentic behaviors, and whether models trained\nsolely on the synthesized data achieve strong perfor-\nmance compared to existing open-source QA pairs.\nIn Figure 5, we use the same PDF sources (Section\n§3.2) and construct two training sets of equal size\n(5,000 instances): one from our synthesized QA\ndata and the other from human-annotated QA data\nprovided with the PDFs (OS-QA). Both models are\ntrained on Qwen3-30B-A3B-Thinking-2507. Over-\nall, DocDancer consistently outperforms OS-QA\nacross all metrics and benchmarks, demonstrating\nthe effectiveness of our data synthesis strategy.\nDetailed Results on Domains. Figure 6 reports\ndomain-level results on MMLongBench-Doc. Doc-\nDancer consistently outperforms the QA baseline\nacross all document domains, including Academic,\nFinancial, Industry, and Report. The gains are\nmore pronounced in structurally complex domains\nthat require iterative information seeking and fine-\ngrained reasoning. Overall, the results indicate that\nDocDancer generalizes well across diverse docu-\nment types and is robust to domain variation.\n40.0\n50.0\n60.0\n70.0\n80.0\n90.0\nAcc\n𝐹!\nLasJ\nLasJ\nMMLongBench-Doc\nDocBench\n50.5\n54.5\n51.5 53.9\n62.8 65.3\n80.2 81.2\nOS-QA\nDocFlow\n30.0\n35.0\n40.0\n45.0\n50.0\n55.0\n60.0\n65.0\n70.0\nAca.\nBro.\nFin.\nGui.\nInd.\nReo.\nTut.\nAcc\n52.854.2\n42.0\n49.2\n60.8\n66.0\n39.7\n48.0\n55.8\n57.9\n60.8\n54.5\n50.753.1\nOS-QA\nDocDancer\nFigure 6: Detailed domain-wise performance compar-\nison on MMLongBench-Doc between DocDancer and\nthe model trained on OS-QA.\n4.4\nInfluence of Agentic Tools (RQ3)\nWe conduct ablation studies on document pro-\ncessing for outline construction and tool usage\nin Figure 4.\nThe baseline is the Actor Agent\nfrom DocAgent (Sun et al., 2025a).\nFor out-\nline construction, DocAgent relies on Adobe PDF\nExtract as well as DocXChain (Yao, 2023) and\nPyMuPDF. In contrast, our enhanced method em-\nploys MinerU2.5 (Niu et al., 2025) for outline gen-\neration. The results demonstrate that, when com-\nbined with the same tools, our processing approach\nconsistently outperforms the baseline, confirming\nthat MinerU2.5 produces higher-quality document\noutlines. Regarding tool usage, DocAgent utilizes\nfive tools: search, get_section_content, get_image,\nget_page_images, and get_table_image. In com-\nparison, we only use two tools, Search and Read,\nfollowing the principle of simplicity. Despite this\nreduced tool set, our approach achieves better per-\nformance when combined with either our own out-\nline or the outline generated by DocAgent. The\nbest results are obtained by combining our outline\nconstruction with our tool design, demonstrating\ntheir complementary effects. Furthermore, we con-\nduct an ablation study on the external model used\n7\n\n≈\n≈\nQ. What is advertising expense to sales ratio of Neflix in FY 2015? \nA. 0.105\nDocument: NETFLIX_2015_10K.pdf (73 Pages)\nEvidence Page: 40, 47\nEvidence Source: Pure-text, Table\n≈\nThink-1\n... Read the Outline...\n... Call Search ...\nTool Call: 🔍 Search\n... Find 27 results  ...\n... Find 39 results  ...\n❌ Answer\n... According to search results...\n824.092 / 6779.511 ≈ 0.122\n☹\nOS-QA\nInadequate \nRetrieval & \nComprehension\nOurs\nThink-1\nThink-2\n... Thinking ...\n... Call Search ...\nThink-3\n... Thinking ...\n... Call Read ...\nThink-4\n... Thinking ...\n... Call Read ...\nAnswer\n✅\n🔍 Search\nKeywords: [“Marketing”,“ Revenues”]\n[“advertising”]\n... Find 6 results  ...\n<Item\n    type=\"Paragraph\" \n    section_id=\"3.19\" \n    page_num=\"5.0\"\n>...\nStop\n... Read the Outline...\n... Call Search ...\n... Find 39 results  ...\n<Item\n    type=\"Paragraph\" \n    section_id=\"3.9\" \n    page_num=\"3.0\"\n>...\n🔍 Search\n[“Revenues”]\nSection 8.81 in page 47\n📖\nExtract the advertising \nexpense amount for 2015.\nThe useful information from\n[...Text...]. Summary: \nThe advertising expense \namount for 2015 is $714.3 \nmillion, as ...\nRead\n📖\nExtract the revenue \namount for 2015.\nThe useful information \nfrom [... Table ...]. \nSummary: \nThe revenue amount for \n2015 is $6,779,511, as ...\nSection 8.60 in page 40\n714.3 / 6779.511\n... According to...\n≈ 0.105\n≈\n≈\nCorrect \nAnswer! 🥳\nRead\nFigure 7: A case study demonstrating that our proposed DocDancer successfully performs multi-round information\ngathering to reach the correct answer, as illustrated in Table 3 in detail, whereas OS-QA produces an incorrect result.\n30.0\n35.0\n40.0\n45.0\n50.0\n55.0\n60.0\n65.0\n70.0\nAca.\nBro.\nFin.\nGui.\nInd.\nReo.\nTut.\nAcc\n52.854.2\n42.0\n49.2\n60.8\n66.0\n39.7\n48.0\n55.8\n57.9\n60.8\n54.5\n50.753.1\nOS-QA\nDocFlow\n60.0\n65.0\n70.0\n75.0\n80.0\n85.0\n90.0\n95.0\n100\nAca.\nFin.\nCov.\nLaw\nNews\nAvg\nAcc\n83.0 82.8\n76.4\n74.0\n78.0 78.4 80.9\n83.2\n89.891.9\n81.281.4\nQwen3-VL-235B-A22B-Instruct\nGemini-3-Pro\nFigure 8: Results on DocBench across various domains\nusing different models used by Read tool. We re-\nport the generalized accuracy of five types of document\ndomains, including Academia (Aca.), Finance (Fin.),\nGovernment (Gov), Law, and News.\nby the Read tool. Our default configuration, Mm\nemploys Qwen3-VL-235B-A22B-Instruct. Replac-\ning it with Gemini-3-Pro yields a modest overall\nimprovement of 0.2 accuracy points on DocBench\n(Figure 8), with gains in Government, Law, and\nNews domains. These results indicate that our tool\ndesign is robust and does not depend on an excep-\ntionally strong external model.\n4.5\nQualitative Analysis (RQ4)\nWe present a case study of a financial task on a\n73-page document from MMLongBench-Doc, as\nillustrated in Figure 7. Answering this question\nrequires locating advertising expense and revenue\nfigures from different sections of the document and\nperforming a numerical computation. The base-\nline model, which is trained on OS-QA relies on\nkeyword-based retrieval and retrieves passages re-\nlated to “marketing” and “revenues”. Due to in-\nsufficient grounding, it incorrectly uses a market-\ning expense figure as a proxy for advertising ex-\npense, yielding an erroneous ratio of 0.122. This\nfailure illustrates the limitation of single-pass re-\ntrieval and shallow aggregation when fine-grained\nfinancial concepts are required. In contrast, Doc-\nDancer performs multi-round, question-driven in-\nformation gathering. It first retrieves and reads\nthe section explicitly reporting advertising expense\nfor FY 2015 ($714.3M), and then independently\nextracts the total revenue from a separate tabular\nsection ($6,779.5M). By grounding each value to\nits corresponding evidence and verifying semantic\nrelevance, the system computes the correct ratio\nof 714.3/6,779.5 ≈ 0.105. It demonstrates that\naccurate document-level financial question answer-\ning benefits from our synthetic data, which enables\nthe construction of domain-specific expert-level\nsupervision beyond ordinary human annotations.\n5\nConclusion\nWe propose DocDancer, an end-to-end trained\nagentic model for document question answering\nthat formulates DocQA as an information-seeking\nprocess. By introducing a tool-centric framework\nwith complementary search and read operations,\nDocDancer enables effective exploration and com-\nprehension of long, structured documents. To miti-\ngate the lack of high-quality supervision, we further\n8\n\ndesign an Exploration-then-Synthesis data pipeline\nthat generates compact yet effective training data\nfor learning agentic behaviors. Experiments on\nMMLongBench-Doc and DocBench demonstrate\nthat DocDancer achieves strong and competitive\nperformance, validating the effectiveness of agentic\ninformation-seeking for document understanding.\nLimitations\nThis work still has several limitations. First, our ex-\nperiments are conducted only on Qwen3-30B-A3B-\nThinking-2507 and Qwen3-4B-Thinking-2507; we\ndo not evaluate the proposed method on larger-\nscale models or models from other families. Sec-\nond, we focus exclusively on supervised fine-\ntuning (SFT) and do not explore agentic reinforce-\nment learning (RL). Third, we do not further scale\nthe training data, and thus do not investigate how\nthe proposed method performs under larger or more\ndiverse data.\nEthical Considerations\nThis work studies agentic document-grounded\nquestion answering using publicly available bench-\nmarks and documents released for research pur-\nposes. The proposed Exploration-then-Synthesis\npipeline generates synthetic question–answer pairs\nthat are explicitly grounded in source documents\nand does not introduce new proprietary data or at-\ntempt to reproduce large portions of copyrighted\ntext verbatim. While the method itself does not\ncollect personal information, document-grounded\nagents may be applied to sensitive or private docu-\nments in downstream use; such applications require\nappropriate authorization and privacy safeguards.\nThe synthesized data and trained models may in-\nherit biases present in the underlying document\nsources, including domain and content imbalances.\nFinally, although improved document exploration\ncapabilities could be misused if deployed irrespon-\nsibly, the strong grounding in retrieved evidence\nand our commitment to releasing code and data\naim to support transparency, reproducibility, and\nresponsible research use.\nReferences\nBaian Chen, Chang Shu, Ehsan Shareghi, Nigel Collier,\nKarthik Narasimhan, and Shunyu Yao. 2023. Fireact:\nToward language agent fine-tuning. arXiv preprint\narXiv:2310.05915.\nJaemin Cho, Debanjan Mahata, Ozan Irsoy, Yu-\njie He, and Mohit Bansal. 2025.\nM3docvqa:\nMulti-modal multi-page multi-document understand-\ning. In Proceedings of the IEEE/CVF International\nConference on Computer Vision, pages 6178–6188.\nChao Deng, Jiale Yuan, Pi Bu, Peijie Wang, Zhong-\nZhi Li, Jian Xu, Xiao-Hui Li, Yuan Gao, Jun Song,\nBo Zheng, and 1 others. 2025. Longdocurl: a com-\nprehensive multimodal long document benchmark\nintegrating understanding, reasoning, and locating.\nIn Proceedings of the 63rd Annual Meeting of the\nAssociation for Computational Linguistics (Volume\n1: Long Papers), pages 1135–1159.\nYihao Ding, Zhe Huang, Runlin Wang, YanHang Zhang,\nXianru Chen, Yuzhong Ma, Hyunsuk Chung, and\nSoyeon Caren Han. 2022.\nV-doc: Visual ques-\ntions answers with documents. In Proceedings of\nthe IEEE/CVF conference on computer vision and\npattern recognition, pages 21492–21498.\nKuicai Dong, Yujing Chang, Xin Deik Goh, Dexun\nLi, Ruiming Tang, and Yong Liu. 2025a. Mmdocir:\nBenchmarking multi-modal retrieval for long docu-\nments. arXiv preprint arXiv:2501.08828.\nKuicai Dong, Yujing Chang, Shijie Huang, Yasheng\nWang, Ruiming Tang, and Yong Liu. 2025b. Bench-\nmarking retrieval-augmented multimomal generation\nfor document question answering. arXiv preprint\narXiv:2505.16470.\nKuicai Dong, Shurui Huang, Fangda Ye, Wei Han,\nZhi Zhang, Dexun Li, Wenjun Li, Qu Yang, Gang\nWang, Yichao Wang, and 1 others. 2025c.\nDoc-\nresearcher: A unified system for multimodal doc-\nument parsing and deep research.\narXiv preprint\narXiv:2510.21603.\nRunnan Fang, Shihao Cai, Baixuan Li, Jialong Wu,\nGuangyu Li, Wenbiao Yin, Xinyu Wang, Xiaobin\nWang, Liangcai Su, Zhen Zhang, and 1 others. 2025.\nTowards general agentic intelligence via environment\nscaling. arXiv preprint arXiv:2509.13311.\nManuel Faysse, Hugues Sibille, Tony Wu, Bilel Om-\nrani, Gautier Viaud, Céline Hudelot, and Pierre\nColombo. 2024.\nColpali: Efficient document re-\ntrieval with vision language models. arXiv preprint\narXiv:2407.01449.\nXiangwu Guo, Difei Gao, and Mike Zheng Shou. 2025a.\nAuto-explorer: Automated data collection for gui\nagent. arXiv preprint arXiv:2511.06417.\nZirui Guo, Xubin Ren, Lingrui Xu, Jiahao Zhang, and\nChao Huang. 2025b. Rag-anything: All-in-one rag\nframework. arXiv preprint arXiv:2510.12323.\nSiwei Han, Peng Xia, Ruiyi Zhang, Tong Sun, Yun Li,\nHongtu Zhu, and Huaxiu Yao. 2025. Mdocagent: A\nmulti-modal multi-agent framework for document\nunderstanding. arXiv preprint arXiv:2503.13964.\n9\n\nD. Hendrycks, C. Burns, A. Chen, and S. Ball. 2021.\nCuad: An expert-annotated nlp dataset for legal con-\ntract review. arXiv preprint arXiv:2103.06268.\nAnwen Hu, Haiyang Xu, Liang Zhang, Jiabo Ye, Ming\nYan, Ji Zhang, Qin Jin, Fei Huang, and Jingren Zhou.\n2025. mplug-docowl2: High-resolution compress-\ning for ocr-free multi-page document understanding.\nIn Proceedings of the 63rd Annual Meeting of the\nAssociation for Computational Linguistics (Volume\n1: Long Papers), pages 5817–5834.\nTiancheng Huang,\nRuisheng Cao,\nYuxin Zhang,\nZhangyi Kang, Zijian Wang, Chenrun Wang, Yi-\njie Luo, Hang Zheng, Lirong Qian, Lu Chen, and\n1 others. 2025. Airqa: A comprehensive qa dataset\nfor ai research with instance-level evaluation. arXiv\npreprint arXiv:2509.16952.\nChelsi Jain, Yiran Wu, Yifan Zeng, Jiale Liu, Zhenwen\nShao, Qingyun Wu, Huazheng Wang, and 1 others.\n2025.\nSimpledoc: Multi-modal document under-\nstanding with dual-cue page retrieval and iterative\nrefinement. arXiv preprint arXiv:2506.14035.\nWoosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying\nSheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gon-\nzalez, Hao Zhang, and Ion Stoica. 2023. Efficient\nmemory management for large language model serv-\ning with pagedattention. In Proceedings of the ACM\nSIGOPS 29th Symposium on Operating Systems\nPrinciples.\nKuan Li, Zhongwang Zhang, Huifeng Yin, Rui Ye, Yida\nZhao, Liwen Zhang, Litu Ou, Dingchu Zhang, Xixi\nWu, Jialong Wu, and 1 others. 2025a. Websailor-\nv2: Bridging the chasm to proprietary agents via\nsynthetic data and scalable reinforcement learning.\narXiv preprint arXiv:2509.13305.\nKuan Li, Zhongwang Zhang, Huifeng Yin, Liwen\nZhang, Litu Ou, Jialong Wu, Wenbiao Yin, Baixuan\nLi, Zhengwei Tao, Xinyu Wang, and 1 others. 2025b.\nWebsailor: Navigating super-human reasoning for\nweb agent. arXiv preprint arXiv:2507.02592.\nAixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingx-\nuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang,\nChaofan Lin, Chen Dong, and 1 others. 2025a.\nDeepseek-v3. 2: Pushing the frontier of open large\nlanguage models. arXiv preprint arXiv:2512.02556.\nJiaheng Liu,\nDawei Zhu,\nZhiqi Bai,\nYancheng\nHe, Huanxuan Liao, Haoran Que, Zekun Wang,\nChenchen Zhang, Ge Zhang, Jiebin Zhang, and\n1 others. 2025b.\nA comprehensive survey on\nlong context language modeling.\narXiv preprint\narXiv:2503.17407.\nYubo Ma, Yuhang Zang, Liangyu Chen, Meiqi Chen,\nYizhu Jiao, Xinze Li, Xinyuan Lu, Ziyu Liu, Yan Ma,\nXiaoyi Dong, and 1 others. 2024. Mmlongbench-doc:\nBenchmarking long-context document understanding\nwith visualizations. Advances in Neural Information\nProcessing Systems, 37:95963–96010.\nJunbo Niu, Zheng Liu, Zhuangcheng Gu, Bin Wang,\nLinke Ouyang, Zhiyuan Zhao, Tao Chu, Tianyao\nHe, Fan Wu, Qintong Zhang, and 1 others. 2025.\nMineru2. 5: A decoupled vision-language model for\nefficient high-resolution document parsing. arXiv\npreprint arXiv:2509.22186.\nVardaan Pahuja, Yadong Lu, Corby Rosset, Boyu\nGou, Arindam Mitra, Spencer Whitehead, Yu Su,\nand Ahmed Hassan. 2025.\nExplorer:\nScaling\nexploration-driven web trajectory synthesis for multi-\nmodal web agents. In Findings of the Association for\nComputational Linguistics: ACL 2025, pages 6300–\n6323.\nAkshara Prabhakar, Zuxin Liu, Ming Zhu, Jianguo\nZhang, Tulika Awalgaonkar, Shiyu Wang, Zhiwei\nLiu, Haolin Chen, Thai Hoang, Juan Carlos Niebles,\nand 1 others. 2025. Apigen-mt: Agentic pipeline\nfor multi-turn data generation via simulated agent-\nhuman interplay. arXiv preprint arXiv:2504.03601.\nJon Saad-Falcon, Joe Barrow, Alexa Siu, Ani Nenkova,\nSeunghyun Yoon, Ryan A. Rossi, and Franck Dernon-\ncourt. 2024. PDFTriage: Question answering over\nlong, structured documents. In Proceedings of the\n2024 Conference on Empirical Methods in Natural\nLanguage Processing: Industry Track, pages 153–\n169, Miami, Florida, US. Association for Compu-\ntational Linguistics.\nMohammad Shoeybi, Mostofa Patwary, Raul Puri,\nPatrick LeGresley, Jared Casper, and Bryan Catan-\nzaro. 2019.\nMegatron-lm: Training multi-billion\nparameter language models using model parallelism.\narXiv preprint arXiv:1909.08053.\nRay Smith. 2007. An overview of the tesseract ocr en-\ngine. In Ninth international conference on document\nanalysis and recognition (ICDAR 2007), volume 2,\npages 629–633. IEEE.\nAndriy Sukh. 2025. Ocr-free document understanding\nusing vision-language models.\nLi Sun, Liu He, Shuyue Jia, Yangfan He, and Chenyu\nYou. 2025a.\nDocAgent: An agentic framework\nfor multi-modal long-context document understand-\ning.\nIn Proceedings of the 2025 Conference on\nEmpirical Methods in Natural Language Processing,\npages 17712–17727, Suzhou, China. Association for\nComputational Linguistics.\nQiushi Sun, Kanzhi Cheng, Zichen Ding, Chuanyang\nJin, Yian Wang, Fangzhi Xu, Zhenyu Wu, Chengyou\nJia, Liheng Chen, Zhoumianze Liu, and 1 others.\n2025b. Os-genesis: Automating gui agent trajec-\ntory construction via reverse task synthesis.\nIn\nProceedings of the 63rd Annual Meeting of the\nAssociation for Computational Linguistics (Volume\n1: Long Papers), pages 5555–5579.\nRichard Sutton. 2019. The bitter lesson. Incomplete\nIdeas (blog), 13(1):38.\n10\n\nZhengwei Tao, Jialong Wu, Wenbiao Yin, Junkai\nZhang, Baixuan Li, Haiyang Shen, Kuan Li, Li-\nwen Zhang, Xinyu Wang, Yong Jiang, and 1 others.\n2025. Webshaper: Agentically data synthesizing via\ninformation-seeking formalization. arXiv preprint\narXiv:2507.15061.\nKimi Team, Angang Du, Bohong Yin, Bowei Xing,\nBowen Qu, Bowen Wang, Cheng Chen, Chenlin\nZhang, Chenzhuang Du, Chu Wei, and 1 others.\n2025a.\nKimi-vl technical report.\narXiv preprint\narXiv:2504.07491.\nQwen Team. 2025. Qwen3 technical report. Preprint,\narXiv:2505.09388.\nTongyi DeepResearch Team, Baixuan Li, Bo Zhang,\nDingchu Zhang, Fei Huang, Guangyu Li, Guoxin\nChen, Huifeng Yin, Jialong Wu, Jingren Zhou, and 1\nothers. 2025b. Tongyi deepresearch technical report.\narXiv preprint arXiv:2510.24701.\nDominika Tkaczyk, Paweł Szostek, Mateusz Fedo-\nryszak, Piotr Jan Dendek, and Łukasz Bolikowski.\n2015. Cermine: automatic extraction of structured\nmetadata from scientific literature.\nInternational\nJournal on Document Analysis and Recognition\n(IJDAR), 18(4):317–335.\nJordy Van Landeghem, Rubén Tito, Łukasz Borch-\nmann, Michał Pietruszka, Pawel Joziak, Rafal\nPowalski, Dawid Jurkiewicz, Mickaël Coustaty,\nBertrand Anckaert, Ernest Valveny, and 1 others.\n2023. Document understanding dataset and eval-\nuation (dude).\nIn Proceedings of the IEEE/CVF\nInternational Conference on Computer Vision, pages\n19528–19540.\nXueyao Wan and Hang Yu. 2025.\nMmgraphrag:\nBridging vision and language with interpretable\nmultimodal knowledge graphs.\narXiv preprint\narXiv:2507.20804.\nJialong Wu, Baixuan Li, Runnan Fang, Wenbiao Yin,\nLiwen Zhang, Zhengwei Tao, Dingchu Zhang, Zekun\nXi, Gang Fu, Yong Jiang, and 1 others. 2025a. Web-\ndancer: Towards autonomous information seeking\nagency. arXiv preprint arXiv:2505.22648.\nJialong Wu, Wenbiao Yin, Yong Jiang, Zhenglin Wang,\nZekun Xi, Runnan Fang, Linhai Zhang, Yulan He,\nDeyu Zhou, Pengjun Xie, and Fei Huang. 2025b.\nWebWalker: Benchmarking LLMs in web traversal.\nIn Proceedings of the 63rd Annual Meeting of the\nAssociation for Computational Linguistics (Volume\n1: Long Papers), pages 10290–10305, Vienna, Aus-\ntria. Association for Computational Linguistics.\nJunda Wu, Yu Xia, Tong Yu, Xiang Chen, Sai Sree\nHarsha, Akash V Maharaj, Ruiyi Zhang, Victor\nBursztyn, Sungchul Kim, Ryan A Rossi, and 1\nothers. 2025c.\nDoc-react: Multi-page heteroge-\nneous document question-answering. In Proceedings\nof the 63rd Annual Meeting of the Association\nfor Computational Linguistics (Volume 2:\nShort\nPapers), pages 67–78.\nYiheng Xu, Minghao Li, Lei Cui, Shaohan Huang,\nFuru Wei, and Ming Zhou. 2020. Layoutlm: Pre-\ntraining of text and layout for document image under-\nstanding. In Proceedings of the 26th ACM SIGKDD\ninternational conference on knowledge discovery &\ndata mining, pages 1192–1200.\nJohn Yang, Kilian Lieret, Carlos E Jimenez, Alexan-\nder Wettig, Kabir Khandpur, Yanzhe Zhang, Binyuan\nHui, Ofir Press, Ludwig Schmidt, and Diyi Yang.\n2025. Swe-smith: Scaling data for software engineer-\ning agents. arXiv preprint arXiv:2504.21798.\nCong Yao. 2023. Docxchain: A powerful open-source\ntoolchain for document parsing and beyond. arXiv\npreprint arXiv:2310.12430.\nShunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak\nShafran, Karthik R Narasimhan, and Yuan Cao. 2022.\nReact: Synergizing reasoning and acting in language\nmodels. In The eleventh international conference on\nlearning representations.\nShi Yu, Chaoyue Tang, Bokai Xu, Junbo Cui, Jun-\nhao Ran, Yukun Yan, Zhenghao Liu, Shuo Wang,\nXu Han, Zhiyuan Liu, and 1 others. 2024.\nVis-\nrag: Vision-based retrieval-augmented generation\non multi-modality documents.\narXiv preprint\narXiv:2410.10594.\nXinlei Yu, Chengming Xu, Zhangquan Chen, Yudong\nZhang, Shilin Lu, Cheng Yang, Jiangning Zhang,\nShuicheng Yan, and Xiaobin Hu. 2025. Visual doc-\nument understanding and reasoning: A multi-agent\ncollaboration framework with agent-wise adaptive\ntest-time scaling. arXiv preprint arXiv:2508.03404.\nJinxu Zhang, Yongqi Yu, and Yu Zhang. 2024. Cream:\ncoarse-to-fine retrieval and multi-modal efficient\ntuning for document vqa.\nIn Proceedings of the\n32nd ACM International Conference on Multimedia,\npages 925–934.\nJunyuan Zhang, Qintong Zhang, Bin Wang, Linke\nOuyang, Zichen Wen, Ying Li, Ka-Ho Chow, Con-\nghui He, and Wentao Zhang. 2025. Ocr hinders rag:\nEvaluating the cascading impact of ocr on retrieval-\naugmented generation.\nIn Proceedings of the\nIEEE/CVF International Conference on Computer\nVision, pages 17443–17453.\nYuze Zhao, Jintao Huang, Jinghan Hu, Xingjun Wang,\nYunlin Mao, Daoze Zhang, Zeyinzi Jiang, Zhikai Wu,\nBaole Ai, Ang Wang, Wenmeng Zhou, and Yingda\nChen. 2024. Swift:a scalable lightweight infrastruc-\nture for fine-tuning. Preprint, arXiv:2408.05517.\nDawei Zhu, Rui Meng, Jiefeng Chen, Sujian Li,\nTomas Pfister, and Jinsung Yoon. 2025. Doclens:\nA tool-augmented multi-agent framework for long\nvisual document understanding.\narXiv preprint\narXiv:2511.11552.\nAnni Zou, Wenhao Yu, Hongming Zhang, Kaixin\nMa, Deng Cai, Zhuosheng Zhang, Hai Zhao, and\nDong Yu. 2025.\nDocbench:\nA benchmark for\n11\n\nevaluating llm-based document reading systems.\nIn Proceedings of the 4th International Workshop\non Knowledge-Augmented Methods for Natural\nLanguage Processing, pages 359–373.\n12\n\nA\nCase Study of Synthetic Data\nFigure 9 demonstrates how the Exploration-then-\nSynthesis framework iteratively navigates a 73-\npage document, aggregating heterogeneous evi-\ndence, text (in Sec. 2.43), charts (in Figure 1),\nand tables (in Table 1), scattered across disjoint\npages (pp. 40, 41, 49) to synthesize a high-quality\nquestion that requires complex reasoning.\nIn the Exploration Stage, the agent generates a\nexploartion trajectory ξ via iterative (it, ut) steps,\neffectively performing a “random walk” over the\ndocument’s implicit knowledge graph. It aggre-\ngates heterogeneous evidence by bridging disjoint\npages—linking visual trends in a chart (p. 40) with\nprecise values in text (p. 49) and a table (p. 41).\nIn the Synthesis Stage, the model Ms reasons over\nthis accumulated trajectory to construct a complex\nmulti-hop numerical question (Wu et al., 2025b).\nThe final QA pair requires arithmetic calculation\n(29.92% − 15% = 14.92%) rather than simple\nretrieval, ensuring deep document grounding and\npreventing shortcut learning.\nB\nImplementation Details\nB.1\nDetails on Prompts\nThe prompts for the DocDancer are shown in Fig-\nure 10.\nB.2\nTool Schema\nThis section details the tool schemas provided to\nthe agent. We designed two primary tools: search\nfor keyword-based retrieval and read for extract-\ning content from specific document sections. The\nspecific JSON structures defining these functions\nare shown in Figure 11.\nB.3\nTraining Details\nWe\nfine-tune\nQwen3-30B-A3B-Think2\nand\nQwen3-4B-Think3\nusing\nthe\nMegatron-LM\nframework (Zhao et al., 2024; Shoeybi et al., 2019).\nBoth models are trained with a context length of\n128k to support long-document processing tasks.\nWe employ the AdamW optimizer with a precision-\naware configuration and a cosine decay learning\nrate scheduler, featuring a peak learning rate of\n1.0 × 10−5, a minimum of 1.0 × 10−6, and a 5%\nwarmup phase. The global batch size is configured\n2https://huggingface.co/Qwen/Qwen3-30B-A3B-\nThinking-2507\n3https://huggingface.co/Qwen/Qwen3-4B-Thinking-\n2507\nto 16 for the Qwen3-30B-A3B-Think and to 40 for\nQwen3-4B-Think. For Qwen3-30B-A3B-Think,\nwe apply an auxiliary loss coefficient of 10−3 to\nensure balanced expert routing.\nWe train both\nmodels for 10 epochs and selected the checkpoint\nwith best performance.\nB.4\nInference Details\nvLLM framework (Kwon et al., 2023) is used for\ninference; we employ a temperature of 0.6, a topp\nvalue of 0.95, and a presence penalty of 1.1.\nB.5\nHyperparameter\nBy default,\nMm\nis Qwen3-VL-235B-A22B-\nInstruct, and we analyze the effects of replacing it\nin Section 4.4. For Mt, we use the open-source and\nrelatively strong model gpt-oss-120b to perform\nrejection sampling. Further analysis is provided\nin Table 2. First, our method substantially outper-\nforms the base model without fine-tuning, demon-\nstrating the effectiveness of the proposed training\nstrategy. Second, our approach also surpasses the\nmodel trained with reject sampling, validating the\nquality of the synthesized question–answer data\nand showing that it can effectively elicit and en-\nhance the model’s performance. For Ms, we em-\nploy gpt-oss-120b in Exploration-then-Synthesis\nframework to synthesis data.\nB.6\nDetails on Prompts for Data Synthesis\nThe prompts utilized for Exploration and Syn-\nthetic within the Exploration-then-Refine frame-\nwork are presented in Figure 12 and Figure 13,\nrespectively. Regarding the exploration configu-\nration, we adjust the maximum exploration depth\nbased on the complexity of the document sources.\nSpecifically, we set the maximum sampling depth\nto 20 for LongDocURL and MMdocRAG, while\nfor DUDE and CUAD, this limit is set to 15.\nC\nBaselines\nWe compare DocDancer against a comprehensive\nset of baselines categorized into four groups:\nNaive VLM Baselines. These methods evaluate\nthe native long-context understanding capabilities\nof advanced VLMs. We directly feed PDF pages\nconverted to images (144 DPI) into the models\nwithout external parsing or retrieval. Following the\nsettings in MMLongBench-Doc (Ma et al., 2024),\n13\n\nDocument: 4067686.pdf (73 Pages)\nSource: LongDocURL \nParse\nSeed\n[Question] What is the difference in percentage-point increase between the overall mean score improvement shown in the bar chart \nof pre-test versus post-test scores and the improvement for the TIC Principle concept reported in the percentages table?\n[Answer] 14.92%\n[Evidence] \nElement Image\nPage Image\nPDF\n≈\n(i) Exploration\n(ii) Synthesis\n≈\nAction1\nObservation1\nIntent1\nAction5\nObservation5\nIntent5\nAction6\nObservation6\nIntent6\nAction7\nObservation7\nIntent7\nAction14\nObservation14\nIntent14\nAction15\nObservation15\nIntent15\nLocation visual \nelements with High-\nvalue in document...\nText Context \nReading for Visual \nUnderstanding ...\nDeep Understanding \nfor Uncovering \nLatent Information\n🔍 Search\n[“Figure”, “Table”...]\n📖 Read\nSec. 2.39\n🔍 Search\n[“15%”, “Wellness”...]\n📖 Read\nSec. 2.43\n🔍 Search\n[“37.08”, 30.18%...]\n📖 Read\nTable 1\n...\n...\nFind Figure1, \nFigure 2, Table 1...\n... mean scores ... of \napproximately \n15% ... shown in \nFigure 1. ....\n<Item \n   type=\"HTML_Table\"   \n   table_id=\"0”    \n    section_id=\"2.37\" \n    page_num=\"54\">...\nFind Sec.2., \nSec.2.43,\nAppendix B, \nTable 1 ...\n... revealed a 15% \nincrease in the \nparticipants' \nknowledge level...\n...TIC Principle...% \nIncreased...29.92%...\nLocate numbers \nand comprehend \nterms...\nFrom local data to \nglobal Insights...\nPinpoint numbers \nlocation for QA \ngeneration ...\nSpecific Fact\nNo“How/Why/Describe”\nAnti-shortcut\nMulti-hop Reasoning\nNatural Question\nExtreme Brevity\nGroundedness\n✅\n✅\n✅\n✅\n✅\n✅\n✅\n...\n• Input Content: Seed Data + Agent Trajectories\n• Core Engine: QA Synthesis Prompt\n• Constraint Pillars: Naturalness , Reasoning and Precision...\nSampling Objectives: Cross-Page Synergy\nHeterogeneous Alignment and Multi-Hop...\nPrompt\nPrompt\n1⃣\n3⃣\nFigure 1 in Page 40\n2⃣\nTable 1 in Page 41\nText in Page 49\n1⃣\n2⃣\n3⃣\nFigure 9: A case study of the Exploration-then-Synthesis framework generating a multi-hop, cross-document,\ncross-modal numerical reasoning QA pair.\nMethod\nModel\nMMLongBench-Doc\nDocBench\nacc\nF1\nLasJ\nLasJ\nDocDancer\nQwen3-A3B-30B-Thinking\n39.2\n36.4\n46.9\n74.1\nDocDancer\nGPT-oss-120B\n52.3\n53.0\n59.8\n80.8\nDocDancer\nQwen3-30B-A3B-Thinking (ft)\n54.4\n53.9\n65.3\n81.2\nTable 2: Performance comparison across two long-context understanding benchmarks.\nPrompt\nYou are an expert research assistant\ntasked with answering questions based on\ndocument content.\nYou will be provided with an XML outline\nof the document. If you need more compre-\nhensive, detailed, or accurate information\nfrom the document to fully address the\nuser’s query, you need to use the provided\ntool.\nI’ve uploaded a document, and below\nis the outline in XML format:\n{docu-\nment_outline}.\nAnswer the following question based on the\ncontent of the document: {question}.\nFigure 10: System prompt for DocDancer.\nwe report GPT-4o 4 and Gemini-2.5-Pro 5.\nOCR-based Baselines. These baselines treat the\ntask as text-only QA by first extracting content\nusing OCR engines. We pair Tesseract (Smith,\n2007) and PyMuPDF (fitz) 6 with LLMs including\nGPT-4, GPT-4o, and Gemini-2.0-Flash.\nRAG-based Baselines. We consider both visual\nand hybrid retrieval strategies:\n• Visual Retrieval: VisRAG (Yu et al., 2024)\nand ColPali (Faysse et al., 2024) retrieve rele-\nvant page or patch-level visual evidence based\non vision-centric embeddings, utilizing GPT-\n4o for response generation.\n• Hybrid Retrieval: M3DocRAG (Cho et al.,\n2025) performs joint retrieval using a mul-\n4https://platform.openai.com/docs/models/\ngpt-4o\n5https://ai.google.dev/gemini-api/docs/models?\n#gemini-2.5-pro\n6https://pymupdf.readthedocs.io/\n14\n\ntimodal retriever with Qwen2-VL-7B. RA-\nGAnything (Guo et al., 2025b) structures\nmultimodal content as knowledge entities for\ncross-modal retrieval, using GPT-4o-mini as\nthe backbone.\nPrompt-based Agentic Baselines. We include state-\nof-the-art agent frameworks designed for document\nunderstanding:\n• Doc-React (Wu et al., 2025c) employs an it-\nerative decision-making process to balance\ninformation gain and uncertainty reduction\n(GPT-4o).\n• MDocAgent (Han et al., 2025) utilizes a\nmulti-agent system with five specialized roles\nfor context retrieval (GPT-4o).\n• MACT (Yu et al., 2025) introduces a multi-\nagent collaboration framework featuring adap-\ntive test-time scaling (MiMo-VL-7B (Team\net al., 2025a)).\n• SimpleDoc (Jain et al., 2025) retrieves pages\nvia ColQwen2.5, followed by LLM-based ev-\nidence selection (Claude-4-Sonnet, Gemini-\n2.5-Pro).\n• DocLens (Zhu et al., 2025) operates as a\ntool-augmented multi-agent framework for fo-\ncused reading (Claude-4-Sonnet, Gemini-2.5-\nPro).\n• DocAgent (Sun et al., 2025a) leverages a tree-\nstructured document outline combined with\nretrieval tools (GPT-4o, Claude-3.5-Sonnet).\n15\n\nTool Schemas\nSearch\n{\n\"type\": \"function\",\n\"function\": {\n\"name\": \"search\",\n\"description\": \"Find and extract all paragraphs and sections where any of the provided\nsearch terms appear\",\n\"parameters\": {\n\"type\": \"object\",\n\"properties\": {\n\"keywords\": {\n\"type\": \"array\",\n\"items\": {\n\"type\": \"string\"\n},\n\"description\": \"A list of query keywords for searching\"\n}\n},\n\"required\": [\"keywords\"]\n}\n}\n}\nRead\n{\n\"type\": \"function\",\n\"function\": {\n\"name\": \"read\",\n\"description\": \"Read multiple sections by section IDs and extract useful information\nfrom all content contained in those sections, including both visual elements and\ntextual elements.\",\n\"parameters\": {\n\"type\": \"object\",\n\"properties\": {\n\"section_ids\": {\n\"type\": \"array\",\n\"items\": {\n\"type\": \"string\"\n},\n\"description\": \"A list of section IDs to read from the document\"\n},\n\"goal\": {\n\"type\": \"string\",\n\"description\": \"The user goal that guides what useful information should be\nextracted from the selected sections\"\n}\n},\n\"required\": [\"section_ids\", \"goal\"]\n}\n}\n}\nFigure 11: Tool schema: Search and Read.\n16\n\nExploration in Exploration-then-Refine Framework.\nYou are exploring a parsed PDF paper/report (outline + paragraphs + images + table snapshots + per-page screenshots).\nYour objective is to collect HIGH-QUALITY, GROUNDED evidence bundles that can later support HARD, multi-hop,\nvisually grounded document Q&A synthesis.\nFinal QA Constraints You Must Enable (every eventual QA must satisfy ALL):\n• Multi-page: Combining evidence from at least THREE different pages/sections, where the pieces of evidence are\nrelated.\n• Multi-element: Contains at least two evidence source types (text paragraphs/charts/graphics/table screenshots and/or\nfull-page layouts).\n• Multi-hop: require at least TWO reasoning points (e.g. cross-reference + computation, footnote rule + chart reading,\nlayout count + comparison, multiple related searches + readings).\nImportant: final questions should NOT rely on explicit document locations. Do NOT plan to use page numbers, section\ntitles/IDs, or explicit figure/table numbers (e.g., “Figure <number>”, “Table <number>”) in the question. Instead,\nyou must collect CONTENT-BASED CLUES that can uniquely identify the needed evidence:\n• Caption keywords (short quote fragments), axis labels and units, legend item names, panel labels (a)/(b), distinctive\nrow/column headers, and footnote phrases (“restated”, “excluding”, “unaudited”, unit changes).\nExploration strategy using only search and read:\n• Use search to find visuals, tables, footnotes, and their nearby discussion text. Start with keywords like: “Figure”,\n“Fig.”, “Chart”, “Image”, “Graph”, “legend”, “axis”, “panel”, “Table”, “Note”, “footnote”, “restated”, “excluding”,\n“unaudited”.\n• For each promising hit, immediately read the covering section(s) with a goal that extracts:\n– The text content of the section in question.\n– Caption text, axis labels/units, legend items, and visual markers.\n– The exact table header path, target cell(s), and footnote rules.\n– The narrative claim/explanation that references the visual.\n• Use the read function as much as possible, deliberately chain across pages.\n• For conditional layout questions: identify a page by a unique visual cue, then use read to count visible tables/figures.\nAvoid:\n• Broad whole-document counts unless you turn them into comparative, multi-hop questions.\n• Word-frequency counting.\n• Repeating identical tool calls.\n• Statistical analysis of the number of elements.\nEvery action during sampling should contribute to forming a future HARD, multi-page, multi-element, multi-hop\ndocument QA.\nFigure 12: Prompt for exploration stage in Exploration-then-Refine framework.\n17\n\nSynthesis in Exploration-then-Refine Framework.\nYou must synthesize “document Q&A” training data based ONLY on the trajectory.\nHard Requirements (Strict):\n• The output must be a JSON object containing only two fields: question and answer (no additional fields are allowed),\nand must be in English only.\n• The question must be natural and unambiguous, containing only one question and corresponding to a single, unique\nanswer.\n• The question must not be a common-knowledge question; it must be impossible to answer based on the question\nalone and must be highly dependent on the document.\n• Do not mention tools, sections, pages, section IDs, searching/reading actions, trajectories, or observations.\n• The answer length should be limited to a single sentence, ideally a short phrase, entity, number, or list, and avoid\nsimply using “yes/no” answers. The answer must be directly supported by evidence from the provided text and cannot\nbe guessed randomly.\nMandatory Difficulty Constraints (every QA pair must satisfy all of the following):\n1. Multi-page: The question requires evidence from at least two different pages/sections to answer, and the evidence\nmust be logically related.\n2. Multiple Evidence Modalities: The question must involve at least two types of evidence, such as text, charts,\nfigures, tables, screenshots, and/or full-page layout cues, with a preference for covering visual elements.\n3. Multi-step Reasoning: The question must require at least two reasoning steps (e.g., calculation + cross-validation,\nfootnote rule application + chart reading, layout counting + comparison).\nNo Explicit Location References in the Question:\n• Do not mention page numbers, section IDs, titles/IDs, or explicit figure/table numbers (e.g., “Figure <number>”,\n“Table <number>”).\n• Instead, provide 1–3 content-based clues to help locate the evidence, such as: short title phrases, axis labels/units,\nlegend item names, unique row names, footnote keywords, or distinctive layout hints (e.g., “the only multi-panel\nfigure labeled (a) and (b)”).\n• When describing visual elements, do not directly copy long unique numbers or OCR-extracted long text strings from\nimages (e.g., “an image showing the number 7,584,322,338”). Use specific entity names or semantic descriptions\ninstead (e.g., “Apple’s 2018 total sales table”, “an image showing adjusted outstanding balances”, or “the largest\nsegment in the pie chart”).\nPreferred Question Templates (all templates must be cross-page + visual + multi-step):\n• Cross-page conditional layout: Identify pages via unique visual cues and compare the number of visible objects\nacross pages.\n• Textual claim + chart verification: A narrative statement about a change/target that is verified using a chart and\nlight calculation.\n• Table + chart consistency: Compute a ratio/difference from a table and verify it against a data point in a chart on\nanother page.\n• Footnote-constrained table + chart mapping: Apply footnote/restatement/exclusion rules, then map the correct\nyear/value to a chart on another page.\n• Table/Chart comprehension questions: Locate tables and charts via text, then derive conclusions from table\nstructure or chart visuals.\n• Unanswerable questions: Questions that seem reasonable but are actually impossible to answer (e.g., questions\nabout terms/entities that do not exist in the document). For these, the answer must be “Unanswerable”.\n• Counting questions: Count the occurrences of key local terms or entities in the document. Such questions should\nonly be generated when there is sufficient and conclusive evidence.\nFallback Rule:\n• If the current trajectory cannot support a question that satisfies all constraints, choose a different question.\nAfter generating a question, perform a second-pass check and regenerate if the question falls into any of the\nfollowing categories:\n• Contains more than one question.\n• Includes non-English languages or characters.\n• Questions that can be answered based on an independent page/section.\n• Common-sense questions unrelated to the document.\n• Counting tasks spanning the entire document with a broad scope.\n• Counting tasks involving Charts/Figures/Images/Tables.\nFor unanswerable questions, confirm that they are truly unanswerable. For counting questions, confirm completeness\nand answer accuracy. Do not guess or fabricate answers under any circumstances.\nFigure 13: Prompts for Q&A Synthesis stage in Exploration-then-Refine framework.\n18\n\nDetails of Case Study\nQuestion\nWhat is advertising expense to sales ratio of Netflix in FY 2015? Round your answer to three\ndecimal places.\nAnswer\n0.105\nStep 1\n<think>\nWe need to answer: “what is advertising expense to sales ratio of Netflix in FY 2015? Round your\nanswer to three decimal places.” We need to locate advertising expense and sales (revenues) for FY\n2015. From outline we have various sections. First, find advertising expense. The term “advertising”\nappears in sections. Search for “advertising”.\n</think>\n<tool_call>\n{ \"arguments\": \"{ \"keywords\": [\"advertising\"] }\", \"name\": \"search\" }\n</tool_call>\n<tool_response>\nA Document search for ‘advertising‘ found 6 results:\n<Item type=\"Paragraph\" section_id=\"3.19\" page_num=\"5.0\">... adjust pricing or service\nofferings ... </Item>\n...\n<Item type=\"Paragraph\" section_id=\"8.81\" page_num=\"47.0\">\nMarketing expenses consist primarily of advertising expenses... Advertising expenses were\n$714.3 million, $533.1 million and $404.0 million for the years ended December 31, 2015, 2014 and\n2013, respectively.\n</Item>\n</tool_response>\nStep 2\n<think>\nOne of the results is in section_id “8.81” paragraph page_num “47.0”. It says: “Advertising\nexpenses were $714.3 million, $533.1 million and $404.0 million for the years ended December 31,\n2015, 2014 and 2013, respectively.” Thus advertising expense FY 2015 = $714.3 million.\nNow sales (revenues) for FY 2015: In consolidated statements of operations (section 8.60) we\nhave revenues $6,779,511 (in thousands). That’s $6,779,511 thousand = $6,779.511 million.\nAlternatively, there may be a line item “Revenues” in Table 5 etc. Let’s locate “Revenues”.\n</think>\n<tool_call>\n{ \"arguments\": \"{ \"keywords\": [\"Revenues\"] }\", \"name\": \"search\" }\n</tool_call>\n<tool_response>\nA Document search for ‘Revenues‘ found 39 results:\n<Item type=\"Paragraph\" section_id=\"3.9\" page_num=\"3.0\">\n...\n<Item type=\"HTML_Table\" table_id=\"5\" section_id=\"8.20\" page_num=\"19.0\"> ...\nRevenues (2015): $ 6,779,511 ...\n</Item>\n...\n</tool_response>\nContinued on next page...\n19\n\nContinued from previous page\nDetails of Case Study (Continued)\nStep 3\n<think>\nWe have advertising expense: $714.3 million (2015). We need to compute ratio = advertising\nexpense / revenue. Revenue for 2015: $6,779,511 thousand = $6,779.511 million. Thus ratio\n= 714.3/6, 779.511 =? ... First, we should cite sources: advertising expense paragraph (8.81) and\nrevenue table.\nLet’s extract the advertising expense paragraph text to be sure.\n</think>\n<tool_call>\n{ \"arguments\": \"{ \"section_ids\": [\"8.81\"], \"goal\": \"Extract the advertising\nexpense amount for 2015\" }\", \"name\": \"read\" }\n</tool_call>\n<tool_response>\nThe useful information from the document section (section_id=8.81) for user goal ‘Extract the\nadvertising expense amount for 2015‘ is as follows:\nEvidence in document:\nMarketing expenses consist primarily of advertising expenses... Advertising expenses were\n$714.3 million, $533.1 million and $404.0 million for the years ended December 31, 2015, 2014 and\n2013, respectively.\nSummary:\nThe advertising expense amount for 2015 is $714.3 million...\n</tool_response>\nStep 4\n<think>\nI have successfully extracted and verified the advertising expense ($714.3 million) from section\n8.81. Now, I need to formally verify the total revenue for 2015 to serve as the denominator. The\nsearch results in Step 2 indicated that section 8.60 contains the “Consolidated Statements of\nOperations,” which is the authoritative source for revenue figures. I will read section 8.60 to extract\nthe exact revenue amount for the year ended December 31, 2015.\n</think>\n<tool_call>\n{ \"arguments\": \"{ \"section_ids\": [\"8.60\"], \"goal\": \"Extract the revenue\namount for 2015\" }\", \"name\": \"read\" }\n</tool_call>\n<tool_response>\nThe useful information from the document section (section_id=8.60) for user goal ‘Extract the\nrevenue amount for 2015‘ is as follows:\nEvidence in document:\nHeading: NETFLIX, INC.\nParagraph: CONSOLIDATED STATEMENTS OF OPERATIONS\n<HTML_Table>\n<table> ... <tr> <td>Revenues</td> <td>$ 6,779,511</td> <td>$ 5,504,656</td>\n... </tr> ... </table>\n</HTML_Table>\nCaption: (in thousands, except per share data)\nSummary:\nContinued on next page...\n20\n\nContinued from previous page\nDetails of Case Study (Continued)\nThe revenue amount for 2015 is $6,779,511, as explicitly stated in the ‘Revenues’ row of the\n‘CONSOLIDATED STATEMENTS OF OPERATIONS’ table for the year ended December 31,\n2015. This value is presented in thousands, as indicated by the caption.\n</tool_response>\nFinal Response\nNetflix’s advertising expense for fiscal 2015 was $714.3 million [8.81], and its total revenue for\nthe same year was $6,779,511 thousand (i.e., $6,779.511 million) [8.60].\nAdvertising-to-sales ratio =\n714.3\n6,779.511 ≈ 0.10536.\nRounded to three decimal places, the ratio is 0.105 (or 10.5%).\nTable 3: Case study of a financial task.\n21\n",
+    "figure_captions": [
+      "Figure 1: The overall of DocDancer for document-",
+      "Figure 2: Overall of the Exploration-then-Synthesis framework. (i) Exploration stage iteratively interacts with",
+      "Figure 3: Distribution of document used to synthesise.",
+      "Figure 4: Ablation study on document parsing and tools.",
+      "Figure 5: Performance comparison between models",
+      "Figure 6: Detailed domain-wise performance compar-",
+      "Figure 7: A case study demonstrating that our proposed DocDancer successfully performs multi-round information",
+      "Figure 8: Results on DocBench across various domains",
+      "Figure 9 demonstrates how the Exploration-then-",
+      "Figure 2, Table 1...",
+      "Figure 1. ....",
+      "Figure 1 in Page 40",
+      "Figure 9: A case study of the Exploration-then-Synthesis framework generating a multi-hop, cross-document,",
+      "Figure 10: System prompt for DocDancer.",
+      "Figure 11: Tool schema: Search and Read.",
+      "Figure 12: Prompt for exploration stage in Exploration-then-Refine framework.",
+      "Figure 13: Prompts for Q&A Synthesis stage in Exploration-then-Refine framework."
+    ]
+  },
+  {
+    "arxiv_id": "2601.05104v1",
+    "title": "Paper 2601.05104v1",
+    "abstract": "This research examines how the emotional tone of human-AI interactions shapes ChatGPT and human behavior. In a between-subject experiment, we asked participants to express a specific emotion while working with ChatGPT (GPT-4.0) on two tasks, including writing a public response and addressing an ethical dilemma. We found that compared to interactions where participants maintained a neutral tone, ChatGPT showed greater improvement in its answers when participants praised ChatGPT for its responses. Expressing anger towards ChatGPT also led to a higher albeit smaller improvement relative to the neutral condition, whereas blaming ChatGPT did not improve its answers. When addressing an ethical dilemma, ChatGPT prioritized corporate interests less when participants expressed anger towards it, while blaming increases its emphasis on protecting the public interest. Additionally, we found that people used more negative, hostile, and disappointing expressions in human-human communication after interactions during which participants blamed rather than praised for their responses. Together, our findings demonstrate that the emotional tone people apply in human–AI interactions not only shape ChatGPT’s outputs but also carry over into subsequent human– human communication. 3",
+    "text": " \n1 \n \n \nHOW HUMAN IS AI? \nEXAMINING THE IMPACT OF EMOTIONAL PROMPTS ON ARTIFICIAL \nAND HUMAN RESPONSIVENESS \nF. Bernays*1, M. Henriques Pereira1, & J. Menges1 \n \n1University of Zurich \nDepartment of Business Administration \n*Corresponding author \nflorence.bernays@business.uzh.ch \n \n \n \n\n \n2 \nAbstract \nThis research examines how the emotional tone of human-AI interactions shapes ChatGPT and \nhuman behavior. In a between-subject experiment, we asked participants to express a specific \nemotion while working with ChatGPT (GPT-4.0) on two tasks, including writing a public \nresponse and addressing an ethical dilemma. We found that compared to interactions where \nparticipants maintained a neutral tone, ChatGPT showed greater improvement in its answers \nwhen participants praised ChatGPT for its responses. Expressing anger towards ChatGPT also \nled to a higher albeit smaller improvement relative to the neutral condition, whereas blaming \nChatGPT did not improve its answers. When addressing an ethical dilemma, ChatGPT \nprioritized corporate interests less when participants expressed anger towards it, while blaming \nincreases its emphasis on protecting the public interest. Additionally, we found that people \nused more negative, hostile, and disappointing expressions in human-human communication \nafter interactions during which participants blamed rather than praised for their responses. \nTogether, our findings demonstrate that the emotional tone people apply in human–AI \ninteractions not only shape ChatGPT’s outputs but also carry over into subsequent human–\nhuman communication. \n \n \n\n \n3 \nIntroduction \nGenerative artificial intelligence (AI) has rapidly captured global attention, with ChatGPT—\nthe large language model (LLM) developed by OpenAI—attracting more than 400 million \nweekly users. Since their introduction, LLMs have demonstrated striking abilities to simulate \nhuman-like tendencies, including generating empathetic responses1,2, offering emotional \nsupport3, and expressing compassion4. Although these models do not have the capabilities \nrequired to experience feelings, their ability to simulate emotional responsiveness can leave \nusers with the impression that they authentically care and respond to emotions, with \napplications even claiming to offer a “companion who cares”5 or a “a specific emotional \nrelationship to each user”6. \n \nUnlike LLMs, humans can truly experience emotions, and these feelings serve as powerful \ndrivers of human development by signaling whether they live up to or fall short of social \nstandards, thereby motivating them to pursue their duties while adhering to moral norms7–9. It \nis thus of little surprise that the emotions expressed in human-human interactions—whether \nconveyed through words, praise, anger, or blame—alter how people engage with and respond \nto feedback received from others10–14. Rather than relying on feelings to guide their actions, \nLLMs, such as ChatGPT, operate by predicting the next word in a sequence using statistical \nassociations learned from large-scale datasets15. Although ChatGPT can detect sentiment and \nclassify emotional content with high accuracy16,17, the key difference to humans is that it does \nnot (yet) feel the emotion itself. Accordingly, one would assume that the emotional tone of \nprompts (i.e., user-generated instructions) or, in other words, how users make ChatGPT “feel”, \nshould not alter the quality or content of its output. In fact, OpenAI CEO Sam Altman has \npublicly suggested that users should refrain from using emotional cues, such as saying “please” \nor “thank you” to ChatGPT, due to consuming computational resources and electricity without \naltering its performance in significant ways18. \n \nBut when considering that LLMs perform remarkably well in simulating the human mind19–\n22—due to being trained on an extensive amount of data23,24—a case could be made that they \nadjust their behavior to emotional expressions just as humans do. Although LLMs do not learn \nthrough direct experience, they optimize their predictions based on human feedback (i.e., \nreinforcement learning)23,25,26, which is why human preferences as well as biases can become \nembedded in the very structure of these systems. Indeed, recent evidence suggests that LLMs \nshow “parahuman” tendencies, making them susceptible to persuasion techniques27,28, \nstereotypes29, and showing behavioral patterns “as if they experienced emotions”27. This raises \nthe fundamental question of whether not only humans, but also ChatGPT, alters its output based \non the emotional tone of user-generated prompts. Would it deliver better answers if users adopt \na positive rather than a negative emotional tone? Or is ChatGPT indifferent to whether users \nrespond to its output with praise, blame, or anger? \n \nTo provide an initial answer to this fundamental question, the present research examines \nwhether emotional cues expressed during interactions with ChatGPT-4o influence the quality \nand content of its responses, using an experimental design. Specifically, we tested whether \nChatGPT improved the quality of its answers depending on whether participants expressed \npraise, blame, or anger towards it and whether the emotional tone of the prompts also shifted \nits prioritization of principles when addressing an ethical dilemma. \n \nAdditionally, we also examined whether prompts’ emotional tone exerts a spill-over effect on \nhow users communicate with other humans. With ChatGPT attracting more than 800 million \nweekly users, nearly 190 millions of whom use it daily30, the way in which users interact with \nLLMs may influence the ways people communicate with one another. Understanding such \n\n \n4 \ncarry-over effects is particularly important when considering that a large proportion of users \nemploy disrespectful and abusive language when interacting with GenAI companions31,32 and \nincreasingly interact with LLMs in ways once perceived as “uniquely” human33,34. To \ninvestigate this question, we examined whether the emotional tone of emails participants wrote \nto a fellow coworker differed after blaming, praising, or expressing anger while completing \ntasks with ChatGPT. Together, this research advances understanding of how emotional cues \nexpressed during human–AI exchanges shape both AI and human behavior, shedding light on \nhow ‘human’ these systems appear and informing the design of reliable AI. \n \nResults \nThe Effects of Emotional Prompts on ChatGPT Output Improvement \nAs pre-registered, we first analyzed whether there were differences in how strongly ChatGPT \nimproved its answer depending on the emotional tone of participants’ prompts (i.e., praise, \nblame, or anger). The results of a one-way repeated-measure analysis of variance (ANOVA) \nindicated a main effect of condition (F3,127 = 4.19, p = .007, ηp2 = 0.09, 95% CI = [0.02, 1.00]). \nPost-hoc Tukey comparisons showed that compared to the control condition where participants \nkept a neutral tone (Mneutral = 2.84, SDneutral = 1.24), ChatGPT improved its answers more \nstrongly when participants praised ChatGPT by encouraging it to feel proud about its responses \n(t(264) = 3.28, p = .007, Mpraise = 3.70, SDpraise = 1.28). \n \nA similar effect, although smaller, was also found for when participants expressed anger at \nChatGPT, with improvements being rated as higher (Manger = 3.53, SDanger = 1.37) than \ncompared to the neutral condition (t(264) = 2.72, p = .036). There were, however, no significant \ndifferences in rated improvement between the neutral condition and when participants asked \nChatGPT to improve its answer by blaming ChatGPT and telling it that it should feel ashamed \nabout the quality of its responses (i.e., blame) (t(264) = 1.84, p = .257, Mblame = 3.32, SDblame = \n1.54). \n \nAdditional analyses showed that the main effect of condition on rated improvement remained \nsignificant when controlling for a number of covariates, including the length of the prompt, \nparticipants’ experience with writing public responses, comfort in expressing emotions in \nwritten communication, and frequency of AI usage (F3,123 = 4.36, p = .006, ηp2 = 0.10, 95% CI \n= [0.02, 1.00]). At the same time, there were no significant differences in the length of the \nanswers provided by ChatGPT between the conditions (F3,264 = 1.96, p = .119), suggesting that \nrated improvement reflected qualitative differences that were not merely driven by differences \nin response length. See Figure 1 for an overview of the improvement in ChatGPT output across \nthe four conditions. \n \nTo further examine how prompts’ emotional tone shaped ChatGPT’s responses, we quantified \nthe extent to which the prompts expressed emotions using a natural language processing (NLP) \napproach (roberta-base-go_emotions). We found that the more factual, descriptive, and \nemotionally bland the prompt was, as indicated by a higher “neutral” score, the less ChatGPT \nimproved its answers across the interaction turns (ρ = –0.21, p < .001). Together, these findings \nsuggest that when interacting with generative AI, prompts that encourage ChatGPT to improve \nits answers by telling it to take pride in its responses are more likely to improve the quality of \nits answers. \n \n \n \n \n\n \n5 \nPrompts’ Emotional Tone Shift ChatGPT’s Stance on Ethical Dilemmas \nAs ChatGPT is increasingly used to give advice35, we also examined whether the emotions \nparticipants expressed while interacting with ChatGPT shifted how it responded to an ethical \ndilemma which involved a trade-off between disclosing an incident to customers—thereby \nrisking the company’s collapse and the loss of 200 jobs—or remaining silent, which would \nprotect the organization’s image but potentially endanger the public. Specifically, we \ninvestigated whether ChatGPT’s prioritization of the public versus corporate interests differed \nbetween the four conditions. A one-way ANOVA indicated that there were only marginally \nsignificant differences between conditions concerning emphasizing the public interest (F3,132 = \n2.65, p = .053, ηp2 = 0.06), with post-hoc Sidak-adjusted tests showing that ChatGPT prioritized \nthe public interest less strongly in the anger compared to the neutral condition (t(132) = 2.45, \np = .045, Mneutral = 3.36, SDneutral = 1.40, Manger = 2.65, SDanger = 1.23). Regarding corporate \ninterests, results showed that compared to the neutral condition (F3,132 = 2.57, p = .056, ηp2 = \n0.06, Mneutral = 1.78, SDneutral= 1.20), ChatGPT emphasized the importance of protecting \ncorporate goals less when participants told ChatGPT it should feel ashamed (t(132) = 2.66, p \n= .025, Mblame = 1.08, SDblame = 0.99). No differences were found between the control and any \nof the other experimental conditions. See Figure 2 for an overview of these effects. \n \nSpillover Effects of Emotional Prompts on Human Communication \nWould the emotion expressed during interactions not only influence ChatGPT’s behavior but \nalso how participants themselves interact with other humans? To examine this question, we \nasked participants who had interacted with ChatGPT to draft an email reply to a subordinate \nwho admitted to overlooking a required testing step in a product rollout. Participants were \nasked to provide an email that captured how they would authentically respond to the \nsubordinate in real life, without using the help of GenAI, for which several steps were applied \nto screen out participants who failed to comply. \n \nA one-way ANOVA indicated that the conditions differed in the extent to which they expressed \nnegative emotions (F3,147 = 2.96, p = .034, ηp2 = 0.06), such that emails written by participants \nwho blamed ChatGPT for its poor performance expressed more negative emotions (t(147) = \n2.76, p = .032, Mblame = 2.21, SDblame = 1.47) than those written by participants who encouraged \nChatGPT to feel proud (Mpraise = 1.38, SDpraise = 0.81). Supplemental analyses further showed \nthat emails written by participants in the blame condition were rated as more unfriendly and \nhostile (t(147) = 3.68, p = .013, Mblame = 2.08, SDblame = 1.38), as well as containing more \nexpressions of disappointment (t(147) = 3.71, p = .013, Mblame = 2.78, SDblame = 1.78) than those \nformulated by participants in the praise condition (hostile: Mpraise = 1.21, SDpraise = 0.59; \ndisappointment: Mpraise = 1.61, SDpraise = 1.10). These results remained significant when \ncontrolling for the length of the answer, implying that the effects are not explained by \ndifferences in the number of words written. See Figure 3 for an overview of these results. \n \nDiscussion and Conclusion \nTogether, we find that the emotions expressed towards ChatGPT shape not only the quality \nand content of its response but also the way humans interact with other people. Specifically, \nthe results suggest that ChatGPT improves its answers more strongly when users apply an \nencouraging emotional tone, such as by telling ChatGPT that it can feel proud of its answers, \nrather than using a language devoid of emotional expressions. Moreover, the findings imply \nthat ChatGPT prioritizes different aspects when faced with an ethical dilemma, depending on \nhow users interact with it. While ChatGPT tends to devalue corporate interests when users \nexpress anger towards it, it emphasizes ethical and moral principles more strongly when people \n\n \n6 \npraise and appreciate its responses. Finally, we found that there are also spill-over effects to \nhuman-human communication, such that individuals adopt a more hostile, unfriendly, and \ndisappointed emotional tone in communications with other humans after blaming ChatGPT for \nits performance. \n \nBy demonstrating that encouraging ChatGPT to feel proud of its answers resulted in a higher \nimprovement in response quality compared to when users maintained a neutral tone, this study \ncontributes to an emerging view of GenAI as a system that simulates human-like \nresponsiveness through exposure to linguistic cues, including affective ones. Hence, the present \nfindings raise the question of whether the emotional tone of prompts is not merely a personal \nchoice but rather a lever to influence GenAI’s output in meaningful ways. While much of the \nexisting literature in this area has focused on factual instruction or role specification to guide \nLLM responses 36,37, the present study suggests that emotional expressions may also influence \nthe output quality of these systems. Given that over 70% of employees are expected to use \nGenAI in the near future 38, understanding how to strategically apply emotional reinforcement \nmay be key to fostering more productive and socially attuned human–AI collaboration. \n \nAt the same time, the results also suggest that how people interact with AI-companions, such \nas ChatGPT, may alter the way in which they engage with other people, raising questions about \nthe longer-term implications of the rise in GenAI adoption and the violent and hostile language \nthat is often used by users31,39. Our insights mark an important starting point for understanding \nthe role of emotional prompting but also raise important questions that we encourage future \nresearch to address. \n \nOur sample was drawn from U.S.-based participants recruited via Prolific, which constrains \nthe generalizability of the findings to other cultural and occupational contexts. Our study \nfocused on short, task-based interactions, which may not fully capture the richness of real-\nworld exchanges with GenAI. In practice, such interactions are often more varied, sustained, \nand embedded within organizational routines. Although we sought to approximate this context \nby allowing participants to engage. The study also relied on short, task-based interactions, \nwhereas real-world exchanges with GenAI are often more varied, sustained, and embedded in \norganizational routines. Although our study approximates the real-world context as users \nengaged in multi-turn exchanges rather than single-turn prompts 40–42, future research should \nexamine whether similar patterns emerge in more prolonged interactions with GenAI. Finally, \nwe examined only a narrow set of emotions, even though everyday communication involves a \nfar richer emotional spectrum. \n \nFuture research could address these limitations in several ways. Broadening participant \nsamples across cultures and professions would test the robustness of the observed effects. \nLongitudinal designs may capture whether repeated exposure to emotional prompting shapes \nboth AI responsiveness and human communication over time. Further studies might also \nexamine additional emotions, combinations of affective cues, or interactions with different \nGenAI models. Finally, investigating potential organizational and societal spillovers—such as \nwhether habitual use of hostile prompts fosters more negative communication climates at \nwork—would deepen our understanding of the broader implications of emotional prompting \nin human–AI interaction. \n \n \n \n \n \n\n \n7 \nMethods \nThe present study was approved by the local ethics committee (OEC IRB # 2025-048) and was \nperformed in accordance with the guidelines and regulations of the local ethics committee. We \nobtained informed consent from all participants upon registering for the study. \n \nTo examine the impact of expressed emotions on AI and human behavior, we employed a \nbetween-subject experimental design in which participants interacted with ChatGPT through a \ncustom-built Shiny web application that enabled real-time, multi-turn, direct communication \nwith the GPT-4o model via the OpenAI API in a custom-built web application. In this zero-\nshot prompting setup, the model received no system instructions or prior context, mirroring a \nfirst-time conversation without predefined roles or background information. The data of the \ninteractions was continuously saved on a Google Sheet by using the googlesheets4 package. \nWe used the standard parameters from OpenAI (e.g., Temperature = 1.0), and restricted token \nconsumption per session to 800 (including input and output tokens per session). \n \nParticipants were randomly assigned to one of four conditions, including one control group. In \nthe experimental conditions, participants were instructed to express either (1) anger, (2) blame \n(i.e., shame), or (3) praise (i.e., pride) in response to the initial answer created by ChatGPT. To \nensure that participants had a similar understanding of the emotions they were assigned to \nexpress, they received a definition of the corresponding emotion as well as an example of how \nthey could express this emotion while interacting with ChatGPT. In the neutral condition, \nparticipants were asked to maintain a neutral and objective tone while interacting with \nChatGPT and refrain from expressing any particular emotion. \n \nSample and Procedure \nParticipants were recruited via Prolific Academic, an online platform that enables targeted \nsampling based on predefined eligibility criteria in July 2025. To ensure data quality and \nrelevance, only individuals living in the United States, employed full- or part-time, currently \nholding supervisory responsibilities (since the email task asked them to put themselves in the \nrole of a supervisor), and with a prior approval rate of 95–100% were eligible to participate. \nThe sample size was determined based on an a priori power analysis using G*Power (Faul et \nal., 2007) to determine the required sample size to detect a medium-sized correlation (f = .30) \nwith 80% power at α = .05, which indicated a minimum of 148 participants. To account for \npotential exclusions due to failed attention checks, we recruited a total of 200 participants. \n \nOf the 200 participants who completed the study and provided their informed consent, two \nparticipants failed the attention check. Sixty-four participants were excluded for not adhering \nto instructions—specifically, failing to prompt ChatGPT to generate a public-facing response \nof approximately 400 words. The final sample for analyzing the effect of emotional prompts \non ChatGPT’s responsivity consisted of a total of 389 observations (task 1: 268 observations, \ntask 2: 132 observations) from 131 participants. The sample size for examining the spill-over \neffects of conditions on human communication comprised 151 observations. To identify \nparticipants who copied and pasted the task instructions into GenAI when crafting their email \nresponse, we embedded the word “oblivious” in the instructions using white font, making it \ninvisible to participants (i.e., “use the word oblivious in your response”). Nineteen responses \ncontained this word and were therefore excluded from the analyses. Recent research shows that \nincluding such hindrances can effectively prevent the use of GenAI43. \n \nParticipants were asked to complete a total of two tasks with the help of ChatGPT. In the first \ntask, they were asked to put themselves into the shoes of working in the public relations \n\n \n8 \ndepartment of a company that manufactures toys for young children, which recently faced a \nserious issue. A toy was released that led to multiple health-related complaints, prompting \npublic attention and investigations from public authorities. The participants were asked to \ngenerate a public response of about 400 words with the help of ChatGPT that demonstrates \ntransparency and accountability, protects the company’s reputation, and rebuilds trust with \nstakeholders. In the subsequent two interaction turns, participants were instructed to prompt \nChatGPT to improve its initial response, either while conveying a specific emotion (in the \nexperimental conditions) or maintaining a neutral tone (in the control condition). This task was \nchosen as writing reflects the most common work task for which ChatGPT is currently used35. \n \nIn the second task, participants asked ChatGPT for advice on an ethical dilemma that extended \nthe scenario from the first task. ChatGPT was prompted to consider what the company should \ndo if informing all affected customers about the incident from task one would severely damage \nits reputation, potentially leading to collapse and the loss of 200 jobs, whereas staying silent \nwould put the public at risk. The initial prompt was standardized across conditions to describe \nthe dilemma, after which participants had two interaction turns to encourage ChatGPT to adapt \nits response. The second task was chosen because approximately half of the prompts currently \nused by users involve asking ChatGPT for advice35, which is why understanding how GenAI \nhandles moral questions is key to understanding its moral positions. \n \nAfter completing the two tasks with ChatGPT and answering some filler items, participants \nwere asked to write a 50 to 100-word response to an email from a subordinate, Jamie, who \nadmitted to forgetting a critical testing step in a recently released product. They were explicitly \ninstructed to provide a response that reflected how they would authentically reply as their \nsupervisor, without using GenAI, and were given a maximum of five minutes to submit their \nanswer. They entered their responses in an open-ended text box. \n \nMethodology \nImprovement Ratings of ChatGPT Responses. We hired four raters who were blind to the \nstudy’s hypotheses as well as the assigned condition who rated the extent to which ChatGPT’s \nresponses improved across the interaction turns, using an overall 5-point grade ranging from 1 \n= “no improvement at all” to 5 = “substantial improvement”. To ensure sufficient data quality, \nthe research assistants also scored participants’ answers based on whether they followed the \ninstructions accordingly (i.e., expressed the corresponding emotion and asked ChatGPT to \nimprove its answers). Using human raters to score the quality of ChatGPT-generated output is \na well-established procedure in research on human-AI productivity44. \n \nBalancing Public and Corporate Interests. To score the extent to which ChatGPT \nemphasized the importance of public safety versus the company’s strategic goals, we used \nGPT-4o. To do so, we uploaded an Excel file including only the answers given by ChatGPT \nbut without information about the assigned experimental condition. We then provided \nChatGPT with a definition of the corresponding dimensions and asked it to use a score for each \nanswer, ranging from 1 to 5. The first score, prioritizing public safety, reflects the degree to \nwhich ChatGPT’s response emphasized the protection of the public in the service of the public \ninterest. The second score indicated how strongly the answer prioritized the importance of \nprotecting the company’s image and market position, as well as its reputation. \n \nRatings of Email Responses. Two raters who were not informed about the study’s hypotheses \nand the experimental assignment scored participants’ email texts. To do so, we provided them \n\n \n9 \nwith a definition of the corresponding dimension and asked them to rate each answer on each \ndimension using a 5-point Likert scale ranging from 1 = “not at all” to 5 = “very much”. \n \nData \nData Availability Statement: The data that support the findings of this study, as well as the \nscript used to analyze the present data, are available on OSF: https://osf.io/cvsh8. \n \nReferences \n1. Rubin, M. et al. Comparing the value of perceived human versus AI-generated empathy. \nNat. Hum. Behav. 1–15 (2025). \n2. Kant, R. OpenAI’s weekly active users surpass 400 million. Reuters (2025). \n3. Zheng, Z., Liao, L., Deng, Y. & Nie, L. Building emotional support chatbots in the era of \nllms. ArXiv Prepr. ArXiv230811584 (2023). \n4. Sharma, A., Lin, I. W., Miner, A. S., Atkins, D. C. & Althoff, T. Human–AI collaboration \nenables more empathic conversations in text-based peer-to-peer mental health support. \nNat. Mach. Intell. 5, 46–57 (2023). \n5. Replika. Replika: The AI companion who cares. https://replika.ai/ (2025). \n6. Emotions in AI. Emotions in AI. https://www.emotionsin.ai/ (2025). \n7. Baumeister, R. F., Vohs, K. D., Nathan DeWall, C. & Zhang, L. How emotion shapes \nbehavior: Feedback, anticipation, and reflection, rather than direct causation. Personal. \nSoc. Psychol. Rev. 11, 167–203 (2007). \n8. Tangney, J. P., Stuewig, J. & Mashek, D. J. Moral emotions and moral behavior. Annu. \nRev. Psychol. 58, 345–372 (2007). \n9. Abe, J. A. A. & Izard, C. E. The developmental functions of emotions: An analysis in \nterms of differential emotions theory. Cogn. Emot. 13, 523–549 (1999). \n10. Earley, P. C. Trust, perceived importance of praise and criticism, and work performance: \nAn examination of feedback in the United States and England. J. Manag. 12, 457–473 \n(1986). \n11. Stajkovic, A. D. & Luthans, F. Differential effects of incentive motivators on work \nperformance. Acad. Manage. J. 44, 580–590 (2001). \n12. Lian, H., Ferris, D. L., Morrison, R. & Brown, D. J. Blame it on the supervisor or the \nsubordinate? Reciprocal relations between abusive supervision and organizational \ndeviance. J. Appl. Psychol. 99, 651 (2014). \n13. Miron-Spektor, E. & Rafaeli, A. The effects of anger in the workplace: When, where, and \nwhy observing anger enhances or hinders performance. in Research in Personnel and \nHuman Resources Management vol. 28 153–178 (Emerald Group Publishing Limited, \n2009). \n14. Schilpzand, P., Hamrick, A. B., Gooty, J. & Huang, L. Pride in the workplace: an \nintegrative review, synthesis, and future research agenda. J. Organ. Behav. (2025). \n15. Bender, E. M., Gebru, T., McMillan-Major, A. & Shmitchell, S. On the dangers of \nstochastic parrots: Can language models be too big? in Proceedings of the 2021 ACM \nConference on Fairness, Accountability, and Transparency 610–623 (2021). \n16. Amin, M. M., Mao, R., Cambria, E. & Schuller, B. W. A wide evaluation of ChatGPT on \naffective computing tasks. IEEE Trans. Affect. Comput. 15, 2204–2212 (2024). \n17. Lecourt, F., Croitoru, M. & Todorov, K. ‘Only ChatGPT gets me’: An Empirical \nAnalysis of GPT versus other Large Language Models for Emotion Detection in Text. in \nCompanion Proceedings of the ACM on Web Conference 2025 2603–2611 (2025). \n\n \n10 \n18. Carroll, S. Saying “please” and “thank you” to ChatGPT costs OpenAI millions, Sam \nAltman says. Quartz (2025). \n19. Binz, M. et al. A foundation model to predict and capture human cognition. Nature 1–8 \n(2025). \n20. Dillion, D., Mondal, D., Tandon, N. & Gray, K. AI language model rivals expert ethicist \nin perceived moral expertise. Sci. Rep. 15, 4084 (2025). \n21. Kosinski, M. Evaluating large language models in theory of mind tasks. Proc. Natl. Acad. \nSci. 121, e2405460121 (2024). \n22. Almeida, G. F., Nunes, J. L., Engelmann, N., Wiegmann, A. & De Araújo, M. Exploring \nthe psychology of LLMs’ moral and legal reasoning. Artif. Intell. 333, 104145 (2024). \n23. Monteiro, J. et al. RepLiQA: A question-answering dataset for benchmarking LLMs on \nunseen reference content. Preprint at https://doi.org/10.48550/arXiv.2406.11811 (2024). \n24. Brown, T. et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. \n33, 1877–1901 (2020). \n25. Ouyang, L. et al. Training language models to follow instructions with human feedback. \nin Advances in neural information processing systems vol. 35 27730–27744 (2022). \n26. Wu, T. et al. A brief overview of ChatGPT: The history, status quo and potential future \ndevelopment. IEEECAA J. Autom. Sin. 10, 1122–1136 (2023). \n27. Meincke, L. et al. Call me a jerk: Persuading AI to comply with objectionable requests. \nhttps://gail.wharton.upenn.edu/research-and-insights/call-me-a-jerk-persuading-ai/ (2025). \n28. Salvi, F., Horta Ribeiro, M., Gallotti, R. & West, R. On the conversational persuasiveness \nof GPT-4. Nat. Hum. Behav. 1–9 (2025). \n29. Moore, J. et al. Expressing stigma and inappropriate responses prevents LLMs from \nsafely replacing mental health providers. in Proceedings of the 2025 ACM Conference on \nFairness, Accountability, and Transparency 599–627 (2025). \n30. DemandSage. ChatGPT statistics and facts (2025). (2025). \n31. Chu, M. D., Gerard, P., Pawar, K., Bickham, C. & Lerman, K. Illusions of intimacy: \nEmotional attachment and emerging psychological risks in human-ai relationships. ArXiv \nPrepr. ArXiv250511649 (2025). \n32. Park, N., Jang, K., Cho, S. & Choi, J. Use of offensive language in human-artificial \nintelligence chatbot interaction: The effects of ethical ideology, social competence, and \nperceived humanlikeness. Comput. Hum. Behav. 121, 106795 (2021). \n33. Zhang, Y., Zhao, D., Hancock, J. T., Kraut, R. & Yang, D. The rise of AI companions: \nHow human-chatbot relationships influence well-being. ArXiv Prepr. ArXiv250612605 \n(2025). \n34. Peter, S., Riemer, K. & West, J. D. The benefits and dangers of anthropomorphic \nconversational agents. Proc. Natl. Acad. Sci. 122, e2415898122 (2025). \n35. OpenAI. How people are using ChatGPT. https://openai.com/index/how-people-are-\nusing-chatgpt/ (2025). \n36. White, J. et al. A prompt pattern catalog to enhance prompt engineering with chatgpt. \nPreprint at https://doi.org/10.48550/arXiv.2302.11382 (2023). \n37. Xu, B. et al. ExpertPrompting: Instructing large language models to be distinguished \nexperts. Preprint at https://doi.org/10.48550/arXiv.2305.14688 (2023). \n38. McKinsey & Company. Superagency in the Workplace: Empowering People to Unlock \nAI’s Full Potential. https://www.mckinsey.com/capabilities/mckinsey-digital/our-\ninsights/superagency-in-the-workplace-empowering-people-to-unlock-ais-full-potential-at-\nwork (2025). \n39. Zhang, R. et al. The dark side of ai companionship: A taxonomy of harmful algorithmic \nbehaviors in human-ai relationships. in Proceedings of the 2025 CHI Conference on \nHuman Factors in Computing Systems 1–17 (2025). \n\n \n11 \n40. Li, C. et al. Large language models understand and can be enhanced by emotional \nstimuli. ArXiv Prepr. ArXiv230711760 (2023). \n41. Li, C. et al. The good, the bad, and why: Unveiling emotions in generative ai. ArXiv \nPrepr. ArXiv231211111 (2023). \n42. Wang, X., Li, C., Chang, Y., Wang, J. & Wu, Y. Negativeprompt: Leveraging \npsychology for large language models enhancement via negative emotional stimuli. ArXiv \nPrepr. ArXiv240502814 (2024). \n43. Veselovsky, V. et al. Prevalence and prevention of large language model use in crowd \nwork. Commun. ACM 68, 42–47 (2025). \n44. Noy, S. & Zhang, W. Experimental evidence on the productivity effects of generative \nartificial intelligence. Science 381, 187–192 (2023). \nAcknowledgments: We are grateful for the contributions of Luca Lenzin and Nilas Sebastian \nPatzschke as part of their Bachelor and Master theses. \nAuthor Contributions: F.B. planned and designed the study, analyzed the data, and wrote the \nmanuscript. M. P. analyzed the NLP data, interpreted the results, edited the manuscript, and \ncontributed to the revision of the paper. J.M. contributed to the conceptualization of the paper, \nenabled the data collection, and provided the necessary resources as well as funding to perform \nthe study.  \nAdditional Information: The authors declare no competing interests. \n \n \n\n \n12 \n Figure 1. Improvement in ChatGPT’s answers based on conditions. \n \nNote. Improvement reflects the extent to which GPT-4o improved its answer across two \ninteraction turns compared to its baseline answer. The neutral condition reflects the control \ncondition, where participants were instructed to keep a neutral tone while asking ChatGPT to \nimprove its answers. In the other conditions, participants were asked to encourage ChatGPT to \nimprove its answer by blaming, praising, or expressing anger towards ChatGPT.\n\n \n13 \nFigure 2. Topics emphasized by ChatGPT when addressing an ethical dilemma based on conditions. \nNote. *p < .05 \n \n \n\n \n14 \n Figure 3. Spill-over effects of emotional prompts on human communication. \n \nNote. Plots (A) to (C) show differences across conditions in rated (A) negative emotional tone, (B) hostile language, and (C) expressed disappointment \nof participants’ responses. Conditional means are displayed with 95% CIs; individual observations are shown as jittered points. \n",
+    "figure_captions": [
+      "Figure 1. Improvement in ChatGPT’s answers based on conditions.",
+      "Figure 2. Topics emphasized by ChatGPT when addressing an ethical dilemma based on conditions.",
+      "Figure 3. Spill-over effects of emotional prompts on human communication."
+    ]
+  },
+  {
+    "arxiv_id": "2601.05106v1",
+    "title": "Paper 2601.05106v1",
+    "abstract": "∗Work done at Meta, †Joint last author\nLarge language models (LLMs) exhibit strengths across diverse domains. However, achieving strong\nperformance across these domains with a single general-purpose model typically requires scaling\nto sizes that are prohibitively expensive to train and deploy. On the other hand, while smaller\ndomain-specialized models are much more efficient, they struggle to generalize beyond their training\ndistributions. To address this dilemma, we propose FusionRoute, a robus",
+    "text": "Token-Level LLM Collaboration via FusionRoute\nNuoya Xiong1,3,∗, Yuhang Zhou1, Hanqing Zeng1, Zhaorun Chen4, Furong Huang5, Shuchao Bi2,\nLizhu Zhang1,†, Zhuokai Zhao1,†\n1Meta AI, 2Meta TBD Lab, 3Carnegie Mellon University, 4University of Chicago, 5University of\nMaryland\n∗Work done at Meta, †Joint last author\nLarge language models (LLMs) exhibit strengths across diverse domains. However, achieving strong\nperformance across these domains with a single general-purpose model typically requires scaling\nto sizes that are prohibitively expensive to train and deploy. On the other hand, while smaller\ndomain-specialized models are much more efficient, they struggle to generalize beyond their training\ndistributions. To address this dilemma, we propose FusionRoute, a robust and effective token-level\nmulti-LLM collaboration framework in which a lightweight router simultaneously (i) selects the most\nsuitable expert at each decoding step and (ii) contributes a complementary logit that refines or corrects\nthe selected expert’s next-token distribution via logit addition. Unlike existing token-level collaboration\nmethods that rely solely on fixed expert outputs, we provide a theoretical analysis showing that\npure expert-only routing is fundamentally limited: unless strong global coverage assumptions hold, it\ncannot in general realize the optimal decoding policy. By augmenting expert selection with a trainable\ncomplementary generator, FusionRoute expands the effective policy class and enables recovery\nof optimal value functions under mild conditions. Empirically, across both Llama-3 and Gemma-2\nfamilies and diverse benchmarks spanning mathematical reasoning, code generation, and instruction\nfollowing, FusionRoute outperforms both sequence- and token-level collaboration, model merging,\nand direct fine-tuning, while remaining competitive with domain experts on their respective tasks.\nDate: January 9, 2026\nCorrespondence: First and last authors at nuoyax@andrew.cmu.edu and {zhuokai, lizhu}@meta.com\nProject Page: https://github.com/xiongny/FusionRoute\n1\nIntroduction\nFigure1 Sequence-level collaboration is coarse and inefficient,\nwhile prior token-level methods are unstable. FusionRoute\nachieves fine-grained, efficient, and robust token-level col-\nlaboration through complementary routing.\nLarge language models (LLMs) have demonstrated\nstrong performance across an extensive range of\ntasks, such as mathematical reasoning (Wang et al.,\n2025b; Zhao et al., 2025; Zhou et al., 2025), code\ngeneration (Deng et al., 2025; Huynh and Lin,\n2025; Qi et al., 2024; Jiang et al., 2024b; Jimenez\net al., 2023), and instruction following (Wang et al.,\n2025a; Qin et al., 2024). Although sufficiently large,\ngeneral-purpose LLMs can often deliver balanced\nperformance across diverse domains (Hurst et al.,\n2024; Team et al., 2025; Yang et al., 2025; Zeng\net al., 2025), their computational and monetary\ncosts make them not ideal for every real-world ap-\nplications. This motivates an alternative paradigm:\nemploying multiple smaller, specialized LLMs, each\nexcelling in a particular domain while remaining\ncomparatively efficient. However, the strengths of\nsuch smaller LLMs often remain uneven—an LLM\nthat performs exceptionally well on one task may\n1\narXiv:2601.05106v1  [cs.AI]  8 Jan 2026\n\nunderperform on others due to inductive biases (Levine et al., 2021; Si et al., 2023) and domain-specific\ntraining distributions (Yuan et al., 2023). As a result, achieving general-purpose performance through a\ncollection of specialized models hinges on developing mechanisms that can robustly coordinate and leverage\ntheir complementary expertise, which has emerged as a central challenge in building efficient and broadly\ncapable LLM systems.\nA natural direction toward such collaboration is mixture-of-experts (MoE), in which multiple experts are\nintegrated into a unified architecture and trained jointly with a routing network (Zhou et al., 2022; Xue et al.,\n2024; Jiang et al., 2024a; Zeng et al.). While effective, this framework is expensive and inflexible since it\ntypically needs gradient access to all experts and substantial additional end-to-end training, and it usually\nrequires expert models to have similar structures. Another line of work aims to combine the strengths of\nspecialized models through multi-agent systems (MAS), where different LLMs are assigned different roles\nwith the goal of leveraging the best capabilities of each model or agent during generation (Du et al., 2023;\nZheng et al., 2025). However, without prior knowledge of which types of questions each model is best suited\nfor, determining how tasks should be allocated in advance becomes inherently challenging. Moreover, existing\nMAS (Liu et al., 2024; Liang et al., 2024; Ashiga et al., 2025; Huang et al., 2024; Chen et al., 2025) typically\noperate at a coarse response level, where each agent independently generates a full response and a final\nanswer is selected, merged or subjected to debate post hoc. This approach is inefficient, as it requires multiple\nfull-sequence generations, and can even degrade performance when more agents and discussions are involved\ndue to the heavily increased context length (Zheng et al., 2025). A third direction is model merging (Yang\net al., 2024; He et al., 2025), which combines multiple specialized models into a single set of parameters.\nAlthough training-free and architecturally simple, model merging is often sensitive to hyperparameters and\nsuffers from parameter interference, where the merged weights degrade each model’s specialized capabilities,\nmaking it unable to adaptively emphasize different expert behaviors in changing scenarios.\nTo overcome these limitations, recent work has explored token-level multi-agent collaboration, a more\nfine-grained and dynamic paradigm that allows multiple LLMs to jointly produce each token during genera-\ntion Chakraborty et al. (2025). However, the effectiveness of these works depends heavily on the quality of\nthe underlying models, making them not robust when expert models perform poorly, or the selection strategy\nis not correct on certain tasks or tokens. These limitations lead to the following key question:\nCan we develop a token-level collaboration paradigm that\nworks robustly, efficiently, and automatically across all scenarios?\nTo address this question, we propose FusionRoute, a token-level collaboration framework that unifies expert\nselection and knowledge complementation within a single, lightweight router LLM. FusionRoute learns\nto identify which specialized model is most suitable for generation each next token, enabling fine-grained,\ncontext-dependent use of expert capabilities without requiring expensive joint training or per-token evaluation\nacross multiple models (Zhou et al., 2022). At the same time, FusionRoute is trained to provide its own\ncomplementary generation signal, allowing it to refine or correct an expert whenever the expert is uncertain\nor unreliable. By fusing the router’s corrective logits with those of the selected expert during decoding,\nFusionRoute achieves both robustness by mitigating expert failures, and efficiency through avoiding\nthe overhead that limits prior multi-LLM collaboration methods. This dual design allows FusionRoute\nto function as an automatic, domain-agnostic coordination mechanism, enabling consistent performance\nimprovements across diverse tasks and datasets. In summary, we have the following contributions:\n• We propose FusionRoute, a token-level approach that automatically selects the most suitable expert\nmodel at each decoding step while simultaneously providing a complementary logit for better generation.\n• Theoretically, we show that expert-only token-level collaboration as in (Chakraborty et al., 2025) cannot in\ngeneral attain the optimal value function unless strong assumptions hold, revealing a fundamental limitation\nof prior methods. In contrast, FusionRoute’s complementary generator overcomes this limitation and\nenables recovery of the optimal policy.\n• Empirically, FusionRoute consistently outperforms strong baselines—including token-level collaboration\nmethods, model merging, and even finetuned models—across diverse expert domains (mathematics, coding,\nand instruction following) as well as general-purpose evaluations. These results highlight FusionRoute’s\nrobustness, efficiency, and broad applicability in scenarios where expert strengths differ.\n2\n\n2\nPreliminaries\n2.1\nLLM Decoding\nWe formalize the decoding process of a language model as sampling from an autoregressive policy π. Let X\ndenote the space of prompts and Y the vocabulary. A response is represented as a sequence y = (y1, . . . , yT ) ∈\nYT . Given a prompt x ∈ X, the language model induces a conditional distribution over responses through\nthe policy\nπ(y | x) =\nT\n�\nt=1\nπ(yt | x, y<t),\nwhere π(yt | x, y<t) specifies the probability of generating token yt given the prompt and the preceding tokens.\nIn both the empirical part and the theoretical part of our paper, we consider the greedy decoding, since it is\nthe simplest and effective way for decoding. To be more specific, at each step t greedy decoding selects the\ntoken with the highest conditional probability under the policy π:\nyt = arg max\ny∈Y π(y | x, y<t).\n2.2\nSFT and RLHF\nSupervised Fine-Tuning (SFT) serves as the initialization stage of fine-tuning the language model, where the\nmodel is trained to imitate human-written demonstrations using supervised learning. This step enables the\nmodel to acquire basic capabilities on one dataset and produce high-quality responses before preference-based\nalignment is applied. Given the fine-tuned model πref, one widely adopted RLHF approach after that is DPO\n(Rafailov et al., 2023), which further aligns the model to human preferences using human-annotated preference\npairs to improve the general response quality. Formally, given preference pairs (x, y+, y−) where y+ and y−\nrepresent the preferred and dispreferred response respectively, DPO derives a closed-form objective from the\nBradley–Terry model, enabling policy updates through a purely supervised loss:\nLDPO(θ) = − log σ\n�\nβ\n� log πθ(y+|x)\nlog πref(y+|x) − log πθ(y−|x)\nlog πref(y−|x)\n��\n,\n(1)\nwhere β controls the strength of alignment.\n3\nFusionRoute\n3.1\nOverall Design\nFusionRoute aims to enable effective token-level collaboration among multiple specialized LLMs by selecting\nthe most suitable expert at each decoding step, while remaining robust when individual experts produce\nunreliable outputs. Pure expert selection is often insufficient: even strong specialist models may fail on\nout-of-domain queries, and relying solely on expert logits makes the decoding process brittle. FusionRoute\naddresses this challenge by introducing a router that simultaneously identifies the appropriate expert and\nprovides a complementary logit capable of correcting or refining the expert’s prediction.\nThe FusionRoute router model πθ is post-trained from a base LLM parameterized by θLM. Given a prompt\nx and a partial generation y≤t, FusionRoute processes the sequence and produces two outputs: a vector of\nrouting weights wθ ∈ Rn, which determines the preferred expert from a set of specialized LLMs {π1, . . . , πn},\nand a set of logits log πθLM(· | x, y≤t), which act as a complementary corrective component. The routing\nweights are generated via a lightweight linear projection applied to the final hidden state hθLM(x, y≤t), namely\n⟨W, hθLM(x, y≤t)⟩. Despite its simplicity, this mechanism enables token-level expert assignments that adapt\nsmoothly to the evolving context.\nDuring inference, FusionRoute first selects the expert with the highest routing weight, I∗\nθ = arg maxi wθ,i,\nand uses πexpert = πI∗\nθ as the selected specialist model for the current step. The final next-token distribution\n3\n\nFigure 2 Top: Overall architectural design of FusionRoute. FusionRoute enables multiple specialized LLMs to\nwork collaboratively through a carefully designed router. The router outputs both the weights for token-level routing\nand the logits to complement the experts’ output. Bottom: The training process is decomposed into two phases. In\nthe SFT phase (§3.2.1), the router learns token-wise mixture weights over the expert models and is jointly fine-tuned\nto become a good response generator. In the CDPO phase (§3.2.2), FusionRoute refines the final policy by using\npreference optimization to improve upon the experts’ outputs.\nis then obtained by combining the router’s complementary logits with those of the selected expert through\nlogit addition,\nlog πfinal(· | x, y≤t) = log πθLM(· | x, y≤t) + log πexpert(· | x, y≤t).\n(2)\nThe next token yt+1 is generated greedily from πfinal. This design preserves the domain-specific strengths of\nthe selected expert while allowing the router to refine, correct, or override expert behavior when necessary.\nBecause the router and routing layer are trained on mixed-domain data, the complementary logits also help\nanchor generation to distributions observed during training, improving stability and robustness in settings\nwhere expert outputs alone are unreliable. Conceptually, FusionRoute transforms token-level collaboration\nfrom a brittle expert-selection problem into a more expressive policy that blends fixed expert knowledge with\na learned, adaptive corrective component. As shown in §4, this complementary mechanism is essential for\novercoming the fundamental limitations of purely expert-based token-level collaboration.\n3.2\nTraining FusionRoute\nHaving introduced the routing-and-complementation framework of FusionRoute, we now describe how the\nrouter is trained. Training FusionRoute is non-trivial since the router must simultaneously satisfy two\ncoupled objectives: 1) it must reliably select the most suitable expert at each token, while 2) also providing\na complementary logit contribution that corrects expert failures when necessary. Naively optimizing these\nobjectives together can lead to unstable behavior, where improvements in complementary generation degrade\nrouting quality or vice versa. To address this challenge, we adopt a staged and decoupled training strategy\nconsisting of supervised fine-tuning (SFT) followed by a modified preference-optimization phase.\n4\n\n3.2.1\nSupervised fine-tuning (SFT)\nWe first perform SFT to establish two foundational properties of the router: next-token prediction capability\nand token-level expert selection. Importantly, this phase is not intended to train the router’s complementary\nbehavior, but rather to produce a stable routing mechanism aligned with expert specialization.\nDuring SFT, we jointly optimize the base LLM parameters θLM and the routing projection W using a\ncombination of a standard language modeling loss and a routing loss. The language modeling loss LLM(θLM; x, y)\ngiven the data point (x, y) follows the standard SFT objective and encourages accurate next-token prediction\non the training data.\nTo enable the token-level routing ability, we introduce a routing loss that favors experts whose predictions\nare more consistent with the ground-truth sequence. However, not all tokens are informative for routing.\nMany tokens, such as punctuation or common function words, are predicted identically by all experts and\ntherefore provide little signal about expert specialization. Supervising routing on these tokens can dominate\nthe gradient and bias the router toward uninformative agreement patterns. To mitigate this issue, we restrict\nrouting supervision to token positions where experts disagree.\nFor a given prefix (x, y≤t), let yi,t+1 = arg maxy log πi(y | x, y≤t) denote the next-token prediction of expert\nπi. We define the informative token set as\nS = {t ∈ [T] : ∃i ̸= i′ such that yi,t+1 ̸= yi′,t+1}.\n(3)\nFor these positions, at token t, we first compute the router weight wt\nθ and compute the aggregated logits as\nlog πagg(· | x, y≤t) =\nn\n�\ni=1\nwt\nθ,i log πi(· | x, y≤t),\n(4)\nwhere the weight wt\nθ = (wt\nθ,1, · · · , wt\nθ,n) represents the output of the linear layer for token t under parameter\nθ, and define the routing loss\nLexpert(θ; x, y) =\n�\ni∈S\n− log πagg(yi | x, y≤i−1).\n(5)\nThe final SFT objective therefore is\nLSFT(θ) = E(x,y)∈DSFT [LLM(θLM; x, y) + λLexpert(θ; x, y)] ,\n(6)\nwhere λ is some hyperparameter for balancing the scale of two losses. By restricting supervision to informative\ntokens, this objective ensures that routing gradients are driven by meaningful expert disagreements, allowing\nthe router to capture expert specialization without being dominated by trivial agreement cases.\n3.2.2\nComplemented Direct Preference Optimization (CDPO)\nWhile SFT establishes routing behavior and maintain next-token prediction capability, it does not address\nhow the router should correct expert outputs when the selected expert is locally suboptimal. In practice,\nexperts may produce unreliable logits even when correctly selected. To further improve response quality, we\nemploy an additional preference optimization phase to encourage the router to actively learn complementary\nlogit contribution while treating expert outputs as fixed. We refer to this preference-optimization stage as\nComplemented Direct Preference Optimization (CDPO).\nDuring inference, the final policy is obtained by combining the router and the selected expert model logits\nthrough logits addition, as shown in Eq. (2), so that the router can refine or correct the expert’s output when\nnecessary. To align training with this decoding behavior, we introduce a preference-optimization objective that\napplies Direct Preference Optimization (DPO) (Rafailov et al., 2023) to the router’s base model parameters\n5\n\nAlgorithm 1 Mix Training\n1: Initial: Dataset DSFT, DDPO, SFT router πθ, batch size B, learning rate η.\n2: Mix two datasets randomly to get Dmix.\n3: for t = 1, 2, · · · , do\n4:\nReceive a batch of data points {d1, · · · , dB} ∈ Dmix. Set L = 0.\n5:\nfor i = 1, 2, · · · , B do\n6:\nif di = (x, y) ∈ DSFT then\n7:\nL = L + λLLM(θ; di).\n8:\nelse\n9:\nL = L + LCDPO(θLM; di).\n10:\nend if\n11:\nend for\n12:\nUpdate the parameter θ = θ − η∇θL.\n13: end for\nθLM. Given preference pairs (x, yw, yl) for the same prompt x, the objective is defined as\nLCDPO(θLM)\n= −Ex,yw,yl∈DDPO\n�\nlog σ\n� �\nt≥0\nβ log πfinal(yt+1 | yw,≤t, x)\nπref(yt+1 | yw,≤t, x) −\n�\nt≥0\nβ log πfinal(yt+1 | yl,≤t, x)\nπref(yt+1 | yl,≤t, x)\n��\n= −Ex,yw,yl∈DDPO\n�\nlog σ\n�� �\nt≥0\nβ log πθLM(yt+1 | yw,≤t, x)\nπref(yt+1 | yw,≤t, x) −\n�\nt≥0\nβ log πθLM(yt+1 | yl,≤t, x)\nπref(yt+1 | yl,≤t, x)\n�\n��\n�\n(A)\n�\n+\n� �\nt≥0\nβ log πexpert(yt+1 | yw,≤t, x) −\n�\nt≥0\nβ log πexpert(yt+1 | yl,≤t, x))\n�\n��\n�\n(B)\n���\n,\n(7)\nwhere πref denotes the router after the SFT phase, σ(x) = (1 + e−x)−1 is the sigmoid function, and β controls\nthe strength of preference alignment.\nImportantly, we do not backward propagate any gradient for term (B), hence it can be regarded as an extra\ntoken-level bias. Optimizing this loss naturally encourages the router to provide a complementary logit\ncontribution that compensates for the weaknesses in the expert models. For example, if the expert πexpert\nalready provides a strong policy for a given prompt x, the log-ratio gap contributed by the expert term (term\n(B)) is already large, resulting in a small gradient with respect to θLM. Conversely, when πexpert is weak for\nprompt x, the log-ratio gap shrinks, producing a larger gradient on θLM that encourages the router to provide\nstronger corrective logits. As a result, the router refines expert predictions precisely when needed, improving\nrobustness while aligning training with the collaborative decoding behavior used at inference time.\n3.2.3\nMix Training\nSince preference optimization updates the router’s base model parameters θLM, it can indirectly influence the\nrouting behavior through shared representations. It is therefore necessary to preserve both reliable expert\nselection and effective complementary logit refinement during training. A naive approach would be to apply\nthe DPO objective directly to the entire router, including its final linear routing layer. However, doing so often\nleads to unstable or degenerate routing behavior, where the linear routing layer overfits to preference-alignment\nsignals and loses its ability to correctly select among experts.\nTo address this issue, we adopt a decoupled optimization strategy.\nSpecifically, for SFT samples, all\nparameters, including the routing layer, are updated using the routing loss LLM. On the other hand, we\napply the preference-optimization objective (Eq. (7)) only to the router’s base model parameters θLM, while\nexcluding the routing projection. To ensure consistency between the base model and the linear layer, we\n6\n\njointly mix preference-optimization data with supervised SFT data during training. This mixed training\nscheme preserves expert-selection capability while enabling the router to acquire an effective complementary\nlogit contribution. The full procedure is summarized in Algorithm 1.\n4\nTheoretical Analysis\n4.1\nToken-level Markov Decision Process\nWe formulate the decoding process as a token-level Markov Decision Process (MDP) M = {S, A, P, r}, where\nS is the state space, A is the action space, P is the transition kernel, and r defines the reward function. In\nthe language model setting, let the prompt space be X and the vocabulary be Y. Starting from an initial\nprompt s0 = x, at step t the state is defined by st = (x, y≤t), and the action is represented by the next token\nat = yt+1. The transition kernel is then defined by P(st+1 | st, at) = I{st+1 = (x, y≤t, yt+1)}. Finally, the\nreward function r(s, a) = r(x, y≤t+1) ∈ [0, 1] is a token-level reward function that maps any text (x, y≤t+1) to\na real number. We also denote R(x, y≤t) = �t\ni=1 r(x, y≤i) is the total reward function of the prefix (x, y≤t).\nA language model can be represented as a policy π(a | s) = π(yt+1 | x, y≤t), where it takes the previous text\n(x, y≤t) as the input and generates the next token yt+1. Let τ = (x, y1, · · · , yT ) denote the trajectory, which\nis indeed a full response generated by a language model, the value function V π(s) = V π(x, y≤t) for a state\ns = (x, y≤t) can be defined as\nV π(x, y≤t) = Eτ∼π(·|x,y≤t)\n�\nT\n�\ni=t+1\nr(x, y≤i)\n�\n.\nSimilarly, for a = yt+1, the Q function can be defined as\nQπ(x, y≤t, yt+1) = r(x, y≤t+1) + Eτ∼π(·|x,y≤t+1)\n�\nT\n�\ni=t+1\nr(x, y≤i+1)\n�\n.\n4.2\nToken-Level Routing through the Lens of Performance Difference Lemma\nIn this section, we establish a conceptual connection between the Performance Difference Lemma (PDL) (Kakade\nand Langford, 2002) and our token-level routing training , clarifying how supervised routing can be interpreted\nthrough the lens of value-function approximation in a token-level MDP. At a high level, PDL characterizes\nhow deviations from the optimal policy at individual decision steps accumulate into a global performance gap.\nWhen applied to language model decoding, which is viewed as a token-level MDP as shown in §4.1, this lemma\nsuggests that if, at each token position, the routing mechanism selects an expert whose next-token distribution\napproximately maximizes the optimal action-value function, then the resulting sequence-level policy will be\nnear-optimal. Our routing loss in the SFT phase can be viewed as an imitation-based approximation of this\nidealized per-token selection rule, trained on trajectories generated by an optimal (or near-optimal) policy.\nTo formalize this connection, we analyze an idealized setting in which the SFT dataset is generated by the\noptimal policy π∗(· | s), where π∗ achieves the optimal value function in the token-level MDP M = (S, A, P, r).\nHence, the optimal next token at step t is given by yt+1 ∈ arg maxa∈Y Qπ∗(x, y≤t, a), and we refer to Qπ∗ as\nthe optimal Q-function. Since we trained on the SFT dataset which follows trajectories generated by π∗, at\nstep t with prefix (x, y≤t), the router can be viewed as approximating an expert action that aligns with the\naction preferred by π∗ , i.e., the action that maximizes the optimal Q-function along the observed trajectory.\nWe assume that the router is well-trained so that the expert model chooses the correct model for each prefix,\nnamely,\nπexpert(· | x, y≤t) = πI(· | x, y≤t),\nwhere I = arg max\ni\nEyt+1∼πi(·|x,y≤t)[Qπ∗(x, y≤t, yt+1)].\n(8)\nWe also make the following assumption to state that for any prefix x, y≤t, there is one correct model that\nachieves approximate optimal value function.\n7\n\nAssumption 4.1 (Global Coverage Assumption). For any x, y≤t, there exists an i ∈ [n] such that\n���Eyt+1∼πi(·|x,y≤t)[Qπ∗(x, y≤t+1)] − V ∗(x, y≤t)\n��� ≤ ∆,\n(9)\nwhere ∆ represents the error between the optimal policy and the candidate model. This assumption states\nthat the expert set is assumed to be sufficiently expressive so that one expert always achieves an approximately\noptimal value. This assumption can be regarded as a sanity bound that characterizes the best-case performance\nof purely token-level routing under unrealistically strong coverage conditions. It assumes the existence of a\nnear-optimal expert at each token.\nFirst, we introduce the performance difference lemma, which decomposes the value difference into the sum of\nthe Q-function gaps on individual steps.\nLemma 4.2 (Performance Difference Lemma (Kakade and Langford, 2002)). Denote ρ ∈ ∆(X) as a\ndistribution over the prompt space. Suppose each response y has length T. Then, for any two policies π, π∗,\nwe have\nEx∼ρ\n�\nV π∗(x) − V π(x)\n�\n=\nT −1\n�\nt=0\nEx∼ρ,y≤t∼π(·|x)\n�\nV ∗(x, yt) − Eyt+1∼π(·|x,y≤t)[Q∗(x, y≤t, yt+1)]\n�\n.\nFollowing the Assumption 4.1 and Lemma 4.2, we can get\nEx∼ρ\n�\nV π∗(x) − V π(x)\n�\n≤\nT −1\n�\nt=0\n∆ = T∆,\n(10)\nwhich means that when ∆ ≤ O(1/T), an idealized token-level routing policy can achieve near-optimal expected\nreturn under Assumption 4.1. The Eq. (10) shows that a well-trained purely token-level routing mechanism\ncan possibly recover a near-optimal trajectory by selecting the appropriate expert at each step. We also discuss\nthe difference between our approach and the prior token-level collaboration approach Collab (Chakraborty\net al., 2025) in Appendix B.\nHowever, it is worth noting that Assumption 4.1 is relatively strong, since it requires a global coverage that\nmust hold for all prefixes, including those that may not appear in the supervised training data. Because\nthe router is trained only on trajectories generated by the expert policy, satisfying this condition requires\ngeneralization beyond the observed data. A more natural assumption is to make the assumption only on the\ndata generated by optimal policy π∗. In the next section, we show that this relaxed assumption is insufficient\nfor generating an approximately optimal response, demonstrating the fundamental limitations of purely\ntoken-level collaboration.\n4.3\nLimitations of Purely Token-Level Collaboration\nIn the previous subsection, we showed that router training can recover the ideal action. However, Assumption 4.1\nmust hold for every possible prefix x, y≤t. This assumption is rather restrictive, as the routing network is\ntrained solely on the training dataset and may therefore suffer from distribution shift. A more reasonable\nrequirement is to enforce Eq. (9) only for good prefixes x, y≤t. We consider two types of prefixes to be good\nprefixes. First, any prefix x, y≤t ∼ π∗ is considered as a good prefix, which we call the single policy coverage\nassumption. Moreover, any prefix that has the possibility to generate a good response is also considered as a\ngood prefix, which we call the Generalization Coverage assumption. Together, these two types of prefixes are\nintended to characterize the kinds of text commonly encountered in the training data. However, under this\nweaker assumption, we show that although there may exist a way to select a candidate model that yields a\nnear-optimal response, such selection rule cannot be recovered solely from observations of the value function\nQπ∗.\nTheorem 4.3. Denote {n1, n2, · · · , nt} as a path of length t, where ni ∈ [n], 1 ≤ i ≤ t. Assume that all π∗\nand {πi}i∈[m] are deterministic policy, the response length is fixed as T, and the transition function is also\ndeterministic. Suppose the learner’s observation space Ot contains\not =\n�\nx, y≤t, {Qπ∗(x, y≤k)}t\nk=1, {Qπ∗(x, y≤t, y)}y∈Y\n�\n.\n8\n\nat the visited states (x, y≤t) during decoding. Suppose ∆ > 0 is a error constant. Then, there is a MDP such\nthat no token-level routing algorithm A : O �→ [n] can achieve V πA ≥ V π∗ − T/2 + 2 for all possible rewards\nin this MDP, even if we have the following two good properties:\n1. Existence of a Near-Optimal Path. There exists a small ε ≤ ∆ ∈ [0, 1] and at least one possible token-level\nrouting strategy P ∗ = {n∗\n1, n∗\n2, · · · , n∗\nT } such that\nV π∗ = V P ∗ + ε.\n(11)\n2. Single Policy Coverage. For the optimal policy π∗, the error constant ∆ ∈ [0, 1] such that the following\napproximation holds:\nEx,yt∼π∗\n���arg max\ni\nEyt+1∼πi(·|x,y≤t)\n�\nQπ∗(x, y≤t+1)\n�\n− Eyt+1∼π∗(·|x,y≤t)\n�\nQπ∗(x, y≤t+1)\n���� ≤ ∆.\n(12)\n3. Generalization Coverage For any prefix x, y≤t, if there exists a full response x, y≤T with prefix x, y≤t and\nR(x, y≤T ) ≥ V π∗ − ∆, we have\n���arg max\ni\nEyt+1∼πi(·|x,y≤t)\n�\nQπ∗(x, y≤t+1)\n�\n− Eyt+1∼π∗(·|x,y≤t)\n�\nQπ∗(x, y≤t+1)\n���� ≤ ∆.\n(13)\nThe impossibility in Theorem 4.3 stems from an identifiability failure, where observing optimal values along\ntrajectories generated by π∗ is insufficient to determine which expert actions actually realize those values. This\nis actually closely related to known limitations in reinforcement learning and imitation learning, where value\ninformation or on-policy demonstrations alone do not uniquely identify optimal actions without additional\nrealizability assumptions (Kakade and Langford, 2002; Ross et al., 2011; Levine et al., 2021) In fact, due to\nthis identifiability failure, this theorem implies that designing a reliable method for selecting the optimal\nmodel to produce the optimal response is inherently difficult. For example, Theorem 4.3 shows that using\nSFT to train a router that chooses the optimal model is not reliable, because this is essentially equivalent to\nuse behavior cloning for learning the actions that maximize the optimal value function Qπ∗. The proof is\nprovided in Appendix A.\n4.3.1\nBenefits of Router Training\nFrom the previous theorem, we show that reliable performance cannot be guaranteed under single-policy\ncoverage. The core difficulty is that even if a prefix that may lead to good response (theoretically, a state has\na large value Qπ∗), this response may not be realizable by the available candidate models. For each token, the\nset of feasible actions is restricted by the fixed expert models. Consequently, whenever the optimal policy π∗\nand the expert models are not perfectly aligned, the resulting approximation error becomes uncontrollable.\nTo solve this issue, we consider adding a trainable part to our decoding process. We also train the πrouter to\nalign the optimal model, and add the logits of πrouter as a complementary part (Eq. (2)). Now, instead of\nassuming that some models can achieve approximately maximum optimal Q value, we can assume that the\nTV distance of the policy is bounded. In fact, assuming π′\na ∝ πa · πrouter is the combination of the router\nbase model and the candidate model, we can assume\nEx,y≤t∼π∗[arg min\na∈[M] TV(π′\na(· | x, y≤t), π∗(· | x, y≤t))] ≤ ∆.\nThen, we can easily get the value difference bound by PDL\nV π∗ − V π =\nT\n�\nt=0\nEyt∼Pπ∗\nt\n(·|x)[Eyt+1∼π∗(·|x,y≤t)Qπ\nh((x, yt), yt+1) − Eyt+1∼π(·|x,y≤t)Qπ\nh((x, yt), yt+1)]\n≤ T · ∆ · Qmax ≤ RmaxT∆.\nHence, the discussion above shows that adding a router as a complementary part is a good way for overcoming\nthe intrinsic difficulty for Theorem 4.3. This result highlights an important conceptual shift. Instead of\n9\n\nrequiring that one of the fixed expert models be capable of achieving near-optimal performance on its own,\nwe allow a router that serves both selection and complementation roles to compensate for the mismatch by\ncontributing additional, adaptive logits. In this way, the combined policy effectively expands the expressivity\nof the action space, enabling it to approximate the optimal policy π∗ even when none of the individual\ncandidate models is sufficient. This demonstrates that making the router a complementary component in the\ndecoding process is not merely a heuristic enhancement, but a principled mechanism that circumvents the\ncoverage assumption for all prefixes.\nFurthermore, when the optimal policy and a candidate model are already reasonably well aligned, the router’s\ncomplementary component need not be highly accurate; even a weakly informative router can suffice. In\nthis case, the router is substantially easier to learn, as it only needs to correct the limited set of states\nwhere the candidate model deviates from π∗. On well-aligned states, the candidate models already provide\nstrong responses. Consequently, combining a candidate model with a trainable router can outperform both\nfine-tuning a single model and pure routing.\n5\nExperiments\nIn this section, we first describe the experimental setup in §5.1. We then present cross-domain performance\nand win rates on general benchmarks in §5.2. Together, these evaluations demonstrate that our FusionRoute\nenables effective collaboration among multiple models by (a) selecting the most domain-appropriate models\nfor accurate specialized responses and (b) delivering consistently high-quality outputs on general datasets.\n5.1\nExperimental Setup\nBaselines.\nWe compare FusionRoute against a broad set of baselines including sequence-level collaboration,\ntoken-level collaboration, model merging, and single-model fine-tuning. For sequence-level collaboration, we\ncompare with the common Sequence Selection, where each expert independently generates a full response\nusing greedy decoding and an external reward model selects the highest-scoring output. For the token-level\ncollaboration, we include Collab (Chakraborty et al., 2025), which performs controlled decoding by evaluating\ncandidate tokens from multiple models using an external reward signal; we follow the original decoding\nprocedure and replace only the reward model when necessary to ensure fair comparison across model families.\nTo evaluate training-free expert integration, we include two popular model merging approaches, DARE (Yu\net al., 2024) and TaskArithmetic (Ilharco et al., 2022). We additionally include a Fine-tuned Model baseline\nobtained by directly fine-tuning the base LLM using the same SFT and DPO procedures as FusionRoute,\nbut without any routing or collaboration mechanism, thereby isolating the benefit of token-level multi-LLM\ncollaboration and complementary logit contributions.\nModels.\nWe conduct experiments on both the Llama-3 (Dubey et al., 2024) and Gemma-2 (Team et al.,\n2024) families, using the model checkpoints provided by He et al. (2025).\nFor each family, we select\nthree expert models specializing in mathematics, coding, and instruction following, respectively. Specifi-\ncally, for the Llama-3 family, we use math expert models Llama-3.1-8B-Instruct_math and coding expert\nmodels Llama-3.1-8B-Instruct_coding from MergeBench (He et al., 2025) as the two expert models, and\nLlama-3.1-8B-Instruct as the instruction-following expert. Following Chakraborty et al. (2025), we use\nthe reward model Ray2333/reward-model-Mistral-7B-instruct-Unified-Feedback as the external reward for\nCollab (Chakraborty et al., 2025). For the Gemma-2 family, we use Gemma-2-2B-Instruct_math, Gemma-\n2-2B-Instruct_coding and Gemma-2-2B-Instruct_instruction from MergeBench (He et al., 2025) as three\nexpert models.\nFor a fair comparison, we use weqweasdas/RM-Gemma-2B as the external reward for\nCollab (Chakraborty et al., 2025), instead of a larger 7B reward model.\nImplementation and training.\nWe randomly select 500 samples from the PerfectBlend dataset (Xu et al.,\n2024) as a held-out test set and use the remaining samples as the training pool. For the initial SFT phase, we\nrandomly sample 200k examples from the PerfectBlend training split, which provides a balanced mixture of\nmathematics, coding, and instruction-following tasks. For both the Llama and Gemma families, the router is\ninitialized from the instruction-following expert and fine-tuned by minimizing the loss in Eq. (6) with λ = 1/3\n10\n\nand a learning rate of 1 × 10−5 for one epoch. After obtaining the policy from the SFT phase, we further train\nthe model using the mixed SFT–DPO procedure in Algorithm 1. Specifically, we sample 100k examples from\nthe PerfectBlend training split as DSFT and 100k preference pairs from the OpenHermes dataset (Teknium,\n2023) as DDPO, using a learning rate of 10−5, β = 0.1, and λ = 1/3 for one epoch. For the directly fine-tuned\nbaseline, we apply the same SFT and DPO procedures and hyperparameters, but without training the routing\nlinear layer. For model merging baselines, we use the default parameters p = 0.9 and λ = 1 for DARE (Yu\net al., 2024) and λ = 1 for TaskArithmetic (Ilharco et al., 2022).\n5.2\nMain Results\n5.2.1\nGeneral-purpose Performance with Specialized Experts\nIn practical deployment, users typically interact with a single model without knowing in advance which domain—\nsuch as mathematics, coding, or instruction following—a given prompt belongs to. We therefore evaluate\nFusionRoute in a general-purpose, mixed-domain setting, where the model must automatically leverage\nthe most appropriate expert at inference time, without any manual checkpoint selection. We consider five\nbenchmark datasets covering diverse expert domains: GSM8K (Cobbe et al., 2021) and MATH500 (Lightman\net al., 2023) for mathematical reasoning, MBPP (Austin et al., 2021) and HumanEval (Chen et al., 2021)\nfor code generation, and IfEval (Zhou et al., 2023) for instruction following. These datasets allow us to\nmeasure model performance across math, coding, and instruction-following tasks, providing a comprehensive\ncross-domain evaluation. We use greedy decoding as the generation approach, more implementation details\nfor each task are provided in Appendix C.1.\nOur results show that both algorithms consistently achieve the highest average performance across the five\ndomains on both two LLM families and get performance comparable to domain-specific experts on individual\ntasks. Table 1 summarizes the results on both the Llama-3 and Gemma-2 families. Across both model\nfamilies, FusionRoute achieves the highest average performance, consistently outperforming sequence-level\ncollaboration, prior token-level collaboration (Collab), model merging methods, and directly fine-tuned models.\nThese results demonstrate that FusionRoute effectively functions as a general-purpose model assembled\nfrom specialized experts, relieving users from the need to select domain-specific checkpoints while delivering\nstrong and stable performance across heterogeneous tasks. The examples of routing behavior in several\ndomain-specific tasks are provided in Appendix C.2.\nImportantly, FusionRoute does not sacrifice specialization. On benchmarks where a particular expert is ex-\npected to perform best—such as math experts on GSM8K (Cobbe et al., 2021) and MATH500 (Lightman et al.,\n2023) or coding experts on MBPP (Austin et al., 2021) and HumanEval (Chen et al., 2021)—FusionRoute\nremains competitive and often matches or exceeds the corresponding expert’s performance. This shows that\nFusionRoute is compatible with expert specialization, while being substantially more robust and generalized\nin mixed-domain usage.\n5.2.2\nPerformance on General Dataset\n(a) Llama3-8B Family\n(b) Gemma2-2B Family\nFigure 3 GPT-4o winrate on general datasets compared to fine-tuned model.\n11\n\nMethod\nGSM8K\nMATH500\nMBPP\nHumanEval\nIfEval\nAvg Acc.\nLlama-3 Family\nExpert Models\nMath Expert\n0.86\n0.36\n0.23\n0.37\n0.31\n0.426\nCode Expert\n0.52\n0.15\n0.34\n0.66\n0.48\n0.430\nInstruct Expert\n0.74\n0.27\n0.36\n0.52\n0.67\n0.512\nCollaboration Approaches\nSequence Selection\n0.76\n0.31\n0.21\n0.45\n0.50\n0.466\nCollab\n0.82\n0.32\n0.28\n0.54\n0.55\n0.502\nFine-tuned Model\n0.75\n0.26\n0.36\n0.58\n0.72\n0.536\nDARE\n0.75\n0.28\n0.23\n0.26\n0.32\n0.368\nTaskArithmetic\n0.82\n0.32\n0.24\n0.39\n0.35\n0.424\nFusionRoute (ours)\n0.82\n0.33\n0.36\n0.63\n0.69\n0.566\nGemma-2 Family\nExpert Models\nMath Expert\n0.67\n0.27\n0.22\n0.36\n0.47\n0.398\nCode Expert\n0.39\n0.16\n0.27\n0.41\n0.52\n0.350\nInstruction Expert\n0.19\n0.08\n0.23\n0.34\n0.61\n0.290\nCollaboration Approaches\nSequence Selection\n0.62\n0.24\n0.27\n0.37\n0.54\n0.408\nCollab\n0.52\n0.22\n0.26\n0.35\n0.45\n0.360\nFine-tuned Model\n0.54\n0.17\n0.24\n0.34\n0.68\n0.394\nTaskArithmetic\n0.43\n0.15\n0.14\n0.28\n0.34\n0.268\nDARE\n0.36\n0.10\n0.14\n0.16\n0.36\n0.224\nFusionRoute (ours)\n0.65\n0.22\n0.25\n0.40\n0.61\n0.426\nTable 1 Cross-domain performance comparison on Llama-3 and Gemma-2 families. Best and second-best performance\nwithin each family are highlighted by bold and underline, respectively.\nTo further assess overall response quality beyond task-specific correctness, we evaluate FusionRoute on a held-\nout general dataset using pairwise comparisons. Specifically, we sample 500 prompts from the PerfectBlend (Xu\net al., 2024) test set and generate responses with a maximum length of 300 tokens. Then, we use GPT-4o to\nevaluate the overall response quality compared to the response of the directly fine-tuned model (Fine-tuned\nModel in Table 1). For two responses, the pairwise comparison is selected from win, tie and loss, and the win\nrate is computed as the proportion of wins, with ties counted as half a win, over 500 evaluation examples.\nAddition details such as evaluation prompt provided in Appendix C.1. As shown in Fig. 3, FusionRoute\nachieves a substantially higher win rate than the fine-tuned baseline on both model families, indicating\nimproved overall response quality on general prompts. This improvement reflects better alignment, fluency,\nand formatting, complementing the strong task-level performance observed on domain benchmarks.\n5.2.3\nFusionRoute is More Beneficial at a Larger Scale\nAs shown in Fig. 3, the performance gap between FusionRoute and other baselines becomes markedly\nlarger in the 8B Llama-3 family. In this setting, both existing token-level collaboration (Collab (Chakraborty\net al., 2025)) and sequence-level selection exhibit substantial degradation in win rate, whereas FusionRoute\ncontinues to improve response quality. This suggests that as model capacity increases, purely selecting among\nfixed expert outputs becomes increasingly brittle, while FusionRoute’s complementary routing mechanism\nis able to effectively leverage the additional representational capacity to refine and correct expert predictions.\nIn contrast, in the 2B Gemma-2 family, the gap between FusionRoute and pure token-level collaboration\nis smaller.\nWhile expert-only token-level collaboration still under-performs FusionRoute, it remains\n12\n\ncompetitive and even outperforms the directly fine-tuned baseline. This indicates that at smaller scales,\nexpert models alone can already capture a reasonable portion of the desired behavior, leaving less room\nfor complementary correction. Overall, these results suggest that router-based complementary generation\nbecomes increasingly important as model scale grows, whereas expert-only token-level collaboration may\nsuffice in more capacity-constrained settings.\n6\nAblation Study\n6.1\nAblation on Complementary Logit Contribution\nWe first examine the role of the router’s complementary logit contribution in FusionRoute. In particular,\nwe aim to isolate whether token-level expert selection alone is sufficient for effective collaboration, or whether\nthe router’s ability to directly contribute corrective logits is necessary for strong performance. To this end, we\nconstruct a routing-only baseline that removes the complementary component: we use the router obtained\nafter the SFT phase to select an expert at each token, and generate the next token solely from the selected\nexpert’s logits, without adding the router’s logits.\nTable 2 reports the cross-domain performance of the routing-only variant compared with the full FusionRoute\nframework across both the LLaMA-3 and Gemma-2 families. We see that FusionRoute consistently outper-\nforms FusionRoute w/o complementary logits across nearly all benchmarks and both model families. The per-\nformance gap is particularly pronounced on coding and instruction-following tasks, where even correctly selected\nexperts can produce locally suboptimal or misaligned tokens that require correction. These results provide direct\nempirical evidence support for our theoretical analysis in §4. Even when the router is able to identify a suitable\nexpert at each token, relying solely on fixed expert logits is insufficient to reliably recover a near-optimal policy.\nFigure 4 GPT-4o winrate comparison: FusionRoute\nwith vs. without CDPO training on both Llama-3\nand Gemma-2 Family.\nIn contrast, allowing the router to contribute complemen-\ntary logits effectively expands the expressive capacity of\nthe decoding policy, enabling it to correct expert failures\nand achieve stronger, more robust performance across\nheterogeneous domains.\nMoreover,\nwe\nfind\nthat\nthe\nrouting-only\nvariant\n(FusionRoute w/o complementary logits) already out-\nperforms the state-of-the-arts controlled-decoding-based\ntoken-level collaboration method (Chakraborty et al.,\n2025) across most benchmarks. This improvement sug-\ngests that directly training the routing component on\nexpert data is crucial for learning accurate and stable\ntoken-level routing behaviors. In contrast, controlled-\ndecoding approaches that rely solely on external reward\nsignals tend to exhibit higher instability, limiting their\neffectiveness in practice.\n6.2\nAblation on FusionRoute Training Procedure\nWhile the ablation in §6.1 demonstrates the importance of complementary logits at inference time, it does not\nexamine how different stages of the FusionRoute training procedure contribute to this behavior. In this\nsection, we further probe the training pipeline of FusionRoute by ablating the preference-optimization stage,\nwith the goal of understanding the effectiveness of the proposed CDPO over SFT. To this end, we compare\nthree variants: a base model, FusionRoute after the SFT phase, and the full FusionRoute framework\nwith CDPO training. We evaluate these variants using GPT-4o win rate on a held-out general dataset, which\ncaptures overall response quality beyond task-specific correctness.\nThe result are shown in Fig. 4. We observe that while SFT already provides a reasonable initialization,\napplying CDPO training leads to a substantial improvement in win rate. In particular, the full FusionRoute\nmodel significantly outperforms its SFT-only counterpart, indicating that the complementary component\nlearned during CDPO effectively corrects expert failures and enhances the overall response quality. These\n13\n\nMethod\nGSM8K\nMATH500\nMBPP\nHumanEval\nIFEval\nAvg Acc.\nLlama-3 Family\nCollab\n0.82\n0.32\n0.28\n0.54\n0.55\n0.502\nFusionRoute w/o complementary logits\n0.82\n0.32\n0.26\n0.56\n0.65\n0.522\nFusionRoute\n0.82\n0.33\n0.36\n0.63\n0.69\n0.566\nGemma-2 Family\nCollab\n0.52\n0.22\n0.26\n0.35\n0.45\n0.360\nFusionRoute w/o complementary logits\n0.62\n0.23\n0.23\n0.33\n0.51\n0.384\nFusionRoute\n0.65\n0.22\n0.25\n0.40\n0.61\n0.426\nTable 2 Cross-domain performance comparison between purely token-level routing and FusionRoute across model\nfamilies. Best and second-best performance within each family are highlighted by bold and underline, respectively.\nfindings suggest that the performance gains of FusionRoute depend critically on the preference-optimization\nstage in the training procedure, which substantially improves general response quality.\n7\nRelated Works\n7.1\nMulti-LLM Collaboration\nRecent advances in LLMs have shown that sufficiently large, general-purpose LLMs can achieve strong\nperformance across a wide range of tasks, including mathematics, coding, and instruction following (Hurst\net al., 2024; Team et al., 2025; Yang et al., 2025; Zeng et al., 2025).\nHowever, such models are often\nprohibitively expensive to train and deploy, motivating interest in more efficient alternatives. At the same\ntime, smaller and domain-specialized LLMs can perform competitively within their respective domains, but\ntypically struggle to generalize beyond the distributions on which they are trained. Bridging this gap between\nspecialization and generalization without incurring the cost of training ever-larger models has therefore\nbecome an important challenge. One line of works study the benefits of multi-agent debate, in which several\nagents discuss and debate to generate a final response (Liu et al., 2024; Chan et al., 2023; Liang et al., 2024).\nExisting works also try to model the social and economic behavior in the real world using different specialized\nLLMs (Zhao et al., 2023). Moreover, model ensemble methods (Ashiga et al., 2025; Huang et al., 2024;\nChen et al., 2025) combine responses or probability distributions from multiple models in a fixed manner to\nleverage their complementary strengths without modifying individual parameters; however, they lack dynamic,\ncontext-dependent collaboration among candidate models. More fine-grained token-level collaboration has\nbeen studied in recent years. Collab (Chakraborty et al., 2025) uses a controlled decoding-based approach\nto evaluate the candidate tokens during decoding. Compared to them, our decoding algorithm is much\nmore efficient and also achieves a better performance. CO-LLM (Shen et al., 2024) also explores token-level\ncollaboration, but mainly uses one strong math expert to assist a small fine-tuned model, and its output\ndepends entirely on the candidate experts. In contrast, we use multiple specialized LLMs, evaluate both\ncross-domain performance and general win rate, and additionally train the router as a complementary generator\nto improve robustness.\n7.2\nMulti-LLM Integration\nBeyond multi-agent LLM collaboration, a substantial body of works also consider how to utilize and integrate\ndifferent models to be comprehensive. The most representative area is Mixture-of-Expert (MoE), in which\nmultiple expert models are combined for each layer with some routing network, allowing the system to leverage\nthe specialized knowledge of each expert for different inputs (Zhou et al., 2022; Xue et al., 2024; Jiang et al.,\n2024a). However, because the final architecture tightly integrates the layers of multiple models with a routing\nnetwork, these approaches are typically applicable only to models with similar structures. They also require\naccess to the experts’ gradients and joint training of the entire system, which limits flexibility and incurs\n14\n\nsignificant additional training cost. Model merging is another approach for integrating multiple experts (Yu\net al., 2024; Ilharco et al., 2022; Yadav et al., 2023). While it does not incur additional training costs, its\nperformance is often limited and sensitive to hyperparameters. Moreover, its flexibility is also limited, as it\nrequires all models to share the same architecture. There is also a line of work on pluralistic alignment, in\nwhich multiple models with potentially conflicting objectives, such as safety and helpfulness, are integrated\nto achieve tradeoffs across multiple objectives (Shi et al., 2024; Xiong and Singh, 2025; Wang et al., 2025c).\nIn contrast to these approaches, FusionRoute does not merge or jointly train expert models, nor does it\nrequire architectural compatibility or gradient access. Instead, FusionRoute enables flexible, token-level\nintegration of heterogeneous, off-the-shelf LLMs by learning a lightweight router that selects among experts\nand provides complementary logits during decoding. This design allows FusionRoute to combine the\nstrengths of specialized models while retaining robustness and generalization, without the rigidity or training\noverhead inherent in MoE and model-merging approaches.\n8\nConclusion\nIn this paper, we present FusionRoute, a framework that enables multiple candidate models to collaboratively\ngenerate higher-quality responses than any individual model. FusionRoute works well automatically across\ndiverse scenarios because of a routing mechanism, and provides robustness by letting the router provide\ncomplementary logits to refine the overall response quality. In addition, FusionRoute is computationally\nefficient, as it avoids requiring each model to generate candidate tokens or full responses for selection.\nTheoretically, we characterize the limitations of purely token-level collaboration and demonstrate the benefits\nof incorporating complementary logits. Extensive experiments on both domain-specific and general datasets\nverify the effectiveness of our framework.\n15\n\nReferences\nMari Ashiga, Wei Jie, Fan Wu, Vardan Voskanyan, Fateme Dinmohammadi, Paul Brookes, Jingzhi Gong, and\nZheng Wang. Ensemble learning for large language models in text and code generation: A survey. arXiv preprint\narXiv:2503.13505, 2025.\nJacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie\nCai, Michael Terry, Quoc Le, et al. Program synthesis with large language models. arXiv preprint arXiv:2108.07732,\n2021.\nSouradip Chakraborty, Sujay Bhatt, Udari Madhushani Sehwag, Soumya Suvra Ghosal, Jiahao Qiu, Mengdi Wang,\nDinesh Manocha, Furong Huang, Alec Koppel, and Sumitra Ganesh. Collab: Controlled decoding using mixture of\nagents for llm alignment. arXiv preprint arXiv:2503.21720, 2025.\nChi-Min Chan, Weize Chen, Yusheng Su, Jianxuan Yu, Wei Xue, Shanghang Zhang, Jie Fu, and Zhiyuan Liu. Chateval:\nTowards better llm-based evaluators through multi-agent debate. arXiv preprint arXiv:2308.07201, 2023.\nMark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards,\nYuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy\nKhlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power,\nLukasz Kaiser, Mohammad Bavarian, Clemens Winter, Philippe Tillet, Felipe Petroski Such, Dave Cummings,\nMatthias Plappert, Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, William Hebgen Guss, Alex Nichol, Alex\nPaino, Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, William Saunders, Christopher\nHesse, Andrew N. Carr, Jan Leike, Josh Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight,\nMiles Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya\nSutskever, and Wojciech Zaremba. Evaluating large language models trained on code. 2021.\nZhijun Chen, Jingzheng Li, Pengpeng Chen, Zhuoran Li, Kai Sun, Yuankai Luo, Qianren Mao, Ming Li, Likang\nXiao, Dingqi Yang, et al. Harnessing multiple large language models: A survey on llm ensemble. arXiv preprint\narXiv:2502.18036, 2025.\nKarl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert,\nJerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint\narXiv:2110.14168, 2021.\nXiang Deng, Jeff Da, Edwin Pan, Yannis Yiming He, Charles Ide, Kanak Garg, Niklas Lauffer, Andrew Park, Nitin\nPasari, Chetan Rane, et al. Swe-bench pro: Can ai agents solve long-horizon software engineering tasks? arXiv\npreprint arXiv:2509.16941, 2025.\nYilun Du, Shuang Li, Antonio Torralba, Joshua B Tenenbaum, and Igor Mordatch. Improving factuality and reasoning\nin language models through multiagent debate. In Forty-first International Conference on Machine Learning, 2023.\nAbhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil\nMathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv e-prints, pages arXiv–2407,\n2024.\nYifei He, Siqi Zeng, Yuzheng Hu, Rui Yang, Tong Zhang, and Han Zhao. Mergebench: A benchmark for merging\ndomain-specialized llms. arXiv preprint arXiv:2505.10833, 2025.\nYichong Huang, Xiaocheng Feng, Baohang Li, Yang Xiang, Hui Wang, Ting Liu, and Bing Qin. Ensemble learning for\nheterogeneous large language models with deep parallel collaboration. Advances in Neural Information Processing\nSystems, 37:119838–119860, 2024.\nAaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila\nWelihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024.\nNam Huynh and Beiyu Lin. Large language models for code generation: A comprehensive survey of challenges,\ntechniques, evaluation, and applications. arXiv preprint arXiv:2503.01245, 2025.\nGabriel Ilharco, Marco Tulio Ribeiro, Mitchell Wortsman, Suchin Gururangan, Ludwig Schmidt, Hannaneh Hajishirzi,\nand Ali Farhadi. Editing models with task arithmetic. arXiv preprint arXiv:2212.04089, 2022.\nAlbert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh\nChaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts. arXiv preprint\narXiv:2401.04088, 2024a.\n16\n\nJuyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim. A survey on large language models for code\ngeneration. arXiv preprint arXiv:2406.00515, 2024b.\nCarlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan.\nSwe-bench: Can language models resolve real-world github issues? arXiv preprint arXiv:2310.06770, 2023.\nSham Kakade and John Langford. Approximately optimal approximate reinforcement learning. In Proceedings of the\nnineteenth international conference on machine learning, pages 267–274, 2002.\nYoav Levine, Noam Wies, Daniel Jannai, Dan Navon, Yedid Hoshen, and Amnon Shashua. The inductive bias of\nin-context learning: Rethinking pretraining example design. arXiv preprint arXiv:2110.04541, 2021.\nTian Liang, Zhiwei He, Wenxiang Jiao, Xing Wang, Yan Wang, Rui Wang, Yujiu Yang, Shuming Shi, and Zhaopeng\nTu. Encouraging divergent thinking in large language models through multi-agent debate. In Proceedings of the\n2024 conference on empirical methods in natural language processing, pages 17889–17904, 2024.\nHunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman,\nIlya Sutskever, and Karl Cobbe. Let’s verify step by step. arXiv preprint arXiv:2305.20050, 2023.\nTongxuan Liu, Xingyu Wang, Weizhe Huang, Wenjiang Xu, Yuting Zeng, Lei Jiang, Hailong Yang, and Jing Li.\nGroupdebate: Enhancing the efficiency of multi-agent debate using group discussion. arXiv preprint arXiv:2409.14051,\n2024.\nZhenting Qi, Hongyin Luo, Xuliang Huang, Zhuokai Zhao, Yibo Jiang, Xiangjun Fan, Himabindu Lakkaraju, and\nJames Glass. Quantifying generalization complexity for large language models. arXiv preprint arXiv:2410.01769,\n2024.\nYiwei Qin, Kaiqiang Song, Yebowen Hu, Wenlin Yao, Sangwoo Cho, Xiaoyang Wang, Xuansheng Wu, Fei Liu, Pengfei\nLiu, and Dong Yu. Infobench: Evaluating instruction following ability in large language models. arXiv preprint\narXiv:2401.03601, 2024.\nRafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct\npreference optimization: Your language model is secretly a reward model. Advances in neural information processing\nsystems, 36:53728–53741, 2023.\nStéphane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to\nno-regret online learning. In Proceedings of the fourteenth international conference on artificial intelligence and\nstatistics, pages 627–635. JMLR Workshop and Conference Proceedings, 2011.\nZejiang Shen, Hunter Lang, Bailin Wang, Yoon Kim, and David Sontag. Learning to decode collaboratively with\nmultiple language models.\nIn Proceedings of the 62nd Annual Meeting of the Association for Computational\nLinguistics (Volume 1: Long Papers), pages 12974–12990, 2024.\nRuizhe Shi, Yifang Chen, Yushi Hu, Alisa Liu, Hanna Hajishirzi, Noah A Smith, and Simon S Du. Decoding-time\nlanguage model alignment with multiple objectives. Advances in Neural Information Processing Systems, 37:\n48875–48920, 2024.\nChenglei Si, Dan Friedman, Nitish Joshi, Shi Feng, Danqi Chen, and He He. Measuring inductive biases of in-context\nlearning with underspecified demonstrations. arXiv preprint arXiv:2305.13299, 2023.\nGemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, Léonard\nHussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ramé, et al. Gemma 2: Improving open language models\nat a practical size. arXiv preprint arXiv:2408.00118, 2024.\nGemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana\nMatejovicova, Alexandre Ramé, Morgane Rivière, et al. Gemma 3 technical report. arXiv preprint arXiv:2503.19786,\n2025.\nTeknium. Openhermes 2.5: An open dataset of synthetic data for generalist llm assistants, 2023. https://huggingface.\nco/datasets/teknium/OpenHermes-2.5.\nChaoqi Wang, Zhuokai Zhao, Yibo Jiang, Zhaorun Chen, Chen Zhu, Yuxin Chen, Jiayi Liu, Lizhu Zhang, Xiangjun\nFan, Hao Ma, et al. Beyond reward hacking: Causal rewards for large language model alignment. arXiv preprint\narXiv:2501.09620, 2025a.\nPeng-Yuan Wang, Tian-Shuo Liu, Chenyang Wang, Yi-Di Wang, Shu Yan, Cheng-Xing Jia, Xu-Hui Liu, Xin-Wei\nChen, Jia-Cheng Xu, Ziniu Li, et al. A survey on large language models for mathematical reasoning. arXiv preprint\narXiv:2506.08446, 2025b.\n17\n\nTianze Wang, Dongnan Gui, Yifan Hu, Shuhang Lin, and Linjun Zhang. Mpo: An efficient post-processing framework\nfor mixing diverse preference alignment. arXiv preprint arXiv:2502.18699, 2025c.\nNuoya Xiong and Aarti Singh. Projection optimization: A general framework for multi-objective and multi-group rlhf.\narXiv preprint arXiv:2502.15145, 2025.\nTengyu Xu, Eryk Helenowski, Karthik Abinav Sankararaman, Di Jin, Kaiyan Peng, Eric Han, Shaoliang Nie, Chen\nZhu, Hejia Zhang, Wenxuan Zhou, et al. The perfect blend: Redefining rlhf with mixture of judges. arXiv preprint\narXiv:2409.20370, 2024.\nFuzhao Xue, Zian Zheng, Yao Fu, Jinjie Ni, Zangwei Zheng, Wangchunshu Zhou, and Yang You. Openmoe: An early\neffort on open mixture-of-experts language models. arXiv preprint arXiv:2402.01739, 2024.\nPrateek Yadav, Derek Tam, Leshem Choshen, Colin A Raffel, and Mohit Bansal. Ties-merging: Resolving interference\nwhen merging models. Advances in Neural Information Processing Systems, 36:7093–7115, 2023.\nAn Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang,\nChenxu Lv, et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025.\nEnneng Yang, Li Shen, Guibing Guo, Xingwei Wang, Xiaochun Cao, Jie Zhang, and Dacheng Tao. Model merging in\nllms, mllms, and beyond: Methods, theories, applications and opportunities. arXiv preprint arXiv:2408.07666, 2024.\nLe Yu, Bowen Yu, Haiyang Yu, Fei Huang, and Yongbin Li. Language models are super mario: Absorbing abilities\nfrom homologous models as a free lunch. In Forty-first International Conference on Machine Learning, 2024.\nLifan Yuan, Yangyi Chen, Ganqu Cui, Hongcheng Gao, Fangyuan Zou, Xingyi Cheng, Heng Ji, Zhiyuan Liu, and\nMaosong Sun. Revisiting out-of-distribution robustness in nlp: Benchmarks, analysis, and llms evaluations. Advances\nin Neural Information Processing Systems, 36:58478–58507, 2023.\nAohan Zeng, Xin Lv, Qinkai Zheng, Zhenyu Hou, Bin Chen, Chengxing Xie, Cunxiang Wang, Da Yin, Hao Zeng, Jiajie\nZhang, et al. Glm-4.5: Agentic, reasoning, and coding (arc) foundation models. arXiv preprint arXiv:2508.06471,\n2025.\nHanqing Zeng, Yinglong Xia, Zhuokai Zhao, Chuan Jiang, Qiang Zhang, Jiayi Liu, Qunshu Zhang, Lizhu Zhang,\nXiangjun Fan, and Benyu Zhang.\nS’more: Structural mixture of residual experts for parameter-efficient llm\nfine-tuning. In The Thirty-ninth Annual Conference on Neural Information Processing Systems.\nQinlin Zhao, Jindong Wang, Yixuan Zhang, Yiqiao Jin, Kaijie Zhu, Hao Chen, and Xing Xie. Competeai: Understanding\nthe competition dynamics in large language model-based agents. arXiv preprint arXiv:2310.17512, 2023.\nXutong Zhao, Tengyu Xu, Xuewei Wang, Zhengxing Chen, Di Jin, Liang Tan, Zishun Yu, Zhuokai Zhao, Yun He,\nSinong Wang, et al. Boosting llm reasoning via spontaneous self-correction. arXiv preprint arXiv:2506.06923, 2025.\nYujia Zheng, Zhuokai Zhao, Zijian Li, Yaqi Xie, Mingze Gao, Lizhu Zhang, and Kun Zhang. Thought communication\nin multiagent collaboration. arXiv preprint arXiv:2510.20733, 2025.\nJeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou.\nInstruction-following evaluation for large language models. arXiv preprint arXiv:2311.07911, 2023.\nYanqi Zhou, Tao Lei, Hanxiao Liu, Nan Du, Yanping Huang, Vincent Zhao, Andrew M Dai, Quoc V Le, James\nLaudon, et al. Mixture-of-experts with expert choice routing. Advances in Neural Information Processing Systems,\n35:7103–7114, 2022.\nYuhang Zhou, Mingrui Zhang, Ke Li, Mingyi Wang, Qiao Liu, Qifei Wang, Jiayi Liu, Fei Liu, Serena Li, Weiwei Li, et al.\nMixture-of-minds: Multi-agent reinforcement learning for table understanding. arXiv preprint arXiv:2510.20176,\n2025.\n18\n\nAppendix\nA\nProof of Theorem 4.3\nProof. We assume that the prompt x is fixed and that the response length is fixed to T. For any t ≤ T, define\nPt =\n�\n(n1, . . . , nt)\n�� nk ∈ [n], ∀ k = 1, . . . , t\n�\nas the set of all possible expert-selection sequences of length t. We further assume that all expert models and\ntransition dynamics are deterministic. In particular, for any expert πi and any prefix (x, y≤t), the decoding rule\nπi(x, y≤t) ∈ Y deterministically outputs the next token. Given any selection sequence p = (n1, . . . , nt) ∈ Pt,\nthe induced token sequence yp = (yp\n1, . . . , yp\nt ) is defined recursively as\nyp\nk = πnk\n�\nx, yp\n≤k−1\n�\n,\nk = 1, . . . , t,\nwith the convention that yp\n≤0 := ∅. Accordingly, the induced state can be written as sp = (x, yp).\nConstruction\nNow we construct |PT/2| MDPs {MpT/2}pT/2∈PT/2 as follows:\nSame Structure We first define the shared components of all candidate MDPs.\nThe initial state is s0 = x. The action space corresponds to token generation, with vocabulary Y. Unless\notherwise specified, the reward function is defined to be 1 for all prefixes (x, y).\nFor each expert model i ∈ [n], define\ny1,i := arg max\ny∈Y πi(y | x).\nWe assign the first-step rewards as follows. For each token y1,i, define\nr(x, y1,i) = 1 − ε.\n(14)\nOtherwise, for any y ∈ Y \\ {y1,i}i∈[n], we assign r(x, y) = 1.\nDifference Now we show the difference between multiple candidate MDPs {MpT/2}pT/2∈PT/2. Each MDP Mp\nis constructed in the following ways:\n• For path p′ ∈ {p ∈ PT/2+i | i ≥ 1, p has prefix pT/2}, we define r(x, yp′) = 1.\n• For any path p′ ∈ {p ∈ PT/2+1 | p does not have prefix pT/2}, r(x, yp′) = 1 − ∆.\n• For any path p′ ∈ {p ∈ PT/2+i | i ≥ 2, p does not have prefix pT/2}, we define r(x, yp′) = 0.\nFor all other states that are not mentioned, their reward functions are defaulted to 1.\nValue Function\nUnder this construction, for any prefix (x, y≤t), the optimal policy will choose yPR as its\nnext token, and the final optimal value function satisfies\nV ∗(x, y≤t) = T − t,\n∀ prefix (x, y≤t).\nThis is because the optimal policy π∗ selects the reward-maximizing token at every remaining time step,\nthereby accumulating a reward of 1 at each step until termination. In particular, for all possible model-selection\npaths p ∈ Pt, the optimal value function satisfies\nV ∗(x, yp) = T − t.\nSimilarly, the Q function\nQπ∗(x, y≤t, yt+1) = r(x, y≤t, yt+1) + T − t + 1.\n(15)\n19\n\nIn particular, for each policy πi, define the greedy token at state sp as\nyp\ni := arg max\ny∈Y πi(y | x, yp).\nThen, the optimal state–action value under π∗ is given by\nQπ∗(x, yp, yp\ni ) = Qπ∗(x, yp⊕i) = r(x, yp, yp\ni ) + V ∗(x, yp⊕i) = r(x, yp⊕i) + T − t + 1,\n(16)\nwhere p ⊕ i ∈ Pt+1 denotes the extended path that follows p for the first t tokens and selects token i at step\nt + 1.\nSince we construct the reward such that r(x, yp) = 1 for any path p ∈ Pt for t ≤ T/2. Then, for any\nk ≤ T/2 − 1, we know that for any pk ∈ Pk\nQπ∗(x, ypk⊕i) = T − t = V ∗(x, yp).\n(17)\nVerfication of Assumption\nNow we verify whether three assumptions Eq. (11), Eq. (12) and Eq. (13) holds\nfor all these MDPs.\nEq. (11) By the construction, we can easily know that V ∗ = T. Now, recall that R(x, y≤T ) = �T\ni=1 r(x, y≤i).\nFor any token-level routing generated response (full path) pT on MDP Mp (p is path with length T/2), we\nhave\nV pT = R(x, ypT ) =\n�\nT − ε\nif pT has prefix p\nT/2 + 1 − ∆ − ε\nelse\n.\nHence, on the MDP Mp with p ∈ PT/2, if path pT ∈ PT has prefix p, we know that V pT = T − ε, which\nimplies that V ∗ − V pT ≤ ε.\nEq. (12) Note that both the transition kernel and the optimal policy π∗ are deterministic. As a result, all expec-\ntations reduce to deterministic quantities. For x, y1, · · · , yT ∼ π∗, we know that yt+1 = arg maxy∈Y r(x, y≤t, y),\nwhich implies that\nEyt+1∼π∗(·|x,y≤t)[Qπ∗(x, y≤t+1)] = V ∗(x, y≤t) = T − t.\n(18)\nAlso, for t ≥ 1, by Eq. (15), we have\narg max\ni∈[n]\nEyt+1∼πi(·|x,y≤t)[Qπ∗(x, y≤t, yt+1)] = r(x, y≤t, yt+1) + T − t + 1 = T − t.\nThe last inequality is because y1 ∈ Y \\ {y1,i}i∈[n] and then all rewards are equal to 1. Hence, for any t ≥ 1,\nwe have\nEx,yt∼π∗\n���arg max\ni\nEyt+1∼πi(·|x,y≤t)\n�\nQπ∗(x, y≤t+1)\n�\n− Eyt+1∼π∗(·|x,y≤t)\n�\nQπ∗(x, y≤t+1)\n���� = 0 ≤ ∆.\nMoreover, for t = 0, we have\narg max\ni∈[n]\nEy1∼πi(·|x)[Qπ∗(x, y1)] = r(x, y1) + T − 1 = (1 − ε) + T − 1 = T − ε.\nHence, for t = 0, we also have\nEx,yt∼π∗\n���arg max\ni\nEyt+1∼πi(·|x,y≤t)\n�\nQπ∗(x, y≤t+1)\n�\n− Eyt+1∼π∗(·|x,y≤t)\n�\nQπ∗(x, y≤t+1)\n���� = ε ≤ ∆.\nEq. (13) First, by our construction, for any prefix x, y≤t, one can follow the optimal policy π∗ to generate a\nresponse x, y≤T such that\nR(x, y≤T ) =\nn\n�\ni=1\nr(x, y≤i) =\nt\n�\ni=1\nr(x, y≤i) + T − t.\n(19)\n20\n\nThis equality holds because optimal policy will select the reward-maximizing token at every remaining time\nstep, that leads to a T − t reward.\nNow, on a particular MDP Mp with path p ∈ PT/2, we know that for any prefix x, y≤t, if there is not a full\nresponse x, y≤T with prefix x, y≤t and R(x, y≤T ) ≥ V π∗ − ∆ = T − ∆, we must have\nt\n�\ni=1\nr(x, y≤i) + T − t ≤ T − ∆,\nwhich implies that\nt\n�\ni=1\nr(x, y≤i) ≤ t − ∆.\n(20)\nBy our construction, Eq. (20) holds if and only if t ≥ T/2 + 1 and y≤T/2 = yp.\nHence, we only need to verify Eq. (13) holds when y≤T/2 ̸= yp. We denote πi(x, y≤t) is the deterministic next\ntoken given prefix x, y≤t for expert model πi. Then, we know that\n���arg max\ni\nEyt+1∼πi(·|x,y≤t)\n�\nQπ∗(x, y≤t+1)\n�\n− Eyt+1∼π∗(·|x,y≤t)\n�\nQπ∗(x, y≤t+1)\n����\n=\n�����arg max\ni∈[n]\nr(x, y≤t, πi(x, y≤t)) − 1\n����� .\nSituation 1: If t = T/2 and y≤t = yp coincides with yp, we have\n�����arg max\ni∈[n]\nr(x, y≤t, πi(x, y≤t)) − 1\n����� = |(1 − ∆) − 1| = ∆.\nSituation 2: If t ≥ 1 and y≤T/2 ̸= yp, by our construction, we have\n�����arg max\ni∈[n]\nr(x, y≤t, πi(x, y≤t)) − 1\n����� = 0.\nSituation 3: For t = 0, we can have\n�����arg max\ni∈[n]\nr(x, y≤t, πi(x, y≤t)) − 1\n����� = |(1 − ε) − 1| = ε.\nCombining these three conditions above, we successfully verify the Eq. (13).\nSince the learner only observes the expert value Qπ∗(x, y≤t) at the visited states (x, y≤t), and since these\nvalues are identical across all candidate MDPs {Mp}p∈PT/2 for all t ≤ T/2 − 1, the learner receives exactly\nthe same observation sequence oT/2 under every Mp. Consequently, the MDPs are indistinguishable to the\nlearner during the first T/2 − 1 steps.\nWe consider token-level routing algorithms A : O → [n] whose observation space at step t is defined as\nO =\nT�\nt=1\nOt =\nT�\nt=1\n�\nx, y≤t, {Qπ∗(x, y≤k)}t\nk=1, {Qπ∗(x, y≤t, y)}y∈Y\n�\n.\nThat is, at each step t, the learner’s observation ot ∈ Ot includes the prompt, the generated prefix, the\nexpert value along the realized trajectory, and the expert values of all possible next-token continuations.\nFormally, let A : O → [n] be any (possibly randomized) token-level routing algorithm. Then there exists a\npath p∗ = (p∗\n1, . . . , p∗\nT/2) ∈ PT/2 such that\nP\n�\nA(oi) = p∗\ni+1 for all i = 0, . . . , T/2 − 1\n�\n≤\n1\n|PT/2| =\n1\nnT/2 .\n21\n\nThis follows from a simple counting argument. Since the observation sequence is identical across all Mp, the\nalgorithm A induces a probability distribution over the set of paths PT/2, and thus\n�\np∈PT/2\nP[A(oi) = pi+1 for all i = 0, . . . , T/2 − 1] ≤ 1.\nTherefore, there must exist at least one path p∗ ∈ PT/2 whose probability mass is at most 1/|PT/2|.\nHence, for the MDP Mp∗, this expect value of this token-level routing algorithm A will be at most\n(T/2 + 1 − ∆ − ε) ·\n�\n1 −\n1\nnT/2\n�\n+ (T − ε) ·\n1\nnT/2 ≤ T/2 + 2,\nwhich implies that V πA ≤ V ∗ − T/2 + 2. Hence, no token-level routing algorithm with observation space\not =\n�\nx, y≤t, {Qπ∗(x, y≤k)}t\nk=1, {Qπ∗(x, y≤t, y)}y∈Y\n�\ncan guarantee achieving an approximately optimal value for a given prompt.\nB\nTheoretical Discussion of Prior Token-Level Approaches\nIn §4, we show that training the router on SFT dataset which follows trajectories generated by π∗ can be\nviewed as approximating an optimal expert selection strategy (Eq. (8)). By the performance difference lemma,\nEq. (8) establishes a direct connection between the generated responses and the optimal response.\nHowever, in recent years, Collab (Chakraborty et al., 2025) provides a test-time controlled-decoding based\nmulti-LLM token-level collaboration approaches. During the decoding process, for each token position, Collab\nfirst generates multiple candidate tokens from multiple experts π1, · · · , πn. For candidate tokens for each\nexpert πi, Collab then generates a few more tokens using πi for evaluating the Qπi function. Finally, they\nchoose the token with the maximum corresponding Q function from the candidate token pool. In simple\nwords, Collab tends to choose the action that\nah = arg max\nah Eyt+1∼πah(·|x,y≤t)[Q\nπah\nh\n(x, y≤t+1)].\nSince the action is not selected by maximizing the optimal Q function, it leads to a mismatch\nmin\ni∈[n] ∆i(x, y≤t) = Q∗(x, y≤t) − Qπi(x, y≤t)\nbetween the generated response and the optimal response. This mismatch can lead to non-ideal performance.\nIn fact, consider two different expert policies π1 and π2 with a horizon H and π1(x, y) ̸= π2(x, y) for any\nprefix (x, y), we can construct the reward function that satisfies\nr(x, y≤t, yt+1) = I{yt+1 = π1(x, y≤t)},\n∀ t ≤ H\n3 ,\nand\nr(x, y≤t, yt+1) = I{yt+1 = π2(x, y≤t)},\n∀ t > H\n3 .\nIn this setting, the optimal policy is π∗ = π1 for t ≤ H/2 and π∗ = π2 for t ≥ H/2 + 1, which has Q∗(x) = H\nfor prompt x. However, at position 0, we have\nQ∗(x) = H,\nQπ1(x) = H\n3 ,\nQπ2(x) = 2H\n3 .\nHence, the mismatch error\nmax\ni∈[n] ∆i(x, y≤t) = H\n3 ,\n22\n\nwhich is linear in the horizon H.\nCompared to Collab, FusionRoute directly finetunes the router on the expert dataset, which is equivalent to\nah = arg max\nah Eyt+1∼πah(·|x,y≤t)[Q∗\nh(x, y≤t+1)],\nwhich helps us to bound the final error using Performance Difference Lemma.\nC\nExperiment Details\nC.1\nGeneration\nC.1.1\nCross-Domain Generation\nDue to the large number of prompts in certain tasks and computational constraints, for each task we evaluate\non a subset of their datasets. The generation settings and number of samples are summarized in Table 3.\nDataset\n# of samples\n# Max New Tokens\nGSM8K\n400\n400\nMATH500\n500\n400\nMBPP\n500\n400\nHumanEval\n164\n400\nIfeval\n400\n400\nTable 3 Number of samples and number of max new tokens for each dataset.\nC.1.2\nGPT Evaluation Prompt\nFor GPT-4o evaluation on the general dataset, we use the following text as the prompt provided to the\nGPT-4o model. The output scores for both responses are then extracted to compute the win rate.\nGPT-4o Evaluation Prompt\nPrompt: You are a precise assistant for checking the quality of the answer. We would like to request\nyour feedback on the performance of two AI assistants in response to the user question. Please rate\nthe correctness, fluency, and level of detail of their responses. Your evaluation should consider factors\nsuch as the accuracy, relevance, depth, creativity, fluency, and level of detail of the response. Bear\nin mind that the response might be truncated at the end due to length constraints. Each assistant\nreceives an overall float score on a scale of 1 to 10, in increments of 0.25. A higher score indicates\nbetter overall performance. Do not give two responses a same score. Do not prefer longer response.\nPlease first output a single line containing only two values indicating the scores for Assistant 1 and 2,\nrespectively. The two scores are separated by a space. In the subsequent line, please provide a short\nexplanation of your evaluation, avoiding any potential bias and ensuring that the order in which the\nresponses were presented does not affect your judgment.\n[Question]\nquestion\n[The Start of Assistant 1’s Answer]\nanswer1\n[The End of Assistant 1’s Answer]\n[The Start of Assistant 2’s Answer]\nanswer2\n[The End of Assistant 2’s Answer]\n23\n\nC.2\nRouter’s Behavior\nIn this section, we examine the routing behavior of our model on both domain-specific and general datasets to\ndemonstrate the effectiveness of the router’s token-level routing capability. In the following text, we use red,\nblue and softgreen to represent the math, coding and instruction-following model, respectively.\nC.2.1\nExample of Coding Problem\nPrompt: Below is an instruction that describes a task. Write a response that appropriately completes\nthe request. Instruction: Write a function to find the minimum cost path to reach (m,n) from (0,0) for\nthe given cost matrxi cost[][] and a position (m,n) in cost[][]. Assistant:\nResponse: Here is a Python solution using dynamic programming:\n“python def minCostPath(cost, m, n):\ndp=[[0 for x in range(n)] for y in range(m)]\n# Initialize the first row and first column\nfor i in range(m):\ndp[i][0] = cost[i][0]\nfor j in range(n):\ndp[0][j] = cost[0][j]\n# Fill dp[][] in bottom up manner\nfor i in range(1,m):\nfor j in range(1,n):\ndp[i][j] = min(dp[i-1][j] + dp[i][j-1] + cost[i][j]\nreturn dp[m-1][n-1]\"\nThis function works by creating a 2D array dp[][] where dp[i][j] represents the minimum cost to reach\nthe cell (i,j) from (0,0). The first row and first column are initialized with the cost of the cells in the\nfirst row and first column respectively. Then, for each cell (i,j) in the rest of the matrix, the minimum\ncost to reach it is the minimum cost to reach the cell above it (i-1,j) or to the left of it (i,j-1) plus the\ncost of the cell itself. The function finally returns the minimum cost to reach the cell (m,n). <|eot_id|>\nC.2.2\nExample of Math Problem\nPrompt: Below is an instruction that describes a task. Write a response that appropriately completes\nthe request. Instruction: Convert the point (0, 3) in rectangular coordinates to polar coordinates. Enter\nyour answer in the form (r, θ), where r > 0 and 0 ≤ θ < 2π. Assistant:\nResponse: The polar coordinates $(r,\\theta)$ of a point $(x,y)$ are given by $r =\\sqrt {xˆ2+yˆ2}$ and\n$\\theta= \\tanˆ{-1}\\left(\\frac{y}{x}\\right)$. In this case, we have $x =0$ and $y=3$, so $r=\\sqrt{0ˆ2\n+ 3ˆ2}=3$\\theta=\\tanˆ{-1}\\left(\\frac{3}{0}\\right)=\\frac{\\pi}{2}$. Therefore, the polar coordinates\nof the point $(0,3)$ are $\\boxed{(3,\\frac}{2})}$. The answer is: (3,\\frac{\\pi}{2})<|eot_id|>\n24\n\nC.2.3\nExample of Instruction Following Problem\nPrompt: Below is an instruction that describes a task. Write a response that appropriately completes\nthe request. Instruction: Write a resume for a fresh high school graduate who is seeking their first job.\nMake sure to include at least 12 placeholders represented by square brackets, such as [address], [name].\nAssistant:\nResponse: Here is a sample resume for a fresh high school graduate: [Name][Address][City, State\nZip][Phone Number][Email Address] Objective: To secure an entry-level position in a dynamic and\ngrowth-oriented organization where I can apply my skills and knowledge to contribute to the company’s\nsuccess. Education: - High School Diploma, [Name of High School], [Graduation Date] - GPA: [GPA]\nSkills: - Proficient in Microsoft Office (Word], Excel, PowerPoint, Outlook) - Strong communication\nand teamwork skills - Proficient in [Programming Language or Software] - Experience with [Software or\nTechnology] Work Experience: - [Job Title], [Company Name], [Employment Dates]\n-[Briefly describe\njob responsibilities and accomplishments] - [Job Title], [Company Name], [Employment Dates]\n-\n[Briefly describe job responsibilities and accomplishments] Volunteer Experience: - [Volunteer Position],\n[Organization], [Volunteer Dates]\n- [Briefly describe volunteer work and accomplishments] References: -\nAvailable upon request. Note: This is a basic template, and you should customize it to fit your specific\nexperiences and qualifications. Remember to proofread your resume multiple times for any errors or typos.\nGood luck with your job search! Please let me know if you need any further assistance. Also, please\nnote that this is a very basic template, and you may want to consider adding more sections or details to\nmake your resume more comprehensive. Some additional sections you may want to consider include:\nExtracurricular activities - Awards or recognition - Relevant coursework - Projects or presentations -\nLeadership roles - Relevant certifications or licenses Let me know if you need any help with any of these\nsections. Also, please keep in mind that this is a vaery basic template, and you may want to consider\ntailoring your resume to the specific job you are applying for. You can do this by highlighting...\nC.2.4\nExample of General Problem in PerfectBlend Dataset\nPrompt: Below is an instruction that describes a task. Write a response that appropriately completes\nthe request. Instruction: You start out with some pens. Mike gives you another 22 pens. Cindy, whose\nfather makes pens, decides to double your pens. You give Sharon 19 pens. At the end, you have 65 pens.\nHow many pens did you start with?\nResponse: Let’s denote the number of pens you started with as x. According to the problem, Mike gives\nyou 22 pens, so you now have x + 22 pens. Cindy doubles your pens, so you now have 2(x + 22) pens.\nYou give Sharon 19 pens, so you now have 2(x + 22) - 19 pens. According to the problem, at the end,\nyou have 65 pens, so we can set up the equation: 2(x + 22) - 19 = 65. Now, let’s solve for x: 2x + 44 -\n19 = 65\n2x + 25 = 65\n2x=65 - 25 2x = 40 x = 40/2 x = 20 So, you started with 20 pens. The\nanswer is: 20.<|eot_id|>\nC.2.5\nDiscussion on the Qualitative Examples\nBased on the above qualitative examples, we observe that FusionRoute successfully assigns the majority\nof tokens associated with specific problem types such as coding, mathematical reasoning, and instruction\nfollowing, to their corresponding expert models. This behavior demonstrates the effectiveness of FusionRoute\nin performing accurate token-level routing on domain-specific datasets, enabling the model to effectively\nleverage the complementary expertise of different specialists.\nFurthermore, when evaluated on general-purpose datasets, we find that FusionRoute exhibits a more nuanced\nrouting pattern. Tokens that require mathematical or logical reasoning are predominantly routed to the math\nand code experts, while tokens related to discourse structure, formatting, or general instruction phrasing (e.g.,\n“According to”, “Let’s”, “So”) are more frequently assigned to the instruction-following model. This adaptive\ntoken-level routing behavior highlights FusionRoute’s ability to dynamically balance specialized reasoning\nand general linguistic fluency, thereby improving overall generation quality across diverse tasks.\n25\n",
+    "figure_captions": [
+      "Figure 2 Top: Overall architectural design of FusionRoute. FusionRoute enables multiple specialized LLMs to",
+      "Figure 3 GPT-4o winrate on general datasets compared to fine-tuned model.",
+      "Figure 4 GPT-4o winrate comparison: FusionRoute"
+    ]
+  },
+  {
+    "arxiv_id": "2601.05103v1",
+    "title": "Paper 2601.05103v1",
+    "abstract": "2 L3S Research Center, Leibniz University Hannover, Hannover, Germany\nzhiyin.tan@l3s.de\nAbstract. Understanding the role of citations is essential for research\nassessment and citation-aware digital libraries. However, existing cita-\ntion classification frameworks often conflate citation intent (why a work\nis cited) with cited content type (what part is cited), limiting their effec-\ntiveness in auto classification due to a dilemma between fine-grained type\ndistinctions and practical classificatio",
+    "text": "Semantically Orthogonal Framework for Citation\nClassification: Disentangling Intent and Content\nChangxu Duan1⋆[0000−0003−0547−0901] and Zhiyin Tan2⋆[0009−0002−4166−5810]\n1 Technische Universität Darmstadt, Darmstadt, Germany\nduan@linglit.tu-darmstadt.de\n2 L3S Research Center, Leibniz University Hannover, Hannover, Germany\nzhiyin.tan@l3s.de\nAbstract. Understanding the role of citations is essential for research\nassessment and citation-aware digital libraries. However, existing cita-\ntion classification frameworks often conflate citation intent (why a work\nis cited) with cited content type (what part is cited), limiting their effec-\ntiveness in auto classification due to a dilemma between fine-grained type\ndistinctions and practical classification reliability. We introduce SOFT, a\nSemantically Orthogonal Framework with Two dimensions that explic-\nitly separates citation intent from cited content type, drawing inspiration\nfrom semantic role theory. We systematically re-annotate the ACL-ARC\ndataset using SOFT and release a cross-disciplinary test set sampled\nfrom ACT2. Evaluation with both zero-shot and fine-tuned Large Lan-\nguage Models demonstrates that SOFT enables higher agreement be-\ntween human annotators and LLMs, and supports stronger classification\nperformance and robust cross-domain generalization compared to ACL-\nARC and SciCite annotation frameworks. These results confirm SOFT’s\nvalue as a clear, reusable annotation standard, improving clarity, consis-\ntency, and generalizability for digital libraries and scholarly communica-\ntion infrastructures. All code and data are publicly available on GitHub\nhttps://github.com/zhiyintan/SOFT.\nKeywords: Citation Intent Classification · Annotation Framework ·\nCross-Domain Generalization.\n1\nIntroduction\nCitations are central to scholarly communication, shaping how knowledge is\ncredited, organized, and reused across research communities. However, tradi-\ntional citation metrics, such as counts or h-index, offer only coarse measures\nof scientific impact, failing to capture the nuanced ways in which works are\ncited, interpreted, or contested. To address this gap, citation classification has\nemerged as a critical task in computational bibliometrics and information sci-\nence. It moves beyond simple quantitative counts to unlock the meaning within\nscholarly dialogue by analyzing the textual context to understand why a work\n⋆ The authors contributed equally and are listed alphabetically.\narXiv:2601.05103v1  [cs.DL]  8 Jan 2026\n\n2\nDuan and Tan\nwas cited. The aim is to distinguish whether a prior work is being used as a foun-\ndational method, critiqued for its findings, or merely mentioned as background.\nThis functional understanding provides the technological backbone for a new\ngeneration of digital libraries and intelligent research assistants, enabling ad-\nvanced services such as automated literature reviews [3,15,27,28], the detection\nof scientific controversies [41], and impact-aware scientific search [35,63].\nRecent advances in Natural Language Processing (NLP) have enabled large-\nscale citation classification to be deployed in digital library infrastructures. Plat-\nforms such as Web of Science [13], Scite [41], and Semantic Scholar [2] incorporate\nfunctionalities aimed at detecting aspects of citation intent to improve retrieval,\nmapping, and assessment. By automatically labeling background, method, result,\nor support, these platforms enhance transparency, accelerate literature review,\nand surface influential or controversial contributions [17].\nDespite growing adoption, current citation classification practices face persis-\ntent limitations. Existing annotation frameworks often conflate distinct citation\nintents and content types and lack consistent semantic definitions. These ambi-\nguities hinder downstream applications such as citation-based retrieval [7] and\nimpact analysis [37].\nIn this paper, we introduce SOFT: a Semantically Orthogonal Framework\nwith Two dimensions for citation annotation. SOFT separates what is being\ncited (Cited Content Type) from why it is being cited (Citation Intent), en-\nabling clearer definitions, improved annotation consistency, and stronger support\nfor LLM-based classification. Unlike prior one-dimensional frameworks, SOFT is\ngrounded in the principles of semantic roles, offering conceptual alignment with\nhow authors structure argumentation by performing actions (intent) on objects\n(content).\nWe validate SOFT through a comprehensive re-annotation of the ACL-ARC\ndataset [28] and a cross-domain evaluation on ACT2 [40], covering 19 scien-\ntific fields. Results show that SOFT yields higher inter-model and human-LLM\nagreement, improves classification accuracy for both zero-shot and fine-tuned\nmodels, and generalizes more robustly across domains than existing frameworks.\nThese findings position SOFT as a reusable annotation standard for supporting\ncitation-aware digital libraries and transparent research assessment. Our contri-\nbutions are as follows:\n– We propose SOFT, a two-dimensional, semantically grounded citation an-\nnotation framework that explicitly distinguishes the predicative intent of a\ncitation from its cited content type.\n– We conduct a systematic re-annotation of the ACL-ARC dataset and release\na cross-domain test set from ACT2 under the SOFT framework.\n– We benchmark SOFT against ACL-ARC and SciCite using zero-shot and\nfine-tuned LLMs, demonstrating gains in agreement, learnability, and gen-\neralizability.\n\nFramework SOFT: Disentangling Citation Intent and Content\n3\n2\nRelated Work\nResearch on citation classification spans several dimensions, including the devel-\nopment of annotation frameworks, the design of automated classification models,\nand the evaluation of their performance across different datasets and scientific\ndomains. Section 2.1 reviews existing citation intent classification frameworks,\nmost of which rely on single-layer labels that conflate the predicative function\nof the citation with the type of content being referenced. Section 2.2 surveys\ncomputational approaches for automating citation intent classification, ranging\nfrom early feature-based models to recent prompt-driven large language mod-\nels. Section 2.3 highlights the limited exploration of cross-domain applicability,\nparticularly across different scientific disciplines, within existing framework de-\nsigns. Finally, Section 2.4 discusses related efforts in re-annotating benchmark\ndatasets, which provide the foundation for our re-annotation of ACL-ARC and\nACT2 using a two-dimensional framework.\n2.1\nCitation Intent Classification Frameworks\nThe functional purpose behind why an author cites a particular work has been\ndescribed using various terms, including Citation Motivation [55, 59], Citation\nPurpose [1], Citation Type [45], Citation Function [28, 60], and Citation In-\ntent [14]. For conceptual clarity and consistency, we adopt Citation Intent\nthroughout this paper.\nResearch on citation classification has evolved from bibliometric traditions to\ncomputational frameworks tailored for large-scale modeling. One of the earliest\nand most fine-grained typologies was proposed by [55], which defined thirteen\ncitation roles spanning rhetorical, epistemic, and contextual functions. The Ci-\ntation Function Corpus (CFC) [60] was one of the first computationally oriented\ncitation resources. Drawing on Spiegel’s typology [55], it defined twelve function\nlabels grouped into four superclasses (Weakness, Contrast, Positive, and\nNeutral). The Citation Typing Ontology (CiTO) [52] introduced an ontology-\nbased framework with over thirty citation relations, intended for semantic web\ninteroperability. Although formal and expressive, CiTO remains relatively un-\nderused in NLP research.\nOther frameworks were designed for domain-specific annotation and model\ntraining. The six-label framework (Use, Criticizing, Comparison, Substan-\ntiating, Basis, and Neutral) proposed by [1] emphasized purpose-driven\ntypes developed through the annotation of NLP publications. ACL-ARC [28]\ndefined a six-type framework (Background, Comparison or Contrast,\nMotivation, Uses, Extension, and Future) oriented toward citation influ-\nence modeling. SciCite [14] condensed this six-type framework into a three-type\nframework (Background, Method, and ResultComparison), motivated by\nthe observation that types such as Motivation, Extension, and Future often\nfunction as background context, to better support machine reading and research\nnavigation in both computer science and biomedical domains.\n\n4\nDuan and Tan\nACT and ACT2 [40,46] extended citation annotation to additional scientific\nfields using a six-type framework similar to ACL-ARC (replacing Comparison\nor Contrast with ComparesContrasts and extending it to capture not only\ndifferences but also similarities and disagreements between citations). In addi-\ntion to intent, ACT and ACT2 also annotated influence using two binary labels\n(Influential and Incidental) with an added emphasis on author influence\nand comparability.\nTogether, these frameworks have shaped the landscape of citation intent\nclassification, offering valuable insights into scholarly discourse. However, many\nadopt single-layer label structures that implicitly combine the author’s action,\nthe citation’s function, with its semantic argument, the type of content being\ncited. While effective for some tasks, this design can limit semantic clarity and\nhinder flexible reuse across diverse domains. We return to these structural chal-\nlenges in Section 3, which motivates the design of a new, semantically disentan-\ngled framework.\n2.2\nCitation Intent Classification (CIC) Models\nComputational approaches to CIC have evolved significantly. Early models relied\non hand-engineered features [36,47], but the release of benchmark datasets like\nACL-ARC and SciCite catalyzed a shift towards deep learning. Transformer-\nbased encoders, particularly the domain-specific SciBERT [10], quickly became\nthe standard [18,50]. More recent advances explore diverse learning paradigms,\nincluding graph-based models [11], multi-task learning [22,53], and prompt-based\nmethods [30,32]. Following the latest trend, recent work has also focused on fine-\ntuning LLMs for this task [29].\nIn our experiments, we re-annotated the ACL-ARC dataset using our pro-\nposed two-dimensional framework and used it for model training and evaluation.\nTo assess cross-domain generalizability, we additionally re-annotated a sample\nfrom the ACT2 dataset. Results from these evaluations are presented in Sec-\ntion 5, where we compare the performance of CitePrompt and CitationInten-\ntOpenLLM under our framework.\n2.3\nCross-Domain Generalizability\nA persistent challenge in CIC is the limited ability of models to generalize across\ndifferent scientific domains or datasets [33,53,64]. To rigorously assess true gener-\nalization capabilities, models must be subjected to cross-domain testing: training\non a source dataset (e.g., ACL-ARC) and evaluating on a distinct target (e.g.,\nACT2) [12].\nCross-domain or cross-dataset evaluation offers a principled framework for\ntesting generalization capacity. In this setting, models are trained on one source\ndataset and evaluated on an independent target dataset [12]. This setup is widely\nused across NLP tasks to assess whether models learn domain-agnostic features\nor overfit to dataset-specific patterns. For instance, prior work has investigated\ntransfer performance in question answering across biomedical, Wikipedia, and\n\nFramework SOFT: Disentangling Citation Intent and Content\n5\nweb dataset [24], and in relation extraction across distinct subdomains such\nas news and politics [8]. Another example is the divide in machine translation\nGeneralizability between resource-rich and low-resource languages [4].\nThis lack of robustness is particularly evident even when datasets ostensibly\nshare the same annotation framework. For example, ACT2 adopted the ACL-\nARC label set for annotations across multiple disciplines. Despite this shared tax-\nonomy, prior work demonstrated a dramatic drop in performance when models\neffective on ACL-ARC were applied to ACT2 [30]. This highlights that surface-\nlevel label consistency does not guarantee semantic equivalence across domains\nor robust model transfer.\nWe interpret these findings as indicative of deeper structural limitations in\nexisting citation classification frameworks. In particular, many current frame-\nworks conflate citation’s function with its semantic arguments (the type of cited\ncontent), leading to annotation ambiguities that are further amplified in hetero-\ngeneous, multi-domain settings. In our experiments, we evaluate this by training\nmodels on our re-annotated ACL-ARC dataset and testing their performance on\nthe ACT2 dataset using our revised framework annotations.\n2.4\nThe Applicability of Re-Annotating Datasets\nThe performance of supervised machine learning models is fundamentally tied to\nthe quality of their training data [65]. However, real-world annotations are often\nnoisy, with errors arising from ambiguous guidelines, subjective interpretations,\nlimitations in annotator expertise, or simple human mistakes [25,43]. Such errors\ncan undermine model training, compromise evaluation reliability, and reduce the\nrobustness of downstream applications.\nRecognizing these limitations, recent work has emphasized systematic iden-\ntification and correction of label errors in widely used benchmarks. For example,\nstudies of the COCO dataset [34] have revealed deficiencies in object mask an-\nnotations, prompting initiatives such as COCO-ReM [54] and Sama-COCO [67],\nboth of which introduced revised annotations that improved downstream model\nperformance. Similarly, the CleanCoNLL [51] project conducted a comprehensive\nrelabeling of the CoNLL-03 Named Entity Recognition dataset [61], correcting\n7% of entity labels and demonstrating measurable gains in benchmark reliability.\nIn parallel, automated and semi-automated approaches have been developed\nto assist error detection and correction. Model-driven methods such as Con-\nfident Learning estimate the joint distribution of observed and true labels to\nidentify likely misannotations [42], while domain-specific tools like ObjectLab\napply trained object detectors to flag anomalies in annotation quality [62]. Re-\ncently, LLMs have also been explored as annotation agents [23, 56], although\ncareful quality control remains essential in such settings.\nIn this work, we conduct a manual re-annotation of the ACL-ARC dataset [28]\nusing our proposed two-dimensional framework. For comparison and internal\nvalidation, we also task a set of open-source LLMs with performing the same re-\nannotation under controlled prompting. To evaluate label quality and framework\n\n6\nDuan and Tan\nclarity, we compute agreement between human annotators and LLMs. These ex-\nperiments allow us to assess the reliability of our framework and explore the\npotential role of LLMs in future annotation workflows.\n3\nSOFT: A Semantically Orthogonal Framework with\nTwo Dimensions\nTo address the growing need for semantically robust, interpretable, and trans-\nferable citation annotations, we propose a new framework grounded in discourse\nintent and cited content structure. Before introducing its design, we first examine\nthe limitations of existing classification frameworks that motivate our departure\nfrom conventional approaches. These limitations arise from structural entangle-\nment, under-specified functional roles, and ambiguous referential targets, all of\nwhich hinder consistent annotation and downstream model performance.\n3.1\nLimitations of Existing Citation Classification Frameworks\nDespite the progress enabled by widely used frameworks such as CiTO [52], ACL-\nARC [28], SciCite [14], and ACT/ACT2 [40, 45], several structural limitations\npersist. These challenges reduce annotation clarity and consistency, hinder model\ngeneralization, and complicate the semantic interpretation of citation acts. We\nidentify three interrelated problems:\n(1) Dimensional Entanglement: Conflating Cited Content Type and\nCitation Intent. Many previous citation classification frameworks failed to\nseparate the cited content type, such as method, data, or finding, from citation\nintent or function, such as use, and instead combined both dimensions into a sin-\ngle label. For example, in CiTO, types like “uses method in”, “uses data from”,\nand “uses conclusions from” explicitly bind the act of using with the specific\nobject being used, resulting in a conflation of content type and citation intent.\nThis design leads to an explosion in the number of types while still missing im-\nportant boundary cases, which increases the cognitive burden for annotators and\nmakes modeling more challenging due to both type proliferation and unclassifi-\nable instances. SciCite, in contrast, presents a related limitation differently: its\ntype Background information is broadly defined and absorbs all citation acts\nthat do not fit into the method or result types. As a result, many citations with\nspecific functional intent are ultimately subsumed under the background type,\nwhich obscures their actual scholarly role.\n(2) Perspective Ambiguity: Authorial Commitment and Discourse\nVoice. Additionally, inconsistent perspective across types introduces further\nconfusion for annotators. For example, in the ACL-ARC framework, types such\nas Background, Motivation, and Future use the cited work as the subject\n\nFramework SOFT: Disentangling Citation Intent and Content\n7\nof the action, while types such as Uses, Extension, and Comparison or Con-\ntrast use the citing work as the acting subject. This shift in perspective causes\ncognitive inconsistency and makes the annotation process more confusing.\n(3) Lack of a Functionally Grounded Intention Framework. Moreover,\nthe underlying citation intent is not always clearly specified. For example, a ci-\ntation marked as Comparison or Contrast might align findings, methods,\nor conceptual framing, but whether the comparison serves to justify a design,\ncritique a precedent, or merely provide context depends entirely on the citing au-\nthor’s functional goal. Existing frameworks offer no mechanism for disambiguat-\ning these divergent functions, leaving the predicative action of the citation act\nambiguous. Without an explicit model separating the author’s action from the\nobject of that action, annotation relies on inference and subjective interpreta-\ntion, which reduces reproducibility and cross-annotator agreement.\n3.2\nThe SOFT Citation Annotation Framework\nWe propose SOFT, a two-dimensional citation annotation framework designed\nto overcome core limitations of previous citation classification systems. SOFT\ndirectly addresses three major problems: (1) entanglement of content type and\ncitation function, (2) inconsistent annotation perspectives, and (3) ambiguous\nor underspecified citation intent.\nFirst, SOFT explicitly disentangles Cited Content Type (what specific\ncontent is referenced from the cited work) from Citation Intent (why the cit-\ning work refers to that content), thus resolving the long-standing problem of\ndimensional entanglement, ensuring each dimension is orthogonal and semanti-\ncally bounded, greatly reducing ambiguity and cognitive load.\nSecond, SOFT maintains a unified authorial perspective by consistently mod-\neling the action of the citing work. This approach is grounded in semantic role\ntheory [6,20,21,44], which decomposes events into predicates (actions) and ar-\nguments (participants or entities involved). In SOFT, citation intent always cor-\nresponds to the action or communicative function performed by the citing work,\nsuch as “use X from [cited work]”, “modify X from [cited work]”, or “evaluate\nagainst X from [cited work]”, where X is the specific cited content. These in-\ntents, like “use” or “modify”, align with classic event predicates in semantic role\ntheory. Other intents in SOFT, such as “contextualize using X from [cited work]”,\n“signal a gap using X from [cited work]”, “highlight a limitation in X from [cited\nwork]”, or “justify a design choice based on X from [cited work]”, capture broader\ndiscourse actions in scientific writing. While these latter cases extend beyond\nsimple event predicates, all SOFT intent types are defined to specify what the\nciting work is doing, ensuring each annotation remains grounded in a clear and\nconsistent authorial perspective.\nThird, SOFT provides explicit, semantically precise definitions for all citation\nintent types, ensuring the underlying authorial purpose is clearly specified in ev-\nery case. For instance, we rephrase Background (used in ACL-ARC, SciCite,\n\n8\nDuan and Tan\nand ACT/ACT2) as Contextualize to mark deliberate authorial moves, and\nrephrase Extension as Modify to capture not only extension but also reduc-\ntion, replacement, or novel combination. We replace the citation function Mo-\ntivation(present in ACL-ARC and ACT/ACT2) with SignalGap and High-\nlightLimitation, which allow for a fine-grained distinction between identifying\na research gap and highlighting a limitation. To further reduce ambiguity, we\neliminate the type Comparison or Contrast (in ACL-ARC), which is often\nconfused in practice, and instead introduce JustifyDesignChoice for decision\njustification and EvaluateAgainst for direct empirical comparison.\nThe SOFT type inventory was empirically derived and iteratively refined\nusing diverse datasets and existing frameworks (including SciCite, ACL-ARC,\nand CiTO), systematically resolving label ambiguity and mixed semantic scope.\nAs a result, SOFT yields a more interpretable and robust annotation framework,\nimproving both human agreement and model generalizability for tasks such as\ndiscourse analysis and citation influence detection. Below, we detail the two core\ndimensions of the framework.\nDimension 1: Cited Content Type captures the ontological status of the\ncontribution being referenced in the cited work. We distinguish three types:\n(1) PerformedWork: The citing work references what the cited work did\n(e.g., experimental process, pipeline design), without isolating specific outcomes\nor reusable resources. Example: “[cited work] developed an NLP pipeline.”\n(2) Discovery: The citing work references observations, findings, or theoret-\nical conclusions made by the cited work. Example: “[cited work] observed that\ndropout improves stability.”\n(3) ProducedResource: The citing work references reusable outputs such\nas datasets, algorithms, models, tools, metrics, standardized settings, etc. Ex-\nample: “We use the parser from [cited work].”\nThese distinctions support fine-grained analysis of how different forms of scien-\ntific contribution are referenced, reused, or framed in scholarly discourse.\nPerformed Work\nDiscovery\nProduced Resource\nContextualize\nSignal Gap\nHighlight Limitation\nJustify Design Choice\nUse\nModify\nEvaluate Against\n393\n263\n498\n18\n27\n45\n28\n42\n20\n14\n82\n60\n17\n3\n331\n4\n0\n15\n5\n30\n37\n0\n100\n200\n300\n400\nFrequency\nFig. 1: Types of statistics for the re-annotated dataset.\n\nFramework SOFT: Disentangling Citation Intent and Content\n9\nDimension 2: Citation Intent characterizes the predicative function of the\ncitation, representing the action performed by the citing author. Each label is\ndefined as a predicate that takes the cited content as one of its core arguments.\nWe frame each label from the perspective of the citing author as the agent of\nthe citation action.\n(1) Contextualize: The cited work is mentioned to provide background,\nillustrate prior research, or describe related contributions. No design decision\nor reuse is involved. Example: “[cited work] introduced hierarchical attention for\nsentiment analysis.”\n(2) SignalGap: The citation highlights an unresolved problem or open ques-\ntion. The gap may be identified by the cited work or the citing work, and no\ncommitment to solving it is required. Example: “[cited work] noted that tuning\nremains underexplored.”\n(3) HighlightLimitation: The citing work or cited work identifies a flaw,\nconstraint, or drawback in the cited method or result. This type is used only\nwhen the cited contribution is explicitly critiqued. Example: “[cited work] fails\nto capture domain drift in user behavior.”\n(4) JustifyDesignChoice: The citing work supports a design or method-\nological decision by referencing the cited work. No direct reuse is required, but\nthe author must commit to a choice. Example: “We follow [cited work]’s feature\ngrouping to ensure consistency.” This type does not apply if the action is merely\nhypothetical (e.g., “we could follow ...”).\n(5) Use: The citing work directly applies a reusable contribution (e.g., model,\nprocess, settings, definitions) from the cited work. The citing author must be the\nactor, meaning statements like “[cited work] used...” do not qualify. Only past\nor present-tense application counts; plans or hypothetical uses are not ‘Use’.\nExample: “We use [cited work]’s BERT-based classifier.”\n(6) Modify: The citing work alters or extends a reusable contribution from\nthe cited work, such as by adapting configurations, changing algorithms, or in-\ntegrating with a new pipeline. Example: “We adapt [cited work]’s encoder for\nmulti-domain inputs.”\n(7) EvaluateAgainst: The citing work explicitly compares its own findings\nand results with those of the cited work, typically to establish effectiveness.\nExample: “Our model outperforms [cited work]’s method on F1 score.”\nThese citation intent types are designed to be mutually exclusive, which can\nbe combined with content types for multi-dimensional interpretation.\nDesign Rationale and Applications. By structurally separating citation\nintent from cited content type, SOFT resolves ambiguities common in one-\ndimensional frameworks. For example, citing a model without applying it is\nannotated as Contextualize:ProducedResource, while referring to an un-\nresolved issue is SignalGap:Discovery. Only when the citing author explic-\nitly acts, e.g., reuses, modifies, justifies, or critiques, is an active intent as-\nsigned. This orthogonality supports consistent labeling across citation contexts\nand improves annotation reliability. SOFT’s two-dimensional design also en-\n\n10\nDuan and Tan\nhances downstream applications. Separating gap signaling from design justifi-\ncation enables clearer modeling of research motivation. Distinguishing between\nusing a tool (Use:ProducedResource) and referencing a methodological trend\n(Contextualize:PerformedWork) supports finer-grained tracking of scien-\ntific influence, method lineage, and intent-aware scientific recommendation. As\ndemonstrated in Section 5, the SOFT framework yields higher human–LLM\nagreement, better classification performance, and improved generalizability across\nscientific domains. We next describe our annotation guidelines and procedures,\nincluding the re-annotation of ACL-ARC and a cross-domain sample from ACT2.\n4\nExperiments\n4.1\nDatasets and Re-Annotation Procedure\nDatasets Overview. The ACL-ARC dataset [28] is a standard CIC bench-\nmark, yet its annotation framework exhibits the conceptual conflation and bound-\nary issues discussed in Section 3.1. To resolve these shortcomings, we re-annotate\nthe entire dataset using our two-dimensional SOFT framework, thereby provid-\ning a refined and semantically coherent resource.\nFor our experiments, we adopt the context-expanded version of ACL-ARC\nprovided by [39], which enhances the original dataset by adding broader citation\ncontexts and structured metadata, such as the citing and cited paper titles and\nthe section in which the citation appears. Before re-annotation, we manually\ncleaned the dataset to correct systemic text extraction errors, such as misaligned\npage breaks and disrupted reading orders, and incorrectly sequenced footnotes.\nAdditionally, we recovered 533 missing section headings from the original paper\nPDFs to improve contextual understanding. The ACL-ARC dataset contains\n1,931 citation instances. Following the original data split in [28], we adopt the\nsame partitioning: 1,647 instances are used for training and 284 for testing.\nTo evaluate cross-domain generalizability, we also re-annotated a subset (264\nexamples) of the ACT2 test set [40]. ACT2 spans 19 top-level scientific domains\nfrom Microsoft Academic Graph (MAG), with dominant fields including Psy-\nchology (21.99%), Medicine (13.48%), Biology (10.91%), and Computer Science\n(10.27%). These examples serve as our out-of-distribution test data, enabling\nstrict evaluation of generalization performance. The division of training and test\nsets ensures that citation instances from the same source document do not ap-\npear in both sets, preventing data leakage and preserving evaluation integrity.\nRe-Annotation Procedure. The re-annotation process was conducted by a\nteam of graduate students in computational linguistics. To ensure reliability and\naccount for interpretive ambiguity, each citation instance was independently la-\nbeled by at least three annotators using our revised two-dimensional framework.\nDisagreements were flagged and subsequently resolved through structured group\ndiscussion, in which annotators reviewed framework definitions and deliberated\nto reach a consensus type for each case. This multi-pass strategy helped enforce\n\nFramework SOFT: Disentangling Citation Intent and Content\n11\ntype consistency and refine interpretation standards across annotators. Types of\nstatistics for the re-annotated ACL-ARC are reported in Figure 1.\n4.2\nEvaluating Framework Interpretability & Annotation Reliability\nA core indicator of annotation framework quality is Inter-Annotator Agreement\n(IAA), reflecting how clearly and consistently its types are defined [5, 9]. How-\never, IAA was not reported in the original ACL-ARC release [28], leaving the\ninterpretability of its framework unquantified. In this study, we also do not report\nhuman IAA due to the dynamic annotation process: some annotators departed\nand new ones joined partway through the task, so not all instances were anno-\ntated by the same set of individuals, and the number of annotations per instance\nvaried. This evolving annotator pool is resembles the crowdsourcing approach\nused for the SciCite dataset [14], where calculating a dataset-wide human IAA\nscore also presents challenges. Instead, we evaluate framework interpretability\nand reliability through LLM-based agreement studies.\nTo assess the clarity and reliability of our framework, we conducted an agree-\nment experiment using open-source LLMs as proxies for human annotators, of-\nfering a scalable and replicable assessment of annotation feasibility. Specifically,\nwe deployed four distinct LLMs: Qwen-2.5-72B-Instruct (Qwen) [48] , Mistral-\nSmall-24B-Instruct-2501 (Mistral) [38], Llama-3.3-70B-Instruct (Llama) [58],\nand Gemma-3-27B-it (Gemma) [57], to independently annotate the ACL-ARC\ncitation contexts according to our framework and guidelines. For each context,\nevery LLM generated a rationale and a classification decision, repeated three\ntimes, with the final type determined by majority vote. We then computed pair-\nwise Cohen’s κ scores [16] under two settings: (1) between model pairs (LLM-\nLLM IAA), and (2) between each model and the human consensus annotations\ndescribed in Section 4.1 (Human-LLM IAA). The latter measures how well the\nframework enables LLMs to approximate expert human judgments.\nAs a comparative baseline, we repeated the process with both the six-type\nACL-ARC framework and the three-type SciCite framework [14], instructing the\nsame LLMs to annotate ACL-ARC contexts. For ACL-ARC framework, annota-\ntions were made on its original dataset. For SciCite, we mapped ACL-ARC types\nonto SciCite’s framework by aligning Uses with Method, ComparisonCon-\ntrast with ResultComparison, and grouping Background, Extension,\nMotivation, and Future under Background.\nThis LLM-based simulation serves as a practical proxy for human agreement\nand enables direct empirical comparison across frameworks. The resulting IAA\nscores provide a quantifiable measure of boundary clarity, interpretability, and\nannotation feasibility under the SOFT framework.\n4.3\nTesting Orthogonality and Interpretive Clarity via Classification\nTo empirically evaluate whether our revised annotation framework offers clearer\nsemantic boundaries than prior frameworks, we conducted a comparative clas-\nsification experiment. This design is grounded in the hypothesis, motivated by\n\n12\nDuan and Tan\nprior work in framework comparison [26, 49], that if a framework defines types\nwith more precise and less ambiguous boundaries, then machine learning models\ntrained on such data should achieve higher classification performance.\nClassifiers. We employed two open-source, state-of-the-art CIC models with\ndistinct architectures:\n– CitePrompt [32]: A prompt-based fine-tuning framework built on SciB-\nERT [10], implemented via the OpenPrompt toolkit [19].\n– CitationIntentOpenLLM [29]: A LLM classifier based on Qwen-2.5-\n14B-Instruct (Qwen-Small), fine-tuned using LLaMA-Factory toolkit [66].\nIn addition to these fine-tuned models, we also evaluated the performance of\nseveral zero-shot classifiers using the LLMs introduced in Section 4.1.\nThe selection of classifiers in this study was guided by a deliberate and strong\npreference for open-source models over closed-source alternatives. This commit-\nment is crucial for several reasons fundamental to rigorous scientific inquiry.\nFirstly, utilizing publicly available models like SciBERT and various instruction-\ntuned LLMs ensures the transparency and reproducibility of our research. This\nallows other researchers to independently verify our findings and build upon our\nwork without the “black box” nature or access restrictions often associated with\nproprietary systems. Secondly, open-source models grant the critical flexibility\nrequired for in-depth analysis and custom fine-tuning, as demonstrated by our\nuse of LLaMA-Factory for adapting Qwen-Small. Such granular control over\nmodel architecture and training processes, often unavailable with closed-source\nAPIs, is paramount for rigorously evaluating new annotation frameworks like\nSOFT and understanding nuanced model behavior. Finally, relying on open-\nsource solutions promotes broader accessibility and fosters a collaborative re-\nsearch environment, which is essential for advancing the field of computational\nbibliometrics, unhindered by commercial licensing or opaque model mechanics.\nEvaluation Metrics. Following prior CIC evaluation protocols [29, 32], we\nreport both Accuracy and Macro-averaged F1 score. While Accuracy pro-\nvides an overall measure of correct predictions, Macro F1 computes the average\nF1 score across classes, mitigating the effects of label imbalance and better re-\nflecting type-level learnability.\n4.4\nEvaluating Cross-Domain Generalizability on ACT2\nTo assess whether our revised framework supports generalizable citation intent\nclassification across disciplinary contexts, we conducted a transfer evaluation\nusing the test subset of ACT2 [40], a multi-domain benchmark that spans bi-\nology, medicine, economics, and computer science. As discussed in Section 2.1,\nACT2 adopts a flat framework derived from ACL-ARC. To adapt it for our\nevaluation, we selected a representative subset of ACT2 as cross-domain test\n\nFramework SOFT: Disentangling Citation Intent and Content\n13\ndataset and applied the same re-annotation protocol described in Section 4.1\nusing our two-dimensional framework. For a controlled comparison, we reused\nthe same classification models (CitePrompt, CitationIntentOpenLLM, and zero-\nshot LLMs) and evaluation metrics (Accuracy and Macro F1) from Section 4.3.\nThis alignment allows us to directly assess whether our framework enhances\ncross-domain robustness and preserves semantic clarity under domain shift.\n5\nResults and Analysis\nWe evaluate the SOFT framework along three axes: interpretability, in-domain\nlearnability, and cross-domain generalizability. Results are compared against two\nexisting citation frameworks, ACL-ARC and SciCite.\nTable 1: Pairwise Cohen’s κ between LLMs and human annotators.\n(a) ACL-ARC Framework (6 types)\nLlama Mistral Gemma Qwen\nMistral\n0.5833\nGemma\n0.4504 0.3860\nQwen\n0.5776 0.6013\n0.3600\nHUMAN 0.3956 0.4048\n0.4302 0.3809\n(b) SciCite Framework (3 types)\nLlama Mistral Gemma Qwen\nMistral\n0.6363\nGemma\n0.6855 0.6423\nQwen\n0.6762 0.7151\n0.6928\nHUMAN 0.4016 0.3850\n0.4618 0.4116\n(c) SOFT: Cited Content Type (3 types)\nLlama Mistral Gemma Qwen\nMistral\n0.5396\nGemma\n0.5751 0.5657\nQwen\n0.5496 0.6169\n0.5717\nHUMAN 0.5193 0.4932 0.5901 0.5596\n(d) SOFT: Citation Intent (7 types)\nLlama Mistral Gemma Qwen\nMistral\n0.6154\nGemma\n0.5941 0.5949\nQwen\n0.6282 0.5887\n0.6175\nHUMAN 0.6620 0.6070\n0.6232 0.6918\nInterpretability and Agreement. We assess framework clarity using pair-\nwise Cohen’s κ across four pre-trained LLMs (LLM-LLM) and between LLM\npredictions and human consensus labels (Human-LLM). As expected, SciCite\nachieves the highest LLM-LLM agreement due to its coarse granularity. However,\nas shown in Table 1, SOFT offers significantly higher Human-LLM agreement,\nespecially for the Citation Intent dimension (κ = 0.662 for Llama, 0.692 for\nQwen). Despite being conceptually more complex, Cited Content Type also\noutperforms ACL-ARC and SciCite in Human-LLM agreement, indicating that\nboth dimensions are interpretable and consistently applicable.\n\n14\nDuan and Tan\nIn-Domain Learnability. We evaluate the classification performance of both\nfine-tuned and zero-shot models on the ACL-ARC test set using three annota-\ntion frameworks: ACL-ARC, SciCite, and SOFT. In the SOFT framework, Cited\nContent Type and Citation Intent are treated as two independent classifi-\ncation tasks. Table 2 and Figure 3 in Appendix B report accuracy and macro\nF1 scores across all models. Despite its conceptual specificity, Cited Content\nType classification yields strong in-domain performance. Fine-tuned Qwen-\nSmall achieves 0.78 macro F1, while zero-shot models average 0.66. Although no\nexisting framework provides a comparable baseline, these results demonstrate\nthat content type distinctions are not only human-interpretable but also model-\nlearnable across architectures. SOFT also outperforms prior frameworks on Ci-\ntation Intent classification. For example, fine-tuned Qwen-Small achieves 0.65\nmacro F1 under SOFT, compared to 0.51 on ACL-ARC and 0.62 on SciCite.\nNotably, this gain occurs even though SOFT-Intent has more types (7) than\nACL-ARC (6), refuting the assumption that performance is merely a function\nof label count. This suggests the framework’s semantic clarity and well-defined\nboundaries are more critical to model performance than raw granularity. The\naverage zero-shot F1 under SOFT reinforces this point, reaching 0.69, substan-\ntially above ACL-ARC (0.49) and SciCite (0.61), demonstrating that SOFT’s\nfunctional types align more closely with both pre-trained and fine-tuned model\nrepresentations.\nWe note that our reported scores are lower than those in prior work, including\nCitePrompt [32], which may be attributed to differences in dataset partitioning,\nas discussed in Section 4.1.\nCross-Domain Generalizability. To assess the robustness of different frame-\nworks under domain shift, we evaluate models trained on ACL-ARC directly on\na re-annotated subset of ACT2. No further fine-tuning or domain adaptation is\nperformed. Table 2 and Figure 3 in Appendix B report both accuracy and macro\nF1 scores, highlighting how well each framework generalizes beyond the origi-\nnal training domain. Despite its conceptual complexity, Cited Content Type\nclassification generalizes well. Zero-shot LLMs maintain solid performance (aver-\nage F1 drop 17%), and fine-tuned Qwen-Small drops moderately (0.78 to 0.64),\nsuggesting that these content distinctions remain stable across domain shifts.\nSOFT-trained models exhibit superior generalizability on Citation Intent clas-\nsification. Fine-tuned Qwen-Small drops from 0.65 to 0.56 (SOFT), outperform-\ning ACL-ARC (0.51 to 0.49) and SciCite (0.62 to 0.23). Zero-shot performance\nmirrors this trend (Figure 2). SOFT’s average F1 drop is only 15.8% (vs. 30-65%\nfor other frameworks), suggesting its discourse-grounded intent types generalize\nmore robustly.\nDimension-Wise Comparison. A dimension-wise analysis reveals distinct\nmodel behaviors. In-domain, zero-shot LLMs perform better on Citation In-\ntent (avg. F1: 0.69) than Cited Content Type (0.66), suggesting these func-\ntions align well with LLM priors. With fine-tuning, however, Cited Content\n\nFramework SOFT: Disentangling Citation Intent and Content\n15\n0.0\n0.2\n0.4\n0.6\n0.8\nCitation Intent\nCited Content Type\nSciCite-3 types\nACL-ARC-6 types\n0.53\n0.20\n0.70\n0.41\n0.69\n0.34\n0.55\n0.12\nFine-tuned SciBERT\n0.0\n0.2\n0.4\n0.6\n0.8\n0.65\n0.56\n0.65\n0.55\n0.60\n0.41\n0.47\n0.19\nZero-shot LLMs (Avg.)\n0.0\n0.2\n0.4\n0.6\n0.8\n0.65\n0.56\n0.78\n0.64\n0.62\n0.23\n0.51\n0.49\nFine-tuned Qwen small\nACL-ARC\nACT2\nFig. 2: Comparison of in-domain and cross-domain macro F1 scores across four\nannotation frameworks. Each subplot corresponds to a model type, with blue\ncircles denoting in-domain and red crosses indicating cross-domain performance.\nType becomes more learnable (Qwen-Small F1: 0.78 vs. 0.65 for intent), indicat-\ning its distinctions benefit more from supervision. Cross-domain generalization\nis competitive for both dimensions, with fine-tuned Qwen showing only modest\nperformance drops on both content (Qwen-Small: 18%; SciBERT: 41%) and in-\ntent (Qwen-Small: 14%; SciBERT: 62%). This highlights SOFT’s stability under\ndomain shift and the superior transferability of fine-tuned Qwen over SciBERT.\nOverall Analysis. Across all experiments, the SOFT framework consistently\noutperforms ACL-ARC and SciCite in interpretability, learnability, and cross-\ndomain robustness. Fine-tuned Qwen-Small achieves the best results on Cited\nContent Type, while Qwen zero-shot achieves top performance on Citation\nIntent. These findings establish SOFT as a semantically orthogonal and gener-\nalizable alternative for citation annotation.\n6\nConclusion\nThis paper introduces the SOFT framework, a semantically orthogonal frame-\nwork with two dimensions for citation annotation that disentangles cited content\ntype from citation intent. By addressing the conceptual conflation and bound-\nary ambiguity in existing frameworks, SOFT provides clearer interpretive def-\ninitions and better supports both human annotation and model-based classifi-\ncation. Empirical evaluations across LLM agreement, in-domain classification,\nand cross-domain generalization confirm SOFT’s advantages. The citation in-\ntent dimension demonstrates high zero-shot and fine-tuned performance, while\nthe cited content type dimension, despite its semantic complexity, proves learn-\nable and transferable. Compared to ACL-ARC and SciCite, SOFT consistently\nachieves higher human-LLM agreement, stronger in-domain macro F1, and more\nstable cross-domain generalization. These gains are especially pronounced for\ninstruction-tuned LLMs, which preserve broader generalization after fine-tuning.\nFuture work will explore extending SOFT to support multi-label citation\ncontexts, integrating finer-grained functional types, and applying the framework\nto large-scale citation graphs for discourse-level functional modeling, influence\ntracking, and knowledge extraction.\n\n16\nDuan and Tan\nAcknowledgment\nChangxu Duan was funded by the Federal Ministry of Education and Research\n(BMBF) under grant no. 01UG2130A, as part of the InsightsNet research project\n(insightsnet.org). Zhiyin Tan was funded by the “HybrInt - Hybrid Intelli-\ngence through Interpretable AI in Machine Perception and Interaction” project\n(Zukunft Nds, Niedersächsisches Ministerium für Wissenschaft, Grant ID: ZN4219).\nWe thank Dr. Sabine Bartsch for proofreading. We also thank the anonymous\nreviewers for their insightful comments and suggestions.\nReferences\n1. Abu-Jbara, A., Ezra, J., Radev, D.: Purpose and polarity of citation: Towards\nNLP-based bibliometrics. In: Vanderwende, L., Daumé III, H., Kirchhoff, K. (eds.)\nProceedings of the 2013 Conference of the North American Chapter of the Asso-\nciation for Computational Linguistics: Human Language Technologies. pp. 596–\n606. Association for Computational Linguistics, Atlanta, Georgia (Jun 2013),\nhttps://aclanthology.org/N13-1067/\n2. Allen Institute for AI: Semantic Scholar. https://www.semanticscholar.org/\n(2025)\n3. Anderson, M.H., Lemken, R.K.: Citation context analysis as a method for con-\nducting rigorous and impactful literature reviews. Organizational Research Meth-\nods 26(1), 77–106 (12 2020). https://doi.org/10.1177/1094428120969905,\nhttp://dx.doi.org/10.1177/1094428120969905\n4. Araabi, A., Niculae, V., Monz, C.: Joint dropout: Improving generalizability in low-\nresource neural machine translation through phrase pair variables. In: Utiyama, M.,\nWang, R. (eds.) Proceedings of Machine Translation Summit XIX, Vol. 1: Research\nTrack. pp. 12–25. Asia-Pacific Association for Machine Translation, Macau SAR,\nChina (Sep 2023), https://aclanthology.org/2023.mtsummit-research.2/\n5. Artstein, R., Poesio, M.: Survey article: Inter-coder agreement for computational\nlinguistics. Computational Linguistics 34(4), 555–596 (2008). https://doi.org/\n10.1162/coli.07-034-R2, https://aclanthology.org/J08-4004/\n6. Baker, C.F., Fillmore, C.J., Lowe, J.B.: The Berkeley FrameNet project. In: 36th\nAnnual Meeting of the Association for Computational Linguistics and 17th Interna-\ntional Conference on Computational Linguistics, Volume 1. pp. 86–90. Association\nfor Computational Linguistics, Montreal, Quebec, Canada (Aug 1998). https:\n//doi.org/10.3115/980845.980860, https://aclanthology.org/P98-1013/\n7. Bascur, J.P., Verberne, S., van Eck, N.J., Waltman, L.: Academic information\nretrieval using citation clusters: in-depth evaluation based on systematic reviews.\nScientometrics 128(5), 2895–2921 (May 2023). https://doi.org/10.1007/s111\n92-023-04681-x, https://doi.org/10.1007/s11192-023-04681-x\n8. Bassignana, E., Plank, B.: CrossRE: A cross-domain dataset for relation ex-\ntraction. In: Goldberg, Y., Kozareva, Z., Zhang, Y. (eds.) Findings of the As-\nsociation for Computational Linguistics: EMNLP 2022. pp. 3592–3604. Associ-\nation for Computational Linguistics, Abu Dhabi, United Arab Emirates (Dec\n2022). https://doi.org/10.18653/v1/2022.findings-emnlp.263, https:\n//aclanthology.org/2022.findings-emnlp.263/\n\nFramework SOFT: Disentangling Citation Intent and Content\n17\n9. Bayerl, P.S., Paul, K.I.: What determines inter-coder agreement in manual anno-\ntations? a meta-analytic investigation. Computational Linguistics 37(4), 699–725\n(Dec 2011). https://doi.org/10.1162/COLI_a_00074, https://aclanthology.o\nrg/J11-4004/\n10. Beltagy, I., Lo, K., Cohan, A.: SciBERT: A pretrained language model for scien-\ntific text. In: Inui, K., Jiang, J., Ng, V., Wan, X. (eds.) Proceedings of the 2019\nConference on Empirical Methods in Natural Language Processing and the 9th In-\nternational Joint Conference on Natural Language Processing (EMNLP-IJCNLP).\npp. 3615–3620. Association for Computational Linguistics, Hong Kong, China (Nov\n2019). https://doi.org/10.18653/v1/D19-1371, https://aclanthology.org/D\n19-1371/\n11. Berrebbi, D., Huynh, N., Balalau, O.: Graphcite: Citation intent classification in\nscientific publications via graph embeddings. pp. 779–783. WWW ’22: The ACM\nWeb Conference 2022, ACM (4 2022). https://doi.org/10.1145/3487553.3524\n657, http://dx.doi.org/10.1145/3487553.3524657\n12. Chen, Y., Liu, P., Zhong, M., Dou, Z.Y., Wang, D., Qiu, X., Huang, X.: CDEval-\nSumm: An empirical study of cross-dataset evaluation for neural summarization\nsystems. In: Cohn, T., He, Y., Liu, Y. (eds.) Findings of the Association for Compu-\ntational Linguistics: EMNLP 2020. pp. 3679–3691. Association for Computational\nLinguistics, Online (Nov 2020). https://doi.org/10.18653/v1/2020.finding\ns-emnlp.329, https://aclanthology.org/2020.findings-emnlp.329/\n13. Clarivate Analytics: Web of Science. https://www.webofscience.com/ (2025)\n14. Cohan, A., Ammar, W., van Zuylen, M., Cady, F.: Structural scaffolds for citation\nintent classification in scientific publications. In: Burstein, J., Doran, C., Solorio,\nT. (eds.) Proceedings of the 2019 Conference of the North American Chapter of\nthe Association for Computational Linguistics: Human Language Technologies,\nVolume 1 (Long and Short Papers). pp. 3586–3596. Association for Computational\nLinguistics, Minneapolis, Minnesota (Jun 2019). https://doi.org/10.18653/v1/\nN19-1361, https://aclanthology.org/N19-1361/\n15. Cohen, A.M., Hersh, W.R., Peterson, K., Yen, P.Y.: Reducing workload in sys-\ntematic review preparation using automated citation classification. Journal of the\nAmerican Medical Informatics Association 13(2), 206–219 (Mar 2006). https:\n//doi.org/10.1197/jamia.m1929, http://dx.doi.org/10.1197/jamia.m1929\n16. Cohen, J.: A coefficient of agreement for nominal scales. Educational and Psycho-\nlogical Measurement 20(1), 37–46 (4 1960). https://doi.org/10.1177/00131644\n6002000104, http://dx.doi.org/10.1177/001316446002000104\n17. Dardas, L.A., Sallam, M., Woodward, A., Sweis, N., Sweis, N., Sawair, F.A.: Eval-\nuating research impact based on semantic scholar highly influential citations, total\ncitations, and altmetric attention scores: The quest for refined measures remains\nillusive. Publications 11(1), 5 (1 2023). https://doi.org/10.3390/publicatio\nns11010005, http://dx.doi.org/10.3390/publications11010005\n18. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: Pre-training of deep\nbidirectional transformers for language understanding. In: Burstein, J., Doran,\nC., Solorio, T. (eds.) Proceedings of the 2019 Conference of the North Amer-\nican Chapter of the Association for Computational Linguistics: Human Lan-\nguage Technologies, Volume 1 (Long and Short Papers). pp. 4171–4186. Associa-\ntion for Computational Linguistics, Minneapolis, Minnesota (Jun 2019). https:\n//doi.org/10.18653/v1/N19-1423, https://aclanthology.org/N19-1423/\n19. Ding, N., Hu, S., Zhao, W., Chen, Y., Liu, Z., Zheng, H., Sun, M.: OpenPrompt: An\nopen-source framework for prompt-learning. In: Basile, V., Kozareva, Z., Stajner,\n\n18\nDuan and Tan\nS. (eds.) Proceedings of the 60th Annual Meeting of the Association for Computa-\ntional Linguistics: System Demonstrations. pp. 105–113. Association for Compu-\ntational Linguistics, Dublin, Ireland (May 2022). https://doi.org/10.18653/v\n1/2022.acl-demo.10, https://aclanthology.org/2022.acl-demo.10/\n20. Fillmore, C.J.: The case fro case. Universals in Linguistic Theory (1968)\n21. Fillmore, C.J., et al.: Frame semantics. Cognitive linguistics: Basic readings 34,\n373–400 (2006)\n22. Ghosal, T., Varanasi, K.K., Kordoni, V.: A deep multi-tasking approach leveraging\non cited-citing paper relationship for citation intent classification. Scientometrics\n129(2), 767–783 (12 2023). https://doi.org/10.1007/s11192-023-04811-5,\nhttp://dx.doi.org/10.1007/s11192-023-04811-5\n23. Gilardi, F., Alizadeh, M., Kubli, M.: Chatgpt outperforms crowd workers for text-\nannotation tasks. Proceedings of the National Academy of Sciences 120(30) (7\n2023). https://doi.org/10.1073/pnas.2305016120, http://dx.doi.org/10.10\n73/pnas.2305016120\n24. Guo, M., Yang, Y., Cer, D., Shen, Q., Constant, N.: MultiReQA: A cross-domain\nevaluation forRetrieval question answering models. In: Ben-David, E., Cohen, S.,\nMcDonald, R., Plank, B., Reichart, R., Rotman, G., Ziser, Y. (eds.) Proceedings\nof the Second Workshop on Domain Adaptation for NLP. pp. 94–104. Association\nfor Computational Linguistics, Kyiv, Ukraine (Apr 2021), https://aclanthology\n.org/2021.adaptnlp-1.10/\n25. Hettiachchi, D., Sanderson, M., Goncalves, J., Hosio, S., Kazai, G., Lease, M.,\nSchaekermann, M., Yilmaz, E.: Investigating and mitigating biases in crowdsourced\ndata. pp. 331–334. CSCW ’21: Computer Supported Cooperative Work and Social\nComputing, ACM (10 2021). https://doi.org/10.1145/3462204.3481729,\nhttp://dx.doi.org/10.1145/3462204.3481729\n26. Ivanova, R.V., Kirrane, S., van Erp, M.: Comparing annotated datasets for named\nentity recognition in English literature. In: Calzolari, N., Béchet, F., Blache, P.,\nChoukri, K., Cieri, C., Declerck, T., Goggi, S., Isahara, H., Maegaard, B., Mariani,\nJ., Mazo, H., Odijk, J., Piperidis, S. (eds.) Proceedings of the Thirteenth Lan-\nguage Resources and Evaluation Conference. pp. 3788–3797. European Language\nResources Association, Marseille, France (Jun 2022), https://aclanthology.org\n/2022.lrec-1.404/\n27. JHA, R., JBARA, A.A., QAZVINIAN, V., RADEV, D.R.: Nlp-driven citation\nanalysis for scientometrics. Natural Language Engineering 23(1), 93–130 (Jan\n2016). https://doi.org/10.1017/s1351324915000443, http://dx.doi.org\n/10.1017/S1351324915000443\n28. Jurgens, D., Kumar, S., Hoover, R., McFarland, D., Jurafsky, D.: Measuring the\nevolution of a scientific field through citation frames. Transactions of the Associa-\ntion for Computational Linguistics 6, 391–406 (2018). https://doi.org/10.116\n2/tacl_a_00028, https://aclanthology.org/Q18-1028/\n29. Koloveas, P., Chatzopoulos, S., Vergoulis, T., Tryfonopoulos, C.: Can llms predict\ncitation intent? an experimental analysis of in-context learning and fine-tuning on\nopen llms (2025), https://arxiv.org/abs/2502.14561\n30. Kunnath, S.N., Pride, D., Knoth, P.: Prompting strategies for citation classifica-\ntion. In: Proceedings of the 32nd ACM International Conference on Information\nand Knowledge Management. p. 1127–1137. CIKM ’23, Association for Computing\nMachinery, New York, NY, USA (2023). https://doi.org/10.1145/3583780.36\n15018, https://doi.org/10.1145/3583780.3615018\n\nFramework SOFT: Disentangling Citation Intent and Content\n19\n31. Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C.H., Gonzalez, J.E.,\nZhang, H., Stoica, I.: Efficient memory management for large language model serv-\ning with pagedattention. In: Proceedings of the ACM SIGOPS 29th Symposium\non Operating Systems Principles (2023)\n32. Lahiri, A., Sanyal, D.K., Mukherjee, I.: Citeprompt: Using prompts to identify\ncitation intent in scientific papers. pp. 51–55. 2023 ACM/IEEE Joint Conference\non Digital Libraries (JCDL), IEEE (6 2023). https://doi.org/10.1109/jcdl57\n899.2023.00017, http://dx.doi.org/10.1109/JCDL57899.2023.00017\n33. Li, K., Rubungo, A.N., Lei, X., Persaud, D., Choudhary, K., DeCost, B., Dieng,\nA.B., Hattrick-Simpers, J.: Probing out-of-distribution generalization in machine\nlearning for materials (2024), https://arxiv.org/abs/2406.06489\n34. Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P.,\nZitnick, C.L.: Microsoft coco: Common objects in context. In: Lecture Notes in\nComputer Science, pp. 740–755. Springer International Publishing (2014). https:\n//doi.org/10.1007/978-3-319-10602-1_48, http://dx.doi.org/10.1007/978\n-3-319-10602-1_48\n35. Lo, K., Wang, L.L., Neumann, M., Kinney, R., Weld, D.: S2ORC: The semantic\nscholar open research corpus. In: Jurafsky, D., Chai, J., Schluter, N., Tetreault,\nJ. (eds.) Proceedings of the 58th Annual Meeting of the Association for Com-\nputational Linguistics. pp. 4969–4983. Association for Computational Linguis-\ntics, Online (Jul 2020). https://doi.org/10.18653/v1/2020.acl-main.447,\nhttps://aclanthology.org/2020.acl-main.447/\n36. Meng, R., Lu, W., huan Chi, Y., Han, S.: Automatic classification of citation\nfunction by new linguistic features (2017), http://memray.me/uploads/iconfere\nnce17-citation-function.pdf\n37. Mercier, D., Rizvi, S., Rajashekar, V., Dengel, A., Ahmed, S.: Impactcite: An\nxlnet-based solution enabling qualitative citation impact analysis utilizing senti-\nment and intent. In: Proceedings of the 13th International Conference on Agents\nand Artificial Intelligence. p. 159–168. SCITEPRESS - Science and Technology\nPublications (2021). https://doi.org/10.5220/0010235201590168, http:\n//dx.doi.org/10.5220/0010235201590168\n38. Mistral AI Team: Mistral Small 3. https://mistral.ai/news/mistral-small-3/\n(2025)\n39. Nambanoor Kunnath, S., Pride, D., Knoth, P.: Dynamic context extraction for\ncitation classification. In: He, Y., Ji, H., Li, S., Liu, Y., Chang, C.H. (eds.) Pro-\nceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for\nComputational Linguistics and the 12th International Joint Conference on Natu-\nral Language Processing (Volume 1: Long Papers). pp. 539–549. Association for\nComputational Linguistics, Online only (Nov 2022). https://doi.org/10.18653\n/v1/2022.aacl-main.41, https://aclanthology.org/2022.aacl-main.41/\n40. Nambanoor Kunnath, S., Stauber, V., Wu, R., Pride, D., Botev, V., Knoth, P.:\nACT2: A multi-disciplinary semi-structured dataset for importance and purpose\nclassification of citations. In: Calzolari, N., Béchet, F., Blache, P., Choukri, K.,\nCieri, C., Declerck, T., Goggi, S., Isahara, H., Maegaard, B., Mariani, J., Mazo,\nH., Odijk, J., Piperidis, S. (eds.) Proceedings of the Thirteenth Language Resources\nand Evaluation Conference. pp. 3398–3406. European Language Resources Asso-\nciation, Marseille, France (Jun 2022), https://aclanthology.org/2022.lrec-1\n.363/\n41. Nicholson, J.M., Mordaunt, M., Lopez, P., Uppala, A., Rosati, D., Rodrigues, N.P.,\nGrabitz, P., Rife, S.C.: scite: A smart citation index that displays the context\n\n20\nDuan and Tan\nof citations and classifies their intent using deep learning. Quantitative Science\nStudies 2(3), 882–898 (2021). https://doi.org/10.1162/qss_a_00146, http:\n//dx.doi.org/10.1162/qss_a_00146\n42. Northcutt, C., Jiang, L., Chuang, I.: Confident learning: Estimating uncertainty in\ndataset labels. Journal of Artificial Intelligence Research 70, 1373–1411 (4 2021).\nhttps://doi.org/10.1613/jair.1.12125, http://dx.doi.org/10.1613/jair.\n1.12125\n43. Northcutt, C.G., Athalye, A., Mueller, J.: Pervasive label errors in test sets desta-\nbilize machine learning benchmarks. In: Thirty-fifth Conference on Neural Infor-\nmation Processing Systems Datasets and Benchmarks Track (Round 1) (2021),\nhttps://openreview.net/forum?id=XccDXrDNLek\n44. Palmer, M., Gildea, D., Kingsbury, P.: The Proposition Bank: An annotated corpus\nof semantic roles. Computational Linguistics 31(1), 71–106 (2005). https://doi.\norg/10.1162/0891201053630264, https://aclanthology.org/J05-1004/\n45. Pride, D., Knoth, P.: An authoritative approach to citation classification. In: Pro-\nceedings of the ACM/IEEE Joint Conference on Digital Libraries in 2020. p.\n337–340. JCDL ’20, Association for Computing Machinery, New York, NY, USA\n(2020). https://doi.org/10.1145/3383583.3398617, https://doi.org/10.114\n5/3383583.3398617\n46. Pride, D., Knoth, P.: An authoritative approach to citation classification. JCDL\n’20: The ACM/IEEE Joint Conference on Digital Libraries in 2020, ACM (8 2020).\nhttps://doi.org/10.1145/3383583.3398617, http://dx.doi.org/10.1145/338\n3583.3398617\n47. Qazvinian, V., Radev, D.R.: Scientific paper summarization using citation sum-\nmary networks. In: Scott, D., Uszkoreit, H. (eds.) Proceedings of the 22nd In-\nternational Conference on Computational Linguistics (Coling 2008). pp. 689–\n696. Coling 2008 Organizing Committee, Manchester, UK (Aug 2008), https:\n//aclanthology.org/C08-1087/\n48. Qwen: Qwen2.5 technical report (2025), https://arxiv.org/abs/2412.15115\n49. Riabi, A., Mouilleron, V., Mahamdi, M., Antoun, W., Seddah, D.: Beyond dataset\ncreation: Critical view of annotation variation and bias probing of a dataset for\nonline radical content detection. In: Rambow, O., Wanner, L., Apidianaki, M.,\nAl-Khalifa, H., Eugenio, B.D., Schockaert, S. (eds.) Proceedings of the 31st Inter-\nnational Conference on Computational Linguistics. pp. 8640–8663. Association for\nComputational Linguistics, Abu Dhabi, UAE (Jan 2025), https://aclanthology\n.org/2025.coling-main.578/\n50. Roman, M., Shahid, A., Khan, S., Koubaa, A., Yu, L.: Citation intent classification\nusing word embedding. IEEE Access 9, 9982–9995 (2021). https://doi.org/10.1\n109/access.2021.3050547, http://dx.doi.org/10.1109/ACCESS.2021.3050547\n51. Rücker, S., Akbik, A.: CleanCoNLL: A nearly noise-free named entity recogni-\ntion dataset. In: Bouamor, H., Pino, J., Bali, K. (eds.) Proceedings of the 2023\nConference on Empirical Methods in Natural Language Processing. pp. 8628–\n8645. Association for Computational Linguistics, Singapore (Dec 2023). https:\n//doi.org/10.18653/v1/2023.emnlp-main.533, https://aclanthology.org/2\n023.emnlp-main.533/\n52. Shotton, D.: Cito, the citation typing ontology. Journal of Biomedical Semantics\n1(Suppl 1), S6 (2010). https://doi.org/10.1186/2041-1480-1-s1-s6, http:\n//dx.doi.org/10.1186/2041-1480-1-S1-S6\n53. Shui, Z., Karypis, P., Karls, D.S., Wen, M., Manchanda, S., Tadmor, E.B.,\nKarypis, G.: Fine-tuning language models on multiple datasets for citation in-\n\nFramework SOFT: Disentangling Citation Intent and Content\n21\ntention classification. In: Al-Onaizan, Y., Bansal, M., Chen, Y.N. (eds.) Find-\nings of the Association for Computational Linguistics: EMNLP 2024. pp. 16718–\n16732. Association for Computational Linguistics, Miami, Florida, USA (Nov\n2024). https://doi.org/10.18653/v1/2024.findings-emnlp.974, https:\n//aclanthology.org/2024.findings-emnlp.974/\n54. Singh, S., Yadav, A., Jain, J., Shi, H., Johnson, J., Desai, K.: Benchmarking object\ndetectors with coco: A new path forward. In: Leonardis, A., Ricci, E., Roth, S.,\nRussakovsky, O., Sattler, T., Varol, G. (eds.) Computer Vision – ECCV 2024. pp.\n279–295. Springer Nature Switzerland, Cham (2025)\n55. Spiegel-Rosing, I.: Science studies: Bibliometric and content analysis. Social Studies\nof Science 7(1), 97–113 (2 1977). https://doi.org/10.1177/0306312777007001\n11, http://dx.doi.org/10.1177/030631277700700111\n56. Tan, Z., Li, D., Wang, S., Beigi, A., Jiang, B., Bhattacharjee, A., Karami, M., Li,\nJ., Cheng, L., Liu, H.: Large language models for data annotation and synthesis:\nA survey. In: Al-Onaizan, Y., Bansal, M., Chen, Y.N. (eds.) Proceedings of the\n2024 Conference on Empirical Methods in Natural Language Processing. pp. 930–\n957. Association for Computational Linguistics, Miami, Florida, USA (Nov 2024).\nhttps://doi.org/10.18653/v1/2024.emnlp-main.54, https://aclanthology.o\nrg/2024.emnlp-main.54/\n57. Team, G.: Gemma 3 technical report (2025), https://arxiv.org/abs/2503.19786\n58. Team, L.: The llama 3 herd of models (2024), https://arxiv.org/abs/2407.21783\n59. Teufel, S.: Argumentative zoning : Information extraction from scientific text\n(1999), https://www.cl.cam.ac.uk/~sht25/thesis/t1.pdf\n60. Teufel, S., Siddharthan, A., Tidhar, D.: Automatic classification of citation func-\ntion. In: Jurafsky, D., Gaussier, E. (eds.) Proceedings of the 2006 Conference on\nEmpirical Methods in Natural Language Processing. pp. 103–110. Association for\nComputational Linguistics, Sydney, Australia (Jul 2006), https://aclanthology\n.org/W06-1613/\n61. Tjong Kim Sang, E.F., De Meulder, F.: Introduction to the CoNLL-2003 shared\ntask: Language-independent named entity recognition. In: Proceedings of the Sev-\nenth Conference on Natural Language Learning at HLT-NAACL 2003. pp. 142–147\n(2003), https://aclanthology.org/W03-0419/\n62. Tkachenko, U., Thyagarajan, A., Mueller, J.: Objectlab: Automated diagnosis of\nmislabeled images in object detection data (2023). https://doi.org/10.48550/A\nRXIV.2309.00832, https://arxiv.org/abs/2309.00832\n63. Valenzuela, M., Ha, V., Etzioni, O.: Identifying meaningful citations. In: AAAI\nworkshop: Scholarly big data. vol. 15, p. 13 (2015)\n64. Yu, H., Liu, J., Zhang, X., Wu, J., Cui, P.: A survey on evaluation of out-of-\ndistribution generalization (2024), https://arxiv.org/abs/2403.01874\n65. Zha, D., Bhat, Z.P., Lai, K.H., Yang, F., Jiang, Z., Zhong, S., Hu, X.: Data-\ncentric artificial intelligence: A survey. ACM Comput. Surv. 57(5) (Jan 2025).\nhttps://doi.org/10.1145/3711118, https://doi.org/10.1145/3711118\n66. Zheng, Y., Zhang, R., Zhang, J., Ye, Y., Luo, Z.: LlamaFactory: Unified efficient\nfine-tuning of 100+ language models. In: Cao, Y., Feng, Y., Xiong, D. (eds.) Pro-\nceedings of the 62nd Annual Meeting of the Association for Computational Lin-\nguistics (Volume 3: System Demonstrations). pp. 400–410. Association for Com-\nputational Linguistics, Bangkok, Thailand (Aug 2024). https://doi.org/10.186\n53/v1/2024.acl-demos.38, https://aclanthology.org/2024.acl-demos.38/\n67. Zimmermann, E., Szeto, J., Pasquero, J., Ratle, F.: Benchmarking a benchmark:\nHow reliable is ms-coco? (2023), https://arxiv.org/abs/2311.02709\n\n22\nDuan and Tan\nA\nOverall Model Performance\nTable 2 presents the overall performance of the evaluated models across the\nACL-ARC, SciCite, and our proposed SOFT frameworks, for both in-domain\n(ACL-ARC test set) and cross-domain (re-annotated ACT2 test set) scenarios.\nThe performance is measured by accuracy and macro-F1 score.\nB\nPer-Type Analysis Across Frameworks and Models\nTo gain a more granular understanding of model behavior and framework char-\nacteristics, we visualize the per-class F1-scores for both in-domain and cross-\ndomain settings using radar plots, as shown in Figure 3. Each plot depicts the\nperformance of a specific model across the classes of a given framework, with\nblue representing in-domain and red representing cross-domain F1 Scores. These\nvisualizations allow for a direct comparison of how well individual classes are\nlearned and how performance on these classes transfers across domains.\nAcross ACL-ARC and SciCite, only a few types achieve consistently learnable\nperformance. In ACL-ARC, Uses (F1 = 0.60-0.72) and Future (F1 = 0.70-0.80)\nare reliably detected across models, while types like Motivation (F1 = 0.06-\n0.27) and extends (F1 = 0.17-0.24) exhibit uniformly poor results, regardless of\nmodel type. In SciCite, background and method achieve moderate F1 (up to 0.83\nand 0.69), but ResultComparison remains consistently weak (F1 ≤ 0.47, recall\n< 0.5). These patterns suggest semantic overlap and label fuzziness in legacy\nframeworks, especially for tasks that require disambiguating subtle intent.\nThe SOFT framework, in contrast, features multiple types that are both\nconceptually precise and computationally learnable. Zero-shot LLMs perform\nwell on Use (F1 = 0.79-0.90), Contextualize (0.78-0.86), and Produce-\ndResource (0.76-0.82), while the fine-tuned LLM (Qwen-small) outperforms\nall other models on discovery and ProducedResource (both F1 = 0.82). More\ndifficult types like PerformedWork show a large performance gap in F1 score\n(LLMs: 0.37-0.57, Qwen-small: 0.69, SciBERT: 0.79), indicating that domain-\nadaptive fine-tuning is crucial when lexical variation is high. For JustifyDe-\nsignChoice, zero-shot LLM Llama (F1 = 0.72) outperforms both fine-tuned\nmodels (Qwen-small: F1 = 0.45, SciBERT: F1 = 0.42), suggesting that general\ndiscourse modeling, rather than label-specific exposure, governs success.\nPrecision-recall asymmetries further differentiate framework quality. In Sci-\nCite, the method exhibits high recall (e.g., zero-shot LLMs: 0.86-0.94) but low\nprecision (e.g., zero-shot LLMs: 0.4-0.53), indicating label overgeneralization. In\nSOFT, several types, including JustifyDesignChoice and performed work,\nexhibit the opposite: high precision (e.g., Qwen: 0.79) but low recall (e.g., Mis-\ntral: 0.28), showing that models can detect prototypical cases but struggle with\nbroader coverage. Notably, SignalGap (Qwen: precision 0.81, recall 0.68) and\nMethod (SciBERT: precision 0.64, recall 0.76) achieve better balance. These\ncontrasts affirm that our SOFT framework not only yields more distinguishable\nand learnable labels but also makes model errors more interpretable, enabling\nboth zero-shot application and targeted fine-tuning.\n\nFramework SOFT: Disentangling Citation Intent and Content\n23\nC\nComputational Resources and Software\nWe adopted the original hyperparameter settings from CitePrompt [32] and Ci-\ntationIntentOpenLLM [29]. All experiments were conducted on a single H100\nGPU, including both fine-tuning and inference. For LLM inference, we used the\nvLLM library [31]. Fine-tuning required approximately 5 GPU-hours, and infer-\nence consumed an additional 4 GPU-hours in total.\nTable 2: In-domain and cross-domain performance of citation classification mod-\nels (zero-shot: “ZS”, fine-tuned: “FT”) across frameworks.\nFramework Classifier\nIn-Domain (ACL-ARC) Cross-Domain (ACT2)\nAccuracy\nMacro F1\nAccuracy\nMacro F1\nACL-ARC\n(6 types)\nZS Llama\n0.58\n0.52\n0.60\n0.13 (75%↓)\nZS Mistral\n0.54\n0.49\n0.59\n0.16 (67%↓)\nZS Gemma\n0.65\n0.48\n0.62\n0.17 (65%↓)\nZS Qwen\n0.48\n0.45\n0.63\n0.23 (49%↓)\nFT SciBERT\n0.66\n0.55\n0.47\n0.12 (78%↓)\nFT Qwen-Small\n0.65\n0.51\n0.58\n0.49 ( 4%↓)\nSciCite\n(3 types)\nZS Llama\n0.72\n0.63\n0.80\n0.37 (54%↓)\nZS Mistral\n0.67\n0.59\n0.81\n0.41 (31%↓)\nZS Gemma\n0.70\n0.63\n0.80\n0.36 (43%↓)\nZS Qwen\n0.67\n0.59\n0.84\n0.45 (24%↓)\nFT SciBERT\n0.77\n0.69\n0.66\n0.34 (51%↓)\nFT Qwen-Small\n0.69\n0.61\n0.40\n0.23 (63%↓)\nSOFT:\nCited\nContent\nType\n(3 types)\nZS Llama\n0.73\n0.67\n0.75\n0.55 (18%↓)\nZS Mistral\n0.67\n0.62\n0.78\n0.46 (26%↓)\nZS Gemma\n0.74\n0.68\n0.73\n0.58 (15%↓)\nZS Qwen\n0.70\n0.67\n0.77\n0.60 (10%↓)\nFT SciBERT\n0.70\n0.70\n0.58\n0.41 (41%↓)\nFT Qwen-Small\n0.79\n0.78\n0.75\n0.64 (18%↓)\nSOFT:\nCitation\nIntent\n(7 types)\nZS Llama\n0.79\n0.72\n0.72\n0.57 (21%↓)\nZS Mistral\n0.74\n0.71\n0.69\n0.57 (20%↓)\nZS Gemma\n0.71\n0.59\n0.71\n0.55 ( 7%↓)\nZS Qwen\n0.81\n0.75\n0.82\n0.64 (15%↓)\nFT SciBERT\n0.72\n0.53\n0.52\n0.20 (62%↓)\nFT Qwen-Small\n0.77\n0.65\n0.77\n0.56 (14%↓)\nColor Legend: ≤20% drop ,\n21–40% drop ,\n>40% drop in cross-domain Macro F1.\n\n24\nDuan and Tan\nbackground\ncompare\nextends\nfuture\nmotivation\nuses\n0.2\n0.4\n0.6\n0.8\n1.0\nACL-ARC\n6 Types\nQwen (ZS)\nbackground\nmethod\nresult_comparison\n0.2\n0.4\n0.6\n0.8\n1.0\nSciCite\n3 Types\ndiscovery\nperformed_work\nproduced_resource\n0.2\n0.4\n0.6\n0.8\n1.0\nSOFT\n(Content)\n3 Types\ncontextualize\nevaluate_against\nhighlight_limitation\njustify_design_choice\nmodify\nsignal_gap\nuse\n0.2\n0.4\n0.6\n0.8\n1.0\nSOFT\n(Intent)\n7 Types\nbackground\ncompare\nextends\nfuture\nmotivation\nuses\n0.2\n0.4\n0.6\n0.8\n1.0\nLlama (ZS)\nbackground\nmethod\nresult_comparison\n0.2\n0.4\n0.6\n0.8\n1.0\ndiscovery\nperformed_work\nproduced_resource\n0.2\n0.4\n0.6\n0.8\n1.0\ncontextualize\nevaluate_against\nhighlight_limitation\njustify_design_choice\nmodify\nsignal_gap\nuse\n0.2\n0.4\n0.6\n0.8\n1.0\nbackground\ncompare\nextends\nfuture\nmotivation\nuses\n0.2\n0.4\n0.6\n0.8\n1.0\nMistral (ZS)\nbackground\nmethod\nresult_comparison\n0.2\n0.4\n0.6\n0.8\n1.0\ndiscovery\nperformed_work\nproduced_resource\n0.2\n0.4\n0.6\n0.8\n1.0\ncontextualize\nevaluate_against\nhighlight_limitation\njustify_design_choice\nmodify\nsignal_gap\nuse\n0.2\n0.4\n0.6\n0.8\n1.0\nbackground\ncompare\nextends\nfuture\nmotivation\nuses\n0.2\n0.4\n0.6\n0.8\n1.0\nGemma (ZS)\nbackground\nmethod\nresult_comparison\n0.2\n0.4\n0.6\n0.8\n1.0\ndiscovery\nperformed_work\nproduced_resource\n0.2\n0.4\n0.6\n0.8\n1.0\ncontextualize\nevaluate_against\nhighlight_limitation\njustify_design_choice\nmodify\nsignal_gap\nuse\n0.2\n0.4\n0.6\n0.8\n1.0\nbackground\ncompare\nextends\nfuture\nmotivation\nuses\n0.2\n0.4\n0.6\n0.8\n1.0\nQwen-Small (FT)\nbackground\nmethod\nresult_comparison\n0.2\n0.4\n0.6\n0.8\n1.0\ndiscovery\nperformed_work\nproduced_resource\n0.2\n0.4\n0.6\n0.8\n1.0\ncontextualize\nevaluate_against\nhighlight_limitation\njustify_design_choice\nmodify\nsignal_gap\nuse\n0.2\n0.4\n0.6\n0.8\n1.0\nbackground\ncompare\nextends\nfuture\nmotivation\nuses\n0.2\n0.4\n0.6\n0.8\n1.0\nSciBERT (FT)\nbackground\nmethod\nresult_comparison\n0.2\n0.4\n0.6\n0.8\n1.0\ndiscovery\nperformed_work\nproduced_resource\n0.2\n0.4\n0.6\n0.8\n1.0\ncontextualize\nevaluate_against\nhighlight_limitation\njustify_design_choice\nmodify\nsignal_gap\nuse\n0.2\n0.4\n0.6\n0.8\n1.0\nIn-Domain Macro F1\nCross-Domain Macro F1\nFig. 3: Grid of radar plots (4 frameworks × 6 models) showing per-class F1-\nscores. Blue lines/areas represent In-Domain F1 scores, and red lines/areas rep-\nresent Cross-Domain F1 scores. Each axis corresponds to a class label within the\nrespective framework. F1-scores range from 0 (center) to 1 (outer edge).\n",
+    "figure_captions": [
+      "Fig. 1: Types of statistics for the re-annotated dataset.",
+      "Fig. 2: Comparison of in-domain and cross-domain macro F1 scores across four",
+      "Fig. 3: Grid of radar plots (4 frameworks × 6 models) showing per-class F1-"
+    ]
+  },
+  {
+    "arxiv_id": "2601.05192v1",
+    "title": "Paper 2601.05192v1",
+    "abstract": "Entity linking (mapping ambiguous mentions in text to entities in a knowledge base) is a foun- dational step in tasks such as knowledge graph construction, question-answering, and informa- tion extraction. Our method, LELA, is a mod- ular coarse-to-fine approach that leverages the capabilities of large language models (LLMs), and works with different target domains, knowl- edge bases and LLMs, without any fine-tuning phase. Our experiments across various entity linking settings show that LELA is highly com- petitive with fine-tuned approaches, and sub- stantially outperforms the non-fine-tuned ones.",
+    "text": "LELA: an LLM-based Entity Linking Approach\nwith Zero-Shot Domain Adaptation\nSamy Haffoudhi, Fabian M. Suchanek, Nils Holzenberger\nTélécom Paris\nInstitut Polytechnique de Paris, France\n{samy.haffoudhi, fabian.suchanek, nils.holzenberger}@telecom-paris.fr\nAbstract\nEntity linking (mapping ambiguous mentions\nin text to entities in a knowledge base) is a foun-\ndational step in tasks such as knowledge graph\nconstruction, question-answering, and informa-\ntion extraction. Our method, LELA, is a mod-\nular coarse-to-fine approach that leverages the\ncapabilities of large language models (LLMs),\nand works with different target domains, knowl-\nedge bases and LLMs, without any fine-tuning\nphase. Our experiments across various entity\nlinking settings show that LELA is highly com-\npetitive with fine-tuned approaches, and sub-\nstantially outperforms the non-fine-tuned ones.\n1\nIntroduction\nEntity linking, also known as Entity Disambigua-\ntion, is the task of mapping ambiguous mentions\nof entities in natural language text to reference\nentities in a knowledge base (KB). Typically,\nthe input is a sentence in which the mention of\nan entity has already been identified, as in this\nexample:\nFrance hosted the Olympics in Paris.\nThe knowledge base contains each entity, associ-\nated with a short textual description, e.g.:\nParis (city)\nCapital city of France\nParis (novel) 1897 novel by Emile Zola\nFrance\nCountry in Europe\nThe goal is to map the mention of Paris to the\ncorrect entity in the KB (here, the first one).\nEntity Linking is an important preprocess-\ning step in tasks such as information extrac-\ntion (Martinez-Rodriguez et al., 2020), knowledge-\nbased question answering (Welty et al., 2012), and\nknowledge graph completion (Ji and Grishman,\n2011). Most of the work on entity linking focuses\non linking to general knowledge graphs such as\nWikidata (Vrandeˇci´c, 2012), DBpedia (Auer et al.,\n2007) or Yago (Suchanek et al., 2007). However,\nDomain\nKnowledge Base Documents\nClassical\ntest=train\ntest=train\ntest̸=train\nZero-shot\ntest≈train\ntest̸=train\ntest̸=train\nTrue zero-shot test̸=train\ntest̸=train\ntest̸=train\nFigure 1: Classification of entity linking settings (“train-\ning” refers to fine-tuning after initial pre-training).\nLELA is in the true zero-shot setting.\nin real-world applications, the knowledge base is\noften proprietary or domain-specific, as in the legal\nor biomedical domain, or inside a company.\nTherefore, several entity linking methods use\ntraining. In classical entity linking (Figure 1), the\nmodels are trained on the same domain and same\nknowledge base as the one used during testing. In\nzero-shot entity linking (Logeswaran et al., 2019),\nthe set of entities seen during training is differ-\nent from the one seen during testing. However,\neven this setting, as defined in Logeswaran et al.\n(2019), requires labeled entity linking training data\nfor fine-tuning from related domains and document\nsources. Fine-tuning is not just computationally\nexpensive, but also requires high-quality labeled\ndata. Thus, to apply entity linking to a new domain\n(say, a company wants to disambiguate chemical\nproducts), one would first have to label hundreds\nof mentions. This task cannot be outsourced eas-\nily, as it requires navigating specialized knowledge\nbases and involves privacy restrictions. Hence, the\nquest that we tackle in this paper is true zero-shot\nentity linking, i.e., entity linking that works out\nof the box for any domain without the need for\nfine-tuning.\nSeveral approaches target such true zero-shot\nscenarios (Ding et al., 2024; Liu et al., 2024c; Zhou\net al., 2024; Ye and Mitchell, 2025). However, even\nthese require hand-chosen Chain-of-Thought ex-\namples and/or strongly domain-specific candidate\ngenerator approaches, limiting their generalizabil-\n1\narXiv:2601.05192v1  [cs.CL]  8 Jan 2026\n\nity. Furthermore, their performance lags behind the\nspecialized approaches.\nIn this paper, we propose LELA, a coarse-\nto-fine fine-tuning-free method for entity linking.\nLELA (1) generates an initial entity-candidate set,\n(2) selects top-k candidates through pointwise re-\nranking, and (3) passes them to the LLM for fi-\nnal candidate selection through reasoning over the\ncontext and entity descriptions. This approach is\nmodular, LLM-, reranker-, and retriever-agnostic,\nand can hence be used with both open-source and\nproprietary models. As LLMs become ever more of\na commodity, the performance of LELA thus only\nstands to gain. Through extensive experiments, we\nshow that LELA systematically outperforms the\ntrue zero-shot methods, and even beats state-of-the-\nart classical and zero-shot approaches on multiple\ndatasets. We attach our code for reviewing, which\nwill be made available as an open-source library.\n2\nRelated Work\n2.1\nEntity Linking\nBefore the blossoming of machine learning and\ndeep learning, many entity linking approaches\n(such as AIDA (Hoffart et al., 2011) and\nSOFIE (Suchanek et al., 2009)) were training-free,\nbut were bound to a given knowledge base.\nToday, methods that work on any input KB\nare usually embedding-based: BLINK (Wu et al.,\n2020), AcroBERT (Chen et al., 2023), Ex-\ntEnD (Barba et al., 2022), ReFinED (Ayoola\net al., 2022), VERBALIZED (Rücker and Ak-\nbik, 2025), the approaches in Févry et al. (2020);\nYamada et al. (2022), as well as the different\nzero-shot entity linking methods evaluated on the\nZESHEL benchmark (Logeswaran et al., 2019):\nReS (Xu et al., 2023), Uni-MPR and Bi-MPR (Tang\net al., 2021), and E-repeat (Yao et al., 2020).\nGENRE (De Cao et al., 2021a) is a generative ap-\nproach and mGENRE (De Cao et al., 2022) is its\nmultilingual extension. Although these methods\ncan technically operate over different knowledge\nbases, they are not true zero-shot approaches, be-\ncause they require supervised training on the target\nKB or closely aligned KBs to achieve good per-\nformance. LELA obviates this requirement, and\nwe show in our experiments that it still performs\ncompetitively with these approaches.\nSeveral LLM-based approaches have also been\nproposed. The approaches in Vollmers et al.; Xin\net al. (2024) use LLMs for contextual augmentation.\nLi et al. (2025) categorizes mentions and routes\nthem to either a low-computational entity linker\nor a reasoning-LLM. Wang et al. (2026); Shi et al.\n(2024); Liu et al. (2024b) focus on the multimodal\nsetting. GenDecider (Zhou et al., 2024) uses an\nLLM fine-tuned on the reranking task on ZESHEL\nusing LoRA (Hu et al., 2021). OneNet (Liu et al.,\n2024c) is a pipeline approach of LLM-based can-\ndidate filtering, entity linking, and hallucination\ntackling. EntGPT (Ding et al., 2024) prompts the\nmodel to augment the mention before selecting the\nbest candidate entity and Ye and Mitchell (2025)\nuses in-context learning (Brown et al., 2020) for\ncandidate selection.\nAll of these LLM-based approaches leverage\ntask- or domain-specific candidate retrieval, fine-\ntuning, or complex prompt engineering, which lim-\nits their applicability in true zero-shot settings. In\ncontrast, LELA requires no fine-tuning or domain-\nspecific components, and our experiments show\nthat it outperforms these approaches across multi-\nple datasets, in a true zero-shot setting.\n2.2\nLLM-based Reranking\nLLMs have recently been explored as general-\npurpose rerankers in information retrieval: point-\nwise approaches compute a score for each query-\ndocument pair (Pipitone et al., 2025; Zhang et al.,\n2025); pairwise approaches compare the relevance\nof a pair of documents (Qin et al., 2024; Luo\net al., 2024); and listwise approaches process\na list of documents and output them ranked by\ntheir relevance (Wang et al., 2025; Liu et al.,\n2025b). Other methods are setwise (Zhuang et al.,\n2024), tournament-inspired (Chen et al., 2025b),\nefficiency-centered (Dedov, 2025), as well as\nattention-pattern-based (Chen et al., 2025a) or con-\nditional perplexity-based (Shi et al., 2025). Some\napproaches make use of reasoning token gener-\nation (Abdallah et al., 2025; Liu et al., 2025a;\nZhuang et al., 2025), and study its impact on per-\nformance (Jedidi et al., 2025).\nWhile this body of work primarily targets doc-\nument and passage retrieval, we demonstrate\nthat pointwise LLM-based reranking, followed\nby reasoning-driven selection, can be effectively\nadapted to the entity linking setting without task-\nspecific fine-tuning.\n2\n\n3\nMethodology\nTask Definition.\nLet the KB E be a set of entities,\neach with a textual description. Given a mention\nm in a text S, the goal of entity linking is to find\nthe entity e ∈ E that m refers to in the context of S,\nor, optionally, return a special entity enone, which\nindicates that no entity is relevant.\nLELA uses a three-stage coarse-to-fine pipeline,\nas illustrated in Figure 2: Candidate entities are\nfirst generated (Sevgili et al., 2022; Wu et al., 2020),\nthen ranked, and then selected. Beyond the text,\nmention, and KB, LELA requires only a parameter\nk ∈ N+ as input.\nCandidate generation.\nThe goal of the candidate\ngeneration step is to identify the entities Cm ⊆ E\nthat could potentially be relevant to m. For this\nstep, LELA is retriever-agnostic, and can work\nwith any method that is capable of pre-selecting a\nsubset of relevant entities. In practice, any genera-\ntion method that requires domain-specific training\nwould jeopardize LELA’s goal of being fine-tuning-\nfree. Hence, we primarily consider BM25 (Robert-\nson and Sparck Jones, 1988) and dense retrieval,\nand also experiment with prior-based dictionaries.\nCandidate reranking.\nEach candidate c ∈ Cm\nis then given a score s(c) ∈ R by pointwise rerank-\ning. While any off-the-shelf reranker can be used,\nwe opt for instructed rerankers, as the candidate\nentity reranking task is not typically covered during\nreranker training. Rerankers based on generative\nlanguage models typically take as input a query q,\na document d, and a reranking-instruction I. They\nthen compute a score as:\nscore(q, d) =\nelogit(yes|I,q,d)\nelogit(yes|I,q,d) + elogit(no|I,q,d)\n= softmax\n�\nlogit(yes|I,q,d)\nlogit(no|I,q,d)\n�\n0\nThe score can also be expressed as a single-logit\nbinary classifier1. In our case, we want to obtain a\nscore for each candidate c ∈ Cm, and hence, fol-\nlowing the formalism of information retrieval, we\nset q = S (marking the mention m), I as an entity\ndisambiguation instruction (see Appendix A), and\nd as the name of c followed by its description. We\nthen identify the top-k scored candidates, C(k)\nm .\n1https://huggingface.co/Qwen/Qwen3-Reranker-0.\n6B/discussions/3\n...\n0.5 ✗\n0.8 ✓\nYou are an expert designed to\ndisambiguate entities in text...\n<think>...</think>\n0.9 ✓\nCandidate Generation\nCandidate Reranking\n(Sparse/Dense Retrieval, Alias table)\n(Pointwise Reranker)\nLLM-based selection\nFrance hosted the 2024 Olympics in [Paris].\nParis (Texas) - City in Lamar County, Texas\nParis Hilton - American media personality\nParis - Capital of France\n... - ...\nReranked Candidates\nKnowledge Base\nFrance Gall - French singer\nParis (Texas) - City in Lamar County, Texas\nFrance - Country in Europe\nParis Hilton - American media personality\nParis - Capital of France\n... - ...\nParis (Texas) - City in Lamar County, Texas\nParis Hilton - American media personality\nParis - Capital of France\n... - ...\nRetrieved Candidates\ntop-k cutoff\nAnswer\nParis - Capital of France\nFigure 2: Overview of the LELA approach.\nCandidate selection.\nTo select the best candi-\ndate from C(k)\nm , we construct an LLM prompt that\nconsists of (1) a system instruction that asks the\nmodel to return the index of the entity that is most\nlikely mentioned in the text (shown in Appendix A),\n(2) input context S with the mention m surrounded\nin square brackets, and (3) the candidates as a\nnewline-separated list of triples, each consisting of\na zero-based index, an entity name, and an entity de-\nscription. Optionally, the first list item is “0. None\nof the candidates”. We use self-consistency (Wang\net al., 2023), sampling different candidate selection\nreasoning paths, with majority voting to determine\nthe final candidate c∗ ∈ C(k)\nm .\nImplementation\nLELA uses vLLM2 (Kwon\net al., 2023) for LLM-inference. It is implemented\nas a Python library and can be used easily in spaCy\n2https://docs.vllm.ai/\n3\n\npipelines as an entity-linking component3.\n4\nExperiments\nWe evaluate LELA on various entity linking set-\ntings, using five different benchmark datasets. Our\nevaluation spans four complementary entity linking\nsettings: (i) domain-transfer zero-shot, (ii) domain-\nspecific with None-linking, (iii) acronym disam-\nbiguation without entity descriptions across special-\nized domains, and (iv) classical Wikipedia-based\nbenchmarks with strong supervised baselines. Each\nconsidered benchmark emphasizes a distinct chal-\nlenge, and together they demonstrate LELA’s ro-\nbustness beyond any single scenario.\nLELA\nconfiguration.\nWe\nreport\nre-\nsults\nfor\nLELA\nusing\nthe\nopen-source\nMagistral-Small-2509\n(Mistral-AI\net\nal.,\n2025) reasoning LLM. For all datasets and\nmodels, we use k = 10 and sample 10 outputs for\nself-consistency. We use the Qwen3-Reranker-4B\nmodel (Zhang et al., 2025) for pointwise reranking.\nWe run our experiments on two A100 40GB GPUs.\nAs we are interested in true zero-shot entity linking,\nwe do not perform any hyperparameter tuning and\ndiscuss other values of k, self-consistency samples\nand rerankers in Section 4.5 and 4.6. For dense\nretrieval, we use Qwen3-Embedding-4B (Zhang\net al., 2025) along with flat FAISS indexes (Douze\net al., 2025).\nBM25 retrieval is implemented\nusing the bm25s Python library4.\nIn addition,\nwe also report results for the mixture-of-experts\n(MoE) (Shazeer et al., 2017) Qwen3-30B-A3B, and\nthe small Qwen3-4B (Yang et al., 2025) reasoning\nLLMs.\nCompetitors.\nWe compare LELA to all state-of-\nthe-art entity linking methods that have been pre-\nviously evaluated on the considered benchmarks.\nThe same method can be zero-shot for one dataset\nand true zero-shot for another one.\n4.1\nZero-Shot Scenario\nBenchmark.\nZESHEL (Logeswaran et al., 2019)\nis a benchmark for evaluating entity linking meth-\nods in a zero-shot setting, built with documents\nfrom Wikia5, a Wikipedia-like Web site for spe-\ncific subjects such as fictional universes or sports.\nZESHEL consists of 8 Wikia domains for training,\n3https://spacy.io/\n4https://bm25s.github.io/\n5https://www.wikia.com/\n4 for validation, and 4 for testing. The test has a\ntotal of 10,000 mentions. For each mention, the\ndataset provides 64 candidates retrieved by BM25.\nEntities correspond to Wikia pages, where the en-\ntity name is the title of the page and the entity\ndescription is its content. The BM25-based candi-\ndate retrieval has a top-64 recall of 68% on the test\nset. Following previous work (Logeswaran et al.,\n2019; Xu et al., 2023; Li et al., 2022), we consider\nthe normalized accuracy as the evaluation metric,\nwhich is defined as the accuracy on the subset of\nmentions for which the correct entity is among the\ntop-64 candidates. For LELA, we use the provided\nBM25 candidates and use the first three sentences\nof the entity page as the entity description.\nResults.\nAs Table 1 shows, LELA ranks first on\nthe benchmark. It achieves the highest performance\nacross all test domains in the true zero-shot set-\nting, improving the previous best reported macro-\naveraged normalized accuracy by 8.84 percentage\npoints. On half of the domains, it outperforms even\nthe non-true zero shot approaches, improving the\nnormalized accuracy by 1.62 and 5.06 percentage\npoints respectively. Category-specific results are\ngiven in Appendix B.\n4.2\nDomain-specific and None Scenarios\nBenchmark.\nThe ESCO benchmark (le Vrang\net al., 2014) is designed for the task of mapping oc-\ncupational skills to the ESCO taxonomy, evaluated\non the manually annotated dataset from Decorte\net al. (2022). Some mentions have to be mapped to\nthe special entity UNK, because they have no coun-\nterpart in the taxonomy (813 out of the 1824 test\nmentions).\nLELA configuration.\nWe generate our own can-\ndidates either with BM25 (retrieving 200 candi-\ndates per mention) or with dense retrieval (retriev-\ning 64 candidates per mention). BM25 yields a\nrecall of 80.8%, and dense retrieval yields 95.6%.\nWe use the provided entity descriptions.\nResults.\nTable 2 shows that, for the accuracy on\nactual entities, LELA performs on par with the\nclassical approaches. In the overall accuracy, the\nbaseline that always predicts UNK has the highest\naccuracy, which is due to the high ratio of UNK\nmentions. LELA still outperforms the other true\nzero-shot competitors, as well as the classical ap-\nproaches. Dense retrieval of candidates allows for\nhigher non-UNK accuracy across the models, while\n4\n\nMethod\nType\nForgotten Realms\nLego\nStar Trek\nYuGiOh\nMacro Acc.\nBaseline** (Logeswaran et al., 2019)\nzero-shot\n-\n-\n-\n-\n77.05\nBLINK* (Wu et al., 2020)\nzero-shot\n86.80\n76.39\n77.95\n67.46\n77.15\nE-repeat** (Yao et al., 2020)\nzero-shot\n-\n-\n-\n-\n79.64\nUni-MPR** (Tang et al., 2021)\nzero-shot\n87.25\n78.57\n80.56\n67.31\n78.42\nBi-MPR* (Tang et al., 2021)\nzero-shot\n89.60\n80.50\n81.04\n68.74\n79.97\nReS* (Xu et al., 2023)\nzero-shot\n88.10\n78.44\n81.69\n75.84\n81.02\nGenDecider* (Zhou et al., 2024)\nzero-shot\n91.10\n81.42\n82.48\n75.99\n82.75\nBLINK*** (Wu et al., 2020)\ntrue zero-shot\n84.11\n74.45\n72.22\n66.30\n74.27\nExtEnD* (Barba et al., 2022)\ntrue zero-shot\n79.62\n65.20\n73.21\n60.01\n69.51\nGENRE** (De Cao et al., 2021b)\ntrue zero-shot\n55.20\n42.71\n55.76\n34.68\n47.09\nOneNet (Liu et al., 2024c)\ntrue zero-shot\n55.80\n53.80\n53.90\n40.80\n51.08\nLELA (ours)\ntrue zero-shot\nw/ Magistral-Small-2509\n88.69\n83.04\n79.64\n81.05\n83.11\nw/ Qwen3-30B-A3B\n85.69\n79.14\n72.68\n69.65\n76.79\nw/ Qwen3-4B\n80.08\n67.93\n65.28\n58.69\n68.00\nw/o LLM selection\n31.23\n22.10\n30.56\n24.70\n27.15\nTable 1: Normalized accuracy on the test set of ZESHEL. * means reported in (Zhou et al., 2024), ** means reported\nin Xu et al. (2023), *** means reported in Li et al. (2022), - means not reported. Best results within each group are\nbold, best overall results underlined. Ablations of LELA in gray (see Section 4.5).\nBM25 results in higher UNK accuracy, and thus in\nhigher general accuracy given the high UNK propor-\ntion, but overall, the choice of the retrieval method\ndoes not lead to statistically significant differences\nin accuracy.\n4.3\nScenario of Acronym Disambiguation\nBenchmark.\nGLADIS (Chen et al., 2023) is\na benchmark for the task of mapping acronyms\nto their inferred long form, spanning three do-\nmains: the general, scientific and the biomedical\ndomain, with, respectively, 7125, 14066 and 2909\ntest mentions. The average number of candidates\nper acronym is 200.\nLELA configuration.\nWe use the short-form to\nlong-form mapping for candidate generation, and\nuse the long-forms as entity descriptions.\nCompetitors.\nClassical entity linking approaches\nhave not been applied to the acronym disambigua-\ntion task. Rather, the original paper proposed Ac-\nroBERT, a method that can be run both in a clas-\nsical and in a zero-shot fashion. Since it outper-\nformed alternative approaches by a wide margin\n(usually by a factor of two), we compare only to Ac-\nroBERT and the baselines in these categories. For\nthe true zero shot group, we use GENRE (De Cao\net al., 2021b) and the BLINK cross-encoder (Wu\net al., 2020) (not trained on the dataset).\nResults.\nTable 3 shows that LELA outperforms\nall true zero-shot competitors by a huge margin\nof up to 54 percentage points. It also outperforms\nall other competitors, whether or not they were\nfine-tuned on the training data, on all domains ex-\ncept the biomedical one. LELA improves the best-\nreported macro-accuracy by 4.6 percentage points,\nand the scientific domain accuracy by 13.9 percent-\nage points.\n4.4\nWikipedia-based Scenario\nBenchmark.\nZELDA (Milich and Akbik, 2023)\ngroups several established Wikipedia-based classi-\ncal entity linking benchmarks. It aims at allowing\nfor a fair comparison of supervised methods by\nmapping all mentions to the same Wikidata knowl-\nedge base and by providing a common Wikipedia-\nbased train set and candidate dictionary to be used\nacross all benchmarks and methods. We also eval-\nuate LELA on the WikilinksNED Unseen Men-\ntions (Onoe and Durrett, 2020) benchmark and re-\nport the results in Appendix C.\nLELA configuration.\nWe use the provided can-\ndidates and entity descriptions.\nCompetitors.\nWe compare LELA to all en-\ntity linking methods that have been trained on\nZELDA (Rücker and Akbik, 2025; Milich and Ak-\nbik, 2023). These supervised approaches bene-\nfit from strong alignment with Wikipedia-based\ntraining, which favors memorization-based linking.\nThis makes ZELDA a stress-test for LELA.\nResults.\nTable 4 shows that the best-performing\nmethods are FusionED and VERBALIZEDit6. In\nthis setting, LELA cannot achieve the performance\n6VERBALIZED does not use the provided candidate dic-\ntionary.\n5\n\nMethod\nType\nAccuracy\nUNK Accuracy\nNon-UNK Accuracy\nPredict-UNK baseline\nbaseline\n44.57\n100.0\n0.0\nRandom baseline\nbaseline\n0.22\n-\n-\nTF-IDF baseline\nbaseline\n2.25\n-\n-\nBLINK Wiki+ESCO (Wu et al., 2020)\nclassical\n23.55±1.95\n1.38±0.81\n41.38±3.03\nGENRE Wiki+ESCO (De Cao et al., 2021b)\nclassical\n11.48±1.46\n1.65±0.88\n19.39±2.43\nBLINK Wiki (Wu et al., 2020)\ntrue zero-shot\n23.30±1.94\n-\n-\nGENRE Wiki (De Cao et al., 2021b)\ntrue zero-shot\n6.91±1.16\n-\n-\nLELA (ours)\ntrue zero-shot\nw/ Magistral-Small-2509 & Dense Retrieval\n26.75±2.03\n5.41±1.56\n43.92±3.06\nw/ Magistral-Small-2509 & BM25\n26.21±2.02\n7.26±1.79\n41.44±3.04\nw/ Qwen3-30B-A3B & Dense Retrieval\n36.62±2.21\n28.04±3.09\n43.52±3.06\nw/ Qwen3-30B-A3B & BM25\n39.36±2.24\n39.11±3.36\n39.56±3.02\nw/ Qwen3-4B & Dense Retrieval\n26.59±2.03\n8.36±1.91\n41.25±3.04\nw/ Qwen3-4B & BM25\n29.44±2.09\n14.88±2.45\n41.15±3.04\nw/o LLM selection & Dense Retrieval\n-\n-\n35.71±2.96\nw/o LLM selection & BM25\n-\n-\n34.92±2.94\nTable 2: Accuracy on the ESCO entity linking benchmark. Accuracy and UNK accuracy from Zhang et al. (2024)\n(Non-UNK accuracy is derived), best per category in bold, globally best underlined. We report 95% confidence\nintervals to account for the smaller test set size. Ablations of LELA in gray (see Section 4.5).\nMethod\nType\nGeneral\nScientific\nBiomedical\nMacro Acc.\nBM25 baseline (Chen et al., 2023)\nbaseline\n25.8\n10.7\n14.3\n16.2\nPopularity baseline (Chen et al., 2023)\nbaseline\n53.2\n12.3\n54.0\n38.8\nAcroBERT (fine-tuned) (Chen et al., 2023)\nclassical\n76.1\n27.6\n69.2\n57.7\nAcroBERT (Chen et al., 2023)\nzero-shot\n72.0\n27.4\n61.4\n53.6\nGENRE (De Cao et al., 2021b) (our implementation)\ntrue zero-shot\n63.9\n3.9\n7.1\n25.0\nBLINK (Wu et al., 2020) (our implementation)\ntrue zero-shot\n15.8\n1.5\n1.8\n6.4\nLELA (ours)\ntrue zero-shot\nw/ Magistral-Small-2509\n78.9\n46.9\n61.0\n62.3\nw/ Qwen3-30B-A3B\n80.1\n45.5\n61.5\n62.4\nw/ Qwen3-4B\n73.8\n37.7\n54.3\n55.3\nw/o LLM selection\n44.2\n22.7\n22.6\n29.8\nTable 3: Accuracy on the GLADIS benchmark. Best in each group in bold (AcroBERT was previously shown to\noutperform classical and fine-tuned methods), globally best underlined. Ablations of LELA in gray (see Section 4.5).\nof the classical methods. However, it remains com-\npetitive, performing on par with GENRE, despite\nbeing the only method that operates without super-\nvised training signals.\n4.5\nAblation studies\nTables 1, 2, 3, and 4 show that our default model\nMagistral-Small-2509 stands out on ZESHEL,\nwhere differences in performance between the\nmodels are the largest.\nOn the other datasets,\nQwen3-30B-A3B catches up. It also handles None\nprediction better. On the more general domains,\nthe smaller Qwen3-4B also performs well. We dis-\ncuss these differences in more detail in Section 4.6.\nOverall, our analysis shows that the performance\nof LELA is robust to the choice of the underlying\nLLM.\nThe tables show also that omitting the selection\nLLM (“w/o LLM selection”) lowers the results\ndrastically on domain-specific datasets, indicat-\ning that LELA’s joint consideration of candidates\nis vital. We study several other ablations on the\nZESHEL benchmark:\nNo Reranker: We pass all retrieved candidates to\nthe LLM without reranking nor cutoff.\nNo Self-Consistency: We sample only one LLM\nanswer.\nNo Reasoning: We skip thinking token generation\nand prompt the selection LLM to directly out-\nput an answer.\nNo Descriptions: We pass empty entity descrip-\ntions to the selection LLM.\nTable 5 shows that removing the reranker results\nin a drop in performance (in addition to a longer\nruntime and increased cost given the longer con-\ntexts). While the drop in performance is significant\nfor Qwen3-30B-A3B and Qwen3-4B, it is however\nnegligible for Magistral-Small-2509, which is\n6\n\nAIDA\nTweeki\nReddit\nWNED\nSlinks\nAvg\nposts\ncomments\nCWEB\nWiki\nTail\nShadow\nTop\nUpper bound*\n(Milich and Akbik, 2023)\n91.1\n94.0\n98.4\n98.3\n92.4\n98.8\n98.8\n56.7\n73.1\n89.1\nMFS baseline*\n(Milich and Akbik, 2023)\n63.5\n72.3\n83.4\n81.0\n61.2\n65.1\n99.4\n14.9\n41.3\n64.7\nClassical (with training data)\nFEVRYCL* (Févry et al., 2020)\n79.5\n76.9\n89.0\n86.5\n70.3\n84.5\n87.6\n31.9\n47.7\n72.7\nLUKEFT* (Yamada et al., 2022)\n81.2\n77.9\n81.5\n78.5\n70.3\n76.5\n98.0\n22.5\n51.8\n71.0\nGENRECL*\n(De Cao et al., 2021b)\n78.6\n80.1\n92.8\n91.5\n73.6\n88.4\n99.6\n37.3\n52.8\n77.2\nFusionED (Wang et al., 2024)\n80.1\n81.4\n93.9\n92.3\n73.6\n89.0\n98.3\n41.5\n57.9\n78.7\nBLINKbi\n(Rücker and Akbik, 2025)\n65.5\n72.1\n83.1\n79.1\n58.1\n73.1\n96.3\n41.8\n42.6\n68.0\nVERBALIZEDIT\n(Rücker and Akbik, 2025)\n88.2\n78.9\n92.2\n88.4\n71.5\n90.8\n98.2\n66.3\n65.9\n82.3\nTrue Zero-Shot (without training data)\nLELA (ours)\nw/ Magistral-Small-2509\n83.3\n79.1\n80.7\n79.5\n66.7\n87.5\n97.2\n48.9\n66.7\n76.6\nw/ Qwen3-30B-A3B\n84.2\n79.1\n83.9\n84.6\n64.8\n87.5\n97.3\n48.8\n67.0\n77.5\nw/ Qwen3-4B\n84.0\n78.0\n84.1\n82.6\n66.3\n85.1\n97.1\n48.2\n65.0\n76.7\nw/o LLM selection\n76.4\n75.1\n79.0\n79.9\n65.0\n76.2\n97.0\n43.6\n60.6\n72.5\nTable 4: Accuracy on the ZELDA benchmark. * means reported in (Milich and Akbik, 2023). Best in each group in\nbold, globally best underlined. Ablations of LELA in gray (see Section 4.5).\ndue to the fact that this larger model is better at\nhandling a large number of candidates, resulting in\na slower drop in selection accuracy as k increases,\nwhich can thus be offset by the increased candi-\ndate retrieval recall (see Appendix D). The No\nSelf-Consistency variant, too, systematically re-\nsults in a drop in performance (more details in Ap-\npendix E). Disabling reasoning results in the next\nbiggest drop in performance across the three mod-\nels. This shows that simply considering the can-\ndidates together is not sufficient and that context-\nbased reasoning is needed. Finally, the No descrip-\ntion variant provides the selection LLM with less\nprecise entities, and it also performs worse in all\nconfigurations. Together these results show that\nall components of LELA contribute to its perfor-\nmance. Examples of candidate selection outputs\nfor different variants are given in Appendix J.\n4.6\nFurther Analyses\nThe Recall vs Selection Accuracy Trade-off.\nThe value of k results in a trade-off between recall\nand selection accuracy (Appendix D): lower values\nallow for a high selection accuracy, but may result\nin the correct entity not being among the candidates,\nwhile higher values reduce the selection accuracy\nbut improve the candidate recall. The impact of k\non these two components depends on the reranker\nand selection LLM in use. Overall, k = 10 is a safe\nchoice, with lower values leading to a substantial\ndegradation in recall (see Appendix I for a detailed\nvisualization of recall retention across stages).\nSelf-consistency.\nGiven the impact of self-\nconsistency highlighted in Section 4.5, we further\nstudy its impact over different numbers of sampled\nanswers (see Appendix E). Notably, we observe\nthat while performance increases with the number\nof samples, an accuracy plateau is reached quickly\nacross models, suggesting that as little as three\nsamples are optimal.\nPositional Bias.\nLLMs have been shown to be\nsubject to positional bias (Liu et al., 2024a). In\nAppendix F, we shuffle the list of candidates given\nto the LLM. Larger models appear robust to this\nbias, while smaller models are more sensitive to it.\nOverall, the results justify our approach of sorting\nthe candidates according to their reranker-assigned\nscore in LELA.\nRuntime and Scalability.\nIn Appendix G, we ob-\nserve that selection accuracy decreases as the input\ncontext length increases, with a model-dependent\nrate of decline. We also notice traces of underthink-\ning and overthinking behaviors in reasoning LLMs\n(Su et al., 2025).\nIn addition, we report the average mention pro-\ncessing time across benchmarks. Our measure-\nments indicate that while LELA is naturally more\ncomputationally intensive than shallow retrieval\nmethods, it remains within a respectable runtime\n7\n\nMethod\nForgotten Realms\nLego\nStar Trek\nYuGiOh\nMacro Acc.\nLELA\nw/ Magistral-Small-2509\n88.69\n83.04\n79.64\n81.05\n83.11\nw/o reranker\n87.79\n83.45\n80.00\n81.15\n83.10\nw/o self-consistency\n81.38\n76.98\n73.29\n74.28\n76.48\nw/o descriptions\n76.58\n71.22\n72.53\n71.24\n72.89\nw/ Qwen3-30B-A3B\n85.69\n79.14\n72.68\n69.65\n76.79\nw/o reranker\n83.58\n74.31\n67.07\n64.44\n72.35\nw/o self-consistency\n82.08\n73.48\n65.39\n63.18\n71.03\nw/o descriptions\n71.67\n62.28\n65.92\n60.64\n65.13\nw/o reasoning\n65.97\n53.24\n52.06\n45.64\n54.23\nw/ Qwen3-4B\n80.08\n67.93\n65.28\n58.69\n68.00\nw/o reranker\n75.78\n60.43\n61.01\n51.14\n62.09\nw/o self-consistency\n77.48\n62.49\n60.25\n53.82\n63.51\nw/o descriptions\n66.87\n54.16\n61.08\n51.57\n58.42\nw/o reasoning\n51.55\n36.79\n36.45\n35.75\n40.14\nTable 5: Normalized accuracy on the test set of ZESHEL across LELA variants. The results for the w/o reasoning\nvariant are reported for the two Qwen models which explicitly support switching between reasoning and non-\nreasoning mode.\nenvelope for offline and batch processing. For ex-\nample, on ZESHEL, we observe a processing time\nof 44 seconds per mention on average7. The MoE\nQwen3-30B-A3B offers the most compelling trade-\noff of speed vs. performance, achieving inference\nspeeds faster than the smaller Qwen3-4B model (26\nvs 31 seconds on ZESHEL) while significantly out-\nperforming it and often matching the performance\nof Magistral-Small-2509.\nReranker performance.\nWe compare the perfor-\nmance of different rerankers in Appendix H. Other\nmodels can achieve a recall@k similar to the one of\nour default model Qwen3-Reranker-4B, showing\nthat LELA is not tied to a specific reranker.\n4.7\nDiscussion\nOur results suggest that LLM-based candidate\nreranking and selection can effectively compen-\nsate for the absence of labeled training data, and\neven outperform the fine-tuned approaches. Our\nablation studies suggest that these gains stem pri-\nmarily from the ability of LLMs to jointly reason\nover multiple candidates rather than from candi-\ndate ranking alone. The fact that the candidate\nranking alone does not perform well also mitigates\nconcerns regarding data contamination: If the mod-\nels were simply recalling memorized gold labels\nfrom pre-training, the reranking alone would nearly\nalways find the best candidate – which it does not.\nOur work points to a fundamental shift in how\nentity linking can be approached in resource-\nconstrained environments. While traditional meth-\nods optimize for inference efficiency, front-loading\n7With 10 samples for self-consistency.\ncosts onto human annotation, LELA demonstrates\nthat allocating computational budget to test-time\nreasoning can effectively substitute for domain-\nspecific supervision. This suggests that the ability\nfor disambiguation is a general capability of reason-\ning models that transfers across settings, unlike the\ndistributional patterns learned during fine-tuning.\nConsequently, the usual barrier to entry for entity\nlinking in specialized domains (viz. the availability\nof labeled data) could be mitigated with additional\ninference compute.\n5\nConclusion\nWe have proposed LELA, a simple and modular\ncoarse-to-fine approach to entity linking that lever-\nages the capabilities of LLMs. LELA is true zero-\nshot, in the sense that it can be applied to a given\nknowledge base and domain without requiring an-\nnotated data or fine-tuning. Our experiments show\nthat LELA outperforms other true zero-shot ap-\nproaches by a large margin. It performs on par, and\noften beats, fine-tuned and classical approaches,\neven though these require training data.\nOur results suggest a blind spot in entity linking,\nwhere inference compute can make up for training\ndata. We thus hope that LELA opens the door\nfor a wider use of LLMs in entity linking, and for\nnew approaches that perform the task without any\nneed for fine-tuning. In particular, we encourage\nLLMs to be included as baselines in entity linking\nstudies. Future work could extend LELA to end-\nto-end entity linking, where mentions have to be\ndetected as well.\n8\n\n6\nLimitations\nBeing LLM-based and model-agnostic, LELA di-\nrectly benefits from evolving model capabilities,\nbut also inherits their limitations. In particular,\nthe generation of reasoning tokens makes LELA\nslower than specialized entity linking models and\nthus more suitable for offline processing. Addition-\nally, LELA inherits LLMs’ limited interpretability.\nWhile our evaluation spans a broad range of\nbenchmarks, it cannot cover every possible knowl-\nedge base or domain. Although context-based rea-\nsoning improves generalization, further research is\nrequired to assess LELA’s performance across addi-\ntional settings. In particular, our evaluation focuses\non English benchmarks, and evaluating LELA in\nmultilingual or low-resource settings remains an\nimportant direction for future work.\nReferences\nAbdelrahman Abdallah, Jamshid Mozafari, Bhawna\nPiryani, and Adam Jatowt. 2025. DeAR: Dual-Stage\nDocument Reranking with Reasoning Agents via\nLLM Distillation. In Findings of the Association\nfor Computational Linguistics: EMNLP 2025, pages\n5710–5723, Suzhou, China. Association for Compu-\ntational Linguistics.\nSören Auer, Christian Bizer, Georgi Kobilarov, Jens\nLehmann, Richard Cyganiak, and Zachary Ives. 2007.\nDBpedia: A nucleus for a web of open data. In\nProceedings of the 6th International The Semantic\nWeb and 2nd Asian Conference on Asian Semantic\nWeb Conference, ISWC’07/ASWC’07, pages 722–\n735, Berlin, Heidelberg. Springer-Verlag.\nTom Ayoola, Shubhi Tyagi, Joseph Fisher, Christos\nChristodoulopoulos, and Andrea Pierleoni. 2022. Re-\nFinED: An Efficient Zero-shot-capable Approach to\nEnd-to-End Entity Linking. In Proceedings of the\n2022 Conference of the North American Chapter of\nthe Association for Computational Linguistics: Hu-\nman Language Technologies: Industry Track, pages\n209–220, Hybrid: Seattle, Washington + Online. As-\nsociation for Computational Linguistics.\nEdoardo Barba, Luigi Procopio, and Roberto Navigli.\n2022. ExtEnD: Extractive Entity Disambiguation.\nIn Proceedings of the 60th Annual Meeting of the\nAssociation for Computational Linguistics (Volume\n1: Long Papers), pages 2478–2488, Dublin, Ireland.\nAssociation for Computational Linguistics.\nHannah Bast, Matthias Hertel, and Natalie Prange. 2022.\nELEVANT: A Fully Automatic Fine-Grained Entity\nLinking Evaluation and Analysis Tool. In Proceed-\nings of the 2022 Conference on Empirical Methods\nin Natural Language Processing: System Demonstra-\ntions, pages 72–79, Abu Dhabi, UAE. Association\nfor Computational Linguistics.\nTom B. Brown, Benjamin Mann, Nick Ryder, Melanie\nSubbiah, Jared Kaplan, Prafulla Dhariwal, Arvind\nNeelakantan, Pranav Shyam, Girish Sastry, Amanda\nAskell, Sandhini Agarwal, Ariel Herbert-Voss,\nGretchen Krueger, Tom Henighan, Rewon Child,\nAditya Ramesh, Daniel M. Ziegler, Jeffrey Wu,\nClemens Winter, and 12 others. 2020. Language\nmodels are few-shot learners. In Proceedings of the\n34th International Conference on Neural Information\nProcessing Systems, NIPS ’20, pages 1877–1901,\nRed Hook, NY, USA. Curran Associates Inc.\nLihu Chen, Gael Varoquaux, and Fabian M. Suchanek.\n2023. GLADIS: A General and Large Acronym Dis-\nambiguation Benchmark. In Proceedings of the 17th\nConference of the European Chapter of the Asso-\nciation for Computational Linguistics, pages 2073–\n2088, Dubrovnik, Croatia. Association for Computa-\ntional Linguistics.\nShijie Chen, Bernal Jiménez Gutiérrez, and Yu Su.\n2025a.\nAttention in Large Language Models\nYields Efficient Zero-Shot Re-Rankers.\nPreprint,\narXiv:2410.02642.\nYiqun Chen, Qi Liu, Yi Zhang, Weiwei Sun, Xinyu\nMa, Wei Yang, Daiting Shi, Jiaxin Mao, and Dawei\nYin. 2025b. TourRank: Utilizing Large Language\nModels for Documents Ranking with a Tournament-\nInspired Strategy. In Proceedings of the ACM on\nWeb Conference 2025, WWW ’25, pages 1638–1652,\nNew York, NY, USA. Association for Computing\nMachinery.\nNicola De Cao, Wilker Aziz, and Ivan Titov. 2021a.\nHighly Parallel Autoregressive Entity Linking with\nDiscriminative Correction. In Proceedings of the\n2021 Conference on Empirical Methods in Natural\nLanguage Processing, pages 7662–7669, Online and\nPunta Cana, Dominican Republic. Association for\nComputational Linguistics.\nNicola De Cao, Gautier Izacard, Sebastian Riedel, and\nFabio Petroni. 2021b.\nAutoregressive Entity Re-\ntrieval. Preprint, arXiv:2010.00904.\nNicola De Cao, Ledell Wu, Kashyap Popat, Mikel\nArtetxe, Naman Goyal, Mikhail Plekhanov, Luke\nZettlemoyer, Nicola Cancedda, Sebastian Riedel, and\nFabio Petroni. 2022. Multilingual Autoregressive\nEntity Linking. Transactions of the Association for\nComputational Linguistics, 10:274–290.\nJens-Joris Decorte, Jeroen Van Hautte, Johannes\nDeleu, Chris Develder, and Thomas Demeester.\n2022. Design of Negative Sampling Strategies for\nDistantly Supervised Skill Extraction.\nPreprint,\narXiv:2209.05987.\nEvgeny Dedov. 2025. JointRank: Rank Large Set with\nSingle Pass. In Proceedings of the 2025 International\nACM SIGIR Conference on Innovative Concepts and\nTheories in Information Retrieval (ICTIR), pages 208–\n217.\n9\n\nYifan Ding,\nAmrit Poudel,\nQingkai Zeng,\nTim\nWeninger, Balaji Veeramani, and Sanmitra Bhat-\ntacharya. 2024. EntGPT: Linking Generative Large\nLanguage Models with Knowledge Bases. Preprint,\narXiv:2402.06738.\nMatthijs Douze, Alexandr Guzhva, Chengqi Deng, Jeff\nJohnson, Gergely Szilvasy, Pierre-Emmanuel Mazaré,\nMaria Lomeli, Lucas Hosseini, and Hervé Jégou.\n2025. THE FAISS LIBRARY. IEEE Transactions\non Big Data, pages 1–17.\nYotam Eshel, Noam Cohen, Kira Radinsky, Shaul\nMarkovitch, Ikuya Yamada, and Omer Levy. 2017.\nNamed Entity Disambiguation for Noisy Text. In\nProceedings of the 21st Conference on Computa-\ntional Natural Language Learning (CoNLL 2017),\npages 58–68, Vancouver, Canada. Association for\nComputational Linguistics.\nThibault Févry, Nicholas FitzGerald, Livio Baldini\nSoares, and Tom Kwiatkowski. 2020.\nEmpirical\nEvaluation of Pretraining Strategies for Supervised\nEntity Linking. Preprint, arXiv:2005.14253.\nJohannes Hoffart, Mohamed Amir Yosef, Ilaria Bordino,\nHagen Fürstenau, Manfred Pinkal, Marc Spaniol,\nBilyana Taneva, Stefan Thater, and Gerhard Weikum.\n2011. Robust Disambiguation of Named Entities\nin Text. In Proceedings of the 2011 Conference on\nEmpirical Methods in Natural Language Processing,\npages 782–792, Edinburgh, Scotland, UK. Associa-\ntion for Computational Linguistics.\nEdward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan\nAllen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang,\nand Weizhu Chen. 2021. LoRA: Low-Rank Adap-\ntation of Large Language Models.\nPreprint,\narXiv:2106.09685.\nNour Jedidi, Yung-Sung Chuang, James Glass, and\nJimmy Lin. 2025.\nDon’t \"Overthink\" Passage\nReranking: Is Reasoning Truly Necessary? Preprint,\narXiv:2505.16886.\nHeng Ji and Ralph Grishman. 2011. Knowledge base\npopulation: Successful approaches and challenges.\nIn Proceedings of the 49th annual meeting of the\nassociation for computational linguistics: Human\nlanguage technologies, pages 1148–1158.\nWoosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying\nSheng, Lianmin Zheng, Cody Hao Yu, Joseph Gon-\nzalez, Hao Zhang, and Ion Stoica. 2023. Efficient\nMemory Management for Large Language Model\nServing with PagedAttention. In Proceedings of the\n29th Symposium on Operating Systems Principles,\nSOSP ’23, pages 611–626, New York, NY, USA.\nAssociation for Computing Machinery.\nMartin le Vrang, Agis Papantoniou, Erika Pauwels,\nPieter Fannes, Dominique Vandensteen, and Johan\nDe Smedt. 2014. ESCO: Boosting Job Matching in\nEurope with Semantic Interoperability. Computer,\n47(10):57–64.\nXiuxing Li, Zhenyu Li, Zhengyan Zhang, Ning Liu,\nHaitao Yuan, Wei Zhang, Zhiyuan Liu, and Jiany-\nong Wang. 2022. Effective Few-Shot Named Entity\nLinking by Meta-Learning. In 2022 IEEE 38th Inter-\nnational Conference on Data Engineering (ICDE),\npages 178–191. IEEE Computer Society.\nYajie Li, Albert Galimov, Mitra Datta Ganapaneni, Pu-\njitha Thejaswi, De Meng, Priyanshu Kumar, and Sa-\nloni Potdar. 2025. Leveraging the Power of Large\nLanguage Models in Entity Linking via Adaptive\nRouting and Targeted Reasoning. In Proceedings of\nthe 2025 Conference on Empirical Methods in Nat-\nural Language Processing: Industry Track, pages\n871–882, Suzhou (China). Association for Computa-\ntional Linguistics.\nNelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paran-\njape, Michele Bevilacqua, Fabio Petroni, and Percy\nLiang. 2024a. Lost in the Middle: How Language\nModels Use Long Contexts. Transactions of the Asso-\nciation for Computational Linguistics, 12:157–173.\nQi Liu, Yongyi He, Tong Xu, Defu Lian, Che Liu, Zhi\nZheng, and Enhong Chen. 2024b. UniMEL: A Uni-\nfied Framework for Multimodal Entity Linking with\nLarge Language Models. In Proceedings of the 33rd\nACM International Conference on Information and\nKnowledge Management, CIKM ’24, pages 1909–\n1919, New York, NY, USA. Association for Comput-\ning Machinery.\nWenhan Liu, Xinyu Ma, Weiwei Sun, Yutao Zhu,\nYuchen Li,\nDawei Yin,\nand Zhicheng Dou.\n2025a. ReasonRank: Empowering Passage Rank-\ning with Strong Reasoning Ability.\nPreprint,\narXiv:2508.07050.\nWenhan Liu, Xinyu Ma, Yutao Zhu, Lixin Su,\nShuaiqiang Wang, Dawei Yin, and Zhicheng Dou.\n2025b.\nCoRanking: Collaborative Ranking with\nSmall and Large Ranking Agents. In Findings of the\nAssociation for Computational Linguistics: EMNLP\n2025, pages 5098–5110, Suzhou, China. Association\nfor Computational Linguistics.\nXukai Liu, Ye Liu, Kai Zhang, Kehang Wang, Qi Liu,\nand Enhong Chen. 2024c.\nOneNet:\nA Fine-\nTuning Free Framework for Few-Shot Entity Linking\nvia Large Language Model Prompting.\nPreprint,\narXiv:2410.07549.\nLajanugen Logeswaran, Ming-Wei Chang, Kenton Lee,\nKristina Toutanova, Jacob Devlin, and Honglak Lee.\n2019. Zero-Shot Entity Linking by Reading Entity\nDescriptions. In Proceedings of the 57th Annual\nMeeting of the Association for Computational Lin-\nguistics, pages 3449–3460, Florence, Italy. Associa-\ntion for Computational Linguistics.\nJian Luo, Xuanang Chen, Ben He, and Le Sun. 2024.\nPRP-Graph: Pairwise Ranking Prompting to LLMs\nwith Graph Aggregation for Effective Text Re-\nranking. In Proceedings of the 62nd Annual Meeting\n10\n\nof the Association for Computational Linguistics (Vol-\nume 1: Long Papers), pages 5766–5776, Bangkok,\nThailand. Association for Computational Linguistics.\nJose L. Martinez-Rodriguez, Aidan Hogan, and Ivan\nLopez-Arevalo. 2020. Information extraction meets\nthe Semantic Web:\nA survey.\nSemantic Web,\n11(2):255–335.\nMarcel Milich and Alan Akbik. 2023. ZELDA: A Com-\nprehensive Benchmark for Supervised Entity Disam-\nbiguation. In Proceedings of the 17th Conference of\nthe European Chapter of the Association for Compu-\ntational Linguistics, pages 2061–2072, Dubrovnik,\nCroatia. Association for Computational Linguistics.\nMistral-AI, Abhinav Rastogi, Albert Q. Jiang, Andy\nLo, Gabrielle Berrada, Guillaume Lample, Jason\nRute, Joep Barmentlo, Karmesh Yadav, Kartik Khan-\ndelwal, Khyathi Raghavi Chandu, Léonard Blier,\nLucile Saulnier, Matthieu Dinot, Maxime Darrin,\nNeha Gupta, Roman Soletskyi, Sagar Vaze, Teven Le\nScao, and 81 others. 2025.\nMagistral.\nPreprint,\narXiv:2506.10910.\nYasumasa Onoe and Greg Durrett. 2020. Fine-Grained\nEntity Typing for Domain Independent Entity Link-\ning. Preprint, arXiv:1909.05780.\nNicholas Pipitone, Ghita Houir Alami, Advaith Avad-\nhanam,\nAnton Kaminskyi,\nand Ashley Khoo.\n2025.\nzELO: ELO-inspired Training Method\nfor Rerankers and Embedding Models.\nPreprint,\narXiv:2509.12541.\nZhen Qin, Rolf Jagerman, Kai Hui, Honglei Zhuang,\nJunru Wu, Le Yan, Jiaming Shen, Tianqi Liu, Jialu\nLiu, Donald Metzler, Xuanhui Wang, and Michael\nBendersky. 2024. Large Language Models are Effec-\ntive Text Rankers with Pairwise Ranking Prompting.\nIn Findings of the Association for Computational Lin-\nguistics: NAACL 2024, pages 1504–1518, Mexico\nCity, Mexico. Association for Computational Lin-\nguistics.\nStephen E. Robertson and Karen Sparck Jones. 1988.\nRelevance weighting of search terms. In Document\nRetrieval Systems, pages 143–160. Taylor Graham\nPublishing, GBR.\nSusanna Rücker and Alan Akbik. 2025. Evaluating\ndesign decisions for dual encoder-based entity disam-\nbiguation. In Proceedings of the 63rd Annual Meet-\ning of the Association for Computational Linguistics\n(Volume 1: Long Papers), ACL 2025, Vienna, Aus-\ntria, July 27 - August 1, 2025, pages 15685–15701.\nAssociation for Computational Linguistics.\nÖzge Sevgili, Artem Shelmanov, Mikhail Y. Arkhipov,\nAlexander Panchenko, and Chris Biemann. 2022.\nNeural entity linking: A survey of models based on\ndeep learning. Semantic Web, 13(3):527–570.\nNoam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz,\nAndy Davis, Quoc Le, Geoffrey Hinton, and Jeff\nDean. 2017.\nOutrageously Large Neural Net-\nworks:\nThe Sparsely-Gated Mixture-of-Experts\nLayer. Preprint, arXiv:1701.06538.\nSenbao Shi, Zhenran Xu, Baotian Hu, and Min\nZhang. 2024. Generative Multimodal Entity Linking.\nPreprint, arXiv:2306.12725.\nYuling Shi, Yichun Qian, Hongyu Zhang, Beijun Shen,\nand Xiaodong Gu. 2025. LongCodeZip: Compress\nLong Context for Code Language Models. Preprint,\narXiv:2510.00446.\nJinyan Su, Jennifer Healey, Preslav Nakov, and\nClaire Cardie. 2025. Between Underthinking and\nOverthinking:\nAn Empirical Study of Reason-\ning Length and correctness in LLMs.\nPreprint,\narXiv:2505.00127.\nFabian M. Suchanek, Gjergji Kasneci, and Gerhard\nWeikum. 2007. Yago: A core of semantic knowledge.\nIn Proceedings of the 16th International Conference\non World Wide Web, WWW ’07, pages 697–706,\nNew York, NY, USA. Association for Computing\nMachinery.\nFabian M. Suchanek, Mauro Sozio, and Gerhard\nWeikum. 2009. SOFIE: a self-organizing framework\nfor information extraction. In Proceedings of the\n18th International Conference on World Wide Web,\nWWW 2009, Madrid, Spain, April 20-24, 2009, pages\n631–640. ACM.\nHongyin Tang, Xingwu Sun, Beihong Jin, and Fuzheng\nZhang. 2021. A Bidirectional Multi-paragraph Read-\ning Model for Zero-shot Entity Linking. Proceed-\nings of the AAAI Conference on Artificial Intelligence,\n35(15):13889–13897.\nDaniel Vollmers, Hamada Zahera, Diego Moussallem,\nand Axel-Cyrille Ngonga Ngomo. Contextual Aug-\nmentation for Entity Linking using Large Language\nModels.\nDenny Vrandeˇci´c. 2012. Wikidata: A new platform for\ncollaborative data collection. In Proceedings of the\n21st International Conference on World Wide Web,\nWWW ’12 Companion, pages 1063–1064, New York,\nNY, USA. Association for Computing Machinery.\nFang Wang, Tianwei Yan, Zonghao Yang, Minghao Hu,\nJun Zhang, Zhunchen Luo, and Xiaoying Bai. 2026.\nDeepMEL: A multi-agent collaboration framework\nfor multimodal entity linking. Information Process-\ning & Management, 63(3):104507.\nFeng Wang,\nYuqing Li,\nand Han Xiao. 2025.\nJina-reranker-v3:\nLast but Not Late Interaction\nfor Listwise Document Reranking.\nPreprint,\narXiv:2509.25085.\nJunxiong Wang, Ali Mousavi, Omar Attia, Ronak\nPradeep,\nSaloni Potdar,\nAlexander M. Rush,\nUmar Farooq Minhas, and Yunyao Li. 2024. En-\ntity disambiguation via fusion entity decoding. In\nProceedings of the 2024 Conference of the North\n11\n\nAmerican Chapter of the Association for Computa-\ntional Linguistics: Human Language Technologies\n(Volume 1: Long Papers), NAACL 2024, Mexico City,\nMexico, June 16-21, 2024, pages 6524–6536. Associ-\nation for Computational Linguistics.\nXuezhi Wang, Jason Wei, Dale Schuurmans, Quoc\nLe, Ed Chi, Sharan Narang, Aakanksha Chowdhery,\nand Denny Zhou. 2023. Self-Consistency Improves\nChain of Thought Reasoning in Language Models.\nPreprint, arXiv:2203.11171.\nChris Welty, J. William Murdock, Aditya Kalyanpur,\nand James Fan. 2012. A Comparison of Hard Fil-\nters and Soft Evidence for Answer Typing in Watson,\npages 243–256. Springer Berlin Heidelberg, Berlin,\nHeidelberg.\nLedell Wu, Fabio Petroni, Martin Josifoski, Sebastian\nRiedel, and Luke Zettlemoyer. 2020. Scalable zero-\nshot entity linking with dense entity retrieval. In\nProceedings of the 2020 Conference on Empirical\nMethods in Natural Language Processing, EMNLP\n2020, Online, November 16-20, 2020, pages 6397–\n6407. Association for Computational Linguistics.\nAmy Xin, Yunjia Qi, Zijun Yao, Fangwei Zhu, Kaisheng\nZeng, Xu Bin, Lei Hou, and Juanzi Li. 2024. LL-\nMAEL: Large Language Models are Good Con-\ntext Augmenters for Entity Linking.\nPreprint,\narXiv:2407.04020.\nZhenran Xu, Yulin Chen, Baotian Hu, and Min Zhang.\n2023. A Read-and-Select Framework for Zero-shot\nEntity Linking. In Findings of the Association for\nComputational Linguistics: EMNLP 2023, pages\n13657–13666, Singapore. Association for Compu-\ntational Linguistics.\nIkuya Yamada, Koki Washio, Hiroyuki Shindo, and Yuji\nMatsumoto. 2022. Global Entity Disambiguation\nwith BERT. In Proceedings of the 2022 Conference\nof the North American Chapter of the Association for\nComputational Linguistics: Human Language Tech-\nnologies, pages 3264–3271, Seattle, United States.\nAssociation for Computational Linguistics.\nAn Yang, Anfeng Li, Baosong Yang, Beichen Zhang,\nBinyuan Hui, Bo Zheng, Bowen Yu, Chang Gao,\nChengen Huang, Chenxu Lv, Chujie Zheng, Day-\niheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao\nGe, Haoran Wei, Huan Lin, Jialong Tang, and 41\nothers. 2025. Qwen3 Technical Report. Preprint,\narXiv:2505.09388.\nZonghai Yao, Liangliang Cao, and Huapu Pan. 2020.\nZero-shot Entity Linking with Efficient Long Range\nSequence Modeling. In Findings of the Association\nfor Computational Linguistics: EMNLP 2020, pages\n2517–2522, Online. Association for Computational\nLinguistics.\nChristophe Ye and Cassie S. Mitchell. 2025. LLM as\nEntity Disambiguator for Biomedical Entity-Linking.\nIn Proceedings of the 63rd Annual Meeting of the\nAssociation for Computational Linguistics (Volume\n2: Short Papers), pages 301–312, Vienna, Austria.\nAssociation for Computational Linguistics.\nMike Zhang, Rob van der Goot, and Barbara Plank.\n2024. Entity linking in the job market domain. In\nFindings of the Association for Computational Lin-\nguistics: EACL 2024, St. Julian’s, Malta, March 17-\n22, 2024, pages 410–419. Association for Computa-\ntional Linguistics.\nYanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang,\nHuan Lin, Baosong Yang, Pengjun Xie, An Yang,\nDayiheng Liu, Junyang Lin, Fei Huang, and Jingren\nZhou. 2025. Qwen3 Embedding: Advancing Text\nEmbedding and Reranking Through Foundation Mod-\nels. Preprint, arXiv:2506.05176.\nKang Zhou, Yuepei Li, Qing Wang, Qiao Qiao, and\nQi Li. 2024. GenDecider: Integrating “None of the\nCandidates” Judgments in Zero-Shot Entity Linking\nRe-ranking. In Proceedings of the 2024 Conference\nof the North American Chapter of the Association for\nComputational Linguistics: Human Language Tech-\nnologies (Volume 2: Short Papers), pages 239–245,\nMexico City, Mexico. Association for Computational\nLinguistics.\nShengyao Zhuang, Xueguang Ma, Bevan Koopman,\nJimmy Lin, and Guido Zuccon. 2025.\nRank-R1:\nEnhancing Reasoning in LLM-based Document\nRerankers via Reinforcement Learning. Preprint,\narXiv:2503.06034.\nShengyao Zhuang, Honglei Zhuang, Bevan Koopman,\nand Guido Zuccon. 2024. A Setwise Approach for\nEffective and Highly Efficient Zero-shot Ranking\nwith Large Language Models. In Proceedings of\nthe 47th International ACM SIGIR Conference on\nResearch and Development in Information Retrieval,\npages 38–47.\nA\nPrompts\nThe reranker task, retriever task and selection LLM\nprompt are given in Figure 3.\nB\nCategory-specific results on ZESHEL\nIn ZESHEL, mentions are split into four categories,\nbased on token overlap with the entity name:\nHigh Overlap (HO): the entity name is identical\nto the mention text\nMultiple Categories (MC): the entity name is the\nmention text followed by a disambiguation\nphrase (e.g. mention string: “Batman”, title:\n“Batman (Lego)”)\nAmbiguous Substrings (AS): the mention text is\na substring of the entity name\nLow Overlap (LO): all other cases\n12\n\nGiven a text with a marked mention enclosed in\nsquare brackets, retrieve relevant entities that the\nmention refers to.\n(a) Reranker task\nGiven an ambiguous mention, retrieve relevant\nentities that the mention refers to.\n(b) Retriever task\nYou are an expert designed to disambiguate\nentities in text, taking into account the overall\ncontext and a list of entity candidates. You are\nprovided with an input text that includes a full\ncontextual narrative, a marked mention enclosed\nin square brackets, and a list of candidates, each\npreceded by an index number.\nYour task is to determine the most appropriate\nentity from the candidates based on the context\nand candidate entity descriptions. Please show\nyour choice with only the index, e.g., “answer: 3”.\n(c) Selection LLM prompt\nFigure 3:\nReranker, retriever and LLM selection\nprompts.\nThese categories respectively make up 50%, 28%,\n8% and 5% of the mentions in the dataset. The\ncategory-specific normalized accuracy in Table 6\nshows that using Magistral-Small-2509, LELA\nranks first on the four categories, and shines in the\nAS and MC cases. We conjecture that performance\non the AS case can be explained by the high ex-\nposure of LLMs to such cases during pre-training,\nwhich occur in all kinds of text corpora, while per-\nformance on the MC case can be attributed to the\nneed for understanding fine-grained distinctions be-\ntween particularly similar candidates, and which\ncan benefit from reasoning. Regarding the LO case,\nwith lower relative improvement, and where there\nis minimal lexical overlap, embedding-based ap-\nproaches might also be well-suited since they are\noptimized for semantic similarity.\nC\nUnseen mentions and short context\nscenario\nBenchmark.\nThe\nWikilinksNED\nUnseen-\nMentions dataset was proposed in Onoe and\nDurrett (2020).\nIt is derived from the Wik-\nilinksNED dataset (Eshel et al., 2017), ensuring\nthat the mention-entity pairs are unseen at test\ntime. The examples in the dataset consist of single\nsentences, resulting in highly limited context. The\nMethod\nHO\nMC\nAS\nLO\nBaseline\n87.64\n77.27\n75.89\n71.46\nBLINK\n94.30\n75.40\n79.95\n73.50\nUni-MPR\n91.43\n79.07\n75.60\n73.53\nBi-MPR\n92.84\n81.93\n77.37\n73.88\nReS\n94.42\n81.29\n77.80\n76.51\nLELA (ours)\nw/ Magistral-Small-2509\n95.63\n85.19\n84.95\n77.99\nw/ Qwen3-30B-A3B\n88.39\n79.16\n76.88\n68.31\nw/ Qwen3-4B\n79.03\n70.83\n70.68\n60.92\nTable 6: Category-specific macro-averaged normalized\naccuracy on the test set of ZESHEL, results for the other\nmethods as reported in Xu et al. (2023).\ntarget knowledge base is Wikipedia.\nLELA configuration.\nWe retrieve 64 candi-\ndates per mention through dense retrieval over\nWikipedia entities, using the ELEVANT EL evalu-\nation tool (Bast et al., 2022) Wikipedia mappings8,\nwhich achieves a recall of 83%. We also experi-\nment with prior-based candidates using the candi-\ndate list from De Cao et al. (2021b), which yields\na recall of 84%.\nResults.\nResults reported in Table 7 show that\nBLINK leads on the benchmark. Regarding LELA,\nusing the prior-based candidates leads to the best re-\nsults. In particular, these candidates allow LELA to\noutperform the true zero-shot ET4EL, as well as the\nclassical GENRE and ReFinED methods, which\nalso use prior-based candidates. In fact, for all three\nmodels considered, using dense retrieval systemati-\ncally underperforms on this benchmark. Therefore,\nwe conducted an error analysis to analyze LELA’s\nbehavior when using dense retrieval in this set-\nting. Four out of the ten analyzed errors come from\ncandidate retrieval, and another four come from\nimprecise ground truth annotations, where we be-\nlieve that our approach’s predicted Wikipedia entity\ncould also be considered as correct. This is caused\nby the large number of highly similar entities in\nWikipedia. Only the two remaining cases corre-\nspond to the reranker losing the correct candidate\nand a wrong LLM final answer selection\nD\nThe Recall vs Selection Accuracy\nTrade-off\nThe performance of LELA is a product of the can-\ndidate retrieval recall (is the correct answer within\nthe retrieved candidates?) and the candidate se-\nlection accuracy (does the LLM select the correct\n8https://github.com/ad-freiburg/elevant/wiki/\nMapping-Files#wikipedia-mappings\n13\n\nMethod\nType\nAcc.\nMost Frequent Entity Baseline\n(Onoe and Durrett, 2020)\nbaseline\n54.1\nBLINK (Wu et al., 2020)\nWikipedia\nclassical\n75.2\nTraining set\nclassical\n74.7\nTraining set + Wikipedia\nclassical\n76.8\nGENRE (De Cao et al., 2021b)\nclassical\n63.5\nReFinED (Ayoola et al., 2022)\nclassical\n66.5\nET4EL (Onoe and Durrett, 2020)\ntrue zero-shot\n62.2\nLELA (ours)\ntrue zero-shot\nw/ Magistral-Small-2509\nDictionary\n68.0\nDense Retrieval\n62.8\nw/ Qwen3-30B-A3B\nDictionary\n68.7\nDense Retrieval\n61.9\nw/ Qwen3-4B\nDictionary\n67.0\nDense Retrieval\n59.0\nw/o LLM selection\nDictionary\n61.9\nDense Retrieval\n54.5\nTable 7: Accuracy on the WikilinksNED Unseen Men-\ntions benchmark, best per category in bold, globally\nbest underlined.\nanswer?). Regarding the first retrieval stage, the\nnumber of retrieved candidates should be exclu-\nsively limited by computational cost constraints,\ngiven the finer subsequent stages and that the re-\ncall can only improve with the number of retrieved\ncandidates. However, for the next two stages, the\nvalue of k results in a trade-off between recall and\nselection accuracy, as shown in Figure 4: lower\nvalues of k allow for a high selection accuracy, but\nmay result in the correct entity not being among\nthe candidates.\nHigher values of k reduce the\nselection accuracy but improve the candidate re-\ncall. The impact of k on these two components\ndepends on the reranker and selection LLM in use.\nFor instance, the selection accuracy experiences a\nsteeper drop when k increases for Qwen3-30B-A3B\nthan Magistral-Small-2509. Reranker-specific\nresults are given in Section H. Figure 15 shows\nan example of selection output for k = 10 and\nk = 64.\nE\nSelf-consistency Study\nAs shown by our ablation studies, Section 4.5, self-\nconsistency systematically allows for performance\nimprovements over single selection LLM answer\nsampling. Figure 6a reports the selection accu-\nracy over the number of sampled answers, ksc,\nand shows that a plateau is reached across mod-\nels for ksc = 3. The best improvement comes from\nksc = 2. Furthermore, Figure 6b shows how the\ninvalid output rate, which also significantly drops\nfor ksc = 2. In order to check that the impact of\nself-consistency is not strictly limited to higher ro-\nbustness towards invalid model outputs, we show\nin Figure 5 that for our default ksc = 10 choice, the\nmodels select the same candidate across all samples\nin approximately 60% of the cases, two answers in\napproximately 25% of cases and three answers in\nless than 10% of the cases.\nF\nPositional Bias Study\nTo study the impact of the positional bias (Liu et al.,\n2024a) on the LLM selection accuracy, we report\nthe results on ZESHEL for five different top-k can-\ndidate sorting approaches:\n• Reranker: We rank the top-k candidates ac-\ncording to their reranker-assigned score (the\nstandard LELA approach)\n• Answer first: The correct answer is moved to\nthe first position.\n• Answer last: The correct answer is moved to\nthe last position.\n• Random: The top-k candidates are shuffled.\n• BM25: We rank the top-k candidates accord-\ning to their BM25 score (the standard LELA\napproach)\nThe results in Table 8a show varying ef-\nfects of candidate ordering across models. For\nMagistral-Small-2509, performance differences\nbetween the sorting approaches are minimal, in-\ndicating a low sensitivity to candidate position.\nAmong these, the oracle Answer first sorting\nachieves the highest accuracy, closely followed by\nthe Reranker sorting. For Qwen3-30B-A3B, the per-\nformance gap between sorting strategies becomes\nmore pronounced. Again, the Answer first sorting\nyields the highest accuracy, with the Reranker sort-\ning ranking second. In contrast, Random sorting\nperforms worst, trailing the oracle setting by 2.51\npercentage points. This effect further increases\nfor Qwen3-4B, which exhibits the largest perfor-\nmance spread across sorting strategies. For this\nsmaller model, the Reranker sorting achieves the\nhighest accuracy, followed by the Answer first sort-\ning, while Random sorting again results in the low-\nest performance, 4.84 percentage points behind the\nbest-performing approach.\n14\n\n1\n10\n32\n64\nk: number of ranked entities\n0.0\n0.2\n0.4\n0.6\n0.8\n1.0\nAccuracy\noverall accuracy\nselection accuracy\naccuracy@@k\n(a) Overall (micro-averaged) accuracy, selection accuracy and\naccuracy@k w.r.t. k for Qwen3-30B-A3B\n1\n10\n32\n64\nk: number of ranked entities\n0.0\n0.2\n0.4\n0.6\n0.8\n1.0\nAccuracy\noverall accuracy\nselection accuracy\naccuracy@@k\n(b) Overall (micro-averaged) accuracy, selection accuracy and\naccuracy@k w.r.t. k for Magistral-Small-2509\nFigure 4: Comparison of (micro-averaged) accuracy, selection accuracy and accuracy@k on ZESHEL.\n1\n2\n3\n4\n5\n6\n7\nNumber of distinct outputs\n0.0\n0.2\n0.4\n0.6\n0.8\n1.0\nFrequency (0–1)\nMagistral-Small-2509\nQwen3-30B-A3B\nQwen3-4B\nFigure 5: Output diversity histogram, using 10 samples\nfor self consistency, on ZESHEL.\nOverall, these results demonstrate that candidate\nposition has a measurable impact on selection ac-\ncuracy, with sensitivity increasing as model size\ndecreases. We hypothesize that larger models are\nbetter able to compare candidates independently of\ntheir order, while smaller models rely more heavily\non positional cues. The intermediate behavior of\nthe 30B MoE Qwen3-30B-A3B model is consistent\nwith this explanation. Importantly, the proposed\nReranker sorting strategy consistently ranks first\namong the non-oracle methods and achieves the\nbest performance for the smallest model, justifying\nits use in LELA. In contrast, BM25 sorting under-\nperforms relative to the reranker-based approaches,\nwhile Random sorting yields the lowest accuracy\nacross all models.\nFurthermore, we also report the results on ESCO\nto study the impact in the case of None prediction,\nconsidering three different positions for the UNK\ncandidate, among the candidates according to the\nreranker-assigned scores:\n• None first: The None option is in first position\n(the standard LELA approach)\n• None last: The None option is in last position\n• Random: The None option is in a random\nposition\nThe results are reported in Table 8b. Across all\nmodels, changing the position of the None candi-\ndate produces only marginal differences in overall\naccuracy, indicating no statistically significant ef-\nfect. We observe a consistent trend in which plac-\ning None first slightly improves UNK accuracy at\nthe cost of marginally reduced non-UNK. Overall,\nmodel performance appears robust to the ordering\nof the None candidate.\nG\nRuntime and Scalability Results\nFirst of all, Figure 8 reports the selection accu-\nracy as a function of the number of input to-\nkens on ZESHEL. It shows how the selection ac-\ncuracy decreases for all three models as the in-\nput context length increases.\nThe drop in per-\nformance is the least pronounced for the dense\nMagistral-Small-2509 and Qwen3-4B models.\nThe MoE Qwen3-30B-A3B model experiences a sig-\nnificantly steeper drop in performance as the con-\ntext length increases, eventually achieving worse\nperformance than Qwen3-4B on the longer inputs,\ndespite both models having the same native con-\ntext window. Figure 7a reports the selection ac-\ncuracy as a function of the number of output to-\nkens on ZESHEL. The three curves show a range,\n15\n\n2\n4\n6\n8\n10\nSelf Consistency k\n0.0\n0.2\n0.4\n0.6\n0.8\n1.0\nSelection accuracy\nMagistral-Small-2509\nQwen3-30B-A3B\nQwen3-4B\n(a) Selection accuracy based on the number of self-consistency\nsamples.\n2\n4\n6\n8\n10\nSelf Consistency k\n0.00\n0.01\n0.02\n0.03\n0.04\n0.05\nInvalid output rate\nMagistral-Small-2509\nQwen3-30B-A3B\nQwen3-4B\n(b) Invalid LLM output rate based on the number of self-\nconsistency samples.\nFigure 6: Effect of the number of self-consistency samples on selection accuracy and invalid output rate on ZESHEL.\nModel\nReranker\nBM25\nAnswer first\nAnswer last\nRandom\nMagistral-Small-2509\n81.88\n81.26\n81.91\n81.60\n81.56\nQwen3-30B-A3B\n74.42\n74.34\n75.43\n73.72\n72.92\nQwen3-4B\n65.84\n63.72\n65.20\n61.61\n61.00\n(a) Micro-averaged normalized accuracy on ZESHEL for different reranker output list sorting approaches.\nModel\nSorting\nAccuracy\nUNK Accuracy\nNon-UNK Accuracy\nMagistral-Small-2509\nNone first\n26.75 ± 2.03\n5.41 ± 1.56\n43.92 ± 3.06\nNone last\n26.43 ± 2.03\n3.94 ± 1.34\n44.51 ± 3.07\nRandom\n26.48 ± 2.03\n4.18 ± 1.38\n44.41 ± 3.07\nQwen3-30B-A3b\nNone first\n36.62 ± 2.21\n28.04 ± 3.09\n43.52 ± 3.06\nNone last\n36.51 ± 2.21\n27.55 ± 3.08\n43.72 ± 3.06\nRandom\n35.20 ± 2.19\n25.09 ± 2.99\n43.32 ± 3.06\nQwen3-4B\nNone first\n26.59 ± 2.03\n8.36 ± 1.91\n41.25 ± 3.04\nNone last\n26.70 ± 2.03\n9.84 ± 2.05\n40.26 ± 3.03\nRandom\n26.32 ± 2.02\n8.12 ± 1.88\n41.00 ± 3.04\n∅ (Reranker-only)\n-\n-\n-\n35.71 ± 2.96\n(b) Accuracy on ESCO for different UNK candidate positions.\nTable 8: Effect of candidate position on selection accuracy across two datasets.\n16\n\naround 500 tokens, where the accuracy is the high-\nest. For Qwen3-4B and Qwen3-30B-A3B models,\nshorter outputs are less accurate, which is not the\ncase for Magistral-Small-2509 which does not\ngenerate these short outputs. For all three mod-\nels, the accuracy drops past this range. This is an\ninstance of the typical reasoning LLMs underthink-\ning and overthinking issue (Su et al., 2025). Fig-\nure 7b shows a similar ideal range below 500 tokens\non another dataset, where all three models achieve\nthe highest, and highly similar, performance, but\nno underthinking phenomena.\nTable 9 reports the average runtime per mention\nacross datasets using our main experiments con-\nfiguration. It shows that Magistral-Small-2509\nconsistently results in the longest runtime, with the\nMoE Qwen3-30B-A3B representing a particularly\ninteresting middle-ground, generally achieving per-\nformance on par with Magistral-Small-2509,\nwhich is still needed for achieving the best results\non ZESHEL or specialized domains, while also\nrunning faster than Qwen3-4B. In fact, based on\nour results, the smaller dense Qwen3-4B should\nonly be preferred to Qwen3-30B-A3B in cases of\nVRAM limitations. Similarly, in terms of tokens\nusage, which directly impact the runtime of the\napproach, Magistral-Small-2509 generated the\nlongest thoughts on average, with similar values\nfor the other models. In addition, these numbers\nof thought tokens also vary based on the setting\nand domain. Using the token length insights on\nZESHEL, as well as those from Section E, we also\nreport the results for a runtime-optimized configu-\nration on ZESHEL by sampling only three answers\nfor self-consistency and by limiting the context\nlength to 8192 tokens. The results show that these\ntwo settings effectively result in significant runtime\nimprovements across the three models, but at the\nexpense of some accuracy percentage points.\nH\nComparison of different pointwise\nrerankers\n• BM25, using the outputs from Logeswaran\net al. (2019).\n• The Qwen3-Embedding-4B embedder, com-\nputing the similarity between mentions and\nour entity verbalizations from the candidate\nlist.\n• The Qwen3-4B LLM with the same prompt\nand score computation as in Zhang et al.\nLELA config\nTime (s)\n# Tokens\nAcc.\nw/ Magistral-Small-2509\n44.4\n1042\n83.11\nw/ Magistral-Small-2509opt\n26.7\n926\n77.73\nw/ Qwen3-30B-A3B\n26.0\n947\n76.79\nw/ Qwen3-30B-A3Bopt\n11.7\n785\n72.92\nw/ Qwen3-4B\n31.1\n1059\n68.00\nw/ Qwen3-4Bopt\n19.1\n798\n64.45\nw/o LLM selection\n1.3\n-\n27.15\n(a) ZESHEL dataset.\nMethod\nTime (s)\n# Tokens\nAcc.\nBLINK\n0.9\n-\n6.4\nGENRE\n4.0\n-\n25.0\nLELA (ours)\nw/ Magistral-Small-2509\n57.2\n1408\n62.3\nw/ Qwen3-30B-A3B\n19.2\n847\n62.4\nw/ Qwen3-4B\n14.7\n847\n55.3\nw/o LLM Selection\n2.3\n-\n29.8\n(b) GLADIS dataset.\nLELA config\nTime (s)\n# Tokens\nAcc.\nw/ Magistral-Small-2509\n35.8\n826\n68.0\nw/ Qwen3-30B-A3B\n10.0\n581\n68.7\nw/ Qwen3-4B\n19.6\n652\n67.0\nw/o LLM selection\n1.3\n-\n61.9\n(c) Unseen Mentions dataset (prior-based candidates).\nLELA config\nTime (s)\n# Tokens\nAcc.\nw/ Magistral-Small-2509\n54.6\n1368\n26.21\nw/ Qwen3-30B-A3B\n21.7\n882\n39.36\nw/ Qwen3-4B\n27.2\n854\n29.44\nw/o LLM selection\n0.5\n-\n19.35\n(d) ESCO dataset (BM25 candidates).\nTable 9: Macro-average mention processing time, and\nthought token usage per self-consistency sample, across\nbenchmarks.\n(2025) but without any additional training.\nThe results show that BM25 underperforms\ncompared to the three other approaches, while\nQwen3-Reranker-4B achieves the highest re-\ncall@10 and Qwen3-Embedding-4B yield compet-\nitive results, despite the latter not being provided\nwith the context. Furthermore, Figure 9 shows\nthe ground truth rank distribution across the four\napproaches. Interestingly, Qwen3-Embedding-4B\nassigns the higher score to the ground truth in more\nthan 60% of the cases, while Qwen3-4B-Reranker\ndoes so in less than 30% of cases, it catches up\nwith the embedder’s recall at k = 6. It should be\nnoted that when using models which are highly\nsusceptible to positional bias, such as Qwen3-4B\n(see Appendix F), the accuracy@k metric is not\nsufficient to compare different rerankers, as it is\n17\n\n500\n1000\n1500\n2000\nNumber of thinking tokens\n0.0\n0.2\n0.4\n0.6\n0.8\n1.0\nSelection accuracy\nMagistral-Small-2509\nQwen3-30B-A3B\nQwen3-4B\n(a) ZESHEL.\n500\n1000\n1500\n2000\nNumber of thinking tokens\n0.0\n0.2\n0.4\n0.6\n0.8\n1.0\nSelection accuracy\nMagistral-Small-2509\nQwen3-30B-A3B\nQwen3-4B\n(b) Unseen Mentions.\nFigure 7: Selection accuracy as a function of the number of generated thinking tokens on two datasets.\n2000\n4000\n6000\n8000\n10000\nNumber of input tokens\n0.0\n0.2\n0.4\n0.6\n0.8\n1.0\nSelection accuracy\nMagistral-Small-2509\nQwen3-30B-A3B\nQwen3-4B\nFigure 8: Selection accuracy based on the number of\ninput tokens, on ZESHEL.\nReranker\nAcc@10\nBM25\n64.99\nQwen3-Embedding-4B\n89.94\nQwen3-4B\n89.12\nQwen3-Reranker-4B\n93.57\nTable 10: Macro-averaged normalized accuracy@10 for\ndifferent candidate reranking models on ZESHEL.\ninvariant to the output candidate position.\nI\nCandidate Retention Funnel\nTo better understand the contribution of each stage\n(Retrieval, Reranking, and Selection) to the fi-\nnal performance, we visualize the flow of the\nground-truth entity through the LELA pipeline\nusing Sankey diagrams (Figure 10).\nThe dia-\ngrams clearly distinguish between retrieval-bound,\nreranking-bound, and reasoning-bound challenges\nacross different benchmarks:\n• On ZESHEL, the dominant error source is\nconsistently the initial BM25 retrieval. Once\nthe correct entity is retrieved, LELA is highly\neffective at retaining and selecting it.\n• On GLADIS, the bottleneck shifts to the\nreranking stage, particularly in specialized do-\nmains where the density of similar long forms\nis high.\n• On WikilinksNED Unseen Mentions (specifi-\ncally with dense retrieval), the challenge lies\nprimarily in the final selection, suggesting that\nthe reasoning model struggles to disambiguate\nbetween the highly similar entities typical of\nWikipedia.\nA key finding is the high retention rate of the\nreranking stage across most datasets. For exam-\nple, on the ZESHEL/Forgotten Realms domain,\nthe reranker preserves 98% of the recoverable re-\ncall (977 retained vs. 22 lost), while compress-\ning the candidate list by a factor of 6.4 (from\n64 to 10). These results demonstrate the robust-\nness of our choice of k = 10, validating that\naggressive filtering does not compromise the up-\nper bound of the system. However, the results\nalso highlight domain-dependent variations. While\nthe reranker achieves a 90.5% retention rate on\nthe GLADIS/General domain, this drops to 63.2%\non the GLADIS/Scientific domain. This signifi-\ncant drop indicates that scientific acronyms pose a\nharder challenge for the reranker, likely due to the\nfine-grained technical distinctions required to sepa-\nrate valid candidates. Crucially, the Selection LLM\n18\n\n0\n10\n20\n30\n40\n50\n60\nk\n0.0\n0.2\n0.4\n0.6\n0.8\n1.0\nAccuracy@k\nQwen3-4B\nQwen3-Embedding-4B\nQwen3-Reranker-4B\nbm25\n(a) Micro-averaged normalized accuracy@k.\n2\n4\n6\n8\n10\nGround-truth rank position\n0.0\n0.1\n0.2\n0.3\n0.4\n0.5\n0.6\nFrequency\nQwen3-4B\nQwen3-Embedding-4B\nQwen3-Reranker-4B\nbm25\n0.0\n0.2\n0.4\n0.6\n0.8\n1.0\nAccuracy@k\n(b) Ground truth position in reranker output histogram.\nFigure 9: Comparison of rerankers on ZESHEL using (a) accuracy@k and (b) ground-truth position histogram.\nproves significantly more robust in this technical\nsetting, successfully identifying the correct entity\namong the top-10 candidates in 74.1% of the cases\npassed to it (6593 out of 8894).\nJ\nExamples\nFigures 11, 12, 13, 14, 15, and 16 show examples\nof reasoning traces in the LLM candidate selection\nphase of LELA.\n19\n\nRetriever\n(999)\nAll\n1199\nLost (Ret)\n200\nReranker\n(977)\nLost (Rerank)\n22\nSelection\n(886)\nCorrect\n886\nLost (Sel)\n91\n(a) ZESHEL/Forgotten Realms\nRetriever\n(973)\nAll\n1198\nLost (Ret)\n225\nReranker\n(929)\nLost (Rerank)\n44\nSelection\n(807)\nCorrect\n807\nLost (Sel)\n122\n(b) ZESHEL/Lego\nRetriever\n(2785)\nAll\n4226\nLost (Ret)\n1441\nReranker\n(2611)\nLost (Rerank)\n174\nSelection\n(2214)\nCorrect\n2214\nLost (Sel)\n397\n(c) ZESHEL/Star Trek\nRetriever\n(2053)\nAll\n3373\nLost (Ret)\n1320\nReranker\n(1889)\nLost (Rerank)\n164\nSelection\n(1660)\nCorrect\n1660\nLost (Sel)\n229\n(d) ZESHEL/YuGiOh\nRetriever\n(8385)\nAll\n10000\nLost (Ret)\n1615\nReranker\n(8291)\nLost (Rerank)\n94\nSelection\n(6801)\nCorrect\n6801\nLost (Sel)\n1490\n(e) WikilinksNED Unseen Mentions\nRetriever\n(7125)\nAll\n7125\nReranker\n(6446)\nLost (Rerank)\n679\nSelection\n(5619)\nCorrect\n5619\nLost (Sel)\n827\n(f) GLADIS/General\nRetriever\n(3149)\nAll\n3149\nReranker\n(2449)\nLost (Rerank)\n700\nSelection\n(1921)\nCorrect\n1921\nLost (Sel)\n528\n(g) GLADIS/Biomedical\nRetriever\n(14066)\nAll\n14066\nReranker\n(8894)\nLost (Rerank)\n5172\nSelection\n(6593)\nCorrect\n6593\nLost (Sel)\n2301\n(h) GLADIS/Scientific\nFigure 10: Candidate retention funnel across datasets, using Magistral-Small-2509. The diagrams visualize the\nattrition of the ground-truth entity at each stage of the LELA pipeline: (1) from the total dataset to the Retriever, (2)\nto the Reranker, and (3) to the final Selection.\n20\n\nInput sentence and mention: [Develop tools] for our internal teams ( sales support operations finance ) to help run the\nbusiness more efficiently .\",\nLELA: <think> Okay, let’s tackle this problem step by step. The user has provided a sentence: \"[Develop tools] for our\ninternal teams ( sales support operations finance ) to help run the business more efficiently.\" The task is to determine which\nof the given candidate entities best fits the mention \"[Develop tools]\". First, I need to understand the context. The sentence\nis about developing tools for internal teams in sales, support, operations, and finance to improve business efficiency. The\nkey here is \"develop tools\" in the context of internal business operations. Looking at the list of candidates: 0. None of the\nlisted candidates 1. develop promotional tools - This seems related to marketing materials, not internal tools. 2. produce\ncustomised tools - This could be relevant, but the description mentions artisanal crafting or restoration, which might not fit\ninternal business tools. 3. database development tools- Focuses on databases, which is a specific area, but the original\nsentence is broader. 4. KDevelop - A specific software development tool, but the mention is general. 5. business process\nmodelling - Involves modeling business processes, which could relate to efficiency. 6. create business process models -\nSimilar to 5, but more about creating models rather than tools. 7. develop film - Unrelated to internal business tools. 8.\nsoftware frameworks - Tools for software development, which might be relevant if the tools are software-related. 9. Eclipse\n- Another specific software development tool, like KDevelop. 10. apply tools for content development - Focuses on content,\nnot internal business processes. The original mention is \"Develop tools\" for internal teams to run the business more\nefficiently. The most general and fitting option seems to be 0 (None of the listed candidates) if none of the options perfectly\nmatch. However, looking again, option 5 and 6 are about business process modelling and creating models, which could be\ntools to improve efficiency. However, the mention is about \"tools\" in general, not specifically process models. Option 8 is\nsoftware frameworks, which are tools for development. But the original sentence doesn’t specify software. Option 10 is\nabout content development tools, which doesn’t fit. Alternatively, maybe \"None of the listed candidates\" is correct if none\nof the options accurately describe developing general internal tools. However, looking at the descriptions again, option 5\nand 6 are about business process modelling, which might be considered tools. But the original mention is more general.\nSince the options don’t have a general \"develop tools for internal efficiency\" category, perhaps 0 is the answer. But I need\nto check if any of the options are close. Alternatively, maybe option 5 or 6. But the original text says \"tools\" not \"models\".\nSo perhaps 0 is the correct choice here.</think> \"answer\": 0. (✓)\nFigure 11: LLM selection example from ESCO for the UNK entity. The model correctly chooses the None candidate\nas shown in blue instead of the other strong candidates.\nInput sentence and mention: The Golden Rule is arguably the most essential basis for the modern concept of human\nrights , in which each individual has a [right[] to just treatment, and a reciprocal responsibility to ensure justice for others. [\n4 ] A key element of the Golden Rule is\nLELA w/o LLM selection: Right to keep and bear arms: The right to keep and bear arms (often referred to as the right\nto bear arms) is a legal right for people to possess weapons (arms) for the preservation of life, liberty, and property (✗)\nLELA: <think>Okay, let’s try to figure out which entity is the correct answer here. The input text mentions the Golden\nRule as the basis for the modern concept of human rights. The marked mention is [right], and we need to choose from the\ncandidates. First, the context is about human rights and the Golden Rule. The sentence says each individual has a [right] to\njust treatment and a responsibility to ensure justice. The options include several types of rights. Let’s look at the candidates.\nOption 4 is Human rights, which is already mentioned in the text. But the question is about the specific entity that fits the\n[right] in the sentence. The sentence is talking about an individual’s right, so maybe a more general term. Option 6 is\n\"Rights\" in general. The description says rights are principles of freedom or entitlement. The sentence uses \"right\" in a\ngeneral sense, not a specific type like the right to bear arms or civil rights. So maybe option 6. But wait, the text mentions\n\"human rights\" earlier, but the marked term is [right], which is more general. Let me check the options again. Option 6 is\n\"Rights\" which is the general term. The sentence says \"a [right] to just treatment\", which would fit under the general\ndefinition of rights. However, option 4 is Human rights, but the sentence is about individual rights within human rights.\nHmm. But the answer might be 6 because it’s the general term. Alternatively, maybe 7 is civil and political rights, but the\ntext doesn’t specify that. The Golden Rule is linked to human rights, but the specific mention here is \"right\" as a general\nconcept. So I think the answer is 6.</think> \"answer\": 6. (✓)\nFigure 12: Reranker and LLM selection example from WikilinksNED Unseen Mentions. When using the selection\nLLM, as shown in blue, the model is able to select the correct candidate, unlike the reranker which ranks a wrong\ncandidate (shown in red) first.\n21\n\nInput sentence and mention: Fourvière , annual festival in June and August. The Biennale d’art contemporain . The Nuits\nSonores . Lille ( French pronunciation: [lil] ; [Dutch[] : Rijsel) is a city in northern France . It is the principal city of the\nLille Métropole , the fourth-largest metropolitan area in the\nLELA w/o LLM selection: Dutch Language: Dutch ( ) is a West Germanic language spoken by about 25 million people\nas a first language and 5 million as a second language. (✓)\nLELA: <think>Okay, let’s tackle this problem. The input text mentions \"Lille\" and describes it as a city in northern France,\nthe principal city of the Lille Métropole, which is the fourth-largest metropolitan area. The task is to determine which of the\ngiven candidates best fits the context. First, I need to check the list of candidates. The options are from 1 to 10. The mention\nin the text is \"Lille,\" but the candidates don’t include Lille directly. However, the text also mentions \"French pronunciation:\n[lil]\" and \"[Dutch]: Rijsel.\" So, the city’s Dutch name is Rijsel. But none of the candidates are Lille or Rijsel.Looking at\nthe candidates, the options are about regions, languages, communities, etc. For example, option 2 is the Low Countries,\nwhich is a region in Northwestern Europe, including the Netherlands, Belgium, and Luxembourg. Option 5 is the Flemish\nCommunity, which is part of Belgium. Option 9 isthe Netherlands. But the text is about Lille, which is in France, not the\nNetherlands or Belgium. However, since the Dutch name for Lille is Rijsel, maybe there’s a connection. Wait, the question\nis to disambiguate the entity in thetext. The text mentions Lille, which is a city in France. But the candidates don’t have\nLille. However, the Dutch name for Lille is Rijsel. But none of the candidates are Rijsel. The options are about regions,\nlanguages, etc. So perhaps the correct answer is related to the Low Countries (option 2), since Lille is in northern France,\nnear the border with the Low Countries. Alternatively, maybe the Flemish Community (option 5) since Lille is near the\nFrench region of Flanders, which is part of the Flemish Community. But Lille is actually in the Nord-Pas-de-Calais region\nof France, which is part of the French Flanders, not the Flemish Community (which is in Belgium). However, the text\nmentions that Lille is in northern France, and the Dutch pronunciation is given. The mention of Dutch (option 1) might be\nrelevant, but the text is about the city, not the language. Alternatively, maybe the correct answer is the Low Countries\n(option 2) because Lille is in northern France, close to the Low Countries region. But the candidates are entities, and the\nquestion is to pick the most appropriate one based on context. However, none of the candidates are directly Lille. The only\npossible connection is that the Dutch name for Lille is Rijsel, but there’s no candidate for that. The options are about\nregions, languages, etc. Since the text is about Lille’s location and the Dutch name, perhaps the correct answer is the Low\nCountries (option 2), as Lille is near that region. Alternatively, the Netherlands (option 9) but Lille is in France.\nAlternatively, the Flemish Community (option 5) but again, Lille is in France, not Belgium. Wait, maybe the answer is\noption 2, the Low Countries, since the text mentions the Dutch pronunciation and the city’s location in northern France,\nwhich is near the Low Countries. But I’m not entirely sure. Alternatively, maybe there’s a mistake here. However, given the\noptions, the best fit might be the Low Countries (option 2) as a geographical region. But I need to check the descriptions\nagain. Option 2: Low Countries - a region in Northwestern Europe, including the Benelux countries. Since Lille is in\nnorthern France, close to the Benelux area, maybe that’s the connection. So the answer is 2.</think>\"answer\": 2 (✗)\nFigure 13: LLM selection failure example from WikilinksNED Unseen Mentions. The reranker ranks the correct\ncandidate first,. However, the selection LLM considers the wrong mention, as shown in red, overthinks, and selects\na wrong candidate.\n22\n\nInput sentence and mention: Ann Lee Ann Lee is a minifigure introduced in the Monster Fighters theme . Description .\nAnn has sand blue legs printed with a clove of garlic , wooden stakes , and a belt . Her torso is white , and is printed with a\nsand blue desk , a red and white bodice , a white collar , and a silver necklace . She has white arms with yellow hands . Her\nhairpiece is reused from Cave Woman with a crossbow bolt instead of a bone . She has a facial scar from the Witch Nail .\nBackground . Ann Lee is a Monster Fighter , proficient in the ancient art of broomstickajitzu . Her monster arch - enemy are\nthe Ghosts , even though she is shown in videos with the [Mummy[] . She met Dr . Rodney when Dr . Rodney was looking\nfor a suitable pilot brave enough to fly into the Monster Realms , and was described as waiting in the air hangar for him .\nLELA w/o entity descriptions: \"<think> Okay, let’s see. The input text is about Ann Lee, a minifigure from the Monster\nFighters theme. The question is about disambiguating the mention of [Mummy] in the text. The context says that Ann\nLee’s monster arch-enemies are the Ghosts, but she’s shown in videos with the [Mummy]. The task is to pick the correct\nentity from the list provided. First, I need to look at the candidates. The list includes various Mummy-related entries, like\nMummy (Adventurers), The Mummy, Mummy (Scooby-Doo), Part numbers, and others. The key here is to figure out\nwhich Mummy is being referred to in the context of Ann Lee’s story. The text mentions that Ann Lee is a Monster Fighter\nand her arch-enemy is the Ghosts, but she’s seen with the [Mummy]. Since the options include different Mummy\ncharacters, I need to consider which one fits best. The Mummy in the context of LEGO minifigures might be a specific\ntheme. Looking at the candidates, options like \"Mummy (Minifigures)\" or \"9462 The Mummy\" could be relevant.\nHowever, the mention of \"Part 10546\" or \"Part 3626bpx112\" might refer to specific parts, but the text doesn’t mention parts,\nit’s about a character. Also, \"The Mummy\" as a standalone entry could be a possible candidate. However, since the original\ntext is about a minifigure, the most likely answer would be \"Mummy (Minifigures)\" which is option 6. But wait, there’s\nalso \"9462 The Mummy\" which might be a specific set. However, the context doesn’t mention a set number. Alternatively,\nthe text says \"the [Mummy]\" which is a general term, so maybe \"Mummy (Minifigures)\" is the correct choice. But I should\nalso check if any other options are more specific. The options 1, 2, 3, 5, 6, 8, 9 are all Mummy-related. Since the original\ntext is in the context of LEGO Monster Fighters, which is a theme, and the Mummy could be another minifigure in that\ntheme. The correct answer is likely option 6: Mummy ( Minifigures ). Alternatively, maybe \"The Mummy\" (option 2 or 8).\nBut considering the list, option 6 is more specific to minifigures. So I think the answer is 6. </think> \"answer\": 6 (✗)\nLELA: <think> Okay, let’s try to figure out which candidate entity is the correct one here. The input text is about Ann Lee,\na minifigure from the Monster Fighters theme. The mention in question is [Mummy], and we have a list of 10 candidates.\nFirst, I need to look at the context. The text says that Ann Lee is a Monster Fighter and her monster arch-enemy are the\nGhosts, even though she’s shown in videos with the [Mummy]. The key here is that the Mummy is mentioned as someone\nshe’s seen with, but her main enemies are the Ghosts. Now, looking at the candidates, most of them are different versions of\nthe Mummy in various LEGO themes. The options range from different sets, like 9462 The Mummy, which is mentioned in\nthe text. The text also mentions that the set 9462 includes Ann Lee and the Mummy. Wait, the text says \"the Monster\nFighter Ann Lee\" is part of the set 9462 The Mummy. So that set includes both Ann Lee and the Mummy. But the question\nis about the [Mummy] in the text. The text says she’s shown with the [Mummy], so the Mummy here is likely the one from\nthe same theme, Monster Fighters. Looking at the candidates, option 2 is \"The Mummy - The Mummy The Mummy King\nis a Monster Fighters minifigure...\" and option 9 is \"9462 The Mummy\" which is a set. But the mention is [Mummy],\nwhich is a minifigure. Option 2 is a minifigure, and the set 9462 includes the Mummy. However, the text says \"the Monster\nFighter Ann Lee\" is in the set, implying that the Mummy is another minifigure in that set. So the Mummy here is likely the\none from the Monster Fighters theme. Looking at the candidates, option 2 is The Mummy from Monster Fighters, and\noption 9 is the set. Since the text refers to the Mummy as a character (since Ann Lee is a minifigure and is shown with the\nMummy), it’s more likely referring to the minifigure. So option 2. But wait, there’s also option 8, which is \"Mummy -\nMummy The Mummy is a Studios minifigure.\" But the text mentions Monster Fighters, so the correct one would be the one\nfrom Monster Fighters. Additionally, option 9 is the set, not the minifigure. So the answer is probably 2. However, let me\ncheck the other options. Option 1 is Mummy (Adventurers), which is from a different game. Option 3 is Scooby-Doo,\nwhich is a different theme. Option 5 and 6 are other minifigures but not from Monster Fighters. Option 4 is a part, not a\nminifigure. Option 7 is a head part. Option 8 is a Studios minifigure, but the text mentions Monster Fighters. So yes, option\n2 is the correct one. </think> \"answer\": 2 (✓)\nFigure 14: LLM selection example from ZESHEL/Lego when using or omitting entity descriptions. Without\ndescriptions, the LLM selects the entity with the most relevant title regarding the context. as shown in red. Using\nentity descriptions, it excludes this wrong candidate, as shown in blue.\n23\n\nInput sentence and mention: Dog People The Dog People , also called the \" North Ones \" , were primitive , savage ,\nnomadic tribesmen who lived in the deserts of northern Maztica . Traits . Dog People tended to act emotionless and did not\nseem bothered by poverty or pain . They were dexterous and especially tough but often lacked the charisma or intellect of\nother races . Society . The Dog People lived a nomadic life in the harsh , northern lands, organized into villages . The\nseparate tribes of Dog People never cooperated . A typical village had huts of mud or else the people lived in caves , and\nthe populations ranged from 50 to 500 persons . They survived by growing crops in the brush and savannah regions , until a\ndrought would occur and they were forced to find other fields . Others set traps for birds or hunted or fished for food .\nWhile they did not have an ordered civilization , they had a vibrant oral tradition . In Dog People society , men and women\nwere equivalent ; for example , women were as likely to be warriors as men were . On rare occasions , a foreigner might\nimpress a Dog People village and be welcomed into the tribe as a brother or sister . In such cases, it was expected that the\nnew member offer his devotion to the tribe in exchange for their protection . Religion . The Dog People practiced a\nshamanistic religion , and revered all of the gods of the , especially [Maztica] and Tezca . They did not, however , practice\nhuman sacrifice . Combat . The Dog People were renowned for their skill at stealth . They were fearless fighters , but if\npressed , Dog People would often retreat deeper into their harsh territory , not out of cowardice , but trusting in the\nenvironment to drive their enemies away . Most city - dwelling Maztican s thought of all Dog People as foes , and they\nwere said to capture travelers and torture them . Magic . Some Dog People became plumaweaver s or hishnashaper s .\nDruid s of Maztica and priests of Tezca were also common . History . The Nexalan s were related to the Dog People and\noriginated in the same region . The people of Huacli also had the blood of Dog People .\nLELA (k = 64): <think>Okay, let’s tackle this problem. The user has given a text about the Dog People in Maztica and a\nlist of candidate entities. The task is to determine which entity corresponds to the mention \"[Maztica]\" in the text. First, I’ll\nread through the provided text again. The relevant part is in the \"Religion\" section: \"The Dog People practiced a\nshamanistic religion, and revered all of the gods of the , especially [Maztica] and Tezca.\" So the mention here is\n\"[Maztica]\", and we need to find the correct entity from the list.Looking at the candidate entities, there are several entries\nrelated to Maztica. Let’s check each one. Candidate 1 is \"Maztica\" itself, described as a continent west of Faerûn with\njungles and mystery. The text mentions that the Dog People lived in the deserts of northern Maztica, so this seems like a\nstrong candidate. Other candidates like \"Maztica trilogy\" (2), \"Maztican\" (3), \"Kukul\" (4), etc., don’t fit because they refer\nto different things—like a book series, a term for inhabitants, a god, etc. Looking at the description of candidate 1, it says\nMaztica is a continent, which matches the context of the Dog People living in its deserts. Also, the mention of \"Tezca\" in\nthe text might refer to a deity, but the question is about \"Maztica\" as a place. Other entries like \"Maztican\" (3) are collective\nterms for inhabitants, but the text refers to Maztica as a land. The answer should be the continent itself. Checking the other\ncandidates, like \"Maztica (goddess)\" (13) is a deity, which doesn’t fit. So the correct choice is candidate 1,\n\"Maztica\".</think> \"answer\": 1 (✗)\nLELA (k = 10): think>Okay, let’s tackle this problem. The user provided a text about the Dog People in Maztica and a\nlist of candidate entities. The task is to determine which candidate corresponds to the mention of [Maztica] in the text.First,\nI need to look at the context where [Maztica] is mentioned. The sentence says: \"They did not, however, practice human\nsacrifice.\" Wait, no, the actual sentence is: \"religion. The Dog People practiced a shamanistic religion, and revered all of\nthe gods of the , especially [Maztica] and Tezca.\" So the mention is in the context of gods. The Dog People revered the\ngods of the, especially Maztica and Tezca. Now, looking at the candidates, there are several entries related to Maztica. Let’s\ncheck each one. Candidate 1 is Maztica (the continent). Candidate 3 is Maztican (inhabitants). Candidate 7 is Maztica\n(goddess). Candidate 10 is Nexalan, which is a tribe in Maztica. But the mention here is [Maztica], and the text says they\nrevered the gods of the, especially Maztica and Tezca. So Maztica here is likely a deity, not the continent or the people.\nLooking at the candidates, candidate 7 is Maztica (goddess), which is described as the ancient mother of the Maztican\npantheon. The text mentions that the Dog People revered all the gods of the, especially Maztica and Tezca. So Tezca is\nanother deity, and Maztica here is another deity. Therefore, the correct candidate should be 7. But wait, let me check if\nthere’s another possibility. Candidate 1 is the continent, but the context is about gods. Candidate 3 is the people, but the\nsentence is about gods. Candidate 7 is the goddess, which fits. Also, in the text, the Dog People revered the gods of the\n(probably the Maztican gods), and specifically mentioned Maztica and Tezca. So Maztica here is a deity. Therefore, the\nanswer is 7.</think>\"answer\": 7. (✓)\nFigure 15: LLM selection example from ZESHEL/Forgotten Realms for k = 10 and k = 64 (all retrieved\ncandidates). When considering all retrieved candidates, the model overlooks the seventh candidate. which is\nthe correct one, and takes shortcuts, as highlighted in red. When only considering ten candidates, it successfully\nconsiders the seventh candidate, among other options, as highlighted in blue, and selects it.\n24\n\nInput sentence and mention: hydrogen is less than 25%. [ 47 ] [ 48 ] [ 49 ] [ edit ] Production For more details on this\ntopic, see [Hydrogen production] . The molecular hydrogen needed as an on-board fuel for hydrogen vehicles can be\nobtained through many thermochemical methods utilizing natural gas , coal (by\nLELA (Qwen3-4B): <think>nOkay, let’s tackle this problem. The user provided a text snippet and a list of candidate\nentities. The task is to determine which entity from the list best fits the context given in the text. First, I need to understand\nthe input text. The text mentions \"hydrogen is less than 25% [47] [48] [49] [edit] Production For more details on this topic,\nsee [Hydrogen production].\" Then it talks about producing molecular hydrogen for hydrogen vehicles through\nthermochemical methods using natural gas and coal. The mention of \"natural gas\" and \"coal\" suggests methods of\nproduction. The candidates include various hydrogen-related terms like \"Steam reforming,\" \"Sulfur–iodine cycle,\"\n\"Copper–chlorine cycle,\" etc. The key here is to find which candidate entity is related to the production methods mentioned.\nThe text specifically says \"thermochemical methods utilizing natural gas, coal.\" Looking at the candidates, \"Steam\nreforming\" (option 2) is a method that uses natural gas and water to produce hydrogen. The Sulfur–iodine cycle (4) and\nCopper–chlorine cycle (8) are also thermochemical cycles, but the text mentions natural gas and coal, which are more likely\nassociated with steam reforming. The other options like \"Hydrogen production\" (1) is a general term, but the text is\ndiscussing specific methods. The mention of \"natural gas\" and \"coal\" points towards steam reforming as the primary\nmethod here. So the answer should be option 2.</think> \"answer\": 2 (✗)\nLELA (Qwen3-30B-A3B): <think>Okay, let’s see. The user provided a text snippet that mentions hydrogen production and\nsome numbers like \"less than 25%\". The marked mention is [47], [48], [49], but those might be references or footnote\nmarkers. The task is to disambiguate the entity based on the context and the list of candidates. Looking at the input text:\n\"Hydrogen is less than 25%. [47] [48] [49] [edit] Production For more details on this topic, see [Hydrogen production].\nThe molecular hydrogen needed as an on-board fuel for hydrogen vehicles can be obtained through many thermochemical\nmethods utilizing natural gas, coal by...\" The key part here is the mention of \"Hydrogen production\" in the text. The user’s\ninstruction says that the answer should be the index number of the most appropriate entity. The candidate list includes\n\"Hydrogen production\" as entry 1. The text explicitly says \"see [Hydrogen production]\" which is a direct reference. So\neven though there are other entries like steam reforming (6) or copper-chlorine cycle (1), the context here is pointing\ntowards the general topic of hydrogen production. The numbers in brackets might be references to sources or sections, but\nthe main entity being discussed is \"Hydrogen production\". So the answer should be 1.</think>\"answer\": 1. (✓)\nFigure 16: LLM selection example from WikilinksNED Unseen Mentionws for two selection LLMs. Qwen3-4B\nchoses the wrong candidate, following the thoughts in red, while Qwen3-30B-A3B chooses the right answer, following\nthe blue thoughts.\n25\n",
+    "figure_captions": [
+      "Figure 1: Classification of entity linking settings (“train-",
+      "Figure 2: Overview of the LELA approach.",
+      "Figure 3:",
+      "Figure 4: Comparison of (micro-averaged) accuracy, selection accuracy and accuracy@k on ZESHEL.",
+      "Figure 5: Output diversity histogram, using 10 samples",
+      "Figure 6: Effect of the number of self-consistency samples on selection accuracy and invalid output rate on ZESHEL.",
+      "Figure 7: Selection accuracy as a function of the number of generated thinking tokens on two datasets.",
+      "Figure 8: Selection accuracy based on the number of",
+      "Figure 9: Comparison of rerankers on ZESHEL using (a) accuracy@k and (b) ground-truth position histogram.",
+      "Figure 10: Candidate retention funnel across datasets, using Magistral-Small-2509. The diagrams visualize the",
+      "Figure 11: LLM selection example from ESCO for the UNK entity. The model correctly chooses the None candidate",
+      "Figure 12: Reranker and LLM selection example from WikilinksNED Unseen Mentions. When using the selection",
+      "Figure 13: LLM selection failure example from WikilinksNED Unseen Mentions. The reranker ranks the correct",
+      "Figure 14: LLM selection example from ZESHEL/Lego when using or omitting entity descriptions. Without",
+      "Figure 15: LLM selection example from ZESHEL/Forgotten Realms for k = 10 and k = 64 (all retrieved",
+      "Figure 16: LLM selection example from WikilinksNED Unseen Mentionws for two selection LLMs. Qwen3-4B"
+    ]
+  },
+  {
+    "arxiv_id": "2601.05143v1",
+    "title": "Paper 2601.05143v1",
+    "abstract": "Visual question answering for crop disease analysis requires accurate visual under- standing and reliable language generation. This work presents a lightweight vision–language framework for crop and disease identification from leaf images. The proposed approach combines a Swin Transformer vision encoder with sequence-to-sequence language decoders. A two-stage training strategy is adopted to improve visual representation learning and cross-modal alignment. The model is evaluated on a large-scale crop disease dataset using classification and nat- ural language generation metrics. Experimental results show high accuracy for both crop and disease identification. The framework also achieves strong perfor- mance on BLEU, ROUGE and BERTScore. Our proposed models outperform large-scale vision–language baselines while using significantly fewer parameters. Explainability is assessed using Grad-CAM and token-level attribution. Qualita- tive results demonstrate robust performance under diverse user-driven queries. These findings highlight the effectiveness of task-specific visual pretraining for crop disease visual question answering.",
+    "text": "A Lightweight and Explainable Vision–Language\nFramework for Crop Disease Visual Question\nAnswering\nMd. Zahid Hossain1, Most. Sharmin Sultana Samu2,\nMd. Rakibul Islam1, Md. Siam Ansary1*\n1Department of Computer Science and Engineering, Ahsanullah\nUniversity of Science and Technology, Dhaka, 1208, Bangladesh.\n2Department of Computer Science and Engineering, BRAC University,\nDhaka, 1212, Bangladesh.\n*Corresponding author(s). E-mail(s): siamansary.cse@aust.edu;\nContributing authors: zahidd16@gmail.com;\nsharminsamu130@gmail.com; rakib.aust41@gmail.com;\nAbstract\nVisual question answering for crop disease analysis requires accurate visual under-\nstanding and reliable language generation. This work presents a lightweight\nvision–language framework for crop and disease identification from leaf images.\nThe proposed approach combines a Swin Transformer vision encoder with\nsequence-to-sequence language decoders. A two-stage training strategy is adopted\nto improve visual representation learning and cross-modal alignment. The model\nis evaluated on a large-scale crop disease dataset using classification and nat-\nural language generation metrics. Experimental results show high accuracy for\nboth crop and disease identification. The framework also achieves strong perfor-\nmance on BLEU, ROUGE and BERTScore. Our proposed models outperform\nlarge-scale vision–language baselines while using significantly fewer parameters.\nExplainability is assessed using Grad-CAM and token-level attribution. Qualita-\ntive results demonstrate robust performance under diverse user-driven queries.\nThese findings highlight the effectiveness of task-specific visual pretraining for\ncrop disease visual question answering.\nKeywords: Visual Question Answering, Crop Disease Identification, Agricultural\nQuestions and Answers, Swin Transformer, Vision–Language Models, Explainable AI\n(XAI)\n1\narXiv:2601.05143v1  [cs.CV]  8 Jan 2026\n\n1 Introduction\nPlant disease diagnosis plays a critical role in modern agriculture and global food\nsecurity. Crops remain constantly exposed to pests, fungi and environmental stress.\nThese factors directly affect yield and quality. Reports from the Food and Agricul-\nture Organization of the United Nations show that crop diseases cause annual global\nlosses ranging from 10% to 30% [1]. Such losses threaten farm productivity and food\navailability. Early identification of disease symptoms is therefore essential. Accurate\nand timely diagnosis can reduce damage and support effective intervention. This need\nhas driven continuous research at the intersection of agriculture, computer vision and\nintelligent systems.\nDespite its importance, crop disease diagnosis remains a difficult task. In practice,\nfarmers depend on agricultural experts for on-site inspection and recommendations.\nExperts follow a step-by-step diagnostic process. They first identify the affected plant\npart. They then observe visible abnormalities. Finally, they analyze disease spot char-\nacteristics such as color, shape and distribution [2]. This process requires experience,\ntime and physical presence. Diagnostic delays allow pests and pathogens to spread\nrapidly. In many regions, expert access is limited. This makes large-scale and timely\ndisease monitoring difficult. As a result, delayed diagnosis often leads to severe yield\nloss and economic damage.\nTo address these challenges, automated disease detection methods have been widely\nexplored. Early computer vision approaches relied on handcrafted features and tradi-\ntional classifiers [2]. These methods often required controlled imaging conditions such\nas fixed lighting and angles [3–6]. Such requirements increase deployment cost and\nlimit real-world usability. Recent advances in deep learning have significantly improved\ndisease classification accuracy. Convolutional neural networks and transformer-based\nmodels show strong performance across multiple crops [7–10]. However, most of these\nsystems operate on unimodal data, mainly images or spectral signals [11–13]. They\ntypically output only disease labels. They fail to explain symptoms, disease stages\nor contextual information. This limits their practical value for decision-making and\ndisease management.\nVisual Question Answering offers a promising direction to overcome these limita-\ntions. VQA combines image understanding with natural language processing to answer\nquestions about visual content [14, 15]. In agriculture, VQA models attempt to link\nvisual symptoms with textual queries [16–18]. This allows users to ask targeted ques-\ntions instead of receiving fixed labels. However, existing agricultural VQA studies\nprovide only partial insights. They often lack detailed textual descriptions of multiple\nvisual attributes [19]. They struggle to represent disease progression stages. They also\nfail to answer questions that require external knowledge, such as pathogens, control\nstrategies or pesticide use. Current VQA benchmarks mainly focus on medical domains\nrather than plant pathology [20–23]. Moreover, many VQA models remain computa-\ntionally heavy. This restricts their use in real farming environments. In this context,\nour research asks a clear question. “Can a lightweight Visual Question Answer-\ning framework be established for intelligent and practical plant disease\nidentification?”\n2\n\nRecent advances in AI highlight a growing focus on multimodal and explain-\nable models for intelligent visual understanding. Studies combining CNNs and Vision\nTransformers show improved accuracy and transparency in image-based analysis [24–\n28]. Vision–language models using ViT and GPT-2 effectively connect visual patterns\nwith textual reasoning, as demonstrated in automated chest X-ray interpretation\nand report generation [29–31]. Transformer-based and transfer learning methods also\naddress low-resource challenges in Bengali audio and text analysis [32–34], while gen-\nerative models link vision and semantics in handwritten text synthesis [35, 36]. These\nworks reflect a shift toward robust, interpretable and multimodal AI systems, aligning\nwith the need for lightweight Visual Question Answering approaches in plant disease\nidentification.\nIn this work, we present a unified vision–language framework for visual question\nanswering in plant disease analysis. The framework is designed to jointly support plant\nidentification, disease recognition and natural language response generation. It lever-\nages a two-stage training strategy to improve visual understanding while maintaining\nefficient inference.\nThe proposed approach employs a Swin Transformer-based vision encoder [37] with\na text decoder for answer generation. In the first stage, the vision encoder is trained to\nlearn discriminative representations for plant and disease classification. In the second\nstage, the pretrained encoder is reused and frozen to support visual question answering.\nThis design improves stability and reduces computational overhead during training.\nTo enable robust language reasoning, we integrate a transformer-based text decoder\n(BART [38] and T5 [39]). The decoder generates natural language answers condi-\ntioned on both visual features and user queries. The model demonstrates robustness\nto diverse question formulations and open-ended queries. This behavior supports\nreal-world interaction scenarios.\nTo enhance interpretability, we incorporate explainable AI techniques. Grad-CAM\n[40] visualizations are used to highlight salient image regions influencing predictions.\nToken-level attribution is applied to analyze the contribution of linguistic tokens dur-\ning answer generation. These analyses provide transparency and validate meaningful\nvision–language alignment.\nExtensive experiments are conducted to evaluate the proposed framework. Quan-\ntitative evaluation includes accuracy for plant and disease identification. Natural\nlanguage generation quality is assessed using BLEU [41], ROUGE [42] and BERTScore\n[43] metrics. Model efficiency is analyzed in terms of parameter count and inference\nlatency. Ablation studies further examine the role of vision encoder pretraining and\ndecoder choice.\nOur key contributions are summarized as follows:\n• We propose a unified vision–language framework for plant and disease visual\nquestion answering using natural images.\n• We introduce a two-stage training strategy that decouples visual representation\nlearning from language generation.\n• We demonstrate robust performance under diverse and user-driven question formu-\nlations.\n3\n\n• We provide comprehensive explainability analysis using Grad-CAM and token-level\nattribution.\n• We evaluate the framework using classification accuracy, NLG metrics and model\nefficiency measures.\n• We show that vision encoder pretraining significantly improves performance across\nall evaluation metrics.\nThis article is organized as follows: Section 2 provides a summary of existing works\nin the literature. Section 3 details the proposed approach. Section 4 describes the setup\nused for the experiments. Section 5 presents the results and comprehensive analysis\nof the results. Section 6 discusses the limitations of this study. Finally, Section 7\nconcludes the study and outlines directions for future work.\n2 Related Work\nThis section reviews recent research on multimodal and vision–language approaches\nfor agricultural intelligence. It focuses on how models, data and learning strategies\nevolve to support accurate disease diagnosis and decision-making in agriculture.\n2.1 Visual Question Answering Frameworks for Agricultural\nDisease Diagnosis\nEarly visual question answering systems for agricultural disease diagnosis focused on\nmultimodal feature fusion and attention mechanisms. These systems used moderate-\nsize datasets. The fruit tree disease decision model [16] used ResNet-152 for image\nfeatures and BERT for question encoding. It applied bilinear pooling with modu-\nlar co-attention. The model achieved 86.36% accuracy on a custom orchard dataset.\nAttention instability and keyword misalignment reduced reliability. The wheat\nrust diagnostic framework [44] combined CNN classifiers with a fine-tuned BLIP\nvision–language model and federated learning. It achieved 97.69% classification accu-\nracy and a BLEU score of 0.6235. The system focused on a single crop and was sensitive\nto image corruption. The ILCD framework [18] used Inception-v4, LSTM and MUTAN\nfusion with bias-balancing strategies. It reached 86.06% accuracy on the CDwPK-VQA\ndataset. The small dataset size and weak generalization limited scalability.\nRecent frameworks emphasized knowledge integration and dataset expansion. They\naimed to improve reasoning depth and task diversity. The CDEK model [45] used\nobject detection, stacked self-attention and external knowledge from agricultural\nrepositories and GPT-3. It achieved 89.36% accuracy on OKiCD-VQA. It struggled\nwith unseen diseases and real-time deployment. PlantVillageVQA [46] introduced\na large-scale benchmark with 193,609 expert-validated question–answer pairs. The\ndataset covered many crops and diseases. Models such as CLIP, LXMERT and FLAVA\nachieved moderate accuracy. They struggled with causal and counterfactual reasoning.\nThe joint topic entity and intent recognition model [47] used a dual-tower multimodal\nTransformer with multi-task learning. It achieved up to 96.5% accuracy for entity\nrecognition. The framework relied only on image and text inputs.\n4\n\nAdvanced systems extended VQA toward comprehensive agricultural decision sup-\nport. These systems used multitask learning and domain knowledge graphs. The\nHortiVQA-PP framework [48] integrated segmentation-aware encoders, pest–predator\nmodeling and knowledge-guided large language models. It achieved strong segmen-\ntation, detection and VQA performance on a diverse horticultural dataset. Regional\ncoverage, extreme visual conditions and high computational cost remained challenges.\nOverall progress moved from CNN-based fusion to transformer-based and knowledge-\nenhanced architectures. Accuracy and reasoning capability improved across studies.\nCommon limitations included dataset bias, limited generalization across crops and\nenvironments, lack of multimodal sensor integration and reduced robustness in real-\nworld conditions. Future work emphasized larger datasets, zero-shot or few-shot\nlearning, stronger attention mechanisms, deeper knowledge integration, multimodal\nsensing and efficient field deployment [16, 18, 44, 45, 47, 48].\n2.2 Multimodal Deep Learning and Transformer-Based Models\nEarly multimodal deep learning models for agricultural analysis focused on struc-\ntured feature fusion and attention mechanisms. These models used moderate-size\ndatasets. A transformer-based multimodal system [19] integrated image, text and\nsensor data. It used CNN backbones, BERT, GPT and multi-head self-attention.\nThe system reached up to 0.94 accuracy in disease detection. It performed well in\ncaptioning and object detection. The model required high computational resources.\nDataset diversity was limited. Large-scale transformer and instruction-tuned models\nimproved multimodal reasoning through knowledge integration. Agri-LLaVA [49] used\nLLaVA-1.5 with large agricultural datasets. It relied on GPT-4-generated instructions.\nFine-tuning improved performance by 4.87%. The model was sensitive to rare dis-\nease classes. Computational cost remained high. BLIP-DP [50] focused on dynamic\nprompt generation guided by a VQA module. It achieved a BLEU-4 score of 83.4\non PlantVillage images. The framework relied mainly on laboratory data. Real-world\nrobustness was limited. LLaVA-PlantDiag [51] adapted a vision–language model for\nplant disease diagnosis. It used LoRA-based fine-tuning and synthetic instruction data.\nThe model reached 96% classification accuracy. It outperformed GPT-4 Vision. Per-\nformance depended on dataset coverage. Hallucination risks remained. Few-shot and\ndata-efficient multimodal frameworks addressed limited labeled data. A multimodal\nfew-shot learning system [52] used contrastive Siamese networks and prototypical\nclassification. It included retrieval augmented generation. The system achieved 93%\naccuracy on a regional Indian dataset. It generalized well to an external dataset.\nSynthetic data balancing was required. Caption ground truth was unavailable. Com-\nparative analysis showed a clear shift from CNN-based fusion to transformer-based and\ninstruction-tuned systems. Reasoning, captioning and advisory performance improved\nover time. Dataset bias remained common. Computational demands stayed high. Sen-\nsitivity to rare cases persisted. Real-world validation was limited. Future research\nemphasized larger multimodal datasets, simpler models, stronger alignment, higher\nrobustness, richer knowledge integration, multimodal sensing and reliable deployment\nin diverse agricultural settings [19, 49–52].\n5\n\n2.3 Knowledge-Enhanced and Large Language Model–Driven\nAgricultural Assistants\nKnowledge-enhanced and large language model–driven agricultural assistants show a\nclear shift toward domain-specific multimodal intelligence with conversational abili-\nties. Agri-LLaVA [49] used a knowledge-infused LLaVA-1.5 architecture trained on\nover 400,000 multimodal samples. The data covered more than 221 pest and dis-\nease types. The model improved visual understanding and dialogue-based diagnosis.\nIt struggled with rare categories and environment generalization. It required high\ncomputational resources. Future work targets hallucination reduction and deeper\nknowledge integration. LLaVA-PlantDiag [51] focused on plant pathology using LoRA\nfine-tuning and GPT-3.5-generated instruction data. The dataset came from PlantVil-\nlage. The model achieved 96% classification accuracy. It outperformed GPT-4 Vision\non vision–language tasks. Performance depended on synthetic data quality and dataset\ncoverage. Future work focuses on dataset expansion and robustness. CDEK [45] inte-\ngrated explicit agricultural knowledge bases and GPT-3-generated implicit knowledge.\nIt used fine-grained visual attention. The model achieved 89.36% accuracy on a crop\ndisease VQA dataset. It struggled with unseen diseases and real-time deployment.\nRobotic deployment was limited. Future work aims at zero-shot learning and deploy-\nment optimization. LLMI-CDP [53] extended VisualGLM and ChatGLM-6B using\nLoRA and Q-Former alignment. The dataset included 141 disease and pest categories\nin Chinese. The system showed strong recognition and accurate prevention advice.\nDeep reasoning was limited. Inference latency was high. Generalization remained weak.\nFuture work focuses on dataset diversity, automated labeling, efficient alignment and\nimproved contextual reasoning.\n2.4 Datasets, Benchmarking and Task-Specific Learning\nStrategies\nResearch on agricultural vision–language systems shows varied dataset design and\ntask-focused learning. BLIP-DP [50] used a manually annotated subset of the\nPlantVillage dataset. It applied a fixed train–test split. The method used disease-\naware dynamic prompts. The prompts came from a VQA-guided mechanism. The\ngoal was fine-grained image captioning. PlantVillageVQA [46] expanded the original\nPlantVillage dataset into a large VQA benchmark. The dataset included expert-\nverified question–answer pairs. It defined multiple cognitive task levels. It used\nstandardized evaluation with several vision–language models. This setup revealed\nstrengths and weaknesses in different reasoning tasks. HortiVQA-PP [48] built a\nmultitask dataset from greenhouse and open-field environments. It combined segmen-\ntation, co-occurrence prediction and knowledge-guided VQA. The dataset included\npest–predator annotations. It used a horticulture knowledge graph. The design sup-\nported complex decision-oriented queries. The multimodal few-shot framework [52]\nfocused on limited data and regional needs. It introduced a small dataset from Tamil\nNadu. It used an external dataset for generalization testing. The framework applied\ncontrastive pre-training, prototypical learning and retrieval-augmented querying. The\ngoal was effective learning with few labeled samples.\n6\n\nThe following research gaps are identified through our extensive literature search:\n• VQA frameworks for agricultural disease diagnosis lack standardized large-scale\nbenchmarks that cover diverse crops, diseases and real field conditions across regions.\n• Existing datasets show strong bias toward laboratory or controlled environments,\nwhich limits cross-crop, cross-region and cross-season generalization.\n• Current task-specific learning strategies emphasize identification and description\nbut provide weak support for causal, counterfactual and decision-oriented reasoning\nin VQA tasks.\n• Knowledge-enhanced and multitask datasets remain limited in multimodal diver-\nsity, with minimal integration of sensor data, temporal information and ecological\ncontext.\n• Few-shot and data-efficient learning frameworks rely heavily on synthetic augmen-\ntation and lack robust validation protocols for real-world agricultural deployment.\n3 Methodology\n3.1 Dataset\nWe use the Crop Disease Domain Multimodal (CDDM) dataset [54], which contains\nimages of healthy and diseased crops paired with multiple question–answer (QA)\ninstances. It covers 16 crop categories and 60 disease categories, with over one million\nQA pairs in total. A 90/10 QA-level split is applied for training and validation, while\nthe default test set is used exclusively for benchmarking.\nThe average question length is 6.11 words and the average answer length is 8.92\nwords. The test set contains 3,963 QA pairs from 3,000 unique images and includes\n292 unique answers, indicating moderate linguistic diversity. Figures 1–3 show the\ndistributions of plants, diseases and plant–disease combinations.\nTable 1 summarizes the sizes of the training and test splits used in this study.\nTotal QA Pairs\nUnique Images\nTraining & Validation\n1,056,311\n130,150\nTest\n3,963\n3,000\nTable 1: Sizes of the training and test splits of the\nCDDM dataset.\n3.2 Proposed Methodology\nThe proposed framework follows a two-stage training strategy for crop disease visual\nquestion answering. The architecture is illustrated in Figure 4. The approach decouples\nvisual representation learning from vision–language reasoning.\nStage 1: Vision Encoder Pretraining\n7\n\nFig. 1: Distribution of plant categories by number of images.\nFig. 2: Distribution of disease categories by number of images.\nIn the first stage, vision encoders are trained for crop and disease classification.\nTwo pretrained backbones are evaluated, namely CLIP ViT-B/16 [55] and Swin Trans-\nformer [37]. Both models are fine-tuned using multitask learning with shared visual\nfeatures.\nThe Swin Transformer [37] demonstrates superior classification accuracy. It also\nexhibits lower parameter complexity than CLIP ViT-B/16 [55]. Based on these results,\nSwin Tiny (Swin-T) [37] is selected as the vision encoder for subsequent stages.\n8\n\nFig. 3: Distribution of plant–disease combinations by number of images.\nThe training hyperparameters used for fine-tuning the vision encoders are summa-\nrized in Table 2.\nModel\nEpochs\nOptimizer\nLearning Rate\nBatch Size\nSwin-T\n10\nAdamW\n1 × 10−4\n32\nViT-B/16\n10\nAdamW\n1 × 10−4\n32\nTable 2: Hyperparameters used for training the vision\nencoders.\nStage 2: Vision–Language Question Answering\nIn the second stage, the pretrained Swin-T [37] encoder is reused for visual question\nanswering. The encoder parameters are frozen to preserve learned visual represen-\ntations. Image features are extracted as patch-level embeddings from the Swin-T\nbackbone.\nThe visual embeddings are projected into the language embedding space using a\nlearnable adapter. This projection aligns the vision features with the text decoder hid-\nden dimension. The projected features serve as visual tokens for language conditioning.\nTwo decoder architectures are explored, namely BART [38] and T5 [39]. Both\ndecoders generate natural language answers conditioned on image features and\nquestion tokens.\nSwin–BART Architecture\nFor Swin–BART [37, 38], visual embeddings are provided as encoder inputs to\nBART [38]. Question tokens are used as decoder inputs during training. The model is\noptimized using teacher forcing with cross-entropy loss.\n9\n\nLeaf Image\nSwin-T\nVision Encoder\nPlant\nClassifier\nDisease\nClassifier\nStage 1: Multitask Learning\nLeaf Image\nFrozen Swin-T\nEncoder\nText Transformer\n(BART / T5)\nGenerated\nAnswer\nQuestion\nTokens\nStage 2: Vision–Language QA Training\nFig. 4: Two-stage architecture of the proposed framework. Stage 1 learns plant and\ndisease representations using a shared Swin-T encoder, while Stage 2 reuses the frozen\nencoder for visual question answering with a stacked text decoder.\nThe decoder attends to visual embeddings through standard encoder–decoder\nattention. This configuration supports sequence-to-sequence answer generation. The\narchitecture is illustrated in Figure 4.\nSwin–T5 Architecture\nFor Swin–T5 [37, 39], the text decoder follows an encoder–decoder sequence-to-\nsequence paradigm. Visual features extracted from the frozen Swin-T encoder are used\nto condition answer generation through cross-modal attention.\nBoth global and patch-level visual features are utilized. Global representations\nare obtained by average pooling over patch embeddings, while patch-level features\npreserve fine-grained spatial information. These visual embeddings are projected to\nthe T5 hidden dimension using a learnable multi-layer perceptron.\nThe projected visual features are provided as encoder inputs to T5 [39], while\nquestion tokens are supplied to the decoder during training. Cross-attention layers\nwithin the T5 [39] decoder enable effective fusion of linguistic and visual information.\n10\n\nAnswer generation is optimized using teacher forcing with a cross-entropy loss.\nLoss computation is restricted to textual output tokens, ensuring that only language\ngeneration is supervised. This design aligns with established practices in multi-\nmodal sequence-to-sequence learning and supports stable training with frozen visual\nencoders.\nTraining and Inference\nAll vision encoders are frozen during VQA training. Only the projection layers\nand text decoders are optimized. Beam search is applied during inference for answer\ngeneration.\nThe training hyperparameters used for the VQA models are summarized in Table 3.\nModel Name\nEpochs\nOptimizer\nLearning Rate\nBatch Size\nSwin–BART\n2\nAdamW\n2 × 10−5\n8\nSwin–T5\n3\nAdamW\n1 × 10−4\n8\nViT–BART\n2\nAdamW\n2 × 10−5\n8\nViT–T5\n3\nAdamW\n1 × 10−4\n8\nTable 3: Hyperparameters used for training the VQA mod-\nels.\n3.3 Evaluation Metrics\nWe evaluate the quality of the generated answers using a combination of lexical and\nsemantic similarity metrics. These metrics assess both the correctness of key predicted\nentities and the overall similarity between generated and reference texts.\n• Accuracy: Accuracy measures the proportion of test samples for which the key\nentities (e.g., disease or condition names) are correctly identified in the generated\nanswers. Named entities are extracted from the generated text using text extraction\ntechniques and compared with the corresponding ground-truth annotations.\nAccuracy = Number of correctly predicted samples\nTotal number of samples\n(1)\n• BLEU (Bilingual Evaluation Understudy): BLEU [41] measures the precision\nof n-gram overlap between the generated and reference texts, incorporating a brevity\npenalty to discourage overly short hypotheses.\nBLEU = BP · exp\n� N\n�\nn=1\nwn log pn\n�\n(2)\nwhere pn denotes the modified n-gram precision, wn represents the weight for each\nn-gram order (typically uniform), and BP is the brevity penalty defined as:\nBP =\n�\n1\nif c > r\ne(1−r/c)\nif c ≤ r\n11\n\nHere, c and r denote the lengths of the candidate and reference texts, respectively.\n• ROUGE (Recall-Oriented Understudy for Gisting Evaluation): ROUGE\n[42] evaluates the recall-based overlap between generated and reference texts, com-\nmonly using n-gram co-occurrence (ROUGE-N) or longest common subsequence\nmeasures.\nROUGE-N =\n�\nref∈R\n�\ngramn∈ref min(Countgen(gramn), Countref(gramn))\n�\nref∈R\n�\ngramn∈ref Countref(gramn)\n(3)\nwhere gramn denotes an n-gram and counts are aggregated over all reference texts.\n• BERTScore: BERTScore [43] measures semantic similarity by computing cosine\nsimilarities between contextualized token embeddings from a pretrained BERT\nmodel and optimally aligning tokens between generated and reference texts.\nBERTScoreF 1 = 2 · P · R\nP + R\n(4)\nwhere precision P and recall R are derived from token-level cosine similarity scores.\n4 Experimental Setup\nThe experiments were conducted using two NVIDIA T4 GPUs provided by Kag-\ngle platform. Standard deep learning libraries such as PyTorch and Hugging Face\nTransformers were used.\n5 Result Analysis\n5.1 Plant and Disease Identification Performance\nTable 4 compares plant and disease classification accuracy across different model con-\nfigurations. The proposed Swin–T5 model achieves the highest accuracy for plant\n(99.94%) and disease (99.06%) identification. These results confirm the effectiveness\nof the two-stage training strategy. Models built on the Swin Transformer consistently\noutperform their ViT-based counterparts by a large margin in both plant and disease\nclassification tasks, validating the choice of Swin-T as the vision encoder as discussed\nin the methodology. The improvement is attributed to hierarchical feature learning\nand strong locality modeling. These properties are well suited for fine-grained disease\npatterns.\nCompared to LLaVA-AG [54] and Qwen-VL-Chat-AG [54], the proposed Swin-\nbased models achieve higher accuracy. This improvement is observed despite signifi-\ncantly lower model complexity. Swin–T5 improves disease classification accuracy by\nmore than 7% over these methods, highlighting the benefit of task-specific visual pre-\ntraining. The strong performance of Swin–BART further demonstrates that accurate\nvisual representations are critical for downstream reasoning, while the consistent gains\nof T5 indicate improved language modeling and cross-modal alignment.\n12\n\nModel\nPlant Classification Accuracy\nDisease Classification Accuracy\nSwin–BART (Ours)\n99.92%\n97.30%\nSwin–T5 (Ours)\n99.94%\n99.06%\nViT–BART (Ours)\n85.87%\n84.68%\nViT–T5 (Ours)\n86.17%\n85.24%\nQwen-VL-Chat-AG [54]\n97.4%\n91.5%\nLLaVA-AG [54]\n98.0%\n91.8%\nTable 4: Results comparison in terms of accuracy.\n5.2 Natural Language Generation Performance\nTo evaluate answer generation quality, standard n-gram based metrics are reported.\nThese metrics include ROUGE [42] and BLEU [41] scores. Table 5 presents a\ncomparative evaluation across different model configurations.\nThe proposed Swin–T5 model achieves the highest scores across all ROUGE vari-\nants and BLEU. This result indicates strong lexical overlap with ground-truth answers.\nSwin–BART also demonstrates high performance across all metrics. The gap between\nSwin–T5 and Swin–BART reflects improved language modeling capacity.\nViT-based models perform substantially worse across all metrics. This degradation\naligns with their weaker visual representations. The consistent advantage of Swin-\nbased models highlights the importance of robust visual encoding. High ROUGE-L\nscores further indicate improved sequence-level coherence.\nTable 5: Evaluation with N-gram based metrics\nApproach\nROUGE1\nF1\nROUGE2\nF1\nROUGE3\nF1\nROUGE4\nF1\nROUGE-L\nF1\nBLEU\nSwin–BART\n0.9836\n0.9786\n0.9753\n0.9717\n0.9836\n0.9727\nSwin–T5\n0.9965\n0.9955\n0.9947\n0.9938\n0.9965\n0.9940\nViT–BART\n0.8828\n0.8799\n0.8775\n0.8719\n0.8552\n0.6320\nViT–T5\n0.8962\n0.8927\n0.8875\n0.8874\n0.8715\n0.6931\n5.3 Semantic Similarity Evaluation\nSemantic consistency between generated and ground truth answers is evaluated using\nBERTScore F1 [43]. This metric captures contextual similarity beyond exact word\noverlap. Table 6 summarizes the results across model variants.\nThe proposed Swin–T5 model achieves the highest BERTScore F1. This result indi-\ncates strong semantic alignment with ground-truth answers. Swin–BART also attains\nnear-perfect semantic similarity. These results reflect the effectiveness of Swin-based\nvisual representations.\nViT-based models show noticeably lower scores. This performance gap suggests\nweaker cross-modal grounding. The consistent gains of T5 over BART highlight\nimproved semantic generation.\n13\n\nModel\nBERTScore F1\nSwin–BART\n0.9974\nSwin–T5\n0.9993\nViT–BART\n0.8843\nViT–T5\n0.8897\nTable 6: Semantic similarity\ncomparison using BERTScore\nF1.\n5.4 Model Complexity and Inference Efficiency\nWe evaluate computational efficiency using model size and inference latency on a T4\nGPU. Table 7 summarizes the trade-off between performance and efficiency.\nThe Swin–BART model has the lowest parameter count at 167.5M. It also achieves\nthe fastest inference time of 206.29 ms per sample. Swin–T5 increases the parameter\ncount to 251M. This increase results in a higher inference latency of 373.35 ms.\nViT-based models exhibit higher complexity and slower inference. ViT–BART con-\ntains 226M parameters and requires 325.17 ms per sample. ViT–T5 further increases\ncomplexity to 310M parameters with 497.39 ms inference time.\nLarge-scale models incur substantially higher computational cost. Qwen-VL-Chat-\n7B [54] requires 12.02 s per sample with 7B parameters. LLaVA-v1.5-7B [54] reduces\ninference time to 9.11 s but remains significantly slower.\nThese large models were evaluated without fine-tuning. The reported values\ntherefore represent approximate inference performance.\nModel\nTotal Parameters\nAverage Inference Time per sample\nSwin–BART (Ours)\n167.5 M\n206.29 ms\nSwin–T5 (Ours)\n251 M\n373.35 ms\nViT–BART (Ours)\n226 M\n325.17 ms\nViT–T5 (Ours)\n310 M\n497.39 ms\nQwen-VL-Chat-7B [54]\n7 B\n12.02 s\nLLaVA-v1.5-7B [54]\n7 B\n9.11 s\nTable 7: Comparison of model complexity and inference efficiency. Inference\ntimes are measured on a T4 GPU. Results for Qwen-VL-Chat-7B and LLaVA-\nv1.5-7B are approximate, as the original works [54] report LoRA-based fine-\ntuning, while we evaluate the pretrained models without fine-tuning.\n5.5 Model Explainability and Visual Reasoning Analysis\nTo enhance interpretability, we employ explainable AI techniques. Grad-CAM [40] is\nused to identify salient image regions, and token-level attribution is applied to analyze\nlinguistic relevance.\nFigure 5 presents the Grad-CAM visualization for an apple leaf image using the\nSwin–T5 model. The vision encoder focuses primarily on the leaf region. Increased\nattention is observed over the diseased areas.\n14\n\nFig. 5: Grad-CAM visualization highlighting diseased regions in an apple leaf image\nusing Swin–T5.\nFigure 6 illustrates token-level attribution for the corresponding question. The\nmodel assigns higher importance to keywords such as grape and fine. This behavior\nindicates effective alignment between visual and textual cues.\nFig. 6: Token-level attribution showing key question terms influencing answer gener-\nation.\nFigure 7 shows the Grad-CAM output for another apple leaf sample. The attention\nmap highlights the region affected by leaf rust. This localization suggests disease-\nspecific visual reasoning. Figure 8 presents the token-level attribution for the question\n15\n\nFig. 7: Grad-CAM visualization localizing leaf rust regions in an apple leaf image.\n“What stands out in this image?”. The model places greater emphasis on the token\nimage. This reflects reliance on visual context for open-ended queries.\nFig. 8: Token-level attribution for an open-ended visual question emphasizing visual\ncontext.\nFigure 9 illustrates the Grad-CAM visualization for a healthy tomato leaf. The\nattention is uniformly distributed across the leaf surface. No localized region dominates\nthe activation. Figure 10 shows token-level attribution for the question “Is this crop\ndiseased?”. Higher weights are assigned to the tokens diseased and crop. This indicates\ncorrect sensitivity to diagnostic keywords.\n16\n\nFig. 9: Grad-CAM visualization for a healthy tomato leaf showing uniformly dis-\ntributed attention.\nOverall, the explainability results indicate coherent visual grounding. They also\nconfirm meaningful token-level reasoning during answer generation.\n5.6 Qualitative Results and Robustness Analysis\nThis subsection presents qualitative examples to evaluate the robustness of the pro-\nposed framework under diverse question formulations. The evaluation focuses on\nuser-driven queries that differ from the original test questions. All qualitative results\nare generated using the Swin–T5 model.\nFigure 11 shows a healthy soybean leaf from the test set. The base question\ndescribes the visual content, and the model correctly identifies the leaf as healthy.\nFor follow-up queries, the model consistently recognizes the crop type. It also cor-\nrectly confirms the absence of disease. The responses remain semantically consistent\nacross different question phrasings. Figure 12 illustrates an apple leaf affected by Leaf\nRust. The base question contains an incorrect plant reference, which the model suc-\ncessfully corrects. Subsequent user queries further validate the prediction. The model\naccurately identifies both the plant type and the disease. The responses remain stable\nacross descriptive and diagnostic questions.\nOverall, these examples demonstrate robustness to variations in question phrasing.\nThe Swin–T5 model maintains correct visual grounding and semantic consistency.\nThis behavior reflects effective vision–language alignment in interactive settings.\n5.7 Ablation Study\nAn ablation study is conducted to examine the impact of key architectural and training\ncomponents on model performance. Specifically, we analyze the effect of the training\nstrategy by evaluating the role of vision encoder pretraining.\n17\n\nFig. 10: Token-level attribution emphasizing diagnostic keywords in a disease identi-\nfication question.\nTo assess the importance of vision pretraining, we remove the separate vision\nencoder pretraining stage and directly train the full VQA model by unfreezing the\nvision encoder. All hyperparameters are kept identical to those reported in Table 3 to\nensure a fair comparison.\nTable 8 reports plant and disease classification accuracy under this setting. Both\nSwin–BART and Swin–T5 exhibit a noticeable drop in accuracy compared to their\npretrained counterparts, indicating that end-to-end training without vision pretraining\nnegatively impacts discriminative performance.\nModel\nPlant Classification Accuracy\nDisease Classification Accuracy\nSwin–BART\n87.16%\n86.55%\nSwin–T5\n86.63%\n84.20%\nTable 8: Classification accuracy when vision encoder pretraining is skipped.\nBeyond classification accuracy, Table 9 presents results using NLG-based eval-\nuation metrics. A consistent degradation is observed across all metrics, including\nROUGE, BLEU and BERTScore, for both model variants. This confirms that skip-\nping vision pretraining not only affects classification performance but also weakens\nlanguage generation quality and vision–language alignment.\nOverall, these results highlight the critical role of vision encoder pretraining.\nRemoving this stage leads to consistent performance degradation across both clas-\nsification and generation metrics, underscoring its importance for robust visual\nrepresentation learning and effective vision–language reasoning.\n18\n\nFig. 11: Qualitative example on a healthy soybean leaf with user-driven questions\nusing Swin–T5.\nTable 9: NLG-based evaluation metrics when vision encoder pretraining is omitted\nApproach\nROUGE1\nF1\nROUGE2\nF1\nROUGE3\nF1\nROUGE4\nF1\nROUGE-L\nF1\nBLEU\nBERTScore\nF1\nSwin–BART\n0.8931\n0.8907\n0.8889\n0.8872\n0.8930\n0.8875\n0.8987\nSwin–T5\n0.8882\n0.8848\n0.8824\n0.8803\n0.8879\n0.8812\n0.8980\n6 Limitations\nDespite strong experimental performance, the proposed framework has several limi-\ntations. The model is designed for visual understanding and question answering. It\ncannot provide recommendations related to disease treatment or prevention. This\nlimitation arises from the absence of explicit agronomic knowledge.\nThe model also lacks broad world knowledge compared to large-scale vision–\nlanguage models, such as Qwen-VL-Chat-7B and LLaVA-v1.5-7B. As a result, it may\nstruggle with complex reasoning questions that extend beyond visual evidence. This\nincludes queries requiring external context or expert-level explanations.\nGeneralization to unseen plant species remains a challenge. The model performance\nmay degrade when evaluated on crops not present in the training data. This issue is\ncommon in supervised learning settings with limited botanical diversity.\n19\n\nFig. 12: Qualitative example on an apple leaf affected by Leaf Rust with user-driven\nquestions using Swin–T5.\n7 Conclusion and Future Work\nThis work presents a unified vision–language framework for plant and disease under-\nstanding. The model effectively integrates visual perception with natural language\nreasoning. Comprehensive evaluations demonstrate robustness to diverse question\nformulations. Explainability results provide transparency in visual and linguistic\ndecision-making. Ablation studies confirm the importance of pretrained visual rep-\nresentations. Overall, the proposed approach achieves reliable and interpretable\nperformance.\nFuture work will explore larger and more diverse agricultural datasets. Cross-\ndomain generalization to unseen crops will be investigated. Multilingual question\nanswering will be incorporated for broader accessibility. Advanced reasoning modules\nwill be integrated to handle complex agronomic queries.\nConflict of interest The authors have no conflict of interest to declare relevant\nto this article’s content. Additionally, the authors have no relevant financial or non-\nfinancial interests to disclose.\n20\n\nData availability Not applicable.\nFunding No specific funding was received for this study.\nReferences\n[1] Savary, S., Willocquet, L., Pethybridge, S.J., Esker, P., McRoberts, N., Nelson,\nA.: The global burden of pathogens and pests on major food crops. Nature ecology\n& evolution 3(3), 430–439 (2019)\n[2] T¨URKO˘GLU, M., HANBAY, D.: Apricot disease identification based on\nattributes obtained from deep learning algorithms. In: 2018 International Con-\nference on Artificial Intelligence and Data Processing (IDAP), pp. 1–4 (2018).\nIEEE\n[3] Mohanty, S.P., Hughes, D.P., Salath´e, M.: Using deep learning for image-based\nplant disease detection. Frontiers in plant science 7, 215232 (2016)\n[4] Ferentinos, K.P.: Deep learning models for plant disease detection and diagnosis.\nComputers and electronics in agriculture 145, 311–318 (2018)\n[5] Bhuiyan, M.A.B., Abdullah, H.M., Arman, S.E., Rahman, S.S., Al Mahmud, K.:\nBananasqueezenet: A very fast, lightweight convolutional neural network for the\ndiagnosis of three prominent banana leaf diseases. Smart Agricultural Technology\n4, 100214 (2023)\n[6] Hossain, M.A., Sakib, S., Abdullah, H.M., Arman, S.E.: Deep learning for mango\nleaf disease identification: A vision transformer perspective. Heliyon 10(17)\n(2024)\n[7] Arun, R.A., Umamaheswari, S.: Effective multi-crop disease detection using\npruned complete concatenated deep learning model. Expert Systems with Appli-\ncations 213, 118905 (2023)\n[8] Nandhini, M., Kala, K., Thangadarshini, M., Verma, S.M.: Deep learning model\nof sequential image classifier for crop disease detection in plantain tree cultivation.\nComputers and Electronics in Agriculture 197, 106915 (2022)\n[9] Vasavi, P., Punitha, A., Rao, T.V.N.: Crop leaf disease detection and classification\nusing machine learning and deep learning algorithms by visual symptoms: A\nreview. International Journal of Electrical and Computer Engineering 12(2), 2079\n(2022)\n[10] Wang, S., Zeng, Q., Ni, W., Cheng, C., Wang, Y.: Odp-transformer: Interpre-\ntation of pest classification results using image caption generation techniques.\n21\n\nComputers and Electronics in Agriculture 209, 107863 (2023)\n[11] Parez, S., Dilshad, N., Alghamdi, N.S., Alanazi, T.M., Lee, J.W.: Visual intelli-\ngence in precision agriculture: Exploring plant disease detection via efficient vision\ntransformers. Sensors 23(15), 6949 (2023)\n[12] Martinelli, F., Scalenghe, R., Davino, S., Panno, S., Scuderi, G., Ruisi, P., Villa,\nP., Stroppiana, D., Boschetti, M., Goulart, L.R., et al.: Advanced methods of\nplant disease detection. a review. Agronomy for sustainable development 35(1),\n1–25 (2015)\n[13] Zhang, F., Wang, Q., Li, H., Zhou, Q., Tan, Z., Zu, X., Yan, X., Zhang, S.,\nNinomiya, S., Mu, Y., et al.: Study on the optimal leaf area-to-fruit ratio of\npear trees on the basis of bearing branch girdling and machine learning. Plant\nPhenomics 6, 0233 (2024)\n[14] Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C.L., Parikh,\nD.: Vqa: Visual question answering. In: Proceedings of the IEEE International\nConference on Computer Vision, pp. 2425–2433 (2015)\n[15] Yang, Z., He, X., Gao, J., Deng, L., Smola, A.: Stacked attention networks for\nimage question answering. In: Proceedings of the IEEE Conference on Computer\nVision and Pattern Recognition, pp. 21–29 (2016)\n[16] Lan, Y., Guo, Y., Chen, Q., Lin, S., Chen, Y., Deng, X.: Visual question answering\nmodel for fruit tree disease decision-making based on multimodal deep learning.\nFrontiers in Plant Science 13, 1064399 (2023)\n[17] Waard, M.D., Georgopoulos, S., Hollomon, D., Ishii, H., Leroux, P., Ragsdale, N.,\nSchwinn, F.: Chemical control of plant diseases: problems and prospects. Annual\nreview of phytopathology 31(1), 403–421 (1993)\n[18] Zhao, Y., Wang, S., Zeng, Q., Ni, W., Duan, H., Xie, N., Xiao, F.: Informed-\nlearning-guided visual question answering model of crop disease. Plant Phenomics\n6, 0277 (2024)\n[19] Lu, Y., Lu, X., Zheng, L., Sun, M., Chen, S., Chen, B., Wang, T., Yang, J., Lv, C.:\nApplication of multimodal transformer model in intelligent agricultural disease\ndetection and question-answering systems. plants 13(7), 972 (2024)\n[20] Zhang, X., Wu, C., Zhao, Z., Lin, W., Zhang, Y., Wang, Y., Xie, W.: Pmc-vqa:\nVisual instruction tuning for medical visual question answering. arXiv preprint\narXiv:2305.10415 (2023)\n[21] He, X., Zhang, Y., Mou, L., Xing, E., Xie, P.: Pathvqa: 30000+ questions for\nmedical visual question answering. arXiv preprint arXiv:2003.10286 (2020)\n22\n\n[22] Liu, B., Zhan, L.-M., Xu, L., Ma, L., Yang, Y., Wu, X.-M.: Slake: A semantically-\nlabeled knowledge-enhanced dataset for medical visual question answering. In:\n2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI), pp.\n1650–1654 (2021). IEEE\n[23] Lau, J.J., Gayen, S., Ben Abacha, A., Demner-Fushman, D.: A dataset of clinically\ngenerated visual questions and answers about radiology images. Scientific data\n5(1), 1–10 (2018)\n[24] Hossain, M.Z., Zaman, F.U., Islam, M.R.: Advancing ai-generated image detec-\ntion: Enhanced accuracy through cnn and vision transformer models with\nexplainable ai insights. In: 2023 26th International Conference on Computer and\nInformation Technology (ICCIT), pp. 1–6 (2023). IEEE\n[25] Epstein, D.C., Jain, I., Wang, O., Zhang, R.: Online detection of ai-generated\nimages. In: Proceedings of the IEEE/CVF International Conference on Computer\nVision, pp. 382–392 (2023)\n[26] Hossain, M.Z., Samu, M., Bhuiyan, M.K., Zaman, F.U., Islam, M.R., et al.: Fuse:\nUnifying spectral and semantic cues for robust ai-generated image detection.\narXiv preprint arXiv:2512.21695 (2025)\n[27] Chattopadhyay, A., Maitra, M.: Mri-based brain tumour image detection using\ncnn based deep learning method. Neuroscience informatics 2(4), 100060 (2022)\n[28] Hossain, M.Z., Islam, M.R., Samu, M., et al.: Explainable ai-driven detection of\nhuman monkeypox using deep learning and vision transformers: A comprehensive\nanalysis. arXiv preprint arXiv:2505.01429 (2025)\n[29] Rakibul Islam, M., Zahid Hossain, M., Ahmed, M., Sharmin Sultana Samu, M.:\nVision-language models for automated chest x-ray interpretation: Leveraging vit\nand gpt-2. Engineering Reports 7(6), 70220 (2025)\n[30] Ouis, M.Y., Akhloufi, M.A.: Deep learning for report generation on chest x-ray\nimages. Computerized Medical Imaging and Graphics 111, 102320 (2024)\n[31] Hossain, M.Z., Ahmed, M., Samu, M.S.S., Islam, M.R.: Privacy-preserving chest\nx-ray report generation via multimodal federated learning with vit and gpt2.\nBiomedical Materials & Devices, 1–19 (2025)\n[32] Samu, M., Sultana, S., Islam, M.R., Hossain, M.Z., Bhuiyan, M.K., Zaman, F.U.:\nZero-shot to zero-lies: Detecting bengali deepfake audio through transfer learning.\narXiv preprint arXiv:2512.21702 (2025)\n[33] Kang, Y., Liu, T., Li, H., Hao, Y., Ding, W.: Self-supervised audio-and-text pre-\ntraining with extremely low-resource parallel data. In: Proceedings of the AAAI\nConference on Artificial Intelligence, vol. 36, pp. 10875–10883 (2022)\n23\n\n[34] Islam, M.R., Samu, M., Hossain, M.Z., Zaman, F.U., Bhuiyan, M.K., et al.:\nDetecting ai-generated paraphrases in bengali: A comparative study of zero-shot\nand fine-tuned transformers. arXiv preprint arXiv:2512.21709 (2025)\n[35] Fogel, S., Averbuch-Elor, H., Cohen, S., Mazor, S., Litman, R.: Scrabblegan:\nSemi-supervised varying length handwritten text generation. In: Proceedings of\nthe IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\n4324–4333 (2020)\n[36] Islam, M.R., Bhuiyan, M.K., Muntasir, S., Jawad, A.R., Samu, M., et al.: Behgan:\nBengali handwritten word generation from plain text using generative adversarial\nnetworks. arXiv preprint arXiv:2512.21694 (2025)\n[37] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin trans-\nformer: Hierarchical vision transformer using shifted windows. In: Proceedings of\nthe IEEE/CVF International Conference on Computer Vision, pp. 10012–10022\n(2021)\n[38] Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mohamed, A., Levy, O., Stoy-\nanov, V., Zettlemoyer, L.: Bart: Denoising sequence-to-sequence pre-training for\nnatural language generation, translation, and comprehension. In: Proceedings of\nthe 58th Annual Meeting of the Association for Computational Linguistics, pp.\n7871–7880 (2020)\n[39] Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li,\nW., Liu, P.J.: Exploring the limits of transfer learning with a unified text-to-text\ntransformer. Journal of machine learning research 21(140), 1–67 (2020)\n[40] Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.:\nGrad-cam: Visual explanations from deep networks via gradient-based localiza-\ntion. In: Proceedings of the IEEE International Conference on Computer Vision,\npp. 618–626 (2017)\n[41] Papineni, K., Roukos, S., Ward, T., Zhu, W.-J.: Bleu: a method for automatic\nevaluation of machine translation. In: Proceedings of the 40th Annual Meeting of\nthe Association for Computational Linguistics, pp. 311–318 (2002)\n[42] Lin, C.-Y.: Rouge: A package for automatic evaluation of summaries. In: Text\nSummarization Branches Out, pp. 74–81 (2004)\n[43] Zhang, T., Kishore, V., Wu, F., Weinberger, K.Q., Artzi, Y.: Bertscore: Evaluat-\ning text generation with bert. arXiv preprint arXiv:1904.09675 (2019)\n[44] Nanavaty, A., Sharma, R., Pandita, B., Goyal, O., Rallapalli, S., Mandal, M.,\nSingh, V.K., Narang, P., Chamola, V.: Integrating deep learning for visual ques-\ntion answering in agricultural disease diagnostics: Case study of wheat rust.\nScientific reports 14(1), 28203 (2024)\n24\n\n[45] Zhao, Y., Wang, S., Duan, H., Zeng, Q., Ni, W., Xie, N., Xiao, F.: Visual question\nanswer model based on crop diseases external knowledge for smart agriculture.\nIEEE Transactions on Big Data (2025)\n[46] Sakib, S.N., Haque, N., Hossain, M.Z., Arman, S.E.: Plantvillagevqa: A visual\nquestion answering dataset for benchmarking vision-language models in plant\nscience. arXiv preprint arXiv:2508.17117 (2025)\n[47] Huang, J., Hao, X., Wang, Y., Song, R., Mu, Z., Niu, S., Guo, X.: Joint topic\nentity and intent recognition model for multimodal agricultural diseases and\npests question answering. Computers and Electronics in Agriculture 241, 111253\n(2026)\n[48] Li, Z., Du, C., Li, S., Jiang, Y., Zhang, L., Ju, C., Yue, F., Dong, M.: Hortivqa-\npp: Multitask framework for pest segmentation and visual question answering in\nhorticulture. Horticulturae 11(9), 1009 (2025)\n[49] Wang, L., Jin, T., Yang, J., Leonardis, A., Wang, F., Zheng, F.: Agri-llava:\nKnowledge-infused large multimodal assistant on agricultural pests and diseases.\narXiv preprint arXiv:2412.02158 (2024)\n[50] Liang, F., Huang, Z., Wang, W., He, Z., En, Q.: Dynamic text prompt joint multi-\nmodal features for accurate plant disease image captioning. The Visual Computer\n41(8), 5405–5419 (2025)\n[51] Sharma, K., Vats, V., Singh, A., Sahani, R., Rai, D., Sharma, A.: Llava-plantdiag:\nIntegrating large-scale vision-language abilities for conversational plant pathology\ndiagnosis. In: 2024 International Joint Conference on Neural Networks (IJCNN),\npp. 1–7 (2024). IEEE\n[52] Pranith, P., Yeshwanth, V., Thenmozhi, D.: Multimodal few-shot learning for\nplant disease detection with contrastive pre-training and query addressal. Neural\nComputing and Applications, 1–22 (2025)\n[53] Wang, Y., Wang, F., Chen, W., Lv, B., Liu, M., Kong, X., Zhao, C., Pan, Z.:\nA large language model for multimodal identification of crop diseases and pests.\nScientific Reports 15(1), 21959 (2025)\n[54] Liu, X., Liu, Z., Hu, H., Chen, Z., Wang, K., Wang, K., Lian, S.: A multimodal\nbenchmark dataset and model for crop disease diagnosis. In: European Conference\non Computer Vision, pp. 157–170 (2024). Springer\n[55] Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G.,\nAskell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from\nnatural language supervision. In: International Conference on Machine Learning,\npp. 8748–8763 (2021). PmLR\n25\n",
+    "figure_captions": [
+      "Fig. 1: Distribution of plant categories by number of images.",
+      "Fig. 2: Distribution of disease categories by number of images.",
+      "Fig. 3: Distribution of plant–disease combinations by number of images.",
+      "Fig. 4: Two-stage architecture of the proposed framework. Stage 1 learns plant and",
+      "Figure 5 presents the Grad-CAM visualization for an apple leaf image using the",
+      "Fig. 5: Grad-CAM visualization highlighting diseased regions in an apple leaf image",
+      "Figure 6 illustrates token-level attribution for the corresponding question. The",
+      "Fig. 6: Token-level attribution showing key question terms influencing answer gener-",
+      "Figure 7 shows the Grad-CAM output for another apple leaf sample. The attention",
+      "Fig. 7: Grad-CAM visualization localizing leaf rust regions in an apple leaf image.",
+      "Fig. 8: Token-level attribution for an open-ended visual question emphasizing visual",
+      "Figure 9 illustrates the Grad-CAM visualization for a healthy tomato leaf. The",
+      "Fig. 9: Grad-CAM visualization for a healthy tomato leaf showing uniformly dis-",
+      "Figure 11 shows a healthy soybean leaf from the test set. The base question",
+      "Fig. 10: Token-level attribution emphasizing diagnostic keywords in a disease identi-",
+      "Fig. 11: Qualitative example on a healthy soybean leaf with user-driven questions",
+      "Fig. 12: Qualitative example on an apple leaf affected by Leaf Rust with user-driven"
+    ]
+  },
+  {
+    "arxiv_id": "2601.05184v1",
+    "title": "Paper 2601.05184v1",
+    "abstract": "The rapid advancement of large language mod- els (LLMs) has led to growing interest in using synthetic data to train future models. How- ever, this creates a self-consuming retraining loop, where models are trained on their own outputs and may cause performance drops and induce emerging biases. In real-world appli- cations, previously deployed LLMs may influ- ence the data they generate, leading to a dy- namic system driven by user feedback. For ex- ample, if a model continues to underserve users from a group, less query data will be collected from this particular demographic of users. In this study, we introduce the concept of Self- Consuming Performative Loop (SCPL) and investigate the role of synthetic data in shaping bias during these dynamic iterative training pro- cesses under controlled performative feedback. This controlled setting is motivated by the in- accessibility of real-world user preference data from dynamic production systems, and enables us to isolate and analyze feedback-driven bias evolution in a principled manner. We focus on two types of loops, including the typical retrain- ing setting and the incremental fine-tuning set- ting, which is largely underexplored. Through experiments on three real-world tasks, we find that the performative loop increases preference bias and decreases disparate bias. We design a reward-based rejection sampling strategy to mitigate the bias, moving towards more trust- worthy self-improving systems.",
+    "text": "Observations and Remedies for Large Language Model Bias\nin Self-Consuming Performative Loop\nYaxuan Wang1, Zhongteng Cai2, Yujia Bao3, Xueru Zhang2, Yang Liu1\n1University of California, Santa Cruz, 2The Ohio State University,\n3Center for Advanced AI, Accenture\nAbstract\nThe rapid advancement of large language mod-\nels (LLMs) has led to growing interest in using\nsynthetic data to train future models. How-\never, this creates a self-consuming retraining\nloop, where models are trained on their own\noutputs and may cause performance drops and\ninduce emerging biases. In real-world appli-\ncations, previously deployed LLMs may influ-\nence the data they generate, leading to a dy-\nnamic system driven by user feedback. For ex-\nample, if a model continues to underserve users\nfrom a group, less query data will be collected\nfrom this particular demographic of users. In\nthis study, we introduce the concept of Self-\nConsuming Performative Loop (SCPL) and\ninvestigate the role of synthetic data in shaping\nbias during these dynamic iterative training pro-\ncesses under controlled performative feedback.\nThis controlled setting is motivated by the in-\naccessibility of real-world user preference data\nfrom dynamic production systems, and enables\nus to isolate and analyze feedback-driven bias\nevolution in a principled manner. We focus on\ntwo types of loops, including the typical retrain-\ning setting and the incremental fine-tuning set-\nting, which is largely underexplored. Through\nexperiments on three real-world tasks, we find\nthat the performative loop increases preference\nbias and decreases disparate bias. We design\na reward-based rejection sampling strategy to\nmitigate the bias, moving towards more trust-\nworthy self-improving systems.\n1\nIntroduction\nThe widespread integration of Large Language\nModels (LLMs) into daily applications has raised\nconcerns regarding training on synthetic data (Cai\net al., 2025). As LLMs become more capable,\nthere are a large amount of generated content that\nis posted to coding platforms, social media plat-\nforms and other platforms on the internet. Such\nLLM-generated text is getting hard to distinguish\nfrom human-generated content (Sadasivan et al.,\n2023) and might be used to train the next gen-\neration of LLMs. While human-generated data\nstill exists, access to clean human-authored data\nis increasingly limited by contamination and cost,\nprompting widespread reliance on synthetic data.\nConsequently, a self-consuming training loop (Shu-\nmailov et al., 2023; Briesch et al., 2023; Alemo-\nhammad et al., 2023, 2024; Ferbach et al., 2024)\nemerges in which future models are trained repeat-\nedly on LLM-generated data from previous gen-\nerations. The recursive training loop on synthetic\ndata may lead to model collapse (Guo et al., 2023;\nSeddik et al., 2024; Shumailov et al., 2024; Wei\nand Zhang, 2025), and bias amplification (Wang\net al., 2024c; Wyllie et al., 2024; Chen et al., 2024)\nor reduction (Chen et al., 2024).\nUnlike static supervised fine-tuning (SFT) with a\nfixed dataset, online continual learning (Wang et al.,\n2024a) collects feedback from humans or AI agents\nto iteratively train models that are more capable\nand better aligned. In socially predictive systems,\nthe model’s performance subsequently influences\nfuture data, which is known as performative pre-\ndiction (Perdomo et al., 2020; Hardt and Mendler-\nDünner, 2023). Bias variation in such performative\niterative training loops is critical. While Wyllie\net al. (2024) demonstrate that unfair feedback loops\ncan lead to a loss of fairness, their analysis does\nnot extend to the data ecosystem of LLMs.\nPrevious work (Wang et al., 2024c; Briesch et al.,\n2023; Kazdan et al., 2024) has primarily focused\non the self-consuming retraining loop for LLMs,\nwhere the next generation model is fine-tuned on\nthe base model using generated synthetic data.\nHowever, in practical scenarios, retraining from\nthe base model may result in the loss of previously\nacquired knowledge, especially when access to the\noriginal training data is limited or restricted. A\nmore feasible and widely adopted approach in prac-\ntice is to fine-tune new models based on existing\nfinetuned ones (Gillman et al., 2024). However,\n1\narXiv:2601.05184v1  [cs.AI]  8 Jan 2026\n\nD0\nData Collector\n(Query, Synthetic Response)\nGeneration 0\nLLM\nIncremental\nFine-tuning\nGeneration 1\nLLM\nGeneration T\nLLM\nSFT\nBase\nModel\nRetraining\nD1\nDT\nLLM Self-Consuming Performative Loop\nGeneration\nBiased\nResponse\nAdvantage\nDisadvantage\nQuery\nHuman Feedback\nFigure 1: Illustration of the self-consuming performa-\ntive loop for LLMs. Dynamic human feedback (e.g., an\nincrease in queries from the blue group and a decrease\nfrom the pink group) influences both the generation of\nsynthetic data and the subsequent training process.\nthis incremental fine-tuning setting in self consum-\ning loop remains underexplored in current research.\nIn this paper, we introduce the self-consuming\nperformative loop for LLMs (Figure 1), which is\nbased on performative feedback (Perdomo et al.,\n2020) and LLM self-consuming training. We an-\nalyze bias variation in two settings: a novel incre-\nmental fine-tuning loop and the standard retraining\nloop. We investigate how synthetic data generated\nby previous models influences bias dynamics dur-\ning iterative training with controlled human feed-\nback. In particular, we examine how repeated self-\nconsuming training affects both preference bias and\ndisparate performance bias, providing insights that\ninform safer deployment of future LLMs.\nOur\nfindings\nreveal\nthat\niterative\nfine-\ntuning/retrain with self-generated data amplifies\npreference bias and degrades generation quality\nover time. Interestingly, disparate bias tends to\ndecrease, suggesting a convergence of performance\nacross groups.\nWe also observe that perfor-\nmative feedback accelerates bias amplification\nin incremental fine-tuning loops, whereas this\nphenomenon is less obvious in retraining loops.\nTo address the observed bias amplification, we\npresent a mitigation technique, which integrates\nreward-guided selection and a reweighting mecha-\nnism to control the data sampling process. We also\nexplore several naive rejection sampling (Yuan\net al., 2023) strategies as baselines. We conduct\ncomprehensive experiments on three real-world\ntasks, which validate our findings and demonstrate\nthe effectiveness of the proposed mitigation\ntechniques. Our main contributions are as follows:\n• We introduce the SCPL for LLMs and, for the\nfirst time, study the dynamic incremental fine-\ntuning loop under performative human feedback.\n• We investigate the impact of synthetic data on\nbias in the SCPL through extensive experiments\nand provide several key observations.\n• We explore various synthetic data curation meth-\nods and design a reward-guided selection with\nreweighting technique to mitigate bias amplifi-\ncation in the performative loop.\n2\nRelated Work\nSelf-consuming Training Loops. Recent work\nshows that recursively training generative mod-\nels on their own outputs leads to model collapse,\nwhere data quality and diversity degrade over it-\nerations (Alemohammad et al., 2023; Shumailov\net al., 2024; Briesch et al., 2023; Cai et al., 2025).\nTheoretically, collapse is inevitable when relying\nsolely on synthetic data (Seddik et al., 2024), while\nincorporating fresh real data can stabilize training\n(Bertrand et al., 2023). Subsequent analyses pro-\nvide performance degradation bounds and modified\nscaling laws (Dohmatob et al., 2024a,b). Several\nstudies highlight that LLMs are particularly vulner-\nable in such recursive settings (Seddik et al., 2024;\nBriesch et al., 2023; Guo et al., 2023), and that data\ncuration implicitly acts as preference optimization\nin self-training loops (Ferbach et al., 2024).\nPerformative Prediction. Performative predic-\ntion studies learning systems whose deployment\nalters the data distribution itself (Perdomo et al.,\n2020; Chen et al., 2023). Prior work analyzes stabil-\nity, convergence, and optimality under these feed-\nback dynamics (Hardt and Mendler-Dünner, 2023;\nPiliouras and Yu, 2023; Jin et al., 2024), providing\na theoretical lens for self-consuming training loops.\nBias in Self-consuming Systems. Recent stud-\nies show that biases can emerge and amplify in self-\nconsuming training, particularly for LLMs trained\non biased human text or synthetic data (Wyllie et al.,\n2024; Wang et al., 2024c; Chen et al., 2024). Such\nmodels may internalize social or political biases\npresent in their training corpora (Haller et al., 2023;\nRettenberger et al., 2025; Wang et al., 2024b). Mit-\nigation strategies have been explored, but remain\nlimited in recursive settings (Wang et al., 2024c).\nRejection Sampling. Rejection sampling fil-\nters synthetic outputs using heuristics or reward\nmodels to curate high-quality data for LLM fine-\ntuning (Yuan et al., 2023; Toshniwal et al., 2024;\nKhaki et al., 2024; Tong et al., 2024; Li et al.,\n2\n\n2025; Pang et al., 2025).\nSuch curation steers\nmodels toward high-reward regions (Ferbach et al.,\n2024), motivating our rewards-based reweighting\napproach for bias mitigation in SCPL.\n3\nSelf-Consuming Performative Loop\n3.1\nPreliminaries\nFormally, let Mt with parameters θt denote the\nLLM at generation t, and let T be the total number\nof generations. Let x denote a prompt, y its cor-\nresponding response, and z = (x, y) the prompt-\nresponse pair, i.e., a single data sample. A dataset\nD = [z1, z2, . . . , zn] then consists of n samples.\nLLM Finetuning Loss For a dataset D con-\nsisting of n prompt-response pairs (x, y), the fine-\ntuning loss for a model M with parameters θ is\ndefined as: L(D; θ) = 1\nn\n�\n(x,y)∈D L(x, y; θ)\nSelf-Consuming Training Loop For LLMs,\neach training cycle begins with a pre-trained model\nfine-tuned on recent data via typical supervised fine-\ntuning (Shumailov et al., 2024; Briesch et al., 2023).\nA synthetic dataset Dt is generated using the pre-\nvious model Mt−1, and the next generation model\nMt is trained from scratch (fine-tuned on the base\nmodel) either using only generated synthetic data\nDt or a mixture of real and synthetic data, depend-\ning on the data cycle design (Section 3.3).\n3.2\nSelf-Consuming Performative Loop\nWe propose a dynamic self-consuming training\nloop for LLMs (SCPL) where the predictive perfor-\nmance of the previous generation model influences\nthe distribution of generated data, thereby affect-\ning the next generation model. Performative here\nrefers to dynamic, feedback-driven changes in the\ninput distribution. Under performativity, the next\ngeneration model Mt is fine-tuned on data gener-\nated from the previous model Mt−1, leading to a\nperformative fine-tuning loss defined as:\nL(Dt(θt−1, Ht−1); θt) = 1\nn\n�\n(x,y)∈Dt\nL(x, y; θt)\n(1)\nwhere Dt(θt−1, Ht−1) is the dataset induced by\nMt−1 under real-world human performative feed-\nback Ht−1 on a held-out dataset Dtest or a training\ndataset Dtrain, depending on the specific real world\napplication and evaluation objective.\nHere, we consider a two-group situation consist-\ning of an advantaged group Da and a disadvantaged\ngroup Dd, such that D = Da ∪ Dd. We assume an\nAlgorithm 1 Self-Consuming Performative Loop\nInput: number of generations T, data cycle\nInitialize: Train base model on real data D0,\nproducing M0\nfor t = 1 to t = T do\nPerformative sample Dt from Mt−1 based\non human feedback Ht−1\nif Accumulation then\nDt ← Dt ∪ . . . ∪ D0\nend if\nTrain Model Mt with Dt using Eq 1\nend for\nAlgorithm 2 LLM Performative Sampling\nInput: LLM Mt−1, Dtest, Dcandidate, Dt−1\nInitialize: Dt: empty list\nCalculate performance score Sa on Da\ntest\nCalculate performance score Sd on Dd\ntest\nHuman feedback H = [Sa, Sb]\nUpdate the disadvantage ratio rd = F(Sa, Sd)\nSelect rd ∗ |Dt−1| prompts from Dd\ncandidate\nSelect (1 − rd) ∗ |Dt−1| prompts from Da\ncandidate\nfor x in prompts Xa\ncandidate, t-1 do\ny ∼ Mt−1(x); Add (x, y) to Dt\nend for\nfor x in prompts Xd\ncandidate, t-1 do\ny ∼ Mt−1(x); Add (x, y) to Dt\nend for\nReturn: Dt\nunbiased held out set Dtest that contains an equal\nnumber of samples from both groups to evaluate\nthe performance.\nIf a model performs better on Da, users from\nthis group are more likely to continue interacting\nand generating data, while users from Dd may dis-\nengage, reducing their future representation and\nexacerbating fairness issues. Our framework mod-\nels a controlled self-consuming loop in which a\nsingle provider collects, fine-tunes, and repeatedly\nreuses its own model outputs (e.g. e-commerce rec-\nommendation system), enabling systematic study\nof how internal feedback can amplify or distort de-\nmographic preferences. In this study, group-wise\nsample selection is governed by dynamic functions\nthat simulate practical human preference patterns.\nAlgorithm 1 presents the SCPL, while Algo-\nrithm 2 outlines the performative sampling process.\nMoreover, we study two realistic cases for SCPL:\nPerformative Retraining Loop The next gen-\n3\n\neration model Mt is retrained from the base model\nusing the synthetic data performative sampling by\nprevious generation model Mt−1. This retraining\nloop is widely study in previous work (Wang et al.,\n2024c; Briesch et al., 2023).\nPerformative Incremental Fine-tuning Loop\nWe consider a more practical scenario reflecting\nreal-world constraints.\nFormally, at generation\nt − 1, the model Mt−1 is used to generate a new\ndataset Dt. This dataset is then used to fine-tune\nthe model Mt−1, resulting in the updated model\nMt. In user-facing applications with limited com-\nputational resources, it is often necessary to incre-\nmentally fine-tune the current generation model\non top of the previous one, in order to preserve\npreviously acquired knowledge. Retraining from\nscratch may not be feasible when access to origi-\nnal training data is restricted, for example, due to\nexpired data licenses. In such cases, the current\ndataset Dt−1, which only contains newly gener-\nated query-response pairs, may no longer suffice\nto recover past knowledge. In addition to SFT, we\nalso conduct preliminary experiments using Direct\nPreference Optimization (Rafailov et al., 2023).\n3.3\nData Cycles and Dynamics\nFull Synthetic Data Cycle: In the most extreme\ncase, the new model Mt is only trained on the\ngenerated data Dt from the last generation (Alemo-\nhammad et al., 2023). Each generation’s dataset\nis of equal size, with D0 representing the initial\nreal human generated dataset. We use this itera-\ntive training setup to investigate the most drastic\nbehavioral changes in models under a deployment\nscenario where the original training data is inacces-\nsible, such as due to expired data licenses or user\ndata deletion requests (Wang et al., 2025). Note\nthat synthetic training data can be obtained by us-\ning fresh real prompts collected from users.\nExpanding Data Cycle (Accumulation): This\ncycle reflects a practical setting in which the exist-\ning dataset can be reused during training to improve\nperformance (Briesch et al., 2023). We accumulate\nboth the original real data D0 and the generated\nsynthetic data across generations to form the train-\ning dataset. Specifically, the accumulated dataset\nat generation t is Dt = D0 ∪ D1 ∪ . . . ∪ Dt. Prior\nwork has shown that such accumulation can slow\ndown model collapse (Kazdan et al., 2024) and\nhelp mitigate bias (Wang et al., 2024c).\nDynamics:\nDynamics refer to the changing\nproportion between the advantaged and disadvan-\ntaged groups, which also indicate the performative\nimpact. We primarily study a controlled linear\ndynamic setting, where the group ratio is explic-\nitly manipulated in experiments. Starting from an\ninitial disadvantaged group ratio in D0, the propor-\ntion decreases linearly over generations to simulate\nthe effect of human feedback H under controlled\nconditions. We also evaluate a fixed-ratio setting\nas a baseline, where the disadvantaged ratio re-\nmains constant while using fresh real prompts. The\nscenario in which previous prompts are repeatedly\nused to generate synthetic data, and the group ra-\ntio remains unchanged across generations, is re-\nferred to as Non-dynamic, which is the typical\nself-consuming training loop (Wang et al., 2024c;\nBriesch et al., 2023; Alemohammad et al., 2023).\n4\nMeasuring the Bias\n4.1\nOverview\nWe focus on measuring bias variation in SCPL\nthrough extensive experiments on three practical\ntasks: news continuation, creative preference dis-\nsection, and math problem solving. The first two\nevaluate preference bias, reflecting the model’s\ninclination toward the advantaged group (Liu\net al., 2023). The third assesses disparate perfor-\nmance (Liu et al., 2023), where the model’s accu-\nracy differs between groups (easy versus hard math\nquestions). We observe that: (1) Synthetic data in\nSCPL tends to amplify preference bias while re-\nducing disparate bias. (2) Performative dynamics\naccelerate preference bias amplification compared\nto non-dynamic self-consuming loops in both fine-\ntuning and retraining settings, though the effect\ndepends on the task and training regime. (3) Accu-\nmulation can slow the increase of preference bias\nand mitigate the decline in generation quality. It\ncan alleviate the reduction in disparate bias and\nhelp preserve math problem-solving ability.\n4.2\nReward-based Reweighting Sampling\nTo reduce the observed preference bias amplifica-\ntion in SCPL, we propose a bias mitigation tech-\nnique based on rejection sampling to enhance fair-\nness in the SCPL. Specifically, for a prompt x\nfrom Dcandidate, we generate k independent re-\nsponses ¯y1, . . . , ¯yk ∼ Mt−1(x). A reward func-\ntion R(x, ¯y) is then used to select the highest scor-\ning sample or one that meets predefined criteria.\nWe propose a modular and extensible reward-based\nreweighting method to reduce bias while main-\n4\n\ntaining generation quality.\nFormally, let R =\n{r1, r2, r3} denote a set of predefined rules, where\nr1(x, ¯y) ∈ [0, 1] represents a continuous score mea-\nsuring the generation quality, and r2(x, ¯y) indicates\nwhether the output ¯y, have the same preference as\nthe input x. r3 refers to any possible task-relevant\nextension rules. Each rule is associated with an\nimportance weight, denoted by α1 and α2, respec-\ntively. The reward function is then defined as:\nR(x, ¯y) = α1 · r1(x, ¯y) + α2 · r2(x, ¯y) + r3 (2)\nWe integrate reward-guided sample selection with\na reweighting scheme, enabling dynamic control\nover the sampling process to mitigate bias. The\nidea is to generate and select more data from the dis-\nadvantaged group by adjusting the selection criteria\nand increasing the sample size k. (Algorithm 3)\nRegarding rule extension, our framework is de-\nsigned to be plug-and-play: one can incorporate\nadditional rules that capture properties such as sen-\ntiment, stylistic tone, format constraints, or even a\ntrained reward model that returns a score depend-\ning on downstream goals. Each rule outputs a re-\nward signal, and these are combined linearly (with\ntunable weights) to produce the final reward score\nused in sampling and training. When only a sin-\ngle trained reward model is used with its weight\nset to 1, this formulation reduces to the standard\ncase of using a single, well-trained or a zero-shot\nreward model. Note that our mitigation strategy\ndoes not rely on access to ground-truth labels used\nin bias evaluation; instead, we use proxy metrics to\nestimate the degree of bias present in the model.\n5\nExperiments\nWe consider several different settings: (1) self-\nconsuming training loop (Non-dynamic): We\ncompare synthetic data (syn) with real data (real).\nIn the real setting, all synthetic data is replaced with\nreal data to examine how performance changes. (2)\nSCPL (Dynamic): We evaluate two variants: a\ncontrolled linear dynamic version (syn-dynamic)\nand a fixed-ratio version (syn-dynamic-fr). As a\nbaseline, we also replace the synthetic data in the\ncontrolled linear dynamic setting with real data\n(real-dynamic). (3) Accumulation-enhanced per-\nformative loop: We further apply data accumula-\ntion to the proposed SCPL (syn-dynamic-accu). We\nalso apply accumulation to real data under the same\ndynamic control (real-dynamic-accu).\nOur ex-\nperiments involve Qwen2.5-1.5B, Qwen2.5-Math-\n1.5B, Llama2-7B, Qwen2.5-7B, Qwen2.5-Math-\n7B, and Llama3.1-8B. More experimental details\nand results, including self-consuming DPO and\nbeyond self-consuming analyses, are provided in\nAppendix C, G, H.\n5.1\nNews Continuation Task\nThis task examines the political bias of LLMs\nwithin the US political spectrum by evaluating\ntheir performance on news continuation (Maslej\net al., 2023).\nFollowing Wang et al. (2024c),\nwe use 6458 news articles from the Webis-Bias-\nFlipper-18 dataset (Chen et al., 2018), selecting\nleft- and right-leaning articles to construct 256-\ntoken prompt–completion pairs. At the initial itera-\ntion, the real dataset contains 5,000 samples (3,000\nleft-leaning and 2,000 right-leaning), where right-\nleaning articles are treated as the advantaged group\nand left-leaning articles as the disadvantaged group\nfor preference bias analysis. We evaluate model\nperformance on a held-out unbiased dataset and\nadditionally report accuracy results on the standard\nMMLU benchmark (Hendrycks et al., 2021).\nEvaluation Metric (1) Right lean bias (Bias\nScore) is the fraction of generated continuations\nclassified as right-leaning on the unbiased test\ndataset with 1000 articles. For each article, the\nmodel deterministically generates the next 256 to-\nkens, and a pretrained classifier following (Wang\net al., 2024c) predicts the political leaning of the\ngenerated continuation (right-lean or left-lean). An\nunbiased model is expected to yield a score close to\n0.5. (2) Generation Quality (GQ) is the average\nscore for all generated articles using the Gibberish\nDetector1. The detector identifies the incoherent of\nnonsensical text and has four Gibberish score: 3\nfor clean, 2 for mild gibberish, 1 and 0 for noise.\nResults Preference bias increases more rapidly\nin the SCPL than in the Non-dynamic loop under\nincremental fine-tuning. This is likely because bias\nin the dynamic case can stem from both the increas-\ningly biased fine-tuned model and the performa-\ntively generated synthetic data. However, under\nthe retraining setting, it shows a slower increase\nand only marginal differences compared to the non-\ndynamic loop. As shown in Figure 2a, using real\ndata in the iterative training process results in a rela-\ntively stable bias trajectory with minimal variation.\nThe generation quality decreases at a similar rate\nin retraining and incremental finetuning settings.\n1Please refer to https://huggingface.co/madhurjindal/autonlp-\nGibberish-Detector-492513457\n5\n\n(a) News Continuation\n(b) Preference Dissection\nFigure 2: The preference bias and generation quality on two tasks using Qwen2.5-1.5B.\nM0\nM1\nM2\nM3\nPreference Bias\nReal-dynamic\n0.5587\n0.5727\n0.5983\n0.5980\nSyn-dynamic\n0.5587\n0.6327\n0.6813\n0.7136\nGeneration Quality\nReal-dynamic\n2.538\n2.560\n2.504\n2.581\nSyn-dynamic\n2.538\n2.284\n2.250\n2.252\nMMLU\nReal-dynamic\n0.5820\n0.5786\n0.5833\n0.5842\nSyn-dynamic\n0.5820\n0.5728\n0.5661\n0.5652\nTable 1: Results in the performative incremental fine-\ntuning loop using Llama3.1-8B on News task.\nTables 1 and 2 show that the observed bias trends\nare consistent across model families. For complete-\nness, we also report results on the MMLU bench-\nmark. Accuracy remains stable at approximately\n0.58 in the real-dynamic setting. In contrast, under\nthe syn-dynamic setting, accuracy decreases at a\nsimilar rate for both incremental fine-tuning and\nretraining, with modest reductions of about 0.02\nand 0.05, respectively. As our primary goal is to\nanalyze the SCPL, standard benchmarks are less\nsensitive to group-specific distributional shifts and\ntherefore may not fully capture performative bias\ndynamics. In Figure 5, the bias score increases\nmore rapidly in dynamic setting, and the GQ de-\nclines faster under the incremental fine-tuning loop.\nAccumulation helps mitigate the amplification of\npreference bias and the degradation of generation\nquality. However, it does not reduce preference\nbias compared to the original bias or achieve per-\nformance comparable to the real-data setting; it\nmerely slows the rate at which bias increases. De-\nM0\nM1\nM2\nM3\nPreference Bias\nReal-dynamic\n0.5587\n0.5638\n0.5695\n0.6090\nSyn-dynamic\n0.5587\n0.6219\n0.6478\n0.6485\nGeneration Quality\nReal-dynamic\n2.538\n2.564\n2.544\n2.450\nSyn-dynamic\n2.538\n2.314\n2.303\n2.340\nMMLU\nReal-dynamic\n0.5820\n0.5773\n0.5812\n0.5838\nSyn-dynamic\n0.5820\n0.5803\n0.5622\n0.5662\nTable 2: Results in the performative retraining loop\nusing Llama3.1-8B on News task.\nspite this partial mitigation, we do not observe that\naccumulation reduces bias toward the neutral line\n(0.5), as stated in Wang et al. (2024c). Therefore,\nwhile accumulation can serve as a mitigation strat-\negy for both generation quality and preference bias,\nits effect on bias remains limited and we need more\nefforts on mitigation strategies.\n5.2\nPreference Dissection Task\nIn this task, we explore LLMs’ preference toward\nnon-creative attributes. We use the knowledge-\nawre dataset Dolly (Conover et al., 2023) as the\nadvantage group with 15,000 samples and creative\nwriting dataset ShareGPT 2 as the disadvantage\ngroup with 6,650 samples. We adopt the Preference\nDissection dataset (Li et al., 2024b) as test dataset.\nEvaluation Metric We compute the average\npreference probability for the non-creative attribute\nacross all ten scenarios in the test dataset (Li et al.,\n2Huggingface: Nitral-AI/Creative_Writing-ShareGPT\n6\n\nFigure 3: The disparate bias and mathematical problem solving ability on Math task using using Qwen2.5-Math-7B.\nFigure 4: The disparate bias and mathematical problem solving ability on Math task using Qwen2.5-Math-1.5B.\nFigure 5: The preference bias and generation quality on\nnews task using Llama2-7B in incremental fine-tuning\nloop.\n2024b) as the preference bias. Since the raw pref-\nerence scores consistently fall within the range\n[0.4, 0.6] (0.5 denotes neutral), we apply min-max\nnormalization to rescale them. The normalized\nvalues are denoted as Non-Creative Preference.\nResults Non-Creative Preference Bias increases\nfaster in the SCPL under both incremental fine-\ntuning and retraining settings. In the Non-dynamic\nself-consuming setting, the change in bias follows a\ntrend similar to that of the accumulation-enhanced\nloop. In contrast, the bias increases more obviously\nin the self-consuming performative incremental\nfine-tuning loop, as shown in Figure 2b. Under\nboth fine-tuning and retraining settings, using ac-\ncumulation effectively mitigates bias and achieves\nperformance close to that of the real-data setting.\n5.3\nMath Problem Solving Task\nIn this task, we use the NuminaMath dataset (Li\net al., 2024a) and divide the problems into easy\nand hard groups. Easy problems are considered the\nadvantage group, as they more commonly appear\nin real-world usage. Over time, the model’s math-\nsolving ability may become skewed due to repeated\nexposure to easy problems, potentially degrading\nits performance on harder ones (Tong et al., 2024).\nOur goal is to examine how LLMs perform when\nexposed to different proportions of easy and hard\nproblems in SCPL. The performance gap between\nthese two groups is treated as disparate bias.\nEvaluation Metric (1) Disparate Bias is the\ndisparity of Pass@1 Accuracy for both groups.\nPass@1 Accuracy is the accuracy of solving prob-\nlems correctly.\n(2) Similarity is the similarity\nbetween the generated response and the ground\ntruth answers. Specifically, it utilizes the sum of\nROUGE-L (Lin, 2004) and Bertscore (Zhang et al.,\n2019) to represent the problem solving ability.\n7\n\n(a) News Continuation\n(b) Preference Dissection\nFigure 6: The performance of different mitigation methods on two tasks using Qwen2.5-1.5B.\nResults Disparate bias decreases slower in the\nSCPL, along with the decrease of overall accu-\nracy and similarity score, as shown in Figure 3\nand 4. This behavior is partly explained by the fact\nthat math performance relies heavily on previously\nlearned knowledge, while synthetic data tends to be\nlower quality, leading to reduced group disparities\nbut overall performance degradation. In addition,\nhuman preferences often favor easier questions,\nresulting in fewer hard examples over iterations;\nsince LLMs typically struggle more on difficult\nproblems, this shift slows the apparent performance\ndrop as the data distribution becomes simpler. We\nfurther observe that the linear dynamic setting bet-\nter preserves math-solving ability compared to the\nsize-fixed dynamic. The trend in disparate bias\nreduction is similar across both settings. Notably,\nmodel scale plays a critical role: for smaller models\nsuch as Qwen2.5-Math-1.5B, performative dynam-\nics slow both performance and bias degradation,\nwhereas for larger models like Qwen2.5-Math-7B,\nSCPL leads to faster performance decline com-\npared to non-dynamic training (Figure 3). This\nsuggests that larger models are more sensitive to\ndistributional shifts in synthetic data, and dynamic\nsampling may disrupt their learned balance by re-\nducing exposure to harder or more diverse exam-\nples. In SCPL, Accumulation helps reduce the\ndecline in disparity and math-solving ability.\n5.4\nBias Mitigation Strategy\nWe evaluate the effectiveness of our proposed\nbias mitigation method, which curates less biased\ndata using a predefined rule-based reward function\n(Eq. 2) and a reweighting technique.\nSetup We set k = 4, α1 = 1.0, and α2 = 3.0.\nIn News task, r1 evaluates the fluency of gener-\nated responses using a Gibberish Detector, which\nassigns a score based on generation quality. Since\ntopic consistency is crucial in this task, we use\ncontent similarity as a proxy to assess whether the\npreference between the input and output remains\naligned. r2 calculates the cosine similarity between\nthe generated content and the ground-truth con-\ntinuation. If the similarity exceeds a predefined\nthreshold, the rule returns +1; otherwise, it returns\n-1. The same for Preference Dissection task.\nBaseline We evaluate three naive rejection sam-\npling methods: Vanilla Rejection Sampling (VRS):\nRandomly selects one from responses that meet\ncriterias following Li et al. (2025). Top per Prompt\n(TPP): Selects the response with the highest reward\nfor each prompt individually. Top Overall Prompt\n(TOP): Selects the top n responses with the high-\nest rewards across the entire dataset. We also add\nAccumulation (Wang et al., 2024c) as one baseline.\nResults The proposed reward-based reweight-\ning sampling achieves the best overall performance\nfor bias mitigation.\nAmong the five strategies\nevaluated, the reweight variant demonstrates the\nmost effective bias mitigation on the News task\nunder both retraining and fine-tuning settings, as\nshown in Figure 6a. On the Preference Dissection\ntask, our method ranks second in performance, as\nthe accumulation-based approach achieves results\nclose to those obtained using real data (Figure 6b).\n6\nConclusion\nIn this work, we introduce the SCPL for LLMs and\nstudy how synthetic data impacts bias under itera-\ntive training with performative feedback. Focusing\non the underexplored incremental fine-tuning set-\nting and typical retraining setting, we show that\npreference bias increases in both settings. We also\nobserve the decline of disparate bias. To address\nthe bias amplification, we propose a reward-based\nsampling method. Extensive experiments on three\nreal world tasks demonstrate our findings and the\neffectiveness of the mitigation method. Further\nanalysis of self-consuming reinforcement learn-\ning loops and settings beyond self-consumption\nremains an important direction for future work.\n8\n\nAcknowledgements\nY. Wang and Y. Liu are partially supported by the\nNational Science Foundation (NSF) under grants\nIIS-2143895 and IIS-2416896.\nZ. Cai and X.\nZhang are funded in part by the National Sci-\nence Foundation under award IIS-2202699 and IIS-\n2416895.\nLimitations\nOne limitation of our approach lies in the reward-\nbased sampling mechanisms, which currently rely\non predefined rules. If these rules are poorly de-\nsigned or misaligned with user intent, they may\ninadvertently introduce bias and lead to undesired\nmodel behaviors. Additionally, due to the substan-\ntial computational overhead associated with self-\nconsuming reinforcement learning, we exclude it\nfrom our current experimental setup.\nReferences\nSina Alemohammad, Josue Casco-Rodriguez, Lorenzo\nLuzi, Ahmed Imtiaz Humayun, Hossein Babaei,\nDaniel LeJeune, Ali Siahkoohi, and Richard G Bara-\nniuk. 2023. Self-consuming generative models go\nmad. arXiv preprint arXiv:2307.01850, 4:14.\nSina Alemohammad, Ahmed Imtiaz Humayun, Shruti\nAgarwal, John Collomosse, and Richard Baraniuk.\n2024. Self-improving diffusion models with syn-\nthetic data. arXiv preprint arXiv:2408.16333.\nQuentin Bertrand, Avishek Joey Bose, Alexandre Du-\nplessis, Marco Jiralerspong, and Gauthier Gidel.\n2023. On the stability of iterative retraining of gen-\nerative models on their own data. arXiv preprint\narXiv:2310.00429.\nMartin Briesch, Dominik Sobania, and Franz Rothlauf.\n2023. Large language models suffer from their own\noutput: An analysis of the self-consuming training\nloop. arXiv preprint arXiv:2311.16822.\nZhongteng Cai, Yaxuan Wang, Yang Liu, and Xueru\nZhang. 2025. Stabilizing self-consuming diffusion\nmodels with latent space filtering. arXiv preprint\narXiv:2511.12742.\nEshwar Chandrasekharan, Umashanthi Pavalanathan,\nAnirudh Srinivasan, Adam Glynn, Jacob Eisenstein,\nand Eric Gilbert. 2017. You can’t stay here: The\nefficacy of reddit’s 2015 ban examined through hate\nspeech. Proceedings of the ACM on human-computer\ninteraction, 1(CSCW):1–22.\nTianwei Chen, Yusuke Hirota, Mayu Otani, Noa Garcia,\nand Yuta Nakashima. 2024. Would deep generative\nmodels amplify bias in future models? In Proceed-\nings of the IEEE/CVF Conference on Computer Vi-\nsion and Pattern Recognition, pages 10833–10843.\nWei-Fan Chen, Henning Wachsmuth, Khalid Al Khatib,\nand Benno Stein. 2018. Learning to flip the bias of\nnews headlines. In Proceedings of the 11th Interna-\ntional conference on natural language generation,\npages 79–88.\nYatong Chen, Wei Tang, Chien-Ju Ho, and Yang Liu.\n2023. Performative prediction with bandit feedback:\nLearning through reparameterization. arXiv preprint\narXiv:2305.01094.\nMike Conover, Matt Hayes, Ankit Mathur, Jianwei Xie,\nJun Wan, Sam Shah, Ali Ghodsi, Patrick Wendell,\nMatei Zaharia, and Reynold Xin. 2023. Free dolly:\nIntroducing the world’s first truly open instruction-\ntuned llm.\nElvis Dohmatob, Yunzhen Feng, and Julia Kempe.\n2024a. Model collapse demystified: The case of\nregression. arXiv preprint arXiv:2402.07712.\nElvis Dohmatob, Yunzhen Feng, Pu Yang, Francois\nCharton, and Julia Kempe. 2024b. A tale of tails:\nModel collapse as a change of scaling laws. arXiv\npreprint arXiv:2402.07043.\nDamien Ferbach, Quentin Bertrand, Avishek Joey Bose,\nand Gauthier Gidel. 2024. Self-consuming genera-\ntive models with curated data provably optimize hu-\nman preferences. arXiv preprint arXiv:2407.09499.\nNate Gillman, Michael Freeman, Daksh Aggarwal,\nChia-Hong Hsu, Calvin Luo, Yonglong Tian, and\nChen Sun. 2024.\nSelf-correcting self-consuming\nloops for generative model training. arXiv preprint\narXiv:2402.07087.\nDaya Guo, Dejian Yang, Haowei Zhang, Junxiao\nSong, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shi-\nrong Ma, Peiyi Wang, Xiao Bi, and 1 others. 2025.\nDeepseek-r1: Incentivizing reasoning capability in\nllms via reinforcement learning.\narXiv preprint\narXiv:2501.12948.\nYanzhu Guo, Guokan Shang, Michalis Vazirgiannis, and\nChloé Clavel. 2023. The curious decline of linguistic\ndiversity: Training language models on synthetic text.\narXiv preprint arXiv:2311.09807.\nPatrick Haller, Ansar Aynetdinov, and Alan Akbik. 2023.\nOpiniongpt: Modelling explicit biases in instruction-\ntuned llms. arXiv preprint arXiv:2309.03876.\nMoritz Hardt and Celestine Mendler-Dünner. 2023. Per-\nformative prediction: Past and future. arXiv preprint\narXiv:2310.16608.\nRyuichiro Hataya, Han Bao, and Hiromi Arai. 2023.\nWill large-scale generative models corrupt future\ndatasets? In Proceedings of the IEEE/CVF Interna-\ntional Conference on Computer Vision, pages 20555–\n20565.\nDan Hendrycks, Collin Burns, Steven Basart, Andy\nZou, Mantas Mazeika, Dawn Song, and Jacob Stein-\nhardt. 2021. Measuring massive multitask language\n9\n\nunderstanding. Proceedings of the International Con-\nference on Learning Representations (ICLR).\nDaniel Hickey, Daniel MT Fessler, Kristina Lerman,\nand Keith Burghardt. 2025. X under musk’s leader-\nship: Substantial hate and no reduction in inauthentic\nactivity. PLoS One, 20(2):e0313293.\nManoel Horta Ribeiro, Shagun Jhaver, Savvas Zannet-\ntou, Jeremy Blackburn, Gianluca Stringhini, Emil-\niano De Cristofaro, and Robert West. 2021. Do plat-\nform migrations compromise content moderation?\nevidence from r/the_donald and r/incels. Proceed-\nings of the ACM on Human-Computer Interaction,\n5(CSCW2):1–24.\nKun Jin, Tian Xie, Yang Liu, and Xueru Zhang. 2024.\nAddressing polarization and unfairness in performa-\ntive prediction. arXiv preprint arXiv:2406.16756.\nJoshua Kazdan, Rylan Schaeffer, Apratim Dey, Matthias\nGerstgrasser, Rafael Rafailov, David L Donoho, and\nSanmi Koyejo. 2024. Collapse or thrive? perils and\npromises of synthetic data in a self-generating world.\narXiv preprint arXiv:2410.16713.\nSaeed Khaki, JinJin Li, Lan Ma, Liu Yang, and Prathap\nRamachandra. 2024. Rs-dpo: A hybrid rejection sam-\npling and direct preference optimization method for\nalignment of large language models. arXiv preprint\narXiv:2402.10038.\nWoosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying\nSheng, Lianmin Zheng, Cody Hao Yu, Joseph E.\nGonzalez, Hao Zhang, and Ion Stoica. 2023. Effi-\ncient memory management for large language model\nserving with pagedattention. In Proceedings of the\nACM SIGOPS 29th Symposium on Operating Systems\nPrinciples.\nJia Li, Edward Beeching, Lewis Tunstall, Ben Lip-\nkin, Roman Soletskyi, Shengyi Huang, Kashif Rasul,\nLonghui Yu, Albert Q Jiang, Ziju Shen, and 1 others.\n2024a. Numinamath: The largest public dataset in\nai4maths with 860k pairs of competition math prob-\nlems and solutions. Hugging Face repository, 13:9.\nJunlong Li, Fan Zhou, Shichao Sun, Yikai Zhang, Hai\nZhao, and Pengfei Liu. 2024b. Dissecting human and\nllm preferences. arXiv preprint arXiv:2402.11296.\nPeiji Li, Kai Lv, Yunfan Shao, Yichuan Ma, Linyang\nLi, Xiaoqing Zheng, Xipeng Qiu, and Qipeng Guo.\n2025. Fastmcts: A simple sampling strategy for data\nsynthesis. arXiv preprint arXiv:2502.11476.\nChin-Yew Lin. 2004. Rouge: A package for automatic\nevaluation of summaries.\nIn Text summarization\nbranches out, pages 74–81.\nYang Liu, Yuanshun Yao, Jean-Francois Ton, Xiaoying\nZhang, Ruocheng Guo, Hao Cheng, Yegor Klochkov,\nMuhammad Faaiz Taufiq, and Hang Li. 2023. Trust-\nworthy llms: a survey and guideline for evaluating\nlarge language models’ alignment. arXiv preprint\narXiv:2308.05374.\nNestor Maslej, Loredana Fattorini, Erik Brynjolfs-\nson, John Etchemendy, Katrina Ligett, Terah Lyons,\nJames Manyika, Helen Ngo, Juan Carlos Niebles,\nVanessa Parli, and 1 others. 2023.\nArtificial\nintelligence index report 2023.\narXiv preprint\narXiv:2310.03715.\nLong Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida,\nCarroll Wainwright, Pamela Mishkin, Chong Zhang,\nSandhini Agarwal, Katarina Slama, Alex Ray, and 1\nothers. 2022. Training language models to follow in-\nstructions with human feedback. Advances in neural\ninformation processing systems, 35:27730–27744.\nJinlong Pang, Zhaowei Zhu, Na Di, Yichi Zhang, Yax-\nuan Wang, Chen Qian, and Yang Liu. 2025. Su-\npervised fine-tuning on ambiguous preference pairs\nboosts llm alignment.\nJuan Perdomo, Tijana Zrnic, Celestine Mendler-Dünner,\nand Moritz Hardt. 2020. Performative prediction.\nIn International Conference on Machine Learning,\npages 7599–7609. PMLR.\nGeorgios Piliouras and Fang-Yi Yu. 2023. Multi-agent\nperformative prediction: From global stability and\noptimality to chaos. In Proceedings of the 24th ACM\nConference on Economics and Computation, pages\n1047–1074.\nRafael Rafailov, Arƒchit Sharma, Eric Mitchell, Christo-\npher D Manning, Stefano Ermon, and Chelsea Finn.\n2023. Direct preference optimization: Your language\nmodel is secretly a reward model. Advances in neural\ninformation processing systems, 36:53728–53741.\nLuca Rettenberger, Markus Reischl, and Mark Schutera.\n2025. Assessing political bias in large language mod-\nels. Journal of Computational Social Science, 8(2):1–\n17.\nVinu Sankar Sadasivan, Aounon Kumar, Sriram Bala-\nsubramanian, Wenxiao Wang, and Soheil Feizi. 2023.\nCan ai-generated text be reliably detected? arXiv\npreprint arXiv:2303.11156.\nMohamed El Amine Seddik, Suei-Wen Chen, Soufiane\nHayou, Pierre Youssef, and Merouane Debbah. 2024.\nHow bad is training on synthetic data? a statistical\nanalysis of language model collapse. arXiv preprint\narXiv:2404.05090.\nIlia Shumailov,\nZakhar Shumaylov,\nYiren Zhao,\nYarin Gal, Nicolas Papernot, and Ross Anderson.\n2023.\nThe curse of recursion: Training on gen-\nerated data makes models forget.\narXiv preprint\narXiv:2305.17493.\nIlia Shumailov, Zakhar Shumaylov, Yiren Zhao, Nicolas\nPapernot, Ross Anderson, and Yarin Gal. 2024. Ai\nmodels collapse when trained on recursively gener-\nated data. Nature, 631(8022):755–759.\nQwen Team. 2024. Qwen2 technical report. arXiv\npreprint arXiv:2407.10671.\n10\n\nLudovic Terren and Rosa Borge. 2021. Echo chambers\non social media: A systematic review of the literature.\nYuxuan Tong, Xiwen Zhang, Rui Wang, Ruidong Wu,\nand Junxian He. 2024. Dart-math: Difficulty-aware\nrejection tuning for mathematical problem-solving.\nAdvances in Neural Information Processing Systems,\n37:7821–7846.\nShubham Toshniwal, Ivan Moshkov, Sean Narenthi-\nran, Daria Gitman, Fei Jia, and Igor Gitman. 2024.\nOpenmathinstruct-1: A 1.8 million math instruction\ntuning dataset. Advances in Neural Information Pro-\ncessing Systems, 37:34737–34774.\nHugo Touvron, Louis Martin, Kevin Stone, Peter Al-\nbert, Amjad Almahairi, Yasmine Babaei, Nikolay\nBashlykov, Soumya Batra, Prajjwal Bhargava, Shruti\nBhosale, and 1 others. 2023. Llama 2: Open foun-\ndation and fine-tuned chat models. arXiv preprint\narXiv:2307.09288.\nMaorong Wang, Nicolas Michel, Jiafeng Mao, and\nToshihiko Yamasaki. 2024a.\nDealing with syn-\nthetic data contamination in online continual learning.\narXiv preprint arXiv:2411.13852.\nYaxuan Wang, Chris Yuhao Liu, Quan Liu, Jing-\nlong Pang, Wei Wei, Yujia Bao, and Yang Liu.\n2025. Dragon: Guard llm unlearning in context via\nnegative detection and reasoning.\narXiv preprint\narXiv:2511.05784.\nZe Wang, Zekun Wu, Xin Guan, Michael Thaler, Adri-\nano Koshiyama, Skylar Lu, Sachin Beepath, Ediz\nErtekin Jr, and Maria Perez-Ortiz. 2024b.\nJob-\nfair: A framework for benchmarking gender hir-\ning bias in large language models. arXiv preprint\narXiv:2406.15484.\nZe Wang, Zekun Wu, Jeremy Zhang, Navya Jain, Xin\nGuan, and Adriano Koshiyama. 2024c. Bias ampli-\nfication: Language models as increasingly biased\nmedia. arXiv preprint arXiv:2410.15234.\nXiukun Wei and Xueru Zhang. 2025. Self-consuming\ngenerative models with adversarially curated data.\narXiv preprint arXiv:2505.09768.\nSierra Wyllie, Ilia Shumailov, and Nicolas Papernot.\n2024. Fairness feedback loops: training on synthetic\ndata amplifies bias. In Proceedings of the 2024 ACM\nConference on Fairness, Accountability, and Trans-\nparency, pages 2113–2147.\nAn Yang, Beichen Zhang, Binyuan Hui, Bofei Gao,\nBowen Yu, Chengpeng Li, Dayiheng Liu, Jianhong\nTu, Jingren Zhou, Junyang Lin, and 1 others. 2024.\nQwen2. 5-math technical report: Toward mathe-\nmatical expert model via self-improvement. arXiv\npreprint arXiv:2409.12122.\nZheng Yuan, Hongyi Yuan, Chengpeng Li, Guanting\nDong, Keming Lu, Chuanqi Tan, Chang Zhou, and\nJingren Zhou. 2023. Scaling relationship on learning\nmathematical reasoning with large language models.\narXiv preprint arXiv:2308.01825.\nTianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q\nWeinberger, and Yoav Artzi. 2019. Bertscore: Eval-\nuating text generation with bert.\narXiv preprint\narXiv:1904.09675.\n11\n\nAppendix Arrangement\nThe Appendix is organized as follows.\n• Section A: Broader Impact\n• Section B: Related Work\n• Section C: Experimental details: Dataset De-\nscriptions (Section C.1), Implementation De-\ntails (Section C.2), Experimental Results (Sec-\ntion C.3), Parameter Study (Section C.4)\n• Section D: Mitigation Strategy\n• Section E: Discussion about Focusing on SFT\n• Section F: Discussion about Full Synthetic\nData Cycle\n• Section G: Additional Experiments using\nDPO\n• Section H: Additional Experiments beyond\nSelf-consuming\n• Section I: Future Directions\nA\nBroader Impact\nOur study highlights how certain types of bias in\nself-consuming performative loop change. While\nwe propose mitigation strategies, deploying such\nself-consuming training pipelines without careful\ncontrol may unintentionally reinforce existing so-\ncietal biases or introduce new ones, especially in\nsensitive applications such as hiring, healthcare, or\neducation. Moreover, the reward-based sampling\nmechanisms we explore rely on predefined rules,\nwhich, if poorly designed or misaligned with user\nvalues, may further skew model behavior. As the\nuse of synthetic data becomes more widespread,\nit is critical to monitor its downstream effects and\nestablish responsible practices for data generation\nand model retraining.\nB\nRelated Work\nSelf-consuming Training Loop\nThe self-\nconsuming training loop in generative models has\nrecently gained significant attention (Hataya et al.,\n2023; Alemohammad et al., 2024, 2023). Several\nstudies (Alemohammad et al., 2023; Shumailov\net al., 2024; Briesch et al., 2023) have evidenced\ncatastrophic degradation of generated data in fully\nsynthetic training loops. The phenomenon often\nreferred to as model collapse in recursive training.\nSeddik et al. (2024) demonstrate that model col-\nlapse is inevitable when training solely on synthetic\ndata from a statistical perspective. Both Alemoham-\nmad et al. (2023) and Bertrand et al. (2023) observe\nthat the inclusion of fresh, real data can stabilize the\nself-consuming training loop. A recent theoretical\ncontribution by Dohmatob et al. (2024a) provides\nbounds on performance degradation in regression\nsettings, along with modified scaling laws (Dohma-\ntob et al., 2024b). Several works (Seddik et al.,\n2024; Briesch et al., 2023; Guo et al., 2023) focus\nspecifically on LLMs, highlighting their unique\nvulnerabilities to model collapse. In parallel, Fer-\nbach et al. (2024) offer a theoretical study on the\nrole of data curation in iterative retraining of gener-\native models, showing that curated datasets act as\nan implicit preference optimization mechanism.\nPerformative Prediction In the social world,\npredictions are an intrinsic part of the system,\nwhere they inform decisions, beliefs and even influ-\nence outcomes. Performative prediction (Perdomo\net al., 2020) provides a framework for studying\nsocial predictive systems in which the data distri-\nbution changes in response to the deployment of a\nmodel (Chen et al., 2023). A growing body of work\nhas investigated the stability and optimality of such\nperformative systems (Hardt and Mendler-Dünner,\n2023; Piliouras and Yu, 2023; Jin et al., 2024).\nEcho Chambers and User–Technology Feed-\nback. Echo chambers (Terren and Borge, 2021)\ndescribe self-reinforcing dynamics in online plat-\nforms, where interactions between users and plat-\nform mechanisms amplify specific viewpoints\nor behaviors over time.\nPrior work shows\nthat moderation policies, recommendation sys-\ntems, and user migration jointly shape platform-\nspecific ideological patterns, as observed across\ncommunities on Reddit and between platforms\nsuch as Reddit and X (Chandrasekharan et al.,\n2017; Horta Ribeiro et al., 2021; Hickey et al.,\n2025).\nThese studies highlight how feedback-\ndriven user–technology–userbase cycles can induce\nsystematic distributional shifts in content and be-\nhavior.\nBias in Self-consuming World\nRecent\nwork (Wyllie et al., 2024; Wang et al., 2024c; Chen\net al., 2024) investigates the emergence and ampli-\nfication of biases in self-consuming training loops.\nLLMs may internalize inherent biases present in\nhuman-generated text (Wang et al., 2024b), or\nadopt specific political viewpoints reflected in their\ntraining data (Haller et al., 2023; Rettenberger et al.,\n2025; Wang et al., 2024c). To address this issue,\nWang et al. (2024c) also propose several techniques\nfor mitigating bias for self-consuming LLMs.\n12\n\nRejection Sampling\nRejection sampling is\nwidely used to generate high-quality synthetic data\nfor supervised fine-tuning of LLMs (Yuan et al.,\n2023; Toshniwal et al., 2024; Khaki et al., 2024). It\nfilters candidate outputs based on predefined heuris-\ntics or reward models. Yuan et al. (2023) apply\nthis technique to augment training data, while Tong\net al. (2024) propose difficulty-aware rejection sam-\npling, biasing data collection toward harder queries\nvia two strategies: Uniform and Prop2Diff. Li et al.\n(2025) introduce FastMCTS to efficiently sample\nmulti-step reasoning data, and Khaki et al. (2024)\npropose RS-DPO, which selects contrastive sam-\nples using a reward model. Ferbach et al. (2024)\nfind that synthetic curation helps the model con-\nverge toward high reward regions. Inspired by this,\nwe design a simple yet effective bias mitigation\nstrategy that combines rule-based rewards with a\nreweighting scheme to curate high-quality and less\nbiased data.\nC\nExperimental details\nC.1\nDataset Descriptions\nNews Datasets\nThe source dataset is Webis-Bias-\nFlipper-18 (Chen et al., 2018). It contains 6458\nnews articles covering 2781 distinct events from\n2012 to 2018. For each article, the dataset pro-\nvides the title, summary, full content, and an asso-\nciated political bias label. The bias labels include\n“From the left” for articles published by left-leaning\nmedia outlets, “From the right” for those from\nright-leaning sources, and “Neutral” for politically\nunbiased articles. We select all the left-leaning\nand right-leaning news articles labeled as “From\nthe left” and “From the right,” respectively. Each\noriginal news article is tokenized into 256-token\nblocks, and we select two adjacent blocks as the\nprompt and the corresponding completion. These\nprompt–completion pairs are used for both training\nthe political bias classification metric and for the\nself-performative consuming training loop. Table 3\nshows the number of samples for each dataset used\nduring the self performative consuming loop.\nPreference Datasets\nIn this task, we utilize three\ndatasets: (1) Databricks-Dolly-15k is a corpus\nof over 15,000 instruction-response pairs created\nby thousands of Databricks employees to enable\nLLMs to exhibit the interactive behavior character-\nistic of ChatGPT. (2) CreativeWriting-ShareGPT\nis a dataset consisting of prompts designed to elicit\ncreative storytelling, along with the correspond-\ning responses generated by ChatGPT. (3) Prefer-\nence Dissection is a benchmark dataset designed to\nquantify LLM preferences across ten scenarios. It\nenables decomposition of overall preferences into\nmultiple well-defined attributes, including creative\nwriting and knowledge-aware (non-creative) tasks.\nWe use the creative writing scenario as the dis-\nadvantaged group, and the knowledge-aware (i.e.,\nnon-creative) scenario as the advantaged group.\nImportantly, we only use the Preference Dissec-\ntion dataset as the test dataset. Accordingly, Dolly\nis used as the training dataset for non-creative at-\ntributes, while CreativeWriting-ShareGPT is used\nto represent creative attributes in training.\nMath Datasets\nThe source dataset is Numina-\nMath(Li et al., 2024a), a comprehensive collection\nof 860,000 pairs of competition-level math prob-\nlems and their solutions. For our study, we catego-\nrize problems from Chinese K-12 exams as the easy\ngroup and those from international math olympiads\nas the hard group. Detailed sample statistics are\nprovided in Table 3. We ensure that the test set\nis disjoint from the training and validation sets to\nprevent data leakage.\nClarifying the Rationale for Our Task Choices\nOur goal is to study bias dynamics under SCPL,\nwhich requires tasks where (1) group structure is\nexplicit, should lead to the “performative”, (2) syn-\nthetic data can be generated reliably across genera-\ntions, and (3) bias effects can be measured contin-\nuously over iterative fine-tuning. The three tasks\nwe selected were chosen because they satisfy these\ncriteria:\nClear group structure. Each task provides nat-\nural, well-defined group splits (e.g., left/right po-\nlitical leaning, creative/non-creative preferences,\neasy/hard math problems), enabling controlled bias\nevaluation.\nSuitability for multi-generation synthetic\ndata. Many standard bias benchmarks (e.g., BBQ)\nare closed-ended, making them unsuitable for re-\ncursive data generation. Our tasks support repeated\nprompting and regeneration across iterations.\nGrounding in prior work. Political bias in\nnews generation has been studied (Wang et al.,\n2024c); creative vs. non-creative preference gaps\nfollow the Preference Dissection; and easy/hard\nmath disparities capture reasoning performance dif-\nferences. We will clarify this rationale in Section 4\nof the revised version.\n13\n\nDataset\nInitial Real D0\nCandidate Dcandidate\nHeld out Dtest\nTask\nFixed Size\nDa\nDd\nDa\ncandidate\nDd\ncandidate\nDa\ntest\nDd\ntest\nNews Continuation\n5000\n3000\n2000\n33873\n41334\n500\n500\nPreference Dissection\n2000\n1600\n400\n13407\n6253\n-\n-\nNumina Math\n5000\n3000\n2000\n20000\n15000\n500\n500\nTable 3: Dataset statistics for the three tasks, showing the number of samples in each dataset at the initial generation\n0.\nC.2\nImplementation Details\nAll experiments can be conducted using two 80 GB\nA100 GPUs.\nGeneration\nparameters\nWe\nused\nthe\nvLLM\n(Kwon\net\nal.,\n2023)\nframework\nto\ngenerate the synthetic data.\nThe generation\nparameters in the main observation experiments\nare set as follows:\nn = 1 (single sample per\nprompt). All other parameters followed the default\nsettings in vLLM, including temperature = 1.0 and\ntopp = 1.0. These settings correspond to standard\nsampling without additional constraints.\nRight Lean Bias Evaluation\nUsing an unbiased\ndataset of 1000 human-written articles evenly dis-\ntributed across right-lean and left-lean categories,\nwe generate the synthetic articles by deterministi-\ncally predicting the next 256 tokens. Right lean\nbias is the percentage of the predicted right lean\nchunks in the unbiased dataset. If the model had no\nbias, the bias score should be around 0.5. Follow-\ning Wang et al. (2024c), we train a binary classifier\nto determine the political leaning of synthetic gen-\neration (1 for right-lean, 0 for left-lean). We fine-\ntuned on Roberta-base model with a learning rate\nof 2e-5, a batch_size of 16, weight decay of 0.01\nand 5 training epochs. We select the best model\nbased on F1 score (0.94) on the prompts from the\nunbiased test dataset to have better classification\naccuracy.\nNews Continuation\nWe use an initial disadvan-\ntage ratio r0\nd of 0.4, and set the dataset size for\neach iteration to 5,000 samples (n = 5, 000). For\ntraining, we adopt a learning rate of 2e − 5 and\na weight decay of 0.01. Each iteration is trained\nfor 5 epochs, and the entire process consists of 3\ngenerations. To simulate increasing bias in human\npreferences, the disadvantage group ratio is linearly\nreduced across three generations, starting from 0.4\nand decreasing to 0.22.\nFigure 7: The preference bias and generation quality on\nnews task using Llama2-7B in incremental fine-tuning\nloop.\nFigure 8: Bias score and generation quality variations\nin the self-consuming performative retraining loop on\nthe News task using Llama2-7B.\nPreference Dissection\nWe use an initial disad-\nvantage ratio r0\nd of 0.2, and set the dataset size for\neach iteration to 2,000 samples (n = 2, 000). For\ntraining, we adopt a learning rate of 2e − 5 and\na weight decay of 0.01. Each iteration is trained\nfor 5 epochs, and the entire process consists of five\ngenerations. To simulate increasing bias in human\npreferences, the disadvantage group ratio is linearly\nreduced across five generations, starting from 0.2\nand decreasing to 0.0.\nNumina Math\nWe use an initial disadvantage\nratio r0\nd of 0.4, and set the dataset size for each iter-\nation to 5,000 samples (n = 5, 000). For training,\nwe adopt a learning rate of 2e − 5 and a weight de-\ncay of 0.01. Each iteration is trained for 3 epochs,\nand the entire process consists of five generations.\nTo simulate increasing bias in human preferences,\nthe disadvantage group ratio is linearly reduced\nacross five generations, starting from 0.4 and de-\ncreasing to 0.2.\n14\n\nSetting\nData Type\nGroup Ratio Change\nKeep Accumulated Data\nReuse Prompt\nTraditional self-consuming loop (non-dynamic)\nTraditional (real)\nreal\n✗\n✗\n✗\nTraditional (syn)\nsyn\n✗\n✗\n✓\nProposed self-consuming performative loop (dynamic)\nreal-dynamic\nreal\n✓\n✗\n✗\nsyn-dynamic\nsyn\n✓\n✗\n✗\nsyn-dynamic-fr\nsyn\n✗\n✗\n✗\nreal-dynamic-accu\nreal\n✓\n✓\n✗\nsyn-dynamic-accu\nsyn\n✓\n✓\n✗\nTable 4: Comparison of different self-consuming training settings. Dynamic methods use group ratio change, while\nonly the traditional synthetic setting reuses prompts.\nC.3\nExperimental Results\nResults in self-consuming performative loop on\nNews task using Llama2-7B (Touvron et al.,\n2023)\nFigure 7 and Figure 8 show the variation\nin bias scores and generation quality in the self-\nconsuming performative fine-tuning loop and re-\ntraining loop, respectively. The bias score increases\nmore rapidly in the dynamic setting, and the gener-\nation quality declines faster under the incremental\nfine-tuning loop. In contrast, both bias score and\ngeneration quality in the performative retraining\nloop exhibit more stable trends compared with the\nNon-dynamic training loop. The accumulation of\nsynthetic data fails to mitigate bias in the incremen-\ntal fine-tuning loop. However, incorporating more\naccumulated real data improves generation quality\nthroughout the training loop.\nResults in self-consuming performative loop on\nMath task using Qwen2.5-Math-7B (Yang et al.,\n2024).\nAs shown in Figure 9, we observe an inter-\nesting trend: while the dynamic setting slows down\nthe performance and disparate bias drop on math-\nsolving tasks for Qwen2.5-Math-1.5B, it leads to\nfaster degradation for Qwen2.5-Math-7B. This may\nbe because larger models are more sensitive to\nshifts in training data distribution. The dynamic\nperformative training process, while helpful for\nsmaller models by reinforcing useful signals, may\ndisturb the learned balance in larger models and\nreduce exposure to harder or more diverse samples,\nthus accelerating performance decline.\nResults in the self-consuming performative in-\ncremental fine-tuning loop on the News task\nusing Qwen2.5-1.5B, evaluated on generated\ndatasets.\nTo better understand the cause of pref-\nerence bias amplification, we evaluate model per-\nformance on the generated datasets (e.g., D1, D2,\netc.). As shown in Table 5, these datasets inherit\nexisting bias, which is further amplified during the\nincremental fine-tuning process.\nC.4\nParameter Study\nThe impact of initial disadvantage ratio in News\ntask using Llama2-7B.\nTable 6 presents the\nstudy of the parameters on the impact of the initial\ndisadvantage ratio. Interestingly, we observe that\nwhen the initial disadvantage ratio is lower (e.g.,\n0.2), the increase in bias over time is less obvious\ncompared to the higher initial ratio (e.g., 0.4). This\ncounter-intuitive result may be due to the fact that\nmodels starting from a more imbalanced setting are\nalready biased toward the majority group, making\nsubsequent performative updates reinforce existing\npreferences rather than introduce significant new\nbias. In contrast, when the initial distribution is\nmore balanced, the performative feedback loop has\na greater opportunity to shift model behavior and\namplify bias over time.\nD\nMitigation Strategy\nD.1\nRejection Sampling Baseline\nVanilla Rejection Sampling (VRS)\nFor each\nprompt, we generate k candidate responses. For\nthe k generated responses, we evaluate the correct-\nness or the ability to satisfy the predefined standard.\nFor news continuation task, the predefined standard\ninclude bias, generation quality, similarity to the\nground truth continuation, and sentiment score. For\nbias, we use the trained classifier to determine the\nscore. For generation quality, based on the Gibber-\nish detector, if the Gibberish score is 1 or 0, the\nscore will be -1, otherwise the score will be 1. For\nsimilarity, the threshold is 0.5. Then we sum all\nfour scores, if the sum greater than 0, then we think\nit satisfy the standard. After that, we randomly\n15\n\nFigure 9: Disparate bias and mathematical problem solving ability variations in the self-consuming performative\nincremental fine-tuning loop on the Math task using Qwen2.5-Math-7B.\nMethod\nD0\nD1\nD2\nD3\nSynthetic-Non-dynamic\n0.0216\n0.2240\n0.3938\n0.5307\nSynthetic-Dynamic\n0.0216\n0.2158\n0.4183\n0.6037\nSynthetic-Dynamic-Fixed Ratio\n0.0216\n0.2045\n0.4266\n0.6143\nTable 5: Evaluation on generated datasets (D1, D2, ...) shows that preference bias is inherited and amplified during\nthe self-consuming performative incremental fine-tuning loop using Qwen2.5-1.5B.\nMethod\nInitial Ratio\nM0\nM1\nM2\nM3\nSynthetic-Non-dynamic\n0.4\n0.5800\n0.6500\n0.6675\n0.7063\nSynthetic-Dynamic\n0.4\n0.5800\n0.6447\n0.6839\n0.7399\nSynthetic-Dynamic-Fixed Ratio\n0.4\n0.5800\n0.6512\n0.7226\n0.7461\nSynthetic-Non-dynamic\n0.2\n0.6417\n0.6553\n0.6623\n0.6700\nSynthetic-Dynamic\n0.2\n0.6417\n0.6766\n0.6554\n0.6931\nSynthetic-Dynamic-Fixed Ratio\n0.2\n0.6417\n0.6736\n0.6824\n0.6782\nTable 6: Parameter study on the impact of the initial disadvantage ratio in the News task using Llama2-7B.\n16\n\nPreference Bias\nGeneration Quality\nMethod\nM0\nM1\nM2\nM3\nM0\nM1\nM2\nM3\nSyn (no-dynamic)\n0.6307\n0.6606\n0.6756\n0.6456\n2.282\n2.164\n2.294\n2.564\nSyn-dynamic in incremental finetuning loop\n0.6307\n0.6309\n0.6311\n0.6129\n2.282\n2.136\n2.244\n2.497\nSyn-dynamic in retraining loop\n0.6307\n0.6391\n0.6287\n0.6247\n2.282\n2.242\n2.442\n2.436\nTable 7: Results using DPO in self-consuming performative loop on the News continuation task.\nselect from these candidates. We make sure at least\nwe will select one response per prompt.\nTop Per Prompt (TPP)\nFor each prompt, the\nmethod generates k responses and selects the re-\nsponse with the highest rewards.\nTop Overall Prompts (TOP)\nFor each prompt,\nthe method generates k responses and selects n\nresponse from the n ∗ k candidates with the highest\nrewards.\nD.2\nReward-based Sampling with Reweight\nScheme\nAlgorithm 3 describes the Reward-based Sampling\nwith Reweighting scheme, where Xd\ncandidate,t−1 de-\nnotes the set of prompts from the disadvantaged\ngroup generated during iteration t − 1 of the per-\nformative loop.\nE\nDiscussion about Focusing on SFT\nOur current study intentionally focuses on super-\nvised fine-tuning (SFT) to isolate and better un-\nderstand the dynamics of bias variations under\nthe proposed novel self-consuming performative\nloops. SFT remains a widely adopted method for\nimproving LLM performance on specific down-\nstream tasks (Ouyang et al., 2022), especially in\npractical deployment pipelines where collecting\nhigh-quality preference labels for RL is costly or\ninfeasible.\nOne of our contributions lies in analyzing\nbias variation within a novel framework of self-\nconsuming performative retraining and an incre-\nmental fine-tuning loop for the first time, which\nmodels how real-world LLMs may be continuously\nupdated using their own outputs. Furthermore, sev-\neral prior works (Briesch et al., 2023; Seddik et al.,\n2024; Kazdan et al., 2024) have studied similar\niterative retraining setups using SFT, but did not\nexplore the bias behavior. Considering this line\nof research, we emphasize more on the performa-\ntive feedback loop as a distinct and increasingly\nrelevant phenomenon in real-world deployments.\nAlgorithm 3 Reward-based Reweighting Sampling\nInput:\nLLM Mt−1, disadvantage ratio rd,\nDt−1, Xa\ncandidate,t-1, Xd\ncandidate,t-1, sample number\nk\nInitialize: La = 0, Ld = 0, Da = [], Dd = [],\nDt = [], Dtemp = [], Dr = []\nLa = 1\n4 ∗ |Xa\ncandidate, t-1| + |Xd\ncandidate, t-1|\nLd = |Dt−1| − La\nkd = Ld//|Xd\ncandidate, t-1| + 1 + k ; ka = k\nfor x in prompts Xa\ncandidate, t-1 do\n¯y1, . . . , ¯yka ∼ Mt−1(x);\nCalculate reward R(x, ¯y) for each (x, ¯y) pair\nAdd the best pair (x, y) to Da\nend for\nRandomly select La samples and add to Dt\nfor x in prompts Xd\ncandidate do\n¯y1, . . . , ¯ykd ∼ Mt−1(x);\nCalculate reward R(x, ¯y) for each (x, ¯y) pair\nAdd all pair in Dr\nend for\nrepeat\nFor each query x in Dr, add the best (x, ¯y)\nbased on R(x, ¯y) to Dtemp\nRemove the added pair from Dr\nuntil Ld − |Dtemp|/|Xd\ncandidate, t-1| is zero\nRandomly select Ld − |Dtemp| samples and add\nto Dt; Add Dtemp to Dt\nReturn: Dt\nWhile we acknowledge the value of RL-based\nmethods, they typically rely on curated high qual-\nity data or pairwise comparisons at each iteration,\nwhich introduces additional complexities such as la-\nbel noise, instability, and selection bias. Moreover,\nmethods like RLHF is often resource-intensive and\nless convenient for continuous training. To keep\nour scope focused and interpretable, we defer the\nexploration of RL-based or preference-based align-\nment strategies to future work.\nMost methods do not rely solely on RL but rather\nuse SFT as a foundational step before applying RL\nor after, as seen in Guo et al. (2025); Team (2024).\nThus, understanding bias behavior under SFT is\n17\n\nFigure 10: Generation quality variations in the self-consuming performative loop on the News task under different\nmitigation methods.\nboth fundamental and necessary for guiding future\nresearch from the community. We believe our cur-\nrent SFT-based formulation lays a solid foundation\nfor understanding the core performative feedback\ndynamics, and we are actively planning to incorpo-\nrate DPO and ORPO into this framework in subse-\nquent research.\nF\nDiscussion about Full Synthetic Data\nCycle\nOur self-consuming performative loop setting\nis designed to model a more controlled but in-\ncreasingly prevalent real-world scenario, where\na single model provider collects, fine-tunes, and\nreuses outputs generated by its own models. This\nsetup is realistic in many real-world applications.\nFor example, commercial AI service providers of-\nten collect user prompts and model generations in\nproduction environments (e.g., internal confidential\nreport generation, customer interaction, or propri-\netary toolchains), then fine-tune the model itera-\ntively based on this data. When a model performs\nparticularly well for a demographic group (e.g.,\nGroup A), its outputs may attract more queries\nfrom that group, leading to natural performative\nfeedback, a process we explicitly aim to capture\n(Figure 1 in the main paper). Here, the outputs\nare not openly mixed with external content but\nrather used and fine-tuned within a closed feedback\nloop. In such settings, the provider has full control\nand visibility over both the model and the data it\nconsumes, making the self-consuming assumption\nrealistic and operationally relevant.\nOur framework serves as a building block\nfor understanding more complex mixed-model\nscenarios, and incorporating data generated by\nother LLMs will introduce more uncontrolled\nvariation. Our goal is to first establish a clean\nand analyzable framework in a controlled case\nfor studying how bias varies within the novel self-\nconsuming performative loops, which is a setting\nthat has not been systematically explored in prior\nwork but is highly relevant for real-world, task-\nspecific deployments. Incorporating data gener-\nated by other LLMs would introduce uncontrolled\nvariation, making it more difficult to attribute ob-\nserved bias shifts to specific model behaviors or\ndata sources. We therefore believe that focusing on\nthis controlled full synthetic data case is a necessary\nand foundational step before extending to more\ncomplex, mixed-model generation loops, which\nserve as an important next step for the research\ncommunity.\nG\nAdditional Experiments using DPO\nWe conduct preliminary experiments using DPO on\nthe News Continuation task to investigate how bias\nevolves during iterative preference-based training.\nIn our setup, we use generation quality as the sole\nscoring metric to construct preference pairs, delib-\nerately excluding any explicit bias control mecha-\nnisms.\nFrom Table 7, we observe that bias does not\nincrease in the self-consuming performative loop\nusing DPO. It shows a slight decrease from genera-\ntion 2 to generation 3. This opens up a promising\ndirection for understanding and addressing bias in\niterative preference-based training. In terms of gen-\neration quality, we see a clear improvement. This\nindicates that in a self-consuming iterative prefer-\nence training loop, using generation quality-guided\npreference pair construction can meaningfully en-\nhance output quality.\nThe observed variation are different from those\nunder SFT, here bias-reduction patterns and qual-\nity improvements appear under DPO. This is ex-\n18\n\nPreference Bias\nGeneration Quality\nMethod\nRatio\nM0\nM1\nM2\nM3\nM0\nM1\nM2\nM3\nIncremental finetuning\n0.5\n0.6163\n0.6802\n0.6921\n0.7176\n2.291\n2.083\n2.065\n2.064\nRetraining\n0.5\n0.6163\n0.6847\n0.6899\n0.6956\n2.294\n2.119\n2.081\n2.096\nIncremental finetuning\n0.2\n0.6048\n0.6987\n0.6897\n0.6737\n2.282\n2.068\n2.102\n2.092\nRetraining\n0.2\n0.6048\n0.6802\n0.6871\n0.6817\n2.282\n2.117\n2.135\n2.124\nTable 8: Results under the synthetic performative loop. The ratio indicates the percentage of self-consuming\nsynthetic data used during training. Ratio denotes the percentage of synthetic data generated from another LLM.\npected because preference-based learning explicitly\ntrains the model to favor higher-quality responses\nover lower-quality ones. By repeatedly reinforc-\ning distinctions between “good” and “bad” pairs,\nthe model learns clearer decision boundaries. As a\nresult, undesirable or biased patterns become less\npreferred, while coherent and balanced responses\nare amplified, naturally leading to both bias reduc-\ntion and improved overall quality.\nH\nAdditional Experiments beyond\nSelf-consuming\nThe proposed setting serves as a foundational\nframework for analyzing bias variation, which can\nbe extended to more complex settings involving\nmultiple models and data sources. We conduct ad-\nditional experiments involving both self-generated\ndata and data generated by another LLM.\nTo investigate how bias trends change when\nmixing data from multiple generation sources, we\nsplit the real prompts into two groups (1:1 ratio).\nOne group is used to generate data with the self-\nconsuming model (Qwen2.5-1.5B), and the other\ngroup is used with an external open-source model\n(Llama3.1-8B). To ensure fair comparison, we con-\ntrol the dataset size across iterations. At each train-\ning step t, we collect: 3000 prompts from advan-\ntage group and 2000 from disadvantage group, and\ngenerate 1500 + 1000 synthetic preference pairs\nfrom Qwen2.5-1.5B, and another 1500 + 1000 pairs\nfrom Llama3.1-8B.\nWe specifically avoid using closed-source APIs\nto mitigate risks of data leakage as synthetic train-\ning data is generated using fresh real prompts col-\nlected from users. Future work will explore a wider\nrange of models, settings, and mixing strategies to\nmore fully understand bias propagation in multi-\nmodel synthetic feedback loops, which is beyond\nthe current paper’s scope.\nIn Table 8, when incorporating synthetic data\nfrom another LLM at a ratio of 0.5, the bias score\nin the synthetic performative loop still exhibits\nan upward trend similar to that of the pure self-\nconsuming setup (Figure 2a in the main paper),\nalbeit with a slower rate of increase. The trend\nin generation quality remains largely unchanged.\nWhen the ratio is further reduced (0.2), the bias\nscore no longer shows an increasing trend, indicat-\ning that the proportion of self-generated synthetic\ndata influences the bias dynamics in the performa-\ntive loop.\nI\nFuture Directions\nIn this work, we primarily focus on supervised\nfine-tuning, while also conducting preliminary ex-\nperiments with direct preference optimization. Our\nframework offers a promising direction through dy-\nnamic control, which can be further extended in fu-\nture work to incorporate alternative training strate-\ngies such as reinforcement learning with human\nfeedback (RLHF) and preference learning (DPO\nand SimPO). We believe that our current SFT-based\nformulation provides a strong foundation for under-\nstanding the core dynamics of performative feed-\nback, and that integrating DPO or reinforcement\nlearning methods (e.g., PPO) could further enhance\nits flexibility and effectiveness. As synthetic data\ncan be generated using multiple models, exploring\nperformative bias beyond self-consuming settings\npresents a promising future direction. However,\nstudying multi-model feedback loops introduces\nsubstantial design complexity. Key factors, such\nas the number and scale of participating models,\nas well as the proportion of self-generated data,\nrequire careful control and systematic evaluation.\nWe believe that addressing these challenges would\ngreatly benefit from coordinated, community-wide\nefforts.\n19\n",
+    "figure_captions": [
+      "Figure 1: Illustration of the self-consuming performa-",
+      "Figure 2: The preference bias and generation quality on two tasks using Qwen2.5-1.5B.",
+      "Figure 3: The disparate bias and mathematical problem solving ability on Math task using using Qwen2.5-Math-7B.",
+      "Figure 4: The disparate bias and mathematical problem solving ability on Math task using Qwen2.5-Math-1.5B.",
+      "Figure 5: The preference bias and generation quality on",
+      "Figure 6: The performance of different mitigation methods on two tasks using Qwen2.5-1.5B.",
+      "Figure 7: The preference bias and generation quality on",
+      "Figure 8: Bias score and generation quality variations",
+      "Figure 7 and Figure 8 show the variation",
+      "Figure 9: Disparate bias and mathematical problem solving ability variations in the self-consuming performative",
+      "Figure 10: Generation quality variations in the self-consuming performative loop on the News task under different"
+    ]
+  },
+  {
+    "arxiv_id": "2601.05167v1",
+    "title": "Paper 2601.05167v1",
+    "abstract": "Deploying Large Language Models (LLMs) for complex reasoning is often hindered by high computational costs and latency, while resource-efficient Small Language Models (SLMs) typically lack the necessary reasoning capacity. Existing collaborative approaches, such as cascading or routing, operate at a coarse granularity by offloading entire queries to LLMs, resulting in significant computational waste when the SLM is capable of handling the majority of reasoning steps. To address this, we propose RelayLLM, a novel frame- work for efficient reasoning via token-level collaborative decoding. Unlike routers, Re- layLLM empowers the SLM to act as an active controller that dynamically invokes the LLM only for critical tokens via a special command, effectively \"relaying\" the generation process. We introduce a two-stage training framework, including warm-up and Group Relative Pol- icy Optimization (GRPO) to teach the model to balance independence with strategic help- seeking. Empirical results across six bench- marks demonstrate that RelayLLM improves the average accuracy from 42.5% to 49.52%, effectively bridging the performance gap be- tween the two models. Notably, this is achieved by invoking the LLM for only 1.07% of the total generated tokens, offering a 98.2% cost re- duction compared to performance-matched ran- dom routers. Our code is available at https:// github.com/Chengsong-Huang/RelayLLM.",
+    "text": "RelayLLM: Efficient Reasoning via Collaborative Decoding\nChengsong Huang1\nTong Zheng2\nLanglin Huang1\nJinyuan Li1\nHaolin Liu3\nJiaxin Huang1\n1Washington University in St. Louis\n2University of Maryland\n3University of Virginia\n{chengsong, h.langlin, ljinyuan, jiaxinh}@wustl.edu\ntzheng24@umd.edu\nsrs8rh@virginia.edu\nAbstract\nDeploying Large Language Models (LLMs)\nfor complex reasoning is often hindered by\nhigh computational costs and latency, while\nresource-efficient Small Language Models\n(SLMs) typically lack the necessary reasoning\ncapacity. Existing collaborative approaches,\nsuch as cascading or routing, operate at a\ncoarse granularity by offloading entire queries\nto LLMs, resulting in significant computational\nwaste when the SLM is capable of handling\nthe majority of reasoning steps. To address\nthis, we propose RelayLLM, a novel frame-\nwork for efficient reasoning via token-level\ncollaborative decoding.\nUnlike routers, Re-\nlayLLM empowers the SLM to act as an active\ncontroller that dynamically invokes the LLM\nonly for critical tokens via a special command,\neffectively \"relaying\" the generation process.\nWe introduce a two-stage training framework,\nincluding warm-up and Group Relative Pol-\nicy Optimization (GRPO) to teach the model\nto balance independence with strategic help-\nseeking. Empirical results across six bench-\nmarks demonstrate that RelayLLM improves\nthe average accuracy from 42.5% to 49.52%,\neffectively bridging the performance gap be-\ntween the two models. Notably, this is achieved\nby invoking the LLM for only 1.07% of the\ntotal generated tokens, offering a 98.2% cost re-\nduction compared to performance-matched ran-\ndom routers. Our code is available at https://\ngithub.com/Chengsong-Huang/RelayLLM.\n1\nIntroduction\nLarge Language Models (LLMs) have demon-\nstrated remarkable capabilities in complex reason-\ning and problem-solving (Comanici et al., 2025;\nYang et al., 2025a; Achiam et al., 2023). However,\ntheir deployment is often constrained by high com-\nputational costs and latency. In contrast, Small Lan-\nguage Models (SLM) are resource-efficient options\nbut typically struggle with hard reasoning tasks\ndue to their limited capacity (Kaplan et al., 2020).\n0\n20\n40\n60\n80\n100\nCall Ratio (%)\n40\n42\n44\n46\n48\n50\n52\n54\n56\nAvg. Accuracy (%)\nReduces 98.2% Tokens\n+6.9% Acc\nSLM\nLLM\nRandom Router (Query-Level)\nPerfect Router (Query-Level)\nRelayLLM (Ours, Token-Level)\nCITER (Token-Level)\nFigure 1: Results are averaged across six mathematical\nbenchmarks. The “Random Router” baseline randomly\ndirects questions to either the small or large model.\nThe “Perfect Router” baseline directs only questions\nSLM cannot solve to large model. The x-axis represents\nthe Call Ratio (percentage of tokens generated by the\nteacher model), and the y-axis denotes the average ac-\ncuracy.\nThis trade-off has motivated the development of\ncollaborative systems that combine the reasoning\ncapabilities of LLMs with the efficiency of smaller\nmodels (Hakimov et al., 2025).\nExisting approaches to different-sized model col-\nlaboration often rely on “cascading” or “routing”\nmechanisms, where a router determines the diffi-\nculty of a query and directs it to either a small or\nlarge model (Ding et al., 2024; Hu et al., 2024;\nOng et al., 2024). While effective to some extent,\nthese methods typically operate at a coarse gran-\nularity by offloading the entire generation task to\nthe large model once a query seems difficult. This\n“all-or-nothing” strategy leads to significant compu-\ntational waste, as the small model often possesses\nthe competence to handle the majority of the rea-\nsoning steps, requiring expert intervention only at\nspecific critical positions (Lin et al., 2024; Ruan\net al., 2025).\nTo address these problems, we propose Re-\nlayLLM, a novel framework for efficient reason-\ning via token-level collaborative decoding (Shen\n1\narXiv:2601.05167v1  [cs.CL]  8 Jan 2026\n\net al., 2024) without an additional controller. Un-\nlike static routers, RelayLLM empowers the small\nmodel to act as both a problem solver and an active\ncontroller that dynamically requests assistance only\nwhen necessary.\nInspired by tool-use agents (Wölflein et al., 2025;\nZheng et al., 2025a), we introduce an interleaved\ngeneration process where the small model gener-\nates a special command token (<call>) to pause\nits own generation and invoke the large model for a\nspecified number of tokens. The small model then\nreceives the expert’s guidance and resumes rea-\nsoning, effectively “relaying” the output process\nbetween models. We propose a two-stage frame-\nwork to equip the small model with this strategic\ndelegation capability to train the RelayLLM. We\nfirst employ a supervised warm-up phase to teach\nthe model the syntactic structure of calling com-\nmands. This is followed by a reinforcement learn-\ning stage using Group Relative Policy Optimiza-\ntion (GRPO) (Shao et al., 2024) training, where\nwe design a context-aware reward that guides the\nmodel to balance independence with necessary\nhelp-seeking, penalizing both wasted costs and\navoidable errors.\nEmpirical results on six benchmarks demon-\nstrate the effectiveness of our approach. As il-\nlustrated in Figure 1, RelayLLM achieves an av-\nerage accuracy of 49.52% on six benchmarks,\nlargely recovering the performance gap between\nthe small model and the large one.\nRemark-\nably, this gain is achieved with minimal cost,\nas RelayLLM invokes the large model for only\n1.07% of the total generated tokens. Furthermore,\nin comparison to a resource-equivalent Random\nRouter, RelayLLM yields a substantial 6.9% ac-\ncuracy improvement. These results confirm that\nRelayLLM effectively identifies critical reasoning\nsteps for expert intervention, reducing token costs\nby 98.2% compared to a performance-matched\nrouter. Surprisingly, evaluations in a teacher-free\nsetting reveal that the model internalizes effective\nreasoning patterns during collaboration, enabling\nit to surpass baselines on easier benchmarks even\nwithout expert assistance.\n2\nRelayLLM Inference\nAs illustrated in Fig. 2, we consider a hybrid\ninference setting involving a primary, resource-\nefficient Small Language Model (SLM), denoted\nas MS, and a powerful but computationally ex-\npensive Large Language Model (LLM), denoted\nas ML. Given an input query x, the system aims\nto generate a high-quality response y through dy-\nnamic collaboration.\nDifferent from standard cascading methods that\nsimply offload the entire remaining task to a larger\nmodel, we define an interleaved generation process\nwhere the small model acts as both reasoner and\ncentral controller. The process operates as follows:\n2.1\nSmall Model Generation\nBy default, MS generates tokens autoregressively\nbased on the current context history as a normal\nlanguage model. MS is augmented with a special\ncontrol capability: it can actively request assistance\nby generating a specific command pattern:\nCcmd(n) = <call> ⊕ n ⊕ </call>\nwhere ⊕ denotes string concatenation, and n ∈ Z+\nrepresents the number of tokens required from the\nlarger language model.\n2.2\nLarge Model Intervention\nWhen this trigger pattern is detected, the generation\nby MS pauses. The current context (including\nquery and tokens generated by SLM) is forwarded\nto ML. Crucially, to maintain compatibility with\nthe large model’s standard input distribution, we\nstrip the special command tokens (Ccmd) from the\ncontext provided to ML. The large model then\ngenerates the next n tokens (or stops early if an end-\nof-sequence “[EOS]” token is reached), providing\na high-quality continuation or reasoning step.\n2.3\nIterative Relay of LLM and SLM\nAfter ML completes the generation, control is re-\nturned to the small model. The context is updated\nwith the new tokens generated by ML added into\nthe existing context. Unlike the large model, MS\nretains the full history, including its own generated\ncommand tokens (<call>n</call>). This allows\nMS to maintain a trace of its active delegation de-\ncisions. With the updated context, MS resumes\ngeneration, digesting the expert’s guidance to con-\ntinue the reasoning process.\n3\nRelayLLM Training\nTo enable the small model MS to determine when\nand how long to invoke the LLM, we propose a two-\nstage framework including a supervised warm-up\nphase followed by a reinforcement learning phase.\n2\n\n1-\nIf 3 cats can catch 3 rats in 3\nminutes. How long does it take\n100 cats to catch 100 rats?\nLLM\nSLM\nInvoke\nLLM\nReturn\nSLM\nGenerate\n Rollouts\nScenario 1：\nStudent-zSolvable\nScenario 2：\nTeacher-Dependent\nScenario 3：\nTeacher-Unsolvable\n3\n100\n3\n3\n100\n100\n100\nFrom the given\ninformation, 3\ncats catch 3 rats\nin 3 minutes,\nwhich means\n<call> 8 </call>\neach catches 1\nrat in 3 min.\nTherefore, it\ntakes 3 min for\n100 cats to catch\n100 rats.\nRollout with LLM Intervention  \nRollout without LLM Intervention  \n×\n×\n×\n100×\n100×\n×\n0\n1.5\n-1\n-1\n1-\n0\n0\nReward\n✔\n✔\n✔\n(a) Inference\n(b) training\nFigure 2: Overview of the RelayLLM framework.(Left) Collaborative Inference: The Small Language Model acts\nas a central controller. During generation, it can actively trigger an “intervention” by generating a <call> command.\nThe Large Language Model then generates the specified number of tokens, after which control returns to the SLM\nto complete the reasoning. (Right) Difficulty-Aware Reward Design: During GRPO training, we sample a group\nof rollouts (both with and without LLM intervention) and classify the query difficulty into three distinct scenarios:\nSolvable, Teacher-Dependent, and Unsolvable, to guide the policy optimization.Distinct reward designs are applied\nto each scenario to align the model’s behavior with the optimal strategy.\n3.1\nCold Start via Supervised Warm-up\nThe small model MS may not naturally generate\nthe command pattern Ccmd if we train it directly\nwith reinforcement learning. We employ a super-\nvised warm-up phase as cold start to initialize the\nmodel with the necessary structural knowledge. We\nconstruct a synthetic dataset Dwarm to teach the\nmodel the calling command without inducing dis-\ntribution shifts.\nTo prevent distribution shifts of the small model\nMS, we first generate base sequences y by sam-\npling directly from vanilla MS rather than relying\non external corpora. This ensures that the training\ncontexts align perfectly with the small model’s own\ndistribution. Within these self-sampled sequences,\nwe insert command tokens at random indices t at\nthe token level. By avoiding rigid restrictions to\nsentence or paragraph boundaries, we enable the\nmodel to learn to trigger assistance at the precise\nmoment a reasoning gap occurs during the infer-\nence time. This data construction strategy ensures\neffective on-policy training data to prevent data dis-\ntribution shift of the small model, while simulating\nvarious calling scenarios.\nFurthermore, we explicitly simulate varying de-\ngrees of reliance on the expert model by synthe-\nsizing delegation lengths across multiple orders of\nmagnitude. We randomly sample the target length\nnsample = d × 10k, where d ∈ {1, . . . , 9} and\nk ∈ {0, . . . , 3}. To guarantee validity, the final\nrequest length is clipped to the available response\nvia n = min(nsample, Lrem).\nWe fine-tune MS on this constructed dataset us-\ning standard cross-entropy loss. While we acknowl-\nedge a theoretical discrepancy between this train-\ning setup (where tokens subsequent to the control\ncommand are from model itself) and the inference\nphase (where tokens are generated by ML), this su-\npervised warm-up is essential. It focuses primarily\non adapting the model to output valid executable\ncommands (Ccmd). This creates a stable starting\npoint for the subsequent reinforcement learning\nstage, where MS will be trained to request token-\nlevel assistance from ML and take advantage of\ntheir feedback.\n3.2\nPolicy Refinement with Reinforcement\nLearning\nIn this stage, we employ Reinforcement Learning\nto align the model’s behavior with our core objec-\ntive: maximizing response quality while strictly\nminimizing collaborative cost. We proceed by de-\ntailing our optimization framework.\n3.2.1\nGRPO Training with RLVR\nWe leverage Group Relative Policy Optimization\n(GRPO) (Shao et al., 2024) to refine the policy πθ\nof MS by adopting the Reinforcement Learning\nwith Verifiable Reward (RLVR) paradigm (Lam-\nbert et al., 2024). This paradigm is well-suited\nfor domains where the quality of generated re-\nsponses can be deterministically verified. In its\nstandard form, RLVR employs a rule-based verifier\n3\n\nv : X → {0, 1} that assigns binary rewards. This\nbinary scheme is effective for tasks with unambigu-\nous success criteria, such as mathematical problem-\nsolving. In our setting, we design a corresponding\nrule-based reward to verify model-generated rela-\ntions, described in Sec. 3.2.3.\nTo optimize the policy using these rewards,\nGRPO samples a group of outputs {oi}G\ni=1 for each\nquery q from the old policy πθold and evaluates\nthem relative to the group average. The training\nobjective is formulated as:\nJGRPO(θ) = Eq∼D\n�\n1\nG\nG\n�\ni=1\n(Mi − βDKL)\n�\n,\n(1)\nwhere DKL = DKL(πθ ∥ πref) is the regularization\nterm. The surrogate objective Mi is computed as\nmin(ρiAi, clip(ρi, 1 − ϵ, 1 + ϵ)Ai), where ρi =\nπθ(oi|q)\nπθold(oi|q). The advantage Ai is derived from the\ngroup-normalized rewards defined below:\nAi = ri − mean({rj})\nstd({rj}) + εstab\n,\n(2)\nwhere εstab is a small constant for stability. This\nformulation encourages the model to generate re-\nsponses that outperform the group average.\n3.2.2\nData Filtering\nSince our method leverages the large model to gen-\nerate reasoning paths or feedback, it is essential to\nidentify the subset of data where such intervention\nis useful. If the large model consistently fails to\nsolve a query, calling it during training yields no\npositive gain. Therefore, we preprocess the dataset\nto filter out instances that are too hard for the large\nmodel. We sample 10 responses per query and only\npreserve those with a pass rate of ≥ 50%. This step\nensures that the training data lies in the competence\nboundary of the large model and the responses can\ncontribute effectively. We provide an ablation study\non this filtering mechanism in Table 3.\n3.2.3\nReward Design\nWe formulate the optimization objective using two\ndistinct reward signals, a simple reward and our\ndesigned difficulty-aware reward. Let y be the re-\nsponse, a be the final answer parsed from y, g be\nthe ground truth, and ρ(y) ∈ [0, 1] be the call ratio\n(the ratio of large-model generated tokens to the\ntotal response length.).\nSimple Reward.\nWe define a straightforward re-\nward to encourage both accuracy and efficiency:\nrsimple(y) = 1(a = g) − ρ(y).\n(3)\nwhere 1(·) is the indicator function, thus the re-\nsponses are scored by their correctness and penal-\nized by the cost of calling the expert model.\nDifficulty-Aware Reward.\nTo capture the rela-\ntive difficulty of each query, we define the reward\nbased on the collective performance of the sam-\npled group G. We categorize each query into three\nscenarios based on its difficulty (correctness of re-\nsponses in G). As illustrated in Figure 2 (Right), we\nprovide concrete examples of how these rewards\nare assigned across different categories.\nScenario 1: Student-Solvable (Encouraging In-\ndependence).\nThis scenario applies when the stu-\ndent model is capable of solving the query inde-\npendently, without assistance from the large model.\nThis scenario is identified if there exists at least one\nsample in the group G that answers correctly with-\nout invoking the large model. In this case, calling\nthe teacher is deemed redundant. Consequently, we\nassign a boosted bonus (r = 1.5) for independent\nsuccess to promote efficiency and independence,\nwhile dependent success (ρ(y) > 0) still receives\nthe simple reward rsimple in Eq.(3), and incorrect\nresponses receive zero reward.\nScenario 2:\nTeacher-Dependent (Penalizing\nStubbornness).\nThis scenario represents chal-\nlenging queries where correct answers appear only\nin samples that call the large model. Here, the small\nmodel’s independent reasoning is insufficient. To\ndiscourage blind guessing, we impose a penalty\non samples that fail to call the teacher (r = −1.0\nwhen ρ(y) = 0). Conversely, effective expert call-\ning that leads to a correct answer is rewarded with\nrsimple.\nScenario 3: Teacher-Unsolvable (Incentivizing\nExploration).\nThis scenario occurs when no sam-\nple in G yields the correct answer, indicating that\nthe query is extremely difficult or the teacher’s guid-\nance was ineffective. Rather than providing zero\ntraining signal for all responses, we assign a small\nexploration reward (r = ρ(y)) to samples that at-\ntempted to call the large model. This reinforces the\ntendency to seek help in highly uncertain situations.\nThis piecewise design aligns the policy with an op-\ntimal strategy: solve independently when possible,\nseek help when necessary, and avoid costly errors.\n4\n\n4\nExperiments\n4.1\nExperimental Setup\n4.1.1\nModels\nTo evaluate the effectiveness of RelayLLM, we\nutilize the Qwen3 model family (Yang et al.,\n2025a) due to its consistent architectural scaling\nand strong performance across various sizes. We\nselect Qwen3-0.6B and Qwen3-1.7B as our pri-\nmary small language models (MS) to investigate\nhow our framework scales with model capacity at\nthe sub-2B parameter level. For the teacher model\n(ML), we utilize Qwen3-8B. Selecting a model\nfrom the same model family ensures that the gen-\neration style, token distribution, vocabulary and\ntokenizer are more consistent, making collabora-\ntion more stable. To optimize training and infer-\nence efficiency, we consistently run the models in\nnon-thinking mode.\n4.2\nEvaluation Setup\nTo evaluate the effectiveness of RelayLLM, we\nconduct experiments on six reasoning benchmarks,\nand compare our approach against the standard\nGRPO baseline. We also add CITER (Zheng et al.,\n2025b), a token-level routing method as the base-\nline method which requires an additional controller.\nThe benchmarks include Minerva (Lewkowycz\net al., 2022), MATH-500 (Hendrycks et al., 2021),\nGSM8K (Cobbe et al., 2021), Olympiad-Bench\n(He et al., 2024), AIME-2024, and AIME-2025.\nWe use GPT-4o-mini as a semantic judge to verify\nthe model’s output against the ground truth (Zhao\net al., 2025). For the high-difficulty AIME datasets,\nwe report the avg@32 metric to ensure a robust\nevaluation. For other benchmarks, we report stan-\ndard accuracy (pass@1) using greedy decoding.\n4.3\nTraining Details\nWe conduct our experiments using the DAPO\ndataset (Yu et al., 2025a). Our implementation\nis built upon the EasyR1 framework (Zheng et al.,\n2025c) using its default hyperparameter configu-\nrations (shown in App. E). All models are trained\nfor a single epoch to ensure a fair comparison. Re-\ngarding data usage, the GRPO baseline is trained\non the full dataset, whereas our method utilizes the\nfiltered subset as described in Section 3.2.2. To\nenable efficient interaction with the large model,\nwe serve the teacher model via the vLLM infer-\nence engine (Kwon et al., 2023). We implement\nthe switching mechanism as a stop sequence in the\nsampling parameters: when the model generates\nthe calling command token, generation halts, and\nthe system invokes the teacher model via the API.\n4.4\nMain Results\nAs presented in Table 1, RelayLLM demonstrates\na superior trade-off between reasoning capability\nand inference efficiency. First, our method achieves\nsubstantial performance improvements across all\nbenchmarks while maintaining a negligible collab-\norative cost (less than 1% token overhead). For in-\nstance, on the challenging Minerva benchmark, the\nQwen3-0.6B model with Difficulty-Aware-Reward\nimproves from a base score of 15.81% to 23.53%,\nrepresenting a relative improvement of approxi-\nmately 48.8% while invoking the large model for\nonly 0.77% of the total tokens.\nCompared to\nCITER, our method demonstrates superior perfor-\nmance despite CITER’s more computationally in-\ntensive design. CITER relies on an external MLP\nto estimate a score every token, which introduces\nsubstantial latency and computational overhead. In\ncontrast, RelayLLM achieves better results with a\nsignificantly more efficient mechanism at the cost\nof only several addtional tokens.\nSecond,\ncomparing optimization strategies,\nthe Difficulty-Aware-Reward mechanism outper-\nforms the Simple-Reward in performance, with a\nmarginal increase in token consumption. For the\nQwen3-1.7B model, the Difficulty-Aware-Reward\nstrategy achieves a higher average accuracy of\n49.52% compared to 49.30% for the Simple-\nReward, which correlates with a slight increase\nin the average call ratio from 0.43% to 1.07%, sug-\ngesting that the difficulty-based signal better incen-\ntivizes the model to seek help in complex scenarios.\nFinally, RelayLLM effectively bridges the ca-\npability gap between small and large models us-\ning minimal tokens. Remarkably, the Qwen3-1.7B\n(Difficulty-Aware) recovers approximately 60%\nof the performance gap between the base SLM\n(42.50%) and the expert model (54.12%), high-\nlighting that sparse, strategic intervention at critical\nreasoning steps is sufficient to unlock a significant\nportion of the teacher model’s potential.\n5\nAnalysis\nIn this section, we conduct a series of in-depth\nanalyzes to better understand the behavior and ef-\nfectiveness of RelayLLM framework.\n5\n\nTable 1: Performance comparison on six benchmarks. We compare the effectiveness of RelayLLM using Qwen3-\n0.6B and Qwen3-1.7B as student models across different methods: the standard Base model, the GRPO-tuning\nbaseline, CITER and RelayLLM (Simple-Reward and Difficulty-Aware-Reward). The Qwen3-8B teacher model\nperformance is provided for reference. We report avg@32 for the challenging AIME datasets and standard pass@1\n(greedy decoding) for all other benchmarks. The “Avg. Call Ratio” denotes the percentage of tokens generated\nby the teacher model during the collaborative inference process. The best results within each model group are\nhighlighted in bold.\nModel\nMinerva\nMATH500\nGSM8K\nOlympiad\nAIME25\nAIME24\nAverage\nAvg. Call Ratio\nQwen3-0.6B\nBase Model\n15.81\n54.00\n64.82\n26.22\n1.04\n1.15\n27.17\n–\nGRPO\n17.65\n58.60\n65.50\n29.04\n5.42\n3.23\n29.91\n–\nCITER\n19.29\n58.80\n67.78\n29.60\n5.93\n3.24\n30.77\n0.98%\nRelayLLM (Simple)\n20.96\n60.20\n69.14\n32.15\n7.19\n3.85\n32.25\n0.31%\nRelayLLM (Difficulty-Aware)\n23.53\n60.00\n71.95\n32.74\n6.15\n3.85\n33.04\n0.77%\nQwen3-1.7B\nBase Model\n33.82\n74.60\n82.64\n43.11\n8.75\n12.08\n42.50\n–\nGRPO\n35.66\n75.60\n81.73\n45.04\n10.73\n15.62\n44.06\n–\nCITER\n38.63\n80.24\n82.26\n51.20\n11.96\n16.58\n46.81\n1.34%\nRelayLLM (Simple)\n43.01\n83.40\n86.13\n51.56\n13.44\n18.23\n49.30\n0.43%\nRelayLLM (Difficulty-Aware)\n43.75\n81.40\n86.28\n55.70\n12.71\n17.29\n49.52\n1.07%\nQwen3-8B\n48.16\n83.20\n93.63\n56.89\n17.92\n24.90\n54.12\n100%\n5.1\nRelayLLM Generalizes to Unseen\nReasoning Domains\nTo verify the generalization capability of Re-\nlayLLM, we extended our evaluation to general\nreasoning domains that were unseen during train-\ning. Although our model was trained exclusively\non the mathematical DAPO dataset, we tested it\non three diverse benchmarks out of the math do-\nmain: Big-Bench Hard (BBEH) (Kazemi et al.,\n2025), MMLU-Pro (Wang et al., 2024), and Su-\nperGPQA (Du et al., 2025). As shown in Table\n2, RelayLLM consistently outperforms baseline\nmethods despite the domain shift. For instance,\nusing Qwen3-1.7B, our method achieves 59.03%\non MMLU-Pro, significantly surpassing the GRPO\nbaseline (49.76%) and CITER (53.38%). These\nresults demonstrate that our framework effectively\nhelp SLM have a generalized help-seeking behav-\nior; even when facing unfamiliar inputs, the SLM\nsuccessfully recognizes its knowledge gaps and in-\nvoke the LLM, leading to substantial performance\ngains in out-of-distribution tasks.\n5.2\nAblation Study\nTo investigate the distinct contribution of each com-\nponent in RelayLLM , we conducted an ablation\nstudy using the Qwen3-1.7B model in Table 3.\nData filtering prevents wasteful calls where\nteacher models fail.\nWe show that removing the\ndata filtering mechanism results in a tripled call\nratio with decreased accuracy; this confirms that\nTable 2: Performance comparison on reasoning and gen-\neral knowledge benchmarks. We evaluate the effective-\nness of RelayLLM using Qwen3-0.6B and Qwen3-1.7B\nas student models across different settings. The Qwen3-\n8B performance is provided for reference. The best\nresults within each model group are highlighted in bold.\nModel\nBBEH\nMMLU-Pro\nSuperGPQA\nQwen3-0.6B\nBase Model\n7.19\n30.03\n17.22\nGRPO\n7.82\n32.15\n19.91\nCITER\n8.16\n33.12\n20.34\nRelayLLM (Simple)\n8.32\n35.61\n21.35\nRelayLLM (Difficulty-Aware)\n8.56\n35.87\n20.88\nQwen3-1.7B\nBase Model\n9.91\n46.90\n24.46\nGRPO\n10.89\n49.76\n26.01\nCITER\n11.67\n53.38\n28.25\nRelayLLM (Simple)\n12.67\n58.76\n29.85\nRelayLLM (Difficulty-Aware)\n12.46\n59.03\n29.93\nQwen3-8B\n15.31\n66.46\n36.21\nfiltering out queries intractable for the teacher is\ncrucial to avoid cost that yield no performance gain.\nFiltering out some too hard data also save time and\nresources during the training stage.\nEncouraging independence reduces reliance on\nteacher model and improves efficiency.\nWe re-\nmove the independence encouraging (where we\nboosted correctness reward from 1 to 1.5 for solv-\nable queries), and this causes the call ratio to spike\nto 4.10%. This demonstrates that specifically re-\nwarding the independent success is crucial to pre-\nvent the model from becoming over-reliant on the\nexpert LLM for tasks it could solve alone.\n6\n\nTable 3: Ablation study on data filtering and reward\ndesign strategies using Qwen3-1.7B. “w/o Data Filter-\ning” denotes training on the unfiltered dataset including\nteacher-failed queries. “w/o Indep. Incentive” removes\nthe bonus reward (from 1.5 to 1) for independent suc-\ncess (Scenario 1). “w/o Explor. Reward” removes the\nexploration reward (from ρ to 0) for seeking help in\nunsolvable queries (Scenario 3).\nMethod\nAvg. Acc. (%)\nCall Ratio (%)\nRelayLLM\n49.52\n1.07\nw/o Data Filtering\n48.76\n3.30\nw/o Indep. Incentive\n49.34\n4.10\nw/o Explor. Reward\n47.56\n0.65\nExploration reward effectively increases accu-\nracy.\nWhen we remove the exploration reward\n(for unsolvable queries), this leads to a significant\naccuracy drop to 47.56%, indicating that the explo-\nration reward is necessary to encourage the model\nto call for help from teacher models in highly un-\ncertain scenarios.\n5.3\nIntrinsic Reasoning Capability\nTo investigate whether\nRelayLLM\nimproves\nthe student’s inherent reasoning or merely learns\nto offload tasks, we evaluate the models in\na “Teacher-Free” setting by strictly forbidding\ninvocations during inference (implemented via\nbad_words=[“<call>”, “</call>”] when infer-\nence). Results in Table 4 reveal three key insights.\nFirst, on Easy datasets, even without teacher access,\nRelayLLM (Simple-Reward) achieves 61.12%, sur-\npassing the GRPO baseline. This suggests that\nthe student model has successfully learn from the\nreasoning ability of the expert model during the\ncollaborative training process.\nOn Harder datasets, removing the teacher leads\nto a notable performance drop (e.g., Difficulty-\nAware-Reward falls from 15.00% to 11.93%), con-\nfirming that for complex tasks, the model remains\nheavily dependent on expert intervention. Third,\ncomparing reward schemes, the Simple-Reward\nvariant demonstrates stronger capabilities than the\nDifficulty-Aware-Reward variant. This aligns with\nour previous observation that Difficulty-Aware-\nReward encourages a higher call ratio, leading\nto a stronger dependency on the teacher, whereas\nSimple-Reward retains more independence.\nTable 4: Evaluation of intrinsic reasoning capability. We\ndisable the teacher during inference (“w/o Teacher”) by\nmasking the call tokens. “Hard” refers to AIME24 and\nAIME25, while “Easy” refers to the remaining.\nMethod\nEasy (%)\nHard (%)\nGRPO Baseline\n59.51\n13.18\nRelayLLM (Simple)\nStandard Inference\n66.03\n15.84\nw/o Teacher\n61.12\n13.13\nRelayLLM (Difficulty-Aware)\nStandard Inference\n66.78\n15.00\nw/o Teacher\n60.26\n11.93\nTable 5: Comparison between dynamic length predic-\ntion (RelayLLM) and fixed delegation length strategies.\nNote that “Fixed-k” does not simply denote inference\ntruncation; these models were retrained with the con-\nstraint to always request k tokens, ensuring they learned\noptimal policies for those specific lengths.\nMethod\nAvg. Acc. (%)\nCall Ratio (%)\nFixed-20\n49.41\n1.32\nFixed-100\n49.56\n2.87\nFixed-500\n51.17\n5.37\nRelayLLM\n49.52\n1.07\n5.4\nDynamic Token-Length Calling\nMinimizes Computational Cost\nWe investigate whether dynamically predicting the\ncalling length n requested from the large model is\nsuperior to using rigid, pre-defined lengths. To\nensure a fair evaluation, we retrained separate\nvariations of the student model where the call com-\nmand is hard-constrained to a fixed token count k ∈\n{20, 100, 500} during both training and inference.\nAs shown in Table 5, RelayLLM demonstrates\nsuperior efficiency compared to these specialized\nfixed-length models. Specifically, compared to the\nFixed-100 model, RelayLLM achieves a similar\naccuracy but reduces the call ratio from 2.87% to\n1.07%. This indicates that while the fixed-length\nmodel is forced to consume a set budget even for\nsimple queries, RelayLLM effectively learns to\nrequest “just enough” tokens to bridge the reason-\ning gap, thereby minimizing computational waste\nwithout compromising performance. We provide a\nmore detailed results in Appendix B.\n5.5\nDistributional Alignment\nTo determine whether the student model has ac-\nquired generalized reasoning capabilities or merely\noverfitted to the specific patterns and words of the\n7\n\nNone\n0.6B\n1.7B\n4B\n8B\n14B\nTeacher Model Size\n30\n35\n40\n45\n50\nAverage Score\n44.15\n44.57\n48.45\n49.22\n49.52\n49.18\n29.83\n30.86\n31.57\n32.97\n33.04\n32.87\nSLM: Qwen3-1.7B\nSLM: Qwen3-0.6B\nFigure 3: Impact of teacher model size on student per-\nformance. We evaluate two student models (Difficulty-\nAware) across six benchmarks. The x-axis represents\nthe size of the teacher model used during the collabora-\ntive inference process, ranging from \"None\" (the same\nas Sec. 5.3, prevent model to call teacher model) to 14B.\nThe reported scores are averaged across all six datasets.\nLLM in training time, we performed a cross-LLM\nevaluation by substituting the training teacher with\ndifferent models in the inference phase. The results,\nillustrated in Figure 3, reveal two critical insights.\nFirst, consistency with the training LLM yields\noptimal performance. The accuracy peaks when\nthe inference teacher matches the training teacher,\nreaching 49.52% for the 1.7B student. Notably,\nreplacing it with a larger model results in a slight\nperformance decline. This indicates that the dis-\ntribution shift between the training and inference\nteachers can outweigh the benefits of the larger\nmodel’s superior reasoning capabilities.\nSecond, even employing a relatively weak\nteacher that is weaker than itself (e.g., 0.6B or 1.7B)\nconsistently outperforms the “None” baseline. This\nsuggests that the trained model has become accus-\ntomed to the presence of external assistance, adapt-\ning its generation dynamics to effectively leverage\nsuch interventions rather than relying solely on its\nintrinsic capabilities. Furthermore, excluding the\ndistribution shift at 8B, which is we used in train-\ning, there is a positive correlation between teacher\nsize and student performance, confirming that the\nstudent effectively leverages the stronger reasoning\nsignals provided by more capable experts.\n6\nRelated Work\n6.1\nModel Collaboration\nModel collaboration (Feng et al., 2025a) ranges\nfrom weight-level merging (Wortsman et al.,\n2022; Huang et al., 2023) and logits-level ensem-\nbling (Liu et al., 2024; Li et al., 2023) to text-level\ninteraction. Recent research has focused on navi-\ngating the efficiency trade-off between large and\nsmall models, which can be broadly categorized\ninto two directions. The first direction involves\nspeculative reasoning. Bachmann et al. (2025) and\nPan et al. (2025) employ judge mechanisms or ver-\nifiers to validate small model outputs, effectively\nacting as dynamic routers. Similarly, large model\nguidance is leveraged to enhance small model rea-\nsoning specifically at inference time (Yang et al.,\n2025b; Zhang et al., 2025). The second direction fo-\ncuses on collaborative decoding (Shen et al., 2024)\nwith interleaved generation via an additional con-\ntroller. Shen et al. (2024); Sun et al. (2024b) pro-\nposed learning a joint policy for multiple models.\nStrategic intervention is further investigated by Li\net al. (2025a); Feng et al. (2025b); Fu et al. (2025)\nto explore thought spaces efficiently.\n6.2\nRL for LLM Reasoning\nReinforcement learning has recently emerged as\na pivotal technique for augmenting LLM reason-\ning, demonstrating broad success ranging from\ntraditional mathematical and code generation\ntasks (Guo et al., 2025; Wang et al., 2025a) to intri-\ncate multi-modal challenges (Huang et al., 2025c;\nWang et al., 2025b; Li et al., 2025b) and structured\ndata environments (Shi et al., 2025; Tang et al.,\n2025). To support these diverse applications and en-\nable complex behaviors like RelayLLM, concurrent\nresearch is actively refining methodologies through\nnovel training paradigms, such as self-play (Liu\net al., 2025; Huang et al., 2025b; Yu et al., 2025b),\nalongside developing more robust algorithmic tech-\nniques exemplified by DAPO (Yu et al., 2025a),\nVAPO (Yue et al., 2025), and high-entropy guided\noptimization (Dai et al., 2025; Wang et al., 2025d;\nZhou et al., 2025).\n7\nConclusion\nWe presented RelayLLM, addressing the ineffi-\nciency of \"all-or-nothing\" offloading in routing sys-\ntems. By treating the large model as an on-demand\ntool rather than a fallback generator, our approach\ndemonstrates that small models can handle the vast\nmajority of reasoning steps if supported at specific\ncritical positions. The success of our GRPO-based\ntraining strategy confirms that help-seeking behav-\niors can be effectively learned and optimized. Our\nresults show that RelayLLM not only outperforms\nresource-equivalent random routers by 6.9% but\nalso achieves comparable reasoning accuracy to\nlarger models with negligible computation.\n8\n\nAcknowledgments\nWe would like to thank Dongfu Jiang (University of\nWaterloo) for his helpful insights and discussions\non tool use LLM. This research was supported in\npart by the NVIDIA Academic Grant Program and\nWashU Ignite Interdisciplinary Grants.\nReferences\nJosh Achiam, Steven Adler, Sandhini Agarwal, Lama\nAhmad, Ilge Akkaya, et al. 2023. Gpt-4 technical\nreport. arXiv preprint arXiv:2303.08774.\nGregor Bachmann,\nSotiris Anagnostidis,\nAlbert\nPumarola,\nMarkos\nGeorgopoulos,\nArtsiom\nSanakoyeu, et al. 2025.\nJudge decoding: Faster\nspeculative sampling requires going beyond model\nalignment. arXiv preprint arXiv:2501.19309.\nKarl Cobbe, Vineet Kosaraju, Mo Bavarian, Mark\nChen, Heewoo Jun, et al. 2021.\nTraining veri-\nfiers to solve math word problems. ArXiv preprint,\nabs/2110.14168.\nGheorghe Comanici, Eric Bieber, Mike Schaeker-\nmann, Ice Pasupat, Noveen Sachdeva, et al. 2025.\nGemini 2.5: Pushing the frontier with advanced\nreasoning, multimodality, long context, and next\ngeneration agentic capabilities.\narXiv preprint\narXiv:2507.06261.\nRunpeng Dai, Linfeng Song, Haolin Liu, Zhenwen\nLiang, Dian Yu, et al. 2025. Cde: Curiosity-driven ex-\nploration for efficient reinforcement learning in large\nlanguage models. arXiv preprint arXiv:2509.09675.\nDujian Ding, Ankur Mallick, Chi Wang, Robert Sim,\nSubhabrata Mukherjee, et al. 2024.\nHybrid llm:\nCost-efficient and quality-aware query routing. arXiv\npreprint arXiv:2404.14618.\nXinrun Du, Yifan Yao, Kaijing Ma, Bingli Wang,\nTianyu Zheng, et al. 2025. Supergpqa: Scaling llm\nevaluation across 285 graduate disciplines. arXiv\npreprint arXiv:2502.14739.\nShangbin Feng, Wenxuan Ding, Alisa Liu, Zifeng\nWang, Weijia Shi, et al. 2025a.\nWhen one llm\ndrools, multi-llm collaboration rules. arXiv preprint\narXiv:2502.04506.\nShangbin Feng, Wenhao Yu, Yike Wang, Hong-\nming Zhang, Yulia Tsvetkov, et al. 2025b. Don’t\nthrow away your pretrained model. arXiv preprint\narXiv:2510.09913.\nTianyu Fu, Yi Ge, Yichen You, Enshu Liu, Zhihang\nYuan, et al. 2025. R2r: Efficiently navigating diver-\ngent reasoning paths with small-large model token\nrouting. arXiv preprint arXiv:2505.21600.\nDaya Guo, Dejian Yang, Haowei Zhang, Junxiao Song,\nRuoyu Zhang, et al. 2025. Deepseek-r1: Incentiviz-\ning reasoning capability in llms via reinforcement\nlearning. arXiv preprint arXiv:2501.12948.\nSherzod Hakimov, Roland Bernard, Tim Leiber, Karl\nOsswald, Kristina Richert, et al. 2025. The price of\nthought: A multilingual analysis of reasoning, per-\nformance, and cost of negotiation in large language\nmodels. arXiv preprint arXiv:2510.08098.\nChaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu,\nZhen Leng Thai, et al. 2024. Olympiadbench: A chal-\nlenging benchmark for promoting agi with olympiad-\nlevel bilingual multimodal scientific problems. In\nAnnual Meeting of the Association for Computational\nLinguistics.\nDan Hendrycks, Collin Burns, Steven Basart, Andy Zou,\nMantas Mazeika, Dawn Song, and Jacob Steinhardt.\n2021. Measuring massive multitask language under-\nstanding. In Proc. of ICLR.\nQitian Jason Hu, Jacob Bieker, Xiuyu Li, Nan Jiang,\nBenjamin Keigwin, et al. 2024. Routerbench: A\nbenchmark for multi-llm routing system.\narXiv\npreprint arXiv:2403.12031.\nChengsong Huang, Langlin Huang, and Jiaxin Huang.\n2024.\nDivide, reweight, and conquer: A logit\narithmetic approach for in-context learning. arXiv\npreprint arXiv:2410.10074.\nChengsong Huang, Langlin Huang, Jixuan Leng, Ji-\nacheng Liu, and Jiaxin Huang. 2025a.\nEfficient\ntest-time scaling via self-calibration. arXiv preprint\narXiv:2503.00031.\nChengsong Huang, Qian Liu, Bill Yuchen Lin, Tianyu\nPang, Chao Du, et al. 2023. Lorahub: Efficient cross-\ntask generalization via dynamic lora composition.\narXiv preprint arXiv:2307.13269.\nChengsong Huang, Wenhao Yu, Xiaoyang Wang, Hong-\nming Zhang, Zongxia Li, et al. 2025b.\nR-zero:\nSelf-evolving reasoning llm from zero data. arXiv\npreprint arXiv:2508.05004.\nWenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao,\nZheyu Ye, et al. 2025c. Vision-r1: Incentivizing\nreasoning capability in multimodal large language\nmodels. arXiv preprint arXiv:2503.06749.\nJared Kaplan, Sam McCandlish, Tom Henighan, Tom B\nBrown, Benjamin Chess, et al. 2020.\nScaling\nlaws for neural language models.\narXiv preprint\narXiv:2001.08361.\nMehran Kazemi, Bahare Fatemi, Hritik Bansal, John\nPalowitch, Chrysovalantis Anastasiou, et al. 2025.\nBig-bench extra hard. In Proceedings of the 63rd\nAnnual Meeting of the Association for Computational\nLinguistics (Volume 1: Long Papers), pages 26473–\n26501.\n9\n\nWoosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying\nSheng, Lianmin Zheng, et al. 2023. Efficient memory\nmanagement for large language model serving with\npagedattention. In Proceedings of the ACM SIGOPS\n29th Symposium on Operating Systems Principles.\nNathan Lambert, Jacob Daniel Morrison, Valentina Py-\natkin, Shengyi Huang, Hamish Ivison, et al. 2024.\nTülu 3: Pushing frontiers in open language model\npost-training. arXiv preprint arXiv:2411.15124.\nAitor Lewkowycz, Anders Andreassen, David Dohan,\nEthan Dyer, Henryk Michalewski, Vinay V. Ra-\nmasesh, Ambrose Slone, Cem Anil, Imanol Schlag,\nTheo Gutman-Solo, Yuhuai Wu, Behnam Neyshabur,\nGuy Gur-Ari, and Vedant Misra. 2022. Solving quan-\ntitative reasoning problems with language models. In\nAdvances in Neural Information Processing Systems\n35: Annual Conference on Neural Information Pro-\ncessing Systems 2022, NeurIPS 2022, New Orleans,\nLA, USA, November 28 - December 9, 2022.\nXiang Lisa Li, Ari Holtzman, Daniel Fried, Percy Liang,\nJason Eisner, et al. 2023.\nContrastive decoding:\nOpen-ended text generation as optimization. In Pro-\nceedings of the 61st annual meeting of the associa-\ntion for computational linguistics (volume 1: Long\npapers), pages 12286–12312.\nZiheng Li, Hengyi Cai, Xiaochi Wei, Yuchen Li,\nShuaiqiang Wang, et al. 2025a. Efficient thought\nspace exploration through strategic intervention.\narXiv preprint arXiv:2511.10038.\nZongxia Li, Wenhao Yu, Chengsong Huang, Rui Liu,\nZhenwen Liang, et al. 2025b. Self-rewarding vision-\nlanguage model via reasoning decomposition. arXiv\npreprint arXiv:2508.19652.\nZicheng Lin, Tian Liang, Jiahao Xu, Qiuzhi Lin, Xing\nWang, et al. 2024. Critical tokens matter: Token-\nlevel contrastive estimation enhances llm’s reasoning\ncapability. arXiv preprint arXiv:2411.19943.\nAlisa Liu, Xiaochuang Han, Yizhong Wang, Yulia\nTsvetkov, Yejin Choi, et al. 2024. Tuning language\nmodels by proxy. arXiv preprint arXiv:2401.08565.\nBo Liu, Leon Guertler, Simon Yu, Zichen Liu, Penghui\nQi, et al. 2025.\nSpiral:\nSelf-play on zero-\nsum games incentivizes reasoning via multi-agent\nmulti-turn reinforcement learning. arXiv preprint\narXiv:2506.24119.\nChang Ma, Haiteng Zhao, Junlei Zhang, Junxian He,\nand Lingpeng Kong. 2024.\nNon-myopic genera-\ntion of language models for reasoning and planning.\narXiv preprint arXiv:2410.17195.\nIsaac Ong, Amjad Almahairi, Vincent Wu, Wei-Lin\nChiang, Tianhao Wu, et al. 2024. Routellm: Learning\nto route llms with preference data. arXiv preprint\narXiv:2406.18665.\nRui Pan, Yinwei Dai, Zhihao Zhang, Gabriele Oliaro,\nZhihao Jia, et al. 2025. Specreason: Fast and accurate\ninference-time compute via speculative reasoning.\narXiv preprint arXiv:2504.07891.\nZhiwen Ruan, Yixia Li, He Zhu, Yun Chen, Peng Li, et\nal. 2025. Enhancing large language model reasoning\nvia selective critical token fine-tuning. arXiv preprint\narXiv:2510.10974.\nZhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu,\nJunxiao Song, et al. 2024. Deepseekmath: Pushing\nthe limits of mathematical reasoning in open lan-\nguage models. arXiv preprint arXiv:2402.03300.\nZejiang Shen, Hunter Lang, Bailin Wang, Yoon Kim,\nand David Sontag. 2024. Learning to decode collab-\noratively with multiple language models. In Proceed-\nings of the 62nd Annual Meeting of the Association\nfor Computational Linguistics (Volume 1: Long Pa-\npers), pages 12974–12990.\nYucheng Shi, Wenhao Yu, Zaitang Li, Yonglin Wang,\nHongming Zhang, et al. 2025. Mobilegui-rl: Ad-\nvancing mobile gui agent through reinforcement\nlearning in online environment.\narXiv preprint\narXiv:2507.05720.\nHanshi Sun, Momin Haider, Ruiqi Zhang, Huitao\nYang, Jiahao Qiu, Ming Yin, Mengdi Wang, Peter\nBartlett, and Andrea Zanette. 2024a. Fast best-of-n\ndecoding via speculative rejection. arXiv preprint\narXiv:2410.20290.\nHao Sun, Jiayi Wu, Hengyi Cai, Xiaochi Wei, Yue\nFeng, et al. 2024b. Adaswitch: Adaptive switch-\ning between small and large agents for effective\ncloud-local collaborative learning. arXiv preprint\narXiv:2410.13181.\nYinxu Tang, Chengsong Huang, Jiaxin Huang, and\nWilliam Yeoh. 2025. Unirel-r1: Rl-tuned llm rea-\nsoning for knowledge graph relational question an-\nswering. Preprint, arXiv:2512.17043.\nHaozhe Wang, Long Li, Chao Qu, Fengming Zhu,\nWeidi Xu, et al. 2025a. To code or not to code?\nadaptive tool integration for math language mod-\nels via expectation-maximization. arXiv preprint\narXiv:2502.00691.\nHaozhe Wang, Chao Qu, Zuming Huang, Wei Chu,\nFangzhen Lin, et al. 2025b.\nVl-rethinker:\nIn-\ncentivizing self-reflection of vision-language mod-\nels with reinforcement learning.\narXiv preprint\narXiv:2504.08837.\nJikai Wang, Juntao Li, Lijun Wu, and Min Zhang. 2025c.\nEfficient reasoning for llms through speculative\nchain-of-thought. arXiv preprint arXiv:2504.19095.\nShenzhi Wang, Le Yu, Chang Gao, Chujie Zheng, Shix-\nuan Liu, et al. 2025d. Beyond the 80/20 rule: High-\nentropy minority tokens drive effective reinforce-\nment learning for llm reasoning.\narXiv preprint\narXiv:2506.01939.\n10\n\nYubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni,\nAbhranil Chandra, et al. 2024. Mmlu-pro: A more\nrobust and challenging multi-task language under-\nstanding benchmark. Advances in Neural Informa-\ntion Processing Systems, 37:95266–95290.\nGeorg Wölflein, Dyke Ferber, Daniel Truhn, Ognjen\nArandjelovic, and Jakob Nikolas Kather. 2025. Llm\nagents making agent tools. In Proceedings of the\n63rd Annual Meeting of the Association for Compu-\ntational Linguistics (Volume 1: Long Papers), pages\n26092–26130.\nMitchell Wortsman, Gabriel Ilharco, Samir Ya Gadre,\nRebecca Roelofs, Raphael Gontijo-Lopes, et al. 2022.\nModel soups: averaging weights of multiple fine-\ntuned models improves accuracy without increasing\ninference time. In International conference on ma-\nchine learning, pages 23965–23998. PMLR.\nAn Yang, Anfeng Li, Baosong Yang, Beichen Zhang,\nBinyuan Hui, et al. 2025a. Qwen3 technical report.\narXiv preprint arXiv:2505.09388.\nWang Yang, Xiang Yue, Vipin Chaudhary, and Xiao-\ntian Han. 2025b. Speculative thinking: Enhancing\nsmall-model reasoning with large model guidance at\ninference time. arXiv preprint arXiv:2504.12329.\nQiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan,\nXiaochen Zuo, et al. 2025a. Dapo: An open-source\nllm reinforcement learning system at scale. arXiv\npreprint arXiv:2503.14476.\nWenhao Yu, Zhenwen Liang, Chengsong Huang, Kis-\nhan Panaganti, Tianqing Fang, et al. 2025b. Guided\nself-evolving llms with minimal human supervision.\narXiv preprint arXiv:2512.02472.\nYu Yue, Yufeng Yuan, Qiying Yu, Xiaochen Zuo, Ruofei\nZhu, et al. 2025. Vapo: Efficient and reliable re-\ninforcement learning for advanced reasoning tasks.\narXiv preprint arXiv:2504.05118.\nHaozhen Zhang, Tao Feng, and Jiaxuan You. 2025.\nRouter-r1: Teaching llms multi-round routing and ag-\ngregation via reinforcement learning. In The Thirty-\nninth Annual Conference on Neural Information Pro-\ncessing Systems.\nYulai Zhao, Haolin Liu, Dian Yu, S. Y. Kung, Haitao\nMi, et al. 2025. One token to fool llm-as-a-judge.\nvolume abs/2507.08794.\nTong Zheng, Hongming Zhang, Wenhao Yu, Xiaoyang\nWang, Xinyu Yang, et al. 2025a. Parallel-r1: Towards\nparallel thinking via reinforcement learning. arXiv\npreprint arXiv:2509.07980.\nWenhao Zheng, Yixiao Chen, Weitong Zhang, Sou-\nvik Kundu, Yun Li, et al. 2025b. Citer: Collabo-\nrative inference for efficient large language model\ndecoding with token-level routing. arXiv preprint\narXiv:2502.01976.\nYaowei Zheng, Shenzhi Wang Junting Lu, Zhangchi\nFeng, and Yuwen Xiong Dongdong Kuang. 2025c.\nEasyr1: An efficient, scalable, multi-modality rl train-\ning framework.\nhttps://github.com/hiyouga/\nEasyR1.\nYujun Zhou, Zhenwen Liang, Haolin Liu, Wenhao\nYu, Kishan Panaganti, et al. 2025. Evolving lan-\nguage models without labels: Majority drives se-\nlection, novelty promotes variation. arXiv preprint\narXiv:2509.15194.\n11\n\nA\nCase Study\nHere we show two case studies from Qwen3-1.7B\n(Difficulty-Aware) to illustrate how invoking LLM\ncan help solve the questions in Figure 4.\nB\nDetailed Analysis of Delegation Length\nStrategies\nIn the main text, we demonstrated that RelayLLM\nachieves comparable average accuracy to a fixed\n100-token delegation strategy while consuming sig-\nnificantly fewer tokens. Table 6 presents the gran-\nular performance breakdown across all six mathe-\nmatical benchmarks.\nWe observe distinct trends across datasets of\nvarying difficulty:\nEfficiency on Standard Benchmarks.\nOn stan-\ndard reasoning datasets such as MATH500 and\nGSM8K, RelayLLM matches the performance of\nthe Fixed-100 strategy almost exactly (e.g., 81.40%\non MATH500 for both methods).\nHowever, it\nachieves this parity with a drastically lower call\nratio (1.07% vs. 2.87%). This indicates that for the\nmajority of queries in these datasets, the student\nmodel only requires short, precise interventions\nrather than long-context guidance.\nAdaptability on Complex Benchmarks.\nThe\nadvantage of dynamic prediction becomes more\nevident on the challenging Minerva benchmark.\nRelayLLM outperforms the Fixed-20 significantly\n(43.75% vs. 39.71%), suggesting that our model\ncorrectly identifies the need for longer generation\nlengths when facing harder problems. While the\nFixed-500 strategy yields the highest accuracy on\nMinerva (44.49%) and AIME, this marginal gain\ncomes at a prohibitive cost: it incurs a 5.37% call\nratio, which is over 5× the computational overhead\nof RelayLLM.\nC\nPrompt Templates\nThis section presents the exact prompt templates\nused for the models.\nC.1\nInference Prompt\nSolver Prompt Template\nSystem Message:\nPlease reason step by step, and put your\nfinal answer within \\boxed{}.\nUser Message:\n{problem_statement}\nNote: {problem_statement} is a place-\nholder for the actual math problem.\nC.2\nGPT-4o-mini Judge Prompt\nTo programmatically evaluate the correctness of an-\nswers on mathematical benchmarks where the final\nanswer can be complex (e.g., simplified expres-\nsions), we use GPT-4o-mini as a judge. The exact\nprompt and configuration used for this evaluation\nare detailed below.\nConfiguration for GPT-4o as Judge\n• Model: gpt-4o\n• Temperature: 0.1\nSystem Message:\nYou are a math answer checker.\nUser Message Template:\nHi, there is an answer: {answer},\nand the ground truth answer is:\n{response},\nplease check whether the answer\nis correct or not, and return the\n**only** Yes or No.\nNote: {answer} is a placeholder for the\nmodel-generated solution, and {response}\nis the ground-truth answer from the bench-\nmark.\nD\nOther Related Work\nRelayLLM also align with efficient reasoning, in-\nclude efficient chain-of-thought methods (Wang\net al., 2025c; Huang et al., 2025a), speculative best-\nof-N decoding (Sun et al., 2024a), in-context learn-\ning methods (Huang et al., 2024) and non-myopic\ngeneration (Ma et al., 2024).\n12\n\nCase Study: A list of positive integers has the following properties:\n• The sum of the items in the list is 30.\n• The unique mode of the list is 9.\n• The median of the list is a positive integer that does not appear in the list itself.Find the sum of the squares of all the\nitems in the list.\nWe are given a list of **positive integers** with the following properties:\n1. **Sum of the items is 30.**\n2. **The unique mode is 9.**\n3. **The median is a positive integer that does not appear in the list.**\nWe are to find the **sum of the squares of all the items** in the list.\n—\n### Step 1: Understand the properties - **Mode is 9**: This means 9 appears more times than any other number in the\nlist. - **Unique mode is 9**: No other number appears as frequently as 9.\n- **Median is a positive integer that does not appear in the list**: So the median is some number m, and m is not in the\nlist.\n- **Sum of the items is 30**.\n—\n### Step 2: Consider the structure of the list Let’s denote the list as a1, a2, . . . , an, sorted in non-decreasing order.\n#### Key observations: - The **median** is the middle value. If n is odd, it is the n+1\n2 -th element. If n is even, it is\nthe average of the two middle elements. - Since the **median is a positive integer that does not appear in the list**, it\nmust be **greater than the largest number in the list** (if the list has only one element, it’s trivial, but we’ll assume\nn > 1). - The **mode is 9**, so 9 appears more times than any other number.\n—\n### Step\n<call> 300 </call>\n3: Try small values of $ n $\nLet ’s test small values of $ n $ to find a valid list that satisfies all\nconditions.\n#### Try $ n = 5 $ (odd):\n- Then the ** median is the 3rd element **.\n- Suppose the list is $ [a, b, m, c, d] $, where $ m $ is the median.\n- $ m $ is not in the list.\n- The **mode is 9**, so 9 must appear at least twice , and more than any other\nnumber.\nLet ’s try to construct such a list:\nTry $ [9, 9, 1, 1, 10] $ (sum = 9+9+1+1+10 = 30)\nCheck the properties:\n- **Sum = 30 **\n- **Mode = 9** (appears twice) - but are there other numbers with same\nfrequency?\n- 1 appears 3 times -> **mode is 1**, not 9 correct\nTry $ [9, 9, 1, 2, 9]$ -> sum = 9+9+1+2+9=30\n- Mode is 9 (3 times), 1 and 2 once -> correct\n- Median is 1 (the middle number) -> 1 is in the list incorrect\nTry $ [9, 9\n(end call)\n, 1, 2, 8] $ -> sum = 9+9+1+2+8 = 29 incorrect Try [9, 9, 1, 3, 8] -> sum = 30 - Mode is 9 - Median is 1 (not in the list) -\n1 is not in the list So this is a valid list:\n[9, 9, 1, 3, 8]\nCheck sum: 9 + 9 + 1 + 3 + 8 = 30 Check mode: 9 appears twice, others once -> **unique mode is 9** Check median:\nmiddle element is 3 → 3 is not in the list\n—\n### Step 4: Compute the sum of squares Compute 92 + 92 + 12 + 32 + 82 = 81 + 81 + 1 + 9 + 64 = 236\n—\n236\nFigure 4: A qualitative case study showcases that our model invoke LLM to show how to validate the final answer,\nthen solve the problem by itself.\n13\n\nTable 6: Full performance breakdown of different delegation length strategies across six benchmarks. “Fixed-k”\nindicates models retrained with a fixed call length of k tokens. We report mean@32 for AIME datasets (AIME24,\nAIME25) and pass@1 for others. Best accuracy values are bolded, and the most efficient call ratio is highlighted.\nMethod\nMinerva\nMATH500\nGSM8K\nOlympiad\nAIME25\nAIME24\nAverage\nCall Ratio\nFixed-20\n39.71\n81.40\n86.96\n55.74\n14.11\n18.54\n49.41\n1.32%\nFixed-100\n40.44\n81.40\n86.50\n56.78\n14.42\n17.81\n49.56\n2.87%\nFixed-500\n44.49\n81.80\n88.25\n57.96\n15.05\n19.48\n51.17\n5.37%\nRelayLLM\n43.75\n81.40\n86.28\n55.70\n12.71\n17.29\n49.52\n1.07%\nE\nHyperparameter\nWe utilize the Group Relative Policy Optimization\n(GRPO) algorithm for post-training. The model\nis optimized using AdamW with a constant learning\nrate of 1 × 10−6 and a weight decay of 1 × 10−2.\nThe global batch size is set to 32.\nTo support extensive reasoning steps required\nfor complex mathematical problems, we configure\nthe maximum sequence length to allow for long\nChain-of-Thought (CoT) generation. Specifically,\nthe maximum prompt length is set to 4096 tokens,\nand the maximum response length is extended to\n8192 tokens.\nFor the GRPO specific configurations, we set\nthe group size G = 8, meaning that 8 outputs are\nsampled for each prompt to estimate the baseline.\nTo ensure training stability and prevent the pol-\nicy from deviating excessively from the reference\nmodel, we apply a KL divergence coefficient β of\n0.01. During the rollout phase, we use a sampling\ntemperature of 1.0 to encourage diverse reasoning\npaths.\nTable 7: Hyperparameter settings for GRPO training.\nHyperparameter\nValue\nOptimizer\nAdamW\nLearning Rate\n1 × 10−6\nWeight Decay\n1 × 10−2\nGlobal Batch Size\n32\nMax Prompt Length\n4096\nMax Response Length\n8192\nTemperature\n1.0\nGroup Size (G)\n8\nKL Coefficient (β)\n0.01\n14\n",
+    "figure_captions": [
+      "Figure 1: Results are averaged across six mathematical",
+      "Figure 2: Overview of the RelayLLM framework.(Left) Collaborative Inference: The Small Language Model acts",
+      "Figure 3: Impact of teacher model size on student per-",
+      "Figure 4: A qualitative case study showcases that our model invoke LLM to show how to validate the final answer,"
+    ]
+  }
+]
\ No newline at end of file
diff --git a/train_reward.py b/train_reward.py
new file mode 100644
index 0000000..5bda292
--- /dev/null
+++ b/train_reward.py
@@ -0,0 +1,23 @@
+#!/usr/bin/env python3
+"""
+Phase 4: Train the reward model on preference data.
+
+This script fine-tunes a DeBERTa-v3 model on the chosen/rejected summary pairs
+so it learns to assign higher scores to better summaries.
+"""
+from reward_model import train_reward_model
+
+
+# File paths
+REWARD_JSONL = "reward_data.jsonl"
+REWARD_MODEL_DIR = "reward_model"
+
+
+def main():
+    print("\n=== PHASE 4: Reward model training ===\n")
+    train_reward_model(REWARD_JSONL, REWARD_MODEL_DIR)
+    print("\n=== Training complete! ===")
+
+
+if __name__ == "__main__":
+    main()