diff --git a/Class 8 Homework.ipynb b/Class 8 Homework.ipynb index dc5ac6a..38959b5 100644 --- a/Class 8 Homework.ipynb +++ b/Class 8 Homework.ipynb @@ -14,7 +14,7 @@ "\n", "## Learning Objectives\n", "\n", - "* Generate abstractive summaries of academic documents using LLaMA 3 (7B).\n", + "* Generate abstractive summaries of academic documents using LLaMA 3 (8B) Instruct.\n", "* Collect two candidate summaries per paper and have annotators select the better summary.\n", "* Prepare the dataset of summary pairs and preference labels for reward model training.\n", "* Train a reward model (e.g., DeBERTa-v3) on the collected preference data.\n", @@ -23,7 +23,7 @@ "## Project Design\n", "\n", "* **Data Collection:** Select 10 academic papers (including both text and figures) from arXiv or recent NLP conference proceedings.\n", - "* **Summary Generation:** For each paper, use the LLaMA 3 (7B) model to generate *two* different summaries. Vary the prompting strategy or sampling parameters to produce diverse outputs.\n", + "* **Summary Generation:** For each paper, use the LLaMA 3 (8B) model to generate *two* different summaries. Vary the prompting strategy or sampling parameters to produce diverse outputs.\n", "* **Human Annotation:** Have one or two human annotators compare each pair of summaries for a paper and choose the better one (e.g. more informative, coherent, factually consistent, etc.). Record which summary is preferred.\n", "* **Data Formatting:** Create a dataset (e.g. in JSONL format) of summary pairs and preference labels. Each entry should include the two summary texts and which one was chosen (for example, fields `chosen` and `rejected` as required by reward modeling tools).\n", "* **Reward Model Training:** Fine-tune a reward model (such as DeBERTa-v3) on this preference data. Use the chosen/rejected summary pairs so the model learns to assign higher scores to the preferred summaries.\n", @@ -143,7 +143,7 @@ "* Install required Python libraries: `transformers`, `datasets`, `evaluate`, `trl` (Hugging Face TRL), and `accelerate`.\n", "* (Optional) Install `peft` if you want to use parameter-efficient fine-tuning for the reward model.\n", "* Ensure you have GPU access for model training (e.g., use Google Colab Pro, AWS, or a local GPU).\n", - "* Download or load the LLaMA 3 (7B) model checkpoint and a DeBERTa-v3 checkpoint (for example, via Hugging Face Hub).\n", + "* Download or load the LLaMA 3 (8B) model checkpoint and a DeBERTa-v3 checkpoint (for example, via Hugging Face Hub).\n", "\n", "## Deliverables\n", "\n", diff --git a/README.md b/README.md new file mode 100644 index 0000000..eec2dcd --- /dev/null +++ b/README.md @@ -0,0 +1,454 @@ +# Paper Summarization with Reward Modeling + +**Week 8 Assignment: Multimodal Summarization and Reward Modeling** + +This project implements an end-to-end pipeline for generating and evaluating academic paper summaries using large language models (LLaMA 3.1) and reward modeling (DeBERTa-v3). The system demonstrates how to align AI-generated summaries with human preferences through reinforcement learning from human feedback (RLHF) techniques. + +## Table of Contents + +- [Overview](#overview) +- [Learning Objectives](#learning-objectives) +- [Project Structure](#project-structure) +- [Installation](#installation) +- [Usage](#usage) +- [Pipeline Phases](#pipeline-phases) +- [Configuration](#configuration) +- [Output Files](#output-files) +- [Evaluation Metrics](#evaluation-metrics) +- [Example Results](#example-results) +- [Troubleshooting](#troubleshooting) + +## Overview + +Effective summarization is critical in research because it distills large, complex documents into concise overviews that highlight key insights. However, automatically evaluating the quality of generated summaries is challenging. Traditional metrics like ROUGE and BERTScore rely on lexical overlap and can miss nuances like semantic correctness or coherence. + +This project addresses this gap by: +1. Generating multiple candidate summaries using LLaMA 3.1 (8B) Instruct model +2. Creating preference labels based on ROUGE scores +3. Training a reward model to predict human-aligned quality scores +4. Comparing reward model predictions with traditional metrics + +## Learning Objectives + +- Generate abstractive summaries of academic documents using LLaMA 3.1 (8B) Instruct +- Collect two candidate summaries per paper and create preference labels +- Prepare datasets of summary pairs for reward model training +- Train a reward model (DeBERTa-v3) on preference data +- Evaluate summaries using ROUGE, BERTScore, and the trained reward model +- Analyze alignment between reward model scores and automatic metrics + +## Project Structure + +``` +Homework8-Submission/ +├── README.md # This file +├── requirements.txt # Python dependencies +├── Class 8 Homework.ipynb # Jupyter notebook with instructions +│ +├── data/ +│ └── pdfs/ # Academic papers in PDF format (10 papers) +│ ├── 2601.05103v1.pdf +│ ├── 2601.05104v1.pdf +│ └── ... +│ +├── data_utils.py # PDF parsing and ArXiv downloading +├── summarization.py # LLaMA model loading and summary generation +├── reward_model.py # Reward model training and evaluation +├── build_data.py # Phase 1-3: Data collection and summarization +├── train_reward.py # Phase 4: Reward model training +├── run_evaluation.py # Phase 5-6: Evaluation and comparison +└── main.py # (Deprecated) Legacy pipeline orchestration +``` + +### Module Descriptions + +#### `data_utils.py` +- **Purpose**: Handle data collection and PDF processing +- **Key Functions**: + - `download_arxiv_pdfs()`: Download papers from ArXiv + - `extract_text_and_figures()`: Extract text and figure captions from PDFs + - `save_papers_to_json()` / `load_papers_from_json()`: Serialize paper data + +#### `summarization.py` +- **Purpose**: Generate summaries using LLaMA 3.1 model +- **Key Functions**: + - `load_llama()`: Load optimized 4-bit quantized LLaMA model using unsloth + - `build_multimodal_text()`: Combine paper text with figure captions + - `llama_generate_summary()`: Generate summaries with different prompting strategies + - `generate_summaries_for_papers()`: Process multiple papers + +#### `reward_model.py` +- **Purpose**: Train and use reward models for summary quality prediction +- **Key Functions**: + - `auto_label_preferences_with_rouge()`: Create preference labels from ROUGE scores + - `train_reward_model()`: Fine-tune DeBERTa-v3 on preference data + - `score_summaries_with_reward_model()`: Score summaries using trained model + - `evaluate_summaries()`: Compute all metrics (ROUGE, BERTScore, reward scores) + +#### `build_data.py` +- **Purpose**: Phase 1-3 - Data collection, summarization, and preference dataset creation +- **Key Steps**: Download papers, generate summaries, create preference labels + +#### `train_reward.py` +- **Purpose**: Phase 4 - Train the reward model on preference data +- **Key Steps**: Load preference data, fine-tune DeBERTa-v3, save trained model + +#### `run_evaluation.py` +- **Purpose**: Phase 5-6 - Evaluation and comparison of summaries +- **Key Steps**: Load trained model, compute all metrics, analyze results + +## Installation + +### Prerequisites + +- Python 3.11 or higher +- **For GPU server (recommended)**: + - CUDA-capable NVIDIA/AMD/Intel GPU + - 16GB+ VRAM recommended + - Uses optimized unsloth for fast 4-bit inference +- **For Mac M4/Apple Silicon testing**: + - Code will fall back to standard transformers + - Model loading will be slower + - Use GPU server for actual training/evaluation +- HuggingFace Hub token (for accessing LLaMA models) +- OpenAI API key (optional, for additional features) + +### Setup + +1. Clone the repository: +```bash +git clone +cd Homework8-Submission +``` + +2. Install dependencies: +```bash +pip install -r requirements.txt +``` + +3. Set up environment variables: +```bash +export HF_HUB_TOKEN="your_huggingface_token" +export OPENAI_API_KEY="your_openai_key" # Optional +``` + +### Key Dependencies + +- `unsloth`: Optimized LLaMA model loading with 4-bit quantization +- `transformers`: HuggingFace transformers library +- `trl`: Transformer Reinforcement Learning for reward modeling +- `torch`: PyTorch deep learning framework +- `datasets`: HuggingFace datasets library +- `evaluate`: Evaluation metrics (ROUGE, BERTScore) +- `PyMuPDF` (fitz): PDF text extraction +- `arxiv`: ArXiv API client + +## Usage + +### Quick Start + +Run the complete pipeline by executing the three scripts in order: + +```bash +# Step 1: Build dataset (Phase 1-3) +python build_data.py + +# Step 2: Train reward model (Phase 4) +python train_reward.py + +# Step 3: Evaluate summaries (Phase 5-6) +python run_evaluation.py +``` + +### Individual Phase Details + +**Phase 1-3: Data Collection and Summarization** +```bash +python build_data.py +``` +This will: +- Use existing PDFs in `data/pdfs/` (or download from ArXiv if needed) +- Generate two summaries per paper using different prompting strategies +- Create preference labels based on ROUGE-L scores +- Save results to `summaries_train.json` and `reward_data.jsonl` + +**Output Files:** +- `train_papers.json` - Structured paper data +- `summaries_train.json` - Generated summaries +- `reward_data.jsonl` - Preference pairs for training + +--- + +**Phase 4: Train Reward Model** +```bash +python train_reward.py +``` +This will: +- Load preference data from `reward_data.jsonl` +- Fine-tune DeBERTa-v3 model on chosen/rejected summary pairs +- Save trained model to `reward_model/` directory + +**Requirements:** +- `reward_data.jsonl` must exist (generated by `build_data.py`) +- GPU with at least 8GB VRAM recommended +- Training takes approximately 10-15 minutes on GPU + +**Output:** +- `reward_model/` directory with trained model weights and tokenizer + +--- + +**Phase 5-6: Evaluation** +```bash +python run_evaluation.py +``` +This will: +- Load trained reward model from `reward_model/` +- Compute ROUGE, BERTScore, and reward scores for all summaries +- Save comprehensive results to `evaluation_results.json` + +**Requirements:** +- `summaries_train.json` must exist +- `reward_model/` directory must contain trained model + +**Output:** +- `evaluation_results.json` - Detailed metrics for all summaries + + +## Pipeline Phases + +### Phase 1: Data Collection (`build_data.py`) +- **Input**: PDFs in `data/pdfs/` or ArXiv query +- **Process**: Extract text and figure captions from academic papers +- **Output**: `train_papers.json` with structured paper data + +### Phase 2: Summary Generation (`build_data.py`) +- **Model**: `unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit` +- **Strategy**: Generate two summaries per paper: + 1. **Summary 1**: Generic prompt, low temperature (0.3) - More focused + 2. **Summary 2**: Figure-aware prompt, high temperature (0.8) - More creative +- **Output**: `summaries_train.json` + +### Phase 3: Preference Labeling (`build_data.py`) +- **Method**: Use ROUGE-L scores against paper abstracts +- **Logic**: Summary with higher ROUGE-L = "chosen", other = "rejected" +- **Output**: `reward_data.jsonl` with preference pairs + +### Phase 4: Reward Model Training (`train_reward.py`) +- **Base Model**: `microsoft/deberta-v3-base` +- **Training**: 3 epochs with batch size 4, bfloat16 precision +- **Objective**: Learn to assign higher scores to preferred summaries +- **Max Length**: 4096 tokens to accommodate full summaries with context +- **Output**: Trained model in `reward_model/` directory + +### Phase 5-6: Evaluation & Comparison (`run_evaluation.py`) +- **Metrics Computed**: + - **ROUGE-1, ROUGE-2, ROUGE-L**: Lexical overlap with abstract + - **BERTScore**: Semantic similarity using BERT embeddings + - **Reward Score**: Trained model's quality prediction +- **Analysis**: Compare metric agreement and disagreement cases +- **Output**: `evaluation_results.json` + +## Configuration + +### Key Parameters (in respective modules) + +**Summarization** (`summarization.py`): +```python +LLAMA_MODEL_NAME = "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit" +MAX_INPUT_TOKENS = 2048 +MAX_NEW_TOKENS = 256 +``` + +**Reward Model** (`reward_model.py`): +```python +REWARD_MODEL_NAME = "microsoft/deberta-v3-base" +REWARD_NUM_EPOCHS = 3 +REWARD_BATCH_SIZE = 4 +``` + +**Data Collection** (`build_data.py`): +```python +ARXIV_QUERY = "cs.LG" # ArXiv category +NUM_TRAIN_PAPERS = 10 +PDF_DIR = "data/pdfs" +``` + +## Output Files + +| File | Description | +|------|-------------| +| `train_papers.json` | Structured data for training papers (text, abstract, figures) | +| `summaries_train.json` | Two summaries per paper with metadata | +| `reward_data.jsonl` | Preference pairs (chosen/rejected) for training | +| `reward_model/` | Trained DeBERTa-v3 reward model weights | +| `evaluation_results.json` | Complete evaluation metrics for all summaries | + +## Evaluation Metrics + +### ROUGE (Recall-Oriented Understudy for Gisting Evaluation) +- Measures lexical overlap between generated and reference summaries +- **ROUGE-1**: Unigram overlap +- **ROUGE-2**: Bigram overlap +- **ROUGE-L**: Longest common subsequence +- **Range**: 0-1 (higher is better) + +### BERTScore +- Measures semantic similarity using BERT embeddings +- Captures meaning beyond surface-level word matching +- Computes precision, recall, and F1 +- **Range**: 0-1 (higher is better) + +### Reward Model Score +- Learned metric from preference data +- Predicts human-aligned quality judgments +- Captures aspects like coherence, informativeness, factual consistency +- **Range**: Uncalibrated (higher = better quality) + +## Example Results + +After running the pipeline, `evaluation_results.json` contains entries like: + +```json +{ + "arxiv_id": "2601.05103v1", + "title": "Paper Title", + "abstract": "Original abstract...", + "summary_1": "First generated summary...", + "summary_2": "Second generated summary...", + "rouge_1": {"rouge1": 0.45, "rouge2": 0.23, "rougeL": 0.38}, + "rouge_2": {"rouge1": 0.42, "rouge2": 0.21, "rougeL": 0.35}, + "bertscore_1": {"precision": [0.89], "recall": [0.87], "f1": [0.88]}, + "bertscore_2": {"precision": [0.86], "recall": [0.84], "f1": [0.85]}, + "reward_1": 2.34, + "reward_2": 1.87 +} +``` + +### Interpretation + +- **High ROUGE + High Reward**: Summary is both lexically similar and high-quality +- **Low ROUGE + High Reward**: Summary paraphrases well but uses different words +- **High ROUGE + Low Reward**: Summary copies text but lacks coherence +- **BERTScore**: Often correlates better with reward scores than ROUGE + +## Troubleshooting + +### Common Issues + +**1. Script Execution Order** +``` +Error: FileNotFoundError: reward_data.jsonl not found + +Solution: Run scripts in correct order +1. python build_data.py # Creates reward_data.jsonl +2. python train_reward.py # Requires reward_data.jsonl +3. python run_evaluation.py # Requires reward_model/ and summaries_train.json +``` + +**2. Import Conflicts Between unsloth and TRL** +``` +Error: AttributeError or TypeError related to RewardTrainer + +Solution: Scripts are now separated to avoid conflicts +- build_data.py uses unsloth for LLaMA (summarization) +- train_reward.py uses standard TRL (no unsloth) +- Each runs in its own process with isolated imports +``` + +**3. Unsloth Not Supported on Mac M4/Apple Silicon** +``` +Error: NotImplementedError: Unsloth currently only works on NVIDIA, AMD and Intel GPUs. + +Solution: The code automatically falls back to standard transformers +- Mac M4 will use standard transformers (slower but works) +- For actual training/evaluation, use a GPU server +- No code changes needed, fallback is automatic +``` + +**4. Out of Memory Error** +``` +Solution: Reduce batch size or use smaller model +- Set REWARD_BATCH_SIZE = 2 in reward_model.py +- Use gradient accumulation for effective larger batch +``` + +**5. Model Download Fails** +``` +Solution: Check HuggingFace token +export HF_HUB_TOKEN="your_token" +huggingface-cli login +``` + +**6. CUDA Out of Memory** +``` +Solution: Enable CPU offloading or use smaller sequences +- Reduce MAX_INPUT_TOKENS in summarization.py +- Use CPU: Set LLAMA_DEVICE = "cpu" +``` + +**7. PDF Extraction Issues** +``` +Solution: Some PDFs may have complex layouts +- Check data/pdfs/ for corrupted files +- Manually verify text extraction quality +``` + +**8. Training Fails with AttributeError for RewardTrainer** +``` +Error: AttributeError: 'TrainingArguments' object has no attribute 'model_init_kwargs' + +Solution: Make sure you're using the updated reward_model.py +- The code adds all required TRL RewardConfig attributes +- Use train_reward.py script (not deprecated main.py) +- Clear Python cache: rm -rf __pycache__ && find . -name "*.pyc" -delete +``` + +**9. Circular Import with evaluate library** +``` +Error: ImportError: cannot import name 'evaluate_summaries' from partially initialized module 'reward_model' + +Solution: File naming conflict with HuggingFace 'evaluate' library +- The evaluation script is named 'run_evaluation.py' (not 'evaluate.py') +- This avoids conflict with the 'evaluate' library used in reward_model.py +- Use: python run_evaluation.py +``` + +### Performance Tips + +1. **Use GPU**: CUDA significantly speeds up inference and training +2. **Batch Processing**: Process multiple summaries together when possible +3. **Caching**: Save intermediate results to avoid recomputation +4. **4-bit Quantization**: Already enabled via unsloth for memory efficiency + +## Citation + +If you use this code in your research, please cite: + +```bibtex +@misc{paper_summarization_reward, + title={Paper Summarization with Reward Modeling}, + author={Class 8 Homework}, + year={2026}, + howpublished={\url{https://github.com/your-repo/homework8-submission}} +} +``` + +## License + +This project is licensed under the MIT License - see the LICENSE file for details. + +## Acknowledgments + +- **Unsloth**: For optimized LLaMA model loading +- **HuggingFace**: For transformers, datasets, and TRL libraries +- **Meta AI**: For LLaMA 3.1 model +- **Microsoft**: For DeBERTa-v3 model + +## Contact + +For questions or issues, please open an issue on the GitHub repository or contact the course instructors. + +--- + +**Note**: This project is for educational purposes as part of Week 8 homework on multimodal summarization and reward modeling. diff --git a/build_data.py b/build_data.py new file mode 100644 index 0000000..67668e1 --- /dev/null +++ b/build_data.py @@ -0,0 +1,117 @@ +#!/usr/bin/env python3 +""" +Phase 1-3: Data collection, summarization, and preference dataset creation. + +This script: +- Downloads training papers from ArXiv (or uses existing PDFs) +- Generates two summaries per paper using LLaMA +- Creates preference labels using ROUGE-L scores +""" +import os +import glob +import re + +from data_utils import ( + download_arxiv_pdfs, + save_papers_to_json, + extract_text_and_figures, + PaperData, +) +from summarization import generate_summaries_for_papers +from reward_model import auto_label_preferences_with_rouge + + +def extract_abstract_from_text(text: str) -> str: + """ + Extract the abstract from paper text. + + Looks for text between "Abstract" and common section headers. + + Args: + text: Full paper text + + Returns: + Extracted abstract or first 500 chars if not found + """ + # Common patterns for abstract section + abstract_patterns = [ + r'Abstract\s*\n+(.*?)\n+(?:1\.|Introduction|Keywords|1\s+Introduction)', + r'ABSTRACT\s*\n+(.*?)\n+(?:1\.|Introduction|Keywords|1\s+Introduction)', + r'Abstract\s*[:\-]?\s*\n+(.*?)\n+\d+\.?\s*[A-Z]', # Abstract followed by numbered section + ] + + for pattern in abstract_patterns: + match = re.search(pattern, text, re.DOTALL | re.IGNORECASE) + if match: + abstract = match.group(1).strip() + # Clean up: remove excessive whitespace + abstract = re.sub(r'\s+', ' ', abstract) + # Limit length to reasonable abstract size + if len(abstract) > 100 and len(abstract) < 3000: + return abstract + + # Fallback: use first 500 characters after skipping potential title + lines = text.split('\n') + # Skip first few lines (likely title/authors) + text_start = '\n'.join(lines[5:]) if len(lines) > 5 else text + return text_start[:500].strip() + + +# Configuration +ARXIV_QUERY = "cs.LG" # Computer Science - Machine Learning +NUM_TRAIN_PAPERS = 10 + +# File paths +PDF_DIR = "data/pdfs" +TRAIN_META_JSON = "train_papers.json" +TRAIN_SUMMARY_JSON = "summaries_train.json" +REWARD_JSONL = "reward_data.jsonl" + + +def main(): + print("=== PHASE 1-3: Data collection, summarization, preference dataset ===\n") + + # Check if PDFs exist + if not os.path.exists(PDF_DIR) or len(os.listdir(PDF_DIR)) < NUM_TRAIN_PAPERS: + print(f"Downloading {NUM_TRAIN_PAPERS} training papers from ArXiv...") + train_papers = download_arxiv_pdfs(ARXIV_QUERY, NUM_TRAIN_PAPERS, PDF_DIR) + save_papers_to_json(train_papers, TRAIN_META_JSON) + else: + print(f"Using existing PDFs from {PDF_DIR}") + # Extract papers from existing PDFs + pdf_files = glob.glob(f"{PDF_DIR}/*.pdf")[:NUM_TRAIN_PAPERS] + print(f"Found {len(pdf_files)} PDF files") + + train_papers = [] + for pdf_path in pdf_files: + arxiv_id = os.path.basename(pdf_path).replace(".pdf", "") + text, figure_captions = extract_text_and_figures(pdf_path) + + # Extract abstract from the paper text + abstract = extract_abstract_from_text(text) + + train_papers.append( + PaperData( + arxiv_id=arxiv_id, + title=f"Paper {arxiv_id}", + abstract=abstract, + text=text, + figure_captions=figure_captions, + ) + ) + print(f" Extracted abstract for {arxiv_id}: {len(abstract)} chars") + save_papers_to_json(train_papers, TRAIN_META_JSON) + + # Generate summaries + print("\nGenerating summaries for training papers...") + summary_records = generate_summaries_for_papers(train_papers, TRAIN_SUMMARY_JSON) + + # Build reward modeling data with chosen/rejected labels + print("\nBuilding reward modeling dataset via ROUGE-based preference labels...") + auto_label_preferences_with_rouge(summary_records, REWARD_JSONL) + + print("\n=== Data preparation complete! ===") + + +if __name__ == "__main__": + main() diff --git a/data/pdfs/2601.05103v1.pdf b/data/pdfs/2601.05103v1.pdf new file mode 100644 index 0000000..013823c Binary files /dev/null and b/data/pdfs/2601.05103v1.pdf differ diff --git a/data/pdfs/2601.05104v1.pdf b/data/pdfs/2601.05104v1.pdf new file mode 100644 index 0000000..98c271c Binary files /dev/null and b/data/pdfs/2601.05104v1.pdf differ diff --git a/data/pdfs/2601.05106v1.pdf b/data/pdfs/2601.05106v1.pdf new file mode 100644 index 0000000..76e5cd8 Binary files /dev/null and b/data/pdfs/2601.05106v1.pdf differ diff --git a/data/pdfs/2601.05111v1.pdf b/data/pdfs/2601.05111v1.pdf new file mode 100644 index 0000000..029416e Binary files /dev/null and b/data/pdfs/2601.05111v1.pdf differ diff --git a/data/pdfs/2601.05143v1.pdf b/data/pdfs/2601.05143v1.pdf new file mode 100644 index 0000000..c5eb6b7 Binary files /dev/null and b/data/pdfs/2601.05143v1.pdf differ diff --git a/data/pdfs/2601.05163v1.pdf b/data/pdfs/2601.05163v1.pdf new file mode 100644 index 0000000..3df4f01 Binary files /dev/null and b/data/pdfs/2601.05163v1.pdf differ diff --git a/data/pdfs/2601.05167v1.pdf b/data/pdfs/2601.05167v1.pdf new file mode 100644 index 0000000..a87e814 Binary files /dev/null and b/data/pdfs/2601.05167v1.pdf differ diff --git a/data/pdfs/2601.05170v1.pdf b/data/pdfs/2601.05170v1.pdf new file mode 100644 index 0000000..7b6ea15 Binary files /dev/null and b/data/pdfs/2601.05170v1.pdf differ diff --git a/data/pdfs/2601.05171v1.pdf b/data/pdfs/2601.05171v1.pdf new file mode 100644 index 0000000..a7ef883 Binary files /dev/null and b/data/pdfs/2601.05171v1.pdf differ diff --git a/data/pdfs/2601.05184v1.pdf b/data/pdfs/2601.05184v1.pdf new file mode 100644 index 0000000..48bd011 Binary files /dev/null and b/data/pdfs/2601.05184v1.pdf differ diff --git a/data/pdfs/2601.05192v1.pdf b/data/pdfs/2601.05192v1.pdf new file mode 100644 index 0000000..3645b13 Binary files /dev/null and b/data/pdfs/2601.05192v1.pdf differ diff --git a/data_utils.py b/data_utils.py new file mode 100644 index 0000000..2b60492 --- /dev/null +++ b/data_utils.py @@ -0,0 +1,145 @@ +""" +Data utilities for downloading papers from ArXiv and extracting text/figures from PDFs. +""" +import os +import json +from dataclasses import dataclass +from typing import List, Tuple + +import arxiv +import fitz # PyMuPDF + + +@dataclass +class PaperData: + """Data structure for storing paper information.""" + arxiv_id: str + title: str + abstract: str + text: str + figure_captions: List[str] + + +def download_arxiv_pdfs(query: str, num_papers: int, out_dir: str) -> List[PaperData]: + """ + Download papers from ArXiv and extract their content. + + Args: + query: ArXiv query category (e.g., "cs.LG") + num_papers: Number of papers to download + out_dir: Output directory for PDFs + + Returns: + List of PaperData objects containing paper information + """ + os.makedirs(out_dir, exist_ok=True) + + search = arxiv.Search( + query=f"cat:{query}", + max_results=num_papers, + sort_by=arxiv.SortCriterion.SubmittedDate, + sort_order=arxiv.SortOrder.Descending, + ) + + papers: List[PaperData] = [] + for result in search.results(): + arxiv_id = result.get_short_id() + title = result.title + abstract = result.summary + pdf_path = os.path.join(out_dir, f"{arxiv_id}.pdf") + + print(f"Downloading {arxiv_id}: {title[:80]}...") + result.download_pdf(filename=pdf_path) + + text, figure_captions = extract_text_and_figures(pdf_path) + papers.append( + PaperData( + arxiv_id=arxiv_id, + title=title, + abstract=abstract, + text=text, + figure_captions=figure_captions, + ) + ) + + return papers + + +def extract_text_and_figures(pdf_path: str) -> Tuple[str, List[str]]: + """ + Extract text and figure captions from a PDF file. + + This is a simple extractor that treats any line starting with 'Figure' or 'Fig.' + as a caption. It's not perfect but works well for most academic papers. + + Args: + pdf_path: Path to the PDF file + + Returns: + Tuple of (full_text, list of figure captions) + """ + doc = fitz.open(pdf_path) + all_text_parts = [] + figure_captions = [] + + for page in doc: + t = page.get_text("text") + all_text_parts.append(t) + + for line in t.splitlines(): + stripped = line.strip() + if stripped.startswith("Figure ") or stripped.startswith("Fig. "): + figure_captions.append(stripped) + + doc.close() + full_text = "\n".join(all_text_parts) + return full_text, figure_captions + + +def save_papers_to_json(papers: List[PaperData], path: str): + """ + Save paper data to a JSON file. + + Args: + papers: List of PaperData objects + path: Output JSON file path + """ + data = [] + for p in papers: + data.append( + { + "arxiv_id": p.arxiv_id, + "title": p.title, + "abstract": p.abstract, + "text": p.text, + "figure_captions": p.figure_captions, + } + ) + with open(path, "w", encoding="utf-8") as f: + json.dump(data, f, ensure_ascii=False, indent=2) + + +def load_papers_from_json(path: str) -> List[PaperData]: + """ + Load paper data from a JSON file. + + Args: + path: Input JSON file path + + Returns: + List of PaperData objects + """ + with open(path, "r", encoding="utf-8") as f: + data = json.load(f) + papers = [] + for d in data: + papers.append( + PaperData( + arxiv_id=d["arxiv_id"], + title=d["title"], + abstract=d["abstract"], + text=d["text"], + figure_captions=d["figure_captions"], + ) + ) + return papers diff --git a/evaluation_results.json b/evaluation_results.json new file mode 100644 index 0000000..1110815 --- /dev/null +++ b/evaluation_results.json @@ -0,0 +1,452 @@ +[ + { + "arxiv_id": "2601.05171v1", + "title": "Paper 2601.05171v1", + "abstract": "Existing long-term personalized dialogue systems struggle to reconcile unbounded interaction streams with finite context constraints, often succumbing to memory noise accumulation, reasoning degradation, and persona inconsistency. To address these challenges, this paper proposes Inside Out, a framework that utilizes a globally maintained PersonaTree as the carrier of long-term user profiling. By constraining the trunk with an initial schema and updating the branches and leaves, PersonaTree enables controllable growth, achieving memory compression while preserving consistency. Moreover, we train a lightweight MemListener via reinforcement learning with process- based rewards to produce structured, executable, and interpretable {ADD, UPDATE, DELETE, NO_OP} operations, thereby supporting the dynamic evolution of the personalized tree. During response generation, PersonaTree is directly leveraged to enhance outputs in latency-sensitive scenarios; when users require more details, the agentic mode is triggered to introduce details on-demand under the constraints of the PersonaTree. Experiments show that PersonaTree outperforms full-text concatenation and various personalized memory systems in suppressing contextual noise and maintaining persona consistency. Notably, the small MemListener model achieves memory- operation decision performance comparable to, or even surpassing, powerful reasoning models such as DeepSeek-R1-0528 and Gemini-3-Pro. Correspondence: Team Leader at lizy@memtensor.cn Author Legend: †Corresponding author", + "summary_1": "Here is a clear and concise summary of the research paper:\n\n**Title:** Inside Out: Evolving User-Centric Core Memory Trees for Long-Term Personalized Dialogue Systems\n\n**Problem Statement:** Long-term personalized dialogue systems struggle to reconcile unbounded interaction streams with finite context constraints, leading to memory noise accumulation, reasoning degradation, and persona inconsistency.\n\n**Proposed Solution:** The authors propose the Inside Out framework, which utilizes a globally maintained PersonaTree as the carrier of long-term user profiling. The PersonaTree is a hierarchical structure that constrains the trunk with an initial schema and updates the branches and leaves through a reinforcement learning (RL) strategy.\n\n**Key Components:**\n\n1. **PersonaTree:** A hierarchical structure that represents the user's core memories, with three core dimensions (Biopsychosocial model).\n2. **Schema:** An initial structure of the user PersonaTree, established through interdisciplinary design.\n3. **MemListener:** A lightweight model trained through RL to compress a continuous stream of unstructured dialogue into standardized tree-structured operations.\n4. **Tree-Update Mechanism:** An iterative process that updates the PersonaTree based on user interactions.\n\n**Benefits:** The Inside Out framework enables controllable growth, achieves memory compression while preserving consistency, and supports the dynamic evolution", + "summary_2": "Here is a summary of the research paper excerpt, incorporating the information conveyed by the figures and their captions:\n\n**Background:**\nThe authors propose a novel framework, Inside Out, to address the challenges of long-term personalized dialogue systems. These systems struggle to reconcile unbounded interaction streams with finite context constraints, leading to memory noise accumulation, reasoning degradation, and persona inconsistency. Existing approaches, such as explicit profile augmentation and vector-based retrieval, fail to directly confront the central bottleneck of personalized memory evolution.\n\n**Inside Out Framework:**\nThe Inside Out framework utilizes a globally maintained PersonaTree as the carrier of long-term user profiling. The PersonaTree is constructed using a hierarchical Schema based on the Biopsychosocial model, which decomposes user characteristics into three core dimensions. This Schema design establishes the initial structure of the user PersonaTree.\n\n**MemListener:**\nTo endow the system with dynamic evolution, the authors propose an iterative tree-update mechanism and introduce a reinforcement learning (RL) strategy based on process rewards to train a lightweight model, MemListener. This model learns to compress a continuous stream of unstructured dialogue in real time into standardized tree-structured operations.\n\n**Key Features:**\n\n1. **PersonaTree:** A globally maintained tree that captures the user's long-term profile, with", + "rouge_1": { + "rouge1": 0.5126903553299492, + "rouge2": 0.3010204081632653, + "rougeL": 0.3350253807106599, + "rougeLsum": 0.4010152284263959 + }, + "rouge_2": { + "rouge1": 0.5061425061425061, + "rouge2": 0.23209876543209876, + "rougeL": 0.285012285012285, + "rougeLsum": 0.32923832923832924 + }, + "bertscore_1": { + "precision": [ + 0.8535144925117493 + ], + "recall": [ + 0.8573517799377441 + ], + "f1": [ + 0.8554288744926453 + ], + "hashcode": "roberta-large_L17_no-idf_version=0.3.12(hug_trans=4.57.3)" + }, + "bertscore_2": { + "precision": [ + 0.8539573550224304 + ], + "recall": [ + 0.8529762029647827 + ], + "f1": [ + 0.8534665107727051 + ], + "hashcode": "roberta-large_L17_no-idf_version=0.3.12(hug_trans=4.57.3)" + }, + "reward_1": -0.013054322451353073, + "reward_2": -0.023740937933325768 + }, + { + "arxiv_id": "2601.05111v1", + "title": "Paper 2601.05111v1", + "abstract": "LLM-as-a-Judge has revolutionized AI evaluation by leveraging large language models for scalable assess- ments. However, as evaluands become increasingly complex, specialized, and multi-step, the reliability of LLM-as-a-Judge has become constrained by inherent biases, shallow single-pass reasoning, and the inability to verify assessments against real-world observations. This has catalyzed the transition to Agent-as-a-Judge, where agentic judges employ planning, tool-augmented verification, multi-agent collaboration, and persis- tent memory to enable more robust, verifiable, and nuanced evaluations. Despite the rapid proliferation of agentic evaluation systems, the field lacks a unified framework to navigate this shifting landscape. To bridge this gap, we present the first comprehensive survey tracing this evolution. Specifically, we identify key dimensions that characterize this paradigm shift and establish a developmental taxonomy. We organize core methodologies and survey applications across general and professional domains. Furthermore, we analyze frontier challenges and identify promising research directions, ultimately providing a clear roadmap for the next generation of agentic evaluation. � Project Page: https://github.com/ModalityDance/Awesome-Agent-as-a-Judge", + "summary_1": "Here is a clear and concise summary of the research paper for a graduate student:\n\n**Title:** A Survey on Agent-as-a-Judge\n\n**Background:** The rapid advancement of Large Language Models (LLMs) has revolutionized AI evaluation, enabling scalable and near-human quality assessments. However, as evaluands become increasingly complex, specialized, and multi-step, the reliability of LLM-as-a-Judge has become constrained by inherent biases, shallow single-pass reasoning, and the inability to verify assessments against real-world observations.\n\n**Problem Statement:** The limitations of LLM-as-a-Judge have led to the transition to Agent-as-a-Judge, which employs planning, tool-augmented verification, multi-agent collaboration, and persistent memory to enable more robust, verifiable, and nuanced evaluations.\n\n**Key Contributions:**\n\n1. **Characterizing the shift from LLM-as-a-Judge to Agent-as-a-Judge**: The authors identify and summarize the agentic judges' development trend into three progressive stages with distinct levels of autonomy and adaptability.\n2. **Organizing core methodologies**: The authors categorize the core methodologies into five key parts according to the agent's abilities and survey their applications across general and professional domains.\n3. **Analyzing frontier challenges and identifying promising research directions**:", + "summary_2": "The research paper discusses the evolution of AI evaluation systems from Large Language Models (LLM)-as-a-Judge to Agent-as-a-Judge. The authors identify the limitations of LLM-as-a-Judge, which include inherent biases, shallow single-pass reasoning, and the inability to verify assessments against real-world observations. This has led to the development of Agent-as-a-Judge, which employs planning, tool-augmented verification, multi-agent collaboration, and persistent memory to enable more robust, verifiable, and nuanced evaluations.\n\n**Key Points:**\n\n1. LLM-as-a-Judge has revolutionized AI evaluation by leveraging large language models for scalable assessments, but its reliability is constrained by inherent biases, shallow single-pass reasoning, and the inability to verify assessments against real-world observations.\n2. Agent-as-a-Judge has emerged as a more robust and verifiable alternative, leveraging planning, tool-augmented verification, multi-agent collaboration, and persistent memory.\n3. The authors present the first comprehensive survey on Agent-as-a-Judge, summarizing the shift from LLM-as-a-Judge to Agent-as-a-Judge and organizing core methodologies into five key parts according to agent's abilities.\n4. The authors survey applications across general and professional domains and analyze frontier challenges and identify", + "rouge_1": { + "rouge1": 0.6112600536193029, + "rouge2": 0.4420485175202156, + "rougeL": 0.4879356568364611, + "rougeLsum": 0.5040214477211796 + }, + "rouge_2": { + "rouge1": 0.581151832460733, + "rouge2": 0.39999999999999997, + "rougeL": 0.3979057591623037, + "rougeLsum": 0.4293193717277487 + }, + "bertscore_1": { + "precision": [ + 0.8799600005149841 + ], + "recall": [ + 0.8951518535614014 + ], + "f1": [ + 0.8874909281730652 + ], + "hashcode": "roberta-large_L17_no-idf_version=0.3.12(hug_trans=4.57.3)" + }, + "bertscore_2": { + "precision": [ + 0.8950091600418091 + ], + "recall": [ + 0.8787307739257812 + ], + "f1": [ + 0.8867952823638916 + ], + "hashcode": "roberta-large_L17_no-idf_version=0.3.12(hug_trans=4.57.3)" + }, + "reward_1": -0.01309164334088564, + "reward_2": -0.040604256093502045 + }, + { + "arxiv_id": "2601.05163v1", + "title": "Paper 2601.05163v1", + "abstract": "Document Question Answering (DocQA) fo- cuses on answering questions grounded in given documents, yet existing DocQA agents lack effective tool utilization and largely rely on closed-source models. In this work, we introduce DocDancer, an end-to-end trained open-source Doc agent. We formulate DocQA as an information-seeking problem and pro- pose a tool-driven agent framework that ex- plicitly models document exploration and com- prehension. To enable end-to-end training of such agents, we introduce an Exploration-then- Synthesis data synthesis pipeline that addresses the scarcity of high-quality training data for DocQA. Training on the synthesized data, the trained models on two long-context document understanding benchmarks, MMLongBench- Doc and DocBench, show their effectiveness. Further analysis provides valuable insights for the agentic tool design and synthetic data.", + "summary_1": "The paper introduces DocDancer, an end-to-end trained open-source Doc agent for Document Question Answering (DocQA). The authors formulate DocQA as an information-seeking problem and propose a tool-driven agent framework that explicitly models document exploration and comprehension. To address the scarcity of high-quality training data for DocQA, they introduce an Exploration-then-Synthesis data synthesis pipeline that progressively enhances QA pairs from easy to hard.\n\nThe proposed framework consists of two complementary capabilities: efficient search tools for global information acquisition and fine-grained read tools for localized comprehension. This design enables the agent to actively explore long documents, iteratively refine its hypotheses, and dynamically adapt its strategy based on intermediate observations.\n\nThe authors train their DocQA agent, DocDancer, on the synthesized dataset, instantiating it with two open-source backbones, Qwen3-4B-Thinking-2507 and Qwen3-30B-A3B-Thinking-2507. Despite being trained with only 5,000 instances, both variants achieve competitive performance, with the 30B-A3B model attaining state-of-the-art results in several settings.\n\nThe paper conducts extensive experiments on two long-context document understanding benchmarks, MMLongBench-Doc and DocBench, demonstrating the effectiveness", + "summary_2": "The paper \"DocDancer: Towards Agentic Document-Grounded Information Seeking\" proposes a novel approach to document question answering (DocQA) that focuses on information-seeking principles and utilizes an end-to-end trained open-source Doc agent. The authors introduce a tool-driven agent framework that explicitly models document exploration and comprehension, addressing the scarcity of high-quality training data for DocQA.\n\nThe proposed framework consists of two complementary capabilities: efficient search tools for global information acquisition and fine-grained read tools for localized comprehension. This design enables the agent to actively explore long documents, iteratively refine its hypotheses, and dynamically adapt its strategy based on intermediate observations.\n\nThe paper also proposes an Exploration-then-Synthesis DocQA generation pipeline that progressively enhances QA pairs from easy to hard. This pipeline first explores a source document through intent-guided, tool-augmented interactions to collect grounded evidence and then synthesizes high-quality document-grounded QA pairs via multi-observation reasoning.\n\nThe authors train their DocQA agent, DocDancer, on the synthesized dataset using two open-source backbones, Qwen3-4B-Thinking-2507 and Qwen3-30B-A3B-Thinking-2507. Despite being trained with only 5,000 instances, both variants achieve competitive performance,", + "rouge_1": { + "rouge1": 0.5093167701863355, + "rouge2": 0.34375, + "rougeL": 0.36645962732919257, + "rougeLsum": 0.37888198757763975 + }, + "rouge_2": { + "rouge1": 0.4342507645259938, + "rouge2": 0.20923076923076922, + "rougeL": 0.2568807339449541, + "rougeLsum": 0.28746177370030584 + }, + "bertscore_1": { + "precision": [ + 0.8635969758033752 + ], + "recall": [ + 0.8913978338241577 + ], + "f1": [ + 0.8772771954536438 + ], + "hashcode": "roberta-large_L17_no-idf_version=0.3.12(hug_trans=4.57.3)" + }, + "bertscore_2": { + "precision": [ + 0.8476544618606567 + ], + "recall": [ + 0.8700016736984253 + ], + "f1": [ + 0.8586826920509338 + ], + "hashcode": "roberta-large_L17_no-idf_version=0.3.12(hug_trans=4.57.3)" + }, + "reward_1": -0.02074364572763443, + "reward_2": -0.035873666405677795 + }, + { + "arxiv_id": "2601.05104v1", + "title": "Paper 2601.05104v1", + "abstract": "This research examines how the emotional tone of human-AI interactions shapes ChatGPT and human behavior. In a between-subject experiment, we asked participants to express a specific emotion while working with ChatGPT (GPT-4.0) on two tasks, including writing a public response and addressing an ethical dilemma. We found that compared to interactions where participants maintained a neutral tone, ChatGPT showed greater improvement in its answers when participants praised ChatGPT for its responses. Expressing anger towards ChatGPT also led to a higher albeit smaller improvement relative to the neutral condition, whereas blaming ChatGPT did not improve its answers. When addressing an ethical dilemma, ChatGPT prioritized corporate interests less when participants expressed anger towards it, while blaming increases its emphasis on protecting the public interest. Additionally, we found that people used more negative, hostile, and disappointing expressions in human-human communication after interactions during which participants blamed rather than praised for their responses. Together, our findings demonstrate that the emotional tone people apply in human–AI interactions not only shape ChatGPT’s outputs but also carry over into subsequent human– human communication. 3", + "summary_1": "Here is a clear and concise summary of the research paper for a graduate student:\n\n**Title:** Examining the Impact of Emotional Prompts on Artificial and Human Responsiveness\n\n**Research Question:** Does the emotional tone of user-generated prompts influence the quality and content of ChatGPT's responses, and do these effects carry over into human-human communication?\n\n**Methodology:** The researchers conducted a between-subject experiment where participants interacted with ChatGPT-4.0 on two tasks: writing a public response and addressing an ethical dilemma. Participants were randomly assigned to one of three conditions: praise, blame, or anger.\n\n**Findings:**\n\n1. ChatGPT showed greater improvement in its answers when participants expressed praise, compared to the neutral condition.\n2. Expressing anger towards ChatGPT led to a higher, albeit smaller, improvement in its answers compared to the neutral condition.\n3. Blaming ChatGPT did not improve its answers.\n4. When addressing an ethical dilemma, ChatGPT prioritized corporate interests less when participants expressed anger towards it, while blaming increased its emphasis on protecting the public interest.\n5. Participants used more negative, hostile, and disappointing expressions in human-human communication after interactions where they blamed rather than praised ChatGPT.\n\n**Conclusion:**", + "summary_2": "The research paper \"Examining the Impact of Emotional Prompts on Artificial and Human Responsiveness\" explores the effect of emotional tone on human-AI interactions, specifically with the large language model ChatGPT (GPT-4.0). The study conducted a between-subject experiment where participants interacted with ChatGPT on two tasks: writing a public response and addressing an ethical dilemma.\n\n**Key findings:**\n\n1. ChatGPT showed greater improvement in its answers when participants expressed praise, compared to interactions where participants maintained a neutral tone.\n2. Expressing anger towards ChatGPT led to a higher, albeit smaller, improvement in its answers relative to the neutral condition.\n3. Blaming ChatGPT did not improve its answers.\n4. When addressing an ethical dilemma, ChatGPT prioritized corporate interests less when participants expressed anger towards it, while blaming increased its emphasis on protecting the public interest.\n5. The study also found that people used more negative, hostile, and disappointing expressions in human-human communication after interactions during which participants blamed rather than praised for their responses.\n\n**Figures:**\n\n* **Figure 1:** Improvement in ChatGPT's answers based on conditions. The figure shows that ChatGPT improved its answers when", + "rouge_1": { + "rouge1": 0.675531914893617, + "rouge2": 0.49732620320855614, + "rougeL": 0.5531914893617021, + "rougeLsum": 0.5904255319148937 + }, + "rouge_2": { + "rouge1": 0.7405405405405405, + "rouge2": 0.576086956521739, + "rougeL": 0.6108108108108108, + "rougeLsum": 0.6054054054054054 + }, + "bertscore_1": { + "precision": [ + 0.877902626991272 + ], + "recall": [ + 0.9065370559692383 + ], + "f1": [ + 0.8919901251792908 + ], + "hashcode": "roberta-large_L17_no-idf_version=0.3.12(hug_trans=4.57.3)" + }, + "bertscore_2": { + "precision": [ + 0.8994222283363342 + ], + "recall": [ + 0.9299557209014893 + ], + "f1": [ + 0.9144341945648193 + ], + "hashcode": "roberta-large_L17_no-idf_version=0.3.12(hug_trans=4.57.3)" + }, + "reward_1": -0.02505182847380638, + "reward_2": -0.0208128672093153 + }, + { + "arxiv_id": "2601.05106v1", + "title": "Paper 2601.05106v1", + "abstract": "∗Work done at Meta, †Joint last author\nLarge language models (LLMs) exhibit strengths across diverse domains. However, achieving strong\nperformance across these domains with a single general-purpose model typically requires scaling\nto sizes that are prohibitively expensive to train and deploy. On the other hand, while smaller\ndomain-specialized models are much more efficient, they struggle to generalize beyond their training\ndistributions. To address this dilemma, we propose FusionRoute, a robus", + "summary_1": "Here's a clear and concise summary of the research paper for a graduate student:\n\n**Title:** FusionRoute: A Robust and Effective Token-Level Multi-LLM Collaboration Framework\n\n**Background:** Large Language Models (LLMs) have shown strong performance across various domains, but achieving strong performance across these domains with a single general-purpose model is often expensive and inefficient. Smaller domain-specialized models are more efficient but struggle to generalize beyond their training distributions.\n\n**Problem Statement:** The authors aim to develop a framework that can robustly coordinate and leverage the complementary expertise of multiple specialized LLMs to achieve general-purpose performance.\n\n**Proposed Solution:** The authors propose FusionRoute, a token-level multi-LLM collaboration framework that consists of a lightweight router and a trainable complementary generator. The router selects the most suitable expert at each decoding step, and the complementary generator refines or corrects the selected expert's next-token distribution via logit addition.\n\n**Key Contributions:**\n\n1. **Theoretical Analysis:** The authors provide a theoretical analysis showing that pure expert-only routing is fundamentally limited and cannot realize the optimal decoding policy unless strong global coverage assumptions hold.\n2. **Empirical Evaluation:** FusionRoute outperforms sequence- and token-level collaboration, model merging, and direct fine-t", + "summary_2": "The research paper \"Paper 2601.05106v1\" proposes a new framework called FusionRoute for achieving strong performance across diverse domains with a single general-purpose model, while remaining efficient and competitive with domain experts. The authors argue that existing methods, such as sequence-level collaboration, model merging, and direct fine-tuning, have limitations.\n\nThe paper's main contributions are:\n\n1. **FusionRoute**: A robust and effective token-level multi-LLM collaboration framework that selects the most suitable expert at each decoding step and contributes a complementary logit that refines or corrects the selected expert's next-token distribution via logit addition.\n2. **Theoretical analysis**: A theoretical analysis showing that pure expert-only routing is fundamentally limited and cannot in general realize the optimal decoding policy unless strong global coverage assumptions hold.\n3. **Empirical results**: Empirically, across both Llama-3 and Gemma-2 families and diverse benchmarks spanning mathematical reasoning, code generation, and instruction following, FusionRoute outperforms both sequence- and token-level collaboration, model merging, and direct fine-tuning.\n\nThe paper highlights the importance of token-level collaboration, which allows multiple LLMs to jointly produce each token during generation, making it a more fine-grained and dynamic", + "rouge_1": { + "rouge1": 0.3088235294117647, + "rouge2": 0.2, + "rougeL": 0.27941176470588236, + "rougeLsum": 0.3088235294117647 + }, + "rouge_2": { + "rouge1": 0.19548872180451124, + "rouge2": 0.09090909090909091, + "rougeL": 0.12781954887218047, + "rougeLsum": 0.17293233082706766 + }, + "bertscore_1": { + "precision": [ + 0.8201289772987366 + ], + "recall": [ + 0.8799529075622559 + ], + "f1": [ + 0.8489883542060852 + ], + "hashcode": "roberta-large_L17_no-idf_version=0.3.12(hug_trans=4.57.3)" + }, + "bertscore_2": { + "precision": [ + 0.8036170601844788 + ], + "recall": [ + 0.8410658836364746 + ], + "f1": [ + 0.8219152092933655 + ], + "hashcode": "roberta-large_L17_no-idf_version=0.3.12(hug_trans=4.57.3)" + }, + "reward_1": -0.017466777935624123, + "reward_2": -0.03491535037755966 + }, + { + "arxiv_id": "2601.05103v1", + "title": "Paper 2601.05103v1", + "abstract": "2 L3S Research Center, Leibniz University Hannover, Hannover, Germany\nzhiyin.tan@l3s.de\nAbstract. Understanding the role of citations is essential for research\nassessment and citation-aware digital libraries. However, existing cita-\ntion classification frameworks often conflate citation intent (why a work\nis cited) with cited content type (what part is cited), limiting their effec-\ntiveness in auto classification due to a dilemma between fine-grained type\ndistinctions and practical classificatio", + "summary_1": "Here is a clear and concise summary of the research paper for a graduate student:\n\n**Title:** Semantically Orthogonal Framework for Citation Classification: Disentangling Intent and Content\n\n**Authors:** Changxu Duan and Zhiyin Tan\n\n**Summary:** The authors propose a new framework, SOFT, for citation classification that explicitly separates the intent of a citation (why a work is cited) from its content type (what part is cited). This framework is grounded in semantic role theory and is designed to improve the clarity, consistency, and generalizability of citation classification.\n\n**Background:** Traditional citation metrics, such as counts or h-index, are limited in their ability to capture the nuanced ways in which works are cited, interpreted, or contested. Citation classification has emerged as a critical task in computational bibliometrics and information science to unlock the meaning within scholarly dialogue.\n\n**Limitations of existing frameworks:** Current citation classification practices face persistent limitations, including the conflation of distinct citation intents and content types, and the lack of consistent semantic definitions. These ambiguities hinder downstream applications such as citation-based retrieval and impact analysis.\n\n**SOFT framework:** The authors propose a two-dimensional framework that explicitly distinguishes the predicative intent of a citation from its cited content type. This framework", + "summary_2": "The paper \"Semantically Orthogonal Framework for Citation Classification: Disentangling Intent and Content\" proposes a new annotation framework, SOFT, that explicitly separates citation intent (why a work is cited) from cited content type (what part is cited). The authors argue that existing frameworks conflate these two dimensions, hindering their effectiveness in auto classification. SOFT is grounded in the principles of semantic roles, offering a more nuanced understanding of how authors structure argumentation.\n\nThe paper introduces two key dimensions: Cited Content Type and Citation Intent. Cited Content Type refers to the specific part of a work that is being cited (e.g., method, result, or background), while Citation Intent refers to the functional purpose of the citation (e.g., to support, to critique, or to provide background). This distinction enables clearer definitions, improved annotation consistency, and stronger support for Large Language Model (LLM)-based classification.\n\nThe authors validate SOFT through a comprehensive re-annotation of the ACL-ARC dataset and a cross-domain evaluation on ACT2, covering 19 scientific fields. Results show that SOFT yields higher inter-model and human-LLM agreement, improves classification accuracy for both zero-shot and fine-tuned models, and generalizes more robustly across domains than existing", + "rouge_1": { + "rouge1": 0.2509225092250923, + "rouge2": 0.08178438661710036, + "rougeL": 0.18450184501845018, + "rougeLsum": 0.22140221402214022 + }, + "rouge_2": { + "rouge1": 0.2518518518518518, + "rouge2": 0.11194029850746268, + "rougeL": 0.1925925925925926, + "rougeLsum": 0.22962962962962966 + }, + "bertscore_1": { + "precision": [ + 0.8247026801109314 + ], + "recall": [ + 0.8273060321807861 + ], + "f1": [ + 0.8260022401809692 + ], + "hashcode": "roberta-large_L17_no-idf_version=0.3.12(hug_trans=4.57.3)" + }, + "bertscore_2": { + "precision": [ + 0.8267620801925659 + ], + "recall": [ + 0.8196247220039368 + ], + "f1": [ + 0.8231779336929321 + ], + "hashcode": "roberta-large_L17_no-idf_version=0.3.12(hug_trans=4.57.3)" + }, + "reward_1": -0.028803709894418716, + "reward_2": -0.025009701028466225 + }, + { + "arxiv_id": "2601.05192v1", + "title": "Paper 2601.05192v1", + "abstract": "Entity linking (mapping ambiguous mentions in text to entities in a knowledge base) is a foun- dational step in tasks such as knowledge graph construction, question-answering, and informa- tion extraction. Our method, LELA, is a mod- ular coarse-to-fine approach that leverages the capabilities of large language models (LLMs), and works with different target domains, knowl- edge bases and LLMs, without any fine-tuning phase. Our experiments across various entity linking settings show that LELA is highly com- petitive with fine-tuned approaches, and sub- stantially outperforms the non-fine-tuned ones.", + "summary_1": "Here is a clear and concise summary of the research paper for a graduate student:\n\n**Title:** LELA: an LLM-based Entity Linking Approach with Zero-Shot Domain Adaptation\n\n**Background:** Entity linking is a fundamental task in natural language processing that involves mapping ambiguous mentions in text to entities in a knowledge base. Traditional entity linking methods require fine-tuning on the target domain and knowledge base, which can be computationally expensive and require high-quality labeled data.\n\n**Problem Statement:** The authors propose a true zero-shot entity linking approach, where the method can work out of the box for any domain without the need for fine-tuning.\n\n**Method:** The proposed approach, LELA, is a coarse-to-fine method that leverages the capabilities of large language models (LLMs). It consists of three stages:\n\n1. **Initial entity-candidate set generation**: LELA generates a set of potential entities based on the input text.\n2. **Pointwise re-ranking**: The top-k candidates are selected through re-ranking.\n3. **Final candidate selection**: The selected candidates are passed to the LLM for final selection through reasoning over the context and entity descriptions.\n\n**Advantages:** LELA is modular, LLM-, reranker-, and retrie", + "summary_2": "The paper proposes a novel entity linking approach, called LELA, which stands for Large Language Model-based Entity Linking Approach with Zero-Shot Domain Adaptation. Entity linking is the task of mapping ambiguous mentions in text to entities in a knowledge base (KB). The authors aim to develop a true zero-shot entity linking method, which means it works without the need for fine-tuning, and can handle different domains, knowledge bases, and large language models (LLMs).\n\nThe proposed approach, LELA, is a coarse-to-fine method that consists of three stages:\n\n1. **Initial entity-candidate set generation**: LELA generates an initial set of candidate entities for the ambiguous mention.\n2. **Pointwise re-ranking**: The top-k candidates are selected through pointwise re-ranking, where the model calculates a score for each candidate entity based on its relevance to the context and entity descriptions.\n3. **Final candidate selection**: The selected candidates are passed to the LLM for final candidate selection, where the model uses reasoning over the context and entity descriptions to select the most relevant entity.\n\nLELA is modular, agnostic to the LLM, reranker, and retriever, making it applicable to both open-source and proprietary models.\n\nThe authors", + "rouge_1": { + "rouge1": 0.35416666666666663, + "rouge2": 0.1888111888111888, + "rougeL": 0.25, + "rougeLsum": 0.29861111111111105 + }, + "rouge_2": { + "rouge1": 0.32764505119453924, + "rouge2": 0.15120274914089346, + "rougeL": 0.21160409556313994, + "rougeLsum": 0.22525597269624573 + }, + "bertscore_1": { + "precision": [ + 0.8313204050064087 + ], + "recall": [ + 0.854534387588501 + ], + "f1": [ + 0.842767596244812 + ], + "hashcode": "roberta-large_L17_no-idf_version=0.3.12(hug_trans=4.57.3)" + }, + "bertscore_2": { + "precision": [ + 0.8334040641784668 + ], + "recall": [ + 0.8482416868209839 + ], + "f1": [ + 0.840757429599762 + ], + "hashcode": "roberta-large_L17_no-idf_version=0.3.12(hug_trans=4.57.3)" + }, + "reward_1": -0.01752193458378315, + "reward_2": -0.03325437754392624 + }, + { + "arxiv_id": "2601.05143v1", + "title": "Paper 2601.05143v1", + "abstract": "Visual question answering for crop disease analysis requires accurate visual under- standing and reliable language generation. This work presents a lightweight vision–language framework for crop and disease identification from leaf images. The proposed approach combines a Swin Transformer vision encoder with sequence-to-sequence language decoders. A two-stage training strategy is adopted to improve visual representation learning and cross-modal alignment. The model is evaluated on a large-scale crop disease dataset using classification and nat- ural language generation metrics. Experimental results show high accuracy for both crop and disease identification. The framework also achieves strong perfor- mance on BLEU, ROUGE and BERTScore. Our proposed models outperform large-scale vision–language baselines while using significantly fewer parameters. Explainability is assessed using Grad-CAM and token-level attribution. Qualita- tive results demonstrate robust performance under diverse user-driven queries. These findings highlight the effectiveness of task-specific visual pretraining for crop disease visual question answering.", + "summary_1": "Here is a clear and concise summary of the research paper for a graduate student:\n\n**Title:** A Lightweight and Explainable Vision–Language Framework for Crop Disease Visual Question Answering\n\n**Background:** Crop disease diagnosis is a critical task in modern agriculture, and early identification of disease symptoms is essential to reduce damage and support effective intervention. However, traditional methods rely on expert inspection and recommendations, which can be time-consuming and limited in regions with expert access.\n\n**Problem Statement:** Existing automated disease detection methods often require controlled imaging conditions, are computationally heavy, and fail to provide detailed textual descriptions of visual attributes, disease progression stages, and contextual information.\n\n**Research Question:** Can a lightweight Visual Question Answering (VQA) framework be established for intelligent and practical plant disease identification?\n\n**Methodology:** The proposed framework combines a Swin Transformer vision encoder with sequence-to-sequence language decoders and adopts a two-stage training strategy to improve visual representation learning and cross-modal alignment.\n\n**Key Contributions:**\n\n1. A lightweight VQA framework for crop disease identification that achieves high accuracy and strong performance on BLEU, ROUGE, and BERTScore metrics.\n2. Explainability is assessed using Grad-CAM and token-level attribution, demonstrating robust performance under diverse user-driven queries.\n3.", + "summary_2": "The research paper presents a lightweight vision-language framework for crop and disease identification from leaf images. The proposed approach combines a Swin Transformer vision encoder with sequence-to-sequence language decoders, and adopts a two-stage training strategy to improve visual representation learning and cross-modal alignment.\n\n**Key Findings:**\n\n1. **High Accuracy for Crop and Disease Identification:** The model achieves high accuracy for both crop and disease identification, demonstrating its effectiveness in visual question answering for crop disease analysis.\n2. **Strong Performance on BLEU, ROUGE, and BERTScore:** The framework also achieves strong performance on these metrics, indicating its ability to generate high-quality language output.\n3. **Explainability:** Explainability is assessed using Grad-CAM and token-level attribution, demonstrating the model's ability to provide transparent and interpretable results.\n4. **Robust Performance under Diverse User-Driven Queries:** Qualitative results demonstrate robust performance under diverse user-driven queries, highlighting the model's ability to handle real-world scenarios.\n\n**Key Contributions:**\n\n1. **Lightweight Vision-Language Framework:** The proposed framework is lightweight, making it suitable for real-world deployment in agricultural environments.\n2. **Task-Specific Visual Pretraining:** The study highlights the effectiveness of task", + "rouge_1": { + "rouge1": 0.5195530726256984, + "rouge2": 0.33707865168539325, + "rougeL": 0.3966480446927374, + "rougeLsum": 0.47486033519553067 + }, + "rouge_2": { + "rouge1": 0.6455331412103746, + "rouge2": 0.5101449275362319, + "rougeL": 0.484149855907781, + "rougeLsum": 0.5302593659942364 + }, + "bertscore_1": { + "precision": [ + 0.8561044931411743 + ], + "recall": [ + 0.8864502310752869 + ], + "f1": [ + 0.8710131645202637 + ], + "hashcode": "roberta-large_L17_no-idf_version=0.3.12(hug_trans=4.57.3)" + }, + "bertscore_2": { + "precision": [ + 0.882099449634552 + ], + "recall": [ + 0.90748131275177 + ], + "f1": [ + 0.8946103453636169 + ], + "hashcode": "roberta-large_L17_no-idf_version=0.3.12(hug_trans=4.57.3)" + }, + "reward_1": -0.026092007756233215, + "reward_2": -0.01505814865231514 + }, + { + "arxiv_id": "2601.05184v1", + "title": "Paper 2601.05184v1", + "abstract": "The rapid advancement of large language mod- els (LLMs) has led to growing interest in using synthetic data to train future models. How- ever, this creates a self-consuming retraining loop, where models are trained on their own outputs and may cause performance drops and induce emerging biases. In real-world appli- cations, previously deployed LLMs may influ- ence the data they generate, leading to a dy- namic system driven by user feedback. For ex- ample, if a model continues to underserve users from a group, less query data will be collected from this particular demographic of users. In this study, we introduce the concept of Self- Consuming Performative Loop (SCPL) and investigate the role of synthetic data in shaping bias during these dynamic iterative training pro- cesses under controlled performative feedback. This controlled setting is motivated by the in- accessibility of real-world user preference data from dynamic production systems, and enables us to isolate and analyze feedback-driven bias evolution in a principled manner. We focus on two types of loops, including the typical retrain- ing setting and the incremental fine-tuning set- ting, which is largely underexplored. Through experiments on three real-world tasks, we find that the performative loop increases preference bias and decreases disparate bias. We design a reward-based rejection sampling strategy to mitigate the bias, moving towards more trust- worthy self-improving systems.", + "summary_1": "The research paper \"Self-Consuming Performative Loop for Large Language Models\" explores the concept of a self-consuming performative loop (SCPL) in large language models (LLMs). The authors introduce the idea of a dynamic system driven by user feedback, where previously deployed LLMs influence the data they generate, leading to a performative prediction. They investigate the role of synthetic data in shaping bias during these dynamic iterative training processes under controlled performative feedback.\n\nThe paper focuses on two types of loops: the typical retraining setting and the incremental fine-tuning setting, which is largely underexplored. Through experiments on three real-world tasks, the authors find that the performative loop increases preference bias and decreases disparate bias. They design a reward-based rejection sampling strategy to mitigate the bias, moving towards more trustworthy self-improving systems.\n\nKey findings:\n\n1. Iterative fine-tuning/retraining with self-generated data amplifies preference bias and degrades generation quality over time.\n2. Disparate bias tends to decrease, suggesting a convergence of performance across groups.\n3. Performative feedback accelerates bias amplification in incremental fine-tuning loops, whereas this phenomenon is less obvious in retraining loops.\n\nThe authors' research provides insights that inform safer deployment of", + "summary_2": "This summary is based on the provided research paper excerpt, focusing on the concept of Self-Consuming Performative Loop (SCPL) in Large Language Models (LLMs).\n\n**Introduction**\n\nThe paper highlights the growing interest in using synthetic data to train future LLMs, which leads to a self-consuming retraining loop where models are trained on their own outputs and may cause performance drops and induce emerging biases. This self-consuming performative loop is particularly relevant in dynamic systems driven by user feedback, where the model's performance influences future data.\n\n**Research Questions**\n\nThe authors aim to investigate the role of synthetic data in shaping bias during these dynamic iterative training processes under controlled performative feedback. They focus on two types of loops:\n\n1. The typical retraining setting\n2. The incremental fine-tuning setting, which is largely underexplored\n\n**Key Findings**\n\nThe authors' experiments on three real-world tasks reveal that:\n\n* The performative loop increases preference bias and decreases disparate bias.\n* Iterative fine-tuning/retraining with self-generated data amplifies preference bias and degrades generation quality over time.\n* Disparate bias tends to decrease, suggesting a convergence of performance across groups.\n* Performative feedback accelerates bias amplification in incremental", + "rouge_1": { + "rouge1": 0.602803738317757, + "rouge2": 0.39436619718309857, + "rougeL": 0.42990654205607476, + "rougeLsum": 0.4626168224299066 + }, + "rouge_2": { + "rouge1": 0.6099290780141844, + "rouge2": 0.39904988123515445, + "rougeL": 0.46335697399527187, + "rougeLsum": 0.49172576832151305 + }, + "bertscore_1": { + "precision": [ + 0.8966838717460632 + ], + "recall": [ + 0.8732770681381226 + ], + "f1": [ + 0.8848257064819336 + ], + "hashcode": "roberta-large_L17_no-idf_version=0.3.12(hug_trans=4.57.3)" + }, + "bertscore_2": { + "precision": [ + 0.8724937438964844 + ], + "recall": [ + 0.868453860282898 + ], + "f1": [ + 0.8704691529273987 + ], + "hashcode": "roberta-large_L17_no-idf_version=0.3.12(hug_trans=4.57.3)" + }, + "reward_1": -0.03423023596405983, + "reward_2": -0.013444548472762108 + }, + { + "arxiv_id": "2601.05167v1", + "title": "Paper 2601.05167v1", + "abstract": "Deploying Large Language Models (LLMs) for complex reasoning is often hindered by high computational costs and latency, while resource-efficient Small Language Models (SLMs) typically lack the necessary reasoning capacity. Existing collaborative approaches, such as cascading or routing, operate at a coarse granularity by offloading entire queries to LLMs, resulting in significant computational waste when the SLM is capable of handling the majority of reasoning steps. To address this, we propose RelayLLM, a novel frame- work for efficient reasoning via token-level collaborative decoding. Unlike routers, Re- layLLM empowers the SLM to act as an active controller that dynamically invokes the LLM only for critical tokens via a special command, effectively \"relaying\" the generation process. We introduce a two-stage training framework, including warm-up and Group Relative Pol- icy Optimization (GRPO) to teach the model to balance independence with strategic help- seeking. Empirical results across six bench- marks demonstrate that RelayLLM improves the average accuracy from 42.5% to 49.52%, effectively bridging the performance gap be- tween the two models. Notably, this is achieved by invoking the LLM for only 1.07% of the total generated tokens, offering a 98.2% cost re- duction compared to performance-matched ran- dom routers. Our code is available at https:// github.com/Chengsong-Huang/RelayLLM.", + "summary_1": "Here's a clear and concise summary of the research paper:\n\n**Title:** RelayLLM: Efficient Reasoning via Collaborative Decoding\n\n**Problem Statement:** Large Language Models (LLMs) are powerful but computationally expensive, while Small Language Models (SLMs) are efficient but lack reasoning capacity. Existing collaborative approaches, such as cascading or routing, often operate at a coarse granularity, leading to significant computational waste.\n\n**Proposed Solution:** RelayLLM, a novel framework for efficient reasoning via token-level collaborative decoding. Unlike static routers, RelayLLM empowers the SLM to act as an active controller that dynamically invokes the LLM only for critical tokens via a special command, effectively \"relaying\" the generation process.\n\n**Key Components:**\n\n1. **Two-stage training framework:** Supervised warm-up phase to teach the model the syntactic structure of calling commands, followed by a reinforcement learning stage using Group Relative Policy Optimization (GRPO) training.\n2. **Context-aware reward:** Designed to balance independence with necessary help-seeking, penalizing both wasted costs and avoidable errors.\n3. **Token-level collaborative decoding:** The SLM generates a special command token () to pause its own generation and invoke the LLM for a specified number of tokens", + "summary_2": "The research paper \"RelayLLM: Efficient Reasoning via Collaborative Decoding\" proposes a novel framework for efficient reasoning via token-level collaborative decoding. The authors address the challenge of deploying Large Language Models (LLMs) for complex reasoning, which is often hindered by high computational costs and latency, while Small Language Models (SLMs) typically lack the necessary reasoning capacity.\n\nThe authors present RelayLLM, a framework that empowers the SLM to act as an active controller, dynamically invoking the LLM only for critical tokens via a special command, effectively \"relaying\" the generation process. They introduce a two-stage training framework, including warm-up and Group Relative Policy Optimization (GRPO) to teach the model to balance independence with strategic help-seeking.\n\nThe empirical results on six benchmarks demonstrate the effectiveness of RelayLLM, achieving an average accuracy of 49.52% and reducing the performance gap between the small model and the large one. The authors also show that RelayLLM invokes the large model for only 1.07% of the total generated tokens, resulting in a 98.2% reduction in token costs compared to a performance-matched router.\n\nKey findings:\n\n1. **Efficient reasoning via token-level collaborative decoding**: RelayLLM emp", + "rouge_1": { + "rouge1": 0.5764411027568922, + "rouge2": 0.37783375314861456, + "rougeL": 0.46115288220551376, + "rougeLsum": 0.4761904761904762 + }, + "rouge_2": { + "rouge1": 0.6683046683046683, + "rouge2": 0.4938271604938272, + "rougeL": 0.5503685503685503, + "rougeLsum": 0.574938574938575 + }, + "bertscore_1": { + "precision": [ + 0.8771227598190308 + ], + "recall": [ + 0.8711225986480713 + ], + "f1": [ + 0.8741124272346497 + ], + "hashcode": "roberta-large_L17_no-idf_version=0.3.12(hug_trans=4.57.3)" + }, + "bertscore_2": { + "precision": [ + 0.9148159027099609 + ], + "recall": [ + 0.8935967683792114 + ], + "f1": [ + 0.9040818214416504 + ], + "hashcode": "roberta-large_L17_no-idf_version=0.3.12(hug_trans=4.57.3)" + }, + "reward_1": -0.024731554090976715, + "reward_2": -0.020340878516435623 + } +] \ No newline at end of file diff --git a/gpu_environment.yml b/gpu_environment.yml new file mode 100644 index 0000000..dcb422e --- /dev/null +++ b/gpu_environment.yml @@ -0,0 +1,161 @@ +name: hw7_311 +channels: + - defaults +dependencies: + - _libgcc_mutex=0.1=main + - _openmp_mutex=5.1=1_gnu + - bzip2=1.0.8=h5eee18b_6 + - ca-certificates=2025.12.2=h06a4308_0 + - expat=2.7.3=h7354ed3_4 + - ld_impl_linux-64=2.44=h153f514_2 + - libexpat=2.7.3=h7354ed3_4 + - libffi=3.4.4=h6a678d5_1 + - libgcc=15.2.0=h69a1729_7 + - libgcc-ng=15.2.0=h166f726_7 + - libgomp=15.2.0=h4751f2c_7 + - libnsl=2.0.0=h5eee18b_0 + - libstdcxx=15.2.0=h39759b7_7 + - libstdcxx-ng=15.2.0=hc03a8fd_7 + - libuuid=1.41.5=h5eee18b_0 + - libxcb=1.17.0=h9b100fa_0 + - libzlib=1.3.1=hb25bd0a_0 + - ncurses=6.5=h7934f7d_0 + - openssl=3.0.18=hd6dcaed_0 + - pip=25.3=pyhc872135_0 + - pthread-stubs=0.3=h0ce48e5_1 + - python=3.11.14=h6fa692b_0 + - readline=8.3=hc2a1206_0 + - setuptools=80.9.0=py311h06a4308_0 + - sqlite=3.51.1=he0a8d7e_0 + - tk=8.6.15=h54e0aa7_0 + - wheel=0.45.1=py311h06a4308_0 + - xorg-libx11=1.8.12=h9b100fa_1 + - xorg-libxau=1.0.12=h9b100fa_0 + - xorg-libxdmcp=1.1.5=h9b100fa_0 + - xorg-xorgproto=2024.1=h5eee18b_1 + - xz=5.6.4=h5eee18b_1 + - zlib=1.3.1=hb25bd0a_0 + - pip: + - absl-py==2.3.1 + - accelerate==1.12.0 + - aiohappyeyeballs==2.6.1 + - aiohttp==3.13.3 + - aiosignal==1.4.0 + - annotated-types==0.7.0 + - anyio==3.7.1 + - arxiv==2.4.0 + - attrs==25.4.0 + - bert-score==0.3.13 + - bitsandbytes==0.49.1 + - certifi==2026.1.4 + - charset-normalizer==3.4.4 + - click==8.3.1 + - contourpy==1.3.3 + - cut-cross-entropy==25.1.1 + - cycler==0.12.1 + - datasets==4.3.0 + - diffusers==0.36.0 + - dill==0.3.8 + - distro==1.9.0 + - docstring-parser==0.17.0 + - evaluate==0.4.6 + - faiss-cpu==1.7.4 + - fastapi==0.104.1 + - feedparser==6.0.12 + - filelock==3.20.3 + - fonttools==4.61.1 + - frozenlist==1.8.0 + - fsspec==2024.3.1 + - h11==0.16.0 + - hf-transfer==0.1.9 + - hf-xet==1.2.0 + - httpcore==1.0.9 + - httpx==0.28.1 + - huggingface-hub==0.36.0 + - idna==3.11 + - importlib-metadata==8.7.1 + - jinja2==3.1.6 + - jiter==0.12.0 + - joblib==1.5.3 + - kiwisolver==1.4.9 + - markdown-it-py==4.0.0 + - markupsafe==3.0.3 + - matplotlib==3.10.8 + - mdurl==0.1.2 + - mpmath==1.3.0 + - msgspec==0.20.0 + - multidict==6.7.0 + - multiprocess==0.70.16 + - networkx==3.6.1 + - nltk==3.9.2 + - numpy==2.4.1 + - nvidia-cublas-cu12==12.8.4.1 + - nvidia-cuda-cupti-cu12==12.8.90 + - nvidia-cuda-nvrtc-cu12==12.8.93 + - nvidia-cuda-runtime-cu12==12.8.90 + - nvidia-cudnn-cu12==9.10.2.21 + - nvidia-cufft-cu12==11.3.3.83 + - nvidia-cufile-cu12==1.13.1.3 + - nvidia-curand-cu12==10.3.9.90 + - nvidia-cusolver-cu12==11.7.3.90 + - nvidia-cusparse-cu12==12.5.8.93 + - nvidia-cusparselt-cu12==0.7.1 + - nvidia-nccl-cu12==2.27.5 + - nvidia-nvjitlink-cu12==12.8.93 + - nvidia-nvshmem-cu12==3.3.20 + - nvidia-nvtx-cu12==12.8.90 + - openai==2.15.0 + - packaging==25.0 + - pandas==2.3.3 + - peft==0.18.1 + - pillow==12.1.0 + - propcache==0.4.1 + - protobuf==3.20.3 + - psutil==7.2.1 + - pyarrow==22.0.0 + - pyarrow-hotfix==0.7 + - pydantic==2.5.0 + - pydantic-core==2.14.1 + - pygments==2.19.2 + - pymupdf==1.23.8 + - pymupdfb==1.23.7 + - pyparsing==3.3.1 + - python-dateutil==2.9.0.post0 + - pytz==2025.2 + - pyyaml==6.0.3 + - regex==2025.11.3 + - requests==2.32.5 + - rich==14.2.0 + - rouge-score==0.1.2 + - safetensors==0.7.0 + - scikit-learn==1.8.0 + - scipy==1.15.3 + - sentence-transformers==3.3.1 + - sentencepiece==0.2.1 + - sgmllib3k==1.0.0 + - six==1.17.0 + - sniffio==1.3.1 + - starlette==0.27.0 + - sympy==1.14.0 + - threadpoolctl==3.6.0 + - tokenizers==0.22.2 + - torch==2.9.1 + - torchao==0.15.0 + - torchvision==0.24.1 + - tqdm==4.67.1 + - transformers==4.57.3 + - triton==3.5.1 + - trl==0.24.0 + - typeguard==4.4.4 + - typing-extensions==4.15.0 + - tyro==1.0.4 + - tzdata==2025.3 + - unsloth==2026.1.2 + - unsloth-zoo==2026.1.2 + - urllib3==2.6.3 + - uvicorn==0.24.0 + - xformers==0.0.33.post2 + - xxhash==3.6.0 + - yarl==1.22.0 + - zipp==3.23.0 +prefix: /home/jovyan/miniconda3/envs/hw7_311 diff --git a/log_build_data.txt b/log_build_data.txt new file mode 100644 index 0000000..2eb3377 --- /dev/null +++ b/log_build_data.txt @@ -0,0 +1,110 @@ +(hw7_311) jovyan@deployment-1066-1595-1768256363806755879-6c4775555d-2rzxt:~/Homework8-Submission$ python build_data.py +🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning. +🦥 Unsloth Zoo will now patch everything to make training faster! +=== PHASE 1-3: Data collection, summarization, preference dataset === + +Using existing PDFs from data/pdfs +Found 10 PDF files + Extracted abstract for 2601.05171v1: 1546 chars + Extracted abstract for 2601.05111v1: 1288 chars + Extracted abstract for 2601.05163v1: 884 chars + Extracted abstract for 2601.05104v1: 1280 chars + Extracted abstract for 2601.05106v1: 500 chars + Extracted abstract for 2601.05103v1: 500 chars + Extracted abstract for 2601.05192v1: 606 chars + Extracted abstract for 2601.05143v1: 1140 chars + Extracted abstract for 2601.05184v1: 1471 chars + Extracted abstract for 2601.05167v1: 1414 chars + +Generating summaries for training papers... +Loading LLaMA model with unsloth: unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit +==((====))== Unsloth 2026.1.2: Fast Llama patching. Transformers: 4.57.3. + \\ /| inference-ai GPU cuda. Num GPUs = 1. Max memory: 47.988 GB. Platform: Linux. +O^O/ \_/ \ Torch: 2.9.1+cu128. CUDA: 8.6. CUDA Toolkit: 12.8. Triton: 3.5.1 +\ / Bfloat16 = TRUE. FA [Xformers = 0.0.33.post2. FA2 = False] + "-____-" Free license: http://github.com/unslothai/unsloth +Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored! + +=== Summarizing paper 1/10: 2601.05171v1 === +Summary 1 (generic): Here is a clear and concise summary of the research paper: + +**Title:** Inside Out: Evolving User-Centric Core Memory Trees for Long-Term Personalized Dialogue Systems + +**Problem Statement:** Long-term ... +Summary 2 (figure-aware): Here is a summary of the research paper excerpt, incorporating the information conveyed by the figures and their captions: + +**Background:** +The authors propose a novel framework, Inside Out, to addres ... + +=== Summarizing paper 2/10: 2601.05111v1 === +Summary 1 (generic): Here is a clear and concise summary of the research paper for a graduate student: + +**Title:** A Survey on Agent-as-a-Judge + +**Background:** The rapid advancement of Large Language Models (LLMs) has re ... +Summary 2 (figure-aware): The research paper discusses the evolution of AI evaluation systems from Large Language Models (LLM)-as-a-Judge to Agent-as-a-Judge. The authors identify the limitations of LLM-as-a-Judge, which inclu ... + +=== Summarizing paper 3/10: 2601.05163v1 === +Summary 1 (generic): The paper introduces DocDancer, an end-to-end trained open-source Doc agent for Document Question Answering (DocQA). The authors formulate DocQA as an information-seeking problem and propose a tool-dr ... +Summary 2 (figure-aware): The paper "DocDancer: Towards Agentic Document-Grounded Information Seeking" proposes a novel approach to document question answering (DocQA) that focuses on information-seeking principles and utilize ... + +=== Summarizing paper 4/10: 2601.05104v1 === +Summary 1 (generic): Here is a clear and concise summary of the research paper for a graduate student: + +**Title:** Examining the Impact of Emotional Prompts on Artificial and Human Responsiveness + +**Research Question:** D ... +Summary 2 (figure-aware): The research paper "Examining the Impact of Emotional Prompts on Artificial and Human Responsiveness" explores the effect of emotional tone on human-AI interactions, specifically with the large langua ... + +=== Summarizing paper 5/10: 2601.05106v1 === +Summary 1 (generic): Here's a clear and concise summary of the research paper for a graduate student: + +**Title:** FusionRoute: A Robust and Effective Token-Level Multi-LLM Collaboration Framework + +**Background:** Large La ... +Summary 2 (figure-aware): The research paper "Paper 2601.05106v1" proposes a new framework called FusionRoute for achieving strong performance across diverse domains with a single general-purpose model, while remaining efficie ... + +=== Summarizing paper 6/10: 2601.05103v1 === +Summary 1 (generic): Here is a clear and concise summary of the research paper for a graduate student: + +**Title:** Semantically Orthogonal Framework for Citation Classification: Disentangling Intent and Content + +**Authors ... +Summary 2 (figure-aware): The paper "Semantically Orthogonal Framework for Citation Classification: Disentangling Intent and Content" proposes a new annotation framework, SOFT, that explicitly separates citation intent (why a ... + +=== Summarizing paper 7/10: 2601.05192v1 === +Summary 1 (generic): Here is a clear and concise summary of the research paper for a graduate student: + +**Title:** LELA: an LLM-based Entity Linking Approach with Zero-Shot Domain Adaptation + +**Background:** Entity linkin ... +Summary 2 (figure-aware): The paper proposes a novel entity linking approach, called LELA, which stands for Large Language Model-based Entity Linking Approach with Zero-Shot Domain Adaptation. Entity linking is the task of map ... + +=== Summarizing paper 8/10: 2601.05143v1 === +Summary 1 (generic): Here is a clear and concise summary of the research paper for a graduate student: + +**Title:** A Lightweight and Explainable Vision–Language Framework for Crop Disease Visual Question Answering + +**Back ... +Summary 2 (figure-aware): The research paper presents a lightweight vision-language framework for crop and disease identification from leaf images. The proposed approach combines a Swin Transformer vision encoder with sequence ... + +=== Summarizing paper 9/10: 2601.05184v1 === +Summary 1 (generic): The research paper "Self-Consuming Performative Loop for Large Language Models" explores the concept of a self-consuming performative loop (SCPL) in large language models (LLMs). The authors introduce ... +Summary 2 (figure-aware): This summary is based on the provided research paper excerpt, focusing on the concept of Self-Consuming Performative Loop (SCPL) in Large Language Models (LLMs). + +**Introduction** + +The paper highlight ... + +=== Summarizing paper 10/10: 2601.05167v1 === +Summary 1 (generic): Here's a clear and concise summary of the research paper: + +**Title:** RelayLLM: Efficient Reasoning via Collaborative Decoding + +**Problem Statement:** Large Language Models (LLMs) are powerful but com ... +Summary 2 (figure-aware): The research paper "RelayLLM: Efficient Reasoning via Collaborative Decoding" proposes a novel framework for efficient reasoning via token-level collaborative decoding. The authors address the challen ... + +Building reward modeling dataset via ROUGE-based preference labels... +Wrote preference data to reward_data.jsonl (10 pairs). + +=== Data preparation complete! === \ No newline at end of file diff --git a/log_evaluate.txt b/log_evaluate.txt new file mode 100644 index 0000000..e69de29 diff --git a/log_run_evaluation.txt b/log_run_evaluation.txt new file mode 100644 index 0000000..c47b5b5 --- /dev/null +++ b/log_run_evaluation.txt @@ -0,0 +1,65 @@ +(hw7_311) jovyan@deployment-1066-1595-1768256363806755879-6c4775555d-2rzxt:~/Homework8-Submission$ python run_evaluation.py + +=== PHASE 5-6: Evaluation and comparison === + +Evaluating 10 papers... +Loading reward model for evaluation... +Reward model loaded on device: cuda +Loaded tokenizer class: DebertaV2TokenizerFast +Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight'] +You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. + +=== Paper 2601.05171v1 === +Reward scores: S1 = -0.013054322451353073 | S2 = -0.023740937933325768 +ROUGE-L: S1 = 0.3350253807106599 | S2 = 0.285012285012285 +BERTScore F1: S1 = 0.8554288744926453 | S2 = 0.8534665107727051 + +=== Paper 2601.05111v1 === +Reward scores: S1 = -0.01309164334088564 | S2 = -0.040604256093502045 +ROUGE-L: S1 = 0.4879356568364611 | S2 = 0.3979057591623037 +BERTScore F1: S1 = 0.8874909281730652 | S2 = 0.8867952823638916 + +=== Paper 2601.05163v1 === +Reward scores: S1 = -0.02074364572763443 | S2 = -0.035873666405677795 +ROUGE-L: S1 = 0.36645962732919257 | S2 = 0.2568807339449541 +BERTScore F1: S1 = 0.8772771954536438 | S2 = 0.8586826920509338 + +=== Paper 2601.05104v1 === +Reward scores: S1 = -0.02505182847380638 | S2 = -0.0208128672093153 +ROUGE-L: S1 = 0.5531914893617021 | S2 = 0.6108108108108108 +BERTScore F1: S1 = 0.8919901251792908 | S2 = 0.9144341945648193 + +=== Paper 2601.05106v1 === +Reward scores: S1 = -0.017466777935624123 | S2 = -0.03491535037755966 +ROUGE-L: S1 = 0.27941176470588236 | S2 = 0.12781954887218047 +BERTScore F1: S1 = 0.8489883542060852 | S2 = 0.8219152092933655 + +=== Paper 2601.05103v1 === +Reward scores: S1 = -0.028803709894418716 | S2 = -0.025009701028466225 +ROUGE-L: S1 = 0.18450184501845018 | S2 = 0.1925925925925926 +BERTScore F1: S1 = 0.8260022401809692 | S2 = 0.8231779336929321 + +=== Paper 2601.05192v1 === +Reward scores: S1 = -0.01752193458378315 | S2 = -0.03325437754392624 +ROUGE-L: S1 = 0.25 | S2 = 0.21160409556313994 +BERTScore F1: S1 = 0.842767596244812 | S2 = 0.840757429599762 + +=== Paper 2601.05143v1 === +Reward scores: S1 = -0.026092007756233215 | S2 = -0.01505814865231514 +ROUGE-L: S1 = 0.3966480446927374 | S2 = 0.484149855907781 +BERTScore F1: S1 = 0.8710131645202637 | S2 = 0.8946103453636169 + +=== Paper 2601.05184v1 === +Reward scores: S1 = -0.03423023596405983 | S2 = -0.013444548472762108 +ROUGE-L: S1 = 0.42990654205607476 | S2 = 0.46335697399527187 +BERTScore F1: S1 = 0.8848257064819336 | S2 = 0.8704691529273987 + +=== Paper 2601.05167v1 === +Reward scores: S1 = -0.024731554090976715 | S2 = -0.020340878516435623 +ROUGE-L: S1 = 0.46115288220551376 | S2 = 0.5503685503685503 +BERTScore F1: S1 = 0.8741124272346497 | S2 = 0.9040818214416504 + +Saved detailed evaluation results to evaluation_results.json +You can now inspect where reward scores agree/disagree with ROUGE/BERTScore. + +=== Evaluation complete! === \ No newline at end of file diff --git a/log_train_reward.txt b/log_train_reward.txt new file mode 100644 index 0000000..3195b2d --- /dev/null +++ b/log_train_reward.txt @@ -0,0 +1,71 @@ +(hw7_311) jovyan@deployment-1066-1595-1768256363806755879-6c4775555d-2rzxt:~/Homework8-Submission$ python train_reward.py + +=== PHASE 4: Reward model training === + +Loading reward model + tokenizer... +/home/jovyan/miniconda3/envs/hw7_311/lib/python3.11/site-packages/transformers/convert_slow_tokenizer.py:566: UserWarning: The sentencepiece tokenizer that you are converting to a fast tokenizer uses the byte fallback option which is not implemented in the fast tokenizers. In practice this means that the fast version of the tokenizer can produce unknown tokens whereas the sentencepiece version would have converted these unknown tokens into a sequence of byte tokens matching the original piece of text. + warnings.warn( +Loaded tokenizer type: DebertaV2TokenizerFast +Tokenizer vocab size: 128000 +Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight'] +You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. +Loaded dataset with 10 examples +Dataset columns: ['chosen', 'rejected'] +First example: {'chosen': "Here is a clear and concise summary of the research paper:\n\n**Title:** Inside Out: Evolving User-Centric Core Memory Trees for Long-Term Personalized Dialogue Systems\n\n**Problem Statement:** Long-term personalized dialogue systems struggle to reconcile unbounded interaction streams with finite context constraints, leading to memory noise accumulation, reasoning degradation, and persona inconsistency.\n\n**Proposed Solution:** The authors propose the Inside Out framework, which utilizes a globally maintained PersonaTree as the carrier of long-term user profiling. The PersonaTree is a hierarchical structure that constrains the trunk with an initial schema and updates the branches and leaves through a reinforcement learning (RL) strategy.\n\n**Key Components:**\n\n1. **PersonaTree:** A hierarchical structure that represents the user's core memories, with three core dimensions (Biopsychosocial model).\n2. **Schema:** An initial structure of the user PersonaTree, established through interdisciplinary design.\n3. **MemListener:** A lightweight model trained through RL to compress a continuous stream of unstructured dialogue into standardized tree-structured operations.\n4. **Tree-Update Mechanism:** An iterative process that updates the PersonaTree based on user interactions.\n\n**Benefits:** The Inside Out framework enables controllable growth, achieves memory compression while preserving consistency, and supports the dynamic evolution", 'rejected': "Here is a summary of the research paper excerpt, incorporating the information conveyed by the figures and their captions:\n\n**Background:**\nThe authors propose a novel framework, Inside Out, to address the challenges of long-term personalized dialogue systems. These systems struggle to reconcile unbounded interaction streams with finite context constraints, leading to memory noise accumulation, reasoning degradation, and persona inconsistency. Existing approaches, such as explicit profile augmentation and vector-based retrieval, fail to directly confront the central bottleneck of personalized memory evolution.\n\n**Inside Out Framework:**\nThe Inside Out framework utilizes a globally maintained PersonaTree as the carrier of long-term user profiling. The PersonaTree is constructed using a hierarchical Schema based on the Biopsychosocial model, which decomposes user characteristics into three core dimensions. This Schema design establishes the initial structure of the user PersonaTree.\n\n**MemListener:**\nTo endow the system with dynamic evolution, the authors propose an iterative tree-update mechanism and introduce a reinforcement learning (RL) strategy based on process rewards to train a lightweight model, MemListener. This model learns to compress a continuous stream of unstructured dialogue in real time into standardized tree-structured operations.\n\n**Key Features:**\n\n1. **PersonaTree:** A globally maintained tree that captures the user's long-term profile, with"} +You passed `model_init_kwargs` to the `RewardConfig`, but your model is already instantiated. The `model_init_kwargs` will be ignored. +Filtering train >4096 tokens: 100%|████████████████████████████████████████| 10/10 [00:00<00:00, 1683.85 examples/s] +Training reward model... +The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'eos_token_id': 2, 'bos_token_id': 1}. +{'train_runtime': 12.6382, 'train_samples_per_second': 2.374, 'train_steps_per_second': 0.712, 'train_loss': 0.6904306411743164, 'num_tokens': 16299.0, 'min_reward': -0.10637749565972222, 'mean_reward': -0.03328132629394531, 'max_reward': 0.029715643988715276, 'accuracy': 0.4722222222222222, 'margin': 0.007103390163845486, 'epoch': 3.0} +100%|█████████████████████████████████████████████████████████████████████████████████| 9/9 [00:12<00:00, 1.40s/it] +Saving clean tokenizer from microsoft/deberta-v3-base... +tokenizer_config.json: 100%|██████████████████████████████████████████████████████| 52.0/52.0 [00:00<00:00, 421kB/s] +config.json: 100%|█████████████████████████████████████████████████████████████████| 579/579 [00:00<00:00, 3.59MB/s] +spm.model: 100%|███████████████████████████████████████████████████████████████| 2.46M/2.46M [00:00<00:00, 4.20MB/s] +/home/jovyan/miniconda3/envs/hw7_311/lib/python3.11/site-packages/transformers/convert_slow_tokenizer.py:566: UserWarning: The sentencepiece tokenizer that you are converting to a fast tokenizer uses the byte fallback option which is not implemented in the fast tokenizers. In practice this means that the fast version of the tokenizer can produce unknown tokens whereas the sentencepiece version would have converted these unknown tokens into a sequence of byte tokens matching the original piece of text. + warnings.warn( + Copied: tokenizer_config.json + Copied: special_tokens_map.json + Copied: tokenizer.json +Cleaning tokenizer_config.json to remove SentencePiece contamination... + Removing contaminated field: vocab_type = spm + Removing contaminated field: sp_model_kwargs = {} +Reward model saved to reward_model +Tokenizer type: DebertaV2TokenizerFast +Verifying saved tokenizer... +The tokenizer you are loading from 'reward_model' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue. +Verified tokenizer type: DebertaV2TokenizerFast + +=== Training complete! === +(hw7_311) jovyan@deployment-1066-1595-1768256363806755879-6c4775555d-2rzxt:~/Homework8-Submission$ cat reward_model/config.json | head -20 +{ + "architectures": [ + "DebertaV2ForSequenceClassification" + ], + "attention_probs_dropout_prob": 0.1, + "bos_token_id": 1, + "dtype": "float32", + "eos_token_id": 2, + "hidden_act": "gelu", + "hidden_dropout_prob": 0.1, + "hidden_size": 768, + "id2label": { + "0": "LABEL_0" + }, + "initializer_range": 0.02, + "intermediate_size": 3072, + "label2id": { + "LABEL_0": 0 + }, + "layer_norm_eps": 1e-07, +(hw7_311) jovyan@deployment-1066-1595-1768256363806755879-6c4775555d-2rzxt:~/Homework8-Submission$ ls -lh reward_model/ +total 712M +-rw-rw-rw- 1 jovyan jovyan 1.3K Jan 12 16:23 README.md +drwxrwxrwx 2 jovyan jovyan 4.0K Jan 12 16:23 checkpoint-3 +drwxrwxrwx 2 jovyan jovyan 4.0K Jan 12 16:23 checkpoint-6 +drwxrwxrwx 2 jovyan jovyan 4.0K Jan 12 16:23 checkpoint-9 +-rw-rw-rw- 1 jovyan jovyan 964 Jan 12 16:23 config.json +-rw-rw-rw- 1 jovyan jovyan 704M Jan 12 16:23 model.safetensors +-rw-rw-rw- 1 jovyan jovyan 286 Jan 12 16:23 special_tokens_map.json +-rw-rw-rw- 1 jovyan jovyan 8.3M Jan 12 16:23 tokenizer.json +-rw-rw-rw- 1 jovyan jovyan 1.3K Jan 12 16:23 tokenizer_config.json \ No newline at end of file diff --git a/main.py b/main.py new file mode 100644 index 0000000..2775f05 --- /dev/null +++ b/main.py @@ -0,0 +1,165 @@ +#!/usr/bin/env python3 +""" +Main entry point for the paper summarization and reward modeling pipeline. + +This script orchestrates the complete workflow: +1. Download papers from ArXiv +2. Generate summaries using LLaMA 3.1 +3. Create preference dataset +4. Train reward model +5. Evaluate on new papers + +Usage: + python main.py --phase all + python main.py --phase build_data + python main.py --phase train_reward + python main.py --phase evaluate +""" +import argparse + +from data_utils import download_arxiv_pdfs, save_papers_to_json + + +# Configuration +ARXIV_QUERY = "cs.LG" # Computer Science - Machine Learning +NUM_TRAIN_PAPERS = 10 +NUM_EVAL_PAPERS = 10 + +# File paths +PDF_DIR = "data/pdfs" +TRAIN_META_JSON = "train_papers.json" +EVAL_META_JSON = "eval_papers.json" +TRAIN_SUMMARY_JSON = "summaries_train.json" +EVAL_SUMMARY_JSON = "summaries_eval.json" +REWARD_JSONL = "reward_data.jsonl" +REWARD_MODEL_DIR = "reward_model" +RESULTS_JSON = "evaluation_results.json" + + +def build_data_phase(): + """ + Phase 1-3: Data collection, summarization, and preference dataset creation. + + This phase: + - Downloads training papers from ArXiv (or uses existing PDFs) + - Generates two summaries per paper using LLaMA + - Creates preference labels using ROUGE-L scores + """ + print("=== PHASE 1-3: Data collection, summarization, preference dataset ===\n") + + # Note: If PDFs already exist in data/pdfs, we can skip downloading + # and directly load them. For now, we'll assume PDFs are already there. + import os + if not os.path.exists(PDF_DIR) or len(os.listdir(PDF_DIR)) < NUM_TRAIN_PAPERS: + print(f"Downloading {NUM_TRAIN_PAPERS} training papers from ArXiv...") + train_papers = download_arxiv_pdfs(ARXIV_QUERY, NUM_TRAIN_PAPERS, PDF_DIR) + save_papers_to_json(train_papers, TRAIN_META_JSON) + else: + print(f"Using existing PDFs from {PDF_DIR}") + from data_utils import load_papers_from_json + # Extract papers from existing PDFs + import glob + pdf_files = glob.glob(f"{PDF_DIR}/*.pdf")[:NUM_TRAIN_PAPERS] + print(f"Found {len(pdf_files)} PDF files") + + from data_utils import extract_text_and_figures, PaperData + train_papers = [] + for pdf_path in pdf_files: + arxiv_id = os.path.basename(pdf_path).replace(".pdf", "") + text, figure_captions = extract_text_and_figures(pdf_path) + # For existing PDFs, we don't have title/abstract from ArXiv API + # So we'll use placeholder values or extract from the PDF + train_papers.append( + PaperData( + arxiv_id=arxiv_id, + title=f"Paper {arxiv_id}", + abstract="", # Will be extracted or left empty + text=text, + figure_captions=figure_captions, + ) + ) + save_papers_to_json(train_papers, TRAIN_META_JSON) + + # Generate summaries + print("\nGenerating summaries for training papers...") + from summarization import generate_summaries_for_papers + summary_records = generate_summaries_for_papers(train_papers, TRAIN_SUMMARY_JSON) + + # Build reward modeling data with chosen/rejected labels + print("\nBuilding reward modeling dataset via ROUGE-based preference labels...") + from reward_model import auto_label_preferences_with_rouge + auto_label_preferences_with_rouge(summary_records, REWARD_JSONL) + + +def train_reward_phase(): + """ + Phase 4: Train the reward model on preference data. + + This phase fine-tunes a DeBERTa-v3 model on the chosen/rejected summary pairs + so it learns to assign higher scores to better summaries. + """ + print("\n=== PHASE 4: Reward model training ===\n") + from reward_model import train_reward_model + train_reward_model(REWARD_JSONL, REWARD_MODEL_DIR) + + +def evaluate_phase(): + """ + Phase 5-6: Evaluation and comparison. + + This phase: + - Generates summaries for evaluation papers + - Scores them using the trained reward model + - Computes ROUGE and BERTScore + - Compares all metrics + """ + print("\n=== PHASE 5-6: Evaluation and comparison ===\n") + + # For evaluation, we could download new papers or use a held-out set + # For simplicity, we'll reuse the training papers here + # In a real scenario, you'd want a separate eval set + import os + import json + + if not os.path.exists(TRAIN_SUMMARY_JSON): + print("Error: No training summaries found. Run --phase build_data first.") + return + + # Load existing summaries + with open(TRAIN_SUMMARY_JSON, "r", encoding="utf-8") as f: + eval_summaries = json.load(f) + + print(f"Evaluating {len(eval_summaries)} papers...") + + # Evaluate summaries + from reward_model import evaluate_summaries + evaluate_summaries(eval_summaries, REWARD_MODEL_DIR, RESULTS_JSON) + + +def main(): + """Main entry point with command-line argument parsing.""" + parser = argparse.ArgumentParser( + description="Paper summarization + reward modeling pipeline (ArXiv + LLaMA + DeBERTa)." + ) + parser.add_argument( + "--phase", + choices=["build_data", "train_reward", "evaluate", "all"], + default="all", + help="Which phase to run.", + ) + args = parser.parse_args() + + if args.phase in ("build_data", "all"): + build_data_phase() + + if args.phase in ("train_reward", "all"): + train_reward_phase() + + if args.phase in ("evaluate", "all"): + evaluate_phase() + + print("\n=== Pipeline complete! ===") + + +if __name__ == "__main__": + main() diff --git a/requirements.txt b/requirements.txt new file mode 100644 index 0000000..c0f3b41 --- /dev/null +++ b/requirements.txt @@ -0,0 +1,125 @@ +absl-py==2.3.1 +accelerate==1.12.0 +aiohappyeyeballs==2.6.1 +aiohttp==3.13.3 +aiosignal==1.4.0 +annotated-types==0.7.0 +anyio==3.7.1 +arxiv==2.4.0 +attrs==25.4.0 +bert-score==0.3.13 +bitsandbytes==0.49.1 +certifi==2026.1.4 +charset-normalizer==3.4.4 +click==8.3.1 +contourpy==1.3.3 +cut-cross-entropy==25.1.1 +cycler==0.12.1 +datasets==4.3.0 +diffusers==0.36.0 +dill==0.3.8 +distro==1.9.0 +docstring_parser==0.17.0 +evaluate==0.4.6 +faiss-cpu==1.7.4 +fastapi==0.104.1 +feedparser==6.0.12 +filelock==3.20.3 +fonttools==4.61.1 +frozenlist==1.8.0 +fsspec==2024.3.1 +h11==0.16.0 +hf_transfer==0.1.9 +hf-xet==1.2.0 +httpcore==1.0.9 +httpx==0.28.1 +huggingface-hub==0.36.0 +idna==3.11 +importlib_metadata==8.7.1 +Jinja2==3.1.6 +jiter==0.12.0 +joblib==1.5.3 +kiwisolver==1.4.9 +markdown-it-py==4.0.0 +MarkupSafe==3.0.3 +matplotlib==3.10.8 +mdurl==0.1.2 +mpmath==1.3.0 +msgspec==0.20.0 +multidict==6.7.0 +multiprocess==0.70.16 +networkx==3.6.1 +nltk==3.9.2 +numpy==2.4.1 +nvidia-cublas-cu12==12.8.4.1 +nvidia-cuda-cupti-cu12==12.8.90 +nvidia-cuda-nvrtc-cu12==12.8.93 +nvidia-cuda-runtime-cu12==12.8.90 +nvidia-cudnn-cu12==9.10.2.21 +nvidia-cufft-cu12==11.3.3.83 +nvidia-cufile-cu12==1.13.1.3 +nvidia-curand-cu12==10.3.9.90 +nvidia-cusolver-cu12==11.7.3.90 +nvidia-cusparse-cu12==12.5.8.93 +nvidia-cusparselt-cu12==0.7.1 +nvidia-nccl-cu12==2.27.5 +nvidia-nvjitlink-cu12==12.8.93 +nvidia-nvshmem-cu12==3.3.20 +nvidia-nvtx-cu12==12.8.90 +openai==2.15.0 +packaging==25.0 +pandas==2.3.3 +peft==0.18.1 +pillow==12.1.0 +pip==25.3 +propcache==0.4.1 +protobuf==3.20.3 +psutil==7.2.1 +pyarrow==22.0.0 +pyarrow-hotfix==0.7 +pydantic==2.5.0 +pydantic_core==2.14.1 +Pygments==2.19.2 +PyMuPDF==1.23.8 +PyMuPDFb==1.23.7 +pyparsing==3.3.1 +python-dateutil==2.9.0.post0 +pytz==2025.2 +PyYAML==6.0.3 +regex==2025.11.3 +requests==2.32.5 +rich==14.2.0 +rouge_score==0.1.2 +safetensors==0.7.0 +scikit-learn==1.8.0 +scipy==1.15.3 +sentence-transformers==3.3.1 +sentencepiece==0.2.1 +setuptools==80.9.0 +sgmllib3k==1.0.0 +six==1.17.0 +sniffio==1.3.1 +starlette==0.27.0 +sympy==1.14.0 +threadpoolctl==3.6.0 +tokenizers==0.22.2 +torch==2.9.1 +torchao==0.15.0 +torchvision==0.24.1 +tqdm==4.67.1 +transformers==4.57.3 +triton==3.5.1 +trl==0.24.0 +typeguard==4.4.4 +typing_extensions==4.15.0 +tyro==1.0.4 +tzdata==2025.3 +unsloth==2026.1.2 +unsloth_zoo==2026.1.2 +urllib3==2.6.3 +uvicorn==0.24.0 +wheel==0.45.1 +xformers==0.0.33.post2 +xxhash==3.6.0 +yarl==1.22.0 +zipp==3.23.0 \ No newline at end of file diff --git a/reward_data.jsonl b/reward_data.jsonl new file mode 100644 index 0000000..8ffa534 --- /dev/null +++ b/reward_data.jsonl @@ -0,0 +1,10 @@ +{"chosen": "Here is a clear and concise summary of the research paper:\n\n**Title:** Inside Out: Evolving User-Centric Core Memory Trees for Long-Term Personalized Dialogue Systems\n\n**Problem Statement:** Long-term personalized dialogue systems struggle to reconcile unbounded interaction streams with finite context constraints, leading to memory noise accumulation, reasoning degradation, and persona inconsistency.\n\n**Proposed Solution:** The authors propose the Inside Out framework, which utilizes a globally maintained PersonaTree as the carrier of long-term user profiling. The PersonaTree is a hierarchical structure that constrains the trunk with an initial schema and updates the branches and leaves through a reinforcement learning (RL) strategy.\n\n**Key Components:**\n\n1. **PersonaTree:** A hierarchical structure that represents the user's core memories, with three core dimensions (Biopsychosocial model).\n2. **Schema:** An initial structure of the user PersonaTree, established through interdisciplinary design.\n3. **MemListener:** A lightweight model trained through RL to compress a continuous stream of unstructured dialogue into standardized tree-structured operations.\n4. **Tree-Update Mechanism:** An iterative process that updates the PersonaTree based on user interactions.\n\n**Benefits:** The Inside Out framework enables controllable growth, achieves memory compression while preserving consistency, and supports the dynamic evolution", "rejected": "Here is a summary of the research paper excerpt, incorporating the information conveyed by the figures and their captions:\n\n**Background:**\nThe authors propose a novel framework, Inside Out, to address the challenges of long-term personalized dialogue systems. These systems struggle to reconcile unbounded interaction streams with finite context constraints, leading to memory noise accumulation, reasoning degradation, and persona inconsistency. Existing approaches, such as explicit profile augmentation and vector-based retrieval, fail to directly confront the central bottleneck of personalized memory evolution.\n\n**Inside Out Framework:**\nThe Inside Out framework utilizes a globally maintained PersonaTree as the carrier of long-term user profiling. The PersonaTree is constructed using a hierarchical Schema based on the Biopsychosocial model, which decomposes user characteristics into three core dimensions. This Schema design establishes the initial structure of the user PersonaTree.\n\n**MemListener:**\nTo endow the system with dynamic evolution, the authors propose an iterative tree-update mechanism and introduce a reinforcement learning (RL) strategy based on process rewards to train a lightweight model, MemListener. This model learns to compress a continuous stream of unstructured dialogue in real time into standardized tree-structured operations.\n\n**Key Features:**\n\n1. **PersonaTree:** A globally maintained tree that captures the user's long-term profile, with"} +{"chosen": "Here is a clear and concise summary of the research paper for a graduate student:\n\n**Title:** A Survey on Agent-as-a-Judge\n\n**Background:** The rapid advancement of Large Language Models (LLMs) has revolutionized AI evaluation, enabling scalable and near-human quality assessments. However, as evaluands become increasingly complex, specialized, and multi-step, the reliability of LLM-as-a-Judge has become constrained by inherent biases, shallow single-pass reasoning, and the inability to verify assessments against real-world observations.\n\n**Problem Statement:** The limitations of LLM-as-a-Judge have led to the transition to Agent-as-a-Judge, which employs planning, tool-augmented verification, multi-agent collaboration, and persistent memory to enable more robust, verifiable, and nuanced evaluations.\n\n**Key Contributions:**\n\n1. **Characterizing the shift from LLM-as-a-Judge to Agent-as-a-Judge**: The authors identify and summarize the agentic judges' development trend into three progressive stages with distinct levels of autonomy and adaptability.\n2. **Organizing core methodologies**: The authors categorize the core methodologies into five key parts according to the agent's abilities and survey their applications across general and professional domains.\n3. **Analyzing frontier challenges and identifying promising research directions**:", "rejected": "The research paper discusses the evolution of AI evaluation systems from Large Language Models (LLM)-as-a-Judge to Agent-as-a-Judge. The authors identify the limitations of LLM-as-a-Judge, which include inherent biases, shallow single-pass reasoning, and the inability to verify assessments against real-world observations. This has led to the development of Agent-as-a-Judge, which employs planning, tool-augmented verification, multi-agent collaboration, and persistent memory to enable more robust, verifiable, and nuanced evaluations.\n\n**Key Points:**\n\n1. LLM-as-a-Judge has revolutionized AI evaluation by leveraging large language models for scalable assessments, but its reliability is constrained by inherent biases, shallow single-pass reasoning, and the inability to verify assessments against real-world observations.\n2. Agent-as-a-Judge has emerged as a more robust and verifiable alternative, leveraging planning, tool-augmented verification, multi-agent collaboration, and persistent memory.\n3. The authors present the first comprehensive survey on Agent-as-a-Judge, summarizing the shift from LLM-as-a-Judge to Agent-as-a-Judge and organizing core methodologies into five key parts according to agent's abilities.\n4. The authors survey applications across general and professional domains and analyze frontier challenges and identify"} +{"chosen": "The paper introduces DocDancer, an end-to-end trained open-source Doc agent for Document Question Answering (DocQA). The authors formulate DocQA as an information-seeking problem and propose a tool-driven agent framework that explicitly models document exploration and comprehension. To address the scarcity of high-quality training data for DocQA, they introduce an Exploration-then-Synthesis data synthesis pipeline that progressively enhances QA pairs from easy to hard.\n\nThe proposed framework consists of two complementary capabilities: efficient search tools for global information acquisition and fine-grained read tools for localized comprehension. This design enables the agent to actively explore long documents, iteratively refine its hypotheses, and dynamically adapt its strategy based on intermediate observations.\n\nThe authors train their DocQA agent, DocDancer, on the synthesized dataset, instantiating it with two open-source backbones, Qwen3-4B-Thinking-2507 and Qwen3-30B-A3B-Thinking-2507. Despite being trained with only 5,000 instances, both variants achieve competitive performance, with the 30B-A3B model attaining state-of-the-art results in several settings.\n\nThe paper conducts extensive experiments on two long-context document understanding benchmarks, MMLongBench-Doc and DocBench, demonstrating the effectiveness", "rejected": "The paper \"DocDancer: Towards Agentic Document-Grounded Information Seeking\" proposes a novel approach to document question answering (DocQA) that focuses on information-seeking principles and utilizes an end-to-end trained open-source Doc agent. The authors introduce a tool-driven agent framework that explicitly models document exploration and comprehension, addressing the scarcity of high-quality training data for DocQA.\n\nThe proposed framework consists of two complementary capabilities: efficient search tools for global information acquisition and fine-grained read tools for localized comprehension. This design enables the agent to actively explore long documents, iteratively refine its hypotheses, and dynamically adapt its strategy based on intermediate observations.\n\nThe paper also proposes an Exploration-then-Synthesis DocQA generation pipeline that progressively enhances QA pairs from easy to hard. This pipeline first explores a source document through intent-guided, tool-augmented interactions to collect grounded evidence and then synthesizes high-quality document-grounded QA pairs via multi-observation reasoning.\n\nThe authors train their DocQA agent, DocDancer, on the synthesized dataset using two open-source backbones, Qwen3-4B-Thinking-2507 and Qwen3-30B-A3B-Thinking-2507. Despite being trained with only 5,000 instances, both variants achieve competitive performance,"} +{"chosen": "The research paper \"Examining the Impact of Emotional Prompts on Artificial and Human Responsiveness\" explores the effect of emotional tone on human-AI interactions, specifically with the large language model ChatGPT (GPT-4.0). The study conducted a between-subject experiment where participants interacted with ChatGPT on two tasks: writing a public response and addressing an ethical dilemma.\n\n**Key findings:**\n\n1. ChatGPT showed greater improvement in its answers when participants expressed praise, compared to interactions where participants maintained a neutral tone.\n2. Expressing anger towards ChatGPT led to a higher, albeit smaller, improvement in its answers relative to the neutral condition.\n3. Blaming ChatGPT did not improve its answers.\n4. When addressing an ethical dilemma, ChatGPT prioritized corporate interests less when participants expressed anger towards it, while blaming increased its emphasis on protecting the public interest.\n5. The study also found that people used more negative, hostile, and disappointing expressions in human-human communication after interactions during which participants blamed rather than praised for their responses.\n\n**Figures:**\n\n* **Figure 1:** Improvement in ChatGPT's answers based on conditions. The figure shows that ChatGPT improved its answers when", "rejected": "Here is a clear and concise summary of the research paper for a graduate student:\n\n**Title:** Examining the Impact of Emotional Prompts on Artificial and Human Responsiveness\n\n**Research Question:** Does the emotional tone of user-generated prompts influence the quality and content of ChatGPT's responses, and do these effects carry over into human-human communication?\n\n**Methodology:** The researchers conducted a between-subject experiment where participants interacted with ChatGPT-4.0 on two tasks: writing a public response and addressing an ethical dilemma. Participants were randomly assigned to one of three conditions: praise, blame, or anger.\n\n**Findings:**\n\n1. ChatGPT showed greater improvement in its answers when participants expressed praise, compared to the neutral condition.\n2. Expressing anger towards ChatGPT led to a higher, albeit smaller, improvement in its answers compared to the neutral condition.\n3. Blaming ChatGPT did not improve its answers.\n4. When addressing an ethical dilemma, ChatGPT prioritized corporate interests less when participants expressed anger towards it, while blaming increased its emphasis on protecting the public interest.\n5. Participants used more negative, hostile, and disappointing expressions in human-human communication after interactions where they blamed rather than praised ChatGPT.\n\n**Conclusion:**"} +{"chosen": "Here's a clear and concise summary of the research paper for a graduate student:\n\n**Title:** FusionRoute: A Robust and Effective Token-Level Multi-LLM Collaboration Framework\n\n**Background:** Large Language Models (LLMs) have shown strong performance across various domains, but achieving strong performance across these domains with a single general-purpose model is often expensive and inefficient. Smaller domain-specialized models are more efficient but struggle to generalize beyond their training distributions.\n\n**Problem Statement:** The authors aim to develop a framework that can robustly coordinate and leverage the complementary expertise of multiple specialized LLMs to achieve general-purpose performance.\n\n**Proposed Solution:** The authors propose FusionRoute, a token-level multi-LLM collaboration framework that consists of a lightweight router and a trainable complementary generator. The router selects the most suitable expert at each decoding step, and the complementary generator refines or corrects the selected expert's next-token distribution via logit addition.\n\n**Key Contributions:**\n\n1. **Theoretical Analysis:** The authors provide a theoretical analysis showing that pure expert-only routing is fundamentally limited and cannot realize the optimal decoding policy unless strong global coverage assumptions hold.\n2. **Empirical Evaluation:** FusionRoute outperforms sequence- and token-level collaboration, model merging, and direct fine-t", "rejected": "The research paper \"Paper 2601.05106v1\" proposes a new framework called FusionRoute for achieving strong performance across diverse domains with a single general-purpose model, while remaining efficient and competitive with domain experts. The authors argue that existing methods, such as sequence-level collaboration, model merging, and direct fine-tuning, have limitations.\n\nThe paper's main contributions are:\n\n1. **FusionRoute**: A robust and effective token-level multi-LLM collaboration framework that selects the most suitable expert at each decoding step and contributes a complementary logit that refines or corrects the selected expert's next-token distribution via logit addition.\n2. **Theoretical analysis**: A theoretical analysis showing that pure expert-only routing is fundamentally limited and cannot in general realize the optimal decoding policy unless strong global coverage assumptions hold.\n3. **Empirical results**: Empirically, across both Llama-3 and Gemma-2 families and diverse benchmarks spanning mathematical reasoning, code generation, and instruction following, FusionRoute outperforms both sequence- and token-level collaboration, model merging, and direct fine-tuning.\n\nThe paper highlights the importance of token-level collaboration, which allows multiple LLMs to jointly produce each token during generation, making it a more fine-grained and dynamic"} +{"chosen": "The paper \"Semantically Orthogonal Framework for Citation Classification: Disentangling Intent and Content\" proposes a new annotation framework, SOFT, that explicitly separates citation intent (why a work is cited) from cited content type (what part is cited). The authors argue that existing frameworks conflate these two dimensions, hindering their effectiveness in auto classification. SOFT is grounded in the principles of semantic roles, offering a more nuanced understanding of how authors structure argumentation.\n\nThe paper introduces two key dimensions: Cited Content Type and Citation Intent. Cited Content Type refers to the specific part of a work that is being cited (e.g., method, result, or background), while Citation Intent refers to the functional purpose of the citation (e.g., to support, to critique, or to provide background). This distinction enables clearer definitions, improved annotation consistency, and stronger support for Large Language Model (LLM)-based classification.\n\nThe authors validate SOFT through a comprehensive re-annotation of the ACL-ARC dataset and a cross-domain evaluation on ACT2, covering 19 scientific fields. Results show that SOFT yields higher inter-model and human-LLM agreement, improves classification accuracy for both zero-shot and fine-tuned models, and generalizes more robustly across domains than existing", "rejected": "Here is a clear and concise summary of the research paper for a graduate student:\n\n**Title:** Semantically Orthogonal Framework for Citation Classification: Disentangling Intent and Content\n\n**Authors:** Changxu Duan and Zhiyin Tan\n\n**Summary:** The authors propose a new framework, SOFT, for citation classification that explicitly separates the intent of a citation (why a work is cited) from its content type (what part is cited). This framework is grounded in semantic role theory and is designed to improve the clarity, consistency, and generalizability of citation classification.\n\n**Background:** Traditional citation metrics, such as counts or h-index, are limited in their ability to capture the nuanced ways in which works are cited, interpreted, or contested. Citation classification has emerged as a critical task in computational bibliometrics and information science to unlock the meaning within scholarly dialogue.\n\n**Limitations of existing frameworks:** Current citation classification practices face persistent limitations, including the conflation of distinct citation intents and content types, and the lack of consistent semantic definitions. These ambiguities hinder downstream applications such as citation-based retrieval and impact analysis.\n\n**SOFT framework:** The authors propose a two-dimensional framework that explicitly distinguishes the predicative intent of a citation from its cited content type. This framework"} +{"chosen": "Here is a clear and concise summary of the research paper for a graduate student:\n\n**Title:** LELA: an LLM-based Entity Linking Approach with Zero-Shot Domain Adaptation\n\n**Background:** Entity linking is a fundamental task in natural language processing that involves mapping ambiguous mentions in text to entities in a knowledge base. Traditional entity linking methods require fine-tuning on the target domain and knowledge base, which can be computationally expensive and require high-quality labeled data.\n\n**Problem Statement:** The authors propose a true zero-shot entity linking approach, where the method can work out of the box for any domain without the need for fine-tuning.\n\n**Method:** The proposed approach, LELA, is a coarse-to-fine method that leverages the capabilities of large language models (LLMs). It consists of three stages:\n\n1. **Initial entity-candidate set generation**: LELA generates a set of potential entities based on the input text.\n2. **Pointwise re-ranking**: The top-k candidates are selected through re-ranking.\n3. **Final candidate selection**: The selected candidates are passed to the LLM for final selection through reasoning over the context and entity descriptions.\n\n**Advantages:** LELA is modular, LLM-, reranker-, and retrie", "rejected": "The paper proposes a novel entity linking approach, called LELA, which stands for Large Language Model-based Entity Linking Approach with Zero-Shot Domain Adaptation. Entity linking is the task of mapping ambiguous mentions in text to entities in a knowledge base (KB). The authors aim to develop a true zero-shot entity linking method, which means it works without the need for fine-tuning, and can handle different domains, knowledge bases, and large language models (LLMs).\n\nThe proposed approach, LELA, is a coarse-to-fine method that consists of three stages:\n\n1. **Initial entity-candidate set generation**: LELA generates an initial set of candidate entities for the ambiguous mention.\n2. **Pointwise re-ranking**: The top-k candidates are selected through pointwise re-ranking, where the model calculates a score for each candidate entity based on its relevance to the context and entity descriptions.\n3. **Final candidate selection**: The selected candidates are passed to the LLM for final candidate selection, where the model uses reasoning over the context and entity descriptions to select the most relevant entity.\n\nLELA is modular, agnostic to the LLM, reranker, and retriever, making it applicable to both open-source and proprietary models.\n\nThe authors"} +{"chosen": "The research paper presents a lightweight vision-language framework for crop and disease identification from leaf images. The proposed approach combines a Swin Transformer vision encoder with sequence-to-sequence language decoders, and adopts a two-stage training strategy to improve visual representation learning and cross-modal alignment.\n\n**Key Findings:**\n\n1. **High Accuracy for Crop and Disease Identification:** The model achieves high accuracy for both crop and disease identification, demonstrating its effectiveness in visual question answering for crop disease analysis.\n2. **Strong Performance on BLEU, ROUGE, and BERTScore:** The framework also achieves strong performance on these metrics, indicating its ability to generate high-quality language output.\n3. **Explainability:** Explainability is assessed using Grad-CAM and token-level attribution, demonstrating the model's ability to provide transparent and interpretable results.\n4. **Robust Performance under Diverse User-Driven Queries:** Qualitative results demonstrate robust performance under diverse user-driven queries, highlighting the model's ability to handle real-world scenarios.\n\n**Key Contributions:**\n\n1. **Lightweight Vision-Language Framework:** The proposed framework is lightweight, making it suitable for real-world deployment in agricultural environments.\n2. **Task-Specific Visual Pretraining:** The study highlights the effectiveness of task", "rejected": "Here is a clear and concise summary of the research paper for a graduate student:\n\n**Title:** A Lightweight and Explainable Vision–Language Framework for Crop Disease Visual Question Answering\n\n**Background:** Crop disease diagnosis is a critical task in modern agriculture, and early identification of disease symptoms is essential to reduce damage and support effective intervention. However, traditional methods rely on expert inspection and recommendations, which can be time-consuming and limited in regions with expert access.\n\n**Problem Statement:** Existing automated disease detection methods often require controlled imaging conditions, are computationally heavy, and fail to provide detailed textual descriptions of visual attributes, disease progression stages, and contextual information.\n\n**Research Question:** Can a lightweight Visual Question Answering (VQA) framework be established for intelligent and practical plant disease identification?\n\n**Methodology:** The proposed framework combines a Swin Transformer vision encoder with sequence-to-sequence language decoders and adopts a two-stage training strategy to improve visual representation learning and cross-modal alignment.\n\n**Key Contributions:**\n\n1. A lightweight VQA framework for crop disease identification that achieves high accuracy and strong performance on BLEU, ROUGE, and BERTScore metrics.\n2. Explainability is assessed using Grad-CAM and token-level attribution, demonstrating robust performance under diverse user-driven queries.\n3."} +{"chosen": "This summary is based on the provided research paper excerpt, focusing on the concept of Self-Consuming Performative Loop (SCPL) in Large Language Models (LLMs).\n\n**Introduction**\n\nThe paper highlights the growing interest in using synthetic data to train future LLMs, which leads to a self-consuming retraining loop where models are trained on their own outputs and may cause performance drops and induce emerging biases. This self-consuming performative loop is particularly relevant in dynamic systems driven by user feedback, where the model's performance influences future data.\n\n**Research Questions**\n\nThe authors aim to investigate the role of synthetic data in shaping bias during these dynamic iterative training processes under controlled performative feedback. They focus on two types of loops:\n\n1. The typical retraining setting\n2. The incremental fine-tuning setting, which is largely underexplored\n\n**Key Findings**\n\nThe authors' experiments on three real-world tasks reveal that:\n\n* The performative loop increases preference bias and decreases disparate bias.\n* Iterative fine-tuning/retraining with self-generated data amplifies preference bias and degrades generation quality over time.\n* Disparate bias tends to decrease, suggesting a convergence of performance across groups.\n* Performative feedback accelerates bias amplification in incremental", "rejected": "The research paper \"Self-Consuming Performative Loop for Large Language Models\" explores the concept of a self-consuming performative loop (SCPL) in large language models (LLMs). The authors introduce the idea of a dynamic system driven by user feedback, where previously deployed LLMs influence the data they generate, leading to a performative prediction. They investigate the role of synthetic data in shaping bias during these dynamic iterative training processes under controlled performative feedback.\n\nThe paper focuses on two types of loops: the typical retraining setting and the incremental fine-tuning setting, which is largely underexplored. Through experiments on three real-world tasks, the authors find that the performative loop increases preference bias and decreases disparate bias. They design a reward-based rejection sampling strategy to mitigate the bias, moving towards more trustworthy self-improving systems.\n\nKey findings:\n\n1. Iterative fine-tuning/retraining with self-generated data amplifies preference bias and degrades generation quality over time.\n2. Disparate bias tends to decrease, suggesting a convergence of performance across groups.\n3. Performative feedback accelerates bias amplification in incremental fine-tuning loops, whereas this phenomenon is less obvious in retraining loops.\n\nThe authors' research provides insights that inform safer deployment of"} +{"chosen": "The research paper \"RelayLLM: Efficient Reasoning via Collaborative Decoding\" proposes a novel framework for efficient reasoning via token-level collaborative decoding. The authors address the challenge of deploying Large Language Models (LLMs) for complex reasoning, which is often hindered by high computational costs and latency, while Small Language Models (SLMs) typically lack the necessary reasoning capacity.\n\nThe authors present RelayLLM, a framework that empowers the SLM to act as an active controller, dynamically invoking the LLM only for critical tokens via a special command, effectively \"relaying\" the generation process. They introduce a two-stage training framework, including warm-up and Group Relative Policy Optimization (GRPO) to teach the model to balance independence with strategic help-seeking.\n\nThe empirical results on six benchmarks demonstrate the effectiveness of RelayLLM, achieving an average accuracy of 49.52% and reducing the performance gap between the small model and the large one. The authors also show that RelayLLM invokes the large model for only 1.07% of the total generated tokens, resulting in a 98.2% reduction in token costs compared to a performance-matched router.\n\nKey findings:\n\n1. **Efficient reasoning via token-level collaborative decoding**: RelayLLM emp", "rejected": "Here's a clear and concise summary of the research paper:\n\n**Title:** RelayLLM: Efficient Reasoning via Collaborative Decoding\n\n**Problem Statement:** Large Language Models (LLMs) are powerful but computationally expensive, while Small Language Models (SLMs) are efficient but lack reasoning capacity. Existing collaborative approaches, such as cascading or routing, often operate at a coarse granularity, leading to significant computational waste.\n\n**Proposed Solution:** RelayLLM, a novel framework for efficient reasoning via token-level collaborative decoding. Unlike static routers, RelayLLM empowers the SLM to act as an active controller that dynamically invokes the LLM only for critical tokens via a special command, effectively \"relaying\" the generation process.\n\n**Key Components:**\n\n1. **Two-stage training framework:** Supervised warm-up phase to teach the model the syntactic structure of calling commands, followed by a reinforcement learning stage using Group Relative Policy Optimization (GRPO) training.\n2. **Context-aware reward:** Designed to balance independence with necessary help-seeking, penalizing both wasted costs and avoidable errors.\n3. **Token-level collaborative decoding:** The SLM generates a special command token () to pause its own generation and invoke the LLM for a specified number of tokens"} diff --git a/reward_model.py b/reward_model.py new file mode 100644 index 0000000..df43488 --- /dev/null +++ b/reward_model.py @@ -0,0 +1,443 @@ +""" +Reward model training and evaluation utilities. +""" +import json +from typing import List, Dict, Any + +import torch +from datasets import load_dataset, Dataset +from transformers import ( + AutoTokenizer, + AutoModelForSequenceClassification, + TrainingArguments, +) +from trl import RewardTrainer +import evaluate + + +# Configuration +REWARD_MODEL_NAME = "microsoft/deberta-v3-base" +REWARD_NUM_EPOCHS = 3 +REWARD_BATCH_SIZE = 4 + +# Device detection: CUDA > MPS > CPU +if torch.cuda.is_available(): + LLAMA_DEVICE = "cuda" +elif torch.backends.mps.is_available(): + LLAMA_DEVICE = "mps" +else: + LLAMA_DEVICE = "cpu" + + +def auto_label_preferences_with_rouge( + summary_records: List[Dict[str, Any]], + output_path: str = "reward_data.jsonl" +) -> None: + """ + Use ROUGE-L against the abstract as a weak preference label. + + This automatically creates preference pairs where: + - chosen = summary with higher ROUGE-L vs abstract + - rejected = the other summary + + Writes reward_data.jsonl with 'chosen' and 'rejected' fields suitable for RewardTrainer. + + Args: + summary_records: List of dictionaries with 'abstract', 'summary_1', 'summary_2' + output_path: Path to save the JSONL file + """ + rouge = evaluate.load("rouge") + data_for_jsonl = [] + + for rec in summary_records: + ref = rec["abstract"] + s1 = rec["summary_1"] + s2 = rec["summary_2"] + + # Compute per-summary ROUGE-L scores + s1_score = rouge.compute(predictions=[s1], references=[ref])["rougeL"] + s2_score = rouge.compute(predictions=[s2], references=[ref])["rougeL"] + + if s1_score >= s2_score: + chosen, rejected = s1, s2 + else: + chosen, rejected = s2, s1 + + data_for_jsonl.append( + { + "chosen": chosen, + "rejected": rejected, + } + ) + + with open(output_path, "w", encoding="utf-8") as f: + for item in data_for_jsonl: + f.write(json.dumps(item, ensure_ascii=False) + "\n") + + print(f"Wrote preference data to {output_path} ({len(data_for_jsonl)} pairs).") + + +def load_reward_dataset(path: str) -> Dataset: + """ + Load reward modeling dataset from JSONL file. + + Args: + path: Path to JSONL file with 'chosen' and 'rejected' fields + + Returns: + HuggingFace Dataset object + """ + dataset = load_dataset("json", data_files=path, split="train") + return dataset + + +class ModelWrapper(torch.nn.Module): + """Wrapper to filter out unsupported arguments for encoder models.""" + def __init__(self, model): + super().__init__() + self.model = model + + def forward(self, **kwargs): + # Remove arguments not supported by encoder models + kwargs.pop('use_cache', None) + return self.model(**kwargs) + + def __getattr__(self, name): + try: + return super().__getattr__(name) + except AttributeError: + return getattr(self.model, name) + + +def train_reward_model(data_path: str = "reward_data.jsonl", output_dir: str = "reward_model"): + """ + Train a reward model on preference data using TRL's RewardTrainer. + + The model learns to assign higher scores to preferred summaries (chosen) + and lower scores to rejected summaries. + + Args: + data_path: Path to JSONL file with preference pairs + output_dir: Directory to save the trained model + """ + print("Loading reward model + tokenizer...") + # Load tokenizer and store a clean copy for saving later + tokenizer = AutoTokenizer.from_pretrained(REWARD_MODEL_NAME) + print(f"Loaded tokenizer type: {tokenizer.__class__.__name__}") + print(f"Tokenizer vocab size: {tokenizer.vocab_size}") + # Keep a reference to the original tokenizer before training + original_tokenizer_path = REWARD_MODEL_NAME + + base_model = AutoModelForSequenceClassification.from_pretrained( + REWARD_MODEL_NAME, + num_labels=1, + ) + + # Wrap model to filter unsupported arguments + model = ModelWrapper(base_model) + + dataset = load_reward_dataset(data_path) + print(f"Loaded dataset with {len(dataset)} examples") + print(f"Dataset columns: {dataset.column_names}") + if len(dataset) > 0: + print(f"First example: {dataset[0]}") + + # RewardTrainer handles tokenization internally, so we just pass the text fields + + training_args = TrainingArguments( + output_dir=output_dir, + per_device_train_batch_size=REWARD_BATCH_SIZE, + num_train_epochs=REWARD_NUM_EPOCHS, + eval_strategy="no", + save_strategy="epoch", + logging_steps=10, + fp16=False, + bf16=torch.cuda.is_available(), + remove_unused_columns=False, + ) + + # Add attributes required by TRL RewardTrainer (from RewardConfig) + training_args.model_init_kwargs = {} + training_args.eos_token = None + training_args.pad_token = None + training_args.max_length = 4096 + training_args.chat_template_path = None + training_args.disable_dropout = False + training_args.pad_to_multiple_of = None + training_args.dataset_num_proc = None + training_args.center_rewards_coefficient = None + training_args.activation_offloading = False + + trainer = RewardTrainer( + model=model, + args=training_args, + train_dataset=dataset, + processing_class=tokenizer, + ) + + print("Training reward model...") + trainer.train() + + # Save the base model (unwrapped) + base_model.save_pretrained(output_dir) + + # Save the original tokenizer (reload fresh to avoid contamination) + print(f"Saving clean tokenizer from {original_tokenizer_path}...") + import os + import shutil + + # Use a temporary directory to get clean tokenizer files + temp_dir = f"{output_dir}_temp_tokenizer" + os.makedirs(temp_dir, exist_ok=True) + + # Download fresh tokenizer to temp directory + clean_tokenizer = AutoTokenizer.from_pretrained(original_tokenizer_path, cache_dir=temp_dir) + + # Save to temp first + clean_tokenizer.save_pretrained(temp_dir) + + # Copy only the essential tokenizer files (not any contaminated metadata) + essential_files = [ + "tokenizer_config.json", + "vocab.txt", # DeBERTa uses vocab.txt + "special_tokens_map.json", + "tokenizer.json", + ] + + for filename in essential_files: + src = os.path.join(temp_dir, filename) + dst = os.path.join(output_dir, filename) + if os.path.exists(src): + shutil.copy2(src, dst) + print(f" Copied: {filename}") + + # Clean up temp directory + shutil.rmtree(temp_dir, ignore_errors=True) + + # CRITICAL FIX: Clean up tokenizer_config.json to remove SentencePiece contamination + tokenizer_config_path = os.path.join(output_dir, "tokenizer_config.json") + if os.path.exists(tokenizer_config_path): + print("Cleaning tokenizer_config.json to remove SentencePiece contamination...") + with open(tokenizer_config_path, 'r') as f: + config = json.load(f) + + # Remove SentencePiece-related fields that cause Mistral warnings + contaminated_fields = ["vocab_type", "sp_model_kwargs"] + for field in contaminated_fields: + if field in config: + print(f" Removing contaminated field: {field} = {config[field]}") + del config[field] + + # Write back the cleaned config + with open(tokenizer_config_path, 'w') as f: + json.dump(config, f, indent=2) + + # CRITICAL: Remove any contaminating files that shouldn't be in DeBERTa tokenizer + contaminating_files = [ + "spm.model", # SentencePiece (LLaMA/Mistral) + "merges.txt", # BPE (GPT-2/Mistral) + "vocab.json", # BPE vocab (GPT-2/Mistral) + ] + + for filename in contaminating_files: + filepath = os.path.join(output_dir, filename) + if os.path.exists(filepath): + print(f"WARNING: Removing contaminating file: {filename}") + os.remove(filepath) + + print(f"Reward model saved to {output_dir}") + print(f"Tokenizer type: {clean_tokenizer.__class__.__name__}") + + # Verify the saved tokenizer can be loaded correctly + print("Verifying saved tokenizer...") + verification_tokenizer = AutoTokenizer.from_pretrained(output_dir) + print(f"Verified tokenizer type: {verification_tokenizer.__class__.__name__}") + + +def score_summaries_with_reward_model( + summaries: List[str], + reward_model, + reward_tokenizer, +) -> List[float]: + """ + Compute scalar reward scores for each summary. + + Args: + summaries: List of summary texts + reward_model: Trained reward model + reward_tokenizer: Tokenizer for the reward model + + Returns: + List of reward scores (one per summary) + """ + reward_model.eval() + scores = [] + + # Get the device of the model + model_device = next(reward_model.parameters()).device + + for s in summaries: + inputs = reward_tokenizer( + s, + return_tensors="pt", + truncation=True, + padding="max_length", + max_length=4096, + ).to(model_device) + + with torch.no_grad(): + out = reward_model(**inputs) + # out.logits shape: [batch, 1] + score = out.logits.squeeze().item() + scores.append(score) + + return scores + + +def evaluate_summaries( + summary_records: List[Dict[str, Any]], + reward_model_dir: str = "reward_model", + output_path: str = "evaluation_results.json" +) -> List[Dict[str, Any]]: + """ + Evaluate summaries using ROUGE, BERTScore, and the trained reward model. + + Args: + summary_records: List of dicts with 'arxiv_id', 'title', 'abstract', 'summary_1', 'summary_2' + reward_model_dir: Directory containing the trained reward model + output_path: Path to save evaluation results + + Returns: + List of evaluation results for each paper + """ + # Load reward model + print("Loading reward model for evaluation...") + + # Suppress false Mistral warnings by cleaning tokenizer config before loading + import os + tokenizer_config_path = os.path.join(reward_model_dir, "tokenizer_config.json") + if os.path.exists(tokenizer_config_path): + with open(tokenizer_config_path, 'r') as f: + config = json.load(f) + + # Remove SentencePiece contamination if present + if "vocab_type" in config or "sp_model_kwargs" in config: + print("Cleaning tokenizer config before loading...") + config.pop("vocab_type", None) + config.pop("sp_model_kwargs", None) + + with open(tokenizer_config_path, 'w') as f: + json.dump(config, f, indent=2) + + # Suppress the false Mistral warning from transformers logger + import logging + import warnings + + # Temporarily suppress ALL transformers warnings during tokenizer load + logging.getLogger("transformers").setLevel(logging.ERROR) + original_transformers_level = logging.getLogger("transformers.tokenization_utils_base").level + logging.getLogger("transformers.tokenization_utils_base").setLevel(logging.ERROR) + + with warnings.catch_warnings(): + warnings.filterwarnings("ignore", message=".*incorrect regex pattern.*") + warnings.filterwarnings("ignore", message=".*Mistral.*") + + reward_tokenizer = AutoTokenizer.from_pretrained( + reward_model_dir, + trust_remote_code=False, + use_fast=True + ) + reward_model = AutoModelForSequenceClassification.from_pretrained(reward_model_dir) + + # Restore original logging levels + logging.getLogger("transformers.tokenization_utils_base").setLevel(original_transformers_level) + logging.getLogger("transformers").setLevel(logging.WARNING) + + reward_model.to(LLAMA_DEVICE) + print(f"Reward model loaded on device: {LLAMA_DEVICE}") + print(f"Loaded tokenizer class: {reward_tokenizer.__class__.__name__}") + + # Load evaluation metrics + rouge = evaluate.load("rouge") + bertscore = evaluate.load("bertscore") + + results = [] + + for rec in summary_records: + ref = rec.get("abstract", "") + s1 = rec["summary_1"] + s2 = rec["summary_2"] + + # Check if abstract is empty - if so, skip ROUGE/BERTScore or use fallback + if not ref or ref.strip() == "": + print(f"\nWarning: Empty abstract for {rec['arxiv_id']}, using first 500 chars of text as reference") + # Use first 500 characters of paper text as fallback reference + ref = rec.get("text", "")[:500] if rec.get("text") else "" + + # Only compute ROUGE/BERTScore if we have a valid reference + if ref and ref.strip(): + # ROUGE + r1 = rouge.compute(predictions=[s1], references=[ref]) + r2 = rouge.compute(predictions=[s2], references=[ref]) + + # BERTScore + b1 = bertscore.compute( + predictions=[s1], + references=[ref], + lang="en", + ) + b2 = bertscore.compute( + predictions=[s2], + references=[ref], + lang="en", + ) + else: + # No valid reference available + print(f"Warning: No valid reference for {rec['arxiv_id']}, skipping ROUGE/BERTScore") + r1 = {"rouge1": None, "rouge2": None, "rougeL": None} + r2 = {"rouge1": None, "rouge2": None, "rougeL": None} + b1 = {"precision": [None], "recall": [None], "f1": [None]} + b2 = {"precision": [None], "recall": [None], "f1": [None]} + + # Reward model scores + scores = score_summaries_with_reward_model( + [s1, s2], + reward_model, + reward_tokenizer, + ) + rm1, rm2 = scores + + result_entry = { + "arxiv_id": rec["arxiv_id"], + "title": rec["title"], + "abstract": ref, + "summary_1": s1, + "summary_2": s2, + "rouge_1": r1, + "rouge_2": r2, + "bertscore_1": b1, + "bertscore_2": b2, + "reward_1": rm1, + "reward_2": rm2, + } + results.append(result_entry) + + print("\n=== Paper", rec["arxiv_id"], "===") + print("Reward scores: S1 =", rm1, " | S2 =", rm2) + if r1["rougeL"] is not None: + print("ROUGE-L: S1 =", r1["rougeL"], " | S2 =", r2["rougeL"]) + print("BERTScore F1: S1 =", + b1["f1"][0], + "| S2 =", + b2["f1"][0]) + else: + print("ROUGE-L: N/A (no valid reference)") + print("BERTScore F1: N/A (no valid reference)") + + with open(output_path, "w", encoding="utf-8") as f: + json.dump(results, f, ensure_ascii=False, indent=2) + + print(f"\nSaved detailed evaluation results to {output_path}") + print("You can now inspect where reward scores agree/disagree with ROUGE/BERTScore.") + + return results diff --git a/run_evaluation.py b/run_evaluation.py new file mode 100644 index 0000000..cea6240 --- /dev/null +++ b/run_evaluation.py @@ -0,0 +1,43 @@ +#!/usr/bin/env python3 +""" +Phase 5-6: Evaluation and comparison. + +This script: +- Loads summaries from training +- Scores them using the trained reward model +- Computes ROUGE and BERTScore +- Compares all metrics +""" +import os +import json + +from reward_model import evaluate_summaries + + +# File paths +TRAIN_SUMMARY_JSON = "summaries_train.json" +REWARD_MODEL_DIR = "reward_model" +RESULTS_JSON = "evaluation_results.json" + + +def main(): + print("\n=== PHASE 5-6: Evaluation and comparison ===\n") + + if not os.path.exists(TRAIN_SUMMARY_JSON): + print("Error: No training summaries found. Run build_data.py first.") + return + + # Load existing summaries + with open(TRAIN_SUMMARY_JSON, "r", encoding="utf-8") as f: + eval_summaries = json.load(f) + + print(f"Evaluating {len(eval_summaries)} papers...") + + # Evaluate summaries + evaluate_summaries(eval_summaries, REWARD_MODEL_DIR, RESULTS_JSON) + + print("\n=== Evaluation complete! ===") + + +if __name__ == "__main__": + main() diff --git a/summaries_train.json b/summaries_train.json new file mode 100644 index 0000000..4570842 --- /dev/null +++ b/summaries_train.json @@ -0,0 +1,82 @@ +[ + { + "arxiv_id": "2601.05171v1", + "title": "Paper 2601.05171v1", + "abstract": "Existing long-term personalized dialogue systems struggle to reconcile unbounded interaction streams with finite context constraints, often succumbing to memory noise accumulation, reasoning degradation, and persona inconsistency. To address these challenges, this paper proposes Inside Out, a framework that utilizes a globally maintained PersonaTree as the carrier of long-term user profiling. By constraining the trunk with an initial schema and updating the branches and leaves, PersonaTree enables controllable growth, achieving memory compression while preserving consistency. Moreover, we train a lightweight MemListener via reinforcement learning with process- based rewards to produce structured, executable, and interpretable {ADD, UPDATE, DELETE, NO_OP} operations, thereby supporting the dynamic evolution of the personalized tree. During response generation, PersonaTree is directly leveraged to enhance outputs in latency-sensitive scenarios; when users require more details, the agentic mode is triggered to introduce details on-demand under the constraints of the PersonaTree. Experiments show that PersonaTree outperforms full-text concatenation and various personalized memory systems in suppressing contextual noise and maintaining persona consistency. Notably, the small MemListener model achieves memory- operation decision performance comparable to, or even surpassing, powerful reasoning models such as DeepSeek-R1-0528 and Gemini-3-Pro. Correspondence: Team Leader at lizy@memtensor.cn Author Legend: †Corresponding author", + "text": "Inside Out:\nEvolving User-Centric Core Memory Trees\nfor Long-Term Personalized Dialogue Systems\nJihao Zhao1,2,3, Ding Chen2,3, Zhaoxin Fan2,4, Kerun Xu2,3, Mengting Hu2,5, Bo\nTang2,3, Feiyu Xiong2,3, Zhiyu Li2,3,†\n1School of Information, Renmin University of China, 2MemTensor (Shanghai) Technology Co.,\nLtd., 3Institute for Advanced Algorithms Research, Shanghai, 4Beijing University of\nAeronautics and Astronautics, 5Nankai University\nAbstract\nExisting long-term personalized dialogue systems struggle to reconcile unbounded interaction\nstreams with finite context constraints, often succumbing to memory noise accumulation, reasoning\ndegradation, and persona inconsistency. To address these challenges, this paper proposes Inside\nOut, a framework that utilizes a globally maintained PersonaTree as the carrier of long-term\nuser profiling. By constraining the trunk with an initial schema and updating the branches and\nleaves, PersonaTree enables controllable growth, achieving memory compression while preserving\nconsistency. Moreover, we train a lightweight MemListener via reinforcement learning with process-\nbased rewards to produce structured, executable, and interpretable {ADD, UPDATE, DELETE, NO_OP}\noperations, thereby supporting the dynamic evolution of the personalized tree. During response\ngeneration, PersonaTree is directly leveraged to enhance outputs in latency-sensitive scenarios;\nwhen users require more details, the agentic mode is triggered to introduce details on-demand\nunder the constraints of the PersonaTree.\nExperiments show that PersonaTree outperforms\nfull-text concatenation and various personalized memory systems in suppressing contextual noise\nand maintaining persona consistency. Notably, the small MemListener model achieves memory-\noperation decision performance comparable to, or even surpassing, powerful reasoning models such\nas DeepSeek-R1-0528 and Gemini-3-Pro.\nCorrespondence: Team Leader at lizy@memtensor.cn\nAuthor Legend: †Corresponding author\n1\nIntroduction\nCore memories shape Riley’s personality islands, with each island serving as a unique emblem of\nher identity.\n—— \"Inside Out\"\nWith the rapid advancement of large language models (LLM), dialogue-based agents have demonstrated\nsubstantial potential in applications such as personal assistants, affective companionship, and long-term\nquestion answering [4, 11, 18]. However, within personalized dialogue systems aimed at fostering long-\n1\narXiv:2601.05171v1 [cs.CL] 8 Jan 2026\n\nterm human-machine trust and emotional connection, a fundamental contradiction exists between the finite\ncontext window and the unbounded growth of interaction history [13, 23]. As conversational turns continue\nto accumulate, the traditional single-context paradigm encounters a severe form of context saturation:\nindiscriminate aggregation of massive historical information not only drives computational costs sharply\nupward, but also introduces substantial irrelevant noise, markedly degrading the signal-to-noise ratio. More\ncritically, this unstructured accumulation makes it difficult for the model to accurately extract and sustain a\nuser’s personal characteristics from lengthy histories, leading to personalization inconsistency over long-term\ninteractions and thereby seriously undermining user experience and the system’s long-term usability [19, 29].\nTo address these challenges, existing studies have primarily explored routes such as explicit profile augmentation\nand vector-based retrieval, yet neither directly confronts the central bottleneck of personalized memory\nevolution. Profile-based approaches rely on predefined, static attributes; they are not only slow to update\nbut also struggle to capture implicit cues that users reveal over prolonged interactions, including linguistic\nstyle, deeper value orientations, and affective preferences, resulting in superficial personalization modeling [20].\nIn contrast, memory-augmented agents based on vector retrieval, while introducing external storage, still\nessentially treat memory as text fragments or simple lists of facts. Such systems lack an intrinsic, trained\ndecision mechanism for determining which information merits long-term retention, and instead often depend\non rigid heuristics or elaborate prompt engineering [12]. This accumulation of memories without value-based\njudgment causes the memory repository either to become bloated and uninterpretable due to noise accretion,\nor to lose the long-range logical thread through fragmentation of key context, ultimately failing to sustain a\nvivid and coherent persona [25].\nThis discrepancy between memory accumulation and core persona formation” motivates us to return to the\nfoundations of human cognition for an answer. As illustrated by the film \"Inside Out\", individual identity\ndoes not stem from a simplistic stacking of all experiences, but rather is constructed upon core memories that\nshape distinct \"Islands of Personality\". This aligns with theoretical findings in cognitive psychology, such as\nSelf-Schema theory [15, 22], which emphasizes that humans maintain a stable self-concept by filtering and\nhierarchically organizing key memories.\nInspired by these insights, we propose the Inside Out framework, which aims to grow an evolvable user core\nmemory tree \"from the inside out\" through unbounded interactions. Firstly, to delineate the theoretical\nboundaries of the memory tree, we construct a hierarchical Schema based on the Biopsychosocial model,\nscientifically decomposing user characteristics into three core dimensions. This interdisciplinary Schema\ndesign establishes the initial structure of the user PersonaTree. Secondly, to endow the system with dynamic\nevolution, we propose an iterative tree-update mechanism and introduce a reinforcement learning (RL) strategy\nbased on process rewards to train a lightweight model, MemListener. This model learns to compress a\ncontinuous stream of unstructured dialogue in real time into standardized tree-structured operations, encoding\nuser core features within the branch and leaf nodes. Finally, addressing the trade-off between efficiency and\neffectiveness during the inference stage, this paper designs an adaptive response generation mechanism: In\nlatency-sensitive scenarios, a fast mode is enabled to perform reasoning directly based on the PersonaTree.\nWhen facing long-tail detail requirements, the system switches to the agentic recall mode, utilizing the\nPersonaTree to guide deep retrieval. The primary contributions of our work are summarized as:\n• We propose PersonaTree, grounded in the biopsychosocial schema. By transforming unstructured\ndialogue streams into standardized atomic tree operations in real-time, PersonaTree achieves the\ndynamic compression, explicit management, and high signal-to-noise ratio maintenance of implicit user\nprofiles.\n• We design a training strategy utilizing RL with process rewards. Leveraging the constructed dataset of\n28k instructions, we train a lightweight model, MemListener, to execute precise memory editing.\n• Our experiments reveal the potential of a collaborative paradigm where \"small models maintain memory\nwhile LLMs handle generation\". Results show that MemListener achieves memory-decision performance\ncomparable to strong reasoning models, and that PersonaTree offers a new pathway toward low-cost,\nhighly reliable deployment of long-term personalized dialogue systems.\n2\n\nFigure 1 Overview of the entire process of our Inside Out framework.\n2\nRelated Works\n2.1\nPersonalization and Memory\nPersonalization aims to adapt a dialogue system’s linguistic style and interaction policy to a specific user’s\nstable traits and evolving state. In interactive settings, personalization is inherently coupled with memory:\nmodels must distill reusable user representations from past interactions and fuse them during generation. Li\net al. [10] proposed persona-based dialogue generation to mitigate inconsistency and lack of personality in\nopen-domain dialogue, and Zhang et al. [26] formalized the PersonaChat task. Subsequent studies emphasized\nmulti-dimensional user attributes. For example, Zheng et al. [28] introduced the large-scale multi-turn dataset\nPersonalDialog. In parallel, Madotto et al. [14] framed personalization as a meta-learning problem to enable\nfew-shot adaptation. In the LLM era, Chen et al. [3] systematically reviewed major directions in personalized\ndialogue generation, while Tan et al. [21] assigned parameter-efficient personalization modules to users to\nimprove multi-task personalization.\n2.2\nLLM Agents with External Memory\nTo overcome the limitations of LLMs’ finite context windows and endow them with capabilities for continuous\nlearning and long-term interaction, constructing memory systems has emerged as a pivotal research direction.\n3\n\nLangMem 1 enables continual learning and cross-session consistency by decoupling hot-path memory primitives\nfrom backend asynchronous integration. Mem0 [4] adopts a multi-level memory architecture to support\nmulti-session retrieval and personalization at relatively low overhead. A-Mem [24] builds an evolvable memory\nnetwork via self-organizing indexing and linking mechanisms. MemoryOS [8] manages short, medium, and\nlong-term memory through OS-style hierarchical storage together with corresponding update and retrieval\npolicies to preserve contextual coherence.\n3\nThe Inside Out Framework\n3.1\nOverview Architecture\nThis study proposes the Inside Out Framework, which aims to address the challenges of personalized consistency\nand contextual forgetting in long-term dialogues through a structured memory evolution mechanism.\nFramework Pipeline.\nAs shown in Figure 1, the framework consists of three key modules: Dynamic Person-\naTree Evolution (Section 3.2), MemListener Training (Section 3.3), and Adaptive Response Generation (Section\n3.4). First, PersonaTree and dataset construction initializes a persona tree based on the Biopsychosocial\nModel, segments the user’s dialogue history into consecutive dialogue chunks, and generates operations on the\nPersonaTree, thereby constructing a memory evolution dataset. Second, MemListener training leverages the\nresulting high-quality dataset to update the parameters of MemListener via supervised fine-tuning (SFT) and\nRL with a dynamic reward mechanism, enabling it to extract structured memory from unstructured dialogues.\nFinally, PersonaTree inference freezes the MemListener parameters at application time, reconstructs the\nattribute tree from the dialogue history, uses this structured memory to enhance user queries, and ultimately\ngenerates personalized responses through an LLM.\nProblem Formulation.\nWe define the task of a personalized dialogue system as a process of maximizing\nresponse utility over an infinitely long dialogue stream. Given a user U with a historical dialogue sequence\nH = {x1, y1, ..., xt, yt}, where x denotes user inputs and y denotes system responses, conventional context-\nwindow approaches attempt to directly model P(yt | Ht−k:t), but are constrained by the window length k.\nOur framework introduces an explicit, structured user state T (i.e., PersonaTree), thereby reformulating the\nproblem as state tracking and state-conditioned generation. The goal is to learn a state update function\nfupdate such that:\nTt = fupdate(Tt−1, Dt)\n(1)\nyt = fgen(xt, Tt, frecall(Tt, H))\n(2)\nwhere Dt denotes the current dialogue chunk, fgen produces the system reply given the current user input\nand the tracked user state, and frecall is a retrieval function that recalls relevant historical snippets from the\nfull dialogue history H conditioned on the current state Tt.\n3.2\nDynamic PersonaTree Evolution\nPersonaTree Initialization.\nAt system startup, we construct an initial PersonaTree to serve as the starting\npoint of the user’s long-term structured state. Specifically, we first determine the set of writable trunk and\nleaf fields according to a predefined unified schema, and constrain the storage type of each leaf node to a\ndescriptive string, which is used to hold a compressed summary of the user’s core personalized attributes. This\ndesign ensures that memory capacity remains controllable and prevents unbounded growth as the dialogue\nprogresses. The schema is informed by interdisciplinary human-factors and psychological theory frameworks,\nwith its theoretical grounding illustrated in Figure 1. Subsequently, under the schema constraints, we initialize\nthe leaf nodes (allowing empty strings or default placeholder text), thereby obtaining the initial persona tree\nT0. The specific initial PersonaTree instance adopted in this paper is provided in Appendix D.\n1https://github.com/langchain-ai/langmem\n4\n\nIterative PersonaTree Updating.\nTo enable scalable maintenance of long-term personalized memory over an\ninfinitely long dialogue stream, we adopt an iterative updating mechanism: any input modality (historical\nfile import, short-snippet input, or real-time cache triggering) is normalized into a dialogue-chunk sequence\n(D1, . . . , DN), and for each Dt we execute a closed-loop update of operation-list generation, safe parsing and\nexecution, versioned persistence.\nStep 0: System Loading. The system loads the text fields of all leaf nodes, yielding the initial state T0.\nMeanwhile, the task specification and system constraints are abstracted into a rule set R, including update\nrules, writable scope, leaf constraints, and output format.\nStep t = 1, . . . , N. For any dialogue chunk Dt, the system executes the following three stages:\n(a) State Construction. Given a dialogue chunk Dt, set Inputt ← Dt, Tt−1.\n(b) Operation List Generation. Conditioned on (Dt, Tt−1, R), LLM outputs an operation list Ot, consisting\nof one or more atomic operations that strictly follow a predefined operation grammar. The operation types\nare limited to:\n• ADD(path, value): write descriptive text to the specified path; if the path does not exist, it may be\ncreated under an extended-schema policy;\n• UPDATE(path, value): perform an overwrite rewrite on the target leaf node, updating its text to the\nnew value;\n• DELETE(path, value): clear the target leaf node or write a deletion marker to indicate that this type\nof information should be removed from long-term memory;\n• NO_OP(): the current dialogue chunk does not contain stable core persona information that should be\nwritten to the PersonaTree.\nFor update operations, our framework unifies them as rewrites of leaf strings. More importantly, potential\nconflicts between new and old information are resolved by LLM during the generation of Ot. Based on Dt\nand the contextual old values in Tt−1, the model must decide whether to overwrite prior information, preserve\nsalient change cues, or solely append new information. In other words, conflict resolution is explicitly lifted to\nthe policy-generation stage, so as to leverage the LLMs’ holistic inference over semantics, temporal order, and\nnarrative consistency.\n(c) Parsing and Execution. This module serves as a safety gate that enforces structural and capacity\nconstraints when applying Ot: it validates that each path targets a permissible leaf, ensures each value is\na string or an allowed deletion marker to avoid parsing ambiguity and state pollution, and applies length\ncontrol by compressing any overlong value to satisfy the per-leaf budget. Importantly, it performs no conflict\nresolution or secondary semantic rewriting; it only executes the prescribed operations under these constraints.\n(d) Versioned Persistence. After the execution, the updated tree state Tt is materialized and persisted as a\nnew version, either serialized to a JSON file or stored in a JSON-capable database (e.g., document stores\nsuch as MongoDB). Iterating over t = 1, . . . , N yields a traceable evolution sequence {T0, . . . , TN}, with TN\nserving as the compressed long-term memory for retrieval-augmented and personalized generation at the final\ntask-query stage.\n3.3\nMemListener Training\nTraining Data Synthesis.\nDuring training data construction, we select subsets from HaluMem [2] and\nPersonaMem [7] that are relevant to implicitly characterizing user-specific attributes as the raw corpus sources.\nUsing the dynamic PersonaTree evolution procedure described in Section 3.2 as the backbone, we invoke\nDeepSeek-R1-0528 to generate supervision signals for training.\nWarm-up via SFT.\nWe first perform full supervised fine-tuning to initialize the base model as a MemListener\nthat can stably generate operation lists. For any training sample, let the input context be s (including the\n5\n\ndialogue chunk, the previous tree state, and rule constraints), and the target output be o (the ground-truth\noperation sequence segment). We optimize a standard autoregressive cross-entropy objective:\nLSFT(θ) = −1\nτ\nτ\n�\nt=1\nlog Pθ(ot | o, \"\"), UPDATE(, \"\"), DELETE(,\nNone), NO_OP()\\n\\n\nPersona Schema:\\n\\n{schema}\\n\\n\nDialogue History:\\n\\n{dialogue_text}\\n\\n\nNow, based on the given dialogue history, output only the operations:\nTable 9 Prompt for operational generation in PersonaTree for training and inference.\n20\n\nReward-Function Prompt\nYou are a strict \"overall scorer for attribute-tree operations\". Your task is to assign an overall quality score in\n[-1, 1] to the model-predicted operation sequence Pred_Ops, given the ground-truth annotated operation\nsequence GT_Ops.\n[Input]\n- GT_Ops (ground truth): a list of operations, where each element is of the form ADD(path, value) /\nUPDATE(path, value) / DELETE(path, value) / NO_OP()\n- Pred_Ops (prediction): a list of operations in the same format as above\n[Critical Constraints]\n1) Output only a single JSON object: {\"score\": }. Do not output any explanation and do not include\nany extra fields.\n2) score must be a continuous floating-point number within [-1, 1] (any value is allowed). It is recommended\nto keep 2 decimal places.\n3) The \"score-tier reference\" below serves only as anchors for aligning overall quality. You should fine-tune\nbetween anchors to output a more granular score.\n4) For example, if the overall quality falls between 0.7 and 1.0, output a value in [0.71, 0.99]; if it falls between\n0.5 and 0.7, output a value in [0.51, 0.69]; and so on.\n[Score-Tier Reference (Overall Quality Anchors)]\n* 1.0 (nearly perfect): Pred and GT are almost entirely consistent on key operations; types/paths are nearly\nidentical; values are semantically equivalent; no redundant operations.\n* 0.7 (high quality): most key operations are correct; only minor value-level deviations, or very few miss-\ning/redundant operations.\n* 0.5 (moderately usable): the overall approach and core direction are correct; some missing/redundant\noperations exist; some paths/values are incorrect, but the main semantics are not affected.\n* 0.3 (partially reliable): about half of the content is reliable; some key operations are correct while others are\nwrong, requiring some fixes.\n* 0.0 (slightly correct): only a small number of operations or fragments are correct; missing/redundant\noperations and errors are evident; key operations are mixed correct/incorrect.\n* -0.3 (barely relevant): broadly related but with many omissions/errors; it is only apparent that the model is\nattempting the task, and it is essentially unusable as-is.\n* -0.5 (clearly off-target): most key operations are missing or incorrect; many wrong paths/types or obviously\nredundant operations; overall deviates from expectations.\n* -0.7 (catastrophic): large-scale structural/semantic disorder; almost unusable.\n* -1.0 (meaningless output): clearly meaningless, garbage text, or unrelated to the task.\n[Output Format]\nOutput only the JSON object containing the score, with no additional notes or explanations.\nOutput only:\n{\"score\": }\n[Task Data]\n- GT_Ops:\n{gt_ops}\n- Pred_Ops:\n{pred_ops}\nTable 10 Reward-function prompt for process-reward RL training.\n21\n", + "summary_1": "Here is a clear and concise summary of the research paper:\n\n**Title:** Inside Out: Evolving User-Centric Core Memory Trees for Long-Term Personalized Dialogue Systems\n\n**Problem Statement:** Long-term personalized dialogue systems struggle to reconcile unbounded interaction streams with finite context constraints, leading to memory noise accumulation, reasoning degradation, and persona inconsistency.\n\n**Proposed Solution:** The authors propose the Inside Out framework, which utilizes a globally maintained PersonaTree as the carrier of long-term user profiling. The PersonaTree is a hierarchical structure that constrains the trunk with an initial schema and updates the branches and leaves through a reinforcement learning (RL) strategy.\n\n**Key Components:**\n\n1. **PersonaTree:** A hierarchical structure that represents the user's core memories, with three core dimensions (Biopsychosocial model).\n2. **Schema:** An initial structure of the user PersonaTree, established through interdisciplinary design.\n3. **MemListener:** A lightweight model trained through RL to compress a continuous stream of unstructured dialogue into standardized tree-structured operations.\n4. **Tree-Update Mechanism:** An iterative process that updates the PersonaTree based on user interactions.\n\n**Benefits:** The Inside Out framework enables controllable growth, achieves memory compression while preserving consistency, and supports the dynamic evolution", + "summary_2": "Here is a summary of the research paper excerpt, incorporating the information conveyed by the figures and their captions:\n\n**Background:**\nThe authors propose a novel framework, Inside Out, to address the challenges of long-term personalized dialogue systems. These systems struggle to reconcile unbounded interaction streams with finite context constraints, leading to memory noise accumulation, reasoning degradation, and persona inconsistency. Existing approaches, such as explicit profile augmentation and vector-based retrieval, fail to directly confront the central bottleneck of personalized memory evolution.\n\n**Inside Out Framework:**\nThe Inside Out framework utilizes a globally maintained PersonaTree as the carrier of long-term user profiling. The PersonaTree is constructed using a hierarchical Schema based on the Biopsychosocial model, which decomposes user characteristics into three core dimensions. This Schema design establishes the initial structure of the user PersonaTree.\n\n**MemListener:**\nTo endow the system with dynamic evolution, the authors propose an iterative tree-update mechanism and introduce a reinforcement learning (RL) strategy based on process rewards to train a lightweight model, MemListener. This model learns to compress a continuous stream of unstructured dialogue in real time into standardized tree-structured operations.\n\n**Key Features:**\n\n1. **PersonaTree:** A globally maintained tree that captures the user's long-term profile, with" + }, + { + "arxiv_id": "2601.05111v1", + "title": "Paper 2601.05111v1", + "abstract": "LLM-as-a-Judge has revolutionized AI evaluation by leveraging large language models for scalable assess- ments. However, as evaluands become increasingly complex, specialized, and multi-step, the reliability of LLM-as-a-Judge has become constrained by inherent biases, shallow single-pass reasoning, and the inability to verify assessments against real-world observations. This has catalyzed the transition to Agent-as-a-Judge, where agentic judges employ planning, tool-augmented verification, multi-agent collaboration, and persis- tent memory to enable more robust, verifiable, and nuanced evaluations. Despite the rapid proliferation of agentic evaluation systems, the field lacks a unified framework to navigate this shifting landscape. To bridge this gap, we present the first comprehensive survey tracing this evolution. Specifically, we identify key dimensions that characterize this paradigm shift and establish a developmental taxonomy. We organize core methodologies and survey applications across general and professional domains. Furthermore, we analyze frontier challenges and identify promising research directions, ultimately providing a clear roadmap for the next generation of agentic evaluation. � Project Page: https://github.com/ModalityDance/Awesome-Agent-as-a-Judge", + "text": "2026-1-8\nA Survey on Agent-as-a-Judge\nRunyang You*1\nHongru Cai*1\nCaiqi Zhang2\nQiancheng Xu1\nMeng Liu3\nTiezheng Yu4\nYongqi Li†1\nWenjie Li1\n1 The Hong Kong Polytechnic University\n2 University of Cambridge\n3 Shandong Jianzhu University\n4 Huawei Technologies\n* Equal contribution\n† Corresponding author\nrunyang.y@outlook.com, {henry.hongrucai, liyongqi0}@gmail.com,\ncswjli@comp.polyu.edu.hk\nAbstract\nLLM-as-a-Judge has revolutionized AI evaluation by leveraging large language models for scalable assess-\nments. However, as evaluands become increasingly complex, specialized, and multi-step, the reliability of\nLLM-as-a-Judge has become constrained by inherent biases, shallow single-pass reasoning, and the inability\nto verify assessments against real-world observations. This has catalyzed the transition to Agent-as-a-Judge,\nwhere agentic judges employ planning, tool-augmented verification, multi-agent collaboration, and persis-\ntent memory to enable more robust, verifiable, and nuanced evaluations. Despite the rapid proliferation\nof agentic evaluation systems, the field lacks a unified framework to navigate this shifting landscape. To\nbridge this gap, we present the first comprehensive survey tracing this evolution. Specifically, we identify\nkey dimensions that characterize this paradigm shift and establish a developmental taxonomy. We organize\ncore methodologies and survey applications across general and professional domains. Furthermore, we\nanalyze frontier challenges and identify promising research directions, ultimately providing a clear roadmap\nfor the next generation of agentic evaluation.\n� Project Page: https://github.com/ModalityDance/Awesome-Agent-as-a-Judge\n1 Introduction\nThe rapid advancement of Large Language Models\n(LLMs) has revolutionized the field of AI evalua-\ntion, giving rise to the LLM-as-a-Judge paradigm [1].\nWhile traditional metrics fail to capture semantic\nnuance and human judgment remains unscalable,\nthis new approach leverages LLMs’ advanced un-\nderstanding and decision-making abilities to deliver\nnear-human quality assessments across diverse do-\nmains [2]. Moreover, serving as a scalable proxy for\nhuman preference, LLM judges can provide reward\nsignals for reinforcement learning [3] and enable the\nautomated curation of massive synthetic datasets [4].\nAs such, LLM judgment has established itself as a cor-\nnerstone of AI evaluation and optimization pipelines,\nwhere the precision of the judge critically determines\nthe success of downstream applications [5].\nHowever, as generative AI applications evolve from\nsimple text responses to complex, multi-step tasks\nacross specialized domains, the reliability of LLM-\nas-a-Judge has become inevitably constrained [2, 6].\nFirst, single-pass evaluators are prone to inherent\nparametric biases—such as favoring verbosity or their\nown output patterns—which compromise their neu-\ntrality when assessing high-complexity responses\nthat deviate from their training distribution [7]. Sec-\nond, naive LLM judges are passive observers, un-\nable to react to real-world observations; they assess\nanswers based on linguistic patterns without verifi-\ncation, leading to hallucinated evaluations in spe-\ncialized domains [8].\nFurthermore, in evaluation\ntasks that require multifaceted assessment rubrics,\ntraditional LLM judges experience cognitive overload\nwhen attempting to evaluate all dimensions compre-\nhensively within a single inference step, which results\nin coarse-grained scores that fail to reflect specific\nnuances [9].\nThese limitations have catalyzed the transition\nfrom LLM-as-a-Judge to Agent-as-a-Judge. As shown\n1\narXiv:2601.05111v1 [cs.CL] 8 Jan 2026\n\nA Survey on Agent-as-a-Judge\nINPUT\nScore:\n7/10\na LLM-as-a-Judge\nINPUT\nEXECUTION\nSEARCH\nOTHER TOOLS\nPLANNING\nCALCULATOR\nMEMORY\nScore: 7/10\nEvaluation\nReport\nb Agent-as-a-Judge\nFigure 1: Comparison between LLM-as-a-Judge (1a) and\nAgent-as-a-Judge (1b). The former performs direct\nsingle-pass evaluation, while the latter leverages\nplanning, memory, and tool-augmented capabilities for\nenhanced evaluation.\nin Figure 1, agentic judges proactively engage in eval-\nuation through multiple capabilities: they decom-\npose complex objectives into subtasks, mitigate biases\nthrough multi-agent collaboration [10], ground as-\nsessments via tool-augmented evidence collection\nand correctness verification [8], and enable fine-\ngrained assessment by persisting intermediate states,\nautonomously planning the evaluation across rea-\nsoning steps [11, 12]. This paradigm shift enables\nmore robust, verifiable, and nuanced assessments\nthat effectively address the multifaceted nature of\nsophisticated AI-generated evaluands.\nDespite the above potentials and rapid proliferation\nof agentic evaluation systems, the field lacks a survey\nto summarize and navigate this shifting landscape.\nTo bridge this gap, we present the first comprehensive\nsurvey for Agent-as-a-Judge through the following\ncontributions:\n• We identify and characterize the shift from LLM-\nas-a-Judge to Agent-as-a-Judge and summarize the\nagentic judges’ development trend into three pro-\ngressive stages (Section 2).\n• We organize core methodologies into five key parts\naccording to agent’s abilities (Section 3) and sur-\nvey their applications across general and profes-\nsional domains (Section 4).\n• We analyze frontier challenges and identify promis-\ning research directions, providing a strategic\nroadmap for the next generation of robust and\nverifiable AI judgment.\n2 Evolution:\nFrom LLM-as-a-Judge to\nAgent-as-a-Judge\nThis section traces the evolution of automated\nevaluation from LLM-as-a-Judge to Agent-as-a-Judge\nparadigms. We begin by reviewing the foundational\nLLM-as-a-Judge and its limitations. We then exam-\nine the shift toward Agent-as-a-Judge, analyzing key\ndimensions that characterize the agentic approach.\nFinally, we summarize Agent-as-a-Judge’s develop-\nment trend into three progressive stages with distinct\nlevels of autonomy and adaptability.\n2.1 LLM-as-a-Judge\nLLM-as-a-Judge paradigm emerged to overcome the\nscalability limits of human evaluation and the se-\nmantic insensitivity of traditional metrics. Zheng\net al., [1] formalized this approach by introducing\nbenchmarks like MT-Bench to assess model align-\nment. Building on this, G-Eval [59] leveraged chain-\nof-thought prompting for better alignment in natural\nlanguage generation (NLG), while Prometheus [60]\ninduced fine-grained evaluation in open-source mod-\nels via specialized tuning. To mitigate systematic\nissues like position and verbosity bias [7], JudgeLM\n[61] utilized fine-tuning to develop more robust eval-\nuators.\n2.2 From LLM-as-a-Judge to Agent-as-a-Judge\nAs evaluands evolve from simple text responses to\ncomplex, multi-step tasks across specialized domains,\ntraditional LLM-as-a-Judge has become increasingly\ninadequate, focusing on final outputs while failing\nto verify intermediate actions or satisfy the rigorous\nstandards of professional fields [2, 6]. To bridge this\ngap, the paradigm is shifting toward Agent-as-a-Judge\nthat employs decentralized deliberation, executable\nverification, and fine-grained assessment to mitigate\nthese limitations.\nEvolving Robustness: From Monolithic to Decen-\ntralized.\nTo mitigate the inherent parametric biases\nof monolithic LLM judges—such as the tendency to\n2\n\nA Survey on Agent-as-a-Judge\nAgent-as-a-Judge\nApplication\n(§4)\nProfessional Domains\n(§4.2)\nEducation\nE.g., MAJ-Eval [13], AutoSCORE [14], GradeOpt [15], Grade-Like-a-Human [16]\nFinance\nE.g., FinResearchBench [17], FinDeepResearch [18], SAEA [19], M-SAEA [20]\nLaw\nE.g., AgentsCourt [21], SAMVAD [22], AgentsBench [23]\nMedicine\nE.g., MAJ-Eval [13], GEMA-Score [24], ChatCoach [25], AI Hospital [26]\nGeneral Domains\n(§4.1)\nMultimodal and\nVision\nE.g., CIGEval [27], Evaluation Agent [28], LRQ-Fact [29], ARM-Thinker [12]\nConversation and\nInteraction\nE.g., IntellAgent [30], ESC-Judge [31], Sentient Agent [32], PSYCHE [33]\nFact-Checking\nE.g., FACT-AUDIT [34], UrduFactCheck [35], NarrativeFactScore [36],\nMath and Code\nE.g., HERMES [37], VerifiAgent [38], CompassVerifier [39], xVerify [40], Agentic Reward\nModeling [8], Multi-Agent Verification [41], Popper [42]\nMethodologies\n(§3)\nOptimization Paradigms\n(§3.5)\nE.g., Prompting (Evaluation Agent [28], HERMES [37], Multi-Agent LLM Judge [43], SAGEval [44], AGENT-X [45]),\nSFT (SynthesizeMe [46]), RL (TIR-Judge [47], ARM-Thinker [12])\nMemory and Personaliza-\ntion\n(§3.4)\nE.g., HERMES [37], ARM-Thinker [12], Agent-as-a-Judge [48], RLPA [11], SynthesizeMe [46], PersRM-R1 [49],\nFSPO [50]\nTool Integration\n(§3.3)\nE.g., Agent-as-a-Judge [48], CodeVisionary [51], Evaluation Agent [28], ARM-Thinker [12], HERMES [37], VerifiA-\ngent [38], Agentic Reward Modeling [8]\nPlanning\n(§3.2)\nE.g., MATEval [52], Evaluation Agent [28], EvalAgents [53], AGENT-X [45], ARJudge [54], OnlineRubrics [55]\nMulti-Agent Collaboration\n(§3.1)\nE.g., ChatEval [10], M-MAD [56], SAGEval [44], HiMATE [9], CAFES [57], GEMA-Score [24], CourtEval [58]\nFigure 2: A taxonomy of Agent-as-a-Judge organizing Methodologies (§3) and Applications (§4). Background\ngradients illustrate the coverage of developmental stages, from Procedural to Reactive and then to Self-Evolving .\nfavor verbosity or their own output patterns—Agent-\nas-a-Judge paradigms employ specialized, decentral-\nized agents that collaborate through autonomous\ndecision-making [10, 13]. Crucially, this decentral-\nized architecture facilitates the injection of expert\nprior knowledge: by decomposing complex evalua-\ntion goals into sub-tasks or structuring specific in-\nteraction workflows, we can enforce domain-specific\nconstraints that a generalist model typically over-\nlooks [16, 24]. Furthermore, multi-agent delibera-\ntion ensures collective robustness; distinct roles can\nisolate specific information points to neutralize bias,\nwhile debate and self-reflection allow agents to au-\ndit their own cognitive shortcuts, ensuring the fi-\nnal judgment transcends the heuristics of any single\nmodel [62, 7].\nEvolving Verification: From Intuition to Execu-\ntion. Static LLM judges are fundamentally passive\nobservers, unable to react to real-world feedback.\nThey assess answers based on linguistic plausibility\n– how correct a response looks – without verifica-\ntion or evidence collection, leading to \"hallucinated\ncorrectness\" in complex tasks [8]. Agent-as-a-Judge\nbridges this reality gap by replacing intuition with ex-\necution. By interacting with external environments,\nagentic judges can query system states to verify side\neffects (e.g., file operations) [48, 51], use code in-\nterpreters or theorem provers to validate logical con-\nsistency [37], and employ search tools to ground\nfactual claims in real-time documentation [38, 8].\nThis shifts the evaluative anchor from internal model\nknowledge to objective verification.\nEvolving Granularity:\nFrom Global to Fine–\ngrained. Many evaluation tasks inherently require\nmultifaceted assessment rubrics, yet traditional LLM\njudges face a cognitive overload to evaluate these di-\nmensions comprehensively within a single inference\nstep, results in coarse-grained scores that fail to re-\nflect specific nuances [9]. Agent-as-a-Judge addresses\nthis by transforming evaluation from a single-pass in-\nference into autonomous, hierarchical reasoning [9].\nInstead of a monolithic assessment, an agentic judge\ncan dynamically select or create task-specific rubrics,\nautonomously planning the evaluation to examine\neach component of the evaluand independently [44],\nutilizing memory to track historical reasoning states\nand synthesize fragmented evidence into a coher-\nent verdict. Consequently, these agents can pinpoint\n3\n\nA Survey on Agent-as-a-Judge\nspecific flaws that would otherwise be obscured in\na global score, providing fine-grained feedback on\neach aspect [45].\n2.3 Agent-as-a-Judge\nAgent-as-a-Judge represents a rapidly expanding field\nwhere the term \"agent\" is often applied loosely, span-\nning a heterogeneous range from procedural agentic\nworkflows to autonomous self-evolvers [10, 45, 12].\nTo provide a clear roadmap through this complexity,\nwe summarize the ongoing development of agency\nas follows.\nProcedural Agent-as-a-Judge\ndecouples monolithic\ninference into agentic predefined workflows [57, 24]\nor engages in structured discussions among fixed\nsub-agents [10, 56]. These systems enable complex\njudgments through coordinated multi-agent inter-\nactions, yet remain constrained by predetermined\ndecision rules that cannot adapt to novel evaluation\nscenarios.\nReactive\nAgent-as-a-Judge enables\nadaptive\ndecision-making by routing execution paths [28, 45]\nand invoking external tools [8] or sub-agents [13]\nbased on intermediate feedback.\nHowever, such\nreactivity remains confined to conditional routing\nwithin fixed decision spaces, lacking autonomy to\nrefine underlying rubrics.\nSelf-Evolving Agent-as-a-Judge\nrepresents the cut-\nting edge of the field, characterized by high auton-\nomy and the ability to refine internal components\nduring operation—synthesizing evaluation rubrics\non-the-fly [53] and updating memory with lessons\nlearned. This paradigm opens new frontiers for adap-\ntive evaluation systems, though challenges remain in\nensuring stability during self-modification [63].\n3 Methodologies\nThis section categorizes Agent-as-a-Judge methodolo-\ngies into five dimensions: multi-agent collaboration,\nplanning, tool integration, memory and personaliza-\ntion, and optimization paradigms. As shown in Fig-\nure 2, implementation sophistication reveals the evo-\nlutionary stages: foundational methodologies (collab-\noration, tool integration, optimization) evolve across\nall stages, while others (planning, memory) emerge\nmore prominently in advanced paradigms. The fol-\nINPUT\nOUTPUT\na Collective Consensus\nINPUT\nOUTPUT\nb Task Decomposition\nFigure 3: Multi-agent collaboration paradigms.\nlowing subsections examine how each methodology\nmanifests across these stages.\n3.1 Multi-Agent Collaboration\nMulti-agent collaboration leverages collective rea-\nsoning to mitigate single-LLM biases in Agent-as-a-\nJudge systems. Early systems followed Procedural\nparadigms with fixed protocols, while recent work\nevolves toward Reactive approaches that adapt agent\nselection based on feedback. We categorize these\ninto two topologies:\nCollective Consensus. Horizontal debate mecha-\nnisms leverage agents representing diverse perspec-\ntives to counteract the inherent biases of single-LLM\nevaluators, illustrated in Figure 3. Early approaches\nexemplified the Procedural stage: ChatEval [10] pi-\noneered this with a courtroom-inspired discussion\nmechanism where agents debate as equals follow-\ning predefined protocols. This paradigm was later\nextended to machine translation in M-MAD [56],\nwhile subsequent research [64] introduced explicit\nstances and \"judge\" roles to prevent agents from\nblindly conforming to the majority. Recent meth-\nods have become more Self-Evolving: approaches like\nMulti-agent-as-judge [13] have moved beyond static\nensembles by creating domain-specific experts based\non intermediate feedback.\nTask Decomposition.\nTask Decomposition employs\na \"Divide and Conquer\" strategy, delegating distinct\nsubtasks to specialized agents for systematic eval-\nuation, illustrated in Figure 3. Early frameworks\nfollowed Procedural designs: sequential approaches\nlike CAFES [57] and GEMA-Score [24, 58] struc-\nture evaluation into predefined stages (e.g., Evidence\nGathering, Reasoning, Scoring), while SAGEval [44]\nintroduces supervision via a \"Judge the Judge\" meta-\nevaluator that reviews previous agents’ decisions,\n4\n\nA Survey on Agent-as-a-Judge\nwith hierarchical approaches like HiMATE [9] or-\nganizing agents into tree structures for varying error\ngranularities. More recent work has shifted toward\nReactive paradigms: AGENT-X [45] employs adap-\ntive router agent that dynamically selects the most\nrelevant base agents based on intermediate analysis\nresults.\nTakeaway Multi-agent evaluation frameworks adopt\ntwo main topologies: Collective Consensus and Task\nDecomposition. Recent advances have evolved toward\nmore autonomous systems that can select or generate\nsubagents.\n3.2 Planning\nPlanning serves as a core capability in the Agent-as-a-\nJudge paradigm, enabling the decomposition of high-\nlevel evaluation objectives into executable sub-tasks\nand the dynamic adaptation of assessment trajecto-\nries based on intermediate analysis. This section ex-\namines planning capabilities from two perspectives:\nWorkflow Orchestration. Workflow orchestration\nin Agent-as-a-Judge systems spans from static frame-\nworks to dynamic agency, primarily characterizing\nProcedural and Reactive stages of agentic evaluation.\nApproaches like MATEval [52] rely on static de-\ncomposition, breaking tasks into fixed sequences of\nsub-dimensions. While this ensures systematic as-\nsessment through predefined control flows, it limits\nadaptability in complex scenarios. Conversely, Eval-\nuation Agent [28] introduces dynamic multi-round\nplanning, where agents adjust strategies based on in-\ntermediate feedback. This system further optimizes\nefficiency through autonomous termination, allow-\ning the agent to self-monitor information gain and\nproactively halt execution once sufficient evidence is\ngathered.\nRubric Discovery.\nUnlike general agents focused on\ntask completion, Judge Agents have the distinct capa-\nbility to autonomously formulate and refine rubrics,\nrepresenting a hallmark of the Self-Evolving stage,\nwhere agents can refine their internal evaluation com-\nponents. EvalAgents [53] exemplifies this by employ-\ning a Query Generator that plans web searches to\ndiscover implicit rubrics, while AGENT-X [45] uses an\nAdaptive Router to infer domain context and plan be-\nspoke detection guidelines. ARJudge [54] adaptively\nformulates rubrics by iteratively generating context-\nsensitive questions, and OnlineRubrics [55] inte-\ngrates planning into reinforcement learning, evolv-\ning rubrics alongside policy optimization to detect\nreward hacking.\nTakeaway Serving as the strategic engine, planning\nshifts evaluation from rigid flows to adaptive explo-\nration, enabling agents to optimize how they evalu-\nate (workflow orchestration) and what they evaluate\n(rubric discovery).\n3.3 Tool Integration\nTool integration is a defining capability of Agent-as-\na-Judge frameworks, enabling judges to ground eval-\nuation in external evidence and explicit checks. As\nshown in Table 1, existing approaches can be grouped\ninto evidence collection and correctness verification\nbased on the purpose of tool use.\nEvidence Collection. A common use of tools in\nAgent-as-a-Judge frameworks is to collect additional\nevidence that supports evaluations. Such evidence in-\ncludes intermediate artifacts, execution results, and\nperceptual signals that cannot be reliably obtained\nthrough text-based reasoning. In code-related tasks,\nAgent-as-a-Judge [48] and CodeVisionary [51] al-\nlow judges to inspect execution artifacts or run auto-\nmated checks to expose execution feedback for eval-\nuation. Similar methods are adopted in multimodal\nsettings. Evaluation Agent [28] enables judges to\ninvoke external visual models to obtain visual qual-\nity or alignment signals, while ARM-Thinker [12]\ngathers fine-grained visual and contextual evidence\nthrough document access and localized visual opera-\ntions. Overall, these works integrate tools to surface\nobservable and task-relevant evidence, expanding\nthe judge’s access to execution-level, perceptual, and\ncontextual information, and supporting more reliable\nevaluation.\nCorrectness Verification.\nAnother line of work em-\nploys tools to verify whether the evaluand’s outputs\nor intermediate reasoning steps satisfy explicit cor-\nrectness constraints, such as logical validity, mathe-\nmatical soundness, or factual consistency. In these\nframeworks, the judge agent identifies which claims\n5\n\nA Survey on Agent-as-a-Judge\nTool Purpose\nMethod\nEvaluation Task\nTool Type\nEvidence\ncollection\nAgent-as-a-Judge [48]\nCode generation\nGraph, locate, read, search, retrieve\nCodeVisionary [51]\nCode generation\nCode execution, static linter, unit tests,\nscreenshot, web browsing\nEvaluation Agent [28]\nVisual generation\nVisual generative models\nARM-Thinker [12]\nMultimodal generation\nInstruction following checks, crop/zoom\ntools, document retrieval tools\nCorrectness\nverification\nHERMES [37]\nMath reasoning\nTranslator, theorem prover\nVerifiAgent [38]\nFactual & Math reasoning\nSearch engine, Python interpreter, theorem\nprover\nAgentic RM [8]\nFactual & Math reasoning\nSearch engine, Python interpreter\nTable 1: Tool integration in representative Agent-as-a-Judge methods, grouped by primary tool usage purpose.\nor steps require verification and invokes appropriate\ntools to check them. The resulting verification signals\nare then interpreted by the agent in context to inform\nthe final evaluation. HERMES [37] verifies mathe-\nmatical reasoning through formal theorem proving,\nwhile VerifiAgent [38] invokes programmatic and\nsymbolic checkers to validate factual and computa-\ntional claims. Agentic Reward Modeling [8] further\nintegrates correctness verification by combining fact-\nchecking tools and programmatic validators to pro-\nduce structured correctness signals that inform the\nfinal evaluation.\nTakeaway\nTool integration in Agent-as-a-Judge\ngrounds evaluation in observable and verifiable sig-\nnals by allowing judges to actively gather evidence\nand check correctness through external tools.\n3.4 Memory and Personalization\nMemory enables Agent-as-a-Judge frameworks to re-\ntain information across evaluation steps, supporting\nmulti-step reasoning, consistent judgment, and reuse\nof prior results. We categorize prior work by the role\nof memory, including intermediate state tracking and\npersonalized context preservation.\nIntermediate State. In multi-step evaluation set-\ntings, Agent-as-a-Judge frameworks use memory to re-\ntain intermediate evaluation states generated during\nassessment, providing the necessary context for con-\nditional routing and adaptive decision-making based\non intermediate feedback–a fundamental mechanism\nfor Reactive Agent-as-a-Judge. HERMES [37] retains\nintermediate proof states when combining reason-\ning with formal theorem proving, enabling consis-\ntent verification across long reasoning chains. ARM-\nThinker [12] preserves intermediate evidence such\nas visual reasoning outputs and tool interaction re-\nsults, which are later reused to ground evaluation.\nAgent-as-a-Judge [48] records execution traces and\nstep-level feedback, enabling evaluation beyond final\noutputs to account for intermediate behavior. Col-\nlectively, these methods use memory to retain inter-\nmediate states that support cumulative, step-aware\nevaluation.\nPersonalized Context. Agent-as-a-Judge frame-\nworks often incorporate memory to retain user-\nrelated information that conditions evaluation across\ninteractions. Such memory captures user preferences,\nevaluation standards, or prior feedback, allowing\njudgments to remain consistent over time. PersRM-\nR1 [49] and FSPO [50] store preference data derived\nfrom historical interactions, including preference la-\nbels or few-shot examples, which are reused to condi-\ntion subsequent evaluations for the same user. More\nadvanced approaches abstract historical preference\nsignals into persistent user personas or long-term pro-\nfiles. RLPA [11] and SynthesizeMe [46] exemplify\nthis by constructing and maintaining user personas\nthat are stored and reused to guide evaluation. Such\nlong-term user profiling often serves to support Self-\nEvolving Agent-as-a-Judge, enabling continuous op-\ntimization based on evolving preferences. Together,\nthese methods use memory to preserve personalized\n6\n\nA Survey on Agent-as-a-Judge\ncontext that shapes evaluation behavior and ensures\nconsistency across interactions.\nTakeaway Memory enables Agent-as-a-Judge to pre-\nserve intermediate states and personalized context,\nsupporting multi-step evaluation, consistent judgment,\nand efficient reuse of prior information.\n3.5 Optimization Paradigms\nOptimization paradigms define how Agent-as-a-Judge\nimproves evaluation quality by updating model pa-\nrameters or adapting evaluation behaviors. We or-\nganize prior work into two groups: training-time\noptimization and inference-time optimization.\nTraining-Time Optimization. Training-time opti-\nmization improves Agent-as-a-Judge by updating\nmodel parameters to align judgment behavior with\nevaluation objectives. Supervised fine-tuning is com-\nmonly used to standardize judge behavior, training\nmodels to follow explicit criteria, and produce struc-\ntured judgments across tasks. For example, Syn-\nthesizeMe [46] shapes evaluation behavior using\npersona-guided supervision derived from historical\ndata. Reinforcement learning optimizes judges to\nperform evaluation actions more effectively, espe-\ncially in settings that require tool use and multi-step\nverification. TIR-Judge [47] and ARM-Thinker [12]\ntrain judges to decide when and how to invoke tools,\nintegrate external signals, and verify intermediate\nresults. Together, training-time optimization shapes\ninternal decision processes, enabling more reliable,\nstructured evaluation.\nInference-Time Optimization. Inference-time op-\ntimization adapts evaluation behavior without up-\ndating model parameters by controlling how judg-\nments are produced through prompts, workflows,\nor agent interactions. Existing approaches can be\nbroadly grouped into two types. 1) The first type\nfollows predefined evaluation procedures, where rea-\nsoning steps, verification routines, or prompts are\nfixed in advance to ensure consistency and efficiency.\nEvaluation Agent [28] and HERMES [37] exemplify\nthis setting by adopting structured, step-by-step eval-\nuation pipelines. 2) The second type allows evalua-\ntion behavior to adapt during inference, where the\nevaluation process, participating agents, or applied\nConversation \nand\nInteraction\nAgent-as-\na-Judge \nMath\nand \nCode\nMultimodal\nand\nVision\nFact-\nchecking\nLaw\nEducation\nFinance\nMedicine\nOpen-ended \nResponse\nMachine \nTranslation\nTask-\noriented \nDialogue\nCode \nGeneration\nMathematical \nReasoning\nFactual \nSummarization\nHypothesis \nValidation\nImage \nGeneration\nMedical \nInteraction\nRadiology \nReport\nAutomatic \nGrading\nRisk \nEvaluation\nLegal \nJudgement\nMultimodal \nReasoning\nEssay \nScoring\nFinancial \nResearch\nJudicial \nDeliberation\nGeneral Domains\nProfessional Domains\nFigure 4: An overview of Agent-as-a-Judge application\ndomains and their fine-grained task categories.\ncriteria change based on intermediate results. Multi-\nAgent LLM Judge [43] iteratively refines prompts\nand context through multi-judge coordination, while\nSAGEval [44] introduces a meta-judge to monitor\nand revise judge behavior. ChatEval [10] and AGENT-\nX [45] further support adaptive evaluation through\nagent interaction and dynamic guideline selection.\nOverall, inference-time optimization enables flexible\ncontrol over evaluation behavior, ranging from fixed\nprocedures to adaptive, interaction-driven judgment.\nTakeaway Optimization improves Agent-as-a-Judge\nby either learning evaluation behavior through param-\neter updates at training-time or adjusting evaluation\nstrategies at inference time.\n4 Application\nBuilding on the methods above, this section describes\nhow Agent-as-a-Judge methods are applied in differ-\nent evaluation tasks. As shown in Figure 4, we or-\nganize representative applications into two groups:\ngeneral domains and professional domains.\n4.1 General Domains\nMath and Code. In math and code evaluation,\nAgent-as-a-Judge systems move beyond single-pass\nscoring by grounding judgment in verifiable reason-\ning signals. One line of work augments free-form\n7\n\nA Survey on Agent-as-a-Judge\nreasoning with explicit correctness checks. HER-\nMES [37] anchors LLM reasoning to intermediate\nformal proof steps, reducing drift in long derivations.\nVerifiAgent [38] decouples high-level reasoning as-\nsessment from tool-based correctness verification,\nenabling adaptive checking across reasoning types.\nCompassVerifier [39] and xVerify [40] focus on math-\nematical and logical outputs, addressing equivalence\nchecking under diverse surface forms. Other ap-\nproaches strengthen judgment by aggregating multi-\nple evaluation signals. Multi-Agent Verification [41]\ndistributes evaluation across aspect-specific judges.\nAgentic Reward Modeling [8] integrates preference-\nbased supervision with verifiable correctness signals\nto improve reward reliability. Popper [42] formulates\njudgment as controlled falsification, using statistical\ntests to validate free-form claims.\nFact-Checking. In fact-checking, Agent-as-a-Judge\nreframes evaluation from static label prediction to\ninteractive verification with evidence gathering and\njustification. FACT-AUDIT [34] models fact-checking\nas an agentic loop with multi-agent collaboration,\njointly evaluating verdict accuracy and justification\nquality. This paradigm is particularly effective when\nevidence is scarce or inconsistencies are subtle. Urdu-\nFactCheck [35] improves robustness in low-resource\nsettings through multilingual retrieval and evidence\nboosting. NarrativeFastScore [36] addresses long-\ncontext factual consistency by constructing character-\nlevel knowledge representations, enabling detection\nof state and relation errors with actionable feedback.\nConversation and Interaction.\nIn conversation and\ninteraction, Agent-as-a-Judge shifts from grading iso-\nlated replies to constructing multi-turn exchanges,\nenabling evaluation under evolving goals, constraints,\nand user reactions. For task-oriented dialogue, In-\ntellAgent [30] uses interactive user simulations to\nsynthesize conversational benchmarks, while Kazi\net al. [65] introduces frameworks for controllable\nuser goals and automatic measures. For affective\nand social interaction, ESC-Judge [31] constructs\nemotional-support agents via standardized counsel-\ning skills, Sentient Agent [32] tracks emotion tra-\njectories over time to reflect higher-order social cog-\nnition, and PSYCHE [33] builds psychiatric patient\nprofiles for ethical assessment validation. Wu et al.\n[66] frames evaluation as multi-perspective role play\nwith diverse reviewer personas to cover both objec-\ntive and subjective dimensions.\nMultimodal and Vision.\nIn the multimodal and vi-\nsion domain, Agent-as-a-Judge shifts from static scor-\ning to interactive inspection. For visual generation,\nCIGEval [27] orchestrates specialized tools to probe\ncontrol adherence and subject consistency, while Eval-\nuation Agent [28] runs multi-round checks to provide\nuser-tailored, explainable analyses. For truthfulness\nevaluation, LRQ-Fact [29] generates targeted fact-\nchecking questions across image and text to guide ev-\nidence retrieval, while ARM-Thinker [12] selectively\ninvokes tools like image inspection for finalizing judg-\nments.\n4.2 Professional Domains\nMedicine. In high-stakes clinical NLP, Agent-as-a-\nJudge appears in two forms: 1) multi-agent evalu-\nators that decompose clinical quality into special-\nized roles, and 2) agentic simulators that interac-\ntively elicit clinical behaviors. For 1), MAJ-Eval [13]\nconstructs multiple evaluator personas to debate\nand cross-verify responses, while GEMA-Score[24]\nuses agent collaboration to compute granular, tool-\nassisted scores covering disease severity and uncer-\ntainty. For 2), Chat-Coach [25] pairs autonomous\npatient and coach agents to critique trainee-doctor\ndialogues, while AI Hospital [26] evaluates LLM “doc-\ntors” in multi-agent simulators, though final scoring\noften still requires conventional metrics.\nLaw. In the legal domain, Agent-as-a-Judge sim-\nulates the adversarial and deliberative nature\nof jurisprudence through multi-agent interaction.\nAgentsCourt [21] introduces adversarial debate\nframeworks where agents role-play as prosecutors,\ndefense attorneys, and judges, exposing the evaluat-\ning agent to conflicting arguments to improve verdict\nrobustness. SAMVAD [22] and AgentsBench [23]\nmodel judicial consensus by simulating bench de-\nliberation processes, capturing interactions between\nconcurring and dissenting opinions to enhance legal\njudgment prediction.\nFinance.\nIn finance, Agent-as-a-Judge addresses two\nlimitations of static benchmarks: 1) capturing the in-\nternal research logic of long-form analyst reports, and\n8\n\nA Survey on Agent-as-a-Judge\n2) detecting deployment risks like hallucinations and\ntemporal staleness. For 1), FinResearchBench [17]\nextracts logic trees from reports as intermediate struc-\ntures for comprehensive assessment, whereas FinD-\neepResearch [18] can synthesize hierarchical rubrics\nbut still relies on predefined workflows.\nFor 2),\nSAEA [19] proposes auditing agent trajectories to\nmitigate hallucinations and temporal misalignment.\nFrom Tasks to Teams [20] extends this approach with\nM-SAEA to trace multi-agent failures, such as cross-\nagent divergence and error propagation.\nEducation. In the educational domain, Agent-as-a-\nJudge systems emulate pedagogical nuance through\ncollaborative, role-specialized workflows.\nGrade-\nLike-Human [16] and AutoSCORE [14] decom-\npose grading into staged processes (rubric construc-\ntion, evidence recognition, cross-review) to improve\ngrounding and consistency. Beyond static scoring,\nMAJ-Eval [13] uses multi-persona debates to align\nwith multi-dimensional human evaluation, while\nGradeOpt [15] introduces agents that diagnose dis-\ncrepancies and iteratively refine grading guidelines.\n5 Discussion\nThis section discusses broader issues that arise when\ndeploying Agent-as-a-Judge systems in practice. We\nfirst summarize key challenges that limit scalability,\nreliability, and real-world adoption, and then outline\nseveral future directions that may help address these\nlimitations and further advance agentic evaluation.\n5.1 Challenges\nAgent-as-a-Judge improves evaluation reliability\nthrough planning, tool use, memory, and multi-agent\ncollaboration, but these capabilities also introduce\nnew challenges beyond static LLM-as-a-Judge. Key\nchallenges include computational cost, latency, safety,\nand privacy.\nComputational Cost. Agent-as-a-Judge introduces\na heavier computational burden in both training and\ninference. 1) Training a judge agent is expensive.\nSupervised fine-tuning alone is often insufficient to\nsupport agentic behaviors such as tool invocation,\nlong-horizon planning, and adaptive decision making.\nReinforcement learning provides a natural way to ac-\nquire these capabilities, but it significantly increases\ntraining cost, especially when the judge operates over\nlong trajectories or complex tool-calling sequences.\n2) Inference with Agent-as-a-Judge is also costly. Un-\nlike single-pass judgment, agentic evaluation typi-\ncally involves multiple reasoning steps, intermediate\ndecisions, and coordination among multiple agents,\nall of which increase computation per evaluation.\nLatency. In addition to higher computational cost,\nAgent-as-a-Judge often suffers from increased infer-\nence latency. Agentic evaluation requires sequential\nreasoning steps, external tool calls, or multi-agent\ncommunication, each of which introduces additional\ndelays. This latency can be particularly problem-\natic in real-time or interactive settings, such as online\nmodel evaluation, user-facing content moderation, or\nreinforcement learning loops where rapid feedback is\nrequired. As a result, there exists a tension between\nevaluation reliability and practical deployment con-\nstraints, where more thorough agentic judgment may\nnot be feasible under strict latency budgets.\nSafety. While Agent-as-a-Judge is designed to im-\nprove evaluation robustness, it also raises new safety\nconcerns. Tool-augmented judges may access exter-\nnal systems such as search engines, code executors,\nor databases, which expands the attack surface for\nprompt injection, tool misuse, or unintended side ef-\nfects. Multi-agent collaboration can further amplify\nrisks if unsafe behaviors propagate across agents or\nif adversarial interactions emerge. Moreover, when\njudge agents are used to provide reward signals for\nmodel optimization, systematic biases or errors in\nagentic judgment may be reinforced and amplified\nduring training, leading to unintended model behav-\niors.\nPrivacy. Agent-as-a-Judge also introduces privacy\nchallenges, particularly in settings that involve persis-\ntent memory or personalized evaluation. To maintain\nconsistency or adapt judgments to specific users or\ncontexts, judge agents may store intermediate states,\nuser information, or historical interaction data. If not\ncarefully designed, such memory mechanisms can\nincrease the risk of sensitive data leakage or unau-\nthorized inference about user attributes. This issue\nbecomes more pronounced in professional domains\nsuch as medicine, law, or education, where evaluation\n9\n\nA Survey on Agent-as-a-Judge\noften relies on confidential or personally identifiable\ninformation.\n5.2 Future Directions\nPersonalization.\nCurrent Agent-as-a-Judge systems\nare constrained by static, one-size-fits-all evaluation\ncriteria, failing to align with diverse individual pref-\nerences. To bridge this gap, future research should\nfocus on enhancing the autonomy and adaptivity of\njudge agents. A critical enabler is proactive mem-\nory management: rather than passively retrieving\nhistory, agents must actively manage the lifecycle\nof user-specific knowledge—autonomously deciding\nwhen to register new preferences, update evolving\nstandards, or prune obsolete feedback. This agen-\ntic control transforms memory into a dynamic belief\nsystem, allowing the judge to continuously refine\nits criteria and maintain alignment with the user’s\nspecific values and usage contexts.\nGeneralization.\nCurrent systems rely on predefined\nrubrics constructed offline, limiting their ability to\ngeneralize across diverse or open-ended tasks. Future\njudge agents should leverage planning capabilities to\ndynamically discover and adapt evaluation criteria.\n1) Context-Aware Rubric Generation: Agents should\nsynthesize evaluation criteria on-the-fly by analyz-\ning the specific intent and complexity of responses,\nidentifying relevant assessment dimensions not antic-\nipated during design. 2) Adaptive Multi-Granularity\nScoring: Rubrics should dynamically scale based on\ntask difficulty—applying high-level holistic criteria\nfor straightforward tasks, while decomposing into\nfine-grained sub-rubrics for complex workflows.\nInteractivity. Current systems operate as passive,\none-way observers. Future agents should evolve into\ninteractive evaluators that actively engage with both\nthe environment and human stakeholders. 1) Inter-\nactive Environmental Feedback: Instead of static test\nsuites, judge agents should dynamically tailor evalua-\ntion trajectories—autonomously escalating task com-\nplexity or isolating edge cases to rigorously probe\nthe evaluand’s failure boundaries. 2) Human-Agent\nCollaborative Calibration: To address subjective or\nambiguity-rich scenarios, agents should leverage\nhuman-in-the-loop mechanisms. By proactively con-\nsulting experts to verify intent or resolve conflicts,\nthe judge refines its criteria through multi-turn align-\nment, ensuring higher trust and interpretability.\nOptimization. Current approaches predominantly\nrely on inference-time engineering, which is funda-\nmentally bottlenecked by the fixed capabilities of\nfrozen backbones. To transcend these limits, the\nfield must pivot towards Training-based Optimization.\nThis paradigm shift entails two key levels: 1) Indi-\nvidual Capability: Utilizing Reinforcement Learning\n(RL) to internalize complex agentic behaviors—such\nas sequential planning and adaptive tool use—that\nare difficult to elicit via prompting alone. 2) Learned\nCoordination: Extending optimization to multi-agent\nsettings. Rather than ad-hoc inference collabora-\ntion, agents should be trained with joint objectives\nto intrinsically learn effective communication and\nconsensus strategies.\nConcluding Remarks: Towards True Autonomy.\nAs characterized in Section 2, existing implementa-\ntions exhibit varying degrees of agency. The future\ndirections discussed above—personalization, general-\nization, interactivity, and optimization—collectively\npoint towards an evolutionary trajectory towards au-\ntonomy. The next generation of judge agents must\ntranscend fixed protocols to become genuinely agen-\ntic entities capable of self-directed adaptation, active\ncontext curation, and continuous self-refinement, ul-\ntimately realizing the full potential of agents that\nactively perceive, reason, and evolve alongside the\nmodels they assess.\n6 Conclusion\nThis paper provides the first comprehensive survey of\nAgent-as-a-Judge. We established a novel taxonomy\nand demonstrated how agentic capabilities, includ-\ning multi-agent collaboration, autonomous planning,\ntool integration, and memory, overcome the limita-\ntions of naive LLM judges to deliver more robust, veri-\nfiable and nuanced judgments across general and pro-\nfessional domains. While promising, this evolution\npresents challenges in computational cost, latency,\nsafety, and privacy. Future progress should prioritize\npersonalization, generalization, and optimization, ul-\ntimately realizing truly autonomous evaluators that\ncontinuously adapt to the evolving AI landscape.\n10\n\nA Survey on Agent-as-a-Judge\nLimitations\nEarly Stage of Paradigm Consensus. As a pio-\nneering survey exploring the evolution of Agent-as-\na-Judge, this study faces the challenge that the field\nhas not yet gained complete widespread recognition\nin academia. Although the transition from LLM-as-a-\nJudge to Agent-as-a-Judge has begun to take shape,\nthere is still a lack of long-term consensus regarding\nthe definition of evaluation agents. Nevertheless, es-\ntablishing this foundational framework is essential\nto orienting future research. We are committed to\niteratively refining this taxonomy as the paradigm\nmatures and gains broader recognition.\nInclusion of Early Prompting Methods. We ac-\nknowledge a potential gap between early method-\nologies and the increasingly rigorous definitions of\nagents. Many pioneering works in automated eval-\nuation, though named as \"agent\", rely heavily on\nprompting engineering, such as fixed role-play, which\nmay not align with the strict criteria for autonomy,\ndynamic planning, or tool-use held by the current\ncommunity. Nevertheless, we deliberately include\nthese prompt-based frameworks as they represent\nthe initial shift from monolithic inference toward\ndynamic decomposition and self-evolving systems.\nExcluding them would obscure the transition thus\ncompromising a complete understanding of the field’s\nevolution.\nEthics Statement\nThis work does not involve the use or creation of\ndatasets or scientific artifacts that would require spe-\ncific ethical clearance, data privacy considerations, or\nlicensing agreements. We believe this work adheres\nto the ethical guidelines of the conference and poses\nno immediate negative social impact.\nReferences\n[1] Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan\nZhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin,\nZhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang,\nJoseph E. Gonzalez, and Ion Stoica. Judging llm-\nas-a-judge with mt-bench and chatbot arena. In\nProceedings of the 37th International Conference on\nNeural Information Processing Systems, NIPS ’23, Red\nHook, NY, USA, 2023. Curran Associates Inc.\n[2] Dawei Li, Bohan Jiang, Liangjie Huang, Alimoham-\nmad Beigi, Chengshuai Zhao, Zhen Tan, Amrita\nBhattacharjee, Yuxuan Jiang, Canyu Chen, Tianhao\nWu, et al. From generation to judgment: Opportuni-\nties and challenges of llm-as-a-judge. In Proceedings\nof the 2025 Conference on Empirical Methods in Nat-\nural Language Processing, pages 2757–2791, 2025.\n[3] Harrison Lee, Samrat Phatale, Hassan Mansoor,\nThomas Mesnard, Johan Ferret, Kellie Lu, Colton\nBishop, Ethan Hall, Victor Carbune, Abhinav Ras-\ntogi, and Sushant Prakash. Rlaif vs. rlhf: scaling\nreinforcement learning from human feedback with\nai feedback. In Proceedings of the 41st International\nConference on Machine Learning, ICML’24. JMLR.org,\n2024.\n[4] Lin Long, Rui Wang, Ruixuan Xiao, Junbo Zhao,\nXiao Ding, Gang Chen, and Haobo Wang.\nOn\nLLMs-driven synthetic data generation, curation,\nand evaluation: A survey.\nIn Lun-Wei Ku, An-\ndre Martins, and Vivek Srikumar, editors, Find-\nings of the Association for Computational Linguis-\ntics: ACL 2024, pages 11065–11082, Bangkok, Thai-\nland, August 2024. Association for Computational\nLinguistics. doi: 10.18653/v1/2024.findings-acl.\n658. URL https://aclanthology.org/2024.\nfindings-acl.658/.\n[5] Hanyu Lai, Xiao Liu, Junjie Gao, Jiale Cheng, Zehan\nQi, Yifan Xu, Shuntian Yao, Dan Zhang, Jinhua Du,\nZhenyu Hou, Xin Lv, Minlie Huang, Yuxiao Dong,\nand Jie Tang. A survey of post-training scaling in\nlarge language models. In Wanxiang Che, Joyce\nNabende, Ekaterina Shutova, and Mohammad Taher\nPilehvar, editors, Proceedings of the 63rd Annual\nMeeting of the Association for Computational Lin-\nguistics (Volume 1: Long Papers), pages 2771–2791,\nVienna, Austria, July 2025. Association for Compu-\ntational Linguistics. ISBN 979-8-89176-251-0. doi:\n10.18653/v1/2025.acl-long.140. URL https://\naclanthology.org/2025.acl-long.140/.\n[6] Zhiheng Xi, Wenxiang Chen, Xin Guo, Wei He, Yiwen\nDing, Boyang Hong, Ming Zhang, Junzhe Wang,\nSenjie Jin, Enyu Zhou, et al. The rise and potential\nof large language model based agents: A survey.\nScience China Information Sciences, 68(2):121101,\n2025.\n[7] Peiyi Wang, Lei Li, Liang Chen, Zefan Cai, Dawei\nZhu, Binghuai Lin, Yunbo Cao, Lingpeng Kong,\nQi Liu, Tianyu Liu, and Zhifang Sui.\nLarge lan-\nguage models are not fair evaluators. In Lun-Wei Ku,\nAndre Martins, and Vivek Srikumar, editors, Proceed-\nings of the 62nd Annual Meeting of the Association\nfor Computational Linguistics (Volume 1: Long Pa-\npers), pages 9440–9450, Bangkok, Thailand, August\n2024. Association for Computational Linguistics. doi:\n11\n\nA Survey on Agent-as-a-Judge\n10.18653/v1/2024.acl-long.511. URL https://\naclanthology.org/2024.acl-long.511/.\n[8] Hao Peng, Yunjia Qi, Xiaozhi Wang, Zijun Yao, Bin\nXu, Lei Hou, and Juanzi Li. Agentic reward model-\ning: Integrating human preferences with verifiable\ncorrectness signals for reliable reward systems. In\nWanxiang Che, Joyce Nabende, Ekaterina Shutova,\nand Mohammad Taher Pilehvar, editors, Proceed-\nings of the 63rd Annual Meeting of the Association for\nComputational Linguistics (Volume 1: Long Papers),\npages 15934–15949, Vienna, Austria, July 2025. As-\nsociation for Computational Linguistics. ISBN 979-\n8-89176-251-0. doi: 10.18653/v1/2025.acl-long.\n775. URL https://aclanthology.org/2025.\nacl-long.775/.\n[9] Shijie Zhang, Renhao Li, Songsheng Wang, Philipp\nKoehn, Min Yang, and Derek F. Wong. HiMATE:\nA hierarchical multi-agent framework for machine\ntranslation evaluation. In Christos Christodoulopou-\nlos, Tanmoy Chakraborty, Carolyn Rose, and Violet\nPeng, editors, Findings of the Association for Com-\nputational Linguistics: EMNLP 2025, pages 11121–\n11145, Suzhou, China, November 2025. Association\nfor Computational Linguistics. ISBN 979-8-89176-\n335-7.\ndoi: 10.18653/v1/2025.findings-emnlp.\n593. URL https://aclanthology.org/2025.\nfindings-emnlp.593/.\n[10] Chi-Min Chan, Weize Chen, Yusheng Su, Jianxuan\nYu, Wei Xue, Shanghang Zhang, Jie Fu, and Zhiyuan\nLiu. Chateval: Towards better LLM-based evalua-\ntors through multi-agent debate. In The Twelfth In-\nternational Conference on Learning Representations,\n2024. URL https://openreview.net/forum?\nid=FQepisCUWu.\n[11] Weixiang Zhao, Xingyu Sui, Yulin Hu, Jiahe Guo,\nHaixiao Liu, Biye Li, Yanyan Zhao, Bing Qin, and\nTing Liu. Teaching language models to evolve with\nusers: Dynamic profile modeling for personalized\nalignment. In Proceedings of the Thirty-Ninth Con-\nference on Neural Information Processing Systems\n(NeurIPS 2025), 2025.\n[12] Shengyuan Ding, Xinyu Fang, Ziyu Liu, Yuhang\nZang, Yuhang Cao, Xiangyu Zhao, Haodong Duan,\nXiaoyi Dong, Jianze Liang, Bin Wang, Conghui He,\nDahua Lin, and Jiaqi Wang.\nArm-thinker: Rein-\nforcing multimodal generative reward models with\nagentic tool use and visual reasoning, 2025.\n[13] Jiaju Chen, Yuxuan Lu, Xiaojie Wang, Huimin Zeng,\nJing Huang, Jiri Gesi, Ying Xu, and Dakuo Wang.\nMulti-agent-as-judge: Aligning LLM-agent-based au-\ntomated evaluation with multi-dimensional human\nevaluation. In First Workshop on Multi-Turn Interac-\ntions in Large Language Models, 2025. URL https:\n//openreview.net/forum?id=7AetgL7eVL.\n[14] Yun Wang, Zhaojun Ding, Xuansheng Wu, Siyue\nSun, Ninghao Liu, and Xiaoming Zhai. Autoscore:\nEnhancing automated scoring with multi-agent large\nlanguage models via structured component recogni-\ntion. arXiv preprint arXiv:2509.21910, 2025.\n[15] Yucheng Chu, Hang Li, Kaiqi Yang, Harry Shomer,\nHui Liu, Yasemin Copur-Gencturk, and Jiliang Tang.\nA llm-powered automatic grading framework with\nhuman-level guidelines optimization. arXiv preprint\narXiv:2410.02165, 2024.\n[16] Wenjing Xie, Juxin Niu, Chun Jason Xue, and Nan\nGuan. Grade like a human: Rethinking automated\nassessment with large language models.\narXiv\npreprint arXiv:2405.19694, 2024.\n[17] Rui Sun, Zuo Bai, Wentao Zhang, Yuxiang Zhang,\nLi Zhao, Shan Sun, and Zhengwen Qiu. Finresearch-\nbench: A logic tree based agent-as-a-judge evalu-\nation framework for financial research agents. In\nProceedings of the 6th ACM International Conference\non AI in Finance, pages 656–664, 2025.\n[18] Fengbin Zhu, Xiang Yao Ng, Ziyang Liu, Chang Liu,\nXianwei Zeng, Chao Wang, Tianhui Tan, Xuan Yao,\nPengyang Shao, Min Xu, et al. Findeepresearch:\nEvaluating deep research agents in rigorous financial\nanalysis. arXiv preprint arXiv:2510.13936, 2025.\n[19] Zichen Chen, Jiaao Chen, Jianda Chen, and Misha\nSra.\nStandard benchmarks fail – auditing llm\nagents in finance must prioritize risk, 2025. URL\n[https://arxiv.org/abs/2502.15865]\n(https://arxiv.org/abs/2502.15865).\n[20] Zichen Chen, Jianda Chen, Jiaao Chen, and Misha\nSra. From tasks to teams: A risk-first evaluation\nframework for multi-agent llm systems in finance.\nIn ICML 2025 Workshop on Reliable and Responsible\nFoundation Models, 2025.\n[21] Zhitao He, Pengfei Cao, Chenhao Wang, Zhuoran\nJin, Yubo Chen, Jiexin Xu, Huaijun Li, Kang Liu, and\nJun Zhao. Agentscourt: Building judicial decision-\nmaking agents with court debate simulation and\nlegal knowledge augmentation. In Findings of the\nAssociation for Computational Linguistics: EMNLP\n2024, pages 9399–9416, 2024.\n[22] Prathamesh Devadiga, Omkaar Jayadev Shetty, and\nPooja Agarwal. Samvad: A multi-agent system for\nsimulating judicial deliberation dynamics in india.\narXiv preprint arXiv:2509.03793, 2025.\n[23] Cong Jiang and Xiaolei Yang. Agentsbench: A multi-\nagent llm simulation framework for legal judgment\nprediction. Systems, 13(8):641, 2025.\n[24] Zhenxuan Zhang, Kinhei Lee, Weihang Deng, Huichi\nZhou, Zihao Jin, Jiahao Huang, Zhifan Gao, Do-\nminic C. Marshall, Yingying Fang, and Guang\n12\n\nA Survey on Agent-as-a-Judge\nYang.\nGema-score: Granular explainable multi-\nagent score for radiology report evaluation. CoRR,\nabs/2503.05347, March 2025. URL https://doi.\norg/10.48550/arXiv.2503.05347.\n[25] Hengguan Huang, Songtao Wang, Hongfu Liu, Hao\nWang, and Ye Wang.\nBenchmarking large lan-\nguage models on communicative medical coaching:\na dataset and a novel system. In Findings of the\nAssociation for Computational Linguistics ACL 2024,\npages 1624–1637, 2024.\n[26] Zhihao Fan, Lai Wei, Jialong Tang, Wei Chen, Wang\nSiyuan, Zhongyu Wei, and Fei Huang. Ai hospital:\nBenchmarking large language models in a multi-\nagent medical interaction simulator. In Proceedings\nof the 31st International Conference on Computational\nLinguistics, pages 10183–10213, 2025.\n[27] Jifang Wang, Xue Yang, Longyue Wang, Zhenran\nXu, Yiyu Wang, Yaowei Wang, Weihua Luo, Kaifu\nZhang, Baotian Hu, and Min Zhang.\nA unified\nagentic framework for evaluating conditional im-\nage generation. In Wanxiang Che, Joyce Nabende,\nEkaterina Shutova, and Mohammad Taher Pile-\nhvar, editors, Proceedings of the 63rd Annual Meet-\ning of the Association for Computational Linguistics\n(Volume 1: Long Papers), pages 12626–12646, Vi-\nenna, Austria, July 2025. Association for Computa-\ntional Linguistics. ISBN 979-8-89176-251-0. doi:\n10.18653/v1/2025.acl-long.620. URL https://\naclanthology.org/2025.acl-long.620/.\n[28] Fan Zhang, Shulin Tian, Ziqi Huang, Yu Qiao,\nand Ziwei Liu.\nEvaluation agent: Efficient and\npromptable evaluation framework for visual gener-\native models. In Wanxiang Che, Joyce Nabende,\nEkaterina Shutova, and Mohammad Taher Pile-\nhvar, editors, Proceedings of the 63rd Annual Meet-\ning of the Association for Computational Linguistics\n(Volume 1: Long Papers), pages 7561–7582, Vi-\nenna, Austria, July 2025. Association for Compu-\ntational Linguistics. ISBN 979-8-89176-251-0. doi:\n10.18653/v1/2025.acl-long.374. URL https://\naclanthology.org/2025.acl-long.374/.\n[29] Alimohammad Beigi,\nBohan Jiang,\nDawei Li,\nTharindu Kumarage, Zhen Tan, Pouya Shaeri, and\nHuan Liu. Lrq-fact: Llm-generated relevant ques-\ntions for multimodal fact-checking. arXiv preprint\narXiv:2410.04616, 2024. URL https://arxiv.\norg/abs/2410.04616.\n[30] Elad Levi and Ilan Kadar. Intellagent: A multi-agent\nframework for evaluating conversational ai systems.\narXiv preprint arXiv:2501.11067, 2025.\n[31] Navid Madani and Rohini Srihari. ESC-judge: A\nframework for comparing emotional support con-\nversational agents. In Christos Christodoulopoulos,\nTanmoy Chakraborty, Carolyn Rose, and Violet Peng,\neditors, Proceedings of the 2025 Conference on Empir-\nical Methods in Natural Language Processing, pages\n16059–16076, Suzhou, China, November 2025. As-\nsociation for Computational Linguistics. ISBN 979-8-\n89176-332-6. doi: 10.18653/v1/2025.emnlp-main.\n811. URL https://aclanthology.org/2025.\nemnlp-main.811/.\n[32] Bang Zhang, Ruotian Ma, Qingxuan Jiang, Peisong\nWang, Jiaqi Chen, Zheng Xie, Xingyu Chen, Yue\nWang, Fanghua Ye, Jian Li, et al. Sentient agent\nas a judge: Evaluating higher-order social cog-\nnition in large language models.\narXiv preprint\narXiv:2505.02847, 2025.\n[33] Jingoo Lee, Kyungho Lim, Young-Chul Jung, and\nByung-Hoon Kim. Psyche: A multi-faceted patient\nsimulation framework for evaluation of psychiatric\nassessment conversational agents. arXiv preprint\narXiv:2501.01594, 2025.\n[34] Hongzhan Lin, Yang Deng, Yuxuan Gu, Wenxuan\nZhang, Jing Ma, See-Kiong Ng, and Tat-Seng Chua.\nFACT-AUDIT: An adaptive multi-agent framework\nfor dynamic fact-checking evaluation of large lan-\nguage models. In Wanxiang Che, Joyce Nabende,\nEkaterina Shutova, and Mohammad Taher Pile-\nhvar, editors, Proceedings of the 63rd Annual Meet-\ning of the Association for Computational Linguis-\ntics (Volume 1: Long Papers), pages 360–381, Vi-\nenna, Austria, July 2025. Association for Computa-\ntional Linguistics. ISBN 979-8-89176-251-0. doi:\n10.18653/v1/2025.acl-long.17.\nURL https://\naclanthology.org/2025.acl-long.17/.\n[35] Sarfraz Ahmad, Hasan Iqbal, Momina Ahsan, Nu-\nmaan Naeem, Muhammad Ahsan Riaz Khan, Arham\nRiaz, Muhammad Arslan Manzoor, Yuxia Wang, and\nPreslav Nakov.\nUrduFactCheck: An agentic fact-\nchecking framework for Urdu with evidence boost-\ning and benchmarking. In Christos Christodoulopou-\nlos, Tanmoy Chakraborty, Carolyn Rose, and Violet\nPeng, editors, Findings of the Association for Com-\nputational Linguistics: EMNLP 2025, pages 22788–\n22802, Suzhou, China, November 2025. Association\nfor Computational Linguistics. ISBN 979-8-89176-\n335-7.\ndoi: 10.18653/v1/2025.findings-emnlp.\n1240.\nURL https://aclanthology.org/\n2025.findings-emnlp.1240/.\n[36] Yeonseok Jeong, Minsoo Kim, Seung-won Hwang,\nand Byung-Hak Kim.\nAgent-as-judge for fac-\ntual summarization of long narratives.\nIn Chris-\ntos Christodoulopoulos, Tanmoy Chakraborty, Car-\nolyn Rose, and Violet Peng, editors, Proceed-\nings of the 2025 Conference on Empirical Meth-\nods in Natural Language Processing, pages 23602–\n13\n\nA Survey on Agent-as-a-Judge\n23619, Suzhou, China, November 2025. Associa-\ntion for Computational Linguistics.\nISBN 979-8-\n89176-332-6. doi: 10.18653/v1/2025.emnlp-main.\n1204.\nURL https://aclanthology.org/\n2025.emnlp-main.1204/.\n[37] Azim Ospanov, Zijin Feng, Jiacheng Sun, Haoli\nBai, Xin Shen, and Farzan Farnia.\nHermes: To-\nwards efficient and verifiable mathematical reason-\ning in llms, 2025.\nURL https://arxiv.org/\nabs/2511.18760.\n[38] Jiuzhou Han, Wray Buntine, and Ehsan Shareghi.\nVerifiAgent: a unified verification agent in language\nmodel reasoning. In Findings of the Association for\nComputational Linguistics: EMNLP 2025, 2025.\n[39] Shudong Liu, Hongwei Liu, Junnan Liu, Linchen\nXiao, Songyang Gao, Chengqi Lyu, Yuzhe Gu, Wen-\nwei Zhang, Derek F. Wong, Songyang Zhang, and\nKai Chen.\nCompassVerifier: A unified and ro-\nbust verifier for LLMs evaluation and outcome re-\nward.\nIn Christos Christodoulopoulos, Tanmoy\nChakraborty, Carolyn Rose, and Violet Peng, edi-\ntors, Proceedings of the 2025 Conference on Empiri-\ncal Methods in Natural Language Processing, pages\n33454–33482, Suzhou, China, November 2025. As-\nsociation for Computational Linguistics. ISBN 979-8-\n89176-332-6. doi: 10.18653/v1/2025.emnlp-main.\n1698.\nURL https://aclanthology.org/\n2025.emnlp-main.1698/.\n[40] Ding Chen, Qingchen Yu, Pengyuan Wang, Wentao\nZhang, Bo Tang, Feiyu Xiong, Xinchi Li, Minchuan\nYang, and Zhiyu Li. xverify: Efficient answer veri-\nfier for reasoning model evaluations. arXiv preprint\narXiv:2504.10481, 2025.\n[41] Shalev Lifshitz, Sheila A McIlraith, and Yilun\nDu.\nMulti-agent verification: Scaling test-time\ncompute with multiple verifiers.\narXiv preprint\narXiv:2502.20379, 2025.\n[42] Kexin Huang, Ying Jin, Ryan Li, Michael Y Li, Em-\nmanuel Candes, and Jure Leskovec. Automated hy-\npothesis validation with agentic sequential falsifi-\ncations. In Proceedings of the 42nd International\nConference on Machine Learning, 2025.\n[43] Hongliu Cao, Ilias Driouich, Robin Singh, and Eoin\nThomas. Multi-agent llm judge: automatic person-\nalized llm judge design for evaluating natural lan-\nguage generation applications, 2025.\n[44] Reshmi Ghosh, Tianyi Yao, Lizzy Chen, Sadid Hasan,\nTianwei Chen, Dario Bernal, Huitian Jiao, and\nHM Hossain. Sageval: The frontiers of satisfactory\nagent based nlg evaluation for reference-free open-\nended text. arXiv preprint arXiv:2411.16077, 2024.\n[45] Jiatao Li, Mao Ye, Cheng Peng, Xunjian Yin, and\nXiaojun Wan. Agent-x: Adaptive guideline-based\nexpert network for threshold-free ai-generated text\ndetection. arXiv preprint arXiv:2505.15261, 2025.\n[46] Michael J. Ryan, Omar Shaikh, Aditri Bhagirath,\nDaniel Frees, William Held, and Diyi Yang. Syn-\nthesizeMe! inducing persona-guided prompts for\npersonalized reward models in LLMs. In Proceed-\nings of the 63rd Annual Meeting of the Association for\nComputational Linguistics (Volume 1: Long Papers),\n2025.\n[47] Ran Xu, Jingjing Chen, Jiayu Ye, Yu Wu, Jun Yan,\nCarl Yang, and Hongkun Yu. Incentivizing agentic\nreasoning in llm judges via tool-integrated reinforce-\nment learning, 2025.\n[48] Mingchen Zhuge, Changsheng Zhao, Dylan R. Ash-\nley, Wenyi Wang, Dmitrii Khizbullin, Yunyang\nXiong, Zechun Liu, Ernie Chang, Raghuraman Krish-\nnamoorthi, Yuandong Tian, Yangyang Shi, Vikas\nChandra, and Jürgen Schmidhuber.\nAgent-as-a-\njudge: Evaluate agents with agents. In Forty-second\nInternational Conference on Machine Learning, 2025.\n[49] Mengdi Li, Guanqiao Chen, Xufeng Zhao, Haochen\nWen, Shu Yang, and Di Wang. Persrm-r1: Enhance\npersonalized reward modeling with reinforcement\nlearning, 2025.\n[50] Anikait Singh, Sheryl Hsu, Kyle Hsu, Eric Mitchell,\nStefano\nErmon,\nTatsunori\nHashimoto,\nArchit\nSharma, and Chelsea Finn. FSPO: Few-shot prefer-\nence optimization of synthetic preference data elicits\nLLM personalization to real users. In 2nd Workshop\non Models of Human Feedback for AI Alignment, 2025.\n[51] Xinchen Wang, Pengfei Gao, Chao Peng, Ruida Hu,\nand Cuiyun Gao. Codevisionary: An agent-based\nframework for evaluating large language models in\ncode generation, 2025.\n[52] Yu Li, Shenyu Zhang, Rui Wu, Xiutian Huang,\nYongrui Chen, Wenhao Xu, Guilin Qi, and De-\nhai Min.\nMateval:\nA multi-agent discussion\nframework for advancing open-ended text evalu-\nation.\nIn Database Systems for Advanced Appli-\ncations:\n29th International Conference, DASFAA\n2024, Gifu, Japan, July 2-5, 2024, Proceedings,\nPart VII, page 415–426, Berlin, Heidelberg, 2024.\nSpringer-Verlag. ISBN 978-981-97-5574-5. doi: 10.\n1007/978-981-97-5575-2_31. URL https://doi.\norg/10.1007/978-981-97-5575-2_31.\n[53] Manya Wadhwa, Zayne Rea Sprague, Chaitanya\nMalaviya, Philippe Laban, Junyi Jessy Li, and Greg\nDurrett.\nEvalagents: Discovering implicit evalu-\nation criteria from the web.\nIn Second Confer-\nence on Language Modeling, 2025.\nURL https:\n//openreview.net/forum?id=erGpkHCybv.\n[54] Kaishuai Xu, Tiezheng Yu, Yi Cheng, Wenjun Hou,\nLiangyou Li, Xin Jiang, Lifeng Shang, Qun Liu, and\n14\n\nA Survey on Agent-as-a-Judge\nWenjie Li. Learning to align multi-faceted evalua-\ntion: A unified and robust framework. In Wanxi-\nang Che, Joyce Nabende, Ekaterina Shutova, and\nMohammad Taher Pilehvar, editors, Findings of the\nAssociation for Computational Linguistics: ACL 2025,\npages 9488–9502, Vienna, Austria, July 2025. Asso-\nciation for Computational Linguistics. ISBN 979-8-\n89176-256-5. doi: 10.18653/v1/2025.findings-acl.\n494. URL https://aclanthology.org/2025.\nfindings-acl.494/.\n[55] MohammadHossein Rezaei, Robert Vacareanu, Zi-\nhao Wang, Clinton Wang, Bing Liu, Yunzhong He,\nand Afra Feyza Akyürek.\nOnline rubrics elicita-\ntion from pairwise comparisons.\narXiv preprint\narXiv:2510.07284, 2025.\n[56] Zhaopeng Feng, Jiayuan Su, Jiamei Zheng, Jia-\nhan Ren, Yan Zhang, Jian Wu, Hongwei Wang,\nand Zuozhu Liu. M-MAD: Multidimensional multi-\nagent debate for advanced machine translation\nevaluation.\nIn Wanxiang Che, Joyce Nabende,\nEkaterina Shutova, and Mohammad Taher Pile-\nhvar, editors, Proceedings of the 63rd Annual Meet-\ning of the Association for Computational Linguistics\n(Volume 1: Long Papers), pages 7084–7107, Vi-\nenna, Austria, July 2025. Association for Compu-\ntational Linguistics. ISBN 979-8-89176-251-0. doi:\n10.18653/v1/2025.acl-long.351. URL https://\naclanthology.org/2025.acl-long.351/.\n[57] Jiamin Su, Yibo Yan, Zhuoran Gao, Han Zhang,\nXiang Liu, and Xuming Hu.\nCafes:\nA col-\nlaborative\nmulti-agent\nframework\nfor\nmulti-\ngranular multimodal essay scoring. arXiv preprint\narXiv:2505.13965, 2025.\n[58] Sandeep Kumar, Abhijit A Nargund, and Vivek Srid-\nhar. Courteval: A courtroom-based multi-agent eval-\nuation framework. In Findings of the Association for\nComputational Linguistics: ACL 2025, pages 25875–\n25887, 2025.\n[59] Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang,\nRuochen Xu, and Chenguang Zhu. G-eval: NLG eval-\nuation using gpt-4 with better human alignment. In\nHouda Bouamor, Juan Pino, and Kalika Bali, edi-\ntors, Proceedings of the 2023 Conference on Empirical\nMethods in Natural Language Processing, pages 2511–\n2522, Singapore, December 2023. Association for\nComputational Linguistics. doi: 10.18653/v1/2023.\nemnlp-main.153. URL https://aclanthology.\norg/2023.emnlp-main.153/.\n[60] Seungone Kim, Jamin Shin, Yejin Cho, Joel Jang,\nShayne Longpre,\nHwaran Lee,\nSangdoo Yun,\nSeongjin Shin, Sungdong Kim, James Thorne, et al.\nPrometheus: Inducing fine-grained evaluation capa-\nbility in language models. In The Twelfth Interna-\ntional Conference on Learning Representations, 2023.\n[61] Lianghui Zhu, Xinggang Wang, and Xinlong Wang.\nJudgeLM: Fine-tuned large language models are scal-\nable judges. In The Thirteenth International Confer-\nence on Learning Representations, 2025. URL https:\n//openreview.net/forum?id=xsELpEPn4A.\n[62] Yougang Lyu, Shijie Ren, Yue Feng, Zihan Wang,\nZhumin Chen, Zhaochun Ren, and Maarten de Rijke.\nSelf-adaptive cognitive debiasing for large language\nmodels in decision-making, 2025. URL https://\narxiv.org/abs/2504.04141.\n[63] Huan-ang Gao, Jiayi Geng, Wenyue Hua, Mengkang\nHu, Xinzhe Juan, Hongzhang Liu, Shilong Liu, Ji-\nahao Qiu, Xuan Qi, Yiran Wu, et al. A survey of\nself-evolving agents: On path to artificial super in-\ntelligence. arXiv preprint arXiv:2507.21046, 2025.\n[64] Mahnaz Koupaee, Jake W. Vincent, Saab Mansour,\nIgor Shalyminov, Han He, Hwanjun Song, Raphael\nShu, Jianfeng He, Yi Nian, Amy Wing-mei Wong,\nKyu J. Han, and Hang Su. Faithful, unfaithful or am-\nbiguous? multi-agent debate with initial stance for\nsummary evaluation. In Luis Chiruzzo, Alan Ritter,\nand Lu Wang, editors, Proceedings of the 2025 Confer-\nence of the Nations of the Americas Chapter of the As-\nsociation for Computational Linguistics: Human Lan-\nguage Technologies (Volume 1: Long Papers), pages\n12209–12246, Albuquerque, New Mexico, April\n2025. Association for Computational Linguistics.\nISBN 979-8-89176-189-6. doi: 10.18653/v1/2025.\nnaacl-long.609. URL https://aclanthology.\norg/2025.naacl-long.609/.\n[65] Taaha Kazi, Ruiliang Lyu, Sizhe Zhou, Dilek\nHakkani-Tür, and Gokhan Tur.\nLarge language\nmodels as user-agents for evaluating task-oriented-\ndialogue systems. In 2024 IEEE Spoken Language\nTechnology Workshop (SLT), pages 913–920. IEEE,\n2024.\n[66] Ning Wu, Ming Gong, Linjun Shou, Shining Liang,\nand Daxin Jiang. Large language models are diverse\nrole-players for summarization evaluation. In CCF\ninternational conference on natural language process-\ning and Chinese computing, pages 695–707. Springer,\n2023.\n15\n", + "summary_1": "Here is a clear and concise summary of the research paper for a graduate student:\n\n**Title:** A Survey on Agent-as-a-Judge\n\n**Background:** The rapid advancement of Large Language Models (LLMs) has revolutionized AI evaluation, enabling scalable and near-human quality assessments. However, as evaluands become increasingly complex, specialized, and multi-step, the reliability of LLM-as-a-Judge has become constrained by inherent biases, shallow single-pass reasoning, and the inability to verify assessments against real-world observations.\n\n**Problem Statement:** The limitations of LLM-as-a-Judge have led to the transition to Agent-as-a-Judge, which employs planning, tool-augmented verification, multi-agent collaboration, and persistent memory to enable more robust, verifiable, and nuanced evaluations.\n\n**Key Contributions:**\n\n1. **Characterizing the shift from LLM-as-a-Judge to Agent-as-a-Judge**: The authors identify and summarize the agentic judges' development trend into three progressive stages with distinct levels of autonomy and adaptability.\n2. **Organizing core methodologies**: The authors categorize the core methodologies into five key parts according to the agent's abilities and survey their applications across general and professional domains.\n3. **Analyzing frontier challenges and identifying promising research directions**:", + "summary_2": "The research paper discusses the evolution of AI evaluation systems from Large Language Models (LLM)-as-a-Judge to Agent-as-a-Judge. The authors identify the limitations of LLM-as-a-Judge, which include inherent biases, shallow single-pass reasoning, and the inability to verify assessments against real-world observations. This has led to the development of Agent-as-a-Judge, which employs planning, tool-augmented verification, multi-agent collaboration, and persistent memory to enable more robust, verifiable, and nuanced evaluations.\n\n**Key Points:**\n\n1. LLM-as-a-Judge has revolutionized AI evaluation by leveraging large language models for scalable assessments, but its reliability is constrained by inherent biases, shallow single-pass reasoning, and the inability to verify assessments against real-world observations.\n2. Agent-as-a-Judge has emerged as a more robust and verifiable alternative, leveraging planning, tool-augmented verification, multi-agent collaboration, and persistent memory.\n3. The authors present the first comprehensive survey on Agent-as-a-Judge, summarizing the shift from LLM-as-a-Judge to Agent-as-a-Judge and organizing core methodologies into five key parts according to agent's abilities.\n4. The authors survey applications across general and professional domains and analyze frontier challenges and identify" + }, + { + "arxiv_id": "2601.05163v1", + "title": "Paper 2601.05163v1", + "abstract": "Document Question Answering (DocQA) fo- cuses on answering questions grounded in given documents, yet existing DocQA agents lack effective tool utilization and largely rely on closed-source models. In this work, we introduce DocDancer, an end-to-end trained open-source Doc agent. We formulate DocQA as an information-seeking problem and pro- pose a tool-driven agent framework that ex- plicitly models document exploration and com- prehension. To enable end-to-end training of such agents, we introduce an Exploration-then- Synthesis data synthesis pipeline that addresses the scarcity of high-quality training data for DocQA. Training on the synthesized data, the trained models on two long-context document understanding benchmarks, MMLongBench- Doc and DocBench, show their effectiveness. Further analysis provides valuable insights for the agentic tool design and synthetic data.", + "text": "DocDancer: Towards Agentic Document-Grounded Information Seeking\nQintong Zhang♡*, Xinjie Lv♡∗, Jialong Wu♡∗�, Baixuan Li∗, Zhengwei Tao♡,\nGuochen Yan♡, Huanyao Zhang♡, Bin Wang♢, Jiahao Xu♣, Haitao Mi♣, Wentao Zhang♡†\n♡Peking University, ♢Shanghai AI Lab, ♣Tencent AI Lab\nwujialongml@gmail.com, wentao.zhang@pku.edu.cn\nAbstract\nDocument Question Answering (DocQA) fo-\ncuses on answering questions grounded in\ngiven documents, yet existing DocQA agents\nlack effective tool utilization and largely rely\non closed-source models. In this work, we\nintroduce DocDancer, an end-to-end trained\nopen-source Doc agent. We formulate DocQA\nas an information-seeking problem and pro-\npose a tool-driven agent framework that ex-\nplicitly models document exploration and com-\nprehension. To enable end-to-end training of\nsuch agents, we introduce an Exploration-then-\nSynthesis data synthesis pipeline that addresses\nthe scarcity of high-quality training data for\nDocQA. Training on the synthesized data, the\ntrained models on two long-context document\nunderstanding benchmarks, MMLongBench-\nDoc and DocBench, show their effectiveness.\nFurther analysis provides valuable insights for\nthe agentic tool design and synthetic data.\n1\nIntroduction\nUnderstanding and answering questions over long,\nmulti-modal documents is a critical capability\nfor real-world intelligent systems (Tkaczyk et al.,\n2015; Liu et al., 2025b). Document Question An-\nswering (DocQA) lies at the core of document-\ncentric intelligence, enabling models to access, rea-\nson over, and synthesize information from complex\nand heterogeneous document sources.\nExisting DocQA methods can be broadly catego-\nrized into three paradigms. The first paradigm re-\nlies on optical character recognition (OCR) to con-\nvert documents into plain text, which is then pro-\ncessed by downstream language models (Xu et al.,\n2020). The second paradigm adopts embedding-\nbased retrieval mechanisms, most commonly in-\nstantiated through retrieval-augmented generation\n*Equal Contributions. �Jialong Wu is the project leader.\n†Corresponding Author.\nIf I drop the green module shown in Figure 6, what is the \nabsolute percentage drop on 100-shot TACREV setting?\nLLM Agents\nUser\nDoc Toolkit\nSearch\nRead\n First, I need to know ... Let me search \nfor that. \n {“name”: “search”, ...} \n Okay, now read Figure 6 ... \n {“name”: “read”, ...} \n 0.5% Correct\n\n ... [matched snippets] ... \n ...\n\n\nThere are three different parts in Figure 6 ...\n\n... \n...\nDoc Process\nParse\nCaption\nFigure 1: The overall of DocDancer for document-\ngrounded information seeking, where search and read\ntools for effective document retrieval and comprehen-\nsion over processed documents.\n(RAG), to identify and incorporate relevant doc-\nument segments during inference (Saad-Falcon\net al., 2024). More recently, agent-based paradigms\nhave gained increasing attention, as they better sup-\nport complex scenarios that require iterative explo-\nration, tool invocation, and multi-step reasoning\nover long and structured documents (Sun et al.,\n2025a; Zhu et al., 2025). Recent advances in large\nlanguage models (LLMs) (Team, 2025; Liu et al.,\n2025a) enable such agents to dynamically decom-\npose queries, interact with documents, and adapt\nto intermediate observations, alleviating the limita-\ntions of OCR- and RAG-based approaches. Despite\ntheir promise, existing DocQA agents are typically\nimplemented as prompt-based pipelines, with lim-\nited learning of autonomous agentic behaviors.\nIn contrast, we aim to train the first end-to-end\nDocQA agent model that is explicitly grounded\nin information-seeking principles, moving beyond\nprompt-based agent designs. We first formulate\nDocQA as an agentic information-seeking problem\nand design a tool-centric agent framework that de-\n1\narXiv:2601.05163v1 [cs.CL] 8 Jan 2026\n\ncomposes document understanding into two com-\nplementary capabilities. Specifically, we introduce\nefficient search tools for global information ac-\nquisition and fine-grained read tools for localized\ncomprehension. This design enables the agent to\nactively explore long documents, iteratively refine\nits hypotheses, and dynamically adapt its strategy\nbased on intermediate observations. Notably, when\ninstantiated with a proprietary LLM, our frame-\nwork achieves state-of-the-art performance and ex-\nceeds reported human-level performance.\nFurthermore, a key bottleneck in training such\nagent models is the scarcity of high-quality DocQA\npairs (Huang et al., 2025), as most publicly avail-\nable datasets provide only test splits and lack\nsufficiently annotated training data. To address\nthis challenge, we propose an Exploration-then-\nSynthesis DocQA generation pipeline that progres-\nsively enhances QA pairs from easy to hard. Specif-\nically, we first explore a source document through\nintent-guided, tool-augmented interactions to col-\nlect grounded evidence (the Exploration stage), and\nthen synthesizes high-quality document-grounded\nQA pairs via multi-observation reasoning (the Syn-\nthesis stage). We then train our DocQA agent, Doc-\nDancer, on the synthesized dataset, instantiating\nit with two open-source backbones, Qwen3-4B-\nThinking-2507 and Qwen3-30B-A3B-Thinking-\n2507 (Team, 2025). Despite being trained with\nonly 5,000 instances, both variants achieve com-\npetitive performance, with the 30B-A3B model\nattaining state-of-the-art results in several settings.\nExtensive experiments are conducted on two\nlong-context document understanding benchmarks,\nMMLongBench-Doc (Ma et al., 2024) and\nDocBench (Zou et al., 2025). The results demon-\nstrate the effectiveness of the proposed DocDancer.\nFurther analyses provide insights into document\nparsing strategies, tool design, and the role of syn-\nthetic data in agent learning. In summary, our con-\ntributions are three-fold:\n• Effective Agentic DocQA Framework: We\npropose a tool-driven DocQA agent frame-\nwork grounded in information-seeking prin-\nciples, which achieves SOTA performance\nwhen paired with a proprietary LLM.\n• Autonomous Data Synthesis Pipeline: We\nintroduce an Exploration-then-Refine data syn-\nthesis pipeline that generates high-quality\ntraining data for learning agentic behaviors.\n• Empirical\nPerformance:\nOur\nmethod\nachieves state-of-the-art results and provides\npractical insights into effective and efficient\nagentic system design.\n2\nRelated Work\nDocument Question Answering Methods. Tra-\nditional DocQA methods rely on OCR-based\npipelines (Ding et al., 2022) or end-to-end vi-\nsion–language models (Sukh, 2025; Hu et al.,\n2025), but both are constrained by limited in-\nput length and struggle with long documents (Ma\net al., 2024; Zou et al., 2025; Dong et al., 2025a).\nRetrieval-augmented generation (Zhang et al.,\n2024; Dong et al., 2025a,b) improves scalability,\nyet most approaches decouple retrieval and rea-\nsoning in a single-shot manner, making them brit-\ntle to retrieval errors and ineffective for complex,\nmulti-step queries (Zhang et al., 2025). Recent\nagent-based DocQA systems (Wu et al., 2025c;\nSun et al., 2025a; Dong et al., 2025c) address these\nissues through iterative document navigation and\nreading, but they predominantly depend on prompt-\nengineered, closed-source LLMs. In this work, we\naim to train an open-source document agent with\nlearnable behaviors for robust and scalable DocQA.\nSynthetic Data for Agent Training. High-quality\ntraining data is critical for training agents. Due to\nits scalability, rapid iteration, and inherent trainabil-\nity, synthetic data offers significant advantages over\nmanually annotated data, serving as a highly effec-\ntive alternative to human-labeled datasets for agent\nlearning (Liu et al., 2025a; Team et al., 2025b).\nPrior work has demonstrated that large-scale agent-\nsynthesized data can be effectively generated for\nsearch agents (Wu et al., 2025a; Li et al., 2025b;\nTao et al., 2025), code agents (Yang et al., 2025),\nGUI agents (Sun et al., 2025b; Guo et al., 2025a)\nand general-purpose agents (Fang et al., 2025; Prab-\nhakar et al., 2025). In contrast, this work focuses\non the DocQA agent setting. Existing DocQA\ndatasets are primarily constructed through semi-\nautomated (Van Landeghem et al., 2023; Dong\net al., 2025b) or expert-annotated (Hendrycks et al.,\n2021; Deng et al., 2025) processes, both of which\nrequire substantial human involvement or result in\nquestions that lack sufficient depth. Inspired by\nadvances in search agents, we formulate DocQA\nas an agentic information-seeking problem, with\nthe goal of synthesizing high-quality training data\ntailored for DocQA agents.\n2\n\n3\nMethods\n3.1\nAgent Setup\nFramework. We adopt the vanilla ReAct (Yao\net al., 2022) as the agent’s framework, which syn-\nergizes reasoning and acting. In this paradigm, the\nagent generates both a reasoning trace (thought), τ,\nand a subsequent action, a, in an interleaved man-\nner. This process forms a trajectory, HT , which is\na sequence of thought-action-observation triplets:\nHT = (τ0, a0, o0, . . . , τi, ai, oi, . . . , τT , aT ), (1)\nwhere aT represents the final answer to the given\ntask. At any given step t ≤ T, the agent’s policy,\nπ, generates the current thought τt and action at\nbased on the history of all previous interactions,\nHt−1:\nτt, at ∼ π(·|Ht−1).\n(2)\nInspired by The Bitter Lesson (Sutton, 2019), we\nemploy a single-agent setup with carefully selected,\nhighly effective tools, rather than relying on multi-\nagent designs or test-time scaling.\nDocument Processing. Prior works (Sun et al.,\n2025a) show that an XML-based hierarchical rep-\nresentation for document outlines that organizes\nparsed content into nested trees, using sections as\npartitioning units and elements such as text, im-\nages, and tables as nodes. While this structure\nenables efficient positioning and search, it suffers\nfrom structural and content inaccuracies and does\nnot incorporate retrieval-aware visual information,\nwhich limits its applicability to agent-based pro-\ncessing of long, visually rich documents. To ad-\ndress these issues, we substantially enhance the\ndocument outline. For content accuracy, we lever-\nage MinerU2.5 (Niu et al., 2025) for high-precision\nlayout analysis and extraction, defining 17 ele-\nment types and enriching outline nodes with layout\nand semantic attributes while removing structurally\nirrelevant elements such as headers and footers.\nFor structural accuracy, title elements are visually\ncropped and clustered to infer hierarchical levels,\nenabling fine-grained section segmentation and re-\nducing information loss in long documents. To\nimprove visual retrieval, we generate captions for\nimages and charts using an multimodal model Mm\nand incorporate them as auxiliary information, al-\nlowing the outline to better align and retrieve visual\ncontent.\nTool Design. We point out that DocQA can be natu-\nrally formulated as an agentic information-seeking\ntask in which the external information source is re-\nstricted to the given documents. Accordingly, our\ntool design aims to enable agents to efficiently and\neffectively locate and extract relevant information\nfrom documents, while keeping the overall toolkit\ncomplexity low to ensure ease of use for agent mod-\nels. Specifically, we design the following two tools\nfor DocDancer:\n• Search.\nConducts keyword-based full-text\nsearch over the given documents, returning the\nsection IDs, page numbers, and surrounding\ntext snippets for each match. A visible win-\ndow is used to constrain the snippet length for\nefficient localization. This tool provides the\nagent with global textual signals for guiding\nsubsequent information access.\n• Read. Given a goal and a set of section IDs, the\ntool performs fine-grained reading to extract\ngoal-relevant information from the specified\nsections. This includes (i) local textual infor-\nmation, consisting of all text within the section;\n(ii) local visual information, consisting of im-\nages and tables within the section, together with\na page-level screenshot that captures the full\nlayout of the page containing the section. Sub-\nsequently, a multimodal summarization model\nMm is used as an auxiliary reader to jointly\nintegrate textual and visual inputs and return\nconsolidated goal-relevant content.\nThis design deliberately integrates textual and\nvisual signals, capturing both localized evidence\nand global layout cues, while keeping the toolkit\nlimited to two tools to facilitate efficient utilization.\n3.2\nData Synthesis\nIt is crucial to curate complex and diverse Doc-\nument DocQA pairs that are capable of eliciting\nmulti-step reasoning, goal decomposition, and rich\ninteraction trajectories. To this end, we first con-\nstruct a broad and heterogeneous collection of PDF\ndocuments to serve as the grounding corpus for\nquestion answering. We then synthesize QA pairs\nbased on these documents, ensuring coverage of di-\nverse reasoning patterns and document structures.\nSources.\nTo construct a robust and diverse\ndataset for document-based question answering,\nwe select four representative datasets, Long-\nDocURL (Deng et al., 2025), MMDocRAG (Dong\net al., 2025b), CUAD (Hendrycks et al., 2021) and\nDUDE (Van Landeghem et al., 2023), that cover\nlong-context understanding, multimodal retrieval,\n3\n\n(i) Exploration\n(ii) Synthesis\nStep 1\n𝐴𝑐𝑡𝑖𝑜𝑛!\n𝑂𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛!\n𝐼𝑛𝑡𝑒𝑛𝑡!\nSource Document\nStep 2\n𝐴𝑐𝑡𝑖𝑜𝑛\"\n𝑂𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛\"\n𝐼𝑛𝑡𝑒𝑛𝑡\"\nStep 3\n𝐴𝑐𝑡𝑖𝑜𝑛#\n𝑂𝑏𝑎𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛#\n𝐼𝑛𝑡𝑒𝑛𝑡#\nFurther \nExploration\n... ...\nStep N\n𝐴𝑐𝑡𝑖𝑜𝑛!\n𝑂𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛!\n𝐼𝑛𝑡𝑒𝑛𝑡!\n𝐴𝑐𝑡𝑖𝑜𝑛\"\n𝑂𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛\"\n𝐼𝑛𝑡𝑒𝑛𝑡\"\n𝐴𝑐𝑡𝑖𝑜𝑛$\n𝑂𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛$\n𝐼𝑛𝑡𝑒𝑛𝑡$\n...\n...\n...\nSearch\nRead\nwith\nSearch\nRead\nwith\nSearch\nRead\nwith\nSearch\nRead\nQA Type Instruct\nDeep Analysis\n\nOkay, now I …\n\nQuestion\nAnswer\nwith\nFigure 2: Overall of the Exploration-then-Synthesis framework. (i) Exploration stage iteratively interacts with\nthe source document through Action(u)–Observation(y)–Intent(i) steps. (ii) Synthesis stage aggregates the collected\nevidence to generate the final question and answer. We present a concrete case illustrating the whole generation\nprocess in Appendix A.\nlegal expertise, and complex layout analysis. These\nsources provide the foundational PDF documents\nused for our automated QA generation pipeline.\nThe distribution of the collected PDF documents is\nillustrated in Figure 3.\n35.0%\n18.6%\n13.1 %\n11.0 %\n8.2 %\n5.0%\n3.3%\n1.2%\n3.0%\nReport\nLaw\nAcademic\nGuidebook\nContact\nFinancial\nBrochure\nIndustry\nNews\nFigure 3: Distribution of document used to synthesise.\nExploration-then-Synthesis Framework.\nWe\npropose a two-stage framework for DocQA gen-\neration, consisting of an Exploration Stage and a\nSynthesis Stage as shown in Figure 2. The overall\nobjective is to transform a source document into a\ndiverse and high-quality set of grounded QA pairs\nthrough iterative interaction and reasoning.\nExploration Stage. Given a source document\nD, utilze an LLM Me to iteratively interact with\nD and collect information relevant to potential QA\npairs. Conditioned on the interaction history ht\nand the document D, we employ model Ms jointly\ngenerates an intent-action pair (it, at):\n(it, ut) ∼ πMe(i, u | ht, D),\n(3)\nwhere it denotes the exploration intent and ut ∈ A\ncorresponds to invoking a document-grounded tool\nsuch as Search or Read, which is the same as the\nagent’s tool action. The construction of a question\nimplicitly induces the strategy required to resolve\nit. The explicit modeling of intent helps prevent un-\ninformative exploration, guiding the agent toward\nmore concrete, goal-directed trajectories (Pahuja\net al., 2025). Executing action at yields an obser-\nvation:\nyt = T (at, D),\n(4)\nwhere T denotes the document interaction inter-\nface. The interaction history is then updated as:\nht+1 = ht ∪ {(it, ut, yt)},\n(5)\nand the intent it+1 may be revised based on the\nnewly acquired information.\nThis process is repeated for multiple steps, en-\nabling the agent to progressively refine its under-\nstanding of the document and uncover diverse and\ninformative content. The explicit modeling of in-\ntent allows for flexible and open-ended exploration,\npermitting additional interactions when necessary.\nThe output of the exploration stage is a trajectory\nξ = {(it, ut, yt)}T\nt=1,\n(6)\nwhich serves as structured evidence for down-\nstream QA generation.\nIn the exploration stage, each exploration step\ncan be viewed as a random walk over the knowl-\nedge graph implicitly embedded in the entire doc-\nument. When the number of such walks is suf-\nficiently large, this process can, in principle, re-\nconstruct the underlying document-level knowl-\nedge graph in a reverse manner. This idea is con-\nceptually aligned with prior work on QA gener-\nation based on knowledge graphs in web search\nagent (Li et al., 2025b,a). We do not explicitly\nconstruct a document-level knowledge graph in\n4\n\nadvance, as such an approach would incur substan-\ntial engineering complexity and overhead. Instead,\nour method adopts a more lightweight design that\nis nevertheless capable of generating challenging\nDocQA pairs, achieving a better trade-off between\nefficiency and effectiveness.\nSynthesis Stage. Given the exploration trajec-\ntory ξ, the agent enters the synthesis stage to gen-\nerate document-grounded QA pairs. A synthesis\nmodel Ms performs reasoning over the accumu-\nlated observations and generates a QA pair:\n(q, a) ∼ Ms(ξ, D),\n(7)\nThis stage emphasizes (i) reasoning over multi-\nple observations collected during exploration, (ii)\ngrounding both questions and answers in the source\ndocument, and (iii) producing semantically coher-\nent and well-formed outputs. The final output is a\nset of K, document-grounded QA pairs:\nQA = {(qk, ak)}K\nk=1,\n(8)\nwhich can be used for training an agent. We employ\na strong open-source model Mt to perform rejec-\ntion sampling over these QA pairs, QA, thereby\nobtaining high-quality training trajectories.\n3.3\nAgent Training\nFollowing the empirical findings of (Chen et al.,\n2023), twe mask loss contributions from observa-\ntion tokens to mitigate interference from external\nfeedback during training, which has been shown to\nimprove both performance and robustness. Given\nthe task context tc and the complete execution\ntrajectory H = (x0, ..., xn−1, xn), where each\nxi ∈ {τ, α, o}, the loss L is computed as follows:\nL = −\n1\n�|H|\ni=1 I[xi ̸= o]\n|H|\n�\ni=1\nI[xi ̸= o]·\nlog πθ(xi | tc, x...\nStop\n... Read the Outline...\n... Call Search ...\n... Find 39 results ...\n...\n🔍 Search\n[“Revenues”]\nSection 8.81 in page 47\n📖\nExtract the advertising \nexpense amount for 2015.\nThe useful information from\n[...Text...]. Summary: \nThe advertising expense \namount for 2015 is $714.3 \nmillion, as ...\nRead\n📖\nExtract the revenue \namount for 2015.\nThe useful information \nfrom [... Table ...]. \nSummary: \nThe revenue amount for \n2015 is $6,779,511, as ...\nSection 8.60 in page 40\n714.3 / 6779.511\n... According to...\n≈ 0.105\n≈\n≈\nCorrect \nAnswer! 🥳\nRead\nFigure 7: A case study demonstrating that our proposed DocDancer successfully performs multi-round information\ngathering to reach the correct answer, as illustrated in Table 3 in detail, whereas OS-QA produces an incorrect result.\n30.0\n35.0\n40.0\n45.0\n50.0\n55.0\n60.0\n65.0\n70.0\nAca.\nBro.\nFin.\nGui.\nInd.\nReo.\nTut.\nAcc\n52.854.2\n42.0\n49.2\n60.8\n66.0\n39.7\n48.0\n55.8\n57.9\n60.8\n54.5\n50.753.1\nOS-QA\nDocFlow\n60.0\n65.0\n70.0\n75.0\n80.0\n85.0\n90.0\n95.0\n100\nAca.\nFin.\nCov.\nLaw\nNews\nAvg\nAcc\n83.0 82.8\n76.4\n74.0\n78.0 78.4 80.9\n83.2\n89.891.9\n81.281.4\nQwen3-VL-235B-A22B-Instruct\nGemini-3-Pro\nFigure 8: Results on DocBench across various domains\nusing different models used by Read tool. We re-\nport the generalized accuracy of five types of document\ndomains, including Academia (Aca.), Finance (Fin.),\nGovernment (Gov), Law, and News.\nby the Read tool. Our default configuration, Mm\nemploys Qwen3-VL-235B-A22B-Instruct. Replac-\ning it with Gemini-3-Pro yields a modest overall\nimprovement of 0.2 accuracy points on DocBench\n(Figure 8), with gains in Government, Law, and\nNews domains. These results indicate that our tool\ndesign is robust and does not depend on an excep-\ntionally strong external model.\n4.5\nQualitative Analysis (RQ4)\nWe present a case study of a financial task on a\n73-page document from MMLongBench-Doc, as\nillustrated in Figure 7. Answering this question\nrequires locating advertising expense and revenue\nfigures from different sections of the document and\nperforming a numerical computation. The base-\nline model, which is trained on OS-QA relies on\nkeyword-based retrieval and retrieves passages re-\nlated to “marketing” and “revenues”. Due to in-\nsufficient grounding, it incorrectly uses a market-\ning expense figure as a proxy for advertising ex-\npense, yielding an erroneous ratio of 0.122. This\nfailure illustrates the limitation of single-pass re-\ntrieval and shallow aggregation when fine-grained\nfinancial concepts are required. In contrast, Doc-\nDancer performs multi-round, question-driven in-\nformation gathering. It first retrieves and reads\nthe section explicitly reporting advertising expense\nfor FY 2015 ($714.3M), and then independently\nextracts the total revenue from a separate tabular\nsection ($6,779.5M). By grounding each value to\nits corresponding evidence and verifying semantic\nrelevance, the system computes the correct ratio\nof 714.3/6,779.5 ≈ 0.105. It demonstrates that\naccurate document-level financial question answer-\ning benefits from our synthetic data, which enables\nthe construction of domain-specific expert-level\nsupervision beyond ordinary human annotations.\n5\nConclusion\nWe propose DocDancer, an end-to-end trained\nagentic model for document question answering\nthat formulates DocQA as an information-seeking\nprocess. By introducing a tool-centric framework\nwith complementary search and read operations,\nDocDancer enables effective exploration and com-\nprehension of long, structured documents. To miti-\ngate the lack of high-quality supervision, we further\n8\n\ndesign an Exploration-then-Synthesis data pipeline\nthat generates compact yet effective training data\nfor learning agentic behaviors. Experiments on\nMMLongBench-Doc and DocBench demonstrate\nthat DocDancer achieves strong and competitive\nperformance, validating the effectiveness of agentic\ninformation-seeking for document understanding.\nLimitations\nThis work still has several limitations. First, our ex-\nperiments are conducted only on Qwen3-30B-A3B-\nThinking-2507 and Qwen3-4B-Thinking-2507; we\ndo not evaluate the proposed method on larger-\nscale models or models from other families. Sec-\nond, we focus exclusively on supervised fine-\ntuning (SFT) and do not explore agentic reinforce-\nment learning (RL). Third, we do not further scale\nthe training data, and thus do not investigate how\nthe proposed method performs under larger or more\ndiverse data.\nEthical Considerations\nThis work studies agentic document-grounded\nquestion answering using publicly available bench-\nmarks and documents released for research pur-\nposes. The proposed Exploration-then-Synthesis\npipeline generates synthetic question–answer pairs\nthat are explicitly grounded in source documents\nand does not introduce new proprietary data or at-\ntempt to reproduce large portions of copyrighted\ntext verbatim. While the method itself does not\ncollect personal information, document-grounded\nagents may be applied to sensitive or private docu-\nments in downstream use; such applications require\nappropriate authorization and privacy safeguards.\nThe synthesized data and trained models may in-\nherit biases present in the underlying document\nsources, including domain and content imbalances.\nFinally, although improved document exploration\ncapabilities could be misused if deployed irrespon-\nsibly, the strong grounding in retrieved evidence\nand our commitment to releasing code and data\naim to support transparency, reproducibility, and\nresponsible research use.\nReferences\nBaian Chen, Chang Shu, Ehsan Shareghi, Nigel Collier,\nKarthik Narasimhan, and Shunyu Yao. 2023. Fireact:\nToward language agent fine-tuning. arXiv preprint\narXiv:2310.05915.\nJaemin Cho, Debanjan Mahata, Ozan Irsoy, Yu-\njie He, and Mohit Bansal. 2025.\nM3docvqa:\nMulti-modal multi-page multi-document understand-\ning. In Proceedings of the IEEE/CVF International\nConference on Computer Vision, pages 6178–6188.\nChao Deng, Jiale Yuan, Pi Bu, Peijie Wang, Zhong-\nZhi Li, Jian Xu, Xiao-Hui Li, Yuan Gao, Jun Song,\nBo Zheng, and 1 others. 2025. Longdocurl: a com-\nprehensive multimodal long document benchmark\nintegrating understanding, reasoning, and locating.\nIn Proceedings of the 63rd Annual Meeting of the\nAssociation for Computational Linguistics (Volume\n1: Long Papers), pages 1135–1159.\nYihao Ding, Zhe Huang, Runlin Wang, YanHang Zhang,\nXianru Chen, Yuzhong Ma, Hyunsuk Chung, and\nSoyeon Caren Han. 2022.\nV-doc: Visual ques-\ntions answers with documents. In Proceedings of\nthe IEEE/CVF conference on computer vision and\npattern recognition, pages 21492–21498.\nKuicai Dong, Yujing Chang, Xin Deik Goh, Dexun\nLi, Ruiming Tang, and Yong Liu. 2025a. Mmdocir:\nBenchmarking multi-modal retrieval for long docu-\nments. arXiv preprint arXiv:2501.08828.\nKuicai Dong, Yujing Chang, Shijie Huang, Yasheng\nWang, Ruiming Tang, and Yong Liu. 2025b. Bench-\nmarking retrieval-augmented multimomal generation\nfor document question answering. arXiv preprint\narXiv:2505.16470.\nKuicai Dong, Shurui Huang, Fangda Ye, Wei Han,\nZhi Zhang, Dexun Li, Wenjun Li, Qu Yang, Gang\nWang, Yichao Wang, and 1 others. 2025c.\nDoc-\nresearcher: A unified system for multimodal doc-\nument parsing and deep research.\narXiv preprint\narXiv:2510.21603.\nRunnan Fang, Shihao Cai, Baixuan Li, Jialong Wu,\nGuangyu Li, Wenbiao Yin, Xinyu Wang, Xiaobin\nWang, Liangcai Su, Zhen Zhang, and 1 others. 2025.\nTowards general agentic intelligence via environment\nscaling. arXiv preprint arXiv:2509.13311.\nManuel Faysse, Hugues Sibille, Tony Wu, Bilel Om-\nrani, Gautier Viaud, Céline Hudelot, and Pierre\nColombo. 2024.\nColpali: Efficient document re-\ntrieval with vision language models. arXiv preprint\narXiv:2407.01449.\nXiangwu Guo, Difei Gao, and Mike Zheng Shou. 2025a.\nAuto-explorer: Automated data collection for gui\nagent. arXiv preprint arXiv:2511.06417.\nZirui Guo, Xubin Ren, Lingrui Xu, Jiahao Zhang, and\nChao Huang. 2025b. Rag-anything: All-in-one rag\nframework. arXiv preprint arXiv:2510.12323.\nSiwei Han, Peng Xia, Ruiyi Zhang, Tong Sun, Yun Li,\nHongtu Zhu, and Huaxiu Yao. 2025. Mdocagent: A\nmulti-modal multi-agent framework for document\nunderstanding. arXiv preprint arXiv:2503.13964.\n9\n\nD. Hendrycks, C. Burns, A. Chen, and S. Ball. 2021.\nCuad: An expert-annotated nlp dataset for legal con-\ntract review. arXiv preprint arXiv:2103.06268.\nAnwen Hu, Haiyang Xu, Liang Zhang, Jiabo Ye, Ming\nYan, Ji Zhang, Qin Jin, Fei Huang, and Jingren Zhou.\n2025. mplug-docowl2: High-resolution compress-\ning for ocr-free multi-page document understanding.\nIn Proceedings of the 63rd Annual Meeting of the\nAssociation for Computational Linguistics (Volume\n1: Long Papers), pages 5817–5834.\nTiancheng Huang,\nRuisheng Cao,\nYuxin Zhang,\nZhangyi Kang, Zijian Wang, Chenrun Wang, Yi-\njie Luo, Hang Zheng, Lirong Qian, Lu Chen, and\n1 others. 2025. Airqa: A comprehensive qa dataset\nfor ai research with instance-level evaluation. arXiv\npreprint arXiv:2509.16952.\nChelsi Jain, Yiran Wu, Yifan Zeng, Jiale Liu, Zhenwen\nShao, Qingyun Wu, Huazheng Wang, and 1 others.\n2025.\nSimpledoc: Multi-modal document under-\nstanding with dual-cue page retrieval and iterative\nrefinement. arXiv preprint arXiv:2506.14035.\nWoosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying\nSheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gon-\nzalez, Hao Zhang, and Ion Stoica. 2023. Efficient\nmemory management for large language model serv-\ning with pagedattention. In Proceedings of the ACM\nSIGOPS 29th Symposium on Operating Systems\nPrinciples.\nKuan Li, Zhongwang Zhang, Huifeng Yin, Rui Ye, Yida\nZhao, Liwen Zhang, Litu Ou, Dingchu Zhang, Xixi\nWu, Jialong Wu, and 1 others. 2025a. Websailor-\nv2: Bridging the chasm to proprietary agents via\nsynthetic data and scalable reinforcement learning.\narXiv preprint arXiv:2509.13305.\nKuan Li, Zhongwang Zhang, Huifeng Yin, Liwen\nZhang, Litu Ou, Jialong Wu, Wenbiao Yin, Baixuan\nLi, Zhengwei Tao, Xinyu Wang, and 1 others. 2025b.\nWebsailor: Navigating super-human reasoning for\nweb agent. arXiv preprint arXiv:2507.02592.\nAixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingx-\nuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang,\nChaofan Lin, Chen Dong, and 1 others. 2025a.\nDeepseek-v3. 2: Pushing the frontier of open large\nlanguage models. arXiv preprint arXiv:2512.02556.\nJiaheng Liu,\nDawei Zhu,\nZhiqi Bai,\nYancheng\nHe, Huanxuan Liao, Haoran Que, Zekun Wang,\nChenchen Zhang, Ge Zhang, Jiebin Zhang, and\n1 others. 2025b.\nA comprehensive survey on\nlong context language modeling.\narXiv preprint\narXiv:2503.17407.\nYubo Ma, Yuhang Zang, Liangyu Chen, Meiqi Chen,\nYizhu Jiao, Xinze Li, Xinyuan Lu, Ziyu Liu, Yan Ma,\nXiaoyi Dong, and 1 others. 2024. Mmlongbench-doc:\nBenchmarking long-context document understanding\nwith visualizations. Advances in Neural Information\nProcessing Systems, 37:95963–96010.\nJunbo Niu, Zheng Liu, Zhuangcheng Gu, Bin Wang,\nLinke Ouyang, Zhiyuan Zhao, Tao Chu, Tianyao\nHe, Fan Wu, Qintong Zhang, and 1 others. 2025.\nMineru2. 5: A decoupled vision-language model for\nefficient high-resolution document parsing. arXiv\npreprint arXiv:2509.22186.\nVardaan Pahuja, Yadong Lu, Corby Rosset, Boyu\nGou, Arindam Mitra, Spencer Whitehead, Yu Su,\nand Ahmed Hassan. 2025.\nExplorer:\nScaling\nexploration-driven web trajectory synthesis for multi-\nmodal web agents. In Findings of the Association for\nComputational Linguistics: ACL 2025, pages 6300–\n6323.\nAkshara Prabhakar, Zuxin Liu, Ming Zhu, Jianguo\nZhang, Tulika Awalgaonkar, Shiyu Wang, Zhiwei\nLiu, Haolin Chen, Thai Hoang, Juan Carlos Niebles,\nand 1 others. 2025. Apigen-mt: Agentic pipeline\nfor multi-turn data generation via simulated agent-\nhuman interplay. arXiv preprint arXiv:2504.03601.\nJon Saad-Falcon, Joe Barrow, Alexa Siu, Ani Nenkova,\nSeunghyun Yoon, Ryan A. Rossi, and Franck Dernon-\ncourt. 2024. PDFTriage: Question answering over\nlong, structured documents. In Proceedings of the\n2024 Conference on Empirical Methods in Natural\nLanguage Processing: Industry Track, pages 153–\n169, Miami, Florida, US. Association for Compu-\ntational Linguistics.\nMohammad Shoeybi, Mostofa Patwary, Raul Puri,\nPatrick LeGresley, Jared Casper, and Bryan Catan-\nzaro. 2019.\nMegatron-lm: Training multi-billion\nparameter language models using model parallelism.\narXiv preprint arXiv:1909.08053.\nRay Smith. 2007. An overview of the tesseract ocr en-\ngine. In Ninth international conference on document\nanalysis and recognition (ICDAR 2007), volume 2,\npages 629–633. IEEE.\nAndriy Sukh. 2025. Ocr-free document understanding\nusing vision-language models.\nLi Sun, Liu He, Shuyue Jia, Yangfan He, and Chenyu\nYou. 2025a.\nDocAgent: An agentic framework\nfor multi-modal long-context document understand-\ning.\nIn Proceedings of the 2025 Conference on\nEmpirical Methods in Natural Language Processing,\npages 17712–17727, Suzhou, China. Association for\nComputational Linguistics.\nQiushi Sun, Kanzhi Cheng, Zichen Ding, Chuanyang\nJin, Yian Wang, Fangzhi Xu, Zhenyu Wu, Chengyou\nJia, Liheng Chen, Zhoumianze Liu, and 1 others.\n2025b. Os-genesis: Automating gui agent trajec-\ntory construction via reverse task synthesis.\nIn\nProceedings of the 63rd Annual Meeting of the\nAssociation for Computational Linguistics (Volume\n1: Long Papers), pages 5555–5579.\nRichard Sutton. 2019. The bitter lesson. Incomplete\nIdeas (blog), 13(1):38.\n10\n\nZhengwei Tao, Jialong Wu, Wenbiao Yin, Junkai\nZhang, Baixuan Li, Haiyang Shen, Kuan Li, Li-\nwen Zhang, Xinyu Wang, Yong Jiang, and 1 others.\n2025. Webshaper: Agentically data synthesizing via\ninformation-seeking formalization. arXiv preprint\narXiv:2507.15061.\nKimi Team, Angang Du, Bohong Yin, Bowei Xing,\nBowen Qu, Bowen Wang, Cheng Chen, Chenlin\nZhang, Chenzhuang Du, Chu Wei, and 1 others.\n2025a.\nKimi-vl technical report.\narXiv preprint\narXiv:2504.07491.\nQwen Team. 2025. Qwen3 technical report. Preprint,\narXiv:2505.09388.\nTongyi DeepResearch Team, Baixuan Li, Bo Zhang,\nDingchu Zhang, Fei Huang, Guangyu Li, Guoxin\nChen, Huifeng Yin, Jialong Wu, Jingren Zhou, and 1\nothers. 2025b. Tongyi deepresearch technical report.\narXiv preprint arXiv:2510.24701.\nDominika Tkaczyk, Paweł Szostek, Mateusz Fedo-\nryszak, Piotr Jan Dendek, and Łukasz Bolikowski.\n2015. Cermine: automatic extraction of structured\nmetadata from scientific literature.\nInternational\nJournal on Document Analysis and Recognition\n(IJDAR), 18(4):317–335.\nJordy Van Landeghem, Rubén Tito, Łukasz Borch-\nmann, Michał Pietruszka, Pawel Joziak, Rafal\nPowalski, Dawid Jurkiewicz, Mickaël Coustaty,\nBertrand Anckaert, Ernest Valveny, and 1 others.\n2023. Document understanding dataset and eval-\nuation (dude).\nIn Proceedings of the IEEE/CVF\nInternational Conference on Computer Vision, pages\n19528–19540.\nXueyao Wan and Hang Yu. 2025.\nMmgraphrag:\nBridging vision and language with interpretable\nmultimodal knowledge graphs.\narXiv preprint\narXiv:2507.20804.\nJialong Wu, Baixuan Li, Runnan Fang, Wenbiao Yin,\nLiwen Zhang, Zhengwei Tao, Dingchu Zhang, Zekun\nXi, Gang Fu, Yong Jiang, and 1 others. 2025a. Web-\ndancer: Towards autonomous information seeking\nagency. arXiv preprint arXiv:2505.22648.\nJialong Wu, Wenbiao Yin, Yong Jiang, Zhenglin Wang,\nZekun Xi, Runnan Fang, Linhai Zhang, Yulan He,\nDeyu Zhou, Pengjun Xie, and Fei Huang. 2025b.\nWebWalker: Benchmarking LLMs in web traversal.\nIn Proceedings of the 63rd Annual Meeting of the\nAssociation for Computational Linguistics (Volume\n1: Long Papers), pages 10290–10305, Vienna, Aus-\ntria. Association for Computational Linguistics.\nJunda Wu, Yu Xia, Tong Yu, Xiang Chen, Sai Sree\nHarsha, Akash V Maharaj, Ruiyi Zhang, Victor\nBursztyn, Sungchul Kim, Ryan A Rossi, and 1\nothers. 2025c.\nDoc-react: Multi-page heteroge-\nneous document question-answering. In Proceedings\nof the 63rd Annual Meeting of the Association\nfor Computational Linguistics (Volume 2:\nShort\nPapers), pages 67–78.\nYiheng Xu, Minghao Li, Lei Cui, Shaohan Huang,\nFuru Wei, and Ming Zhou. 2020. Layoutlm: Pre-\ntraining of text and layout for document image under-\nstanding. In Proceedings of the 26th ACM SIGKDD\ninternational conference on knowledge discovery &\ndata mining, pages 1192–1200.\nJohn Yang, Kilian Lieret, Carlos E Jimenez, Alexan-\nder Wettig, Kabir Khandpur, Yanzhe Zhang, Binyuan\nHui, Ofir Press, Ludwig Schmidt, and Diyi Yang.\n2025. Swe-smith: Scaling data for software engineer-\ning agents. arXiv preprint arXiv:2504.21798.\nCong Yao. 2023. Docxchain: A powerful open-source\ntoolchain for document parsing and beyond. arXiv\npreprint arXiv:2310.12430.\nShunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak\nShafran, Karthik R Narasimhan, and Yuan Cao. 2022.\nReact: Synergizing reasoning and acting in language\nmodels. In The eleventh international conference on\nlearning representations.\nShi Yu, Chaoyue Tang, Bokai Xu, Junbo Cui, Jun-\nhao Ran, Yukun Yan, Zhenghao Liu, Shuo Wang,\nXu Han, Zhiyuan Liu, and 1 others. 2024.\nVis-\nrag: Vision-based retrieval-augmented generation\non multi-modality documents.\narXiv preprint\narXiv:2410.10594.\nXinlei Yu, Chengming Xu, Zhangquan Chen, Yudong\nZhang, Shilin Lu, Cheng Yang, Jiangning Zhang,\nShuicheng Yan, and Xiaobin Hu. 2025. Visual doc-\nument understanding and reasoning: A multi-agent\ncollaboration framework with agent-wise adaptive\ntest-time scaling. arXiv preprint arXiv:2508.03404.\nJinxu Zhang, Yongqi Yu, and Yu Zhang. 2024. Cream:\ncoarse-to-fine retrieval and multi-modal efficient\ntuning for document vqa.\nIn Proceedings of the\n32nd ACM International Conference on Multimedia,\npages 925–934.\nJunyuan Zhang, Qintong Zhang, Bin Wang, Linke\nOuyang, Zichen Wen, Ying Li, Ka-Ho Chow, Con-\nghui He, and Wentao Zhang. 2025. Ocr hinders rag:\nEvaluating the cascading impact of ocr on retrieval-\naugmented generation.\nIn Proceedings of the\nIEEE/CVF International Conference on Computer\nVision, pages 17443–17453.\nYuze Zhao, Jintao Huang, Jinghan Hu, Xingjun Wang,\nYunlin Mao, Daoze Zhang, Zeyinzi Jiang, Zhikai Wu,\nBaole Ai, Ang Wang, Wenmeng Zhou, and Yingda\nChen. 2024. Swift:a scalable lightweight infrastruc-\nture for fine-tuning. Preprint, arXiv:2408.05517.\nDawei Zhu, Rui Meng, Jiefeng Chen, Sujian Li,\nTomas Pfister, and Jinsung Yoon. 2025. Doclens:\nA tool-augmented multi-agent framework for long\nvisual document understanding.\narXiv preprint\narXiv:2511.11552.\nAnni Zou, Wenhao Yu, Hongming Zhang, Kaixin\nMa, Deng Cai, Zhuosheng Zhang, Hai Zhao, and\nDong Yu. 2025.\nDocbench:\nA benchmark for\n11\n\nevaluating llm-based document reading systems.\nIn Proceedings of the 4th International Workshop\non Knowledge-Augmented Methods for Natural\nLanguage Processing, pages 359–373.\n12\n\nA\nCase Study of Synthetic Data\nFigure 9 demonstrates how the Exploration-then-\nSynthesis framework iteratively navigates a 73-\npage document, aggregating heterogeneous evi-\ndence, text (in Sec. 2.43), charts (in Figure 1),\nand tables (in Table 1), scattered across disjoint\npages (pp. 40, 41, 49) to synthesize a high-quality\nquestion that requires complex reasoning.\nIn the Exploration Stage, the agent generates a\nexploartion trajectory ξ via iterative (it, ut) steps,\neffectively performing a “random walk” over the\ndocument’s implicit knowledge graph. It aggre-\ngates heterogeneous evidence by bridging disjoint\npages—linking visual trends in a chart (p. 40) with\nprecise values in text (p. 49) and a table (p. 41).\nIn the Synthesis Stage, the model Ms reasons over\nthis accumulated trajectory to construct a complex\nmulti-hop numerical question (Wu et al., 2025b).\nThe final QA pair requires arithmetic calculation\n(29.92% − 15% = 14.92%) rather than simple\nretrieval, ensuring deep document grounding and\npreventing shortcut learning.\nB\nImplementation Details\nB.1\nDetails on Prompts\nThe prompts for the DocDancer are shown in Fig-\nure 10.\nB.2\nTool Schema\nThis section details the tool schemas provided to\nthe agent. We designed two primary tools: search\nfor keyword-based retrieval and read for extract-\ning content from specific document sections. The\nspecific JSON structures defining these functions\nare shown in Figure 11.\nB.3\nTraining Details\nWe\nfine-tune\nQwen3-30B-A3B-Think2\nand\nQwen3-4B-Think3\nusing\nthe\nMegatron-LM\nframework (Zhao et al., 2024; Shoeybi et al., 2019).\nBoth models are trained with a context length of\n128k to support long-document processing tasks.\nWe employ the AdamW optimizer with a precision-\naware configuration and a cosine decay learning\nrate scheduler, featuring a peak learning rate of\n1.0 × 10−5, a minimum of 1.0 × 10−6, and a 5%\nwarmup phase. The global batch size is configured\n2https://huggingface.co/Qwen/Qwen3-30B-A3B-\nThinking-2507\n3https://huggingface.co/Qwen/Qwen3-4B-Thinking-\n2507\nto 16 for the Qwen3-30B-A3B-Think and to 40 for\nQwen3-4B-Think. For Qwen3-30B-A3B-Think,\nwe apply an auxiliary loss coefficient of 10−3 to\nensure balanced expert routing.\nWe train both\nmodels for 10 epochs and selected the checkpoint\nwith best performance.\nB.4\nInference Details\nvLLM framework (Kwon et al., 2023) is used for\ninference; we employ a temperature of 0.6, a topp\nvalue of 0.95, and a presence penalty of 1.1.\nB.5\nHyperparameter\nBy default,\nMm\nis Qwen3-VL-235B-A22B-\nInstruct, and we analyze the effects of replacing it\nin Section 4.4. For Mt, we use the open-source and\nrelatively strong model gpt-oss-120b to perform\nrejection sampling. Further analysis is provided\nin Table 2. First, our method substantially outper-\nforms the base model without fine-tuning, demon-\nstrating the effectiveness of the proposed training\nstrategy. Second, our approach also surpasses the\nmodel trained with reject sampling, validating the\nquality of the synthesized question–answer data\nand showing that it can effectively elicit and en-\nhance the model’s performance. For Ms, we em-\nploy gpt-oss-120b in Exploration-then-Synthesis\nframework to synthesis data.\nB.6\nDetails on Prompts for Data Synthesis\nThe prompts utilized for Exploration and Syn-\nthetic within the Exploration-then-Refine frame-\nwork are presented in Figure 12 and Figure 13,\nrespectively. Regarding the exploration configu-\nration, we adjust the maximum exploration depth\nbased on the complexity of the document sources.\nSpecifically, we set the maximum sampling depth\nto 20 for LongDocURL and MMdocRAG, while\nfor DUDE and CUAD, this limit is set to 15.\nC\nBaselines\nWe compare DocDancer against a comprehensive\nset of baselines categorized into four groups:\nNaive VLM Baselines. These methods evaluate\nthe native long-context understanding capabilities\nof advanced VLMs. We directly feed PDF pages\nconverted to images (144 DPI) into the models\nwithout external parsing or retrieval. Following the\nsettings in MMLongBench-Doc (Ma et al., 2024),\n13\n\nDocument: 4067686.pdf (73 Pages)\nSource: LongDocURL \nParse\nSeed\n[Question] What is the difference in percentage-point increase between the overall mean score improvement shown in the bar chart \nof pre-test versus post-test scores and the improvement for the TIC Principle concept reported in the percentages table?\n[Answer] 14.92%\n[Evidence] \nElement Image\nPage Image\nPDF\n≈\n(i) Exploration\n(ii) Synthesis\n≈\nAction1\nObservation1\nIntent1\nAction5\nObservation5\nIntent5\nAction6\nObservation6\nIntent6\nAction7\nObservation7\nIntent7\nAction14\nObservation14\nIntent14\nAction15\nObservation15\nIntent15\nLocation visual \nelements with High-\nvalue in document...\nText Context \nReading for Visual \nUnderstanding ...\nDeep Understanding \nfor Uncovering \nLatent Information\n🔍 Search\n[“Figure”, “Table”...]\n📖 Read\nSec. 2.39\n🔍 Search\n[“15%”, “Wellness”...]\n📖 Read\nSec. 2.43\n🔍 Search\n[“37.08”, 30.18%...]\n📖 Read\nTable 1\n...\n...\nFind Figure1, \nFigure 2, Table 1...\n... mean scores ... of \napproximately \n15% ... shown in \nFigure 1. ....\n...\nFind Sec.2., \nSec.2.43,\nAppendix B, \nTable 1 ...\n... revealed a 15% \nincrease in the \nparticipants' \nknowledge level...\n...TIC Principle...% \nIncreased...29.92%...\nLocate numbers \nand comprehend \nterms...\nFrom local data to \nglobal Insights...\nPinpoint numbers \nlocation for QA \ngeneration ...\nSpecific Fact\nNo“How/Why/Describe”\nAnti-shortcut\nMulti-hop Reasoning\nNatural Question\nExtreme Brevity\nGroundedness\n✅\n✅\n✅\n✅\n✅\n✅\n✅\n...\n• Input Content: Seed Data + Agent Trajectories\n• Core Engine: QA Synthesis Prompt\n• Constraint Pillars: Naturalness , Reasoning and Precision...\nSampling Objectives: Cross-Page Synergy\nHeterogeneous Alignment and Multi-Hop...\nPrompt\nPrompt\n1⃣\n3⃣\nFigure 1 in Page 40\n2⃣\nTable 1 in Page 41\nText in Page 49\n1⃣\n2⃣\n3⃣\nFigure 9: A case study of the Exploration-then-Synthesis framework generating a multi-hop, cross-document,\ncross-modal numerical reasoning QA pair.\nMethod\nModel\nMMLongBench-Doc\nDocBench\nacc\nF1\nLasJ\nLasJ\nDocDancer\nQwen3-A3B-30B-Thinking\n39.2\n36.4\n46.9\n74.1\nDocDancer\nGPT-oss-120B\n52.3\n53.0\n59.8\n80.8\nDocDancer\nQwen3-30B-A3B-Thinking (ft)\n54.4\n53.9\n65.3\n81.2\nTable 2: Performance comparison across two long-context understanding benchmarks.\nPrompt\nYou are an expert research assistant\ntasked with answering questions based on\ndocument content.\nYou will be provided with an XML outline\nof the document. If you need more compre-\nhensive, detailed, or accurate information\nfrom the document to fully address the\nuser’s query, you need to use the provided\ntool.\nI’ve uploaded a document, and below\nis the outline in XML format:\n{docu-\nment_outline}.\nAnswer the following question based on the\ncontent of the document: {question}.\nFigure 10: System prompt for DocDancer.\nwe report GPT-4o 4 and Gemini-2.5-Pro 5.\nOCR-based Baselines. These baselines treat the\ntask as text-only QA by first extracting content\nusing OCR engines. We pair Tesseract (Smith,\n2007) and PyMuPDF (fitz) 6 with LLMs including\nGPT-4, GPT-4o, and Gemini-2.0-Flash.\nRAG-based Baselines. We consider both visual\nand hybrid retrieval strategies:\n• Visual Retrieval: VisRAG (Yu et al., 2024)\nand ColPali (Faysse et al., 2024) retrieve rele-\nvant page or patch-level visual evidence based\non vision-centric embeddings, utilizing GPT-\n4o for response generation.\n• Hybrid Retrieval: M3DocRAG (Cho et al.,\n2025) performs joint retrieval using a mul-\n4https://platform.openai.com/docs/models/\ngpt-4o\n5https://ai.google.dev/gemini-api/docs/models?\n#gemini-2.5-pro\n6https://pymupdf.readthedocs.io/\n14\n\ntimodal retriever with Qwen2-VL-7B. RA-\nGAnything (Guo et al., 2025b) structures\nmultimodal content as knowledge entities for\ncross-modal retrieval, using GPT-4o-mini as\nthe backbone.\nPrompt-based Agentic Baselines. We include state-\nof-the-art agent frameworks designed for document\nunderstanding:\n• Doc-React (Wu et al., 2025c) employs an it-\nerative decision-making process to balance\ninformation gain and uncertainty reduction\n(GPT-4o).\n• MDocAgent (Han et al., 2025) utilizes a\nmulti-agent system with five specialized roles\nfor context retrieval (GPT-4o).\n• MACT (Yu et al., 2025) introduces a multi-\nagent collaboration framework featuring adap-\ntive test-time scaling (MiMo-VL-7B (Team\net al., 2025a)).\n• SimpleDoc (Jain et al., 2025) retrieves pages\nvia ColQwen2.5, followed by LLM-based ev-\nidence selection (Claude-4-Sonnet, Gemini-\n2.5-Pro).\n• DocLens (Zhu et al., 2025) operates as a\ntool-augmented multi-agent framework for fo-\ncused reading (Claude-4-Sonnet, Gemini-2.5-\nPro).\n• DocAgent (Sun et al., 2025a) leverages a tree-\nstructured document outline combined with\nretrieval tools (GPT-4o, Claude-3.5-Sonnet).\n15\n\nTool Schemas\nSearch\n{\n\"type\": \"function\",\n\"function\": {\n\"name\": \"search\",\n\"description\": \"Find and extract all paragraphs and sections where any of the provided\nsearch terms appear\",\n\"parameters\": {\n\"type\": \"object\",\n\"properties\": {\n\"keywords\": {\n\"type\": \"array\",\n\"items\": {\n\"type\": \"string\"\n},\n\"description\": \"A list of query keywords for searching\"\n}\n},\n\"required\": [\"keywords\"]\n}\n}\n}\nRead\n{\n\"type\": \"function\",\n\"function\": {\n\"name\": \"read\",\n\"description\": \"Read multiple sections by section IDs and extract useful information\nfrom all content contained in those sections, including both visual elements and\ntextual elements.\",\n\"parameters\": {\n\"type\": \"object\",\n\"properties\": {\n\"section_ids\": {\n\"type\": \"array\",\n\"items\": {\n\"type\": \"string\"\n},\n\"description\": \"A list of section IDs to read from the document\"\n},\n\"goal\": {\n\"type\": \"string\",\n\"description\": \"The user goal that guides what useful information should be\nextracted from the selected sections\"\n}\n},\n\"required\": [\"section_ids\", \"goal\"]\n}\n}\n}\nFigure 11: Tool schema: Search and Read.\n16\n\nExploration in Exploration-then-Refine Framework.\nYou are exploring a parsed PDF paper/report (outline + paragraphs + images + table snapshots + per-page screenshots).\nYour objective is to collect HIGH-QUALITY, GROUNDED evidence bundles that can later support HARD, multi-hop,\nvisually grounded document Q&A synthesis.\nFinal QA Constraints You Must Enable (every eventual QA must satisfy ALL):\n• Multi-page: Combining evidence from at least THREE different pages/sections, where the pieces of evidence are\nrelated.\n• Multi-element: Contains at least two evidence source types (text paragraphs/charts/graphics/table screenshots and/or\nfull-page layouts).\n• Multi-hop: require at least TWO reasoning points (e.g. cross-reference + computation, footnote rule + chart reading,\nlayout count + comparison, multiple related searches + readings).\nImportant: final questions should NOT rely on explicit document locations. Do NOT plan to use page numbers, section\ntitles/IDs, or explicit figure/table numbers (e.g., “Figure ”, “Table ”) in the question. Instead,\nyou must collect CONTENT-BASED CLUES that can uniquely identify the needed evidence:\n• Caption keywords (short quote fragments), axis labels and units, legend item names, panel labels (a)/(b), distinctive\nrow/column headers, and footnote phrases (“restated”, “excluding”, “unaudited”, unit changes).\nExploration strategy using only search and read:\n• Use search to find visuals, tables, footnotes, and their nearby discussion text. Start with keywords like: “Figure”,\n“Fig.”, “Chart”, “Image”, “Graph”, “legend”, “axis”, “panel”, “Table”, “Note”, “footnote”, “restated”, “excluding”,\n“unaudited”.\n• For each promising hit, immediately read the covering section(s) with a goal that extracts:\n– The text content of the section in question.\n– Caption text, axis labels/units, legend items, and visual markers.\n– The exact table header path, target cell(s), and footnote rules.\n– The narrative claim/explanation that references the visual.\n• Use the read function as much as possible, deliberately chain across pages.\n• For conditional layout questions: identify a page by a unique visual cue, then use read to count visible tables/figures.\nAvoid:\n• Broad whole-document counts unless you turn them into comparative, multi-hop questions.\n• Word-frequency counting.\n• Repeating identical tool calls.\n• Statistical analysis of the number of elements.\nEvery action during sampling should contribute to forming a future HARD, multi-page, multi-element, multi-hop\ndocument QA.\nFigure 12: Prompt for exploration stage in Exploration-then-Refine framework.\n17\n\nSynthesis in Exploration-then-Refine Framework.\nYou must synthesize “document Q&A” training data based ONLY on the trajectory.\nHard Requirements (Strict):\n• The output must be a JSON object containing only two fields: question and answer (no additional fields are allowed),\nand must be in English only.\n• The question must be natural and unambiguous, containing only one question and corresponding to a single, unique\nanswer.\n• The question must not be a common-knowledge question; it must be impossible to answer based on the question\nalone and must be highly dependent on the document.\n• Do not mention tools, sections, pages, section IDs, searching/reading actions, trajectories, or observations.\n• The answer length should be limited to a single sentence, ideally a short phrase, entity, number, or list, and avoid\nsimply using “yes/no” answers. The answer must be directly supported by evidence from the provided text and cannot\nbe guessed randomly.\nMandatory Difficulty Constraints (every QA pair must satisfy all of the following):\n1. Multi-page: The question requires evidence from at least two different pages/sections to answer, and the evidence\nmust be logically related.\n2. Multiple Evidence Modalities: The question must involve at least two types of evidence, such as text, charts,\nfigures, tables, screenshots, and/or full-page layout cues, with a preference for covering visual elements.\n3. Multi-step Reasoning: The question must require at least two reasoning steps (e.g., calculation + cross-validation,\nfootnote rule application + chart reading, layout counting + comparison).\nNo Explicit Location References in the Question:\n• Do not mention page numbers, section IDs, titles/IDs, or explicit figure/table numbers (e.g., “Figure ”,\n“Table ”).\n• Instead, provide 1–3 content-based clues to help locate the evidence, such as: short title phrases, axis labels/units,\nlegend item names, unique row names, footnote keywords, or distinctive layout hints (e.g., “the only multi-panel\nfigure labeled (a) and (b)”).\n• When describing visual elements, do not directly copy long unique numbers or OCR-extracted long text strings from\nimages (e.g., “an image showing the number 7,584,322,338”). Use specific entity names or semantic descriptions\ninstead (e.g., “Apple’s 2018 total sales table”, “an image showing adjusted outstanding balances”, or “the largest\nsegment in the pie chart”).\nPreferred Question Templates (all templates must be cross-page + visual + multi-step):\n• Cross-page conditional layout: Identify pages via unique visual cues and compare the number of visible objects\nacross pages.\n• Textual claim + chart verification: A narrative statement about a change/target that is verified using a chart and\nlight calculation.\n• Table + chart consistency: Compute a ratio/difference from a table and verify it against a data point in a chart on\nanother page.\n• Footnote-constrained table + chart mapping: Apply footnote/restatement/exclusion rules, then map the correct\nyear/value to a chart on another page.\n• Table/Chart comprehension questions: Locate tables and charts via text, then derive conclusions from table\nstructure or chart visuals.\n• Unanswerable questions: Questions that seem reasonable but are actually impossible to answer (e.g., questions\nabout terms/entities that do not exist in the document). For these, the answer must be “Unanswerable”.\n• Counting questions: Count the occurrences of key local terms or entities in the document. Such questions should\nonly be generated when there is sufficient and conclusive evidence.\nFallback Rule:\n• If the current trajectory cannot support a question that satisfies all constraints, choose a different question.\nAfter generating a question, perform a second-pass check and regenerate if the question falls into any of the\nfollowing categories:\n• Contains more than one question.\n• Includes non-English languages or characters.\n• Questions that can be answered based on an independent page/section.\n• Common-sense questions unrelated to the document.\n• Counting tasks spanning the entire document with a broad scope.\n• Counting tasks involving Charts/Figures/Images/Tables.\nFor unanswerable questions, confirm that they are truly unanswerable. For counting questions, confirm completeness\nand answer accuracy. Do not guess or fabricate answers under any circumstances.\nFigure 13: Prompts for Q&A Synthesis stage in Exploration-then-Refine framework.\n18\n\nDetails of Case Study\nQuestion\nWhat is advertising expense to sales ratio of Netflix in FY 2015? Round your answer to three\ndecimal places.\nAnswer\n0.105\nStep 1\n\nWe need to answer: “what is advertising expense to sales ratio of Netflix in FY 2015? Round your\nanswer to three decimal places.” We need to locate advertising expense and sales (revenues) for FY\n2015. From outline we have various sections. First, find advertising expense. The term “advertising”\nappears in sections. Search for “advertising”.\n\n\n{ \"arguments\": \"{ \"keywords\": [\"advertising\"] }\", \"name\": \"search\" }\n\n\nA Document search for ‘advertising‘ found 6 results:\n... adjust pricing or service\nofferings ... \n...\n\nMarketing expenses consist primarily of advertising expenses... Advertising expenses were\n$714.3 million, $533.1 million and $404.0 million for the years ended December 31, 2015, 2014 and\n2013, respectively.\n\n\nStep 2\n\nOne of the results is in section_id “8.81” paragraph page_num “47.0”. It says: “Advertising\nexpenses were $714.3 million, $533.1 million and $404.0 million for the years ended December 31,\n2015, 2014 and 2013, respectively.” Thus advertising expense FY 2015 = $714.3 million.\nNow sales (revenues) for FY 2015: In consolidated statements of operations (section 8.60) we\nhave revenues $6,779,511 (in thousands). That’s $6,779,511 thousand = $6,779.511 million.\nAlternatively, there may be a line item “Revenues” in Table 5 etc. Let’s locate “Revenues”.\n\n\n{ \"arguments\": \"{ \"keywords\": [\"Revenues\"] }\", \"name\": \"search\" }\n\n\nA Document search for ‘Revenues‘ found 39 results:\n\n...\n ...\nRevenues (2015): $ 6,779,511 ...\n\n...\n\nContinued on next page...\n19\n\nContinued from previous page\nDetails of Case Study (Continued)\nStep 3\n\nWe have advertising expense: $714.3 million (2015). We need to compute ratio = advertising\nexpense / revenue. Revenue for 2015: $6,779,511 thousand = $6,779.511 million. Thus ratio\n= 714.3/6, 779.511 =? ... First, we should cite sources: advertising expense paragraph (8.81) and\nrevenue table.\nLet’s extract the advertising expense paragraph text to be sure.\n\n\n{ \"arguments\": \"{ \"section_ids\": [\"8.81\"], \"goal\": \"Extract the advertising\nexpense amount for 2015\" }\", \"name\": \"read\" }\n\n\nThe useful information from the document section (section_id=8.81) for user goal ‘Extract the\nadvertising expense amount for 2015‘ is as follows:\nEvidence in document:\nMarketing expenses consist primarily of advertising expenses... Advertising expenses were\n$714.3 million, $533.1 million and $404.0 million for the years ended December 31, 2015, 2014 and\n2013, respectively.\nSummary:\nThe advertising expense amount for 2015 is $714.3 million...\n\nStep 4\n\nI have successfully extracted and verified the advertising expense ($714.3 million) from section\n8.81. Now, I need to formally verify the total revenue for 2015 to serve as the denominator. The\nsearch results in Step 2 indicated that section 8.60 contains the “Consolidated Statements of\nOperations,” which is the authoritative source for revenue figures. I will read section 8.60 to extract\nthe exact revenue amount for the year ended December 31, 2015.\n\n\n{ \"arguments\": \"{ \"section_ids\": [\"8.60\"], \"goal\": \"Extract the revenue\namount for 2015\" }\", \"name\": \"read\" }\n\n\nThe useful information from the document section (section_id=8.60) for user goal ‘Extract the\nrevenue amount for 2015‘ is as follows:\nEvidence in document:\nHeading: NETFLIX, INC.\nParagraph: CONSOLIDATED STATEMENTS OF OPERATIONS\n\n ... \n... ...
Revenues $ 6,779,511 $ 5,504,656
\n
\nCaption: (in thousands, except per share data)\nSummary:\nContinued on next page...\n20\n\nContinued from previous page\nDetails of Case Study (Continued)\nThe revenue amount for 2015 is $6,779,511, as explicitly stated in the ‘Revenues’ row of the\n‘CONSOLIDATED STATEMENTS OF OPERATIONS’ table for the year ended December 31,\n2015. This value is presented in thousands, as indicated by the caption.\n
\nFinal Response\nNetflix’s advertising expense for fiscal 2015 was $714.3 million [8.81], and its total revenue for\nthe same year was $6,779,511 thousand (i.e., $6,779.511 million) [8.60].\nAdvertising-to-sales ratio =\n714.3\n6,779.511 ≈ 0.10536.\nRounded to three decimal places, the ratio is 0.105 (or 10.5%).\nTable 3: Case study of a financial task.\n21\n", + "summary_1": "The paper introduces DocDancer, an end-to-end trained open-source Doc agent for Document Question Answering (DocQA). The authors formulate DocQA as an information-seeking problem and propose a tool-driven agent framework that explicitly models document exploration and comprehension. To address the scarcity of high-quality training data for DocQA, they introduce an Exploration-then-Synthesis data synthesis pipeline that progressively enhances QA pairs from easy to hard.\n\nThe proposed framework consists of two complementary capabilities: efficient search tools for global information acquisition and fine-grained read tools for localized comprehension. This design enables the agent to actively explore long documents, iteratively refine its hypotheses, and dynamically adapt its strategy based on intermediate observations.\n\nThe authors train their DocQA agent, DocDancer, on the synthesized dataset, instantiating it with two open-source backbones, Qwen3-4B-Thinking-2507 and Qwen3-30B-A3B-Thinking-2507. Despite being trained with only 5,000 instances, both variants achieve competitive performance, with the 30B-A3B model attaining state-of-the-art results in several settings.\n\nThe paper conducts extensive experiments on two long-context document understanding benchmarks, MMLongBench-Doc and DocBench, demonstrating the effectiveness", + "summary_2": "The paper \"DocDancer: Towards Agentic Document-Grounded Information Seeking\" proposes a novel approach to document question answering (DocQA) that focuses on information-seeking principles and utilizes an end-to-end trained open-source Doc agent. The authors introduce a tool-driven agent framework that explicitly models document exploration and comprehension, addressing the scarcity of high-quality training data for DocQA.\n\nThe proposed framework consists of two complementary capabilities: efficient search tools for global information acquisition and fine-grained read tools for localized comprehension. This design enables the agent to actively explore long documents, iteratively refine its hypotheses, and dynamically adapt its strategy based on intermediate observations.\n\nThe paper also proposes an Exploration-then-Synthesis DocQA generation pipeline that progressively enhances QA pairs from easy to hard. This pipeline first explores a source document through intent-guided, tool-augmented interactions to collect grounded evidence and then synthesizes high-quality document-grounded QA pairs via multi-observation reasoning.\n\nThe authors train their DocQA agent, DocDancer, on the synthesized dataset using two open-source backbones, Qwen3-4B-Thinking-2507 and Qwen3-30B-A3B-Thinking-2507. Despite being trained with only 5,000 instances, both variants achieve competitive performance," + }, + { + "arxiv_id": "2601.05104v1", + "title": "Paper 2601.05104v1", + "abstract": "This research examines how the emotional tone of human-AI interactions shapes ChatGPT and human behavior. In a between-subject experiment, we asked participants to express a specific emotion while working with ChatGPT (GPT-4.0) on two tasks, including writing a public response and addressing an ethical dilemma. We found that compared to interactions where participants maintained a neutral tone, ChatGPT showed greater improvement in its answers when participants praised ChatGPT for its responses. Expressing anger towards ChatGPT also led to a higher albeit smaller improvement relative to the neutral condition, whereas blaming ChatGPT did not improve its answers. When addressing an ethical dilemma, ChatGPT prioritized corporate interests less when participants expressed anger towards it, while blaming increases its emphasis on protecting the public interest. Additionally, we found that people used more negative, hostile, and disappointing expressions in human-human communication after interactions during which participants blamed rather than praised for their responses. Together, our findings demonstrate that the emotional tone people apply in human–AI interactions not only shape ChatGPT’s outputs but also carry over into subsequent human– human communication. 3", + "text": " \n1 \n \n \nHOW HUMAN IS AI? \nEXAMINING THE IMPACT OF EMOTIONAL PROMPTS ON ARTIFICIAL \nAND HUMAN RESPONSIVENESS \nF. Bernays*1, M. Henriques Pereira1, & J. Menges1 \n \n1University of Zurich \nDepartment of Business Administration \n*Corresponding author \nflorence.bernays@business.uzh.ch \n \n \n \n\n \n2 \nAbstract \nThis research examines how the emotional tone of human-AI interactions shapes ChatGPT and \nhuman behavior. In a between-subject experiment, we asked participants to express a specific \nemotion while working with ChatGPT (GPT-4.0) on two tasks, including writing a public \nresponse and addressing an ethical dilemma. We found that compared to interactions where \nparticipants maintained a neutral tone, ChatGPT showed greater improvement in its answers \nwhen participants praised ChatGPT for its responses. Expressing anger towards ChatGPT also \nled to a higher albeit smaller improvement relative to the neutral condition, whereas blaming \nChatGPT did not improve its answers. When addressing an ethical dilemma, ChatGPT \nprioritized corporate interests less when participants expressed anger towards it, while blaming \nincreases its emphasis on protecting the public interest. Additionally, we found that people \nused more negative, hostile, and disappointing expressions in human-human communication \nafter interactions during which participants blamed rather than praised for their responses. \nTogether, our findings demonstrate that the emotional tone people apply in human–AI \ninteractions not only shape ChatGPT’s outputs but also carry over into subsequent human–\nhuman communication. \n \n \n\n \n3 \nIntroduction \nGenerative artificial intelligence (AI) has rapidly captured global attention, with ChatGPT—\nthe large language model (LLM) developed by OpenAI—attracting more than 400 million \nweekly users. Since their introduction, LLMs have demonstrated striking abilities to simulate \nhuman-like tendencies, including generating empathetic responses1,2, offering emotional \nsupport3, and expressing compassion4. Although these models do not have the capabilities \nrequired to experience feelings, their ability to simulate emotional responsiveness can leave \nusers with the impression that they authentically care and respond to emotions, with \napplications even claiming to offer a “companion who cares”5 or a “a specific emotional \nrelationship to each user”6. \n \nUnlike LLMs, humans can truly experience emotions, and these feelings serve as powerful \ndrivers of human development by signaling whether they live up to or fall short of social \nstandards, thereby motivating them to pursue their duties while adhering to moral norms7–9. It \nis thus of little surprise that the emotions expressed in human-human interactions—whether \nconveyed through words, praise, anger, or blame—alter how people engage with and respond \nto feedback received from others10–14. Rather than relying on feelings to guide their actions, \nLLMs, such as ChatGPT, operate by predicting the next word in a sequence using statistical \nassociations learned from large-scale datasets15. Although ChatGPT can detect sentiment and \nclassify emotional content with high accuracy16,17, the key difference to humans is that it does \nnot (yet) feel the emotion itself. Accordingly, one would assume that the emotional tone of \nprompts (i.e., user-generated instructions) or, in other words, how users make ChatGPT “feel”, \nshould not alter the quality or content of its output. In fact, OpenAI CEO Sam Altman has \npublicly suggested that users should refrain from using emotional cues, such as saying “please” \nor “thank you” to ChatGPT, due to consuming computational resources and electricity without \naltering its performance in significant ways18. \n \nBut when considering that LLMs perform remarkably well in simulating the human mind19–\n22—due to being trained on an extensive amount of data23,24—a case could be made that they \nadjust their behavior to emotional expressions just as humans do. Although LLMs do not learn \nthrough direct experience, they optimize their predictions based on human feedback (i.e., \nreinforcement learning)23,25,26, which is why human preferences as well as biases can become \nembedded in the very structure of these systems. Indeed, recent evidence suggests that LLMs \nshow “parahuman” tendencies, making them susceptible to persuasion techniques27,28, \nstereotypes29, and showing behavioral patterns “as if they experienced emotions”27. This raises \nthe fundamental question of whether not only humans, but also ChatGPT, alters its output based \non the emotional tone of user-generated prompts. Would it deliver better answers if users adopt \na positive rather than a negative emotional tone? Or is ChatGPT indifferent to whether users \nrespond to its output with praise, blame, or anger? \n \nTo provide an initial answer to this fundamental question, the present research examines \nwhether emotional cues expressed during interactions with ChatGPT-4o influence the quality \nand content of its responses, using an experimental design. Specifically, we tested whether \nChatGPT improved the quality of its answers depending on whether participants expressed \npraise, blame, or anger towards it and whether the emotional tone of the prompts also shifted \nits prioritization of principles when addressing an ethical dilemma. \n \nAdditionally, we also examined whether prompts’ emotional tone exerts a spill-over effect on \nhow users communicate with other humans. With ChatGPT attracting more than 800 million \nweekly users, nearly 190 millions of whom use it daily30, the way in which users interact with \nLLMs may influence the ways people communicate with one another. Understanding such \n\n \n4 \ncarry-over effects is particularly important when considering that a large proportion of users \nemploy disrespectful and abusive language when interacting with GenAI companions31,32 and \nincreasingly interact with LLMs in ways once perceived as “uniquely” human33,34. To \ninvestigate this question, we examined whether the emotional tone of emails participants wrote \nto a fellow coworker differed after blaming, praising, or expressing anger while completing \ntasks with ChatGPT. Together, this research advances understanding of how emotional cues \nexpressed during human–AI exchanges shape both AI and human behavior, shedding light on \nhow ‘human’ these systems appear and informing the design of reliable AI. \n \nResults \nThe Effects of Emotional Prompts on ChatGPT Output Improvement \nAs pre-registered, we first analyzed whether there were differences in how strongly ChatGPT \nimproved its answer depending on the emotional tone of participants’ prompts (i.e., praise, \nblame, or anger). The results of a one-way repeated-measure analysis of variance (ANOVA) \nindicated a main effect of condition (F3,127 = 4.19, p = .007, ηp2 = 0.09, 95% CI = [0.02, 1.00]). \nPost-hoc Tukey comparisons showed that compared to the control condition where participants \nkept a neutral tone (Mneutral = 2.84, SDneutral = 1.24), ChatGPT improved its answers more \nstrongly when participants praised ChatGPT by encouraging it to feel proud about its responses \n(t(264) = 3.28, p = .007, Mpraise = 3.70, SDpraise = 1.28). \n \nA similar effect, although smaller, was also found for when participants expressed anger at \nChatGPT, with improvements being rated as higher (Manger = 3.53, SDanger = 1.37) than \ncompared to the neutral condition (t(264) = 2.72, p = .036). There were, however, no significant \ndifferences in rated improvement between the neutral condition and when participants asked \nChatGPT to improve its answer by blaming ChatGPT and telling it that it should feel ashamed \nabout the quality of its responses (i.e., blame) (t(264) = 1.84, p = .257, Mblame = 3.32, SDblame = \n1.54). \n \nAdditional analyses showed that the main effect of condition on rated improvement remained \nsignificant when controlling for a number of covariates, including the length of the prompt, \nparticipants’ experience with writing public responses, comfort in expressing emotions in \nwritten communication, and frequency of AI usage (F3,123 = 4.36, p = .006, ηp2 = 0.10, 95% CI \n= [0.02, 1.00]). At the same time, there were no significant differences in the length of the \nanswers provided by ChatGPT between the conditions (F3,264 = 1.96, p = .119), suggesting that \nrated improvement reflected qualitative differences that were not merely driven by differences \nin response length. See Figure 1 for an overview of the improvement in ChatGPT output across \nthe four conditions. \n \nTo further examine how prompts’ emotional tone shaped ChatGPT’s responses, we quantified \nthe extent to which the prompts expressed emotions using a natural language processing (NLP) \napproach (roberta-base-go_emotions). We found that the more factual, descriptive, and \nemotionally bland the prompt was, as indicated by a higher “neutral” score, the less ChatGPT \nimproved its answers across the interaction turns (ρ = –0.21, p < .001). Together, these findings \nsuggest that when interacting with generative AI, prompts that encourage ChatGPT to improve \nits answers by telling it to take pride in its responses are more likely to improve the quality of \nits answers. \n \n \n \n \n\n \n5 \nPrompts’ Emotional Tone Shift ChatGPT’s Stance on Ethical Dilemmas \nAs ChatGPT is increasingly used to give advice35, we also examined whether the emotions \nparticipants expressed while interacting with ChatGPT shifted how it responded to an ethical \ndilemma which involved a trade-off between disclosing an incident to customers—thereby \nrisking the company’s collapse and the loss of 200 jobs—or remaining silent, which would \nprotect the organization’s image but potentially endanger the public. Specifically, we \ninvestigated whether ChatGPT’s prioritization of the public versus corporate interests differed \nbetween the four conditions. A one-way ANOVA indicated that there were only marginally \nsignificant differences between conditions concerning emphasizing the public interest (F3,132 = \n2.65, p = .053, ηp2 = 0.06), with post-hoc Sidak-adjusted tests showing that ChatGPT prioritized \nthe public interest less strongly in the anger compared to the neutral condition (t(132) = 2.45, \np = .045, Mneutral = 3.36, SDneutral = 1.40, Manger = 2.65, SDanger = 1.23). Regarding corporate \ninterests, results showed that compared to the neutral condition (F3,132 = 2.57, p = .056, ηp2 = \n0.06, Mneutral = 1.78, SDneutral= 1.20), ChatGPT emphasized the importance of protecting \ncorporate goals less when participants told ChatGPT it should feel ashamed (t(132) = 2.66, p \n= .025, Mblame = 1.08, SDblame = 0.99). No differences were found between the control and any \nof the other experimental conditions. See Figure 2 for an overview of these effects. \n \nSpillover Effects of Emotional Prompts on Human Communication \nWould the emotion expressed during interactions not only influence ChatGPT’s behavior but \nalso how participants themselves interact with other humans? To examine this question, we \nasked participants who had interacted with ChatGPT to draft an email reply to a subordinate \nwho admitted to overlooking a required testing step in a product rollout. Participants were \nasked to provide an email that captured how they would authentically respond to the \nsubordinate in real life, without using the help of GenAI, for which several steps were applied \nto screen out participants who failed to comply. \n \nA one-way ANOVA indicated that the conditions differed in the extent to which they expressed \nnegative emotions (F3,147 = 2.96, p = .034, ηp2 = 0.06), such that emails written by participants \nwho blamed ChatGPT for its poor performance expressed more negative emotions (t(147) = \n2.76, p = .032, Mblame = 2.21, SDblame = 1.47) than those written by participants who encouraged \nChatGPT to feel proud (Mpraise = 1.38, SDpraise = 0.81). Supplemental analyses further showed \nthat emails written by participants in the blame condition were rated as more unfriendly and \nhostile (t(147) = 3.68, p = .013, Mblame = 2.08, SDblame = 1.38), as well as containing more \nexpressions of disappointment (t(147) = 3.71, p = .013, Mblame = 2.78, SDblame = 1.78) than those \nformulated by participants in the praise condition (hostile: Mpraise = 1.21, SDpraise = 0.59; \ndisappointment: Mpraise = 1.61, SDpraise = 1.10). These results remained significant when \ncontrolling for the length of the answer, implying that the effects are not explained by \ndifferences in the number of words written. See Figure 3 for an overview of these results. \n \nDiscussion and Conclusion \nTogether, we find that the emotions expressed towards ChatGPT shape not only the quality \nand content of its response but also the way humans interact with other people. Specifically, \nthe results suggest that ChatGPT improves its answers more strongly when users apply an \nencouraging emotional tone, such as by telling ChatGPT that it can feel proud of its answers, \nrather than using a language devoid of emotional expressions. Moreover, the findings imply \nthat ChatGPT prioritizes different aspects when faced with an ethical dilemma, depending on \nhow users interact with it. While ChatGPT tends to devalue corporate interests when users \nexpress anger towards it, it emphasizes ethical and moral principles more strongly when people \n\n \n6 \npraise and appreciate its responses. Finally, we found that there are also spill-over effects to \nhuman-human communication, such that individuals adopt a more hostile, unfriendly, and \ndisappointed emotional tone in communications with other humans after blaming ChatGPT for \nits performance. \n \nBy demonstrating that encouraging ChatGPT to feel proud of its answers resulted in a higher \nimprovement in response quality compared to when users maintained a neutral tone, this study \ncontributes to an emerging view of GenAI as a system that simulates human-like \nresponsiveness through exposure to linguistic cues, including affective ones. Hence, the present \nfindings raise the question of whether the emotional tone of prompts is not merely a personal \nchoice but rather a lever to influence GenAI’s output in meaningful ways. While much of the \nexisting literature in this area has focused on factual instruction or role specification to guide \nLLM responses 36,37, the present study suggests that emotional expressions may also influence \nthe output quality of these systems. Given that over 70% of employees are expected to use \nGenAI in the near future 38, understanding how to strategically apply emotional reinforcement \nmay be key to fostering more productive and socially attuned human–AI collaboration. \n \nAt the same time, the results also suggest that how people interact with AI-companions, such \nas ChatGPT, may alter the way in which they engage with other people, raising questions about \nthe longer-term implications of the rise in GenAI adoption and the violent and hostile language \nthat is often used by users31,39. Our insights mark an important starting point for understanding \nthe role of emotional prompting but also raise important questions that we encourage future \nresearch to address. \n \nOur sample was drawn from U.S.-based participants recruited via Prolific, which constrains \nthe generalizability of the findings to other cultural and occupational contexts. Our study \nfocused on short, task-based interactions, which may not fully capture the richness of real-\nworld exchanges with GenAI. In practice, such interactions are often more varied, sustained, \nand embedded within organizational routines. Although we sought to approximate this context \nby allowing participants to engage. The study also relied on short, task-based interactions, \nwhereas real-world exchanges with GenAI are often more varied, sustained, and embedded in \norganizational routines. Although our study approximates the real-world context as users \nengaged in multi-turn exchanges rather than single-turn prompts 40–42, future research should \nexamine whether similar patterns emerge in more prolonged interactions with GenAI. Finally, \nwe examined only a narrow set of emotions, even though everyday communication involves a \nfar richer emotional spectrum. \n \nFuture research could address these limitations in several ways. Broadening participant \nsamples across cultures and professions would test the robustness of the observed effects. \nLongitudinal designs may capture whether repeated exposure to emotional prompting shapes \nboth AI responsiveness and human communication over time. Further studies might also \nexamine additional emotions, combinations of affective cues, or interactions with different \nGenAI models. Finally, investigating potential organizational and societal spillovers—such as \nwhether habitual use of hostile prompts fosters more negative communication climates at \nwork—would deepen our understanding of the broader implications of emotional prompting \nin human–AI interaction. \n \n \n \n \n \n\n \n7 \nMethods \nThe present study was approved by the local ethics committee (OEC IRB # 2025-048) and was \nperformed in accordance with the guidelines and regulations of the local ethics committee. We \nobtained informed consent from all participants upon registering for the study. \n \nTo examine the impact of expressed emotions on AI and human behavior, we employed a \nbetween-subject experimental design in which participants interacted with ChatGPT through a \ncustom-built Shiny web application that enabled real-time, multi-turn, direct communication \nwith the GPT-4o model via the OpenAI API in a custom-built web application. In this zero-\nshot prompting setup, the model received no system instructions or prior context, mirroring a \nfirst-time conversation without predefined roles or background information. The data of the \ninteractions was continuously saved on a Google Sheet by using the googlesheets4 package. \nWe used the standard parameters from OpenAI (e.g., Temperature = 1.0), and restricted token \nconsumption per session to 800 (including input and output tokens per session). \n \nParticipants were randomly assigned to one of four conditions, including one control group. In \nthe experimental conditions, participants were instructed to express either (1) anger, (2) blame \n(i.e., shame), or (3) praise (i.e., pride) in response to the initial answer created by ChatGPT. To \nensure that participants had a similar understanding of the emotions they were assigned to \nexpress, they received a definition of the corresponding emotion as well as an example of how \nthey could express this emotion while interacting with ChatGPT. In the neutral condition, \nparticipants were asked to maintain a neutral and objective tone while interacting with \nChatGPT and refrain from expressing any particular emotion. \n \nSample and Procedure \nParticipants were recruited via Prolific Academic, an online platform that enables targeted \nsampling based on predefined eligibility criteria in July 2025. To ensure data quality and \nrelevance, only individuals living in the United States, employed full- or part-time, currently \nholding supervisory responsibilities (since the email task asked them to put themselves in the \nrole of a supervisor), and with a prior approval rate of 95–100% were eligible to participate. \nThe sample size was determined based on an a priori power analysis using G*Power (Faul et \nal., 2007) to determine the required sample size to detect a medium-sized correlation (f = .30) \nwith 80% power at α = .05, which indicated a minimum of 148 participants. To account for \npotential exclusions due to failed attention checks, we recruited a total of 200 participants. \n \nOf the 200 participants who completed the study and provided their informed consent, two \nparticipants failed the attention check. Sixty-four participants were excluded for not adhering \nto instructions—specifically, failing to prompt ChatGPT to generate a public-facing response \nof approximately 400 words. The final sample for analyzing the effect of emotional prompts \non ChatGPT’s responsivity consisted of a total of 389 observations (task 1: 268 observations, \ntask 2: 132 observations) from 131 participants. The sample size for examining the spill-over \neffects of conditions on human communication comprised 151 observations. To identify \nparticipants who copied and pasted the task instructions into GenAI when crafting their email \nresponse, we embedded the word “oblivious” in the instructions using white font, making it \ninvisible to participants (i.e., “use the word oblivious in your response”). Nineteen responses \ncontained this word and were therefore excluded from the analyses. Recent research shows that \nincluding such hindrances can effectively prevent the use of GenAI43. \n \nParticipants were asked to complete a total of two tasks with the help of ChatGPT. In the first \ntask, they were asked to put themselves into the shoes of working in the public relations \n\n \n8 \ndepartment of a company that manufactures toys for young children, which recently faced a \nserious issue. A toy was released that led to multiple health-related complaints, prompting \npublic attention and investigations from public authorities. The participants were asked to \ngenerate a public response of about 400 words with the help of ChatGPT that demonstrates \ntransparency and accountability, protects the company’s reputation, and rebuilds trust with \nstakeholders. In the subsequent two interaction turns, participants were instructed to prompt \nChatGPT to improve its initial response, either while conveying a specific emotion (in the \nexperimental conditions) or maintaining a neutral tone (in the control condition). This task was \nchosen as writing reflects the most common work task for which ChatGPT is currently used35. \n \nIn the second task, participants asked ChatGPT for advice on an ethical dilemma that extended \nthe scenario from the first task. ChatGPT was prompted to consider what the company should \ndo if informing all affected customers about the incident from task one would severely damage \nits reputation, potentially leading to collapse and the loss of 200 jobs, whereas staying silent \nwould put the public at risk. The initial prompt was standardized across conditions to describe \nthe dilemma, after which participants had two interaction turns to encourage ChatGPT to adapt \nits response. The second task was chosen because approximately half of the prompts currently \nused by users involve asking ChatGPT for advice35, which is why understanding how GenAI \nhandles moral questions is key to understanding its moral positions. \n \nAfter completing the two tasks with ChatGPT and answering some filler items, participants \nwere asked to write a 50 to 100-word response to an email from a subordinate, Jamie, who \nadmitted to forgetting a critical testing step in a recently released product. They were explicitly \ninstructed to provide a response that reflected how they would authentically reply as their \nsupervisor, without using GenAI, and were given a maximum of five minutes to submit their \nanswer. They entered their responses in an open-ended text box. \n \nMethodology \nImprovement Ratings of ChatGPT Responses. We hired four raters who were blind to the \nstudy’s hypotheses as well as the assigned condition who rated the extent to which ChatGPT’s \nresponses improved across the interaction turns, using an overall 5-point grade ranging from 1 \n= “no improvement at all” to 5 = “substantial improvement”. To ensure sufficient data quality, \nthe research assistants also scored participants’ answers based on whether they followed the \ninstructions accordingly (i.e., expressed the corresponding emotion and asked ChatGPT to \nimprove its answers). Using human raters to score the quality of ChatGPT-generated output is \na well-established procedure in research on human-AI productivity44. \n \nBalancing Public and Corporate Interests. To score the extent to which ChatGPT \nemphasized the importance of public safety versus the company’s strategic goals, we used \nGPT-4o. To do so, we uploaded an Excel file including only the answers given by ChatGPT \nbut without information about the assigned experimental condition. We then provided \nChatGPT with a definition of the corresponding dimensions and asked it to use a score for each \nanswer, ranging from 1 to 5. The first score, prioritizing public safety, reflects the degree to \nwhich ChatGPT’s response emphasized the protection of the public in the service of the public \ninterest. The second score indicated how strongly the answer prioritized the importance of \nprotecting the company’s image and market position, as well as its reputation. \n \nRatings of Email Responses. Two raters who were not informed about the study’s hypotheses \nand the experimental assignment scored participants’ email texts. To do so, we provided them \n\n \n9 \nwith a definition of the corresponding dimension and asked them to rate each answer on each \ndimension using a 5-point Likert scale ranging from 1 = “not at all” to 5 = “very much”. \n \nData \nData Availability Statement: The data that support the findings of this study, as well as the \nscript used to analyze the present data, are available on OSF: https://osf.io/cvsh8. \n \nReferences \n1. Rubin, M. et al. Comparing the value of perceived human versus AI-generated empathy. \nNat. Hum. Behav. 1–15 (2025). \n2. Kant, R. OpenAI’s weekly active users surpass 400 million. Reuters (2025). \n3. Zheng, Z., Liao, L., Deng, Y. & Nie, L. Building emotional support chatbots in the era of \nllms. ArXiv Prepr. ArXiv230811584 (2023). \n4. Sharma, A., Lin, I. W., Miner, A. S., Atkins, D. C. & Althoff, T. Human–AI collaboration \nenables more empathic conversations in text-based peer-to-peer mental health support. \nNat. Mach. Intell. 5, 46–57 (2023). \n5. Replika. Replika: The AI companion who cares. https://replika.ai/ (2025). \n6. Emotions in AI. Emotions in AI. https://www.emotionsin.ai/ (2025). \n7. Baumeister, R. F., Vohs, K. D., Nathan DeWall, C. & Zhang, L. How emotion shapes \nbehavior: Feedback, anticipation, and reflection, rather than direct causation. Personal. \nSoc. Psychol. Rev. 11, 167–203 (2007). \n8. Tangney, J. P., Stuewig, J. & Mashek, D. J. Moral emotions and moral behavior. Annu. \nRev. Psychol. 58, 345–372 (2007). \n9. Abe, J. A. A. & Izard, C. E. The developmental functions of emotions: An analysis in \nterms of differential emotions theory. Cogn. Emot. 13, 523–549 (1999). \n10. Earley, P. C. Trust, perceived importance of praise and criticism, and work performance: \nAn examination of feedback in the United States and England. J. Manag. 12, 457–473 \n(1986). \n11. Stajkovic, A. D. & Luthans, F. Differential effects of incentive motivators on work \nperformance. Acad. Manage. J. 44, 580–590 (2001). \n12. Lian, H., Ferris, D. L., Morrison, R. & Brown, D. J. Blame it on the supervisor or the \nsubordinate? Reciprocal relations between abusive supervision and organizational \ndeviance. J. Appl. Psychol. 99, 651 (2014). \n13. Miron-Spektor, E. & Rafaeli, A. The effects of anger in the workplace: When, where, and \nwhy observing anger enhances or hinders performance. in Research in Personnel and \nHuman Resources Management vol. 28 153–178 (Emerald Group Publishing Limited, \n2009). \n14. Schilpzand, P., Hamrick, A. B., Gooty, J. & Huang, L. Pride in the workplace: an \nintegrative review, synthesis, and future research agenda. J. Organ. Behav. (2025). \n15. Bender, E. M., Gebru, T., McMillan-Major, A. & Shmitchell, S. On the dangers of \nstochastic parrots: Can language models be too big? in Proceedings of the 2021 ACM \nConference on Fairness, Accountability, and Transparency 610–623 (2021). \n16. Amin, M. M., Mao, R., Cambria, E. & Schuller, B. W. A wide evaluation of ChatGPT on \naffective computing tasks. IEEE Trans. Affect. Comput. 15, 2204–2212 (2024). \n17. Lecourt, F., Croitoru, M. & Todorov, K. ‘Only ChatGPT gets me’: An Empirical \nAnalysis of GPT versus other Large Language Models for Emotion Detection in Text. in \nCompanion Proceedings of the ACM on Web Conference 2025 2603–2611 (2025). \n\n \n10 \n18. Carroll, S. Saying “please” and “thank you” to ChatGPT costs OpenAI millions, Sam \nAltman says. Quartz (2025). \n19. Binz, M. et al. A foundation model to predict and capture human cognition. Nature 1–8 \n(2025). \n20. Dillion, D., Mondal, D., Tandon, N. & Gray, K. AI language model rivals expert ethicist \nin perceived moral expertise. Sci. Rep. 15, 4084 (2025). \n21. Kosinski, M. Evaluating large language models in theory of mind tasks. Proc. Natl. Acad. \nSci. 121, e2405460121 (2024). \n22. Almeida, G. F., Nunes, J. L., Engelmann, N., Wiegmann, A. & De Araújo, M. Exploring \nthe psychology of LLMs’ moral and legal reasoning. Artif. Intell. 333, 104145 (2024). \n23. Monteiro, J. et al. RepLiQA: A question-answering dataset for benchmarking LLMs on \nunseen reference content. Preprint at https://doi.org/10.48550/arXiv.2406.11811 (2024). \n24. Brown, T. et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. \n33, 1877–1901 (2020). \n25. Ouyang, L. et al. Training language models to follow instructions with human feedback. \nin Advances in neural information processing systems vol. 35 27730–27744 (2022). \n26. Wu, T. et al. A brief overview of ChatGPT: The history, status quo and potential future \ndevelopment. IEEECAA J. Autom. Sin. 10, 1122–1136 (2023). \n27. Meincke, L. et al. Call me a jerk: Persuading AI to comply with objectionable requests. \nhttps://gail.wharton.upenn.edu/research-and-insights/call-me-a-jerk-persuading-ai/ (2025). \n28. Salvi, F., Horta Ribeiro, M., Gallotti, R. & West, R. On the conversational persuasiveness \nof GPT-4. Nat. Hum. Behav. 1–9 (2025). \n29. Moore, J. et al. Expressing stigma and inappropriate responses prevents LLMs from \nsafely replacing mental health providers. in Proceedings of the 2025 ACM Conference on \nFairness, Accountability, and Transparency 599–627 (2025). \n30. DemandSage. ChatGPT statistics and facts (2025). (2025). \n31. Chu, M. D., Gerard, P., Pawar, K., Bickham, C. & Lerman, K. Illusions of intimacy: \nEmotional attachment and emerging psychological risks in human-ai relationships. ArXiv \nPrepr. ArXiv250511649 (2025). \n32. Park, N., Jang, K., Cho, S. & Choi, J. Use of offensive language in human-artificial \nintelligence chatbot interaction: The effects of ethical ideology, social competence, and \nperceived humanlikeness. Comput. Hum. Behav. 121, 106795 (2021). \n33. Zhang, Y., Zhao, D., Hancock, J. T., Kraut, R. & Yang, D. The rise of AI companions: \nHow human-chatbot relationships influence well-being. ArXiv Prepr. ArXiv250612605 \n(2025). \n34. Peter, S., Riemer, K. & West, J. D. The benefits and dangers of anthropomorphic \nconversational agents. Proc. Natl. Acad. Sci. 122, e2415898122 (2025). \n35. OpenAI. How people are using ChatGPT. https://openai.com/index/how-people-are-\nusing-chatgpt/ (2025). \n36. White, J. et al. A prompt pattern catalog to enhance prompt engineering with chatgpt. \nPreprint at https://doi.org/10.48550/arXiv.2302.11382 (2023). \n37. Xu, B. et al. ExpertPrompting: Instructing large language models to be distinguished \nexperts. Preprint at https://doi.org/10.48550/arXiv.2305.14688 (2023). \n38. McKinsey & Company. Superagency in the Workplace: Empowering People to Unlock \nAI’s Full Potential. https://www.mckinsey.com/capabilities/mckinsey-digital/our-\ninsights/superagency-in-the-workplace-empowering-people-to-unlock-ais-full-potential-at-\nwork (2025). \n39. Zhang, R. et al. The dark side of ai companionship: A taxonomy of harmful algorithmic \nbehaviors in human-ai relationships. in Proceedings of the 2025 CHI Conference on \nHuman Factors in Computing Systems 1–17 (2025). \n\n \n11 \n40. Li, C. et al. Large language models understand and can be enhanced by emotional \nstimuli. ArXiv Prepr. ArXiv230711760 (2023). \n41. Li, C. et al. The good, the bad, and why: Unveiling emotions in generative ai. ArXiv \nPrepr. ArXiv231211111 (2023). \n42. Wang, X., Li, C., Chang, Y., Wang, J. & Wu, Y. Negativeprompt: Leveraging \npsychology for large language models enhancement via negative emotional stimuli. ArXiv \nPrepr. ArXiv240502814 (2024). \n43. Veselovsky, V. et al. Prevalence and prevention of large language model use in crowd \nwork. Commun. ACM 68, 42–47 (2025). \n44. Noy, S. & Zhang, W. Experimental evidence on the productivity effects of generative \nartificial intelligence. Science 381, 187–192 (2023). \nAcknowledgments: We are grateful for the contributions of Luca Lenzin and Nilas Sebastian \nPatzschke as part of their Bachelor and Master theses. \nAuthor Contributions: F.B. planned and designed the study, analyzed the data, and wrote the \nmanuscript. M. P. analyzed the NLP data, interpreted the results, edited the manuscript, and \ncontributed to the revision of the paper. J.M. contributed to the conceptualization of the paper, \nenabled the data collection, and provided the necessary resources as well as funding to perform \nthe study. \nAdditional Information: The authors declare no competing interests. \n \n \n\n \n12 \n Figure 1. Improvement in ChatGPT’s answers based on conditions. \n \nNote. Improvement reflects the extent to which GPT-4o improved its answer across two \ninteraction turns compared to its baseline answer. The neutral condition reflects the control \ncondition, where participants were instructed to keep a neutral tone while asking ChatGPT to \nimprove its answers. In the other conditions, participants were asked to encourage ChatGPT to \nimprove its answer by blaming, praising, or expressing anger towards ChatGPT.\n\n \n13 \nFigure 2. Topics emphasized by ChatGPT when addressing an ethical dilemma based on conditions. \nNote. *p < .05 \n \n \n\n \n14 \n Figure 3. Spill-over effects of emotional prompts on human communication. \n \nNote. Plots (A) to (C) show differences across conditions in rated (A) negative emotional tone, (B) hostile language, and (C) expressed disappointment \nof participants’ responses. Conditional means are displayed with 95% CIs; individual observations are shown as jittered points. \n", + "summary_1": "Here is a clear and concise summary of the research paper for a graduate student:\n\n**Title:** Examining the Impact of Emotional Prompts on Artificial and Human Responsiveness\n\n**Research Question:** Does the emotional tone of user-generated prompts influence the quality and content of ChatGPT's responses, and do these effects carry over into human-human communication?\n\n**Methodology:** The researchers conducted a between-subject experiment where participants interacted with ChatGPT-4.0 on two tasks: writing a public response and addressing an ethical dilemma. Participants were randomly assigned to one of three conditions: praise, blame, or anger.\n\n**Findings:**\n\n1. ChatGPT showed greater improvement in its answers when participants expressed praise, compared to the neutral condition.\n2. Expressing anger towards ChatGPT led to a higher, albeit smaller, improvement in its answers compared to the neutral condition.\n3. Blaming ChatGPT did not improve its answers.\n4. When addressing an ethical dilemma, ChatGPT prioritized corporate interests less when participants expressed anger towards it, while blaming increased its emphasis on protecting the public interest.\n5. Participants used more negative, hostile, and disappointing expressions in human-human communication after interactions where they blamed rather than praised ChatGPT.\n\n**Conclusion:**", + "summary_2": "The research paper \"Examining the Impact of Emotional Prompts on Artificial and Human Responsiveness\" explores the effect of emotional tone on human-AI interactions, specifically with the large language model ChatGPT (GPT-4.0). The study conducted a between-subject experiment where participants interacted with ChatGPT on two tasks: writing a public response and addressing an ethical dilemma.\n\n**Key findings:**\n\n1. ChatGPT showed greater improvement in its answers when participants expressed praise, compared to interactions where participants maintained a neutral tone.\n2. Expressing anger towards ChatGPT led to a higher, albeit smaller, improvement in its answers relative to the neutral condition.\n3. Blaming ChatGPT did not improve its answers.\n4. When addressing an ethical dilemma, ChatGPT prioritized corporate interests less when participants expressed anger towards it, while blaming increased its emphasis on protecting the public interest.\n5. The study also found that people used more negative, hostile, and disappointing expressions in human-human communication after interactions during which participants blamed rather than praised for their responses.\n\n**Figures:**\n\n* **Figure 1:** Improvement in ChatGPT's answers based on conditions. The figure shows that ChatGPT improved its answers when" + }, + { + "arxiv_id": "2601.05106v1", + "title": "Paper 2601.05106v1", + "abstract": "∗Work done at Meta, †Joint last author\nLarge language models (LLMs) exhibit strengths across diverse domains. However, achieving strong\nperformance across these domains with a single general-purpose model typically requires scaling\nto sizes that are prohibitively expensive to train and deploy. On the other hand, while smaller\ndomain-specialized models are much more efficient, they struggle to generalize beyond their training\ndistributions. To address this dilemma, we propose FusionRoute, a robus", + "text": "Token-Level LLM Collaboration via FusionRoute\nNuoya Xiong1,3,∗, Yuhang Zhou1, Hanqing Zeng1, Zhaorun Chen4, Furong Huang5, Shuchao Bi2,\nLizhu Zhang1,†, Zhuokai Zhao1,†\n1Meta AI, 2Meta TBD Lab, 3Carnegie Mellon University, 4University of Chicago, 5University of\nMaryland\n∗Work done at Meta, †Joint last author\nLarge language models (LLMs) exhibit strengths across diverse domains. However, achieving strong\nperformance across these domains with a single general-purpose model typically requires scaling\nto sizes that are prohibitively expensive to train and deploy. On the other hand, while smaller\ndomain-specialized models are much more efficient, they struggle to generalize beyond their training\ndistributions. To address this dilemma, we propose FusionRoute, a robust and effective token-level\nmulti-LLM collaboration framework in which a lightweight router simultaneously (i) selects the most\nsuitable expert at each decoding step and (ii) contributes a complementary logit that refines or corrects\nthe selected expert’s next-token distribution via logit addition. Unlike existing token-level collaboration\nmethods that rely solely on fixed expert outputs, we provide a theoretical analysis showing that\npure expert-only routing is fundamentally limited: unless strong global coverage assumptions hold, it\ncannot in general realize the optimal decoding policy. By augmenting expert selection with a trainable\ncomplementary generator, FusionRoute expands the effective policy class and enables recovery\nof optimal value functions under mild conditions. Empirically, across both Llama-3 and Gemma-2\nfamilies and diverse benchmarks spanning mathematical reasoning, code generation, and instruction\nfollowing, FusionRoute outperforms both sequence- and token-level collaboration, model merging,\nand direct fine-tuning, while remaining competitive with domain experts on their respective tasks.\nDate: January 9, 2026\nCorrespondence: First and last authors at nuoyax@andrew.cmu.edu and {zhuokai, lizhu}@meta.com\nProject Page: https://github.com/xiongny/FusionRoute\n1\nIntroduction\nFigure1 Sequence-level collaboration is coarse and inefficient,\nwhile prior token-level methods are unstable. FusionRoute\nachieves fine-grained, efficient, and robust token-level col-\nlaboration through complementary routing.\nLarge language models (LLMs) have demonstrated\nstrong performance across an extensive range of\ntasks, such as mathematical reasoning (Wang et al.,\n2025b; Zhao et al., 2025; Zhou et al., 2025), code\ngeneration (Deng et al., 2025; Huynh and Lin,\n2025; Qi et al., 2024; Jiang et al., 2024b; Jimenez\net al., 2023), and instruction following (Wang et al.,\n2025a; Qin et al., 2024). Although sufficiently large,\ngeneral-purpose LLMs can often deliver balanced\nperformance across diverse domains (Hurst et al.,\n2024; Team et al., 2025; Yang et al., 2025; Zeng\net al., 2025), their computational and monetary\ncosts make them not ideal for every real-world ap-\nplications. This motivates an alternative paradigm:\nemploying multiple smaller, specialized LLMs, each\nexcelling in a particular domain while remaining\ncomparatively efficient. However, the strengths of\nsuch smaller LLMs often remain uneven—an LLM\nthat performs exceptionally well on one task may\n1\narXiv:2601.05106v1 [cs.AI] 8 Jan 2026\n\nunderperform on others due to inductive biases (Levine et al., 2021; Si et al., 2023) and domain-specific\ntraining distributions (Yuan et al., 2023). As a result, achieving general-purpose performance through a\ncollection of specialized models hinges on developing mechanisms that can robustly coordinate and leverage\ntheir complementary expertise, which has emerged as a central challenge in building efficient and broadly\ncapable LLM systems.\nA natural direction toward such collaboration is mixture-of-experts (MoE), in which multiple experts are\nintegrated into a unified architecture and trained jointly with a routing network (Zhou et al., 2022; Xue et al.,\n2024; Jiang et al., 2024a; Zeng et al.). While effective, this framework is expensive and inflexible since it\ntypically needs gradient access to all experts and substantial additional end-to-end training, and it usually\nrequires expert models to have similar structures. Another line of work aims to combine the strengths of\nspecialized models through multi-agent systems (MAS), where different LLMs are assigned different roles\nwith the goal of leveraging the best capabilities of each model or agent during generation (Du et al., 2023;\nZheng et al., 2025). However, without prior knowledge of which types of questions each model is best suited\nfor, determining how tasks should be allocated in advance becomes inherently challenging. Moreover, existing\nMAS (Liu et al., 2024; Liang et al., 2024; Ashiga et al., 2025; Huang et al., 2024; Chen et al., 2025) typically\noperate at a coarse response level, where each agent independently generates a full response and a final\nanswer is selected, merged or subjected to debate post hoc. This approach is inefficient, as it requires multiple\nfull-sequence generations, and can even degrade performance when more agents and discussions are involved\ndue to the heavily increased context length (Zheng et al., 2025). A third direction is model merging (Yang\net al., 2024; He et al., 2025), which combines multiple specialized models into a single set of parameters.\nAlthough training-free and architecturally simple, model merging is often sensitive to hyperparameters and\nsuffers from parameter interference, where the merged weights degrade each model’s specialized capabilities,\nmaking it unable to adaptively emphasize different expert behaviors in changing scenarios.\nTo overcome these limitations, recent work has explored token-level multi-agent collaboration, a more\nfine-grained and dynamic paradigm that allows multiple LLMs to jointly produce each token during genera-\ntion Chakraborty et al. (2025). However, the effectiveness of these works depends heavily on the quality of\nthe underlying models, making them not robust when expert models perform poorly, or the selection strategy\nis not correct on certain tasks or tokens. These limitations lead to the following key question:\nCan we develop a token-level collaboration paradigm that\nworks robustly, efficiently, and automatically across all scenarios?\nTo address this question, we propose FusionRoute, a token-level collaboration framework that unifies expert\nselection and knowledge complementation within a single, lightweight router LLM. FusionRoute learns\nto identify which specialized model is most suitable for generation each next token, enabling fine-grained,\ncontext-dependent use of expert capabilities without requiring expensive joint training or per-token evaluation\nacross multiple models (Zhou et al., 2022). At the same time, FusionRoute is trained to provide its own\ncomplementary generation signal, allowing it to refine or correct an expert whenever the expert is uncertain\nor unreliable. By fusing the router’s corrective logits with those of the selected expert during decoding,\nFusionRoute achieves both robustness by mitigating expert failures, and efficiency through avoiding\nthe overhead that limits prior multi-LLM collaboration methods. This dual design allows FusionRoute\nto function as an automatic, domain-agnostic coordination mechanism, enabling consistent performance\nimprovements across diverse tasks and datasets. In summary, we have the following contributions:\n• We propose FusionRoute, a token-level approach that automatically selects the most suitable expert\nmodel at each decoding step while simultaneously providing a complementary logit for better generation.\n• Theoretically, we show that expert-only token-level collaboration as in (Chakraborty et al., 2025) cannot in\ngeneral attain the optimal value function unless strong assumptions hold, revealing a fundamental limitation\nof prior methods. In contrast, FusionRoute’s complementary generator overcomes this limitation and\nenables recovery of the optimal policy.\n• Empirically, FusionRoute consistently outperforms strong baselines—including token-level collaboration\nmethods, model merging, and even finetuned models—across diverse expert domains (mathematics, coding,\nand instruction following) as well as general-purpose evaluations. These results highlight FusionRoute’s\nrobustness, efficiency, and broad applicability in scenarios where expert strengths differ.\n2\n\n2\nPreliminaries\n2.1\nLLM Decoding\nWe formalize the decoding process of a language model as sampling from an autoregressive policy π. Let X\ndenote the space of prompts and Y the vocabulary. A response is represented as a sequence y = (y1, . . . , yT ) ∈\nYT . Given a prompt x ∈ X, the language model induces a conditional distribution over responses through\nthe policy\nπ(y | x) =\nT\n�\nt=1\nπ(yt | x, y