neulab · neubig · Feb 26, 2026 · Feb 28, 2026 · Feb 28, 2026
diff --git a/AGENTS.md b/AGENTS.md
@@ -0,0 +1,133 @@
+# Agent Data Protocol - Repository Guidelines
+
+This document captures key patterns and best practices for contributing to the Agent Data Protocol repository.
+
+## Repository Structure
+
+```
+agent-data-protocol/
+├── datasets/           # Dataset implementations
+│   └── $DATASET_NAME/
+│       ├── README.md
+│       ├── extract_raw.py
+│       ├── raw_to_standardized.py
+│       ├── schema_raw.py (optional)
+│       ├── api.py (optional)
+│       ├── sample_raw.json
+│       ├── sample_std.json
+│       ├── sample_sft.json
+│       └── sample_sft/
+│           └── sample_sft_$AGENT.json
+├── agents/             # Agent-specific SFT converters
+├── schema/             # ADP standardized format definitions
+├── scripts/            # Utility scripts
+└── tests/              # Validation tests
+```
+
+## Data Flow Pipeline
+
+```
+Raw Dataset      →  Standardized Format  →  Agent Specific SFT Format
+     ↓                   ↓                       ↓
+sample_raw.json  →  sample_std.json      →  sample_sft.json
+```
+
+## Key Requirements
+
+### File Naming
+- Only these JSON files are allowed in dataset directories:
+  - `sample_raw.json`
+  - `sample_std.json`
+  - `sample_sft.json`
+  - `generated_thoughts.json`
+- All JSON files MUST have a trailing newline
+
+### SFT Format Requirements
+
+**Critical**: Messages containing function call patterns MUST use `"from": "function_call"`, not `"from": "gpt"`.
+
+Function call patterns that trigger this requirement:
+- `<function=`
+- `<function_calls>`
+- `<invoke name=`
+
+Example correct format:
+```json
+{
+  "from": "function_call",
+  "value": "I'll run the command.\n\n<function=execute_bash>\n<parameter=command>ls -la</parameter>\n</function>"
+}
+```
+
+### Standardized Schema Components
+
+**Actions:**
+- `MessageAction`: Text-based communication
+- `CodeAction`: Code execution requests
+- `ApiAction`: API/function calls with `function` and `kwargs` fields
+
+**Observations:**
+- `TextObservation`: Text-based responses with `source` field (user/environment)
+- `WebObservation`: Web page content
+
+## Commands
+
+### Generate sample files
+```bash
+export MY_DATASET=your_dataset
+export PYTHONPATH=`pwd`:$PYTHONPATH
+
+# Extract raw data (5 samples)
+python datasets/$MY_DATASET/extract_raw.py | head -5 | python scripts/jsonl_to_json.py > datasets/$MY_DATASET/sample_raw.json
+
+# Convert to standardized format
+cat datasets/$MY_DATASET/sample_raw.json | python scripts/json_to_jsonl.py | python datasets/$MY_DATASET/raw_to_standardized.py | python scripts/jsonl_to_json.py > datasets/$MY_DATASET/sample_std.json
+
+# Convert to SFT format (OpenHands)
+cat datasets/$MY_DATASET/sample_std.json | python scripts/json_to_jsonl.py | python agents/openhands/std_to_sft.py --is_web=no --api_env=execute_bash | python scripts/jsonl_to_json.py > datasets/$MY_DATASET/sample_sft/sample_sft_openhands.json
+```
+
+### Run tests
+```bash
+# All tests
+python -m pytest tests/ -v
+
+# Tests for specific dataset
+python -m pytest tests/ -v -k "dataset_name"
+
+# Key validation tests
+python -m pytest tests/test_dataset_structure.py -v
+python -m pytest tests/test_datasets_from_parameter.py -v
+python -m pytest tests/test_standardized_schemas.py -v
+```
+
+## Common Issues
+
+1. **Missing trailing newline**: All JSON files must end with `\n`
+2. **Wrong `from` field**: Function calls must use `"from": "function_call"`
+3. **Extra JSON files**: Remove any temporary `.json` files before committing
+4. **Missing `sample_sft.json`**: Required at root level if `sample_std.json` exists
+
+## Post-Processing SFT Files
+
+If your SFT conversion produces `"from": "gpt"` for function calls, apply this fix:
+
+```python
+import json
+
+function_patterns = ['<function=', '<function_calls>', '<invoke name=']
+
+with open('sample_sft.json', 'r') as f:
+    data = json.load(f)
+
+for item in data:
+    for message in item.get('conversations', []):
+        value = message.get('value', '')
+        if any(p in value for p in function_patterns):
+            if message.get('from') == 'gpt':
+                message['from'] = 'function_call'
+
+with open('sample_sft.json', 'w') as f:
+    json.dump(data, f, indent=2)
+    f.write('\n')
+```
diff --git a/datasets/coderforge_preview/README.md b/datasets/coderforge_preview/README.md
@@ -0,0 +1,33 @@
+# CoderForge-Preview Dataset
+
+## Description
+
+CoderForge-Preview is the largest open test-verified coding agent dataset designed for training efficient software engineering agents. The dataset contains agent trajectories solving real-world coding tasks, with all trajectories being test-verified for quality.
+
+Fine-tuning Qwen-3 32B on this dataset boosts SWE-Bench Verified performance from 23.0% to 59.4% pass@1, ranking #1 among open-data and #2 among open-weight models ≤32B parameters.
+
+The dataset focuses on:
+- Large-scale agentic data generation from 51K distinct open-source tasks
+- Long-horizon, multi-step SFT trajectories
+- Test-verified coding agent trajectories
+- Data collected using OpenHands agent framework
+
+## Paper Citation
+
+```bibtex
+@misc{CoderForge2026,
+  title = {CoderForge-Preview: SOTA Open Dataset for Training Efficient Agents},
+  author = {Ariyak, Alpay and Zhang, Junda and Wang, Junxiong and Zhu, Shang and Bianchi, Federico and Srivastava, Sanjana and Panda, Ashwinee and Bharti, Siddhant and Xu, Chenfeng and Heo, John and Wu, Xiaoxia Shirley and Zhou, James and Liang, Percy and Song, Leon and Zhang, Ce and Athiwaratkun, Ben and Zhou, Zhongzhu and Wu, Qingyang},
+  year = {2026},
+  month = feb,
+  publisher = {TogetherAI Blog},
+  url = {https://www.together.ai/blog/coderforge-preview},
+  note = {Project core leads: Alpay Ariyak; Zhongzhu Zhou; Qingyang Wu}
+}
+```
+
+## Dataset Information
+
+**Source URL**: https://huggingface.co/datasets/togethercomputer/CoderForge-Preview
+
+**License**: Apache-2.0
diff --git a/datasets/coderforge_preview/api.py b/datasets/coderforge_preview/api.py
@@ -0,0 +1,56 @@
+def str_replace_editor(
+    command: str,
+    path: str,
+    file_text: str = None,
+    old_str: str = None,
+    new_str: str = None,
+    insert_line: int = None,
+    view_range: list = None,
+) -> None:
+    """View, create, and edit files with this custom editing tool.
+
+    Args:
+    ----
+        command (str): The commands to run. Allowed options are: `view`, `create`, `str_replace`, `insert`, `undo_edit`.
+        path (str): Absolute path to file or directory, e.g. `/repo/file.py` or `/repo`.
+        file_text (str): Required parameter of `create` command, with the content of the file to be created.
+        old_str (str): Required parameter of `str_replace` command containing the string in `path` to replace.
+        new_str (str): Optional parameter of `str_replace` command containing the new string (if not given, no string will be added). Required parameter of `insert` command containing the string to insert.
+        insert_line (int): Required parameter of `insert` command. The `new_str` will be inserted AFTER the line `insert_line` of `path`.
+        view_range (list): Optional parameter of `view` command when `path` points to a file. If none is given, the full file is shown. If provided, the file will be shown in the indicated line number range, e.g. [11, 12] will show lines 11 and 12. Indexing at 1 to start. Setting `[start_line, -1]` shows all lines from `start_line` to the end of the file.
+
+    """
+    pass
+
+
+def finish(message: str = None):
+    """Finish the interaction when the task is complete OR if the assistant cannot proceed further with the task.
+
+    Args:
+    ----
+        message (str): Optional final message to the user.
+
+    """
+    pass
+
+
+def execute_bash(command: str):
+    """Execute a bash command in the terminal.
+
+    Args:
+    ----
+        command (str): The bash command to execute.
+
+    """
+    pass
+
+
+def think(thought: str):
+    """Log a thought for reasoning.
+
+    Args:
+    ----
+        thought (str): The thought to log.
+
+    """
+    pass
diff --git a/datasets/coderforge_preview/extract_raw.py b/datasets/coderforge_preview/extract_raw.py
@@ -0,0 +1,16 @@
+import json
+
+from datasets import load_dataset
+
+# Load all splits from the trajectories config
+dataset = load_dataset("togethercomputer/CoderForge-Preview", "trajectories")
+ids = {}
+
+for split in ["SWE_Rebench", "SWE_Smith", "R2E_Gym", "filtered_reward1"]:
+    for item in dataset[split]:
+        id = str(item["trajectory_id"])
+        if id not in ids:
+            ids[id] = 0
+        item["id"] = f"{id}_{ids[id]}"
+        ids[id] += 1
+        print(json.dumps(item))