Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
133 changes: 133 additions & 0 deletions AGENTS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,133 @@
# Agent Data Protocol - Repository Guidelines

This document captures key patterns and best practices for contributing to the Agent Data Protocol repository.

## Repository Structure

```
agent-data-protocol/
├── datasets/ # Dataset implementations
│ └── $DATASET_NAME/
│ ├── README.md
│ ├── extract_raw.py
│ ├── raw_to_standardized.py
│ ├── schema_raw.py (optional)
│ ├── api.py (optional)
│ ├── sample_raw.json
│ ├── sample_std.json
│ ├── sample_sft.json
│ └── sample_sft/
│ └── sample_sft_$AGENT.json
├── agents/ # Agent-specific SFT converters
├── schema/ # ADP standardized format definitions
├── scripts/ # Utility scripts
└── tests/ # Validation tests
```

## Data Flow Pipeline

```
Raw Dataset → Standardized Format → Agent Specific SFT Format
↓ ↓ ↓
sample_raw.json → sample_std.json → sample_sft.json
```

## Key Requirements

### File Naming
- Only these JSON files are allowed in dataset directories:
- `sample_raw.json`
- `sample_std.json`
- `sample_sft.json`
- `generated_thoughts.json`
- All JSON files MUST have a trailing newline

### SFT Format Requirements

**Critical**: Messages containing function call patterns MUST use `"from": "function_call"`, not `"from": "gpt"`.

Function call patterns that trigger this requirement:
- `<function=`
- `<function_calls>`
- `<invoke name=`

Example correct format:
```json
{
"from": "function_call",
"value": "I'll run the command.\n\n<function=execute_bash>\n<parameter=command>ls -la</parameter>\n</function>"
}
```

### Standardized Schema Components

**Actions:**
- `MessageAction`: Text-based communication
- `CodeAction`: Code execution requests
- `ApiAction`: API/function calls with `function` and `kwargs` fields

**Observations:**
- `TextObservation`: Text-based responses with `source` field (user/environment)
- `WebObservation`: Web page content

## Commands

### Generate sample files
```bash
export MY_DATASET=your_dataset
export PYTHONPATH=`pwd`:$PYTHONPATH

# Extract raw data (5 samples)
python datasets/$MY_DATASET/extract_raw.py | head -5 | python scripts/jsonl_to_json.py > datasets/$MY_DATASET/sample_raw.json

# Convert to standardized format
cat datasets/$MY_DATASET/sample_raw.json | python scripts/json_to_jsonl.py | python datasets/$MY_DATASET/raw_to_standardized.py | python scripts/jsonl_to_json.py > datasets/$MY_DATASET/sample_std.json

# Convert to SFT format (OpenHands)
cat datasets/$MY_DATASET/sample_std.json | python scripts/json_to_jsonl.py | python agents/openhands/std_to_sft.py --is_web=no --api_env=execute_bash | python scripts/jsonl_to_json.py > datasets/$MY_DATASET/sample_sft/sample_sft_openhands.json
```

### Run tests
```bash
# All tests
python -m pytest tests/ -v

# Tests for specific dataset
python -m pytest tests/ -v -k "dataset_name"

# Key validation tests
python -m pytest tests/test_dataset_structure.py -v
python -m pytest tests/test_datasets_from_parameter.py -v
python -m pytest tests/test_standardized_schemas.py -v
```

## Common Issues

1. **Missing trailing newline**: All JSON files must end with `\n`
2. **Wrong `from` field**: Function calls must use `"from": "function_call"`
3. **Extra JSON files**: Remove any temporary `.json` files before committing
4. **Missing `sample_sft.json`**: Required at root level if `sample_std.json` exists

## Post-Processing SFT Files

If your SFT conversion produces `"from": "gpt"` for function calls, apply this fix:

```python
import json

function_patterns = ['<function=', '<function_calls>', '<invoke name=']

with open('sample_sft.json', 'r') as f:
data = json.load(f)

for item in data:
for message in item.get('conversations', []):
value = message.get('value', '')
if any(p in value for p in function_patterns):
if message.get('from') == 'gpt':
message['from'] = 'function_call'

with open('sample_sft.json', 'w') as f:
json.dump(data, f, indent=2)
f.write('\n')
```
33 changes: 33 additions & 0 deletions datasets/coderforge_preview/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
# CoderForge-Preview Dataset

## Description

CoderForge-Preview is the largest open test-verified coding agent dataset designed for training efficient software engineering agents. The dataset contains agent trajectories solving real-world coding tasks, with all trajectories being test-verified for quality.

Fine-tuning Qwen-3 32B on this dataset boosts SWE-Bench Verified performance from 23.0% to 59.4% pass@1, ranking #1 among open-data and #2 among open-weight models ≤32B parameters.

The dataset focuses on:
- Large-scale agentic data generation from 51K distinct open-source tasks
- Long-horizon, multi-step SFT trajectories
- Test-verified coding agent trajectories
- Data collected using OpenHands agent framework

## Paper Citation

```bibtex
@misc{CoderForge2026,
title = {CoderForge-Preview: SOTA Open Dataset for Training Efficient Agents},
author = {Ariyak, Alpay and Zhang, Junda and Wang, Junxiong and Zhu, Shang and Bianchi, Federico and Srivastava, Sanjana and Panda, Ashwinee and Bharti, Siddhant and Xu, Chenfeng and Heo, John and Wu, Xiaoxia Shirley and Zhou, James and Liang, Percy and Song, Leon and Zhang, Ce and Athiwaratkun, Ben and Zhou, Zhongzhu and Wu, Qingyang},
year = {2026},
month = feb,
publisher = {TogetherAI Blog},
url = {https://www.together.ai/blog/coderforge-preview},
note = {Project core leads: Alpay Ariyak; Zhongzhu Zhou; Qingyang Wu}
}
```

## Dataset Information

**Source URL**: https://huggingface.co/datasets/togethercomputer/CoderForge-Preview

**License**: Apache-2.0
56 changes: 56 additions & 0 deletions datasets/coderforge_preview/api.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
def str_replace_editor(
command: str,
path: str,
file_text: str = None,
old_str: str = None,
new_str: str = None,
insert_line: int = None,
view_range: list = None,
) -> None:
"""View, create, and edit files with this custom editing tool.

Args:
----
command (str): The commands to run. Allowed options are: `view`, `create`, `str_replace`, `insert`, `undo_edit`.
path (str): Absolute path to file or directory, e.g. `/repo/file.py` or `/repo`.
file_text (str): Required parameter of `create` command, with the content of the file to be created.
old_str (str): Required parameter of `str_replace` command containing the string in `path` to replace.
new_str (str): Optional parameter of `str_replace` command containing the new string (if not given, no string will be added). Required parameter of `insert` command containing the string to insert.
insert_line (int): Required parameter of `insert` command. The `new_str` will be inserted AFTER the line `insert_line` of `path`.
view_range (list): Optional parameter of `view` command when `path` points to a file. If none is given, the full file is shown. If provided, the file will be shown in the indicated line number range, e.g. [11, 12] will show lines 11 and 12. Indexing at 1 to start. Setting `[start_line, -1]` shows all lines from `start_line` to the end of the file.

"""
pass


def finish(message: str = None):
"""Finish the interaction when the task is complete OR if the assistant cannot proceed further with the task.

Args:
----
message (str): Optional final message to the user.

"""
pass


def execute_bash(command: str):
"""Execute a bash command in the terminal.

Args:
----
command (str): The bash command to execute.

"""
pass


def think(thought: str):
"""Log a thought for reasoning.

Args:
----
thought (str): The thought to log.

"""
pass
16 changes: 16 additions & 0 deletions datasets/coderforge_preview/extract_raw.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
import json

from datasets import load_dataset

# Load all splits from the trajectories config
dataset = load_dataset("togethercomputer/CoderForge-Preview", "trajectories")
ids = {}

for split in ["SWE_Rebench", "SWE_Smith", "R2E_Gym", "filtered_reward1"]:
for item in dataset[split]:
id = str(item["trajectory_id"])
if id not in ids:
ids[id] = 0
item["id"] = f"{id}_{ids[id]}"
ids[id] += 1
print(json.dumps(item))
Loading