This repository serves as a standardized collection of agent training data from the research community. We welcome contributions of new datasets and agent implementations that follow the standardized format.
- Overview
- Development Setup
- Contributing New Datasets
- Contributing New Agents
- Testing and Validation
- Code Quality Standards
- Submission Guidelines
This repository contains:
- Datasets: Agent training data in standardized format (
datasets/) - Agents: Agent-specific implementations and conversion scripts (
agents/]) - Schema: ADP standardized format definitions (
schema/) - Scripts: General tility scripts (
scripts/)
Raw Dataset → Standardized Format → Agent Specific SFT Format
↓ ↓ ↓
sample_raw.json → sample_std.json → sample_sft.json
Our standardized format uses these main components:
- MessageAction: Text-based communication
- CodeAction: Code execution requests
- ApiAction: API calls
- TextObservation: Text-based responses
- WebObservation: Web page content
- Python 3.12+
- Clone the repository:
git clone https://github.com/neulab/agent-data-protocol.git
cd agent-data-protocol- Install dependencies:
pip install -r requirements.txt- Install pre-commit hooks:
./install_hooks.shThis will install pre-commit and set up the hooks to ensure code quality.
Note: see schema/SCHEMA.md first for better understanding of the ADP standardized data format.
Each dataset must follow this directory structure:
datasets/$YOUR_DATASET_NAME/
├── README.md # Dataset description and usage
├── LICENSE # Dataset description and usage
├── requirements.txt # Dataset-specific dependencies (optional)
├── schema_raw.py # Raw data schema (optional)
├── extract_raw.py # Script to extract raw data
├── raw_to_standardized.py # Conversion script
├── api.py # API definitions (optional)
├── sample_raw.json # 5 raw samples
├── sample_std.json # 5 standardized samples
└── sample_sft # SFT format samples
├── sample_sft_$AGENT1.json # 5 SFT format samples (if applicable)
├── sample_sft_$AGENT2.json # 5 SFT format samples (if applicable)
└── sample_sft_$AGENT3.json # 5 SFT format samples (if applicable)
- Create your dataset directory:
export MY_DATASET=$YOUR_DATASET_NAME
mkdir -p datasets/$MY_DATASET- Create
extract_raw.pythat outputs raw data in JSONL format to stdout. Example:
#!/usr/bin/env python3
"""Extract raw data from the original dataset source."""
import json
import sys
from pathlib import Path
def extract_raw_data():
"""Extract and yield raw data samples."""
# Your extraction logic here
# This should read from the original data source
# and yield individual samples
for sample in your_data_source:
yield {
"id": sample.id,
"content": sample.content,
# ... other fields
}
if __name__ == "__main__":
for sample in extract_raw_data():
print(json.dumps(sample)) # The output file will be in jsonl format.-
Create raw data schema (optional): If your raw data has a complex or well-defined structure, create a
schema_raw.pyfile to define Pydantic models for validation and type safety.When to include schema_raw.py:
- Your raw data has a complex nested structure with many fields
- You want type validation and better error messages during processing
- The original dataset has a well-documented schema you want to preserve
- You're working with structured data formats (JSON with specific schemas, API responses, etc.)
How to create schema_raw.py:
from typing import List, Optional from pydantic import BaseModel class YourDataItem(BaseModel): """Define structure for individual data items.""" field1: str field2: Optional[int] = None nested_data: List[dict] class SchemaRaw(BaseModel): """Main schema for your raw data format.""" id: str items: List[YourDataItem] metadata: Optional[dict] = None
-
Generate sample raw data:
python datasets/$MY_DATASET/extract_raw.py | python scripts/jsonl_to_json.py | jq '.[0:5]' > datasets/$MY_DATASET/sample_raw.json- Create
raw_to_standardized.pythat converts raw data to ADP standardized schemas. Essentially, you should map each action / observation in the raw data to an action / observation in ADP schemas.
Brief conversion examples:
# Raw data examples → Standardized actions
# Text message: {"type": "message", "text": "Hello"} →
MessageAction(class_="MessageAction", message="Hello")
# Code execution: {"type": "code", "language": "python", "code": "print('hi')"} →
CodeAction(class_="CodeAction", language="python", content="print('hi')")
# Function call: {"type": "action", "function": "click", "args": {"xpath": "//button"}} →
ApiAction(class_="api_action", function="click", kwargs={"xpath": "//button"})
# Environment response: {"type": "output", "text": "Command executed"} →
TextObservation(class_="TextObservation", text="Command executed", source="environment")Complete example:
#!/usr/bin/env python3
"""Convert raw data to standardized format."""
import json
import sys
from schema.trajectory import Trajectory
from schema.action.code import CodeAction
from schema.action.message import MessageAction
from schema.action.api import ApiAction
from schema.observation.text import TextObservation
def convert_raw_to_standardized(raw_data):
"""Convert a raw data sample to standardized format."""
content = []
# Example conversion logic
for item in raw_data["content"]:
if item["type"] == "message":
content.append(MessageAction(
class_="MessageAction",
message=item["message"],
description=item.get("thoughts", "")
))
elif item["type"] == "python":
content.append(CodeAction(
class_="CodeAction",
language="python",
content=item["code"],
description=item.get("thoughts", "")
))
elif item["type"] == "action":
content.append(ApiAction(
class_="api_action",
function=item["function"],
kwargs=item["args"],
description=item.get("thoughts", "")
))
elif item["type"] == "observation":
content.append(TextObservation(
class_="TextObservation",
text=item["text"],
source="environment"
))
return Trajectory(
id=raw_data["id"],
content=content,
details=raw_data.get("metadata", {})
)
if __name__ == "__main__":
for line in sys.stdin:
raw_data = json.loads(line.strip())
standardized = convert_raw_to_standardized(raw_data)
print(standardized.model_dump_json())Using schema_raw.py in raw_to_standardized.py (if applicable):
from schema_raw import SchemaRaw
def convert_raw_to_standardized(raw_data):
# Validate and parse raw data
data = SchemaRaw(**raw_data)
# Now use typed data with IDE support and validation
content = []
for item in data.items:
# Process with type safety
pass-
Create API definitions (when applicable): If your standardized dataset contains ApiAction, create an
api.pyfile to define the available functions. This is crucial for later conversion to agent specific formats.When to include api.py:
- Your dataset contains structured actions (e.g.,
go("bedroom"),click("button_id"),search("query")) - Actions have specific parameters and can be represented as function calls
- The data contains interactions with a specific environment, tool, or API
- You want to convert actions to
ApiActionschema objects in your standardized format
Why api.py is important for agents:
- Function Call Training: Agents learn to make structured function calls rather than generating free-form text
- Parameter Validation: Ensures agents learn correct parameter types and formats
- Standardization: Provides a consistent representation for tool calls across different datasets
- SFT Conversion: Many agent training frameworks expect function call formats for fine-tuning
What to include in api.py:
def function_name(param1: type, param2: type) -> return_type: """Clear description of what the function does. Args: param1 (type): Description of parameter 1 param2 (type): Description of parameter 2 Example: function_name("example_value", 123) """ pass
Example api.py for a web browsing dataset:
def click(xpath: str) -> None: """Click on a web element. Args: xpath (str): The xpath of the element to click. Example: click("//button[@id='submit']") """ pass def type(xpath: str, text: str) -> None: """Type text into an input field. Args: xpath (str): The xpath of the input element. text (str): The text to type. Example: type("//input[@name='username']", "john_doe") """ pass
Then in your
raw_to_standardized.py, useApiActionfor structured actions:from schema.action.api import ApiAction # Convert structured actions to ApiAction if action_type == "click": content.append(ApiAction( class_="api_action", function="click", kwargs={"xpath": action_data["xpath"]}, description=action_data.get("thought", "") ))
- Your dataset contains structured actions (e.g.,
-
Generate standardized samples:
export PYTHONPATH=`pwd`:$PYTHONPATH
cat datasets/$MY_DATASET/sample_raw.json | python scripts/json_to_jsonl.py | python datasets/$MY_DATASET/raw_to_standardized.py | python scripts/jsonl_to_json.py > datasets/$MY_DATASET/sample_std.jsonRun validation tests to ensure your standardized format is correct:
python -m pytest tests/test_standardized_schemas.py::test_sample_standardized_against_schema -vOr to test a specific dataset:
python -m pytest tests/test_standardized_schemas.py -v -k $MY_DATASETCreate a comprehensive README.md for your dataset:
- Original Paper: Paper Title
- Original Repository: Repository
- License: License information
- Size: Number of samples, average steps per trajectory
- Description: Description of the dataset and its purpose.
Each agent implementation should follow this structure:
agents/YOUR_AGENT_NAME/
├── __init__.py # Agent module initialization
├── api.py # Agent-specific API definitions
├── std_to_sft.py # Standardized to SFT conversion
├── system_prompt/ # System prompts and templates
└── README.md # Agent documentation
export AGENT_NAME=YOUR_AGENT_NAME
mkdir -p agents/$AGENT_NAMECreate api.py to load APIs from datasets, use any format that suits your agent.
The loaded specs could be used for training function calling models.
Example:
import importlib.util
import inspect
api_file_path = os.path.expanduser(f"datasets/{dataset}/api.py")
if os.path.exists(api_file_path):
spec = importlib.util.spec_from_file_location("api", api_file_path)
api_module = importlib.util.module_from_spec(spec)
spec.loader.exec_module(api_module)
functions = inspect.getmembers(api_module, inspect.isfunction)
sigs = {}
for name, func in functions:
docstring = "\n" + inspect.getdoc(func)
sig = inspect.signature(func)Create system prompt templates in system_prompt/default.txt:
You are an AI assistant specialized in [your agent's domain].
Your capabilities include:
- Capability 1
- Capability 2
- Capability 3
Guidelines:
- Guideline 1
- Guideline 2
- Guideline 3
When responding, always:
- Be clear and concise
- Follow the specified format
- Provide reasoning for your actions
Create std_to_sft.py for converting standardized format to your agent's SFT format.
The output data should ideally be directly usable for training using LLaMA Factory.
#!/usr/bin/env python3
"""Convert standardized format to SFT format for Your Agent."""
import json
import sys
from typing import Dict, Any
from schema.trajectory import Trajectory
def convert_trajectory_to_sft(trajectory: Trajectory) -> Dict[str, Any]:
"""Convert a standardized trajectory to SFT format."""
messages = []
# Add system message
messages.append({
"role": "system",
"content": "Your agent's system prompt here"
})
# Convert ADP action and observation content to agent specific messages
for item in trajectory.content:
if item.class_ == "MessageAction":
messages.append({
"role": "user" if item.source == "user" else "assistant",
"content": item.message
})
elif item.class_ == "TextObservation":
messages.append({
"role": "system",
"content": f"Observation: {item.text}"
})
# Handle other observation types...
return {
"id": trajectory.id,
"conversations": messages,
"metadata": trajectory.details
}
if __name__ == "__main__":
for line in sys.stdin:
trajectory_data = json.loads(line.strip())
trajectory = Trajectory(**trajectory_data)
sft_data = convert_trajectory_to_sft(trajectory)
print(json.dumps(sft_data))export PYTHONPATH=`pwd`:$PYTHONPATH
cat datasets/$MY_DATASET/sample_std.json | python scripts/json_to_jsonl.py | python agents/YOUR_AGENT_NAME/std_to_sft.py | python scripts/jsonl_to_json.py > datasets/$MY_DATASET/sample_sft/sample_sft_YOUR_AGENT_NAME.jsonFor example, for OpenHands:
export PYTHONPATH=`pwd`:$PYTHONPATH
cat datasets/$MY_DATASET/sample_std.json | python scripts/json_to_jsonl.py | python agents/openhands/std_to_sft.py --is_web=yes --api_env=browser | python scripts/jsonl_to_json.py > datasets/$MY_DATASET/sample_sft/sample_sft_openhands.jsonCreate a comprehensive README.md:
# Your Agent Name
Description of your agent and its capabilities.
## Overview
- **Domain**: Agent's specialized domain
- **Capabilities**: List of key capabilities
- **API Environment**: Required API environment
- **Supported Actions**: List of supported actions
- **Supported Datasets**: List of supported datasets given this implementation
- **Instructions**: Command to generate the full SFT data for supported datasets
## Testing and Validation
### Running Tests
Run the full test suite:
```bash
pytest tests/Run specific validation tests:
# Test raw schemas
pytest tests/test_raw_schemas.py
# Test standardized schemas
pytest tests/test_standardized_schemas.py
# Test dataset structure
pytest tests/test_dataset_structure.pyBefore submitting your contribution:
- Raw data extraction script works correctly
- Standardized format conversion passes validation
- SFT format conversion produces valid output
- All required files are present
- README is comprehensive and accurate
- Tests pass
- Code follows style guidelines
- Follow PEP 8 for Python code
- Use type hints where appropriate
- Include docstrings for functions and classes
- Keep line length under 100 characters
The repository uses pre-commit hooks to ensure code quality:
- ruff: Code formatting and linting
- mypy: Type checking (where applicable)
- Keep imports at the top of files
- Group imports: standard library, third-party, local
- Use absolute imports where possible
- Avoid circular imports
- Fork the repository and create a feature branch
- Implement your changes following the guidelines above
- Test thoroughly using the validation scripts
- Update documentation including README files
- Submit a pull request with a clear description
## Description
Brief description of the changes made.
## Type of Change
- [ ] New dataset
- [ ] New agent
- [ ] Bug fix
- [ ] Documentation update
- [ ] Other (please describe)
## Dataset/Agent Information
- **Name**:
- **Source**:
- **Size**:
- **Domain**:
## Testing
- [ ] All tests pass
- [ ] Validation scripts run successfully
- [ ] Sample files generated correctly
## Checklist
- [ ] Code follows style guidelines
- [ ] Documentation is complete
- [ ] All required files are present
- [ ] No sensitive data included- Automated checks will run on your PR
- Manual review by maintainers
- Feedback incorporation if needed
- Final approval and merge
- Schema validation failures: Ensure your standardized format follows the schema exactly
- Missing files: Check that all required files are present
- Import errors: Verify that all dependencies are properly specified
- Test failures: Run tests locally before submitting
- Documentation: Check existing dataset READMEs for examples
- Schema definitions: Review files in
schema/directory - Example implementations: Look at existing datasets and agents
- Issues: Open a GitHub issue for bugs or questions
- Discussions: Use GitHub discussions for general questions
- Email: Contact the maintainers directly for sensitive issues
Q: How do I handle datasets with special requirements?
A: Create a requirements.txt file in your dataset directory and document any special setup in the README.
Q: What if my dataset doesn't fit the standard schema? A: Open an issue to discuss potential solutions.
Q: How do I handle large datasets? A: Focus on the conversion scripts and sample files. Full datasets should be processed separately.
Q: Can I contribute datasets that require special licenses? A: Yes, but clearly document the license requirements in your README.
By contributing to this project, you agree that your contributions will be licensed under the same license as the project.