Contributing to Agent Data Protocol

This repository serves as a standardized collection of agent training data from the research community. We welcome contributions of new datasets and agent implementations that follow the standardized format.

Overview
Development Setup
Contributing New Datasets
Contributing New Agents
Testing and Validation
Code Quality Standards
Submission Guidelines

Overview

This repository contains:

Datasets: Agent training data in standardized format (datasets/)
Agents: Agent-specific implementations and conversion scripts (agents/])
Schema: ADP standardized format definitions (schema/)
Scripts: General tility scripts (scripts/)

Data Flow

Raw Dataset      →  Standardized Format  →  Agent Specific SFT Format
      ↓                   ↓                       ↓
sample_raw.json  →  sample_std.json      →  sample_sft.json

Standardized Schema Components

Our standardized format uses these main components:

Actions

MessageAction: Text-based communication
CodeAction: Code execution requests
ApiAction: API calls

Observations

TextObservation: Text-based responses
WebObservation: Web page content

Development Setup

Prerequisites

Python 3.12+

Installation

Clone the repository:

git clone https://github.com/neulab/agent-data-protocol.git
cd agent-data-protocol

Install dependencies:

pip install -r requirements.txt

Install pre-commit hooks:

./install_hooks.sh

This will install pre-commit and set up the hooks to ensure code quality.

Contributing New Datasets

Note: see schema/SCHEMA.md first for better understanding of the ADP standardized data format.

Dataset Structure

Each dataset must follow this directory structure:

datasets/$YOUR_DATASET_NAME/
├── README.md                   # Dataset description and usage
├── LICENSE                     # Dataset description and usage
├── requirements.txt            # Dataset-specific dependencies (optional)
├── schema_raw.py               # Raw data schema (optional)
├── extract_raw.py              # Script to extract raw data
├── raw_to_standardized.py      # Conversion script
├── api.py                      # API definitions (optional)
├── sample_raw.json             # 5 raw samples
├── sample_std.json             # 5 standardized samples
└── sample_sft                  # SFT format samples
  ├── sample_sft_$AGENT1.json   # 5 SFT format samples (if applicable)
  ├── sample_sft_$AGENT2.json   # 5 SFT format samples (if applicable)
  └── sample_sft_$AGENT3.json   # 5 SFT format samples (if applicable)

Step-by-Step Guide

Step 1: Create Sample Raw Data

Create your dataset directory:

export MY_DATASET=$YOUR_DATASET_NAME
mkdir -p datasets/$MY_DATASET

Create extract_raw.py that outputs raw data in JSONL format to stdout. Example:

#!/usr/bin/env python3
"""Extract raw data from the original dataset source."""

import json
import sys
from pathlib import Path

def extract_raw_data():
    """Extract and yield raw data samples."""
    # Your extraction logic here
    # This should read from the original data source
    # and yield individual samples

    for sample in your_data_source:
        yield {
            "id": sample.id,
            "content": sample.content,
            # ... other fields
        }

if __name__ == "__main__":
    for sample in extract_raw_data():
        print(json.dumps(sample)) # The output file will be in jsonl format.

Create raw data schema (optional): If your raw data has a complex or well-defined structure, create a schema_raw.py file to define Pydantic models for validation and type safety.

When to include schema_raw.py:
- Your raw data has a complex nested structure with many fields
- You want type validation and better error messages during processing
- The original dataset has a well-documented schema you want to preserve
- You're working with structured data formats (JSON with specific schemas, API responses, etc.)
How to create schema_raw.py:
```
from typing import List, Optional
from pydantic import BaseModel

class YourDataItem(BaseModel):
    """Define structure for individual data items."""
    field1: str
    field2: Optional[int] = None
    nested_data: List[dict]

class SchemaRaw(BaseModel):
    """Main schema for your raw data format."""
    id: str
    items: List[YourDataItem]
    metadata: Optional[dict] = None
```
Generate sample raw data:

python datasets/$MY_DATASET/extract_raw.py | python scripts/jsonl_to_json.py | jq '.[0:5]' > datasets/$MY_DATASET/sample_raw.json

Step 2: Create Standardized Format Converter

Create raw_to_standardized.py that converts raw data to ADP standardized schemas. Essentially, you should map each action / observation in the raw data to an action / observation in ADP schemas.

Brief conversion examples:

# Raw data examples → Standardized actions

# Text message: {"type": "message", "text": "Hello"} →
MessageAction(class_="MessageAction", message="Hello")

# Code execution: {"type": "code", "language": "python", "code": "print('hi')"} →
CodeAction(class_="CodeAction", language="python", content="print('hi')")

# Function call: {"type": "action", "function": "click", "args": {"xpath": "//button"}} →
ApiAction(class_="api_action", function="click", kwargs={"xpath": "//button"})

# Environment response: {"type": "output", "text": "Command executed"} →
TextObservation(class_="TextObservation", text="Command executed", source="environment")

Complete example:

#!/usr/bin/env python3
"""Convert raw data to standardized format."""

import json
import sys
from schema.trajectory import Trajectory
from schema.action.code import CodeAction
from schema.action.message import MessageAction
from schema.action.api import ApiAction
from schema.observation.text import TextObservation

def convert_raw_to_standardized(raw_data):
    """Convert a raw data sample to standardized format."""

    content = []

    # Example conversion logic
    for item in raw_data["content"]:
        if item["type"] == "message":
            content.append(MessageAction(
                class_="MessageAction",
                message=item["message"],
                description=item.get("thoughts", "")
            ))
        elif item["type"] == "python":
            content.append(CodeAction(
                class_="CodeAction",
                language="python",
                content=item["code"],
                description=item.get("thoughts", "")
            ))
        elif item["type"] == "action":
            content.append(ApiAction(
                class_="api_action",
                function=item["function"],
                kwargs=item["args"],
                description=item.get("thoughts", "")
            ))
        elif item["type"] == "observation":
            content.append(TextObservation(
                class_="TextObservation",
                text=item["text"],
                source="environment"
            ))

    return Trajectory(
        id=raw_data["id"],
        content=content,
        details=raw_data.get("metadata", {})
    )

if __name__ == "__main__":
    for line in sys.stdin:
        raw_data = json.loads(line.strip())
        standardized = convert_raw_to_standardized(raw_data)
        print(standardized.model_dump_json())

Using schema_raw.py in raw_to_standardized.py (if applicable):

from schema_raw import SchemaRaw

def convert_raw_to_standardized(raw_data):
    # Validate and parse raw data
    data = SchemaRaw(**raw_data)

    # Now use typed data with IDE support and validation
    content = []
    for item in data.items:
        # Process with type safety
        pass

Create API definitions (when applicable): If your standardized dataset contains ApiAction, create an api.py file to define the available functions. This is crucial for later conversion to agent specific formats.

When to include api.py:

Your dataset contains structured actions (e.g., go("bedroom"), click("button_id"), search("query"))
Actions have specific parameters and can be represented as function calls
The data contains interactions with a specific environment, tool, or API
You want to convert actions to ApiAction schema objects in your standardized format

Why api.py is important for agents:

Function Call Training: Agents learn to make structured function calls rather than generating free-form text
Parameter Validation: Ensures agents learn correct parameter types and formats
Standardization: Provides a consistent representation for tool calls across different datasets
SFT Conversion: Many agent training frameworks expect function call formats for fine-tuning

What to include in api.py:

def function_name(param1: type, param2: type) -> return_type:
    """Clear description of what the function does.

    Args:
        param1 (type): Description of parameter 1
        param2 (type): Description of parameter 2

    Example:
        function_name("example_value", 123)
    """
    pass

Example api.py for a web browsing dataset:

def click(xpath: str) -> None:
    """Click on a web element.

    Args:
        xpath (str): The xpath of the element to click.

    Example:
        click("//button[@id='submit']")
    """
    pass

def type(xpath: str, text: str) -> None:
    """Type text into an input field.

    Args:
        xpath (str): The xpath of the input element.
        text (str): The text to type.

    Example:
        type("//input[@name='username']", "john_doe")
    """
    pass

Then in your raw_to_standardized.py, use ApiAction for structured actions:

from schema.action.api import ApiAction

# Convert structured actions to ApiAction
if action_type == "click":
    content.append(ApiAction(
        class_="api_action",
        function="click",
        kwargs={"xpath": action_data["xpath"]},
        description=action_data.get("thought", "")
    ))

Generate standardized samples:

export PYTHONPATH=`pwd`:$PYTHONPATH
cat datasets/$MY_DATASET/sample_raw.json | python scripts/json_to_jsonl.py | python datasets/$MY_DATASET/raw_to_standardized.py | python scripts/jsonl_to_json.py > datasets/$MY_DATASET/sample_std.json

Step 3: Validate Standardized Format

Run validation tests to ensure your standardized format is correct:

python -m pytest tests/test_standardized_schemas.py::test_sample_standardized_against_schema -v

Or to test a specific dataset:

python -m pytest tests/test_standardized_schemas.py -v -k $MY_DATASET

Step 4: Create Dataset README

Create a comprehensive README.md for your dataset:

Original Paper: Paper Title
Original Repository: Repository
License: License information
Size: Number of samples, average steps per trajectory
Description: Description of the dataset and its purpose.

Contributing New Agents

Agent Structure

Each agent implementation should follow this structure:

agents/YOUR_AGENT_NAME/
├── __init__.py                # Agent module initialization
├── api.py                     # Agent-specific API definitions
├── std_to_sft.py              # Standardized to SFT conversion
├── system_prompt/             # System prompts and templates
└── README.md                  # Agent documentation

Step-by-Step Guide

Step 1: Create Agent Directory

export AGENT_NAME=YOUR_AGENT_NAME
mkdir -p agents/$AGENT_NAME

Step 2: Define Agent APIs

Create api.py to load APIs from datasets, use any format that suits your agent. The loaded specs could be used for training function calling models.

Example:

import importlib.util
import inspect
api_file_path = os.path.expanduser(f"datasets/{dataset}/api.py")
if os.path.exists(api_file_path):
    spec = importlib.util.spec_from_file_location("api", api_file_path)
    api_module = importlib.util.module_from_spec(spec)
    spec.loader.exec_module(api_module)
    functions = inspect.getmembers(api_module, inspect.isfunction)
    sigs = {}
    for name, func in functions:
        docstring = "\n" + inspect.getdoc(func)
        sig = inspect.signature(func)

Step 3: Create System Prompts

Create system prompt templates in system_prompt/default.txt:

You are an AI assistant specialized in [your agent's domain].

Your capabilities include:
- Capability 1
- Capability 2
- Capability 3

Guidelines:
- Guideline 1
- Guideline 2
- Guideline 3

When responding, always:
- Be clear and concise
- Follow the specified format
- Provide reasoning for your actions

Step 4: Create SFT Conversion Script

Create std_to_sft.py for converting standardized format to your agent's SFT format. The output data should ideally be directly usable for training using LLaMA Factory.

#!/usr/bin/env python3
"""Convert standardized format to SFT format for Your Agent."""

import json
import sys
from typing import Dict, Any
from schema.trajectory import Trajectory

def convert_trajectory_to_sft(trajectory: Trajectory) -> Dict[str, Any]:
    """Convert a standardized trajectory to SFT format."""

    messages = []

    # Add system message
    messages.append({
        "role": "system",
        "content": "Your agent's system prompt here"
    })

    # Convert ADP action and observation content to agent specific messages
    for item in trajectory.content:
        if item.class_ == "MessageAction":
            messages.append({
                "role": "user" if item.source == "user" else "assistant",
                "content": item.message
            })
        elif item.class_ == "TextObservation":
            messages.append({
                "role": "system",
                "content": f"Observation: {item.text}"
            })
        # Handle other observation types...

    return {
        "id": trajectory.id,
        "conversations": messages,
        "metadata": trajectory.details
    }

if __name__ == "__main__":
    for line in sys.stdin:
        trajectory_data = json.loads(line.strip())
        trajectory = Trajectory(**trajectory_data)
        sft_data = convert_trajectory_to_sft(trajectory)
        print(json.dumps(sft_data))

Step 5: Generate SFT format samples using the appropriate conversion script:

export PYTHONPATH=`pwd`:$PYTHONPATH
cat datasets/$MY_DATASET/sample_std.json | python scripts/json_to_jsonl.py | python agents/YOUR_AGENT_NAME/std_to_sft.py | python scripts/jsonl_to_json.py > datasets/$MY_DATASET/sample_sft/sample_sft_YOUR_AGENT_NAME.json

For example, for OpenHands:

export PYTHONPATH=`pwd`:$PYTHONPATH
cat datasets/$MY_DATASET/sample_std.json | python scripts/json_to_jsonl.py | python agents/openhands/std_to_sft.py --is_web=yes --api_env=browser | python scripts/jsonl_to_json.py > datasets/$MY_DATASET/sample_sft/sample_sft_openhands.json

Step 6: Create Agent README

Create a comprehensive README.md:

# Your Agent Name

Description of your agent and its capabilities.

## Overview

- **Domain**: Agent's specialized domain
- **Capabilities**: List of key capabilities
- **API Environment**: Required API environment
- **Supported Actions**: List of supported actions
- **Supported Datasets**: List of supported datasets given this implementation
- **Instructions**: Command to generate the full SFT data for supported datasets

## Testing and Validation

### Running Tests

Run the full test suite:
```bash
pytest tests/

Run specific validation tests:

# Test raw schemas
pytest tests/test_raw_schemas.py

# Test standardized schemas
pytest tests/test_standardized_schemas.py

# Test dataset structure
pytest tests/test_dataset_structure.py

Validation Checklist

Before submitting your contribution:

Raw data extraction script works correctly
Standardized format conversion passes validation
SFT format conversion produces valid output
All required files are present
README is comprehensive and accurate
Tests pass
Code follows style guidelines

Code Quality Standards

Style Guidelines

Follow PEP 8 for Python code
Use type hints where appropriate
Include docstrings for functions and classes
Keep line length under 100 characters

Pre-commit Hooks

The repository uses pre-commit hooks to ensure code quality:

ruff: Code formatting and linting
mypy: Type checking (where applicable)

File Organization

Keep imports at the top of files
Group imports: standard library, third-party, local
Use absolute imports where possible
Avoid circular imports

Submission Guidelines

Pull Request Process

Fork the repository and create a feature branch
Implement your changes following the guidelines above
Test thoroughly using the validation scripts
Update documentation including README files
Submit a pull request with a clear description

Pull Request Template

## Description
Brief description of the changes made.

## Type of Change
- [ ] New dataset
- [ ] New agent
- [ ] Bug fix
- [ ] Documentation update
- [ ] Other (please describe)

## Dataset/Agent Information
- **Name**:
- **Source**:
- **Size**:
- **Domain**:

## Testing
- [ ] All tests pass
- [ ] Validation scripts run successfully
- [ ] Sample files generated correctly

## Checklist
- [ ] Code follows style guidelines
- [ ] Documentation is complete
- [ ] All required files are present
- [ ] No sensitive data included

Review Process

Automated checks will run on your PR
Manual review by maintainers
Feedback incorporation if needed
Final approval and merge

Common Issues

Schema validation failures: Ensure your standardized format follows the schema exactly
Missing files: Check that all required files are present
Import errors: Verify that all dependencies are properly specified
Test failures: Run tests locally before submitting

Getting Help

Resources

Documentation: Check existing dataset READMEs for examples
Schema definitions: Review files in schema/ directory
Example implementations: Look at existing datasets and agents

Contact

Issues: Open a GitHub issue for bugs or questions
Discussions: Use GitHub discussions for general questions
Email: Contact the maintainers directly for sensitive issues

FAQ

Q: How do I handle datasets with special requirements? A: Create a requirements.txt file in your dataset directory and document any special setup in the README.

Q: What if my dataset doesn't fit the standard schema? A: Open an issue to discuss potential solutions.

Q: How do I handle large datasets? A: Focus on the conversion scripts and sample files. Full datasets should be processed separately.

Q: Can I contribute datasets that require special licenses? A: Yes, but clearly document the license requirements in your README.

License

By contributing to this project, you agree that your contributions will be licensed under the same license as the project.

FilesExpand file tree

CONTRIBUTING.md

Latest commit

History

CONTRIBUTING.md

File metadata and controls

Contributing to Agent Data Protocol

Table of Contents

Overview

Data Flow

Standardized Schema Components

Actions

Observations

Development Setup

Prerequisites

Installation

Contributing New Datasets

Dataset Structure

Step-by-Step Guide

Step 1: Create Sample Raw Data

Step 2: Create Standardized Format Converter

Step 3: Validate Standardized Format

Step 4: Create Dataset README

Contributing New Agents

Agent Structure

Step-by-Step Guide

Step 1: Create Agent Directory

Step 2: Define Agent APIs

Step 3: Create System Prompts

Step 4: Create SFT Conversion Script

Step 5: Generate SFT format samples using the appropriate conversion script:

Step 6: Create Agent README

Validation Checklist

Code Quality Standards

Style Guidelines

Pre-commit Hooks

File Organization

Submission Guidelines

Pull Request Process

Pull Request Template

Review Process

Common Issues

Getting Help

Resources

Contact

FAQ

License