Skip to content

Add CoderForge-Preview dataset#167

Open
neubig wants to merge 3 commits intomainfrom
openhands/add-coderforge-preview-dataset
Open

Add CoderForge-Preview dataset#167
neubig wants to merge 3 commits intomainfrom
openhands/add-coderforge-preview-dataset

Conversation

@neubig
Copy link
Contributor

@neubig neubig commented Feb 26, 2026

Description

This PR adds the CoderForge-Preview dataset from TogetherAI to the agent-data-protocol repository.

Fixes #166

Dataset Information

CoderForge-Preview is the largest open test-verified coding agent dataset designed for training efficient software engineering agents. Fine-tuning Qwen-3 32B on this dataset boosts SWE-Bench Verified performance from 23.0% to 59.4% pass@1.

Key Features

  • Large-scale agentic data generation from 51K distinct open-source tasks
  • Long-horizon, multi-step SFT trajectories
  • Test-verified coding agent trajectories
  • Data collected using OpenHands agent framework

Source URL: https://huggingface.co/datasets/togethercomputer/CoderForge-Preview

License: Apache-2.0

Files Added

datasets/coderforge_preview/
├── README.md                   # Dataset description and citation
├── api.py                      # API definitions for tools
├── extract_raw.py              # Script to extract raw data
├── raw_to_standardized.py      # Conversion script
├── schema_raw.py               # Raw data schema
├── sample_raw.json             # Sample raw data
├── sample_std.json             # Sample standardized data
├── sample_sft.json             # Sample SFT format data
└── sample_sft/
    └── sample_sft_openhands.json  # OpenHands SFT format

Testing

All tests pass:

tests/test_dataset_structure.py::test_dataset_structure[coderforge_preview] PASSED
tests/test_raw_schemas.py::test_sample_raw_against_schema[coderforge_preview] PASSED
tests/test_standardized_schemas.py::test_sample_standardized_against_schema[...coderforge_preview/sample_std.json] PASSED
tests/test_std_to_sft_conversion.py::test_std_to_sft_conversion[coderforge_preview] PASSED

Checklist

  • Created README.md with dataset description
  • Created extract_raw.py for data extraction
  • Created schema_raw.py for raw data schema
  • Created raw_to_standardized.py for conversion
  • Created sample_raw.json with sample data
  • Created sample_std.json with standardized data
  • Created sample_sft.json with SFT format data
  • All tests pass

Add the CoderForge-Preview dataset from TogetherAI, which is the largest
open test-verified coding agent dataset designed for training efficient
software engineering agents.

Dataset features:
- Large-scale agentic data generation from 51K distinct open-source tasks
- Long-horizon, multi-step SFT trajectories
- Test-verified coding agent trajectories
- Data collected using OpenHands agent framework

Source: https://huggingface.co/datasets/togethercomputer/CoderForge-Preview

Closes #166

Co-authored-by: openhands <openhands@all-hands.dev>
- Add trailing newline to sample_raw.json (pre-commit fix)
- Update sample_sft.json to use 'from': 'function_call' for function calls
- Add AGENTS.md with repository best practices and guidelines

Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: openhands <openhands@all-hands.dev>
@neubig neubig marked this pull request as ready for review March 1, 2026 13:23
@neubig neubig marked this pull request as draft March 1, 2026 13:29
@neubig neubig marked this pull request as ready for review March 2, 2026 03:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add CoderForge-Preview dataset

2 participants