Add RL on Verifiable Reward (RLVR) reference implementations #5

jacobthebanana · 2025-09-17T14:44:38Z

This pull request includes:

Minimalized TRL/VeRL to run one PPO optimization step, given token array, per-token advantage array, model checkpoint, and PPO hyperparameters.
Agent SDK integration- define the environment using the familiar OpenAI Agent SDK and run RL on the LLM powering the agent. Not yet tested on multi-agent setups (agent as tool or handoff)
Extensive typing for simplified function signatures and IDE support- static type checking, pyright lints, proper autocompletion even within the training loop.

Vec-Inf wrapper integration will come in a separate pull request.

…erified.)

jacobthebanana added 4 commits September 8, 2025 16:17

Minimalistic GRPO optimizer step implementation. (to be numerically v…

7baf106

…erified.)

GRPO for Openai-Agents implementation. (Missing vec-inf component.)

acbba68

Added LangFuse integration to GRPO Agent RL.

658c4c2

Added AGENTS.md for programming agents.

c1a088f

Provide feedback