- [2025.10.15] We release the Tinker version for Reinforce-Ada.
- [2025.10.07] We release the verl version (main version) for Reinforce-Ada.
This repository contains the official implementation for Reinforce-Ada with Tinker, an adaptive sampling framework designed to resolve the ``signal collapse'' problem in Reinforce-style algorithm with group baselines such as GRPO, making training more efficient and effective.
All results shown in our paper are from the runnings on verl. Here we further validate the effectiveness of Reinforce-Ada on another easy-to-use training API, Tinker.
Figure 1: Training dynamics with Qwen3-4B-Instruct-2507.
We can observe that Reinforce-Ada achieves a significantly higher reward (from iid samples w/o adaptive sampling) than GRPO on Tinker with LoRA finetuning.
- We apply full finetuning in our paper, while Tinker only supports finetuning with LoRA. The above results from LoRA are slightly worse than full finetuning on verl, which might be able to improve by adjusting the LoRA rank and learning rate.
- For full finetuning, we utilizes a small lr=1e-6. This small lr makes the learning really slow with LoRA. Tinker suggests using a 40 times lr (i.e. lr=4e-5) for LoRA on Qwen3-4B-Instruct-2507, which results in reward collapse (increase first, then decrease to 0) for both GRPO and Reinforce-Ada. Here we set lr=5e-6. A dedicated tuning might result in better performance.
- Create a new environment.
python -m venv ~/.python/reinforce_ada_tinker source ~/.python/reinforce_ada/bin/activate # You can also use conda #conda create -n reinforce_ada_tinker python==3.10 #conda activate reinforce_ada_tinker
- Install dependencies
pip install pip --upgrade pip install uv python -m uv pip install tinker git clone https://github.com/RLHFlow/Reinforce-Ada-Tinker cd Reinforce-Ada-Tinker python -m uv pip install -e . python -m uv pip install wandb
-
Prepare the training and test datasets
If you have custom dataset, or want to train a new LLM, please go to the Reinforce-Ada verl version for this step. Otherwise, you can use our open-sourced training sets in the following.
-
Start the training
# Set your key in this file bash scripts/run_reinforce_ada.shThe key hyperparameters from Reinforce-Ada are:
multiround_adaptive_downsampling=True: Use adaptive sampling.reinforce_ada_choice=balanced: How to balance the positive and negative prompts within a batch, could be one of [balanced, positive-focused].global_stat_est=True: Use global statistics to calculate the mean and std.
Note: Tinker considers one policy update as a step, while the verl version consider 16 updates as a step. total_steps, max_steps_off_policy and groups_per_batch has been set accordingly to match the verl version for training 400 steps.
-
Evaluate
We don't offer the evaluation code with Tinker. It's suggested to download the weights and evaluate them with our Reinforce-Ada verl version. You can evaluate the checkpoint every 800 steps (equal to every 50 steps with the verl version).
We offer the processed/selected training prompts in huggingface.
Note: Not all LLMs are supported by Tinker, you can check the supported models here.
| Model to train | Prompt level | Algorithm | Training set |
|---|---|---|---|
Qwen/Qwen2.5-Math-1.5B |
easy | Reinforce-Ada-balance | RLHFlow/reinforce_ada_easy_prompt_1.5b |
Qwen/Qwen2.5-Math-1.5B |
hard | Reinforce-Ada-balance | RLHFlow/reinforce_ada_hard_prompt_1.5b |
Qwen/Qwen2.5-Math-7B |
easy | Reinforce-Ada-balance | RLHFlow/reinforce_ada_easy_prompt |
Qwen/Qwen2.5-Math-7B |
hard | Reinforce-Ada-balance | RLHFlow/reinforce_ada_hard_prompt |
Qwen/Qwen3-4B-Instruct-2507 |
hard | Reinforce-Ada-balance | RLHFlow/reinforce_ada_hard_prompt |
meta-llama/Llama-3.2-3B-Instruct |
hard | Reinforce-Ada-balance | RLHFlow/reinforce_ada_hard_prompt_llama |
We thank Tinker for providing this awesome training API.
If you find our paper or code helpful, feel free to give us a citation.
@misc{xiong2025reinforceada,
title={Reinforce-Ada: An Adaptive Sampling Framework for Reinforce-Style LLM Training},
author={Wei Xiong and Chenlu Ye and Baohao Liao and Hanze Dong and Xinxing Xu and Christof Monz and Jiang Bian and Nan Jiang and Tong Zhang},
year={2025},
eprint={2510.04996},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2510.04996},
}