Reinforce-Ada: An Adaptive Sampling Framework for Reinforce-Style LLM Training

🚨 News

[2025.10.15] We release the Tinker version for Reinforce-Ada.
[2025.10.07] We release the verl version (main version) for Reinforce-Ada.

🛠️ Reinforce-Ada Meets Tinker

This repository contains the official implementation for Reinforce-Ada with Tinker, an adaptive sampling framework designed to resolve the ``signal collapse'' problem in Reinforce-style algorithm with group baselines such as GRPO, making training more efficient and effective.

All results shown in our paper are from the runnings on verl. Here we further validate the effectiveness of Reinforce-Ada on another easy-to-use training API, Tinker.

Figure 1: Training dynamics with Qwen3-4B-Instruct-2507.

We can observe that Reinforce-Ada achieves a significantly higher reward (from iid samples w/o adaptive sampling) than GRPO on Tinker with LoRA finetuning.

Note:

We apply full finetuning in our paper, while Tinker only supports finetuning with LoRA. The above results from LoRA are slightly worse than full finetuning on verl, which might be able to improve by adjusting the LoRA rank and learning rate.
For full finetuning, we utilizes a small lr=1e-6. This small lr makes the learning really slow with LoRA. Tinker suggests using a 40 times lr (i.e. lr=4e-5) for LoRA on Qwen3-4B-Instruct-2507, which results in reward collapse (increase first, then decrease to 0) for both GRPO and Reinforce-Ada. Here we set lr=5e-6. A dedicated tuning might result in better performance.

🌍 Environment Setup

Create a new environment.

python -m venv ~/.python/reinforce_ada_tinker
source ~/.python/reinforce_ada/bin/activate

# You can also use conda 
#conda create -n reinforce_ada_tinker python==3.10
#conda activate reinforce_ada_tinker

Install dependencies

pip install pip --upgrade
pip install uv
python -m uv pip install tinker
git clone https://github.com/RLHFlow/Reinforce-Ada-Tinker
cd Reinforce-Ada-Tinker
python -m uv pip install -e .
python -m uv pip install wandb

🧪 Experiment Running

Prepare the training and test datasets

If you have custom dataset, or want to train a new LLM, please go to the Reinforce-Ada verl version for this step. Otherwise, you can use our open-sourced training sets in the following.
Start the training
```
# Set your key in this file
bash scripts/run_reinforce_ada.sh
```
The key hyperparameters from Reinforce-Ada are:
- multiround_adaptive_downsampling=True: Use adaptive sampling.
- reinforce_ada_choice=balanced: How to balance the positive and negative prompts within a batch, could be one of [balanced, positive-focused].
- global_stat_est=True: Use global statistics to calculate the mean and std.
Note: Tinker considers one policy update as a step, while the verl version consider 16 updates as a step. total_steps, max_steps_off_policy and groups_per_batch has been set accordingly to match the verl version for training 400 steps.
Evaluate

We don't offer the evaluation code with Tinker. It's suggested to download the weights and evaluate them with our Reinforce-Ada verl version. You can evaluate the checkpoint every 800 steps (equal to every 50 steps with the verl version).

🤗 Processed Training Sets

We offer the processed/selected training prompts in huggingface.

Note: Not all LLMs are supported by Tinker, you can check the supported models here.

Model to train	Prompt level	Algorithm	Training set
`Qwen/Qwen2.5-Math-1.5B`	easy	Reinforce-Ada-balance	`RLHFlow/reinforce_ada_easy_prompt_1.5b`
`Qwen/Qwen2.5-Math-1.5B`	hard	Reinforce-Ada-balance	`RLHFlow/reinforce_ada_hard_prompt_1.5b`
`Qwen/Qwen2.5-Math-7B`	easy	Reinforce-Ada-balance	`RLHFlow/reinforce_ada_easy_prompt`
`Qwen/Qwen2.5-Math-7B`	hard	Reinforce-Ada-balance	`RLHFlow/reinforce_ada_hard_prompt`
`Qwen/Qwen3-4B-Instruct-2507`	hard	Reinforce-Ada-balance	`RLHFlow/reinforce_ada_hard_prompt`
`meta-llama/Llama-3.2-3B-Instruct`	hard	Reinforce-Ada-balance	`RLHFlow/reinforce_ada_hard_prompt_llama`

🙏 Acknowledgement

We thank Tinker for providing this awesome training API.

📝 Citation

If you find our paper or code helpful, feel free to give us a citation.

@misc{xiong2025reinforceada,
      title={Reinforce-Ada: An Adaptive Sampling Framework for Reinforce-Style LLM Training}, 
      author={Wei Xiong and Chenlu Ye and Baohao Liao and Hanze Dong and Xinxing Xu and Christof Monz and Jiang Bian and Nan Jiang and Tong Zhang},
      year={2025},
      eprint={2510.04996},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2510.04996}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
figures		figures
scripts		scripts
tinker_cookbook		tinker_cookbook
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Reinforce-Ada: An Adaptive Sampling Framework for Reinforce-Style LLM Training

🚨 News

🛠️ Reinforce-Ada Meets Tinker

Note:

🌍 Environment Setup

🧪 Experiment Running

🤗 Processed Training Sets

🙏 Acknowledgement

📝 Citation

About

Uh oh!

Releases

Packages

Contributors 2

Languages

License

RLHFlow/Reinforce-Ada-Tinker

Folders and files

Latest commit

History

Repository files navigation

Reinforce-Ada: An Adaptive Sampling Framework for Reinforce-Style LLM Training

🚨 News

🛠️ Reinforce-Ada Meets Tinker

Note:

🌍 Environment Setup

🧪 Experiment Running

🤗 Processed Training Sets

🙏 Acknowledgement

📝 Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages