LMM-R1 is a fork of OpenRLHF, aimed at providing high-performance LMM Reinforcement Learning infrastructure for reproduction of DeepSeek-R1 on multimodal tasks.
We currently support PPO/REINFORCE++/RLOO training for LMM, and achieve 4.7x speedup (RLOO) compared with R1-V (GRPO).
Team:
- [2025/2/13] We release code of LMM-R1!
We train Qwen2.5-VL-3B-Instruct on 8k text-only MATH (level3-5) dataset using RLOO with a rule-based reward function. We find it gains significant improvement on challenging multi-modal math benchmarks (MathVision, MathVerse, Olympiadbench_en).
| MathVision | MathVerse | Olympiadbench_en | |
|---|---|---|---|
| Qwen2.5-VL-3B-Instruct | 23.09 | 27.99 | 10.15 |
| Qwen2.5-VL-3B-Instruct-rloo-math | 27.47 | 35.1 | 13.32 |
This result reminds us that the existing rich high-quaity text-modality reasoning data maybe beneficial to train a strong multimodal reasoning model, especially at this very moment when high-quality multimodal reasoning data is scarce.
We provide the data examples/data/mathlv345_8k_chatml.json and script examples/scripts/r1_scripts/train_rloo_qwenvl2_5_math.sh for reproduction. Note that the system prompt of our model is consistent with that of training during evaluation.
- Support LMM training (Qwen2-VL, Qwen2.5-VL).
- Distributed PPO and REINFORCE++/RLOO implementations based on Ray.
- Ray-based Reinforced Finetuning
- Support Ray-based PPO and REINFORCE++/RLOO using Hybrid Engine (
--colocate_all_models,--vllm_enable_sleepand--vllm_gpu_memory_utilization 0.5) - Full RLHF fine-tuning support for models with over 70 billion parameters.
- Integration with vLLM for accelerated generation in RLHF tasks (
--vllm_num_engines). - Support for multiple reward models (
--reward_pretrain model1,model2...) and remote reward models (--remote_rm_url). - Integration of FlashAttention2 (
--flash_attn). - Support for QLoRA (
--load_in_4bit) and LoRA (--lora_rank,--target_modules). - Compatibility with HuggingFace's
tokenizer.apply_chat_templatefor datasets (--apply_chat_templateand--input_key). - Logging support with Wandb (
--use_wandb) and TensorBoard (--use_tensorboard). - Checkpoint recovery functionality (
--load_checkpointand--save_steps). - Provided multi-node training scripts, such as Ray PPO.
git clone https://github.com/TideDra/lmm-r1.git
cd lmm-r1
pip install -e .[vllm]
pip install flash_attn --no-build-isolationNote
We recommend using vLLM 0.7.2 or higher. We also provided the Dockerfiles for vLLM and One-Click Installation Script of Nvidia-Docker.
LMM-R1 requires the multimodal prompt dataset to be in OpenAI-compatible message format:
[
{
"message":"[
{
\"role\": \"user\",
\"content\": [
{ \
\"type\": \"image\",
\"image\": \"file:///path/to/your/image.jpg\",
}, \
{\"type\": \"text\", \"text\": \"How many cats in the image?\"},
],
}
]",
"answer": "$3$"
},
]Note that message is a stringfied list.
An example dataset examples/data/test_message.jsonl is for reference.
- We can use
--input_keyto specify theJSON key nameof the input datasets--prompt_data {name or path}(PPO) or--dataset {name or path}. Do not use--apply_chat_templatefor multimodal prompt, the message will be processed internally. - OpenRLHF also support mixing multiple datasets using
--prompt_data_probs 0.1,0.4,0.5(PPO) or--dataset_probs 0.1,0.4,0.5.
How to specify training and test datasets ?
You can specify it using the data_type@data_dir format. For example, the dataset can be set as --dataset json@./data.
data
├── test.jsonl
└── train.jsonl
Note
By default, we use train and test as splits to distinguish training and testing datasets from Huggingface.
The JSON key options depends on the specific datasets. See Reward Dataset and SFT Dataset
Note
Set --train-vlm for LMM training.
DATASET="test_message.jsonl"
MODEL_CPK_NAME="qwenvl25_3B_ins_rloo_mathvision"
PRETRAIN_MODEL="Qwen/Qwen2.5-VL-3B-Instruct"
SAVE_PATH="/ckpts"
mkdir -p "${SAVE_PATH}/${MODEL_CPK_NAME}"
# deploy remote reward function at 127.0.0.1:5000
python -m openrlhf.models.remote_rm.math_verifier --dataset $DATASET --input_key message --prompt-template chatml > "${SAVE_PATH}/${MODEL_CPK_NAME}/remote_rm.log" 2>&1 &
childpid=$!
ray start --head --node-ip-address 0.0.0.0 --num-gpus 8 --temp-dir ~/.cache/ray
ray job submit --address="http://127.0.0.1:8265" \
--runtime-env-json='{"working_dir": "/root/projects/OpenRLHF"}' \
-- python3 -m openrlhf.cli.train_ppo_ray \
--ref_num_nodes 1 \
--ref_num_gpus_per_node 8 \
--remote_rm_url http://127.0.0.1:5000/get_reward \
--actor_num_nodes 1 \
--actor_num_gpus_per_node 8 \
--vllm_num_engines 8 \
--vllm_tensor_parallel_size 1 \
--colocate_all_models \
--vllm_enable_sleep \
--vllm_gpu_memory_utilization 0.7 \
--vllm_sync_backend gloo \
--enable_prefix_caching \
--pretrain $PRETRAIN_MODEL \
--save_path $SAVE_PATH/$MODEL_CPK_NAME \
--micro_train_batch_size 2 \
--train_batch_size 128 \
--micro_rollout_batch_size 4 \
--rollout_batch_size 256 \
--temperature 1 \
--n_samples_per_prompt 16 \
--max_epochs 1 \
--num_episodes 30 \
--prompt_max_len 4096 \
--max_samples 100000 \
--generate_max_len 4096 \
--advantage_estimator rloo \
--zero_stage 3 \
--bf16 \
--actor_learning_rate 1e-6 \
--init_kl_coef 0.01 \
--prompt_data $DATASET \
--input_key message \
--normalize_reward \
--flash_attn \
--gradient_checkpointing \
--save_steps 10 \
--ckpt_path $SAVE_PATH/$MODEL_CPK_NAME/ckpt \
--save_hf_ckpt \
--use_tensorboard $SAVE_PATH/$MODEL_CPK_NAME/logs \
--train_vlm
ray stop
kill $childpidNote
Not set --vllm_num_engines means not using the vLLM engine.
Note
RLOO and REINFORCE++-baseline in OPENRLHF are a modification based on REINFORCE++:
- REINFORCE++ integrates key optimization techniques from PPO while eliminating the need for a critic network.
- REINFORCE++-baseline uses the mean reward of multiple samples from the same prompt as the baseline.
- RLOO in OpenRLHF modifies the original version by incorporating the
per-token KL rewardand utilizing thePPO-clip loss.
Note
If you encounter an error related to index out of range when deepspeed sets up the GPU devices, you can try to set the environment variable RAY_EXPERIMENTAL_NOSET_*_VISIBLE_DEVICES as a workaround.
# For NVIDIA GPUs:
export RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES=1To achieve optimal performance, we recommend allocating nodes number vLLM:Actor:Critic = 1:1:1. For example, for a 70B model with 48 A100 GPUs, it is advised to allocate 16 A100 GPUs to the vLLM Engine, 16 GPUs to the Actor model, and the remaining 16 GPUs to the Critic model. Additionally, enable the --colocate_critic_reward, --colocate_actor_ref options to merge nodes. Finally, you should increase the rollout_micro_batch_size (and minimize the TP size of vLLM engine) as much as possible. During the training phase, a larger --micro_train_batch_size is better and enable --packing_samples (Not supported for LMM). When there are enough GPUs, please disable --adam_offload and enable --overlap_comm. For multi-nodes RLHF, please use --vllm_sync_backend nccl with vLLM 0.7.2+. Enable enable_prefix_caching in vLLM generation when n_samples_per_prompts > 1. Using hybrid engine --colocate_all_models and –vllm_enable_sleeprather than distributed RLHF when the model size and context length are small values.
We sincerely thank DeepSeek for their exploration on LLM reasoning, and OpenRLHF for their incredible RL infrastructure. We also thank open-r1 and simpleRL-reason which give us insights on reproduction of R1. Special thanks to Kai Yang, Jie Liu, ZhiYuan You for their valuable suggestions.
@misc{peng2025lmmr1,
author = {YingZhe Peng and Gongrui Zhang and Xin Geng and Xu Yang},
title = {LMM-R1},
howpublished = {\url{https://github.com/TideDra/lmm-r1}},
note = {Accessed: 2025-02-13},
year = {2025}
}

