Skip to content

mukhal/ThinkPRM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

31 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Process Reward Models That Think 🧠

🎉News

  • [2025-07-3] Gave a talk at Tsinghua university on ThinkPRM. Slides are here.
  • [2025-07-2] We have released code for running test-time scaling experiments with ThinkPRM!.
  • [2025-04-23] Our paper is released on arxiv.
  • [2025-04-24] Our synthetic verification CoTs used to train ThinkPRM on are now on huggingface.
  • [2025-04-25] Our trained PRMs are released in two sizes: 1.5B and 14B finetuned from R1-Distill-Qwen models.
  • [2025-05-16] To provide a balanced performance between 1.5B and 14B, we have upload ThinkPRM-7B, based on Deepseek-R1-Distill-Qwen-7B and finetuned on our data.

📖Introduction

We introduce ThinkPRM, a collection of generative long CoT process reward models. Our verifiers are obtained by finetuning reasoning models over 1K synthetic verification CoTs---filtered based on only on 8K process labels from PRM800K. The resulting verifiers outperform LLM-as-a-judge, discriminative PRMs, on most in- and out-of-domain setups. ThinkPRM enables scaling up verifier compute either in parallel or sequentially by thinking longer.

image

📑 Data Collection

ThinkPRM was trained on synthetic verification CoTs. This dataset contains 1,000 high-quality synthetic verification chains-of-thought (CoTs) designed for training generative Process Reward Models (PRMs), as used in the paper "Process Reward Models that Think". The goal was to create a data-efficient alternative to traditional PRM training which often requires extensive human annotation or expensive rollouts.

Each instance consists of a math problem, a corresponding multi-step solution prefix (sourced from PRM800K [Lightman et al., 2023]), and a detailed verification CoT generated by the QwQ-32B-Preview. The verification CoT critiques each step of the solution prefix and provides a step-level correctness judgment (\boxed{correct} or \boxed{incorrect}).

To ensure high-quality synthetic CoTs, only chains where all step-level judgments matched the ground-truth human annotations from the PRM800K dataset were retained. They were also filtered based on correct formatting and length constraints to avoid issues like excessive overthinking observed in unfiltered generation. The figure below summarizes the synthetic cots collection. Refer to our paper for more details on data collection.

image.png

The dataset was created to enable efficient training of powerful generative PRMs. The core idea is that fine-tuning strong reasoning models on carefully curated, synthetic verification CoTs can yield verifiers that outperform models trained on much larger, traditionally labeled datasets. The process-based filtering (matching gold step labels) was shown to be crucial for generating high-quality training data compared to outcome-based filtering.

✨Getting Started

Installation

We are going to create two virtual envrionments: one particularly to serve ThinkPRM using sglang and the other for running test-time scaling experiments. This is ncecessary to avoid dependency conflicts between sglang and other packages. We will be using uv so make sure you have it installed by following the instructions.

uv python install 3.11
uv venv sglang-env
uv pip install "sglang[all]>=0.4.8.post1"

now let's setup the other main environment

uv venv
uv sync

Basic Usage

from prm import ThinkPRM

# Initialize ThinkPRM
prm = ThinkPRM(
    model_name_or_path="launch/ThinkPRM-1.5B",
    max_length=4096,
    temperature=0.1,
    n=1
)

# Example question and reasoning steps
question = "What is 15% of 200?"
reasoning_steps = [
    "To find 15% of 200, I need to multiply 200 by 0.15",
    "200 × 0.15 = 30",
    "Therefore, 15% of 200 is 30"
]

# Evaluate correctness
results = prm.predict_correctness_batch([question], [reasoning_steps])

# results will contain detailed information about the verification:
# results = [
#     {
#         'step_labels': [[1, 1, 1]],  # 1 for correct steps, 0 for incorrect
#         'inputs': 'What is 15% of 200?...',  # Original input prompt
#         'outputs': ['Let me verify this step by step...</think>...'],  # Generated verification text
#         'prefix_score': 0.85  # correctness score of the whole prefix. If n>1, this will be averaged across different CoTs (parallel scaling)
#     }
# ]

Sequential scaling through multi-round verification

Multi-Round Verification i.e., budget forcing -- see paper for more details

For more thorough evaluation, you can enable multi-round verification:

prm = ThinkPRM(
    model_name_or_path="launch/ThinkPRM-1.5B",
    multiround_verifier=True,
    n_thinking_rounds=3,  # Allow 3 rounds of thinking with budget forcing
    temperature=0.1,
    n=1 
)

problem = "Solve for x: 2x + 3 = 7"
steps = ["Subtract 3 from both sides: 2x = 4", "Divide by 2: x = 7"]
results = model.predict_correctness_batch_sequential_scaling(questions=[problem], prefix_steps_batch=[steps])

Running ThinkPRM: Use the provided recipes to run ThinkPRM with different models and configurations:

First of all we need to serve one of the ThinkPRM models.

The available models are:

source sglang-env/bin/activate
uv run --active python -m sglang.launch_server --grammar-backend xgrammar --model-path launch/ThinkPRM-1.5B --port 31111 --host 127.0.0.1

This should start an Sglang server with ThinkPRM-1.5B ready to roll.

Now we can run best-of-N or verifier-guided beam search as follows. Note that every experiment in the paper is described using a .yaml recipe file. These are located in search-and-learn/recipes/

Now, switch to a new terminal and cd into the project directory. To run best-of-N on MATH-500 with Llama-3.2-3B-Instruct as the generator, this correponds to the recipe in search-and-learn/recipes/MATH-500/Llama-3.2-3B-Instruct/best_of_n/thinkprm.yaml

source .venv/bin/activate
CUDA_VISIBLE_DEVICES=1 uv run python search-and-learn/scripts/test_time_compute.py search-and-learn/recipes/MATH-500/Llama-3.2-3B-Instruct/best_of_n/thinkprm.yam

This should sample solutions from the generator model over MATH-500 dataset, score samples using ThinkPRM, and then at the end it will also calculate accuracy for different N values and store these in config.output_dir.

Test-time scaling Scaling with ThinkPRM

image

ThinkPRM supports two types of scaling:

  • Parallel Scaling: Multiple PRM evaluations run simultaneously
  • Sequential Scaling: Multiple thinking rounds in sequence

Configuration Differences

Parallel Scaling

prm_temperature: 0.4    
prm_n: 4               # 4 parallel verification CoTs

Sequential Scaling

prm_temperature: 0.0    
prm_n: 1               # only supports a single chain
n_thinking_rounds: 1   # Number of sequential thinking rounds

note: we do not support both types at the same time. note: sequential scaling is substantially slower than parallel scaling.

Running the Experiments

As described in the paper, we scale verifier verifier compute with ThinkPRM in two ways: parallel by sampling K independent verification chains and aggregating scores over them, or sequentially via some form of budget forcing. We include two example recipes here and here

Acknowledgements

This project would not be possible without the efforts from the open-source community, artpicularly repos like search-and-learn, vllm and the awesome sglang.

🎈Citation

If you find ThinkPRM helpful, please cite us.

@article{khalifa2025,
      title={Process Reward Models That Think}, 
      author={Muhammad Khalifa and Rishabh Agarwal and Lajanugen Logeswaran and Jaekyeom Kim and Hao Peng and Moontae Lee and Honglak Lee and Lu Wang},
      year={2025},
      journal={arXiv preprint arXiv:2504.16828},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2504.16828}, 
}

About

Process Reward Models That Think

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages