- [2025-07-3] Gave a talk at Tsinghua university on ThinkPRM. Slides are here.
- [2025-07-2] We have released code for running test-time scaling experiments with ThinkPRM!.
- [2025-04-23] Our paper is released on arxiv.
- [2025-04-24] Our synthetic verification CoTs used to train ThinkPRM on are now on huggingface.
- [2025-04-25] Our trained PRMs are released in two sizes: 1.5B and 14B finetuned from R1-Distill-Qwen models.
- [2025-05-16] To provide a balanced performance between 1.5B and 14B, we have upload ThinkPRM-7B, based on Deepseek-R1-Distill-Qwen-7B and finetuned on our data.
We introduce ThinkPRM, a collection of generative long CoT process reward models. Our verifiers are obtained by finetuning reasoning models over 1K synthetic verification CoTs---filtered based on only on 8K process labels from PRM800K. The resulting verifiers outperform LLM-as-a-judge, discriminative PRMs, on most in- and out-of-domain setups. ThinkPRM enables scaling up verifier compute either in parallel or sequentially by thinking longer.
ThinkPRM was trained on synthetic verification CoTs. This dataset contains 1,000 high-quality synthetic verification chains-of-thought (CoTs) designed for training generative Process Reward Models (PRMs), as used in the paper "Process Reward Models that Think". The goal was to create a data-efficient alternative to traditional PRM training which often requires extensive human annotation or expensive rollouts.
Each instance consists of a math problem, a corresponding multi-step solution prefix (sourced from PRM800K [Lightman et al., 2023]), and a detailed verification CoT generated by the QwQ-32B-Preview. The verification CoT critiques each step of the solution prefix and provides a step-level correctness judgment (\boxed{correct} or \boxed{incorrect}).
To ensure high-quality synthetic CoTs, only chains where all step-level judgments matched the ground-truth human annotations from the PRM800K dataset were retained. They were also filtered based on correct formatting and length constraints to avoid issues like excessive overthinking observed in unfiltered generation. The figure below summarizes the synthetic cots collection. Refer to our paper for more details on data collection.
The dataset was created to enable efficient training of powerful generative PRMs. The core idea is that fine-tuning strong reasoning models on carefully curated, synthetic verification CoTs can yield verifiers that outperform models trained on much larger, traditionally labeled datasets. The process-based filtering (matching gold step labels) was shown to be crucial for generating high-quality training data compared to outcome-based filtering.
We are going to create two virtual envrionments: one particularly to serve ThinkPRM using sglang and the other for running test-time scaling experiments. This is ncecessary to avoid dependency conflicts between sglang and other packages. We will be using uv so make sure you have it installed by following the instructions.
uv python install 3.11
uv venv sglang-env
uv pip install "sglang[all]>=0.4.8.post1"
now let's setup the other main environment
uv venv
uv syncfrom prm import ThinkPRM
# Initialize ThinkPRM
prm = ThinkPRM(
model_name_or_path="launch/ThinkPRM-1.5B",
max_length=4096,
temperature=0.1,
n=1
)
# Example question and reasoning steps
question = "What is 15% of 200?"
reasoning_steps = [
"To find 15% of 200, I need to multiply 200 by 0.15",
"200 × 0.15 = 30",
"Therefore, 15% of 200 is 30"
]
# Evaluate correctness
results = prm.predict_correctness_batch([question], [reasoning_steps])
# results will contain detailed information about the verification:
# results = [
# {
# 'step_labels': [[1, 1, 1]], # 1 for correct steps, 0 for incorrect
# 'inputs': 'What is 15% of 200?...', # Original input prompt
# 'outputs': ['Let me verify this step by step...</think>...'], # Generated verification text
# 'prefix_score': 0.85 # correctness score of the whole prefix. If n>1, this will be averaged across different CoTs (parallel scaling)
# }
# ]For more thorough evaluation, you can enable multi-round verification:
prm = ThinkPRM(
model_name_or_path="launch/ThinkPRM-1.5B",
multiround_verifier=True,
n_thinking_rounds=3, # Allow 3 rounds of thinking with budget forcing
temperature=0.1,
n=1
)
problem = "Solve for x: 2x + 3 = 7"
steps = ["Subtract 3 from both sides: 2x = 4", "Divide by 2: x = 7"]
results = model.predict_correctness_batch_sequential_scaling(questions=[problem], prefix_steps_batch=[steps])Running ThinkPRM: Use the provided recipes to run ThinkPRM with different models and configurations:
First of all we need to serve one of the ThinkPRM models.
The available models are:
source sglang-env/bin/activate
uv run --active python -m sglang.launch_server --grammar-backend xgrammar --model-path launch/ThinkPRM-1.5B --port 31111 --host 127.0.0.1This should start an Sglang server with ThinkPRM-1.5B ready to roll.
Now we can run best-of-N or verifier-guided beam search as follows. Note that every experiment in the paper is described using a .yaml recipe file. These are located in search-and-learn/recipes/
Now, switch to a new terminal and cd into the project directory. To run best-of-N on MATH-500 with Llama-3.2-3B-Instruct as the generator, this correponds to the recipe in search-and-learn/recipes/MATH-500/Llama-3.2-3B-Instruct/best_of_n/thinkprm.yaml
source .venv/bin/activate
CUDA_VISIBLE_DEVICES=1 uv run python search-and-learn/scripts/test_time_compute.py search-and-learn/recipes/MATH-500/Llama-3.2-3B-Instruct/best_of_n/thinkprm.yamThis should sample solutions from the generator model over MATH-500 dataset, score samples using ThinkPRM, and then at the end it will also calculate accuracy for different N values and store these in config.output_dir.
ThinkPRM supports two types of scaling:
- Parallel Scaling: Multiple PRM evaluations run simultaneously
- Sequential Scaling: Multiple thinking rounds in sequence
prm_temperature: 0.4
prm_n: 4 # 4 parallel verification CoTsprm_temperature: 0.0
prm_n: 1 # only supports a single chain
n_thinking_rounds: 1 # Number of sequential thinking roundsnote: we do not support both types at the same time. note: sequential scaling is substantially slower than parallel scaling.
As described in the paper, we scale verifier verifier compute with ThinkPRM in two ways: parallel by sampling K independent verification chains and aggregating scores over them, or sequentially via some form of budget forcing. We include two example recipes here and here
This project would not be possible without the efforts from the open-source community, artpicularly repos like search-and-learn, vllm and the awesome sglang.
If you find ThinkPRM helpful, please cite us.
@article{khalifa2025,
title={Process Reward Models That Think},
author={Muhammad Khalifa and Rishabh Agarwal and Lajanugen Logeswaran and Jaekyeom Kim and Hao Peng and Moontae Lee and Honglak Lee and Lu Wang},
year={2025},
journal={arXiv preprint arXiv:2504.16828},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2504.16828},
}

