RL Scaling Laws for Mathematical Reasoning

Key Finding: Compute vs Performance

RL provides significant gains on top of pre-trained models, with larger relative improvements on smaller models.

This plot shows how GSM8K accuracy scales with total compute (pre-training + RL fine-tuning) across model sizes from 0.6B to 14B parameters. Each colored segment represents a model:

Square markers: Pre-trained baseline performance
Star markers: Performance after RL fine-tuning
Compute labels: Total FLOPs invested (pre-training dominates, RL adds ~0.003%)

Key insight: Small models (0.6B-1.7B) see 25-28% absolute accuracy gains from minimal RL compute, while larger models (8B-14B) see 6-8% gains. This suggests different optimal compute allocation strategies depending on model size.

Research Question

Given N FLOPs, how should compute be split between pre-training and RL to maximize performance?

Most scaling law work focuses on pre-training only. This project explores the compute-performance trade-offs when adding RL fine-tuning to pre-trained models.

Experimental Setup

Models: Qwen3-base (0.6B, 1.7B, 4B, 8B, 14B) trained on 36T tokens
RL Algorithm: GRPO with batch size 512, 3 rollouts/prompt, learning rate 1e-6
Configuration: 36T pre-training tokens, 512 token context, ~56 RL training steps

Evaluation Methods

Three reward functions tested:

strict: Requires exact #### number format
flexible: Extracts last number from response
custom_flexible: Stops at #### marker, then extracts number

Critical finding: Base models fail at instruction following, not reasoning. Using strict rewards wastes RL compute teaching formatting instead of improving math ability.

Results

RL Scaling Laws

GSM8K gains by model size (0-shot, custom_flexible):

Model	Baseline	After RL	Gain	MATH Baseline	MATH After RL	MATH Gain
0.6B	40.7%	68.3%	+27.6%	25.3%	33.8%	+8.5%
1.7B	54.8%	80.4%	+25.6%	33.8%	41.3%	+7.5%
4B	80.4%	88.5%	+8.1%	41.0%	44.7%	+3.7%
8B	81.5%	89.8%	+8.3%	44.6%	46.9%	+2.3%
14B	86.3%	92.6%	+6.3%	-	47.9%	-

Power law fit: Gain ≈ 10.7 × (Model_Size)^0.194

Smaller models see larger relative gains from RL, with gains diminishing as base model quality increases.

Cross-Domain Transfer

GSM8K → MATH: Limited but measurable transfer (0.6B: +8.5%, 14B: minimal)
GSM8K → MMLU: No meaningful transfer to knowledge-based tasks

Key Insights

RL provides diminishing returns as base model quality increases (power law exponent ~0.19)
Smaller models benefit more from RL proportionally (0.6B: +27.6% vs 14B: +6.3%)
Evaluation method critically matters - strict rewards waste compute on formatting, not reasoning
Limited cross-domain transfer - GSM8K → MATH shows gains, GSM8K → MMLU shows none
Few-shot instruction collapse - base models fail dramatically at 1-2 shots (4B: 67% → 1%)

Limitations

512 token context clips ~7% of GSM8K problems
Only 56 RL steps tested (plateau effects not fully characterized)
No pass@k evaluation (greedy decoding only)

Future Work

Longer RL training - characterize full sample efficiency and plateau effects
32B model validation - test if power law holds at larger scale
Reverse transfer - train on MATH, evaluate on GSM8K
Optimal compute allocation - find best pre-train/RL split for fixed FLOP budget
Pass@k evaluation - test if RL improves diversity or just mode collapse

Usage

Train:

./train.sh 4B 1e-6 512  # model_size, lr, batch_size

Evaluate:

python -m eval.run --model Qwen/Qwen3-4B-base --shots 0,1,2,3,4,5

Key Takeaway

Measure the right thing. Training RL on metrics that confound reasoning and formatting wastes compute. Custom flexible rewards achieve 2% better generalization to MATH compared to strict rewards, despite similar GSM8K scores.

Models: Qwen3-base | Framework: GRPO via Verl | Compute: 36T pre-training tokens + ~56 RL steps

Name		Name	Last commit message	Last commit date
Latest commit History 53 Commits
analysis		analysis
eval		eval
results		results
scripts		scripts
utils		utils
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
pyproject.toml		pyproject.toml
train.py		train.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

RL Scaling Laws for Mathematical Reasoning

Key Finding: Compute vs Performance

Research Question

Experimental Setup

Evaluation Methods

Results

RL Scaling Laws

Cross-Domain Transfer

Key Insights

Limitations

Future Work

Usage

Key Takeaway

About

Uh oh!

Releases

Packages

Languages

josancamon19/rl-scaling-laws

Folders and files

Latest commit

History

Repository files navigation

RL Scaling Laws for Mathematical Reasoning

Key Finding: Compute vs Performance

Research Question

Experimental Setup

Evaluation Methods

Results

RL Scaling Laws

Cross-Domain Transfer

Key Insights

Limitations

Future Work

Usage

Key Takeaway

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages