Awesome-System2-Reasoning-LLM

📢 Updates

2025.02: We released a survey paper "From System 1 to System 2: A Survey of Reasoning Large Language Models". Feel free to cite or open pull requests.

👀 Introduction

Welcome to the repository for our survey paper, "From System 1 to System 2: A Survey of Reasoning Large Language Models". This repository provides resources and updates related to our research. For a detailed introduction, please refer to our survey paper.

Achieving human-level intelligence requires enhancing the transition from System 1 (fast, intuitive) to System 2 (slow, deliberate) reasoning. While foundational Large Language Models (LLMs) have made significant strides, they still fall short of human-like reasoning in complex tasks. Recent reasoning LLMs, like OpenAI’s o1, have demonstrated expert-level performance in domains such as mathematics and coding, resembling System 2 thinking. This survey explores the development of reasoning LLMs, their foundational technologies, benchmarks, and future directions. We maintain an up-to-date GitHub repository to track the latest developments in this rapidly evolving field.

This image highlights the progression of AI systems, emphasizing the shift from rapid, intuitive approaches to deliberate, reasoning-driven models. It shows how AI has evolved to handle a broader range of real-world challenges.

The recent timeline of reasoning LLMs, covering core methods and the release of open-source and closed-source reproduction projects.

📒 Table of Contents

Awesome-System-2-AI

Part 1: O1 Replication

O1 Replication Journey: A Strategic Progress Report -- Part 1 [Paper]
Enhancing LLM Reasoning with Reward-guided Tree Search [Paper]
Marco-o1: Towards Open Reasoning Models for Open-Ended Solutions [Paper]
O1 Replication Journey--Part 2: Surpassing O1-preview through Simple Distillation, Big Progress or Bitter Lesson? [Paper]
Imitate, Explore, and Self-Improve: A Reproduction Report on Slow-thinking Reasoning Systems [Paper]
o1-Coder: an o1 Replication for Coding [Paper]
HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs [Paper]
DRT: Deep Reasoning Translation via Long Chain-of-Thought [Paper]
mini-deepseek-r1 [Blog]
Run DeepSeek R1 Dynamic 1.58-bit [Blog]
Simple Reinforcement Learning for Reasoning [Notion]
TinyZero [github]
Open R1 [github]
Search-o1: Agentic Search-Enhanced Large Reasoning Models [Paper]
Virgo: A Preliminary Exploration on Reproducing o1-like MLLM [Paper]
The Relationship Between Reasoning and Performance in Large Language Models -- o3 (mini) Thinks Harder, Not Longer [Paper]
Open-Reasoner-Zero [Paper]
X-R1 [github]
Unlock-Deepseek [Blog]
Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning [Paper]
LLM-R1 [github]

Part 2: Process Reward Models

Solving Math Word Problems with Process and Outcome-Based Feedback [Paper]
Improve Mathematical Reasoning in Language Models by Automated Process Supervision [Paper]
Making Large Language Models Better Reasoners with Step-Aware Verifier [Paper]
Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations [Paper]
OVM: Outcome-supervised Value Models for Planning in Mathematical Reasoning [Paper]
Let's Verify Step by Step. [Paper]
Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs [Paper]
AutoPSV: Automated Process-Supervised Verifier [Paper]
ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search [Paper]
Free Process Rewards without Process Labels. [Paper]
Outcome-Refining Process Supervision for Code Generation [Paper]
PRMBench: A Fine-grained and Challenging Benchmark for Process-Level Reward Models. [Paper]
ReARTeR: Retrieval-Augmented Reasoning with Trustworthy Process Rewarding [Paper]
The Lessons of Developing Process Reward Models in Mathematical Reasoning. [Paper]
ToolComp: A Multi-Tool Reasoning & Process Supervision Benchmark. [Paper]
ARMAP: Scaling Autonomous Agents via Automatic Reward Modeling And Planning [Paper]
Uncertainty-Aware Step-wise Verification with Generative Reward Models [Paper]
AdaptiveStep: Automatically Dividing Reasoning Step through Model Confidence [Paper]
Self-Consistency of the Internal Reward Models Improves Self-Rewarding Language Models [Paper]
Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time Scaling [Paper]
Agentic Reward Modeling: Integrating Human Preferences with Verifiable Correctness Signals for Reliable Reward Systems [Paper]
Unified Reward Model for Multimodal Understanding and Generation [Paper]
Reward Shaping to Mitigate Reward Hacking in RLHF [Paper]
Multi-head Reward Aggregation Guided by Entropy [Paper]
[Paper]
Better Process Supervision with Bi-directional Rewarding Signals [Paper]
Inference-Time Scaling for Generalist Reward Modeling [Paper]

Part 3: Reinforcement Learning

Improve Vision Language Model Chain-of-thought Reasoning [Paper]
Does RLHF Scale? Exploring the Impacts From Data, Model, and Method [Paper]
Offline Reinforcement Learning for LLM Multi-Step Reasoning [Paper]
ReFT: Representation Finetuning for Language Models [Paper]
InfAlign: Inference-aware language model alignment [Paper]
Advancing Language Model Reasoning through Reinforcement Learning and Inference Scaling [Paper]
Challenges in Ensuring AI Safety in DeepSeek-R1 Models: The Shortcomings of Reinforcement Learning Strategies [Paper]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning [Paper]
Kimi k1.5: Scaling Reinforcement Learning with LLMs [Paper]
Deepseekmath: Pushing the limits of mathematical reasoning in open language models [Paper]
Reasoning with Reinforced Functional Token Tuning [Paper]
Value-Based Deep RL Scales Predictably [Paper]
MM-RLHF: The Next Step Forward in Multimodal LLM Alignment [Paper]
Satori: Reinforcement Learning with Chain-of-Action-Thought Enhances LLM Reasoning via Autoregressive Search [Paper]
DeepScaleR: Surpassing O1-Preview with a 1.5B Model by Scaling RL [Paper]
LIMR: Less is More for RL Scaling [Paper]
A Survey on Feedback-based Multi-step Reasoning for Large Language Models on Mathematics [Paper]
Med-RLVR: Emerging Medical Reasoning from a 3B base model via reinforcement Learning [Paper]
QLASS: Boosting Language Agent Inference via Q-Guided Stepwise Search [Paper]
Process Reinforcement through Implicit Rewards [Paper]
UI-R1: Enhancing Action Prediction of GUI Agents by Reinforcement Learning [Paper]
All Roads Lead to Likelihood: The Value of Reinforcement Learning in Fine-Tuning [Paper]
R1-Zero's "Aha Moment" in Visual Reasoning on a 2B Non-SFT Model [Paper]
Visual-RFT: Visual Reinforcement Fine-Tuning [Paper]
GTR: Guided Thought Reinforcement Prevents Thought Collapse in RL-based VLM Agent Training [Paper]
L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning [Paper]
Reinforcement Learning for Reasoning in Small LLMs: What Works and What Doesn't [Paper]
Boosting the Generalization and Reasoning of Vision Language Models with Curriculum Reinforcement [Paper]
VLAA-Thinker [github]
Concise Reasoning via Reinforcement Learning [Paper]
d1: Scaling Reasoning in Diffusion Large Language Models via Reinforcement Learning [github]
Algorithm Discovery With LLMs: Evolutionary Search Meets Reinforcement Learning [Paper]
Efficient Reinforcement Finetuning via Adaptive Curriculum Learning [Paper]
VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning [Paper]
SFT or RL? An Early Investigation into Training R1-Like Reasoning Large Vision-Language Models [Paper]
RAISE: Reinforenced Adaptive Instruction Selection For Large Language Models [Paper]
MT-R1-Zero: Advancing LLM-based Machine Translation via R1-Zero-like Reinforcement Learning [Paper]
VisRL: Intention-Driven Visual Perception via Reinforced Reasoning [Paper]

Part 4: MCTS/Tree Search

Reasoning with Language Model is Planning with World Model [Paper]
Fine-grained Conversational Decoding via Isotropic and Proximal Search [Paper]
Large Language Models as Commonsense Knowledge for Large-Scale Task Planning [Paper]
ALPHAZERO-LIKE TREE-SEARCH CAN GUIDE LARGE LANGUAGE MODEL DECODING AND TRAINING [Paper]
Alphazero-like Tree-Search can Guide Large Language Model Decoding and Training [Paper]
MAKING PPO EVEN BETTER: VALUE-GUIDED MONTE-CARLO TREE SEARCH DECODING [Paper]
Look-back Decoding for Open-Ended Text Generation [Paper]
Stream of Search (SoS): Learning to Search in Language [Paper]
Toward Self-Improvement of LLMs via Imagination, Searching, and Criticizing [Paper]
Uncertainty of Thoughts: Uncertainty-Aware Planning Enhances Information Seeking in Large Language Models [Paper]
AlphaMath Almost Zero: process Supervision without process [Paper]
Generating Code World Models with Large Language Models Guided by Monte Carlo Tree Search [Paper]
MindStar: Enhancing Math Reasoning in Pre-trained LLMs at Inference Time [Paper]
Monte Carlo Tree Search Boosts Reasoning via Iterative Preference Learning [Paper]
Accessing GPT-4 level Mathematical Olympiad Solutions via Monte Carlo Tree Self-refine with LLaMa-3 8B [Paper]
Beyond A*: Better Planning with Transformers via Search Dynamics Bootstrapping [Paper]
LLM Reasoners: New Evaluation, Library, and Analysis of Step-by-Step Reasoning with Large Language Models [Paper]
LiteSearch: Efficacious Tree Search for LLM [Paper]
Tree Search for Language Model Agents [Paper]
Uncertainty-Guided Optimization on Large Language Model Search Trees [Paper]
Strategist: Learning Strategic Skills by LLMs via Bi-Level Tree Search [Paper]
RethinkMCTS: Refining Erroneous Thoughts in Monte Carlo Tree Search for Code Generation [Paper]
AFlow: Automating Agentic Workflow Generation [Paper]
Interpretable Contrastive Monte Carlo Tree Search Reasoning [Paper]
LLaMA-Berry: Pairwise Optimization for O1-like Olympiad-Level Mathematical Reasoning [Paper]
Towards Self-Improvement of LLMs via MCTS: Leveraging Stepwise Knowledge with Curriculum Preference Learning [Paper]
TreeBoN: Enhancing Inference-Time Alignment with Speculative Tree-Search and Best-of-N Sampling [Paper]
Understanding When Tree of Thoughts Succeeds: Larger Models Excel in Generation, Not Discrimination [Paper]
CodeTree: Agent-guided Tree Search for Code Generation with Large Language Models [Paper]
GPT-Guided Monte Carlo Tree Search for Symbolic Regression in Financial Fraud Detection [Paper]
MC-NEST -- Enhancing Mathematical Reasoning in Large Language Models with a Monte Carlo Nash Equilibrium Self-Refine Tree [Paper]
Marco-o1: Towards Open Reasoning Models for Open-Ended Solutions [Paper]
SRA-MCTS: Self-driven Reasoning Augmentation with Monte Carlo Tree Search for Code Generation [Paper]
Don’t throw away your value model! Generating more preferable text with Value-Guided Monte-Carlo Tree Search decoding [Paper]
ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search [Paper]
Forest-of-Thought: Scaling Test-Time Compute for Enhancing LLM Reasoning [Paper]
HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs [Paper]
Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via Collective Monte Carlo Tree Search [Paper]
Proposing and solving olympiad geometry with guided tree search [Paper]
SPaR: Self-Play with Tree-Search Refinement to Improve Instruction-Following in Large Language Models [Paper]
Towards Intrinsic Self-Correction Enhancement in Monte Carlo Tree Search Boosted Reasoning via Iterative Preference Learning [Paper]
Control-DAG: Constrained Decoding for Non-Autoregressive Directed Acyclic T5 using Weighted Finite State Automata [Paper]
Leveraging Constrained Monte Carlo Tree Search to Generate Reliable Long Chain-of-Thought for Mathematical Reasoning [Paper]
PairJudge RM: Perform Best-of-N Sampling with Knockout Tournament [Paper]
ARMAP: Scaling Autonomous Agents via Automatic Reward Modeling And Planning [Paper]
On the Convergence Rate of MCTS for the Optimal Value Estimation in Markov Decision Processes [Paper]
Search-o1: Agentic Search-Enhanced Large Reasoning Models [Paper]
rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking [Paper]
LeanProgress: Guiding Search for Neural Theorem Proving via Proof Progress Prediction [Paper]
Boosting Multimodal Reasoning with MCTS-Automated Structured Thinking [Paper]
DeepSolution: Boosting Complex Engineering Solution Design via Tree-based Exploration and Bi-point Thinking [Paper]
Hypothesis-Driven Theory-of-Mind Reasoning for Large Language Models [Paper]
VisuoThink: Empowering LVLM Reasoning with Multimodal Tree Search [Paper]

Part 5: Self-Training / Self-Improve

Expert Iteration: Thinking Fast and Slow with Deep Learning and Tree Search [Paper]
STaR: Bootstrapping Reasoning With Reasoning [Paper]
Large Language Models are Better Reasoners with Self-Verification [Paper]
Self-Evaluation Guided Beam Search for Reasoning [Paper]
Self-Refine: Iterative Refinement with Self-Feedback [Paper]
ReST: Reinforced Self-Training for Language Modeling [Paper]
Self-Refine: Iterative Refinement with Self-Feedback [Paper]
V-star: Training Verifiers for Self-Taught Reasoners [Paper]
Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking [Paper]
CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing [Paper]
Enhancing Large Vision Language Models with Self-Training on Image Comprehension [Paper]
Interactive Evolution: A Neural-Symbolic Self-Training Framework for Large Language Models [Paper]
SelfCheck: Using LLMs to Zero-Shot Check Their Own Step-by-Step Reasoning [Paper]
CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing [Paper]
Learning From Correctness Without Prompting Makes LLM Efficient Reasoner [Paper]
Self-Improvement in Language Models: The Sharpening Mechanism [Paper]
ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search [Paper]
Recursive Introspection: Teaching Language Model Agents How to Self-Improve [Paper]
B-STaR: Monitoring and Balancing Exploration and Exploitation in Self-Taught Reasoner [Paper]
ReST-EM: Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models [Paper]
ReFT: Representation Finetuning for Language Models [Paper]
Enabling Scalable Oversight via Self-Evolving Critic [Paper]
S²R: Teaching LLMs to Self-verify and Self-correct via Reinforcement Learning [Paper]
ProgCo: Program Helps Self-Correction of Large Language Models [Paper]
Small LLMs Can Master Reasoning with Self-Evolved Deep Thinking (Rstar-Math) [Paper]
Self-Training Elicits Concise Reasoning in Large Language Models [Paper]
Language Models can Self-Improve at State-Value Estimation for Better Search [Paper]
Genius: A Generalizable and Purely Unsupervised Self-Training Framework For Advanced Reasonin [Paper]
START: Self-taught Reasoner with Tools [Paper]

Part 6: Reflection

SelfCheck: Using LLMs to Zero-Shot Check Their Own Step-by-Step Reasoning [Paper]
Reflection-Tuning: An Approach for Data Recycling [Paper]
Learning From Mistakes Makes LLM Better Reasoner [Paper]
Mutual Reasoning Makes Smaller LLMs Stronger Problem-Solvers [Paper]
LLaMA-Berry: Pairwise Optimization for O1-like Olympiad-Level Mathematical Reasoning [Paper]
Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via Collective Monte Carlo Tree Search [Paper]
AtomThink: A Slow Thinking Framework for Multimodal Mathematical Reasoning [Paper]
Beyond Examples: High-level Automated Reasoning Paradigm in In-Context Learning via MCTS [Paper]
Marco-o1: Towards Open Reasoning Models for Open-Ended Solutions [Paper]
LLaVA-o1: Let Vision Language Models Reason Step-by-Step [Paper]
Vision-Language Models Can Self-Improve Reasoning via Reflection [Paper]
HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs [Paper]
Refiner: Restructure Retrieved Content Efficiently to Advance Question-Answering Capabilities [Paper]
rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking [Paper]
RedStar: Does Scaling Long-CoT Data Unlock Better Slow-Reasoning Systems? [Paper]
Perception in Reflection [Paper]

Part 7: Efficient System2

Guiding Language Model Reasoning with Planning Tokens [Paper]
AutoReason: Automatic Few-Shot Reasoning Decomposition [Paper]
DynaThink: Fast or Slow? A Dynamic Decision-Making Framework for Large Language Models [Paper]
B-STaR: Monitoring and Balancing Exploration and Exploitation in Self-Taught Reasoner [Paper]
Token-Budget-Aware LLM Reasoning [Paper]
Training Large Language Models to Reason in a Continuous Latent Space [Paper]
From Informal to Formal -- Incorporating and Evaluating LLMs on Natural Language Requirements to Verifiable Formal Proofs [Paper]
MALT: Improving Reasoning with Multi-Agent LLM Training [Paper]
Thoughts Are All Over the Place: On the Underthinking of o1-Like LLMs [Paper]
Efficient Reasoning with Hidden Thinking [Paper]
O1-Pruner: Length-Harmonizing Fine-Tuning for O1-Like Reasoning Pruning [Paper]
Think More, Hallucinate Less: Mitigating Hallucinations via Dual Process of Fast and Slow Thinking [Paper]
Stepwise Perplexity-Guided Refinement for Efficient Chain-of-Thought Reasoning in Large Language Models [Paper]
Titans: Learning to Memorize at Test Time [Paper]
MoBA: Mixture of Block Attention for Long-Context LLMs [Paper]
One Example Shown, Many Concepts Known! Counterexample-Driven Conceptual Reasoning in Mathematical LLMs [Paper]
Small Models Struggle to Learn from Strong Reasoners [Paper]
TokenSkip: Controllable Chain-of-Thought Compression in LLMs [Paper]
SoftCoT: Soft Chain-of-Thought for Efficient Reasoning with LLMs [Paper]
Dynamic Chain-of-Thought: Towards Adaptive Deep Reasoning [Paper]
Thinking Preference Optimization [Paper]
Revisiting the Test-Time Scaling of o1-like Models: Do they Truly Possess Test-Time Scaling Capabilities? [Paper]
Flow-of-Options: Diversified and Improved LLM Reasoning by Thinking Through Options [Paper]
CodeI/O: Condensing Reasoning Patterns via Code Input-Output Prediction [Paper]
OctoTools: An Agentic Framework with Extensible Tools for Complex Reasoning [Paper]
LogiDynamics: Unraveling the Dynamics of Logical Inference in Large Language Model Reasoning [Paper]
Atom of Thoughts for Markov LLM Test-Time Scaling [Paper]
Efficient Long-Decoding Inference with Reasoning-Aware Attention Sparsity [Paper]
Integrating Arithmetic Learning Improves Mathematical Reasoning in Smaller Models [Paper]
Enhancing Auto-regressive Chain-of-Thought through Loop-Aligned Reasoning [Paper]
Scalable Language Models with Posterior Inference of Latent Thought Vectors [Paper]
Enhancing Auto-regressive Chain-of-Thought through Loop-Aligned Reasoning [Paper]
Token Assorted: Mixing Latent and Text Tokens for Improved Language Model Reasoning [Paper]
LightThinker: Thinking Step-by-Step Compression [Paper]
The Semantic Hub Hypothesis: Language Models Share Semantic Representations Across Languages and Modalities [Paper]
Reasoning with Latent Thoughts: On the Power of Looped Transformers [Paper]
Efficient Reasoning with Hidden Thinking [Paper]
Emergent Symbolic Mechanisms Support Abstract Reasoning in Large Language Models [Paper]
Investigating Inference-time Scaling for Chain of Multi-modal Thought: A Preliminary Study [Paper]
Meta-Reasoner: Dynamic Guidance for Optimized Inference-time Reasoning in Large Language Models [Paper]
FINEREASON: Evaluating and Improving LLMs' Deliberate Reasoning through Reflective Puzzle Solving [Paper]
MixLLM: Dynamic Routing in Mixed Large Language Models [Paper]
PEARL: Towards Permutation-Resilient LLMs [Paper]
Reasoning-as-Logic-Units: Scaling Test-Time Reasoning in Large Language Models Through Logic Unit Alignment [Paper]
Can Large Language Models Detect Errors in Long Chain-of-Thought Reasoning? [Paper]
Code to Think, Think to Code: A Survey on Code-Enhanced Reasoning and Reasoning-Driven Code Intelligence in LLMs [Paper]
Training Large Language Models to be Better Rule Followers [Paper]
Agentic Reasoning: Reasoning LLMs with Tools for the Deep Research [Paper]
CODI: Compressing Chain-of-Thought into Continuous Space via Self-Distillation [Paper]
SIFT: Grounding LLM Reasoning in Contexts via Stickers [Paper]
AdaptiveStep: Automatically Dividing Reasoning Step through Model Confidence [Paper]
How Well do LLMs Compress Their Own Chain-of-Thought? A Token Complexity Approach [Paper]
PromptCoT: Synthesizing Olympiad-level Problems for Mathematical Reasoning in Large Language Models [Paper]
DAST: Difficulty-Adaptive Slow-Thinking for Large Reasoning Models [Paper]
Quantifying the Reasoning Abilities of LLMs on Real-world Clinical Cases [Paper]
Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models [Paper]
TinyR1-32B-Preview: Boosting Accuracy with Branch-Merge Distillation [Paper]
Symbolic Mixture-of-Experts: Adaptive Skill-based Routing for Heterogeneous Reasoning [Paper]
Entropy-based Exploration Conduction for Multi-step Reasoning [Paper]
MathFusion: Enhancing Mathematic Problem-solving of LLM through Instruction Fusion [Paper]
Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models [Paper]
ThinkPatterns-21k: A Systematic Study on the Impact of Thinking Patterns in LLMs [Paper]
Agent models: Internalizing Chain-of-Action Generation into Reasoning models [Paper]
StepMathAgent: A Step-Wise Agent for Evaluating Mathematical Processes through Tree-of-Error [Paper]
Through the Magnifying Glass: Adaptive Perception Magnification for Hallucination-Free VLM Decoding [Paper]
Scaling Evaluation-time Compute with Reasoning Models as Process Evaluators [Paper]
Shared Global and Local Geometry of Language Model Embeddings [Paper]
Mitigating Visual Forgetting via Take-along Visual Conditioning for Multi-modal Long CoT Reasoning [Paper]
Effectively Controlling Reasoning Models through Thinking Intervention [Paper]
Audio-Reasoner: Improving Reasoning Capability in Large Audio Language Models [Paper]
TinyLLaVA-Video-R1: Towards Smaller LMMs for Video Reasoning [Paper]
Lemmanaid: Neuro-Symbolic Lemma Conjecturing [Paper]
ThoughtProbe: Classifier-Guided Thought Space Exploration Leveraging LLM Intrinsic Reasoning [Paper]
Skywork R1V: Pioneering Multimodal Reasoning with Chain-of-Thought [Paper]
Reasoning Models Know When They're Right: Probing Hidden States for Self-Verification [Paper]
Missing Premise exacerbates Overthinking: Are Reasoning Models losing Critical Thinking Skill? [Paper]
Decentralizing AI Memory: SHIMI, a Semantic Hierarchical Memory Index for Scalable Agent Reasoning [Paper]
Reasoning Towards Fairness: Mitigating Bias in Language Models through Reasoning-Guided Fine-Tuning [Paper]
Inference-Time Scaling for Complex Tasks: Where We Stand and What Lies Ahead [Paper]
RealSafe-R1: Safety-Aligned DeepSeek-R1 without Compromising Reasoning Capability [Paper]
Climbing the Ladder of Reasoning: What LLMs Can-and Still Can't-Solve after SFT? [Paper]

Part 8: Explainability

Agents Thinking Fast and Slow: A Talker-Reasoner Architecture [Paper]
Distilling System 2 into System 1 [Paper]
The Impact of Reasoning Step Length on Large Language Models [Paper]
What Happened in LLMs Layers when Trained for Fast vs. Slow Thinking: A Gradient Perspective [Paper]
When a Language Model is Optimized for Reasoning, Does It Still Show Embers of Autoregression? An Analysis of OpenAI o1 [Paper]
System 2 Attention (is something you might need too) [Paper]
Towards System 2 Reasoning in LLMs: Learning How to Think With Meta Chain-of-Thought [Paper]
LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs [Paper]
Two Heads Are Better Than One: Dual-Model Verbal Reflection at Inference-Time [Paper]
Large Reasoning Models in Agent Scenarios: Exploring the Necessity of Reasoning Capabilities [Paper]
Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning [Paper]

Part 9: Multimodal Agent related Slow-Fast System

AtomThink: A Slow Thinking Framework for Multimodal Mathematical Reasoning [Paper]
LLaVA-o1: Let Vision Language Models Reason Step-by-Step [Paper]
Vision-Language Models Can Self-Improve Reasoning via Reflection [Paper]
Scaling Inference-Time Search With Vision Value Model for Improved Visual Comprehension [Paper]
Slow Perception: Let's Perceive Geometric Figures Step-by-Step [Paper]
Diving into Self-Evolving Training for Multimodal Reasoning [Paper]
Visual Agents as Fast and Slow Thinkers [Paper]
Virgo: A Preliminary Exploration on Reproducing o1-like MLLM [Paper]
I Think, Therefore I Diffuse: Enabling Multimodal In-Context Reasoning in Diffusion Models [Paper]
RAG-Gym: Optimizing Reasoning and Search Agents with Process Supervision [Paper]

Part 10: Benchmark and Datasets

Evaluation of OpenAI o1: Opportunities and Challenges of AGI [Paper]
A Preliminary Study of o1 in Medicine: Are We Closer to an AI Doctor? [Paper]
FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI [Paper]
MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs [Paper]
Do NOT Think That Much for 2+3=? On the Overthinking of o1-like LLMs [Paper]
EquiBench: Benchmarking Code Reasoning Capabilities of Large Language Models via Equivalence Checking [Paper]
SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines [Paper]
Multimodal RewardBench: Holistic Evaluation of Reward Models for Vision Language Models [Paper]
MATH-Perturb: Benchmarking LLMs' Math Reasoning Abilities against Hard Perturbations [Paper]
LongReason: A Synthetic Long-Context Reasoning Benchmark via Context Expansion [Paper]
Humanity's Last Exam [Paper]
RM-Bench: Benchmarking Reward Models of Language Models with Subtlety and Style [Paper]
PRMBench: A Fine-grained and Challenging Benchmark for Process-Level Reward Models [Paper]
Big-Math: A Large-Scale, High-Quality Math Dataset for Reinforcement Learning in Language Models [Paper]
ZeroBench: An Impossible* Visual Benchmark for Contemporary Large Multimodal Models [paper]
MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for Reasoning Quality, Robustness, and Efficiency [paper]
MM-IQ: Benchmarking Human-Like Abstraction and Reasoning in Multimodal Models [paper]
LR²Bench: Evaluating Long-chain Reflective Reasoning Capabilities of Large Language Models via Constraint Satisfaction Problems [Paper]
BIG-Bench Extra Hard [Paper]
MV-MATH: Evaluating Multimodal Math Reasoning in Multi-Visual Contexts [paper]
MastermindEval: A Simple But Scalable Reasoning Benchmark [paper]
DNA Bench: When Silence is Smarter -- Benchmarking Over-Reasoning in Reasoning LLMs [paper]
V1: Toward Multimodal Reasoning by Designing Auxiliary Tasks [github]
ResearchBench: Benchmarking LLMs in Scientific Discovery via Inspiration-Based Task Decomposition [paper]
S1-Bench: A Simple Benchmark for Evaluating System 1 Thinking Capability of Large Reasoning Models [paper]
When Reasoning Meets Compression: Benchmarking Compressed Large Reasoning Models on Complex Reasoning Tasks [paper]
BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents [paper]
Mle-bench: Evaluating machine learning agents on machine learning engineering [paper]
How Far Are We on the Decision-Making of LLMs? Evaluating LLMs' Gaming Ability in Multi-Agent Environments [paper]
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments [paper]
ToolComp: A Multi-Tool Reasoning & Process Supervision Benchmark [paper]
Mobile-Agent-E: Self-Evolving Mobile Assistant for Complex Tasks [paper]
PhysReason: A Comprehensive Benchmark towards Physics-Based Reasoning [paper]
Text2World: Benchmarking Large Language Models for Symbolic World Model Generation [paper]
WebGames: Challenging General-Purpose Web-Browsing AI Agents [paper]
UI-R1: Enhancing Action Prediction of GUI Agents by Reinforcement Learning [paper]
Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering [paper]
Plot2code: A comprehensive benchmark for evaluating multi-modal large language models in code generation from scientific plots [paper]
M³CoT: A Novel Benchmark for Multi-Domain Multi-step Multi-modal Chain-of-Thought [paper]
PuzzleVQA: Diagnosing Multimodal Reasoning Challenges of Language Models with Abstract Visual Patterns [paper]
Can LLMs Solve Molecule Puzzles? A Multimodal Benchmark for Molecular Structure Elucidation [paper]
HumanEval-V: Evaluating Visual Understanding and Reasoning Abilities of Large Multimodal Models Through Coding Tasks [paper]
CoMT: A Novel Benchmark for Chain of Multi-modal Thought on Large Vision-Language Models [paper]
ChartMimic: Evaluating LMM's Cross-Modal Reasoning Capability via Chart-to-Code Generation [paper]
Can Large Language Models Unveil the Mysteries? An Exploration of Their Ability to Unlock Information in Complex Scenarios [paper]
EnigmaEval: A Benchmark of Long Multimodal Reasoning Challenges [paper]
Code-Vision: Evaluating Multimodal LLMs Logic Understanding and Code Generation Capabilities [paper]
Exploring and Evaluating Multimodal Knowledge Reasoning Consistency of Multimodal Large Language Models [paper]
MMSciBench: Benchmarking Language Models on Multimodal Scientific Problems [paper]
LEGO-Puzzles: How Good Are MLLMs at Multi-Step Spatial Reasoning? [paper]
On the measure of intelligence [paper]
Competition-Level Code Generation with AlphaCode [paper]
Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them [paper]
OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI [paper]
Putnam-AXIOM: A Functional and Static Benchmark for Measuring Higher Level Mathematical Reasoning [paper]
Let's verify step by step [paper]
Mhpp: Exploring the capabilities and limitations of language models beyond basic code generation [paper]
Frontiermath: A benchmark for evaluating advanced mathematical reasoning in ai [paper]
LiveBench: A Challenging, Contamination-Limited LLM Benchmark [paper]
JustLogic: A Comprehensive Benchmark for Evaluating Deductive Reasoning in Large Language Models [paper]
MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding [paper]
Theoretical Physics Benchmark (TPBench)--a Dataset and Study of AI Reasoning Capabilities in Theoretical Physics [paper]
AIME 2025 [huggingface]
ThinkBench: Dynamic Out-of-Distribution Evaluation for Robust LLM Reasoning [paper]
ProBench: Benchmarking Large Language Models in Competitive Programming [paper]
ZebraLogic: On the Scaling Limits of LLMs for Logical Reasoning [paper]
DivIL: Unveiling and Addressing Over-Invariance for Out-of-Distribution Generalization [paper]
QuestBench: Can LLMs ask the right question to acquire information in reasoning tasks? [paper]
Benchmarking Reasoning Robustness in Large Language Models [paper]
Mathematical Reasoning in Large Language Models: Assessing Logical and Arithmetic Errors across Wide Numerical Ranges [paper]
Inference-Time Computations for LLM Reasoning and Planning: A Benchmark and Insights [paper]
Rewardbench: Evaluating reward models for language modeling [paper]
Evaluating LLMs at Detecting Errors in LLM Responses [paper]
CriticBench: Benchmarking LLMs for Critique-Correct Reasoning [paper]
Judgebench: A benchmark for evaluating llm-based judges [paper]
Errorradar: Benchmarking complex mathematical reasoning of multimodal large language models via error detection [paper]
Processbench: Identifying process errors in mathematical reasoning [paper]
Medec: A benchmark for medical error detection and correction in clinical notes [paper]
CodeCriticBench: A Holistic Code Critique Benchmark for Large Language Models [paper]
Can Large Language Models Detect Errors in Long Chain-of-Thought Reasoning? [paper]
FINEREASON: Evaluating and Improving LLMs' Deliberate Reasoning through Reflective Puzzle Solving [paper]

Part 11: Reasoning and Safety

Measuring Faithfulness in Chain-of-Thought Reasoning [Blog]
Deliberative Alignment: Reasoning Enables Safer Language Models [Paper]
OpenAI trained o1 and o3 to ‘think’ about its safety policy [Blog]
Why AI Safety Researchers Are Worried About DeepSeek [Blog]
OverThink: Slowdown Attacks on Reasoning LLMs [Paper]
GuardReasoner: Towards Reasoning-based LLM Safeguards [Paper]
SafeChain: Safety of Language Models with Long Chain-of-Thought Reasoning Capabilities [Paper]
ThinkGuard: Deliberative Slow Thinking Leads to Cautious Guardrails [Paper]
SafeChain: Safety of Language Models with Long Chain-of-Thought Reasoning Capabilities [Paper]
H-CoT: Hijacking the Chain-of-Thought Safety Reasoning Mechanism to Jailbreak Large Reasoning Models, Including OpenAI o1/o3, DeepSeek-R1, and Gemini 2.0 Flash Thinking [Paper]
BoT: Breaking Long Thought Processes of o1-like Large Language Models through Backdoor Attack [Paper]
The Hidden Risks of Large Reasoning Models: A Safety Assessment of R1 [Paper]
Illusory Safety: Redteaming DeepSeek R1 and the Strongest Fine-Tunable Models of OpenAI, Anthropic, and Google [Blog]
Safety Tax: Safety Alignment Makes Your Large Reasoning Models Less Reasonable [Paper]
DeepSeek-R1 Thoughtology: Let's about LLM Reasoning [Paper]
STAR-1: Safer Alignment of Reasoning LLMs with 1K Data [Paper]

Part 12: R1 Driven Multimodal Reasoning Enhancement

Open R1 Video [github]
R1-Vision: Let's first take a look at the image [github]
MedVLM-R1: Incentivizing Medical Reasoning Capability of Vision-Language Models (VLMs) via Reinforcement Learning [paper]
Efficient-R1-VLLM: Efficient RL-Tuned MoE Vision-Language Model For Reasoning [github]
MMR1: Advancing the Frontiers of Multimodal Reasoning [github]
Skywork-R1V: Pioneering Multimodal Reasoning with CoT [github]
VLM-R1: A stable and generalizable R1-style Large Vision-Language Model [Blog]
Exploring the Effect of Reinforcement Learning on Video Understanding: Insights from SEED-Bench-R1 [paper]
Reason-RFT: Reinforcement Fine-Tuning for Visual Reasoning [paper]
MM-Eureka: Exploring Visual Aha Moment with Rule-based Large-scale Reinforcement Learning [paper]
R1-Omni: Explainable Omni-Multimodal Emotion Recognition with Reinforcement Learning [paper]
R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization [paper]
R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization [paper]
Reinforcement Learning Outperforms Supervised Fine-Tuning: A Case Study on Audio Question Answering [paper]
Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement [paper]
TimeZero: Temporal Video Grounding with Reasoning-Guided LVLM [paper]
Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models [paper]
Q-Insight: Understanding Image Quality via Visual Reinforcement Learning [paper]
Rethinking RL Scaling for Vision Language Models: A Transparent, From-Scratch Framework and Comprehensive Evaluation Scheme [paper]
VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model [paper]
VLAA-Thinking [github]
SoTA with Less: MCTS-Guided Sample Selection for Data-Efficient Visual Reasoning Self-Improvement [paper]
Perception-R1: Pioneering Perception Policy with Reinforcement Learning [paper]
VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning [paper]
Embodied-R: Collaborative Framework for Activating Embodied Spatial Reasoning in Foundation Models via Reinforcement Learning [paper]
NoisyRollout: Reinforcing Visual Reasoning with Data Augmentation [paper]

Citation

If you find this work useful, welcome to cite us.

@misc{li202512surveyreasoning,
      title={From System 1 to System 2: A Survey of Reasoning Large Language Models}, 
      author={Zhong-Zhi Li and Duzhen Zhang and Ming-Liang Zhang and Jiaxin Zhang and Zengyan Liu and Yuxuan Yao and Haotian Xu and Junhao Zheng and Pei-Jie Wang and Xiuyi Chen and Yingying Zhang and Fei Yin and Jiahua Dong and Zhijiang Guo and Le Song and Cheng-Lin Liu},
      year={2025},
      eprint={2502.17419},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2502.17419}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 69 Commits
assets		assets
src		src
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Awesome-System2-Reasoning-LLM

📢 Updates

👀 Introduction

📒 Table of Contents

Part 1: O1 Replication

Part 2: Process Reward Models

Part 3: Reinforcement Learning

Part 4: MCTS/Tree Search

Part 5: Self-Training / Self-Improve

Part 6: Reflection

Part 7: Efficient System2

Part 8: Explainability

Part 9: Multimodal Agent related Slow-Fast System

Part 10: Benchmark and Datasets

Part 11: Reasoning and Safety

Part 12: R1 Driven Multimodal Reasoning Enhancement

Citation

⭐ Star History

About

Releases

Packages

Contributors 12

Languages

zzli2022/Awesome-System2-Reasoning-LLM

Folders and files

Latest commit

History

Repository files navigation

Awesome-System2-Reasoning-LLM

📢 Updates

👀 Introduction

📒 Table of Contents

Part 1: O1 Replication

Part 2: Process Reward Models

Part 3: Reinforcement Learning

Part 4: MCTS/Tree Search

Part 5: Self-Training / Self-Improve

Part 6: Reflection

Part 7: Efficient System2

Part 8: Explainability

Part 9: Multimodal Agent related Slow-Fast System

Part 10: Benchmark and Datasets

Part 11: Reasoning and Safety

Part 12: R1 Driven Multimodal Reasoning Enhancement

Citation

⭐ Star History

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 12

Languages

Packages