This repository is the reading list on Reinforcement Learning Human Feedback for Mathematical Reasoning (RLHF4Math). Related papers, web blogs, datasets, and classic books are included.
Contributors: yehangcheng @XXX, Wenhao Yan @NEU, Chen Bo @XXX
๐ If you have any suggestions or notice something we missed, please don't hesitate to let us know. You can directly email Wenhao Yan (wenhao19990616@gmail.com), or post an issue on this repo.
This section mainly covers surveys about RLHF for math, but also covers other areas.
- Reasoning with Language Model Prompting: A Survey, [paper]
- Towards Reasoning in Large Language Models: A Survey, [paper]
- A Survey of Deep Learning for Mathematical Reasoning, [paper]
- A Survey on In-context Learning, [paper]
- A Survey on Transformers in Reinforcement Learning, [paper]
This section covers RLHF for math reasoning.
- Solving Math Word Problems via Cooperative Reasoning induced Language Models, [paper]
- Interpretable Math Word Problem Solution Generation via Step-by-step Planning, [paper]
- Compositional Mathematical Encoding for Math Word Problems, [paper]
- Plan-and-Solve Prompting: Improving Zero-Shot Chain-of-Thought Reasoning by Large Language Models, [paper]
- GeoDRL: A Self-Learning Framework for Geometry Problem Solving using Reinforcement Learning in Deductive Reasoning, [paper]
- A Causal Framework to Quantify the Robustness of Mathematical Reasoning with Language Models, [paper]
- Math Word Problem Solving by Generating Linguistic Variants of Problem Statements, [paper]
- Making Language Models Better Reasoners with Step-Aware Verifier, [paper]
- ALERT: Adapting Language Models to Reasoning Tasks, [paper]
- Distilling Reasoning Capabilities into Smaller Language Models, [paper]
- Preference Ranking Optimization for Human Alignment, [paper]
- Direct Preference Optimization:Your Language Model is Secretly a Reward Model, [paper]
- Dynamic Prompt Learning via Policy Gradient for Semi-structured Mathematical Reasoning, [paper]
- SELF-INSTRUCT: Aligning Language Models with Self-Generated Instructions, [paper]
- OFFLINE REINFORCEMENT LEARNING WITH IMPLICIT Q-LEARNING, [paper]
- OFFLINE RL FOR NATURAL LANGUAGE GENERATION WITH IMPLICIT LANGUAGE Q LEARNING, [paper]
- ๐ Chain of Thought Prompting Elicits Reasoning in Large Language Models, [paper]
- LEAST-TO-MOST PROMPTING ENABLES COMPLEX REASONING IN LARGE LANGUAGE MODELS, [paper]
- ๐ Letโs Verify Step by Step, [paper]
- MathPrompter Mathematical Reasoning using Large Language Models, [paper]
- Measuring Mathematical Problem Solving With the MATH Dataset, [paper]
- ๐ Solving math word problems with process and outcome-based feedback, [paper]
- ๐ STaR Self-Taught Reasoner Bootstrapping Reasoning With Reasoning, [paper]
- ๐ Training Verifiers to Solve Math Word Problems, [paper]
This section covers other research and general content for RLHF.
- CodeRL Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning, [paper]
- RAFT Reward rAnked FineTuning for Generative Foundation Model Alignment, [paper]
- IS REINFORCEMENT LEARNING (NOT) FOR NATURAL LANGUAGE PROCESSING BENCHMARKS, BASELINES, AND BUILDING BLOCKS FOR NATURAL LANGUAGE POLICY OPTIMIZATION, [paper]
- WizardCoder: Empowering Code Large Language Models with Evol-Instruct, [paper]
- Constitutional AI: Harmlessness from AI Feedback, [paper]
This section covers RL prerequisite foundation for RLHF.
- Trust Region Policy Optimization, [paper]
- ๐ Proximal Policy Optimization Algorithms, [paper]
- Mastering the Game of Go without Human Knowledge, [paper]
- Deep reinforcement learning from human preferences, [paper]
- Thinking Fast and Slow with Deep Learning and Tree Search, [paper]
This section covers NLP prerequisite foundation for RLHF.
- add title, [[paper](add link)]
This section covers LLM supplement content.
- Improving Language Understanding by Generative Pre-Training, [paper]
- Language Models are Unsupervised Multitask Learners, [paper]
- Language Models are Few-Shot Learners, [paper]
- GPT-4 Technical Report, [paper]
- ๐ Training language models to follow instructions with human feedback, [paper]
- ๐ GLM: General Language Model Pretraining with Autoregressive Blank Infilling, [paper]
- ๐ GLM-130B: AN OPEN BILINGUAL PRE-TRAINED MODEL, [paper]
- ๐ [blog] Lil'Log, [RL,NLP,and so on]
- ๐ [blog] Yao Fu's Blog, [Open-source the knowledge]
- [docs] Stable-Baselines3 Docs, [Reliable Reinforcement Learning Implementations]
- [docs] OpenAI Spinning Up, [An educational resource produced by OpenAI to learning RL easier]
- [book] Reinforcement Learning: An Introduction(Chinese Version), [Classical Works in RL]
- Anthropic Helpful and Harmless RLHF, reward modeling [git] [paper]
- OpenAI Summarize, reward modeling [git] [paper]
- OpenAI Web GPT, reward modeling [blog] [paper] [hugging face]
- stack-exchange-preferences, reward modeling [hugging face]
- Stanford Human Preferences Dataset (SHP), reward modeling [hugging face]
- TruthfulQA, evaluate truthful [hugging face]
- ToxiGen, evaluate toxicity [hugging face]
- Bias in Open-ended Language Generation Dataset (BOLD), evaluate fairness [hugging face]
- [WizardLM] WizardLM: An Instruction-following LLM Using Evol-Instruct, [git]
- [Self-Instruct] Self-Instruct: Aligning LM with Self Generated Instructions, [git]
- [ConstitutionalHarmlessness] Constitutional AI: Harmlessness from AI Feedback, [git]
- [project-name] project full name, [where]
If you find this repo useful, please kindly cite our survey:
@article{lu2022dl4math,
title={A Survey of Deep Learning for Mathematical Reasoning},
author={Lu, Pan and Qiu, Liang and Yu, Wenhao and Welleck, Sean and Chang, Kai-Wei},
journal={arXiv preprint arXiv:2212.10535},
year={2022}
}