Skip to content

This repository contains sources about reinforcement learning human feedback for math reasoning,.

Notifications You must be signed in to change notification settings

TaciturnMute/RLHF4Math

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

4 Commits
ย 
ย 

Repository files navigation

Reinforcement Learning for Mathematical Reasoning (RLHF4Math)

Awesome License: MIT

This repository is the reading list on Reinforcement Learning Human Feedback for Mathematical Reasoning (RLHF4Math). Related papers, web blogs, datasets, and classic books are included.

Contributors: yehangcheng @XXX, Wenhao Yan @NEU, Chen Bo @XXX

๐Ÿ”” If you have any suggestions or notice something we missed, please don't hesitate to let us know. You can directly email Wenhao Yan (wenhao19990616@gmail.com), or post an issue on this repo.

๐Ÿฆ Papers

Related Survey

This section mainly covers surveys about RLHF for math, but also covers other areas.

  • Reasoning with Language Model Prompting: A Survey, [paper]
  • Towards Reasoning in Large Language Models: A Survey, [paper]
  • A Survey of Deep Learning for Mathematical Reasoning, [paper]
  • A Survey on In-context Learning, [paper]
  • A Survey on Transformers in Reinforcement Learning, [paper]

Math Reasoning

This section covers RLHF for math reasoning.

  • Solving Math Word Problems via Cooperative Reasoning induced Language Models, [paper]
  • Interpretable Math Word Problem Solution Generation via Step-by-step Planning, [paper]
  • Compositional Mathematical Encoding for Math Word Problems, [paper]
  • Plan-and-Solve Prompting: Improving Zero-Shot Chain-of-Thought Reasoning by Large Language Models, [paper]
  • GeoDRL: A Self-Learning Framework for Geometry Problem Solving using Reinforcement Learning in Deductive Reasoning, [paper]
  • A Causal Framework to Quantify the Robustness of Mathematical Reasoning with Language Models, [paper]
  • Math Word Problem Solving by Generating Linguistic Variants of Problem Statements, [paper]
  • Making Language Models Better Reasoners with Step-Aware Verifier, [paper]
  • ALERT: Adapting Language Models to Reasoning Tasks, [paper]
  • Distilling Reasoning Capabilities into Smaller Language Models, [paper]
  • Preference Ranking Optimization for Human Alignment, [paper]
  • Direct Preference Optimization:Your Language Model is Secretly a Reward Model, [paper]
  • Dynamic Prompt Learning via Policy Gradient for Semi-structured Mathematical Reasoning, [paper]
  • SELF-INSTRUCT: Aligning Language Models with Self-Generated Instructions, [paper]
  • OFFLINE REINFORCEMENT LEARNING WITH IMPLICIT Q-LEARNING, [paper]
  • OFFLINE RL FOR NATURAL LANGUAGE GENERATION WITH IMPLICIT LANGUAGE Q LEARNING, [paper]
  • ๐Ÿ’ Chain of Thought Prompting Elicits Reasoning in Large Language Models, [paper]
  • LEAST-TO-MOST PROMPTING ENABLES COMPLEX REASONING IN LARGE LANGUAGE MODELS, [paper]
  • ๐Ÿ’ Letโ€™s Verify Step by Step, [paper]
  • MathPrompter Mathematical Reasoning using Large Language Models, [paper]
  • Measuring Mathematical Problem Solving With the MATH Dataset, [paper]
  • ๐Ÿ’ Solving math word problems with process and outcome-based feedback, [paper]
  • ๐Ÿ’ STaR Self-Taught Reasoner Bootstrapping Reasoning With Reasoning, [paper]
  • ๐Ÿ’ Training Verifiers to Solve Math Word Problems, [paper]

RLHF suplement

This section covers other research and general content for RLHF.

  • CodeRL Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning, [paper]
  • RAFT Reward rAnked FineTuning for Generative Foundation Model Alignment, [paper]
  • IS REINFORCEMENT LEARNING (NOT) FOR NATURAL LANGUAGE PROCESSING BENCHMARKS, BASELINES, AND BUILDING BLOCKS FOR NATURAL LANGUAGE POLICY OPTIMIZATION, [paper]
  • WizardCoder: Empowering Code Large Language Models with Evol-Instruct, [paper]
  • Constitutional AI: Harmlessness from AI Feedback, [paper]

RL supplement

This section covers RL prerequisite foundation for RLHF.

  • Trust Region Policy Optimization, [paper]
  • ๐Ÿ’ Proximal Policy Optimization Algorithms, [paper]
  • Mastering the Game of Go without Human Knowledge, [paper]
  • Deep reinforcement learning from human preferences, [paper]
  • Thinking Fast and Slow with Deep Learning and Tree Search, [paper]

NLP supplement

This section covers NLP prerequisite foundation for RLHF.

  • add title, [[paper](add link)]

LLM supplement

This section covers LLM supplement content.

  • Improving Language Understanding by Generative Pre-Training, [paper]
  • Language Models are Unsupervised Multitask Learners, [paper]
  • Language Models are Few-Shot Learners, [paper]
  • GPT-4 Technical Report, [paper]
  • ๐Ÿ’ Training language models to follow instructions with human feedback, [paper]
  • ๐Ÿ’ GLM: General Language Model Pretraining with Autoregressive Blank Infilling, [paper]
  • ๐Ÿ’ GLM-130B: AN OPEN BILINGUAL PRE-TRAINED MODEL, [paper]

๐ŸŽถ Blogs & Docs & Books

๐Ÿฎ Datasets

Math

Other

  • Anthropic Helpful and Harmless RLHF, reward modeling [git] [paper]
  • OpenAI Summarize, reward modeling [git] [paper]
  • OpenAI Web GPT, reward modeling [blog] [paper] [hugging face]
  • stack-exchange-preferences, reward modeling [hugging face]
  • Stanford Human Preferences Dataset (SHP), reward modeling [hugging face]
  • TruthfulQA, evaluate truthful [hugging face]
  • ToxiGen, evaluate toxicity [hugging face]
  • Bias in Open-ended Language Generation Dataset (BOLD), evaluate fairness [hugging face]

๐ŸŽจ Code

  • [WizardLM] WizardLM: An Instruction-following LLM Using Evol-Instruct, [git]
  • [Self-Instruct] Self-Instruct: Aligning LM with Self Generated Instructions, [git]
  • [ConstitutionalHarmlessness] Constitutional AI: Harmlessness from AI Feedback, [git]
  • [project-name] project full name, [where]

Citation

If you find this repo useful, please kindly cite our survey:

@article{lu2022dl4math,
  title={A Survey of Deep Learning for Mathematical Reasoning},
  author={Lu, Pan and Qiu, Liang and Yu, Wenhao and Welleck, Sean and Chang, Kai-Wei},
  journal={arXiv preprint arXiv:2212.10535},
  year={2022}
}

About

This repository contains sources about reinforcement learning human feedback for math reasoning,.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published