From d86526d83351c84f1b33e145ac233af9ea1276d2 Mon Sep 17 00:00:00 2001 From: NanoCode012 Date: Thu, 13 Mar 2025 19:58:24 +0700 Subject: [PATCH] chore(docs): add cookbook/blog link to docs --- docs/lora_optims.qmd | 4 ++++ docs/reward_modelling.qmd | 4 ++++ docs/rlhf.qmd | 4 ++++ 3 files changed, 12 insertions(+) diff --git a/docs/lora_optims.qmd b/docs/lora_optims.qmd index 8bee20402e..a7555a0a33 100644 --- a/docs/lora_optims.qmd +++ b/docs/lora_optims.qmd @@ -66,6 +66,10 @@ logic to be compatible with more of them. +::: {.callout-tip} +Check out our [LoRA optimizations blog](https://axolotlai.substack.com/p/accelerating-lora-fine-tuning-with). +::: + ## Usage These optimizations can be enabled in your Axolotl config YAML file. The diff --git a/docs/reward_modelling.qmd b/docs/reward_modelling.qmd index c9ac5f8019..386dc1f572 100644 --- a/docs/reward_modelling.qmd +++ b/docs/reward_modelling.qmd @@ -41,6 +41,10 @@ Bradley-Terry chat templates expect single-turn conversations in the following f ### Process Reward Models (PRM) +::: {.callout-tip} +Check out our [PRM blog](https://axolotlai.substack.com/p/process-reward-models). +::: + Process reward models are trained using data which contains preference annotations for each step in a series of interactions. Typically, PRMs are trained to provide reward signals over each step of a reasoning trace and are used for downstream reinforcement learning. ```yaml base_model: Qwen/Qwen2.5-3B diff --git a/docs/rlhf.qmd b/docs/rlhf.qmd index 773b159e84..ac1cf03938 100644 --- a/docs/rlhf.qmd +++ b/docs/rlhf.qmd @@ -497,6 +497,10 @@ The input format is a simple JSON input with customizable fields based on the ab ### GRPO +::: {.callout-tip} +Check out our [GRPO cookbook](https://github.com/axolotl-ai-cloud/axolotl-cookbook/tree/main/grpo#training-an-r1-style-large-language-model-using-grpo). +::: + GRPO uses custom reward functions and transformations. Please have them ready locally. For ex, to load OpenAI's GSM8K and use a random reward for completions: