Skip to content

Commit 45b8d1c

Browse files
committed
docs: update
1 parent 5f14183 commit 45b8d1c

File tree

7 files changed

+18
-18
lines changed

7 files changed

+18
-18
lines changed

static/CoMLRL/docs/dev/changelog/index.html

Lines changed: 4 additions & 4 deletions
Large diffs are not rendered by default.

static/CoMLRL/docs/dev/index.xml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,7 @@
2929
<h2 id="version-126">Version 1.2.6<a class="anchor" href="#version-126">#</a></h2>
3030
<p>The first release of CoMLRL:</p>
3131
<ul>
32-
<li>Including MAGRPO, MAREINFORCE, MARLOO, MAREMAX, and IPPO trainers for multi-agent reinforcement learning with LLMs.</li>
32+
<li>Including MAGRPO, MAREINFORCE, MARLOO, MAREMAX, and IAC trainers for multi-agent reinforcement learning with LLMs.</li>
3333
<li>Support for multi-turn training with custom external feedback mechanisms.</li>
3434
<li>LLM collaboration environments for various tasks.</li>
3535
<li>Comprehensive documentation and examples for getting started.</li>

static/CoMLRL/docs/user-guide/index.xml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -20,8 +20,8 @@ J(\theta_i) = \mathbb{E}_{\mathbf{o}_0 \sim \mathcal{D}, \mathbf{h}^\mathcal{G}
2020
\]</div><link rel="stylesheet" href="../../katex/katex.min.css" /><script defer src="../../katex/katex.min.js"></script><script defer src="../../katex/auto-render.min.js" onload="renderMathInElement(document.body, {"delimiters":[{"left":"$$","right":"$$","display":true},{"left":"\\(","right":"\\)","display":false},{"left":"\\[","right":"\\]","display":true},{"left":"\\begin{equation}","right":"\\end{equation}","display":true},{"left":"\\begin{align}","right":"\\end{align}","display":true},{"left":"\\begin{gather}","right":"\\end{gather}","display":true}],"throwOnError":false});"></script>
2121
<blockquote class="book-hint success">
2222
&lt;p&gt;These classes are derived from &lt;code&gt;comlrl.trainers.magrpo.MAGRPOTrainer&lt;/code&gt;. Interfaces for the trainer and configuration classes are the same as &lt;code&gt;MAGRPOTrainer&lt;/code&gt; and &lt;code&gt;MAGRPOConfig&lt;/code&gt;.&lt;/p&gt;</description></item><item><title>Multi-Agent PPO</title><link>/docs/user-guide/ppo-finetuning/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>/docs/user-guide/ppo-finetuning/</guid><description>&lt;p&gt;PPO is a widely used policy gradient method that employs generalized advantage estimation to estimate advantages, reducing the high variance and long rollout times in Monte Carlo methods, e.g., REINFORCE. PPO has also been used for LLM fine-tuning, e.g., &lt;a href="https://huggingface.co/docs/trl/main/en/ppo_trainer"&gt;trl&lt;/a&gt;, &lt;a href="https://verl.readthedocs.io/en/latest/algo/ppo.html"&gt;verl&lt;/a&gt;, &lt;a href="https://llamafactory.readthedocs.io/en/latest/advanced/trainers.html#ppo"&gt;LLaMA Factory&lt;/a&gt;.&lt;/p&gt;
23-
&lt;h2 id="ippo"&gt;IPPO&lt;a class="anchor" href="#ippo"&gt;#&lt;/a&gt;&lt;/h2&gt;
24-
&lt;p&gt;Independent PPO (&lt;a href="https://arxiv.org/abs/2011.09533"&gt;IPPO&lt;/a&gt;) optimizes each agent&amp;rsquo;s policy independently while using joint returns from multiple agents. Each agent maintains its own actor and critic, other agents serve as part of the environment. The policy objective is:&lt;/p&gt;</description></item><item><title>Multi-Turn Training</title><link>/docs/user-guide/multi-turn/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>/docs/user-guide/multi-turn/</guid><description>&lt;p&gt;Many complex problems cannot be solved in a single turn. Agents need to interact with the environment to obtain useful feedback from other models or tools involved in the system, enabling iterative refinement and exploration of multiple solution paths.&lt;/p&gt;
23+
&lt;h2 id="iac"&gt;IAC&lt;a class="anchor" href="#iac"&gt;#&lt;/a&gt;&lt;/h2&gt;
24+
&lt;p&gt;Independent PPO (&lt;a href="https://arxiv.org/pdf/1705.08926"&gt;IAC&lt;/a&gt;) optimizes each agent&amp;rsquo;s policy independently while using joint returns from multiple agents. Each agent maintains its own actor and critic, other agents serve as part of the environment. The policy objective is:&lt;/p&gt;</description></item><item><title>Multi-Turn Training</title><link>/docs/user-guide/multi-turn/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>/docs/user-guide/multi-turn/</guid><description>&lt;p&gt;Many complex problems cannot be solved in a single turn. Agents need to interact with the environment to obtain useful feedback from other models or tools involved in the system, enabling iterative refinement and exploration of multiple solution paths.&lt;/p&gt;
2525
&lt;h2 id="multi-turn-magrpo"&gt;Multi-Turn MAGRPO&lt;a class="anchor" href="#multi-turn-magrpo"&gt;#&lt;/a&gt;&lt;/h2&gt;
2626
&lt;p&gt;MAGRPO in the multi-turn setting (&lt;strong&gt;MAGRPO-MT&lt;/strong&gt;) forms a tree-structured rollout expansion where branches represent different joint responses (&lt;a href="https://arxiv.org/abs/2506.05183"&gt;TreeRPO&lt;/a&gt;).&lt;/p&gt;
2727
&lt;p align="center"&gt;

0 commit comments

Comments
 (0)