OpenMLRL
diff --git a/‎static/CoMLRL/docs/dev/changelog/index.html‎
Lines changed: 4 additions & 4 deletions b/‎static/CoMLRL/docs/dev/changelog/index.html‎
Lines changed: 4 additions & 4 deletions
diff --git a/‎static/CoMLRL/docs/dev/index.xml‎
Lines changed: 1 addition & 1 deletion b/‎static/CoMLRL/docs/dev/index.xml‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎static/CoMLRL/docs/user-guide/index.xml‎
Lines changed: 2 additions & 2 deletions b/‎static/CoMLRL/docs/user-guide/index.xml‎
Lines changed: 2 additions & 2 deletions
@@ -29,7 +29,7 @@
 &lt;h2 id="version-126"&gt;Version 1.2.6&lt;a class="anchor" href="#version-126"&gt;#&lt;/a&gt;&lt;/h2&gt;
 &lt;p&gt;The first release of CoMLRL:&lt;/p&gt;
 &lt;ul&gt;
-&lt;li&gt;Including MAGRPO, MAREINFORCE, MARLOO, MAREMAX, and IPPO trainers for multi-agent reinforcement learning with LLMs.&lt;/li&gt;
+&lt;li&gt;Including MAGRPO, MAREINFORCE, MARLOO, MAREMAX, and IAC trainers for multi-agent reinforcement learning with LLMs.&lt;/li&gt;
 &lt;li&gt;Support for multi-turn training with custom external feedback mechanisms.&lt;/li&gt;
 &lt;li&gt;LLM collaboration environments for various tasks.&lt;/li&gt;
 &lt;li&gt;Comprehensive documentation and examples for getting started.&lt;/li&gt;
 
@@ -20,8 +20,8 @@ J(\theta_i) = \mathbb{E}_{\mathbf{o}_0 \sim \mathcal{D}, \mathbf{h}^\mathcal{G}
  \]&lt;/div&gt;&lt;link rel="stylesheet" href="../../katex/katex.min.css" /&gt;&lt;script defer src="../../katex/katex.min.js"&gt;&lt;/script&gt;&lt;script defer src="../../katex/auto-render.min.js" onload="renderMathInElement(document.body, {&amp;#34;delimiters&amp;#34;:[{&amp;#34;left&amp;#34;:&amp;#34;$$&amp;#34;,&amp;#34;right&amp;#34;:&amp;#34;$$&amp;#34;,&amp;#34;display&amp;#34;:true},{&amp;#34;left&amp;#34;:&amp;#34;\\(&amp;#34;,&amp;#34;right&amp;#34;:&amp;#34;\\)&amp;#34;,&amp;#34;display&amp;#34;:false},{&amp;#34;left&amp;#34;:&amp;#34;\\[&amp;#34;,&amp;#34;right&amp;#34;:&amp;#34;\\]&amp;#34;,&amp;#34;display&amp;#34;:true},{&amp;#34;left&amp;#34;:&amp;#34;\\begin{equation}&amp;#34;,&amp;#34;right&amp;#34;:&amp;#34;\\end{equation}&amp;#34;,&amp;#34;display&amp;#34;:true},{&amp;#34;left&amp;#34;:&amp;#34;\\begin{align}&amp;#34;,&amp;#34;right&amp;#34;:&amp;#34;\\end{align}&amp;#34;,&amp;#34;display&amp;#34;:true},{&amp;#34;left&amp;#34;:&amp;#34;\\begin{gather}&amp;#34;,&amp;#34;right&amp;#34;:&amp;#34;\\end{gather}&amp;#34;,&amp;#34;display&amp;#34;:true}],&amp;#34;throwOnError&amp;#34;:false});"&gt;&lt;/script&gt;
 &lt;blockquote class="book-hint success"&gt;
 &lt;p&gt;These classes are derived from &lt;code&gt;comlrl.trainers.magrpo.MAGRPOTrainer&lt;/code&gt;. Interfaces for the trainer and configuration classes are the same as &lt;code&gt;MAGRPOTrainer&lt;/code&gt; and &lt;code&gt;MAGRPOConfig&lt;/code&gt;.&lt;/p&gt;</description></item><item><title>Multi-Agent PPO</title><link>/docs/user-guide/ppo-finetuning/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>/docs/user-guide/ppo-finetuning/</guid><description>&lt;p&gt;PPO is a widely used policy gradient method that employs generalized advantage estimation to estimate advantages, reducing the high variance and long rollout times in Monte Carlo methods, e.g., REINFORCE. PPO has also been used for LLM fine-tuning, e.g., &lt;a href="https://huggingface.co/docs/trl/main/en/ppo_trainer"&gt;trl&lt;/a&gt;, &lt;a href="https://verl.readthedocs.io/en/latest/algo/ppo.html"&gt;verl&lt;/a&gt;, &lt;a href="https://llamafactory.readthedocs.io/en/latest/advanced/trainers.html#ppo"&gt;LLaMA Factory&lt;/a&gt;.&lt;/p&gt;
-&lt;h2 id="ippo"&gt;IPPO&lt;a class="anchor" href="#ippo"&gt;#&lt;/a&gt;&lt;/h2&gt;
-&lt;p&gt;Independent PPO (&lt;a href="https://arxiv.org/abs/2011.09533"&gt;IPPO&lt;/a&gt;) optimizes each agent&amp;rsquo;s policy independently while using joint returns from multiple agents. Each agent maintains its own actor and critic, other agents serve as part of the environment. The policy objective is:&lt;/p&gt;</description></item><item><title>Multi-Turn Training</title><link>/docs/user-guide/multi-turn/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>/docs/user-guide/multi-turn/</guid><description>&lt;p&gt;Many complex problems cannot be solved in a single turn. Agents need to interact with the environment to obtain useful feedback from other models or tools involved in the system, enabling iterative refinement and exploration of multiple solution paths.&lt;/p&gt;
+&lt;h2 id="iac"&gt;IAC&lt;a class="anchor" href="#iac"&gt;#&lt;/a&gt;&lt;/h2&gt;
+&lt;p&gt;Independent PPO (&lt;a href="https://arxiv.org/pdf/1705.08926"&gt;IAC&lt;/a&gt;) optimizes each agent&amp;rsquo;s policy independently while using joint returns from multiple agents. Each agent maintains its own actor and critic, other agents serve as part of the environment. The policy objective is:&lt;/p&gt;</description></item><item><title>Multi-Turn Training</title><link>/docs/user-guide/multi-turn/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>/docs/user-guide/multi-turn/</guid><description>&lt;p&gt;Many complex problems cannot be solved in a single turn. Agents need to interact with the environment to obtain useful feedback from other models or tools involved in the system, enabling iterative refinement and exploration of multiple solution paths.&lt;/p&gt;
 &lt;h2 id="multi-turn-magrpo"&gt;Multi-Turn MAGRPO&lt;a class="anchor" href="#multi-turn-magrpo"&gt;#&lt;/a&gt;&lt;/h2&gt;
 &lt;p&gt;MAGRPO in the multi-turn setting (&lt;strong&gt;MAGRPO-MT&lt;/strong&gt;) forms a tree-structured rollout expansion where branches represent different joint responses (&lt;a href="https://arxiv.org/abs/2506.05183"&gt;TreeRPO&lt;/a&gt;).&lt;/p&gt;
 &lt;p align="center"&gt;