[rllib] Improve IMPALA examples and premerge #59927

pseudo-rnd-thoughts · 2026-01-07T11:12:47Z

Description

Updates the IMPALA examples and premerge with CartPole and TicTacToe.
We only have a minimal number of examples as most users should use PPO or APPO rather than IMPALA.

Signed-off-by: Mark Towers <mark@anyscale.com>

# Conflicts: # rllib/BUILD.bazel

Signed-off-by: Mark Towers <mark@anyscale.com>

# Conflicts: # rllib/BUILD.bazel # rllib/examples/algorithms/bc/cartpole_bc.py # rllib/examples/algorithms/bc/cartpole_bc_with_offline_evaluation.py # rllib/examples/algorithms/bc/pendulum_bc.py # rllib/examples/algorithms/iql/pendulum_iql.py # rllib/examples/algorithms/marwil/cartpole_marwil.py

Signed-off-by: Mark Towers <mark@anyscale.com>

# Conflicts: # rllib/BUILD.bazel # rllib/examples/algorithms/appo/cartpole_appo.py # rllib/examples/algorithms/appo/halfcheetah_appo.py # rllib/examples/algorithms/appo/multi_agent_cartpole_appo.py # rllib/examples/algorithms/appo/multi_agent_pong_appo.py # rllib/examples/algorithms/appo/multi_agent_stateless_cartpole_appo.py # rllib/examples/algorithms/appo/pendulum_appo.py # rllib/examples/algorithms/appo/pong_appo.py # rllib/examples/algorithms/appo/stateless_cartpole_appo.py

Signed-off-by: Mark Towers <mark@anyscale.com>

# Conflicts: # rllib/BUILD.bazel

Signed-off-by: Mark Towers <mark@anyscale.com>

# Conflicts: # rllib/examples/algorithms/appo/halfcheetah_appo.py # rllib/examples/algorithms/appo/multi_agent_cartpole_appo.py # rllib/examples/algorithms/appo/multi_agent_pong_appo.py # rllib/examples/algorithms/appo/multi_agent_stateless_cartpole_appo.py # rllib/examples/algorithms/appo/pendulum_appo.py # rllib/examples/algorithms/appo/pong_appo.py # rllib/examples/algorithms/appo/stateless_cartpole_appo.py # rllib/utils/test_utils.py

Signed-off-by: Mark Towers <mark@anyscale.com>

# Conflicts: # rllib/examples/algorithms/appo/stateless_cartpole_appo.py

Signed-off-by: Mark Towers <mark@anyscale.com>

gemini-code-assist

Code Review

This pull request significantly improves the IMPALA and APPO examples and associated tests. The changes include refactoring release tests for clarity, renaming cluster configuration files for better descriptiveness, and replacing old example scripts with new, well-documented ones. The new examples for APPO and IMPALA are a great addition.

My review focuses on ensuring consistency in the new documentation and configurations. I've pointed out a few places where comments, tables, and default values are out of sync. Addressing these will improve the maintainability and usability of these examples and tests. Specifically, I've noted:

Inconsistent test matrices for APPO between release_tests.yaml and BUILD.bazel.
A copy-paste error in the IMPALA test matrix header in BUILD.bazel.
Several new example scripts have placeholders or outdated information in their docstrings regarding expected results.

Overall, this is a valuable cleanup and improvement. Great work!

gemini-code-assist · 2026-01-07T11:15:04Z

release/release_tests.yaml

+# | APPO (14 total tests)          |         |                Number of Learners (Device)                                                      |
+# | Environment                    | Success | Local (CPU) | Single (CPU)    | Single (GPU) | Multi (GPU) Single Node | Multi (GPU) Multi Node |
+# |--------------------------------|---------|-------------|-----------------|--------------|-------------------------|------------------------|
+# | (SA/D) Cartpole                | 450     | ✅          | ✅              | ❌           | ❌                     | ❌                     |
+# | (SA/D/LSTM) Stateless Cartpole | 350     | ✅          | ❌              | ✅           | ❌                     | ❌                     |
+# | (MA/D) TicTacToe               | 0       | ✅          | ✅              | ❌           | ✅                     | ❌                     |
+# | (SA/D) Atari (Pong)            | 18      | ❌          | ❌              | ❌           | ✅                     | ✅                     |
+# | (SA/C) IsaacLab (Humanoid)     | ??      | ❌          | ✅ (with 1 GPU) | ⚠️           | ❌                     | ❌                     |
+# | (MA/D) Footsies                | ??      | ❌          | ❌              | ✅           | ❌                     | ❌                     |


The APPO test matrix table here seems to be inconsistent with the one in rllib/BUILD.bazel. For example, the success criteria for CartPole is 450 here, but 200 in BUILD.bazel. Similarly for Stateless Cartpole (350 vs 150), TicTacToe (0 vs -0.2), and Atari (18 vs 5). Also, the column structure is different. It would be great to unify these tables to avoid confusion and improve maintainability.

gemini-code-assist · 2026-01-07T11:15:04Z

rllib/BUILD.bazel

+# | APPO (14 total tests)          |         |                Number of Learners (Device)                 |
+# | Environment                    | Success | Local (CPU) | Single (CPU)    | Single (GPU) | Multi (GPU) |
+# |--------------------------------|---------|-------------|-----------------|--------------|-------------|
+# | (SA/D) Cartpole                | 200     | ✅          | ❌              | ❌           | ❌         |
+# | (MA/D) TicTacToe               | -2.0    | ❌          | ❌              | ❌           | ✅         |


It seems there's a copy-paste error in the comment table for IMPALA tests. The header says "APPO (14 total tests)" but the tests below are for IMPALA. This should be corrected to "IMPALA".

Suggested change

# | APPO (14 total tests) | | Number of Learners (Device) |

# | Environment | Success | Local (CPU) | Single (CPU) | Single (GPU) | Multi (GPU) |

# |--------------------------------|---------|-------------|-----------------|--------------|-------------|

# | (SA/D) Cartpole | 200 | ✅ | ❌ | ❌ | ❌ |

# | (MA/D) TicTacToe | -2.0 | ❌ | ❌ | ❌ | ✅ |

# | IMPALA (2 total tests) | | Number of Learners (Device) |

# | Environment | Success | Local (CPU) | Single (CPU) | Single (GPU) | Multi (GPU) |

# |--------------------------------|---------|-------------|-----------------|--------------|-------------|

# | (SA/D) Cartpole | 200 | ✅ | ❌ | ❌ | ❌ |

# | (MA/D) TicTacToe | -2.0 | ❌ | ❌ | ❌ | ✅ |

gemini-code-assist · 2026-01-07T11:15:05Z

rllib/examples/algorithms/appo/atari_appo.py

+The algorithm should reach the default reward threshold of XX on Breakout
+within 10 million timesteps (40 million frames with 4x frame stacking,
+see: `default_timesteps` in the code).


The docstring mentions that the algorithm should reach a reward threshold of "XX on Breakout". This seems to be a placeholder. Could you please update "XX" with the actual expected reward (e.g., 18.0 as per default_reward) and change "Breakout" to "Pong" to match the default environment for this script?

Suggested change

The algorithm should reach the default reward threshold of XX on Breakout

within 10 million timesteps (40 million frames with 4x frame stacking,

see: `default_timesteps` in the code).

The algorithm should reach the default reward threshold of 18.0 on Pong

within 10 million timesteps (40 million frames with 4x frame stacking,

see: `default_timesteps` in the code).

gemini-code-assist · 2026-01-07T11:15:05Z

rllib/examples/algorithms/appo/mujoco_appo.py

+The algorithm should reach the default reward threshold of 9000.0 within
+2 million timesteps (see: `default_timesteps` in the code).
+The number of environment steps can be changed through


The "Results to expect" section in the docstring seems to contain information for HalfCheetah ("reward threshold of 9000.0 within 2 million timesteps"), but the script defaults are for Humanoid-v4 (default_reward=800.0, default_timesteps=3_000_000). This looks like a copy-paste from the old halfcheetah_appo.py example. Please update the docstring to reflect the correct environment and expected results for Humanoid-v4.

Suggested change

The algorithm should reach the default reward threshold of 9000.0 within

2 million timesteps (see: `default_timesteps` in the code).

The number of environment steps can be changed through

The algorithm should reach the default reward threshold of 800.0 on Humanoid-v4

within 3 million timesteps (see: `default_timesteps` in the code).

gemini-code-assist · 2026-01-07T11:15:05Z

rllib/examples/algorithms/appo/stateless_cartpole_appo_with_lstm.py

+The algorithm should reach the default reward threshold of 300.0 (500.0 is the
+maximum) within approximately 2 million timesteps (see: `default_timesteps`
+in the code). The number of environment
+steps can be changed through argparser's `default_timesteps`.


The docstring mentions an expected reward of 300.0 within 2 million timesteps, but the script's defaults are default_reward=350.0 and default_timesteps=5_000_000. Please update the docstring to be consistent with the code defaults.

Suggested change

The algorithm should reach the default reward threshold of 300.0 (500.0 is the

maximum) within approximately 2 million timesteps (see: `default_timesteps`

in the code). The number of environment

steps can be changed through argparser's `default_timesteps`.

The algorithm should reach the default reward threshold of 350.0 (500.0 is the

maximum) within approximately 5 million timesteps (see: `default_timesteps`

in the code). The number of environment

steps can be changed through argparser's `default_timesteps`.

gemini-code-assist · 2026-01-07T11:15:05Z

rllib/examples/algorithms/appo/tictactoe_appo.py

+Training will run for 100 thousand timesteps (see: `default_timesteps` in the
+code) for p0 (policy 0) to achieve a mean return of XX compared to the
+random policy. The number of environment steps can be changed through
+`default_timesteps`. The trainable policies should gradually improve their


The docstring has a placeholder "XX" for the expected mean return and an outdated value for default_timesteps. Could you please replace "XX" with the actual expected value (e.g., -0.5) and update the number of timesteps to match the script's default of 10,000,000?

Suggested change

Training will run for 100 thousand timesteps (see: `default_timesteps` in the

code) for p0 (policy 0) to achieve a mean return of XX compared to the

random policy. The number of environment steps can be changed through

`default_timesteps`. The trainable policies should gradually improve their

Training will run for 10 million timesteps (see: `default_timesteps` in the

code) for p0 (policy 0) to achieve a mean return of -0.5 compared to the

random policy. The number of environment steps can be changed through

`default_timesteps`. The trainable policies should gradually improve their

Mark Towers and others added 30 commits November 21, 2025 17:06

[rllib] Merge tuned-examples into examples

75bdc1d

Signed-off-by: Mark Towers <mark@anyscale.com>

Update BUILD.bazel for tuned-examples new location

b89a7af

Signed-off-by: Mark Towers <mark@anyscale.com>

Gemini review

5100f82

Signed-off-by: Mark Towers <mark@anyscale.com>

Merge branch 'master' into merge-tuned-examples

36d242f

# Conflicts: # rllib/BUILD.bazel

update offline data path

a0dd0c3

Signed-off-by: Mark Towers <mark@anyscale.com>

update tuned_example file paths

5a4d28b

Signed-off-by: Mark Towers <mark@anyscale.com>

Fix file paths

0f2d5bd

Signed-off-by: Mark Towers <mark@anyscale.com>

Update rllib release test directory and release test paths

dc3be81

Signed-off-by: Mark Towers <mark@anyscale.com>

[rllib] Update APPO premerge

38bf29b

Signed-off-by: Mark Towers <mark@anyscale.com>

Clean appo folder

6c0dd98

Signed-off-by: Mark Towers <mark@anyscale.com>

Add stateles cartpole

5e9d21d

Signed-off-by: Mark Towers <mark@anyscale.com>

Merge branch 'master' into appo-premerge

101a5ee

# Conflicts: # rllib/BUILD.bazel

pre-commit

b946c72

Signed-off-by: Mark Towers <mark@anyscale.com>

Improve documentation

2234d96

Signed-off-by: Mark Towers <mark@anyscale.com>

Fix training scripts

206d8fc

Signed-off-by: Mark Towers <mark@anyscale.com>

Change to TicTacToe from Connect4

d9a559e

Signed-off-by: Mark Towers <mark@anyscale.com>

Updated to BUILD.bazel to tictactoe file

32f0e68

Signed-off-by: Mark Towers <mark@anyscale.com>

kamil code-review

a6270c3

Signed-off-by: Mark Towers <mark@anyscale.com>

Merge branch 'master' into appo-premerge

f80bfa7

code-review

617ce9f

Signed-off-by: Mark Towers <mark@anyscale.com>

update tictactoe and stop rewards

2892342

Signed-off-by: Mark Towers <mark@anyscale.com>

Merge branch 'master' into appo-premerge

05db432

Add nightly tests

f08cfab

Signed-off-by: Mark Towers <mark@anyscale.com>

Add default_iters to atari

350c68e

Signed-off-by: Mark Towers <mark@anyscale.com>

Update tic tac toe implementation

15b8376

Signed-off-by: Mark Towers <mark@anyscale.com>

Rewrite TicTacToe and add stop rewards / iters for premerge

2b141ac

Signed-off-by: Mark Towers <mark@anyscale.com>

Merge branch 'master' into appo-premerge

15d25ef

Mark Towers added 9 commits December 17, 2025 16:16

Fix release tests cluster_compute

fd8a835

Signed-off-by: Mark Towers <mark@anyscale.com>

code-review

7b122d4

Signed-off-by: Mark Towers <mark@anyscale.com>

remove type: gpu from non gpu nightly

828e383

Signed-off-by: Mark Towers <mark@anyscale.com>

code review

43b68f9

Signed-off-by: Mark Towers <mark@anyscale.com>

Add note about GPU learners

7bb656d

Signed-off-by: Mark Towers <mark@anyscale.com>

Fix run name

d989c5e

Signed-off-by: Mark Towers <mark@anyscale.com>

[rllib] Permerge for IMPALA

f919f6f

Signed-off-by: Mark Towers <mark@anyscale.com>

Merge branch 'master' into impala-premerge-nightly

f6e9fb6

# Conflicts: # rllib/examples/algorithms/appo/stateless_cartpole_appo.py

Update IMPALA premerge and examples

f4bd657

Signed-off-by: Mark Towers <mark@anyscale.com>

gemini-code-assist bot reviewed Jan 7, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[rllib] Improve IMPALA examples and premerge #59927

[rllib] Improve IMPALA examples and premerge #59927

Uh oh!

pseudo-rnd-thoughts commented Jan 7, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Jan 7, 2026

Uh oh!

gemini-code-assist bot Jan 7, 2026

Uh oh!

gemini-code-assist bot Jan 7, 2026

Uh oh!

gemini-code-assist bot Jan 7, 2026

Uh oh!

gemini-code-assist bot Jan 7, 2026

Uh oh!

gemini-code-assist bot Jan 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[rllib] Improve IMPALA examples and premerge #59927

Are you sure you want to change the base?

[rllib] Improve IMPALA examples and premerge #59927

Uh oh!

Conversation

pseudo-rnd-thoughts commented Jan 7, 2026

Description

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants