Skip to content

Conversation

@pseudo-rnd-thoughts
Copy link
Member

Description

Updates the IMPALA examples and premerge with CartPole and TicTacToe.
We only have a minimal number of examples as most users should use PPO or APPO rather than IMPALA.

Mark Towers and others added 30 commits November 21, 2025 17:06
Signed-off-by: Mark Towers <mark@anyscale.com>
Signed-off-by: Mark Towers <mark@anyscale.com>
Signed-off-by: Mark Towers <mark@anyscale.com>
# Conflicts:
#	rllib/BUILD.bazel
Signed-off-by: Mark Towers <mark@anyscale.com>
Signed-off-by: Mark Towers <mark@anyscale.com>
Signed-off-by: Mark Towers <mark@anyscale.com>
# Conflicts:
#	rllib/BUILD.bazel
#	rllib/examples/algorithms/bc/cartpole_bc.py
#	rllib/examples/algorithms/bc/cartpole_bc_with_offline_evaluation.py
#	rllib/examples/algorithms/bc/pendulum_bc.py
#	rllib/examples/algorithms/iql/pendulum_iql.py
#	rllib/examples/algorithms/marwil/cartpole_marwil.py
Signed-off-by: Mark Towers <mark@anyscale.com>
Signed-off-by: Mark Towers <mark@anyscale.com>
# Conflicts:
#	rllib/BUILD.bazel
#	rllib/examples/algorithms/appo/cartpole_appo.py
#	rllib/examples/algorithms/appo/halfcheetah_appo.py
#	rllib/examples/algorithms/appo/multi_agent_cartpole_appo.py
#	rllib/examples/algorithms/appo/multi_agent_pong_appo.py
#	rllib/examples/algorithms/appo/multi_agent_stateless_cartpole_appo.py
#	rllib/examples/algorithms/appo/pendulum_appo.py
#	rllib/examples/algorithms/appo/pong_appo.py
#	rllib/examples/algorithms/appo/stateless_cartpole_appo.py
Signed-off-by: Mark Towers <mark@anyscale.com>
Signed-off-by: Mark Towers <mark@anyscale.com>
# Conflicts:
#	rllib/BUILD.bazel
Signed-off-by: Mark Towers <mark@anyscale.com>
Signed-off-by: Mark Towers <mark@anyscale.com>
Signed-off-by: Mark Towers <mark@anyscale.com>
# Conflicts:
#	rllib/examples/algorithms/appo/halfcheetah_appo.py
#	rllib/examples/algorithms/appo/multi_agent_cartpole_appo.py
#	rllib/examples/algorithms/appo/multi_agent_pong_appo.py
#	rllib/examples/algorithms/appo/multi_agent_stateless_cartpole_appo.py
#	rllib/examples/algorithms/appo/pendulum_appo.py
#	rllib/examples/algorithms/appo/pong_appo.py
#	rllib/examples/algorithms/appo/stateless_cartpole_appo.py
#	rllib/utils/test_utils.py
Signed-off-by: Mark Towers <mark@anyscale.com>
Signed-off-by: Mark Towers <mark@anyscale.com>
Signed-off-by: Mark Towers <mark@anyscale.com>
Signed-off-by: Mark Towers <mark@anyscale.com>
Signed-off-by: Mark Towers <mark@anyscale.com>
Signed-off-by: Mark Towers <mark@anyscale.com>
Signed-off-by: Mark Towers <mark@anyscale.com>
Signed-off-by: Mark Towers <mark@anyscale.com>
Signed-off-by: Mark Towers <mark@anyscale.com>
Mark Towers added 9 commits December 17, 2025 16:16
Signed-off-by: Mark Towers <mark@anyscale.com>
Signed-off-by: Mark Towers <mark@anyscale.com>
Signed-off-by: Mark Towers <mark@anyscale.com>
Signed-off-by: Mark Towers <mark@anyscale.com>
Signed-off-by: Mark Towers <mark@anyscale.com>
Signed-off-by: Mark Towers <mark@anyscale.com>
Signed-off-by: Mark Towers <mark@anyscale.com>
# Conflicts:
#	rllib/examples/algorithms/appo/stateless_cartpole_appo.py
Signed-off-by: Mark Towers <mark@anyscale.com>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request significantly improves the IMPALA and APPO examples and associated tests. The changes include refactoring release tests for clarity, renaming cluster configuration files for better descriptiveness, and replacing old example scripts with new, well-documented ones. The new examples for APPO and IMPALA are a great addition.

My review focuses on ensuring consistency in the new documentation and configurations. I've pointed out a few places where comments, tables, and default values are out of sync. Addressing these will improve the maintainability and usability of these examples and tests. Specifically, I've noted:

  • Inconsistent test matrices for APPO between release_tests.yaml and BUILD.bazel.
  • A copy-paste error in the IMPALA test matrix header in BUILD.bazel.
  • Several new example scripts have placeholders or outdated information in their docstrings regarding expected results.

Overall, this is a valuable cleanup and improvement. Great work!

Comment on lines +2023 to +2031
# | APPO (14 total tests) | | Number of Learners (Device) |
# | Environment | Success | Local (CPU) | Single (CPU) | Single (GPU) | Multi (GPU) Single Node | Multi (GPU) Multi Node |
# |--------------------------------|---------|-------------|-----------------|--------------|-------------------------|------------------------|
# | (SA/D) Cartpole | 450 | ✅ | ✅ | ❌ | ❌ | ❌ |
# | (SA/D/LSTM) Stateless Cartpole | 350 | ✅ | ❌ | ✅ | ❌ | ❌ |
# | (MA/D) TicTacToe | 0 | ✅ | ✅ | ❌ | ✅ | ❌ |
# | (SA/D) Atari (Pong) | 18 | ❌ | ❌ | ❌ | ✅ | ✅ |
# | (SA/C) IsaacLab (Humanoid) | ?? | ❌ | ✅ (with 1 GPU) | ⚠️ | ❌ | ❌ |
# | (MA/D) Footsies | ?? | ❌ | ❌ | ✅ | ❌ | ❌ |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The APPO test matrix table here seems to be inconsistent with the one in rllib/BUILD.bazel. For example, the success criteria for CartPole is 450 here, but 200 in BUILD.bazel. Similarly for Stateless Cartpole (350 vs 150), TicTacToe (0 vs -0.2), and Atari (18 vs 5). Also, the column structure is different. It would be great to unify these tables to avoid confusion and improve maintainability.

Comment on lines +671 to +675
# | APPO (14 total tests) | | Number of Learners (Device) |
# | Environment | Success | Local (CPU) | Single (CPU) | Single (GPU) | Multi (GPU) |
# |--------------------------------|---------|-------------|-----------------|--------------|-------------|
# | (SA/D) Cartpole | 200 | ✅ | ❌ | ❌ | ❌ |
# | (MA/D) TicTacToe | -2.0 | ❌ | ❌ | ❌ | ✅ |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

It seems there's a copy-paste error in the comment table for IMPALA tests. The header says "APPO (14 total tests)" but the tests below are for IMPALA. This should be corrected to "IMPALA".

Suggested change
# | APPO (14 total tests) | | Number of Learners (Device) |
# | Environment | Success | Local (CPU) | Single (CPU) | Single (GPU) | Multi (GPU) |
# |--------------------------------|---------|-------------|-----------------|--------------|-------------|
# | (SA/D) Cartpole | 200 | ✅ | ❌ | ❌ | ❌ |
# | (MA/D) TicTacToe | -2.0 | ❌ | ❌ | ❌ | ✅ |
# | IMPALA (2 total tests) | | Number of Learners (Device) |
# | Environment | Success | Local (CPU) | Single (CPU) | Single (GPU) | Multi (GPU) |
# |--------------------------------|---------|-------------|-----------------|--------------|-------------|
# | (SA/D) Cartpole | 200 | ✅ | ❌ | ❌ | ❌ |
# | (MA/D) TicTacToe | -2.0 | ❌ | ❌ | ❌ | ✅ |

Comment on lines +47 to +49
The algorithm should reach the default reward threshold of XX on Breakout
within 10 million timesteps (40 million frames with 4x frame stacking,
see: `default_timesteps` in the code).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The docstring mentions that the algorithm should reach a reward threshold of "XX on Breakout". This seems to be a placeholder. Could you please update "XX" with the actual expected reward (e.g., 18.0 as per default_reward) and change "Breakout" to "Pong" to match the default environment for this script?

Suggested change
The algorithm should reach the default reward threshold of XX on Breakout
within 10 million timesteps (40 million frames with 4x frame stacking,
see: `default_timesteps` in the code).
The algorithm should reach the default reward threshold of 18.0 on Pong
within 10 million timesteps (40 million frames with 4x frame stacking,
see: `default_timesteps` in the code).

Comment on lines +56 to +58
The algorithm should reach the default reward threshold of 9000.0 within
2 million timesteps (see: `default_timesteps` in the code).
The number of environment steps can be changed through
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The "Results to expect" section in the docstring seems to contain information for HalfCheetah ("reward threshold of 9000.0 within 2 million timesteps"), but the script defaults are for Humanoid-v4 (default_reward=800.0, default_timesteps=3_000_000). This looks like a copy-paste from the old halfcheetah_appo.py example. Please update the docstring to reflect the correct environment and expected results for Humanoid-v4.

Suggested change
The algorithm should reach the default reward threshold of 9000.0 within
2 million timesteps (see: `default_timesteps` in the code).
The number of environment steps can be changed through
The algorithm should reach the default reward threshold of 800.0 on Humanoid-v4
within 3 million timesteps (see: `default_timesteps` in the code).

Comment on lines +51 to +54
The algorithm should reach the default reward threshold of 300.0 (500.0 is the
maximum) within approximately 2 million timesteps (see: `default_timesteps`
in the code). The number of environment
steps can be changed through argparser's `default_timesteps`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The docstring mentions an expected reward of 300.0 within 2 million timesteps, but the script's defaults are default_reward=350.0 and default_timesteps=5_000_000. Please update the docstring to be consistent with the code defaults.

Suggested change
The algorithm should reach the default reward threshold of 300.0 (500.0 is the
maximum) within approximately 2 million timesteps (see: `default_timesteps`
in the code). The number of environment
steps can be changed through argparser's `default_timesteps`.
The algorithm should reach the default reward threshold of 350.0 (500.0 is the
maximum) within approximately 5 million timesteps (see: `default_timesteps`
in the code). The number of environment
steps can be changed through argparser's `default_timesteps`.

Comment on lines +49 to +52
Training will run for 100 thousand timesteps (see: `default_timesteps` in the
code) for p0 (policy 0) to achieve a mean return of XX compared to the
random policy. The number of environment steps can be changed through
`default_timesteps`. The trainable policies should gradually improve their
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The docstring has a placeholder "XX" for the expected mean return and an outdated value for default_timesteps. Could you please replace "XX" with the actual expected value (e.g., -0.5) and update the number of timesteps to match the script's default of 10,000,000?

Suggested change
Training will run for 100 thousand timesteps (see: `default_timesteps` in the
code) for p0 (policy 0) to achieve a mean return of XX compared to the
random policy. The number of environment steps can be changed through
`default_timesteps`. The trainable policies should gradually improve their
Training will run for 10 million timesteps (see: `default_timesteps` in the
code) for p0 (policy 0) to achieve a mean return of -0.5 compared to the
random policy. The number of environment steps can be changed through
`default_timesteps`. The trainable policies should gradually improve their

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants