-
Notifications
You must be signed in to change notification settings - Fork 7.1k
[rllib] Improve IMPALA examples and premerge #59927
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
[rllib] Improve IMPALA examples and premerge #59927
Conversation
Signed-off-by: Mark Towers <mark@anyscale.com>
Signed-off-by: Mark Towers <mark@anyscale.com>
Signed-off-by: Mark Towers <mark@anyscale.com>
# Conflicts: # rllib/BUILD.bazel
Signed-off-by: Mark Towers <mark@anyscale.com>
Signed-off-by: Mark Towers <mark@anyscale.com>
Signed-off-by: Mark Towers <mark@anyscale.com>
# Conflicts: # rllib/BUILD.bazel # rllib/examples/algorithms/bc/cartpole_bc.py # rllib/examples/algorithms/bc/cartpole_bc_with_offline_evaluation.py # rllib/examples/algorithms/bc/pendulum_bc.py # rllib/examples/algorithms/iql/pendulum_iql.py # rllib/examples/algorithms/marwil/cartpole_marwil.py
Signed-off-by: Mark Towers <mark@anyscale.com>
Signed-off-by: Mark Towers <mark@anyscale.com>
# Conflicts: # rllib/BUILD.bazel # rllib/examples/algorithms/appo/cartpole_appo.py # rllib/examples/algorithms/appo/halfcheetah_appo.py # rllib/examples/algorithms/appo/multi_agent_cartpole_appo.py # rllib/examples/algorithms/appo/multi_agent_pong_appo.py # rllib/examples/algorithms/appo/multi_agent_stateless_cartpole_appo.py # rllib/examples/algorithms/appo/pendulum_appo.py # rllib/examples/algorithms/appo/pong_appo.py # rllib/examples/algorithms/appo/stateless_cartpole_appo.py
Signed-off-by: Mark Towers <mark@anyscale.com>
Signed-off-by: Mark Towers <mark@anyscale.com>
# Conflicts: # rllib/BUILD.bazel
Signed-off-by: Mark Towers <mark@anyscale.com>
Signed-off-by: Mark Towers <mark@anyscale.com>
Signed-off-by: Mark Towers <mark@anyscale.com>
# Conflicts: # rllib/examples/algorithms/appo/halfcheetah_appo.py # rllib/examples/algorithms/appo/multi_agent_cartpole_appo.py # rllib/examples/algorithms/appo/multi_agent_pong_appo.py # rllib/examples/algorithms/appo/multi_agent_stateless_cartpole_appo.py # rllib/examples/algorithms/appo/pendulum_appo.py # rllib/examples/algorithms/appo/pong_appo.py # rllib/examples/algorithms/appo/stateless_cartpole_appo.py # rllib/utils/test_utils.py
Signed-off-by: Mark Towers <mark@anyscale.com>
Signed-off-by: Mark Towers <mark@anyscale.com>
Signed-off-by: Mark Towers <mark@anyscale.com>
Signed-off-by: Mark Towers <mark@anyscale.com>
Signed-off-by: Mark Towers <mark@anyscale.com>
Signed-off-by: Mark Towers <mark@anyscale.com>
Signed-off-by: Mark Towers <mark@anyscale.com>
Signed-off-by: Mark Towers <mark@anyscale.com>
Signed-off-by: Mark Towers <mark@anyscale.com>
Signed-off-by: Mark Towers <mark@anyscale.com>
Signed-off-by: Mark Towers <mark@anyscale.com>
Signed-off-by: Mark Towers <mark@anyscale.com>
Signed-off-by: Mark Towers <mark@anyscale.com>
Signed-off-by: Mark Towers <mark@anyscale.com>
Signed-off-by: Mark Towers <mark@anyscale.com>
Signed-off-by: Mark Towers <mark@anyscale.com>
# Conflicts: # rllib/examples/algorithms/appo/stateless_cartpole_appo.py
Signed-off-by: Mark Towers <mark@anyscale.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request significantly improves the IMPALA and APPO examples and associated tests. The changes include refactoring release tests for clarity, renaming cluster configuration files for better descriptiveness, and replacing old example scripts with new, well-documented ones. The new examples for APPO and IMPALA are a great addition.
My review focuses on ensuring consistency in the new documentation and configurations. I've pointed out a few places where comments, tables, and default values are out of sync. Addressing these will improve the maintainability and usability of these examples and tests. Specifically, I've noted:
- Inconsistent test matrices for APPO between
release_tests.yamlandBUILD.bazel. - A copy-paste error in the IMPALA test matrix header in
BUILD.bazel. - Several new example scripts have placeholders or outdated information in their docstrings regarding expected results.
Overall, this is a valuable cleanup and improvement. Great work!
| # | APPO (14 total tests) | | Number of Learners (Device) | | ||
| # | Environment | Success | Local (CPU) | Single (CPU) | Single (GPU) | Multi (GPU) Single Node | Multi (GPU) Multi Node | | ||
| # |--------------------------------|---------|-------------|-----------------|--------------|-------------------------|------------------------| | ||
| # | (SA/D) Cartpole | 450 | ✅ | ✅ | ❌ | ❌ | ❌ | | ||
| # | (SA/D/LSTM) Stateless Cartpole | 350 | ✅ | ❌ | ✅ | ❌ | ❌ | | ||
| # | (MA/D) TicTacToe | 0 | ✅ | ✅ | ❌ | ✅ | ❌ | | ||
| # | (SA/D) Atari (Pong) | 18 | ❌ | ❌ | ❌ | ✅ | ✅ | | ||
| # | (SA/C) IsaacLab (Humanoid) | ?? | ❌ | ✅ (with 1 GPU) | ⚠️ | ❌ | ❌ | | ||
| # | (MA/D) Footsies | ?? | ❌ | ❌ | ✅ | ❌ | ❌ | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The APPO test matrix table here seems to be inconsistent with the one in rllib/BUILD.bazel. For example, the success criteria for CartPole is 450 here, but 200 in BUILD.bazel. Similarly for Stateless Cartpole (350 vs 150), TicTacToe (0 vs -0.2), and Atari (18 vs 5). Also, the column structure is different. It would be great to unify these tables to avoid confusion and improve maintainability.
| # | APPO (14 total tests) | | Number of Learners (Device) | | ||
| # | Environment | Success | Local (CPU) | Single (CPU) | Single (GPU) | Multi (GPU) | | ||
| # |--------------------------------|---------|-------------|-----------------|--------------|-------------| | ||
| # | (SA/D) Cartpole | 200 | ✅ | ❌ | ❌ | ❌ | | ||
| # | (MA/D) TicTacToe | -2.0 | ❌ | ❌ | ❌ | ✅ | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems there's a copy-paste error in the comment table for IMPALA tests. The header says "APPO (14 total tests)" but the tests below are for IMPALA. This should be corrected to "IMPALA".
| # | APPO (14 total tests) | | Number of Learners (Device) | | |
| # | Environment | Success | Local (CPU) | Single (CPU) | Single (GPU) | Multi (GPU) | | |
| # |--------------------------------|---------|-------------|-----------------|--------------|-------------| | |
| # | (SA/D) Cartpole | 200 | ✅ | ❌ | ❌ | ❌ | | |
| # | (MA/D) TicTacToe | -2.0 | ❌ | ❌ | ❌ | ✅ | | |
| # | IMPALA (2 total tests) | | Number of Learners (Device) | | |
| # | Environment | Success | Local (CPU) | Single (CPU) | Single (GPU) | Multi (GPU) | | |
| # |--------------------------------|---------|-------------|-----------------|--------------|-------------| | |
| # | (SA/D) Cartpole | 200 | ✅ | ❌ | ❌ | ❌ | | |
| # | (MA/D) TicTacToe | -2.0 | ❌ | ❌ | ❌ | ✅ | |
| The algorithm should reach the default reward threshold of XX on Breakout | ||
| within 10 million timesteps (40 million frames with 4x frame stacking, | ||
| see: `default_timesteps` in the code). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docstring mentions that the algorithm should reach a reward threshold of "XX on Breakout". This seems to be a placeholder. Could you please update "XX" with the actual expected reward (e.g., 18.0 as per default_reward) and change "Breakout" to "Pong" to match the default environment for this script?
| The algorithm should reach the default reward threshold of XX on Breakout | |
| within 10 million timesteps (40 million frames with 4x frame stacking, | |
| see: `default_timesteps` in the code). | |
| The algorithm should reach the default reward threshold of 18.0 on Pong | |
| within 10 million timesteps (40 million frames with 4x frame stacking, | |
| see: `default_timesteps` in the code). |
| The algorithm should reach the default reward threshold of 9000.0 within | ||
| 2 million timesteps (see: `default_timesteps` in the code). | ||
| The number of environment steps can be changed through |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The "Results to expect" section in the docstring seems to contain information for HalfCheetah ("reward threshold of 9000.0 within 2 million timesteps"), but the script defaults are for Humanoid-v4 (default_reward=800.0, default_timesteps=3_000_000). This looks like a copy-paste from the old halfcheetah_appo.py example. Please update the docstring to reflect the correct environment and expected results for Humanoid-v4.
| The algorithm should reach the default reward threshold of 9000.0 within | |
| 2 million timesteps (see: `default_timesteps` in the code). | |
| The number of environment steps can be changed through | |
| The algorithm should reach the default reward threshold of 800.0 on Humanoid-v4 | |
| within 3 million timesteps (see: `default_timesteps` in the code). |
| The algorithm should reach the default reward threshold of 300.0 (500.0 is the | ||
| maximum) within approximately 2 million timesteps (see: `default_timesteps` | ||
| in the code). The number of environment | ||
| steps can be changed through argparser's `default_timesteps`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docstring mentions an expected reward of 300.0 within 2 million timesteps, but the script's defaults are default_reward=350.0 and default_timesteps=5_000_000. Please update the docstring to be consistent with the code defaults.
| The algorithm should reach the default reward threshold of 300.0 (500.0 is the | |
| maximum) within approximately 2 million timesteps (see: `default_timesteps` | |
| in the code). The number of environment | |
| steps can be changed through argparser's `default_timesteps`. | |
| The algorithm should reach the default reward threshold of 350.0 (500.0 is the | |
| maximum) within approximately 5 million timesteps (see: `default_timesteps` | |
| in the code). The number of environment | |
| steps can be changed through argparser's `default_timesteps`. |
| Training will run for 100 thousand timesteps (see: `default_timesteps` in the | ||
| code) for p0 (policy 0) to achieve a mean return of XX compared to the | ||
| random policy. The number of environment steps can be changed through | ||
| `default_timesteps`. The trainable policies should gradually improve their |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docstring has a placeholder "XX" for the expected mean return and an outdated value for default_timesteps. Could you please replace "XX" with the actual expected value (e.g., -0.5) and update the number of timesteps to match the script's default of 10,000,000?
| Training will run for 100 thousand timesteps (see: `default_timesteps` in the | |
| code) for p0 (policy 0) to achieve a mean return of XX compared to the | |
| random policy. The number of environment steps can be changed through | |
| `default_timesteps`. The trainable policies should gradually improve their | |
| Training will run for 10 million timesteps (see: `default_timesteps` in the | |
| code) for p0 (policy 0) to achieve a mean return of -0.5 compared to the | |
| random policy. The number of environment steps can be changed through | |
| `default_timesteps`. The trainable policies should gradually improve their |
Description
Updates the IMPALA examples and premerge with CartPole and TicTacToe.
We only have a minimal number of examples as most users should use PPO or APPO rather than IMPALA.