diff --git a/CLAUDE.md b/CLAUDE.md index 3d2e569a76..25d1dcb183 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -6,5 +6,5 @@ - Always run the linter and make sure the tests pass before finishing a task. - Prefer running single tests, not the whole suite, when developing. - To run the `./scripts/train/build_image_and_launch.sh` script, you must commit the current changes. -- Launch tool use experiments by running `./scripts/train/build_image_and_launch.sh scripts/train/debug/tool_grpo_fast.sh`. +- Launch tool use experiments by running `./scripts/train/build_image_and_launch.sh scripts/train/debug/tool_grpo.sh`. - Launch multi-node non-tool experiments by running `./scripts/train/build_image_and_launch.sh scripts/train/debug/large_test_script.sh`. diff --git a/docs/algorithms/grpo.md b/docs/algorithms/grpo.md index 0ef8bf82c3..67817f5b23 100644 --- a/docs/algorithms/grpo.md +++ b/docs/algorithms/grpo.md @@ -6,12 +6,12 @@ GRPO is an online RL method used in [DeepSeek R1 paper](https://arxiv.org/abs/25 ## Implemented Variants -- `grpo_fast.py` is a faster variant using [packing techniques](https://huggingface.co/blog/sirluk/llm-sequence-packing). +- `grpo.py` is a faster variant using [packing techniques](https://huggingface.co/blog/sirluk/llm-sequence-packing). - `grpo_vllm_thread_ray_gtrl.py` is a more vanilla GRPO implementation, using vLLM and Ray. -## `grpo_fast.py` +## `grpo.py` This implementation has the following features: @@ -19,17 +19,17 @@ This implementation has the following features: - Uses a thread-based approach to parallelize the training and inference processes, based on [Asynchronous RLHF](https://arxiv.org/abs/2410.18252). - Uses a data preparation thread to prepare the data for the training process. -In simpler tasks, we see 2x faster training, and even 10x faster for more complex tasks. With `grpo_fast.py`, we can run crank up `number_samples_per_prompt` and train on really large batch sizes. +In simpler tasks, we see 2x faster training, and even 10x faster for more complex tasks. With `grpo.py`, we can run crank up `number_samples_per_prompt` and train on really large batch sizes. It implements additional optimizations: -* `grpo_fast.py` also implements an optimization to skip zero gradient batches. If we solve a prompt 100% correct or 0% correct, the std of the group is 0. So `adv = (score - score.mean()) / (score.std + 1e-5) = 0 / 1e-5 = 0`, causing 0 gradients. `grpo_fast.py` will skip these batches before packing the sequences. +* `grpo.py` also implements an optimization to skip zero gradient batches. If we solve a prompt 100% correct or 0% correct, the std of the group is 0. So `adv = (score - score.mean()) / (score.std + 1e-5) = 0 / 1e-5 = 0`, causing 0 gradients. `grpo.py` will skip these batches before packing the sequences. -![](grpo/grpo_fast_gradient.png) +![](grpo/grpo_gradient.png) Figure taken from [this discord thread by @the_real_jrb](https://discord.com/channels/1179127597926469703/1208183216843005962/1357712190957682839) -* `grpo_fast.py` only applies the verification reward if the format reward is enabled (via `--additive_format_reward False` by default). See ([allenai/open-instruct/pull/659](https://github.com/allenai/open-instruct/pull/659)). A direct additive format reward is undesirable. In GRPO, the scale of the rewards is not relevant due to group normalization. For example, a group of [0, 0, 0, 0, 10], [0, 0, 0, 0, 11], [0, 0, 0, 0, 1] reward will have the same advantage. +* `grpo.py` only applies the verification reward if the format reward is enabled (via `--additive_format_reward False` by default). See ([allenai/open-instruct/pull/659](https://github.com/allenai/open-instruct/pull/659)). A direct additive format reward is undesirable. In GRPO, the scale of the rewards is not relevant due to group normalization. For example, a group of [0, 0, 0, 0, 10], [0, 0, 0, 0, 11], [0, 0, 0, 0, 1] reward will have the same advantage. Now imagine there are cases where the model generates a really long response (8k) gen length, but only get the format reward right, GRPO will push up the probs for this long response even though the response is not really correct. As a result, when using the format reward directly, we see the response length of unsolved prompts to fluctuate significantly, causing stability issues. @@ -41,9 +41,9 @@ You can run the script in a single GPU mode to debug the training process. ```bash # single GPU -bash scripts/train/debug/grpo_fast.sh +bash scripts/train/debug/grpo.sh # 3 GPU: 2 for training, 1 for inference (a more realistic setting for async training) -bash scripts/train/debug/grpo_fast_3_gpu.sh +bash scripts/train/debug/grpo_3_gpu.sh ``` ### Reproduce `allenai/Llama-3.1-Tulu-3.1-8B` (1 Nodes) @@ -51,16 +51,16 @@ bash scripts/train/debug/grpo_fast_3_gpu.sh You can reproduce our `allenai/Llama-3.1-Tulu-3.1-8B` model by running the following command: ```bash -bash scripts/train/tulu3/grpo_fast_8b_single_node.sh +bash scripts/train/tulu3/grpo_8b_single_node.sh ``` ???+ info - Here the `grpo_fast.py` actually use 6 GPUs for training and 2 GPUs for inference, so it's using less hardware but runs faster than `grpo_vllm_thread_ray_gtrl.py` which uses 2 nodes (12 GPUs for training and 4 GPUs for inference). + Here the `grpo.py` actually use 6 GPUs for training and 2 GPUs for inference, so it's using less hardware but runs faster than `grpo_vllm_thread_ray_gtrl.py` which uses 2 nodes (12 GPUs for training and 4 GPUs for inference). -![grpo_tulu3_8b](grpo/tulu3.1_8b_grpo_fast.png) -![grpo_tulu3_8b_time](grpo/tulu3.1_8b_grpo_fast-time.png) +![grpo_tulu3_8b](grpo/tulu3.1_8b_grpo.png) +![grpo_tulu3_8b_time](grpo/tulu3.1_8b_grpo-time.png) ??? note "👉 Tracked WandB Experiments (Click to expand)" @@ -70,13 +70,13 @@ bash scripts/train/tulu3/grpo_fast_8b_single_node.sh Below are some learning curves for the evaluation metrics during training. Basically, ifeval, gsm8k, and math:flex all go up. - ![grpo_plot](grpo/tulu3.1_8b_grpo_fast_eval_curve.png) + ![grpo_plot](grpo/tulu3.1_8b_grpo_eval_curve.png) ???+ info Based on our internal evaluation, the GRPO model is roughly on par with the original `allenai/Llama-3.1-Tulu-3.1-8B` model, though there are some slight differences. Note that your results may vary slightly due to the random seeds used in the training. - ![grpo_plot](grpo/tulu3.1_8b_grpo_fast_eval.png) + ![grpo_plot](grpo/tulu3.1_8b_grpo_eval.png) ???+ info @@ -89,12 +89,12 @@ bash scripts/train/tulu3/grpo_fast_8b_single_node.sh We have ```bash -bash scripts/train/qwen/grpo_fast_7b.sh +bash scripts/train/qwen/grpo_7b.sh ``` -![grpo_qwen2.5_7B_works](grpo/qwen2.5_7b_grpo_fast_zero.png) -![grpo_qwen2.5_7B_works_time](grpo/qwen2.5_7b_grpo_fast_zero-time.png) +![grpo_qwen2.5_7B_works](grpo/qwen2.5_7b_grpo_zero.png) +![grpo_qwen2.5_7B_works_time](grpo/qwen2.5_7b_grpo_zero-time.png) ??? note "👉 Tracked WandB Experiments (Click to expand)" @@ -106,7 +106,7 @@ bash scripts/train/qwen/grpo_fast_7b.sh Below are some learning curves for the evaluation metrics during training. Basically, ifeval, gsm8k, and math:flex all go up. - ![grpo_plot](grpo/qwen2.5_7b_grpo_fast_zero_eval_curve.png) + ![grpo_plot](grpo/qwen2.5_7b_grpo_zero_eval_curve.png) ???+ info @@ -120,12 +120,12 @@ bash scripts/train/qwen/grpo_fast_7b.sh We have ```bash -bash scripts/train/olmo2/grpo_fast_7b_zero.sh +bash scripts/train/olmo2/grpo_7b_zero.sh ``` -![grpo_olmo2_7b_zero](grpo/olmo2_7b_grpo_fast_zero.png) -![grpo_olmo2_7b_zero_time](grpo/olmo2_7b_grpo_fast_zero-time.png) +![grpo_olmo2_7b_zero](grpo/olmo2_7b_grpo_zero.png) +![grpo_olmo2_7b_zero_time](grpo/olmo2_7b_grpo_zero-time.png) ??? note "👉 Tracked WandB Experiments (Click to expand)" @@ -135,7 +135,7 @@ bash scripts/train/olmo2/grpo_fast_7b_zero.sh Below are some learning curves for the evaluation metrics during training. Basically, ifeval, gsm8k, and math:flex all go up. - ![grpo_plot](grpo/olmo2_7b_grpo_fast_zero_eval_curve.png) + ![grpo_plot](grpo/olmo2_7b_grpo_zero_eval_curve.png) ???+ info @@ -148,12 +148,12 @@ bash scripts/train/olmo2/grpo_fast_7b_zero.sh We have ```bash -bash scripts/train/olmo2/grpo_fast_13b_zero.sh +bash scripts/train/olmo2/grpo_13b_zero.sh ``` -![grpo_olmo2_13b_zero](grpo/olmo2_13b_grpo_fast_zero.png) -![grpo_olmo2_13b_zero_time](grpo/olmo2_13b_grpo_fast_zero-time.png) +![grpo_olmo2_13b_zero](grpo/olmo2_13b_grpo_zero.png) +![grpo_olmo2_13b_zero_time](grpo/olmo2_13b_grpo_zero-time.png) ??? note "👉 Tracked WandB Experiments (Click to expand)" @@ -163,7 +163,7 @@ bash scripts/train/olmo2/grpo_fast_13b_zero.sh Below are some learning curves for the evaluation metrics during training. Basically, ifeval, gsm8k, and math:flex all go up. - ![grpo_plot](grpo/olmo2_13b_grpo_fast_zero_eval_curve.png) + ![grpo_plot](grpo/olmo2_13b_grpo_zero_eval_curve.png) ???+ info @@ -175,7 +175,7 @@ bash scripts/train/olmo2/grpo_fast_13b_zero.sh ### Training Metrics -See the Training Metrics for `grpo_vllm_thread_ray_gtrl.py` below for general metrics. `grpo_fast.py` includes the following additional metrics: +See the Training Metrics for `grpo_vllm_thread_ray_gtrl.py` below for general metrics. `grpo.py` includes the following additional metrics: * `other/real_batch_size_ratio`: In GRPO, as we train we actually get smaller and smaller batch sizes. This is because if we solve a prompt 100% correct or 0% correct, the std of the group is 0. So `adv = (score - score.mean()) / (score.std + 1e-5) = 0 / 1e-5 = 0`, causing 0 gradients. This metric is the ratio of the samples that have gradients vs the total number of samples, diff --git a/docs/algorithms/grpo/grpo_fast_gradient.png b/docs/algorithms/grpo/grpo_gradient.png similarity index 100% rename from docs/algorithms/grpo/grpo_fast_gradient.png rename to docs/algorithms/grpo/grpo_gradient.png diff --git a/docs/algorithms/grpo/olmo2_13b_grpo_fast_zero-time.png b/docs/algorithms/grpo/olmo2_13b_grpo_zero-time.png similarity index 100% rename from docs/algorithms/grpo/olmo2_13b_grpo_fast_zero-time.png rename to docs/algorithms/grpo/olmo2_13b_grpo_zero-time.png diff --git a/docs/algorithms/grpo/olmo2_13b_grpo_fast_zero.png b/docs/algorithms/grpo/olmo2_13b_grpo_zero.png similarity index 100% rename from docs/algorithms/grpo/olmo2_13b_grpo_fast_zero.png rename to docs/algorithms/grpo/olmo2_13b_grpo_zero.png diff --git a/docs/algorithms/grpo/olmo2_13b_grpo_fast_zero_eval_curve.png b/docs/algorithms/grpo/olmo2_13b_grpo_zero_eval_curve.png similarity index 100% rename from docs/algorithms/grpo/olmo2_13b_grpo_fast_zero_eval_curve.png rename to docs/algorithms/grpo/olmo2_13b_grpo_zero_eval_curve.png diff --git a/docs/algorithms/grpo/olmo2_7b_grpo_fast_zero-time.png b/docs/algorithms/grpo/olmo2_7b_grpo_zero-time.png similarity index 100% rename from docs/algorithms/grpo/olmo2_7b_grpo_fast_zero-time.png rename to docs/algorithms/grpo/olmo2_7b_grpo_zero-time.png diff --git a/docs/algorithms/grpo/olmo2_7b_grpo_fast_zero.png b/docs/algorithms/grpo/olmo2_7b_grpo_zero.png similarity index 100% rename from docs/algorithms/grpo/olmo2_7b_grpo_fast_zero.png rename to docs/algorithms/grpo/olmo2_7b_grpo_zero.png diff --git a/docs/algorithms/grpo/olmo2_7b_grpo_fast_zero_eval_curve.png b/docs/algorithms/grpo/olmo2_7b_grpo_zero_eval_curve.png similarity index 100% rename from docs/algorithms/grpo/olmo2_7b_grpo_fast_zero_eval_curve.png rename to docs/algorithms/grpo/olmo2_7b_grpo_zero_eval_curve.png diff --git a/docs/algorithms/grpo/qwen2.5_7b_grpo_fast_zero-time.png b/docs/algorithms/grpo/qwen2.5_7b_grpo_fast_zero-time.png deleted file mode 100644 index ad859f4444..0000000000 Binary files a/docs/algorithms/grpo/qwen2.5_7b_grpo_fast_zero-time.png and /dev/null differ diff --git a/docs/algorithms/grpo/qwen2.5_7b_grpo_fast_zero.png b/docs/algorithms/grpo/qwen2.5_7b_grpo_fast_zero.png deleted file mode 100644 index cdea046244..0000000000 Binary files a/docs/algorithms/grpo/qwen2.5_7b_grpo_fast_zero.png and /dev/null differ diff --git a/docs/algorithms/grpo/qwen2.5_7b_grpo_fast_zero_eval_curve.png b/docs/algorithms/grpo/qwen2.5_7b_grpo_fast_zero_eval_curve.png deleted file mode 100644 index e60e7dc2b6..0000000000 Binary files a/docs/algorithms/grpo/qwen2.5_7b_grpo_fast_zero_eval_curve.png and /dev/null differ diff --git a/docs/algorithms/grpo/qwen2.5_7b_grpo_zero-time.png b/docs/algorithms/grpo/qwen2.5_7b_grpo_zero-time.png deleted file mode 100644 index c92cb1e261..0000000000 Binary files a/docs/algorithms/grpo/qwen2.5_7b_grpo_zero-time.png and /dev/null differ diff --git a/docs/algorithms/grpo/qwen2.5_7b_grpo_zero.png b/docs/algorithms/grpo/qwen2.5_7b_grpo_zero.png deleted file mode 100644 index f6e2c82699..0000000000 Binary files a/docs/algorithms/grpo/qwen2.5_7b_grpo_zero.png and /dev/null differ diff --git a/docs/algorithms/grpo/qwen2.5_7b_grpo_zero_eval_curve.png b/docs/algorithms/grpo/qwen2.5_7b_grpo_zero_eval_curve.png deleted file mode 100644 index 903203a798..0000000000 Binary files a/docs/algorithms/grpo/qwen2.5_7b_grpo_zero_eval_curve.png and /dev/null differ diff --git a/docs/algorithms/grpo/tulu3.1_8b_grpo-time.png b/docs/algorithms/grpo/tulu3.1_8b_grpo-time.png index 7f59ef4074..f0af87aa8f 100644 Binary files a/docs/algorithms/grpo/tulu3.1_8b_grpo-time.png and b/docs/algorithms/grpo/tulu3.1_8b_grpo-time.png differ diff --git a/docs/algorithms/grpo/tulu3.1_8b_grpo.png b/docs/algorithms/grpo/tulu3.1_8b_grpo.png index a8dbd983e4..b80d9f912a 100644 Binary files a/docs/algorithms/grpo/tulu3.1_8b_grpo.png and b/docs/algorithms/grpo/tulu3.1_8b_grpo.png differ diff --git a/docs/algorithms/grpo/tulu3.1_8b_grpo_fast-time.png b/docs/algorithms/grpo/tulu3.1_8b_grpo_fast-time.png deleted file mode 100644 index f0af87aa8f..0000000000 Binary files a/docs/algorithms/grpo/tulu3.1_8b_grpo_fast-time.png and /dev/null differ diff --git a/docs/algorithms/grpo/tulu3.1_8b_grpo_fast.png b/docs/algorithms/grpo/tulu3.1_8b_grpo_fast.png deleted file mode 100644 index b80d9f912a..0000000000 Binary files a/docs/algorithms/grpo/tulu3.1_8b_grpo_fast.png and /dev/null differ diff --git a/docs/algorithms/grpo/tulu3.1_8b_grpo_fast_eval.png b/docs/algorithms/grpo/tulu3.1_8b_grpo_fast_eval.png deleted file mode 100644 index 4e93fc2249..0000000000 Binary files a/docs/algorithms/grpo/tulu3.1_8b_grpo_fast_eval.png and /dev/null differ diff --git a/docs/algorithms/grpo/tulu3.1_8b_grpo_fast_eval_curve.png b/docs/algorithms/grpo/tulu3.1_8b_grpo_fast_eval_curve.png deleted file mode 100644 index d8024d123c..0000000000 Binary files a/docs/algorithms/grpo/tulu3.1_8b_grpo_fast_eval_curve.png and /dev/null differ diff --git a/docs/get_started/ai2_internal_setup.md b/docs/get_started/ai2_internal_setup.md index 47fb2d413d..4721eb2ae6 100644 --- a/docs/get_started/ai2_internal_setup.md +++ b/docs/get_started/ai2_internal_setup.md @@ -105,7 +105,7 @@ When submitting to the `ai2/augusta` cluster, mason will try to read your model The [/scripts/train](/scripts/train) directory contains many examples on how to launch jobs with mason.py. Sometimes the commands can get long and hard to manage, so we wrote a script called [update_command_args.py](/update_command_args.py) that can be used to add or update arguments in a shell script. For example, ```bash -python update_command_args.py scripts/train/tulu3/grpo_fast_8b.sh \ +python update_command_args.py scripts/train/tulu3/grpo_8b.sh \ --cluster ai2/augusta \ --priority normal \ --image costah/open_instruct_dev0320_11 --non_stop_penalty False | uv run bash @@ -118,8 +118,8 @@ As another example, you can run something like this for a learning rate search: ```bash for lr in 1e-6 1e-5 1e-4; do - python update_command_args.py scripts/train/tulu3/grpo_fast_8b.sh \ - --exp_name grpo_fast_8b_lr_${lr} \ + python update_command_args.py scripts/train/tulu3/grpo_8b.sh \ + --exp_name grpo_8b_lr_${lr} \ --learning_rate $lr \ --image costah/open_instruct_dev0320_11 --non_stop_penalty False | uv run bash done diff --git a/mason.py b/mason.py index 447286ba56..dcecc61857 100644 --- a/mason.py +++ b/mason.py @@ -22,14 +22,14 @@ OPEN_INSTRUCT_COMMANDS = [ "open_instruct/finetune.py", "open_instruct/dpo_tune_cache.py", - "open_instruct/grpo_fast.py", + "open_instruct/grpo.py", "open_instruct/ppo.py", "open_instruct/grpo_vllm_thread_ray_gtrl.py", "open_instruct/ppo_vllm_thread_ray_gtrl.py", "open_instruct/reward_modeling.py", ] -OPEN_INSTRUCT_RESUMABLES = ["open_instruct/grpo_fast.py"] +OPEN_INSTRUCT_RESUMABLES = ["open_instruct/grpo.py"] # ---------------------------------------------------------------------- diff --git a/open_instruct/benchmark_generators.py b/open_instruct/benchmark_generators.py index 65274a4aa9..be1a23aa07 100644 --- a/open_instruct/benchmark_generators.py +++ b/open_instruct/benchmark_generators.py @@ -2,8 +2,8 @@ """ Benchmark script for testing vLLM generator performance. -This script loads datasets in the same way as grpo_fast.py, sets up a generator -like in test_grpo_fast.py, and streams results to/from the generator to measure +This script loads datasets in the same way as grpo.py, sets up a generator +like in test_grpo.py, and streams results to/from the generator to measure performance. """ @@ -27,7 +27,7 @@ import vllm from ray.util import queue as ray_queue -from open_instruct import dataset_transformation, grpo_fast, logger_utils, model_utils, utils, vllm_utils +from open_instruct import dataset_transformation, grpo, logger_utils, model_utils, utils, vllm_utils from open_instruct.actor_manager import ActorManager from open_instruct.queue_types import PromptRequest @@ -116,7 +116,7 @@ def get_git_commit() -> str: def save_benchmark_results_to_csv( - results: list[dict[str, Any]], total_time: float, args: grpo_fast.Args, model_config: model_utils.ModelConfig + results: list[dict[str, Any]], total_time: float, args: grpo.Args, model_config: model_utils.ModelConfig ) -> None: """Save benchmark results to CSV file.""" git_commit = get_git_commit() @@ -199,8 +199,8 @@ def free_all_gpu_memory(device: int | str = 0) -> None: logger.info(f"[GPU {dev.index}] {free / gib:.2f} GiB free of {total / gib:.2f} GiB after cleanup") -def setup_dataset(args: grpo_fast.Args, tokenizer_config: dataset_transformation.TokenizerConfig) -> datasets.Dataset: - """Set up the dataset using the same pipeline as grpo_fast.py.""" +def setup_dataset(args: grpo.Args, tokenizer_config: dataset_transformation.TokenizerConfig) -> datasets.Dataset: + """Set up the dataset using the same pipeline as grpo.py.""" logger.info("Loading and processing dataset...") # Transform function arguments @@ -229,7 +229,7 @@ def setup_dataset(args: grpo_fast.Args, tokenizer_config: dataset_transformation def setup_vllm_engines( - args: grpo_fast.Args, + args: grpo.Args, tokenizer_config: dataset_transformation.TokenizerConfig, model_config: model_utils.ModelConfig, max_model_len: int, @@ -274,7 +274,7 @@ def setup_vllm_engines( def simulate_weight_sync( - actor_manager: ray.actor.ActorHandle, vllm_engines: list[ray.actor.ActorHandle], args: grpo_fast.Args + actor_manager: ray.actor.ActorHandle, vllm_engines: list[ray.actor.ActorHandle], args: grpo.Args ) -> float: """Simulate weight sync by pausing all actors. @@ -348,7 +348,7 @@ def run_benchmark( param_prompt_Q: ray_queue.Queue, inference_results_Q: ray_queue.Queue, actor_manager: ray.actor.ActorHandle, - args: grpo_fast.Args, + args: grpo.Args, model_config: model_utils.ModelConfig, timestamp: int, num_batches: int = 5, @@ -565,7 +565,7 @@ def aggregate_results(results: list[dict[str, Any]]) -> dict[str, Any]: def print_summary( results: list[dict[str, Any]], total_time: float, - args: grpo_fast.Args, + args: grpo.Args, model_config: model_utils.ModelConfig, model_dims: utils.ModelDims, ) -> None: @@ -642,9 +642,7 @@ def cleanup(vllm_engines: list[ray.actor.ActorHandle], actor_manager: ray.actor. def main() -> None: """Main benchmark function.""" # Parse arguments using ArgumentParserPlus - parser = utils.ArgumentParserPlus( - (grpo_fast.Args, dataset_transformation.TokenizerConfig, model_utils.ModelConfig) - ) + parser = utils.ArgumentParserPlus((grpo.Args, dataset_transformation.TokenizerConfig, model_utils.ModelConfig)) args, tokenizer_config, model_config = parser.parse_args_into_dataclasses() diff --git a/open_instruct/grpo_fast.py b/open_instruct/grpo.py similarity index 100% rename from open_instruct/grpo_fast.py rename to open_instruct/grpo.py diff --git a/open_instruct/test_grpo_fast.py b/open_instruct/test_grpo.py similarity index 96% rename from open_instruct/test_grpo_fast.py rename to open_instruct/test_grpo.py index 978c201d98..fbc2988db6 100644 --- a/open_instruct/test_grpo_fast.py +++ b/open_instruct/test_grpo.py @@ -14,12 +14,12 @@ from transformers import AutoTokenizer from vllm import SamplingParams -from open_instruct import grpo_fast, model_utils, utils +from open_instruct import grpo, model_utils, utils from open_instruct.queue_types import GenerationResult, PromptRequest, RequestInfo, TokenStatistics from open_instruct.vllm_utils import create_vllm_engines -class TestGrpoFastBase(unittest.TestCase): +class TestGrpoBase(unittest.TestCase): """Base class with common test utilities.""" def _get_resource_tracker_state(self): @@ -192,7 +192,7 @@ def setup_and_split_batch( queue_size = max(len(queries), num_engines * 2) param_prompt_Q = ray_queue.Queue(maxsize=queue_size) inference_results_Q = ray_queue.Queue(maxsize=queue_size) - pending_queries_map = grpo_fast.PendingQueriesMap() + pending_queries_map = grpo.PendingQueriesMap() # Track queues for cleanup self._ray_queues.extend([param_prompt_Q, inference_results_Q]) @@ -209,14 +209,14 @@ def setup_and_split_batch( # Calculate inference_batch_size based on number of queries and engines mock_args.inference_batch_size = max(1, len(queries) // num_engines) - grpo_fast.split_and_insert_batch( + grpo.split_and_insert_batch( batch, 0, training_step, pending_queries_map, param_prompt_Q, mock_generation_config, False ) return param_prompt_Q, inference_results_Q, pending_queries_map -class TestGrpoFastVLLM(TestGrpoFastBase): +class TestGrpoFastVLLM(TestGrpoBase): def test_vllm_queue_system_single_prompt(self): """Test the new queue-based vLLM system with a single prompt 'What is the capital of France?'""" # Check if CUDA is available @@ -497,7 +497,7 @@ def test_multiple_samples_per_prompt(self, vllm_num_engines: int, num_samples_pe self.assertEqual(len(combined_result.responses), expected_responses) -class GrpoIntegrationTests(TestGrpoFastBase): +class GrpoIntegrationTests(TestGrpoBase): """Integration tests for GRPO with parallel processing.""" @ray.remote @@ -566,7 +566,7 @@ def test_out_of_order_processing(self): mock_generation_config.n = num_samples_per_prompt mock_model_dims = self.create_mock_model_dims() - combined_result, batch, prompt_lengths, response_lengths = grpo_fast.accumulate_inference_batches( + combined_result, batch, prompt_lengths, response_lengths = grpo.accumulate_inference_batches( inference_results_Q, pending_queries_map, mock_args, @@ -582,7 +582,7 @@ def test_out_of_order_processing(self): def test_thread_safety_pending_queries_map(self): """Test concurrent access to pending_queries_map.""" - pending_queries_map = grpo_fast.PendingQueriesMap() + pending_queries_map = grpo.PendingQueriesMap() errors = [] num_threads = 4 entries_per_thread = 50 @@ -637,7 +637,7 @@ def test_accumulate_waits_for_all_engines(self): # Track queue for cleanup self._ray_queues.append(inference_results_Q) - pending_queries_map = grpo_fast.PendingQueriesMap() + pending_queries_map = grpo.PendingQueriesMap() # Add entries to map for i in range(num_prompts): @@ -661,7 +661,7 @@ def run_accumulate(): mock_generation_config.n = 1 mock_model_dims = self.create_mock_model_dims() - grpo_fast.accumulate_inference_batches( + grpo.accumulate_inference_batches( inference_results_Q, pending_queries_map, mock_args, @@ -686,7 +686,7 @@ def run_accumulate(): self.assertEqual(len(pending_queries_map), 4) -class TestStreamingAccumulation(TestGrpoFastBase): +class TestStreamingAccumulation(TestGrpoBase): """Test the new streaming accumulation functionality.""" def test_more_engines_than_queries(self): @@ -697,7 +697,7 @@ def test_more_engines_than_queries(self): queries, ground_truths, datasets, raw_queries, indices = self.create_test_data(num_queries) param_prompt_Q = ray_queue.Queue(maxsize=num_queries) - pending_queries_map = grpo_fast.PendingQueriesMap() + pending_queries_map = grpo.PendingQueriesMap() # Track queue for cleanup self._ray_queues.append(param_prompt_Q) @@ -713,7 +713,7 @@ def test_more_engines_than_queries(self): mock_args = MagicMock() mock_args.inference_batch_size = max(1, num_queries // num_engines) - grpo_fast.split_and_insert_batch( + grpo.split_and_insert_batch( batch, epoch_number=0, training_step=1, @@ -748,7 +748,7 @@ def test_uneven_distribution_no_empty_batches(self): queries, ground_truths, datasets, raw_queries, indices = self.create_test_data(num_queries) param_prompt_Q = ray_queue.Queue(maxsize=num_queries) - pending_queries_map = grpo_fast.PendingQueriesMap() + pending_queries_map = grpo.PendingQueriesMap() # Track queue for cleanup self._ray_queues.append(param_prompt_Q) @@ -764,7 +764,7 @@ def test_uneven_distribution_no_empty_batches(self): mock_args = MagicMock() mock_args.inference_batch_size = max(1, num_queries // num_engines + (1 if num_queries % num_engines else 0)) - grpo_fast.split_and_insert_batch( + grpo.split_and_insert_batch( batch, epoch_number=0, training_step=1, @@ -797,7 +797,7 @@ def test_streaming_accumulation_basic(self): # Create queues and maps inference_results_Q = ray_queue.Queue(maxsize=num_prompts) - pending_queries_map = grpo_fast.PendingQueriesMap() + pending_queries_map = grpo.PendingQueriesMap() # Track queue for cleanup self._ray_queues.append(inference_results_Q) @@ -849,7 +849,7 @@ def test_streaming_with_multiple_samples(self): # Create queues and maps inference_results_Q = ray_queue.Queue(maxsize=num_prompts) - pending_queries_map = grpo_fast.PendingQueriesMap() + pending_queries_map = grpo.PendingQueriesMap() # Track queue for cleanup self._ray_queues.append(inference_results_Q) @@ -892,7 +892,7 @@ def test_basic_iteration(self): data = np.arange(100) batch_size = 10 - iterator = grpo_fast.ShufflingIterator(data, batch_size, seed=42) + iterator = grpo.ShufflingIterator(data, batch_size, seed=42) # Get first batch batch1 = next(iterator) @@ -913,7 +913,7 @@ def test_state_preservation_and_restoration(self): seed = 42 # Create original iterator - iter1 = grpo_fast.ShufflingIterator(data, batch_size, seed=seed) + iter1 = grpo.ShufflingIterator(data, batch_size, seed=seed) # Get a few batches _ = next(iter1) @@ -934,7 +934,7 @@ def test_state_preservation_and_restoration(self): batch5_original = next(iter1) # Create new iterator with different seed and restore state - iter2 = grpo_fast.ShufflingIterator(data, batch_size, seed=999) + iter2 = grpo.ShufflingIterator(data, batch_size, seed=999) iter2.set_state(state) # Get batches from restored iterator @@ -952,7 +952,7 @@ def test_epoch_boundary_state(self): batch_size = 5 # Create iterator and complete one epoch - iterator = grpo_fast.ShufflingIterator(data, batch_size, seed=123) + iterator = grpo.ShufflingIterator(data, batch_size, seed=123) for _ in range(4): # 20 / 5 = 4 batches per epoch next(iterator) @@ -962,7 +962,7 @@ def test_epoch_boundary_state(self): self.assertEqual(state["index"], 20) # Create new iterator and restore state - iter2 = grpo_fast.ShufflingIterator(data, batch_size, seed=456) + iter2 = grpo.ShufflingIterator(data, batch_size, seed=456) iter2.set_state(state) # Next batches should match @@ -977,8 +977,8 @@ def test_rng_state_preservation(self): batch_size = 50 # Create two iterators with same seed - iter1 = grpo_fast.ShufflingIterator(data, batch_size, seed=42) - _ = grpo_fast.ShufflingIterator(data, batch_size, seed=42) + iter1 = grpo.ShufflingIterator(data, batch_size, seed=42) + _ = grpo.ShufflingIterator(data, batch_size, seed=42) # Advance first iterator for _ in range(5): @@ -986,7 +986,7 @@ def test_rng_state_preservation(self): # Save state and create new iterator with different seed state = iter1.get_state() - iter3 = grpo_fast.ShufflingIterator(data, batch_size, seed=999) + iter3 = grpo.ShufflingIterator(data, batch_size, seed=999) # Restore state - this should override the different seed iter3.set_state(state) diff --git a/scripts/benchmarking/olmo3_infra.sh b/scripts/benchmarking/olmo3_infra.sh index 30a4b6419b..f5b0fb0cdf 100644 --- a/scripts/benchmarking/olmo3_infra.sh +++ b/scripts/benchmarking/olmo3_infra.sh @@ -32,7 +32,7 @@ for split_var in mixin_it_up; do --env VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 \ --env LITELLM_LOG="ERROR" \ --budget ai2/oe-adapt \ - --gpus 8 -- source configs/beaker_configs/ray_node_setup.sh \&\& source configs/beaker_configs/code_api_setup.sh \&\& python open_instruct/grpo_fast.py \ + --gpus 8 -- source configs/beaker_configs/ray_node_setup.sh \&\& source configs/beaker_configs/code_api_setup.sh \&\& python open_instruct/grpo.py \ --exp_name ${exp_name} \ --beta 0.0 \ --async_steps=4 \ diff --git a/scripts/train/benchmark.sh b/scripts/train/benchmark.sh index 20d4cdcec5..cb29266267 100644 --- a/scripts/train/benchmark.sh +++ b/scripts/train/benchmark.sh @@ -47,7 +47,7 @@ python update_command_args.py scripts/train/tulu3/ppo_8b.sh \ --image costah/open_instruct_dev_uv13 | uv run bash # 2 nodes -python update_command_args.py scripts/train/tulu3/grpo_fast_8b.sh \ +python update_command_args.py scripts/train/tulu3/grpo_8b.sh \ --cluster ai2/augusta \ --wandb_project_name open_instruct_public \ --priority normal \ @@ -55,7 +55,7 @@ python update_command_args.py scripts/train/tulu3/grpo_fast_8b.sh \ --image costah/open_instruct_dev0320_11 | uv run bash # 1 node -python update_command_args.py scripts/train/tulu3/grpo_fast_8b_single_node.sh \ +python update_command_args.py scripts/train/tulu3/grpo_8b_single_node.sh \ --cluster ai2/augusta \ --wandb_project_name open_instruct_public \ --priority normal \ @@ -67,13 +67,13 @@ python update_command_args.py scripts/train/tulu3/grpo_fast_8b_single_node.sh \ # Qwen # 2 nodes -python update_command_args.py scripts/train/qwen/grpo_fast_7b.sh \ +python update_command_args.py scripts/train/qwen/grpo_7b.sh \ --cluster ai2/augusta \ --wandb_project_name open_instruct_public \ --priority urgent | uv run bash # 4 nodes -python update_command_args.py scripts/train/qwen/grpo_fast_7b_orz.sh \ +python update_command_args.py scripts/train/qwen/grpo_7b_orz.sh \ --cluster ai2/augusta \ --wandb_project_name open_instruct_public \ --image costah/open_instruct_dev_0405 \ @@ -96,7 +96,7 @@ python update_command_args.py scripts/train/qwen/grpo_7b.sh \ --image costah/open_instruct_dev_0410_ww_1 | uv run bash # 1 node -python update_command_args.py scripts/train/qwen/grpo_fast_3b_single_node.sh \ +python update_command_args.py scripts/train/qwen/grpo_3b_single_node.sh \ --cluster ai2/augusta \ --wandb_project_name open_instruct_public \ --priority normal \ @@ -104,7 +104,7 @@ python update_command_args.py scripts/train/qwen/grpo_fast_3b_single_node.sh \ # 8 nodes -python update_command_args.py scripts/train/qwen/grpo_fast_32b.sh \ +python update_command_args.py scripts/train/qwen/grpo_32b.sh \ --cluster ai2/augusta \ --wandb_project_name open_instruct_public \ --priority high | uv run bash @@ -113,7 +113,7 @@ python update_command_args.py scripts/train/qwen/grpo_fast_32b.sh \ # Llama3 # 4 nodes -python update_command_args.py scripts/train/llama3/grpo_fast_7b_math.sh \ +python update_command_args.py scripts/train/llama3/grpo_7b_math.sh \ --cluster ai2/augusta \ --wandb_project_name open_instruct_public \ --priority high | uv run bash @@ -157,14 +157,14 @@ python update_command_args.py scripts/train/olmo2/dpo_13b.sh \ --image costah/open_instruct_dev0320_11 | uv run bash # 2 nodes -python update_command_args.py scripts/train/olmo2/grpo_fast_7b_zero.sh \ +python update_command_args.py scripts/train/olmo2/grpo_7b_zero.sh \ --cluster ai2/augusta \ --wandb_project_name open_instruct_public \ --priority urgent \ --image costah/open_instruct_dev0327_4 | uv run bash # 2 nodes -python update_command_args.py scripts/train/olmo2/grpo_fast_13b_zero.sh \ +python update_command_args.py scripts/train/olmo2/grpo_13b_zero.sh \ --cluster ai2/augusta \ --wandb_project_name open_instruct_public \ --priority urgent \ diff --git a/scripts/train/debug/code.sh b/scripts/train/debug/code.sh index c8e4720a90..5344aa2cf4 100644 --- a/scripts/train/debug/code.sh +++ b/scripts/train/debug/code.sh @@ -1,4 +1,4 @@ -python open_instruct/grpo_fast.py \ +python open_instruct/grpo.py \ --exp_name "test" \ --beta 0.01 \ --num_unique_prompts_rollout 48 \ diff --git a/scripts/train/debug/full_integration_test.sh b/scripts/train/debug/full_integration_test.sh index 09c1a58141..ffd3c3083b 100644 --- a/scripts/train/debug/full_integration_test.sh +++ b/scripts/train/debug/full_integration_test.sh @@ -27,7 +27,7 @@ for split_var in split_int_mix_3; do --env VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 \ --env LITELLM_LOG="ERROR" \ --budget ai2/oe-adapt \ - --gpus 8 -- source configs/beaker_configs/ray_node_setup.sh \&\& source configs/beaker_configs/code_api_setup.sh \&\& python open_instruct/grpo_fast.py \ + --gpus 8 -- source configs/beaker_configs/ray_node_setup.sh \&\& source configs/beaker_configs/code_api_setup.sh \&\& python open_instruct/grpo.py \ --exp_name ${exp_name} \ --beta 0.0 \ --num_samples_per_prompt_rollout 8 \ diff --git a/scripts/train/debug/grpo.sh b/scripts/train/debug/grpo.sh old mode 100644 new mode 100755 index a56d378318..3f4b642d44 --- a/scripts/train/debug/grpo.sh +++ b/scripts/train/debug/grpo.sh @@ -1,36 +1,39 @@ -python open_instruct/grpo_vllm_thread_ray_gtrl.py \ - --dataset_mixer_list ai2-adapt-dev/rlvr_gsm8k_zs 1.0 \ +export VLLM_ALLOW_INSECURE_SERIALIZATION=1 +export VLLM_DISABLE_COMPILE_CACHE=1 +export VLLM_USE_V1=1 +uv run python open_instruct/grpo.py \ + --dataset_mixer_list ai2-adapt-dev/rlvr_gsm8k_zs 64 \ --dataset_mixer_list_splits train \ --dataset_mixer_eval_list ai2-adapt-dev/rlvr_gsm8k_zs 16 \ --dataset_mixer_eval_list_splits train \ --max_prompt_token_length 512 \ --response_length 512 \ - --model_name_or_path Qwen/Qwen3-1.7B \ - --number_samples_per_prompt 4 \ - --non_stop_penalty \ - --stop_token eos \ - --temperature 1.0 \ - --chat_template_name tulu \ - --learning_rate 3e-7 \ - --total_episodes 32 \ - --penalty_reward_value -10.0 \ - --deepspeed_stage 3 \ + --pack_length 1024 \ --per_device_train_batch_size 1 \ - --local_rollout_forward_batch_size 1 \ - --local_mini_batch_size 4 \ - --local_rollout_batch_size 4 \ + --num_unique_prompts_rollout 8 \ + --num_samples_per_prompt_rollout 4 \ + --model_name_or_path Qwen/Qwen3-0.6B \ + --stop_strings "" \ + --apply_r1_style_format_reward \ + --apply_verifiable_reward true \ + --temperature 0.7 \ + --ground_truths_key ground_truth \ + --chat_template_name r1_simple_chat_postpend_think \ + --learning_rate 3e-7 \ + --total_episodes 200 \ + --deepspeed_stage 2 \ --num_epochs 1 \ - --actor_num_gpus_per_node 1 \ + --num_learners_per_node 1 \ --vllm_tensor_parallel_size 1 \ - --beta 0.05 \ - --apply_verifiable_reward true \ + --beta 0.01 \ --seed 3 \ - --num_evals 3 \ - --save_freq 100 \ - --reward_model_multiplier 0.0 \ - --gradient_checkpointing \ - --single_gpu_mode \ + --local_eval_every 1 \ --vllm_sync_backend gloo \ --vllm_gpu_memory_utilization 0.3 \ + --save_traces \ --vllm_enforce_eager \ + --gradient_checkpointing \ + --single_gpu_mode \ + --push_to_hub false \ + --system_prompt_override_file scripts/train/debug/cute_debug_system_prompt.txt \ # --with_tracking diff --git a/scripts/train/debug/grpo_fast_3_gpu.sh b/scripts/train/debug/grpo_3_gpu.sh similarity index 96% rename from scripts/train/debug/grpo_fast_3_gpu.sh rename to scripts/train/debug/grpo_3_gpu.sh index 3ee5cb899e..6a7d05a062 100644 --- a/scripts/train/debug/grpo_fast_3_gpu.sh +++ b/scripts/train/debug/grpo_3_gpu.sh @@ -1,4 +1,4 @@ -python open_instruct/grpo_fast.py \ +python open_instruct/grpo.py \ --dataset_mixer_list ai2-adapt-dev/rlvr_gsm8k_zs 64 \ --dataset_mixer_list_splits train \ --dataset_mixer_eval_list ai2-adapt-dev/rlvr_gsm8k_zs 16 \ diff --git a/scripts/train/debug/grpo_fast.sh b/scripts/train/debug/grpo_fast.sh deleted file mode 100755 index a1b6dc6ca4..0000000000 --- a/scripts/train/debug/grpo_fast.sh +++ /dev/null @@ -1,39 +0,0 @@ -export VLLM_ALLOW_INSECURE_SERIALIZATION=1 -export VLLM_DISABLE_COMPILE_CACHE=1 -export VLLM_USE_V1=1 -uv run python open_instruct/grpo_fast.py \ - --dataset_mixer_list ai2-adapt-dev/rlvr_gsm8k_zs 64 \ - --dataset_mixer_list_splits train \ - --dataset_mixer_eval_list ai2-adapt-dev/rlvr_gsm8k_zs 16 \ - --dataset_mixer_eval_list_splits train \ - --max_prompt_token_length 512 \ - --response_length 512 \ - --pack_length 1024 \ - --per_device_train_batch_size 1 \ - --num_unique_prompts_rollout 8 \ - --num_samples_per_prompt_rollout 4 \ - --model_name_or_path Qwen/Qwen3-0.6B \ - --stop_strings "" \ - --apply_r1_style_format_reward \ - --apply_verifiable_reward true \ - --temperature 0.7 \ - --ground_truths_key ground_truth \ - --chat_template_name r1_simple_chat_postpend_think \ - --learning_rate 3e-7 \ - --total_episodes 200 \ - --deepspeed_stage 2 \ - --num_epochs 1 \ - --num_learners_per_node 1 \ - --vllm_tensor_parallel_size 1 \ - --beta 0.01 \ - --seed 3 \ - --local_eval_every 1 \ - --vllm_sync_backend gloo \ - --vllm_gpu_memory_utilization 0.3 \ - --save_traces \ - --vllm_enforce_eager \ - --gradient_checkpointing \ - --single_gpu_mode \ - --push_to_hub false \ - --system_prompt_override_file scripts/train/debug/cute_debug_system_prompt.txt \ - # --with_tracking diff --git a/scripts/train/debug/grpo_fast_llm_judge.sh b/scripts/train/debug/grpo_llm_judge.sh similarity index 97% rename from scripts/train/debug/grpo_fast_llm_judge.sh rename to scripts/train/debug/grpo_llm_judge.sh index d2606e3e96..247d5f10e1 100755 --- a/scripts/train/debug/grpo_fast_llm_judge.sh +++ b/scripts/train/debug/grpo_llm_judge.sh @@ -1,7 +1,7 @@ # note: judge may not be alive, internal ai2 host. export HOSTED_VLLM_API_BASE=http://saturn-cs-aus-234.reviz.ai2.in:8001/v1 -uv run python open_instruct/grpo_fast.py \ +uv run python open_instruct/grpo.py \ --dataset_mixer_list hamishivi/virtuoussy_multi_subject_rlvr 64 \ --dataset_mixer_list_splits train \ --dataset_mixer_eval_list hamishivi/virtuoussy_multi_subject_rlvr 16 \ diff --git a/scripts/train/debug/grpo_fast_tool.sh b/scripts/train/debug/grpo_tool.sh similarity index 96% rename from scripts/train/debug/grpo_fast_tool.sh rename to scripts/train/debug/grpo_tool.sh index 70e7b71b96..92aa1ded83 100644 --- a/scripts/train/debug/grpo_fast_tool.sh +++ b/scripts/train/debug/grpo_tool.sh @@ -1,4 +1,4 @@ -python open_instruct/grpo_fast.py \ +python open_instruct/grpo.py \ --dataset_mixer_list ai2-adapt-dev/rlvr_gsm8k_zs 64 \ --dataset_mixer_list_splits train \ --dataset_mixer_eval_list ai2-adapt-dev/rlvr_gsm8k_zs 16 \ diff --git a/scripts/train/debug/judge.sh b/scripts/train/debug/judge.sh index 7332c63c3a..ae3b5bbdc9 100644 --- a/scripts/train/debug/judge.sh +++ b/scripts/train/debug/judge.sh @@ -1,7 +1,7 @@ export HOSTED_VLLM_API_BASE=http://saturn-cs-aus-230.reviz.ai2.in:8001/v1 # new version -python open_instruct/grpo_fast.py \ +python open_instruct/grpo.py \ --dataset_mixer_list faezeb/tulu_3_rewritten_100k-no-math 20000 \ --dataset_mixer_list_splits train \ --dataset_mixer_eval_list hamishivi/tulu_3_rewritten_100k 32 \ @@ -46,7 +46,7 @@ python open_instruct/grpo_fast.py \ # 8192 # initial saurabh version -# python open_instruct/grpo_fast.py \ +# python open_instruct/grpo.py \ # --dataset_mixer_list faezeb/tulu_3_rewritten_100k-no-math 512 \ # --dataset_mixer_list_splits train \ # --dataset_mixer_eval_list ai2-adapt-dev/general-thoughts-100k-rewritten-v2-ifeval 16 \ diff --git a/scripts/train/debug/large_test_script.sh b/scripts/train/debug/large_test_script.sh index b7f35d21cd..7f1ff6d6e7 100755 --- a/scripts/train/debug/large_test_script.sh +++ b/scripts/train/debug/large_test_script.sh @@ -17,7 +17,7 @@ uv run python mason.py \ --max_retries 0 \ --env VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 \ --budget ai2/oe-adapt \ - --gpus 8 -- source configs/beaker_configs/ray_node_setup.sh \&\& source configs/beaker_configs/code_api_setup.sh \&\&python open_instruct/grpo_fast.py \ + --gpus 8 -- source configs/beaker_configs/ray_node_setup.sh \&\& source configs/beaker_configs/code_api_setup.sh \&\&python open_instruct/grpo.py \ --exp_name ${exp_name} \ --beta 0.0 \ --num_samples_per_prompt_rollout 16 \ diff --git a/scripts/train/debug/local_tool_grpo_fast.sh b/scripts/train/debug/local_tool_grpo.sh similarity index 97% rename from scripts/train/debug/local_tool_grpo_fast.sh rename to scripts/train/debug/local_tool_grpo.sh index e70e98d202..96a3e33cf8 100755 --- a/scripts/train/debug/local_tool_grpo_fast.sh +++ b/scripts/train/debug/local_tool_grpo.sh @@ -1,5 +1,5 @@ #!/bin/bash -uv run open_instruct/grpo_fast.py \ +uv run open_instruct/grpo.py \ --dataset_mixer_list hamishivi/tulu_3_rewritten_100k_with_tool_prompt 64 \ --dataset_mixer_list_splits train \ --dataset_mixer_eval_list hamishivi/tulu_3_rewritten_100k_with_tool_prompt 16 \ diff --git a/scripts/train/debug/single_gpu_integration_test.sh b/scripts/train/debug/single_gpu_integration_test.sh index f39ab2fbf6..859249bba3 100755 --- a/scripts/train/debug/single_gpu_integration_test.sh +++ b/scripts/train/debug/single_gpu_integration_test.sh @@ -22,7 +22,7 @@ uv run python mason.py \ --budget ai2/oe-adapt \ --no-host-networking \ --gpus 1 \ - -- source configs/beaker_configs/ray_node_setup.sh \&\& python open_instruct/grpo_fast.py \ + -- source configs/beaker_configs/ray_node_setup.sh \&\& python open_instruct/grpo.py \ --dataset_mixer_list ai2-adapt-dev/rlvr_gsm8k_zs 64 \ --dataset_mixer_list_splits train \ --dataset_mixer_eval_list ai2-adapt-dev/rlvr_gsm8k_zs 16 \ diff --git a/scripts/train/debug/single_gpu_on_beaker.sh b/scripts/train/debug/single_gpu_on_beaker.sh index fbbe0f5901..b0fecb7a9e 100755 --- a/scripts/train/debug/single_gpu_on_beaker.sh +++ b/scripts/train/debug/single_gpu_on_beaker.sh @@ -21,7 +21,7 @@ uv run python mason.py \ --env VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 \ --budget ai2/oe-adapt \ --gpus 1 \ - -- source configs/beaker_configs/ray_node_setup.sh \&\& python open_instruct/grpo_fast.py \ + -- source configs/beaker_configs/ray_node_setup.sh \&\& python open_instruct/grpo.py \ --dataset_mixer_list ai2-adapt-dev/rlvr_gsm8k_zs 64 \ --dataset_mixer_list_splits train \ --dataset_mixer_eval_list ai2-adapt-dev/rlvr_gsm8k_zs 16 \ diff --git a/scripts/train/debug/tool_grpo_fast.sh b/scripts/train/debug/tool_grpo.sh similarity index 98% rename from scripts/train/debug/tool_grpo_fast.sh rename to scripts/train/debug/tool_grpo.sh index 0ada90d92f..1e82056185 100755 --- a/scripts/train/debug/tool_grpo_fast.sh +++ b/scripts/train/debug/tool_grpo.sh @@ -27,7 +27,7 @@ uv run python mason.py \ --budget ai2/oe-adapt \ --no-host-networking \ --gpus 1 \ - -- source configs/beaker_configs/ray_node_setup.sh \&\& python open_instruct/grpo_fast.py \ + -- source configs/beaker_configs/ray_node_setup.sh \&\& python open_instruct/grpo.py \ --dataset_mixer_list hamishivi/tulu_3_rewritten_100k_with_tool_prompt 1.0 \ --dataset_mixer_list_splits train \ --dataset_mixer_eval_list hamishivi/tulu_3_rewritten_100k_with_tool_prompt 32 \ diff --git a/scripts/train/olmo2/grpo_13b.sh b/scripts/train/olmo2/grpo_13b.sh deleted file mode 100644 index 3adfc600b9..0000000000 --- a/scripts/train/olmo2/grpo_13b.sh +++ /dev/null @@ -1,54 +0,0 @@ -python mason.py \ - --cluster ai2/jupiter \ - --workspace ai2/tulu-3-dev \ - --priority high \ - --image nathanl/open_instruct_auto --pure_docker_mode \ - --preemptible \ - --num_nodes 2 \ - --budget ai2/oe-adapt \ - --gpus 8 -- source configs/beaker_configs/ray_node_setup.sh \&\& python open_instruct/grpo_vllm_thread_ray_gtrl.py \ - --exp_name olmo2_13b_grpo \ - --beta 0.01 \ - --local_mini_batch_size 32 \ - --number_samples_per_prompt 16 \ - --local_rollout_batch_size 4 \ - --kl_estimator kl3 \ - --learning_rate 5e-7 \ - --dataset_mixer_list allenai/RLVR-GSM-MATH-IF-Mixed-Constraints 1.0 \ - --dataset_mixer_list_splits train \ - --dataset_mixer_eval_list allenai/RLVR-GSM-MATH-IF-Mixed-Constraints 16 \ - --dataset_mixer_eval_list_splits train \ - --max_token_length 2048 \ - --max_prompt_token_length 2048 \ - --response_length 2048 \ - --model_name_or_path allenai/OLMo-2-1124-13B-DPO \ - --model_revision main \ - --tokenizer_name allenai/OLMo-2-1124-13B-DPO \ - --tokenizer_revision main \ - --use_slow_tokenizer False \ - --add_bos \ - --non_stop_penalty \ - --stop_token eos \ - --temperature 1.0 \ - --ground_truths_key ground_truth \ - --chat_template_name tulu \ - --sft_messages_key messages \ - --total_episodes 2000000 \ - --penalty_reward_value 0.0 \ - --deepspeed_stage 2 \ - --per_device_train_batch_size 1 \ - --local_rollout_forward_batch_size 2 \ - --actor_num_gpus_per_node 4 8 \ - --num_epochs 1 \ - --vllm_tensor_parallel_size 4 \ - --lr_scheduler_type constant \ - --apply_verifiable_reward true \ - --seed 1 \ - --num_evals 100 \ - --save_freq 40 \ - --reward_model_multiplier 0.0 \ - --no_try_launch_beaker_eval_jobs \ - --try_launch_beaker_eval_jobs_on_weka \ - --gradient_checkpointing \ - --gather_whole_model False \ - --with_tracking \ No newline at end of file diff --git a/scripts/train/olmo2/grpo_fast_13b_zero.sh b/scripts/train/olmo2/grpo_13b_zero.sh similarity index 95% rename from scripts/train/olmo2/grpo_fast_13b_zero.sh rename to scripts/train/olmo2/grpo_13b_zero.sh index 110decd1a2..934c8b7ece 100644 --- a/scripts/train/olmo2/grpo_fast_13b_zero.sh +++ b/scripts/train/olmo2/grpo_13b_zero.sh @@ -6,8 +6,8 @@ python mason.py \ --preemptible \ --num_nodes 2 \ --budget ai2/oe-adapt \ - --gpus 8 -- source configs/beaker_configs/ray_node_setup.sh \&\& python open_instruct/grpo_fast.py \ - --exp_name olmo2_13b_grpo_fast_zero \ + --gpus 8 -- source configs/beaker_configs/ray_node_setup.sh \&\& python open_instruct/grpo.py \ + --exp_name olmo2_13b_grpo_zero \ --beta 0.0 \ --num_unique_prompts_rollout 48 \ --num_samples_per_prompt_rollout 16 \ diff --git a/scripts/train/olmo2/grpo_fast_32b.sh b/scripts/train/olmo2/grpo_32b.sh similarity index 95% rename from scripts/train/olmo2/grpo_fast_32b.sh rename to scripts/train/olmo2/grpo_32b.sh index 90e23e4abf..ad74911f0f 100644 --- a/scripts/train/olmo2/grpo_fast_32b.sh +++ b/scripts/train/olmo2/grpo_32b.sh @@ -6,8 +6,8 @@ python mason.py \ --preemptible \ --num_nodes 8 \ --budget ai2/oe-adapt \ - --gpus 8 -- source configs/beaker_configs/ray_node_setup.sh \&\& python open_instruct/grpo_fast.py \ - --exp_name olmo2_32b_grpo_fast_zero \ + --gpus 8 -- source configs/beaker_configs/ray_node_setup.sh \&\& python open_instruct/grpo.py \ + --exp_name olmo2_32b_grpo_zero \ --beta 0.0 \ --num_unique_prompts_rollout 256 \ --num_samples_per_prompt_rollout 64 \ diff --git a/scripts/train/olmo2/grpo_fast_32b_tulu.sh b/scripts/train/olmo2/grpo_32b_tulu.sh similarity index 95% rename from scripts/train/olmo2/grpo_fast_32b_tulu.sh rename to scripts/train/olmo2/grpo_32b_tulu.sh index a222c558be..4df8766ce1 100644 --- a/scripts/train/olmo2/grpo_fast_32b_tulu.sh +++ b/scripts/train/olmo2/grpo_32b_tulu.sh @@ -6,8 +6,8 @@ python mason.py \ --preemptible \ --num_nodes 8 \ --budget ai2/oe-adapt \ - --gpus 8 -- source configs/beaker_configs/ray_node_setup.sh \&\& python open_instruct/grpo_fast.py \ - --exp_name olmo2_32b_grpo_fast_zero \ + --gpus 8 -- source configs/beaker_configs/ray_node_setup.sh \&\& python open_instruct/grpo.py \ + --exp_name olmo2_32b_grpo_zero \ --beta 0.0 \ --num_unique_prompts_rollout 256 \ --num_samples_per_prompt_rollout 64 \ diff --git a/scripts/train/olmo2/grpo_7b.sh b/scripts/train/olmo2/grpo_7b.sh deleted file mode 100644 index 373daf0138..0000000000 --- a/scripts/train/olmo2/grpo_7b.sh +++ /dev/null @@ -1,53 +0,0 @@ -python mason.py \ - --cluster ai2/jupiter \ - --workspace ai2/tulu-3-dev \ - --priority high \ - --image nathanl/open_instruct_auto --pure_docker_mode \ - --preemptible \ - --num_nodes 2 \ - --budget ai2/oe-adapt \ - --gpus 8 -- source configs/beaker_configs/ray_node_setup.sh \&\& python open_instruct/grpo_vllm_thread_ray_gtrl.py \ - --exp_name olmo2_7b_grpo \ - --beta 0.01 \ - --local_mini_batch_size 32 \ - --number_samples_per_prompt 16 \ - --local_rollout_batch_size 4 \ - --kl_estimator kl3 \ - --learning_rate 5e-7 \ - --dataset_mixer_list allenai/RLVR-GSM-MATH-IF-Mixed-Constraints 1.0 \ - --dataset_mixer_list_splits train \ - --dataset_mixer_eval_list allenai/RLVR-GSM-MATH-IF-Mixed-Constraints 16 \ - --dataset_mixer_eval_list_splits train \ - --max_token_length 2048 \ - --max_prompt_token_length 2048 \ - --response_length 2048 \ - --model_name_or_path allenai/OLMo-2-1124-7B-DPO \ - --model_revision main \ - --tokenizer_name_or_path allenai/OLMo-2-1124-7B-DPO \ - --tokenizer_revision main \ - --use_slow_tokenizer False \ - --add_bos \ - --non_stop_penalty \ - --stop_token eos \ - --temperature 1.0 \ - --ground_truths_key ground_truth \ - --chat_template_name tulu \ - --sft_messages_key messages \ - --total_episodes 2000000 \ - --penalty_reward_value 0.0 \ - --deepspeed_stage 2 \ - --per_device_train_batch_size 1 \ - --local_rollout_forward_batch_size 2 \ - --actor_num_gpus_per_node 4 8 \ - --num_epochs 1 \ - --vllm_tensor_parallel_size 4 \ - --lr_scheduler_type constant \ - --apply_verifiable_reward true \ - --seed 1 \ - --num_evals 100 \ - --save_freq 40 \ - --reward_model_multiplier 0.0 \ - --no_try_launch_beaker_eval_jobs \ - --try_launch_beaker_eval_jobs_on_weka \ - --gradient_checkpointing \ - --with_tracking \ No newline at end of file diff --git a/scripts/train/olmo2/grpo_fast_7b_zero.sh b/scripts/train/olmo2/grpo_7b_zero.sh similarity index 96% rename from scripts/train/olmo2/grpo_fast_7b_zero.sh rename to scripts/train/olmo2/grpo_7b_zero.sh index 067e2fda25..00032ad9bc 100644 --- a/scripts/train/olmo2/grpo_fast_7b_zero.sh +++ b/scripts/train/olmo2/grpo_7b_zero.sh @@ -6,8 +6,8 @@ python mason.py \ --preemptible \ --num_nodes 2 \ --budget ai2/oe-adapt \ - --gpus 8 -- source configs/beaker_configs/ray_node_setup.sh \&\& python open_instruct/grpo_fast.py \ - --exp_name olmo2_7b_grpo_fast_zero \ + --gpus 8 -- source configs/beaker_configs/ray_node_setup.sh \&\& python open_instruct/grpo.py \ + --exp_name olmo2_7b_grpo_zero \ --beta 0.0 \ --num_unique_prompts_rollout 48 \ --num_samples_per_prompt_rollout 16 \ diff --git a/scripts/train/qwen/grpo_fast_32b.sh b/scripts/train/qwen/grpo_32b.sh similarity index 95% rename from scripts/train/qwen/grpo_fast_32b.sh rename to scripts/train/qwen/grpo_32b.sh index 7660c0ddb6..3c2aea0fa1 100644 --- a/scripts/train/qwen/grpo_fast_32b.sh +++ b/scripts/train/qwen/grpo_32b.sh @@ -6,8 +6,8 @@ python mason.py \ --preemptible \ --num_nodes 8 \ --budget ai2/oe-adapt \ - --gpus 8 -- source configs/beaker_configs/ray_node_setup.sh \&\& python open_instruct/grpo_fast.py \ - --exp_name qwen2.5_32b_grpo_fast_zero \ + --gpus 8 -- source configs/beaker_configs/ray_node_setup.sh \&\& python open_instruct/grpo.py \ + --exp_name qwen2.5_32b_grpo_zero \ --beta 0.0 \ --num_unique_prompts_rollout 256 \ --num_samples_per_prompt_rollout 64 \ diff --git a/scripts/train/qwen/grpo_fast_3b_single_node.sh b/scripts/train/qwen/grpo_3b_single_node.sh similarity index 95% rename from scripts/train/qwen/grpo_fast_3b_single_node.sh rename to scripts/train/qwen/grpo_3b_single_node.sh index bef1d90e34..812918f69d 100644 --- a/scripts/train/qwen/grpo_fast_3b_single_node.sh +++ b/scripts/train/qwen/grpo_3b_single_node.sh @@ -6,8 +6,8 @@ python mason.py \ --preemptible \ --num_nodes 1 \ --budget ai2/oe-adapt \ - --gpus 8 -- source configs/beaker_configs/ray_node_setup.sh \&\& python open_instruct/grpo_fast.py \ - --exp_name qwen2.5_3b_grpo_fast_zero \ + --gpus 8 -- source configs/beaker_configs/ray_node_setup.sh \&\& python open_instruct/grpo.py \ + --exp_name qwen2.5_3b_grpo_zero \ --beta 0.0 \ --num_unique_prompts_rollout 48 \ --num_samples_per_prompt_rollout 16 \ diff --git a/scripts/train/qwen/grpo_7b.sh b/scripts/train/qwen/grpo_7b.sh index 1e048eef9d..41b61f57e0 100644 --- a/scripts/train/qwen/grpo_7b.sh +++ b/scripts/train/qwen/grpo_7b.sh @@ -1,4 +1,3 @@ -# https://wandb.ai/ai2-llm/open_instruct_internal/runs/96221yio/overview python mason.py \ --cluster ai2/jupiter \ --workspace ai2/tulu-3-dev \ @@ -7,14 +6,11 @@ python mason.py \ --preemptible \ --num_nodes 2 \ --budget ai2/oe-adapt \ - --gpus 8 -- source configs/beaker_configs/ray_node_setup.sh \&\& python open_instruct/grpo_vllm_thread_ray_gtrl.py \ + --gpus 8 -- source configs/beaker_configs/ray_node_setup.sh \&\& python open_instruct/grpo.py \ --exp_name qwen2.5_7b_grpo_zero \ --beta 0.0 \ - --local_mini_batch_size 32 \ - --number_samples_per_prompt 16 \ - --oe_eval_tasks minerva_math::hamish_zs_reasoning,bbh:cot::hamish_zs_reasoning,gsm8k::hamish_zs_reasoning,minerva_math_500::hamish_zs_reasoning,zebralogic::hamish_zs_reasoning,aime::hamish_zs_reasoning,agi_eval_english:0shot_cot::hamish_zs_reasoning,gpqa:0shot_cot::hamish_zs_reasoning \ - --oe_eval_max_length 8192 \ - --local_rollout_batch_size 4 \ + --num_unique_prompts_rollout 48 \ + --num_samples_per_prompt_rollout 16 \ --kl_estimator kl3 \ --learning_rate 5e-7 \ --dataset_mixer_list ai2-adapt-dev/math_ground_truth_zs 1.0 \ @@ -24,29 +20,28 @@ python mason.py \ --max_token_length 2048 \ --max_prompt_token_length 2048 \ --response_length 4096 \ + --pack_length 6144 \ --model_name_or_path Qwen/Qwen2.5-7B \ --stop_strings "" \ - --add_r1_style_format_reward True \ + --apply_r1_style_format_reward True \ --apply_verifiable_reward True \ + --non_stop_penalty True \ + --non_stop_penalty_value 0.0 \ --chat_template_name r1_simple_chat_postpend_think \ - --non_stop_penalty False \ - --stop_token eos \ - --penalty_reward_value 0.0 \ + --oe_eval_tasks minerva_math::hamish_zs_reasoning,bbh:cot::hamish_zs_reasoning,gsm8k::hamish_zs_reasoning,minerva_math_500::hamish_zs_reasoning,zebralogic::hamish_zs_reasoning,aime::hamish_zs_reasoning,agi_eval_english:0shot_cot::hamish_zs_reasoning,gpqa:0shot_cot::hamish_zs_reasoning \ + --oe_eval_max_length 8192 \ --temperature 1.0 \ - --ground_truths_key ground_truth \ - --sft_messages_key messages \ - --total_episodes 10000000 \ - --deepspeed_stage 2 \ - --per_device_train_batch_size 2 \ - --local_rollout_forward_batch_size 2 \ - --actor_num_gpus_per_node 8 4 \ + --total_episodes 5000000 \ + --deepspeed_stage 3 \ + --per_device_train_batch_size 1 \ + --num_mini_batches 1 \ + --num_learners_per_node 4 \ --num_epochs 1 \ --vllm_tensor_parallel_size 1 \ - --vllm_num_engines 4 \ + --vllm_num_engines 12 \ --lr_scheduler_type linear \ --seed 1 \ - --num_evals 200 \ - --reward_model_multiplier 0.0 \ + --local_eval_every 30 \ --save_freq 40 \ --try_launch_beaker_eval_jobs_on_weka \ --gradient_checkpointing \ diff --git a/scripts/train/qwen/grpo_fast_7b_code.sh b/scripts/train/qwen/grpo_7b_code.sh similarity index 95% rename from scripts/train/qwen/grpo_fast_7b_code.sh rename to scripts/train/qwen/grpo_7b_code.sh index e12cce582f..9230546329 100644 --- a/scripts/train/qwen/grpo_fast_7b_code.sh +++ b/scripts/train/qwen/grpo_7b_code.sh @@ -6,8 +6,8 @@ python mason.py \ --preemptible \ --num_nodes 4 \ --budget ai2/oe-adapt \ - --gpus 8 -- source configs/beaker_configs/ray_node_setup.sh \&\& source configs/beaker_configs/code_api_setup.sh \&\& python open_instruct/grpo_fast.py \ - --exp_name qwen2.5_7b_grpo_fast_zero_orz \ + --gpus 8 -- source configs/beaker_configs/ray_node_setup.sh \&\& source configs/beaker_configs/code_api_setup.sh \&\& python open_instruct/grpo.py \ + --exp_name qwen2.5_7b_grpo_zero_orz \ --beta 0.0 \ --num_unique_prompts_rollout 128 \ --num_samples_per_prompt_rollout 64 \ diff --git a/scripts/train/qwen/grpo_fast_7b_orz.sh b/scripts/train/qwen/grpo_7b_orz.sh similarity index 95% rename from scripts/train/qwen/grpo_fast_7b_orz.sh rename to scripts/train/qwen/grpo_7b_orz.sh index 4f29dd16cd..50e40f3008 100644 --- a/scripts/train/qwen/grpo_fast_7b_orz.sh +++ b/scripts/train/qwen/grpo_7b_orz.sh @@ -6,8 +6,8 @@ python mason.py \ --preemptible \ --num_nodes 4 \ --budget ai2/oe-adapt \ - --gpus 8 -- source configs/beaker_configs/ray_node_setup.sh \&\& python open_instruct/grpo_fast.py \ - --exp_name qwen2.5_7b_grpo_fast_zero_orz \ + --gpus 8 -- source configs/beaker_configs/ray_node_setup.sh \&\& python open_instruct/grpo.py \ + --exp_name qwen2.5_7b_grpo_zero_orz \ --beta 0.0 \ --num_unique_prompts_rollout 128 \ --num_samples_per_prompt_rollout 64 \ diff --git a/scripts/train/qwen/grpo_fast_7b.sh b/scripts/train/qwen/grpo_fast_7b.sh deleted file mode 100644 index 690f8f172f..0000000000 --- a/scripts/train/qwen/grpo_fast_7b.sh +++ /dev/null @@ -1,48 +0,0 @@ -python mason.py \ - --cluster ai2/jupiter \ - --workspace ai2/tulu-3-dev \ - --priority high \ - --image nathanl/open_instruct_auto --pure_docker_mode \ - --preemptible \ - --num_nodes 2 \ - --budget ai2/oe-adapt \ - --gpus 8 -- source configs/beaker_configs/ray_node_setup.sh \&\& python open_instruct/grpo_fast.py \ - --exp_name qwen2.5_7b_grpo_fast_zero \ - --beta 0.0 \ - --num_unique_prompts_rollout 48 \ - --num_samples_per_prompt_rollout 16 \ - --kl_estimator kl3 \ - --learning_rate 5e-7 \ - --dataset_mixer_list ai2-adapt-dev/math_ground_truth_zs 1.0 \ - --dataset_mixer_list_splits train \ - --dataset_mixer_eval_list ai2-adapt-dev/math_ground_truth_zs 16 \ - --dataset_mixer_eval_list_splits train \ - --max_token_length 2048 \ - --max_prompt_token_length 2048 \ - --response_length 4096 \ - --pack_length 6144 \ - --model_name_or_path Qwen/Qwen2.5-7B \ - --stop_strings "" \ - --apply_r1_style_format_reward True \ - --apply_verifiable_reward True \ - --non_stop_penalty True \ - --non_stop_penalty_value 0.0 \ - --chat_template_name r1_simple_chat_postpend_think \ - --oe_eval_tasks minerva_math::hamish_zs_reasoning,bbh:cot::hamish_zs_reasoning,gsm8k::hamish_zs_reasoning,minerva_math_500::hamish_zs_reasoning,zebralogic::hamish_zs_reasoning,aime::hamish_zs_reasoning,agi_eval_english:0shot_cot::hamish_zs_reasoning,gpqa:0shot_cot::hamish_zs_reasoning \ - --oe_eval_max_length 8192 \ - --temperature 1.0 \ - --total_episodes 5000000 \ - --deepspeed_stage 3 \ - --per_device_train_batch_size 1 \ - --num_mini_batches 1 \ - --num_learners_per_node 4 \ - --num_epochs 1 \ - --vllm_tensor_parallel_size 1 \ - --vllm_num_engines 12 \ - --lr_scheduler_type linear \ - --seed 1 \ - --local_eval_every 30 \ - --save_freq 40 \ - --try_launch_beaker_eval_jobs_on_weka \ - --gradient_checkpointing \ - --with_tracking \ No newline at end of file diff --git a/scripts/train/rlvr/grpo_llama3.1-8b.sh b/scripts/train/rlvr/grpo_llama3.1-8b.sh index 373a8c352f..5ee236c165 100644 --- a/scripts/train/rlvr/grpo_llama3.1-8b.sh +++ b/scripts/train/rlvr/grpo_llama3.1-8b.sh @@ -1,4 +1,4 @@ -exp_name="0302_qwen2.5_7B_math_grpo_fast1_${RANDOM}" +exp_name="0302_qwen2.5_7B_math_grpo1_${RANDOM}" python mason.py \ --cluster ai2/jupiter \ --workspace ai2/tulu-3-dev \ diff --git a/scripts/train/rlvr/grpo_mini_base.sh b/scripts/train/rlvr/grpo_mini_base.sh index 81264dacb6..014bfd6ce6 100644 --- a/scripts/train/rlvr/grpo_mini_base.sh +++ b/scripts/train/rlvr/grpo_mini_base.sh @@ -1,16 +1,17 @@ -exp_name="base_grpo_${RANDOM}" -python open_instruct/grpo_vllm_thread_ray_gtrl.py \ +exp_name="base_smollm_grpo_${RANDOM}" +python open_instruct/grpo.py \ --exp_name $exp_name \ --output_dir /weka/oe-adapt-default/costah/models/$exp_name \ --dataset_mixer_list ai2-adapt-dev/rlvr_gsm8k_zs 1.0 \ --dataset_mixer_list_splits train \ --dataset_mixer_eval_list ai2-adapt-dev/rlvr_gsm8k_zs 1.0 \ --dataset_mixer_eval_list_splits train \ - --max_token_length 512 \ - --max_prompt_token_length 512 \ - --response_length 512 \ + --max_token_length 256 \ + --max_prompt_token_length 256 \ + --response_length 128 \ + --pack_length 2048 \ --number_samples_per_prompt 4 \ - --model_name_or_path EleutherAI/pythia-14m \ + --model_name_or_path HuggingFaceTB/SmolLM2-135M \ --stop_strings "" \ --add_r1_style_format_reward \ --non_stop_penalty False \ @@ -22,7 +23,7 @@ python open_instruct/grpo_vllm_thread_ray_gtrl.py \ --sft_messages_key messages \ --learning_rate 3e-7 \ --total_episodes 1000000 \ - --deepspeed_stage 3 \ + --deepspeed_stage 2 \ --per_device_train_batch_size 1 \ --local_rollout_forward_batch_size 1 \ --local_mini_batch_size 16 \ @@ -30,16 +31,16 @@ python open_instruct/grpo_vllm_thread_ray_gtrl.py \ --num_epochs 1 \ --actor_num_gpus_per_node 1 \ --vllm_tensor_parallel_size 1 \ - --beta 0.0 \ + --beta 0.01 \ --apply_verifiable_reward true \ --seed 3 \ - --num_evals 100 \ + --local_eval_every 150 \ --save_freq 100 \ --reward_model_multiplier 0.0 \ --no_try_launch_beaker_eval_jobs \ - --single_gpu_mode \ --vllm_sync_backend gloo \ - --vllm_gpu_memory_utilization 0.5 \ + --vllm_gpu_memory_utilization 0.3 \ --vllm_enforce_eager \ --gradient_checkpointing \ - # --with_tracking + --single_gpu_mode \ + --with_tracking diff --git a/scripts/train/rlvr/grpo_mini_base_fast1.sh b/scripts/train/rlvr/grpo_mini_base1.sh similarity index 97% rename from scripts/train/rlvr/grpo_mini_base_fast1.sh rename to scripts/train/rlvr/grpo_mini_base1.sh index 3b3c8f0378..5affc37c1e 100644 --- a/scripts/train/rlvr/grpo_mini_base_fast1.sh +++ b/scripts/train/rlvr/grpo_mini_base1.sh @@ -1,5 +1,5 @@ exp_name="base_smollm_grpo_${RANDOM}" -python open_instruct/grpo_fast.py \ +python open_instruct/grpo.py \ --exp_name $exp_name \ --output_dir output/dummy \ --dataset_mixer_list ai2-adapt-dev/rlvr_gsm8k_zs 1.0 \ diff --git a/scripts/train/rlvr/grpo_mini_base_fast1_test_oom.sh b/scripts/train/rlvr/grpo_mini_base1_test_oom.sh similarity index 97% rename from scripts/train/rlvr/grpo_mini_base_fast1_test_oom.sh rename to scripts/train/rlvr/grpo_mini_base1_test_oom.sh index aa06b4608b..bcfa02da97 100644 --- a/scripts/train/rlvr/grpo_mini_base_fast1_test_oom.sh +++ b/scripts/train/rlvr/grpo_mini_base1_test_oom.sh @@ -1,5 +1,5 @@ exp_name="base_smollm_grpo_${RANDOM}" -python open_instruct/grpo_fast.py \ +python open_instruct/grpo.py \ --exp_name $exp_name \ --output_dir /weka/oe-adapt-default/costah/models/$exp_name \ --dataset_mixer_list ai2-adapt-dev/rlvr_gsm8k_zs 1.0 \ diff --git a/scripts/train/rlvr/grpo_mini_base_fast.sh b/scripts/train/rlvr/grpo_mini_base_fast.sh deleted file mode 100644 index c16d7cafce..0000000000 --- a/scripts/train/rlvr/grpo_mini_base_fast.sh +++ /dev/null @@ -1,46 +0,0 @@ -exp_name="base_smollm_grpo_${RANDOM}" -python open_instruct/grpo_fast.py \ - --exp_name $exp_name \ - --output_dir /weka/oe-adapt-default/costah/models/$exp_name \ - --dataset_mixer_list ai2-adapt-dev/rlvr_gsm8k_zs 1.0 \ - --dataset_mixer_list_splits train \ - --dataset_mixer_eval_list ai2-adapt-dev/rlvr_gsm8k_zs 1.0 \ - --dataset_mixer_eval_list_splits train \ - --max_token_length 256 \ - --max_prompt_token_length 256 \ - --response_length 128 \ - --pack_length 2048 \ - --number_samples_per_prompt 4 \ - --model_name_or_path HuggingFaceTB/SmolLM2-135M \ - --stop_strings "" \ - --add_r1_style_format_reward \ - --non_stop_penalty False \ - --stop_token eos \ - --penalty_reward_value 0.0 \ - --temperature 0.7 \ - --ground_truths_key ground_truth \ - --chat_template_name r1_simple_chat_postpend_think \ - --sft_messages_key messages \ - --learning_rate 3e-7 \ - --total_episodes 1000000 \ - --deepspeed_stage 2 \ - --per_device_train_batch_size 1 \ - --local_rollout_forward_batch_size 1 \ - --local_mini_batch_size 16 \ - --local_rollout_batch_size 16 \ - --num_epochs 1 \ - --actor_num_gpus_per_node 1 \ - --vllm_tensor_parallel_size 1 \ - --beta 0.01 \ - --apply_verifiable_reward true \ - --seed 3 \ - --local_eval_every 150 \ - --save_freq 100 \ - --reward_model_multiplier 0.0 \ - --no_try_launch_beaker_eval_jobs \ - --vllm_sync_backend gloo \ - --vllm_gpu_memory_utilization 0.3 \ - --vllm_enforce_eager \ - --gradient_checkpointing \ - --single_gpu_mode \ - --with_tracking diff --git a/scripts/train/rlvr/grpo_fast_mini copy.sh b/scripts/train/rlvr/grpo_mini_copy.sh similarity index 97% rename from scripts/train/rlvr/grpo_fast_mini copy.sh rename to scripts/train/rlvr/grpo_mini_copy.sh index 15bef52add..3300861dd7 100644 --- a/scripts/train/rlvr/grpo_fast_mini copy.sh +++ b/scripts/train/rlvr/grpo_mini_copy.sh @@ -1,5 +1,5 @@ exp_name="base_smollm_grpo_${RANDOM}" -python open_instruct/grpo_fast.py \ +python open_instruct/grpo.py \ --exp_name $exp_name \ --output_dir output/dummy \ --dataset_mixer_list nouhad/multiplication_test_100_2x2 1.0 \ diff --git a/scripts/train/rlvr/grpo_fast_mini_old.sh b/scripts/train/rlvr/grpo_mini_old.sh similarity index 97% rename from scripts/train/rlvr/grpo_fast_mini_old.sh rename to scripts/train/rlvr/grpo_mini_old.sh index d07c51efe8..ce27d7c4f8 100644 --- a/scripts/train/rlvr/grpo_fast_mini_old.sh +++ b/scripts/train/rlvr/grpo_mini_old.sh @@ -1,5 +1,5 @@ exp_name="base_smollm_grpo_${RANDOM}" -python open_instruct/grpo_fast.py \ +python open_instruct/grpo.py \ --exp_name $exp_name \ --output_dir output/dummy \ --dataset_mixer_list ai2-adapt-dev/rlvr_gsm8k_zs 1.0 \ diff --git a/scripts/train/rlvr/grpo_qwen_fast_2.5_7B_best.sh b/scripts/train/rlvr/grpo_qwen_2.5_7B_best.sh similarity index 91% rename from scripts/train/rlvr/grpo_qwen_fast_2.5_7B_best.sh rename to scripts/train/rlvr/grpo_qwen_2.5_7B_best.sh index 98f6293caf..1692b2ed67 100644 --- a/scripts/train/rlvr/grpo_qwen_fast_2.5_7B_best.sh +++ b/scripts/train/rlvr/grpo_qwen_2.5_7B_best.sh @@ -1,4 +1,4 @@ -exp_name="0302_qwen2.5_7B_math_grpo_fast1_${RANDOM}" +exp_name="0302_qwen2.5_7B_math_grpo1_${RANDOM}" python mason.py \ --cluster ai2/jupiter \ --workspace ai2/tulu-3-dev \ @@ -7,11 +7,11 @@ python mason.py \ --num_nodes 2 \ --max_retries 0 \ --budget ai2/oe-adapt \ - --gpus 8 -- source configs/beaker_configs/ray_node_setup.sh \&\& python open_instruct/grpo_fast.py \ - --exp_name 0302_qwen2.5_7B_math_grpo_fast1_1317 \ + --gpus 8 -- source configs/beaker_configs/ray_node_setup.sh \&\& python open_instruct/grpo.py \ + --exp_name 0302_qwen2.5_7B_math_grpo1_1317 \ --beta 0.0 \ --number_samples_per_prompt 16 \ - --output_dir /weka/oe-adapt-default/costah/models/0302_qwen2.5_7B_math_grpo_fast1_1317 \ + --output_dir /weka/oe-adapt-default/costah/models/0302_qwen2.5_7B_math_grpo1_1317 \ --oe_eval_tasks minerva_math::hamish_zs_reasoning,bbh:cot::hamish_zs_reasoning,gsm8k::hamish_zs_reasoning,minerva_math_500::hamish_zs_reasoning,zebralogic::hamish_zs_reasoning,aime::hamish_zs_reasoning,agi_eval_english:0shot_cot::hamish_zs_reasoning,gpqa:0shot_cot::hamish_zs_reasoning \ --save_freq 40 \ --no_try_launch_beaker_eval_jobs \ diff --git a/scripts/train/rlvr/judge_general_verifier.sh b/scripts/train/rlvr/judge_general_verifier.sh index a83af210af..18879f82b0 100755 --- a/scripts/train/rlvr/judge_general_verifier.sh +++ b/scripts/train/rlvr/judge_general_verifier.sh @@ -20,7 +20,7 @@ python mason.py \ --env HOSTED_VLLM_API_BASE=${JUDGE_BASE_URL} \ --env VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 \ --budget ai2/oe-adapt \ - --gpus 8 -- source configs/beaker_configs/ray_node_setup.sh \&\& python open_instruct/grpo_fast.py \ + --gpus 8 -- source configs/beaker_configs/ray_node_setup.sh \&\& python open_instruct/grpo.py \ --exp_name 0906rl_judge_test_${RANDOM} \ --dataset_mixer_list hamishivi/WebInstruct-verified-general-verifier-judge 1.0 \ --dataset_mixer_list_splits train \ diff --git a/scripts/train/rlvr/valpy_if_grpo_fast.sh b/scripts/train/rlvr/valpy_if_grpo.sh similarity index 97% rename from scripts/train/rlvr/valpy_if_grpo_fast.sh rename to scripts/train/rlvr/valpy_if_grpo.sh index 5afa0a85f6..a57fafa5a2 100644 --- a/scripts/train/rlvr/valpy_if_grpo_fast.sh +++ b/scripts/train/rlvr/valpy_if_grpo.sh @@ -6,7 +6,7 @@ python mason.py \ --preemptible \ --num_nodes 2 \ --budget ai2/oe-adapt \ - --gpus 8 -- source configs/beaker_configs/ray_node_setup.sh \&\& python open_instruct/grpo_fast.py \ + --gpus 8 -- source configs/beaker_configs/ray_node_setup.sh \&\& python open_instruct/grpo.py \ --exp_name valpy_if_multi_tulu3.1_8b_grpo \ --beta 0.01 \ --num_unique_prompts_rollout 48 \ diff --git a/scripts/train/tulu3/grpo_8b.sh b/scripts/train/tulu3/grpo_8b.sh index 35bcfc4c27..08b7716ee6 100644 --- a/scripts/train/tulu3/grpo_8b.sh +++ b/scripts/train/tulu3/grpo_8b.sh @@ -6,12 +6,12 @@ python mason.py \ --preemptible \ --num_nodes 2 \ --budget ai2/oe-adapt \ - --gpus 8 -- source configs/beaker_configs/ray_node_setup.sh \&\& python open_instruct/grpo_vllm_thread_ray_gtrl.py \ + --gpus 8 -- source configs/beaker_configs/ray_node_setup.sh \&\& python open_instruct/grpo.py \ --exp_name tulu3.1_8b_grpo \ --beta 0.01 \ - --local_mini_batch_size 32 \ - --number_samples_per_prompt 16 \ - --local_rollout_batch_size 4 \ + --num_unique_prompts_rollout 48 \ + --num_samples_per_prompt_rollout 16 \ + --try_launch_beaker_eval_jobs_on_weka \ --kl_estimator kl3 \ --learning_rate 5e-7 \ --dataset_mixer_list allenai/RLVR-GSM-MATH-IF-Mixed-Constraints 1.0 \ @@ -21,27 +21,25 @@ python mason.py \ --max_token_length 2048 \ --max_prompt_token_length 2048 \ --response_length 2048 \ + --pack_length 4096 \ --model_name_or_path allenai/Llama-3.1-Tulu-3-8B-DPO \ - --non_stop_penalty \ - --stop_token eos \ + --apply_verifiable_reward True \ + --non_stop_penalty True \ + --non_stop_penalty_value 0.0 \ --temperature 1.0 \ --chat_template_name tulu \ --total_episodes 2000000 \ - --penalty_reward_value 0.0 \ --deepspeed_stage 2 \ - --per_device_train_batch_size 2 \ - --local_rollout_forward_batch_size 2 \ - --actor_num_gpus_per_node 4 8 \ + --per_device_train_batch_size 1 \ + --num_mini_batches 2 \ + --num_learners_per_node 6 \ --num_epochs 1 \ --vllm_tensor_parallel_size 1 \ - --vllm_num_engines 4 \ + --vllm_num_engines 10 \ --lr_scheduler_type constant \ --apply_verifiable_reward true \ --seed 1 \ - --num_evals 100 \ + --local_eval_every 25 \ --save_freq 40 \ - --reward_model_multiplier 0.0 \ - --no_try_launch_beaker_eval_jobs \ - --try_launch_beaker_eval_jobs_on_weka \ --gradient_checkpointing \ --with_tracking \ No newline at end of file diff --git a/scripts/train/tulu3/grpo_fast_8b_code_dpo.sh b/scripts/train/tulu3/grpo_8b_code_dpo.sh similarity index 97% rename from scripts/train/tulu3/grpo_fast_8b_code_dpo.sh rename to scripts/train/tulu3/grpo_8b_code_dpo.sh index 6cd0c1712a..c2db8fba40 100755 --- a/scripts/train/tulu3/grpo_fast_8b_code_dpo.sh +++ b/scripts/train/tulu3/grpo_8b_code_dpo.sh @@ -1,6 +1,6 @@ base=DPO description="4 dataset code mix (ocr personas algorithm acecoder) on top of Tulu ${base}" -exp_name=rlvr_tulu3.1_8b_${base}_grpo_fast_code +exp_name=rlvr_tulu3.1_8b_${base}_grpo_code python mason.py \ --cluster ai2/augusta \ --image saurabhs/code \ @@ -11,7 +11,7 @@ python mason.py \ --num_nodes 2 \ --description "${description}" \ --budget ai2/oe-adapt \ - --gpus 8 -- source configs/beaker_configs/ray_node_setup.sh \&\& source configs/beaker_configs/code_api_setup.sh \&\& python open_instruct/grpo_fast.py \ + --gpus 8 -- source configs/beaker_configs/ray_node_setup.sh \&\& source configs/beaker_configs/code_api_setup.sh \&\& python open_instruct/grpo.py \ --exp_name $exp_name \ --beta 0.01 \ --num_unique_prompts_rollout 48 \ diff --git a/scripts/train/tulu3/grpo_fast_8b_code_sft.sh b/scripts/train/tulu3/grpo_8b_code_sft.sh similarity index 97% rename from scripts/train/tulu3/grpo_fast_8b_code_sft.sh rename to scripts/train/tulu3/grpo_8b_code_sft.sh index 05d8cd7fc4..9cbd7247f8 100755 --- a/scripts/train/tulu3/grpo_fast_8b_code_sft.sh +++ b/scripts/train/tulu3/grpo_8b_code_sft.sh @@ -1,6 +1,6 @@ base=SFT description="test of https://github.com/allenai/open-instruct/pull/631" -exp_name=rlvr_tulu3.1_8b_${base}_grpo_fast_code +exp_name=rlvr_tulu3.1_8b_${base}_grpo_code python mason.py \ --cluster ai2/augusta \ --image saurabhs/code_dev \ @@ -11,7 +11,7 @@ python mason.py \ --num_nodes 4 \ --description "${description}" \ --budget ai2/oe-adapt \ - --gpus 8 -- source configs/beaker_configs/ray_node_setup.sh \&\& source configs/beaker_configs/code_api_setup.sh \&\& python open_instruct/grpo_fast.py \ + --gpus 8 -- source configs/beaker_configs/ray_node_setup.sh \&\& source configs/beaker_configs/code_api_setup.sh \&\& python open_instruct/grpo.py \ --exp_name $exp_name \ --beta 0.01 \ --num_unique_prompts_rollout 48 \ diff --git a/scripts/train/tulu3/grpo_fast_8b_single_node.sh b/scripts/train/tulu3/grpo_8b_single_node.sh similarity index 95% rename from scripts/train/tulu3/grpo_fast_8b_single_node.sh rename to scripts/train/tulu3/grpo_8b_single_node.sh index 4305eb875d..0fc6bc5e64 100644 --- a/scripts/train/tulu3/grpo_fast_8b_single_node.sh +++ b/scripts/train/tulu3/grpo_8b_single_node.sh @@ -6,8 +6,8 @@ python mason.py \ --preemptible \ --num_nodes 1 \ --budget ai2/oe-adapt \ - --gpus 8 -- source configs/beaker_configs/ray_node_setup.sh \&\& python open_instruct/grpo_fast.py \ - --exp_name tulu3.1_8b_grpo_fast \ + --gpus 8 -- source configs/beaker_configs/ray_node_setup.sh \&\& python open_instruct/grpo.py \ + --exp_name tulu3.1_8b_grpo \ --beta 0.01 \ --num_unique_prompts_rollout 64 \ --num_samples_per_prompt_rollout 16 \ diff --git a/scripts/train/tulu3/grpo_fast_8b.sh b/scripts/train/tulu3/grpo_fast_8b.sh deleted file mode 100644 index 1e41937403..0000000000 --- a/scripts/train/tulu3/grpo_fast_8b.sh +++ /dev/null @@ -1,45 +0,0 @@ -python mason.py \ - --cluster ai2/jupiter \ - --workspace ai2/tulu-3-dev \ - --priority high \ - --image nathanl/open_instruct_auto --pure_docker_mode \ - --preemptible \ - --num_nodes 2 \ - --budget ai2/oe-adapt \ - --gpus 8 -- source configs/beaker_configs/ray_node_setup.sh \&\& python open_instruct/grpo_fast.py \ - --exp_name tulu3.1_8b_grpo_fast \ - --beta 0.01 \ - --num_unique_prompts_rollout 48 \ - --num_samples_per_prompt_rollout 16 \ - --try_launch_beaker_eval_jobs_on_weka \ - --kl_estimator kl3 \ - --learning_rate 5e-7 \ - --dataset_mixer_list allenai/RLVR-GSM-MATH-IF-Mixed-Constraints 1.0 \ - --dataset_mixer_list_splits train \ - --dataset_mixer_eval_list allenai/RLVR-GSM-MATH-IF-Mixed-Constraints 16 \ - --dataset_mixer_eval_list_splits train \ - --max_token_length 2048 \ - --max_prompt_token_length 2048 \ - --response_length 2048 \ - --pack_length 4096 \ - --model_name_or_path allenai/Llama-3.1-Tulu-3-8B-DPO \ - --apply_verifiable_reward True \ - --non_stop_penalty True \ - --non_stop_penalty_value 0.0 \ - --temperature 1.0 \ - --chat_template_name tulu \ - --total_episodes 2000000 \ - --deepspeed_stage 2 \ - --per_device_train_batch_size 1 \ - --num_mini_batches 2 \ - --num_learners_per_node 6 \ - --num_epochs 1 \ - --vllm_tensor_parallel_size 1 \ - --vllm_num_engines 10 \ - --lr_scheduler_type constant \ - --apply_verifiable_reward true \ - --seed 1 \ - --local_eval_every 25 \ - --save_freq 40 \ - --gradient_checkpointing \ - --with_tracking \ No newline at end of file