Active sampling simplified #1143

mnoukhov · 2025-11-04T21:36:05Z

all integration tests passing:

true dynamic sampling aka active sampling: we can filter out 0 std prompts and continuously sample until we have a full batch_size number of non-zero-std prompts

major changes:

moves reward function into accumulate_inference_batches and simplifies the workflow while being better async
active sampling now always waits to collect the exact batch size, no max_retries

minor changes

we're now calling reward for every sample so I've made the Timer logging it a noop and the reward logging a debug so that it doesn't clutter the logs.
extended the Batch dataclass to include all the fields necessary for a training batch. Can alternatively change to have a new dataclass that's just exactly the fields we need for a training batch and change current Batch to be a GenerationBatch
episode now refers to a training episode (number of samples we train on) instead of a generation episode (number of samples we've generated) since everything is running completely async. We can move it back to a good approx of a generation episode but different runs will no longer have synced episode numbers for steps / updates
reward_fn does not take Batch anymore, which didn't make much sense as we only used ground_truths and datasets. Instead we match ppo.py in explicitly passing ground_truths and datasets

Note

Adds active sampling with per-sample reward computed during accumulation, extends Batch and metrics, updates training/eval flow, tests, and scripts.

Core RL changes:
- Introduce Args.active_sampling (replaces fill_completions) to keep sampling until batches have non-zero-std rewards; assert async_steps > 1.
- Move reward computation into accumulate_inference_batches; now returns (GenerationResult, Batch, reward_metrics, BatchStatistics).
- Compute and filter zero-std prompts during accumulation; track filtered counts in BatchStatistics.
- Update truncated completion masking to filter responses and align all related tensors in-place.
Data structures & APIs:
- Extend model_utils.Batch with decoded_responses and scores.
- Add BatchStatistics dataclass for prompt/response lengths and filtering stats.
- Change apply_verifiable_reward and reward_fn signatures to accept ground_truths and datasets instead of Batch.
- Add utils.combine_reward_metrics to aggregate per-prompt reward metrics.
- Compute Args.max_possible_score for solved/unsolved stats.
Metrics & logging:
- Log aggregated reward metrics and batch filtering stats; include time/reward and updated "real/unsolved batch size" ratios.
- Evaluation uses eval_batch.scores and eval_batch.decoded_responses directly.
Training loop:
- Main thread refills prompts based on num_filtered_prompts; adjusts episode counting and packing outputs accordingly.
- load_data_from_packing_thread now returns num_filtered_prompts.
Tests:
- Update tests to provide tokenizer/reward_fn and to validate new accumulation outputs and batch sizing.
Scripts:
- Enable --active_sampling and set --async_steps in debug/integration scripts; minor parameter tweaks.

^{Written by Cursor Bugbot for commit 9c5ba65. This will update automatically on new commits. Configure here.}

…ampling-next

open_instruct/grpo_fast.py

cursor · 2025-11-07T17:55:44Z

open_instruct/model_utils.py

    raw_queries: list[str] | None
+    decoded_responses: list[str] | None
    indices: list[int] | None
+    scores: list[float] | None


Bug: Batch Slicing: Incomplete Data Causes Errors

The __getitem__ method in the Batch dataclass doesn't include the newly added decoded_responses and scores fields when creating sliced/indexed batches. This causes a TypeError because the Batch constructor requires all fields but slicing operations only pass the original fields, omitting the new ones.

open_instruct/test_grpo_fast.py

move weight sync directly after update episode now refers to "training episode", not "generation episode" as previously

open_instruct/grpo_fast.py

becomes the same as reward when num_responses_per_prompt is 1 just because cursor keeps complaining

open_instruct/grpo_fast.py

gemini-code-assist

Code Review

This pull request introduces a significant refactoring to simplify active sampling. The reward calculation is now integrated into accumulate_inference_batches, which streamlines the data preparation thread. Active sampling is also improved to continuously sample until a full batch is collected. My review focuses on improving code clarity and maintainability. I've suggested renaming a confusing variable and refactoring a function call to make the data flow more consistent. Overall, the changes are a good improvement to the codebase.

gemini-code-assist · 2025-11-07T20:29:43Z

open_instruct/grpo_fast.py

+        scores, reward_metrics = asyncio.run(
+            reward_fn(
+                result.responses,
+                decoded_responses,
+                # note that you only need ground_truths and datasets for the reward model
+                Batch(
+                    queries=None,
+                    ground_truths=k_ground_truths,
+                    datasets=k_datasets,
+                    raw_queries=None,
+                    decoded_responses=None,
+                    indices=None,
+                    scores=None,
+                ),
+                result.finish_reasons,
+                result.request_info,
+                k_raw_queries,
+            )
+        )


The way reward_fn is called with a partially constructed Batch object and k_raw_queries as a separate argument is a bit inconsistent. The Batch dataclass has a raw_queries field, which is set to None here, while the actual queries are passed separately.

For better code clarity and data flow consistency, I suggest including k_raw_queries in the Batch object. This would look like:

scores, reward_metrics = asyncio.run( reward_fn( result.responses, decoded_responses, Batch( queries=None, ground_truths=k_ground_truths, datasets=k_datasets, raw_queries=k_raw_queries, decoded_responses=None, indices=None, scores=None, ), result.finish_reasons, result.request_info, ) )

This would require a small change in make_reward_fn to use batch.raw_queries instead of the separate queries argument, and you could then remove the queries argument from reward_fn's signature.

open_instruct/grpo_fast.py

…ampling-next

makes grpo and ppo reward functions the same

we now return k repeats of a prompts, not just 1 in the batch

finbarrtimbers · 2025-11-04T21:47:59Z

open_instruct/grpo_fast.py

+            )
+
+        # Filter out zero std prompts
+        if filter_zero_std_samples and np.array(scores).std() == 0:


We can just do np.std(scores) without the array!

finbarrtimbers · 2025-11-04T21:48:52Z

open_instruct/grpo_fast.py

-                eval_result.request_info,
-            )
-        )
+        # eval_decoded_responses = tokenizer.batch_decode(eval_result.responses, skip_special_tokens=True)


Why comment these out?

this should be deleted not commented out, it was deleted in a later commit, we're doing all of this stuff in accumulate_inference_batches

open_instruct/grpo_fast.py

finbarrtimbers · 2025-11-07T20:58:12Z

open_instruct/grpo_fast.py

+        decoded_responses = tokenizer.batch_decode(result.responses, skip_special_tokens=True)
+
+        k_queries = [query for _ in range(generation_config.n)]
+        k_ground_truths = [ground_truth for _ in range(generation_config.n)]


Can you use the repeat_each function here? like we do in packing!

finbarrtimbers · 2025-11-07T20:58:59Z

open_instruct/grpo_fast.py

+        k_datasets = [dataset for _ in range(generation_config.n)]
+        k_raw_queries = [raw_query for _ in range(generation_config.n)]
+
+        # with Timer("💰 [Data Preparation Thread] Calculating rewards and advantages"):


open_instruct/grpo_fast.py

Co-authored-by: Finbarr Timbers <finbarrtimbers@gmail.com>

open_instruct/grpo_fast.py

mnoukhov and others added 8 commits November 4, 2025 15:37

tmp

c51ee5c

initial conversion to reward in accumulate_inference_batches

40ff9fd

nearly working

d8ed68e

first test fixes

2e99bda

running, just need to test reduced logging

b4950a4

test scripts, tmp commit for integration test

5d065b5

Merge branch 'main' of github.com:allenai/open-instruct into active-s…

5f1f52e

…ampling-next

update tests

9e9b64f

mnoukhov marked this pull request as ready for review November 7, 2025 17:52

cursor bot reviewed Nov 7, 2025

View reviewed changes

mnoukhov added 3 commits November 7, 2025 18:37

intermediate commit

02571df

fix accumulate_inference_batches inputs

b30b934

change model to actually solve

845e789

cursor bot reviewed Nov 7, 2025

View reviewed changes

open_instruct/test_grpo_fast.py Outdated Show resolved Hide resolved

mnoukhov added 2 commits November 7, 2025 14:46

refill filtered prompts

4f55be4

move weight sync directly after update episode now refers to "training episode", not "generation episode" as previously

fix test reward fn

1b1fca9

cursor bot reviewed Nov 7, 2025

View reviewed changes

open_instruct/grpo_fast.py Show resolved Hide resolved

mnoukhov added 2 commits November 7, 2025 20:10

cleanup and move episode to later

568626d

allow for not having time/reward metric

852d915

cursor bot reviewed Nov 7, 2025

View reviewed changes

open_instruct/grpo_fast.py Show resolved Hide resolved

mnoukhov added 2 commits November 7, 2025 15:21

always calculate advantage

709193b

becomes the same as reward when num_responses_per_prompt is 1 just because cursor keeps complaining

try to fix test

5dbf633

cursor bot reviewed Nov 7, 2025

View reviewed changes

open_instruct/grpo_fast.py Outdated Show resolved Hide resolved

gemini-code-assist bot reviewed Nov 7, 2025

View reviewed changes

mnoukhov added 3 commits November 7, 2025 15:30

Merge branch 'main' of github.com:allenai/open-instruct into active-s…

5314bc1

…ampling-next

fix ground truths and datasets

8a12840

makes grpo and ppo reward functions the same

fix test

5eb92d0

we now return k repeats of a prompts, not just 1 in the batch

finbarrtimbers requested changes Nov 7, 2025

View reviewed changes

cursor bot reviewed Nov 7, 2025

View reviewed changes

open_instruct/grpo_fast.py Show resolved Hide resolved

open_instruct/grpo_fast.py Show resolved Hide resolved

active sampling in large tests

7cfc4cc

mnoukhov and others added 5 commits November 7, 2025 16:16

Update open_instruct/grpo_fast.py

d4b3f33

Co-authored-by: Finbarr Timbers <finbarrtimbers@gmail.com>

Update open_instruct/grpo_fast.py

0e1fee2

Co-authored-by: Finbarr Timbers <finbarrtimbers@gmail.com>

cursor was right

63484ad

address comments

2a55a59

nit

f4a7ecb

cursor bot reviewed Nov 7, 2025

View reviewed changes

open_instruct/grpo_fast.py Show resolved Hide resolved

finbarrtimbers self-requested a review November 7, 2025 21:32

finbarrtimbers approved these changes Nov 7, 2025

View reviewed changes

32b without active sampling

481041b

cursor bot reviewed Nov 8, 2025

View reviewed changes

open_instruct/grpo_fast.py Show resolved Hide resolved

repeat each fix

9c5ba65

mnoukhov added this pull request to the merge queue Nov 9, 2025

Merged via the queue into main with commit da87d77 Nov 9, 2025
4 checks passed

Active sampling simplified #1143

Active sampling simplified #1143

Uh oh!

Conversation

mnoukhov commented Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

cursor bot Nov 7, 2025

Choose a reason for hiding this comment

Bug: Batch Slicing: Incomplete Data Causes Errors

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mnoukhov commented Nov 4, 2025 •

edited

Loading