Skip to content

支持packing算法,提高训练速度#1

Open
JasonCZH4 wants to merge 6 commits intowxhcore:mainfrom
JasonCZH4:main
Open

支持packing算法,提高训练速度#1
JasonCZH4 wants to merge 6 commits intowxhcore:mainfrom
JasonCZH4:main

Conversation

@JasonCZH4
Copy link

No description provided.

JasonCZH4 and others added 2 commits February 7, 2026 14:47
Removed commented-out SFTDataset class and its methods.
@wxhcore
Copy link
Owner

wxhcore commented Feb 8, 2026

@JasonCZH4 Thank you very much for submitting a PR to this project! I think adding the sequence packing feature is a great optimization that can significantly improve GPU utilization and training efficiency. After reviewing the newly added code, I have the following suggestions:

  1. Several utility functions have been added in dataset.py; could we move these utility functions into a new file for unified management?

  2. I notice that only SFTDataset has been modified. Considering potential future support for other dataset types, I suggest using inheritance or the strategy pattern for extensibility. Alternatively, if you have a better approach, you could also refer to the implementation in supervised.py from LLaMA-Factory.

  3. Actually, your current code doesn’t work correctly: in base_trainer.py, the loss computation explicitly passes input_ids and attention_mask, but your code doesn’t pass position_ids to the model, so it may need to be changed to pass the batch using **batch.

  4. I see you’ve added two new parameters, so the corresponding YAML config and argparse setup likely need to be updated to ensure the parameters are passed correctly.

  5. Since I haven’t integrated CI on GitHub yet, I’ll need you to run the existing tests and add unit or functional tests for the new feature to ensure everything works as expected.

Finally, thank you again for contributing to this project!

@JasonCZH4
Copy link
Author

Thanks for reviewing! It is my first time to submit PR. There are still many issues with the current code, and I am working hard to fix them in the coming time. In fact, I am currently conducting many tests. Once everything is OK, I will let you know. Thanks again!

@wxhcore
Copy link
Owner

wxhcore commented Feb 8, 2026

Thanks for reviewing! It is my first time to submit PR. There are still many issues with the current code, and I am working hard to fix them in the coming time. In fact, I am currently conducting many tests. Once everything is OK, I will let you know. Thanks again!

Thank you very much for your efforts! This is also BumbleCore's first PR, so I attach great importance to it. Wish you all the best!

Copy link
Owner

@wxhcore wxhcore left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your contribution! We still have a few issues to address. Additionally, have you compared the training performance of the same model and training configuration with and without sequence packing, to ensure that samples are properly isolated?

action="store_true",
default=cfg.get("enable_gradient_checkpointing", False),
)
parser.add_argument("--packing", type=bool, default=cfg.get("packing", True))
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When type=bool is used, it doesn't work as expected because passing any non-empty string will be interpreted as True.

def _is_dist():
return dist.is_available() and dist.is_initialized() and dist.get_world_size() > 1

class SFTDataset(Dataset):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are three suggestions for the sequence packing feature:

  1. Encapsulate the logic into a PackingMixin that can be optionally inherited by SFTDataset, PretrainDataset, etc.

  2. If it’s only needed for SFT, create a new class that inherits from SFTDataset and overrides the relevant methods.

  3. Alternatively, you could introduce a new PackingSFTDataset that inherits from both PackingMixin and SFTDataset—this would be a clean and elegant way to compose the functionality without modifying existing code.

@wxhcore wxhcore self-assigned this Feb 13, 2026
@JasonCZH4
Copy link
Author

Additionally, have you compared the training performance of the same model and training configuration with and without sequence packing, to ensure that samples are properly isolated?

Thanks for the review! I have test the performance on Qwen2.5-0.5B-Instruct using the following configuration:

dataset_path: Magpie-Align/Magpie-Pro-300K-Filtered (randomly select 10K to save time)        
cutoff_len: 8192

# Training
num_epochs: 3.0
learning_rate: 2e-5                             
weight_decay: 0.01
lr_scheduler_type: "cosine"                            
enable_gradient_checkpointing: true                   
warmup_ratio: 0.1
packing: true
packing_num_proc: 4

# Batch Size
train_micro_batch_size_per_gpu: 1          
gradient_accumulation_steps: 8             

# Precision
train_model_precision: "bf16"                   

# DeepSpeed
deepspeed_config_path: "./configs/deepspeed/ds_z3_config.json"  

average_tokens_across_devices: true
  

I validate the performance using lm-eval and have the following results.
image

Compared to not using packing, there is a slight improvement in model performance and a significant reduction in time when using packing.

Now I will turn to implementing the inheritance of packing features. Thanks for reviewing!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants