Distributed Training Template #6

kohankhaki · 2025-09-18T06:59:34Z

This pull request introduces a new distributed finetuning template for LLMs, enabling scalable training using either DDP or FSDP with Hydra and Submitit orchestration. It adds a complete configuration, launch, and training pipeline, along with documentation and a compute config for multi-GPU training.
Added a Slurm compute configuration (bon_echo/a40_4x.yaml) for running jobs on 4xA40 GPU nodes, including resource and partition settings.

kohankhaki added 3 commits September 17, 2025 23:59

added llm fine-tuning ddp fsdp scripts.

02dcefc

Updated LLM Readme.

b5bcf06

added default checkpoint function.

4d9a15a

kohankhaki requested a review from jwilles September 18, 2025 14:02

Provide feedback