Skip to content

Add GPT-OSS 20B LoRA + GRPO finetuning recipe#997

Open
mijanur132 wants to merge 8 commits intomainfrom
gpt-oss-lora-grpo
Open

Add GPT-OSS 20B LoRA + GRPO finetuning recipe#997
mijanur132 wants to merge 8 commits intomainfrom
gpt-oss-lora-grpo

Conversation

@mijanur132
Copy link
Collaborator

Purpose

Fine-tune GPT-OSS 20B with LoRA and GRPO for multilingual reasoning on Amazon SageMaker HyperPod EKS. Includes SFT training, GRPO reinforcement learning, inference, and evaluation scripts.

Changes

New test case.

Test Plan

Environment:

  • AWS Service: AWS Sagemaker Hyperpod, EKS
  • Instance type: ml.g6e.12xlarge
  • Number of nodes: 4

Test commands:

Test Results

Directory Structure

3.test_cases/
└── pytorch/                # e.g. pytorch, megatron, jax
    └── trl/              # e.g. picotron, FSDP, megatron-lm
        └── gpt-oss            # e.g. SmolLM-1.7B (may be omitted for single-model cases)
            ├── artifacts     # src code, config
            ├── README.md       # Overview, prerequisites, usage
            └── train/   # training manifests
            └── inference/   # inference manifests
            └── Evaluaiton/   # Evaluation manifests

  • Top-level files (Dockerfile, README.md, training scripts, configs) cover general setup.
  • Subdirectories (slurm/, kubernetes/, hyperpod-eks/) contain service-specific launch instructions.
  • Not all service subdirectories are required — include only the ones relevant to your test case.

Checklist

  • [* ] I have read the contributing guidelines.
  • [ *] I am working against the latest main branch.
  • [ *] I have searched existing open and recently merged PRs to confirm this is not a duplicate.
  • [ *] The contribution is self-contained with documentation and scripts.
  • [ *] External dependencies are pinned to a specific version or tag (no latest).
  • [ *] A README is included or updated with prerequisites, instructions, and known issues.
  • [ *] New test cases follow the expected directory structure.

@KeitaW KeitaW self-assigned this Feb 26, 2026
@KeitaW KeitaW added the enhancement New feature or request label Feb 26, 2026
KeitaW added a commit that referenced this pull request Feb 26, 2026
Rename grpo/ to grpo-math-reasoning/ for clarity, create a unified
trl/Dockerfile with all TRL example dependencies, add a HyperPod EKS
Dockerfile extension, and add a library-level README with test case
table. This establishes the pattern for PR #997 and future TRL test
cases to share a single base image rather than duplicating Dockerfiles.
Copy link
Collaborator

@KeitaW KeitaW left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Batch 1/5 — Structure & Repository Hygiene

Rebase onto PR #1000 and reuse shared base image

Important — merge order: PR #1000 restructures the trl/ directory (renames grpo/grpo-math-reasoning/, creates a shared trl/Dockerfile, and adds a library-level trl/README.md). PR #1000 should merge first, then this PR should rebase and add gpt-oss-lora-grpo/ alongside grpo-math-reasoning/, reusing the shared base image rather than duplicating dependencies.

After rebasing:

  1. Reuse the shared trl/Dockerfile — it already installs TRL, PyTorch, DeepSpeed, Flash Attention, FlashInfer, and common dependencies. GPT-OSS-specific deps (e.g., hyperpod-elastic-agent) can go in an extended Dockerfile.
  2. Share common scriptsinference.py, eval.py, and reward-function utilities could be generalized rather than duplicated.
  3. Add platform-specific subdirectorieskubernetes/ for generic K8s manifests (including a standard PyTorchJob for vanilla EKS) and hyperpod-eks/ for HyperPod-specific configs.
  4. Update the trl/README.md test case table — add a row for gpt-oss-lora-grpo.

Directory name typo: evalution/evaluation/

The evaluation directory is misspelled as evalution/ (missing an "a"). Rename to evaluation/ and update all references.

Missing license headers on most files

All Python files except sft.py, all YAML files, and build_push.sh lack the repo's required header:

# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
# SPDX-License-Identifier: MIT-0

Note: sft.py has a HuggingFace Apache-2.0 header, not the repo's MIT-0.

Missing __init__.py in src/ directory

The src/ directory is used as a Python package (via PYTHONPATH) and evaluate_grpo.py imports from inference_g6e. Adding an __init__.py would make the package structure explicit.

Copy link
Collaborator

@KeitaW KeitaW left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Batch 2/5 — Deployment Pipeline (K8s / Shell)

GRPO and evaluation use plain Pod instead of PyTorchJob

train/train-grpo-multinode.yaml and evalution/eval-grpo.yaml use plain Pod specs. The GRPO manifest is named train-grpo-multinode.yaml but deploys a single pod named grpo-single — if multi-node GRPO is intended, a PyTorchJob or HyperPodPyTorchJob would handle torchrun coordination. If single-node is intentional, the filename is misleading.

envsubst calls without explicit variable whitelists

All envsubst calls in the README use the bare form (lines 83, 98, 152, 199-201, 249). Without a whitelist, envsubst replaces every $VAR pattern — including things like $(date +%Y%m%d_%H%M%S) in the eval manifest (which should be shell-expanded, not envsubst-expanded). Use explicit whitelists:

envsubst '$REGISTRY $IMAGE $TAG $HF_TOKEN' < train/fsx-storage-manager.yaml | kubectl apply -f -

Ref: envsubst pitfalls

Copy link
Collaborator

@KeitaW KeitaW left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Batch 3/5 — Training Code Quality

trust_remote_code=True used without acknowledgment

trust_remote_code=True appears in every from_pretrained call across grpo_multinode.py, evaluate_grpo.py, inference_g6e.py, inference_grpo_new.py, and sft_teacher_data.py. This flag allows arbitrary code execution from the HuggingFace Hub. For GPT-OSS this may be necessary, but a brief comment (e.g., # Required: GPT-OSS model uses custom code on HF Hub) would acknowledge the security implication.

Bare except: clauses throughout

Multiple bare except: clauses catch all exceptions including KeyboardInterrupt and SystemExit:

  • grpo_multinode.py (lines 87, 101, 135)
  • evaluate_grpo.py (line 210)
  • inference_g6e.py (line 114)
  • inference_grpo_new.py (lines 86, 121)

Consider using except Exception: at minimum to allow clean interruption.

generate_fn = lambda m, t, p, lang: generate_response_grpo(m, t, p, reasoning_language=lang)
else:
from inference_g6e import generate_response
generate_fn = lambda m, t, p, lang: generate_response(m, t, p, max_new_tokens=1024, show_full=True, reasoning_language=lang)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: generate_response called with unsupported show_full parameter

inference_g6e.generate_response has signature generate_response(model, tokenizer, prompt, max_new_tokens=2048, reasoning_language=None) — it does not accept a show_full parameter. This will raise a TypeError at runtime.

Suggested change
generate_fn = lambda m, t, p, lang: generate_response(m, t, p, max_new_tokens=1024, show_full=True, reasoning_language=lang)
generate_fn = lambda m, t, p, lang: generate_response(m, t, p, max_new_tokens=1024, reasoning_language=lang)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

corrected.

# Standard mode: use inference_g6e
from inference_g6e import load_model
print(f"Loading model (use_trained={args.use_trained})...")
model, tokenizer = load_model(args.base_model, use_trained=args.use_trained, checkpoint_dir=args.checkpoint_dir)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: load_model called with wrong keyword arguments

inference_g6e.load_model has signature load_model(base_model_path, checkpoint=None) — it does not accept use_trained or checkpoint_dir. This call will raise a TypeError at runtime.

The correct call would be something like:

checkpoint_path = os.path.join(args.checkpoint_dir, "converted-peft/lora-checkpoint-1000-peft") if args.use_trained else None
model, tokenizer = load_model(args.base_model, checkpoint=checkpoint_path)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

corrected.

Copy link
Collaborator

@KeitaW KeitaW left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Batch 4/5 — Infrastructure & NCCL Configuration

Dockerfile runs as root

The Dockerfile does not include a USER directive, so the container runs as root. While this is common for GPU training containers (NCCL/EFA access often requires it), consider adding a comment explaining why root is needed.

Comment on lines +1 to +6
torch>=2.4.0
transformers>=4.48.0
accelerate>=0.27.0
datasets>=2.17.0
peft>=0.9.0
trl>=0.8.0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing upper-bound version pins

All packages use only lower-bound pins (>=). Without upper bounds, future installs may pull incompatible major versions. Also, scipy, sentencepiece, and protobuf (lines 8-10) have no version pin at all.

Consider adding upper bounds for critical packages:

Suggested change
torch>=2.4.0
transformers>=4.48.0
accelerate>=0.27.0
datasets>=2.17.0
peft>=0.9.0
trl>=0.8.0
torch>=2.4.0,<3.0
transformers>=4.48.0,<5.0
accelerate>=0.27.0,<1.0
datasets>=2.17.0,<3.0
peft>=0.9.0,<1.0
trl>=0.8.0,<1.0

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additionally, the minimum PyTorch version should be bumped to torch>=2.6.0 — PyTorch 2.6 is the first release with official NVIDIA Blackwell (sm_100) support. The Dockerfile already builds NCCL tests with -gencode=arch=compute_100,code=sm_100 and uses a CUDA 12.9 base image, so the PyTorch floor should match:

Suggested change
torch>=2.4.0
transformers>=4.48.0
accelerate>=0.27.0
datasets>=2.17.0
peft>=0.9.0
trl>=0.8.0
torch>=2.6.0,<3.0

PyTorch 2.4 does not include Blackwell kernels and will either fail or fall back to slow codepaths on B-series instances.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

corrected.

protobuf
flash-attn>=2.5.0
deepspeed>=0.14.0
hyperpod-elastic-agent
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicate hyperpod-elastic-agent install

hyperpod-elastic-agent is installed both here and separately in the Dockerfile (line 45: RUN pip install hyperpod-elastic-agent ...). This means it gets installed twice, and the versions could conflict. Consider removing it from one location.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

corrected.

# 7. Copy Code & Set Entrypoint
COPY src/ /app/
ENV PYTHONPATH="${PYTHONPATH}:/app"
#CMD ["python", "sft.py", "--config", "configs/sft_lora.yaml"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove commented-out CMD

Dead code. Since the K8s manifests provide the command, this commented line should be removed.

Suggested change
#CMD ["python", "sft.py", "--config", "configs/sft_lora.yaml"]

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

corrected.

Copy link
Collaborator

@KeitaW KeitaW left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Batch 5/5 — Documentation Consistency

sft_lora.yaml config values overridden by training manifest

The YAML config sets per_device_train_batch_size: 4 and gradient_accumulation_steps: 4, but the training manifest overrides these (e.g., --per_device_train_batch_size=1). Similarly, output_dir: gpt-oss-20b-multilingual-reasoner is overridden to --output_dir=/fsx/checkpoints. While CLI args take precedence, aligning the config or adding comments noting which values are overridden at launch would reduce confusion.

Files Reference table in README is incomplete

The table at README lines 270-284 omits several files: sft.py, sft_teacher_data.py, convert_grpo_checkpoint.py, build_push.sh, requirements.txt, Dockerfile, sft_lora.yaml, and env_vars.example. Consider expanding the table or noting that artifacts/ contains additional scripts.


Things That Look Great

  • Excellent pipeline overview — the ASCII diagram showing Base → SFT → GRPO with corresponding inference endpoints makes the workflow immediately clear.
  • Well-structured GRPO reward function — the three-component reward (answer language, reasoning language, brevity) with clear weights is easy to understand and modify.
  • Good use of HyperPodPyTorchJob for SFT training — elastic scaling with minReplicas/maxReplicas and auto-resume tolerations is the right pattern for HyperPod EKS.
  • TestPromptCallback in grpo_multinode.py — generating test outputs every 10 steps provides excellent training visibility.
  • Correct use of HF_HOME throughout all manifests — no deprecated TRANSFORMERS_CACHE.
  • EFA configuration is solidFI_PROVIDER=efa, FI_EFA_USE_DEVICE_RDMA=1, proper LD_LIBRARY_PATH for aws-ofi-nccl.
  • Self-contained test case with clear separation of concerns.
  • Checkpoint conversion scripts are a thoughtful addition.

Suggested Priority

  1. Bugs (P0): Fix load_model and generate_response signature mismatches in evaluate_grpo.py — these will fail at runtime.
  2. Correctness (P1): Fix env_vars.example bash syntax, add set -euo pipefail to build_push.sh, remove os.system("pip install langdetect"), quote shell variables.
  3. Conventions (P2): Add license headers, fix directory typo (evalution/), update clone URL, fix relative link.
  4. Coordination (P3): Rebase onto PR #1000, reuse shared base image, add platform subdirectories.

@@ -0,0 +1,278 @@
# GPT-OSS 20B Training & Inference Guide

This guide explains how to train the GPT-OSS 20B model with LoRA, then improve it using Group Relative Policy Optimization (GRPO) for better language compliance in reasoning and final answers. It is designed to be as simple as possible, requires no data preparation, and uses a container image. For further background information look at https://developers.openai.com/cookbook/articles/gpt-oss/fine-tune-transfomers
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possible typo in external URL

The URL ends with fine-tune-transfomers — missing an "r" (should probably be fine-tune-transformers). Worth verifying this is the correct URL path.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

corrected

### 6.2. Check results

```bash
kubectl logs eval-grpo --tail=50
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

evalution/ typo preserved in command

This matches the typo'd directory name so it currently works, but both should be fixed to evaluation/ together.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

corrected.

Copy link
Collaborator

@KeitaW KeitaW left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments.

@KeitaW KeitaW assigned mijanur132 and unassigned KeitaW Feb 26, 2026
mijanur132 pushed a commit that referenced this pull request Feb 27, 2026
…#1000)

* Restructure trl/ test cases: shared base image and descriptive naming

Rename grpo/ to grpo-math-reasoning/ for clarity, create a unified
trl/Dockerfile with all TRL example dependencies, add a HyperPod EKS
Dockerfile extension, and add a library-level README with test case
table. This establishes the pattern for PR #997 and future TRL test
cases to share a single base image rather than duplicating Dockerfiles.

* Remove hyperpod-eks directory from trl/ test cases

HyperPod EKS support will be added separately when needed.
@mijanur132
Copy link
Collaborator Author

  1. Reuse the shared trl/Dockerfile — it already installs TRL, PyTorch, DeepSpeed, Flash Attention, FlashInfer, and common dependencies. GPT-OSS-specific deps (e.g., hyperpod-elastic-agent) can go in an extended Dockerfile.

response:
This recipe intentionally uses a separate Dockerfile because it requires the AWS Deep Learning Container base image (public.ecr.aws/deep-learning-containers/base:12.9.1-gpu-py312-cu129-ubuntu22.04-ec2-v1.31) which comes with EFA drivers, aws-ofi-nccl, and CUDA 12.9 pre-installed — these are needed for HyperPod EKS elastic training with EFA networking. The shared trl/Dockerfile uses public.ecr.aws/hpc-cloud/nccl-tests:latest as its base, which has a different set of pre-installed libraries and driver versions. Switching base images could introduce compatibility issues with EFA and the HyperPod elastic agent. I've added a comment in the Dockerfile noting the shared alternative for visibility.

@KeitaW
Copy link
Collaborator

KeitaW commented Feb 27, 2026

  1. Reuse the shared trl/Dockerfile — it already installs TRL, PyTorch, DeepSpeed, Flash Attention, FlashInfer, and common dependencies. GPT-OSS-specific deps (e.g., hyperpod-elastic-agent) can go in an extended Dockerfile.

response: This recipe intentionally uses a separate Dockerfile because it requires the AWS Deep Learning Container base image (public.ecr.aws/deep-learning-containers/base:12.9.1-gpu-py312-cu129-ubuntu22.04-ec2-v1.31) which comes with EFA drivers, aws-ofi-nccl, and CUDA 12.9 pre-installed — these are needed for HyperPod EKS elastic training with EFA networking. The shared trl/Dockerfile uses public.ecr.aws/hpc-cloud/nccl-tests:latest as its base, which has a different set of pre-installed libraries and driver versions. Switching base images could introduce compatibility issues with EFA and the HyperPod elastic agent. I've added a comment in the Dockerfile noting the shared alternative for visibility.

the base image is also include them cf. https://github.com/awslabs/awsome-distributed-training/blob/gpt-oss-lora-grpo/micro-benchmarks/nccl-tests/nccl-tests.Dockerfile could you please test to see if that works?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants