Add GPT-OSS 20B LoRA + GRPO finetuning recipe#997
Conversation
Rename grpo/ to grpo-math-reasoning/ for clarity, create a unified trl/Dockerfile with all TRL example dependencies, add a HyperPod EKS Dockerfile extension, and add a library-level README with test case table. This establishes the pattern for PR #997 and future TRL test cases to share a single base image rather than duplicating Dockerfiles.
KeitaW
left a comment
There was a problem hiding this comment.
Review Batch 1/5 — Structure & Repository Hygiene
Rebase onto PR #1000 and reuse shared base image
Important — merge order: PR #1000 restructures the
trl/directory (renamesgrpo/→grpo-math-reasoning/, creates a sharedtrl/Dockerfile, and adds a library-leveltrl/README.md). PR #1000 should merge first, then this PR should rebase and addgpt-oss-lora-grpo/alongsidegrpo-math-reasoning/, reusing the shared base image rather than duplicating dependencies.
After rebasing:
- Reuse the shared
trl/Dockerfile— it already installs TRL, PyTorch, DeepSpeed, Flash Attention, FlashInfer, and common dependencies. GPT-OSS-specific deps (e.g.,hyperpod-elastic-agent) can go in an extended Dockerfile. - Share common scripts —
inference.py,eval.py, and reward-function utilities could be generalized rather than duplicated. - Add platform-specific subdirectories —
kubernetes/for generic K8s manifests (including a standardPyTorchJobfor vanilla EKS) andhyperpod-eks/for HyperPod-specific configs. - Update the
trl/README.mdtest case table — add a row forgpt-oss-lora-grpo.
Directory name typo: evalution/ → evaluation/
The evaluation directory is misspelled as evalution/ (missing an "a"). Rename to evaluation/ and update all references.
Missing license headers on most files
All Python files except sft.py, all YAML files, and build_push.sh lack the repo's required header:
# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
# SPDX-License-Identifier: MIT-0
Note: sft.py has a HuggingFace Apache-2.0 header, not the repo's MIT-0.
Missing __init__.py in src/ directory
The src/ directory is used as a Python package (via PYTHONPATH) and evaluate_grpo.py imports from inference_g6e. Adding an __init__.py would make the package structure explicit.
3.test_cases/pytorch/trl/gpt-oss-lora-grpo/artifacts/src/configs/sft_lora.yaml
Show resolved
Hide resolved
KeitaW
left a comment
There was a problem hiding this comment.
Review Batch 2/5 — Deployment Pipeline (K8s / Shell)
GRPO and evaluation use plain Pod instead of PyTorchJob
train/train-grpo-multinode.yaml and evalution/eval-grpo.yaml use plain Pod specs. The GRPO manifest is named train-grpo-multinode.yaml but deploys a single pod named grpo-single — if multi-node GRPO is intended, a PyTorchJob or HyperPodPyTorchJob would handle torchrun coordination. If single-node is intentional, the filename is misleading.
envsubst calls without explicit variable whitelists
All envsubst calls in the README use the bare form (lines 83, 98, 152, 199-201, 249). Without a whitelist, envsubst replaces every $VAR pattern — including things like $(date +%Y%m%d_%H%M%S) in the eval manifest (which should be shell-expanded, not envsubst-expanded). Use explicit whitelists:
envsubst '$REGISTRY $IMAGE $TAG $HF_TOKEN' < train/fsx-storage-manager.yaml | kubectl apply -f -Ref: envsubst pitfalls
3.test_cases/pytorch/trl/gpt-oss-lora-grpo/artifacts/build_push.sh
Outdated
Show resolved
Hide resolved
3.test_cases/pytorch/trl/gpt-oss-lora-grpo/artifacts/build_push.sh
Outdated
Show resolved
Hide resolved
3.test_cases/pytorch/trl/gpt-oss-lora-grpo/inference/inference-g6e-base.yaml
Outdated
Show resolved
Hide resolved
KeitaW
left a comment
There was a problem hiding this comment.
Review Batch 3/5 — Training Code Quality
trust_remote_code=True used without acknowledgment
trust_remote_code=True appears in every from_pretrained call across grpo_multinode.py, evaluate_grpo.py, inference_g6e.py, inference_grpo_new.py, and sft_teacher_data.py. This flag allows arbitrary code execution from the HuggingFace Hub. For GPT-OSS this may be necessary, but a brief comment (e.g., # Required: GPT-OSS model uses custom code on HF Hub) would acknowledge the security implication.
Bare except: clauses throughout
Multiple bare except: clauses catch all exceptions including KeyboardInterrupt and SystemExit:
grpo_multinode.py(lines 87, 101, 135)evaluate_grpo.py(line 210)inference_g6e.py(line 114)inference_grpo_new.py(lines 86, 121)
Consider using except Exception: at minimum to allow clean interruption.
| generate_fn = lambda m, t, p, lang: generate_response_grpo(m, t, p, reasoning_language=lang) | ||
| else: | ||
| from inference_g6e import generate_response | ||
| generate_fn = lambda m, t, p, lang: generate_response(m, t, p, max_new_tokens=1024, show_full=True, reasoning_language=lang) |
There was a problem hiding this comment.
Bug: generate_response called with unsupported show_full parameter
inference_g6e.generate_response has signature generate_response(model, tokenizer, prompt, max_new_tokens=2048, reasoning_language=None) — it does not accept a show_full parameter. This will raise a TypeError at runtime.
| generate_fn = lambda m, t, p, lang: generate_response(m, t, p, max_new_tokens=1024, show_full=True, reasoning_language=lang) | |
| generate_fn = lambda m, t, p, lang: generate_response(m, t, p, max_new_tokens=1024, reasoning_language=lang) |
3.test_cases/pytorch/trl/gpt-oss-lora-grpo/artifacts/src/evaluate_grpo.py
Outdated
Show resolved
Hide resolved
| # Standard mode: use inference_g6e | ||
| from inference_g6e import load_model | ||
| print(f"Loading model (use_trained={args.use_trained})...") | ||
| model, tokenizer = load_model(args.base_model, use_trained=args.use_trained, checkpoint_dir=args.checkpoint_dir) |
There was a problem hiding this comment.
Bug: load_model called with wrong keyword arguments
inference_g6e.load_model has signature load_model(base_model_path, checkpoint=None) — it does not accept use_trained or checkpoint_dir. This call will raise a TypeError at runtime.
The correct call would be something like:
checkpoint_path = os.path.join(args.checkpoint_dir, "converted-peft/lora-checkpoint-1000-peft") if args.use_trained else None
model, tokenizer = load_model(args.base_model, checkpoint=checkpoint_path)
KeitaW
left a comment
There was a problem hiding this comment.
Review Batch 4/5 — Infrastructure & NCCL Configuration
Dockerfile runs as root
The Dockerfile does not include a USER directive, so the container runs as root. While this is common for GPU training containers (NCCL/EFA access often requires it), consider adding a comment explaining why root is needed.
| torch>=2.4.0 | ||
| transformers>=4.48.0 | ||
| accelerate>=0.27.0 | ||
| datasets>=2.17.0 | ||
| peft>=0.9.0 | ||
| trl>=0.8.0 |
There was a problem hiding this comment.
Missing upper-bound version pins
All packages use only lower-bound pins (>=). Without upper bounds, future installs may pull incompatible major versions. Also, scipy, sentencepiece, and protobuf (lines 8-10) have no version pin at all.
Consider adding upper bounds for critical packages:
| torch>=2.4.0 | |
| transformers>=4.48.0 | |
| accelerate>=0.27.0 | |
| datasets>=2.17.0 | |
| peft>=0.9.0 | |
| trl>=0.8.0 | |
| torch>=2.4.0,<3.0 | |
| transformers>=4.48.0,<5.0 | |
| accelerate>=0.27.0,<1.0 | |
| datasets>=2.17.0,<3.0 | |
| peft>=0.9.0,<1.0 | |
| trl>=0.8.0,<1.0 |
There was a problem hiding this comment.
Additionally, the minimum PyTorch version should be bumped to torch>=2.6.0 — PyTorch 2.6 is the first release with official NVIDIA Blackwell (sm_100) support. The Dockerfile already builds NCCL tests with -gencode=arch=compute_100,code=sm_100 and uses a CUDA 12.9 base image, so the PyTorch floor should match:
| torch>=2.4.0 | |
| transformers>=4.48.0 | |
| accelerate>=0.27.0 | |
| datasets>=2.17.0 | |
| peft>=0.9.0 | |
| trl>=0.8.0 | |
| torch>=2.6.0,<3.0 |
PyTorch 2.4 does not include Blackwell kernels and will either fail or fall back to slow codepaths on B-series instances.
| protobuf | ||
| flash-attn>=2.5.0 | ||
| deepspeed>=0.14.0 | ||
| hyperpod-elastic-agent |
There was a problem hiding this comment.
Duplicate hyperpod-elastic-agent install
hyperpod-elastic-agent is installed both here and separately in the Dockerfile (line 45: RUN pip install hyperpod-elastic-agent ...). This means it gets installed twice, and the versions could conflict. Consider removing it from one location.
| # 7. Copy Code & Set Entrypoint | ||
| COPY src/ /app/ | ||
| ENV PYTHONPATH="${PYTHONPATH}:/app" | ||
| #CMD ["python", "sft.py", "--config", "configs/sft_lora.yaml"] |
There was a problem hiding this comment.
Remove commented-out CMD
Dead code. Since the K8s manifests provide the command, this commented line should be removed.
| #CMD ["python", "sft.py", "--config", "configs/sft_lora.yaml"] |
KeitaW
left a comment
There was a problem hiding this comment.
Review Batch 5/5 — Documentation Consistency
sft_lora.yaml config values overridden by training manifest
The YAML config sets per_device_train_batch_size: 4 and gradient_accumulation_steps: 4, but the training manifest overrides these (e.g., --per_device_train_batch_size=1). Similarly, output_dir: gpt-oss-20b-multilingual-reasoner is overridden to --output_dir=/fsx/checkpoints. While CLI args take precedence, aligning the config or adding comments noting which values are overridden at launch would reduce confusion.
Files Reference table in README is incomplete
The table at README lines 270-284 omits several files: sft.py, sft_teacher_data.py, convert_grpo_checkpoint.py, build_push.sh, requirements.txt, Dockerfile, sft_lora.yaml, and env_vars.example. Consider expanding the table or noting that artifacts/ contains additional scripts.
Things That Look Great
- Excellent pipeline overview — the ASCII diagram showing Base → SFT → GRPO with corresponding inference endpoints makes the workflow immediately clear.
- Well-structured GRPO reward function — the three-component reward (answer language, reasoning language, brevity) with clear weights is easy to understand and modify.
- Good use of
HyperPodPyTorchJobfor SFT training — elastic scaling withminReplicas/maxReplicasand auto-resume tolerations is the right pattern for HyperPod EKS. TestPromptCallbackingrpo_multinode.py— generating test outputs every 10 steps provides excellent training visibility.- Correct use of
HF_HOMEthroughout all manifests — no deprecatedTRANSFORMERS_CACHE. - EFA configuration is solid —
FI_PROVIDER=efa,FI_EFA_USE_DEVICE_RDMA=1, properLD_LIBRARY_PATHfor aws-ofi-nccl. - Self-contained test case with clear separation of concerns.
- Checkpoint conversion scripts are a thoughtful addition.
Suggested Priority
- Bugs (P0): Fix
load_modelandgenerate_responsesignature mismatches inevaluate_grpo.py— these will fail at runtime. - Correctness (P1): Fix
env_vars.examplebash syntax, addset -euo pipefailtobuild_push.sh, removeos.system("pip install langdetect"), quote shell variables. - Conventions (P2): Add license headers, fix directory typo (
evalution/), update clone URL, fix relative link. - Coordination (P3): Rebase onto PR #1000, reuse shared base image, add platform subdirectories.
| @@ -0,0 +1,278 @@ | |||
| # GPT-OSS 20B Training & Inference Guide | |||
|
|
|||
| This guide explains how to train the GPT-OSS 20B model with LoRA, then improve it using Group Relative Policy Optimization (GRPO) for better language compliance in reasoning and final answers. It is designed to be as simple as possible, requires no data preparation, and uses a container image. For further background information look at https://developers.openai.com/cookbook/articles/gpt-oss/fine-tune-transfomers | |||
There was a problem hiding this comment.
Possible typo in external URL
The URL ends with fine-tune-transfomers — missing an "r" (should probably be fine-tune-transformers). Worth verifying this is the correct URL path.
| ### 6.2. Check results | ||
|
|
||
| ```bash | ||
| kubectl logs eval-grpo --tail=50 |
There was a problem hiding this comment.
evalution/ typo preserved in command
This matches the typo'd directory name so it currently works, but both should be fixed to evaluation/ together.
…#1000) * Restructure trl/ test cases: shared base image and descriptive naming Rename grpo/ to grpo-math-reasoning/ for clarity, create a unified trl/Dockerfile with all TRL example dependencies, add a HyperPod EKS Dockerfile extension, and add a library-level README with test case table. This establishes the pattern for PR #997 and future TRL test cases to share a single base image rather than duplicating Dockerfiles. * Remove hyperpod-eks directory from trl/ test cases HyperPod EKS support will be added separately when needed.
48643c5 to
9d381ba
Compare
response: |
the base image is also include them cf. https://github.com/awslabs/awsome-distributed-training/blob/gpt-oss-lora-grpo/micro-benchmarks/nccl-tests/nccl-tests.Dockerfile could you please test to see if that works? |
Purpose
Fine-tune GPT-OSS 20B with LoRA and GRPO for multilingual reasoning on Amazon SageMaker HyperPod EKS. Includes SFT training, GRPO reinforcement learning, inference, and evaluation scripts.
Changes
New test case.
Test Plan
Environment:
Test commands:
Test Results
Directory Structure
Dockerfile,README.md, training scripts, configs) cover general setup.slurm/,kubernetes/,hyperpod-eks/) contain service-specific launch instructions.Checklist
mainbranch.latest).