Add GPT-OSS 20B LoRA + GRPO finetuning recipe by mijanur132 · Pull Request #997 · awslabs/awsome-distributed-training

mijanur132 · 2026-02-25T16:08:08Z

Purpose

Fine-tune GPT-OSS 20B with LoRA and GRPO for multilingual reasoning on Amazon SageMaker HyperPod EKS. Includes SFT training, GRPO reinforcement learning, inference, and evaluation scripts.

Changes

New test case.

Test Plan

Environment:

AWS Service: AWS Sagemaker Hyperpod, EKS
Instance type: ml.g6e.12xlarge
Number of nodes: 4

Test commands:

Test Results

Directory Structure

3.test_cases/
└── pytorch/                # e.g. pytorch, megatron, jax
    └── trl/              # e.g. picotron, FSDP, megatron-lm
        └── gpt-oss            # e.g. SmolLM-1.7B (may be omitted for single-model cases)
            ├── artifacts     # src code, config
            ├── README.md       # Overview, prerequisites, usage
            └── train/   # training manifests
            └── inference/   # inference manifests
            └── Evaluaiton/   # Evaluation manifests

Top-level files (Dockerfile, README.md, training scripts, configs) cover general setup.
Subdirectories (slurm/, kubernetes/, hyperpod-eks/) contain service-specific launch instructions.
Not all service subdirectories are required — include only the ones relevant to your test case.

Checklist

[* ] I have read the contributing guidelines.
[ *] I am working against the latest main branch.
[ *] I have searched existing open and recently merged PRs to confirm this is not a duplicate.
[ *] The contribution is self-contained with documentation and scripts.
[ *] External dependencies are pinned to a specific version or tag (no latest).
[ *] A README is included or updated with prerequisites, instructions, and known issues.
[ *] New test cases follow the expected directory structure.

Rename grpo/ to grpo-math-reasoning/ for clarity, create a unified trl/Dockerfile with all TRL example dependencies, add a HyperPod EKS Dockerfile extension, and add a library-level README with test case table. This establishes the pattern for PR #997 and future TRL test cases to share a single base image rather than duplicating Dockerfiles.

KeitaW

Review Batch 1/5 — Structure & Repository Hygiene

Rebase onto PR #1000 and reuse shared base image

Important — merge order: PR #1000 restructures the trl/ directory (renames grpo/ → grpo-math-reasoning/, creates a shared trl/Dockerfile, and adds a library-level trl/README.md). PR #1000 should merge first, then this PR should rebase and add gpt-oss-lora-grpo/ alongside grpo-math-reasoning/, reusing the shared base image rather than duplicating dependencies.

After rebasing:

Reuse the shared trl/Dockerfile — it already installs TRL, PyTorch, DeepSpeed, Flash Attention, FlashInfer, and common dependencies. GPT-OSS-specific deps (e.g., hyperpod-elastic-agent) can go in an extended Dockerfile.
Share common scripts — inference.py, eval.py, and reward-function utilities could be generalized rather than duplicated.
Add platform-specific subdirectories — kubernetes/ for generic K8s manifests (including a standard PyTorchJob for vanilla EKS) and hyperpod-eks/ for HyperPod-specific configs.
Update the trl/README.md test case table — add a row for gpt-oss-lora-grpo.

Directory name typo: `evalution/` → `evaluation/`

The evaluation directory is misspelled as evalution/ (missing an "a"). Rename to evaluation/ and update all references.

Missing license headers on most files

All Python files except sft.py, all YAML files, and build_push.sh lack the repo's required header:

# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
# SPDX-License-Identifier: MIT-0

Note: sft.py has a HuggingFace Apache-2.0 header, not the repo's MIT-0.

Missing `init.py` in `src/` directory

The src/ directory is used as a Python package (via PYTHONPATH) and evaluate_grpo.py imports from inference_g6e. Adding an __init__.py would make the package structure explicit.

3.test_cases/pytorch/trl/gpt-oss-lora-grpo/README.md

3.test_cases/pytorch/trl/gpt-oss-lora-grpo/artifacts/src/configs/sft_lora.yaml

KeitaW

Review Batch 2/5 — Deployment Pipeline (K8s / Shell)

GRPO and evaluation use plain `Pod` instead of `PyTorchJob`

train/train-grpo-multinode.yaml and evalution/eval-grpo.yaml use plain Pod specs. The GRPO manifest is named train-grpo-multinode.yaml but deploys a single pod named grpo-single — if multi-node GRPO is intended, a PyTorchJob or HyperPodPyTorchJob would handle torchrun coordination. If single-node is intentional, the filename is misleading.

`envsubst` calls without explicit variable whitelists

All envsubst calls in the README use the bare form (lines 83, 98, 152, 199-201, 249). Without a whitelist, envsubst replaces every $VAR pattern — including things like $(date +%Y%m%d_%H%M%S) in the eval manifest (which should be shell-expanded, not envsubst-expanded). Use explicit whitelists:

envsubst '$REGISTRY $IMAGE $TAG $HF_TOKEN' < train/fsx-storage-manager.yaml | kubectl apply -f -

Ref: envsubst pitfalls

3.test_cases/pytorch/trl/gpt-oss-lora-grpo/artifacts/build_push.sh

3.test_cases/pytorch/trl/gpt-oss-lora-grpo/inference/inference-g6e-base.yaml

3.test_cases/pytorch/trl/gpt-oss-lora-grpo/env_vars.example

KeitaW

Review Batch 3/5 — Training Code Quality

`trust_remote_code=True` used without acknowledgment

trust_remote_code=True appears in every from_pretrained call across grpo_multinode.py, evaluate_grpo.py, inference_g6e.py, inference_grpo_new.py, and sft_teacher_data.py. This flag allows arbitrary code execution from the HuggingFace Hub. For GPT-OSS this may be necessary, but a brief comment (e.g., # Required: GPT-OSS model uses custom code on HF Hub) would acknowledge the security implication.

Bare `except:` clauses throughout

Multiple bare except: clauses catch all exceptions including KeyboardInterrupt and SystemExit:

grpo_multinode.py (lines 87, 101, 135)
evaluate_grpo.py (line 210)
inference_g6e.py (line 114)
inference_grpo_new.py (lines 86, 121)

Consider using except Exception: at minimum to allow clean interruption.

KeitaW · 2026-02-26T06:48:27Z

3.test_cases/pytorch/trl/gpt-oss-lora-grpo/artifacts/src/evaluate_grpo.py

+        generate_fn = lambda m, t, p, lang: generate_response_grpo(m, t, p, reasoning_language=lang)
+    else:
+        from inference_g6e import generate_response
+        generate_fn = lambda m, t, p, lang: generate_response(m, t, p, max_new_tokens=1024, show_full=True, reasoning_language=lang)


Bug: generate_response called with unsupported show_full parameter

inference_g6e.generate_response has signature generate_response(model, tokenizer, prompt, max_new_tokens=2048, reasoning_language=None) — it does not accept a show_full parameter. This will raise a TypeError at runtime.

Suggested change

generate_fn = lambda m, t, p, lang: generate_response(m, t, p, max_new_tokens=1024, show_full=True, reasoning_language=lang)

generate_fn = lambda m, t, p, lang: generate_response(m, t, p, max_new_tokens=1024, reasoning_language=lang)

3.test_cases/pytorch/trl/gpt-oss-lora-grpo/artifacts/src/evaluate_grpo.py

KeitaW · 2026-02-26T06:48:27Z

3.test_cases/pytorch/trl/gpt-oss-lora-grpo/artifacts/src/evaluate_grpo.py

+        # Standard mode: use inference_g6e
+        from inference_g6e import load_model
+        print(f"Loading model (use_trained={args.use_trained})...")
+        model, tokenizer = load_model(args.base_model, use_trained=args.use_trained, checkpoint_dir=args.checkpoint_dir)


Bug: load_model called with wrong keyword arguments

inference_g6e.load_model has signature load_model(base_model_path, checkpoint=None) — it does not accept use_trained or checkpoint_dir. This call will raise a TypeError at runtime.

The correct call would be something like:

checkpoint_path = os.path.join(args.checkpoint_dir, "converted-peft/lora-checkpoint-1000-peft") if args.use_trained else None model, tokenizer = load_model(args.base_model, checkpoint=checkpoint_path)

KeitaW

Review Batch 4/5 — Infrastructure & NCCL Configuration

Dockerfile runs as root

The Dockerfile does not include a USER directive, so the container runs as root. While this is common for GPU training containers (NCCL/EFA access often requires it), consider adding a comment explaining why root is needed.

KeitaW · 2026-02-26T06:48:32Z

3.test_cases/pytorch/trl/gpt-oss-lora-grpo/artifacts/requirements.txt

+torch>=2.4.0
+transformers>=4.48.0
+accelerate>=0.27.0
+datasets>=2.17.0
+peft>=0.9.0
+trl>=0.8.0


Missing upper-bound version pins

All packages use only lower-bound pins (>=). Without upper bounds, future installs may pull incompatible major versions. Also, scipy, sentencepiece, and protobuf (lines 8-10) have no version pin at all.

Consider adding upper bounds for critical packages:

Suggested change

torch>=2.4.0

transformers>=4.48.0

accelerate>=0.27.0

datasets>=2.17.0

peft>=0.9.0

trl>=0.8.0

torch>=2.4.0,<3.0

transformers>=4.48.0,<5.0

accelerate>=0.27.0,<1.0

datasets>=2.17.0,<3.0

peft>=0.9.0,<1.0

trl>=0.8.0,<1.0

Additionally, the minimum PyTorch version should be bumped to torch>=2.6.0 — PyTorch 2.6 is the first release with official NVIDIA Blackwell (sm_100) support. The Dockerfile already builds NCCL tests with -gencode=arch=compute_100,code=sm_100 and uses a CUDA 12.9 base image, so the PyTorch floor should match:

Suggested change

torch>=2.4.0

transformers>=4.48.0

accelerate>=0.27.0

datasets>=2.17.0

peft>=0.9.0

trl>=0.8.0

torch>=2.6.0,<3.0

PyTorch 2.4 does not include Blackwell kernels and will either fail or fall back to slow codepaths on B-series instances.

KeitaW · 2026-02-26T06:48:33Z

3.test_cases/pytorch/trl/gpt-oss-lora-grpo/artifacts/requirements.txt

+protobuf
+flash-attn>=2.5.0
+deepspeed>=0.14.0
+hyperpod-elastic-agent


Duplicate hyperpod-elastic-agent install

hyperpod-elastic-agent is installed both here and separately in the Dockerfile (line 45: RUN pip install hyperpod-elastic-agent ...). This means it gets installed twice, and the versions could conflict. Consider removing it from one location.

KeitaW · 2026-02-26T06:48:33Z

3.test_cases/pytorch/trl/gpt-oss-lora-grpo/artifacts/Dockerfile

+# 7. Copy Code & Set Entrypoint
+COPY src/ /app/
+ENV PYTHONPATH="${PYTHONPATH}:/app"
+#CMD ["python", "sft.py", "--config", "configs/sft_lora.yaml"]


Remove commented-out CMD

Dead code. Since the K8s manifests provide the command, this commented line should be removed.

Suggested change

#CMD ["python", "sft.py", "--config", "configs/sft_lora.yaml"]

KeitaW

Review Batch 5/5 — Documentation Consistency

`sft_lora.yaml` config values overridden by training manifest

The YAML config sets per_device_train_batch_size: 4 and gradient_accumulation_steps: 4, but the training manifest overrides these (e.g., --per_device_train_batch_size=1). Similarly, output_dir: gpt-oss-20b-multilingual-reasoner is overridden to --output_dir=/fsx/checkpoints. While CLI args take precedence, aligning the config or adding comments noting which values are overridden at launch would reduce confusion.

Files Reference table in README is incomplete

The table at README lines 270-284 omits several files: sft.py, sft_teacher_data.py, convert_grpo_checkpoint.py, build_push.sh, requirements.txt, Dockerfile, sft_lora.yaml, and env_vars.example. Consider expanding the table or noting that artifacts/ contains additional scripts.

Things That Look Great

Excellent pipeline overview — the ASCII diagram showing Base → SFT → GRPO with corresponding inference endpoints makes the workflow immediately clear.
Well-structured GRPO reward function — the three-component reward (answer language, reasoning language, brevity) with clear weights is easy to understand and modify.
Good use of HyperPodPyTorchJob for SFT training — elastic scaling with minReplicas/maxReplicas and auto-resume tolerations is the right pattern for HyperPod EKS.
TestPromptCallback in grpo_multinode.py — generating test outputs every 10 steps provides excellent training visibility.
Correct use of HF_HOME throughout all manifests — no deprecated TRANSFORMERS_CACHE.
EFA configuration is solid — FI_PROVIDER=efa, FI_EFA_USE_DEVICE_RDMA=1, proper LD_LIBRARY_PATH for aws-ofi-nccl.
Self-contained test case with clear separation of concerns.
Checkpoint conversion scripts are a thoughtful addition.

Suggested Priority

Bugs (P0): Fix load_model and generate_response signature mismatches in evaluate_grpo.py — these will fail at runtime.
Correctness (P1): Fix env_vars.example bash syntax, add set -euo pipefail to build_push.sh, remove os.system("pip install langdetect"), quote shell variables.
Conventions (P2): Add license headers, fix directory typo (evalution/), update clone URL, fix relative link.
Coordination (P3): Rebase onto PR #1000, reuse shared base image, add platform subdirectories.

KeitaW · 2026-02-26T06:48:38Z

3.test_cases/pytorch/trl/gpt-oss-lora-grpo/README.md

@@ -0,0 +1,278 @@
+# GPT-OSS 20B Training & Inference Guide
+
+This guide explains how to train the GPT-OSS 20B model with LoRA, then improve it using Group Relative Policy Optimization (GRPO) for better language compliance in reasoning and final answers. It is designed to be as simple as possible, requires no data preparation, and uses a container image. For further background information look at https://developers.openai.com/cookbook/articles/gpt-oss/fine-tune-transfomers


Possible typo in external URL

The URL ends with fine-tune-transfomers — missing an "r" (should probably be fine-tune-transformers). Worth verifying this is the correct URL path.

KeitaW · 2026-02-26T06:48:38Z

3.test_cases/pytorch/trl/gpt-oss-lora-grpo/README.md

+### 6.2. Check results
+
+```bash
+kubectl logs eval-grpo --tail=50


evalution/ typo preserved in command

This matches the typo'd directory name so it currently works, but both should be fixed to evaluation/ together.

KeitaW

Left some comments.

…#1000) * Restructure trl/ test cases: shared base image and descriptive naming Rename grpo/ to grpo-math-reasoning/ for clarity, create a unified trl/Dockerfile with all TRL example dependencies, add a HyperPod EKS Dockerfile extension, and add a library-level README with test case table. This establishes the pattern for PR #997 and future TRL test cases to share a single base image rather than duplicating Dockerfiles. * Remove hyperpod-eks directory from trl/ test cases HyperPod EKS support will be added separately when needed.

mijanur132 · 2026-02-27T04:01:27Z

Reuse the shared trl/Dockerfile — it already installs TRL, PyTorch, DeepSpeed, Flash Attention, FlashInfer, and common dependencies. GPT-OSS-specific deps (e.g., hyperpod-elastic-agent) can go in an extended Dockerfile.

response:
This recipe intentionally uses a separate Dockerfile because it requires the AWS Deep Learning Container base image (public.ecr.aws/deep-learning-containers/base:12.9.1-gpu-py312-cu129-ubuntu22.04-ec2-v1.31) which comes with EFA drivers, aws-ofi-nccl, and CUDA 12.9 pre-installed — these are needed for HyperPod EKS elastic training with EFA networking. The shared trl/Dockerfile uses public.ecr.aws/hpc-cloud/nccl-tests:latest as its base, which has a different set of pre-installed libraries and driver versions. Switching base images could introduce compatibility issues with EFA and the HyperPod elastic agent. I've added a comment in the Dockerfile noting the shared alternative for visibility.

…h.sh

…-g6e-base.yaml

…ate_grpo.py

KeitaW · 2026-02-27T07:32:08Z

Reuse the shared trl/Dockerfile — it already installs TRL, PyTorch, DeepSpeed, Flash Attention, FlashInfer, and common dependencies. GPT-OSS-specific deps (e.g., hyperpod-elastic-agent) can go in an extended Dockerfile.

response: This recipe intentionally uses a separate Dockerfile because it requires the AWS Deep Learning Container base image (public.ecr.aws/deep-learning-containers/base:12.9.1-gpu-py312-cu129-ubuntu22.04-ec2-v1.31) which comes with EFA drivers, aws-ofi-nccl, and CUDA 12.9 pre-installed — these are needed for HyperPod EKS elastic training with EFA networking. The shared trl/Dockerfile uses public.ecr.aws/hpc-cloud/nccl-tests:latest as its base, which has a different set of pre-installed libraries and driver versions. Switching base images could introduce compatibility issues with EFA and the HyperPod elastic agent. I've added a comment in the Dockerfile noting the shared alternative for visibility.

the base image is also include them cf. https://github.com/awslabs/awsome-distributed-training/blob/gpt-oss-lora-grpo/micro-benchmarks/nccl-tests/nccl-tests.Dockerfile could you please test to see if that works?

mijanur132 requested review from KeitaW and mvinci12 February 25, 2026 16:08

KeitaW self-assigned this Feb 26, 2026

KeitaW added the enhancement New feature or request label Feb 26, 2026

KeitaW mentioned this pull request Feb 26, 2026

Restructure trl/ test cases: shared base image and descriptive naming #1000

Merged

4 tasks

KeitaW reviewed Feb 26, 2026

View reviewed changes

3.test_cases/pytorch/trl/gpt-oss-lora-grpo/README.md Outdated Show resolved Hide resolved

3.test_cases/pytorch/trl/gpt-oss-lora-grpo/README.md Outdated Show resolved Hide resolved

3.test_cases/pytorch/trl/gpt-oss-lora-grpo/artifacts/src/configs/sft_lora.yaml Show resolved Hide resolved

KeitaW reviewed Feb 26, 2026

View reviewed changes

KeitaW requested changes Feb 26, 2026

View reviewed changes

KeitaW assigned mijanur132 and unassigned KeitaW Feb 26, 2026

Add GPT-OSS 20B LoRA + GRPO finetuning recipe

9d381ba

mijanur132 force-pushed the gpt-oss-lora-grpo branch from 48643c5 to 9d381ba Compare February 27, 2026 01:21

KeitaW added 7 commits February 27, 2026 16:12

Update 3.test_cases/pytorch/trl/gpt-oss-lora-grpo/README.md

7e8df08

Update 3.test_cases/pytorch/trl/gpt-oss-lora-grpo/README.md

c3ec71d

Update 3.test_cases/pytorch/trl/gpt-oss-lora-grpo/artifacts/build_pus…

2540a75

…h.sh

Update 3.test_cases/pytorch/trl/gpt-oss-lora-grpo/artifacts/build_pus…

d01a1b6

…h.sh

Update 3.test_cases/pytorch/trl/gpt-oss-lora-grpo/inference/inference…

be6657d

…-g6e-base.yaml

Update 3.test_cases/pytorch/trl/gpt-oss-lora-grpo/env_vars.example

d2389ec

Update 3.test_cases/pytorch/trl/gpt-oss-lora-grpo/artifacts/src/evalu…

458767a

…ate_grpo.py

	generate_fn = lambda m, t, p, lang: generate_response(m, t, p, max_new_tokens=1024, show_full=True, reasoning_language=lang)
	generate_fn = lambda m, t, p, lang: generate_response(m, t, p, max_new_tokens=1024, reasoning_language=lang)

		@@ -0,0 +1,278 @@
		# GPT-OSS 20B Training & Inference Guide

		This guide explains how to train the GPT-OSS 20B model with LoRA, then improve it using Group Relative Policy Optimization (GRPO) for better language compliance in reasoning and final answers. It is designed to be as simple as possible, requires no data preparation, and uses a container image. For further background information look at https://developers.openai.com/cookbook/articles/gpt-oss/fine-tune-transfomers

Conversation

mijanur132 commented Feb 25, 2026

Purpose

Changes

Test Plan

Test Results

Directory Structure

Checklist

Uh oh!

KeitaW left a comment

Choose a reason for hiding this comment

Review Batch 1/5 — Structure & Repository Hygiene

Rebase onto PR #1000 and reuse shared base image

Directory name typo: evalution/ → evaluation/

Missing license headers on most files

Missing __init__.py in src/ directory

Uh oh!

Uh oh!

Uh oh!

Uh oh!

KeitaW left a comment

Choose a reason for hiding this comment

Review Batch 2/5 — Deployment Pipeline (K8s / Shell)

GRPO and evaluation use plain Pod instead of PyTorchJob

envsubst calls without explicit variable whitelists

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

KeitaW left a comment

Choose a reason for hiding this comment

Review Batch 3/5 — Training Code Quality

trust_remote_code=True used without acknowledgment

Bare except: clauses throughout

Uh oh!

Choose a reason for hiding this comment

Bug: generate_response called with unsupported show_full parameter

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Bug: load_model called with wrong keyword arguments

Uh oh!

Choose a reason for hiding this comment

Uh oh!

KeitaW left a comment

Choose a reason for hiding this comment

Review Batch 4/5 — Infrastructure & NCCL Configuration

Dockerfile runs as root

Uh oh!

Choose a reason for hiding this comment

Missing upper-bound version pins

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Duplicate hyperpod-elastic-agent install

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Remove commented-out CMD

Uh oh!

Choose a reason for hiding this comment

Uh oh!

KeitaW left a comment

Choose a reason for hiding this comment

Review Batch 5/5 — Documentation Consistency

sft_lora.yaml config values overridden by training manifest

Files Reference table in README is incomplete

Things That Look Great

Suggested Priority

Uh oh!

Choose a reason for hiding this comment

Possible typo in external URL

Uh oh!

Directory name typo: `evalution/` → `evaluation/`

Missing `init.py` in `src/` directory

GRPO and evaluation use plain `Pod` instead of `PyTorchJob`

`envsubst` calls without explicit variable whitelists

`trust_remote_code=True` used without acknowledgment

Bare `except:` clauses throughout

Bug: `generate_response` called with unsupported `show_full` parameter

Bug: `load_model` called with wrong keyword arguments

Duplicate `hyperpod-elastic-agent` install

Remove commented-out `CMD`

`sft_lora.yaml` config values overridden by training manifest

`evalution/` typo preserved in command