From ccbe1c92769acd91e003b562beb5c875786a6bd9 Mon Sep 17 00:00:00 2001
From: Peng Sun <peng.sun@amd.com>
Date: Tue, 24 Feb 2026 10:09:24 -0600
Subject: [PATCH] Add Claude Code skills for ATOM development

Add 4 skills for Claude Code AI assistant:
- triage-accuracy: Debug wrong/garbage model output
- triage-perf: Debug slow inference performance
- docker-serve: Docker build, patch, serve workflow
- benchmark: InferenceX benchmarking and comparison
---
 .claude/skills/benchmark/SKILL.md       | 284 ++++++++++++++++++++++++
 .claude/skills/docker-serve/SKILL.md    | 284 ++++++++++++++++++++++++
 .claude/skills/triage-accuracy/SKILL.md | 229 +++++++++++++++++++
 .claude/skills/triage-perf/SKILL.md     | 229 +++++++++++++++++++
 4 files changed, 1026 insertions(+)
 create mode 100644 .claude/skills/benchmark/SKILL.md
 create mode 100644 .claude/skills/docker-serve/SKILL.md
 create mode 100644 .claude/skills/triage-accuracy/SKILL.md
 create mode 100644 .claude/skills/triage-perf/SKILL.md

diff --git a/.claude/skills/benchmark/SKILL.md b/.claude/skills/benchmark/SKILL.md
new file mode 100644
index 00000000..ac62cffa
--- /dev/null
+++ b/.claude/skills/benchmark/SKILL.md
@@ -0,0 +1,284 @@
+# Skill: ATOM Inference Benchmarking
+
+## Description
+Run ATOM inference benchmarks using InferenceX benchmark_serving, compare results across Docker images, and analyze throughput/latency metrics. Covers the full workflow from server launch to results analysis.
+
+## When to Use
+- Benchmarking a new ATOM Docker image against a reference
+- Comparing performance across different TP configurations, sequence lengths, or concurrency levels
+- Validating that performance fixes actually improve throughput
+- Generating benchmark data for reports
+
+---
+
+## 1. Benchmark Setup
+
+### Prerequisites
+```bash
+# Inside Docker container
+pip install -q aiohttp requests tqdm transformers
+export OMP_NUM_THREADS=1
+```
+
+### Directory structure
+```
+/workspace/results/           # Inside container
+/mnt/m2m_nobackup/pensun/results/  # Host mount
+    public/                   # Reference image results
+    clean_v7/                 # Clean image results
+    dsr1_public/              # DeepSeek-R1 reference
+```
+
+---
+
+## 2. Running Benchmarks
+
+### Single benchmark
+```bash
+python3 -m atom.benchmarks.benchmark_serving \
+    --model /models/MODEL \
+    --backend vllm \
+    --base-url http://localhost:8000 \
+    --dataset-name random \
+    --random-input-len 1024 \
+    --random-output-len 1024 \
+    --random-range-ratio 0.8 \
+    --num-prompts 80 \
+    --max-concurrency 8 \
+    --request-rate inf \
+    --ignore-eos \
+    --save-result \
+    --result-dir /workspace/results \
+    --result-filename result_tp8_isl1024_osl1024_conc8.json \
+    --percentile-metrics ttft,tpot,itl,e2el
+```
+
+### Key parameters
+| Parameter | Description | Typical Values |
+|-----------|------------|----------------|
+| `--random-input-len` | Input sequence length | 1024, 8192 |
+| `--random-output-len` | Output sequence length | 1024, 8192 |
+| `--max-concurrency` | Concurrent requests | 4, 8, 16, 32, 64, 128 |
+| `--num-prompts` | Total requests to send | concurrency * 10 |
+| `--random-range-ratio` | Min/max ratio for random lengths | 0.8 |
+| `--request-rate inf` | Send requests as fast as possible | inf |
+| `--ignore-eos` | Don't stop at EOS (measure full output length) | Always use |
+
+### Benchmark matrix
+Standard benchmark suite covers these configurations:
+
+**Phase 1: Short context (1k/1k)**
+```
+TP=8, ISL=1024, OSL=1024, CONC={4,8,16,32}
+```
+
+**Phase 2: Long output (1k/8k)**
+```
+TP=8, ISL=1024, OSL=8192, CONC={4,8,16,32,64,128}
+```
+
+**Phase 3: Long input (8k/1k)**
+```
+TP=8, ISL=8192, OSL=1024, CONC={4,8,16,32,64,128}
+```
+
+**Phase 4: Lower TP (if model fits)**
+```
+TP=4 or TP=1, same ISL/OSL/CONC combos
+```
+
+---
+
+## 3. Helper Functions
+
+### start_server
+```bash
+start_server() {
+    local tp=$1; shift; local extra_args="$@"
+
+    # Kill previous server
+    pkill -9 -f "atom.entrypoints" 2>/dev/null || true
+    pkill -9 -f "ModelRunner" 2>/dev/null || true
+    sleep 5
+
+    # Clear GPU memory
+    python3 -c "
+import torch
+for i in range(8):
+    try:
+        torch.cuda.set_device(i); torch.cuda.empty_cache()
+    except: pass
+" 2>/dev/null || true
+    sleep 15
+
+    # Launch server
+    python3 -m atom.entrypoints.openai_server \
+        --model $MODEL --server-port $PORT -tp $tp \
+        --kv_cache_dtype fp8 --block-size 16 \
+        $extra_args > /workspace/server.log 2>&1 &
+    SERVER_PID=$!
+
+    # Wait for health (up to 20 min)
+    for i in $(seq 1 240); do
+        curl -s http://localhost:$PORT/health > /dev/null 2>&1 && return 0
+        kill -0 $SERVER_PID 2>/dev/null || return 1
+        sleep 5
+    done
+    return 1
+}
+```
+
+### run_bench
+```bash
+run_bench() {
+    local isl=$1 osl=$2 conc=$3 tp=$4
+    local num_prompts=$((conc * 10))
+    local result_file="${MODEL_PREFIX}_tp${tp}_isl${isl}_osl${osl}_conc${conc}"
+
+    # Skip if already done
+    [ -f "$RESULT_DIR/${result_file}.json" ] && return 0
+
+    timeout 7200 python3 -m atom.benchmarks.benchmark_serving \
+        --model $MODEL --backend vllm --base-url http://localhost:$PORT \
+        --dataset-name random \
+        --random-input-len $isl --random-output-len $osl \
+        --random-range-ratio 0.8 \
+        --num-prompts $num_prompts --max-concurrency $conc \
+        --request-rate inf --ignore-eos \
+        --save-result --result-dir $RESULT_DIR \
+        --result-filename ${result_file}.json \
+        --percentile-metrics ttft,tpot,itl,e2el
+}
+```
+
+---
+
+## 4. Results Analysis
+
+### Result JSON format
+Each benchmark produces a JSON file with:
+```json
+{
+    "output_throughput": 1234.5,        // output tokens/sec
+    "total_token_throughput": 2345.6,   // total tokens/sec (input+output)
+    "mean_ttft_ms": 45.2,              // time to first token (ms)
+    "median_tpot_ms": 8.1,             // time per output token (ms)
+    "mean_itl_ms": 8.3,                // inter-token latency (ms)
+    "mean_e2el_ms": 850.0              // end-to-end latency (ms)
+}
+```
+
+### Key metrics
+| Metric | What It Measures | Lower/Higher Better |
+|--------|-----------------|-------------------|
+| `output_throughput` | Output tokens per second | Higher |
+| `total_token_throughput` | All tokens per second | Higher |
+| `mean_ttft_ms` | Time to first token | Lower |
+| `median_tpot_ms` | Time per output token | Lower |
+
+### Quick summary script
+```bash
+for f in /workspace/results/*.json; do
+    [ -f "$f" ] || continue
+    base=$(basename "$f" .json)
+    python3 -c "
+import json; d=json.load(open('$f'))
+print(f'$base: output={d.get(\"output_throughput\",0):.1f} total={d.get(\"total_token_throughput\",0):.1f} ttft={d.get(\"mean_ttft_ms\",0):.1f}ms tpot={d.get(\"median_tpot_ms\",0):.1f}ms')
+"
+done
+```
+
+### Comparison table generator
+```python
+import json, glob, os
+
+def load_results(result_dir, prefix):
+    results = {}
+    for f in glob.glob(f"{result_dir}/{prefix}_*.json"):
+        base = os.path.basename(f).replace('.json', '')
+        # Extract config from filename: prefix_tpN_islN_oslN_concN
+        parts = base.split('_')
+        key = '_'.join(parts[1:])  # remove prefix
+        with open(f) as fh:
+            results[key] = json.load(fh)
+    return results
+
+public = load_results('/workspace/results', 'public')
+clean = load_results('/workspace/results', 'clean')
+
+print(f"{'Config':<35} {'Public':>10} {'Clean':>10} {'Ratio':>8}")
+print("-" * 65)
+for key in sorted(public.keys()):
+    if key in clean:
+        pub_tp = public[key].get('output_throughput', 0)
+        cln_tp = clean[key].get('output_throughput', 0)
+        ratio = cln_tp / pub_tp if pub_tp > 0 else 0
+        print(f"{key:<35} {pub_tp:>10.1f} {cln_tp:>10.1f} {ratio:>7.1%}")
+```
+
+---
+
+## 5. Remote Node Execution
+
+### Deploy and run on remote nodes
+```bash
+NODE="uswslocpm2m-106-1236.amd.com"
+
+# 1. Write benchmark script locally
+cat > /tmp/bench.sh << 'EOF'
+#!/bin/bash
+# ... benchmark script content ...
+EOF
+
+# 2. Deploy
+scp /tmp/bench.sh ${NODE}:/mnt/m2m_nobackup/pensun/bench_serving/
+
+# 3. Launch (never inline complex Docker commands via SSH)
+ssh ${NODE} 'bash /mnt/m2m_nobackup/pensun/bench_serving/bench.sh'
+
+# 4. Monitor
+ssh ${NODE} 'docker logs -f atom-bench 2>&1 | tail -20'
+
+# 5. Collect results
+scp ${NODE}:/mnt/m2m_nobackup/pensun/results/*.json /home/pensun/results/
+```
+
+### Multi-node deployment
+When running on multiple nodes (e.g., public vs clean on separate nodes):
+```bash
+NODE1="uswslocpm2m-106-881.amd.com"    # Public image
+NODE2="uswslocpm2m-106-1236.amd.com"   # Clean image
+
+# Deploy to both
+for NODE in $NODE1 $NODE2; do
+    scp /tmp/bench.sh ${NODE}:/mnt/m2m_nobackup/pensun/bench_serving/
+done
+
+# Launch in parallel
+ssh ${NODE1} 'nohup bash /mnt/.../bench.sh --prefix public > /mnt/.../run.log 2>&1 &'
+ssh ${NODE2} 'nohup bash /mnt/.../bench.sh --prefix clean > /mnt/.../run.log 2>&1 &'
+```
+
+---
+
+## 6. Common Issues
+
+| Issue | Symptom | Fix |
+|-------|---------|-----|
+| Server OOM | Killed during model load | Reduce `--max-model-len`, or increase TP |
+| Benchmark timeout | `timeout 7200` exits 124 | High concurrency + long output overwhelms server. Reduce concurrency or output length |
+| Stale GPU processes | Server won't start, GPU memory full | `pkill -9 -f atom.entrypoints && sleep 15` |
+| Results already exist | `run_bench` silently skips | Delete old JSON or use different prefix |
+| SSH escaping | `Ambiguous output redirect` | Never inline Docker commands via SSH. Use script files. |
+| Node 1 GPU issues | Memory faults on all shapes | Prefer Node 2 (1236) for reliable benchmarks |
+
+---
+
+## Anti-Patterns
+
+1. **Don't compare results from different hardware** — MI300X vs MI355X have different performance profiles
+2. **Don't skip the warmup** — first few requests are always slower (JIT compilation, Triton caching)
+3. **Don't forget `--ignore-eos`** — without it, output length varies and throughput numbers aren't comparable
+4. **Don't use `--request-rate` below inf for throughput tests** — rate limiting artificially caps throughput
+5. **Don't compare decode and prefill numbers directly** — they measure fundamentally different things
+6. **Don't inline Docker commands via SSH** — write scripts, scp them, launch via `bash /path/to/script.sh`
diff --git a/.claude/skills/docker-serve/SKILL.md b/.claude/skills/docker-serve/SKILL.md
new file mode 100644
index 00000000..ffc84994
--- /dev/null
+++ b/.claude/skills/docker-serve/SKILL.md
@@ -0,0 +1,284 @@
+# Skill: ATOM Docker Build, Serve, and Debug
+
+## Description
+End-to-end workflow for building, patching, serving, and debugging ATOM Docker images for LLM inference on AMD GPUs. Covers the Docker run template, in-container patching, image tagging, server launch, and health checking.
+
+## When to Use
+- Setting up ATOM inference for a new model or Docker image
+- Patching an existing Docker image without rebuilding from scratch
+- Debugging server startup failures or configuration issues
+- Comparing behavior between two Docker images
+
+---
+
+## 1. Docker Run Template
+
+### Full server launch
+```bash
+docker run -d \
+    --name atom-server \
+    --network host \
+    --device /dev/kfd \
+    --device /dev/dri \
+    --group-add video \
+    --ipc host \
+    --shm-size 256g \
+    --cap-add SYS_PTRACE \
+    --security-opt seccomp=unconfined \
+    -v /path/to/models:/models:ro \
+    -v /path/to/results:/workspace/results \
+    IMAGE_NAME \
+    python3 -m atom.entrypoints.openai_server \
+        --model /models/MODEL_NAME \
+        --server-port 8000 \
+        -tp 8 \
+        --kv_cache_dtype fp8 \
+        --block-size 16 \
+        --max-model-len 4096 \
+        --trust-remote-code \
+        --enforce-eager
+```
+
+### Debug container (sleep infinity for patching)
+```bash
+docker run -d \
+    --name atom-debug \
+    --network host \
+    --device /dev/kfd \
+    --device /dev/dri \
+    --group-add video \
+    --ipc host \
+    --shm-size 256g \
+    --cap-add SYS_PTRACE \
+    --security-opt seccomp=unconfined \
+    -v /path/to/models:/models:ro \
+    IMAGE_NAME \
+    sleep infinity
+```
+
+### Required flags explained
+| Flag | Why |
+|------|-----|
+| `--device /dev/kfd --device /dev/dri` | GPU access on ROCm |
+| `--group-add video` | GPU device permissions |
+| `--ipc host` | Shared memory for multi-GPU |
+| `--shm-size 256g` | Large shared memory for tensor parallel |
+| `--cap-add SYS_PTRACE` | ROCm profiler/debugger access |
+| `--security-opt seccomp=unconfined` | Required for some ROCm operations |
+| `--network host` | Direct port access (no -p mapping needed) |
+
+---
+
+## 2. Patch-Commit-Test Workflow
+
+**Never rebuild Docker for debugging.** Patch inside running containers instead.
+
+### Step 1: Start debug container
+```bash
+docker run -d --name atom-debug ... IMAGE sleep infinity
+```
+
+### Step 2: Apply patches
+
+**Option A: sed for simple string replacements**
+```bash
+docker exec atom-debug sed -i 's/old_string/new_string/' /app/ATOM/atom/model_ops/moe.py
+```
+
+**Option B: Python script for complex patches**
+```bash
+docker exec atom-debug python3 -c "
+content = open('/app/ATOM/atom/model_ops/moe.py').read()
+content = content.replace('old_pattern', 'new_pattern')
+open('/app/ATOM/atom/model_ops/moe.py', 'w').write(content)
+print('Patched')
+"
+```
+
+**Option C: docker cp for file replacement**
+```bash
+docker cp local_fixed_file.py atom-debug:/app/ATOM/atom/model_ops/moe.py
+```
+
+### Step 3: Commit as new image
+```bash
+docker commit atom-debug IMAGE:patched-v1
+```
+
+### Step 4: Test with new image
+```bash
+docker rm -f atom-debug
+docker run -d --name atom-test ... IMAGE:patched-v1 \
+    python3 -m atom.entrypoints.openai_server --model /models/MODEL ...
+```
+
+### Step 5: Verify
+```bash
+# Wait for server health
+for i in $(seq 1 60); do
+    curl -s http://localhost:8000/health > /dev/null 2>&1 && break
+    sleep 5
+done
+
+# Test completions
+curl -s http://localhost:8000/v1/completions \
+  -d '{"model":"/models/MODEL","prompt":"The capital of France is","max_tokens":30,"temperature":0}'
+```
+
+---
+
+## 3. Server Configuration
+
+### Model-specific configurations
+
+**DeepSeek-R1 671B (MXFP4)**
+```bash
+python3 -m atom.entrypoints.openai_server \
+    --model /models/DeepSeek-R1-0528-MXFP4 \
+    -tp 8 \
+    --kv_cache_dtype fp8 \
+    --block-size 16 \
+    --max-model-len 4096   # or 10240 for long context
+```
+
+**GPT-OSS 120B (MXFP4)**
+```bash
+python3 -m atom.entrypoints.openai_server \
+    --model /models/gpt-oss-120b \
+    -tp 8 \                         # or tp 1 for single-GPU
+    --kv_cache_dtype fp8 \
+    --block-size 16 \
+    --max-model-len 4096
+```
+
+### Common server flags
+| Flag | Description | Default |
+|------|------------|---------|
+| `-tp N` | Tensor parallel degree | 1 |
+| `--kv_cache_dtype fp8` | FP8 KV cache (saves memory) | auto |
+| `--block-size 16` | KV cache block size | 16 |
+| `--max-model-len N` | Max sequence length | Model default |
+| `--enforce-eager` | Disable CUDA graphs (debug) | False |
+| `--trust-remote-code` | Allow custom model code | False |
+
+### Health check
+```bash
+curl -s http://localhost:8000/health
+# Returns 200 when ready
+```
+
+### Server startup wait pattern
+```bash
+start_server() {
+    python3 -m atom.entrypoints.openai_server ... &
+    SERVER_PID=$!
+    for i in $(seq 1 240); do
+        curl -s http://localhost:8000/health > /dev/null 2>&1 && return 0
+        kill -0 $SERVER_PID 2>/dev/null || { echo "Server died"; return 1; }
+        sleep 5
+    done
+    echo "Timeout (20 min)"; return 1
+}
+```
+
+---
+
+## 4. Image Tagging Convention
+
+```
+username_date_rocmversion_aiterversion_atomversion
+```
+
+**Examples:**
+- `pensun_20260219_rocm720_aiter011_atom010` — base clean image
+- `IMAGE:patched-v1` — patched variant
+
+**Registry:** `rocm/atom-private`
+
+---
+
+## 5. GPU Cleanup Between Runs
+
+When restarting the server inside the same container:
+```bash
+# Kill all related processes
+pkill -9 -f "atom.entrypoints" 2>/dev/null || true
+pkill -9 -f "ModelRunner" 2>/dev/null || true
+pkill -9 -f "from multiprocessing" 2>/dev/null || true
+sleep 5
+
+# Clear GPU memory
+python3 -c "
+import torch
+for i in range(8):
+    try:
+        torch.cuda.set_device(i)
+        torch.cuda.empty_cache()
+        torch.cuda.reset_peak_memory_stats(i)
+    except: pass
+"
+sleep 15  # Wait for GPU memory release
+```
+
+---
+
+## 6. Updating ATOM/AITER from GitHub Inside Container
+
+```bash
+# Update ATOM
+cd /app/ATOM
+pip uninstall -y atom 2>/dev/null || true
+git fetch origin main && git checkout origin/main
+pip install -e . --no-deps
+
+# Update AITER
+cd /app/aiter-test  # or /root/aiter
+pip uninstall -y amd-aiter 2>/dev/null || true
+git fetch origin main && git checkout origin/main
+rm -rf aiter/jit/build aiter/jit/*.so
+git submodule sync && git submodule update --init --recursive
+python setup.py develop
+```
+
+---
+
+## 7. Troubleshooting
+
+### Server won't start
+```bash
+# Check logs
+docker logs atom-server 2>&1 | tail -100
+
+# Common issues:
+# - OOM: reduce --max-model-len or increase --shm-size
+# - GPU not found: check --device flags and --group-add video
+# - Module import error: check AITER/ATOM versions match
+```
+
+### Server starts but output is wrong
+Use the `triage-accuracy` skill.
+
+### Server starts but output is slow
+Use the `triage-perf` skill.
+
+### Remote node deployment
+```bash
+# Deploy scripts to remote node
+scp bench.sh NODE:/path/to/bench.sh
+
+# Launch via SSH (never inline complex Docker commands)
+ssh NODE 'bash /path/to/bench.sh'
+
+# Monitor
+ssh NODE 'docker logs -f atom-server 2>&1 | tail -50'
+```
+
+---
+
+## Anti-Patterns
+
+1. **Don't rebuild Docker for each test** — patch + commit is 100x faster
+2. **Don't inline complex Docker commands via SSH** — shell escaping breaks. Write scripts, scp, then `ssh node 'bash script.sh'`
+3. **Don't forget GPU cleanup between server restarts** — leftover processes hold GPU memory
+4. **Don't skip the health check** — server takes 5-20 min to load large models
+5. **Don't mount model directories read-write** — use `:ro` to prevent accidental modification
diff --git a/.claude/skills/triage-accuracy/SKILL.md b/.claude/skills/triage-accuracy/SKILL.md
new file mode 100644
index 00000000..000e0e79
--- /dev/null
+++ b/.claude/skills/triage-accuracy/SKILL.md
@@ -0,0 +1,229 @@
+# Skill: Triage Accuracy Issues in LLM Inference
+
+## Description
+Systematic methodology for diagnosing and fixing correctness/accuracy issues in LLM inference — garbage output, wrong answers, numerical instability, or degraded quality. Works across any backend (CK, Triton, ASM, hipBLASLt), any quantization format (FP8, FP4, INT8, BF16), and any model architecture.
+
+## When to Use
+- Model produces garbage, spaces, random tokens, or incoherent text
+- Model gives wrong answers but no errors/crashes/NaN
+- Output quality degraded after a code change, build change, or environment change
+- Numerical instability (NaN/Inf) during inference
+- Suspected kernel, quantization, or routing bugs
+
+---
+
+## Core Principle
+
+**Check DIRECTIONS, not magnitudes.** Norms, means, min/max can all look reasonable while the hidden state points in a completely wrong direction. **Cosine similarity against a known-good reference** is the gold standard metric.
+
+| Cosine Similarity | Verdict |
+|-------------------|---------|
+| > 0.999 | Correct |
+| 0.99 – 0.999 | Suspicious — investigate |
+| 0.9 – 0.99 | Broken — will degrade over layers |
+| < 0.9 | Catastrophically wrong |
+
+A per-layer cos_sim of 0.85 looks tolerable but compounds across N layers. For a 61-layer model, the final hidden state can be completely uncorrelated with the reference (cos_sim ~ 0.05).
+
+---
+
+## Phase 1: Reproduce and Characterize
+
+### 1.1 Use completions API, not chat
+Chat APIs apply templates, `<think>` handling, and streaming that can mask issues:
+```bash
+curl -s http://localhost:PORT/v1/completions \
+  -d '{"model":"MODEL","prompt":"The capital of France is","max_tokens":30,"temperature":0}'
+```
+
+### 1.2 Use deterministic prompts with known answers
+| Prompt | Expected | Tests |
+|--------|----------|-------|
+| "The capital of France is" | " Paris" | Basic knowledge |
+| "1+1=" | "2" | Arithmetic |
+| "John went to the store. John" | "bought" or similar | Coherence |
+| "The" | Any reasonable continuation | Not garbage |
+
+### 1.3 Compare with a known-good reference
+Run identical prompts on a reference build/image. If both are wrong, the issue is upstream (model weights, config). If only one is wrong, the issue is in the code delta between them.
+
+### 1.4 Check for build/JIT failures
+Look for silent JIT failures in stderr:
+```bash
+# AITER JIT
+grep -i "failed jit build" server.log
+# General
+grep -i "fallback\|warning\|error" server.log | head -50
+```
+Silent fallback to a less-tested code path is a top root cause.
+
+---
+
+## Phase 2: Binary Search Through the Model Pipeline
+
+Start from both ends (input and output) and narrow inward.
+
+### Step 1: Verify Embeddings
+```python
+from safetensors import safe_open
+with safe_open(path, framework='pt', device='cpu') as sf:
+    embed_w = sf.get_tensor('model.embed_tokens.weight')
+
+ref = embed_w[token_ids]  # should match exactly
+cos = F.cosine_similarity(ref.flatten(), model_hidden.flatten(), dim=0)
+# Expected: 1.000000 (exact match)
+```
+
+### Step 2: Verify lm_head (output projection)
+```python
+logits_manual = hidden_states @ lm_head_weight.T
+cos = F.cosine_similarity(logits_manual.flatten(), model_logits.flatten(), dim=0)
+# If this matches but final output is wrong -> bug is in the transformer layers
+```
+
+### Step 3: Narrow to layer granularity
+```python
+for layer_idx in [0, 15, 30, 45, 60]:
+    cos = F.cosine_similarity(
+        test_hidden[layer_idx].flatten(),
+        ref_hidden[layer_idx].flatten(), dim=0
+    )
+    print(f"Layer {layer_idx}: cos_sim={cos:.6f}")
+```
+Find where cos_sim drops — the bug is in that layer's subcomponents.
+
+### Step 4: Narrow within a layer
+1. **RMSNorm / LayerNorm** — recompute manually
+2. **Attention** (Q/K/V projections, RoPE, softmax, output projection)
+3. **MoE routing** (expert selection, gating scores)
+4. **FP8/INT8 GEMM** (quantize -> matmul -> dequantize)
+5. **Residual connection** (addition, not a common failure point)
+
+---
+
+## Phase 3: Build Standalone Verification
+
+```python
+import torch
+from safetensors import safe_open
+
+# 1. Load actual model weights (bypass model loading)
+with safe_open(safetensors_path, framework='pt', device='cpu') as sf:
+    weight = sf.get_tensor('model.layers.0.mlp.gate_proj.weight')
+    scale = sf.get_tensor('model.layers.0.mlp.gate_proj.weight_scale_inv')
+
+# 2. Compute FP32 reference on CPU
+weight_f32 = dequantize(weight, scale)  # manual block dequant
+ref_output = input_f32 @ weight_f32.T
+
+# 3. Call actual kernel with same inputs
+kernel_output = suspect_kernel(input_quant, weight, input_scale, weight_scale)
+
+# 4. Compare
+cos = F.cosine_similarity(ref_output.flatten(), kernel_output.float().flatten(), dim=0)
+print(f"cos_sim = {cos:.6f}")  # < 0.999 = bug confirmed
+```
+
+### Key: Test the FULL chain, not components in isolation
+The most dangerous bugs live at **interfaces between components**. A GEMM kernel can be correct, and a quantization kernel can be correct, but if the scale layout from quant doesn't match what GEMM expects, the result is wrong. Always test quant->GEMM as one unit.
+
+---
+
+## Phase 4: Trace the Data Flow
+
+### For quantized GEMM paths (ATOM):
+```
+model.forward()
+  -> LinearBase.forward()         # atom/model_ops/linear.py
+    -> quant_function(input, ...)  # produces (input_fp8, input_scale)
+      -> [primary kernel OR fallback]
+    -> gemm_function(input_fp8, weight, input_scale, weight_scale)
+      -> [primary kernel OR fallback]
+```
+
+### What to check at each hop:
+1. **Tensor shapes** — are they what the next function expects?
+2. **Memory layout** — row-major vs column-major, contiguous vs strided
+3. **Scale tensor layout** — this is the #1 source of silent correctness bugs
+4. **Dtype** — especially FP8 variants (e4m3fn vs e4m3fnuz on AMD)
+5. **Fallback path** — does the fallback handle ALL parameters the primary does?
+
+### The "Silent Parameter" Bug Pattern
+```python
+# DANGEROUS: function accepts parameter but ignores it
+def fallback_quant(out, input, scales, shuffle_scale=False):
+    triton_quant(out, input, scales)
+    # shuffle_scale is SILENTLY IGNORED
+```
+Every fallback function should either implement the parameter fully, or `raise NotImplementedError`.
+
+---
+
+## Phase 5: Fix and Verify
+
+### 5.1 Apply minimal fix
+### 5.2 Verify with standalone test (cos_sim should jump from 0.7-0.9 to 0.9999+)
+### 5.3 Verify end-to-end on deterministic prompts from Phase 1
+### 5.4 Docker patch workflow
+```bash
+docker cp fix.py container:/tmp/
+docker exec container python3 /tmp/fix.py
+docker commit container image:fixed
+curl -s http://localhost:PORT/v1/completions -d '{"prompt":"The capital of France is",...}'
+```
+
+---
+
+## Common Root Causes (Ranked by Frequency)
+
+1. **Scale/metadata layout mismatch** — quant writes one layout, GEMM expects another. Produces cos_sim 0.7-0.9 per layer, compounds to garbage.
+2. **Silent fallback to untested code path** — JIT build fails, fallback ignores parameters.
+3. **Dtype mismatch** — FP8 variants (e4m3fn vs e4m3fnuz). gfx942=FNUZ, gfx950=FN, safetensors=FN.
+4. **Transpose/permutation errors** — correct shape, wrong values.
+5. **Precision loss in accumulation** — FP8->FP16 instead of FP8->FP32.
+6. **RoPE / positional encoding bugs** — wrong frequency, dimension ordering, or position indices.
+
+---
+
+## Anti-Patterns
+
+1. **Don't check only magnitudes** — check directions (cosine similarity).
+2. **Don't test components in isolation only** — test the full quant->kernel chain.
+3. **Don't assume fallbacks are complete** — verify every parameter is handled.
+4. **Don't use chat API for initial triage** — use `/v1/completions`.
+5. **Don't assume the first wrong layer IS the root cause** — systematic bugs affect ALL layers equally.
+6. **Don't rebuild containers for each test** — patch in-place, commit, test.
+7. **Don't trust "no NaN/Inf" as proof of correctness** — the most dangerous bugs produce finite, wrong-direction outputs.
+
+---
+
+## Decision Tree
+
+```
+Model output is wrong/garbage
+|
++- NaN/Inf in output?
+|  +- YES -> dtype mismatch, overflow, missing eps
+|  +- NO -> continue
+|
++- First token wrong? (completions API)
+|  +- YES -> Prefill broken (GEMM/quant/attention)
+|  +- NO -> Decode broken (KV cache, decode kernels)
+|
++- Embeddings correct? (compare with safetensors)
+|  +- NO -> Weight loading or dtype conversion bug
+|  +- YES -> continue
+|
++- lm_head correct? (manual matmul matches logits?)
+|  +- NO -> lm_head kernel or weight bug
+|  +- YES -> Bug in transformer layers
+|
++- Binary search layers: where does cos_sim drop?
+|  +- EVERY LAYER -> Systematic bug (quant, scale layout, dispatch)
+|  +- SPECIFIC LAYER -> Layer-specific bug (attention, RoPE, routing)
+|
++- Found the buggy kernel?
+   +- Using a fallback? Does fallback handle ALL params?
+   +- Scale layout match between producer and consumer?
+   +- Correct dtype for this GPU architecture?
+```
diff --git a/.claude/skills/triage-perf/SKILL.md b/.claude/skills/triage-perf/SKILL.md
new file mode 100644
index 00000000..7c1fc7a3
--- /dev/null
+++ b/.claude/skills/triage-perf/SKILL.md
@@ -0,0 +1,229 @@
+# Skill: Triage Performance Issues in LLM Inference
+
+## Description
+Systematic methodology for diagnosing and fixing performance regressions in LLM inference — slow token generation, high latency, or throughput far below expectations. Distilled from a debugging campaign where a clean AITER/ATOM Docker image was 570x slower than the public image, and 5 root causes were identified and fixed to reach 80% of reference performance.
+
+## When to Use
+- Inference throughput far below expectations or a known-good reference
+- Decode or prefill latency regression after a build/config/environment change
+- New Docker image or new hardware (e.g., gfx942 -> gfx950) shows unexplained slowness
+- Suspected GEMM dispatch, kernel fallback, or architecture detection issues
+
+---
+
+## Core Principle
+
+**Slow inference is almost always a dispatch problem, not a kernel problem.** The kernels themselves are fast — the question is whether the right kernel is being called. Silent fallbacks to unoptimized paths (torch.mm, Python loops, wrong tile sizes) cause 10-100x slowdowns that compound across layers.
+
+---
+
+## Phase 1: Baseline and Compare
+
+### 1.1 Establish a reference
+- Public/production Docker image on same hardware
+- Published benchmark numbers for same model/hardware
+- Previous build that was fast
+
+### 1.2 Run identical benchmarks
+```bash
+curl -s http://localhost:PORT/v1/completions \
+  -d '{"model":"MODEL","prompt":"The capital of France is","max_tokens":100,"temperature":0}'
+```
+
+### 1.3 Measure decode and prefill separately
+| Phase | Bottleneck | Common Root Cause |
+|-------|-----------|-------------------|
+| Decode (M=1) | Memory-bandwidth, GEMM dispatch | Wrong default (torch.mm vs hipBLASLt vs ASM) |
+| Prefill (M=batch) | Compute, GEMM throughput | Missing tuned configs, wrong tile size |
+| Both | MOE dispatch | Architecture detection miss, LDS constraint |
+
+---
+
+## Phase 2: Check for Silent Fallbacks
+
+### 2.1 JIT module build failures
+```bash
+grep -i "failed jit build" server.log
+grep -i "fallback" server.log | head -20
+```
+
+### 2.2 Architecture detection
+```python
+# BAD (misses gfx950):
+target.arch == 'gfx942'
+get_gfx().startswith("gfx94")
+
+# GOOD:
+target.arch in ('gfx942', 'gfx950')
+```
+Search: `grep -rn "gfx94\|gfx942\|get_gfx" atom/model_ops/ aiter/ops/`
+
+### 2.3 GEMM dispatch default
+```python
+# BAD: default to torch.mm (50-100x slower)
+default_config["libtype"] = "torch"
+# GOOD: default to hipBLASLt
+default_config["libtype"] = "hipblaslt"
+```
+Check `aiter/tuned_gemm.py` for default fallback logic.
+
+### 2.4 Tuned GEMM config coverage
+Common gap: tuned configs cover M<=256 (decode) but not M>256 (prefill).
+
+### 2.5 MOE kernel dispatch
+- Architecture gating on MOE Triton kernels
+- LDS size constraints (gfx950 != gfx942)
+- Sorting/routing fallback paths
+
+---
+
+## Phase 3: Component-Level Profiling
+
+```python
+import torch, time
+
+def bench(name, fn, warmup=5, runs=20):
+    for _ in range(warmup): fn()
+    torch.cuda.synchronize()
+    t0 = time.perf_counter()
+    for _ in range(runs): fn()
+    torch.cuda.synchronize()
+    elapsed = (time.perf_counter() - t0) / runs * 1000
+    print(f"{name}: {elapsed:.2f} ms")
+
+bench("torch.mm M=1",     lambda: torch.mm(a1, b))
+bench("hipblaslt M=1",    lambda: hipblaslt_gemm(a1, b))
+bench("tuned_gemm M=1",   lambda: tuned_gemm(a1, b))
+bench("fused_moe",        lambda: fused_moe(hidden, gate, experts))
+bench("paged_attn",       lambda: paged_attention(q, k_cache, v_cache))
+```
+
+**What to look for**:
+- torch.mm vs hipBLASLt >10x gap -> GEMM default is wrong
+- hipBLASLt vs tuned >2x gap -> tuned configs missing
+- MOE Triton vs Python >5x gap -> dispatch failed
+
+---
+
+## Phase 4: Docker-Specific Debugging
+
+### 4.1 Patch inside running containers
+```bash
+docker run -d --name debug ... IMAGE sleep infinity
+docker exec debug sed -i 's/old/new/' /path/to/file.py
+docker commit debug image:patched-v1
+```
+
+### 4.2 Incremental validation (one fix at a time)
+```
+Baseline:      0.17 tok/s  (570x slower)
++ Fix gfx950:  12 tok/s    (70x — dispatch was worst)
++ Fix LDS:     45 tok/s    (3.8x — MOE crashes eliminated)
++ Fix GEMM:    82 tok/s    (1.8x — hipBLASLt vs torch.mm)
++ Fix layout:  95 tok/s    (1.15x — correct scale strides)
+Reference:     118 tok/s   (80% achieved)
+```
+
+### 4.3 Compare Docker images
+```bash
+# Diff packages
+docker run --rm image1 pip list > /tmp/pkgs1.txt
+docker run --rm image2 pip list > /tmp/pkgs2.txt
+diff /tmp/pkgs1.txt /tmp/pkgs2.txt
+
+# Diff AITER configs
+docker cp container1:/app/aiter/configs/ /tmp/configs1/
+docker cp container2:/app/aiter/configs/ /tmp/configs2/
+diff -r /tmp/configs1 /tmp/configs2
+
+# Check ASM kernels
+docker run --rm image1 find /app/aiter/hsa/ -name "*.co" | sort > /tmp/asm1.txt
+docker run --rm image2 find /app/aiter/hsa/ -name "*.co" | sort > /tmp/asm2.txt
+diff /tmp/asm1.txt /tmp/asm2.txt
+```
+
+---
+
+## Common Root Causes (Ranked by Impact)
+
+1. **GEMM dispatch default (50-100x)** — untuned shapes -> torch.mm instead of hipBLASLt
+2. **Architecture detection miss (10-100x)** — new GPU arch not in dispatch conditions
+3. **LDS/resource constraint violation (crash -> fallback)** — tile sizes don't fit on new arch
+4. **Missing ASM backend (1.5-3x on decode)** — M=1 GEMM falls to CK-Tile or hipBLASLt
+5. **Tuned CSV coverage gap (1.5-2x on prefill)** — configs only cover decode shapes
+6. **JIT build failure (2-5x per kernel)** — C++/ASM -> Triton/Python fallback
+
+---
+
+## Known ATOM Fixes for MI355X (gfx950)
+
+These patches are commonly needed when running ATOM Docker images built for gfx942 on gfx950:
+
+### Fix 0: gfx950 Triton MOE detection
+```python
+# atom/model_ops/moe.py
+# Change: get_gfx().startswith("gfx94") -> get_gfx() in ("gfx942", "gfx950")
+```
+
+### Fix 2: CDNA4MXScaleLayout rename
+```python
+# atom/model_ops/fused_moe_triton.py
+# Change: GFX950MXScaleLayout -> CDNA4MXScaleLayout
+```
+
+### Fix 3: gfx950 LDS constraint
+```python
+# atom/model_ops/fused_moe_triton.py
+# Add: update_opt_flags_constraints({"block_m": 128})
+```
+
+### Fix 4: GEMM default fallback (AITER)
+```python
+# aiter/tuned_gemm.py
+# Change: default_config["libtype"] = "torch" -> "hipblaslt"
+```
+
+### Fix 5: JIT non-fatal errors (AITER)
+```python
+# aiter/jit/core.py
+# Change: raise SystemExit(...) -> raise RuntimeError(...)
+```
+
+### Fix 6: MOE sorting Triton fallback
+```python
+# aiter/fused_moe.py
+# Wrap moe_sorting_fwd in try/except with Triton fallback
+```
+
+---
+
+## Decision Tree
+
+```
+Inference is slower than expected
+|
++- How much slower?
+|  +- >10x -> dispatch/fallback bug (Phase 2)
+|  +- 2-10x -> missing tuned configs or wrong default (Phase 2.3-2.4)
+|  +- <2x -> missing ASM backend or partial tuned coverage
+|
++- Decode or prefill slow?
+|  +- DECODE -> GEMM default for M=1, ASM backend presence
+|  +- PREFILL -> tuned CSV coverage for large M, MOE dispatch
+|  +- BOTH -> architecture detection miss or JIT failures
+|
++- JIT build failures in stderr?
+|  +- YES -> identify which modules, check fallbacks
+|  +- NO -> continue
+|
++- GPU arch in all dispatch conditions?
+|  +- NO -> add arch to dispatch checks
+|  +- YES -> continue
+|
++- GEMM default fallback?
+|  +- "torch" -> change to "hipblaslt"
+|  +- "hipblaslt" -> check tuned config coverage
+|
++- All dispatch correct?
+   +- Component-level profiling (Phase 3)
+```