From ccbe1c92769acd91e003b562beb5c875786a6bd9 Mon Sep 17 00:00:00 2001 From: Peng Sun Date: Tue, 24 Feb 2026 10:09:24 -0600 Subject: [PATCH] Add Claude Code skills for ATOM development Add 4 skills for Claude Code AI assistant: - triage-accuracy: Debug wrong/garbage model output - triage-perf: Debug slow inference performance - docker-serve: Docker build, patch, serve workflow - benchmark: InferenceX benchmarking and comparison --- .claude/skills/benchmark/SKILL.md | 284 ++++++++++++++++++++++++ .claude/skills/docker-serve/SKILL.md | 284 ++++++++++++++++++++++++ .claude/skills/triage-accuracy/SKILL.md | 229 +++++++++++++++++++ .claude/skills/triage-perf/SKILL.md | 229 +++++++++++++++++++ 4 files changed, 1026 insertions(+) create mode 100644 .claude/skills/benchmark/SKILL.md create mode 100644 .claude/skills/docker-serve/SKILL.md create mode 100644 .claude/skills/triage-accuracy/SKILL.md create mode 100644 .claude/skills/triage-perf/SKILL.md diff --git a/.claude/skills/benchmark/SKILL.md b/.claude/skills/benchmark/SKILL.md new file mode 100644 index 00000000..ac62cffa --- /dev/null +++ b/.claude/skills/benchmark/SKILL.md @@ -0,0 +1,284 @@ +# Skill: ATOM Inference Benchmarking + +## Description +Run ATOM inference benchmarks using InferenceX benchmark_serving, compare results across Docker images, and analyze throughput/latency metrics. Covers the full workflow from server launch to results analysis. + +## When to Use +- Benchmarking a new ATOM Docker image against a reference +- Comparing performance across different TP configurations, sequence lengths, or concurrency levels +- Validating that performance fixes actually improve throughput +- Generating benchmark data for reports + +--- + +## 1. Benchmark Setup + +### Prerequisites +```bash +# Inside Docker container +pip install -q aiohttp requests tqdm transformers +export OMP_NUM_THREADS=1 +``` + +### Directory structure +``` +/workspace/results/ # Inside container +/mnt/m2m_nobackup/pensun/results/ # Host mount + public/ # Reference image results + clean_v7/ # Clean image results + dsr1_public/ # DeepSeek-R1 reference +``` + +--- + +## 2. Running Benchmarks + +### Single benchmark +```bash +python3 -m atom.benchmarks.benchmark_serving \ + --model /models/MODEL \ + --backend vllm \ + --base-url http://localhost:8000 \ + --dataset-name random \ + --random-input-len 1024 \ + --random-output-len 1024 \ + --random-range-ratio 0.8 \ + --num-prompts 80 \ + --max-concurrency 8 \ + --request-rate inf \ + --ignore-eos \ + --save-result \ + --result-dir /workspace/results \ + --result-filename result_tp8_isl1024_osl1024_conc8.json \ + --percentile-metrics ttft,tpot,itl,e2el +``` + +### Key parameters +| Parameter | Description | Typical Values | +|-----------|------------|----------------| +| `--random-input-len` | Input sequence length | 1024, 8192 | +| `--random-output-len` | Output sequence length | 1024, 8192 | +| `--max-concurrency` | Concurrent requests | 4, 8, 16, 32, 64, 128 | +| `--num-prompts` | Total requests to send | concurrency * 10 | +| `--random-range-ratio` | Min/max ratio for random lengths | 0.8 | +| `--request-rate inf` | Send requests as fast as possible | inf | +| `--ignore-eos` | Don't stop at EOS (measure full output length) | Always use | + +### Benchmark matrix +Standard benchmark suite covers these configurations: + +**Phase 1: Short context (1k/1k)** +``` +TP=8, ISL=1024, OSL=1024, CONC={4,8,16,32} +``` + +**Phase 2: Long output (1k/8k)** +``` +TP=8, ISL=1024, OSL=8192, CONC={4,8,16,32,64,128} +``` + +**Phase 3: Long input (8k/1k)** +``` +TP=8, ISL=8192, OSL=1024, CONC={4,8,16,32,64,128} +``` + +**Phase 4: Lower TP (if model fits)** +``` +TP=4 or TP=1, same ISL/OSL/CONC combos +``` + +--- + +## 3. Helper Functions + +### start_server +```bash +start_server() { + local tp=$1; shift; local extra_args="$@" + + # Kill previous server + pkill -9 -f "atom.entrypoints" 2>/dev/null || true + pkill -9 -f "ModelRunner" 2>/dev/null || true + sleep 5 + + # Clear GPU memory + python3 -c " +import torch +for i in range(8): + try: + torch.cuda.set_device(i); torch.cuda.empty_cache() + except: pass +" 2>/dev/null || true + sleep 15 + + # Launch server + python3 -m atom.entrypoints.openai_server \ + --model $MODEL --server-port $PORT -tp $tp \ + --kv_cache_dtype fp8 --block-size 16 \ + $extra_args > /workspace/server.log 2>&1 & + SERVER_PID=$! + + # Wait for health (up to 20 min) + for i in $(seq 1 240); do + curl -s http://localhost:$PORT/health > /dev/null 2>&1 && return 0 + kill -0 $SERVER_PID 2>/dev/null || return 1 + sleep 5 + done + return 1 +} +``` + +### run_bench +```bash +run_bench() { + local isl=$1 osl=$2 conc=$3 tp=$4 + local num_prompts=$((conc * 10)) + local result_file="${MODEL_PREFIX}_tp${tp}_isl${isl}_osl${osl}_conc${conc}" + + # Skip if already done + [ -f "$RESULT_DIR/${result_file}.json" ] && return 0 + + timeout 7200 python3 -m atom.benchmarks.benchmark_serving \ + --model $MODEL --backend vllm --base-url http://localhost:$PORT \ + --dataset-name random \ + --random-input-len $isl --random-output-len $osl \ + --random-range-ratio 0.8 \ + --num-prompts $num_prompts --max-concurrency $conc \ + --request-rate inf --ignore-eos \ + --save-result --result-dir $RESULT_DIR \ + --result-filename ${result_file}.json \ + --percentile-metrics ttft,tpot,itl,e2el +} +``` + +--- + +## 4. Results Analysis + +### Result JSON format +Each benchmark produces a JSON file with: +```json +{ + "output_throughput": 1234.5, // output tokens/sec + "total_token_throughput": 2345.6, // total tokens/sec (input+output) + "mean_ttft_ms": 45.2, // time to first token (ms) + "median_tpot_ms": 8.1, // time per output token (ms) + "mean_itl_ms": 8.3, // inter-token latency (ms) + "mean_e2el_ms": 850.0 // end-to-end latency (ms) +} +``` + +### Key metrics +| Metric | What It Measures | Lower/Higher Better | +|--------|-----------------|-------------------| +| `output_throughput` | Output tokens per second | Higher | +| `total_token_throughput` | All tokens per second | Higher | +| `mean_ttft_ms` | Time to first token | Lower | +| `median_tpot_ms` | Time per output token | Lower | + +### Quick summary script +```bash +for f in /workspace/results/*.json; do + [ -f "$f" ] || continue + base=$(basename "$f" .json) + python3 -c " +import json; d=json.load(open('$f')) +print(f'$base: output={d.get(\"output_throughput\",0):.1f} total={d.get(\"total_token_throughput\",0):.1f} ttft={d.get(\"mean_ttft_ms\",0):.1f}ms tpot={d.get(\"median_tpot_ms\",0):.1f}ms') +" +done +``` + +### Comparison table generator +```python +import json, glob, os + +def load_results(result_dir, prefix): + results = {} + for f in glob.glob(f"{result_dir}/{prefix}_*.json"): + base = os.path.basename(f).replace('.json', '') + # Extract config from filename: prefix_tpN_islN_oslN_concN + parts = base.split('_') + key = '_'.join(parts[1:]) # remove prefix + with open(f) as fh: + results[key] = json.load(fh) + return results + +public = load_results('/workspace/results', 'public') +clean = load_results('/workspace/results', 'clean') + +print(f"{'Config':<35} {'Public':>10} {'Clean':>10} {'Ratio':>8}") +print("-" * 65) +for key in sorted(public.keys()): + if key in clean: + pub_tp = public[key].get('output_throughput', 0) + cln_tp = clean[key].get('output_throughput', 0) + ratio = cln_tp / pub_tp if pub_tp > 0 else 0 + print(f"{key:<35} {pub_tp:>10.1f} {cln_tp:>10.1f} {ratio:>7.1%}") +``` + +--- + +## 5. Remote Node Execution + +### Deploy and run on remote nodes +```bash +NODE="uswslocpm2m-106-1236.amd.com" + +# 1. Write benchmark script locally +cat > /tmp/bench.sh << 'EOF' +#!/bin/bash +# ... benchmark script content ... +EOF + +# 2. Deploy +scp /tmp/bench.sh ${NODE}:/mnt/m2m_nobackup/pensun/bench_serving/ + +# 3. Launch (never inline complex Docker commands via SSH) +ssh ${NODE} 'bash /mnt/m2m_nobackup/pensun/bench_serving/bench.sh' + +# 4. Monitor +ssh ${NODE} 'docker logs -f atom-bench 2>&1 | tail -20' + +# 5. Collect results +scp ${NODE}:/mnt/m2m_nobackup/pensun/results/*.json /home/pensun/results/ +``` + +### Multi-node deployment +When running on multiple nodes (e.g., public vs clean on separate nodes): +```bash +NODE1="uswslocpm2m-106-881.amd.com" # Public image +NODE2="uswslocpm2m-106-1236.amd.com" # Clean image + +# Deploy to both +for NODE in $NODE1 $NODE2; do + scp /tmp/bench.sh ${NODE}:/mnt/m2m_nobackup/pensun/bench_serving/ +done + +# Launch in parallel +ssh ${NODE1} 'nohup bash /mnt/.../bench.sh --prefix public > /mnt/.../run.log 2>&1 &' +ssh ${NODE2} 'nohup bash /mnt/.../bench.sh --prefix clean > /mnt/.../run.log 2>&1 &' +``` + +--- + +## 6. Common Issues + +| Issue | Symptom | Fix | +|-------|---------|-----| +| Server OOM | Killed during model load | Reduce `--max-model-len`, or increase TP | +| Benchmark timeout | `timeout 7200` exits 124 | High concurrency + long output overwhelms server. Reduce concurrency or output length | +| Stale GPU processes | Server won't start, GPU memory full | `pkill -9 -f atom.entrypoints && sleep 15` | +| Results already exist | `run_bench` silently skips | Delete old JSON or use different prefix | +| SSH escaping | `Ambiguous output redirect` | Never inline Docker commands via SSH. Use script files. | +| Node 1 GPU issues | Memory faults on all shapes | Prefer Node 2 (1236) for reliable benchmarks | + +--- + +## Anti-Patterns + +1. **Don't compare results from different hardware** — MI300X vs MI355X have different performance profiles +2. **Don't skip the warmup** — first few requests are always slower (JIT compilation, Triton caching) +3. **Don't forget `--ignore-eos`** — without it, output length varies and throughput numbers aren't comparable +4. **Don't use `--request-rate` below inf for throughput tests** — rate limiting artificially caps throughput +5. **Don't compare decode and prefill numbers directly** — they measure fundamentally different things +6. **Don't inline Docker commands via SSH** — write scripts, scp them, launch via `bash /path/to/script.sh` diff --git a/.claude/skills/docker-serve/SKILL.md b/.claude/skills/docker-serve/SKILL.md new file mode 100644 index 00000000..ffc84994 --- /dev/null +++ b/.claude/skills/docker-serve/SKILL.md @@ -0,0 +1,284 @@ +# Skill: ATOM Docker Build, Serve, and Debug + +## Description +End-to-end workflow for building, patching, serving, and debugging ATOM Docker images for LLM inference on AMD GPUs. Covers the Docker run template, in-container patching, image tagging, server launch, and health checking. + +## When to Use +- Setting up ATOM inference for a new model or Docker image +- Patching an existing Docker image without rebuilding from scratch +- Debugging server startup failures or configuration issues +- Comparing behavior between two Docker images + +--- + +## 1. Docker Run Template + +### Full server launch +```bash +docker run -d \ + --name atom-server \ + --network host \ + --device /dev/kfd \ + --device /dev/dri \ + --group-add video \ + --ipc host \ + --shm-size 256g \ + --cap-add SYS_PTRACE \ + --security-opt seccomp=unconfined \ + -v /path/to/models:/models:ro \ + -v /path/to/results:/workspace/results \ + IMAGE_NAME \ + python3 -m atom.entrypoints.openai_server \ + --model /models/MODEL_NAME \ + --server-port 8000 \ + -tp 8 \ + --kv_cache_dtype fp8 \ + --block-size 16 \ + --max-model-len 4096 \ + --trust-remote-code \ + --enforce-eager +``` + +### Debug container (sleep infinity for patching) +```bash +docker run -d \ + --name atom-debug \ + --network host \ + --device /dev/kfd \ + --device /dev/dri \ + --group-add video \ + --ipc host \ + --shm-size 256g \ + --cap-add SYS_PTRACE \ + --security-opt seccomp=unconfined \ + -v /path/to/models:/models:ro \ + IMAGE_NAME \ + sleep infinity +``` + +### Required flags explained +| Flag | Why | +|------|-----| +| `--device /dev/kfd --device /dev/dri` | GPU access on ROCm | +| `--group-add video` | GPU device permissions | +| `--ipc host` | Shared memory for multi-GPU | +| `--shm-size 256g` | Large shared memory for tensor parallel | +| `--cap-add SYS_PTRACE` | ROCm profiler/debugger access | +| `--security-opt seccomp=unconfined` | Required for some ROCm operations | +| `--network host` | Direct port access (no -p mapping needed) | + +--- + +## 2. Patch-Commit-Test Workflow + +**Never rebuild Docker for debugging.** Patch inside running containers instead. + +### Step 1: Start debug container +```bash +docker run -d --name atom-debug ... IMAGE sleep infinity +``` + +### Step 2: Apply patches + +**Option A: sed for simple string replacements** +```bash +docker exec atom-debug sed -i 's/old_string/new_string/' /app/ATOM/atom/model_ops/moe.py +``` + +**Option B: Python script for complex patches** +```bash +docker exec atom-debug python3 -c " +content = open('/app/ATOM/atom/model_ops/moe.py').read() +content = content.replace('old_pattern', 'new_pattern') +open('/app/ATOM/atom/model_ops/moe.py', 'w').write(content) +print('Patched') +" +``` + +**Option C: docker cp for file replacement** +```bash +docker cp local_fixed_file.py atom-debug:/app/ATOM/atom/model_ops/moe.py +``` + +### Step 3: Commit as new image +```bash +docker commit atom-debug IMAGE:patched-v1 +``` + +### Step 4: Test with new image +```bash +docker rm -f atom-debug +docker run -d --name atom-test ... IMAGE:patched-v1 \ + python3 -m atom.entrypoints.openai_server --model /models/MODEL ... +``` + +### Step 5: Verify +```bash +# Wait for server health +for i in $(seq 1 60); do + curl -s http://localhost:8000/health > /dev/null 2>&1 && break + sleep 5 +done + +# Test completions +curl -s http://localhost:8000/v1/completions \ + -d '{"model":"/models/MODEL","prompt":"The capital of France is","max_tokens":30,"temperature":0}' +``` + +--- + +## 3. Server Configuration + +### Model-specific configurations + +**DeepSeek-R1 671B (MXFP4)** +```bash +python3 -m atom.entrypoints.openai_server \ + --model /models/DeepSeek-R1-0528-MXFP4 \ + -tp 8 \ + --kv_cache_dtype fp8 \ + --block-size 16 \ + --max-model-len 4096 # or 10240 for long context +``` + +**GPT-OSS 120B (MXFP4)** +```bash +python3 -m atom.entrypoints.openai_server \ + --model /models/gpt-oss-120b \ + -tp 8 \ # or tp 1 for single-GPU + --kv_cache_dtype fp8 \ + --block-size 16 \ + --max-model-len 4096 +``` + +### Common server flags +| Flag | Description | Default | +|------|------------|---------| +| `-tp N` | Tensor parallel degree | 1 | +| `--kv_cache_dtype fp8` | FP8 KV cache (saves memory) | auto | +| `--block-size 16` | KV cache block size | 16 | +| `--max-model-len N` | Max sequence length | Model default | +| `--enforce-eager` | Disable CUDA graphs (debug) | False | +| `--trust-remote-code` | Allow custom model code | False | + +### Health check +```bash +curl -s http://localhost:8000/health +# Returns 200 when ready +``` + +### Server startup wait pattern +```bash +start_server() { + python3 -m atom.entrypoints.openai_server ... & + SERVER_PID=$! + for i in $(seq 1 240); do + curl -s http://localhost:8000/health > /dev/null 2>&1 && return 0 + kill -0 $SERVER_PID 2>/dev/null || { echo "Server died"; return 1; } + sleep 5 + done + echo "Timeout (20 min)"; return 1 +} +``` + +--- + +## 4. Image Tagging Convention + +``` +username_date_rocmversion_aiterversion_atomversion +``` + +**Examples:** +- `pensun_20260219_rocm720_aiter011_atom010` — base clean image +- `IMAGE:patched-v1` — patched variant + +**Registry:** `rocm/atom-private` + +--- + +## 5. GPU Cleanup Between Runs + +When restarting the server inside the same container: +```bash +# Kill all related processes +pkill -9 -f "atom.entrypoints" 2>/dev/null || true +pkill -9 -f "ModelRunner" 2>/dev/null || true +pkill -9 -f "from multiprocessing" 2>/dev/null || true +sleep 5 + +# Clear GPU memory +python3 -c " +import torch +for i in range(8): + try: + torch.cuda.set_device(i) + torch.cuda.empty_cache() + torch.cuda.reset_peak_memory_stats(i) + except: pass +" +sleep 15 # Wait for GPU memory release +``` + +--- + +## 6. Updating ATOM/AITER from GitHub Inside Container + +```bash +# Update ATOM +cd /app/ATOM +pip uninstall -y atom 2>/dev/null || true +git fetch origin main && git checkout origin/main +pip install -e . --no-deps + +# Update AITER +cd /app/aiter-test # or /root/aiter +pip uninstall -y amd-aiter 2>/dev/null || true +git fetch origin main && git checkout origin/main +rm -rf aiter/jit/build aiter/jit/*.so +git submodule sync && git submodule update --init --recursive +python setup.py develop +``` + +--- + +## 7. Troubleshooting + +### Server won't start +```bash +# Check logs +docker logs atom-server 2>&1 | tail -100 + +# Common issues: +# - OOM: reduce --max-model-len or increase --shm-size +# - GPU not found: check --device flags and --group-add video +# - Module import error: check AITER/ATOM versions match +``` + +### Server starts but output is wrong +Use the `triage-accuracy` skill. + +### Server starts but output is slow +Use the `triage-perf` skill. + +### Remote node deployment +```bash +# Deploy scripts to remote node +scp bench.sh NODE:/path/to/bench.sh + +# Launch via SSH (never inline complex Docker commands) +ssh NODE 'bash /path/to/bench.sh' + +# Monitor +ssh NODE 'docker logs -f atom-server 2>&1 | tail -50' +``` + +--- + +## Anti-Patterns + +1. **Don't rebuild Docker for each test** — patch + commit is 100x faster +2. **Don't inline complex Docker commands via SSH** — shell escaping breaks. Write scripts, scp, then `ssh node 'bash script.sh'` +3. **Don't forget GPU cleanup between server restarts** — leftover processes hold GPU memory +4. **Don't skip the health check** — server takes 5-20 min to load large models +5. **Don't mount model directories read-write** — use `:ro` to prevent accidental modification diff --git a/.claude/skills/triage-accuracy/SKILL.md b/.claude/skills/triage-accuracy/SKILL.md new file mode 100644 index 00000000..000e0e79 --- /dev/null +++ b/.claude/skills/triage-accuracy/SKILL.md @@ -0,0 +1,229 @@ +# Skill: Triage Accuracy Issues in LLM Inference + +## Description +Systematic methodology for diagnosing and fixing correctness/accuracy issues in LLM inference — garbage output, wrong answers, numerical instability, or degraded quality. Works across any backend (CK, Triton, ASM, hipBLASLt), any quantization format (FP8, FP4, INT8, BF16), and any model architecture. + +## When to Use +- Model produces garbage, spaces, random tokens, or incoherent text +- Model gives wrong answers but no errors/crashes/NaN +- Output quality degraded after a code change, build change, or environment change +- Numerical instability (NaN/Inf) during inference +- Suspected kernel, quantization, or routing bugs + +--- + +## Core Principle + +**Check DIRECTIONS, not magnitudes.** Norms, means, min/max can all look reasonable while the hidden state points in a completely wrong direction. **Cosine similarity against a known-good reference** is the gold standard metric. + +| Cosine Similarity | Verdict | +|-------------------|---------| +| > 0.999 | Correct | +| 0.99 – 0.999 | Suspicious — investigate | +| 0.9 – 0.99 | Broken — will degrade over layers | +| < 0.9 | Catastrophically wrong | + +A per-layer cos_sim of 0.85 looks tolerable but compounds across N layers. For a 61-layer model, the final hidden state can be completely uncorrelated with the reference (cos_sim ~ 0.05). + +--- + +## Phase 1: Reproduce and Characterize + +### 1.1 Use completions API, not chat +Chat APIs apply templates, `` handling, and streaming that can mask issues: +```bash +curl -s http://localhost:PORT/v1/completions \ + -d '{"model":"MODEL","prompt":"The capital of France is","max_tokens":30,"temperature":0}' +``` + +### 1.2 Use deterministic prompts with known answers +| Prompt | Expected | Tests | +|--------|----------|-------| +| "The capital of France is" | " Paris" | Basic knowledge | +| "1+1=" | "2" | Arithmetic | +| "John went to the store. John" | "bought" or similar | Coherence | +| "The" | Any reasonable continuation | Not garbage | + +### 1.3 Compare with a known-good reference +Run identical prompts on a reference build/image. If both are wrong, the issue is upstream (model weights, config). If only one is wrong, the issue is in the code delta between them. + +### 1.4 Check for build/JIT failures +Look for silent JIT failures in stderr: +```bash +# AITER JIT +grep -i "failed jit build" server.log +# General +grep -i "fallback\|warning\|error" server.log | head -50 +``` +Silent fallback to a less-tested code path is a top root cause. + +--- + +## Phase 2: Binary Search Through the Model Pipeline + +Start from both ends (input and output) and narrow inward. + +### Step 1: Verify Embeddings +```python +from safetensors import safe_open +with safe_open(path, framework='pt', device='cpu') as sf: + embed_w = sf.get_tensor('model.embed_tokens.weight') + +ref = embed_w[token_ids] # should match exactly +cos = F.cosine_similarity(ref.flatten(), model_hidden.flatten(), dim=0) +# Expected: 1.000000 (exact match) +``` + +### Step 2: Verify lm_head (output projection) +```python +logits_manual = hidden_states @ lm_head_weight.T +cos = F.cosine_similarity(logits_manual.flatten(), model_logits.flatten(), dim=0) +# If this matches but final output is wrong -> bug is in the transformer layers +``` + +### Step 3: Narrow to layer granularity +```python +for layer_idx in [0, 15, 30, 45, 60]: + cos = F.cosine_similarity( + test_hidden[layer_idx].flatten(), + ref_hidden[layer_idx].flatten(), dim=0 + ) + print(f"Layer {layer_idx}: cos_sim={cos:.6f}") +``` +Find where cos_sim drops — the bug is in that layer's subcomponents. + +### Step 4: Narrow within a layer +1. **RMSNorm / LayerNorm** — recompute manually +2. **Attention** (Q/K/V projections, RoPE, softmax, output projection) +3. **MoE routing** (expert selection, gating scores) +4. **FP8/INT8 GEMM** (quantize -> matmul -> dequantize) +5. **Residual connection** (addition, not a common failure point) + +--- + +## Phase 3: Build Standalone Verification + +```python +import torch +from safetensors import safe_open + +# 1. Load actual model weights (bypass model loading) +with safe_open(safetensors_path, framework='pt', device='cpu') as sf: + weight = sf.get_tensor('model.layers.0.mlp.gate_proj.weight') + scale = sf.get_tensor('model.layers.0.mlp.gate_proj.weight_scale_inv') + +# 2. Compute FP32 reference on CPU +weight_f32 = dequantize(weight, scale) # manual block dequant +ref_output = input_f32 @ weight_f32.T + +# 3. Call actual kernel with same inputs +kernel_output = suspect_kernel(input_quant, weight, input_scale, weight_scale) + +# 4. Compare +cos = F.cosine_similarity(ref_output.flatten(), kernel_output.float().flatten(), dim=0) +print(f"cos_sim = {cos:.6f}") # < 0.999 = bug confirmed +``` + +### Key: Test the FULL chain, not components in isolation +The most dangerous bugs live at **interfaces between components**. A GEMM kernel can be correct, and a quantization kernel can be correct, but if the scale layout from quant doesn't match what GEMM expects, the result is wrong. Always test quant->GEMM as one unit. + +--- + +## Phase 4: Trace the Data Flow + +### For quantized GEMM paths (ATOM): +``` +model.forward() + -> LinearBase.forward() # atom/model_ops/linear.py + -> quant_function(input, ...) # produces (input_fp8, input_scale) + -> [primary kernel OR fallback] + -> gemm_function(input_fp8, weight, input_scale, weight_scale) + -> [primary kernel OR fallback] +``` + +### What to check at each hop: +1. **Tensor shapes** — are they what the next function expects? +2. **Memory layout** — row-major vs column-major, contiguous vs strided +3. **Scale tensor layout** — this is the #1 source of silent correctness bugs +4. **Dtype** — especially FP8 variants (e4m3fn vs e4m3fnuz on AMD) +5. **Fallback path** — does the fallback handle ALL parameters the primary does? + +### The "Silent Parameter" Bug Pattern +```python +# DANGEROUS: function accepts parameter but ignores it +def fallback_quant(out, input, scales, shuffle_scale=False): + triton_quant(out, input, scales) + # shuffle_scale is SILENTLY IGNORED +``` +Every fallback function should either implement the parameter fully, or `raise NotImplementedError`. + +--- + +## Phase 5: Fix and Verify + +### 5.1 Apply minimal fix +### 5.2 Verify with standalone test (cos_sim should jump from 0.7-0.9 to 0.9999+) +### 5.3 Verify end-to-end on deterministic prompts from Phase 1 +### 5.4 Docker patch workflow +```bash +docker cp fix.py container:/tmp/ +docker exec container python3 /tmp/fix.py +docker commit container image:fixed +curl -s http://localhost:PORT/v1/completions -d '{"prompt":"The capital of France is",...}' +``` + +--- + +## Common Root Causes (Ranked by Frequency) + +1. **Scale/metadata layout mismatch** — quant writes one layout, GEMM expects another. Produces cos_sim 0.7-0.9 per layer, compounds to garbage. +2. **Silent fallback to untested code path** — JIT build fails, fallback ignores parameters. +3. **Dtype mismatch** — FP8 variants (e4m3fn vs e4m3fnuz). gfx942=FNUZ, gfx950=FN, safetensors=FN. +4. **Transpose/permutation errors** — correct shape, wrong values. +5. **Precision loss in accumulation** — FP8->FP16 instead of FP8->FP32. +6. **RoPE / positional encoding bugs** — wrong frequency, dimension ordering, or position indices. + +--- + +## Anti-Patterns + +1. **Don't check only magnitudes** — check directions (cosine similarity). +2. **Don't test components in isolation only** — test the full quant->kernel chain. +3. **Don't assume fallbacks are complete** — verify every parameter is handled. +4. **Don't use chat API for initial triage** — use `/v1/completions`. +5. **Don't assume the first wrong layer IS the root cause** — systematic bugs affect ALL layers equally. +6. **Don't rebuild containers for each test** — patch in-place, commit, test. +7. **Don't trust "no NaN/Inf" as proof of correctness** — the most dangerous bugs produce finite, wrong-direction outputs. + +--- + +## Decision Tree + +``` +Model output is wrong/garbage +| ++- NaN/Inf in output? +| +- YES -> dtype mismatch, overflow, missing eps +| +- NO -> continue +| ++- First token wrong? (completions API) +| +- YES -> Prefill broken (GEMM/quant/attention) +| +- NO -> Decode broken (KV cache, decode kernels) +| ++- Embeddings correct? (compare with safetensors) +| +- NO -> Weight loading or dtype conversion bug +| +- YES -> continue +| ++- lm_head correct? (manual matmul matches logits?) +| +- NO -> lm_head kernel or weight bug +| +- YES -> Bug in transformer layers +| ++- Binary search layers: where does cos_sim drop? +| +- EVERY LAYER -> Systematic bug (quant, scale layout, dispatch) +| +- SPECIFIC LAYER -> Layer-specific bug (attention, RoPE, routing) +| ++- Found the buggy kernel? + +- Using a fallback? Does fallback handle ALL params? + +- Scale layout match between producer and consumer? + +- Correct dtype for this GPU architecture? +``` diff --git a/.claude/skills/triage-perf/SKILL.md b/.claude/skills/triage-perf/SKILL.md new file mode 100644 index 00000000..7c1fc7a3 --- /dev/null +++ b/.claude/skills/triage-perf/SKILL.md @@ -0,0 +1,229 @@ +# Skill: Triage Performance Issues in LLM Inference + +## Description +Systematic methodology for diagnosing and fixing performance regressions in LLM inference — slow token generation, high latency, or throughput far below expectations. Distilled from a debugging campaign where a clean AITER/ATOM Docker image was 570x slower than the public image, and 5 root causes were identified and fixed to reach 80% of reference performance. + +## When to Use +- Inference throughput far below expectations or a known-good reference +- Decode or prefill latency regression after a build/config/environment change +- New Docker image or new hardware (e.g., gfx942 -> gfx950) shows unexplained slowness +- Suspected GEMM dispatch, kernel fallback, or architecture detection issues + +--- + +## Core Principle + +**Slow inference is almost always a dispatch problem, not a kernel problem.** The kernels themselves are fast — the question is whether the right kernel is being called. Silent fallbacks to unoptimized paths (torch.mm, Python loops, wrong tile sizes) cause 10-100x slowdowns that compound across layers. + +--- + +## Phase 1: Baseline and Compare + +### 1.1 Establish a reference +- Public/production Docker image on same hardware +- Published benchmark numbers for same model/hardware +- Previous build that was fast + +### 1.2 Run identical benchmarks +```bash +curl -s http://localhost:PORT/v1/completions \ + -d '{"model":"MODEL","prompt":"The capital of France is","max_tokens":100,"temperature":0}' +``` + +### 1.3 Measure decode and prefill separately +| Phase | Bottleneck | Common Root Cause | +|-------|-----------|-------------------| +| Decode (M=1) | Memory-bandwidth, GEMM dispatch | Wrong default (torch.mm vs hipBLASLt vs ASM) | +| Prefill (M=batch) | Compute, GEMM throughput | Missing tuned configs, wrong tile size | +| Both | MOE dispatch | Architecture detection miss, LDS constraint | + +--- + +## Phase 2: Check for Silent Fallbacks + +### 2.1 JIT module build failures +```bash +grep -i "failed jit build" server.log +grep -i "fallback" server.log | head -20 +``` + +### 2.2 Architecture detection +```python +# BAD (misses gfx950): +target.arch == 'gfx942' +get_gfx().startswith("gfx94") + +# GOOD: +target.arch in ('gfx942', 'gfx950') +``` +Search: `grep -rn "gfx94\|gfx942\|get_gfx" atom/model_ops/ aiter/ops/` + +### 2.3 GEMM dispatch default +```python +# BAD: default to torch.mm (50-100x slower) +default_config["libtype"] = "torch" +# GOOD: default to hipBLASLt +default_config["libtype"] = "hipblaslt" +``` +Check `aiter/tuned_gemm.py` for default fallback logic. + +### 2.4 Tuned GEMM config coverage +Common gap: tuned configs cover M<=256 (decode) but not M>256 (prefill). + +### 2.5 MOE kernel dispatch +- Architecture gating on MOE Triton kernels +- LDS size constraints (gfx950 != gfx942) +- Sorting/routing fallback paths + +--- + +## Phase 3: Component-Level Profiling + +```python +import torch, time + +def bench(name, fn, warmup=5, runs=20): + for _ in range(warmup): fn() + torch.cuda.synchronize() + t0 = time.perf_counter() + for _ in range(runs): fn() + torch.cuda.synchronize() + elapsed = (time.perf_counter() - t0) / runs * 1000 + print(f"{name}: {elapsed:.2f} ms") + +bench("torch.mm M=1", lambda: torch.mm(a1, b)) +bench("hipblaslt M=1", lambda: hipblaslt_gemm(a1, b)) +bench("tuned_gemm M=1", lambda: tuned_gemm(a1, b)) +bench("fused_moe", lambda: fused_moe(hidden, gate, experts)) +bench("paged_attn", lambda: paged_attention(q, k_cache, v_cache)) +``` + +**What to look for**: +- torch.mm vs hipBLASLt >10x gap -> GEMM default is wrong +- hipBLASLt vs tuned >2x gap -> tuned configs missing +- MOE Triton vs Python >5x gap -> dispatch failed + +--- + +## Phase 4: Docker-Specific Debugging + +### 4.1 Patch inside running containers +```bash +docker run -d --name debug ... IMAGE sleep infinity +docker exec debug sed -i 's/old/new/' /path/to/file.py +docker commit debug image:patched-v1 +``` + +### 4.2 Incremental validation (one fix at a time) +``` +Baseline: 0.17 tok/s (570x slower) ++ Fix gfx950: 12 tok/s (70x — dispatch was worst) ++ Fix LDS: 45 tok/s (3.8x — MOE crashes eliminated) ++ Fix GEMM: 82 tok/s (1.8x — hipBLASLt vs torch.mm) ++ Fix layout: 95 tok/s (1.15x — correct scale strides) +Reference: 118 tok/s (80% achieved) +``` + +### 4.3 Compare Docker images +```bash +# Diff packages +docker run --rm image1 pip list > /tmp/pkgs1.txt +docker run --rm image2 pip list > /tmp/pkgs2.txt +diff /tmp/pkgs1.txt /tmp/pkgs2.txt + +# Diff AITER configs +docker cp container1:/app/aiter/configs/ /tmp/configs1/ +docker cp container2:/app/aiter/configs/ /tmp/configs2/ +diff -r /tmp/configs1 /tmp/configs2 + +# Check ASM kernels +docker run --rm image1 find /app/aiter/hsa/ -name "*.co" | sort > /tmp/asm1.txt +docker run --rm image2 find /app/aiter/hsa/ -name "*.co" | sort > /tmp/asm2.txt +diff /tmp/asm1.txt /tmp/asm2.txt +``` + +--- + +## Common Root Causes (Ranked by Impact) + +1. **GEMM dispatch default (50-100x)** — untuned shapes -> torch.mm instead of hipBLASLt +2. **Architecture detection miss (10-100x)** — new GPU arch not in dispatch conditions +3. **LDS/resource constraint violation (crash -> fallback)** — tile sizes don't fit on new arch +4. **Missing ASM backend (1.5-3x on decode)** — M=1 GEMM falls to CK-Tile or hipBLASLt +5. **Tuned CSV coverage gap (1.5-2x on prefill)** — configs only cover decode shapes +6. **JIT build failure (2-5x per kernel)** — C++/ASM -> Triton/Python fallback + +--- + +## Known ATOM Fixes for MI355X (gfx950) + +These patches are commonly needed when running ATOM Docker images built for gfx942 on gfx950: + +### Fix 0: gfx950 Triton MOE detection +```python +# atom/model_ops/moe.py +# Change: get_gfx().startswith("gfx94") -> get_gfx() in ("gfx942", "gfx950") +``` + +### Fix 2: CDNA4MXScaleLayout rename +```python +# atom/model_ops/fused_moe_triton.py +# Change: GFX950MXScaleLayout -> CDNA4MXScaleLayout +``` + +### Fix 3: gfx950 LDS constraint +```python +# atom/model_ops/fused_moe_triton.py +# Add: update_opt_flags_constraints({"block_m": 128}) +``` + +### Fix 4: GEMM default fallback (AITER) +```python +# aiter/tuned_gemm.py +# Change: default_config["libtype"] = "torch" -> "hipblaslt" +``` + +### Fix 5: JIT non-fatal errors (AITER) +```python +# aiter/jit/core.py +# Change: raise SystemExit(...) -> raise RuntimeError(...) +``` + +### Fix 6: MOE sorting Triton fallback +```python +# aiter/fused_moe.py +# Wrap moe_sorting_fwd in try/except with Triton fallback +``` + +--- + +## Decision Tree + +``` +Inference is slower than expected +| ++- How much slower? +| +- >10x -> dispatch/fallback bug (Phase 2) +| +- 2-10x -> missing tuned configs or wrong default (Phase 2.3-2.4) +| +- <2x -> missing ASM backend or partial tuned coverage +| ++- Decode or prefill slow? +| +- DECODE -> GEMM default for M=1, ASM backend presence +| +- PREFILL -> tuned CSV coverage for large M, MOE dispatch +| +- BOTH -> architecture detection miss or JIT failures +| ++- JIT build failures in stderr? +| +- YES -> identify which modules, check fallbacks +| +- NO -> continue +| ++- GPU arch in all dispatch conditions? +| +- NO -> add arch to dispatch checks +| +- YES -> continue +| ++- GEMM default fallback? +| +- "torch" -> change to "hipblaslt" +| +- "hipblaslt" -> check tuned config coverage +| ++- All dispatch correct? + +- Component-level profiling (Phase 3) +```