Skip to content

Conversation

@msaroufim
Copy link
Member

@msaroufim msaroufim commented Feb 10, 2026

Summary

End-to-end support for model competitions where users submit vLLM forks and are benchmarked on serving throughput/latency. This mirrors the existing kernel submission flow but for full model inference serving.

  • New Language.Model type with ModelTaskData config (model name, tensor parallel, benchmark shapes, perplexity baseline)
  • run_model_benchmark() — 4-phase pipeline: extract archive → install fork (fast overlay or pip) → start vLLM server → perplexity check → benchmark serving
  • GitHub Actions workflow (nvidia_model_workflow.yml) for B200 self-hosted runners
  • Modal runner with CUDA 12.8 base image, pre-installed vLLM wheel, persistent model weights volume
  • API support for 50MB binary archive uploads (tar.gz/zip)
  • score_ascending field for higher-is-better metrics (e.g., throughput)
  • Security: tar path traversal validation, metrics namespacing, perplexity success threshold

E2E Testing Status

Full API → Modal → DB pipeline (H100) — FULLY WORKING

Tested the complete round-trip: HTTP submission → API server → Background Manager → Modal dispatch → H100 runner → FullResult → score computation → DB storage → leaderboard ranking.

Phase Result Details
API submission PASS POST /submission/llama_8b_serving-dev/H100/leaderboard returns 202
Modal dispatch PASS run_model_benchmark_h100 found and invoked
Install PASS Fast overlay: copies user's .py files onto pre-installed vLLM (~instant)
Server startup PASS vLLM 0.15.1 starts in ~30s with pre-downloaded weights
Perplexity PASS 1.7785 measured vs 1.80 baseline (within 2% tolerance)
Benchmark PASS 42.30 req/s, 5,414 output tok/s, 1000/1000 requests successful
Score computation PASS request_throughput = 42.10 extracted via RankCriterion.CUSTOM
DB storage PASS 6 runs stored (test, benchmark, leaderboard × public+secret)
Job status PASS submission_job_status.status = 'succeeded'
User API PASS GET /user/submissions returns submission with score
Leaderboard PASS testuser ranked #1 on llama_8b_serving-dev with score 42.10

Total pipeline time: ~3 minutes (warm container with cached weights).

Popcorn-CLI → SSE → Modal → DB (H100) — FULLY WORKING

Tested the CLI streaming flow: popcorn-cli submit → SSE endpoint → Background Manager → Modal dispatch → result callback → DB storage.

Mode Result Details
--mode test PASS Perplexity 1.7785 (within 2% of 1.80 baseline)
--mode leaderboard PASS Public: 41.71 req/s, Secret: 41.55 req/s
DB storage PASS 6 runs stored correctly for leaderboard mode
Leaderboard ranking PASS Best score 42.10 preserved from earlier submission

CLI uses streaming SSE endpoint (POST /{leaderboard}/{gpu}/{mode}) with --no-tui for non-interactive use.

GitHub Actions route (B200) — FULLY WORKING

Tested manually on B200 self-hosted runner (l-bgx-01, 8x B200). All 4 phases pass:

Phase Result Details
Install PASS Fast overlay: 0.0s, 1 Python file copied onto base vLLM
Server startup PASS 119.4s (cold start, first run)
Perplexity PASS 1.7979 measured vs 1.80 baseline (within 2% tolerance)
Benchmark PASS 51.71 req/s, 1000 prompts

Key fixes for B200 route:

  • Torch cu128 (vLLM pip wheel needs libcudart.so.12)
  • Keep vLLM installed (don't uninstall) — enables fast overlay path
  • Use sys.executable instead of "python3" for subprocesses (venv python)
  • CUDA_VISIBLE_DEVICES=4,5,6,7 (GPUs 0-3 occupied)
  • Pre-downloaded model weights at /models/meta-llama/Llama-3.1-8B

Bugs found and fixed during E2E

  • task.pygpus keyword error: gpus field from task.yml was passed to LeaderboardTask.__init__() which doesn't accept it. Fixed by popping gpus before from_dict().
  • leaderboard_db.py — binary archive decode crash: get_submission_by_id() tried to UTF-8 decode binary tar.gz archives, causing UnicodeDecodeError. Fixed with errors="replace".
  • Modal workspace mismatch: .env tokens pointed to gpu-mode workspace but deploy went to msaroufim workspace via profile. API server must use matching tokens.
  • run_eval.py — subprocess used system python: _start_vllm_server() and benchmark used "python3" which resolved to /usr/bin/python3 (no vLLM). Fixed with sys.executable.

Key implementation fixes (from earlier iterations)

  • Switched from CUDA 13.1 to CUDA 12.8 base image so the vLLM wheel works natively (no source build needed)
  • Use vllm bench serve CLI (python3 -m vllm.entrypoints.cli.main bench serve) instead of deprecated benchmark_serving.py
  • Use --backend openai with /v1/completions (not openai-chat) for base models like Llama-3.1-8B
  • Added GPU cleanup (pkill + torch.cuda.empty_cache()) before server start for container reuse
  • Added overlay backup/restore safety: if a user's overlay breaks vLLM imports, original files are restored
  • Added HF secret to Modal functions for gated model access
  • Fixed download_model.py to save weights at /models/<org>/<model> matching _resolve_model_ref()

How correctness is defined

Model submissions are validated through a two-phase gate defined in task.yml:

Phase 1: Perplexity check (correctness gate)

  • Runs 10 fixed prompts against the vLLM server's /v1/completions endpoint
  • Computes measured_ppl = exp(-total_log_prob / total_tokens)
  • Pass criteria: abs(measured - baseline) / baseline <= tolerance (within 2% of baseline 1.80)
  • If perplexity fails, submission is rejected — no benchmark runs

Phase 2: Serving benchmark (ranking metric)

  • Uses vllm bench serve with specified shapes (1000 prompts, 512 input len, 128 output len)
  • Extracts request_throughput (req/s) as the leaderboard score
  • Higher is better (score_ascending: false, ranking_by: custom)

The perplexity baseline (1.80) was established by running unmodified vLLM against Llama-3.1-8B on H100.

Remaining work

Perplexity / determinism

  • Perplexity baseline stability — measured 1.7785 on H100, 1.7979 on B200. Need to understand:
    • Whether the baseline should be per-GPU-type (H100 vs B200 may produce different values)
    • Whether to set a CUDA/cuBLAS deterministic mode or seed for reproducibility
    • What tolerance is appropriate (currently 2%) — may need to be wider if inter-run variance is high

Performance — GitHub Actions route

  • vLLM source build for cu130 — currently using cu128 pip wheel. For native CUDA 13 support, need to build from source once and cache the wheel.
  • Server cold start — 119s on first run. Subsequent runs should be faster with warm process caches.

Nice to have

  • sccache for CUDA compilation — Modal runner has this, GitHub runner doesn't yet
  • Smaller test model — use a non-gated model (e.g., facebook/opt-125m) for CI smoke tests

Test plan

  • Unit tests pass (test_backend.py, test_task.py)
  • GitHub Actions workflow dispatches and runs on B200 runner
  • Server log capture works (stderr visible in result on failure)
  • HF_TOKEN works — model weights download successfully
  • Perplexity check executes and passes on H100 (baseline 1.80, measured 1.7785)
  • Perplexity check executes and passes on B200 (baseline 1.80, measured 1.7979)
  • Modal runner deployment and test
  • Full passing E2E on Modal H100 (perplexity + benchmark, 42.30 req/s)
  • Full passing E2E on B200 (perplexity + benchmark, 51.71 req/s)
  • Fast overlay path works (Python-only submissions skip pip install)
  • Overlay safety: broken overlays detected and rolled back
  • GPU cleanup handles container reuse (no memory exhaustion)
  • Full API → Modal → DB round-trip (submission, score computation, leaderboard ranking)
  • Dev leaderboard creation from task.yml with gpus field
  • Binary archive storage and retrieval in DB
  • User submissions API and leaderboard ranking API
  • Popcorn-CLI submission flow (test + leaderboard modes via SSE endpoint)
  • sys.executable fix for venv subprocess resolution
  • Full GH Actions workflow dispatch → artifact → DB round-trip

Extend the platform to support model-level competitions where users submit
vLLM forks as tarballs. The system pip installs the fork, starts a vLLM
server, runs serving benchmarks, and checks perplexity against a baseline.

- Add Language.Model and RankCriterion.CUSTOM to support model tasks
- Add ModelTaskData with benchmark shapes, perplexity config, timeouts
- Add run_model_benchmark() with 5-phase pipeline (install, server, perplexity, benchmark, cleanup)
- Add score_ascending field for higher-is-better ranking (throughput vs time)
- Add tarball upload support (50MB limit) in API
- Add Modal image with vLLM deps, sccache, and model weights volume
- Add download_model.py for pre-populating model weights
- Add example task definition for Llama-3.1-8B serving
- Add reuse documentation listing unchanged components
Copilot AI review requested due to automatic review settings February 10, 2026 20:38
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds end-to-end “model competition” support where users submit vLLM forks as archives that are installed and benchmarked via a new runner path, with leaderboard ranking able to support both lower-is-better and higher-is-better scores.

Changes:

  • Introduces Language.Model + ModelTaskData, plus run_model_benchmark() pipeline (install → serve → perplexity → benchmark → cleanup).
  • Adds score direction (score_ascending) wiring through task config, DB ranking queries, and API responses.
  • Extends submission handling to accept binary archives (50MB) and adds Modal infra (new image + volumes) and a weight pre-download script.

Reviewed changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 12 comments.

Show a summary per file
File Description
tests/test_task.py Updates expected task config dicts to include score_ascending.
src/runners/modal_runner_archs.py Registers Modal functions for model benchmarking on selected GPUs with volumes mounted.
src/runners/modal_runner.py Adds dedicated model_image and Modal Volumes for model weights + sccache.
src/runners/download_model.py Adds a Modal app to pre-download HF model weights into a shared volume.
src/libkernelbot/task.py Adds ModelTaskData, extends LeaderboardTask to support model tasks + score_ascending.
src/libkernelbot/submission.py Adds custom metric scoring, and threads score_ascending into competition/ranking display.
src/libkernelbot/run_eval.py Routes lang=model to new run_model_benchmark() implementation.
src/libkernelbot/leaderboard_db.py Stores bytes submissions and adds ranking direction support to leaderboard queries.
src/libkernelbot/launchers/modal.py Dispatches Modal function name based on lang including model.
src/libkernelbot/consts.py Adds Language.Model and RankCriterion.CUSTOM.
src/libkernelbot/backend.py Base64-encodes model archives for transport and avoids .lower() on bytes.
src/kernelbot/api/main.py Ensures /submissions endpoint uses correct score ordering for the given leaderboard.
src/kernelbot/api/api_utils.py Accepts larger binary uploads for model tasks (50MB) and validates archive extension.
examples/llama_8b_serving/task.yml Adds an example model task configuration (custom ranking metric + descending score).
docs/model-competitions-reuse.md Documents which existing components are reused unchanged for model competitions.
Comments suppressed due to low confidence (1)

src/runners/modal_runner.py:1

  • These pins look risky: I’m not aware of a torch==2.9.1 release or a cu130 wheel index in the standard PyTorch distribution scheme. If this is intentional for your environment, consider documenting/validating it; otherwise, pin to a known-available Torch/CUDA combo (or make it configurable) to avoid Modal image build failures.
import signal

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 899 to 907
if tarfile.is_tarfile(archive_path):
with tarfile.open(archive_path, "r:*") as tar:
tar.extractall(path=extract_dir)
elif zipfile.is_zipfile(archive_path):
with zipfile.ZipFile(archive_path, "r") as zf:
zf.extractall(path=extract_dir)
else:
return False, "", "Submission archive is not a valid tar.gz or zip file"

Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tar.extractall() / ZipFile.extractall() are vulnerable to path traversal (e.g., ../../...) and can write outside extract_dir. Use a safe extraction routine that validates each member path stays within extract_dir (reject absolute paths and .. segments) before extracting.

Suggested change
if tarfile.is_tarfile(archive_path):
with tarfile.open(archive_path, "r:*") as tar:
tar.extractall(path=extract_dir)
elif zipfile.is_zipfile(archive_path):
with zipfile.ZipFile(archive_path, "r") as zf:
zf.extractall(path=extract_dir)
else:
return False, "", "Submission archive is not a valid tar.gz or zip file"
def _safe_extract_tar(tar: tarfile.TarFile, dest_dir: str) -> None:
base_dir = os.path.abspath(dest_dir)
for member in tar.getmembers():
name = member.name
# Disallow absolute paths
if os.path.isabs(name):
raise ValueError(f"Unsafe absolute path in tar archive: {name!r}")
# Disallow parent directory traversal
if ".." in Path(name).parts:
raise ValueError(f"Unsafe relative path in tar archive: {name!r}")
target_path = os.path.abspath(os.path.join(base_dir, name))
# Ensure the target path is within dest_dir
if os.path.commonpath([base_dir, target_path]) != base_dir:
raise ValueError(f"Tar path escapes destination directory: {name!r}")
tar.extractall(path=dest_dir)
def _safe_extract_zip(zf: zipfile.ZipFile, dest_dir: str) -> None:
base_dir = os.path.abspath(dest_dir)
for name in zf.namelist():
# Disallow absolute paths
if os.path.isabs(name):
raise ValueError(f"Unsafe absolute path in zip archive: {name!r}")
# Disallow parent directory traversal
if ".." in Path(name).parts:
raise ValueError(f"Unsafe relative path in zip archive: {name!r}")
target_path = os.path.abspath(os.path.join(base_dir, name))
# Ensure the target path is within dest_dir
if os.path.commonpath([base_dir, target_path]) != base_dir:
raise ValueError(f"Zip path escapes destination directory: {name!r}")
zf.extractall(path=dest_dir)
try:
if tarfile.is_tarfile(archive_path):
with tarfile.open(archive_path, "r:*") as tar:
_safe_extract_tar(tar, extract_dir)
elif zipfile.is_zipfile(archive_path):
with zipfile.ZipFile(archive_path, "r") as zf:
_safe_extract_zip(zf, extract_dir)
else:
return False, "", "Submission archive is not a valid tar.gz or zip file"
except ValueError as e:
return False, "", f"Submission archive contains unsafe paths: {e}"

Copilot uses AI. Check for mistakes.
Comment on lines 886 to 923
work_dir = tempfile.mkdtemp(prefix="model_submission_")
archive_path = os.path.join(work_dir, "submission.tar.gz")

with open(archive_path, "wb") as f:
f.write(archive_bytes)

# Extract
import tarfile
import zipfile

extract_dir = os.path.join(work_dir, "src")
os.makedirs(extract_dir, exist_ok=True)

if tarfile.is_tarfile(archive_path):
with tarfile.open(archive_path, "r:*") as tar:
tar.extractall(path=extract_dir)
elif zipfile.is_zipfile(archive_path):
with zipfile.ZipFile(archive_path, "r") as zf:
zf.extractall(path=extract_dir)
else:
return False, "", "Submission archive is not a valid tar.gz or zip file"

# Find the actual package directory (may be nested one level)
entries = os.listdir(extract_dir)
if len(entries) == 1 and os.path.isdir(os.path.join(extract_dir, entries[0])):
pkg_dir = os.path.join(extract_dir, entries[0])
else:
pkg_dir = extract_dir

# pip install
result = subprocess.run(
["pip", "install", "-e", pkg_dir],
capture_output=True,
text=True,
timeout=install_timeout,
)

return result.returncode == 0, _limit_length(result.stdout), _limit_length(result.stderr)
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tempfile.mkdtemp() creates a work directory that is never removed, which can leak disk space across runs. Prefer TemporaryDirectory() or explicitly shutil.rmtree(work_dir) in a finally (including the error/early-return paths).

Suggested change
work_dir = tempfile.mkdtemp(prefix="model_submission_")
archive_path = os.path.join(work_dir, "submission.tar.gz")
with open(archive_path, "wb") as f:
f.write(archive_bytes)
# Extract
import tarfile
import zipfile
extract_dir = os.path.join(work_dir, "src")
os.makedirs(extract_dir, exist_ok=True)
if tarfile.is_tarfile(archive_path):
with tarfile.open(archive_path, "r:*") as tar:
tar.extractall(path=extract_dir)
elif zipfile.is_zipfile(archive_path):
with zipfile.ZipFile(archive_path, "r") as zf:
zf.extractall(path=extract_dir)
else:
return False, "", "Submission archive is not a valid tar.gz or zip file"
# Find the actual package directory (may be nested one level)
entries = os.listdir(extract_dir)
if len(entries) == 1 and os.path.isdir(os.path.join(extract_dir, entries[0])):
pkg_dir = os.path.join(extract_dir, entries[0])
else:
pkg_dir = extract_dir
# pip install
result = subprocess.run(
["pip", "install", "-e", pkg_dir],
capture_output=True,
text=True,
timeout=install_timeout,
)
return result.returncode == 0, _limit_length(result.stdout), _limit_length(result.stderr)
with tempfile.TemporaryDirectory(prefix="model_submission_") as work_dir:
archive_path = os.path.join(work_dir, "submission.tar.gz")
with open(archive_path, "wb") as f:
f.write(archive_bytes)
# Extract
import tarfile
import zipfile
extract_dir = os.path.join(work_dir, "src")
os.makedirs(extract_dir, exist_ok=True)
if tarfile.is_tarfile(archive_path):
with tarfile.open(archive_path, "r:*") as tar:
tar.extractall(path=extract_dir)
elif zipfile.is_zipfile(archive_path):
with zipfile.ZipFile(archive_path, "r") as zf:
zf.extractall(path=extract_dir)
else:
return False, "", "Submission archive is not a valid tar.gz or zip file"
# Find the actual package directory (may be nested one level)
entries = os.listdir(extract_dir)
if len(entries) == 1 and os.path.isdir(os.path.join(extract_dir, entries[0])):
pkg_dir = os.path.join(extract_dir, entries[0])
else:
pkg_dir = extract_dir
# pip install
result = subprocess.run(
["pip", "install", "-e", pkg_dir],
capture_output=True,
text=True,
timeout=install_timeout,
)
return result.returncode == 0, _limit_length(result.stdout), _limit_length(result.stderr)

Copilot uses AI. Check for mistakes.
Comment on lines +896 to +897
extract_dir = os.path.join(work_dir, "src")
os.makedirs(extract_dir, exist_ok=True)
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tempfile.mkdtemp() creates a work directory that is never removed, which can leak disk space across runs. Prefer TemporaryDirectory() or explicitly shutil.rmtree(work_dir) in a finally (including the error/early-return paths).

Copilot uses AI. Check for mistakes.
Comment on lines 943 to 944
stdout=subprocess.PIPE,
stderr=subprocess.PIPE,
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Starting the server with stdout=PIPE and stderr=PIPE without continuously draining them risks blocking the vLLM process once its output buffers fill, potentially hanging runs. Redirect to files/DEVNULL, merge streams, or spawn reader threads to drain and store logs safely.

Suggested change
stdout=subprocess.PIPE,
stderr=subprocess.PIPE,
stdout=subprocess.DEVNULL,
stderr=subprocess.DEVNULL,

Copilot uses AI. Check for mistakes.
Comment on lines 979 to 1034
cmd = [
"python3", "-m", "vllm.entrypoints.openai.run_batch",
]

# Prefer the benchmark_serving script approach
cmd = [
"python3", "-m", "vllm.benchmarks.benchmark_serving",
"--backend", "openai-chat",
"--base-url", f"http://localhost:{port}",
"--model", model_name,
"--endpoint", "/v1/chat/completions",
"--num-prompts", str(shape.get("num_prompts", 100)),
"--random-input-len", str(shape.get("input_len", 512)),
"--random-output-len", str(shape.get("output_len", 128)),
"--save-result",
]

result = subprocess.run(
cmd,
capture_output=True,
text=True,
timeout=benchmark_timeout,
)

if result.returncode != 0:
all_metrics[f"shape_{i}_error"] = _limit_length(result.stderr)
continue

# Parse the saved JSON result file
# vLLM saves to a json file in current directory
import glob
json_files = sorted(glob.glob("*.json"), key=os.path.getmtime, reverse=True)
if json_files:
try:
with open(json_files[0]) as f:
bench_result = json.load(f)
for key in [
"request_throughput",
"output_throughput",
"mean_ttft_ms",
"median_ttft_ms",
"p99_ttft_ms",
"mean_tpot_ms",
"median_tpot_ms",
"p99_tpot_ms",
"mean_itl_ms",
"median_itl_ms",
"p99_itl_ms",
]:
if key in bench_result:
all_metrics[key] = bench_result[key]
os.remove(json_files[0])
except (json.JSONDecodeError, OSError):
pass

all_metrics[f"shape_{i}_stdout"] = _limit_length(result.stdout)
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Metrics are overwritten across shapes because all_metrics[key] is reused for every shape; only the last shape’s values will survive. Also, glob('*.json') in the current working directory can pick up unrelated files and is race-prone. Write results to a per-shape, known filepath (or run in a temp working directory) and namespace metrics per shape (e.g., shape_{i}_{key}) or return a list keyed by shape.

Suggested change
cmd = [
"python3", "-m", "vllm.entrypoints.openai.run_batch",
]
# Prefer the benchmark_serving script approach
cmd = [
"python3", "-m", "vllm.benchmarks.benchmark_serving",
"--backend", "openai-chat",
"--base-url", f"http://localhost:{port}",
"--model", model_name,
"--endpoint", "/v1/chat/completions",
"--num-prompts", str(shape.get("num_prompts", 100)),
"--random-input-len", str(shape.get("input_len", 512)),
"--random-output-len", str(shape.get("output_len", 128)),
"--save-result",
]
result = subprocess.run(
cmd,
capture_output=True,
text=True,
timeout=benchmark_timeout,
)
if result.returncode != 0:
all_metrics[f"shape_{i}_error"] = _limit_length(result.stderr)
continue
# Parse the saved JSON result file
# vLLM saves to a json file in current directory
import glob
json_files = sorted(glob.glob("*.json"), key=os.path.getmtime, reverse=True)
if json_files:
try:
with open(json_files[0]) as f:
bench_result = json.load(f)
for key in [
"request_throughput",
"output_throughput",
"mean_ttft_ms",
"median_ttft_ms",
"p99_ttft_ms",
"mean_tpot_ms",
"median_tpot_ms",
"p99_tpot_ms",
"mean_itl_ms",
"median_itl_ms",
"p99_itl_ms",
]:
if key in bench_result:
all_metrics[key] = bench_result[key]
os.remove(json_files[0])
except (json.JSONDecodeError, OSError):
pass
all_metrics[f"shape_{i}_stdout"] = _limit_length(result.stdout)
with tempfile.TemporaryDirectory() as tmpdir:
cmd = [
"python3", "-m", "vllm.entrypoints.openai.run_batch",
]
# Prefer the benchmark_serving script approach
cmd = [
"python3", "-m", "vllm.benchmarks.benchmark_serving",
"--backend", "openai-chat",
"--base-url", f"http://localhost:{port}",
"--model", model_name,
"--endpoint", "/v1/chat/completions",
"--num-prompts", str(shape.get("num_prompts", 100)),
"--random-input-len", str(shape.get("input_len", 512)),
"--random-output-len", str(shape.get("output_len", 128)),
"--save-result",
]
result = subprocess.run(
cmd,
capture_output=True,
text=True,
timeout=benchmark_timeout,
cwd=tmpdir,
)
if result.returncode != 0:
all_metrics[f"shape_{i}_error"] = _limit_length(result.stderr)
all_metrics[f"shape_{i}_stdout"] = _limit_length(result.stdout)
continue
# Parse the saved JSON result file
# vLLM saves to a json file in the working directory
import glob
json_files = sorted(
glob.glob(os.path.join(tmpdir, "*.json")),
key=os.path.getmtime,
reverse=True,
)
if json_files:
try:
with open(json_files[0]) as f:
bench_result = json.load(f)
for key in [
"request_throughput",
"output_throughput",
"mean_ttft_ms",
"median_ttft_ms",
"p99_ttft_ms",
"mean_tpot_ms",
"median_tpot_ms",
"p99_tpot_ms",
"mean_itl_ms",
"median_itl_ms",
"p99_itl_ms",
]:
if key in bench_result:
all_metrics[f"shape_{i}_{key}"] = bench_result[key]
os.remove(json_files[0])
except (json.JSONDecodeError, OSError):
pass
all_metrics[f"shape_{i}_stdout"] = _limit_length(result.stdout)

Copilot uses AI. Check for mistakes.
Comment on lines 979 to 982
cmd = [
"python3", "-m", "vllm.entrypoints.openai.run_batch",
]

Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The initial cmd assignment to vllm.entrypoints.openai.run_batch is immediately overwritten and has no effect. Remove the dead assignment to reduce confusion and keep the benchmark invocation single-sourced.

Suggested change
cmd = [
"python3", "-m", "vllm.entrypoints.openai.run_batch",
]

Copilot uses AI. Check for mistakes.
Comment on lines +1085 to +1087
try:
with urllib.request.urlopen(req, timeout=30) as resp:
data = json.loads(resp.read())
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The perplexity check silently ignores all request/parse errors and may compute perplexity from a small subset of prompts, which can lead to unstable or falsely passing results. Consider failing the check on any request error (or at least tracking an error count and requiring a minimum success ratio) and include the error details in the run result for debuggability.

Copilot uses AI. Check for mistakes.
Comment on lines 1094 to 1095
except Exception:
continue
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The perplexity check silently ignores all request/parse errors and may compute perplexity from a small subset of prompts, which can lead to unstable or falsely passing results. Consider failing the check on any request error (or at least tracking an error count and requiring a minimum success ratio) and include the error details in the run result for debuggability.

Copilot uses AI. Check for mistakes.

def compute_score(result: FullResult, task: LeaderboardTask, submission_id: int) -> float:
if task.ranking_by == RankCriterion.CUSTOM:
ranking_metric = task.config.ranking_metric
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RankCriterion.CUSTOM implicitly assumes task.config has ranking_metric, but LeaderboardTask.config can also be CudaTaskData/PythonTaskData, which don’t define it. Enforce CUSTOM only for Language.Model (e.g., in LeaderboardTask.__post_init__) or store ranking_metric at the task level so this doesn’t depend on a specific config dataclass.

Suggested change
ranking_metric = task.config.ranking_metric
# Some task configurations (e.g., CudaTaskData/PythonTaskData) may not
# define a `ranking_metric` attribute. Guard against that here so we
# don't rely on a specific config dataclass shape.
config = getattr(task, "config", None)
if config is None or not hasattr(config, "ranking_metric"):
raise KernelBotError(
"RankCriterion.CUSTOM requires task.config to define a 'ranking_metric' "
f"attribute; got config type '{type(config).__name__}' instead."
)
ranking_metric = getattr(config, "ranking_metric")

Copilot uses AI. Check for mistakes.
return passed, measured_ppl


def run_model_benchmark(config: dict) -> FullResult: # noqa: C901
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new run_model_benchmark() path (install, server startup/timeout handling, perplexity pass/fail, benchmark parsing, and cleanup) introduces substantial logic but isn’t covered by unit tests. Since the repo already has pytest coverage (e.g., tests/test_task.py), add focused tests that mock subprocess.run / subprocess.Popen and urllib.request.urlopen to deterministically validate success and failure modes.

Copilot uses AI. Check for mistakes.
- Fix path traversal vulnerability in tar/zip extraction (validate members)
- Fix metrics overwritten across shapes (namespace by shape index)
- Fix vLLM server stdout/stderr PIPE blocking (redirect to DEVNULL)
- Fix perplexity check silently swallowing errors (require >50% success)
- Remove dead cmd assignment in benchmark runner
- Add hasattr guard for CUSTOM ranking_metric in compute_score
- Remove docs/model-competitions-reuse.md
- Fix lang_name KeyError crash for model submissions in GitHub launcher
- Upload model archives as Git blobs to bypass workflow dispatch size limits
- Add nvidia_model_workflow.yml with 60-min timeout for model benchmarking
- Update github-runner.py to download blob archives before running
- Add model-specific timeout computation from model_config
- Add expected run name pattern for model workflow dispatch
- Block model competitions on AMD GPUs (NVIDIA only for now)
@github-actions
Copy link

github-actions bot commented Feb 10, 2026

Coverage report

Click to see where and how coverage changed

FileStatementsMissingCoverageCoverage
(new stmts)
Lines missing
  src/libkernelbot
  backend.py 198
  consts.py
  leaderboard_db.py
  submission.py 178-190, 235
  task.py 83, 87, 101, 156, 194, 221
  utils.py
Project Total  

This report was generated by python-coverage-comment-action

Isolates model benchmark dependencies in a venv instead of
polluting the runner's system Python. Falls back to pip if
uv is not available.
- Persistent venv at /opt/model-venv with torch + vLLM deps pre-cached
  (mirrors Modal model_image pattern: install vllm for deps, uninstall)
- Set SETUPTOOLS_SCM_PRETEND_VERSION for tarball submissions without .git
- Pin Python 3.10 in venv, add sccache for CUDA compilation caching
Drop /opt persistent venv (permission issues on containerized runners).
Bootstrap fresh venv each run with torch + vllm deps. Optimize later.
- Only use --download-dir /models if the path exists (Modal volume).
  On GitHub runners, fall back to HF cache default.
- Capture server stdout/stderr to a log file instead of DEVNULL.
- Include server log in result on startup failure for debugging.
Calibrated from actual B200 E2E test run with stock vLLM.
…afety

- Switch model_image from CUDA 13.1 to CUDA 12.8 base so the vLLM
  wheel (compiled for CUDA 12) works natively without compat libraries
  or source builds. CUDA 12.8 supports H100 (SM 9.0) and B200 (SM 10.0).

- Use vllm bench serve CLI (python3 -m vllm.entrypoints.cli.main bench serve)
  instead of the deprecated benchmarks/benchmark_serving.py script.
  Use --backend openai with /v1/completions for base models.

- Add fast overlay path: for Python-only submissions, copy .py files
  directly onto the pre-installed vLLM package instead of doing a full
  pip install from source. Includes backup/restore safety to detect and
  recover if an overlay breaks vLLM imports.

- Add GPU cleanup (pkill + torch.cuda.empty_cache) before server start
  to handle Modal container reuse where previous vLLM processes left
  GPU memory allocated.

- Add HF secret to model benchmark functions for gated model access.

- Fix download_model.py to save weights at /models/<org>/<model> path
  matching what _resolve_model_ref() expects.

Tested E2E on Modal H100: perplexity 1.7785 (pass), benchmark 34.54 req/s
with 100/100 successful requests.
- Pop gpus from raw dict before LeaderboardTask.from_dict() to prevent
  unexpected keyword argument error when creating dev leaderboards
- Handle binary model archives in get_submission_by_id() with
  errors="replace" to prevent UnicodeDecodeError on tar.gz data
- Add .claude/skills/model-competition-testing.md with full E2E testing
  instructions, correctness criteria, and troubleshooting
…guide

Adds Step 5b covering the streaming SSE endpoint flow via popcorn-cli,
including config backup, build, submit with --no-tui, and config restore.
- Switch GH workflow torch from cu130 to cu128 (vLLM pip wheel needs
  libcudart.so.12)
- Keep vLLM installed instead of uninstalling — enables fast overlay
  path for Python-only submissions (~instant vs ~20 min)
- Use sys.executable instead of "python3" for vLLM server and benchmark
  subprocesses so they use the venv Python
- Add CUDA_VISIBLE_DEVICES=4,5,6,7 to workflow (GPUs 0-3 occupied)
- Add B200 machine entry to remote-gpu-testing skill
- Add GH Actions B200 section to model-competition-testing skill
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant