[VIBE CODED - Review Only] E2E model competition support #442

msaroufim · 2026-02-10T20:38:06Z

Summary

End-to-end support for model competitions where users submit vLLM forks and are benchmarked on serving throughput/latency. This mirrors the existing kernel submission flow but for full model inference serving.

New Language.Model type with ModelTaskData config (model name, tensor parallel, benchmark shapes, perplexity baseline)
run_model_benchmark() — 4-phase pipeline: extract archive → install fork (fast overlay or pip) → start vLLM server → perplexity check → benchmark serving
GitHub Actions workflow (nvidia_model_workflow.yml) for B200 self-hosted runners
Modal runner with CUDA 12.8 base image, pre-installed vLLM wheel, persistent model weights volume
API support for 50MB binary archive uploads (tar.gz/zip)
score_ascending field for higher-is-better metrics (e.g., throughput)
Security: tar path traversal validation, metrics namespacing, perplexity success threshold

E2E Testing Status

Full API → Modal → DB pipeline (H100) — FULLY WORKING

Tested the complete round-trip: HTTP submission → API server → Background Manager → Modal dispatch → H100 runner → FullResult → score computation → DB storage → leaderboard ranking.

Phase	Result	Details
API submission	PASS	`POST /submission/llama_8b_serving-dev/H100/leaderboard` returns 202
Modal dispatch	PASS	`run_model_benchmark_h100` found and invoked
Install	PASS	Fast overlay: copies user's .py files onto pre-installed vLLM (~instant)
Server startup	PASS	vLLM 0.15.1 starts in ~30s with pre-downloaded weights
Perplexity	PASS	1.7785 measured vs 1.80 baseline (within 2% tolerance)
Benchmark	PASS	42.30 req/s, 5,414 output tok/s, 1000/1000 requests successful
Score computation	PASS	`request_throughput = 42.10` extracted via `RankCriterion.CUSTOM`
DB storage	PASS	6 runs stored (test, benchmark, leaderboard × public+secret)
Job status	PASS	`submission_job_status.status = 'succeeded'`
User API	PASS	`GET /user/submissions` returns submission with score
Leaderboard	PASS	testuser ranked #1 on `llama_8b_serving-dev` with score 42.10

Total pipeline time: ~3 minutes (warm container with cached weights).

Popcorn-CLI → SSE → Modal → DB (H100) — FULLY WORKING

Tested the CLI streaming flow: popcorn-cli submit → SSE endpoint → Background Manager → Modal dispatch → result callback → DB storage.

Mode	Result	Details
`--mode test`	PASS	Perplexity 1.7785 (within 2% of 1.80 baseline)
`--mode leaderboard`	PASS	Public: 41.71 req/s, Secret: 41.55 req/s
DB storage	PASS	6 runs stored correctly for leaderboard mode
Leaderboard ranking	PASS	Best score 42.10 preserved from earlier submission

CLI uses streaming SSE endpoint (POST /{leaderboard}/{gpu}/{mode}) with --no-tui for non-interactive use.

GitHub Actions route (B200) — FULLY WORKING

Tested manually on B200 self-hosted runner (l-bgx-01, 8x B200). All 4 phases pass:

Phase	Result	Details
Install	PASS	Fast overlay: 0.0s, 1 Python file copied onto base vLLM
Server startup	PASS	119.4s (cold start, first run)
Perplexity	PASS	1.7979 measured vs 1.80 baseline (within 2% tolerance)
Benchmark	PASS	51.71 req/s, 1000 prompts

Key fixes for B200 route:

Torch cu128 (vLLM pip wheel needs libcudart.so.12)
Keep vLLM installed (don't uninstall) — enables fast overlay path
Use sys.executable instead of "python3" for subprocesses (venv python)
CUDA_VISIBLE_DEVICES=4,5,6,7 (GPUs 0-3 occupied)
Pre-downloaded model weights at /models/meta-llama/Llama-3.1-8B

Bugs found and fixed during E2E

task.py — gpus keyword error: gpus field from task.yml was passed to LeaderboardTask.__init__() which doesn't accept it. Fixed by popping gpus before from_dict().
leaderboard_db.py — binary archive decode crash: get_submission_by_id() tried to UTF-8 decode binary tar.gz archives, causing UnicodeDecodeError. Fixed with errors="replace".
Modal workspace mismatch: .env tokens pointed to gpu-mode workspace but deploy went to msaroufim workspace via profile. API server must use matching tokens.
run_eval.py — subprocess used system python: _start_vllm_server() and benchmark used "python3" which resolved to /usr/bin/python3 (no vLLM). Fixed with sys.executable.

Key implementation fixes (from earlier iterations)

Switched from CUDA 13.1 to CUDA 12.8 base image so the vLLM wheel works natively (no source build needed)
Use vllm bench serve CLI (python3 -m vllm.entrypoints.cli.main bench serve) instead of deprecated benchmark_serving.py
Use --backend openai with /v1/completions (not openai-chat) for base models like Llama-3.1-8B
Added GPU cleanup (pkill + torch.cuda.empty_cache()) before server start for container reuse
Added overlay backup/restore safety: if a user's overlay breaks vLLM imports, original files are restored
Added HF secret to Modal functions for gated model access
Fixed download_model.py to save weights at /models/<org>/<model> matching _resolve_model_ref()

How correctness is defined

Model submissions are validated through a two-phase gate defined in task.yml:

Phase 1: Perplexity check (correctness gate)

Runs 10 fixed prompts against the vLLM server's /v1/completions endpoint
Computes measured_ppl = exp(-total_log_prob / total_tokens)
Pass criteria: abs(measured - baseline) / baseline <= tolerance (within 2% of baseline 1.80)
If perplexity fails, submission is rejected — no benchmark runs

Phase 2: Serving benchmark (ranking metric)

Uses vllm bench serve with specified shapes (1000 prompts, 512 input len, 128 output len)
Extracts request_throughput (req/s) as the leaderboard score
Higher is better (score_ascending: false, ranking_by: custom)

The perplexity baseline (1.80) was established by running unmodified vLLM against Llama-3.1-8B on H100.

Remaining work

Perplexity / determinism

Perplexity baseline stability — measured 1.7785 on H100, 1.7979 on B200. Need to understand:
- Whether the baseline should be per-GPU-type (H100 vs B200 may produce different values)
- Whether to set a CUDA/cuBLAS deterministic mode or seed for reproducibility
- What tolerance is appropriate (currently 2%) — may need to be wider if inter-run variance is high

Performance — GitHub Actions route

vLLM source build for cu130 — currently using cu128 pip wheel. For native CUDA 13 support, need to build from source once and cache the wheel.
Server cold start — 119s on first run. Subsequent runs should be faster with warm process caches.

Nice to have

sccache for CUDA compilation — Modal runner has this, GitHub runner doesn't yet
Smaller test model — use a non-gated model (e.g., facebook/opt-125m) for CI smoke tests

Test plan

Extend the platform to support model-level competitions where users submit vLLM forks as tarballs. The system pip installs the fork, starts a vLLM server, runs serving benchmarks, and checks perplexity against a baseline. - Add Language.Model and RankCriterion.CUSTOM to support model tasks - Add ModelTaskData with benchmark shapes, perplexity config, timeouts - Add run_model_benchmark() with 5-phase pipeline (install, server, perplexity, benchmark, cleanup) - Add score_ascending field for higher-is-better ranking (throughput vs time) - Add tarball upload support (50MB limit) in API - Add Modal image with vLLM deps, sccache, and model weights volume - Add download_model.py for pre-populating model weights - Add example task definition for Llama-3.1-8B serving - Add reuse documentation listing unchanged components

Copilot

Pull request overview

Adds end-to-end “model competition” support where users submit vLLM forks as archives that are installed and benchmarked via a new runner path, with leaderboard ranking able to support both lower-is-better and higher-is-better scores.

Changes:

Introduces Language.Model + ModelTaskData, plus run_model_benchmark() pipeline (install → serve → perplexity → benchmark → cleanup).
Adds score direction (score_ascending) wiring through task config, DB ranking queries, and API responses.
Extends submission handling to accept binary archives (50MB) and adds Modal infra (new image + volumes) and a weight pre-download script.

Reviewed changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 12 comments.

Show a summary per file

File	Description
tests/test_task.py	Updates expected task config dicts to include `score_ascending`.
src/runners/modal_runner_archs.py	Registers Modal functions for model benchmarking on selected GPUs with volumes mounted.
src/runners/modal_runner.py	Adds dedicated `model_image` and Modal `Volume`s for model weights + sccache.
src/runners/download_model.py	Adds a Modal app to pre-download HF model weights into a shared volume.
src/libkernelbot/task.py	Adds `ModelTaskData`, extends `LeaderboardTask` to support `model` tasks + `score_ascending`.
src/libkernelbot/submission.py	Adds custom metric scoring, and threads `score_ascending` into competition/ranking display.
src/libkernelbot/run_eval.py	Routes `lang=model` to new `run_model_benchmark()` implementation.
src/libkernelbot/leaderboard_db.py	Stores bytes submissions and adds ranking direction support to leaderboard queries.
src/libkernelbot/launchers/modal.py	Dispatches Modal function name based on `lang` including `model`.
src/libkernelbot/consts.py	Adds `Language.Model` and `RankCriterion.CUSTOM`.
src/libkernelbot/backend.py	Base64-encodes model archives for transport and avoids `.lower()` on bytes.
src/kernelbot/api/main.py	Ensures `/submissions` endpoint uses correct score ordering for the given leaderboard.
src/kernelbot/api/api_utils.py	Accepts larger binary uploads for model tasks (50MB) and validates archive extension.
examples/llama_8b_serving/task.yml	Adds an example model task configuration (custom ranking metric + descending score).
docs/model-competitions-reuse.md	Documents which existing components are reused unchanged for model competitions.

Comments suppressed due to low confidence (1)

src/runners/modal_runner.py:1

These pins look risky: I’m not aware of a torch==2.9.1 release or a cu130 wheel index in the standard PyTorch distribution scheme. If this is intentional for your environment, consider documenting/validating it; otherwise, pin to a known-available Torch/CUDA combo (or make it configurable) to avoid Modal image build failures.

import signal

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-10T20:40:50Z

src/libkernelbot/run_eval.py

+    if tarfile.is_tarfile(archive_path):
+        with tarfile.open(archive_path, "r:*") as tar:
+            tar.extractall(path=extract_dir)
+    elif zipfile.is_zipfile(archive_path):
+        with zipfile.ZipFile(archive_path, "r") as zf:
+            zf.extractall(path=extract_dir)
+    else:
+        return False, "", "Submission archive is not a valid tar.gz or zip file"
+


tar.extractall() / ZipFile.extractall() are vulnerable to path traversal (e.g., ../../...) and can write outside extract_dir. Use a safe extraction routine that validates each member path stays within extract_dir (reject absolute paths and .. segments) before extracting.

Suggested change

if tarfile.is_tarfile(archive_path):

with tarfile.open(archive_path, "r:*") as tar:

tar.extractall(path=extract_dir)

elif zipfile.is_zipfile(archive_path):

with zipfile.ZipFile(archive_path, "r") as zf:

zf.extractall(path=extract_dir)

else:

return False, "", "Submission archive is not a valid tar.gz or zip file"

def _safe_extract_tar(tar: tarfile.TarFile, dest_dir: str) -> None:

base_dir = os.path.abspath(dest_dir)

for member in tar.getmembers():

name = member.name

# Disallow absolute paths

if os.path.isabs(name):

raise ValueError(f"Unsafe absolute path in tar archive: {name!r}")

# Disallow parent directory traversal

if ".." in Path(name).parts:

raise ValueError(f"Unsafe relative path in tar archive: {name!r}")

target_path = os.path.abspath(os.path.join(base_dir, name))

# Ensure the target path is within dest_dir

if os.path.commonpath([base_dir, target_path]) != base_dir:

raise ValueError(f"Tar path escapes destination directory: {name!r}")

tar.extractall(path=dest_dir)

def _safe_extract_zip(zf: zipfile.ZipFile, dest_dir: str) -> None:

base_dir = os.path.abspath(dest_dir)

for name in zf.namelist():

# Disallow absolute paths

if os.path.isabs(name):

raise ValueError(f"Unsafe absolute path in zip archive: {name!r}")

# Disallow parent directory traversal

if ".." in Path(name).parts:

raise ValueError(f"Unsafe relative path in zip archive: {name!r}")

target_path = os.path.abspath(os.path.join(base_dir, name))

# Ensure the target path is within dest_dir

if os.path.commonpath([base_dir, target_path]) != base_dir:

raise ValueError(f"Zip path escapes destination directory: {name!r}")

zf.extractall(path=dest_dir)

try:

if tarfile.is_tarfile(archive_path):

with tarfile.open(archive_path, "r:*") as tar:

_safe_extract_tar(tar, extract_dir)

elif zipfile.is_zipfile(archive_path):

with zipfile.ZipFile(archive_path, "r") as zf:

_safe_extract_zip(zf, extract_dir)

else:

return False, "", "Submission archive is not a valid tar.gz or zip file"

except ValueError as e:

return False, "", f"Submission archive contains unsafe paths: {e}"

Copilot · 2026-02-10T20:40:51Z

src/libkernelbot/run_eval.py

+    work_dir = tempfile.mkdtemp(prefix="model_submission_")
+    archive_path = os.path.join(work_dir, "submission.tar.gz")
+
+    with open(archive_path, "wb") as f:
+        f.write(archive_bytes)
+
+    # Extract
+    import tarfile
+    import zipfile
+
+    extract_dir = os.path.join(work_dir, "src")
+    os.makedirs(extract_dir, exist_ok=True)
+
+    if tarfile.is_tarfile(archive_path):
+        with tarfile.open(archive_path, "r:*") as tar:
+            tar.extractall(path=extract_dir)
+    elif zipfile.is_zipfile(archive_path):
+        with zipfile.ZipFile(archive_path, "r") as zf:
+            zf.extractall(path=extract_dir)
+    else:
+        return False, "", "Submission archive is not a valid tar.gz or zip file"
+
+    # Find the actual package directory (may be nested one level)
+    entries = os.listdir(extract_dir)
+    if len(entries) == 1 and os.path.isdir(os.path.join(extract_dir, entries[0])):
+        pkg_dir = os.path.join(extract_dir, entries[0])
+    else:
+        pkg_dir = extract_dir
+
+    # pip install
+    result = subprocess.run(
+        ["pip", "install", "-e", pkg_dir],
+        capture_output=True,
+        text=True,
+        timeout=install_timeout,
+    )
+
+    return result.returncode == 0, _limit_length(result.stdout), _limit_length(result.stderr)


tempfile.mkdtemp() creates a work directory that is never removed, which can leak disk space across runs. Prefer TemporaryDirectory() or explicitly shutil.rmtree(work_dir) in a finally (including the error/early-return paths).

Suggested change

work_dir = tempfile.mkdtemp(prefix="model_submission_")

archive_path = os.path.join(work_dir, "submission.tar.gz")

with open(archive_path, "wb") as f:

f.write(archive_bytes)

# Extract

import tarfile

import zipfile

extract_dir = os.path.join(work_dir, "src")

os.makedirs(extract_dir, exist_ok=True)

if tarfile.is_tarfile(archive_path):

with tarfile.open(archive_path, "r:*") as tar:

tar.extractall(path=extract_dir)

elif zipfile.is_zipfile(archive_path):

with zipfile.ZipFile(archive_path, "r") as zf:

zf.extractall(path=extract_dir)

else:

return False, "", "Submission archive is not a valid tar.gz or zip file"

# Find the actual package directory (may be nested one level)

entries = os.listdir(extract_dir)

if len(entries) == 1 and os.path.isdir(os.path.join(extract_dir, entries[0])):

pkg_dir = os.path.join(extract_dir, entries[0])

else:

pkg_dir = extract_dir

# pip install

result = subprocess.run(

["pip", "install", "-e", pkg_dir],

capture_output=True,

text=True,

timeout=install_timeout,

)

return result.returncode == 0, _limit_length(result.stdout), _limit_length(result.stderr)

with tempfile.TemporaryDirectory(prefix="model_submission_") as work_dir:

archive_path = os.path.join(work_dir, "submission.tar.gz")

with open(archive_path, "wb") as f:

f.write(archive_bytes)

# Extract

import tarfile

import zipfile

extract_dir = os.path.join(work_dir, "src")

os.makedirs(extract_dir, exist_ok=True)

if tarfile.is_tarfile(archive_path):

with tarfile.open(archive_path, "r:*") as tar:

tar.extractall(path=extract_dir)

elif zipfile.is_zipfile(archive_path):

with zipfile.ZipFile(archive_path, "r") as zf:

zf.extractall(path=extract_dir)

else:

return False, "", "Submission archive is not a valid tar.gz or zip file"

# Find the actual package directory (may be nested one level)

entries = os.listdir(extract_dir)

if len(entries) == 1 and os.path.isdir(os.path.join(extract_dir, entries[0])):

pkg_dir = os.path.join(extract_dir, entries[0])

else:

pkg_dir = extract_dir

# pip install

result = subprocess.run(

["pip", "install", "-e", pkg_dir],

capture_output=True,

text=True,

timeout=install_timeout,

)

return result.returncode == 0, _limit_length(result.stdout), _limit_length(result.stderr)

Copilot · 2026-02-10T20:40:51Z

src/libkernelbot/run_eval.py

+    extract_dir = os.path.join(work_dir, "src")
+    os.makedirs(extract_dir, exist_ok=True)


tempfile.mkdtemp() creates a work directory that is never removed, which can leak disk space across runs. Prefer TemporaryDirectory() or explicitly shutil.rmtree(work_dir) in a finally (including the error/early-return paths).

Copilot · 2026-02-10T20:40:51Z

src/libkernelbot/run_eval.py

+        stdout=subprocess.PIPE,
+        stderr=subprocess.PIPE,


Starting the server with stdout=PIPE and stderr=PIPE without continuously draining them risks blocking the vLLM process once its output buffers fill, potentially hanging runs. Redirect to files/DEVNULL, merge streams, or spawn reader threads to drain and store logs safely.

Suggested change

stdout=subprocess.PIPE,

stderr=subprocess.PIPE,

stdout=subprocess.DEVNULL,

stderr=subprocess.DEVNULL,

Copilot · 2026-02-10T20:40:52Z

src/libkernelbot/run_eval.py

+        cmd = [
+            "python3", "-m", "vllm.entrypoints.openai.run_batch",
+        ]
+
+        # Prefer the benchmark_serving script approach
+        cmd = [
+            "python3", "-m", "vllm.benchmarks.benchmark_serving",
+            "--backend", "openai-chat",
+            "--base-url", f"http://localhost:{port}",
+            "--model", model_name,
+            "--endpoint", "/v1/chat/completions",
+            "--num-prompts", str(shape.get("num_prompts", 100)),
+            "--random-input-len", str(shape.get("input_len", 512)),
+            "--random-output-len", str(shape.get("output_len", 128)),
+            "--save-result",
+        ]
+
+        result = subprocess.run(
+            cmd,
+            capture_output=True,
+            text=True,
+            timeout=benchmark_timeout,
+        )
+
+        if result.returncode != 0:
+            all_metrics[f"shape_{i}_error"] = _limit_length(result.stderr)
+            continue
+
+        # Parse the saved JSON result file
+        # vLLM saves to a json file in current directory
+        import glob
+        json_files = sorted(glob.glob("*.json"), key=os.path.getmtime, reverse=True)
+        if json_files:
+            try:
+                with open(json_files[0]) as f:
+                    bench_result = json.load(f)
+                for key in [
+                    "request_throughput",
+                    "output_throughput",
+                    "mean_ttft_ms",
+                    "median_ttft_ms",
+                    "p99_ttft_ms",
+                    "mean_tpot_ms",
+                    "median_tpot_ms",
+                    "p99_tpot_ms",
+                    "mean_itl_ms",
+                    "median_itl_ms",
+                    "p99_itl_ms",
+                ]:
+                    if key in bench_result:
+                        all_metrics[key] = bench_result[key]
+                os.remove(json_files[0])
+            except (json.JSONDecodeError, OSError):
+                pass
+
+        all_metrics[f"shape_{i}_stdout"] = _limit_length(result.stdout)


Metrics are overwritten across shapes because all_metrics[key] is reused for every shape; only the last shape’s values will survive. Also, glob('*.json') in the current working directory can pick up unrelated files and is race-prone. Write results to a per-shape, known filepath (or run in a temp working directory) and namespace metrics per shape (e.g., shape_{i}_{key}) or return a list keyed by shape.

Suggested change

cmd = [

"python3", "-m", "vllm.entrypoints.openai.run_batch",

]

# Prefer the benchmark_serving script approach

cmd = [

"python3", "-m", "vllm.benchmarks.benchmark_serving",

"--backend", "openai-chat",

"--base-url", f"http://localhost:{port}",

"--model", model_name,

"--endpoint", "/v1/chat/completions",

"--num-prompts", str(shape.get("num_prompts", 100)),

"--random-input-len", str(shape.get("input_len", 512)),

"--random-output-len", str(shape.get("output_len", 128)),

"--save-result",

]

result = subprocess.run(

cmd,

capture_output=True,

text=True,

timeout=benchmark_timeout,

)

if result.returncode != 0:

all_metrics[f"shape_{i}_error"] = _limit_length(result.stderr)

continue

# Parse the saved JSON result file

# vLLM saves to a json file in current directory

import glob

json_files = sorted(glob.glob("*.json"), key=os.path.getmtime, reverse=True)

if json_files:

try:

with open(json_files[0]) as f:

bench_result = json.load(f)

for key in [

"request_throughput",

"output_throughput",

"mean_ttft_ms",

"median_ttft_ms",

"p99_ttft_ms",

"mean_tpot_ms",

"median_tpot_ms",

"p99_tpot_ms",

"mean_itl_ms",

"median_itl_ms",

"p99_itl_ms",

]:

if key in bench_result:

all_metrics[key] = bench_result[key]

os.remove(json_files[0])

except (json.JSONDecodeError, OSError):

pass

all_metrics[f"shape_{i}_stdout"] = _limit_length(result.stdout)

with tempfile.TemporaryDirectory() as tmpdir:

cmd = [

"python3", "-m", "vllm.entrypoints.openai.run_batch",

]

# Prefer the benchmark_serving script approach

cmd = [

"python3", "-m", "vllm.benchmarks.benchmark_serving",

"--backend", "openai-chat",

"--base-url", f"http://localhost:{port}",

"--model", model_name,

"--endpoint", "/v1/chat/completions",

"--num-prompts", str(shape.get("num_prompts", 100)),

"--random-input-len", str(shape.get("input_len", 512)),

"--random-output-len", str(shape.get("output_len", 128)),

"--save-result",

]

result = subprocess.run(

cmd,

capture_output=True,

text=True,

timeout=benchmark_timeout,

cwd=tmpdir,

)

if result.returncode != 0:

all_metrics[f"shape_{i}_error"] = _limit_length(result.stderr)

all_metrics[f"shape_{i}_stdout"] = _limit_length(result.stdout)

continue

# Parse the saved JSON result file

# vLLM saves to a json file in the working directory

import glob

json_files = sorted(

glob.glob(os.path.join(tmpdir, "*.json")),

key=os.path.getmtime,

reverse=True,

)

if json_files:

try:

with open(json_files[0]) as f:

bench_result = json.load(f)

for key in [

"request_throughput",

"output_throughput",

"mean_ttft_ms",

"median_ttft_ms",

"p99_ttft_ms",

"mean_tpot_ms",

"median_tpot_ms",

"p99_tpot_ms",

"mean_itl_ms",

"median_itl_ms",

"p99_itl_ms",

]:

if key in bench_result:

all_metrics[f"shape_{i}_{key}"] = bench_result[key]

os.remove(json_files[0])

except (json.JSONDecodeError, OSError):

pass

all_metrics[f"shape_{i}_stdout"] = _limit_length(result.stdout)

Copilot · 2026-02-10T20:40:52Z

src/libkernelbot/run_eval.py

+        cmd = [
+            "python3", "-m", "vllm.entrypoints.openai.run_batch",
+        ]
+


The initial cmd assignment to vllm.entrypoints.openai.run_batch is immediately overwritten and has no effect. Remove the dead assignment to reduce confusion and keep the benchmark invocation single-sourced.

Suggested change

cmd = [

"python3", "-m", "vllm.entrypoints.openai.run_batch",

]

Copilot · 2026-02-10T20:40:53Z

src/libkernelbot/run_eval.py

+        try:
+            with urllib.request.urlopen(req, timeout=30) as resp:
+                data = json.loads(resp.read())


The perplexity check silently ignores all request/parse errors and may compute perplexity from a small subset of prompts, which can lead to unstable or falsely passing results. Consider failing the check on any request error (or at least tracking an error count and requiring a minimum success ratio) and include the error details in the run result for debuggability.

Copilot · 2026-02-10T20:40:53Z

src/libkernelbot/run_eval.py

+        except Exception:
+            continue


The perplexity check silently ignores all request/parse errors and may compute perplexity from a small subset of prompts, which can lead to unstable or falsely passing results. Consider failing the check on any request error (or at least tracking an error count and requiring a minimum success ratio) and include the error details in the run result for debuggability.

Copilot · 2026-02-10T20:40:53Z

src/libkernelbot/submission.py


 def compute_score(result: FullResult, task: LeaderboardTask, submission_id: int) -> float:
+    if task.ranking_by == RankCriterion.CUSTOM:
+        ranking_metric = task.config.ranking_metric


RankCriterion.CUSTOM implicitly assumes task.config has ranking_metric, but LeaderboardTask.config can also be CudaTaskData/PythonTaskData, which don’t define it. Enforce CUSTOM only for Language.Model (e.g., in LeaderboardTask.__post_init__) or store ranking_metric at the task level so this doesn’t depend on a specific config dataclass.

Suggested change

ranking_metric = task.config.ranking_metric

# Some task configurations (e.g., CudaTaskData/PythonTaskData) may not

# define a `ranking_metric` attribute. Guard against that here so we

# don't rely on a specific config dataclass shape.

config = getattr(task, "config", None)

if config is None or not hasattr(config, "ranking_metric"):

raise KernelBotError(

"RankCriterion.CUSTOM requires task.config to define a 'ranking_metric' "

f"attribute; got config type '{type(config).__name__}' instead."

)

ranking_metric = getattr(config, "ranking_metric")

Copilot · 2026-02-10T20:40:53Z

src/libkernelbot/run_eval.py

+    return passed, measured_ppl
+
+
+def run_model_benchmark(config: dict) -> FullResult:  # noqa: C901


The new run_model_benchmark() path (install, server startup/timeout handling, perplexity pass/fail, benchmark parsing, and cleanup) introduces substantial logic but isn’t covered by unit tests. Since the repo already has pytest coverage (e.g., tests/test_task.py), add focused tests that mock subprocess.run / subprocess.Popen and urllib.request.urlopen to deterministically validate success and failure modes.

- Fix path traversal vulnerability in tar/zip extraction (validate members) - Fix metrics overwritten across shapes (namespace by shape index) - Fix vLLM server stdout/stderr PIPE blocking (redirect to DEVNULL) - Fix perplexity check silently swallowing errors (require >50% success) - Remove dead cmd assignment in benchmark runner - Add hasattr guard for CUSTOM ranking_metric in compute_score - Remove docs/model-competitions-reuse.md

- Fix lang_name KeyError crash for model submissions in GitHub launcher - Upload model archives as Git blobs to bypass workflow dispatch size limits - Add nvidia_model_workflow.yml with 60-min timeout for model benchmarking - Update github-runner.py to download blob archives before running - Add model-specific timeout computation from model_config - Add expected run name pattern for model workflow dispatch - Block model competitions on AMD GPUs (NVIDIA only for now)

github-actions · 2026-02-10T23:22:04Z

Coverage report

Click to see where and how coverage changed

File	Statements	Missing	Coverage	Coverage (new stmts)	Lines missing
src/libkernelbot
backend.py					198
consts.py
leaderboard_db.py
submission.py					178-190, 235
task.py					83, 87, 101, 156, 194, 221
utils.py
Project Total

_{This report was generated by python-coverage-comment-action}

Isolates model benchmark dependencies in a venv instead of polluting the runner's system Python. Falls back to pip if uv is not available.

- Persistent venv at /opt/model-venv with torch + vLLM deps pre-cached (mirrors Modal model_image pattern: install vllm for deps, uninstall) - Set SETUPTOOLS_SCM_PRETEND_VERSION for tarball submissions without .git - Pin Python 3.10 in venv, add sccache for CUDA compilation caching

Drop /opt persistent venv (permission issues on containerized runners). Bootstrap fresh venv each run with torch + vllm deps. Optimize later.

- Only use --download-dir /models if the path exists (Modal volume). On GitHub runners, fall back to HF cache default. - Capture server stdout/stderr to a log file instead of DEVNULL. - Include server log in result on startup failure for debugging.

Calibrated from actual B200 E2E test run with stock vLLM.

…afety - Switch model_image from CUDA 13.1 to CUDA 12.8 base so the vLLM wheel (compiled for CUDA 12) works natively without compat libraries or source builds. CUDA 12.8 supports H100 (SM 9.0) and B200 (SM 10.0). - Use vllm bench serve CLI (python3 -m vllm.entrypoints.cli.main bench serve) instead of the deprecated benchmarks/benchmark_serving.py script. Use --backend openai with /v1/completions for base models. - Add fast overlay path: for Python-only submissions, copy .py files directly onto the pre-installed vLLM package instead of doing a full pip install from source. Includes backup/restore safety to detect and recover if an overlay breaks vLLM imports. - Add GPU cleanup (pkill + torch.cuda.empty_cache) before server start to handle Modal container reuse where previous vLLM processes left GPU memory allocated. - Add HF secret to model benchmark functions for gated model access. - Fix download_model.py to save weights at /models/<org>/<model> path matching what _resolve_model_ref() expects. Tested E2E on Modal H100: perplexity 1.7785 (pass), benchmark 34.54 req/s with 100/100 successful requests.

- Pop gpus from raw dict before LeaderboardTask.from_dict() to prevent unexpected keyword argument error when creating dev leaderboards - Handle binary model archives in get_submission_by_id() with errors="replace" to prevent UnicodeDecodeError on tar.gz data - Add .claude/skills/model-competition-testing.md with full E2E testing instructions, correctness criteria, and troubleshooting

…guide Adds Step 5b covering the streaming SSE endpoint flow via popcorn-cli, including config backup, build, submit with --no-tui, and config restore.

- Switch GH workflow torch from cu130 to cu128 (vLLM pip wheel needs libcudart.so.12) - Keep vLLM installed instead of uninstalling — enables fast overlay path for Python-only submissions (~instant vs ~20 min) - Use sys.executable instead of "python3" for vLLM server and benchmark subprocesses so they use the venv Python - Add CUDA_VISIBLE_DEVICES=4,5,6,7 to workflow (GPUs 0-3 occupied) - Add B200 machine entry to remote-gpu-testing skill - Add GH Actions B200 section to model-competition-testing skill

Copilot AI review requested due to automatic review settings February 10, 2026 20:38

Copilot AI reviewed Feb 10, 2026

View reviewed changes

msaroufim added 3 commits February 10, 2026 12:49

Fix test_backend.py: add score_ascending to expected config dicts

5b46ed7

msaroufim added 12 commits February 10, 2026 15:28

Add testing guide for model competitions (Modal-first E2E)

b8bd7d2

Use uv venv in model workflow and submission install

57c3166

Isolates model benchmark dependencies in a venv instead of polluting the runner's system Python. Falls back to pip if uv is not available.

Simplify model workflow: local venv, no persistent paths

4c84d44

Drop /opt persistent venv (permission issues on containerized runners). Bootstrap fresh venv each run with torch + vllm deps. Optimize later.

Pass HF_TOKEN to model workflow for gated model access

0c269eb

Update Llama-3.1-8B perplexity baseline to 1.80

9ed1bca

Calibrated from actual B200 E2E test run with stock vLLM.

Merge remote-tracking branch 'origin/main' into speedrun

d1ec22b

Add popcorn-cli submission instructions to model competition testing …

73984d1

…guide Adds Step 5b covering the streaming SSE endpoint flow via popcorn-cli, including config backup, build, submit with --no-tui, and config restore.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[VIBE CODED - Review Only] E2E model competition support #442

[VIBE CODED - Review Only] E2E model competition support #442

Uh oh!

msaroufim commented Feb 10, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 10, 2026

Uh oh!

Copilot AI Feb 10, 2026

Uh oh!

Copilot AI Feb 10, 2026

Uh oh!

Copilot AI Feb 10, 2026

Uh oh!

Copilot AI Feb 10, 2026

Uh oh!

Copilot AI Feb 10, 2026

Uh oh!

Copilot AI Feb 10, 2026

Uh oh!

Copilot AI Feb 10, 2026

Uh oh!

Copilot AI Feb 10, 2026

Uh oh!

Copilot AI Feb 10, 2026

Uh oh!

github-actions bot commented Feb 10, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

-    if tarfile.is_tarfile(archive_path):
-        with tarfile.open(archive_path, "r:*") as tar:
-            tar.extractall(path=extract_dir)
-    elif zipfile.is_zipfile(archive_path):
-        with zipfile.ZipFile(archive_path, "r") as zf:
-            zf.extractall(path=extract_dir)
-    else:
-        return False, "", "Submission archive is not a valid tar.gz or zip file"
+    def _safe_extract_tar(tar: tarfile.TarFile, dest_dir: str) -> None:
+        base_dir = os.path.abspath(dest_dir)
+        for member in tar.getmembers():
+            name = member.name
+            # Disallow absolute paths
+            if os.path.isabs(name):
+                raise ValueError(f"Unsafe absolute path in tar archive: {name!r}")
+            # Disallow parent directory traversal
+            if ".." in Path(name).parts:
+                raise ValueError(f"Unsafe relative path in tar archive: {name!r}")
+            target_path = os.path.abspath(os.path.join(base_dir, name))
+            # Ensure the target path is within dest_dir
+            if os.path.commonpath([base_dir, target_path]) != base_dir:
+                raise ValueError(f"Tar path escapes destination directory: {name!r}")
+        tar.extractall(path=dest_dir)
+    def _safe_extract_zip(zf: zipfile.ZipFile, dest_dir: str) -> None:
+        base_dir = os.path.abspath(dest_dir)
+        for name in zf.namelist():
+            # Disallow absolute paths
+            if os.path.isabs(name):
+                raise ValueError(f"Unsafe absolute path in zip archive: {name!r}")
+            # Disallow parent directory traversal
+            if ".." in Path(name).parts:
+                raise ValueError(f"Unsafe relative path in zip archive: {name!r}")
+            target_path = os.path.abspath(os.path.join(base_dir, name))
+            # Ensure the target path is within dest_dir
+            if os.path.commonpath([base_dir, target_path]) != base_dir:
+                raise ValueError(f"Zip path escapes destination directory: {name!r}")
+        zf.extractall(path=dest_dir)
+    try:
+        if tarfile.is_tarfile(archive_path):
+            with tarfile.open(archive_path, "r:*") as tar:
+                _safe_extract_tar(tar, extract_dir)
+        elif zipfile.is_zipfile(archive_path):
+            with zipfile.ZipFile(archive_path, "r") as zf:
+                _safe_extract_zip(zf, extract_dir)
+        else:
+            return False, "", "Submission archive is not a valid tar.gz or zip file"
+    except ValueError as e:
+        return False, "", f"Submission archive contains unsafe paths: {e}"

		extract_dir = os.path.join(work_dir, "src")
		os.makedirs(extract_dir, exist_ok=True)

	cmd = [
	"python3", "-m", "vllm.entrypoints.openai.run_batch",
	]

-        ranking_metric = task.config.ranking_metric
+        # Some task configurations (e.g., CudaTaskData/PythonTaskData) may not
+        # define a `ranking_metric` attribute. Guard against that here so we
+        # don't rely on a specific config dataclass shape.
+        config = getattr(task, "config", None)
+        if config is None or not hasattr(config, "ranking_metric"):
+            raise KernelBotError(
+                "RankCriterion.CUSTOM requires task.config to define a 'ranking_metric' "
+                f"attribute; got config type '{type(config).__name__}' instead."
+            )
+        ranking_metric = getattr(config, "ranking_metric")

		return passed, measured_ppl


		def run_model_benchmark(config: dict) -> FullResult: # noqa: C901

[VIBE CODED - Review Only] E2E model competition support #442

Are you sure you want to change the base?

[VIBE CODED - Review Only] E2E model competition support #442

Uh oh!

Conversation

msaroufim commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

E2E Testing Status

Full API → Modal → DB pipeline (H100) — FULLY WORKING

Popcorn-CLI → SSE → Modal → DB (H100) — FULLY WORKING

GitHub Actions route (B200) — FULLY WORKING

Bugs found and fixed during E2E

Key implementation fixes (from earlier iterations)

How correctness is defined

Phase 1: Perplexity check (correctness gate)

Phase 2: Serving benchmark (ranking metric)

Remaining work

Perplexity / determinism

Performance — GitHub Actions route

Nice to have

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Coverage report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

msaroufim commented Feb 10, 2026 •

edited

Loading

github-actions bot commented Feb 10, 2026 •

edited

Loading