Skip to content

Commit 0bcb540

Browse files
committed
Add Buildkite backend integration and pre-built image support
Backend integration: - Register BuildkiteLauncher in create_backend() when BUILDKITE_API_TOKEN is set - Add BUILDKITE_API_TOKEN, BUILDKITE_ORG, BUILDKITE_PIPELINE env vars - Results now flow to database same as GitHub/Modal Pre-built Docker image for fast cold starts: - Add Dockerfile with all dependencies pre-installed - Add build-image.sh script for local image building - Add pipeline-fast.yml for using pre-built image (~5s vs ~40s cold start) - Update setup-node-simple.sh with BUILD_IMAGE=true option Update skills doc with operational model for both approaches
1 parent 29f1239 commit 0bcb540

File tree

7 files changed

+200
-37
lines changed

7 files changed

+200
-37
lines changed

SKILLS/buildkite.md

Lines changed: 39 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -554,16 +554,15 @@ BUILDKITE_API_TOKEN=xxx uv run python scripts/submit_buildkite_job.py --eval ide
554554

555555
## Operational Model
556556

557-
### No Pre-Built Docker Image (Current Setup)
557+
### Option 1: No Pre-Built Image (Current Default)
558558

559-
The pipeline does **NOT** use a pre-built Docker image. Each job:
559+
The pipeline installs dependencies at runtime. Each job:
560560

561561
1. Uses base `nvidia/cuda:12.4.0-devel-ubuntu22.04` image
562-
2. Installs dependencies at runtime:
562+
2. Installs dependencies at runtime (~30-40 seconds):
563563
- `uv` for Python package management
564-
- Clones kernelbot repo from `buildkite-infrastructure` branch
565-
- Runs `uv sync` to install project dependencies
566-
- Runs `uv pip install torch triton numpy` for GPU packages
564+
- Clones kernelbot repo
565+
- Runs `uv sync` and `uv pip install torch triton numpy`
567566
3. Runs the evaluation
568567

569568
**Advantages:**
@@ -573,17 +572,44 @@ The pipeline does **NOT** use a pre-built Docker image. Each job:
573572
- **No admin action needed** after code updates
574573

575574
**Trade-off:**
576-
- Slightly longer job startup time (~30-40 seconds for dependency installation)
575+
- Slower cold starts (~40 seconds)
577576

578-
### When Admin Action Is Needed
577+
### Option 2: Pre-Built Image (Fast Cold Starts)
578+
579+
For faster cold starts (~5 seconds), build the Docker image on each node:
579580

580-
The only time the machine admin needs to run anything is:
581+
```bash
582+
# During initial setup:
583+
sudo BUILDKITE_AGENT_TOKEN=xxx GPU_TYPE=test BUILD_IMAGE=true ./deployment/buildkite/setup-node-simple.sh
581584

582-
1. **Initial setup**: Run `setup-node-simple.sh` once when onboarding a new node
583-
2. **Buildkite agent updates**: If Buildkite releases a new agent version (rare)
584-
3. **System-level changes**: NVIDIA driver updates, OS updates, etc.
585+
# Or build separately:
586+
./deployment/buildkite/build-image.sh
587+
```
588+
589+
Then update the Buildkite pipeline config to use the local image:
590+
```yaml
591+
image: "kernelbot:latest"
592+
```
593+
594+
**When to rebuild the image:**
595+
- When dependencies change (new PyTorch version, new packages)
596+
- When you want the latest kernelbot code baked in
597+
- NOT needed for problem/task changes (those come via config)
598+
599+
**Rebuild command:**
600+
```bash
601+
./deployment/buildkite/build-image.sh
602+
```
603+
604+
### When Admin Action Is Needed
585605

586-
Code changes to kernelbot require **no admin action** - the pipeline clones fresh code each run.
606+
| Scenario | Action Required |
607+
|----------|-----------------|
608+
| Code changes (no deps) | None - pipeline clones fresh code |
609+
| Dependency changes | Rebuild image: `./build-image.sh` |
610+
| Initial node setup | Run `setup-node-simple.sh` once |
611+
| NVIDIA driver updates | May need to rebuild image |
612+
| Buildkite agent updates | Rare - Buildkite handles this |
587613

588614
### Shared Evaluation Logic
589615

deployment/buildkite/Dockerfile

Lines changed: 23 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -1,42 +1,42 @@
11
# Kernelbot evaluation image
2+
# Pre-built with all dependencies for fast cold starts
23
FROM nvidia/cuda:12.4.0-devel-ubuntu22.04
34

45
ENV DEBIAN_FRONTEND=noninteractive
56
ENV PYTHONUNBUFFERED=1
67

78
# System packages
89
RUN apt-get update && apt-get install -y --no-install-recommends \
9-
python3.11 \
10-
python3.11-dev \
11-
python3.11-venv \
12-
python3-pip \
13-
git \
14-
wget \
1510
curl \
11+
ca-certificates \
12+
git \
1613
build-essential \
1714
ninja-build \
1815
cmake \
1916
&& rm -rf /var/lib/apt/lists/*
2017

21-
# Set Python 3.11 as default
22-
RUN update-alternatives --install /usr/bin/python python /usr/bin/python3.11 1 && \
23-
update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.11 1
18+
# Install uv
19+
RUN curl -LsSf https://astral.sh/uv/install.sh | sh
20+
ENV PATH="/root/.local/bin:$PATH"
21+
22+
# Clone and install kernelbot
23+
WORKDIR /opt/kernelbot
24+
RUN git clone --depth 1 --branch buildkite-infrastructure https://github.com/gpu-mode/kernelbot.git .
25+
26+
# Install dependencies with uv
27+
RUN uv sync
2428

25-
# Upgrade pip
26-
RUN python -m pip install --no-cache-dir --upgrade pip setuptools wheel
29+
# Install PyTorch and GPU packages
30+
RUN uv pip install torch triton numpy --index-url https://download.pytorch.org/whl/cu124
2731

28-
# PyTorch + CUDA
29-
RUN pip install --no-cache-dir \
30-
torch==2.4.0 \
31-
triton \
32-
numpy \
33-
scipy
32+
# Ensure venv is activated for any commands
33+
ENV VIRTUAL_ENV=/opt/kernelbot/.venv
34+
ENV PATH="$VIRTUAL_ENV/bin:$PATH"
3435

35-
# Copy kernelbot
36-
WORKDIR /app
37-
COPY pyproject.toml .
38-
COPY src/ src/
39-
RUN pip install --no-cache-dir -e .
36+
# Verify installation
37+
RUN python -c "import torch; print(f'PyTorch {torch.__version__}')" && \
38+
python -c "import triton; print(f'Triton installed')" && \
39+
python -c "from libkernelbot.run_eval import run_config; print('kernelbot installed')"
4040

4141
# Default command
42-
CMD ["python", "/app/src/runners/buildkite-runner.py"]
42+
CMD ["python", "/opt/kernelbot/src/runners/buildkite-runner.py"]
Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,43 @@
1+
#!/bin/bash
2+
# Build the kernelbot Docker image locally on a GPU node
3+
# Usage: ./build-image.sh [--push]
4+
5+
set -euo pipefail
6+
7+
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
8+
REPO_ROOT="$(cd "$SCRIPT_DIR/../.." && pwd)"
9+
10+
IMAGE_NAME="${KERNELBOT_IMAGE:-kernelbot:latest}"
11+
BRANCH="${KERNELBOT_BRANCH:-buildkite-infrastructure}"
12+
13+
echo "=== Building Kernelbot Image ==="
14+
echo "Image: $IMAGE_NAME"
15+
echo "Branch: $BRANCH"
16+
echo ""
17+
18+
# Update Dockerfile to use correct branch
19+
sed -i "s|--branch [a-zA-Z0-9_-]*|--branch $BRANCH|g" "$SCRIPT_DIR/Dockerfile" 2>/dev/null || \
20+
sed -i '' "s|--branch [a-zA-Z0-9_-]*|--branch $BRANCH|g" "$SCRIPT_DIR/Dockerfile"
21+
22+
echo "Building image..."
23+
docker build -t "$IMAGE_NAME" -f "$SCRIPT_DIR/Dockerfile" "$REPO_ROOT"
24+
25+
echo ""
26+
echo "=== Build Complete ==="
27+
echo "Image: $IMAGE_NAME"
28+
docker images "$IMAGE_NAME"
29+
30+
# Optional: push to registry
31+
if [[ "${1:-}" == "--push" ]]; then
32+
REGISTRY="${KERNELBOT_REGISTRY:-ghcr.io/gpu-mode}"
33+
REMOTE_IMAGE="$REGISTRY/kernelbot:latest"
34+
echo ""
35+
echo "Pushing to $REMOTE_IMAGE..."
36+
docker tag "$IMAGE_NAME" "$REMOTE_IMAGE"
37+
docker push "$REMOTE_IMAGE"
38+
echo "Pushed: $REMOTE_IMAGE"
39+
fi
40+
41+
echo ""
42+
echo "To use this image, update your pipeline config:"
43+
echo " image: \"$IMAGE_NAME\""
Lines changed: 53 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,53 @@
1+
# Kernelbot Fast Evaluation Pipeline
2+
# Uses pre-built image for fast cold starts (~5s vs ~40s)
3+
#
4+
# Prerequisites:
5+
# 1. Build image on node: ./deployment/buildkite/build-image.sh
6+
# 2. Or pull from registry: docker pull ghcr.io/gpu-mode/kernelbot:latest
7+
8+
steps:
9+
- label: ":rocket: Kernel Evaluation"
10+
agents:
11+
queue: "${KERNELBOT_QUEUE:-test}"
12+
13+
plugins:
14+
- docker#v5.11.0:
15+
image: "${KERNELBOT_IMAGE:-kernelbot:latest}"
16+
always-pull: false
17+
gpus: "all"
18+
propagate-environment: true
19+
shell: ["/bin/bash", "-e", "-c"]
20+
environment:
21+
- NVIDIA_VISIBLE_DEVICES
22+
- CUDA_VISIBLE_DEVICES
23+
- KERNELBOT_PAYLOAD
24+
- KERNELBOT_RUN_ID
25+
cpus: "${KERNELBOT_CPUS:-8}"
26+
memory: "${KERNELBOT_MEMORY:-64g}"
27+
workdir: /workdir
28+
29+
command: |
30+
set -e
31+
32+
echo "=== Environment ==="
33+
echo "NVIDIA_VISIBLE_DEVICES=$NVIDIA_VISIBLE_DEVICES"
34+
echo "KERNELBOT_RUN_ID=$KERNELBOT_RUN_ID"
35+
nvidia-smi -L
36+
37+
echo ""
38+
echo "=== Running Evaluation ==="
39+
cd /opt/kernelbot
40+
python src/runners/buildkite-runner.py
41+
42+
echo ""
43+
echo "=== Copying Artifacts ==="
44+
cp result.json /workdir/result.json
45+
cp -r profile_data /workdir/profile_data 2>/dev/null || true
46+
47+
echo "=== Done ==="
48+
49+
artifact_paths:
50+
- "result.json"
51+
- "profile_data/*"
52+
53+
timeout_in_minutes: 15

deployment/buildkite/setup-node-simple.sh

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -203,3 +203,27 @@ echo ' - label: "GPU Test"'
203203
echo ' command: "echo NVIDIA_VISIBLE_DEVICES=$$NVIDIA_VISIBLE_DEVICES && nvidia-smi -L"'
204204
echo ' agents:'
205205
echo " queue: \"${GPU_TYPE}\""
206+
207+
# === BUILD DOCKER IMAGE (optional) ===
208+
if [[ "${BUILD_IMAGE:-}" == "true" ]]; then
209+
echo ""
210+
echo "=== Building Docker Image ==="
211+
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
212+
213+
if [[ -f "$SCRIPT_DIR/Dockerfile" ]]; then
214+
docker build -t kernelbot:latest -f "$SCRIPT_DIR/Dockerfile" "$SCRIPT_DIR/../.."
215+
echo "Docker image built: kernelbot:latest"
216+
echo ""
217+
echo "To use the fast pipeline, update Buildkite config to use:"
218+
echo " image: \"kernelbot:latest\""
219+
else
220+
echo "WARNING: Dockerfile not found at $SCRIPT_DIR/Dockerfile"
221+
echo "Clone the repo first: git clone https://github.com/gpu-mode/kernelbot.git"
222+
fi
223+
fi
224+
225+
echo ""
226+
echo "For faster cold starts, build the Docker image:"
227+
echo " BUILD_IMAGE=true $0"
228+
echo "Or manually:"
229+
echo " ./deployment/buildkite/build-image.sh"

src/kernelbot/env.py

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -33,6 +33,11 @@
3333
env.GITHUB_WORKFLOW_BRANCH = os.getenv("GITHUB_WORKFLOW_BRANCH", get_github_branch_name())
3434
env.PROBLEMS_REPO = os.getenv("PROBLEMS_REPO")
3535

36+
# Buildkite-specific constants
37+
env.BUILDKITE_API_TOKEN = os.getenv("BUILDKITE_API_TOKEN")
38+
env.BUILDKITE_ORG = os.getenv("BUILDKITE_ORG", "gpu-mode")
39+
env.BUILDKITE_PIPELINE = os.getenv("BUILDKITE_PIPELINE", "kernelbot")
40+
3641
# Directory that will be used for local problem development.
3742
env.PROBLEM_DEV_DIR = os.getenv("PROBLEM_DEV_DIR", "examples")
3843

src/kernelbot/main.py

Lines changed: 13 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,8 @@
1616
from libkernelbot import consts
1717
from libkernelbot.backend import KernelBackend
1818
from libkernelbot.background_submission_manager import BackgroundSubmissionManager
19-
from libkernelbot.launchers import GitHubLauncher, ModalLauncher
19+
from libkernelbot.launchers import BuildkiteLauncher, GitHubLauncher, ModalLauncher
20+
from libkernelbot.launchers.buildkite import BuildkiteConfig
2021
from libkernelbot.utils import setup_logging
2122

2223
logger = setup_logging(__name__)
@@ -29,6 +30,17 @@ def create_backend(debug_mode: bool = False) -> KernelBackend:
2930
backend.register_launcher(
3031
GitHubLauncher(env.GITHUB_REPO, env.GITHUB_TOKEN, env.GITHUB_WORKFLOW_BRANCH)
3132
)
33+
34+
# Register Buildkite launcher if API token is configured
35+
if env.BUILDKITE_API_TOKEN:
36+
buildkite_config = BuildkiteConfig(
37+
org_slug=env.BUILDKITE_ORG,
38+
pipeline_slug=env.BUILDKITE_PIPELINE,
39+
api_token=env.BUILDKITE_API_TOKEN,
40+
)
41+
backend.register_launcher(BuildkiteLauncher(buildkite_config))
42+
logger.info("Buildkite launcher registered")
43+
3244
return backend
3345

3446

0 commit comments

Comments
 (0)