diff --git a/README.md b/README.md index b7fe4871..92a8ecd9 100644 --- a/README.md +++ b/README.md @@ -1,371 +1,93 @@ # KernelBot -[![nvidia-on-prem](https://github.com/gpu-mode/discord-cluster-manager/actions/workflows/nvidia-on-prem-health.yml/badge.svg)](https://github.com/gpu-mode/discord-cluster-manager/actions/workflows/nvidia-on-prem-health.yml) +Backend service for the GPU Mode kernel competition platform. -This is the code for the Discord bot we'll be using to queue jobs to a cluster of GPUs that our generous sponsors have provided. Our goal is to be able to queue kernels that can run end to end in seconds that way things feel interactive and social. +**For users:** Submit kernels via the [popcorn-cli](https://github.com/gpu-mode/popcorn-cli). -The key idea is that we're using Github Actions as a job scheduling engine and primarily making the Discord bot interact with the cluster via issuing Github Actions and and monitoring their status and while we're focused on having a nice user experience on discord.gg/gpumode, [we're happy to accept PRs](#local-development) that make it easier for other Discord communities to hook GPUs. - -## Table of Contents - -- [Supported Schedulers](#supported-schedulers) -- [Local Development](#local-development) - - [Clone Repository](#clone-repository) - - [Discord Bot](#discord-bot) - - [Database](#database) - - [Environment Variables](#environment-variables) - - [Verify Setup](#verify-setup) -- [Available Commands](#available-commands) -- [Using the Leaderboard](#using-the-leaderboard) - - [Creating a New Leaderboard](#creating-a-new-leaderboard) - - [Reference Code Requirements (Python)](#reference-code-requirements-python) - - [Reference Code Requirements (CUDA)](#reference-code-requirements-cuda) - - [Submitting to a Leaderboard](#submitting-to-a-leaderboard) - - [Other Available Leaderboard Commands](#other-available-leaderboard-commands) - - [GPU Kernel-Specific Commands](#gpu-kernel-specific-commands) -- [Testing the Discord Bot](#testing-the-discord-bot) -- [How to Add a New GPU to the Cluster](#how-to-add-a-new-gpu-to-the-cluster) -- [Acknowledgements](#acknowledgements) - -## Supported schedulers - -- GitHub Actions -- Modal -- Slurm (not implemented yet) +**For problem authors:** See [reference-kernels](https://github.com/gpu-mode/reference-kernels) for problem configuration and examples. ## Local Development -### Clone Repository - -> [!IMPORTANT] -> Do not fork this repository. Instead, directly clone this repository to your local machine. - -> [!IMPORTANT] -> Python 3.11 or higher is required. - -After, install the dependencies with `pip install -r requirements-dev.txt`. - -### Setup Discord Bot - -To run and develop the bot locally, you need to add it to your own "staging" server. Follow the steps [here](https://discordjs.guide/preparations/setting-up-a-bot-application.html#creating-your-bot) and [here](https://discordjs.guide/preparations/adding-your-bot-to-servers.html#bot-invite-links) to create a bot application and then add it to your staging server. - -Below is a visual walk-through of the steps linked above: - -- The bot needs the `Message Content Intent` and `Server Members Intent` permissions turned on. -
- Click here for visual. - DCS_bot_perms -
- -- The bot needs `applications.commands` and `bot` scopes. - -
- Click here for visual. - Screenshot 2024-11-24 at 12 34 09 PM -
- -- Finally, generate an invite link for the bot and enter it into any browser. +### Prerequisites -
- Click here for visual. - Screenshot 2024-11-24 at 12 44 08 PM -
+- Python 3.11+ +- PostgreSQL +- A Discord bot application (optional, for Discord integration - see [docs/discord.md](docs/discord.md)) -> [!NOTE] -> Bot permissions involving threads/mentions/messages should suffice, but you can naively give it `Administrator` since it's just a test bot in your own testing Discord server. - -### Database - -The leaderboard persists information in a Postgres database. To develop locally, set Postgres up on your machine. Then start a Postgres shell with `psql`, and create a database: - -``` -$ psql -U postgres -Password for user postgres: ******** -psql (16.6 (Ubuntu 16.6-1.pgdg22.04+1)) -Type "help" for help. - -postgres=# CREATE DATABASE clusterdev; -``` - -We are using [Yoyo Migrations](https://ollycope.com/software/yoyo/) to manage tables, indexes, etc. in our database. To create tables in your local database, apply the migrations in `src/discord-cluster-manager/migrations` with the following command line: +### Clone and Install +```bash +git clone https://github.com/gpu-mode/discord-cluster-manager.git +cd discord-cluster-manager +pip install -e . ``` -yoyo apply src/migrations \ - -d postgresql://user:password@localhost/clusterdev -``` - -
- Click here for a transcript of a yoyo apply session - - $ yoyo apply . -d postgresql://user:password@localhost/clusterdev - - [20241208_01_p3yuR-initial-leaderboard-schema] - Shall I apply this migration? [Ynvdaqjk?]: y - - Selected 1 migration: - [20241208_01_p3yuR-initial-leaderboard-schema] - Apply this migration to postgresql://user:password@localhost/clusterdev [Yn]: y - Save migration configuration to yoyo.ini? - This is saved in plain text and contains your database password. - - Answering 'y' means you do not have to specify the migration source or database connection for future runs [yn]: n - -
-Applying migrations to our staging and prod environments also happens using `yoyo apply`, just with a different database URL. +### Database Setup -To make changes to the structure of the database, create a new migration: +Create a local Postgres database and apply migrations: +```bash +psql -U postgres -c "CREATE DATABASE clusterdev;" +yoyo apply src/migrations -d postgresql://user:password@localhost/clusterdev ``` -yoyo new src/discord-cluster-manager/migrations -m "short_description" -``` - -...and then edit the generated file. Please do not edit existing migration files: the existing migration files form a sort of changelog that is supposed to be immutable, and so yoyo will refuse to reapply the changes. - -We are following an expand/migrate/contract pattern to allow database migrations without downtime. When you want to make a change to the structure of the database, first determine if it is expansive or contractive. - -- _Expansive changes_ are those that have no possibility of breaking a running application. Examples include: adding a new nullable column, adding a non-null column with a default value, adding an index, adding a table, etc. -- _Contractive changes_ are those that could break a running application. Examples include: dropping a table, dropping a column, adding a not null constraint to a column, adding a unique index, etc. -After an expansive phase, data gets migrated to the newly added elements. Code also begins using the newly added elements. This is the migration step. Finally, when all code is no longer using elements that are obsolete, these can be removed. (Or, if adding a unique or not null constraint, after checking that the data satisfies the constraint, then the constraint can be safely added.) - -Expand, migrate, and contract steps may all be written using yoyo. +See [docs/database.md](docs/database.md) for migration patterns and creating new migrations. ### Environment Variables -Create a `.env` file with the following environment variables: - -- `DISCORD_DEBUG_TOKEN` : The token of the bot you want to run locally -- `DISCORD_TOKEN` : The token of the bot you want to run in production -- `DISCORD_DEBUG_CLUSTER_STAGING_ID` : The ID of the "staging" server you want to connect to -- `DISCORD_CLUSTER_STAGING_ID` : The ID of the "production" server you want to connect to -- `GITHUB_TOKEN` : A Github token with permissions to trigger workflows, for now only new branches from [discord-cluster-manager](https://github.com/gpu-mode/discord-cluster-manager) are tested, since the bot triggers workflows on your behalf -- `GITHUB_REPO` : The repository where the cluster manager is hosted. -- `GITHUB_WORKFLOW_BRANCH` : The branch to start the GitHub Actions jobs from when submitting a task. -- `DATABASE_URL` : The URL you use to connect to Postgres. -- `DISABLE_SSL` : (Optional) set if you want to disable SSL when connecting to Postgres. - -Below is where to find these environment variables: - -> [!NOTE] -> For now, you can naively set `DISCORD_DEBUG_TOKEN` and `DISCORD_DEBUG_CLUSTER_STAGING_ID` to the same values as `DISCORD_TOKEN` and `DISCORD_CLUSTER_STAGING_ID` respectively. - -- `DISCORD_DEBUG_TOKEN` or `DISCORD_TOKEN`: Found in your bot's page within the [Discord Developer Portal](https://discord.com/developers/applications/): - -
- Click here for visual. - Screenshot 2024-11-24 at 11 01 19 AM -
- -- `DISCORD_DEBUG_CLUSTER_STAGING_ID` or `DISCORD_CLUSTER_STAGING_ID`: Right-click your staging Discord server and select `Copy Server ID`: - -
- Click here for visual. - Screenshot 2024-11-24 at 10 58 27 AM -
- -- `GITHUB_TOKEN`: Found in Settings -> Developer Settings (or [here](https://github.com/settings/tokens?type=beta)). Create a new (preferably classic) personal access token with an expiration date to any day less than a year from the current date, and the scopes `repo` and `workflow`. - -
- Click here for visual. - Screenshot 2024-12-30 at 8 51 59 AM -
- -- `GITHUB_REPO`: This should be set to this repository, which is usually `gpu-mode/discord-cluster-manager`. - -- `GITHUB_WORKFLOW_BRANCH`: Usually `main` or the branch you are working from. - -- `DATABASE_URL`: This contains the connection details for your local database, and has the form `postgresql://user:password@localhost/clusterdev`. - -- `DISABLE_SSL`: Set to `1` when developing. - -### Verify Setup - -Install the kernel bot as editable using `pip install -e .` - -Run the following command to run the bot: - -``` -python src/kernelbot/main.py --debug -``` - -Then in your staging server, use the `/verifyruns` command to test basic functionalities of the bot and the `/verifydb` command to check database connectivity. - -> [!NOTE] -> To test functionality of the Modal runner, you also need to be authenticated with Modal. Modal provides free credits to get started. -> To test functionality of the GitHub runner, you may need direct access to this repo which you can ping us for. - -## Available Commands +Create a `.env` file: -TODO. This is currently a work in progress. +```bash +# Required +GITHUB_TOKEN= # GitHub PAT with repo and workflow scopes +GITHUB_REPO=gpu-mode/discord-cluster-manager +PROBLEMS_REPO=gpu-mode/reference-kernels +DATABASE_URL=postgresql://user:password@localhost/clusterdev -`/run modal ` which you can use to pick a specific gpu, right now defaults to T4 +# Optional - defaults shown +GITHUB_WORKFLOW_BRANCH=main +PROBLEM_DEV_DIR=examples +DISABLE_SSL=1 # Set for local development +GITHUB_TOKEN_BACKUP= # Fallback token for rate limiting +ADMIN_TOKEN= # Token for admin API endpoints -`/run github ` which picks one of two workflow files +# Discord bot (only needed if testing Discord integration) +# See docs/discord.md for setup instructions +DISCORD_TOKEN= +DISCORD_DEBUG_TOKEN= +DISCORD_CLUSTER_STAGING_ID= +DISCORD_DEBUG_CLUSTER_STAGING_ID= -`/resync` to clear all the commands and resync them - -`/ping` to check if the bot is online - -## Using the Leaderboard - -The main purpose of the Discord bot is to allow servers to host coding competitions through Discord. -The leaderboard was designed for evaluating GPU kernels, but can be adapted easily for other -competitions. The rest of this section will mostly refer to leaderboard submissions in the context -of our GPU Kernel competition. - -> [!NOTE] -> All leaderboard commands have the prefix `/leaderboard`, and center around creating, submitting to, -> and viewing leaderboard statistics and information. - -### Creating a new Leaderboard - -``` -/leaderboard create {name: str} {deadline: str} {reference_code: .cu or .py file} -``` - -The above command creates a leaderboard named `name` that ends at `deadline`. The `reference_code` -has strict function signature requirements, and is required to contain an input generator and a -reference implementation for the desired GPU kernel. We import these functions in our evaluation -scripts for verifying leaderboard submissions and measuring runtime. In the next mini-section, we -discuss the exact requirements for the `reference_code` script. - -Each leaderboard `name` can also specify the types of hardware that users can run their kernels on. -For example, a softmax kernel on an RTX 4090 can have different performance characteristics on an -H100. After running the leaderboard creation command, a prompt will pop up where the creator can -specify the available GPUs that the leaderboard evaluates on. - -![Leaderboard GPU](assets/img/lb_gpu.png) - -#### Reference Code Requirements (Python) - -The Discord bot internally contains an `eval.py` script that handles the correctness and timing -analysis for the leaderboard. The `reference_code` that the leaderboard creator submits must have -the following function signatures with their implementations filled out. `InputType` and -`OutputType` are generics that could be a `torch.Tensor`, `List[torch.Tensor]`, etc. -depending on the reference code specifications. We leave this flexibility to the leaderboard creator. - -```python -# Reference kernel implementation. -def ref_kernel(input: InputType) -> OutputType: - # Implement me... - -# Generate a list of tensors as input to the kernel -def generate_input() -> InputType: - # Implement me... - -# Verify correctness of reference and output -def check_implementation(custom_out: OutputType, reference_out: OutputType) -> bool: - # Implement me... -``` - -#### Reference Code Requirements (CUDA) - -The Discord bot internally contains an `eval.cu` script that handles the correctness and timing -analysis for the leaderboard. The difficult of CUDA evaluation scripts is we need to explicitly -handle the typing system for tensors. The `reference.cu` that the leaderboard creator submits must have -the following function signatures with their implementations filled out: - -The main difference is we now need to define an alias for the type that the input / outputs are. A -simple and common example is a list of FP32 tensors, which can be defined using a pre-defined array of -`const int`s called `N_SIZES`, then define an array of containers, e.g. -`std::array, N_SIZES>`. - -```cuda -// User-defined type for inputs, e.g. using input_t = std::array, IN_SIZES>; -using input_t = ...; - -// User-defined type for outputs, e.g. using output_t = std::array, OUT_SIZES>; -using output_t = ...; - -// Generate random data of type input_t -input_t generate_input() { - // Implement me... -} - - -// Reference kernel host code. -output_t reference(input_t data) { - // Implement me... -} - - -// Verify correctness of reference and output -bool check_implementation(output_t out, output_t ref) { - // Implement me... -} -``` - -### Submitting to a Leaderboard - -``` -/leaderboard submit {github / modal} {leaderboard_name: str} {script: .cu or .py file} -``` - -The leaderboard submission for _Python code_ requires the following function signatures, where -`InputType` and `OutputType` are generics that could be a `torch.Tensor`, `List[torch.Tensor]`, -etc. depending on the reference code specifications. - -```python -# User kernel implementation. -def custom_kernel(input: InputType) -> OutputType: - # Implement me... -``` - -### Other Available Leaderboard Commands - -Deleting a leaderboard: - -``` -/leaderboard delete {name: str} +# CLI OAuth (only needed if running CLI against local instance) +CLI_DISCORD_CLIENT_ID= +CLI_DISCORD_CLIENT_SECRET= +CLI_TOKEN_URL= +CLI_GITHUB_CLIENT_ID= +CLI_GITHUB_CLIENT_SECRET= ``` -List all active leaderboards and which GPUs they can run on: +### Run the Bot +```bash +python src/kernelbot/main.py --debug ``` -/leaderboard list -``` - -List all leaderboard scores (runtime) for a particular leaderboard. (currently deprecated. Doesn't -support multiple GPU types yet) - -``` -/leaderboard show {name: str} -``` - -Display all personal scores (runtime) from a specific leaderboard. - -``` -/leaderboard show-personal {name: str} -``` - -### Submitting via a CLI -Moving forward we also allow submissions without logging in to Discord via a CLI tool we wrote in Rust https://github.com/gpu-mode/popcorn-cli +Use `/verifyruns` to test GitHub Actions integration and `/verifydb` to check database connectivity. -#### GPU Kernel-specific Commands +## Adding GPUs to the Cluster -We plan to add support for the PyTorch profiler and CUDA NSight Compute CLI to allow users to -profile their kernels. These commands are not specific to the leaderboard, but may be helpful for -leaderboard submissions. - -## How to add a new GPU to the cluster - -If you'd like to donate a GPU to our efforts, we can make you a CI admin in Github and have you add an org level runner https://github.com/organizations/gpu-mode/settings/actions/runners +To donate a GPU, contact us to become a CI admin and add an org-level runner at https://github.com/organizations/gpu-mode/settings/actions/runners ## Acknowledgements -- Thank you to AMD for sponsoring an MI250 node -- Thank you to NVIDIA for sponsoring an H100 node -- Thank you to Nebius for sponsoring credits and an H100 node -- Thank you Modal for credits and speedy spartup times -- Luca Antiga did something very similar for the NeurIPS LLM efficiency competition, it was great! -- Midjourney was a similar inspiration in terms of UX +- Modal for credits +- AMD for sponsoring an MI250 node +- NVIDIA for sponsoring an H100 node +- Nebius for credits and an H100 node ## Citation -If you used our software please cite it as -``` +```bibtex @inproceedings{ kernelbot2025, title={KernelBot: A Competition Platform for Writing Heterogeneous {GPU} Code}, diff --git a/docs/database.md b/docs/database.md new file mode 100644 index 00000000..c68b5803 --- /dev/null +++ b/docs/database.md @@ -0,0 +1,78 @@ +# Database Setup + +The leaderboard persists information in a Postgres database. + +## Local Development + +Set up Postgres on your machine, then create a database: + +```bash +psql -U postgres -c "CREATE DATABASE clusterdev;" +``` + +## Migrations + +We use [Yoyo Migrations](https://ollycope.com/software/yoyo/) to manage tables, indexes, etc. + +### Applying Migrations + +```bash +yoyo apply src/migrations -d postgresql://user:password@localhost/clusterdev +``` + +
+Example yoyo apply session + +``` +$ yoyo apply . -d postgresql://user:password@localhost/clusterdev + +[20241208_01_p3yuR-initial-leaderboard-schema] +Shall I apply this migration? [Ynvdaqjk?]: y + +Selected 1 migration: + [20241208_01_p3yuR-initial-leaderboard-schema] +Apply this migration to postgresql://user:password@localhost/clusterdev [Yn]: y +Save migration configuration to yoyo.ini? +This is saved in plain text and contains your database password. + +Answering 'y' means you do not have to specify the migration source or database connection for future runs [yn]: n +``` + +
+ +Staging and prod environments use `yoyo apply` with a different database URL. + +### Creating New Migrations + +```bash +yoyo new src/migrations -m "short_description" +``` + +Edit the generated file. Do not edit existing migration files - they form an immutable changelog, and yoyo will refuse to reapply changes. + +## Expand/Migrate/Contract Pattern + +We follow an expand/migrate/contract pattern to allow database migrations without downtime. + +### Expansive Changes + +Changes that cannot break a running application: +- Adding a new nullable column +- Adding a non-null column with a default value +- Adding an index +- Adding a table + +### Contractive Changes + +Changes that could break a running application: +- Dropping a table or column +- Adding a NOT NULL constraint +- Adding a unique index + +### Workflow + +1. **Expand**: Add new elements (nullable columns, new tables, etc.) +2. **Migrate**: Move data to new elements; update code to use them +3. **Contract**: Remove obsolete elements (or add constraints after verifying data satisfies them) + +All steps can be written using yoyo migrations. diff --git a/docs/discord.md b/docs/discord.md new file mode 100644 index 00000000..b6a6fd32 --- /dev/null +++ b/docs/discord.md @@ -0,0 +1,71 @@ +# Discord Bot Setup + +To run and develop the bot locally, you need to create a Discord bot application and add it to your own "staging" server. + +## Create a Bot Application + +Follow the Discord.js guides: +1. [Creating your bot](https://discordjs.guide/preparations/setting-up-a-bot-application.html#creating-your-bot) +2. [Adding your bot to servers](https://discordjs.guide/preparations/adding-your-bot-to-servers.html#bot-invite-links) + +## Required Permissions + +The bot needs the `Message Content Intent` and `Server Members Intent` permissions turned on. + +
+Click for visual +DCS_bot_perms +
+ +## Required Scopes + +The bot needs `applications.commands` and `bot` scopes. + +
+Click for visual +Screenshot 2024-11-24 at 12 34 09 PM +
+ +## Generate Invite Link + +Generate an invite link for the bot and enter it into any browser. + +
+Click for visual +Screenshot 2024-11-24 at 12 44 08 PM +
+ +> [!NOTE] +> Bot permissions involving threads/mentions/messages should suffice, but you can give it `Administrator` since it's just a test bot in your own testing Discord server. + +## Environment Variables + +Add these to your `.env` file: + +```bash +DISCORD_TOKEN= # Bot token (production) +DISCORD_DEBUG_TOKEN= # Bot token (local development) +DISCORD_CLUSTER_STAGING_ID= # Server ID (production) +DISCORD_DEBUG_CLUSTER_STAGING_ID= # Server ID (local development) +``` + +> [!NOTE] +> For local development, you can set the DEBUG variants to the same values as the production ones. + +### Finding Your Bot Token + +Found in your bot's page within the [Discord Developer Portal](https://discord.com/developers/applications): + +
+Click for visual +Screenshot 2024-11-24 at 11 01 19 AM +
+ +### Finding Your Server ID + +Right-click your staging Discord server and select `Copy Server ID`: + +
+Click for visual +Screenshot 2024-11-24 at 10 58 27 AM +