-
Notifications
You must be signed in to change notification settings - Fork 10
Add comprehensive GitHub Copilot instructions for tritonBLAS development #12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
Copilot
wants to merge
5
commits into
main
Choose a base branch
from
copilot/fix-11
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
5 commits
Select commit
Hold shift + click to select a range
47b79aa
Initial plan
Copilot 0464fa2
Create comprehensive copilot instructions with validated build proces…
Copilot 6ed5fd9
Update .github/copilot-instructions.md
neoblizz 847d04b
Update .github/copilot-instructions.md
neoblizz a8fd078
Update .github/copilot-instructions.md
neoblizz File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,202 @@ | ||
| # tritonBLAS Development Instructions | ||
|
|
||
| **Always reference these instructions first and fallback to search or bash commands only when you encounter unexpected information that does not match the info here.** | ||
|
|
||
| tritonBLAS is a lightweight Triton-based BLAS library for general matrix multiplication (GEMM) that uses analytical models instead of autotuning. It requires ROCm/CUDA GPU environments and has complex dependency requirements. | ||
|
|
||
| ## Working Effectively | ||
|
|
||
| ### Prerequisites and Environment Setup | ||
|
|
||
| **CRITICAL: tritonBLAS requires a ROCm or CUDA GPU environment. CPU-only environments will fail during installation when trying to install triton.** | ||
|
|
||
| #### Docker Setup (RECOMMENDED) | ||
| ```bash | ||
| docker compose up --build -d | ||
| docker attach tritonBLAS-dev | ||
| pip3 install -e . | ||
| export PYTHONPATH=$(pwd)/include/:$PYTHONPATH | ||
| ``` | ||
|
|
||
| **TIMING: Docker build typically takes 7-10 minutes. NEVER CANCEL. Set timeout to at least 15 minutes to allow for slower machines, network delays, or dependency fetching.** | ||
|
|
||
| #### Local Development (Advanced Users Only) | ||
| **WARNING: Local installation often fails due to triton/CUDA dependencies. Use Docker unless you have a specific need.** | ||
|
|
||
| ```bash | ||
| # Only works in ROCm/CUDA environments | ||
| pip install -e ".[dev]" | ||
| export PYTHONPATH=$(pwd)/include/:$PYTHONPATH | ||
| ``` | ||
|
|
||
| **TIMING: pip install takes 2-5 minutes depending on network. NEVER CANCEL. Set timeout to 10+ minutes.** | ||
|
|
||
| ### Build and Installation Process | ||
|
|
||
| The setup.py performs these steps automatically: | ||
| 1. **Dependency Fetching** (~4 seconds total): | ||
| - Clones ROCm libraries from GitHub (~2.5 seconds) | ||
| - Sparse checkout of specific paths (~1 second) | ||
| - Builds origami utility from hipBLASLt | ||
|
|
||
| **NEVER CANCEL the build process. The dependency fetching is critical and must complete.** | ||
|
|
||
| ### Code Quality and Linting | ||
|
|
||
| **ALWAYS run these before committing changes:** | ||
|
|
||
| ```bash | ||
| # Fix linting issues (takes <1 second) | ||
| ruff check . | ||
| ruff format . | ||
| ``` | ||
|
|
||
| **TIMING: Both ruff commands complete in <1 second each.** | ||
|
|
||
| ### Testing | ||
|
|
||
| **CRITICAL: Tests require CUDA GPU and cannot run in CPU-only environments.** | ||
|
|
||
| ```bash | ||
| # Run tests (requires CUDA GPU) | ||
| pytest | ||
| ``` | ||
|
|
||
| **TIMING: Test collection takes ~2 seconds. Full test execution timing varies by GPU.** | ||
|
|
||
| **WARNING: Tests will fail with "ModuleNotFoundError: No module named 'triton'" if run without CUDA/ROCm environment.** | ||
|
|
||
| ### Running Examples | ||
|
|
||
| ```bash | ||
| # Basic matrix multiplication example | ||
| cd examples | ||
| python3 example_matmul.py --m 8192 --n 8192 --k 8192 | ||
|
|
||
| # Performance-focused example | ||
| python3 example_matmul_lt.py | ||
| ``` | ||
|
|
||
| **TIMING: Examples require GPU and triton installation to run.** | ||
|
|
||
| ## Validation and Development Workflow | ||
|
|
||
| ### After Making Changes | ||
|
|
||
| 1. **ALWAYS run linting first:** | ||
| ```bash | ||
| ruff check . | ||
| ruff format . | ||
| ``` | ||
|
|
||
| 2. **Test your changes in Docker environment:** | ||
| ```bash | ||
| docker compose up --build -d # Starts build in background (7-10 minutes). Command returns immediately due to -d (detached mode); monitor build progress with 'docker compose logs -f' and wait for build to finish before proceeding. NEVER CANCEL. | ||
| docker attach tritonBLAS-dev | ||
| pip3 install -e . # 2-5 minutes in container | ||
| ``` | ||
|
|
||
| 3. **Run examples to validate functionality:** | ||
| ```bash | ||
| cd examples | ||
| python3 example_matmul.py | ||
| ``` | ||
|
|
||
| 4. **Run tests if you have GPU access:** | ||
| ```bash | ||
| pytest # Requires CUDA GPU | ||
| ``` | ||
|
|
||
| ### Common Build Issues and Solutions | ||
|
|
||
| 1. **"No space left on device"**: Clean up Docker images and system cache | ||
| 2. **"ModuleNotFoundError: No module named 'triton'"**: Must use ROCm/CUDA environment | ||
| 3. **Timeout during Docker build**: Increase timeout to 15+ minutes, NEVER CANCEL | ||
| 4. **pip install network timeouts**: Retry with longer timeout (10+ minutes) | ||
|
|
||
| ## Project Structure and Navigation | ||
|
|
||
| ### Key Directories | ||
|
|
||
| - **`include/tritonblas/`**: Main package source code | ||
neoblizz marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| - `__init__.py`: Package entry point | ||
| - `matmul.py`: Core matrix multiplication functions | ||
| - `origami.py`: Heuristic selection logic | ||
| - `internal/`: Internal implementation details | ||
|
|
||
| - **`examples/`**: Working examples demonstrating usage | ||
| - `example_matmul.py`: Basic usage example | ||
| - `example_matmul_lt.py`: Performance-focused API example | ||
|
|
||
| - **`tests/`**: Test suite (requires GPU) | ||
| - `test_matmul.py`: Basic matmul tests | ||
| - `test_matmul_lt.py`: Performance API tests | ||
|
|
||
| - **`benchmarks/`**: Performance benchmarking tools | ||
| - `benchmark_autotuning.py`: Autotuning overhead comparison | ||
| - `heuristic_benchmark.py`: Heuristic selection timing | ||
|
|
||
| ### Important Files | ||
|
|
||
| - **`setup.py`**: Custom build script with ROCm dependency fetching | ||
| - **`pyproject.toml`**: Project configuration and dependencies | ||
| - **`docker-compose.yml`**: Docker environment setup | ||
| - **`Dockerfile`**: ROCm/PyTorch base image setup | ||
|
|
||
| ## API Usage Patterns | ||
|
|
||
| ### Peak Performance API (Recommended) | ||
| ```python | ||
| import tritonblas | ||
|
|
||
| # Create heuristic selector (one-time setup) | ||
| selector = tritonblas.MatmulHeuristicResult(m, n, k, a_dtype, b_dtype, c_dtype) | ||
|
|
||
| # Use for actual computation | ||
| result = tritonblas.matmul_lt(A, B, selector=selector, enable_streamk=False) | ||
| ``` | ||
|
|
||
| ### Drop-in Replacement API | ||
| ```python | ||
| import tritonblas | ||
|
|
||
| # Direct usage (performs heuristic selection internally) | ||
| result = tritonblas.matmul(A, B, enable_streamk=False) | ||
| ``` | ||
|
|
||
| ## Development Notes | ||
|
|
||
| ### Code Style Requirements | ||
| - **Line length**: 120 characters (configured in pyproject.toml) | ||
| - **Auto-formatting**: Use `ruff format` before committing | ||
| - **Linting**: Fix all `ruff check` issues before committing | ||
|
|
||
| ### Testing Requirements | ||
| - All tests require CUDA GPU environment | ||
| - Tests use parametrized inputs for different data types and matrix sizes | ||
| - Tests include stream-k algorithm validation | ||
|
|
||
| ### Performance Considerations | ||
| - The library eliminates autotuning overhead through analytical models | ||
| - Heuristic selection is cached for previously seen problem sizes | ||
| - Docker environment includes optimized ROCm/PyTorch stack | ||
|
|
||
| ## Troubleshooting | ||
|
|
||
| ### "Cannot import triton" | ||
| - **Cause**: CPU-only environment or missing CUDA | ||
| - **Solution**: Use Docker with ROCm/CUDA support | ||
|
|
||
| ### "No space left on device" | ||
| - **Cause**: Docker build exhausts disk space | ||
| - **Solution**: `docker system prune -af && sudo apt-get clean` | ||
|
|
||
| ### Docker build timeout | ||
| - **Cause**: Large base image download (ROCm/PyTorch) | ||
| - **Solution**: Increase timeout to 15+ minutes, be patient | ||
|
|
||
| ### pip install fails with network errors | ||
| - **Cause**: PyPI timeouts or network issues | ||
| - **Solution**: Retry with increased timeout, check network connectivity | ||
|
|
||
| **Remember: This is a research project intended for ROCm/CUDA environments. Always validate your changes in the proper GPU environment before submitting.** | ||
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.