-
Notifications
You must be signed in to change notification settings - Fork 44
Open
Description
Labels: question, enhancement, help wanted
Issue Description
I'm unable to compile the low-latency-llama demo on RTX 4090 GPUs due to ThunderKittens compatibility issues. The project seems designed primarily for H100+ architectures.
Environment
- GPU: RTX 4090
- CUDA: 12.4
- OS: Linux
- Python: 3.12
Compilation Errors
error: barrier is not a template
error: identifier "semaphore" is undefined
error: name followed by "::" must be a class or namespace name (move<T>::lds, etc.)
static_assert(NUM_PAGES == 13, "NUM_PAGES must be 13"); // Fails on RTX 4090
Root Cause
ThunderKittens uses Hopper/Blackwell-specific features not available on RTX 4090:
- TMA operations
- Advanced barrier/semaphore primitives
- Architecture-specific memory limits (RTX 4090: 100KB shared memory vs H100: 227KB)
Questions
- Is RTX 4090 officially supported?
- Are there plans to add RTX 4090 support?
- Would contributions for RTX 4090 compatibility be welcome?
Workaround
I've created a basic CUDA test that compiles and runs successfully on RTX 4090, confirming the environment is correct. The issue is specifically with ThunderKittens advanced features.
Potential Solutions
- Conditional compilation for different architectures
- Fallback implementations using standard CUDA operations
- Architecture-specific configuration constants
- Clear documentation of supported GPUs
RTX 4090 is widely used in research/development, so adding support would significantly expand the user base. Happy to contribute if there's interest!
Additional Info
- Basic CUDA compilation works fine
- Python bindings work correctly
- Issue is specifically with ThunderKittens library features
- Already fixed some Makefile and config issues locally
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels