-
Notifications
You must be signed in to change notification settings - Fork 22
Open
Labels
Description
Operator Description
The current state-of-the-art Fast Walsh-Hadamard Transform (FWHT) implementations utilize Tensor Cores on GPUs for significant performance improvements, as described in this paper This issue focuses on implementing a TileLang Tensor Core-accelerated Hadamard transform on GPUs. The operator should mirror existing SOTA methods but be implemented within the TileLang framework and interface.
Implementation Plan
1. Kernel Implementation (L1)
- Kernel: Implement the TileLang kernel for the FWHT operator in
top/kernels/<Hadamard>/.
2. Op Definition (L2)
- Interface: Define the
torch.opsinterface for the FWHT operator intop/ops/<Hadamard>.py.- Provide a clear and efficient API for users to call the operator within their TileLang-based code.
- Support FP16 and BF16 precision as part of the interface for optimization on modern GPUs.
- Unit Tests: Implement unit tests for correctness in
tests/test_<Hadamard>.py.- FP16: Ensure the output is close to the reference values, within an error margin of (1e-3).
- BF16: Ensure the output is close to the reference values, within an error margin of (1.6e-2).
- Compare the output with PyTorch's FWHT implementation for verification.
- Benchmarks: Implement benchmarking scripts for performance in
benchmarks/benchmark_<op_name>.py.- Latency: Measure the time taken to compute the FWHT using the TileLang operator.
- TFLOPS: Report throughput in tera-floating-point operations per second.
- DRAM Bandwidth: Measure the data transfer rates between GPU memory and the processor to assess the memory bottleneck.
3. Benchmark Results
- Report the performance of the TileLang FWHT operator compared to existing SOTA implementations utilizing Tensor Cores
- Provide performance improvements in terms of latency, throughput (TFLOPS), and memory bandwidth
Reactions are currently unavailable