[New Op] Implement Tensor Core Accelerated Hadamard Transform

## Operator Description

The current state-of-the-art Fast Walsh-Hadamard Transform (FWHT) implementations utilize Tensor Cores on GPUs for significant performance improvements, as described in this [paper](https://arxiv.org/abs/2412.08832) This issue focuses on implementing a TileLang Tensor Core-accelerated Hadamard transform on GPUs. The operator should mirror existing SOTA methods but be implemented within the TileLang framework and interface.

## Implementation Plan

### 1. Kernel Implementation (L1)
- [ ] **Kernel**: Implement the TileLang kernel for the FWHT operator in `top/kernels/<Hadamard>/`.

### 2. Op Definition (L2)
- [ ] **Interface**: Define the `torch.ops` interface for the FWHT operator in `top/ops/<Hadamard>.py`.
  - Provide a clear and efficient API for users to call the operator within their TileLang-based code.
  - Support FP16 and BF16 precision as part of the interface for optimization on modern GPUs.
- [ ] **Unit Tests**: Implement unit tests for correctness in `tests/test_<Hadamard>.py`.
  - **FP16**: Ensure the output is close to the reference values, within an error margin of \(1e-3\).
  - **BF16**: Ensure the output is close to the reference values, within an error margin of \(1.6e-2\).
  - Compare the output with PyTorch's FWHT implementation for verification.
- [ ] **Benchmarks**: Implement benchmarking scripts for performance in `benchmarks/benchmark_<op_name>.py`.
  - **Latency**: Measure the time taken to compute the FWHT using the TileLang operator.
  - **TFLOPS**: Report throughput in tera-floating-point operations per second.
  - **DRAM Bandwidth**: Measure the data transfer rates between GPU memory and the processor to assess the memory bottleneck.

### 3. Benchmark Results
- Report the performance of the TileLang FWHT operator compared to existing SOTA implementations utilizing Tensor Cores 
- Provide performance improvements in terms of latency, throughput (TFLOPS), and memory bandwidth




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[New Op] Implement Tensor Core Accelerated Hadamard Transform #160

Operator Description

Implementation Plan

1. Kernel Implementation (L1)

2. Op Definition (L2)

3. Benchmark Results

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[New Op] Implement Tensor Core Accelerated Hadamard Transform #160

Description

Operator Description

Implementation Plan

1. Kernel Implementation (L1)

2. Op Definition (L2)

3. Benchmark Results

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions