## Parent Issue <!-- Link to the main tracking issue for this operator using #IssueID --> Part of #111 ## Task Type <!-- Please check the relevant component for this sub-issue --> - [ ] **L2: Op Implementation** (Wrapper + Unit Tests + Benchmarks) ## Description <!-- Detailed description of what needs to be implemented in this step --> ## Checklist <!-- Refer to docs/DEVELOPMENT.md for specific requirements for each layer --> - [ ] Implementation follows **Google Python Style** for code and docstrings. - [ ] **(L2 Only)** Unit tests match PyTorch reference (FP16/BF16). - [ ] **(L2 Only)** Benchmarks implemented (Latency/TFLOPS/Bandwidth).