[Question] Clarification on FP8 Micro-block Scaling and FP4 Support Timeline

Hi cuTile team,

I have two specific questions regarding the support for Blackwell-specific hardware features:

1. Automatic Micro-block Scaling for FP8
When using fp8 with ct.matmul, how is the Micro-block Scaling (1x16) handled?

Automation: Does the tileiras compiler automatically handle the scaling logic and hardware invocation (5th-gen Tensor Core) under the hood?

Explicit Scaling: If it is not fully automatic, how should we provide the scale-factor tiles to the ct.matmul operator? Currently, the ct.matmul(A, B) signature seems to only accept data tiles. Is there a plan for a signature like ct.matmul(A, B, A_scale, B_scale)?

2. NVFP4 (FP4) Support Roadmap
The current documentation and samples focus on fp8 and bf16. Since Blackwell's throughput peak is tied to NVFP4:
When can we expect the support for 4-bit narrow-precision tiles in cuTile Python?

Thanks for this great library!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Question] Clarification on FP8 Micro-block Scaling and FP4 Support Timeline #47

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Question] Clarification on FP8 Micro-block Scaling and FP4 Support Timeline #47

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions