There are often significant performance gains from changing the tile size when a smaller type is used, or sometimes setting the entry hint occupancy=2 will have a non-trivial impact. It would be nice to be able to find a good combination of parameters, even if it just searches a hand-picked set.
See _autotuner.py with example usage in AttentionFMHA.py