CUDA has a maximal `.y` and `.z` grid dimension of 65536. We should handle cases where the local matrix is wider than that.