-
Notifications
You must be signed in to change notification settings - Fork 645
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Number of dims and results of reindexed AffineMap doesn't match on Vectorization #17591
Comments
Inlining the mlir input below. In the beginning, I thought that the hal.executable public @main_graph$async_dispatch_1 {
hal.executable.variant public @embedded_elf_x86_64 target(<"llvm-cpu", "embedded-elf-x86_64", {cpu = "generic", cpu_features = "", data_layout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-i128:128-f80:128-n8:16:32:64-S128", native_vector_size = 16 : i64, target_triple = "x86_64-unknown-unknown-eabi-elf"}>) {
hal.executable.export public @main_graph$async_dispatch_1_generic_9x1024_f32 ordinal(0) layout(#hal.pipeline.layout<push_constants = 2, sets = [<0, bindings = [<0, storage_buffer, ReadOnly>, <1, storage_buffer>]>]>) attributes {hal.interface.bindings = [#hal.interface.binding<0, 0>, #hal.interface.binding<0, 1>]} {
^bb0(%arg0: !hal.device):
%x, %y, %z = flow.dispatch.workgroup_count_from_slice
hal.return %x, %y, %z : index, index, index
}
builtin.module {
func.func @main_graph$async_dispatch_1_generic_9x1024_f32() {
%cst = arith.constant 0.000000e+00 : f32
%0 = hal.interface.constant.load[0] : i32
%1 = hal.interface.constant.load[1] : i32
%2 = arith.index_castui %0 : i32 to index
%3 = arith.index_castui %1 : i32 to index
%4 = hal.interface.binding.subspan set(0) binding(0) type(storage_buffer) alignment(64) offset(%2) flags(ReadOnly) : !flow.dispatch.tensor<readonly:tensor<9x1024xf32>>
%5 = hal.interface.binding.subspan set(0) binding(1) type(storage_buffer) alignment(64) offset(%3) : !flow.dispatch.tensor<writeonly:tensor<1x9xf32>>
%6 = flow.dispatch.tensor.load %4, offsets = [0, 0], sizes = [9, 1024], strides = [1, 1] : !flow.dispatch.tensor<readonly:tensor<9x1024xf32>> -> tensor<9x1024xf32>
%7 = tensor.empty() : tensor<1x9xf32>
%8 = linalg.fill ins(%cst : f32) outs(%7 : tensor<1x9xf32>) -> tensor<1x9xf32>
%9 = linalg.generic {indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>, affine_map<(d0, d1) -> (0, d0)>], iterator_types = ["parallel", "reduction"]} ins(%6 : tensor<9x1024xf32>) outs(%8 : tensor<1x9xf32>) {
^bb0(%in: f32, %out: f32):
%10 = arith.addf %in, %out : f32
linalg.yield %10 : f32
} -> tensor<1x9xf32>
flow.dispatch.tensor.store %9, %5, offsets = [0, 0], sizes = [1, 9], strides = [1, 1] : tensor<1x9xf32> -> !flow.dispatch.tensor<writeonly:tensor<1x9xf32>>
return
}
}
}
} |
I worked with @jinchen62 and we got a smaller repro: https://gist.github.com/hanhanW/b3652f5887b93fb8f0df6c6c39c1ef87 To repro, run Then you'll see #map2 = affine_map<(d0, d1) -> (d0, d1)>
#map8 = affine_map<(d0, d1) -> (0, d0)>
// ...
%29 = linalg.generic {indexing_maps = [#map2, #map8], iterator_types = ["parallel", "reduction"]} ins(%collapsed_12 : tensor<9x1024xf32>) outs(%28 : tensor<?x9xf32>) {
^bb0(%in: f32, %out: f32):
%35 = arith.addf %in, %out : f32
linalg.yield %35 : f32
} -> tensor<?x9xf32>
// ... |
Actually, the input reduction op looks weird. The size of #map5 = affine_map<(d0, d1, d2) -> (d0, d1, d2)>
#map10 = affine_map<(d0, d1, d2) -> (d0, d1, 0)>
%25 = tensor.empty(%12) : tensor<?x9x1xf32>
%26 = linalg.fill ins(%cst_7 : f32) outs(%25 : tensor<?x9x1xf32>) -> tensor<?x9x1xf32>
%27 = linalg.generic {indexing_maps = [#map5, #map10], iterator_types = ["parallel", "parallel", "reduction"]} ins(%24 : tensor<1x9x1024xf32>) outs(%26 : tensor<?x9x1xf32>) {
^bb0(%in: f32, %out: f32):
%31 = arith.addf %in, %out : f32
linalg.yield %31 : f32
} -> tensor<?x9x1xf32> |
smaller repro: https://gist.github.com/jinchen62/91e216fb39abbb9ba4c0461346d2bb5a command: |
@jinchen62 did you get a chance to see which op is generating the IR? The generic op looks invalid to me, like I explained in the above comment. |
I think it's
|
I'd suggest to check if there are bugs in torch -> linalg lowering, or other high level dialects -> torch lowering. |
torch level repro: https://gist.github.com/jinchen62/601cfce290b81e037383fc49b604a68a
|
part of dump torch repro: @hanhanW I think with a
|
I'm not convinced that the issue is I don't know why we're still triaging the issue at model level, perhaps I did not make it clear. Let me put it this way -- Instead of compiling the whole model, are you able to compile a single |
I don't think it's a lowering issue. The @raikonenfnu and I think there might be an optimization bug in |
@jinchen62 So what's the plan to fix this issue? The bart-large model need this anyway. |
What happened?
dispatch: https://gist.github.com/jinchen62/5e2af98f9b5bfc3b55e949f964459815
error log: https://gist.github.com/jinchen62/df2038b5a43ed4680804a3d7d0647d95
The failing op dumped at https://github.com/iree-org/iree/blob/main/compiler/src/iree/compiler/Codegen/Common/GenericVectorization.cpp#L336 is
At the assertion failing point https://github.com/llvm/llvm-project/blob/main/mlir/lib/Dialect/Linalg/Transforms/Vectorization.cpp#L474, the map is changed from
(d0, d1) -> (0, d0)
to(d0) -> (0, d0)
so the number of dims and results doesn't match.Steps to reproduce your issue
Run
iree-compile --iree-input-demote-i64-to-i32 --iree-hal-target-backends=llvm-cpu dispatch_1.mlir -o test.vmfb 2> dump.mlir
with TOM iree.What component(s) does this issue relate to?
No response
Version information
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: