-
Notifications
You must be signed in to change notification settings - Fork 30
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[Benchmarking option] Make outlined functions do no compute (#1010)
This PR has an experiment that checks how much performance loss is due to compute vs data movement. It does the following: it adds an option to elide/omit the computation of the outlined function entirely, basically replacing the matmul with a noop. This gives us a lower bound answer to the question: how fast would we be if the kernel running on the AIE core was 100% efficient? Scraping the result from CI (see c&ps below) we see that replacing the matmul with a noop goes from 1932 us to 1859 us -- a 4% speedup. So the actual speed-up we'd get from a 100% efficient kernel is in range 0% - 4%. i.e. not very much. With matmul: ``` matmul_benchmark_4096_512_512_bf16_f32_O3_outline.mlir -------------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations UserCounters... -------------------------------------------------------------------------------------------------- BM_matmul/process_time/real_time_mean 1932 us 94.5 us 5 items_per_second=517.698/s BM_matmul/process_time/real_time_median 1933 us 91.1 us 5 items_per_second=517.365/s BM_matmul/process_time/real_time_stddev 7.62 us 7.53 us 5 items_per_second=2.04055/s -------------------------------------------------------------------------------------------------- The largest program memory size (read from byte 72 of elf files) is 8928 bytes ``` With noop replacing matmul: ``` matmul_benchmark_4096_512_512_bf16_f32_O3_outline_empty.mlir -------------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations UserCounters... -------------------------------------------------------------------------------------------------- BM_matmul/process_time/real_time_mean 1859 us 90.4 us 5 items_per_second=537.837/s BM_matmul/process_time/real_time_median 1854 us 86.6 us 5 items_per_second=539.255/s BM_matmul/process_time/real_time_stddev 19.6 us 13.9 us 5 items_per_second=5.60814/s -------------------------------------------------------------------------------------------------- The largest program memory size (read from byte 72 of elf files) is 2816 bytes ``` --------- Co-authored-by: Jorn Tuyls <jtuyls@users.noreply.github.com>
- Loading branch information
Showing
10 changed files
with
144 additions
and
27 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
2 changes: 1 addition & 1 deletion
2
compiler/plugins/target/AMD-AIE/iree-amd-aie/Transforms/test/linalg_function_outlining.mlir
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
28 changes: 28 additions & 0 deletions
28
...ugins/target/AMD-AIE/iree-amd-aie/Transforms/test/linalg_function_outlining_to_empty.mlir
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,28 @@ | ||
// RUN: iree-opt --split-input-file --pass-pipeline="builtin.module(iree-amdaie-linalg-function-outlining{empty-functions=true})" --verify-diagnostics --split-input-file %s | FileCheck %s --check-prefix=EMPTY | ||
// RUN: iree-opt --split-input-file --pass-pipeline="builtin.module(iree-amdaie-linalg-function-outlining{empty-functions=false})" --verify-diagnostics --split-input-file %s | FileCheck %s --check-prefix=NOT_EMPTY | ||
|
||
func.func @reduction(%A: memref<4xbf16>, %B: memref<bf16>) { | ||
%c2 = arith.constant 2 : index | ||
%tile = amdaie.tile(%c2, %c2) | ||
%1 = amdaie.core(%tile, in : [], out : []) { | ||
linalg.generic { | ||
indexing_maps = [affine_map<(d0) -> (d0)>, affine_map<(d0) -> ()>], | ||
iterator_types = ["reduction"] | ||
} ins(%A: memref<4xbf16>) outs(%B : memref<bf16>) { | ||
^bb0(%in: bf16, %out: bf16): | ||
linalg.yield %in : bf16 | ||
} | ||
amdaie.end | ||
} | ||
return | ||
} | ||
|
||
// The (default) case where empty-functions is false, outlining works as usual. | ||
// NOT_EMPTY: func.func private | ||
// NOT_EMPTY: linalg.generic | ||
// NOT_EMPTY: return | ||
|
||
// When empty-functions=true, the outlined function shouldn't contain compute. | ||
// EMPTY: func.func private | ||
// EMPTY-NOT: linalg.generic | ||
// EMPTY: return |