1
- # Using XLA tooling
1
+ # XLA Tooling
2
2
3
3
The XLA development workflow is usually centered around
4
4
[ HLO] ( ./operation_semantics ) IR, which represents isolated functional
@@ -19,7 +19,11 @@ $ XLA_FLAGS=--xla_dump_to=/tmp/myfolder ./myprogram-entry-point
19
19
which stores all before-optimization HLO files in the folder specified, along
20
20
with many other useful artifacts.
21
21
22
- ## Running HLO snippets: ` run_hlo_module `
22
+ ## [ ` run_hlo_module ` ] Run HLO Modules
23
+
24
+ ```
25
+ $ bazel run //xla/tools:run_hlo_module -- [flags] <filename>
26
+ ```
23
27
24
28
The tool ` run_hlo_module ` operates on pre-optimization HLO, and by default
25
29
bundles compilation, running and comparison with the reference interpreter
@@ -30,30 +34,44 @@ implementation. For example, the usual invocation to run an input file
30
34
$ run_hlo_module --platform=CUDA --reference_platform=Interpreter computation.hlo
31
35
```
32
36
33
- As with all the tools, ` --help ` can be used to obtain the full list of options.
37
+ ### Run Multiple HLO Modules
38
+ Invocation with multiple HLO modules is supported for ` run_hlo_module ` . To run
39
+ all hlo modules from a directory:
40
+
41
+ ```
42
+ $ bazel run //xla/tools:run_hlo_module -- [flags] /dump/*before_optimizations*
43
+ ```
44
+
45
+ ## [ ` multihost_hlo_runner ` ] Run HLO Modules With SPMD Support
34
46
35
- ## Running HLO snippets with SPMD support: ` multihost_hlo_runner `
47
+ ```
48
+ Note: Binary name is `hlo_runner_main`.
49
+ $ bazel run //xla/tools/multihost_hlo_runner:hlo_runner_main -- [flags] <filename>
50
+ ```
36
51
37
52
Multihost HLO runner is a very similar tool, with the caveat that it supports
38
53
SPMD, including cross host communication. See
39
54
[ Multi-Host HLO Runner] ( ./tools_multihost_hlo_runner ) for details.
40
55
41
- ## Multi- HLO replay
56
+ ### Run Multiple HLO Modules With SPMD Support
42
57
43
- Invocation with multiple modules is supported for both ` run_hlo_module ` and
44
- ` hlo_runner_main ` , which is often convenient to replay all modules in a dump
45
- directory:
58
+ Similar to ` run_hlo_module ` , ` multihost_hlo_runner ` also supports invocation
59
+ with multiple modules.
46
60
47
- ``` shell
48
- $ hlo_runner_main /dump/* before_optimizations*
61
+ ```
62
+ $ bazel run //xla/tools/multihost_hlo_runner: hlo_runner_main -- [flags] /dump/*before_optimizations*
49
63
```
50
64
51
- ## Running passes/stages of HLO compilation: ` hlo-opt `
65
+ ## [ ` hlo-opt ` ] Compile HLO Module
66
+
67
+ ```
68
+ $ bazel run //xla/tools:hlo-opt -- --platform=[gpu|cpu|...] [more flags] <filename>
69
+ ```
52
70
53
71
When debugging or understanding the workings of the compiler, it is often useful
54
72
to get the expansion for a particular hardware at a particular point in the
55
- pipeline (be it HLO, optimized HLO, TritonIR or LLVM), for a given (Stable) HLO
56
- input.
73
+ pipeline (be it HLO, optimized HLO, TritonIR or LLVM), for a given HLO or
74
+ StableHLO input.
57
75
58
76
` hlo-opt ` supports multiple output stages: be it PTX, HLO after optimizations,
59
77
LLVM IR before optimizations, or TritonIR. The exact set of stages supported
@@ -62,35 +80,38 @@ the --list-stages command:
62
80
63
81
```
64
82
$ hlo-opt --platform=CUDA --list-stages
83
+ buffer-assignment
65
84
hlo
85
+ hlo-backend
86
+ html
66
87
llvm
88
+ llvm-after-optimizations
89
+ llvm-before-optimizations
67
90
ptx
68
91
```
69
92
70
93
After selecting a stage, the user can write the result of the conversion for a
71
94
given platform to a given stream:
72
95
73
96
```
74
- $ hlo-opt myinput.hlo --platform=CUDA --stage=llvm
97
+ $ hlo-opt --platform=cpu --stage=hlo input.hlo
75
98
```
76
99
77
100
which would print the dump to stdout (or to a given file if ` -o ` was specified).
78
101
79
- ### Deviceless Usage
102
+ ### Deviceless Compilation for GPU
103
+
104
+ Deviceless compilation do not need access to a GPU. The Deviceless Compilation
105
+ provides a way to specify GPU spec on the command line
106
+ (` --xla_gpu_target_config_filename ` ) for stages where access to GPU is required.
107
+ eliminating a need for GPU device.
80
108
81
- Access to a GPU is not needed for most of the compilation, and by specifying a
82
- GPU spec on the command line we can get e.g. PTX output without access to an
83
- accelerator:
109
+ Example: PTX output without access to a gpu device:
84
110
85
111
```
86
- $ hlo-opt --platform=CUDA --stage=llvm --xla_gpu_target_config_filename=(pwd)/ tools/data /gpu_specs/a100_pcie_80.txtpb input.hlo
112
+ $ hlo-opt --platform=CUDA --stage=llvm --xla_gpu_target_config_filename=/xla/ tools/hlo_opt /gpu_specs/a100_pcie_80.txtpb input.hlo
87
113
```
88
114
89
- Note: For the above invocation to work, the user would usually either need to
90
- disable autotuning with ` --xla_gpu_autotune_level=0 ` or load a pre-existing
91
- autotuning results with ` --xla_gpu_load_autotune_results_from=<filename> `
92
- (obtained with ` --xla_gpu_dump_autotune_results_to=<filename> ` ).
93
-
94
115
Specs for popular GPUs are shipped with the compiler, and the provided file is
95
116
string serialization of ` device_description.proto ` :
96
117
@@ -117,6 +138,16 @@ gpu_device_info {
117
138
}
118
139
platform_name: "CUDA"
119
140
```
141
+ More GPU specs are located at ` /xla/tools/hlo_opt/gpu_specs `
142
+
143
+ Note: ** Autotuning** \
144
+ Sometimes compilation may involve autotuning based on a compilation ` --stage ` .
145
+ For the deviceless compilation to work, the user either need to\
146
+ ** disable** autotuning with ` --xla_gpu_autotune_level=0 ` \
147
+ or\
148
+ ** load a pre-existing
149
+ autotuning results** with ` --xla_gpu_load_autotune_results_from=<filename> `
150
+ (obtained with ` --xla_gpu_dump_autotune_results_to=<filename> ` ).
120
151
121
152
Deviceless compilation might run into issues if autotuning is required. Luckily,
122
153
we can also provide those on the command line:
@@ -152,11 +183,109 @@ results {
152
183
The autotuning database can be serialized using
153
184
` XLA_FLAGS=--xla_gpu_dump_autotune_results_t=<myfile.pbtxt> `
154
185
155
- ### Running a Single Compiler Pass
186
+ ## [ ` hlo-opt ` ] HLO Pass Development And Debugging
187
+
188
+ ```
189
+ If you are working with hardware independent passes from the
190
+ `xla/hlo/transforms/` directory, prefer light-weight version
191
+ of the `hlo-opt` tool with fewer dependencies:
192
+
193
+ $ bazel run //xla/hlo/tools:hlo-opt -- [flags] <filename>
194
+
195
+ Otherwise, for hardware independent and CPU, GPU passes use
196
+ the same binary from "Compile HLO Modules" section above:
197
+
198
+ $ bazel run //xla/tools:hlo-opt -- [flags] <filename>
199
+ ```
200
+
201
+ The ` hlo-opt ` tool allows execution of an individual passes
202
+ independent of the given platform compilation stages. This isolation helps to
203
+ quickly run passes on input hlo module and pinpoint the root cause of failures.
204
+
205
+ ```
206
+ $ hlo-opt --passes=schedule-aware-collective-cse input.hlo
207
+ ```
208
+
209
+ Note: ` --platform ` option is not required.
210
+
211
+ ` hlo-opt ` tool also supports [ ` DebugOptions XLA_FLAGS ` ] ( https://github.com/openxla/xla/blob/5bf1e6420d250dce5eb840889096bdf8aad6f432/xla/xla.proto#L40-L1197 ) .
212
+
213
+ ```
214
+ $ hlo-opt --passes=schedule-aware-collective-cse
215
+ --xla_gpu_experimental_collective_cse_distance_threshold=20 input.hlo
216
+ ```
217
+
218
+ Use` --list-passes ` option to get the pass name string.
219
+
220
+ ```
221
+ $ hlo-opt --list-passes
222
+ ```
223
+
224
+ Users can create their own custom pipeline by specifying more than one passes
225
+ to ` --passes ` option.
226
+
227
+ ```
228
+ $ hlo-opt --passes=pass1,pass2,pass3 input.hlo
229
+ ```
230
+
231
+ ### Assist New HLO Pass Development
232
+
233
+ 1 . First, write your pass.
234
+ 1 . Register the new pass to the ` hlo-opt ` tool pass registry.
235
+
236
+ ```
237
+ RegisterPass<FooPass>(FooPassInputOptions)
238
+ ```
239
+
240
+ Based on the pass type, choose one of the following locations for
241
+ registration:\
242
+ [`opt_lib.cc`](https://github.com/openxla/xla/blob/5d015a2ddfcf4f40934a33891dc63471704f221d/xla/hlo/tools/hlo_opt/opt_lib.cc) Hardware-independent passes.\
243
+ [`cpu_opt.cc`](https://github.com/openxla/xla/blob/5d015a2ddfcf4f40934a33891dc63471704f221d/xla/tools/hlo_opt/cpu_opt.cc) CPU specific passes.\
244
+ [`gpu_opt.cc`](https://github.com/openxla/xla/blob/5d015a2ddfcf4f40934a33891dc63471704f221d/xla/tools/hlo_opt/gpu_opt.cc) GPU specific passes.\
245
+ [`compiled_opt.cc`](https://github.com/openxla/xla/blob/5d015a2ddfcf4f40934a33891dc63471704f221d/xla/tools/hlo_opt/compiled_opt_lib.cc) Passes common to CPU, GPU, XPU.\
246
+ Don't forget to add build dependency.
247
+
248
+ Include pass registration as part of your PR([example](https://github.com/openxla/xla/pull/22968/files#diff-e37a0ea999dfc5764d624240cd2edebb8b7ee4e6d91686be89c632dd7203b823)) so that the pass will be
249
+ available to use for all `hlo-opt` users.
250
+
251
+ 1. Rebuild the `hlo-opt` tool, validate successful pass registration using
252
+ `--list-passes` option and then use `--passes` option to run the pass.
253
+
254
+ ```
255
+ $ hlo-opt --passes=foo-pass input.hlo
256
+ ```
257
+
258
+ 1. Writing unit tests for the pass? refer https://openxla.org/xla/test_hlo_passes for more details.
259
+
260
+ ### Pass Runtime Measurement
261
+
262
+ For large models, full compilation runs can take upto few minutes, making it
263
+ challenging to detect subtle performance regressions. In contrast, individual
264
+ pass runs using `hlo-opt` allow for precise
265
+ performance measurement and the easy detection of even small increases in
266
+ execution time caused by new code changes.
267
+
268
+ ```
269
+ $ time hlo-opt --passes=reduce-window-rewriter,scatter_simplifier
270
+ --xla_reduce_window_rewrite_base_length=128 input.hlo
271
+ ```
272
+
273
+ ## [`hlo-opt`] Convert HLO Module Formats
274
+
275
+ ```
276
+ Use the light weight version of the ` hlo-opt ` tool.
277
+
278
+ $ bazel run //xla/hlo/tools: hlo-opt -- [ flags] <filename >
279
+ ```
280
+
281
+ #### Convert `HLO Text` -> `HLO Proto`
282
+
283
+ ```
284
+ $ hlo-opt --emit-proto input.hlo
285
+ ```
156
286
157
- The flags from ` XLA_FLAGS ` are also supported, so the tool can be used to test
158
- running a single pass:
287
+ #### Convert `HLO Proto` or `HLO Proto Binary` -> `HLO Text`
159
288
160
289
```
161
- $ hlo-opt --platform=CUDA --stage=hlo --passes=algebraic_simplifer input.hlo
290
+ $ hlo-opt input.pbtxt or input.pb
162
291
```
0 commit comments