Skip to content

Commit dc08851

Browse files
abhigunjGoogle-ML-Automation
authored andcommitted
XLA Tooling: Improve and Update the documentation to include new features introduced at the tool.
1. changed Heading format to [`tool name`] <Usage> 2. Added missing how to build a binary instruction for each tool. 3. Added hlo-opt tool new feature documentation. 4. Updated hlo-opt tool old features documentation to clarify, it is deviceless compilation, corrected paths, links. PiperOrigin-RevId: 734744627
1 parent 29f814a commit dc08851

File tree

1 file changed

+157
-28
lines changed

1 file changed

+157
-28
lines changed

docs/tools.md

+157-28
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# Using XLA tooling
1+
# XLA Tooling
22

33
The XLA development workflow is usually centered around
44
[HLO](./operation_semantics) IR, which represents isolated functional
@@ -19,7 +19,11 @@ $ XLA_FLAGS=--xla_dump_to=/tmp/myfolder ./myprogram-entry-point
1919
which stores all before-optimization HLO files in the folder specified, along
2020
with many other useful artifacts.
2121

22-
## Running HLO snippets: `run_hlo_module`
22+
## [`run_hlo_module`] Run HLO Modules
23+
24+
```
25+
$ bazel run //xla/tools:run_hlo_module -- [flags] <filename>
26+
```
2327

2428
The tool `run_hlo_module` operates on pre-optimization HLO, and by default
2529
bundles compilation, running and comparison with the reference interpreter
@@ -30,30 +34,44 @@ implementation. For example, the usual invocation to run an input file
3034
$ run_hlo_module --platform=CUDA --reference_platform=Interpreter computation.hlo
3135
```
3236

33-
As with all the tools, `--help` can be used to obtain the full list of options.
37+
### Run Multiple HLO Modules
38+
Invocation with multiple HLO modules is supported for `run_hlo_module`. To run
39+
all hlo modules from a directory:
40+
41+
```
42+
$ bazel run //xla/tools:run_hlo_module -- [flags] /dump/*before_optimizations*
43+
```
44+
45+
## [`multihost_hlo_runner`] Run HLO Modules With SPMD Support
3446

35-
## Running HLO snippets with SPMD support: `multihost_hlo_runner`
47+
```
48+
Note: Binary name is `hlo_runner_main`.
49+
$ bazel run //xla/tools/multihost_hlo_runner:hlo_runner_main -- [flags] <filename>
50+
```
3651

3752
Multihost HLO runner is a very similar tool, with the caveat that it supports
3853
SPMD, including cross host communication. See
3954
[Multi-Host HLO Runner](./tools_multihost_hlo_runner) for details.
4055

41-
## Multi-HLO replay
56+
### Run Multiple HLO Modules With SPMD Support
4257

43-
Invocation with multiple modules is supported for both `run_hlo_module` and
44-
`hlo_runner_main`, which is often convenient to replay all modules in a dump
45-
directory:
58+
Similar to `run_hlo_module`, `multihost_hlo_runner` also supports invocation
59+
with multiple modules.
4660

47-
```shell
48-
$ hlo_runner_main /dump/*before_optimizations*
61+
```
62+
$ bazel run //xla/tools/multihost_hlo_runner:hlo_runner_main -- [flags] /dump/*before_optimizations*
4963
```
5064

51-
## Running passes/stages of HLO compilation: `hlo-opt`
65+
## [`hlo-opt`] Compile HLO Module
66+
67+
```
68+
$ bazel run //xla/tools:hlo-opt -- --platform=[gpu|cpu|...] [more flags] <filename>
69+
```
5270

5371
When debugging or understanding the workings of the compiler, it is often useful
5472
to get the expansion for a particular hardware at a particular point in the
55-
pipeline (be it HLO, optimized HLO, TritonIR or LLVM), for a given (Stable) HLO
56-
input.
73+
pipeline (be it HLO, optimized HLO, TritonIR or LLVM), for a given HLO or
74+
StableHLO input.
5775

5876
`hlo-opt` supports multiple output stages: be it PTX, HLO after optimizations,
5977
LLVM IR before optimizations, or TritonIR. The exact set of stages supported
@@ -62,35 +80,38 @@ the --list-stages command:
6280

6381
```
6482
$ hlo-opt --platform=CUDA --list-stages
83+
buffer-assignment
6584
hlo
85+
hlo-backend
86+
html
6687
llvm
88+
llvm-after-optimizations
89+
llvm-before-optimizations
6790
ptx
6891
```
6992

7093
After selecting a stage, the user can write the result of the conversion for a
7194
given platform to a given stream:
7295

7396
```
74-
$ hlo-opt myinput.hlo --platform=CUDA --stage=llvm
97+
$ hlo-opt --platform=cpu --stage=hlo input.hlo
7598
```
7699

77100
which would print the dump to stdout (or to a given file if `-o` was specified).
78101

79-
### Deviceless Usage
102+
### Deviceless Compilation for GPU
103+
104+
Deviceless compilation do not need access to a GPU. The Deviceless Compilation
105+
provides a way to specify GPU spec on the command line
106+
(`--xla_gpu_target_config_filename`) for stages where access to GPU is required.
107+
eliminating a need for GPU device.
80108

81-
Access to a GPU is not needed for most of the compilation, and by specifying a
82-
GPU spec on the command line we can get e.g. PTX output without access to an
83-
accelerator:
109+
Example: PTX output without access to a gpu device:
84110

85111
```
86-
$ hlo-opt --platform=CUDA --stage=llvm --xla_gpu_target_config_filename=(pwd)/tools/data/gpu_specs/a100_pcie_80.txtpb input.hlo
112+
$ hlo-opt --platform=CUDA --stage=llvm --xla_gpu_target_config_filename=/xla/tools/hlo_opt/gpu_specs/a100_pcie_80.txtpb input.hlo
87113
```
88114

89-
Note: For the above invocation to work, the user would usually either need to
90-
disable autotuning with `--xla_gpu_autotune_level=0` or load a pre-existing
91-
autotuning results with `--xla_gpu_load_autotune_results_from=<filename>`
92-
(obtained with `--xla_gpu_dump_autotune_results_to=<filename>`).
93-
94115
Specs for popular GPUs are shipped with the compiler, and the provided file is
95116
string serialization of `device_description.proto`:
96117

@@ -117,6 +138,16 @@ gpu_device_info {
117138
}
118139
platform_name: "CUDA"
119140
```
141+
More GPU specs are located at `/xla/tools/hlo_opt/gpu_specs`
142+
143+
Note: **Autotuning**\
144+
Sometimes compilation may involve autotuning based on a compilation `--stage`.
145+
For the deviceless compilation to work, the user either need to\
146+
**disable** autotuning with `--xla_gpu_autotune_level=0`\
147+
or\
148+
**load a pre-existing
149+
autotuning results** with `--xla_gpu_load_autotune_results_from=<filename>`
150+
(obtained with `--xla_gpu_dump_autotune_results_to=<filename>`).
120151

121152
Deviceless compilation might run into issues if autotuning is required. Luckily,
122153
we can also provide those on the command line:
@@ -152,11 +183,109 @@ results {
152183
The autotuning database can be serialized using
153184
`XLA_FLAGS=--xla_gpu_dump_autotune_results_t=<myfile.pbtxt>`
154185

155-
### Running a Single Compiler Pass
186+
## [`hlo-opt`] HLO Pass Development And Debugging
187+
188+
```
189+
If you are working with hardware independent passes from the
190+
`xla/hlo/transforms/` directory, prefer light-weight version
191+
of the `hlo-opt` tool with fewer dependencies:
192+
193+
$ bazel run //xla/hlo/tools:hlo-opt -- [flags] <filename>
194+
195+
Otherwise, for hardware independent and CPU, GPU passes use
196+
the same binary from "Compile HLO Modules" section above:
197+
198+
$ bazel run //xla/tools:hlo-opt -- [flags] <filename>
199+
```
200+
201+
The `hlo-opt` tool allows execution of an individual passes
202+
independent of the given platform compilation stages. This isolation helps to
203+
quickly run passes on input hlo module and pinpoint the root cause of failures.
204+
205+
```
206+
$ hlo-opt --passes=schedule-aware-collective-cse input.hlo
207+
```
208+
209+
Note: `--platform` option is not required.
210+
211+
`hlo-opt` tool also supports [`DebugOptions XLA_FLAGS`](https://github.com/openxla/xla/blob/5bf1e6420d250dce5eb840889096bdf8aad6f432/xla/xla.proto#L40-L1197).
212+
213+
```
214+
$ hlo-opt --passes=schedule-aware-collective-cse
215+
--xla_gpu_experimental_collective_cse_distance_threshold=20 input.hlo
216+
```
217+
218+
Use`--list-passes` option to get the pass name string.
219+
220+
```
221+
$ hlo-opt --list-passes
222+
```
223+
224+
Users can create their own custom pipeline by specifying more than one passes
225+
to `--passes` option.
226+
227+
```
228+
$ hlo-opt --passes=pass1,pass2,pass3 input.hlo
229+
```
230+
231+
### Assist New HLO Pass Development
232+
233+
1. First, write your pass.
234+
1. Register the new pass to the `hlo-opt` tool pass registry.
235+
236+
```
237+
RegisterPass<FooPass>(FooPassInputOptions)
238+
```
239+
240+
Based on the pass type, choose one of the following locations for
241+
registration:\
242+
[`opt_lib.cc`](https://github.com/openxla/xla/blob/5d015a2ddfcf4f40934a33891dc63471704f221d/xla/hlo/tools/hlo_opt/opt_lib.cc) Hardware-independent passes.\
243+
[`cpu_opt.cc`](https://github.com/openxla/xla/blob/5d015a2ddfcf4f40934a33891dc63471704f221d/xla/tools/hlo_opt/cpu_opt.cc) CPU specific passes.\
244+
[`gpu_opt.cc`](https://github.com/openxla/xla/blob/5d015a2ddfcf4f40934a33891dc63471704f221d/xla/tools/hlo_opt/gpu_opt.cc) GPU specific passes.\
245+
[`compiled_opt.cc`](https://github.com/openxla/xla/blob/5d015a2ddfcf4f40934a33891dc63471704f221d/xla/tools/hlo_opt/compiled_opt_lib.cc) Passes common to CPU, GPU, XPU.\
246+
Don't forget to add build dependency.
247+
248+
Include pass registration as part of your PR([example](https://github.com/openxla/xla/pull/22968/files#diff-e37a0ea999dfc5764d624240cd2edebb8b7ee4e6d91686be89c632dd7203b823)) so that the pass will be
249+
available to use for all `hlo-opt` users.
250+
251+
1. Rebuild the `hlo-opt` tool, validate successful pass registration using
252+
`--list-passes` option and then use `--passes` option to run the pass.
253+
254+
```
255+
$ hlo-opt --passes=foo-pass input.hlo
256+
```
257+
258+
1. Writing unit tests for the pass? refer https://openxla.org/xla/test_hlo_passes for more details.
259+
260+
### Pass Runtime Measurement
261+
262+
For large models, full compilation runs can take upto few minutes, making it
263+
challenging to detect subtle performance regressions. In contrast, individual
264+
pass runs using `hlo-opt` allow for precise
265+
performance measurement and the easy detection of even small increases in
266+
execution time caused by new code changes.
267+
268+
```
269+
$ time hlo-opt --passes=reduce-window-rewriter,scatter_simplifier
270+
--xla_reduce_window_rewrite_base_length=128 input.hlo
271+
```
272+
273+
## [`hlo-opt`] Convert HLO Module Formats
274+
275+
```
276+
Use the light weight version of the `hlo-opt` tool.
277+
278+
$ bazel run //xla/hlo/tools:hlo-opt -- [flags] <filename>
279+
```
280+
281+
#### Convert `HLO Text` -> `HLO Proto`
282+
283+
```
284+
$ hlo-opt --emit-proto input.hlo
285+
```
156286
157-
The flags from `XLA_FLAGS` are also supported, so the tool can be used to test
158-
running a single pass:
287+
#### Convert `HLO Proto` or `HLO Proto Binary` -> `HLO Text`
159288
160289
```
161-
$ hlo-opt --platform=CUDA --stage=hlo --passes=algebraic_simplifer input.hlo
290+
$ hlo-opt input.pbtxt or input.pb
162291
```

0 commit comments

Comments
 (0)