-
Notifications
You must be signed in to change notification settings - Fork 61
Developer guide
Jingyue Wu edited this page Mar 18, 2025
·
24 revisions
# Optionally, fetch a remote branch
$ git fetch origin <BRANCH_NAME>
$ git switch <BRANCH_NAME>
# Only needed initially or when submodules are updated.
$ git submodule update --init --recursive
$ pip install -e .
See https://github.com/NVIDIA/Fuser/wiki/Building-fuser-project for a more detailed guide.
$ ./manual_ci.sh
$ bin/nvfuser_bench [--benchmark_filter=<FILTER_REGEX>]
To run only the nvFuser-based benchmarks:
$ bin/nvfuser_bench --benchmark_filter=NvFuserScheduler
Often, you'd like to measure the performance impact of a change.
$ python tools/compare_benchmark.py <baseline_branch_or_commit> <contender_branch_or_commit> <out_dir> -- <args to bin/nvfuser_bench, e.g., --benchmark_filter=NvFuserScheduler>
This script builds and runs both the baseline and the contender, and compares the two results. It also skips the expensive benchmarking when <out_dir>
already contains <baseline_branch_or_commit>.json
or <contender_branch_or_commit>.json
. Below is an example output.
Top 5 improvements:
Benchmark NvFuserScheduler_LayerNorm_fp16___GRAPH/NvFuserScheduler_LayerNorm_fp16/16/2097152/manual_time changed from 235.70649740466797 seconds 173.04190461790853 (0.73x)
Benchmark NvFuserScheduler_LayerNorm_fp16___GRAPH/NvFuserScheduler_LayerNorm_fp16/8/2097152/manual_time changed from 138.39395947220675 seconds 102.51789488799382 (0.74x)
Benchmark NvFuserScheduler_Broadcast_Outer_fp16___GRAPH/NvFuserScheduler_Broadcast_Outer_fp16/8/2097152/manual_time changed from 66.96629544563652 seconds 50.55804438076262 (0.75x)
Benchmark NvFuserScheduler_Broadcast_Outer_fp16___GRAPH/NvFuserScheduler_Broadcast_Outer_fp16/16/2097152/manual_time changed from 115.90680760401582 seconds 91.29062974831477 (0.79x)
Benchmark NvFuserScheduler_LayerNorm_LargeHiddenSize_fp32___GRAPH/NvFuserScheduler_LayerNorm_LargeHiddenSize_fp32/8192/34816/manual_time changed from 1788.1167894834052 seconds 1567.6858208396222 (0.88x)
Top 5 regressions:
Benchmark NvFuserScheduler_Reduction_Outer_fp32___GRAPH/NvFuserScheduler_Reduction_Outer_fp32/1024/8/manual_time changed from 9.868966872637673 seconds 10.735423786849232 (1.09x)
Benchmark NvFuserScheduler_Broadcast_Outer_fp16___GRAPH/NvFuserScheduler_Broadcast_Outer_fp16/1/320/manual_time changed from 5.658855892095212 seconds 6.134677312527026 (1.08x)
Benchmark NvFuserScheduler_Reduction_Inner_fp16___GRAPH/NvFuserScheduler_Reduction_Inner_fp16/8/320/manual_time changed from 6.313002350175524 seconds 6.8305892441798015 (1.08x)
Benchmark NvFuserScheduler_Reduction_Outer_fp16___GRAPH/NvFuserScheduler_Reduction_Outer_fp16/2/4096/manual_time changed from 6.264241068695716 seconds 6.77447848660655 (1.08x)
Benchmark NvFuserScheduler_BatchNorm_nhwc_fp16___GRAPH/NvFuserScheduler_BatchNorm_nhwc_fp16/2/8/2/manual_time changed from 7.216345496261045 seconds 7.765639299028032 (1.08x)
Saved the histogram of time changes to /home/me/workspace/benchmark_test/histogram.png.
$ open /home/me/workspace/benchmark_test/histogram.png