Skip to content

Developer guide

Jingyue Wu edited this page Mar 18, 2025 · 24 revisions

Developer guide

Build nvfuser

# Optionally, fetch a remote branch
$ git fetch origin <BRANCH_NAME>
$ git switch <BRANCH_NAME>

# Only needed initially or when submodules are updated. 
$ git submodule update --init --recursive

$ pip install -e .

See https://github.com/NVIDIA/Fuser/wiki/Building-fuser-project for a more detailed guide.

Test nvfuser

$ ./manual_ci.sh

Benchmark nvfuser

$ bin/nvfuser_bench [--benchmark_filter=<FILTER_REGEX>]

To run only the nvFuser-based benchmarks:

$ bin/nvfuser_bench --benchmark_filter=NvFuserScheduler

Often, you'd like to measure the performance impact of a change.

$ python tools/compare_benchmark.py <baseline_branch_or_commit> <contender_branch_or_commit> <out_dir> -- <args to bin/nvfuser_bench, e.g., --benchmark_filter=NvFuserScheduler>

This script builds and runs both the baseline and the contender, and compares the two results. It also skips the expensive benchmarking when <out_dir> already contains <baseline_branch_or_commit>.json or <contender_branch_or_commit>.json. Below is an example output.

Top 5 improvements:
  Benchmark NvFuserScheduler_LayerNorm_fp16___GRAPH/NvFuserScheduler_LayerNorm_fp16/16/2097152/manual_time changed from 235.70649740466797 seconds 173.04190461790853 (0.73x)
  Benchmark NvFuserScheduler_LayerNorm_fp16___GRAPH/NvFuserScheduler_LayerNorm_fp16/8/2097152/manual_time changed from 138.39395947220675 seconds 102.51789488799382 (0.74x)
  Benchmark NvFuserScheduler_Broadcast_Outer_fp16___GRAPH/NvFuserScheduler_Broadcast_Outer_fp16/8/2097152/manual_time changed from 66.96629544563652 seconds 50.55804438076262 (0.75x)
  Benchmark NvFuserScheduler_Broadcast_Outer_fp16___GRAPH/NvFuserScheduler_Broadcast_Outer_fp16/16/2097152/manual_time changed from 115.90680760401582 seconds 91.29062974831477 (0.79x)
  Benchmark NvFuserScheduler_LayerNorm_LargeHiddenSize_fp32___GRAPH/NvFuserScheduler_LayerNorm_LargeHiddenSize_fp32/8192/34816/manual_time changed from 1788.1167894834052 seconds 1567.6858208396222 (0.88x)

Top 5 regressions:
  Benchmark NvFuserScheduler_Reduction_Outer_fp32___GRAPH/NvFuserScheduler_Reduction_Outer_fp32/1024/8/manual_time changed from 9.868966872637673 seconds 10.735423786849232 (1.09x)
  Benchmark NvFuserScheduler_Broadcast_Outer_fp16___GRAPH/NvFuserScheduler_Broadcast_Outer_fp16/1/320/manual_time changed from 5.658855892095212 seconds 6.134677312527026 (1.08x)
  Benchmark NvFuserScheduler_Reduction_Inner_fp16___GRAPH/NvFuserScheduler_Reduction_Inner_fp16/8/320/manual_time changed from 6.313002350175524 seconds 6.8305892441798015 (1.08x)
  Benchmark NvFuserScheduler_Reduction_Outer_fp16___GRAPH/NvFuserScheduler_Reduction_Outer_fp16/2/4096/manual_time changed from 6.264241068695716 seconds 6.77447848660655 (1.08x)
  Benchmark NvFuserScheduler_BatchNorm_nhwc_fp16___GRAPH/NvFuserScheduler_BatchNorm_nhwc_fp16/2/8/2/manual_time changed from 7.216345496261045 seconds 7.765639299028032 (1.08x)

Saved the histogram of time changes to /home/me/workspace/benchmark_test/histogram.png. 

$ open /home/me/workspace/benchmark_test/histogram.png

histogram.png

Clone this wiki locally