PyBuda monitors many environment variables to modify default behavior. These can be used to debug or analyze problems.
- PYBUDA_BUILD_DIR: Override the build directory for the compiler.
- LOGURU_LEVEL: set Python logger level - default is DEBUG, valid values are INFO, DEBUG, TRACE, NONE
- LOGGER_LEVEL: set C++ logger level - values are the same as Python logger
- PYBUDA_DEVMODE: set to make Golden/Sequential default run mode if one isn't specified explicitly
- PYBUDA_PROFILE: enable Python profiler
- PYBUDA_ASSERT_UNSUPPORTED_HW_OP: assert if an unsupported op is found
- PYBUDA_BALANCER_PLACER_DATA: prints balancer/placer visual info, prints chip op group info
- PYBUDA_BALANCER_POLICY_TYPE: override balancer policy
- PYBUDA_SCHEDULER_POLICY: override scheduler policy
- PYBUDA_BALANCER_ONE_ROW: limit placement to one row
- PYBUDA_ENABLE_T_STREAMING: enable t-streaming (i.e. streaming ops with small output buffers)
- PYBUDA_ENABLE_TVM_CACHE: Cache tvm graphs instead of re-compiling
- PYBUDA_FORCE_FULL_COMPILE_DEPTH: Force each test to run to compile depth "FULL"
- PYBUDA_RELOAD_GENERATED_MODULES: Reload previously generated modules instead of recompiling through tvm.
- PYBUDA_SKIP_L1_USAGE_VALIDATION: allows ops to use more L1 than available
- PYBUDA_ENFORCE_SAME_UBLOCK_OPERANDS: ???
- PYBUDA_VERIFY_NET2PIPE: verify produced netlist using net2pipe
- PYBUDA_CI_DIR: ???
- PYTEST_CURRENT_TEST: ???
- PYBUDA_CI_CAPTURE_TENSORS: save tensors used in the test so they can be used in stand-alone back-end tests
- PYBUDA_FORCE_SEQUENTIAL: override test/script to run everything in sequential mode
- PYBUDA_TRACE_SHUTDOWN: show stack trace on shutdown due to error
- PYBUDA_OVERRIDE_NUM_CHIPS: force the number of chips to use, instead of the auto-detected number
- PYBUDA_DISABLE_DYNAMIC_DRAM: disable dynamic allocation of e2e queues in inference
- PYBUDA_DISABLE_AUTOMATIC_DRAM_LOGIC: disable automatic logic for static/dynamic allocation of queues.
- PYBUDA_DISABLE_FORK_JOIN_BUF: disable fork-join buffering
- PYBUDA_DISABLE_FORK_JOIN_NOPS: don't insert nops if there's not enough buffering. Just add what's available in L1. This should only be used for debug.
- PYBUDA_FORK_JOIN_DEBUG_INFO: print debug logs related to fork-join buffering
- PYBUDA_FORK_JOIN_DEBUG_FORK_NAME: filter debug logs (generated by PYBUDA_FORK_JOIN_DEBUG_INFO) by fork node name
- PYBUDA_FORK_JOIN_DEBUG_JOIN_NAME: filter debug logs (generated by PYBUDA_FORK_JOIN_DEBUG_INFO) by join node name
- PYBUDA_FORK_JOIN_SKIP_EXPANDING_BUFFERS: don't expand buffers in L1 - this will cause algorithm to add buffering nops/queues any time a fork-join needs to be buffered.
- PYBUDA_FORK_JOIN_EXPAND_OUTPUT_BUFFERS: expand only output buffers (instead of input buffers) for fork-join buffering
- SHOW_ALL_FAILS: don't assert on the first data mismatch, but show all fails before failing the test
- PYBUDA_EXP_APPROX: run exp in approximate mode
- PYBUDA_VERIFY_RESULTS_OFF_BY_DEFAULT: disable result verification (tensor comparison of processed and golden module done via forward pass)
- PYBUDA_ENABLE_STABLE_SOFTMAX: enable stable Softmax (disabled by default)
- EVAL_DEBUG: prints inputs/outputs during module evaluation
- TT_BACKEND_GOLDEN_QUANTIZE: ???
- PYBUDA_RESET_DEV_BEFORE_TEST: resets device between tests (pytest must be called with --forked in order to work)
- PYBUDA_PERF_SIMULATOR: run performance simulator to estimate performance of the model
- PYBUDA_PERF_SIMULATOR_LOG: dump log of all events in perf simulator (will slow down the run)
- PYBUDA_PERF_SIMULATOR_TRACE: create trace file to be loaded into routeagui
- PYBUDA_OP_PERF: dump op_perf.csv file with op grid choices and estimated cycle counts
- PYBUDA_BENCHMARK_NO_RESET_ON_ERROR: from the comments seems that it doesn't work, should we remove this one?
- PYBUDA_SKIP_BACKEND_COMPILE: configure backend device to run in DeviceMode.RunOnly, picking up build binaries from previous run
- PYBUDA_PLACER_BWD_GROUPS: use bwd groups when placing so that fwd and bwd ops are placed together
- PYBUDA_TRIPLET_PLACEMENT: try to place bwd groups in "triplet" placement strategy
- PYBUDA_EXP_APPROX: force exp and exponent in gelu_derivative to run in approximate mode (i.e. faster, but less accurate)
- PYBUDA_AMP_LEVEL: configure the AMP (Automatic Mixed Precision) optimization level.
- PYBUDA_NO_FUSE_MATMUL_BIAS: disable fusing of matmul+add into matmul
- PYBUDA_ENABLE_OUTPUT_QUEUES_ON_HOST: configures whether whether output queues are placed on HOST (default: true)
- PYBUDA_FORCE_VERIFY_ALL: ensure that verification is run after each compile stage, overrides VerifyCondig.disabled()
- PYBUDA_VERIFY_POST_AUTOGRAD_PASSES: verify graph after post autograd passes, unless the verify config is VeifyConfig.disabled()
- PYBUDA_VERIFY_POST_PLACER: verify graph after post placer pass, unless the verify config is VeifyConfig.disabled()
- PYBUDA_GALAXY_LINEAR_ROUTE: place graphs sequentially in a snake route around the Galaxy modules
- PYBUDA_NEBULA_GALAXY_PLACER: only place output nop on mmio chip for untilizing
- PYBUDA_ENABLE_AUTO_TRANSPOSE: configures whether auto-transpose is enabled while op placement (default: false)
- PYBUDA_MINIMIZE_REMOTE_DRAM_QUEUES: configures behaviour for data forking to remote chips - create single e2e queue on producer or e2e queue per consumer chip (default)
- PYBUDA_SPARSE_MM_ENCODING_ESTIMATES_OFF: when on, turns off estimation logic for in0/in2 for sparse mm, but gets slower
- PYBUDA_REBLOCK_INPUT_ACT: when enabled, we reblock input activations to the smallest grid across all users instead of forcing 1x1. (default: disabled)
- PYBUDA_DUMP_MIXED_PRECISION: when on, dump json with a per-op info about fidelity, data-formats (default: off). Default directory: reportify dump directory.
- PYBUDA_PRESTRIDE_DISABLE: disables prestriding transform for convs
- PYBUDA_LEGALIZER_DETAILED_DEBUGGING: when on provides detailed debugging information and statistics about legalizer OpModel selection process including GraphSolver. Works only in DEBUG(default: off).
- PYBUDA_LEGALIZER_DEBUG_NODE_NAME: used together with legalizer detailed debugging to narrow down debugging info to single node. Works only in DEBUG(default: off).
- PYBUDA_GRAPHSOLVER_SELF_CUT_TYPE: Override for graph_solver_self_cut_type in BalancerConfig. Valid values: None, ConsumerOperandDataEdgesFirst, ProducerUserDataEdgesFirst, FastCut. When switched on(not None) graphsolver will cut edges for which it cannot produce valid paths. (default: None)
- PYBUDA_MAX_GRAPH_CUT_RETRY: Override for default_resolve_retry_count_self_cutting in GraphSolver::resolve. This sets the max retry step if GraphSolver self cut is turned on.
- PYBUDA_REPLACE_INF_IN_TVM_PARAMS: Replace -inf and inf values from TVM parameters during PyBuda code generation.
- PYBUDA_DISABLE_FUSE_TAGS: Specify a list of ops (comma delimited) by original_op_type/op_type that will be exempt from fusion (e.g. PYBUDA_DISABLE_FUSE_TAGS="reciprocal,softmax").
- PYBUDA_BISECT_FUSING: bool, false by default. When it is set to true, we bisect fusing by defining PYBUDA_FUSE_OP_FIRST_IND and PYBUDA_FUSE_OP_LAST_IND.
- PYBUDA_FUSE_OP_FIRST_IND: First index in topologically sorted graph of ops to fuse.
- PYBUDA_FUSE_OP_LAST_IND: Last index in topologically sorted graph of ops to fuse.
- PYBUDA_SINGLE_OP_EPOCHS: Place every single op on a new epoch.
- PYBUDA_FORK_JOIN_BUF_QUEUES: Turn on adding buffering queues instead of nops in fork joins that need a lot of buffering (have one path much larger than the other).
- PYBUDA_RESNET_BUFF_QUEUE_OVERRIDE: Turn off adding buffering queues in graph solver cut. Temporal fix for ResNet perf.
- PYBUDA_OVERRIDE_DEVICE_YAML: Override the soc device descriptor to compile against different device configurations.
- PYBUDA_DISABLE_INTERACTIVE_PLACER: Override balancer policy not to use Interactive placer and to fallback to legacy placer instead. (default: 0/False)
- PYBUDA_DISABLE_INTERACTIVE_FJ_BUFFERING: Override balancer policy not to use inlined fork-join buffering. (default: 0/False)
- PYBUDA_DISABLE_PADDING_PASS: Disable running of padding pass.
- PYBUDA_PADDING_PASS_ELEMENT_WISE: In padding pass pad elementwise ops.
- PYBUDA_PADDING_PASS_MATMUL: In padding pass pad matmul ops.
- PYBUDA_PADDING_PASS_SPARSE_MATMUL: In padding pass pad sparse matmul ops. Needs to have matmul ops enabled for padding too in order to enable this.
- PYBUDA_PADDING_PASS_BUFFER_QUEUE": Enable padding pass, insert buffer queue
- PYBUDA_ENABLE_STOCHASTIC_ROUNDING": Enable stochastic rounding for all supported ops.
- PYBUDA_PADDING_PASS_CONCAT": Enable padding pass, for concatenate operation
- PYBUDA_FORCE_CONV_MULTI_OP_FRACTURE: Forces all convs to be fractured (during decompose pass) according to heuristic defined in
pybuda/pybuda/op/eval/pybuda/convolution.py
. - PYBUDA_COLLECT_CONSTRAINT_INFO: Enables constraint info collection on every graphsolver resolve.
- PYBUDA_GRAPHSOLVER_FAST: Enables partial re-resolve on cut and buffer, much faster at cost of not enabling all possible valid OpModels.
- NUM_EXEC_LOOP_ITERATIONS: For single temporal epoch tests, you can specify a # here that will rerun the epoch the specified # of times. Each rerun is initiated by FW rather than requiring host interaction, to improve performance.
- PYBUDA_PADDING_PASS_DISABLE_BUDA_OP: Disable padding logic that uses buda implementation for pad and unpad.
- PYBUDA_ENABLE_ETH_SERIALIZATION: Enable the ethernet stream reduction pass, using the ethernet datacopy op to implement the stream reduction
- PYBUDA_ENABLE_ETH_DATACOPY_SERIALIZATION: Enable the ethernet stream reduction pass, using the tensix datacopy/nop op to implement the stream reduction. Will only insert datacopy ops if there are free tensix cores
- PYBUDA_SUPRESS_T_FACTOR_MM: Enables a condition in calculate_op_model in legalizer that limits the t factor of sparse/dense matmul ops to be less than the flag's value. Valid values: any positive int value (eg. 16)
- PYBUDA_AMP_LIGHT: Enable a "light" version of mixed precision to minimize accuracy impact (default: 0/False; 1: bfp8/hifi2, 2: bfp4/hifi2, 3: bfp4/LoFi)
- PYBUDA_GRAPH_NAME_SUFFIX: Suffix to add to the graph name (helps to generate unique netlist names)
- PYBUDA_DISABLE_L1_ACCUMULATE: Flag for disabling and debugging L1 accumaulation feature.
- PYBUDA_OVERRIDE_VETO: Used to Add/Remove/Update general and env var based compiler configurations.
- PYBUDA_DISABLE_REPORTIFY_DUMP: Disable generating reportify graph.
- PYBUDA_DISABLE_CAP_SPARSE_MM_FIDELITY: Disables an optimization to cap the fidelity phases of sparse matmul to at most HiFi2.
- PYBUDA_DISABLE_EXPLICIT_DRAM_IO: Disables the FE from programming netlist attribute
input_dram_io_buf_size_tiles
. Instead the FE will leave this attribute as0
which implicitly means that the backend will handle the allocation of this buffer. - PYBUDA_CONCAT_ON_HOST: Lower concatenate ops on output nodes into runtime transforms so that they're done on host.
- PYBUDA_OP_MODEL_COMPARE_VERSION: Version of op model comparision function. Can be used to compare effect of different comparison logic on performance.
- PYBUDA_RIBBON1_PREPASS_ENABLED: Whether to use or not suboptimal opmodel invalidation prepass. Default value is False.
- PYBUDA_RIBBON2_CONSERVATIVE_OPTIMIZATION_ITERATIONS: Number of optimization iterations in Ribbon2 balancing policy per epoch. Default value is 10.
- PYBUDA_RIBBON2_DISABLE_CLEANUP_BUF_NOPS: Disable cleanup of unneeded buffering nops in Ribbon2. (default: 0/False)
- PYBUDA_RIBBON2_CALCULATE_TARGET_CYCLES: Calculate target cycles for every epoch within Ribbon2 balancing policy. (default: 0/False)
- PYBUDA_RIBBON2_CALCULATE_TARGET_CYCLES_APPLY_FILTERING: Apply filtering on GS search space while calculating dynamic cycles per epoch within Ribbon2 balancing policy. (default: 0/False)
- PYBUDA_RIBBON_LEGACY: Use legacy Ribbon balancing policy. (default: 0/False)
- PYBUDA_MAXIMIZE_GRID: Reverse logic of MinimizeGrid policy. Maximize grid size for all ops. (default: 0/False)
- PYBUDA_ENABLE_HOST_INPUT_NOP_BUFFERING: Enable nop buffering of input host read. (default: 0/False)
- PYBUDA_AUTO_RECOMPILE: Triggers handling of backend compile error and recompiles the model. (default: 1/True)
- PYBUDA_AUTO_RECOMPILE_TARGET_CYCLES: Enables adjustment of target cycles during recompile if no errors from backend have been previously handled. Requires PYBUDA_AUTO_RECOMPILE to be enabled to work. (default: 0/False)
- PYBUDA_AUTO_RECOMPILE_RETRY_LIMIT: Limits number of attempts to recompile. (default: 10)
- PYBUDA_TARGET_CYCLES_OFFSET: Sets the desired amount by which to offset the target cycles for balancer. Default value is 0.
- PYBUDA_ENABLE_VERSIM_DEVICE: The Versim device is a specific silicon simulation device that PyBUDA supports.The variable is used to enable the Versim device in the PyBUDA pytest environment. By setting this variable to 1, we are instructing PyBUDA to use the Versim device as the target device instead of the silicon or golden device. Enabling the Versim device can be useful for testing or experimentation purposes, allowing us to evaluate the behavior of our code on this simulation device and specified architecture. In order to run Versim device as a targeted device, the source code must be built with UMD_VERSIM_STUB=0 enviroment variable.
- PYBUDA_VERSIM_DEVICE_ARCH: This env variable represents the architecture of the Versim device used in the pytest.
- PYBUDA_ENABLE_EMULATION_DEVICE: This device is a specific silicon emulation device that PyBUDA supports. The variable is used to enable emulation device in PyBUDA pytest environment. By setting this variable to 1, we are instructing PyBUDA to use the emulation device as the target device instead of the silicon or golden device. Enabling the emulation device can be useful for testing or experimentation purposes, allowing us to evaluate the behaviour of our code on this emulation device. In order to run emulation device as a targeted device, the source code must be built with EMULATION_DEVICE_EN=1 environment variable.
- PYBUDA_EMULATION_DEVICE_ARCH: This env variable represents the architecture of the emulation device used in the pytest.
- PYBUDA_DISABLE_DEPTHWISE_CONV2D_DECOMP: If set to 1, depthwise conv2d ops will not be decomposed using the depthwise op and instead use a matmul.
- PYBUDA_DISABLE_SINGLE_REDUCE_COMMUTEL If set to 1, reshapes will not be able to commute through single reduce operations, yet may still be allowed to commute through any back-to-back reduce ops. Enabling this may result in additional TMs remaining in the model.
- GOLDEN_WORMHOLE_B0: run Golden with Wormhole_B0 as target device instead of Grayskull (default)
- PYBUDA_GOLDEN_BLACKHOLE: run Golden with Blackhole as target device instead of Grayskull (default)
- PYBUDA_TEMP_ENABLE_NEW_SPARSE_ESTIMATES: Apply new formula to estimate the cycle count of sparse matmul ops (currently only support LoFi and HiFi2 fidelities)
- PYBUDA_TEMP_SCALE_SPARSE_ESTIMATE_ARGS: Scale counts of non-zero tiles, ublocks and strips to reflect the numbers that would end up on a single core, since BBE estimates always assume grid_size [1,1].
- PYBUDA_TEMP_SPARSE_ESTIMATE_ARGS_PER_CORE: Instead of uniformly scaling sparse args (as happens in PYBUDA_TEMP_SCALE_SPARSE_ESTIMATE_ARGS), calculate them per core. To use, need set PYBUDA_TEMP_SCALE_SPARSE_ESTIMATE_ARGS to 1 as well.
- PYBUDA_TEMP_ELT_UNARY_ESTIMATES_LEGACY: Force legacy path of calculating execution cycles for eltwise unary ops - instead of calling into BBE, use hand-crafted FE-side logic
- PYBUDA_TEMP_ENABLE_NEW_FUSED_ESTIMATES: Apply new formula to estimate the cycle count of fused ops. The formula calls BBE to estimate each subop and sums up the results.
- PYBUDA_LEGACY_KERNEL_BROADCAST: Use legacy kernel broadcast detection path. Will detect fewer kernel broadcasts, and will oftentimes use more tiles (longer KBs).
- PYBUDA_TEMP_BALANCER_MODEL_PCIE_BW: Estimate PCIe bandwidth in limiter cycles. (default: 1/True)
- PYBUDA_TEMP_BALANCER_DISABLE_TARGET_PROXIMITY: Disable target proximity in balancer. (default: 0/False)
- PYBUDA_TEMP_DISABLE_FJ_NOP_SCHEDULE_FIX: This flag disables a fix that forces FJ buffering nops to be scheduled last.
- PYBUDA_TEMP_FIX_2351: Controls the fix for bug #2351 - fork-join can end up adding buffering nops and queues on same path, this control flag fixes it.
- PYBUDA_TEMP_RIBBON2_LEGACY_UTIL_EVAL: Use legacy util evaluation in Ribbon2 balancing policy. (default: 0/False)
- PYBUDA_TEMP_DISABLE_MODEL_KB_PROLOGUE_BW: Disables bandwidth modelling for kernel broadcasted and prologued inputs.
- PYBUDA_TEMP_ENABLE_SPARSE_MM_SERIALIZATION_FACTOR: If enabled, accounts for serialization in sparse matmuls due to sparse tensors starting at different
m_k
s across cores.