PyBuda monitors many environment variables to modify default behavior. These can be used to debug or analyze problems.

Overrides

PYBUDA_BUILD_DIR: Override the build directory for the compiler.
LOGURU_LEVEL: set Python logger level - default is DEBUG, valid values are INFO, DEBUG, TRACE, NONE
LOGGER_LEVEL: set C++ logger level - values are the same as Python logger
PYBUDA_DEVMODE: set to make Golden/Sequential default run mode if one isn't specified explicitly
PYBUDA_PROFILE: enable Python profiler
PYBUDA_ASSERT_UNSUPPORTED_HW_OP: assert if an unsupported op is found
PYBUDA_BALANCER_PLACER_DATA: prints balancer/placer visual info, prints chip op group info
PYBUDA_BALANCER_POLICY_TYPE: override balancer policy
PYBUDA_SCHEDULER_POLICY: override scheduler policy
PYBUDA_BALANCER_ONE_ROW: limit placement to one row
PYBUDA_ENABLE_T_STREAMING: enable t-streaming (i.e. streaming ops with small output buffers)
PYBUDA_ENABLE_TVM_CACHE: Cache tvm graphs instead of re-compiling
PYBUDA_FORCE_FULL_COMPILE_DEPTH: Force each test to run to compile depth "FULL"
PYBUDA_RELOAD_GENERATED_MODULES: Reload previously generated modules instead of recompiling through tvm.
PYBUDA_SKIP_L1_USAGE_VALIDATION: allows ops to use more L1 than available
PYBUDA_ENFORCE_SAME_UBLOCK_OPERANDS: ???
PYBUDA_VERIFY_NET2PIPE: verify produced netlist using net2pipe
PYBUDA_CI_DIR: ???
PYTEST_CURRENT_TEST: ???
PYBUDA_CI_CAPTURE_TENSORS: save tensors used in the test so they can be used in stand-alone back-end tests
PYBUDA_FORCE_SEQUENTIAL: override test/script to run everything in sequential mode
PYBUDA_TRACE_SHUTDOWN: show stack trace on shutdown due to error
PYBUDA_OVERRIDE_NUM_CHIPS: force the number of chips to use, instead of the auto-detected number
PYBUDA_DISABLE_DYNAMIC_DRAM: disable dynamic allocation of e2e queues in inference
PYBUDA_DISABLE_AUTOMATIC_DRAM_LOGIC: disable automatic logic for static/dynamic allocation of queues.
PYBUDA_DISABLE_FORK_JOIN_BUF: disable fork-join buffering
PYBUDA_DISABLE_FORK_JOIN_NOPS: don't insert nops if there's not enough buffering. Just add what's available in L1. This should only be used for debug.
PYBUDA_FORK_JOIN_DEBUG_INFO: print debug logs related to fork-join buffering
PYBUDA_FORK_JOIN_DEBUG_FORK_NAME: filter debug logs (generated by PYBUDA_FORK_JOIN_DEBUG_INFO) by fork node name
PYBUDA_FORK_JOIN_DEBUG_JOIN_NAME: filter debug logs (generated by PYBUDA_FORK_JOIN_DEBUG_INFO) by join node name
PYBUDA_FORK_JOIN_SKIP_EXPANDING_BUFFERS: don't expand buffers in L1 - this will cause algorithm to add buffering nops/queues any time a fork-join needs to be buffered.
PYBUDA_FORK_JOIN_EXPAND_OUTPUT_BUFFERS: expand only output buffers (instead of input buffers) for fork-join buffering
SHOW_ALL_FAILS: don't assert on the first data mismatch, but show all fails before failing the test
PYBUDA_EXP_APPROX: run exp in approximate mode
PYBUDA_VERIFY_RESULTS_OFF_BY_DEFAULT: disable result verification (tensor comparison of processed and golden module done via forward pass)
PYBUDA_ENABLE_STABLE_SOFTMAX: enable stable Softmax (disabled by default)
EVAL_DEBUG: prints inputs/outputs during module evaluation
TT_BACKEND_GOLDEN_QUANTIZE: ???
PYBUDA_RESET_DEV_BEFORE_TEST: resets device between tests (pytest must be called with --forked in order to work)
PYBUDA_PERF_SIMULATOR: run performance simulator to estimate performance of the model
PYBUDA_PERF_SIMULATOR_LOG: dump log of all events in perf simulator (will slow down the run)
PYBUDA_PERF_SIMULATOR_TRACE: create trace file to be loaded into routeagui
PYBUDA_OP_PERF: dump op_perf.csv file with op grid choices and estimated cycle counts
PYBUDA_BENCHMARK_NO_RESET_ON_ERROR: from the comments seems that it doesn't work, should we remove this one?
PYBUDA_SKIP_BACKEND_COMPILE: configure backend device to run in DeviceMode.RunOnly, picking up build binaries from previous run
PYBUDA_PLACER_BWD_GROUPS: use bwd groups when placing so that fwd and bwd ops are placed together
PYBUDA_TRIPLET_PLACEMENT: try to place bwd groups in "triplet" placement strategy
PYBUDA_EXP_APPROX: force exp and exponent in gelu_derivative to run in approximate mode (i.e. faster, but less accurate)
PYBUDA_AMP_LEVEL: configure the AMP (Automatic Mixed Precision) optimization level.
PYBUDA_NO_FUSE_MATMUL_BIAS: disable fusing of matmul+add into matmul
PYBUDA_ENABLE_OUTPUT_QUEUES_ON_HOST: configures whether whether output queues are placed on HOST (default: true)
PYBUDA_FORCE_VERIFY_ALL: ensure that verification is run after each compile stage, overrides VerifyCondig.disabled()
PYBUDA_VERIFY_POST_AUTOGRAD_PASSES: verify graph after post autograd passes, unless the verify config is VeifyConfig.disabled()
PYBUDA_VERIFY_POST_PLACER: verify graph after post placer pass, unless the verify config is VeifyConfig.disabled()
PYBUDA_GALAXY_LINEAR_ROUTE: place graphs sequentially in a snake route around the Galaxy modules
PYBUDA_NEBULA_GALAXY_PLACER: only place output nop on mmio chip for untilizing
PYBUDA_ENABLE_AUTO_TRANSPOSE: configures whether auto-transpose is enabled while op placement (default: false)
PYBUDA_MINIMIZE_REMOTE_DRAM_QUEUES: configures behaviour for data forking to remote chips - create single e2e queue on producer or e2e queue per consumer chip (default)
PYBUDA_SPARSE_MM_ENCODING_ESTIMATES_OFF: when on, turns off estimation logic for in0/in2 for sparse mm, but gets slower
PYBUDA_REBLOCK_INPUT_ACT: when enabled, we reblock input activations to the smallest grid across all users instead of forcing 1x1. (default: disabled)
PYBUDA_DUMP_MIXED_PRECISION: when on, dump json with a per-op info about fidelity, data-formats (default: off). Default directory: reportify dump directory.
PYBUDA_PRESTRIDE_DISABLE: disables prestriding transform for convs
PYBUDA_LEGALIZER_DETAILED_DEBUGGING: when on provides detailed debugging information and statistics about legalizer OpModel selection process including GraphSolver. Works only in DEBUG(default: off).
PYBUDA_LEGALIZER_DEBUG_NODE_NAME: used together with legalizer detailed debugging to narrow down debugging info to single node. Works only in DEBUG(default: off).
PYBUDA_GRAPHSOLVER_SELF_CUT_TYPE: Override for graph_solver_self_cut_type in BalancerConfig. Valid values: None, ConsumerOperandDataEdgesFirst, ProducerUserDataEdgesFirst, FastCut. When switched on(not None) graphsolver will cut edges for which it cannot produce valid paths. (default: None)
PYBUDA_MAX_GRAPH_CUT_RETRY: Override for default_resolve_retry_count_self_cutting in GraphSolver::resolve. This sets the max retry step if GraphSolver self cut is turned on.
PYBUDA_REPLACE_INF_IN_TVM_PARAMS: Replace -inf and inf values from TVM parameters during PyBuda code generation.
PYBUDA_DISABLE_FUSE_TAGS: Specify a list of ops (comma delimited) by original_op_type/op_type that will be exempt from fusion (e.g. PYBUDA_DISABLE_FUSE_TAGS="reciprocal,softmax").
PYBUDA_BISECT_FUSING: bool, false by default. When it is set to true, we bisect fusing by defining PYBUDA_FUSE_OP_FIRST_IND and PYBUDA_FUSE_OP_LAST_IND.
PYBUDA_FUSE_OP_FIRST_IND: First index in topologically sorted graph of ops to fuse.
PYBUDA_FUSE_OP_LAST_IND: Last index in topologically sorted graph of ops to fuse.
PYBUDA_SINGLE_OP_EPOCHS: Place every single op on a new epoch.
PYBUDA_FORK_JOIN_BUF_QUEUES: Turn on adding buffering queues instead of nops in fork joins that need a lot of buffering (have one path much larger than the other).
PYBUDA_RESNET_BUFF_QUEUE_OVERRIDE: Turn off adding buffering queues in graph solver cut. Temporal fix for ResNet perf.
PYBUDA_OVERRIDE_DEVICE_YAML: Override the soc device descriptor to compile against different device configurations.
PYBUDA_DISABLE_INTERACTIVE_PLACER: Override balancer policy not to use Interactive placer and to fallback to legacy placer instead. (default: 0/False)
PYBUDA_DISABLE_INTERACTIVE_FJ_BUFFERING: Override balancer policy not to use inlined fork-join buffering. (default: 0/False)
PYBUDA_DISABLE_PADDING_PASS: Disable running of padding pass.
PYBUDA_PADDING_PASS_ELEMENT_WISE: In padding pass pad elementwise ops.
PYBUDA_PADDING_PASS_MATMUL: In padding pass pad matmul ops.
PYBUDA_PADDING_PASS_SPARSE_MATMUL: In padding pass pad sparse matmul ops. Needs to have matmul ops enabled for padding too in order to enable this.
PYBUDA_PADDING_PASS_BUFFER_QUEUE": Enable padding pass, insert buffer queue
PYBUDA_ENABLE_STOCHASTIC_ROUNDING": Enable stochastic rounding for all supported ops.
PYBUDA_PADDING_PASS_CONCAT": Enable padding pass, for concatenate operation
PYBUDA_FORCE_CONV_MULTI_OP_FRACTURE: Forces all convs to be fractured (during decompose pass) according to heuristic defined in pybuda/pybuda/op/eval/pybuda/convolution.py.
PYBUDA_COLLECT_CONSTRAINT_INFO: Enables constraint info collection on every graphsolver resolve.
PYBUDA_GRAPHSOLVER_FAST: Enables partial re-resolve on cut and buffer, much faster at cost of not enabling all possible valid OpModels.
NUM_EXEC_LOOP_ITERATIONS: For single temporal epoch tests, you can specify a # here that will rerun the epoch the specified # of times. Each rerun is initiated by FW rather than requiring host interaction, to improve performance.
PYBUDA_PADDING_PASS_DISABLE_BUDA_OP: Disable padding logic that uses buda implementation for pad and unpad.
PYBUDA_ENABLE_ETH_SERIALIZATION: Enable the ethernet stream reduction pass, using the ethernet datacopy op to implement the stream reduction
PYBUDA_ENABLE_ETH_DATACOPY_SERIALIZATION: Enable the ethernet stream reduction pass, using the tensix datacopy/nop op to implement the stream reduction. Will only insert datacopy ops if there are free tensix cores
PYBUDA_SUPRESS_T_FACTOR_MM: Enables a condition in calculate_op_model in legalizer that limits the t factor of sparse/dense matmul ops to be less than the flag's value. Valid values: any positive int value (eg. 16)
PYBUDA_AMP_LIGHT: Enable a "light" version of mixed precision to minimize accuracy impact (default: 0/False; 1: bfp8/hifi2, 2: bfp4/hifi2, 3: bfp4/LoFi)
PYBUDA_GRAPH_NAME_SUFFIX: Suffix to add to the graph name (helps to generate unique netlist names)
PYBUDA_DISABLE_L1_ACCUMULATE: Flag for disabling and debugging L1 accumaulation feature.
PYBUDA_OVERRIDE_VETO: Used to Add/Remove/Update general and env var based compiler configurations.
PYBUDA_DISABLE_REPORTIFY_DUMP: Disable generating reportify graph.
PYBUDA_DISABLE_CAP_SPARSE_MM_FIDELITY: Disables an optimization to cap the fidelity phases of sparse matmul to at most HiFi2.
PYBUDA_DISABLE_EXPLICIT_DRAM_IO: Disables the FE from programming netlist attribute input_dram_io_buf_size_tiles. Instead the FE will leave this attribute as 0 which implicitly means that the backend will handle the allocation of this buffer.
PYBUDA_CONCAT_ON_HOST: Lower concatenate ops on output nodes into runtime transforms so that they're done on host.
PYBUDA_OP_MODEL_COMPARE_VERSION: Version of op model comparision function. Can be used to compare effect of different comparison logic on performance.
PYBUDA_RIBBON1_PREPASS_ENABLED: Whether to use or not suboptimal opmodel invalidation prepass. Default value is False.
PYBUDA_RIBBON2_CONSERVATIVE_OPTIMIZATION_ITERATIONS: Number of optimization iterations in Ribbon2 balancing policy per epoch. Default value is 10.
PYBUDA_RIBBON2_DISABLE_CLEANUP_BUF_NOPS: Disable cleanup of unneeded buffering nops in Ribbon2. (default: 0/False)
PYBUDA_RIBBON2_CALCULATE_TARGET_CYCLES: Calculate target cycles for every epoch within Ribbon2 balancing policy. (default: 0/False)
PYBUDA_RIBBON2_CALCULATE_TARGET_CYCLES_APPLY_FILTERING: Apply filtering on GS search space while calculating dynamic cycles per epoch within Ribbon2 balancing policy. (default: 0/False)
PYBUDA_RIBBON_LEGACY: Use legacy Ribbon balancing policy. (default: 0/False)
PYBUDA_MAXIMIZE_GRID: Reverse logic of MinimizeGrid policy. Maximize grid size for all ops. (default: 0/False)
PYBUDA_ENABLE_HOST_INPUT_NOP_BUFFERING: Enable nop buffering of input host read. (default: 0/False)
PYBUDA_AUTO_RECOMPILE: Triggers handling of backend compile error and recompiles the model. (default: 1/True)
PYBUDA_AUTO_RECOMPILE_TARGET_CYCLES: Enables adjustment of target cycles during recompile if no errors from backend have been previously handled. Requires PYBUDA_AUTO_RECOMPILE to be enabled to work. (default: 0/False)
PYBUDA_AUTO_RECOMPILE_RETRY_LIMIT: Limits number of attempts to recompile. (default: 10)
PYBUDA_TARGET_CYCLES_OFFSET: Sets the desired amount by which to offset the target cycles for balancer. Default value is 0.
PYBUDA_ENABLE_VERSIM_DEVICE: The Versim device is a specific silicon simulation device that PyBUDA supports.The variable is used to enable the Versim device in the PyBUDA pytest environment. By setting this variable to 1, we are instructing PyBUDA to use the Versim device as the target device instead of the silicon or golden device. Enabling the Versim device can be useful for testing or experimentation purposes, allowing us to evaluate the behavior of our code on this simulation device and specified architecture. In order to run Versim device as a targeted device, the source code must be built with UMD_VERSIM_STUB=0 enviroment variable.
PYBUDA_VERSIM_DEVICE_ARCH: This env variable represents the architecture of the Versim device used in the pytest.
PYBUDA_ENABLE_EMULATION_DEVICE: This device is a specific silicon emulation device that PyBUDA supports. The variable is used to enable emulation device in PyBUDA pytest environment. By setting this variable to 1, we are instructing PyBUDA to use the emulation device as the target device instead of the silicon or golden device. Enabling the emulation device can be useful for testing or experimentation purposes, allowing us to evaluate the behaviour of our code on this emulation device. In order to run emulation device as a targeted device, the source code must be built with EMULATION_DEVICE_EN=1 environment variable.
PYBUDA_EMULATION_DEVICE_ARCH: This env variable represents the architecture of the emulation device used in the pytest.
PYBUDA_DISABLE_DEPTHWISE_CONV2D_DECOMP: If set to 1, depthwise conv2d ops will not be decomposed using the depthwise op and instead use a matmul.
PYBUDA_DISABLE_SINGLE_REDUCE_COMMUTEL If set to 1, reshapes will not be able to commute through single reduce operations, yet may still be allowed to commute through any back-to-back reduce ops. Enabling this may result in additional TMs remaining in the model.

Golden overrides

GOLDEN_WORMHOLE_B0: run Golden with Wormhole_B0 as target device instead of Grayskull (default)
PYBUDA_GOLDEN_BLACKHOLE: run Golden with Blackhole as target device instead of Grayskull (default)

Temp overrides

PYBUDA_TEMP_ENABLE_NEW_SPARSE_ESTIMATES: Apply new formula to estimate the cycle count of sparse matmul ops (currently only support LoFi and HiFi2 fidelities)
PYBUDA_TEMP_SCALE_SPARSE_ESTIMATE_ARGS: Scale counts of non-zero tiles, ublocks and strips to reflect the numbers that would end up on a single core, since BBE estimates always assume grid_size [1,1].
PYBUDA_TEMP_SPARSE_ESTIMATE_ARGS_PER_CORE: Instead of uniformly scaling sparse args (as happens in PYBUDA_TEMP_SCALE_SPARSE_ESTIMATE_ARGS), calculate them per core. To use, need set PYBUDA_TEMP_SCALE_SPARSE_ESTIMATE_ARGS to 1 as well.
PYBUDA_TEMP_ELT_UNARY_ESTIMATES_LEGACY: Force legacy path of calculating execution cycles for eltwise unary ops - instead of calling into BBE, use hand-crafted FE-side logic
PYBUDA_TEMP_ENABLE_NEW_FUSED_ESTIMATES: Apply new formula to estimate the cycle count of fused ops. The formula calls BBE to estimate each subop and sums up the results.
PYBUDA_LEGACY_KERNEL_BROADCAST: Use legacy kernel broadcast detection path. Will detect fewer kernel broadcasts, and will oftentimes use more tiles (longer KBs).
PYBUDA_TEMP_BALANCER_MODEL_PCIE_BW: Estimate PCIe bandwidth in limiter cycles. (default: 1/True)
PYBUDA_TEMP_BALANCER_DISABLE_TARGET_PROXIMITY: Disable target proximity in balancer. (default: 0/False)
PYBUDA_TEMP_DISABLE_FJ_NOP_SCHEDULE_FIX: This flag disables a fix that forces FJ buffering nops to be scheduled last.
PYBUDA_TEMP_FIX_2351: Controls the fix for bug #2351 - fork-join can end up adding buffering nops and queues on same path, this control flag fixes it.
PYBUDA_TEMP_RIBBON2_LEGACY_UTIL_EVAL: Use legacy util evaluation in Ribbon2 balancing policy. (default: 0/False)
PYBUDA_TEMP_DISABLE_MODEL_KB_PROLOGUE_BW: Disables bandwidth modelling for kernel broadcasted and prologued inputs.
PYBUDA_TEMP_ENABLE_SPARSE_MM_SERIALIZATION_FACTOR: If enabled, accounts for serialization in sparse matmuls due to sparse tensors starting at different m_ks across cores.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.debug.md

README.debug.md

Overrides

Golden overrides

Temp overrides

Files

README.debug.md

Latest commit

History

README.debug.md

File metadata and controls

Overrides

Golden overrides

Temp overrides