[None][infra] Cherry-pick Perf Test Scripts#12057
[None][infra] Cherry-pick Perf Test Scripts#12057chenfeiz0326 wants to merge 1 commit intoNVIDIA:release/1.2from
Conversation
📝 WalkthroughWalkthroughIntroduces a comprehensive performance sanity testing infrastructure refactor incorporating local and disaggregated SLURM-based test launching, OpenSearch integration for historical baseline data retrieval, HTML/SVG report generation for pre- and post-merge analysis, enhanced test configuration management with split group support, and new error detection utilities across Jenkins, Python scripts, Bash workflows, and YAML configurations. Changes
Sequence DiagramsequenceDiagram
participant Jenkins as Jenkins Pipeline
participant LocalSubmit as Local Submit.py
participant SLURM as SLURM Cluster
participant InstallScript as Install Script
participant TestRunner as Test Runner
participant OpenSearch as OpenSearch DB
participant Reporter as Report Generator
Jenkins->>LocalSubmit: Parse test config & hardware
LocalSubmit->>LocalSubmit: Generate SLURM launch script
LocalSubmit->>SLURM: sbatch slurm_launch.sh
SLURM->>InstallScript: Execute on compute nodes
InstallScript->>InstallScript: Build wheel (coordinated)
InstallScript->>InstallScript: Install dependencies
SLURM->>TestRunner: Execute pytest commands
TestRunner->>TestRunner: Run performance benchmarks
TestRunner->>OpenSearch: Post new perf data
Reporter->>OpenSearch: Query history & baseline data
OpenSearch-->>Reporter: Return perf data
Reporter->>Reporter: Generate HTML/SVG reports
Reporter-->>Jenkins: Return report artifacts
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Rationale: 70+ affected files spanning heterogeneous change types (Jenkins config, Python modules with dense logic, Bash scripts, YAML configurations, test infrastructure). Key files like perf_utils.py (1620 lines), submit.py (724 lines), and test_perf_sanity.py (395 lines) contain substantial new logic for OpenSearch integration, report generation, and test orchestration. Structural changes to test infrastructure and new multi-component workflows require careful reasoning. Configuration files (YAML) reduce effort per file but add breadth. Not purely refactoring—introduces new functionality across multiple subsystems. Possibly related PRs
Suggested reviewers
✨ Finishing Touches🧪 Generate unit tests (beta)
|
There was a problem hiding this comment.
Actionable comments posted: 8
Note
Due to the large number of review comments, Critical severity comments were prioritized as inline comments.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (4)
tests/scripts/perf-sanity/aggregated/config_database_b200_nvl.yaml (2)
623-648:⚠️ Potential issue | 🟠 MajorUse
CUTLASSfor this FP4 v2 8k1k ADP/MTP1 path.This block is otherwise the DEP/MTP1 profile (
enable_attention_dp: trueandnum_nextn_predict_layers: 1), butmoe_config.backendis stillTRTLLM. The sibling FP4 v2 DEP entries useCUTLASS, so this one will measure a different backend.Suggested fix
moe_config: - backend: TRTLLM + backend: CUTLASS attention_dp_config: batching_wait_iters: 0 enable_balance: true🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/scripts/perf-sanity/aggregated/config_database_b200_nvl.yaml` around lines 623 - 648, The moe backend for the FP4 v2 ADP/MTP1 entry (name: nvidia_DeepSeek_R1_0528_FP4_v2_8192_1024_conc512_gpu8, model_name: deepseek_r1_0528_fp4_v2) is set to TRTLLM but should use CUTLASS to match sibling FP4 v2 DEP profiles; update the moe_config.backend value from TRTLLM to CUTLASS in this YAML block so the test measures the correct backend.
383-407:⚠️ Potential issue | 🟠 MajorAdd the missing MTP block to this FP4 v2 ADP profile.
This is the only touched FP4 v2 CUTLASS/attention-DP scenario here without
speculative_config. The sibling DEP-style variants all pindecoding_type: MTPwith one predictor layer, so this entry will benchmark a different decode path than the rest.Suggested fix
attention_dp_config: batching_wait_iters: 0 enable_balance: true timeout_iters: 60 + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 1 tensor_parallel_size: 8 moe_expert_parallel_size: 8🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/scripts/perf-sanity/aggregated/config_database_b200_nvl.yaml` around lines 383 - 407, The FP4 v2 ADP profile block named "nvidia_DeepSeek_R1_0528_FP4_v2_1024_1024_conc2048_gpu8" is missing the MTP speculative config; add a speculative_config section to that profile with decoding_type set to "MTP" and predictor_layers set to 1 (matching sibling DEP-style variants) so this CUTLASS/attention-DP scenario uses the same MTP decode path as the others; place the speculative_config under the profile’s top-level keys (alongside kv_cache_config, moe_config, attention_dp_config, etc.) using the key names speculative_config, decoding_type, and predictor_layers to locate and update the block.tests/scripts/perf-sanity/aggregated/config_database_h200_sxm.yaml (1)
820-831:⚠️ Potential issue | 🟠 MajorAdd the missing
attention_dp_configto this GPT-OSS ADP case.This is the only touched GPT-OSS entry here with
enable_attention_dp: truebut no accompanyingattention_dp_config. The other ADP profiles explicitly setenable_balance: true, so this scenario will run with different scheduling behavior.Suggested fix
enable_attention_dp: true + attention_dp_config: + enable_balance: true print_iter_log: true🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/scripts/perf-sanity/aggregated/config_database_h200_sxm.yaml` around lines 820 - 831, The GPT-OSS ADP profile named openai_gpt_oss_120b_8192_1024_conc16_gpu2 (model_name: gpt_oss_120b_fp4) sets enable_attention_dp: true but is missing an attention_dp_config; add an attention_dp_config block mirroring the other ADP entries (including enable_balance: true) so scheduling behavior matches the other profiles—locate the YAML entry by the name or model_name and insert attention_dp_config with enable_balance: true and the same subfields used by the other GPT-OSS ADP cases.jenkins/scripts/perf/disaggregated/submit.py (1)
331-357:⚠️ Potential issue | 🟠 MajorThis change removes the advertised local-mode path.
main()now always resolvesconfig_yamlfrom--test-listand then unconditionally opensargs.script_prefix/args.srun_args. That makes--config-yamland--run-ci=falseunusable and breaks local invocations with the current CLI.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@jenkins/scripts/perf/disaggregated/submit.py` around lines 331 - 357, main() now unconditionally resolves config_yaml via parse_test_case_name(args.test_list, ...) and always opens args.script_prefix and args.srun_args, which breaks the --config-yaml and --run-ci local mode; restore conditional handling so local invocations work. Fix by reverting to the prior branching: if args.run_ci (or when --test-list is provided) call parse_test_case_name and open script_prefix/srun_args, otherwise use args.config_yaml and skip reading args.script_prefix/args.srun_args for local mode; update the code paths around parse_test_case_name, config_yaml assignment, and the blocks that open args.script_prefix and args.srun_args to only run when CI/test-list mode is active.
🟠 Major comments (15)
tests/scripts/perf-sanity/aggregated/deepseek_r1_fp8_blackwell.yaml-140-166 (1)
140-166:⚠️ Potential issue | 🟠 MajorAdd the missing MTP1 config to the 6k1k case.
This entry is labeled as
MTP1, but unlike the 1k1k and 8k1k variants it never setsspeculative_config. As written, the 6k1k run will exercise a different decode path than the scenario name/comment implies, which makes the perf result hard to compare against the other MTP cases.Proposed fix
- # 6k1k configs - TP8 with TRTLLM, MTP1 - - name: "r1_fp8_tp8_6k1k" + # 6k1k configs - TP8 with TRTLLM, MTP1 + - name: "r1_fp8_tp8_mtp1_6k1k" model_name: "deepseek_r1_0528_fp8" tensor_parallel_size: 8 moe_expert_parallel_size: 1 pipeline_parallel_size: 1 max_batch_size: 512 max_num_tokens: 8192 attn_backend: "TRTLLM" enable_attention_dp: false moe_config: backend: 'TRTLLM' cuda_graph_config: enable_padding: true max_batch_size: 64 kv_cache_config: dtype: 'fp8' enable_block_reuse: false free_gpu_memory_fraction: 0.8 + speculative_config: + decoding_type: 'MTP' + num_nextn_predict_layers: 1 client_configs: - name: "con64_iter10_6k1k" concurrency: 64 iterations: 10 isl: 6144 osl: 1024 backend: "openai" random_range_ratio: 0.2🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/scripts/perf-sanity/aggregated/deepseek_r1_fp8_blackwell.yaml` around lines 140 - 166, The 6k1k config "r1_fp8_tp8_6k1k" is missing the MTP1 speculative_config, causing it to use a different decode path; add a speculative_config block to the "r1_fp8_tp8_6k1k" entry (same location as the other MTP1 variants) with the MTP1 settings (e.g., mode: "MTP1" and the same speculative parameters used by the 1k1k/8k1k MTP1 cases) so the client_configs (e.g., "con64_iter10_6k1k") exercises the intended MTP1 decode path.tests/scripts/perf-sanity/disaggregated/gb200_deepseek-r1-fp4_1k1k_con1024_ctx1_dep4_gen1_dep32_eplb0_mtp3_ccb-UCX.yaml-71-97 (1)
71-97:⚠️ Potential issue | 🟠 MajorRemove YAML alias to avoid shared mutable state between generator and context worker configs.
Lines 71–97 currently share a single
speculative_configmapping via YAML anchor/alias. When loaded,yaml.safe_load()materializes this as one shared Python dict object assigned to bothgenandctx. Thevalidate_speculative_config()method inTorchLlmArgs(line 3057) mutates this object in place forMTPDecodingConfig:self.speculative_config.max_draft_len = self.speculative_config.num_nextn_predict_layersIn disaggregated systems, the two worker roles are independent processes initialized separately. Cross-role mutation of a shared config object is a correctness hazard. Duplicate the small mapping instead.
Proposed fix
- speculative_config: &id001 + speculative_config: decoding_type: MTP num_nextn_predict_layers: 3 @@ - speculative_config: *id001 + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/scripts/perf-sanity/disaggregated/gb200_deepseek-r1-fp4_1k1k_con1024_ctx1_dep4_gen1_dep32_eplb0_mtp3_ccb-UCX.yaml` around lines 71 - 97, The YAML uses a shared alias (&id001 / *id001) for speculative_config causing a single Python dict to be shared between the generator and context worker; change the ctx.speculative_config alias into an explicit duplicate mapping (copy the keys: decoding_type: MTP and num_nextn_predict_layers: 3) instead of referencing *id001 so yaml.safe_load produces two independent dicts and TorchLlmArgs.validate_speculative_config (which mutates self.speculative_config) won't mutate the other role's config.tests/scripts/perf-sanity/disaggregated/gb300_deepseek-r1-fp4_1k1k_con1024_ctx1_dep4_gen1_dep32_eplb0_mtp3_ccb-UCX.yaml-11-12 (1)
11-12:⚠️ Potential issue | 🟠 MajorConfirm these template placeholders are resolved before this config is launched.
If this YAML is consumed as-is,
<partition>,<account>,<container_mount>,<container_image>,<model_path>, and<full_path_to_work_dir>will break job submission or container startup immediately. The disaggregated submission pipeline (disaggregated/submit.py) loads this config withyaml.safe_load()and directly exports these values to the environment without substitution. Please verify how these placeholders are replaced in your CI or local submission workflow before this config is selected by tests.Also applies to: 32-37
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/scripts/perf-sanity/disaggregated/gb300_deepseek-r1-fp4_1k1k_con1024_ctx1_dep4_gen1_dep32_eplb0_mtp3_ccb-UCX.yaml` around lines 11 - 12, The YAML contains unresolved template placeholders (<partition>, <account>, <container_mount>, <container_image>, <model_path>, <full_path_to_work_dir>) that will break job submission because disaggregated/submit.py loads the file with yaml.safe_load() and exports values to the environment; update this config (gb300_deepseek-..._ccb-UCX.yaml) to use concrete values or CI/template variables that your submission workflow replaces, or implement a runtime check in disaggregated/submit.py to validate and fail fast if any placeholder pattern like "<...>" remains (so you can surface missing substitutions before container startup).tests/scripts/perf-sanity/disaggregated/gb200_deepseek-v32-fp4_32k4k_con256_ctx1_dep4_gen1_dep32_eplb0_mtp3_ccb-UCX.yaml-86-103 (1)
86-103:⚠️ Potential issue | 🟠 MajorAlign the ctx transceiver buffer with the configured max token budget.
Line 86 allows 32,784 ctx tokens, but Line 102 caps the UCX buffer at 32,768. A max-sized prompt can overflow this limit on the boundary case.
Suggested fix
ctx: print_iter_log: true max_batch_size: 1 max_num_tokens: 32784 @@ cache_transceiver_config: - max_tokens_in_buffer: 32768 + max_tokens_in_buffer: 32784 backend: UCX🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/scripts/perf-sanity/disaggregated/gb200_deepseek-v32-fp4_32k4k_con256_ctx1_dep4_gen1_dep32_eplb0_mtp3_ccb-UCX.yaml` around lines 86 - 103, The configured max_num_tokens (32784) can exceed cache_transceiver_config.max_tokens_in_buffer (32768) causing boundary overflows; update the cache_transceiver_config.max_tokens_in_buffer to be at least equal to max_num_tokens or reduce max_num_tokens to match—modify the YAML so the fields max_num_tokens and cache_transceiver_config.max_tokens_in_buffer are aligned (e.g., set cache_transceiver_config.max_tokens_in_buffer: 32784 or lower max_num_tokens) to prevent overflow.jenkins/scripts/perf/local/slurm_run.sh-5-24 (1)
5-24:⚠️ Potential issue | 🟠 Major
ERRtrap still aborts before pytest exit code is captured.
set +eon line 19 disableserrexit, but the ERR trap from line 5 remains active. Wheneval $pytestCommandfails on line 23, it triggers the trap which explicitly callsexit $rc, aborting the script before line 24 can capture the exit code. This defeats the intended behavior stated in the comment.Use a conditional instead to avoid triggering the trap:
Proposed fix
-# Turn off "exit on error" so the following lines always run -set +e - pytest_exit_code=0 - -eval $pytestCommand -pytest_exit_code=$? +if eval "$pytestCommand"; then + pytest_exit_code=0 +else + pytest_exit_code=$? +fi echo "Rank${SLURM_PROCID} Pytest finished execution with exit code $pytest_exit_code"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@jenkins/scripts/perf/local/slurm_run.sh` around lines 5 - 24, The ERR trap installed by the trap command still forces an exit when pytest fails, so capture pytest_exit_code by temporarily disabling or restoring the ERR trap around the eval $pytestCommand call: save the current ERR trap (from the trap '...' ERR), disable it (trap - ERR), run eval $pytestCommand and assign pytest_exit_code=$?, then restore the saved ERR trap; reference the existing trap, set +e, eval $pytestCommand and pytest_exit_code to locate where to add the save/disable/restore logic.jenkins/scripts/perf/local/slurm_install.sh-23-35 (1)
23-35:⚠️ Potential issue | 🟠 MajorPropagate leader failures to the waiting ranks.
Both follower branches wait forever for a lock file that is only written on success. When
retry_commandexhausts on the leader, the other ranks spin until the SLURM job times out.Also applies to: 47-77
tests/test_common/http_utils.py-26-36 (1)
26-36:⚠️ Potential issue | 🟠 Major
check_fileswon't run for most timeouts.Line 27 only checks files every 300 loop iterations. Because the loop sleeps one second, callers with
timeout < 300never execute this branch, and the defaulttimeout=300hits it only once near expiry. That means log-detected startup failures still wait out the full timeout unlessserver_procis also provided.Suggested fix
- if check_files and iteration % 300 == 0: + if check_files and (iteration == 1 or iteration % 10 == 0):🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/test_common/http_utils.py` around lines 26 - 36, The check_files block only runs every 300 iterations so callers with timeout < 300 never get log-based failures; change the check frequency to use a dynamic interval based on timeout (e.g. compute check_interval = min(300, timeout) or similar) and replace the hard-coded 300 in the condition so check_error(check_file) runs at appropriate intervals (and still periodically for long timeouts); update the logic around iteration and the condition that currently reads "if check_files and iteration % 300 == 0" to use the new check_interval variable to ensure early detection for small timeouts while keeping periodic checks for long runs.tests/test_common/error_utils.py-52-58 (1)
52-58:⚠️ Potential issue | 🟠 Major
report_error()never scans the file contents for keywords.Line 53 reads the whole file into
all_lines, then Line 54 iterates the already-exhausted handlef.error_linesstays empty, so keyword matches are never reported and callers only get the generic tail output.Suggested fix
try: with open(log_file, "r", errors="replace") as f: all_lines = f.readlines() - for line_idx, line in enumerate(f, start=1): - for keyword in ERROR_KEYWORDS: - if keyword in line: - error_lines.append((line_idx, line.strip())) - break + for line_idx, line in enumerate(all_lines, start=1): + for keyword in ERROR_KEYWORDS: + if keyword in line: + error_lines.append((line_idx, line.strip())) + break🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/test_common/error_utils.py` around lines 52 - 58, The loop in report_error reads the file into all_lines then iterates the exhausted file handle f, so error_lines never gets any matches; change the iteration to enumerate(all_lines, start=1) (or seek the file and iterate f again) and keep the keyword check against ERROR_KEYWORDS so report_error properly appends matching (line_idx, line.strip()) entries to error_lines for the given log_file.jenkins/scripts/perf/local/run_agg_local.sh-7-14 (1)
7-14:⚠️ Potential issue | 🟠 MajorDon't commit developer-specific cluster settings.
These absolute repo/model paths, mounts, image tag, and account are tied to one user and one cluster environment, so the script is not reusable for other developers or automation. Please read them from arguments or environment variables instead.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@jenkins/scripts/perf/local/run_agg_local.sh` around lines 7 - 14, Replace hard-coded, developer-specific constants (REPO_ROOT, LOCAL_DIR, IMAGE, MOUNTS, LLM_MODELS_ROOT, PARTITION, ACCOUNT, JOB_NAME) with values read from environment variables or command-line arguments and provide sensible defaults; update the script entrypoint to parse args (e.g., via getopts) and fall back to ${ENV_VAR:-default} for each symbol so CI and other users can override paths, mounts, image tag, partition/account, and job name without committing local settings.jenkins/scripts/perf/local/run_e2e_local.sh-7-14 (1)
7-14:⚠️ Potential issue | 🟠 MajorDon't commit developer-specific cluster settings.
These absolute repo/model paths, mounts, image tag, and account are tied to one user and one cluster environment, so the script is not reusable for other developers or automation. Please read them from arguments or environment variables instead.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@jenkins/scripts/perf/local/run_e2e_local.sh` around lines 7 - 14, The script hardcodes developer- and cluster-specific settings (REPO_ROOT, LOCAL_DIR, IMAGE, MOUNTS, LLM_MODELS_ROOT, PARTITION, ACCOUNT, JOB_NAME); change these to be read from environment variables or script arguments with sensible defaults so others can override them. Replace the literal assignments for REPO_ROOT, LOCAL_DIR, IMAGE, MOUNTS, LLM_MODELS_ROOT, PARTITION, ACCOUNT and JOB_NAME with parameter parsing (e.g., using getopts or positional args) and fallbacks to ENV vars (e.g., ${REPO_ROOT:-/default/path}) and validate required values early, emitting a clear error and usage help if missing. Ensure any user-specific defaults are removed and documented in the script header so CI and other developers can supply their own values.jenkins/scripts/perf/local/run_gen_only_local.sh-7-14 (1)
7-14:⚠️ Potential issue | 🟠 MajorDon't commit developer-specific cluster settings.
These absolute repo/model paths, mounts, image tag, and account are tied to one user and one cluster environment, so the script is not reusable for other developers or automation. Please read them from arguments or environment variables instead.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@jenkins/scripts/perf/local/run_gen_only_local.sh` around lines 7 - 14, The script hardcodes developer/cluster-specific variables (REPO_ROOT, LOCAL_DIR, IMAGE, MOUNTS, LLM_MODELS_ROOT, PARTITION, ACCOUNT, JOB_NAME); change these to be read from environment variables or positional/flagged script arguments with sensible defaults and a clear usage message, e.g., respect $REPO_ROOT, $IMAGE, $MOUNTS, $LLM_MODELS_ROOT, $PARTITION, $ACCOUNT, $JOB_NAME or accept --repo-root/--image/--mounts etc., validate required values early and exit with an error if missing, and set LOCAL_DIR relative to REPO_ROOT when not explicitly provided so the script is reusable across users and CI.jenkins/scripts/perf/local/submit.py-1-7 (1)
1-7:⚠️ Potential issue | 🟠 MajorAdd the required NVIDIA Apache 2.0 file header.
This new Python source file is missing the standard TensorRT-LLM copyright/license block.
As per coding guidelines: "All TensorRT-LLM source files should contain an NVIDIA copyright header with the year of the latest meaningful modification. The header should be an Apache 2.0 license block as specified".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@jenkins/scripts/perf/local/submit.py` around lines 1 - 7, Add the required NVIDIA Apache 2.0 file header to the top of jenkins/scripts/perf/local/submit.py: insert the standard TensorRT-LLM copyright/license block (Apache License, Version 2.0) including the correct year of latest meaningful modification and "NVIDIA CORPORATION" as the copyright holder, placed above the existing shebang (#!/usr/bin/env python3) so the file begins with the full header comment followed by the shebang and imports.jenkins/scripts/perf/local/submit.py-326-340 (1)
326-340:⚠️ Potential issue | 🟠 MajorDon't introduce a second parametrized test-ID format.
The local runner scripts in this PR pass
aggr_upload-*/disagg_upload-*names, but this helper rewrites them toaggr-*/disagg-*. Sincepytest --test-listfilters by parametrized id, that drift can select the wrong case or no case at all.Suggested fix
if runtime_mode == "disaggregated": # disagg_upload-{e2e|gen_only}-{config_base} test_list_content = ( - f"perf/test_perf_sanity.py::test_e2e[disagg-{benchmark_mode}-{config_file_base_name}]" + f"perf/test_perf_sanity.py::test_e2e[disagg_upload-{benchmark_mode}-{config_file_base_name}]" ) elif benchmark_mode == "ctx_only": # aggr_upload-ctx_only-{config_base} test_list_content = ( - f"perf/test_perf_sanity.py::test_e2e[aggr-ctx_only-{config_file_base_name}]" + f"perf/test_perf_sanity.py::test_e2e[aggr_upload-ctx_only-{config_file_base_name}]" ) else: # Normal aggr: aggr-{config}-{select_pattern} test_list_content = ( - f"perf/test_perf_sanity.py::test_e2e[aggr-{config_file_base_name}-{select_pattern}]" + f"perf/test_perf_sanity.py::test_e2e[aggr_upload-{config_file_base_name}-{select_pattern}]" )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@jenkins/scripts/perf/local/submit.py` around lines 326 - 340, The code rewrites incoming parametrized test IDs (e.g., converts aggr_upload-/disagg_upload- to aggr-/disagg-), which breaks pytest --test-list filtering; instead use the original benchmark_mode (and any upload_* prefixes) when building test_list_content. Update the branches that set test_list_content to include benchmark_mode verbatim (alongside runtime_mode/config_file_base_name/select_pattern as appropriate) rather than substituting fixed strings like "disagg-" or "aggr-", so the variable test_list_content matches the exact parametrized IDs pytest expects.jenkins/scripts/perf/local/submit.py-261-272 (1)
261-272:⚠️ Potential issue | 🟠 MajorKeep the YAML benchmark mode in
bm_config.
benchmark_modeonly carries the CLI/test-list mode (e2e,gen_only,ctx_only). Returning it here dropsgen_only_no_context, so the later env setup can never exportTRTLLM_DISAGG_BENCHMARK_GEN_ONLY=1for no-context runs.Suggested fix
def get_benchmark_config(config, benchmark_mode): """Get benchmark config based on mode.""" if benchmark_mode is None: return {} benchmark = config.get("benchmark", {}) + benchmark_runtime_mode = benchmark.get("mode", benchmark_mode) concurrency_str = benchmark.get("concurrency_list", "1") concurrency = int(concurrency_str) if isinstance(concurrency_str, str) else concurrency_str return { - "mode": benchmark_mode, + "mode": benchmark_runtime_mode, "concurrency": concurrency, }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@jenkins/scripts/perf/local/submit.py` around lines 261 - 272, The function get_benchmark_config currently returns the CLI/test-list mode (benchmark_mode) and drops the YAML-specified benchmark mode; update get_benchmark_config to preserve the YAML benchmark mode by using the value from config["benchmark"].get("mode") (falling back to benchmark_mode if missing) so bm_config retains the YAML mode (e.g., "gen_only_no_context"); keep the existing concurrency handling and return the resulting dict with "mode" set to that YAML-aware value.jenkins/scripts/perf/disaggregated/submit.py-15-17 (1)
15-17:⚠️ Potential issue | 🟠 MajorUse the YAML benchmark mode as the source of truth for no-context runs.
parse_test_case_name()only returnse2eorgen_only, so this condition can never seegen_only_no_context. Those configs will still keep their ctx-server count. Please derive the runtime mode fromconfig["benchmark"]["mode"]once and reuse it here and in the later gen-only env branch.Suggested fix
- num_ctx_servers = ( - 0 if "gen_only_no_context" in benchmark_mode else hardware.get("num_ctx_servers") - ) + benchmark_runtime_mode = config.get("benchmark", {}).get("mode", benchmark_mode) + num_ctx_servers = ( + 0 if benchmark_runtime_mode == "gen_only_no_context" + else hardware.get("num_ctx_servers") + )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@jenkins/scripts/perf/disaggregated/submit.py` around lines 15 - 17, The check for "gen_only_no_context" is wrong because parse_test_case_name() only returns "e2e" or "gen_only"; derive the runtime mode once from config["benchmark"]["mode"] (e.g., runtime_mode = config["benchmark"]["mode"]) and use that variable instead of benchmark_mode when computing num_ctx_servers (num_ctx_servers = 0 if runtime_mode == "gen_only_no_context" else hardware.get("num_ctx_servers")), and also reuse runtime_mode in the later gen-only env branch so both places use the YAML benchmark mode as the single source of truth.
🟡 Minor comments (5)
jenkins/scripts/open_search_db.py-328-356 (1)
328-356:⚠️ Potential issue | 🟡 MinorNarrow this exception handling so real parsing/schema bugs do not look like "no perf data."
except Exceptionhere will also hide unexpected payload-shape regressions and local code bugs by returningNone. Catch only the request/JSON/payload errors you actually expect—requests.RequestException,ValueError, andAttributeError.Proposed fix
- except Exception as e: + except (requests.RequestException, ValueError, AttributeError) as e: OpenSearchDB.logger.warning( f"Failed to query from {project_name}, returned error: {e}") return None🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@jenkins/scripts/open_search_db.py` around lines 328 - 356, The broad except Exception in the OpenSearchDB.queryFromOpenSearchDB call hides real parsing/schema bugs; narrow it to only the expected request/JSON/payload errors by catching requests.RequestException, ValueError, and AttributeError (and import requests if not present) around the block that calls OpenSearchDB.queryFromOpenSearchDB and processes res.json(), keep the existing logging behavior for those specific exceptions, and remove or avoid a blanket Exception catch so unexpected errors bubble up instead of returning None.tests/scripts/perf-sanity/disaggregated/gb200_kimi-k2-thinking-fp4_8k1k_con1024_ctx1_dep4_gen1_dep32_eplb416_mtp3_ccb-UCX.yaml-74-78 (1)
74-78:⚠️ Potential issue | 🟡 MinorFilename says
mtp3but config uses Eagle decoding.The filename contains
mtp3which typically indicates Multi-Token Prediction with 3 layers, but the speculative config uses Eagle decoding withmax_draft_len: 3. This naming inconsistency may cause confusion when filtering or selecting configs by filename patterns.Consider updating the filename to reflect Eagle decoding (e.g.,
eagle3) or verify this is intentional for your naming convention.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/scripts/perf-sanity/disaggregated/gb200_kimi-k2-thinking-fp4_8k1k_con1024_ctx1_dep4_gen1_dep32_eplb416_mtp3_ccb-UCX.yaml` around lines 74 - 78, The filename indicates "mtp3" but the speculative_config block uses Eagle decoding (decoding_type: Eagle) with max_draft_len: 3 and eagle3_one_model: true (speculative_model: Kimi-K2-Thinking-NVFP4-Eagle3); rename the file to use "eagle3" (or another Eagle-based token) to match the config, or if "mtp3" is intentional, change decoding_type/max_draft_len/eagle3_one_model to reflect MTP3 settings so the filename and the speculative_config (specifically the decoding_type, max_draft_len, eagle3_one_model, and speculative_model) are consistent.tests/scripts/perf-sanity/README.md-154-158 (1)
154-158:⚠️ Potential issue | 🟡 MinorFix the SBSA Jenkins config descriptions.
These labels do not match
jenkins/L0_Test.groovy:SBSASlurmTestConfigsis the SBSA single-node SLURM bucket, andmultiNodesSBSAConfigscontains both aggregated and disaggregated multi-node stages. The current table will send people to the wrong map when they add or debug coverage.Suggested wording
-| `SBSASlurmTestConfigs` | Multi-node aggregated tests | -| `multiNodesSBSAConfigs` | Multi-node disaggregated tests | +| `SBSASlurmTestConfigs` | SBSA single-node SLURM tests | +| `multiNodesSBSAConfigs` | SBSA multi-node SLURM tests (aggregated + disaggregated) |🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/scripts/perf-sanity/README.md` around lines 154 - 158, Update the README table entries so they match jenkins/L0_Test.groovy: change the description for SBSASlurmTestConfigs to "SBSA single-node SLURM bucket" (it is not multi-node), and change the description for multiNodesSBSAConfigs to "Multi-node SBSA stages (contains both aggregated and disaggregated multi-node stages)"; locate and edit the table rows referencing the symbols SBSASlurmTestConfigs and multiNodesSBSAConfigs to use these corrected descriptions.tests/integration/defs/perf/test_perf_sanity.py-940-952 (1)
940-952:⚠️ Potential issue | 🟡 MinorChain the exception with
fromfor proper traceback.When re-raising a different exception inside an except block, use
fromto preserve the exception chain.Proposed fix
except (subprocess.CalledProcessError, FileNotFoundError, IndexError): - raise RuntimeError("Failed to get GPU type") + raise RuntimeError("Failed to get GPU type") from NoneOr to preserve the original exception:
except (subprocess.CalledProcessError, FileNotFoundError, IndexError) as e: - raise RuntimeError("Failed to get GPU type") + raise RuntimeError("Failed to get GPU type") from e🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/integration/defs/perf/test_perf_sanity.py` around lines 940 - 952, The except block in get_gpu_type catches (subprocess.CalledProcessError, FileNotFoundError, IndexError) but re-raises a RuntimeError without chaining the original exception; modify the except to capture the original exception (e.g., except (... ) as e:) and re-raise using "raise RuntimeError('Failed to get GPU type') from e" so the original traceback from subprocess.check_output (and related errors) is preserved.jenkins/scripts/perf/perf_utils.py-536-547 (1)
536-547:⚠️ Potential issue | 🟡 MinorAdd
strict=Truetozip()to catch length mismatches.The
zip(x_vals, decline_vals)call on line 540 could silently truncate data if the lists have different lengths, which would produce incorrect regression calculations.Proposed fix
- ss_xy = sum((x - x_mean) * (y - y_mean) for x, y in zip(x_vals, decline_vals)) + ss_xy = sum((x - x_mean) * (y - y_mean) for x, y in zip(x_vals, decline_vals, strict=True))🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@jenkins/scripts/perf/perf_utils.py` around lines 536 - 547, In the regression block using x_vals and decline_vals, change the zip call to use strict=True (zip(x_vals, decline_vals, strict=True)) so length mismatches raise immediately; wrap the pairing and subsequent calculations (ss_xy, ss_xx, ss_yy, slope, r_squared) in a try/except ValueError to handle/report mismatched lengths (e.g., log and skip regression) so you don't silently truncate data when computing slope and r_squared for the gradual_decline check.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 28b1230b-8e55-4410-854d-3e592e8080b7
📒 Files selected for processing (76)
jenkins/L0_Test.groovyjenkins/scripts/open_search_db.pyjenkins/scripts/perf/README.mdjenkins/scripts/perf/aggregated/slurm_launch_draft.shjenkins/scripts/perf/disaggregated/slurm_launch_draft.shjenkins/scripts/perf/disaggregated/submit.pyjenkins/scripts/perf/get_pre_merge_html.pyjenkins/scripts/perf/local/README.mdjenkins/scripts/perf/local/run_agg_local.shjenkins/scripts/perf/local/run_e2e_local.shjenkins/scripts/perf/local/run_gen_only_local.shjenkins/scripts/perf/local/slurm_install.shjenkins/scripts/perf/local/slurm_run.shjenkins/scripts/perf/local/submit.pyjenkins/scripts/perf/perf_sanity_triage.pyjenkins/scripts/perf/perf_utils.pytests/integration/defs/perf/open_search_db_utils.pytests/integration/defs/perf/test_perf_sanity.pytests/scripts/perf-sanity/README.mdtests/scripts/perf-sanity/aggregated/config_database_b200_nvl.yamltests/scripts/perf-sanity/aggregated/config_database_h200_sxm.yamltests/scripts/perf-sanity/aggregated/deepseek_r1_fp4_v2_2_nodes_grace_blackwell.yamltests/scripts/perf-sanity/aggregated/deepseek_r1_fp4_v2_blackwell.yamltests/scripts/perf-sanity/aggregated/deepseek_r1_fp4_v2_grace_blackwell.yamltests/scripts/perf-sanity/aggregated/deepseek_r1_fp8_blackwell.yamltests/scripts/perf-sanity/aggregated/deepseek_v32_fp4_blackwell.yamltests/scripts/perf-sanity/aggregated/deepseek_v32_fp4_grace_blackwell.yamltests/scripts/perf-sanity/aggregated/gb300_deepseek_r1_fp4_v2_2_nodes_grace_blackwell.yamltests/scripts/perf-sanity/aggregated/gpt_oss_120b_fp4_blackwell.yamltests/scripts/perf-sanity/aggregated/gpt_oss_120b_fp4_grace_blackwell.yamltests/scripts/perf-sanity/aggregated/k2_thinking_fp4_2_nodes_grace_blackwell.yamltests/scripts/perf-sanity/aggregated/k2_thinking_fp4_blackwell.yamltests/scripts/perf-sanity/aggregated/k2_thinking_fp4_grace_blackwell.yamltests/scripts/perf-sanity/deepseek_r1_fp4_v2_2_nodes_grace_blackwell.yamltests/scripts/perf-sanity/disaggregated/b200_deepseek-r1-fp4_1k1k_con1_ctx1_dep4_gen1_tep8_eplb0_mtp3_ccb-UCX.yamltests/scripts/perf-sanity/disaggregated/b200_deepseek-r1-fp4_1k1k_con2048_ctx1_dep4_gen1_dep8_eplb0_mtp1_ccb-UCX.yamltests/scripts/perf-sanity/disaggregated/b200_deepseek-r1-fp4_1k1k_con256_ctx1_dep4_gen1_dep8_eplb0_mtp3_ccb-UCX.yamltests/scripts/perf-sanity/disaggregated/b200_deepseek-r1-fp4_8k1k_con1536_ctx1_dep4_gen1_dep8_eplb0_mtp1_ccb-UCX.yamltests/scripts/perf-sanity/disaggregated/b200_deepseek-r1-fp4_8k1k_con1_ctx1_dep4_gen1_tep8_eplb0_mtp3_ccb-UCX.yamltests/scripts/perf-sanity/disaggregated/b200_deepseek-r1-fp4_8k1k_con256_ctx1_dep4_gen1_dep8_eplb0_mtp1_ccb-UCX.yamltests/scripts/perf-sanity/disaggregated/gb200_deepseek-r1-fp4_128k8k_con128_ctx1_pp8_gen1_dep16_eplb0_mtp2_ccb-UCX.yamltests/scripts/perf-sanity/disaggregated/gb200_deepseek-r1-fp4_128k8k_con1_ctx1_pp8_gen1_tep8_eplb0_mtp3_ccb-UCX.yamltests/scripts/perf-sanity/disaggregated/gb200_deepseek-r1-fp4_128k8k_con64_ctx1_pp8_gen1_dep32_eplb0_mtp3_ccb-UCX.yamltests/scripts/perf-sanity/disaggregated/gb200_deepseek-r1-fp4_1k1k_con1024_ctx1_dep4_gen1_dep32_eplb0_mtp3_ccb-UCX.yamltests/scripts/perf-sanity/disaggregated/gb200_deepseek-r1-fp4_1k1k_con1024_ctx1_dep4_gen1_dep8_eplb0_mtp0_ccb-UCX.yamltests/scripts/perf-sanity/disaggregated/gb200_deepseek-r1-fp4_1k1k_con1_ctx1_dep4_gen1_tep8_eplb0_mtp3_ccb-UCX.yamltests/scripts/perf-sanity/disaggregated/gb200_deepseek-r1-fp4_1k1k_con3072_ctx1_dep4_gen1_dep4_eplb0_mtp1_ccb-UCX.yamltests/scripts/perf-sanity/disaggregated/gb200_deepseek-r1-fp4_8k1k_con1024_ctx1_dep4_gen1_dep32_eplb0_mtp3_ccb-UCX.yamltests/scripts/perf-sanity/disaggregated/gb200_deepseek-r1-fp4_8k1k_con1_ctx1_dep4_gen1_tep8_eplb0_mtp3_ccb-UCX.yamltests/scripts/perf-sanity/disaggregated/gb200_deepseek-r1-fp4_8k1k_con4096_ctx1_dep4_gen1_dep16_eplb0_mtp1_ccb-UCX.yamltests/scripts/perf-sanity/disaggregated/gb200_deepseek-v32-fp4_1k1k_con1024_ctx1_dep4_gen1_dep32_eplb256_mtp3_ccb-UCX.yamltests/scripts/perf-sanity/disaggregated/gb200_deepseek-v32-fp4_1k1k_con1_ctx1_dep4_gen1_tep8_eplb0_mtp3_ccb-UCX.yamltests/scripts/perf-sanity/disaggregated/gb200_deepseek-v32-fp4_1k1k_con2048_ctx1_dep4_gen1_dep4_eplb0_mtp1_ccb-UCX.yamltests/scripts/perf-sanity/disaggregated/gb200_deepseek-v32-fp4_32k4k_con1_ctx1_dep4_gen1_tep8_eplb0_mtp3_ccb-UCX.yamltests/scripts/perf-sanity/disaggregated/gb200_deepseek-v32-fp4_32k4k_con2048_ctx1_dep4_gen1_dep32_eplb288_mtp1_ccb-UCX.yamltests/scripts/perf-sanity/disaggregated/gb200_deepseek-v32-fp4_32k4k_con256_ctx1_dep4_gen1_dep32_eplb0_mtp3_ccb-UCX.yamltests/scripts/perf-sanity/disaggregated/gb200_deepseek-v32-fp4_8k1k_con1024_ctx1_dep4_gen1_dep32_eplb256_mtp3_ccb-UCX.yamltests/scripts/perf-sanity/disaggregated/gb200_deepseek-v32-fp4_8k1k_con1_ctx1_dep4_gen1_tep8_eplb0_mtp3_ccb-UCX.yamltests/scripts/perf-sanity/disaggregated/gb200_deepseek-v32-fp4_8k1k_con4096_ctx1_dep4_gen1_dep32_eplb256_mtp0_ccb-UCX.yamltests/scripts/perf-sanity/disaggregated/gb200_gpt-oss-120b-fp4_1k1k_con2048_ctx1_tp1_gen1_dep2_eplb0_mtp0_ccb-UCX.yamltests/scripts/perf-sanity/disaggregated/gb200_gpt-oss-120b-fp4_1k1k_con512_ctx1_tp1_gen1_dep2_eplb0_mtp0_ccb-UCX.yamltests/scripts/perf-sanity/disaggregated/gb200_gpt-oss-120b-fp4_1k1k_con64_ctx1_tp1_gen1_tp4_eplb0_mtp0_ccb-UCX.yamltests/scripts/perf-sanity/disaggregated/gb200_gpt-oss-120b-fp4_8k1k_con128_ctx1_tp1_gen1_tp4_eplb0_mtp0_ccb-UCX.yamltests/scripts/perf-sanity/disaggregated/gb200_gpt-oss-120b-fp4_8k1k_con4_ctx1_tp1_gen1_tp4_eplb0_mtp0_ccb-UCX.yamltests/scripts/perf-sanity/disaggregated/gb200_gpt-oss-120b-fp4_8k1k_con512_ctx1_tp1_gen1_dep2_eplb0_mtp0_ccb-UCX.yamltests/scripts/perf-sanity/disaggregated/gb200_kimi-k2-thinking-fp4_1k1k_con2048_ctx1_dep4_gen1_dep32_eplb384_mtp0_ccb-UCX.yamltests/scripts/perf-sanity/disaggregated/gb200_kimi-k2-thinking-fp4_1k1k_con4096_ctx1_dep4_gen1_dep8_eplb0_mtp0_ccb-UCX.yamltests/scripts/perf-sanity/disaggregated/gb200_kimi-k2-thinking-fp4_1k1k_con4_ctx1_dep4_gen1_tep4_eplb0_mtp0_ccb-UCX.yamltests/scripts/perf-sanity/disaggregated/gb200_kimi-k2-thinking-fp4_8k1k_con1024_ctx1_dep4_gen1_dep32_eplb416_mtp3_ccb-UCX.yamltests/scripts/perf-sanity/disaggregated/gb200_kimi-k2-thinking-fp4_8k1k_con4096_ctx1_dep4_gen1_dep16_eplb384_mtp0_ccb-UCX.yamltests/scripts/perf-sanity/disaggregated/gb200_kimi-k2-thinking-fp4_8k1k_con4_ctx1_dep4_gen1_tep8_eplb0_mtp3_ccb-UCX.yamltests/scripts/perf-sanity/disaggregated/gb200_qwen3-235b-fp4_8k1k_con1024_ctx1_tp1_gen1_dep8_eplb0_mtp0_ccb-UCX.yamltests/scripts/perf-sanity/disaggregated/gb200_qwen3-235b-fp4_8k1k_con64_ctx1_tp1_gen1_tep4_eplb0_mtp0_ccb-UCX.yamltests/scripts/perf-sanity/disaggregated/gb300_deepseek-r1-fp4_1k1k_con1024_ctx1_dep4_gen1_dep32_eplb0_mtp3_ccb-UCX.yamltests/test_common/error_utils.pytests/test_common/http_utils.py
💤 Files with no reviewable changes (1)
- tests/scripts/perf-sanity/deepseek_r1_fp4_v2_2_nodes_grace_blackwell.yaml
| def disaggMode = stageName.contains("Disagg-PerfSanity") | ||
|
|
There was a problem hiding this comment.
The disaggregated path never activates.
Line 904 looks for "Disagg-PerfSanity", but the stage keys in multiNodesSBSAConfigs are named "...PerfSanity-Disagg...". That leaves disaggMode false for the GB200 disaggregated stages, so they fall back to the aggregated sbatch launcher instead of the split-group disaggregated flow.
Suggested fix
- def disaggMode = stageName.contains("Disagg-PerfSanity")
+ def disaggMode = stageName.contains("PerfSanity-Disagg")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def disaggMode = stageName.contains("Disagg-PerfSanity") | |
| def disaggMode = stageName.contains("PerfSanity-Disagg") | |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@jenkins/L0_Test.groovy` around lines 904 - 905, The disaggregated path never
activates because disaggMode is set by checking
stageName.contains("Disagg-PerfSanity"), but the keys use the pattern
"PerfSanity-Disagg"; update the check that sets disaggMode (the line defining
disaggMode using stageName) to match the actual naming (e.g.,
stageName.contains("PerfSanity-Disagg") or more robustly
stageName.contains("Disagg") || stageName.contains("PerfSanity-Disagg")) so
GB200 disaggregated stages trigger the split-group disaggregated flow instead of
the aggregated sbatch path.
| } catch (InterruptedException e) { | ||
| stageIsInterrupted = true | ||
| throw e | ||
| } finally { | ||
| uploadResults(pipeline, cluster, jobUID, stageName) | ||
| uploadResults(pipeline, cluster, jobUID, stageName, stageIsInterrupted) | ||
| stage("Clean Up Slurm Resource") { |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify the helper definition and all call sites.
rg -n 'def uploadResults\s*\(' jenkins/L0_Test.groovy
rg -n 'uploadResults\s*\(' jenkins/L0_Test.groovyRepository: NVIDIA/TensorRT-LLM
Length of output: 335
Fix uploadResults call with incorrect argument count.
Line 1384 passes five arguments (pipeline, cluster, jobUID, stageName, stageIsInterrupted), but uploadResults at line 111 is defined with four parameters (pipeline, cluster, nodeName, stageName). This will raise MissingMethodException from the finally block at runtime, masking the original error and skipping result upload.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@jenkins/L0_Test.groovy` around lines 1380 - 1385, The call to uploadResults
currently passes five args (pipeline, cluster, jobUID, stageName,
stageIsInterrupted) but the defined signature is uploadResults(pipeline,
cluster, nodeName, stageName); fix the call to match the signature by passing
pipeline, cluster, nodeName, stageName (replace jobUID with the nodeName
variable and remove stageIsInterrupted). If the intent was to propagate the
interrupted flag, instead update the uploadResults method signature
(uploadResults(pipeline, cluster, nodeName, stageName, stageIsInterrupted)) and
update its callers/implementation accordingly.
| max_batch_size: 4 | ||
| max_num_tokens: 12288 |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
In TensorRT-LLM serving/perf-sanity configs, is max_num_tokensenforced per request or as a batch-wide scheduler cap, and what value is required to actually admit 8k input / 1k output requests whenmax_batch_size is 4 or 32?
💡 Result:
max_num_tokens is not per-request; it’s a batch-/iteration-wide scheduler budget. In each scheduled iteration it counts (a) the sum of all prompt (context) tokens of the context-phase requests plus (b) ~1 token per request that is in generation (not the full remaining output length). [1], [2]
So, to actually admit requests with ~8k input (≈8192 tokens) while allowing up to max_batch_size = N such prefills to be scheduled together, you need roughly:
max_batch_size = 4:max_num_tokens >= 4 * 8192 = 32768max_batch_size = 32:max_num_tokens >= 32 * 8192 = 262144[2]
(Your 1k output doesn’t require adding 1k to max_num_tokens up front; decode contributes ~1 token/request per iteration to this cap. [2] Also ensure your per-request limits like max_input_len/max_seq_len allow 8k+1k.) [2]
Fix max_num_tokens to allow actual batching of 8k input requests.
max_num_tokens is a batch-wide scheduler cap that sums all prompt tokens in a scheduled iteration. With isl: 8192 inputs, the required minimums are:
- Lines 14-15 (
max_batch_size: 4):max_num_tokensmust be ≥ 32768 (currently 12288) - Lines 45-46 (
max_batch_size: 32):max_num_tokensmust be ≥ 262144 (currently 12288)
Current values prevent the batch sizes from being exercised and invalidate the perf baselines.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/scripts/perf-sanity/aggregated/deepseek_v32_fp4_blackwell.yaml` around
lines 14 - 15, Update the batch-wide scheduler cap values so the scheduler can
actually batch 8k-token inputs: for the block where max_batch_size: 4, change
max_num_tokens from 12288 to at least 32768; for the block where max_batch_size:
32, change max_num_tokens from 12288 to at least 262144. Ensure you update the
respective max_num_tokens keys in the same YAML sections as the existing
max_batch_size entries so the perf baselines exercise the intended batch sizes.
| client_configs: | ||
| - name: "con2_iter10_8k1k" | ||
| concurrency: 2 | ||
| iterations: 10 | ||
| isl: 8192 | ||
| osl: 1024 | ||
| backend: "openai" | ||
| trust_remote_code: true | ||
| dataset_file: datasets/perf-ci/k2_thinking-8k1k-20480-ratio-1_for_serve.json |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
target='datasets/perf-ci/k2_thinking-8k1k-20480-ratio-1_for_serve.json'
echo "Checking for ${target}..."
if [ -f "${target}" ]; then
echo "FOUND: ${target}"
else
echo "MISSING: ${target}"
echo
echo "Nearby perf-ci datasets:"
fd . datasets/perf-ci 2>/dev/null || true
exit 1
fiRepository: NVIDIA/TensorRT-LLM
Length of output: 239
The referenced perf dataset is missing and must be cherry-picked with this config.
The file datasets/perf-ci/k2_thinking-8k1k-20480-ratio-1_for_serve.json does not exist in the repository. Both scenarios at lines 28-36 and 62-70 depend on it and will fail immediately at runtime without it.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/scripts/perf-sanity/aggregated/k2_thinking_fp4_grace_blackwell.yaml`
around lines 28 - 36, The referenced perf dataset file
datasets/perf-ci/k2_thinking-8k1k-20480-ratio-1_for_serve.json is missing and
both client_configs entries (e.g., name "con2_iter10_8k1k") reference it; fix by
adding/ cherry-picking the commit that introduced that JSON into
datasets/perf-ci so the path exists, or replace the dataset_file value in the
client_configs (the occurrences with dataset_file:
datasets/perf-ci/k2_thinking-8k1k-20480-ratio-1_for_serve.json) with an existing
dataset path; ensure both scenarios that reference that filename are updated
consistently.
| partition: <partition> | ||
| account: <account> |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Description: Find all placeholder values in YAML configuration files
rg -n '<[^>]+>' tests/scripts/perf-sanity/disaggregated/b200_deepseek-r1-fp4_8k1k_con1_ctx1_dep4_gen1_tep8_eplb0_mtp3_ccb-UCX.yamlRepository: NVIDIA/TensorRT-LLM
Length of output: 263
All placeholder values must be replaced before execution.
This SLURM configuration file contains multiple placeholder values that must be replaced with actual values:
- Lines 11-12:
<partition>and<account> - Lines 32-34:
<container_mount>,<container_image>, and<model_path> - Line 37:
<full_path_to_work_dir>
These placeholders will cause the job submission to fail at runtime.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@tests/scripts/perf-sanity/disaggregated/b200_deepseek-r1-fp4_8k1k_con1_ctx1_dep4_gen1_tep8_eplb0_mtp3_ccb-UCX.yaml`
around lines 11 - 12, Replace all SLURM/YAML placeholder tokens with real
values: set the partition and account fields (replace <partition> and
<account>), provide a real container mount path, container image and model path
(replace <container_mount>, <container_image>, <model_path>), and set the
working directory to the full absolute path (replace <full_path_to_work_dir>);
verify these concrete values are valid on the cluster and in the container
runtime so the job submission using the keys partition, account,
container_mount, container_image, model_path and work_dir will not fail at
runtime.
| account: <account> | ||
| job_time: 02:00:00 | ||
| job_name: unified-benchmark | ||
| extra_args: --gres=gpu:8 |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "== Inspect the new config =="
sed -n '1,220p' tests/scripts/perf-sanity/disaggregated/b200_deepseek-r1-fp4_8k1k_con256_ctx1_dep4_gen1_dep8_eplb0_mtp1_ccb-UCX.yaml
echo
echo "== Find code that computes SLURM/node/GPU allocation for disaggregated runs =="
rg -n -C4 '\b(num_ctx_servers|num_gen_servers|gpus_per_node|tensor_parallel_size|pipeline_parallel_size|context_parallel_size|extra_args|gres)\b'
echo
echo "== Inspect the referenced SLURM script =="
fd 'disaggr_torch\.slurm$' -x sed -n '1,260p' {}Repository: NVIDIA/TensorRT-LLM
Length of output: 50377
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "=== Find the disaggr_torch.slurm script ==="
find . -name "disaggr_torch.slurm" -type f
echo
echo "=== Find launcher scripts in perf-sanity ==="
fd -e py . tests/scripts/perf-sanity/ | head -20
echo
echo "=== Look at run_benchmark_serve.py to see how it processes hardware config ==="
wc -l tests/scripts/perf-sanity/run_benchmark_serve.pyRepository: NVIDIA/TensorRT-LLM
Length of output: 448
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "=== Search for how num_ctx_servers and num_gen_servers are used ==="
rg -n 'num_ctx_servers|num_gen_servers' tests/scripts/perf-sanity/ -A 3 -B 1 | head -100
echo
echo "=== Search for SLURM node/GPU allocation computation ==="
rg -n 'gres|--nodes|num_nodes|total.*gpu' tests/scripts/perf-sanity/ -B 2 -A 2 | head -80Repository: NVIDIA/TensorRT-LLM
Length of output: 22235
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "=== Examine the disaggr_torch.slurm script ==="
head -80 ./examples/disaggregated/slurm/benchmark/disaggr_torch.slurm
echo
echo "=== Look for resource/node allocation logic in run_benchmark_serve.py ==="
sed -n '1,150p' tests/scripts/perf-sanity/run_benchmark_serve.py | head -100Repository: NVIDIA/TensorRT-LLM
Length of output: 7199
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "=== Search for where YAML config is loaded and parsed ==="
rg -n 'hardware.*num_ctx_servers|num_ctx_servers.*hardware|load.*yaml|yaml.*load' tests/scripts/perf-sanity/ -B 2 -A 5 | head -150
echo
echo "=== Look for node allocation computation logic ==="
rg -n 'num_ctx_servers|num_gen_servers' examples/disaggregated/ -B 3 -A 3 | head -100Repository: NVIDIA/TensorRT-LLM
Length of output: 8901
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "=== Look at submit.py to see how total_nodes is calculated ==="
sed -n '180,260p' examples/disaggregated/slurm/benchmark/submit.pyRepository: NVIDIA/TensorRT-LLM
Length of output: 3893
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "=== Find the calculate_nodes function ==="
rg -n 'def calculate_nodes' examples/disaggregated/slurm/benchmark/ -A 10
echo
echo "=== Verify for b200 config what nodes would be calculated ==="
python3 << 'EOF'
# Simulate the calculation for b200_deepseek-r1-fp4_8k1k
gpus_per_node = 8
num_ctx_servers = 1
num_gen_servers = 1
ctx_tp = 4
gen_tp = 8
# Assuming calculate_nodes is: ceil(world_size / gpus_per_node) * num_servers
import math
def calculate_nodes(world_size, num_servers, gpus_per_node):
# Typical pattern: each server needs ceil(world_size / gpus_per_node) nodes
gpus_per_server = gpus_per_node
return math.ceil(world_size / gpus_per_server) * num_servers
ctx_nodes = calculate_nodes(ctx_tp, num_ctx_servers, gpus_per_node)
gen_nodes = calculate_nodes(gen_tp, num_gen_servers, gpus_per_node)
total_nodes = ctx_nodes + gen_nodes
total_gpus = total_nodes * gpus_per_node
print(f"ctx_world_size={ctx_tp}, num_ctx_servers={num_ctx_servers}, gpus_per_node={gpus_per_node}")
print(f" ctx_nodes = ceil({ctx_tp}/{gpus_per_node}) * {num_ctx_servers} = {ctx_nodes}")
print(f"gen_world_size={gen_tp}, num_gen_servers={num_gen_servers}, gpus_per_node={gpus_per_node}")
print(f" gen_nodes = ceil({gen_tp}/{gpus_per_node}) * {num_gen_servers} = {gen_nodes}")
print(f"total_nodes = {ctx_nodes} + {gen_nodes} = {total_nodes}")
print(f"total_gpus = {total_nodes} * {gpus_per_node} = {total_gpus}")
print(f"Config specifies: --gres=gpu:8")
print(f"MISMATCH: {total_gpus} GPUs needed but only 8 requested!" if total_gpus != 8 else "MATCH!")
EOFRepository: NVIDIA/TensorRT-LLM
Length of output: 1376
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "=== Verify the calculation for the other config lines mentioned ==="
python3 << 'EOF'
import math
def calculate_nodes(world_size, num_servers, gpus_per_node):
return math.ceil(world_size * num_servers / gpus_per_node)
configs = [
# Lines 15, 28-30, 52-55, 80-83 (primary file)
("b200_deepseek-r1-fp4_8k1k", 8, 1, 1, 4, 8),
# Lines 28-30, 52-55, 80-83 patterns from gb200 files
("gb200_deepseek-r1-fp4_1k1k", 4, 1, 1, 8, 8),
]
for name, gpus_per_node, num_ctx, num_gen, ctx_tp, gen_tp in configs:
ctx_nodes = calculate_nodes(ctx_tp, num_ctx, gpus_per_node)
gen_nodes = calculate_nodes(gen_tp, num_gen, gpus_per_node)
total_nodes = ctx_nodes + gen_nodes
total_gpus = total_nodes * gpus_per_node
print(f"\n{name}:")
print(f" gpus_per_node={gpus_per_node}, num_ctx_servers={num_ctx}, num_gen_servers={num_gen}")
print(f" ctx_tp={ctx_tp}, gen_tp={gen_tp}")
print(f" → ctx_nodes={ctx_nodes}, gen_nodes={gen_nodes}, total_nodes={total_nodes}")
print(f" → requires {total_gpus} GPUs total")
print(f" → but config specifies --gres=gpu:{gpus_per_node}")
if total_gpus > gpus_per_node:
print(f" ⚠️ UNDER-PROVISIONED: needs {total_gpus} but requests {gpus_per_node}!")
EOFRepository: NVIDIA/TensorRT-LLM
Length of output: 659
Fix SLURM GPU allocation to match disaggregated topology requirements.
The launcher's calculate_nodes() function correctly determines that each disaggregated server needs multiple full nodes based on tensor parallelism, but the YAML configs hardcode insufficient --gres=gpu:* values.
For example, the b200_deepseek-r1-fp4_8k1k config with gen TP=8 and ctx TP=4 requires:
ctx_nodes = ceil(4 × 1 / 8) = 1 nodegen_nodes = ceil(8 × 1 / 8) = 1 node- Total: 2 nodes × 8 GPUs/node = 16 GPUs
But extra_args: --gres=gpu:8 allocates only 8 GPUs on a single node, causing 2x under-provisioning.
Change extra_args to reflect total GPU requirements: --gres=gpu:16 (or scale it based on the computed total_nodes = ctx_nodes + gen_nodes multiplied by gpus_per_node).
Also applies to lines 28-30, 52-55, 80-83 with the same resource-mismatch pattern across all disaggregated benchmark configurations.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@tests/scripts/perf-sanity/disaggregated/b200_deepseek-r1-fp4_8k1k_con256_ctx1_dep4_gen1_dep8_eplb0_mtp1_ccb-UCX.yaml`
at line 15, Update the SLURM GPU allocation in the YAML configs by replacing the
hardcoded extra_args: --gres=gpu:* with a value computed from the launcher's
node calculation: compute ctx_nodes and gen_nodes the same way as
calculate_nodes(), set total_nodes = ctx_nodes + gen_nodes, and set extra_args
to --gres=gpu:{total_nodes * gpus_per_node} (e.g., for ctx TP=4 and gen TP=8
with 8 GPUs/node use --gres=gpu:16); apply this change to the occurrences at the
shown blocks (lines around the current extra_args and the repeated blocks at
lines ~28-30, ~52-55, ~80-83) so the YAML reflects total_nodes × gpus_per_node
rather than a single-node GPU count.
| partition: <partition> | ||
| account: <account> |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "== Placeholder occurrences in perf-sanity configs =="
rg -n --glob '*.yaml' '<(partition|account|container_mount|container_image|model_path|full_path_to_work_dir)>' tests/scripts/perf-sanity
echo
echo "== Candidate substitution/materialization logic =="
rg -n 'envsubst|jinja|template|render|substitute|<partition>|<container_image>|<model_path>|<full_path_to_work_dir>' .Repository: NVIDIA/TensorRT-LLM
Length of output: 50377
🏁 Script executed:
# Search for where these disaggregated configs are referenced/loaded
rg -l "disaggregated.*\.yaml|gb200.*fp4.*con" tests/ --type py --type sh
# Check for any loader/materialization scripts
find . -type f \( -name "*disaggregated*" -not -path "./\.git/*" \) | head -20
# Look for scripts that might handle template rendering for these configs
rg -n "partition|account|container_image|model_path" tests/scripts/perf-sanity/ --type py -A 3 -B 3 | head -100Repository: NVIDIA/TensorRT-LLM
Length of output: 6204
🏁 Script executed:
# Search for YAML loading of disaggregated configs more directly
rg -n "load.*yaml|yaml.*load" tests/ --type py -B 2 -A 2 | grep -i disaggregated -A 5 -B 5
# Look for explicit loading of the config files in the disaggregated directory
rg -n "disaggregated/.*\.yaml" . --type py
# Search for references to the specific file under review
rg -n "gb200_deepseek-v32-fp4_32k4k_con256" .
# Check if there's a template or config rendering mechanism in tests
find tests/scripts/perf-sanity -type f \( -name "*.py" -o -name "*.sh" \) -exec basename {} \; | sort | uniqRepository: NVIDIA/TensorRT-LLM
Length of output: 4051
🏁 Script executed:
# Search for references to tests/scripts/perf-sanity/disaggregated specifically
rg -n "tests/scripts/perf-sanity/disaggregated" .
# Look for any script in perf-sanity that might load these YAMLs
find tests/scripts/perf-sanity -type f \( -name "*.py" -o -name "*.sh" \) | xargs ls -la
# Check if there's a benchmark script that reads from disaggregated dir
rg -n "disaggregated" tests/scripts/perf-sanity/ --type py
# Look for config loading in perf-sanity scripts
rg -n "yaml.load|yaml.safe_load" tests/scripts/perf-sanity/ --type pyRepository: NVIDIA/TensorRT-LLM
Length of output: 1289
🏁 Script executed:
# Check how run_benchmark_serve.py loads and processes the YAML configs
sed -n '450,500p' tests/scripts/perf-sanity/run_benchmark_serve.py
# Also check the jenkins submission script to see if it handles template rendering
cat jenkins/scripts/perf/disaggregated/submit.py
# Check if there's any environment variable substitution in the loading process
rg -n "environ|getenv|substitute|envsubst|template" tests/scripts/perf-sanity/run_benchmark_serve.py -B 2 -A 2Repository: NVIDIA/TensorRT-LLM
Length of output: 20470
🏁 Script executed:
# Read the README to understand the intent of these configs
cat tests/scripts/perf-sanity/README.md | head -250
# Also check if there's any documentation about template requirements
rg -n "template|placeholder|<partition>|<account>" tests/scripts/perf-sanity/README.md
# Check one of the actual YAML files to see the full context
head -50 tests/scripts/perf-sanity/disaggregated/gb200_deepseek-v32-fp4_32k4k_con256_ctx1_dep4_gen1_dep32_eplb0_mtp3_ccb-UCX.yamlRepository: NVIDIA/TensorRT-LLM
Length of output: 11182
🏁 Script executed:
# Check the SLURM launch draft template to see if substitution happens there
head -100 jenkins/scripts/perf/disaggregated/slurm_launch_draft.sh
# Also check if there's any shell variable expansion of these placeholders
rg -n "container_image|<container_image>|envsubst" jenkins/scripts/perf/disaggregated/ --type sh
# Look for where env_config values are actually used in submit.py
sed -n '80,250p' jenkins/scripts/perf/disaggregated/submit.pyRepository: NVIDIA/TensorRT-LLM
Length of output: 8817
🏁 Script executed:
# Check if there's any documentation comment in the YAML files themselves
head -10 tests/scripts/perf-sanity/disaggregated/*.yaml | grep -A 5 -B 5 "partition\|container_image" | head -50
# Check the Jenkinsfile to see if it handles template substitution
grep -A 20 -B 5 "disaggregated/submit.py\|DISAGG_CONFIG" jenkins/L0_Test.groovy | head -100
# Look for any shell script that might do envsubst or sed replacement
find . -name "*.sh" -type f | xargs grep -l "envsubst\|sed.*<\|placeholder" 2>/dev/null | head -10Repository: NVIDIA/TensorRT-LLM
Length of output: 1772
Verify that a template rendering step exists before these configs are submitted to SLURM.
These YAML files contain unrendered placeholders (<partition>, <account>, <container_image>, <model_path>, <full_path_to_work_dir>) across lines 11–12 and 32–37. The config is loaded directly via yaml.safe_load() in submit.py without any substitution step, and will pass literal <...> strings to SLURM and container commands, causing immediate failures.
Confirm either:
- A preprocessing step exists outside this repo that renders these templates before
submit.pyis invoked, or - These files require explicit concrete values and should not contain angle-bracket placeholders.
This pattern is consistent across all disaggregated config files and applies to lines 32–37 as well.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@tests/scripts/perf-sanity/disaggregated/gb200_deepseek-v32-fp4_32k4k_con256_ctx1_dep4_gen1_dep32_eplb0_mtp3_ccb-UCX.yaml`
around lines 11 - 12, The YAML configs contain unrendered placeholders like
<partition>, <account>, <container_image>, <model_path>, and
<full_path_to_work_dir> and are loaded directly via yaml.safe_load() in
submit.py; update the workflow so these placeholders are rendered before
submission by either (A) adding a preprocessing/render step that substitutes
template variables into the files (e.g., call the existing templating utility or
a simple Jinja/str.replace pass) before submit.py calls yaml.safe_load(), or (B)
ensure the config files contain concrete values and remove angle-bracket
placeholders; additionally, add validation inside submit.py (after
yaml.safe_load()) to detect any remaining angle-bracket tokens and fail fast
with a clear error referencing the offending fields to prevent passing literal
"<...>" values to SLURM/container commands.
| script_file: disaggr_torch.slurm | ||
| benchmark_type: 8k1k | ||
| slurm: | ||
| script_file: disaggr_torch.slurm |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check that the referenced launcher and dataset exist on this branch and are already used by perf-sanity.
fd 'disaggr_torch\.slurm$'
fd 'k2_thinking-8k1k-20480-ratio-1_for_serve\.json$'
rg -n -C2 'disaggr_torch\.slurm|k2_thinking-8k1k-20480-ratio-1_for_serve\.json'Repository: NVIDIA/TensorRT-LLM
Length of output: 50376
Missing dataset file will cause job failure on release/1.2.
The referenced launcher disaggr_torch.slurm exists at examples/disaggregated/slurm/benchmark/disaggr_torch.slurm, but the dataset file k2_thinking-8k1k-20480-ratio-1_for_serve.json is not present in the repository despite being referenced in the config at line 26 and in multiple other perf-sanity configs. The job will fail at runtime without this artifact.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@tests/scripts/perf-sanity/disaggregated/gb200_kimi-k2-thinking-fp4_8k1k_con4096_ctx1_dep4_gen1_dep16_eplb384_mtp0_ccb-UCX.yaml`
around lines 7 - 10, The config references a missing dataset file
k2_thinking-8k1k-20480-ratio-1_for_serve.json (and the disaggr_torch.slurm
launcher) which will cause runtime failure; either add the dataset JSON to the
repo (e.g., alongside examples/disaggregated/slurm/benchmark/disaggr_torch.slurm
or a centralized datasets dir) or update
gb200_kimi-k2-thinking-fp4_8k1k_con4096_ctx1_dep4_gen1_dep16_eplb384_mtp0_ccb-UCX.yaml
and other perf-sanity configs to point to an existing dataset filename/path, and
then confirm the launcher name disaggr_torch.slurm and dataset path are
consistently referenced across all perf-sanity configs on the release/1.2
branch.
Summary by CodeRabbit
Release Notes
New Features
Improvements
Documentation
Description
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.