chore: log host config on first boot order failure before ForceRestart by martinraumann · Pull Request #421 · NVIDIA/bare-metal-manager-core

martinraumann · 2026-03-02T22:56:47Z

Description

Calls log_host_config on the first set_boot_order_dpu_first failure before triggering ForceRestart. Previously this diagnostic logging (boot option UEFI device paths + PCIe inventory) was only captured on subsequent retries via trigger_reboot_if_needed_with_location, leaving a gap at the most useful moment.

Type of Change

Add - New feature or capability
Change - Changes in existing functionality
Fix - Bug fixes
Remove - Removed features or deprecated functionality
Internal - Internal changes (refactoring, tests, docs, etc.)

Related Issues

Relates to the GB200 missing primary DPU HTTP boot option bug. See also PR #350 which added log_host_config for adjacent states.

Breaking Changes

This PR contains breaking changes

Testing

No testing required (docs, internal refactor, etc.)

Additional Notes

log_host_config is infallible (swallows errors internally) and the additional Redfish calls happen immediately before a reboot, so there is no impact on control flow or timing.

crates/api/src/state_controller/machine/handler.rs

On the first call to set_boot_order_dpu_first, if it fails the code takes a direct ForceRestart path that bypasses the log_host_config call that trigger_reboot_if_needed_with_location makes on subsequent retries. This means we lose boot option details and PCIe inventory at the most useful moment. Call log_host_config before the first ForceRestart so diagnostic context is always captured on failure regardless of which reboot path is taken. See PR NVIDIA#350 for the original log_host_config additions and the GB200 missing DPU HTTP boot option bug for context. Signed-off-by: Martin Raumann <mraumann@nvidia.com>

spydaNVIDIA · 2026-03-03T22:26:53Z

crates/api/src/state_controller/machine/handler.rs

+                        // triggered so we capture full diagnostic context (UEFI device paths +
+                        // PCIe inventory) before state resets. Skipped when waiting on an
+                        // already-in-progress reboot to avoid redundant Redfish calls.
+                        if reboot_status.increase_retry_count {


Is the goal of this MR to find the RC for https://nvbugspro.nvidia.com/bug/5867167 ?

If so, given that the bug happens on instance provisioning, im not sure it will be of much help b/c we dont call set_host_boot_order going from Ready -> Assigned/Ready today.

good point, but the end goal wasn't specifically to find the RC but to ensure we have diagnostic context when this failure occurs. Based on the bug report (I could be reading it wrong), the failure is happening during the machine bring-up (HostInit/SetBootOrder) on newly scaled-up nodes, not during Ready->Assigned. set_host_boot_order is called on that path (and in HostPlatformConfig), so the logging should be relevant. could the failure be occurring somewhere else in the flow?

HostInit/SetBootOrder happens on the initial ingestion of machines; in YTL this happened a while back. I think the failure happened when tenants were unable to provision instances (Ready -> Assigned/Ready). The servers were able to boot Scout earlier.

Fair. there's ambiguity in the bug report around whether these were truly new machines going through HostInit or existing machines that had been in the pool. If it's the latter, do you know where in the Ready→Assigned flow the failure would be happening? That would help us figure out where the logging actually needs to go. worth keeping the PR even if it doesn't dierectly help diagnose this specific bug, I think it's still a logging improvement regardless?

Agreed, its a logging improvement. I realized earlier today that the logs I added in a previous MR wont help with that bug and figured the same applied here. TBH Im not too sure why the server wouldnt pxe boot given that it pxe booted into scout earlier, but I guess we could log on the transition to see whats going on.

martinraumann requested a review from a team as a code owner March 2, 2026 22:56

martinraumann requested a review from spydaNVIDIA March 2, 2026 22:57

martinraumann self-assigned this Mar 2, 2026

spydaNVIDIA reviewed Mar 3, 2026

View reviewed changes

crates/api/src/state_controller/machine/handler.rs Outdated Show resolved Hide resolved

martinraumann force-pushed the fix/log-host-config-on-first-boot-order-failure branch from 3a80884 to 9333bd0 Compare March 3, 2026 17:22

Merge branch 'main' into fix/log-host-config-on-first-boot-order-failure

b858460

kensimon approved these changes Mar 3, 2026

View reviewed changes

spydaNVIDIA reviewed Mar 3, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: log host config on first boot order failure before ForceRestart#421

chore: log host config on first boot order failure before ForceRestart#421
martinraumann wants to merge 2 commits intoNVIDIA:mainfrom
martinraumann:fix/log-host-config-on-first-boot-order-failure

martinraumann commented Mar 2, 2026

Uh oh!

Uh oh!

spydaNVIDIA Mar 3, 2026 •

edited

Loading

Uh oh!

martinraumann Mar 3, 2026

Uh oh!

spydaNVIDIA Mar 3, 2026

Uh oh!

martinraumann Mar 4, 2026

Uh oh!

spydaNVIDIA Mar 4, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

martinraumann commented Mar 2, 2026

Description

Type of Change

Related Issues

Breaking Changes

Testing

Additional Notes

Uh oh!

Uh oh!

spydaNVIDIA Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

martinraumann Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

spydaNVIDIA Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

martinraumann Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

spydaNVIDIA Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

spydaNVIDIA Mar 3, 2026 •

edited

Loading

spydaNVIDIA Mar 4, 2026 •

edited

Loading