Skip to content

chore: log host config on first boot order failure before ForceRestart#421

Open
martinraumann wants to merge 2 commits intoNVIDIA:mainfrom
martinraumann:fix/log-host-config-on-first-boot-order-failure
Open

chore: log host config on first boot order failure before ForceRestart#421
martinraumann wants to merge 2 commits intoNVIDIA:mainfrom
martinraumann:fix/log-host-config-on-first-boot-order-failure

Conversation

@martinraumann
Copy link
Contributor

Description

Calls log_host_config on the first set_boot_order_dpu_first failure before triggering ForceRestart. Previously this diagnostic logging (boot option UEFI device paths + PCIe inventory) was only captured on subsequent retries via trigger_reboot_if_needed_with_location, leaving a gap at the most useful moment.

Type of Change

  • Add - New feature or capability
  • Change - Changes in existing functionality
  • Fix - Bug fixes
  • Remove - Removed features or deprecated functionality
  • Internal - Internal changes (refactoring, tests, docs, etc.)

Related Issues

Relates to the GB200 missing primary DPU HTTP boot option bug. See also PR #350 which added log_host_config for adjacent states.

Breaking Changes

  • This PR contains breaking changes

Testing

  • No testing required (docs, internal refactor, etc.)

Additional Notes

log_host_config is infallible (swallows errors internally) and the additional Redfish calls happen immediately before a reboot, so there is no impact on control flow or timing.

@martinraumann martinraumann requested a review from a team as a code owner March 2, 2026 22:56
@martinraumann martinraumann self-assigned this Mar 2, 2026
On the first call to set_boot_order_dpu_first, if it fails the code takes
a direct ForceRestart path that bypasses the log_host_config call that
trigger_reboot_if_needed_with_location makes on subsequent retries. This
means we lose boot option details and PCIe inventory at the most useful
moment.

Call log_host_config before the first ForceRestart so diagnostic context
is always captured on failure regardless of which reboot path is taken.

See PR NVIDIA#350 for the original log_host_config additions and the GB200
missing DPU HTTP boot option bug for context.

Signed-off-by: Martin Raumann <mraumann@nvidia.com>
@martinraumann martinraumann force-pushed the fix/log-host-config-on-first-boot-order-failure branch from 3a80884 to 9333bd0 Compare March 3, 2026 17:22
// triggered so we capture full diagnostic context (UEFI device paths +
// PCIe inventory) before state resets. Skipped when waiting on an
// already-in-progress reboot to avoid redundant Redfish calls.
if reboot_status.increase_retry_count {
Copy link
Contributor

@spydaNVIDIA spydaNVIDIA Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the goal of this MR to find the RC for https://nvbugspro.nvidia.com/bug/5867167 ?

If so, given that the bug happens on instance provisioning, im not sure it will be of much help b/c we dont call set_host_boot_order going from Ready -> Assigned/Ready today.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, but the end goal wasn't specifically to find the RC but to ensure we have diagnostic context when this failure occurs. Based on the bug report (I could be reading it wrong), the failure is happening during the machine bring-up (HostInit/SetBootOrder) on newly scaled-up nodes, not during Ready->Assigned. set_host_boot_order is called on that path (and in HostPlatformConfig), so the logging should be relevant. could the failure be occurring somewhere else in the flow?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HostInit/SetBootOrder happens on the initial ingestion of machines; in YTL this happened a while back. I think the failure happened when tenants were unable to provision instances (Ready -> Assigned/Ready). The servers were able to boot Scout earlier.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair. there's ambiguity in the bug report around whether these were truly new machines going through HostInit or existing machines that had been in the pool. If it's the latter, do you know where in the Ready→Assigned flow the failure would be happening? That would help us figure out where the logging actually needs to go. worth keeping the PR even if it doesn't dierectly help diagnose this specific bug, I think it's still a logging improvement regardless?

Copy link
Contributor

@spydaNVIDIA spydaNVIDIA Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, its a logging improvement. I realized earlier today that the logs I added in a previous MR wont help with that bug and figured the same applied here. TBH Im not too sure why the server wouldnt pxe boot given that it pxe booted into scout earlier, but I guess we could log on the transition to see whats going on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants