chore: log host config on first boot order failure before ForceRestart#421
chore: log host config on first boot order failure before ForceRestart#421martinraumann wants to merge 2 commits intoNVIDIA:mainfrom
Conversation
On the first call to set_boot_order_dpu_first, if it fails the code takes a direct ForceRestart path that bypasses the log_host_config call that trigger_reboot_if_needed_with_location makes on subsequent retries. This means we lose boot option details and PCIe inventory at the most useful moment. Call log_host_config before the first ForceRestart so diagnostic context is always captured on failure regardless of which reboot path is taken. See PR NVIDIA#350 for the original log_host_config additions and the GB200 missing DPU HTTP boot option bug for context. Signed-off-by: Martin Raumann <mraumann@nvidia.com>
3a80884 to
9333bd0
Compare
| // triggered so we capture full diagnostic context (UEFI device paths + | ||
| // PCIe inventory) before state resets. Skipped when waiting on an | ||
| // already-in-progress reboot to avoid redundant Redfish calls. | ||
| if reboot_status.increase_retry_count { |
There was a problem hiding this comment.
Is the goal of this MR to find the RC for https://nvbugspro.nvidia.com/bug/5867167 ?
If so, given that the bug happens on instance provisioning, im not sure it will be of much help b/c we dont call set_host_boot_order going from Ready -> Assigned/Ready today.
There was a problem hiding this comment.
good point, but the end goal wasn't specifically to find the RC but to ensure we have diagnostic context when this failure occurs. Based on the bug report (I could be reading it wrong), the failure is happening during the machine bring-up (HostInit/SetBootOrder) on newly scaled-up nodes, not during Ready->Assigned. set_host_boot_order is called on that path (and in HostPlatformConfig), so the logging should be relevant. could the failure be occurring somewhere else in the flow?
There was a problem hiding this comment.
HostInit/SetBootOrder happens on the initial ingestion of machines; in YTL this happened a while back. I think the failure happened when tenants were unable to provision instances (Ready -> Assigned/Ready). The servers were able to boot Scout earlier.
There was a problem hiding this comment.
Fair. there's ambiguity in the bug report around whether these were truly new machines going through HostInit or existing machines that had been in the pool. If it's the latter, do you know where in the Ready→Assigned flow the failure would be happening? That would help us figure out where the logging actually needs to go. worth keeping the PR even if it doesn't dierectly help diagnose this specific bug, I think it's still a logging improvement regardless?
There was a problem hiding this comment.
Agreed, its a logging improvement. I realized earlier today that the logs I added in a previous MR wont help with that bug and figured the same applied here. TBH Im not too sure why the server wouldnt pxe boot given that it pxe booted into scout earlier, but I guess we could log on the transition to see whats going on.
Description
Calls
log_host_configon the firstset_boot_order_dpu_firstfailure before triggeringForceRestart. Previously this diagnostic logging (boot option UEFI device paths + PCIe inventory) was only captured on subsequent retries viatrigger_reboot_if_needed_with_location, leaving a gap at the most useful moment.Type of Change
Related Issues
Relates to the GB200 missing primary DPU HTTP boot option bug. See also PR #350 which added
log_host_configfor adjacent states.Breaking Changes
Testing
Additional Notes
log_host_configis infallible (swallows errors internally) and the additional Redfish calls happen immediately before a reboot, so there is no impact on control flow or timing.