Diagnostic features for update failures #1867

cbiffle · 2024-09-12T18:44:23Z

We had a sidecar SP fail update at a customer site today in a rather ambiguous manner. This issue is intended to collect ideas for diagnostic tools we could have built that would have helped today, so that we can hopefully build them before this reproduces much more.

One possibility is that this is simply an MGS timeout that has drifted out of sync with how long Sidecar takes to boot in practice. We know Sidecar boot is nondeterministic (https://github.com/oxidecomputer/hardware-sidecar/issues/741) so if the timeout is marginal, it could happen rarely for certain units.

Potential root causes I've floated, and tools that might help distinguish them, include:

Update was written incorrectly due to corruption, off-by-one, or other bug.
- The ability to read out, or at least hash, the idle bank would let us check for this.
SP rebooted into new image which was written correctly, but something failed during initialization that prevented the network from coming up.
- It would be really great to be able to read a dump from a previous boot out of the dump area, to see if anything panicked last boot.
Sidecar may have taken too long to start up for the timeout in MGS, and this might all be an illusion.
- MGS may want to revise up that timeout (I would also argue for making it configurable, for the next time this happens)
- We should take a pass over Sidecar startup and check for any optimizations we could make there.

Please add more ideas.

labbott · 2024-09-12T18:45:39Z

We need a way for the RoT to tell us that it triggered a bank swap besides the ringbuf

cbiffle · 2024-09-12T18:46:13Z

Note that the successful SP reboot we observed was 26 seconds; the timeout is 30 seconds, and we've seen 5+ seconds of variability. oxidecomputer/management-gateway-service#284

labbott · 2024-09-12T18:49:05Z

Read out/measure the auxflash (thanks @lzrd for the idea)

jgallagher · 2024-09-12T18:50:05Z

Ask the SP what its current time is (this is exclusively upstack work; the MgsRequest::CurrentTime message already exists and is supported on the SP: oxidecomputer/management-gateway-service#283)

cbiffle · 2024-09-12T19:02:03Z

In case anyone's investigating cores from this, we also get a crash of both the sequencer and power tasks. These crashes are both deliberate in the code and don't appear to be related, except possibly in causing some of the boot time nondeterminism.

https://github.com/oxidecomputer/hubris/blob/master/task/power/src/bsp/sidecar_bcd.rs#L42C1-L42C33

https://github.com/oxidecomputer/hubris/blob/master/drv/sidecar-seq-server/src/main.rs#L919

cbiffle · 2024-09-16T20:09:13Z

Note that the successful SP reboot we observed was 26 seconds; the timeout is 30 seconds, and we've seen 5+ seconds of variability. oxidecomputer/management-gateway-service#284

@rmustacc pointed out that at least part of the variability will be coming from our accidental hardware random number generator: https://github.com/oxidecomputer/hardware-qsfp-x32/issues/116

One such debugging session, with logs and stuff, here: https://github.com/oxidecomputer/hardware-sidecar/issues/830

cbiffle added service processor Related to the service processor. robustness Fixing this would improve robustness of deployed firmware labels Sep 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Diagnostic features for update failures #1867

Diagnostic features for update failures #1867

cbiffle commented Sep 12, 2024

labbott commented Sep 12, 2024

cbiffle commented Sep 12, 2024

labbott commented Sep 12, 2024

jgallagher commented Sep 12, 2024

cbiffle commented Sep 12, 2024

cbiffle commented Sep 16, 2024 •

edited

Loading

Diagnostic features for update failures #1867

Diagnostic features for update failures #1867

Comments

cbiffle commented Sep 12, 2024

labbott commented Sep 12, 2024

cbiffle commented Sep 12, 2024

labbott commented Sep 12, 2024

jgallagher commented Sep 12, 2024

cbiffle commented Sep 12, 2024

cbiffle commented Sep 16, 2024 • edited Loading

cbiffle commented Sep 16, 2024 •

edited

Loading