sled-diagnostics: Capture nvmeadm health logpage by wfchandler · Pull Request #10031 · oxidecomputer/omicron

wfchandler · 2026-03-11T17:45:44Z

During recent customer installs we have found that the health logpage exposed by nvmeadm(8) was useful in identifying failing drives.

Add this output to support bundles.

rmustacc

Is this the first thing that we're adding that relies on the disk working correctly to actually finish generating a support bundle? Assume that we have a device that is hanging or cannot complete commands, how do we ensure that we don't hang the entire support bundle generation process.

rmustacc · 2026-03-11T23:45:18Z

sled-diagnostics/src/queries.rs

+pub fn nvmeadm_logpage_health(nvme_num: u32) -> Command {
+    let mut cmd = std::process::Command::new(PFEXEC);
+    cmd.env_clear()
+        .arg(NVMEADM)
+        .arg("-v")
+        .arg("get-logpage")
+        .arg(&format!("nvme{nvme_num}"))
+        .arg("health");
+    cmd
+}


If we're going to invoke this, please just use the -O option to get-logpage to send this entirely to a binary file that can be interpreted more efficiently with tools.

@rmustacc I think this is a "yes and" situation. When we we're specifically looking for a problem with a disk, the binary files are superior. In scenarios where we're just performing a quick health check against the bundle, it's more convenient to have text output.

Text files can be analyzed on non-illumos hosts, and are trivial to read without extracting the files, e.g., bundle-cat bundle.zip --path '*logpage*' | 'grep -A 4 "Critical Warnings"'.

Happy to make a follow-on PR for the binary health log page, and any others you want.

The text output format is not a stable interface and is going to change. So I think it's critical if we're going to build tooling on top of this that we're doing something that is going to continue to work and not silently break.

It looks like the print-logpage CL you have in flight for illumos will cover both of our needs, or maybe get-logpage -p.

Perhaps I should just close this PR and wait for those command to be available.

The same features for print-logpage work for get-logpage. However, if I were doing the support bundle, I would again just gather the thing we want once and then do whatever we want after the fact. Note, the changes going in there don't touch the extent logs today, but will in the future.

During recent customer installs we have found that the `health` logpage exposed by `nvmeadm(8)` was useful in identifying failing drives. Add this output to support bundles.

wfchandler · 2026-03-16T15:52:03Z

Is this the first thing that we're adding that relies on the disk working correctly to actually finish generating a support bundle? Assume that we have a device that is hanging or cannot complete commands, how do we ensure that we don't hang the entire support bundle generation process.

No, these commands (and all others in sled-diagnostics) are executed with a 10 second timeout.

wfchandler force-pushed the wc/sb-nvme-logpage-health branch 2 times, most recently from b1ba6d8 to e15ebe2 Compare March 11, 2026 20:52

rmustacc reviewed Mar 11, 2026

View reviewed changes

sled-diagnostics: Capture nvmeadm health logpage

cb97c8d

During recent customer installs we have found that the `health` logpage exposed by `nvmeadm(8)` was useful in identifying failing drives. Add this output to support bundles.

wfchandler force-pushed the wc/sb-nvme-logpage-health branch from e15ebe2 to cb97c8d Compare March 12, 2026 00:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sled-diagnostics: Capture nvmeadm health logpage#10031

sled-diagnostics: Capture nvmeadm health logpage#10031
wfchandler wants to merge 1 commit intomainfrom
wc/sb-nvme-logpage-health

wfchandler commented Mar 11, 2026

Uh oh!

rmustacc left a comment

Uh oh!

rmustacc Mar 11, 2026

Uh oh!

wfchandler Mar 16, 2026

Uh oh!

rmustacc Mar 16, 2026

Uh oh!

wfchandler Mar 17, 2026 •

edited

Loading

Uh oh!

rmustacc Mar 17, 2026

Uh oh!

wfchandler commented Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

wfchandler commented Mar 11, 2026

Uh oh!

rmustacc left a comment

Choose a reason for hiding this comment

Uh oh!

rmustacc Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

wfchandler Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

rmustacc Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

wfchandler Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rmustacc Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

wfchandler commented Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

wfchandler Mar 17, 2026 •

edited

Loading