Skip to content

sled-diagnostics: Capture nvmeadm health logpage#10031

Draft
wfchandler wants to merge 1 commit intomainfrom
wc/sb-nvme-logpage-health
Draft

sled-diagnostics: Capture nvmeadm health logpage#10031
wfchandler wants to merge 1 commit intomainfrom
wc/sb-nvme-logpage-health

Conversation

@wfchandler
Copy link
Contributor

During recent customer installs we have found that the health logpage exposed by nvmeadm(8) was useful in identifying failing drives.

Add this output to support bundles.

@wfchandler wfchandler force-pushed the wc/sb-nvme-logpage-health branch 2 times, most recently from b1ba6d8 to e15ebe2 Compare March 11, 2026 20:52
Copy link

@rmustacc rmustacc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the first thing that we're adding that relies on the disk working correctly to actually finish generating a support bundle? Assume that we have a device that is hanging or cannot complete commands, how do we ensure that we don't hang the entire support bundle generation process.

Comment on lines +251 to +260
pub fn nvmeadm_logpage_health(nvme_num: u32) -> Command {
let mut cmd = std::process::Command::new(PFEXEC);
cmd.env_clear()
.arg(NVMEADM)
.arg("-v")
.arg("get-logpage")
.arg(&format!("nvme{nvme_num}"))
.arg("health");
cmd
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're going to invoke this, please just use the -O option to get-logpage to send this entirely to a binary file that can be interpreted more efficiently with tools.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rmustacc I think this is a "yes and" situation. When we we're specifically looking for a problem with a disk, the binary files are superior. In scenarios where we're just performing a quick health check against the bundle, it's more convenient to have text output.

Text files can be analyzed on non-illumos hosts, and are trivial to read without extracting the files, e.g., bundle-cat bundle.zip --path '*logpage*' | 'grep -A 4 "Critical Warnings"'.

Happy to make a follow-on PR for the binary health log page, and any others you want.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The text output format is not a stable interface and is going to change. So I think it's critical if we're going to build tooling on top of this that we're doing something that is going to continue to work and not silently break.

Copy link
Contributor Author

@wfchandler wfchandler Mar 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the print-logpage CL you have in flight for illumos will cover both of our needs, or maybe get-logpage -p.

Perhaps I should just close this PR and wait for those command to be available.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same features for print-logpage work for get-logpage. However, if I were doing the support bundle, I would again just gather the thing we want once and then do whatever we want after the fact. Note, the changes going in there don't touch the extent logs today, but will in the future.

During recent customer installs we have found that the `health` logpage
exposed by `nvmeadm(8)` was useful in identifying failing drives.

Add this output to support bundles.
@wfchandler wfchandler force-pushed the wc/sb-nvme-logpage-health branch from e15ebe2 to cb97c8d Compare March 12, 2026 00:59
@wfchandler
Copy link
Contributor Author

Is this the first thing that we're adding that relies on the disk working correctly to actually finish generating a support bundle? Assume that we have a device that is hanging or cannot complete commands, how do we ensure that we don't hang the entire support bundle generation process.

No, these commands (and all others in sled-diagnostics) are executed with a 10 second timeout.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants