Skip to content

Conversation

@lesscodingmorehappiness
Copy link
Collaborator

Improve stability and resilience of verify_cpu_offline_channel_add test

This change refactors the verify_cpu_offline_channel_add test and related helper methods in cpusuite.py to improve robustness, reduce flakiness, and ensure correctness across various CPU and NIC configurations.

Tested on multiple different images:
Verified on Canonical Ubuntu 24.04 (x86_64 and Arm64) — passed.
Verified on RHEL 8.1 — skipped due to limited NIC capability.
SUSE, Oracle, Mariner -- passed

self, node: Node, log: Optional[Logger]
) -> Optional[int]:
try:
r = node.execute("ethtool -l eth0", sudo=True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use ethtool tool

@LiliDeng LiliDeng requested a review from Copilot November 6, 2025 06:34
try:
info = node.tools[Ethtool].get_device_channels_info("eth0", True)
candidates = []
for name in ("max_combined", "max_channels", "max_current"):
Copy link
Collaborator

@LiliDeng LiliDeng Nov 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put them into ethtool tool

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please refer get_device_channels_info in ethtool, wrap these into ethtool

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR refactors the verify_cpu_offline_channel_add test case to improve robustness and handle edge cases when testing network channel changes with CPU hotplug. The refactoring introduces helper methods to safely read and set network channel configurations while respecting device capabilities and CPU constraints.

Key changes:

  • Adds comprehensive helper methods for reading device channel capabilities with fallback strategies
  • Implements safe channel setting with retry logic when exceeding device maximums
  • Replaces fragile random channel selection with constraint-aware logic
  • Improves error handling and cleanup with proper exception catching

Comment on lines 221 to 229
m = re.search(r"(?im)^\s*Combined\s*:\s*(\d+)\s*$", block.group(1))
if m:
return self._clamp_channels(int(m.group(1)))

ms = re.findall(r"(?im)^\s*Combined\s*:\s*(\d+)\s*$", raw)
if ms:
return self._clamp_channels(max(int(x) for x in ms))

ms2 = re.findall(r"(?im)(?:Maximum|Combined maximum)\s*:\s*(\d+)", raw)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These patterns should be added to ethtool

try:
chans = node.tools[Lsvmbus].get_device_channels(force_run=True)
for ch in chans:
if ch.class_id == "f8615163-df3e-46c5-913f-f2d2f965ed0e": # hv_netvsc
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Set it as a constant in lsvmbus tool

Comment on lines 346 to 347
if ch.class_id == "f8615163-df3e-46c5-913f-f2d2f965ed0e":
assert_that(ch.channel_vp_map).is_length(expect_len)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Set as "f8615163-df3e-46c5-913f-f2d2f965ed0e" as constant in lsvmbus tool

node.tools[Ethtool].change_device_channels_info("eth0", safe_origin)
log.debug("Restored channels to origin value: %d", safe_origin)
except Exception as e:
log.warning(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dont use warning

try:
set_cpu_state_serial(log, node, idle, CPUState.ONLINE)
except Exception as e:
log.warning("Failed to bring CPUs online during cleanup: %s", e)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dont use warning

log.debug(f"idle cpus: {idle_cpus}")
if len(idle_cpus) == 0:
idle = get_idle_cpus(node)
log.debug("Idle CPUs (after shrink to 1): %s", idle)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all of log use this format log.debug(f"Idle CPUs (after shrink to 1): {idle}")

# reset max and current channel count into original ones
# by reloading hv_netvsc driver if hv_netvsc can be reload
# otherwise reboot vm
if node.tools[KernelConfig].is_built_as_module("CONFIG_HYPERV_NET"):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why remove this logic, is it unnecessary?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The conditional logic was removed to improve test reliability. Module reload could cause network stack issues after CPU hotplug, so we switched to a simpler "always reboot" approach that's slower but guaranteed to work in all environments.


# ---- CPUSuite helpers ----
def _clamp_channels(self, val: int) -> int:
return max(1, min(int(val), 64))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where is it used?

log.debug(f"Reduced current channels at phase2: {cur2}")

tgt2 = self._pick_target_not_eq_current(cur2, upper2)
new2 = self._set_channels_with_retry(log, node, tgt2, cur2, upper2)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid using numeric suffixes; please use meaningful names instead.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment regarding usage of "tgt", does it mean target?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants