-
Notifications
You must be signed in to change notification settings - Fork 223
fix verify_cpu_offline_channel_add #4092
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| self, node: Node, log: Optional[Logger] | ||
| ) -> Optional[int]: | ||
| try: | ||
| r = node.execute("ethtool -l eth0", sudo=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use ethtool tool
| try: | ||
| info = node.tools[Ethtool].get_device_channels_info("eth0", True) | ||
| candidates = [] | ||
| for name in ("max_combined", "max_channels", "max_current"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
put them into ethtool tool
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please refer get_device_channels_info in ethtool, wrap these into ethtool
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR refactors the verify_cpu_offline_channel_add test case to improve robustness and handle edge cases when testing network channel changes with CPU hotplug. The refactoring introduces helper methods to safely read and set network channel configurations while respecting device capabilities and CPU constraints.
Key changes:
- Adds comprehensive helper methods for reading device channel capabilities with fallback strategies
- Implements safe channel setting with retry logic when exceeding device maximums
- Replaces fragile random channel selection with constraint-aware logic
- Improves error handling and cleanup with proper exception catching
| m = re.search(r"(?im)^\s*Combined\s*:\s*(\d+)\s*$", block.group(1)) | ||
| if m: | ||
| return self._clamp_channels(int(m.group(1))) | ||
|
|
||
| ms = re.findall(r"(?im)^\s*Combined\s*:\s*(\d+)\s*$", raw) | ||
| if ms: | ||
| return self._clamp_channels(max(int(x) for x in ms)) | ||
|
|
||
| ms2 = re.findall(r"(?im)(?:Maximum|Combined maximum)\s*:\s*(\d+)", raw) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These patterns should be added to ethtool
| try: | ||
| chans = node.tools[Lsvmbus].get_device_channels(force_run=True) | ||
| for ch in chans: | ||
| if ch.class_id == "f8615163-df3e-46c5-913f-f2d2f965ed0e": # hv_netvsc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Set it as a constant in lsvmbus tool
| if ch.class_id == "f8615163-df3e-46c5-913f-f2d2f965ed0e": | ||
| assert_that(ch.channel_vp_map).is_length(expect_len) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Set as "f8615163-df3e-46c5-913f-f2d2f965ed0e" as constant in lsvmbus tool
| node.tools[Ethtool].change_device_channels_info("eth0", safe_origin) | ||
| log.debug("Restored channels to origin value: %d", safe_origin) | ||
| except Exception as e: | ||
| log.warning( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dont use warning
| try: | ||
| set_cpu_state_serial(log, node, idle, CPUState.ONLINE) | ||
| except Exception as e: | ||
| log.warning("Failed to bring CPUs online during cleanup: %s", e) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dont use warning
ba5eaf7 to
68cae96
Compare
| log.debug(f"idle cpus: {idle_cpus}") | ||
| if len(idle_cpus) == 0: | ||
| idle = get_idle_cpus(node) | ||
| log.debug("Idle CPUs (after shrink to 1): %s", idle) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all of log use this format log.debug(f"Idle CPUs (after shrink to 1): {idle}")
| # reset max and current channel count into original ones | ||
| # by reloading hv_netvsc driver if hv_netvsc can be reload | ||
| # otherwise reboot vm | ||
| if node.tools[KernelConfig].is_built_as_module("CONFIG_HYPERV_NET"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why remove this logic, is it unnecessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The conditional logic was removed to improve test reliability. Module reload could cause network stack issues after CPU hotplug, so we switched to a simpler "always reboot" approach that's slower but guaranteed to work in all environments.
|
|
||
| # ---- CPUSuite helpers ---- | ||
| def _clamp_channels(self, val: int) -> int: | ||
| return max(1, min(int(val), 64)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where is it used?
| log.debug(f"Reduced current channels at phase2: {cur2}") | ||
|
|
||
| tgt2 = self._pick_target_not_eq_current(cur2, upper2) | ||
| new2 = self._set_channels_with_retry(log, node, tgt2, cur2, upper2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid using numeric suffixes; please use meaningful names instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment regarding usage of "tgt", does it mean target?
Improve stability and resilience of verify_cpu_offline_channel_add test
This change refactors the verify_cpu_offline_channel_add test and related helper methods in cpusuite.py to improve robustness, reduce flakiness, and ensure correctness across various CPU and NIC configurations.
Tested on multiple different images:
Verified on Canonical Ubuntu 24.04 (x86_64 and Arm64) — passed.
Verified on RHEL 8.1 — skipped due to limited NIC capability.
SUSE, Oracle, Mariner -- passed