-
Notifications
You must be signed in to change notification settings - Fork 21
fix: lme error handling and timeouts #1142
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Unit tests (Go) 1 files ± 0 21 suites +1 13s ⏱️ +11s For more details on these failures, see this check. Results for commit 34d2adf. ± Comparison against base commit 3a1aca6. This pull request removes 633 and adds 1 tests. Note that renamed tests count towards both.♻️ This comment has been updated with latest results. |
33656f0 to
86f0391
Compare
sudhir-intc
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please have this PR against 2.x.x branch.
Lets make if available in 2.x.x first and then have it on main
86f0391 to
86a07d1
Compare
- Fixes error handling in LME execution and adds timeouts to various operations to prevent indefinite hangs - Retry operation restructured based on device status - Added centralized timeout constants - Enhanced HECI driver with thread-safe device access (Issue #6) - Improved retry logic with exponential backoff (Issue #5) - Fixed timer goroutine leaks on reconnection (Issue #7) - Added device stabilization delays after reinitialization (Issue #1) - Prevented deadlock with non-blocking error channel sends (Issue #3) - Added AMT response timeout to prevent indefinite hangs - Persistent LME connection with proper lifecycle management Signed-off-by: Nabendu Maiti <nabendu.bikash.maiti@intel.com>
Implement comprehensive resource management to prevent MEI device conflicts and race conditions during local AMT activation. Changes: - Add Close() method to AMTCommand, GoWSMANMessages, and LocalTransport interfaces to properly release MEI device resources - Close AMT command before creating WSMAN clients in activateCCM/ACM to avoid EBUSY errors from concurrent MEI access - Add defer statements to ensure WSMAN clients are properly closed after activation completes - Add Close() to WSMANer interface for consistent cleanup - Fix LocalTransport error handling to prevent nil pointer panics when LME connection fails without response data - Add defer amtCommand.Close() in base.go AfterApply to release resources after getting control mode This fixes the race condition where multiple components tried to access /dev/mei0 simultaneously, causing "device or resource busy" errors during local activation. Signed-off-by: Nabendu Maiti <nabendu.bikash.maiti@intel.com>
86a07d1 to
34d2adf
Compare
Signed-off-by: Nabendu Maiti nabendu.bikash.maiti@intel.com
Retry-aware flow (device open → RPS data → channel lifecycle)
sequenceDiagram participant RPS as RPS participant Exec as Executor participant LME as LMEConnection participant MEI as "/dev/mei" participant AMT as "AMT FW" RPS->>Exec: Activation payload arrives Note over Exec: MakeItSo loop blocks until HandleDataFromRPS completes Exec->>LME: Connect() loop Up to 4 attempts alt send fails with device error Note over LME,MEI: Device error detected LME->>MEI: Close() LME->>MEI: Open(true) Note over LME,MEI: Device handle reopened LME->>AMT: APF_PROTOCOL_VERSION exchange loop execute retry alt AMT empty response or no device Note over LME,AMT: AMT Unavailable, break to retry Connect else AMT busy or timeout Note over LME,AMT: Backoff in MEI driver, retry execute else AMT responds Note over LME,AMT: Protocol exchange complete end end else send succeeds LME->>AMT: APF_CHANNEL_OPEN LME-->>Exec: WaitGroup.Add(1) activate LME Note over LME: Listen goroutine starts Note over LME: 2s idle timer armed AMT-->>LME: APF_CHANNEL_OPEN_CONFIRMATION LME-->>Exec: WaitGroup.Done deactivate LME end end alt Channel opened successfully Exec->>LME: Send payload to AMT AMT-->>LME: Response or none LME-->>Exec: Data forwarded back Note over Exec: Wait for WaitGroup Exec->>RPS: Send response to server Note over Exec,LME: NO explicit disconnect for LME path Note over Exec,LME: (LMS path has defer close, LME does not) Note over LME: 2s idle timer may send APF_CHANNEL_CLOSE later (async) Note over LME: MEI device handle stays open for reuse Note over Exec: HandleDataFromRPS returns false Note over Exec: Loop continues to next RPS payload (serialized) Note over RPS,Exec: Next payload may arrive before timer fires else All 4 attempts failed or timeout Note over Exec: HandleDataFromRPS returns true Note over Exec: Close data and error channels Note over Exec: MakeItSo exits, defer closes MEI device end