fix: lme error handling and timeouts #1142

nbmaiti · 2026-02-04T19:23:36Z

Fixes error handling in LME execution and adds timeouts to various operations to prevent indefinite hangs.
Retry operation restructured based on device status.

Signed-off-by: Nabendu Maiti nabendu.bikash.maiti@intel.com

Retry-aware flow (device open → RPS data → channel lifecycle)

sequenceDiagram
    participant RPS as RPS
    participant Exec as Executor
    participant LME as LMEConnection
    participant MEI as "/dev/mei"
    participant AMT as "AMT FW"

    RPS->>Exec: Activation payload arrives
    Note over Exec: MakeItSo loop blocks until HandleDataFromRPS completes
    Exec->>LME: Connect()
    
    loop Up to 4 attempts
        alt send fails with device error
            Note over LME,MEI: Device error detected
            LME->>MEI: Close()
            LME->>MEI: Open(true)
            Note over LME,MEI: Device handle reopened
            LME->>AMT: APF_PROTOCOL_VERSION exchange
            loop execute retry
                alt AMT empty response or no device
                    Note over LME,AMT: AMT Unavailable, break to retry Connect
                else AMT busy or timeout
                    Note over LME,AMT: Backoff in MEI driver, retry execute
                else AMT responds
                    Note over LME,AMT: Protocol exchange complete
                end
            end
        else send succeeds
            LME->>AMT: APF_CHANNEL_OPEN
            LME-->>Exec: WaitGroup.Add(1)
            activate LME
            Note over LME: Listen goroutine starts
            Note over LME: 2s idle timer armed
            AMT-->>LME: APF_CHANNEL_OPEN_CONFIRMATION
            LME-->>Exec: WaitGroup.Done
            deactivate LME
        end
    end
    
    alt Channel opened successfully
        Exec->>LME: Send payload to AMT
        AMT-->>LME: Response or none
        LME-->>Exec: Data forwarded back
        Note over Exec: Wait for WaitGroup
        Exec->>RPS: Send response to server
        Note over Exec,LME: NO explicit disconnect for LME path
        Note over Exec,LME: (LMS path has defer close, LME does not)
        Note over LME: 2s idle timer may send APF_CHANNEL_CLOSE later (async)
        Note over LME: MEI device handle stays open for reuse
        Note over Exec: HandleDataFromRPS returns false
        Note over Exec: Loop continues to next RPS payload (serialized)
        Note over RPS,Exec: Next payload may arrive before timer fires
    else All 4 attempts failed or timeout
        Note over Exec: HandleDataFromRPS returns true
        Note over Exec: Close data and error channels
        Note over Exec: MakeItSo exits, defer closes MEI device
    end

github-actions · 2026-02-04T19:24:29Z

Unit tests (Go)

1 files ± 0 21 suites +1 13s ⏱️ +11s
182 tests - 632 180 ✅ - 634 0 💤 ±0 2 ❌ +2
181 runs - 633 175 ✅ - 639 0 💤 ±0 6 ❌ +6

For more details on these failures, see this check.

Results for commit 34d2adf. ± Comparison against base commit 3a1aca6.

This pull request removes 633 and adds 1 tests. Note that renamed tests count towards both.

github.com/device-management-toolkit/rpc-go/v2/internal/cli ‑ FuzzDeactivate
github.com/device-management-toolkit/rpc-go/v2/internal/cli ‑ FuzzDeactivate/seed#0
github.com/device-management-toolkit/rpc-go/v2/internal/cli ‑ FuzzDeactivate/seed#1
github.com/device-management-toolkit/rpc-go/v2/internal/cli ‑ FuzzDeactivate/seed#10
github.com/device-management-toolkit/rpc-go/v2/internal/cli ‑ FuzzDeactivate/seed#11
github.com/device-management-toolkit/rpc-go/v2/internal/cli ‑ FuzzDeactivate/seed#12
github.com/device-management-toolkit/rpc-go/v2/internal/cli ‑ FuzzDeactivate/seed#13
github.com/device-management-toolkit/rpc-go/v2/internal/cli ‑ FuzzDeactivate/seed#14
github.com/device-management-toolkit/rpc-go/v2/internal/cli ‑ FuzzDeactivate/seed#15
github.com/device-management-toolkit/rpc-go/v2/internal/cli ‑ FuzzDeactivate/seed#16
…

TestMain

♻️ This comment has been updated with latest results.

sudhir-intc

Could you please have this PR against 2.x.x branch.
Lets make if available in 2.x.x first and then have it on main

- Fixes error handling in LME execution and adds timeouts to various operations to prevent indefinite hangs - Retry operation restructured based on device status - Added centralized timeout constants - Enhanced HECI driver with thread-safe device access (Issue #6) - Improved retry logic with exponential backoff (Issue #5) - Fixed timer goroutine leaks on reconnection (Issue #7) - Added device stabilization delays after reinitialization (Issue #1) - Prevented deadlock with non-blocking error channel sends (Issue #3) - Added AMT response timeout to prevent indefinite hangs - Persistent LME connection with proper lifecycle management Signed-off-by: Nabendu Maiti <nabendu.bikash.maiti@intel.com>

Implement comprehensive resource management to prevent MEI device conflicts and race conditions during local AMT activation. Changes: - Add Close() method to AMTCommand, GoWSMANMessages, and LocalTransport interfaces to properly release MEI device resources - Close AMT command before creating WSMAN clients in activateCCM/ACM to avoid EBUSY errors from concurrent MEI access - Add defer statements to ensure WSMAN clients are properly closed after activation completes - Add Close() to WSMANer interface for consistent cleanup - Fix LocalTransport error handling to prevent nil pointer panics when LME connection fails without response data - Add defer amtCommand.Close() in base.go AfterApply to release resources after getting control mode This fixes the race condition where multiple components tried to access /dev/mei0 simultaneously, causing "device or resource busy" errors during local activation. Signed-off-by: Nabendu Maiti <nabendu.bikash.maiti@intel.com>

nbmaiti requested review from rsdmike and sudhir-intc February 4, 2026 19:23

nbmaiti force-pushed the lme_timeout_fix branch 2 times, most recently from 33656f0 to 86f0391 Compare February 4, 2026 19:51

nbmaiti changed the title ~~Fix: lme error handling and timeouts~~ fix: lme error handling and timeouts Feb 4, 2026

sudhir-intc reviewed Feb 5, 2026

View reviewed changes

nbmaiti force-pushed the lme_timeout_fix branch from 86f0391 to 86a07d1 Compare February 8, 2026 18:16

nbmaiti added 2 commits February 8, 2026 18:17

nbmaiti force-pushed the lme_timeout_fix branch from 86a07d1 to 34d2adf Compare February 8, 2026 18:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: lme error handling and timeouts #1142

fix: lme error handling and timeouts #1142

Uh oh!

nbmaiti commented Feb 4, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Feb 4, 2026 •

edited

Loading

Uh oh!

sudhir-intc left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fix: lme error handling and timeouts #1142

Are you sure you want to change the base?

fix: lme error handling and timeouts #1142

Uh oh!

Conversation

nbmaiti commented Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Retry-aware flow (device open → RPS data → channel lifecycle)

Uh oh!

github-actions bot commented Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Unit tests (Go)

Uh oh!

sudhir-intc left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

nbmaiti commented Feb 4, 2026 •

edited

Loading

github-actions bot commented Feb 4, 2026 •

edited

Loading