Skip to content

Conversation

@nbmaiti
Copy link

@nbmaiti nbmaiti commented Feb 4, 2026

  • Fixes error handling in LME execution and adds timeouts to various operations to prevent indefinite hangs.
  • Retry operation restructured based on device status.

Signed-off-by: Nabendu Maiti nabendu.bikash.maiti@intel.com

Retry-aware flow (device open → RPS data → channel lifecycle)

sequenceDiagram
    participant RPS as RPS
    participant Exec as Executor
    participant LME as LMEConnection
    participant MEI as "/dev/mei"
    participant AMT as "AMT FW"

    RPS->>Exec: Activation payload arrives
    Note over Exec: MakeItSo loop blocks until HandleDataFromRPS completes
    Exec->>LME: Connect()
    
    loop Up to 4 attempts
        alt send fails with device error
            Note over LME,MEI: Device error detected
            LME->>MEI: Close()
            LME->>MEI: Open(true)
            Note over LME,MEI: Device handle reopened
            LME->>AMT: APF_PROTOCOL_VERSION exchange
            loop execute retry
                alt AMT empty response or no device
                    Note over LME,AMT: AMT Unavailable, break to retry Connect
                else AMT busy or timeout
                    Note over LME,AMT: Backoff in MEI driver, retry execute
                else AMT responds
                    Note over LME,AMT: Protocol exchange complete
                end
            end
        else send succeeds
            LME->>AMT: APF_CHANNEL_OPEN
            LME-->>Exec: WaitGroup.Add(1)
            activate LME
            Note over LME: Listen goroutine starts
            Note over LME: 2s idle timer armed
            AMT-->>LME: APF_CHANNEL_OPEN_CONFIRMATION
            LME-->>Exec: WaitGroup.Done
            deactivate LME
        end
    end
    
    alt Channel opened successfully
        Exec->>LME: Send payload to AMT
        AMT-->>LME: Response or none
        LME-->>Exec: Data forwarded back
        Note over Exec: Wait for WaitGroup
        Exec->>RPS: Send response to server
        Note over Exec,LME: NO explicit disconnect for LME path
        Note over Exec,LME: (LMS path has defer close, LME does not)
        Note over LME: 2s idle timer may send APF_CHANNEL_CLOSE later (async)
        Note over LME: MEI device handle stays open for reuse
        Note over Exec: HandleDataFromRPS returns false
        Note over Exec: Loop continues to next RPS payload (serialized)
        Note over RPS,Exec: Next payload may arrive before timer fires
    else All 4 attempts failed or timeout
        Note over Exec: HandleDataFromRPS returns true
        Note over Exec: Close data and error channels
        Note over Exec: MakeItSo exits, defer closes MEI device
    end
Loading

@github-actions
Copy link

github-actions bot commented Feb 4, 2026

Unit tests (Go)

  1 files  ±  0   21 suites  +1   13s ⏱️ +11s
182 tests  - 632  180 ✅  - 634  0 💤 ±0  2 ❌ +2 
181 runs   - 633  175 ✅  - 639  0 💤 ±0  6 ❌ +6 

For more details on these failures, see this check.

Results for commit 34d2adf. ± Comparison against base commit 3a1aca6.

This pull request removes 633 and adds 1 tests. Note that renamed tests count towards both.
github.com/device-management-toolkit/rpc-go/v2/internal/cli ‑ FuzzDeactivate
github.com/device-management-toolkit/rpc-go/v2/internal/cli ‑ FuzzDeactivate/seed#0
github.com/device-management-toolkit/rpc-go/v2/internal/cli ‑ FuzzDeactivate/seed#1
github.com/device-management-toolkit/rpc-go/v2/internal/cli ‑ FuzzDeactivate/seed#10
github.com/device-management-toolkit/rpc-go/v2/internal/cli ‑ FuzzDeactivate/seed#11
github.com/device-management-toolkit/rpc-go/v2/internal/cli ‑ FuzzDeactivate/seed#12
github.com/device-management-toolkit/rpc-go/v2/internal/cli ‑ FuzzDeactivate/seed#13
github.com/device-management-toolkit/rpc-go/v2/internal/cli ‑ FuzzDeactivate/seed#14
github.com/device-management-toolkit/rpc-go/v2/internal/cli ‑ FuzzDeactivate/seed#15
github.com/device-management-toolkit/rpc-go/v2/internal/cli ‑ FuzzDeactivate/seed#16
…
TestMain

♻️ This comment has been updated with latest results.

@nbmaiti nbmaiti force-pushed the lme_timeout_fix branch 2 times, most recently from 33656f0 to 86f0391 Compare February 4, 2026 19:51
@nbmaiti nbmaiti changed the title Fix: lme error handling and timeouts fix: lme error handling and timeouts Feb 4, 2026
Copy link

@sudhir-intc sudhir-intc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please have this PR against 2.x.x branch.
Lets make if available in 2.x.x first and then have it on main

- Fixes error handling in LME execution and adds timeouts to various
  operations to prevent indefinite hangs
- Retry operation restructured based on device status
- Added centralized timeout constants
- Enhanced HECI driver with thread-safe device access (Issue #6)
- Improved retry logic with exponential backoff (Issue #5)
- Fixed timer goroutine leaks on reconnection (Issue #7)
- Added device stabilization delays after reinitialization (Issue #1)
- Prevented deadlock with non-blocking error channel sends (Issue #3)
- Added AMT response timeout to prevent indefinite hangs
- Persistent LME connection with proper lifecycle management

Signed-off-by: Nabendu Maiti <nabendu.bikash.maiti@intel.com>
Implement comprehensive resource management to prevent MEI device
conflicts and race conditions during local AMT activation.

Changes:
- Add Close() method to AMTCommand, GoWSMANMessages, and LocalTransport
  interfaces to properly release MEI device resources
- Close AMT command before creating WSMAN clients in activateCCM/ACM
  to avoid EBUSY errors from concurrent MEI access
- Add defer statements to ensure WSMAN clients are properly closed
  after activation completes
- Add Close() to WSMANer interface for consistent cleanup
- Fix LocalTransport error handling to prevent nil pointer panics
  when LME connection fails without response data
- Add defer amtCommand.Close() in base.go AfterApply to release
  resources after getting control mode

This fixes the race condition where multiple components tried to
access /dev/mei0 simultaneously, causing "device or resource busy"
errors during local activation.

Signed-off-by: Nabendu Maiti <nabendu.bikash.maiti@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants