Skip to content

Conversation

@wilson-seok
Copy link
Contributor

@wilson-seok wilson-seok commented Dec 4, 2025

Description of the issue(symptom, root-cause, how it was resolved)

  • NMS cpu_impl has wrong output tensor and it causes accuracy drop in validation(MSCOCO mAP 39.3% -> 30.4%) about BMG.
  • The issue is gone when get_stream().finish() is called before NMS cpu_impl execution.
  • From memory dump of NMS inputs, I found that input0(boxes) is concat which is optimized out from prepare buffer fusing pass.
  • Current execute_impl() of ocl_impl checks can_be_optimized() first and return aggregate_events(events). But aggregate_events() doesn't wait for dependent events when QueueTypes=in_order and SyncMethods=none from dGPU use case.
  • As the concat user is cpu_impl(nms) and it has need_completion_events=true. So all events should be completed in concat. I added this consideration with wait_for_events().

The code and line that caused this issue (if it is not changed directly)

Problematic graph

  • Execution graph about NMS, Input0 is optimized out concat.
image

Checklist

  • Is it a proper fix? (not a workaround)
  • Did you include test case for this fix, if necessary?
  • Did you review existing test that can be extended to cover this scenario? Which test did you review?

Tickets:

  • 176971

@wilson-seok wilson-seok requested review from a team as code owners December 4, 2025 09:09
@github-actions github-actions bot added the category: GPU OpenVINO GPU plugin label Dec 4, 2025
@wilson-seok wilson-seok requested a review from Copilot December 5, 2025 01:37
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a critical synchronization issue in the GPU plugin where optimized-out operations (like concat) were not properly completing their dependent events before CPU implementations (like NMS) accessed their outputs, causing accuracy degradation in MSCOCO validation (39.3% → 30.4% mAP).

Key Changes:

  • Added event completion check for optimized-out instances that have CPU consumers requiring synchronization
  • Ensures wait_for_events() is called when an optimized instance needs completion events before returning aggregated events

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

@hyunback hyunback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment on lines 247 to 249
if (instance.needs_completion_event()) {
stream.wait_for_events(events);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we can return here instead of calling the next line aggregate_events().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@e-ddykim I added return here. Thanks!

if (instance.can_be_optimized()) {
if (instance.needs_completion_event()) {
stream.wait_for_events(events);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you review existing places to handle needs_completion_event? If I remember correctly, we already handled such case and it is surprising that it is not working as expected..
I found this comment in primitive_inst.cpp and this seems to be relevant.

        // Prepare dependencies events in case of OOO queue, CPU implementation,
        // or optimized_out impl which has CPU users (needs_completion_event() && !is_output() condition)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@isanghao Yes, the dep_events is created from prepare_primitive() as concat is optimized out/needs_completion_event()/!is_output(). And this dep_events is not waited in execute() because BMG has SyncMethods=none. That's why this accuracy happens.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

category: GPU OpenVINO GPU plugin

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants