[GPU] Add event completion for optimized out instance #33122

wilson-seok · 2025-12-04T09:09:21Z

Description of the issue(symptom, root-cause, how it was resolved)

NMS cpu_impl has wrong output tensor and it causes accuracy drop in validation(MSCOCO mAP 39.3% -> 30.4%) about BMG.
The issue is gone when get_stream().finish() is called before NMS cpu_impl execution.
From memory dump of NMS inputs, I found that input0(boxes) is concat which is optimized out from prepare buffer fusing pass.
Current execute_impl() of ocl_impl checks can_be_optimized() first and return aggregate_events(events). But aggregate_events() doesn't wait for dependent events when QueueTypes=in_order and SyncMethods=none from dGPU use case.
As the concat user is cpu_impl(nms) and it has need_completion_events=true. So all events should be completed in concat. I added this consideration with wait_for_events().

The code and line that caused this issue (if it is not changed directly)

openvino/src/plugins/intel_gpu/src/graph/impls/ocl/primitive_base.hpp

Line 247 in fb89d8b

return stream.aggregate_events(events, false, instance.is_output());

Problematic graph

Execution graph about NMS, Input0 is optimized out concat.

Checklist

Is it a proper fix? (not a workaround)
Did you include test case for this fix, if necessary?
Did you review existing test that can be extended to cover this scenario? Which test did you review?

Tickets:

176971

Copilot

Pull request overview

This PR fixes a critical synchronization issue in the GPU plugin where optimized-out operations (like concat) were not properly completing their dependent events before CPU implementations (like NMS) accessed their outputs, causing accuracy degradation in MSCOCO validation (39.3% → 30.4% mAP).

Key Changes:

Added event completion check for optimized-out instances that have CPU consumers requiring synchronization
Ensures wait_for_events() is called when an optimized instance needs completion events before returning aggregated events

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/plugins/intel_gpu/src/graph/impls/ocl/primitive_base.hpp

hyunback

LGTM

e-ddykim · 2025-12-05T03:41:09Z

src/plugins/intel_gpu/src/graph/impls/ocl/primitive_base.hpp

+            if (instance.needs_completion_event()) {
+                stream.wait_for_events(events);
+            }


I think that we can return here instead of calling the next line aggregate_events().

@e-ddykim I added return here. Thanks!

isanghao · 2025-12-05T07:17:13Z

src/plugins/intel_gpu/src/graph/impls/ocl/primitive_base.hpp

        if (instance.can_be_optimized()) {
+            if (instance.needs_completion_event()) {
+                stream.wait_for_events(events);
+            }


Did you review existing places to handle needs_completion_event? If I remember correctly, we already handled such case and it is surprising that it is not working as expected..
I found this comment in primitive_inst.cpp and this seems to be relevant.

// Prepare dependencies events in case of OOO queue, CPU implementation, // or optimized_out impl which has CPU users (needs_completion_event() && !is_output() condition)

@isanghao Yes, the dep_events is created from prepare_primitive() as concat is optimized out/needs_completion_event()/!is_output(). And this dep_events is not waited in execute() because BMG has SyncMethods=none. That's why this accuracy happens.

wilson-seok requested review from a team as code owners December 4, 2025 09:09

github-actions bot added the category: GPU OpenVINO GPU plugin label Dec 4, 2025

add event completion for optimized out instance

72dc188

wilson-seok requested a review from Copilot December 5, 2025 01:37

Copilot AI reviewed Dec 5, 2025

View reviewed changes

src/plugins/intel_gpu/src/graph/impls/ocl/primitive_base.hpp Show resolved Hide resolved

hyunback approved these changes Dec 5, 2025

View reviewed changes

e-ddykim reviewed Dec 5, 2025

View reviewed changes

isanghao reviewed Dec 5, 2025

View reviewed changes

add return when need_completion_events is true

282b10c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[GPU] Add event completion for optimized out instance #33122

[GPU] Add event completion for optimized out instance #33122

wilson-seok commented Dec 4, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

hyunback left a comment

Uh oh!

e-ddykim Dec 5, 2025

Uh oh!

wilson-seok Dec 5, 2025

Uh oh!

isanghao Dec 5, 2025

Uh oh!

wilson-seok Dec 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[GPU] Add event completion for optimized out instance #33122

Are you sure you want to change the base?

[GPU] Add event completion for optimized out instance #33122

Conversation

wilson-seok commented Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description of the issue(symptom, root-cause, how it was resolved)

The code and line that caused this issue (if it is not changed directly)

Problematic graph

Checklist

Tickets:

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

hyunback left a comment

Choose a reason for hiding this comment

Uh oh!

e-ddykim Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

wilson-seok Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

isanghao Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

wilson-seok Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

wilson-seok commented Dec 4, 2025 •

edited

Loading