[GSoC 2026] Project #1 Build a GUI Agent with local LLM/VLM and OpenVINO #34765

harsh2025-sketch · 2026-03-18T13:30:05Z

harsh2025-sketch
Mar 18, 2026

Hi Ethan (@openvino-dev-samples) and Zhuo (@zhuo-yoyowz),

My name is Harsh Dinodia. I’ve been working within the openvino.genai repository recently on the Whisper C-API prerequisite (PR #3513). I really enjoyed the challenge of implementing the explicit API design for word-level timestamps and the quick iteration on the shared-pointer logic.

Now that I’ve got a solid handle on the core C++ bindings, I’d like to open a discussion regarding Project #1 (GUI Agent). I've been analyzing the reference architectures (UI-TARS/MobileAgent) and wanted to share my initial implementation strategy for feedback before I finalize my formal proposal.

Proposed Technical Direction:

Unified Memory & Hardware Strategy:
While I am developing locally on an 11th Gen i5 (8GB RAM), my primary target for the agent is the AIPC Cloud (32GB RAM / 18GB Unified Memory). To optimize for the Lunar Lake architecture, I intend to use a single multimodal model (Llama-3.2-Vision INT4) to minimize KV-cache overhead, keeping the model entirely resident in the NPU memory.

Low-Latency Perception Pipeline:
To avoid the typical UI-freeze during screen capture, I plan to use a C++ backend utilizing the Windows Desktop Duplication API (DXGI). This will allow us to feed frames directly into the VisionPipeline. I am also looking into running the Set-of-Mark (SoM) preprocessing pass directly on the NPU to identify interactable elements before the main VLM reasoning step.

Hybrid Grounding (SoM + UIA):
Pure visual grounding can struggle with high-DPI scaling or dynamic UI shifts. I propose a hybrid approach that augments raw pixel perception with Windows UI Automation (UIA) metadata. This should ensure the agent remains robust across complex, professional productivity apps.

Native Action Bridge:
Building on my recent work with the OpenVINO C-bindings, I will implement the execution layer as a native C++ module using the Windows SendInput API. By bypassing Python wrappers like pyautogui, we can achieve sub-ms latency between the model's decision and the hardware event.

Questions:
For the NPU 4000 target, is there a preferred INT4 quantization profile for the VisionPipeline, or should I explore custom NNCF configurations?

Regarding the 350-hour scope: Does the team prefer a pure visual-first approach, or is the hybrid UIA/SoM integration considered the desired standard for this project?

Is there existing support within the GenAI library for asynchronous tensor inputs from D3D11 surfaces, or should I implement a custom buffer bridge in the C++ layer?

openvino-dev-samples · 2026-03-19T01:04:27Z

openvino-dev-samples
Mar 19, 2026

Hi @harsh2025-sketch

**For the NPU 4000 target, is there a preferred INT4 quantization profile for the VisionPipeline, or should I explore custom NNCF configurations?
A: What does NPU 4000 target mean here ?

Regarding the 350-hour scope: Does the team prefer a pure visual-first approach, or is the hybrid UIA/SoM integration considered the desired standard for this project?
A: I think its all depends on you. Please focus on the final result and user experience.

Is there existing support within the GenAI library for asynchronous tensor inputs from D3D11 surfaces, or should I implement a custom buffer bridge in the C++ layer?**
A: I don't think openvino can support tensor inputs from D3D11 directly.

1 reply

harsh2025-sketch Mar 26, 2026
Author

The NPU 4000 is for 4-bit integer operations. Running a 7B or 2B VLM in INT4 on this hardware allows for massive memory bandwidth savings

harsh2025-sketch · 2026-03-28T20:21:19Z

harsh2025-sketch
Mar 28, 2026
Author

Hi @openvino-dev-samples and @zhuo-yoyowz ,

I am finalizing my proposal for the "Build a GUI Agent" project. Given the systems-level complexity of my approach which integrates DXGI capture, Win32 UIA grounding, and NPU offloading my current draft is quite detailed to ensure all technical risks are addressed.

Would you prefer a comprehensive, detailed proposal that covers these architectural specifics and tiered milestones, or a more compact version focused strictly on the core deliverables?

1 reply

harsh2025-sketch Mar 28, 2026
Author

Just to provide more context:
I’ve kept it this detailed because it includes full HLD/LLD schematics and a week-by-week risk mitigation plan specifically for the DXGI/COM synchronization and NPU offloading logic.

However, I have structured it with a 1-page Executive Summary and a Tiered Deliverables table at the front. If you prefer, I can strip the deep-dive systems specs into an Appendix to keep the main proposal more compact or completely remove that part.

Looking forward to your guidance!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GSoC 2026] Project #1 Build a GUI Agent with local LLM/VLM and OpenVINO #34765

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

[GSoC 2026] Project #1 Build a GUI Agent with local LLM/VLM and OpenVINO #34765

Uh oh!

harsh2025-sketch Mar 18, 2026

Replies: 2 comments · 2 replies

Uh oh!

openvino-dev-samples Mar 19, 2026

Uh oh!

Uh oh!

harsh2025-sketch Mar 26, 2026 Author

Uh oh!

harsh2025-sketch Mar 28, 2026 Author

Uh oh!

Uh oh!

harsh2025-sketch Mar 28, 2026 Author

harsh2025-sketch
Mar 18, 2026

Replies: 2 comments 2 replies

openvino-dev-samples
Mar 19, 2026

harsh2025-sketch Mar 26, 2026
Author

harsh2025-sketch
Mar 28, 2026
Author

harsh2025-sketch Mar 28, 2026
Author