[GSoC 2026] Project #1 Build a GUI Agent with local LLM/VLM and OpenVINO #34765
Replies: 2 comments 2 replies
-
|
**For the NPU 4000 target, is there a preferred INT4 quantization profile for the VisionPipeline, or should I explore custom NNCF configurations? Regarding the 350-hour scope: Does the team prefer a pure visual-first approach, or is the hybrid UIA/SoM integration considered the desired standard for this project? Is there existing support within the GenAI library for asynchronous tensor inputs from D3D11 surfaces, or should I implement a custom buffer bridge in the C++ layer?** |
Beta Was this translation helpful? Give feedback.
-
|
Hi @openvino-dev-samples and @zhuo-yoyowz , I am finalizing my proposal for the "Build a GUI Agent" project. Given the systems-level complexity of my approach which integrates DXGI capture, Win32 UIA grounding, and NPU offloading my current draft is quite detailed to ensure all technical risks are addressed. Would you prefer a comprehensive, detailed proposal that covers these architectural specifics and tiered milestones, or a more compact version focused strictly on the core deliverables? |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi Ethan (@openvino-dev-samples) and Zhuo (@zhuo-yoyowz),
My name is Harsh Dinodia. I’ve been working within the openvino.genai repository recently on the Whisper C-API prerequisite (PR #3513). I really enjoyed the challenge of implementing the explicit API design for word-level timestamps and the quick iteration on the shared-pointer logic.
Now that I’ve got a solid handle on the core C++ bindings, I’d like to open a discussion regarding Project #1 (GUI Agent). I've been analyzing the reference architectures (UI-TARS/MobileAgent) and wanted to share my initial implementation strategy for feedback before I finalize my formal proposal.
Proposed Technical Direction:
Unified Memory & Hardware Strategy:
While I am developing locally on an 11th Gen i5 (8GB RAM), my primary target for the agent is the AIPC Cloud (32GB RAM / 18GB Unified Memory). To optimize for the Lunar Lake architecture, I intend to use a single multimodal model (Llama-3.2-Vision INT4) to minimize KV-cache overhead, keeping the model entirely resident in the NPU memory.
Low-Latency Perception Pipeline:
To avoid the typical UI-freeze during screen capture, I plan to use a C++ backend utilizing the Windows Desktop Duplication API (DXGI). This will allow us to feed frames directly into the VisionPipeline. I am also looking into running the Set-of-Mark (SoM) preprocessing pass directly on the NPU to identify interactable elements before the main VLM reasoning step.
Hybrid Grounding (SoM + UIA):
Pure visual grounding can struggle with high-DPI scaling or dynamic UI shifts. I propose a hybrid approach that augments raw pixel perception with Windows UI Automation (UIA) metadata. This should ensure the agent remains robust across complex, professional productivity apps.
Native Action Bridge:
Building on my recent work with the OpenVINO C-bindings, I will implement the execution layer as a native C++ module using the Windows SendInput API. By bypassing Python wrappers like pyautogui, we can achieve sub-ms latency between the model's decision and the hardware event.
Questions:
For the NPU 4000 target, is there a preferred INT4 quantization profile for the VisionPipeline, or should I explore custom NNCF configurations?
Regarding the 350-hour scope: Does the team prefer a pure visual-first approach, or is the hybrid UIA/SoM integration considered the desired standard for this project?
Is there existing support within the GenAI library for asynchronous tensor inputs from D3D11 surfaces, or should I implement a custom buffer bridge in the C++ layer?
Beta Was this translation helpful? Give feedback.
All reactions