Skip to content

OSU-NLP-Group/UGround

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

70 Commits
 
 
 
 
 
 

Repository files navigation

UGround

This is the official code repository for the project: Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents. image

Updates

  • 2024/11/04: Qwen2VL-based UGround-v1 has finished, with a even stronger SOTA result on ScreenSpot (85.9% Avg.). Qwen2VL-based UGround-v1.1 and all the codes are coming.

  • 2024/10/07: Preprint is arXived. Demo is live. Code coming soon.

  • 2024/08/06: Website is live. The initial manuscript and results are available.

Release Plans:

  • Model Weights
  • Code
    • Inference Code of UGround
    • Training Code, Scripts, and Checkpoints
    • Offline Experiments
      • Screenspot (along with referring expressions generated by GPT-4/4o)
      • Multimodal-Mind2Web
      • OmniAct
    • Online Experiments
      • Mind2Web-Live
      • AndroidWorld
  • Data
    • Data Examples
    • Data Construction Scripts
    • Guidance of Open-source Data
  • Online Demo (HF Spaces)

Untitled design

Citation Information

If you find this work useful, please consider starring our repo and citing our papers:

@article{gou2024uground,
        title={Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents},
        author={Boyu Gou and Ruohan Wang and Boyuan Zheng and Yanan Xie and Cheng Chang and Yiheng Shu and Huan Sun and Yu Su},
        journal={arXiv preprint arXiv:2410.05243},
        year={2024},
        url={https://arxiv.org/abs/2410.05243},
      }

@article{zheng2023seeact,
        title={GPT-4V(ision) is a Generalist Web Agent, if Grounded},
        author={Boyuan Zheng and Boyu Gou and Jihyung Kil and Huan Sun and Yu Su},
        journal={arXiv preprint arXiv:2401.01614},
        year={2024},
      }