UI-AGILE: Advancing GUI Agents with Effective Reinforcement Learning and Precise Inference-Time Grounding

[📖 Paper] [🤗 Checkpoints] [🤗 Data] [🤗 Daily Paper] [🚀 Github]

🔥 Overview

UI-AGILE enhances GUI agents through improved training with a Continuous Reward function, Simple Thinking reward, and Cropping-based Resampling, and inference with Decomposed Grounding with Selection.

Trained on about only 9k samples for just 2 epochs, UI-AGILE shows superior performance, while also showcasing strong general agent capabilities. Furthermore, our inference method can act as a plug-and-play enhancement for a wide range of existing agents, improving the accuracy of some existing open-source models.

As a baseline, the standard grounding approach applied to UI-AGILE-7B completes the benchmark in 30 minutes. When applying our method, the decomposed grounding stage takes 35 minutes. The subsequent VLM-based selection stage requires additional 4 minutes. The modest increase in overhead is a practical trade-off for the substantial gain of grounding accuracy brought by our method.

"Attempt Num Distribution" shows the distribution of attempts per GRPO training step, where each step processes a batch of two training samples. In the first epoch, we find that only 61.8% of training steps are fully successful on the initial attempt (i.e., both samples in the batch are solved without resampling). This means that without our strategy, a minimum of 19.1% (38.2% ÷ 2) of training samples would have provided no learning signal. Overall attempt numbers decreases in the second epoch, demonstrating that the model learns from the samples salvaged by our method.

Setup

We provide the code for our RFT training and the Decomposed Grounding with Selection method in two separate modules. To avoid potential dependency conflicts, each module is designed to be run in its own conda environment.

Inference

cd eval

To accelerate evaluation, we organize data as parquet and provide evaluation code.

You can easily adapt your models to our pipline.

eval/grounding/eval_grounding_vllm_no_ray.py is for grounding benchmarks (Screenspot-v2 and Screenspot-Pro).

eval/android_control/inference_android_control_refactored.py is for AndroidControl.

Training

cd train/src/scripts
bash train.sh

⭐️ Citation

If you find this project useful, welcome to cite us.

@misc{lian2025uiagileadvancingguiagents,
      title={UI-AGILE: Advancing GUI Agents with Effective Reinforcement Learning and Precise Inference-Time Grounding}, 
      author={Shuquan Lian and Yuhang Wu and Jia Ma and Zihan Song and Bingqi Chen and Xiawu Zheng and Hui Li},
      year={2025},
      eprint={2507.22025},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2507.22025}, 
}

🤝 Acknowledgements

We sincerely thank projects R1-V, Open-R1, and Open-r1-multimodal, VLM-R1 for providing their open-source resources.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
assets		assets
eval		eval
gui_region		gui_region
train		train
training_data		training_data
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

UI-AGILE: Advancing GUI Agents with Effective Reinforcement Learning and Precise Inference-Time Grounding

🔥 Overview

Setup

Inference

Training

⭐️ Citation

🤝 Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

KDEGroup/UI-AGILE

Folders and files

Latest commit

History

Repository files navigation

UI-AGILE: Advancing GUI Agents with Effective Reinforcement Learning and Precise Inference-Time Grounding

🔥 Overview

Setup

Inference

Training

⭐️ Citation

🤝 Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages