This project introduces Booking Benchmark, a framework for evaluating the agentic capabilities of instruction-based vision–language models (VLMs) on GUI tasks. It provides:
- A single‑step flight‑booking dataset and evaluating models on it.
- A web agent capable of interacting with the web's GUI and performing actions
The benchmark dataset was created from scratch. An annotated version—generated with the Omniparser tool—is also included.
All data processing and model evaluation were performed in Google Colab. This repository contains:
-
Omniparser Script
- Annotates the raw dataset
- Loads model weights directly
-
Omniparser_API.py
- Integrated into the WebAgent
- Calls the Omniparser service via its Hugging Face Space API
-
utils‑dataset‑testing/
- Utilities for dataset creation and validation
-
screenshots/
- Raw and annotated GUI screenshots
- Subfolders with outputs for each model evaluation
Model evaluation is documented in booking_benchmark.ipynb. To reproduce the results:
- Run all cells in order.
- Flush the GPU before switching to a different model.
- Provide your Hugging Face token and/or Gemini API key when evaluating with them.
WebAgent.ipynb demonstrates a production‑style agent built with the Langgraph framework. The agent uses:
- A VLM (Gemini)
- Omniparser for parsing web‑page contents
- Defined Tools
To reproduce the demo:
- Run all cells up to the Examples section.
- For each example, initialize browser state(the cell before calling
call_agent). - Invoke
call_agent(prompt)with your desired instruction.