Booking Benchmark

Overview

This project introduces Booking Benchmark, a framework for evaluating the agentic capabilities of instruction-based vision–language models (VLMs) on GUI tasks. It provides:

A single‑step flight‑booking dataset and evaluating models on it.
A web agent capable of interacting with the web's GUI and performing actions

Dataset

The benchmark dataset was created from scratch. An annotated version—generated with the Omniparser tool—is also included.

Tools and Environment

All data processing and model evaluation were performed in Google Colab. This repository contains:

Omniparser Script
- Annotates the raw dataset
- Loads model weights directly
Omniparser_API.py
- Integrated into the WebAgent
- Calls the Omniparser service via its Hugging Face Space API
utils‑dataset‑testing/
- Utilities for dataset creation and validation
screenshots/
- Raw and annotated GUI screenshots
- Subfolders with outputs for each model evaluation

Evaluation (Part 1)

Model evaluation is documented in booking_benchmark.ipynb. To reproduce the results:

Run all cells in order.
Flush the GPU before switching to a different model.
Provide your Hugging Face token and/or Gemini API key when evaluating with them.

Real‑World Application (Part 2)

WebAgent.ipynb demonstrates a production‑style agent built with the Langgraph framework. The agent uses:

A VLM (Gemini)
Omniparser for parsing web‑page contents
Defined Tools

To reproduce the demo:

Run all cells up to the Examples section.
For each example, initialize browser state(the cell before calling call_agent ).
Invoke call_agent(prompt) with your desired instruction.

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
assets		assets
logs		logs
screenshots		screenshots
.gitignore		.gitignore
.python-version		.python-version
Omniparser_script.ipynb		Omniparser_script.ipynb
README.md		README.md
WebAgent.ipynb		WebAgent.ipynb
booking_benchmark.ipynb		booking_benchmark.ipynb
omniparser_Api.py		omniparser_Api.py
pyproject.toml		pyproject.toml
qwen_ollama.ipynb		qwen_ollama.ipynb
utils-dataset-testing.ipynb		utils-dataset-testing.ipynb
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Booking Benchmark

Overview

Dataset

Tools and Environment

Evaluation (Part 1)

Real‑World Application (Part 2)

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Booking Benchmark

Overview

Dataset

Tools and Environment

Evaluation (Part 1)

Real‑World Application (Part 2)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages