Arbigent(Arbiter-Agent): An AI Agent Testing Framework for Modern Applications

Zero to AI agent testing in minutes. Arbigent's intuitive UI and powerful code interface make it accessible to everyone, while its scenario breakdown feature ensures scalability for even the most complex tasks.

Warning

There seems to be a spam account posing as Arbigent, but the account is not related to me. The creator's accounts are https://x.com/_takahirom_ and https://x.com/new_runnable .

Screenshot

Demo movie

arbigent-demo.mp4

Motivation

Make AI Agent Testing Practical for Modern Applications

Traditional UI testing often relies on brittle methods that are easily disrupted by even minor UI changes. A/B tests, updated tutorials, unexpected dialogs, dynamic advertising, or ever-changing user-generated content can cause tests to fail.
AI agents emerged as a solution, but testing with AI agents also presents challenges. AI agents often don't work as intended; for example, the agents might open other apps or click on the wrong button due to the complexity of the task.
To address these challenges, I created Arbigent, an AI agent testing framework that can break down complex tasks into smaller, dependent scenarios. By decomposing tasks, Arbigent enables more predictable and scalable testing of AI agents in modern applications.

Customizable for Various AI Providers, OSes, Form Factors, etc.

I believe many AI Agent testing frameworks will emerge in the future. However, widespread adoption might be delayed due to limitations in customization. For instance:

Limited AI Provider Support: Frameworks might be locked to specific AI providers, excluding those used internally by companies.
Slow OS Adoption: Support for different operating systems (like iOS and Android) could lag.
Delayed Form Factor Support: Expanding to form factors beyond phones, such as Android TV, might take considerable time.

To address these issues, I aimed to create a framework that empowers users with extensive customization capabilities. Inspired by OkHttp's interceptor pattern, Arbigent provides interfaces for flexible customization, allowing users to adapt the framework to their specific needs, such as those listed above.

Easy Integration into Development Workflows

Furthermore, I wanted to make Arbigent accessible to QA engineers by offering a user-friendly UI. This allows for scenario creation within the UI and seamless test execution via the code interface.

Key Feature Breakdown

I. Core Functionality & Design

Complex Task Management:
- Scenario Dependencies: Breaks down complex goals into smaller, manageable scenarios that depend on each other (e.g., login -> search).
- Orchestration: Acts as a mediator, managing the execution flow of AI agents across multiple, interconnected scenarios.
Hybrid Development Workflow:
- UI-Driven Scenario Creation: Allows non-programmers (e.g., QA engineers) to visually design test scenarios through a user-friendly interface.
- Code-Based Execution: Enables software engineers to execute the saved scenarios programmatically (YAML files), allowing for integration with existing testing infrastructure.

II. Cross-Platform & Device Support

Multi-Platform Compatibility:
- Mobile & TV: Supports testing on iOS, Android, Web, and TV interfaces.
- D-Pad Navigation: Handles TV interfaces that rely on D-pad navigation.

III. AI Optimization & Efficiency

Enhanced AI Understanding:
- UI Tree Optimization: Simplifies and filters the UI tree to improve AI comprehension and performance.
- Accessibility-Independent: Provides annotated screenshots to assist AI in understanding UIs that lack accessibility information.
Cost Savings:
- Open Source: Free to use, modify, and distribute, eliminating licensing costs.
- Efficient Model Usage: Compatible with cost-effective models like GPT-4o mini, reducing operational expenses.

IV. Robustness & Reliability

Double Check with AI-Powered Image Assertion: Integrates Roborazzi's feature to verify AI decisions using image-based prompts and allows the AI to re-evaluate if needed.
Stuck Screen Detection: Identifies and recovers from situations where the AI agent gets stuck on the same screen, prompting it to reconsider its actions.

V. Advanced Features & Customization

Flexible Code Interface:
- Custom Hooks: Offers a code interface for adding custom initialization and cleanup methods, providing greater control over scenario execution.

VI. Community & Open Source

Open Source Nature:
- Free & Open: Freely available for use, modification, and distribution.
- Community Driven: Welcomes contributions from the community to enhance and expand the framework.

Arbigent's Strengths and Weaknesses Based on SMURF

I categorized automated testing frameworks into five levels using the SMURF framework. Here's how Arbigent stacks up:

Speed (1/5): Arbigent's speed is currently limited by the underlying AI technology and the need to interact with the application's UI in real-time. This makes it slower than traditional unit or integration tests.
- We have introduced some mechanisms to address this:
  - Tests can be parallelized using the --shard option to speed up execution.
  - AI result caching can be utilized when the UI tree and goal are identical, which is configurable in the project settings.
Maintainability (4/5):* Arbigent excels in maintainability. The underlying AI model can adapt to minor UI changes, minimizing the need to rewrite tests for every small update, thus reducing maintenance effort. You can write tests in natural language (e.g., "Complete the tutorial"), making them resilient to UI changes. The task decomposition feature also reduces duplication, further enhancing maintainability. Maintenance can be done by non-engineers, thanks to the natural language interface.
Utilization (1/5): Arbigent requires both device resources (emulators or physical devices) and AI resources, which can be costly. (AI cost can be around $0.005 per step and $0.02 per task when using GPT-4o.)
Reliability (3/5): Arbigent has several features to improve reliability. It automatically waits during loading screens, handles unexpected dialogs, and even attempts self-correction. However, external factors like emulator flakiness can still impact reliability.
- Recently I found Arbigent has retry feature and can execute the scenario from the beginning. But, even without retry, Arbigent works fine without failures thanks to the flexibility of AI.
Fidelity (5/5): Arbigent provides high fidelity by testing on real or emulated devices with the actual application. It can even assess aspects that were previously difficult to test, such as verifying video playback by checking for visual changes on the screen.

I believe that many of its current limitations, such as speed, maintainability, utilization, and reliability, will be addressed as AI technology continues to evolve. The need for extensive prompt engineering will likely diminish as AI models become more capable.

How to Use

Installation

Install the Arbigent UI binary from the Release page.

Installation for macOS Users

If you encounter security warnings when opening the app: Refer to Apple's guide on opening apps from unidentified developers.

Device Connection and AI API Key Entry

Connect your device to your PC.
In the Arbigent UI, select your connected device from the list of available devices. This will establish a connection.
Enter your AI provider's API key in the designated field within the Arbigent UI.

Scenario Creation

Use the intuitive UI to define scenarios. Simply specify the desired goal for the AI agent.

Test Execution

Run tests either directly through the UI or programmatically via the code interface or CLI.

CLI

You can install the CLI via Homebrew and run a saved YAML file.

brew tap takahirom/homebrew-repo
brew install takahirom/repo/arbigent

Usage: arbigent [<options>]

Options for OpenAI API AI:
  --open-ai-endpoint=<text>    Endpoint URL (default: https://api.openai.com/v1/)
  --open-ai-model-name=<text>  Model name (default: gpt-4o-mini)

Options for Gemini API AI:
  --gemini-endpoint=<text>    Endpoint URL (default: https://generativelanguage.googleapis.com/v1beta/openai/)
  --gemini-model-name=<text>  Model name (default: gemini-1.5-flash)

Options for Azure OpenAI:
  --azure-open-aiendpoint=<text>     Endpoint URL
  --azure-open-aiapi-version=<text>  API version
  --azure-open-aimodel-name=<text>   Model name (default: gpt-4o-mini)

Options:
  --ai-type=(openai|gemini|azureopenai)  Type of AI to use
  --os=(android|ios|web)                 Target operating system
  --project-file=<text>                  Path to the project YAML file
  --log-level=(debug|info|warn|error)    Log level
  --shard=<value>                        Shard specification (e.g., 1/5)
  -h, --help                             Show this message and exit

Shard Option to Enable Parallel Tests

You can run tests separately with the --shard option. This allows you to split your test suite and run tests in parallel, reducing overall test execution time.

Example:

arbigent --shard=1/4

This command will run the first quarter of your test suite.

Integrating with GitHub Actions:

Here's an example of how to integrate the --shard option with GitHub Actions to run parallel tests on multiple Android emulators:

  cli-e2e-android:
    runs-on: ubuntu-latest
    strategy:
      fail-fast: false
      matrix:
        shardIndex: [ 1, 2, 3, 4 ]
        shardTotal: [ 4 ]
    steps:
...
      - name: CLI E2E test
        uses: reactivecircus/android-emulator-runner@v2
...
          script: |
            arbigent --shard=${{ matrix.shardIndex }}/${{ matrix.shardTotal }} --os=android --project-file=sample-test/src/main/resources/projects/e2e-test-android.yaml --ai-type=gemini --gemini-model-name=gemini-2.0-flash-exp
...

      - uses: actions/upload-artifact@b4b15b8c7c6ac21ea08fcf65892d2ee8f75cf882 # v4
        if: ${{ always() }}
        with:
          name: cli-report-android-${{ matrix.shardIndex }}-${{ matrix.shardTotal }}
          path: |
            arbigent-result/*
          retention-days: 90

Minimal GitHub Actions CI sample

You can use the CLI in GitHub Actions like in this sample. There are only two files: .github/workflows/arbigent-test.yaml and arbigent-project.yaml. This example demonstrates GitHub Actions and an arbigent-project.yaml file created by the Arbigent UI.

https://github.com/takahirom/arbigent-sample

Supported AI Providers

AI Provider	Supported
OpenAI	Yes
Gemini	Yes
OpenAI based APIs like Ollama	Yes

You can add AI providers by implementing the ArbigentAi interface.

Supported OSes / Form Factors

OS	Supported	Test Status in the Arbigent repository
Android	Yes	End-to-End including Android emulator and real AI
iOS	Yes	End-to-End including iOS simulator and real AI
Web(Chrome)	Yes	Currently, Testing not yet conducted

You can add OSes by implementing the ArbigentDevice interface. Thanks to the excellent Maestro library, we are able to support multiple OSes.

Form Factor	Supported
Phone / Tablet	Yes
TV(D-Pad)	Yes

Learn More

Basic Structure

Execution Flow

The execution flow involves the UI, Arbigent, ArbigentDevice, and ArbigentAi. The UI sends a project creation request to Arbigent, which fetches the UI tree from ArbigentDevice. ArbigentAi then decides on an action based on the goal and UI tree. The action is performed by ArbigentDevice, and the results are returned to the UI for display.

sequenceDiagram
  participant UI(or Tests)
  participant ArbigentAgent
  participant ArbigentDevice
  participant ArbigentAi
  UI(or Tests)->>ArbigentAgent: Execute
  loop
    ArbigentAgent->>ArbigentDevice: Fetch UI tree
    ArbigentDevice->>ArbigentAgent: Return UI tree
    ArbigentAgent->>ArbigentAi: Decide Action by goal and UI tree and histories
    ArbigentAi->>ArbigentAgent: Return Action
    ArbigentAgent->>ArbigentDevice: Perform actions
    ArbigentDevice->>ArbigentAgent: Return results
  end
  ArbigentAgent->>UI(or Tests): Display results

Class Diagram

The class diagram illustrates the relationships between ArbigentProject, ArbigentScenario, ArbigentTask, ArbigentAgent, ArbigentScenarioExecutor, ArbigentAi, ArbigentDevice, and ArbigentInterceptor.

classDiagram
  direction TB
  class ArbigentProject {
    +List~ArbigentScenario~ scenarios
    +execute()
  }
  class ArbigentAgentTask {
    +String goal
  }
  class ArbigentAgent {
    +ArbigentAi ai
    +ArbigentDevice device
    +List~ArbigentInterceptor~ interceptors
    +execute(arbigentAgentTask)
  }
  class ArbigentScenarioExecutor {
    +execute(arbigentScenario)
  }
  class ArbigentScenario {
    +List~ArbigentAgentTask~ agentTasks
  }
  ArbigentProject o--"*" ArbigentScenarioExecutor
ArbigentScenarioExecutor o--"*" ArbigentAgent
ArbigentScenario o--"*" ArbigentAgentTask
ArbigentProject o--"*" ArbigentScenario

Saved project file

Warning

The yaml format is still under development and may change in the future.

The project file is saved in YAML format and contains scenarios with goals, initialization methods, and cleanup data. Dependencies between scenarios are also defined. You can write a project file in YAML format by hand or create it using the Arbigent UI.

The id is auto-generated UUID by Arbigent UI but you can change it to any string.

scenarios:
  - id: "7788d7f4-7276-4cb3-8e98-7d3ad1d1cd47"
    goal: "Open the Now in Android app from the app list. The goal is to view the list\
    \ of topics.  Do not interact with the app beyond this."
    initializationMethods:
      - type: "CleanupData"
        packageName: "com.google.samples.apps.nowinandroid"
      - type: "LaunchApp"
        packageName: "com.google.samples.apps.nowinandroid"
  - id: "f0ef0129-c764-443f-897d-fc4408e5952b"
    goal: "In the Now in Android app, select an tech topic and complete the form in\
    \ the \"For you\" tab. The goal is reached when articles are displayed.  Do not\
    \ click on any articles. If the browser opens, return to the app."
    dependency: "7788d7f4-7276-4cb3-8e98-7d3ad1d1cd47"
    imageAssertions:
      - assertionPrompt: "Articles are visible on the screen"
  - id: "73c785f7-0f45-4709-97b5-601b6803eb0d"
    goal: "Save an article using the Bookmark button."
    dependency: "f0ef0129-c764-443f-897d-fc4408e5952b"
  - id: "797514d2-fb04-4b92-9c07-09d46cd8f931"
    goal: "Check if a saved article appears in the Saved tab."
    dependency: "73c785f7-0f45-4709-97b5-601b6803eb0d"
    imageAssertions:
      - assertionPrompt: "The screen is showing Saved tab"
      - assertionPrompt: "There is an article in the screen"

Code Interface

Warning

The code interface is still under development and may change in the future.

Arbigent provides a code interface for executing tests programmatically. Here's an example of how to run a test:

Dependency

Stay tuned for the release of Arbigent on Maven Central.

Running saved project yaml file

You can load a project yaml file and execute it using the following code:

class ArbigentTest {
  private val scenarioFile = File(this::class.java.getResource("/projects/nowinandroidsample.yaml").toURI())

  @Test
  fun tests() = runTest(
    timeout = 10.minutes
  ) {
    val arbigentProject = ArbigentProject(
      file = scenarioFile,
      aiFactory = {
        OpenAIAi(
          apiKey = System.getenv("OPENAI_API_KEY")
        )
      },
      deviceFactory = {
        AvailableDevice.Android(
          dadb = Dadb.discover()!!
        ).connectToDevice()
      }
    )
    arbigentProject.execute()
  }
}

Run a scenario directly

val agentConfig = AgentConfig {
  deviceFactory { FakeDevice() }
  ai(FakeAi())
}
val arbigentScenarioExecutor = ArbigentScenarioExecutor {
}
val arbigentScenario = ArbigentScenario(
  id = "id2",
  agentTasks = listOf(
    ArbigentAgentTask("id1", "Login in the app and see the home tab.", agentConfig),
    ArbigentAgentTask("id2", "Search an episode and open detail", agentConfig)
  ),
  maxStepCount = 10,
)
arbigentScenarioExecutor.execute(
  arbigentScenario
)

Run a goal directly

val agentConfig = AgentConfig {
  deviceFactory { FakeDevice() }
  ai(FakeAi())
}

val task = ArbigentAgentTask("id1", "Login in the app and see the home tab.", agentConfig)
ArbigentAgent(agentConfig)
  .execute(task)

Name		Name	Last commit message	Last commit date
Latest commit History 459 Commits
.github/workflows		.github/workflows
arbigent-ai-openai		arbigent-ai-openai
arbigent-cli		arbigent-cli
arbigent-core-model		arbigent-core-model
arbigent-core-web-report		arbigent-core-web-report
arbigent-core		arbigent-core
arbigent-ui		arbigent-ui
gradle		gradle
sample-test		sample-test
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
build.gradle.kts		build.gradle.kts
gradle.properties		gradle.properties
gradlew		gradlew
gradlew.bat		gradlew.bat
settings.gradle.kts		settings.gradle.kts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Arbigent(Arbiter-Agent): An AI Agent Testing Framework for Modern Applications

Screenshot

Demo movie

Motivation

Make AI Agent Testing Practical for Modern Applications

Customizable for Various AI Providers, OSes, Form Factors, etc.

Easy Integration into Development Workflows

Key Feature Breakdown

Arbigent's Strengths and Weaknesses Based on SMURF

How to Use

Installation

Installation for macOS Users

Device Connection and AI API Key Entry

Scenario Creation

Test Execution

CLI

Shard Option to Enable Parallel Tests

Minimal GitHub Actions CI sample

Supported AI Providers

Supported OSes / Form Factors

Learn More

Basic Structure

Execution Flow

Class Diagram

Saved project file

Code Interface

Dependency

Running saved project yaml file

Run a scenario directly

Run a goal directly

About

Releases 28

Packages

Languages

License

takahirom/arbigent

Folders and files

Latest commit

History

Repository files navigation

Arbigent(Arbiter-Agent): An AI Agent Testing Framework for Modern Applications

Screenshot

Demo movie

Motivation

Make AI Agent Testing Practical for Modern Applications

Customizable for Various AI Providers, OSes, Form Factors, etc.

Easy Integration into Development Workflows

Key Feature Breakdown

Arbigent's Strengths and Weaknesses Based on SMURF

How to Use

Installation

Installation for macOS Users

Device Connection and AI API Key Entry

Scenario Creation

Test Execution

CLI

Shard Option to Enable Parallel Tests

Minimal GitHub Actions CI sample

Supported AI Providers

Supported OSes / Form Factors

Learn More

Basic Structure

Execution Flow

Class Diagram

Saved project file

Code Interface

Dependency

Running saved project yaml file

Run a scenario directly

Run a goal directly

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 28

Packages 0

Languages

Packages