Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
131 commits
Select commit Hold shift + click to select a range
65750c4
Fix homepage url bugs. Add instructions on setting it up
shuyanzhou Aug 27, 2023
68cff40
Merge pull request #28 from web-arena-x/23-instructions-to-setup-env-…
shuyanzhou Aug 27, 2023
526a00e
Fix Mangeto base URL redirect problem
shuyanzhou Aug 28, 2023
9c01d57
Merge pull request #29 from web-arena-x/18-some-links-point-to-an-ext…
shuyanzhou Aug 28, 2023
fd3f05a
add mirror download links
frankxu2004 Sep 8, 2023
f3a7f58
update inaccurate annotations
shuyanzhou Sep 11, 2023
ed93b3a
slow verion of more robust viewport
shuyanzhou Sep 12, 2023
e44972d
remove beartype for efficency purpose
shuyanzhou Sep 13, 2023
669958d
Improve inaccurate locators
shuyanzhou Sep 14, 2023
26a1721
Merge must_include
shuyanzhou Sep 14, 2023
da9d7a3
add support for os agnostic meta/control+a
oootttyyy Sep 15, 2023
676b580
add clear textbox test
oootttyyy Sep 15, 2023
86e8dfc
Use more exact_match if possible
shuyanzhou Sep 15, 2023
d1450f2
update evaluators to match the new config format
shuyanzhou Sep 16, 2023
5af6100
add clear textbox test
oootttyyy Sep 15, 2023
5b94f5f
recover necessary beartype
shuyanzhou Sep 16, 2023
06f5a72
Merge branch '34-os-agnostic-select-all' of https://github.com/web-ar…
oootttyyy Sep 16, 2023
9ccc2dc
fix black formatting
oootttyyy Sep 16, 2023
772a539
fix black formatting
oootttyyy Sep 16, 2023
fe58b55
fix black formatting
oootttyyy Sep 16, 2023
017a735
fix async key press
oootttyyy Sep 16, 2023
1c0f414
Merge pull request #43 from web-arena-x/34-os-agnostic-select-all
shuyanzhou Sep 17, 2023
7630e04
Merge remote-tracking branch 'origin/bug-in-current-viewport-gitlab' …
shuyanzhou Sep 18, 2023
536b5cf
ignore nltk type
shuyanzhou Sep 19, 2023
bb3115b
add nltk install to the workflow
shuyanzhou Sep 19, 2023
551d248
Merge pull request #44 from web-arena-x/25-errorsimperfections-in-eva…
shuyanzhou Sep 19, 2023
c1fc273
Fix evaluation annotation for example 301, 302
shuyanzhou Sep 19, 2023
730b4dd
Merge pull request #47 from web-arena-x/fix_example_301_302
shuyanzhou Sep 20, 2023
1fef526
add huggingface model support
shuyanzhou Sep 22, 2023
507659a
better rendering of typing action
shuyanzhou Sep 22, 2023
2b15f20
multi threading auto login; auto login per example
shuyanzhou Sep 22, 2023
e84910d
better error message for env config
shuyanzhou Sep 22, 2023
493294b
fix statictext bounding box bug
shuyanzhou Sep 22, 2023
57d2067
add support to evaluate by trace
shuyanzhou Sep 22, 2023
16f2592
support generation retry when the parsing of the action failed
shuyanzhou Sep 22, 2023
9f3e4ac
ignore cache
shuyanzhou Sep 22, 2023
5ae6834
fix statictext bounding box bug
shuyanzhou Sep 22, 2023
c0a9ebd
Merge remote-tracking branch 'origin/main' into bug-in-current-viewpo…
shuyanzhou Sep 22, 2023
ba3c07c
Merge pull request #39 from web-arena-x/bug-in-current-viewport-gitlab
shuyanzhou Sep 23, 2023
0e7bcda
add prompts
shuyanzhou Sep 23, 2023
741292e
fix force_prefix missing bug
shuyanzhou Sep 23, 2023
1ee1ea4
fix typo
shuyanzhou Sep 23, 2023
c1ae73c
add script to check inference failures
shuyanzhou Sep 23, 2023
6fdbd92
add parallel running script
shuyanzhou Sep 23, 2023
cd7d593
fix annotation errors based on human trajectories
shuyanzhou Sep 26, 2023
b6e0b22
change reddit vote related posts to absolute urls
shuyanzhou Sep 26, 2023
f8d636a
update URL matching, fix typos
shuyanzhou Sep 26, 2023
b4c917d
update must_include tokenization condition; upate url match
shuyanzhou Sep 26, 2023
a7c475b
remove unused evaluators
shuyanzhou Sep 26, 2023
50e2c43
remove exact from evalutor names
shuyanzhou Sep 26, 2023
db063c7
update test example due to html escape
shuyanzhou Sep 26, 2023
6ab7fd2
update fuzzy match prompt
shuyanzhou Sep 27, 2023
58061ee
reduce coordinate precision; fix template 67 annotations
shuyanzhou Sep 27, 2023
4b86d43
fix locator for product; add prep action; fix url for promo rules
shuyanzhou Oct 20, 2023
df87757
add options to renew cookie for selected sites
shuyanzhou Oct 20, 2023
3d3d837
print unfinished examples
shuyanzhou Oct 20, 2023
7730a85
reduce openai max retry
shuyanzhou Oct 20, 2023
f91eb5b
minor
shuyanzhou Oct 20, 2023
4cec5ac
minor
shuyanzhou Oct 21, 2023
7a1f8d6
Merge remote-tracking branch 'origin/main' into new_eval
shuyanzhou Oct 21, 2023
9f0900f
fix type errors
shuyanzhou Oct 21, 2023
a68aa1a
Merge pull request #54 from web-arena-x/new_eval
shuyanzhou Oct 21, 2023
e32b71e
Update README.md
shuyanzhou Oct 21, 2023
3065fda
remove duplicate "string_match" in "eval_types" for task 301 302
nicholaschenai Oct 22, 2023
f086fe2
Merge pull request #55 from nicholaschenai/nicholaschenai-patch-1
shuyanzhou Oct 24, 2023
00cc5db
Update README.md
shuyanzhou Oct 25, 2023
8a664cb
add gitlab url change fix
frankxu2004 Nov 2, 2023
1b4f8ce
add v2 execution trajectories
shuyanzhou Nov 3, 2023
bb16bd1
add AMI instructions
shuyanzhou Nov 3, 2023
6da5f53
Merge pull request #61 from web-arena-x/v2_traj
shuyanzhou Nov 3, 2023
ec1e8c4
Add Zeno support
shuyanzhou Nov 3, 2023
b294709
Merge pull request #62 from web-arena-x/zeno
shuyanzhou Nov 3, 2023
e28e6b0
Update README.md
shuyanzhou Nov 3, 2023
c74c4f0
minor
shuyanzhou Nov 3, 2023
137fc11
minor
shuyanzhou Nov 3, 2023
73d8dce
Merge pull request #63 from web-arena-x/html2json-patch
shuyanzhou Nov 4, 2023
8210cd1
Update README.md
shuyanzhou Nov 19, 2023
e989873
fix the regex in cleaning axtree
shuyanzhou Dec 6, 2023
d5c9dbd
add openai and transformers lib version
shuyanzhou Dec 6, 2023
ae96e67
Merge pull request #71 from web-arena-x/70-issue-with-accessibility-t…
shuyanzhou Dec 6, 2023
f4abead
update zeno project url
shuyanzhou Dec 10, 2023
039f934
Merge pull request #74 from web-arena-x/72-zeno-results-seem-unauthor…
shuyanzhou Dec 10, 2023
ab4f2ad
add human trajectories
shuyanzhou Dec 21, 2023
b0a2424
Merge pull request #78 from web-arena-x/73-human-annotation-traces
shuyanzhou Dec 21, 2023
3e3685d
support unachievable task eval when not explicit instruction is given
oootttyyy Dec 22, 2023
eb68ab6
fix pre-commit
oootttyyy Dec 22, 2023
0cec70e
Revert "fix pre-commit"
oootttyyy Dec 22, 2023
cf51999
support unachievable task eval when no explicit instruction is given
oootttyyy Dec 22, 2023
2a737cc
add missing ua reason
oootttyyy Dec 22, 2023
7d01c33
should account for uahint too
oootttyyy Dec 22, 2023
3c4dcab
use fuzzy_match for UA tasks and update ua eval prompt
oootttyyy Dec 22, 2023
b9d4f0e
retain support for n/a
oootttyyy Dec 22, 2023
73d9de7
add comment
oootttyyy Dec 22, 2023
c6475f0
Merge pull request #81 from web-arena-x/ua_eval
shuyanzhou Dec 22, 2023
ac84657
Update README.md
anamhira47 Dec 23, 2023
14f91d9
Update README.md
eltociear Jan 9, 2024
6fd6887
update env readme
frankxu2004 Feb 13, 2024
2e690f9
Update README.md
optimass Mar 5, 2024
bb6e4c6
fix typo in intent
lwaekfjlk Mar 14, 2024
0056d34
Merge pull request #111 from lwaekfjlk/feature/fix-typo-in-intent
shuyanzhou Mar 25, 2024
19c5fea
add leaderboard link
shuyanzhou Apr 11, 2024
abd8269
Update README.md
frankxu2004 Apr 15, 2024
a643543
Merge pull request #108 from optimass/patch-2
frankxu2004 Apr 15, 2024
eab9b05
Merge pull request #86 from eltociear/patch-1
frankxu2004 Apr 15, 2024
955eec8
Merge pull request #83 from anamhira47/anamhira47-patch-1
frankxu2004 Apr 15, 2024
ce732b2
Update README.md
frankxu2004 Apr 15, 2024
dc56401
Update helper_functions.py
frankxu2004 Apr 15, 2024
b476478
Update helper_functions.py
frankxu2004 Apr 15, 2024
aeb9e82
Merge pull request #126 from web-arena-x/frankxu2004-patch-1
frankxu2004 Apr 15, 2024
de524be
notes on setup and reset environment
shuyanzhou Apr 29, 2024
4c741b4
Update README.md
shuyanzhou May 29, 2024
cf388a2
Update README.md
shuyanzhou Jul 22, 2024
1469b7c
Merge pull request #162 from web-arena-x/shuyanzhou-patch-2
shuyanzhou Jul 22, 2024
41b2aaf
altera agent
Jul 31, 2024
b2a8b0b
agent working, log messages
Jul 31, 2024
23da5a1
cool
dryingpaint Jul 31, 2024
8b6cccf
edits
dryingpaint Aug 1, 2024
f517ca5
updated requirements.txt
Aug 2, 2024
22b9e98
edits
Aug 2, 2024
912da27
prompt
Aug 2, 2024
1437d3e
benchmarking
Aug 5, 2024
370666d
benchmark
Aug 6, 2024
05e3bae
bench
Aug 6, 2024
d691780
better benching
Aug 6, 2024
5639676
fixes
Aug 7, 2024
62056b4
fix
Aug 7, 2024
48eda7a
fix
Aug 9, 2024
0641151
fix
Aug 10, 2024
3a7ee45
bench
Sep 20, 2024
21e8e1e
working
Sep 20, 2024
9c1b133
update
Sep 21, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion .github/workflows/tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@ jobs:
run: |
pip install -r requirements.txt
playwright install
python -m nltk.downloader punkt stopwords
pip install -e .[dev]
- name: Type-checking package with mypy
run: |
Expand All @@ -33,7 +34,7 @@ jobs:
mypy --version
# Run this mypy instance against our main package.
mypy --install-types --non-interactive .
mypy --strict .
mypy --strict . --exclude scripts
- name: Enviroment prepare
run: |
bash prepare.sh
Expand Down
32 changes: 20 additions & 12 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -141,18 +141,26 @@ run.sh

# trajectory visualization
render_cache/*
cache/*

# TMP IGNORE
agent/prompts/jsons/*
# agent/prompts/jsons/*
log_files/
config_files/*0.json
config_files/*1.json
config_files/*2.json
config_files/*3.json
config_files/*4.json
config_files/*5.json
config_files/*6.json
config_files/*7.json
config_files/*8.json
config_files/*9.json
config_files/test.json
config_files*/*0.json
config_files*/*1.json
config_files*/*2.json
config_files*/*3.json
config_files*/*4.json
config_files*/*5.json
config_files*/*6.json
config_files*/*7.json
config_files*/*8.json
config_files*/*9.json
config_files*/test.json
node_modules/
/test-results/
/playwright-report/
/blob-report/
/playwright/.cache/
/run_outputs/*
/traces/*
18 changes: 13 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,16 +16,20 @@

<p align="center">
<a href="https://webarena.dev/">Website</a> •
<a href="https://arxiv.org/2307.13854">Paper</a>
<a href="https://arxiv.org/abs/2307.13854">Paper</a> •
<a href="https://docs.google.com/spreadsheets/d/1M801lEpBbKSNwP-vDBkC_pF7LdyGU1f_ufZb_NWNBZQ/edit?usp=sharing">Leaderboard</a>
</p>

![Overview](media/overview.png)

## Roadmap
- [ ] In-house end-to-end evaluation. We are working on an API that accepts predicted actions from any interface and then returns the subsequent observation.
- [ ] Support more agents with different prompting mechanisms such as [ASH](https://arxiv.org/pdf/2305.14257.pdf).

## News
* [12/21/2023] We release the recording of trajectories performed by human annotators on ~170 tasks. Check out the [resource page](./resources/README.md#12212023-human-trajectories) for more details.
* [11/3/2023] Multiple features!
* Uploaded newest [execution trajectories](./resources/README.md#1132023-execution-traces-from-our-experiments-v2)
* Added [Amazon Machine Image](./environment_docker/README.md#pre-installed-amazon-machine-image) that pre-installed all websites so that you don't have to!
* [Zeno](https://zenoml.com/) x WebArena which allows you to analyze your agents on WebArena without pain. Check out this [notebook](./scripts/webarena-zeno.ipynb) to upload your own data to Zeno, and [this](https://hub.zenoml.com/project/9db3e1cf-6e28-4cfc-aeec-1670cac01872/WebArena%20Tester/explore?params=eyJtb2RlbCI6ImdwdDM1LWRpcmVjdCIsIm1ldHJpYyI6eyJpZCI6NzQ5MiwibmFtZSI6InN1Y2Nlc3MiLCJ0eXBlIjoibWVhbiIsImNvbHVtbnMiOlsic3VjY2VzcyJdfSwiY29tcGFyaXNvbk1vZGVsIjoiZ3B0NC1jb3QiLCJjb21wYXJpc29uQ29sdW1uIjp7ImlkIjoiYTVlMDFiZDUtZTg0NS00M2I4LTllNDgtYTU4NzRiNDJjNjNhIiwibmFtZSI6ImNvbnRleHQiLCJjb2x1bW5UeXBlIjoiT1VUUFVUIiwiZGF0YVR5cGUiOiJOT01JTkFMIiwibW9kZWwiOiJncHQzNS1kaXJlY3QifSwiY29tcGFyZVNvcnQiOltudWxsLHRydWVdLCJtZXRyaWNSYW5nZSI6WzAsMV0sInNlbGVjdGlvbnMiOnsibWV0YWRhdGEiOnt9LCJzbGljZXMiOltdLCJ0YWdzIjpbXX19) page for browsing our existing results!
* [10/24/2023] We re-examined the whole dataset and fixed the spotted annotation bugs. The current version ([v0.2.0](https://github.com/web-arena-x/webarena/releases/tag/v0.2.0)) is relatively stable and we don't expect major updates on the annotation in the future. The new results with better prompts and the comparison with human performance can be found in our [paper](https://arxiv.org/abs/2307.13854)
* [8/4/2023] Added the instructions and the docker resources to host your own WebArena Environment. Check out [this page](environment_docker/README.md) for details.
* [7/29/2023] Added [a well commented script](minimal_example.py) to walk through the environment setup.
## Install
Expand Down Expand Up @@ -66,6 +70,9 @@ action = create_id_based_action(f"click [id]")
obs, _, terminated, _, info = env.step(action)
```
## End-to-end Evaluation
> [!IMPORTANT]
> To ensure the correct evaluation, please setup your own WebArena websites following step 1 and step 2. The demo sites are only for browsing purpose to help you better understand the content. After evaluating the 812 examples, reset the environment to the initial state following the instructions [here](./environment_docker/README.md#environment-reset).

1. Setup the standalone environment.
Please check out [this page](environment_docker/README.md) for details.

Expand Down Expand Up @@ -106,8 +113,9 @@ python run.py \
```
This script will run the first example with GPT-3.5 reasoning agent. The trajectory will be saved in `<your_result_dir>/0.html`


## Develop Your Prompt-based Agent
1. Define the prompts. We provide two baseline agents whose correrponding prompts are listed [here](./agent/prompts/raw). Each prompt is a dictionary with the following keys:
1. Define the prompts. We provide two baseline agents whose corresponding prompts are listed [here](./agent/prompts/raw). Each prompt is a dictionary with the following keys:
```python
prompt = {
"intro": <The overall guideline which includes the task description, available action, hint and others>,
Expand Down
3 changes: 2 additions & 1 deletion agent/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,8 @@
Agent,
PromptAgent,
TeacherForcingAgent,
AlteraAgent,
construct_agent,
)

__all__ = ["Agent", "TeacherForcingAgent", "PromptAgent", "construct_agent"]
__all__ = ["Agent", "TeacherForcingAgent", "PromptAgent", "construct_agent", "AlteraAgent"]
Loading