Merge branch 'main' into rb/no-retry-llm

All-Hands-AI · Feb 3, 2025 · e94c57d · e94c57d
2 parents 6a829b7 + f24fbec
commit e94c57d
Show file tree

Hide file tree

Showing 78 changed files with 1,543 additions and 405 deletions.
diff --git a/.github/workflows/ghcr-build.yml b/.github/workflows/ghcr-build.yml
@@ -219,7 +219,7 @@ jobs:
             exit 1
           fi
 
-  # Run unit tests with the EventStream runtime Docker images as root
+  # Run unit tests with the Docker runtime Docker images as root
   test_runtime_root:
     name: RT Unit Tests (Root)
     needs: [ghcr_build_runtime]
@@ -286,7 +286,7 @@ jobs:
           image_name=ghcr.io/${{ github.repository_owner }}/runtime:${{ env.RELEVANT_SHA }}-${{ matrix.base_image }}
           image_name=$(echo $image_name | tr '[:upper:]' '[:lower:]')
 
-          TEST_RUNTIME=eventstream \
+          TEST_RUNTIME=docker \
           SANDBOX_USER_ID=$(id -u) \
           SANDBOX_RUNTIME_CONTAINER_IMAGE=$image_name \
           TEST_IN_CI=true \
@@ -297,7 +297,7 @@ jobs:
         env:
           CODECOV_TOKEN: ${{ secrets.CODECOV_TOKEN }}
 
-  # Run unit tests with the EventStream runtime Docker images as openhands user
+  # Run unit tests with the Docker runtime Docker images as openhands user
   test_runtime_oh:
     name: RT Unit Tests (openhands)
     runs-on: ubuntu-latest
@@ -363,7 +363,7 @@ jobs:
           image_name=ghcr.io/${{ github.repository_owner }}/runtime:${{ env.RELEVANT_SHA }}-${{ matrix.base_image }}
           image_name=$(echo $image_name | tr '[:upper:]' '[:lower:]')
 
-          TEST_RUNTIME=eventstream \
+          TEST_RUNTIME=docker \
           SANDBOX_USER_ID=$(id -u) \
           SANDBOX_RUNTIME_CONTAINER_IMAGE=$image_name \
           TEST_IN_CI=true \

diff --git a/.github/workflows/openhands-resolver.yml b/.github/workflows/openhands-resolver.yml
@@ -20,6 +20,10 @@ on:
         required: false
         type: string
         default: "anthropic/claude-3-5-sonnet-20241022"
+      LLM_API_VERSION:
+        required: false
+        type: string
+        default: ""
       base_container_image:
         required: false
         type: string
@@ -116,6 +120,7 @@ jobs:
           LLM_MODEL: ${{ secrets.LLM_MODEL || inputs.LLM_MODEL }}
           LLM_API_KEY: ${{ secrets.LLM_API_KEY }}
           LLM_BASE_URL: ${{ secrets.LLM_BASE_URL }}
+          LLM_API_VERSION: ${{ inputs.LLM_API_VERSION }}
           PAT_TOKEN: ${{ secrets.PAT_TOKEN }}
           PAT_USERNAME: ${{ secrets.PAT_USERNAME }}
           GITHUB_TOKEN: ${{ github.token }}
@@ -230,6 +235,7 @@ jobs:
           LLM_MODEL: ${{ secrets.LLM_MODEL || inputs.LLM_MODEL }}
           LLM_API_KEY: ${{ secrets.LLM_API_KEY }}
           LLM_BASE_URL: ${{ secrets.LLM_BASE_URL }}
+          LLM_API_VERSION: ${{ inputs.LLM_API_VERSION }}
           PYTHONPATH: ""
         run: |
           cd /tmp && python -m openhands.resolver.resolve_issue \
@@ -265,6 +271,7 @@ jobs:
           LLM_MODEL: ${{ secrets.LLM_MODEL || inputs.LLM_MODEL }}
           LLM_API_KEY: ${{ secrets.LLM_API_KEY }}
           LLM_BASE_URL: ${{ secrets.LLM_BASE_URL }}
+          LLM_API_VERSION: ${{ inputs.LLM_API_VERSION }}
           PYTHONPATH: ""
         run: |
           if [ "${{ steps.check_result.outputs.RESOLUTION_SUCCESS }}" == "true" ]; then

diff --git a/.github/workflows/stale.yml b/.github/workflows/stale.yml
@@ -19,3 +19,4 @@ jobs:
           close-issue-message: 'This issue was closed because it has been stalled for over 30 days with no activity.'
           close-pr-message: 'This PR was closed because it has been stalled for over 30 days with no activity.'
           days-before-close: 7
+          operations-per-run: 150
diff --git a/docs/i18n/fr/docusaurus-plugin-content-docs/current/usage/architecture/runtime.md b/docs/i18n/fr/docusaurus-plugin-content-docs/current/usage/architecture/runtime.md
@@ -1,8 +1,8 @@
 
 
-# 📦 Runtime EventStream
+# 📦 Runtime Docker
 
-Le Runtime EventStream d'OpenHands est le composant principal qui permet l'exécution sécurisée et flexible des actions des agents d'IA.
+Le Runtime Docker d'OpenHands est le composant principal qui permet l'exécution sécurisée et flexible des actions des agents d'IA.
 Il crée un environnement en bac à sable (sandbox) en utilisant Docker, où du code arbitraire peut être exécuté en toute sécurité sans risquer le système hôte.
 
 ## Pourquoi avons-nous besoin d'un runtime en bac à sable ?

diff --git a/docs/i18n/fr/docusaurus-plugin-content-docs/current/usage/configuration-options.md b/docs/i18n/fr/docusaurus-plugin-content-docs/current/usage/configuration-options.md
@@ -163,7 +163,7 @@ Les options de configuration de base sont définies dans la section `[core]` du
 
 - `runtime`
   - Type : `str`
-  - Valeur par défaut : `"eventstream"`
+  - Valeur par défaut : `"docker"`
   - Description : Environnement d'exécution
 
 - `default_agent`

diff --git a/...8n/fr/docusaurus-plugin-content-docs/current/usage/how-to/evaluation-harness.md b/...8n/fr/docusaurus-plugin-content-docs/current/usage/how-to/evaluation-harness.md
@@ -114,7 +114,7 @@ Pour créer un workflow d'évaluation pour votre benchmark, suivez ces étapes :
    def get_config(instance: pd.Series, metadata: EvalMetadata) -> AppConfig:
        config = AppConfig(
            default_agent=metadata.agent_class,
-           runtime='eventstream',
+           runtime='docker',
            max_iterations=metadata.max_iterations,
            sandbox=SandboxConfig(
                base_container_image='your_container_image',

diff --git a/...8n/zh-Hans/docusaurus-plugin-content-docs/current/usage/architecture/runtime.md b/...8n/zh-Hans/docusaurus-plugin-content-docs/current/usage/architecture/runtime.md
@@ -1,8 +1,8 @@
 以下是翻译后的内容:
 
-# 📦 EventStream 运行时
+# 📦 Docker 运行时
 
-OpenHands EventStream 运行时是实现 AI 代理操作安全灵活执行的核心组件。
+OpenHands Docker 运行时是实现 AI 代理操作安全灵活执行的核心组件。
 它使用 Docker 创建一个沙盒环境,可以安全地运行任意代码而不会危及主机系统。
 
 ## 为什么我们需要沙盒运行时?

diff --git a/...n/zh-Hans/docusaurus-plugin-content-docs/current/usage/configuration-options.md b/...n/zh-Hans/docusaurus-plugin-content-docs/current/usage/configuration-options.md
@@ -162,7 +162,7 @@
 
 - `runtime`
   - 类型: `str`
-  - 默认值: `"eventstream"`
+  - 默认值: `"docker"`
   - 描述: 运行时环境
 
 - `default_agent`

diff --git a/...-Hans/docusaurus-plugin-content-docs/current/usage/how-to/evaluation-harness.md b/...-Hans/docusaurus-plugin-content-docs/current/usage/how-to/evaluation-harness.md
@@ -112,7 +112,7 @@ OpenHands 的主要入口点在 `openhands/core/main.py` 中。以下是它的
    def get_config(instance: pd.Series, metadata: EvalMetadata) -> AppConfig:
        config = AppConfig(
            default_agent=metadata.agent_class,
-           runtime='eventstream',
+           runtime='docker',
            max_iterations=metadata.max_iterations,
            sandbox=SandboxConfig(
                base_container_image='your_container_image',

diff --git a/docs/modules/usage/architecture/runtime.md b/docs/modules/usage/architecture/runtime.md
@@ -1,6 +1,6 @@
-# 📦 EventStream Runtime
+# 📦 Docker Runtime
 
-The OpenHands EventStream Runtime is the core component that enables secure and flexible execution of AI agent's action.
+The OpenHands Docker Runtime is the core component that enables secure and flexible execution of AI agent's action.
 It creates a sandboxed environment using Docker, where arbitrary code can be run safely without risking the host system.
 
 ## Why do we need a sandboxed runtime?

diff --git a/docs/modules/usage/configuration-options.md b/docs/modules/usage/configuration-options.md
@@ -126,7 +126,7 @@ The core configuration options are defined in the `[core]` section of the `confi
 
 - `runtime`
   - Type: `str`
-  - Default: `"eventstream"`
+  - Default: `"docker"`
   - Description: Runtime environment
 
 - `default_agent`

diff --git a/docs/modules/usage/how-to/custom-sandbox-guide.md b/docs/modules/usage/how-to/custom-sandbox-guide.md
@@ -41,8 +41,16 @@ docker build -t custom-image .
 
 This will produce a new image called `custom-image`, which will be available in Docker.
 
-> Note that in the configuration described in this document, OpenHands will run as user "openhands" inside the
-> sandbox and thus all packages installed via the docker file should be available to all users on the system, not just root.
+## Using the Docker Command
+
+When running OpenHands using [the docker command](/modules/usage/installation#start-the-app), replace
+`-e SANDBOX_RUNTIME_CONTAINER_IMAGE=...` with `-e SANDBOX_BASE_CONTAINER_IMAGE=<custom image name>`:
+
+```commandline
+docker run -it --rm --pull=always \
+    -e SANDBOX_BASE_CONTAINER_IMAGE=custom-image \
+    ...
+```
 
 ## Using the Development Workflow
 

diff --git a/docs/modules/usage/how-to/evaluation-harness.md b/docs/modules/usage/how-to/evaluation-harness.md
@@ -112,7 +112,7 @@ To create an evaluation workflow for your benchmark, follow these steps:
    def get_config(instance: pd.Series, metadata: EvalMetadata) -> AppConfig:
        config = AppConfig(
            default_agent=metadata.agent_class,
-           runtime='eventstream',
+           runtime='docker',
            max_iterations=metadata.max_iterations,
            sandbox=SandboxConfig(
                base_container_image='your_container_image',

diff --git a/evaluation/benchmarks/EDA/run_infer.py b/evaluation/benchmarks/EDA/run_infer.py
@@ -69,6 +69,7 @@ def get_config(
             base_container_image='python:3.12-bookworm',
             enable_auto_lint=False,
             use_host_network=False,
+            remote_runtime_enable_retries=True,
         ),
         # do not mount workspace
         workspace_base=None,

diff --git a/evaluation/benchmarks/agent_bench/run_infer.py b/evaluation/benchmarks/agent_bench/run_infer.py
@@ -53,6 +53,7 @@ def get_config(
             remote_runtime_api_url=os.environ.get('SANDBOX_REMOTE_RUNTIME_API_URL'),
             keep_runtime_alive=False,
             remote_runtime_init_timeout=3600,
+            remote_runtime_enable_retries=True,
         ),
         # do not mount workspace
         workspace_base=None,

diff --git a/evaluation/benchmarks/aider_bench/run_infer.py b/evaluation/benchmarks/aider_bench/run_infer.py
@@ -61,6 +61,7 @@ def get_config(
             remote_runtime_api_url=os.environ.get('SANDBOX_REMOTE_RUNTIME_API_URL'),
             keep_runtime_alive=False,
             remote_runtime_init_timeout=1800,
+            remote_runtime_enable_retries=True,
         ),
         # do not mount workspace
         workspace_base=None,

diff --git a/evaluation/benchmarks/biocoder/run_infer.py b/evaluation/benchmarks/biocoder/run_infer.py
@@ -67,6 +67,7 @@ def get_config(
             base_container_image=BIOCODER_BENCH_CONTAINER_IMAGE,
             enable_auto_lint=True,
             use_host_network=False,
+            remote_runtime_enable_retries=True,
         ),
         # do not mount workspace
         workspace_base=None,

diff --git a/evaluation/benchmarks/bird/run_infer.py b/evaluation/benchmarks/bird/run_infer.py
@@ -80,6 +80,7 @@ def get_config(
             base_container_image='python:3.12-bookworm',
             enable_auto_lint=True,
             use_host_network=False,
+            remote_runtime_enable_retries=True,
         ),
         # do not mount workspace
         workspace_base=None,

diff --git a/evaluation/benchmarks/browsing_delegation/run_infer.py b/evaluation/benchmarks/browsing_delegation/run_infer.py
@@ -45,6 +45,7 @@ def get_config(
             base_container_image='python:3.12-bookworm',
             enable_auto_lint=False,
             use_host_network=False,
+            remote_runtime_enable_retries=True,
         ),
         workspace_base=None,
         workspace_mount_path=None,

diff --git a/evaluation/benchmarks/commit0_bench/run_infer.py b/evaluation/benchmarks/commit0_bench/run_infer.py
@@ -135,6 +135,7 @@ def get_config(
             remote_runtime_api_url=os.environ.get('SANDBOX_REMOTE_RUNTIME_API_URL'),
             keep_runtime_alive=False,
             remote_runtime_init_timeout=3600,
+            remote_runtime_enable_retries=True,
         ),
         # do not mount workspace
         workspace_base=None,

diff --git a/evaluation/benchmarks/discoverybench/run_infer.py b/evaluation/benchmarks/discoverybench/run_infer.py
@@ -71,6 +71,7 @@ def get_config(
             base_container_image='python:3.12-bookworm',
             enable_auto_lint=True,
             use_host_network=False,
+            remote_runtime_enable_retries=True,
         ),
         # do not mount workspace
         workspace_base=None,

diff --git a/evaluation/benchmarks/gaia/run_infer.py b/evaluation/benchmarks/gaia/run_infer.py
@@ -56,6 +56,7 @@ def get_config(
             base_container_image='python:3.12-bookworm',
             enable_auto_lint=True,
             use_host_network=False,
+            remote_runtime_enable_retries=True,
         ),
         # do not mount workspace
         workspace_base=None,

diff --git a/evaluation/benchmarks/gorilla/run_infer.py b/evaluation/benchmarks/gorilla/run_infer.py
@@ -49,6 +49,7 @@ def get_config(
             base_container_image='python:3.12-bookworm',
             enable_auto_lint=True,
             use_host_network=False,
+            remote_runtime_enable_retries=True,
         ),
         # do not mount workspace
         workspace_base=None,

diff --git a/evaluation/benchmarks/gpqa/run_infer.py b/evaluation/benchmarks/gpqa/run_infer.py
@@ -70,6 +70,7 @@ def get_config(
             base_container_image='python:3.12-bookworm',
             enable_auto_lint=True,
             use_host_network=False,
+            remote_runtime_enable_retries=True,
         ),
         # do not mount workspace
         workspace_base=None,

diff --git a/evaluation/benchmarks/humanevalfix/run_infer.py b/evaluation/benchmarks/humanevalfix/run_infer.py
@@ -91,6 +91,7 @@ def get_config(
             base_container_image='python:3.12-bookworm',
             enable_auto_lint=True,
             use_host_network=False,
+            remote_runtime_enable_retries=True,
         ),
         # do not mount workspace
         workspace_base=None,

diff --git a/evaluation/benchmarks/logic_reasoning/run_infer.py b/evaluation/benchmarks/logic_reasoning/run_infer.py
@@ -55,6 +55,7 @@ def get_config(
             enable_auto_lint=True,
             use_host_network=False,
             runtime_extra_deps='$OH_INTERPRETER_PATH -m pip install scitools-pyke',
+            remote_runtime_enable_retries=True,
         ),
         # do not mount workspace
         workspace_base=None,

diff --git a/evaluation/benchmarks/miniwob/run_infer.py b/evaluation/benchmarks/miniwob/run_infer.py
@@ -70,6 +70,7 @@ def get_config(
             remote_runtime_init_timeout=1800,
             keep_runtime_alive=False,
             timeout=120,
+            remote_runtime_enable_retries=True,
         ),
         # do not mount workspace
         workspace_base=None,

diff --git a/evaluation/benchmarks/mint/run_infer.py b/evaluation/benchmarks/mint/run_infer.py
@@ -113,6 +113,7 @@ def get_config(
             enable_auto_lint=True,
             use_host_network=False,
             runtime_extra_deps=f'$OH_INTERPRETER_PATH -m pip install {" ".join(MINT_DEPENDENCIES)}',
+            remote_runtime_enable_retries=True,
         ),
         # do not mount workspace
         workspace_base=None,

diff --git a/evaluation/benchmarks/scienceagentbench/run_infer.py b/evaluation/benchmarks/scienceagentbench/run_infer.py
@@ -73,6 +73,7 @@ def get_config(
             api_key=os.environ.get('ALLHANDS_API_KEY', None),
             remote_runtime_api_url=os.environ.get('SANDBOX_REMOTE_RUNTIME_API_URL'),
             keep_runtime_alive=False,
+            remote_runtime_enable_retries=True,
         ),
         # do not mount workspace
         workspace_base=None,

diff --git a/evaluation/benchmarks/swe_bench/eval_infer.py b/evaluation/benchmarks/swe_bench/eval_infer.py
@@ -412,6 +412,17 @@ def process_instance(
         with open(metadata_filepath, 'r') as metadata_file:
             data = metadata_file.read()
             metadata = EvalMetadata.model_validate_json(data)
+    else:
+        # Initialize with a dummy metadata when file doesn't exist
+        metadata = EvalMetadata(
+            agent_class="dummy_agent",  # Placeholder agent class
+            llm_config=LLMConfig(model="dummy_model"),  # Minimal LLM config
+            max_iterations=1,  # Minimal iterations
+            eval_output_dir=os.path.dirname(args.input_file),  # Use input file dir as output dir
+            start_time=time.strftime('%Y-%m-%d %H:%M:%S'),  # Current time
+            git_commit=subprocess.check_output(['git', 'rev-parse', 'HEAD']).decode('utf-8').strip(),  # Current commit
+            dataset=args.dataset  # Dataset name from args
+        )
 
     # The evaluation harness constrains the signature of `process_instance_func` but we need to
     # pass extra information. Build a new function object to avoid issues with multiprocessing.

diff --git a/evaluation/benchmarks/swe_bench/run_infer.py b/evaluation/benchmarks/swe_bench/run_infer.py
@@ -144,6 +144,7 @@ def get_config(
                 dataset_name=metadata.dataset,
                 instance_id=instance['instance_id'],
             ),
+            remote_runtime_enable_retries=True,
         ),
         # do not mount workspace
         workspace_base=None,

diff --git a/evaluation/benchmarks/the_agent_company/README.md b/evaluation/benchmarks/the_agent_company/README.md
@@ -17,27 +17,36 @@ When the `run_infer.sh` script is started, it will automatically pull all task i
 
 ```bash
 ./evaluation/benchmarks/the_agent_company/scripts/run_infer.sh \
-  --agent-llm-config <agent-llm-config>  \
-  --env-llm-config <env-llm-config> \
-  --outputs-path <outputs-path> \
-  --server-hostname <server-hostname> \
-  --version <version>
+  --agent-llm-config <agent-llm-config, default to 'agent'>  \
+  --env-llm-config <env-llm-config, default to 'env'> \
+  --outputs-path <outputs-path, default to outputs> \
+  --server-hostname <server-hostname, default to localhost> \
+  --version <version, default to 1.0.0> \
+  --start-percentile <integer from 0 to 99, default to 0> \
+  --end-percentile <integer from 1 to 100, default to 100>
+
 
 # Example
 ./evaluation/benchmarks/the_agent_company/scripts/run_infer.sh \
   --agent-llm-config claude-3-5-sonnet-20240620 \
   --env-llm-config claude-3-5-sonnet-20240620 \
   --outputs-path outputs \
   --server-hostname localhost \
-  --version 1.0.0
+  --version 1.0.0 \
+  --start-percentile 10 \
+  --end-percentile 20
 ```
 
 - `agent-llm-config`: the config name for the agent LLM. This should match the config name in config.toml. This is the LLM used by the agent (e.g. CodeActAgent).
 - `env-llm-config`: the config name for the environment LLM. This should match the config name in config.toml. This is used by the chat bots (NPCs) and LLM-based evaluators.
 - `outputs-path`: the path to save trajectories and evaluation results.
 - `server-hostname`: the hostname of the server that hosts all the web services. It could be localhost if you are running the evaluation and services on the same machine. If the services are hosted on a remote machine, you must use the hostname of the remote machine rather than IP address.
 - `version`: the version of the task images to use. Currently, the only supported version is 1.0.0.
+- `start-percentile`: the start percentile of the task split, must be an integer between 0 to 99.
+- `end-percentile`: the end percentile of the task split, must be an integer between 1 to 100 and larger than start-percentile.
 
-The script is idempotent. If you run it again, it will resume from the last checkpoint. It would usually take a few days to finish evaluation.
+The script is idempotent. If you run it again, it will resume from the last checkpoint. It would usually take 2 days to finish evaluation if you run the whole task set.
+To speed up evaluation, you can use `start-percentile` and `end-percentile` to split the tasks for higher parallelism,
+provided concurrent runs are **targeting different servers**.
 
 Note: the script will automatically skip a task if it encounters an error. This usually happens when the OpenHands runtime dies due to some unexpected errors. This means even if the script finishes, it might not have evaluated all tasks. You can manually resume the evaluation by running the script again.
diff --git a/evaluation/benchmarks/the_agent_company/run_infer.py b/evaluation/benchmarks/the_agent_company/run_infer.py
@@ -50,6 +50,7 @@ def get_config(
             # large enough timeout, since some testcases take very long to run
             timeout=300,
             api_key=os.environ.get('ALLHANDS_API_KEY', None),
+            remote_runtime_enable_retries=True,
         ),
         # we mount trajectories path so that trajectories, generated by OpenHands
         # controller, can be accessible to the evaluator file in the runtime container