Skip to content

Commit

Permalink
Merge branch 'main' into rb/no-retry-llm
Browse files Browse the repository at this point in the history
  • Loading branch information
enyst authored Feb 3, 2025
2 parents 6a829b7 + f24fbec commit e94c57d
Show file tree
Hide file tree
Showing 78 changed files with 1,543 additions and 405 deletions.
8 changes: 4 additions & 4 deletions .github/workflows/ghcr-build.yml
Original file line number Diff line number Diff line change
Expand Up @@ -219,7 +219,7 @@ jobs:
exit 1
fi
# Run unit tests with the EventStream runtime Docker images as root
# Run unit tests with the Docker runtime Docker images as root
test_runtime_root:
name: RT Unit Tests (Root)
needs: [ghcr_build_runtime]
Expand Down Expand Up @@ -286,7 +286,7 @@ jobs:
image_name=ghcr.io/${{ github.repository_owner }}/runtime:${{ env.RELEVANT_SHA }}-${{ matrix.base_image }}
image_name=$(echo $image_name | tr '[:upper:]' '[:lower:]')
TEST_RUNTIME=eventstream \
TEST_RUNTIME=docker \
SANDBOX_USER_ID=$(id -u) \
SANDBOX_RUNTIME_CONTAINER_IMAGE=$image_name \
TEST_IN_CI=true \
Expand All @@ -297,7 +297,7 @@ jobs:
env:
CODECOV_TOKEN: ${{ secrets.CODECOV_TOKEN }}

# Run unit tests with the EventStream runtime Docker images as openhands user
# Run unit tests with the Docker runtime Docker images as openhands user
test_runtime_oh:
name: RT Unit Tests (openhands)
runs-on: ubuntu-latest
Expand Down Expand Up @@ -363,7 +363,7 @@ jobs:
image_name=ghcr.io/${{ github.repository_owner }}/runtime:${{ env.RELEVANT_SHA }}-${{ matrix.base_image }}
image_name=$(echo $image_name | tr '[:upper:]' '[:lower:]')
TEST_RUNTIME=eventstream \
TEST_RUNTIME=docker \
SANDBOX_USER_ID=$(id -u) \
SANDBOX_RUNTIME_CONTAINER_IMAGE=$image_name \
TEST_IN_CI=true \
Expand Down
7 changes: 7 additions & 0 deletions .github/workflows/openhands-resolver.yml
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,10 @@ on:
required: false
type: string
default: "anthropic/claude-3-5-sonnet-20241022"
LLM_API_VERSION:
required: false
type: string
default: ""
base_container_image:
required: false
type: string
Expand Down Expand Up @@ -116,6 +120,7 @@ jobs:
LLM_MODEL: ${{ secrets.LLM_MODEL || inputs.LLM_MODEL }}
LLM_API_KEY: ${{ secrets.LLM_API_KEY }}
LLM_BASE_URL: ${{ secrets.LLM_BASE_URL }}
LLM_API_VERSION: ${{ inputs.LLM_API_VERSION }}
PAT_TOKEN: ${{ secrets.PAT_TOKEN }}
PAT_USERNAME: ${{ secrets.PAT_USERNAME }}
GITHUB_TOKEN: ${{ github.token }}
Expand Down Expand Up @@ -230,6 +235,7 @@ jobs:
LLM_MODEL: ${{ secrets.LLM_MODEL || inputs.LLM_MODEL }}
LLM_API_KEY: ${{ secrets.LLM_API_KEY }}
LLM_BASE_URL: ${{ secrets.LLM_BASE_URL }}
LLM_API_VERSION: ${{ inputs.LLM_API_VERSION }}
PYTHONPATH: ""
run: |
cd /tmp && python -m openhands.resolver.resolve_issue \
Expand Down Expand Up @@ -265,6 +271,7 @@ jobs:
LLM_MODEL: ${{ secrets.LLM_MODEL || inputs.LLM_MODEL }}
LLM_API_KEY: ${{ secrets.LLM_API_KEY }}
LLM_BASE_URL: ${{ secrets.LLM_BASE_URL }}
LLM_API_VERSION: ${{ inputs.LLM_API_VERSION }}
PYTHONPATH: ""
run: |
if [ "${{ steps.check_result.outputs.RESOLUTION_SUCCESS }}" == "true" ]; then
Expand Down
1 change: 1 addition & 0 deletions .github/workflows/stale.yml
Original file line number Diff line number Diff line change
Expand Up @@ -19,3 +19,4 @@ jobs:
close-issue-message: 'This issue was closed because it has been stalled for over 30 days with no activity.'
close-pr-message: 'This PR was closed because it has been stalled for over 30 days with no activity.'
days-before-close: 7
operations-per-run: 150
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@


# 📦 Runtime EventStream
# 📦 Runtime Docker

Le Runtime EventStream d'OpenHands est le composant principal qui permet l'exécution sécurisée et flexible des actions des agents d'IA.
Le Runtime Docker d'OpenHands est le composant principal qui permet l'exécution sécurisée et flexible des actions des agents d'IA.
Il crée un environnement en bac à sable (sandbox) en utilisant Docker, où du code arbitraire peut être exécuté en toute sécurité sans risquer le système hôte.

## Pourquoi avons-nous besoin d'un runtime en bac à sable ?
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -163,7 +163,7 @@ Les options de configuration de base sont définies dans la section `[core]` du

- `runtime`
- Type : `str`
- Valeur par défaut : `"eventstream"`
- Valeur par défaut : `"docker"`
- Description : Environnement d'exécution

- `default_agent`
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -114,7 +114,7 @@ Pour créer un workflow d'évaluation pour votre benchmark, suivez ces étapes :
def get_config(instance: pd.Series, metadata: EvalMetadata) -> AppConfig:
config = AppConfig(
default_agent=metadata.agent_class,
runtime='eventstream',
runtime='docker',
max_iterations=metadata.max_iterations,
sandbox=SandboxConfig(
base_container_image='your_container_image',
Expand Down
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
以下是翻译后的内容:

# 📦 EventStream 运行时
# 📦 Docker 运行时

OpenHands EventStream 运行时是实现 AI 代理操作安全灵活执行的核心组件。
OpenHands Docker 运行时是实现 AI 代理操作安全灵活执行的核心组件。
它使用 Docker 创建一个沙盒环境,可以安全地运行任意代码而不会危及主机系统。

## 为什么我们需要沙盒运行时?
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -162,7 +162,7 @@

- `runtime`
- 类型: `str`
- 默认值: `"eventstream"`
- 默认值: `"docker"`
- 描述: 运行时环境

- `default_agent`
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -112,7 +112,7 @@ OpenHands 的主要入口点在 `openhands/core/main.py` 中。以下是它的
def get_config(instance: pd.Series, metadata: EvalMetadata) -> AppConfig:
config = AppConfig(
default_agent=metadata.agent_class,
runtime='eventstream',
runtime='docker',
max_iterations=metadata.max_iterations,
sandbox=SandboxConfig(
base_container_image='your_container_image',
Expand Down
4 changes: 2 additions & 2 deletions docs/modules/usage/architecture/runtime.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# 📦 EventStream Runtime
# 📦 Docker Runtime

The OpenHands EventStream Runtime is the core component that enables secure and flexible execution of AI agent's action.
The OpenHands Docker Runtime is the core component that enables secure and flexible execution of AI agent's action.
It creates a sandboxed environment using Docker, where arbitrary code can be run safely without risking the host system.

## Why do we need a sandboxed runtime?
Expand Down
2 changes: 1 addition & 1 deletion docs/modules/usage/configuration-options.md
Original file line number Diff line number Diff line change
Expand Up @@ -126,7 +126,7 @@ The core configuration options are defined in the `[core]` section of the `confi

- `runtime`
- Type: `str`
- Default: `"eventstream"`
- Default: `"docker"`
- Description: Runtime environment

- `default_agent`
Expand Down
12 changes: 10 additions & 2 deletions docs/modules/usage/how-to/custom-sandbox-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,8 +41,16 @@ docker build -t custom-image .

This will produce a new image called `custom-image`, which will be available in Docker.

> Note that in the configuration described in this document, OpenHands will run as user "openhands" inside the
> sandbox and thus all packages installed via the docker file should be available to all users on the system, not just root.
## Using the Docker Command

When running OpenHands using [the docker command](/modules/usage/installation#start-the-app), replace
`-e SANDBOX_RUNTIME_CONTAINER_IMAGE=...` with `-e SANDBOX_BASE_CONTAINER_IMAGE=<custom image name>`:

```commandline
docker run -it --rm --pull=always \
-e SANDBOX_BASE_CONTAINER_IMAGE=custom-image \
...
```

## Using the Development Workflow

Expand Down
2 changes: 1 addition & 1 deletion docs/modules/usage/how-to/evaluation-harness.md
Original file line number Diff line number Diff line change
Expand Up @@ -112,7 +112,7 @@ To create an evaluation workflow for your benchmark, follow these steps:
def get_config(instance: pd.Series, metadata: EvalMetadata) -> AppConfig:
config = AppConfig(
default_agent=metadata.agent_class,
runtime='eventstream',
runtime='docker',
max_iterations=metadata.max_iterations,
sandbox=SandboxConfig(
base_container_image='your_container_image',
Expand Down
1 change: 1 addition & 0 deletions evaluation/benchmarks/EDA/run_infer.py
Original file line number Diff line number Diff line change
Expand Up @@ -69,6 +69,7 @@ def get_config(
base_container_image='python:3.12-bookworm',
enable_auto_lint=False,
use_host_network=False,
remote_runtime_enable_retries=True,
),
# do not mount workspace
workspace_base=None,
Expand Down
1 change: 1 addition & 0 deletions evaluation/benchmarks/agent_bench/run_infer.py
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,7 @@ def get_config(
remote_runtime_api_url=os.environ.get('SANDBOX_REMOTE_RUNTIME_API_URL'),
keep_runtime_alive=False,
remote_runtime_init_timeout=3600,
remote_runtime_enable_retries=True,
),
# do not mount workspace
workspace_base=None,
Expand Down
1 change: 1 addition & 0 deletions evaluation/benchmarks/aider_bench/run_infer.py
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,7 @@ def get_config(
remote_runtime_api_url=os.environ.get('SANDBOX_REMOTE_RUNTIME_API_URL'),
keep_runtime_alive=False,
remote_runtime_init_timeout=1800,
remote_runtime_enable_retries=True,
),
# do not mount workspace
workspace_base=None,
Expand Down
1 change: 1 addition & 0 deletions evaluation/benchmarks/biocoder/run_infer.py
Original file line number Diff line number Diff line change
Expand Up @@ -67,6 +67,7 @@ def get_config(
base_container_image=BIOCODER_BENCH_CONTAINER_IMAGE,
enable_auto_lint=True,
use_host_network=False,
remote_runtime_enable_retries=True,
),
# do not mount workspace
workspace_base=None,
Expand Down
1 change: 1 addition & 0 deletions evaluation/benchmarks/bird/run_infer.py
Original file line number Diff line number Diff line change
Expand Up @@ -80,6 +80,7 @@ def get_config(
base_container_image='python:3.12-bookworm',
enable_auto_lint=True,
use_host_network=False,
remote_runtime_enable_retries=True,
),
# do not mount workspace
workspace_base=None,
Expand Down
1 change: 1 addition & 0 deletions evaluation/benchmarks/browsing_delegation/run_infer.py
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,7 @@ def get_config(
base_container_image='python:3.12-bookworm',
enable_auto_lint=False,
use_host_network=False,
remote_runtime_enable_retries=True,
),
workspace_base=None,
workspace_mount_path=None,
Expand Down
1 change: 1 addition & 0 deletions evaluation/benchmarks/commit0_bench/run_infer.py
Original file line number Diff line number Diff line change
Expand Up @@ -135,6 +135,7 @@ def get_config(
remote_runtime_api_url=os.environ.get('SANDBOX_REMOTE_RUNTIME_API_URL'),
keep_runtime_alive=False,
remote_runtime_init_timeout=3600,
remote_runtime_enable_retries=True,
),
# do not mount workspace
workspace_base=None,
Expand Down
1 change: 1 addition & 0 deletions evaluation/benchmarks/discoverybench/run_infer.py
Original file line number Diff line number Diff line change
Expand Up @@ -71,6 +71,7 @@ def get_config(
base_container_image='python:3.12-bookworm',
enable_auto_lint=True,
use_host_network=False,
remote_runtime_enable_retries=True,
),
# do not mount workspace
workspace_base=None,
Expand Down
1 change: 1 addition & 0 deletions evaluation/benchmarks/gaia/run_infer.py
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,7 @@ def get_config(
base_container_image='python:3.12-bookworm',
enable_auto_lint=True,
use_host_network=False,
remote_runtime_enable_retries=True,
),
# do not mount workspace
workspace_base=None,
Expand Down
1 change: 1 addition & 0 deletions evaluation/benchmarks/gorilla/run_infer.py
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,7 @@ def get_config(
base_container_image='python:3.12-bookworm',
enable_auto_lint=True,
use_host_network=False,
remote_runtime_enable_retries=True,
),
# do not mount workspace
workspace_base=None,
Expand Down
1 change: 1 addition & 0 deletions evaluation/benchmarks/gpqa/run_infer.py
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,7 @@ def get_config(
base_container_image='python:3.12-bookworm',
enable_auto_lint=True,
use_host_network=False,
remote_runtime_enable_retries=True,
),
# do not mount workspace
workspace_base=None,
Expand Down
1 change: 1 addition & 0 deletions evaluation/benchmarks/humanevalfix/run_infer.py
Original file line number Diff line number Diff line change
Expand Up @@ -91,6 +91,7 @@ def get_config(
base_container_image='python:3.12-bookworm',
enable_auto_lint=True,
use_host_network=False,
remote_runtime_enable_retries=True,
),
# do not mount workspace
workspace_base=None,
Expand Down
1 change: 1 addition & 0 deletions evaluation/benchmarks/logic_reasoning/run_infer.py
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,7 @@ def get_config(
enable_auto_lint=True,
use_host_network=False,
runtime_extra_deps='$OH_INTERPRETER_PATH -m pip install scitools-pyke',
remote_runtime_enable_retries=True,
),
# do not mount workspace
workspace_base=None,
Expand Down
1 change: 1 addition & 0 deletions evaluation/benchmarks/miniwob/run_infer.py
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,7 @@ def get_config(
remote_runtime_init_timeout=1800,
keep_runtime_alive=False,
timeout=120,
remote_runtime_enable_retries=True,
),
# do not mount workspace
workspace_base=None,
Expand Down
1 change: 1 addition & 0 deletions evaluation/benchmarks/mint/run_infer.py
Original file line number Diff line number Diff line change
Expand Up @@ -113,6 +113,7 @@ def get_config(
enable_auto_lint=True,
use_host_network=False,
runtime_extra_deps=f'$OH_INTERPRETER_PATH -m pip install {" ".join(MINT_DEPENDENCIES)}',
remote_runtime_enable_retries=True,
),
# do not mount workspace
workspace_base=None,
Expand Down
1 change: 1 addition & 0 deletions evaluation/benchmarks/scienceagentbench/run_infer.py
Original file line number Diff line number Diff line change
Expand Up @@ -73,6 +73,7 @@ def get_config(
api_key=os.environ.get('ALLHANDS_API_KEY', None),
remote_runtime_api_url=os.environ.get('SANDBOX_REMOTE_RUNTIME_API_URL'),
keep_runtime_alive=False,
remote_runtime_enable_retries=True,
),
# do not mount workspace
workspace_base=None,
Expand Down
11 changes: 11 additions & 0 deletions evaluation/benchmarks/swe_bench/eval_infer.py
Original file line number Diff line number Diff line change
Expand Up @@ -412,6 +412,17 @@ def process_instance(
with open(metadata_filepath, 'r') as metadata_file:
data = metadata_file.read()
metadata = EvalMetadata.model_validate_json(data)
else:
# Initialize with a dummy metadata when file doesn't exist
metadata = EvalMetadata(
agent_class="dummy_agent", # Placeholder agent class
llm_config=LLMConfig(model="dummy_model"), # Minimal LLM config
max_iterations=1, # Minimal iterations
eval_output_dir=os.path.dirname(args.input_file), # Use input file dir as output dir
start_time=time.strftime('%Y-%m-%d %H:%M:%S'), # Current time
git_commit=subprocess.check_output(['git', 'rev-parse', 'HEAD']).decode('utf-8').strip(), # Current commit
dataset=args.dataset # Dataset name from args
)

# The evaluation harness constrains the signature of `process_instance_func` but we need to
# pass extra information. Build a new function object to avoid issues with multiprocessing.
Expand Down
1 change: 1 addition & 0 deletions evaluation/benchmarks/swe_bench/run_infer.py
Original file line number Diff line number Diff line change
Expand Up @@ -144,6 +144,7 @@ def get_config(
dataset_name=metadata.dataset,
instance_id=instance['instance_id'],
),
remote_runtime_enable_retries=True,
),
# do not mount workspace
workspace_base=None,
Expand Down
23 changes: 16 additions & 7 deletions evaluation/benchmarks/the_agent_company/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,27 +17,36 @@ When the `run_infer.sh` script is started, it will automatically pull all task i

```bash
./evaluation/benchmarks/the_agent_company/scripts/run_infer.sh \
--agent-llm-config <agent-llm-config> \
--env-llm-config <env-llm-config> \
--outputs-path <outputs-path> \
--server-hostname <server-hostname> \
--version <version>
--agent-llm-config <agent-llm-config, default to 'agent'> \
--env-llm-config <env-llm-config, default to 'env'> \
--outputs-path <outputs-path, default to outputs> \
--server-hostname <server-hostname, default to localhost> \
--version <version, default to 1.0.0> \
--start-percentile <integer from 0 to 99, default to 0> \
--end-percentile <integer from 1 to 100, default to 100>


# Example
./evaluation/benchmarks/the_agent_company/scripts/run_infer.sh \
--agent-llm-config claude-3-5-sonnet-20240620 \
--env-llm-config claude-3-5-sonnet-20240620 \
--outputs-path outputs \
--server-hostname localhost \
--version 1.0.0
--version 1.0.0 \
--start-percentile 10 \
--end-percentile 20
```

- `agent-llm-config`: the config name for the agent LLM. This should match the config name in config.toml. This is the LLM used by the agent (e.g. CodeActAgent).
- `env-llm-config`: the config name for the environment LLM. This should match the config name in config.toml. This is used by the chat bots (NPCs) and LLM-based evaluators.
- `outputs-path`: the path to save trajectories and evaluation results.
- `server-hostname`: the hostname of the server that hosts all the web services. It could be localhost if you are running the evaluation and services on the same machine. If the services are hosted on a remote machine, you must use the hostname of the remote machine rather than IP address.
- `version`: the version of the task images to use. Currently, the only supported version is 1.0.0.
- `start-percentile`: the start percentile of the task split, must be an integer between 0 to 99.
- `end-percentile`: the end percentile of the task split, must be an integer between 1 to 100 and larger than start-percentile.

The script is idempotent. If you run it again, it will resume from the last checkpoint. It would usually take a few days to finish evaluation.
The script is idempotent. If you run it again, it will resume from the last checkpoint. It would usually take 2 days to finish evaluation if you run the whole task set.
To speed up evaluation, you can use `start-percentile` and `end-percentile` to split the tasks for higher parallelism,
provided concurrent runs are **targeting different servers**.

Note: the script will automatically skip a task if it encounters an error. This usually happens when the OpenHands runtime dies due to some unexpected errors. This means even if the script finishes, it might not have evaluated all tasks. You can manually resume the evaluation by running the script again.
1 change: 1 addition & 0 deletions evaluation/benchmarks/the_agent_company/run_infer.py
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,7 @@ def get_config(
# large enough timeout, since some testcases take very long to run
timeout=300,
api_key=os.environ.get('ALLHANDS_API_KEY', None),
remote_runtime_enable_retries=True,
),
# we mount trajectories path so that trajectories, generated by OpenHands
# controller, can be accessible to the evaluator file in the runtime container
Expand Down
Loading

0 comments on commit e94c57d

Please sign in to comment.