XuanCe provides standardized and reproducible benchmark scripts for evaluating deep reinforcement learning (DRL) and multi-agent reinforcement learning (MARL) algorithms. Benchmarks are designed with the following principles:
- Clarity: one script corresponds to one algorithm-task benchmark
- Reproducibility: fixed evaluation protocol and multiple random seeds
- Comparability: consistent directory layout and result format
- Extensibility: easy to add new algorithms, environments, or suites
- Directory Structure
- Running a Single Benchmark
- Running a Benchmark Suite
- Evaluation Protocol
- Benchmark Results
- Reproducibility
- How to Add a New Benchmark
Benchmarks are organized by environment->scenario->algorithm:
xuance-benchmarks/
├── MuJoCo/
│ └── Ant-v5/
│ ├── a2c/
│ │ ├── a2c_Ant-v5.yaml
│ │ └── run_a2c_Ant-v5.sh
│ ├── ddpg/
│ │ ├── ddpg_Ant-v5.yaml
│ │ └── run_ddpg_Ant-v5.sh
│ ├── ppo/
│ │ ├── ppo_Ant-v5.yaml
│ │ └── run_ppo_Ant-v5.sh
│ └── run_Ant-v5_all.sh
│── ...
│── benchmark.py
- Each algorithm-specific script (run_*.sh) defines an atomic benchmark.
- Suite scripts (e.g. run_simple_spread_all.sh) run multiple algorithms sequentially on the same task.
Step 1: Create and activate conda environment (Python >= 3.8 is recommended)
conda create -n xuance_env python=3.12 && conda activate xuance_envStep 2: Install XuanCe package
pip install xuanceEach benchmark script runs the same task with multiple random seeds (default: 5).
Example: run PPO on MuJoCo Ant-v5
bash MuJoCo/Ant-v5/ppo/run_ppo_Ant-v5.shDuring execution, XuanCe prints algorith, environment, and evaluation information, while the benchmark script prints clear START / END boundaries for each seed.
To evaluate all supported algorithms on a given task, use the suite script:
bash benchmarks/MuJoCo/Ant-v5/run_Ant-v5_all.shThis will sequentially run Algorithm_1, Algorithm_2, ..., Algorithm_N on the same environment with identical evaluation settings.
All benchmarks follow a unified evaluation protocol:
- Multiple independent runs with different random seeds
- Periodic evaluation during training (≈ every 1% of total steps)
- Each evaluation consists of multiple test episodes
- Reported performance is the mean episode return
- Final benchmark scores are aggregated across seeds
This design ensures fair comparison and robust performance estimation.
Benchmark results are stored in a structured directory layout:
(To be stored)
-
Each learning_curve.csv contains the learning curve for one seed
-
Aggregated results (mean ± std across seeds) can be generated using analysis scripts
TensorBoard logs are used for visualization and debugging, while CSV files are treated as the official benchmark artifacts.
To ensure reproducibility, benchmark scripts explicitly specify:
- Algorithm name
- Environment and scenario ID
- Random seed
- Training and evaluation settings
Benchmark scripts are the source of truth for all reported results.
This section describes how to add a new benchmark to XuanCe. A benchmark in XuanCe is defined by one algorithm, one environment scenario, and multiple random seeds.
Determine the target environment and scenario. For example:
- Environment: Atari
- Scenario: Breakout-v5
Create the corresponding directory if it does not exist:
benchmarks/Atari/Breakout-v5/
Under the scenario directory, create a subdirectory for the algorithm:
benchmarks/Atari/Breakout-v5/<algorithm>/
For example, for PPO:
benchmarks/Atari/Breakout-v5/ppo/
If the algorithm requires a specific configuration file, place it in the algorithm directory:
benchmarks/Atari/Breakout-v5/ppo/ppo_atari.yaml
This configuration file defines hyperparameters and environment-specific settings used by the benchmark.
Create a benchmark script named:
run_<algorithm>_<scenario>.sh
For example:
run_ppo_Breakout-v5.sh
Each benchmark script should:
- Call the shared benchmark.py script
- Run multiple random seeds (default: 5)
- Clearly indicate the start and end of each seed
- Not duplicate algorithm or environment information already printed by XuanCe
Example structure:
#!/usr/bin/env bash
set -euo pipefail
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
PROJECT_ROOT="$(cd "${SCRIPT_DIR}/../../.." && pwd)"
PYTHON=python
ALGO="ppo"
ENV="Atari"
ENV_ID="Breakout-v5"
OUT_ROOT="${PROJECT_ROOT}/benchmarks/results/raw/${ENV}/${ENV_ID}/${ALGO}"
for SEED in 1 2 3 4 5; do
WORKDIR="${OUT_ROOT}/seed_${SEED}"
mkdir -p "${WORKDIR}"
echo "========== [Benchmark START] seed=${SEED} =========="
START_TIME=$(date +%s)
if ${PYTHON} "${PROJECT_ROOT}/benchmark.py" \
--algo "${ALGO}" \
--env "${ENV}" \
--env-id "${ENV_ID}" \
--seed "${SEED}" \
--workdir "${WORKDIR}"; then
STATUS="SUCCESS"
else
STATUS="FAILED"
fi
END_TIME=$(date +%s)
ELAPSED=$((END_TIME - START_TIME))
echo "========== [Benchmark END] seed=${SEED} | status=${STATUS} | time=${ELAPSED}s =========="
doneIf you want the new benchmark to be included in a benchmark suite, edit the suite script under the scenario directory:
benchmarks/Atai/Breakout-v5/run_simple_spread_all.sh
Add the new benchmark script to the list:
SCRIPTS=(
"${ROOT_DIR}/dqn/run_dqn_Breakout-v5.sh"
"${ROOT_DIR}/ppo/run_ppo_Breakout-v5.sh"
"${ROOT_DIR}/<new_algo>/run_<new_algo>_Breakout-v5.sh"
)Run the benchmark script:
bash benchmarks/Atari/Breakout-v5/iql/run_ppo_Breakout-v5.sh
Verify that:
- All seeds run sequentially
- Each seed prints clear START / END markers
- Results are saved under the correct directory structure
- The benchmark can be reproduced by re-running the script
When adding a new benchmark, please follow these principles:
- One script = one benchmark
- Benchmark scripts are the source of truth
- Do not hard-code absolute paths
- Do not duplicate logging already handled by XuanCe
- Prefer clarity and reproducibility over convenience