This document describes how to run, interpret, and compare sandbox performance benchmarks for Fence.
# Install dependencies
brew install hyperfine # macOS
# apt install hyperfine # Linux
go install golang.org/x/perf/cmd/benchstat@latest
# Run CLI benchmarks
./scripts/benchmark.sh
# Run Go microbenchmarks
go test -run=^$ -bench=. -benchmem ./internal/sandbox/...- Quantify sandbox overhead on each platform (
sandboxed / unsandboxedratio) - Compare macOS (Seatbelt) vs Linux (bwrap+Landlock) overhead fairly
- Attribute overhead to specific components (proxy startup, bridge setup, wrap generation)
- Track regressions over time
What it measures: Real-world agent cost - full fence invocation including proxy startup, socat bridges (Linux), and sandbox-exec/bwrap setup.
This is the most realistic benchmark for understanding the cost of running agent commands through Fence.
# Full benchmark suite
./scripts/benchmark.sh
# Quick mode (fewer runs)
./scripts/benchmark.sh -q
# Custom output directory
./scripts/benchmark.sh -o ./my-results
# Include network benchmarks (requires local server)
./scripts/benchmark.sh --network| Option | Description |
|---|---|
-b, --binary PATH |
Path to fence binary (default: ./fence) |
-o, --output DIR |
Output directory (default: ./benchmarks) |
-n, --runs N |
Minimum runs per benchmark (default: 30) |
-q, --quick |
Quick mode: fewer runs, skip slow benchmarks |
--network |
Include network benchmarks |
What it measures: Component-level overhead - isolates Manager initialization, WrapCommand generation, and execution.
# Run all benchmarks
go test -run=^$ -bench=. -benchmem ./internal/sandbox/...
# Run specific benchmark
go test -run=^$ -bench=BenchmarkWarmSandbox -benchmem ./internal/sandbox/...
# Multiple runs for statistical analysis
go test -run=^$ -bench=. -benchmem -count=10 ./internal/sandbox/... > bench.txt
benchstat bench.txt| Benchmark | Description |
|---|---|
BenchmarkBaseline_* |
Unsandboxed command execution |
BenchmarkManagerInitialize |
Cold initialization (proxies + bridges) |
BenchmarkWrapCommand |
Command string construction only |
BenchmarkColdSandbox_* |
Full init + wrap + exec per iteration |
BenchmarkWarmSandbox_* |
Pre-initialized manager, just exec |
BenchmarkOverhead |
Grouped comparison of baseline vs sandbox |
What it measures: Kernel/system overhead - context switches, syscalls, page faults.
# Quick syscall cost breakdown
strace -f -c ./fence -- true
# Context switches, page faults
perf stat -- ./fence -- true
# Full profiling (flamegraph-ready)
perf record -F 99 -g -- ./fence -- git status
perf report# Time Profiler via Instruments
xcrun xctrace record --template 'Time Profiler' --launch -- ./fence -- true
# Quick call-stack snapshot
./fence -- sleep 5 &
sample $! 5 -file sample.txtOverhead Factor = time(sandboxed) / time(unsandboxed)
Compare overhead factors across platforms, not absolute times, because hardware differences swamp absolute timings.
Benchmark Unsandboxed Sandboxed Overhead
true 1.2 ms 45 ms 37.5x
git status 15 ms 62 ms 4.1x
python -c 'pass' 25 ms 73 ms 2.9x
| Workload | Linux Overhead | macOS Overhead | Notes |
|---|---|---|---|
true |
180-360x | 8-10x | Dominated by cold start |
echo |
150-300x | 6-8x | Similar to true |
python3 -c 'pass' |
10-12x | 2-3x | Interpreter startup dominates |
git status |
50-60x | 4-5x | Real I/O helps amortize |
rg |
40-50x | 3-4x | Search I/O helps amortize |
The overhead factor decreases as the actual workload increases (because sandbox setup is fixed cost). Linux overhead is significantly higher due to bwrap/socat setup.
- Run benchmarks on each platform independently
- Compare overhead factors, not absolute times
- Use the same fence version and workloads
# On macOS
go test -run=^$ -bench=. -count=10 ./internal/sandbox/... > bench_macos.txt
# On Linux
go test -run=^$ -bench=. -count=10 ./internal/sandbox/... > bench_linux.txt
# Compare
benchstat bench_macos.txt bench_linux.txt- macOS uses Seatbelt (sandbox-exec) - built-in, lightweight kernel sandbox
- Linux uses bwrap + Landlock, this creates socat bridges for network, incurring significant setup cost
- Linux cold start is ~10x slower than macOS due to bwrap/socat bridge setup
- Linux warm path is still ~5x slower than macOS - bwrap execution itself has overhead
- For long-running agents, this difference is negligible (one-time startup cost)
Tip
Running Linux benchmarks inside a VM (Colima, Docker Desktop, etc.) inflates overhead due to virtualization. Use native Linux (bare metal or CI) for fair cross-platform comparison.
Benchmarks can be run in CI via the workflow at .github/workflows/benchmark.yml:
# Trigger manually from GitHub UI: Actions > Benchmarks > Run workflow
# Or via gh CLI
gh workflow run benchmark.ymlResults are uploaded as artifacts and summarized in the workflow summary.
- Run with
--min-runs 50or higher - Close other applications
- Pin CPU frequency if possible (Linux:
cpupower frequency-set --governor performance) - Run multiple times and use benchstat for statistical analysis
# CPU profile
go test -run=^$ -bench=BenchmarkWarmSandbox -cpuprofile=cpu.out ./internal/sandbox/...
go tool pprof -http=:8080 cpu.out
# Memory profile
go test -run=^$ -bench=BenchmarkWarmSandbox -memprofile=mem.out ./internal/sandbox/...
go tool pprof -http=:8080 mem.out- Run benchmarks before and after changes
- Save results to files
- Compare with benchstat
# Before
go test -run=^$ -bench=. -count=10 ./internal/sandbox/... > before.txt
# Make changes...
# After
go test -run=^$ -bench=. -count=10 ./internal/sandbox/... > after.txt
# Compare
benchstat before.txt after.txt| Category | Commands | What it Stresses |
|---|---|---|
| Spawn-only | true, echo |
Process spawn, wrapper overhead |
| Interpreter | python3 -c, node -e |
Runtime startup under sandbox |
| FS-heavy | file creation, rg |
Landlock/Seatbelt FS rules |
| Network (local) | curl localhost |
Proxy forwarding overhead |
| Real tools | git status |
Practical agent workloads |
Results from GitHub Actions CI runners (Linux: AMD EPYC 7763, macOS: Apple M1 Virtual).
| Platform | Manager.Initialize() |
|---|---|
| Linux | 101.9 ms |
| macOS | 27.5 µs |
Linux initialization is ~3,700x slower because it must:
- Start HTTP + SOCKS proxies
- Create Unix socket bridges for socat
- Set up bwrap namespace configuration
macOS only generates a Seatbelt profile string (very cheap).
| Workload | Linux | macOS |
|---|---|---|
true |
215 ms | 22 ms |
| Python | 124 ms | 33 ms |
| Git status | 114 ms | 25 ms |
This is the realistic cost for scripts running fence -c "command" repeatedly.
| Workload | Linux | macOS |
|---|---|---|
true |
112 ms | 20 ms |
| Python | 124 ms | 33 ms |
| Git status | 114 ms | 25 ms |
Even with proxies already running, Linux bwrap execution adds ~110ms overhead per command.
| Workload | Linux Overhead | macOS Overhead |
|---|---|---|
true (cold) |
~360x | ~10x |
true (warm) |
~187x | ~8x |
| Python (warm) | ~11x | ~2x |
| Git status (warm) | ~54x | ~4x |
Overhead decreases as the actual workload increases (sandbox setup is fixed cost).
For agents that run as a child process under fence:
| Phase | Cost |
|---|---|
| Startup (once) | Linux: ~215ms, macOS: ~22ms |
| Per tool call | Negligible (baseline fork+exec only) |
Child processes inherit the sandbox - no re-initialization, no WrapCommand overhead. The per-command cost is just normal process spawning:
| Command | Linux | macOS |
|---|---|---|
true |
0.6 ms | 2.3 ms |
git status |
2.1 ms | 5.9 ms |
| Python script | 11 ms | 15 ms |
Bottom line: For fence <agent> usage, sandbox overhead is a one-time startup cost. Tool calls inside the agent run at native speed.
For scripts or CI running fence per command:
| Session | Linux Cost | macOS Cost |
|---|---|---|
| 1 command | 215 ms | 22 ms |
| 10 commands | 2.15 s | 220 ms |
| 50 commands | 10.75 s | 1.1 s |
Consider keeping the manager alive (daemon mode) or batching commands to reduce overhead.
Manager.Initialize()starts HTTP + SOCKS proxies; on Linux also creates socat bridges- Cold start includes all initialization; hot path is just
WrapCommand + exec -m(monitor mode) spawns additional monitoring processes, so we'll have to benchmark separately- Keep workloads under the repo - avoid
/tmpsince Linux bwrap does--tmpfs /tmp debugmode changes logging, so always benchmark with debug off