Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
184 changes: 92 additions & 92 deletions BENCHMARKS.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,9 @@
Comparison of `zodb-json-codec` (Rust + PyO3) vs CPython's `pickle` module
for ZODB record encoding/decoding.

Measured on: 2026-02-24
Measured on: 2026-02-25
Python: 3.13.9, PyO3: 0.28, 5000 iterations, 100 warmup
Build: `maturin develop --release` (optimized, LTO + codegen-units=1 + PGO)
Build: `maturin develop --release` + PGO (LTO + codegen-units=1)

**Important:** Always benchmark with `maturin develop --release`. Debug builds
are 3-8x slower due to missing optimizations and inlining.
Expand All @@ -20,7 +20,8 @@ The codec does fundamentally more work than `pickle.loads`/`pickle.dumps`:

The codec's value is not raw speed but **JSONB queryability** — enabling SQL
queries on ZODB object attributes in PostgreSQL. Despite the extra work, the
release build beats CPython pickle on most operations.
release build beats CPython pickle on encode and roundtrip across all
categories, and on decode for all but the largest string-dominated payloads.

---

Expand All @@ -30,64 +31,66 @@ release build beats CPython pickle on most operations.

| Category | Python | Codec | Ratio |
|---|---|---|---|
| simple_flat_dict (120 B) | 1.9 us | 1.1 us | **1.8x faster** |
| nested_dict (187 B) | 2.9 us | 1.8 us | **1.6x faster** |
| large_flat_dict (2.5 KB) | 22.8 us | 19.7 us | **1.2x faster** |
| bytes_in_state (1 KB) | 1.8 us | 1.9 us | 1.1x slower |
| special_types (314 B) | 6.8 us | 4.7 us | **1.5x faster** |
| btree_small (112 B) | 1.9 us | 1.8 us | 1.1x faster |
| btree_length (44 B) | 1.0 us | 0.5 us | **2.0x faster** |
| scalar_string (72 B) | 1.1 us | 0.5 us | **2.1x faster** |
| wide_dict (27 KB) | 264 us | 279 us | 1.1x slower |
| deep_nesting (379 B) | 7.2 us | 7.3 us | 1.0x |
| simple_flat_dict (120 B) | 1.9 us | 1.0 us | **1.9x faster** |
| nested_dict (187 B) | 2.7 us | 1.6 us | **1.3x faster** |
| large_flat_dict (2.5 KB) | 22.6 us | 18.0 us | **1.3x faster** |
| bytes_in_state (1 KB) | 1.6 us | 1.4 us | **1.1x faster** |
| special_types (314 B) | 6.8 us | 3.8 us | **1.8x faster** |
| btree_small (112 B) | 1.7 us | 1.5 us | **1.2x faster** |
| btree_length (44 B) | 1.0 us | 0.4 us | **2.3x faster** |
| scalar_string (72 B) | 1.1 us | 0.5 us | **2.2x faster** |
| wide_dict (27 KB) | 250 us | 244.5 us | **1.0x faster** |
| deep_nesting (379 B) | 6.9 us | 6.4 us | 1.0x slower |

### Decode to JSON string (pickle bytes -> JSON, all in Rust)

The direct path for PG storage — serializes to a JSON string entirely in Rust
with the GIL released. Compared against the dict path + `json.dumps()`.
The direct path for PG storage — writes JSON tokens directly to a `String`
buffer from the PickleValue AST, entirely in Rust with the GIL released.
No intermediate `serde_json::Value` allocations. Compared against the dict
path + `json.dumps()`.

| Category | Dict+dumps | JSON str | Speedup |
|---|---|---|---|
| simple_flat_dict | 2.7 us | 1.3 us | **2.2x faster** |
| nested_dict | 4.3 us | 2.5 us | **1.7x faster** |
| large_flat_dict | 35.4 us | 25.6 us | **1.4x faster** |
| bytes_in_state | 5.7 us | 2.7 us | **2.1x faster** |
| special_types | 7.1 us | 4.7 us | **1.5x faster** |
| btree_small | 3.8 us | 2.1 us | **1.8x faster** |
| btree_length | 1.5 us | 0.8 us | **1.9x faster** |
| scalar_string | 0.9 us | 0.7 us | **1.3x faster** |
| wide_dict | 273.7 us | 307.6 us | 1.1x slower |
| deep_nesting | 13.3 us | 8.6 us | **1.5x faster** |
| simple_flat_dict | 2.7 us | 1.1 us | **2.5x faster** |
| nested_dict | 4.3 us | 1.9 us | **2.3x faster** |
| large_flat_dict | 33.7 us | 17.1 us | **2.0x faster** |
| bytes_in_state | 5.2 us | 1.6 us | **3.3x faster** |
| special_types | 7.5 us | 4.0 us | **1.9x faster** |
| btree_small | 3.6 us | 1.6 us | **2.3x faster** |
| btree_length | 1.4 us | 0.5 us | **2.8x faster** |
| scalar_string | 0.8 us | 0.6 us | **1.3x faster** |
| wide_dict | 290.5 us | 161.6 us | **1.8x faster** |
| deep_nesting | 14.2 us | 5.7 us | **2.5x faster** |

### Encode (Python dict -> pickle bytes)

| Category | Python | Codec | Ratio |
|---|---|---|---|
| simple_flat_dict | 1.3 us | 0.2 us | **6.5x faster** |
| nested_dict | 1.5 us | 0.3 us | **4.8x faster** |
| large_flat_dict | 5.3 us | 1.5 us | **3.5x faster** |
| bytes_in_state | 1.2 us | 0.7 us | **1.7x faster** |
| special_types | 4.7 us | 0.5 us | **9.8x faster** |
| btree_small | 1.3 us | 0.2 us | **6.0x faster** |
| btree_length | 1.1 us | 0.1 us | **8.8x faster** |
| scalar_string | 1.2 us | 0.1 us | **8.3x faster** |
| wide_dict | 56.4 us | 13.9 us | **4.0x faster** |
| deep_nesting | 2.8 us | 1.0 us | **2.8x faster** |
| simple_flat_dict | 1.3 us | 0.2 us | **6.7x faster** |
| nested_dict | 1.6 us | 0.3 us | **6.4x faster** |
| large_flat_dict | 5.7 us | 1.6 us | **3.9x faster** |
| bytes_in_state | 1.3 us | 0.8 us | **1.7x faster** |
| special_types | 4.6 us | 0.5 us | **9.2x faster** |
| btree_small | 1.3 us | 0.2 us | **6.6x faster** |
| btree_length | 1.0 us | 0.1 us | **8.0x faster** |
| scalar_string | 1.0 us | 0.1 us | **7.9x faster** |
| wide_dict | 56.9 us | 13.7 us | **4.1x faster** |
| deep_nesting | 2.6 us | 1.0 us | **2.6x faster** |

### Full roundtrip (decode + encode)

| Category | Python | Codec | Ratio |
|---|---|---|---|
| simple_flat_dict | 3.2 us | 1.4 us | **2.4x faster** |
| nested_dict | 4.5 us | 2.1 us | **2.2x faster** |
| large_flat_dict | 29.7 us | 19.1 us | **1.6x faster** |
| bytes_in_state | 3.3 us | 2.4 us | **1.4x faster** |
| special_types | 11.7 us | 4.4 us | **2.7x faster** |
| btree_small | 5.8 us | 1.8 us | **3.3x faster** |
| btree_length | 2.1 us | 0.6 us | **3.6x faster** |
| scalar_string | 2.3 us | 0.6 us | **3.6x faster** |
| wide_dict | 316 us | 260 us | **1.2x faster** |
| deep_nesting | 10.3 us | 7.3 us | **1.4x faster** |
| simple_flat_dict | 3.2 us | 1.3 us | **2.6x faster** |
| nested_dict | 4.4 us | 2.1 us | **2.1x faster** |
| large_flat_dict | 28.7 us | 19.8 us | **1.5x faster** |
| bytes_in_state | 3.1 us | 2.3 us | **1.4x faster** |
| special_types | 11.5 us | 4.9 us | **2.4x faster** |
| btree_small | 3.1 us | 1.8 us | **1.7x faster** |
| btree_length | 2.0 us | 0.6 us | **3.4x faster** |
| scalar_string | 2.1 us | 0.6 us | **3.5x faster** |
| wide_dict | 318 us | 258.8 us | **1.3x faster** |
| deep_nesting | 10.0 us | 7.8 us | **1.3x faster** |

### Output size (pickle bytes vs JSON)

Expand Down Expand Up @@ -122,18 +125,18 @@ plus OOBTree containers, group summaries, and edge-case objects.

| Metric | Codec | Python | Speedup |
|---|---|---|---|
| Decode mean | 26.9 us | 22.2 us | 1.2x slower |
| Decode median | 23.2 us | 21.6 us | 1.1x slower |
| Decode P95 | 39.7 us | 31.7 us | 1.3x slower |
| Encode mean | 4.7 us | 18.0 us | **3.8x faster** |
| Encode median | 3.9 us | 19.7 us | **5.1x faster** |
| Encode P95 | 9.6 us | 29.1 us | **3.0x faster** |
| Decode mean | 27.2 us | 22.7 us | 1.2x slower |
| Decode median | 23.6 us | 22.2 us | 1.1x slower |
| Decode P95 | 40.5 us | 33.1 us | 1.2x slower |
| Encode mean | 4.8 us | 18.2 us | **3.8x faster** |
| Encode median | 4.0 us | 19.9 us | **5.0x faster** |
| Encode P95 | 9.9 us | 30.0 us | **3.0x faster** |
| Total pickle | 5.1 MB | — | — |
| Total JSON | 7.2 MB | — | 1.41x |

Decode is slightly slower (1.1x median) due to the two-pass conversion plus
type-aware transformation. The gap narrows on metadata-heavy records.
Encode is consistently **3.0-5.1x faster** because the Rust encoder writes
Encode is consistently **3.0-5.0x faster** because the Rust encoder writes
pickle opcodes directly from Python objects, bypassing intermediate allocations.

### Record type distribution
Expand All @@ -154,26 +157,27 @@ pickle opcodes directly from Python objects, bypassing intermediate allocations.
The zodb-pgjsonb storage path has two decode functions. The dict path
(`decode_zodb_record_for_pg`) returns a Python dict that must then be
serialized via `json.dumps()`. The JSON string path
(`decode_zodb_record_for_pg_json`) does everything in Rust with the GIL
released. See the synthetic comparison above.
(`decode_zodb_record_for_pg_json`) writes JSON tokens directly from the
PickleValue AST to a `String` buffer, entirely in Rust with the GIL released.

```
Dict path: pickle bytes → Rust AST → Python dict (GIL held) → json.dumps() → PG
JSON path: pickle bytes → Rust AST → serde_json → JSON string (all Rust, GIL released) → PG
JSON path: pickle bytes → Rust AST → JSON string (direct write, GIL released) → PG
```

### 1,692 records

| Metric | Dict+dumps | JSON str | Speedup |
|---|---|---|---|
| Mean | 41.3 us | 31.5 us | **1.3x faster** |
| Median | 35.9 us | 26.9 us | **1.3x faster** |
| P95 | 64.2 us | 47.7 us | **1.3x faster** |
| Mean | 40.4 us | 28.3 us | **1.4x faster** |
| Median | 34.7 us | 24.4 us | **1.4x faster** |
| P95 | 62.0 us | 51.9 us | **1.2x faster** |

The JSON string path is **1.3x faster** across real-world data because
it eliminates the Python dict allocation + `json.dumps()` serialization.
The entire pipeline runs in Rust with the GIL released, improving
multi-threaded throughput in Zope/Plone deployments.
The JSON string path is **1.4x faster** across real-world data because
it eliminates both the Python dict allocation + `json.dumps()` serialization
and all intermediate `serde_json::Value` heap allocations. The entire pipeline
runs in Rust with the GIL released, improving multi-threaded throughput in
Zope/Plone deployments.

---

Expand All @@ -182,9 +186,9 @@ multi-threaded throughput in Zope/Plone deployments.
The sweet spot is typical ZODB objects (5-50 keys, mixed types, datetime
fields, persistent refs):

- **Decode:** 1.5-2.0x faster on synthetic, near parity on real-world data
- **Encode:** 2-10x faster on synthetic, 3-5x faster on real-world data
- **PG path:** 1.3x faster end-to-end with GIL-free throughput
- **Decode:** 1.1-2.3x faster on synthetic, near parity on real-world data
- **Encode:** 1.7-9.2x faster on synthetic, 3-5x faster on real-world data
- **PG path:** 1.3-3.3x faster end-to-end with GIL-free throughput

Decode overhead comes from the two-pass conversion plus type transformation.
On string-dominated payloads this matters more; on metadata-rich records with
Expand Down Expand Up @@ -215,49 +219,33 @@ mixed types (the typical ZODB case) the codec is competitive or faster.
- Thread-local buffer reuse (retains capacity across encode calls)
- `reserve()` calls before multi-part writes (eliminates mid-write reallocations)
- Direct i64 LONG1 encoding (eliminates BigInt heap allocation)
- Thread-local class pickle cache per (module, name) pair (single memcpy
replaces 7 opcode writes for ~99.6% of records)
- `#[inline]` on `write_u8`, `write_bytes`, `encode_int`

**Both paths:**
- Interned marker strings (`pyo3::intern!` for `@t`, `@cls`, `@s`, etc.)
- Pre-collected PyList (`PyList::new` vs append loop)
- Thin LTO + single codegen unit (free 6-9% improvement)
- Profile-guided optimization (PGO) with real FileStorage + synthetic data
- Direct pickle → JSON string path for PG storage (GIL released)
- Direct PickleValue → JSON string writer (`json_writer.rs`) for PG storage,
eliminating all `serde_json::Value` intermediate allocations (GIL released)
- Thread-local JSON writer buffer reuse (retains capacity across decode calls)

---

## Running benchmarks

All numbers in this document are from PGO builds. Always use PGO for
benchmarking — it adds 5-15% and reflects production performance.

```bash
cd sources/zodb-json-codec

# Build release first (important!)
maturin develop --release

# Synthetic micro-benchmarks
python benchmarks/bench.py synthetic --iterations 1000

# Generate a reproducible benchmark FileStorage (requires ZODB + BTrees)
python benchmarks/bench.py generate

# Scan the generated (or any) FileStorage
python benchmarks/bench.py filestorage benchmarks/bench_data/Data.fs

# PG decode path comparison (dict vs JSON string)
python benchmarks/bench.py pg-compare --filestorage benchmarks/bench_data/Data.fs

# Both synthetic + filestorage, with JSON export
python benchmarks/bench.py all --filestorage benchmarks/bench_data/Data.fs --output results.json
```
# 0. Decompress benchmark data (once — Data.fs is gitignored, only .gz is tracked)
gunzip -k benchmarks/bench_data/Data.fs.gz

## PGO build (optional, adds 5-15%)

Profile-guided optimization uses real workload data to optimize branch
prediction and code layout. The release CI builds include PGO for
Linux x86_64 wheels.

```bash
# 1. Install LLVM tools
# 1. Install LLVM tools (once)
rustup component add llvm-tools

# 2. Instrumented build
Expand All @@ -266,11 +254,23 @@ RUSTFLAGS="-Cprofile-generate=/tmp/pgo-data" maturin develop --release
# 3. Generate profiles — use BOTH real data and synthetic for best coverage
python benchmarks/bench.py filestorage benchmarks/bench_data/Data.fs
python benchmarks/bench.py synthetic --iterations 2000
python benchmarks/bench.py pg-compare --filestorage benchmarks/bench_data/Data.fs --iterations 500

# 4. Merge profiles
LLVM_PROFDATA=$(find ~/.rustup -name llvm-profdata | head -1)
$LLVM_PROFDATA merge -o /tmp/pgo-data/merged.profdata /tmp/pgo-data/*.profraw

# 5. Optimized build
RUSTFLAGS="-Cprofile-use=/tmp/pgo-data/merged.profdata" maturin develop --release

# 6. Run benchmarks
python benchmarks/bench.py synthetic --iterations 5000
python benchmarks/bench.py filestorage benchmarks/bench_data/Data.fs
python benchmarks/bench.py pg-compare --filestorage benchmarks/bench_data/Data.fs

# Generate a reproducible benchmark FileStorage (requires ZODB + BTrees)
python benchmarks/bench.py generate

# Both synthetic + filestorage, with JSON export
python benchmarks/bench.py all --filestorage benchmarks/bench_data/Data.fs --output results.json
```
1 change: 1 addition & 0 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -23,3 +23,4 @@ serde_json = "1"
base64 = "0.22"
hex = "0.4"
num-bigint = "0.4"
ryu = "1"
Loading