bluedynamics · jensens · Feb 25, 2026 · Feb 25, 2026 · Feb 25, 2026 · Feb 25, 2026
diff --git a/BENCHMARKS.md b/BENCHMARKS.md
@@ -3,9 +3,9 @@
 Comparison of `zodb-json-codec` (Rust + PyO3) vs CPython's `pickle` module
 for ZODB record encoding/decoding.
 
-Measured on: 2026-02-24
+Measured on: 2026-02-25
 Python: 3.13.9, PyO3: 0.28, 5000 iterations, 100 warmup
-Build: `maturin develop --release` (optimized, LTO + codegen-units=1 + PGO)
+Build: `maturin develop --release` + PGO (LTO + codegen-units=1)
 
 **Important:** Always benchmark with `maturin develop --release`. Debug builds
 are 3-8x slower due to missing optimizations and inlining.
@@ -20,7 +20,8 @@ The codec does fundamentally more work than `pickle.loads`/`pickle.dumps`:
 
 The codec's value is not raw speed but **JSONB queryability** — enabling SQL
 queries on ZODB object attributes in PostgreSQL. Despite the extra work, the
-release build beats CPython pickle on most operations.
+release build beats CPython pickle on encode and roundtrip across all
+categories, and on decode for all but the largest string-dominated payloads.
 
 ---
 
@@ -30,64 +31,66 @@ release build beats CPython pickle on most operations.
 
 | Category | Python | Codec | Ratio |
 |---|---|---|---|
-| simple_flat_dict (120 B) | 1.9 us | 1.1 us | **1.8x faster** |
-| nested_dict (187 B) | 2.9 us | 1.8 us | **1.6x faster** |
-| large_flat_dict (2.5 KB) | 22.8 us | 19.7 us | **1.2x faster** |
-| bytes_in_state (1 KB) | 1.8 us | 1.9 us | 1.1x slower |
-| special_types (314 B) | 6.8 us | 4.7 us | **1.5x faster** |
-| btree_small (112 B) | 1.9 us | 1.8 us | 1.1x faster |
-| btree_length (44 B) | 1.0 us | 0.5 us | **2.0x faster** |
-| scalar_string (72 B) | 1.1 us | 0.5 us | **2.1x faster** |
-| wide_dict (27 KB) | 264 us | 279 us | 1.1x slower |
-| deep_nesting (379 B) | 7.2 us | 7.3 us | 1.0x |
+| simple_flat_dict (120 B) | 1.9 us | 1.0 us | **1.9x faster** |
+| nested_dict (187 B) | 2.7 us | 1.6 us | **1.3x faster** |
+| large_flat_dict (2.5 KB) | 22.6 us | 18.0 us | **1.3x faster** |
+| bytes_in_state (1 KB) | 1.6 us | 1.4 us | **1.1x faster** |
+| special_types (314 B) | 6.8 us | 3.8 us | **1.8x faster** |
+| btree_small (112 B) | 1.7 us | 1.5 us | **1.2x faster** |
+| btree_length (44 B) | 1.0 us | 0.4 us | **2.3x faster** |
+| scalar_string (72 B) | 1.1 us | 0.5 us | **2.2x faster** |
+| wide_dict (27 KB) | 250 us | 244.5 us | **1.0x faster** |
+| deep_nesting (379 B) | 6.9 us | 6.4 us | 1.0x slower |
 
 ### Decode to JSON string (pickle bytes -> JSON, all in Rust)
 
-The direct path for PG storage — serializes to a JSON string entirely in Rust
-with the GIL released. Compared against the dict path + `json.dumps()`.
+The direct path for PG storage — writes JSON tokens directly to a `String`
+buffer from the PickleValue AST, entirely in Rust with the GIL released.
+No intermediate `serde_json::Value` allocations. Compared against the dict
+path + `json.dumps()`.
 
 | Category | Dict+dumps | JSON str | Speedup |
 |---|---|---|---|
-| simple_flat_dict | 2.7 us | 1.3 us | **2.2x faster** |
-| nested_dict | 4.3 us | 2.5 us | **1.7x faster** |
-| large_flat_dict | 35.4 us | 25.6 us | **1.4x faster** |
-| bytes_in_state | 5.7 us | 2.7 us | **2.1x faster** |
-| special_types | 7.1 us | 4.7 us | **1.5x faster** |
-| btree_small | 3.8 us | 2.1 us | **1.8x faster** |
-| btree_length | 1.5 us | 0.8 us | **1.9x faster** |
-| scalar_string | 0.9 us | 0.7 us | **1.3x faster** |
-| wide_dict | 273.7 us | 307.6 us | 1.1x slower |
-| deep_nesting | 13.3 us | 8.6 us | **1.5x faster** |
+| simple_flat_dict | 2.7 us | 1.1 us | **2.5x faster** |
+| nested_dict | 4.3 us | 1.9 us | **2.3x faster** |
+| large_flat_dict | 33.7 us | 17.1 us | **2.0x faster** |
+| bytes_in_state | 5.2 us | 1.6 us | **3.3x faster** |
+| special_types | 7.5 us | 4.0 us | **1.9x faster** |
+| btree_small | 3.6 us | 1.6 us | **2.3x faster** |
+| btree_length | 1.4 us | 0.5 us | **2.8x faster** |
+| scalar_string | 0.8 us | 0.6 us | **1.3x faster** |
+| wide_dict | 290.5 us | 161.6 us | **1.8x faster** |
+| deep_nesting | 14.2 us | 5.7 us | **2.5x faster** |
 
 ### Encode (Python dict -> pickle bytes)
 
 | Category | Python | Codec | Ratio |
 |---|---|---|---|
-| simple_flat_dict | 1.3 us | 0.2 us | **6.5x faster** |
-| nested_dict | 1.5 us | 0.3 us | **4.8x faster** |
-| large_flat_dict | 5.3 us | 1.5 us | **3.5x faster** |
-| bytes_in_state | 1.2 us | 0.7 us | **1.7x faster** |
-| special_types | 4.7 us | 0.5 us | **9.8x faster** |
-| btree_small | 1.3 us | 0.2 us | **6.0x faster** |
-| btree_length | 1.1 us | 0.1 us | **8.8x faster** |
-| scalar_string | 1.2 us | 0.1 us | **8.3x faster** |
-| wide_dict | 56.4 us | 13.9 us | **4.0x faster** |
-| deep_nesting | 2.8 us | 1.0 us | **2.8x faster** |
+| simple_flat_dict | 1.3 us | 0.2 us | **6.7x faster** |
+| nested_dict | 1.6 us | 0.3 us | **6.4x faster** |
+| large_flat_dict | 5.7 us | 1.6 us | **3.9x faster** |
+| bytes_in_state | 1.3 us | 0.8 us | **1.7x faster** |
+| special_types | 4.6 us | 0.5 us | **9.2x faster** |
+| btree_small | 1.3 us | 0.2 us | **6.6x faster** |
+| btree_length | 1.0 us | 0.1 us | **8.0x faster** |
+| scalar_string | 1.0 us | 0.1 us | **7.9x faster** |
+| wide_dict | 56.9 us | 13.7 us | **4.1x faster** |
+| deep_nesting | 2.6 us | 1.0 us | **2.6x faster** |
 
 ### Full roundtrip (decode + encode)
 
 | Category | Python | Codec | Ratio |
 |---|---|---|---|
-| simple_flat_dict | 3.2 us | 1.4 us | **2.4x faster** |
-| nested_dict | 4.5 us | 2.1 us | **2.2x faster** |
-| large_flat_dict | 29.7 us | 19.1 us | **1.6x faster** |
-| bytes_in_state | 3.3 us | 2.4 us | **1.4x faster** |
-| special_types | 11.7 us | 4.4 us | **2.7x faster** |
-| btree_small | 5.8 us | 1.8 us | **3.3x faster** |
-| btree_length | 2.1 us | 0.6 us | **3.6x faster** |
-| scalar_string | 2.3 us | 0.6 us | **3.6x faster** |
-| wide_dict | 316 us | 260 us | **1.2x faster** |
-| deep_nesting | 10.3 us | 7.3 us | **1.4x faster** |
+| simple_flat_dict | 3.2 us | 1.3 us | **2.6x faster** |
+| nested_dict | 4.4 us | 2.1 us | **2.1x faster** |
+| large_flat_dict | 28.7 us | 19.8 us | **1.5x faster** |
+| bytes_in_state | 3.1 us | 2.3 us | **1.4x faster** |
+| special_types | 11.5 us | 4.9 us | **2.4x faster** |
+| btree_small | 3.1 us | 1.8 us | **1.7x faster** |
+| btree_length | 2.0 us | 0.6 us | **3.4x faster** |
+| scalar_string | 2.1 us | 0.6 us | **3.5x faster** |
+| wide_dict | 318 us | 258.8 us | **1.3x faster** |
+| deep_nesting | 10.0 us | 7.8 us | **1.3x faster** |
 
 ### Output size (pickle bytes vs JSON)
 
@@ -122,18 +125,18 @@ plus OOBTree containers, group summaries, and edge-case objects.
 
 | Metric | Codec | Python | Speedup |
 |---|---|---|---|
-| Decode mean | 26.9 us | 22.2 us | 1.2x slower |
-| Decode median | 23.2 us | 21.6 us | 1.1x slower |
-| Decode P95 | 39.7 us | 31.7 us | 1.3x slower |
-| Encode mean | 4.7 us | 18.0 us | **3.8x faster** |
-| Encode median | 3.9 us | 19.7 us | **5.1x faster** |
-| Encode P95 | 9.6 us | 29.1 us | **3.0x faster** |
+| Decode mean | 27.2 us | 22.7 us | 1.2x slower |
+| Decode median | 23.6 us | 22.2 us | 1.1x slower |
+| Decode P95 | 40.5 us | 33.1 us | 1.2x slower |
+| Encode mean | 4.8 us | 18.2 us | **3.8x faster** |
+| Encode median | 4.0 us | 19.9 us | **5.0x faster** |
+| Encode P95 | 9.9 us | 30.0 us | **3.0x faster** |
 | Total pickle | 5.1 MB | — | — |
 | Total JSON | 7.2 MB | — | 1.41x |
 
 Decode is slightly slower (1.1x median) due to the two-pass conversion plus
 type-aware transformation. The gap narrows on metadata-heavy records.
-Encode is consistently **3.0-5.1x faster** because the Rust encoder writes
+Encode is consistently **3.0-5.0x faster** because the Rust encoder writes
 pickle opcodes directly from Python objects, bypassing intermediate allocations.
 
 ### Record type distribution
@@ -154,26 +157,27 @@ pickle opcodes directly from Python objects, bypassing intermediate allocations.
 The zodb-pgjsonb storage path has two decode functions. The dict path
 (`decode_zodb_record_for_pg`) returns a Python dict that must then be
 serialized via `json.dumps()`. The JSON string path
-(`decode_zodb_record_for_pg_json`) does everything in Rust with the GIL
-released. See the synthetic comparison above.
+(`decode_zodb_record_for_pg_json`) writes JSON tokens directly from the
+PickleValue AST to a `String` buffer, entirely in Rust with the GIL released.
 
 ```
 Dict path:   pickle bytes → Rust AST → Python dict (GIL held) → json.dumps() → PG
-JSON path:   pickle bytes → Rust AST → serde_json → JSON string (all Rust, GIL released) → PG
+JSON path:   pickle bytes → Rust AST → JSON string (direct write, GIL released) → PG
 ```
 
 ### 1,692 records
 
 | Metric | Dict+dumps | JSON str | Speedup |
 |---|---|---|---|
-| Mean | 41.3 us | 31.5 us | **1.3x faster** |
-| Median | 35.9 us | 26.9 us | **1.3x faster** |
-| P95 | 64.2 us | 47.7 us | **1.3x faster** |
+| Mean | 40.4 us | 28.3 us | **1.4x faster** |
+| Median | 34.7 us | 24.4 us | **1.4x faster** |
+| P95 | 62.0 us | 51.9 us | **1.2x faster** |
 
-The JSON string path is **1.3x faster** across real-world data because
-it eliminates the Python dict allocation + `json.dumps()` serialization.
-The entire pipeline runs in Rust with the GIL released, improving
-multi-threaded throughput in Zope/Plone deployments.
+The JSON string path is **1.4x faster** across real-world data because
+it eliminates both the Python dict allocation + `json.dumps()` serialization
+and all intermediate `serde_json::Value` heap allocations. The entire pipeline
+runs in Rust with the GIL released, improving multi-threaded throughput in
+Zope/Plone deployments.
 
 ---
 
@@ -182,9 +186,9 @@ multi-threaded throughput in Zope/Plone deployments.
 The sweet spot is typical ZODB objects (5-50 keys, mixed types, datetime
 fields, persistent refs):
 
-- **Decode:** 1.5-2.0x faster on synthetic, near parity on real-world data
-- **Encode:** 2-10x faster on synthetic, 3-5x faster on real-world data
-- **PG path:** 1.3x faster end-to-end with GIL-free throughput
+- **Decode:** 1.1-2.3x faster on synthetic, near parity on real-world data
+- **Encode:** 1.7-9.2x faster on synthetic, 3-5x faster on real-world data
+- **PG path:** 1.3-3.3x faster end-to-end with GIL-free throughput
 
 Decode overhead comes from the two-pass conversion plus type transformation.
 On string-dominated payloads this matters more; on metadata-rich records with
@@ -215,49 +219,33 @@ mixed types (the typical ZODB case) the codec is competitive or faster.
 - Thread-local buffer reuse (retains capacity across encode calls)
 - `reserve()` calls before multi-part writes (eliminates mid-write reallocations)
 - Direct i64 LONG1 encoding (eliminates BigInt heap allocation)
+- Thread-local class pickle cache per (module, name) pair (single memcpy
+  replaces 7 opcode writes for ~99.6% of records)
 - `#[inline]` on `write_u8`, `write_bytes`, `encode_int`
 
 **Both paths:**
 - Interned marker strings (`pyo3::intern!` for `@t`, `@cls`, `@s`, etc.)
 - Pre-collected PyList (`PyList::new` vs append loop)
 - Thin LTO + single codegen unit (free 6-9% improvement)
 - Profile-guided optimization (PGO) with real FileStorage + synthetic data
-- Direct pickle → JSON string path for PG storage (GIL released)
+- Direct PickleValue → JSON string writer (`json_writer.rs`) for PG storage,
+  eliminating all `serde_json::Value` intermediate allocations (GIL released)
+- Thread-local JSON writer buffer reuse (retains capacity across decode calls)
 
 ---
 
 ## Running benchmarks
 
+All numbers in this document are from PGO builds. Always use PGO for
+benchmarking — it adds 5-15% and reflects production performance.
+
 ```bash
 cd sources/zodb-json-codec
 
-# Build release first (important!)
-maturin develop --release
-
-# Synthetic micro-benchmarks
-python benchmarks/bench.py synthetic --iterations 1000
-
-# Generate a reproducible benchmark FileStorage (requires ZODB + BTrees)
-python benchmarks/bench.py generate
-
-# Scan the generated (or any) FileStorage
-python benchmarks/bench.py filestorage benchmarks/bench_data/Data.fs
-
-# PG decode path comparison (dict vs JSON string)
-python benchmarks/bench.py pg-compare --filestorage benchmarks/bench_data/Data.fs
-
-# Both synthetic + filestorage, with JSON export
-python benchmarks/bench.py all --filestorage benchmarks/bench_data/Data.fs --output results.json
-```
+# 0. Decompress benchmark data (once — Data.fs is gitignored, only .gz is tracked)
+gunzip -k benchmarks/bench_data/Data.fs.gz
 
-## PGO build (optional, adds 5-15%)
-
-Profile-guided optimization uses real workload data to optimize branch
-prediction and code layout. The release CI builds include PGO for
-Linux x86_64 wheels.
-
-```bash
-# 1. Install LLVM tools
+# 1. Install LLVM tools (once)
 rustup component add llvm-tools
 
 # 2. Instrumented build
@@ -266,11 +254,23 @@ RUSTFLAGS="-Cprofile-generate=/tmp/pgo-data" maturin develop --release
 # 3. Generate profiles — use BOTH real data and synthetic for best coverage
 python benchmarks/bench.py filestorage benchmarks/bench_data/Data.fs
 python benchmarks/bench.py synthetic --iterations 2000
+python benchmarks/bench.py pg-compare --filestorage benchmarks/bench_data/Data.fs --iterations 500
 
 # 4. Merge profiles
 LLVM_PROFDATA=$(find ~/.rustup -name llvm-profdata | head -1)
 $LLVM_PROFDATA merge -o /tmp/pgo-data/merged.profdata /tmp/pgo-data/*.profraw
 
 # 5. Optimized build
 RUSTFLAGS="-Cprofile-use=/tmp/pgo-data/merged.profdata" maturin develop --release
+
+# 6. Run benchmarks
+python benchmarks/bench.py synthetic --iterations 5000
+python benchmarks/bench.py filestorage benchmarks/bench_data/Data.fs
+python benchmarks/bench.py pg-compare --filestorage benchmarks/bench_data/Data.fs
+
+# Generate a reproducible benchmark FileStorage (requires ZODB + BTrees)
+python benchmarks/bench.py generate
+
+# Both synthetic + filestorage, with JSON export
+python benchmarks/bench.py all --filestorage benchmarks/bench_data/Data.fs --output results.json
 ```
diff --git a/Cargo.toml b/Cargo.toml
@@ -23,3 +23,4 @@ serde_json = "1"
 base64 = "0.22"
 hex = "0.4"
 num-bigint = "0.4"
+ryu = "1"