[Memory] RecordBatchData:Arrow C Data Interface structures explicit and safe (3)#2939
Open
jamesblackburn wants to merge 3 commits intomasterfrom
Open
[Memory] RecordBatchData:Arrow C Data Interface structures explicit and safe (3)#2939jamesblackburn wants to merge 3 commits intomasterfrom
jamesblackburn wants to merge 3 commits intomasterfrom
Conversation
4 tasks
7e9a80c to
10132f8
Compare
Contributor
ArcticDB Code Review SummaryPR #2939 — Make ownership of RecordBatchData / Arrow C Data Interface structures explicit and safe (7) API & Compatibility
Memory & Safety
Correctness
Code Quality
Testing
Build & Dependencies
Security
PR Title & Description
|
10132f8 to
c1e707a
Compare
c1e707a to
f95fe52
Compare
jamesblackburn
commented
Feb 28, 2026
| ) | ||
|
|
||
|
|
||
| class ArrowMemoryStability: |
Collaborator
Author
There was a problem hiding this comment.
Added end-to-end (read-write) Python test for arrow memory stability. Leaks before the fix. Stable after.
Collaborator
Author
There was a problem hiding this comment.
┌────────────┬────────────────────┬───────────────────────┐
│ Iterations │ With fix (RssAnon) │ Without fix (RssAnon) │
├────────────┼────────────────────┼───────────────────────┤
│ 100 │ 9.3 MB │ 9.0 MB │
├────────────┼────────────────────┼───────────────────────┤
│ 200 │ 8.6 MB │ 12.6 MB │
├────────────┼────────────────────┼───────────────────────┤
│ 300 │ 8.6 MB │ 12.2 MB │
├────────────┼────────────────────┼───────────────────────┤
│ 400 │ 8.9 MB │ 11.2 MB │
├────────────┼────────────────────┼───────────────────────┤
│ 500 │ 9.8 MB (flat) │ 18.3 MB (growing) │
└────────────┴────────────────────┴───────────────────────┘
cc4b602 to
0718c1f
Compare
RecordBatchData wraps Arrow C Data Interface structs (ArrowArray/ArrowSchema) which must not be double-freed. The previous default copy/move semantics allowed accidental double-free. Apply Rule of Five: deleted copy, explicit move ctor/assign that null out the source's release pointers, and a destructor that calls release() if still owned. Also change InputItem variant from vector<RecordBatchData> to vector<shared_ptr<RecordBatchData>> since pybind11 requires copyable types in std::variant and RecordBatchData is now non-copyable.
42/42 Arrow tests pass. BM_arrow_convert_single_record_batch_to_segment shows no regression vs master baseline (0% delta on stable cases).
0718c1f to
3dc3073
Compare
Adds ArrowMemoryStability benchmark that writes and reads a ~1MB PyArrow table 500 times with prune_previous_versions=True, tracking anonymous RSS growth (RssAnon excludes LMDB file-backed mmap pages). Detects memory leaks from Arrow C Data Interface release callbacks not being called.
3dc3073 to
af4bc1d
Compare
5 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Applies the Rule of Five to
RecordBatchDatato make ownership of Arrow C Data Interface structures explicit and safe.RecordBatchDatawrapsArrowArrayandArrowSchema. The Arrow C Data Interface contract requires whoever holds astruct with a non-null
releasepointer to call it to free owned buffers — analogous to aunique_ptrdeleter.Previously the struct had no destructor and no copy/move semantics, making ownership accidental.
Changes
RecordBatchData(arrow_output_frame.hpp) — Rule of Five:memset(safe initial state with nullrelease)= delete— prevents aliased ownership ofreleasepointersreleaseon the sourcerelease_if_owned(), a no-op ifrelease == nullptr(already transferred to Python)InputItem(python_to_tensor_frame.hpp) — changed fromstd::vector<RecordBatchData>tostd::vector<std::shared_ptr<RecordBatchData>>. Required because pybind11 needs copyable types instd::variant, andRecordBatchDatanow has a deleted copy constructor.Motivation
The happy path was already correct. The bug was on exception paths: if
extract_record_batches()partially completed and then threw (e.g., allocation failure in pybind11 during Python list
construction),
RecordBatchDataobjects remaining in the C++ vector would be destroyed trivially —no destructor meant
releasewas never called and Arrow buffers were leaked. The new destructorensures buffers are freed whenever C++ is the last owner.
Memory leak on write-path
Write Path — Genuine leak on every Arrow write (fixed)
The flow is:
PyArrow sets release callbacks that, when called, decref the source PyArrow array and free internal export metadata.
pointers (confirmed by the ArrowArray*, ArrowSchema* constructor and init(AA*, AS*) at line 688 which never calls
release). Data is read into a SegmentInMemory via arrow_data_to_segment().
Old code: The defaulted destructor is a no-op. release is never called. PyArrow's export holds an extra refcount on the
source array and leaks its own internal ExportedArray struct. This is a genuine memory leak on every Arrow write
operation.
New code: ~RecordBatchData() calls release_if_owned() → array_.release(&array_) and schema_.release(&schema_) →
PyArrow's export cleanup runs. Fixed.
Benchmarks
BM_arrow_convert_single_record_batch_to_segmentonlinux-releasevs master — no regression.All stable cases show 0% delta; ±11% variation on 100K-row cases is within system noise (load avg
~116 at time of measurement).