Skip to content

[Memory] RecordBatchData:Arrow C Data Interface structures explicit and safe (3)#2939

Open
jamesblackburn wants to merge 3 commits intomasterfrom
jb/3_refactor_recordbatch_move_only
Open

[Memory] RecordBatchData:Arrow C Data Interface structures explicit and safe (3)#2939
jamesblackburn wants to merge 3 commits intomasterfrom
jb/3_refactor_recordbatch_move_only

Conversation

@jamesblackburn
Copy link
Copy Markdown
Collaborator

@jamesblackburn jamesblackburn commented Feb 25, 2026

Summary

Applies the Rule of Five to RecordBatchData to make ownership of Arrow C Data Interface structures explicit and safe.

RecordBatchData wraps ArrowArray and ArrowSchema. The Arrow C Data Interface contract requires whoever holds a
struct with a non-null release pointer to call it to free owned buffers — analogous to a unique_ptr deleter.
Previously the struct had no destructor and no copy/move semantics, making ownership accidental.

Changes

RecordBatchData (arrow_output_frame.hpp) — Rule of Five:

  • Default constructor zeroes both structs via memset (safe initial state with null release)
  • Copy constructor/assignment = delete — prevents aliased ownership of release pointers
  • Move constructor/assignment — transfers ownership and nulls out release on the source
  • Destructor — calls release_if_owned(), a no-op if release == nullptr (already transferred to Python)

InputItem (python_to_tensor_frame.hpp) — changed from std::vector<RecordBatchData> to
std::vector<std::shared_ptr<RecordBatchData>>. Required because pybind11 needs copyable types in
std::variant, and RecordBatchData now has a deleted copy constructor.

Motivation

The happy path was already correct. The bug was on exception paths: if extract_record_batches()
partially completed and then threw (e.g., allocation failure in pybind11 during Python list
construction), RecordBatchData objects remaining in the C++ vector would be destroyed trivially —
no destructor meant release was never called and Arrow buffers were leaked. The new destructor
ensures buffers are freed whenever C++ is the last owner.

Memory leak on write-path

Write Path — Genuine leak on every Arrow write (fixed)

The flow is:

  1. Python (_normalization.py:741-742): RecordBatchData() is created, then pa_record_batch._export_to_c(...) fills it.
    PyArrow sets release callbacks that, when called, decref the source PyArrow array and free internal export metadata.
  2. C++ (python_to_tensor_frame.cpp:347-348): sparrow::record_batch{&rb->array_, &rb->schema_} — sparrow borrows the
    pointers (confirmed by the ArrowArray*, ArrowSchema* constructor and init(AA*, AS*) at line 688 which never calls
    release). Data is read into a SegmentInMemory via arrow_data_to_segment().
  3. Cleanup: After the write completes, Python drops the RecordBatchData objects.

Old code: The defaulted destructor is a no-op. release is never called. PyArrow's export holds an extra refcount on the
source array and leaks its own internal ExportedArray struct. This is a genuine memory leak on every Arrow write
operation.

New code: ~RecordBatchData() calls release_if_owned() → array_.release(&array_) and schema_.release(&schema_) →
PyArrow's export cleanup runs. Fixed.

Benchmarks

BM_arrow_convert_single_record_batch_to_segment on linux-release vs master — no regression.
All stable cases show 0% delta; ±11% variation on 100K-row cases is within system noise (load avg
~116 at time of measurement).

@jamesblackburn jamesblackburn added the minor Feature change, should increase minor version label Feb 25, 2026
@jamesblackburn jamesblackburn changed the title Make ownership of RecordBatchData:Arrow C Data Interface structures explicit and safe 3. Make ownership of RecordBatchData:Arrow C Data Interface structures explicit and safe Feb 25, 2026
@jamesblackburn jamesblackburn changed the title 3. Make ownership of RecordBatchData:Arrow C Data Interface structures explicit and safe 2. Make ownership of RecordBatchData:Arrow C Data Interface structures explicit and safe Feb 25, 2026
@jamesblackburn jamesblackburn changed the title 2. Make ownership of RecordBatchData:Arrow C Data Interface structures explicit and safe Make ownership of RecordBatchData:Arrow C Data Interface structures explicit and safe (3) Feb 25, 2026
@jamesblackburn jamesblackburn force-pushed the jb/3_refactor_recordbatch_move_only branch 2 times, most recently from 7e9a80c to 10132f8 Compare February 27, 2026 11:27
@claude
Copy link
Copy Markdown
Contributor

claude bot commented Feb 27, 2026

ArcticDB Code Review Summary

PR #2939 — Make ownership of RecordBatchData / Arrow C Data Interface structures explicit and safe (7)
Reviewed diff: synchronize — delta diff only (commits since last review)


API & Compatibility

  • ✅ No breaking changes to public Python API (Arctic, Library, NativeVersionStore)
  • ✅ Deprecation protocol not applicable (no removed/renamed parameters)
  • ✅ On-disk format unchanged — pure C++ ownership refactor
  • ✅ Protobuf schema unchanged
  • ✅ Key types and ValueType enum unchanged

Memory & Safety

  • ✅ RAII used: Rule of Five correctly applied to RecordBatchData
  • ✅ Move constructor nulls source's release pointer — no double-free after move
  • ✅ Destructor calls release_if_owned(), which checks for null before invoking — safe no-op after transfer to Python
  • ✅ Default constructor zeroes both C structs so release == nullptr initially
  • release_if_owned() correctly handles nested Arrow buffers (owned by the release callback)
  • ⚠️ Double-extraction riskArrowOutputFrame::extract_record_batches() can still be called twice from Python (see inline comment on arrow_output_frame.hpp line 61)
  • ✅ GIL not affected by this change
  • ✅ No accidental large copies — copy constructor deleted, shared_ptr used where copies are needed

Correctness

  • ✅ Arrow C Data Interface ownership contract followed correctly in move semantics
  • shared_ptr<RecordBatchData> as pybind11 holder type consistent with py::class_<RecordBatchData, std::shared_ptr<RecordBatchData>> registration
  • InputItem variant change to std::vector<std::shared_ptr<RecordBatchData>> required because pybind11's std::variant caster needs copyable alternatives
  • ✅ pybind11 binding in python_bindings.cpp correctly updated to use shared_ptr holder type

Code Quality

  • ✅ Rule of Five implementation is clean and idiomatic
  • ✅ Comments on move constructor/destructor accurately describe the ownership semantics
  • release_if_owned() is private, well-named, and reused in both destructor and move assignment
  • ✅ Benchmark documents added to docs/claude/plans/duckdb/ — no regression confirmed
  • ✅ Unused os import from earlier draft removed; gc import added and used
  • ✅ PR description updated (fixed earlier sync)data_consumed_ flag reference removed; description now accurately reflects delivered changes

Testing

  • ✅ Benchmark doc confirms 42/42 C++ Arrow tests pass on linux-release
  • Five new tests cover the full Rule of Five:
    • DestructorCallsRelease — destructor calls Arrow release callbacks on both ArrowArray and ArrowSchema
    • WritePathReleasesArrowMemory — end-to-end write path; pointer-constructed sparrow::record_batch does not consume Arrow ownership
    • MoveTransfersOwnership — move constructor transfers ownership; moved-from source destructs safely
    • MoveAssignmentReleasesOldAndTransfers ✅ — move assignment releases existing destination resources before taking source ownership
    • SelfMoveAssignmentIsNoOp ✅ — self-assignment guard verified; final destruction calls release exactly once
  • IndentationError in python/benchmarks/arrow.py FIXED (fixed earlier sync)_get_anon_rss_kb now has correct indentation
  • Benchmark portability FIXED (fixed earlier sync)FileNotFoundError is now explicitly caught so the function correctly returns 0 on non-Linux platforms
  • ArrowMemoryStability Python benchmark — 500 write+read cycles; measures anonymous RSS growth (RssAnon from /proc/self/status) to catch release-callback failures end-to-end; correctly excludes LMDB file-backed mmap pages
  • benchmarks.json updated (this sync) — dict comprehension in ArrowMemoryStability.setup() reformatted to multi-line style; version hash and setup_cache_key values updated accordingly (arrow:51→53, arrow:125→127 reflecting +2 line shift in arrow.py); no functional change
  • ✅ No regression in existing test expectations

Build & Dependencies

  • ✅ No new source files — header-only change plus updates to existing .cpp files
  • ✅ No dependency changes
  • #include <cstring> added for std::memset — correct
  • ✅ New #include <arcticdb/arrow/arrow_output_frame.hpp> added to test file — correct

Security

  • ✅ No hardcoded credentials or secrets
  • ✅ No new input validation concerns
  • ✅ No buffer overflow potential introduced

PR Title & Description

  • ✅ Title is clear and descriptive
  • ✅ Description accurately reflects all delivered changes
  • ✅ Motivation section clearly explains the exception-path and write-path leaks being fixed
  • ✅ Benchmarks linked and interpreted
  • bug + minor labels applied

@jamesblackburn jamesblackburn force-pushed the jb/3_refactor_recordbatch_move_only branch from 10132f8 to c1e707a Compare February 27, 2026 17:06
@jamesblackburn jamesblackburn added the bug Something isn't working label Feb 27, 2026
@jamesblackburn jamesblackburn force-pushed the jb/3_refactor_recordbatch_move_only branch from c1e707a to f95fe52 Compare February 28, 2026 11:09
@jamesblackburn jamesblackburn changed the title Make ownership of RecordBatchData:Arrow C Data Interface structures explicit and safe (3) [Memory] RecordBatchData:Arrow C Data Interface structures explicit and safe (3) Feb 28, 2026
)


class ArrowMemoryStability:
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added end-to-end (read-write) Python test for arrow memory stability. Leaks before the fix. Stable after.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  ┌────────────┬────────────────────┬───────────────────────┐                                                             
  │ Iterations │ With fix (RssAnon) │ Without fix (RssAnon) │                                                             
  ├────────────┼────────────────────┼───────────────────────┤                                                             
  │    100     │       9.3 MB       │        9.0 MB         │                                 
  ├────────────┼────────────────────┼───────────────────────┤
  │    200     │       8.6 MB       │        12.6 MB        │
  ├────────────┼────────────────────┼───────────────────────┤
  │    300     │       8.6 MB       │        12.2 MB        │
  ├────────────┼────────────────────┼───────────────────────┤
  │    400     │       8.9 MB       │        11.2 MB        │
  ├────────────┼────────────────────┼───────────────────────┤
  │    500     │   9.8 MB (flat)    │   18.3 MB (growing)   │
  └────────────┴────────────────────┴───────────────────────┘

@jamesblackburn jamesblackburn force-pushed the jb/3_refactor_recordbatch_move_only branch from cc4b602 to 0718c1f Compare February 28, 2026 11:20
RecordBatchData wraps Arrow C Data Interface structs (ArrowArray/ArrowSchema)
which must not be double-freed. The previous default copy/move semantics
allowed accidental double-free. Apply Rule of Five: deleted copy, explicit
move ctor/assign that null out the source's release pointers, and a
destructor that calls release() if still owned.

Also change InputItem variant from vector<RecordBatchData> to
vector<shared_ptr<RecordBatchData>> since pybind11 requires copyable types
in std::variant and RecordBatchData is now non-copyable.
42/42 Arrow tests pass. BM_arrow_convert_single_record_batch_to_segment
shows no regression vs master baseline (0% delta on stable cases).
@jamesblackburn jamesblackburn force-pushed the jb/3_refactor_recordbatch_move_only branch from 0718c1f to 3dc3073 Compare February 28, 2026 11:22
Adds ArrowMemoryStability benchmark that writes and reads a ~1MB PyArrow
table 500 times with prune_previous_versions=True, tracking anonymous RSS
growth (RssAnon excludes LMDB file-backed mmap pages). Detects memory
leaks from Arrow C Data Interface release callbacks not being called.
@jamesblackburn jamesblackburn force-pushed the jb/3_refactor_recordbatch_move_only branch from 3dc3073 to af4bc1d Compare February 28, 2026 11:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working minor Feature change, should increase minor version

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant