Skip to content

Fix seg fault when exceptions raise in folly worker#3004

Open
phoebusm wants to merge 4 commits intomasterfrom
fix_seg_fault_on_folly_exception
Open

Fix seg fault when exceptions raise in folly worker#3004
phoebusm wants to merge 4 commits intomasterfrom
fix_seg_fault_on_folly_exception

Conversation

@phoebusm
Copy link
Copy Markdown
Collaborator

@phoebusm phoebusm commented Apr 2, 2026

Reference Issues/PRs

https://man312219.monday.com/boards/7852509418/pulses/11573559183

What does this implement or fix?

folly::collect fails fast when a task raises an exception. If the caller's stack unwinds while other folly tasks are still running, they may access destroyed stack variables, causing a segfault.
In this issue, handler_data has been destroyed along with the stack, cauing seg fault.

Any other comments?

Similar fix #2458

Checklist

Checklist for code changes...
  • Have you updated the relevant docstrings, documentation and copyright notice?
  • Is this contribution tested against all ArcticDB's features?
  • Do all exceptions introduced raise appropriate error messages?
  • Are API changes highlighted in the PR description?
  • Is the PR labelled as enhancement or bug so it appears in autogenerated release notes?

auto descriptor = std::make_shared<StreamDescriptor>(frame_and_desc.frame_.descriptor());
pipeline_context->begin()->set_descriptor(std::move(descriptor));
reduce_and_fix_columns(pipeline_context, frame_and_desc.frame_, ReadOptions{}, handler_data).get();
std::shared_ptr<std::any> handler_data_ptr(std::shared_ptr<std::any>{}, &handler_data);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This creates a shared_ptr with a null owning pointer and a raw pointer to handler_data — a null-deleter alias — to adapt a reference into the new shared_ptr<std::any> API. Because .get() is called immediately on the next line, the caller's frame is alive for the full duration and this is safe here.

The identical pattern is used in local_versioned_engine.cpp for batch_read_and_join_internal (also guarded by .get()). Both are fragile: removing .get() or copying the pattern without the blocking call would produce a dangling pointer.

Consider propagating std::shared_ptr<std::any> all the way through batch_read_and_join_internal and read_result_from_single_frame as well (matching the rest of the chain changed in this PR), which would eliminate the aliasing wrapper entirely and remove the implicit lifetime constraint.

@claude
Copy link
Copy Markdown
Contributor

claude bot commented Apr 2, 2026

test comment update

phoebusm and others added 2 commits April 2, 2026 13:38
Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com>
Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com>
@phoebusm phoebusm added the patch Small change, should increase patch version label Apr 2, 2026
@phoebusm phoebusm changed the title Fix seg fault on folly exception Fix seg fault when exceptions raises in folly worker Apr 2, 2026
@phoebusm phoebusm changed the title Fix seg fault when exceptions raises in folly worker Fix seg fault when exceptions raise in folly worker Apr 2, 2026
@claude
Copy link
Copy Markdown
Contributor

claude bot commented Apr 2, 2026

ArcticDB Code Review Summary

PR: Fix seg fault when exceptions raise in folly worker
Reviewed commit: 1332d96

Summary

This PR fixes a use-after-free segfault caused by folly::collect failing fast on an exception while async workers still hold a reference to a handler_data stack variable that has since been destroyed. The fix converts handler_data from std::any& (reference to caller's stack) to std::shared_ptr<std::any> (shared ownership) across read_frame.cpp, local_versioned_engine.cpp, version_core.cpp, and their callers.

Two call sites (batch_read_and_join_internal in local_versioned_engine.cpp and read_result_from_single_frame in pipeline_utils.hpp) still receive handler_data by reference and adapt it using the aliased-pointer idiom:

std::shared_ptr<std::any> handler_data_ptr(std::shared_ptr<std::any>{}, &handler_data);

This is safe because both functions block synchronously (the future chain terminates with .get() before the reference can expire), but it is an unusual pattern worth an explanatory comment. Inline comments from the previous push have already flagged the pipeline_utils.hpp usage.

The latest commits (delta) contain only line-wrapping/formatting changes — no logic changes.


API & Compatibility

  • ✅ No breaking changes to public Python API (Arctic, Library, NativeVersionStore)
  • ✅ Deprecation protocol followed for removed/renamed parameters (N/A)
  • ✅ On-disk format unchanged
  • ✅ Protobuf schema backwards-compatible (no proto changes)
  • ✅ Key types and ValueType enum unchanged

Memory & Safety

  • ✅ RAII used for all resource management
  • ✅ No use-after-move or use-after-free patterns in the main fix path
  • ✅ Buffer size calculations correct (no overflow potential)
  • ✅ GIL correctly managed at Python-C++ boundary
  • ✅ No accidental copies of large objects
  • ⚠️ batch_read_and_join_internal and pipeline_utils.hpp use the aliased null-owning shared_ptr pattern — safe here because futures are .get()-ed synchronously, but the assumption is implicit and deserves a comment

Correctness

  • ✅ All ValueType/KeyType switch statements exhaustive (no changes)
  • ✅ Edge cases handled (no new edge cases introduced)
  • ✅ Error codes unique and in correct ErrorCategory (no changes)
  • ✅ Storage backend interface fully implemented (no changes)
  • ✅ Query clause composition correct (no changes)

Code Quality

  • ✅ No duplicated logic introduced
  • ✅ C++ and Python implementations consistent — all call sites updated
  • ✅ Tests updated to wrap handler_data in std::make_shared<std::any> at every test site

Testing

  • ✅ Behavioural changes have corresponding test updates (all C++ test files updated)
  • ✅ No regression in existing test expectations
  • ✅ Integration tests for version engine changes covered by updated test suite
  • ❌ No new test specifically exercises the race condition (exception during folly::collect with concurrent workers) — existing tests will catch regressions but do not directly validate the lifetime fix

Build & Dependencies

  • ✅ No new source files (no CMakeLists.txt changes needed)
  • ✅ No new dependencies
  • ✅ Cross-platform compatibility maintained (standard C++17/20 shared_ptr usage)
  • ✅ vcpkg/conda unchanged

Security

  • ✅ No hardcoded credentials or secrets
  • ✅ Input validation at system boundaries unchanged
  • ✅ No buffer overflow potential introduced

PR Title & Description

  • ✅ Title is clear and imperative ("Fix seg fault when exceptions raise in folly worker")
  • ✅ Description explains root cause and references a similar prior fix (Fix flaky segfault in concat testing #2458)
  • ✅ No public API changes to call out
  • ✅ Labelled patch — appropriate for a bug fix
  • ⚠️ No linked GitHub issue (references a Monday.com board entry only — not a blocker)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

patch Small change, should increase patch version

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant