Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add user documentation for the FFI approach #1031

Merged
merged 3 commits into from
Feb 21, 2025

Conversation

timsaucer
Copy link
Contributor

Which issue does this PR close?

Closes #1027

Rationale for this change

User requested.

What changes are included in this PR?

Adds a page to the online documentation.

Are there any user-facing changes?

None

@robtandy
Copy link

Fantastic! No edits from me. I learned more about FFI and the approach reading that too. That is the exact doc I would have needed and wanted when encountering this issue organically

@robtandy
Copy link

Thinking out loud. Do we want to have a FAQ of some sort in the top level Readme.md? It could have a Q and A that would direct someone to this doc. My concern is, as important as I think this information is, it might not be as discoverable otherwise.

Unsure what other things you could touch on with the FAQ but it could be useful I think.

Copy link
Member

@davisp davisp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 Great write up.

@timsaucer
Copy link
Contributor Author

@kevinjqliu Since you've been working with this code closely recently, would you mind doing a review?

Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much for writing this up! This is a wealth of information 🙏
I added a few comments. Overall, this is amazing.

The Primary Issue
-----------------

Suppose you wish to use DataFusion and you have a custom data source that can produce tables that
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just as a callout, whats so special about this method is that the abstraction is at the "data provider" ("TableProvider") layer. Most other projects are integrated with the Arrow ecosystem but requires passing arrow table or arrow batches around.
This here allows datafusion to integrate with a data source and produce the necessary data at the root level, which allows further optimizations and custom handling.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for example, currently pyiceberg pass data to other engines by materializing the arrow table first and then registering the table with the engine
https://github.com/apache/iceberg-python/blob/b95e792d86f551d3705c3ea6b7e9985a2a0aaf3b/pyiceberg/table/__init__.py#L1785-L1828

as performant as possible and to utilize the features of DataFusion, you may decide to write
your source in Rust and then expose it through `PyO3 <https://pyo3.rs>`_ as a Python library.

At first glance, it may appear the best way to do this is to add the ``datafusion-python``
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the datafusion-python crate is interesting. what is it used for outside of creating the python bindings as the datafusion python library.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before we started doing the FFI work, re-exporting the datafusion-python crate was basically they only way you could add extensions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks! i see this is called out in the Alternative Approach section. Was just wondering if there was another reason I missed

.. code-block:: rust
let my_provider = MyTableProvider::default();
let ffi_provider = FFI_TableProvider::new(Arc::new(my_provider), false, None);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: since this is the implementation details section, i think its worth calling out the ability to passing tokio runtime and why we would want to do that
apache/datafusion#13937

@timsaucer timsaucer merged commit e6f6e66 into apache:main Feb 21, 2025
15 checks passed
@timsaucer timsaucer deleted the doc/document-ffi-approach branch February 21, 2025 21:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add documentation for why FFI is needed
5 participants