-
Notifications
You must be signed in to change notification settings - Fork 96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add user documentation for the FFI approach #1031
Conversation
Fantastic! No edits from me. I learned more about FFI and the approach reading that too. That is the exact doc I would have needed and wanted when encountering this issue organically |
Thinking out loud. Do we want to have a FAQ of some sort in the top level Unsure what other things you could touch on with the FAQ but it could be useful I think. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 Great write up.
@kevinjqliu Since you've been working with this code closely recently, would you mind doing a review? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much for writing this up! This is a wealth of information 🙏
I added a few comments. Overall, this is amazing.
The Primary Issue | ||
----------------- | ||
|
||
Suppose you wish to use DataFusion and you have a custom data source that can produce tables that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just as a callout, whats so special about this method is that the abstraction is at the "data provider" ("TableProvider") layer. Most other projects are integrated with the Arrow ecosystem but requires passing arrow table or arrow batches around.
This here allows datafusion to integrate with a data source and produce the necessary data at the root level, which allows further optimizations and custom handling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for example, currently pyiceberg pass data to other engines by materializing the arrow table first and then registering the table with the engine
https://github.com/apache/iceberg-python/blob/b95e792d86f551d3705c3ea6b7e9985a2a0aaf3b/pyiceberg/table/__init__.py#L1785-L1828
as performant as possible and to utilize the features of DataFusion, you may decide to write | ||
your source in Rust and then expose it through `PyO3 <https://pyo3.rs>`_ as a Python library. | ||
|
||
At first glance, it may appear the best way to do this is to add the ``datafusion-python`` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the datafusion-python
crate is interesting. what is it used for outside of creating the python bindings as the datafusion
python library.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before we started doing the FFI work, re-exporting the datafusion-python
crate was basically they only way you could add extensions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks! i see this is called out in the Alternative Approach
section. Was just wondering if there was another reason I missed
.. code-block:: rust | ||
let my_provider = MyTableProvider::default(); | ||
let ffi_provider = FFI_TableProvider::new(Arc::new(my_provider), false, None); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: since this is the implementation details section, i think its worth calling out the ability to passing tokio runtime and why we would want to do that
apache/datafusion#13937
Which issue does this PR close?
Closes #1027
Rationale for this change
User requested.
What changes are included in this PR?
Adds a page to the online documentation.
Are there any user-facing changes?
None