-
Notifications
You must be signed in to change notification settings - Fork 608
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: to_dask
(or more generally to_batches
?)
#9891
Comments
to_dask
(or more generally to_partitions
?)to_dask
(or more generally to_batches
?)
I like the idea of a general |
I think starting with |
Hi @jcrist , Thank you for creating this. Transferring data directly between the compute backend and another cluster, bypassing the client, is crucial for efficient ML training. We could start from Please let me know if you need any help from me. |
Not seeing exactly what |
it would be great if we could pass the data from some other backends to the training cluster without going through the client. One direct use case for IbisML, we could demo large scale training using spark or bigquery + xgboost/torch,. |
Opening this mostly for discussion.
Say all your data lives in
<big cloud provider db>
. After doing some selecting/filtering/transforming, you want to export your data out of the DB and into a different distributed system likedask
(or spark or others) to do some operations (ML training for example) that can't as easily be executed purely in the database backend.Some of our backends provide efficient means for distributed batch retrieval. By this I mean a way to fetch query results in parallel (perhaps across a distributed system) rather than streaming them back through the client. In these cases, conversion of a result set to a distributed object (like a
dask.dataframe
) could be done fairly efficiently, and in a way that the user can't easily compose using existing API methods.Systems that support this natively:
dask
spark
bigquery
snowflake
We could support this as a general method for systems where this is inefficient, but I'm not sure if we'd want to do that. Better to error than accidentally slowly pipe data through the client and back out to a cluster (a user can fairly easily write this code themselves too).
We could expose this as a
to_dask
method on an expression that does all the fiddly bits and returns adask.dataframe
object.Alternatively (or additionally), we could generalize this to a
to_batches
(or better name) method that returns a list ofBatch
objects, each of which has ato_pandas
/to_arrow
/to_polars
methods for fetching the partition as a specified type. These could be pickleable and distributed to any distributed system (dask/spark/ray/...).Conversion to a dask dataframe would then be something like:
The text was updated successfully, but these errors were encountered: