Tips for improving encoding time for large vecs/hashmaps? #463
Replies: 4 comments 1 reply
-
If you have a single buffer that contains all strings, you can return that one and the positions to then extract the substrings using |
Beta Was this translation helpful? Give feedback.
-
Unfortunately it will be a mix of strings, integers, strructs, etc. You can see an example here: https://github.com/joshuataylor/serde_examples/blob/main/native/serde_examples/src/lib.rs#L57 |
Beta Was this translation helpful? Give feedback.
-
Well, in an Arrow frame it won't. The whole point of Arrow is to have a per-column homogeneous data. So you can check in the Arrow schema whether a column is a string column and return it in the way that I suggested. The relevant methods are (I'll take the liberty of moving this into a discussion, it's not really an issue with Rustler). |
Beta Was this translation helpful? Give feedback.
-
Awesome! Thanks so much for setting up discussions, I wasn't sure where to place this (as it's not an issue as you mentioned). wrt/ The initial thread:
To this:
This is a specific integration with Snowflake, I'm sure we'll also have a generic NIF at some point (or people can just use polars/nx). I also really appreciate your comments as well, the community here and over in Rust land is fantastic 🙌 edit: I'm going to do an experiment and return all columns as is, then do List.zip across the columns in Elixir |
Beta Was this translation helpful? Give feedback.
-
Hi!
I'm writing a library for Elixir which deserialises Apache Arrow, specifically the IPC streaming files using arrow2 , then returning them back to Elixir as rows. Using Rustler for this has been an amazing experience, and has taught me a lot about Rust (as a Rust beginner).
This is for a Snowflake adapter for Elixir, they return both JSON/Arrow and from my initial benchmarks when Snowflake sends Arrow it returns 2-3x faster compared to JSON.
I seem to have hit a problem when returning a large amount of strings back to Elixir, as it needs to encode each one? Maybe there is a more efficient way to return data?
Here is an example repo I have: https://github.com/joshuataylor/serde_examples
My results across three different systems:
1/ My desktop, a 32 core threadripper 2990wx, designed for multicore not as much single threaded :)
A desktop 6 core 5600x, pretty decent single core performance.
My laptop, a 2020 m1 Macbook Air:
Beta Was this translation helpful? Give feedback.
All reactions