This repository has been archived by the owner on Feb 18, 2024. It is now read-only.
Writing records row by row #1053
ishitatsuyuki
started this conversation in
General
Replies: 1 comment 3 replies
-
Hey @ishitatsuyuki and thanks for the issue. I converted it to a discussion for now since it seems a bit broad to be actionable (yet). Maybe the primary question I have here: do you need to go through Arrow? It seems to me that you can skip arrow entirely and work directly with parquet. I am asking because it seems to me that you would like to write to parquet on a streaming fashion and thus having "arrow arrays" does not seem beneficial. Thinking through this angle,
|
Beta Was this translation helpful? Give feedback.
3 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I'm currently working on a profiler that uses Parquet for the on-disk profile format. We don't do much of columnar processing but would like to take advantage of the compression and efficiency benefits.
As such, the records come in a row-by-row manner, but the arrow2 crate's writer only takes an entire column in a row-group. It should be technically possible to serialize things on a tighter manner, e.g. for delta bitpacked, you can serialize as soon as you have buffered a minibatch-worth of data. Doing so allows the in-memory buffer to have a smaller footprint.
As such, it might be a worthy addition to the API to be able to write things one-by-one.
Beta Was this translation helpful? Give feedback.
All reactions