-
Notifications
You must be signed in to change notification settings - Fork 154
Arrow read support (read directly as pyarrow.Table or polars.DataFrame)
#2776
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
pyarrow.Table or polars.DataFrame)pyarrow.Table or polars.DataFrame)
3c821ed to
a5299e0
Compare
2749078 to
022c5bb
Compare
ba7bff2 to
d654d02
Compare
9966231 to
dee704c
Compare
| if output_format.lower() == OutputFormat.EXPERIMENTAL_POLARS.lower(): | ||
| data = pl.from_arrow(data) | ||
| if output_format.lower() == OutputFormat.POLARS.lower(): | ||
| data = pl.from_arrow(data, rechunk=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This rechunk=False is the only behavior change apart from the enum renames.
It is done to get better performance on read (rechunking requires memory copies and allocations).
In most cases the chunks will be of reasonable sizes (100k rows), so the rechunking would not provide much benefit on polars operations.
Users can always call rechunk on their dataframes if needed and an example is given in the notebook.
dee704c to
02a04bf
Compare
Detailed release notes to be written in PR description
02a04bf to
3ae5a48
Compare
| i-th entry corresponds to i-th element of `symbols`. | ||
| per_symbol_arrow_string_format_per_column: Optional[List[Optional[Dict[str, Union[ArrowOutputStringFormat, "pa.DataType"]]]]], default=None | ||
| Per-symbol, per-column overrides. Takes precedence over all other string format settings. | ||
| i-th entry corresponds to i-th element of `symbols`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a bit odd that these are full arguments to batch_read, but undocumented kwargs to read/head/tail
| Default string column format when using `PYARROW` or `POLARS` output formats on this library. | ||
| If `None`, uses the `arrow_string_format_default` from the `Arctic` instance. | ||
| Can be overridden per read operation. | ||
| See `ArrowOutputStringFormat` documentation for details on available string formats. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should have a note that these are not persisted lib config settings, they just apply to the returned library object
| Does not deduplicate strings, so has better performance for columns with many unique strings. | ||
| Slightly faster than `LARGE_STRING` but does not work with very long strings. | ||
| Uses 32-bit variable-size encoding. | ||
| PyArrow: `pa.string()`, Polars: `pl.String` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean Polars converts everything to large string?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- "batche" typo
- Warning about lmdb already being open in the process should be removed if possible from the final version
- We should use set_sorted on timeseries index columns to speed up subsequent query performance when we know this to be the case (probably a separate PR)
tz_datasymbol use an unnamed index before that behaviour has been introduced, I would just give it a name to avoid confusion when reading the notebook from start to finish- In the filtering and column selection section, worth mentioning that timeseries indexes are always fetched, even if not specified in the
columnslist - "PyArrow is useful when you need direct access to Arrow's columnar format" I didn't understand what this means exactly? Does it expose more info on chunks than Polars?
- In the duckdb example, is there an alternative to
to_df()that returns Polars or PyArrow, to show Pandas isn't necessary in that workflow? print(f"Number of record batches: {arrow_table.num_rows}")seems wrong- "When Pandas DataFrames have unnamed indexes, Arrow/Polars use the special column name index" Polars uses "None"
- "Converting to Polars/PyArrow and back preserves Pandas metadata" - I thought Polars dropped the Pandas metadata?
- Might be worth a cell on unnamed multiindexes
New Feature: PyArrow and Polars Output Formats
ArcticDB now supports returning data in Apache Arrow-based formats.
Overview
In addition to the existing
PANDASoutput format (which remains the default), you can now specifyPYARROWorPOLARSoutput formats for read operations. Both formats use Apache Arrow's columnar memory layout, enabling:polarsandduckdbUsage
For detailed usage see the arrow notebook
Output format can be configured at three levels:
1. Arctic instance level (default for all libraries):
2. Library level (default for all reads from this library):
3. Per read operation (most granular control):
Output Format Options
OutputFormat.PANDAS(default): Returnspandas.DataFramebacked by numpy arraysOutputFormat.PYARROW: Returnspyarrow.TableobjectsOutputFormat.POLARS: Returnspolars.DataFrameobjectsString Format Customization
When using Arrow-based output formats, you can customize how string columns are encoded using
ArrowOutputStringFormat:LARGE_STRING(default):pa.large_string(), Polars:pl.StringSMALL_STRING:pa.string(), Polars:pl.StringCATEGORICAL/DICTIONARY_ENCODED:pa.dictionary(pa.int32(), pa.large_string()), Polars:pl.CategoricalExample: Using String Formats
Supported Operations
All read operations support the new output formats:
read()read_batch()read_batch_and_join()head()tail()lazy=TrueNotes
PANDASfor backward compatibilityPYARROWandPOLARSformats share the same underlying Arrow memory layout