Arrow read support (read directly as `pyarrow.Table` or `polars.DataFrame`) #2776

IvoDD · 2025-11-21T15:38:59Z

New Feature: PyArrow and Polars Output Formats

ArcticDB now supports returning data in Apache Arrow-based formats.

Overview

In addition to the existing PANDAS output format (which remains the default), you can now specify PYARROW or POLARS output formats for read operations. Both formats use Apache Arrow's columnar memory layout, enabling:

Better performance: Especially for dataframes with many string columns
Integration with Modern systems: Native zero-copy integration with Arrow-based data processing tools like polars and duckdb

Usage

For detailed usage see the arrow notebook

Output format can be configured at three levels:

1. Arctic instance level (default for all libraries):

import arcticdb as adb
from arcticdb import OutputFormat

# Set default for all libraries from this Arctic instance
ac = adb.Arctic(uri, output_format=OutputFormat.PYARROW)

2. Library level (default for all reads from this library):

# Set at library creation
lib = ac.create_library("my_lib", output_format=OutputFormat.POLARS)

# Or when getting an existing library
lib = ac.get_library("my_lib", output_format=OutputFormat.PYARROW)

3. Per read operation (most granular control):

# Override on a per-read basis
result = lib.read("symbol", output_format=OutputFormat.PYARROW)
table = result.data  # Returns pyarrow.Table

result = lib.read("symbol", output_format=OutputFormat.POLARS)
df = result.data  # Returns polars.DataFrame

Output Format Options

OutputFormat.PANDAS (default): Returns pandas.DataFrame backed by numpy arrays
OutputFormat.PYARROW: Returns pyarrow.Table objects
OutputFormat.POLARS: Returns polars.DataFrame objects

String Format Customization

When using Arrow-based output formats, you can customize how string columns are encoded using ArrowOutputStringFormat:

LARGE_STRING (default):

64-bit variable-size encoding
Best for general-purpose use and large string data
PyArrow: pa.large_string(), Polars: pl.String

SMALL_STRING:

32-bit variable-size encoding
More memory efficient for smaller string data
PyArrow: pa.string(), Polars: pl.String

CATEGORICAL / DICTIONARY_ENCODED:

Dictionary-encoded with int32 indices
Best for low cardinality columns (few unique values repeated many times)
Deduplicates strings for significant memory savings
PyArrow: pa.dictionary(pa.int32(), pa.large_string()), Polars: pl.Categorical

Example: Using String Formats

from arcticdb import ArrowOutputStringFormat

# Set default string format at Arctic level
ac = adb.Arctic(
    uri,
    output_format=OutputFormat.PYARROW,
    arrow_string_format_default=ArrowOutputStringFormat.CATEGORICAL
)

# Or per-column during read
result = lib.read(
    "symbol",
    output_format=OutputFormat.PYARROW,
    arrow_string_format_per_column={
        "country": ArrowOutputStringFormat.CATEGORICAL,  # Low cardinality
        "description": ArrowOutputStringFormat.LARGE_STRING  # High cardinality
    }
)

Supported Operations

All read operations support the new output formats:

read()
read_batch()
read_batch_and_join()
head()
tail()
LazyDataFrame operations with lazy=True

Notes

The default output format remains PANDAS for backward compatibility
Both PYARROW and POLARS formats share the same underlying Arrow memory layout
For more details on Arrow's physical layouts, see: https://arrow.apache.org/docs/format/Columnar.html

IvoDD · 2025-11-26T08:34:40Z

python/arcticdb/version_store/_store.py

-            if output_format.lower() == OutputFormat.EXPERIMENTAL_POLARS.lower():
-                data = pl.from_arrow(data)
+            if output_format.lower() == OutputFormat.POLARS.lower():
+                data = pl.from_arrow(data, rechunk=False)


This rechunk=False is the only behavior change apart from the enum renames.

It is done to get better performance on read (rechunking requires memory copies and allocations).
In most cases the chunks will be of reasonable sizes (100k rows), so the rechunking would not provide much benefit on polars operations.

Users can always call rechunk on their dataframes if needed and an example is given in the notebook.

Detailed release notes to be written in PR description

alexowens90 · 2025-11-27T15:27:02Z

python/arcticdb/version_store/_store.py

+            i-th entry corresponds to i-th element of `symbols`.
+        per_symbol_arrow_string_format_per_column: Optional[List[Optional[Dict[str, Union[ArrowOutputStringFormat, "pa.DataType"]]]]], default=None
+            Per-symbol, per-column overrides. Takes precedence over all other string format settings.
+            i-th entry corresponds to i-th element of `symbols`.


It's a bit odd that these are full arguments to batch_read, but undocumented kwargs to read/head/tail

alexowens90 · 2025-11-27T15:30:10Z

python/arcticdb/arctic.py

+            Default string column format when using `PYARROW` or `POLARS` output formats on this library.
+            If `None`, uses the `arrow_string_format_default` from the `Arctic` instance.
+            Can be overridden per read operation.
+            See `ArrowOutputStringFormat` documentation for details on available string formats.


I think we should have a note that these are not persisted lib config settings, they just apply to the returned library object

alexowens90 · 2025-11-27T15:34:39Z

python/arcticdb/options.py

-    Does not deduplicate strings, so has better performance for columns with many unique strings.
-    Slightly faster than `LARGE_STRING` but does not work with very long strings.
+        Uses 32-bit variable-size encoding.
+        PyArrow: `pa.string()`, Polars: `pl.String`


Does this mean Polars converts everything to large string?

alexowens90 · 2025-11-27T17:31:06Z

docs/mkdocs/docs/notebooks/ArcticDB_demo_read_as_arrow.ipynb

"batche" typo

Warning about lmdb already being open in the process should be removed if possible from the final version

We should use set_sorted on timeseries index columns to speed up subsequent query performance when we know this to be the case (probably a separate PR)

tz_data symbol use an unnamed index before that behaviour has been introduced, I would just give it a name to avoid confusion when reading the notebook from start to finish

In the filtering and column selection section, worth mentioning that timeseries indexes are always fetched, even if not specified in the columns list

"PyArrow is useful when you need direct access to Arrow's columnar format" I didn't understand what this means exactly? Does it expose more info on chunks than Polars?

In the duckdb example, is there an alternative to to_df() that returns Polars or PyArrow, to show Pandas isn't necessary in that workflow?

print(f"Number of record batches: {arrow_table.num_rows}") seems wrong

"When Pandas DataFrames have unnamed indexes, Arrow/Polars use the special column name index" Polars uses "None"

"Converting to Polars/PyArrow and back preserves Pandas metadata" - I thought Polars dropped the Pandas metadata?

Might be worth a cell on unnamed multiindexes

IvoDD requested review from alexowens90 and poodlewars as code owners November 21, 2025 15:39

IvoDD added the minor Feature change, should increase minor version label Nov 21, 2025

IvoDD changed the title ~~Arrow read support (allow reading directly as pyarrow.Table or polars.DataFrame)~~ Arrow read support (read directly as pyarrow.Table or polars.DataFrame) Nov 21, 2025

IvoDD force-pushed the polars-output-format branch 2 times, most recently from 3c821ed to a5299e0 Compare November 24, 2025 09:25

IvoDD force-pushed the official-arrow-read-support branch 2 times, most recently from 2749078 to 022c5bb Compare November 24, 2025 09:35

IvoDD force-pushed the polars-output-format branch from ba7bff2 to d654d02 Compare November 24, 2025 13:57

Base automatically changed from polars-output-format to master November 24, 2025 15:56

IvoDD force-pushed the official-arrow-read-support branch 4 times, most recently from 9966231 to dee704c Compare November 26, 2025 08:31

IvoDD commented Nov 26, 2025

View reviewed changes

IvoDD force-pushed the official-arrow-read-support branch from dee704c to 02a04bf Compare November 26, 2025 08:41

Arrow output format for read operations

3ae5a48

Detailed release notes to be written in PR description

IvoDD force-pushed the official-arrow-read-support branch from 02a04bf to 3ae5a48 Compare November 26, 2025 08:54

alexowens90 reviewed Nov 27, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Arrow read support (read directly as `pyarrow.Table` or `polars.DataFrame`) #2776

Arrow read support (read directly as `pyarrow.Table` or `polars.DataFrame`) #2776

Uh oh!

IvoDD commented Nov 21, 2025 •

edited

Loading

Uh oh!

IvoDD Nov 26, 2025

Uh oh!

alexowens90 Nov 27, 2025

Uh oh!

alexowens90 Nov 27, 2025

Uh oh!

alexowens90 Nov 27, 2025

Uh oh!

alexowens90 Nov 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Arrow read support (read directly as pyarrow.Table or polars.DataFrame) #2776

Are you sure you want to change the base?

Arrow read support (read directly as pyarrow.Table or polars.DataFrame) #2776

Uh oh!

Conversation

IvoDD commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

New Feature: PyArrow and Polars Output Formats

Overview

Usage

Output Format Options

String Format Customization

Example: Using String Formats

Supported Operations

Notes

Uh oh!

IvoDD Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

alexowens90 Nov 27, 2025

Choose a reason for hiding this comment

Uh oh!

alexowens90 Nov 27, 2025

Choose a reason for hiding this comment

Uh oh!

alexowens90 Nov 27, 2025

Choose a reason for hiding this comment

Uh oh!

alexowens90 Nov 27, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Arrow read support (read directly as `pyarrow.Table` or `polars.DataFrame`) #2776

Arrow read support (read directly as `pyarrow.Table` or `polars.DataFrame`) #2776

IvoDD commented Nov 21, 2025 •

edited

Loading