Skip to content

Conversation

@IvoDD
Copy link
Collaborator

@IvoDD IvoDD commented Nov 21, 2025

New Feature: PyArrow and Polars Output Formats

ArcticDB now supports returning data in Apache Arrow-based formats.

Overview

In addition to the existing PANDAS output format (which remains the default), you can now specify PYARROW or POLARS output formats for read operations. Both formats use Apache Arrow's columnar memory layout, enabling:

  • Better performance: Especially for dataframes with many string columns
  • Integration with Modern systems: Native zero-copy integration with Arrow-based data processing tools like polars and duckdb

Usage

For detailed usage see the arrow notebook

Output format can be configured at three levels:

1. Arctic instance level (default for all libraries):

import arcticdb as adb
from arcticdb import OutputFormat

# Set default for all libraries from this Arctic instance
ac = adb.Arctic(uri, output_format=OutputFormat.PYARROW)

2. Library level (default for all reads from this library):

# Set at library creation
lib = ac.create_library("my_lib", output_format=OutputFormat.POLARS)

# Or when getting an existing library
lib = ac.get_library("my_lib", output_format=OutputFormat.PYARROW)

3. Per read operation (most granular control):

# Override on a per-read basis
result = lib.read("symbol", output_format=OutputFormat.PYARROW)
table = result.data  # Returns pyarrow.Table

result = lib.read("symbol", output_format=OutputFormat.POLARS)
df = result.data  # Returns polars.DataFrame

Output Format Options

  • OutputFormat.PANDAS (default): Returns pandas.DataFrame backed by numpy arrays
  • OutputFormat.PYARROW: Returns pyarrow.Table objects
  • OutputFormat.POLARS: Returns polars.DataFrame objects

String Format Customization

When using Arrow-based output formats, you can customize how string columns are encoded using ArrowOutputStringFormat:

LARGE_STRING (default):

  • 64-bit variable-size encoding
  • Best for general-purpose use and large string data
  • PyArrow: pa.large_string(), Polars: pl.String

SMALL_STRING:

  • 32-bit variable-size encoding
  • More memory efficient for smaller string data
  • PyArrow: pa.string(), Polars: pl.String

CATEGORICAL / DICTIONARY_ENCODED:

  • Dictionary-encoded with int32 indices
  • Best for low cardinality columns (few unique values repeated many times)
  • Deduplicates strings for significant memory savings
  • PyArrow: pa.dictionary(pa.int32(), pa.large_string()), Polars: pl.Categorical

Example: Using String Formats

from arcticdb import ArrowOutputStringFormat

# Set default string format at Arctic level
ac = adb.Arctic(
    uri,
    output_format=OutputFormat.PYARROW,
    arrow_string_format_default=ArrowOutputStringFormat.CATEGORICAL
)

# Or per-column during read
result = lib.read(
    "symbol",
    output_format=OutputFormat.PYARROW,
    arrow_string_format_per_column={
        "country": ArrowOutputStringFormat.CATEGORICAL,  # Low cardinality
        "description": ArrowOutputStringFormat.LARGE_STRING  # High cardinality
    }
)

Supported Operations

All read operations support the new output formats:

  • read()
  • read_batch()
  • read_batch_and_join()
  • head()
  • tail()
  • LazyDataFrame operations with lazy=True

Notes

@IvoDD IvoDD added the minor Feature change, should increase minor version label Nov 21, 2025
@IvoDD IvoDD changed the title Arrow read support (allow reading directly as pyarrow.Table or polars.DataFrame) Arrow read support (read directly as pyarrow.Table or polars.DataFrame) Nov 21, 2025
@IvoDD IvoDD force-pushed the polars-output-format branch 2 times, most recently from 3c821ed to a5299e0 Compare November 24, 2025 09:25
@IvoDD IvoDD force-pushed the official-arrow-read-support branch 2 times, most recently from 2749078 to 022c5bb Compare November 24, 2025 09:35
@IvoDD IvoDD force-pushed the polars-output-format branch from ba7bff2 to d654d02 Compare November 24, 2025 13:57
Base automatically changed from polars-output-format to master November 24, 2025 15:56
@IvoDD IvoDD force-pushed the official-arrow-read-support branch 4 times, most recently from 9966231 to dee704c Compare November 26, 2025 08:31
if output_format.lower() == OutputFormat.EXPERIMENTAL_POLARS.lower():
data = pl.from_arrow(data)
if output_format.lower() == OutputFormat.POLARS.lower():
data = pl.from_arrow(data, rechunk=False)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This rechunk=False is the only behavior change apart from the enum renames.

It is done to get better performance on read (rechunking requires memory copies and allocations).
In most cases the chunks will be of reasonable sizes (100k rows), so the rechunking would not provide much benefit on polars operations.

Users can always call rechunk on their dataframes if needed and an example is given in the notebook.

@IvoDD IvoDD force-pushed the official-arrow-read-support branch from dee704c to 02a04bf Compare November 26, 2025 08:41
Detailed release notes to be written in PR description
@IvoDD IvoDD force-pushed the official-arrow-read-support branch from 02a04bf to 3ae5a48 Compare November 26, 2025 08:54
i-th entry corresponds to i-th element of `symbols`.
per_symbol_arrow_string_format_per_column: Optional[List[Optional[Dict[str, Union[ArrowOutputStringFormat, "pa.DataType"]]]]], default=None
Per-symbol, per-column overrides. Takes precedence over all other string format settings.
i-th entry corresponds to i-th element of `symbols`.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit odd that these are full arguments to batch_read, but undocumented kwargs to read/head/tail

Default string column format when using `PYARROW` or `POLARS` output formats on this library.
If `None`, uses the `arrow_string_format_default` from the `Arctic` instance.
Can be overridden per read operation.
See `ArrowOutputStringFormat` documentation for details on available string formats.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should have a note that these are not persisted lib config settings, they just apply to the returned library object

Does not deduplicate strings, so has better performance for columns with many unique strings.
Slightly faster than `LARGE_STRING` but does not work with very long strings.
Uses 32-bit variable-size encoding.
PyArrow: `pa.string()`, Polars: `pl.String`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean Polars converts everything to large string?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • "batche" typo
  • Warning about lmdb already being open in the process should be removed if possible from the final version
  • We should use set_sorted on timeseries index columns to speed up subsequent query performance when we know this to be the case (probably a separate PR)
  • tz_data symbol use an unnamed index before that behaviour has been introduced, I would just give it a name to avoid confusion when reading the notebook from start to finish
  • In the filtering and column selection section, worth mentioning that timeseries indexes are always fetched, even if not specified in the columns list
  • "PyArrow is useful when you need direct access to Arrow's columnar format" I didn't understand what this means exactly? Does it expose more info on chunks than Polars?
  • In the duckdb example, is there an alternative to to_df() that returns Polars or PyArrow, to show Pandas isn't necessary in that workflow?
  • print(f"Number of record batches: {arrow_table.num_rows}") seems wrong
  • "When Pandas DataFrames have unnamed indexes, Arrow/Polars use the special column name index" Polars uses "None"
  • "Converting to Polars/PyArrow and back preserves Pandas metadata" - I thought Polars dropped the Pandas metadata?
  • Might be worth a cell on unnamed multiindexes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

minor Feature change, should increase minor version

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants