Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion Cargo.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[package]
name = "wolfxl"
version = "0.1.2"
version = "0.2.0"
edition = "2021"
license = "MIT"
description = "Fast, openpyxl-compatible Excel I/O backed by Rust"
Expand Down
42 changes: 42 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -125,6 +125,48 @@ Every Rust-backed Python Excel project picks a different slice of the problem. W

Upstream [calamine](https://github.com/tafia/calamine) does not parse styles. WolfXL's read engine uses [calamine-styles](https://crates.io/crates/calamine-styles), a fork that adds Font/Fill/Border/Alignment/NumberFormat extraction from OOXML.

## Batch APIs for Maximum Speed

For write-heavy workloads, use `append()` or `write_rows()` instead of cell-by-cell access. These APIs buffer rows as raw Python lists and flush them to Rust in a single call at save time, bypassing per-cell FFI overhead entirely.

```python
from wolfxl import Workbook

wb = Workbook()
ws = wb.active

# append() — fast sequential writes (3.7x faster than cell-by-cell)
ws.append(["Name", "Amount", "Date"])
for row in data:
ws.append(row)

# write_rows() — fast writes at arbitrary positions
ws.write_rows(header_grid, start_row=1, start_col=1)
ws.write_rows(data_grid, start_row=5, start_col=1)

wb.save("output.xlsx")
```

For reads, `iter_rows(values_only=True)` uses a fast bulk path that reads all values in a single Rust call (6.7x faster than openpyxl):

```python
wb = load_workbook("data.xlsx")
ws = wb[wb.sheetnames[0]]
for row in ws.iter_rows(values_only=True):
process(row) # row is a tuple of plain Python values
```

| API | vs openpyxl | How |
|-----|-------------|-----|
| `ws.append(row)` | **3.7x** faster write | Buffers rows, single Rust call at save |
| `ws.write_rows(grid)` | **3.7x** faster write | Same mechanism, arbitrary start position |
| `ws.iter_rows(values_only=True)` | **6.7x** faster read | Single Rust call, no Cell objects |
| `ws.cell(r, c, value=v)` | **1.6x** faster write | Per-cell FFI (compatible but slower) |

## Case Study: SynthGL

[SynthGL](https://synthgl.dev) switched from openpyxl to WolfXL for their GL journal exports (14-column financial data, 1K-50K rows). Results: **4x faster writes**, **9x faster reads** at scale. 50K-row exports dropped from 7.6s to 1.3s. [Read the full case study](docs/case-study-synthgl.md).

## How It Works

WolfXL is a thin Python layer over compiled Rust engines, connected via [PyO3](https://pyo3.rs/). The Python side uses **lazy cell proxies** — opening a 10M-cell file is instant. Values and styles are fetched from Rust only when you access them. On save, dirty cells are flushed in one batch, avoiding per-cell FFI overhead.
Expand Down
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ build-backend = "maturin"

[project]
name = "wolfxl"
version = "0.1.2"
version = "0.2.0"
description = "Fast, openpyxl-compatible Excel I/O backed by Rust"
requires-python = ">=3.9"
license = { text = "MIT" }
Expand Down
155 changes: 140 additions & 15 deletions python/wolfxl/_worksheet.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ class Worksheet:
__slots__ = (
"_workbook", "_title", "_cells", "_dirty", "_dimensions",
"_max_col_idx", "_next_append_row",
"_append_buffer", "_append_buffer_start",
"_append_buffer", "_append_buffer_start", "_bulk_writes",
)

def __init__(self, workbook: Workbook, title: str) -> None:
Expand All @@ -32,6 +32,8 @@ def __init__(self, workbook: Workbook, title: str) -> None:
# Fast-path append buffer: raw value lists, no Cell objects.
self._append_buffer: list[list[Any]] = []
self._append_buffer_start: int = 1
# Bulk write buffer: list of (grid, start_row, start_col) tuples.
self._bulk_writes: list[tuple[list[list[Any]], int, int]] = []

@property
def title(self) -> str:
Expand All @@ -51,6 +53,9 @@ def title(self, value: str) -> None:
wb._sheet_names[idx] = value # noqa: SLF001
wb._sheets[value] = wb._sheets.pop(old) # noqa: SLF001
self._title = value
# Sync the Rust writer so ensure_sheet_exists() sees the new name.
if wb._rust_writer is not None: # noqa: SLF001
wb._rust_writer.rename_sheet(old, value) # noqa: SLF001

# ------------------------------------------------------------------
# Cell access
Expand Down Expand Up @@ -131,6 +136,64 @@ def _materialize_append_buffer(self) -> None:
self.cell(row=r, column=c, value=val)
# Buffer is already cleared above.

def write_rows(
self,
rows: list[list[Any]],
start_row: int = 1,
start_col: int = 1,
) -> None:
"""Bulk-write a 2D grid of values starting at (start_row, start_col).

Unlike ``append()``, this writes to an arbitrary position. Values are
buffered and flushed via a single ``write_sheet_values()`` Rust call
at save time, avoiding per-cell FFI overhead.

``rows`` is a list of lists. Each inner list is one row of values.
"""
if not rows:
return
# Store a shallow copy so flush can safely mutate without affecting caller.
copied = [list(row) for row in rows]
self._bulk_writes.append((copied, start_row, start_col))

def _materialize_bulk_writes(self) -> None:
"""Convert bulk write buffers into Cell objects.

Called before the patcher flush path which has no batch API and
needs all values as individual dirty cells.
"""
writes = self._bulk_writes
if not writes:
return
self._bulk_writes = []
for grid, sr, sc in writes:
for ri, row in enumerate(grid):
for ci, val in enumerate(row):
if val is not None:
self.cell(row=sr + ri, column=sc + ci, value=val)

@staticmethod
def _extract_non_batchable(
grid: list[list[Any]], start_row: int, start_col: int,
) -> list[tuple[int, int, Any]]:
"""Extract non-batchable values from grid, replacing them with None.

Non-batchable: booleans, formulas (str starting with '='), and
non-primitive types (dates, datetimes, etc.). These require
per-cell ``write_cell_value()`` calls with type-preserving payloads.
"""
indiv: list[tuple[int, int, Any]] = []
for ri, row in enumerate(grid):
for ci, val in enumerate(row):
if val is not None and (
isinstance(val, bool)
or (isinstance(val, str) and val.startswith("="))
or not isinstance(val, (int, float, str))
):
indiv.append((start_row + ri, start_col + ci, val))
row[ci] = None
return indiv

# ------------------------------------------------------------------
# Iteration
# ------------------------------------------------------------------
Expand All @@ -144,6 +207,11 @@ def iter_rows(
values_only: bool = False,
) -> Iterator[tuple[Any, ...]]:
"""Iterate over rows in a range. Matches openpyxl's iter_rows API."""
# Fast bulk path: read-mode + values_only -> single Rust FFI call.
if values_only and self._workbook._rust_reader is not None: # noqa: SLF001
yield from self._iter_rows_bulk(min_row, max_row, min_col, max_col)
return

r_min = min_row or 1
r_max = max_row or self._max_row()
c_min = min_col or 1
Expand All @@ -159,6 +227,61 @@ def iter_rows(
self._get_or_create_cell(r, c) for c in range(c_min, c_max + 1)
)

def _iter_rows_bulk(
self,
min_row: int | None,
max_row: int | None,
min_col: int | None,
max_col: int | None,
) -> Iterator[tuple[Any, ...]]:
"""Bulk-read values via a single Rust FFI call (values_only fast path).

Uses ``read_sheet_values_plain()`` when available (returns native
Python objects), falling back to ``read_sheet_values()`` + per-cell
``_payload_to_python()`` conversion otherwise.
"""
from wolfxl._cell import _payload_to_python

reader = self._workbook._rust_reader # noqa: SLF001
sheet = self._title

# Build an A1:B2-style range string for Rust.
r_min = min_row or 1
r_max = max_row or self._max_row()
c_min = min_col or 1
c_max = max_col or self._max_col()
range_str = f"{rowcol_to_a1(r_min, c_min)}:{rowcol_to_a1(r_max, c_max)}"

# Prefer plain-value read (no dict overhead) if available.
use_plain = hasattr(reader, "read_sheet_values_plain")
if use_plain:
rows = reader.read_sheet_values_plain(sheet, range_str)
else:
rows = reader.read_sheet_values(sheet, range_str)

if not rows:
return

# The Rust range returns exactly the rows/cols we asked for,
# so no Python-side slicing is needed.
expected_cols = c_max - c_min + 1
for row in rows:
if use_plain:
# Already native Python values; pad/trim to expected width.
n = len(row)
if n >= expected_cols:
yield tuple(row[:expected_cols])
else:
yield tuple(row) + (None,) * (expected_cols - n)
else:
# Dict payloads need conversion.
vals = [_payload_to_python(cell) for cell in row]
n = len(vals)
if n >= expected_cols:
yield tuple(vals[:expected_cols])
else:
yield tuple(vals) + (None,) * (expected_cols - n)

def _read_dimensions(self) -> tuple[int, int]:
"""Discover sheet dimensions from the Rust backend (read mode only)."""
if self._dimensions is not None:
Expand Down Expand Up @@ -220,10 +343,12 @@ def _flush(self) -> None:
writer = wb._rust_writer # noqa: SLF001

if patcher is not None:
# Modify mode: materialize append buffer first (patcher has no
# batch API), then flush dirty cells individually.
# Modify mode: materialize buffers first (patcher has no batch
# API), then flush dirty cells individually.
if self._append_buffer:
self._materialize_append_buffer()
if self._bulk_writes:
self._materialize_bulk_writes()
self._flush_to_patcher(patcher, python_value_to_payload,
font_to_format_dict, fill_to_format_dict,
alignment_to_format_dict, border_to_rust_dict)
Expand Down Expand Up @@ -257,18 +382,7 @@ def _flush_to_writer(
start_row = self._append_buffer_start
start_a1 = rowcol_to_a1(start_row, 1)

# Scan for non-batchable values and replace with None in the grid.
# Non-batchable values get individual write_cell_value calls after.
indiv_from_buf: list[tuple[int, int, Any]] = []
for ri, row in enumerate(buf):
for ci, val in enumerate(row):
if val is not None and (
isinstance(val, bool)
or (isinstance(val, str) and val.startswith("="))
or not isinstance(val, (int, float, str))
):
indiv_from_buf.append((start_row + ri, ci + 1, val))
row[ci] = None # will be skipped by write_sheet_values
indiv_from_buf = self._extract_non_batchable(buf, start_row, 1)

writer.write_sheet_values(self._title, start_a1, buf)

Expand All @@ -279,6 +393,17 @@ def _flush_to_writer(

self._append_buffer = []

# -- Flush bulk writes (write_rows) -----------------------------------
for grid, sr, sc in self._bulk_writes:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Flush write_rows queue when saving in modify mode

write_rows() now queues grids into _bulk_writes, but that queue is only consumed in the writer-specific flush loop shown here. In load_workbook(..., modify=True) flows, _flush() uses the patcher path and never drains _bulk_writes, so ws.write_rows(...) changes are silently dropped on save unless each target cell is also written through the dirty-cell path.

Useful? React with 👍 / 👎.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in a56f0fc. Added _materialize_bulk_writes() method that converts bulk write buffers into Cell objects. The patcher flush path now calls it (alongside the existing _materialize_append_buffer()) before processing dirty cells. This ensures write_rows() data is flushed in modify mode.

start_a1 = rowcol_to_a1(sr, sc)
indiv_from_bulk = self._extract_non_batchable(grid, sr, sc)
writer.write_sheet_values(self._title, start_a1, grid)
for r, c, val in indiv_from_bulk:
coord = rowcol_to_a1(r, c)
payload = python_value_to_payload(val)
writer.write_cell_value(self._title, coord, payload)
self._bulk_writes = []

# -- Partition dirty cells into batch-eligible values vs individual ----
#
# "batchable" = value is int | float | str | None (not bool, not
Expand Down
Loading