Skip to content

Conversation

@PSU3D0
Copy link

@PSU3D0 PSU3D0 commented Jul 7, 2025

Overview

This PR introduces optional formula reading capabilities to python-calamine, allowing users to access the underlying formulas in spreadsheet cells rather than just their calculated values.

Motivation

Previously, python-calamine only provided access to cell values (the calculated results of formulas). Users had no way to access the actual formula expressions, which is essential for:

  • Spreadsheet analysis and auditing
  • Formula debugging and validation
  • Converting between different spreadsheet formats while preserving formulas
  • Building tools that need to understand the calculation logic

Implementation

Core Changes

  1. Opt-in Design: Formula reading is disabled by default (read_formulas=False) to maintain backward compatibility and avoid performance overhead for users who don't need formulas.

  2. New API Parameter: Added read_formulas parameter to all workbook creation methods:

    • CalamineWorkbook.from_object(path_or_filelike, read_formulas=False)
    • CalamineWorkbook.from_path(path, read_formulas=False)
    • CalamineWorkbook.from_filelike(filelike, read_formulas=False)
    • load_workbook(path_or_filelike, read_formulas=False)
  3. Formula Iterator: Added CalamineSheet.iter_formulas() method that returns a CalamineFormulaIterator with consistent dimensions matching the data iterator.

  4. Performance Optimization: Implemented on-demand coordinate mapping instead of pre-allocating expanded ranges, ensuring minimal memory overhead.

Technical Details

  • Coordinate Consistency: Formula iterator returns the same dimensions as data iterator, with empty strings for cells without formulas
  • Graceful Error Handling: Clear error message when attempting to access formulas without enabling the feature
  • Memory Efficient: Uses get_value() lookups rather than expanding formula ranges upfront
  • Iterator Properties: Both CalamineCellIterator and CalamineFormulaIterator expose position, start, width, and height properties for coordinate recovery

File Format Support

Supports formula reading across all major spreadsheet formats:

  • Excel (.xlsx, .xls, .xlsb)
  • OpenDocument (.ods)

Formula syntax varies by format (e.g., ODS uses of:=SUM([.A1:.B1]) vs Excel's SUM(A1:B1)).

API Examples

# Default behavior - formulas disabled
wb = load_workbook("spreadsheet.xlsx")
sheet = wb.get_sheet_by_name("Sheet1")
# sheet.iter_formulas()  # Raises ValueError

# Enable formula reading
wb = load_workbook("spreadsheet.xlsx", read_formulas=True)
sheet = wb.get_sheet_by_name("Sheet1")

# Access both data and formulas
data = list(sheet.iter_rows())      # [[10.0, 15.0, 25.0]]
formulas = list(sheet.iter_formulas())  # [["", "", "SUM(A1:B1)"]]

# Iterator properties for coordinate mapping
formula_iter = sheet.iter_formulas()
print(f"Start: {formula_iter.start}, Size: {formula_iter.width}x{formula_iter.height}")

Backward Compatibility

  • Default behavior unchanged (read_formulas=False)
  • No breaking changes to existing APIs
  • Existing code continues to work without modification

Testing

Comprehensive test suite covering:

  • All supported file formats (.xlsx, .xls, .ods)
  • All workbook creation methods
  • Error handling for disabled formulas
  • Iterator property validation
  • Coordinate consistency between data and formula iterators

Performance Impact

  • Zero overhead when formulas are disabled (default)
  • Minimal memory usage when formulas are enabled due to on-demand lookup strategy
  • No upfront range expansion or pre-allocation

@PSU3D0
Copy link
Author

PSU3D0 commented Jul 7, 2025

A human addendum to the AI's diff summary ^:

This lib is much faster than openpyxl but remained unusable for my use-case due to:

  • Lack of formula support
  • Lack of coordinate recovery

This PR should be a big step forward for broader usability and more use-cases. The next thing I'd hit would be a unified CellData iterator that exposes value, coordinate, optional formula, and potentially style information (still need to check Calamine source on this).

Cheers!

@PSU3D0
Copy link
Author

PSU3D0 commented Jul 12, 2025

@dimastbk bumping this

m.add_function(wrap_pyfunction!(load_workbook, m)?)?;
m.add_class::<CalamineWorkbook>()?;
m.add_class::<CalamineSheet>()?;
m.add_class::<CalamineFormulaIterator>()?;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add CalamineCellIterator here?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove, please

@benemanu
Copy link

benemanu commented Aug 21, 2025

This would be awesome!
Is there any way i can help?

@simx11
Copy link

simx11 commented Aug 21, 2025

This is something we urgently need, let me know if we can help!

Copilot AI review requested due to automatic review settings October 21, 2025 09:22
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants