A DuckDB extension that enables reading Excel, WPS, and OpenDocument spreadsheet files directly within SQL queries. This extension provides seamless integration for analyzing spreadsheet data using DuckDB's powerful SQL engine.
- High Performance: Optimized for large files with pure Rust implementation; e.g., reading a 1M-row XLSX file on MacBook M1 now takes under 13 seconds (down from 140+ seconds in v0.1.x).
- Multiple Format Support: Read Excel files (
.xls,.xlsx,.xlsm,.xlsb,.xla,.xlam), WPS files (.et,.ett), and OpenDocument Spreadsheet files (.ods) - Batch Processing: Analyze or read multiple files and worksheets with wildcard pattern matching
- Flexible Data Types: Support for boolean, integer, double, varchar, datetime, date, and time types
- Excel-Style Data Ranges: Specify data ranges using familiar Excel notation (e.g.,
"A1:C3") - Automatic Column Type Detection: Column types are inferred automatically; override specific columns with the
columnsparameter - Header Row Handling: Automatic detection and parsing of header rows
- Error Handling: Configurable behavior for parsing errors with precise cell location and file name reporting
- Type Safety: Built-in data type validation and conversion
- Advanced Data Filtering: Skip empty rows or stop at first empty row for efficient data processing
- Remote Storage Support: Read spreadsheets from remote URLs including HTTP, HTTPS, S3, Google Cloud Storage, and Hugging Face datasets
DuckDB 1.4.1 or later is required for community extension support.
The easiest way to install rusty_sheet is through DuckDB's community extension platform:
INSTALL rusty_sheet FROM community;
LOAD rusty_sheet;- Python 3
- Python 3-venv
- Make
- Git
- Rust toolchain
- Clone the repository:
git clone https://github.com/redraiment/rusty-sheet.git
cd rusty-sheet- Configure the build environment:
make configure- Build the extension:
make debug # For development
make release # For production- The built extension will be available in
build/debug/extension/orbuild/release/extension/
Start DuckDB with the unsigned flag to load local extensions:
duckdb -unsignedLoad the extension:
LOAD './build/debug/extension/rusty-sheet/rusty-sheet.duckdb_extension';SELECT * FROM read_sheet('data.xlsx');SELECT * FROM read_sheet('workbook.xlsx', sheet='Sheet2');SELECT * FROM read_sheet('data.xlsx',
columns={'id': 'bigint'}
);SELECT * FROM read_sheet('data.xlsx', range='A2:E100');SELECT * FROM read_sheet('data.xlsx',
header=false,
columns={'A': 'varchar', 'B': 'bigint'}
);-- Read from HTTP/HTTPS URL
SELECT * FROM read_sheet('https://example.com/data.xlsx');
-- Read from S3 bucket
SELECT * FROM read_sheet('s3://my-bucket/data.xlsx');
-- Read from Google Cloud Storage
SELECT * FROM read_sheet('gs://my-bucket/data.xlsx');
-- Read from Hugging Face dataset
SELECT * FROM read_sheet('hf://datasets/my-dataset/data.xlsx');
-- With custom HTTP headers
CREATE PERSISTENT SECRET http_auth ( -- Must be a persistent SECRET
TYPE HTTP,
BEARER_TOKEN 'Hello world!'
);
SELECT * FROM read_sheet('https://example.com/data.xlsx');SELECT * FROM analyze_sheet('data.xlsx', analyze_rows=20);-- Read all Excel files in directory
SELECT * FROM read_sheets(['*.xlsx']);
-- Read all WPS files
SELECT * FROM read_sheets(['*.et']);
-- Read multiple file types using list patterns
SELECT * FROM read_sheets(['*.xls', '*.xlsx']);
-- Read multiple file types with different extensions
SELECT * FROM read_sheets(['*.xlsx', '*.ods', '*.et']);
-- Read from remote URLs
SELECT * FROM read_sheets(['https://example.com/data1.xlsx', 'https://example.com/data2.xlsx']);
-- Mix local and remote files
SELECT * FROM read_sheets(['local_data.xlsx', 'https://example.com/remote_data.xlsx']);-- Analyze all worksheets in all Excel files
SELECT * FROM analyze_sheets(['*.xlsx']);
-- Analyze specific worksheets across files
SELECT * FROM analyze_sheets(['*.xlsx'], sheets=['Sheet1', 'Sheet2']);
-- Analyze with wildcard pattern
SELECT * FROM analyze_sheets(['*.xlsx'], sheets=['Sheet*']);
-- Analyze multiple file types
SELECT * FROM analyze_sheets(['*.xlsx', '*.ods']);-- Match specific worksheets only in specific file types
SELECT * FROM read_sheets(['*.xlsx'], sheets=['*.xlsx=Sheet*']);
-- Read multiple file types
SELECT * FROM read_sheets(['*.*']);
-- Use multiple file patterns with improved matching
SELECT * FROM read_sheets(['*.xls', '*.xlsx'], sheets=['Sheet*']);The read_sheets and analyze_sheets functions now feature enhanced matching logic:
- Multiple Pattern Support: Accepts lists of file patterns for flexible multi-format processing
- Smart Worksheet Discovery: When using wildcards, the system finds the first worksheet that matches both the file pattern and worksheet pattern, reducing "no matching worksheets" errors
- File-Specific Matching: Improved handling of file-specific worksheet patterns like
*.xlsx=Sheet*
-- Skip completely empty rows
SELECT * FROM read_sheet('data.xlsx', skip_empty_rows=true);
-- Stop reading at first empty row
SELECT * FROM read_sheet('data.xlsx', end_at_empty_row=true);Analyzes the column structure of a single worksheet in a single file.
Parameters:
- file_path (required): Path to the spreadsheet file (no wildcard support). Supports local files and remote URLs (HTTP, HTTPS, S3, GS, HF)
- sheet (optional, default first sheet): Worksheet name (supports wildcards like
Sheet*) - range (optional): Data range in format
[start_col][start_row]:[end_col][end_row] - header (optional, default
true): Whether the first row contains column headers - analyze_rows (optional, default
10): Number of rows to analyze for type inference - error_as_null (optional, default
false): If true, convert parsing errors to NULL instead of failing
Examples:
-- Analyze default worksheet
SELECT * FROM analyze_sheet('data.xlsx');
-- Analyze specific worksheet
SELECT * FROM analyze_sheet('data.xlsx', sheet='Sheet2');
-- Analyze specific range
SELECT * FROM analyze_sheet('data.xlsx', range='A1:C10');
-- Analyze more rows for better type inference
SELECT * FROM analyze_sheet('data.xlsx', analyze_rows=50);
-- Analyze remote URL
SELECT * FROM analyze_sheet('https://example.com/data.xlsx');
-- Analyze S3 file
SELECT * FROM analyze_sheet('s3://my-bucket/data.xlsx');Analyzes column structures of multiple worksheets across multiple files with wildcard pattern matching.
Parameters:
- file_pattern (required): File path pattern(s) with wildcard support (e.g.,
['*.xlsx'],['*.xls', '*.xlsx']). Also supports remote URLs (HTTP, HTTPS, S3, GS, HF) - sheets (optional): List of worksheet names (supports wildcards and file-specific patterns like
['Sheet*'],['*.xlsx=Sheet*']) - range (optional): Data range in format
[start_col][start_row]:[end_col][end_row] - header (optional, default
true): Whether the first row contains column headers - analyze_rows (optional, default
10): Number of rows to analyze for type inference - error_as_null (optional, default
false): If true, convert parsing errors to NULL instead of failing
Examples:
-- Analyze all worksheets in all Excel files
SELECT * FROM analyze_sheets(['*.xlsx']);
-- Analyze specific worksheets across files
SELECT * FROM analyze_sheets(['*.xlsx'], sheets=['Sheet1', 'Sheet2']);
-- Use wildcard to match worksheets
SELECT * FROM analyze_sheets(['*.xlsx'], sheets=['Sheet*']);
-- File-specific pattern matching
SELECT * FROM analyze_sheets(['*.xlsx'], sheets=['*.xlsx=Sheet*']);
-- Analyze multiple file types
SELECT * FROM analyze_sheets(['*.xlsx', '*.ods']);
-- Analyze remote URLs
SELECT * FROM analyze_sheets(['https://example.com/data1.xlsx', 'https://example.com/data2.xlsx']);
-- Analyze mixed local and remote files
SELECT * FROM analyze_sheets(['local_data.xlsx', 'https://example.com/remote_data.xlsx']);Reads data from a single worksheet in a single file.
Parameters:
- file_path (required): Path to the spreadsheet file (no wildcard support). Supports local files and remote URLs (HTTP, HTTPS, S3, GS, HF)
- sheet (optional, default first sheet): Worksheet name (supports wildcards like
Sheet*) - range (optional): Data range in format
[start_col][start_row]:[end_col][end_row] - header (optional, default
true): Whether the first row contains column headers - analyze_rows (optional, default
10): Number of rows to analyze for type inference - error_as_null (optional, default
false): If true, convert parsing errors to NULL instead of failing - skip_empty_rows (optional, default
false): Skip rows where all columns contain empty values - end_at_empty_row (optional, default
false): Stop reading at the first completely empty row
Examples:
-- Read default worksheet
SELECT * FROM read_sheet('data.xlsx');
-- Read specific worksheet
SELECT * FROM read_sheet('workbook.xlsx', sheet='Sheet2');
-- Read specific data range
SELECT * FROM read_sheet('data.xlsx', range='A2:E100');
-- Skip empty rows
SELECT * FROM read_sheet('data.xlsx', skip_empty_rows=true);
-- Stop at first empty row
SELECT * FROM read_sheet('data.xlsx', end_at_empty_row=true);
-- Handle errors as NULL
SELECT * FROM read_sheet('messy_data.xlsx', error_as_null=true);
-- Read from remote URL
SELECT * FROM read_sheet('https://example.com/data.xlsx');
-- Read from S3 with specific worksheet
SELECT * FROM read_sheet('s3://my-bucket/data.xlsx', sheet='Sheet2');
-- Read from Google Cloud Storage with range
SELECT * FROM read_sheet('gs://my-bucket/data.xlsx', range='A1:C10');Reads data from multiple worksheets across multiple files with wildcard pattern matching.
Important Note: When using wildcard patterns, this function analyzes the column structure and data types from the first matching worksheet only. All subsequent worksheets with matching patterns will use the same column structure, even if their actual structure differs. For worksheets with varying structures, consider using analyze_sheets first to inspect individual worksheet structures.
Parameters:
- file_pattern (required): File path pattern(s) with wildcard support (e.g.,
['*.xlsx'],['*.xls', '*.xlsx']). Also supports remote URLs (HTTP, HTTPS, S3, GS, HF) - sheets (optional): List of worksheet names (supports wildcards and file-specific patterns like
['Sheet*'],['*.xlsx=Sheet*']) - range (optional): Data range in format
[start_col][start_row]:[end_col][end_row] - header (optional, default
true): Whether the first row contains column headers - analyze_rows (optional, default
10): Number of rows to analyze for type inference - error_as_null (optional, default
false): If true, convert parsing errors to NULL instead of failing - skip_empty_rows (optional, default
false): Skip rows where all columns contain empty values - end_at_empty_row (optional, default
false): Stop reading at the first completely empty row - file_name_column (optional): Column name to include file source information in results
- sheet_name_column (optional): Column name to include worksheet source information in results
- union_by_name (optional, default
false): When false, union data by position; when true, union data by column name
Examples:
-- Read all Excel files
SELECT * FROM read_sheets(['*.xlsx']);
-- Read all WPS files
SELECT * FROM read_sheets(['*.et']);
-- Read specific worksheets
SELECT * FROM read_sheets(['*.xlsx'], sheets=['Sheet1', 'Sheet2']);
-- Use wildcard to match worksheets
SELECT * FROM read_sheets(['*.xlsx'], sheets=['Sheet*']);
-- File-specific pattern matching
SELECT * FROM read_sheets(['*.xlsx'], sheets=['*.xlsx=Sheet*']);
-- Skip empty rows in batch processing
SELECT * FROM read_sheets(['*.xlsx'], skip_empty_rows=true);
-- Read multiple file types
SELECT * FROM read_sheets(['*.xls', '*.xlsx']);
-- Track data sources with custom column names
SELECT * FROM read_sheets(['*.xls', '*.xlsx'],
sheets=['Sheet*'],
file_name_column='file',
sheet_name_column='worksheet'
);
-- Union data by column name instead of position
SELECT * FROM read_sheets(['*.xlsx'], union_by_name=true);
-- Union by name with specific worksheets
SELECT * FROM read_sheets(['*.xlsx'],
sheets=['Sheet1', 'Sheet2'],
union_by_name=true
);
-- Read from remote URLs
SELECT * FROM read_sheets(['https://example.com/data1.xlsx', 'https://example.com/data2.xlsx']);
-- Mix local and remote files
SELECT * FROM read_sheets(['local_data.xlsx', 's3://my-bucket/remote_data.xlsx']);
-- Read from multiple cloud storage providers
SELECT * FROM read_sheets(['s3://bucket1/data.xlsx', 'gs://bucket2/data.xlsx']);| Type | DuckDB Type | Description |
|---|---|---|
boolean |
BOOLEAN | True/false values |
bigint |
BIGINT | 64-bit signed integers |
double |
DOUBLE | Double-precision floating point |
varchar |
VARCHAR | Variable-length strings |
timestamp |
TIMESTAMP | Date and time with microsecond precision (supports ISO 8601 format) |
date |
DATE | Date without time component (supports ISO 8601 format) |
time |
TIME | Time without date component (including ISO 8601 durations) |
The range parameter supports flexible Excel-style cell range notation with five optional components:
-
Start Column (optional): Excel column letter (e.g.,
Afor column 1,Bfor column 2, etc.)- If specified, reading starts from this column even if it contains no data
- Does not skip empty columns
-
Start Row (optional): Excel row number (e.g.,
1for row 1,2for row 2, etc.)- If specified, reading starts from this row even if it contains no data
- Does not skip empty rows
-
Colon Separator: Required only when specifying end column or end row
-
End Column (optional): Excel column letter
- If specified, reading stops at this column even if data ends earlier
- Ignores data beyond this column
-
End Row (optional): Excel row number
- If specified, reading stops at this row even if data ends earlier
- Ignores data beyond this row
-- Single cell (B2 only)
SELECT * FROM read_sheet('data.xlsx', range='B2:B2');
-- Cell range (columns B-D, rows 2-5)
SELECT * FROM read_sheet('data.xlsx', range='B2:D5');
-- Column range only (all rows in columns A through C)
SELECT * FROM read_sheet('data.xlsx', range='A:C');
-- Row range only (all columns in rows 5 through 15)
SELECT * FROM read_sheet('data.xlsx', range='5:15');
-- Start column and row only (from B2 to end of sheet)
SELECT * FROM read_sheet('data.xlsx', range='B2');
-- Single column only (column B only)
SELECT * FROM read_sheet('data.xlsx', range='B:B');
-- Single row only (row 5 only)
SELECT * FROM read_sheet('data.xlsx', range='5:5');
-- Start column only (from column B to end of sheet)
SELECT * FROM read_sheet('data.xlsx', range='B');
-- Start row only (from row 5 to end of sheet)
SELECT * FROM read_sheet('data.xlsx', range='5');
-- End column only (from start to column D)
SELECT * FROM read_sheet('data.xlsx', range=':D');
-- End row only (from start to row 10)
SELECT * FROM read_sheet('data.xlsx', range=':10');The extension supports Rust glob patterns for file and worksheet matching.
?- Matches any single character*- Matches any (possibly empty) sequence of characters**- Matches the current directory and arbitrary subdirectories[...]- Matches any character inside the brackets[!...]- Negation of[...], matches characters not in the brackets
-- Single character wildcard
SELECT * FROM read_sheets(['data_2024_??.xlsx']);
-- Multiple character wildcard
SELECT * FROM read_sheets(['report_*.xlsx']);
-- Character set matching
SELECT * FROM read_sheets(['data_[0-9].xlsx']);
SELECT * FROM read_sheets(['file_[abc].xlsx']);
-- Negation pattern
SELECT * FROM read_sheets(['file_[!test]*.xlsx']);
-- Recursive directory matching
SELECT * FROM read_sheets(['**/*.xlsx']);
-- Worksheet pattern matching
SELECT * FROM read_sheets(['*.xlsx'], sheets=['Sheet?']);
SELECT * FROM read_sheets(['*.xlsx'], sheets=['Data*']);
-- File-specific worksheet patterns
SELECT * FROM read_sheets(['*.xlsx'], sheets=['*.xlsx=Sheet*']);**must form a single path component (e.g.,**/*.xlsxis valid,**ais invalid)- Character ranges use Unicode ordering (e.g.,
[0-9]matches digits 0-9) - Metacharacters
?,*,[,]can be matched using brackets (e.g.,[?]) - The
-character in character sets must be at start or end (e.g.,[abc-])
Run the test suite:
# Test debug build
make test_debug
# Test release build
make test_releaseTest with different DuckDB versions:
make clean_all
DUCKDB_TEST_VERSION=v1.4.1 make configure
make debug
make test_debugThis extension is built using the DuckDB Rust extension framework. The main components are:
src/lib.rs: Main extension implementationCargo.toml: Rust dependencies and build configuration
To contribute:
- Fork the repository
- Create a feature branch
- Make your changes with tests
- Run
make test_debugto verify - Submit a pull request
- On Windows with Python 3.11, you may encounter extension loading issues. Use Python 3.12 or later.
- Very large spreadsheets may require significant memory allocation.
- Complex Excel formulas are not evaluated; only the computed values are read.
Zhang, Zepeng Email: redraiment@gmail.com
This project is licensed under the MIT License - see the LICENSE file for details.
- Built on the DuckDB Rust extension template
- Inspired by DuckDB's commitment to making data analysis more accessible
- Thanks to abelcha for helping implement remote file reading functionality