Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
534ae8f
Add new testfiles to test for data with images
L1am0 Nov 20, 2025
d9cff9f
Add eslint config
L1am0 Nov 20, 2025
038ab94
Parse images out of docx, pptx, odt, odp, pdf
L1am0 Nov 20, 2025
31e84e0
npm i
L1am0 Nov 20, 2025
df58f63
Fix pdf file parsing (restore line break logic)
L1am0 Nov 20, 2025
37371d0
Also test files with images, if they extract text correctly
L1am0 Nov 20, 2025
46da9c7
Fix duplicate image problem in pptx parsing
L1am0 Nov 21, 2025
bfc2bfc
Adapt to a block-based logic, which allows to persist the position of…
L1am0 Nov 21, 2025
d62ba1b
eslint --fix
L1am0 Nov 21, 2025
7e12665
Added support for information from charts to be parsed and pulled
bjorndonald Jan 23, 2026
94b3d19
makes sure the extension checking is case insensitive
bjorndonald Jan 23, 2026
198cc46
added support for powerpoint
bjorndonald Jan 23, 2026
1f8fb79
Merge pull request #1 from pylehound/feat/support-for-charts
bjorndonald Jan 23, 2026
d71fd21
Add UPPER/lowercase testcase for file extension
L1am0 Jan 24, 2026
6fb0018
adds extensive typing for all entities in code
bjorndonald Jan 24, 2026
b4e68fa
Merge remote-tracking branch 'upstream/master'
bjorndonald Jan 26, 2026
1f80e37
updates the forked repo to match the version of the main repo
bjorndonald Jan 26, 2026
8698adf
improves the parsers to include blocks with the different elements
bjorndonald Jan 26, 2026
94974a0
resolves some typing issues
bjorndonald Jan 26, 2026
23d9353
adds dist folder back
bjorndonald Jan 26, 2026
9486e2e
improves logging in the parsers
bjorndonald Jan 26, 2026
25894d6
improves logging in the parsers
bjorndonald Jan 26, 2026
63c27d6
improves logging for debugging
bjorndonald Jan 26, 2026
566b5c5
ensures the charts get created regardless of the attachments option
bjorndonald Jan 26, 2026
ab7473b
removes unnessary logging requests
bjorndonald Jan 26, 2026
e93ee2d
adds type DocImage
bjorndonald Jan 26, 2026
6c7e845
adds DocImage type
bjorndonald Jan 26, 2026
f11c66a
improves the typing of the OfficeParser
bjorndonald Jan 26, 2026
c1f8473
optimizes the extraction of charts and tables from excel and word files
bjorndonald Jan 28, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 1 addition & 5 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,4 @@ officeParserTemp/
officeParser.bk.js
*.DS_Store
.vscode/
/dist/
*.actual.json
*.actual.txt
*.traineddata
/test/results/
.idea/
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -124,6 +124,7 @@ console.log(text);
We still support callbacks, but the data returned is now the AST object.
```js
const officeParser = require('officeparser');
const fs = require('fs');

officeParser.parseOffice("/path/to/officeFile.docx", function(ast, err) {
if (err) {
Expand Down
90 changes: 90 additions & 0 deletions dist/OfficeParser.d.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
/**
* Office Parser - Main Entry Point
*
* This module provides the main `OfficeParser` class with a single static method
* that automatically detects file types and routes to the appropriate parser.
*
* **Supported Formats:**
* - DOCX (Word documents)
* - XLSX (Excel spreadsheets)
* - PPTX (PowerPoint presentations)
* - ODT, ODP, ODS (OpenDocument formats)
* - PDF (Portable Document Format)
* - RTF (Rich Text Format)
*
* **Usage:**
* ```typescript
* import { OfficeParser } from 'officeparser';
*
* // Parse from file path
* const ast = await OfficeParser.parseOffice('document.docx', {
* extractAttachments: true,
* ocr: true
* });
*
* // Parse from Buffer
* const buffer = fs.readFileSync('document.pdf');
* const ast = await OfficeParser.parseOffice(buffer);
*
* // Get plain text
* console.log(ast.toText());
* ```
*
* @module OfficeParser
*/
/// <reference types="node" />
import { OfficeParserAST, OfficeParserConfig } from './types';
/**
* Main parser class providing office document parsing functionality.
*
* This class contains a single static method `parseOffice` that serves as the
* universal entry point for parsing any supported office document format.
*/
export declare class OfficeParser {
/**
* Parses an office document and returns a structured AST.
*
* This method:
* 1. Accepts a file path, Buffer, or ArrayBuffer
* 2. Detects the file type (from extension or content)
* 3. Routes to the appropriate format-specific parser
* 4. Returns a unified AST structure
*
* **File Type Detection:**
* - If a file path is provided, uses the file extension
* - If a Buffer is provided, uses magic bytes detection (file-type library)
*
* **Supported Formats and Routes:**
* - `.docx` → WordParser (OOXML)
* - `.xlsx` → ExcelParser (OOXML)
* - `.pptx` → PowerPointParser (OOXML)
* - `.odt`, `.odp`, `.ods` → OpenOfficeParser (ODF)
* - `.pdf` → PdfParser (PDF.js)
* - `.rtf` → RtfParser (custom RTF parser)
*
* @param file - File path (string), Buffer, or ArrayBuffer containing the document
* @param config - Optional configuration object (defaults applied for all omitted options)
* @returns A promise resolving to the parsed OfficeParserAST
* @throws {Error} If file doesn't exist, format is unsupported, or parsing fails
*
* @example
* ```typescript
* // Parse a DOCX file
* const ast = await OfficeParser.parseOffice('report.docx', {
* extractAttachments: true,
* includeRawContent: false
* });
*
* // Parse a Buffer with OCR enabled
* const buffer = await fetch('document.pdf').then(r => r.arrayBuffer());
* const ast = await OfficeParser.parseOffice(buffer, {
* ocr: true,
* ocrLanguage: 'eng+fra'
* });
*
* // Extract text
* const text = ast.toText();
* ```
*/
static parseOffice(file: string | Buffer | ArrayBuffer, configOrCallback?: OfficeParserConfig | ((ast: OfficeParserAST, err?: any) => void), config?: OfficeParserConfig): Promise<OfficeParserAST>;
}
217 changes: 217 additions & 0 deletions dist/OfficeParser.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,217 @@
"use strict";
/**
* Office Parser - Main Entry Point
*
* This module provides the main `OfficeParser` class with a single static method
* that automatically detects file types and routes to the appropriate parser.
*
* **Supported Formats:**
* - DOCX (Word documents)
* - XLSX (Excel spreadsheets)
* - PPTX (PowerPoint presentations)
* - ODT, ODP, ODS (OpenDocument formats)
* - PDF (Portable Document Format)
* - RTF (Rich Text Format)
*
* **Usage:**
* ```typescript
* import { OfficeParser } from 'officeparser';
*
* // Parse from file path
* const ast = await OfficeParser.parseOffice('document.docx', {
* extractAttachments: true,
* ocr: true
* });
*
* // Parse from Buffer
* const buffer = fs.readFileSync('document.pdf');
* const ast = await OfficeParser.parseOffice(buffer);
*
* // Get plain text
* console.log(ast.toText());
* ```
*
* @module OfficeParser
*/
var __createBinding = (this && this.__createBinding) || (Object.create ? (function(o, m, k, k2) {
if (k2 === undefined) k2 = k;
var desc = Object.getOwnPropertyDescriptor(m, k);
if (!desc || ("get" in desc ? !m.__esModule : desc.writable || desc.configurable)) {
desc = { enumerable: true, get: function() { return m[k]; } };
}
Object.defineProperty(o, k2, desc);
}) : (function(o, m, k, k2) {
if (k2 === undefined) k2 = k;
o[k2] = m[k];
}));
var __setModuleDefault = (this && this.__setModuleDefault) || (Object.create ? (function(o, v) {
Object.defineProperty(o, "default", { enumerable: true, value: v });
}) : function(o, v) {
o["default"] = v;
});
var __importStar = (this && this.__importStar) || function (mod) {
if (mod && mod.__esModule) return mod;
var result = {};
if (mod != null) for (var k in mod) if (k !== "default" && Object.prototype.hasOwnProperty.call(mod, k)) __createBinding(result, mod, k);
__setModuleDefault(result, mod);
return result;
};
Object.defineProperty(exports, "__esModule", { value: true });
exports.OfficeParser = void 0;
const fileType = __importStar(require("file-type"));
const fs = __importStar(require("fs"));
const ExcelParser_1 = require("./parsers/ExcelParser");
const OpenOfficeParser_1 = require("./parsers/OpenOfficeParser");
const PdfParser_1 = require("./parsers/PdfParser");
const PowerPointParser_1 = require("./parsers/PowerPointParser");
const RtfParser_1 = require("./parsers/RtfParser");
const WordParser_1 = require("./parsers/WordParser");
const errorUtils_1 = require("./utils/errorUtils");
/**
* Main parser class providing office document parsing functionality.
*
* This class contains a single static method `parseOffice` that serves as the
* universal entry point for parsing any supported office document format.
*/
class OfficeParser {
/**
* Parses an office document and returns a structured AST.
*
* This method:
* 1. Accepts a file path, Buffer, or ArrayBuffer
* 2. Detects the file type (from extension or content)
* 3. Routes to the appropriate format-specific parser
* 4. Returns a unified AST structure
*
* **File Type Detection:**
* - If a file path is provided, uses the file extension
* - If a Buffer is provided, uses magic bytes detection (file-type library)
*
* **Supported Formats and Routes:**
* - `.docx` → WordParser (OOXML)
* - `.xlsx` → ExcelParser (OOXML)
* - `.pptx` → PowerPointParser (OOXML)
* - `.odt`, `.odp`, `.ods` → OpenOfficeParser (ODF)
* - `.pdf` → PdfParser (PDF.js)
* - `.rtf` → RtfParser (custom RTF parser)
*
* @param file - File path (string), Buffer, or ArrayBuffer containing the document
* @param config - Optional configuration object (defaults applied for all omitted options)
* @returns A promise resolving to the parsed OfficeParserAST
* @throws {Error} If file doesn't exist, format is unsupported, or parsing fails
*
* @example
* ```typescript
* // Parse a DOCX file
* const ast = await OfficeParser.parseOffice('report.docx', {
* extractAttachments: true,
* includeRawContent: false
* });
*
* // Parse a Buffer with OCR enabled
* const buffer = await fetch('document.pdf').then(r => r.arrayBuffer());
* const ast = await OfficeParser.parseOffice(buffer, {
* ocr: true,
* ocrLanguage: 'eng+fra'
* });
*
* // Extract text
* const text = ast.toText();
* ```
*/
static async parseOffice(file, configOrCallback, config) {
let callback;
let actualConfig = {};
if (typeof configOrCallback === 'function') {
callback = configOrCallback;
actualConfig = config || {};
}
else {
actualConfig = configOrCallback || {};
}
const internalConfig = {
ignoreNotes: false,
newlineDelimiter: '\n',
putNotesAtLast: false,
outputErrorToConsole: false,
extractAttachments: false,
ocr: false,
ocrLanguage: 'eng',
includeRawContent: false,
pdfWorkerSrc: '',
...actualConfig
};
let buffer = Buffer.alloc(0);
let ext = '';
let filePath;
try {
if (!file) {
throw (0, errorUtils_1.getOfficeError)(errorUtils_1.OfficeErrorType.IMPROPER_ARGUMENTS, internalConfig);
}
if (file instanceof ArrayBuffer) {
buffer = Buffer.from(file);
}
else if (Buffer.isBuffer(file)) {
buffer = file;
}
else if (typeof file === 'string') {
filePath = file;
if (!fs.existsSync(file)) {
throw (0, errorUtils_1.getOfficeError)(errorUtils_1.OfficeErrorType.FILE_DOES_NOT_EXIST, internalConfig, file);
}
if (fs.lstatSync(file).isDirectory()) {
throw (0, errorUtils_1.getOfficeError)(errorUtils_1.OfficeErrorType.LOCATION_NOT_FOUND, internalConfig, file);
}
buffer = fs.readFileSync(file);
ext = file.split('.').pop()?.toLowerCase() || '';
}
else {
throw (0, errorUtils_1.getOfficeError)(errorUtils_1.OfficeErrorType.INVALID_INPUT, internalConfig);
}
if (!ext) {
const type = await fileType.fromBuffer(buffer);
if (type) {
ext = type.ext.toLowerCase();
}
else {
throw (0, errorUtils_1.getOfficeError)(errorUtils_1.OfficeErrorType.IMPROPER_BUFFERS, internalConfig);
}
}
let result;
switch (ext) {
case 'docx':
result = await (0, WordParser_1.parseWord)(buffer, internalConfig);
break;
case 'pptx':
result = await (0, PowerPointParser_1.parsePowerPoint)(buffer, internalConfig);
break;
case 'xlsx':
result = await (0, ExcelParser_1.parseExcel)(buffer, internalConfig);
break;
case 'odt':
case 'odp':
case 'ods':
result = await (0, OpenOfficeParser_1.parseOpenOffice)(buffer, internalConfig);
break;
case 'pdf':
result = await (0, PdfParser_1.parsePdf)(buffer, internalConfig);
break;
case 'rtf':
result = await (0, RtfParser_1.parseRtf)(buffer, internalConfig);
break;
default:
throw (0, errorUtils_1.getOfficeError)(errorUtils_1.OfficeErrorType.EXTENSION_UNSUPPORTED, internalConfig, ext);
}
if (callback)
callback(result);
return result;
}
catch (error) {
const wrappedError = (0, errorUtils_1.getWrappedError)(error, internalConfig, filePath);
if (callback)
callback(undefined, wrappedError);
throw wrappedError;
}
}
}
exports.OfficeParser = OfficeParser;
51 changes: 51 additions & 0 deletions dist/index.d.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
#!/usr/bin/env node
/**
* officeparser - Universal Office Document Parser
*
* A comprehensive Node.js library for parsing Microsoft Office and OpenDocument files
* into structured Abstract Syntax Trees (AST) with full formatting information.
*
* **Supported Formats:**
* - Microsoft Office: DOCX, XLSX, PPTX (Office Open XML)
* - OpenDocument: ODT, ODP, ODS (ODF)
* - Legacy: RTF (Rich Text Format)
* - Portable: PDF
*
* **Key Features:**
* - Unified AST output across all formats
* - Rich text formatting (bold, italic, colors, fonts, etc.)
* - Document structure (headings, lists, tables)
* - Image extraction with optional OCR
* - Metadata extraction
* - TypeScript support with full type definitions
*
* **Quick Start:**
* ```typescript
* import { OfficeParser } from 'officeparser';
*
* const ast = await OfficeParser.parseOffice('document.docx', {
* extractAttachments: true,
* ocr: true,
* includeRawContent: false
* });
*
* console.log(ast.toText()); // Plain text output
* console.log(ast.content); // Structured content tree
* console.log(ast.metadata); // Document metadata
* ```
*
* **Main Exports:**
* - `OfficeParser` - Main parser class
* - `OfficeParserConfig` - Configuration interface
* - `OfficeParserAST` - AST result interface
* - `OfficeContentNode` - Content tree node interface
* - All type definitions
*
* @packageDocumentation
* @module officeparser
*/
import { OfficeParser } from './OfficeParser';
import { OfficeParserConfig, OfficeParserAST, OfficeContentNode, OfficeAttachment, OfficeMetadata, TextFormatting, SupportedFileType, OfficeContentNodeType, OfficeMimeType, SlideMetadata, SheetMetadata, HeadingMetadata, ListMetadata, CellMetadata, ImageMetadata, PageMetadata, ContentMetadata, DocImage, Block, TextBlock, ImageBlock, TableBlock, ChartBlock, ChartData, ChartMetadata, CoordinateData, ParagraphMetadata, TextMetadata, NoteMetadata } from './types';
declare const parseOffice: typeof OfficeParser.parseOffice;
export { OfficeParser, parseOffice, OfficeParserConfig, OfficeParserAST, OfficeContentNode, OfficeAttachment, OfficeMetadata, TextFormatting, SupportedFileType, OfficeContentNodeType, OfficeMimeType, SlideMetadata, SheetMetadata, HeadingMetadata, ListMetadata, CellMetadata, ImageMetadata, PageMetadata, ContentMetadata, DocImage, Block, TextBlock, ImageBlock, TableBlock, ChartBlock, ChartData, ChartMetadata, CoordinateData, ParagraphMetadata, TextMetadata, NoteMetadata };
export default OfficeParser;
Loading