A pure Rust PDF generation and manipulation library with zero external PDF dependencies. Battle-tested against 9,000+ real-world PDFs with a 99.3% success rate, 6,400+ tests, and validated performance of 3,000-4,000 pages/second for realistic business documents.
- 🚀 Pure Rust Core - No C dependencies for PDF operations (OCR feature requires Tesseract)
- 📄 PDF Generation - Create multi-page documents with text, graphics, and images
- 🔍 PDF Parsing - Read and extract content from existing PDFs (tested on 9,000+ real-world PDFs)
- 🛡️ Corruption Recovery - Robust error recovery for damaged or malformed PDFs (99.3% success rate)
- ✂️ PDF Operations - Split, merge, and rotate PDFs while preserving basic content
- 🖼️ Image Support - Embed JPEG and PNG images with automatic compression
- 🎨 Transparency & Blending - Full alpha channel, SMask, blend modes for watermarking and overlays
- 🌏 CJK Text Support - Chinese, Japanese, and Korean text rendering and extraction with ToUnicode CMap
- 🎨 Rich Graphics - Vector graphics with shapes, paths, colors (RGB/CMYK/Gray)
- 📝 Advanced Text - Custom TTF/OTF fonts, standard fonts, text flow with automatic wrapping, alignment
🅰️ Custom Fonts - Load and embed TrueType/OpenType fonts with full Unicode support- 🔍 OCR Support - Extract text from scanned PDFs using Tesseract OCR
- 🤖 AI/RAG Integration - Document chunking for LLM pipelines with sentence boundaries and metadata
- 📋 Invoice Extraction - Automatic structured data extraction from invoice PDFs with multi-language support
- 🗜️ Compression - FlateDecode, LZWDecode, CCITTFaxDecode, JBIG2Decode, and more
- 🔒 Encryption - RC4, AES-128, AES-256 (R5/R6) with full permission support
- ✍️ Digital Signatures - Detection, PKCS#7 verification, and certificate validation (Mozilla CA roots)
- 📑 PDF/A Validation - 8 conformance levels (1a/b, 2a/b/u, 3a/b/u)
- 🔒 Type Safe - Leverage Rust's type system for safe PDF manipulation
- 📜 MIT License - Consolidated across all project files
- 📊 9,000+ PDF Corpus - 7-tier test infrastructure (T0-T6) with 99.3% success rate
- 🖼️ JBIG2 Decoder - Full pure Rust implementation (ITU-T T.88, 9 modules, 416 tests)
- ✍️ Digital Signature Verification - PKCS#7 + Mozilla CA root certificates
- 📑 PDF/A Validation - 8 conformance levels (1a/b, 2a/b/u, 3a/b/u)
- 🗜️ CCITTFaxDecode - Group 3/4 fax compression support
- 🔒 AES-256 R5/R6 Encryption - RustCrypto, Algorithm 2.B, qpdf compatible
- 🧪 6,400+ Tests - Unit, integration, doc tests, and property-based testing
See CHANGELOG.md for previous releases.
- Production-ready performance - 3,000-4,000 pages/second generation, 35.9 PDFs/second parsing
- 5.2 MB binary - 3x smaller than PDFSharp, 40x smaller than IronPDF
- Zero dependencies - No runtime, no Chrome, just a single binary
- Low memory usage - Efficient streaming for large PDFs
- Memory safe - Guaranteed by Rust compiler (no null pointers, no buffer overflows)
- Type safe API - Catch errors at compile time
- 6,400+ tests - Comprehensive test suite with 9,000+ real-world PDFs
- No CVEs possible - Memory safety eliminates entire classes of vulnerabilities
- Modern API - Designed in 2024, not ported from 2005
- True cross-platform - Single binary runs on Linux, macOS, Windows, ARM
- Easy deployment - One file to ship, no dependencies to manage
- Fast compilation - Incremental builds in seconds
Add oxidize-pdf to your Cargo.toml:
[dependencies]
oxidize-pdf = "2.0.0"
# For OCR support (optional)
oxidize-pdf = { version = "2.0.0", features = ["ocr-tesseract"] }use oxidize_pdf::{Document, Page, Font, Color, Result};
fn main() -> Result<()> {
// Create a new document
let mut doc = Document::new();
doc.set_title("My First PDF");
doc.set_author("Rust Developer");
// Create a page
let mut page = Page::a4();
// Add text
page.text()
.set_font(Font::Helvetica, 24.0)
.at(50.0, 700.0)
.write("Hello, PDF!")?;
// Add graphics
page.graphics()
.set_fill_color(Color::rgb(0.0, 0.5, 1.0))
.circle(300.0, 400.0, 50.0)
.fill();
// Add the page and save
doc.add_page(page);
doc.save("hello.pdf")?;
Ok(())
}use oxidize_pdf::ai::DocumentChunker;
use oxidize_pdf::parser::{PdfReader, PdfDocument};
use oxidize_pdf::Result;
fn main() -> Result<()> {
// Load and parse PDF
let reader = PdfReader::open("document.pdf")?;
let pdf_doc = PdfDocument::new(reader);
let text_pages = pdf_doc.extract_text()?;
// Prepare page texts with page numbers
let page_texts: Vec<(usize, String)> = text_pages
.iter()
.enumerate()
.map(|(idx, page)| (idx + 1, page.text.clone()))
.collect();
// Create chunker: 512 tokens per chunk, 50 tokens overlap
let chunker = DocumentChunker::new(512, 50);
let chunks = chunker.chunk_text_with_pages(&page_texts)?;
// Process chunks for RAG pipeline
for chunk in chunks {
println!("Chunk {}: {} tokens", chunk.id, chunk.tokens);
println!(" Pages: {:?}", chunk.page_numbers);
println!(" Position: chars {}-{}",
chunk.metadata.position.start_char,
chunk.metadata.position.end_char);
println!(" Sentence boundary: {}",
chunk.metadata.sentence_boundary_respected);
// Send to embedding API, store in vector DB, etc.
// let embedding = openai.embed(&chunk.content)?;
// vector_db.insert(chunk.id, embedding, chunk.content)?;
}
Ok(())
}use oxidize_pdf::Document;
use oxidize_pdf::text::extraction::{TextExtractor, ExtractionOptions};
use oxidize_pdf::text::invoice::{InvoiceExtractor, InvoiceField};
fn main() -> Result<(), Box<dyn std::error::Error>> {
// Open PDF invoice
let doc = Document::open("invoice.pdf")?;
let page = doc.get_page(1)?;
// Extract text from page
let text_extractor = TextExtractor::new();
let extracted = text_extractor.extract_text(&doc, page, &ExtractionOptions::default())?;
// Extract structured invoice data
let invoice_extractor = InvoiceExtractor::builder()
.with_language("es") // Spanish invoices
.confidence_threshold(0.7) // 70% minimum confidence
.build();
let invoice = invoice_extractor.extract(&extracted.fragments)?;
// Access extracted fields
println!("Extracted {} fields with {:.0}% overall confidence",
invoice.field_count(),
invoice.metadata.extraction_confidence * 100.0
);
for field in &invoice.fields {
match &field.field_type {
InvoiceField::InvoiceNumber(number) => {
println!("Invoice: {} ({:.0}% confidence)", number, field.confidence * 100.0);
}
InvoiceField::TotalAmount(amount) => {
println!("Total: €{:.2} ({:.0}% confidence)", amount, field.confidence * 100.0);
}
InvoiceField::InvoiceDate(date) => {
println!("Date: {} ({:.0}% confidence)", date, field.confidence * 100.0);
}
_ => {}
}
}
Ok(())
}Supported Languages: Spanish (ES), English (EN), German (DE), Italian (IT)
Extracted Fields: Invoice number, dates, amounts (total/tax/net), VAT numbers, supplier/customer names, currency, line items
See docs/INVOICE_EXTRACTION_GUIDE.md for complete documentation.
use oxidize_pdf::{Document, Page, Font, Color, Result};
fn main() -> Result<()> {
let mut doc = Document::new();
doc.set_title("Custom Fonts Demo");
// Load a custom font from file
doc.add_font("MyFont", "/path/to/font.ttf")?;
// Or load from bytes
let font_data = std::fs::read("/path/to/font.otf")?;
doc.add_font_from_bytes("MyOtherFont", font_data)?;
let mut page = Page::a4();
// Use standard font
page.text()
.set_font(Font::Helvetica, 14.0)
.at(50.0, 700.0)
.write("Standard Font: Helvetica")?;
// Use custom font
page.text()
.set_font(Font::Custom("MyFont".to_string()), 16.0)
.at(50.0, 650.0)
.write("Custom Font: This is my custom font!")?;
// Advanced text formatting with custom font
page.text()
.set_font(Font::Custom("MyOtherFont".to_string()), 12.0)
.set_character_spacing(2.0)
.set_word_spacing(5.0)
.at(50.0, 600.0)
.write("Spaced text with custom font")?;
doc.add_page(page);
doc.save("custom_fonts.pdf")?;
Ok(())
}use oxidize_pdf::{PdfReader, Result};
fn main() -> Result<()> {
// Open and parse a PDF
let mut reader = PdfReader::open("document.pdf")?;
// Get document info
println!("PDF Version: {}", reader.version());
println!("Page Count: {}", reader.page_count()?);
// Extract text from all pages
let document = reader.into_document();
let text = document.extract_text()?;
for (page_num, page_text) in text.iter().enumerate() {
println!("Page {}: {}", page_num + 1, page_text.content);
}
Ok(())
}use oxidize_pdf::{Document, Page, Image, Result};
use oxidize_pdf::graphics::TransparencyGroup;
fn main() -> Result<()> {
let mut doc = Document::new();
let mut page = Page::a4();
// Load a JPEG image
let image = Image::from_jpeg_file("photo.jpg")?;
// Add image to page
page.add_image("my_photo", image);
// Draw the image
page.draw_image("my_photo", 100.0, 300.0, 400.0, 300.0)?;
// Add watermark with transparency
let watermark = TransparencyGroup::new().with_opacity(0.3);
page.graphics()
.begin_transparency_group(watermark)
.set_font(oxidize_pdf::text::Font::HelveticaBold, 48.0)
.begin_text()
.show_text("CONFIDENTIAL")
.end_text()
.end_transparency_group();
doc.add_page(page);
doc.save("image_example.pdf")?;
Ok(())
}use oxidize_pdf::{Document, Page, Font, TextAlign, Result};
fn main() -> Result<()> {
let mut doc = Document::new();
let mut page = Page::a4();
// Create text flow with automatic wrapping
let mut flow = page.text_flow();
flow.at(50.0, 700.0)
.set_font(Font::Times, 12.0)
.set_alignment(TextAlign::Justified)
.write_wrapped("This is a long paragraph that will automatically wrap \
to fit within the page margins. The text is justified, \
creating clean edges on both sides.")?;
page.add_text_flow(&flow);
doc.add_page(page);
doc.save("text_flow.pdf")?;
Ok(())
}use oxidize_pdf::operations::{PdfSplitter, PdfMerger, PageRange};
use oxidize_pdf::Result;
fn main() -> Result<()> {
// Split a PDF
let splitter = PdfSplitter::new("input.pdf")?;
splitter.split_by_pages("page_{}.pdf")?; // page_1.pdf, page_2.pdf, ...
// Merge PDFs
let mut merger = PdfMerger::new();
merger.add_pdf("doc1.pdf", PageRange::All)?;
merger.add_pdf("doc2.pdf", PageRange::Pages(vec![1, 3, 5]))?;
merger.save("merged.pdf")?;
// Rotate pages
use oxidize_pdf::operations::{PdfRotator, RotationAngle};
let rotator = PdfRotator::new("input.pdf")?;
rotator.rotate_all(RotationAngle::Clockwise90, "rotated.pdf")?;
Ok(())
}use oxidize_pdf::text::tesseract_provider::{TesseractOcrProvider, TesseractConfig};
use oxidize_pdf::text::ocr::{OcrOptions, OcrProvider};
use oxidize_pdf::operations::page_analysis::PageContentAnalyzer;
use oxidize_pdf::parser::PdfReader;
use oxidize_pdf::Result;
fn main() -> Result<()> {
// Open a scanned PDF
let document = PdfReader::open_document("scanned.pdf")?;
let analyzer = PageContentAnalyzer::new(document);
// Configure OCR provider
let config = TesseractConfig::for_documents();
let ocr_provider = TesseractOcrProvider::with_config(config)?;
// Find and process scanned pages
let scanned_pages = analyzer.find_scanned_pages()?;
for page_num in scanned_pages {
let result = analyzer.extract_text_from_scanned_page(page_num, &ocr_provider)?;
println!("Page {}: {} (confidence: {:.1}%)",
page_num, result.text, result.confidence * 100.0);
}
Ok(())
}Before using OCR features, install Tesseract on your system:
macOS:
brew install tesseract
brew install tesseract-lang # For additional languagesUbuntu/Debian:
sudo apt-get install tesseract-ocr
sudo apt-get install tesseract-ocr-spa # For Spanish
sudo apt-get install tesseract-ocr-deu # For GermanWindows: Download from: https://github.com/UB-Mannheim/tesseract/wiki
Explore comprehensive examples in the examples/ directory:
recovery_corrupted_pdf.rs- Handle damaged or malformed PDFs with robust error recoverypng_transparency_watermark.rs- Create watermarks, blend modes, and transparent overlayscjk_text_extraction.rs- Work with Chinese, Japanese, and Korean textbasic_chunking.rs- Document chunking for AI/RAG pipelinesrag_pipeline.rs- Complete RAG workflow with embeddings
Run any example:
cargo run --example recovery_corrupted_pdf
cargo run --example png_transparency_watermark
cargo run --example cjk_text_extraction- ✅ Multi-page documents
- ✅ Vector graphics (rectangles, circles, paths, lines)
- ✅ Text rendering with standard fonts (Helvetica, Times, Courier)
- ✅ JPEG and PNG image embedding with transparency
- ✅ Transparency groups, blend modes, and opacity control
- ✅ RGB, CMYK, and Grayscale colors
- ✅ Graphics transformations (translate, rotate, scale)
- ✅ Text flow with automatic line wrapping
- ✅ FlateDecode compression
- ✅ PDF 1.0 - 1.7 basic structure support
- ✅ Cross-reference table parsing with automatic recovery
- ✅ XRef streams (PDF 1.5+) and object streams
- ✅ Object and stream parsing with corruption tolerance
- ✅ Page tree navigation with circular reference detection
- ✅ Content stream parsing (basic operators)
- ✅ Text extraction with CJK (Chinese, Japanese, Korean) support
- ✅ CMap and ToUnicode parsing for complex encodings
- ✅ Document metadata extraction
- ✅ Filter support (FlateDecode, ASCIIHexDecode, ASCII85Decode, RunLengthDecode, LZWDecode, DCTDecode)
- ✅ Lenient parsing with multiple error recovery strategies
- ✅ Split by pages, ranges, or size
- ✅ Merge multiple PDFs
- ✅ Rotate pages (90°, 180°, 270°)
- ✅ Basic content preservation
- ✅ Tesseract OCR integration with feature flag
- ✅ Multi-language support (50+ languages)
- ✅ Page analysis and scanned page detection
- ✅ Configurable preprocessing (denoise, deskew, contrast)
- ✅ Layout preservation with position information
- ✅ Confidence scoring and filtering
- ✅ Multiple page segmentation modes (PSM)
- ✅ Character whitelisting/blacklisting
- ✅ Mock OCR provider for testing
- ✅ Parallel and batch processing
Validated Metrics (based on comprehensive benchmarking):
- PDF Generation: 3,000-4,000 pages/second for realistic business documents
- Complex Content: 670 pages/second for dense analytics dashboards
- PDF Parsing: 35.9 PDFs/second (99.3% success rate on 9,000+ real-world PDFs)
- Memory Efficient: Streaming operations available for large documents
- Pure Rust: No external C dependencies for PDF operations
See PERFORMANCE_HONEST_REPORT.md for detailed benchmarking methodology and results.
Check out the examples directory for more usage patterns:
hello_world.rs- Basic PDF creationgraphics_demo.rs- Vector graphics showcasetext_formatting.rs- Advanced text featurescustom_fonts.rs- TTF/OTF font loading and embeddingjpeg_image.rs- Image embeddingparse_pdf.rs- PDF parsing and text extractioncomprehensive_demo.rs- All features demonstrationtesseract_ocr_demo.rs- OCR text extraction (requires--features ocr-tesseract)scanned_pdf_analysis.rs- Analyze PDFs for scanned contentextract_images.rs- Extract embedded images from PDFscreate_pdf_with_images.rs- Advanced image embedding examples
Run examples with:
cargo run --example hello_world
# For OCR examples
cargo run --example tesseract_ocr_demo --features ocr-tesseractThis project is licensed under the MIT License - see the LICENSE file for details.
We prioritize transparency about what works and what doesn't.
- ✅ Compression: FlateDecode, ASCIIHexDecode, ASCII85Decode, RunLengthDecode, LZWDecode, DCTDecode, CCITTFaxDecode, JBIG2Decode
- ✅ Color Spaces: DeviceRGB, DeviceCMYK, DeviceGray
- ✅ Fonts: Standard 14 fonts + TTF/OTF custom font loading and embedding
- ✅ Images: JPEG embedding, raw RGB/Gray data, PNG with transparency
- ✅ Operations: Split, merge, rotate, page extraction, text extraction
- ✅ Graphics: Vector operations, clipping paths, transparency (CA/ca)
- ✅ Encryption: RC4 40/128-bit, AES-128/256, AES-256 R5/R6
- ✅ Forms: Basic text fields, checkboxes, radio buttons, combo boxes, list boxes
- ✅ Digital Signatures: Detection + PKCS#7 verification + certificate validation (signing not yet supported)
- ✅ PDF/A Validation: 8 conformance levels (1a/b, 2a/b/u, 3a/b/u)
- ✅ JBIG2 Decoding: Full pure Rust decoder (ITU-T T.88)
- 🚧 Form Interactions: Forms can be created but not edited interactively
- 🚧 Tagged PDFs: Structure tree API (partial — no marked content operators)
- ❌ Rendering: No PDF to image conversion
- ❌ JPXDecode: JPEG 2000 compression not supported
- ❌ Advanced Graphics: Complex patterns, shadings, gradients
- ❌ Digital Signing: Signature creation (verification works, signing does not)
- ❌ Advanced Color: ICC profiles, spot colors, Lab color space
- ❌ JavaScript: No form calculations or validation scripts
- ❌ Multimedia: No sound, video, or 3D content support
- Parsing success doesn't mean full feature support
- Many PDFs will parse but advanced features will be ignored
oxidize-pdf/
├── oxidize-pdf-core/ # Core PDF library (MIT)
├── oxidize-pdf-api/ # REST API server
├── oxidize-pdf-cli/ # CLI interface
├── test-corpus/ # 9,000+ PDFs across 7 tiers (T0-T6)
├── docs/ # Documentation
├── dev-tools/ # Development utilities
├── benches/ # Benchmarks
└── lints/ # Custom Clippy lints
See REPOSITORY_ARCHITECTURE.md for detailed information.
oxidize-pdf uses a 7-tier corpus (T0-T6) with 9,000+ PDFs and 6,400+ tests:
| Tier | Description | PDFs | Purpose |
|---|---|---|---|
| T0 | Synthetic | Generated | Unit tests, CI/CD |
| T1 | Reference | ~1,300 | pdf.js, pdfium, poppler suites |
| T2 | Real-world | ~7,000 | GovDocs, academic, corporate |
| T3 | Stress | ~200 | Malformed, edge cases |
| T4 | Performance | ~100 | Benchmarking targets |
| T5 | Quality | ~300 | Text extraction accuracy |
| T6 | Adversarial | ~100 | Security, fuzzing |
# Run standard test suite (T0 — synthetic PDFs, runs in CI)
cargo test --workspace
# Run corpus tests (requires downloaded corpus)
cargo test --test t1_spec # T1: reference suite
cargo test --test t2_realworld # T2: real-world PDFs
# Run OCR tests (requires Tesseract installation)
cargo test tesseract_ocr_tests --features ocr-tesseract -- --ignoredThe T0 tier runs in CI with zero external dependencies. T1-T6 tiers require downloading the corpus (~15 GB) — see test-corpus/ for setup instructions.
Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.
oxidize-pdf is under active development. Our focus areas include:
- Parsing & Compatibility: Improving support for diverse PDF structures
- Core Operations: Enhancing split, merge, and manipulation capabilities
- Performance: Optimizing memory usage and processing speed
- Stability: Addressing edge cases and error handling
- Extended Format Support: Additional image formats and encodings
- Advanced Text Processing: Improved text extraction and layout analysis
- Enterprise Features: Features designed for production use at scale
- Developer Experience: Better APIs, documentation, and tooling
- Comprehensive PDF standard compliance for common use cases
- Production-ready reliability and performance
- Rich ecosystem of tools and integrations
- Sustainable open source development model
We prioritize features based on community feedback and real-world usage. Have a specific need? Open an issue to discuss!
Built with ❤️ using Rust. Special thanks to the Rust community and all contributors.