The PDF Processing tool provides fast and efficient text and image extraction from PDF files, converting them to structured Markdown with embedded image references.
Note: This tool is disabled by default. To enable it, set the ENABLE_ADDITIONAL_TOOLS environment variable to include pdf.
A lightweight, fast alternative to the full Document Processing tool specifically optimised for PDF files. Perfect when you need quick text extraction without the overhead of advanced analysis features.
- Fast Extraction: Optimised for speed and efficiency
- Text and Images: Extract both text content and embedded images
- Page Ranges: Process specific pages or page ranges
- Markdown Output: Clean, structured markdown format
- Image Links: Properly linked images in markdown
- No Dependencies: Self-contained, no external requirements
- Cross-Platform: Works on macOS and Linux
The PDF Processing tool includes comprehensive security hardening to prevent resource exhaustion and ensure safe operation:
- Default limit: 200MB maximum file size
- Configurable: Set custom limits via
PDF_MAX_FILE_SIZEenvironment variable - Prevention: Blocks processing of excessively large files that could consume system resources
- Error handling: Clear error messages with current and maximum allowed sizes
- Default limit: 5GB maximum memory usage
- Configurable: Set custom limits via
PDF_MAX_MEMORY_LIMITenvironment variable - Protection: Prevents memory exhaustion during PDF processing operations
- Validation: Strict PDF validation to prevent malformed files from consuming excessive resources
# Set custom file size limit (bytes)
export PDF_MAX_FILE_SIZE=104857600 # 100MB
# Set custom memory limit (bytes)
export PDF_MAX_MEMORY_LIMIT=2147483648 # 2GB- Resource protection: Prevents processing of maliciously large files
- System stability: Avoids memory exhaustion scenarios
- Predictable performance: Consistent processing times within defined limits
- Error transparency: Clear feedback when limits are exceeded
- ✅ Working with PDF files only
- ✅ Need fast extraction speed
- ✅ Want simple, lightweight processing
- ✅ Don't need OCR or diagram analysis
- ✅ Processing digital (non-scanned) PDFs
- 📄 Working with multiple document formats (DOCX, XLSX, etc.)
- 🔍 Need OCR for scanned documents
- 🎨 Want diagram analysis and Mermaid generation
- ⚙️ Need advanced processing profiles
- 🧠 Require AI-powered content analysis
First, enable the tool by setting the environment variable:
ENABLE_ADDITIONAL_TOOLS="pdf"While intended to be activated via a prompt to an agent, below are some example JSON tool calls.
{
"name": "pdf",
"arguments": {
"file_path": "/absolute/path/to/document.pdf"
}
}{
"name": "pdf",
"arguments": {
"file_path": "/path/to/document.pdf",
"extract_images": true,
"output_dir": "/path/to/output"
}
}{
"name": "pdf",
"arguments": {
"file_path": "/path/to/large-document.pdf",
"pages": "1-5",
"extract_images": true
}
}// First 10 pages
{"pages": "1-10"}
// Specific pages
{"pages": "1,3,5,10"}
// Mixed ranges and pages
{"pages": "1-3,7,10-12"}
// All pages (default)
{"pages": "all"}| Parameter | Description | Example |
|---|---|---|
file_path |
Absolute path to PDF file | "/Users/john/documents/report.pdf" |
| Parameter | Type | Default | Description |
|---|---|---|---|
output_dir |
string | Same as PDF | Output directory for markdown and images |
extract_images |
boolean | true |
Whether to extract embedded images |
pages |
string | "all" |
Page range to process |
- All pages:
"all"(default) - Range:
"1-5"(pages 1 through 5) - Specific pages:
"1,3,5"(pages 1, 3, and 5) - Mixed:
"1-3,7,10-12"(pages 1-3, 7, and 10-12)
/path/to/document.pdf
├── document.md # Extracted markdown content
└── images/ # Extracted images directory
├── image_001.png
├── image_002.jpg
└── ...
# Document Title
## Page 1
Document content from first page...

More content...
## Page 2
Content from second page...{
"success": true,
"message": "PDF processed successfully",
"markdown_file": "/path/to/document.md",
"images_extracted": 5,
"images_directory": "/path/to/document/images",
"pages_processed": 10,
"processing_time": 2.3,
"file_size": 1024000
}{
"success": false,
"error": "File not found: /invalid/path/document.pdf",
"file_path": "/invalid/path/document.pdf"
}Extract text for quick review or analysis:
{
"name": "pdf",
"arguments": {
"file_path": "/downloads/research-paper.pdf",
"extract_images": false
}
}Convert PDF documentation to markdown:
{
"name": "pdf",
"arguments": {
"file_path": "/docs/api-reference.pdf",
"extract_images": true,
"output_dir": "/project/docs"
}
}Process specific sections of large documents:
{
"name": "pdf",
"arguments": {
"file_path": "/reports/annual-report-2024.pdf",
"pages": "5-15",
"extract_images": true
}
}Extract specific pages for further processing:
{
"name": "pdf",
"arguments": {
"file_path": "/contracts/agreement.pdf",
"pages": "1,5,10-12",
"extract_images": false
}
}- Small PDFs (< 10 pages): 1-3 seconds
- Medium PDFs (10-50 pages): 3-15 seconds
- Large PDFs (50+ pages): 15+ seconds
- Page count: Linear relationship with processing time
- Image content: Images slow down processing
- Text complexity: Tables and complex layouts take longer
- File size: Large embedded images impact speed
- Text processing: Low memory usage
- Image extraction: Moderate memory for large images
- Large documents: Memory usage scales with content
| Feature | PDF Processing | Document Processing |
|---|---|---|
| Speed | ⚡ Fast (1-15 seconds) | 🐌 Slower (10-60+ seconds) |
| File Types | PDF only | PDF, DOCX, XLSX, PPTX, HTML, CSV, images |
| Dependencies | None | Python 3.10+, Docling |
| OCR Support | ❌ No | ✅ Yes |
| Diagram Analysis | ❌ No | ✅ Yes with AI models |
| Setup Complexity | ✅ Simple | ⚙️ Complex |
| Resource Usage | 🟢 Low | 🟡 Moderate to High |
# 1. Quick PDF extraction
pdf_extract="/path/to/research.pdf"
# 2. Process with think tool
think "I've extracted the research paper content. Let me analyse the key findings and methodology before proceeding with implementation."
# 3. Store key insights
memory create_entities --data '{"entities": [{"name": "Research_Paper_2024", "type": "document", "observations": ["Novel approach to distributed consensus", "Improves performance by 40%"]}]}'# 1. Extract PDF documentation
pdf_extract="/technical-specs/api-guide.pdf" --pages="1-20"
# 2. Search for additional information
internet_search "REST API best practices 2024"
# 3. Combine insights for implementation
think "The PDF provides specific implementation details, while the search results show current best practices. I'll combine both for the recommended approach."# 1. Extract content from multiple PDFs
pdf_extract="/reports/q1-report.pdf" --pages="1-5"
pdf_extract="/reports/q2-report.pdf" --pages="1-5"
# 2. Store extracted insights
memory create_entities --namespace="quarterly_reports" --data='{...}'
# 3. Analyse trends
think "Comparing Q1 and Q2 reports, I can see a clear trend in customer acquisition costs and revenue growth patterns."- File not found: Invalid file path
- Permission denied: Insufficient file access rights
- Corrupted PDF: Damaged or invalid PDF file
- Unsupported PDF: Encrypted or password-protected PDFs
- Disk space: Insufficient space for output files
- Use absolute paths: Avoid relative path issues
- Check file existence: Verify file exists before processing
- Verify permissions: Ensure read access to PDF and write access to output directory
- Test with small files: Validate setup with simple PDFs first
{
"name": "pdf",
"arguments": {
"file_path": "/source/document.pdf",
"output_dir": "/organised/content/document-name",
"extract_images": true,
"pages": "1-10"
}
}// Process table of contents and summary
{
"name": "pdf",
"arguments": {
"file_path": "/reports/annual-report.pdf",
"pages": "1-3,50-55",
"extract_images": false
}
}{
"name": "pdf",
"arguments": {
"file_path": "/presentations/slides.pdf",
"extract_images": true,
"pages": "10-20"
}
}For technical implementation details, see the PDF Processing source documentation.