PDF to JSON Converter

This Python script extracts text and structure from PDF files and converts them into a structured JSON format. It is designed to handle various PDF structures, filter out noise, and provide a clean, hierarchical JSON output suitable for further processing or analysis.

Reason

The primary motivation for building this PDF-to-JSON converter is to enhance the usability of structured data for AI systems and advanced file-parsing techniques. Unlike traditional flat formats, a hierarchical JSON structure preserves the document’s logical organization, making it easier to process, analyze, and extract meaningful insights. By converting PDFs into structured JSON, this tool enables seamless integration with machine learning models, natural language processing applications, and other automated systems that rely on well-organized data.

Features

Extracts metadata such as title, author, and creation date.
Processes nested outlines to capture the document's hierarchical structure.
Filters out common noise like page numbers, headers, and footers.
Detects images and includes placeholders in the JSON output.
Standardizes metadata fields for consistency.
Installation

To get started, follow these steps:

Clone the repository: git clone https://github.com/Druspect/PDF2Json
Navigate to the project directory: cd pdf-to-json-converter
Install the required Python packages: pip install -r requirements.txt

Usage

Run the script from the command line using the following syntax:

python pdf_to_json.py <input_pdf_path> <output_json_path>
<input_pdf_path>: Path to the input PDF file.
<output_json_path>: Path to the output JSON file where the result will be saved.

Example

python pdf_to_json.py sample.pdf output.json

Output Format

The output JSON file is organized into two main sections:

metadata: Contains document-level information such as title, author, and creation date.
chapters: An array of chapter objects (or content if no outlines are present), each including:
- title: The chapter title.
- level: The nesting level (0 for top-level chapters).
- start_page: The starting page number.
- pages: An array of text content extracted from each page.
- images: An array of image placeholders (e.g., "[Image]").
- Sample JSON Snippet

{ "metadata": { "Title": "Sample Document", "Author": "John Doe", "CreationDate": "2023-01-01T12:00:00Z" }, "chapters": [ { "title": "Introduction", "level": 0, "start_page": 1, "pages": ["This is the introduction..."], "images": ["[Image]"] }, { "title": "Chapter 1", "level": 0, "start_page": 2, "pages": ["Chapter 1 content..."], "images": [] } ] }

Block 8: Error Handling

Error Handling

The script includes robust error handling for scenarios such as:

Missing or invalid PDF files.
Corrupt PDF documents.
Extraction failures, with a fallback to flat text extraction if structured parsing fails.

Errors are logged using Python's logging module to assist with debugging.

Contributing

We welcome contributions! To contribute, please follow these steps:

Fork the repository.
Create a new branch: git checkout -b feature-branch
Make your changes and commit them: git commit -m 'Add new feature'
Push to the branch: git push origin feature-branch
Open a pull request on GitHub.

Acknowledgments

PyPDF2: For PDF parsing and extraction capabilities.
pdfplumber: For advanced PDF layout analysis (optional dependency).

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
LICENSE		LICENSE
README.md		README.md
pdf_to_json.py		pdf_to_json.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDF to JSON Converter

Reason

Features

Installation

Usage

Example

Output Format

Sample JSON Snippet

Block 8: Error Handling

Error Handling

Contributing

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

Druspect/PDF2Json

Folders and files

Latest commit

History

Repository files navigation

PDF to JSON Converter

Reason

Features

Installation

Usage

Example

Output Format

Sample JSON Snippet

Block 8: Error Handling

Error Handling

Contributing

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages