Skip to content

An AI-powered system that transforms book PDFs into visual story illustrations

Notifications You must be signed in to change notification settings

ananya-kkk/Book2Vision

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

3 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

PDF to Visual Story Generator - Complete Implementation

🎯 Project Overview

A complete end-to-end system that transforms PDF documents into visual stories using AI. The pipeline extracts text from PDFs, processes it into scenes, and generates high-quality images using OpenAI's DALLΒ·E API.

Complete Pipeline:

PDF Document β†’ Text Extraction β†’ Scene Analysis β†’ Image Generation β†’ Visual Story

πŸ“Š Project Status

Phase Feature Status Version
Phase 1 PDF Upload & Text Extraction βœ… Complete 1.0.0
Phase 2 Text Processing & Scene Extraction βœ… Complete 2.0.0
Phase 3 Image Generation with DALLΒ·E βœ… Complete 3.0.0

πŸ—οΈ Architecture Overview

System Components

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                         Frontend (Optional)                      β”‚
β”‚                    React UI for visualization                    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                             β”‚ HTTP/REST
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                      FastAPI Backend                             β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
β”‚  β”‚   Phase 1      β”‚    Phase 2      β”‚      Phase 3        β”‚    β”‚
β”‚  β”‚ PDF β†’ Text     β”‚  Text β†’ Scenes  β”‚  Scenes β†’ Images    β”‚    β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                             β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                      Service Layer                               β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
β”‚  β”‚ PDF Extract  β”‚ Text Cleaner β”‚  Prompt Generator        β”‚    β”‚
β”‚  β”‚              β”‚ Summarizer   β”‚  Image Generator         β”‚    β”‚
β”‚  β”‚              β”‚ Scene Extractβ”‚  Image Storage           β”‚    β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

✨ Features

Phase 1: PDF Upload & Text Extraction

  • βœ… Upload PDF files via REST API
  • βœ… Page-by-page text extraction
  • βœ… Validates file type and content
  • βœ… Handles errors (scanned PDFs, corrupted files)
  • βœ… Returns structured JSON with page content

Phase 2: Text Processing & Scene Extraction

  • βœ… Cleans and normalizes extracted text
  • βœ… Removes PDF artifacts (headers, footers, page numbers)
  • βœ… Fixes broken sentences from page breaks
  • βœ… Generates visual-focused summaries
  • βœ… Extracts scenes with subjects, settings, and moods
  • βœ… Identifies visual elements for image generation

Phase 3: Image Generation with DALLΒ·E

  • βœ… DALLΒ·E-optimized prompt engineering
  • βœ… Generates high-quality 1024Γ—1024 images
  • βœ… Organized file system storage
  • βœ… URL-based image access
  • βœ… Automatic retry and error handling
  • βœ… Cost tracking and management

πŸš€ Quick Start

Prerequisites

# Required
Python 3.8+
pip (Python package manager)
OpenAI API key (for Phase 3)

# Recommended
Virtual environment (venv or conda)
Git (for version control)

Installation

# 1. Clone or download the project
cd pdf-story-generator

# 2. Create virtual environment
python -m venv venv

# Activate (Linux/Mac)
source venv/bin/activate

# Activate (Windows)
venv\Scripts\activate

# 3. Install dependencies
pip install -r requirements.txt

# 4. Configure environment
cp .env.example .env
# Edit .env and add your OPENAI_API_KEY

# 5. Start the server
python main.py

Access Points


πŸ”„ Complete Pipeline Example

# Step 1: Upload PDF (Phase 1)
curl -X POST "http://localhost:8000/upload-pdf" \
  -F "file=@story.pdf" > phase1_output.json

# Step 2: Process Text (Phase 2)
cat phase1_output.json | jq '.pages' | \
curl -X POST "http://localhost:8000/process-text" \
  -H "Content-Type: application/json" \
  -d @- > phase2_output.json

# Step 3: Generate Images (Phase 3)
curl -X POST "http://localhost:8000/generate-images" \
  -H "Content-Type: application/json" \
  -d @phase2_output.json > phase3_output.json

# Step 4: View results
cat phase3_output.json | python -m json.tool

# Step 5: Access generated images
# Images are available at: http://localhost:8000/images/page_X/scene_Y.png

πŸ“ Project Structure

pdf-story-generator/
β”œβ”€β”€ backend/
β”‚   β”œβ”€β”€ main.py                      # FastAPI application
β”‚   β”œβ”€β”€ requirements.txt             # Python dependencies
β”‚   β”œβ”€β”€ .env                         # Environment configuration
β”‚   β”œβ”€β”€ .env.example                 # Configuration template
β”‚   β”‚
β”‚   β”œβ”€β”€ services/                    # Business logic layer
β”‚   β”‚   β”œβ”€β”€ __init__.py
β”‚   β”‚   β”‚
β”‚   β”‚   # Phase 2 Services
β”‚   β”‚   β”œβ”€β”€ text_cleaner.py         # Text cleaning & normalization
β”‚   β”‚   β”œβ”€β”€ summarizer.py           # Visual-focused summarization
β”‚   β”‚   β”œβ”€β”€ scene_extractor.py      # Scene detection & extraction
β”‚   β”‚   β”‚
β”‚   β”‚   # Phase 3 Services
β”‚   β”‚   β”œβ”€β”€ prompt_generator.py     # DALLΒ·E prompt engineering
β”‚   β”‚   β”œβ”€β”€ image_generator.py      # OpenAI API integration
β”‚   β”‚   └── image_storage.py        # File system storage
β”‚   β”‚
β”‚   β”œβ”€β”€ generated_images/            # Generated image storage
β”‚   β”‚   β”œβ”€β”€ .gitkeep
β”‚   β”‚   β”œβ”€β”€ page_1/
β”‚   β”‚   β”‚   β”œβ”€β”€ scene_1.png
β”‚   β”‚   β”‚   └── scene_2.png
β”‚   β”‚   └── page_2/
β”‚   β”‚       └── scene_1.png
β”‚   β”‚
β”‚   β”œβ”€β”€ test_phase2.py              # Phase 2 tests
β”‚   └── test_phase3.py              # Phase 3 tests
β”‚
β”œβ”€β”€ frontend/                        # React UI (optional)
β”‚   β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ public/
β”‚   └── package.json
β”‚
β”œβ”€β”€ docs/
β”‚   β”œβ”€β”€ PHASE1_README.md
β”‚   β”œβ”€β”€ PHASE2_README.md
β”‚   β”œβ”€β”€ PHASE3_README.md
β”‚   β”œβ”€β”€ INTEGRATION_GUIDE.md
β”‚   └── API_DOCUMENTATION.md
β”‚
β”œβ”€β”€ README.md                        # This file
└── .gitignore

πŸ”§ Configuration

Environment Variables

Create a .env file in the backend directory:

# OpenAI Configuration (Required for Phase 3)
OPENAI_API_KEY=sk-your-openai-api-key-here

# Server Configuration
PORT=8000
HOST=0.0.0.0
ENVIRONMENT=development

# DALLΒ·E Settings (Optional)
DALLE_MODEL=dall-e-3
DALLE_SIZE=1024x1024
DALLE_QUALITY=standard
DALLE_STYLE=vivid

🎨 How It Works

Phase 1: Text Extraction

# Upload PDF β†’ Extract text page-by-page
# Uses pdfplumber for high-quality extraction
# Validates file type and handles errors

Input: PDF file
Output: Structured text by page


Phase 2: Scene Processing

# Clean text β†’ Summarize β†’ Extract scenes
# Removes PDF artifacts
# Identifies visual elements (subjects, settings, moods)

Input: Raw extracted text
Output: Structured scenes with descriptions


Phase 3: Image Generation

# Generate prompt β†’ Call DALLΒ·E β†’ Save image
# Optimized prompts for DALLΒ·E
# Mood-to-lighting mapping
# Organized file storage

Input: Scene descriptions
Output: High-quality PNG images



πŸš€ Deployment

Development

# Start server with auto-reload
python main.py

Production

# Use production ASGI server
gunicorn main:app -w 4 -k uvicorn.workers.UvicornWorker

# Or with uvicorn
uvicorn main:app --host 0.0.0.0 --port 8000 --workers 4

πŸŽ‰ Summary

The PDF-to-Visual Story Generator is a complete, production-ready system that:

βœ… Extracts text from PDF documents
βœ… Processes text into visual scenes
βœ… Generates high-quality images with DALLΒ·E
βœ… Provides a clean REST API
βœ… Includes comprehensive documentation
βœ… Has thorough test coverage
βœ… Follows best practices

All three phases are complete and integrated!


Project Version: 3.0.0
Last Updated: January 2026
Status: βœ… Production Ready

Complete Pipeline: PDF β†’ Text β†’ Scenes β†’ Images ✨

About

An AI-powered system that transforms book PDFs into visual story illustrations

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors