A complete end-to-end system that transforms PDF documents into visual stories using AI. The pipeline extracts text from PDFs, processes it into scenes, and generates high-quality images using OpenAI's DALLΒ·E API.
Complete Pipeline:
PDF Document β Text Extraction β Scene Analysis β Image Generation β Visual Story
| Phase | Feature | Status | Version |
|---|---|---|---|
| Phase 1 | PDF Upload & Text Extraction | β Complete | 1.0.0 |
| Phase 2 | Text Processing & Scene Extraction | β Complete | 2.0.0 |
| Phase 3 | Image Generation with DALLΒ·E | β Complete | 3.0.0 |
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Frontend (Optional) β
β React UI for visualization β
ββββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββ
β HTTP/REST
ββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββ
β FastAPI Backend β
β ββββββββββββββββββ¬ββββββββββββββββββ¬ββββββββββββββββββββββ β
β β Phase 1 β Phase 2 β Phase 3 β β
β β PDF β Text β Text β Scenes β Scenes β Images β β
β ββββββββββββββββββ΄ββββββββββββββββββ΄ββββββββββββββββββββββ β
ββββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββ
β
ββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββ
β Service Layer β
β ββββββββββββββββ¬βββββββββββββββ¬βββββββββββββββββββββββββββ β
β β PDF Extract β Text Cleaner β Prompt Generator β β
β β β Summarizer β Image Generator β β
β β β Scene Extractβ Image Storage β β
β ββββββββββββββββ΄βββββββββββββββ΄βββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
- β Upload PDF files via REST API
- β Page-by-page text extraction
- β Validates file type and content
- β Handles errors (scanned PDFs, corrupted files)
- β Returns structured JSON with page content
- β Cleans and normalizes extracted text
- β Removes PDF artifacts (headers, footers, page numbers)
- β Fixes broken sentences from page breaks
- β Generates visual-focused summaries
- β Extracts scenes with subjects, settings, and moods
- β Identifies visual elements for image generation
- β DALLΒ·E-optimized prompt engineering
- β Generates high-quality 1024Γ1024 images
- β Organized file system storage
- β URL-based image access
- β Automatic retry and error handling
- β Cost tracking and management
# Required
Python 3.8+
pip (Python package manager)
OpenAI API key (for Phase 3)
# Recommended
Virtual environment (venv or conda)
Git (for version control)# 1. Clone or download the project
cd pdf-story-generator
# 2. Create virtual environment
python -m venv venv
# Activate (Linux/Mac)
source venv/bin/activate
# Activate (Windows)
venv\Scripts\activate
# 3. Install dependencies
pip install -r requirements.txt
# 4. Configure environment
cp .env.example .env
# Edit .env and add your OPENAI_API_KEY
# 5. Start the server
python main.py- API Server: http://localhost:8000
- API Documentation: http://localhost:8000/docs
- Health Check: http://localhost:8000/health
# Step 1: Upload PDF (Phase 1)
curl -X POST "http://localhost:8000/upload-pdf" \
-F "file=@story.pdf" > phase1_output.json
# Step 2: Process Text (Phase 2)
cat phase1_output.json | jq '.pages' | \
curl -X POST "http://localhost:8000/process-text" \
-H "Content-Type: application/json" \
-d @- > phase2_output.json
# Step 3: Generate Images (Phase 3)
curl -X POST "http://localhost:8000/generate-images" \
-H "Content-Type: application/json" \
-d @phase2_output.json > phase3_output.json
# Step 4: View results
cat phase3_output.json | python -m json.tool
# Step 5: Access generated images
# Images are available at: http://localhost:8000/images/page_X/scene_Y.pngpdf-story-generator/
βββ backend/
β βββ main.py # FastAPI application
β βββ requirements.txt # Python dependencies
β βββ .env # Environment configuration
β βββ .env.example # Configuration template
β β
β βββ services/ # Business logic layer
β β βββ __init__.py
β β β
β β # Phase 2 Services
β β βββ text_cleaner.py # Text cleaning & normalization
β β βββ summarizer.py # Visual-focused summarization
β β βββ scene_extractor.py # Scene detection & extraction
β β β
β β # Phase 3 Services
β β βββ prompt_generator.py # DALLΒ·E prompt engineering
β β βββ image_generator.py # OpenAI API integration
β β βββ image_storage.py # File system storage
β β
β βββ generated_images/ # Generated image storage
β β βββ .gitkeep
β β βββ page_1/
β β β βββ scene_1.png
β β β βββ scene_2.png
β β βββ page_2/
β β βββ scene_1.png
β β
β βββ test_phase2.py # Phase 2 tests
β βββ test_phase3.py # Phase 3 tests
β
βββ frontend/ # React UI (optional)
β βββ src/
β βββ public/
β βββ package.json
β
βββ docs/
β βββ PHASE1_README.md
β βββ PHASE2_README.md
β βββ PHASE3_README.md
β βββ INTEGRATION_GUIDE.md
β βββ API_DOCUMENTATION.md
β
βββ README.md # This file
βββ .gitignore
Create a .env file in the backend directory:
# OpenAI Configuration (Required for Phase 3)
OPENAI_API_KEY=sk-your-openai-api-key-here
# Server Configuration
PORT=8000
HOST=0.0.0.0
ENVIRONMENT=development
# DALLΒ·E Settings (Optional)
DALLE_MODEL=dall-e-3
DALLE_SIZE=1024x1024
DALLE_QUALITY=standard
DALLE_STYLE=vivid# Upload PDF β Extract text page-by-page
# Uses pdfplumber for high-quality extraction
# Validates file type and handles errorsInput: PDF file
Output: Structured text by page
# Clean text β Summarize β Extract scenes
# Removes PDF artifacts
# Identifies visual elements (subjects, settings, moods)Input: Raw extracted text
Output: Structured scenes with descriptions
# Generate prompt β Call DALLΒ·E β Save image
# Optimized prompts for DALLΒ·E
# Mood-to-lighting mapping
# Organized file storageInput: Scene descriptions
Output: High-quality PNG images
# Start server with auto-reload
python main.py# Use production ASGI server
gunicorn main:app -w 4 -k uvicorn.workers.UvicornWorker
# Or with uvicorn
uvicorn main:app --host 0.0.0.0 --port 8000 --workers 4The PDF-to-Visual Story Generator is a complete, production-ready system that:
β
Extracts text from PDF documents
β
Processes text into visual scenes
β
Generates high-quality images with DALLΒ·E
β
Provides a clean REST API
β
Includes comprehensive documentation
β
Has thorough test coverage
β
Follows best practices
All three phases are complete and integrated!
Project Version: 3.0.0
Last Updated: January 2026
Status: β
Production Ready
Complete Pipeline: PDF β Text β Scenes β Images β¨