A high-performance Python utility for cleaning PDF libraries by removing OceanofPDFs.com watermarks and normalizing filenames. Designed to handle large collections (10,000+ files) efficiently with smart processing modes and cloud-sync awareness.
- Text Removal: Detects and removes all variations of "OceanofPDFs.com" watermarks
- Handles spaced variants:
O c e a n o f P D F s . c o m - Case-insensitive pattern matching
- White-out redaction preserves document structure
- Handles spaced variants:
- Link Removal: Deletes hyperlink annotations pointing to OceanofPDFs.com
- Removes blue underline artifacts
- Cleans clickable watermarks
- Two-Pass Optimization:
- Fast link removal on all pages
- Text redaction only when watermarks detected
- Malformed PDF Handling: Processes broken PDFs safely with comprehensive error handling
Automatically renames PDFs following consistent, human-readable patterns:
Rule 1: Prefix cleanup and author-title reordering
_OceanofPDFs.com_The_Great_Gatsby_-_F._Scott_Fitzgerald.pdf
β F. Scott Fitzgerald - The Great Gatsby.pdf
Rule 2: Z-Library suffix removal
The_Great_Gatsby_ (Z-Library).pdf
β The Great Gatsby.pdf
Rule 3: Underscore normalization
Book___Title___With____Underscores.pdf
β Book Title With Underscores.pdf
- β
Invalid Windows characters removed (
\ / : * ? " < > |) - β
Automatic collision detection with incremental naming
Book.pdfβBook (1).pdfβBook (2).pdf
- β Whitespace normalization
- β
Optional: Disable renaming with
--no-renameflag
- Python 3.10 or higher
- Operating System: Windows, macOS, or Linux
pip install pymupdf tqdmOptional (Windows only): For full creation date preservation:
pip install pywin32Clone this repository:
git clone https://github.com/yourusername/oceanofpdfs-remover.git
cd oceanofpdfs-removerOr download oceanofpdfs_remover_+_renamer.py directly.
Process a single PDF:
python oceanofpdfs_remover_+_renamer.py "C:\Books\example.pdf"Process entire directory recursively:
python oceanofpdfs_remover_+_renamer.py "C:\Books"Process multiple drives/directories:
python oceanofpdfs_remover_+_renamer.py "C:\Books" "D:\Library" "E:\Ebooks"| Flag | Description |
|---|---|
--dry-run |
Preview changes without modifying files. Shows what would be cleaned/renamed. |
--links-only |
Remove only hyperlinks (fastest mode). Skips text redaction. |
--no-rename |
Disable all filename changes. Only clean PDF content. |
--no-progress |
Disable progress bar and enable streaming mode. Process files as found. |
Dry run to preview changes:
python oceanofpdfs_remover_+_renamer.py "C:\Books" --dry-runFast mode (links only, no renaming):
python oceanofpdfs_remover_+_renamer.py "C:\Books" --links-only --no-renameStreaming mode for immediate processing:
python oceanofpdfs_remover_+_renamer.py "C:\Books" --no-progressProcess multiple drives without progress bar:
python oceanofpdfs_remover_+_renamer.py C:\ D:\ E:\ --no-progress- Scans all directories first to count total PDFs
- Shows real-time scanning progress with current folder path
- Displays accurate progress bar during processing
- Processes local files first, cloud-synced files last
- Best for: Large libraries where you want to see total progress
Example Output:
Scanning for PDFs...
Scanning: C:\Users\YourName\Documents\Books | PDFs: 1,234
β Scan complete: 1,234 PDFs found
βΉοΈ 45 cloud-synced files will be processed last
Processing PDFs: 100%|ββββββββββββββββββββ| 1234/1234 [00:15<00:00, 82.27file/s]
DONE: 1234 processed | 156 cleaned | 203 renamed | 2 failed
- Processes PDFs immediately as they're discovered
- No initial scan delay
- Shows folder changes in real-time
- Best for: Quick processing, CI/CD pipelines, or when total count doesn't matter
Example Output:
Starting streaming processing...
Processing folder: C:\Users\YourName\Documents\Books
β»οΈ Cleaned: The_Great_Gatsby.pdf (hits=12) & Renamed -> F. Scott Fitzgerald - The Great Gatsby.pdf
βΉοΈ Renamed: 1984_ (Z-Library).pdf -> 1984.pdf
Processing folder: C:\Users\YourName\Documents\Books\Classics
β»οΈ Cleaned: Pride_and_Prejudice.pdf (hits=8)
DONE: 1234 processed | 156 cleaned | 203 renamed | 2 failed
- β Atomic Replacement: Original files only replaced after successful processing
- β Automatic Cleanup: Temporary files deleted automatically on failure
- β Timestamp Preservation: Maintains original access, modification, and creation dates
- β No Partial Overwrites: Failed operations never corrupt original files
- β Collision Prevention: Automatic unique naming prevents file overwrites
- β Cloud Detection: Identifies OneDrive, Dropbox, Google Drive, iCloud, etc.
- β Deferred Processing: Cloud files processed last to avoid blocking
- β Retry Logic: 3 automatic retries for cloud timeout errors
- β Graceful Degradation: Non-cloud files complete even if cloud files fail
- β No Network Access: 100% offline operation
- β No Telemetry: Zero data collection or phone-home behavior
- β No Metadata Scraping: Only reads text/links, never tracks reading habits
At completion, the script provides:
- Total PDFs processed
- Number cleaned (watermarks removed)
- Number renamed
- Number failed with error grouping
- Grouped Errors: Similar failures grouped for easy diagnosis
- Detailed Messages: Full error context for debugging
- Non-Fatal: Individual failures don't stop batch processing
- Common Errors: Broken PDFs, corrupted object streams, invalid colorspaces
Example Error Report:
DONE: 1234 processed | 156 cleaned | 203 renamed | 3 failed
Failure summary:
Failed to open file 'corrupted.pdf'.
- corrupted.pdf
- broken_structure.pdf
[WinError 426] The cloud operation was not completed before the time-out period expired
- onedrive_syncing.pdf
- Pattern Pre-filtering: Fast text search before expensive redaction
- Page-level Processing: Skip clean pages entirely
- Smart Defaults: Balance speed vs. thoroughness
- Batch Operations: Process multiple files without reloading libraries
- Large Libraries: Tested on 10,000+ PDF collections
- Speed: ~50-100 files/second in
--links-onlymode - Thoroughness: ~10-30 files/second in full cleaning mode
- Memory: Minimal footprint, processes one file at a time
- Use
--links-onlyif you only need link removal - Use
--no-progressfor slightly faster processing - Process local drives before network/cloud drives
- Exclude temporary or download folders if not needed
- PyMuPDF (fitz): PDF parsing and manipulation
- tqdm: Progress bar visualization (optional)
- pywin32: Windows creation date preservation (optional, Windows only)
The script preserves three timestamp types:
| Timestamp | Windows | macOS/Linux | Preserved |
|---|---|---|---|
| Access Time (atime) | β | β | Always |
| Modification Time (mtime) | β | β | Always |
| Creation Time (ctime) | β | With pywin32 |
*On Unix systems, ctime is metadata change time, not creation time.
- Create
.tmpfile with same name +.tmpextension - Apply all modifications to temp file
- Atomic move from temp β original on success
- Cleanup temp file on any failure
Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.
git clone https://github.com/yourusername/oceanofpdfs-remover.git
cd oceanofpdfs-remover
pip install -r requirements.txt# Dry run on test directory
python oceanofpdfs_remover_+_renamer.py "test_pdfs/" --dry-runThis project is licensed under the MIT License - see the LICENSE file for details.
This tool is for personal library management only. Users are responsible for ensuring they have the right to modify their PDF files. Always maintain backups of important documents.
- Built with PyMuPDF for robust PDF processing
- Progress bars powered by tqdm
- Inspired by the need for clean, organized digital libraries
If you encounter issues or have questions:
- Check the Issues page
- Create a new issue with:
- Python version (
python --version) - Operating system
- Error message (if applicable)
- Command used
- Python version (
Made with β€οΈ for book lovers who value clean libraries