Releases: albion2000/tools-jpeg2pdf
Addition of a few little tools including scandirpdf2noocr
new tool scandirpdf2noocr
Recursively removes all the ocr text from the pdfs. Can be needed if your ocr sw happens to append its generated text to the one already present. This tool should only be used on pdf files that were the result of scans and were processed through OCR. Running this tool on a pdf that is a printout of .doc file will totally remove the text !
Use this tool with caution on a copy of your pdfs and verify for each type of pdf (categorized by the way it was produced) that your pdf remains unaltered visually. This check is recommended, because some OCR tools can embed fonts and have them replace the bitmaps. In that case, you are screwed, you may never be able to remove properly the OCR.
Addtion of a pdf checker
This release includes six tools :
naming_conventions.py & naming_conventions_do_rename.py to enforce some strict rules over the directory names in a file tree. naming_conventions.py is a preview with no effective renaming
check_jpegs for a fast sanity check of a jpegs file tree & check_jpegs_full for a deeper and slower sanity check
scandir2pdf for massive conversion from jpegs to pdfs.
scandirpdf2txt for massive conversion from ocred pdfs to txt files for the purpose of fast full text search with dedicated tools (google or else).
new: check_pdfs for a sanity check of a pdf files tree, it can detect corrupted files, even though some can be open with acrobat reader. There are certainly several possible causes of "false" positives : less standard formats and robustness of acrobat reader to corrupted files.
new : naming_conventions_files.py & naming_conventions_do_rename_files.py to enforce some strict rules over the file names in a file tree. naming_conventions_files.py is a preview with no effective renaming.
validated on 27K+ jpeg files, 3K+ pdfs.
Addition of the tool scandirpdf2txt
This release includes four tools :
naming_conventions.py & naming_conventions_do_rename.py to enforce some strict rules over the directory names in a file tree
check_jpegs for a fast sanity check of a jpegs file tree & check_jpegs_full for a deeper and slower sanity check
scandir2pdf for massive conversion from jpegs to pdfs.
new : scandirpdf2txt for massive conversion from ocred pdfs to txt files for the purpose of fast full text search with dedicated tools (google or else).
validated on 27K+ jpeg files, 1.2K pdfs.
Bug Fix release
New bugfix release for check_jpegs & scandir2pdf
validated on 14K+ files
This release includes three tools :
naming_conventions.py & naming_conventions_do_rename.py to enforce some strict rules over the directory names in a file tree
check_jpegs for a fast sanity check of a jpegs file tree
check_jpegs_full for a deeper and slower sanity check
scandir2pdf for massive conversion from jpegs to pdfs.
A next release will include
scandir2txt for massive conversion from ocred pdfs to txt files
this update is more simple to install and use than the previous ones.
Release, with one more tool and easier to use
New Release with installation instructions and updated user instructions.
This release includes three tools (one new tool) :
naming_conventions.py & naming_conventions_do_rename.py to enforce some strict rules over the directory names in a file tree
check_jpegs for a fast sanity check of a jpegs file tree
scandir2pdf for massive conversion from jpegs to pdfs.
A next release will include
scandir2txt for massive conversion from ocred pdfs to txt files
these tools
Initial Release
Initial Release with installation instructions and user manual.
This release includes two tools :
check_jpegs for a fast sanity check of a jpegs file tree
scandir2pdf for massive conversion from jpegs to pdfs.
A next release will include
scandir2txt for massive conversion from ocred pdfs to txt files