Skip to content

Releases: albion2000/tools-jpeg2pdf

Addition of a few little tools including scandirpdf2noocr

22 Oct 18:33
Compare
Choose a tag to compare

new tool scandirpdf2noocr

Recursively removes all the ocr text from the pdfs. Can be needed if your ocr sw happens to append its generated text to the one already present. This tool should only be used on pdf files that were the result of scans and were processed through OCR. Running this tool on a pdf that is a printout of .doc file will totally remove the text !

Use this tool with caution on a copy of your pdfs and verify for each type of pdf (categorized by the way it was produced) that your pdf remains unaltered visually. This check is recommended, because some OCR tools can embed fonts and have them replace the bitmaps. In that case, you are screwed, you may never be able to remove properly the OCR.

Addtion of a pdf checker

29 Aug 19:15
Compare
Choose a tag to compare

This release includes six tools :

naming_conventions.py & naming_conventions_do_rename.py to enforce some strict rules over the directory names in a file tree. naming_conventions.py is a preview with no effective renaming

check_jpegs for a fast sanity check of a jpegs file tree & check_jpegs_full for a deeper and slower sanity check

scandir2pdf for massive conversion from jpegs to pdfs.

scandirpdf2txt for massive conversion from ocred pdfs to txt files for the purpose of fast full text search with dedicated tools (google or else).

new: check_pdfs for a sanity check of a pdf files tree, it can detect corrupted files, even though some can be open with acrobat reader. There are certainly several possible causes of "false" positives : less standard formats and robustness of acrobat reader to corrupted files.

new : naming_conventions_files.py & naming_conventions_do_rename_files.py to enforce some strict rules over the file names in a file tree. naming_conventions_files.py is a preview with no effective renaming.

validated on 27K+ jpeg files, 3K+ pdfs.

Addition of the tool scandirpdf2txt

18 Feb 14:08
Compare
Choose a tag to compare

This release includes four tools :

naming_conventions.py & naming_conventions_do_rename.py to enforce some strict rules over the directory names in a file tree

check_jpegs for a fast sanity check of a jpegs file tree & check_jpegs_full for a deeper and slower sanity check

scandir2pdf for massive conversion from jpegs to pdfs.

new : scandirpdf2txt for massive conversion from ocred pdfs to txt files for the purpose of fast full text search with dedicated tools (google or else).

validated on 27K+ jpeg files, 1.2K pdfs.

Bug Fix release

14 Jan 11:48
Compare
Choose a tag to compare

New bugfix release for check_jpegs & scandir2pdf

validated on 14K+ files

This release includes three tools :

naming_conventions.py & naming_conventions_do_rename.py to enforce some strict rules over the directory names in a file tree

check_jpegs for a fast sanity check of a jpegs file tree

check_jpegs_full for a deeper and slower sanity check

scandir2pdf for massive conversion from jpegs to pdfs.

A next release will include

scandir2txt for massive conversion from ocred pdfs to txt files

this update is more simple to install and use than the previous ones.

Release, with one more tool and easier to use

13 Jan 21:14
Compare
Choose a tag to compare

New Release with installation instructions and updated user instructions.

This release includes three tools (one new tool) :

naming_conventions.py & naming_conventions_do_rename.py to enforce some strict rules over the directory names in a file tree

check_jpegs for a fast sanity check of a jpegs file tree

scandir2pdf for massive conversion from jpegs to pdfs.

A next release will include

scandir2txt for massive conversion from ocred pdfs to txt files

these tools

Initial Release

01 Jan 20:35
Compare
Choose a tag to compare

Initial Release with installation instructions and user manual.

This release includes two tools :

check_jpegs for a fast sanity check of a jpegs file tree

scandir2pdf for massive conversion from jpegs to pdfs.

A next release will include

scandir2txt for massive conversion from ocred pdfs to txt files