You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Brief description - I can elaborate later as I have to go shortly:
(Something here messes with the formatting - I will return to fix for the sake of our eyes.)
I had gone through the installation steps without reported problems.
The tests seemed to print the instructions twice, once having exited from those, it looks like it ran tests; test 7 appeared to fail with the message:
Error in pixReadStream: Unknown format: no pix returned
Error in pixRead: pix not read
Image file tests/imgs/README.md cannot be read!
Error during processing.
ERROR: 'tesseract' failed. ret_code = 1
System: Ubuntu 20 LTS.
Output example:
nsa@treatdell:~/Downloads/pdfsplit$ ls
dati1.jpg dati2.jpg dati3.jpg file_list.txt
nsa@treatdell:~/Downloads/pdfsplit$ rm file_list.txt
nsa@treatdell:~/Downloads/pdfsplit$ cd ..
nsa@treatdell:~/Downloads$ pdf2searchablepdf pdfsplit
pdf2searchablepdf ('pdf2searchablepdf') version 0.5.0
Author = Gabriel Staples
See 'pdf2searchablepdf -h'for more info.
Language = eng
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1st parmeter is a directory, so we are assuming it contains a bunch of images
you'd like converted to a PDF. PLEASE ENSURE THIS DIRECTORY CONTAINS ONLY IMAGES!Converting all files (images) inside directory "pdfsplit"into a searchable PDF.~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Running tesseract OCR on all files (better be images) in the directory you provided.This could take some time.Searchable PDF will be generated at "pdfsplit_searchable.pdf".Page 0 : pdfsplit/dati1.jpgEstimating resolution as 530Page 1 : pdfsplit/dati2.jpgEstimating resolution as 806Page 2 : pdfsplit/dati3.jpgEstimating resolution as 204Error in pixReadStream: Pdf reading is not supportedError in pixRead: pix not readImage file pdfsplit/._searchable.pdf cannot be read!Error during processing.ERROR: 'tesseract' failed. ret_code = 1Total script run-time: 5 sec (0.083 min).real 0m4.446suser 0m9.147ssys 0m0.145s
Example output from tests:
nsa@treatdell:~/PDF2SearchablePDF$ ls
branch_bak.txt
install.sh
LICENSE
pdf2searchablepdf.sh
pdf2searchablepdf_temp_20191110-231200.594322352
pdf2searchablepdf_temp_20220809-111036.689043492
'pdf2searchablepdf - what to work on next - Gabriel.odt'
README.md
research
run_tests.sh
tests
TODO.md
nsa@treatdell:~/PDF2SearchablePDF$ ./run_tests.sh
============================ START OF TEST 1 ===============================
=== Running 'pdf2searchablepdf -h' ===
----------------------------------------------------------------------------
pdf2searchablepdf ('pdf2searchablepdf') version 0.5.0
Purpose: convert "input.pdf" to a searchable PDF named "input_searchable.pdf"
by using tesseract to perform OCR (Optical Character Recognition) on the PDF.
Usage:
pdf2searchablepdf [options] <input.pdf|dir_of_imgs> [lang]
If the 1st positional argument (after options) is to an input pdf, then convert
input.pdf to input_searchable.pdf using language "lang"for OCR. Otherwise, if the 1st
argument is a path to a directory containing a bunch of images, convert the whole
directory of images into a single PDF, using language "lang"for OCR!
pdf2searchablepdf
print help menu, thenexit
Options:
[-h|-?|--help]
print help menu, thenexit
[-v|--version]
print author & version, thenexit
[-d|--debug]
Turn debug prints on while running the script
[-upw <password>]
Specify the user password to open and read the PDF file. This option is passed directly
through to the 'pdftoppm' cmd used internally to convert the PDF to images for OCR.
[--run_tests]
Run unit tests for this program.
Examples:
pdf2searchablepdf mypdf.pdf deu
Convert mypdf.pdf to a searchable PDF, using German text OCR, or
pdf2searchablepdf mypdf.pdf
Convert mypdf.pdf to a searchable PDF, using English text OCR (the default).
pdf2searchablepdf mypdf.pdf --debug
Same as above, except also print out the debug prints.
pdf2searchablepdf dir_of_imgs
Convert all images in this directory, "dir_of_imgs", to a single, searchable PDF.
pdf2searchablepdf .
Convert all images in the present directory, indicated by '.', to a single, searchable
PDF.
pdf2searchablepdf -upw 1234 mypdf.pdf
Convert mypdf.pdf to a searchable PDF, using English text OCR, while using the user
password "1234" to open up and read the PDF.
pdf2searchablepdf mypdf.pdf -upw 1234
Same as above.
Option Details:
[lang]
The optional [lang] argument allows you to perform OCR in your language of choice. This
parameter will be passed on to tesseract. You must use ISO 639-2 3-letter language codes.
Ex: "deu"for German, "dan"for Danish, "eng"for English, etc. See the "LANGUAGES"
section of the tesseract man pages ('man tesseract') for a complete list. If the [lang]
parameter is not given, English will be used by default. If you don't have a desired language installed, it may be obtained from one of the following 3 repos (see tesseract man pages for details): - https://github.com/tesseract-ocr/tessdata_fast - https://github.com/tesseract-ocr/tessdata_best - https://github.com/tesseract-ocr/tessdata To install a new language, simply download the respective "*.traineddata" file from one of the 3 repos above and copy it to your tesseract installation's "tessdata" directory.
See "Post-Install Instructions" here:
https://github.com/tesseract-ocr/tessdoc/blob/master/Compiling-%E2%80%93-GitInstallation.md#post-install-instructions
Tesseract Wiki:
https://github.com/tesseract-ocr/tesseract/wiki.
Source code:
https://github.com/ElectricRCAircraftGuy/PDF2SearchablePDF
=== END TEST 1 ===
============================ START OF TEST 2 ===============================
=== Running 'pdf2searchablepdf -?' ===
----------------------------------------------------------------------------
pdf2searchablepdf ('pdf2searchablepdf') version 0.5.0
Purpose: convert "input.pdf" to a searchable PDF named "input_searchable.pdf"
by using tesseract to perform OCR (Optical Character Recognition) on the PDF.
Usage:
pdf2searchablepdf [options] <input.pdf|dir_of_imgs> [lang]
If the 1st positional argument (after options) is to an input pdf, then convert
input.pdf to input_searchable.pdf using language "lang"for OCR. Otherwise, if the 1st
argument is a path to a directory containing a bunch of images, convert the whole
directory of images into a single PDF, using language "lang"for OCR!
pdf2searchablepdf
print help menu, thenexit
Options:
[-h|-?|--help]
print help menu, thenexit
[-v|--version]
print author & version, thenexit
[-d|--debug]
Turn debug prints on while running the script
[-upw <password>]
Specify the user password to open and read the PDF file. This option is passed directly
through to the 'pdftoppm' cmd used internally to convert the PDF to images for OCR.
[--run_tests]
Run unit tests for this program.
Examples:
pdf2searchablepdf mypdf.pdf deu
Convert mypdf.pdf to a searchable PDF, using German text OCR, or
pdf2searchablepdf mypdf.pdf
Convert mypdf.pdf to a searchable PDF, using English text OCR (the default).
pdf2searchablepdf mypdf.pdf --debug
Same as above, except also print out the debug prints.
pdf2searchablepdf dir_of_imgs
Convert all images in this directory, "dir_of_imgs", to a single, searchable PDF.
pdf2searchablepdf .
Convert all images in the present directory, indicated by '.', to a single, searchable
PDF.
pdf2searchablepdf -upw 1234 mypdf.pdf
Convert mypdf.pdf to a searchable PDF, using English text OCR, while using the user
password "1234" to open up and read the PDF.
pdf2searchablepdf mypdf.pdf -upw 1234
Same as above.
Option Details:
[lang]
The optional [lang] argument allows you to perform OCR in your language of choice. This
parameter will be passed on to tesseract. You must use ISO 639-2 3-letter language codes.
Ex: "deu"for German, "dan"for Danish, "eng"for English, etc. See the "LANGUAGES"
section of the tesseract man pages ('man tesseract') for a complete list. If the [lang]
parameter is not given, English will be used by default. If you don't have a desired language installed, it may be obtained from one of the following 3 repos (see tesseract man pages for details): - https://github.com/tesseract-ocr/tessdata_fast - https://github.com/tesseract-ocr/tessdata_best - https://github.com/tesseract-ocr/tessdata To install a new language, simply download the respective "*.traineddata" file from one of the 3 repos above and copy it to your tesseract installation's "tessdata" directory.
See "Post-Install Instructions" here:
https://github.com/tesseract-ocr/tessdoc/blob/master/Compiling-%E2%80%93-GitInstallation.md#post-install-instructions
Tesseract Wiki:
https://github.com/tesseract-ocr/tesseract/wiki.
Source code:
https://github.com/ElectricRCAircraftGuy/PDF2SearchablePDF
=== END TEST 2 ===
============================ START OF TEST 3 ===============================
=== Running 'pdf2searchablepdf -v' ===
----------------------------------------------------------------------------
pdf2searchablepdf ('pdf2searchablepdf') version 0.5.0
Author = Gabriel Staples
See 'pdf2searchablepdf -h'for more info.
=== END TEST 3 ===
============================ START OF TEST 4 ===============================
=== Running 'pdf2searchablepdf tests/pdfs/test1.pdf' ===
----------------------------------------------------------------------------
pdf2searchablepdf ('pdf2searchablepdf') version 0.5.0
Author = Gabriel Staples
See 'pdf2searchablepdf -h'for more info.
Language = eng
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Converting input PDF (tests/pdfs/test1.pdf) into a searchable PDF
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Creating temporary working directory: "pdf2searchablepdf_temp_20220809-185335.170253381"
Converting input PDF to a bunch of output TIF images inside temporary working directory.
- THIS COULD TAKE A LONG TIME (up to 45 sec or so per page)! Manually watch the temporary
working directory to see the pages created one-by-one to roughly monitor progress.
- NB: each TIF file created is ~25MB, so ensure you have enough disk space for this
operation to complete successfully.
All TIF files created.
Running tesseract OCR on all generated TIF images in the temporary working directory.
This could take some time.
Searchable PDF will be generated at "tests/pdfs/test1_searchable.pdf".
Page 0 : pdf2searchablepdf_temp_20220809-185335.170253381/pg-1.tif
Page 1 : pdf2searchablepdf_temp_20220809-185335.170253381/pg-2.tif
Page 2 : pdf2searchablepdf_temp_20220809-185335.170253381/pg-3.tif
Done! Searchable PDF generated at "tests/pdfs/test1_searchable.pdf".
Removing temporary working directory at "pdf2searchablepdf_temp_20220809-185335.170253381".
Done!
Total script run-time: 12 sec (0.200 min).
real 0m12.246s
user 0m16.218s
sys 0m0.278s
END OF pdf2searchablepdf.
=== END TEST 4 ===
============================ START OF TEST 5 ===============================
=== Running 'pdf2searchablepdf tests/pdfs/test1_edited_w_foxit.pdf' ===
----------------------------------------------------------------------------
pdf2searchablepdf ('pdf2searchablepdf') version 0.5.0
Author = Gabriel Staples
See 'pdf2searchablepdf -h'for more info.
Language = eng
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Converting input PDF (tests/pdfs/test1_edited_w_foxit.pdf) into a searchable PDF
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Creating temporary working directory: "pdf2searchablepdf_temp_20220809-185347.414661491"
Converting input PDF to a bunch of output TIF images inside temporary working directory.
- THIS COULD TAKE A LONG TIME (up to 45 sec or so per page)! Manually watch the temporary
working directory to see the pages created one-by-one to roughly monitor progress.
- NB: each TIF file created is ~25MB, so ensure you have enough disk space for this
operation to complete successfully.
All TIF files created.
Running tesseract OCR on all generated TIF images in the temporary working directory.
This could take some time.
Searchable PDF will be generated at "tests/pdfs/test1_edited_w_foxit_searchable.pdf".
Page 0 : pdf2searchablepdf_temp_20220809-185347.414661491/pg-1.tif
Page 1 : pdf2searchablepdf_temp_20220809-185347.414661491/pg-2.tif
Page 2 : pdf2searchablepdf_temp_20220809-185347.414661491/pg-3.tif
Done! Searchable PDF generated at "tests/pdfs/test1_edited_w_foxit_searchable.pdf".
Removing temporary working directory at "pdf2searchablepdf_temp_20220809-185347.414661491".
Done!
Total script run-time: 5 sec (0.083 min).
real 0m4.984s
user 0m9.069s
sys 0m0.211s
END OF pdf2searchablepdf.
=== END TEST 5 ===
============================ START OF TEST 6 ===============================
=== Running 'pdf2searchablepdf tests/pdfs/Wikipedia_pdf_screenshot.pdf' ===
----------------------------------------------------------------------------
pdf2searchablepdf ('pdf2searchablepdf') version 0.5.0
Author = Gabriel Staples
See 'pdf2searchablepdf -h'for more info.
Language = eng
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Converting input PDF (tests/pdfs/Wikipedia_pdf_screenshot.pdf) into a searchable PDF
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Creating temporary working directory: "pdf2searchablepdf_temp_20220809-185352.400599190"
Converting input PDF to a bunch of output TIF images inside temporary working directory.
- THIS COULD TAKE A LONG TIME (up to 45 sec or so per page)! Manually watch the temporary
working directory to see the pages created one-by-one to roughly monitor progress.
- NB: each TIF file created is ~25MB, so ensure you have enough disk space for this
operation to complete successfully.
All TIF files created.
Running tesseract OCR on all generated TIF images in the temporary working directory.
This could take some time.
Searchable PDF will be generated at "tests/pdfs/Wikipedia_pdf_screenshot_searchable.pdf".
Page 0 : pdf2searchablepdf_temp_20220809-185352.400599190/pg-1.tif
Image too small to scale!! (2x36 vs min width of 3)
Line cannot be recognized!!
Done! Searchable PDF generated at "tests/pdfs/Wikipedia_pdf_screenshot_searchable.pdf".
Removing temporary working directory at "pdf2searchablepdf_temp_20220809-185352.400599190".
Done!
Total script run-time: 2 sec (0.033 min).
real 0m1.998s
user 0m4.059s
sys 0m0.146s
END OF pdf2searchablepdf.
=== END TEST 6 ===
============================ START OF TEST 7 ===============================
=== Running 'pdf2searchablepdf tests/imgs' ===
----------------------------------------------------------------------------
pdf2searchablepdf ('pdf2searchablepdf') version 0.5.0
Author = Gabriel Staples
See 'pdf2searchablepdf -h'for more info.
Language = eng
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1st parmeter is a directory, so we are assuming it contains a bunch of images
you'd like converted to a PDF. PLEASE ENSURE THIS DIRECTORY CONTAINS ONLY IMAGES!Converting all files (images) inside directory "tests/imgs"into a searchable PDF.~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Running tesseract OCR on all files (better be images) in the directory you provided.This could take some time.Searchable PDF will be generated at "tests/imgs_searchable.pdf".Error in pixReadStream: Unknown format: no pix returnedError in pixRead: pix not readImage file tests/imgs/README.md cannot be read!Error during processing.ERROR: 'tesseract' failed. ret_code = 1Total script run-time: 0 sec (0.000 min).real 0m0.059suser 0m0.060ssys 0m0.000s=== END TEST 7 ===real 0m28.817suser 0m29.450ssys 0m0.658s
The text was updated successfully, but these errors were encountered:
Brief description - I can elaborate later as I have to go shortly:
(Something here messes with the formatting - I will return to fix for the sake of our eyes.)
I had gone through the installation steps without reported problems.
The tests seemed to print the instructions twice, once having exited from those, it looks like it ran tests; test 7 appeared to fail with the message:
Error in pixReadStream: Unknown format: no pix returned
Error in pixRead: pix not read
Image file tests/imgs/README.md cannot be read!
Error during processing.
ERROR: 'tesseract' failed. ret_code = 1
System: Ubuntu 20 LTS.
Output example:
Example output from tests:
The text was updated successfully, but these errors were encountered: