Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

error in pixReadStream: Pdf reading is not supported #24

Open
nsandersen opened this issue Aug 9, 2022 · 1 comment
Open

error in pixReadStream: Pdf reading is not supported #24

nsandersen opened this issue Aug 9, 2022 · 1 comment

Comments

@nsandersen
Copy link

nsandersen commented Aug 9, 2022

Brief description - I can elaborate later as I have to go shortly:

(Something here messes with the formatting - I will return to fix for the sake of our eyes.)

I had gone through the installation steps without reported problems.
The tests seemed to print the instructions twice, once having exited from those, it looks like it ran tests; test 7 appeared to fail with the message:

Error in pixReadStream: Unknown format: no pix returned
Error in pixRead: pix not read
Image file tests/imgs/README.md cannot be read!
Error during processing.
ERROR: 'tesseract' failed. ret_code = 1

System: Ubuntu 20 LTS.

Output example:

nsa@treatdell:~/Downloads/pdfsplit$ ls
dati1.jpg  dati2.jpg  dati3.jpg  file_list.txt
nsa@treatdell:~/Downloads/pdfsplit$ rm file_list.txt 
nsa@treatdell:~/Downloads/pdfsplit$ cd ..
nsa@treatdell:~/Downloads$ pdf2searchablepdf pdfsplit
pdf2searchablepdf ('pdf2searchablepdf') version 0.5.0
Author = Gabriel Staples
See 'pdf2searchablepdf -h' for more info.

Language = eng
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1st parmeter is a directory, so we are assuming it contains a bunch of images
you'd like converted to a PDF. PLEASE ENSURE THIS DIRECTORY CONTAINS ONLY IMAGES!
Converting all files (images) inside directory "pdfsplit"
into a searchable PDF.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Running tesseract OCR on all files (better be images) in the directory you provided.
This could take some time.
Searchable PDF will be generated at "pdfsplit_searchable.pdf".
Page 0 : pdfsplit/dati1.jpg
Estimating resolution as 530
Page 1 : pdfsplit/dati2.jpg
Estimating resolution as 806
Page 2 : pdfsplit/dati3.jpg
Estimating resolution as 204
Error in pixReadStream: Pdf reading is not supported
Error in pixRead: pix not read
Image file pdfsplit/._searchable.pdf cannot be read!
Error during processing.
ERROR: 'tesseract' failed. ret_code = 1

Total script run-time: 5 sec (0.083 min).

real	0m4.446s
user	0m9.147s
sys	0m0.145s

Example output from tests:

nsa@treatdell:~/PDF2SearchablePDF$ ls
 branch_bak.txt
 install.sh
 LICENSE
 pdf2searchablepdf.sh
 pdf2searchablepdf_temp_20191110-231200.594322352
 pdf2searchablepdf_temp_20220809-111036.689043492
'pdf2searchablepdf - what to work on next - Gabriel.odt'
 README.md
 research
 run_tests.sh
 tests
 TODO.md
nsa@treatdell:~/PDF2SearchablePDF$ ./run_tests.sh
============================ START OF TEST 1 ===============================
=== Running 'pdf2searchablepdf -h' ===
----------------------------------------------------------------------------
pdf2searchablepdf ('pdf2searchablepdf') version 0.5.0

Purpose: convert "input.pdf" to a searchable PDF named "input_searchable.pdf"
by using tesseract to perform OCR (Optical Character Recognition) on the PDF.

Usage:

    pdf2searchablepdf [options] <input.pdf|dir_of_imgs> [lang]
            If the 1st positional argument (after options) is to an input pdf, then convert
            input.pdf to input_searchable.pdf using language "lang" for OCR. Otherwise, if the 1st
            argument is a path to a directory containing a bunch of images, convert the whole
            directory of images into a single PDF, using language "lang" for OCR!
    pdf2searchablepdf
            print help menu, then exit

  Options:

    [-h|-?|--help]
            print help menu, then exit
    [-v|--version]
            print author & version, then exit
    [-d|--debug]
            Turn debug prints on while running the script
    [-upw <password>]
            Specify the user password to open and read the PDF file. This option is passed directly
            through to the 'pdftoppm' cmd used internally to convert the PDF to images for OCR.
    [--run_tests]
            Run unit tests for this program.

Examples:

    pdf2searchablepdf mypdf.pdf deu
            Convert mypdf.pdf to a searchable PDF, using German text OCR, or
    pdf2searchablepdf mypdf.pdf
            Convert mypdf.pdf to a searchable PDF, using English text OCR (the default).
    pdf2searchablepdf mypdf.pdf --debug
            Same as above, except also print out the debug prints.
    pdf2searchablepdf dir_of_imgs
            Convert all images in this directory, "dir_of_imgs", to a single, searchable PDF.
    pdf2searchablepdf .
            Convert all images in the present directory, indicated by '.', to a single, searchable
            PDF.
    pdf2searchablepdf -upw 1234 mypdf.pdf
            Convert mypdf.pdf to a searchable PDF, using English text OCR, while using the user
            password "1234" to open up and read the PDF.
    pdf2searchablepdf mypdf.pdf -upw 1234
            Same as above.

Option Details:

    [lang]
        The optional [lang] argument allows you to perform OCR in your language of choice. This
        parameter will be passed on to tesseract. You must use ISO 639-2 3-letter language codes.
        Ex: "deu" for German, "dan" for Danish, "eng" for English, etc. See the "LANGUAGES"
        section of the tesseract man pages ('man tesseract') for a complete list. If the [lang]
        parameter is not given, English will be used by default. If you don't have a desired
        language installed, it may be obtained from one of the following 3 repos (see tesseract man
        pages for details):
          - https://github.com/tesseract-ocr/tessdata_fast
          - https://github.com/tesseract-ocr/tessdata_best
          - https://github.com/tesseract-ocr/tessdata
        To install a new language, simply download the respective "*.traineddata" file from one of
        the 3 repos above and copy it to your tesseract installation's "tessdata" directory.
        See "Post-Install Instructions" here:
        https://github.com/tesseract-ocr/tessdoc/blob/master/Compiling-%E2%80%93-GitInstallation.md#post-install-instructions

Tesseract Wiki:
https://github.com/tesseract-ocr/tesseract/wiki.

Source code:
https://github.com/ElectricRCAircraftGuy/PDF2SearchablePDF

=== END TEST 1 ===

============================ START OF TEST 2 ===============================
=== Running 'pdf2searchablepdf -?' ===
----------------------------------------------------------------------------
pdf2searchablepdf ('pdf2searchablepdf') version 0.5.0

Purpose: convert "input.pdf" to a searchable PDF named "input_searchable.pdf"
by using tesseract to perform OCR (Optical Character Recognition) on the PDF.

Usage:

    pdf2searchablepdf [options] <input.pdf|dir_of_imgs> [lang]
            If the 1st positional argument (after options) is to an input pdf, then convert
            input.pdf to input_searchable.pdf using language "lang" for OCR. Otherwise, if the 1st
            argument is a path to a directory containing a bunch of images, convert the whole
            directory of images into a single PDF, using language "lang" for OCR!
    pdf2searchablepdf
            print help menu, then exit

  Options:

    [-h|-?|--help]
            print help menu, then exit
    [-v|--version]
            print author & version, then exit
    [-d|--debug]
            Turn debug prints on while running the script
    [-upw <password>]
            Specify the user password to open and read the PDF file. This option is passed directly
            through to the 'pdftoppm' cmd used internally to convert the PDF to images for OCR.
    [--run_tests]
            Run unit tests for this program.

Examples:

    pdf2searchablepdf mypdf.pdf deu
            Convert mypdf.pdf to a searchable PDF, using German text OCR, or
    pdf2searchablepdf mypdf.pdf
            Convert mypdf.pdf to a searchable PDF, using English text OCR (the default).
    pdf2searchablepdf mypdf.pdf --debug
            Same as above, except also print out the debug prints.
    pdf2searchablepdf dir_of_imgs
            Convert all images in this directory, "dir_of_imgs", to a single, searchable PDF.
    pdf2searchablepdf .
            Convert all images in the present directory, indicated by '.', to a single, searchable
            PDF.
    pdf2searchablepdf -upw 1234 mypdf.pdf
            Convert mypdf.pdf to a searchable PDF, using English text OCR, while using the user
            password "1234" to open up and read the PDF.
    pdf2searchablepdf mypdf.pdf -upw 1234
            Same as above.

Option Details:

    [lang]
        The optional [lang] argument allows you to perform OCR in your language of choice. This
        parameter will be passed on to tesseract. You must use ISO 639-2 3-letter language codes.
        Ex: "deu" for German, "dan" for Danish, "eng" for English, etc. See the "LANGUAGES"
        section of the tesseract man pages ('man tesseract') for a complete list. If the [lang]
        parameter is not given, English will be used by default. If you don't have a desired
        language installed, it may be obtained from one of the following 3 repos (see tesseract man
        pages for details):
          - https://github.com/tesseract-ocr/tessdata_fast
          - https://github.com/tesseract-ocr/tessdata_best
          - https://github.com/tesseract-ocr/tessdata
        To install a new language, simply download the respective "*.traineddata" file from one of
        the 3 repos above and copy it to your tesseract installation's "tessdata" directory.
        See "Post-Install Instructions" here:
        https://github.com/tesseract-ocr/tessdoc/blob/master/Compiling-%E2%80%93-GitInstallation.md#post-install-instructions

Tesseract Wiki:
https://github.com/tesseract-ocr/tesseract/wiki.

Source code:
https://github.com/ElectricRCAircraftGuy/PDF2SearchablePDF

=== END TEST 2 ===

============================ START OF TEST 3 ===============================
=== Running 'pdf2searchablepdf -v' ===
----------------------------------------------------------------------------
pdf2searchablepdf ('pdf2searchablepdf') version 0.5.0
Author = Gabriel Staples
See 'pdf2searchablepdf -h' for more info.

=== END TEST 3 ===

============================ START OF TEST 4 ===============================
=== Running 'pdf2searchablepdf tests/pdfs/test1.pdf' ===
----------------------------------------------------------------------------
pdf2searchablepdf ('pdf2searchablepdf') version 0.5.0
Author = Gabriel Staples
See 'pdf2searchablepdf -h' for more info.

Language = eng
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Converting input PDF (tests/pdfs/test1.pdf) into a searchable PDF
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Creating temporary working directory: "pdf2searchablepdf_temp_20220809-185335.170253381"
Converting input PDF to a bunch of output TIF images inside temporary working directory.
- THIS COULD TAKE A LONG TIME (up to 45 sec or so per page)! Manually watch the temporary
  working directory to see the pages created one-by-one to roughly monitor progress.
- NB: each TIF file created is ~25MB, so ensure you have enough disk space for this
  operation to complete successfully.
All TIF files created.
Running tesseract OCR on all generated TIF images in the temporary working directory.
This could take some time.
Searchable PDF will be generated at "tests/pdfs/test1_searchable.pdf".
Page 0 : pdf2searchablepdf_temp_20220809-185335.170253381/pg-1.tif
Page 1 : pdf2searchablepdf_temp_20220809-185335.170253381/pg-2.tif
Page 2 : pdf2searchablepdf_temp_20220809-185335.170253381/pg-3.tif
Done! Searchable PDF generated at "tests/pdfs/test1_searchable.pdf".
Removing temporary working directory at "pdf2searchablepdf_temp_20220809-185335.170253381".
Done!

Total script run-time: 12 sec (0.200 min).

real	0m12.246s
user	0m16.218s
sys	0m0.278s
END OF pdf2searchablepdf.
=== END TEST 4 ===

============================ START OF TEST 5 ===============================
=== Running 'pdf2searchablepdf tests/pdfs/test1_edited_w_foxit.pdf' ===
----------------------------------------------------------------------------
pdf2searchablepdf ('pdf2searchablepdf') version 0.5.0
Author = Gabriel Staples
See 'pdf2searchablepdf -h' for more info.

Language = eng
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Converting input PDF (tests/pdfs/test1_edited_w_foxit.pdf) into a searchable PDF
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Creating temporary working directory: "pdf2searchablepdf_temp_20220809-185347.414661491"
Converting input PDF to a bunch of output TIF images inside temporary working directory.
- THIS COULD TAKE A LONG TIME (up to 45 sec or so per page)! Manually watch the temporary
  working directory to see the pages created one-by-one to roughly monitor progress.
- NB: each TIF file created is ~25MB, so ensure you have enough disk space for this
  operation to complete successfully.
All TIF files created.
Running tesseract OCR on all generated TIF images in the temporary working directory.
This could take some time.
Searchable PDF will be generated at "tests/pdfs/test1_edited_w_foxit_searchable.pdf".
Page 0 : pdf2searchablepdf_temp_20220809-185347.414661491/pg-1.tif
Page 1 : pdf2searchablepdf_temp_20220809-185347.414661491/pg-2.tif
Page 2 : pdf2searchablepdf_temp_20220809-185347.414661491/pg-3.tif
Done! Searchable PDF generated at "tests/pdfs/test1_edited_w_foxit_searchable.pdf".
Removing temporary working directory at "pdf2searchablepdf_temp_20220809-185347.414661491".
Done!

Total script run-time: 5 sec (0.083 min).

real	0m4.984s
user	0m9.069s
sys	0m0.211s
END OF pdf2searchablepdf.
=== END TEST 5 ===

============================ START OF TEST 6 ===============================
=== Running 'pdf2searchablepdf tests/pdfs/Wikipedia_pdf_screenshot.pdf' ===
----------------------------------------------------------------------------
pdf2searchablepdf ('pdf2searchablepdf') version 0.5.0
Author = Gabriel Staples
See 'pdf2searchablepdf -h' for more info.

Language = eng
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Converting input PDF (tests/pdfs/Wikipedia_pdf_screenshot.pdf) into a searchable PDF
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Creating temporary working directory: "pdf2searchablepdf_temp_20220809-185352.400599190"
Converting input PDF to a bunch of output TIF images inside temporary working directory.
- THIS COULD TAKE A LONG TIME (up to 45 sec or so per page)! Manually watch the temporary
  working directory to see the pages created one-by-one to roughly monitor progress.
- NB: each TIF file created is ~25MB, so ensure you have enough disk space for this
  operation to complete successfully.
All TIF files created.
Running tesseract OCR on all generated TIF images in the temporary working directory.
This could take some time.
Searchable PDF will be generated at "tests/pdfs/Wikipedia_pdf_screenshot_searchable.pdf".
Page 0 : pdf2searchablepdf_temp_20220809-185352.400599190/pg-1.tif
Image too small to scale!! (2x36 vs min width of 3)
Line cannot be recognized!!
Done! Searchable PDF generated at "tests/pdfs/Wikipedia_pdf_screenshot_searchable.pdf".
Removing temporary working directory at "pdf2searchablepdf_temp_20220809-185352.400599190".
Done!

Total script run-time: 2 sec (0.033 min).

real	0m1.998s
user	0m4.059s
sys	0m0.146s
END OF pdf2searchablepdf.
=== END TEST 6 ===

============================ START OF TEST 7 ===============================
=== Running 'pdf2searchablepdf tests/imgs' ===
----------------------------------------------------------------------------
pdf2searchablepdf ('pdf2searchablepdf') version 0.5.0
Author = Gabriel Staples
See 'pdf2searchablepdf -h' for more info.

Language = eng
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1st parmeter is a directory, so we are assuming it contains a bunch of images
you'd like converted to a PDF. PLEASE ENSURE THIS DIRECTORY CONTAINS ONLY IMAGES!
Converting all files (images) inside directory "tests/imgs"
into a searchable PDF.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Running tesseract OCR on all files (better be images) in the directory you provided.
This could take some time.
Searchable PDF will be generated at "tests/imgs_searchable.pdf".
Error in pixReadStream: Unknown format: no pix returned
Error in pixRead: pix not read
Image file tests/imgs/README.md cannot be read!
Error during processing.
ERROR: 'tesseract' failed. ret_code = 1

Total script run-time: 0 sec (0.000 min).

real	0m0.059s
user	0m0.060s
sys	0m0.000s
=== END TEST 7 ===


real	0m28.817s
user	0m29.450s
sys	0m0.658s
@ElectricRCAircraftGuy
Copy link
Owner

ElectricRCAircraftGuy commented Aug 9, 2022

Thanks. I'll take a look when able. Meanwhile, I just fixed the formatting.

Note to self: the key part I need to look at:

Error in pixReadStream: Pdf reading is not supported
Error in pixRead: pix not read
Image file pdfsplit/._searchable.pdf cannot be read!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants