Docker Image with latest Tesseract OCR Version 5.x.x built from sources.
The sources are pulled from the latest main
branch and latest releases
of the Tesseract OCR project.
Docker Hub: https://hub.docker.com/r/franky1/tesseract
Pull the docker image from Docker Hub:
docker pull franky1/tesseract
Mount your image data to the /tmp
directory and run Tesseract OCR container with the required command line options, for example, run Tesseract OCR container with test image:
docker run -it -v ${PWD}/testdata:/tmp --rm franky1/tesseract \
tesseract english.png output --oem 1 -l eng
For the Tesseract command line options, please refer to the Tesseract Manual
Test if the mounted languages from your local subfolder /tessdata
are available in the Docker container.
Be aware that the local languages overwrite the installed languages in the Docker image. Example here with french language:
docker run -it -v ${PWD}/testdata:/tmp \
-v ${PWD}/tessdata:/usr/local/share/tessdata/ \
--rm franky1/tesseract
Test the mounted languages in the Docker container with a sample image. Example here with french language:
docker run -it -v ${PWD}/testdata:/tmp \
-v ${PWD}/tessdata:/usr/local/share/tessdata/ \
--rm franky1/tesseract \
tesseract french.jpg output --oem 1 -l fra
Alternatively, you can build a new Docker image if you want other languages, see next section.
For details have a look into the Dockerfile.
- Git clone this repo.
- Add your required languages to the languages.txt file.
- (a) Build the docker image from scratch, if you want the latest sources from the
main
branch.
docker build --tag tesseract .
- (b) Build the docker image from scratch, if you want a specific
release
version.
docker build --tag tesseract --build-arg TESSERACT_VERSION=5.0.0 .
- Run Tesseract OCR container with test image:
docker run -it --name tesseract -v ${PWD}/testdata:/tmp --rm \
tesseract tesseract english.png output --oem 1 -l eng
- Only supported target for this docker image currently is
linux/amd64
. - Working directory for ocr images is
/tmp
inside the container. See example above. - Directory for trained data is
/usr/local/share/tessdata/
inside the container. See example above. - This image was built without the Tesseract training tools.
- This image currently includes only the following languages:
- English:
tessdata_best > eng.traineddata
- German:
tessdata_best > deu.traineddata
- If you need other languages, you have to build your own image or mount trained data to the
/usr/local/share/tessdata/
directory. See example above.
- English:
- Overview of supported languages https://tesseract-ocr.github.io/tessdoc/Data-Files-in-different-versions.html
- Trained models with support for legacy and LSTM OCR engine https://github.com/tesseract-ocr/tessdata
- Fast integer versions of trained LSTM models https://github.com/tesseract-ocr/tessdata_fast
- Best (most accurate) trained LSTM models https://github.com/tesseract-ocr/tessdata_best
- Docker Hub: https://hub.docker.com/repository/docker/franky1/tesseract
- Original Tesseract Github Repository: https://github.com/tesseract-ocr/tesseract
- Original Tesseract Documentation: https://tesseract-ocr.github.io/
- Original Tesseract Manual: https://tesseract-ocr.github.io/tessdoc/
- More
tessdata_best
languages: https://github.com/tesseract-ocr/tessdata_best
- Update
README.md
to latest Dockerfile and Usage - Add dependabot on Github
- Add vulnerability scanning in Github Actions with Snyk
- Add GitHub Action for check container efficiency with Dive https://github.com/MartinHeinz/dive-action
- Add documentation for GitHub Actions Workflow
- Add more inline comments in GitHub Actions related files
- Build image for more targets
- Building Tesseract with TensorFlow?
- Building Tesseract with Training tools?
- Change build in Dockerfile according to instructions in Compiling-GitInstallation.md
If you have any bugs or requests regarding this Docker image, please post an issue in this Github Repository.
27.07.2022: Docker Image is ready for usage, still some slight improvements possible, sometimes build issues