Get OCR in txt form from an image or pdf extension supporting multiple files from directory using pytesseract
with support for rotation in case of wrong orientation along.
Currently in beta state
Follow: Demo run
Note:
Make sure you have a OCR tool like tesseract
and certain data value for comparing OCR, eg tesseract-data-eng
along with Pillow
and Wand
for image conversion and loading which will be fetched during pip install.
For using in python: Refer to the py-module branch
Install using PIP:
$ pip install saram
$ saram <dirname>
else
Clone the source locally:
$ git clone https://github.com/aryaminus/saram
$ cd saram
$ git checkout py-module
$ python main.py <dirname>
- Add support for PDF by PDF -> Image -> Txt with converted image deletion after processing
- Double check for orientation in case of image and PDF
- Make a PIP package
- Add NLP to process the most repeated frequent characters to filer content
- Add Cloud Vision support for effective character recognization
- Suppot for GUI using tkinter
- Fork it (https://github.com/aryaminus/saram/fork)
- Create your feature branch (
git checkout -b feature/fooBar
) - Commit your changes (
git commit -am 'Add some fooBar'
) - Push to the branch (
git push origin feature/fooBar
) - Create a new Pull Request
Enjoy!