GitHub - databricks-industry-solutions/digitization-documents: Using Apache tika and tesseract to extact text from any document

Digitization of documents with Tika on Databricks : The volume of available data is growing by the second. About 64 zettabytes was created or copied last year, according to IDC, a technology market research firm. By 2025, this number will grow to an estimated 175 zetabytes, and it is becoming increasingly granular and difficult to codify, unify, and centralize. And though more financial services institutions (FSIs) are talking about big data and using technology to capture more data than ever, Forrester reports that 70% of all data within an enterprise still goes unused for analytics. The open source nature of Lakehouse for Financial Services makes it possible for bank compliance officers, insurance underwriting agents or claim adjusters to combine latest technologies in optical character recognition (OCR) and natural language processing (NLP) in order to transform any financial document, in any format, into valuable data assets. The Apache Tika toolkit detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF). Combined with Tesseract, the most commonly used OCR technology, there is literally no limit to what files we can ingest, store and exploit for analytics / operation purpose. In this solution, we will use our newly released spark input format tika-ocr to extract text from PDF reports available online

© 2022 Databricks, Inc. All rights reserved. The source in this notebook is provided subject to the Databricks License [https://databricks.com/db-license-source]. All included or referenced third party libraries are subject to the licenses set forth below.

library	description	license	source
unidecode	Text processing	GNU	https://github.com/avian2/unidecode
pdf2image	PDF parser	MIT	https://github.com/Belval/pdf2image
beautifulsoup4	Web scraper	MIT	https://www.crummy.com/software/BeautifulSoup/
PyPDF2	PDF parser	BSD	https://pypi.org/project/PyPDF2
tika-ocr	Spark input format	Databricks	https://github.com/databrickslabs/tika-ocr
tesseract-ocr	OCR library	Apache2	https://github.com/tesseract-ocr
poppler-utils	Image transformation	MIT	https://github.com/skmetaly/poppler-utils

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
.github/workflows		.github/workflows
config		config
images		images
.gitignore		.gitignore
00_digitization_context.py		00_digitization_context.py
01_digitization_download.py		01_digitization_download.py
02_digitization_extract.py		02_digitization_extract.py
LICENSE.md		LICENSE.md
NOTICE		NOTICE
README.md		README.md
SECURITY.md		SECURITY.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Contributors 2

Languages

License

databricks-industry-solutions/digitization-documents

Folders and files

Latest commit

History

Repository files navigation

About

Topics

Resources

License

Security policy

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages