Skip to content

Summarization of Text from PDF, URL as a source text; Text and Image Extraction form Web Link/URLs & PDF

License

Notifications You must be signed in to change notification settings

deepak-mandal/DueDash-Germany

Repository files navigation

DueDashAssignment

Web and PDF file Data Extraction

1. Summarization of Text:-

Generated the summary from the Source text, Further Drawn the Word Cloud
(a). From Any Web Link - could be generated summary x percentage (eg. 50%) of the Original web source Text data. & finally created a Word cloud.
(b). From Any PDF file - Generated summary of the .pdf file, and their word cloud

Colab: https://colab.research.google.com/drive/1uWLS3FeO1U9jUCQCjtlQxJMstc28GupJ?usp=sharing

2. Text and Image Extraction from any Web Link or URL:-

(a). Generated the formated HTML file from source code
(b). Extraced all the Images from the web Link, and Downloaded into a folder automatically
(c). Extracted Various text data such as paragraph tags, anchor tags, header tags, Further saved all data in a file
Further Extracted Text and Image Data from the PDF file format.

Colab: https://colab.research.google.com/drive/1lyBAsNTcgpi0-7yh7Mycbx2qvCEhQKAD?usp=sharing
Technologies used: Python3, BeautifulSoup/bs4, PyPDF2, SpaCy, NLTK, WordCloud, NumPy, Shutil, OS, parse, requests

About

Summarization of Text from PDF, URL as a source text; Text and Image Extraction form Web Link/URLs & PDF

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published