DueDashAssignment

Web and PDF file Data Extraction

1. Summarization of Text:-

Generated the summary from the Source text, Further Drawn the Word Cloud
(a). From Any Web Link - could be generated summary x percentage (eg. 50%) of the Original web source Text data. & finally created a Word cloud.
(b). From Any PDF file - Generated summary of the .pdf file, and their word cloud

Colab: https://colab.research.google.com/drive/1uWLS3FeO1U9jUCQCjtlQxJMstc28GupJ?usp=sharing

2. Text and Image Extraction from any Web Link or URL:-

(a). Generated the formated HTML file from source code
(b). Extraced all the Images from the web Link, and Downloaded into a folder automatically
(c). Extracted Various text data such as paragraph tags, anchor tags, header tags, Further saved all data in a file
Further Extracted Text and Image Data from the PDF file format.

Colab: https://colab.research.google.com/drive/1lyBAsNTcgpi0-7yh7Mycbx2qvCEhQKAD?usp=sharing

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.ipynb_checkpoints		.ipynb_checkpoints
PdfImg_TextFolder		PdfImg_TextFolder
WebImg_TextFolder		WebImg_TextFolder
LICENSE		LICENSE
PDF_text_data.txt		PDF_text_data.txt
README.md		README.md
TextSummarization.ipynb		TextSummarization.ipynb
Text_Image_Extraction.ipynb		Text_Image_Extraction.ipynb
URLTextData.txt		URLTextData.txt
cnt.pdf		cnt.pdf
ml1.pdf		ml1.pdf
temp.html		temp.html
wc_Any_URL.png		wc_Any_URL.png
wc_PDF_text_data_pdf.png		wc_PDF_text_data_pdf.png
wc_Raw_text_data_link.png		wc_Raw_text_data_link.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DueDashAssignment

Web and PDF file Data Extraction

1. Summarization of Text:-

2. Text and Image Extraction from any Web Link or URL:-

Technologies used: Python3, BeautifulSoup/bs4, PyPDF2, SpaCy, NLTK, WordCloud, NumPy, Shutil, OS, parse, requests

About

Releases

Packages

Languages

License

deepak-mandal/DueDash-Germany

Folders and files

Latest commit

History

Repository files navigation

DueDashAssignment

Web and PDF file Data Extraction

1. Summarization of Text:-

2. Text and Image Extraction from any Web Link or URL:-

Technologies used: Python3, BeautifulSoup/bs4, PyPDF2, SpaCy, NLTK, WordCloud, NumPy, Shutil, OS, parse, requests

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages