Skip to content

Latest commit

 

History

History
20 lines (14 loc) · 1.38 KB

File metadata and controls

20 lines (14 loc) · 1.38 KB

Webscrapping to identify and download latest pdf documents. Classify these documents into pre-defined categories.

  • This repository will assist you in scrapping data from multiple websites. It will download the latest pdf files published on a website in a specific folder as per the users requirement. This can be used for automating various operations involved in market research.
  • Once the pdfs are downloaded they are classified into oil/no_oil/foreign_language categories based on a string based rule
  • You can customize these rules for classification as per your need

Instructions

  • pip install -r requirements
  • Run radar_automation.py

Reference

I devised the solution from the following pages of the documentation:

  • [Urllib] package that collects several modules for working with URLs
  • [beautyfulsoup4] to scrape information from web pages
  • [PDFminer] is a text extraction tool for PDF documents
  • [NLTK] for natural language processing
  • Keyword based search in extracted text for rule based classification