Webscrapping to identify and download latest pdf documents. Classify these documents into pre-defined categories.
- This repository will assist you in scrapping data from multiple websites. It will download the latest pdf files published on a website in a specific folder as per the users requirement. This can be used for automating various operations involved in market research.
- Once the pdfs are downloaded they are classified into oil/no_oil/foreign_language categories based on a string based rule
- You can customize these rules for classification as per your need
- pip install -r requirements
- Run radar_automation.py
I devised the solution from the following pages of the documentation: