python based crawler to mine pdfs from websites related to research works and extract useful features like author name, emails.
For python3
- pypdf2
- pymysql
- nltk (tokenizer)
- pyenchant (otherwise known as pyenchant)
- BeautifulSoup
- urllib
- requests
- pdftohtml
Make a database on localhost named "authors_db". In "authors_db" create a table named "nameemail" having the following fields:
email varchar length: 500 (make it unique to avoid duplication) name varchar length: 500 info varchar length: 500 website varchar length: 500
deault credentials
user: root
password: admin123
Place domains list in finalDomain.txt Start automate.py
python automate.py