pdf-miner

python based crawler to mine pdfs from websites related to research works and extract useful features like author name, emails.

Requirements:

For python3

Make a database on localhost named "authors_db". In "authors_db" create a table named "nameemail" having the following fields:

email varchar length: 500 (make it unique to avoid duplication) name varchar length: 500 info varchar length: 500 website varchar length: 500

deault credentials
user: root
password: admin123

Place domains list in finalDomain.txt Start automate.py

python automate.py

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
backup/localhost		backup/localhost
mine		mine
.gitignore		.gitignore
EmailToNameMapping.py		EmailToNameMapping.py
NameExtractor.py		NameExtractor.py
README.md		README.md
Requirements.txt		Requirements.txt
automata.py		automata.py
automate.py		automate.py
automate.py.bak		automate.py.bak
configuration.py		configuration.py
curse.py		curse.py
curser.py		curser.py
down.py		down.py
final.py		final.py
finalDomains.txt		finalDomains.txt
link_scrape.py		link_scrape.py
out.txt		out.txt
out1_pro.py		out1_pro.py
recurse_old.py		recurse_old.py
testemail.py		testemail.py
xml_convertor.py		xml_convertor.py