Simple Python module to crawl a website and extract URLs.
Using pip:
pip3 install sitecrawl
sitecrawl --help
Or build from sources:
# Clone project
git clone https://github.com/gabfl/sitecrawl && cd sitecrawl
# Installation
pip3 install .
sitecrawl --url https://www.yahoo.com/ --depth 2 --max 4 --verbose
->
* Found 4 internal URLs
https://www.yahoo.com
https://www.yahoo.com/entertainment
https://www.yahoo.com/lifestyle
https://www.yahoo.com/plus
* Found 5 external URLs
https://mail.yahoo.com/
https://news.yahoo.com/
https://finance.yahoo.com/
https://sports.yahoo.com/
https://shopping.yahoo.com/
* Skipped 0 URLs
Basic example:
from sitecrawl import crawl
crawl.base_url = 'https://www.yahoo.com'
crawl.deep_crawl(depth=2)
print('Internal URLs:', crawl.get_internal_urls())
print('External URLs:', crawl.get_external_urls())
print('Skipped URLs:', crawl.get_skipped_urls())
A more detailed example is available in example.py.