Skip to content

Webcrawl is a Python web crawler that recursively follows links from a starting URL to extract and print unique HTTP links. Using 'requests and 'BeautifulSoup', it avoids revisits, handles errors, and supports configurable crawling depth. Ideal for gathering and analyzing web links.

License

Notifications You must be signed in to change notification settings

ls-saurabh/webcrawl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 

Repository files navigation

Web Crawler

This is a simple web crawler implemented in Python using requests and BeautifulSoup. It recursively follows links from a given starting URL and prints out all the unique HTTP links it encounters.

Features

  • Recursively crawls websites to discover links.
  • Avoids revisiting URLs to prevent infinite loops.
  • Handles HTTP request errors gracefully.
  • Supports configurable depth to limit crawling depth.

Requirements

  • Python 3.x
  • requests
  • beautifulsoup4

Install the dependencies using:

pip install requests beautifulsoup4

Usage

Clone the repository:

git clone https://github.com/ls-saurabh/webcrawl.git
cd webcrawl

Run the script:

python crawl.py

About

Webcrawl is a Python web crawler that recursively follows links from a starting URL to extract and print unique HTTP links. Using 'requests and 'BeautifulSoup', it avoids revisits, handles errors, and supports configurable crawling depth. Ideal for gathering and analyzing web links.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages