This Python-based web crawler is designed to crawl news websites, scrape articles, and store them in a MongoDB database. The crawler starts with a set of initial URLs, follows all links on each webpage, and continues crawling using a queue. It also checks the robots.txt
file for each site to ensure compliance with the site's crawl policies. For websites with dynamically loaded content, the crawler uses Selenium to automate a browser for rendering the page before scraping.
The crawler was run for a few hours, during which it successfully crawled around 2200 unique articles from the Yahoo Finance news website. The articles were stored in a MongoDB database, with all relevant metadata for easy retrieval and querying.
- Starts with a list of initial URLs.
- Extracts all URLs on each page and adds them to a crawl queue.
- Navigates through the URLs to discover and scrape new content.
- Scrapes news articles from each webpage visited.
- Handles dynamically loaded content by automating a browser using Selenium.
- Stores the scraped articles in a MongoDB database for easy retrieval and querying.
- The crawler checks each website's
robots.txt
file before making a request to respect crawl policies. - Avoids URLs that are explicitly disallowed by
robots.txt
.
- The scraped articles are stored in MongoDB for efficient storage and retrieval.
- Each article is saved with metadata (e.g., id, title, URL, crawl date) to enable structured queries.
- Python 3.x
- pymongo
- Selenium
- beautifulsoup4
- requests
- lxml
- A web browser (e.g., Chrome, Firefox) along with the respective Selenium driver (e.g., ChromeDriver)
-
Clone this repository:
git clone https://github.com/amndzdzdz/Web-Crawler.git
-
Install the required Python libraries:
pip install -r requirements.txt
-
Download and configure the web driver for Selenium (e.g., ChromeDriver or GeckoDriver) and ensure it's accessible from your PATH.
-
Configure MongoDB connection in the crawler script:
client = MongoClient() your_database_name = client.your_database_name your_collections_name = your_database_name.your_collections_name
-
Run the crawler:
python crawler.py
Feel free to submit pull requests or open issues for improvements, bug fixes, or feature requests!