Skip to content
This repository was archived by the owner on Mar 25, 2025. It is now read-only.

A lightweight and efficient web crawling project using Python. This mini-project leverages BeautifulSoup, Requests, and Selenium to extract and process web data. Ideal for beginners and developers looking to automate data collection from websites.

License

Notifications You must be signed in to change notification settings

NhanPhamThanh-IT/Mini-Web-Scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Data Crawling Mini Project

Introduction

This project focuses on collecting data from various websites using Python. The main goal is to build an efficient, scalable, and easy-to-use data crawling tool for research or real-world applications. The project is designed to handle different types of web structures, enabling users to extract relevant data in a structured format. It provides flexibility to configure crawling parameters, support for handling dynamic content, and efficient data storage options.

Features

  • Flexible Configuration: Easily define rules for extracting data from various sources.
  • Scalability: Supports parallel crawling to optimize data collection speed.
  • Handling Dynamic Content: Uses Selenium or other headless browsers to extract JavaScript-rendered data.
  • Data Storage Options: Store crawled data in JSON, CSV, or database systems like PostgreSQL and MongoDB.
  • Error Handling & Logging: Implements robust exception handling and logs crawling progress.
  • Scheduler Support: Enables periodic crawling using schedulers like Cron or Celery.

Project Structure

  • src/: Directory containing the project's main source code.
  • docs/: Directory containing related documentation.
  • tutorials/: Directory containing guides and example use cases.
  • configs/: Configuration files for defining crawling rules and settings.
  • logs/: Logs generated during the crawling process.

System Requirements

  • Python 3.x
  • Required Python libraries are listed in the requirements.txt file.

Installation

  1. Clone the repository:

    git clone https://github.com/NhanPhamThanh-IT/Data-Crawling-Mini-Project.git
    cd Data-Crawling-Mini-Project
  2. Create and activate a virtual environment (recommended):

    python -m venv env
    source env/bin/activate  # On Windows: env\Scripts\activate
  3. Install dependencies:

    pip install -r requirements.txt

Usage

  1. Define crawling rules: Modify the configuration files in the configs/ directory to specify target websites, request headers, and extraction rules.

  2. Run the crawler:

    python src/crawler.py --config configs/sample_config.json
  3. View logs: Check the logs directory for status updates and debugging information.

  4. Process and analyze data: Export the collected data for further processing or integrate it with analytics tools.

Detailed instructions on using the data crawling tool can be found in the tutorials/ directory. These guides provide specific examples of how to configure and run the tool to collect data from various sources.

Contribution

We welcome contributions from the community. If you would like to contribute, please fork the project, create a new branch for your feature or bug fix, and submit a pull request. Make sure to thoroughly test your code and follow the project's contribution guidelines.

How to Contribute

  1. Fork the repository and clone it locally.
  2. Create a new branch:
    git checkout -b feature-branch-name
  3. Make your changes and test them.
  4. Commit your changes:
    git commit -m "Describe your changes here"
  5. Push to your fork:
    git push origin feature-branch-name
  6. Submit a pull request on GitHub.

License

This project is licensed under the MIT License. See the LICENSE file for more details.

Contact

For questions, feature requests, or bug reports, please open an issue on GitHub or contact the project maintainer via email at contact@example.com.

About

A lightweight and efficient web crawling project using Python. This mini-project leverages BeautifulSoup, Requests, and Selenium to extract and process web data. Ideal for beginners and developers looking to automate data collection from websites.

Topics

Resources

License

Stars

Watchers

Forks

Languages