Dynamic URL Crawler

Project Description

Dynamic URL Crawler is a Python-based asynchronous web scraping tool built using Playwright. This tool is designed to scrape product-related links dynamically from a given list of URLs. It effectively handles infinite scrolling pages and extracts URLs matching specific patterns. The scraped data is stored in a structured JSON format for further use.

Features

Asynchronous Crawling: Utilizes asyncio and Playwright for high-performance, non-blocking web scraping.
Dynamic Scrolling: Automatically scrolls to the bottom of pages to ensure complete data extraction from infinite scrolling websites.
Customizable URL Patterns: Scrapes links matching specific product-related patterns (e.g., /product/, /dp/, /shop/, etc.).
JSON Storage: Saves extracted product links in a product_urls.json file.
Scalable Architecture: Handles multiple URLs concurrently for efficient scraping.

Installation

Prerequisites

Python: Ensure Python 3 or later is installed.

Steps to Install

Clone the repository:

git clone https://github.com/RaKAsHASH/urlExtractor.git
cd urlExtractor

SetUp a virtual Environment
```
python3 -m venv <your-venv-name> 
```
Activate your virtual Environment
```
source ./venv/bin/activate 
```
Install Dependencies
Playwright: Install and set up Playwright with the following commands:
```
pip install playwright
playwright install
```

Usage

Add the target URLs to the url list in the script:

url = ["https://www.amazon.in/s?k=i+phone+15+pro", "https://www.flipkart.com/", ...]

Run the script:
```
python urlExtractor.py
```
View the results in the product_urls.json file.

Code Structure

DynamicUrlCrawler Class:
- Manages the crawling process and data extraction.
start_crawl Method:
- Initiates the browser, distributes tasks, and manages concurrent URL processing.
scrape_page Method:
- Handles infinite scrolling and extracts product links.
save_results Method:
- Saves extracted links to a JSON file.

Example Output

An example product_urls.json file:

{
  "https://www.amazon.in/s?k=i+phone+15+pro": [
    "/product/iphone-15-pro",
    "/dp/B0C7XYZ"
  ],
  "https://www.flipkart.com/": [
    "/item/iphone-case",
    "/p/smartphone"
  ]
}

Dependencies

Python 3
Playwright

Limitations

Limited to scraping product-related links based on predefined patterns.
Unable to handle Pagination Change to get product Links.
Static wait time of 2sec for page loading.
Requires stable internet connection and proper handling of rate limits.

Developed with 💻 and 🧠 by Harjeet

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitignore		.gitignore
README.md		README.md
urlExtractor.py		urlExtractor.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Dynamic URL Crawler

Project Description

Features

Installation

Prerequisites

Steps to Install

Usage

Code Structure

Example Output

Dependencies

Limitations

About

Releases

Packages

Languages

RaKAsHASH/urlExtractor

Folders and files

Latest commit

History

Repository files navigation

Dynamic URL Crawler

Project Description

Features

Installation

Prerequisites

Steps to Install

Usage

Code Structure

Example Output

Dependencies

Limitations

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages