ProductDigest

Objective

This project provides a tool to automatically extract and compile webpage details into a well-formatted PDF document. It is particularly tailored to handle Amazon product pages, capturing prices and other key details, but also works with general URLs to gather metadata and generate page previews.

What It Does

Reads a list of URLs from a file (urls.txt).
Fetches the title, timestamp, and thumbnail of each webpage.
Specifically for Amazon URLs, retrieves additional details such as product pricing and description.
Compiles these details into a structured PDF file (webpage_details.pdf), making it convenient for users to view key webpage information offline.

Sample Output

The image above shows a sample page with product information from Amazon India, generated during a trial run.

How It Works

URL Processing: The script reads URLs from a text file (urls.txt), handling each URL line-by-line.
Data Extraction:
- Uses Selenium for automated browsing and scraping.
- For Amazon pages, specialized routines extract product pricing and details.
- General URLs are parsed for titles and preview images.
PDF Generation: Combines the extracted data, arranging each entry with a title, thumbnail, and timestamp, and generates a PDF using PyMuPDF.
Error Handling: Incorporates retry mechanisms for failed URL loads to improve reliability.

Required Packages

To run this script, the following Python packages are required:

PyMuPDF (fitz) for PDF creation.
selenium for web scraping and page automation.
webdriver_manager to manage the Edge WebDriver.
Pillow (PIL) for image processing.
requests for HTTP requests.
beautifulsoup4 for HTML parsing.

Additionally, ensure that:

Microsoft Edge is installed on your system.
An internet connection is available.

Installation

Clone the repository:

git clone https://github.com/venkatarangan/ProductsDigest.git 
cd ProductsDigest

Install the required Python packages:

pip install PyMuPDF selenium webdriver_manager Pillow requests beautifulsoup4

Ensure Microsoft Edge is installed and up-to-date for compatibility with Selenium.

Usage

Create a text file named urls.txt in the project directory, listing the URLs to process, with one URL per line.
Run the script:
```
python ProductDigest.py
```
The output PDF, webpage_details.pdf, will be generated in the project directory.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Acknowledgement

The basic code was generated from several prompts using GPT-4o and Claude Sonnet 3.5 in Abacus.AI, with further adjustments made to improve accuracy and customize functionality.

Disclaimer

All product information, price details, and images are the property of their respective owners, including Amazon India. This project uses such information solely for educational and personal purposes, with no commercial intent.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
samples		samples
.gitignore		.gitignore
LICENSE		LICENSE
ProductDigest.py		ProductDigest.py
README.md		README.md
URLs.txt		URLs.txt
requirements.txt		requirements.txt
webpage_details.pdf		webpage_details.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ProductDigest

Objective

What It Does

Sample Output

How It Works

Required Packages

Installation

Usage

License

Acknowledgement

Disclaimer

About

Releases

Packages

Languages

License

venkatarangan/ProductsDigest

Folders and files

Latest commit

History

Repository files navigation

ProductDigest

Objective

What It Does

Sample Output

How It Works

Required Packages

Installation

Usage

License

Acknowledgement

Disclaimer

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages