Web-Scraping-Test

Web-Scraping-Test is a collection of Python scripts for a self-study project for an internship assessment. The goal is to extract specific product data (Silikomart brand) from various e-commerce platforms, normalize the data structure, and export it to Excel for analysis.

📂 Data Collection Workflow

The scripts follow a modular extraction pipeline designed for reliability and accuracy:

Request & Bypass: The scripts use requests or cloudscraper to mimic a real browser user-agent, bypassing basic anti-bot protections (Cloudflare/403 Forbidden errors).
HTML Parsing: BeautifulSoup navigates the DOM tree.
Data Extraction: Specific strategies (CSS Selectors, JSON-LD parsing, or Regex) apply depending on the site structure.
Normalization: The code cleans data (whitespace removal, currency formatting) and maps it to a strict schema of 18 columns.
Export: The final dataset saves automatically into a /results folder as an .xlsx file.

🌐 Target Websites & Strategies

1. Southern Hospitality

Method: cloudscraper + BeautifulSoup.
Structure: Standard e-commerce grid.
Key Feature: Handles dynamic folder creation for results and cleans inconsistent whitespace in price fields.

2. Bakedeco

Method: requests + BeautifulSoup.
Structure: Hybrid (Tables <td> and Divs <div>).
Key Feature: The script includes a dual-strategy selector. It first checks for table-based layouts; if that fails, it falls back to grid-based div extraction. This ensures no products are missed regardless of page layout variations.

3. Silikomart (Official Site)

Method: requests + Regex (Regular Expressions).
Structure: Complex Magento 2 with hidden data.
Key Feature: Standard HTML parsing fails because data is embedded in JavaScript variables (dataLayer, dlObjects). The script uses Regex pattern matching to hunt for raw strings like "sku":"..." and "availability":"..." directly in the source code, bypassing the need for complex JS rendering.

4. Meilleur du Chef

Method: cloudscraper + JSON-LD.
Structure: Structured Data.
Key Feature: This site heavily utilizes Schema.org (JSON-LD). The script parses the hidden JSON blocks to get the most accurate Price, Stock, and Breadcrumb data, falling back to HTML parsing only if the JSON is missing. It also includes robust pagination logic to traverse all pages.

⚠️ Issues Encountered & Solutions

Issue	Cause	Solution
403 Forbidden	Southern Hospitality & Meilleur block standard `requests`.	Switch to Cloudscraper library to negotiate TLS handshakes and mimic browser headers.
Hidden Stock	Silikomart stores stock status inside JavaScript `script` tags rather than visible HTML.	Implement Regex to extract the specific JSON string from the raw HTML text.

🚀 Getting Started

Prerequisites

Python 3.x.x
Required libraries:

pip install -r requirements.txt

Installation & Usage

Clone the repository or download the scripts.
Run the specific script for the desired website (e.g., meilleurduchef.py).
Check the console for progress logs.
Find the output file in the newly created results/ folder.

# Example Run
> py meilleurduchef.py

# Expected Output
Starting link collection...
Scanning page: https://www.meilleurduchef.com/en/shop/brands/silikomart.html
   -> Found 410 new products.
Total unique products found: 410

Starting product extraction...
[1/410] Scraping: https://www.meilleurduchef.com/en/shop/baking-supplies/cake-mould/yule-log-moulds/sil-silicone-mould-signature-yule-log.html
[2/410] Scraping: https://www.meilleurduchef.com/en/shop/baking-supplies/cake-mould/savarin-mould/mfe-silicone-mould-18-savarin.html
[3/410] Scraping: https://www.meilleurduchef.com/en/shop/baking-supplies/cake-mould/shaped-moulds/sil-silicone-mould-8-flowers-kiku.html
[4/410] Scraping: https://www.meilleurduchef.com/en/shop/baking-supplies/cake-mould/loaf-tins/sil-paris-travel-cake-mould-23-x-5-x-ht-5-cm.html
...
[410/410] Scraping: https://www.meilleurduchef.com/en/shop/baking-supplies/cake-mould/cake-decorating-moulds/sil-silicone-mould-6-pleated-round.html

Success! Saved to: results\MeilleurDuChef_Silikomart_Test_5.xlsx

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
__pycache__		__pycache__
results		results
sanneng		sanneng
sources		sources
steelite		steelite
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
arranger_xlsx.py		arranger_xlsx.py
bakadeco.py		bakadeco.py
bakedeco.cpp		bakedeco.cpp
bakedeco.exe		bakedeco.exe
bakedeco_build.bat		bakedeco_build.bat
check_data.py		check_data.py
debug_chakawal.py		debug_chakawal.py
debug_csv.py		debug_csv.py
debug_spiders.py		debug_spiders.py
dionysus.py		dionysus.py
final_run.py		final_run.py
final_summary.py		final_summary.py
inspect_xlsx.py		inspect_xlsx.py
meilleurduchef.py		meilleurduchef.py
naxlsx.py		naxlsx.py
profiles.py		profiles.py
report_status.py		report_status.py
requirements.txt		requirements.txt
run_all_scrapers.bat		run_all_scrapers.bat
run_all_scrapers.py		run_all_scrapers.py
run_sanneng_spiders.py		run_sanneng_spiders.py
run_working_spiders.py		run_working_spiders.py
sanneng_arranger_xlsx.py		sanneng_arranger_xlsx.py
search_addon_enrichment.py		search_addon_enrichment.py
silikomart.py		silikomart.py
southernhospitality.py		southernhospitality.py
spider_status.py		spider_status.py
steeliteutopia.py		steeliteutopia.py
test_all_spiders.py		test_all_spiders.py
test_json_structure.py		test_json_structure.py
test_playwright.py		test_playwright.py
test_selectors.py		test_selectors.py
test_spiders.py		test_spiders.py
test_spiders_diagnostic.py		test_spiders_diagnostic.py
test_spiders_final.py		test_spiders_final.py
test_title.py		test_title.py
test_website.py		test_website.py
wasserstrom_v0.0.1.py		wasserstrom_v0.0.1.py
wasserstrom_v0.0.2.py		wasserstrom_v0.0.2.py
wasserstrom_v0.py		wasserstrom_v0.py
webstaurantstore_v0.1.1.py		webstaurantstore_v0.1.1.py
webstaurantstore_v0.1.py		webstaurantstore_v0.1.py
webstaurantstore_v0.py		webstaurantstore_v0.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Web-Scraping-Test

📂 Data Collection Workflow

🌐 Target Websites & Strategies

1. Southern Hospitality

2. Bakedeco

3. Silikomart (Official Site)

4. Meilleur du Chef

⚠️ Issues Encountered & Solutions

🚀 Getting Started

Prerequisites

Installation & Usage

About

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Web-Scraping-Test

📂 Data Collection Workflow

🌐 Target Websites & Strategies

1. Southern Hospitality

2. Bakedeco

3. Silikomart (Official Site)

4. Meilleur du Chef

⚠️ Issues Encountered & Solutions

🚀 Getting Started

Prerequisites

Installation & Usage

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors

Uh oh!

Languages