GitHub - Akinfiresoye-Victor/German-Structured-Web-Scraper: A web Scraping Project built to collect job listing data from a complex german website across 258 pages with over 5000 data to be scraped

🧠 README — Truckerbörse Web Scraper

📋 Project Overview

This project is a Python-based web scraper built to extract structured data from the German logistics job platform Truckerbörse.
It automatically navigates through multiple job listing pages, extracts detailed company and job information, and stores the results into a well-structured CSV file.

The scraper handles pagination, dynamic user-agent rotation, random delays, and error handling to simulate natural browsing behavior and reduce the risk of being blocked.

🚀 Features

✅ Scrapes job listings and associated company details.
✅ Extracts:
- Job Title
- Company Name
- Address
- Website
- Person of Contact
- Email
- Telephone
- Fax
- Job Link
✅ Handles inconsistent data fields and HTML layouts.
✅ Avoids 508 Loop Detected and resource limit errors via retry logic and random delays.
✅ Logs runtime duration automatically.
✅ Uses randomized headers and fake_useragent for stealth scraping.

⚙️ Tech Stack

Language: Python 3.13
Libraries Used:
- BeautifulSoup4 – for HTML parsing
- urllib.request – for HTTP requests
- fake_useragent – for random user-agent rotation
- csv – for structured data output
- re – for regex-based data extraction
- time & random – for dynamic delay and runtime tracking

🧩 How It Works

The scraper starts from the main URL: https://www.truckerboerse.net/
It extracts all job listings from the current page.
For each listing, it follows the detail link to extract additional company information.
It writes the scraped data into a CSV file named: trucker_data.csv
It automatically finds and follows the “Next Page” link to continue scraping until no more pages are available.

🛡️ Anti-Ban & Reliability Features

Randomized User-Agent strings per request.
Random delays between 2–5 seconds between detail page requests.
“Coffee breaks” (30-second pauses) after every 10 pages.
Tracking of visited URLs to prevent infinite pagination loops.
Graceful handling of HTTP 508 (loop detected) and resource limit errors.
Retry logic with exponential backoff for unstable connections.

📊 Runtime Summary

Total Runtime: 275 minutes and 56 seconds
Equivalent: ~4 hours, 36 minutes
Output File: trucker_data.csv
Format: UTF-8 CSV with 9 columns

📁 Example CSV Output

Job Title	Company Name	Address	Website	Person of Contact	Email	Telephone	Fax	Job Link
Kraftfahrer CE	Logistik GmbH	Musterstraße 45, 12345 Berlin, Germany	www.logistik.de	Max Mustermann	info@logistik.de	+49 30 123456	+49 30 654321	https://www.truckerboerse.net/job123

⚠️ Notes

The site occasionally throttles or returns HTTP 508 errors due to resource limits — handled automatically by the scraper.
Some entries may lack contact information due to inconsistent HTML structures on individual job pages.
Always respect the site’s robots.txt and use the data responsibly.

🕒 Runtime Tracker Snippet

Add this at the top of your script (before the scraping starts):

import time
start_time = time.time()
And add this at the very end, right after csv_file.close():
end_time = time.time()
elapsed_time = end_time - start_time

hours = int(elapsed_time // 3600)
minutes = int((elapsed_time % 3600) // 60)
seconds = int(elapsed_time % 60)

print("\n✅ Scraping completed successfully!")
print(f"🕒 Total runtime: {hours}h {minutes}m {seconds}s")
print("💾 Data saved to: trucker_data.csv\n")

How to run

Clone the repo git clone https://github.com/Akinfiresoye-Victor/German-Structured-Web-Scraper.git also make sure you have python intstalled
Set up virtual environment

 python -m venv env
env/Scripts/activate```
pip install -r requirements.txt
python <directory of folder>

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
structured_data_scrape.py		structured_data_scrape.py
trucker_data.csv		trucker_data.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🧠 README — Truckerbörse Web Scraper

📋 Project Overview

🚀 Features

⚙️ Tech Stack

🧩 How It Works

🛡️ Anti-Ban & Reliability Features

📊 Runtime Summary

📁 Example CSV Output

⚠️ Notes

🕒 Runtime Tracker Snippet

How to run

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🧠 README — Truckerbörse Web Scraper

📋 Project Overview

🚀 Features

⚙️ Tech Stack

🧩 How It Works

🛡️ Anti-Ban & Reliability Features

📊 Runtime Summary

📁 Example CSV Output

⚠️ Notes

🕒 Runtime Tracker Snippet

How to run

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages