Skip to content

Akinfiresoye-Victor/German-Structured-Web-Scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🧠 README — Truckerbörse Web Scraper

📋 Project Overview

This project is a Python-based web scraper built to extract structured data from the German logistics job platform Truckerbörse.
It automatically navigates through multiple job listing pages, extracts detailed company and job information, and stores the results into a well-structured CSV file.

The scraper handles pagination, dynamic user-agent rotation, random delays, and error handling to simulate natural browsing behavior and reduce the risk of being blocked.


🚀 Features

  • ✅ Scrapes job listings and associated company details.
  • ✅ Extracts:
    • Job Title
    • Company Name
    • Address
    • Website
    • Person of Contact
    • Email
    • Telephone
    • Fax
    • Job Link
  • ✅ Handles inconsistent data fields and HTML layouts.
  • ✅ Avoids 508 Loop Detected and resource limit errors via retry logic and random delays.
  • ✅ Logs runtime duration automatically.
  • ✅ Uses randomized headers and fake_useragent for stealth scraping.

⚙️ Tech Stack

  • Language: Python 3.13
  • Libraries Used:
    • BeautifulSoup4 – for HTML parsing
    • urllib.request – for HTTP requests
    • fake_useragent – for random user-agent rotation
    • csv – for structured data output
    • re – for regex-based data extraction
    • time & random – for dynamic delay and runtime tracking

🧩 How It Works

  1. The scraper starts from the main URL: https://www.truckerboerse.net/
  2. It extracts all job listings from the current page.
  3. For each listing, it follows the detail link to extract additional company information.
  4. It writes the scraped data into a CSV file named: trucker_data.csv
  5. It automatically finds and follows the “Next Page” link to continue scraping until no more pages are available.

🛡️ Anti-Ban & Reliability Features

  • Randomized User-Agent strings per request.
  • Random delays between 2–5 seconds between detail page requests.
  • “Coffee breaks” (30-second pauses) after every 10 pages.
  • Tracking of visited URLs to prevent infinite pagination loops.
  • Graceful handling of HTTP 508 (loop detected) and resource limit errors.
  • Retry logic with exponential backoff for unstable connections.

📊 Runtime Summary

  • Total Runtime: 275 minutes and 56 seconds
  • Equivalent: ~4 hours, 36 minutes
  • Output File: trucker_data.csv
  • Format: UTF-8 CSV with 9 columns

📁 Example CSV Output

Job Title Company Name Address Website Person of Contact Email Telephone Fax Job Link
Kraftfahrer CE Logistik GmbH Musterstraße 45, 12345 Berlin, Germany www.logistik.de Max Mustermann info@logistik.de +49 30 123456 +49 30 654321 https://www.truckerboerse.net/job123

⚠️ Notes

  • The site occasionally throttles or returns HTTP 508 errors due to resource limits — handled automatically by the scraper.
  • Some entries may lack contact information due to inconsistent HTML structures on individual job pages.
  • Always respect the site’s robots.txt and use the data responsibly.

🕒 Runtime Tracker Snippet

Add this at the top of your script (before the scraping starts):

import time
start_time = time.time()
And add this at the very end, right after csv_file.close():
end_time = time.time()
elapsed_time = end_time - start_time

hours = int(elapsed_time // 3600)
minutes = int((elapsed_time % 3600) // 60)
seconds = int(elapsed_time % 60)

print("\n✅ Scraping completed successfully!")
print(f"🕒 Total runtime: {hours}h {minutes}m {seconds}s")
print("💾 Data saved to: trucker_data.csv\n")

How to run

  1. Clone the repo git clone https://github.com/Akinfiresoye-Victor/German-Structured-Web-Scraper.git also make sure you have python intstalled
  2. Set up virtual environment
 python -m venv env
env/Scripts/activate```
pip install -r requirements.txt
python <directory of folder>

About

A web Scraping Project built to collect job listing data from a complex german website across 258 pages with over 5000 data to be scraped

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages