This project is a Python-based web scraper built to extract structured data from the German logistics job platform Truckerbörse.
It automatically navigates through multiple job listing pages, extracts detailed company and job information, and stores the results into a well-structured CSV file.
The scraper handles pagination, dynamic user-agent rotation, random delays, and error handling to simulate natural browsing behavior and reduce the risk of being blocked.
- ✅ Scrapes job listings and associated company details.
- ✅ Extracts:
- Job Title
- Company Name
- Address
- Website
- Person of Contact
- Telephone
- Fax
- Job Link
- ✅ Handles inconsistent data fields and HTML layouts.
- ✅ Avoids 508 Loop Detected and resource limit errors via retry logic and random delays.
- ✅ Logs runtime duration automatically.
- ✅ Uses randomized headers and
fake_useragentfor stealth scraping.
- Language: Python 3.13
- Libraries Used:
BeautifulSoup4– for HTML parsingurllib.request– for HTTP requestsfake_useragent– for random user-agent rotationcsv– for structured data outputre– for regex-based data extractiontime&random– for dynamic delay and runtime tracking
- The scraper starts from the main URL: https://www.truckerboerse.net/
- It extracts all job listings from the current page.
- For each listing, it follows the detail link to extract additional company information.
- It writes the scraped data into a CSV file named: trucker_data.csv
- It automatically finds and follows the “Next Page” link to continue scraping until no more pages are available.
- Randomized User-Agent strings per request.
- Random delays between 2–5 seconds between detail page requests.
- “Coffee breaks” (30-second pauses) after every 10 pages.
- Tracking of visited URLs to prevent infinite pagination loops.
- Graceful handling of HTTP 508 (loop detected) and resource limit errors.
- Retry logic with exponential backoff for unstable connections.
- Total Runtime: 275 minutes and 56 seconds
- Equivalent: ~4 hours, 36 minutes
- Output File:
trucker_data.csv - Format: UTF-8 CSV with 9 columns
| Job Title | Company Name | Address | Website | Person of Contact | Telephone | Fax | Job Link | |
|---|---|---|---|---|---|---|---|---|
| Kraftfahrer CE | Logistik GmbH | Musterstraße 45, 12345 Berlin, Germany | www.logistik.de | Max Mustermann | info@logistik.de | +49 30 123456 | +49 30 654321 | https://www.truckerboerse.net/job123 |
- The site occasionally throttles or returns
HTTP 508errors due to resource limits — handled automatically by the scraper. - Some entries may lack contact information due to inconsistent HTML structures on individual job pages.
- Always respect the site’s
robots.txtand use the data responsibly.
Add this at the top of your script (before the scraping starts):
import time
start_time = time.time()
And add this at the very end, right after csv_file.close():
end_time = time.time()
elapsed_time = end_time - start_time
hours = int(elapsed_time // 3600)
minutes = int((elapsed_time % 3600) // 60)
seconds = int(elapsed_time % 60)
print("\n✅ Scraping completed successfully!")
print(f"🕒 Total runtime: {hours}h {minutes}m {seconds}s")
print("💾 Data saved to: trucker_data.csv\n")- Clone the repo
git clone https://github.com/Akinfiresoye-Victor/German-Structured-Web-Scraper.gitalso make sure you have python intstalled - Set up virtual environment
python -m venv env
env/Scripts/activate```
pip install -r requirements.txt
python <directory of folder>