Skip to content

SamiiShabuse/WebScraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🕵️‍♂️ High School Teacher Web Scraper

This Python-based web scraper collects publicly available teacher data from high school websites near Drexel University.
It searches using Google (via SerpAPI), scans potential staff directories, and extracts:

  • 👤 Teacher names
  • 📧 Emails
  • 📚 Subjects (if listed)
  • 🌐 School source URLs

Everything is exported into a clean Excel file — perfect for analysis, outreach, or research.


📦 Features

  • 🔍 Searches Google for high schools near a target location (via SerpAPI)
  • 🌐 Follows multiple link types (staff, about, directory, contact, etc.)
  • 🧠 Smart content detection:
    • Recognizes pages with actual teacher info
    • Handles table-based layouts & plain text
  • 📈 Scalable architecture:
    • External keywords.txt for custom logic
    • Modular functions
    • Retry-safe requests
  • 📊 Export to Excel with timestamped filenames

🚀 How It Works

  1. Loads keywords.txt to identify potential staff/directory/contact links.
  2. Uses SerpAPI to find nearby school websites.
  3. Follows each link and scans for useful teacher data (tables, emails, titles).
  4. Stops when valid info is found or exhausts all options.
  5. Exports final results to an Excel spreadsheet.

🧪 Example Output

Name Email Position School Website
John Smith jsmith@school.org Math Teacher https://examplehigh.org
Amanda Lee alee@school.org Principal https://anotherhigh.org

⚙️ Setup Instructions

1. Clone the Repo

git clone https://github.com/yourusername/highschool-teacher-scraper.git
cd highschool-teacher-scraper

About

Web Scraper Building for Work Study

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages