Skip to content

Latest commit

Β 

History

History
55 lines (39 loc) Β· 1.84 KB

File metadata and controls

55 lines (39 loc) Β· 1.84 KB

πŸ•΅οΈβ€β™‚οΈ High School Teacher Web Scraper

This Python-based web scraper collects publicly available teacher data from high school websites near Drexel University.
It searches using Google (via SerpAPI), scans potential staff directories, and extracts:

  • πŸ‘€ Teacher names
  • πŸ“§ Emails
  • πŸ“š Subjects (if listed)
  • 🌐 School source URLs

Everything is exported into a clean Excel file β€” perfect for analysis, outreach, or research.


πŸ“¦ Features

  • πŸ” Searches Google for high schools near a target location (via SerpAPI)
  • 🌐 Follows multiple link types (staff, about, directory, contact, etc.)
  • 🧠 Smart content detection:
    • Recognizes pages with actual teacher info
    • Handles table-based layouts & plain text
  • πŸ“ˆ Scalable architecture:
    • External keywords.txt for custom logic
    • Modular functions
    • Retry-safe requests
  • πŸ“Š Export to Excel with timestamped filenames

πŸš€ How It Works

  1. Loads keywords.txt to identify potential staff/directory/contact links.
  2. Uses SerpAPI to find nearby school websites.
  3. Follows each link and scans for useful teacher data (tables, emails, titles).
  4. Stops when valid info is found or exhausts all options.
  5. Exports final results to an Excel spreadsheet.

πŸ§ͺ Example Output

Name Email Position School Website
John Smith jsmith@school.org Math Teacher https://examplehigh.org
Amanda Lee alee@school.org Principal https://anotherhigh.org

βš™οΈ Setup Instructions

1. Clone the Repo

git clone https://github.com/yourusername/highschool-teacher-scraper.git
cd highschool-teacher-scraper