Skip to content

An Apache Airflow pipeline that scrapes football stadium data from Wikipedia, processes it with pandas, stores it in PostgreSQL, and saves query results to CSV.

Notifications You must be signed in to change notification settings

Undisputed-jay/wikipedia_stadium_data_pipeline_with_apache_airflow

Repository files navigation

wikipedia_stadium_data_pipeline_with_apache_airflow

This project implements an Apache Airflow DAG to scrape and process data on the largest football stadiums worldwide from Wikipedia. The pipeline extracts data from a Wikipedia page, cleans and stores it in a Postgres database, and performs SQL queries for further analysis. The key features include:

  • Scrapes football stadium data using BeautifulSoup and requests.
  • Cleans and transforms the data using pandas.
  • Stores the data in a Postgres database with automatic table creation.
  • Executes SQL queries for advanced analysis and saves the results to CSV.

About

An Apache Airflow pipeline that scrapes football stadium data from Wikipedia, processes it with pandas, stores it in PostgreSQL, and saves query results to CSV.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published