This project consists of collecting, processing and analyzing data of the Dutch real estate market, based on property listings available on Funda website. The ETL pipeline carried out within it made use of Apache Airflow for orchestrating the Python scripts that run its main tasks and the Power BI application for the data analysis. However, it's possible to run the scripts directly as well, if you don't want to use Apache Airflow. Just skip the 3rd, 4th, and 5th steps.
- Project Workflow
- Automated Extraction and Cleaning Pipeline using Apache Airflow (DAG) running 3 tasks:
- Extract: webscraping_task running the
webscraping_script.py
- that collect data via Web Scraping using the Funda Scraper module - Transform: processing_task running the
data_processing_script.py
- that does cleaning/pre-processing, executed in bite-size chunks, using- Numpy/Pandas for cleaning
- Beautiful Soup for dataset enrichment, scraping some additional data directly from Funda
- Geopy also for enrichment, obtaining the addresses' geolocation
- Concatenate: concat_task running the
chunks_concat.py
- that concatenate chunks and province files in a single file dataset Tasks Dependencies: webscraping_task >> processing_task >> concat_task
- Extract: webscraping_task running the
- PowerBI part:
- Load: (+ additional transformation): Made on Power BI for Exploratory Data Analysis and for generating visualizations
- Automated Extraction and Cleaning Pipeline using Apache Airflow (DAG) running 3 tasks:
In this repository, you find the Python scripts along with the Power BI project. For more details, see the full documentation.
data/
: Used to store raw, processed and external datasrc/
: Python scripts used for web scraping, data cleaning/pre-processing and Airflow settings.docs/
: Github page containing the complete project documentation, results, reports and figures.README.md
: This file.requirements.txt
: Project dependencies list.
To run the Python scripts, you'll need Python 3, the dependencies listed on the requirements.txt
, and Apache Airflow.
For loading the Power BI project (.pbix
file), you'll need the Power BI Desktop application. The process, from start to end can be executed as follows:
-
Clone this repository:
# Go to your project diretory cd project_local_path/projects # Clone this repository $ git clone https://github.com/nathmota/funda-analysis.git # Go into the repository $ cd funda-analysis
-
Install the dependencies using pip (directly or from within you preferred virtual environment):
pip install -r requirements.txt
-
Install Apache Airflow by running the instructions bellow via bash script or command line:
export AIRFLOW_HOME=~/airflow AIRFLOW_VERSION=2.9.2 PYTHON_VERSION="$(python --version | cut -d " " -f 2 | cut -d "." -f 1-2)" CONSTRAINT_URL="https://raw.githubusercontent.com/apache/airflow/constraints-${AIRFLOW_VERSION}/constraints-${PYTHON_VERSION}.txt" # For example this would install 2.9.2 with python 3.8: https://raw.githubusercontent.com/apache/airflow/constraints-2.9.2/constraints-3.8.txt pip install "apache-airflow==${AIRFLOW_VERSION}" --constraint "${CONSTRAINT_URL}" # Init the airflow DB airflow db init
You can also check the official page instructions.
In your project folder, there is a folder called dags. Inside it, the DAG file corresponds to the DAG workflow, where the tasks and the dependencies between them are defined.
-
Start the Airflow web server and scheduler services:
airflow webserver -D airflow scheduler -D
The application runs on port 8080. To access Airflow's interface, open your web browser and navigate to http://127.0.0.1:8080/dags.
-
In the Airflow DAG's page, click on the 'funda_tasks_dag'. At the upper right hand corner there is a play button, click on it and then, Trigger DAG.
Once the three tasks have a dark green circle, the tasks are successfully finished. Otherwise, red circles indicates failure, and if you click on the task and then the 'Logs' tab, you can check error messages.
-
(optional) Edit the parameters used for web scraping as you need (on the
src/webscraping_script.py
file). The default parameters used for this project were following:raw_folder_path = "project_path/funda-analysis/data/raw/" provincies = ["provincie-drenthe", "provincie-groningen", "provincie-flevoland", "provincie-friesland", "provincie-zeeland", "provincie-limburg", "provincie-utrecht", "provincie-overijssel","provincie-gelderland","provincie-noord-brabant","provincie-noord-holland","provincie-zuid-holland"] want_to = "buy" find_past = False n_pages = 50 # records every 50 pages (750 entries) to avoid loss in case of execution failure final_page = 2000 entries_per_page = 15 raw_data = True
As those parameters are directly infeed into the Funda Scraper module on its usage, you can consult its documentation and/or Funda website to better understand how to adapt it to your use case.
-
(optional) Edit the data processing script (
src/data_processing_script.py
) to set on it different parameters for the processing step, including file paths (from where files are read and written to); -
(optional) Edit the chunks concatenator script (
src/chunks_concat.py
), setting different parameters, including file paths (again to define from where files are read and written to).
-
Run the web scraping script to collect data:
python src/webscraping_script.py
-
Run the data processing script:
python src/data_processing_script.py
-
Run the chunks concatenator script:
python src/chunks_concat.py
- Open the Power BI project (
docs/reports/funda_report.pbix
) on the Power BI Desktop application, loading the processed data for analysis and visualization. If no personalization was made to change that, you'll find the processed data to be loaded on the Power BI Desktop application stored on its default location, atdata/preprocessed/processed_data.csv
.
For checking an offline snapshot of the visualizations and the results I've produced in my analyzis, see the full documentation.