This project explores the patterns behind these disruptions using the Airline On-Time Performance dataset from Kaggle to uncover the when, where, and why of flight delays and cancellations.
The analysis was conducted in a virtual environment using Oracle VirtualBox. The workflow involves:
- ๐ HDFS: To upload and manage large datasets across the cluster
- ๐ Jupyter Notebook: Used as the main Python IDE for scripting, analysis, and visualization. The .ipynb file include on how to connect with Hive.
The analysis was conducted to answer questions:
- What times of day (morning/afternoon/evening) have the lowest average delays?
- Which days of the week show better on-time performance?
- During which months or seasons are flights most likely to be on time?
- Identify and rank the top 3โ5 causes of flight delays based on dataset categories.
- Quantify the impact of each factor in minutes and as a percentage of total delays.
- Identify the main reasons for flight cancellations (Carrier, Weather, NAS, Security).
- Analyze whether cancellations correlate with specific airlines, airports, or time periods.
- Identify routes (origin-destination pairs), carriers, or flight numbers that consistently underperform.
- Analyze the reasons these flights are frequently delayed or cancelled.
To efficiently handle and analyze this large dataset:
- Data was loaded into Hive tables via HDFS
- SQL queries were used to extract relevant subsets
- Python (via Jupyter Notebook) was used for deeper analysis, data wrangling, and visualizations