Skip to content

Latest commit

 

History

History
113 lines (89 loc) · 6.5 KB

p8105_final_project.md

File metadata and controls

113 lines (89 loc) · 6.5 KB

p8105_final_project

2023-11-06

Group Members:

  • Fangyi Chen, fc2718
  • Jiying Wang, jw4489
  • Yumeng Qi, yq2378
  • Si Chen, sc5126
  • Yuxin Yin, yy3439

Tentative Project Title

Weather-Induced Flight Delays: An Analysis of Patterns and Predictive Modeling

Motivation for this project

The airline industry is highly dependent on timely flight operations, and flight delays are a common concern for both passengers and airlines. Weather conditions often play a big role in causing delays. In this project, we aim to investigate the relationship between weather and flight delays using historical flight data from 2013. Specifically, we would assess which pattern might cause the longest delay time. By understanding the data patterns, we would build a prediction model to predict flight delays based on variables including but not limited to carriers, airports, weather, etc. We would also assess if the insights gained from 2013 data can be applied to 2023, which is ten years later.

Intended final products

  • A website comprising the following pages:
  1. Homepage: Provides a brief introduction to the project.
  2. Exploratory Data Analysis Page: Offers in-depth exploration of the project’s data.
  3. Flight Delay Prediction Model Page: Presents details about the developed flight delay prediction model.
  4. Conclusions Page: Summarizes key findings and conclusions of the project.
  5. Interactive Dashboard Page: Utilizing R-Shiny, this page hosts an interactive dashboard for dynamic data exploration.
  • A comprehensive report: A well-organized document ensuring clear and concise presentation of our project, covering all aspects from introduction to conclusions.
  • A brief screencast: A concise video presentation showcasing the project’s key features and highlights.

Anticipated data sources

The study intends to use two data sets. The first one on-time data for all flights that departed NYC in 2013. The second one is hourly meteorological data for LGA, JFK and EWR.

Sources link

Planned analyses / visualizations / coding challenges

Planned analyses

We plan to perform preliminary descriptive analysis of the data, which entails the summarization of some basic characteristics (size, attributes), data distributions (mean, range, standard deviation) and number of missing values. The outcomes of the preliminary analysis will provide guidance for the future steps of data preprocessing and cleaning. Another analysis will be focused on the correlation between weather conditions and flight delays in minutes. We are also interested to investigate if there exists any disparity of flight delays among diverse airlines given the identical weather conditions. Additional analysis will be centered around the predictive model outcomes, where evaluation metrics include root mean square error (RMSE) and R squared. Performance variability across different temporal ranges (from earliest timepoint to the most recent data) will also be examined.

Planned visualizations

For data exploration, we plan to plot the distribution of important attributes such as dep_delay, arr_delay, temp, dewep, humid, wind_speed, pressure, visib using box plot and histogram. Then we will visualize the relationship between flight delays and weather-related variables using scatter plot and fitted lines. We may also visualize the relationship among weather variables as well as the delay disparity among airlines. Finally, we will visualize our model and predicted outcomes.

Planned coding challenges

Given the original dataset flights we will employ is a huge dataset with 336,776 entries, one of the biggest coding challenges we will encounter is how to, without bias, get the dataset down to a size that is suitable for use but still representative. In addition, since we would join two datasets that share similar variables, we might have to deal with the many-to-many relationship when joining the datasets.

Planned timeline

Milestone Description Expected Date
Project Logistics Setup 1. Establish the general objective and scope of the project. 2. Pinpoint the required data sources 11/10/2023
Project Review and Plan Outline 1. Meet with the teaching team to discuss the project proposal. 2. Anticipate difficulties and develop a contingency plan. 3. Identify major tasks and assign tasks to group members. 11/14/2023
Plan Execution The key components of the projects are anticipated to be completed: data cleaning and preprocessing, visualizations, model training and deployment, outcome evaluation, and any statistical or follow-up analysis 11/27/2023
Report Drafting and Compiling 1. Collect and finalize written parts of the project including but not limited to the following: objective, motivation, EDA, analysis, and discussion. 2. Revise and edit to ensure coherence and clarity 12/8/2023
Webpage and Screencast 1. Set up and publish the web page. 2. Video demo 12/9/2023
Peer Assessment 1. Brief assessment of team members. 2. Contribution summary 12/9/2023
In-class Discussion Learning about projects presented by other groups 12/14/2023