This project involves analyzing and extracting insights from the NCEI Global Surface Summary of Day dataset. The data will be processed using PySpark, Jupyter Notebook, and various data engineering and analytics techniques, providing hands-on experience with big data. Key tasks include handling and analyzing weather data from Cincinnati and Florida for years 2015-2024.
-
Anaconda and PySpark Installation
- Download and install Anaconda and Apache Spark 3.5.3.
- Set up environment variables and install the PySpark library:
pip install pyspark
- Launch Jupyter Notebook from Anaconda.
-
Dataset Download
- Download weather data for Cincinnati (72429793812) and Florida (99495199999) from the NCEI database for 2015-2024.
- Organize datasets by year and location (e.g.,
2015/72429793812.csv
for Cincinnati 2015).
-
Setup Verification
- Provide screenshots showing Anaconda and PySpark installation and Jupyter Notebook launch.
-
Data Loading
- Load each CSV file and display the record count for verification (20 datasets).
-
Hottest Day Analysis
- Find the hottest day (MAX temperature) for each year, providing station code, name, date, and temperature.
-
Coldest Day in March
- Identify the coldest day in March across all years with station code, name, date, and temperature.
-
Annual Precipitation Comparison
- Determine the year with the most precipitation for Cincinnati and Florida, including station code, name, year, and mean precipitation.
-
Missing Value Percentage
- Calculate the percentage of missing values in the wind gust (GUST) column for Cincinnati and Florida in 2024.
-
Temperature Statistics (2020)
- Compute mean, median, mode, and standard deviation of the temperature (TEMP) for each month in Cincinnati for 2020.
-
Lowest Wind Chill Days (2017)
- Calculate Wind Chill for days in 2017 where temperature (TEMP) was below 50°F and wind speed (WDSP) above 3 mph. Display the top 10 lowest Wind Chill days.
-
Extreme Weather Analysis for Florida
- Analyze the FRSHTT column to count days with extreme weather events (fog, rain, snow, etc.) in Florida.
-
Temperature Prediction for Cincinnati (Nov-Dec 2024)
- Predict maximum temperatures for November and December 2024 based on previous two years' data, including a brief discussion of the model and possible improvements.
Submit two files:
- Results File:
My_Results.txt
(e.g.,My_Results.txt
) - Spark Code:
My_Spark.ipynb
(e.g.,My_Spark.ipynb
)
- Remove missing or invalid readings from the dataset as per the README.
- Organize and analyze data by leveraging Spark's capabilities for handling large datasets efficiently.
- Ensure results for all tasks are clear and well-documented in the code and output.
The following files are submitted for the completion of Project 4:
-
Results File: A text file containing the results of the data analysis.
Go to My_Results.txt -
PySpark Code: The Jupyter notebook containing the Spark code used for data analysis.
Go to My_Spark.ipynb