This project utilized SparkSQL and PySpark to analyze home sales data. After setting up the environment and importing necessary libraries, the home sales data was loaded into a Spark DataFrame, and a temporary table called HomeSales was created for SQL queries. Queries calculated metrics such as the average price of four-bedroom homes sold per year and the average price of homes with three bedrooms, three bathrooms, and additional features like two floors and over 2,000 square feet. Another query computed the average price of homes by "view" rating, filtering for prices above $350,000, with the runtime for this query recorded.
To improve query performance, the HomeSales table was cached, and its status was verified. The same query was run on the cached table, and its runtime was compared to the uncached version. Data was also partitioned by the date_built field, saved as Parquet files, and a new temporary table was created for further analysis. Queries were then run on the partitioned data to compare performance. Finally, the HomeSales table was uncached and its removal was confirmed.
Data for this dataset was generated by edX Boot Camps LLC, and is intended for training purposes.