This project analyzes a dataset of home sales using PySpark and SparkSQL. The analysis explores key metrics such as average home prices based on various criteria, caching performance, and data partitioning.
The objective is to analyze the home sales data to extract insights using PySpark, perform efficient queries using caching, and optimize storage and querying by partitioning the data.
The dataset used in this project is home_sales_revised.csv
, which contains information about home sales, including the following fields:
id
: Unique identifier for the home saledate
: Sale dateprice
: Sale price of the homebedrooms
: Number of bedroomsbathrooms
: Number of bathroomssqft_living
: Square footage of the living spacesqft_lot
: Lot size in square feetfloors
: Number of floorswaterfront
: Waterfront property indicatorview
: View rating of the homedate_built
: Year the home was built
- PySpark is used to read the CSV dataset into a Spark DataFrame.
- A temporary table
home_sales
is created for SQL-based querying.
- A SparkSQL query calculates the average price of four-bedroom houses sold for each year, rounded to two decimal places.
- Homes with exactly three bedrooms and three bathrooms are filtered, and their average price is calculated for each year built.
- Filters are applied to homes with:
- 3 bedrooms
- 3 bathrooms
- 2 floors
- Square footage ≥ 2000
- The average price is calculated for each year built.
- The query calculates the average price of homes per "view" rating where the average price is ≥ $350,000.
- Runtime for the query is recorded.
- The
home_sales
table is cached using PySpark. - The query for average price by view rating is rerun on the cached table.
- Performance is compared to the uncached query runtime.
- The dataset is partitioned by
date_built
and stored in Parquet format. - The partitioned data is read back into a DataFrame.
- A temporary table is created from the partitioned Parquet data.
- Queries are executed on the Parquet table, and runtime is compared to previous executions.
- The cached table
home_sales
is uncached, and its status is verified.
The project meets the following criteria:
- A Spark DataFrame is created from the dataset.
- Temporary tables are created for both CSV and Parquet data.
- Queries for specific metrics are implemented and optimized.
- Caching and uncaching functionality is verified.
- The dataset is partitioned and stored in Parquet format.
- The runtime for queries using uncached data, cached data, and partitioned data is recorded for performance comparison.
- Install the required libraries and dependencies:
- PySpark
- findspark
- Download the dataset
home_sales_revised.csv
from the provided link. - Execute the notebook to replicate the analysis.
The analysis demonstrates how PySpark can be used to handle large datasets efficiently, optimize queries with caching, and organize data with partitioning.
Home_Sales.ipynb
: Jupyter Notebook containing the full analysis and implementation.home_sales_revised.csv
: Dataset used for the analysis.