Home Sales Analysis with PySpark

This project analyzes a dataset of home sales using PySpark and SparkSQL. The analysis explores key metrics such as average home prices based on various criteria, caching performance, and data partitioning.

Project Objective

The objective is to analyze the home sales data to extract insights using PySpark, perform efficient queries using caching, and optimize storage and querying by partitioning the data.

Dataset

The dataset used in this project is home_sales_revised.csv, which contains information about home sales, including the following fields:

id: Unique identifier for the home sale
date: Sale date
price: Sale price of the home
bedrooms: Number of bedrooms
bathrooms: Number of bathrooms
sqft_living: Square footage of the living space
sqft_lot: Lot size in square feet
floors: Number of floors
waterfront: Waterfront property indicator
view: View rating of the home
date_built: Year the home was built

Steps and Implementation

1. Import PySpark and Read the Dataset

PySpark is used to read the CSV dataset into a Spark DataFrame.
A temporary table home_sales is created for SQL-based querying.

2. Query: Average Price for Four-Bedroom Houses Sold Per Year

A SparkSQL query calculates the average price of four-bedroom houses sold for each year, rounded to two decimal places.

3. Query: Average Price for Homes with Three Bedrooms and Bathrooms Per Year Built

Homes with exactly three bedrooms and three bathrooms are filtered, and their average price is calculated for each year built.

4. Query: Average Price for Homes with Specific Features Per Year Built

Filters are applied to homes with:
- 3 bedrooms
- 3 bathrooms
- 2 floors
- Square footage ≥ 2000
The average price is calculated for each year built.

5. Query: Average Price by View Rating for Homes Priced Over $350,000

The query calculates the average price of homes per "view" rating where the average price is ≥ $350,000.
Runtime for the query is recorded.

6. Caching and Performance Comparison

The home_sales table is cached using PySpark.
The query for average price by view rating is rerun on the cached table.
Performance is compared to the uncached query runtime.

7. Data Partitioning by `date_built`

The dataset is partitioned by date_built and stored in Parquet format.
The partitioned data is read back into a DataFrame.

8. Temporary Table for Parquet Data

A temporary table is created from the partitioned Parquet data.
Queries are executed on the Parquet table, and runtime is compared to previous executions.

9. Uncaching Temporary Table

The cached table home_sales is uncached, and its status is verified.

Requirements

The project meets the following criteria:

A Spark DataFrame is created from the dataset.
Temporary tables are created for both CSV and Parquet data.
Queries for specific metrics are implemented and optimized.
Caching and uncaching functionality is verified.
The dataset is partitioned and stored in Parquet format.

Runtime Analysis

The runtime for queries using uncached data, cached data, and partitioned data is recorded for performance comparison.

Setup Instructions

Install the required libraries and dependencies:
- PySpark
- findspark
Download the dataset home_sales_revised.csv from the provided link.
Execute the notebook to replicate the analysis.

Conclusion

The analysis demonstrates how PySpark can be used to handle large datasets efficiently, optimize queries with caching, and organize data with partitioning.

Repository Structure

Home_Sales.ipynb: Jupyter Notebook containing the full analysis and implementation.
home_sales_revised.csv: Dataset used for the analysis.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
Images		Images
Home_Sales.ipynb		Home_Sales.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Home Sales Analysis with PySpark

Project Objective

Dataset

Steps and Implementation

1. Import PySpark and Read the Dataset

2. Query: Average Price for Four-Bedroom Houses Sold Per Year

3. Query: Average Price for Homes with Three Bedrooms and Bathrooms Per Year Built

4. Query: Average Price for Homes with Specific Features Per Year Built

5. Query: Average Price by View Rating for Homes Priced Over $350,000

6. Caching and Performance Comparison

7. Data Partitioning by `date_built`

8. Temporary Table for Parquet Data

9. Uncaching Temporary Table

Requirements

Runtime Analysis

Setup Instructions

Conclusion

Repository Structure

About

Releases

Packages

Languages

zara5961/Home_Sales

Folders and files

Latest commit

History

Repository files navigation

Home Sales Analysis with PySpark

Project Objective

Dataset

Steps and Implementation

1. Import PySpark and Read the Dataset

2. Query: Average Price for Four-Bedroom Houses Sold Per Year

3. Query: Average Price for Homes with Three Bedrooms and Bathrooms Per Year Built

4. Query: Average Price for Homes with Specific Features Per Year Built

5. Query: Average Price by View Rating for Homes Priced Over $350,000

6. Caching and Performance Comparison

7. Data Partitioning by date_built

8. Temporary Table for Parquet Data

9. Uncaching Temporary Table

Requirements

Runtime Analysis

Setup Instructions

Conclusion

Repository Structure

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

7. Data Partitioning by `date_built`

Packages