GitHub - Alliekj/Home-Sales

Home Sales Analysis with PySpark

Key Features:

Data Ingestion:
The dataset is ingested from a CSV file hosted on an AWS S3 bucket. Data Transformation:
Data is transformed and queried using SQL within PySpark. Caching:

a. The home_sales temporary table is cached to improve query performance. Performance Comparison:

b. Runtime comparison between cached and uncached data queries. Data is partitioned by date_built and stored in Parquet format for optimized querying.

Queries:

Average price of homes based on different criteria such as: Number of bedrooms and bathrooms. Number of floors and square footage. View rating with a price threshold. Steps to Run Setup Spark Environment:

Ensure Spark and necessary dependencies are installed and configured. Run Analysis:

Execute the provided PySpark scripts or Jupyter notebook to perform the analysis.

Caching & Uncaching:

Observe the impact of caching on query performance. Uncache tables when done to free up resources. Parquet Operations:

Partition data by date_built and save it as Parquet files. Run queries on the Parquet data and compare performance.

Requirements

Python 3.x PySpark Java 11

How to Use

Clone the repository. Set up the Spark environment. Execute the scripts in a PySpark environment or Jupyter notebook. Modify queries as needed to explore different aspects of the dataset.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
Home_sales.ipynb		Home_sales.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Home Sales Analysis with PySpark

Requirements

How to Use

About

Releases

Packages

Languages

Alliekj/Home-Sales

Folders and files

Latest commit

History

Repository files navigation

Home Sales Analysis with PySpark

Requirements

How to Use

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages