Key Features:
-
Data Ingestion:
-
The dataset is ingested from a CSV file hosted on an AWS S3 bucket. Data Transformation:
-
Data is transformed and queried using SQL within PySpark. Caching:
a. The home_sales temporary table is cached to improve query performance. Performance Comparison:
b. Runtime comparison between cached and uncached data queries. Data is partitioned by date_built and stored in Parquet format for optimized querying.
Queries:
Average price of homes based on different criteria such as: Number of bedrooms and bathrooms. Number of floors and square footage. View rating with a price threshold. Steps to Run Setup Spark Environment:
Ensure Spark and necessary dependencies are installed and configured. Run Analysis:
Execute the provided PySpark scripts or Jupyter notebook to perform the analysis.
Caching & Uncaching:
Observe the impact of caching on query performance. Uncache tables when done to free up resources. Parquet Operations:
Partition data by date_built and save it as Parquet files. Run queries on the Parquet data and compare performance.
Python 3.x PySpark Java 11
Clone the repository. Set up the Spark environment. Execute the scripts in a PySpark environment or Jupyter notebook. Modify queries as needed to explore different aspects of the dataset.