A structured 18-day PySpark learning roadmap covering DataFrames, SQL, Joins, Window Functions, Performance Tuning, and MLlib with daily datasets and coding tasks.
This repository contains my complete journey of learning PySpark through a daily coding roadmap.
Each day focuses on a new PySpark concept with a dataset and implementation example. By the end, the roadmap covers an end-to-end mini project and MLlib model training.
- Beginner to advanced PySpark concepts in 18 days.
- Hands-on coding tasks with datasets for each topic.
- Covers DataFrames, SQL, Aggregations, Joins, Window Functions, Performance tuning.
- Mini Project (Day 15): Data cleaning, joining, aggregations, and saving results.
- MLlib tasks (Day 16–18): Feature engineering, decision tree, and ML pipeline.
- Code is written for Google Colab / Jupyter Notebook for easy execution.