This project involves the analysis of Indian Premier League (IPL) cricket data using Apache Spark, a powerful open-source unified analytics engine. The primary objective is to uncover valuable insights and trends within the IPL datasets, utilizing Spark's capabilities for large-scale data processing.
- Data Ingestion and Cleaning: Efficiently load and preprocess raw IPL data.
- Exploratory Data Analysis (EDA): Generate descriptive statistics and visualizations to understand the underlying patterns in the data.
- Advanced Analytics: Implement advanced analytical techniques to derive meaningful insights from the data.
- Visualization: Create interactive and static visualizations to present the findings effectively.
The datasets used in this project can be found at the following link:
- IPL Data Till 2017: Includes match and ball-by-ball data up to the year 2017.
- Apache Spark (PySpark)
- Databricks
- SparkSQL
- Pandas
- Matplotlib
Below is the architecture diagram that illustrates the data flow and components used in this IPL data analysis project: