ETL Movies Data Warehouse & Analysis Project 🎬📊

Project Overview

Welcome to the ETL Movies Data Warehouse and Analysis project, where raw data from the cinematic world is transformed into meaningful insights! This project demonstrates the full ETL (Extract, Transform, Load) workflow using IMDb datasets and Talend Open Studio. The processed data is stored in a MySQL database, enabling interactive analysis through Power BI dashboards.

Key Features

✨ ETL Pipeline:

Extract: Data is sourced from Kaggle and IMDb datasets.
Transform: Data cleaning, deduplication, sorting, and aggregation using Talend Open Studio components.
Load: Processed data is loaded into a MySQL database for efficient querying.

📦 Data Warehouse:

A structured data warehouse built on a star schema with optimized fact and dimension tables for analytics.

📊 Interactive Dashboards:

Power BI visualizations provide insights into movies, actors, and trends.
KPIs include top movies, actor contributions, and trends over time.

Technologies Used 🚀

This project integrates the following tools and technologies to implement the ETL pipeline and visualization:

ETL Tool: Talend Open Studio
Database: MySQL
Visualization Tool: Power BI
Data Source:
- IMDb datasets:
  - name.basics.tsv.gz
  - title.principals.tsv.gz
  - title.basics.tsv.gz
- Additional movie datasets from Kaggle

Repository Structure 📁

Here is the structure of the project repository:

ETL_Movies_Project/
├── .git/                     # Git repository files  
├── DataSets/                 # Links of the IMDb and Kaggle datasets  
├── IMDBDWH/                  # Talend jobs and workflows  
├── MySQL_Database/           # SQL scripts for creating tables and loading data  
├── PowerBi_Report/           # Power BI report files  
├── screenshots/              # Images of dashboards and Talend workflows  
├── .gitattributes            # Git attributes file  
├── LICENSE                   # License file  
└── README.md                 # Project documentation (this file)

Talend Components Used 🛠️

During the ETL process, the following Talend Open Studio components were used to clean, filter, and aggregate data:

tFileInputDelimited: To read raw IMDb TSV files (e.g., name.basics.tsv, title.basics.tsv).
tUniqRow: To remove duplicate rows and ensure data integrity.
tJoin: To join movie and actor datasets on common keys like tconst and nconst.
tAggregate: To calculate metrics such as movie counts or aggregated KPIs.
tSortRow: To sort rows based on specific fields, ensuring organized data.

Getting Started ⚙️

Prerequisites 🔧

Ensure you have the following installed to set up the project locally:

Talend Open Studio: Download and install from Talend's website.
MySQL: Install and set up a MySQL server to host the data warehouse.
Power BI Desktop: Install from Power BI's website.

Steps to Run the Project 📋

ETL Process (Talend):
- Open Talend Open Studio and import jobs from the IMDBDWH folder.
- Run the jobs to extract, clean, and load the datasets into your MySQL database.
Database Setup:
- Execute the SQL scripts in the MySQL_Database folder to create the required tables.
Power BI Dashboard:
- Open the Power BI file from PowerBi_Report and connect it to your MySQL database.
- Refresh the data to generate insights and visualizations.

Data Sources 📝

This project uses publicly available movie datasets:

IMDb Datasets (Source):
- name.basics.tsv.gz – Contains actor and crew information.
- title.principals.tsv.gz – Links actors/crew to movies.
- title.basics.tsv.gz – Contains movie titles, years, and genres.
Kaggle Dataset (Source): – Provides additional raw data (400k movies).

Power BI Visualizations 🎨

Explore insightful visualizations that provide:

Top Movies and Actors: Analysis of top-performing actors and movies.
Genre Analysis: Trends and popularity of genres over time.
Actor Contributions: Movie counts and associations per actor.

Screenshots 📸

Examples of the dashboards and workflows:

Power BI Dashboard
Talend Workflow

Contributing 🤝

Contributions are welcome! If you'd like to improve this project, feel free to:

Fork the repository.
Create a pull request with your changes.
Open issues for bugs or suggestions.

License 📝

This project is licensed under the MIT License. See the LICENSE file for more details.

Acknowledgments 🙌

IMDb for providing comprehensive movie datasets.
Talend Open Studio for ETL processing.
Power BI for enabling powerful visual analytics.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ETL Movies Data Warehouse & Analysis Project 🎬📊

Project Overview

Key Features

Technologies Used 🚀

Repository Structure 📁

Talend Components Used 🛠️

Getting Started ⚙️

Prerequisites 🔧

Steps to Run the Project 📋

Data Sources 📝

Power BI Visualizations 🎨

Screenshots 📸

Contributing 🤝

License 📝

Acknowledgments 🙌

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
DataSets		DataSets
IMDBDWH		IMDBDWH
MySQL_Database		MySQL_Database
screenshots		screenshots
.gitattributes		.gitattributes
LICENSE		LICENSE
README.md		README.md

License

MeDjb10/ETL-Movies-Data-Warehouse--Analysis-Project

Folders and files

Latest commit

History

Repository files navigation

ETL Movies Data Warehouse & Analysis Project 🎬📊

Project Overview

Key Features

Technologies Used 🚀

Repository Structure 📁

Talend Components Used 🛠️

Getting Started ⚙️

Prerequisites 🔧

Steps to Run the Project 📋

Data Sources 📝

Power BI Visualizations 🎨

Screenshots 📸

Contributing 🤝

License 📝

Acknowledgments 🙌

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages