Big Data ETL Pipelines Design

Description

This project designs data pipelines to extract data on famous musical albums, transform it using the MapReduce technique for meaningful partitioning, and load it into text-based databases. The final output includes a report on the pipeline design.

Features

Data Extraction: Connects to MongoDB using pymongo to extract data on famous musical albums.
Data Transformation: Utilizes the mrjob framework to perform MapReduce, partitioning data into meaningful datasets, such as Annual top sales, Best sellers in history, etc.
Data Storage: Loads the transformed data into text-based databases for further analysis.
Pipeline Report: Generates a report detailing the pipeline designs, key features, and potential improvements.

Technologies Used

Python: Core language for data extraction, transformation, and processing.
MongoDB: NoSQL database for storing and retrieving album data.
pymongo: Python library for connecting to MongoDB and extracting data.
mrjob: Framework for running MapReduce jobs in Python.
json: Python library for json parsing and processing.

Installation & Setup

Prerequisites

Python 3.x installed
Jupyter Notebook or a Python IDE (VS Code, PyCharm, etc.)
Virtual environment (optional but recommended)

Setup

Clone the repository:

git clone https://github.com/TheVinh-Ha-1710/Big-Data-Pipeline-Design.git
cd Big-Data-Pipeline-Design

Create and activate a virtual environment (optional but recommended):

python -m venv venv
source venv/bin/activate  # On Windows use `venv\Scripts\activate`

Install dependencies:
```
pip install -r requirements.txt
```
Run the data pipeline scripts:
```
chmode +x run_pipelines.sh
```

Folder Structure

📂 Diabetes-Predictive-Model
 ├── 📂 databases               # Output datasets
 ├── 📂 pipelines               # Pipeline scripts
 ├── 📜 README.md               # Project document
 ├── 📜 Report.pdf              # PDF Report
 ├── 📜 Song.json               # The original dataset
 ├── 📜 requirements.txt        # Required frameworks
 ├── 📜 run_pipelines.sh        # Shell script to run the pipeline

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Big Data ETL Pipelines Design

Description

Features

Technologies Used

Installation & Setup

Prerequisites

Setup

Folder Structure

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
databases		databases
pipelines		pipelines
README.md		README.md
Report.pdf		Report.pdf
Song.json		Song.json
requirements.txt		requirements.txt
run_pipelines.sh		run_pipelines.sh

TheVinh-Ha-1710/Big-Data-Pipeline-Design

Folders and files

Latest commit

History

Repository files navigation

Big Data ETL Pipelines Design

Description

Features

Technologies Used

Installation & Setup

Prerequisites

Setup

Folder Structure

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages