This project designs data pipelines to extract data on famous musical albums, transform it using the MapReduce technique for meaningful partitioning, and load it into text-based databases. The final output includes a report on the pipeline design.
- Data Extraction: Connects to MongoDB using pymongo to extract data on famous musical albums.
- Data Transformation: Utilizes the mrjob framework to perform MapReduce, partitioning data into meaningful datasets, such as Annual top sales, Best sellers in history, etc.
- Data Storage: Loads the transformed data into text-based databases for further analysis.
- Pipeline Report: Generates a report detailing the pipeline designs, key features, and potential improvements.
- Python: Core language for data extraction, transformation, and processing.
- MongoDB: NoSQL database for storing and retrieving album data.
- pymongo: Python library for connecting to MongoDB and extracting data.
- mrjob: Framework for running MapReduce jobs in Python.
- json: Python library for json parsing and processing.
- Python 3.x installed
- Jupyter Notebook or a Python IDE (VS Code, PyCharm, etc.)
- Virtual environment (optional but recommended)
-
Clone the repository:
git clone https://github.com/TheVinh-Ha-1710/Big-Data-Pipeline-Design.git cd Big-Data-Pipeline-Design
-
Create and activate a virtual environment (optional but recommended):
python -m venv venv source venv/bin/activate # On Windows use `venv\Scripts\activate`
-
Install dependencies:
pip install -r requirements.txt
-
Run the data pipeline scripts:
chmode +x run_pipelines.sh
📂 Diabetes-Predictive-Model
├── 📂 databases # Output datasets
├── 📂 pipelines # Pipeline scripts
├── 📜 README.md # Project document
├── 📜 Report.pdf # PDF Report
├── 📜 Song.json # The original dataset
├── 📜 requirements.txt # Required frameworks
├── 📜 run_pipelines.sh # Shell script to run the pipeline