This project is a comprehensive analysis of song data spanning several decades. It aims to uncover insights and trends in the music industry over time. The data analyzed includes various parameters such as the artist’s name, the year of song release, the singer’s gender, and the country of origin.
- Infrastructure: Terraform & Docker
- Orchestration: MAGEai
- Database Storage:local
- Data Processing: Apache Spark
- ETL Scripts: Python
- Serving Layer: Google Sheets & Looker
The pipeline starts by ingesting raw data from CSV files. and collect another data using wikidata API Following the ETL (Extract, Transform, Load) process, The orchestration of the ETL workflow is MAGEai, then it save to googlesheet file in cloud. Finally, the insights derived from the processed data are visualized using lookerstudio.
This section will guide you through getting the project up and running on your local machine for development and testing purposes.
- Docker
- Docker Compose
- Terraform
- Download the Repository
- Open Google Cloud
- Create a Service Account in the project
- Generate a key and save it in the project path as Serviceaccounts.json
- Replace the existing file with this new key
- Copy this google sheet to your account with the same name
- copy the client_email from Serviceaccounts.json file and make it editor in google sheet by click
- Open Looker Studio and copy this report
- Define the data sources which is the Google Sheet file
- Run this command to build the infrastructure:
terraform apply
Select 'yes' when prompted Run this command to trigger the pipeline:curl -X POST http://localhost:6789/api/pipeline_schedules/1/pipeline_runs/5266e37a5e6545bb8d96531bf70471d5
If the pipeline doesn't start automatically, navigate to server: localhost:6789 and click on MillionSongsanalysis, then select 'run once'
in case you get a model not found error go to requirements.txt and install packages
After completing the above steps, the setup should be functional