This tool filters and optimizes CSV files (artists and tracks) to generate claims more easily and saves records after filtering to a SQL database.
Important: Ensure you have the ~/.kaggle/kaggle.json configuration file set up before running the tool. You can obtain this configuration from Kaggle's Public API. Additionally, Python must be installed on your system to run the CLI and successfully extract CSVs.
- Python
- Node.js
- PostgreSQL
Ensure Python is installed on your system. You can download it from the official Python website.
Ensure Node.js is installed on your system. You can download it from the official Node.js website.
Ensure PostgreSQL is installed and running on your system. You can download it from the official PostgreSQL website.
Install the necessary dependencies using your preferred package manager:
yarn install
# or
npm install
There are also alternatives like pnpm.
Create a .env file in the root directory of your project and insert the required environment variables as specified in the .env.example file. These variables include database connection details, AWS S3 bucket information, and any other necessary configurations.
First, build the project:
yarn build
This command compiles the TypeScript source files to JavaScript. After running this command, there should be a dist folder containing the distribution JavaScript files.
To download CSV files associated with the artist and track, run:
yarn bake
This command uses Python to download the required CSV files from Kaggle. It utilizes the kaggle CLI to fetch the datasets. Refer to the installKaggle.mjs script for details on how it works.
Transform the data and upload it to S3:
yarn start -t
This command reads the downloaded CSV files, filters and transforms the data according to specified criteria, and then uploads the processed data to an AWS S3 bucket. The filtering criteria include:
- Ignoring tracks with no name.
- Ignoring tracks shorter than 1 minute.
- Loading only artists that have tracks after filtering. The data transformation involves exploding the track release date into separate columns (year, month, day) and transforming the track danceability into string values (Low, Medium, High) based on float.
If you want to download the transformed data files from S3 bucket which you can configure inside source entry file, remember to rebuild afterwards, run:
yarn start -f
This command downloads the processed data files from your specified S3 bucket. It requires a network connection.
To create new records in PostgreSQL from the local CSV files (after transformation), run:
yarn start -c
This command will:
- Create the artists and tracks tables in your PostgreSQL database.
- Insert the data into the respective tables.
To create the SQL views (track_info, most_energizing_tracks, tracks_with_artist_followers), run:
yarn start -v
This command sets up SQL views that perform the following tasks:
- track_info: Selects track information including id, name, popularity, energy, danceability, and the number of artist followers.
- tracks_with_artist_followers: Filters tracks to only include those where artists have followers.
- most_energizing_tracks: Picks the most energizing track for each release year.
To drop records and tables, run:
yarn start -d
This command will delete all records from the artists and tracks tables and drop the tables from your PostgreSQL database. Use this command with caution as it will remove all data.