Based on the idea of Spotify : a concrete example to understand how graph databases work, with Neo4j. The challenge is to create a music recommendation algorithm, using a very large database of songs (Million Song Dataset) with a graphical interface (Symfony).
Take the time to read what I wrote here so you will understand what you do.
Datasets are a bunch of data in a certain form, but we have to convert them to match what Neo4j wants to ingest.
You can first download the data list of the songs. Then, extract it.
$> wget http://static.echonest.com/millionsongsubset_full.tar.gz $> tar -xvzf millionsongsubset_full.tar.gz
This will create a ./MillionSongSubset
directory.
This is a ressource allowing us to get the list of the titles of the songs, along with many (many) data such as the bitrates of the musics or the related artists of the artist who created the music.
Each data of a song is stored inside its .h5
file.
This file format contains a folder and file tree containing data.
Roughly speaking, these files contain batches of data in tables (as in Excel).
You can use the"HDFView" software to see what these files actually contain.
The dataset downloaded previously comes with the .h5
files stored under multiple directories.
We store everything in the same directory to make the following scripts easier to process.
$> find -name "*.h5" -exec cp {} ../tools/DATASET_PROCESS/H5_FILES/ \;
The dataset provides the data of exactly 10 000
songs.
To be sure everything is in there, execute :
$> ls ./tools/DATASET_PROCESS/H5_FILES/ | wc -l
A Python script allows us to extract the data from the dataset, with the information we want (title, artists related, play time etc...).
Execute h5_to_ascii.sh
to run the script that translate .h5
files to human-readable ASCII files.
$> sh ./tools/h5_to_ascii.sh
Everything will be stored under ./tools/DATASET_PROCESS/ASCII_FILES/
.
Neo4j allows to import CSV files. But as the first script outputs only a ASCII text file, we have to format it in JSON and then in CSV.
$> sh ./tools/ascii_to_json.sh
Everything will be stored under ./tools/DATASET_PROCESS/JSON_FILES/
.
This script as well concatenate the JSON files into a single file (in ./tools/DATASET_PROCESS/JSON_FILES/ALL_DATA_JSON.json
) so we will easily convert it to CSV.
To convert the JSON we've outputed to CSV, we use an excellent website :
https://codebeautify.org/json-to-csv
Click the "Browse" button, select ./ALL_DATA_JSON.json
and click "Download".
Don't forget to add the file on your server with the name : ALL_DATA_CSV.csv
.
Get these files we've compiled inside :
./data/processed/artists_ids.csv ./data/processed/genres.csv
We've stepped into several problems while importing the csv data with our first algorithm. So to have a cleaner and slimmer import, we had to list the artists IDs in a single node to then link the similar artists of a music, to the music. Same thing for the genres.
Here are the steps to get the artist IDs:
Inside the downloaded song list directory : (by default ./MillionSongSubset/
), is a file named ./MillionSongSubset/subset_artist_term.db
.
This file is a SQLite database file.
We've just browsed this database with the SQLite browser and used the function "export", selecting only the artist_id
column.
You can follow this official tutorial to install Neo4j for your Debian machine.
Inside /etc/neo4j/neo4j.conf
:
# Uncomment : dbms.security.auth_enabled=false dbms.security.allow_csv_import_from_file_urls=true # Comment : #dbms.directories.import=/var/lib/neo4j/import
Restart Neo4j :
$> service neo4j restart
All the queries below are written in Cypher. Cypher is to Neo4j what SQL is to MySQL.
Access your browser instance of Neo4j with the following link. Replace localhost by your IP address if it is necessary.
http://localhost:7474/browser
Just before continuing, we have to increase the limit of 300 nodes display for Neo4j using this command in the Neo4j console :
Change 1000
by the number you want. But careful : it may make your browser crash.
:config initialNodeDisplay: 1000
Replace /home/user by the absolute file where you've cloned this git repository.
LOAD CSV WITH HEADERS FROM "file:/home/user/Neo4j-Example-Spotifylike/data/processed/artists_id.csv" AS csvLine CREATE (a:Artist { artist_id: csvLine.artist_id })
Replace /home/user by the absolute file where you've cloned this git repository.
LOAD CSV WITH HEADERS FROM "file:/home/user/Neo4j-Example-Spotifylike/data/processed/genres.csv" AS csvLine CREATE (g:Genre { name: csvLine.mbtag })
For this, we will use the ./data/processed/artist_genre.csv
file.
Replace /home/user by the absolute file where you've cloned this git repository.
LOAD CSV WITH HEADERS FROM "file:/artist_genre.csv" AS csvLine MATCH (a:Artist {artist_id:csvLine.artist_id}), (g:Genre {name: csvLine.mbtag}) MERGE (a)-[:HAS_GENRE]->(g)
Well, it is not just about importing the song list. The fact is that each music has a "similar_artists" property, which is really heavy and will overload our server for no reason.
To fix this, we will use the Artist
nodes, and add a relation : music OWNED_BY artist
.
LOAD CSV WITH HEADERS FROM "file:/ALL_DATA_CSV.csv" AS csvLine // Creating the music node. MERGE (m:Music {title: csvLine.title, duration: csvLine.duration}) WITH m, csvLine MATCH (a:Artist {artist_id: csvLine.artist_id}) MERGE (a)-[:OWNS]->(m) SET a.name = csvLine.artist_name MERGE (y:Year {year: csvLine.year}) MERGE (m)-[:RELEASED_IN]->(y) MERGE (al:Album {name: csvLine.album}) MERGE (m)-[:IN]->(al) MERGE (a)-[:CREATED]->(al) WITH a, m, csvLine UNWIND split(csvLine.all_terms, ',') as genre_instance MATCH (g:Genre {name: genre_instance}) MERGE (m)-[:HAS_GENRE]->(g) WITH a, m, csvLine UNWIND split(csvLine.similar_artists, ',') as asi MATCH (as:Artist {artist_id: asi}) MERGE (a)-[:SIMILAR_TO]->(as) RETURN count(*) // LIMIT 5; // Limit the query if you computer is not really powerful.
ℹ️ You might experience some problems while importing a large quantity of data.
Use the following command at the beginning of the previous command to make it work. It will persist the data every 50
entity processed.
USING PERIODIC COMMIT 50
ℹ️ You might experience bugs of memory while importing the data. In /etc/neo4j/neo4j.conf
, uncomment and modify the following line.
dbms.memory.heap.max_size=1024m