-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
0 parents
commit 645238c
Showing
14 changed files
with
597 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
/target | ||
/*.iml | ||
/.* | ||
!/.gitignore |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
# Change Log | ||
|
||
## 0.1.0 - 2018-03-16 | ||
### Added | ||
- Conversion of MSD files into H2 | ||
- README with usage instructions |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,16 @@ | ||
Copyright (c) 2018 Nicolas Duchenne, Belove Ltd, London, UK | ||
|
||
Permission is hereby granted, free of charge, to any person obtaining a copy of this software | ||
and associated documentation files (the "Software"), to deal in the Software without restriction, | ||
including without limitation the rights to use, copy, modify, merge, publish, distribute, | ||
sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is | ||
furnished to do so, subject to the following conditions: | ||
|
||
The above copyright notice and this permission notice shall be included in all copies or | ||
substantial portions of the Software. | ||
|
||
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING | ||
BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND | ||
NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, | ||
DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, | ||
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,119 @@ | ||
# MSD Lyrics SQL database | ||
|
||
A command line tool to load the lyrics subset of the [Million Song Dataset](https://labrosa.ee.columbia.edu/millionsong/) | ||
into an [H2 SQL database](http://www.h2database.com/html/main.html). | ||
|
||
A SQL database makes it easy to inspect, clean, aggregate, filter and slice the dataset, via a GUI or programmatically. | ||
|
||
|
||
## Installation | ||
|
||
|
||
Install [Java](http://java.com/en/download/), download the jar file from the release page of this repository and follow the instructions below. | ||
|
||
Alternatively, you can clone this repository and run the code with [Leiningen](http://leiningen.org/), the build automation tool | ||
for [Clojure](http://clojure.org). Start by editing the `msd.edn` file in the project root (see further below) and then execute | ||
the code with `lein run`. | ||
|
||
You don't need to install H2 on your machine to run the program. The database engine is embedded in the program. | ||
|
||
However, you'll need some H2 compatible tool to view the data. An option is to [install H2](http://www.h2database.com/html/download.html) | ||
and use the included [console application](http://www.h2database.com/html/quickstart.html). | ||
Another is to use an H2 compatible front-end such as [DataGrip](https://www.jetbrains.com/datagrip/). | ||
|
||
|
||
|
||
|
||
## Usage | ||
|
||
Gather the following files from the [Million Song Dataset](https://labrosa.ee.columbia.edu/millionsong/) and place them in | ||
a same directory (while you're browsing the websites, check the licensing/citing terms for the various subsets): | ||
|
||
- from https://labrosa.ee.columbia.edu/millionsong/musixmatch: | ||
- mxm_779k_matches.txt | ||
- mxm_dataset_test.txt | ||
- mxm_dataset_train.txt | ||
- mxm_reverse_mapping.txt | ||
- from https://labrosa.ee.columbia.edu/millionsong/pages/getting-dataset: | ||
- tracks_per_year.txt | ||
- from http://www.ifs.tuwien.ac.at/mir/msd/download.html#groundtruth: | ||
- msd-MASD-styleAssignment.cls | ||
|
||
Place the jar file into the same directory and run: | ||
|
||
$ java -jar msd-to-h2-0.1.0-standalone.jar | ||
|
||
Give it a few minutes to create the output files: | ||
|
||
Creating the csv files... | ||
Creating the database. Just a bit of patience. | ||
Creating primary tables... | ||
Creating indexes... | ||
Creating derived tables... | ||
All done! | ||
The 3 artists with the largest vocabulary in the Million Song Dataset are Aesop Rock with 2555 words, Eminem with 2526 words, Cypress Hill with 2476 words | ||
|
||
|
||
|
||
|
||
## Outputs | ||
|
||
The program runs in two stages. | ||
|
||
Firsty, the program converts the original MSD files into CSV files. Words and tracks are given new unique integer ids, | ||
and files that relate to each others are consolidated (e.g. tracks + track years + track genres). | ||
|
||
Secondly, the program uploads the resulting CSV files into a new H2 database. Tables are created for tracks, words and the track/word matrix. | ||
The tool also creates a table of artists (based on the MusicXMatch artist names in the dataset) with aggregate track count, vocabulary count, | ||
and year range for each artist. This list of artists is preliminary and is meant to help prioritize data cleaning, rather than being used as is. | ||
|
||
|
||
|
||
|
||
## Options | ||
|
||
Rather than having all input and output files in the same directory, it is possible to specify different locations for input files, csv outputs files | ||
and the database. To do this, create an [edn file](https://learnxinyminutes.com/docs/edn/) called `msd.edn` with the following keys: | ||
|
||
- `:in` directory of the MSD input files | ||
- `:csv` directory of the csv output files | ||
- `:db` output database file according to | ||
[H2's URL format for an embedded database](http://www.h2database.com/html/features.html#database_url), without the `jdbc:h2:` prefix. | ||
The code was tested for a location relative to the `msd.edn` file (`./`) | ||
or to the user's home directory (`~/`). | ||
|
||
For example: | ||
|
||
{:in "/data/msd/source/" | ||
:csv "/data/msd/csv/" | ||
:db "~/h2data/msd"} | ||
|
||
You can then run `java -jar msd-to-h2-0.1.0-standalone.jar` at the location of the `msd.edn` file. | ||
|
||
If there is no `msd.edn` file (like earlier) then the program defaults to the following parameters: | ||
|
||
{:in "./" | ||
:csv "./" | ||
:db "./msd"} | ||
|
||
|
||
|
||
|
||
## Other SQL Engines | ||
|
||
Amending the code to accommodate other SQL engines should be straightforward | ||
. | ||
You'll have to: | ||
|
||
- [ ] change the database driver dependency in `project.clj` | ||
- [ ] adapt the database spec in `src/intoh2/core.clj` (look for`db-spec` and `create-database`) | ||
- [ ] adapt the scripts in `resources/sql` to the SQL dialect of the new database | ||
|
||
|
||
|
||
|
||
## License | ||
|
||
Copyright © 2018 Nicolas Duchenne, Belove Ltd, London, UK | ||
|
||
Released under the [MIT License](https://opensource.org/licenses/MIT). |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
{:in "/data/millionsong/dataset/extract" | ||
:csv "/data/millionsong/dataset/csv" | ||
:db "~/data/millionsong/dataset/h2/msd;AUTO_SERVER=TRUE;AUTO_SERVER_PORT=9091"} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,18 @@ | ||
(defproject msd-to-h2 "0.1.0" | ||
:description "Converts lyrics files in the Million Song Dataset into an H2 database." | ||
:url "https://github.com/belovehq/msd-to-h2" | ||
:license {:name "MIT License" | ||
:url "https://opensource.org/licenses/MIT"} | ||
:dependencies [[org.clojure/clojure "1.9.0"] | ||
[com.layerware/hugsql "0.4.8"] | ||
[com.h2database/h2 "1.4.195"]] | ||
|
||
:source-paths ["src"] | ||
:target-path "target/%s" | ||
:resource-paths ["resources"] | ||
|
||
:main ^:skip-aot intoh2.core | ||
|
||
:profiles {:uberjar {:aot :all}}) | ||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,39 @@ | ||
-- Create and populate artists table | ||
|
||
CREATE TABLE msdartists ( | ||
artistid INT PRIMARY KEY AUTO_INCREMENT, | ||
mxmartistname VARCHAR(55), | ||
trackcount INT, | ||
vocabulary INT, | ||
fromyear INT, | ||
toyear INT | ||
); | ||
|
||
|
||
INSERT INTO msdartists (mxmartistname, trackcount, vocabulary, fromyear, toyear) | ||
SELECT | ||
t.mxmartistname AS mxmartistname, | ||
COUNT(DISTINCT t.trackid) AS trackcount, | ||
COUNT(DISTINCT m.wordid) AS vocabulary, | ||
MIN(t.trackyear) AS fromyear, | ||
MAX(t.trackyear) AS toyear | ||
FROM msdtracks t | ||
INNER JOIN matrix M ON M.trackid = t.trackid | ||
GROUP BY mxmartistname | ||
ORDER BY mxmartistname; | ||
|
||
CREATE INDEX IX_msdartists_artistname | ||
ON msdartists (mxmartistname); | ||
|
||
-- -- Create and populate genres table | ||
|
||
CREATE TABLE msdgenres ( | ||
genreid INT AUTO_INCREMENT PRIMARY KEY, | ||
masdgenre VARCHAR(20) | ||
); | ||
|
||
INSERT INTO msdgenres (masdgenre) | ||
SELECT DISTINCT masdgenre | ||
FROM msdtracks | ||
WHERE masdgenre IS NOT NULL | ||
ORDER BY masdgenre; |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
-- :name hello :query :one | ||
SELECT CONCAT('The 3 artists with the largest vocabulary in the Million Song Dataset are ', | ||
GROUP_CONCAT(s SEPARATOR ', ')) out | ||
FROM | ||
(SELECT TOP 3 CONCAT(mxmartistname, ' with ', vocabulary, ' words') s | ||
FROM msdartists | ||
ORDER BY vocabulary DESC); |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,12 @@ | ||
-- indexes on msdtracks | ||
CREATE PRIMARY KEY ON msdtracks (trackid); | ||
CREATE UNIQUE INDEX ix_msdtracks_entrackid ON msdtracks (entrackid); | ||
CREATE INDEX ix_msdtracks_masdgenre ON msdtracks (masdgenre); | ||
CREATE INDEX ix_msdtracks_mxmartistname ON msdtracks (mxmartistname); | ||
|
||
-- index on msdwords | ||
CREATE PRIMARY KEY ON msdwords (wordid); | ||
|
||
-- indexes on matrix | ||
CREATE PRIMARY KEY ON matrix (trackid, wordid); | ||
CREATE INDEX ix_matrix_wordid_trackid ON matrix (wordid, trackid); |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,37 @@ | ||
-- Command definitions for HughSQL. | ||
|
||
-- :name create-tracks-table :execute :raw | ||
CREATE TABLE msdtracks ( | ||
trackid INT NOT NULL, | ||
entrackid VARCHAR(18), | ||
mxmtrackid INT, | ||
istest INT, | ||
entrackttitle VARCHAR(250), | ||
mxmtracktitle VARCHAR(180), | ||
enartistname VARCHAR(100), | ||
mxmartistname VARCHAR(55), | ||
trackyear INT, | ||
masdgenre VARCHAR(20) | ||
) AS | ||
SELECT * | ||
FROM CSVREAD(:sql:file); | ||
|
||
-- :name create-words-table :execute :raw | ||
CREATE TABLE msdwords ( | ||
wordid INT NOT NULL, | ||
stem VARCHAR(15), | ||
word VARCHAR(15) | ||
) AS | ||
SELECT * | ||
FROM CSVREAD(:sql:file); | ||
|
||
-- :name create-matrix-table :execute :raw | ||
CREATE TABLE matrix ( | ||
trackid INT NOT NULL, | ||
wordid INT NOT NULL, | ||
count INT NOT NULL | ||
) AS | ||
SELECT * | ||
FROM CSVREAD(:sql:file); | ||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
-- re-enable database logs | ||
SET LOG 1; | ||
SET UNDO_LOG 1; |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
-- momentarily disable database logs | ||
SET LOG 0; | ||
SET UNDO_LOG 0; |
Oops, something went wrong.