Skip to content

Commit

Permalink
v0.1.0
Browse files Browse the repository at this point in the history
  • Loading branch information
duchenne committed Mar 16, 2018
0 parents commit 645238c
Show file tree
Hide file tree
Showing 14 changed files with 597 additions and 0 deletions.
4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
/target
/*.iml
/.*
!/.gitignore
6 changes: 6 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
# Change Log

## 0.1.0 - 2018-03-16
### Added
- Conversion of MSD files into H2
- README with usage instructions
16 changes: 16 additions & 0 deletions LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
Copyright (c) 2018 Nicolas Duchenne, Belove Ltd, London, UK

Permission is hereby granted, free of charge, to any person obtaining a copy of this software
and associated documentation files (the "Software"), to deal in the Software without restriction,
including without limitation the rights to use, copy, modify, merge, publish, distribute,
sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or
substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING
BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM,
DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
119 changes: 119 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,119 @@
# MSD Lyrics SQL database

A command line tool to load the lyrics subset of the [Million Song Dataset](https://labrosa.ee.columbia.edu/millionsong/)
into an [H2 SQL database](http://www.h2database.com/html/main.html).

A SQL database makes it easy to inspect, clean, aggregate, filter and slice the dataset, via a GUI or programmatically.


## Installation


Install [Java](http://java.com/en/download/), download the jar file from the release page of this repository and follow the instructions below.

Alternatively, you can clone this repository and run the code with [Leiningen](http://leiningen.org/), the build automation tool
for [Clojure](http://clojure.org). Start by editing the `msd.edn` file in the project root (see further below) and then execute
the code with `lein run`.

You don't need to install H2 on your machine to run the program. The database engine is embedded in the program.

However, you'll need some H2 compatible tool to view the data. An option is to [install H2](http://www.h2database.com/html/download.html)
and use the included [console application](http://www.h2database.com/html/quickstart.html).
Another is to use an H2 compatible front-end such as [DataGrip](https://www.jetbrains.com/datagrip/).




## Usage

Gather the following files from the [Million Song Dataset](https://labrosa.ee.columbia.edu/millionsong/) and place them in
a same directory (while you're browsing the websites, check the licensing/citing terms for the various subsets):

- from https://labrosa.ee.columbia.edu/millionsong/musixmatch:
- mxm_779k_matches.txt
- mxm_dataset_test.txt
- mxm_dataset_train.txt
- mxm_reverse_mapping.txt
- from https://labrosa.ee.columbia.edu/millionsong/pages/getting-dataset:
- tracks_per_year.txt
- from http://www.ifs.tuwien.ac.at/mir/msd/download.html#groundtruth:
- msd-MASD-styleAssignment.cls

Place the jar file into the same directory and run:

$ java -jar msd-to-h2-0.1.0-standalone.jar

Give it a few minutes to create the output files:

Creating the csv files...
Creating the database. Just a bit of patience.
Creating primary tables...
Creating indexes...
Creating derived tables...
All done!
The 3 artists with the largest vocabulary in the Million Song Dataset are Aesop Rock with 2555 words, Eminem with 2526 words, Cypress Hill with 2476 words




## Outputs

The program runs in two stages.

Firsty, the program converts the original MSD files into CSV files. Words and tracks are given new unique integer ids,
and files that relate to each others are consolidated (e.g. tracks + track years + track genres).

Secondly, the program uploads the resulting CSV files into a new H2 database. Tables are created for tracks, words and the track/word matrix.
The tool also creates a table of artists (based on the MusicXMatch artist names in the dataset) with aggregate track count, vocabulary count,
and year range for each artist. This list of artists is preliminary and is meant to help prioritize data cleaning, rather than being used as is.




## Options

Rather than having all input and output files in the same directory, it is possible to specify different locations for input files, csv outputs files
and the database. To do this, create an [edn file](https://learnxinyminutes.com/docs/edn/) called `msd.edn` with the following keys:

- `:in` directory of the MSD input files
- `:csv` directory of the csv output files
- `:db` output database file according to
[H2's URL format for an embedded database](http://www.h2database.com/html/features.html#database_url), without the `jdbc:h2:` prefix.
The code was tested for a location relative to the `msd.edn` file (`./`)
or to the user's home directory (`~/`).

For example:

{:in "/data/msd/source/"
:csv "/data/msd/csv/"
:db "~/h2data/msd"}

You can then run `java -jar msd-to-h2-0.1.0-standalone.jar` at the location of the `msd.edn` file.

If there is no `msd.edn` file (like earlier) then the program defaults to the following parameters:

{:in "./"
:csv "./"
:db "./msd"}




## Other SQL Engines

Amending the code to accommodate other SQL engines should be straightforward
.
You'll have to:

- [ ] change the database driver dependency in `project.clj`
- [ ] adapt the database spec in `src/intoh2/core.clj` (look for`db-spec` and `create-database`)
- [ ] adapt the scripts in `resources/sql` to the SQL dialect of the new database




## License

Copyright © 2018 Nicolas Duchenne, Belove Ltd, London, UK

Released under the [MIT License](https://opensource.org/licenses/MIT).
3 changes: 3 additions & 0 deletions msd.edn
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
{:in "/data/millionsong/dataset/extract"
:csv "/data/millionsong/dataset/csv"
:db "~/data/millionsong/dataset/h2/msd;AUTO_SERVER=TRUE;AUTO_SERVER_PORT=9091"}
18 changes: 18 additions & 0 deletions project.clj
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
(defproject msd-to-h2 "0.1.0"
:description "Converts lyrics files in the Million Song Dataset into an H2 database."
:url "https://github.com/belovehq/msd-to-h2"
:license {:name "MIT License"
:url "https://opensource.org/licenses/MIT"}
:dependencies [[org.clojure/clojure "1.9.0"]
[com.layerware/hugsql "0.4.8"]
[com.h2database/h2 "1.4.195"]]

:source-paths ["src"]
:target-path "target/%s"
:resource-paths ["resources"]

:main ^:skip-aot intoh2.core

:profiles {:uberjar {:aot :all}})


39 changes: 39 additions & 0 deletions resources/sql/derivedtables.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
-- Create and populate artists table

CREATE TABLE msdartists (
artistid INT PRIMARY KEY AUTO_INCREMENT,
mxmartistname VARCHAR(55),
trackcount INT,
vocabulary INT,
fromyear INT,
toyear INT
);


INSERT INTO msdartists (mxmartistname, trackcount, vocabulary, fromyear, toyear)
SELECT
t.mxmartistname AS mxmartistname,
COUNT(DISTINCT t.trackid) AS trackcount,
COUNT(DISTINCT m.wordid) AS vocabulary,
MIN(t.trackyear) AS fromyear,
MAX(t.trackyear) AS toyear
FROM msdtracks t
INNER JOIN matrix M ON M.trackid = t.trackid
GROUP BY mxmartistname
ORDER BY mxmartistname;

CREATE INDEX IX_msdartists_artistname
ON msdartists (mxmartistname);

-- -- Create and populate genres table

CREATE TABLE msdgenres (
genreid INT AUTO_INCREMENT PRIMARY KEY,
masdgenre VARCHAR(20)
);

INSERT INTO msdgenres (masdgenre)
SELECT DISTINCT masdgenre
FROM msdtracks
WHERE masdgenre IS NOT NULL
ORDER BY masdgenre;
7 changes: 7 additions & 0 deletions resources/sql/hello.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
-- :name hello :query :one
SELECT CONCAT('The 3 artists with the largest vocabulary in the Million Song Dataset are ',
GROUP_CONCAT(s SEPARATOR ', ')) out
FROM
(SELECT TOP 3 CONCAT(mxmartistname, ' with ', vocabulary, ' words') s
FROM msdartists
ORDER BY vocabulary DESC);
12 changes: 12 additions & 0 deletions resources/sql/indexes.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
-- indexes on msdtracks
CREATE PRIMARY KEY ON msdtracks (trackid);
CREATE UNIQUE INDEX ix_msdtracks_entrackid ON msdtracks (entrackid);
CREATE INDEX ix_msdtracks_masdgenre ON msdtracks (masdgenre);
CREATE INDEX ix_msdtracks_mxmartistname ON msdtracks (mxmartistname);

-- index on msdwords
CREATE PRIMARY KEY ON msdwords (wordid);

-- indexes on matrix
CREATE PRIMARY KEY ON matrix (trackid, wordid);
CREATE INDEX ix_matrix_wordid_trackid ON matrix (wordid, trackid);
37 changes: 37 additions & 0 deletions resources/sql/primarytables.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
-- Command definitions for HughSQL.

-- :name create-tracks-table :execute :raw
CREATE TABLE msdtracks (
trackid INT NOT NULL,
entrackid VARCHAR(18),
mxmtrackid INT,
istest INT,
entrackttitle VARCHAR(250),
mxmtracktitle VARCHAR(180),
enartistname VARCHAR(100),
mxmartistname VARCHAR(55),
trackyear INT,
masdgenre VARCHAR(20)
) AS
SELECT *
FROM CSVREAD(:sql:file);

-- :name create-words-table :execute :raw
CREATE TABLE msdwords (
wordid INT NOT NULL,
stem VARCHAR(15),
word VARCHAR(15)
) AS
SELECT *
FROM CSVREAD(:sql:file);

-- :name create-matrix-table :execute :raw
CREATE TABLE matrix (
trackid INT NOT NULL,
wordid INT NOT NULL,
count INT NOT NULL
) AS
SELECT *
FROM CSVREAD(:sql:file);


3 changes: 3 additions & 0 deletions resources/sql/safety.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
-- re-enable database logs
SET LOG 1;
SET UNDO_LOG 1;
3 changes: 3 additions & 0 deletions resources/sql/speed.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
-- momentarily disable database logs
SET LOG 0;
SET UNDO_LOG 0;
Loading

0 comments on commit 645238c

Please sign in to comment.