Example code for chapter six, Clojure for Data Science.
This chapter makes use of the Reuters-21578 text categorization test collection. See here for more information.
The dataset can be downloaded directly from here.
Run the following command-line script to download the data to the project's data directory:
# Downloads and unzips the data files into this project's data directory.
script/download-data.sh
- Download the .tar.gz file linked above.
- Expand the contents of the file to a directory called data/reuters-sgml inside this project's directory
After following these steps there ought to be many .sgm files inside the data/reuters-sgml directory.
Examples can be run with:
# Replace 6.1 with the example you want to run:
lein run -e 6.1
or open an interactive REPL with:
lein repl
The output of some examples are a prerequisite for subsequent examples. These are aliased as the following, which must be run in order:
lein extract-reuters
lein create-sequencefile
lein create-vectors
The output of lein create-vectors
is a file that can be used for clustering in Mahout.
Copyright © 2015 Henry Garner
Distributed under the Eclipse Public License either version 1.0 or (at your option) any later version.