A collection of C/C++ code implementing the algorithms from the paper "Dimensionality Reduction for Documents with Nearest Neighbor Queries" by Ingram and Munzner, 2014.
This includes the All-Pairs Query, or APQ, nearest-neighbor algorithm and the Q-SNE, dimensionality reduction algorithm.
The code requires a C/C++ compiler. I have successfully used both clang and gcc.
You will first need to have GLib installed on your system.
For Mac (with homebrew installed):
brew install glib
For Ubuntu Linux:
sudo apt-get install libglib2.0-dev
For other Linux build from source.
For building the APQ algorithm just type
make
which will make the executable file testapq.
For building BH-SNE with arbitrary input (use input from apq for Q-SNE) type
make tsne
As a fun bonus, there are some other executables you can build if you dig around in the makefile.
APQ processes document term-vectors into nearest-neighbor files (both in vec
format).
QSNE processes nearest-neighbor files into coordinate files (in csv
file).
Term vectors represent documents as a vector of term counts in a term vectors space. Each dimension represents a different term and the value represents the TFIDF weight of that term. Because most terms don't appear in a given document, APQ expects sparse vectors as input.
APQ computes the approximate k most similar documents for each document in the input as well as the approximate cosine distance between the specified document pairs.
Both term term vectors and nearest neighbors are stored in vec
format. Each line describes a document.
(index1,value1) (index2,value2)
For term-vectors, indices represent term numbers and values. For nearest neighbors, indices represent document numbers.
(one day, maybe)
for help running the APQ algorithm
testapq -?
for help running BH-SNE (and Q-SNE)
tsne -?
(one day, maybe)
(one day, maybe)