AtSNE is a solution of high-dimensional data visualization problem. It can project large-scale high-dimension vectors into low-dimension space while keeping the pair-wise similarity amount point. AtSNE is efficient and scalable and can visualize 20M points in less than 5 hours using GPU. The spatial structure of its result is also robust to random initializations. It implements the algorithm of our KDD'19 paper - AtSNE: Efficient and Robust Visulization on GPU through Hierarchical Optimization
Dataset | Dimensions | Number of Points | Number of Categories | Data | Label |
---|---|---|---|---|---|
CIFAR10 | 1024 | 60,000 | 10 | .txt .fvces | .txt .fvces |
CIFAR100 | 1024 | 60,000 | 100 | .txt .fvces | .txt .fvces |
MNIST | 784 | 70,000 | 10 | .txt .fvces | .txt .fvces |
Fashion-MNIST | 784 | 70,000 | 10 | .txt .fvces | .txt .fvces |
AG’s News | 100 | 120,000 | 4 | .txt .fvces | .txt .fvces |
DBPedia | 100 | 560,000 | 14 | .txt .fvces | .txt .fvces |
ImageNet | 128 | 1,281,167 | 1000 | .txt .fvces | .txt .fvces |
Yahoo | 100 | 1,400,000 | 10 | .txt .fvces | .txt .fvces |
Crawl | 300 | 2,000,000 | 10 | .txt .fvces | .txt .fvces |
Amazon3M | 100 | 3,000,000 | 5 | .txt .fvces | .txt .fvces |
Amazon20M | 96 | 19,531,329 | 5 | .txt .fvces | .txt .fvces |
- Details of dataset pre-processing are provided in our papers
Compared Algorithms:
Dataset | method | 10-NN accuracy | time | Memory (GB) | speedup |
---|---|---|---|---|---|
CIFAR10 | BH-t-SNE | 0.966 | 5m12s | 2.61 | 1.6 |
LargeVis | 0.965 | 8m23s | 7.90 | 1.0 | |
tsne-cuda | 0.963 | 27.7s | 2.17 | 18.1 | |
AtSNE | 0.957 | 19.6s | 0.93 | 25.7 | |
CIFAR100 | BH-t-SNE | 0.636 | 9m51s | 2.62 | 0.9 |
LargeVis | 0.607 | 8m50s | 7.90 | 1.0 | |
tsne-cuda | 0.646 | 28.3s | 2.33 | 18.7 | |
AtSNE | 0.600 | 19s | 0.93 | 27.9 | |
MNIST | BH-t-SNE | 0.970 | 5m20s | 2.35 | 1.7 |
LargeVis | 0.966 | 8m59s | 7.15 | 1.0 | |
tsne-cuda | 0.968 | 31.3s | 2.33 | 14.7 | |
AtSNE | 0.967 | 19.6s | 0.93 | 27.5 | |
Fashion-MNIST | BH-t-SNE | 0.821 | 3m46s | 2.28 | 2.3 |
LargeVis | 0.797 | 8m30s | 7.18 | 1.0 | |
tsne-cuda | 0.827 | 31.1s | 2.17 | 16.4 | |
AtSNE | 0.822 | 19.9s | 0.93 | 25.6 | |
AG’s News | BH-t-SNE | 0.993 | 5m30s | 0.95 | 1.9 |
LargeVis | 0.994 | 10m37s | 2.65 | 1.0 | |
tsne-cuda | 0.993 | 39.3s | 2.17 | 16.2 | |
AtSNE | 0.995 | 23s | 0.88 | 27.7 | |
DBPedia | BH-t-SNE | 0.993 | 36m8s | 4.22 | 0.93 |
LargeVis | 0.999 | 33m43s | 12.71 | 1.0 | |
tsne-cuda | - | - | - | ||
AtSNE | 0.999 | 3m | 2.03 | 11.2 | |
ImageNet | BH-t-SNE | 0.412 | 4h7m53s | 10.8 | 0.3 |
LargeVis | 0.608 | 1h18m45s | 53.09 | 1.0 | |
tsne-cuda | - | - | - | ||
AtSNE | 0.649 | 11m53s | 4.01 | 6.6 | |
Yahoo | BH-t-SNE | 0.537 | 2h17m17s | 10.47 | 0.62 |
LargeVis | 0.775 | 1h25m17s | 49.99 | 1.0 | |
tsne-cuda | – | – | – | ||
AtSNE | 0.780 | 12m52s | 4.27 | 6.6 | |
Crawl | BH-t-SNE | - | >24h | – | |
LargeVis | 0.688 | 2h34m14s | 139.05 | 1.0 | |
tsne-cuda | - | – | – | ||
AtSNE | 0.692 | 30m1s | 7.19 | 5.1 | |
Amazon3M | BH-t-SNE | - | > 24h | - | |
LargeVis | 0.606 | 2h53m25s | 104 | 1.0 | |
tsne-cuda | - | - | – | ||
AtSNE | 0.603 | 34m4s | 7.98 | 5.1 | |
Amazon20M | BH-t-SNE | - | - | - | |
LargeVis | - | - | - | ||
tsne-cuda | - | - | - | ||
AtSNE | 0.755 | 4h54m | 19.70 |
- Tested on i9-7980XE (18Cores, 36 Threads) with 128GB Memory
- AtSNE and TSNE-CUDA use one GTX 1080Ti GPU
- BH-t-SNE and LargeVis use 32 threads in the table above
-
means this method crashed in testing progress, mostly because of memory issues- Tested version of LargeVis, BH-t-SNE and TSNE-CUDA are feb8121, 62dedde and efa2098 respectively
- For Amazon20M dataset which is too large to fit in memory, we use Product Quantization to build KNN graph. AtSNE use extra parameters
-k 50 --ivfpq 1 --subQuantizers 24 --bitsPerCode 8
. - AtSNE just use the default parameters in the test above, except
--n_negative 400
. Exact parameters of aforementioned result are provided below in case you need it.
--lr 0.05 --vis_iter 2000 --save_interval 0 -k 100 --clusters 1000 --n_negative 400 --center_number 5 --nprobe 50 --knn_negative_rate 0 -p 50 --early_pull_rate 20 --center_pull_iter 500 --early_pull_iter 1000 --scale 10 --center_perplexity 70 --center_grad_coeff 1
- CUDA (8 or later), nvcc and cublas included
- gcc
- faiss
- Clone this project
- init submodule (cmdline and faiss)
- enter the project root directory
- run
git submodule init; git submodule update
- Compile faiss, enter directory of faiss (
vendor/faiss
), and follow Step1 and Step3, confirm thatvendor/faiss/libfaiss.a
andvendor/faiss/gpu/libgpufaiss.a
is generated. Simplified instructions are shown below:- install required BLAS library (MKL, openblas):
sudo apt install libopenblas-dev
cd vender/faiss
- build faiss cpu library:
./configure && make -j8
- build faiss gpu library:
cd gpu; make -j
- install required BLAS library (MKL, openblas):
- enter project root directory, run
make -j
./qvis_gpu -b mnist_vec784D_data.txt.fvecs -o mnist_result.txt
We choose good default parameters for you. And there are many other parameters you can change. If you want to reproduce the test in our KDD paper, please add --n_negative 400
.
./qvis_gpu -b mnist_vec784D_data.txt.fvecs --n_negative 400 -o mnist_result.txt
ivecs/fvecs vector file formats are defined here
There are some supplementary tools we use during developing/debugging/experimentation
tools/view.py
Draw the result in 2D space and save images for you.- Label file is optional.
- Use multi-process to draw images for results with the same filename-prefix
tools/txt_to_fvecs.py
covert txt file, like result of largevVis or label file, to ivecs/fvecstools/largevis_convert.py
convert dataset of fvecs/ivecs to largeVis input formattools/imagenet_infer.py
generate 128D imagenet feature vectors from ImageNet datasettools/box_filter.py
Give a bounding-box, print the points and corresponding labels. Used for case-study in our papertest_knn_accuracy
(Build required) Test knn classifier accuracy(label needed) of visualization resulttest_top1_error
(Build required) Test top-1 error of visualization result. The top-1 error is the ratio that the nearest neighbor of one point in low-dimension is not the nearest neighbor in high-dimension