- Model for learning document embeddings (i-vectors) along with their uncertainties.
- Gaussian linear classifier exploiting the uncertainties in document embeddings.
- See paper http://arxiv.org/abs/1908.07599
S. Kesiraju, O. Plchot, L. Burget and S. V. Gangashetty, "Learning Document Embeddings Along With Their Uncertainties," in IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 2319-2332, 2020, doi: 10.1109/TASLP.2020.3012062.
-
Python >= 3.7
-
PyTorch >= 1.1 <=1.4
-
scipy >= 1.3
-
numpy >= 1.16.4
-
scikit-learn >= 0.21.2
-
h5py >= 2.9.0
-
See INSTALL.md for detailed instructions.
python src/create_sample_data.py.py sample_data/
-
For help:
python src/run_baysmm.py --help
-
To train on GPU set
CUDA_VISIBLE_DEVICES=$GPU_ID
where the$GPU_ID
is the free GPU index -
Following code trains the model for
1000
VB iterations and saves the model in an automatically created sub-directory:exp/s_1.00_rp_1_lw_1e+01_l1_1e-03_50_adam/
python src/run_baysmm.py train \ sample_data/train.mtx \ sample_data/vocab \ exp/ \ -K 50 \ -trn 1000 \ -lw 1e+01 \ -var_p 1e+01 \ -lt 1e-03
-
ELBO and KLD for every iteration, log file, etc are saved in the sub-directory.
-
Extract embeddings
[mean, log.std.dev]
for1000
iterations for each of the stats file present insample_data/mtx.flist
file list. -
Using
-nth 100
argument, embeddings for every100
th iteration are also saved.python src/run_baysmm.py extract \ sample_data/mtx.flist \ exp/s_1.00_rp_1_lw_1e+01_l1_1e-03_50_adam/model_T1000.h5 \ -xtr 1000 \ -nth 100
-
Extracted embedding posterior distributions are saved in
exp/*/ivecs/
sub-directory with appropriate names.
- Three classifiers can be trained on these embeddings.
- Use
--final
option to train and test classifier on embeddings from the final iteration.
-
Gaussian linear classifier - uses only the mean parameter
python src/train_and_clf_cv.py exp/s_1.00_rp_1_lw_1e+01_l1_1e-03_50_adam/ivecs/train_model_T1000_e1000.h5 sample_data/train.labels glc
-
Multi-class logistic regression - uses only the mean parameter
python src/train_and_clf_cv.py exp/s_1.00_rp_1_lw_1e+01_l1_1e-03_50_adam/ivecs/train_model_T1000_e1000.h5 sample_data/train.labels lr
-
Gaussian linear classifier with uncertainty - uses full posterior distribution
python src/train_and_clf_cv.py exp/s_1.00_rp_1_lw_1e+01_l1_1e-03_50_adam/ivecs/train_model_T1000_e1000.h5 sample_data/train.labels glcu
- All the results and predicted classes are saved in
exp/*/results/
@ARTICLE{Kesiraju:2020:BaySMM,
author={Kesiraju, Santosh and Plchot, Oldřich and Burget, Lukáš and Gangashetty, Suryakanth V.},
journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing},
title={Learning Document Embeddings Along With Their Uncertainties},
year={2020},
volume={28},
number={},
pages={2319-2332},
doi={10.1109/TASLP.2020.3012062}}