A systematically integrated toolkit on sentence embedding representation, built from scratch.
Four kinds of models are supported:
-
Embedders (GloVe, Word2Vec)
-
TextCNN
-
RNNs (RNN, LSTM, GRU)
-
BERTs (BERT, ALBERT, RoBERTa)
-
Git clone this repository or download and unzip Sentence-representation.zip
git clone https://github.com/brian-zZZ/Sentence-representation.git
-
Setup environment dependencies
pip install -r requirements.txt
-
Download pretrained embeddings vectors of GloVe and Word2Vec, then place them to
sentence-representation/glove.6B/
andsentence-representation/word2vec/
respectively. For Chinese users, you may also download via [Baidu NetDisk link](百度网盘 请输入提取码), pass-code: pd9m. -
Download pretrained weights of BERTs from Huggingface, Optional. Pretrained weights are available at the Baidu NetDisk link above, place them to
sentence-representation/huggingface_pretrained/
.If you don't download the pretrained weights, remember to pass
--local_or_online=online
when you run themain_BERTs.py
script, then the online weights will be cached properly.
# Word level
python preprocessor.py --word_type=word --min_freq=1
# Subword level
python preprocessor.py --word_type=subword --min_freq=2
-
Embedders
Mainly determining args:
embedder_type
: str, choices=[glove, word2vec]siamese
: boolpooling_strategy
: str, choices=[mean, max]word_type
: str, choices=[word, subword]python main_Embedders.py --embedder_type=glove --siamese --pooling_strategy=mean --word_type=subword
-
CNN
Mainly determining args:
siamese, embedder_type, word_type
python main_CNN.py "your specific args config"
-
RNNs
Mainly determining args:
rnn_type, bidirectional, siamese, embedder_type
, where the choices ofrnn_type
arernn, lstm, gru
python main_RNNs.py "your specific args config"
-
BERTs
Mainly determining args:
bert_type, pooling_strategy, siamese, word_type
, where the choices ofbert_type
arebert, bert_nli, bert_simcse, albert, roberta_nli
._nli
variants are quoted from: Sentence Embeddings using Siamese BERT-Networks[1]_simcse
variant is quoted from: Simple Contrastive Learning of Sentence Embeddings[2]python main_BERTs.py "your specific args config"
More args of each models please check the corresponding main_xxx.py
file, and customize your training schedule.
BERTs outperform other methods by absolute advantage, although it costs more time and hardware resources; Embedders and RNNs are not bad, close to each other; while TextCNN performs poor, indicating it's not that suitable for sentence semantic similarity prediction task.
Author: Brian Zhang, College of AI, UCAS.
This is the programming assignment implementation for the Text Data Mining course, Spring semester, 2022 in UCAS.
This integrated toolkit on sentence embedding is welcome to use, and please cite here.