Semantic textual similarity deals with determining how similar two pieces of texts are. This can take the form of assigning a score from 1 to 5. Related tasks are paraphrase or duplicate identification.
SentEval is an evaluation toolkit for evaluating sentence representations. It includes 17 downstream tasks, including common semantic textual similarity tasks. The semantic textual similarity (STS) benchmark tasks from 2012-2016 (STS12, STS13, STS14, STS15, STS16, STS-B) measure the relatedness of two sentences based on the cosine similarity of the two representations. The evaluation criterion is Pearson correlation.
The SICK relatedness (SICK-R) task trains a linear model to output a score from 1 to 5 indicating the relatedness of two sentences. For the same dataset (SICK-E) can be treated as a three-class classification problem using the entailment labels (classes are 'entailment', 'contradiction', and 'neutral'). The evaluation metric for SICK-R is Pearson correlation and classification accuracy for SICK-E.
The Microsoft Research Paraphrase Corpus (MRPC) corpus is a paraphrase identification dataset, where systems aim to identify if two sentences are paraphrases of each other. The evaluation metric is classification accuracy and F1.
The data can be downloaded from here.
Model | MRPC | SICK-R | SICK-E | STS | Paper / Source | Code |
---|---|---|---|---|---|---|
XLNet-Large (ensemble) (Yang et al., 2019) | 93.0/90.7 | - | - | 91.6/91.1* | XLNet: Generalized Autoregressive Pretraining for Language Understanding | Official |
MT-DNN-ensemble (Liu et al., 2019) | 92.7/90.3 | - | - | 91.1/90.7* | Improving Multi-Task Deep Neural Networks via Knowledge Distillation for Natural Language Understanding | Official |
Snorkel MeTaL(ensemble) (Ratner et al., 2018) | 91.5/88.5 | - | - | 90.1/89.7* | Training Complex Models with Multi-Task Weak Supervision | Official |
GenSen (Subramanian et al., 2018) | 78.6/84.4 | 0.888 | 87.8 | 78.9/78.6 | Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning | Official |
InferSent (Conneau et al., 2017) | 76.2/83.1 | 0.884 | 86.3 | 75.8/75.5 | Supervised Learning of Universal Sentence Representations from Natural Language Inference Data | Official |
TF-KLD (Ji and Eisenstein, 2013) | 80.4/85.9 | - | - | - | Discriminative Improvements to Distributional Sentence Similarity |
* only evaluated on STS-B
The Quora Question Pairs dataset consists of over 400,000 pairs of questions on Quora. Systems must identify whether one question is a duplicate of the other. Models are evaluated based on accuracy.
Model | F1 | Accuracy | Paper / Source | Code |
---|---|---|---|---|
XLNet-Large (ensemble) (Yang et al., 2019) | 74.2 | 90.3 | XLNet: Generalized Autoregressive Pretraining for Language Understanding | Official |
MT-DNN-ensemble (Liu et al., 2019) | 73.7 | 89.9 | Improving Multi-Task Deep Neural Networks via Knowledge Distillation for Natural Language Understanding | Official |
Snorkel MeTaL(ensemble) (Ratner et al., 2018) | 73.1 | 89.9 | Training Complex Models with Multi-Task Weak Supervision | Official |
MwAN (Tan et al., 2018) | 89.12 | Multiway Attention Networks for Modeling Sentence Pairs | ||
DIIN (Gong et al., 2018) | 89.06 | Natural Language Inference Over Interaction Space | Official | |
pt-DecAtt (Char) (Tomar et al., 2017) | 88.40 | Neural Paraphrase Identification of Questions with Noisy Pretraining | ||
BiMPM (Wang et al., 2017) | 88.17 | Bilateral Multi-Perspective Matching for Natural Language Sentences | Official | |
GenSen (Subramanian et al., 2018) | 87.01 | Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning | Official |