Skip to content

inzva/tse-s2v

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

66 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Turkish-Sentence Encoder with Quick-Thought Vectors

We are a team of graduate students working voluntarily at Inzva Hacker Space on building a model for Generic Sentence Encoder for the Turkish Language. Generic Sentence Encoders are used to let the computer understand what the user meant by their sentences. They are typically used in conversational assistants to understand the meaning of the user utterances. There are a number of recent research that are proposed for Generic Sentence Encoder for the English language as exemplified below.

Universal Sentence Encoder (by Google AI) https://ai.google/research/pubs/pub46808

Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning (by Microsoft Research) https://openreview.net/forum?id=B18WgG-CZ

They leverage the Natural Language Inference (NLI) datasets to train their models as well as to evaluate their models as a downstream task. There are many NLI datasets that have been previously crowdsourced for the English language. Examples in an NLI dataset can be seen below.

An example of entailment (similarity).

  • Sentence 1: A senior is waiting at the window of a restaurant that serves sandwiches.
  • Sentence 2: A person waits to be served his food.

An example of contradiction (dissimilarity).

  • Sentence 1: A senior is waiting at the window of a restaurant that serves sandwiches.
  • Sentence 2: A man is waiting in line for the bus.

Examples are taken from The Stanford Natural Language Inference (SNLI) Corpus (https://nlp.stanford.edu/projects/snli).

We want to build a Generic Sentence Encoder model for the Turkish. However, there are not any NLI datasets in the Turkish language due to the high cost of crowdsourcing. We want to leverage the new Amazon Translate Service to translate the NLI datasets available in the English language into the Turkish language as a cost effective solution and publish the translated datasets publicly to boost Turkish NLP works.

Experimentations

Experiment #1

The results of the following experiments were obtained.

Experiment Name Language Dataset Preprocessing Model # Metric Value
model1_task1 English SICK Lowercase Model1 Pearson 0.8595461496671714
model2_task1 English SICK Lowercase + sentencepiece Model2 Pearson 0.8470309759444442
model1_task1 English SICK Lowercase Model1 Spearman 0.7906599787429348
model2_task1 English SICK Lowercase + sentencepiece Model2 Spearman 0.7824858836725014
model1_task1 English SICK Lowercase Model1 MSE 0.2669741153404767
model2_task1 English SICK Lowercase + sentencepiece Model2 MSE 0.28983657549983965
model1_task1 English SNLI Lowercase Model1 Accuracy 71.32%
model2_task1 English SNLI Lowercase + sentencepiece Model2 Accuracy 69.63%
model1_task1 English MultiNLI (matched) Lowercase Model1 Accuracy 59.69%
model1_task1 English MultiNLI (mismatched) Lowercase Model1 Accuracy 60.84%
model2_task1 English MultiNLI (matched) Lowercase + sentencepiece Model2 Accuracy 59.04%
model2_task1 English MultiNLI (mismatched) Lowercase + sentencepiece Model2 Accuracy 60.27%
model1_task1 English XNLI Lowercase Model1 Accuracy 60.36%
model2_task1 English XNLI Lowercase + sentencepiece Model2 Accuracy 59.12%
model1_task2 Turkish SICK-MT-TR Lowercase Model1 Pearson 0.7767414617451377
model2_task2 Turkish SICK-MT-TR Lowercase + sentencepiece Model2 Pearson 0.8076206267469718
model1_task2 Turkish SICK-MT-TR Lowercase Model1 Spearman 0.7042856789726142
model2_task2 Turkish SICK-MT-TR Lowercase + sentencepiece Model2 Spearman 0.7348411904626335
model1_task2 Turkish SICK-MT-TR Lowercase Model1 MSE 0.40369925427270614
model2_task2 Turkish SICK-MT-TR Lowercase + sentencepiece Model2 MSE 0.3561044127205771
model1_task2 Turkish SNLI-MT-TR Lowercase Model1 Accuracy 62.41%
model2_task2 Turkish SNLI-MT-TR Lowercase + sentencepiece Model2 Accuracy 64.38%
model1_task2 Turkish MultiNLI (matched) Lowercase Model2 Accuracy 50.84%
model1_task2 Turkish MultiNLI (mismatched) Lowercase Model2 Accuracy 51.61%
model2_task2 Turkish MultiNLI (matched) Lowercase + sentencepiece Model2 Accuracy 59.04%
model2_task2 Turkish MultiNLI (mismatched) Lowercase + sentencepiece Model2 Accuracy 53.36%
model1_task2 English XNLI Lowercase Model1 Accuracy 53.35%
model2_task2 Turkish XNLI Lowercase + sentencepiece Model2 Accuracy 54.33%

Note: We are working on the presentation of the results in an more readable, understandable and interpretable way. Stay tuned :)

Announcements

🎯 (2019-09-30) We have applied for the AWS Research Credits again to translate some well known NLI and Textual Entailment datasets into Turkish.

🐣 (2018-10-26) Our research has been awarded free GPU resource by TRUBA which is the Turkish Science e-Infrastructure that is provided by The Scientific and Technological Research Council of Turkey (TUBITAK).

🐣 (2019-01-26) Our research has been granted AWS Research Credits to translate the most commong NLI datasets into Turkish and evaluate our resulting NLI models.

🎯 (2019-03-31) We have applied for the AWS Research Credits again to translate all available NLI and Textual Entailment datasets into Turkish.

🐣 (2019-10-18) Our research has been granted AWS Research Credits to keep translating the NLI datasets into Turkish and evaluate our resulting NLI models.


References

This repository is based on the implementation of the following paper in this Github repository which belongs to the following paper.

Lajanugen Logeswaran, Honglak Lee, An efficient framework for learning sentence representations. In ICLR, 2018.

About

Turkish-Sentence Encoder with Quick-Thought Vectors

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 67.2%
  • Shell 30.8%
  • Jupyter Notebook 2.0%