Skip to content

Türkçe GloVe - Repository for Turkish GloVe Word Embeddings

License

Notifications You must be signed in to change notification settings

inzva/Turkish-GloVe

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 

Repository files navigation

TurkishGloVe

Türkçe GloVe - Repository for Turkish GloVe Word Embeddings

Training

We used official GloVe repository both to create word embeddings and evaluation. GloVe Github Repository

Download pre-trained word vectors

  1. 570K Vocab, cased, 300d vectors, 1.6 GB Text, 2.6 GB Binary link

  2. 253K Vocab, uncased, 300d vectors, 720 MB Text 1.2 GB Binary link:

Corpus

Corpus collected from January-December 2018 Commoncrawl. This corpus has 2,736B tokens. Corpus size: 5.4GB Corpus Link
Paper Link

Intrinsic Evaluation

This benchmark dataset is used for intrinsic evaluation on analogy task. We used synonyms, capitals, and antonyms for analogy task. Benchmark Dataset Link

Results

Semantic Evaluation Antonyms Analogy Task Capitals Analogy Task Synonyms Analogy Task Total Accuracy
GloVe Uncased 21.70 47.74 19.48 27.88

Extrinsic Evaluation

This dataset is used for extrinsic evaluation on text categorization. The dataset has 7 different classes.

Accuracy

SVC Logistic Regression
GloVe Cased 0.89306 0.89959
GloVe Uncased 0.89956 0.90530

Precision

SVC Logistic Regression
GloVe Cased 0.89388 0.89864
GloVe Uncased 0.90015 0.90619

Recall

SVC Logistic Regression
GloVe Cased 0.89306 0.89796
GloVe Uncased 0.89959 0.90531

We used the given machine learning techniques with default hyperparameters in scikit-learn.

Text Categorization Dataset Link

Examples

model.most_similar(positive=['fransa', 'berlin'], negative=['almanya'])

city

model.most_similar(positive=['geliyor', 'gitmek'], negative=['gelmek'])

verb

model.most_similar("kedi")

animal

References

https://cs224d.stanford.edu/lecture_notes/notes2.pdf
https://nlp.stanford.edu/pubs/glove.pdf