Add Word2Vec Embeddings to Cluster Summarizers [resolves #150] #162

dafajon · 2020-11-18T15:57:06Z

This feature branch branches from feature/word2vec branch.
Add word2vec embeddings to Cluster Summarizers.
embedding_type on summarizer argument selects among bert or word2vec.
Include Word2Vec embedding with both simple and bert tokenizers during summarizer evaluation.
Add results on summarize/README.md
Add tests for cluster summarizer regarding new embedding type.

- `GCorpus` is an iterator created from selected sadedegel corpus. - Gensim requires a list of senteces in tokenized `List[List[str]]` or `Iter[List[str]]` format. - Eevery sentence in corpus is yielded `List[str]` by the iterator. - Punctuation is stripped and case lowerized for gensim vocab building. Customizable in the future. - Gensim vocab building and training consumes the iterator object instead a list. So each training epoch requires a reset iterator. - CLI is operable for training and re-training on existing SadedeGel corpora. - Dumped model is trained with 15 epochs on standard corpus with 98 documents.

- Different word tokenizers create different vocabularies and models. - Access model and keyedvectors based on configured tokenizer when user tries to access `.word2vec_embeddings` property. - Load and save models based on configured tokenizer.

… corpus.

- Load Word2Vec model based on default sadedegel tokenizer. - Make token lowercase as Gensim model is trained with lowercase vocabulary. - Collect oov tokens of a sentence as an intance attribute. - Add an instance attribute to sentence as `._has_w2v`. .s.t. a sentence with all tokens oov will not be used when this attribute is used as a filter.

… attribute.

husnusensoy · 2021-03-23T19:48:40Z

@dafajon can you reimplement this on version 0.19

dafajon · 2021-04-19T14:56:59Z

Flagged as highpriority. First on my TODO list.

dafajon added 8 commits October 20, 2020 19:03

Fix .gitattributes for directory structure.

fafb8e0

Add Gensim model trained with bert tokenized vocabulary over extended…

214aab3

… corpus.

model .npy files

23b0b82

Add Word2Vec Embeddings to Cluster Summarizers [resolves #150]

b6f5c3f

Add variable vector size training and handle other sizes in embedding…

409d0a5

… attribute.

husnusensoy added pending Pending Merge Requests question Further information is requested labels Apr 2, 2021

dafajon added the highpriority label Apr 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Word2Vec Embeddings to Cluster Summarizers [resolves #150] #162

Add Word2Vec Embeddings to Cluster Summarizers [resolves #150] #162

dafajon commented Nov 18, 2020

husnusensoy commented Mar 23, 2021

dafajon commented Apr 19, 2021

Add Word2Vec Embeddings to Cluster Summarizers [resolves #150] #162

Are you sure you want to change the base?

Add Word2Vec Embeddings to Cluster Summarizers [resolves #150] #162

Conversation

dafajon commented Nov 18, 2020

husnusensoy commented Mar 23, 2021

dafajon commented Apr 19, 2021