Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Word2Vec Embeddings to Cluster Summarizers [resolves #150] #162

Open
wants to merge 8 commits into
base: develop
Choose a base branch
from

Conversation

dafajon
Copy link
Contributor

@dafajon dafajon commented Nov 18, 2020

  • This feature branch branches from feature/word2vec branch.
  • Add word2vec embeddings to Cluster Summarizers.
  • embedding_type on summarizer argument selects among bert or word2vec.
  • Include Word2Vec embedding with both simple and bert tokenizers during summarizer evaluation.
  • Add results on summarize/README.md
  • Add tests for cluster summarizer regarding new embedding type.

- `GCorpus` is an iterator created from selected sadedegel corpus.
- Gensim requires a list of senteces in tokenized `List[List[str]]` or `Iter[List[str]]` format.
- Eevery sentence in corpus is yielded `List[str]` by the iterator.
- Punctuation is stripped and case lowerized for gensim vocab building. Customizable in the future.
- Gensim vocab building and training consumes the iterator object instead a list. So each training epoch requires a reset iterator.
- CLI is operable for training and re-training on existing SadedeGel corpora.
- Dumped model is trained with 15 epochs on standard corpus with 98 documents.
- Different word tokenizers create different vocabularies and models.
- Access model and keyedvectors based on configured tokenizer when user tries to access `.word2vec_embeddings` property.
- Load and save models based on configured tokenizer.
- Load Word2Vec model based on default sadedegel tokenizer.
- Make token lowercase as Gensim model is trained with lowercase vocabulary.
- Collect oov tokens of a sentence as an intance attribute.
- Add an instance attribute to sentence as `._has_w2v`. .s.t. a sentence with all tokens oov will not be used when this attribute is used as a filter.
@husnusensoy
Copy link
Contributor

@dafajon can you reimplement this on version 0.19

@husnusensoy husnusensoy added pending Pending Merge Requests question Further information is requested labels Apr 2, 2021
@dafajon
Copy link
Contributor Author

dafajon commented Apr 19, 2021

Flagged as highpriority. First on my TODO list.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
highpriority pending Pending Merge Requests question Further information is requested
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants