Skip to content

Commit eb6d086

Browse files
authored
v0.3 (#32)
* Use candidate words instead of extracting those from the documents * Spacy, Gensim, USE, and Custom Backends were added * Improved imports * Fix encoding error when locally installing KeyBERT #30 * Improved documentation (ReadMe & MKDocs) * Add the main tutorial as a shield * Typos #31, #35
1 parent 2a982bd commit eb6d086

File tree

16 files changed

+747
-191
lines changed

16 files changed

+747
-191
lines changed

README.md

Lines changed: 37 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@
22
[![PyPI - License](https://img.shields.io/badge/license-MIT-green.svg)](https://github.com/MaartenGr/keybert/blob/master/LICENSE)
33
[![PyPI - PyPi](https://img.shields.io/pypi/v/keyBERT)](https://pypi.org/project/keybert/)
44
[![Build](https://img.shields.io/github/workflow/status/MaartenGr/keyBERT/Code%20Checks/master)](https://pypi.org/project/keybert/)
5+
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1OxpgwKqSzODtO3vS7Xe1nEmZMCAIMckX?usp=sharing)
56

67
<img src="images/logo.png" width="35%" height="35%" align="right" />
78

@@ -65,10 +66,19 @@ Installation can be done using [pypi](https://pypi.org/project/keybert/):
6566
pip install keybert
6667
```
6768

68-
To use Flair embeddings, install KeyBERT as follows:
69+
You may want to install more depending on the transformers and language backends that you will be using. The possible installations are:
6970

7071
```
7172
pip install keybert[flair]
73+
pip install keybert[gensim]
74+
pip install keybert[spacy]
75+
pip install keybert[use]
76+
```
77+
78+
To install all backends:
79+
80+
```
81+
pip install keybert[all]
7282
```
7383

7484
<a name="usage"/></a>
@@ -90,14 +100,14 @@ doc = """
90100
the learning algorithm to generalize from the training data to unseen situations in a
91101
'reasonable' way (see inductive bias).
92102
"""
93-
model = KeyBERT('distilbert-base-nli-mean-tokens')
94-
keywords = model.extract_keywords(doc)
103+
kw_model = KeyBERT('distilbert-base-nli-mean-tokens')
104+
keywords = kw_model.extract_keywords(doc)
95105
```
96106

97107
You can set `keyphrase_ngram_range` to set the length of the resulting keywords/keyphrases:
98108

99109
```python
100-
>>> model.extract_keywords(doc, keyphrase_ngram_range=(1, 1), stop_words=None)
110+
>>> kw_model.extract_keywords(doc, keyphrase_ngram_range=(1, 1), stop_words=None)
101111
[('learning', 0.4604),
102112
('algorithm', 0.4556),
103113
('training', 0.4487),
@@ -109,7 +119,7 @@ To extract keyphrases, simply set `keyphrase_ngram_range` to (1, 2) or higher de
109119
of words you would like in the resulting keyphrases:
110120

111121
```python
112-
>>> model.extract_keywords(doc, keyphrase_ngram_range=(1, 2), stop_words=None)
122+
>>> kw_model.extract_keywords(doc, keyphrase_ngram_range=(1, 2), stop_words=None)
113123
[('learning algorithm', 0.6978),
114124
('machine learning', 0.6305),
115125
('supervised learning', 0.5985),
@@ -125,13 +135,13 @@ have shown great performance in semantic similarity and paraphrase identificatio
125135
<a name="maxsum"/></a>
126136
### 2.3. Max Sum Similarity
127137

128-
To diversity the results, we take the 2 x top_n most similar words/phrases to the document.
138+
To diversify the results, we take the 2 x top_n most similar words/phrases to the document.
129139
Then, we take all top_n combinations from the 2 x top_n words and extract the combination
130140
that are the least similar to each other by cosine similarity.
131141

132142
```python
133-
>>> model.extract_keywords(doc, keyphrase_ngram_range=(3, 3), stop_words='english',
134-
use_maxsum=True, nr_candidates=20, top_n=5)
143+
>>> kw_model.extract_keywords(doc, keyphrase_ngram_range=(3, 3), stop_words='english',
144+
use_maxsum=True, nr_candidates=20, top_n=5)
135145
[('set training examples', 0.7504),
136146
('generalize training data', 0.7727),
137147
('requires learning algorithm', 0.5050),
@@ -148,8 +158,8 @@ keywords / keyphrases which is also based on cosine similarity. The results
148158
with **high diversity**:
149159

150160
```python
151-
>>> model.extract_keywords(doc, keyphrase_ngram_range=(3, 3), stop_words='english',
152-
use_mmr=True, diversity=0.7)
161+
>>> kw_model.extract_keywords(doc, keyphrase_ngram_range=(3, 3), stop_words='english',
162+
use_mmr=True, diversity=0.7)
153163
[('algorithm generalize training', 0.7727),
154164
('labels unseen instances', 0.1649),
155165
('new examples optimal', 0.4185),
@@ -160,8 +170,8 @@ with **high diversity**:
160170
The results with **low diversity**:
161171

162172
```python
163-
>>> model.extract_keywords(doc, keyphrase_ngram_range=(3, 3), stop_words='english',
164-
use_mmr=True, diversity=0.2)
173+
>>> kw_model.extract_keywords(doc, keyphrase_ngram_range=(3, 3), stop_words='english',
174+
use_mmr=True, diversity=0.2)
165175
[('algorithm generalize training', 0.7727),
166176
('supervised learning algorithm', 0.7502),
167177
('learning machine learning', 0.7577),
@@ -172,16 +182,23 @@ The results with **low diversity**:
172182

173183
<a name="embeddings"/></a>
174184
### 2.5. Embedding Models
175-
The parameter `model` takes in a string pointing to a sentence-transformers model,
176-
a SentenceTransformer, or a Flair DocumentEmbedding model.
185+
KeyBERT supports many embedding models that can be used to embed the documents and words:
186+
187+
* Sentence-Transformers
188+
* Flair
189+
* Spacy
190+
* Gensim
191+
* USE
192+
193+
Click [here](https://maartengr.github.io/KeyBERT/guides/embeddings.html) for a full overview of all supported embedding models.
177194

178195
**Sentence-Transformers**
179196
You can select any model from `sentence-transformers` [here](https://www.sbert.net/docs/pretrained_models.html)
180197
and pass it through KeyBERT with `model`:
181198

182199
```python
183200
from keybert import KeyBERT
184-
model = KeyBERT(model='distilbert-base-nli-mean-tokens')
201+
kw_model = KeyBERT(model='distilbert-base-nli-mean-tokens')
185202
```
186203

187204
Or select a SentenceTransformer model with your own parameters:
@@ -191,7 +208,7 @@ from keybert import KeyBERT
191208
from sentence_transformers import SentenceTransformer
192209

193210
sentence_model = SentenceTransformer("distilbert-base-nli-mean-tokens", device="cpu")
194-
model = KeyBERT(model=sentence_model)
211+
kw_model = KeyBERT(model=sentence_model)
195212
```
196213

197214
**Flair**
@@ -203,7 +220,7 @@ from keybert import KeyBERT
203220
from flair.embeddings import TransformerDocumentEmbeddings
204221

205222
roberta = TransformerDocumentEmbeddings('roberta-base')
206-
model = KeyBERT(model=roberta)
223+
kw_model = KeyBERT(model=roberta)
207224
```
208225

209226
You can select any 🤗 transformers model [here](https://huggingface.co/models).
@@ -218,7 +235,7 @@ To cite PolyFuzz in your work, please use the following bibtex reference:
218235
title = {KeyBERT: Minimal keyword extraction with BERT.},
219236
year = 2020,
220237
publisher = {Zenodo},
221-
version = {v0.1.3},
238+
version = {v0.3.0},
222239
doi = {10.5281/zenodo.4461265},
223240
url = {https://doi.org/10.5281/zenodo.4461265}
224241
}
@@ -238,10 +255,10 @@ but most importantly, these are amazing resources for creating impressive keywor
238255
* https://github.com/swisscom/ai-research-keyphrase-extraction
239256

240257
**MMR**:
241-
The selection of keywords/keyphrases was modelled after:
258+
The selection of keywords/keyphrases was modeled after:
242259
* https://github.com/swisscom/ai-research-keyphrase-extraction
243260

244261
**NOTE**: If you find a paper or github repo that has an easy-to-use implementation
245262
of BERT-embeddings for keyword/keyphrase extraction, let me know! I'll make sure to
246-
add it a reference to this repo.
263+
add a reference to this repo.
247264

docs/changelog.md

Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,44 @@
1+
## **Version 0.3.0**
2+
*Release date: 10 May, 2021*
3+
4+
The two main features are **candidate keywords**
5+
and several **backends** to use instead of Flair and SentenceTransformers!
6+
7+
**Highlights**:
8+
9+
* Use candidate words instead of extracting those from the documents ([#25](https://github.com/MaartenGr/KeyBERT/issues/25))
10+
* ```KeyBERT().extract_keywords(doc, candidates)```
11+
* Spacy, Gensim, USE, and Custom Backends were added (see documentation [here](https://maartengr.github.io/KeyBERT/guides/embeddings.html))
12+
13+
**Fixes**:
14+
15+
* Improved imports
16+
* Fix encoding error when locally installing KeyBERT ([#30](https://github.com/MaartenGr/KeyBERT/issues/30))
17+
18+
**Miscellaneous**:
19+
20+
* Improved documentation (ReadMe & MKDocs)
21+
* Add the main tutorial as a shield
22+
* Typos ([#31](https://github.com/MaartenGr/KeyBERT/pull/31), [#35](https://github.com/MaartenGr/KeyBERT/pull/35))
23+
24+
25+
## **Version 0.2.0**
26+
*Release date: 9 Feb, 2021*
27+
28+
**Highlights**:
29+
30+
* Add similarity scores to the output
31+
* Add Flair as a possible back-end
32+
* Update documentation + improved testing
33+
34+
## **Version 0.1.2*
35+
*Release date: 28 Oct, 2020*
36+
37+
Added Max Sum Similarity as an option to diversify your results.
38+
39+
40+
## **Version 0.1.0**
41+
*Release date: 27 Oct, 2020*
42+
43+
This first release includes keyword/keyphrase extraction using BERT and simple cosine similarity.
44+
There is also an option to use Maximal Marginal Relevance to select the candidate keywords/keyphrases.

docs/guides/embeddings.md

Lines changed: 113 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1,36 +1,137 @@
1-
## **Embedding Models**
2-
The parameter `model` takes in a string pointing to a sentence-transformers model,
3-
a SentenceTransformer, or a Flair DocumentEmbedding model.
1+
# Embedding Models
2+
In this tutorial we will be going through the embedding models that can be used in KeyBERT.
3+
Having the option to choose embedding models allow you to leverage pre-trained embeddings that suit your use-case.
44

5-
### **Sentence-Transformers**
6-
You can select any model from `sentence-transformers` [here](https://www.sbert.net/docs/pretrained_models.html)
5+
### **Sentence Transformers**
6+
You can select any model from sentence-transformers [here](https://www.sbert.net/docs/pretrained_models.html)
77
and pass it through KeyBERT with `model`:
88

99
```python
1010
from keybert import KeyBERT
11-
model = KeyBERT(model='distilbert-base-nli-mean-tokens')
11+
kw_model = KeyBERT(model="xlm-r-bert-base-nli-stsb-mean-tokens")
1212
```
1313

1414
Or select a SentenceTransformer model with your own parameters:
1515

1616
```python
17-
from keybert import KeyBERT
1817
from sentence_transformers import SentenceTransformer
1918

20-
sentence_model = SentenceTransformer("distilbert-base-nli-mean-tokens", device="cpu")
21-
model = KeyBERT(model=sentence_model)
19+
sentence_model = SentenceTransformer("distilbert-base-nli-mean-tokens", device="cuda")
20+
kw_model = KeyBERT(model=sentence_model)
2221
```
2322

24-
### **Flair**
23+
### **Flair**
2524
[Flair](https://github.com/flairNLP/flair) allows you to choose almost any embedding model that
2625
is publicly available. Flair can be used as follows:
2726

2827
```python
29-
from keybert import KeyBERT
3028
from flair.embeddings import TransformerDocumentEmbeddings
3129

3230
roberta = TransformerDocumentEmbeddings('roberta-base')
33-
model = KeyBERT(model=roberta)
31+
kw_model = KeyBERT(model=roberta)
3432
```
3533

3634
You can select any 🤗 transformers model [here](https://huggingface.co/models).
35+
36+
Moreover, you can also use Flair to use word embeddings and pool them to create document embeddings.
37+
Under the hood, Flair simply averages all word embeddings in a document. Then, we can easily
38+
pass it to KeyBERT in order to use those word embeddings as document embeddings:
39+
40+
```python
41+
from flair.embeddings import WordEmbeddings, DocumentPoolEmbeddings
42+
43+
glove_embedding = WordEmbeddings('crawl')
44+
document_glove_embeddings = DocumentPoolEmbeddings([glove_embedding])
45+
46+
kw_model = KeyBERT(model=document_glove_embeddings)
47+
```
48+
49+
### **Spacy**
50+
[Spacy](https://github.com/explosion/spaCy) is an amazing framework for processing text. There are
51+
many models available across many languages for modeling text.
52+
53+
allows you to choose almost any embedding model that
54+
is publicly available. Flair can be used as follows:
55+
56+
To use Spacy's non-transformer models in KeyBERT:
57+
58+
```python
59+
import spacy
60+
61+
nlp = spacy.load("en_core_web_md", exclude=['tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer'])
62+
63+
kw_model = KeyBERT(model=document_glove_embeddings)nlp
64+
```
65+
66+
Using spacy-transformer models:
67+
68+
```python
69+
import spacy
70+
71+
spacy.prefer_gpu()
72+
nlp = spacy.load("en_core_web_trf", exclude=['tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer'])
73+
74+
kw_model = KeyBERT(model=nlp)
75+
```
76+
77+
If you run into memory issues with spacy-transformer models, try:
78+
79+
```python
80+
import spacy
81+
from thinc.api import set_gpu_allocator, require_gpu
82+
83+
nlp = spacy.load("en_core_web_trf", exclude=['tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer'])
84+
set_gpu_allocator("pytorch")
85+
require_gpu(0)
86+
87+
kw_model = KeyBERT(model=nlp)
88+
```
89+
90+
### **Universal Sentence Encoder (USE)**
91+
The Universal Sentence Encoder encodes text into high dimensional vectors that are used here
92+
for embedding the documents. The model is trained and optimized for greater-than-word length text,
93+
such as sentences, phrases or short paragraphs.
94+
95+
Using USE in KeyBERT is rather straightforward:
96+
97+
```python
98+
import tensorflow_hub
99+
embedding_model = tensorflow_hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")
100+
kw_model = KeyBERT(model=embedding_model)
101+
```
102+
103+
### **Gensim**
104+
For Gensim, KeyBERT supports its `gensim.downloader` module. Here, we can download any model word embedding model
105+
to be used in KeyBERT. Note that Gensim is primarily used for Word Embedding models. This works typically
106+
best for short documents since the word embeddings are pooled.
107+
108+
```python
109+
import gensim.downloader as api
110+
ft = api.load('fasttext-wiki-news-subwords-300')
111+
kw_model = KeyBERT(model=ft)
112+
```
113+
114+
### **Custom Backend**
115+
If your backend or model cannot be found in the ones currently available, you can use the `keybert.backend.BaseEmbedder` class to
116+
create your own backend. Below, you will find an example of creating a SentenceTransformer backend for KeyBERT:
117+
118+
```python
119+
from keybert.backend import BaseEmbedder
120+
from sentence_transformers import SentenceTransformer
121+
122+
class CustomEmbedder(BaseEmbedder):
123+
def __init__(self, embedding_model):
124+
super().__init__()
125+
self.embedding_model = embedding_model
126+
127+
def embed(self, documents, verbose=False):
128+
embeddings = self.embedding_model.encode(documents, show_progress_bar=verbose)
129+
return embeddings
130+
131+
# Create custom backend
132+
distilbert = SentenceTransformer("distilbert-base-nli-stsb-mean-tokens")
133+
custom_embedder = CustomEmbedder(embedding_model=distilbert)
134+
135+
# Pass custom backend to keybert
136+
kw_model = KeyBERT(model=custom_embedder)
137+
```

0 commit comments

Comments
 (0)