2
2
[ ![ PyPI - License] ( https://img.shields.io/badge/license-MIT-green.svg )] ( https://github.com/MaartenGr/keybert/blob/master/LICENSE )
3
3
[ ![ PyPI - PyPi] ( https://img.shields.io/pypi/v/keyBERT )] ( https://pypi.org/project/keybert/ )
4
4
[ ![ Build] ( https://img.shields.io/github/workflow/status/MaartenGr/keyBERT/Code%20Checks/master )] ( https://pypi.org/project/keybert/ )
5
+ [ ![ Open In Colab] ( https://colab.research.google.com/assets/colab-badge.svg )] ( https://colab.research.google.com/drive/1OxpgwKqSzODtO3vS7Xe1nEmZMCAIMckX?usp=sharing )
5
6
6
7
<img src =" images/logo.png " width =" 35% " height =" 35% " align =" right " />
7
8
@@ -65,10 +66,19 @@ Installation can be done using [pypi](https://pypi.org/project/keybert/):
65
66
pip install keybert
66
67
```
67
68
68
- To use Flair embeddings, install KeyBERT as follows :
69
+ You may want to install more depending on the transformers and language backends that you will be using. The possible installations are :
69
70
70
71
```
71
72
pip install keybert[flair]
73
+ pip install keybert[gensim]
74
+ pip install keybert[spacy]
75
+ pip install keybert[use]
76
+ ```
77
+
78
+ To install all backends:
79
+
80
+ ```
81
+ pip install keybert[all]
72
82
```
73
83
74
84
<a name =" usage " /></a >
@@ -90,14 +100,14 @@ doc = """
90
100
the learning algorithm to generalize from the training data to unseen situations in a
91
101
'reasonable' way (see inductive bias).
92
102
"""
93
- model = KeyBERT(' distilbert-base-nli-mean-tokens' )
94
- keywords = model .extract_keywords(doc)
103
+ kw_model = KeyBERT(' distilbert-base-nli-mean-tokens' )
104
+ keywords = kw_model .extract_keywords(doc)
95
105
```
96
106
97
107
You can set ` keyphrase_ngram_range ` to set the length of the resulting keywords/keyphrases:
98
108
99
109
``` python
100
- >> > model .extract_keywords(doc, keyphrase_ngram_range = (1 , 1 ), stop_words = None )
110
+ >> > kw_model .extract_keywords(doc, keyphrase_ngram_range = (1 , 1 ), stop_words = None )
101
111
[(' learning' , 0.4604 ),
102
112
(' algorithm' , 0.4556 ),
103
113
(' training' , 0.4487 ),
@@ -109,7 +119,7 @@ To extract keyphrases, simply set `keyphrase_ngram_range` to (1, 2) or higher de
109
119
of words you would like in the resulting keyphrases:
110
120
111
121
``` python
112
- >> > model .extract_keywords(doc, keyphrase_ngram_range = (1 , 2 ), stop_words = None )
122
+ >> > kw_model .extract_keywords(doc, keyphrase_ngram_range = (1 , 2 ), stop_words = None )
113
123
[(' learning algorithm' , 0.6978 ),
114
124
(' machine learning' , 0.6305 ),
115
125
(' supervised learning' , 0.5985 ),
@@ -125,13 +135,13 @@ have shown great performance in semantic similarity and paraphrase identificatio
125
135
<a name =" maxsum " /></a >
126
136
### 2.3. Max Sum Similarity
127
137
128
- To diversity the results, we take the 2 x top_n most similar words/phrases to the document.
138
+ To diversify the results, we take the 2 x top_n most similar words/phrases to the document.
129
139
Then, we take all top_n combinations from the 2 x top_n words and extract the combination
130
140
that are the least similar to each other by cosine similarity.
131
141
132
142
``` python
133
- >> > model .extract_keywords(doc, keyphrase_ngram_range = (3 , 3 ), stop_words = ' english' ,
134
- use_maxsum = True , nr_candidates = 20 , top_n = 5 )
143
+ >> > kw_model .extract_keywords(doc, keyphrase_ngram_range = (3 , 3 ), stop_words = ' english' ,
144
+ use_maxsum = True , nr_candidates = 20 , top_n = 5 )
135
145
[(' set training examples' , 0.7504 ),
136
146
(' generalize training data' , 0.7727 ),
137
147
(' requires learning algorithm' , 0.5050 ),
@@ -148,8 +158,8 @@ keywords / keyphrases which is also based on cosine similarity. The results
148
158
with ** high diversity** :
149
159
150
160
``` python
151
- >> > model .extract_keywords(doc, keyphrase_ngram_range = (3 , 3 ), stop_words = ' english' ,
152
- use_mmr = True , diversity = 0.7 )
161
+ >> > kw_model .extract_keywords(doc, keyphrase_ngram_range = (3 , 3 ), stop_words = ' english' ,
162
+ use_mmr = True , diversity = 0.7 )
153
163
[(' algorithm generalize training' , 0.7727 ),
154
164
(' labels unseen instances' , 0.1649 ),
155
165
(' new examples optimal' , 0.4185 ),
@@ -160,8 +170,8 @@ with **high diversity**:
160
170
The results with ** low diversity** :
161
171
162
172
``` python
163
- >> > model .extract_keywords(doc, keyphrase_ngram_range = (3 , 3 ), stop_words = ' english' ,
164
- use_mmr = True , diversity = 0.2 )
173
+ >> > kw_model .extract_keywords(doc, keyphrase_ngram_range = (3 , 3 ), stop_words = ' english' ,
174
+ use_mmr = True , diversity = 0.2 )
165
175
[(' algorithm generalize training' , 0.7727 ),
166
176
(' supervised learning algorithm' , 0.7502 ),
167
177
(' learning machine learning' , 0.7577 ),
@@ -172,16 +182,23 @@ The results with **low diversity**:
172
182
173
183
<a name =" embeddings " /></a >
174
184
### 2.5. Embedding Models
175
- The parameter ` model ` takes in a string pointing to a sentence-transformers model,
176
- a SentenceTransformer, or a Flair DocumentEmbedding model.
185
+ KeyBERT supports many embedding models that can be used to embed the documents and words:
186
+
187
+ * Sentence-Transformers
188
+ * Flair
189
+ * Spacy
190
+ * Gensim
191
+ * USE
192
+
193
+ Click [ here] ( https://maartengr.github.io/KeyBERT/guides/embeddings.html ) for a full overview of all supported embedding models.
177
194
178
195
** Sentence-Transformers**
179
196
You can select any model from ` sentence-transformers ` [ here] ( https://www.sbert.net/docs/pretrained_models.html )
180
197
and pass it through KeyBERT with ` model ` :
181
198
182
199
``` python
183
200
from keybert import KeyBERT
184
- model = KeyBERT(model = ' distilbert-base-nli-mean-tokens' )
201
+ kw_model = KeyBERT(model = ' distilbert-base-nli-mean-tokens' )
185
202
```
186
203
187
204
Or select a SentenceTransformer model with your own parameters:
@@ -191,7 +208,7 @@ from keybert import KeyBERT
191
208
from sentence_transformers import SentenceTransformer
192
209
193
210
sentence_model = SentenceTransformer(" distilbert-base-nli-mean-tokens" , device = " cpu" )
194
- model = KeyBERT(model = sentence_model)
211
+ kw_model = KeyBERT(model = sentence_model)
195
212
```
196
213
197
214
** Flair**
@@ -203,7 +220,7 @@ from keybert import KeyBERT
203
220
from flair.embeddings import TransformerDocumentEmbeddings
204
221
205
222
roberta = TransformerDocumentEmbeddings(' roberta-base' )
206
- model = KeyBERT(model = roberta)
223
+ kw_model = KeyBERT(model = roberta)
207
224
```
208
225
209
226
You can select any 🤗 transformers model [ here] ( https://huggingface.co/models ) .
@@ -218,7 +235,7 @@ To cite PolyFuzz in your work, please use the following bibtex reference:
218
235
title = {KeyBERT: Minimal keyword extraction with BERT.},
219
236
year = 2020,
220
237
publisher = {Zenodo},
221
- version = {v0.1.3 },
238
+ version = {v0.3.0 },
222
239
doi = {10.5281/zenodo.4461265},
223
240
url = {https://doi.org/10.5281/zenodo.4461265}
224
241
}
@@ -238,10 +255,10 @@ but most importantly, these are amazing resources for creating impressive keywor
238
255
* https://github.com/swisscom/ai-research-keyphrase-extraction
239
256
240
257
** MMR** :
241
- The selection of keywords/keyphrases was modelled after:
258
+ The selection of keywords/keyphrases was modeled after:
242
259
* https://github.com/swisscom/ai-research-keyphrase-extraction
243
260
244
261
** NOTE** : If you find a paper or github repo that has an easy-to-use implementation
245
262
of BERT-embeddings for keyword/keyphrase extraction, let me know! I'll make sure to
246
- add it a reference to this repo.
263
+ add a reference to this repo.
247
264
0 commit comments