Skip to content

Commit 25dab3a

Browse files
authored
v0.4 (#43)
* Use paraphrase-MiniLM-L6-v2 as the default embedding model * Highlight a document's keywords * Added FAQ
1 parent eb6d086 commit 25dab3a

18 files changed

+242
-83
lines changed

README.md

Lines changed: 14 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -90,8 +90,8 @@ from keybert import KeyBERT
9090

9191
doc = """
9292
Supervised learning is the machine learning task of learning a function that
93-
maps an input to an output based on example input-output pairs.[1] It infers a
94-
function from labeled training data consisting of a set of training examples.[2]
93+
maps an input to an output based on example input-output pairs. It infers a
94+
function from labeled training data consisting of a set of training examples.
9595
In supervised learning, each example is a pair consisting of an input object
9696
(typically a vector) and a desired output value (also called the supervisory signal).
9797
A supervised learning algorithm analyzes the training data and produces an inferred function,
@@ -100,7 +100,7 @@ doc = """
100100
the learning algorithm to generalize from the training data to unseen situations in a
101101
'reasonable' way (see inductive bias).
102102
"""
103-
kw_model = KeyBERT('distilbert-base-nli-mean-tokens')
103+
kw_model = KeyBERT()
104104
keywords = kw_model.extract_keywords(doc)
105105
```
106106

@@ -127,10 +127,17 @@ of words you would like in the resulting keyphrases:
127127
('learning function', 0.5850)]
128128
```
129129

130+
We can highlight the keywords in the document by simply setting `hightlight`:
130131

132+
```python
133+
keywords = kw_model.extract_keywords(doc, highlight=True)
134+
```
135+
<img src="images/highlight.png" width="75%" height="75%" />
136+
137+
131138
**NOTE**: For a full overview of all possible transformer models see [sentence-transformer](https://www.sbert.net/docs/pretrained_models.html).
132-
I would advise either `'distilbert-base-nli-mean-tokens'` or `'xlm-r-distilroberta-base-paraphrase-v1'` as they
133-
have shown great performance in semantic similarity and paraphrase identification respectively.
139+
I would advise either `"paraphrase-MiniLM-L6-v2"` for English documents or `"paraphrase-multilingual-MiniLM-L12-v2"`
140+
for multi-lingual documents or any other language.
134141

135142
<a name="maxsum"/></a>
136143
### 2.3. Max Sum Similarity
@@ -198,7 +205,7 @@ and pass it through KeyBERT with `model`:
198205

199206
```python
200207
from keybert import KeyBERT
201-
kw_model = KeyBERT(model='distilbert-base-nli-mean-tokens')
208+
kw_model = KeyBERT(model='paraphrase-MiniLM-L6-v2')
202209
```
203210

204211
Or select a SentenceTransformer model with your own parameters:
@@ -207,7 +214,7 @@ Or select a SentenceTransformer model with your own parameters:
207214
from keybert import KeyBERT
208215
from sentence_transformers import SentenceTransformer
209216

210-
sentence_model = SentenceTransformer("distilbert-base-nli-mean-tokens", device="cpu")
217+
sentence_model = SentenceTransformer("paraphrase-MiniLM-L6-v2")
211218
kw_model = KeyBERT(model=sentence_model)
212219
```
213220

docs/changelog.md

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,17 @@
1+
## **Version 0.4.0**
2+
*Release date: 23 June, 2021*
3+
4+
**Highlights**:
5+
6+
* Highlight a document's keywords with:
7+
* ```keywords = kw_model.extract_keywords(doc, highlight=True)```
8+
* Use `paraphrase-MiniLM-L6-v2` as the default embedder which gives great results!
9+
10+
**Miscellaneous**:
11+
12+
* Update Flair dependencies
13+
* Added FAQ
14+
115
## **Version 0.3.0**
216
*Release date: 10 May, 2021*
317

docs/faq.md

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
## **Which embedding model works best for which language?**
2+
Unfortunately, there is not a definitive list of the best models for each language, this highly depends
3+
on your data, the model, and your specific use-case. However, the default model in KeyBERT
4+
(`"paraphrase-MiniLM-L6-v2"`) works great for **English** documents. In contrast, for **multi-lingual**
5+
documents or any other language, `"paraphrase-multilingual-MiniLM-L12-v2""` has shown great performance.
6+
7+
If you want to use a model that provides a higher quality, but takes more compute time, then I would advise using `paraphrase-mpnet-base-v2` and `paraphrase-multilingual-mpnet-base-v2` instead.
8+
9+
10+
## **Should I preprocess the data?**
11+
No. By using document embeddings there is typically no need to preprocess the data as all parts of a document
12+
are important in understanding the general topic of the document. Although this holds true in 99% of cases, if you
13+
have data that contains a lot of noise, for example, HTML-tags, then it would be best to remove them. HTML-tags
14+
typically do not contribute to the meaning of a document and should therefore be removed. However, if you apply
15+
topic modeling to HTML-code to extract topics of code, then it becomes important.
16+
17+
18+
## **Can I use the GPU to speed up the model?**
19+
Yes! Since KeyBERT uses embeddings as its backend, a GPU is actually prefered when using this package.
20+
Although it is possible to use it without a dedicated GPU, the inference speed will be significantly slower.

docs/guides/embeddings.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -8,15 +8,15 @@ and pass it through KeyBERT with `model`:
88

99
```python
1010
from keybert import KeyBERT
11-
kw_model = KeyBERT(model="xlm-r-bert-base-nli-stsb-mean-tokens")
11+
kw_model = KeyBERT(model="paraphrase-MiniLM-L6-v2")
1212
```
1313

1414
Or select a SentenceTransformer model with your own parameters:
1515

1616
```python
1717
from sentence_transformers import SentenceTransformer
1818

19-
sentence_model = SentenceTransformer("distilbert-base-nli-mean-tokens", device="cuda")
19+
sentence_model = SentenceTransformer("paraphrase-MiniLM-L6-v2")
2020
kw_model = KeyBERT(model=sentence_model)
2121
```
2222

@@ -60,7 +60,7 @@ import spacy
6060

6161
nlp = spacy.load("en_core_web_md", exclude=['tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer'])
6262

63-
kw_model = KeyBERT(model=document_glove_embeddings)nlp
63+
kw_model = KeyBERT(model=nlp)
6464
```
6565

6666
Using spacy-transformer models:
@@ -129,7 +129,7 @@ class CustomEmbedder(BaseEmbedder):
129129
return embeddings
130130

131131
# Create custom backend
132-
distilbert = SentenceTransformer("distilbert-base-nli-stsb-mean-tokens")
132+
distilbert = SentenceTransformer("paraphrase-MiniLM-L6-v2")
133133
custom_embedder = CustomEmbedder(embedding_model=distilbert)
134134

135135
# Pass custom backend to keybert

docs/guides/quickstart.md

Lines changed: 9 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -38,7 +38,7 @@ doc = """
3838
the learning algorithm to generalize from the training data to unseen situations in a
3939
'reasonable' way (see inductive bias).
4040
"""
41-
kw_model = KeyBERT('distilbert-base-nli-mean-tokens')
41+
kw_model = KeyBERT()
4242
keywords = kw_model.extract_keywords(doc)
4343
```
4444

@@ -65,9 +65,15 @@ of words you would like in the resulting keyphrases:
6565
('learning function', 0.5850)]
6666
```
6767

68+
We can highlight the keywords in the document by simply setting `hightlight`:
69+
70+
```python
71+
keywords = kw_model.extract_keywords(doc, highlight=True)
72+
```
73+
6874
**NOTE**: For a full overview of all possible transformer models see [sentence-transformer](https://www.sbert.net/docs/pretrained_models.html).
69-
I would advise either `'distilbert-base-nli-mean-tokens'` or `'xlm-r-distilroberta-base-paraphrase-v1'` as they
70-
have shown great performance in semantic similarity and paraphrase identification respectively.
75+
I would advise either `"paraphrase-MiniLM-L6-v2"` for English documents or `"paraphrase-multilingual-MiniLM-L12-v2"`
76+
for multi-lingual documents or any other language.
7177

7278
### Max Sum Similarity
7379

docs/index.md

Lines changed: 21 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ create keywords and keyphrases that are most similar to a document.
77

88
## About the Project
99

10-
Although that are already many methods available for keyword generation
10+
Although there are already many methods available for keyword generation
1111
(e.g.,
1212
[Rake](https://github.com/aneesha/RAKE),
1313
[YAKE!](https://github.com/LIAAD/yake), TF-IDF, etc.)
@@ -30,11 +30,6 @@ papers and solutions out there that use BERT-embeddings
3030
), I could not find a BERT-based solution that did not have to be trained from scratch and
3131
could be used for beginners (**correct me if I'm wrong!**).
3232
Thus, the goal was a `pip install keybert` and at most 3 lines of code in usage.
33-
34-
**NOTE**: If you use MMR to select the candidates instead of simple cosine similarity,
35-
this repo is essentially a simplified implementation of
36-
[EmbedRank](https://github.com/swisscom/ai-research-keyphrase-extraction)
37-
with BERT-embeddings.
3833

3934
## Installation
4035
Installation can be done using [pypi](https://pypi.org/project/keybert/):
@@ -43,22 +38,33 @@ Installation can be done using [pypi](https://pypi.org/project/keybert/):
4338
pip install keybert
4439
```
4540

46-
To use Flair embeddings, install KeyBERT as follows:
41+
You may want to install more depending on the transformers and language backends that you will be using. The possible installations are:
4742

4843
```
4944
pip install keybert[flair]
45+
pip install keybert[gensim]
46+
pip install keybert[spacy]
47+
pip install keybert[use]
5048
```
5149

50+
To install all backends:
51+
52+
```
53+
pip install keybert[all]
54+
```
55+
56+
5257
## Usage
5358

59+
5460
The most minimal example can be seen below for the extraction of keywords:
5561
```python
5662
from keybert import KeyBERT
5763

5864
doc = """
5965
Supervised learning is the machine learning task of learning a function that
60-
maps an input to an output based on example input-output pairs.[1] It infers a
61-
function from labeled training data consisting of a set of training examples.[2]
66+
maps an input to an output based on example input-output pairs. It infers a
67+
function from labeled training data consisting of a set of training examples.
6268
In supervised learning, each example is a pair consisting of an input object
6369
(typically a vector) and a desired output value (also called the supervisory signal).
6470
A supervised learning algorithm analyzes the training data and produces an inferred function,
@@ -67,13 +73,14 @@ doc = """
6773
the learning algorithm to generalize from the training data to unseen situations in a
6874
'reasonable' way (see inductive bias).
6975
"""
70-
model = KeyBERT('distilbert-base-nli-mean-tokens')
76+
kw_model = KeyBERT()
77+
keywords = kw_model.extract_keywords(doc)
7178
```
7279

73-
You can set `keyphrase_length` to set the length of the resulting keyphras:
80+
You can set `keyphrase_ngram_range` to set the length of the resulting keywords/keyphrases:
7481

7582
```python
76-
>>> model.extract_keywords(doc, keyphrase_ngram_range=(1, 1))
83+
>>> kw_model.extract_keywords(doc, keyphrase_ngram_range=(1, 1), stop_words=None)
7784
[('learning', 0.4604),
7885
('algorithm', 0.4556),
7986
('training', 0.4487),
@@ -85,10 +92,10 @@ To extract keyphrases, simply set `keyphrase_ngram_range` to (1, 2) or higher de
8592
of words you would like in the resulting keyphrases:
8693

8794
```python
88-
>>> model.extract_keywords(doc, keyphrase_ngram_range=(1, 2))
95+
>>> kw_model.extract_keywords(doc, keyphrase_ngram_range=(1, 2), stop_words=None)
8996
[('learning algorithm', 0.6978),
9097
('machine learning', 0.6305),
9198
('supervised learning', 0.5985),
9299
('algorithm analyzes', 0.5860),
93100
('learning function', 0.5850)]
94-
```
101+
```

images/highlight.png

21.1 KB
Loading

keybert/__init__.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,3 @@
1-
from keybert.model import KeyBERT
1+
from keybert._model import KeyBERT
22

3-
__version__ = "0.3.0"
3+
__version__ = "0.4.0"

keybert/_highlight.py

Lines changed: 96 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,96 @@
1+
import re
2+
from rich.console import Console
3+
from rich.highlighter import RegexHighlighter
4+
from typing import Tuple, List
5+
6+
7+
class NullHighlighter(RegexHighlighter):
8+
"""Apply style to anything that looks like an email."""
9+
10+
base_style = ""
11+
highlights = [r""]
12+
13+
14+
def highlight_document(doc: str,
15+
keywords: List[Tuple[str, float]]):
16+
""" Highlight keywords in a document
17+
18+
Arguments:
19+
doc: The document for which to extract keywords/keyphrases
20+
keywords: the top n keywords for a document with their respective distances
21+
to the input document
22+
23+
Returns:
24+
highlighted_text: The document with additional tags to highlight keywords
25+
according to the rich package
26+
"""
27+
keywords_only = [keyword for keyword, _ in keywords]
28+
max_len = max([len(token.split(" ")) for token in keywords_only])
29+
30+
if max_len == 1:
31+
highlighted_text = _highlight_one_gram(doc, keywords_only)
32+
else:
33+
highlighted_text = _highlight_n_gram(doc, keywords_only)
34+
35+
console = Console(highlighter=NullHighlighter())
36+
console.print(highlighted_text)
37+
38+
39+
def _highlight_one_gram(doc: str,
40+
keywords: List[str]) -> str:
41+
""" Highlight 1-gram keywords in a document
42+
43+
Arguments:
44+
doc: The document for which to extract keywords/keyphrases
45+
keywords: the top n keywords for a document
46+
47+
Returns:
48+
highlighted_text: The document with additional tags to highlight keywords
49+
according to the rich package
50+
"""
51+
tokens = re.sub(r' +', ' ', doc.replace("\n", " ")).split(" ")
52+
53+
highlighted_text = " ".join([f"[black on #FFFF00]{token}[/]"
54+
if token.lower() in keywords
55+
else f"{token}"
56+
for token in tokens]).strip()
57+
return highlighted_text
58+
59+
60+
def _highlight_n_gram(doc: str,
61+
keywords: List[str]) -> str:
62+
""" Highlight n-gram keywords in a document
63+
64+
Arguments:
65+
doc: The document for which to extract keywords/keyphrases
66+
keywords: the top n keywords for a document
67+
68+
Returns:
69+
highlighted_text: The document with additional tags to highlight keywords
70+
according to the rich package
71+
"""
72+
max_len = max([len(token.split(" ")) for token in keywords])
73+
tokens = re.sub(r' +', ' ', doc.replace("\n", " ")).strip().split(" ")
74+
n_gram_tokens = [[" ".join(tokens[i: i + max_len][0: j + 1]) for j in range(max_len)] for i, _ in enumerate(tokens)]
75+
highlighted_text = []
76+
skip = False
77+
78+
for n_grams in n_gram_tokens:
79+
candidate = False
80+
81+
if not skip:
82+
for index, n_gram in enumerate(n_grams):
83+
84+
if n_gram.lower() in keywords:
85+
candidate = f"[black on #FFFF00]{n_gram}[/]" + n_grams[-1].split(n_gram)[-1]
86+
skip = index + 1
87+
88+
if not candidate:
89+
candidate = n_grams[0]
90+
91+
highlighted_text.append(candidate)
92+
93+
else:
94+
skip = skip - 1
95+
highlighted_text = " ".join(highlighted_text)
96+
return highlighted_text
File renamed without changes.
File renamed without changes.

0 commit comments

Comments
 (0)