Skip to content

Commit 38f036d

Browse files
authored
v0.8 (#180)
* Added `KeyLLM` to extract keywords from text with LLMs across five use cases: 1. Create Keywords with KeyLLM 2. Extract Keywords with KeyLLM 3. Fine-tune Candidate Keywords 4. Efficient KeyLLM 5. Efficient KeyLLM + KeyBERT * Integrated different LLM backends (OpenAI, Cohere, HF, LangChain, LiteLLM)
1 parent 4f79073 commit 38f036d

28 files changed

+1770
-9
lines changed

README.md

Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,7 @@ Corresponding medium post can be found [here](https://towardsdatascience.com/key
2323
2.3. [Max Sum Distance](#maxsum)
2424
2.4. [Maximal Marginal Relevance](#maximal)
2525
2.5. [Embedding Models](#embeddings)
26+
3. [Large Language Models](#llms)
2627
<!--te-->
2728

2829

@@ -226,6 +227,55 @@ kw_model = KeyBERT(model=roberta)
226227

227228
You can select any 🤗 transformers model [here](https://huggingface.co/models).
228229

230+
<a name="llms"/></a>
231+
## 3. Large Language Models
232+
[Back to ToC](#toc)
233+
234+
With `KeyLLM` you can new perform keyword extraction with Large Language Models (LLM). You can find the full documentation [here](https://maartengr.github.io/KeyBERT/guides/keyllm.html) but there are two examples that are common with this new method. Make sure to install the OpenAI package through `pip install openai` before you start.
235+
236+
First, we can ask OpenAI directly to extract keywords:
237+
238+
```python
239+
import openai
240+
from keybert.llm import OpenAI
241+
from keybert import KeyLLM
242+
243+
# Create your LLM
244+
openai.api_key = "sk-..."
245+
llm = OpenAI()
246+
247+
# Load it in KeyLLM
248+
kw_model = KeyLLM(llm)
249+
```
250+
251+
This will query any ChatGPT model and ask it to extract keywords from text.
252+
253+
Second, we can find documents that are likely to have the same keywords and only extract keywords for those.
254+
This is much more efficient then asking the keywords for every single documents. There are likely documents that
255+
have the exact same keywords. Doing so is straightforward:
256+
257+
```python
258+
import openai
259+
from keybert.llm import OpenAI
260+
from keybert import KeyLLM
261+
from sentence_transformers import SentenceTransformer
262+
263+
# Extract embeddings
264+
model = SentenceTransformer('all-MiniLM-L6-v2')
265+
embeddings = model.encode(MY_DOCUMENTS, convert_to_tensor=True)
266+
267+
# Create your LLM
268+
openai.api_key = "sk-..."
269+
llm = OpenAI()
270+
271+
# Load it in KeyLLM
272+
kw_model = KeyLLM(llm)
273+
274+
# Extract keywords
275+
keywords = kw_model.extract_keywords(MY_DOCUMENTS, embeddings=embeddings, threshold=.75)
276+
```
277+
278+
You can use the `threshold` parameter to decide how similar documents need to be in order to receive the same keywords.
229279

230280
## Citation
231281
To cite KeyBERT in your work, please use the following bibtex reference:

docs/api/cohere.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
# `Cohere`
2+
3+
::: keybert.llm._cohere.Cohere

docs/api/keyllm.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
# `KeyLLM`
2+
3+
::: keybert._llm.KeyLLM

docs/api/langchain.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
# `LangChain`
2+
3+
::: keybert.llm._langchain.LangChain

docs/api/litellm.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
# `LiteLLM`
2+
3+
::: keybert.llm._litellm.LiteLLM

docs/api/openai.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
# `OpenAI`
2+
3+
::: keybert.llm._openai.OpenAI

docs/api/textgeneration.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
# `TextGeneration`
2+
3+
::: keybert.llm._textgeneration.TextGeneration

docs/changelog.md

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,34 @@ hide:
33
- navigation
44
---
55

6+
## **Version 0.8.0**
7+
*Release date: 27 September, 2023*
8+
9+
**Highlights**:
10+
11+
* Use `KeyLLM` to leverage LLMs for extracting keywords
12+
* Use it either with or without candidate keywords generated through `KeyBERT`
13+
* Multiple LLMs are integrated: OpenAI, Cohere, LangChain, HF, and LiteLLM
14+
15+
```python
16+
import openai
17+
from keybert.llm import OpenAI
18+
from keybert import KeyLLM
19+
20+
# Create your LLM
21+
openai.api_key = "sk-..."
22+
llm = OpenAI()
23+
24+
# Load it in KeyLLM
25+
kw_model = KeyLLM(llm)
26+
```
27+
28+
See [here](https://maartengr.github.io/KeyBERT/guides/keyllm.html) for full documentation on use cases of `KeyLLM` and [here](https://maartengr.github.io/KeyBERT/guides/llms.html) for the implemented Large Language Models.
29+
30+
**Fixes**:
31+
32+
* Enable Guided KeyBERT for seed keywords differing among docs by [@shengbo-ma](https://github.com/shengbo-ma) in [#152](https://github.com/MaartenGr/KeyBERT/pull/152)
33+
634

735
## **Version 0.7.0**
836
*Release date: 3 November, 2022*

docs/guides/keyllm.md

Lines changed: 255 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,255 @@
1+
A minimal method for keyword extraction with Large Language Models (LLM). There are a number of implementations that allow you to mix and match `KeyBERT` with `KeyLLM`. You could also choose to use `KeyLLM` without `KeyBERT`.
2+
3+
<div class="excalidraw">
4+
--8<-- "docs/images/keyllm.svg"
5+
</div>
6+
7+
We start with an example of some data:
8+
9+
```python
10+
documents = [
11+
"The website mentions that it only takes a couple of days to deliver but I still have not received mine.",
12+
"I received my package!",
13+
"Whereas the most powerful LLMs have generally been accessible only through limited APIs (if at all), Meta released LLaMA's model weights to the research community under a noncommercial license."
14+
]
15+
```
16+
17+
This data was chosen to show the different use cases and techniques. As you might have noticed documents 1 and 2 are quite similar whereas document 3 is about an entirely different subject. This similarity will be taken into account when using `KeyBERT` together with `KeyLLM`
18+
19+
Let's start with `KeyLLM` only.
20+
21+
# Use Cases
22+
23+
If you want the full performance and easiest method, you can skip the use cases below and go straight to number 5 where you will combine `KeyBERT` with `KeyLLM`.
24+
25+
!!! Tip
26+
If you want to use KeyLLM without any of the HuggingFace packages, you can install it as follows:
27+
`pip install keybert --no-deps`
28+
`pip install scikit-learn numpy rich tqdm`
29+
This will make the installation much smaller and the import much quicker.
30+
31+
## 1. **Create** Keywords with `KeyLLM`
32+
33+
We start by creating keywords for each document. This creation process is simply asking the LLM to come up with a bunch of keywords for each document. The focus here is on **creating** keywords which refers to the idea that the keywords do not necessarily need to appear in the input documents.
34+
35+
Install the relevant LLM first:
36+
37+
```bash
38+
pip install openai
39+
```
40+
41+
Then we can use any OpenAI model, such as ChatGPT, as follows:
42+
43+
```python
44+
import openai
45+
from keybert.llm import OpenAI
46+
from keybert import KeyLLM
47+
48+
# Create your LLM
49+
openai.api_key = "sk-..."
50+
llm = OpenAI()
51+
52+
# Load it in KeyLLM
53+
kw_model = KeyLLM(llm)
54+
55+
# Extract keywords
56+
keywords = kw_model.extract_keywords(documents)
57+
```
58+
59+
This creates the following keywords:
60+
61+
```python
62+
[['Website',
63+
'Delivery',
64+
'Mention',
65+
'Timeframe',
66+
'Not received',
67+
'Order fulfillment'],
68+
['Package', 'Received', 'Delivery', 'Order fulfillment'],
69+
['Powerful LLMs',
70+
'Limited APIs',
71+
'Meta',
72+
'Model weights',
73+
'Research community',
74+
'']]
75+
```
76+
77+
## 2. **Extract** Keywords with `KeyLLM`
78+
79+
Instead of creating keywords out of thin air, we ask the LLM to check whether they actually appear in the text and limit the keywords to those that are found in the documents. We do this by using a custom prompt together with `check_vocab=True`:
80+
81+
```python
82+
import openai
83+
from keybert.llm import OpenAI
84+
from keybert import KeyLLM
85+
86+
# Create your LLM
87+
openai.api_key = "sk-..."
88+
89+
prompt = """
90+
I have the following document:
91+
[DOCUMENT]
92+
93+
Based on the information above, extract the keywords that best describe the topic of the text.
94+
Make sure to only extract keywords that appear in the text.
95+
Use the following format separated by commas:
96+
<keywords>
97+
"""
98+
llm = OpenAI()
99+
100+
# Load it in KeyLLM
101+
kw_model = KeyLLM(llm)
102+
103+
# Extract keywords
104+
keywords = kw_model.extract_keywords(documents, check_vocab=True); keywords
105+
```
106+
107+
This creates the following keywords:
108+
109+
```python
110+
[['website', 'couple of days', 'deliver', 'received'],
111+
['package', 'received'],
112+
['LLMs',
113+
'APIs',
114+
'Meta',
115+
'LLaMA',
116+
'model weights',
117+
'research community',
118+
'noncommercial license']]
119+
```
120+
121+
## 3. **Fine-tune** Candidate Keywords
122+
123+
If you already have a list of keywords, you could fine-tune them by asking the LLM to come up with nicer tags or names that we could use. We can use the `[CANDIDATES]` tag in the prompt to assign where they should go.
124+
125+
```python
126+
import openai
127+
from keybert.llm import OpenAI
128+
from keybert import KeyLLM
129+
130+
# Create your LLM
131+
openai.api_key = "sk-..."
132+
133+
prompt = """
134+
I have the following document:
135+
[DOCUMENT]
136+
137+
With the following candidate keywords:
138+
[CANDIDATES]
139+
140+
Based on the information above, improve the candidate keywords to best describe the topic of the document.
141+
142+
Use the following format separated by commas:
143+
<keywords>
144+
"""
145+
llm = OpenAI(model="gpt-3.5-turbo", prompt=prompt, chat=True)
146+
147+
# Load it in KeyLLM
148+
kw_model = KeyLLM(llm)
149+
150+
# Extract keywords
151+
candidate_keywords = [['website', 'couple of days', 'deliver', 'received'],
152+
['received', 'package'],
153+
['most powerful LLMs',
154+
'limited APIs',
155+
'Meta',
156+
"LLaMA's model weights",
157+
'research community',
158+
'noncommercial license']]
159+
keywords = kw_model.extract_keywords(documents, candidate_keywords=candidate_keywords); keywords
160+
```
161+
162+
This creates the following keywords:
163+
164+
```python
165+
[['delivery timeframe', 'discrepancy', 'website', 'order status'],
166+
['received package'],
167+
['most powerful language models',
168+
'API limitations',
169+
"Meta's release",
170+
"LLaMA's model weights",
171+
'research community access',
172+
'noncommercial licensing']]
173+
```
174+
175+
## 4. **Efficient** `KeyLLM`
176+
177+
If you have embeddings of your documents, you could use those to find documents that are most similar to one another. Those documents could then all receive the same keywords and only one of these documents will need to be passed to the LLM. This can make computation much faster as only a subset of documents will need to receive keywords.
178+
179+
<div class="excalidraw">
180+
--8<-- "docs/images/efficient.svg"
181+
</div>
182+
183+
```python
184+
import openai
185+
from keybert.llm import OpenAI
186+
from keybert import KeyLLM
187+
from sentence_transformers import SentenceTransformer
188+
189+
# Extract embeddings
190+
model = SentenceTransformer('all-MiniLM-L6-v2')
191+
embeddings = model.encode(documents, convert_to_tensor=True)
192+
193+
# Create your LLM
194+
openai.api_key = "sk-..."
195+
llm = OpenAI()
196+
197+
# Load it in KeyLLM
198+
kw_model = KeyLLM(llm)
199+
200+
# Extract keywords
201+
keywords = kw_model.extract_keywords(documents, embeddings=embeddings, threshold=.75)
202+
```
203+
204+
This creates the following keywords:
205+
206+
```python
207+
[['Website',
208+
'Delivery',
209+
'Mention',
210+
'Timeframe',
211+
'Not received',
212+
'Waiting',
213+
'Order fulfillment'],
214+
['Received', 'Package', 'Delivery', 'Order fulfillment'],
215+
['Powerful LLMs', 'Limited APIs', 'Meta', 'LLaMA', 'Model weights']]
216+
```
217+
218+
219+
## 5. **Efficient** `KeyLLM` + `KeyBERT`
220+
221+
This is the best of both worlds. We use `KeyBERT` to generate a first pass of keywords and embeddings and give those to `KeyLLM` for a final pass. Again, the most similar documents will be clustered and they will all receive the same keywords. You can change this behavior with `threshold`. A higher value will reduce the number of documents that are clustered and a lower value will increase the number of documents that are clustered.
222+
223+
<div class="excalidraw">
224+
--8<-- "docs/images/keybert_keyllm.svg"
225+
</div>
226+
227+
```python
228+
import openai
229+
from keybert.llm import OpenAI
230+
from keybert import KeyLLM, KeyBERT
231+
232+
# Create your LLM
233+
openai.api_key = "sk-..."
234+
llm = OpenAI()
235+
236+
# Load it in KeyLLM
237+
kw_model = KeyBERT(llm=llm)
238+
239+
# Extract keywords
240+
keywords = kw_model.extract_keywords(documents); keywords
241+
```
242+
243+
This creates the following keywords:
244+
245+
```python
246+
[['Website',
247+
'Delivery',
248+
'Timeframe',
249+
'Mention',
250+
'Order fulfillment',
251+
'Not received',
252+
'Waiting'],
253+
['Package', 'Received', 'Confirmation', 'Delivery', 'Order fulfillment'],
254+
['LLMs', 'Limited APIs', 'Meta', 'LLaMA', 'Model weights', '']]
255+
```

0 commit comments

Comments
 (0)