Skip to content

Commit

Permalink
Browse files Browse the repository at this point in the history
  • Loading branch information
liukidar committed Dec 4, 2024
2 parents 9645f52 + b48d8f5 commit 36f0d7c
Show file tree
Hide file tree
Showing 4 changed files with 65 additions and 36 deletions.
7 changes: 3 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
</h1>
<h4 align="center">
<a href="https://github.com/circlemind-ai/fast-graphrag/blob/main/LICENSE">
<img src="https://img.shields.io/badge/license-MIT-blue.svg" alt="fast-graphrag is released under the MIT license." />
<img src="https://img.shields.io/badge/license-MIT-blue.svg" alt="fast-graphrag is released under the MIT license." alt="Fast GraphRAG by Circlemind"/>
</a>
<a href="https://github.com/circlemind-ai/fast-graphrag/blob/main/CONTRIBUTING.md">
<img src="https://img.shields.io/badge/PRs-Welcome-brightgreen" alt="PRs welcome!" />
Expand All @@ -26,13 +26,12 @@
</h4>

> [!NOTE]
> Using *The Wizard of Oz*, `fast-graphrag` costs $0.08 vs. `graphrag` $0.48 — **a 6x costs saving** that further improves with data size and number of insertions. Stay tuned for the official benchmarks, and join us as a contributor!
> Using *The Wizard of Oz*, `fast-graphrag` costs $0.08 vs. `graphrag` $0.48 — **a 6x costs saving** that further improves with data size and number of insertions.
## News
## News (and Coming Soon)
- [ ] Support for IDF weightening of entities
- [ ] Support for generic entities and concepts
- [x] [2024.12.02] Benchmarks comparing Fast GraphRAG to LightRAG, GraphRAG and VectorDBs released [here](https://github.com/circlemind-ai/fast-graphrag/blob/main/benchmarks/README.md)
- [x] [2024.12.02] We finally have a news section :P

## Features

Expand Down
66 changes: 48 additions & 18 deletions benchmarks/README.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,57 @@
## Benchmarks
We validate the benchmark results provided in [HippoRAG](https://arxiv.org/abs/2405.14831), as well as comparing with other methods:
- NaiveRAG using the embedder `text-embedding-3-small`
- NaiveRAG (vector dbs) using the OpenAI embedder `text-embedding-3-small`
- [LightRAG](https://github.com/HKUDS/LightRAG)
- [GraphRAG](https://github.com/gusye1234/nano-graphrag) (we use the implementation provided by `nano-graphrag`, based on the original [Microsoft GraphRAG](https://github.com/microsoft/graphrag))

The scripts in this directory will generate and evaluate the 2wikimultihopqa datasets on a subsets of 51 and 101 queries with the same methodology as in the HippoRAG paper. In particular, we evaluate the retrieval capabilities of each method, mesauring the percentage of queries for which all the required evidence was retrieved. We preloaded the results so it is enough to run `evaluate.xx` to get the numbers. You can also run `create_dbs.xx` to regenerate the databases for the different methods (you will need to set an OPENAI_API_KEY, LightRAG and GraphRAG could take a while (hours) to process).
### Results
**2wikimultihopQA**
| # Queries | Method | All queries % | Multihop only % |
|----------:|:--------:|--------------:|----------------:|
| 51||||
| | VectorDB| 0.49| 0.32|
| | LightRAG| 0.47| 0.32|
| | GraphRAG| 0.75| 0.68|
| |**Circlemind**| **0.96**| **0.95**|
| 101||||
| | VectorDB| 0.42| 0.23|
| | LightRAG| 0.45| 0.28|
| | GraphRAG| 0.73| 0.64|
| |**Circlemind**| **0.93**| **0.90**|

The output should looks similar at follow (the exact numbers could vary based on your graph configuration)
**Circlemind is up to 4x more accurate than VectorDB RAG.**

**HotpotQA**
| # Queries | Method | All queries % |
|----------:|:--------:|--------------:|
| 101|||
| | VectorDB| 0.78|
| | LightRAG| 0.55|
| | GraphRAG| -*|
| |**Circlemind**| **0.84**|

*: crashes after half an hour of processing

Below, find the insertion times for the 2wikimultihopqa benchmark (~800 chunks):
| Method | Time (minutes) |
|:--------:|-----------------:|
| VectorDB| ~0.3|
| LightRAG| ~25|
| GraphRAG| ~40|
|**Circlemind**| ~1.5|

**Circlemind is 27x faster than GraphRAG while also being over 40% more accurate in retrieval.**

### Run it yourself
The scripts in this directory will generate and evaluate the 2wikimultihopqa datasets on a subsets of 51 and 101 queries with the same methodology as in the HippoRAG paper. In particular, we evaluate the retrieval capabilities of each method, mesauring the percentage of queries for which all the required evidence was retrieved. We preloaded the results so it is enough to run `evaluate_dbs.xx` to get the numbers. You can also run `create_dbs.xx` to regenerate the databases for the different methods.

A couple of NOTES:
- you will need to set an OPENAI_API_KEY;
- LightRAG and GraphRAG could take a over an 1 hour to process and they can be expensive;
- when pip installing LightRAG, not all dependencies are added; to run it we simply deleted all the imports of each missing dependency (since we use OpenAI they are not necessary).
- we also benchmarked on the HotpotQA dataset (we will soon release the code for that as well).

The output will look similar to the following (the exact numbers could vary based on your graph configuration)
```
Evaluation of the performance of different RAG methods on 2wikimultihopqa (51 queries)
Expand Down Expand Up @@ -53,18 +98,3 @@ Loading dataset...
[all questions] Percentage of queries with perfect retrieval: 0.9306930693069307
[multihop only] Percentage of queries with perfect retrieval: 0.8985507246376812
```

We also benchmarked on the HotpotQA dataset (we will soon release the code for that as well). Here's a preview of the results (101 queries):

```
VectorDB: 0.78
LightRAG [local mode]: 0.55
GraphRAG [local mode]: - (crashed after half an hour of processing)
Circlemind: 0.84
```

We also briefly report the insertion times for the 2wikimultihopqa benchmark (101 queries, which corresponds to ~800 chunks):
- VectorDB: ~20 seconds
- LightRAG: ~25 minutes
- GraphRAG: ~40 minutes
- Circlemind: ~100 seconds
24 changes: 12 additions & 12 deletions poetry.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

4 changes: 2 additions & 2 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -7,15 +7,15 @@ packages = [{include = "fast_graphrag" }]
readme = "README.md"

[tool.poetry.dependencies]
python = "^3.10.1"
python = "^3.10"
igraph = "^0.11.6"
xxhash = "^3.5.0"
pydantic = "^2.9.2"
scipy = "^1.14.1"
scikit-learn = "^1.5.2"
tenacity = "^9.0.0"
openai = "^1.52.1"
scipy-stubs = "^1.14.1.3"
scipy-stubs = "^1.14.1.5"
hnswlib = "^0.8.0"
instructor = "^1.6.3"
requests = "^2.32.3"
Expand Down

0 comments on commit 36f0d7c

Please sign in to comment.