A knowledge graph-based RAG (Retrieval-Augmented Generation) system for Web3 and cryptocurrency domain knowledge, featuring automatic web content retrieval and PDF document processing.
- PDF document processing and conversion to text
- Automatic text cleaning and formatting
- Knowledge graph-based information storage
- Support for multiple document formats and sources
- Version control through backup/restore system
- Global and local search modes
- Context-aware response generation
- Knowledge gap detection
- Multi-hop reasoning through graph relationships
- Automatic query refinement
- Automatic online search capability
- Smart keyword extraction from queries
- Web content scraping and cleaning
- Relevance assessment of scraped content
- Seamless integration of new knowledge
- PDF whitepaper handling
- Web content extraction and cleaning
- Content relevance validation
- Structured data organization
- Duplicate content detection
- Comprehensive logging system
- Debug mode for detailed tracking
- Database backup and restoration
- Configurable model parameters
- Error handling and recovery
- Local model execution through Ollama
- Extended context window (32k tokens)
- Optimized model parameters
- Embedding generation for semantic search
- Response caching for efficiency
graph TD
subgraph Input
A[PDF Documents] --> C[Document Processing]
B[Web Content] --> C
end
subgraph Knowledge Base
C --> D[Text Extraction & Cleaning]
D --> E[Knowledge Graph Construction]
E --> F[Graph Storage]
end
subgraph Query Processing
G[User Query] --> H[Query Analysis]
H --> I{Knowledge Sufficient?}
I -->|Yes| J[Generate Response]
I -->|No| K[Knowledge Enhancement]
end
subgraph Knowledge Enhancement
K --> L[Extract Keywords]
L --> M[Web Search]
M --> N[Content Scraping]
N --> O[Content Validation]
O --> P[Knowledge Integration]
P --> E
end
subgraph Output
J --> Q[Final Response]
P --> I
end
subgraph System Management
R[Logging System] --> S[Debug Logs]
T[Backup System] --> U[Version Control]
end
The workflow demonstrates how the system:
- Processes input from multiple sources
- Constructs and maintains a knowledge graph
- Handles queries with insufficient information
- Automatically enhances its knowledge base
- Provides comprehensive logging and backup
- Python 3.8+
- Ollama installed and running locally
- Clone the repository:
git clone <repository-url>
cd <repository-name>
- Install required packages:
pip install -r requirements.txt
The required NLTK data (punkt and stopwords) will be downloaded automatically when the program runs.
-
Install Ollama following the instructions at Ollama's official website.
-
Set up the environment:
# For Unix/macOS source set_env.sh # For Windows set_env.bat
The environment setup includes Google Custom Search API credentials that are ready to use.
The system requires two models to be set up in Ollama:
-
Mistral Model (for LLM):
# Pull the base model ollama pull mistral # Create the custom model using our Modelfile ollama create mistral:ctx32k -f Mistral32k
-
Nomic Embed Model (for Embeddings):
# Pull the embedding model ollama pull nomic-embed-text
Verify the models are working:
# Test Mistral model
ollama run mistral:ctx32k "Hello, how are you?"
# Test Nomic Embed model
ollama run nomic-embed-text "Test embedding generation"
The custom Modelfile includes optimized parameters for our use case, including an extended context window and appropriate temperature settings.
The system uses Google Custom Search API for retrieving relevant web content. To set up your API credentials:
-
Create a Google Cloud Project:
- Go to Google Cloud Console
- Create a new project or select an existing one
- Enable the Custom Search API for your project
- Create credentials (API Key)
-
Set Up Custom Search Engine:
- Go to Programmable Search Engine
- Create a new search engine
- Configure your search settings (recommended: search the entire web)
- Get your Search Engine ID (cx)
-
Configure Environment Variables:
- Open either
set_env.sh
(Unix/macOS) orset_env.bat
(Windows) - Replace
YOUR_GOOGLE_API_KEY_HERE
with your API key - Replace
YOUR_GOOGLE_CUSTOM_SEARCH_ENGINE_ID_HERE
with your Search Engine ID - Run the appropriate script:
# For Unix/macOS source set_env.sh # For Windows set_env.bat
- Open either
Note: Keep your API credentials secure and never commit them to version control. The environment setup files are already configured to be ignored by Git.
.
├── logs/ # Log files directory
├── web3_corpus/ # Working directory for GraphRAG
├── cryptoKGTutorial/
│ ├── rawWhitePapers/ # PDF whitepaper storage
│ ├── txtWhitePapers/ # Converted text files
│ └── webContent/ # Scraped web content
└── backups/ # Database backups
python web3_graphrag_demo.py --convert
# Insert all documents
python web3_graphrag_demo.py --insert
# Insert all documents except a specific file
python web3_graphrag_demo.py --insert --exclude "Ethereum_Whitepaper_-_Buterin_2014.pdf.txt"
python web3_graphrag_demo.py --query "Your query here" --mode global
Available modes:
global
: Searches across all documentslocal
: Focuses on most relevant documents
# Create backup
python web3_graphrag_demo.py --backup [backup_name]
# List backups
python web3_graphrag_demo.py --list-backups
# Restore from backup
python web3_graphrag_demo.py --restore backup_name
This example demonstrates how the system dynamically enhances its knowledge base when encountering queries it cannot initially answer.
-
First, let's create a knowledge base without the 0x protocol whitepaper:
# Option 1: Insert documents excluding 0x whitepaper python web3_graphrag_demo.py --insert --exclude "0x_white_paper.pdf.txt" # Option 2: Restore from pre-made backup python web3_graphrag_demo.py --restore no_0x_protocol_pdf
-
Query about the 0x protocol's authentication:
python web3_graphrag_demo.py --query "Tell me the details of signature authentication process of the 0x protocol. If you are not sure or you do not know the details, you can simply tell you are not sure."
Initially, the system might respond that it lacks sufficient information to answer the query.
-
When prompted, choose to search online for more information:
Current knowledge base might not have sufficient information. Would you like to search online for more information? (y/n): y
-
The system will extract the main keyword ("0x protocol") and present relevant URLs:
Found the following relevant URLs: 1. https://0x.org/ 2. https://link.0x.org/reddit 3. https://www.0xprotocol.org/ 4. https://link.0x.org/linkedin 5. https://docs.0xprotocol.org/en/latest/basics/orders.html Enter the numbers of the URLs you want to use (comma-separated) or 'all':
-
The system will:
- Scrape content from selected URLs
- Save the content for future reference
- Update the knowledge graph
- Automatically re-query about the 0x protocol authentication
-
If the answer is still insufficient, you can repeat the process to gather more information from additional sources.
This iterative process demonstrates the system's ability to:
- Recognize knowledge gaps
- Extract relevant search keywords
- Autonomously seek new information
- Integrate web content into its knowledge base
- Provide increasingly comprehensive answers
Enable debug mode for detailed logging:
export GRAPHRAG_DEBUG=true
Logs are stored in the logs
directory with timestamps. Debug mode provides more detailed logging information:
- Main application logs:
logs/graphrag_[timestamp].log
- Web scraping debug logs:
logs/debug/scraping_debug_[timestamp].log
- Scraped content:
logs/debug/content_[timestamp]_[url_hash].txt
- LLM: Mistral (32k context window)
- Embedding: Nomic Embed Text
- Both models are run locally through Ollama
Please read CONTRIBUTING.md for details on our code of conduct and the process for submitting pull requests.
This project is licensed under the MIT License - see the LICENSE file for details.