Web3 Knowledge Graph RAG System

A knowledge graph-based RAG (Retrieval-Augmented Generation) system for Web3 and cryptocurrency domain knowledge, featuring automatic web content retrieval and PDF document processing.

Key Features

1. Knowledge Base Management

PDF document processing and conversion to text
Automatic text cleaning and formatting
Knowledge graph-based information storage
Support for multiple document formats and sources
Version control through backup/restore system

2. Intelligent Query Processing

Global and local search modes
Context-aware response generation
Knowledge gap detection
Multi-hop reasoning through graph relationships
Automatic query refinement

3. Dynamic Knowledge Enhancement

Automatic online search capability
Smart keyword extraction from queries
Web content scraping and cleaning
Relevance assessment of scraped content
Seamless integration of new knowledge

4. Robust Content Processing

PDF whitepaper handling
Web content extraction and cleaning
Content relevance validation
Structured data organization
Duplicate content detection

5. System Management

Comprehensive logging system
Debug mode for detailed tracking
Database backup and restoration
Configurable model parameters
Error handling and recovery

6. Advanced LLM Integration

Local model execution through Ollama
Extended context window (32k tokens)
Optimized model parameters
Embedding generation for semantic search
Response caching for efficiency

System Workflow Pipeline

graph TD
    subgraph Input
        A[PDF Documents] --> C[Document Processing]
        B[Web Content] --> C
    end

    subgraph Knowledge Base
        C --> D[Text Extraction & Cleaning]
        D --> E[Knowledge Graph Construction]
        E --> F[Graph Storage]
    end

    subgraph Query Processing
        G[User Query] --> H[Query Analysis]
        H --> I{Knowledge Sufficient?}
        I -->|Yes| J[Generate Response]
        I -->|No| K[Knowledge Enhancement]
    end

    subgraph Knowledge Enhancement
        K --> L[Extract Keywords]
        L --> M[Web Search]
        M --> N[Content Scraping]
        N --> O[Content Validation]
        O --> P[Knowledge Integration]
        P --> E
    end

    subgraph Output
        J --> Q[Final Response]
        P --> I
    end

    subgraph System Management
        R[Logging System] --> S[Debug Logs]
        T[Backup System] --> U[Version Control]
    end

The workflow demonstrates how the system:

Processes input from multiple sources
Constructs and maintains a knowledge graph
Handles queries with insufficient information
Automatically enhances its knowledge base
Provides comprehensive logging and backup

Prerequisites

Python 3.8+
Ollama installed and running locally

Installation

Clone the repository:

git clone <repository-url>
cd <repository-name>

Install required packages:

pip install -r requirements.txt

The required NLTK data (punkt and stopwords) will be downloaded automatically when the program runs.

Install Ollama following the instructions at Ollama's official website.
Set up the environment:
```
# For Unix/macOS
source set_env.sh

# For Windows
set_env.bat
```
The environment setup includes Google Custom Search API credentials that are ready to use.

Setting Up Ollama Models

The system requires two models to be set up in Ollama:

Mistral Model (for LLM):

# Pull the base model
ollama pull mistral

# Create the custom model using our Modelfile
ollama create mistral:ctx32k -f Mistral32k

Nomic Embed Model (for Embeddings):

# Pull the embedding model
ollama pull nomic-embed-text

Verify the models are working:

# Test Mistral model
ollama run mistral:ctx32k "Hello, how are you?"

# Test Nomic Embed model
ollama run nomic-embed-text "Test embedding generation"

The custom Modelfile includes optimized parameters for our use case, including an extended context window and appropriate temperature settings.

Google Custom Search Configuration

The system uses Google Custom Search API for retrieving relevant web content. To set up your API credentials:

Create a Google Cloud Project:
- Go to Google Cloud Console
- Create a new project or select an existing one
- Enable the Custom Search API for your project
- Create credentials (API Key)
Set Up Custom Search Engine:
- Go to Programmable Search Engine
- Create a new search engine
- Configure your search settings (recommended: search the entire web)
- Get your Search Engine ID (cx)
Configure Environment Variables:
- Open either set_env.sh (Unix/macOS) or set_env.bat (Windows)
- Replace YOUR_GOOGLE_API_KEY_HERE with your API key
- Replace YOUR_GOOGLE_CUSTOM_SEARCH_ENGINE_ID_HERE with your Search Engine ID
- Run the appropriate script:
```
# For Unix/macOS
source set_env.sh

# For Windows
set_env.bat
```

Note: Keep your API credentials secure and never commit them to version control. The environment setup files are already configured to be ignored by Git.

Directory Structure

.
├── logs/                  # Log files directory
├── web3_corpus/          # Working directory for GraphRAG
├── cryptoKGTutorial/
│   ├── rawWhitePapers/   # PDF whitepaper storage
│   ├── txtWhitePapers/   # Converted text files
│   └── webContent/       # Scraped web content
└── backups/              # Database backups

Usage

Converting PDF Documents

python web3_graphrag_demo.py --convert

Inserting Documents into Knowledge Base

# Insert all documents
python web3_graphrag_demo.py --insert

# Insert all documents except a specific file
python web3_graphrag_demo.py --insert --exclude "Ethereum_Whitepaper_-_Buterin_2014.pdf.txt"

Querying the Knowledge Base

python web3_graphrag_demo.py --query "Your query here" --mode global

Available modes:

global: Searches across all documents
local: Focuses on most relevant documents

Managing Backups

# Create backup
python web3_graphrag_demo.py --backup [backup_name]

# List backups
python web3_graphrag_demo.py --list-backups

# Restore from backup
python web3_graphrag_demo.py --restore backup_name

Case Study: Dynamic Knowledge Enhancement

This example demonstrates how the system dynamically enhances its knowledge base when encountering queries it cannot initially answer.

Scenario: Learning About 0x Protocol Authentication

First, let's create a knowledge base without the 0x protocol whitepaper:

# Option 1: Insert documents excluding 0x whitepaper
python web3_graphrag_demo.py --insert --exclude "0x_white_paper.pdf.txt"

# Option 2: Restore from pre-made backup
python web3_graphrag_demo.py --restore no_0x_protocol_pdf

Query about the 0x protocol's authentication:

python web3_graphrag_demo.py --query "Tell me the details of signature authentication process of the 0x protocol. If you are not sure or you do not know the details, you can simply tell you are not sure."

Initially, the system might respond that it lacks sufficient information to answer the query.

When prompted, choose to search online for more information:

Current knowledge base might not have sufficient information.
Would you like to search online for more information? (y/n): y

The system will extract the main keyword ("0x protocol") and present relevant URLs:

Found the following relevant URLs:
1. https://0x.org/
2. https://link.0x.org/reddit
3. https://www.0xprotocol.org/
4. https://link.0x.org/linkedin
5. https://docs.0xprotocol.org/en/latest/basics/orders.html

Enter the numbers of the URLs you want to use (comma-separated) or 'all':

The system will:
- Scrape content from selected URLs
- Save the content for future reference
- Update the knowledge graph
- Automatically re-query about the 0x protocol authentication
If the answer is still insufficient, you can repeat the process to gather more information from additional sources.

This iterative process demonstrates the system's ability to:

Recognize knowledge gaps
Extract relevant search keywords
Autonomously seek new information
Integrate web content into its knowledge base
Provide increasingly comprehensive answers

Debug Mode

Enable debug mode for detailed logging:

export GRAPHRAG_DEBUG=true

Logging

Logs are stored in the logs directory with timestamps. Debug mode provides more detailed logging information:

Main application logs: logs/graphrag_[timestamp].log
Web scraping debug logs: logs/debug/scraping_debug_[timestamp].log
Scraped content: logs/debug/content_[timestamp]_[url_hash].txt

Models Used

LLM: Mistral (32k context window)
Embedding: Nomic Embed Text
Both models are run locally through Ollama

Contributing

Please read CONTRIBUTING.md for details on our code of conduct and the process for submitting pull requests.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
backups		backups
nano-graphrag @ c061781		nano-graphrag @ c061781
.gitignore		.gitignore
.gitmodules		.gitmodules
Mistral32k		Mistral32k
README.md		README.md
debug_logger.py		debug_logger.py
logger_config.py		logger_config.py
pdfToTxt.py		pdfToTxt.py
requirements.txt		requirements.txt
set_env.bat		set_env.bat
set_env.sh		set_env.sh
web3_graphrag_demo.py		web3_graphrag_demo.py
web_scraper.py		web_scraper.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Web3 Knowledge Graph RAG System

Key Features

1. Knowledge Base Management

2. Intelligent Query Processing

3. Dynamic Knowledge Enhancement

4. Robust Content Processing

5. System Management

6. Advanced LLM Integration

System Workflow Pipeline

Prerequisites

Installation

Setting Up Ollama Models

Google Custom Search Configuration

Directory Structure

Usage

Converting PDF Documents

Inserting Documents into Knowledge Base

Querying the Knowledge Base

Managing Backups

Case Study: Dynamic Knowledge Enhancement

Scenario: Learning About 0x Protocol Authentication

Debug Mode

Logging

Models Used

Contributing

License

About

Uh oh!

Releases

Packages

Languages

Mr-Kill/DatabaseProject

Folders and files

Latest commit

History

Repository files navigation

Web3 Knowledge Graph RAG System

Key Features

1. Knowledge Base Management

2. Intelligent Query Processing

3. Dynamic Knowledge Enhancement

4. Robust Content Processing

5. System Management

6. Advanced LLM Integration

System Workflow Pipeline

Prerequisites

Installation

Setting Up Ollama Models

Google Custom Search Configuration

Directory Structure

Usage

Converting PDF Documents

Inserting Documents into Knowledge Base

Querying the Knowledge Base

Managing Backups

Case Study: Dynamic Knowledge Enhancement

Scenario: Learning About 0x Protocol Authentication

Debug Mode

Logging

Models Used

Contributing

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages