Skip to content

Mr-Kill/DatabaseProject

Repository files navigation

Web3 Knowledge Graph RAG System

A knowledge graph-based RAG (Retrieval-Augmented Generation) system for Web3 and cryptocurrency domain knowledge, featuring automatic web content retrieval and PDF document processing.

Key Features

1. Knowledge Base Management

  • PDF document processing and conversion to text
  • Automatic text cleaning and formatting
  • Knowledge graph-based information storage
  • Support for multiple document formats and sources
  • Version control through backup/restore system

2. Intelligent Query Processing

  • Global and local search modes
  • Context-aware response generation
  • Knowledge gap detection
  • Multi-hop reasoning through graph relationships
  • Automatic query refinement

3. Dynamic Knowledge Enhancement

  • Automatic online search capability
  • Smart keyword extraction from queries
  • Web content scraping and cleaning
  • Relevance assessment of scraped content
  • Seamless integration of new knowledge

4. Robust Content Processing

  • PDF whitepaper handling
  • Web content extraction and cleaning
  • Content relevance validation
  • Structured data organization
  • Duplicate content detection

5. System Management

  • Comprehensive logging system
  • Debug mode for detailed tracking
  • Database backup and restoration
  • Configurable model parameters
  • Error handling and recovery

6. Advanced LLM Integration

  • Local model execution through Ollama
  • Extended context window (32k tokens)
  • Optimized model parameters
  • Embedding generation for semantic search
  • Response caching for efficiency

System Workflow Pipeline

graph TD
    subgraph Input
        A[PDF Documents] --> C[Document Processing]
        B[Web Content] --> C
    end

    subgraph Knowledge Base
        C --> D[Text Extraction & Cleaning]
        D --> E[Knowledge Graph Construction]
        E --> F[Graph Storage]
    end

    subgraph Query Processing
        G[User Query] --> H[Query Analysis]
        H --> I{Knowledge Sufficient?}
        I -->|Yes| J[Generate Response]
        I -->|No| K[Knowledge Enhancement]
    end

    subgraph Knowledge Enhancement
        K --> L[Extract Keywords]
        L --> M[Web Search]
        M --> N[Content Scraping]
        N --> O[Content Validation]
        O --> P[Knowledge Integration]
        P --> E
    end

    subgraph Output
        J --> Q[Final Response]
        P --> I
    end

    subgraph System Management
        R[Logging System] --> S[Debug Logs]
        T[Backup System] --> U[Version Control]
    end
Loading

The workflow demonstrates how the system:

  1. Processes input from multiple sources
  2. Constructs and maintains a knowledge graph
  3. Handles queries with insufficient information
  4. Automatically enhances its knowledge base
  5. Provides comprehensive logging and backup

Prerequisites

  • Python 3.8+
  • Ollama installed and running locally

Installation

  1. Clone the repository:
git clone <repository-url>
cd <repository-name>
  1. Install required packages:
pip install -r requirements.txt

The required NLTK data (punkt and stopwords) will be downloaded automatically when the program runs.

  1. Install Ollama following the instructions at Ollama's official website.

  2. Set up the environment:

    # For Unix/macOS
    source set_env.sh
    
    # For Windows
    set_env.bat

    The environment setup includes Google Custom Search API credentials that are ready to use.

Setting Up Ollama Models

The system requires two models to be set up in Ollama:

  1. Mistral Model (for LLM):

    # Pull the base model
    ollama pull mistral
    
    # Create the custom model using our Modelfile
    ollama create mistral:ctx32k -f Mistral32k
  2. Nomic Embed Model (for Embeddings):

    # Pull the embedding model
    ollama pull nomic-embed-text

Verify the models are working:

# Test Mistral model
ollama run mistral:ctx32k "Hello, how are you?"

# Test Nomic Embed model
ollama run nomic-embed-text "Test embedding generation"

The custom Modelfile includes optimized parameters for our use case, including an extended context window and appropriate temperature settings.

Google Custom Search Configuration

The system uses Google Custom Search API for retrieving relevant web content. To set up your API credentials:

  1. Create a Google Cloud Project:

    • Go to Google Cloud Console
    • Create a new project or select an existing one
    • Enable the Custom Search API for your project
    • Create credentials (API Key)
  2. Set Up Custom Search Engine:

    • Go to Programmable Search Engine
    • Create a new search engine
    • Configure your search settings (recommended: search the entire web)
    • Get your Search Engine ID (cx)
  3. Configure Environment Variables:

    • Open either set_env.sh (Unix/macOS) or set_env.bat (Windows)
    • Replace YOUR_GOOGLE_API_KEY_HERE with your API key
    • Replace YOUR_GOOGLE_CUSTOM_SEARCH_ENGINE_ID_HERE with your Search Engine ID
    • Run the appropriate script:
      # For Unix/macOS
      source set_env.sh
      
      # For Windows
      set_env.bat

Note: Keep your API credentials secure and never commit them to version control. The environment setup files are already configured to be ignored by Git.

Directory Structure

.
├── logs/                  # Log files directory
├── web3_corpus/          # Working directory for GraphRAG
├── cryptoKGTutorial/
│   ├── rawWhitePapers/   # PDF whitepaper storage
│   ├── txtWhitePapers/   # Converted text files
│   └── webContent/       # Scraped web content
└── backups/              # Database backups

Usage

Converting PDF Documents

python web3_graphrag_demo.py --convert

Inserting Documents into Knowledge Base

# Insert all documents
python web3_graphrag_demo.py --insert

# Insert all documents except a specific file
python web3_graphrag_demo.py --insert --exclude "Ethereum_Whitepaper_-_Buterin_2014.pdf.txt"

Querying the Knowledge Base

python web3_graphrag_demo.py --query "Your query here" --mode global

Available modes:

  • global: Searches across all documents
  • local: Focuses on most relevant documents

Managing Backups

# Create backup
python web3_graphrag_demo.py --backup [backup_name]

# List backups
python web3_graphrag_demo.py --list-backups

# Restore from backup
python web3_graphrag_demo.py --restore backup_name

Case Study: Dynamic Knowledge Enhancement

This example demonstrates how the system dynamically enhances its knowledge base when encountering queries it cannot initially answer.

Scenario: Learning About 0x Protocol Authentication

  1. First, let's create a knowledge base without the 0x protocol whitepaper:

    # Option 1: Insert documents excluding 0x whitepaper
    python web3_graphrag_demo.py --insert --exclude "0x_white_paper.pdf.txt"
    
    # Option 2: Restore from pre-made backup
    python web3_graphrag_demo.py --restore no_0x_protocol_pdf
  2. Query about the 0x protocol's authentication:

    python web3_graphrag_demo.py --query "Tell me the details of signature authentication process of the 0x protocol. If you are not sure or you do not know the details, you can simply tell you are not sure."

    Initially, the system might respond that it lacks sufficient information to answer the query.

  3. When prompted, choose to search online for more information:

    Current knowledge base might not have sufficient information.
    Would you like to search online for more information? (y/n): y
    
  4. The system will extract the main keyword ("0x protocol") and present relevant URLs:

    Found the following relevant URLs:
    1. https://0x.org/
    2. https://link.0x.org/reddit
    3. https://www.0xprotocol.org/
    4. https://link.0x.org/linkedin
    5. https://docs.0xprotocol.org/en/latest/basics/orders.html
    
    Enter the numbers of the URLs you want to use (comma-separated) or 'all':
    
  5. The system will:

    • Scrape content from selected URLs
    • Save the content for future reference
    • Update the knowledge graph
    • Automatically re-query about the 0x protocol authentication
  6. If the answer is still insufficient, you can repeat the process to gather more information from additional sources.

This iterative process demonstrates the system's ability to:

  • Recognize knowledge gaps
  • Extract relevant search keywords
  • Autonomously seek new information
  • Integrate web content into its knowledge base
  • Provide increasingly comprehensive answers

Debug Mode

Enable debug mode for detailed logging:

export GRAPHRAG_DEBUG=true

Logging

Logs are stored in the logs directory with timestamps. Debug mode provides more detailed logging information:

  • Main application logs: logs/graphrag_[timestamp].log
  • Web scraping debug logs: logs/debug/scraping_debug_[timestamp].log
  • Scraped content: logs/debug/content_[timestamp]_[url_hash].txt

Models Used

  • LLM: Mistral (32k context window)
  • Embedding: Nomic Embed Text
  • Both models are run locally through Ollama

Contributing

Please read CONTRIBUTING.md for details on our code of conduct and the process for submitting pull requests.

License

This project is licensed under the MIT License - see the LICENSE file for details.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages