Skip to content

goranjovic55/AIEmbedder

Repository files navigation

AIEmbedder

AIEmbedder is a powerful tool for embedding documents into vector databases for semantic search. It provides a user-friendly GUI to process, analyze, and search through document collections using modern embedding techniques.

Features

  • Process multiple document formats (TXT, HTML, PDF, DOCX)
  • Clean and chunk text with customizable parameters
  • Remove duplicates and near-duplicates based on similarity threshold
  • Generate embeddings using state-of-the-art models
  • Store and query vector embeddings with ChromaDB
  • User-friendly GUI with progress tracking and detailed logs
  • Create optimized chunks with rich metadata for GPT4All embeddings

GPT4All Integration

AIEmbedder is specifically optimized for use with GPT4All's local document embeddings. Each chunk includes rich metadata that improves the semantic search precision:

Enhanced Metadata

Each document chunk includes:

  • Document context (filename, type, path)
  • Position information (beginning, middle, end)
  • Chunk statistics (index, total chunks, content length)
  • Processing parameters (cleaning level, chunk size, overlap)
  • Timestamps and file information

This metadata helps GPT4All better understand the document context and improves the semantic matching between queries and chunks. When using AIEmbedder with GPT4All:

  1. Process your documents with AIEmbedder
  2. Point GPT4All to your chunks directory
  3. Enjoy more precise and contextually relevant responses

Chunks Directory

The processed chunks are stored in the configured chunks directory with a clear hierarchical structure:

  • Each document gets its own subdirectory
  • Chunks are numbered sequentially for easy navigation
  • Metadata is included as comments at the top of each file
  • Content is clearly separated from metadata with a divider

Installation

Prerequisites

  • Python 3.8 or higher
  • Tkinter (usually comes with Python)
  • GPU support (optional, for faster processing)

Setup

  1. Clone the repository:

    git clone https://github.com/yourusername/aiembedder.git
    cd aiembedder
    
  2. Install the required packages:

    pip install -r requirements.txt
    
  3. Run the application:

    python -m aiembedder
    

Usage

Processing Documents

  1. Launch the application
  2. Click "Add File" or "Add Directory" to select documents for processing
  3. Configure processing options if needed
  4. Click "Process Files" to start processing

Searching

  1. Click "Search" in the menu
  2. Enter your search query
  3. Use metadata filters if needed
  4. View and explore results

Configuration

AIEmbedder can be configured through the Settings dialog:

  • Processing: Configure cleaning level, chunk size, and overlap
  • Database: Set collection name and persistence directory
  • Interface: Customize appearance and logging
  • Advanced: Configure embedding models and log locations

Development

Project Structure

  • aiembedder/: Main package
    • gui/: GUI components
    • processing/: Text processing modules
    • vector/: Vector database components
    • processors/: Document type processors
    • utils/: Utility functions and helpers

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages