AIEmbedder

AIEmbedder is a powerful tool for embedding documents into vector databases for semantic search. It provides a user-friendly GUI to process, analyze, and search through document collections using modern embedding techniques.

Features

Process multiple document formats (TXT, HTML, PDF, DOCX)
Clean and chunk text with customizable parameters
Remove duplicates and near-duplicates based on similarity threshold
Generate embeddings using state-of-the-art models
Store and query vector embeddings with ChromaDB
User-friendly GUI with progress tracking and detailed logs
Create optimized chunks with rich metadata for GPT4All embeddings

GPT4All Integration

AIEmbedder is specifically optimized for use with GPT4All's local document embeddings. Each chunk includes rich metadata that improves the semantic search precision:

Enhanced Metadata

Each document chunk includes:

Document context (filename, type, path)
Position information (beginning, middle, end)
Chunk statistics (index, total chunks, content length)
Processing parameters (cleaning level, chunk size, overlap)
Timestamps and file information

This metadata helps GPT4All better understand the document context and improves the semantic matching between queries and chunks. When using AIEmbedder with GPT4All:

Process your documents with AIEmbedder
Point GPT4All to your chunks directory
Enjoy more precise and contextually relevant responses

Chunks Directory

The processed chunks are stored in the configured chunks directory with a clear hierarchical structure:

Each document gets its own subdirectory
Chunks are numbered sequentially for easy navigation
Metadata is included as comments at the top of each file
Content is clearly separated from metadata with a divider

Installation

Prerequisites

Python 3.8 or higher
Tkinter (usually comes with Python)
GPU support (optional, for faster processing)

Setup

Clone the repository:

git clone https://github.com/yourusername/aiembedder.git
cd aiembedder

Install the required packages:
```
pip install -r requirements.txt
```
Run the application:
```
python -m aiembedder
```

Usage

Processing Documents

Launch the application
Click "Add File" or "Add Directory" to select documents for processing
Configure processing options if needed
Click "Process Files" to start processing

Searching

Click "Search" in the menu
Enter your search query
Use metadata filters if needed
View and explore results

Configuration

AIEmbedder can be configured through the Settings dialog:

Processing: Configure cleaning level, chunk size, and overlap
Database: Set collection name and persistence directory
Interface: Customize appearance and logging
Advanced: Configure embedding models and log locations

Development

Project Structure

aiembedder/: Main package
- gui/: GUI components
- processing/: Text processing modules
- vector/: Vector database components
- processors/: Document type processors
- utils/: Utility functions and helpers

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
_PROJECT		_PROJECT
_TEST		_TEST
aiembedder		aiembedder
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
debug_config.py		debug_config.py
download_nltk_data.py		download_nltk_data.py
download_nltk_resources.py		download_nltk_resources.py
force_chunks_dir.py		force_chunks_dir.py
requirements.txt		requirements.txt
test_chunking.py		test_chunking.py
test_chunks.py		test_chunks.py
test_config_chunks.py		test_config_chunks.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AIEmbedder

Features

GPT4All Integration

Enhanced Metadata

Chunks Directory

Installation

Prerequisites

Setup

Usage

Processing Documents

Searching

Configuration

Development

Project Structure

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

goranjovic55/AIEmbedder

Folders and files

Latest commit

History

Repository files navigation

AIEmbedder

Features

GPT4All Integration

Enhanced Metadata

Chunks Directory

Installation

Prerequisites

Setup

Usage

Processing Documents

Searching

Configuration

Development

Project Structure

Contributing

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages