Text Document Embedding Generator

A Python tool for generating embeddings from text documents using multiple providers, including OpenAI, Mistral AI, Voyage AI, and Cohere. It splits documents into configurable-sized chunks and generates embeddings for each chunk.

New Feature: The script now adds a contextual description generated by a language model (LLM) such as GPT-4 for each chunk. This description helps situate the chunk within the overall context of the text, enhancing the quality of the embeddings.

Features

Support for Multiple Embedding Providers:
- OpenAI
- Mistral AI
- Voyage AI
- Cohere
Contextual Embeddings:
- Generation of contextual descriptions for each chunk using an LLM
- Combination of the chunk and its description to form a new chunk for embedding
Processing of Multiple Text Files
Configurable Chunk Sizing
Document Header Management
Multiple Output Formats (CSV, JSON, NPY)
Error Handling and Retries
YAML-based Configuration

Prerequisites

pip install openai tiktoken numpy pandas tqdm pyyaml requests

Configuration

Create a config.yaml file with the following structure:

api:
  provider:
    name: "openai"  # Options: "openai", "mistral", "voyage", "cohere"
    key: "your-api-key"
    model: "text-embedding-ada-002"  # The model varies by provider
  llm_model: "gpt-4"  # or "gpt-3.5-turbo" if you don't have access to GPT-4
  llm_max_input_tokens: 8192
  llm_max_output_tokens: 256
  max_retries: 3
  retry_delay: 2

paths:
  input_folder: "path/to/text/files"
  output_base: "output"

processing:
  chunk_sizes: [400, 800, 1200]
  header_lines: 2

output:
  formats:
    - csv
    - json
    - npy

Configuration Parameters

API Provider Settings

provider.name: Name of the embedding provider to use. Options:
- "openai"
- "mistral"
- "voyage"
- "cohere"
provider.key: Your API key for the selected provider.
provider.model: The embedding model to use. Model names vary by provider.
llm_model: The LLM model used to generate contextual descriptions. Examples:
- "gpt-4"
- "gpt-3.5-turbo"
llm_max_input_tokens: Maximum number of tokens for the LLM input (prompt).
llm_max_output_tokens: Maximum number of tokens for the LLM output (response).
max_retries: Maximum number of attempts in case of API call failures.
retry_delay: Delay between attempts (in seconds).

Provider API Keys

OpenAI
- API Key: OPENAI_API_KEY
- Embedding Model: "text-embedding-ada-002"
Mistral AI
- API Key: MISTRAL_API_KEY
- Embedding Model: "mistral-embed"
Voyage AI
- API Key: VOYAGE_API_KEY
- Embedding Model: "voyage-large-2"
Cohere
- API Key: CO_API_KEY
- Embedding Model: "embed-english-v3.0"

Other Parameters

paths.input_folder: The folder containing the text files to process.
paths.output_base: The folder where results will be saved.
processing.chunk_sizes: List of chunk sizes in tokens.
processing.header_lines: Number of header lines to include in each chunk.
output.formats: Desired output formats (csv, json, npy).

Usage

Install the required packages:

pip install openai tiktoken numpy pandas tqdm pyyaml requests

Set up your API keys for the chosen provider(s):

# For OpenAI
export OPENAI_API_KEY='your-api-key'
# For Mistral AI
export MISTRAL_API_KEY='your-api-key'
# For Voyage AI
export VOYAGE_API_KEY='your-api-key'
# For Cohere
export CO_API_KEY='your-api-key'

Configure your provider and the LLM in config.yaml.
Prepare your text files in the input directory specified in config.yaml.
Run the script:
```
python embedding_generator.py
```

Provider-Specific Features

OpenAI

High-quality embeddings
Extensive model options
Reliable API performance

Mistral AI

Competitive pricing
Good performance for multiple languages
Modern embedding models

Voyage AI

Specialized for specific use cases
Competitive pricing
Good documentation

Cohere

Multiple embedding types
Classification-specific embeddings
Extensive language support

Output Structure

For each configured chunk size, the script generates:

CSV (`embeddings_results_{size}tok.csv`)

filename: Source file name
chunk_id: Chunk identifier
text: Chunk content combined with its contextual description
embedding: Embedding vector

JSON (`chunks.json`)

[
  {
    "text": "Chunk content combined with its description",
    "embedding": [embedding vector],
    "metadata": {
      "filename": "file name",
      "chunk_id": "chunk identifier"
    }
  },
  ...
]

NPY (`embeddings.npy`)

NumPy array containing all embedding vectors.

Error Handling

Provider-specific error handling
Automatic retry on API failure
Exponential backoff between attempts
Error and warning logging
Continues processing if a provider fails
Separate handling of errors related to the LLM

Methods Description

`EmbeddingGenerator` Class

`clean_text(text: str) -> str`

Cleans and normalizes text by removing extra whitespace and line breaks.

`split_into_chunks(text: str, max_tokens: int) -> List[str]`

Splits text into chunks while preserving headers and respecting token limits.

`get_chunk_context_description(chunk_text: str, full_text: str) -> str`

Generates a brief description of the chunk's role in the text using an LLM.

`get_embedding(text: str) -> Optional[List[float]]`

Obtains embeddings from the selected provider with error handling and retries.

`process_file(file_path: str, chunk_size: int) -> List[Dict[str, Any]]`

Processes a single file by generating chunks, contextual descriptions, and embeddings.

`save_results(results: List[Dict[str, Any]], chunk_size: int) -> None`

Saves the results in the configured output formats.

Limitations

Requires valid API keys for the embedding provider and the LLM (if different)
Different rate limits per provider
Varying embedding dimensions between providers
Provider-specific model limitations
Processes .txt files only
Using the LLM may increase processing time and costs

Best Practices

File Preparation
- Ensure text files are properly encoded (UTF-8)
- Remove any binary or non-text content
Configuration
- Adjust chunk sizes based on your needs
- Configure appropriate retry settings
- Set a reasonable number of header lines
- Choose an appropriate LLM model for your use case and budget
Resource Management
- Monitor API usage
- Consider rate limiting for large datasets
- Regularly back up output files
- Be aware of LLM usage to manage costs
Provider and LLM Selection
- Choose the embedding provider based on your needs:
  - OpenAI for general purpose
  - Mistral AI for multilingual support
  - Voyage AI for specialized cases
  - Cohere for classification tasks
- Select an LLM model based on accessibility and cost:
  - Use gpt-4 for better quality if accessible
  - Use gpt-3.5-turbo for lower cost and wider availability
API Management
- Monitor usage across all APIs used
- Consider provider-specific rate limits
- Keep API keys secure
- Plan for quota limits, especially when using LLMs

Provider Comparison

Provider	Strengths	Use Cases
OpenAI	High quality, reliable	General purpose
Mistral	Good multilingual support	International content
Voyage	Specialized features	Domain-specific
Cohere	Classification focus	Text classification

Troubleshooting

Common issues and solutions:

API Errors
- Verify API keys
- Check API rate limits and quotas
- Ensure network connectivity
- For LLM-related errors, check if the input exceeds token limits
File Processing Issues
- Check file encoding
- Verify file permissions
- Ensure valid file content
Output Errors
- Check disk space
- Verify write permissions
- Validate output directory structure
LLM Usage Issues
- Monitor the number of tokens used in prompts and responses
- Adjust llm_max_input_tokens and llm_max_output_tokens if necessary
- Ensure the combined size of the chunk and context fits within the LLM's token limits

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

MIT License

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Text Document Embedding Generator

Features

Prerequisites

Configuration

Configuration Parameters

API Provider Settings

Provider API Keys

Other Parameters

Usage

Provider-Specific Features

OpenAI

Mistral AI

Voyage AI

Cohere

Output Structure

CSV (`embeddings_results_{size}tok.csv`)

JSON (`chunks.json`)

NPY (`embeddings.npy`)

Error Handling

Methods Description

`EmbeddingGenerator` Class

`clean_text(text: str) -> str`

`split_into_chunks(text: str, max_tokens: int) -> List[str]`

`get_chunk_context_description(chunk_text: str, full_text: str) -> str`

`get_embedding(text: str) -> Optional[List[float]]`

`process_file(file_path: str, chunk_size: int) -> List[Dict[str, Any]]`

`save_results(results: List[Dict[str, Any]], chunk_size: int) -> None`

Limitations

Best Practices

Provider Comparison

Troubleshooting

Contributing

License

About

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
README.md		README.md
config.yaml		config.yaml
embedding_generator.py		embedding_generator.py

simonpierreboucher/Embedding-generator

Folders and files

Latest commit

History

Repository files navigation

Text Document Embedding Generator

Features

Prerequisites

Configuration

Configuration Parameters

API Provider Settings

Provider API Keys

Other Parameters

Usage

Provider-Specific Features

OpenAI

Mistral AI

Voyage AI

Cohere

Output Structure

CSV (embeddings_results_{size}tok.csv)

JSON (chunks.json)

NPY (embeddings.npy)

Error Handling

Methods Description

EmbeddingGenerator Class

clean_text(text: str) -> str

split_into_chunks(text: str, max_tokens: int) -> List[str]

get_chunk_context_description(chunk_text: str, full_text: str) -> str

get_embedding(text: str) -> Optional[List[float]]

process_file(file_path: str, chunk_size: int) -> List[Dict[str, Any]]

save_results(results: List[Dict[str, Any]], chunk_size: int) -> None

Limitations

Best Practices

Provider Comparison

Troubleshooting

Contributing

License

About

Topics

Resources

Stars

Watchers

Forks

Packages 0

Languages

CSV (`embeddings_results_{size}tok.csv`)

JSON (`chunks.json`)

NPY (`embeddings.npy`)

`EmbeddingGenerator` Class

`clean_text(text: str) -> str`

`split_into_chunks(text: str, max_tokens: int) -> List[str]`

`get_chunk_context_description(chunk_text: str, full_text: str) -> str`

`get_embedding(text: str) -> Optional[List[float]]`

`process_file(file_path: str, chunk_size: int) -> List[Dict[str, Any]]`

`save_results(results: List[Dict[str, Any]], chunk_size: int) -> None`

Packages