A Python tool for generating embeddings from text documents using multiple providers, including OpenAI, Mistral AI, Voyage AI, and Cohere. It splits documents into configurable-sized chunks and generates embeddings for each chunk.
New Feature: The script now adds a contextual description generated by a language model (LLM) such as GPT-4 for each chunk. This description helps situate the chunk within the overall context of the text, enhancing the quality of the embeddings.
- Support for Multiple Embedding Providers:
- OpenAI
- Mistral AI
- Voyage AI
- Cohere
- Contextual Embeddings:
- Generation of contextual descriptions for each chunk using an LLM
- Combination of the chunk and its description to form a new chunk for embedding
- Processing of Multiple Text Files
- Configurable Chunk Sizing
- Document Header Management
- Multiple Output Formats (CSV, JSON, NPY)
- Error Handling and Retries
- YAML-based Configuration
pip install openai tiktoken numpy pandas tqdm pyyaml requests
Create a config.yaml
file with the following structure:
api:
provider:
name: "openai" # Options: "openai", "mistral", "voyage", "cohere"
key: "your-api-key"
model: "text-embedding-ada-002" # The model varies by provider
llm_model: "gpt-4" # or "gpt-3.5-turbo" if you don't have access to GPT-4
llm_max_input_tokens: 8192
llm_max_output_tokens: 256
max_retries: 3
retry_delay: 2
paths:
input_folder: "path/to/text/files"
output_base: "output"
processing:
chunk_sizes: [400, 800, 1200]
header_lines: 2
output:
formats:
- csv
- json
- npy
-
provider.name: Name of the embedding provider to use. Options:
"openai"
"mistral"
"voyage"
"cohere"
-
provider.key: Your API key for the selected provider.
-
provider.model: The embedding model to use. Model names vary by provider.
-
llm_model: The LLM model used to generate contextual descriptions. Examples:
"gpt-4"
"gpt-3.5-turbo"
-
llm_max_input_tokens: Maximum number of tokens for the LLM input (prompt).
-
llm_max_output_tokens: Maximum number of tokens for the LLM output (response).
-
max_retries: Maximum number of attempts in case of API call failures.
-
retry_delay: Delay between attempts (in seconds).
-
OpenAI
- API Key:
OPENAI_API_KEY
- Embedding Model:
"text-embedding-ada-002"
- API Key:
-
Mistral AI
- API Key:
MISTRAL_API_KEY
- Embedding Model:
"mistral-embed"
- API Key:
-
Voyage AI
- API Key:
VOYAGE_API_KEY
- Embedding Model:
"voyage-large-2"
- API Key:
-
Cohere
- API Key:
CO_API_KEY
- Embedding Model:
"embed-english-v3.0"
- API Key:
- paths.input_folder: The folder containing the text files to process.
- paths.output_base: The folder where results will be saved.
- processing.chunk_sizes: List of chunk sizes in tokens.
- processing.header_lines: Number of header lines to include in each chunk.
- output.formats: Desired output formats (
csv
,json
,npy
).
-
Install the required packages:
pip install openai tiktoken numpy pandas tqdm pyyaml requests
-
Set up your API keys for the chosen provider(s):
# For OpenAI export OPENAI_API_KEY='your-api-key' # For Mistral AI export MISTRAL_API_KEY='your-api-key' # For Voyage AI export VOYAGE_API_KEY='your-api-key' # For Cohere export CO_API_KEY='your-api-key'
-
Configure your provider and the LLM in
config.yaml
. -
Prepare your text files in the input directory specified in
config.yaml
. -
Run the script:
python embedding_generator.py
- High-quality embeddings
- Extensive model options
- Reliable API performance
- Competitive pricing
- Good performance for multiple languages
- Modern embedding models
- Specialized for specific use cases
- Competitive pricing
- Good documentation
- Multiple embedding types
- Classification-specific embeddings
- Extensive language support
For each configured chunk size, the script generates:
filename
: Source file namechunk_id
: Chunk identifiertext
: Chunk content combined with its contextual descriptionembedding
: Embedding vector
[
{
"text": "Chunk content combined with its description",
"embedding": [embedding vector],
"metadata": {
"filename": "file name",
"chunk_id": "chunk identifier"
}
},
...
]
NumPy array containing all embedding vectors.
- Provider-specific error handling
- Automatic retry on API failure
- Exponential backoff between attempts
- Error and warning logging
- Continues processing if a provider fails
- Separate handling of errors related to the LLM
Cleans and normalizes text by removing extra whitespace and line breaks.
Splits text into chunks while preserving headers and respecting token limits.
Generates a brief description of the chunk's role in the text using an LLM.
Obtains embeddings from the selected provider with error handling and retries.
Processes a single file by generating chunks, contextual descriptions, and embeddings.
Saves the results in the configured output formats.
- Requires valid API keys for the embedding provider and the LLM (if different)
- Different rate limits per provider
- Varying embedding dimensions between providers
- Provider-specific model limitations
- Processes
.txt
files only - Using the LLM may increase processing time and costs
-
File Preparation
- Ensure text files are properly encoded (UTF-8)
- Remove any binary or non-text content
-
Configuration
- Adjust chunk sizes based on your needs
- Configure appropriate retry settings
- Set a reasonable number of header lines
- Choose an appropriate LLM model for your use case and budget
-
Resource Management
- Monitor API usage
- Consider rate limiting for large datasets
- Regularly back up output files
- Be aware of LLM usage to manage costs
-
Provider and LLM Selection
- Choose the embedding provider based on your needs:
- OpenAI for general purpose
- Mistral AI for multilingual support
- Voyage AI for specialized cases
- Cohere for classification tasks
- Select an LLM model based on accessibility and cost:
- Use
gpt-4
for better quality if accessible - Use
gpt-3.5-turbo
for lower cost and wider availability
- Use
- Choose the embedding provider based on your needs:
-
API Management
- Monitor usage across all APIs used
- Consider provider-specific rate limits
- Keep API keys secure
- Plan for quota limits, especially when using LLMs
Provider | Strengths | Use Cases |
---|---|---|
OpenAI | High quality, reliable | General purpose |
Mistral | Good multilingual support | International content |
Voyage | Specialized features | Domain-specific |
Cohere | Classification focus | Text classification |
Common issues and solutions:
-
API Errors
- Verify API keys
- Check API rate limits and quotas
- Ensure network connectivity
- For LLM-related errors, check if the input exceeds token limits
-
File Processing Issues
- Check file encoding
- Verify file permissions
- Ensure valid file content
-
Output Errors
- Check disk space
- Verify write permissions
- Validate output directory structure
-
LLM Usage Issues
- Monitor the number of tokens used in prompts and responses
- Adjust
llm_max_input_tokens
andllm_max_output_tokens
if necessary - Ensure the combined size of the chunk and context fits within the LLM's token limits
Contributions are welcome! Please feel free to submit a Pull Request.