Skip to content

A robust Python tool for generating embeddings from text files using OpenAI's API. This tool processes text files, splits them into chunks while preserving context headers, and generates embeddings using OpenAI's models, saving both text and embeddings in structured formats.

Notifications You must be signed in to change notification settings

simonpierreboucher/Embedding-generator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 

Repository files navigation

Text Document Embedding Generator

License: MIT Python Version GitHub Issues GitHub Forks GitHub Stars

A Python tool for generating embeddings from text documents using multiple providers, including OpenAI, Mistral AI, Voyage AI, and Cohere. It splits documents into configurable-sized chunks and generates embeddings for each chunk.

New Feature: The script now adds a contextual description generated by a language model (LLM) such as GPT-4 for each chunk. This description helps situate the chunk within the overall context of the text, enhancing the quality of the embeddings.

Features

  • Support for Multiple Embedding Providers:
    • OpenAI
    • Mistral AI
    • Voyage AI
    • Cohere
  • Contextual Embeddings:
    • Generation of contextual descriptions for each chunk using an LLM
    • Combination of the chunk and its description to form a new chunk for embedding
  • Processing of Multiple Text Files
  • Configurable Chunk Sizing
  • Document Header Management
  • Multiple Output Formats (CSV, JSON, NPY)
  • Error Handling and Retries
  • YAML-based Configuration

Prerequisites

pip install openai tiktoken numpy pandas tqdm pyyaml requests

Configuration

Create a config.yaml file with the following structure:

api:
  provider:
    name: "openai"  # Options: "openai", "mistral", "voyage", "cohere"
    key: "your-api-key"
    model: "text-embedding-ada-002"  # The model varies by provider
  llm_model: "gpt-4"  # or "gpt-3.5-turbo" if you don't have access to GPT-4
  llm_max_input_tokens: 8192
  llm_max_output_tokens: 256
  max_retries: 3
  retry_delay: 2

paths:
  input_folder: "path/to/text/files"
  output_base: "output"

processing:
  chunk_sizes: [400, 800, 1200]
  header_lines: 2

output:
  formats:
    - csv
    - json
    - npy

Configuration Parameters

API Provider Settings

  • provider.name: Name of the embedding provider to use. Options:

    • "openai"
    • "mistral"
    • "voyage"
    • "cohere"
  • provider.key: Your API key for the selected provider.

  • provider.model: The embedding model to use. Model names vary by provider.

  • llm_model: The LLM model used to generate contextual descriptions. Examples:

    • "gpt-4"
    • "gpt-3.5-turbo"
  • llm_max_input_tokens: Maximum number of tokens for the LLM input (prompt).

  • llm_max_output_tokens: Maximum number of tokens for the LLM output (response).

  • max_retries: Maximum number of attempts in case of API call failures.

  • retry_delay: Delay between attempts (in seconds).

Provider API Keys

  • OpenAI

    • API Key: OPENAI_API_KEY
    • Embedding Model: "text-embedding-ada-002"
  • Mistral AI

    • API Key: MISTRAL_API_KEY
    • Embedding Model: "mistral-embed"
  • Voyage AI

    • API Key: VOYAGE_API_KEY
    • Embedding Model: "voyage-large-2"
  • Cohere

    • API Key: CO_API_KEY
    • Embedding Model: "embed-english-v3.0"

Other Parameters

  • paths.input_folder: The folder containing the text files to process.
  • paths.output_base: The folder where results will be saved.
  • processing.chunk_sizes: List of chunk sizes in tokens.
  • processing.header_lines: Number of header lines to include in each chunk.
  • output.formats: Desired output formats (csv, json, npy).

Usage

  1. Install the required packages:

    pip install openai tiktoken numpy pandas tqdm pyyaml requests
  2. Set up your API keys for the chosen provider(s):

    # For OpenAI
    export OPENAI_API_KEY='your-api-key'
    # For Mistral AI
    export MISTRAL_API_KEY='your-api-key'
    # For Voyage AI
    export VOYAGE_API_KEY='your-api-key'
    # For Cohere
    export CO_API_KEY='your-api-key'
  3. Configure your provider and the LLM in config.yaml.

  4. Prepare your text files in the input directory specified in config.yaml.

  5. Run the script:

    python embedding_generator.py

Provider-Specific Features

OpenAI

  • High-quality embeddings
  • Extensive model options
  • Reliable API performance

Mistral AI

  • Competitive pricing
  • Good performance for multiple languages
  • Modern embedding models

Voyage AI

  • Specialized for specific use cases
  • Competitive pricing
  • Good documentation

Cohere

  • Multiple embedding types
  • Classification-specific embeddings
  • Extensive language support

Output Structure

For each configured chunk size, the script generates:

CSV (embeddings_results_{size}tok.csv)

  • filename: Source file name
  • chunk_id: Chunk identifier
  • text: Chunk content combined with its contextual description
  • embedding: Embedding vector

JSON (chunks.json)

[
  {
    "text": "Chunk content combined with its description",
    "embedding": [embedding vector],
    "metadata": {
      "filename": "file name",
      "chunk_id": "chunk identifier"
    }
  },
  ...
]

NPY (embeddings.npy)

NumPy array containing all embedding vectors.

Error Handling

  • Provider-specific error handling
  • Automatic retry on API failure
  • Exponential backoff between attempts
  • Error and warning logging
  • Continues processing if a provider fails
  • Separate handling of errors related to the LLM

Methods Description

EmbeddingGenerator Class

clean_text(text: str) -> str

Cleans and normalizes text by removing extra whitespace and line breaks.

split_into_chunks(text: str, max_tokens: int) -> List[str]

Splits text into chunks while preserving headers and respecting token limits.

get_chunk_context_description(chunk_text: str, full_text: str) -> str

Generates a brief description of the chunk's role in the text using an LLM.

get_embedding(text: str) -> Optional[List[float]]

Obtains embeddings from the selected provider with error handling and retries.

process_file(file_path: str, chunk_size: int) -> List[Dict[str, Any]]

Processes a single file by generating chunks, contextual descriptions, and embeddings.

save_results(results: List[Dict[str, Any]], chunk_size: int) -> None

Saves the results in the configured output formats.

Limitations

  • Requires valid API keys for the embedding provider and the LLM (if different)
  • Different rate limits per provider
  • Varying embedding dimensions between providers
  • Provider-specific model limitations
  • Processes .txt files only
  • Using the LLM may increase processing time and costs

Best Practices

  1. File Preparation

    • Ensure text files are properly encoded (UTF-8)
    • Remove any binary or non-text content
  2. Configuration

    • Adjust chunk sizes based on your needs
    • Configure appropriate retry settings
    • Set a reasonable number of header lines
    • Choose an appropriate LLM model for your use case and budget
  3. Resource Management

    • Monitor API usage
    • Consider rate limiting for large datasets
    • Regularly back up output files
    • Be aware of LLM usage to manage costs
  4. Provider and LLM Selection

    • Choose the embedding provider based on your needs:
      • OpenAI for general purpose
      • Mistral AI for multilingual support
      • Voyage AI for specialized cases
      • Cohere for classification tasks
    • Select an LLM model based on accessibility and cost:
      • Use gpt-4 for better quality if accessible
      • Use gpt-3.5-turbo for lower cost and wider availability
  5. API Management

    • Monitor usage across all APIs used
    • Consider provider-specific rate limits
    • Keep API keys secure
    • Plan for quota limits, especially when using LLMs

Provider Comparison

Provider Strengths Use Cases
OpenAI High quality, reliable General purpose
Mistral Good multilingual support International content
Voyage Specialized features Domain-specific
Cohere Classification focus Text classification

Troubleshooting

Common issues and solutions:

  1. API Errors

    • Verify API keys
    • Check API rate limits and quotas
    • Ensure network connectivity
    • For LLM-related errors, check if the input exceeds token limits
  2. File Processing Issues

    • Check file encoding
    • Verify file permissions
    • Ensure valid file content
  3. Output Errors

    • Check disk space
    • Verify write permissions
    • Validate output directory structure
  4. LLM Usage Issues

    • Monitor the number of tokens used in prompts and responses
    • Adjust llm_max_input_tokens and llm_max_output_tokens if necessary
    • Ensure the combined size of the chunk and context fits within the LLM's token limits

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

MIT License

About

A robust Python tool for generating embeddings from text files using OpenAI's API. This tool processes text files, splits them into chunks while preserving context headers, and generates embeddings using OpenAI's models, saving both text and embeddings in structured formats.

Topics

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages