Skip to content

chameeradesilva/ai-assistant-tri

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Sri Lankan Tea Industry AI Assistant

License Python 3.12 Conda

Industry-grade AI solution for processing and analyzing tea industry documents with multi-language support and semantic search capabilities.

Project Structure

tea-ai-assistant/
├── config/                 # Configuration files
│   ├── logging.yaml       # Logging configuration
│   └── processing.yaml    # Document processing parameters
├── data/                   # Sample data and test documents
├── docs/                   # Documentation and specifications
├── src/                    # Source code
│   ├── scrapers/          # Web scraping components
│   ├── pdf_processor/     # PDF extraction and processing
│   ├── vector_db/         # Pinecone integration
│   └── utils/             # Helper functions and utilities
├── environment.yml         # Conda environment specification
├── LICENSE
└── README.md

Conda Environment Setup

  1. Create and activate conda environment:
conda env create -f environment.yml
conda activate scraper_env
  1. Verify Tesseract installation:
tesseract --version  # Should show version 5.3.4 with Sinhala/Tamil support
  1. Configure environment variables:
cp .env.example .env
# Update .env with your Pinecone credentials and Tesseract path

Processing Pipeline

graph TD
    A[Document Scraping] --> B[PDF Extraction]
    B --> C[Language Detection]
    C --> D[OCR Processing]
    D --> E[Text Chunking]
    E --> F[Embedding Generation]
    F --> G[Pinecone Storage]
Loading

Configuration Management

  • Update config/processing.yaml for:
    • Chunking parameters
    • OCR confidence thresholds
    • Language-specific processing rules
  • Environment variables for sensitive credentials
  • YAML configurations for processing parameters

Best Practices:

  • Monitor embedding dimensions vs index configuration
  • Track OCR success rates by language
  • Log chunking efficiency metrics
  • Implement circuit breakers for API calls

Version Compatibility

Component Version Notes
PyTorch 2.5.1 CPU-only optimized
SentenceBERT 3.4.0 Multi-lingual variant
Pinecone Client 5.0.1 Optimized batch operations

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages