Skip to content

SmartOps is an AI-powered RAG pipeline that scrapes messy documents from the web and parses them into structured JSON using OCR and NLP. Built with LangChain, ChromaDB, and Streamlit, it supports PDF/HTML parsing and natural language querying over the data.

Notifications You must be signed in to change notification settings

CraftyEngineer/SmartOps

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

25 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SmartOps: AI-Powered Document Parsing & Query Engine

SmartOps is an AI-powered data ingestion and RAG pipeline that automates the parsing of messy PDF manuals, HTML pages, and scanned documents. It extracts useful information (including tables), embeds it in a vector database, and allows natural language querying through a chat-like interface.

Built using LangChain, ChromaDB, HuggingFace, Streamlit, and OCR tools, it is ideal for knowledge-intensive tasks such as technical support, financial analysis, and enterprise automation.

🔍 Features

  • Web Scraping + Smart Downloading: Automatically searches Google, downloads PDF/HTML pages, and handles embedded links and scanned images.

  • Multi-Stage Parsing Pipeline: Uses OCR (Tesseract + OpenCV), pdfplumber, Camelot, and BeautifulSoup to extract both text and tables.

  • Vector Store with Chunked Embeddings: Splits content into semantic chunks and stores it in ChromaDB with MiniLM embeddings.

  • Chat Over Documents: Query your docs using an intuitive Streamlit UI powered by local or cloud-hosted LLMs (e.g., Groq, OpenAI, Claude).

  • Auto Context File Creation: All document data is appended to a context file for downstream use and reproducibility.

💻 How to Run

  1. Clone the repository:
git clone https://github.com/CraftyEngineer/SmartOps.git
cd SmartOps
  1. Create a virtual environment (optional but recommended):
python -m venv .venv
source .venv/bin/activate   # For Linux/Mac
.venv\Scripts\activate      # For Windows
  1. Install required packages:
pip install -r requirements.txt
  1. Set up your API keys:
export GOOGLE_API_KEY=your_key
export GOOGLE_CX=your_cx
export GROQ_API_KEY=your_groq_key
export HF_TOKEN=your_hugging_face_token
  1. Run the app:
streamlit run app.py

📦 Tech Stack

  • Core:
    Python · Streamlit · LangChain · ChromaDB
  • AI/ML:
    HuggingFace Transformers · OCR (Tesseract + OpenCV)
  • Processing Tools:
    pdfplumber · Camelot · BeautifulSoup · aiohttp · asyncio

📘 Use Cases

  • Parsing scanned PDF manuals into structured data
  • Searching across unstructured financial or technical docs
  • Building internal AI assistants for documentation or SOPs
  • Rapid prototyping of domain-specific RAG pipelines

About

SmartOps is an AI-powered RAG pipeline that scrapes messy documents from the web and parses them into structured JSON using OCR and NLP. Built with LangChain, ChromaDB, and Streamlit, it supports PDF/HTML parsing and natural language querying over the data.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages