This project is a lightweight, modular pipeline for extracting and processing data from various sources like PDFs, text files, directories, and web pages using LangChain and Groq's LLMs.
- ✅ PDF data extraction with
PyPDFLoader
- ✅ Directory-wise PDF processing using
DirectoryLoader
- ✅ Raw
.txt
file summarization - ✅ Web scraping + LLM-based question answering
- ✅ Uses
ChatGroq
with DeepSeek or LLaMA-3 models
.
├── dataloader/
│ ├── directory_loader.py # Load multiple PDFs from a folder
│ ├── pypdf_loader.py # Load and query a single PDF
│ ├── text_loader.py # Summarize .txt files
│ ├── webbase_loader.py # Extract info from websites
│ ├── extra.py # (Optional utility file)
│ ├── text.txt # Sample text file
│ ├── data.pdf # Sample PDF
│ └── .env # Stores your GROQ_API_KEY
Install dependencies via:
pip install -r requirements.txt
langchain-groq
groq
python-dotenv
langchain_community
pypdf
bs4
-
Create a
.env
file:GROQ_API_KEY=your_groq_api_key_here
-
Run any of the scripts as needed:
python pypdf_loader.py python webbase_loader.py python text_loader.py python directory_loader.py
- PDF:
Tell me all the education institute names of the person
- Web:
Name of the darkest coffee
- Text:
Summarize the following text
This project is licensed under the MIT License.
Author: Nitesh Kumar Singh
Built with ❤️ using LangChain, Groq, and Python