This repository contains code for web scraping text data from a webpage and performing question answering using LangChain's tools and models. It provides scripts to extract information from a webpage and answer questions based on the extracted content using OpenAI's language models.
The LangChain Web Scraping and Question Answering Repository is designed to showcase how to use LangChain's tools and models to perform web scraping and question answering tasks. It includes Python scripts that demonstrate how to:
- Extract text data from a webpage using Trafilatura.
- Split the extracted text data into smaller chunks for processing.
- Utilize LangChain's question answering chains to answer user queries based on the extracted content.
To set up the repository and run the provided scripts, follow these steps:
- Clone the repository to your local machine:
git clone https://github.com/anirudhjain26/fyllo-chatbase.git
cd fyllo-chatbase
- Create a virtual environment (recommended) and activate it:
python3 -m venv venv
source venv/bin/activate
- Install the required dependencies:
pip install -r requirements.txt
- Create a
.env
file in the root directory of the repository and set your OpenAI API key:
OPENAI_API_KEY="sk-your_key_here"
Replace your_key_here
with your actual OpenAI API key.
-
- Run the
websiteLoader.py
script to perform question answering and web scraping:
- Run the
python websiteLoader.py
This script will extract information from the specified webpage and answer a predefined question using LangChain's models.
-
- Run the
textLoader.py
script to perform question answering:
- Run the
python textLoader.py
This script will use text from scrapedText.txt
and answer a predefined question using LangChain's models.
The repository includes the following files:
textLoader.py
: Demonstrates how to load text data from a file and perform question answering using LangChain's tools. Requires a valid OpenAI API key and a.env
file.websiteLoader.py
: Illustrates web scraping and question answering using LangChain's models. Scrapes live website data usingtrafilatura
module. Requires a valid OpenAI API key and a.env
file.scrapedText.txt
: Contains the text data extracted from the website https://www.fyllo.in/.