Automatic Web Scraper

This project is an automatic web scraper that uses the LLM Ollama 3.1 to parse the body content of a web page. The application is built using Streamlit for the user interface and various Python libraries for web scraping and parsing.

Features

Scrape the body content of a web page.
Clean the scraped content by removing scripts and styles.
Split the content into manageable chunks.
Parse the content using the LLM Ollama 3.1 based on user-provided descriptions.

Demo

Installation

Prerequisites

Python 3.7 or higher

Create a virtual environment:

python -m venv ai

Activate the virtual environment:

On macOS and Linux:

source ai/bin/activate

On Windows:

.\venv\Scripts\activate

Installing dependencies:

pip install -r requirements.txt

Running the Application

Activate the virtual environment (if not already activated):

On macOS and Linux:

source ai/bin/activate

On Windows:

.\venv\Scripts\activate

Run the Streamlit application:

streamlit run main.py

Usage

Enter the URL of the website you want to scrape in the input field.
Click the "Scrape" button to scrape the website.
View the DOM content in the expander section.
Describe what you want to parse in the text area.
Click the "Parse Content" button to parse the content based on your description.
View the parsed results on the Streamlit app.

Dependencies

streamlit
langchain
langchain_ollama
selenium
beautifulsoup4
lxml
html5lib
python-dotenv

License

This project is licensed under the MIT License. See the LICENSE file for more details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Automatic Web Scraper

Features

Demo

Installation

Prerequisites

Create a virtual environment:

Activate the virtual environment:

Installing dependencies:

Running the Application

Usage

Dependencies

License

Files

README.md

Latest commit

History

README.md

File metadata and controls

Automatic Web Scraper

Features

Demo

Installation

Prerequisites

Create a virtual environment:

Activate the virtual environment:

Installing dependencies:

Running the Application

Usage

Dependencies

License