Automatic Web Scraper

This project is an automatic web scraper that uses the LLM Ollama 3.1 to parse the body content of a web page. The application is built using Streamlit for the user interface and various Python libraries for web scraping and parsing.

Features

Scrape the body content of a web page.
Clean the scraped content by removing scripts and styles.
Split the content into manageable chunks.
Parse the content using the LLM Ollama 3.1 based on user-provided descriptions.

Demo

Installation

Prerequisites

Python 3.7 or higher

Create a virtual environment:

python -m venv ai

Activate the virtual environment:

On macOS and Linux:

source ai/bin/activate

On Windows:

.\venv\Scripts\activate

Installing dependencies:

pip install -r requirements.txt

Running the Application

Activate the virtual environment (if not already activated):

On macOS and Linux:

source ai/bin/activate

On Windows:

.\venv\Scripts\activate

Run the Streamlit application:

streamlit run main.py

Usage

Enter the URL of the website you want to scrape in the input field.
Click the "Scrape" button to scrape the website.
View the DOM content in the expander section.
Describe what you want to parse in the text area.
Click the "Parse Content" button to parse the content based on your description.
View the parsed results on the Streamlit app.

Dependencies

streamlit
langchain
langchain_ollama
selenium
beautifulsoup4
lxml
html5lib
python-dotenv

License

This project is licensed under the MIT License. See the LICENSE file for more details.

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
__pycache__		__pycache__
ai		ai
.DS_Store		.DS_Store
README.md		README.md
ScreenRecording.gif		ScreenRecording.gif
chromedriver		chromedriver
main.py		main.py
parse.py		parse.py
requirements.txt		requirements.txt
scrape.py		scrape.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Automatic Web Scraper

Features

Demo

Installation

Prerequisites

Create a virtual environment:

Activate the virtual environment:

Installing dependencies:

Running the Application

Usage

Dependencies

License

About

Releases

Packages

Languages

RockENZO/Automatic-web-scraper-with-LLM-parsing

Folders and files

Latest commit

History

Repository files navigation

Automatic Web Scraper

Features

Demo

Installation

Prerequisites

Create a virtual environment:

Activate the virtual environment:

Installing dependencies:

Running the Application

Usage

Dependencies

License

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages