This project is an automatic web scraper that uses the LLM Ollama 3.1 to parse the body content of a web page. The application is built using Streamlit for the user interface and various Python libraries for web scraping and parsing.
- Scrape the body content of a web page.
- Clean the scraped content by removing scripts and styles.
- Split the content into manageable chunks.
- Parse the content using the LLM Ollama 3.1 based on user-provided descriptions.
- Python 3.7 or higher
python -m venv ai
- On macOS and Linux:
source ai/bin/activate
- On Windows:
.\venv\Scripts\activate
pip install -r requirements.txt
- Activate the virtual environment (if not already activated):
- On macOS and Linux:
source ai/bin/activate
- On Windows:
.\venv\Scripts\activate
- Run the Streamlit application:
streamlit run main.py
- Enter the URL of the website you want to scrape in the input field.
- Click the "Scrape" button to scrape the website.
- View the DOM content in the expander section.
- Describe what you want to parse in the text area.
- Click the "Parse Content" button to parse the content based on your description.
- View the parsed results on the Streamlit app.
- streamlit
- langchain
- langchain_ollama
- selenium
- beautifulsoup4
- lxml
- html5lib
- python-dotenv
This project is licensed under the MIT License. See the LICENSE file for more details.