The AI Web Scraper is a powerful tool designed to scrape and parse information from web pages, even those protected by CAPTCHAs. It utilizes Bright Data and Selenium for web scraping and BeautifulSoup for parsing HTML content. Additionally, it leverages the langchain framework to process and extract specific information from the scraped content using a language model (LLM).
- Web Scraping with CAPTCHA Bypass: Automatically handles CAPTCHA challenges using Bright Data's Scraping Browser and Selenium.
- HTML Content Extraction: Parses and extracts specific parts of the web page, such as the
<body>content, using BeautifulSoup. - Content Cleaning: Cleans the extracted content by removing unnecessary scripts, styles, and whitespace.
- Custom Information Parsing: Uses LLM (
llama3model) to extract specific information from the cleaned content based on user-provided descriptions. - Streamlit Integration: Provides a user-friendly web interface for scraping and parsing web pages interactively.
-
Clone this repository:
git clone https://github.com/mHaines9219/ai-web-scraper.git cd ai-web-scraper -
Install the required Python packages:
pip install -r requirements.txt
-
Set Up Bright Data and Selenium:
Ensure you have the necessary credentials for Bright Data's Scraping Browser and have set up your Selenium environment correctly.
-
Run the Streamlit App:
Start the Streamlit web application by running:
streamlit run app.py
-
Enter a URL:
Input the URL of the web page you want to scrape in the provided text input field.
-
Scrape the Website:
Click the "Scrape" button to start the web scraping process. The app will connect to the Scraping Browser, solve any CAPTCHAs, and retrieve the page content.
-
View and Parse Content:
After scraping, the content will be displayed in a text area. Describe the information you want to parse and click "Parse Content" to extract specific data using the LLM (
llama3).
scrape_website(website): Connects to the Scraping Browser using Selenium and Bright Data, navigates to the specified website, and returns the HTML content.extract_body(html_content): Extracts the<body>content from the HTML.clean_body(body): Cleans the extracted body content by removing scripts, styles, and unnecessary whitespace.split_dom_content(dom_content, max_length): Splits the cleaned content into chunks of a specified maximum length for processing.parse_with_llama3(dom_chunks, parse_description): Parses the cleaned and chunked content using thellama3model to extract specific information based on the user-provided description.
- Python 3.x
- Selenium
- BeautifulSoup4
- Streamlit
- langchain
- Bright Data Scraping Browser credentials
- CAPTCHA Handling: The scraper uses Bright Data's automated CAPTCHA-solving service. Ensure your account is properly configured for this functionality.
- Selenium Setup: Make sure you have the correct version of ChromeDriver or another WebDriver compatible with your browser.
- UI is a bit ugly
- LLM runs VERY slow, to scale this project I will need to implement cloud computing to effectively handle the load
This project is licensed under the MIT License. See the LICENSE file for details.
Contributions are welcome! Please feel free to submit a pull request or open an issue if you have any ideas or improvements.
Inspired to do the project and build off the work of Tech With Tim on youtube.
For any questions or feedback, please reach out to mhaines9219@gmail.com.