-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #2 from ScrapeGraphAI/pre/beta
feat: added markdownify and localscraper tools
- Loading branch information
Showing
16 changed files
with
800 additions
and
55 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1 +1,140 @@ | ||
# langchain-scrapegraph | ||
# 🕷️🦜 langchain-scrapegraph | ||
|
||
[![License](https://img.shields.io/badge/License-MIT-blue.svg)](https://opensource.org/licenses/MIT) | ||
[![Python Support](https://img.shields.io/pypi/pyversions/langchain-scrapegraph.svg)](https://pypi.org/project/langchain-scrapegraph/) | ||
[![Documentation](https://img.shields.io/badge/Documentation-Latest-green)](https://scrapegraphai.com/docs) | ||
|
||
Supercharge your LangChain agents with AI-powered web scraping capabilities. LangChain-ScrapeGraph provides a seamless integration between [LangChain](https://github.com/langchain-ai/langchain) and [ScrapeGraph AI](https://scrapegraphai.com), enabling your agents to extract structured data from websites using natural language. | ||
|
||
## 📦 Installation | ||
|
||
```bash | ||
pip install langchain-scrapegraph | ||
``` | ||
|
||
## 🛠️ Available Tools | ||
|
||
### 📝 MarkdownifyTool | ||
Convert any webpage into clean, formatted markdown. | ||
|
||
```python | ||
from langchain_scrapegraph.tools import MarkdownifyTool | ||
|
||
tool = MarkdownifyTool() | ||
markdown = tool.invoke({"website_url": "https://example.com"}) | ||
|
||
print(markdown) | ||
``` | ||
|
||
### 🔍 SmartscraperTool | ||
Extract structured data from any webpage using natural language prompts. | ||
|
||
```python | ||
from langchain_scrapegraph.tools import SmartscraperTool | ||
|
||
# Initialize the tool (uses SGAI_API_KEY from environment) | ||
tool = SmartscraperTool() | ||
|
||
# Extract information using natural language | ||
result = tool.invoke({ | ||
"website_url": "https://www.example.com", | ||
"user_prompt": "Extract the main heading and first paragraph" | ||
}) | ||
|
||
print(result) | ||
``` | ||
|
||
### 💻 LocalscraperTool | ||
Extract information from HTML content using AI. | ||
|
||
```python | ||
from langchain_scrapegraph.tools import LocalscraperTool | ||
|
||
tool = LocalscraperTool() | ||
result = tool.invoke({ | ||
"user_prompt": "Extract all contact information", | ||
"website_html": "<html>...</html>" | ||
}) | ||
|
||
print(result) | ||
``` | ||
|
||
## 🌟 Key Features | ||
|
||
- 🐦 **LangChain Integration**: Seamlessly works with LangChain agents and chains | ||
- 🔍 **AI-Powered Extraction**: Use natural language to describe what data to extract | ||
- 📊 **Structured Output**: Get clean, structured data ready for your agents | ||
- 🔄 **Flexible Tools**: Choose from multiple specialized scraping tools | ||
- ⚡ **Async Support**: Built-in support for async operations | ||
|
||
## 💡 Use Cases | ||
|
||
- 📖 **Research Agents**: Create agents that gather and analyze web data | ||
- 📊 **Data Collection**: Automate structured data extraction from websites | ||
- 📝 **Content Processing**: Convert web content into markdown for further processing | ||
- 🔍 **Information Extraction**: Extract specific data points using natural language | ||
|
||
## 🤖 Example Agent | ||
|
||
```python | ||
from langchain.agents import initialize_agent, AgentType | ||
from langchain_scrapegraph.tools import SmartscraperTool | ||
from langchain_openai import ChatOpenAI | ||
|
||
# Initialize tools | ||
tools = [ | ||
SmartscraperTool(), | ||
] | ||
|
||
# Create an agent | ||
agent = initialize_agent( | ||
tools=tools, | ||
llm=ChatOpenAI(temperature=0), | ||
agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION, | ||
verbose=True | ||
) | ||
|
||
# Use the agent | ||
response = agent.run(""" | ||
Visit example.com, make a summary of the content and extract the main heading and first paragraph | ||
""") | ||
``` | ||
|
||
## ⚙️ Configuration | ||
|
||
Set your ScrapeGraph API key in your environment: | ||
```bash | ||
export SGAI_API_KEY="your-api-key-here" | ||
``` | ||
|
||
Or set it programmatically: | ||
```python | ||
import os | ||
os.environ["SGAI_API_KEY"] = "your-api-key-here" | ||
``` | ||
|
||
## 📚 Documentation | ||
|
||
- [API Documentation](https://scrapegraphai.com/docs) | ||
- [LangChain Documentation](https://python.langchain.com/docs/get_started/introduction.html) | ||
- [Examples](examples/) | ||
|
||
## 💬 Support & Feedback | ||
|
||
- 📧 Email: support@scrapegraphai.com | ||
- 💻 GitHub Issues: [Create an issue](https://github.com/ScrapeGraphAI/langchain-scrapegraph/issues) | ||
- 🌟 Feature Requests: [Request a feature](https://github.com/ScrapeGraphAI/langchain-scrapegraph/issues/new) | ||
|
||
## 📄 License | ||
|
||
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details. | ||
|
||
## 🙏 Acknowledgments | ||
|
||
This project is built on top of: | ||
- [LangChain](https://github.com/langchain-ai/langchain) | ||
- [ScrapeGraph AI](https://scrapegraphai.com) | ||
|
||
--- | ||
|
||
Made with ❤️ by [ScrapeGraph AI](https://scrapegraphai.com) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,57 @@ | ||
""" | ||
Remember to install the additional dependencies for this example to work: | ||
pip install langchain-openai langchain | ||
""" | ||
|
||
from dotenv import load_dotenv | ||
from langchain.agents import AgentExecutor, create_openai_functions_agent | ||
from langchain_core.messages import SystemMessage | ||
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder | ||
from langchain_openai import ChatOpenAI | ||
|
||
from langchain_scrapegraph.tools import ( | ||
GetCreditsTool, | ||
LocalScraperTool, | ||
SmartScraperTool, | ||
) | ||
|
||
load_dotenv() | ||
|
||
# Initialize the tools | ||
tools = [ | ||
SmartScraperTool(), | ||
LocalScraperTool(), | ||
GetCreditsTool(), | ||
] | ||
|
||
# Create the prompt template | ||
prompt = ChatPromptTemplate.from_messages( | ||
[ | ||
SystemMessage( | ||
content=( | ||
"You are a helpful AI assistant that can analyze websites and extract information. " | ||
"You have access to tools that can help you scrape and process web content. " | ||
"Always explain what you're doing before using a tool." | ||
) | ||
), | ||
MessagesPlaceholder(variable_name="chat_history", optional=True), | ||
("user", "{input}"), | ||
MessagesPlaceholder(variable_name="agent_scratchpad"), | ||
] | ||
) | ||
|
||
# Initialize the LLM | ||
llm = ChatOpenAI(temperature=0) | ||
|
||
# Create the agent | ||
agent = create_openai_functions_agent(llm, tools, prompt) | ||
|
||
# Create the executor | ||
agent_executor = AgentExecutor(agent=agent, tools=tools, verbose=True) | ||
|
||
# Example usage | ||
query = """Extract the main products from https://www.scrapegraphai.com/""" | ||
|
||
print("\nQuery:", query, "\n") | ||
response = agent_executor.invoke({"input": query}) | ||
print("\nFinal Response:", response["output"]) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,9 +1,13 @@ | ||
from scrapegraph_py.logger import sgai_logger | ||
|
||
from langchain_scrapegraph.tools import GetCreditsTool | ||
|
||
# Will automatically get SGAI_API_KEY from environment, or set it manually | ||
sgai_logger.set_logging(level="INFO") | ||
|
||
# Will automatically get SGAI_API_KEY from environment | ||
tool = GetCreditsTool() | ||
credits = tool.run() | ||
|
||
print("\nCredits Information:") | ||
print(f"Remaining Credits: {credits['remaining_credits']}") | ||
print(f"Total Credits Used: {credits['total_credits_used']}") | ||
# Use the tool | ||
credits = tool.invoke({}) | ||
|
||
print(credits) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,28 @@ | ||
from scrapegraph_py.logger import sgai_logger | ||
|
||
from langchain_scrapegraph.tools import LocalScraperTool | ||
|
||
sgai_logger.set_logging(level="INFO") | ||
|
||
# Will automatically get SGAI_API_KEY from environment | ||
tool = LocalScraperTool() | ||
|
||
# Example website and prompt | ||
html_content = """ | ||
<html> | ||
<body> | ||
<h1>Company Name</h1> | ||
<p>We are a technology company focused on AI solutions.</p> | ||
<div class="contact"> | ||
<p>Email: contact@example.com</p> | ||
<p>Phone: (555) 123-4567</p> | ||
</div> | ||
</body> | ||
</html> | ||
""" | ||
user_prompt = "Make a summary of the webpage and extract the email and phone number" | ||
|
||
# Use the tool | ||
result = tool.invoke({"website_html": html_content, "user_prompt": user_prompt}) | ||
|
||
print(result) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,16 @@ | ||
from scrapegraph_py.logger import sgai_logger | ||
|
||
from langchain_scrapegraph.tools import MarkdownifyTool | ||
|
||
sgai_logger.set_logging(level="INFO") | ||
|
||
# Will automatically get SGAI_API_KEY from environment | ||
tool = MarkdownifyTool() | ||
|
||
# Example website and prompt | ||
website_url = "https://www.example.com" | ||
|
||
# Use the tool | ||
result = tool.invoke({"website_url": website_url}) | ||
|
||
print(result) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,15 +1,17 @@ | ||
from langchain_scrapegraph.tools import SmartscraperTool | ||
from scrapegraph_py.logger import sgai_logger | ||
|
||
# Will automatically get SGAI_API_KEY from environment, or set it manually | ||
tool = SmartscraperTool() | ||
from langchain_scrapegraph.tools import SmartScraperTool | ||
|
||
sgai_logger.set_logging(level="INFO") | ||
|
||
# Will automatically get SGAI_API_KEY from environment | ||
tool = SmartScraperTool() | ||
|
||
# Example website and prompt | ||
website_url = "https://www.example.com" | ||
user_prompt = "Extract the main heading and first paragraph from this webpage" | ||
|
||
# Use the tool synchronously | ||
result = tool.run({"user_prompt": user_prompt, "website_url": website_url}) | ||
# Use the tool | ||
result = tool.invoke({"website_url": website_url, "user_prompt": user_prompt}) | ||
|
||
print("\nExtraction Results:") | ||
print(f"Main Heading: {result['main_heading']}") | ||
print(f"First Paragraph: {result['first_paragraph']}") | ||
print(result) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,4 +1,6 @@ | ||
from .credits import GetCreditsTool | ||
from .smartscraper import SmartscraperTool | ||
from .localscraper import LocalScraperTool | ||
from .markdownify import MarkdownifyTool | ||
from .smartscraper import SmartScraperTool | ||
|
||
__all__ = ["SmartscraperTool", "GetCreditsTool"] | ||
__all__ = ["SmartScraperTool", "GetCreditsTool", "MarkdownifyTool", "LocalScraperTool"] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.