Skip to content

Commit

Permalink
Merge pull request #2 from ScrapeGraphAI/pre/beta
Browse files Browse the repository at this point in the history
feat: added markdownify and localscraper tools
  • Loading branch information
PeriniM authored Dec 5, 2024
2 parents b92efb3 + 5ca7fe2 commit 9c37e01
Show file tree
Hide file tree
Showing 16 changed files with 800 additions and 55 deletions.
3 changes: 3 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,11 @@
## 1.0.0 (2024-12-05)

## 1.0.0-beta.1 (2024-12-05)


### Features

* added markdownify and localscraper tools ([03e49dc](https://github.com/ScrapeGraphAI/langchain-scrapegraph/commit/03e49dce84ef5a1b7a59b6dfd046eb563c14d283))
* tools integration ([dc7e9a8](https://github.com/ScrapeGraphAI/langchain-scrapegraph/commit/dc7e9a8fbf4e88bb79e11a9253428b2f61fa1293))


Expand Down
141 changes: 140 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1 +1,140 @@
# langchain-scrapegraph
# 🕷️🦜 langchain-scrapegraph

[![License](https://img.shields.io/badge/License-MIT-blue.svg)](https://opensource.org/licenses/MIT)
[![Python Support](https://img.shields.io/pypi/pyversions/langchain-scrapegraph.svg)](https://pypi.org/project/langchain-scrapegraph/)
[![Documentation](https://img.shields.io/badge/Documentation-Latest-green)](https://scrapegraphai.com/docs)

Supercharge your LangChain agents with AI-powered web scraping capabilities. LangChain-ScrapeGraph provides a seamless integration between [LangChain](https://github.com/langchain-ai/langchain) and [ScrapeGraph AI](https://scrapegraphai.com), enabling your agents to extract structured data from websites using natural language.

## 📦 Installation

```bash
pip install langchain-scrapegraph
```

## 🛠️ Available Tools

### 📝 MarkdownifyTool
Convert any webpage into clean, formatted markdown.

```python
from langchain_scrapegraph.tools import MarkdownifyTool

tool = MarkdownifyTool()
markdown = tool.invoke({"website_url": "https://example.com"})

print(markdown)
```

### 🔍 SmartscraperTool
Extract structured data from any webpage using natural language prompts.

```python
from langchain_scrapegraph.tools import SmartscraperTool

# Initialize the tool (uses SGAI_API_KEY from environment)
tool = SmartscraperTool()

# Extract information using natural language
result = tool.invoke({
"website_url": "https://www.example.com",
"user_prompt": "Extract the main heading and first paragraph"
})

print(result)
```

### 💻 LocalscraperTool
Extract information from HTML content using AI.

```python
from langchain_scrapegraph.tools import LocalscraperTool

tool = LocalscraperTool()
result = tool.invoke({
"user_prompt": "Extract all contact information",
"website_html": "<html>...</html>"
})

print(result)
```

## 🌟 Key Features

- 🐦 **LangChain Integration**: Seamlessly works with LangChain agents and chains
- 🔍 **AI-Powered Extraction**: Use natural language to describe what data to extract
- 📊 **Structured Output**: Get clean, structured data ready for your agents
- 🔄 **Flexible Tools**: Choose from multiple specialized scraping tools
-**Async Support**: Built-in support for async operations

## 💡 Use Cases

- 📖 **Research Agents**: Create agents that gather and analyze web data
- 📊 **Data Collection**: Automate structured data extraction from websites
- 📝 **Content Processing**: Convert web content into markdown for further processing
- 🔍 **Information Extraction**: Extract specific data points using natural language

## 🤖 Example Agent

```python
from langchain.agents import initialize_agent, AgentType
from langchain_scrapegraph.tools import SmartscraperTool
from langchain_openai import ChatOpenAI

# Initialize tools
tools = [
SmartscraperTool(),
]

# Create an agent
agent = initialize_agent(
tools=tools,
llm=ChatOpenAI(temperature=0),
agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
verbose=True
)

# Use the agent
response = agent.run("""
Visit example.com, make a summary of the content and extract the main heading and first paragraph
""")
```

## ⚙️ Configuration

Set your ScrapeGraph API key in your environment:
```bash
export SGAI_API_KEY="your-api-key-here"
```

Or set it programmatically:
```python
import os
os.environ["SGAI_API_KEY"] = "your-api-key-here"
```

## 📚 Documentation

- [API Documentation](https://scrapegraphai.com/docs)
- [LangChain Documentation](https://python.langchain.com/docs/get_started/introduction.html)
- [Examples](examples/)

## 💬 Support & Feedback

- 📧 Email: support@scrapegraphai.com
- 💻 GitHub Issues: [Create an issue](https://github.com/ScrapeGraphAI/langchain-scrapegraph/issues)
- 🌟 Feature Requests: [Request a feature](https://github.com/ScrapeGraphAI/langchain-scrapegraph/issues/new)

## 📄 License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## 🙏 Acknowledgments

This project is built on top of:
- [LangChain](https://github.com/langchain-ai/langchain)
- [ScrapeGraph AI](https://scrapegraphai.com)

---

Made with ❤️ by [ScrapeGraph AI](https://scrapegraphai.com)
57 changes: 57 additions & 0 deletions examples/agent_example.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
"""
Remember to install the additional dependencies for this example to work:
pip install langchain-openai langchain
"""

from dotenv import load_dotenv
from langchain.agents import AgentExecutor, create_openai_functions_agent
from langchain_core.messages import SystemMessage
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_openai import ChatOpenAI

from langchain_scrapegraph.tools import (
GetCreditsTool,
LocalScraperTool,
SmartScraperTool,
)

load_dotenv()

# Initialize the tools
tools = [
SmartScraperTool(),
LocalScraperTool(),
GetCreditsTool(),
]

# Create the prompt template
prompt = ChatPromptTemplate.from_messages(
[
SystemMessage(
content=(
"You are a helpful AI assistant that can analyze websites and extract information. "
"You have access to tools that can help you scrape and process web content. "
"Always explain what you're doing before using a tool."
)
),
MessagesPlaceholder(variable_name="chat_history", optional=True),
("user", "{input}"),
MessagesPlaceholder(variable_name="agent_scratchpad"),
]
)

# Initialize the LLM
llm = ChatOpenAI(temperature=0)

# Create the agent
agent = create_openai_functions_agent(llm, tools, prompt)

# Create the executor
agent_executor = AgentExecutor(agent=agent, tools=tools, verbose=True)

# Example usage
query = """Extract the main products from https://www.scrapegraphai.com/"""

print("\nQuery:", query, "\n")
response = agent_executor.invoke({"input": query})
print("\nFinal Response:", response["output"])
14 changes: 9 additions & 5 deletions examples/get_credits_tool.py
Original file line number Diff line number Diff line change
@@ -1,9 +1,13 @@
from scrapegraph_py.logger import sgai_logger

from langchain_scrapegraph.tools import GetCreditsTool

# Will automatically get SGAI_API_KEY from environment, or set it manually
sgai_logger.set_logging(level="INFO")

# Will automatically get SGAI_API_KEY from environment
tool = GetCreditsTool()
credits = tool.run()

print("\nCredits Information:")
print(f"Remaining Credits: {credits['remaining_credits']}")
print(f"Total Credits Used: {credits['total_credits_used']}")
# Use the tool
credits = tool.invoke({})

print(credits)
28 changes: 28 additions & 0 deletions examples/localscraper_tool.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
from scrapegraph_py.logger import sgai_logger

from langchain_scrapegraph.tools import LocalScraperTool

sgai_logger.set_logging(level="INFO")

# Will automatically get SGAI_API_KEY from environment
tool = LocalScraperTool()

# Example website and prompt
html_content = """
<html>
<body>
<h1>Company Name</h1>
<p>We are a technology company focused on AI solutions.</p>
<div class="contact">
<p>Email: contact@example.com</p>
<p>Phone: (555) 123-4567</p>
</div>
</body>
</html>
"""
user_prompt = "Make a summary of the webpage and extract the email and phone number"

# Use the tool
result = tool.invoke({"website_html": html_content, "user_prompt": user_prompt})

print(result)
16 changes: 16 additions & 0 deletions examples/markdownify_tool.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
from scrapegraph_py.logger import sgai_logger

from langchain_scrapegraph.tools import MarkdownifyTool

sgai_logger.set_logging(level="INFO")

# Will automatically get SGAI_API_KEY from environment
tool = MarkdownifyTool()

# Example website and prompt
website_url = "https://www.example.com"

# Use the tool
result = tool.invoke({"website_url": website_url})

print(result)
18 changes: 10 additions & 8 deletions examples/smartscraper_tool.py
Original file line number Diff line number Diff line change
@@ -1,15 +1,17 @@
from langchain_scrapegraph.tools import SmartscraperTool
from scrapegraph_py.logger import sgai_logger

# Will automatically get SGAI_API_KEY from environment, or set it manually
tool = SmartscraperTool()
from langchain_scrapegraph.tools import SmartScraperTool

sgai_logger.set_logging(level="INFO")

# Will automatically get SGAI_API_KEY from environment
tool = SmartScraperTool()

# Example website and prompt
website_url = "https://www.example.com"
user_prompt = "Extract the main heading and first paragraph from this webpage"

# Use the tool synchronously
result = tool.run({"user_prompt": user_prompt, "website_url": website_url})
# Use the tool
result = tool.invoke({"website_url": website_url, "user_prompt": user_prompt})

print("\nExtraction Results:")
print(f"Main Heading: {result['main_heading']}")
print(f"First Paragraph: {result['first_paragraph']}")
print(result)
6 changes: 4 additions & 2 deletions langchain_scrapegraph/tools/__init__.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,6 @@
from .credits import GetCreditsTool
from .smartscraper import SmartscraperTool
from .localscraper import LocalScraperTool
from .markdownify import MarkdownifyTool
from .smartscraper import SmartScraperTool

__all__ = ["SmartscraperTool", "GetCreditsTool"]
__all__ = ["SmartScraperTool", "GetCreditsTool", "MarkdownifyTool", "LocalScraperTool"]
55 changes: 51 additions & 4 deletions langchain_scrapegraph/tools/credits.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,25 +7,72 @@
from langchain_core.tools import BaseTool
from langchain_core.utils import get_from_dict_or_env
from pydantic import model_validator
from scrapegraph_py import SyncClient
from scrapegraph_py import Client


class GetCreditsTool(BaseTool):
"""Tool for checking remaining credits on your ScrapeGraph AI account.
Setup:
Install ``langchain-scrapegraph`` python package:
.. code-block:: bash
pip install langchain-scrapegraph
Get your API key from ScrapeGraph AI (https://scrapegraphai.com)
and set it as an environment variable:
.. code-block:: bash
export SGAI_API_KEY="your-api-key"
Key init args:
api_key: Your ScrapeGraph AI API key. If not provided, will look for SGAI_API_KEY env var.
client: Optional pre-configured ScrapeGraph client instance.
Instantiate:
.. code-block:: python
from langchain_scrapegraph.tools import GetCreditsTool
# Will automatically get SGAI_API_KEY from environment
tool = GetCreditsTool()
# Or provide API key directly
tool = GetCreditsTool(api_key="your-api-key")
Use the tool:
.. code-block:: python
result = tool.invoke({})
print(result)
# {
# "remaining_credits": 100,
# "total_credits_used": 50
# }
Async usage:
.. code-block:: python
result = await tool.ainvoke({})
"""

name: str = "GetCredits"
description: str = (
"Get the current credits available in your ScrapeGraph AI account"
)
return_direct: bool = True
client: Optional[SyncClient] = None
client: Optional[Client] = None
api_key: str
testing: bool = False

@model_validator(mode="before")
@classmethod
def validate_environment(cls, values: Dict) -> Dict:
"""Validate that api key exists in environment."""
values["api_key"] = get_from_dict_or_env(values, "api_key", "SGAI_API_KEY")
values["client"] = SyncClient(api_key=values["api_key"])
values["client"] = Client(api_key=values["api_key"])
return values

def __init__(self, **data: Any):
Expand Down
Loading

0 comments on commit 9c37e01

Please sign in to comment.