🕷️🦜 langchain-scrapegraph

Supercharge your LangChain agents with AI-powered web scraping capabilities. LangChain-ScrapeGraph provides a seamless integration between LangChain and ScrapeGraph AI, enabling your agents to extract structured data from websites using natural language.

🔗 ScrapeGraph API & SDKs

If you are looking for a quick solution to integrate ScrapeGraph in your system, check out our powerful API here!

We offer SDKs in both Python and Node.js, making it easy to integrate into your projects. Check them out below:

SDK	Language	GitHub Link
Python SDK	Python	scrapegraph-py
Node.js SDK	Node.js	scrapegraph-js

📦 Installation

pip install langchain-scrapegraph

🛠️ Available Tools

📝 MarkdownifyTool

Convert any webpage into clean, formatted markdown.

from langchain_scrapegraph.tools import MarkdownifyTool

tool = MarkdownifyTool()
markdown = tool.invoke({"website_url": "https://example.com"})

print(markdown)

🔍 SmartscraperTool

Extract structured data from any webpage using natural language prompts.

from langchain_scrapegraph.tools import SmartScraperTool

# Initialize the tool (uses SGAI_API_KEY from environment)
tool = SmartscraperTool()

# Extract information using natural language
result = tool.invoke({
    "website_url": "https://www.example.com",
    "user_prompt": "Extract the main heading and first paragraph"
})

print(result)

🔍 Using Output Schemas with SmartscraperTool

You can define the structure of the output using Pydantic models:

from typing import List
from pydantic import BaseModel, Field
from langchain_scrapegraph.tools import SmartScraperTool

class WebsiteInfo(BaseModel):
    title: str = Field(description="The main title of the webpage")
    description: str = Field(description="The main description or first paragraph")
    urls: List[str] = Field(description="The URLs inside the webpage")

# Initialize with schema
tool = SmartScraperTool(llm_output_schema=WebsiteInfo)

# The output will conform to the WebsiteInfo schema
result = tool.invoke({
    "website_url": "https://www.example.com",
    "user_prompt": "Extract the website information"
})

print(result)
# {
#     "title": "Example Domain",
#     "description": "This domain is for use in illustrative examples...",
#     "urls": ["https://www.iana.org/domains/example"]
# }

💻 LocalscraperTool

Extract information from HTML content using AI.

from langchain_scrapegraph.tools import LocalScraperTool

tool = LocalScraperTool()
result = tool.invoke({
    "user_prompt": "Extract all contact information",
    "website_html": "<html>...</html>"
})

print(result)

🔍 Using Output Schemas with LocalscraperTool

You can define the structure of the output using Pydantic models:

from typing import Optional
from pydantic import BaseModel, Field
from langchain_scrapegraph.tools import LocalScraperTool

class CompanyInfo(BaseModel):
    name: str = Field(description="The company name")
    description: str = Field(description="The company description")
    email: Optional[str] = Field(description="Contact email if available")
    phone: Optional[str] = Field(description="Contact phone if available")

# Initialize with schema
tool = LocalScraperTool(llm_output_schema=CompanyInfo)

html_content = """
<html>
    <body>
        <h1>TechCorp Solutions</h1>
        <p>We are a leading AI technology company.</p>
        <div class="contact">
            <p>Email: contact@techcorp.com</p>
            <p>Phone: (555) 123-4567</p>
        </div>
    </body>
</html>
"""

# The output will conform to the CompanyInfo schema
result = tool.invoke({
    "website_html": html_content,
    "user_prompt": "Extract the company information"
})

print(result)
# {
#     "name": "TechCorp Solutions",
#     "description": "We are a leading AI technology company.",
#     "email": "contact@techcorp.com",
#     "phone": "(555) 123-4567"
# }

🌟 Key Features

🐦 LangChain Integration: Seamlessly works with LangChain agents and chains
🔍 AI-Powered Extraction: Use natural language to describe what data to extract
📊 Structured Output: Get clean, structured data ready for your agents
🔄 Flexible Tools: Choose from multiple specialized scraping tools
⚡ Async Support: Built-in support for async operations

💡 Use Cases

📖 Research Agents: Create agents that gather and analyze web data
📊 Data Collection: Automate structured data extraction from websites
📝 Content Processing: Convert web content into markdown for further processing
🔍 Information Extraction: Extract specific data points using natural language

🤖 Example Agent

from langchain.agents import initialize_agent, AgentType
from langchain_scrapegraph.tools import SmartScraperTool
from langchain_openai import ChatOpenAI

# Initialize tools
tools = [
    SmartScraperTool(),
]

# Create an agent
agent = initialize_agent(
    tools=tools,
    llm=ChatOpenAI(temperature=0),
    agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
    verbose=True
)

# Use the agent
response = agent.run("""
    Visit example.com, make a summary of the content and extract the main heading and first paragraph
""")

⚙️ Configuration

Set your ScrapeGraph API key in your environment:

export SGAI_API_KEY="your-api-key-here"

Or set it programmatically:

import os
os.environ["SGAI_API_KEY"] = "your-api-key-here"

📚 Documentation

💬 Support & Feedback

📧 Email: support@scrapegraphai.com
💻 GitHub Issues: Create an issue
🌟 Feature Requests: Request a feature

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

This project is built on top of:

Made with ❤️ by ScrapeGraph AI

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
.github		.github
cookbook		cookbook
examples		examples
langchain_scrapegraph		langchain_scrapegraph
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.releaserc.yml		.releaserc.yml
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🕷️🦜 langchain-scrapegraph

🔗 ScrapeGraph API & SDKs

📦 Installation

🛠️ Available Tools

📝 MarkdownifyTool

🔍 SmartscraperTool

💻 LocalscraperTool

🌟 Key Features

💡 Use Cases

🤖 Example Agent

⚙️ Configuration

📚 Documentation

💬 Support & Feedback

📄 License

🙏 Acknowledgments

About

Releases 7

Packages

Contributors 2

Languages

License

ScrapeGraphAI/langchain-scrapegraph

Folders and files

Latest commit

History

Repository files navigation

🕷️🦜 langchain-scrapegraph

🔗 ScrapeGraph API & SDKs

📦 Installation

🛠️ Available Tools

📝 MarkdownifyTool

🔍 SmartscraperTool

💻 LocalscraperTool

🌟 Key Features

💡 Use Cases

🤖 Example Agent

⚙️ Configuration

📚 Documentation

💬 Support & Feedback

📄 License

🙏 Acknowledgments

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 7

Packages 0

Contributors 2

Languages

Packages