CrawlX: Simple Page Data Extraction/Scraping API

CrawlX is an API that allows you to extract structured data from HTML documents or URLs using CSS selectors. It is particularly useful for frontend developers who need to scrape data from websites for various purposes such as web scraping, data mining, and content aggregation.

A Demo Project "The Guradian Scraper"

Features

Extract data from HTML documents or URLs using CSS selectors.
Support for nested data extraction using nested selectors.
Accepts both HTML content and URLs as input.

Use Cases

Frontend Prototyping: Frontend developers can use CrawlX for quick prototyping without the need to set up a backend server. They can easily integrate external data into their frontend applications and deploy them on serverless platforms like Netlify.

How to Use

Request Format

Send a POST request to /extract with the following JSON payload:

{
    "html": "<html><body>...</body></html>",
    "selectors": {
        "key1": {
            "css": "selector1",
            "type": "text"
        },
        "key2": {
            "css": "selector2",
            "type": "text"
        },
        ...
    }
}

html (optional): The HTML content to extract data from. Use either html or url.
url (optional): The URL of the website to scrape. Use either html or url.
selectors: A dictionary of CSS selectors to extract data from the HTML content.

Response Format

The API will return a JSON response containing the extracted data:

{
    "data": {
        "key1": "value1",
        "key2": "value2",
        ...
    }
}

Example

curl -X POST "http://127.0.0.1:8000/extract" \
-H "Content-Type: application/json" \
-d '{
    "html": "<div><h1>Hello, World!</h1><p>This is a sample HTML content</p></div>",
    "selectors": {
        "heading": {
            "css": "h1",
            "type": "text"
        },
        "paragraph": {
            "css": "p",
            "type": "text"
        }
    }
}'

CrawlX simplifies the process of web scraping and data extraction, making it easier to integrate external data into frontend applications.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.gitignore		.gitignore
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md
amazon.in_sample.json		amazon.in_sample.json
amazon_test.py		amazon_test.py
app.py		app.py
example.com_test.json		example.com_test.json
scrapethissite-coutries-data_request_body.json		scrapethissite-coutries-data_request_body.json
vercel.json		vercel.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CrawlX: Simple Page Data Extraction/Scraping API

Features

Use Cases

How to Use

Request Format

Response Format

Example

About

Languages

NotoriousArnav/crawlx

Folders and files

Latest commit

History

Repository files navigation

CrawlX: Simple Page Data Extraction/Scraping API

Features

Use Cases

How to Use

Request Format

Response Format

Example

About

Topics

Resources

Stars

Watchers

Forks

Languages