Skip to content

java-ai-tech/url-crawl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

url-crawl

概述

url-crawl 是一个单体 Python 服务,使用 Playwright 驱动 Chromium 抓取动态网页,抽取文章内容并将 HTML 转为 Markdown 与纯文本快照,最终存入 MongoDB。服务基于 FastAPI 提供 HTTP 接口,适合做轻量级内容抓取与存储。

技术栈

  • 运行时: Python 3.11
  • Web 框架: FastAPI
  • 浏览器驱动: Playwright (Chromium)
  • 数据库: MongoDB (PyMongo)
  • HTML 解析: BeautifulSoup4 + markdownify
  • 运行: Uvicorn, Docker / Docker Compose

目录说明(重要文件)

  • app/:应用源码,包含 main.py, crawler.py, parser.py, mongo.py, config.py, selectors.py, editorjs.py, logger.py
  • Dockerfile:多阶段构建(Playwright 浏览器预下载 + Python 3.11.14 运行镜像)
  • requirements.txt:Python 依赖
  • docker-compose.yml:本地/测试 compose 配置(含 apimongo
  • logs/:日志目录(容器映射)

如何在本地运行(Conda)

  1. 创建并激活 Conda 环境(推荐 Python 3.11)
conda create -n url-crawl python=3.11 -y
conda activate url-crawl
  1. 安装 Python 依赖并准备 Playwright 浏览器
python -m pip install -U pip setuptools wheel
pip install -r requirements.txt
# 安装系统依赖(仅 Linux)并下载 Chromium
python -m playwright install-deps chromium
python -m playwright install chromium
  1. 设置环境变量(示例)
export MONGO_URI=mongodb://localhost:27017
export MONGO_DB=webdata
export PORT=3000
export CRAWLER_ID=my-crawler-v1
  1. 启动服务
uvicorn app.main:app --host 0.0.0.0 --port $PORT --reload

构建与发布(Docker)

推荐使用仓库中已提供的多阶段 Dockerfile:第一阶段使用 Playwright 镜像下载浏览器,第二阶段使用 python:3.11.14 作为运行时并复制浏览器二进制。

本地构建并运行:

docker compose build --no-cache
docker compose up -d
docker compose logs -f api

构建并推送到私有 Registry:

docker build -t registry.example.com/your-repo/url-crawl:1.0.0 .
docker push registry.example.com/your-repo/url-crawl:1.0.0

环境变量(说明)

  • MONGO_URI:Mongo 连接串,例如 mongodb://mongo:27017(容器内使用服务名 mongo
  • MONGO_DB:数据库名,默认 webdata
  • PORT:API 端口,默认 3000
  • CRAWLER_ID:抓取器标识,用于 crawlInfo.crawler
  • CONTENT_ROOT_SELECTORS:逗号分隔的根选择器优先列表
  • CONTENT_DOMAIN_SELECTORS:JSON 字符串,域名 → CSS 选择器映射
  • MIDDLEWARE_FIRST: 是否为中间件,true:数据不落库。false: 数据落库,同时启动时初始化mongo
  • USE_LLM_SELECTOR: 是否使用LLM来识别文章内容的容器
  • LLM_API_KEY: 调用的三方api key
  • LLM_BASE_URL: 调用的三方服务BaseUrl
  • LLM_MODEL: 调用的模型, 如 deepseek-ai/DeepSeek-V3.2-Exp
  • LLM_PATH: 调用接口路径,如 /chat/completions

示例(docker-compose.yml 中 environment):

environment:
	- MONGO_URI=mongodb://mongo:27017
	- MONGO_DB=webdata
	- PORT=3000
	- CRAWLER_ID=my-crawler-v1
	- CONTENT_ROOT_SELECTORS=div#app,#app,main#app,div#root,#root
	- CONTENT_DOMAIN_SELECTORS={"juejin.cn":"main","medium.com":"article"}

接口详情


Base URL: http://<host>:<PORT>(默认 http://localhost:3000

POST /crawl

  • 描述:抓取指定 URL、解析并写入 MongoDB
  • MIDDLEWARE_FIRSTtrue,直接返回爬取的数据。反之直接落库。

请求体(JSON):

{
	"url": "https://example.com/article/1",
	"contentSelector": "main.article",   # 可选,覆盖选择器
	"ignoreTags": "nav,footer"           # 可选,逗号分隔
}

成功响应:

{ "id": "650a1...", "url": "https://example.com/article/1" }

POST /parse

  • 描述:直接解析传入的 HTML(兼容保留,不入库)
  • 请求体:原始 HTML 或 JSON 包含 html、可选 urltitle

响应示例(部分):

{ "title": "示例", "markdown": "# 标题\n内容...", "length": 123 }

GET /content/{id}

  • 描述:按文档 ID 获取内容;可通过查询参数 type=origin|md 控制返回格式
  • MIDDLEWARE_FIRSTtrue,此接口不适用

成功响应示例:

{
    "id": "650a1...",
    "originalHtml": "<div>...</div>",
    "markdown": "# 标题...",
    "snapshot": {
        "text": "纯文本...",
        "length": 120
    }
}

GET /pages?page=1&limit=50

  • 描述:分页查询已抓取条目,返回 urldomaintitlecrawledAt
  • MIDDLEWARE_FIRSTtrue,此接口不适用

响应示例:

{
    "total": 123,
    "page": 1,
    "limit": 50,
    "items": [
        {
            "id": "...",
            "url": "...",
            "title": "...",
            "crawledAt": "2025-11-17T00:00:00Z"
        }
    ]
}

GET /editorjs/{id}

  • 描述:把存储的 HTML 转为 Editor.js block JSON
  • MIDDLEWARE_FIRSTtrue,此接口不适用

响应示例(部分):

{
    "time": 1700000000000,
    "blocks": [
        {
            "id": "blk...",
            "type": "paragraph",
            "data": {
                "text": "..."
            }
        },
        {
            "type": "image",
            "data": {
                "file": {
                    "url": "..."
                }
            }
        }
    ]
}

POST /html_to_editorjs

  • 描述:把存储的 HTML 转为 Editor.js block JSON

请求体(JSON):

{
	"originHtml": "<div>xxxx</div>"
}

响应示例(部分):

{
    "time": 1700000000000,
    "blocks": [
        {
            "id": "blk...",
            "type": "paragraph",
            "data": {
                "text": "..."
            }
        },
        {
            "type": "image",
            "data": {
                "file": {
                    "url": "..."
                }
            }
        }
    ]
}

数据结构示例(集合 web_pages

{
	"url": "https://example.com/article/123",
	"domain": "example.com",
	"title": "示例标题",
	"originalHtml": "<div>...</div>",
	"markdown": "## 标题\n正文...",
	"snapshot": { "text": "纯文本内容...", "length": 1234 },
	"crawlInfo": {
		"crawler": "my-crawler-v1",
		"crawledAt": "2025-11-17T00:00:00Z",
		"fetchHash": "sha256(...)",
		"responseTimeMs": 320,
		"retryCount": 0
	},
	"meta": { "description": "...", "keywords": [] },
	"createdAt": "2025-11-17T00:00:00Z",
	"updatedAt": "2025-11-17T00:00:00Z"
}

实际效果

About

url crawl for mm

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published