url-crawl 是一个单体 Python 服务,使用 Playwright 驱动 Chromium 抓取动态网页,抽取文章内容并将 HTML 转为 Markdown 与纯文本快照,最终存入 MongoDB。服务基于 FastAPI 提供 HTTP 接口,适合做轻量级内容抓取与存储。
- 运行时: Python 3.11
- Web 框架: FastAPI
- 浏览器驱动: Playwright (Chromium)
- 数据库: MongoDB (PyMongo)
- HTML 解析: BeautifulSoup4 + markdownify
- 运行: Uvicorn, Docker / Docker Compose
app/:应用源码,包含main.py,crawler.py,parser.py,mongo.py,config.py,selectors.py,editorjs.py,logger.py等Dockerfile:多阶段构建(Playwright 浏览器预下载 + Python 3.11.14 运行镜像)requirements.txt:Python 依赖docker-compose.yml:本地/测试 compose 配置(含api与mongo)logs/:日志目录(容器映射)
- 创建并激活 Conda 环境(推荐 Python 3.11)
conda create -n url-crawl python=3.11 -y
conda activate url-crawl- 安装 Python 依赖并准备 Playwright 浏览器
python -m pip install -U pip setuptools wheel
pip install -r requirements.txt
# 安装系统依赖(仅 Linux)并下载 Chromium
python -m playwright install-deps chromium
python -m playwright install chromium- 设置环境变量(示例)
export MONGO_URI=mongodb://localhost:27017
export MONGO_DB=webdata
export PORT=3000
export CRAWLER_ID=my-crawler-v1- 启动服务
uvicorn app.main:app --host 0.0.0.0 --port $PORT --reload推荐使用仓库中已提供的多阶段 Dockerfile:第一阶段使用 Playwright 镜像下载浏览器,第二阶段使用 python:3.11.14 作为运行时并复制浏览器二进制。
本地构建并运行:
docker compose build --no-cache
docker compose up -d
docker compose logs -f api构建并推送到私有 Registry:
docker build -t registry.example.com/your-repo/url-crawl:1.0.0 .
docker push registry.example.com/your-repo/url-crawl:1.0.0MONGO_URI:Mongo 连接串,例如mongodb://mongo:27017(容器内使用服务名mongo)MONGO_DB:数据库名,默认webdataPORT:API 端口,默认3000CRAWLER_ID:抓取器标识,用于crawlInfo.crawlerCONTENT_ROOT_SELECTORS:逗号分隔的根选择器优先列表CONTENT_DOMAIN_SELECTORS:JSON 字符串,域名 → CSS 选择器映射MIDDLEWARE_FIRST: 是否为中间件,true:数据不落库。false: 数据落库,同时启动时初始化mongoUSE_LLM_SELECTOR: 是否使用LLM来识别文章内容的容器LLM_API_KEY: 调用的三方api keyLLM_BASE_URL: 调用的三方服务BaseUrlLLM_MODEL: 调用的模型, 如deepseek-ai/DeepSeek-V3.2-ExpLLM_PATH: 调用接口路径,如/chat/completions
示例(docker-compose.yml 中 environment):
environment:
- MONGO_URI=mongodb://mongo:27017
- MONGO_DB=webdata
- PORT=3000
- CRAWLER_ID=my-crawler-v1
- CONTENT_ROOT_SELECTORS=div#app,#app,main#app,div#root,#root
- CONTENT_DOMAIN_SELECTORS={"juejin.cn":"main","medium.com":"article"}Base URL: http://<host>:<PORT>(默认 http://localhost:3000)
- 描述:抓取指定 URL、解析并写入 MongoDB
MIDDLEWARE_FIRST为true,直接返回爬取的数据。反之直接落库。
请求体(JSON):
{
"url": "https://example.com/article/1",
"contentSelector": "main.article", # 可选,覆盖选择器
"ignoreTags": "nav,footer" # 可选,逗号分隔
}成功响应:
{ "id": "650a1...", "url": "https://example.com/article/1" }- 描述:直接解析传入的 HTML(兼容保留,不入库)
- 请求体:原始 HTML 或 JSON 包含
html、可选url、title
响应示例(部分):
{ "title": "示例", "markdown": "# 标题\n内容...", "length": 123 }- 描述:按文档 ID 获取内容;可通过查询参数
type=origin|md控制返回格式 MIDDLEWARE_FIRST为true,此接口不适用
成功响应示例:
{
"id": "650a1...",
"originalHtml": "<div>...</div>",
"markdown": "# 标题...",
"snapshot": {
"text": "纯文本...",
"length": 120
}
}- 描述:分页查询已抓取条目,返回
url、domain、title、crawledAt MIDDLEWARE_FIRST为true,此接口不适用
响应示例:
{
"total": 123,
"page": 1,
"limit": 50,
"items": [
{
"id": "...",
"url": "...",
"title": "...",
"crawledAt": "2025-11-17T00:00:00Z"
}
]
}- 描述:把存储的 HTML 转为 Editor.js block JSON
MIDDLEWARE_FIRST为true,此接口不适用
响应示例(部分):
{
"time": 1700000000000,
"blocks": [
{
"id": "blk...",
"type": "paragraph",
"data": {
"text": "..."
}
},
{
"type": "image",
"data": {
"file": {
"url": "..."
}
}
}
]
}- 描述:把存储的 HTML 转为 Editor.js block JSON
请求体(JSON):
{
"originHtml": "<div>xxxx</div>"
}响应示例(部分):
{
"time": 1700000000000,
"blocks": [
{
"id": "blk...",
"type": "paragraph",
"data": {
"text": "..."
}
},
{
"type": "image",
"data": {
"file": {
"url": "..."
}
}
}
]
}{
"url": "https://example.com/article/123",
"domain": "example.com",
"title": "示例标题",
"originalHtml": "<div>...</div>",
"markdown": "## 标题\n正文...",
"snapshot": { "text": "纯文本内容...", "length": 1234 },
"crawlInfo": {
"crawler": "my-crawler-v1",
"crawledAt": "2025-11-17T00:00:00Z",
"fetchHash": "sha256(...)",
"responseTimeMs": 320,
"retryCount": 0
},
"meta": { "description": "...", "keywords": [] },
"createdAt": "2025-11-17T00:00:00Z",
"updatedAt": "2025-11-17T00:00:00Z"
}实际效果
