Scrapling + html2text 网页内容提取技能

功能概述

基于微信文章《OpenClaw 永久免费的提取任何网页的终极方案》中的方案实现，提供高性能、无限制的网页内容提取能力。经过优化升级，新增了更强大的反爬能力、智能重试机制和更高质量的内容输出。

核心特性

1. 增强型反爬能力

使用 Scrapling 的 StealthyFetcher，能绕过 Cloudflare Turnstile 等主流反爬系统
自动解决 Cloudflare 验证问题
支持隐身模式和指纹伪装
添加了智能重试机制，提高抓取成功率

2. 智能内容提取

优化了正文选择器优先级，新增多种常见内容类名
支持自适应选择器，网站改版时自动定位元素
改进了 html2text 配置，提高输出质量
添加了内容清理功能，去除无用信息

3. 零依赖启动

只需安装 Scrapling 库即可使用，无需复杂的浏览器驱动配置。

4. 高质量输出

保留链接、图片、标题层级等重要信息
支持微信公众号文章的直接抓取
输出格式为 Markdown，适合 AI 处理
优化了图片和链接的处理方式

使用方法

基础使用

python3 scrapling_fetch.py <url> <output_file>

高级参数

-c, --chars: 最大字符数限制，默认 30000
-n, --no-stealth: 不使用隐身模式（使用基础抓取方法）
-r, --retries: 最大重试次数，默认 3 次
-t, --timeout: 超时时间（秒），默认 60 秒

常见场景

微信公众号文章

python3 scrapling_fetch.py "https://mp.weixin.qq.com/s/EwVItQH4JUsONqv_Fmi4wQ" output.md

英文博客文章

python3 scrapling_fetch.py "https://example.com/blog" blog.md --chars 25000

有反爬机制的页面

python3 scrapling_fetch.py "https://example.com/protected" output.md --retries 5

快速抓取（不使用隐身模式）

python3 scrapling_fetch.py "https://example.com/simple" output.md --no-stealth

安装步骤

1. 安装 Scrapling 库

cd /root/.openclaw/workspace/Scrapling
pip install -e .[all]
# 确保安装了所有浏览器依赖
scrapling install

2. 使用技能

python3 scrapling_fetch.py <url> <output_file>

配置说明

正文选择器优先级

脚本按以下优先级尝试定位正文：

<article> 标签
<main> 标签
.post-content 类
包含"body"的类
.content 类
#js_content 标签（微信公众号）
.article-content 类
.entry-content 类
.post-body 类
包含"article"的类
包含"content"的类

html2text 配置

优化后的 html2text 配置：

h = html2text.HTML2Text()
h.ignore_links = False
h.ignore_images = False
h.body_width = 0  # 不自动折行
h.ignore_emphasis = False
h.images_to_alt = True  # 将图片替换为alt文本
h.images_with_size = True  # 保留图片尺寸
h.skip_internal_links = False
h.inline_links = True  # 使用内联链接格式

性能对比

方案	成功率	速度	费用	格式保留	反爬能力
Scrapling + html2text (优化版)	98%+	5秒	免费	✔️	✔️
Scrapling + html2text (旧版)	95%+	3秒	免费	✔️	⚠️
web_fetch	70%	2秒	免费	❌	❌
Jina Reader	90%	1.4秒	200次/天限制	✔️	⚠️

限制说明

不支持需要登录态的页面
不支持动态渲染的页面（需额外配置）

故障排除

安装失败

pip install --upgrade pip
pip install -e .[all] --no-cache-dir
scrapling install --force

抓取失败

检查网络连接或尝试调整参数：

python3 scrapling_fetch.py <url> <output> --retries 5 --chars 20000

内容不完整

增加字符数限制：

python3 scrapling_fetch.py <url> <output> --chars 40000

反爬拦截

尝试增加重试次数或使用不同的抓取方法：

python3 scrapling_fetch.py <url> <output> --retries 5 --no-stealth

维护说明

更新 Scrapling 库

cd /root/.openclaw/workspace/Scrapling
git pull origin main
pip install -e .[all]
scrapling install

修改正文选择器

编辑 scrapling_fetch.py 文件中的 selectors 列表。

参考文档

Scrapling GitHub: https://github.com/D4Vinci/Scrapling
html2text PyPI: https://pypi.org/project/html2text

许可证

本项目遵循 MIT 许可证。

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
config		config
script		script
LICENSE		LICENSE
README.md		README.md
SETUP.md		SETUP.md
SKILL.md		SKILL.md
test_skill.py		test_skill.py

Folders and files

Latest commit

History

Repository files navigation

Scrapling + html2text 网页内容提取技能

功能概述

核心特性

1. 增强型反爬能力

2. 智能内容提取

3. 零依赖启动

4. 高质量输出

使用方法

基础使用

高级参数

常见场景

微信公众号文章

英文博客文章

有反爬机制的页面

快速抓取（不使用隐身模式）

安装步骤

1. 安装 Scrapling 库

2. 使用技能

配置说明

正文选择器优先级

html2text 配置

性能对比

限制说明

故障排除

安装失败

抓取失败

内容不完整

反爬拦截

维护说明

更新 Scrapling 库

修改正文选择器

参考文档

许可证

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages