spider_star

个人学习他人爬虫的项目收藏。

1、微信公众号爬虫

基于搜狗微信搜索的微信公众号爬虫接口，可以扩展成基于搜狗搜索的爬虫，返回结果是列表，每一项均是公众号具体信息字典。

GitHub地址：https://github.com/Chyroc/WechatSogou

2、豆瓣读书爬虫

可以爬下豆瓣读书标签下的所有图书，按评分排名依次存储，存储到Excel中，可方便大家筛选搜罗，比如筛选评价人数>1000的高分书籍；可依据不同的主题存储到Excel不同的Sheet ，采用User Agent伪装为浏览器进行爬取，并加入随机延时来更好的模仿浏览器行为，避免爬虫被封。

GitHub地址：https://github.com/lanbing510/DouBanSpider

3、知乎爬虫

此项目的功能是爬取知乎用户信息以及人际拓扑关系，爬虫框架使用scrapy，数据存储使用Mongo。

GitHub地址：https://github.com/LiuRoy/zhihu_spider

4、Bilibili用户爬虫

总数据数：20119918，抓取字段：用户id，昵称，性别，头像，等级，经验值，粉丝数，生日，地址，注册时间，签名，等级与经验值等。抓取之后生成B站用户数据报告。

GitHub地址：https://github.com/airingursb/bilibili-user

5、中国知网爬虫

设置检索条件后，执行src/CnkiSpider.py抓取数据，抓取数据存储在/data目录下，每个数据文件的第一行为字段名称。

GitHub地址：https://github.com/yanzhou/CnkiSpider 、https://github.com/yanzhou/CnkiSpider

6、QQ 群爬虫

批量抓取 QQ 群信息，包括群名称、群号、群人数、群主、群简介等内容，最终生成 XLS(X) / CSV 结果文件。

GitHub地址：https://github.com/caspartse/QQ-Groups-Spider

7、机票爬虫

基于Scrapy的机票爬虫，目前整合了国内两大机票网站（去哪儿 + 携程）。

GitHub地址：https://github.com/fankcoder/findtrip

8、QQ空间爬虫

包括日志、说说、个人信息等，一天可抓取 400 万条数据。

GitHub地址：https://github.com/LiuXingMing/QQSpider

9、百度云盘爬虫

GitHub地址：https://github.com/k1995/BaiduyunSpider

10、网易云音乐爬虫

GitHub地址：https://github.com/RitterHou/music-163

11、CSDN博客爬虫

GitHub地址：https://github.com/Kevinsss/csdn-spider

12、慕课网视频爬虫

GitHub地址：https://github.com/qiyeboy/spider_smooc

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

spider_star

About

Releases

Packages

weizhiwen/spider_star

Folders and files

Latest commit

History

Repository files navigation

spider_star

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages