介绍

介绍
结构
使用方法
接口

介绍

本项目是Project Nichijou的一个子项目。根据内部规范实现的基于Scrapy二次开发的爬虫框架。

结构

common
├── cache
│   ├── cache_maker.py              # cache 生产
│   └── cache_response.py           # 封装过的cache Response
├── config
│   ├── settings.py                 # 配置文件
│   └── settings_template.py        # 配置模板
├── cookies
│   ├── cookies.json                # cookies
│   ├── cookies.json.backup         # cookies 备份
│   ├── cookies_io.py               # cookies的IO封装
│   └── cookies_template.json       # cookies 模板
├── database
│   ├── database.py                 # 数据库
│   └── database_command.py         # 根据 [规范] 封装的 建表命令
├── items                           # 根据 [规范] 封装的 Item
│   ├── anime_item.py
│   ├── anime_name_item.py
│   ├── cache_item.py
│   ├── common_item.py              # Item 自定义父类
│   ├── episode_item.py
│   ├── episode_name_item.py
│   ├── fail_request_item.py
│   └── log_item.py
├── middlewares
│   ├── cache_middleware.py         # 请求缓存中间件
│   └── cookie_middleware.py        # cookies持久化中间件
├── pipelines
│   └── storing_pipeline.py         # 储存管道
├── spiders
│   └── common_spider.py            # Spider 自定义父类
└── utils
    ├── ac.py                       # AC 自动机封装
    ├── checker.py                  # 数据有效性封装
    ├── datetime.py                 # 日期时间格式封装
    ├── formatter.py                # 格式化工具封装
    ├── hash.py                     # 哈希工具封装
    └── logger.py                   # 日志封装

使用方法

根据下面的template进行配置 (复制到同目录并重命名)
- common/cookies/cookies_template.json
- common/config/settings_template.py

在主项目中配置scrapy的配置文件，重点有如下字段：

DOWNLOADER_MIDDLEWARES = {
	'scrapy.downloadermiddlewares.cookies.CookiesMiddleware': None,
	'common.middlewares.cookie_middleware.CommonCookiesMiddleware': 920,
	'common.middlewares.cache_middleware.CommonCacheMiddleware': 930,
}
ITEM_PIPELINES = {
	'common.pipelines.storing_pipeline.CommonStoringPipeline': 300,
}

注意：上面的配置不是必须的

可以考虑如下几种使用方式:

子类继承父类，自定义某些字段，覆写值
传参使用。注意：Item只有变量名为_开始的才能够作为属性直接修改，否则需要通过dict的方式。

接口

`common.spiders.common_spider.CommonSpider`

parent: scrapy.Spider

`use_cookies`

type: boolean
desc: 为True则为此蜘蛛启用cookies组件，为False则不启用。注意：启用的前提是在settings中配置了middlewares

`initialize`

type: function
desc: 初始化spider

`init_normal_datasource`

type: function
desc: 初始化正常情况下的数据源

`init_fail_datasource`

type: function
desc: 初始化失败重试情况下的数据源

`common.items.common_item.CommonItem`

parent: scrapy.Item

`table`

type: str
desc: 此Item将被保存到的数据表

`primary_keys`

type: list
desc: 存入数据表的primary_keys (主键)，用于update数据，若此项缺失，则会直接覆写

`_url`

type: str
desc: 产生该Item请求的url，用于删除fail记录

`use_fail`

type: boolean
desc: 此Item是否回进行重试，或重试时是否需要删除失败记录

关于`cache`

CommonCacheMiddleware只处理了cache的读取，cache的写入需要在Spider中实现。可以使用common/cache/cache_maker.py当中封装过的函数实现。

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

介绍

结构

使用方法

接口

`common.spiders.common_spider.CommonSpider`

`use_cookies`

`initialize`

`init_normal_datasource`

`init_fail_datasource`

`common.items.common_item.CommonItem`

`table`

`primary_keys`

`_url`

`use_fail`

关于`cache`

About

Releases 1

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
cache		cache
config		config
cookies		cookies
database		database
items		items
middlewares		middlewares
pipelines		pipelines
spiders		spiders
utils		utils
.editorconfig		.editorconfig
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

License

project-nichijou/nichijou.server.spider.common

Folders and files

Latest commit

History

Repository files navigation

介绍

结构

使用方法

接口

common.spiders.common_spider.CommonSpider

use_cookies

initialize

init_normal_datasource

init_fail_datasource

common.items.common_item.CommonItem

table

primary_keys

_url

use_fail

关于cache

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

`common.spiders.common_spider.CommonSpider`

`use_cookies`

`initialize`

`init_normal_datasource`

`init_fail_datasource`

`common.items.common_item.CommonItem`

`table`

`primary_keys`

`_url`

`use_fail`

关于`cache`

Packages