ptt-crawler

A crawler for web ptt

說明

以ptt_crawler.cfg中的index為起點開始抓, 抓到該版的最新一篇貼文預設為[1, 1], 代表從board_name/index1.html中的第1篇文章開始抓, 抓到最後一筆

同時也會將下次要以哪篇文章為起點的index記錄下來所以下次執行程式時就從該篇文章開始抓, 抓到最新的一篇貼文

文章會以以下兩種方式儲存:

json格式 (posts/board_name/[M|G].[unsigned_integer].A.[HEX{3}].json)
mysql (預設為開啟, 可使用參數--no-database來關閉)

輸出格式

json

{
    "board": 版名,
    "url": [M|G].[unsigned_integer].A.[HEX{3}],
    "author": 作者,
    "title": 文章標題,
    "datetime": 發文時間,
    "ip": 發文者 IP,
    "content": 文章內容,
    "pushes": [
        {
            "status": 推/噓/→,
            "userid": 推文者 ID,
            "content": 推文內容,
            "datetime": 推文時間
        }
    ]
}

mysql table schema請參考ptt.sql

注意事項

如需使用mysql, 請建立my.cnf, 預設路徑為~/.my.cnf, 請自行更改

# in my.cnf
[client]
host = localhost
port = 3306
database = dbname
user = username
password = password
default-character-set = utf8

執行環境

Python 3.4.3

Pre-Install

執行方法

$ python3 ptt_crawler.py (-b | --board) BOARD_NAME [--no-database]

範例

$ python3 ptt_crawler.py gossiping # store both file and database
or
$ python3 ptt_crawler.py tainan --no-database # only store file

Inspired by PTTcrawler.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.config		.config
.gitignore		.gitignore
README.md		README.md
ptt.sql		ptt.sql
ptt_crawler.py		ptt_crawler.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ptt-crawler

說明

輸出格式

注意事項

執行環境

Pre-Install

執行方法

範例

About

Releases

Packages

Languages

atychang/ptt-crawler

Folders and files

Latest commit

History

Repository files navigation

ptt-crawler

說明

輸出格式

注意事項

執行環境

Pre-Install

執行方法

範例

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages