dblpPaperListCrawler

A web crawler for collecting paper lists from dblp.org

简介

爬取dblp.org上论文列表的脚本，支持指定期刊/会议爬取，获取的数据为论文列表，包含标题和作者。

如需全量论文数据可以下载官方XML文件（每日更新，解压缩后约5G）。

已有数据

仓库中已经包含了中国计算机学会推荐国际学术会议和期刊目录-2022更名版（此文件的发布时间是2024年6月）中网络与信息安全、计算机网络、软件工程/系统软件/程序设计语言 三个方向的A类、B类、C类及其他共194个期刊/会议的爬取结果。最近更新时间为2026-01-10。

仓库中已爬取的期刊/会议列表见output/full_name_mapping.json ，每个期刊/会议的论文列表见output/paper_lists/。

使用方法

安装依赖：

pip install requests lxml

想要自定义爬取的期刊/会议，只需要参照main.py修改indices即可。

from crawler import scrape_paper_lists

indices = {
    "journals": ["pami"],
    "conf": ["cvpr", "iccv", "eccv"],
}

if __name__ == "__main__":
    scrape_paper_lists(indices, output_dir="./path/to/output")

indices中的值应该与dblp.org 中各个期刊/会议的URL路径一致，具体路径可以参考ccf2022.pdf。

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
output		output
README.md		README.md
ccf2022.pdf		ccf2022.pdf
crawler.py		crawler.py
main.py		main.py
security.py		security.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

dblpPaperListCrawler

简介

已有数据

使用方法

About

Uh oh!

Releases

Packages

Languages

1kuzus/dblpPaperListCrawler

Folders and files

Latest commit

History

Repository files navigation

dblpPaperListCrawler

简介

已有数据

使用方法

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages