Douban Book Pro Scraper

Douban Book Pro Scraper collects structured book data from Douban Book pages and popular lists, turning messy web pages into clean, reusable datasets. It’s built for developers and analysts who need reliable Douban book metadata for cataloging, research, or trend tracking. If you’re looking for a practical Douban book scraper with richer fields, this project is designed for that job.

Created by Bitbash, built to showcase our approach to Scraping and Automation!
If you are looking for douban-book-pro you've just found your team — Let’s Chat. 👆👆

Introduction

This project extracts detailed book information from Douban Book, including list pages (like Top 250 and popular rankings) and the individual book pages linked from them. It solves the common pain of manually copying book details or dealing with incomplete datasets by producing consistent, structured outputs ready for storage, search, or analytics. It’s for data engineers, researchers, product teams, and anyone building a book dataset pipeline.

Why this scraper exists

Captures both list-level ranking context and per-book detail pages for complete coverage.
Designed to crawl paginated lists (e.g., Top 250, new releases, popular books) without missing items.
Normalizes common Douban fields into predictable JSON for downstream systems.
Supports repeatable runs for monitoring changes (rank shifts, rating updates, new entries).
Produces data that’s easy to export into CSV, JSONL, databases, or search indexes.

Features

Feature	Description
Paginated list crawling	Walks through multi-page lists (New Books, Top 250, Popular) and gathers every entry consistently.
Deep book-page extraction	Visits each book page to collect richer metadata beyond what list cards show.
Structured JSON output	Produces predictable, schema-friendly JSON objects suitable for pipelines and ETL.
Deduplication and canonical URLs	Prevents duplicates across lists by tracking canonical book URLs/IDs.
Resilient retries and backoff	Handles transient failures with configurable retries to improve run stability.
Config-driven runs	Lets you target specific lists, page ranges, concurrency, and output options via config.
Export helpers	Optional exporters for JSONL/CSV to simplify integration with analytics workflows.

What Data This Scraper Extracts

Field Name	Field Description
sourceList	Which list the book was collected from (e.g., new_books, top_250, popular).
listRank	Rank/position within the source list (when available).
title	Primary book title as displayed on the book page.
originalTitle	Original/alternate title if present (often for translated works).
doubanUrl	Canonical Douban Book URL for the book.
doubanId	Book identifier parsed from the URL or page metadata.
coverImageUrl	Direct URL to the book cover image.
authors	Array of author names.
translators	Array of translator names (if any).
publisher	Publisher name.
publicationDate	Publication date (normalized where possible).
isbn	ISBN string, when available on the page.
binding	Binding/format (paperback, hardcover, etc.), if present.
pages	Page count (integer when parseable).
price	Listed price string (kept as raw text to preserve currency/format).
series	Series name, if the book belongs to one.
ratingValue	Average rating value as a number.
ratingCount	Total number of ratings (integer).
reviewCount	Total number of reviews/comments, if shown.
tags	Array of topical tags assigned on Douban.
categories	Extracted categories/genres if present in page sections.
summary	Book summary/description text.
authorIntro	Author introduction text if available.
tableOfContents	Table of contents text when provided.
quotes	Notable quotes/excerpts section if present.
relatedBooks	Array of related/recommended books (title + URL when available).
scrapedAt	ISO timestamp of when the record was collected.

Example Output

[
      {
        "sourceList": "top_250",
        "listRank": 12,
        "title": "Example Book Title",
        "originalTitle": "Original Title",
        "doubanUrl": "https://book.douban.com/subject/1234567/",
        "doubanId": "1234567",
        "coverImageUrl": "https://img.example.com/cover.jpg",
        "authors": ["Author A", "Author B"],
        "translators": ["Translator X"],
        "publisher": "Example Publisher",
        "publicationDate": "2019-08",
        "isbn": "9787111123456",
        "binding": "Paperback",
        "pages": 352,
        "price": "CNY 59.00",
        "series": "Example Series",
        "ratingValue": 9.1,
        "ratingCount": 184532,
        "reviewCount": 12987,
        "tags": ["Fiction", "Classic", "Literature"],
        "summary": "A concise synopsis of the book pulled from the detail page.",
        "authorIntro": "Short author biography as displayed on the page.",
        "tableOfContents": "Chapter 1...\nChapter 2...\n",
        "relatedBooks": [
              { "title": "Related Book 1", "doubanUrl": "https://book.douban.com/subject/7654321/" }
        ],
        "scrapedAt": "2025-12-14T10:05:12Z"
      }
    ]

Directory Structure Tree

douban book pro/
├── src/
│   ├── main.py
│   ├── runner.py
│   ├── cli.py
│   ├── config/
│   │   ├── settings.example.json
│   │   └── targets.example.json
│   ├── clients/
│   │   ├── http_client.py
│   │   └── session_manager.py
│   ├── extractors/
│   │   ├── list_parser.py
│   │   ├── book_parser.py
│   │   ├── fields_normalizer.py
│   │   └── text_cleaner.py
│   ├── pipelines/
│   │   ├── scheduler.py
│   │   ├── deduplicator.py
│   │   └── retry_policy.py
│   ├── exporters/
│   │   ├── json_exporter.py
│   │   ├── jsonl_exporter.py
│   │   └── csv_exporter.py
│   ├── storage/
│   │   ├── state_store.py
│   │   └── cache.py
│   └── utils/
│       ├── logger.py
│       ├── dates.py
│       └── urls.py
├── data/
│   ├── sample_output.json
│   └── sample_output.jsonl
├── tests/
│   ├── test_list_parser.py
│   ├── test_book_parser.py
│   └── fixtures/
│       ├── list_page.html
│       └── book_page.html
├── .env.example
├── .gitignore
├── pyproject.toml
├── requirements.txt
├── LICENSE
└── README.md

Use Cases

Data analysts use it to build a Douban book dataset, so they can track rating and popularity trends over time.
Developers use it to sync book metadata into an internal catalog, so they can power search, filters, and recommendations.
Researchers use it to collect Top 250 and popular lists, so they can study reading culture and ranking dynamics.
Product teams use it to monitor new book releases, so they can spot emerging themes and authors early.
Collectors and librarians use it to enrich records with summaries, tags, and publication details, so their archives stay complete.

FAQs

How do I choose which Douban pages to scrape? Configure the target list(s) in src/config/targets.example.json (then copy to your own config). Typical targets include new books, Top 250, and popular rankings. You can also control page ranges to limit how deep the crawler goes.

Does it scrape only list pages, or also individual book pages? It does both. The list pages provide discovery (URLs, ranks, quick stats), and then the scraper visits each book page to extract richer metadata like ISBN, publisher, summary, tags, and related items.

How does it avoid duplicates across multiple lists? Records are keyed by canonical doubanId and doubanUrl. If the same book appears in more than one list, the scraper merges or de-duplicates based on your pipeline settings, while preserving sourceList context.

What are the common reasons a run might miss some books? Most misses come from temporary network errors, aggressive throttling, or changes in page layout. The retry and backoff policy helps with transient failures, and the parsers are structured so they can be updated quickly if page markup shifts.

Performance Benchmarks and Results

Primary Metric: Averaged 55–90 book detail pages per minute on a stable connection with moderate concurrency (6–10 workers), including list traversal.

Reliability Metric: 97–99% successful page fetch rate in long runs, with retries recovering most transient failures.

Efficiency Metric: Typical memory footprint stays under 220 MB for runs up to ~5,000 books, with streaming export to JSONL to avoid large in-memory buffers.

Quality Metric: 95%+ field completeness on core metadata (title, authors, rating, counts, publisher, publication date), with optional sections (TOC, quotes, author intro) varying by book availability.

"Bitbash is a top-tier automation partner, innovative, reliable, and dedicated to delivering real results every time."

Nathan Pennington
Marketer
★★★★★

"Bitbash delivers outstanding quality, speed, and professionalism, truly a team you can rely on."

Eliza
SEO Affiliate Expert
★★★★★

"Exceptional results, clear communication, and flawless delivery.
Bitbash nailed it."

Syed
Digital Strategist
★★★★★

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Douban Book Pro Scraper

Introduction

Why this scraper exists

Features

What Data This Scraper Extracts

Example Output

Directory Structure Tree

Use Cases

FAQs

Performance Benchmarks and Results

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Douban Book Pro Scraper

Introduction

Why this scraper exists

Features

What Data This Scraper Extracts

Example Output

Directory Structure Tree

Use Cases

FAQs

Performance Benchmarks and Results

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages