Douban Book Pro Scraper collects structured book data from Douban Book pages and popular lists, turning messy web pages into clean, reusable datasets. It’s built for developers and analysts who need reliable Douban book metadata for cataloging, research, or trend tracking. If you’re looking for a practical Douban book scraper with richer fields, this project is designed for that job.
Created by Bitbash, built to showcase our approach to Scraping and Automation!
If you are looking for douban-book-pro you've just found your team — Let’s Chat. 👆👆
This project extracts detailed book information from Douban Book, including list pages (like Top 250 and popular rankings) and the individual book pages linked from them. It solves the common pain of manually copying book details or dealing with incomplete datasets by producing consistent, structured outputs ready for storage, search, or analytics. It’s for data engineers, researchers, product teams, and anyone building a book dataset pipeline.
- Captures both list-level ranking context and per-book detail pages for complete coverage.
- Designed to crawl paginated lists (e.g., Top 250, new releases, popular books) without missing items.
- Normalizes common Douban fields into predictable JSON for downstream systems.
- Supports repeatable runs for monitoring changes (rank shifts, rating updates, new entries).
- Produces data that’s easy to export into CSV, JSONL, databases, or search indexes.
| Feature | Description |
|---|---|
| Paginated list crawling | Walks through multi-page lists (New Books, Top 250, Popular) and gathers every entry consistently. |
| Deep book-page extraction | Visits each book page to collect richer metadata beyond what list cards show. |
| Structured JSON output | Produces predictable, schema-friendly JSON objects suitable for pipelines and ETL. |
| Deduplication and canonical URLs | Prevents duplicates across lists by tracking canonical book URLs/IDs. |
| Resilient retries and backoff | Handles transient failures with configurable retries to improve run stability. |
| Config-driven runs | Lets you target specific lists, page ranges, concurrency, and output options via config. |
| Export helpers | Optional exporters for JSONL/CSV to simplify integration with analytics workflows. |
| Field Name | Field Description |
|---|---|
| sourceList | Which list the book was collected from (e.g., new_books, top_250, popular). |
| listRank | Rank/position within the source list (when available). |
| title | Primary book title as displayed on the book page. |
| originalTitle | Original/alternate title if present (often for translated works). |
| doubanUrl | Canonical Douban Book URL for the book. |
| doubanId | Book identifier parsed from the URL or page metadata. |
| coverImageUrl | Direct URL to the book cover image. |
| authors | Array of author names. |
| translators | Array of translator names (if any). |
| publisher | Publisher name. |
| publicationDate | Publication date (normalized where possible). |
| isbn | ISBN string, when available on the page. |
| binding | Binding/format (paperback, hardcover, etc.), if present. |
| pages | Page count (integer when parseable). |
| price | Listed price string (kept as raw text to preserve currency/format). |
| series | Series name, if the book belongs to one. |
| ratingValue | Average rating value as a number. |
| ratingCount | Total number of ratings (integer). |
| reviewCount | Total number of reviews/comments, if shown. |
| tags | Array of topical tags assigned on Douban. |
| categories | Extracted categories/genres if present in page sections. |
| summary | Book summary/description text. |
| authorIntro | Author introduction text if available. |
| tableOfContents | Table of contents text when provided. |
| quotes | Notable quotes/excerpts section if present. |
| relatedBooks | Array of related/recommended books (title + URL when available). |
| scrapedAt | ISO timestamp of when the record was collected. |
[
{
"sourceList": "top_250",
"listRank": 12,
"title": "Example Book Title",
"originalTitle": "Original Title",
"doubanUrl": "https://book.douban.com/subject/1234567/",
"doubanId": "1234567",
"coverImageUrl": "https://img.example.com/cover.jpg",
"authors": ["Author A", "Author B"],
"translators": ["Translator X"],
"publisher": "Example Publisher",
"publicationDate": "2019-08",
"isbn": "9787111123456",
"binding": "Paperback",
"pages": 352,
"price": "CNY 59.00",
"series": "Example Series",
"ratingValue": 9.1,
"ratingCount": 184532,
"reviewCount": 12987,
"tags": ["Fiction", "Classic", "Literature"],
"summary": "A concise synopsis of the book pulled from the detail page.",
"authorIntro": "Short author biography as displayed on the page.",
"tableOfContents": "Chapter 1...\nChapter 2...\n",
"relatedBooks": [
{ "title": "Related Book 1", "doubanUrl": "https://book.douban.com/subject/7654321/" }
],
"scrapedAt": "2025-12-14T10:05:12Z"
}
]
douban book pro/
├── src/
│ ├── main.py
│ ├── runner.py
│ ├── cli.py
│ ├── config/
│ │ ├── settings.example.json
│ │ └── targets.example.json
│ ├── clients/
│ │ ├── http_client.py
│ │ └── session_manager.py
│ ├── extractors/
│ │ ├── list_parser.py
│ │ ├── book_parser.py
│ │ ├── fields_normalizer.py
│ │ └── text_cleaner.py
│ ├── pipelines/
│ │ ├── scheduler.py
│ │ ├── deduplicator.py
│ │ └── retry_policy.py
│ ├── exporters/
│ │ ├── json_exporter.py
│ │ ├── jsonl_exporter.py
│ │ └── csv_exporter.py
│ ├── storage/
│ │ ├── state_store.py
│ │ └── cache.py
│ └── utils/
│ ├── logger.py
│ ├── dates.py
│ └── urls.py
├── data/
│ ├── sample_output.json
│ └── sample_output.jsonl
├── tests/
│ ├── test_list_parser.py
│ ├── test_book_parser.py
│ └── fixtures/
│ ├── list_page.html
│ └── book_page.html
├── .env.example
├── .gitignore
├── pyproject.toml
├── requirements.txt
├── LICENSE
└── README.md
- Data analysts use it to build a Douban book dataset, so they can track rating and popularity trends over time.
- Developers use it to sync book metadata into an internal catalog, so they can power search, filters, and recommendations.
- Researchers use it to collect Top 250 and popular lists, so they can study reading culture and ranking dynamics.
- Product teams use it to monitor new book releases, so they can spot emerging themes and authors early.
- Collectors and librarians use it to enrich records with summaries, tags, and publication details, so their archives stay complete.
How do I choose which Douban pages to scrape?
Configure the target list(s) in src/config/targets.example.json (then copy to your own config). Typical targets include new books, Top 250, and popular rankings. You can also control page ranges to limit how deep the crawler goes.
Does it scrape only list pages, or also individual book pages? It does both. The list pages provide discovery (URLs, ranks, quick stats), and then the scraper visits each book page to extract richer metadata like ISBN, publisher, summary, tags, and related items.
How does it avoid duplicates across multiple lists?
Records are keyed by canonical doubanId and doubanUrl. If the same book appears in more than one list, the scraper merges or de-duplicates based on your pipeline settings, while preserving sourceList context.
What are the common reasons a run might miss some books? Most misses come from temporary network errors, aggressive throttling, or changes in page layout. The retry and backoff policy helps with transient failures, and the parsers are structured so they can be updated quickly if page markup shifts.
Primary Metric: Averaged 55–90 book detail pages per minute on a stable connection with moderate concurrency (6–10 workers), including list traversal.
Reliability Metric: 97–99% successful page fetch rate in long runs, with retries recovering most transient failures.
Efficiency Metric: Typical memory footprint stays under 220 MB for runs up to ~5,000 books, with streaming export to JSONL to avoid large in-memory buffers.
Quality Metric: 95%+ field completeness on core metadata (title, authors, rating, counts, publisher, publication date), with optional sections (TOC, quotes, author intro) varying by book availability.
