GitHub - treper/scrapy-redis: compatible with scrapy 0.17

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
example-project		example-project
scrapy_redis		scrapy_redis
LICENSE		LICENSE
README		README
setup.py		setup.py

Repository files navigation

Redis-based components for Scrapy
=================================

This is a initial work on Scrapy-Redis integration, not production-tested.
Use it at your own risk!

Features:
* Distributed crawling/scraping
* Distributed post-processing

Requirements:
* Scrapy 0.17 
* redis-py (tested on 2.4.9)
* redis server (tested on 2.2-2.4)

Available Scrapy components:
* Scheduler
* Duplication Filter
* Item Pipeline

Usage
-----

In your settings.py:

# enables scheduling storing requests queue in redis
SCHEDULER = "scrapy_redis.scheduler.Scheduler"

# don't cleanup redis queues, allows to pause/resume crawls
SCHEDULER_PERSIST = True

# store scraped item in redis for post-processing
ITEM_PIPELINES = [
    'scrapy_redis.pipelines.RedisPipeline',
]


Running the example project
---------------------------

You can test the funcionality following the next steps:

1. Setup scrapy_redis package in your PYTHONPATH

2. Run the crawler for first time then stop it

  $ cd example-project
  $ scrapy crawl dmoz
  ... [dmoz] ...
  ^C

3. Run the crawler again to resume stopped crawling

  $ scrapy crawl dmoz
  ... [dmoz] DEBUG: Resuming crawl (9019 requests scheduled)

4. Start one or more additional scrapy crawlers

  $ scrapy crawl dmoz
  ... [dmoz] DEBUG: Resuming crawl (8712 requests scheduled)

5. Start one or more post-processing workers

  $ python process_items.py
  Processing: Kilani Giftware (http://www.dmoz.org/Computers/Shopping/Gifts/)
  Processing: NinjaGizmos.com (http://www.dmoz.org/Computers/Shopping/Gifts/)
  ...

That's it.