Scrapy-Rotated-Proxy is a Scrapy downloadmiddleware to dynamically attach proxy to Request, which can repeately use rotated proxies supplied by configuration. It can temporarily block unavailable proxy ip and retrieve to use in the future when the proxy is available. Also, it can remove invalid proxy ip through Scrapy signal. Proxy ip list can be supplied through Spider Settings, File or MongoDB.
- Python 2.7 or Python 3.3+
- Works on Linux, Windows, Mac OSX, BSD
The quick way:
pip install scrapy-rotated-proxy
OR copy this middleware to your scrapy project.
enable scrapy-rotated-proxy middleware and supply proxy ip list through Spider Settings
# ----------------------------------------------------------------------------- # ROTATED PROXY SETTINGS (Spider Settings Backend) # ----------------------------------------------------------------------------- DOWNLOADER_MIDDLEWARES.update({ 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': None, 'scrapy_rotated_proxy.downloadmiddlewares.proxy.RotatedProxyMiddleware': 750, }) ROTATED_PROXY_ENABLED = True PROXY_STORAGE = 'scrapy_rotated_proxy.extensions.file_storage.FileProxyStorage' # When set PROXY_FILE_PATH='', scrapy-rotated-proxy # will use proxy in Spider Settings default. PROXY_FILE_PATH = '' HTTP_PROXIES = [ 'http://proxy0:8888', 'http://user:pass@proxy1:8888', 'https://user:pass@proxy1:8888', ] HTTPS_PROXIES = [ 'http://proxy0:8888', 'http://user:pass@proxy1:8888', 'https://user:pass@proxy1:8888', ]
enable scrapy-rotated-proxy middleware and supply proxy ip list through Local File
# ----------------------------------------------------------------------------- # ROTATED PROXY SETTINGS (Local File Backend) # ----------------------------------------------------------------------------- DOWNLOADER_MIDDLEWARES.update({ 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': None, 'scrapy_rotated_proxy.downloadmiddlewares.proxy.RotatedProxyMiddleware': 750, }) ROTATED_PROXY_ENABLED = True PROXY_STORAGE = 'scrapy_rotated_proxy.extensions.file_storage.FileProxyStorage' PROXY_FILE_PATH = 'file_path/proxy.txt'
local file store proxy list with json style
# proxy file content, must conform to json format, otherwise will cause json # load error HTTP_PROXIES = [ 'http://proxy0:8888', 'http://user:pass@proxy1:8888', 'https://user:pass@proxy1:8888' ] HTTPS_PROXIES = [ 'http://proxy0:8888', 'http://user:pass@proxy1:8888', 'https://user:pass@proxy1:8888' ]
enable scrapy-rotated-proxy middleware and supply proxy ip list through MongoDB
# ----------------------------------------------------------------------------- # ROTATED PROXY SETTINGS (MongoDB Backend) # ----------------------------------------------------------------------------- # mongodb document required field: scheme, ip, port, username, password # document example: {'scheme': 'http', 'ip': '10.0.0.1', 'port': 8080, # 'username':'user', 'password':'password'} DOWNLOADER_MIDDLEWARES.update({ 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': None, 'scrapy_rotated_proxy.downloadmiddlewares.proxy.RotatedProxyMiddleware': 750, }) ROTATED_PROXY_ENABLED = True PROXY_STORAGE = 'scrapy_rotated_proxy.extensions.mongodb_storage.MongoDBProxyStorage' PROXY_MONGODB_HOST = HOST_OR_IP PROXY_MONGODB_PORT = 27017 PROXY_MONGODB_USERNAME = USERNAME_OR_NONE PROXY_MONGODB_PASSWORD = PASSWORD_OR_NONE PROXY_MONGODB_AUTH_DB = 'admin' PROXY_MONGODB_DB = 'vps_management' PROXY_MONGODB_COLL = 'service'
Default, spider will close when run out of all proxies. you can config spider to wait until block proxies become valid, which you block by signal
# ----------------------------------------------------------------------------- # OTHER SETTINGS (Optional) # ----------------------------------------------------------------------------- PROXY_SLEEP_INTERVAL = 60*60*24 # Default 24hours PROXY_SPIDER_CLOSE_WHEN_NO_PROXY = False # Default True
Remove proxy that never be used in the spider, you can send signal to
scrapy_rotated_proxy.signals.proxy_remove, which signal must contains arguments
including spider
, request
, exception
Block proxy that can be used in the future after sleep interval reach, you can send signal to
scrapy_rotated_proxy.signals.proxy_block, which signal must contains arguments
including spider
, response
, exception
Setting | Description | Default |
ROTATED_PROXY_ENABLED | Whether to enable Scrapy-Rotated-Proxy | True |
PROXY_STORAGE | A class which implements the proxy storage backend | FileProxyStorage |
PROXY_MONGODB_HOST | MongoDB host for MongoDB backend | '127.0.0.1' |
PROXY_MONGODB_PORT | MongoDB port for MongoDB backend | 27017 |
PROXY_MONGODB_USERNAME | MongoDB username for MongoDB backend | None |
PROXY_MOGNODB_PASSWORD | MongoDB password for MongoDB backend | None |
PROXY_MONGODB_DB | MongoDB database name for MongoDB backend | proxy_management |
PROXY_MONGODB_COLL | MongoDB collection name for MongoDB backend | proxy |
PROXY_MONGODB_OPTIONS_* | MongoDB uri options for MongoDB backend | |
PROXY_FILE_PATH | Path of file that store proxies. default is None, means get proxies from Spider Settings | None |
HTTP_PROXIES | keywords of HTTP proxies for LocalFile backend or Spider Settings | |
HTTPS_PROXIES | keywords of HTTPS proxies for LocalFile backend or Spider Settings | |
PROXY_SLEEP_INTERVAL | Time to sleep for blocked proxy become available | 60*60*24 |
PROXY_SPIDER_CLOSE_WHEN_NO_PROXY | Whether to close spider when run out of all proxies | True |
PROXY_RELOAD_ENABLED | enable to reload proxy from storage when all proxies was used and prepare to cycle use | False |