This is a distributed web crawler using Scrapy, Redis, and Selenium. It is designed to handle various types of websites, including static, AJAX, and dynamic pages. By leveraging a distributed setup with docker compose, the system can be deployed across multiple machines to enhance crawling speed and efficiency.
Note:
-
ddroom -> Ajax
-
housefun -> dynamic (with selenium)
-
rakuya -> static
The architecture features a central queue managed by Redis, which distributes tasks to multiple Scrapy crawlers. The crawlers process the tasks and store the collected data in MongoDB.
There is no MongoDB container in the docker-compose. You need to rewrite the docker-compose or set up MongoDB locally.
- Setup MonogoDB locally or modify the docker-compose.
- Adjust the environment variable to make the project find your MongoDB database.
There are two ways to set up.
-
local set up
- Pip install
pip install -r requirements.txt
- Push url to the redis
- scrapy crawl [ddroom/housefun/rakuya]
- Pip install
-
Adopt the docker-compose
- Build the docker image
docker build -t scrapy_rent_crawler .
- Run the docker compose
if want to debug, remove the -d flag.
docker compose up -d
- Build the docker image