Codebase to crawl data from major Vietnamese websites
News websites:
- Thanhnien
- vnexpress
Forums:
- VnZ
- Otofun
- Kenhsinhvien
- Hocmai
pip install -r requirements.txt
- ElasticSearch + Kibana
- MongoDB
Install Docker + Docker-Compose. Note: Edit the following line to add volume from docker.
volumes:
- ./esdata:/home/lap15363/elasticsearch/data
To start service with ElasticSearch + Kibana
docker-compose up -d
Wait about 1 minute for the network to start (default: localhost). Services will be ported as follows:
- ElasticSearch: 9200
- Kibana: 5601
Spiders will contain an implementation of how we crawl pages.
You can use SitemapSpider types to parse links from the sitemap if available. You can see an example from crawler/spiders/thanhnien.py
Other types of spiders can be referenced here
You can refer to my custom spider example to make requests, check out crawler/spiders/tv4u.py
Items will define the Schema of the items every time we crawl.
Refer to Items implemented in crawler/items.py
Exporters is where we implement how to export items.
Here we can write ways to connect database and export items and import into DB.
Middlewares are intermediaries between spiders and the site, here we can add some middleware such as:
- Random User-Agents: randomly select user-agents to send requests to the site.
- Proxy Middlewares: randomly select proxies to send to the request site.
- Retry Middleware: how to retry after failing to connect.
You can refer to the following libs to use instead of middlewares:
Defining pipelines: the flow of processing items after crawling, what to do next. Here we may combine Exporters and Items to create Pipelines.
Refer to ESPipeline in crawler/pipelines.py
In the Settings.py file will be where we configure all the above components.
Some important configurations:
DOWNLOAD_DELAY = 0
# The download delay setting will honor only one of:
CONCURRENT_REQUESTS_PER_DOMAIN = 16
# CONCURRENT_REQUESTS_PER_IP = 16
# MAXIMUM TIME FOR A REQUEST
DOWNLOAD_TIMEOUT = 10
# Disable cookies (enabled by default)
COOKIES_ENABLED = False
# Disable Telnet Console (enabled by default)
TELNETCONSOLE_ENABLED = False
# Number of retry times if a request failed
RETRY_TIMES = 10
Use Middlewares: give the order of numbers for middlewares, None is to turn off the use of that middlewares.
DOWNLOADER_MIDDLEWARES = {
"crawler.middlewares.CrawlerAgentMiddleware": 100,
# 'rotating_free_proxies.middlewares.RotatingProxyMiddleware': 200,
# 'rotating_free_proxies.middlewares.BanDetectionMiddleware': 300,
# "crawler.middlewares.CrawlerProxyMiddleware": 200,
# "crawler.middlewares.CrawlerRetryMiddleware": 300,
# 'scrapy.downloadermiddlewares.retry.RetryMiddleware': None,
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
}
Pipelines:
ITEM_PIPELINES = {
"crawler.pipelines.CrawlerPipeline": 100,
"crawler.pipelines.ESPipeline": 200,
}
Logging:
LOG_LEVEL = 'INFO' # DEBUG for debug mode, ERROR for only display errors
LOG_FORMAT = '%(levelname)s: %(message)s'
LOG_FILE = 'crawl.log' # Logging filename
ElasticSearch Config (sử dụng trong ESExporters):
ELASTIC_HOSTS = [
{'host': 'localhost', 'port': 9200, "scheme": "http"},
]
To start crawling, we use:
cd crawler/crawler
scrapy crawl <spider name> --set JOBDIR=<job name>
Ex:
scrapy crawl thanhnien --set JOBDIR=thanhnien
To pause and resume crawling, we can press Ctrl + C (only once) and restart the above command. Scrapy will automatically save a JOBDIR folder to restart URLs that have not been run.