Skip to content

Latest commit

 

History

History
30 lines (17 loc) · 1.74 KB

README.md

File metadata and controls

30 lines (17 loc) · 1.74 KB

weibo-spider

Introduction

This is a Sina Weibo (mobile site) crawler program. Weibo is the most popular social media in Chinese Mainland. We clean and organize the data crawled, based on which word-cloud figure can be carried out.

Code Structure

scrapy startproject [yourproject] will create a scrapy project.

scrapy.cfg is the configuration file for the project.

setting.py is used to set the parameters of the request, use the proxy, crawl the data after file saving.

/spider/sinaSpider.py is the main code of the crawler.

middlewares.py is the middleware for scrapy's request and its related processing. It is mainly the rotation of UserAgent, Cookies and agents.

items.py is the definition file of the data structure that needs to be extracted.

pipelines.py is to further process the data extracted from items, and the connection to mongdb is in this.

Libraries

scrapy is an application framework for crawling website data and extracting structured data. It is a very powerful and easy-to-use crawler framework that not only provides some basic components out of the box, but also provides powerful customization capabilities.

selenium is a tool for testing Web applications. Selenium tests run directly in the browser, just as real users do. We use selenium mainly to simulate the behavior of users to log in to Weibo and get cookies.

PhantomJS is a non-interface, scriptable WebKit browser engine. It natively supports several web standards: DOM manipulation, CSS selectors, JSON, Canavs, etc.

Reference

web_scraping_with_python