keyword_based_Sina_weibo_crawler

A web crawler for Sina, search and retrieve microblogs that contain certain keywords 一个简单的python爬虫实践，爬取包含关键词的新浪微博

This program originally was written to collect data for my master thesis: 【Spatial-temporal Analysis of International Connections Based on Textual Social Media Data】

Information

Based on python 2
add your own Email setting in email_info.py
change your own search list at the beginning of sina_crawler.py
data format and other functions are all contained in function.py
JSON data processing part will be explained here

Introduction

No UI, looks shabby but this is my first crawler I am happy it function all well
Could and should run 24/7 to get a better resort (data quality)
You will reveive an Email when mission start, end and meet request failure
Write JSON data into txt file, managed by structured file and folder name
Check 用python处理微博JSON数据范例 for later txt file handling

file and forder name based on the date: WBTestdata>04-12.

Every page of JSON data contains 10 records (in most case). Write each JSON into a txt file name with "keyword"+"date"+"page number"

you could use JSON editorto check the data.

Background

Sina Weibo provides normal user search function in all its client platforms.

Theoretically, the page source contains all the content we see on a page, after downloading the HTML source we could analyze and extract useful data from it.

However, most of the well-developed dynamic websites nowadays use AJAX techniques which is not easy to "crawl".

Good news is, Sina still keeping the m.weibo.cn site for smartphone browser user.

The mobile version site only realizes a small part of functions compare to PC Web version, but the general search function is kept.

Much like Web version, through , the search operation will return weibos from most recent post to older, 10 weibos in each page, and when user scroll down, new pages will be loaded from JSON files— which can be accessed by HTTP request.

For instance, search Germany the page looks like this：

Open developer tool，check network-->XHR，and scroll down till new feeds show up. Then you can see this link：

click for preview：

Yes, this is our data.

check the format of this link：

https://m.weibo.cn/api/container/getIndex?type=all&queryVal=%E5%BE%B7%E5%9B%BD&featurecode=20000320&luicode=10000011&lfid=106003type%3D1&title=%E5%BE%B7%E5%9B%BD&containerid=100103type%3D1%26q%3D%E5%BE%B7%E5%9B%BD&page=2

decode URL，it actually is the same as：

https://m.weibo.cn/api/container/getIndex?type=all&queryVal=德国& featurecode=20000320&luicode=10000011&lfid=106003type%3D1&title=德国&containerid=100103type%3D1%26q%3D德国&page= 1%26q%3D%E5% BE%B7%E5%9B%BD&page=1

So the key information is obvious，queryVal=德国 and page=1. Based on this rule we can now make our own url to retreive data.

Code structure

import library request import requests
Define header

# add header for the crawler
headers = {'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'}

Add your own keyword list

# Add in your search list!
search_list = ["Germany", "Austria", "China",]
......

Encode Chinese into url format (if necessary)

# Create url encoded search list based on the word list you have gaven
urlencoded_search_list = url_encoding(search_list)
urls = create_url_list(urlencoded_search_list)

Write while: 1 to perform daily loop.

The "mission" for each day as I defined is： Start from page1 and continue to page+1, on the same time, get the creat time of last record in this page，if this record was created within two days, we continue to next page, if not, we stop, and move to next word.

Reason is, Sina seems not really return all records based on time. The 10 records from two requests on the same time may end up with one or two different weibo. And the older the posts, the bigger the time interval -- page 1 may contain 10 posts from within 10 minutes, while page 100 will contain 10 posts from within 3 hours.

Inside the while loop
1. Send an Email to myself when program start
2. Creat folder according to the date
3. Start from page 1 request data untill the last record was post longer than two days
4. Finish all keyword search, send me another Email
5. Print some information about the mission, and write the report into a log.txt

The comment line in the code is very into detail, please check for any further question.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README_En.md

README_En.md

keyword_based_Sina_weibo_crawler

Information

Introduction

Background

Code structure

Files

README_En.md

Latest commit

History

README_En.md

File metadata and controls

keyword_based_Sina_weibo_crawler

Information

Introduction

Background

Code structure