A project for extracting and analyzing users' comments on cars.
This repo includes the following six parts:
- Web crawler.
- Text classification
- Key-phrase extraction
- Sentiment analysis
- Web application integrating above three functions
- Word vector remapping to sentiment aware embedding
The following several sections will introduce each part respectively, each of the section includes its own requirements, usage and notes.
Web crawler of different car-comment websites. All in cralwer/crawler.py
, for detailed comments, see the code within.
requests, bs4, mysqldb
Simply run:
python crawler.py
It will first get all url corresponding to car-series, and store them info url.json
, and request them one by one. The crawled data will be stored into mysql.
Car strucutrue: brand (宝马) -> series (车系) -> spec (品牌). Considering the different structure in websites, some of websites do not distinguish btw specs. So only series is considered currently.
I do not use scrapy or other well-structured api, so it's currently NOT sustainable, i.e., visited url are not recorded and it may suffers from unexpected problems.
The text classification of sentences is done in different ways: traditional machine learning techniques (in text_clf/
), CNN (in sent-conv-torch/
).
It's a 12-class classification problem. Ten classes are meaningful ones (like space, appearance). one class named neutral means it does not conform to any actual aspect of car-comment, the other named other means the sentence says something about the car, but not conformed with our classification taximony (like noise info).
Currently, the F1 of 12-class classification reaches 87% using CNN.
Also, we try another task: given a whole passage, try to split it into sentences with different classes. It's more difficult because some sentences are not indicative enough and require context to determine its class.
traditional machine learning (using python):
xlrd, xlwt, cPickle, h5py, MySQLdb, jieba, numpy, scipy, sklearn, nltk, matplotlib
CNN (using lua):
torch7, hdf5
- text_clf/car_preprocessing_tool.py: It offers useful tools to manipulate data crawled. Also the basic training models and testing tools.
- text_clf/car_train.py: It offers automatic training of multiple experiments. Using
matplotlib
to plot the results. - text_clf/evalute.py: It offers ways to evalute the model, also predict passage function is included.
- text_clf/test_passage2.py: These two files are actually test set for passage prediction. Data collected from xcar.
- sent-conv-torch/: CNN code via lua. Actually copied from repo here. The folder in this repo is broken :(
It collects comments of cars and extract key-phrase from sentences (like 空间大, 加速给力). We use rules of syntactic patterns (e.g., if we find the POS tag with pattern "n + adj", we extract it as a keyphrase).
It reaches only 65% in F1.
We also tried several other ways (totally 4) to improve the rules, see Main.py
for detail.
xlrd, xlwt, jieba
TODO
nodejs
To run in foreground:
node carinfo.js # start
To run it in background, we need a nodejs package: forever
:
forever start carinfo.js # start
forever stop carinfo.js # stop