- Crawl article URL with NAVER Open API
- Crawl article title and content with bs4 + selenium
python 3.6
$conda create -n text_analysis python=3.6 anaconda
Install KoNLPy
$pip install konlpy
Install MeCab (optional)
$bash <(curl -s https://raw.githubusercontent.com/konlpy/konlpy/master/scripts/mecab.sh)
Install Gensim for topic modeling
$pip install gensim
$conda install -c conda-forge gensim
KoNLPy Webpage: https://konlpy.org/en/latest/install/#id1
: compare with Kkma, Komoran , Hannanum, Mecab, Okt(Twitter Morphological Analyzer)
-> "Mecab" does a good job analyzing poorly spaced sentences. On the other hand, "hannanum" is barely categorized.
-> Misspellings make a big difference in the performance of stemming.
-> If you have a lot of typos, "komoran" is a better.
-> In the speed comparison, Kkma is the slowest and Mecab is significantly faster. Komoran and Hannanum are about the same speed.
Kkma < Komoran < okt < Hannanum < Mecab
- The process of finding topics (keywords) in an article
- The process of finding a "set of k words" from a combination of words that make up a document (sentence).
- It is a Bayesian probability model, and the result of topic modeling is the probability that each word belongs to each topic.
-
LSI (Latent Semantic Indexing):
- SVD on the document word matrix.
- Vectors converted to low dimensions by SVD represent the semantics.
-
pLSA (Probabilistic Latent Semantic Analysis)
- keeps the document-word matrix in place
- Representation based on probability of occurrence, not frequency per document
-
LDA (Latent Dirichlet Allocation)
-
Utilize the Dirichlet distribution
-
In this project, using LDA for Topic Modeling
-
Model Assumptions in LDA
-
Each document can contain multiple topics.
-
Each topic can contain multiple words.
-
Every word that exists in a document is necessarily contained in some topic.
-
The process of human writing is defined as a generative model.
-
-
LDA Model Process
-
Select the topics to be used in the documents. (K topics)
-
Select a topic from one of the topics.
-
select one of the words in that topic.
-
add the word to the document (write).
-
Repeat from step 2.
-
-
- url : https://github.com/e9t/nsmc
- size : 19MB
- Data source: Naver
- No more than 100,140 ratings (reviews) per movie
- Total 200,000 reviews (sampled from 640,000 collected)
- 'ratings_train.txt : 150,000 , 'ratings_test.txt : 50,000
- Equally sampling the percentage of positive/negative reviews (i.e. random guess yields 50% accuracy)
- Does not include neutral reviews
-
Reference : https://www.kdnuggets.com/2018/03/5-things-sentiment-analysis-classification.html
-
Natural Language Processing, Text Analysis, Computational Lingustics, and biometrics are used to find out the author's intention or information hidden in the text.
-
It is also called Opinion Mining, Sentiment Mining, and Subjectivity Analysis.
-
The early methods were tried a lot to find the polarity of the text. A typical example is the case of dividing into positive/negative.
-
Sentiment analysis is largely divided into a knowledge-based approach and a machine learning-based approach.
Knowledge-based is a method that imports data that has already been evaluated by human experts using known phrases, endings, and idiomatic expressions.
ML-based approach has supervised and unsupervised methods. Recently, as pretrained language models have been developed by leaps and bounds, the performance of unsupervised methods has increased, but supervised is still superior in terms of performance.
-
Experimental results (Test accuracy)
Model linear classifier SVM classifier Nouns only 0.51 0.53 no Preprocessing 0.67 0.72 Preprocessing 0.71 0.76 TfidfVectorizer 0.71 0.81 All features 0.77 0.82 -
Conclusion
-
extracting only nouns does not provide clear information on the classification problem of NSMC data.
-
There is a difference in performance with and without preprocessing. Since there are many special characters in NSMC data, excluding special characters makes the tokens more canonical (text normalization).
-
It is more performant to use all dimension sources rather than truncating the top 500 with high frequency.
-
Comparison performance Countvectorizer vs TF-IDF Vectorizer : The accuracy of the Countvectorizer was slightly higher than that of the TF-IDF Vectorizer.
-
Tuning hyperparameters for better performance
-
-
LSTM : Long Short-Term Memory, and it's a type of recurrent neural network (RNN) that's commonly used in natural language processing and other sequential data tasks.
-
Set Hyper Parameter
- max_words = 35000
- max_len = 30
- batch_size = 128
- EPOCHS = 100
-
Experimental results (Test accuracy)
Model | LSTM |
---|---|
Test Score | 1.062 |
Test Accuracy | 0.7995 |
-
Conclusion
- The final results of deep learning model(LSTM) are similar to the accuracy of the machine learning model.