Sentiment Analysis with NSMC and LSTM 🫥

1. Crawling - get NAVER News data

Crawl article URL with NAVER Open API
Crawl article title and content with bs4 + selenium

2. KONLPY Stemmer

1-1. Installing the requirements

python 3.6

$conda create -n text_analysis python=3.6 anaconda

Install KoNLPy

$pip install konlpy

Install MeCab (optional)

$bash <(curl -s https://raw.githubusercontent.com/konlpy/konlpy/master/scripts/mecab.sh)

Install Gensim for topic modeling

$pip install gensim
$conda install -c conda-forge gensim

KoNLPy Webpage: https://konlpy.org/en/latest/install/#id1

1-2. KONLPY Stemmer Performance Comparison

: compare with Kkma, Komoran , Hannanum, Mecab, Okt(Twitter Morphological Analyzer)

-> "Mecab" does a good job analyzing poorly spaced sentences. On the other hand, "hannanum" is barely categorized.

-> Misspellings make a big difference in the performance of stemming.

-> If you have a lot of typos, "komoran" is a better.

-> In the speed comparison, Kkma is the slowest and Mecab is significantly faster. Komoran and Hannanum are about the same speed.
Kkma < Komoran < okt < Hannanum < Mecab

3. Topic Modeling

1.1 Definition

The process of finding topics (keywords) in an article
The process of finding a "set of k words" from a combination of words that make up a document (sentence).
It is a Bayesian probability model, and the result of topic modeling is the probability that each word belongs to each topic.

1.2 History

LSI (Latent Semantic Indexing):
- SVD on the document word matrix.
- Vectors converted to low dimensions by SVD represent the semantics.
pLSA (Probabilistic Latent Semantic Analysis)
- keeps the document-word matrix in place
- Representation based on probability of occurrence, not frequency per document
LDA (Latent Dirichlet Allocation)
- Utilize the Dirichlet distribution
- In this project, using LDA for Topic Modeling
- Model Assumptions in LDA
  - Each document can contain multiple topics.
  - Each topic can contain multiple words.
  - Every word that exists in a document is necessarily contained in some topic.
  - The process of human writing is defined as a generative model.
- LDA Model Process
  1. Select the topics to be used in the documents. (K topics)
  2. Select a topic from one of the topics.
  3. select one of the words in that topic.
  4. add the word to the document (write).
  5. Repeat from step 2.

4. NSMC Sentiment Analysis

4-1. NSMC (Naver Sentiment Movie Corpus v1.0)

url : https://github.com/e9t/nsmc
size : 19MB
Data source: Naver
No more than 100,140 ratings (reviews) per movie
Total 200,000 reviews (sampled from 640,000 collected)
- 'ratings_train.txt : 150,000 , 'ratings_test.txt : 50,000
- Equally sampling the percentage of positive/negative reviews (i.e. random guess yields 50% accuracy)
- Does not include neutral reviews

4-2. Sentiment analysis definition

Reference : https://www.kdnuggets.com/2018/03/5-things-sentiment-analysis-classification.html
Natural Language Processing, Text Analysis, Computational Lingustics, and biometrics are used to find out the author's intention or information hidden in the text.
It is also called Opinion Mining, Sentiment Mining, and Subjectivity Analysis.
The early methods were tried a lot to find the polarity of the text. A typical example is the case of dividing into positive/negative.

Sentiment analysis is largely divided into a knowledge-based approach and a machine learning-based approach.

Knowledge-based is a method that imports data that has already been evaluated by human experts using known phrases, endings, and idiomatic expressions.

ML-based approach has supervised and unsupervised methods. Recently, as pretrained language models have been developed by leaps and bounds, the performance of unsupervised methods has increased, but supervised is still superior in terms of performance.

4-3. NSMC Sentiment Analysis with Machine learning

Experimental results (Test accuracy)

Model linear classifier SVM classifier

Nouns only 0.51 0.53

no Preprocessing 0.67 0.72

Preprocessing 0.71 0.76

TfidfVectorizer 0.71 0.81

All features 0.77 0.82
Conclusion
1. extracting only nouns does not provide clear information on the classification problem of NSMC data.
2. There is a difference in performance with and without preprocessing. Since there are many special characters in NSMC data, excluding special characters makes the tokens more canonical (text normalization).
3. It is more performant to use all dimension sources rather than truncating the top 500 with high frequency.
4. Comparison performance Countvectorizer vs TF-IDF Vectorizer : The accuracy of the Countvectorizer was slightly higher than that of the TF-IDF Vectorizer.
5. Tuning hyperparameters for better performance

4-4. NSMC Sentiment Analysis with LSTM

LSTM : Long Short-Term Memory, and it's a type of recurrent neural network (RNN) that's commonly used in natural language processing and other sequential data tasks.
Set Hyper Parameter
- max_words = 35000
- max_len = 30
- batch_size = 128
- EPOCHS = 100
Experimental results (Test accuracy)

Model	LSTM
Test Score	1.062
Test Accuracy	0.7995

Conclusion
- The final results of deep learning model(LSTM) are similar to the accuracy of the machine learning model.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Sentiment Analysis with NSMC and LSTM 🫥

1. Crawling - get NAVER News data

2. KONLPY Stemmer

1-1. Installing the requirements

1-2. KONLPY Stemmer Performance Comparison

3. Topic Modeling

1.1 Definition

1.2 History

4. NSMC Sentiment Analysis

4-1. NSMC (Naver Sentiment Movie Corpus v1.0)

4-2. Sentiment analysis definition

4-3. NSMC Sentiment Analysis with Machine learning

4-4. NSMC Sentiment Analysis with LSTM

Model	linear classifier	SVM classifier
Nouns only	0.51	0.53
no Preprocessing	0.67	0.72
Preprocessing	0.71	0.76
TfidfVectorizer	0.71	0.81
All features	0.77	0.82

Files

README.md

Latest commit

History

README.md

File metadata and controls

Sentiment Analysis with NSMC and LSTM 🫥

1. Crawling - get NAVER News data

2. KONLPY Stemmer

1-1. Installing the requirements

1-2. KONLPY Stemmer Performance Comparison

3. Topic Modeling

1.1 Definition

1.2 History

4. NSMC Sentiment Analysis

4-1. NSMC (Naver Sentiment Movie Corpus v1.0)

4-2. Sentiment analysis definition

4-3. NSMC Sentiment Analysis with Machine learning

4-4. NSMC Sentiment Analysis with LSTM