Skip to content

chaeyeon2367/dl-python-SentimentAnalysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation



Sentiment Analysis with NSMC and LSTM 🫥


1. Crawling - get NAVER News data

  • Crawl article URL with NAVER Open API
  • Crawl article title and content with bs4 + selenium

2. KONLPY Stemmer

1-1. Installing the requirements

python 3.6

$conda create -n text_analysis python=3.6 anaconda

Install KoNLPy

$pip install konlpy

Install MeCab (optional)

$bash <(curl -s https://raw.githubusercontent.com/konlpy/konlpy/master/scripts/mecab.sh)

Install Gensim for topic modeling

$pip install gensim
$conda install -c conda-forge gensim

KoNLPy Webpage: https://konlpy.org/en/latest/install/#id1


1-2. KONLPY Stemmer Performance Comparison

: compare with Kkma, Komoran , Hannanum, Mecab, Okt(Twitter Morphological Analyzer)

-> "Mecab" does a good job analyzing poorly spaced sentences. On the other hand, "hannanum" is barely categorized.

-> Misspellings make a big difference in the performance of stemming.

-> If you have a lot of typos, "komoran" is a better.

-> In the speed comparison, Kkma is the slowest and Mecab is significantly faster. Komoran and Hannanum are about the same speed.
Kkma < Komoran < okt < Hannanum < Mecab


3. Topic Modeling

1.1 Definition

  • The process of finding topics (keywords) in an article
  • The process of finding a "set of k words" from a combination of words that make up a document (sentence).
  • It is a Bayesian probability model, and the result of topic modeling is the probability that each word belongs to each topic.

1.2 History

  • LSI (Latent Semantic Indexing):

    • SVD on the document word matrix.
    • Vectors converted to low dimensions by SVD represent the semantics.
  • pLSA (Probabilistic Latent Semantic Analysis)

    • keeps the document-word matrix in place
    • Representation based on probability of occurrence, not frequency per document
  • LDA (Latent Dirichlet Allocation)

    • Utilize the Dirichlet distribution

    • In this project, using LDA for Topic Modeling

    • Model Assumptions in LDA

      • Each document can contain multiple topics.

      • Each topic can contain multiple words.

      • Every word that exists in a document is necessarily contained in some topic.

      • The process of human writing is defined as a generative model.

    • LDA Model Process

      1. Select the topics to be used in the documents. (K topics)

      2. Select a topic from one of the topics.

      3. select one of the words in that topic.

      4. add the word to the document (write).

      5. Repeat from step 2.


4. NSMC Sentiment Analysis

4-1. NSMC (Naver Sentiment Movie Corpus v1.0)

  • url : https://github.com/e9t/nsmc
  • size : 19MB
  • Data source: Naver
  • No more than 100,140 ratings (reviews) per movie
  • Total 200,000 reviews (sampled from 640,000 collected)
    • 'ratings_train.txt : 150,000 , 'ratings_test.txt : 50,000
    • Equally sampling the percentage of positive/negative reviews (i.e. random guess yields 50% accuracy)
    • Does not include neutral reviews

4-2. Sentiment analysis definition


Capture d’écran 2023-04-07 à 15 53 13


  • Reference : https://www.kdnuggets.com/2018/03/5-things-sentiment-analysis-classification.html

  • Natural Language Processing, Text Analysis, Computational Lingustics, and biometrics are used to find out the author's intention or information hidden in the text.

  • It is also called Opinion Mining, Sentiment Mining, and Subjectivity Analysis.

  • The early methods were tried a lot to find the polarity of the text. A typical example is the case of dividing into positive/negative.


  • Sentiment analysis is largely divided into a knowledge-based approach and a machine learning-based approach.

    Knowledge-based is a method that imports data that has already been evaluated by human experts using known phrases, endings, and idiomatic expressions.

    ML-based approach has supervised and unsupervised methods. Recently, as pretrained language models have been developed by leaps and bounds, the performance of unsupervised methods has increased, but supervised is still superior in terms of performance.


4-3. NSMC Sentiment Analysis with Machine learning

  • Experimental results (Test accuracy)

    Model linear classifier SVM classifier
    Nouns only 0.51 0.53
    no Preprocessing 0.67 0.72
    Preprocessing 0.71 0.76
    TfidfVectorizer 0.71 0.81
    All features 0.77 0.82
  • Conclusion

    1. extracting only nouns does not provide clear information on the classification problem of NSMC data.

    2. There is a difference in performance with and without preprocessing. Since there are many special characters in NSMC data, excluding special characters makes the tokens more canonical (text normalization).

    3. It is more performant to use all dimension sources rather than truncating the top 500 with high frequency.

    4. Comparison performance Countvectorizer vs TF-IDF Vectorizer : The accuracy of the Countvectorizer was slightly higher than that of the TF-IDF Vectorizer.

    5. Tuning hyperparameters for better performance


4-4. NSMC Sentiment Analysis with LSTM

  • LSTM : Long Short-Term Memory, and it's a type of recurrent neural network (RNN) that's commonly used in natural language processing and other sequential data tasks.

  • Set Hyper Parameter

    • max_words = 35000
    • max_len = 30
    • batch_size = 128
    • EPOCHS = 100
  • Experimental results (Test accuracy)

Model LSTM
Test Score 1.062
Test Accuracy 0.7995
  • Conclusion

    • The final results of deep learning model(LSTM) are similar to the accuracy of the machine learning model.

About

Sentiment Analysis with NSMC and LSTM

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published