Skip to content

FISA-DL/sentiment-analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

39 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

๐Ÿ’ฌ Sentiment Analysis & AI Advice Dashboard

๋”ฅ๋Ÿฌ๋‹ ๊ธฐ๋ฐ˜ ๊ฐ์ • ๋ถ„์„ ํ”„๋กœ์ ํŠธ

๐Ÿ“Œ Overview

Kaggle์˜ Twitter US Airline Sentiment dataset์„ ํ™œ์šฉํ•˜์—ฌ ํŠธ์œ— ๋‚ด์šฉ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ๊ฐ์ •์„ ๋ถ„๋ฅ˜ํ•˜๋Š” ๋”ฅ๋Ÿฌ๋‹ ๋ชจ๋ธ์„ ํ›ˆ๋ จํ•˜๊ณ , ์ด๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ๊ธฐ์—…์—๊ฒŒ ๋งž์ถคํ˜• ์†”๋ฃจ์…˜์„ ์ œ๊ณตํ•˜๋Š” ์„œ๋น„์Šค๋กœ ๊ตฌํ˜„ํ–ˆ์Šต๋‹ˆ๋‹ค. ์ž…๋ ฅ๋œ ์˜๊ฒฌ์— ๋‹ด๊ธด ๊ฐ์ •์„ ๋ถ„๋ฅ˜ํ•œ ํ›„, GPT๋ฅผ ํ™œ์šฉํ•˜์—ฌ ํ•ด๋‹น ๊ฐ์ • ๋ฐ ์˜๊ฒฌ์— ๋งž๋Š” ์กฐ์–ธ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

๐ŸŽฏ Features

โœ… ํ…์ŠคํŠธ ๊ฐ์ • ๋ถ„์„ (positive, neutral, negative)
โœ… GPT ๊ธฐ๋ฐ˜ ์กฐ์–ธ ์ œ๊ณต (์˜ˆ: ๋ถ€์ •์ ์ธ ๊ฐ์ •์˜ ๊ฒฝ์šฐ ํ•ด๊ฒฐ์ฑ… ์ œ์•ˆ)
โœ… Streamlit UI๋ฅผ ํ†ตํ•œ ๋ฐฐํฌ โŒ SHAP ์‹œ๊ฐํ™”๋ฅผ ํ†ตํ•œ ๊ฐ์ • ๋ถ„๋ฅ˜ ๊ทผ๊ฑฐ ์ œ๊ณต

๐Ÿ“Š Dataset

  • ์ถœ์ฒ˜: Twitter US Airline Sentiment (Kaggle)
  • ๊ตฌ์„ฑ:
    • 2015๋…„ 2์›” ์—ฌํ–‰๊ฐ๋“ค์˜ ํŠธ์œ— ๋ฐ์ดํ„ฐ
    • ๊ฐ์ •(label): positive, neutral, negative
    • ์ด 14,640๊ฐœ์˜ ์ƒ˜ํ”Œ ๋ฐ์ดํ„ฐ

๐Ÿ— Model Architecture

๐Ÿ”น ์ž์—ฐ์–ด ์ „์ฒ˜๋ฆฌ

  • HTML ํƒœ๊ทธ, URL, ๋ฉ˜์…˜, ํ•ด์‹œํƒœ๊ทธ ์ œ๊ฑฐ
  • ๋ถˆํ•„์š”ํ•œ ํŠน์ˆ˜๋ฌธ์ž ๋ฐ ์ˆซ์ž ์ œ๊ฑฐ
  • thx/thanks โ†’ thank์™€ ๊ฐ™์ด ๋‹จ์–ด ์ •๊ทœํ™”
  • ๋ถˆ์šฉ์–ด ๋ฐ ํ‘œ์ œ์–ด ์ œ๊ฑฐ : NLTK (Natural Language Toolkit), Word Cloud ์‹œ๊ฐํ™” ์‚ฌ์šฉ

๐Ÿ”น ํ† ํ”ฝ ๋ชจ๋ธ๋ง

  • ํ† ํฐํ™” : nltk ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ํ™œ์šฉ
  • ํ† ํ”ฝ ์ˆ˜ ์„ ์ • : coherence ์ ์ˆ˜๋ฅผ ์‚ฐ์ถœํ•œ ํ›„, ๊ฐ€์žฅ ๋†’์€ ์ ์ˆ˜๋ฅผ ๋ณด์ธ 18๊ฐœ์˜ ํ† ํ”ฝ์œผ๋กœ ์ตœ์ข… ๊ฒฐ์ •
  • LDA ํ† ํ”ฝ ๋ชจ๋ธ๋ง

๐Ÿ”น ๋”ฅ๋Ÿฌ๋‹ ๊ธฐ๋ฐ˜ ๊ฐ์„ฑ ๋ถ„์„ ๋ชจ๋ธ

๐Ÿ“Œ ์‚ฌ์šฉ๋œ ๋”ฅ๋Ÿฌ๋‹ ๋ชจ๋ธ

1๏ธโƒฃ BERT (Bidirectional Encoder Representations from Transformers)

  • Google์—์„œ ๊ฐœ๋ฐœํ•œ ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ(NLP) ๋ชจ๋ธ
  • ์–‘๋ฐฉํ–ฅ(Bidirectional) ๋ฌธ๋งฅ ์ดํ•ด๋ฅผ ํ†ตํ•ด ๋ฌธ์žฅ์˜ ์˜๋ฏธ๋ฅผ ์ •ํ™•ํ•˜๊ฒŒ ํŒŒ์•…
  • **์‚ฌ์ „ ํ•™์Šต(Pre-training)**๋œ ๋ชจ๋ธ๋กœ, ๊ฐ์„ฑ ๋ถ„์„์„ ์œ„ํ•ด ํŒŒ์ธํŠœ๋‹(Fine-tuning) ์ง„ํ–‰

2๏ธโƒฃ ์‚ฌ์šฉํ•œ Hugging Face Transformer ๋ชจ๋ธ

๋ชจ๋ธ ์„ค๋ช…
bert-base-uncased ๊ธฐ๋ณธ BERT ๋ชจ๋ธ
bert-large-uncased BERT-Base๋ณด๋‹ค ๋” ๊นŠ์€ ๋ชจ๋ธ (๋ ˆ์ด์–ด ์ˆ˜ ์ฆ๊ฐ€)
roberta-base BERT๋ณด๋‹ค 10๋ฐฐ ๋งŽ์€ ๋ฐ์ดํ„ฐ๋กœ ํ•™์Šต๋œ ๋ชจ๋ธ (๋™์  Masking ์ ์šฉ)
roberta-large RoBERTa-Base๋ณด๋‹ค ๋” ํฌ๊ณ  ๊ฐ•๋ ฅํ•œ ๋ชจ๋ธ
deberta-v3-large Microsoft์—์„œ ๊ฐœ๋ฐœํ•œ 15์–ต ๊ฐœ์˜ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ๊ฐ€์ง„ ๋ชจ๋ธ
twitter-roberta-base-sentiment Twitter ๊ฐ์ • ๋ถ„์„์— ํŠนํ™”๋œ ๋ชจ๋ธ

โœ” ์ด ์ค‘ ๊ฐ€์žฅ ์ ํ•ฉํ•œ ๋ชจ๋ธ์„ ์„ ์ •ํ•˜์—ฌ ๊ฐ์„ฑ ๋ถ„์„์„ ์ง„ํ–‰ํ•จ. ๐Ÿš€


๐Ÿ“Œ ๋ชจ๋ธ ์„ฑ๋Šฅ ํ‰๊ฐ€ ์ง€ํ‘œ

  • Accuracy (์ •ํ™•๋„) : ์˜ˆ์ธกํ•œ ๊ฐ์„ฑ์ด ์‹ค์ œ ๊ฐ์„ฑ๊ณผ ์–ผ๋งˆ๋‚˜ ์ผ์น˜ํ•˜๋Š”์ง€ ์ธก์ •
  • Loss (์†์‹ค ํ•จ์ˆ˜ ๊ฐ’) : ๋ชจ๋ธ์˜ ์˜ˆ์ธก ์˜ค๋ฅ˜ ์ •๋„๋ฅผ ๋‚˜ํƒ€๋‚ด๋ฉฐ, ๋‚ฎ์„์ˆ˜๋ก ์„ฑ๋Šฅ์ด ์ข‹์Œ

๐Ÿ”น ์ข…๋ฅ˜๋ณ„ BERT ๋ชจ๋ธ ์„ฑ๋Šฅ์ง€ํ‘œ

Model Train Accuracy Validation Accuracy Test Accuracy Train Loss Validation Loss Test Loss
BERT-Base 0.8730 0.8300 0.8220 0.3400 0.4150 0.4050
BERT-Large 0.8950 0.8500 0.8600 0.2850 0.4100 0.3980
RoBERTa-Base 0.8861 0.8402 0.8324 0.3241 0.4013 0.3961
RoBERTa-Base (Dropout 0.2) 0.8659 0.8443 0.8369 0.3505 0.4184 0.4112
RoBERTa-Large 0.8928 0.8624 0.8651 0.2695 0.4023 0.3997
Twitter-RoBERTa-Base-Sentiment 0.8786 0.8497 0.8465 0.3118 0.4037 0.3875
DeBERTa-V3-Large 0.8816 0.8370 0.8493 0.3201 0.4351 0.4081

๐Ÿ“Œ ๋ชจ๋ธ ์„ ์ • ๊ธฐ์ค€

๐Ÿ’ก ๋ชจ๋ธ ์„ ํƒ ์‹œ ๊ณผ์ ํ•ฉ(Overfitting)๊ณผ ์ผ๋ฐ˜ํ™” ์„ฑ๋Šฅ์„ ๊ณ ๋ คํ•˜์—ฌ ์•„๋ž˜ ์กฐ๊ฑด์„ ๋งŒ์กฑํ•˜๋Š” ๋ชจ๋ธ์„ ์„ ์ •ํ•จ.

1๏ธโƒฃ Train Accuracy vs Test Accuracy ์ฐจ์ด โ‰ค 5%

  • Train(ํ›ˆ๋ จ) ๋ฐ์ดํ„ฐ์™€ Test(ํ…Œ์ŠคํŠธ) ๋ฐ์ดํ„ฐ์—์„œ ์ •ํ™•๋„์˜ ์ฐจ์ด๊ฐ€ 5% ์ด์ƒ์ด๋ฉด ๊ณผ์ ํ•ฉ ๊ฐ€๋Šฅ์„ฑ์ด ์žˆ์Œ.
  • ์ผ๋ฐ˜ํ™” ์„ฑ๋Šฅ์ด ์ข‹์€ ๋ชจ๋ธ์„ ์„ ํƒํ•˜๊ธฐ ์œ„ํ•ด ์ด ๊ธฐ์ค€์„ ์ ์šฉ.

2๏ธโƒฃ Test Loss โ‰ค 0.4

  • Test ๋ฐ์ดํ„ฐ์—์„œ Loss๊ฐ€ 0.4 ์ดํ•˜์ธ ๋ชจ๋ธ์„ ์šฐ์„ ์ ์œผ๋กœ ๊ณ ๋ คํ•˜์—ฌ ์•ˆ์ •์ ์ธ ์„ฑ๋Šฅ ํ™•๋ณด.

๊ฒฐ๊ณผ์ ์œผ๋กœ RoBERTa-Large๋ชจ๋ธ ์‚ฌ์šฉ


๐Ÿ“Œ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ํŠœ๋‹

๐Ÿ’ก ์ตœ์ ์˜ ์„ฑ๋Šฅ์„ ์œ„ํ•ด ์—ฌ๋Ÿฌ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์‹คํ—˜ํ•˜๋ฉฐ ์กฐ์ •ํ•จ.

ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ์„ค์ • ๊ฐ’
Dropout (hidden, attention) 0.1 ~ 0.2 (๊ณผ์ ํ•ฉ ๋ฐฉ์ง€)
Epochs 2 ~ 4 (์ ์ ˆํ•œ ํ•™์Šต ํšŸ์ˆ˜)
Batch Size (Train, Eval) 16 ~ 32 (ํ•™์Šต ์•ˆ์ •์„ฑ์„ ๊ณ ๋ ค)
Learning Rate 1e-5 ~ 2e-5 (AdamW ์˜ตํ‹ฐ๋งˆ์ด์ € ์‚ฌ์šฉ)
Warmup Steps ํ›ˆ๋ จ ์ดˆ๋ฐ˜ 500~1000 ์Šคํ… ๋™์•ˆ ์ž‘์€ ํ•™์Šต๋ฅ  ์œ ์ง€
Weight Decay 0.001 ~ 0.01 (๊ณผ์ ํ•ฉ ๋ฐฉ์ง€ ๋ฐ ์ผ๋ฐ˜ํ™” ์„ฑ๋Šฅ ๊ฐœ์„ )

๐Ÿ”ง Trouble Shooting

๐Ÿ“Œ Twitter US Airline Sentiment Dataset ๊ตฌ์„ฑ

์ด ๋ฐ์ดํ„ฐ์…‹์€ ๋ฏธ๊ตญ ํ•ญ๊ณต์‚ฌ์— ๋Œ€ํ•œ ํŠธ์œ—์„ ๊ฐ์„ฑ ๋ถ„์„ํ•œ ๊ฒƒ์œผ๋กœ, ๋ถ€์ •์ ์ธ ๊ฐ์„ฑ์ด ์••๋„์ ์œผ๋กœ ๋งŽ์•„ ๋ฐ์ดํ„ฐ ๋ถˆ๊ท ํ˜•(imbalanced data)์ด ๋ฐœ์ƒํ•˜๋Š” ํŠน์ง•์ด ์žˆ์Œ.

Sentiment ๋น„์œจ (%)
Negative (๋ถ€์ •์ ) 62%
Neutral (์ค‘๋ฆฝ์ ) 21%
Positive (๊ธ์ •์ ) 16%

๐Ÿ’ก ์ด๋Ÿฌํ•œ ๋ถˆ๊ท ํ˜• ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜์ง€ ์•Š์œผ๋ฉด, ๋ชจ๋ธ์ด negative ํด๋ž˜์Šค์— ํŽธํ–ฅ๋˜์–ด ์˜ˆ์ธก ์„ฑ๋Šฅ์ด ์ €ํ•˜๋  ๊ฐ€๋Šฅ์„ฑ์ด ์žˆ์Œ.


โœ… ํ•ด๊ฒฐ ๋ฐฉ๋ฒ•: ์†์‹ค ํ•จ์ˆ˜ ๋ณ€๊ฒฝ (CrossEntropyLoss โ†’ Focal Loss)

๐Ÿ”น ๊ธฐ์กด Loss Function: CrossEntropyLoss ๋ฌธ์ œ์ 

  • ๋ชจ๋“  ํด๋ž˜์Šค(positive/neutral/negative)๋ฅผ ๋™์ผํ•œ ์ค‘์š”๋„๋กœ ํ•™์Šต
  • ๋ฐ์ดํ„ฐ ๋ถˆ๊ท ํ˜•์ด ์‹ฌํ•œ ๊ฒฝ์šฐ, ๋‹ค์ˆ˜ ํด๋ž˜์Šค(negative) ์ค‘์‹ฌ์œผ๋กœ ํ•™์Šต ์ง„ํ–‰
  • ๊ฒฐ๊ณผ์ ์œผ๋กœ ์†Œ์ˆ˜ ํด๋ž˜์Šค(positive, neutral)์˜ ์˜ˆ์ธก ์ •ํ™•๋„๊ฐ€ ๋‚ฎ์•„์งˆ ๊ฐ€๋Šฅ์„ฑ์ด ํผ

๐Ÿ”น ๋Œ€์•ˆ: Focal Loss ์ ์šฉ

Focal Loss๋Š” ์ž์ฃผ ๋“ฑ์žฅํ•˜๋Š” ์‰ฌ์šด ์ƒ˜ํ”Œ(negative)์— ๋Œ€ํ•œ ๊ฐ€์ค‘์น˜๋ฅผ ๋‚ฎ์ถ”๊ณ , ์–ด๋ ค์šด ์ƒ˜ํ”Œ(positive, neutral)์— ๋” ์ง‘์ค‘ํ•˜๋„๋ก ์œ ๋„ํ•˜๋Š” ์†์‹ค ํ•จ์ˆ˜

Loss Function ํŠน์ง• ๋ฐ์ดํ„ฐ ๋ถˆ๊ท ํ˜• ํ•ด๊ฒฐ
CrossEntropyLoss ๋ชจ๋“  ํด๋ž˜์Šค ๋™์ผ ๊ฐ€์ค‘์น˜ โŒ ๋‹ค์ˆ˜ ํด๋ž˜์Šค(negative)์— ํŽธํ–ฅ๋  ๊ฐ€๋Šฅ์„ฑ
Focal Loss ์–ด๋ ค์šด ์ƒ˜ํ”Œ์— ๋” ์ง‘์ค‘ โœ… ์†Œ์ˆ˜ ํด๋ž˜์Šค(positive, neutral)๋„ ํ•™์Šต ๊ฐ€๋Šฅ

โœ” ์ ์šฉ ํšจ๊ณผ

  • ์†Œ์ˆ˜ ํด๋ž˜์Šค(positive, neutral)์˜ ์˜ˆ์ธก ์„ฑ๋Šฅ ํ–ฅ์ƒ
  • ๊ณผ์ ํ•ฉ ๋ฐฉ์ง€ ๋ฐ ๋ชจ๋ธ์˜ ์ผ๋ฐ˜ํ™” ์„ฑ๋Šฅ ๊ฐœ์„ 

๐Ÿ† ์ตœ์ข… ๋ชจ๋ธ ์„ฑ๋Šฅ

ํ•™์Šต๋œ ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์€ Train / Validation / Test ๋ฐ์ดํ„ฐ์—์„œ์˜ ์ •ํ™•๋„(Accuracy)๋กœ ํ‰๊ฐ€

๋ฐ์ดํ„ฐ์…‹ Accuracy (%)
Train (ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ) 90.39%
Validation (๊ฒ€์ฆ ๋ฐ์ดํ„ฐ) 85.70%
Test (ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ) 86.89%

โœ” Train๊ณผ Test ์„ฑ๋Šฅ ์ฐจ์ด๊ฐ€ 5% ์ด๋‚ด๋กœ ์œ ์ง€๋˜์–ด, ๊ณผ์ ํ•ฉ ์—†์ด ์•ˆ์ •์ ์ธ ์„ฑ๋Šฅ
โœ” Validation๊ณผ Test ์ •ํ™•๋„๊ฐ€ ๋น„์Šทํ•˜์—ฌ ์ผ๋ฐ˜ํ™” ์„ฑ๋Šฅ์ด ์šฐ์ˆ˜


๐Ÿš€ ์ตœ์ข… ๊ฒฐ๋ก 

Focal Loss ์ ์šฉ ๋ฐ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ํŠœ๋‹์„ ํ†ตํ•ด ์ตœ์ ์˜ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ•˜์˜€์œผ๋ฉฐ, ๋ชจ๋ธ์ด ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ์— ๊ณผ์ ํ•ฉํ•˜์ง€ ์•Š๊ณ  ์ƒˆ๋กœ์šด ์ž…๋ ฅ ๋ฐ์ดํ„ฐ์—์„œ๋„ ๋ถ„๋ฅ˜๋ฅผ ์ž˜ ํ•ด๋ƒ„์„ ํ™•์ธ ๐Ÿš€๐Ÿ”ฅ


๐Ÿš€ Try it!

๐Ÿ”— [https://advicegenerator.streamlit.app/]

์‚ฌ์šฉ๋ฐฉ๋ฒ•

  • advice page

    1. ์‚ฌ์šฉ์ž์˜ ๊ธฐ์—… ๋„๋ฉ”์ธ์„ ์„ ํƒํ•ฉ๋‹ˆ๋‹ค.
    2. ์กฐ์–ธ์„ ์–ป๊ณ  ์‹ถ์€ ๊ณ ๊ฐ์˜ ์˜๊ฒฌ์„ ์ž…๋ ฅ๋ž€์— ์ž…๋ ฅํ•ฉ๋‹ˆ๋‹ค.
    3. ์˜๊ฒฌ์— ๋‹ด๊ธด ๊ฐ์ •์„ ํ™•์ธํ•˜๊ณ  ๊ทธ์— ๋งž๋Š” advice๋ฅผ ํ™•์ธํ•ฉ๋‹ˆ๋‹ค.
  • airline monitoring page

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages