This document provides a comprehensive explanation of the ViSoBERT model, introduces the UIT-VSMEC dataset, and illustrates the model's performance in emotion recognition using this dataset.
ViSoBERT, short for "Vietnamese Social BERT," is a specialized pre-trained language model explicitly designed for processing the nuances of Vietnamese social media text. It addresses the limitations of traditional language models trained on formal text and effectively handles:
- Informal language
- Emojis
- Slang
- Variations in diacritic usage
-
Built upon XLM-R Architecture
ViSoBERT leverages the robust multilingual capabilities of the XLM-R (Cross-lingual Language Model - RoBERTa) architecture, inheriting its powerful transformer-based design and masked language model pre-training approach. -
Custom Tokenizer Tailored to Social Media
A crucial aspect of ViSoBERT's effectiveness is its use of a custom tokenizer built with SentencePiece. This tokenizer is specifically trained on a large corpus of Vietnamese social media text, enabling it to accurately handle emojis, teencode, and variations in diacritic usage. -
Training on a Massive Social Media Dataset
ViSoBERT is trained exclusively on a vast dataset of Vietnamese social media posts and comments collected from platforms like Facebook, TikTok, and YouTube. This targeted training makes ViSoBERT highly adept at understanding vocabulary, language patterns, and context in social media.
ViSoBERT has demonstrated superior performance in various Vietnamese social media processing tasks, including:
- Emotion recognition
- Hate speech detection
- Sentiment analysis
- Spam reviews detection
- Hate speech spans detection
It consistently outperforms strong baseline models, including monolingual and multilingual language models.
UIT-VSMEC (Vietnamese Social Media Emotion Corpus) is a standardized dataset developed by researchers at the University of Information Technology in Vietnam. It serves as a valuable resource for training and evaluating emotion recognition models specifically designed for Vietnamese social media text.
-
Emotion Labels
The dataset consists of 6,927 Vietnamese sentences annotated with one of seven emotion labels:
enjoyment
,sadness
,anger
,surprise
,fear
,disgust
, andother
(for neutral or ambiguous emotions). -
Source of Data
Sentences were collected from Facebook, ensuring the dataset reflects real-world social media communication. -
Annotation Agreement
To ensure quality and reliability, the annotation process involved multiple annotators and an agreement measure (Am) to assess consensus. The Am agreement for UIT-VSMEC was over 82%.
Given its specialization in Vietnamese social media text, ViSoBERT is well-suited for emotion recognition on the UIT-VSMEC dataset.
-
Corpus Preparation
- The UIT-VSMEC corpus was divided into training, validation, and test sets using stratified sampling to ensure a balanced distribution of emotion labels.
-
Fine-tuning ViSoBERT
- ViSoBERT was fine-tuned on the UIT-VSMEC training set using the
simpletransformers
library. Standard fine-tuning procedures were followed.
- ViSoBERT was fine-tuned on the UIT-VSMEC training set using the
-
Evaluation Metrics
- Performance was evaluated using accuracy, weighted F1-score, and macro F1-score.
ViSoBERT achieved the following results on the UIT-VSMEC emotion recognition task:
Metric | Score |
---|---|
Accuracy | 66% |
Weighted F1 | 66% |
Macro F1 | 64% |
-
State-of-the-Art Performance
ViSoBERT demonstrates its effectiveness in capturing the nuances of emotion expression in Vietnamese social media text. -
Value of Domain-Specific Training
Its superior performance compared to general-purpose models like PhoBERT and multilingual models like TwHIN-BERT underscores the importance of training on domain-specific data.
ViSoBERT, with its specialized tokenizer and training on a large-scale Vietnamese social media corpus, proves to be a powerful tool for emotion recognition in this domain. Its performance on the UIT-VSMEC dataset showcases its ability to accurately classify emotions expressed in social media text, setting a new benchmark for this task.
-
Download Models
Download the model weights from Google Drive and move them to your project directory:
Model Weights -
Clone the Repository
Clone this repository and navigate to the project directory:git clone https://github.com/your-repository-name/Classification-for-Vietnamese-Text.git cd Classification-for-Vietnamese-Text
-
Run application
streamlit run main.py