LLMs know everything, but don't understand anything.
- Omar Yasser
This project focuses on the sentiment analysis of company reviews in various dialects of Arabic.
- Data Cleansing: Removal of nulls and duplicates to ensure a clean dataset.
- Text Normalization: Stripping away punctuation, digits, and special characters to focus on the linguistic essence.
- Diacritic Handling: Removing diacritics and normalizing Arabic characters to address the variability in text input.
- Language Homogenization: Translating the few non-Arabic words into Arabic to maintain linguistic consistency.
- Emoji Mapping: Emojis, often conveying strong sentiments, were mapped to their textual meanings.
Four models were implemented:
- Finetuned AraBERT: Leveraging the power of AraBERT, finetuned to our specific dataset.
- Transformer from Scratch: Building a Transformer model from the ground up, to better understand its architecture.
- LSTM
- Bidirectional LSTM: LSTM, but it captures both forward and backward directions.
Our team won in a Kaggle university-wide Arabic Sentiment Analysis competition (out of more than 100 teams). Our model achieved an impressive 87.5% accuracy, outperforming the second-best team by a significant margin of 2%.