This project focuses on sentiment analysis for Urdu text using machine learning and deep learning models. With over 300 million Urdu speakers worldwide, sentiment analysis for Urdu has been underexplored due to a lack of resources and datasets. This research aims to bridge this gap by implementing various NLP techniques and classification models to analyze sentiments in Urdu movie reviews.
Traditional sentiment analysis models primarily focus on English and other Western languages. Due to the complexity of the Urdu language, existing NLP methods fail to deliver optimal results. The key challenges include:
- Right-to-left script processing
- Lack of annotated datasets
- Complex grammar and morphological structure
- Tokenization and word segmentation issues
- Dataset Preparation: Used a dataset of 50,000 Urdu movie reviews with positive and negative labels.
- Preprocessing: Implemented stop-word removal, lemmatization, and tokenization.
- Feature Extraction: Applied TF-IDF, Bag of Words (BoW), and Word2Vec.
- Classification Models: Used Support Vector Machine (SVM), Decision Tree, Logistic Regression, and Long Short-Term Memory (LSTM).
- Evaluation Metrics: Compared models using accuracy, precision, recall, F1-score, confusion matrix, and ROC curve.
- Optimization: Fine-tuned hyperparameters and performed cross-validation to improve accuracy.
- Programming Language: Python
- Libraries & Frameworks:
- TensorFlow & Keras (Deep Learning)
- Scikit-learn (ML Models)
- NLTK & UrduHack (NLP Processing)
- Gensim (Word2Vec)
- Matplotlib & Seaborn (Visualization)
- Dataset Source: GitHub, Kaggle
Ensure you have Python 3.x installed. Install the required dependencies using:
pip install -r requirements.txt
- Clone the repository:
git clone https://github.com/yourusername/urdu-sentiment-analysis.git cd urdu-sentiment-analysis
- Prepare the dataset and place it in the
data/
directory. - Run the preprocessing script:
python preprocessing.py
- Train and evaluate machine learning models:
python train_models.py
- Train and evaluate deep learning models:
python train_lstm.py
- LSTM with Word2Vec outperformed other models with 87.94% accuracy.
- SVM with TF-IDF achieved the second-best accuracy of 80.92%.
- Deep learning models required higher computational resources but provided better results.
- Feature extraction techniques significantly impacted model performance.
- This project ensures compliance with data privacy regulations.
- The dataset was sourced from publicly available repositories.
- No personally identifiable information (PII) is included in the dataset.
Contributions are welcome! Feel free to fork this repository, create a branch, and submit a pull request.
For any inquiries, reach out via email: Emaazsiddiq@gmail.com
- Supervisor: Dr. Na Helian
- University: University of Hertfordshire
- Open-source contributors for NLP libraries
This project is open-source and licensed under the MIT License.