Classification of Urdu News Articles
This project explores supervised machine learning techniques to classify Urdu-language news articles into five predefined categories: entertainment, business, sports, science-technology, and international. Using data scraped from prominent Urdu news websites, a dataset of 2,750 articles was prepared, involving extensive preprocessing steps such as text normalization, lemmatization, and tokenization.
Three models were implemented and evaluated: Multinomial Naive Bayes (MNB), Logistic Regression, and Neural Networks. MNB provided a simple and effective baseline, achieving an accuracy of 96.55%, while Logistic Regression offered robust classification with a 95.27% accuracy. The Neural Network outperformed both, achieving an impressive accuracy of 97.45% through advanced sequential modeling with dropout layers to prevent overfitting.
Performance was assessed using accuracy, precision, recall, and F1 scores, with confusion matrices providing insights into misclassifications. The project highlights the potential of machine learning for natural language processing in underrepresented languages like Urdu, while identifying limitations such as reliance on traditional models and the lack of contextual semantic understanding.