- Data Analysis
- Algorithm Design
- Evalutation
Machine Learning Technologies (COMP32222)
This individual project was set with an aim to explore ways of automatically classifying Twitter news related content as real or fake. Within this coursework, I designed a machine learning algorithm for classifying Twitter posts from MediaEval 2015 "verifying multimedia use" challenge dataset.
The project was based on Python and Jupyter Notebook, along with the use of scikit-learn library, numpy, pandas and deep translator. The dataset contained 14,277 training data entries and 3755 testing data entries, and each entry had the following set of features:
tweetId / tweetText / userId / imageId / username / timestamp / label
Here are some of the graphs produced throughout the data analysis.
The algorithm design part started with the preprocessing steps taken. This task consisted of data cleaning by removing punctuation from tweets, text lowercasing, stop word removal, emoji removal as well as translation. Once the preprocessing was completed, the tweetText features were vectorized and transformed into a term frequency inverse document frequency matrices.
Considering all the constraints and the characteristics of the data, 3 starting classifiers were chosen: MultinomialNB / LinearSVC / SGDClassifier.
The classifiers were evaluated and the strongest learner (MultinomialNB in this project) was chosen to further perform hyper parameter tuning through GridSearch. Additionally, other features like the imageId and username were used in an iterative process.
The best performance was achieved by using the TweetText and username feature, which resulted in an accuracy score of 89.26%.
Once the project was submitted, I received detailed feedback for them module leader highlighting the strengths of this work, mainly being the data analysis and code quality, as well as areas of improvement like additional feature selection. This project was awarded a 1st class mark of 70%.