Skip to content

This repository contains projects that classify texts using a variety of machine learning and deep learning models. The projects show use-cases of classifying text data through Natural Language Processing methods.

Notifications You must be signed in to change notification settings

jibbs1703/Text-Classification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Text Classification

Overview

This repository contains projects that classify texts data using a variety of machine learning and deep learning models. The projects show the real world use-cases of Natural Language Processing. The tasks completed cover disaster tweet classification, spam message detection and fake news recognition.

Data

Disaster Tweets Classification : The dataset for the classification of disaster tweets was obtained from Kaggle. The dataset contains five columns - id (a unique identifier for each tweet), text (the text of the tweet), location (the location the tweet was sent from, which may be blank if the tweet is sent without its corresponding location), keyword (a particular keyword from the tweet, which may also be blank) and the target (labeled as 1 - for disaster tweets and 0 - for non-disaster tweets). Disaster tweets take a prediction of 1 while non-disaster tweets take a prediction of 0. Each model used is then scored on the F1 precision scores of the predictions.

Fake News Recognition : The dataset for the recognition of fake news texts was obtained from Kaggle. The dataset comes separated in two files - one containing 23502 fake news article and the other containing 21417 true news articles. Each file contains 4 columns - Title (title of news article), Text (body text of news article), Subject (subject of news article) and the Date (publish date of news article). Labels are created for each category of the news - True/Real news texts take a label of 1 while fake news texts take a label of 0. The model(s) used would predict the validity label of news texts.

Spam Message Detection : The dataset for the detection of spam messages was obtained from [Kaggle] (https://www.kaggle.com/datasets/ashfakyeafi/spam-email-classification). The dataset contains two columns - Category and Message columns. There are two categories of messages - 'Spam', which indicates that the email is classified as fake while 'Ham' denotes that the email is legitimate. The Message column contains the actual content of the email messages. The model(s) used would predict the validity of email messages as either spam or ham. Labels are created for each category of messages - spam messages take a label of 1 while non-spam (ham) messages take a label of 0. The model(s) used would predict the validity label of the messages.

Chronology of Analysis

  • Import necessary libraries and datasets.
  • Make general check on data for missing values or inappropriate datatypes present.
  • Preprocess and feature engineer dataset using text processing methods.
  • Split data into the features and target.
  • Split data into training and test datasets (not necessary for Kaggle datasets).
  • Training machine learning models on training dataset and check training accuracy.
  • Tune Hyperparameters of the model (if necessary).
  • Use trained model on test dataset and classify text based on features.
  • Save predictions to a desired file format.

Results

Disaster Tweets : Supervised Machine learning models were used on the disaster tweets and the F1 accuracy scores were recorded. The Logistic Regression Cross Validation Model correctly predicted the validity category of 79.25% of the tweets in the test dataset (0.7925 F1 score) while the Complement Naive Bayes Model attained correctly predicted the validity of 78.42% of tweets in the test dataset (0.7842 F1 score).

Fake News : The Logistic Regression Cross Validation Model was used to classify the validity of news texts and the F1 scores were recorded.The Logistic Regression Cross Validation Model correctly predicted the validity category of 99.51% of the news texts in the test dataset (0.9951 F1 score).

Spam Detection : The Logistic Regression Cross Validation Model was used to classify whether email messages were either spam or ham and the F1 scores were recorded.The Logistic Regression Cross Validation Model correctly predicted the validity category of 94.37% of the email messages in the test dataset (0.9437 F1 score).

Authors

About

This repository contains projects that classify texts using a variety of machine learning and deep learning models. The projects show use-cases of classifying text data through Natural Language Processing methods.

Topics

Resources

Stars

Watchers

Forks

Languages