This project focuses on the detection of hate speech on Twitter using various natural language processing (NLP) and machine learning techniques. The project explores different feature extraction methods, including TF-IDF, sentiment analysis, and Doc2Vec, and evaluates multiple machine learning models to determine the most effective approach.
- Installation
- Dataset
- Data Preprocessing
- Feature Extraction
- Model Training and Evaluation
- Results
- Visualization
- Confusion Matrix
- Conclusion
To run this project, you need to have Python installed along with the necessary libraries. You can install the required libraries using the provided command.
The dataset used in this project is HateSpeechData.csv
, which contains tweets labeled as hate speech, offensive speech, or neither. The dataset is loaded and a new column text length
is added to represent the length of each tweet.
The preprocessing steps include:
- Removal of punctuation and capitalization
- Tokenizing
- Removal of stopwords
- Stemming
Visualizing the most commonly used words in the dataset through a word cloud.
TF-IDF feature extraction to transform the text data into numerical features.
Using VADER sentiment analysis to extract sentiment-related features from the tweets.
Training a Doc2Vec model and extracting document vectors to represent the tweets in vector space.
Training and evaluating a logistic regression model for hate speech detection.
Training and evaluating a random forest classifier for hate speech detection.
Training and evaluating a Naive Bayes classifier for hate speech detection.
Training and evaluating a Support Vector Machine for hate speech detection.
Visualizing the accuracy of different models to compare their performance.
The accuracy and performance of each model are presented in a comparison chart. The logistic regression and support vector machine models performed better than the others.
Word clouds for the entire dataset and for hate and offensive speech specifically.
The confusion matrix helps to understand the misclassifications made by the model. It provides insights into the performance of the model by showing the true and predicted values for each class.
This project demonstrates the effectiveness of various NLP techniques and machine learning models in detecting hate speech on Twitter. The results highlight the importance of feature extraction methods and model selection in achieving high accuracy and reliable performance in hate speech detection tasks.