Hope Speech Detection: Identifying Positive Actors in Toxic Discourses

This repository contains our course project for CSE-343 (Monsoon 2022).

Accepted at ICLR 2023 for the Tiny Papers Track

What is Hope Speech?

Hope speech is any message or content that is positive, encouraging, reassuring, inclusive and supportive that inspires and engenders optimism in the minds of people.

TLDR for our project?

We define two tasks:

Task 1: Multiclass Hope Speech Detection; In this task, we categorize the tweets into three classes, Hope, Non-Hope and Non-English.

Task 2: Two class classification; In this task we categorize the tweets as Hope and Non-Hope speech and drop the “Non-English” class.

We match SOTA in Task 1 with simple ML Models while we beat SOTA in Task 2, and by a lot ;) using DL Models

Dataset and Original Task

We use the dataset, made available by Chakravarthi et al. 2022. It consists of a very skewed distribution, with around 20k samples in favour of one class (Non_Hope_Speech) and only about 2k samples for the other class(es). The task was originally formulated as a three-way classification task, where we need to predict one of the labels {Non_Hope_Speech, Hope_Speech, not-English}.

Methodology

For Task 1, we tried out different word embedding techniques (GloVe, FastText, word2vec, TF-IDF, and Sentence-BERT) and also tried various combinations with them by performing PCA or leaving them as is, to see if we can retain some amount of data while also compressing the dimensions, which we have reported in our final results. We also tried to experiment with custom word2vec embeddings, which we created from scratch.
We dumped the final embeddings for future use, and each of us then took up different types of Classifier models from sklearn, and performed the stated task using all the embeddings thus generated. Towards the end, we also tried different DL models like BERT, BERTweet, and other similar Pre-trained Transformer-based classifiers.
We have reported the Weighted F1 scores for Task 1 as that is the metric which was used in the original paper. For Task 2 we take the top 5 ML models from Task 1 and also run LSTM, RNN and some pre-trained models for the 2 way classification and report the Macro F1 scores.

Results

Surpisingly enough, we managed to beat the SOTA results reported for both the tasks.

For Task 1, we were able to do so using classical ML methods like Linear Discriminant Analysis.
For Task 2, we beat SOTA here as well but this time we managed to do so by a large margin (by about 20 Macro F1 points) using DL methods.

Our Top Models:

Task 1 mimics the First Shared Task while Task 2 mimics the Second Shared Task

Directory Structure:

Data Preprocessing/: Contains the Preprocessing Files
Data: Contains the Data
DataAugmentation: Contains Data Augmentation Utils
Word_Embeddings: Contains the code to generate word embeddings and also the folder to store dump files. The word embeddings are hosted on Google Drive due to Space Constraints
Documents: Our Reports, Presentations and Proposal
Explainability: Files to see evaluate models from an explainability lens
Exploratory Data Analysis: Our Visualizations and Analysis
Models: Contains the code for our models and also the folder to store the saves. The saves for all our ML models are hosted on Google Drive due to space constraints while our DL Model checkpoints are too large to shift and hence are not present on Google Drive but you can generate them by simply running the notebook

Citation Details

To cite our work, kindly use the following BibTeX:

@inproceedings{DBLP:conf/iclr/YadavKSS23,
  author       = {Neemesh Yadav and
                  Mohammad Aflah Khan and
                  Diksha Sethi and
                  Raghav Sahni},
  editor       = {Krystal Maughan and
                  Rosanne Liu and
                  Thomas F. Burns},
  title        = {Beyond Negativity: Re-Analysis and Follow-Up Experiments on Hope Speech
                  Detection},
  booktitle    = {The First Tiny Papers Track at {ICLR} 2023, Tiny Papers @ {ICLR} 2023,
                  Kigali, Rwanda, May 5, 2023},
  publisher    = {OpenReview.net},
  year         = {2023},
  url          = {https://openreview.net/pdf?id=eaKoBpxCPe},
  timestamp    = {Wed, 19 Jul 2023 17:21:16 +0200},
  biburl       = {https://dblp.org/rec/conf/iclr/YadavKSS23.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Hope Speech Detection: Identifying Positive Actors in Toxic Discourses

What is Hope Speech?

TLDR for our project?

Dataset and Original Task

Methodology

Results

Our Top Models:

Directory Structure:

Citation Details

About

Releases

Packages

Contributors 4

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 185 Commits
Data Preprocessing		Data Preprocessing
Data		Data
DataAugmentation		DataAugmentation
Documents		Documents
Explainability		Explainability
Exploratory Data Analysis		Exploratory Data Analysis
Models		Models
Results		Results
Word_Embeddings		Word_Embeddings
.gitignore		.gitignore
README.md		README.md

aflah02/Hope_Speech_Detection

Folders and files

Latest commit

History

Repository files navigation

Hope Speech Detection: Identifying Positive Actors in Toxic Discourses

What is Hope Speech?

TLDR for our project?

Dataset and Original Task

Methodology

Results

Our Top Models:

Directory Structure:

Citation Details

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages