This repository contains our course project for CSE-343 (Monsoon 2022).
Accepted at ICLR 2023 for the Tiny Papers Track
Hope speech is any message or content that is positive, encouraging, reassuring, inclusive and supportive that inspires and engenders optimism in the minds of people.
We define two tasks:
Task 1: Multiclass Hope Speech Detection; In this task, we categorize the tweets into three classes, Hope, Non-Hope and Non-English.
Task 2: Two class classification; In this task we categorize the tweets as Hope and Non-Hope speech and drop the “Non-English” class.
We match SOTA in Task 1 with simple ML Models while we beat SOTA in Task 2, and by a lot ;) using DL Models
We use the dataset, made available by Chakravarthi et al. 2022.
It consists of a very skewed distribution, with around 20k samples in favour of one class (Non_Hope_Speech
) and only about 2k samples for the other class(es). The task was originally formulated as a three-way classification task, where we need to predict one of the labels {Non_Hope_Speech, Hope_Speech, not-English}
.
- For Task 1, we tried out different word embedding techniques (
GloVe
,FastText
,word2vec
,TF-IDF
, andSentence-BERT
) and also tried various combinations with them by performing PCA or leaving them as is, to see if we can retain some amount of data while also compressing the dimensions, which we have reported in our final results. We also tried to experiment withcustom word2vec
embeddings, which we created from scratch. - We dumped the final embeddings for future use, and each of us then took up different types of Classifier models from sklearn, and performed the stated task using all the embeddings thus generated. Towards the end, we also tried different DL models like
BERT
,BERTweet
, and other similar Pre-trained Transformer-based classifiers. - We have reported the
Weighted F1 scores
for Task 1 as that is the metric which was used in the original paper. For Task 2 we take the top 5 ML models from Task 1 and also run LSTM, RNN and some pre-trained models for the 2 way classification and report theMacro F1
scores.
Surpisingly enough, we managed to beat the SOTA results reported for both the tasks.
- For Task 1, we were able to do so using classical ML methods like
Linear Discriminant Analysis
. - For Task 2, we beat SOTA here as well but this time we managed to do so by a large margin (by about 20 Macro F1 points) using DL methods.
Task 1 mimics the First Shared Task while Task 2 mimics the Second Shared Task
Data Preprocessing/
: Contains the Preprocessing FilesData
: Contains the DataDataAugmentation
: Contains Data Augmentation UtilsWord_Embeddings
: Contains the code to generate word embeddings and also the folder to store dump files. The word embeddings are hosted on Google Drive due to Space ConstraintsDocuments
: Our Reports, Presentations and ProposalExplainability
: Files to see evaluate models from an explainability lensExploratory Data Analysis
: Our Visualizations and AnalysisModels
: Contains the code for our models and also the folder to store the saves. The saves for all our ML models are hosted on Google Drive due to space constraints while our DL Model checkpoints are too large to shift and hence are not present on Google Drive but you can generate them by simply running the notebook
To cite our work, kindly use the following BibTeX:
@inproceedings{DBLP:conf/iclr/YadavKSS23,
author = {Neemesh Yadav and
Mohammad Aflah Khan and
Diksha Sethi and
Raghav Sahni},
editor = {Krystal Maughan and
Rosanne Liu and
Thomas F. Burns},
title = {Beyond Negativity: Re-Analysis and Follow-Up Experiments on Hope Speech
Detection},
booktitle = {The First Tiny Papers Track at {ICLR} 2023, Tiny Papers @ {ICLR} 2023,
Kigali, Rwanda, May 5, 2023},
publisher = {OpenReview.net},
year = {2023},
url = {https://openreview.net/pdf?id=eaKoBpxCPe},
timestamp = {Wed, 19 Jul 2023 17:21:16 +0200},
biburl = {https://dblp.org/rec/conf/iclr/YadavKSS23.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}