A Review and Experimental Evaluation of the State-of-the-Art in Text Classification
Manon Reusens, Alexander Stevens, Wouter Verbeke, Johannes Desmedt, Bart Baesens [2023]
This repository is organized as follows:
|- config/
|- hyperparameter configurations for Weight&Biases (YAML files)
|- notebooks/
|- example notebooks for experimentation
|- preprocessing/
|- fasttext_embeddings.py # Loads FastText embeddings and generates sentence embeddings for a corpus
|- preprocessor.py # Preprocesses raw text fields
|- util/
|- dataloader.py # Collects datasets in raw format and converts columns into pandas dataframes
|- datasplitter.py # Splits datasets into train, validation, and test components
|- data_collection.py # Downloads datasets from the web and stores them as CSV files
Experiments are conducted with Python 3.9.
$ conda create --name TextBenchmark python=3.9
$ conda activate TextBenchmark
$ pip install -r requirements.txt
Instructions for datasets collection :
- Run 'data_collection.py'. This will download all datasets.
Instructions for running py-files :
- Run the py-file in the command line including a random seed. For our experiments we used random seeds [33:42]
Dataset | Task | Classes | Size | Split |
---|---|---|---|---|
FakeNewsNet - GossipCop | Fake News | 2 | 22140 | None (80% Train - 20% Test in paper) |
CoAID | Fake News | 2 | 2162 | None (75% Train - 25% Test in paper) |
LIAR | Fake News | 6 | 12836 | Train-Val-Test |
20News | Topic | 20 | 18846 | Train-Test |
AGNews | Topic | 4 | 127600 | Train-Test |
Web of Science Dataset | Topic | 7 | 11967 | None (80% Train - 20% Test in paper) |
TweetEval Emotion | Emotion | 4 | 5052 | Train-Val-Test |
CARER | Emotion | 8 | 20000 | Train-Val-Test |
DailyDialog Act - Silicone | Emotion | 7 | 102979 | Train-Val-Test |
IMDb | Polarity | 2 | 50000 | Train-Test |
Stanford Sentiment Treebank | Polarity | 2 | 68221 | Train-Val-Test |
Movie Review | Polarity | 2 | 10662 | None (80% Train - 20% Test in paper) |
SemEval Task 3 | Sarcasm | 2(4) | 4601 | Train-Test |
iSarcasm - English | Sarcasm | 2 | 4868 | Train-Test |
Sarcasm News Headlines | Sarcasm | 2 | 55328 | Train-Test |
The datasets can be retrieved with the following links.
- FakeNewsNet
- CoAID
- LIAR
- 20News
- AGNews
- Web of Science Dataset
- Tweet Eval : Emotion detection
- CARER Emotion
- Daily Dialog Act Corpus (silicone)
- Stanford Sentiment Tree Bank
- Movie Review
- SemEval 2018 Task 3
- SemEval 2022 iSarcasm
- Sarcasm News Headlines
Please cite our paper and/or code as follows:
@inproceedings{reusens2023review,
title={A review and experimental evaluation of the state-of-the-art in text classification},
author={Reusens, Manon and Stevens, Alexander and Tonglet, Jonathan and De Smedt, Johannes and Verbeke, Wouter and others},
booktitle={37th Annual Conference of the Belgian Operational Research Society, ORBEL 37, Location: Liège},
year={2023}
}