Skip to content

VerbekeLab/text-classification-benchmark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

A Review and Experimental Evaluation of the State-of-the-Art in Text Classification
Manon Reusens, Alexander Stevens, Wouter Verbeke, Johannes Desmedt, Bart Baesens [2023]

Repository structure

This repository is organized as follows:

|- config/
    |- hyperparameter configurations for Weight&Biases (YAML files)
|- notebooks/
    |- example notebooks for experimentation
|- preprocessing/
    |- fasttext_embeddings.py   # Loads FastText embeddings and generates sentence embeddings for a corpus
    |- preprocessor.py          # Preprocesses raw text fields
|- util/
    |- dataloader.py            # Collects datasets in raw format and converts columns into pandas dataframes
    |- datasplitter.py          # Splits datasets into train, validation, and test components
|- data_collection.py           # Downloads datasets from the web and stores them as CSV files

Installing

Experiments are conducted with Python 3.9.

$ conda create --name TextBenchmark python=3.9
$ conda activate TextBenchmark
$ pip install -r requirements.txt

How to use

Instructions for datasets collection :

  1. Run 'data_collection.py'. This will download all datasets.

Instructions for running py-files :

  1. Run the py-file in the command line including a random seed. For our experiments we used random seeds [33:42]

Datasets

Structure of the datasets

Dataset Task Classes Size Split
FakeNewsNet - GossipCop Fake News 2 22140 None (80% Train - 20% Test in paper)
CoAID Fake News 2 2162 None (75% Train - 25% Test in paper)
LIAR Fake News 6 12836 Train-Val-Test
20News Topic 20 18846 Train-Test
AGNews Topic 4 127600 Train-Test
Web of Science Dataset Topic 7 11967 None (80% Train - 20% Test in paper)
TweetEval Emotion Emotion 4 5052 Train-Val-Test
CARER Emotion 8 20000 Train-Val-Test
DailyDialog Act - Silicone Emotion 7 102979 Train-Val-Test
IMDb Polarity 2 50000 Train-Test
Stanford Sentiment Treebank Polarity 2 68221 Train-Val-Test
Movie Review Polarity 2 10662 None (80% Train - 20% Test in paper)
SemEval Task 3 Sarcasm 2(4) 4601 Train-Test
iSarcasm - English Sarcasm 2 4868 Train-Test
Sarcasm News Headlines Sarcasm 2 55328 Train-Test

Links

The datasets can be retrieved with the following links.

Citing

Please cite our paper and/or code as follows:

@inproceedings{reusens2023review,
  title={A review and experimental evaluation of the state-of-the-art in text classification},
  author={Reusens, Manon and Stevens, Alexander and Tonglet, Jonathan and De Smedt, Johannes and Verbeke, Wouter and others},
  booktitle={37th Annual Conference of the Belgian Operational Research Society, ORBEL 37, Location: Liège},
  year={2023}
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published