A Review and Experimental Evaluation of the State-of-the-Art in Text Classification
_{_{Manon Reusens, Alexander Stevens, Wouter Verbeke, Johannes Desmedt, Bart Baesens [2023]}}

Repository structure

This repository is organized as follows:

|- config/
    |- hyperparameter configurations for Weight&Biases (YAML files)
|- notebooks/
    |- example notebooks for experimentation
|- preprocessing/
    |- fasttext_embeddings.py   # Loads FastText embeddings and generates sentence embeddings for a corpus
    |- preprocessor.py          # Preprocesses raw text fields
|- util/
    |- dataloader.py            # Collects datasets in raw format and converts columns into pandas dataframes
    |- datasplitter.py          # Splits datasets into train, validation, and test components
|- data_collection.py           # Downloads datasets from the web and stores them as CSV files

Installing

Experiments are conducted with Python 3.9.

$ conda create --name TextBenchmark python=3.9
$ conda activate TextBenchmark
$ pip install -r requirements.txt

How to use

Instructions for datasets collection :

Run 'data_collection.py'. This will download all datasets.

Instructions for running py-files :

Run the py-file in the command line including a random seed. For our experiments we used random seeds [33:42]

Datasets

Structure of the datasets

Dataset	Task	Classes	Size	Split
FakeNewsNet - GossipCop	Fake News	2	22140	None (80% Train - 20% Test in paper)
CoAID	Fake News	2	2162	None (75% Train - 25% Test in paper)
LIAR	Fake News	6	12836	Train-Val-Test
20News	Topic	20	18846	Train-Test
AGNews	Topic	4	127600	Train-Test
Web of Science Dataset	Topic	7	11967	None (80% Train - 20% Test in paper)
TweetEval Emotion	Emotion	4	5052	Train-Val-Test
CARER	Emotion	8	20000	Train-Val-Test
DailyDialog Act - Silicone	Emotion	7	102979	Train-Val-Test
IMDb	Polarity	2	50000	Train-Test
Stanford Sentiment Treebank	Polarity	2	68221	Train-Val-Test
Movie Review	Polarity	2	10662	None (80% Train - 20% Test in paper)
SemEval Task 3	Sarcasm	2(4)	4601	Train-Test
iSarcasm - English	Sarcasm	2	4868	Train-Test
Sarcasm News Headlines	Sarcasm	2	55328	Train-Test

Links

The datasets can be retrieved with the following links.

Citing

Please cite our paper and/or code as follows:

@inproceedings{reusens2023review,
  title={A review and experimental evaluation of the state-of-the-art in text classification},
  author={Reusens, Manon and Stevens, Alexander and Tonglet, Jonathan and De Smedt, Johannes and Verbeke, Wouter and others},
  booktitle={37th Annual Conference of the Belgian Operational Research Society, ORBEL 37, Location: Liège},
  year={2023}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

A Review and Experimental Evaluation of the State-of-the-Art in Text Classification
_{_{Manon Reusens, Alexander Stevens, Wouter Verbeke, Johannes Desmedt, Bart Baesens [2023]}}

Repository structure

Installing

How to use

Datasets

Structure of the datasets

Links

Citing

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
config		config
notebooks		notebooks
preprocessing		preprocessing
py_files		py_files
util		util
LICENSE		LICENSE
README.md		README.md
data_collection.py		data_collection.py
requirements.txt		requirements.txt

License

VerbekeLab/text-classification-benchmark

Folders and files

Latest commit

History

Repository files navigation

A Review and Experimental Evaluation of the State-of-the-Art in Text ClassificationManon Reusens, Alexander Stevens, Wouter Verbeke, Johannes Desmedt, Bart Baesens [2023]

Repository structure

Installing

How to use

Datasets

Structure of the datasets

Links

Citing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

A Review and Experimental Evaluation of the State-of-the-Art in Text Classification
_{_{Manon Reusens, Alexander Stevens, Wouter Verbeke, Johannes Desmedt, Bart Baesens [2023]}}

Packages