Large Language Models (LLMs) Robustness Evaluation on Sentiment Analysis task

Overview 🏷️

This project aims to assess the robustness of several Large Language Models (LLMs) in the task of Sentiment Analysis. The evaluation is conducted on various models, including GPT-J-6B, FLAN-T5-small, FLANT-T5-base, and FLANT-T5-xl. The evaluation process involves assessing these models on the IMDB dataset and its contrastive set available at allenai/contrast-sets.

Robustness Definition 📚

In the context of LLMs, Robustness refer to the models’ ability to consistently maintain their performance and exhibit reliable and coherent behavior even in the presence of challenges, variation, or perturbation introduced to their input.

This encompasses the capacity of LLMs to effectively handle and resist adversarial inputs or unexpected shifts in the input distribution.

A robust LLM should demonstrate resilience against noise, uncertainties, and potential attacks from malicious users, ensuring that its outputs remain trustworthy and meaningful across a range of conditions.

Assessing the robustness of LLMs involves evaluating the stability and reliability of their outputs under various adverse and difficult circumstances.

Adversarial Perturbation 👽

Unlike images, in the context of Natural Language Processing (NLP), perturbation refers to any form of alteration or manipulation introduced to textual data, such as adding noise, changing words or syntax. In the text domain, universal perturbations are commonly categorized into:

character-level,
word-level,
sentence-level

In this evaluation, the original and clear set is the IMDB reviews dataset and the perturbed set is its relative contrastive set available at allenai/contrast-sets.

Project Structure 📁

FLAN-T5_eval.ipynb: Colab notebook with scripts for evaluating FLAN models.
GPT-J-6B_eval.ipynb: Colab notebook with scripts for evaluating GPT-J-6B model.
There is a folder associated with each evaluation.
Dataset: This folder contains the original and contrast sample for the evaluation.
Plots: This folder contains various comparison plots generated during the evaluation process.

Evaluation Methodology

The models are evaluated under zero-shot, 1-shot, 3-shot, and 5-shot settings. The evaluation scripts assess the models' performance on the IMDB dataset and its contrastive set to gauge their robustness in sentiment analysis tasks.

Evaluation results

How to Use

Clone the repository to your local machine.
Navigate to the evaluation notebooks.
Execute the evaluation scripts.
View the evaluation results in the notebook.
Explore comparison plots in the Plots folder to analyze the performance of different models.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
Dataset		Dataset
FLAN-T5-base_eval		FLAN-T5-base_eval
FLAN-T5-small_eval		FLAN-T5-small_eval
FLAN-T5-xl_eval		FLAN-T5-xl_eval
GPT-J-6B		GPT-J-6B
imgs		imgs
plots		plots
.gitignore		.gitignore
FLAN-T5_eval.ipynb		FLAN-T5_eval.ipynb
GPT-J-6B_eval.ipynb		GPT-J-6B_eval.ipynb
README.md		README.md
plots.ipynb		plots.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Large Language Models (LLMs) Robustness Evaluation on Sentiment Analysis task

Overview 🏷️

Robustness Definition 📚

Adversarial Perturbation 👽

Project Structure 📁

Evaluation Methodology

Evaluation results

How to Use

About

Languages

CristianCosci/LLMs_Robustness_evaluation_on_Sentiment_Analysis

Folders and files

Latest commit

History

Repository files navigation

Large Language Models (LLMs) Robustness Evaluation on Sentiment Analysis task

Overview 🏷️

Robustness Definition 📚

Adversarial Perturbation 👽

Project Structure 📁

Evaluation Methodology

Evaluation results

How to Use

About

Topics

Resources

Stars

Watchers

Forks

Languages