Skip to content

Explainable Artificial Intelligence with Integrated Gradients for the Detection of Adversarial Attacks on Text Classifiers

Notifications You must be signed in to change notification settings

CDAC-lab/egv_explainability_driven_adversarial_attack_detection

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Explainable Artificial Intelligence with Integrated Gradients for the Detection of Adversarial Attacks on Text Classifiers

ext classifiers are Artificial Intelligence (AI) models used to classify new documents or text vectors into pre-defined classes. They are typically built using supervised learning algorithms and labelled datasets. Text classifiers produce a pre-defined class as an output, which also makes them susceptible to adversarial attacks. Text classifiers with high accuracy that are trained using complex deep learning algorithms are equally susceptible to adversarial examples, due to subtle differences that are indiscernible to human experts. Recent work in this space is mostly focused on improving adversarial robustness and adversarial example detection, instead of detecting adversarial attacks. In this work, we propose a novel approach, Explainable AI with Integrated Gradients (IGs) for the detection of adversarial attacks on text classifiers. This approach uses IGs to unpack model behaviour and identify terms that positively and negatively influence the target prediction. Instead of random substitution of words in the input, we select the top k words with the greatest positive and negative influence as substitute candidates using attribution scores obtained from IG to generate k samples of transformed inputs by replacing them with synonyms. This approach does not require changes to the model architecture or the training algorithm. The approach was empirically evaluated on three benchmark datasets IMDB, SST-2 and AG News. Our approach outperforms baseline models on word substitution rate, detection accuracy and F1 scores while maintaining equivalent detection performance against adversarial attacks.

About

Explainable Artificial Intelligence with Integrated Gradients for the Detection of Adversarial Attacks on Text Classifiers

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages