Explainable Artificial Intelligence with Integrated Gradients for the Detection of Adversarial Attacks on Text Classifiers

ext classifiers are Artificial Intelligence (AI) models used to classify new documents or text vectors into pre-defined classes. They are typically built using supervised learning algorithms and labelled datasets. Text classifiers produce a pre-defined class as an output, which also makes them susceptible to adversarial attacks. Text classifiers with high accuracy that are trained using complex deep learning algorithms are equally susceptible to adversarial examples, due to subtle differences that are indiscernible to human experts. Recent work in this space is mostly focused on improving adversarial robustness and adversarial example detection, instead of detecting adversarial attacks. In this work, we propose a novel approach, Explainable AI with Integrated Gradients (IGs) for the detection of adversarial attacks on text classifiers. This approach uses IGs to unpack model behaviour and identify terms that positively and negatively influence the target prediction. Instead of random substitution of words in the input, we select the top k words with the greatest positive and negative influence as substitute candidates using attribution scores obtained from IG to generate k samples of transformed inputs by replacing them with synonyms. This approach does not require changes to the model architecture or the training algorithm. The approach was empirically evaluated on three benchmark datasets IMDB, SST-2 and AG News. Our approach outperforms baseline models on word substitution rate, detection accuracy and F1 scores while maintaining equivalent detection performance against adversarial attacks.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
notebooks		notebooks
results		results
training		training
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Explainable Artificial Intelligence with Integrated Gradients for the Detection of Adversarial Attacks on Text Classifiers

About

Releases

Packages

Languages

CDAC-lab/egv_explainability_driven_adversarial_attack_detection

Folders and files

Latest commit

History

Repository files navigation

Explainable Artificial Intelligence with Integrated Gradients for the Detection of Adversarial Attacks on Text Classifiers

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages