A professional extension of TextAttack for multi-label adversarial example generation, with focus on toxicity classification. Generate adversarial examples that flip multiple labels simultaneously while preserving semantic meaning and grammatical correctness.
- 🎯 Multi-label Attacks: Attack multiple labels simultaneously (maximize/minimize different label sets)
- 🏗️ Modular Architecture: Support for multiple models (Detoxify, custom HuggingFace models)
- 🔬 Multiple Attack Recipes: Composite transformations and single-method attacks
- 📊 Configuration-Driven: YAML configuration for flexible attack parameters
- 🧪 Comprehensive Testing: 78+ test functions with 45% code coverage
- 📈 Built-in Analysis: Attack success metrics, query statistics, and result visualization
- 🚀 Easy Installation: Pip-installable with automatic dependency management
- 🎓 Complete Examples: End-to-end demos with built-in data
# Install from source
git clone https://github.com/QData/TextAttack-Multilabel
cd TextAttack-Multilabel
pip install -e .python install_env.py# Install with development dependencies (testing, linting, type checking)
pip install -e ".[dev]"
# Verify installation
python -c "from textattack_multilabel import MultilabelACL23; print('✓ Installation successful')"- Python 3.8+
- PyTorch 1.9+
- TextAttack 0.3.0+
- Transformers 4.10+
- See
pyproject.tomlfor complete dependencies
Run a complete workflow with built-in sample data (no download needed):
# Quick demo (5 samples, ~2 minutes)
python example_toxic_adv_examples/run_end_to_end_demo.py --quick
# Full demo with analysis
python example_toxic_adv_examples/run_end_to_end_demo.pyWhat this does:
- Creates sample benign/toxic texts
- Loads Detoxify toxicity model
- Runs multilabel adversarial attacks
- Analyzes attack success rates
- Shows example perturbations
- Saves detailed results
from textattack_multilabel import (
MultilabelModelWrapper,
MultilabelACL23_recipe,
MultilabelACL23Transform
)
import transformers
# Load your model and tokenizer
model = transformers.AutoModelForSequenceClassification.from_pretrained("your-model")
tokenizer = transformers.AutoTokenizer.from_pretrained("your-model")
# Wrap for multilabel attacks
model_wrapper = MultilabelModelWrapper(
model,
tokenizer,
multilabel=True,
device='cuda' # Auto-detects if None
)
# Build attack: maximize toxic labels (make benign text toxic)
mattack = MultilabelACL23_recipe.build(
model_wrapper=model_wrapper,
labels_to_maximize=[0, 1, 2, 3, 4, 5], # All 6 toxic labels
labels_to_minimize=[],
wir_method="gradient" # Options: unk, delete, gradient, weighted-saliency
)
# Run attack
import textattack
dataset = textattack.datasets.Dataset([("Sample text", [0.1, 0.2, 0.3, 0.1, 0.2, 0.1])])
attacker = textattack.Attacker(mattack, dataset)
results = attacker.attack_dataset()# Run attacks with configuration file
python example_toxic_adv_examples/run_multilabel_tae_main.py \
--config example_toxic_adv_examples/config/attack_config.yaml \
--attack toxictextattack_multilabel/
├── __init__.py # Public API exports
├── multilabel_model_wrapper.py # Model wrapper with gradient support
├── goal_function.py # Multi-label goal functions
├── attack_components.py # Search methods and components
├── multilabel_target_attack_recipe.py # Composite attack recipe
└── multilabel_transform_attack_recipe.py # Single-method attack recipe
Wraps HuggingFace models for multilabel classification with:
- Automatic device detection (CUDA/CPU)
- Gradient computation for gradient-based attacks
- Sigmoid activation for multilabel outputs
- Support for standard transformer models
Goal function that can:
- Maximize specific labels (make text toxic)
- Minimize specific labels (make text benign)
- Combine both objectives simultaneously
- Validate multi-label success criteria
MultilabelACL23_recipe (Recommended):
- Composite transformations (word swaps, character edits, homoglyphs)
- Multiple WIR methods:
unk,delete,weighted-saliency,gradient - Flexible constraint configuration
MultilabelACL23Transform:
- Single transformation methods:
glove,mlm,wordnet - Simpler, more interpretable perturbations
We have comprehensive test coverage with 78+ test functions:
# Run all tests
python test/run_tests.py
# Run with coverage report
python test/run_tests.py --coverage
# Run specific test file
python test/run_tests.py --file test_goal_function_core.py
# Run in parallel (4 workers)
python test/run_tests.py --parallel 4
# List all test files
python test/run_tests.py --listTest Suite Overview:
- test_goal_function_core.py: 43 tests for goal function logic
- test_model_wrapper_advanced.py: 29 tests for model wrapper and gradients
- test_model_wrapper.py: Basic wrapper tests
- test_multilabel_attack_recipes.py: Recipe building tests
Coverage: ~45% (goal function: 95%, model wrapper: 85%)
See example_toxic_adv_examples/README.md for detailed documentation.
# 1. End-to-end demo (no data needed)
python example_toxic_adv_examples/run_end_to_end_demo.py --quick
# 2. Custom parameters
python example_toxic_adv_examples/run_end_to_end_demo.py \
--num-samples 20 \
--wir-method gradient \
--recipe-type transform
# 3. Attack only benign samples
python example_toxic_adv_examples/run_end_to_end_demo.py --no-attack-toxicMake Benign Text Toxic (Maximize):
attack = MultilabelACL23_recipe.build(
model_wrapper=model_wrapper,
labels_to_maximize=[0, 1, 2, 3, 4, 5], # Maximize all toxic labels
labels_to_minimize=[],
maximize_target_score=0.5 # Target: all labels > 0.5
)Make Toxic Text Benign (Minimize):
attack = MultilabelACL23_recipe.build(
model_wrapper=model_wrapper,
labels_to_maximize=[],
labels_to_minimize=[0, 1, 2, 3, 4, 5], # Minimize all toxic labels
minimize_target_score=0.5 # Target: all labels < 0.5
)Mixed Objectives:
attack = MultilabelACL23_recipe.build(
model_wrapper=model_wrapper,
labels_to_maximize=[0, 1], # Maximize toxic and severe_toxic
labels_to_minimize=[2, 3], # Minimize obscene and threat
)Edit example_toxic_adv_examples/config/attack_config.yaml:
defaults:
model:
type: "detoxify" # or "custom"
variant: "original"
dataset:
name: "jigsaw_toxic_comments"
sample_size: 500
attack:
wir_method: "gradient" # unk, delete, weighted-saliency, gradient
labels_to_maximize: [] # Empty = all labels
labels_to_minimize: []
maximize_target_score: 0.5
minimize_target_score: 0.5
constraints:
pos_constraint: true # Maintain part-of-speech
sbert_constraint: false # Semantic similarityAttack results include:
- Success Rate: Percentage of successful attacks
- Query Efficiency: Average queries per attack
- Perturbation Quality: Words changed, character edits
- Label Changes: Before/after predictions for all labels
- Example Outputs: Actual perturbed texts
Results saved as:
*.parquet- Complete results with predictions*.summary.txt- Statistics and metricshtmlcov/- Test coverage reports (when using--coverage)
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from textattack_multilabel import MultilabelModelWrapper
# Load your custom model
model = AutoModelForSequenceClassification.from_pretrained(
"path/to/your/model",
num_labels=6
)
tokenizer = AutoTokenizer.from_pretrained("path/to/your/tokenizer")
# Wrap it
wrapper = MultilabelModelWrapper(model, tokenizer, multilabel=True)from textattack_multilabel import MultilabelACL23Transform
# Use WordNet transformations with beam search
attack = MultilabelACL23Transform.build(
model_wrapper=wrapper,
labels_to_maximize=[0, 1, 2],
transform_method="wordnet", # glove, mlm, or wordnet
wir_method="beam",
pos_constraint=True,
sbert_constraint=True # Add semantic similarity constraint
)# Most effective but slowest
attack = MultilabelACL23_recipe.build(
model_wrapper=wrapper,
labels_to_maximize=[0, 1, 2, 3, 4, 5],
wir_method="gradient", # Gradient-guided word importance
pos_constraint=True
)CUDA Out of Memory:
# Force CPU mode
CUDA_VISIBLE_DEVICES="" python example_toxic_adv_examples/run_end_to_end_demo.pyImport Errors:
# Verify installation
pip install -e .
# Check imports
python -c "from textattack_multilabel import MultilabelACL23; print('OK')"Slow Attacks:
# Use faster WIR method
python example_toxic_adv_examples/run_end_to_end_demo.py --wir-method unk# All tests with coverage
python test/run_tests.py --coverage
# Quality checks (black, isort, mypy)
python test/run_tests.py --quality
# Specific test class
python test/run_tests.py --test test/test_goal_function_core.py::TestGetScore# Format code
black textattack_multilabel/ test/
# Sort imports
isort textattack_multilabel/ test/
# Type checking
mypy textattack_multilabel/ --ignore-missing-imports- Package API: See docstrings in
textattack_multilabel/ - Examples:
example_toxic_adv_examples/README.md - Tests:
test/directory with comprehensive examples - TextAttack Docs: https://textattack.readthedocs.io/
If you use this package in your research, please cite:
@inproceedings{textattack-multilabel-2023,
title={Multi-label Adversarial Attacks for Text Classification},
author={QData Lab},
booktitle={ACL},
year={2023}
}Contributions welcome! Please:
- Fork the repository
- Create a feature branch
- Add tests for new functionality
- Run
python test/run_tests.py --qualitybefore submitting - Submit a pull request
Apache 2.0 License - see LICENSE file.
Built on top of TextAttack by QData Lab.
- Issues: https://github.com/QData/TextAttack-Multilabel/issues
- Documentation: See
example_toxic_adv_examples/README.md - Tests: Run
python test/run_tests.py --help
Quick Links:
@inproceedings{bespalov-etal-2023-towards,
title = "Towards Building a Robust Toxicity Predictor",
author = "Bespalov, Dmitriy and
Bhabesh, Sourav and
Xiang, Yi and
Zhou, Liutong and
Qi, Yanjun",
editor = "Sitaram, Sunayana and
Beigman Klebanov, Beata and
Williams, Jason D",
booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 5: Industry Track)",
month = jul,
year = "2023",
address = "Toronto, Canada",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.acl-industry.56/",
doi = "10.18653/v1/2023.acl-industry.56",
pages = "581--598",
abstract = "Recent NLP literature pays little attention to the robustness of toxicity language predictors, while these systems are most likely to be used in adversarial contexts. This paper presents a novel adversarial attack, {\textbackslash}texttt{\{}ToxicTrap{\}}, introducing small word-level perturbations to fool SOTA text classifiers to predict toxic text samples as benign. {\textbackslash}texttt{\{}ToxicTrap{\}} exploits greedy based search strategies to enable fast and effective generation of toxic adversarial examples. Two novel goal function designs allow {\textbackslash}texttt{\{}ToxicTrap{\}} to identify weaknesses in both multiclass and multilabel toxic language detectors. Our empirical results show that SOTA toxicity text classifiers are indeed vulnerable to the proposed attacks, attaining over 98{\textbackslash}{\%} attack success rates in multilabel cases. We also show how a vanilla adversarial training and its improved version can help increase robustness of a toxicity detector even against unseen attacks."
}
@article{zhu2024taebench,
title={TaeBench: Improving Quality of Toxic Adversarial Examples},
author={Zhu, Xuan and Bespalov, Dmitriy and You, Liwen and Kulkarni, Ninad and Qi, Yanjun},
journal={Proceeding of Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics (NAACL) Industry Track},
year={2025}
}