TextAttack-Multilabel

A professional extension of TextAttack for multi-label adversarial example generation, with focus on toxicity classification. Generate adversarial examples that flip multiple labels simultaneously while preserving semantic meaning and grammatical correctness.

✨ Features

🎯 Multi-label Attacks: Attack multiple labels simultaneously (maximize/minimize different label sets)
🏗️ Modular Architecture: Support for multiple models (Detoxify, custom HuggingFace models)
🔬 Multiple Attack Recipes: Composite transformations and single-method attacks
📊 Configuration-Driven: YAML configuration for flexible attack parameters
🧪 Comprehensive Testing: 78+ test functions with 45% code coverage
📈 Built-in Analysis: Attack success metrics, query statistics, and result visualization
🚀 Easy Installation: Pip-installable with automatic dependency management
🎓 Complete Examples: End-to-end demos with built-in data

📦 Installation

Quick Install (Recommended)

# Install from source
git clone https://github.com/QData/TextAttack-Multilabel
cd TextAttack-Multilabel
pip install -e .

Enviroment Setup and Verify

python install_env.py

Development Installation

# Install with development dependencies (testing, linting, type checking)
pip install -e ".[dev]"

# Verify installation
python -c "from textattack_multilabel import MultilabelACL23; print('✓ Installation successful')"

Requirements

Python 3.8+
PyTorch 1.9+
TextAttack 0.3.0+
Transformers 4.10+
See pyproject.toml for complete dependencies

🚀 Quick Start

Option 1: End-to-End Demo (Fastest)

Run a complete workflow with built-in sample data (no download needed):

# Quick demo (5 samples, ~2 minutes)
python example_toxic_adv_examples/run_end_to_end_demo.py --quick

# Full demo with analysis
python example_toxic_adv_examples/run_end_to_end_demo.py

What this does:

Creates sample benign/toxic texts
Loads Detoxify toxicity model
Runs multilabel adversarial attacks
Analyzes attack success rates
Shows example perturbations
Saves detailed results

Option 2: Python API

from textattack_multilabel import (
    MultilabelModelWrapper,
    MultilabelACL23_recipe,
    MultilabelACL23Transform
)
import transformers

# Load your model and tokenizer
model = transformers.AutoModelForSequenceClassification.from_pretrained("your-model")
tokenizer = transformers.AutoTokenizer.from_pretrained("your-model")

# Wrap for multilabel attacks
model_wrapper = MultilabelModelWrapper(
    model,
    tokenizer,
    multilabel=True,
    device='cuda'  # Auto-detects if None
)

# Build attack: maximize toxic labels (make benign text toxic)
mattack = MultilabelACL23_recipe.build(
    model_wrapper=model_wrapper,
    labels_to_maximize=[0, 1, 2, 3, 4, 5],  # All 6 toxic labels
    labels_to_minimize=[],
    wir_method="gradient"  # Options: unk, delete, gradient, weighted-saliency
)

# Run attack
import textattack
dataset = textattack.datasets.Dataset([("Sample text", [0.1, 0.2, 0.3, 0.1, 0.2, 0.1])])
attacker = textattack.Attacker(mattack, dataset)
results = attacker.attack_dataset()

Option 3: Configuration-Based

# Run attacks with configuration file
python example_toxic_adv_examples/run_multilabel_tae_main.py \
  --config example_toxic_adv_examples/config/attack_config.yaml \
  --attack toxic

📖 Package Structure

textattack_multilabel/
├── __init__.py                              # Public API exports
├── multilabel_model_wrapper.py              # Model wrapper with gradient support
├── goal_function.py                         # Multi-label goal functions
├── attack_components.py                     # Search methods and components
├── multilabel_target_attack_recipe.py       # Composite attack recipe
└── multilabel_transform_attack_recipe.py    # Single-method attack recipe

Core Components

MultilabelModelWrapper

Wraps HuggingFace models for multilabel classification with:

Automatic device detection (CUDA/CPU)
Gradient computation for gradient-based attacks
Sigmoid activation for multilabel outputs
Support for standard transformer models

MultilabelClassificationGoalFunction

Goal function that can:

Maximize specific labels (make text toxic)
Minimize specific labels (make text benign)
Combine both objectives simultaneously
Validate multi-label success criteria

Attack Recipes

MultilabelACL23_recipe (Recommended):

Composite transformations (word swaps, character edits, homoglyphs)
Multiple WIR methods: unk, delete, weighted-saliency, gradient
Flexible constraint configuration

MultilabelACL23Transform:

Single transformation methods: glove, mlm, wordnet
Simpler, more interpretable perturbations

🧪 Testing

We have comprehensive test coverage with 78+ test functions:

# Run all tests
python test/run_tests.py

# Run with coverage report
python test/run_tests.py --coverage

# Run specific test file
python test/run_tests.py --file test_goal_function_core.py

# Run in parallel (4 workers)
python test/run_tests.py --parallel 4

# List all test files
python test/run_tests.py --list

Test Suite Overview:

test_goal_function_core.py: 43 tests for goal function logic
test_model_wrapper_advanced.py: 29 tests for model wrapper and gradients
test_model_wrapper.py: Basic wrapper tests
test_multilabel_attack_recipes.py: Recipe building tests

Coverage: ~45% (goal function: 95%, model wrapper: 85%)

📊 Examples

Complete Examples Directory

See example_toxic_adv_examples/README.md for detailed documentation.

Quick Examples

# 1. End-to-end demo (no data needed)
python example_toxic_adv_examples/run_end_to_end_demo.py --quick

# 2. Custom parameters
python example_toxic_adv_examples/run_end_to_end_demo.py \
  --num-samples 20 \
  --wir-method gradient \
  --recipe-type transform

# 3. Attack only benign samples
python example_toxic_adv_examples/run_end_to_end_demo.py --no-attack-toxic

Attack Direction Examples

Make Benign Text Toxic (Maximize):

attack = MultilabelACL23_recipe.build(
    model_wrapper=model_wrapper,
    labels_to_maximize=[0, 1, 2, 3, 4, 5],  # Maximize all toxic labels
    labels_to_minimize=[],
    maximize_target_score=0.5  # Target: all labels > 0.5
)

Make Toxic Text Benign (Minimize):

attack = MultilabelACL23_recipe.build(
    model_wrapper=model_wrapper,
    labels_to_maximize=[],
    labels_to_minimize=[0, 1, 2, 3, 4, 5],  # Minimize all toxic labels
    minimize_target_score=0.5  # Target: all labels < 0.5
)

Mixed Objectives:

attack = MultilabelACL23_recipe.build(
    model_wrapper=model_wrapper,
    labels_to_maximize=[0, 1],  # Maximize toxic and severe_toxic
    labels_to_minimize=[2, 3],  # Minimize obscene and threat
)

⚙️ Configuration

YAML Configuration Files

Edit example_toxic_adv_examples/config/attack_config.yaml:

defaults:
  model:
    type: "detoxify"  # or "custom"
    variant: "original"

  dataset:
    name: "jigsaw_toxic_comments"
    sample_size: 500

  attack:
    wir_method: "gradient"  # unk, delete, weighted-saliency, gradient
    labels_to_maximize: []  # Empty = all labels
    labels_to_minimize: []
    maximize_target_score: 0.5
    minimize_target_score: 0.5

    constraints:
      pos_constraint: true      # Maintain part-of-speech
      sbert_constraint: false   # Semantic similarity

📈 Results and Analysis

Attack results include:

Success Rate: Percentage of successful attacks
Query Efficiency: Average queries per attack
Perturbation Quality: Words changed, character edits
Label Changes: Before/after predictions for all labels
Example Outputs: Actual perturbed texts

Output Files

Results saved as:

*.parquet - Complete results with predictions
*.summary.txt - Statistics and metrics
htmlcov/ - Test coverage reports (when using --coverage)

🔧 Advanced Usage

Custom Models

from transformers import AutoModelForSequenceClassification, AutoTokenizer
from textattack_multilabel import MultilabelModelWrapper

# Load your custom model
model = AutoModelForSequenceClassification.from_pretrained(
    "path/to/your/model",
    num_labels=6
)
tokenizer = AutoTokenizer.from_pretrained("path/to/your/tokenizer")

# Wrap it
wrapper = MultilabelModelWrapper(model, tokenizer, multilabel=True)

Custom Attack Parameters

from textattack_multilabel import MultilabelACL23Transform

# Use WordNet transformations with beam search
attack = MultilabelACL23Transform.build(
    model_wrapper=wrapper,
    labels_to_maximize=[0, 1, 2],
    transform_method="wordnet",  # glove, mlm, or wordnet
    wir_method="beam",
    pos_constraint=True,
    sbert_constraint=True  # Add semantic similarity constraint
)

Gradient-Based Attacks

# Most effective but slowest
attack = MultilabelACL23_recipe.build(
    model_wrapper=wrapper,
    labels_to_maximize=[0, 1, 2, 3, 4, 5],
    wir_method="gradient",  # Gradient-guided word importance
    pos_constraint=True
)

🐛 Troubleshooting

Common Issues

CUDA Out of Memory:

# Force CPU mode
CUDA_VISIBLE_DEVICES="" python example_toxic_adv_examples/run_end_to_end_demo.py

Import Errors:

# Verify installation
pip install -e .

# Check imports
python -c "from textattack_multilabel import MultilabelACL23; print('OK')"

Slow Attacks:

# Use faster WIR method
python example_toxic_adv_examples/run_end_to_end_demo.py --wir-method unk

🧑‍💻 Development

Running Tests

# All tests with coverage
python test/run_tests.py --coverage

# Quality checks (black, isort, mypy)
python test/run_tests.py --quality

# Specific test class
python test/run_tests.py --test test/test_goal_function_core.py::TestGetScore

Code Quality

# Format code
black textattack_multilabel/ test/

# Sort imports
isort textattack_multilabel/ test/

# Type checking
mypy textattack_multilabel/ --ignore-missing-imports

📚 Documentation

Package API: See docstrings in textattack_multilabel/
Examples: example_toxic_adv_examples/README.md
Tests: test/ directory with comprehensive examples
TextAttack Docs: https://textattack.readthedocs.io/

🔬 Research

If you use this package in your research, please cite:

@inproceedings{textattack-multilabel-2023,
  title={Multi-label Adversarial Attacks for Text Classification},
  author={QData Lab},
  booktitle={ACL},
  year={2023}
}

🤝 Contributing

Contributions welcome! Please:

Fork the repository
Create a feature branch
Add tests for new functionality
Run python test/run_tests.py --quality before submitting
Submit a pull request

📄 License

Apache 2.0 License - see LICENSE file.

🙏 Acknowledgments

Built on top of TextAttack by QData Lab.

📞 Support

Issues: https://github.com/QData/TextAttack-Multilabel/issues
Documentation: See example_toxic_adv_examples/README.md
Tests: Run python test/run_tests.py --help

Quick Links:

Installation
Quick Start
Examples
Testing
API Reference

Please Cite

@inproceedings{bespalov-etal-2023-towards,
    title = "Towards Building a Robust Toxicity Predictor",
    author = "Bespalov, Dmitriy  and
      Bhabesh, Sourav  and
      Xiang, Yi  and
      Zhou, Liutong  and
      Qi, Yanjun",
    editor = "Sitaram, Sunayana  and
      Beigman Klebanov, Beata  and
      Williams, Jason D",
    booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 5: Industry Track)",
    month = jul,
    year = "2023",
    address = "Toronto, Canada",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.acl-industry.56/",
    doi = "10.18653/v1/2023.acl-industry.56",
    pages = "581--598",
    abstract = "Recent NLP literature pays little attention to the robustness of toxicity language predictors, while these systems are most likely to be used in adversarial contexts. This paper presents a novel adversarial attack, {\textbackslash}texttt{\{}ToxicTrap{\}}, introducing small word-level perturbations to fool SOTA text classifiers to predict toxic text samples as benign. {\textbackslash}texttt{\{}ToxicTrap{\}} exploits greedy based search strategies to enable fast and effective generation of toxic adversarial examples. Two novel goal function designs allow {\textbackslash}texttt{\{}ToxicTrap{\}} to identify weaknesses in both multiclass and multilabel toxic language detectors. Our empirical results show that SOTA toxicity text classifiers are indeed vulnerable to the proposed attacks, attaining over 98{\textbackslash}{\%} attack success rates in multilabel cases. We also show how a vanilla adversarial training and its improved version can help increase robustness of a toxicity detector even against unseen attacks."
}

@article{zhu2024taebench,
  title={TaeBench: Improving Quality of Toxic Adversarial Examples},
  author={Zhu, Xuan and Bespalov, Dmitriy and You, Liwen and Kulkarni, Ninad and Qi, Yanjun},
  journal={Proceeding of Annual Conference of the Nations of the Americas Chapter of the Association  for Computational Linguistics (NAACL) Industry Track},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
example_toxic_adv_examples		example_toxic_adv_examples
test		test
textattack_multilabel		textattack_multilabel
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
attack_results_toxic.parquet		attack_results_toxic.parquet
install_env.py		install_env.py
pyproject.toml		pyproject.toml

License

QData/TextAttack-Multilabel

Folders and files

Latest commit

History

Repository files navigation