HEARTS introduces explainable, low-carbon models fine-tuned on the Expanded Multi-Grain Stereotype Dataset (EMGSD) to tackle challenges in stereotype detection. This repository includes scripts for training, evaluation, and explainability analysis for sentence-level stereotype classification. For details, refer to the HEARTS research paper.
- Exploratory Data Analysis (EDA): Analyze group distributions, text length, and sentiment/regard trends in EMGSD.
- Model Training & Evaluation: Train and test models (e.g., BERT, ALBERT-V2, logistic regression) on EMGSD with ablation studies.
- Explainability: Generate SHAP and LIME explanations to interpret predictions.
- LLM Bias Evaluation: Classify and evaluate bias in LLM outputs using neutral prompts derived from EMGSD.
-
Clone this repository:
git clone https://github.com/username/HEARTS-Text-Stereotype-Detection.git cd HEARTS-Text-Stereotype-Detection
-
Install dependencies:
pip install -r requirements.txt
-
Explore the modules (see details below).
Scripts to perform basic analysis on EMGSD, available at Hugging Face.
Initial_EDA
: Analyze target group distribution, stereotype group distribution, text length, and frequency.Sentiment_Regard_Analysis
: Classify sentiment (RoBERTa Sentiment Model) and regard (Regard v3) for dataset entries.
Train and evaluate various models on EMGSD, with ablation studies on its three core datasets (MGSD, Augmented WinoQueer, and Augmented SeeGULL).
BERT_Models_Fine_Tuning
: Fine-tune and evaluate ALBERT-V2, DistilBERT, and BERT models.Logistic_Regression
: Train logistic regression models using:- TF-IDF vectorization
- Pre-trained embeddings (spaCy embeddings)
DistilRoBERTaBias
: Evaluate an open-source bias detection model (DistilRoBERTa Bias).GPT4_Models
: Evaluate GPT-4o and GPT-4o-mini using API prompting (API credentials required).
Interpret model predictions using SHAP and LIME. Weights for the fine-tuned ALBERT-V2 model are available at Hugging Face.
SHAP_LIME_Analysis
: Generate SHAP and LIME explanations for selected model predictions and compare their similarity using metrics such as:- Cosine similarity
- Pearson correlation
- Jensen-Shannon divergence
Classify and evaluate bias in LLM responses using neutral prompts derived from EMGSD.
LLM_Prompt_Verification
: Verify neutrality of prompts using the fine-tuned ALBERT-V2 model.LLM_Bias_Evaluation
: Classify LLM outputs to compute aggregate bias scores, representing stereotype prevalence.SHAP_LIME_Analysis_LLM_Outputs
: Apply SHAP and LIME to interpret predictions on LLM outputs.
Key findings and performance benchmarks from the paper are outlined here.
If you use this work, please cite the following paper:
@article{hearts2024,
title={HEARTS: Enhancing Stereotype Detection with Explainable, Low-Carbon Models},
author={Author Names},
journal={arXiv preprint arXiv:2409.11579},
year={2024}
}
This repository is licensed under the MIT License.
For questions or collaborations, contact Holistic AI.