EMGSD Hermes

This project explores bias mitigation in GPT2-EMGSD, leveraging correlation analysis for stereotype deduction and activation manipulation, highlighting the potential of an alternative to traditional fine-tuning. Additionally, it demonstrates the feasibility of inducing bias in vanilla GPT2 through activation engineering.

Fast Demo

# Install python 3.10 which is required by SAE-Lens
⁠⁠⁠git clone ⁠ https://github.com/seonglae/emgsd-hermes && cd emgsd-hermes
p⁠ip install torch colorama sae-lens transformers
python compare.py

Main Pipeline

TBA

1. Fine-tuning SAE with EMGSD dataset

python empsd.py

2. Extract features using correlation

python search_category.py
python search_stereo.py
# replace emgsd/*.json files
python draw_corr.py

or if you want to calculate mutual information

python mi_stereo.py

3. Compute ratio of stereotyped text in generation

python compare_all.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

EMGSD Hermes

Fast Demo

Main Pipeline

1. Fine-tuning SAE with EMGSD dataset

2. Extract features using correlation

3. Compute ratio of stereotyped text in generation

Loss Graph of fine-tuning SAE

Files

README.md

Latest commit

History

README.md

File metadata and controls

EMGSD Hermes

Fast Demo

Main Pipeline

1. Fine-tuning SAE with EMGSD dataset

2. Extract features using correlation

3. Compute ratio of stereotyped text in generation

Loss Graph of fine-tuning SAE