This repo is the Python 3 implementation of Self-Supervised Euphemism Detection and Identification for Content Moderation (42nd IEEE Symposium on Security and Privacy 2021).
This project aims at Euphemism Detection and Euphemism Identification.
The code is based on Python 3.7. Please install the dependencies as below:
pip install -r requirements.txt
Due to the license issue, we will not distribute the dataset ourselves, but we will direct the readers to their respective sources.
Drug:
- Raw Text Corpus: Please request the raw text corpus ---
reddit.csv
from Wanzheng Zhu (wz6@illinois.edu) or Professor Nicolas Christin. - Ground Truth: we summarize the drug euphemism ground truth list (provided by the DEA Intelligence Report -- Slang Terms and Code Words: A Reference for Law Enforcement Personnel) in
data/euphemism_answer_drug.txt
anddata/target_keywords_drug.txt
.
Weapon:
- Raw Text Corpus: Please request the dataset from What is gab: A bastion of free speech or an alt-right echo chamber (Zanettou et al. 2018), Identifying products in online cybercrime marketplaces: A dataset for fine-grained domain adaptation (Durrett et al. 2017), Tools for Automated Analysis of Cybercriminal Markets (Portnoff et al. 2017), and the examples on Slangpedia.
- Ground Truth: Please refer to The Online Slang Dictionary, Slangpedia, and The Urban Thesaurus.
Sexuality:
- Raw Text Corpus: We use 2,894,869 processed Gab posts, collected from Jan 2018 to Oct 2018 by PushShift.
- Ground Truth: Please refer to The Online Slang Dictionary.
Sample:
- Raw Text Corpus: we provide a sample dataset
data/sample.txt
for the readers to run the code. - Ground Truth: same as the Drug dataset (see
data/euphemism_answer_drug.txt
anddata/target_keywords_drug.txt
). - This Sample dataset is only for you to play with the code and it does not represent any reliable results.
Please refer to this link from Hugging Face to fine-tune a BERT on a raw text corpus.
You may download our pre-trained BERT model on the reddit
text corpus (from the Drug dataset) here. Please unzip it and put it under data/
.
python ./Main.py --dataset sample --target drug
You may find other tunable arguments --- c1
, c2
and coarse
to specify different classifiers for euphemism identification.
Please go to Main.py
to find out their meanings.
Please refer to baselines/README.md
.
We use the code here for the text classification in Pytorch.
@inproceedings{zhu2021selfsupervised,
title = {Self-Supervised Euphemism Detection and Identification for Content Moderation},
author = {Zhu, Wanzheng and Gong, Hongyu and Bansal, Rohan and Weinberg, Zachary and Christin, Nicolas and Fanti, Giulia and Bhat, Suma},
booktitle = {42nd IEEE Symposium on Security and Privacy},
year = {2021}
}