Skip to content

Self-Supervised Euphemism Detection and Identification for Content Moderation, IEEE S&P (Oakland) 2021

License

Notifications You must be signed in to change notification settings

Hear-Ye/Euphemism

 
 

Repository files navigation

Python 3.7 License: MIT

Self-Supervised Euphemism Detection and Identification for Content Moderation

This repo is the Python 3 implementation of Self-Supervised Euphemism Detection and Identification for Content Moderation (42nd IEEE Symposium on Security and Privacy 2021).

Table of Contents

Introduction

This project aims at Euphemism Detection and Euphemism Identification.

Requirements

The code is based on Python 3.7. Please install the dependencies as below:

pip install -r requirements.txt

Data

Due to the license issue, we will not distribute the dataset ourselves, but we will direct the readers to their respective sources.

Drug:

Weapon:

Sexuality:

Sample:

  • Raw Text Corpus: we provide a sample dataset data/sample.txt for the readers to run the code.
  • Ground Truth: same as the Drug dataset (see data/euphemism_answer_drug.txt and data/target_keywords_drug.txt).
  • This Sample dataset is only for you to play with the code and it does not represent any reliable results.

Code

1. Fine-tune the BERT model.

Please refer to this link from Hugging Face to fine-tune a BERT on a raw text corpus.

You may download our pre-trained BERT model on the reddit text corpus (from the Drug dataset) here. Please unzip it and put it under data/.

2. Euphemism Detection and Euphemism Identification

python ./Main.py --dataset sample --target drug  

You may find other tunable arguments --- c1, c2 and coarse to specify different classifiers for euphemism identification. Please go to Main.py to find out their meanings.

Baselines:

Please refer to baselines/README.md.

Acknowledgement

We use the code here for the text classification in Pytorch.

Citation

@inproceedings{zhu2021selfsupervised,
    title = {Self-Supervised Euphemism Detection and Identification for Content Moderation},
    author = {Zhu, Wanzheng and Gong, Hongyu and Bansal, Rohan and Weinberg, Zachary and Christin, Nicolas and Fanti, Giulia and Bhat, Suma},
    booktitle = {42nd IEEE Symposium on Security and Privacy},
    year = {2021}
}

About

Self-Supervised Euphemism Detection and Identification for Content Moderation, IEEE S&P (Oakland) 2021

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%