Skip to content

This project is based on detecting "hate-speech" of any kind of communication that disparages someone or a group of people because of their qualities, such as religion, ethnicity, sexual orientation or gender.

License

Notifications You must be signed in to change notification settings

farazulhoda/hate-speech-detection

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Automatic hate speech detection

Setup

  1. Install requirements
  • python3
  • python packages: pandas, sklearn, fasttext, sqlalchemy, ...
  1. Configure collector
  • Edit hiit_collector.py.example and save it as hiit_collector.py
  1. Configure PostgreSQL
  • Edit postgre_keys.py.example and save it as postgre_keys.py
  1. Get the data

Usage:

Collect new data

usage:

`collector.py [-h] [--user USER] [--password PASSWORD] [--hostname HOSTNAME] [--outdir OUTDIR] [--startdate STARTDATE] [--enddate ENDDATE]

optional arguments: -h, --help show this help message and exit --user USER Username --password PASSWORD Password --hostname HOSTNAME Hostname --outdir OUTDIR Directory to store data --startdate STARTDATE Startdate as YYYY-MM-DD --enddate ENDDATE Enddate as YYYY-MM-DD`

Example:

./collector.py --startdate 2017-03-01 --enddate 2017-03-15

Train predictor

Example:

./predict.py --inputdir data/incoming --outdir data/output/ --featurename bow --featurefile data/models/feature_extractor_bow.pkl --predictor data/models/fasttext_svm.pkl

Predict hate speech

Example:

./predict.py --inputdir data/incoming --outdir data/output/ --featurename bow --featurefile data/models/feature_extractor_bow.pkl --predictor data/models/bow_svm.pkl

Sync data

Example:

./sync.py --inputdir data/output/

TODO

  1. CNN on Embedding Matrix (c.f Willi)
  2. Stemmings, stop words for BoW
  3. Study SVM factors (with BoW)
  4. Mezadona ? To Models
  5. Plot TSNE manifolds for wikipedia model and twitter model
  • Highlight hatewords

DONE:

  1. Try Naive Bayes-classifier with BoW
  • Naive Bayes (Gaussian) did perform comparable to RF, but worse than SVM
  • With FastText it performed poorly

LINK

About

This project is based on detecting "hate-speech" of any kind of communication that disparages someone or a group of people because of their qualities, such as religion, ethnicity, sexual orientation or gender.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages