Skip to content

Latest commit

 

History

History
65 lines (54 loc) · 2.22 KB

File metadata and controls

65 lines (54 loc) · 2.22 KB

Indonesian Twitter Hate Speech Classification

Dataset

The Dataset for Hate Speech Detection in Indonesian (https://github.com/ialfina/id-hatespeech-detection).

Data Format

The dataset consists of two data columns : label - tweet. It consists of 713 tweets in Indonesian. The labels:

  • Non_HS for "non-hate-speech" tweet (453).
  • HS for "hate-speech" tweet (260).

Requirements

  • Python 3.7 or above
  • Modules:
    • pandas
    • numpy
    • seaborn
    • matplotlib
    • re
    • TextBlob
    • nltk
    • stopwords
    • StemmerFactory
    • Sastrawi
    • sklearn
    • train_test_split

Exploratory Data Analysis

The label distribution:
image

Histogram : The distribution of "the text data length".
image

Data Preprocessing

1. Oversampling

Since the dataset is unbalanced, we do over-sampling to create a balanced dataset. So, we get:
image

2. Tokenizing, Filtering and Stemming

  • Tokenizing : generate word lists and remove punctuation.
  • Filtering : remove stopwords and words with unusual symbols.
  • Stemming : find basic indonesian words from tweet

Split Data

For splitting data, we use train_test_split:

  • Train data = 80%
  • Test data = 20%

Classification Model

We use 3 classifier:

  • Random Forest
  • Multinomial Naive Bayes
  • K Nearest Neighbors

The Results

Classifier Macro F1 Accuracy Recall
Random Forest 0.93 0.93 0.94
Multinomial NB 0.85 0.85 0.86
KNN 0.81 0.81 0.83

References

Ika Alfina, Rio Mulia, Mohamad Ivan Fanany, and Yudo Ekanata, "Hate Speech Detection in Indonesian Language: A Dataset and Preliminary Study ", in Proceeding of 9th International Conference on Advanced Computer Science and Information Systems 2017(ICACSIS 2017).