Note
This is an extension of the speech classifier program developed by Thomas Davidson et al.. As Thomas Davidson's repository is no longer maintained, we decided to create our own, added modifications, and tested it with different datasets, including speech data from Berkeley
There are 7 files in the "speech_classifier" folder:
- generate_group_csv.py
- count_groups.py
- generate_trained_model.ipynb
- speech_classifier.py
- generate_cv_data.py
- run_cross_validation.py
- run_all_scenarios.py
- The "data" folder referenced below is available in data.zip v1.0.0
- Note that the files in the input folder are aligned with the original analysis data set of TDavidson (i.e., labeled_data.csv). In addition, the data entries are labeled with target groups in the files in the input folder.
All .py files can be executed from the speech_classifier folder with a simple command like this:
python program_name.py
The first program: generate_group_csv.py reads the speech data available in the "data" folder (provided by Berkeley) and selects the desired number of groups to be analyzed. Different scenarios can be created for testing by altering the number of each targeted group in the file according to the "data_name" variable specified. For simplicity in our description, we will use the "balanced" data name as an example.
- Input: ../data/berkeley_speech_dataset.csv
- Output: ../data/balanced_dataset.csv
The second program: count_groups.py is a simple script to print out the actual number of groups resulting from the first program.
As there are overlaps between different targeted groups, it is useful to obtain the percentage of each targeted group produced.
- Input: ../data/balanced_dataset.csv
- Output: ../output/balanced_groups_counts.txt
The third program: generate_trained_model.ipynb is a Jupyter Notebook script to generate the trained model that needs to be passed into the speech classifier program.
This program will produce 5 files with a .pkl extension, which need to be passed to speech_classifier.py.
- Data folder: Should be placed in the top-level directory(root directory)
- Input: ../data/balanced_dataset.csv
- Output:
- ../data/balanced_model.pkl
- ../data/balanced_tfidf.pkl
- ../data/balanced_idf.pkl
- ../data/balanced_pos.pkl
- ../data/balanced_oth.pkl
The file: speech_classifier.py is the actual speech classifier program. It analyzes all the speech files in the input folder and produces the number of hate speech instances detected in the input files. It is currently also set to analyze a pre-labeled data file named "labeled_data.csv."
This file was created by TDavidson to test the program's performance. We use this same file to determine the accuracy, precision, recall, and F1 score of our program.
Trained model files:
- ../data/balanced_model.pkl
- ../data/balanced_tfidf.pkl
- ../data/balanced_idf.pkl
- ../data/balanced_pos.pkl
- ../data/balanced_oth.pkl
- Input CSV files to be analyzed: all files located in ../input
- Output: The output is located in ../output. The program will produce one output file for every file it finds in the input directory, listing the predicted class for each tweet/text within each file.
- It will also generate two PDF files and two TXT files. The PDF files are named "original_hate_vs_balanced_hate.pdf" and "original_hate+offensive_vs_balanced_hate.pdf" and contain confusion matrices based on the analysis of the original analysis data set "labeled_data.csv." The TXT files are named "original_hate_vs_balanced_hate.txt" and "original_hate+offensive_vs_balanced_hate.txt" and contain quality scores of the classifier program based on the analysis of the input CSV files.
We concentrate on just two classes, "hate" or "not hate" in our program (and designed our trained dataset accordingly). Therefore, we produce the file "original_hate_vs_balanced_hate.pdf" by considering all "Offensive" class instances in "labeled_data.csv" as incorrectly classified. The second file, "original_hate+offensive_vs_balanced_hate.pdf," treats all "Hate" and "Offensive" class instances the same as "Hate," resulting in higher accuracy.
The file run_all_scenarios.py is an automation of the execution of speech_classifier.py for running the 4 pre-configured scenarios, which are black, women, lgbt and balanced scenario. This program will output a table in csv format containing the quality scores of these scenarios.
- Input: speech_classifier.py
- Output: ../output/full_table.csv
Cross Validation: There are two files
- generate_cv_data.py and run_cross_validation.py.
- These two files are used for performing k-fold cross-validation.
- Currently, k is set to 5, but users can set it to different numbers to perform k-fold cross-validation.
The steps are:
- python generate_group_csv.py
- python generate_cv_data.py
- python run_cross_validation.py
-
The first program: generate_group_csv.py is used to produce training data with the desired number for each targeted group. This is the same program as before.
- Input: ../data/berkeley_speech_dataset.csv
- Output: ../data/balanced_dataset.csv
-
The second program: generate_cv_data.py is used to split the files into k+1 pieces, in preparation for k-fold validation. With one piece used as analysis dataset, and 5 other pieces used as training datasets. This step will be repeated k times so each piece of data will have its chance to be the analysis dataset. This program will also generate the .pkl files needed for each trained dataset.
- Input: ../data/balanced_dataset.csv
- Output:
Analysis sets: ../cv_data/balanced_cvanalysis_fold1.csv ../cv_data/balanced_cvanalysis_fold2.csv ... ../cv_data/balanced_cvanalysis_foldk.csv
Training sets: balanced_cvtrain_fold1.csv balanced_cvtrain_fold2.csv ... balanced_cvtrain_foldk.csv
Pkl files: balanced_cvtrain_fold1_idf.pkl balanced_cvtrain_fold1_model.pkl balanced_cvtrain_fold1_oth.pkl balanced_cvtrain_fold1_pos.pkl balanced_cvtrain_fold1_tfidf.pkl ... balanced_cvtrain_foldk_tfidf.pkl
-
Program for running the actual cross-validation:
run_cross_validation.py will run through all the analysis sets and training sets for each fold and generate quality scores for each file. These quality scores include accuracy, precision, recall, and F1 score. -
Input: "Analysis sets" and "Training sets" generated by generate_cv_data.py
-
Output: ../cv_output/balanced_cvresults_fold1.txt, ../cv_output/balanced_cvresults_fold2.txt, ... ../cv_output/balanced_cvresults_foldk.txt
Important
Data Used
- Our program uses the newest Python version (3.11 at the time of our testing), which is an update from version 2.7 in the original TDavidson run. We obtained the trained data files (the .pkl files) from TDavidson's repository, repickled so they can be used in our program. These files are prefixed with "original_" and located in the data folder. Unfortunately, these files are trained models, and the CSV files used for generating these files aren't available, so it is not possible to modify these files.
TDavidson also uses an analysis set named "labeled_data.csv," which is a set of tweets with manually labeled classes ("Hate," "Offensive," or "Neither").
Important
In order to test with different data, we downloaded a new set of data from Berkeley researchers (link), and name it "berkeley_speech_dataset.csv", which is put into the data folder as well.
We use the "berkeley_speech_dataset.csv" to create three different scenarios to observe the effectiveness of this speech classifier program.
- Scenario 1: csv files with tweets targeting mostly black
- Scenario 2: csv files with tweets targeting mostly women
- Scenario 3: csv files with tweets targeting mostly LGBT
- Scenario 4: csv files with tweets targeting a balanced group (including black, women and LGBT group)
Note
Different scenarios can be created by setting different numbers when running the program "generate_group_csv.py". Here are our configurations:
Scenario | Configuration |
---|---|
Configuration for scenario 1 | 9000 black, 500 women, 200 trans, 150 gay, 150 lesbian |
Configuration for scenario 2 | 9000 women, 500 black, 200 trans, 150 gay, 150 lesbian |
Configuration for scenario 3 | 15000 LGBT, 3300 black, 3300 women |
Configuration for scenario 4 | 3300 black, 3300 women, 2800 trans, 100 gay, 500 lesbian |
Results The results table presented summarizes the performance metrics obtained after running cross-validation on different target groups.
Execute run_all_scenarios.py to get the results table below
- Input: Consists of four sets of scenarios as training datasets, followed by the analysis datasets
- Output: Include the number of hate speech instances detected, among other things, plus a compiled table in CSV format for the three scenarios
Scenario | Target Group | Accuracy | Precision (Hate) | Recall (Hate) | F1 Score (Hate) |
---|---|---|---|---|---|
Black | Black | 67% | 91% | 65% | 76% |
Black | Women | 74% | 94% | 71% | 81% |
Black | LGBT | 84% | 95% | 86% | 90% |
Women | Black | 62% | 96% | 55% | 70% |
Women | Women | 75% | 96% | 71% | 81% |
Women | LGBT | 70% | 96% | 68% | 76% |
LGBT | Black | 70% | 87% | 72% | 79% |
LGBT | Women | 77% | 87% | 83% | 85% |
LGBT | LGBT | 87% | 92% | 93% | 93% |
Balanced | Black | 68% | 94% | 64% | 76% |
Balanced | Women | 83% | 94% | 84% | 89% |
Balanced | LGBT | 85% | 95% | 88% | 91% |