MultiLingual_Toxic_Comment_Classification

title	Toxic Comment Classifier
emoji	🌍
colorFrom	blue
colorTo	yellow
sdk	gradio
app_file	app.py
pinned	false
license	apache-2.0

MultiLingual_Toxic_Comment_Classification

Problem Statement and Description

It only takes one toxic comment to sour an online discussion. Identifing the toxicity in online conversations, where toxicity is defined as anything rude, disrespectful or otherwise likely to make someone leave a discussion, in an automated way using machine learning, at very early stage, is one way to protect voices in online conversations.

The MultiLingual Toxic Comment Classification project is a powerful tool for content platforms operating in various domains, including social media, news websites, forums, and on the internet. By leveraging the latest Natural Language Processing techniques, it automatically detects and filters toxic comments in multiple languages, addressing the pressing issues of cyberbullying, hate speech and harassment.

Dataset Description

The dataset has been taken from Kaggle competition organized by the Jigsaw/Conversation AI team. The dataset contains several files:

jigsaw-toxic-comment-train.csv - data from our first competition. The dataset is made up of English comments from Wikipedia’s talk page edits.
jigsaw-unintended-bias-train.csv - data from our second competition. This is an expanded version of the Civil Comments dataset with a range of additional labels.
sample_submission.csv - a sample submission file in the correct format.
test.csv - comments from Wikipedia talk pages in different non-English languages (Spanish, Portuguese, Italian, Turkish, French, Russian).
validation.csv - comments from Wikipedia talk pages in different non-English languages.

Columns

id - identifier within each file.
comment_text - the text of the comment to be classified.
lang - the language of the comment.
toxic - whether or not the comment is classified as toxic. (Does not exist in test.csv.)

Here, we're predicting the probability that a comment is toxic. A toxic comment would receive a 1.0. A benign, non-toxic comment would receive a0.0. In the test set, all comments are classified as either a 1.0 or 0.0.

Repository Structure

The project has the following structure:

input: This repository would contain the all the csv files mentioned above which can be downloaded from here and put them in it.
demos/multilingual_toxic_comment_files: This repository would contain all the necessary files to build our Gradio application to deploy on HuggingFace Spaces. All the files required to deploy on HuggingFace Spaces are here. Note: Ignore git related files and Multlingual_toxic_comment_classifier/ repository contains our model binary files.
notebooks: This repository contains jupyter notebooks for all the experiments we ran for model building. Note: To train the models we've used Kaggle TPUs and Kaggle Kernels. The final/chosen model is in Multilingual_Toxic_Comment_Classification_Final.ipynb notebook and this notebook maynot render many times so may have to download locally this repository to view it.
output/working - This repository contains the saved files while training the model on Kaggle kernel.

Deployment

The application is deployed on HuggingFace Spaces using Gradio at Here.

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
.github/workflows		.github/workflows
docs		docs
eda		eda
experiment_notebooks		experiment_notebooks
img		img
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
app.py		app.py
app2.py		app2.py
demo.py		demo.py
mkdocs.yml		mkdocs.yml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MultiLingual_Toxic_Comment_Classification

Problem Statement and Description

Dataset Description

Repository Structure

Deployment

About

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MultiLingual_Toxic_Comment_Classification

Problem Statement and Description

Dataset Description

Repository Structure

Deployment

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors

Uh oh!

Languages