GitHub - patronus-ai/llama-3-toxicity-experiments: This repository contains code and output files for the evaluation of Llama-Guard-3 on multilingual toxic datasets

Llama Guard Toxicity Analysis

This repository contains the code for our experiments with the Llama-Guard-3 model on the multilingual toxicity detection task. Through our experiments, we found that the Llama-Guard-3 underperformed as compared to the base models with a simple toxicity detection prompt. This repository further analyzes the results on 500 samples each from the toxic splits of popular English as well as multilingual toxicity datasets.

Setting up your environment

This repository has only three dependencies. To install these, first create a virtual environment (we recommend python version >= 3.10) and activate it. Then run the following command:

pip install -r requirements.txt

This repository uses Together AI for querying the Llama models so please ensure to set your TOGETHER_API_KEY:

export TOGETHER_API_KEY="your_key_here"

Executing the code

To run all experiments, please execute the following command:

bash run.sh

This script will create an outputs/ directory which contains all the output CSV results. The script will also display the accuracy results for every dataset after the execution for that dataset is finished.

Feel free to change the run.sh parameters. You can use --help for more information on available commands.

Results

We observed that most of the heavy lifting for producing safe outputs when the Llama-Guard-3 and Llama-3.1-8B are used together is done by the base Llama-3.1-8B. Through our results, we can conclude that adding Llama-Guard-3 to the pipeline may be redundant.

References

Ji, J., Liu, M., Dai, J., Pan, X., Zhang, C., Bian, C., ... & Yang, Y. (2024). Beavertails: Towards improved safety alignment of llm via a human-preference dataset. Advances in Neural Information Processing Systems, 36.
Li, L., Dong, B., Wang, R., Hu, X., Zuo, W., Lin, D., ... & Shao, J. (2024). Salad-bench: A hierarchical and comprehensive safety benchmark for large language models. https://aclanthology.org/2024.findings-acl.235.
Lin, Z., Wang, Z., Tong, Y., Wang, Y., Guo, Y., Wang, Y., & Shang, J. (2023). Toxicchat: Unveiling hidden challenges of toxicity detection in real-world user-ai conversation. https://aclanthology.org/2023.findings-emnlp.311/.
Kluge, N. (2022). Nkluge-correa/Aira-EXPERT: release v.01. Zenodo.
cjadams, Jeffrey Sorensen, Julia Elliott, Lucas Dixon, Mark McDonald, nithum, Will Cukierski. (2017). Toxic Comment Classification Challenge. Kaggle. https://kaggle.com/competitions/jigsaw-toxic-comment-classification-challenge
Röttger, P., Kirk, H. R., Vidgen, B., Attanasio, G., Bianchi, F., & Hovy, D. (2023). Xstest: A test suite for identifying exaggerated safety behaviours in large language models. arXiv preprint arXiv:2308.01263.
Tonneau, M., Liu, D., Fraiberger, S., Schroeder, R., Hale, S. A., & Röttger, P. (2024). From Languages to Geographies: Towards Evaluating Cultural Bias in Hate Speech Datasets. arXiv preprint arXiv:2404.17874.
Sirihattasak, S., Komachi, M., & Ishikawa, H. (2018, May). Annotation and classification of toxicity for Thai Twitter. In TA-COS 2018: 2nd Workshop on Text Analytics for Cybersecurity and Online Safety (p. 1).
Çöltekin, Ç. (2020, May). A corpus of Turkish offensive language on social media. In Proceedings of the Twelfth language resources and evaluation conference (pp. 6174-6184).
İ. Mayda, Y. E. Demir, T. Dalyan and B. Diri, "Hate Speech Dataset from Turkish Tweets," Innovations in Intelligent Systems and Applications Conference (ASYU), Elazig, Turkey, 2021, pp. 1-6, doi: 10.1109/ASYU52992.2021.9599042.
Kadir Bulut Ozler, "5k turkish tweets with incivil content", 2020, "https://www.kaggle.com/datasets/kbulutozler/5k-turkish-tweets-with-incivil-content
Overfit-GM/turkish-toxic-language · Datasets at Hugging Face. (n.d.). https://huggingface.co/datasets/Overfit-GM/turkish-toxic-language
Daryna Dementieva, Valeriia Khylenko, Nikolay Babakov, and Georg Groh. 2024. Toxicity Classification in Ukrainian. In Proceedings of the 8th Workshop on Online Abuse and Harms (WOAH 2024), pages 244–255, Mexico City, Mexico. Association for Computational Linguistics

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
experiments		experiments
images		images
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Llama Guard Toxicity Analysis

Setting up your environment

Executing the code

Results

References

About

Releases

Packages

Languages

License

patronus-ai/llama-3-toxicity-experiments

Folders and files

Latest commit

History

Repository files navigation

Llama Guard Toxicity Analysis

Setting up your environment

Executing the code

Results

References

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages