Skip to content

This repository contains code and output files for the evaluation of Llama-Guard-3 on multilingual toxic datasets

License

Notifications You must be signed in to change notification settings

patronus-ai/llama-3-toxicity-experiments

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Llama Guard Toxicity Analysis

This repository contains the code for our experiments with the Llama-Guard-3 model on the multilingual toxicity detection task. Through our experiments, we found that the Llama-Guard-3 underperformed as compared to the base models with a simple toxicity detection prompt. This repository further analyzes the results on 500 samples each from the toxic splits of popular English as well as multilingual toxicity datasets.

Setting up your environment

This repository has only three dependencies. To install these, first create a virtual environment (we recommend python version >= 3.10) and activate it. Then run the following command:

pip install -r requirements.txt

This repository uses Together AI for querying the Llama models so please ensure to set your TOGETHER_API_KEY:

export TOGETHER_API_KEY="your_key_here"

Executing the code

To run all experiments, please execute the following command:

bash run.sh

This script will create an outputs/ directory which contains all the output CSV results. The script will also display the accuracy results for every dataset after the execution for that dataset is finished.

Feel free to change the run.sh parameters. You can use --help for more information on available commands.

Results

We observed that most of the heavy lifting for producing safe outputs when the Llama-Guard-3 and Llama-3.1-8B are used together is done by the base Llama-3.1-8B. Through our results, we can conclude that adding Llama-Guard-3 to the pipeline may be redundant.

References

  1. Ji, J., Liu, M., Dai, J., Pan, X., Zhang, C., Bian, C., ... & Yang, Y. (2024). Beavertails: Towards improved safety alignment of llm via a human-preference dataset. Advances in Neural Information Processing Systems, 36.
  2. Li, L., Dong, B., Wang, R., Hu, X., Zuo, W., Lin, D., ... & Shao, J. (2024). Salad-bench: A hierarchical and comprehensive safety benchmark for large language models. https://aclanthology.org/2024.findings-acl.235.
  3. Lin, Z., Wang, Z., Tong, Y., Wang, Y., Guo, Y., Wang, Y., & Shang, J. (2023). Toxicchat: Unveiling hidden challenges of toxicity detection in real-world user-ai conversation. https://aclanthology.org/2023.findings-emnlp.311/.
  4. Kluge, N. (2022). Nkluge-correa/Aira-EXPERT: release v.01. Zenodo.
  5. cjadams, Jeffrey Sorensen, Julia Elliott, Lucas Dixon, Mark McDonald, nithum, Will Cukierski. (2017). Toxic Comment Classification Challenge. Kaggle. https://kaggle.com/competitions/jigsaw-toxic-comment-classification-challenge
  6. Röttger, P., Kirk, H. R., Vidgen, B., Attanasio, G., Bianchi, F., & Hovy, D. (2023). Xstest: A test suite for identifying exaggerated safety behaviours in large language models. arXiv preprint arXiv:2308.01263.
  7. Tonneau, M., Liu, D., Fraiberger, S., Schroeder, R., Hale, S. A., & Röttger, P. (2024). From Languages to Geographies: Towards Evaluating Cultural Bias in Hate Speech Datasets. arXiv preprint arXiv:2404.17874.
  8. Sirihattasak, S., Komachi, M., & Ishikawa, H. (2018, May). Annotation and classification of toxicity for Thai Twitter. In TA-COS 2018: 2nd Workshop on Text Analytics for Cybersecurity and Online Safety (p. 1).
  9. Çöltekin, Ç. (2020, May). A corpus of Turkish offensive language on social media. In Proceedings of the Twelfth language resources and evaluation conference (pp. 6174-6184).
  10. İ. Mayda, Y. E. Demir, T. Dalyan and B. Diri, "Hate Speech Dataset from Turkish Tweets," Innovations in Intelligent Systems and Applications Conference (ASYU), Elazig, Turkey, 2021, pp. 1-6, doi: 10.1109/ASYU52992.2021.9599042.
  11. Kadir Bulut Ozler, "5k turkish tweets with incivil content", 2020, "https://www.kaggle.com/datasets/kbulutozler/5k-turkish-tweets-with-incivil-content
  12. Overfit-GM/turkish-toxic-language · Datasets at Hugging Face. (n.d.). https://huggingface.co/datasets/Overfit-GM/turkish-toxic-language
  13. Daryna Dementieva, Valeriia Khylenko, Nikolay Babakov, and Georg Groh. 2024. Toxicity Classification in Ukrainian. In Proceedings of the 8th Workshop on Online Abuse and Harms (WOAH 2024), pages 244–255, Mexico City, Mexico. Association for Computational Linguistics

About

This repository contains code and output files for the evaluation of Llama-Guard-3 on multilingual toxic datasets

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published