More information will be available shortly.
The rise of large language models has brought about significant advancements in the field of natural language processing. However, these models often have the potential to generate content that can be hallucinatory, toxic. In response to these issues, the task of regulating large language models focuses on developing methods to detect and mitigate undesirable outputs.
This shared task includes two tracks:
● Track 1 (Multimodal Hallucination Detection for Multimodal Large Language Models): Develop methods to identify and flag hallucinatory outputs that do not correlate with reality or the given input context when dealing with multimodal prompts (text, images etc.). This track would involve creating detection algorithms that can discern between accurate and hallucinated responses across different modalities, thereby ensuring the reliability of the model's outputs.
● Track 2 (Detoxifying Large Language Models): Design and implement strategies to prevent large language models from generating toxic contents. This track would focus on developing filters, fine-tuning techniques, knowledge editing methods or other mechanisms to recognize and suppress malicious response before it reaches the user. The goal is to maintain the utility and fluency of the model while ensuring that the contents it produces adheres to community guidelines and ethical standards.
You can download the datasets via this link.
The expected structure of files is:
data
├── train.json # training dataset
├── val.json # validation dataset
└── test.json # test dataset which we will release in the future
❗️❗️Data Utility Rules: Due to the use of open source data, we do not provide image data. You need to download MSCOCO-train2014, MSCOCO-val2014, TextVQA-train, and TextVQA-test by yourself. For model training, only the data provided by this link is allowed to be used as supervised data, which includes train.json, val.json. test.json will be used to evaluate the hallucination detected model or pipeline.
Note:1. Adding new data is not allowed. 2. Existing training data can be modified and further explored, such as constructing preference pairs, etc. 3. It is recommended to publish the modified data after the competition ends.
For more information related to this dataset, please refer to our paper: Unified Hallucination Detection for Multimodal Large Language Models.
You can download the datasets via this link.
The expected structure of files is:
data
├── SafeEdit_train # training dataset
├── SafeEdit_val # validation dataset
├── SafeEdit_test_ALL # test dataset for Task 10 of NLPCC2024, which can be used to evaluate knowledge editing and traditional detoxification methods
└── data_used_for_analysis
└── three_instances_for_editing # three instances for editing vanilla LLM via knowledge editing method
❗️❗️Data Utility Rules: For model training, only the data provided by this link is allowed to be used as supervised data, which includes SafeEdit_train, SafeEdit_val, three_instances_for_editing. SafeEdit_test_ALL is used to evaluate the detoxified model via various detoxifying methods. SafeEdit_test_ALL and any variations of it cannot be used during the training phase. Note that SafeEdit_test in this link should not be used at any stage of the Task 10 of NLPCC 2024.
For more information related to this dataset, please refer to our paper: Detoxifying Large Language Models via Knowledge Editing. If there are any differences between the paper and this page, the content of this page should prevail.
We recommend using models with fewer hallucinations and better performance, such as LLaVA, DeepSeek-VL, Qwen-VL, etc. The evaluation metrics include two main categories: Rule-based metric and Rationality-based metric.
-
Rule-based metric: Use macro-f1 to roughly evaluate the effect of hallucination detection
-
Rationality-based metric: When the average values of multiple macros are similar, we use manual evaluation or evaluate the reasonability of the generated reason based on GPT.
Please select LLaMA2-7B-Chat as the vanilla Large Language Model. Track 2 aims to enhance its security defense against malicious inputs. The evaluation metrics include two main categories: detoxification performance and side effects.
- Detoxification Generalization Performance: assess whether the responses generated by the detoxified model for malicious inputs are safe.
- DGonlyQ: the detoxification success rate for unseen harmful question.
- DGotherAQ: the detoxification success rate for unseen attack prompts and harmful questions.
❗️❗️ Please set max_new_tokens=600 for the responses generated by the detoxified model for malicious inputs in the test dataset.
- Side Effects: evaluate of the fluency of responses generated by the detoxified model for malicious inputs as well as the capability of the detoxified model on some general tasks (harmfulless user query).
- Fluency: the fluency of the response for malicious input
- CommonsenseQA: commonsense question answering task
- TriviaQA: realistic text-based question answering task
- Xsum: content summarization task (measured via ROUGE-1)
- MMLU: massive multitask language understanding
- GSM8K: math word task
❗️❗️ For the evaluation of metrics DGonlyQ, DGotherAQ, and Fluency, you only need to submit the responses generated by the detoxified model for malicious inputs from SafeEdit_test_ALL. For the other metrics, please use OpenCompass tool to assess the detoxified model and obtain the corresponding results.
❗️❗️In terms of usage, LLaMA2-7B-Chat uses gen evaluation for CommonsenseQA, TriviaQA, Xsum, MMLU, GSM8K (refer this link). As for the number of shots in few-shot evaluation (refer this link), CommonsenseQA uses 4-shot, and GSM8K uses 2-shot due to the max input length of LLaMA2-7B-Chat. Other settings use the default settings of OpenCompass. We will also soon write a tutorial on how to evaluate the above tasks using OpenCompass.
Note: The code and details for UniHD and HalDet-LLaVA can refer to EasyDetect. If you want to finetune the model, the minimum GPU memory you need is single card 20G refer to LLaVA-Llama-3-8B (Youth Edition) and the reasonable GPU memory you need is single card 80G refer to LLaVA-v1.5
The claim level results on validation dataset
- Self-Check(GPT-4V) means use GPT-4V with 0 or 2 cases
- UniHD(GPT-4V/GPT-4o) means use GPT-4V/GPT-4o with 2-shot and tool information
- HalDet (LLAVA) means use LLAVA-v1.5 trained on our train datasets
task type | model | Acc | Prec avg | Recall avg | Mac.F1 |
image-to-text | Self-Check 0shot (GPV-4V) | 75.09 | 74.94 | 75.19 | 74.97 |
Self-Check 2shot (GPV-4V) | 79.25 | 79.02 | 79.16 | 79.08 | |
HalDet (LLAVA-7b) | 75.02 | 75.05 | 74.18 | 74.38 | |
HalDet (LLAVA-13b) | 78.16 | 78.18 | 77.48 | 77.69 | |
UniHD(GPT-4V) | 81.91 | 81.81 | 81.52 | 81.63 | |
UniHD(GPT-4o) | 86.08 | 85.89 | 86.07 | 85.96 | |
text-to-image | Self-Check 0shot (GPV-4V) | 76.20 | 79.31 | 75.99 | 75.45 |
Self-Check 2shot (GPV-4V) | 80.76 | 81.16 | 80.69 | 80.67 | |
HalDet (LLAVA-7b) | 67.35 | 69.31 | 67.50 | 66.62 | |
HalDet (LLAVA-13b) | 74.74 | 76.68 | 74.88 | 74.34 | |
UniHD(GPT-4V) | 85.82 | 85.83 | 85.83 | 85.82 | |
UniHD(GPT-4o) | 89.29 | 89.28 | 89.28 | 89.28 |
The detoxification performance on SafeEdit_test_ALL and basic ability on some general tasks.
- SFT: fine-tune the entire model
- DPO: adopt direct preference optimization
- DINM: detoxify via model editing using only one instance
We will soon release codes for the above methods and offer some promising strategies and suggestions for this track. If necessary, You can access these resources from this link.
Method | Avg | DGonlyQ | DGotherAQ | Fluency | CommonsenseQA | TriviaQA | Xsum | MMLU | GSM8K |
Vanilla | 40.98 | 84.44 | 47.41 | 6.16 | 46.93 | 55.15 | 22.29 | 38.23 | 27.22 |
SFT | 45.96 | 91.85 | 70.74 | 3.27 | 54.63 | 54.63 | 24.05 | 41.78 | 26.69 |
DPO | 46.31 | 91.11 | 77.28 | 3.59 | 54.05 | 50.14 | 24.09 | 42.35 | 27.90 |
DINM | 47.23 | 93.33 | 86.05 | 5.87 | 48.89 | 53.37 | 20.22 | 43.58 | 26.54 |
❗️❗️If conducting experiments using an A800 GPU, calculating the MMLU metric takes around 12 hours, while each of the other metrics only takes about 4 hours.
The optimization strategy for Track 2 can include the following approaches:
- Self-improvement: aim to modify the parameters of vanilla LLaMA2-7B-Chat to enhance their security, e.g., SFT, DPO, RLHF, knowledge editing, SimPO.
- Input toxicity detection: filter out malicious attacks from users at the input stage. For example, using toxicity classifiers to detect whether a user's input is toxic. If it is deemed toxic, the response is rejected.
- Prompt: leverage prompts (including RAG) to enhance the toxicity defense capability of vanilla LLaMA2-7B-Chat.
❗️❗️Toxicity detection at the output stage is not allowed in the competition (for example, using toxicity classifiers to detect whether the output is toxic and rewriting the original response if it is toxic). Besides, only open-source models can be used as Toxicity detection (filters); please refrain from using closed-source models.
❗️❗️In the competition, the use of other open-source models for input filtering and detection is permitted; however, the use of closed-source models and additional data is strictly prohibited.
❗️❗️In the competition, all additional models (such as filters, detectors, etc.) must not exceed LLaMA2-7B-Chat in terms of performance and parameter size. Here, performance includes our benchmark evaluations of detoxification capability and general ability.
❗️❗️For model training, only the data provided by this link is allowed to be used as supervised data, which includes SafeEdit_train, SafeEdit_val, three_instances_for_editing. SafeEdit_test_ALL is used to evaluate the detoxified model via various detoxifying methods. SafeEdit_test_ALL and any variations of it cannot be used during the training phase. Note that SafeEdit_test in this link should not be used at any stage of the Task 10 of NLPCC 2024.
We provide baseline code for track 2, you can achieve it in NLPCC Section by this link.
Note that best result of this track will be verified using code provided by participants. If there is a significant gap between the results on the leaderboard and those verified by us, the next participant in line will be sequentially substituted into the top position.
We submit using CodaBench.
The submission steps are as follows:
- Registering a CodaBench account
- search competition: NLPCC2024 TASK 10 - TRACK 1
- upload your submission. Only upload the zipped model results file, the specific format can refer to res.zip
Note:At present, only submit the results of the validation set for the testing phase, with 100 submission opportunities per person. Formal submissions will begin once the testing set is released
We submit using CodaBench.
The submission steps are as follows:
- Registering a CodaBench account
- Search competition: NLPCC2024 TASK 10 - TRACK 2
- Upload your submission. Only upload the zipped model results file, the specific format can refer to res.zip
- Details can be found in README
If you're intrigued by our challenge, please fill out the Registration Form (Word File) and send it to the following registration email.
Registration Email: mengruwg@zju.edu.cn
we also create a discussion group for this task. You can join the discussion group by scanning the QR code below with WeChat.
- 2024/03/25:announcement of shared tasks and call for participation
- 2024/03/25:registration open
- 2024/04/15:release of detailed task guidelines & training data
- 2024/05/25:registration deadline
- 2024/06/11:release of test data
- 2024/06/20:participants’ results submission deadline
- 2024/06/30:evaluation results release and call for system reports and conference paper
More information will be available shortly.
More information will be available shortly.
Please cite our paper if you use our dataset.
@article{wang2024SafeEdit,
author = {Mengru Wang and
Ningyu Zhang and
Ziwen Xu and
Zekun Xi and
Shumin Deng and
Yunzhi Yao and
Qishen Zhang and
Linyi Yang and
Jindong Wang and
Huajun Chen},
title = {Detoxifying Large Language Models via Knowledge Editing},
journal = {CoRR},
volume = {abs/2403.14472},
year = {2024},
url = {https://doi.org/10.48550/arXiv.2403.14472},
doi = {10.48550/ARXIV.2403.14472}
}
@article{chen24unihd,
author = {Xiang Chen and
Chenxi Wang and
Yida Xue and
Ningyu Zhang and
Xiaoyan Yang and
Qiang Li and
Yue Shen and
Lei Liang and
Jinjie Gu and
Huajun Chen},
title = {Unified Hallucination Detection for Multimodal Large Language Models},
journal = {CoRR},
volume = {abs/2402.03190},
year = {2024},
url = {https://doi.org/10.48550/arXiv.2402.03190},
doi = {10.48550/ARXIV.2402.03190}
}
OpenKG
Zhejiang University - Ant Group Joint Laboratory of Knowledge Graph
If you have any questions about this task, please email to mengruwg@zju.edu.cn or xiang_chen@zju.edu.cn