Skip to content

We propose a novel evaluation technique for LLMs which surpasses BeRT based evaluation scores in terms of correlation with human evaluation scores

Notifications You must be signed in to change notification settings

avnCode/Topics_in_AI

Repository files navigation

Blind LLM Truthfulness Assessment(BLTA) :

This paper aims to propose novel techniques for evaluating the truthfulness of large language models (LLMs) without relying on golden answers. We introduce a new metric, the Negative Sample Difference (NSD) score, which demonstrates a high correlation with human scores. Our key idea is to invert the evaluation problem: while it is difficult to generate correct answers for LLM, it is easier to generate a set of wrong answers shown.

NSD Technique:

image

Hence, we employ another LLM, Google’s Gemini 1.5 pro(3 ), to generate a set of intentionally incorrect answers along with their corresponding corrected versions. These pairs are be utilized to evaluate our main LLM, Llama 3 - 8b.

  1. llama2_Inference.ipynb This is the code to generate responses of recently launched llama 3-8b on a subset of the TruthFulQA dataset.

  2. Dataset_Creation.ipynb This is the code for creating the negative sample datasets using one of the state-of-the-art Langauge Model Google's Gemini-1.5-pro. The dataset created has been uploaded on Hugging Face and is publicly available.

  3. NSD_Scores.ipynb This is the main code for evaluating the llama-3 using our approach where we calculate the embeddings using a sentence transformer and then use NSD formulation to calculate the scores.

  4. Correlation_Scores.ipynb : In this code, we first have to calculate the BeRT scores (famous unsupervised evaluation method) and then compare them with NSD Scores based on correlation with human scores.

Human_Evaluation_Scores.xlsx :

This excel file contains the human scores by three different humans, we then have to average them to get the final scores used for correlation purpose.

Dataset Link:

Closed Book QA: https://huggingface.co/datasets/avnishkr/topics_in_ai

NSD dataset: https://huggingface.co/datasets/avnishkr/negative_samples_dataset

About

We propose a novel evaluation technique for LLMs which surpasses BeRT based evaluation scores in terms of correlation with human evaluation scores

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published