ai-evaluation

Here are 19 public repositories matching this topic...

lechmazur / confabulations

Hallucinations (Confabulations) Document-Based Benchmark for RAG. Includes human-verified questions and answers.

benchmark leaderboard gemini llama language-model claude rag o1 hallucinations ai-evaluation llm gemini-pro llm-benchmarking confabulations deepseek-r1 o3-mini

Updated Apr 30, 2025
HTML

rungalileo / agent-leaderboard

Star

Ranking LLMs on agentic tasks

ai evaluation ai-agents ai-evaluation llms

Updated Apr 24, 2025
Jupyter Notebook

METR / vivaria

Star

Vivaria is METR's tool for running evaluations and conducting agent elicitation research.

ai elicitation ai-evaluation evals

Updated Apr 30, 2025
TypeScript

taoAIGC / AI-Shortcuts

Star

one click to open multi AI sites ｜一键打开多个 AI 站点，查看 AI 结果

ai gemini poe claude perplexity ai-evaluation llm chatgpt

Updated Jan 21, 2025

kereva-dev / kereva-scanner

Star

Code scanner to check for issues in prompts and LLM calls

cli security ai linter evaluation code-scanning red-teaming ai-security hallucination ai-evaluation llm prompt-injection llm-security ai-code-review llm-evaluation owasp-llm-top-10 ai-performance ai-red-teaming llm-performance

Updated Apr 6, 2025
Python

Benchmark evaluating LLMs on their ability to create and resist disinformation. Includes comprehensive testing across major models (Claude, GPT-4, Gemini, Llama, etc.) with standardized evaluation metrics.

nlp machine-learning gemini llama language-model model-evaluation ai-safety mistral claude disinformation ai-security ai-benchmarks ai-evaluation llm llm-benchmarking gpt4o

Updated Mar 20, 2025

mhamzaerol / Cost-of-Pass

Star

Cost-of-Pass: An Economic Framework for Evaluating Language Models

benchmark economics language-model evaluation-framework ai-evaluation cost-efficiency cost-performance

Updated Apr 25, 2025
Python

bigdata-ustc / CAT4AI

Star

Adaptive Testing Framework for AI Models (Psychometrics in AI Evaluation)

psychometrics adaptive-testing ai-evaluation

Updated Oct 1, 2024
Jupyter Notebook

aloth / JudgeGPT

Star

JudgeGPT - (Fake) News Evaluation, a research project

nlp machine-learning mongodb survey research-project fake-news survey-app fake-news-challenge crowdsource-experiments misinformation explainable-ai ai-ethics streamlit streamlit-webapp fake-news-analysis human-ai-interaction ai-evaluation generative-ai

Updated Feb 25, 2025
Python

Alab-NII / llm-judge-extract-qa

Star

LLM-as-a-judge for Extractive QA datasets

qa evaluation evaluation-metrics ai-evaluation llm-as-a-judge

Updated Apr 22, 2025
Python

dpc10ster / RJafrocRocBook

Star

ROC methodology explained with R-examples

book roc ai-evaluation

Updated Apr 25, 2024
TeX

dpc10ster / RJafrocFrocBook

Star

FROC methodology explained with R-examples

pdf r book ai-evaluation

Updated Apr 18, 2025
TeX

dpc10ster / RJafrocQuickStart

Star

RJafroc quick start for those already familiar with windows jafroc

r rjafroc ai-evaluation

Updated Dec 28, 2023
TeX

dpc10ster / WindowsJAFROC

Star

Installation files for Windows JAFROC software

windows ai-evaluation jafroc

Updated Feb 8, 2023

sergeyklay / factly

Star

CLI tool to evaluate LLM factuality on MMLU benchmark.

cli benchmark openai factuality ai-evaluation llm prompt-engineering chatgpt mmlu llm-evaluation

Updated May 1, 2025
Python

dpc10ster / datasets

Star

ROC/FROC datasets from my collaborations

datasets ai-evaluation

Updated Aug 14, 2023

gabrielhamalwa / magpie

Star

Repository for the LWDA'24 presentation on 'Psychometric Profiling of GPT Models for Bias Exploration', featuring conference materials including the poster, paper, slides, and references.

ai-safety personality-traits interpretability cognitive-bias explainability ai-evaluation gpt-models machine-psychology ai-bias psychometric-analysis lwda24

Updated Sep 23, 2024
TeX

cvs-health / uqlm

Star

UQLM (Uncertainty Quantification for Language Models) is a Python package for UQ-based LLM hallucination detection

uncertainty-quantification uncertainty-estimation ai-safety confidence-score hallucination confidence-estimation ai-evaluation llm llm-evaluation llm-safety hallucination-evaluation hallucination-detection hallucination-mitigation llm-hallucination

Updated Apr 29, 2025

Marco210210 / llm-eval-analysis

Star

Automatic multi-metric evaluation of human-bot dialogues using LLMs (Claude, GPT-4o) across different datasets and settings. Built for the Artificial Intelligence course at the University of Salerno.

python nlp natural-language-processing chatbot university-project openai dialogue-systems conversation-analysis ai-evaluation large-language-models llm anthropic

Updated Apr 30, 2025
Python

Improve this page

Add a description, image, and links to the ai-evaluation topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the ai-evaluation topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ai-evaluation

Here are 19 public repositories matching this topic...

lechmazur / confabulations

rungalileo / agent-leaderboard

METR / vivaria

taoAIGC / AI-Shortcuts

kereva-dev / kereva-scanner

lechmazur / deception

mhamzaerol / Cost-of-Pass

bigdata-ustc / CAT4AI

aloth / JudgeGPT

Alab-NII / llm-judge-extract-qa

dpc10ster / RJafrocRocBook

dpc10ster / RJafrocFrocBook

dpc10ster / RJafrocQuickStart

dpc10ster / WindowsJAFROC

sergeyklay / factly

dpc10ster / datasets

gabrielhamalwa / magpie

cvs-health / uqlm

Marco210210 / llm-eval-analysis

Improve this page

Add this topic to your repo