[ICCV 2025] AdsQA: Towards Advertisement Video Understanding Arxiv: https://arxiv.org/abs/2509.08621
-
Updated
Oct 30, 2025 - Python
[ICCV 2025] AdsQA: Towards Advertisement Video Understanding Arxiv: https://arxiv.org/abs/2509.08621
Benchmark for evaluating AI epistemic reliability - testing how well LLMs handle uncertainty, avoid hallucinations, and acknowledge what they don't know.
Testing how well LLMs can solve jigsaw puzzles
🔍 Benchmark jailbreak resilience in LLMs with JailBench for clear insights and improved model defenses against jailbreak attempts.
Benchmark LLM jailbreak resilience across providers with standardized tests, adversarial mode, rich analytics, and a clean Web UI.
🚀 A modern, production-ready refactor of the LoCoMo long-term memory benchmark.
Benchmark LLMs Spatial Reasoning with Head-to-Head Bananagrams
Yes, LLM's just regurgitate the same jokes on the internet over and over again. But some are slightly funnier than others.
GateBench is a challenging benchmark for Vision Language Models (VLMs) that tests visual reasoning by requiring models to extract boolean algebra expressions from logic gate circuit diagrams.
Compare how vision models reason about images — not just their accuracy scores
自动收集 Bilibili 硬核会员答题数据并生成 LLM 评估数据集
UrduReason-Eval: A comprehensive evaluation dataset with 800 Urdu reasoning problems across 6 categories (arithmetic, logical deduction, temporal, comparative, and causal reasoning) for assessing reasoning capabilities in Urdu language models.
Gemma3 RAG benchmark system for Japanese river/dam/erosion control technical standards.
is it better to run a Tiny Model (2B-4B) at High Precision (FP16/INT8), or a Large Model (8B+) at Low Precision (INT4)?" This benchmark framework allows developers to scientifically choose the best model for resource-constrained environments (consumer GPUs, laptops, edge devices) by measuring the trade-off between Speed and Intelligence
🧠 Benchmark Haiku 4.5 and MiniMax M2.1 on agentic tasks, revealing strengths in design thinking and operational skills for multi-turn workflows.
Systematic benchmark comparing Claude Haiku 4.5 vs MiniMax M2.1 on agentic coding tasks. Includes full audit trails, LLM-as-judge evaluation, and path divergence analysis.
🔍 Evaluate AI models' ability to detect ambiguity and manage uncertainty with the ERR-EVAL benchmark for reliable epistemic reasoning.
Add a description, image, and links to the llm-benchmark topic page so that developers can more easily learn about it.
To associate your repository with the llm-benchmark topic, visit your repo's landing page and select "manage topics."