llm-judge

Star

Here are 16 public repositories matching this topic...

haizelabs / verdict

Star

Inference-time scaling for LLMs-as-a-judge.

reward-shaping llm llm-as-a-judge test-time-compute inference-time-compute llm-judge test-time-scaling

Updated Nov 5, 2025
Jupyter Notebook

vtdinh13 / habit-builder-ai-agent

Star

An end-to-end AI agent project that transcribes audio files, embeds user queries, and searches in Qdrant and web browser via the Brave API. A Streamlit interface powered by OpenAI GPT models delivers actionable health insights from both the archive and the latest research.

ai-agents qdrant pydantic-ai llm-judge ai-agent-evaluation

Updated Dec 12, 2025
Python

ShaheerKhawaja / ProductionOS

Star

ProductionOS v1.0 — Claude Code plugin with 76 agents, 39 commands, and 12 hooks. Deploys specialized agents that review, score, and improve your entire codebase. Smart routing, recursive convergence, self-evaluation.

security-audit multi-agent code-review prompt-engineering self-evaluation deep-research llm-judge claude-code agentic-development claude-code-plugin recursive-improvement auto-swarm max-research production-upgrade convergence-engine worktree-isolation

Updated Apr 3, 2026
TypeScript

black-yt / structai

Star

StructAI offers a robust toolkit for LLM interaction—such as structured outputs, context management, and parallel execution.

Updated Mar 2, 2026
Python

youdotcom-oss / web-search-agent-evals

Star

Extensible benchmarking suite for evaluating AI coding agents on web search tasks. Compare native search vs MCP servers (You.com, expanding) across multiple agents (Claude Code, Gemini, Droid, Codex, expanding) with automated Docker workflows and statistical analysis.

benchmark mcp gemini headless-testing droid codex ai-agents web-search coding-agents model-context-protocol llm-judge claude-code agent-evaluation evaluation-suite

Updated Feb 27, 2026
TypeScript

syed-waleed-ahmed / LLM-as-Judge

Star

A Streamlit web app that uses a Groq-powered LLM (Llama 3) to act as an impartial judge for evaluating and comparing two model outputs. Supports custom criteria, presets like creativity and brand tone, and returns structured scores, explanations, and a winner. Built end-to-end with Python, Groq API, and Streamlit.

python code-evaluation a-b-testing text-evaluation groq streamlit model-benchmarking ai-automation ai-evaluation llm prompt-evaluation llama3 llm-judge output-evaluation scoring-framework

Updated Nov 24, 2025
Python

mennamohammedkh / Simple-Chatbot-Llama-3-8B-via-HuggingFace-API-TrustGuard-with-LLM-Judge

Star

🤖 A conversational chatbot powered by Meta-Llama-3-8B via HuggingFace API, with TrustGuard safety validation using an LLM-as-Judge.

python chatbot ai-safety uv huggingface llm llama3 llm-judge trustguard

Updated Feb 27, 2026
Python

Anmolian / Prompt_Eval_LLM_Judge

Star

Prompt Design & LLM Judge

prompt-engineering llms few-shot-prompting one-shot-prompting zero-shot-prompting contrastive-cot-prompting cot-prompting llm-judge trec-rag-2024 self-consistency-prompting role-playing-prompting

Updated Feb 10, 2025
Python

PabloCabaleiro / pondera

Star

Pondera is a lightweight, YAML-first framework to evaluate AI models and agents with pluggable runners and an LLM-as-a-judge.

python ai agents model-agnostic ai-evaluation llms llm-evaluation llm-evaluation-framework llm-judge agent-evaluation ai-evaluation-framework rubric-based-evaluation yaml-first

Updated Oct 23, 2025
Python

Padraigobrien08 / Agentic-GenAI-Capstone-Google-Kaggle

Star

Agent QA Mentor: an agentic QA pipeline that evaluates tool-using AI agent trajectories (scores, issue codes, safety/hallucination detection), rewrites prompts with targeted fixes, and stores long-term memory for continuous improvement—plus a CI-style eval gate and demo notebook.

python qa multi-agent-systems ai-agents gemini-api qa-automation pydantic vector-database llms agentic llm-judge agent-evaluation

Updated Dec 2, 2025
Jupyter Notebook

akshan-main / arxiv-context-feed

Sponsor

Star

arXiv scraper built for Contextual AI capstone project. Cron based ingestion. LLM judge that minimizes ingestion cost.

arxiv llm-judge

Updated Mar 15, 2026
Python

motasemwed / llm-judge

Star

LLM-as-a-Judge system for rubric-based, explainable evaluation of large language model outputs.

python nlp ai-evaluation large-language-models llm prompt-engineering llm-evaluation llm-judge

Updated Jan 29, 2026
Python

mohsinsheikhani / support-fte-evals

Star

Eval-driven Customer Support FTE using OpenAI Agents SDK. Multi-agent routing, guardrails, and systematic quality evaluation.

Updated Feb 6, 2026
Python

beviah / fracture

Star

Red-team framework for discovering alignment failures in frontier language models.

model-evaluation ai-safety jailbreak-detection red-teaming rlhf prompt-injection llm-evaluation llm-safety llm-safety-benchmark llm-judge alignment-testing adversarial-testing alignment-research

Updated Feb 19, 2026
Python

Asyasyarif / openjudges

Star

OpenJudges is an interactive CLI tool that uses LLMs as judges to evaluate AI responses against specific criteria

evaluation rag llm retreival-augmented-generation llm-judge

Updated Feb 24, 2026
Go

mohitsem13 / Verdict

Star

Deliver predictive litigation modeling and outcome simulation with enterprise-grade legal analytics for high-stakes trial intelligence.

android ratings malware it-security magisk-module bollywood ai-agents attack-defense malware-detection system-engineering cyber-resiliency zygisk playintegrityfix test-time-compute llm-judge test-time-scaling

Updated Apr 3, 2026
TypeScript

Improve this page

Add a description, image, and links to the llm-judge topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the llm-judge topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llm-judge

Here are 16 public repositories matching this topic...

haizelabs / verdict

vtdinh13 / habit-builder-ai-agent

ShaheerKhawaja / ProductionOS

black-yt / structai

youdotcom-oss / web-search-agent-evals

syed-waleed-ahmed / LLM-as-Judge

mennamohammedkh / Simple-Chatbot-Llama-3-8B-via-HuggingFace-API-TrustGuard-with-LLM-Judge

Anmolian / Prompt_Eval_LLM_Judge

PabloCabaleiro / pondera

Padraigobrien08 / Agentic-GenAI-Capstone-Google-Kaggle

akshan-main / arxiv-context-feed

motasemwed / llm-judge

mohsinsheikhani / support-fte-evals

beviah / fracture

Asyasyarif / openjudges

mohitsem13 / Verdict

Improve this page

Add this topic to your repo