evals

Here are 107 public repositories matching this topic...

mastra-ai / mastra

From the team behind Gatsby, Mastra is a framework for building AI-powered applications and agents with a modern TypeScript stack.

nodejs javascript typescript ai reactjs mcp nextjs tts chatbots workflows agents llm evals

Updated Jan 7, 2026
TypeScript

Arize-ai / phoenix

Star

AI Observability & Evaluation

openai datasets agents ai-monitoring ai-observability prompt-engineering llms langchain llmops anthropic llamaindex llm-eval evals llm-evaluation aiengineering smolagents

Updated Jan 7, 2026
Jupyter Notebook

Python SDK for AI agent monitoring, LLM cost tracking, benchmarking, and more. Integrates with most LLMs and agent frameworks including CrewAI, Agno, OpenAI Agents SDK, Langchain, Autogen, AG2, and CamelAI

agent ai openai evaluation-metrics mistral cost-estimation autogen groq agentops llm langchain anthropic evals ollama crewai agents-sdk openai-agents

Updated Oct 30, 2025
Python

Kiln-AI / Kiln

Star

Build, Evaluate, and Optimize AI Systems. Includes evals, RAG, agents, fine-tuning, synthetic data generation, dataset management, MCP, and more.

Updated Jan 7, 2026
Python

truera / trulens

Star

Evaluation and Tracking for LLM Experiments and AI Agents

machine-learning neural-networks ai-agents explainable-ml agentops ai-monitoring ai-observability llms llmops llm-eval evals llm-evaluation agent-evaluation

Updated Jan 7, 2026
Python

lmnr-ai / lmnr

Star

Laminar - open-source observability platform purpose-built for AI agents. YC S24.

Updated Jan 7, 2026
TypeScript

GitHamza0206 / simba

Star

OpenSource Production ready Customer service with built in Evals and monitoring

knowledge-base customer-service rag llm evals

Updated Jan 7, 2026
TypeScript

mattpocock / evalite

Sponsor

Star

Evaluate your LLM-powered apps with TypeScript

typescript ai evals

Updated Dec 3, 2025
TypeScript

superlinear-ai / raglite

Star

🥤 RAGLite is a Python toolkit for Retrieval-Augmented Generation (RAG) with DuckDB or PostgreSQL

markdown pdf postgres sqlite postgresql reranking rag vector-search duckdb colbert llm pgvector chainlit retrieval-augmented-generation evals late-interaction late-chunking query-adapter

Updated Jan 7, 2026
Python

laude-institute / harbor

Star

Harbor is a framework for running agent evaluations and creating and using RL environments.

rl-environments evals terminal-bench

Updated Jan 7, 2026
Python

keshik6 / HourVideo

Star

[NeurIPS 2024] Official code for HourVideo: 1-Hour Video Language Understanding

navigation perception summarization reasoning visual-reasoning egocentric-videos gpt-4 multiple-choice-questions benchmark-dataset video-language-understanding multimodal-large-language-models evals gemini-pro spatial-intelligence neurips-2024 1-hour-video-language-understanding long-form-video-language-understanding long-context-understanding

Updated Jul 12, 2025
Jupyter Notebook

microsoft / promptpex

Star

Test Generation for Prompts

testing evaluations prompt-engineering llms chatgpt evals gpt-4o

Updated Jan 7, 2026
TeX

METR / vivaria

Star

Vivaria is METR's tool for running evaluations and conducting agent elicitation research.

ai elicitation ai-evaluation evals

Updated Nov 11, 2025
TypeScript

mclenhard / mcp-evals

Star

A Node.js package and GitHub Action for evaluating MCP (Model Context Protocol) tool implementations using LLM-based scoring. This helps ensure your MCP server's tools are working correctly and performing well.

ai mcp evals