Skip to content

Latest commit

 

History

History
157 lines (132 loc) · 3.51 KB

File metadata and controls

157 lines (132 loc) · 3.51 KB

Evaluation System

The evaluation system provides a flexible and abstract interface for testing and scoring various tasks, from simple string operations to complex LLM-based generation.

Quick Start

import { eval, Levenshtein } from './lib/eval';

const result = await eval("My Eval", {
  // A function that returns an array of test data
  data: async () => {
    return [{ input: "Hello", expected: "Hello World!" }];
  },
  // The task to perform
  task: async (input) => {
    return input + " World!";
  },
  // The scoring methods for the eval
  scorers: [Levenshtein],
});

Core Concepts

Test Data

Each test case should have at minimum an input and expected field:

interface TestData<TInput = any, TExpected = any> {
  input: TInput;
  expected: TExpected;
  [key: string]: any; // Allow additional properties
}

Task Function

The task function receives an input and returns an output to be scored:

task: (input: TInput) => Promise<TOutput> | TOutput

Scorers

Scorers compare the task output with the expected result and return a numerical score:

interface Scorer<TInput = any, TOutput = any, TExpected = any> {
  name: string;
  score: (input: TInput, output: TOutput, expected: TExpected) => Promise<number> | number;
  description?: string;
}

Built-in Scorers

ExactMatch

Returns 1 for exact matches, 0 otherwise:

import { ExactMatch } from './lib/eval';

Levenshtein

Normalized Levenshtein distance (1 - distance/max_length):

import { Levenshtein } from './lib/eval';

Contains

Checks if output contains the expected string (case-insensitive):

import { Contains } from './lib/eval';

Examples

Simple String Task

await eval("String Concatenation", {
  data: () => [
    { input: "Hello", expected: "Hello World!" },
    { input: "Hi", expected: "Hi World!" }
  ],
  task: (input: string) => input + " World!",
  scorers: [ExactMatch, Levenshtein]
});

LLM Evaluation

import { generateCommand } from './ai';

await eval("Command Generation", {
  data: () => [
    { input: "list all files", expected: "ls -la" },
    { input: "show directory", expected: "pwd" }
  ],
  task: async (input: string) => {
    return await generateCommand(input);
  },
  scorers: [ExactMatch, Contains, Levenshtein]
});

Custom Scorer

const CustomScorer: Scorer = {
  name: 'CustomScore',
  description: 'Custom scoring logic',
  score: (input, output, expected) => {
    // Your custom scoring logic here
    return Math.random(); // Example
  }
};

await eval("Custom Evaluation", {
  data: () => [/* test data */],
  task: (input) => {/* task logic */},
  scorers: [CustomScorer]
});

Results

The evaluation returns a summary with:

  • Test case details
  • Individual scores for each scorer
  • Average scores across all test cases
  • Error information for failed tests
interface EvalSummary {
  name: string;
  totalTests: number;
  averageScores: Record<string, number>;
  results: EvalResult[];
}

Migration from Legacy API

The old evaluateGeneration function is deprecated. Migrate to the new interface:

Old:

const result = await evaluateGeneration({
  originalPrompt: "test",
  generatedOutput: "output",
  referenceOutput: "reference"
});

New:

const result = await eval("Test", {
  data: () => [{ input: "test", expected: "reference" }],
  task: () => "output",
  scorers: [Levenshtein]
});