diff --git a/docs/src/pages/docs/reference/evals/context-position.mdx b/docs/src/pages/docs/reference/evals/context-position.mdx index 1545911a5..ac721f7cd 100644 --- a/docs/src/pages/docs/reference/evals/context-position.mdx +++ b/docs/src/pages/docs/reference/evals/context-position.mdx @@ -125,9 +125,8 @@ console.log(result.info.reason); // Explanation of the score The metric evaluates context positioning through: - Individual assessment of each context piece's relevance -- Position-based weighting (1/position) +- Applies position weights (earlier positions weighted more heavily) - Binary relevance verdicts (yes/no) with detailed reasoning -- Normalization against optimal ordering ### Scoring Process diff --git a/docs/src/pages/docs/reference/evals/context-precision.mdx b/docs/src/pages/docs/reference/evals/context-precision.mdx index 617220537..2f454742a 100644 --- a/docs/src/pages/docs/reference/evals/context-precision.mdx +++ b/docs/src/pages/docs/reference/evals/context-precision.mdx @@ -131,9 +131,7 @@ The metric evaluates context precision through: ### Scoring Process -1. Converts verdicts to binary scores: - - Relevant context: 1 - - Irrelevant context: 0 +1. Converts verdicts to binary scores (1 for relevant, 0 for not) 2. Calculates Mean Average Precision: - Computes precision at each position diff --git a/docs/src/pages/docs/reference/evals/faithfulness.mdx b/docs/src/pages/docs/reference/evals/faithfulness.mdx index 2a8f2bc1d..f88a89d0e 100644 --- a/docs/src/pages/docs/reference/evals/faithfulness.mdx +++ b/docs/src/pages/docs/reference/evals/faithfulness.mdx @@ -126,10 +126,9 @@ console.log(result.info.reason); // "All claims are supported by the context." ## Scoring Details The metric evaluates faithfulness through: -- Claim extraction and verification -- Context-based validation -- Verdict assignment -- Support ratio calculation +- Extracting all claims from the output (both factual and speculative) +- Verifying each claim against the provided context +- Calculating a score based on the proportion of supported claims ### Scoring Process @@ -155,10 +154,6 @@ Final score: `supported_claims / total_claims * scale` - 0.4-0.6: Mixed support with some contradictions - 0.1-0.3: Limited support, many contradictions - 0.0: No supported claims -- 0.67: Two-thirds of claims are supported -- 0.5: Half of the claims are supported -- 0.33: One-third of claims are supported -- 0: No claims are supported or the output is empty ## Advanced Example diff --git a/docs/src/pages/docs/reference/evals/hallucination.mdx b/docs/src/pages/docs/reference/evals/hallucination.mdx index 6d84f9d12..b1398003f 100644 --- a/docs/src/pages/docs/reference/evals/hallucination.mdx +++ b/docs/src/pages/docs/reference/evals/hallucination.mdx @@ -176,11 +176,11 @@ Final score: `contradicted_statements / total_statements` ### Score interpretation (0 to scale, default 0-1) -- 0.0: No hallucination - output aligns with all context statements -- 0.25: Low hallucination - contradicts 25% of context statements -- 0.5: Moderate hallucination - contradicts half of context statements -- 0.75: High hallucination - contradicts 75% of context statements - 1.0: Complete hallucination - contradicts all context statements +- 0.75: High hallucination - contradicts 75% of context statements +- 0.5: Moderate hallucination - contradicts half of context statements +- 0.25: Low hallucination - contradicts 25% of context statements +- 0.0: No hallucination - output aligns with all context statements **Note:** The score represents the degree of hallucination - lower scores indicate better factual alignment with the provided context diff --git a/docs/src/pages/docs/reference/evals/keyword-coverage.mdx b/docs/src/pages/docs/reference/evals/keyword-coverage.mdx index 2e2bf3aac..112415a1b 100644 --- a/docs/src/pages/docs/reference/evals/keyword-coverage.mdx +++ b/docs/src/pages/docs/reference/evals/keyword-coverage.mdx @@ -87,8 +87,7 @@ The metric evaluates keyword coverage by matching keywords with the following fe - Common word and stop word filtering (e.g., "the", "a", "and") - Case-insensitive matching - Word form variation handling -- Special handling of technical terms -- Compound word processing +- Special handling of technical terms and compound words ### Scoring Process diff --git a/docs/src/pages/docs/reference/evals/summarization.mdx b/docs/src/pages/docs/reference/evals/summarization.mdx index 9623c17b1..37ecb3b58 100644 --- a/docs/src/pages/docs/reference/evals/summarization.mdx +++ b/docs/src/pages/docs/reference/evals/summarization.mdx @@ -133,18 +133,15 @@ console.log(result.info); // Object containing detailed metrics about the summar ## Scoring Details The metric evaluates summaries through two essential components: - 1. **Alignment Score**: Measures factual correctness - - Extracts and verifies claims - - Validates against source text - - Uses yes/no/unsure verdicts - - Ensures factual accuracy + - Extracts claims from the summary + - Verifies each claim against the original text + - Assigns "yes", "no", or "unsure" verdicts 2. **Coverage Score**: Measures inclusion of key information - Generates key questions from the original text - Check if the summary answers these questions - - Checks information inclusion - - Assesses comprehensiveness + - Checks information inclusion and assesses comprehensiveness ### Scoring Process diff --git a/docs/src/pages/docs/reference/evals/textual-difference.mdx b/docs/src/pages/docs/reference/evals/textual-difference.mdx index 3c6298b0a..5e3739ea9 100644 --- a/docs/src/pages/docs/reference/evals/textual-difference.mdx +++ b/docs/src/pages/docs/reference/evals/textual-difference.mdx @@ -111,8 +111,8 @@ The metric calculates several measures: ### Scoring Process 1. Analyzes textual differences: - - Performs sequence matching - - Counts change operations + - Performs sequence matching between input and output + - Counts the number of change operations required - Measures length differences 2. Calculates final metrics: diff --git a/docs/src/pages/docs/reference/evals/tone-consistency.mdx b/docs/src/pages/docs/reference/evals/tone-consistency.mdx index d4d9ec5de..5700c6de8 100644 --- a/docs/src/pages/docs/reference/evals/tone-consistency.mdx +++ b/docs/src/pages/docs/reference/evals/tone-consistency.mdx @@ -111,16 +111,15 @@ console.log(result2.score); // Tone stability score from 0-1 The metric operates in two modes: 1. **Tone Consistency** (with reference): +When both input and output are provided: - Compares sentiment between texts - Calculates sentiment difference - - Evaluates tone matching - - Higher score = more consistent tone + - Evaluates tone matching: a higher score indicates more consistent tone 2. **Tone Stability** (single input): - - Analyzes cross-sentence sentiment - - Measures sentiment variance - - Assesses tone stability - - Higher score = more stable tone + - Analyzes sentiment stability across sentences + - Calculates variance in sentiment + - Assesses tone stability: a higher score indicates more stable tone ### Scoring Process