Skip to content

Conversation

@tyom
Copy link
Contributor

@tyom tyom commented Oct 29, 2025

Originally proposed in #276. Resubmitting against v1 branch as requested.

  • Include ESLint 9 as root dependency
  • Set up ESLint to lint the whole repo
  • Extend the root config and add a few package-specific plugins for Evalite UI
  • Add a consistent typecheck npm script for type checking across the repo

Use can now use pnpm lint in root and UI app and pnpm typecheck anywhere in the repo.
Use pnpm lint --fix to attempt to fix the issues.

Fix errors (mostly removing unused imports and variables). I left some prefixed with _ to serve as a reminder of function parameters.

Also add EditorConfig file to help maintain consistent code style for those who use it.

mattpocock and others added 30 commits October 19, 2025 12:44
- Remove unused DB_LOCATION import from test-utils.ts
- Replace FILES_LOCATION import with local constant in files.test.ts

Co-authored-by: Matt Pocock <mattpocock@users.noreply.github.com>
- Add dotenv as a dependency
- Create env-setup-file module that imports dotenv/config
- Export env-setup-file as 'evalite/env-setup-file'
- Automatically prepend env-setup-file to setupFiles array
- Update documentation to reflect automatic .env loading
- Update example config to remove manual dotenv setup

Fixes mattpocock#234

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-authored-by: Matt Pocock <mattpocock@users.noreply.github.com>
… precedence

- Add loadVitestSetupFiles() to load setupFiles from vitest.config.ts
- Merge setupFiles from both configs with evalite.config.ts taking precedence
- Add tests for vitest.config.ts setupFiles support and precedence
- setupFiles execution order: env-setup-file -> vitest -> evalite

Co-authored-by: Matt Pocock <mattpocock@users.noreply.github.com>
  - Export new `evalite/scorers` module with factory functions
  - Add `createLLMBasedScorer` for model-dependent scorers
  - Add `createEmbeddingBasedScorer` for embedding-dependent scorers
  - Introduce `EvaluationSample` type with query, contexts, and reference fields

Part of mattpocock#250
- Added a new `faithfulness` scorer to evaluate model responses against retrieved contexts.
- Introduced utility functions for scoring and context handling.
- Updated `package.json` to include `zod` version 4.1.12 as a dependency.
- Updated `pnpm-lock.yaml` to reflect changes in dependencies and versions.

Part of mattpocock#250
- Introduced a new `answerSimilarity` scorer to assess the semantic similarity between a ground truth answer and a generated answer.
- The scorer utilizes embedding models to compute cosine similarity and includes an optional threshold for binary output.
- Updated the `scorers` module to export the new `answerSimilarity` scorer.

Part of mattpocock#250
- Introduced a new `contextRecall` scorer to evaluate how much of a generated answer can be attributed to retrieved contexts.
- Updated the `scorers` module to export the new `contextRecall` scorer.

Part of mattpocock#250
…g based scorers, and context recall and faithfulness classifications
mattpocock and others added 23 commits October 23, 2025 10:32
feat: Swap from React Markdown to Streamdown
…splay issues

Fixes mattpocock#265 - durations were displaying with full floating point precision like '3.090249999999997ms' instead of rounding to '3ms'. Updated formatTime() to use Math.round() for millisecond values.

Co-authored-by: Matt Pocock <mattpocock@users.noreply.github.com>
…251024-1549

fix: round millisecond durations to avoid floating point precision display issues
* refactor: Simplify scorer factory API

- Remove `createBaseScorer`, consolidate to `createLLMScorer`/`createEmbeddingScorer`
- Add generic `TExpected` type for type-safe expected data
- Replace `singleTurn`/`multiTurn` with single `scorer` function
- Rename utils to `isSingleTurnInput`/`isMultiTurnInput`
- Update all scorers (faithfulness, answerSimilarity, contextRecall) to new API
- Fix example.eval.ts: textStream -> text

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

* refactor: Move inline scorer types to Evalite.Scorers namespace

Move inline expected data types from answer-similarity, context-recall,
and faithfulness scorers to the Evalite.Scorers namespace in types.ts
for better type organization and discoverability.

Co-authored-by: Matt Pocock <mattpocock@users.noreply.github.com>

* Formatting

---------

Co-authored-by: Claude <noreply@anthropic.com>
Co-authored-by: claude[bot] <41898282+claude[bot]@users.noreply.github.com>
Co-authored-by: Matt Pocock <mattpocock@users.noreply.github.com>
* refactor: Simplify scorer factory API

- Remove `createBaseScorer`, consolidate to `createLLMScorer`/`createEmbeddingScorer`
- Add generic `TExpected` type for type-safe expected data
- Replace `singleTurn`/`multiTurn` with single `scorer` function
- Rename utils to `isSingleTurnInput`/`isMultiTurnInput`
- Update all scorers (faithfulness, answerSimilarity, contextRecall) to new API
- Fix example.eval.ts: textStream -> text

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

* refactor: Move inline scorer types to Evalite.Scorers namespace

Move inline expected data types from answer-similarity, context-recall,
and faithfulness scorers to the Evalite.Scorers namespace in types.ts
for better type organization and discoverability.

Co-authored-by: Matt Pocock <mattpocock@users.noreply.github.com>

* refactor: Update scorer types and utility functions for output handling

- Renamed input types in Evalite.Scorers namespace to reflect output handling: SingleTurnInput to SingleTurnOutput, MultiTurnInput to MultiTurnOutput, and updated related types accordingly.
- Modified scorer implementations in context-recall and faithfulness to use new output types.
- Updated utility functions to check for output types instead of input types, enhancing clarity and consistency in the scoring logic.

* Formatting

* Trigger CI re-check

---------

Co-authored-by: Matt Pocock <mattpocockvoice@gmail.com>
Co-authored-by: Claude <noreply@anthropic.com>
Co-authored-by: claude[bot] <41898282+claude[bot]@users.noreply.github.com>
Co-authored-by: Matt Pocock <mattpocock@users.noreply.github.com>
* feat: update tailwind.css with new dark theme colors

Solves: mattpocock#272

* fix: remove unnecessary class from active sidebar item styling
* feat: integrate search functionality

- Implemented search functionality in the main application layout, allowing users to filter evaluations based on search queries.
- Updated routes to support search parameters using Zod for validation.

Closes: mattpocock#271

* Create wet-clocks-camp.md

---------

Co-authored-by: Matt Pocock <mattpocockvoice@gmail.com>
* refactor: Simplify scorer factory API

- Remove `createBaseScorer`, consolidate to `createLLMScorer`/`createEmbeddingScorer`
- Add generic `TExpected` type for type-safe expected data
- Replace `singleTurn`/`multiTurn` with single `scorer` function
- Rename utils to `isSingleTurnInput`/`isMultiTurnInput`
- Update all scorers (faithfulness, answerSimilarity, contextRecall) to new API
- Fix example.eval.ts: textStream -> text

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

* refactor: Move inline scorer types to Evalite.Scorers namespace

Move inline expected data types from answer-similarity, context-recall,
and faithfulness scorers to the Evalite.Scorers namespace in types.ts
for better type organization and discoverability.

Co-authored-by: Matt Pocock <mattpocock@users.noreply.github.com>

* refactor: Update scorer types and utility functions for output handling

- Renamed input types in Evalite.Scorers namespace to reflect output handling: SingleTurnInput to SingleTurnOutput, MultiTurnInput to MultiTurnOutput, and updated related types accordingly.
- Modified scorer implementations in context-recall and faithfulness to use new output types.
- Updated utility functions to check for output types instead of input types, enhancing clarity and consistency in the scoring logic.

* feat: Implement Tool Call Accuracy scorer

Related mattpocock#250

---------

Co-authored-by: Matt Pocock <mattpocockvoice@gmail.com>
Co-authored-by: Claude <noreply@anthropic.com>
Co-authored-by: claude[bot] <41898282+claude[bot]@users.noreply.github.com>
Co-authored-by: Matt Pocock <mattpocock@users.noreply.github.com>
- Extended loadFixture() to return Vitest instance with Symbol.asyncDispose
- Added triggerWatchModeRerun() helper using vitest.waitForTestRunEnd()
- Added disableServer option to runEvalite() to prevent port conflicts
- runEvalite() now returns Vitest instance
- Fixed ai-sdk-traces.test.ts to use await using

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Include ESLint 9 as root dependency
- Set up ESLint to lint the whole repo
- Extend the root config and add a few package-specific plugins for Evalite UI
- Add a consistent `typecheck` npm script for type checking across the repo

Use can now use `pnpm lint` in root and UI app and `pnpm typecheck` anywhere in the repo.
Use `pnpm lint --fix` to attempt to fix the issues.
@changeset-bot
Copy link

changeset-bot bot commented Oct 29, 2025

⚠️ No Changeset found

Latest commit: 205a215

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@vercel
Copy link

vercel bot commented Oct 29, 2025

@tyom is attempting to deploy a commit to the Skill Recordings Team on Vercel.

A member of the Team first needs to authorize it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants