Because we should all have our own set of LLM evals. Blog post
Python stuff:
uv sync
Node stuff:
npm install
just:
brew install just
Run them all:
just eval-all
Run a specific one:
just eval CONFIG
where CONFIG is "social-media-insults" for example.
To view the dashboard (the version published at https://kschaul.com/llm-evals/):
just dev