Skip to content

Commit

Permalink
Update README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
alopatenko authored Apr 20, 2024
1 parent 67a188a commit cf03bbc
Showing 1 changed file with 2 additions and 1 deletion.
3 changes: 2 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -62,7 +62,7 @@ My view on LLM Evaluation: [Deck](LLMEvaluation.pdf), and [video Analytics Vidh
- [Enterprise Scenarios, Patronus ](https://huggingface.co/blog/leaderboard-patronus)
- [Vectara Hallucination Leaderboard ]( https://github.com/vectara/hallucination-leaderboard)
- [Ray/Anyscale's LLM Performance Leaderboard]( https://github.com/ray-project/llmperf-leaderboard) ([explanation:]( https://www.anyscale.com/blog/comparing-llm-performance-introducing-the-open-source-leaderboard-for-llm))
-
- [Multi-task Language Understanding on MMLU](https://paperswithcode.com/sota/multi-task-language-understanding-on-mmlu)
---
### Evaluation Software
- [MTEB](https://huggingface.co/spaces/mteb/leaderboard)
Expand Down Expand Up @@ -91,6 +91,7 @@ My view on LLM Evaluation: [Deck](LLMEvaluation.pdf), and [video Analytics Vidh
---
### Evaluation of evaluation, Evaluation theory, evaluation methods, analysis of evaluation
- Elo Uncovered: Robustness and Best Practices in Language Model Evaluation, Nov 2023 [arxiv](https://arxiv.org/abs/2311.17295)
- When Benchmarks are Targets: Revealing the Sensitivity of Large Language Model Leaderboards, Feb 2024, [arxiv](https://arxiv.org/abs/2402.01781)
- Are Emergent Abilities of Large Language Models a Mirage? Apr 23 [arxiv](https://arxiv.org/abs/2304.15004)
- Don't Make Your LLM an Evaluation Benchmark Cheater nov 2023 [arxiv](https://arxiv.org/abs/2311.01964)
- Evaluating Question Answering Evaluation, 2019, [ACL](https://aclanthology.org/D19-5817/)
Expand Down

0 comments on commit cf03bbc

Please sign in to comment.