diff --git a/README.md b/README.md index 6c01de8..203ffc8 100644 --- a/README.md +++ b/README.md @@ -62,7 +62,7 @@ My view on LLM Evaluation: [Deck](LLMEvaluation.pdf), and [video Analytics Vidh - [Enterprise Scenarios, Patronus ](https://huggingface.co/blog/leaderboard-patronus) - [Vectara Hallucination Leaderboard ]( https://github.com/vectara/hallucination-leaderboard) - [Ray/Anyscale's LLM Performance Leaderboard]( https://github.com/ray-project/llmperf-leaderboard) ([explanation:]( https://www.anyscale.com/blog/comparing-llm-performance-introducing-the-open-source-leaderboard-for-llm)) -- +- [Multi-task Language Understanding on MMLU](https://paperswithcode.com/sota/multi-task-language-understanding-on-mmlu) --- ### Evaluation Software - [MTEB](https://huggingface.co/spaces/mteb/leaderboard) @@ -91,6 +91,7 @@ My view on LLM Evaluation: [Deck](LLMEvaluation.pdf), and [video Analytics Vidh --- ### Evaluation of evaluation, Evaluation theory, evaluation methods, analysis of evaluation - Elo Uncovered: Robustness and Best Practices in Language Model Evaluation, Nov 2023 [arxiv](https://arxiv.org/abs/2311.17295) +- When Benchmarks are Targets: Revealing the Sensitivity of Large Language Model Leaderboards, Feb 2024, [arxiv](https://arxiv.org/abs/2402.01781) - Are Emergent Abilities of Large Language Models a Mirage? Apr 23 [arxiv](https://arxiv.org/abs/2304.15004) - Don't Make Your LLM an Evaluation Benchmark Cheater nov 2023 [arxiv](https://arxiv.org/abs/2311.01964) - Evaluating Question Answering Evaluation, 2019, [ACL](https://aclanthology.org/D19-5817/)