Update README.md

alopatenko · Apr 20, 2024 · cf03bbc · cf03bbc
1 parent 67a188a
commit cf03bbc
Showing 1 changed file with 2 additions and 1 deletion.
diff --git a/README.md b/README.md
@@ -62,7 +62,7 @@ My view on LLM Evaluation: [Deck](LLMEvaluation.pdf), and  [video Analytics Vidh
 - [Enterprise Scenarios, Patronus ](https://huggingface.co/blog/leaderboard-patronus)
 - [Vectara Hallucination Leaderboard ]( https://github.com/vectara/hallucination-leaderboard)
 - [Ray/Anyscale's LLM Performance Leaderboard]( https://github.com/ray-project/llmperf-leaderboard) ([explanation:]( https://www.anyscale.com/blog/comparing-llm-performance-introducing-the-open-source-leaderboard-for-llm))
-- 
+- [Multi-task Language Understanding on MMLU](https://paperswithcode.com/sota/multi-task-language-understanding-on-mmlu)
 ---
 ### Evaluation Software
 - [MTEB](https://huggingface.co/spaces/mteb/leaderboard)
@@ -91,6 +91,7 @@ My view on LLM Evaluation: [Deck](LLMEvaluation.pdf), and  [video Analytics Vidh
 ---
 ### Evaluation of evaluation, Evaluation theory, evaluation methods, analysis of evaluation
 - Elo Uncovered: Robustness and Best Practices in Language Model Evaluation, Nov 2023 [arxiv](https://arxiv.org/abs/2311.17295)
+- When Benchmarks are Targets: Revealing the Sensitivity of Large Language Model Leaderboards, Feb 2024, [arxiv](https://arxiv.org/abs/2402.01781)
 - Are Emergent Abilities of Large Language Models a Mirage? Apr 23 [arxiv](https://arxiv.org/abs/2304.15004)
 - Don't Make Your LLM an Evaluation Benchmark Cheater nov 2023 [arxiv](https://arxiv.org/abs/2311.01964)
 - Evaluating Question Answering Evaluation, 2019, [ACL](https://aclanthology.org/D19-5817/)