Skip to content

Commit b0b5c1e

Browse files
authored
Update README.md
1 parent 21b4925 commit b0b5c1e

File tree

1 file changed

+2
-0
lines changed

1 file changed

+2
-0
lines changed

README.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -99,6 +99,7 @@ My view on LLM Evaluation: [Deck](LLMEvaluation.pdf), and [video Analytics Vidh
9999
- [Mozilla AI Exploring LLM Evaluation at scale](https://blog.mozilla.ai/exploring-llm-evaluation-at-scale-with-the-neurips-large-language-model-efficiency-challenge/)
100100
- Evaluation part of [How to Maximize LLM Performance](https://humanloop.com/blog/optimizing-llms)
101101
- Mozilla AI blog published multiple good articles in [Mozilla AI blog](https://blog.mozilla.ai/)
102+
- Andrej Karpathy on evaluation [X](https://twitter.com/karpathy/status/1795873666481402010)
102103
---
103104
### Large benchmarks
104105
- Super-NaturalInstructions: Generalization via Declarative Instructions on 1600+ NLP Tasks EMNLP 2022, [pdf](https://aclanthology.org/2022.emnlp-main.340.pdf)
@@ -107,6 +108,7 @@ My view on LLM Evaluation: [Deck](LLMEvaluation.pdf), and [video Analytics Vidh
107108
-
108109
---
109110
### Evaluation of evaluation, Evaluation theory, evaluation methods, analysis of evaluation
111+
- Lessons from the Trenches on Reproducible Evaluation of Language Models, [arxiv](https://arxiv.org/abs/2405.14782)
110112
- *Synthetic data in evaluation*, see Chapter 3 in Best Practices and Lessons Learned on Synthetic Data for Language Models, Apr 2024, [arxiv](https://arxiv.org/abs/2404.07503)
111113
- Elo Uncovered: Robustness and Best Practices in Language Model Evaluation, Nov 2023 [arxiv](https://arxiv.org/abs/2311.17295)
112114
- When Benchmarks are Targets: Revealing the Sensitivity of Large Language Model Leaderboards, Feb 2024, [arxiv](https://arxiv.org/abs/2402.01781)

0 commit comments

Comments
 (0)