From 89a10e24fd4164e2f98b6f3d69644506a68e55fc Mon Sep 17 00:00:00 2001 From: slobentanzer Date: Mon, 12 Feb 2024 10:27:46 +0100 Subject: [PATCH] issue template --- content/40.methods.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/content/40.methods.md b/content/40.methods.md index 22c89d8..ba01128 100644 --- a/content/40.methods.md +++ b/content/40.methods.md @@ -56,6 +56,8 @@ For instance, we test the conversion of numbers (which LLMs are notoriously bad The Pytest framework is implemented at [https://github.com/biocypher/biochatter/blob/main/benchmark](https://github.com/biocypher/biochatter/blob/main/benchmark), and more information is available at [https://biochatter.org/benchmarking](https://biochatter.org/benchmarking). The benchmark is updated upon the release of new models and extensions to the datasets, and continuously available at [https://biochatter.org/benchmark](https://biochatter.org/benchmark). + +We will run the benchmark on new models and variants (including fine-tuned models) upon requests from the community, which can be made on GitHub using our issue template (TODO link). The living benchmark process is inspired by test-driven development, meaning test cases are created based on specific features or behaviors that are desired. When a model doesn't initially produce the optimal response, which is often the case, adjustments are made to various elements of the framework, including prompts or functions, to enhance the model's effectiveness. Monitoring the model's performance on these tests over time allows us to assess the framework's reliability and pinpoint areas that need improvement.