How are answers evaluated exactly? #228

kohlerm · 2024-06-28T12:52:22Z

kohlerm
Jun 28, 2024

I just ran the evaluation once and did not look at the code in detail yet.
What is the strategy with regards to the model answering not just straight with code bit some text. Is there an attempt to parse the code from the markdown that is typically returned? IIRC I saw some prompt to only return the code, but not all models follow such "formatting" rules properly.

zimmski · 2024-07-04T12:46:22Z

zimmski
Jul 4, 2024
Maintainer

There seems to be a standard for how models answer with code: in markdown code fences. We have code that parses such responses (and cleans up lots of cases). Code is here https://github.com/symflower/eval-dev-quality/blob/main/model/llm/prompt/parse.go some examples are here https://github.com/symflower/eval-dev-quality/blob/main/model/llm/prompt/parse_test.go Looking at the v0.5.0 evaluation run results (i am currently writing a deep-dive for those) we have ~65% such code responses. It might be that some of these responses are code responses and we do not parse them correctly, or they are just nonsense. In a future release we might simply switch to a forced structured output.

For the assessment: When a model answers with code that we can parse (no matter if there is extra text) it receives one point, and if there is only code (and no additional chatter) it receives another point. However, i think no-chatter should receive a bigger weight.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How are answers evaluated exactly? #228

{{title}}

Replies: 1 comment

{{title}}

Select a reply

How are answers evaluated exactly? #228

kohlerm Jun 28, 2024

Replies: 1 comment

zimmski Jul 4, 2024 Maintainer

kohlerm
Jun 28, 2024

zimmski
Jul 4, 2024
Maintainer