Skip to content

Followup questions about this framework and blog post + Reddit thread #1

@olaservo

Description

@olaservo

Hi, thanks a lot for open sourcing your eval framework along with the blog post! I'm very interested in this type of evaluation so its useful to see how others are doing it.

After looking at the repo I was still confused about a few things that I am hoping you can clarify:

Evaluation model: In the Reddit comments you mentioned using Claude as the evaluator, but the GitHub repo shows Gemini being used. Could you confirm which model you actually used and share the statistical validation of the evaluator's accuracy?

Statistical methodology: The 67% vs 80% critical bug detection rates and '3.7x more bugs' claim are very big differences. Could you share the confidence intervals, p-values or other statistical significance tests that you used? How did you control for false positives in bug detection, and what was the distribution of results across different PRs?

Bug classification: How did you validate that the detected 'critical bugs' were actually critical issues? Did you have human reviewers verify a sample of the findings? And/or did you calculate false positive/negative rates?

Sample distribution for the 500 PR dataset: How did you choose the PRs? What was the distribution across different types of changes and were there any differences in how the models performed across different PR types?

Model comparison methodology: you mentioned testing evaluator bias by having models evaluate each other's reviews. For this analysis I'm curious about what statistical methods did you use to measure agreement between models, and how did you control for potential prompt biases?

I totally understand that this was a blog post and not an academic research paper, so this might be reaching farther than you intended with the original post. Still curious if you have any of these details you'd be willing to share.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions