Request for Evaluation Code with Search Integration for BrowseComp Reproducibility

Hello,

Thank you for your great work and for releasing such valuable works to the community.

I am currently attempting to reproduce the BrowseComp evaluation for research purposes. While reviewing the evaluation pipeline, I noticed that the evaluation code appears to be tightly coupled with live or external search APIs. This makes it somewhat challenging to ensure fair, controlled, and fully reproducible experiments.

I was wondering if it would be possible for the authors to share (or point me to) an evaluation code setup that includes the search component, or a version that decouples retrieval from reasoning so that search results can be controlled or replayed offline.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Request for Evaluation Code with Search Integration for BrowseComp Reproducibility #3

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Request for Evaluation Code with Search Integration for BrowseComp Reproducibility #3

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions