Hello,
Thank you for your great work and for releasing such valuable works to the community.
I am currently attempting to reproduce the BrowseComp evaluation for research purposes. While reviewing the evaluation pipeline, I noticed that the evaluation code appears to be tightly coupled with live or external search APIs. This makes it somewhat challenging to ensure fair, controlled, and fully reproducible experiments.
I was wondering if it would be possible for the authors to share (or point me to) an evaluation code setup that includes the search component, or a version that decouples retrieval from reasoning so that search results can be controlled or replayed offline.