-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Right way to calculate eval scores #6
Comments
Good question. I started working on cleaning up this repo code to make it compatible with the recent version ScienceWorld. While cleaning up, I removed the internal scripts we used to extract the scores (see 4ed8909, maybe they can be useful to you). But for now, @PeterAJansen might be better to answer this question. |
Hi @MarcCote Speaking about the results from the scienceworld paper, I am studying and trying to replicate some results in the paper (specifically the DRRN result) and I have a question: can I consider the DRRN results as zero-shot learning in test variations? I am asking about it because the Table 2 description in the paper contains this part: "...Performance for RL agents is averaged over the last 10% of evaluation episodes..." and I am not sure if evaluation episodes here are meaning eval or test variations (in the script training, the default arg is "eval" and not "test"). Also, please correct me if I am missing some point. |
Hi @yukioichida, apologies to be a little slow! If I'm remembering correctly, the setup for the DRRN was essentially:
Having this data also allows us to plot the performance vs task curves, like in Figure 2. I tried to setup this evaluation similar to by best understanding for how the existing text game literature/DRRN models were being used at the time. But, in retrospect, I think the cleanest evaluation would be to do just do something similar to how we evaluated the LLM-based agents:
The above protocol would be much cleaner, give an assessment of model performance across all task variations, and also give a fairly direct comparison to the LLM-based models. I will say, as someone who stared at a very large number of DRRN trajectories during development: It really doesn't appear that the DRRN is learning much of anything, and most of its very modest performance seems to be randomly selecting the occasional action that helps incidental/optional goals (like moving to a new location, or opening a door) rather than actually helping task performance. The DRRN's performance is highest on the pick-and-place tasks (e.g. find a living/non-living thing), where its scores suggest that it successfully completes some of the very permissive picks some/most of the time. Other than that, task performance across the board is generally very low. So, while we should use the best possible protocol and research methods to measure performance, even so I doubt it would change the DRRN performance much/at all, as there doesn't seem to be much signal there to measure in the first place. |
What would be the right way to get the eval results once the training finishes? Should I manually average the last 10% episode scores using the eval json file? (I have occasionally encountered cases when all files are saved after training except eval json)
Or should I rely on progress.csv?
The text was updated successfully, but these errors were encountered: