Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Right way to calculate eval scores #6

Open
hitzkrieg opened this issue Dec 5, 2022 · 3 comments
Open

Right way to calculate eval scores #6

hitzkrieg opened this issue Dec 5, 2022 · 3 comments

Comments

@hitzkrieg
Copy link

What would be the right way to get the eval results once the training finishes? Should I manually average the last 10% episode scores using the eval json file? (I have occasionally encountered cases when all files are saved after training except eval json)
Or should I rely on progress.csv?

@MarcCote
Copy link
Collaborator

MarcCote commented Dec 6, 2022

Good question. I started working on cleaning up this repo code to make it compatible with the recent version ScienceWorld. While cleaning up, I removed the internal scripts we used to extract the scores (see 4ed8909, maybe they can be useful to you). But for now, @PeterAJansen might be better to answer this question.

@yukioichida
Copy link

yukioichida commented Jul 17, 2023

Hi @MarcCote

Speaking about the results from the scienceworld paper, I am studying and trying to replicate some results in the paper (specifically the DRRN result) and I have a question: can I consider the DRRN results as zero-shot learning in test variations?

I am asking about it because the Table 2 description in the paper contains this part: "...Performance for RL agents is averaged over the last 10% of evaluation episodes..." and I am not sure if evaluation episodes here are meaning eval or test variations (in the script training, the default arg is "eval" and not "test").

Also, please correct me if I am missing some point.

@PeterAJansen
Copy link
Contributor

Hi @yukioichida, apologies to be a little slow!

If I'm remembering correctly, the setup for the DRRN was essentially:

  • train for 100k steps (x8 threads, ~= 800k steps)
  • every so many training steps (I think 1k or 5k, under various configurations), spawn an evaluation environment set to a randomly selected variation index (drawn from dev for tuning, or test for the final eval), and run some steps in that environment (I think 100).
  • Once completed all the training steps, take the average performance of the last 10% of the scores on the evaluation set, and report these.

Having this data also allows us to plot the performance vs task curves, like in Figure 2.

I tried to setup this evaluation similar to by best understanding for how the existing text game literature/DRRN models were being used at the time. But, in retrospect, I think the cleanest evaluation would be to do just do something similar to how we evaluated the LLM-based agents:

  • Train the model up to some number of episodes
  • Evaluate the model iteratively on every variation within a given task, then average that performance across variations (rather than running for some static number of steps or episodes, randomly starting a new variation when the last finished, as in the DRRN evaluation).

The above protocol would be much cleaner, give an assessment of model performance across all task variations, and also give a fairly direct comparison to the LLM-based models.

I will say, as someone who stared at a very large number of DRRN trajectories during development: It really doesn't appear that the DRRN is learning much of anything, and most of its very modest performance seems to be randomly selecting the occasional action that helps incidental/optional goals (like moving to a new location, or opening a door) rather than actually helping task performance. The DRRN's performance is highest on the pick-and-place tasks (e.g. find a living/non-living thing), where its scores suggest that it successfully completes some of the very permissive picks some/most of the time. Other than that, task performance across the board is generally very low. So, while we should use the best possible protocol and research methods to measure performance, even so I doubt it would change the DRRN performance much/at all, as there doesn't seem to be much signal there to measure in the first place.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants