Baseline (Value Function) should account for KL penalty but currently does not

Issue: 
Because the objective is being modified by the KL penalty, the baseline (value function) should take this into account. The current training scheme has the baseline just learn the values as if there were no KL penalty. 

Possible Effects: 
If the baseline is suboptimal, then the variance would be higher than with an optimal baseline. This increases the amount of noise, which could hamper learning. Fixing this issue should lead to better variance reduction, which would hopefully translate into better results. Since the baseline only affects variance, even an incorrect baseline can still lead to sensible learning outcomes (just with more noise), which is possibly what's happening with the current code.

Fix: 
Derive a formulation of the reward (including with DiCE) such that the KL divergence fits directly into the reward at each time step, instead of being calculated separately. This might be trivial (just add the term to the reward), but it would be good to take care and be sure of the right formulation. Implement this, merging the KL divergence into the reward. Replace the reward used for learning the value function with this new reward as well.

Timeline: 
Uncertain. Currently I am quite busy so will not work on this. Additionally, I am unsure how much effort this will take to fix, and unlike some other previously discovered bugs, I believe the effect of this is relatively clear, and fixing should only lead to better results.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Baseline (Value Function) should account for KL penalty but currently does not #9

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Baseline (Value Function) should account for KL penalty but currently does not #9

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions