Skip to content

Baseline (Value Function) should account for KL penalty but currently does not #9

@Silent-Zebra

Description

@Silent-Zebra

Issue:
Because the objective is being modified by the KL penalty, the baseline (value function) should take this into account. The current training scheme has the baseline just learn the values as if there were no KL penalty.

Possible Effects:
If the baseline is suboptimal, then the variance would be higher than with an optimal baseline. This increases the amount of noise, which could hamper learning. Fixing this issue should lead to better variance reduction, which would hopefully translate into better results. Since the baseline only affects variance, even an incorrect baseline can still lead to sensible learning outcomes (just with more noise), which is possibly what's happening with the current code.

Fix:
Derive a formulation of the reward (including with DiCE) such that the KL divergence fits directly into the reward at each time step, instead of being calculated separately. This might be trivial (just add the term to the reward), but it would be good to take care and be sure of the right formulation. Implement this, merging the KL divergence into the reward. Replace the reward used for learning the value function with this new reward as well.

Timeline:
Uncertain. Currently I am quite busy so will not work on this. Additionally, I am unsure how much effort this will take to fix, and unlike some other previously discovered bugs, I believe the effect of this is relatively clear, and fixing should only lead to better results.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions