-
Notifications
You must be signed in to change notification settings - Fork 0
05‐22‐2024 Weekly Tag Up
Joe Miceli edited this page May 22, 2024
·
2 revisions
- Joe
- Chi Hui
- Implemented new critic update for single objective learning
- Essentially changes the "target" for the states that are associated with the sampled actions
- The idea being that we want to make optimal states more valuable/obvious
- New single objective models have very similar learning curves to the previous versions
- Re-ran ablation study with true value functions and new single objective models
- Started with "pure queue" dataset (only actions from the queue policy) (exp 25.2)
- Results were very similar to previous (exp 24.2)
- Offline rollouts still ~100x larger than they should be
-
RECALL
- Part of the intention of this effort with the ablation study is to understand why our batch policy learning algorithm was not really controlling the performance of the multi-objective policy when evaluated using online rollouts
- The performance was always FAR above the constraint and lambda updates seemed to have no impact
- Part of the intention of this effort with the ablation study is to understand why our batch policy learning algorithm was not really controlling the performance of the multi-objective policy when evaluated using online rollouts
- The behavior should match between online and offline rollouts
- Bad result from offline rollout should produce bad result in online rollout
- Values don't necessarily need to match
- The offline rollouts use value function so they account for discount factor
- Online rollouts just accumulate (undiscounted) reward