05‐22‐2024 Weekly Tag Up

Jump to bottom

Joe Miceli edited this page May 22, 2024 · 2 revisions

Attendees

Joe
Chi Hui

Updates

Implemented new critic update for single objective learning
- Essentially changes the "target" for the states that are associated with the sampled actions
- The idea being that we want to make optimal states more valuable/obvious
New single objective models have very similar learning curves to the previous versions
Re-ran ablation study with true value functions and new single objective models
- Started with "pure queue" dataset (only actions from the queue policy) (exp 25.2)
- Results were very similar to previous (exp 24.2)
  - Offline rollouts still ~100x larger than they should be
RECALL
- Part of the intention of this effort with the ablation study is to understand why our batch policy learning algorithm was not really controlling the performance of the multi-objective policy when evaluated using online rollouts
  - The performance was always FAR above the constraint and lambda updates seemed to have no impact

Next Steps

The behavior should match between online and offline rollouts
- Bad result from offline rollout should produce bad result in online rollout
- Values don't necessarily need to match
  - The offline rollouts use value function so they account for discount factor
  - Online rollouts just accumulate (undiscounted) reward