Skip to content

05‐22‐2024 Weekly Tag Up

Joe Miceli edited this page May 22, 2024 · 2 revisions

Attendees

  • Joe
  • Chi Hui

Updates

  • Implemented new critic update for single objective learning
    • Essentially changes the "target" for the states that are associated with the sampled actions
    • The idea being that we want to make optimal states more valuable/obvious
  • New single objective models have very similar learning curves to the previous versions
  • Re-ran ablation study with true value functions and new single objective models
    • Started with "pure queue" dataset (only actions from the queue policy) (exp 25.2)
    • Results were very similar to previous (exp 24.2)
      • Offline rollouts still ~100x larger than they should be
  • RECALL
    • Part of the intention of this effort with the ablation study is to understand why our batch policy learning algorithm was not really controlling the performance of the multi-objective policy when evaluated using online rollouts
      • The performance was always FAR above the constraint and lambda updates seemed to have no impact

Next Steps

  • The behavior should match between online and offline rollouts
    • Bad result from offline rollout should produce bad result in online rollout
    • Values don't necessarily need to match
      • The offline rollouts use value function so they account for discount factor
      • Online rollouts just accumulate (undiscounted) reward
Clone this wiki locally