Skip to content

Latest commit

 

History

History
18 lines (17 loc) · 4.09 KB

VariBAD: A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.md

File metadata and controls

18 lines (17 loc) · 4.09 KB

TL;DR:

  • The main different from this method to others that inference and used a probabilistic latent vector in meta-RL is that they don't just sample the latent variable from the distribution, but they used the whole distribution in the decision making process => this is how they claimed that their method approximate Bayes-optimal strategy.
  • They changed the objective function for the encoder to predicted the future states and rewards (at this phase, they still sample the latent variable). (I don't think that this change matter much)
  • They point out some weakness of posterior sampling based methods: can not do super-fast adaptation at test time (when the number of sample is small) (I think this is only true for simple problems) and it also not the best way to exploit the (estimated) posterior.

Details (and my notes):

  • One way to solve: Compute Bayes-optimal policy from Bayes-adaptive Markov decision processes (BAMDPs): the agent maintain a "belief", i.e. a posterior distribution, over possible environments. Cons: planning in a BAMDP is intractable => rely on posterior sampling.
  • Problem with Posterior Sampling: (the agent) takes the shortest route to a possible goal position (from the sampled posterior), and then resamples a different goal position from the updated posterior => not efficient since the agent’s uncertainty is not reduced optimally (e.g., states are revisited). => figure 1, experiment on GridWorld
    • My question: Posterior sampling seems to have more stochastic => more robust than a fixed exploration strategy. I can imagine that for harder problem, i.e. where the goal might changed overtime, revisiting states might not be a bad idea.
    • But, we can also reason that, using stochastic alone is better than a naive/fixed exploration strategy, but it's also not the best method to exploit the belief/posterior that we reason of the meta-structure.
    • For example: with a navigation problem, where the goals (from each individual problem) are condensed in n clusters. The optimal meta-strategy, I think, would be to visit all clusters, which the order of visit depend of the location of each cluster and the different in probability containing the goal. A naive exploration strategy probably couldn't even reason about the clusters, but posterior sampling would probably NOT use the location information. Thinking about PEARL, might be the improve strategy part can be improve by NOT sampling z from the posterior, but by doing something based on the distribution. (used gradient to select z somehow ?
  • From Figure 1: Bayes-optimal policies can explore much more efficiently than posterior sampling => how to approximately Bayes-optimal policies while retaining the tractability of posterior sampling.
  • (As I thought above) This paper build a variational auto-encoder that can infer the posterior distribution and a policy that conditions on this posterior distribution
  • Formulate the Bayes-Adaptive Markov decision process (BAMDP) with modified state/reward function/transistion function/init state distribution that consider the belief (posterior, which is updated deterministically according to Bayes rule).
  • Objective function is quite similar to other paper (amortize, ELBO). There is a slight different is the training phase, they used information of the whole trajectory, not just the past experience in the decoder part (just normal REINFORCE ?).
  • MODEL: used an RNN: encoder to produce the posterior overtask embeddings, the decoder sample latent variables here and used that to predict the next state and reward (this is how they train the encoder). The policy, on the other hand, don't sample the latent variables but use the full distribution over latent var to make decision.
  • NOTE: same thing happens with other paper: the task identification part if trained with the same on-policy data would yields poor performance. Thus, they put previous trajectories in a buffer for this part.
  • They point out 1 weakness of PEARL (and other posterior sampling based methods) that it's can't do as fast adaptation at TEST time as other methods (low sample efficient at test time)