Skip to content

Latest commit

 

History

History
57 lines (38 loc) · 2.69 KB

draft_appendix.md

File metadata and controls

57 lines (38 loc) · 2.69 KB

Acknowledgments

This template has been adapted from the World Model template, written and kindly open-sourced by David Ha.

The experiments in this article were performed on both a P100 GPU and a 64-core CPU Ubuntu Linux virtual machine provided by Google Cloud Platform, using TensorFlow.

Open Source Code

The instructions to reproduce the experiments in this work is available here.

Reuse

Diagrams and text are licensed under Creative Commons Attribution CC-BY 4.0 with the source available on GitHub, unless noted otherwise. The figures that have been reused from other sources don’t fall under this license and can be recognized by the citations in their caption.

Appendix

In this section we will describe in more details the models and training methods used in this work.

Bandit Toy Example

As a toy example of how EXP3 work, we can test it on a simple situation with fixed reward allocations: Consider 3 tasks, providing rewards with fixed probabilities of $p_1=0.2, p_2=0.5, p_3=0.3$. In this situation, the teacher should try these tasks enough time to discover that task 2 is the most valuable. The evolution of the probability of selecting a task is shown in the figure below.

Example training curve for EXP3 on a toy example.
Given an environment with 3 tasks, with different reward probabilities, EXP3 should collect enough evidence to discover which task to exploit. The plot shows the evolution of the tasks probabilities through time.

As one can see, the Teacher explores early, sampling enough time to get a good estimate of the rewards associated with each task. Then after enough evidence has been collected, it starts exploiting task 2, as one would expect.

<script src="https://gist.github.com/Feryal/ddfe13322e7f2c6186f723f37c444a21.js"></script>

Student Architecture

  • Inputs:
    • Observations: Flattened 5x5 egocentric view, 1-hot features & inventory. 1072 features.
    • Task instructions: strings of task names.
  • Observation processing: 2x fully connected with 256 units
  • Language processing:
    • Embedding: 20 units
    • LSTM for words: 64 units
  • LSTM (recurrent core): 64 units
  • Policy: Softmax (5 possible actions: Down/Right/Left/Up/Use)
  • Value prediction (Critic): Linear layer to scalar