Acknowledgments

This template has been adapted from the World Model template, written and kindly open-sourced by David Ha.

The experiments in this article were performed on both a P100 GPU and a 64-core CPU Ubuntu Linux virtual machine provided by Google Cloud Platform, using TensorFlow.

Open Source Code

The instructions to reproduce the experiments in this work is available here.

Reuse

Diagrams and text are licensed under Creative Commons Attribution CC-BY 4.0 with the source available on GitHub, unless noted otherwise. The figures that have been reused from other sources don’t fall under this license and can be recognized by the citations in their caption.

Appendix

In this section we will describe in more details the models and training methods used in this work.

Bandit Toy Example

As a toy example of how EXP3 work, we can test it on a simple situation with fixed reward allocations: Consider 3 tasks, providing rewards with fixed probabilities of $p_1=0.2, p_2=0.5, p_3=0.3$. In this situation, the teacher should try these tasks enough time to discover that task 2 is the most valuable. The evolution of the probability of selecting a task is shown in the figure below.

Example training curve for EXP3 on a toy example.
Given an environment with 3 tasks, with different reward probabilities, EXP3 should collect enough evidence to discover which task to exploit. The plot shows the evolution of the tasks probabilities through time.

As one can see, the Teacher explores early, sampling enough time to get a good estimate of the rewards associated with each task. Then after enough evidence has been collected, it starts exploiting task 2, as one would expect.

Student Architecture

Inputs:
- Observations: Flattened 5x5 egocentric view, 1-hot features & inventory. 1072 features.
- Task instructions: strings of task names.
Observation processing: 2x fully connected with 256 units
Language processing:
- Embedding: 20 units
- LSTM for words: 64 units
LSTM (recurrent core): 64 units
Policy: Softmax (5 possible actions: Down/Right/Left/Up/Use)
Value prediction (Critic): Linear layer to scalar

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

draft_appendix.md

draft_appendix.md

Acknowledgments

Open Source Code

Reuse

Appendix

Bandit Toy Example

Student Architecture

Files

draft_appendix.md

Latest commit

History

draft_appendix.md

File metadata and controls

Acknowledgments

Open Source Code

Reuse

Appendix

Bandit Toy Example

Student Architecture