👍 There is no supervisor, only a reward signal
👍 Feedback is delayed, not instantaneous
👍 Time really matters (sequential)
👍 Agent's action affect the subsequent data it receives
-
Reward, is a scalar feed back signal, it indicates how weell agent is doing at step t. The agent's jobs is to maximise cumulative reward. RL is based on th ereward hypothesis
-
Sequential DEcision Making:
- Gloal: select actions to maximise total future reward
- Actions may have long term consequences.
- Reward may be delayed
- It may be better to sacrifice immediate reward to gain more long-term reward.
- Example:
- Afinancial investment (may take months to mature)
- Refuelling a helicopter
- Blocking oppoent moves
-
Agent and Environment:
- At each step
t
the agent:- excutes action
A_t
- Recieves observation
O_t
- Receives scalar reward
R_t
- excutes action
- The environment:
- Receives action
A_t
- Emits observation
O_{t+1}
- Emits scalar reward
R_{t+1}
- Receives action
t
increments at env.
- At each step
-
History and State:
- The history is th esequence of {obsevations, actions, rewards}
H_t = O_1,R_1,A_1...A_{t-1},O_t,R_t
- All observable variables up to time
t
- The sensorimotor stream of a robot or embodied agent.
- What happens next depends on the history
- State is the information used to determine what happens next
- Formally, state is a function of the history:
S_t = f(H_t)
- The history is th esequence of {obsevations, actions, rewards}
-
Environment state
- The environment state
S_t^e
is the environment's private representation - the environment state is not usually visible to the agent
- The environment state
-
Agent State
S_t^a
:- It's the informatin used by reinforcement learning algorithms
S_t^a=f(H_t)
It can be any function of history: